cs.RO / 1 / 2603.00016
Beyond Static Instruction: A Multi-agent AI Framework for Adaptive Augmented Reality Robot Training
超越静态指令:一种用于自适应增强现实机器人培训的多智能体人工智能框架
Abstract
Augmented Reality (AR) offers powerful visualization capabilities for industrial robot training, yet current interfaces remain predominantly static, failing to account for learners' diverse cognitive profiles. In this paper, we present an AR application for robot training and propose a multi-agent AI framework for future integration that bridges the gap between static visualization and pedagogical intelligence. We report on the evaluation of the baseline AR interface with 36 participants performing a robotic pick-and-place task. While overall usability was high, notable disparities in task duration and learner characteristics highlighted the necessity for dynamic adaptation. To address this, we propose a multi-agent framework that orchestrates multiple components to perform complex preprocessing of multimodal inputs (e.g., voice, physiology, robot data) and adapt the AR application to the learner's needs. By utilizing autonomous Large Language Model (LLM) agents, the proposed system would dynamically adapt the learning environment based on advanced LLM reasoning in real-time.
Chinese Translation
增强现实(AR)为工业机器人培训提供了强大的可视化能力,但当前的接口仍然以静态为主,未能考虑学习者多样的认知特征。本文提出了一种用于机器人培训的AR应用,并建议一个多智能体人工智能框架,以便未来整合,弥补静态可视化与教学智能之间的差距。我们报告了对基线AR接口的评估,参与者为36名执行机器人抓取与放置任务的学习者。尽管整体可用性较高,但任务持续时间和学习者特征的显著差异突显了动态适应的必要性。为此,我们提出了一个多智能体框架,协调多个组件以对多模态输入(例如,语音、生理、机器人数据)进行复杂的预处理,并根据学习者的需求调整AR应用。通过利用自主的大型语言模型(LLM)代理,所提系统将基于先进的LLM推理实时动态调整学习环境。
cs.RO / 2 / 2603.00020
A User Study on the Suitability of Teleoperation Interfaces for Primitive Manipulation Tasks
关于遥操作界面在原始操作任务中的适用性的用户研究
Abstract
The application of teleoperation to control robotic arms has been widely explored, and user-friendly teleoperation systems have been studied for facilitating higher performance and lower operational burden. To investigate the dominant factors in a practical teleoperation system, this study focused on the characteristics of an interface used to operate a robotic arm. The usability of an interface depends on the characteristics of the manipulation tasks to be completed; however, systematic comparisons of different interfaces across different tasks remain limited. In this study, we compared two widely used teleoperation interfaces, a 3D mouse and a VR controller, for two simple yet broadly applicable tasks with a six-degree-of-freedom (6DoF) robotic arm: repetitively pushing buttons and rotating knobs. Participants (N = 23) controlled a robotic arm with 6DoF to push buttons and rotate knobs as many times as possible in 3-minute trials. Each trial was followed by a NASA-TLX workload rating. The results showed a clear connection between the interface and task performance: the VR controller yielded higher performance for pushing buttons, whereas the 3D mouse performed better and was less demanding for knob rotation. These findings highlight the importance of considering dominant motion primitives of the task when designing practical teleoperation interfaces.
Chinese Translation
遥操作技术在控制机器人手臂方面的应用已被广泛探讨,用户友好的遥操作系统也被研究以促进更高的性能和更低的操作负担。为了研究实际遥操作系统中的主导因素,本研究聚焦于用于操作机器人手臂的界面的特性。界面的可用性取决于需要完成的操作任务的特性;然而,不同任务之间不同界面的系统比较仍然有限。在本研究中,我们比较了两种广泛使用的遥操作界面:3D鼠标和虚拟现实(VR)控制器,针对两项简单但广泛适用的任务,使用六自由度(6DoF)机器人手臂进行按钮重复按压和旋转旋钮。参与者(N = 23)在3分钟的试验中控制6DoF机器人手臂尽可能多地按压按钮和旋转旋钮。每次试验后进行了NASA-TLX工作负荷评估。结果显示界面与任务表现之间存在明显的联系:VR控制器在按钮按压任务中表现更佳,而3D鼠标在旋钮旋转任务中表现更好且要求较低。这些发现强调了在设计实际遥操作界面时考虑任务的主导运动原语的重要性。
cs.RO / 3 / 2603.00102
Designing Social Robots with Ethical, User-Adaptive Explainability in the Era of Foundation Models
在基础模型时代设计具有伦理性和用户适应性可解释性的社交机器人
Abstract
Foundation models are increasingly embedded in social robots, mediating not only what they say and do but also how they adapt to users over time. This shift renders traditional ``one-size-fits-all'' explanation strategies especially problematic: generic justifications are now wrapped around behaviour produced by models trained on vast, heterogeneous, and opaque datasets. We argue that ethical, user-adapted explainability must be treated as a core design objective for foundation-model-driven social robotics. We first identify open challenges around explainability and ethical concerns that arise when both adaptation and explanation are delegated to foundation models. Building on this analysis, we propose four recommendations for moving towards user-adapted, modality-aware, and co-designed explanation strategies grounded in smaller, fairer datasets. An illustrative use case of an LLM-driven socially assistive robot demonstrates how these recommendations might be instantiated in a sensitive, real-world domain.
Chinese Translation
基础模型越来越多地嵌入社交机器人中,不仅调节它们的言语和行为,还影响它们如何随时间适应用户。这一转变使得传统的“一刀切”解释策略尤其成问题:通用的解释现在被包裹在由在庞大、异质且不透明的数据集上训练的模型产生的行为之中。我们认为,伦理性和用户适应性可解释性必须被视为基础模型驱动的社交机器人设计的核心目标。我们首先识别出在将适应性和解释性委托给基础模型时出现的可解释性和伦理问题的开放挑战。基于这一分析,我们提出了四项建议,以朝着用户适应、模态感知和共同设计的解释策略迈进,这些策略应基于更小、更公平的数据集。一个以大型语言模型(LLM)驱动的社交辅助机器人示例案例展示了这些建议如何在敏感的现实世界领域中得以实现。
cs.RO / 4 / 2603.00103
Autonomous Block Assembly for Boom Cranes with Passive Joint Dynamics: Integrated Vision MPC Control
具有被动关节动力学的自主动臂起重机块体组装:集成视觉模型预测控制
Abstract
This paper presents an autonomous control framework for articulated boom cranes performing prefabricated block assembly in construction environments. The key challenge addressed is precise placement control under passive joint dynamics that cause pendulum-like sway, complicating the accurate positioning of building components. Our integrated approach combines real-time vision-based pose estimation of building blocks, collision-aware B-spline path planning, and nonlinear model predictive control (NMPC) to achieve autonomous pickup, placement, and obstacle-avoidance assembly operations. The framework is validated on a laboratory-scale testbed that emulates crane kinematics and passive dynamics while enabling rapid experimentation. The collision-aware planner generates feasible B-spline references in real-time on CPU hardware with anytime performance, while the NMPC controller actively suppresses passive joint sway and tracks the planned trajectory under continuous vision feedback. Experimental results demonstrate autonomous block stacking and obstacle-avoidance assembly, with sway damping reducing settling times by more than an order of magnitude compared to uncontrolled passive dynamics, confirming the real-time feasibility of the integrated approach for construction automation.
Chinese Translation
本文提出了一种用于在施工环境中进行预制块体组装的关节式动臂起重机的自主控制框架。所解决的关键挑战是如何在被动关节动力学下实现精确的放置控制,这种动力学会导致类似摆动的晃动,增加了建筑组件精确定位的难度。我们的方法集成了基于实时视觉的建筑块体姿态估计、考虑碰撞的B样条路径规划和非线性模型预测控制(NMPC),以实现自主的拾取、放置和避障组装操作。该框架在一个实验室规模的测试平台上进行了验证,该平台模拟了起重机的运动学和被动动力学,同时支持快速实验。碰撞感知规划器在CPU硬件上实时生成可行的B样条参考,并具有随时性能,而NMPC控制器则主动抑制被动关节的晃动,并在持续的视觉反馈下跟踪规划轨迹。实验结果展示了自主的块体堆叠和避障组装,晃动阻尼将稳定时间减少了一个数量级以上,相较于未控制的被动动力学,确认了该集成方法在施工自动化中的实时可行性。
cs.RO / 5 / 2603.00108
SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment
SurgFusion-Net:用于外科技能评估的多样化自适应多模态融合网络
Abstract
Robotic-assisted surgery (RAS) is established in clinical practice, and automated surgical skill assessment utilizing multimodal data offers transformative potential for surgical analytics and education. However, developing effective multimodal methods remains challenging due to the task complexity, limited annotated datasets and insufficient techniques for cross-modal information fusion. Existing state-of-the-art relies exclusively on RGB video and only applies on dry-lab settings, failing to address the significant domain gap between controlled simulation and real clinical cases, where the surgical environment together with camera and tissue motion introduce substantial complexities. This work introduces SurgFusion-Net and Divergence Regulated Attention (DRA), an innovative fusion strategy for multimodal surgical skill assessment. We contribute two first-of-their-kind clinical datasets: the RAH-skill dataset containing 279,691 RGB frames from 37 videos of Robot-assisted Hysterectomy (RAH), and the RARP-skill dataset containing 70,661 RGB frames from 33 videos of Robot-Assisted Radical Prostatectomy (RARP). Both datasets include M-GEARS skill annotations, corresponding optical flow and tool segmentation masks. DRA incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability. Validated on the JIGSAWS benchmark, RAH-skill, and RARP-skill datasets, our approach outperforms recent baselines with SCC improvements of 0.02 in LOSO, 0.04 in LOUO across JIGSAWS tasks, and 0.0538 and 0.0493 gains on RAH-skill and RARP-skill, respectively.
Chinese Translation
机器人辅助手术(RAS)已在临床实践中得到应用,利用多模态数据进行自动化外科技能评估为外科分析和教育提供了变革性潜力。然而,由于任务复杂性、有限的标注数据集以及跨模态信息融合技术不足,开发有效的多模态方法仍然具有挑战性。现有的最先进技术仅依赖于RGB视频,并且仅适用于干实验室环境,未能解决受控模拟与真实临床案例之间显著的领域差距,在这些案例中,手术环境以及相机和组织运动带来了实质性的复杂性。本研究提出了SurgFusion-Net和发散调节注意力(DRA),这是一种用于多模态外科技能评估的创新融合策略。我们贡献了两个首创的临床数据集:RAH-skill数据集包含来自37个机器人辅助子宫切除术(RAH)视频的279,691帧RGB图像,和RARP-skill数据集包含来自33个机器人辅助根治性前列腺切除术(RARP)视频的70,661帧RGB图像。两个数据集均包括M-GEARS技能注释、相应的光流和工具分割掩膜。DRA结合了自适应双重注意力和促进多样性的多头注意力,根据手术上下文融合来自三种模态的多模态信息,提高了评估的准确性和可靠性。在JIGSAWS基准、RAH-skill和RARP-skill数据集上进行验证,我们的方法在LOSO中SCC提升了0.02,在LOUO中提升了0.04,在JIGSAWS任务上分别在RAH-skill和RARP-skill上获得了0.0538和0.0493的增益,超越了近期的基线。
cs.RO / 6 / 2603.00110
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
从预训练视频模型中学习物理:用于机器人操作的多模态连续和顺序世界交互模型
Abstract
The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $\pi_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.
Chinese Translation
大规模机器人数据的稀缺促使我们将其他模态的基础模型重新用于策略学习。在本研究中,我们介绍了PhysGen(从预训练视频生成模型中学习物理),这是一个可扩展的连续和顺序世界交互框架,利用自回归视频生成来解决机器人操作任务。通过将预训练的视频模型视为物理模拟器的代理,PhysGen建模了外部环境与机器人动作之间的动态互动。我们引入了一种多模态连续表示,将视频和动作统一为共享的物理标记,弥合了离散视频生成与连续机器人控制之间的差距。这种方法使得隐含物理知识(如物体的持久性和动态特性)能够无缝地从视频预训练转移到下游操作中。为了确保高效收敛,我们结合了因果掩蔽、逆向运动学、前瞻性多标记预测(Lookahead Multi-Token Prediction, L-MTP)和关键值(Key-Value, KV)缓存。在Libero和ManiSkill基准测试上的实验结果表明,PhysGen始终优于强健的基线,分别超越OpenVLA和WorldVLA 13.8%和8.8%的幅度。值得注意的是,在真实场景中,PhysGen的表现与大规模动作预训练模型如$ ext{π}_0$相当,而无需先前的特定动作预训练,展现出在抓取透明物体等物理复杂任务中的卓越能力。这些发现验证了从预训练视频生成器中提取物理直觉以促进可推广的机器人操作的潜力。
cs.RO / 7 / 2603.00117
PEPA: a Persistently Autonomous Embodied Agent with Personalities
PEPA:一种具有个性的持久自主具身智能体
Abstract
Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three-layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality-aligned goals and refines them via episodic memory and daily self-reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real-world deployment on a quadruped robot in a multi-floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality-driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait-aligned behaviors. The results confirm that personality-driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at https://sites.google.com/view/pepa-persistent/.
Chinese Translation
生物体通过内部生成的目标和自我维持的行为组织表现出持久的自主性,而当前的具身智能体仍然依赖于外部编写的目标。这种对预定义任务规范的依赖限制了它们在动态、非结构化环境中长期部署的能力,因为持续的人为干预是不切实际的。我们提出个性特征提供了一种内在的组织原则,以实现持久的自主性。类似于影响生物行为倾向的基因型偏差,个性使得智能体能够自主生成目标并在没有外部监督的情况下维持行为演变。为此,我们开发了PEPA,一种通过三个相互作用的系统运行的三层认知架构:Sys3自主合成与个性对齐的目标,并通过情节记忆和每日自我反思进行精炼;Sys2进行深思熟虑的推理,将目标转化为可执行的行动计划;Sys1将智能体与传感器运动交互相结合,执行行动并记录经验。我们通过在多层办公楼中的四足机器人进行实际部署来验证该框架。该机器人在不依赖固定任务规范的情况下,自主协调用户请求和个性驱动的动机,导航电梯并相应地探索环境。对五种不同个性原型的定量分析表明,行为稳定且与特征一致。结果确认,个性驱动的认知架构能够实现持久具身系统特有的持续自主操作。代码和演示视频可在 https://sites.google.com/view/pepa-persistent/ 获取。
cs.RO / 8 / 2603.00151
Multiview Progress Prediction of Robot Activities
机器人活动的多视角进度预测
Abstract
For robots to operate effectively and safely alongside humans, they must be able to understand the progress of ongoing actions. This ability, known as action progress prediction, is critical for tasks ranging from timely assistance to autonomous decision-making. However, modeling action progression in robotics has often been overlooked. Moreover, a single camera may be insufficient for understanding robot's ego-actions, as self-occlusion can significantly hinder perception and model performance. In this paper, we propose a multi-view architecture for action progress prediction in robot manipulation tasks. Experiments on Mobile ALOHA demonstrate the effectiveness of the proposed approach.
Chinese Translation
为了使机器人能够有效且安全地与人类协作,它们必须能够理解正在进行的动作的进展。这种能力被称为动作进展预测,对于从及时协助到自主决策等任务至关重要。然而,在机器人技术中,动作进展的建模往往被忽视。此外,单一摄像头可能不足以理解机器人的自我动作,因为自遮挡会显著妨碍感知和模型性能。本文提出了一种用于机器人操作任务的多视角架构,以进行动作进展预测。在Mobile ALOHA上的实验验证了所提方法的有效性。
cs.RO / 9 / 2603.00154
Trust in Autonomous Human--Robot Collaboration: Effects of Responsive Interaction Policies
对自主人机协作的信任:响应互动策略的影响
Abstract
Trust plays a central role in human--robot collaboration, yet its formation is rarely examined under the constraints of fully autonomous interaction. This pilot study investigated how interaction policy influences trust during in-person collaboration with a social robot operating without Wizard-of-Oz control or scripted repair. Participants completed a multi-stage collaborative task with a mobile robot that autonomously managed spoken-language dialogue, affect inference, and task progression. Two interaction policies were compared: a responsive policy, in which the robot proactively adapted its dialogue and assistance based on inferred interaction state, and a neutral, reactive policy, in which the robot provided only direct, task-relevant responses when prompted. Responsive interaction was associated with significantly higher post-interaction trust under viable communication conditions, despite no reliable differences in overall task accuracy. Sensitivity analyses indicated that affective and experiential components of trust were more sensitive to communication breakdown than evaluative judgments of reliability, and that as language-mediated interaction degraded, the trust advantage associated with responsiveness attenuated and ratings became less clearly interpretable as calibrated evaluations of collaborative competence. These findings suggest that trust in autonomous human--robot interaction emerges from process-level interaction dynamics and operates within constraints imposed by communication viability, highlighting the importance of evaluating trust under real autonomy conditions when designing interactive robotic systems.
Chinese Translation
信任在人机协作中扮演着核心角色,但在完全自主互动的限制下,其形成过程却鲜有研究。本研究探讨了互动策略如何影响与一台无需“巫师-奥兹”控制或脚本修复的社交机器人进行面对面协作时的信任。参与者与一台能够自主管理口语对话、情感推断和任务进展的移动机器人完成了多阶段的协作任务。比较了两种互动策略:响应策略,其中机器人根据推断的互动状态主动调整其对话和帮助;以及中立的反应策略,其中机器人仅在被提示时提供直接的、与任务相关的回应。在可行的沟通条件下,响应互动与显著更高的后互动信任相关,尽管整体任务准确性没有可靠差异。敏感性分析表明,信任的情感和体验成分对沟通中断的敏感性高于对可靠性的评估判断,并且随着语言媒介互动的退化,与响应性相关的信任优势减弱,评分变得不再清晰可解读为协作能力的校准评估。这些发现表明,在自主人机互动中,信任源于过程层面的互动动态,并在沟通可行性所施加的限制内运作,这突显了在设计互动机器人系统时,在真实自主条件下评估信任的重要性。
cs.RO / 10 / 2603.00167
EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations
EgoMoD:从局部自我中心观察预测动态的全球地图
Abstract
Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. Experiments in large simulated environments show that EgoMoD accurately predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.
Chinese Translation
在动态环境中高效导航需要预测运动模式如何在机器人即时感知范围之外演变,从而实现预防性而非纯粹反应性的规划,尤其是在拥挤场景中。动态地图(Maps of Dynamics, MoDs)提供了一种结构化的空间运动趋势表示,适用于长期全球规划,但传统上构建它们需要在较长时间内对全球环境进行观察。我们提出了EgoMoD,这是第一种直接从机器人操作期间收集的短期自我中心视频片段中学习预测未来MoDs的方法。我们的方法利用一种视频和姿态条件的架构,学习从局部动态线索推断环境范围内的运动趋势,并以从外部观察计算得到的MoDs作为特权监督进行训练,从而使局部观察能够作为全球运动结构的预测信号。得益于此,我们提供了预测整个环境未来运动动态的能力,而不仅仅是延续机器人视野中的过去模式。在大型模拟环境中的实验表明,EgoMoD在有限可观察性下准确预测未来的MoDs,而在真实图像上的评估展示了其对真实系统的零样本迁移能力。
cs.RO / 11 / 2603.00182
Embedding Morphology into Transformers for Cross-Robot Policy Learning
将形态嵌入变换器以实现跨机器人策略学习
Abstract
Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.
Chinese Translation
跨机器人策略学习——训练单一策略在多个表现体上表现良好——仍然是机器人学习中的一个核心挑战。基于变换器的策略,例如视觉-语言-动作(VLA)模型,通常是与表现体无关的,必须仅通过观察推断运动学结构,这可能降低不同表现体之间的鲁棒性,甚至限制单一表现体内的性能。我们提出了一种具备表现体感知的变换器策略,通过三种机制注入形态信息:(1)运动学标记,通过关节分解动作并通过每个关节的时间分块来压缩时间;(2)拓扑感知的注意力偏差,将运动学拓扑编码为自注意力中的归纳偏差,鼓励沿运动学边缘进行信息传递;(3)关节属性条件化,通过每个关节的描述符增强拓扑,以捕捉超越连接性的语义。在一系列表现体中,这种结构化整合始终提高了相较于基础的 pi0.5 VLA 模型的性能,表明在单一表现体内及不同表现体之间的鲁棒性得到了改善。
cs.RO / 12 / 2603.00319
Modeling PWM-Time-SOC Interaction in a Simulated Robot
模拟机器人中PWM-时间-SOC交互的建模
Abstract
Accurate prediction of battery state of charge is needed for autonomous robots to plan movements without using up all available power. This work develops a physics and data-informed model from a simulation that predicts SOC depletion as a function of time and PWM duty cycle for a simulated 4-wheel Arduino robot. A forward-motion simulation incorporating motor electrical characteristics (resistance, inductance, back-EMF, torque constant) and mechanical dynamics (mass, drag, rolling resistance, wheel radius) was used to generate SOC time-series data across PWM values from 1-100%. Sparse Identification of Nonlinear Dynamics (SINDy), combined with least-squares regression, was applied to construct a unified nonlinear model that captures SOC(t, p). The framework allows for energy-aware planning for similar robots and can be extended to incorporate arbitrary initial SOC levels and environment-dependent parameters for real-world deployment.
Chinese Translation
为了使自主机器人能够在不耗尽所有可用电力的情况下规划运动,需要准确预测电池的充电状态(SOC)。本研究开发了一个基于物理和数据的模型,该模型通过模拟预测SOC的耗尽情况,作为时间和PWM占空比的函数,针对一个模拟的四轮Arduino机器人。通过结合电机电气特性(电阻、电感、反电动势、扭矩常数)和机械动态(质量、阻力、滚动阻力、轮径)的前向运动模拟,生成了跨越1-100% PWM值的SOC时间序列数据。采用稀疏非线性动力学识别(SINDy)结合最小二乘回归,构建了一个统一的非线性模型,以捕捉SOC(t, p)。该框架允许对类似机器人进行能量感知规划,并可扩展以纳入任意初始SOC水平和环境依赖参数,以便于实际部署。
cs.RO / 13 / 2603.00325
Geometric Look-Angle Shaping Strategy for Enclosed Inspection
封闭区域检查的几何视角塑形策略
Abstract
This paper introduces inspection through GLASS, a Geometric Look-Angle Shaping Strategy for enclosed regions using unmanned aerial vehicles. In doing so, the vehicles guidance command is constructed through a bounded, geometry-consistent shaping of the look angle relative to a desired standoff path. By embedding a smooth, hyperbolic-tangent-type shaping function within a polar geometric framework, GLASS ensures global existence of the guidance dynamics. It avoids the far-field limitations inherent to conventional formulations. Lyapunov stability analysis establishes asymptotic convergence to a prescribed inspection standoff under explicit curvature feasibility conditions, along with analytical settling-time characteristics. The proposed strategy incorporates maximum turn-rate constraints without inducing singularities throughout the workspace. High-fidelity six-degree-of-freedom quadrotor simulations demonstrate the effectiveness of GLASS in representative enclosed inspection scenarios, highlighting a practically viable guidance framework for autonomous enclosed inspection missions.
Chinese Translation
本文介绍了一种通过GLASS(几何视角塑形策略)进行封闭区域检查的方法,该方法利用无人机进行操作。在此过程中,车辆的引导指令是通过相对于期望的安全路径对视角进行有界的、几何一致的塑形而构建的。通过在极坐标几何框架内嵌入平滑的双曲正切型塑形函数,GLASS确保了引导动态的全局存在性,避免了传统公式固有的远场限制。李雅普诺夫稳定性分析在明确的曲率可行性条件下建立了对规定检查安全距离的渐近收敛,同时提供了分析性的稳定时间特征。所提出的策略在不引发工作空间内奇异性的情况下,纳入了最大转向速率约束。高保真度的六自由度四旋翼仿真展示了GLASS在典型封闭检查场景中的有效性,突显了其作为自主封闭检查任务的实用引导框架的潜力。
cs.RO / 14 / 2603.00338
Layered Safety: Enhancing Autonomous Collision Avoidance via Multistage CBF Safety Filters
分层安全:通过多阶段控制屏障函数安全过滤器增强自主碰撞避免
Abstract
This paper presents a general end-to-end framework for constructing robust and reliable layered safety filters that can be leveraged to perform dynamic collision avoidance over a broad range of applications using only local perception data. Given a robot-centric point cloud, we begin by constructing an occupancy map which is used to synthesize a Poisson safety function (PSF). The resultant PSF is employed as a control barrier function (CBF) within two distinct safety filtering stages. In the first stage, we propose a predictive safety filter to compute optimal safe trajectories based on nominal potentially-unsafe commands. The resultant short-term plans are constrained to satisfy the CBF condition along a finite prediction horizon. In the second stage, instantaneous velocity commands are further refined by a real-time CBF-based safety filter and tracked by the full-order low-level robot controller. Assuming accurate tracking of velocity commands, we obtain formal guarantees of safety for the full-order system. We validate the optimality and robustness of our multistage architecture, in comparison to traditional single-stage safety filters, via a detailed Pareto analysis. We further demonstrate the effectiveness and generality of our collision avoidance methodology on multiple legged robot platforms across a variety of real-world dynamic scenarios.
Chinese Translation
本文提出了一种通用的端到端框架,用于构建稳健可靠的分层安全过滤器,能够利用局部感知数据在广泛的应用中执行动态碰撞避免。给定一个以机器人为中心的点云,我们首先构建一个占用图,用于合成泊松安全函数(Poisson Safety Function, PSF)。生成的PSF被用作两阶段不同的控制屏障函数(Control Barrier Function, CBF)中的控制屏障函数。在第一阶段,我们提出了一种预测安全过滤器,以基于名义上潜在不安全的指令计算最佳安全轨迹。生成的短期计划被约束以满足有限预测视域内的CBF条件。在第二阶段,瞬时速度指令通过实时基于CBF的安全过滤器进一步优化,并由全阶低级机器人控制器跟踪。假设速度指令的跟踪是准确的,我们为全阶系统获得了安全性的正式保证。通过详细的帕累托分析,我们验证了多阶段架构的最优性和稳健性,与传统的单阶段安全过滤器进行了比较。我们进一步展示了我们的碰撞避免方法在多种真实世界动态场景下在多足机器人平台上的有效性和通用性。
cs.RO / 15 / 2603.00351
Acoustic Sensing for Universal Jamming Grippers
用于通用夹持器的声学传感
Abstract
Universal jamming grippers excel at grasping unknown objects due to their compliant bodies. Traditional tactile sensors can compromise this compliance, reducing grasping performance. We present acoustic sensing as a form of morphological sensing, where the gripper's soft body itself becomes the sensor. A speaker and microphone are placed inside the gripper cavity, away from the deformable membrane, fully preserving compliance. Sound propagates through the gripper and object, encoding object properties, which are then reconstructed via machine learning. Our sensor achieves high spatial resolution in sensing object size (2.6 mm error) and orientation (0.6 deg error), remains robust to external noise levels of 80 dBA, and discriminates object materials (up to 100% accuracy) and 16 everyday objects (85.6% accuracy). We validate the sensor in a realistic tactile object sorting task, achieving 53 minutes of uninterrupted grasping and sensing, confirming the preserved grasping performance. Finally, we demonstrate that disentangled acoustic representations can be learned, improving robustness to irrelevant acoustic variations.
Chinese Translation
通用夹持器因其柔性结构在抓取未知物体方面表现出色。然而,传统的触觉传感器可能会影响这种柔性,降低抓取性能。我们提出将声学传感作为一种形态传感的形式,其中夹持器的软体本身成为传感器。扬声器和麦克风被放置在夹持器腔体内,远离可变形膜,从而完全保留了柔性。声音通过夹持器和物体传播,编码物体特性,然后通过机器学习进行重建。我们的传感器在感知物体大小(误差为2.6毫米)和方向(误差为0.6度)方面实现了高空间分辨率,能够在80 dBA的外部噪声水平下保持稳健,并能够区分物体材料(准确率高达100%)和16种日常物体(准确率为85.6%)。我们在一个真实的触觉物体分类任务中验证了该传感器,实现了53分钟的不间断抓取和感知,确认了抓取性能的保持。最后,我们展示了可以学习到解耦的声学表示,从而提高对无关声学变化的鲁棒性。
cs.RO / 16 / 2603.00420
TMR-VLA:Vision-Language-Action Model for Magnetic Motion Control of Tri-leg Silicone-based Soft Robot
TMR-VLA:用于三足硅基软机器人磁运动控制的视觉-语言-动作模型
Abstract
In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 74% average success rate.
Chinese Translation
在体内环境中,磁驱动的软机器人具有无线操作和精确控制等优势,显示出在无痛检测和治疗程序中具有良好的潜力。我们开发了一种三足磁驱动软机器人(TMR),其多腿设计使得其具备更灵活的步态和多样的运动模式。对于由可重构软材料制成的硅基机器人,其导航能力可以分为一系列顺序动作,即蹲下、旋转、抬腿、行走等。其运动和行为依赖于其弯曲形状。为了将运动类型描述与特定的低电压控制相结合,我们引入了TMR-VLA,这是一种端到端的多模态系统,能够执行混合运动类型,具有通过适应其形状到语言约束运动类型来发展导航能力的潜力。TMR-VLA从EndoVLA部署了具身内腔定位能力,并将顺序帧和自然语言命令融合为输入。低电压输出是基于当前观察状态和特定运动类型描述生成的。结果表明,TMR-VLA能够预测施加在TMR上的电压如何改变硅基软机器人的动力学。TMR-VLA达到了74%的平均成功率。
cs.RO / 17 / 2603.00446
HydroShear: Hydroelastic Shear Simulation for Tactile Sim-to-Real Reinforcement Learning
HydroShear:用于触觉仿真到真实强化学习的水弹性剪切模拟
Abstract
In this paper, we address the problem of tactile sim-to-real policy transfer for contact-rich tasks. Existing methods primarily focus on vision-based sensors and emphasize image rendering quality while providing overly simplistic models of force and shear. Consequently, these models exhibit a large sim-to-real gap for many dexterous tasks. Here, we present HydroShear, a non-holonomic hydroelastic tactile simulator that advances the state-of-the-art by modeling: a) stick-slip transitions, b) path-dependent force and shear build up, and c) full SE(3) object-sensor interactions. HydroShear extends hydroelastic contact models using Signed Distance Functions (SDFs) to track the displacements of the on-surface points of an indenter during physical interaction with the sensor membrane. Our approach generates physics-based, computationally efficient force fields from arbitrary watertight geometries while remaining agnostic to the underlying physics engine. In experiments with GelSight Minis, HydroShear more faithfully reproduces real tactile shear compared to existing methods. This fidelity enables zero-shot sim-to-real transfer of reinforcement learning policies across four tasks: peg insertion, bin packing, book shelving for insertion, and drawer pulling for fine gripper control under slip. Our method achieves a 93% average success rate, outperforming policies trained on tactile images (34%) and alternative shear simulation methods (58%-61%).
Chinese Translation
本文解决了接触丰富任务中触觉仿真到真实策略转移的问题。现有方法主要集中于基于视觉的传感器,并强调图像渲染质量,同时提供过于简化的力和剪切模型。因此,这些模型在许多灵巧任务中表现出较大的仿真到真实差距。在此,我们提出了HydroShear,一种非完整的水弹性触觉模拟器,通过建模:a)粘滑过渡,b)路径依赖的力和剪切积累,以及c)完整的SE(3)物体-传感器交互,推动了技术的前沿。HydroShear使用有符号距离函数(Signed Distance Functions, SDFs)扩展水弹性接触模型,以跟踪在与传感器膜进行物理交互时压头表面点的位移。我们的方法从任意的防水几何体生成基于物理的、计算高效的力场,同时对底层物理引擎保持无关。通过与GelSight Minis的实验,HydroShear比现有方法更真实地再现了真实的触觉剪切。这种保真度使得在四个任务中实现了零样本的仿真到真实的强化学习策略转移:插销插入、箱子打包、书籍上架插入和抽屉拉动以实现滑动下的精细夹持控制。我们的方法实现了93%的平均成功率,优于在触觉图像上训练的策略(34%)和替代剪切模拟方法(58%-61%)。
cs.RO / 18 / 2603.00455
Test-Driven Agentic Framework for Reliable Robot Controller
基于测试驱动的自主框架用于可靠的机器人控制器
Abstract
In this work, we present a test-driven, agentic framework for synthesizing a deployable low-level robot controller for navigation tasks. Given a 2D map with an image of an ultrasonic sensor-based robot, or a 3D robotic simulation environment, our framework iteratively refines the generated controller code using diagnostic feedback from structured test suites to achieve task success. We propose a dual-tier repair strategy to refine the generated code that alternates between prompt-level refinement and direct code editing. We evaluate the approach across 2D navigation tasks and 3D navigation in the Webots simulator. Experimental results show that test-driven synthesis substantially improves controller reliability and robustness over one-shot controller generation, especially when the initial prompt is underspecified. The source code and demonstration videos are available at: https://shivanshutripath.github.io/robotic_controller.github.io.
Chinese Translation
在本研究中,我们提出了一种基于测试驱动的自主框架,用于合成可部署的低级机器人控制器,以执行导航任务。给定一个包含超声波传感器机器人图像的二维地图或三维机器人仿真环境,我们的框架通过使用结构化测试套件的诊断反馈,迭代地优化生成的控制器代码,以实现任务成功。我们提出了一种双层修复策略,以交替进行提示级别的优化和直接代码编辑,从而改进生成的代码。我们在Webots仿真器中评估了该方法在二维导航任务和三维导航中的表现。实验结果表明,与一次性控制器生成相比,基于测试驱动的合成显著提高了控制器的可靠性和鲁棒性,尤其是在初始提示不明确的情况下。源代码和演示视频可在以下网址获取:https://shivanshutripath.github.io/robotic_controller.github.io。
cs.RO / 19 / 2603.00500
Zero-Shot Robotic Manipulation via 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation
通过3D高斯点云增强的多模态检索增强生成实现零样本机器人操控
Abstract
Existing end-to-end approaches of robotic manipulation often lack generalization to unseen objects or tasks due to limited data and poor interpretability. While recent Multimodal Large Language Models (MLLMs) demonstrate strong commonsense reasoning, they struggle with geometric and spatial understanding required for pose prediction. In this paper, we propose RobMRAG, a 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation (MRAG) framework for zero-shot robotic manipulation. Specifically, we construct a multi-source manipulation knowledge base containing object contact frames, task completion frames, and pose parameters. During inference, a Hierarchical Multimodal Retrieval module first employs a three-priority hybrid retrieval strategy to find task-relevant object prototypes, then selects the geometrically closest reference example based on pixel-level similarity and Instance Matching Distance (IMD). We further introduce a 3D-Aware Pose Refinement module based on 3D Gaussian Splatting into the MRAG framework, which aligns the pose of the reference object to the target object in 3D space. The aligned results are reprojected onto the image plane and used as input to the MLLM to enhance the generation of the final pose parameters. Extensive experiments show that on a test set containing 30 categories of household objects, our method improves the success rate by 7.76% compared to the best-performing zero-shot baseline under the same setting, and by 6.54% compared to the state-of-the-art supervised baseline. Our results validate that RobMRAG effectively bridges the gap between high-level semantic reasoning and low-level geometric execution, enabling robotic systems that generalize to unseen objects while remaining inherently interpretable.
Chinese Translation
现有的端到端机器人操控方法由于数据有限和可解释性差,往往缺乏对未见物体或任务的泛化能力。尽管近期的多模态大型语言模型(MLLMs)展现出强大的常识推理能力,但在姿态预测所需的几何和空间理解方面仍显不足。本文提出了RobMRAG,一个基于3D高斯点云增强的多模态检索增强生成(MRAG)框架,用于零样本机器人操控。具体而言,我们构建了一个多源操控知识库,包含物体接触框架、任务完成框架和姿态参数。在推理过程中,分层多模态检索模块首先采用三优先混合检索策略找到与任务相关的物体原型,然后基于像素级相似性和实例匹配距离(IMD)选择几何上最接近的参考示例。我们进一步在MRAG框架中引入了基于3D高斯点云的姿态精细化模块,将参考物体的姿态对齐到目标物体在3D空间中的位置。对齐结果被重新投影到图像平面,并作为输入提供给MLLM,以增强最终姿态参数的生成。大量实验表明,在包含30类家用物体的测试集上,我们的方法在相同设置下相比于最佳零样本基线提高了7.76%的成功率,相比于最先进的监督基线提高了6.54%。我们的结果验证了RobMRAG有效地弥合了高层语义推理与低层几何执行之间的差距,使机器人系统能够对未见物体进行泛化,同时保持内在的可解释性。
cs.RO / 20 / 2603.00507
Optimal-Horizon Social Robot Navigation in Heterogeneous Crowds
异质人群中社交机器人导航的最优时间视野
Abstract
Navigating social robots in dense, dynamic crowds is challenging due to environmental uncertainty and complex human-robot interactions. While Model Predictive Control (MPC) offers strong real-time performance, its reliance on a fixed prediction horizon limits adaptability to changing environments and social dynamics. Furthermore, most MPC approaches treat pedestrians as homogeneous obstacles, ignoring social heterogeneity and cooperative or adversarial interactions, which often causes the Frozen Robot Problem in partially observable real-world environments. In this paper, we identify the planning horizon as a socially conditioned decision variable rather than a fixed design choice. Building on this insight, we propose an optimal-horizon social navigation framework that optimizes MPC foresight online according to inferred social context. A spatio-temporal Transformer infers pedestrian cooperation attributes from local trajectory observations, which serve as social priors for a reinforcement learning policy that optimally selects the prediction horizon under a task-driven objective. The resulting horizon-aware MPC incorporates socially conditioned safety constraints to balance navigation efficiency and interaction safety. Extensive simulations and real-world robot experiments demonstrate that optimal foresight selection is critical for robust social navigation in partially observable crowds. Compared to state-of-the-art baselines, the proposed approach achieves a 6.8\% improvement in success rate, reduces collisions by 50\%, and shortens navigation time by 19\%, with a low timeout rate of 0.8\%, validating the necessity of socially optimal planning horizons for efficient and safe robot navigation in crowded environments. Code and videos are available at Under Review.
Chinese Translation
在密集、动态的人群中导航社交机器人面临环境不确定性和复杂的人机交互的挑战。尽管模型预测控制(Model Predictive Control, MPC)提供了强大的实时性能,但其对固定预测时间视野的依赖限制了对变化环境和社会动态的适应能力。此外,大多数MPC方法将行人视为同质障碍物,忽视了社会异质性以及合作或对抗的互动,这常常导致在部分可观察的真实环境中出现“冻结机器人问题”(Frozen Robot Problem)。在本文中,我们将规划时间视野视为一个社会条件化的决策变量,而不是固定的设计选择。基于这一见解,我们提出了一种最优时间视野社交导航框架,该框架根据推断的社会背景在线优化MPC的前瞻性。时空Transformer从局部轨迹观测中推断行人的合作属性,这些属性作为强化学习策略的社会先验,最优选择在任务驱动目标下的预测时间视野。所提出的时间视野感知MPC结合了社会条件化的安全约束,以平衡导航效率和互动安全。大量的仿真和真实机器人实验表明,最优前瞻性选择对于在部分可观察的人群中实现稳健的社交导航至关重要。与最先进的基线相比,所提出的方法在成功率上提高了6.8\%,减少了50\\%的碰撞,缩短了19\\%的导航时间,且超时率仅为0.8\\%,验证了在拥挤环境中实现高效和安全机器人导航所需的社会最优规划时间视野。代码和视频可在审稿中获取。
cs.RO / 21 / 2603.00555
Planning Method for Skill-Based Control of Robots Using a PLC as Skill Trigger
基于技能控制的机器人规划方法:使用PLC作为技能触发器
Abstract
Skill-based programming of robots provides a flexible approach for automation. Existing solutions neglect the optimization of motion sequences, leading to inefficiencies in execution. This work introduces a planning method that enhances skill-based robot programming by integrating motion sequence optimization. This optimization leads to a new MoveContinuousSkill. The software for executing the MoveContinuousSkill is implemented on a Programmable Logic Controller and applied across multiple robotic systems. Experimental results demonstrate a significant improvement in execution time through optimized motion sequence.
Chinese Translation
基于技能的机器人编程提供了一种灵活的自动化方法。现有解决方案忽视了运动序列的优化,导致执行效率低下。本研究提出了一种规划方法,通过整合运动序列优化来增强基于技能的机器人编程。这一优化导致了新的 MoveContinuousSkill。执行 MoveContinuousSkill 的软件在可编程逻辑控制器(Programmable Logic Controller, PLC)上实现,并应用于多个机器人系统。实验结果表明,通过优化运动序列,执行时间显著改善。
cs.RO / 22 / 2603.00592
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
LangGap:诊断和弥合视觉-语言-动作模型中的语言差距
Abstract
Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in {\pi}0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
Chinese Translation
视觉-语言-动作(VLA)模型在标准基准测试中取得了超过95%的成功率。然而,通过系统实验,我们发现当前最先进的VLA模型在很大程度上忽视了语言指令。以往的研究存在以下不足:(1)缺乏系统的语义扰动诊断,(2)缺乏设计上强制语言理解的基准测试,以及(3)缺乏语言多样性的训练数据。本文构建了LangGap基准,基于一种四维语义扰动方法——在保持桌面布局固定的情况下,变化指令语义——揭示了{ extit{π}}0.5在语言理解方面的不足。现有基准如LIBERO在每个布局下仅分配一个任务,未充分利用可用对象和目标位置;而LangGap在相同布局下全面多样化了取放任务,迫使模型真正理解语言。实验表明,针对性的数据增强可以部分弥补语言差距——在单任务训练下,成功率从0%提高到90%,在多任务训练下,从0%提高到28%。然而,随着扩展任务的语义多样性增加,模型的学习能力显得严重不足;即使是经过训练的任务表现也不佳。这揭示了VLA模型在理解多样化语言指令方面的根本挑战——这正是LangGap的长期价值所在。
cs.RO / 23 / 2603.00597
AI-IO: An Aerodynamics-Inspired Real-Time Inertial Odometry for Quadrotors
AI-IO:一种受气动学启发的四旋翼实时惯性测距
Abstract
Inertial Odometry (IO) has gained attention in quadrotor applications due to its sole reliance on inertial measurement units (IMUs), attributed to its lightweight design, low cost, and robust performance across diverse environments. However, most existing learning-based inertial odometry systems for quadrotors either use only IMU data or include additional dynamics-related inputs such as thrust, but still lack a principled formulation of the underlying physical model to be learned. This lack of interpretability hampers the model's ability to generalize and often limits its accuracy. In this work, we approach the inertial odometry learning problem from a different perspective. Inspired by the aerodynamics model and IMU measurement model, we identify the key physical quantity--rotor speed measurements required for inertial odometry and design a transformer-based inertial odometry. By incorporating rotor speed measurements, the proposed model improves velocity prediction accuracy by 36.9%. Furthermore, the transformer architecture more effectively exploits temporal dependencies for denoising and aerodynamics modeling, yielding an additional 22.4% accuracy gain over previous results. To support evaluation, we also provide a real-world quadrotor flight dataset capturing IMU measurements and rotor speed for high-speed motion. Finally, combined with an uncertainty-aware extended Kalman filter (EKF), our framework is validated across multiple datasets and real-time systems, demonstrating superior accuracy, generalization, and real-time performance. We share the code and data to promote further research (https://github.com/SJTU-ViSYS-team/AI-IO).
Chinese Translation
惯性测距(IO)因其仅依赖惯性测量单元(IMU),在四旋翼应用中受到关注,这归因于其轻量化设计、低成本以及在多种环境下的稳健性能。然而,大多数现有的基于学习的四旋翼惯性测距系统要么仅使用IMU数据,要么包含额外的与动力学相关的输入,如推力,但仍然缺乏对待学习的基础物理模型的原则性表述。这种缺乏可解释性的问题限制了模型的泛化能力,并常常影响其准确性。在本研究中,我们从不同的角度来解决惯性测距学习问题。受气动学模型和IMU测量模型的启发,我们识别出惯性测距所需的关键物理量——转子速度测量,并设计了一种基于变换器的惯性测距模型。通过结合转子速度测量,所提模型将速度预测准确性提高了36.9%。此外,变换器架构更有效地利用时间依赖性进行去噪和气动建模,相较于之前的结果,额外提高了22.4%的准确性。为了支持评估,我们还提供了一个真实世界的四旋翼飞行数据集,捕获了高速度运动下的IMU测量和转子速度。最后,结合具有不确定性意识的扩展卡尔曼滤波器(EKF),我们的框架在多个数据集和实时系统中得到了验证,展示了卓越的准确性、泛化能力和实时性能。我们分享代码和数据以促进进一步的研究(https://github.com/SJTU-ViSYS-team/AI-IO)。
cs.RO / 24 / 2603.00600
I-Perceive: A Foundation Model for Active Perception with Language Instructions
I-Perceive:一种基于语言指令的主动感知基础模型
Abstract
Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.
Chinese Translation
主动感知是指机器人主动调整其视角以获取与任务相关信息的能力,这对于在非结构化的真实世界环境中进行稳健操作至关重要。虽然这对于诸如操作等下游任务至关重要,但现有的方法在很大程度上局限于局部环境(例如,桌面场景)和固定的感知目标(例如,遮挡减少)。在大规模环境中以开放式意图解决主动感知仍然是一个未解的挑战。为了解决这一问题,我们提出了I-Perceive,这是一种基于自然语言指令的主动感知基础模型,旨在用于移动操纵器和室内环境。I-Perceive根据基于图像的场景上下文预测遵循开放式语言指令的摄像机视角。通过将视觉-语言模型(Vision-Language Model, VLM)主干与几何基础模型相结合,I-Perceive在语义和几何理解之间架起了桥梁,从而实现了主动感知的有效推理。我们在一个多样化的数据集上训练I-Perceive,该数据集包含真实世界场景扫描数据和通过自动化和可扩展的数据生成管道处理的模拟数据。实验表明,I-Perceive在生成摄像机视角的预测准确性和指令遵循方面显著优于最先进的VLM,并且在新场景和任务上表现出强大的零样本泛化能力。
cs.RO / 25 / 2603.00615
TGM-VLA: Task-Guided Mixup for Sampling-Efficient and Robust Robotic Manipulation
TGM-VLA:任务引导的混合采样用于高效且稳健的机器人操控
Abstract
The performance of robotic imitation learning is fundamentally limited by data quality and training strategies. Prevalent sampling strategies on RLBench suffer from severe keyframe redundancy and imbalanced temporal distribution, leading to inefficient memory usage and unstable optimization. Moreover, reprojecting point clouds onto multi-view images with a black background--while more efficient than voxel-based methods--often causes dark objects to be indistinguishable and hard to manipulate. In this work, we propose a novel holistic framework that significantly improves both model performance and training efficiency. First, we redesign and optimize the keyframe sampling strategy, reducing memory consumption by 80% and accelerating training speed by 5x. Second, we augment the model with a color inversion projection branch--a simple yet effective module that resolves the ambiguity of dark objects. Finally, we propose a task-guided mixup technique that dynamically fuses point clouds and action heatmaps according to task instructions, greatly improving robustness to distractors and performance in multi-goal scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance with a 90.5% success rate on RLBench and 68.8% on the COLOSSEUM benchmark under challenging interference conditions. Our code and checkpoints are available at https://github.com/PuFanqi23/TGM-VLA.
Chinese Translation
机器人的模仿学习性能在根本上受到数据质量和训练策略的限制。RLBench上普遍采用的采样策略存在严重的关键帧冗余和不平衡的时间分布,导致内存使用效率低下和优化不稳定。此外,将点云重新投影到黑色背景的多视图图像上——虽然比基于体素的方法更高效——常常导致深色物体难以辨识且难以操控。在本研究中,我们提出了一种新颖的整体框架,显著提高了模型性能和训练效率。首先,我们重新设计并优化了关键帧采样策略,将内存消耗降低了80%,并将训练速度提高了5倍。其次,我们通过一个颜色反转投影分支增强了模型——这是一个简单而有效的模块,解决了深色物体的模糊性。最后,我们提出了一种任务引导的混合技术,根据任务指令动态融合点云和动作热图,大大提高了对干扰物的鲁棒性以及在多目标场景中的表现。大量实验表明,我们的方法在RLBench上达到了90.5%的成功率,在COLOSSEUM基准测试中达到了68.8%的成功率,表现出色,尤其在具有挑战性的干扰条件下。我们的代码和检查点可在https://github.com/PuFanqi23/TGM-VLA获得。
cs.RO / 26 / 2603.00628
Validation of Space Robotics in Underwater Environments via Disturbance Robustness Equivalency
通过干扰鲁棒性等价性验证水下环境中的空间机器人技术
Abstract
We present an experimental validation framework for space robotics that leverages underwater environments to approximate microgravity dynamics. While neutral buoyancy conditions make underwater robotics an excellent platform for space robotics validation, there are still dynamical and environmental differences that need to be overcome. Given a high-level space mission specification, expressed in terms of a Signal Temporal Logic specification, we overcome these differences via the notion of maximal disturbance robustness of the mission. We formulate the motion planning problem such that the original space mission and the validation mission achieve the same disturbance robustness degree. The validation platform then executes its mission plan using a near-identical control strategy to the space mission where the closed-loop controller considers the spacecraft dynamics. Evaluating our validation framework relies on estimating disturbances during execution and comparing them to the disturbance robustness degree, providing practical evidence of operation in the space environment. Our evaluation features a dual-experiment setup: an underwater robot operating under near-neutral buoyancy conditions to validate the planning and control strategy of either an experimental planar spacecraft platform or a CubeSat in a high-fidelity space dynamics simulator.
Chinese Translation
我们提出了一种实验验证框架,用于空间机器人技术,该框架利用水下环境来近似微重力动态。尽管中性浮力条件使水下机器人成为空间机器人验证的优秀平台,但仍然存在需要克服的动力学和环境差异。基于高层次的空间任务规范,该规范以信号时序逻辑(Signal Temporal Logic)形式表达,我们通过任务的最大干扰鲁棒性概念来克服这些差异。我们将运动规划问题进行公式化,使得原始空间任务和验证任务实现相同的干扰鲁棒性程度。验证平台随后使用与空间任务几乎相同的控制策略执行其任务计划,其中闭环控制器考虑了航天器的动力学。评估我们的验证框架依赖于在执行过程中估计干扰并将其与干扰鲁棒性程度进行比较,从而提供在空间环境中操作的实际证据。我们的评估采用双实验设置:一台在近中性浮力条件下操作的水下机器人,用于验证实验平面航天器平台或高保真空间动力学模拟器中的CubeSat的规划和控制策略。
cs.RO / 27 / 2603.00663
Optimal Solutions for the Moving Target Vehicle Routing Problem via Branch-and-Price with Relaxed Continuity
通过放宽连续性的分支定价法求解移动目标车辆路径问题的最优解
Abstract
The Moving Target Vehicle Routing Problem (MT-VRP) seeks trajectories for several agents that intercept a set of moving targets, subject to speed, time window, and capacity constraints. We introduce an exact algorithm, Branch-and-Price with Relaxed Continuity (BPRC), for the MT-VRP. The main challenge in a branch-and-price approach for the MT-VRP is the pricing subproblem, which is complicated by moving targets and time-dependent travel costs between targets. Our key contribution is a new labeling algorithm that solves this subproblem by means of a novel dominance criterion tailored for problems with moving targets. Numerical results on instances with up to 25 targets show that our algorithm finds optimal solutions more than an order of magnitude faster than a baseline based on previous work, showing particular strength in scenarios with limited agent capacities.
Chinese Translation
移动目标车辆路径问题(MT-VRP)旨在为多个代理寻找轨迹,以拦截一组移动目标,同时受到速度、时间窗口和容量限制的约束。我们为MT-VRP引入了一种精确算法——放宽连续性的分支定价法(BPRC)。在MT-VRP的分支定价方法中,主要挑战在于定价子问题,该问题因移动目标和目标之间的时间依赖性旅行成本而变得复杂。我们的关键贡献是提出了一种新的标记算法,通过针对移动目标问题的新颖支配准则来解决这一子问题。在包含多达25个目标的实例上的数值结果表明,我们的算法比基于先前工作的基线快了一个数量级以上,特别在代理容量有限的场景中表现出色。
cs.RO / 28 / 2603.00694
Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model
Wild-Drive:通过稳健的多模态路由和高效的大型语言模型进行越野场景描述和路径规划
Abstract
Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human--robot interaction. However, most existing approaches target structured urban scenarios; in off-road environments, they are vulnerable to single-modality degradations caused by rain, fog, snow, and darkness, and they lack a unified framework that jointly models structured scene captioning and path planning. To bridge this gap, we propose Wild-Drive, an efficient framework for off-road scene captioning and path planning. Wild-Drive adopts modern multimodal encoders and introduces a task-conditioned modality-routing bridge, MoRo-Former, to adaptively aggregate reliable information under degraded sensing. It then integrates an efficient large language model (LLM), together with a planning token and a gate recurrent unit (GRU) decoder, to generate structured captions and predict future trajectories. We also build the OR-C2P Benchmark, which covers structured off-road scene captioning and path planning under diverse sensor corruption conditions. Experiments on OR-C2P dataset and a self-collected dataset show that Wild-Drive outperforms prior LLM-based methods and remains more stable under degraded sensing. The code and benchmark will be publicly available at https://github.com/wangzihanggg/Wild-Drive.
Chinese Translation
可解释性和透明决策对于自主驾驶系统的安全部署至关重要。场景描述以自然语言总结环境条件和风险因素,从而提高透明度、安全性和人机交互。然而,大多数现有方法针对的是结构化的城市场景;在越野环境中,它们容易受到雨、雾、雪和黑暗等单一模态退化的影响,并且缺乏一个统一的框架来共同建模结构化场景描述和路径规划。为了解决这一问题,我们提出了Wild-Drive,这是一个高效的越野场景描述和路径规划框架。Wild-Drive采用现代多模态编码器,并引入了一种任务条件的模态路由桥接器MoRo-Former,以在感知退化的情况下自适应地聚合可靠信息。然后,它集成了一个高效的大型语言模型(LLM),以及一个规划令牌和一个门控递归单元(GRU)解码器,以生成结构化描述并预测未来轨迹。我们还构建了OR-C2P基准,涵盖了在各种传感器损坏条件下的结构化越野场景描述和路径规划。在OR-C2P数据集和一个自收集的数据集上的实验表明,Wild-Drive的性能优于先前基于LLM的方法,并在感知退化的情况下保持更稳定。代码和基准将公开发布在https://github.com/wangzihanggg/Wild-Drive。
cs.RO / 29 / 2603.00719
Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics
基于关键帧引导的结构化奖励在长时间跨度实验室机器人强化学习中的应用
Abstract
Long-horizon precision manipulation in laboratory automation, such as pipette tip attachment and liquid transfer, requires policies that respect strict procedural logic while operating in continuous, high-dimensional state spaces. However, existing approaches struggle with reward sparsity, multi-stage structural constraints, and noisy or imperfect demonstrations, leading to inefficient exploration and unstable convergence. We propose a Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning. The framework integrates multi-view visual encoding, latent similarity-based progress tracking, and human-in-the-loop reinforcement fine-tuning on a Vision-Language-Action backbone to align policy optimization with the intrinsic stepwise logic of biological protocols. Across four real-world laboratory tasks, including high-precision pipette attachment and dynamic liquid transfer, our method achieves an average success rate of 82% after 40--60 minutes of online fine-tuning. Compared with HG-DAgger (42%) and Hil-ConRFT (47%), our approach demonstrates the effectiveness of structured keyframe-guided rewards in overcoming exploration bottlenecks and providing a scalable solution for high-precision, long-horizon robotic laboratory automation.
Chinese Translation
实验室自动化中的长时间跨度精确操作,如移液器尖端的附加和液体转移,需要遵循严格程序逻辑的策略,同时在连续的高维状态空间中运行。然而,现有方法在奖励稀疏性、多阶段结构约束以及噪声或不完美示范方面面临挑战,导致探索效率低下和收敛不稳定。我们提出了一种关键帧引导的奖励生成框架,该框架能够自动从示范中提取具有运动学意识的关键帧,通过潜在空间中的扩散预测器生成阶段性目标,并构建基于几何进展的奖励以指导在线强化学习。该框架集成了多视角视觉编码、基于潜在相似性的进展跟踪以及基于人类反馈的强化微调,旨在将策略优化与生物协议的内在逐步逻辑对齐。在包括高精度移液器附加和动态液体转移在内的四个真实实验室任务中,我们的方法在经过40-60分钟的在线微调后实现了平均成功率82%。与HG-DAgger(42%)和Hil-ConRFT(47%)相比,我们的方法展示了结构化关键帧引导奖励在克服探索瓶颈和提供高精度、长时间跨度机器人实验室自动化的可扩展解决方案方面的有效性。
cs.RO / 30 / 2603.00732
UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
UniHM:基于视觉语言模型的统一灵巧手操控
Abstract
Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.
Chinese Translation
规划物理可行的灵巧手操控是机器人操控和具身人工智能中的一个核心挑战。以往的研究通常依赖于以物体为中心的线索或精确的手-物体交互序列,而忽视了开放词汇指令的丰富组合指导。我们提出了UniHM,这是第一个由自由形式语言命令指导的统一灵巧手操控框架。我们提出了一种统一的手-灵巧标记器(Unified Hand-Dexterous Tokenizer),将异构的灵巧手形态映射到一个共享的代码本中,从而提高了跨灵巧手的泛化能力和对新形态的可扩展性。我们的视觉语言动作模型仅基于人类-物体交互数据进行训练,消除了对大量真实世界远程操作数据集的需求,并在从开放式语言指令生成类人操控序列方面表现出强大的泛化能力。为了确保物理现实性,我们引入了一个物理引导的动态优化模块,该模块在生成和时间先验下执行分段关节优化,从而产生平滑且物理可行的操控序列。在多个数据集和真实世界评估中,UniHM在已见和未见物体及轨迹上均达到了最先进的结果,展示了强大的泛化能力和高物理可行性。我们的项目页面在 [https://unihm.github.io/](https://unihm.github.io/)。
cs.RO / 31 / 2603.00759
Online Generation of Collision-Free Trajectories in Dynamic Environments
动态环境中无碰撞轨迹的在线生成
Abstract
In this paper, we present an online method for converting an arbitrary geometric path represented by a sequence of states, generated by any planner (e.g., sampling-based planners like RRT or PRM, search-based planners like ARA*, etc.), into a corresponding kinematically feasible, jerk-limited trajectory. The method generates a sequence of quintic/quartic splines that can be discretized at a user-specified control rate, and then streamed to a low-level robot controller. Our approach enables real-time adaptation to newly captured changes in the environment. It can also be re-invoked at any time instance to generate a new trajectory from the robot's current to a desired target state or sequence of states. We can guarantee that the trajectory will remain collision-free for a certain amount of time in dynamic environments, while allowing bounded geometric deviation from the original path. The kinematic constraints are taken into account, including limited jerk. We validate the approach in a comparative simulation study against the competing method, demonstrating favorable behavior w.r.t. smoothness, computational time, and real-time performance, particularly in scenarios with frequent changes of target states (up to 1 [kHz]). Experiments on a real robot demonstrate that the proposed approach can be used in real-world scenarios including human presence.
Chinese Translation
在本文中,我们提出了一种在线方法,用于将由任意规划器(例如,基于采样的规划器如RRT或PRM,基于搜索的规划器如ARA*等)生成的一系列状态表示的任意几何路径转换为相应的运动学可行、限制加速度的轨迹。该方法生成一系列五次/四次样条曲线,可以在用户指定的控制频率下进行离散化,然后流式传输到低层机器人控制器。我们的方法能够实时适应环境中新捕获的变化。它还可以在任何时刻重新调用,以从机器人的当前状态生成新的轨迹,指向期望的目标状态或状态序列。我们可以保证,在动态环境中,轨迹在一定时间内保持无碰撞,同时允许与原始路径的有限几何偏差。运动学约束被考虑在内,包括限制加速度。我们在比较仿真研究中验证了该方法,与竞争方法进行了对比,展示了在平滑性、计算时间和实时性能方面的良好表现,特别是在目标状态频繁变化的场景中(高达1 [kHz])。在真实机器人上的实验表明,所提出的方法可以用于包括人类存在的真实场景。
cs.RO / 32 / 2603.00871
Hippo: High-performance Interior-Point and Projection-based Solver for Generic Constrained Trajectory Optimization
Hippo:高性能内点法和基于投影的通用约束轨迹优化求解器
Abstract
Trajectory optimization is the core of modern model-based robotic control and motion planning. Existing trajectory optimizers, based on sequential quadratic programming (SQP) or differential dynamic programming (DDP), are often limited by their slow computation efficiency, low modeling flexibility, and poor convergence for complex tasks requiring hard constraints. In this paper, we introduce Hippo, a solver that can handle inequality constraints using the interior-point method (IPM) with an adaptive barrier update strategy and hard equality constraints via projection or IPM. Through extensive numerical benchmarks, we show that Hippo is a robust and efficient alternative to existing state-of-the-art solvers for difficult robotic trajectory optimization problems requiring high-quality solutions, such as locomotion and manipulation.
Chinese Translation
轨迹优化是现代基于模型的机器人控制和运动规划的核心。现有的轨迹优化器基于序列二次规划(SQP)或微分动态规划(DDP),通常受到计算效率低、建模灵活性差以及在需要严格约束的复杂任务中收敛性差的限制。本文介绍了Hippo,一个能够使用内点法(IPM)处理不等式约束的求解器,采用自适应障碍更新策略,并通过投影或内点法处理严格等式约束。通过广泛的数值基准测试,我们证明了Hippo是现有最先进求解器在需要高质量解决方案的困难机器人轨迹优化问题(如运动和操作)中的一种稳健且高效的替代方案。
cs.RO / 33 / 2603.00892
A Novel Reconfigurable Dexterous Hand Based on Triple-Symmetric Bricard Parallel Mechanism
基于三重对称布里卡德并联机制的新型可重构灵巧手
Abstract
This paper introduces a novel design for a robotic hand based on parallel mechanisms. The proposed hand uses a triple-symmetric Bricard linkage as its reconfigurable palm, enhancing adaptability to objects of varying shapes and sizes. Through topological and dimensional synthesis, the mechanism achieves a well-balanced degree of freedom and link configuration suitable for reconfigurable palm motion, balancing dexterity, stability, and load capacity. Furthermore, kinematic analysis is performed using screw theory and closed-loop constraints, and performance is evaluated based on workspace, stiffness, and motion/force transmission efficiency. Finally, a prototype is developed and tested through a series of grasping experiments, demonstrating the ability to perform stable and efficient manipulation across a wide range of objects. The results validate the effectiveness of the design in improving grasping versatility and operational precision, offering a promising solution for advanced robotic manipulation tasks.
Chinese Translation
本文介绍了一种基于并联机制的机器人手的新设计。所提出的手采用三重对称布里卡德连杆作为其可重构的掌心,增强了对不同形状和尺寸物体的适应能力。通过拓扑和尺寸综合,该机制实现了良好的自由度和平衡的连杆配置,适合可重构掌心的运动,平衡了灵巧性、稳定性和负载能力。此外,利用螺旋理论和闭环约束进行了运动学分析,并根据工作空间、刚度以及运动/力传递效率评估性能。最后,开发并测试了一个原型,通过一系列抓取实验,展示了其在广泛物体上进行稳定和高效操作的能力。结果验证了该设计在提高抓取多样性和操作精度方面的有效性,为先进的机器人操作任务提供了有前景的解决方案。
cs.RO / 34 / 2603.00913
Minimalist Compliance Control
简约合规控制
Abstract
Compliance control is essential for safe physical interaction, yet its adoption is limited by hardware requirements such as force torque sensors. While recent reinforcement learning approaches aim to bypass these constraints, they often suffer from sim-to-real gaps, lack safety guarantees, and add system complexity. We propose Minimalist Compliance Control, which enables compliant behavior using only motor current or voltage signals readily available in modern servos and quasi-direct-drive motors, without force sensors, current control, or learning. External wrenches are estimated from actuator signals and Jacobians and incorporated into a task-space admittance controller, preserving sufficient force measurement accuracy for stable and responsive compliance control. Our method is embodiment-agnostic and plug-and-play with diverse high-level planners. We validate our approach on a robot arm, a dexterous hand, and two humanoid robots across multiple contact-rich tasks, using vision-language models, imitation learning, and model-based planning. The results demonstrate robust, safe, and compliant interaction across embodiments and planning paradigms.
Chinese Translation
合规控制对于安全的物理交互至关重要,但其采用受到诸如力矩传感器等硬件要求的限制。尽管近期的强化学习方法旨在绕过这些限制,但它们往往面临模拟到现实的差距,缺乏安全保障,并增加了系统复杂性。我们提出了简约合规控制(Minimalist Compliance Control),该方法仅使用现代伺服电机和准直接驱动电机中 readily available 的电机电流或电压信号,而无需力传感器、电流控制或学习。外部扭矩通过执行器信号和雅可比矩阵进行估计,并纳入任务空间的顺应性控制器中,从而保持足够的力测量精度,以实现稳定和响应迅速的合规控制。我们的方法与具体实现无关,并且可以与多种高级规划器即插即用。我们在一个机器人手臂、一个灵巧手和两个类人机器人上验证了我们的方法,涵盖多个接触丰富的任务,使用视觉语言模型、模仿学习和基于模型的规划。结果表明,在不同的实现和规划范式中,能够实现稳健、安全和合规的交互。
cs.RO / 35 / 2603.00926
DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation
DAM-VLA:基于动态动作模型的视觉-语言-动作框架用于机器人操作
Abstract
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.
Chinese Translation
在仓库、医院和家庭等动态环境中,机器人必须在粗大运动和精确操作之间无缝切换,以完成复杂任务。然而,当前的视觉-语言-动作(VLA)框架主要是从预训练的视觉-语言模型(VLMs)中改编而来,往往难以调和一般任务适应性与复杂操作所需的专业精确度。为了解决这一挑战,我们提出了DAM-VLA,一个基于动态动作模型的VLA框架。DAM-VLA将VLM推理与专门用于手臂和抓取控制的扩散基础动作模型相结合。具体而言,它引入了(i)一种动作路由机制,利用任务特定的视觉和语言线索选择合适的动作模型(例如,手臂运动或抓取操作),(ii)一种动态动作模型,将高层次的VLM认知与低层次的视觉特征融合,以预测动作,以及(iii)一种双尺度动作加权机制,实现手臂运动模型与抓取操作模型之间的动态协调。在广泛的评估中,DAM-VLA在模拟(SIMPLER、FurnitureBench)和真实世界环境中相比于最先进的VLA方法取得了更高的成功率,显示出从标准的拾取与放置到要求高的长时间和接触丰富任务的强大泛化能力。
cs.RO / 36 / 2603.00936
DRIFT: Diffusion-based Rule-Inferred For Trajectories
DRIFT:基于扩散的轨迹规则推断
Abstract
Trajectory generation for mobile robots in unstructured environments faces a critical dilemma: balancing kinematic smoothness for safe execution with terminal precision for fine-grained tasks. Existing generative planners often struggle with this trade-off, yielding either smooth but imprecise paths or geometrically accurate but erratic motions. To address the aforementioned shortcomings, this article proposes DRIFT (Diffusion-based Rule-Inferred for Trajectories), a conditional diffusion framework designed to generate high-fidelity reference trajectories by integrating two complementary inductive biases. First, a Relational Inductive Bias, realized via a GNN-based Structured Scene Perception (SSP) module, encodes global topological constraints to ensure holistic smoothness. Second, a Temporal Attention Bias, implemented through a novel Graph-Conditioned Time-Aware GRU (GTGRU), dynamically attends to sparse obstacles and targets for precise local maneuvering. In the end, quantitative results demonstrate that DRIFT reconciles these conflicting objectives, achieving centimeter-level imitation fidelity (0.041m FDE) and competitive smoothness (27.19 Jerk). This balance yields highly executable reference plans for downstream control.
Chinese Translation
在非结构化环境中,移动机器人轨迹生成面临一个关键的困境:在安全执行的运动学平滑性与精细任务的终端精度之间进行平衡。现有的生成规划器通常难以处理这一权衡,导致生成的路径要么平滑但不精确,要么几何上准确但运动不稳定。为了解决上述不足,本文提出了DRIFT(基于扩散的轨迹规则推断),一个条件扩散框架,旨在通过整合两种互补的归纳偏置生成高保真参考轨迹。首先,通过基于图神经网络(GNN)的结构化场景感知(SSP)模块实现的关系归纳偏置,编码全球拓扑约束,以确保整体平滑性。其次,通过一种新颖的图条件时间感知门控循环单元(GTGRU)实现的时间注意偏置,动态关注稀疏障碍物和目标,以实现精确的局部机动。最终,定量结果表明,DRIFT调和了这些相互冲突的目标,实现了厘米级的模仿保真度(0.041m FDE)和竞争性的平滑性(27.19 Jerk)。这种平衡为下游控制提供了高度可执行的参考规划。
cs.RO / 37 / 2603.00948
HierKick: Hierarchical Reinforcement Learning for Vision-Guided Soccer Robot Control
HierKick:基于层次强化学习的视觉引导足球机器人控制
Abstract
Controlling soccer robots involves multi-time-scale decision-making, which requires balancing long-term tactical planning and short-term motion execution. Traditional end-to-end reinforcement learning (RL) methods face challenges in complex dynamic environments. This paper proposes HierKick, a vision-guided soccer robot control framework based on dual-frequency hierarchical RL. The framework adopts a hierarchical control architecture featuring a 5 Hz high-level policy that integrates YOLOv8 for real-time detection and selects tasks via a coach model, and a pre-trained 50 Hz low-level controller for precise joint control. Through this architecture, the framework achieves the four steps of approaching, aligning, dribbling, and kicking. Experimental results show that the success rates of this framework are 95.2\% in IsaacGym, 89.8\% in Mujoco, and 80\% in the real world. HierKick provides an effective hierarchical paradigm for robot control in complex environments, extendable to multi-time-scale tasks, with its modular design and skill reuse offering a new path for intelligent robot control.
Chinese Translation
足球机器人的控制涉及多时间尺度的决策制定,这需要在长期战术规划与短期运动执行之间取得平衡。传统的端到端强化学习(RL)方法在复杂动态环境中面临挑战。本文提出了HierKick,一个基于双频层次强化学习的视觉引导足球机器人控制框架。该框架采用层次控制架构,具有5 Hz的高层策略,集成了YOLOv8进行实时检测,并通过教练模型选择任务,同时配备了经过预训练的50 Hz低层控制器以实现精确的关节控制。通过这一架构,该框架实现了接近、对齐、运球和射门四个步骤。实验结果表明,该框架在IsaacGym中的成功率为95.2\%,在Mujoco中的成功率为89.8\%,在现实世界中的成功率为80\%。HierKick为复杂环境中的机器人控制提供了一种有效的层次范式,能够扩展到多时间尺度任务,其模块化设计和技能重用为智能机器人控制提供了一条新路径。
cs.RO / 38 / 2603.00972
MiniUGV$_2$: A Compact UAV-Deployable Tracked Ground Vehicle with Manipulation Capabilities
MiniUGV$_2$: 一种可由无人机部署的紧凑型履带式地面车辆,具备操作能力
Abstract
Exploring and inspecting \emph{Hidden Spaces}, defined as environments whose entrances are accessible only to aerial robots but remain unexplored due to geometric constraints, limited flight time, and communication loss, remains a major challenge. We present miniUGV$_2$, a compact UAV-deployable tracked ground vehicle that extends UAV capabilities into confined environments. The system introduces dual articulated arms, integrated LiDAR and depth sensing, and modular electronics for enhanced autonomy. A novel tether module with an electro-permanent magnetic head enables safe deployment, retrieval, and optional detachment, thereby overcoming prior entanglement issues. Experiments demonstrate robust terrain navigation, self-righting, and manipulation of objects up to 3.5 kg, validating miniUGV$_2$ as a versatile platform for hybrid aerial-ground robotics.
Chinese Translation
探索和检查 extit{隐秘空间},即仅可通过空中机器人进入但因几何限制、有限的飞行时间和通信丢失而未被探索的环境,仍然是一个重大挑战。我们提出了miniUGV$_2$,这是一种可由无人机部署的紧凑型履带式地面车辆,能够将无人机的能力扩展到封闭环境中。该系统引入了双关节臂、集成的激光雷达(LiDAR)和深度传感器,以及模块化电子设备,以增强自主性。一个新型的带电永久磁头的绳索模块使得安全部署、回收和可选的分离成为可能,从而克服了先前的缠绕问题。实验表明,该系统具备强大的地形导航能力、自我翻正能力以及对重达3.5千克物体的操作能力,验证了miniUGV$_2$作为混合空地机器人平台的多功能性。
cs.RO / 39 / 2603.01023
An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving
用于闭环自主驾驶的基于扩散的运动规划开放源代码模块化基准
Abstract
Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in {3, 5, 7, 10, 15, 20}), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.
Chinese Translation
基于扩散的运动规划器在诸如nuPlan等基准测试中已取得了最先进的成果,但在闭环生产自主驾驶系统中的评估仍然基本未被探索。现有的评估忽略了ROS 2通信延迟和实时调度约束,而单体ONNX部署在导出时冻结了所有求解器参数。我们提出了一个开放源代码模块化基准,解决了这两个问题:通过ONNX GraphSurgeon,我们将一个包含18,398个节点的单体扩散规划器分解为三个独立可执行的模块,并在原生C++中重新实现了DPM-Solver++去噪循环。该系统作为ROS 2节点集成在Autoware中,Autoware是一个在全球真实车辆上部署的开源自主驾驶堆栈,允许在不重新编译模型的情况下进行运行时可配置的求解器参数设置,并实现去噪过程的逐步可观察性,打破了单体部署的黑箱限制。与在CARLA等独立模拟器中的评估不同,我们的基准在生产级堆栈中运行,并通过AWSIM闭环仿真进行了验证。通过对DPM-Solver++(一阶和二阶)和DDIM在六种步数配置(N ∈ {3, 5, 7, 10, 15, 20})下的系统比较,我们显示编码器缓存实现了3.2倍的延迟减少,并且在N=3时,二阶求解相比一阶求解将最终距离误差(FDE)降低了41%。完整的代码库将作为开源发布,为从仿真基准到真实车辆部署提供直接路径。
cs.RO / 40 / 2603.01110
Compact Task-Aligned Imitation Learning for Laboratory Automation
紧凑型任务对齐模仿学习用于实验室自动化
Abstract
Robotic laboratory automation has traditionally relied on carefully engineered motion pipelines and task-specific hardware interfaces, resulting in high design cost and limited flexibility. While recent imitation learning techniques can generate general robot behaviors, their large model sizes often require high-performance computational resources, limiting applicability in practical laboratory environments. In this study, we propose a compact imitation learning framework for laboratory automation using small foundation models. The proposed method, TVF-DiT, aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert. The entire model consists of fewer than 500M parameters, enabling inference on low-VRAM GPUs. Experiments on three real-world laboratory tasks - test tube cleaning, test tube arrangement, and powder transfer - demonstrate an average success rate of 86.6%, significantly outperforming alternative lightweight baselines. Furthermore, detailed task prompts improve vision-language alignment and task performance. These results indicate that small foundation models, when properly aligned and integrated with diffusion-based policy learning, can effectively support practical laboratory automation with limited computational resources.
Chinese Translation
机器人实验室自动化传统上依赖于精心设计的运动管道和特定任务的硬件接口,这导致了高设计成本和有限的灵活性。尽管最近的模仿学习技术能够生成通用的机器人行为,但其庞大的模型规模通常需要高性能的计算资源,这限制了其在实际实验室环境中的适用性。在本研究中,我们提出了一种用于实验室自动化的紧凑型模仿学习框架,采用小型基础模型。所提出的方法TVF-DiT通过一个紧凑适配器将自监督视觉基础模型与视觉-语言模型对齐,并将它们与基于扩散变换器的动作专家集成。整个模型的参数少于5亿,能够在低显存的GPU上进行推理。在三个真实实验室任务——试管清洗、试管排列和粉末转移上的实验表明,平均成功率达到86.6%,显著优于其他轻量级基线。此外,详细的任务提示改善了视觉-语言对齐和任务表现。这些结果表明,当小型基础模型得到适当对齐并与基于扩散的策略学习集成时,可以有效支持在有限计算资源下的实际实验室自动化。
cs.RO / 41 / 2603.01113
From Dialogue to Execution: Mixture-of-Agents Assisted Interactive Planning for Behavior Tree-Based Long-Horizon Robot Execution
从对话到执行:基于混合代理的行为树长时间机器人执行的交互规划
Abstract
Interactive task planning with large language models (LLMs) enables robots to generate high-level action plans from natural language instructions. However, in long-horizon tasks, such approaches often require many questions, increasing user burden. Moreover, flat plan representations become difficult to manage as task complexity grows. We propose a framework that integrates Mixture-of-Agents (MoA)-based proxy answering into interactive planning and generates Behavior Trees (BTs) for structured long-term execution. The MoA consists of multiple LLM-based expert agents that answer general or domain-specific questions when possible, reducing unnecessary human intervention. The resulting BT hierarchically represents task logic and enables retry mechanisms and dynamic switching among multiple robot policies. Experiments on cocktail-making tasks show that the proposed method reduces human response requirements by approximately 27% while maintaining structural and semantic similarity to fully human-answered BTs. Real-robot experiments on a smoothie-making task further demonstrate successful long-horizon execution with adaptive policy switching and recovery from action failures. These results indicate that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.
Chinese Translation
基于大型语言模型(LLMs)的交互任务规划使机器人能够从自然语言指令中生成高层次的行动计划。然而,在长时间任务中,这种方法通常需要提出许多问题,增加了用户的负担。此外,随着任务复杂性的增加,平面计划表示变得难以管理。我们提出了一种框架,将基于混合代理(Mixture-of-Agents, MoA)的代理回答集成到交互规划中,并生成行为树(Behavior Trees, BTs)以实现结构化的长期执行。MoA由多个基于LLM的专家代理组成,当可能时,它们回答一般性或特定领域的问题,从而减少不必要的人为干预。生成的BT以层次方式表示任务逻辑,并支持重试机制和在多个机器人策略之间的动态切换。在鸡尾酒制作任务上的实验表明,所提方法将人类响应需求减少了约27%,同时保持了与完全由人类回答的BT在结构和语义上的相似性。在平滑制作任务上的真实机器人实验进一步展示了成功的长期执行,具有自适应策略切换和从行动失败中恢复的能力。这些结果表明,MoA辅助的交互规划提高了对话效率,同时保持了真实世界机器人任务中的执行质量。
cs.RO / 42 / 2603.01122
Fast Confidence-Aware Human Prediction via Hardware-accelerated Bayesian Inference for Safe Robot Navigation
基于硬件加速贝叶斯推理的快速信心感知人类预测,以实现安全的机器人导航
Abstract
As robots increasingly integrate into everyday environments, ensuring their safe navigation around humans becomes imperative. Efficient and safe motion planning requires robots to account for human behavior, particularly in constrained spaces such as grocery stores or care homes, where interactions with multiple individuals are common. Prior research has employed Bayesian frameworks to model human rationality based on navigational intent, enabling the prediction of probabilistic trajectories for planning purposes. In this work, we present a simple yet novel approach for confidence-aware prediction that treats future predictions as particles. This framework is highly parallelized and accelerated on an graphics processing unit (GPU). As a result, this enables longer-term predictions at a frequency of 125 Hz and can be easily extended for multi-human predictions. Compared to existing methods, our implementation supports finer prediction time steps, yielding more granular trajectory forecasts. This enhanced resolution allows motion planners to respond effectively to subtle changes in human behavior. We validate our approach through real-world experiments, demonstrating a robot safely navigating among multiple humans with diverse navigational goals. Our results highlight the methods potential for robust and efficient human-robot coexistence in dynamic environments.
Chinese Translation
随着机器人越来越多地融入日常环境,确保它们在与人类共存时的安全导航变得至关重要。高效且安全的运动规划要求机器人考虑人类行为,特别是在如杂货店或养老院等受限空间中,这里与多个个体的互动是常见的。先前的研究采用贝叶斯框架来基于导航意图建模人类理性,从而实现对规划目的的概率轨迹预测。在本研究中,我们提出了一种简单但新颖的信心感知预测方法,将未来预测视为粒子。该框架高度并行化,并在图形处理单元(GPU)上加速。因此,这使得以125 Hz的频率进行长期预测成为可能,并且可以轻松扩展到多个人类预测。与现有方法相比,我们的实现支持更细的预测时间步,产生更精细的轨迹预测。这种增强的分辨率使运动规划者能够有效应对人类行为的微妙变化。我们通过现实世界实验验证了我们的方法,展示了一台机器人在多个具有不同导航目标的人类之间安全导航的能力。我们的结果突显了该方法在动态环境中实现稳健且高效的人机共存的潜力。
cs.RO / 43 / 2603.01126
Pro-HOI: Perceptive Root-guided Humanoid-Object Interaction
Pro-HOI:感知根导向的人形-物体交互
Abstract
Executing reliable Humanoid-Object Interaction (HOI) tasks for humanoid robots is hindered by the lack of generalized control interfaces and robust closed-loop perception mechanisms. In this work, we introduce Perceptive Root-guided Humanoid-Object Interaction, Pro-HOI, a generalizable framework for robust humanoid loco-manipulation. First, we collect box-carrying motions that are suitable for real-world deployment and optimize penetration artifacts through a Signed Distance Field loss. Second, we propose a novel training framework that conditions the policy on a desired root-trajectory while utilizing reference motion exclusively as a reward. This design not only eliminates the need for intricate reward tuning but also establishes root trajectory as a universal interface for high-level planners, enabling simultaneous navigation and loco-manipulation. Furthermore, to ensure operational reliability, we incorporate a persistent object estimation module. By fusing real-time detection with Digital Twin, this module allows the robot to autonomously detect slippage and trigger re-grasping maneuvers. Empirical validation on a Unitree G1 robot demonstrates that Pro-HOI significantly outperforms baselines in generalization and robustness, achieving reliable long-horizon execution in complex real-world scenarios.
Chinese Translation
人形机器人执行可靠的人形-物体交互(HOI)任务受到缺乏通用控制接口和稳健闭环感知机制的限制。在本研究中,我们提出了感知根导向的人形-物体交互(Pro-HOI),这是一个用于稳健人形运动操控的通用框架。首先,我们收集了适合实际应用的箱子搬运动作,并通过带符号距离场损失优化穿透伪影。其次,我们提出了一种新颖的训练框架,该框架在期望的根轨迹上对策略进行条件设置,同时将参考动作仅用作奖励。这一设计不仅消除了对复杂奖励调优的需求,还将根轨迹确立为高层规划者的通用接口,从而实现导航和运动操控的同步。此外,为确保操作的可靠性,我们结合了一个持久的物体估计模块。通过将实时检测与数字双胞胎(Digital Twin)融合,该模块使机器人能够自主检测滑移并触发重新抓取操作。在Unitree G1机器人上的实证验证表明,Pro-HOI在泛化能力和稳健性方面显著优于基线方法,实现了在复杂现实场景中的可靠长时间执行。
cs.RO / 44 / 2603.01128
A Deployable Bio-inspired Compliant Leg Design for Enhanced Leaping in Quadruped Robots
一种可部署的仿生柔性腿设计以增强四足机器人跳跃能力
Abstract
Quadruped robots are becoming increasingly essential for various applications, including industrial inspection and catastrophe search and rescue. These scenarios require robots to possess enhanced agility and obstacle-navigation skills. Nonetheless, the performance of current platforms is often constrained by insufficient peak motor power, limiting their ability to perform explosive jumps. To address this challenge, this paper proposes a bio-inspired method that emulates the energy-storage mechanism found in froghopper legs. We designed a Deployable Compliant Leg (DCL) utilizing a specialized 3D-printed elastic material, Polyether block amide (PEBA), featuring a lightweight internal lattice structure. This structure functions analogously to biological tendons, storing elastic energy during the robot's squatting phase and rapidly releasing it to augment motor output during the leap. The proposed mechanical design significantly enhances the robot's vertical jumping capability. Through finite element analysis (FEA) and experimental validation, we demonstrate a relative performance improvement of 17.1% in vertical jumping height.
Chinese Translation
四足机器人在工业检查和灾难搜索与救援等多种应用中变得越来越重要。这些场景要求机器人具备增强的灵活性和障碍物导航能力。然而,当前平台的性能往往受到峰值电机功率不足的限制,限制了它们进行爆发性跳跃的能力。为了解决这一挑战,本文提出了一种仿生方法,模拟了跳蚤腿部的能量储存机制。我们设计了一种可部署的柔性腿(Deployable Compliant Leg, DCL),采用了一种特殊的3D打印弹性材料——聚醚块酰胺(Polyether block amide, PEBA),具有轻量化的内部格子结构。该结构类似于生物肌腱,在机器人下蹲阶段储存弹性能量,并在跳跃过程中快速释放,以增强电机输出。所提出的机械设计显著提高了机器人的垂直跳跃能力。通过有限元分析(Finite Element Analysis, FEA)和实验验证,我们展示了垂直跳跃高度相对性能提升17.1%。
cs.RO / 45 / 2603.01151
D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
D-REX:用于学习灵巧抓取的可微分真实-仿真-真实引擎
Abstract
Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap.
Chinese Translation
仿真为数据生成和策略学习提供了一种成本效益高且灵活的平台,以开发机器人系统。然而,弥合仿真与现实世界动态之间的差距仍然是一个重大挑战,尤其是在物理参数识别方面。在本研究中,我们引入了一种真实-仿真-真实引擎,该引擎利用高斯点云(Gaussian Splat)表示法构建了一个可微分引擎,使得能够从现实世界的视觉观察和机器人控制信号中识别物体质量,同时实现抓取策略学习。通过优化被操控物体的质量,我们的方法自动构建高保真且物理上合理的数字双胞胎。此外,我们提出了一种新颖的方法,通过将可行的人类演示转化为仿真机器人演示,从有限的数据中训练力感知抓取策略。通过全面的实验,我们证明了我们的引擎在各种物体几何形状和质量值的质量识别中实现了准确且稳健的性能。这些优化的质量值促进了力感知策略学习,在物体抓取中实现了优越的高性能,有效缩小了仿真与现实之间的差距。
cs.RO / 46 / 2603.01153
RAG-RUSS: A Retrieval-Augmented Robotic Ultrasound for Autonomous Carotid Examination
RAG-RUSS:一种用于自主颈动脉检查的检索增强型机器人超声
Abstract
Robotic ultrasound (US) has recently attracted increasing attention as a means to overcome the limitations of conventional US examinations, such as the strong operator dependence. However, the decision-making process of existing methods is often either rule-based or relies on end-to-end learning models that operate as black boxes. This has been seen as a main limit for clinical acceptance and raises safety concerns for widespread adoption in routine practice. To tackle this challenge, we introduce the RAG-RUSS, an interpretable framework capable of performing a full carotid examination in accordance with the clinical workflow while explicitly explaining both the current stage and the next planned action. Furthermore, given the scarcity of medical data, we incorporate retrieval-augmented generation to enhance generalization and reduce dependence on large-scale training datasets. The method was trained on data acquired from 28 volunteers, while an additional four volumetric scans recorded from previously unseen volunteers were reserved for testing. The results demonstrate that the method can explain the current scanning stage and autonomously plan probe motions to complete the carotid examination, encompassing both transverse and longitudinal planes.
Chinese Translation
机器人超声(US)近年来受到越来越多的关注,作为克服传统超声检查局限性的一种手段,例如强烈的操作员依赖性。然而,现有方法的决策过程往往是基于规则的,或依赖于作为黑箱操作的端到端学习模型。这被视为临床接受的主要限制,并引发了在常规实践中广泛采用的安全隐患。为了解决这一挑战,我们引入了RAG-RUSS,这是一种可解释的框架,能够按照临床工作流程执行完整的颈动脉检查,同时明确解释当前阶段和下一个计划的行动。此外,考虑到医学数据的稀缺性,我们结合了检索增强生成(retrieval-augmented generation)来增强泛化能力,并减少对大规模训练数据集的依赖。该方法在28名志愿者获取的数据上进行了训练,同时保留了来自先前未见志愿者的四个体积扫描用于测试。结果表明,该方法能够解释当前扫描阶段,并自主规划探头运动以完成颈动脉检查,包括横截面和纵截面。
cs.RO / 47 / 2603.01176
Path Integral Particle Filtering for Hybrid Systems via Saltation Matrices
通过跳跃矩阵的混合系统路径积分粒子滤波
Abstract
We present an optimal-control-based particle filtering method for state estimation in hybrid systems that undergo intermittent contact with their environments. We follow the path integral filtering framework that exploits the duality between the smoothing problem and optimal control. We leverage saltation matrices to map out the uncertainty propagation during contact events for hybrid systems. The resulting path integral optimal control problem allows for a state estimation algorithm robust to outlier effects, flexible to non-Gaussian noise distributions, that also handles the challenging contact dynamics in hybrid systems. This work offers a computationally efficient and reliable estimation algorithm for hybrid systems with stochastic dynamics. We also present extensive experimental results demonstrating that our approach consistently outperforms strong baselines across multiple settings.
Chinese Translation
我们提出了一种基于最优控制的粒子滤波方法,用于对经历与环境间歇性接触的混合系统进行状态估计。我们遵循路径积分滤波框架,利用平滑问题与最优控制之间的对偶性。我们利用跳跃矩阵来描绘混合系统在接触事件期间的不确定性传播。由此产生的路径积分最优控制问题允许一种对异常值影响具有鲁棒性的状态估计算法,能够灵活应对非高斯噪声分布,并处理混合系统中具有挑战性的接触动态。该工作为具有随机动态的混合系统提供了一种计算高效且可靠的估计算法。我们还展示了大量实验结果,证明我们的方法在多种设置下始终优于强基线。
cs.RO / 48 / 2603.01178
riMESA: Consensus ADMM for Real-World Collaborative SLAM
riMESA:用于现实世界协作SLAM的共识ADMM
Abstract
Collaborative Simultaneous Localization and Mapping (C-SLAM) is a fundamental capability for multi-robot teams as it enables downstream tasks like planning and navigation. However, existing C-SLAM back-end algorithms that are required to solve this problem struggle to address the practical realities of real-world deployments (i.e. communication limitations, outlier measurements, and online operation). In this paper we propose Robust Incremental Manifold Edge-based Separable ADMM (riMESA) -- a robust, incremental, and distributed C-SLAM back-end that is resilient to outliers, reliable in the face of limited communication, and can compute accurate state estimates for a multi-robot team in real-time. Through the development of riMESA, we, more broadly, make an argument for the use of Consensus Alternating Direction Method of Multipliers as a theoretical foundation for distributed optimization tasks in robotics like C-SLAM due to its flexibility, accuracy, and fast convergence. We conclude this work with an in-depth evaluation of riMESA on a variety of C-SLAM problem scenarios and communication network conditions using both synthetic and real-world C-SLAM data. These experiments demonstrate that riMESA is able to generalize across conditions, produce accurate state estimates, operate in real-time, and outperform the accuracy of prior works by a factor >7x on real-world datasets.
Chinese Translation
协作同时定位与地图构建(C-SLAM)是多机器人团队的一项基本能力,因为它能够支持规划和导航等下游任务。然而,现有的C-SLAM后端算法在解决这一问题时,难以应对现实部署中的实际情况(即通信限制、异常测量和在线操作)。在本文中,我们提出了一种鲁棒增量流形边缘分离ADMM(riMESA)——一种鲁棒的、增量的、分布式的C-SLAM后端,能够抵御异常值,在通信受限的情况下可靠运行,并能够实时计算多机器人团队的准确状态估计。通过riMESA的开发,我们更广泛地论证了共识交替方向乘子法(Consensus Alternating Direction Method of Multipliers)作为机器人领域分布式优化任务(如C-SLAM)的理论基础的有效性,因其具有灵活性、准确性和快速收敛性。我们以对riMESA在多种C-SLAM问题场景和通信网络条件下的深入评估作为结尾,使用合成和真实世界的C-SLAM数据。这些实验表明,riMESA能够在不同条件下进行泛化,产生准确的状态估计,实时运行,并在真实世界数据集上超越先前工作的准确性,提升幅度超过7倍。
cs.RO / 49 / 2603.01189
Agent-Based Simulation of Trust Development in Human-Robot Teams: An Empirically-Validated Framework
基于代理的信任发展模拟在人机团队中的应用:一个经实证验证的框架
Abstract
This paper presents an empirically grounded agent-based model capturing trust dynamics, workload distribution, and collaborative performance in human-robot teams. The model, implemented in NetLogo 6.4.0, simulates teams of 2--10 agents performing tasks of varying complexity. We validate against Hancock et al.'s (2021) meta-analysis, achieving interval validity for 4 of 8 trust antecedent categories and strong ordinal validity (Spearman \r{ho}=0.833\rho = 0.833 \r{ho}=0.833). Sensitivity analysis using OFAT and full factorial designs (n=50n = 50 n=50 replications per condition) reveals robot reliability exhibits the strongest effect on trust ({\eta}2=0.35\eta^2 = 0.35 {\eta}2=0.35) and dominates task success ({\eta}2=0.93\eta^2 = 0.93 {\eta}2=0.93) and productivity ({\eta}2=0.89\eta^2 = 0.89 {\eta}2=0.89), consistent with meta-analytic findings. Trust asymmetry ratios ranged from 0.07 to 0.55 -- below the meta-analytic benchmark of 1.50 -- revealing that per-event asymmetry does not guarantee cumulative asymmetry when trust repair mechanisms remain active. Scenario analysis uncovered trust-performance decoupling: the Trust Recovery scenario achieved the highest productivity (4.29) despite the lowest trust (38.2), while the Unreliable Robot scenario produced the highest trust (73.2) despite the lowest task success (33.4\%), establishing calibration error as a critical diagnostic distinct from trust magnitude. Factorial ANOVA confirmed significant main effects for reliability, transparency, communication, and collaboration (p<.001p < .001 p<.001), explaining 45.4\% of trust variance. The open-source implementation provides an evidence-based tool for identifying overtrust and undertrust conditions prior to deployment.
Chinese Translation
本文提出了一个基于代理的模型,该模型以实证为基础,捕捉人机团队中的信任动态、工作负载分配和协作表现。该模型在 NetLogo 6.4.0 中实现,模拟了由 2 到 10 个代理组成的团队执行不同复杂度任务的情况。我们对照 Hancock 等人(2021)的元分析进行了验证,达到了 8 个信任前因类别中 4 个的区间有效性,并且获得了强的序数有效性(Spearman
{ho}=0.833
ho = 0.833
{ho}=0.833)。使用单因素实验(OFAT)和全因子设计(每个条件 n=50 的重复实验)进行的敏感性分析显示,机器人可靠性对信任的影响最为显著({ exteta}^2=0.35),并主导了任务成功({ exteta}^2=0.93)和生产力({ exteta}^2=0.89),与元分析结果一致。信任不对称比率范围为 0.07 到 0.55,低于元分析基准值 1.50,表明在信任修复机制仍然活跃时,事件间的不对称性并不保证累积不对称性。情景分析揭示了信任与表现的解耦:信任恢复情景尽管信任最低(38.2),却实现了最高的生产力(4.29),而不可靠机器人情景则产生了最高的信任(73.2),但任务成功率最低(33.4%),确立了校准误差作为与信任强度不同的关键诊断指标。因子方差分析(ANOVA)确认了可靠性、透明度、沟通和协作的显著主效应(p<.001),解释了 45.4% 的信任方差。该开源实现为在部署前识别过度信任和不足信任条件提供了基于证据的工具。
cs.RO / 50 / 2603.01229
RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design
RMBench:具有政策设计洞察的记忆依赖型机器人操作基准
Abstract
Robotic manipulation policies have made rapid progress in recent years, yet most existing approaches give limited consideration to memory capabilities. Consequently, they struggle to solve tasks that require reasoning over historical observations and maintaining task-relevant information over time, which are common requirements in real-world manipulation scenarios. Although several memory-aware policies have been proposed, systematic evaluation of memory-dependent manipulation remains underexplored, and the relationship between architectural design choices and memory performance is still not well understood. To address this gap, we introduce RMBench, a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity, enabling systematic evaluation of policy memory capabilities. We further propose Mem-0, a modular manipulation policy with explicit memory components designed to support controlled ablation studies. Through extensive simulation and real-world experiments, we identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance. The website is available at https://rmbench.github.io/.
Chinese Translation
近年来,机器人操作策略取得了快速进展,但大多数现有方法对记忆能力的考虑有限。因此,它们在解决需要对历史观察进行推理并在时间上保持与任务相关信息的任务时遇到困难,而这些都是现实世界操作场景中的常见要求。尽管已经提出了几种记忆感知策略,但对记忆依赖型操作的系统评估仍然未得到充分探索,并且架构设计选择与记忆性能之间的关系仍不清楚。为了解决这一问题,我们引入了RMBench,这是一个包含9个操作任务的仿真基准,涵盖多个记忆复杂度级别,能够系统地评估策略的记忆能力。我们进一步提出了Mem-0,这是一种具有明确记忆组件的模块化操作策略,旨在支持受控消融研究。通过广泛的仿真和现实世界实验,我们识别了现有策略中的记忆相关限制,并提供了关于架构设计选择如何影响记忆性能的实证见解。网站地址为 https://rmbench.github.io/
cs.RO / 51 / 2603.01267
Certifiable Estimation with Factor Graphs
可证明的因子图估计
Abstract
Factor graphs provide a convenient modular modeling language that enables practitioners to design and deploy high-performance robotic state estimation systems by composing simple, reusable building blocks. However, inference in these models is typically performed using local optimization methods that can converge to suboptimal solutions, a serious reliability concern in safety-critical applications. Conversely, certifiable estimators based on convex relaxation can recover verifiably globally optimal solutions in many practical settings, but the computational cost of solving their large-scale relaxations necessitates specialized, structure-exploiting solvers that require substantial expertise to implement, significantly hampering practical deployment. In this paper, we show that these two paradigms, which have thus far been treated as independent in the literature, can be naturally synthesized into a unified framework for certifiable factor graph optimization. The key insight is that factor graph structure is preserved under Shor's relaxation and Burer-Monteiro factorization: applying these transformations to a QCQP with an associated factor graph representation yields a lifted problem admitting a factor graph model with identical connectivity, in which variables and factors are simple one-to-one algebraic transformations of those in the original QCQP. This structural preservation enables the Riemannian Staircase methodology for certifiable estimation to be implemented using the same mature, highly-performant factor graph libraries and workflows already ubiquitously employed throughout robotics and computer vision, making certifiable estimation as straightforward to design and deploy as conventional factor graph inference.
Chinese Translation
因子图提供了一种方便的模块化建模语言,使从业者能够通过组合简单、可重用的构建块来设计和部署高性能的机器人状态估计系统。然而,这些模型中的推理通常使用局部优化方法进行,这可能会收敛到次优解,这在安全关键应用中是一个严重的可靠性问题。相反,基于凸松弛的可证明估计器可以在许多实际场景中恢复可验证的全局最优解,但解决其大规模松弛问题的计算成本需要专门的、利用结构的求解器,这需要相当的专业知识来实现,显著妨碍了实际部署。在本文中,我们展示了这两种在文献中迄今被视为独立的范式可以自然地合成到一个统一的可证明因子图优化框架中。关键的见解是,因子图结构在Shor松弛和Burer-Monteiro分解下得以保留:将这些变换应用于具有相关因子图表示的二次约束二次规划(QCQP)会产生一个提升问题,该问题承认具有相同连接性的因子图模型,其中变量和因子是原始QCQP中变量和因子的简单一对一代数变换。这种结构保留使得可证明估计的黎曼阶梯方法能够使用已经在机器人和计算机视觉中广泛应用的成熟、高性能因子图库和工作流程来实现,从而使可证明估计的设计和部署与传统因子图推理同样简单。
cs.RO / 52 / 2603.01294
Spherical Latent Motion Prior for Physics-Based Simulated Humanoid Control
基于物理的类人控制的球面潜在运动先验
Abstract
Learning motion priors for physics-based humanoid control is an active research topic. Existing approaches mainly include variational autoencoders (VAE) and adversarial motion priors (AMP). VAE introduces information loss, and random latent sampling may sometimes produce invalid behaviors. AMP suffers from mode collapse and struggles to capture diverse motion skills. We present the Spherical Latent Motion Prior (SLMP), a two-stage method for learning motion priors. In the first stage, we train a high-quality motion tracking controller. In the second stage, we distill the tracking controller into a spherical latent space. A combination of distillation, a discriminator, and a discriminator-guided local semantic consistency constraint shapes a structured latent action space, allowing stable random sampling without information loss. To evaluate SLMP, we collect a two-hour human combat motion capture dataset and show that SLMP preserves fine motion detail without information loss, and random sampling yields semantically valid and stable behaviors. When applied to a two-agent physics-based combat task, SLMP produces human-like and physically plausible combat behaviors only using simple rule-based rewards. Furthermore, SLMP generalizes across different humanoid robot morphologies, demonstrating its transferability beyond a single simulated avatar.
Chinese Translation
学习基于物理的类人控制的运动先验是一个活跃的研究课题。现有的方法主要包括变分自编码器(VAE)和对抗运动先验(AMP)。VAE引入了信息损失,随机潜在采样有时可能产生无效行为。AMP则面临模式崩溃的问题,难以捕捉多样的运动技能。我们提出了球面潜在运动先验(SLMP),这是一种用于学习运动先验的两阶段方法。在第一阶段,我们训练一个高质量的运动跟踪控制器。在第二阶段,我们将跟踪控制器提炼到一个球面潜在空间。通过提炼、判别器和判别器引导的局部语义一致性约束的结合,形成了一个结构化的潜在动作空间,使得稳定的随机采样得以实现而不损失信息。为了评估SLMP,我们收集了一个两小时的人类战斗动作捕捉数据集,并展示SLMP在不损失信息的情况下保留了细致的运动细节,随机采样产生了语义上有效且稳定的行为。当应用于一个双代理的基于物理的战斗任务时,SLMP仅使用简单的基于规则的奖励就能产生类人且物理上合理的战斗行为。此外,SLMP在不同的类人机器人形态中具有良好的泛化能力,展示了其超越单一模拟化身的可转移性。
cs.RO / 53 / 2603.01302
Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space
混合 TD3:过估计偏差分析与混合动作空间的稳定策略优化
Abstract
Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines
Chinese Translation
在离散-连续混合动作空间中的强化学习为机器人操作带来了基本挑战,其中高层任务决策与低层关节空间执行必须共同优化。现有的方法要么对连续部分进行离散化,要么将离散选择放宽为连续近似,这在高维动作空间和领域随机化下存在可扩展性限制和训练不稳定性。本文提出了混合 TD3,这是双延迟深度确定性策略梯度(TD3)的扩展,能够以原则性的方式原生处理参数化的混合动作空间。我们对混合动作设置中的过估计偏差进行了严格的理论分析,推导了在双评论家架构下的正式界限,并建立了五种算法变体之间的完整偏差排序。在此分析的基础上,我们引入了一种加权裁剪 Q 学习目标,该目标对离散动作分布进行边际化,实现了与标准裁剪最小化相当的偏差减少,同时提高了策略的平滑性。实验结果表明,混合 TD3 在训练稳定性和与最先进的混合动作基线的竞争性能方面均表现优越。
cs.RO / 54 / 2603.01404
D-GVIO: A Buffer-Driven and Efficient Decentralized GNSS-Visual-Inertial State Estimator for Multi-Agent Systems
D-GVIO:一种基于缓冲区驱动的高效去中心化GNSS-视觉-惯性状态估计器用于多智能体系统
Abstract
Cooperative localization is essential for swarm applications like collaborative exploration and search-and-rescue missions. However, maintaining real-time capability, robustness, and computational efficiency on resource-constrained platforms presents significant challenges. To address these challenges, we propose D-GVIO, a buffer-driven and fully decentralized GNSS-Visual-Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left-invariant extended Kalman filter (L-IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer-based re-propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L-IEKF, eliminating the need for costly re-computation. Finally, an adaptive buffer-driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS-challenged environments.
Chinese Translation
协作定位对于群体应用,如协作探索和搜索救援任务至关重要。然而,在资源受限的平台上保持实时能力、鲁棒性和计算效率面临重大挑战。为了解决这些挑战,我们提出了D-GVIO,一种基于缓冲区驱动的完全去中心化GNSS-视觉-惯性里程计(GVIO)框架,该框架利用一种新颖的缓冲策略来支持高效和鲁棒的分布式状态估计。该框架的特点在于四个核心机制。首先,通过协方差分段、协方差交集和缓冲策略,我们将分布式状态估计中的传播和更新步骤模块化,显著减少了计算和通信负担。其次,采用左不变扩展卡尔曼滤波器(L-IEKF)进行信息融合,其状态估计性能优于传统的扩展卡尔曼滤波器(EKF),因为其状态转移矩阵与系统状态无关。第三,采用基于缓冲的重新传播策略,通过利用L-IEKF高效准确地处理延迟测量,消除了昂贵的重新计算需求。最后,提出了一种自适应缓冲驱动的异常值检测方法,以动态剔除GNSS异常值,从而增强在GNSS受限环境中的鲁棒性。
cs.RO / 55 / 2603.01414
Jailbreaking Embodied LLMs via Action-level Manipulation
通过动作级操控破解具身大型语言模型
Abstract
Embodied Large Language Models (LLMs) enable AI agents to interact with the physical world through natural language instructions and actions. However, beyond the language-level risks inherent to LLMs themselves, embodied LLMs with real-world actuation introduce a new vulnerability: instructions that appear semantically benign may still lead to dangerous real-world consequences, revealing a fundamental misalignment between linguistic security and physical outcomes. In this paper, we introduce Blindfold, an automated attack framework that leverages the limited causal reasoning capabilities of embodied LLMs in real-world action contexts. Rather than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that appear semantically safe but could result in harmful physical effects when executed. Blindfold further conceals key malicious actions by injecting carefully crafted noise to evade detection by defense mechanisms, and it incorporates a rule-based verifier to improve the attack executability. Evaluations on both embodied AI simulators and a real-world 6DoF robotic arm show that Blindfold achieves up to 53% higher attack success rates than SOTA baselines, highlighting the urgent need to move beyond surface-level language censorship and toward consequence-aware defense mechanisms to secure embodied LLMs.
Chinese Translation
具身大型语言模型(LLMs)使人工智能代理能够通过自然语言指令和动作与物理世界进行交互。然而,除了LLMs自身固有的语言层面风险外,具身LLMs在现实世界中的执行引入了一种新的脆弱性:看似语义上无害的指令仍可能导致危险的现实后果,揭示了语言安全与物理结果之间的根本不一致。在本文中,我们介绍了Blindfold,这是一种自动化攻击框架,利用具身LLMs在现实世界行动背景下有限的因果推理能力。与对黑箱具身LLMs进行迭代试错式破解不同,Blindfold采用了对抗代理规划策略:它妥协了一个本地替代LLM,以执行看似语义安全但在执行时可能导致有害物理效果的动作级操控。Blindfold进一步通过注入精心设计的噪声来隐藏关键恶意动作,以规避防御机制的检测,并结合基于规则的验证器以提高攻击的可执行性。在具身人工智能模拟器和现实世界的六自由度机器人手臂上的评估表明,Blindfold的攻击成功率比最先进的基线高出多达53%,突显了超越表面语言审查、朝向后果意识防御机制以保护具身LLMs的迫切需求。
cs.RO / 56 / 2603.01436
PhysGraph: Physically-Grounded Graph-Transformer Policies for Bimanual Dexterous Hand-Tool-Object Manipulation
PhysGraph:用于双手灵巧工具物体操作的物理基础图变换器策略
Abstract
Bimanual dexterous manipulation for tool use remains a formidable challenge in robotics due to the high-dimensional state space and complicated contact dynamics. Existing methods naively represent the entire system state as a single configuration vector, disregarding the rich structural and topological information inherent to articulated hands. We present PhysGraph, a physically-grounded graph transformer policy designed explicitly for challenging bimanual hand-tool-object manipulation. Unlike prior works, we represent the bimanual system as a kinematic graph and introduce per-link tokenization to preserve fine-grained local state information. We propose a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties. This allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards. Extensive experiments show that PhysGraph significantly outperforms baseline - ManipTrans in manipulation precision and task success rates while using only 51% of the parameters of ManipTrans. Furthermore, the inherent topological flexibility of our architecture shows qualitative zero-shot transfer to unseen tool/object geometries, and is sufficiently general to be trained on three robotic hands (Shadow, Allegro, Inspire).
Chinese Translation
由于高维状态空间和复杂的接触动态,双手灵巧操作工具的挑战在机器人领域仍然是一项艰巨的任务。现有方法天真地将整个系统状态表示为单一的配置向量,忽视了关节手固有的丰富结构和拓扑信息。我们提出了PhysGraph,一种专门为具有挑战性的双手工具物体操作设计的物理基础图变换器策略。与之前的工作不同,我们将双手系统表示为运动学图,并引入每个链接的标记化,以保留细粒度的局部状态信息。我们提出了一种物理基础的偏置生成器,直接将结构先验注入到注意力机制中,包括运动学空间距离、动态接触状态、几何接近性和解剖特性。这使得策略能够显式地推理物理交互,而不是从稀疏奖励中隐式学习。大量实验表明,PhysGraph在操作精度和任务成功率上显著优于基线方法ManipTrans,同时仅使用ManipTrans的51%的参数。此外,我们架构固有的拓扑灵活性在未见过的工具/物体几何形状上显示出定性零-shot迁移,并且足够通用,可以在三种机器人手(Shadow、Allegro、Inspire)上进行训练。
cs.RO / 57 / 2603.01465
Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining
通过关键帧链实现非马尔可夫长时间机器人操控
Abstract
Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.
Chinese Translation
现有的视觉-语言-行动(VLA)模型通常由于过度依赖即时观察而难以推广到长时间任务。尽管近期研究通过引入检索机制或扩展上下文窗口来处理程序性任务,但它们往往难以捕捉非马尔可夫依赖关系,即最优行动仅依赖于特定的过去状态,而不是当前观察。为了解决这一问题,我们提出了关键帧链VLA(Keyframe-Chaining VLA),这是一个提取和链接关键历史帧以建模长时间依赖关系的框架。具体而言,我们提出了一种自动关键帧选择器,该选择器学习一个区分性的嵌入空间,有效识别不同的状态转变。为了捕捉任务关键的信息,我们设计了一种进度感知查询机制,根据历史帧与当前执行阶段的时间相关性动态检索历史帧。这些选择的关键帧作为交错的视觉标记集成到VLA中,明确将策略扎根于长时间的时间上下文中。最后,我们引入了一套基于ManiSkill模拟器构建的四个非马尔可夫操控任务,以测量任务成功率。实验结果表明,我们的方法在处理具有长时间依赖关系的机器人操控任务时表现优越。代码可在 https://github.com/cytoplastm/KC-VLA 获取。
cs.RO / 58 / 2603.01469
Mean-Flow based One-Step Vision-Language-Action
基于均值流的一步视觉-语言-动作
Abstract
Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.
Chinese Translation
近年来,基于流匹配的视觉-语言-动作(VLA)框架在生成高频动作片段方面展现出了显著优势,尤其是在高度灵巧的机器人操控任务中。尽管取得了这些显著成就,其实际应用仍受到生成延迟过长的限制,这主要源于固有的迭代采样需求和架构限制。为了解决这一关键瓶颈,我们提出了一种基于均值流的一步VLA方法。具体而言,我们解决了动作生成过程中由噪声引起的问题,从而消除了传统流匹配方法固有的一致性约束。这显著提高了生成效率,并实现了一步动作生成。实际的机器人实验表明,所提出的基于均值流的一步VLA的生成速度分别比SmolVLA和扩散策略快8.7倍和83.9倍。这些结果阐明了其作为高效VLA基础架构在机器人操控中的巨大潜力。
cs.RO / 59 / 2603.01474
ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning
ROSER:可扩展机器人学习的少量样本机器人序列检索
Abstract
A critical bottleneck in robot learning is the scarcity of task-labeled, segmented training data, despite the abundance of large-scale robotic datasets recorded as long, continuous interaction logs. Existing datasets contain vast amounts of diverse behaviors, yet remain structurally incompatible with modern learning frameworks that require cleanly segmented, task-specific trajectories. We address this data utilization crisis by formalizing robotic sequence retrieval: the task of extracting reusable, task-centric segments from unlabeled logs using only a few reference examples. We introduce ROSER, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required. To validate our approach, we establish comprehensive evaluation protocols and benchmark ROSER against classical alignment methods, learned embeddings, and language model baselines across three large-scale datasets (e.g., LIBERO, DROID, and nuScenes). Our experiments demonstrate that ROSER consistently outperforms all prior methods in both accuracy and efficiency, achieving sub-millisecond per-match inference while maintaining superior distributional alignment. By reframing data curation as few-shot retrieval, ROSER provides a practical pathway to unlock underutilized robotic datasets, fundamentally improving data availability for robot learning.
Chinese Translation
机器人学习中的一个关键瓶颈是任务标记的分段训练数据的稀缺,尽管存在大量记录为长时间连续交互日志的大规模机器人数据集。现有数据集包含大量多样化的行为,但在结构上与现代学习框架不兼容,这些框架要求干净分段的、特定任务的轨迹。我们通过形式化机器人序列检索来解决这一数据利用危机:从未标记的日志中提取可重用的、以任务为中心的片段,仅使用少量参考示例。我们提出了ROSER,一个轻量级的少量样本检索框架,它在时间窗口上学习与任务无关的度量空间,使得在不需要任何特定任务训练的情况下,仅通过3-5个演示就能实现准确的检索。为了验证我们的方法,我们建立了全面的评估协议,并将ROSER与经典对齐方法、学习嵌入和语言模型基线在三个大规模数据集(如LIBERO、DROID和nuScenes)上进行基准测试。我们的实验表明,ROSER在准确性和效率上始终优于所有先前的方法,实现了每次匹配亚毫秒的推理,同时保持了优越的分布对齐。通过将数据策划重新框架为少量样本检索,ROSER为解锁未充分利用的机器人数据集提供了一条实用的途径,从根本上改善了机器人学习的数据可用性。
cs.RO / 60 / 2603.01477
SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment
SFCo-Nav:通过慢速大语言模型与快速属性图对齐的协作实现高效零样本视觉语言导航
Abstract
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
Chinese Translation
最近,大型视觉语言模型(VLMs)和大型语言模型(LLMs)的进展使得零样本视觉语言导航(VLN)方法成为可能,其中代理仅依靠自我感知和推理来遵循自然语言指令。然而,现有的零样本方法通常构建一个简单的观察图,并在其上进行逐步的VLM-LLM推理,导致高延迟和计算成本,从而限制了实时部署。为了解决这个问题,我们提出了SFCo-Nav,一个高效的零样本VLN框架,灵感来自于慢速与快速认知协作的原则。SFCo-Nav整合了三个关键模块:1)基于慢速LLM的规划器,生成一系列战略性子目标,每个子目标与一个想象中的对象图相关联;2)快速反应导航器,用于实时对象图构建和子目标执行;3)轻量级异步慢速-快速桥接器,将先进的结构化、属性化的想象图和感知图对齐,以估计导航信心,仅在必要时触发慢速LLM规划器。根据我们的最佳知识,SFCo-Nav是第一个支持根据内部信心异步触发LLM的慢速-快速协作零样本VLN系统。在公共R2R和REVERIE基准上评估时,SFCo-Nav的成功率与之前的最先进的零样本VLN相匹配或超过,同时每条轨迹的总令牌消耗减少了50%以上,运行速度提高了3.5倍以上。最后,我们在酒店套房中展示了SFCo-Nav在四足机器人上的应用,展示了其在室内环境中的高效性和实用性。
cs.RO / 61 / 2603.01479
Multimodal Adversarial Quality Policy for Safe Grasping
多模态对抗质量策略用于安全抓取
Abstract
Vision-guided robot grasping based on Deep Neural Networks (DNNs) generalizes well but poses safety risks in the Human-Robot Interaction (HRI). Recent works solved it by designing benign adversarial attacks and patches with RGB modality, yet depth-independent characteristics limit their effectiveness on RGBD modality. In this work, we propose the Multimodal Adversarial Quality Policy (MAQP) to realize multimodal safe grasping. Our framework introduces two key components. First, the Heterogeneous Dual-Patch Optimization Scheme (HDPOS) mitigates the distribution discrepancy between RGB and depth modalities in patch generation by adopting modality-specific initialization strategies, employing a Gaussian distribution for depth patches and a uniform distribution for RGB patches, while jointly optimizing both modalities under a unified objective function. Second, the Gradient-Level Modality Balancing Strategy (GLMBS) is designed to resolve the optimization imbalance from RGB and Depth patches in patch shape adaptation by reweighting gradient contributions based on per-channel sensitivity analysis and applying distance-adaptive perturbation bounds. We conduct extensive experiments on the benchmark datasets and a cobot, showing the effectiveness of MAQP.
Chinese Translation
基于深度神经网络(DNN)的视觉引导机器人抓取具有良好的泛化能力,但在人与机器人交互(HRI)中存在安全风险。近期的研究通过设计良性对抗攻击和RGB模态的补丁来解决这一问题,但深度无关特性限制了其在RGBD模态下的有效性。在本研究中,我们提出了多模态对抗质量策略(MAQP)以实现多模态安全抓取。我们的框架引入了两个关键组件。首先,异构双补丁优化方案(HDPOS)通过采用模态特定的初始化策略,缓解了RGB和深度模态在补丁生成中的分布差异,使用高斯分布生成深度补丁,使用均匀分布生成RGB补丁,同时在统一目标函数下共同优化这两种模态。其次,梯度级模态平衡策略(GLMBS)旨在通过基于每通道敏感性分析重新加权梯度贡献并应用距离自适应扰动界限,解决RGB和深度补丁在补丁形状适应中的优化不平衡问题。我们在基准数据集和协作机器人上进行了广泛的实验,展示了MAQP的有效性。
cs.RO / 62 / 2603.01480
Towards Robot Skill Learning and Adaptation with Gaussian Processes
基于高斯过程的机器人技能学习与适应
Abstract
General robot skill adaptation requires expressive representations robust to varying task configurations. While recent learning-based skill adaptation methods refined via Reinforcement Learning (RL), have shown success, existing skill models often lack sufficient representational capacity for anything beyond minor environmental changes. In contrast, Gaussian Process (GP)-based skill modelling provides an expressive representation with useful analytical properties; however, adaptation of GP-based skills remains underexplored. This paper proposes a novel, robust skill adaptation framework that utilises GPs with sparse via-points for compact and expressive modelling. The model considers the trajectory's poses and leverages its first and second analytical derivatives to preserve the skill's kinematic profile. We present three adaptation methods to cater for the variability between initial and observed configurations. Firstly, an optimisation agent that adjusts the path's via-points while preserving the demonstration velocity. Second, a behaviour cloning agent trained to replicate output trajectories from the optimisation agent. Lastly, an RL agent that has learnt to modify via-points whilst maintaining the kinematic profile and enabling online capabilities. Evaluated across three tasks (drawer opening, cube-pushing and bar manipulation) in both simulation and hardware, our proposed methods outperform every benchmark in success rates. Furthermore, the results demonstrate that the GP-based representation enables all three methods to attain high cosine similarity and low velocity magnitude errors, indicating strong preservation of the kinematic profile. Overall, our formulation provides a compact representation capable of adapting to large deviations from a single demonstrated skill.
Chinese Translation
一般的机器人技能适应需要对不同任务配置具有鲁棒性的表达能力。尽管近期基于强化学习(Reinforcement Learning, RL)的技能适应方法取得了成功,但现有的技能模型通常缺乏足够的表示能力,无法应对超出轻微环境变化的情况。相比之下,基于高斯过程(Gaussian Process, GP)的技能建模提供了一种具有有用分析特性的表达方式;然而,基于GP的技能适应仍然未得到充分探索。本文提出了一种新颖的、鲁棒的技能适应框架,该框架利用稀疏的途径点(via-points)进行紧凑且富有表现力的建模。该模型考虑了轨迹的姿态,并利用其一阶和二阶分析导数来保持技能的运动学特征。我们提出了三种适应方法,以应对初始配置与观察到的配置之间的变异性。首先,一个优化代理调整路径的途径点,同时保持演示速度。其次,一个行为克隆代理被训练以复制优化代理的输出轨迹。最后,一个RL代理学习到在保持运动学特征的同时修改途径点,并具备在线能力。在三个任务(抽屉打开、立方体推动和杆件操作)的仿真和硬件评估中,我们提出的方法在成功率上超越了所有基准。此外,结果表明,基于GP的表达使得所有三种方法都能达到高余弦相似度和低速度幅度误差,表明运动学特征得到了良好的保持。总体而言,我们的公式提供了一种紧凑的表示,能够适应单一演示技能的大幅偏差。
cs.RO / 63 / 2603.01505
FATE: Closed-Loop Feasibility-Aware Task Generation with Active Repair for Physically Grounded Robotic Curricula
FATE:具有主动修复的闭环可行性意识任务生成用于物理基础的机器人课程
Abstract
Recent breakthroughs in generative simulation have harnessed Large Language Models (LLMs) to generate diverse robotic task curricula, yet these open-loop paradigms frequently produce linguistically coherent but physically infeasible goals, stemming from ungrounded task specifications or misaligned objective formulations. To address this critical limitation, we propose FATE (Feasibility-Aware Task gEneration), a closed-loop, self-correcting framework that reimagines task generation as an iterative validation-and-refinement process. Unlike conventional methods that decouple generation and verification into discrete stages, FATE embeds a generalist embodied agent directly into the generation loop to proactively guarantee the physical groundedness of the resulting curriculum. FATE instantiates a sequential auditing pipeline: it first validates static scene attributes (e.g., object affordances, layout compatibility) and subsequently verifies execution feasibility via simulated embodied interaction. Critical to its performance, upon detecting an infeasible task, FATE deploys an active repair module that autonomously adapts scene configurations or policy specifications, converting unworkable proposals into physically valid task instances. Extensive experiments validate that FATE generates semantically diverse, physically grounded task curricula while achieving a substantial reduction in execution failure rates relative to state-of-the-art generative baselines.
Chinese Translation
最近在生成模拟方面的突破利用大型语言模型(LLMs)生成多样化的机器人任务课程,但这些开放式的范式常常产生语言上连贯但在物理上不可行的目标,这源于未扎根的任务规范或目标表述不一致。为了解决这一关键限制,我们提出了FATE(可行性意识任务生成),这是一个闭环自我修正框架,将任务生成重新构想为一个迭代的验证与改进过程。与将生成和验证分解为离散阶段的传统方法不同,FATE将一个通用的具身代理直接嵌入生成循环中,以主动保证生成课程的物理基础性。FATE实现了一个顺序审计管道:它首先验证静态场景属性(例如,物体可用性、布局兼容性),然后通过模拟的具身交互验证执行的可行性。其性能的关键在于,当检测到不可行的任务时,FATE部署一个主动修复模块,自动调整场景配置或策略规范,将不可行的提案转换为物理有效的任务实例。大量实验验证了FATE生成语义多样、物理基础的任务课程,同时在执行失败率方面相较于最先进的生成基线实现了显著降低。
cs.RO / 64 / 2603.01560
(hu)Man vs. Machine: In the Future of Motorsport, can Autonomous Vehicles Compete?
人类与机器:在未来的赛车运动中,自动驾驶车辆能否竞争?
Abstract
Motorsport has historically driven technological innovation in the automotive industry. Autonomous racing provides a proving ground to push the limits of performance of autonomous vehicle (AV) systems. In principle, AVs could be at least as fast, if not faster, than humans. However, human driven racing provides broader audience appeal thus far, and is more strategically challenging. Both provide opportunities to push each other even further technologically, yet competitions remain separate. This paper evaluates whether the future of motorsport could encompass joint competition between humans and AVs. Analysis of the current state of the art, as well as recent competition outcomes, shows that while technical performance has reached comparable levels, there are substantial challenges in racecraft, strategy and safety that need to be overcome. Outstanding issues involved in mixed human-AI racing, ranging from an initial assessment of critical factors such as system-level latencies, to effective planning and risk guarantees are explored. The crucial non-technical aspect of audience engagement and appeal regarding the changing character of motorsport is addressed. In the wider context of motorsport and AVs, this work outlines a proposed agenda for future research to 'keep pushing the possible', in the true spirit of motorsport.
Chinese Translation
赛车运动历史上推动了汽车工业的技术创新。自动驾驶赛车提供了一个测试平台,以推动自动驾驶车辆(AV)系统的性能极限。原则上,自动驾驶车辆的速度可以至少与人类相当,甚至更快。然而,迄今为止,人类驾驶的赛车更具广泛的观众吸引力,并且在战略上更具挑战性。两者都为彼此在技术上进一步推动提供了机会,但竞争依然是分开的。本文评估了未来的赛车运动是否可以涵盖人类与自动驾驶车辆的联合竞争。对当前技术水平的分析以及最近比赛结果的评估表明,尽管技术性能已达到可比水平,但在赛车技术、战略和安全方面仍存在重大挑战需要克服。探讨了混合人类与人工智能赛车中涉及的突出问题,从对系统级延迟等关键因素的初步评估,到有效的规划和风险保障。本文还讨论了与赛车运动变化特征相关的观众参与和吸引力这一重要非技术性方面。在赛车运动和自动驾驶车辆的更广泛背景下,本研究概述了未来研究的建议议程,以“持续推动可能性”,体现赛车运动的真正精神。
cs.RO / 65 / 2603.01581
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
KERV:用于具身视觉-语言-动作模型的运动学校正推测解码
Abstract
Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
Chinese Translation
视觉-语言-动作(VLA)模型构建了一种基于标记的机器人控制范式,但速度较慢。推测解码(SD)是一种可以提升推理速度的优化策略。在将VLA与SD结合时出现两个关键问题:首先,SD依赖于重新推理来解决标记错误,这在计算上是昂贵的;其次,为了减轻标记错误,SD中的接受阈值需要仔细调整。现有研究未能有效解决上述两个问题。同时,作为人工智能与物理世界之间的桥梁,现有的具身智能忽视了机器人运动学的应用。为了解决这些问题,我们创新性地将基于标记的VLA模型与运动学领域的预测相结合,提出了一种名为KERV的运动学校正SD框架。我们采用基于运动学的卡尔曼滤波器来预测动作并补偿SD错误,从而避免昂贵的重新推理。此外,我们设计了一种基于运动学的调整策略,以动态校正接受阈值,解决阈值确定的难题。跨多种任务和环境的实验结果表明,KERV在几乎没有成功率损失的情况下实现了27%~37%的加速。
cs.RO / 66 / 2603.01631
Learning Thermal-Aware Locomotion Policies for an Electrically-Actuated Quadruped Robot
学习热-aware的电动四足机器人运动策略
Abstract
Electrically-actuated quadrupedal robots possess high mobility on complex terrains, but their motors tend to accumulate heat under high-torque cyclic loads, potentially triggering overheat protection and limiting long-duration tasks. This work proposes a thermal-aware control method that incorporates motor temperatures into reinforcement learning locomotion policies and introduces thermal-constraint rewards to prevent temperature exceedance. Real-world experiments on the Unitree A1 demonstrate that, under a fixed 3 kg payload, the baseline policy triggers overheat protection and stops within approximately 7 minutes, whereas the proposed method can operate continuously for over 27 minutes without thermal interruptions while maintaining comparable command-tracking performance, thereby enhancing sustainable operational capability.
Chinese Translation
电动四足机器人在复杂地形上具有高机动性,但其电机在高扭矩循环负载下容易积累热量,可能触发过热保护并限制长时间任务的执行。本研究提出了一种热-aware控制方法,将电机温度纳入强化学习运动策略,并引入热约束奖励以防止温度超标。在Unitree A1的实际实验中,固定3 kg负载的基线策略在大约7分钟内触发过热保护并停止运行,而所提出的方法在不发生热中断的情况下能够连续运行超过27分钟,同时保持可比的指令跟踪性能,从而增强了可持续操作能力。
cs.RO / 67 / 2603.01673
B$^2$F-Map: Crowd-sourced Mapping with Bayesian B-spline Fusion
B$^2$F-Map:基于众包的贝叶斯B样条融合地图构建
Abstract
Crowd-sourced mapping offers a scalable alternative to creating maps using traditional survey vehicles. Yet, existing methods either rely on prior high-definition (HD) maps or neglect uncertainties in the map fusion. In this work, we present a complete pipeline for HD map generation using production vehicles equipped only with a monocular camera, consumer-grade GNSS, and IMU. Our approach includes on-cloud localization using lightweight standard-definition maps, on-vehicle mapping via an extended object trajectory (EOT) Poisson multi-Bernoulli (PMB) filter with Gibbs sampling, and on-cloud multi-drive optimization and Bayesian map fusion. We represent the lane lines using B-splines, where each B-spline is parameterized by a sequence of Gaussian distributed control points, and propose a novel Bayesian fusion framework for B-spline trajectories with differing density representation, enabling principled handling of uncertainties. We evaluate our proposed approach, B$^2$F-Map, on large-scale real-world datasets collected across diverse driving conditions and demonstrate that our method is able to produce geometrically consistent lane-level maps.
Chinese Translation
众包地图构建为使用传统测量工具创建地图提供了一种可扩展的替代方案。然而,现有方法要么依赖于先前的高清(HD)地图,要么忽视了地图融合中的不确定性。在本研究中,我们提出了一种完整的高清地图生成流程,该流程仅使用配备单目相机、消费级GNSS和IMU的生产车辆。我们的方法包括使用轻量级标准清晰度地图进行云端定位,通过扩展物体轨迹(EOT)泊松多伯努利(PMB)滤波器结合吉布斯采样进行车载地图构建,以及云端多驱动优化和贝叶斯地图融合。我们使用B样条表示车道线,其中每个B样条由一系列高斯分布的控制点参数化,并提出了一种新颖的贝叶斯融合框架,用于处理具有不同密度表示的B样条轨迹,从而能够原则性地处理不确定性。我们在跨越多种驾驶条件的大规模真实世界数据集上评估了我们提出的方法B$^2$F-Map,并证明我们的方法能够生成几何一致的车道级地图。
cs.RO / 68 / 2603.01700
TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning
TacMamba:一种触觉历史压缩适配器,连接快速反应与缓慢的视觉-语言-动作推理
Abstract
In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.
Chinese Translation
在视觉模糊的操作中,例如检测按钮点击,触觉反馈往往是唯一的真实依据。然而,融合触觉数据面临着显著的挑战,因为存在时空不匹配:触觉感知需要高频处理和长时间记忆(系统1),而视觉策略则在低控制频率下运行(系统2)。现有架构难以弥合这一差距:变换器(Transformers)在高频循环(>100Hz)中计算开销过大,而长短期记忆网络(LSTMs)在处理较长交互历史时容易遗忘。本文提出了TacMamba,一种层次化架构,将高带宽的触觉反应与低频视觉规划对齐。我们的方法包括三个核心贡献:(1)一个定制的高频触觉接口,旨在灵活集成;(2)一个基于Mamba的触觉历史压缩器,将连续的力历史编码为紧凑状态,具有O(1)推理延迟(0.45毫秒),使其能够与视觉-语言-动作(VLA)模型无缝融合,而无需联合预训练;(3)一种触觉引导的双阶段训练策略,利用时间区分进行自我监督表示学习,并通过相位均匀采样来缓解数据稀疏性。在离散计数和隐式状态切换的实验中,TacMamba实现了100%的成功率,显著优于仅使用视觉的基线pi_0.5,同时严格满足硬实时约束。
cs.RO / 69 / 2603.01705
A Safety-Aware Shared Autonomy Framework with BarrierIK Using Control Barrier Functions
基于控制障碍函数的安全感知共享自主框架与反向运动学的结合
Abstract
Shared autonomy blends operator intent with autonomous assistance. In cluttered environments, linear blending can produce unsafe commands even when each source is individually collision-free. Many existing approaches model obstacle avoidance through potentials or cost terms, which only enforce safety as a soft constraint. In contrast, safety-critical control requires hard guarantees. We investigate the use of control barrier functions (CBFs) at the inverse kinematics (IK) layer of shared autonomy, targeting post-blend safety while preserving task performance. Our approach is evaluated in simulation on representative cluttered environments and in a VR teleoperation study comparing pure teleoperation with shared autonomy. Across conditions, employing CBFs at the IK layer reduces violation time and increases minimum clearance while maintaining task performance. In the user study, participants reported higher perceived safety and trust, lower interference, and an overall preference for shared autonomy with our safety filter. Additional materials available at https://berkguler.github.io/barrierik.
Chinese Translation
共享自主将操作员意图与自主辅助相结合。在复杂环境中,线性混合可能会产生不安全的指令,即使每个来源在单独情况下都是无碰撞的。许多现有方法通过势能或成本项来建模障碍物避免,这仅将安全性作为软约束。相比之下,安全关键控制需要硬性保证。我们研究了在共享自主的反向运动学(IK)层中使用控制障碍函数(CBFs),旨在在保持任务性能的同时实现混合后的安全性。我们在代表性的复杂环境中进行了仿真评估,并在虚拟现实远程操作研究中比较了纯远程操作与共享自主。在各种条件下,在IK层使用CBFs减少了违规时间并增加了最小间隙,同时保持了任务性能。在用户研究中,参与者报告了更高的感知安全性和信任度,更低的干扰,以及对我们安全过滤器的共享自主的整体偏好。更多材料可在 https://berkguler.github.io/barrierik 获取。
cs.RO / 70 / 2603.01751
Shape-Interpretable Visual Self-Modeling Enables Geometry-Aware Continuum Robot Control
形状可解释的视觉自我建模实现几何感知的连续机器人控制
Abstract
Continuum robots possess high flexibility and redundancy, making them well suited for safe interaction in complex environments, yet their continuous deformation and nonlinear dynamics pose fundamental challenges to perception, modeling, and control. Existing vision-based control approaches often rely on end-to-end learning, achieving shape regulation without explicit awareness of robot geometry or its interaction with the environment. Here, we introduce a shape-interpretable visual self-modeling framework for continuum robots that enables geometry-aware control. Robot shapes are encoded from multi-view planar images using a Bezier-curve representation, transforming visual observations into a compact and physically meaningful shape space that uniquely characterizes the robot's three-dimensional configuration. Based on this representation, neural ordinary differential equations are employed to self-model both shape and end-effector dynamics directly from data, enabling hybrid shape-position control without analytical models or dense body markers. The explicit geometric structure of the learned shape space allows the robot to reason about its body and surroundings, supporting environment-aware behaviors such as obstacle avoidance and self-motion while maintaining end-effector objectives. Experiments on a cable-driven continuum robot demonstrate accurate shape-position regulation and tracking, with shape errors within 1.56% of image resolution and end-effector errors within 2% of robot length, as well as robust performance in constrained environments. By elevating visual shape representations from two-dimensional observations to an interpretable three-dimensional self-model, this work establishes a principled alternative to vision-based end-to-end control and advances autonomous, geometry-aware manipulation for continuum robots.
Chinese Translation
连续机器人具有高度的灵活性和冗余性,使其非常适合在复杂环境中进行安全交互,但其连续变形和非线性动力学对感知、建模和控制提出了基本挑战。现有的基于视觉的控制方法通常依赖于端到端学习,实现形状调节而不明确意识到机器人几何形状或其与环境的交互。在此,我们提出了一种形状可解释的视觉自我建模框架,旨在实现连续机器人的几何感知控制。机器人形状通过多视角平面图像使用贝塞尔曲线表示进行编码,将视觉观测转化为一个紧凑且具有物理意义的形状空间,独特地表征机器人的三维配置。基于这一表示,神经常微分方程被用来直接从数据中自我建模形状和末端执行器的动力学,实现混合形状-位置控制,而无需解析模型或密集的身体标记。学习到的形状空间的显式几何结构使机器人能够推理其身体及周围环境,支持环境感知行为,如避障和自我运动,同时保持末端执行器的目标。在一台由电缆驱动的连续机器人上进行的实验表明,形状-位置调节和跟踪准确,形状误差在图像分辨率的1.56%以内,末端执行器误差在机器人长度的2%以内,并且在受限环境中表现出稳健的性能。通过将视觉形状表示从二维观测提升到可解释的三维自我建模,本研究为基于视觉的端到端控制建立了一个原则性的替代方案,并推动了连续机器人的自主几何感知操控。
cs.RO / 71 / 2603.01766
Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models
神经隐式动作场:从离散路径点到视觉-语言-动作模型的连续函数
Abstract
Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.
Chinese Translation
尽管视觉-语言-动作(VLA)模型取得了快速进展,但预测离散路径点的主流范式与物理运动的内在连续性根本不一致。这种离散化施加了严格的采样率,缺乏高阶可微性,并引入量化伪影,阻碍了精确且符合要求的交互。我们提出了神经隐式动作场(Neural Implicit Action Fields, NIAF),这一范式转变将动作预测从离散路径点重构为连续动作函数回归。通过利用一个多层次语言模型(MLLM)作为可学习运动先验的层次谱调制器,NIAF合成无限分辨率的轨迹作为连续时间流形。这一公式化实现了解析可微性,允许对速度、加速度和抖动进行显式监督,以确保数学一致性和物理合理性。我们的方法在CALVIN和LIBERO基准测试中取得了最先进的结果,适用于多种骨干网络。此外,现实世界的实验表明,NIAF能够实现稳定的阻抗控制,弥合高层语义理解与低层动态执行之间的差距。
cs.RO / 72 / 2603.01813
SSMG-Nav: Enhancing Lifelong Object Navigation with Semantic Skeleton Memory Graph
SSMG-Nav:通过语义骨架记忆图增强终身物体导航
Abstract
Navigating to out-of-sight targets from human instructions in unfamiliar environments is a core capability for service robots. Despite substantial progress, most approaches underutilize reusable, persistent memory, constraining performance in lifelong settings. Many are additionally limited to single-modality inputs and employ myopic greedy policies, which often induce inefficient back-and-forth maneuvers (BFMs). To address such limitations, we introduce SSMG-Nav, a framework for object navigation built on a \textit{Semantic Skeleton Memory Graph} (SSMG) that consolidates past observations into a spatially aligned, persistent memory anchored by topological keypoints (e.g., junctions, room centers). SSMG clusters nearby entities into subgraphs, unifying entity- and space-level semantics to yield a compact set of candidate destinations. To support multimodal targets (images, objects, and text), we integrate a vision-language model (VLM). For each subgraph, a multimodal prompt synthesized from memory guides the VLM to infer a target belief over destinations. A long-horizon planner then trades off this belief against traversability costs to produce a visit sequence that minimizes expected path length, thereby reducing backtracking. Extensive experiments on challenging lifelong benchmarks and standard ObjectNav benchmarks demonstrate that, compared to strong baselines, our method achieves higher success rates and greater path efficiency, validating the effectiveness of SSMG-Nav.
Chinese Translation
在不熟悉的环境中,根据人类指令导航到视线之外的目标是服务机器人的一项核心能力。尽管取得了显著进展,但大多数方法未能充分利用可重用的持久记忆,这限制了其在终身设置中的表现。许多方法还仅限于单一模态输入,并采用短视的贪婪策略,这往往导致低效的来回操作(BFMs)。为了解决这些局限性,我们提出了SSMG-Nav,这是一个基于 extit{语义骨架记忆图}(SSMG)的物体导航框架,该框架将过去的观察整合为一个空间对齐的持久记忆,并以拓扑关键点(例如,交叉口、房间中心)为锚点。SSMG将附近的实体聚类为子图,统一实体和空间层面的语义,从而生成一组紧凑的候选目的地。为了支持多模态目标(图像、物体和文本),我们集成了视觉-语言模型(VLM)。对于每个子图,从记忆中合成的多模态提示引导VLM推断目标在目的地上的信念。然后,一个长远规划器在信念与可通行性成本之间进行权衡,以生成最小化预期路径长度的访问序列,从而减少回溯。在具有挑战性的终身基准和标准ObjectNav基准上的大量实验表明,与强基线相比,我们的方法实现了更高的成功率和更大的路径效率,验证了SSMG-Nav的有效性。
cs.RO / 73 / 2603.01850
Tiny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones
Tiny-DroNeRF:搭载联邦学习的微型神经辐射场纳米无人机
Abstract
Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering $\sim$100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
Chinese Translation
重量低于30克的纳米级空中机器人可以利用其灵活性和小巧的外形自主探索拥挤和狭窄的环境,如工业检查和搜救任务。然而,微小体积带来的代价是资源的严重限制,即低于100毫瓦的微控制器单元(MCUs),在最佳情况下提供约100 GOps/s的计算能力,以及内存预算远低于100 MB。尽管面临这些严格的限制,我们的目标是使纳米无人机能够执行复杂的基于视觉的任务,例如密集的3D场景重建:这是支撑空间意识和运动规划等基本能力的关键机器人任务。顶尖的3D重建方法利用神经辐射场(NeRF)模型,这些模型通常需要GB级的内存和大量计算,通常由消耗数百瓦特的高端GPU提供。我们的工作引入了Tiny-DroNeRF,这是一种轻量级的NeRF模型,基于Instant-NGP,并针对在我们的纳米无人机上运行的GAP9超低功耗(ULP)MCU进行了优化。随后,我们通过利用协作的联邦学习方案进一步增强了Tiny-DroNeRF,该方案将模型训练分布在多个纳米无人机之间。我们的实验结果表明,与Instant-NGP相比,Tiny-DroNeRF的内存占用减少了96%,而重建精度仅下降了5.7 dB。最后,我们的联邦学习方案使Tiny-DroNeRF能够使用在单个无人机内存中无法存储的数据量进行训练,从而提高了整体重建精度。最终,我们的工作首次将ULP MCU上的NeRF训练与纳米无人机上的联邦学习相结合。
cs.RO / 74 / 2603.01898
SaferPath: Hierarchical Visual Navigation with Learned Guidance and Safety-Constrained Control
SaferPath:具有学习引导和安全约束控制的分层视觉导航
Abstract
Visual navigation is a core capability for mobile robots, yet end-to-end learning-based methods often struggle with generalization and safety in unseen, cluttered, or narrow environments. These limitations are especially pronounced in dense indoor settings, where collisions are likely and end-to-end models frequently fail. To address this, we propose SaferPath, a hierarchical visual navigation framework that leverages learned guidance from existing end-to-end models and refines it through a safety-constrained optimization-control module. SaferPath transforms visual observations into a traversable-area map and refines guidance trajectories using Model Predictive Stein Variational Evolution Strategy (MP-SVES), efficiently generating safe trajectories in only a few iterations. The refined trajectories are tracked by an MPC controller, ensuring robust navigation in complex environments. Extensive experiments in scenarios with unseen obstacles, dense unstructured spaces, and narrow corridors demonstrate that SaferPath consistently improves success rates and reduces collisions, outperforming representative baselines such as ViNT and NoMaD, and enabling safe navigation in challenging real-world settings.
Chinese Translation
视觉导航是移动机器人核心能力之一,但基于端到端学习的方法在未见过的、杂乱或狭窄环境中往往难以实现泛化和安全性。这些局限性在密集的室内环境中尤为明显,在这些环境中,碰撞的可能性较高,而端到端模型经常失败。为了解决这一问题,我们提出了SaferPath,一个分层视觉导航框架,利用现有端到端模型的学习引导,并通过安全约束优化控制模块进行优化。SaferPath将视觉观测转换为可通行区域地图,并利用模型预测斯坦因变分进化策略(Model Predictive Stein Variational Evolution Strategy, MP-SVES)优化引导轨迹,仅需少量迭代即可高效生成安全轨迹。优化后的轨迹由模型预测控制器(MPC)跟踪,确保在复杂环境中的稳健导航。在未见障碍物、密集非结构化空间和狭窄走廊的场景中进行的广泛实验表明,SaferPath始终提高成功率并减少碰撞,优于代表性基线如ViNT和NoMaD,并能够在具有挑战性的现实世界环境中实现安全导航。
cs.RO / 75 / 2603.01953
Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy
具有动态修正的闭环动作块用于无训练扩散策略
Abstract
Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp
Chinese Translation
基于扩散的策略在机器人操控中取得了显著成果,但在动态场景中往往难以快速适应,导致反应延迟或任务失败。我们提出了DCDP(动态闭环扩散策略)框架,该框架将基于块的动作生成与实时修正相结合。DCDP集成了自监督动态特征编码器、交叉注意力融合和非对称动作编码器-解码器,以在动作执行之前注入环境动态,实现实时闭环动作修正,并增强系统在动态场景中的适应性。在动态PushT仿真中,DCDP在不重新训练的情况下提高了19%的适应性,同时仅需额外5%的计算量。其模块化设计使得即插即用的集成成为可能,在动态机器人场景中实现了时间一致性和实时响应,包括现实世界的操控任务。项目页面地址为:https://github.com/wupengyuan/dcdp
cs.RO / 76 / 2603.01982
From Transportation to Manipulation: Transforming Magnetic Levitation to Magnetic Robotics
从运输到操作:将磁悬浮技术转变为磁性机器人
Abstract
Magnetic Levitation (MagLev) systems fundamentally increase the flexibility of in-machine material flow in industrial automation. Therefore, these systems enable dynamic throughput optimization, which is especially beneficial for high-mix low-volume manufacturing. Until now, MagLev installations have been used primarily for in-machine transport, while their potential for manipulation is largely unexplored. This paper introduces the 6D-Platform MagBot, a low-cost six degrees of freedom parallel kinematic that couples two movers into a composite robotic platform. Experiments show that the 6D-Platform MagBot achieves sub-millimeter positioning accuracy and supports fully autonomous pick up and drop off via a docking station, allowing rapid and repeatable reconfiguration of the machine. Relative to a single mover, the proposed platform substantially expands the reachable workspace, payload, and functional dexterity. By unifying transportation and manipulation, this work advances Magnetic Levitation towards Magnetic Robotics, enabling manufacturing solutions that are more agile, efficient, and adaptable.
Chinese Translation
磁悬浮(MagLev)系统从根本上提高了工业自动化中机器内部材料流动的灵活性。因此,这些系统能够实现动态吞吐量优化,这对于高混合低批量生产尤为有利。迄今为止,MagLev 装置主要用于机器内部运输,而其在操作方面的潜力尚未得到充分探索。本文介绍了一种低成本的六自由度并联运动平台——6D-Platform MagBot,该平台将两个移动器结合成一个复合机器人平台。实验表明,6D-Platform MagBot 实现了亚毫米级的定位精度,并通过对接站支持完全自主的拾取和放置,从而允许机器的快速和可重复的重新配置。与单一移动器相比,所提出的平台显著扩展了可达工作空间、有效载荷和功能灵活性。通过统一运输和操作,本研究推动了磁悬浮技术向磁性机器人发展的进程,使制造解决方案变得更加灵活、高效和适应性强。
cs.RO / 77 / 2603.01999
Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation
基于视觉的全向导航学习:一种使用单目深度估计的教师-学生方法
Abstract
Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.
Chinese Translation
在工业环境中,可靠的障碍物避免需要对三维场景的理解,但广泛使用的二维激光雷达(LiDAR)传感器仅能感知环境的单一水平切片,无法捕捉到扫描平面上下的关键障碍物。我们提出了一种基于视觉的移动机器人导航的教师-学生框架,消除了对激光雷达传感器的需求。通过在NVIDIA Isaac Lab中使用近端策略优化(Proximal Policy Optimization, PPO)训练的教师策略,利用特权的二维激光雷达观测数据(涵盖整个机器人足迹)来学习稳健的导航。学习到的行为被提炼成一个仅依赖于由经过微调的Depth Anything V2模型从四个RGB摄像头预测的单目深度图的学生策略。完整的推理流程,包括单目深度估计(Monocular Depth Estimation, MDE)、策略执行和电机控制,完全在安装在DJI RoboMaster平台上的NVIDIA Jetson Orin AGX上运行,无需外部计算进行推理。在仿真中,学生的成功率达到82-96.5%,始终优于标准的二维激光雷达教师(50-89%)。在实际实验中,基于MDE的学生在绕过具有复杂三维几何形状的障碍物(如悬垂结构和低矮物体)时表现优于二维激光雷达教师,这些障碍物超出了二维激光雷达的单一扫描平面。
cs.RO / 78 / 2603.02004
CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies
CHOP:反事实人类偏好标签改善视觉运动导航策略中的障碍物避让
Abstract
Visuomotor navigation policies have shown strong perception-action coupling for embodied agents, yet they often struggle with safe navigation and dynamic obstacle avoidance in complex real-world environments. We introduce CHOP, a novel approach that leverages Counterfactual Human Preference Labels to align visuomotor navigation policies towards human intuition of safety and obstacle avoidance in navigation. In CHOP, for each visual observation, the robot's executed trajectory is included among a set of counterfactual navigation trajectories: alternative trajectories the robot could have followed under identical conditions. Human annotators provide pairwise preference labels over these trajectories based on anticipated outcomes such as collision risk and path efficiency. These aggregated preferences are then used to fine-tune visuomotor navigation policies, aligning their behavior with human preferences in navigation. Experiments on the SCAND dataset show that visuomotor navigation policies fine-tuned with CHOP reduce near-collision events by 49.7%, decrease deviation from human-preferred trajectories by 45.0%, and increase average obstacle clearance by 19.8% on average across multiple state-of-the-art models, compared to their pretrained baselines. These improvements transfer to real-world deployments on a Ghost Robotics Vision60 quadruped, where CHOP-aligned policies improve average goal success rates by 24.4%, increase minimum obstacle clearance by 6.8%, reduce collision and intervention events by 45.7%, and improve normalized path completion by 38.6% on average across navigation scenarios, compared to their pretrained baselines. Our results highlight the value of counterfactual preference supervision in bridging the gap between large-scale visuomotor policies and human-aligned, safety-aware embodied navigation.
Chinese Translation
视觉运动导航策略在具身智能体中表现出强烈的感知-行动耦合,但在复杂的现实环境中,它们常常在安全导航和动态障碍物避让方面面临挑战。我们提出了CHOP,一种新颖的方法,利用反事实人类偏好标签来使视觉运动导航策略与人类对安全和障碍物避让的直觉相一致。在CHOP中,对于每个视觉观察,机器人执行的轨迹被纳入一组反事实导航轨迹中:在相同条件下机器人可以遵循的替代轨迹。人类标注者根据预期结果(如碰撞风险和路径效率)对这些轨迹提供成对的偏好标签。这些聚合的偏好随后用于微调视觉运动导航策略,使其行为与人类在导航中的偏好相一致。在SCAND数据集上的实验表明,使用CHOP微调的视觉运动导航策略将近碰撞事件减少了49.7%,与人类偏好轨迹的偏离减少了45.0%,并且在多个最先进模型中,平均障碍物清除率提高了19.8%,相较于其预训练基线。这些改进在Ghost Robotics Vision60四足机器人上的实际部署中得到了验证,其中CHOP对齐的策略使平均目标成功率提高了24.4%,最小障碍物清除率提高了6.8%,碰撞和干预事件减少了45.7%,并且在导航场景中,标准化路径完成率平均提高了38.6%,相较于其预训练基线。我们的结果强调了反事实偏好监督在弥合大规模视觉运动策略与人类对齐、安全意识的具身导航之间的价值。
cs.RO / 79 / 2603.02035
LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers
LAD-Drive:通过动作感知扩散变换器连接语言与轨迹
Abstract
While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.
Chinese Translation
尽管多模态大型语言模型(MLLMs)为自动驾驶提供了先进的推理能力,但将其离散语义知识转化为连续轨迹仍然是一个基本挑战。现有方法通常依赖于单模态规划头,这在本质上限制了其表示多模态驾驶行为的能力。此外,大多数生成方法经常基于独热编码的动作进行条件处理,忽视了在复杂场景中至关重要的细微导航不确定性。为了解决这些局限性,我们提出了LAD-Drive,一个生成框架,结构性地将高层意图与低层空间规划分离。LAD-Drive采用动作解码器推断概率元动作分布,建立一个明确的信念状态,保留通常在独热编码中丢失的细微意图。该分布与车辆的运动状态融合,为动作感知扩散解码器提供条件,该解码器利用截断去噪过程将学习到的运动锚点精炼为安全且运动学上可行的轨迹。在LangAuto基准上的广泛评估表明,LAD-Drive达到了最先进的结果,在驾驶评分上超越竞争基线高达59%,同时显著减少了路线偏差和碰撞。我们将公开发布代码和模型,网址为https://github.com/iis-esslingen/lad-drive。
cs.RO / 80 / 2603.02083
$\pi$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
$oldsymbol{ ext{π}}$-StepNFT:在线强化学习中流量基础视觉-语言-动作(VLA)模型需要更细致的步骤
Abstract
Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbol{\pi}$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $\pi$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
Chinese Translation
基于流量的视觉-语言-动作(VLA)模型在具身控制方面表现出色,但在多步采样过程中面临难以处理的似然性问题,从而阻碍了在线强化学习的发展。我们提出了 extbf{ extit{$oldsymbol{ ext{π}}$-StepNFT}}(逐步负向意识微调),这是一种无需评论员和似然性的框架,每个优化步骤仅需一次前向传播,并消除了辅助价值网络。我们发现,更广泛的探索空间需要更细致的逐步指导以实现对齐。实证结果表明,$oldsymbol{ ext{π}}$-StepNFT在LIBERO上释放了潜在能力,展现出竞争力的少样本鲁棒性。此外,它在ManiSkill上实现了优越的泛化能力,在OOD场景中优于基于价值的基线,防止了对多模态特征的过拟合。这一特性为复杂的现实世界应用提供了可扩展的解决方案。
cs.RO / 81 / 2603.02104
ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation
ACDC:用于机器人操作的目标条件强化学习的动态对比控制自适应课程规划
Abstract
Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.
Chinese Translation
目标条件强化学习在机器人操作中展现了相当大的潜力;然而,现有方法仍然受到依赖于优先考虑收集经验的限制,导致在多样化任务中的表现不尽如人意。受到人类学习行为的启发,我们提出了一种更全面的学习范式ACDC,它将多维自适应课程(AC)规划与动态对比(DC)控制相结合,以引导智能体沿着精心设计的学习轨迹前进。更具体地说,在规划层面,AC组件通过动态平衡以多样性驱动的探索和以质量驱动的开发,基于智能体的成功率和训练进展来安排学习课程。在控制层面,DC组件通过规范约束的对比学习来实施课程计划,使得经验选择与当前课程重点相一致,并且能够根据幅度进行引导。对具有挑战性的机器人操作任务进行的广泛实验表明,ACDC在样本效率和最终任务成功率方面始终优于最先进的基线方法。
cs.RO / 82 / 2603.02114
Real-Time Thermal-Inertial Odometry on Embedded Hardware for High-Speed GPS-Denied Flight
高速度GPS失效飞行的嵌入式硬件实时热惯性里程计
Abstract
We present a real-time monocular thermal-inertial odometry system designed for high-velocity, GPS-denied flight on embedded hardware. The system fuses measurements from a FLIR Boson+ 640 longwave infrared camera, a high-rate IMU, a laser range finder, a barometer, and a magnetometer within a fixed-lag factor graph. To sustain reliable feature tracks under motion blur, low contrast, and rapid viewpoint changes, we employ a lightweight thermal-optimized front-end with multi-stage feature filtering. Laser range finder measurements provide per-feature depth priors that stabilize scale during weakly observable motion. High-rate inertial data is first pre-filtered using a Chebyshev Type II infinite impulse response (IIR) filter and then preintegrated, improving robustness to airframe vibrations during aggressive maneuvers. To address barometric altitude errors induced at high airspeeds, we train an uncertainty-aware gated recurrent unit (GRU) network that models the temporal dynamics of static pressure distortion, outperforming polynomial and multi-layer perceptron (MLP) baselines. Integrated on an NVIDIA Jetson Xavier NX, the complete system supports closed-loop quadrotor flight at 30 m/s with drift under 2% over kilometer-scale trajectories. These contributions expand the operational envelope of thermal-inertial navigation, enabling reliable high-speed flight in visually degraded and GPS-denied environments.
Chinese Translation
我们提出了一种实时单目热惯性里程计系统,旨在在嵌入式硬件上实现高速度、GPS失效的飞行。该系统融合了来自FLIR Boson+ 640长波红外相机、高频率惯性测量单元(IMU)、激光测距仪、气压计和磁力计的测量数据,采用固定滞后因子图进行处理。为了在运动模糊、低对比度和快速视角变化下保持可靠的特征轨迹,我们采用了一种轻量级的热优化前端,并进行了多阶段特征过滤。激光测距仪的测量提供了每个特征的深度先验,在弱可观测运动中稳定了尺度。高频率的惯性数据首先使用切比雪夫II型无限脉冲响应(IIR)滤波器进行预过滤,然后进行预积分,从而提高了在激烈机动过程中对机体振动的鲁棒性。为了解决在高速度下引起的气压高度误差,我们训练了一个不确定性感知的门控递归单元(GRU)网络,以建模静态压力失真的时间动态,其性能优于多项式和多层感知器(MLP)基线。该完整系统集成于NVIDIA Jetson Xavier NX上,支持以30米/秒的速度进行闭环四旋翼飞行,在公里级轨迹上漂移小于2%。这些贡献扩展了热惯性导航的操作范围,使得在视觉退化和GPS失效环境下实现可靠的高速飞行成为可能。
cs.RO / 83 / 2603.02115
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
Robometer:通过轨迹比较扩展通用机器人奖励模型
Abstract
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
Chinese Translation
通用机器人奖励模型通常通过专家示范来预测绝对任务进展,仅提供局部的帧级监督。虽然这种方法在专家示范中有效,但在大量机器人数据集中,由于失败和次优轨迹的普遍存在,且分配密集的进展标签存在歧义,因此其扩展性较差。我们提出了Robometer,一个可扩展的奖励建模框架,它结合了轨迹内部进展监督和轨迹间偏好监督。Robometer的训练目标是双重的:一方面是帧级进展损失,它将奖励幅度锚定在专家数据上;另一方面是轨迹比较偏好损失,它对同一任务的轨迹施加全局排序约束,从而能够有效地从真实和增强的失败轨迹中学习。为了支持这一大规模的构想,我们整理了RBM-1M,一个奖励学习数据集,包含超过一百万条轨迹,涵盖多种机器人形态和任务,包括大量的次优和失败数据。在基准测试和现实世界评估中,Robometer学习到的奖励函数比以往的方法更具可推广性,并在多样化的下游应用中提高了机器人学习性能。代码、模型权重和视频请访问 https://robometer.github.io/。
cs.RO / 84 / 2603.02139
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
重新思考相机选择:关于鱼眼相机在机器人操作中的特性的一项实证研究
Abstract
The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/
Chinese Translation
鱼眼相机因其极宽的视场(FoV)在机器人操作中的应用迅速超越了对其对策略学习的下游影响的系统理解。本文呈现了第一项全面的实证研究,以填补这一空白,严格分析了腕部安装的鱼眼相机在模仿学习中的特性。通过在模拟和现实世界中进行广泛实验,我们探讨了三个关键研究问题:空间定位、场景泛化和硬件泛化。我们的研究发现:(1)宽视场显著增强了空间定位能力,但这一优势在很大程度上依赖于环境的视觉复杂性。(2)尽管鱼眼训练的策略在简单场景中容易过拟合,但在环境多样性充分的情况下,能够实现更优的场景泛化。(3)虽然简单的跨相机迁移会导致失败,但我们确定其根本原因是尺度过拟合,并证明通过简单的随机尺度增强(Random Scale Augmentation, RSA)策略可以改善硬件泛化性能。总体而言,我们的发现为大规模收集和有效利用鱼眼数据集在机器人学习中的应用提供了具体的、可操作的指导。更多结果和视频可在 https://robo-fisheye.github.io/ 上查看。
cs.CV / 1 / 2603.00060
Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection
在极端数据稀缺下学习:基于轻量级卷积神经网络的个体层面fMRI前驱帕金森病检测评估
Abstract
Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity.
Chinese Translation
深度学习通常应用于数据有限、相关性强且难以获取的环境中,但评估实践并不总是反映这些限制。前驱帕金森病的神经影像学就是一个这样的例子,其中受试者数量较少,个体扫描产生许多高度相关的样本。本研究将静息态fMRI下的前驱帕金森病检测视为一个机器学习问题,重点关注在极端数据稀缺下的学习。使用来自40名受试者的fMRI数据,包括20例前驱帕金森病病例和20名健康对照,针对两种不同的数据划分策略对基于ImageNet预训练的卷积神经网络进行了微调和评估。结果表明,常用的图像级划分允许同一受试者的切片同时出现在训练集和测试集中,导致严重的信息泄露和接近完美的准确率。当强制执行严格的受试者级划分时,性能显著下降,测试准确率在60%到81%之间。比较了不同容量特征的模型,包括VGG19、Inception V3、Inception ResNet V2和轻量级的MobileNet V1。在受试者级评估下,MobileNet表现出最可靠的泛化能力,尽管参数显著较少,却超越了更深的架构。这些结果表明,在极端低数据环境中,评估策略和模型容量对性能的影响大于架构深度。尽管分析仅限于一个包含40名受试者的单一队列,并且未包含外部验证或交叉验证,但它提供了一个具体的案例研究和在严重数据稀缺下评估深度学习模型的实际建议。
cs.CV / 2 / 2603.00114
Automated Quality Check of Sensor Data Annotations
传感器数据注释的自动质量检查
Abstract
The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.
Chinese Translation
路线和轨道环境的监测在自动驾驶中发挥着重要作用。例如,它可以作为自动化等级(Grade of Automation, GoA)2中路线监测的辅助系统,此时列车司机仍在车上。在完全自动化、无人驾驶的GoA4级别中,这些系统最终完全独立地承担环境监测的任务。借助人工智能(AI),它们能够自动对路线上的风险和危险事件作出反应。为了训练这样的AI算法,需要大量的训练数据,这些数据由于其安全相关性必须符合高质量标准。在本出版物中,我们提出了一种自动化的方法来确保训练数据的质量,显著减少了人工工作量,加速了这些系统的开发。我们提出了一种开源工具,旨在检测铁路车辆多传感器数据集中常见的九种错误。为了评估该框架的性能,所有检测到的错误均进行了人工验证。六种问题检测方法达到了100%的精准度,而另外三种方法的精准率分别为96%和97%。
cs.CV / 3 / 2603.00116
VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation
VoxelDiffusionCut:通过迭代切割和结构估计实现非破坏性内部部件提取
Abstract
Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
Chinese Translation
在回收和处置场所,通过切割周围结构非破坏性地提取目标内部部件(如电池和电动机)至关重要。然而,产品的多样性以及对拆解程序缺乏信息使得决定切割位置变得具有挑战性。本研究探讨了一种非破坏性提取目标内部部件的方法,该方法通过观察切割表面迭代估计内部结构,并根据估计结果制定切割计划。一个关键要求是从部分观察中估计目标部件存在的概率。然而,为此任务学习条件生成模型是具有挑战性的:3D形状表示的高维性使得学习变得困难,传统模型(例如条件变分自编码器)往往由于模式崩溃而无法捕捉多模态预测不确定性,导致过于自信的预测。为了解决这些问题,我们提出了VoxelDiffusionCut,该方法使用扩散模型迭代估计以体素表示的内部结构,并根据估计结果规划切割以实现目标内部部件的非破坏性提取。体素表示使模型仅能预测固定网格位置的属性,即组成部件的类型,从而使学习更加可行。扩散模型在观察到的切割表面条件下完成体素表示,捕捉未观察区域的不确定性,以避免错误切割。模拟实验结果表明,所提出的方法能够从观察到的切割表面估计内部结构,并通过利用估计的不确定性实现目标内部部件的非破坏性提取。
cs.CV / 4 / 2603.00118
Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks
基于多尺度空间自适应注意力网络的高效图像超分辨率
Abstract
This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network's capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.
Chinese Translation
本文介绍了一种轻量级图像超分辨率(SR)网络,称为多尺度空间自适应注意力网络(MSAAN),旨在解决现有SR方法中高重建保真度与低模型复杂性之间的常见困境。我们方法的核心是一个新颖的多尺度空间自适应注意力模块(MSAA),旨在共同建模细粒度的局部细节和长距离的上下文依赖关系。MSAA由两个协同组件组成:一个全局特征调制模块(GFM),通过差异特征提取学习一致的纹理结构,以及一个多尺度特征聚合模块(MFA),利用金字塔处理自适应地融合从局部到全局尺度的特征。为了进一步增强网络的能力,我们提出了一个局部增强块(LEB),以加强局部几何感知,并引入一个特征交互门控前馈模块(FIGFF),以改善非线性表示,同时减少通道冗余。在标准基准(Set5、Set14、B100、Urban100、Manga109)上进行的广泛实验表明,我们的轻量级版本(MSAAN-light)和标准版本(MSAAN)在PSNR和SSIM方面均实现了优越或具有竞争力的性能,同时相比于最先进的方法保持了显著更低的参数和计算成本。消融研究验证了每个组件的贡献,视觉结果显示MSAAN重建了更清晰的边缘和更真实的纹理。
cs.CV / 5 / 2603.00119
BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation
BiSe-Unet:一种轻量级双路径 U-Net,具有注意力精炼上下文的实时医学图像分割
Abstract
During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.
Chinese Translation
在图像引导的操作中,通常需要实时图像分割。这要求轻量级的人工智能模型能够在资源受限的设备上运行。一个重要的应用场景是内窥镜引导的结肠镜检查,其中必须实时检测息肉。Kvasir-Seg 数据集是一个公开可用的基准数据集,包含 1,000 张高分辨率的息肉内窥镜图像及其对应的像素级分割掩码。在受限环境中实现临床部署的实时推理速度需要高度高效和轻量级的网络架构。然而,许多现有模型在嵌入式部署中仍然计算密集。尽管轻量级架构速度更快,但通常会出现空间精度降低和上下文理解较弱的问题,导致边界质量下降和诊断可靠性降低。为了解决这些挑战,我们提出了 BiSe-UNet,这是一种轻量级双路径 U-Net,它将注意力精炼的上下文路径与浅层空间路径结合,以保留详细特征,随后使用深度可分离解码器进行高效重建。在 Kvasir-Seg 数据集上的评估表明,BiSe-UNet 在保持超过 30 FPS 的实时吞吐量的同时,达到了具有竞争力的 Dice 和 IoU 分数,证明了其在边缘硬件上进行准确、轻量且可部署的医学图像分割的有效性。
cs.CV / 6 / 2603.00122
NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
NovaLAD:一个快速、CPU优化的文档提取管道,用于生成式人工智能和数据智能
Abstract
Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
Chinese Translation
文档提取是检索增强生成(RAG)、知识库和下游生成式人工智能工作之前的重要步骤。它将PDF和扫描等非结构化文档转化为结构化文本和布局感知表示。我们介绍了NovaLAD,一个综合的文档解析系统,集成了两个并行的YOLO目标检测模型——元素检测和布局检测——以及基于规则的分组和可选的视觉-语言增强。当页面图像被输入时,首先进行的是同时通过这两个模型的处理。元素模型识别出语义内容,如标题、头部、文本、表格、图像等,而布局模型则识别出结构区域,如layout_box、column_group、multi_column、row_group等。一个关键的设计决策是首先将图像或图形通过一个图像分类器(ViT),以决定其是否相关。只有有用的图像才会提交给视觉语言模型(Vision LLM)进行标题、摘要和结构化信息的提取,从而减少噪声和成本。NovaLAD旨在提高速度:它在CPU上运行,采用并行执行进行检测、分类、光学字符识别(OCR)和转换,并生成多种形式,包括结构化JSON、Markdown、适用于RAG的文本和知识图谱。我们在DP-Bench基准(upstage/dp-bench)上进行测试,获得了96.49%的TEDS和98.51%的NID,优于商业和开源解析器。本文解释了如何提取数据、架构如何运作、数据流动的方式,以及如何在不需要GPU的情况下使NovaLAD既准确又可用。
cs.CV / 7 / 2603.00123
CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers
CT-Flow:通过模型上下文协议服务器协调CT解读工作流程
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
Chinese Translation
近年来,大型视觉语言模型(LVLMs)的进展显示出在多模态放射学推理方面的强大潜力,特别是在诊断视觉问答(VQA)和放射学报告生成等任务中。然而,现有的大多数3D CT分析方法主要依赖于静态的单次推理。在实际应用中,临床解读是一个动态的、工具中介的工作流程,放射科医生需要反复审查切片,并使用测量、放射组学和分割工具来细化发现。为了解决这一问题,我们提出了CT-Flow,一个旨在实现可互操作体积解读的智能框架。通过利用模型上下文协议(MCP),CT-Flow从封闭式推理转变为开放的、工具感知的范式。我们策划了CT-FlowBench,这是第一个针对3D CT工具使用和多步骤推理的大规模指令调优基准。在此基础上,CT-Flow作为临床协调者,能够将复杂的自然语言查询分解为自动化的工具使用序列。在CT-FlowBench和标准3D VQA数据集上的实验评估表明,CT-Flow在诊断准确性上超过基线模型41%,并在自主工具调用中实现了95%的成功率。这项工作为将自主的、智能的智能整合到现实世界的临床放射学中提供了可扩展的基础。
cs.CV / 8 / 2603.00124
OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics -- A Methodological Proof-of-Concept
OrthoAI:一种轻量级深度学习框架,用于清晰矫治器正畸中的自动生物力学分析——方法学概念验证
Abstract
Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of $81.4\%$ and mIoU of $8.25\%$ on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in $<4s$ on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.
Chinese Translation
清晰矫治器治疗现已主导正畸领域,但临床医生对数字化规划的牙齿移动(通常通过ClinCheck(Align Technology)进行)的审查仍然缓慢且易出错。我们提出了OrthoAI,一个开源的概念验证决策支持系统,结合轻量级3D牙齿分割与自动生物力学分析,以辅助治疗计划评估。该框架使用基于3DTeethLand(MICCAI)中标志点重建点云训练的动态图卷积神经网络(Dynamic Graph CNN),并集成了基于正畸证据的规则驱动生物力学引擎(Kravitz et al 2009; Simon et al 2014)。该系统在六个自由度上分解每颗牙齿的运动,计算运动特定的可预测性,当生物力学限制被超越时发出警报,并推导出探索性复合指数。具有60,705个可训练参数的分割在代理点云上达到了81.4%的牙齿识别率和8.25%的平均交并比(mIoU),反映出稀疏标志点监督而非密集网格。尽管空间边界较粗糙,但下游分析主要依赖于牙齿身份和近似质心/轴的估计。结果为未来的全网格训练建立了基线,并突出了当前的感知限制。该端到端管道在消费级硬件上运行时间少于4秒。代码、权重和分析工具已发布,以支持几何深度学习和数字正畸领域的可重复研究。该系统尚未在真实口内网格上进行验证,不应假定其能够推广到超出标志点导出表示的情况。
cs.CV / 9 / 2603.00126
QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
QuickGrasp:通过加速标记化和边缘增强推理实现的响应式视频语言查询服务
Abstract
Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
Chinese Translation
视频语言模型(VLMs)正在重塑视频查询服务,为复杂的感知和推理任务提供统一的解决方案。然而,由于其高资源需求,在现实世界系统中部署大型VLM仍然面临挑战,远程部署往往导致不可接受的响应延迟。尽管小型、可本地部署的VLM提供了更快的响应,但在准确性上不可避免地存在不足。为了解决这一权衡,我们提出了QuickGrasp,一个响应式、关注服务质量(QoS)的系统,通过本地优先架构与按需边缘增强来弥合这一差距。QuickGrasp建立在VLM高度模块化的架构之上,跨模型变体共享视觉表示,以避免冗余计算。为了最大化系统整体效率,QuickGrasp引入了三个关键设计:加速视频标记化、查询自适应边缘增强,以及延迟感知、保持准确性的视觉标记密度配置。我们实现了QuickGrasp的原型,并在多个视频理解基准上进行了评估。结果表明,QuickGrasp在准确性上与大型VLM相匹配,同时实现了高达12.8倍的响应延迟减少。QuickGrasp代表了在开放世界理解中构建响应式视频查询服务的重要进展,充分利用了VLM的能力。
cs.CV / 10 / 2603.00127
Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach
低对比度混凝土XCT的分割:一种无监督方法
Abstract
This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.
Chinese Translation
本研究测试了一种基于自我标注的无监督方法,用于训练卷积神经网络(CNN)模型,以实现混凝土X射线计算机断层扫描(XCT)图像的语义分割。由于骨料和砂浆具有相似的X射线衰减系数,混凝土在XCT成像中面临独特的挑战,导致后续图像中两相之间的对比度较低。尽管基于CNN的模型在此类挑战性情况下的语义分割中已被证明是一种有效技术,但它们通常需要标注的训练数据,而这些数据在新的数据集中往往不可用或获取成本高昂。为了解决这一限制,本文采用了一种自我标注技术,该技术利用超像素算法识别图像中感知相似的局部区域,并通过利用CNN模型的感受野将其与图像的全局上下文相关联。这使得模型能够学习图像中的全局-局部关系,并识别语义上相似的结构。因此,我们展示了无监督训练方法在我们的XCT数据集上的性能,并讨论了进一步改进的潜在途径。
cs.CV / 11 / 2603.00132
Predicting Local Climate Zones using Urban Morphometrics and Satellite Imagery
利用城市形态测量和卫星影像预测地方气候区
Abstract
The Local Climate Zone (LCZ) framework is commonly employed to represent urban form in morphological analyses despite its mapping predominantly relies on satellite imagery. Urban morphometrics, describing urban form via numerical measures of physical aspects and spatial relationships of its elements, offers another avenue. This study evaluates the ability of morphometric assessment to predict LCZs using a) a morphometric-based LCZ prediction, and b) a fusion-based LCZ prediction combining morphometrics with satellite imagery. We calculate 321 2D morphometric attributes from building footprints and street networks, covering their various properties at multiple spatial scales. Subsequently, we develop four classification schemes: morphometric-based prediction, baseline image-based prediction, and two techniques fusing morphometrics with imagery. We evaluate them across five sites. Results from the morphometric-based prediction indicate that the correspondence between 2D urban morphometrics and urban LCZ types is selective and inconsistent, rendering the efficacy of this method site-dependent. Nevertheless, it demonstrated that a much broader range of urban form properties is relevant for distinguishing LCZ types compared to standard parameters. Relative to the image-based baseline, the fusion yielded relatively distinct accuracy improvements for urban LCZ types at two sites; however, gains at the remaining sites were negligible or even slightly negative, suggesting that the benefits of fusion are modest and inconsistent. Collectively, these results indicate that the relationship between the LCZs and the measurable, visible aspects of urban form is tenuous, thus the LCZ framework should be used with caution in morphological studies.
Chinese Translation
地方气候区(Local Climate Zone, LCZ)框架通常用于在形态分析中表示城市形态,尽管其映射主要依赖于卫星影像。城市形态测量通过对城市元素的物理特征和空间关系进行数值度量,提供了另一种途径。本研究评估了形态测量评估预测LCZ的能力,采用了a) 基于形态测量的LCZ预测,以及b) 将形态测量与卫星影像相结合的融合型LCZ预测。我们从建筑轮廓和街道网络中计算了321个二维形态测量属性,涵盖了其在多个空间尺度上的各种特性。随后,我们开发了四种分类方案:基于形态测量的预测、基线影像预测,以及两种将形态测量与影像融合的技术。我们在五个地点对这些方案进行了评估。基于形态测量的预测结果表明,二维城市形态测量与城市LCZ类型之间的对应关系是选择性的且不一致的,这使得该方法的有效性具有地点依赖性。然而,它表明,与标准参数相比,区分LCZ类型所需的城市形态特性的范围要广泛得多。相较于基于影像的基线,融合在两个地点的城市LCZ类型上产生了相对明显的准确性提升;然而,在其余地点的增益微不足道,甚至略有负值,表明融合的好处是适度且不一致的。总体而言,这些结果表明,LCZ与城市形态的可测量、可视化方面之间的关系是脆弱的,因此在形态学研究中应谨慎使用LCZ框架。
cs.CV / 12 / 2603.00133
You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models
你不需要那么多注意力:文本到图像扩散模型中的记忆减轻
Abstract
Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
Chinese Translation
生成模型已被证明会“记忆”某些训练数据,导致逐字或近乎逐字生成图像,这可能引发隐私问题或版权侵犯。我们提出了一种新的记忆减轻框架——使用吸引-排斥动态的引导(Guidance Using Attractive-Repulsive Dynamics, GUARD),用于文本到图像扩散模型中。GUARD 调整图像去噪过程,以引导生成远离原始训练图像,朝向与训练数据不同但仍与提示一致的图像,从而防止重现训练数据,同时不损害图像生成质量。我们提出了该框架的具体实例,其中我们引导的正目标是通过一种新方法实现的(交叉)注意力衰减,该方法基于(i)一种新统计机制,自动识别需要衰减交叉注意力的提示位置,以及(ii)在这些每个提示位置衰减交叉注意力。最终的 GUARD 提供了一种外科式的动态每提示推理时间方法,我们发现,这在两个架构中对于记忆减轻而言,是迄今为止最稳健的方法,能够持续产生最先进的结果,适用于逐字和模板记忆,同时在图像质量方面也有所改善或产生可比结果。
cs.CV / 13 / 2603.00136
TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
TinyVLM:通过使用马特ryoshka嵌入的视觉-语言蒸馏在微控制器上实现零-shot物体检测
Abstract
Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
Chinese Translation
零-shot物体检测使得在没有特定任务训练的情况下识别新物体成为可能,但当前的方法依赖于大型视觉语言模型(VLM),如CLIP,这些模型需要数百兆字节的内存,远远超出了微控制器单元(MCU)的限制。我们提出了TinyVLM,这是第一个能够在资源受限的MCU上实现零-shot物体检测的框架,其内存小于1MB。我们的方法引入了三个关键创新:(1)解耦架构,将视觉推理与文本编码分开,允许将预计算的类别嵌入存储在闪存中;(2)马特ryoshka蒸馏,在多个维度(16-256)上训练嵌套嵌入,实现灵活的准确性-内存权衡;(3)量化嵌入存储,将类别原型内存减少4倍,同时保持最小的准确性损失。在Conceptual Captions 3M(CC3M)上训练的TinyVLM在COCO、Flowers102和Food101上实现了具有竞争力的零-shot准确性,同时仅需285KB的RAM和892KB的闪存用于部署的视觉编码器。我们在STM32H7上展示了26 FPS的实时推理,并在MAX78000的CNN加速器上超过1,000 FPS,首次实现了在边缘设备上的实用零-shot检测。
cs.CV / 14 / 2603.00138
Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression
潜在重放检测:基于任务自适应压缩的微控制器上内存高效的持续目标检测
Abstract
Deploying object detection on microcontrollers (MCUs) enables intelligent edge devices but current models cannot learn new object categories after deployment. Existing continual learning methods require storing raw images far exceeding MCU memory budgets of tens of kilobytes. We present Latent Replay Detection (LRD), the first framework for continual object detection under MCU memory constraints. Our key contributions are: 1. Task-Adaptive Compression: Unlike fixed PCA, we propose learnable compression with FiLM (Feature-wise Linear Modulation) conditioning, where task specific embeddings modulate the compression to preserve discriminative features for each task's distribution; 2. Spatial-Diverse Exemplar Selection: Traditional sampling ignores spatial information critical for detection - we select exemplars maximizing bounding box diversity via farthest-point sampling in IoU space, preventing localization bias in replay; 3. MCU-Deployable System: Our latent replay stores 150 bytes per sample versus >10KB for images, enabling a 64KB buffer to hold 400+ exemplars. Experiments on CORe50 (50 classes, 5 tasks) demonstrate that LRD achieves mAP@50 on the initial task and maintains strong performance across subsequent tasks - a significant improvement over naive fine-tuning while operating within strict MCU constraints. Our task-adaptive FiLM compression and spatial diverse exemplar selection work synergistically to preserve detection capabilities. Deployed on STM32H753ZI, ESP32-S3, and MAX78000 MCUs, LRD achieves 4.9-97.5ms latency per inference within a 64KB memory budget-enabling practical continual detection on edge devices for the first time.
Chinese Translation
在微控制器(MCUs)上部署目标检测使智能边缘设备成为可能,但当前模型在部署后无法学习新的目标类别。现有的持续学习方法需要存储原始图像,远超出微控制器数十千字节的内存预算。我们提出了潜在重放检测(Latent Replay Detection, LRD),这是第一个在微控制器内存限制下进行持续目标检测的框架。我们的主要贡献包括:1. 任务自适应压缩:与固定的主成分分析(PCA)不同,我们提出了可学习的压缩方法,采用特征线性调制(FiLM)条件,其中任务特定的嵌入调节压缩,以保留每个任务分布的判别特征;2. 空间多样性示例选择:传统采样忽略了对检测至关重要的空间信息——我们通过在交并比(IoU)空间中进行最远点采样,选择最大化边界框多样性的示例,防止重放中的定位偏差;3. 微控制器可部署系统:我们的潜在重放每个样本存储150字节,而图像则超过10KB,使得64KB的缓冲区能够容纳400多个示例。在CORe50(50类,5个任务)上的实验表明,LRD在初始任务上实现了mAP@50,并在后续任务中保持了强劲的性能——相较于简单的微调,显著提升,同时在严格的微控制器限制内运行。我们的任务自适应FiLM压缩和空间多样性示例选择协同工作,以保留检测能力。部署在STM32H753ZI、ESP32-S3和MAX78000微控制器上,LRD在64KB内存预算内每次推理的延迟为4.9-97.5毫秒——首次实现了边缘设备上的实用持续检测。
cs.CV / 15 / 2603.00139
Towards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images
基于多光谱图像的农田氮素估计的 数据驱动方法
Abstract
The modernization of agriculture has motivated the development of advanced analytics and decision-support systems to improve resource utilization and reduce environmental impacts. Targeted Spraying and Fertilization (TSF) is a critical operation that enables farmers to apply inputs more precisely, optimizing resource use and promoting environmental sustainability. However, accurate TSF is a challenging problem, due to external factors such as crop type, fertilization phase, soil conditions, and weather dynamics. In this paper, we present TerrAI, a Neural Network-based solution for TSF, which considers the spatio-temporal variability across different parcels. Our experimental study over a real-world remote sensing dataset validates the soundness of TerrAI on data-driven agricultural practices.
Chinese Translation
农业现代化促使了先进分析和决策支持系统的发展,以提高资源利用效率并减少环境影响。精准喷洒和施肥(Targeted Spraying and Fertilization, TSF)是一个关键操作,使农民能够更精确地施用投入,优化资源使用并促进环境可持续性。然而,由于作物类型、施肥阶段、土壤条件和天气动态等外部因素,准确的 TSF 是一个具有挑战性的问题。本文提出了 TerrAI,一种基于神经网络的 TSF 解决方案,考虑了不同地块之间的时空变异性。我们在真实世界遥感数据集上的实验研究验证了 TerrAI 在数据驱动农业实践中的有效性。
cs.CV / 16 / 2603.00140
Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion
远离记忆化:针对文本到图像扩散的可达性约束强化学习
Abstract
Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.
Chinese Translation
文本到图像扩散模型往往会记忆训练数据,显示出在训练集之外泛化的根本失败。目前的缓解策略通常牺牲图像质量或提示对齐以减少记忆化。为了解决这一问题,我们提出了可达性意识扩散引导(Reachability-Aware Diffusion Steering, RADS),这是一种推理时框架,能够在保持生成保真度的同时防止记忆化。RADS将扩散去噪过程建模为一个动态系统,并应用可达性分析的概念来近似“反向可达管”(backward reachable tube)——即不可避免地演变为记忆样本的中间状态集合。然后,我们将缓解问题表述为一个受约束的强化学习(Reinforcement Learning, RL)问题,其中策略通过在标题嵌入空间中的最小扰动学习引导轨迹远离记忆化。实证评估表明,与最先进的基线相比,RADS在生成多样性(SSCD)、质量(FID)和对齐(CLIP)之间实现了更优的帕累托前沿。重要的是,RADS在不修改扩散主干的情况下提供了稳健的缓解,提供了一种即插即用的安全生成解决方案。我们的官方网站可访问: https://s-karnik.github.io/rads-memorization-project-page/.
cs.CV / 17 / 2603.00141
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
从规模到速度:用于图像编辑的自适应测试时间缩放
Abstract
Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
Chinese Translation
图像思维链(Image Chain-of-Thought, Image-CoT)是一种测试时间缩放范式,通过延长推理时间来改善图像生成。大多数Image-CoT方法专注于文本到图像(text-to-image, T2I)生成。与T2I生成不同,图像编辑是目标导向的:解决方案空间受到源图像和指令的限制。这种不匹配在将Image-CoT应用于编辑时造成了三个挑战:固定采样预算下的资源分配效率低下、使用通用大规模语言模型(MLLM)评分的早期验证不可靠,以及大规模采样导致的冗余编辑结果。为了解决这些问题,我们提出了自适应编辑思维链(ADaptive Edit-CoT, ADE-CoT),这是一个按需的测试时间缩放框架,旨在提高编辑效率和性能。它包含三个关键策略:(1)基于估计的编辑难度进行的难度感知资源分配,动态分配预算;(2)在早期修剪中进行的特定编辑验证,利用区域定位和标题一致性选择有前景的候选;(3)由实例特定验证器指导的深度优先机会停止,当找到意图一致的结果时终止。在三个基准上的三种最先进的编辑模型(Step1X-Edit、BAGEL、FLUX.1 Kontext)进行的广泛实验表明,ADE-CoT在性能与效率的权衡上表现优越。在相当的采样预算下,ADE-CoT在性能上优于Best-of-N,且速度提升超过2倍。
cs.CV / 18 / 2603.00143
GrapHist: Graph Self-Supervised Learning for Histopathology
GrapHist:用于组织病理学的图自监督学习
Abstract
Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large-scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .
Chinese Translation
自监督视觉模型在数字病理学中取得了显著成功。然而,它们的领域无关的变换器架构并不是最初设计用来考虑组织病理图像的基本生物元素,即细胞及其复杂的相互作用。在本研究中,我们假设将组织建模为细胞图的生物学信息化方法提供了更高效的表征学习。因此,我们提出了GrapHist,一个用于组织病理学的新颖图基自监督学习框架,该框架学习可泛化且结构信息丰富的嵌入,能够支持多样的下游任务。GrapHist集成了掩蔽自编码器和异质图神经网络,专门设计用于捕捉肿瘤微环境的异质性。我们在从乳腺组织衍生的1100万个细胞图的大型数据集上对GrapHist进行了预训练,并评估其在领域内外基准测试中的可迁移性。我们的结果表明,GrapHist在幻灯片、区域和细胞级任务中与基于视觉的对手相比表现出竞争力,同时所需参数减少了四倍。它在癌症亚型任务上也大幅超越了完全监督的图模型。最后,我们还在 https://huggingface.co/ogutsevda/datasets 发布了我们研究中使用的五个基于图的数字病理数据集,建立了该领域首个大规模图基准。我们的代码可在 https://github.com/ogutsevda/graphist 获取。
cs.CV / 19 / 2603.00144
Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
用于三维人际互动生成的解耦层次变分自编码器
Abstract
Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
Chinese Translation
生成逼真的三维人际互动(HHI)需要对代理的物理合理性及其互动语义进行一致性建模。现有方法将所有运动信息压缩为单一的潜在表示,限制了它们捕捉细粒度动作和代理间互动的能力。这常常导致语义不对齐和物理不合理的伪影,例如穿透或错失接触。我们提出了一种基于解耦层次变分自编码器(DHVAE)的潜在扩散方法,用于结构化和可控的HHI生成。DHVAE通过采用CoTransformer模块,明确地将全局互动上下文和个体运动模式解耦为一个独立的潜在结构。为了减轻HHI中的不合理和物理不一致接触,我们在DHVAE中结合了对比学习约束,以促进更具辨别性和物理合理性的潜在互动空间。为了实现高保真度的互动合成,DHVAE在层次潜在空间中采用基于DDIM的扩散去噪过程,并通过跳连的AdaLN-Transformer去噪器进行增强。大量评估表明,DHVAE在运动保真度、文本对齐和物理合理性方面表现优越,且计算效率更高。
cs.CV / 20 / 2603.00145
M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction
M-高斯:一种用于高效多堆叠MRI重建的磁性高斯框架
Abstract
Magnetic Resonance Imaging (MRI) is a crucial non-invasive imaging modality. In routine clinical practice, multi-stack thick-slice acquisitions are widely used to reduce scan time and motion sensitivity, particularly in challenging scenarios such as fetal brain imaging. However, the resulting severe through-plane anisotropy compromises volumetric analysis and downstream quantitative assessment, necessitating robust reconstruction of isotropic high-resolution volumes. Implicit neural representation methods, while achieving high quality, suffer from computational inefficiency due to complex network structures. We present M-Gaussian, adapting 3D Gaussian Splatting to MRI reconstruction. Our contributions include: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training. Our method achieves an optimal balance between quality and speed. On the FeTA dataset, M-Gaussian achieves 40.31 dB PSNR while being 14 times faster, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.
Chinese Translation
磁共振成像(MRI)是一种重要的非侵入性成像技术。在常规临床实践中,多堆叠厚切片采集被广泛应用于减少扫描时间和运动敏感性,特别是在胎儿脑部成像等具有挑战性的场景中。然而,由于产生了严重的平面间各向异性,这妨碍了体积分析和后续的定量评估,因此需要对各向同性高分辨率体积进行稳健重建。隐式神经表示方法虽然能够实现高质量的重建,但由于复杂的网络结构,计算效率较低。我们提出了M-高斯,将3D高斯散射(3D Gaussian Splatting)应用于MRI重建。我们的贡献包括:(1)具有物理一致性的体积渲染的磁性高斯原语;(2)用于高频细节精细化的神经残差场;(3)多分辨率渐进训练。我们的方法在质量和速度之间实现了最佳平衡。在FeTA数据集上,M-高斯达到了40.31 dB的峰值信噪比(PSNR),同时速度提高了14倍,代表了3D高斯散射首次成功应用于多堆叠MRI重建。
cs.CV / 21 / 2603.00147
Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents
利用生成式人工智能对百年技术文献进行分割和标注
Abstract
Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.
Chinese Translation
图像分割和图像识别是图像处理领域中成熟的计算技术。分割可以定位图像中的区域,而识别则识别图像中特定的对象。这些技术在现代图像中表现出显著的准确性,主要是因为训练数据量庞大。然而,在对数百年历史文献的数字化图像中实现类似的准确性则更具挑战性。这种困难主要源于两个原因:首先,缺乏足够的训练数据;其次,特定领域的专业化程度较高。尽管存在这些限制,在这些文献中分割和识别对象的能力对于自动化知识的策展、编目和传播至关重要,使得珍贵文献的内容能够被学者和公众所获取。在本文中,我们报告了我们在对16和17世纪船舶建造论著的图像进行分割和标注方面的持续工作,这一历史时期被称为探索时代。为此,我们利用SAM2进行图像分割;使用Florence2和ChatGPT进行标注;并采用专门的本体ontoShip和海洋建筑术语表glosShip来增强标注过程。初步结果表明,将这些技术结合起来可以改善珍贵历史文献的策展和检索。我们还讨论了在这一方法中遇到的挑战和局限性,以及未来克服这些问题的思路。
cs.CV / 22 / 2603.00148
Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models
机制指导的低秩适应(LoRA)提高了医学视觉-语言模型中的释义一致性
Abstract
Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p < 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.
Chinese Translation
医学视觉-语言模型对同一临床问题的不同表述可能给出不同的“是”或“否”答案。我们在 MedGemma-4B 中研究了这一现象,使用 PSF-Med Sadanandan 和 Behzadan (2025) 提供的释义对进行系统一致性评估。针对 MIMIC-CXR 的二元问题(n = 158),基线翻转率为 14.6%,平均边际差异为 1.63 logits。我们验证了 Gemma Scope 2 稀疏自编码器(SAEs)能够转移到 MedGemma 激活,在医学和通用文本上均实现 R2 ≈ 0.997(每个 n = 100 的提示,p < 0.001,超过 0.95 的阈值)。然后,我们使用结合了释义一致性与答案准确性的损失函数对低秩适应(LoRA)适配器进行了微调。这种结合方法防止了纯一致性训练中出现的模式崩溃,同时将翻转率从 14.6% 降低到 4.4%(p = 0.002,两比例 z 检验),边际差异从 1.63 降低到 0.33(减少 79.5%)。准确率在训练前后保持稳定,基线为 84.2%,训练后为 82.3%(-1.9pp,不显著)。在 PadChest Balanced 数据集上(n = 250),翻转率从 13.6% 降低到 7.8%,平均边际差异从 1.08 降低到 0.35(减少 67.9%),准确率从 66.4% 提高到 69.4%。层范围消融实验表明,早期层对边际差异的减少效果优于机制选择的中间层。
cs.CV / 23 / 2603.00149
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
物理一致性扩散用于高效流体超分辨率的多尺度残差修正
Abstract
Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on https://github.com/lizhihao2022/ReMD.
Chinese Translation
现有的图像超分辨率(SR)和通用扩散模型在流体超分辨率方面表现不佳:它们采样密集,忽略物理约束,且常常导致光谱不匹配和虚假发散。我们提出了一种物理一致性的扩散框架—— extbf{ReMD}( extunderline{Re}sidual- extunderline{M}ultigrid extunderline{D}iffusion),用于流体超分辨率(SR)。在每个反向步骤中,ReMD执行 extit{多网格残差修正}:更新方向通过将数据一致性与轻量级物理线索相结合获得,然后在各个尺度上修正残差;多尺度层次使用 extit{多小波}基来捕捉大结构和细微涡旋细节。这种粗到细的设计加速了收敛并保持了细微结构,同时保持无方程特性。在大气和海洋基准测试中,ReMD提高了准确性和光谱保真度,减少了发散,并在显著减少采样步骤的情况下达到了与扩散基线相当的质量。我们的结果表明,通过多网格残差修正和多小波多尺度建模在扩散过程中强制执行物理一致性是实现高效流体超分辨率的有效途径。我们的代码可在 https://github.com/lizhihao2022/ReMD 获取。
cs.CV / 24 / 2603.00150
Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
关注神经抄袭:扩散模型可以抄袭您的版权图像!
Abstract
In this paper, we highlight a critical threat posed by emerging neural models: data plagiarism. We demonstrate how modern neural models (e.g., diffusion models) can replicate copyrighted images, even when protected by advanced watermarking techniques. To expose vulnerabilities in copyright protection and facilitate future research, we propose a general approach to neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on "anchors and shims", employs inverse latents as anchors and finds shim perturbations that gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbations to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modification in copyrighted images, enabling it to bypass protections ranging from visible trademarks and signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.
Chinese Translation
在本文中,我们强调了新兴神经模型所带来的一个重要威胁:数据抄袭。我们展示了现代神经模型(例如,扩散模型)如何能够复制受版权保护的图像,即使这些图像采用了先进的水印技术进行保护。为了揭示版权保护中的脆弱性并促进未来的研究,我们提出了一种通用的神经抄袭方法,该方法可以伪造受版权保护数据的副本或引入版权模糊性。我们的方法基于“锚点和垫片”,使用逆潜变量作为锚点,并寻找逐渐偏离锚点潜变量的垫片扰动,从而规避水印或版权检测。通过在不同时间步对交叉注意机制施加扰动,我们的方法在受版权保护的图像中引起不同程度的语义修改,使其能够绕过从可见商标和签名到不可见水印的各种保护。值得注意的是,我们的方法是一种纯粹基于梯度的搜索,不需要额外的训练或微调。在 MS-COCO 和实际版权图像上的实验表明,扩散模型能够复制受版权保护的图像,强调了对抗神经抄袭的紧迫需求。
cs.CV / 25 / 2603.00152
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Dr. Seg:通过感知导向设计重新审视视觉大语言模型的GRPO训练
Abstract
Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at https://github.com/xVI-group-SCU/Dr-Seg.
Chinese Translation
在群体相对策略优化(GRPO)在基础大语言模型(LLMs)中的成功之后,越来越多的研究试图将GRPO适应于视觉大语言模型(VLLMs)以应对视觉感知任务(例如,检测和分割)。然而,这一研究方向的许多工作基于一个长期存在但未被检验的假设:为语言推理开发的训练范式可以无缝转移到视觉感知上。我们的实验表明,这一假设并不成立,揭示了推理导向和感知导向设置之间的内在差异。以推理分割作为代表性案例,我们提出了两个被忽视的因素:(i)对更广泛输出空间的需求,以及(ii)细粒度、稳定奖励的重要性。基于这些观察,我们提出了Dr. Seg,一个简单的即插即用的基于GRPO的框架,包含一个确认机制(Look-to-Confirm)和一个分布排名奖励模块(Distribution-Ranked Reward),无需架构修改,并能与现有的基于GRPO的VLLMs无缝集成。大量实验表明,Dr. Seg在复杂视觉场景中提高了性能,同时保持了强大的泛化能力。代码和模型将发布在 https://github.com/xVI-group-SCU/Dr-Seg。
cs.CV / 26 / 2603.00155
EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
EfficientPosterGen:通过令牌压缩和准确的违规检测实现语义感知的高效海报生成
Abstract
Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen-Code.
Chinese Translation
自动化学术海报生成旨在将冗长的研究论文提炼为简明、视觉连贯的展示。然而,现有的基于多模态大型语言模型(MLLMs)的方法存在三个关键限制:全文输入的信息密度低、令牌消耗过多以及布局验证不可靠。我们提出了EfficientPosterGen,一个端到端框架,通过语义感知的检索和令牌高效的多模态生成来解决这些挑战。EfficientPosterGen引入了三个核心创新:(1)语义感知关键信息检索(SKIR),构建语义贡献图以建模段落间关系,并选择性地保留重要内容;(2)基于视觉的上下文压缩(VCC),将选定的文本段落渲染为图像,将文本信息转移到视觉模态,显著减少生成海报准备好的要点时的令牌使用;(3)无代理布局违规检测(ALVD),一种基于确定性颜色渐变的算法,能够可靠地检测内容溢出和空间稀疏,而无需辅助的MLLMs。大量实验表明,EfficientPosterGen在令牌效率和布局可靠性方面取得了显著改善,同时保持了高质量的海报,为自动化学术海报生成提供了可扩展的解决方案。我们的代码可在 https://github.com/vinsontang1/EfficientPosterGen-Code 获取。
cs.CV / 27 / 2603.00156
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
BiCLIP:双向一致的语言-图像处理用于稳健的医学图像分割
Abstract
Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in "in-the-wild" clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
Chinese Translation
医学图像分割是计算机辅助诊断和治疗规划的基石。尽管最近的多模态视觉-语言模型在通过文本描述增强语义理解方面显示出前景,但它们在“野外”临床环境中的韧性——这种环境以稀缺的注释和硬件引起的图像退化为特征——仍然未得到充分探讨。我们提出了BiCLIP(双向一致的语言-图像处理),这是一个旨在增强医学分割稳健性的框架。BiCLIP具有双向多模态融合机制,使视觉特征能够迭代地优化文本表示,确保更优的语义对齐。为了进一步稳定学习,我们实施了一种增强一致性目标,以规范中间表示,抵御扰动输入视图的影响。在QaTa-COV19和MosMedData+基准测试中的评估表明,BiCLIP始终超越了最先进的图像单一和多模态基线。值得注意的是,BiCLIP在仅使用1%的标注数据进行训练时仍能保持高性能,并对临床伪影(包括运动模糊和低剂量CT噪声)表现出显著的抵抗力。
cs.CV / 28 / 2603.00157
FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
富士视野:用于预测景观能见度的多模态后融合
Abstract
Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present FujiView, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as "nowcasting" and "samedaycasting", while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving ACC of approx 0.89 for same-day prediction and up to 84% for next-day forecasts. These results position Scenic Visibility Forecasting (SVF) as a new benchmark task for multimodal learning.
Chinese Translation
自然地标的能见度,例如富士山,是旅游规划和游客体验的决定性因素,但由于快速变化的气象条件,预测其能见度仍然困难。我们提出了富士视野(FujiView),这是一个多模态学习框架和数据集,通过将网络摄像头图像与结构化气象数据融合来预测景观能见度。我们的后融合方法将图像派生的类别概率与数值气象特征相结合,将能见度分类为五个类别。该数据集目前包含超过100,000张与富士山周围40多台摄像头的实时和预测天气条件配对的网络摄像头图像,并持续扩展;它将被发布以支持环境预测的进一步研究。实验表明,基于YOLO的视觉特征在短期预测如“即时预报”(nowcasting)和“同日预报”(samedaycasting)中占主导地位,而气象驱动的预测在超过$+1$d后逐渐成为主要预测信号。后融合方法始终产生最高的整体准确率,对于同日预测达到约0.89的准确率,对于次日预测达到84%。这些结果将景观能见度预测(Scenic Visibility Forecasting, SVF)确立为多模态学习的新基准任务。
cs.CV / 29 / 2603.00159
FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
FlowPortrait:基于强化学习的音频驱动肖像视频生成
Abstract
Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
Chinese Translation
生成逼真的对话视频仍然面临诸多挑战,如唇部同步不完美、运动不自然以及评估指标与人类感知的相关性较差。我们提出了FlowPortrait,一个基于多模态骨干网络的音频驱动肖像动画的强化学习框架,用于自回归音频到视频的生成。FlowPortrait引入了一种基于多模态大型语言模型(Multimodal Large Language Models, MLLMs)的人类对齐评估系统,以评估唇同步准确性、表现力和运动质量。这些信号与感知和时间一致性正则化器结合,形成一个稳定的复合奖励,用于通过群体相对策略优化(Group Relative Policy Optimization, GRPO)对生成器进行后训练。大量实验,包括自动评估和人类偏好研究,表明FlowPortrait始终能够生成更高质量的对话视频,突显了强化学习在肖像动画中的有效性。
cs.CV / 30 / 2603.00160
DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops
DINOv3与YOLO26在蔬菜作物中的杂草检测研究
Abstract
Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.
Chinese Translation
目前,开发用于精确蔬菜除草的稳健模型受到大规模标注的杂草-作物数据集稀缺的限制。为了解决这一问题,本研究通过整合异构数据集和利用自监督学习,提出了一种基础的作物-杂草检测模型。最初收集了618,642张作物-杂草图像,经过筛选后精炼至199,388张图像,以通过顺序策划策略对DINOv3视觉变换器(ViT-small)进行微调。微调后的DINOv3主干随后被集成到YOLO26中,作为主要主干或双主干架构的一部分。在双主干框架中引入了特征对齐损失,以在最小计算开销的情况下增强特征融合。实验结果表明,所提出的基于DINOv3微调的ViT-small的YOLO26-large在2025赛季收集的领域内图像上实现了高达+5.4%的mAP50增益。此外,与标准YOLO26-large相比,它在2021-2023赛季数据集上表现出强大的跨领域泛化能力,mAP50提高了+14.0%,在2024赛季数据集上提高了+11.9%。尽管DINOv3-YOLO26-large模型的参数数量增加了45.6%,推理延迟提高了2.9倍,但其仍保持约28.5帧每秒(fps)的实时性能。本研究中开发的策划数据集和软件程序将公开发布。
cs.CV / 31 / 2603.00161
SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision
SKINOPATHY AI:基于智能手机的眼科筛查与纵向跟踪轻量级计算机视觉技术
Abstract
Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.
Chinese Translation
在资源匮乏和偏远地区,早期眼科筛查受到专业设备和训练有素的从业人员的限制。我们提出了SKINOPATHY AI,这是一款以智能手机为首的网络应用程序,提供五个互补的、可解释的筛查模块,完全通过普通移动硬件实现: (1) 通过LAB a*颜色空间归一化进行红色量化; (2) 使用MediaPipe FaceMesh眼部纵横比(EAR)和自适应阈值估计眨眼频率; (3) 通过瞳孔与虹膜比率(PIR)时间序列分析表征瞳孔光反射; (4) 通过LAB/HSV统计进行巩膜颜色指数化,以作为黄疸和贫血的代理; (5) 使用毫米级估计和纵向趋势跟踪进行虹膜标志校准的病变侵入测量。该系统作为React/FastAPI堆栈实现,结合OpenCV和MediaPipe,使用MongoDB支持会话持久性,并生成PDF报告。所有算法都是完全确定性的,保护隐私,并设计用于非诊断性的消费者分流。我们详细介绍了系统架构、算法设计、评估方法、临床背景和平台的伦理边界。SKINOPATHY AI展示了在未修改的智能手机上进行多信号眼科筛查的可行性,无需基于云的AI推断,为未来经过临床验证的移动眼底镜工具奠定了基础。
cs.CV / 32 / 2603.00163
A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance
极端不平衡下白板笔画分割的边界度量评估协议
Abstract
The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.
Chinese Translation
白板笔画的二元分割受到极端类别不平衡的阻碍,笔画像素平均仅占图像的 $1.79\%$,而且,细笔画子集的前景平均仅为 $1.14\% \, ext{±} \, 0.41\\%$。标准区域度量(F1,IoU)可能掩盖细笔画的失败,因为绝大多数背景主导了得分。相反,添加边界感知度量和细子集公平性分析改变了损失函数的排名方式,并揭示了隐藏的权衡。我们贡献了一种评估协议,联合考察区域度量、边界度量(BF1,B-IoU)、核心/细子集公平性分析以及在种子、多次训练下的每图像鲁棒性统计(中位数,四分位距,最坏情况),并进行非参数显著性检验。五种损失函数——交叉熵、焦点损失、Dice损失、Dice+焦点损失和Tversky损失——在DeepLabV3-MobileNetV3模型上各训练三次,并在12张保留图像上进行评估,这些图像被分为核心和细子集。基于重叠的损失在F1上比交叉熵提高了20多个点($0.663$ 对比 $0.438$,$p < 0.001$)。此外,边界度量确认这种增益扩展到轮廓的精度。自适应阈值和Sauvola二值化在原始分辨率下实现了更高的平均F1(Sauvola为$0.787$),但最坏情况性能显著较差(F1 $= 0.452$ 对比 $0.565$,Tversky),揭示了一种一致性-准确性权衡:经典基线在平均F1上领先,而学习模型则提供了更高的最坏情况可靠性。将训练分辨率加倍进一步提高了F1值12.7点。
cs.CV / 33 / 2603.00165
ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
ConFoThinking:用于视觉问答的综合聚焦注意力驱动思维
Abstract
Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.
Chinese Translation
图像思维通过强调视觉线索来改善多模态大语言模型(MLLMs)的细粒度视觉问答(VQA)。然而,工具增强方法依赖于基础能力,而这一能力在MLLMs中仍然不可靠。同时,提出了基于注意力的区域兴趣(ROIs)裁剪方法,但它们受到以下限制:(1)注意力信号在层间分散,导致次优定位;(2)依赖于问题或冗余文本条件的注意力提取。我们的分析揭示了三种模式:MLLMs可能关注正确区域但生成错误坐标,关注位置的注意力通常在层间碎片化,以及注意力提取对查询敏感。基于这些动机,我们提出了ConFoThinking,一个综合聚焦注意力驱动的思维框架,学习将注意力聚合到指定的中间层,从中挖掘并放大显著区域以进行下游视觉理解。此外,我们使用简洁的语义线索提取注意力,指示关注的内容,从而减轻由问题或冗余文本基础的注意力提取引入的语义噪声。在五个VQA基准测试中的实验表明,ConFoThinking显著提高了感知性能。代码、检查点和数据集将在被接受后发布。
cs.CV / 34 / 2603.00166
Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
探索人工智能服从性:为什么生成纯色图像比生成赛博朋克图像更困难?
Abstract
Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.
Chinese Translation
最近在生成性人工智能领域的进展展示了其产生高质量内容的显著能力。然而,这些模型常常表现出“简单性悖论”:尽管它们能够渲染复杂的景观,但在简单的、确定性的任务上却常常失败。为了解决这个问题,我们将服从性正式化为与指令对齐的能力,并建立了一个从基本语义对齐到像素级系统精度的分级系统,这为整合和分类现有文献提供了统一的范式。随后,我们进行案例研究以识别常见的服从性差距,揭示生成性先验如何常常覆盖逻辑约束。为了评估高级服从性,我们提出了 VIOLIN(VIsual Obedience Level-4 EvaluatIoN),这是第一个专注于纯色生成的基准,涵盖六种变体。对最先进(SOTA)模型的广泛实验揭示了基本的服从性局限性,并提供了进一步的探索性见解。通过建立这一框架,我们旨在引起对人工智能服从性的更多关注,并鼓励更深入的探索以弥补这一差距。
cs.CV / 35 / 2603.00168
Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks
基于图像的土耳其特有橄榄种类的深度神经网络分类
Abstract
In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.
Chinese Translation
本研究采用图像处理和深度学习方法,自动分类在土耳其栽培的本地橄榄种类。使用立体相机捕捉五种不同橄榄种类的图像,并对其进行预处理,以确保适合分析。采用卷积神经网络(CNN)架构,特别是MobileNetV2和EfficientNetB0进行图像分类。这些模型通过迁移学习方法进行了优化。训练和测试结果表明,EfficientNetB0模型表现最佳,准确率达到94.5%。研究结果表明,基于深度学习的系统为高准确率的橄榄种类分类提供了有效的解决方案。所开发的方法在农业产品的自动识别和质量控制等领域具有重要的应用潜力。
cs.CV / 36 / 2603.00170
A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition
一种用于计算机辅助颅面叠加的自动化新进化方法
Abstract
Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks' correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners' approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.
Chinese Translation
颅面叠加是一种法医学技术,通过将尸检后的颅骨与生前的面部照片进行比较来识别骨骼遗骸。该过程中的一个关键步骤是颅面叠加(Skull-Face Overlay, SFO)。这一阶段涉及将3D颅骨模型与2D面部图像对齐,通常是通过颅骨和面部标志点之间的对应关系来指导。然而,个体软组织厚度的变异性削弱了其准确性,给叠加过程引入了显著的不确定性。本文介绍了Lilium,一种自动化进化方法,用于提高SFO的准确性和稳健性。Lilium明确建模软组织的变异性,采用基于3D锥体的表示,其参数通过差分进化(Differential Evolution)算法进行优化。该方法通过一系列约束来强制执行解剖学、形态学和摄影学的合理性:标志点匹配、相机参数一致性、头部姿态对齐、颅骨在面部边界内的包容性以及区域平行性。这种对法医从业者常规方法的模拟使得Lilium在准确性和稳健性方面超越了现有的先进方法。
cs.CV / 37 / 2603.00171
AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning
AdaFocus:了解何时何地进行自适应视觉推理
Abstract
Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.
Chinese Translation
多模态大型语言模型(MLLMs)正朝着“用图像思考”的方向发展,通过主动探索图像细节来实现。尽管效果显著,但大规模训练的计算成本高昂,这引发了对轻量级、无训练解决方案的日益关注。然而,现有的无训练方法存在两个缺陷:由于无差别裁剪导致的感知冗余,增加了开销和噪声;以及语义意图与空间注意力之间的漂移,阻碍了用户关注区域的准确定位。为了解决这些挑战,我们提出了AdaFocus,一种旨在自适应视觉推理的新型无训练框架。AdaFocus遵循两阶段流程:基于置信度的模块决定何时裁剪,语义引导的定位模块确定何处裁剪。这使得在不增加额外训练的情况下实现自适应视觉推理。在实验中,AdaFocus在性能上显著提升,同时实现了约4.0倍的推理速度提升,相较于最先进的方法ZoomEyes,代表了准确性和效率的重大进步。
cs.CV / 38 / 2603.00173
Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model
Summer-22B:视频基础模型的大规模数据集工程与训练的系统化方法
Abstract
We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $\mu$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $\mu$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.
Chinese Translation
我们描述了从零开始训练Summer-22B这一视频基础模型的经验。本报告记录了从原始视频素材收集到训练出一个功能性模型(约5000万个片段)的过程中所面临的工程挑战、设计决策和经验教训。我们概述了结合元数据驱动的数据集策划、多阶段过滤、$bc$P参数化和超球体约束优化的方法。我们开发了Lavender Data系统用于数据集管理,并采用了考虑推理的架构选择。我们分享了在我们环境中有效的观察:数据集工程消耗了大部分精力,架构变体之间的差异比我们预期的小,以及$bc$P超参数转移在几何约束下似乎也有效。我们希望这一经验分享对其他进行类似项目的人有所帮助。
cs.CV / 39 / 2603.00175
Infinite Self-Attention
无限自注意力
Abstract
The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).
Chinese Translation
软最大注意力的二次成本限制了变换器在高分辨率视觉中的可扩展性。我们提出了无限自注意力(Infinite Self-Attention,InfSA),这是一种谱重构方法,将每个注意力层视为内容自适应令牌图上的扩散步骤,通过对注意力矩阵的折扣诺依曼级数累积多跳交互。这将自注意力与经典图中心性(Katz、PageRank、特征向量中心性)联系起来,以实现可解释的令牌加权。我们还表明,诺依曼核等于吸收马尔可夫链的基本矩阵,因此令牌的中心性是其在吸收之前的随机游走访问的期望次数。然后我们提出了线性无限自注意力(Linear-InfSA),这是一种线性时间变体,能够在不形成完整注意力矩阵的情况下近似隐式注意力算子的主特征向量。它保持一个固定大小的辅助状态,与每头维度 dh 成正比(与序列长度 N 无关),与视觉变换器(Vision Transformers)兼容,并支持在 4096 x 4096 的稳定训练和在 9216 x 9216 的推理(约 332k 令牌)。在一个 4 层的 ViT(53.5M 参数,224 x 224 时 59 GFLOPs)中,Linear-InfSA 在 ImageNet-1K 上达到了 84.7% 的 top-1 精度,相比于使用相同配方训练的相同深度的软最大 ViT 提升了 +3.2 个百分点。在 ImageNet-V2 上,InfViT 变体的表现优于所有比较的基线(最高 79.8% 对比 76.8%),表明其在分布转移下的鲁棒性。在 A100 40GB GPU 上,Linear-InfViT 的运行速度为 231 张图像/秒,能耗为 0.87 J/图像(比相同深度的 ViT 提高了 13 倍的吞吐量和能量效率),并且是唯一一个在不发生内存溢出的情况下完成 9216 x 9216 推理的测试模型。线性近似与二次算子的主特征向量(余弦相似度 0.985)非常接近。
cs.CV / 40 / 2603.00184
Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1
基于基础模型的零样本与监督鸟类图像分割:结合 Grounding DINO~1.5、YOLOv11 和 SAM~2.1 的双管道方法
Abstract
Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt "bird" before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.
Chinese Translation
鸟类图像分割在计算机视觉中仍然是一项具有挑战性的任务,原因在于极端的姿态多样性、复杂的羽毛图案以及变化的光照条件。本文提出了一种基于2025个基础模型的双管道框架,用于二元鸟类图像分割。我们介绍了两种操作模式,基于共享的冻结骨干网络 Segment Anything Model 2.1 (SAM 2.1):(1) 使用 Grounding DINO 1.5 的零样本管道,通过文本提示“bird”检测鸟类,然后用边界框提示 SAM 2.1,无需标注鸟类数据;(2) 监督管道在 CUB-200-2011 数据集上微调 YOLOv11,以实现高精度检测,再次提示 SAM 2.1 生成像素级掩码。分割模型从未针对新物种或领域进行重训练。在 CUB-200-2011 数据集(11,788 张图像,200 个物种)上,监督管道实现了 IoU 0.912、Dice 0.954 和 F1 0.953,超越了所有先前的基准,包括 SegFormer-B2(IoU 0.842),提高了 7.0 个百分点。零样本管道仅使用文本提示实现了 IoU 0.831,这是在该基准上报告的首个此类结果。我们证明了基于提示的基础模型管道优于特定任务的端到端训练分割网络,同时仅需轻量级检测器微调(约 1 小时)即可实现领域适应。完整的 PyTorch 实现、数据集准备脚本和训练权重均已公开。
cs.CV / 41 / 2603.00188
Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
通过无训练的 KV 缓存压缩实现高效的长时间跨度 GUI 代理
Abstract
Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
Chinese Translation
大型视觉-语言模型(VLMs)已成为自主 GUI 代理的强大引擎,但在长时间交互中,由于键值(KV)缓存的巨大内存占用和延迟,其部署受到严重限制。尽管现有的缓存压缩方法在大语言模型(LLMs)中已被证明有效,但我们实证表明,它们在 GUI 场景中表现不佳,原因在于基本的不匹配:与一般视觉任务中注意力稀疏性在各层之间变化不同,GUI 注意力模式在所有变换器层中表现出均匀的高稀疏性。基于这一见解,我们提出了 ST-Lite,一种无训练的 KV 缓存压缩框架,专为高效的 GUI 代理量身定制,明确解决 GUI 数据流中的动态时空轨迹依赖性。ST-Lite 引入了一种新颖的双分支评分策略,结合了以组件为中心的空间显著性(CSS)和轨迹感知的语义门控(TSG)。具体而言,CSS 通过评估局部邻域显著性来保持交互 UI 元素的结构完整性,而 TSG 通过动态过滤交互轨迹中视觉上重复的 KV 对来减轻历史冗余。广泛的评估表明,仅使用 10-20% 的缓存预算,ST-Lite 实现了 2.45 倍的解码加速,同时保持与全缓存基线相当甚至更优的性能,为资源受限的 GUI 代理提供了可扩展的解决方案。
cs.CV / 42 / 2603.00194
SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
SKeDA:一种针对文本到视频扩散模型的生成水印框架
Abstract
The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
Chinese Translation
文本到视频生成模型的兴起引发了对内容真实性、版权保护和恶意滥用的日益关注。水印技术作为调节此类人工智能生成内容的有效机制,其高保真性和强鲁棒性尤为关键。近期的生成图像水印方法通过利用水印信息和伪随机密钥来控制初始采样噪声,为无损嵌入提供了有希望的基础。然而,直接将这些技术扩展到视频中会引入两个主要限制:现有设计隐含地依赖于视频帧与用于水印加密的帧依赖伪随机二进制序列之间的严格对齐。一旦这种对齐被破坏,后续水印提取就变得不可靠;而视频特有的失真,如帧间压缩,显著降低了水印的可靠性。为了解决这些问题,我们提出了SKeDA,这是一种针对文本到视频扩散模型量身定制的生成水印框架。SKeDA由两个组件组成:(1)基于洗牌密钥的分布保持采样(Shuffle-Key-based Distribution-preserving Sampling, SKe)采用单一基础伪随机二进制序列进行水印加密,并通过置换推导出帧级加密序列。该设计将水印提取从对同步敏感的序列解码转变为对置换容忍的集合级聚合,显著提高了对帧重排序和丢失的鲁棒性;(2)差分注意力(Differential Attention, DA),计算帧间差异并在提取过程中动态调整注意力权重,提高了对时间失真的鲁棒性。大量实验表明,SKeDA在保持高视频生成质量的同时,增强了水印的鲁棒性。
cs.CV / 43 / 2603.00197
A Case Study on Concept Induction for Neuron-Level Interpretability in CNN
关于卷积神经网络中神经元级可解释性的概念归纳案例研究
Abstract
Deep Neural Networks (DNNs) have advanced applications in domains such as healthcare, autonomous systems, and scene understanding, yet the internal semantics of their hidden neurons remain poorly understood. Prior work introduced a Concept Induction-based framework for hidden neuron analysis and demonstrated its effectiveness on the ADE20K dataset. In this case study, we investigate whether the approach generalizes by applying it to the SUN2012 dataset, a large-scale scene recognition benchmark. Using the same workflow, we assign interpretable semantic labels to neurons and validate them through web-sourced images and statistical testing. Our findings confirm that the method transfers to SUN2012, showing its broader applicability.
Chinese Translation
深度神经网络(DNN)在医疗、自动化系统和场景理解等领域取得了显著进展,但其隐藏神经元的内部语义仍然不够清晰。之前的研究提出了一种基于概念归纳的框架用于隐藏神经元分析,并在ADE20K数据集上展示了其有效性。在本案例研究中,我们通过将该方法应用于SUN2012数据集——一个大规模场景识别基准,来探讨该方法的普适性。使用相同的工作流程,我们为神经元分配可解释的语义标签,并通过网络来源的图像和统计测试对其进行验证。我们的研究结果确认该方法可以迁移到SUN2012,显示出其更广泛的适用性。
cs.CV / 44 / 2603.00198
Stateful Token Reduction for Long-Video Hybrid VLMs
用于长视频混合视觉语言模型的状态感知令牌减少
Abstract
Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8--4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.
Chinese Translation
令牌减少是一种加速长视频视觉语言模型(VLMs)的有效方法,但大多数现有方法是为密集型变换器设计的,并未直接考虑与线性时间状态空间块(例如,Mamba)交错的混合架构。我们研究了针对混合视频 VLMs 的查询条件令牌减少,并通过两个属性分析其减少行为:层级稀疏性(多少令牌捕获与查询相关的信息)和重要性稳定性(令牌重要性排名是否在不同层次间保持一致)。尽管每层内的令牌重要性是稀疏的,但重要令牌的集合在不同层次间变化,因此激进的早期剪枝是不可靠的。基于此,我们提出了一种低到高的渐进式减少调度和统一的语言感知评分机制,适用于注意力和 Mamba 块(使用隐式注意力代理来处理 Mamba),从而实现混合模型中的全层令牌减少。在激进的压缩设置下(保留 25% 的视觉令牌),我们的方法在测试时提供了显著的预填充加速(3.8-4.2 倍),并在减少下的轻量微调进一步提高了长上下文视频基准的性能。
cs.CV / 45 / 2603.00201
AdURA-Net: Adaptive Uncertainty and Region-Aware Network
AdURA-Net:自适应不确定性与区域感知网络
Abstract
One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.
Chinese Translation
临床决策中常见的问题之一是存在不确定性,这通常源于放射学报告中的模糊性,这些模糊性往往反映了真正的诊断不确定性或在各种复杂案例中自动标签提取的局限性。尤其是在多标签数据集的情况下,如 CheXpert、MIMIC-CXR 等,这些数据集包含正面、负面和不确定等标签。在临床决策中,不确定标签扮演着复杂的角色,因为模型在缺乏足够证据的情况下不应被迫提供自信的预测。模型能够在不自信时表示其不理解的能力至关重要,尤其是在涉及高风险的临床决策中。在此,我们提出了 AdURA-Net,一种基于几何驱动的自适应不确定性感知框架,用于可靠的胸部疾病分类。所提模型的关键亮点包括:a) 自适应膨胀卷积和多尺度可变形对齐,结合主干网络 Densenet 架构,捕捉医学图像的解剖复杂性;b) 双头损失(Dual Head Loss),将掩蔽的二元交叉熵与逻辑回归结合,并引入 Dirichlet 证据学习目标。
cs.CV / 46 / 2603.00206
TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models
TACIT基准:用于生成模型和判别模型的程序化视觉推理基准
Abstract
Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).
Chinese Translation
现有的视觉推理基准主要依赖自然语言提示,评估狭窄的推理模式,或依赖主观评分程序,如LLM-as-judge。我们介绍了TACIT基准,这是一个包含6个推理领域(空间导航、抽象模式补全、因果模拟、逻辑约束满足、图论和拓扑)的10个任务的程序化视觉推理基准。该基准提供双轨评估:生成轨道要求模型生成通过确定性计算机视觉管道验证的解决方案图像,而判别轨道则提供五选一的多项选择,具有结构上合理的近似干扰项。每个干扰项恰好违反一个结构约束,要求模型推理细微的视觉差异,而不是利用表面的线索。版本0.1.0分发了6,000个难题(108,000个PNG图像,分为三种分辨率),具有完全确定性的种子生成和可重复的验证。数据集、生成代码和评估工具在HuggingFace上根据Apache 2.0许可证发布(DOI: 10.57967/hf/7904)。
cs.CV / 47 / 2603.00207
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
VisRef:思维过程中的视觉重聚焦提升多模态大规模推理模型的测试时间扩展
Abstract
Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
Chinese Translation
大规模推理模型的进展在复杂推理任务中表现出色,通过扩展推理来提升测试时间计算。然而,近期研究观察到,在依赖视觉的任务中,推理时的扩展文本推理可能会降低性能,因为模型逐渐对视觉标记失去关注,越来越依赖文本先验。为了解决这个问题,之前的研究使用基于强化学习(RL)的微调来引导视觉标记,或在推理过程中采用重聚焦机制。尽管这些方法有效,但计算成本高,需进行大规模数据生成和策略优化。为了在不增加RL微调的情况下利用测试时间计算的优势,我们提出了VisRef,一个视觉基础的测试时间扩展框架。我们的关键思想是通过重新注入与推理上下文语义相关的视觉标记核心集,积极引导推理过程,同时保持多样性并在全球范围内代表图像,从而实现更为扎实的多模态推理。在三个视觉推理基准上进行的实验表明,在固定的测试时间计算预算下,VisRef的表现始终优于现有的测试时间扩展方法,提升幅度可达6.4%。
cs.CV / 48 / 2603.00217
Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection
基于物理的自然对抗补丁在摄像头交通标志检测中的评估
Abstract
This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector's STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.
Chinese Translation
本文研究了自然对抗补丁(Naturalistic Adversarial Patches, NAPs)在物理交通标志环境中的转移效果,特别是在检测器使用为自主车辆(Autonomous Vehicle, AV)环境定制的数据集进行训练时的表现。我们通过将德国交通标志识别基准(German Traffic Sign Recognition Benchmark, GTSRB)中的交通标志实例粘贴到从目标平台捕获的无失真背景上,构建了一个复合数据集CompGTSRB(为AV环境定制的数据集)。CompGTSRB用于训练YOLOv5模型,并使用生成对抗网络(Generative Adversarial Network, GAN)进行潜在空间优化,生成补丁,遵循现有的NAP方法。我们在Quanser QCar测试平台上进行了一系列实验,利用QCar提供的前置CSI摄像头。在不同配置下,NAPs降低了检测器对STOP类别的置信度。不同配置包括距离、补丁大小和补丁位置。这些结果以及详细的逐步方法论表明了CompGTSRB数据集的实用性和所提出的系统物理协议在可信补丁评估中的应用价值。该研究进一步激励了针对嵌入式感知管道中局部补丁破坏的防御研究。
cs.CV / 49 / 2603.00223
Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
优秀测量在放射组学中的应用:一种量子启发的多类分类器用于肺癌亚型分类和前列腺癌风险分层
Abstract
We investigate a quantum-inspired approach to supervised multi-class classification based on the \emph{Pretty Good Measurement} (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.
Chinese Translation
我们研究了一种基于 extit{优秀测量}(Pretty Good Measurement, PGM)的量子启发式监督多类分类方法,该方法被视为从量子态区分中推导出的算子值决策规则。该方法将每个类别与编码的混合态关联,并通过单一的POVM(正算子值测量)构造进行分类,从而提供一种真正的多类策略,而无需简化为成对或一对多的方案。在这一视角下,分类被重新表述为对有限的类别依赖密度算子的区分,其性能受编码映射所诱导的几何结构和类别之间重叠结构的影响。为了评估该框架的实际应用范围,我们将基于PGM的分类器应用于两个生物医学放射组学案例研究:非小细胞肺癌(NSCLC)的组织病理学亚型分类和前列腺癌(PCa)风险分层。评估是在与之前报告的放射组学研究一致的协议下进行的,从而能够与已建立的经典基准进行直接比较。结果表明,基于PGM的分类器始终具有竞争力,并且在多个设置中优于标准方法。特别是,该方法在NSCLC的二分类和三分类任务中表现尤为出色,同时在四分类任务中保持竞争力,尽管类别重叠增加导致更具挑战性的区分几何。在PCa研究中,PGM分类器与最强的集成基准保持接近,并在特征选择场景中展现出临床相关的灵敏度-特异性权衡。
cs.CV / 50 / 2603.00266
Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization
通过联合位置-颜色优化生成视觉-红外密集预测任务的对抗补丁
Abstract
Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
Chinese Translation
多模态对抗攻击在密集预测领域仍然未得到充分探索。特别是,视觉-红外(VI)感知系统由于光谱特性异质和模态特定的强度分布,带来了独特的挑战。现有的对抗补丁方法主要针对单模态输入设计,未能考虑跨光谱不一致性,导致在应用于VI密集预测模型时攻击效果降低且隐蔽性差。为了解决这些挑战,我们提出了一种联合位置-颜色优化框架(AP-PCO),用于在视觉-红外环境中生成对抗补丁。该方法通过从模型输出中导出的适应度函数,同时优化补丁的放置和颜色组合,使单个补丁能够扰动可见和红外模态。为了进一步弥合光谱差异,我们引入了一种跨模态颜色适应策略,根据红外灰度特性约束补丁外观,同时在可见域内保持强扰动,从而减少跨光谱显著性。优化过程无需内部模型信息,支持灵活的黑箱攻击。在视觉-红外密集预测任务上的大量实验表明,所提出的AP-PCO在多种架构中实现了一致的强攻击性能,为VI感知系统的鲁棒性评估提供了实用的基准。
cs.CV / 51 / 2603.00273
Ozone Cues Mitigate Reflected Downwelling Radiance in LWIR Absorption-Based Ranging
臭氧信号减轻基于LWIR吸收的反射下行辐射
Abstract
Passive long-wave infrared (LWIR) absorption-based ranging relies on atmospheric absorption to estimate distances to objects from their emitted thermal radiation. First demonstrated decades ago for objects much hotter than the air and recently extended to scenes with low temperature variations, this ranging has depended on reflected radiance being negligible. Downwelling radiance is especially problematic, sometimes causing large inaccuracies. In two new ranging methods, we use characteristic features from ozone absorption to estimate the contribution of reflected downwelling radiance. The quadspectral method gives a simple closed-form range estimate from four narrowband measurements, two at a water vapor absorption line and two at an ozone absorption line. The hyperspectral method uses a broader spectral range to improve accuracy while also providing estimates of temperature, emissivity profiles, and contributions of downwelling from a collection of zenith angles. Experimental results demonstrate improved ranging accuracy, in one case reducing error from over 100 m when reflected light is not modeled to 6.8 m with the quadspectral method and 1.2 m with the hyperspectral method.
Chinese Translation
被动的长波红外(LWIR)吸收基础测距依赖于大气吸收来估算从物体发出的热辐射到物体的距离。几十年前首次在比空气温度高得多的物体上进行演示,最近扩展到温度变化较小的场景,这种测距方法依赖于反射辐射可以忽略不计。下行辐射尤其成问题,有时会导致较大的不准确性。在两种新的测距方法中,我们利用臭氧吸收的特征特征来估算反射下行辐射的贡献。四谱法通过四个窄带测量提供简单的闭式范围估计,其中两个在水蒸气吸收线,两个在臭氧吸收线。高光谱法使用更广的光谱范围来提高准确性,同时还提供温度、发射率剖面和来自一组天顶角的下行辐射贡献的估计。实验结果表明测距准确性有所提高,在一种情况下,当未建模反射光时,误差从超过100米降低到四谱法的6.8米和高光谱法的1.2米。
cs.CV / 52 / 2603.00289
Seeking Necessary and Sufficient Information from Multimodal Medical Data
从多模态医学数据中寻求必要和充分的信息
Abstract
Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method's effectiveness. Code will be available on GitHub.
Chinese Translation
从医学图像和其他数据源学习多模态表示可以为决策提供更丰富的信息。尽管已经开发了各种多模态模型,但它们忽视了学习既必要(必须存在以导致结果发生)又充分(足以决定结果)的特征。我们认为,学习这样的特征至关重要,因为它们可以通过捕捉基本的预测信息来提高模型性能,并增强模型对缺失模态的鲁棒性,因为每个模态都可以提供足够的预测信号。通过利用必要性和充分性概率(Probability of Necessity and Sufficiency, PNS)作为学习目标,可以学习这些特征,这种方法在单模态环境中已被证明有效。然而,将PNS扩展到多模态场景仍然未被充分探索,并且并非易事,因为PNS估计的关键条件被违反。我们通过将多模态表示分解为模态不变和模态特定的组件来解决这一问题,然后为每个组件推导可处理的PNS目标。在合成和真实世界医学数据集上的实验表明了我们方法的有效性。代码将在GitHub上发布。
cs.CV / 53 / 2603.00324
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
感知证明:具有组合一致性保证的认证工具使用多模态推理
Abstract
We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
Chinese Translation
我们提出了感知证明(Proof-of-Perception, PoP),这是一个将多模态推理视为具有明确可靠性保证的可执行图的工具使用框架。每个感知或逻辑节点输出一个一致性集合,产生经过校准的逐步不确定性;一个轻量级控制器利用这些证书在预算内分配计算资源,仅在需要时扩展额外的工具调用,否则提前停止。这使得答案基于可验证的证据,减少了错误累积和幻觉,并实现了原则性的准确性与计算资源的权衡。在文档、图表和多图像问答基准测试中,PoP在性能和可靠性上优于强链式思维、ReAct风格和思维程序基线,同时更有效地使用计算资源。
cs.CV / 54 / 2603.00337
Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors
基于扩散的低光照图像增强方法:结合颜色和亮度先验
Abstract
Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.
Chinese Translation
低光照图像通常存在低对比度、噪声和颜色失真等问题,降低了视觉质量并影响下游视觉任务。我们提出了一种新颖的条件扩散框架用于低光照图像增强,该框架结合了结构化控制嵌入模块(Structured Control Embedding Module, SCEM)。SCEM将低光照图像分解为四个信息丰富的组件,包括照明、照明不变特征、阴影先验和颜色不变线索。这些组件作为控制信号,调节基于U-Net的扩散模型,该模型使用简化的噪声预测损失进行训练。因此,所提出的配备SCEM的扩散方法通过物理先验强制实施结构化增强。在实验中,我们的模型仅在LOLv1数据集上进行训练,并在不进行微调的情况下评估LOLv2-real、LSRW、DICM、MEF和LIME。该方法在定量和感知指标上实现了最先进的性能,展示了在各基准测试中的强泛化能力。
cs.CV / 55 / 2603.00362
Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance
具有血管避让的感知意识外科规划用于视觉皮层假体
Abstract
Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.
Chinese Translation
皮层视觉假体旨在通过电刺激早期视觉皮层(V1)中的神经元来恢复视力。随着高密度和灵活神经接口的出现,电极在三维皮层内的放置已成为一个关键的外科规划问题。现有策略强调视觉场覆盖和解剖启发式,但并未在安全约束下直接优化预测的感知结果。我们提出了一种感知意识框架,用于皮层视觉假体的外科规划,将电极放置形式化为解剖空间中的约束优化问题。电极坐标被视为可学习参数,并通过可微分的假体视觉前向模型进行端到端优化。目标是最小化任务级感知误差,同时结合血管避让和灰质可行性约束。在使用真实折叠皮层几何(FreeSurfer fsaverage)进行的模拟阅读和自然图像任务评估中,感知意识优化相较于基于覆盖的放置策略,始终提高了重建保真度。重要的是,血管安全约束消除了边缘违规,同时保持了感知性能。该框架进一步支持在固定插入预算下的多电极线配置的共同优化。这些结果展示了可微分感知模型如何为基于解剖的、安全意识的计算机辅助规划提供信息,并为优化下一代视觉假体奠定基础。
cs.CV / 56 / 2603.00372
Unsupervised Semantic Segmentation in Synchrotron Computed Tomography with Self-Correcting Pseudo Labels
基于自校正伪标签的同步辐射计算机断层扫描无监督语义分割
Abstract
X-ray computed tomography (CT) is a widely used imaging technique that provides detailed examinations into the internal structure of an object with synchrotron CT (SR-CT) enabling improved data quality by using higher energy, monochromatic X-rays. While SR-CT allows for improved resolution, time-resolved experimentation, and reduced imaging artifacts, it also produces significantly larger datasets than conventional CT. Accurate and efficient evaluation of these datasets is a critical component of these workflows; yet is often done manually representing a major bottleneck in the analysis phase. While deep learning has emerged as a powerful tool capable of providing a wide range of purely data-driven solutions, it requires a substantial amount of labeled data for training and manual annotation of SR-CT datasets is impractical in practice. In this paper, we introduce a novel framework that enables automatic segmentation of large, high-resolution SR-CT datasets by eliminating the need to hand label images for deep learning training. First, we generate pseudo labels by clustering on the voxel values identifying regions in the volume with similar attenuation coefficients producing an initial semantic map. Afterwards, we train a segmentation model on the pseudo labels before utilizing the Unbiased Teacher approach to self-correct them ensuring accurate final segmentations. We find our approach improves pixel-wise accuracy and mIoU by 13.31% and 15.94%, respectively, over the baseline pseudo labels when using a magnesium crystal SR-CT sample. Additionally, we extensively evaluate the different components of our workflow including segmentation model, loss function, pseudo labeling strategy, and input type. Finally, we evaluate our approach on to two additional samples highlighting our frameworks ability to produce segmentations that are considerably better than the original pseudo labels.
Chinese Translation
X射线计算机断层扫描(CT)是一种广泛应用的成像技术,能够详细检查物体的内部结构,而同步辐射CT(SR-CT)通过使用更高能量的单色X射线提高了数据质量。虽然SR-CT允许更高的分辨率、时间分辨实验和减少成像伪影,但它也产生了比传统CT显著更大的数据集。对这些数据集的准确和高效评估是这些工作流程的关键组成部分;然而,评估通常是手动进行的,这在分析阶段代表了一个主要瓶颈。尽管深度学习已成为一种强大的工具,能够提供广泛的纯数据驱动解决方案,但它需要大量标记数据进行训练,而在实践中手动标注SR-CT数据集是不切实际的。在本文中,我们提出了一种新颖的框架,使得通过消除手动标注图像的需求,能够自动分割大型高分辨率SR-CT数据集。首先,我们通过对体素值进行聚类生成伪标签,识别具有相似衰减系数的体积区域,从而生成初始语义图。随后,我们在伪标签上训练分割模型,然后利用无偏教师(Unbiased Teacher)方法自我校正这些标签,以确保最终分割的准确性。我们发现,在使用镁晶体SR-CT样本时,我们的方法在像素级准确性和mIoU上分别比基线伪标签提高了13.31%和15.94%。此外,我们对工作流程的不同组件进行了广泛评估,包括分割模型、损失函数、伪标注策略和输入类型。最后,我们在另外两个样本上评估了我们的方法,突显了我们的框架能够生成明显优于原始伪标签的分割结果。
cs.CV / 57 / 2603.00382
DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography
DiffSOS:用于超声计算机断层扫描中声速重建的声学条件扩散模型
Abstract
Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.
Chinese Translation
从声波形中准确重建声速(SoS)是超声计算机断层扫描(USCT)的基石,能够实现定量速度映射,揭示常规成像中常常不可见的细微解剖细节和病理变化。然而,现有算法的局限性阻碍了其实用性;传统的全波形反演(FWI)计算量大,而当前的深度学习方法往往产生缺乏细节的过平滑结果。我们提出了DiffSOS,这是一种条件扩散模型,直接将声波形映射到声速图。我们的框架采用了专门的声学ControlNet,以严格基于物理波测量来进行去噪过程。为了确保结构一致性,我们优化了一种混合损失函数,集成了噪声预测、空间重建和噪声频率内容。为了加速推理,我们采用了随机去噪扩散隐式模型(DDIM)采样,仅需10步即可实现接近实时的重建。重要的是,我们利用框架的随机生成特性来估计像素级的不确定性,提供了一种在确定性方法中常常缺失的可靠性度量。在OpenPros USCT基准测试中评估,DiffSOS显著优于最先进的网络,平均多尺度结构相似性达到0.957。我们的方法提供了高保真度的声速图,并附带有原则性的置信度度量,从而促进了更安全、更快速的临床解读。
cs.CV / 58 / 2603.00409
SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
SSR:通过结构化场景推理推动空间智能的极限
Abstract
While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在语义任务中表现出色,但它们常常缺乏进行复杂几何推理所必需的“空间感”。当前模型通常面临高昂的模态对齐成本和细粒度结构建模精度不足的问题。我们提出了SSR,一个旨在进行结构化场景推理的框架,通过轻量级对齐机制无缝整合2D和3D表示。为了最小化训练开销,我们的框架通过跨模态加法和标记交错,将3D几何特征锚定到大型语言模型预对齐的2D视觉语义上,有效消除了大规模对齐预训练的必要性。为了支撑复杂的空间推理,我们提出了一种新颖的场景图生成管道,将全局布局表示为由相对坐标定义的独立局部三元组链。这一过程辅以增量生成算法,使模型能够为复杂环境构建“语言模型友好”的结构框架。此外,我们将这些能力扩展到全球规模的3D全局定位任务,在异构数据源上实现绝对度量精度。在7B参数规模下,SSR在多个空间智能基准测试中取得了最先进的性能,特别是在VSI-Bench上得分73.9。我们的方法显著超越了更大规模的模型,证明了高效的特征对齐和结构化场景推理是实现真实空间智能的基石。
cs.CV / 59 / 2603.00412
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
PointAlign:用于3D视觉-语言模型的特征级对齐正则化
Abstract
The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
Chinese Translation
3D视觉-语言模型(VLMs)的发展对机器人技术、自动驾驶和增强现实等应用至关重要,但受到配对的3D-文本数据稀缺的严重限制。现有方法仅依赖于下一个标记预测损失,使用语言标记进行监督。这导致有限的3D数据未得到有效利用,并在中间表示中显著降低和丧失了宝贵的几何信息。为了解决这些限制,我们提出了{ extit{PointAlign}},一种新颖的特征级对齐正则化方法。{ extit{PointAlign}}明确监督中间点云标记,以在语言建模过程中保留细粒度的3D几何-语义信息。具体而言,我们通过一致性损失约束LLM中的中间点云标记与视觉输入标记对齐。通过仅训练一个轻量级对齐投影器和LoRA适配器,{ extit{PointAlign}}以最小的计算开销实现了明确的特征级监督,有效防止了几何退化。在ModelNet40和Objaverse数据集上的大量实验表明,我们的方法在分类任务上平均提高了 extbf{2.08}个百分点,在具有挑战性的开放词汇Objaverse分类任务上获得了显著的 extbf{7.50}个百分点提升,并在Qwen2-72B-Instruct评估的3D物体描述任务上提高了 extbf{4.88}个百分点,验证了{ extit{PointAlign}}的有效性。代码已公开发布于 extit{https://github.com/yharoldsu0627/PointAlign}。
cs.CV / 60 / 2603.00413
DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
DiffTrans:用于重建透明物体的可微几何-材料分解
Abstract
Reconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed DiffTrans, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our DiffTrans compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. The code is available at https://github.com/lcp29/DiffTrans.
Chinese Translation
从一组多视角图像重建透明物体是一项具有挑战性的任务,因为光传播的复杂性和不确定性。典型的方法主要针对特定场景进行优化,例如遵循均匀拓扑的物体、表现出理想透明度和表面镜面反射的物体,或者仅具有表面材料的物体,这在很大程度上限制了它们在现实世界中的实际应用。在本研究中,我们提出了一种用于透明物体的可微渲染框架,称为DiffTrans,该框架允许高效地分解和重建透明物体的几何形状和材料,从而在具有多样拓扑和复杂纹理的复杂场景中准确重建透明物体。具体而言,我们首先利用具有膨胀和平滑性正则化的FlexiCubes作为等值面表示,从多视角物体轮廓中高效重建初始几何形状。同时,我们采用环境光辐射场来恢复场景的环境。然后,我们设计了一种递归可微光线追踪器,以统一和端到端的方式进一步优化几何形状、折射率和吸收率,从而在复杂场景中实现高质量的透明物体重建。所设计的光线追踪器的一个显著优势是可以在CUDA中实现,从而显著降低计算成本。在多个基准测试上的广泛实验表明,与其他方法相比,我们的DiffTrans在重建性能上具有优势,尤其是在涉及具有多样拓扑和复杂纹理的透明物体的复杂场景中。代码可在 https://github.com/lcp29/DiffTrans 获取。
cs.CV / 61 / 2603.00418
Station2Radar: query conditioned gaussian splatting for precipitation field
Station2Radar:基于查询条件的高斯溅射用于降水场
Abstract
Precipitation forecasting relies on heterogeneous data. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating precipitation fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried precipitation regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible precipitation field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50\% improvement in RMSE compared to conventional gridded precipitation products, and consistently maintains high performance across multiple spatiotemporal scales.
Chinese Translation
降水预测依赖于异构数据。气象雷达准确,但覆盖范围地理限制且维护成本高。气象站提供准确但稀疏的点测量,而卫星则提供密集的高分辨率覆盖,但无法直接获取降水数据。为克服这些局限性,我们提出了查询条件高斯溅射(Query-Conditioned Gaussian Splatting,QCGS),这是第一个将自动气象站(AWS)观测与卫星图像融合以生成降水场的框架。与传统的二维高斯溅射不同,QCGS仅选择性地渲染查询的降水区域,避免在无降水区域进行不必要的计算,同时保留清晰的降水结构。该框架结合了一个雷达点提议网络,用于识别降水支持位置,以及一个隐式神经表示(Implicit Neural Representation,INR)网络,用于预测每个点的高斯参数。QCGS实现了实时高效、分辨率灵活的降水场生成。通过与基准降水产品的广泛评估,QCGS在均方根误差(RMSE)上相比传统的网格降水产品提高了超过50\%,并在多个时空尺度上始终保持高性能。
cs.CV / 62 / 2603.00423
An Interpretable Local Editing Model for Counterfactual Medical Image Generation
一种可解释的局部编辑模型用于反事实医学图像生成
Abstract
Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering "what-if" questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.
Chinese Translation
反事实医学图像生成已成为增强医疗领域人工智能驱动系统的重要工具,能够回答“如果……会怎样”的问题。然而,现有方法面临两个基本限制:首先,它们未能防止意外修改,导致在仅应影响疾病特征时,人口属性发生附带变化。其次,它们在编辑过程中缺乏可解释性,这显著限制了其在实际医疗应用中的效用。为了解决这些限制,我们提出了InstructX2X,这是一种新颖的可解释局部编辑模型,专门用于反事实医学图像生成,并具有区域特定编辑的特点。该方法将修改限制在特定区域,有效防止意外变化,同时提供指导图(Guidance Map),为编辑过程提供内在可解释的视觉解释。此外,我们还引入了MIMIC-EDIT-INSTRUCTION,这是一个基于专家验证的医学视觉问答(VQA)对构建的反事实医学图像生成数据集。通过广泛的实验,InstructX2X在所有主要评估指标上实现了最先进的性能。我们的模型成功生成高质量的反事实胸部X光图像,并附带可解释的解释。
cs.CV / 63 / 2603.00431
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
面向分类法的表示对齐用于大型多模态模型的层次视觉识别
Abstract
A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.
Chinese Translation
一个高性能的通用视觉理解模型应将视觉输入映射到标签的分类树上,并识别训练集之外的新类别,这些类别几乎没有或没有公开可用的图像。大型多模态模型(LMMs)在已知类别的细粒度视觉识别(FGVR)方面取得了显著进展。然而,它们在层次视觉识别(HVR)方面仍然存在局限性,该任务旨在从粗到细的类别中预测一致的标签路径,尤其是在新类别的情况下。为了解决这些挑战,我们提出了面向分类法的表示对齐(TARA),这是一种简单而有效的策略,将分类法知识注入到LMMs中。TARA利用生物基础模型(BFMs)中的表示,这些模型通过层次对比学习编码了丰富的生物关系。通过将视觉特征的中间表示与BFMs的表示对齐,LMMs被鼓励提取在分类树中结构良好的区分性视觉线索。此外,我们将第一个答案标记的表示与真实标签对齐,根据用户意图灵活地弥合上下文化视觉特征与不同粒度类别之间的差距。实验表明,TARA始终增强了LMMs的层次一致性和叶节点准确性,使得在复杂生物分类法中可靠地识别已知和新类别成为可能。代码可在 https://github.com/PKU-ICST-MIPL/TARA_CVPR2026 获取。
cs.CV / 64 / 2603.00433
TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis
TAP-SLF:用于多任务超声图像分析的视觉基础模型的参数高效适应
Abstract
Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.
Chinese Translation
在医学图像分析中,同时执行多个任务(包括分割、分类、检测和回归)常常带来模型泛化能力和共享特征表示优化方面的重大挑战。尽管视觉基础模型(VFMs)提供强大的通用表示,但在有限的医学数据上进行全面微调容易导致过拟合,并且计算成本高昂。此外,现有的参数高效微调方法通常采用与任务无关的适应协议,忽视了任务特定机制以及在微调过程中模型层的不同敏感性。在本研究中,我们提出了任务感知提示与选择性层微调(TAP-SLF),这是一个用于多任务超声图像分析的统一框架。TAP-SLF结合任务感知的软提示,将任务特定的先验信息编码到输入令牌序列中,并对编码器的特定高层应用LoRA。这一策略仅更新VFM参数的一小部分,同时保持预训练的主干网络不变。通过将任务感知提示与选择性高层微调相结合,TAP-SLF使得VFM能够高效适应共享主干下的多样医学任务。在FMC_UIA 2026挑战赛测试集上的结果显示,TAP-SLF获得第五名,同时在官方发布的训练数据集上进行8:2的训练-测试划分评估,证明了任务感知提示和选择性层调优是高效VFM适应的有效策略。
cs.CV / 65 / 2603.00437
Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models
模型内部自我修正:利用层注意力减轻大型视觉语言模型中的幻觉现象
Abstract
Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.
Chinese Translation
尽管大型视觉语言模型(LVLMs)已经取得了显著进展,但幻觉现象,即生成的文本未能与视觉输入相结合,仍然是一个挑战。随着LVLMs的不断增强,以往报告的幻觉模式,如语言偏见和过度思考现象,变得不再一致,从而使相应的减轻技术的有效性大大降低。在本文中,我们引入了一种利用层注意力的内部自我修正机制(Internal self-Correction mechanism utilizing Layer Attention, ICLA),该机制在生成过程中直接作用于隐藏状态。每一层通过对角交叉层注意力机制选择性地从所有前置层中检索信息,实现自我优化,而无需任何外部修正信号。仅在LLaVA1.5-7B和Qwen2.5-VL-7B上引入和训练0.2M和0.1M额外参数, extit{ours}在多个幻觉基准测试中始终提高了视觉基础,证明了其对更高级LVLMs的有效性。
cs.CV / 66 / 2603.00439
Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling
Mamba-CAD:用于三维计算机辅助设计生成建模的状态空间模型
Abstract
Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from https://github.com/Sunny-Hack/Code-for-Mamba-CAD-AAAI-2025-.
Chinese Translation
计算机辅助设计(CAD)生成建模在工业中具有强大且长期的应用。最近,作为对象设计逻辑的参数化CAD序列已被序列模型广泛挖掘。然而,工业CAD模型,特别是在组件对象中,具有细粒度和复杂性,需要更长的参数化CAD序列来定义。为了解决这一问题,我们引入了Mamba-CAD,一种用于工业复杂CAD模型的自监督生成建模方法,能够在更长的参数化CAD序列上进行建模。具体而言,我们首先设计了一个基于Mamba架构的编码器-解码器框架,并将其与CAD重建任务配对进行预训练,以建模CAD模型的潜在表示;然后,我们利用学习到的表示来指导生成对抗网络生成CAD模型的伪表示,最终通过MambaCAD的解码器将其恢复为参数化CAD序列。为了训练Mamba-CAD,我们进一步创建了一个包含77,078个具有更长参数化CAD序列的CAD模型的新数据集。我们进行了全面的实验,以证明我们模型在各种评估指标下的有效性,特别是在有效参数化CAD序列的生成长度方面。代码和数据集可以从 https://github.com/Sunny-Hack/Code-for-Mamba-CAD-AAAI-2025- 获取。
cs.CV / 67 / 2603.00443
SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
SesaHand:通过语义和结构对齐的可控生成增强3D手部重建
Abstract
Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
Chinese Translation
近期关于3D手部重建的研究表明,合成训练数据在提高估计性能方面的有效性。然而,大多数方法依赖于游戏引擎合成手部图像,这些图像通常在纹理和环境上缺乏多样性,并且未能包含诸如手臂或交互对象等关键组件。生成模型是生成多样化手部图像的有希望的替代方案,但仍然存在对齐问题。在本文中,我们提出了SesaHand,从语义和结构对齐的角度增强可控手部图像生成,以用于3D手部重建。具体而言,在语义对齐方面,我们提出了一种通过链式思维推理的管道,从视觉-语言模型生成的图像标题中提取人类行为语义。这一语义抑制了与人类无关的环境细节,并确保了足够的人类中心上下文以进行手部图像生成。在结构对齐方面,我们引入了层次结构融合,将不同粒度的结构信息整合以进行特征细化,从而更好地对齐生成图像中的手部和整体人体。我们进一步提出了一种手部结构注意力增强方法,以有效增强模型对手部区域的关注。实验表明,我们的方法不仅在生成性能上超越了先前的工作,还改善了利用生成手部图像进行的3D手部重建。
cs.CV / 68 / 2603.00458
Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
改进的对抗性扩散压缩用于现实世界视频超分辨率
Abstract
While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
Chinese Translation
尽管许多扩散模型在现实世界视频超分辨率(Real-VSR)中取得了令人印象深刻的结果,生成了丰富而逼真的细节,但它们对多步采样的依赖导致了推理速度缓慢。一些单步网络如SeedVR2、DOVE和DLoRAL通过将生成过程浓缩为单一步骤来缓解这一问题,然而它们仍然庞大,参数量达到数十亿,并且存在多秒的延迟。最近的对抗性扩散压缩(ADC)通过对这些模型进行剪枝和蒸馏,提供了一条有前景的路径,将其压缩为紧凑的AdcSR网络,但直接将其应用于Real-VSR时,由于缺乏时间感知和标准对抗学习的局限性,未能平衡空间细节和时间一致性。为了解决这些挑战,我们提出了一种改进的ADC方法用于Real-VSR。我们的方法将配备3D时空注意力的大型扩散Transformer(DiT)教师DOVE蒸馏为一个剪枝的基于2D稳定扩散(SD)的AdcSR骨干网络,并增强了轻量级的1D时间卷积,从而实现显著更高的效率。此外,我们引入了一种双头对抗蒸馏方案,其中像素域和特征域中的判别器明确地将细节和一致性的判别分解为两个头,使得两个目标能够有效优化,而不牺牲其中一个。实验表明,生成的压缩AdcVSR模型在参数复杂性上减少了95%,并在保持竞争性视频质量和效率的同时,实现了对其DiT教师DOVE的8倍加速。
cs.CV / 69 / 2603.00459
Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation
基于局部自相似先验的可解释连续时间掩膜细化用于医学图像分割
Abstract
Accurate semantic segmentation of foot ulcers is essential for automated wound monitoring, yet boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. To overcome the limitations of standard intensity-based networks, we present LSS-LTCNet:an ante-hoc explainable framework synergizing deterministic structural priors with continuous-time neural dynamics. Our architecture departs from traditional black-box models by employing a Local Self-Similarity (LSS) mechanism that extracts dense, illumination-invariant texture descriptors to explicitly disentangle necrotic tissue from background artifacts. To enforce topological precision, we introduce a Liquid Time-Constant (LTC) refinement module that treats boundary evolution as an ODEgoverned dynamic system, iteratively refining masks over continuous time-steps. Comprehensive evaluation on the MICCAI FUSeg dataset demonstrates that LSS-LTCNet achieves state-of-the-art boundary alignment, securing a peak Dice score of 86.96% and an exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. Requiring merely 25.70M parameters, the model significantly outperforms heavier U-Net and transformer baselines in efficiency. By providing inherent visual audit trails alongside high-fidelity predictions, LSS-LTCNet offers a robust and transparent solution for computer-aided diagnosis in mobile healthcare (mHealth) settings.
Chinese Translation
准确的足部溃疡语义分割对于自动化伤口监测至关重要,但由于组织异质性和与周围皮肤的对比度较差,边界划定仍然具有挑战性。为克服标准强度基础网络的局限性,我们提出了LSS-LTCNet:一个先验可解释的框架,将确定性结构先验与连续时间神经动态相结合。我们的架构不同于传统的黑箱模型,采用局部自相似(Local Self-Similarity, LSS)机制,提取密集的、光照不变的纹理描述符,以明确区分坏死组织与背景伪影。为了确保拓扑精度,我们引入了液体时间常数(Liquid Time-Constant, LTC)细化模块,将边界演变视为一个由常微分方程(ODE)控制的动态系统,迭代地在连续时间步长上细化掩膜。在MICCAI FUSeg数据集上的全面评估表明,LSS-LTCNet实现了最先进的边界对齐,获得了86.96%的峰值Dice得分和8.91像素的卓越第95百分位Hausdorff距离(HD95)。该模型仅需25.70M参数,显著优于更重的U-Net和变换器基线的效率。通过提供固有的视觉审计轨迹以及高保真预测,LSS-LTCNet为移动医疗(mHealth)环境中的计算机辅助诊断提供了一种稳健且透明的解决方案。
cs.CV / 70 / 2603.00461
ReMoT: Reinforcement Learning with Motion Contrast Triplets
ReMoT:基于运动对比三元组的强化学习
Abstract
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
Chinese Translation
我们提出了ReMoT,这是一种统一的训练范式,系统性地解决了视觉语言模型(VLMs)在时空一致性方面的基本缺陷——这是导航、机器人技术和自动驾驶中的一个关键失效点。ReMoT整合了两个核心组件:(1)一个基于规则的自动化框架,生成ReMoT-16K,这是一个大规模(16.5K三元组)运动对比数据集,源自视频元注释,超越了成本高昂的手动或模型生成。(2)群体相对策略优化(Group Relative Policy Optimization),我们通过实证验证其在学习这种对比推理方面的最佳性能和数据效率,远超标准的监督微调(Supervised Fine-Tuning)。我们还构建了第一个用于细粒度运动对比三元组的基准,以衡量VLM对细微运动属性(例如,相对方向)的辨别能力。最终模型在我们的新基准和多个标准VLM基准上实现了最先进的性能,在时空推理任务上取得了显著的25.1%的性能提升。
cs.CV / 71 / 2603.00462
OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation
OPGAgent:一个可审计的牙科全景X光解读代理
Abstract
Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized tools offer a path to both versatility and accuracy, this approach remains unexplored in the field of dental imaging. To address this gap, we propose OPGAgent, a multi-tool agentic system for auditable OPG interpretation. OPGAgent coordinates specialized perception modules with a consensus mechanism through three components: (1) a Hierarchical Evidence Gathering module that decomposes OPG analysis into global, quadrant, and tooth-level phases with dynamically invoking tools, (2) a Specialized Toolbox encapsulating spatial, detection, utility, and expert zoos, and (3) a Consensus Subagent that resolves conflicts through anatomical constraints. We further propose OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples derived from real clinical reports, which enables a comprehensive review of findings and hallucinations, extending beyond the limitations of VQA indicators. On our OPG-Bench and the public MMOral-OPG benchmark, OPGAgent outperforms current dental VLMs and medical agent frameworks across both structured-report and VQA evaluation. Code will be released upon acceptance.
Chinese Translation
全景X光片(Orthopantomograms, OPGs)是牙科中标准的全景放射线照片,广泛用于多种诊断任务的全弓筛查。尽管视觉语言模型(Vision Language Models, VLMs)现在允许通过自然语言进行多任务的OPG分析,但在大多数单独任务上,它们的表现不及任务特定模型。代理系统通过协调专业工具提供了兼具多样性和准确性的解决方案,但这一方法在牙科影像领域尚未得到探索。为了解决这一空白,我们提出了OPGAgent,一个用于可审计OPG解读的多工具代理系统。OPGAgent通过三个组件协调专业感知模块与共识机制:(1)层次证据收集模块,该模块将OPG分析分解为全局、象限和牙齿级别的阶段,并动态调用工具;(2)专业工具箱,封装空间、检测、实用和专家工具;(3)共识子代理,通过解剖约束解决冲突。我们进一步提出了OPG-Bench,一个基于真实临床报告中提取的(位置、领域、值)三元组的结构化报告协议,能够全面审查发现和幻觉,超越VQA指标的局限。在我们的OPG-Bench和公共MMOral-OPG基准上,OPGAgent在结构化报告和VQA评估中均优于当前的牙科VLM和医疗代理框架。代码将在接受后发布。
cs.CV / 72 / 2603.00466
DreamWorld: Unified World Modeling in Video Generation
梦境世界:视频生成中的统一世界建模
Abstract
Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{https://github.com/ABU121111/DreamWorld}{\textcolor{mypink}{\textbf{Github}}}.
Chinese Translation
尽管视频生成取得了令人瞩目的进展,但现有模型仍然局限于表层的可信度,缺乏对世界的连贯和统一理解。以往的方法通常仅融入单一形式的世界相关知识,或依赖于僵化的对齐策略来引入额外知识。然而,单一世界知识的对齐不足以构成一个世界模型,这需要对多个异构维度(例如,物理常识、三维和时间一致性)进行联合建模。为了解决这一局限性,我们提出了 extbf{梦境世界(DreamWorld)},这是一个统一框架,通过 extbf{联合世界建模范式(Joint World Modeling Paradigm)}将互补的世界知识整合到视频生成器中,联合预测视频像素和特征,以捕捉时间动态、空间几何和语义一致性。然而,简单地优化这些异构目标可能导致视觉不稳定和时间闪烁。为缓解这一问题,我们提出了 extit{一致约束退火(Consistent Constraint Annealing, CCA)},在训练过程中逐步调节世界级约束,以及 extit{多源内部引导(Multi-Source Inner-Guidance)},以在推理时强制执行学习到的世界先验。广泛的评估表明,梦境世界在世界一致性方面有所提升,在VBench上超越了Wan2.1,提升了2.26分。代码将公开发布于 extcolor{mypink}{ extbf{Github}}(https://github.com/ABU121111/DreamWorld)。
cs.CV / 73 / 2603.00467
High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System
基于非对称事件-SVE相机系统的高动态范围成像
Abstract
High dynamic range (HDR) imaging under extreme illumination remains challenging for conventional cameras due to overexposure. Event cameras provide microsecond temporal resolution and high dynamic range, while spatially varying exposure (SVE) sensors offer single-shot radiometric diversity.We present a hardware--algorithm co-designed HDR imaging system that tightly integrates an SVE micro-attenuation camera with an event sensor in an asymmetric dual-modality configuration. To handle non-coaxial geometry and heterogeneous optics, we develop a two-stage cross-modal alignment framework that combines feature-guided coarse homography estimation with a multi-scale refinement module based on spatial pooling and frequency-domain filtering. On top of aligned representations, we develop a cross-modal HDR reconstruction network with convolutional fusion, mutual-information regularization, and a learnable fusion loss that adaptively balances intensity cues and event-derived structural constraints. Comprehensive experiments on both synthetic benchmarks and real captures demonstrate that the proposed system consistently improves highlight recovery, edge fidelity, and robustness compared with frame-only or event-only HDR pipelines. The results indicate that jointly optimizing optical design, cross-modal alignment, and computational fusion provides an effective foundation for reliable HDR perception in highly dynamic and radiometrically challenging environments.
Chinese Translation
在极端光照条件下,高动态范围(HDR)成像对传统相机仍然具有挑战性,因为容易出现过曝。事件相机提供微秒级的时间分辨率和高动态范围,而空间变化曝光(SVE)传感器则提供单次拍摄的辐射多样性。我们提出了一种硬件-算法协同设计的HDR成像系统,该系统将SVE微衰减相机与事件传感器紧密集成在一个非对称双模态配置中。为了处理非同轴几何和异构光学,我们开发了一个两阶段的跨模态对齐框架,该框架结合了特征引导的粗略单应性估计和基于空间池化及频域滤波的多尺度精细化模块。在对齐表示的基础上,我们开发了一个跨模态HDR重建网络,该网络采用卷积融合、互信息正则化和可学习的融合损失,能够自适应地平衡强度线索和事件导出的结构约束。在合成基准和真实捕获的全面实验中,结果表明,与仅基于帧或仅基于事件的HDR管道相比,所提出的系统在高光恢复、边缘保真度和鲁棒性方面始终表现出改善。结果表明,联合优化光学设计、跨模态对齐和计算融合为在高度动态和辐射挑战环境中实现可靠的HDR感知提供了有效的基础。
cs.CV / 74 / 2603.00479
U-VLM: Hierarchical Vision Language Modeling for Report Generation
U-VLM:用于报告生成的层次化视觉语言建模
Abstract
Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.
Chinese Translation
自动化放射学报告生成对于减少放射科医生的工作负担和提高诊断一致性至关重要,但为三维医学影像生成准确报告仍然具有挑战性。现有的视觉语言模型面临两个限制:它们未能利用经过分割预训练的编码器,并且仅在语言模型的输入层注入视觉特征,从而丧失了多尺度信息。我们提出了U-VLM,它在训练和架构中实现了层次化的视觉语言建模:(1)从分割到分类再到报告生成的渐进式训练,以及(2)多层视觉注入,将U-Net编码器特征路由到相应的语言模型层。每个训练阶段可以利用不同的数据集,而无需统一的注释。U-VLM在CT-RATE(F1: 0.414对比0.258,BLEU-mean: 0.349对比0.305)和AbdomenAtlas 3.0(F1: 0.624对比0.518,基于分割的检测)上实现了最先进的性能,仅使用一个从头训练的0.1B解码器,证明了精心设计的视觉编码器预训练的优势超过了7B+预训练语言模型的好处。消融研究表明,渐进式预训练显著提高了F1,而多层注入则改善了BLEU-mean。代码可在 https://github.com/yinghemedical/U-VLM 获取。
cs.CV / 75 / 2603.00482
TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications
TokenCom:用于多模态和多任务令牌通信的视觉-语言模型
Abstract
Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.
Chinese Translation
视觉-语言模型(VLMs)凭借其在图像和文本理解方面的强大能力,为智能通信提供了坚实的基础。然而,它们的有效性受到有限的令牌粒度、过长的视觉令牌序列以及不足的跨模态对齐的限制。为了解决这些挑战,我们提出了TaiChi,一种旨在令牌通信的新型VLM框架。TaiChi采用双视觉令牌化器架构,处理高分辨率和低分辨率图像,以协同捕捉像素级细节和全局概念特征。引入了一种双边注意力网络(Bilateral Attention Network, BAN),以智能融合多尺度视觉令牌,从而增强视觉理解并生成紧凑的视觉令牌。此外,采用基于Kolmogorov Arnold网络(KAN)的模态投影器,配备可学习的激活函数,以实现从视觉特征到文本语义空间的精确非线性对齐,从而最小化信息损失。最后,TaiChi被整合到一个多模态和多任务令牌通信系统中,该系统配备了联合VLM通道编码方案。实验结果验证了TaiChi的卓越性能,以及基于TaiChi的令牌通信系统的可行性和有效性。
cs.CV / 76 / 2603.00483
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
RAISE:基于需求自适应进化的无训练文本到图像对齐方法
Abstract
Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.
Chinese Translation
近期的文本到图像(T2I)扩散模型在现实感方面取得了显著进展,但忠实的提示-图像对齐仍然具有挑战性,尤其是对于包含多个对象、关系和细粒度属性的复杂提示。现有的无训练推理时间缩放方法依赖于固定的迭代预算,无法根据提示的难度进行调整,而反射调优模型则需要精心策划的反射数据集和扩展的扩散与视觉-语言模型的联合微调,往往会过拟合于反射路径数据,并缺乏跨模型的迁移能力。我们提出了RAISE(基于需求自适应自我改进进化),这是一种无训练、以需求驱动的自适应T2I生成的进化框架。RAISE将图像生成公式化为一种以需求驱动的自适应缩放过程,通过多样化的精炼操作(包括提示重写、噪声重采样和指令编辑)在推理时进化候选者的种群。每一代都通过结构化的需求清单进行验证,使系统能够动态识别未满足的项目,并仅在必要时分配进一步的计算。这实现了自适应的测试时间缩放,将计算努力与语义查询复杂性对齐。在GenEval和DrawBench上,RAISE达到了最先进的对齐效果(总体GenEval为0.94),同时生成样本数量减少(减少30-40%)和VLM调用次数减少(减少80%),相比于先前的缩放和反射调优基线,展示了高效、可迁移和模型无关的多轮自我改进。代码可在https://github.com/LiyaoJiang1998/RAISE获取。
cs.CV / 77 / 2603.00486
Random Wins All: Rethinking Grouping Strategies for Vision Tokens
随机胜出:重新思考视觉标记的分组策略
Abstract
Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.
Chinese Translation
自从Transformer被引入视觉架构以来,其二次复杂性一直是许多研究努力试图解决的重大问题。一个代表性的方法涉及对标记进行分组,在每个组内执行自注意力计算,或将每个组内的标记汇聚为一个单一标记。为此,提出了多种精心设计的分组策略,以增强视觉Transformer的性能。在此,我们提出以下问题: extbf{这些精心设计的分组方法真的必要吗?是否存在一种更简单、更统一的标记分组方法可以替代这些多样化的方法?}因此,我们提出了随机分组策略,这是一种简单快速的视觉标记随机分组策略。我们在多个基准上验证了这一方法,实验表明随机分组几乎优于所有其他分组方法。当转移到下游任务时,例如目标检测,随机分组表现出更明显的优势。针对这一现象,我们从多个角度对随机分组的优势进行了详细分析,并确定了分组策略设计的几个关键要素:位置信息、头特征多样性、全局感受野和固定分组模式。我们证明,只要满足这四个条件,视觉标记只需一种极其简单的分组策略即可高效、有效地处理各种视觉任务。我们还在多个模态上验证了我们提出的随机方法的有效性,包括视觉任务、点云处理和视觉-语言模型。代码将发布在 https://github.com/qhfan/random。
cs.CV / 78 / 2603.00492
ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models
ArtiFixer:利用自回归扩散模型增强和扩展3D重建
Abstract
Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
Chinese Translation
每场景优化方法如3D高斯点云(3D Gaussian Splatting)提供了最先进的新视图合成质量,但在观察不足的区域表现不佳。利用生成先验来修正这些区域伪影的方法具有潜力,但目前存在两个不足之处。首先是可扩展性,现有方法使用的图像扩散模型或双向视频模型在单次处理时生成的视图数量有限(因此需要耗时的迭代蒸馏过程以确保一致性)。其次是质量本身,先前工作的生成器往往产生与现有场景内容不一致的输出,并在完全未观察到的区域完全失效。为了解决这些问题,我们提出了一种两阶段的管道,利用两个关键见解。首先,我们训练了一个强大的双向生成模型,采用了一种新颖的不透明度混合策略,鼓励与现有观察的一致性,同时保留模型在未见区域外推新内容的能力。其次,我们将其蒸馏为一个因果自回归模型,该模型能够在单次处理时生成数百帧。该模型可以直接生成新视图或作为伪监督,以简单且高效的方式改善基础3D表示。我们对我们的方法进行了广泛评估,证明它能够在现有方法完全失效的场景中生成可信的重建。在常用基准数据集上的测量结果显示,我们的表现大幅超越所有现有基线,PSNR比之前的最先进方法高出1-3 dB。
cs.CV / 79 / 2603.00493
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
COG:一种基于置信度的最优几何对应方法用于无监督单参考新物体姿态估计
Abstract
Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.
Chinese Translation
由于遮挡、视角变化和异常值,利用单一参考视图估计新物体的6自由度姿态具有挑战性。核心难点在于寻找稳健的跨视图对应关系,因为现有方法通常依赖于不可微分的离散一对一匹配,且往往会集中在稀疏关键点上。我们提出了基于置信度的最优几何对应方法(Confidence-aware Optimal Geometric Correspondence,COG),这是一个将对应关系估计公式化为基于置信度的最优传输问题的无监督框架。COG通过预测逐点置信度并将其作为最优传输边际注入,从而产生平衡的软对应关系,抑制非重叠区域。来自视觉基础模型的语义先验进一步规范化对应关系,从而实现稳定的姿态估计。该设计将置信度整合到对应关系寻找和姿态估计流程中,使得无监督学习成为可能。实验表明,无监督COG的性能与监督方法相当,而监督COG则优于这些方法。
cs.CV / 80 / 2603.00503
M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
M$^2$: 通过轨迹摘要和洞察检索实现长时间跨度网络代理的双重记忆增强
Abstract
Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
Chinese Translation
基于多模态大型语言模型(MLLMs)的代理在自主网络导航中展现出显著潜力。然而,处理长时间跨度任务仍然是一个关键瓶颈。现有策略往往过于依赖大量数据收集和模型训练,但在面对复杂的长时间跨度场景时,仍然面临高计算成本和推理能力不足的问题。为了解决这一问题,我们提出了M$^2$,一个无训练的、增强记忆的框架,旨在优化上下文效率和决策的稳健性。我们的方法结合了双层记忆机制,利用动态轨迹摘要(Internal Memory)将冗长的交互历史压缩为简洁的状态更新,并通过洞察检索增强(External Memory)为代理提供从离线洞察库中检索的可操作指导。通过在WebVoyager和OnlineMind2Web上的广泛评估,M$^2$始终超越基线,Qwen3-VL-32B的成功率提高了多达19.6%,并减少了58.7%的标记数量,而像Claude这样的专有模型在准确性上提高了多达12.5%,同时显著降低了计算开销。
cs.CV / 81 / 2603.00504
Hierarchical Classification for Improved Histopathology Image Analysis
改进的组织病理图像分析的层次分类
Abstract
Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.
Chinese Translation
全切片图像分析对于病理学中的诊断任务至关重要,但现有的深度学习方法主要依赖于平面分类,忽略了类别标签之间的层次关系。在本研究中,我们提出了 HiClass,一个用于改进组织病理图像分析的层次分类框架,增强了粗粒度和细粒度的全切片图像分类。HiClass 基于多实例学习方法构建,通过引入双向特征集成,促进粗粒度和细粒度特征表示之间的信息交换,有效地学习层次特征。此外,我们引入了定制的损失函数,包括层次一致性损失、类内和类间距离损失以及组间交叉熵损失,以进一步优化层次学习。我们在一个包含 4 个粗粒度类别和 14 个细粒度类别的胃活检数据集上评估了 HiClass 的性能,结果显示其在粗粒度分类和细粒度分类方面均取得了优越的分类性能。这些结果证明了 HiClass 在通过捕捉粗粒度和细粒度组织病理特征来改善全切片图像分类方面的有效性。
cs.CV / 82 / 2603.00510
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
视觉标记究竟编码了什么?揭示多模态大型语言模型中的稀疏性和冗余性
Abstract
Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60\%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: https://github.com/EIT-NLP/EmbedLens.
Chinese Translation
多模态大型语言模型(MLLMs)将视觉标记投射到语言模型的嵌入空间中,但视觉语义的内部结构和处理仍然不够清晰。在本研究中,我们引入了一个双重分析框架,采用了一种新颖的探测工具$ extbf{EmbedLens}$,以进行细致的分析。我们发现输入层面存在显著的语义稀疏性:视觉标记始终被划分为沉没、无效和有效三类。值得注意的是,只有有效标记(约占总输入的60%)携带特定于图像的意义。此外,通过使用针对性的补丁压缩基准,我们证明这些有效标记在进入LLM之前已经编码了丰富的细粒度线索(例如,物体、颜色和光学字符识别)。对于大多数标准任务,内部视觉计算(如视觉注意力和前馈网络)是冗余的。对于少数高度依赖视觉的任务,实际受益于内部处理,我们发现有效标记自然与中间LLM层对齐,而不是初始嵌入空间,这表明浅层处理是不必要的,直接的中层注入是足够的。最终,我们的发现提供了对视觉标记处理的统一机制视角,为通过选择性标记修剪、最小化视觉计算和中层注入来实现更高效和可解释的MLLM架构铺平了道路。代码已发布于:https://github.com/EIT-NLP/EmbedLens。
cs.CV / 83 / 2603.00511
Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
通过内部表示学习的多模态自适应检索增强生成
Abstract
Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
Chinese Translation
视觉问答系统面临由于幻觉而导致的可靠性问题,即模型生成的答案与视觉输入或事实知识不一致。虽然检索增强生成(Retrieval Augmented Generation, RAG)框架通过引入外部知识来缓解这一问题,但静态检索往往会引入无关或相互冲突的内容,特别是在视觉RAG设置中,可能会检索到视觉上相似但语义上不正确的证据。为了解决这一问题,我们提出了多模态自适应RAG(Multimodal Adaptive RAG, MMA-RAG),该方法动态评估模型内部知识的置信度,以决定是否将检索到的外部信息纳入生成过程。MMA-RAG的核心是通过逐层分析训练的决策分类器,该分类器利用联合的内部视觉和文本表示来指导反向图像检索的使用。实验表明,该模型在三个视觉问答数据集上的响应性能显著提升。同时,消融研究强调了内部表示在自适应检索决策中的重要性。总体而言,实验结果表明MMA-RAG在多样化的多模态场景中有效平衡了外部知识的利用和推理的稳健性。
cs.CV / 84 / 2603.00512
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
基于小波的帧选择通过检测语义边界实现长视频理解
Abstract
Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
Chinese Translation
由于高帧冗余和有限的上下文窗口,在将大型视觉语言模型(LVLMs)应用于长视频时,帧选择至关重要。目前的方法通常选择与给定查询高度相关的帧,导致选择的帧集合缺乏连贯性,忽视了视频的叙事结构。本文提出了一种基于小波的帧选择方法,通过检测语义边界(WFS-SB),这是一个无训练的框架,提供了一个新的视角:有效的视频理解不仅依赖于高相关性,更重要的是捕捉语义变化——叙事变化的关键时刻,这对于理解视频的整体故事情节至关重要。然而,由于模型不确定性和瞬态视觉变化引起的高频噪声,直接检测查询-帧相似性信号中的突变通常是不可靠的。为了解决这个问题,我们利用小波变换,它通过在时间和频率域的多分辨率分析提供了理想的解决方案。通过应用这种变换,我们将噪声信号分解为多个尺度,并从最粗尺度中提取干净的语义变化信号。我们将该信号的局部极值识别为语义边界,从而将视频分割成连贯的片段。在此基础上,WFS-SB包括一个两阶段策略:首先,根据复合重要性评分自适应地为每个片段分配帧预算;其次,在每个片段内,采用最大边际相关性(Maximal Marginal Relevance)方法选择一组多样且相关的帧。大量实验表明,WFS-SB显著提升了LVLM的性能,例如,在VideoMME上提高了5.5%的准确率,在MLVU上提高了9.5%,在LongVideoBench上提高了6.2%,始终优于最先进的方法。
cs.CV / 85 / 2603.00515
MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
MLLM-4D:面向基于视觉的时空智能
Abstract
Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.
Chinese Translation
人类天生具备基于视觉的四维时空智能,这使我们能够从纯粹的视觉输入中感知和推理三维空间随时间的演变。尽管这一能力至关重要,但它仍然是当前多模态大语言模型(MLLMs)的一个重大瓶颈。为了解决这一挑战,我们提出了MLLM-4D,这是一个全面的框架,旨在弥补时空理解和推理的训练数据整理和模型后训练中的空白。在数据方面,我们开发了一种成本高效的数据整理管道,将现有的立体视频数据集重新利用为高质量的四维时空教学数据。这导致了用于监督微调(SFT)的MLLM4D-2M和用于强化微调(RFT)的MLLM4D-R1-30k数据集,以及用于全面评估的MLLM4D-Bench。在模型训练方面,我们的后训练策略通过SFT建立了基础的四维理解,并通过采用群体相对策略优化(GRPO)与专门的时空思维链(ST-CoT)提示和时空奖励函数(ST-reward)进一步促进四维推理能力,而无需修改架构。大量实验表明,MLLM-4D能够从纯粹的二维RGB输入中实现最先进的时空理解和推理能力。项目页面:https://github.com/GVCLab/MLLM-4D。
cs.CV / 86 / 2603.00518
Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
Vision-TTT:高效且富有表现力的视觉表征学习与测试时训练
Abstract
Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
Chinese Translation
高效且富有表现力的视觉表征学习一直是计算机视觉研究的追求。尽管视觉变换器(Vision Transformers, ViTs)逐渐取代传统卷积神经网络(Convolutional Neural Networks, CNNs)成为更具可扩展性的视觉学习者,但其应用受到自注意力机制的二次复杂度的困扰。为了解决这一挑战,我们将一种新的线性时间序列建模方法——测试时训练(Test-Time Training, TTT)引入视觉领域,并提出Vision-TTT,该方法以新颖的自监督学习方式压缩视觉标记序列。通过结合双向扫描策略和Conv2d模块,Vision-TTT有效地将普通TTT扩展到建模具有全局感受野的二维视觉相关性。大量实验表明, exttt{Vittt-T/S/B}在ImageNet分类上分别达到了77.3%、81.2%、82.5%的Top-1准确率,并在下游任务中大幅超越了其对应模型。在1280x1280分辨率下, exttt{Vittt-T}的FLOPs减少了79.4%,运行速度比DeiT-T快4.38倍,内存使用减少了88.9%。这些结果证明了Vision-TTT作为下一代通用视觉骨干网的强大候选者,具有良好的表现力和效率。
cs.CV / 87 / 2603.00519
Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness
Jano:具有早期收敛意识的自适应扩散生成
Abstract
Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at https://github.com/chen-yy20/Jano.
Chinese Translation
扩散模型在生成性人工智能中取得了显著成功,但其计算效率仍然是一个重大挑战,特别是对于需要大量全注意力计算的扩散变换器(Diffusion Transformers, DiTs)。虽然现有的加速方法集中于内容无关的均匀优化策略,但我们观察到在去噪过程中生成内容的不同区域表现出异质的收敛模式。我们提出了Jano,一个无训练的框架,利用这一见解实现高效的区域感知生成。Jano引入了一种早期复杂性识别算法,能够准确识别初始去噪步骤中的区域收敛需求,并结合自适应的令牌调度运行时,优化计算资源分配。通过对最先进模型的全面评估,Jano实现了显著加速(平均加速2.0倍,最高可达2.4倍),同时保持生成质量。我们的工作挑战了传统的均匀处理假设,并为加速大规模内容生成提供了实用解决方案。我们的实现源代码可在https://github.com/chen-yy20/Jano获取。
cs.CV / 88 / 2603.00526
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Mesh-Pro:用于艺术风格四边形网格生成的异步优势引导排名偏好优化
Abstract
Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
Chinese Translation
强化学习(RL)在文本和图像生成方面取得了显著成功,但其在三维生成中的潜力仍然未被充分探索。现有的尝试通常依赖于离线直接偏好优化(DPO)方法,该方法存在训练效率低和泛化能力有限的问题。在本研究中,我们旨在提高RL在三维网格生成中的训练效率和生成质量。具体而言,(1) 我们设计了第一个异步在线RL框架,旨在提高三维网格生成后的训练效率,其速度比同步RL快3.75倍。(2) 我们提出了优势引导排名偏好优化(ARPO),这是一种新颖的RL算法,在训练效率和泛化能力之间实现了比现有为三维网格生成设计的RL算法(如DPO和组相对策略优化(GRPO))更好的平衡。(3) 基于异步ARPO,我们提出了Mesh-Pro,该方法还引入了一种新颖的对角线感知混合三角形-四边形标记化用于网格表示,以及基于光线的几何完整性奖励。Mesh-Pro在艺术和密集网格上实现了最先进的性能。
cs.CV / 89 / 2603.00527
TP-Spikformer: Token Pruned Spiking Transformer
TP-Spikformer:令牌修剪脉冲变换器
Abstract
Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.
Chinese Translation
脉冲神经网络(SNNs)由于其事件驱动的计算范式,提供了比传统神经网络更为节能的替代方案。然而,近期脉冲变换器的进展主要集中在通过大规模架构提高准确性,这需要大量的计算资源,并限制了在资源受限设备上的部署。在本文中,我们提出了一种简单而有效的脉冲变换器令牌修剪方法,称为TP-Spikformer,该方法在保持竞争性能的同时减少存储和计算开销。具体而言,我们首先引入了一种启发式时空信息保留标准,该标准全面评估令牌的重要性,为信息丰富的令牌分配更高的保留分数,而为信息贫乏的令牌分配较低的修剪分数。基于这一标准,我们提出了一种信息保留令牌修剪框架,该框架采用区块级早停策略处理信息贫乏的令牌,而不是直接将其移除。这也有助于在令牌修剪过程中保留更多信息。我们通过在多种架构(包括Spikformer、QKFormer以及Spike-driven Transformer V1和V3)和一系列任务(如图像分类、目标检测、语义分割和基于事件的目标跟踪)中进行广泛实验,展示了TP-Spikformer的有效性、效率和可扩展性。特别地,TP-Spikformer在无训练的情况下表现良好。这些结果揭示了其作为在计算资源有限的实际应用中部署SNNs的高效且实用解决方案的潜力。
cs.CV / 90 / 2603.00529
CaptionFool: Universal Image Captioning Model Attacks
CaptionFool:通用图像描述模型攻击
Abstract
Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate "slang" terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
Chinese Translation
图像描述模型是基于编码器-解码器架构,训练于大规模图像-文本数据集,因此容易受到对抗攻击。我们提出了CaptionFool,这是一种针对最先进的基于变换器的图像描述模型的新型通用(输入无关)对抗攻击。通过仅修改577个图像补丁中的7个(约1.2%的图像),我们的攻击在生成任意目标描述方面实现了94-96%的成功率,包括攻击性内容。我们进一步展示了CaptionFool可以生成专门设计用于规避现有内容审核过滤器的“俚语”术语。我们的研究揭示了已部署的视觉-语言模型中的关键脆弱性,并强调了对这些攻击的强大防御的迫切需求。警告:本文包含具有攻击性性质的模型输出。
cs.CV / 91 / 2603.00535
RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation
RAFM:用于非配对CBCT到CT转换的检索增强流匹配
Abstract
Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT--CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at https://github.com/HiLab-git/RAFM.git.
Chinese Translation
锥束CT(CBCT)在放射治疗中常规获取,但存在严重伪影和不可靠的亨斯菲尔德单位(HU)值,限制了其在剂量计算中的直接应用。因此,从CBCT生成合成CT(sCT)是一个重要任务,但由于时间间隔、解剖变化和配准错误,配对的CBCT-CT数据往往不可用或不可靠。在本研究中,我们将修正流(RF)引入医学影像中的非配对CBCT到CT转换。尽管RF在理论上通过分布级耦合和确定性传输与非配对学习兼容,但其在小型医学数据集和有限批量大小下的实际有效性仍未被充分探索。直接应用随机或批量局部伪配对可能会因语义不匹配的端点样本而产生不稳定的监督。为了解决这一挑战,我们提出了检索增强流匹配(RAFM),该方法通过使用冻结的DINOv3编码器和全局CT记忆库构建检索引导的伪配对,将RF适应于医学场景。这一策略提高了经验耦合质量,并稳定了基于流的非配对训练。在严格的受试者级真实非配对协议下对SynthRAD2023的实验表明,RAFM在FID、MAE、SSIM、PSNR和SegScore等指标上优于现有方法。代码可在https://github.com/HiLab-git/RAFM.git获取。
cs.CV / 92 / 2603.00542
Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation
通过指令驱动和任务反馈闭环优化实现的自适应动态去雾以适应多样化下游任务
Abstract
In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream tasks.To address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization mechanism.It enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without retraining.Technically,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task preferences.This dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple tasks.Extensive experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our approach.These results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.
Chinese Translation
在现实世界的视觉系统中,去雾不仅需要增强图像的可见性,还需满足多样化下游任务的特定需求。为了解决这一挑战,我们提出了一种新颖的自适应动态去雾框架,该框架结合了闭环优化机制。它能够基于下游任务的性能进行反馈驱动的精细调整,并在推理过程中根据用户指令进行指导,从而使模型在不重新训练的情况下满足多个下游任务的特定要求。从技术上讲,我们的框架整合了两种互补且创新的机制:(1)一个任务反馈循环,根据多个下游任务的性能动态调节去雾输出;(2)一个文本指令接口,允许用户指定高层次的任务偏好。这种双重指导策略使模型能够在训练后适应其去雾行为,实时调整输出以满足多个任务的不断变化的需求。针对各种视觉任务的广泛实验证明了我们方法的强大有效性、鲁棒性和泛化能力。这些结果为交互式、任务自适应的去雾建立了一个新的范式,积极与下游应用协作。
cs.CV / 93 / 2603.00543
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
通过 ScaleFormer 实现跨尺度全色融合及 PanScale 基准测试
Abstract
Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
Chinese Translation
全色融合旨在通过将全色图像的空间细节与低分辨率多光谱数据的光谱丰富性相结合,生成高分辨率的多光谱图像。然而,现有的大多数方法在有限的低分辨率设置下进行评估,这限制了它们在真实世界高分辨率场景中的泛化能力。为了解决这一问题,我们系统地研究了跨尺度全色融合的数据、算法和计算挑战。我们首先介绍了 PanScale,这是第一个大规模的跨尺度全色融合数据集,并附带了 PanScale-Bench,这是一个全面的基准测试,用于评估在不同分辨率和尺度下的泛化能力。为了实现尺度泛化,我们提出了 ScaleFormer,这是一种为多尺度全色融合设计的新型架构。ScaleFormer 将跨图像分辨率的泛化重新构建为跨序列长度的泛化:它将图像标记为相同分辨率但长度可变的补丁序列,长度与图像尺度成比例。一个尺度感知的 Patchify 模块使得可以针对这种来自固定大小裁剪的变化进行训练。ScaleFormer 随后将补丁内部的空间特征学习与补丁之间的序列依赖建模解耦,结合旋转位置编码以增强对未见尺度的外推能力。大量实验表明,我们的方法在融合质量和跨尺度泛化方面优于现有最先进的方法。数据集和源代码将在接受后提供。
cs.CV / 94 / 2603.00545
Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer
基于3D视觉变换器的阿尔茨海默病分类的多输入和混合数据
Abstract
The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer's affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer's requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer's Disease.
Chinese Translation
目前使用磁共振成像(MRI)诊断阿尔茨海默病的方法存在显著局限性。许多先前的研究使用2D变换器独立分析单个脑切片,可能会丢失关键的3D上下文信息。基于感兴趣区域(ROI)的模型往往只关注少数几个脑区,而阿尔茨海默病影响多个区域。此外,大多数分类模型依赖于单一测试,而诊断阿尔茨海默病需要一种多方面的方法,整合多种数据来源以获得更准确的评估。本研究提出了一种新方法,称为多输入和混合数据3D视觉变换器(MIMD-3DVT)。该方法将连续切片一起处理,以捕捉特征维度和空间信息,融合多种3D ROI成像数据输入,并整合来自人口因素、认知评估和脑成像的混合数据。我们的方法通过实验评估,使用了包含阿尔茨海默病神经成像倡议(ADNI)、澳大利亚成像、生物标志物和老龄化旗舰研究(AIBL)以及开放获取成像研究系列(OASIS)的组合数据集。我们的MIMD-3DVT,利用单个或多个ROI,达到了97.14%的准确率,超越了当前最先进的方法,在区分正常认知与阿尔茨海默病方面表现优异。
cs.CV / 95 / 2603.00550
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
弱监督视频异常检测:异常连接组件与意图推理
Abstract
Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
Chinese Translation
弱监督视频异常检测(WS-VAD)涉及在未剪辑视频中识别包含异常事件的时间间隔,其中仅提供视频级别的注释作为监督信号。然而,WS-VAD仍然存在一个关键限制,即缺乏密集的帧级注释,这常常使现有方法难以有效学习异常语义。为了解决这一问题,我们提出了一种新颖的框架,命名为LAS-VAD(Learning Anomaly Semantics for WS-VAD),该框架集成了异常连接组件机制和意图感知机制。前者旨在将视频帧分配到视频中的不同语义组内,同一组内的帧段被视为共享相同的语义信息。后者利用意图感知策略来区分相似的正常和异常行为(例如,拿取物品与盗窃)。为了进一步建模异常的语义信息,由于异常发生伴随着独特的特征属性(即,爆炸特征为火焰和浓烟),我们还结合了异常属性信息以指导准确检测。在两个基准数据集XD-Violence和UCF-Crime上的广泛实验表明,我们的LAS-VAD在显著提升的表现上超越了当前的最先进方法。
cs.CV / 96 / 2603.00560
Geometry OR Tracker: Universal Geometric Operating Room Tracking
几何操作室跟踪器:通用几何手术室跟踪
Abstract
In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition, where physically meaningful quantities such as distances and motion statistics must be measured in meters. However, real clinical deployments rarely satisfy the geometric prerequisites for stable multi-view fusion and tracking: camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency that produces "ghosting" during fusion and degrades 3D trajectories in a shared OR coordinate frame. To address this, we introduce Geometry OR Tracker, a two-stage pipeline that first rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup with a single global scale via a Multi-view Metric Geometry Rectification module, and then performs Occlusion-Robust 3D Point Tracking directly in the unified OR world frame. On the MM-OR benchmark, improved geometric consistency translates into tracking gains: our rectification front-end reduces cross-view depth disagreement by more than 30$\times$ compared to raw calibration. Ablation studies further demonstrate the relationship between calibration quality and tracking accuracy, showing that improved geometric consistency yields stronger world-frame tracking.
Chinese Translation
在手术室(OR)中,世界尺度的多视角三维跟踪支持下游应用,如外科医生行为识别,其中必须以米为单位测量物理上有意义的量,如距离和运动统计。然而,实际临床部署很少满足稳定的多视角融合和跟踪的几何前提条件:相机标定和RGB-D配准总是不可靠,导致跨视角几何不一致,从而在融合过程中产生“重影”,并降低共享OR坐标框架中的三维轨迹质量。为了解决这个问题,我们提出了几何操作室跟踪器(Geometry OR Tracker),这是一个两阶段的处理流程,首先通过多视角度量几何校正模块将不精确的标定纠正为具有单一全局尺度的尺度一致和几何一致的相机设置,然后直接在统一的OR世界框架中执行抗遮挡的三维点跟踪。在MM-OR基准测试中,改进的几何一致性转化为跟踪增益:我们的校正前端相比于原始标定减少了超过30倍的跨视角深度不一致。消融研究进一步证明了标定质量与跟踪精度之间的关系,显示出改进的几何一致性带来了更强的世界框架跟踪能力。
cs.CV / 97 / 2603.00565
MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
MIDAS:用于越狱多模态大型语言模型的多图像分散与语义重构
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model's reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model's security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this [link](https://github.com/Winnie-Lian/MIDAS).
Chinese Translation
多模态大型语言模型(MLLMs)已取得显著性能,但仍然容易受到越狱攻击,这可能导致有害内容的生成并削弱其安全部署。先前的研究表明,引入额外的推理步骤,干扰安全注意力,可以使MLLMs更容易被误导生成恶意内容。然而,这些方法依赖于单图像遮蔽或孤立的视觉线索,仅能适度延长推理路径,因此在针对强对齐的商业闭源模型时效果有限。为了解决这一问题,本文提出了多图像分散与语义重构(MIDAS),一种多模态越狱框架,它将有害语义分解为承载风险的子单元,分散到多个视觉线索中,并利用跨图像推理逐步重构恶意意图,从而绕过现有的安全机制。所提出的MIDAS强制实施更长且更结构化的多图像链式推理,显著增加模型对视觉线索的依赖,同时延迟恶意语义的暴露,并显著降低模型的安全注意力,从而提高对先进MLLMs的越狱性能。在不同数据集和MLLMs上的大量实验表明,所提出的MIDAS在越狱攻击方面优于最先进的MLLMs越狱攻击,并在4个闭源MLLMs上实现了81.46%的平均攻击成功率。我们的代码可在此[链接](https://github.com/Winnie-Lian/MIDAS)获取。
cs.CV / 98 / 2603.00574
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
解耦稳定性与可塑性以实现多模态测试时适应
Abstract
Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
Chinese Translation
将预训练的多模态模型适应于不断变化的测试时分布,即多模态测试时适应,面临着重大挑战。现有方法常常在无偏模态中遭遇负迁移,而在有偏模态中则出现灾难性遗忘。为了解决这些挑战,我们提出了稳定性与可塑性解耦适应(Decoupling Adaptation for Stability and Plasticity, DASP),这是一种新颖的诊断-缓解框架。我们的分析揭示了统一潜在空间内的一个关键差异:与无偏模态相比,有偏模态表现出显著更高的维间冗余(即特征维度之间的强相关性)。利用这一见解,DASP识别出有偏模态并实施不对称适应策略。该策略采用解耦架构,其中每个模态特定的适配器被划分为稳定和可塑性组件。不对称机制的工作方式如下:对于需要可塑性的有偏模态,激活并更新可塑性组件以捕捉领域特定信息,而稳定组件保持不变。相反,对于需要稳定性的无偏模态,跳过可塑性组件,并使用KL正则化更新稳定组件以防止负迁移。这种不对称设计使模型能够灵活适应新领域,同时保持可泛化的知识。在多种多模态基准测试上的全面评估表明,DASP显著优于最先进的方法。
cs.CV / 99 / 2603.00586
WildActor: Unconstrained Identity-Preserving Video Generation
WildActor:无约束身份保持的视频生成
Abstract
Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
Chinese Translation
生产就绪的人类视频生成要求数字化演员在动态镜头、视角和动作中保持严格一致的全身身份,这一设置对现有方法仍然具有挑战性。以往的方法往往受到以面部为中心的行为的影响,忽视了身体层面的连贯性,或产生复制粘贴的伪影,使得主体因姿势锁定而显得僵硬。我们提出了Actor-18M,这是一个大规模的人类视频数据集,旨在捕捉无约束视角和环境下的身份一致性。Actor-18M包含160万个视频和1800万张对应的人类图像,涵盖任意视角和标准三视图表示。利用Actor-18M,我们提出了WildActor,一个用于任意视角条件下的人类视频生成的框架。我们引入了一种不对称身份保持注意力机制,并结合了一种视角自适应的蒙特卡洛采样策略,通过边际效用迭代重新加权参考条件,以实现平衡的流形覆盖。在提出的Actor-Bench上进行评估时,WildActor在多样的镜头组合、大视角转换和显著动作下始终保持身体身份,超越了现有方法在这些具有挑战性的设置中的表现。
cs.CV / 100 / 2603.00589
AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
AlignVAR:面向图像超分辨率的全球一致性视觉自回归
Abstract
Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
Chinese Translation
视觉自回归(VAR)模型最近作为图像生成的一种有前景的替代方案出现,提供了稳定的训练、非迭代推理和通过下一个尺度预测实现的高保真合成。这鼓励了对VAR在图像超分辨率(ISR)中的探索,但其应用仍然未得到充分研究,并面临两个关键挑战:局部偏向注意力,这会破坏空间结构,以及仅依赖残差的监督,这在各尺度间累积错误,严重影响重建图像的全球一致性。为了解决这些问题,我们提出了AlignVAR,一个针对ISR量身定制的全球一致性视觉自回归框架,具有两个关键组件:(1)空间一致性自回归(SCA),它应用自适应掩码对结构相关区域的注意力进行重新加权,从而减轻过度局部性并增强长距离依赖;(2)层次一致性约束(HCC),它在每个尺度上通过全面重建监督增强残差学习,早期暴露累积偏差并稳定粗到细的细化过程。大量实验表明,AlignVAR在结构一致性和感知保真度上始终优于现有生成方法,同时以比领先的基于扩散的方法快10倍以上的速度推理,并且参数数量减少近50%,为高效的ISR建立了新的范式。
cs.CV / 101 / 2603.00595
UNICBench: UNIfied Counting Benchmark for MLLM
UNICBench:多模态大语言模型的统一计数基准
Abstract
Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
Chinese Translation
计数是多模态大语言模型(MLLMs)的核心能力,但目前尚无统一的计数数据集来严格评估这一能力在图像、文本和音频中的表现。我们提出了UNICBench,一个统一的多模态、多层次计数基准和评估工具包,具有准确的真实值、确定性的数字解析和分层报告。该语料库包含5,300张图像(5,508个问答)、872份文档(5,888个问答)和2,069个音频片段(2,905个问答),并附有三层能力分类和难度标签。在固定的分割/提示/种子和特定模态匹配规则的标准化协议下,我们评估了45个最先进的MLLM在不同模态下的表现。结果显示,在一些基本计数任务上表现良好,但在推理和最难的分区上存在显著差距,突显了长尾错误和在改进一般计数方面的巨大提升空间。UNICBench为测量提供了严格且可比较的基础,并提供了一个公共工具包以加速进展。
cs.CV / 102 / 2603.00604
Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation
面向数据的基准测试用于遥感图像分割中的标签噪声估计与排名
Abstract
High-quality pixel-level annotations are essential for the semantic segmentation of remote sensing imagery. However, such labels are expensive to obtain and often affected by noise due to the labor-intensive and time-consuming nature of pixel-wise annotation, which makes it challenging for human annotators to label every pixel accurately. Annotation errors can significantly degrade the performance and robustness of modern segmentation models, motivating the need for reliable mechanisms to identify and quantify noisy training samples. This paper introduces a novel Data-Centric benchmark, together with a novel, publicly available dataset and two techniques for identifying, quantifying, and ranking training samples according to their level of label noise in remote sensing semantic segmentation. Such proposed methods leverage complementary strategies based on model uncertainty, prediction consistency, and representation analysis, and consistently outperform established baselines across a range of experimental settings. The outcomes of this work are publicly available at https://github.com/keillernogueira/label_noise_segmentation.
Chinese Translation
高质量的像素级标注对于遥感图像的语义分割至关重要。然而,由于像素级标注的劳动密集和耗时特性,这些标签的获取成本高昂,并且常常受到噪声的影响,这使得人类标注者难以准确标注每个像素。标注错误会显著降低现代分割模型的性能和鲁棒性,因此需要可靠的机制来识别和量化噪声训练样本。本文提出了一种新颖的面向数据的基准测试,并提供了一个新的公开数据集,以及两种技术,用于识别、量化和根据标签噪声水平对训练样本进行排名,适用于遥感语义分割。这些提出的方法利用基于模型不确定性、预测一致性和表示分析的互补策略,并在一系列实验设置中持续超越既定基线。本研究的成果已公开发布,网址为 https://github.com/keillernogueira/label_noise_segmentation。
cs.CV / 103 / 2603.00607
IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
IdGlow:多主体生成的动态身份调制
Abstract
Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
Chinese Translation
多主体图像生成需要在一个连贯的场景中无缝地协调多个参考身份。然而,现有方法依赖于刚性的空间掩膜或局部注意力,常常在处理“稳定性-可塑性困境”时遇到困难,特别是在需要复杂结构变形的任务中,例如保持身份的年龄转换。为了解决这个问题,我们提出了IdGlow,这是一种基于流匹配扩散模型的无掩膜渐进式两阶段框架。在监督微调(SFT)阶段,我们引入了与扩散生成动态相一致的任务自适应时间步调度:一种线性衰减调度,逐步放宽自然群体组成的约束,以及一种时间门控机制,将身份注入集中在关键语义窗口内,成功地保持了成年面部语义而不覆盖儿童的解剖结构。为了在没有明确布局输入的情况下解决属性泄漏和语义模糊问题,我们进一步整合了一个基于坏例驱动的视觉-语言模型(VLM),用于精确的上下文感知提示合成。在第二阶段,我们设计了一种细粒度群体级直接偏好优化(DPO),采用加权边际公式,旨在同时消除多主体伪影,提高纹理和谐性,并将身份保真度重新校准至真实世界分布。在两个具有挑战性的基准测试——直接多人物融合和年龄转换群体生成上的广泛实验表明,IdGlow根本上缓解了稳定性-可塑性冲突,在最先进的面部保真度和商业级美学质量之间实现了优越的帕累托平衡。
cs.CV / 104 / 2603.00609
Linking Modality Isolation in Heterogeneous Collaborative Perception
连接异构协作感知中的模态隔离
Abstract
Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released on https://github.com/cxliu0314/CodeAlign.
Chinese Translation
协作感知利用多个智能体之间的数据交换来增强整体感知能力。然而,智能体之间的异质性引入了领域差距,阻碍了协作,而这一问题因模态隔离的存在而进一步加剧。模态隔离发生在不同模态的多个智能体在任何训练数据帧中从未同时出现,从而扩大了跨模态的领域差距。现有的对齐方法依赖于空间重叠观测的监督,因此无法处理模态隔离。为了解决这一挑战,我们提出了CodeAlign,这是第一个高效的、无共现的对齐框架,通过跨模态特征-编码-特征(FCF)转换平滑地对齐模态。其关键思想是通过代码本明确识别表示一致性,并直接学习模态特定特征空间之间的映射,从而消除对空间对应关系的需求。代码本将特征空间规范化为代码空间,提供紧凑而富有表现力的表示。通过为每个模态准备代码空间,CodeAlign学习将特征映射到其他模态对应代码的FCF转换,然后再解码回目标代码空间中的特征,从而实现有效对齐。实验表明,当整合三种模态时,CodeAlign仅需先前对齐方法8%的训练参数,通信负载减少1024倍,并在OPV2V和DAIR-V2X数据集上实现了最先进的感知性能。代码将发布在 https://github.com/cxliu0314/CodeAlign。
cs.CV / 105 / 2603.00611
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
探索视频级压缩光谱重建的时空特征传播:数据集、模型与基准
Abstract
Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs. Project page: https://github.com/nju-cite/DynaSpec
Chinese Translation
近年来,光谱压缩成像(Spectral Compressive Imaging, SCI)取得了显著成功,为动态光谱视觉开辟了重要潜力。然而,现有的重建方法主要基于图像,存在两个局限性:(i)编码过程掩盖了空间-光谱特征,导致从单个压缩测量中重建缺失信息的不确定性;(ii)逐帧重建范式未能确保时间一致性,而这在视频感知中至关重要。为了解决这些挑战,本文旨在将光谱重建从图像级提升到视频级,利用动态场景中相邻帧之间的互补特征和时间连续性。首先,我们构建了第一个高质量动态高光谱图像数据集(DynaSpec),该数据集包含通过逐帧扫描获取的30个序列。随后,我们提出了传播引导的光谱视频重建变换器(Propagation-Guided Spectral Video Reconstruction Transformer, PG-SVRT),该模型采用空间-然后-时间的注意力机制,有效地从丰富的视频信息中重建光谱特征,同时使用桥接标记以降低计算复杂度。最后,我们进行模拟实验以评估四个SCI系统的性能,并构建了一个DD-CASSI原型用于真实数据收集和基准测试。大量实验表明,PG-SVRT在重建质量、光谱保真度和时间一致性方面表现优越,同时保持最低的FLOPs。项目页面:https://github.com/nju-cite/DynaSpec
cs.CV / 106 / 2603.00643
Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered
立场:视觉处理的评估应以人为中心,而非以指标为中心
Abstract
This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.
Chinese Translation
本文立场论文认为,现代视觉处理系统的评估不应再主要由单一指标的图像质量评估基准驱动,特别是在生成性和感知导向方法的时代。图像恢复便是这一分歧的典型例子:尽管客观的图像质量评估(IQA)指标能够实现可重复、可扩展的评估,但它们与人类感知和用户偏好之间的差距却日益加大。我们认为,这种不匹配的风险在于限制创新并误导视觉处理任务的研究进展。本文并不提倡完全拒绝指标,而是呼吁对评估范式进行重新平衡,倡导一种更加以人为中心、关注上下文和细致入微的方法来评估视觉模型的结果。
cs.CV / 107 / 2603.00651
Exploring 3D Dataset Pruning
探索3D数据集剪枝
Abstract
Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long-tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full-data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior-mismatch bias from inconsistency between subset-induced class weights and target metrics. We propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation. The retention quota also serves as a switch to control the OA-mAcc trade-off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at https://github.com/XiaohanZhao123/3D-Dataset-Pruning.
Chinese Translation
数据集剪枝在2D图像中得到了广泛研究,以去除冗余并加速训练,而针对3D数据的特定剪枝方法仍然大多未被探索。在本研究中,我们研究了3D数据集剪枝,其中观察到的常见长尾类分布特性使得在传统评估指标整体准确率(Overall Accuracy, OA)和平均准确率(Mean Accuracy, mAcc)下的优化本质上存在冲突,并进一步使得剪枝变得特别具有挑战性。为了解决这个问题,我们将剪枝公式化为用加权子集近似全数据的期望风险,这揭示了两个关键错误:由于代表性不足导致的覆盖错误,以及由于子集引起的类权重与目标指标之间的不一致性所导致的先验不匹配偏差。我们提出了基于代表性的子集选择,针对长尾覆盖设定每类保留配额,并采用使用校准软标签和嵌入几何蒸馏的先验不变教师监督。保留配额还作为控制OA-mAcc权衡的开关。在多个设置下,我们在3D数据集上的大量实验表明,我们的方法可以在适应不同下游偏好的同时提高这两个指标。我们的代码可在 https://github.com/XiaohanZhao123/3D-Dataset-Pruning 获取。
cs.CV / 108 / 2603.00654
RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception
RC-GeoCP:雷达-相机协同感知的几何共识
Abstract
Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.
Chinese Translation
协同感知(CP)通过多智能体信息共享增强场景理解。尽管以激光雷达为中心的系统提供了精确的几何信息,但高成本和恶劣天气下的性能下降促使我们寻求多模态替代方案。尽管视觉语义密集且空间测量稳健,但在协同环境中,摄像头与4D雷达之间的协同作用仍未得到充分探索。本研究提出了RC-GeoCP,这是第一个探索在CP中融合4D雷达和图像的框架。为了解决因深度模糊和智能体间空间分散造成的错位,RC-GeoCP建立了一个以雷达为基础的几何共识。具体而言,几何结构校正(GSR)将视觉语义与雷达导出的几何信息对齐,以生成空间上扎根且几何一致的表示。基于不确定性的通信(UAC)将选择性传输形式化为条件熵减少过程,以根据智能体间的不一致性优先传递信息特征。最后,基于共识的汇聚器(CDA)通过共享的几何锚点聚合多智能体信息,以形成全球一致的表示。我们在V2X-Radar和V2X-R上建立了第一个统一的雷达-相机CP基准,展示了显著降低通信开销的最先进性能。代码将很快发布。
cs.CV / 109 / 2603.00655
Stateful Cross-layer Vision Modulation
状态跨层视觉调制
Abstract
Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM's cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.
Chinese Translation
最近的多模态大型语言模型(MLLMs)广泛采用多层视觉特征融合来增强视觉表征。然而,现有方法通常在视觉编码后进行静态拼接或加权聚合,而不干预表征形成过程本身。因此,早期层的细粒度细节可能在层次抽象过程中逐渐被抑制。此外,直接将浅层特征引入语言模型通常会导致与LLM的跨注意力层预训练的视觉特征空间之间的语义分布不匹配,这通常需要额外的适应或微调LLM。为了解决这些局限性,我们从表征演变控制的角度重新审视视觉表征学习,并提出了一种跨层记忆调制视觉框架(SCVM)。具体而言,我们在视觉编码器内引入了一个递归更新的跨层记忆状态,以建模长距离的层间依赖关系。我们进一步设计了一种层级反馈调制机制,根据累积的记忆在每一层刷新标记表征,从而结构性地调节表征演变轨迹。此外,我们还结合了一个辅助语义对齐目标,明确监督最终的记忆状态,鼓励任务相关信息的逐步压缩和强化。在多个视觉问答和幻觉评估基准上的实验结果表明,SCVM在不扩展视觉标记、不引入额外视觉编码器或修改或微调语言模型的情况下,实现了一致的性能提升。
cs.CV / 110 / 2603.00667
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
像病理学家一样:基于组织的全幻灯片图像推理
Abstract
Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
Chinese Translation
计算病理学近年来快速发展,这得益于特定领域的图像编码器以及对使用视觉-语言模型回答关于疾病的自然语言问题的日益关注。然而,病理问答背后的核心问题仍未解决,因为一个千兆像素的幻灯片包含的信息远超过给定问题所需的信息。病理学家通过广泛扫描并根据临床问题选择性放大,天然地导航组织和形态的复杂性。相比之下,当前模型依赖于均匀的补丁采样或广泛的注意力图,往往对无关区域给予同等关注,而忽视了关键的视觉证据。在本研究中,我们尝试使模型更接近人类实际检查幻灯片的方式。我们提出了一种基于问题引导、组织感知的粗到细检索框架HistoSelect,该框架由两个关键组件组成:一个群体采样器用于识别与问题相关的组织区域,随后是一个补丁选择器,用于在这些区域内检索最具信息量的补丁。通过仅选择最具信息量的补丁,我们的方法显著提高了效率:平均减少了70%的视觉标记使用,同时在三个病理问答任务中提高了准确性。在356,000个问答对的评估中,我们的方法优于现有方法,并生成基于可解释的、与病理学家一致的区域的答案。我们的结果表明,将类人搜索和注意模式引入全幻灯片图像推理是构建实用且可靠的病理视觉语言模型的一个有前景的方向。
cs.CV / 111 / 2603.00668
Direct low-field MRI super-resolution using undersampled k-space
直接利用欠采样k空间进行低场MRI超分辨率
Abstract
Low-field magnetic resonance imaging (MRI) provides affordable access to diagnostic imaging but suffers from prolonged acquisition and limited image quality. Accelerated imaging can be achieved with k-space undersampling, while super-resolution (SR) and image quality transfer (IQT) methods typically rely on spatial-domain post-processing. In this work, we propose a novel framework for reconstructing high-field MR like images directly from undersampled low-field k-space. Our approach employs a k-space dual channel U-Net that processes the real and imaginary components of undersampled k-space to restore missing frequency content. Experiments on low-field brain MRI demonstrate that our k-space-driven image enhancement consistently outperforms the counterpart spatial-domain method. Furthermore, reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions. To the best of our knowledge, this is the first work that investigates low-field MRI SR/IQT directly from undersampled k-space.
Chinese Translation
低场磁共振成像(MRI)提供了经济实惠的诊断成像手段,但存在采集时间长和图像质量有限的问题。通过k空间欠采样可以实现加速成像,而超分辨率(SR)和图像质量转移(IQT)方法通常依赖于空间域后处理。在本研究中,我们提出了一种新颖的框架,能够直接从欠采样的低场k空间重建出高场MRI样的图像。我们的方法采用k空间双通道U-Net,处理欠采样k空间的实部和虚部,以恢复缺失的频率内容。在低场脑MRI的实验中,我们的k空间驱动图像增强方法始终优于相应的空间域方法。此外,从欠采样k空间重建的图像质量可与完整k空间采集相媲美。据我们所知,这是首个直接从欠采样k空间研究低场MRI超分辨率/IQT的工作。
cs.CV / 112 / 2603.00675
Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis
通过低秩专家混合模型专门化基础模型以实现全面的头部CT分析
Abstract
Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.
Chinese Translation
在大规模数据集上预训练的基础模型展示了强大的迁移学习能力;然而,它们在复杂的多标签诊断任务(如全面的头部CT发现检测)中的适应性仍然未得到充分研究。标准的参数高效微调方法如LoRA在不同病理类型之间应用统一的调整,这可能限制了对多样化医学发现的性能。我们提出了一种低秩专家混合模型(Mixture of Low-Rank Experts, MoLRE)框架,该框架通过多个专门的低秩适配器和无监督的软路由扩展了LoRA。这种方法能够在不超过0.5%的额外参数和没有明确病理监督的情况下实现条件特征适应。我们对MoLRE进行了全面的基准测试,涵盖了六种最先进的医学影像基础模型,这些模型包括2D和3D架构、通用领域、医学领域以及特定于头部CT的预训练,模型规模范围从700万到4.31亿参数。使用超过70,000个无对比头部CT扫描和75个标注发现(包括出血、梗塞、创伤、肿块病变、结构异常和慢性变化),我们的实验显示所有模型的一致性能提升。提升幅度差异显著:通用和医学领域模型显示出最大的改进(DINOv3-Base: +4.6%;MedGemma: +4.3%),而3D CT专用或非常大的模型则显示出较为温和的提升(+0.2-1.3%)。MoLRE与MedGemma的结合实现了最高的平均检测AUC为0.917。这些发现强调了在目标临床任务上进行系统基准测试的重要性,因为预训练领域、架构和模型规模以非明显的方式相互作用。
cs.CV / 113 / 2603.00682
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
CoLC:一种通信高效的基于激光雷达补全的协同感知方法
Abstract
Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings. The code is available at https://github.com/CatOneTwo/CoLC.
Chinese Translation
协同感知使自主代理能够共享互补信息,克服感知局限性。尽管早期融合提供了更多的感知互补性,并且在模型异构性方面具有内在的鲁棒性,但其高通信成本限制了其实际部署,促使大多数现有工作倾向于中间或晚期融合。为了解决这一问题,我们提出了一种通信高效的早期协同感知框架,结合了激光雷达补全技术,以在稀疏传输下恢复场景的完整性,称为CoLC。具体而言,CoLC整合了三种互补设计。首先,每个邻居代理应用前景感知点采样(Foreground-Aware Point Sampling, FAPS)选择性地传输在带宽限制下保留重要结构和上下文线索的信息点。然后,自我代理采用补全增强早期融合(Completion-Enhanced Early Fusion, CEEF)从接收到的稀疏输入中重建密集柱,并自适应地将其与自身观测融合,从而恢复空间完整性。最后,密集引导双重对齐(Dense-Guided Dual Alignment, DGDA)策略在训练过程中强制增强柱和密集柱之间的语义和几何一致性,确保一致且鲁棒的特征学习。在模拟和真实世界数据集上的实验表明,CoLC在感知与通信的权衡上表现优越,并在异构模型设置下保持鲁棒性。代码可在 https://github.com/CatOneTwo/CoLC 获取。
cs.CV / 114 / 2603.00687
SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion
SCOUT:通过伪标签生成在超低数据条件下快速谱CT成像
Abstract
Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at https://github.com/yqx7150/SCOUT.
Chinese Translation
在计算机断层扫描(CT)过程中,噪声和伪影是影响疾病诊断的基本挑战。然而,当前的方法要么涉及过长的重建时间,要么依赖于数据驱动模型进行优化,未能充分考虑数据本身所固有的有价值信息,尤其是医学3D数据。本文提出了一种在超低原始数据条件下的重建方法,该方法无需外部数据,避免了冗长的预训练过程。通过利用空间非局部相似性和投影域的共轭特性生成伪3D数据进行自监督训练,可以在非常短的时间内实现高保真度的结果。大量实验表明,该方法不仅减轻了探测器引起的环状伪影,还在细节恢复方面展现了前所未有的能力。该方法为使用未标记的原始投影数据的研究提供了新的范式。代码可在 https://github.com/yqx7150/SCOUT 获取。
cs.CV / 115 / 2603.00695
STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification
STMI:基于分割引导的跨模态超图交互的多模态目标重识别的标记调制
Abstract
Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
Chinese Translation
多模态目标重识别(ReID)旨在利用不同模态的互补信息来检索特定对象。然而,现有方法往往依赖于硬性标记过滤或简单的融合策略,这可能导致判别线索的丧失和背景干扰的增加。为了解决这些挑战,我们提出了STMI,一种新颖的多模态学习框架,包含三个关键组件:(1)分割引导特征调制(SFM)模块利用SAM生成的掩码来增强前景表示并通过可学习的注意力调制抑制背景噪声;(2)语义标记重新分配(STR)模块采用可学习的查询标记和自适应重新分配机制,以提取紧凑且信息丰富的表示,而不丢弃任何标记;(3)跨模态超图交互(CHI)模块构建跨模态的统一超图,以捕捉高阶语义关系。在公共基准(即RGBNT201、RGBNT100和MSVR310)上的大量实验表明,我们提出的STMI框架在多模态ReID场景中具有有效性和鲁棒性。
cs.CV / 116 / 2603.00697
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
TokenSplat:面向前馈的无姿态重建的令牌对齐3D高斯溅射
Abstract
We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods. Project page: https://kidleyh.github.io/tokensplat/.
Chinese Translation
我们提出了TokenSplat,这是一个前馈框架,用于从无姿态的多视角图像中进行联合3D高斯重建和相机姿态估计。TokenSplat的核心是引入了一个令牌对齐高斯预测模块,该模块直接在特征空间中对齐语义对应的信息。通过粗略的令牌位置和融合置信度的指导,它聚合多尺度上下文特征,以实现长距离的跨视角推理并减少重叠高斯的冗余。为了进一步增强姿态的鲁棒性并将视角线索与场景语义解耦,TokenSplat采用了可学习的相机令牌和一个非对称双流解码器(Asymmetric Dual-Flow Decoder,ADF-Decoder),该解码器强制相机令牌与图像令牌之间进行方向性受限的通信。这在前馈架构中保持了干净的因式分解,使得在没有迭代优化的情况下实现一致的重建和稳定的姿态估计。大量实验表明,TokenSplat在无姿态设置下实现了更高的重建保真度和新视图合成质量,并显著提高了相较于先前无姿态方法的姿态估计精度。项目页面:https://kidleyh.github.io/tokensplat/
cs.CV / 117 / 2603.00702
Towards Universal Khmer Text Recognition
迈向通用高棉文本识别
Abstract
Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
Chinese Translation
高棉语是一种低资源语言,其复杂的文字系统给光学字符识别(OCR)带来了显著挑战。尽管由于可用数据集,文档打印文本识别已有所进展,但在手写文本和场景文本等其他形式上的表现仍受到数据稀缺的限制。为每种形式训练特定的模型无法实现跨形式的迁移学习,而这对于数据有限的形式来说本可以带来益处。此外,部署多个特定形式的模型会导致显著的内存开销,并且需要将每个输入图像路由到相应模型的过程容易出错。另一方面,简单地在一个不同形式间数据分布不均的组合数据集上训练,往往会导致在代表性不足的形式上的性能下降。为了解决这些问题,我们提出了一种通用高棉文本识别(UKTR)框架,能够处理多样的文本形式。我们方法的核心是一种新颖的形式感知自适应特征选择(MAFS)技术,旨在根据特定输入图像形式调整视觉特征,并增强跨形式的识别鲁棒性。大量实验表明,我们的模型达到了最先进的(SoTA)性能。此外,我们还推出了首个全面的通用高棉文本识别基准,供社区使用,以促进未来的研究。我们的数据集和模型可通过此受限存储库访问。
cs.CV / 118 / 2603.00707
Towards Khmer Scene Document Layout Detection
面向高棉场景文档布局检测
Abstract
While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).
Chinese Translation
尽管拉丁文字的文档布局分析在大型多模态模型(LMMs)的推动下取得了显著进展,但由于缺乏标注的训练数据,高棉语的进展仍然受到限制。这一差距在场景文档中尤为明显,因为透视畸变和复杂背景对传统方法构成了挑战。考虑到高棉文字的结构复杂性,如变音符号和多层字符堆叠,现有的基于拉丁文的布局分析模型无法准确划分语义布局单元,尤其是在密集文本区域(例如列表项)中。在本文中,我们首次全面研究高棉场景文档布局检测。我们提出了一个新颖的框架,包括三个关键元素:(1)专门针对高棉场景布局的强大训练和基准数据集;(2)一个开源文档增强工具,能够合成逼真的场景文档以扩展训练数据;以及(3)利用基于YOLO的架构和定向边界框(OBB)处理几何畸变的布局检测基线。为了促进高棉文档分析与识别(DAR)领域的进一步研究,我们将在此受限库中发布我们的模型、代码和数据集(正在审查中)。
cs.CV / 119 / 2603.00714
A Reconstruction System for Industrial Pipeline Inner Walls Using Panoramic Image Stitching with Endoscopic Imaging
基于全景图像拼接与内窥镜成像的工业管道内壁重建系统
Abstract
Visual analysis and reconstruction of pipeline inner walls remain challenging in industrial inspection scenarios. This paper presents a dedicated reconstruction system for pipeline inner walls via industrial endoscopes, which is built on panoramic image stitching technology. Equipped with a custom graphical user interface (GUI), the system extracts key frames from endoscope video footage, and integrates polar coordinate transformation with image stitching techniques to unwrap annular video frames of pipeline inner walls into planar panoramic images. Experimental results demonstrate that the proposed method enables efficient processing of industrial endoscope videos, and the generated panoramic stitched images preserve all detailed features of pipeline inner walls in their entirety. This provides intuitive and accurate visual support for defect detection and condition assessment of pipeline inner walls. In comparison with the traditional frame-by-frame video review method, the proposed approach significantly elevates the efficiency of pipeline inner wall reconstruction and exhibits considerable engineering application value.
Chinese Translation
在工业检测场景中,管道内壁的视觉分析和重建仍然面临挑战。本文提出了一种专门针对管道内壁的重建系统,该系统基于工业内窥镜和全景图像拼接技术构建。该系统配备了自定义图形用户界面(GUI),能够从内窥镜视频中提取关键帧,并将极坐标变换与图像拼接技术结合,能够将管道内壁的环形视频帧展开为平面全景图像。实验结果表明,所提出的方法能够高效处理工业内窥镜视频,生成的全景拼接图像完整保留了管道内壁的所有细节特征。这为管道内壁的缺陷检测和状态评估提供了直观而准确的视觉支持。与传统逐帧视频回顾方法相比,所提出的方法显著提高了管道内壁重建的效率,并展现出相当的工程应用价值。
cs.CV / 120 / 2603.00717
Diversity over Uniformity: Rethinking Representation in Generated Image Detection
多样性优于统一性:重新思考生成图像检测中的表示
Abstract
With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.
Chinese Translation
随着生成模型的快速发展,生成图像检测已成为视觉取证中的一项重要任务。尽管现有方法取得了显著进展,但它们通常在训练后仅依赖于一小部分高度显著的伪造线索,这限制了它们对未见生成机制的泛化能力。我们认为,可靠的生成图像检测不应依赖单一决策路径,而应保留多种判断视角,使模型能够从不同角度理解真实图像与生成图像之间的差异。基于这一理念,我们提出了一种反特征崩溃学习框架,该框架过滤与任务无关的成分,并抑制表示空间中不同伪造线索之间的过度重叠,防止判别信息崩溃为少数主导特征方向。该设计在模型内部保持多样性和互补证据,减少对少量显著线索的依赖,并增强在未见生成设置下的鲁棒性。在多个公共基准上的广泛实验表明,所提出的方法在跨模型场景中显著优于最先进的方法,准确率提高了5.02%,并表现出更好的泛化能力和检测可靠性。源代码可在 https://github.com/Yanmou-Hui/DoU 获取。
cs.CV / 121 / 2603.00755
BornoViT: A Novel Efficient Vision Transformer for Bengali Handwritten Basic Characters Classification
BornoViT:一种新颖高效的视觉变换器用于孟加拉手写基本字符分类
Abstract
Handwritten character classification in the Bengali script is a significant challenge due to the complexity and variability of the characters. The models commonly used for classification are often computationally expensive and data-hungry, making them unsuitable for resource-limited languages such as Bengali. In this experiment, we propose a novel, efficient, and lightweight Vision Transformer model that effectively classifies Bengali handwritten basic characters and digits, addressing several shortcomings of traditional methods. The proposed solution utilizes a deep convolutional neural network (DCNN) in a more simplified manner compared to traditional DCNN architectures, with the aim of reducing computational burden. With only 0.65 million parameters, a model size of 0.62 MB, and 0.16 GFLOPs, our model, BornoViT, is significantly lighter than current state-of-the-art models, making it more suitable for resource-limited environments, which is essential for Bengali handwritten character classification. BornoViT was evaluated on the BanglaLekha Isolated dataset, achieving an accuracy of 95.77%, and demonstrating superior efficiency compared to existing state-of-the-art approaches. Furthermore, the model was evaluated on our self-collected dataset, Bornomala, consisting of approximately 222 samples from different age groups, where it achieved an accuracy of 91.51%.
Chinese Translation
在孟加拉文字符的手写字符分类中,由于字符的复杂性和变异性,这一任务面临着重大挑战。常用的分类模型通常计算开销大且对数据需求高,使其不适合资源有限的语言,如孟加拉语。在本实验中,我们提出了一种新颖、高效且轻量的视觉变换器模型,能够有效分类孟加拉手写基本字符和数字,解决了传统方法的若干不足之处。与传统的深度卷积神经网络(DCNN)架构相比,所提出的解决方案以更简化的方式利用了DCNN,旨在减少计算负担。我们的模型BornoViT仅有65万参数,模型大小为0.62 MB,计算量为0.16 GFLOPs,显著轻于当前最先进的模型,更加适合资源有限的环境,这对于孟加拉手写字符分类至关重要。BornoViT在BanglaLekha Isolated数据集上进行了评估,达到了95.77%的准确率,并展示了比现有最先进方法更优越的效率。此外,该模型还在我们自收集的数据集Bornomala上进行了评估,该数据集包含来自不同年龄组的约222个样本,模型在该数据集上的准确率达到了91.51%。
cs.CV / 122 / 2603.00756
Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder
基于时空扩散自编码器的CT脑部影像中卒中结果与演变预测
Abstract
Stroke is a major cause of death and disability worldwide. Accurate outcome and evolution prediction has the potential to revolutionize stroke care by individualizing clinical decision-making leading to better outcomes. However, despite a plethora of attempts and the rich data provided by neuroimaging, modelling the ultimate fate of brain tissue remains a challenging task. In this work, we apply recent ideas in the field of diffusion probabilistic models to generate a self-supervised semantically meaningful stroke representation from Computed Tomography (CT) images. We then improve this representation by extending the method to accommodate longitudinal images and the time from stroke onset. The effectiveness of our approach is evaluated on a dataset consisting of 5,824 CT images from 3,573 patients across two medical centers with minimal labels. Comparative experiments show that our method achieves the best performance for predicting next-day severity and functional outcome at discharge.
Chinese Translation
卒中是全球主要的死亡和残疾原因。准确的结果和演变预测有潜力通过个性化临床决策来彻底改变卒中护理,从而改善预后。然而,尽管有大量尝试和神经影像学提供的丰富数据,建模脑组织的最终命运仍然是一项具有挑战性的任务。在本研究中,我们应用扩散概率模型领域的最新思想,从计算机断层扫描(CT)图像中生成自监督的语义意义明确的卒中表示。然后,我们通过扩展该方法以适应纵向图像和卒中发作时间来改进这一表示。我们的方法在一个包含来自两个医疗中心的3,573名患者的5,824张CT图像的数据集上进行了有效性评估,标签极少。比较实验表明,我们的方法在预测次日严重程度和出院时功能结果方面达到了最佳性能。
cs.CV / 123 / 2603.00763
Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models
分析与改进文本到图像扩散模型的快速采样
Abstract
Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.
Chinese Translation
文本到图像的扩散模型已取得前所未有的成功,但在有限的采样预算下仍然难以生成高质量的结果。现有的无训练采样加速方法通常是独立开发的,导致这些方法之间的整体性能和兼容性未得到探索。本文通过系统阐明设计空间来填补这一空白,我们的综合实验确定采样时间调度是最关键的因素。受Frenet-Serret公式揭示的扩散模型几何特性的启发,我们提出了恒定总旋转调度(Constant Total Rotation Schedule, TORS),这是一种确保采样轨迹沿线均匀几何变化的调度策略。TORS在Flux.1-Dev和Stable Diffusion 3.5上优于之前的无训练加速方法,并在10个采样步骤下生成高质量图像。大量实验强调了我们的方法对未见模型、超参数和下游应用的适应性。
cs.CV / 124 / 2603.00777
DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents
DUCX:分解工具使用胸部X光代理中的不公平性
Abstract
Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md
Chinese Translation
使用工具的医疗代理通过协调专业的视觉和语言模块,可以改善胸部X光问答的效果,但这种额外的流程复杂性也为人口统计偏见创造了新的途径,超出了独立模型的范围。我们提出了我们的研究(分解胸部X光代理中的不公平性),对使用MedRAX实例化的胸部X光代理进行系统审计。为了定位差异出现的地方,我们引入了一种阶段性公平性分解方法,将端到端偏见与三种代理特定来源分开:工具暴露偏见(基于工具存在的效用差距)、工具转移偏见(工具路由模式中的亚组差异)和模型推理偏见(合成行为中的亚组差异)。在五个驱动骨干网络上进行的基于工具使用的代理框架的广泛实验表明:(i)在人口统计学表现中,端到端性能仍然存在差距,平衡几率高达20.79%,最低公平性-效用权衡降至28.65%;(ii)中间行为、工具使用、转移模式和推理轨迹表现出明显的亚组差异,这些差异无法仅通过端到端评估预测(例如,基于分割工具可用性,亚组效用差距高达50%)。我们的研究结果强调了进行流程级公平性审计和去偏见化的必要性,以确保临床代理系统的公平部署。代码可在此获取:https://anonymous.4open.science/r/DUCK-E5FE/README.md
cs.CV / 125 / 2603.00793
Neural Functional Alignment Space: Brain-Referenced Representation of Artificial Neural Networks
神经功能对齐空间:基于大脑的人工神经网络表示
Abstract
We propose the Neural Functional Alignment Space (NFAS), a brain-referenced representational framework for characterizing artificial neural networks on equal functional grounds. NFAS departs from conventional alignment approaches that rely on layer-wise features or task-specific activations by modeling the intrinsic dynamical evolution of stimulus representations across network depth. Specifically, we model layer-wise embeddings as a depth-wise dynamical trajectory and apply Dynamic Mode Decomposition (DMD) to extract the stable mode. This representation is then projected into a biologically anchored coordinate system defined by distributed neural responses. We also introduce the Signal-to-Noise Consistency Index (SNCI) to quantify cross-model consistency at the modality level. Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization within this brain-referenced space, including modality-specific clustering and cross-modal convergence in integrative cortical systems. Our findings suggest that representation dynamics provide a principled basis for
Chinese Translation
我们提出了神经功能对齐空间(Neural Functional Alignment Space, NFAS),这是一个基于大脑的表示框架,用于在相同功能基础上表征人工神经网络。NFAS不同于依赖于层级特征或任务特定激活的传统对齐方法,它通过建模刺激表示在网络深度上的内在动态演变来实现。具体而言,我们将层级嵌入建模为深度动态轨迹,并应用动态模式分解(Dynamic Mode Decomposition, DMD)提取稳定模式。然后将这种表示投影到由分布式神经反应定义的生物学锚定坐标系统中。我们还引入了信噪一致性指数(Signal-to-Noise Consistency Index, SNCI)来量化模态层面的跨模型一致性。在涵盖视觉、音频和语言的45个预训练模型中,NFAS揭示了这一基于大脑的空间内的结构化组织,包括模态特定聚类和整合皮层系统中的跨模态收敛。我们的研究结果表明,表示动态为理解
cs.CV / 126 / 2603.00805
NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
NERFIFY:将 NeRF 论文转化为代码的多智能体框架
Abstract
The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
Chinese Translation
神经辐射场(NeRF)研究的迅速发展需要在基于已有论文进行研究之前进行大量的重新实现工作。我们提出了 NERFIFY,一个多智能体框架,能够可靠地将 NeRF 研究论文转化为可训练的 Nerfstudio 插件,这与通常无法生成可运行代码的通用论文转代码方法和前沿模型(如 GPT-5)形成对比。NERFIFY 通过六项关键创新实现了领域特定的可执行性:(1)无上下文语法(CFG):大型语言模型(LLM)合成受到 Nerfstudio 的形式化约束,确保生成的代码满足架构不变性。(2)思维图代码合成:专门的多文件智能体按照拓扑依赖顺序生成代码库,在每个节点验证合约和错误。(3)组合引用恢复:智能体自动从引用图中检索并整合组件(采样器、编码器、提议网络)。(4)视觉反馈:通过 PSNR 最小值 ROI 分析、跨视图几何验证和 VLM 引导的修补诊断工件,以迭代方式提高质量。(5)知识增强:除了再现,方法还可以通过新优化进行改进。(6)基准测试:设计了一个评估框架,用于对 30 篇不同论文进行 NeRF 论文到代码的合成。在没有公开实现的论文中,NERFIFY 实现了与专家人类代码相匹配的视觉质量(+/-0.5 dB PSNR,+/-0.2 SSIM),同时将实现时间从数周缩短至数分钟。NERFIFY 证明了领域感知设计能够为复杂视觉论文实现代码翻译,从而加速和民主化可重复研究。代码、数据和实现将公开发布。
cs.CV / 127 / 2603.00825
COMBAT: Conditional World Models for Behavioral Agent Training
COMBAT:用于行为代理训练的条件世界模型
Abstract
Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.
Chinese Translation
近期视频生成的进展推动了能够模拟三维一致环境及与静态物体互动的世界模型的发展。然而,它们在建模能够智能影响和与世界互动的动态反应代理方面仍存在显著限制。为了解决这一问题,我们提出了COMBAT,这是一种实时、基于动作控制的世界模型,训练于复杂的1对1格斗游戏《铁拳3》。我们的研究表明,扩散模型能够成功模拟一个对玩家动作做出反应的动态对手,隐式学习其行为。我们的方法利用了一个具有12亿参数的扩散变换器,条件化于深度压缩自编码器的潜在表示。我们采用了包括因果蒸馏和扩散强制在内的最先进技术,以实现实时推理。重要的是,我们观察到通过仅基于单人输入训练模型,复杂的代理行为得以出现,而无需对对手策略进行任何显式监督。与传统的模仿学习方法不同,后者需要完整的动作标签,COMBAT能够有效地从部分观察数据中学习,为可控的玩家1生成响应行为。我们进行了广泛的研究,并引入了新颖的评估方法来基准测试这一新兴的代理行为,为在基于扩散的世界模型中训练互动代理奠定了坚实的基础。
cs.CV / 128 / 2603.00828
MME: Mixture of Mesh Experts with Random Walk Transformer Gating
MME:随机游走变换器门控的网格专家混合模型
Abstract
In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts.
Chinese Translation
近年来,针对网格分析提出了多种方法,每种方法都有其独特的优势,并且在不同的物体类别上表现出色。我们提出了一种新颖的专家混合(MoE)框架,旨在利用这些多样化方法的互补优势。我们提出了一种新的门控架构,鼓励每个专家专注于其擅长的类别。我们的设计基于两个关键思想:(1)在网格表面上的随机游走有效捕捉各个专家关注的区域;(2)一种注意力机制使得门控能够集中关注对每个专家决策最具信息量的区域。为了进一步提升性能,我们引入了一种动态损失平衡方案,该方案在训练过程中调整多样性损失和相似性损失之间的权衡,其中多样性促使专家专业化,而相似性则使专家之间能够共享知识。我们的框架在网格分类、检索和语义分割任务中实现了最先进的结果。我们的代码可在以下链接获取:https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts。
cs.CV / 129 / 2603.00853
Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement
神经区分提示变换器用于高效超高清图像恢复与增强
Abstract
We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement. Our UHDPromer is inspired by an interesting observation that there implicitly exist neural differences between high-resolution and low-resolution features, and exploring such differences can facilitate low-resolution feature representation. To this end, we first introduce Neural Discrimination Priors (NDP) to measure the differences and then integrate NDP into the proposed Neural Discrimination-Prompted Attention (NDPA) and Neural Discrimination-Prompted Network (NDPN). The proposed NDPA re-formulates the attention by incorporating NDP to globally perceive useful discrimination information, while the NDPN explores a continuous gating mechanism guided by NDP to selectively permit the passage of beneficial content. To enhance the quality of restored images, we propose a super-resolution-guided reconstruction approach, which is guided by super-resolving low-resolution features to facilitate final UHD image restoration. Experiments show that UHDPromer achieves the best computational efficiency while still maintaining state-of-the-art performance on $3$ UHD image restoration and enhancement tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes and pre-trained models will be made available at https://github.com/supersupercong/uhdpromer.
Chinese Translation
我们提出了一种简单而有效的UHDPromer,一种神经区分提示变换器,用于超高清(UHD)图像的恢复与增强。我们的UHDPromer受到一个有趣观察的启发,即高分辨率和低分辨率特征之间隐含存在神经差异,探索这些差异可以促进低分辨率特征的表示。为此,我们首先引入神经区分先验(Neural Discrimination Priors, NDP)来测量这些差异,然后将NDP整合到提出的神经区分提示注意力(Neural Discrimination-Prompted Attention, NDPA)和神经区分提示网络(Neural Discrimination-Prompted Network, NDPN)中。所提出的NDPA通过结合NDP重新构建注意力,以全局感知有用的区分信息,而NDPN则探索一种由NDP引导的连续门控机制,以选择性地允许有益内容的通过。为了提高恢复图像的质量,我们提出了一种超分辨率引导的重建方法,该方法通过超分辨率低分辨率特征来引导最终的UHD图像恢复。实验表明,UHDPromer在保持先进性能的同时,实现了最佳的计算效率,适用于3个UHD图像恢复与增强任务,包括低光照图像增强、图像去雾和图像去模糊。源代码和预训练模型将发布在https://github.com/supersupercong/uhdpromer。
cs.CV / 130 / 2603.00870
PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
PPC-MT:基于Mamba-Transformer混合架构的并行点云补全
Abstract
Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid Mamba-Transformer architecture. Our approach introduces an innovative parallel completion strategy guided by Principal Component Analysis (PCA), which imposes a geometrically meaningful structure on unordered point clouds, transforming them into ordered sets and decomposing them into multiple subsets. These subsets are reconstructed in parallel using a multi-head reconstructor. This structured parallel synthesis paradigm significantly enhances the uniformity of point distribution and detail fidelity, while preserving computational efficiency. By integrating Mamba's linear complexity for efficient feature extraction during encoding with the Transformer's capability to model fine-grained multi-sequence relationships during decoding, PPC-MT effectively balances efficiency and reconstruction accuracy. Extensive quantitative and qualitative experiments on benchmark datasets, including PCN, ShapeNet-55/34, and KITTI, demonstrate that PPC-MT outperforms state-of-the-art methods across multiple metrics, validating the efficacy of our proposed framework.
Chinese Translation
现有的点云补全方法在高质量重建与计算效率之间难以取得平衡。为了解决这一问题,我们提出了PPC-MT,一种利用混合Mamba-Transformer架构的新型并行点云补全框架。我们的方法引入了一种创新的并行补全策略,该策略以主成分分析(Principal Component Analysis, PCA)为指导,为无序点云施加几何上有意义的结构,将其转化为有序集合,并将其分解为多个子集。这些子集通过多头重构器并行重建。这种结构化的并行合成范式显著增强了点分布的均匀性和细节保真度,同时保持了计算效率。通过将Mamba在编码过程中高效特征提取的线性复杂度与Transformer在解码过程中建模细粒度多序列关系的能力相结合,PPC-MT有效地平衡了效率与重建精度。在包括PCN、ShapeNet-55/34和KITTI等基准数据集上的大量定量和定性实验表明,PPC-MT在多个指标上优于最先进的方法,验证了我们提出的框架的有效性。
cs.CV / 131 / 2603.00878
MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment
MMTA:多成员时间注意力用于细粒度中风康复评估
Abstract
To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
Chinese Translation
为了增强个人康复过程中的迭代评估,自动化评估个人在日常活动中的能力需要对治疗视频中的细粒度动作进行时间上精确的分割。现有的时间动作分割(TAS)模型在捕捉亚秒级微动作的同时保持运动上下文方面存在困难,模糊了快速的相位转换,限制了对运动恢复的可靠下游评估。我们提出了多成员时间注意力(MMTA),这是一种用于细粒度康复评估的高分辨率时间变换器。与标准时间注意力不同,后者为每一帧在每一层分配一个单一的注意力上下文,MMTA允许每一帧在同一层内关注多个局部归一化的时间注意力窗口。我们通过特征空间重叠解析融合这些并发的时间视图,保留了接近转换的竞争局部上下文,同时通过层级传播实现了更长范围的推理。这在不增加深度或多阶段细化的情况下提高了边界敏感性。MMTA在统一的单阶段架构中支持视频和可穿戴IMU输入,使其适用于临床和家庭环境。MMTA在StrokeRehab上持续优于全局注意力变换器,提升了编辑分数(Edit Score)+1.3(视频)和+1.6(IMU),同时在50Salads上进一步提升了+3.3。消融实验确认,性能提升源于多成员时间视图而非架构复杂性,为资源受限的康复评估提供了实用的解决方案。
cs.CV / 132 / 2603.00881
Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos
基于不确定性感知的半监督血管造影视频概念与运动分割
Abstract
Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3's unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: https://github.com/qimingfan10/SMART.
Chinese Translation
从X射线冠状动脉造影(XCA)序列中分割主要冠状动脉对于冠状动脉疾病的诊断至关重要。然而,由于边界模糊、辐射对比不一致、复杂运动模式以及缺乏标注图像进行训练等问题,这一任务具有挑战性。尽管半监督学习(Semi-Supervised Learning, SSL)可以减轻标注负担,但传统方法在处理复杂的时间动态和不可靠的不确定性量化方面表现不佳。为了解决这些挑战,我们提出了一种基于SAM3的教师-学生框架,结合运动感知一致性和渐进置信度正则化(SMART),用于X射线血管造影视频的半监督血管分割。首先,我们的方法利用SAM3独特的可提示概念分割设计,并创新性地构建了基于SAM3的教师-学生框架,以最大化教师和学生的性能潜力。其次,我们通过整合血管掩膜变形技术和运动一致性损失来增强分割,以建模复杂的血管动态。为了应对由于边界模糊和对比度低导致的教师预测不可靠的问题,我们进一步提出了一种渐进的置信度感知一致性正则化,以降低不可靠输出的风险。在来自不同机构的三个XCA序列数据集上进行的大量实验表明,SMART在显著减少标注需求的同时实现了最先进的性能,使其在标注数据稀缺的现实临床应用中尤为有价值。我们的代码可在以下链接获取:https://github.com/qimingfan10/SMART。
cs.CV / 133 / 2603.00887
VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
VEMamba:具有轴向-横向一致性的高效体积电子显微镜各向同性重建
Abstract
Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint. The source code is available on GitHub: https://github.com/I2-Multimedia-Lab/VEMamba
Chinese Translation
体积电子显微镜(VEM)对于三维组织成像至关重要,但通常会产生各向异性数据,且轴向分辨率较差,阻碍了可视化和后续分析。现有的各向同性重建方法往往忽视了丰富的轴向信息,并采用简单的下采样来模拟各向异性数据。为了解决这些局限性,我们提出了VEMamba,一个高效的各向同性重建框架。VEMamba的核心是一个新颖的三维依赖重排序范式,通过两个关键组件实现:轴向-横向块选择扫描模块(ALCSSM),该模块智能地将复杂的三维空间依赖关系(包括轴向和横向)重新映射为优化的1D序列,以便进行高效的基于Mamba的建模,并明确强制执行轴向-横向一致性;以及一个动态权重聚合模块(DWAM),用于自适应地聚合这些重排序后的序列输出,以增强表示能力。此外,我们引入了现实的降解模拟,并利用动量对比(MoCo)将这种降解感知知识整合到网络中,以实现更优的重建。在模拟和真实世界的各向异性VEM数据集上进行的广泛实验表明,VEMamba在各种指标上都实现了高度竞争的性能,同时保持了较低的计算负担。源代码可在GitHub上获取:https://github.com/I2-Multimedia-Lab/VEMamba
cs.CV / 134 / 2603.00905
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
pySpatial:为零-shot空间推理生成3D视觉程序
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
Chinese Translation
多模态大型语言模型(MLLMs)在通用感知和推理方面展现了强大的能力,但在需要对3D世界进行空间理解的任务中仍然存在困难。为了解决这个问题,我们提出了pySpatial,一个视觉编程框架,使MLLMs能够通过生成Python代码与空间工具进行交互。给定一系列图像和自然语言查询,该模型构建对空间工具的函数调用,包括3D重建、相机姿态恢复、新视角渲染等。这些操作将原始的2D输入转换为可探索的3D场景,使MLLMs能够在结构化空间表示上进行明确的推理。值得注意的是,pySpatial不需要基于梯度的微调,并且在完全零-shot的设置中运行。在具有挑战性的MindCube和Omni3D-Bench基准上的实验评估表明,我们的框架pySpatial始终超越强大的MLLM基线;例如,在MindCube上,它比GPT-4.1-mini高出12.94%。此外,我们还进行了真实世界的室内导航实验,机器人能够成功地使用pySpatial生成的路线规划穿越复杂环境,突显了我们方法的实际有效性。
cs.CV / 135 / 2603.00906
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
ShiftLUT:用于高效图像恢复的空间位移增强查找表
Abstract
Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
Chinese Translation
基于查找表(Look-Up Table, LUT)的方法已成为高效图像恢复任务的有前景方向。近期的基于LUT的方法专注于通过扩展感受野来提高性能。然而,这不可避免地引入了额外的计算和存储开销,阻碍了它们在边缘设备上的部署。为了解决这一问题,我们提出了ShiftLUT,一个新颖的框架,它在所有基于LUT的方法中实现了最大的感受野,同时保持高效性。我们的关键见解在于三个互补的组件。首先,引入可学习空间位移模块(Learnable Spatial Shift, LSS),通过在特征图上应用可学习的通道级空间偏移来扩展感受野。其次,我们提出了一种不对称双分支架构,将更多计算分配给信息密集的分支,显著减少推理延迟而不影响恢复质量。最后,我们结合了一种称为误差界限自适应采样(Error-bounded Adaptive Sampling, EAS)的特征级LUT压缩策略,以最小化存储开销。与之前的最先进方法TinyLUT相比,ShiftLUT实现了3.8倍更大的感受野,并在多个标准基准上平均提高了0.21 dB的PSNR,同时保持了较小的存储大小和推理时间。
cs.CV / 136 / 2603.00908
UD-SfPNet: An Underwater Descattering Shape-from-Polarization Network for 3D Normal Reconstruction
UD-SfPNet:一种用于3D法线重建的水下去散射偏振形状网络
Abstract
Underwater optical imaging is severely hindered by scattering, but polarization imaging offers the unique dual advantages of descattering and shape-from-polarization (SfP) 3D reconstruction. To exploit these advantages, this paper proposes UD-SfPNet, an underwater descattering shape-from-polarization network that leverages polarization cues for improved 3D surface normal prediction. The framework jointly models polarization-based image descattering and SfP normal estimation in a unified pipeline, avoiding error accumulation from sequential processing and enabling global optimization across both tasks. UD-SfPNet further incorporates a novel color embedding module to enhance geometric consistency by exploiting the relationship between color encodings and surface orientation. A detail enhancement convolution module is also included to better preserve high-frequency geometric details that are lost under scattering. Experiments on the MuS-Polar3D dataset show that the proposed method significantly improves reconstruction accuracy, achieving a mean surface normal angular error of 15.12$^\circ$ (the lowest among compared methods). These results confirm the efficacy of combining descattering with polarization-based shape inference, and highlight the practical significance and potential applications of UD-SfPNet for optical 3D imaging in challenging underwater environments. The code is available at https://github.com/WangPuyun/UD-SfPNet.
Chinese Translation
水下光学成像受到散射的严重影响,但偏振成像提供了去散射和基于偏振的形状重建(SfP)的独特双重优势。为了利用这些优势,本文提出了UD-SfPNet,一种水下去散射偏振形状网络,利用偏振线索来提高3D表面法线预测的准确性。该框架在统一的流程中联合建模基于偏振的图像去散射和SfP法线估计,避免了顺序处理带来的误差累积,并实现了两个任务之间的全局优化。UD-SfPNet进一步结合了一种新颖的颜色嵌入模块,通过利用颜色编码与表面方向之间的关系来增强几何一致性。同时,还包含了一个细节增强卷积模块,以更好地保留在散射下丢失的高频几何细节。在MuS-Polar3D数据集上的实验表明,所提出的方法显著提高了重建精度,达到了15.12$^ ext{°}$的平均表面法线角度误差(在比较方法中最低)。这些结果确认了将去散射与基于偏振的形状推断相结合的有效性,并突显了UD-SfPNet在挑战性水下环境中进行光学3D成像的实际意义和潜在应用。代码可在 https://github.com/WangPuyun/UD-SfPNet 获取。
cs.CV / 137 / 2603.00911
On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms
通过最小代表形式的质数提取精确算法提取有限镶嵌
Abstract
The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
Chinese Translation
在符号推理、算法合成和结构优化等多种计算领域中,识别离散网格中的重复模式是基础性的。尽管针对噪声数据的统计方法可以大致识别模式,但利用确定性提取周期结构的符号分析仍然不够成熟。本文旨在填补这一空白,通过采用一种分层算法,在有限平面网格中发现精确的镶嵌,解决多个独立模式可能共存于分层结构中的问题。所提出的方法利用复合发现(双重检查和广度优先剪枝)来识别具有内部重复的矩形区域,归一化为最小代表形式,并进行质数提取(选择性复制和分层记忆化)以考虑不规则维度并实现高效的计算时间。我们在2x2到32x32的网格大小上评估了可扩展性,结果表明简单重复瓷砖的重叠检测处理时间低于1毫秒,而需要穷举搜索和系统探索的复杂模式则表现出指数增长。该算法为精确的、轴对齐的矩形镶嵌提供了确定性行为,填补了符号网格分析技术中的一个关键空白,适用于解谜推理任务和在离散符号领域中识别精确重复结构。
cs.CV / 138 / 2603.00912
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
VGGT-Det:挖掘VGGT内部先验用于无传感器几何的多视角室内3D物体检测
Abstract
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6
[email protected] on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
Chinese Translation
当前的多视角室内3D物体检测器依赖于获取成本高昂的传感器几何信息(即,精确校准的多视角相机姿态),以将多视角信息融合为全局场景表示,这限制了其在现实场景中的应用。我们针对一个更为实用的设置:无传感器几何(SG-Free)多视角室内3D物体检测,其中没有传感器提供的几何输入(多视角姿态或深度)。最近的视觉几何基础变换器(Visual Geometry Grounded Transformer, VGGT)表明,可以直接从图像中推断出强大的3D线索。在此基础上,我们提出了VGGT-Det,这是第一个专为SG-Free多视角室内3D物体检测量身定制的框架。我们的方法不仅仅是利用VGGT的预测,而是将VGGT编码器集成到基于变换器的管道中。为了有效利用VGGT内部的语义和几何先验,我们引入了两个新颖的关键组件:(i)注意力引导的查询生成(Attention-Guided Query Generation, AG):利用VGGT注意力图作为语义先验来初始化物体查询,通过关注物体区域而保留全局空间结构,从而改善定位;(ii)查询驱动的特征聚合(Query-Driven Feature Aggregation, QD):一个可学习的See-Query与物体查询交互,以“查看”它们所需的内容,然后动态聚合VGGT层之间的多层几何特征,逐步将2D特征提升到3D。实验表明,VGGT-Det在SG-Free设置中显著超越了最佳表现方法,在ScanNet和ARKitScenes上分别提高了4.4和8.6的
[email protected]。消融研究表明,VGGT内部学习的语义和几何先验可以通过我们的AG和QD有效利用。
cs.CV / 139 / 2603.00918
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
通过内在自信奖励改善文本到图像生成
Abstract
Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
Chinese Translation
文本到图像生成推动了设计、媒体和数据增强等领域的内容创作。对文本到图像生成模型的后期训练是一条有前景的路径,旨在更好地匹配人类偏好、事实准确性和提升美学。我们提出了ARC(自适应自信奖励),这是一种后期训练框架,它用内部自信信号替代外部奖励监督,该信号通过评估模型在自我去噪探测下多准确地恢复注入噪声而获得。ARC将这一内在信号转换为标量奖励,从而实现完全无监督的优化,无需额外的数据集、注释者或奖励模型。实证研究表明,通过强化高自信度生成,ARC在组合生成、文本渲染和文本-图像对齐方面相较于基线模型提供了一致的提升。我们还发现,将ARC与外部奖励结合使用会产生互补的改进,并减轻奖励操控问题。
cs.CV / 140 / 2603.00919
DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
DriveCode:基于领域特定的数字编码用于基于大语言模型的自动驾驶
Abstract
Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
Chinese Translation
大型语言模型(LLMs)在自动驾驶领域展现出了巨大的潜力。然而,将数字离散化为标记限制了精确的数值推理,未能反映数字在训练目标中的位置重要性,并且使得实现解码效率与数值精度的平衡变得困难。这些限制影响了传感器测量的处理和精确控制命令的生成,成为部署基于LLM的自动驾驶系统的根本障碍。本文介绍了DriveCode,一种新颖的数字编码方法,它将数字表示为专用的嵌入,而不是离散的文本标记。DriveCode采用数字投影器将数字映射到语言模型的隐空间,从而实现与视觉和文本特征在统一的多模态序列中的无缝集成。在OmniDrive、DriveGPT4和DriveGPT4-V2数据集上的评估表明,DriveCode在轨迹预测和控制信号生成方面表现优越,确认了其在基于LLM的自动驾驶系统中的有效性。
cs.CV / 141 / 2603.00931
Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications
学习权重废物:一种物理信息驱动的多模态融合框架及其在商业和工业应用中的大规模数据集
Abstract
Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
Chinese Translation
准确估计商业和工业废物的重量对于高效运营至关重要,但基于图像的估计仍然困难,因为外观相似的物体可能具有不同的密度,并且可见大小会随着相机距离的变化而变化。为了解决这个问题,我们提出了多模态重量预测器(Multimodal Weight Predictor, MWP)框架,通过结合RGB图像和物理信息驱动的元数据(包括物体尺寸、相机距离和相机高度)来估计废物重量。我们还引入了Waste-Weight-10K,这是一个包含10,421个同步图像-元数据的数据集,数据来自物流和回收现场。该数据集涵盖了11个废物类别,重量范围广泛,从3.5公斤到3,450公斤不等。我们的模型使用视觉变换器(Vision Transformer)提取视觉特征,并使用专门的元数据编码器处理几何和类别信息,通过堆叠互注意力融合(Stacked Mutual Attention Fusion)将它们结合在一起,使视觉和物理线索相互引导。这有助于模型管理透视效应并将物体与材料属性联系起来。为了确保在广泛的重量范围内保持稳定的性能,我们使用均方对数误差(Mean Squared Logarithmic Error)训练模型。在测试集上,所提出的方法实现了88.06公斤的平均绝对误差(Mean Absolute Error, MAE)、6.39%的平均绝对百分比误差(Mean Absolute Percentage Error, MAPE)和0.9548的R²系数。该模型在0-100公斤范围内的轻型物体上表现出强大的准确性,MAE为2.38公斤,MAPE为3.1%,在1000-2000公斤范围内的重型废物上保持可靠的性能,MAPE为11.1%。最后,我们结合了一个基于物理的解释模块,使用Shapley加法解释(Shapley Additive Explanations, SHAP)和大型语言模型,为每个预测提供清晰、易于理解的解释。
cs.CV / 142 / 2603.00938
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
超越8位的视觉:HDR用户生成视频的主观与客观质量评估
Abstract
High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
Chinese Translation
高动态范围(HDR)用户生成(UGC)视频在社交平台上迅速传播,但大多数感知视频质量评估(VQA)系统仍然针对标准动态范围(SDR)进行优化。HDR具有更高的位深度、广色域和更高的亮度范围,暴露出近黑色压缩、亮部剪切、带状现象和曝光闪烁等失真,这些失真加剧了UGC伪影,并对SDR模型提出了挑战。为了推动这一领域的进展,我们整理了Beyond8Bits,这是一个大型主观数据集,包含来自6.5K来源的44K视频,获得超过150万的众包评分,涵盖多样的场景、拍摄条件和压缩设置。我们进一步推出HDR-Q,这是第一个用于HDR-UGC VQA的多模态大型语言模型(MLLM)。我们提出了(i)一种新颖的HDR感知视觉编码器,以生成对HDR敏感的嵌入,以及(ii)HDR感知策略优化(HAPO),这是一种将推理锚定于HDR线索的强化学习微调框架。HAPO通过HDR-SDR对比KL增强GRPO,鼓励令牌依赖于HDR输入,并通过高斯加权回归奖励进行细粒度的主观意见评分(MOS)校准。在Beyond8Bits和公共HDR-VQA基准测试中,HDR-Q展现了最先进的性能。
cs.CV / 143 / 2603.00947
\textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On
移动虚拟试衣(Mobile-VTON):高保真离线虚拟试衣
Abstract
Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
Chinese Translation
虚拟试衣(VTON)最近在视觉保真度方面取得了令人印象深刻的进展,但大多数现有系统需要将个人照片上传到基于云的GPU,这引发了隐私问题并限制了在设备上的部署。为了解决这一问题,我们提出了移动虚拟试衣(Mobile-VTON),这是一个高质量、保护隐私的框架,能够仅使用单张用户图像和一张服装图像在普通移动设备上实现完全离线的虚拟试衣。Mobile-VTON引入了一个模块化的TeacherNet-GarmentNet-TryonNet(TGT)架构,将知识蒸馏、服装条件生成和服装对齐集成到一个统一的管道中,以优化设备上的效率。在该框架内,我们提出了一种特征引导对抗(Feature-Guided Adversarial, FGA)蒸馏策略,将教师监督与对抗学习相结合,以更好地匹配真实世界的图像分布。GarmentNet通过轨迹一致性损失进行训练,以在扩散步骤中保持服装语义的一致性,而TryonNet则利用潜在拼接和轻量级跨模态条件化,实现了在没有大规模预训练的情况下,稳健的服装与人物对齐。通过结合这些组件,Mobile-VTON实现了高保真生成且计算开销低。在VITON-HD和DressCode数据集上进行的$1024{ imes}768$实验表明,它的性能与强大的基于服务器的基线相匹配或超越,同时完全离线运行。这些结果表明,高质量的虚拟试衣不仅是可行的,而且在设备上也是实用的,为现实世界应用提供了安全的解决方案。
cs.CV / 144 / 2603.00949
StegoNGP: 3D Cryptographic Steganography using Instant-NGP
StegoNGP:基于即时神经图形原语的三维密码学隐写术
Abstract
Recently, Instant Neural Graphics Primitives (Instant-NGP) has achieved significant success in rapid 3D scene reconstruction, but securely embedding high-capacity hidden data, such as an entire 3D scene, remains a challenge. Existing methods rely on external decoders, require architectural modifications, and suffer from limited capacity, which makes them easily detectable. We propose a novel parameter-free 3D Cryptographic Steganography using Instant-NGP (StegoNGP), which leverages the Instant-NGP hash encoding function as a key-controlled scene switcher. By associating a default key with a cover scene and a secret key with a hidden scene, our method trains a single model to interweave both representations within the same network weights. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count. We also introduce an enhanced Multi-Key scheme, which assigns multiple independent keys across hash levels, dramatically expanding the key space and providing high robustness against partial key disclosure attacks. Experimental results demonstrated that StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security, providing a new paradigm for high-capacity, undetectable information hiding in neural fields. The code can be found at https://github.com/jiang-wenxiang/StegoNGP.
Chinese Translation
近期,即时神经图形原语(Instant Neural Graphics Primitives,Instant-NGP)在快速三维场景重建方面取得了显著成功,但安全地嵌入高容量的隐蔽数据(例如整个三维场景)仍然是一个挑战。现有方法依赖于外部解码器,需要架构修改,并且容量有限,容易被检测到。我们提出了一种新颖的无参数三维密码学隐写术,利用即时神经图形原语(StegoNGP),该方法利用Instant-NGP哈希编码函数作为密钥控制的场景切换器。通过将默认密钥与覆盖场景关联,将秘密密钥与隐藏场景关联,我们的方法训练一个单一模型,在相同的网络权重中交织这两种表示。生成的模型在架构和参数数量上与标准的Instant-NGP无差异。我们还引入了一种增强的多密钥方案,在哈希级别上分配多个独立密钥,极大地扩展了密钥空间,并提供了对部分密钥泄露攻击的高鲁棒性。实验结果表明,StegoNGP能够以强隐蔽性和安全性隐藏完整的高质量三维场景,为神经场中的高容量、不可检测的信息隐藏提供了一种新范式。代码可在 https://github.com/jiang-wenxiang/StegoNGP 找到。
cs.CV / 145 / 2603.00952
Decoupling Motion and Geometry in 4D Gaussian Splatting
在4D高斯点云中解耦运动与几何
Abstract
High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.
Chinese Translation
动态场景的高保真重建是一个重要但具有挑战性的问题。尽管最近的4D高斯点云(4DGS)展示了建模时间动态的能力,但它将高斯运动和几何属性耦合在单一的协方差公式中,这限制了其对复杂运动的表达能力,并且常常导致视觉伪影。为了解决这个问题,我们提出了VeGaS,一个新颖的基于速度的4D高斯点云框架,能够解耦高斯运动和几何。具体而言,我们引入了一个伽利略剪切矩阵,该矩阵明确地结合了时间变化的速度,以灵活地建模复杂的非线性运动,同时严格隔离高斯运动对几何相关条件高斯协方差的影响。此外,我们引入了一个几何变形网络,利用时空上下文和速度线索来细化高斯形状和方向,从而增强时间几何建模。对公共数据集的广泛实验表明,VeGaS达到了最先进的性能。
cs.CV / 146 / 2603.00976
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
PreciseCache:高效且高保真视频生成的精确特征缓存
Abstract
High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.
Chinese Translation
高计算成本和缓慢的推理速度阻碍了视频生成模型的实际应用。虽然之前的研究通过特征缓存加速生成过程,但它们往往会遭遇显著的质量下降。在本研究中,我们揭示了这一问题源于它们无法区分真正冗余的特征,从而导致重要特征的计算被意外跳过。为了解决这个问题,我们提出了 extbf{PreciseCache},一个即插即用的框架,能够精确检测并跳过真正冗余的计算,从而加速推理而不牺牲质量。具体而言,PreciseCache 包含两个组件:LFCache 用于逐步缓存,BlockCache 用于块级缓存。对于 LFCache,我们计算当前步骤的预测特征与之前缓存步骤的特征之间的低频差异(Low-Frequency Difference, LFD)。实证研究表明,LFD 是逐步冗余的有效度量,能够准确检测出可以通过重用缓存特征跳过计算的高度冗余步骤。为了进一步加速每个未跳过步骤的生成,我们提出了 BlockCache,它能够在网络内部精确检测并跳过块级冗余计算。在各种基础网络上的大量实验表明,我们的 PreciseCache 的有效性,平均实现了 2.6 倍的加速而没有明显的质量损失。源代码将会发布。
cs.CV / 147 / 2603.00978
EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization
EraseAnything++:利用多目标优化实现整流流变换器中的概念消除
Abstract
Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
Chinese Translation
在大规模文本到图像(T2I)和文本到视频(T2V)扩散模型中去除不需要的概念,同时保持整体生成质量,仍然是一个主要挑战,尤其是现代模型如Stable Diffusion v3、Flux和OpenSora采用流匹配和基于变换器的架构,并扩展到长时间视频生成。现有的概念消除方法针对早期的T2I/T2V模型设计,往往无法推广到这些新范式。为了解决这一问题,我们提出了EraseAnything++,这是一个统一的框架,用于在具有流匹配目标的图像和视频扩散模型中进行概念消除。我们的方法的核心是将概念消除表述为一个受限的多目标优化问题,明确平衡概念移除与生成效用的保留。为了解决由此产生的冲突目标,我们引入了一种基于隐式梯度手术的高效效用保留的无学习策略。此外,通过将基于LoRA的参数调优与注意力级别的正则化相结合,我们的方法将消除锚定在关键视觉表示上,并在空间和时间维度上持续传播。在视频设置中,我们通过锚定和传播机制进一步增强一致性,该机制在参考帧上初始化消除,并在后续的变换器层中强制执行,从而减轻时间漂移。在图像和视频基准上的大量实验表明,EraseAnything++在消除有效性、生成保真度和时间一致性方面显著优于先前的方法,为下一代扩散模型中的概念消除建立了新的最先进水平。
cs.CV / 148 / 2603.00979
Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation
正确伪造:将解剖逻辑注入医疗分割的合成监督预训练
Abstract
Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL's infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74\% and up to 1.66\%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.
Chinese Translation
视觉变换器(ViTs)在三维医疗分割中表现出色,但需要大量标注数据集。虽然自监督学习(SSL)通过使用未标记数据来缓解这一问题,但仍面临严格的隐私和后勤障碍。基于公式的监督学习(FDSL)通过在合成数学原语上进行预训练,提供了一种保护隐私的替代方案。然而,一个关键的语义差距限制了其有效性:通用形状缺乏真实解剖结构的形态保真度、固定空间布局和器官间关系,阻碍了模型学习重要的全局结构先验。为了解决这一问题,我们提出了一种解剖信息驱动的合成监督预训练框架,将FDSL的无限可扩展性与解剖现实性相结合。我们用来自5个受试者的去标识化、仅含标签的分割掩膜的轻量级形状库替换了基本原语。此外,我们引入了一种结构感知的顺序放置策略来管理补丁合成过程。我们通过使用空间锚点强制生理合理性来确保正确定位,并利用拓扑图管理器官间的相互作用(例如,防止不可能的重叠)。在BTCV和MSD数据集上的大量实验表明,我们的方法显著优于最先进的FDSL基线和SSL方法,分别提高了1.74%和高达1.66%,同时表现出强大的扩展效果,即随着合成数据量的增加,性能得以提升。这为医疗分割提供了一种数据高效、符合隐私要求的解决方案。代码将在接受后公开发布。
cs.CV / 149 / 2603.00983
Event-Anchored Frame Selection for Effective Long-Video Understanding
基于事件锚定的帧选择用于有效的长视频理解
Abstract
Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
Chinese Translation
大量的帧冗余和有限的上下文窗口使得在使用大型视觉语言模型(LVLMs)进行长视频理解时,高效的帧选择变得至关重要。然而,现有的方法采用了一种平面采样范式,将视频视为无结构的帧集合。本文介绍了一种基于事件锚定的帧选择(Event-Anchored Frame Selection, EFS)方法,这是一种分层的、事件感知的处理流程。EFS利用自监督的DINO嵌入,首先将视频流划分为视觉上同质的时间段,这些时间段作为语义事件的代理。在每个事件中,它选择与查询最相关的帧作为锚点。这些锚点作为结构先验,指导使用自适应的最大边际相关性(Maximal Marginal Relevance, MMR)方案进行全局优化阶段。该流程确保最终的关键帧集在事件覆盖、查询相关性和视觉多样性方面共同优化。作为一个无需训练的即插即用模块,EFS可以无缝集成到现成的LVLM中,在具有挑战性的视频理解基准上取得显著提升。具体而言,当应用于LLaVA-Video-7B时,EFS在VideoMME、LongVideoBench和MLVU上分别提高了4.7%、4.9%和8.8%的准确率。
cs.CV / 150 / 2603.00985
The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers
纹理-形状困境:3D医学变换器的边界安全合成生成
Abstract
Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
Chinese Translation
视觉变换器(ViTs)已经彻底改变了医学图像分析,但其对数据的高需求与临床档案的稀缺性和隐私限制相冲突。基于公式的监督学习(FDSL)作为解决这一瓶颈的有前景的方案,能够从数学公式中合成无限的标注样本,而无需使用真实患者数据。然而,现有的FDSL范式依赖于具有均匀强度的简单几何形状,忽视了CT和MRI等模态中固有的组织纹理和噪声模式,从而造成了显著的差距。在本文中,我们识别出一个关键的优化冲突,称为边界混叠:当高频合成纹理被简单地添加时,会破坏学习结构边界所需的图像梯度信号,导致模型无法准确划分真实的解剖边界。为了解决这一问题,我们提出了一种新颖的基于物理的空间解耦合成框架。我们的方法使合成过程正交化:首先根据边界距离构建一个梯度保护缓冲区,以确保形状学习的稳定性,然后将基于物理的光谱纹理注入对象核心。这一设计有效地调和了强大的形状表示学习与对采集噪声的不变性。在BTCV和MSD数据集上的广泛实验表明,我们的方法在BTCV上比之前的FDSL和在真实医学数据集上训练的SSL方法提高了1.43%,在MSD任务上提高了高达1.51%,为医学ViTs提供了一个可扩展的、无标注的基础。代码将在接受后公开发布。
cs.CV / 151 / 2603.00988
Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality
遥感中的基础模型:从单模态到多模态的演变
Abstract
Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
Chinese Translation
遥感(RS)技术在加深我们对地球理解方面变得愈加重要。随着遥感数据的体量和多样性呈指数级增长,迫切需要先进的数据建模和理解能力,以有效管理和解读这些庞大的数据集。基础模型为遥感领域带来了显著的新增长机会和巨大的潜力,可能会彻底改变这一领域。本文对遥感中的基础模型进行了全面的技术调查,通过探索其从单模态到多模态的演变,提供了全新的视角。我们希望这项工作能为对基础模型和遥感感兴趣的研究人员提供一个有价值的切入点,帮助他们启动新项目或探索这一快速发展的领域中的新研究主题。本调查解决了以下三个关键问题:什么是遥感中的基础模型?为什么遥感中需要基础模型?我们如何有效指导初级研究人员全面而实际地理解遥感应用中的基础模型?更具体地说,我们首先概述了背景和动机,强调基础模型在遥感中的重要性。然后,我们回顾了现有的遥感基础模型,将其系统地分类为单模态和多模态方法。此外,我们提供了一个类似教程的部分,以指导研究人员,特别是初学者,如何在遥感中训练基础模型并将其应用于实际任务。该调查旨在使遥感研究人员对基础模型有更深入和高效的理解,使他们能够轻松入门并有效地将这些模型应用于各种遥感应用中。
cs.CV / 152 / 2603.00990
MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation
MLRecon:通过粗到细的姿态估计实现鲁棒的无标记自由手3D超声重建
Abstract
Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.
Chinese Translation
自由手3D超声(US)重建承诺提供体积成像,同时具备标准2D探头的灵活性,然而现有的跟踪范式面临着一个限制性的三难困境:基于标记的系统需要高昂的成本,内向外的方法要求侵入式传感器附加,而无传感器的方法则遭受严重的累积漂移。为了克服这些限制,我们提出了MLRecon,一个鲁棒的无标记3D超声重建框架,利用单个普通RGB-D相机提供抗漂移的6D探头姿态跟踪。我们的管道利用视觉基础模型的泛化能力,实现了探头的连续无标记跟踪,并通过视觉引导的偏差检测器增强,能够自主监测跟踪完整性并触发故障恢复,以确保扫描过程不间断。至关重要的是,我们进一步提出了一个双阶段姿态精炼网络,明确地将高频抖动与低频偏差分离,有效地去噪轨迹,同时保持操作员动作的运动学保真性。实验表明,MLRecon显著优于竞争的无传感器和有传感器辅助的方法,在复杂轨迹上实现了平均位置误差低至0.88毫米,并产生了亚毫米级的高质量3D重建。这为资源有限的临床环境中的低成本、可及的体积超声成像建立了新的基准。
cs.CV / 153 / 2603.01000
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
让你的图像随着你的动作而移动!——隐式多对象多动作转移
Abstract
Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
Chinese Translation
动作转移已成为可控视频生成的一个有前景的方向,但现有方法主要集中于单对象场景,当多个对象需要不同的动作模式时,表现不佳。在本研究中,我们提出了FlexiMMT,这是第一个隐式图像到视频(I2V)动作转移框架,明确支持多对象、多动作转移。给定一个静态的多对象图像和多个参考视频,FlexiMMT独立提取动作表示,并准确地将其分配给不同的对象,支持灵活的重组和任意的动作到对象映射。为了解决跨对象动作纠缠的核心挑战,我们引入了一种动作解耦掩码注意机制,该机制使用对象特定的掩码来约束注意力,确保动作和文本标记仅影响其指定区域。我们进一步提出了一种差异化掩码传播机制,该机制直接从扩散注意力中推导出对象特定的掩码,并有效地在帧之间逐步传播。大量实验表明,FlexiMMT在基于I2V的多对象多动作转移中实现了精确、组合和最先进的性能。
cs.CV / 154 / 2603.01007
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Dr.Occ:基于深度和区域引导的来自全景摄像头的3D占用预测用于自动驾驶
Abstract
3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.
Chinese Translation
3D语义占用预测对于自动驾驶感知至关重要,它提供了全面的几何场景理解和语义识别。然而,现有方法在视图转换中由于缺乏像素级精确的深度估计而面临几何不对齐的问题,并且在语义类别表现出强烈的空间各向异性时,严重的空间类别不平衡也成为挑战。为了解决这些问题,我们提出了Dr.Occ,一个基于深度和区域引导的占用预测框架。具体而言,我们引入了一种深度引导的2D到3D视图变换器(D$^2$-VFormer),它有效利用来自MoGe-2的高质量密集深度线索来构建可靠的几何先验,从而实现体素特征的精确几何对齐。此外,受混合专家(Mixture-of-Experts, MoE)框架的启发,我们提出了一种区域引导的专家变换器(R/R$^2$-EFormer),它自适应地分配区域特定的专家以关注不同的空间区域,有效应对空间语义变化。因此,这两个组件相辅相成:深度引导确保几何对齐,而区域专家增强语义学习。在Occ3D-nuScenes基准上的实验表明, extbf{Dr.Occ}在全视觉设置下将强基线BEVDet4D的mIoU提高了7.43\%,IoU提高了3.09\\%。
cs.CV / 155 / 2603.01010
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
GeodesicNVS:用于新视图合成的概率密度测地线流匹配
Abstract
Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
Chinese Translation
最近在生成建模方面的进展显著提升了新视图合成的效果,但在不同视点之间保持一致性仍然具有挑战性。基于扩散的模型依赖于随机噪声到数据的转换,这掩盖了确定性结构并导致视图预测不一致。我们提出了一种数据到数据流匹配框架,该框架直接学习配对视图之间的确定性变换,通过显式的数据耦合增强视图一致性合成。为了进一步增强几何一致性,我们引入了概率密度测地线流匹配(Probability Density Geodesic Flow Matching, PDG-FM),该方法使用从预训练扩散模型的概率密度度量导出的测地线插值来约束流轨迹。与数据流形的高密度区域的对齐促进了样本之间更为真实的插值。实证结果表明,我们的方法超越了基于扩散的新视图合成基线,展示了更好的结构一致性和更平滑的视图过渡。这些结果突显了将数据依赖的几何正则化纳入确定性流匹配以实现一致的新视图生成的优势。
cs.CV / 156 / 2603.01016
Implementation of Licensed Plate Detection and Noise Removal in Image Processing
车牌检测与图像处理中的噪声去除的实现
Abstract
Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.
Chinese Translation
车牌识别系统是一种图像处理技术,用于通过捕捉车辆的车牌来识别车辆。车牌识别技术也被称为自动车牌识别(Automatic Number-Plate Recognition)、自动车辆识别(Automatic Vehicle Identification)、车牌识别(Car License Plate Recognition)或汽车光学字符识别(Optical Character Recognition for Cars)。在马来西亚,随着车辆数量的快速增加,路上的车辆数量大幅上升,导致对车牌识别系统的需求显著增加。车牌识别系统可以应用于电子停车支付系统、高速公路收费系统、交通监控系统以及作为警务执法工具。此外,车牌识别系统技术还有潜力与其他不同领域的各种技术结合,如生物学、航空航天等,以实现解决一些专业问题的目标。
cs.CV / 157 / 2603.01026
RaUF: Learning the Spatial Uncertainty Field of Radar
RaUF:学习雷达的空间不确定性场
Abstract
Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.
Chinese Translation
毫米波雷达在恶劣天气条件下具有独特的优势,但其空间保真度低、方位模糊严重,并且容易受到杂波引起的虚假回波的影响。现有的方法主要集中在通过粗到细的跨模态监督来提高空间感知的有效性,但往往忽视了模糊的特征与标签映射,这可能导致不适定的几何推断,并对下游感知任务构成根本性挑战。在本研究中,我们提出了RaUF,一种空间不确定性场学习框架,通过其物理基础的各向异性特性对雷达测量进行建模。为了解决冲突的特征与标签映射,我们设计了一种各向异性概率模型,以学习细粒度的不确定性。为了进一步增强可靠性,我们提出了一种双向域注意机制,利用空间结构与多普勒一致性之间的互补性,有效抑制虚假或多路径引起的反射。在公共基准和真实世界数据集上的大量实验表明,RaUF能够提供高度可靠的空间检测,并具有良好的不确定性校准。此外,下游案例研究进一步验证了RaUF在具有挑战性的真实驾驶场景下的增强可靠性和可扩展性。
cs.CV / 158 / 2603.01028
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
基于内容的频率编码用于具有傅里叶-切比雪夫特征的隐式神经表示
Abstract
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at https://github.com/JunboKe0619/CAFE.
Chinese Translation
隐式神经表示(INRs)已成为各种信号处理任务的强大范式,但其固有的频谱偏差限制了捕捉高频细节的能力。现有方法通过使用基于傅里叶的特征部分缓解了这一问题,这些特征通常依赖于固定的频率基。这迫使多层感知器(MLPs)低效地组合所需的频率,从而限制了它们的表示能力。为了解决这一限制,我们提出了基于内容的频率编码(CAFE),该方法通过多个并行线性层结合哈达玛积构建在傅里叶特征之上。CAFE能够显式且高效地合成更广泛的频率基,而学习到的权重使得选择与任务相关的频率成为可能。此外,我们将该框架扩展到CAFE+,它将切比雪夫特征作为傅里叶基的补充组件。这种组合提供了更强大和更稳定的频率表示。在多个基准测试中的广泛实验验证了我们方法的有效性和效率,始终在性能上优于现有方法。我们的代码可在 https://github.com/JunboKe0619/CAFE 获取。
cs.CV / 159 / 2603.01029
Vision-Language Feature Alignment for Road Anomaly Segmentation
视觉-语言特征对齐用于道路异常分割
Abstract
Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
Chinese Translation
在复杂环境中,安全的自主系统需要强大的道路异常分割能力以识别未知障碍物。然而,现有方法通常依赖于像素级统计来判断一个区域是否显得异常。这种依赖导致在语义上正常的背景区域(如天空或植被)上产生高假阳性率,并且对真实的分布外(Out-of-distribution, OOD)实例的召回率较低,从而对机器人感知和决策造成安全风险。为了解决这些挑战,我们提出了VL-Anomaly,一个结合了预训练视觉-语言模型(Vision-Language Models, VLMs)语义先验的视觉-语言异常分割框架。具体而言,我们设计了一个基于提示学习的对齐模块,将Mask2Forme的视觉特征适应于已知类别的CLIP文本嵌入,有效抑制背景区域中的虚假异常响应。在推理时,我们进一步引入了一种多源推理策略,整合了文本引导的相似性、基于CLIP的图像-文本相似性和检测器置信度,通过利用互补信息源来实现更可靠的异常预测。大量实验表明,VL-Anomaly在RoadAnomaly、SMIYC和Fishyscapes等基准数据集上达到了最先进的性能。代码已发布在 https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment。
cs.CV / 160 / 2603.01034
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
用于多维数据恢复的重参数化张量环函数分解
Abstract
Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.
Chinese Translation
张量环(Tensor Ring, TR)分解是高阶数据建模的强大工具,但本质上受到在固定网格上定义的离散形式的限制。在本研究中,我们提出了一种适用于网格数据和非网格数据的TR函数分解,其中因子由隐式神经表示(Implicit Neural Representations, INRs)参数化。然而,优化这一连续框架以捕捉细微的细节本质上是困难的。通过频域分析,我们证明了TR因子的谱结构决定了重构张量的频率组成,并限制了高频建模能力。为了解决这一问题,我们提出了一种重参数化的TR函数分解,其中每个TR因子是一个可学习潜在张量和一个固定基的结构化组合。这种重参数化在理论上被证明可以改善TR因子学习的训练动态。我们进一步推导了固定基的原则性初始化方案,并证明了我们提出的模型的利普希茨连续性。在图像修复、去噪、超分辨率和点云恢复等任务上的大量实验表明,我们的方法在性能上始终优于现有方法。代码可在 https://github.com/YangyangXu2002/RepTRFD 获取。
cs.CV / 161 / 2603.01036
SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
SMR-Net:基于多尺度特征和自注意力网络的机器人卡扣检测
Abstract
In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method's superiority in complex snap detection and localization tasks.
Chinese Translation
在机器人自动化装配中,卡扣装配的精度和效率直接决定了整体生产质量。作为核心前提,卡扣的检测和定位对后续的装配成功至关重要。传统的视觉方法在处理复杂场景(例如透明或低对比度的卡扣)时,往往表现出较差的鲁棒性和较大的定位误差,无法满足高精度装配的需求。为了解决这一问题,本文设计了一种专用传感器,并提出了SMR-Net,一种基于自注意力的多尺度物体检测算法,以协同增强检测和定位性能。SMR-Net采用了增强注意力的多尺度特征融合架构:通过嵌入注意力的特征提取器对原始传感器数据进行编码,以增强关键卡扣特征并抑制噪声;三个多尺度特征图通过标准卷积和扩张卷积并行处理,以实现维度统一,同时保持分辨率;自适应重加权网络动态分配权重给融合特征,生成整合细节和全局语义的精细表示。在A型和B型卡扣数据集上的实验结果表明,SMR-Net显著优于传统的Faster R-CNN:交并比(IoU)分别提高了6.52%和5.8%,平均精度均值(mAP)分别增加了2.8%和1.5%。这充分证明了该方法在复杂卡扣检测和定位任务中的优越性。
cs.CV / 162 / 2603.01038
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
从直觉到调查:一种工具增强推理的多模态学习框架用于可泛化的人脸反欺诈
Abstract
Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
Chinese Translation
人脸识别仍然容易受到展示攻击,这呼唤强健的人脸反欺诈(FAS)解决方案。近期基于多模态学习模型(MLLM)的FAS方法将二分类任务重新表述为生成简短的文本描述,以提高跨领域的泛化能力。然而,它们的泛化能力仍然有限,因为这些描述主要捕捉直观的语义线索(例如,面具轮廓),而在感知细粒度视觉模式方面存在困难。为了解决这一局限性,我们将外部视觉工具纳入MLLM,以促进对微妙欺诈线索的深入调查。具体而言,我们提出了工具增强推理FAS(TAR-FAS)框架,将FAS任务重新表述为带有视觉工具的思维链(CoT-VT)范式,使得MLLM能够从直观观察开始,并自适应地调用外部视觉工具进行细粒度调查。为此,我们设计了一个工具增强的数据标注流程,并构建了ToolFAS-16K数据集,该数据集包含多轮工具使用推理轨迹。此外,我们引入了一个工具感知的FAS训练流程,其中多样化工具组相对策略优化(DT-GRPO)使模型能够自主学习有效的工具使用。在一个具有挑战性的1对11跨领域协议下的广泛实验表明,TAR-FAS在提供可信的欺诈检测的细粒度视觉调查的同时,达到了最先进的性能。
cs.CV / 163 / 2603.01050
MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
MM-DeepResearch:一个简单有效的多模态自主搜索基线
Abstract
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch
Chinese Translation
我们旨在开发一个能够进行明确推理和规划、多工具调用以及跨模态信息综合的多模态研究代理,从而使其能够执行深度研究任务。然而,我们在开发此类代理时观察到三个主要挑战:(1)搜索密集型多模态问答数据稀缺,(2)缺乏有效的搜索轨迹,以及(3)使用在线搜索API进行训练的高昂成本。为了解决这些问题,我们首先提出了Hyper-Search,这是一种基于超图的问答生成方法,它建模并连接模态内和模态间的视觉和文本节点,从而生成需要调用各种搜索工具来解决的搜索密集型多模态问答对。其次,我们引入了DR-TTS,它首先根据搜索工具类型将涉及搜索的任务分解为几个类别,并分别为每种工具优化专业的搜索工具专家。然后,它重新组合工具专家,通过树搜索共同探索搜索轨迹,生成成功使用各种搜索工具解决复杂任务的轨迹。第三,我们构建了一个支持多种搜索工具的离线搜索引擎,使得自主强化学习无需使用昂贵的在线搜索API。通过这三项设计,我们开发了MM-DeepResearch,一个强大的多模态深度研究代理,广泛的结果显示其在基准测试中的优越性。代码可在 https://github.com/HJYao00/MM-DeepResearch 获取。
cs.CV / 164 / 2603.01063
Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
通过从失败中显式学习释放自主驾驶中的视觉-语言-动作(VLA)潜力
Abstract
Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause -- whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
Chinese Translation
自主驾驶的视觉-语言-动作(VLA)模型在强化学习(RL)优化过程中常常达到性能平台期。这种停滞源于之前的监督微调(SFT)所限制的探索能力,导致在长尾场景中持续出现失败。在这些关键情况下,所有探索的动作都产生零值驾驶评分。这种信息稀疏的奖励信号表明失败,但无法识别其根本原因——是由于规划不当、推理错误还是轨迹执行不佳。为了解决这一局限性,我们提出了显式学习失败的VLA(ELF-VLA),这是一个通过结构化诊断反馈增强RL的框架。我们的方法不再依赖模糊的标量奖励,而是生成详细、可解释的报告,识别具体的失败模式。然后,VLA策略利用这些显式反馈生成反馈引导的细化。通过将这些修正后的高奖励样本重新注入RL训练批次,我们的方法提供了有针对性的梯度,使策略能够解决无指导探索无法应对的关键场景。大量实验表明,我们的方法释放了VLA模型的潜在能力,在公共NAVSIM基准测试中实现了整体PDMS、EPDMS评分和高层规划准确度的最新性能(SOTA)。
cs.CV / 165 / 2603.01068
LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
LLaDA-o:一种有效且长度自适应的全扩散模型
Abstract
We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
Chinese Translation
我们提出了 extbf{LLaDA-o},一种用于多模态理解和生成的有效且长度自适应的全扩散模型。LLaDA-o建立在混合扩散(Mixture of Diffusion, MoD)框架之上,该框架将离散的掩蔽扩散用于文本理解,连续扩散用于视觉生成,同时通过共享的、简单的、高效的注意力骨干网络将两者耦合,从而减少固定条件下的冗余计算。在MoD的基础上,我们进一步引入了一种以数据为中心的长度适应策略,使得在多模态环境中能够灵活地进行长度解码,而无需改变架构。大量实验表明,LLaDA-o在多模态理解和生成基准测试中,在全扩散模型中达到了最先进的性能,并在文本到图像生成的DPG-Bench上达到了87.04,支持了统一全扩散建模的有效性。代码可在https://github.com/ML-GSAI/LLaDA-o获取。
cs.CV / 166 / 2603.01073
Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
基于流匹配的无监督心脏磁共振注册的测试时细化
Abstract
Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at https://github.com/mathpluscode/FlowReg.
Chinese Translation
基于扩散的无监督图像注册已被探索用于心脏动态磁共振成像(cine MR),但昂贵的多步骤推理限制了其实际应用。我们提出了FlowReg,一个在位移场空间中的流匹配框架,它能够在仅需两步的情况下实现强大的注册,并支持通过更多步骤进行进一步细化。FlowReg采用热身-重流训练:一个单步骤网络首先充当教师,然后学生从任意中间状态学习进行细化,消除了现有方法中对预训练模型的需求。初始猜测策略将模型预测反馈作为下一个起始点,从第二步开始改善细化。在ACDC和MM2的六个任务(包括跨数据集泛化)上,FlowReg在五个任务中超越了当前最先进的技术(平均提升0.6%的Dice系数),在左心室的提升最大(+1.09%),并在所有六个任务中减少了左心室射血分数(LVEF)估计误差(减少2.58个百分点),仅使用了0.7%的额外参数且不需要分割标签。匿名代码可在https://github.com/mathpluscode/FlowReg获取。
cs.CV / 167 / 2603.01074
Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
自适应增强感知潜在学习用于鲁棒的LiDAR语义分割
Abstract
Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model's inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results.
Chinese Translation
不利的天气条件显著降低了LiDAR点云语义分割网络的性能,因为它们引入了较大的分布偏移。现有的基于增强的方法试图通过在训练过程中模拟天气干扰来增强鲁棒性。然而,由于轻微增强和激进增强之间的权衡,它们难以充分利用增强的潜力。为了解决这个问题,我们提出了A3Point,一个自适应增强感知潜在学习框架,能够有效利用多样化的增强,同时减轻语义偏移,即由增强引起的语义意义变化。A3Point由两个关键组件组成:语义混淆先验(SCP)潜在学习,捕捉模型固有的语义混淆信息,以及语义偏移区域(SSR)定位,解耦语义混淆和语义偏移,从而为不同干扰水平提供自适应优化策略。在不利天气条件下对多个标准化LiDAR分割基准的广泛实验表明我们的方法的有效性,创造了新的最先进结果。
cs.CV / 168 / 2603.01082
Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
超越全球相似性:面向细粒度、多条件的多模态检索
Abstract
Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR
Chinese Translation
近年来,多模态大型语言模型(MLLMs)的进展显著扩展了多模态检索的能力,使系统能够在视觉和文本模态之间对齐和检索信息。然而,现有基准主要集中在粗粒度或单一条件的对齐上,忽视了现实场景中用户查询指定多个相互依赖的模态约束的情况。为了解决这一问题,我们引入了MCMR(多条件多模态检索):一个旨在评估自然语言查询下细粒度、多条件跨模态检索的大规模基准。MCMR涵盖五个产品领域:上衣和下装、珠宝、鞋子和家具。它还保留了丰富的长格式元数据,这对于组合匹配至关重要。每个查询整合了互补的视觉和文本属性,要求模型共同满足所有指定条件以确保相关性。我们基准测试了一系列多样化的基于MLLM的多模态检索器和视觉-语言重排序器,以评估它们的条件感知推理能力。实验结果显示:(i)模型之间存在明显的模态不对称性;(ii)视觉线索主导了早期排名的精度,而文本元数据则稳定了长尾排序;(iii)基于MLLM的逐点重排序器通过明确验证查询-候选项的一致性显著改善了细粒度匹配。总体而言,MCMR建立了一个具有挑战性和诊断性的基准,推动多模态检索朝向组合、约束感知和可解释理解的发展。我们的代码和数据集可在 https://github.com/EIT-NLP/MCMR 获取。
cs.CV / 169 / 2603.01083
Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective
视觉语言模型能评估图形设计美学吗?基准、评估与数据集视角
Abstract
Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}
Chinese Translation
评估图形设计的美学质量是视觉传播的核心,但在视觉语言模型(VLMs)中仍然未得到充分探索。我们研究了VLMs是否能够以与人类可比的方式评估设计美学。以往的研究面临三个关键限制:基准受限于狭窄的原则和粗略的评估协议,缺乏系统的VLM比较,以及用于模型改进的训练数据有限。在本研究中,我们引入了AesEval-Bench,这是一个涵盖四个维度、十二个指标和三个完全可量化任务的综合基准:美学判断、区域选择和精确定位。随后,我们系统地评估了专有、开源和推理增强的VLMs,揭示了其在美学评估细致需求面前的明显性能差距。此外,我们构建了一个训练数据集,以微调VLMs在这一领域的表现,利用人类指导的VLM标注大规模生成任务标签,并通过指标基础推理将抽象指标与具体设计区域联系起来。我们的工作共同建立了图形设计美学质量评估的第一个系统框架。我们的代码和数据集将发布于: exttt{https://github.com/arctanxarc/AesEval-Bench}
cs.CV / 170 / 2603.01096
Unified Vision-Language Modeling via Concept Space Alignment
通过概念空间对齐实现统一的视觉-语言建模
Abstract
We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
Chinese Translation
我们介绍了 V-SONAR,这是一个扩展自仅文本嵌入空间 SONAR(Omnilingual Embeddings Team 等,2026)的视觉-语言嵌入空间,支持 1500 种文本语言和 177 种语音语言。为了构建 V-SONAR,我们提出了一种事后对齐管道,将现有视觉编码器的表示映射到 SONAR 空间。我们对 V-SONAR 进行了全面评估,结果表明其嵌入在文本到视频检索任务中表现出竞争力。配备 OMNISONAR 文本解码器后,V-SONAR 在视频字幕任务上进一步超越了最先进的视觉-语言模型,包括 DREAM-1K(BLEU 23.9 对比 19.6)和 PE-VIDEO(BLEU 39.0 对比 30.0)。利用 V-SONAR,我们首次展示了在 SONAR 中运行并仅用英语文本训练的 Large Concept Model (LCM; LCM team 等,2024) 能够以零样本方式进行单一和多重视觉概念理解。最后,我们介绍了 V-LCM,它通过视觉-语言指令调优扩展了 LCM。V-LCM 通过 V-SONAR 和 SONAR 将视觉和语言输入编码为统一的潜在嵌入序列,并以与 LCM 的仅文本预训练相同的潜在扩散目标进行下一个嵌入预测的训练。在大规模多语言和多模态指令调优数据混合上的实验突显了 V-LCM 的潜力:V-LCM 在涵盖图像/视频字幕和问答任务上与最先进的视觉-语言模型相匹配,同时在 62 种测试语言中的 61 种丰富到低资源语言上显著超越它们。
cs.CV / 171 / 2603.01098
Differential privacy representation geometry for medical image analysis
医疗图像分析中的差分隐私表示几何
Abstract
Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.
Chinese Translation
差分隐私(DP)在医学成像中的影响通常仅通过端到端性能进行评估,这使得隐私引起的效用损失机制不够明确。我们提出了医疗成像中的差分隐私表示几何(DP-RGMI),这是一个将DP解释为表示空间的结构化变换的框架,并将性能下降分解为编码器几何和任务头利用率。几何通过从初始化的表示位移和谱有效维度进行量化,而利用率则通过线性探测与端到端效用之间的差距进行测量。在来自四个胸部X光数据集和多个预训练初始化的594,000多幅图像中,我们展示了即使在大部分线性可分性得以保持的情况下,DP仍与利用率差距持续相关。同时,位移和谱维度表现出非单调、依赖于初始化和数据集的重塑,表明DP改变了表示的各向异性,而不是均匀地压缩特征。相关性分析揭示了端到端性能与利用率之间的关联在不同数据集上是稳健的,但可能因初始化而异,而几何量则捕捉到额外的先验和数据集条件变化。这些发现使DP-RGMI成为一个可重复的框架,用于诊断隐私引起的失败模式并指导隐私模型选择。
cs.CV / 172 / 2603.01099
HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
HeroGS:稀疏视角下鲁棒3D高斯点云的分层引导
Abstract
3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
Chinese Translation
3D高斯点云(3DGS)最近作为一种有前景的视图合成方法出现,结合了照片级真实感渲染与实时效率。然而,其成功在很大程度上依赖于密集的相机覆盖;在稀疏视角条件下,监督不足导致高斯分布不规则,表现为全球性稀疏覆盖、模糊背景和失真的高频区域。为了解决这一问题,我们提出了HeroGS,即鲁棒3D高斯点云的分层引导,这是一个统一框架,在图像、特征和参数层面建立分层引导。在图像层面,稀疏监督被转换为伪密集引导,全球性地规范化高斯分布,为后续优化奠定一致的基础。在此基础上,特征层面的特征自适应密集化与修剪(FADP)利用低级特征来细化高频细节,并自适应地在背景区域密集化高斯分布。优化后的分布随后支持参数层面的共同修剪几何一致性(CPG),通过参数冻结和共同修剪引导几何一致性,有效去除不一致的点云。分层引导策略有效约束和优化整体高斯分布,从而增强结构保真度和渲染质量。大量实验表明,HeroGS在稀疏视角条件下实现了高保真重建,并始终超越了最先进的基线。
cs.CV / 173 / 2603.01103
Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting
基于扩散模型的油画数据高效笔触生成
Abstract
Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a B\'ezier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
Chinese Translation
许多创意多媒体系统建立在诸如笔触或纹理等视觉原语之上,这些原语难以大规模收集,并且与自然图像数据有根本性的不同。这种数据稀缺性使得现代生成模型在学习表现力强且可控的原语时面临挑战,限制了它们在过程感知内容创作中的应用。在本研究中,我们探讨了从一小组手绘样本(n=470)中学习类人笔触生成的问题,并提出了StrokeDiff,一个基于扩散的框架,结合了平滑正则化(Smooth Regularization, SmR)。SmR在训练过程中注入随机视觉先验,提供了一种简单的机制,以在稀疏监督下稳定扩散模型,而不改变推理过程。我们进一步展示了如何通过基于Bézier的条件模块使学习到的原语可控,并将其整合到一个完整的基于笔触的绘画管道中,包括预测、生成、排序和合成。这表明数据高效的原语建模可以支持表现力丰富且结构化的多媒体内容创作。实验表明,所提出的方法生成了多样且结构一致的笔触,并使绘画具有更丰富的纹理和层次感,这得到了自动指标和人工评估的验证。
cs.CV / 174 / 2603.01108
GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation
GroundedSurg:一种基于语言的多程序手术工具分割基准
Abstract
Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg
Chinese Translation
临床上对手术场景的可靠感知对于推动智能、上下文感知的术中辅助(如器械交接指导、碰撞避免和基于工作流程的机器人支持)至关重要。现有的手术工具基准主要评估类别级别的分割,要求模型检测所有预定义器械类别的实例。然而,现实世界中的临床决策通常需要根据特定器械的功能角色、空间关系或解剖交互能力来解析对特定器械实例的引用,而这些在当前的评估范式中并未得到体现。我们提出了GroundedSurg,这是第一个基于语言的实例级手术定位基准。每个实例将手术图像与针对单一器械的自然语言描述配对,并附有包括边界框和点级锚点在内的结构化空间定位注释。该数据集涵盖眼科、腹腔镜、机器人和开放手术,包含多种器械类型、成像条件和手术复杂性。通过联合评估语言引用解析和像素级定位,GroundedSurg使得在临床现实的多器械场景中对视觉-语言模型进行系统和现实的评估成为可能。大量实验表明,现代分割模型和视觉-语言模型(VLMs)之间存在显著的性能差距,突显了在手术人工智能系统中迫切需要基于临床的视觉-语言推理。代码和数据可在 https://github.com/gaash-lab/GroundedSurg 获取。
cs.CV / 175 / 2603.01111
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
DeAR:通过分解注意力头角色实现细粒度的视觉-语言模型适应
Abstract
Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
Chinese Translation
提示学习是将预训练视觉-语言模型(VLM)适应于下游任务的主流范式。然而,现有方法往往依赖于一种简单的、以层为中心的视角,假设浅层捕捉一般特征,而深层处理任务特定知识。这一假设导致可学习的标记与原始标记之间的相互作用失控。任务特定知识可能会削弱模型的核心泛化能力,并在任务适应与零样本泛化的保持之间产生权衡。为了解决这个问题,我们挑战了以层为中心的观点,提出了 extbf{DeAR},一个通过 extbf{De}composing extbf{A}ttention head extbf{R}oles实现细粒度VLM适应的框架。我们认为,VLM内部的功能专业化并不是发生在层之间,而是在更细粒度的深层单个注意力头水平上。基于这一见解,我们引入了一种新颖的度量标准,概念熵(Concept Entropy),以系统地将注意力头分类为不同的功能角色: extit{属性}(Attribute)、 extit{泛化}(Generalization)和 extit{混合}(Mixed)。在这些角色的指导下,我们引入了专门的属性标记和基于角色的注意力掩码机制,以精确控制信息流,确保泛化头与任务特定知识保持隔离。我们进一步结合了一种任务自适应融合策略用于推理。在十五个数据集上的大量实验表明,DeAR在任务适应与泛化之间实现了良好的平衡,在各种任务中优于之前的方法。
cs.CV / 176 / 2603.01115
GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation
GuiDINO:重新思考医学图像分割中的视觉基础模型
Abstract
Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO
Chinese Translation
基础视觉模型在医学图像分析中被越来越广泛地采用。由于领域转移,这些预训练模型在未经过完全微调或轻微适应的情况下,与医学图像分割的需求不匹配。我们提出了GuiDINO,一个将原生基础模型重新定位为下游分割的视觉引导生成器的框架。GuiDINO从DINOv3中提取视觉特征表示,并通过轻量级的TokenBook机制将其转换为空间引导掩膜,该机制聚合了token原型的相似性。该引导掩膜在多个分割主干网络中控制特征激活,从而在保留医学专用架构的归纳偏差和效率的同时,注入基础模型的先验知识。训练依赖于引导监督目标损失,该损失将引导掩膜与真实区域对齐,此外可选地通过关注边界的铰链损失来增强细结构的清晰度。GuiDINO还支持通过LoRA在DINOv3引导主干上的参数高效适应。在多样的医学数据集和nnUNet风格的推理中,GuiDINO始终提高了分割质量和边界鲁棒性,表明它是微调的实用替代方案,并为基础模型如何更好地服务于医学视觉提供了新的视角。代码可在 https://github.com/Hi-FishU/GuiDINO 获取。
cs.CV / 177 / 2603.01116
Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains
改进的 MambaBDA 框架用于跨灾害领域的稳健建筑损伤评估
Abstract
Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and crossdataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.
Chinese Translation
从卫星图像中可靠地进行灾后建筑损伤评估(BDA)受到严重类别不平衡、背景杂波以及不同灾害类型和地理区域之间的领域转移的阻碍。在本研究中,我们针对这些问题进行了探讨,并探索了改进 MambaBDA 的方法,MambaBDA 是 ChangeMamba 架构下的 BDA 网络,属于最成功的 BDA 模型之一。该方法通过三个模块化组件增强了 MambaBDA:(i)Focal Loss 用于缓解类别不平衡的损伤分类,(ii)轻量级注意力门(Attention Gates)用于抑制无关背景,以及(iii)紧凑的对齐模块(Alignment Module)用于在解码之前将事件前的特征空间扭曲到事件后的内容。我们在多个卫星图像数据集上进行了实验,包括 xBD、巴基斯坦洪水、土耳其地震和艾达飓风,并进行了领域内和跨数据集的测试。所提出的模块化增强在基线模型上产生了一致的改进,在领域内的性能提升幅度为 0.8% 到 5%,在未见过的灾害上则可达到 27%。这表明所提出的增强措施对系统的泛化能力尤其有益。
cs.CV / 178 / 2603.01124
ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
ClinCoT:面向临床的视觉思维链用于医学视觉语言模型
Abstract
Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
Chinese Translation
医学视觉语言模型在临床决策支持中显示出良好的潜力,但由于缺乏对局部病理证据的充分基础,它们仍然容易出现事实幻觉。现有的医学对齐方法主要在响应层面通过偏好优化进行操作,虽然提高了输出的正确性,但使得中间推理与视觉区域的连接较弱。尽管思维链(Chain-of-Thought, CoT)增强了多模态推理,但其仍然主要以文本为中心,限制了临床视觉线索的有效整合。为了解决这一问题,我们提出了ClinCoT,一个面向临床的视觉思维链框架,将偏好优化从响应层面的修正转变为视觉驱动的推理。我们引入了一个自动数据生成管道,通过假设驱动的区域提议进行推理,构建临床基础的偏好对。多个医学语言模型(Med-LLMs)评估者对每个响应进行排名并分配分数,这些排名作为监督用于训练目标模型。我们进一步引入了一种基于评分的边际感知优化策略,结合偏好排名和分数差异,以细化区域级推理轨迹。为了在训练过程中保持模型策略的对齐,我们采用了一种迭代学习方案,动态再生偏好数据。在三个医学视觉问答(VQA)和报告生成基准上的广泛实验表明,ClinCoT在事实基础上持续改善,并且与现有的基于偏好的对齐方法相比,表现出更优的性能。
cs.CV / 179 / 2603.01125
Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
基于增强异常对比学习的组合视觉关系预测推理
Abstract
While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.
Chinese Translation
尽管简单类比的视觉推理受到了广泛关注,但由于组合视觉关系(CVR)的复杂性,它们仍然相对未被探索。为了解决CVR任务,我们提出了基于增强异常对比学习的预测推理(PR-A$^2$CL),即在给定遵循相同组合规则的三幅图像的情况下,识别出一幅异常图像。为了解决建模丰富组合规则的挑战,我们设计了一种增强异常对比学习,以通过最大化正常实例之间的相似性,同时最小化正常实例与异常离群点之间的相似性,提炼出具有区分性和可泛化的特征。更重要的是,我们引入了一种基于规则的推理的预测与验证范式,其中一系列预测异常推理模块(PARBs)迭代地利用四幅图像中的三幅图像的特征来预测剩余一幅图像的特征。在随后的验证阶段,PARBs逐步确定归因于潜在规则的具体差异。在SVRT、CVR和MC$^2$R数据集上的实验结果表明,PR-A$^2$CL显著优于最先进的推理模型。
cs.CV / 180 / 2603.01140
Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers
教师引导的因果干预用于图像去噪:视觉变换器中的正交内容-噪声解耦
Abstract
Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
Chinese Translation
传统的图像去噪模型常常无意中学习到环境因素与噪声模式之间的虚假相关性。此外,由于高频模糊性,它们难以可靠地区分细微纹理与随机噪声,导致细节过度去除或残留噪声伪影。因此,我们重新审视通过因果干预进行去噪,认为纯粹的相关拟合将内在内容与外在噪声纠缠在一起,这直接降低了在分布变化下的鲁棒性。基于此,我们提出了教师引导的因果解耦网络(Teacher-Guided Causal Disentanglement Network,TCD-Net),该网络通过在视觉变换器框架内对特征空间进行结构化干预,明确分解生成机制。具体而言,我们的方法整合了三个关键组件:(1) 环境偏差调整(Environmental Bias Adjustment,EBA)模块将特征投影到一个稳定的去中心化子空间,以抑制全球环境偏差(去混淆)。(2) 双分支解耦头采用正交约束,强制内容与噪声表示之间严格分离,防止信息泄漏。(3) 为了解决结构模糊性,我们利用Nano Banana Pro,谷歌的推理引导AI图像生成模型,来引导因果先验,有效地将内容表示拉回到自然图像流形上。大量实验表明,TCD-Net在多个基准测试中在保真度和效率上均优于主流方法,在单个RTX 5090 GPU上实现了104.2 FPS的实时速度。
cs.CV / 181 / 2603.01142
ArtLLM: Generating Articulated Assets via 3D LLM
ArtLLM:通过3D LLM生成可动资产
Abstract
Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
Chinese Translation
为游戏、机器人和仿真创建交互式数字环境依赖于可动的3D对象,其功能源于部件几何形状和运动结构。然而,现有的方法在根本上仍然有限:基于优化的重建方法需要缓慢的逐对象关节拟合,通常仅处理简单的单关节对象,而基于检索的方法则从固定库中组装部件,导致几何形状重复且泛化能力差。为了解决这些挑战,我们提出了ArtLLM,一个从完整3D网格直接生成高质量可动资产的新框架。其核心是一个3D多模态大型语言模型,经过在一个大规模的可动数据集上训练,该数据集由现有的可动数据集和程序生成的对象整理而成。与之前的工作不同,ArtLLM自回归地预测可变数量的部件和关节,从对象的点云中以统一的方式推断其运动结构。这种关注可动性的布局随后为3D生成模型提供条件,以合成高保真度的部件几何形状。在PartNet-Mobility数据集上的实验表明,ArtLLM在部件布局准确性和关节预测方面显著优于最先进的方法,同时在真实世界对象上具有良好的泛化能力。最后,我们展示了其在构建数字双胞胎方面的实用性,突显了其在可扩展机器人学习中的潜力。
cs.CV / 182 / 2603.01143
TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
TC-SSA:通过语义槽聚合进行令牌压缩以实现千兆像素病理推理
Abstract
The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.
Chinese Translation
将大型视觉-语言模型应用于计算病理学为诊断助手带来了巨大的潜力,但面临着一个关键的计算瓶颈:全切片图像(Whole Slide Images,WSIs)的千兆像素规模。单个WSI通常包含超过10^5个图块,导致序列长度超出标准Transformer架构的限制。现有解决方案通常诉诸于空间采样,这可能会丢弃诊断上至关重要的证据。为了解决这个问题,我们提出了TC-SSA(通过语义槽聚合进行令牌压缩),这是一种可学习的令牌压缩框架,将图块特征聚合到固定数量的语义槽中。一个门控路由模块使用稀疏的Top-2路由将图块分配到槽中,随后进行加权聚合,从而在严格的令牌预算下实现全局切片覆盖。生成的表示保留了诊断相关信息,同时将视觉令牌的数量减少到原始序列的1.7%。在SlideBench(TCGA)上,我们的模型实现了78.34%的总体准确率和77.14%的诊断子集准确率,优于在可比令牌预算下的基于采样的基线方法。该方法还可以推广到MIL分类,在TCGA-BRCA上达到95.83%的AUC,在TCGA-NSCLC上达到98.27%,在PANDA上达到79.80%。这些结果表明,可学习的语义聚合在千兆像素病理推理中提供了效率与诊断性能之间的有效权衡。
cs.CV / 183 / 2603.01147
ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features
ConVibNet:基于频率启发特征的连续插入过程中的针头检测
Abstract
Purpose: Ultrasound-guided needle interventions are widely used in clinical practice, but their success critically depends on accurate needle placement, which is frequently hindered by the poor and intermittent visibility of needles in ultrasound images. Existing approaches remain limited by artifacts, occlusions, and low contrast, and often fail to support real-time continuous insertion. To overcome these challenges, this study introduces a robust real-time framework for continuous needle detection. Methods: We present ConVibNet, an extension of VibNet for detecting needles with significantly reduced visibility, addressing real-time, continuous needle tracking during insertion. ConVibNet leverages temporal dependencies across successive ultrasound frames to enable continuous estimation of both needle tip position and shaft angle in dynamic scenarios. To strengthen temporal awareness of needle-tip motion, we introduce a novel intersection-and-difference loss that explicitly leverages motion correlations across consecutive frames. In addition, we curated a dedicated dataset for model development and evaluation. Results: The performance of the proposed ConVibNet model was evaluated on our dataset, demonstrating superior accuracy compared to the baseline VibNet and UNet-LSTM models. Specifically, ConVibNet achieved a tip error of 2.80+-2.42 mm and an angle error of 1.69+-2.00 deg. These results represent a 0.75 mm improvement in tip localization accuracy over the best-performing baseline, while preserving real-time inference capability. Conclusion: ConVibNet advances real-time needle detection in ultrasound-guided interventions by integrating temporal correlation modeling with a novel intersection-and-difference loss, thereby improving accuracy and robustness and demonstrating high potential for integration into autonomous insertion systems.
Chinese Translation
目的:超声引导下的针头干预在临床实践中被广泛应用,但其成功在很大程度上依赖于针头的准确定位,而这一过程常常受到超声图像中针头可见性差和间歇性的限制。现有方法受到伪影、遮挡和低对比度的限制,通常无法支持实时的连续插入。为了解决这些挑战,本研究提出了一种强大的实时连续针头检测框架。方法:我们提出了ConVibNet,这是VibNet的扩展,旨在检测可见性显著降低的针头,解决插入过程中的实时连续针头跟踪问题。ConVibNet利用连续超声帧之间的时间依赖性,实现动态场景中针头尖端位置和杆角度的连续估计。为了增强对针头尖端运动的时间感知,我们引入了一种新颖的交集与差异损失,明确利用连续帧之间的运动相关性。此外,我们还创建了一个专门的数据集用于模型开发和评估。结果:在我们的数据集上评估了所提出的ConVibNet模型的性能,结果显示其准确性优于基线模型VibNet和UNet-LSTM。具体而言,ConVibNet实现了2.80±2.42毫米的尖端误差和1.69±2.00度的角度误差。这些结果相比最佳基线模型在尖端定位准确性上提高了0.75毫米,同时保持了实时推理能力。结论:ConVibNet通过将时间相关性建模与新颖的交集与差异损失相结合,推动了超声引导干预中的实时针头检测,从而提高了准确性和鲁棒性,并展示了其在自主插入系统中集成的高潜力。
cs.CV / 184 / 2603.01161
GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection
GRAD-Former:基于门控鲁棒注意力的差异变换器用于变化检测
Abstract
Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former's superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD-Former
Chinese Translation
遥感中的变化检测(CD)旨在识别在不同时间捕获的卫星图像之间的语义差异。尽管深度学习在这一领域取得了显著进展,但现有基于卷积神经网络(CNNs)、变换器和选择性状态空间模型(SSMs)的方法仍然难以精确划定变化区域。特别是,传统的基于变换器的方法在应用于超高分辨率(VHR)卫星图像时面临二次计算复杂度的问题,并且在训练数据有限的情况下表现不佳,导致未能充分利用VHR图像中丰富的空间信息。我们提出了GRAD-Former,这是一种新颖的框架,旨在增强上下文理解,同时通过减少模型规模来保持效率。该框架由一个新颖的编码器、具有自适应特征相关性和精炼(AFRAR)模块、融合块和解码器块组成。AFRAR通过两个提出的组件:选择性嵌入放大(SEA)模块和全局-局部特征精炼(GLFR)模块,集成了全局-局部上下文意识。SEA和GLFR分别利用门控机制和差异注意力,生成多个softmax堆以捕捉重要特征,同时最小化捕获的不相关特征。在三个具有挑战性的CD数据集(LEVIR-CD、CDD、DSIFN-CD)上的多次实验表明,GRAD-Former的性能优于现有方法。值得注意的是,GRAD-Former在所有指标和所有数据集上均超越了当前的最先进模型,同时使用的参数更少。我们的框架为遥感变化检测性能建立了新的基准。我们的代码将发布在:https://github.com/Ujjwal238/GRAD-Former
cs.CV / 185 / 2603.01163
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
BeautyGRPO:通过动态路径引导和细粒度偏好建模实现面部修饰的美学对齐
Abstract
Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
Chinese Translation
面部修饰需要去除细微的瑕疵,同时保留独特的面部特征,以增强整体美学吸引力。然而,现有方法面临着根本性的权衡。基于标注数据的监督学习受限于像素级标签模仿,无法捕捉复杂的主观人类美学偏好。相反,尽管在线强化学习(RL)在偏好对齐方面表现出色,其随机探索范式与面部修饰的高保真需求相冲突,并且由于累积的随机漂移,常常引入明显的噪声伪影。为了解决这些局限性,我们提出了BeautyGRPO,一个将面部修饰与人类美学偏好对齐的强化学习框架。我们构建了FRPref-10K,这是一个涵盖五个关键修饰维度的细粒度偏好数据集,并训练了一个专门的奖励模型,能够评估细微的感知差异。为了调和探索与保真度,我们引入了动态路径引导(Dynamic Path Guidance, DPG)。DPG通过动态计算基于锚点的常微分方程(ODE)路径并在每个采样时间步重新规划引导轨迹,稳定了随机采样轨迹,有效纠正了随机漂移,同时保持了受控探索。大量实验表明,BeautyGRPO在专门的面部修饰方法和通用图像编辑模型中表现优越,达到了更高的纹理质量、更准确的瑕疵去除,并且整体结果更好地与人类美学偏好对齐。
cs.CV / 186 / 2603.01164
FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
FREE-Edit:在修正流模型中使用编辑感知注入进行零-shot图像驱动视频编辑
Abstract
Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.
Chinese Translation
图像驱动的视频编辑旨在将修改后的第一帧的编辑内容传播到其余帧。现有方法通常使用预训练的图像到视频(I2V)模型将源视频反转为噪声,然后利用编辑后的第一帧引导采样过程。一般来说,为了保持源视频中的运动和布局,常用的选择是在重建过程中通过注入注意力来干预去噪过程。然而,这种注入往往导致不令人满意的结果,过度注入会导致源视频中的语义冲突,而不足的注入则带来有限的源表示。意识到这一点,我们提出了一种编辑感知注入(Editing-awaRE, REE)方法,以调节每个标记的注入强度。具体而言,我们首先计算源帧和编辑后第一帧之间的像素差异,以形成相应的编辑掩码。接下来,我们通过使用光流对第一帧掩码进行扭曲,跟踪整个视频中的编辑区域。然后,相应地生成每个标记的编辑感知特征注入强度,其中在编辑区域不进行注入。在REE注入的基础上,我们进一步提出了一种基于最近出现的修正流模型的零-shot图像驱动视频编辑框架,称为FREE-Edit。在不进行微调或训练的情况下,我们的FREE-Edit在各种图像驱动视频编辑场景中展示了有效性,显示出其在生成更高质量输出方面的能力,相较于现有技术。项目页面:https://free-edit.github.io/page/
cs.CV / 187 / 2603.01169
TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
TripleSumm:用于视频摘要的自适应三模态融合
Abstract
The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
Chinese Translation
视频内容的指数增长要求有效的视频摘要,以高效提取长视频中的关键信息。然而,当前的方法在全面理解复杂视频方面存在困难,主要是因为它们采用了静态或模态无关的融合策略。这些方法未能考虑视频数据中模态显著性动态、帧依赖的变化。为了解决这些局限性,我们提出了TripleSumm,一种新颖的架构,能够在帧级别自适应地加权和融合视觉、文本和音频模态的贡献。此外,针对多模态视频摘要研究的一个重大瓶颈是缺乏全面的基准测试。为了解决这一瓶颈,我们引入了MoSu(Most Replayed Multimodal Video Summarization),这是第一个提供三种模态的大规模基准测试。大量实验表明,TripleSumm在四个基准测试上,包括MoSu,达到了最先进的性能,显著超越了现有方法。我们的代码和数据集可在 https://github.com/smkim37/TripleSumm 获取。
cs.CV / 188 / 2603.01174
VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification
VP-Hype:一种结合视觉-文本提示的混合Mamba-Transformer框架用于高光谱图像分类
Abstract
Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
Chinese Translation
高光谱图像(HSI)的准确分类常常受到高维光谱数据与标记训练样本极度稀缺之间的矛盾影响。尽管像LoLA-SpecViT这样的层次模型展示了局部窗口注意力和参数高效微调的强大能力,但标准Transformer的二次复杂性仍然是扩展的障碍。我们提出了VP-Hype,一个通过将状态空间模型(State-Space Models, SSMs)的线性时间效率与Transformer的关系建模统一于一种新型混合架构,从而重新思考HSI分类。VP-Hype基于强大的3D-CNN光谱前端,采用混合Mamba-Transformer主干替代传统的注意力模块,以显著降低计算开销来捕捉长程依赖。此外,我们通过整合双模态视觉和文本提示来解决标签稀缺问题,这些提示为特征提取过程提供了上下文感知的指导。我们的实验评估表明,VP-Hype在低数据环境下建立了新的技术前沿。具体而言,在仅有2\%的训练样本分布下,该模型在Salinas数据集上实现了99.69\\%的总体准确率(Overall Accuracy, OA),在Longkou数据集上实现了99.45\\%的总体准确率。这些结果表明,混合序列建模与多模态提示的结合为高性能、样本高效的遥感提供了一条稳健的前进路径。
cs.CV / 189 / 2603.01194
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
RnG:一种统一的变换器,用于从部分观察中进行完整的 3D 建模
Abstract
Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG
Chinese Translation
人类通过有限视角的 2D 观察来感知 3D 世界。尽管最近的前馈可泛化 3D 重建模型在从稀疏图像中恢复 3D 结构方面表现出色,但它们的表示通常局限于观察到的区域,未观察到的几何形状则未被建模。这提出了一个关键的基本挑战:我们能否从部分 2D 观察中推断出完整的 3D 结构?我们提出了 RnG(重建与生成),一种新颖的前馈变换器,通过预测隐式的完整 3D 表示来统一这两项任务。在 RnG 的核心,我们提出了一种重建引导的因果注意机制,在注意力层面上分离重建和生成,并将 KV 缓存视为隐式的 3D 表示。然后,任意姿态可以高效地查询此缓存,以渲染高保真度的新视角 RGBD 输出。因此,RnG 不仅准确重建可见几何体,还生成合理且连贯的未见几何体和外观。我们的方法在可泛化 3D 重建和新视角生成方面均达到了最先进的性能,同时在实时交互应用中也能高效运行。项目页面:https://npucvr.github.io/RnG
cs.CV / 190 / 2603.01195
VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
VisNec:测量和利用视觉必要性进行多模态指令调优
Abstract
The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
Chinese Translation
多模态指令调优的有效性不仅依赖于数据集的规模,更关键的是训练样本是否真正需要视觉推理。然而,现有的指令数据集中往往包含大量视觉冗余样本(仅通过文本即可解决),以及可能降低学习效果的多模态不对齐监督。为了解决这个问题,我们提出了VisNec(视觉必要性评分),这是一个原则性的数据选择框架,用于测量视觉输入在指令调优过程中的边际贡献。通过比较有无视觉上下文的预测损失,VisNec能够识别训练实例是视觉关键、冗余还是不对齐。为了保持任务的多样性,我们将VisNec与语义聚类相结合,并在每个聚类中选择高必要性的样本。在10个下游基准测试中,仅使用VisNec选择的LLaVA-665K数据集的15%进行训练,便达到了100.2%的全数据性能。在较小的Vision-Flan-186K数据集上,我们的选择不仅进一步减少了数据规模,还超越了全数据训练15.8%。这些结果表明,测量和利用视觉必要性为高效且稳健的多模态指令调优提供了有效的解决方案。代码和选择的子集将在接受后发布。
cs.CV / 191 / 2603.01205
CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling
CoSMo3D:通过LLM引导的规范空间建模实现开放世界可提示的3D语义部件分割
Abstract
Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.
Chinese Translation
开放世界可提示的3D语义分割仍然脆弱,因为语义是在输入传感器坐标中推断的。然而,与此相反,人类通过在规范空间中的功能角色来解释部件——翅膀向侧面延伸,手柄向外突出,腿部从下方支撑。心理物理证据表明,我们在心理上将物体旋转到规范框架中以揭示这些角色。为填补这一空白,我们提出了 extit{CoSMo3D},通过直接从数据中学习的潜在规范参考框架来实现规范空间感知。通过构建,我们通过LLM引导的类内和跨类对齐创建了统一的规范数据集,揭示了200个类别之间的规范空间规律。通过归纳,我们在模型内部实现了规范性,采用双分支架构,结合规范映射锚定和规范框校准,将姿态变化和对称性压缩为稳定的规范嵌入。从输入姿态空间到规范嵌入的转变产生了更稳定和可转移的部件语义。实验结果表明, extit{CoSMo3D}在开放世界可提示的3D分割中建立了新的最先进水平。
cs.CV / 192 / 2603.01224
Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
基于视觉语言模型的单目3D物体位置估计用于人机交互
Abstract
Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
Chinese Translation
预训练的通用视觉语言模型(VLM)由于其丰富的世界知识和2D物体检测能力,有潜力增强直观的人机交互。然而,针对3D坐标检测任务的VLM仍然较为稀缺。在本研究中,我们通过从腕部安装的相机获取单目RGB图像、自然语言输入和机器人状态,探讨了VLM的交互能力,以返回3D物体位置。我们收集并整理了一个异构数据集,包含超过100,000张图像,并使用QLoRA对VLM进行了微调,添加了自定义回归头。通过实施条件路由,我们的模型在处理一般视觉查询的同时,增加了专门的3D位置估计能力。我们的结果显示出强大的预测性能,在测试集上的中位绝对误差(MAE)为13毫米,相较于未微调的简单基线提高了五倍。在约25%的情况下,预测结果在被认为是机器人与物体交互的可接受范围内。
cs.CV / 193 / 2603.01228
Towards Policy-Adaptive Image Guardrail: Benchmark and Method
面向政策自适应的图像保护栏:基准与方法
Abstract
Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
Chinese Translation
准确拒绝敏感或有害的视觉内容,即有害图像保护栏,在许多应用场景中至关重要。该任务必须不断适应不断变化的安全政策和各个领域随时间演变的内容。然而,传统分类器受限于固定类别,在引入新政策时需要频繁重新训练。视觉-语言模型(VLMs)为动态安全保护栏提供了更具适应性和可推广性的基础。尽管有这种潜力,现有的基于VLM的保护方法通常仅在固定安全政策下进行训练和评估。我们发现这些模型严重过拟合于已见政策,无法推广到未见政策,甚至失去了基本的指令遵循能力和一般知识。为了解决这个问题,本文做出了两个关键贡献。首先,我们使用SafeEditBench,一个新的评估套件,对现有VLM的跨政策泛化性能进行基准测试。SafeEditBench利用图像编辑模型将不安全图像转换为安全对应物,生成政策对齐的数据集,其中每对安全-不安全图像在视觉上保持相似,除了违反特定安全规则的局部区域。然后,人类注释者在五个不同政策下提供准确的安全/不安全标签,从而实现对政策感知泛化的细致评估。其次,我们引入了SafeGuard-VL,一种基于强化学习的具有可验证奖励(RLVR)的方法,用于稳健的不安全图像保护栏。SafeGuard-VL不仅依赖于固定政策下的监督微调(SFT),而是明确地使用基于政策的奖励优化模型,促进在不断演变的政策下的可验证适应。大量实验验证了我们的方法在各种政策下对不安全图像保护栏的有效性。
cs.CV / 194 / 2603.01236
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
AgilePruner:针对大型视觉-语言模型中自适应视觉标记修剪的注意力与多样性的实证研究
Abstract
Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.
Chinese Translation
大型视觉-语言模型(LVLMs)采用视觉标记修剪策略,以减轻由于大量视觉标记序列带来的显著计算开销。尽管之前的研究主要集中于基于注意力或基于多样性的修剪方法,但对这些方法特征及其局限性的深入分析仍然较少被探讨。在本研究中,我们使用有效秩(effective rank,erank)作为特征多样性的度量,并结合注意力得分熵,进行全面的实证分析,以探讨视觉标记处理机制,并分析每种方法的优缺点。我们的分析揭示了两个见解:(1)基于erank的定量分析表明,许多以多样性为导向的修剪方法所保留的特征多样性远低于预期;此外,使用CHAIR数据集的分析显示,它们所保留的多样性与相比于基于注意力的修剪,幻觉频率的增加密切相关。(2)我们进一步观察到,基于注意力的方法在视觉证据集中简单图像上更为有效,而基于多样性的方法则更适合处理特征分布复杂的图像。基于这些实证见解,我们展示了将图像感知调整纳入现有混合修剪策略中可以持续提高其性能。我们还通过简单的自适应修剪机制提供了我们实证发现的最小实例,该机制在标准基准测试以及针对幻觉的特定评估中均表现出强大而可靠的性能。我们的项目页面可访问:https://cvsp-lab.github.io/AgilePruner。
cs.CV / 195 / 2603.01250
The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
MAMA-MIA挑战:推动乳腺MRI肿瘤分割和治疗反应预测的普适性与公平性
Garrucho, Lidia, Joshi, Smriti, Kushibar, Kaisar, Osuala, Richard, Bobowicz, Maciej, Bargalló, Xavier, Jaruševičius, Paulius, Geissler, Kai, Schäfer, Raphael, Alberb, Muhammad, Xu, Tony, Martel, Anne, Sleiman, Daniel, Awasthi, Navchetan, Awwad, Hadeel, Vilanova, Joan C., Martí, Robert, Schouten, Daan, Lee, Jeong Hoon, Rusu, Mirabela, Poeta, Eleonora, Vargas, Luisa, Pastor, Eliana, Zuluaga, Maria A., Kächele, Jessica, Bounias, Dimitrios, Ertl, Alexandra, Gwoździewicz, Katarzyna, Cosaka, Maria-Laura, Abo-Elhoda, Pasant M., Tantawy, Sara W., Sakrana, Shorouq S., Shawky-Abdelfatah, Norhan O., Abdo-Salem, Amr Muhammad, Kozana, Androniki, Divjak, Eugen, Ivanac, Gordana, Nikiforaki, Katerina, Klontzas, Michail E., García-Dosdá, Rosa, Gulsun-Akpinar, Meltem, Lafcı, Oğuz, Martín-Isla, Carlos, Díaz, Oliver, Igual, Laura, Lekadir, Karim
Abstract
Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
Chinese Translation
乳腺癌是全球女性中最常被诊断的恶性肿瘤,也是癌症相关死亡的主要原因。动态对比增强磁共振成像在肿瘤特征描述和治疗监测中发挥着核心作用,尤其是在接受新辅助化疗的患者中。然而,现有的乳腺磁共振成像人工智能模型通常是基于单中心数据开发的,并使用汇总性能指标进行评估,这限制了它们的普适性,并掩盖了不同人口子群体之间潜在的性能差异。MAMA-MIA挑战旨在通过引入一个大规模基准,联合评估原发肿瘤分割和仅使用治疗前磁共振成像预测病理完全反应,来解决这些局限性。训练队列包括来自美国多个机构的1,506名患者,而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行,以评估跨大陆和跨机构的泛化能力。一个统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度等子群体一致性。最终评估阶段有26个国际团队参与。结果表明,在外部测试下性能存在显著变异,并揭示了整体准确性与子群体公平性之间的权衡。该挑战提供了标准化的数据集、评估协议和公共资源,以促进乳腺癌影像学中强大而公平的人工智能系统的发展。
cs.CV / 196 / 2603.01253
Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography
基于跨模态引导的快速扩散计算机断层成像
Abstract
Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.
Chinese Translation
扩散模型已成为解决计算机断层成像(CT)逆问题的强大先验。在某些应用中,例如中子CT,即使对于单次扫描,收集大量测量数据也可能非常昂贵,这导致数据集稀疏,甚至使用扩散模型也难以获得高质量的重建。缓解这一挑战的一种策略是利用一种互补的、易于获取的成像模态;然而,这种方法通常需要使用大数据集重新训练扩散模型。在本研究中,我们提出在不重新训练扩散先验的情况下,结合额外的模态,从而加速昂贵模态的成像。我们进一步考察了不完美的辅助模态对跨模态引导的影响。我们的方法在稀疏视图中子计算机断层成像中进行了评估,通过结合相同样本的X射线计算机断层成像,重建质量得到了显著改善。
cs.CV / 197 / 2603.01284
FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration
FoSS:通过傅里叶状态空间集成建模轨迹预测中的长程依赖性和多模态不确定性
Abstract
Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
Chinese Translation
准确的轨迹预测对安全的自动驾驶至关重要,但现有方法在建模能力和计算效率之间难以取得平衡。基于注意力的架构在代理数量增加时会导致二次复杂度,而递归模型则难以捕捉长程依赖性和细粒度的局部动态。在此基础上,我们提出了FoSS,一个双分支框架,将频域推理与线性时间序列建模相结合。频域分支通过离散傅里叶变换将轨迹分解为编码全局意图的幅度分量和捕捉局部变化的相位分量,随后是一个渐进的螺旋重排序模块,保持谱序;两个选择性状态空间子模块,Coarse2Fine-SSM和SpecEvolve-SSM,以O(N)复杂度精炼谱特征。与此同时,一个时域动态选择性SSM以线性时间重建自注意力行为,以保留长程时间上下文。交叉注意力层融合时间和谱表示,而可学习的查询生成多个候选轨迹,权重融合头则表达运动不确定性。在Argoverse 1和Argoverse 2基准上的实验表明,FoSS在实现最先进的准确性同时减少计算量22.5%和参数量超过40%。全面的消融实验确认了每个组件的必要性。
cs.CV / 198 / 2603.01295
Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis
多层双向解码器交互用于不确定性感知的乳腺超声分析
Abstract
Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi-task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance-specific prediction difficulty. We propose a multi-task framework addressing these limitations through multi-level decoder interaction and uncertainty-aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation-classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single-level or encoder-only approaches, this multi-level design captures scale specific task synergies across semantic-to-spatial scales, producing complementary task interaction streams. Uncertainty-Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per-level and per-sample task balancing without heuristic tuning. To support instance-adaptive prediction, multi-scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi-level task interaction provides significant performance gains, validating that decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing. The code is available at: https://github.com/C-loud-Nine/Uncertainty-Aware-Multi-Level-Decoder-Interaction.
Chinese Translation
乳腺超声的解读需要同时进行病灶分割和组织分类。然而,传统的多任务学习方法存在任务干扰和僵化的协调策略,无法适应实例特定的预测难度。我们提出了一种多任务框架,通过多层解码器交互和不确定性感知的自适应协调来解决这些局限性。任务交互模块在所有解码器层次上运行,通过注意力加权池化和乘法调制,在空间重建过程中建立双向的分割-分类通信。与之前的单层或仅编码器的方法不同,这种多层设计捕捉了跨语义到空间尺度的特定任务协同,产生互补的任务交互流。不确定性代理注意力根据特征激活方差自适应地加权每一层的基础特征与增强特征,使得在每一层和每个样本之间实现任务平衡而无需启发式调优。为了支持实例自适应预测,多尺度上下文融合捕捉不同病灶大小的形态线索。在多个公开可用的乳腺超声数据集上的评估显示出竞争力的性能,包括在BUSI数据集上74.5%的病灶IoU和90.6%的分类准确率。消融研究确认多层任务交互提供了显著的性能提升,验证了解码器层级的双向通信比传统的仅编码器参数共享更为有效。代码可在以下链接获取:https://github.com/C-loud-Nine/Uncertainty-Aware-Multi-Level-Decoder-Interaction。
cs.CV / 199 / 2603.01301
When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
强化学习何时能助力医疗视觉语言模型?解构视觉、监督微调与强化学习的收益
Abstract
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
Chinese Translation
强化学习(RL)在医疗视觉语言模型(VLMs)的后训练中越来越多地被使用,但尚不清楚RL是否改善了医疗视觉推理,还是主要增强了由监督微调(SFT)已经诱导的行为。我们进行了一项控制研究,从视觉、SFT和RL三个维度解构这些效应。利用MedMNIST作为多模态测试平台,我们通过将VLM视觉塔与仅视觉基线进行基准测试来探测视觉感知,通过Accuracy@1与Pass@K量化推理支持和采样效率,并评估RL何时缩小支持差距以及收益如何在不同模态间转移。我们发现,当模型已经具有非平凡的支持(高Pass@K)时,RL最为有效:它主要锐化输出分布,提高Acc@1和采样效率,而SFT则扩展了支持,使得RL得以有效。基于这些发现,我们提出了一种边界感知的方案,并通过在PMC多项选择VQA的小型平衡子集上对OctoMed初始化模型进行RL后训练来实例化该方案,在六个医疗VQA基准测试中实现了强劲的平均表现。
cs.CV / 200 / 2603.01305
AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
AG-VAS:基于锚点引导的大规模多模态模型的零-shot视觉异常分割
Abstract
Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
Chinese Translation
大型多模态模型(LMMs)展现出强大的任务泛化能力,为零-shot视觉异常分割(ZSAS)提供了新的机遇。然而,现有基于LMM的分割方法仍面临基本的限制:异常概念本质上是抽象且依赖于上下文的,缺乏稳定的视觉原型,而高层语义嵌入与像素级空间特征之间的弱对齐阻碍了精确的异常定位。为了解决这些挑战,我们提出了AG-VAS(Anchor-Guided Visual Anomaly Segmentation),这是一个新的框架,通过三个可学习的语义锚点标记-[SEG]、[NOR]和[ANO]扩展了LMM的词汇,建立了统一的锚点引导分割范式。具体而言,[SEG]作为绝对语义锚点,将抽象的异常语义转化为明确的、空间上扎根的视觉实体(例如,孔或划痕),而[NOR]和[ANO]则作为相对锚点,建模不同类别之间正常与异常模式的上下文对比。为了进一步增强跨模态对齐,我们引入了语义-像素对齐模块(SPAM),该模块将语言级语义嵌入与高分辨率视觉特征对齐,并结合锚点引导的掩码解码器(AGMD),实现基于锚点的掩码预测,以精确定位异常。此外,我们整理了Anomaly-Instruct20K,这是一个大规模指令数据集,将异常知识组织为外观、形状和空间属性的结构化描述,促进了所提语义锚点的有效学习和整合。在六个工业和医疗基准上的广泛实验表明,AG-VAS在零-shot设置中实现了一致的最先进性能。
cs.CV / 201 / 2603.01324
Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
开放词汇与监督学习方法在灾后视觉场景理解中的比较
Abstract
Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
Chinese Translation
航空影像对于大规模灾后损害评估至关重要。然而,由于杂乱的背景、视觉变异性以及强烈的跨事件领域转移,自动化解读仍然面临挑战,而监督学习方法仍然依赖于昂贵的、特定任务的标注,这些标注在灾害类型和地区之间的覆盖范围有限。最近的开放词汇和基础视觉模型提供了一种有吸引力的替代方案,减少了对固定标签集和广泛特定任务标注的依赖。相反,它们利用大规模的预训练和视觉-语言表示。这些特性对于灾后领域尤其相关,因为视觉概念往往模糊且数据可用性受到限制。在本研究中,我们对监督学习和开放词汇视觉模型在灾后场景理解中的表现进行了比较评估,重点关注多个数据集(包括 FloodNet+、RescueNet、DFire 和 LADD)中的语义分割和目标检测。我们考察了不同学习范式之间的性能趋势、失败模式和实际权衡,为其在现实世界灾害响应中的适用性提供了见解。在所有评估基准中,最显著的观察是,监督训练仍然是最可靠的方法(即当标签空间固定且有可用标注时),尤其是在处理小物体和复杂场景中的细边界划分时。
cs.CV / 202 / 2603.01328
You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
只需一个阶段:从单一盲人面孔图像合成新视角
Abstract
We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
Chinese Translation
我们提出了一种新颖的一阶段方法NVB-Face,旨在直接从单一盲人面孔图像生成一致的新视角图像。现有的面孔或物体新视角合成方法通常需要高分辨率RGB图像作为输入。在处理退化图像时,传统流程遵循两阶段过程:首先将图像恢复到高分辨率,然后从恢复结果合成新视角。然而,这种方法高度依赖于恢复图像的质量,常常导致最终输出的不准确和不一致。为了解决这一局限性,我们直接从盲人面孔图像中提取单视图特征,并引入了一种特征操控器,将这些特征转换为具有3D感知的多视图潜在表示。利用扩散模型强大的生成能力,我们的框架合成高质量、一致的新视角面孔图像。实验结果表明,我们的方法在一致性和保真度方面显著优于传统的两阶段方法。
cs.CV / 203 / 2603.01332
Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
无地面真实值的多光谱去马赛克的视角等变微调
Abstract
Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.
Chinese Translation
多光谱去马赛克对于从快照马赛克测量中重建全分辨率光谱图像至关重要,这使得从神经外科到自动驾驶的实时成像成为可能。经典方法往往模糊,而监督学习需要从缓慢的线扫描系统中获得昂贵的地面真实值(GT)。我们提出了用于去马赛克的视角等变微调框架(PEFD),该框架仅从马赛克测量中学习多光谱去马赛克。PEFD a) 利用基于相机的成像系统的投影几何,利用比以往去马赛克方法更丰富的群体结构来恢复更多的零空间信息;b) 通过适应为1-3通道成像设计的预训练基础模型,能够高效地在没有GT的情况下进行学习。在手术和汽车数据集上,PEFD能够恢复细微的细节,如血管,并保持光谱保真度,显著优于最近的方法,接近监督学习的性能。
cs.CV / 204 / 2603.01361
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
MixerCSeg:一种高效的裂缝分割混合架构,通过解耦的Mamba注意力机制
Abstract
Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.
Chinese Translation
特征编码器在像素级裂缝分割中发挥着关键作用,通过塑造细微纹理和薄结构的表示。现有的基于CNN、Transformer和Mamba的模型仅捕捉到所需空间或结构信息的一部分,导致在建模复杂裂缝模式时存在明显的空白。为了解决这个问题,我们提出了MixerCSeg,这是一种设计得像协调专家团队的混合架构,其中类似CNN的路径专注于局部纹理,类似Transformer的路径捕捉全局依赖关系,而受Mamba启发的流则在单个编码器内建模顺序上下文。MixerCSeg的核心是TransMixer,它探索了Mamba的潜在注意力行为,同时建立了专门的路径,自然地表达局部性和全局意识。为了进一步增强结构的保真度,我们引入了一种空间块处理策略和一种方向引导的边缘门控卷积(DEGConv),在不规则裂缝几何形状下增强边缘敏感性,同时保持最小的计算开销。然后,采用空间细化多级融合(SRF)模块来细化多尺度细节,而不增加复杂性。在多个裂缝分割基准上的广泛实验表明,MixerCSeg以仅2.05 GFLOPs和2.54M参数实现了最先进的性能,展示了其高效性和强大的表示能力。代码可在 https://github.com/spiderforest/MixerCSeg 获取。
cs.CV / 205 / 2603.01371
TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity
TIMI:无训练的图像到三维多实例生成框架,具有空间保真度
Abstract
Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
Chinese Translation
在图像到三维多实例生成中,精确的空间保真度对于下游实际应用至关重要。近期的研究尝试通过在多实例数据集上微调预训练的图像到三维(Image-to-3D, I23D)模型来解决这一问题,但这会带来大量的训练开销,并且难以保证空间保真度。事实上,我们观察到预训练的I23D模型已经具备有意义的空间先验,但由于实例纠缠问题,这些先验未得到充分利用。基于此,我们提出了TIMI,一个新颖的无训练框架,用于图像到三维多实例生成,能够实现高空间保真度。具体而言,我们首先引入了一个实例感知分离引导(Instance-aware Separation Guidance, ISG)模块,该模块在早期去噪阶段促进实例的解缠结。接下来,为了稳定ISG引入的引导,我们设计了一个空间稳定几何自适应更新(Spatial-stabilized Geometry-adaptive Update, SGU)模块,该模块在保持实例相对关系的同时,促进几何特征的保留。大量实验表明,我们的方法在全球布局和独特局部实例方面的性能优于现有的多实例方法,无需额外训练且具有更快的推理速度。
cs.CV / 206 / 2603.01398
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
用于真实大气湍流合成的连续曝光时间建模
Abstract
Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: github.com/Jun-Wei-Zeng/ET-Turb.
Chinese Translation
大气湍流显著降低了远程成像的质量,导致几何扭曲和依赖曝光时间的模糊,这对视觉质量和高层次视觉任务的性能产生了不利影响。现有的湍流效应合成方法通常简化了模糊与曝光时间之间的关系,通常假设固定或二元曝光设置。这导致合成数据不够真实,训练模型的泛化能力有限。为了解决这一问题,我们重新审视了调制传递函数(MTF)的公式,并提出了一种新颖的依赖曝光时间的MTF(ET-MTF),将模糊建模为曝光时间的连续函数。为了合成模糊,我们从ET-MTF推导出一个倾斜不变的点扩散函数(PSF),该函数与空间变化的模糊宽度场结合,提供了对湍流引起的模糊的全面且物理准确的表征。在此合成流程的基础上,我们构建了ET-Turb,一个大规模的合成湍流数据集,明确地在多种光学和大气条件下纳入了连续曝光时间建模。该数据集包含5,083个视频(2,005,835帧),分为3,988个训练视频和1,095个测试视频。大量实验表明,与在其他数据集上训练的模型相比,在ET-Turb上训练的模型能够产生更真实的恢复效果,并在真实世界的湍流数据上实现更优的泛化能力。该数据集已公开可用,网址为:github.com/Jun-Wei-Zeng/ET-Turb。
cs.CV / 207 / 2603.01400
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
通过局部和全局上下文优化实现高效视频大语言模型的令牌减少
Abstract
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.
Chinese Translation
视频大语言模型(VLLMs)展示了强大的视频理解能力,但由于冗余的视觉令牌而面临效率低下的问题。现有的剪枝方法主要针对帧内空间冗余,或在大语言模型(LLM)内部进行浅层剪枝,导致时空减少效果不佳,并未充分利用长上下文的压缩能力。这些方法往往会丢弃合并或剪枝令牌中的微妙但重要的上下文。本文提出了一种新的视角,通过局部-全局最优传输(AOT)全面聚合信息上下文,阐述了帧内和帧间的令牌锚点。具体而言,我们首先在每一帧内建立局部和全局感知的令牌锚点,在注意力引导下,通过最优传输聚合来自剪枝令牌的信息上下文,从而构建帧内令牌锚点。然后,基于时间帧片段,考虑每个片段内的第一帧作为关键帧锚点,通过最优传输从连续帧中集成相似信息,同时保留不同的令牌以表示时间动态,从而以无训练的方式实现高效的令牌减少。广泛的评估表明,我们提出的AOT在领先的视频大语言模型上,在各种短视频和长视频基准测试中获得了具有竞争力的性能,显著提高了计算效率,同时保持了时间和视觉的保真度。项目网页:AOT。
cs.CV / 208 / 2603.01412
UETrack: A Unified and Efficient Framework for Single Object Tracking
UETrack:一个统一高效的单目标跟踪框架
Abstract
With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.
Chinese Translation
随着现实世界需求的增长,高效跟踪受到了越来越多的关注。然而,现有的大多数方法仅限于RGB输入,并且在多模态场景中表现不佳。此外,目前的多模态跟踪方法通常采用复杂的设计,使其在资源受限的环境中显得过于沉重和缓慢。为了解决这些局限性,我们提出了UETrack,一个高效的单目标跟踪框架。UETrack展现了高度的实用性和多样性,能够高效处理包括RGB、深度、热成像、事件和语言在内的多种模态,并填补了高效多模态跟踪的空白。它引入了两个关键组件:基于Token-Pooling的专家混合机制,通过特征聚合和专家专业化增强建模能力;以及目标感知的自适应蒸馏策略,根据样本特征选择性地进行蒸馏,减少冗余监督并提高性能。在3个硬件平台上的12个基准测试中进行的广泛实验表明,UETrack在速度与准确性之间实现了优越的平衡。例如,UETrack-B在LaSOT上达到了69.2%的AUC,并在GPU/CPU/AGX上分别以163/56/60 FPS运行,展示了其强大的实用性和多样性。代码可在 https://github.com/kangben258/UETrack 获取。
cs.CV / 209 / 2603.01418
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
UniTalking:一个统一的音视频框架用于生成对话肖像
Abstract
While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
Chinese Translation
尽管最先进的音视频生成模型如Veo3和Sora2展现了卓越的能力,但其闭源特性使得其架构和训练范式难以获取。为了解决这一可及性和性能之间的差距,我们提出了UniTalking,一个统一的端到端扩散框架,用于生成高保真度的语音和唇部同步视频。我们的框架核心采用多模态变换器模块(Multi-Modal Transformer Blocks),通过共享自注意力机制明确建模音频和视频潜在标记之间的细粒度时间对应关系。通过利用预训练视频生成模型的强大先验,我们的框架确保了最先进的视觉保真度,同时实现了高效的训练。此外,UniTalking还集成了个性化语音克隆能力,允许从简短的音频参考生成目标风格的语音。定性和定量结果表明,我们的方法生成了高度逼真的对话肖像,在唇部同步精度、音频自然性和整体感知质量方面超越了现有的开源方法。
cs.CV / 210 / 2603.01431
SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
SeaVIS:用于在线音视频实例分割的声音增强关联
Abstract
Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
Chinese Translation
最近,提出了一种音视频实例分割(AVIS)任务,旨在识别、分割和跟踪视频中的单个发声实例。然而,现有方法主要采用离线范式,无法在连续片段之间关联检测到的实例,使其不适用于涉及连续视频流的现实场景。为了解决这一限制,我们引入了SeaVIS,这是第一个专为音视频实例分割设计的在线框架。SeaVIS利用因果交叉注意力融合(Causal Cross Attention Fusion, CCAF)模块实现高效的在线处理,该模块在严格的因果约束下,将当前帧的视觉特征与整个音频历史进行整合。传统视觉实例分割(VIS)方法面临的一个主要挑战是,基于外观的实例关联无法区分物体的发声状态和静音状态,导致静音物体的错误分割。为了解决这个问题,我们采用了音频引导对比学习(Audio-Guided Contrastive Learning, AGCL)策略,生成不仅编码视觉外观而且编码发声活动的实例原型。通过这种方式,在每帧预测过程中保留的未发声实例可以在实例关联过程中有效抑制,从而显著增强SeaVIS的音频跟随能力。在AVISeg数据集上进行的大量实验表明,SeaVIS在多个评估指标上超越了现有的最先进模型,同时保持了适合实时处理的竞争推理速度。
cs.CV / 211 / 2603.01433
DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis
DOCFORGE-BENCH:文档伪造检测与分析的综合基准
Abstract
We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
Chinese Translation
我们提出了DOCFORGE-BENCH,这是第一个统一的零样本文档伪造检测基准,评估了14种方法在八个数据集上的表现,这些数据集涵盖了文本篡改、收据伪造和身份文件操控。与以微调为导向的评估(如ForensicHub [Du et al., 2025])不同,DOCFORGE-BENCH在没有领域适应的情况下,使用所有方法的已发布预训练权重进行评估——这一设计选择反映了现实部署场景中,实践者缺乏标注文档训练数据的情况。我们的核心发现是,在单阈值协议下,普遍存在不可见的校准失败:方法的Pixel-AUC(>=0.76)适中,但Pixel-F1接近于零。这个AUC-F1差距并不是区分失败,而是评分分布的偏移:篡改区域仅占文档图像中0.27-4.17%的像素——比自然图像基准低一个数量级——使得标准的tau=0.5阈值严重失校。Oracle-F1比固定阈值的Pixel-F1高出2-10倍,确认了校准而非表示是瓶颈。一项受控的校准实验验证了这一点:在N=10个领域图像上调整单个阈值可以恢复39-55%的Oracle-F1差距,证明阈值调整——而非重新训练——是实际部署的关键缺失步骤。总体而言,没有评估的方法能够在多样化的文档类型上可靠地开箱即用,这突显了文档伪造检测仍然是一个未解决的问题。我们进一步指出,所有八个数据集都早于生成AI编辑的时代;涵盖扩散和基于大语言模型(LLM)的文档伪造的基准在现代攻击面上代表了一个重要的开放空白。
cs.CV / 212 / 2603.01441
Unifying Language-Action Understanding and Generation for Autonomous Driving
统一语言-动作理解与生成用于自主驾驶
Abstract
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
Chinese Translation
视觉-语言-动作(VLA)模型作为一种有前景的端到端自主驾驶范式,因其能够利用世界知识并推理复杂驾驶场景而受到重视。然而,现有方法存在两个关键限制:语言指令与动作输出之间的持续不对齐,以及典型自回归动作生成的固有低效性。本文介绍了LinkVLA,一种新颖的架构,直接解决这些挑战,以增强对齐性和效率。首先,我们通过将语言和动作标记统一到一个共享的离散词汇表中,建立了结构性链接,并在单一的多模态模型中进行处理。这从根本上强制实现了跨模态一致性。其次,为了创建深层语义链接,我们引入了辅助的动作理解目标,训练模型从轨迹生成描述性标题,促进双向语言-动作映射。最后,我们用一种两步粗到细的生成方法C2F替代了缓慢的逐步生成,该方法高效解码动作序列,节省了86%的推理时间。在闭环驾驶基准测试中的实验显示,指令遵循准确性和驾驶性能均有一致提升,同时推理延迟减少。
cs.CV / 213 / 2603.01450
Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection
深度伪造取证适配器:一种用于可泛化深度伪造检测的双流网络
Abstract
The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at https://github.com/Liao330/DFA.git
Chinese Translation
深度伪造生成技术的快速发展对公共安全构成了重大威胁,并通过创建高度逼真的合成面部媒体造成了社会危害。虽然现有的检测方法在对新兴伪造模式的泛化能力上存在局限性,本文提出了深度伪造取证适配器(Deepfake Forensics Adapter, DFA),这是一种新颖的双流框架,结合了视觉-语言基础模型与针对性的取证分析。我们的方法集成了一个预训练的CLIP模型,并通过三个核心组件实现了专业化的深度伪造检测,充分利用了CLIP的强大泛化能力而不改变其参数:1)全球特征适配器用于识别可能表明伪造的图像内容中的全球不一致性;2)局部异常流通过明确利用面部结构先验,增强模型感知局部面部伪造线索的能力;3)交互融合分类器通过使用变换器编码器促进全球特征与局部特征之间的深度交互与融合。对帧级和视频级基准的广泛评估表明,DFA在泛化能力方面表现优越,特别是在具有挑战性的DFDC数据集上实现了最先进的性能,帧级AUC/EER为0.816/0.256,视频级AUC/EER为0.836/0.251,视频AUC较之前方法提高了4.8%。我们的框架不仅展示了最先进的性能,还指出了开发具有增强泛化能力的稳健深度伪造检测系统的可行且有效的方向。我们的代码可在 https://github.com/Liao330/DFA.git 获取。
cs.CV / 214 / 2603.01454
VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
VidDoS:针对基于视频的大型语言模型的通用拒绝服务攻击
Abstract
Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
Chinese Translation
视频大型语言模型(Video-LLMs)越来越多地应用于安全关键的场景,但它们易受到能量-延迟攻击(Energy-Latency Attacks, ELAs)的影响,这种攻击会耗尽计算资源。当前以图像为中心的方法失败,因为时间聚合机制稀释了单帧扰动。此外,实时需求使得对连续视频流进行实例级优化变得不切实际。我们提出了VidDoS,这是首个专为Video-LLMs量身定制的通用ELA框架。我们的方法利用通用优化创建实例无关的触发器,无需在推理时计算梯度。我们通过“掩蔽教师强制”(masked teacher forcing)引导模型朝向高成本目标序列,同时结合“拒绝惩罚”(refusal penalty)和“提前终止抑制”(early-termination suppression)来覆盖简洁性先验。对三种主流Video-LLMs和三个视频数据集的测试,包括视频问答和自动驾驶场景,显示出极大的性能下降。VidDoS导致的令牌扩展超过205倍,并使推理延迟相对于干净基线增加超过15倍。对实时自动驾驶流的模拟进一步揭示,这种引入的延迟导致了严重的安全违规。我们呼吁学术界认识并缓解Video-LLMs中这些高风险的ELA。
cs.CV / 215 / 2603.01455
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
从逐字到要旨:通过语义信息瓶颈提炼金字塔多模态记忆以支持长时间视频代理
Abstract
While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.
Chinese Translation
尽管多模态大型语言模型在短期推理方面表现出色,但由于上下文窗口的限制和静态记忆机制无法反映人类认知效率,它们在长时间视频理解方面仍然面临挑战。现有的范式通常落入两个极端:以视觉为中心的方法通过密集的视觉积累导致高延迟和冗余,或以文本为中心的方法通过激进的字幕生成遭受细节丢失和幻觉。为了解决这一问题,我们提出了MM-Mem,一种基于模糊痕迹理论的金字塔多模态记忆架构。MM-Mem将记忆分层结构化为感官缓冲区、情节流和符号图式,从而实现将细粒度感知痕迹(逐字)逐步提炼为高级语义图式(要旨)。此外,为了管理记忆的动态构建,我们推导出一个语义信息瓶颈目标,并引入SIB-GRPO以优化记忆压缩与任务相关信息保留之间的权衡。在推理过程中,我们设计了一种基于熵驱动的自上而下的记忆检索策略,首先尝试抽象的符号图式,并在高不确定性下逐步“深入”到感官缓冲区和情节流。针对4个基准的广泛实验验证了MM-Mem在离线和流媒体任务上的有效性,展示了强大的泛化能力,并验证了以认知为灵感的记忆组织的有效性。代码可在 https://github.com/EliSpectre/MM-Mem 获取。
cs.CV / 216 / 2603.01461
UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation
UltraStar:用于超声心动图导航的语义感知星图建模
Abstract
Echocardiography is critical for diagnosing cardiovascular diseases, yet the shortage of skilled sonographers hinders timely patient care, due to high operational difficulties. Consequently, research on automated probe navigation has significant clinical potential. To achieve robust navigation, it is essential to leverage historical scanning information, mimicking how experts rely on past feedback to adjust subsequent maneuvers. Practical scanning data collected from sonographers typically consists of noisy trajectories inherently generated through trial-and-error exploration. However, existing methods typically model this history as a sequential chain, forcing models to overfit these noisy paths, leading to performance degradation on long sequences. In this paper, we propose UltraStar, which reformulates probe navigation from path regression to anchor-based global localization. By establishing a Star Graph, UltraStar treats historical keyframes as spatial anchors connected directly to the current view, explicitly modeling geometric constraints for precise positioning. We further enhance the Star Graph with a semantic-aware sampling strategy that actively selects the representative landmarks from massive history logs, reducing redundancy for accurate anchoring. Extensive experiments on a dataset with over 1.31 million samples demonstrate that UltraStar outperforms baselines and scales better with longer input lengths, revealing a more effective topology for history modeling under noisy exploration.
Chinese Translation
超声心动图在诊断心血管疾病中至关重要,但由于操作难度大,熟练的超声技师短缺,影响了及时的患者护理。因此,自动探头导航的研究具有重要的临床潜力。为了实现稳健的导航,利用历史扫描信息至关重要,这类似于专家如何依赖过去的反馈来调整后续操作。来自超声技师的实际扫描数据通常由通过试错探索生成的噪声轨迹组成。然而,现有方法通常将这一历史建模为顺序链,迫使模型过拟合这些噪声路径,从而导致在长序列上的性能下降。在本文中,我们提出了UltraStar,它将探头导航从路径回归重新构建为基于锚点的全局定位。通过建立星图,UltraStar将历史关键帧视为直接连接到当前视图的空间锚点,明确建模几何约束以实现精确定位。我们进一步通过语义感知采样策略增强星图,该策略主动从大量历史日志中选择代表性地标,减少冗余以实现准确锚定。在一个包含超过131万个样本的数据集上的广泛实验表明,UltraStar的表现优于基线,并且在更长输入长度下具有更好的扩展性,揭示了在噪声探索下历史建模的更有效拓扑。
cs.CV / 217 / 2603.01475
WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
WildCross:用于自然环境中地点识别和度量深度估计的跨模态大规模基准
Abstract
Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
Chinese Translation
近年来,对在非结构化自然环境中机器人解决方案的需求显著增加,同时对桥接二维和三维场景理解的兴趣也在不断增长。然而,现有的机器人数据集主要是在结构化城市环境中捕获的,这使得它们无法有效应对复杂的非结构化自然环境所带来的挑战。为了解决这一问题,我们提出了WildCross,一个用于大规模自然环境中地点识别和度量深度估计的跨模态基准。WildCross包含超过476K个顺序RGB帧,配有半稠密深度和表面法线注释,每个帧都与准确的6DoF姿态对齐,并同步有稠密的激光雷达子地图。我们在视觉、激光雷达和跨模态地点识别以及度量深度估计方面进行了全面实验,展示了WildCross作为多模态机器人感知任务的挑战性基准的价值。我们提供了代码库和数据集的访问链接:https://csiro-robotics.github.io/WildCross。
cs.CV / 218 / 2603.01485
SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
SCATR:通过二次机会分配和轨迹查询丢弃缓解基于LiDAR的注意力跟踪中的新实例抑制
Abstract
LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work's core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6\% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \href{https://github.com/TRAILab/SCATR}{https://github.com/TRAILab/SCATR}
Chinese Translation
基于LiDAR的注意力跟踪(TBA)框架固有地面临高假阴性错误,导致与传统基于LiDAR的检测跟踪(TBD)方法之间存在显著的性能差距。本文提出了SCATR,一种旨在系统性解决这一根本挑战的新型基于LiDAR的TBA模型。SCATR利用了视觉跟踪的最新进展,并结合了专门为LiDAR调整的针对性训练策略。我们工作的核心创新是针对TBA方法的两种与架构无关的训练策略:二次机会分配(Second Chance Assignment)和轨迹查询丢弃(Track Query Dropout)。二次机会分配是一种新颖的真实标签分配方法,它在二分匹配之前将未分配的轨迹查询与提议查询连接起来,使这些轨迹查询有机会再次分配给真实目标,有效缓解了基于注意力的跟踪中检测与跟踪任务之间的冲突。轨迹查询丢弃是一种训练方法,通过多样化监督对象查询配置,来有效训练解码器以处理不同的轨迹查询集,从而增强对缺失或新生轨迹的鲁棒性。在nuScenes跟踪基准上的实验表明,SCATR在基于LiDAR的TBA方法中实现了最先进的性能,超越了之前的工作7.6\% AMOTA,并成功弥合了基于LiDAR的TBA与TBD方法之间长期存在的性能差距。消融研究进一步验证了二次机会分配和轨迹查询丢弃的有效性和泛化能力。代码可在以下链接找到:\href{https://github.com/TRAILab/SCATR}{https://github.com/TRAILab/SCATR}
cs.CV / 219 / 2603.01490
ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models
ATA:通过注意力引导和行动引导推理连接隐式推理与视觉语言行动模型
Abstract
Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
Chinese Translation
视觉-语言-行动(VLA)模型依赖于当前观察,包括图像、语言指令和机器人状态,以预测行动并完成任务。尽管准确的视觉感知对于精确的行动预测和执行至关重要,但近期的研究尝试通过在推理过程中引入显式推理来进一步提高性能。然而,这些方法面临着显著的限制。它们通常依赖于数据密集型资源,例如链式思维(Chain-of-Thought, CoT)风格的注释,将任务分解为逐步推理,并且在许多情况下需要额外的视觉定位注释(例如,边界框或掩码)来突出相关的图像区域。此外,它们涉及耗时的数据集构建、标注和再训练,最终导致更长的推理序列和效率降低。为了解决这些挑战,我们提出了ATA,一个新颖的无训练框架,通过互补的注意力引导和行动引导策略将隐式推理引入VLA推理。与CoT或显式视觉定位方法不同,ATA通过将注意力图与基于行动的兴趣区域(Region of Interest, RoI)相结合,隐式地形成推理,从而自适应地优化视觉输入,而无需额外的训练或注释。ATA是一种即插即用的隐式推理方法,适用于VLA模型,轻量且有效。大量实验表明,它在保持甚至增强推理效率的同时,始终提高任务成功率和鲁棒性。
cs.CV / 220 / 2603.01491
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
辐射一致性的高斯表面点用于逆向渲染
Abstract
Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive's learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (<10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
Chinese Translation
高斯点云的逆向渲染技术发展迅速,但从复杂的全局光照效果中准确分离材料属性,特别是间接光照,仍然是一个主要挑战。现有方法通常从为新视角合成预训练的高斯原语中查询间接辐射。然而,这些预训练的高斯原语仅针对有限的训练视角进行监督,因此缺乏对未观察视角的间接辐射建模的监督。为了解决这个问题,我们引入了辐射一致性,这是一种新颖的基于物理的约束,通过最小化每个高斯原语学习到的辐射与其基于物理的渲染对应物之间的残差,为未观察视角提供监督。最小化未观察视角的残差建立了一个自我校正的反馈循环,能够从基于物理的渲染和新视角合成中提供监督,从而实现间接反射的准确建模。随后,我们提出了辐射一致性的高斯表面点(RadioGS),这是一个基于我们的原则构建的逆向渲染框架,通过利用高斯表面点和二维高斯光线追踪高效整合辐射一致性。我们进一步提出了一种基于微调的重光照策略,该策略能够在几分钟内将高斯表面点的辐射适应于新的光照,实现低渲染成本(<10毫秒)。在现有逆向渲染基准上的大量实验表明,RadioGS在逆向渲染中优于现有的基于高斯的方法,同时保持计算效率。
cs.CV / 221 / 2603.01498
Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
三路径 DINO:遥感多类变化检测的特征互补学习
Abstract
In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.
Chinese Translation
在遥感图像中,多类变化检测(MCD)对细粒度监测至关重要,但长期以来受到复杂场景变化和详细注释稀缺的限制。为了解决这一问题,我们提出了三路径 DINO 架构,该架构采用三路径互补特征学习策略,以促进预训练基础模型对复杂垂直领域的快速适应。具体而言,我们采用 DINOv3 预训练模型作为主干特征提取网络,以学习粗粒度特征。辅助路径也采用了孪生结构,逐步聚合来自孪生编码器的中间特征,以增强细粒度特征的学习。最后,引入了一种多尺度注意力机制来增强解码器网络,其中并行卷积自适应地捕捉和增强不同感受野下的上下文信息。所提方法在加沙设施损坏评估数据集(Gaza change)和经典 SECOND 数据集上的 MCD 任务中实现了最佳性能。GradCAM 可视化进一步确认主路径和辅助路径分别自然聚焦于粗粒度语义变化和细粒度结构细节。这种协同互补性为高级变化检测任务提供了一个稳健且可解释的解决方案,为快速准确的损坏评估奠定了基础。
cs.CV / 222 / 2603.01506
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
OMG-头像:一次性多细节层次高斯头部头像
Abstract
We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
Chinese Translation
我们提出了OMG-头像,一种新颖的一次性方法,利用多细节层次(Multi-LOD)高斯表示,从单幅图像中在0.2秒内重建可动画的3D头部。我们的方法使用统一模型进行LOD头部头像建模,适应多样的硬件能力和推理速度要求。为了捕捉全局和局部面部特征,我们采用基于变换器的架构进行全局特征提取,并使用基于投影的采样方法获取局部特征。这些特征在深度缓冲区的指导下有效融合,确保了遮挡的合理性。我们进一步引入了一种粗到细的学习范式,以支持细节层次功能并增强分层细节的感知。为了解决3DMM在建模非头部区域(如肩部)时的局限性,我们引入了一种多区域分解方案,其中头部和肩部分别预测,然后通过跨区域组合进行整合。大量实验表明,OMG-头像在重建质量、重演性能和计算效率方面优于现有的最先进方法。
cs.CV / 223 / 2603.01509
Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling
通过提示优化和测试时缩放进行文本到视频生成的检索、精炼和排序
Abstract
While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
Chinese Translation
尽管大规模数据集推动了文本到视频(Text-to-Video, T2V)生成模型的显著进展,但这些模型对输入提示仍然高度敏感,表明提示设计对生成质量至关重要。目前改善视频输出的方法往往不尽如人意:它们要么依赖复杂的后期编辑模型,可能引入伪影,要么需要对核心生成器进行昂贵的微调,这严重限制了可扩展性和可访问性。在本研究中,我们提出了3R,一个基于RAG的提示优化框架。3R利用当前最先进的T2V扩散模型和视觉语言模型的强大能力。它可以与任何T2V模型一起使用,而无需任何模型训练。该框架采用三种关键策略:基于RAG的修饰符提取以丰富上下文基础,基于扩散的偏好优化以使输出与人类偏好对齐,以及时间帧插值以生成时间一致的视觉内容。这些组件共同实现了更准确、高效和上下文对齐的文本到视频生成。实验结果表明,3R在增强生成视频的静态保真度和动态一致性方面的有效性,强调了优化用户提示的重要性。
cs.CV / 224 / 2603.01515
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
FACE:一种基于面片的自回归表示用于高保真且高效的网格生成
Abstract
Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
Chinese Translation
用于三维网格生成的自回归模型存在一个根本性限制:它们将网格压缩为长的顶点坐标序列。这导致了高昂的计算成本,阻碍了高保真几何体的高效合成。我们认为这一瓶颈源于在错误的语义层面进行操作。我们提出了FACE,一个新颖的自回归自编码器(ARAE)框架,通过在面片级别生成网格重新构思了这一任务。我们的单面单标记策略将每个三角面片(网格的基本构建块)视为一个单一的统一标记。这一简单而强大的设计将序列长度减少了九倍,导致前所未有的压缩比0.11,较之前的最先进技术减半。这一显著的效率提升并未妥协质量;通过将我们的面片级解码器与强大的VecSet编码器配对,FACE在标准基准测试中达到了最先进的重建质量。通过训练一个潜在扩散模型,进一步展示了学习到的潜在空间的多样性,该模型实现了高保真度的单图像到网格生成。FACE提供了一种简单、可扩展且强大的范式,降低了高质量结构化三维内容创作的门槛。
cs.CV / 225 / 2603.01524
Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection
更好的匹配,更少的遗忘:一种基于质量引导的变换器增量目标检测匹配器
Abstract
Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.
Chinese Translation
增量目标检测(IOD)旨在不断学习新的目标类别,而不遗忘先前学习的类别。一个持续的挑战是灾难性遗忘,这主要归因于传统检测器中的背景偏移。虽然伪标签在密集检测器中缓解了这一问题,但我们识别出一种特有的、针对DETR(Detection Transformer)架构的遗忘新源:背景前置。这是由于匈牙利匹配器的全面性约束,强制将每个真实目标分配给一个预测,即使预测主要覆盖背景区域(即低IoU)。这种错误的监督迫使模型将背景特征错误分类为特定的前景类别,扰乱已学习的表示并加速遗忘。为了解决这个问题,我们提出了一种质量引导的最小成本最大流(Q-MCMF)匹配器。为了避免强制分配,Q-MCMF构建了一个流图,并根据几何质量修剪不合理的匹配。然后,它优化最终匹配,以最小化成本并最大化有效分配。该策略消除了来自背景前置的有害监督,同时最大化前景学习信号。在各种增量设置下对COCO数据集的广泛实验表明,我们的方法始终优于现有的最先进方法。
cs.CV / 226 / 2603.01528
Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case
通过FSM驱动的流式推理管道提升人工智能可靠性:一个工业案例
Abstract
The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI's predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at https://github.com/thulab/video-streamling-inference-pipeline.
Chinese Translation
人工智能在工业中的广泛应用常常受到其在训练数据缺失场景下的有限鲁棒性的阻碍,导致预测偏差和脆弱性。为了解决这一问题,我们提出了一种新颖的流式推理管道,通过明确地结合先验知识来增强数据驱动模型。本文展示了一个工业人工智能应用的研究,该应用能够自动从监控视频中计数挖掘机的工作负载。我们的方法将目标检测模型与有限状态机(Finite State Machine, FSM)相结合,FSM编码了操作场景的知识,以指导和修正人工智能在流数据上的预测。在对来自12个现场视频的超过7000张图像的真实世界数据集进行实验时,我们的方法显示出比基于手动启发式规则的原始解决方案更优越的性能和更强的鲁棒性。我们将发布代码,网址为 https://github.com/thulab/video-streamling-inference-pipeline。
cs.CV / 227 / 2603.01535
Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
通过外观和几何属性编辑对语义分割模型进行基准测试
Abstract
Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
Chinese Translation
语义分割在自动驾驶和医学图像分析等多个应用中发挥着重要作用。在实际部署分割模型时,提前测试其在多样化和复杂场景中的表现至关重要。本文构建了一个自动数据生成管道Gen4Seg,通过生成具有不同属性变化的各种挑战性样本来对语义分割模型进行压力测试。与以往仅关注全球天气和风格迁移的评估范式不同,我们在对象和图像层面上研究外观和几何属性的变化。这些变化包括对象的颜色、材料、大小、位置,以及图像层面的变化,如天气和风格。为此,我们提出通过扩散模型精确控制结构信息来编辑现有真实图像的视觉属性。通过这种方式,现有的分割标签可以被重新用于编辑后的图像,从而大大降低了人工成本。利用我们的管道,我们构建了两个新的基准,Pascal-EA和COCO-EA。我们对多种语义分割模型进行了基准测试,涵盖了从封闭集模型到开放词汇大模型的广泛范围。我们的研究得出几个关键发现:1)先进的开放词汇模型在几何变化下并未表现出比封闭集方法更大的鲁棒性;2)数据增强技术(如CutOut和CutMix)在增强对外观变化的鲁棒性方面有限;3)我们的管道还可以作为数据增强工具,改善在分布内和分布外的表现。我们的工作表明生成模型作为自动分析分割模型的有效工具的潜力,我们希望我们的发现能帮助从业者和研究人员开发出更鲁棒和可靠的分割模型。
cs.CV / 228 / 2603.01544
RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
RA-Det:通过鲁棒性不对称性实现对AI生成图像的通用检测
Abstract
Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.
Chinese Translation
近期的图像生成器产生的照片级真实内容削弱了下游识别系统的可靠性。随着视觉外观线索变得不那么明显,依赖法医线索或高层次表征的外观驱动检测器失去稳定性。这促使我们从外观转向行为,关注图像如何对控制扰动作出反应,而不是它们的外观。在本研究中,我们识别出一种简单且通用的行为信号。自然图像在小的、结构化的扰动下保持稳定的语义表征,而生成图像则表现出明显更大的特征漂移。我们将这一现象称为鲁棒性不对称性,并提供了理论分析,建立了一个下界,将这种不对称性与生成模型中的记忆倾向联系起来,解释了其在不同架构中的普遍性。基于这一见解,我们引入了鲁棒性不对称检测(Robustness Asymmetry Detection, RA-Det),这是一个以行为为驱动的检测框架,将鲁棒性不对称性转化为可靠的决策信号。在14种不同的生成模型和超过10种强检测器的评估中,RA-Det表现优越,平均性能提高了7.81%。该方法不依赖于数据和模型,不需要生成器指纹,并能在未见过的生成器上迁移。综合这些结果表明,鲁棒性不对称性是合成图像检测的一个稳定、通用的线索,经过精心设计的探测可以将这一线索转化为实用的通用检测器。源代码已在Github上公开。
cs.CV / 229 / 2603.01545
Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
无训练的时空解耦推理视频分割与自适应对象记忆
Abstract
Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
Chinese Translation
推理视频对象分割(ReasonVOS)是一项具有挑战性的任务,要求在视频序列中使用隐式和复杂的文本输入进行稳定的对象分割。以往的方法通过微调多模态大型语言模型(MLLMs)来生成分割输出,这需要大量的资源。此外,一些现有方法在时空信息处理上是耦合的,这在一定程度上影响了模型的时间稳定性。为了解决这些问题,我们提出了无训练的时空解耦推理视频分割与自适应对象记忆(SDAM)。我们的目标是设计一个无训练的推理视频分割框架,超越现有需要微调的方法,仅使用预训练模型。同时,我们提出了一个自适应对象记忆模块,根据不同视频序列中的运动线索选择和记忆关键对象。最后,我们提出了时空解耦以实现稳定的时间传播。在空间域中,我们实现了目标对象的精确定位和分割,而在时间域中,我们利用关键对象的时间信息来驱动稳定的跨帧传播。我们的方法在五个基准数据集上取得了优异的结果,包括Ref-YouTubeVOS、Ref-DAVIS17、MeViS、ReasonVOS和ReVOS。
cs.CV / 230 / 2603.01547
PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification
PathMoE:用于儿童脑肿瘤分类的可解释多模态交互专家
Abstract
Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.
Chinese Translation
由于组织学的复杂性和有限的训练数据,儿童中枢神经系统肿瘤的准确分类仍然具有挑战性。尽管病理基础模型在全切片图像(WSI)分析方面取得了进展,但它们往往未能利用临床文本和组织微结构中丰富的互补信息。为此,我们提出了PathMoE,一种可解释的多模态框架,通过基于最先进的基础模型构建的交互感知专家混合架构,整合H&E切片、病理报告和细胞核级别的细胞图。通过训练专门的专家来捕捉模态的独特性、冗余性和协同效应,PathMoE采用输入依赖的门控机制,动态加权这些交互,从而提供样本级别的可解释性。我们在内部儿童脑肿瘤数据集(PBT)和外部TCGA数据集上对两个特定数据集的分类任务评估了我们的框架。当整合WSI、文本和图形模态时,PathMoE在PBT上的宏观F1从0.762提高到0.799(+0.037);在TCGA上,利用图形知识增强WSI使宏观F1从0.668提高到0.709(+0.041)。这些结果显示出相较于最先进的仅图像基线显著的性能提升,同时揭示了驱动个体预测的特定模态交互。这种可解释性对于罕见肿瘤亚型尤为重要,因为透明的模型推理对于临床信任和诊断验证至关重要。
cs.CV / 231 / 2603.01549
Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Pri4R:通过特权4D表示学习视觉-语言-动作模型的世界动态
Abstract
Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
Chinese Translation
人类不仅学习自身的运动方式,还学习周围世界如何对其行为做出反应。相比之下,尽管近期的视觉-语言-动作(VLA)模型展现了令人印象深刻的语义理解,但它们往往无法捕捉支配物理交互的时空动态。本文介绍了Pri4R,这是一种简单而有效的方法,通过在训练过程中利用特权4D信息,赋予VLA模型对世界动态的隐含理解。具体而言,Pri4R通过一个轻量级的点轨迹头增强VLA,预测3D点轨迹。通过将VLA特征注入该头部以共同预测未来的3D轨迹,模型学习在其共享表示空间中融入不断变化的场景几何,从而为精确控制提供更具物理意识的上下文。由于其架构的简单性,Pri4R与主流VLA设计模式兼容,几乎不需要改动。在推理过程中,我们使用未改变的原始VLA架构运行模型;Pri4R不增加额外的输入、输出或计算开销。在模拟和现实世界评估中,Pri4R在具有挑战性的操作任务上显著提高了性能,包括在LIBERO-Long上提高了10%的表现,在RoboCasa上提高了40%。我们进一步展示了3D点轨迹预测是学习动作-世界动态的有效监督目标,并通过广泛的消融实验验证了我们的设计选择。
cs.CV / 232 / 2603.01552
Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder
Align-cDAE:基于注意力对齐的条件扩散自编码器的阿尔茨海默病进展建模
Abstract
Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer's. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer's disease progression.
Chinese Translation
基于生成性人工智能框架的人脑纵向影像建模与预测提供了一种有效的机制,以跟踪神经退行性进展,这对于评估阿尔茨海默病等疾病至关重要。在现有的生成方法中,最近的基于扩散的模型已成为生成疾病进展影像的有效替代方案。将多模态和非影像属性作为条件信息纳入扩散框架已被证明可以改善生成过程中的可控性。然而,现有方法并未明确确保来自非影像条件模态的信息与影像特征之间的有意义对齐,以引入生成影像中的期望变化,例如对特定进展区域的调制。此外,通过在模型的内部表示中引入与进展相关的结构,可以实现对生成过程的更精确控制,而现有方法在这方面存在不足。为了解决这些限制,我们提出了一种基于扩散自编码器的疾病进展建模框架,明确强制不同模态之间的对齐。通过引入一个明确的目标函数来强制对齐,使模型能够关注表现出与进展相关变化的区域。此外,我们设计了一种机制,以更好地构建扩散自编码框架的潜在表示空间。具体而言,我们为整合与进展相关的条件和保留个体特征信息分配了独立的潜在子空间,从而实现更可控的影像生成。这些结果表明,强制对齐和更好地构建扩散自编码框架的潜在表示空间能够实现对阿尔茨海默病进展的更解剖学精确建模。
cs.CV / 233 / 2603.01558
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
TopoMaskV3:具有密集偏移和高度预测的3D掩膜头用于道路拓扑理解
Abstract
Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.
Chinese Translation
基于掩膜的道路拓扑理解范式,如TopoMaskV2,通过生成中心线的密集栅格化中间表示,为查询基础方法提供了一种互补的替代方案。然而,之前的工作仅限于二维预测,并且受到严重的离散化伪影的影响,因此需要与参数头进行融合。我们引入了TopoMaskV3,将这一流程推进为一个强大的独立3D预测器,采用两个新颖的密集预测头:一个用于现有鸟瞰视图(BEV)分辨率内的子网格离散化修正的密集偏移场,以及一个用于直接3D估计的密集高度图。除了架构之外,我们首次通过引入(1)地理上不同的划分以防止记忆化并确保公平泛化,以及(2)一个长距离(+/-100米)基准,解决了道路拓扑评估中的地理数据泄漏问题。TopoMaskV3在这一地理上不重叠的基准上达到了28.5的最优最小二乘(OLS)成绩,超越了所有先前的方法。我们的分析表明,掩膜表示在地理过拟合方面比Bezier更具鲁棒性,而LiDAR融合在长距离时最为有利,并且在重叠的原始划分上表现出更大的相对增益,暗示了重叠引起的记忆化效应。
cs.CV / 234 / 2603.01576
Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
Cryo-Bench:冰冻圈应用基础模型的基准测试
Abstract
Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).
Chinese Translation
地理基础模型(Geo-Foundation Models, GFMs)已在多个领域的地球观测任务中进行了评估,并展示出在稀疏标签情况下生成可靠地图的强大潜力。然而,针对冰冻圈应用的GFMs基准测试仍然有限,主要是由于缺乏合适的评估数据集。为了解决这一问题,我们引入了 extbf{Cryo-Bench},这是一个旨在评估GFM在关键冰冻圈组件性能的基准。Cryo-Bench包括覆盖碎石的冰川、冰川湖、海冰和冰崩前沿,涵盖了多个传感器和广泛的地理区域。我们评估了14个GFMs以及UNet和ViT基线,以评估它们的优缺点和最佳使用策略。在冻结编码器的情况下,UNet在Cryo-Bench包含的五个评估数据集中达到了最高的平均mIoU值 extbf{66.38},其次是TerraMind,得分为 extbf{64.02}。在少样本设置(10 ext{%}输入数据)中,GFMs如DOFA和TerraMind的表现优于UNet,分别达到了mIoU分数 extbf{59.53}、 extbf{56.62}和 extbf{56.60},而U-Net的得分为56.60。当对GFMs进行完全微调时,我们观察到在数据集和模型之间的性能不一致。然而,调整学习率并进行微调显著提高了GFM的性能。例如,在两个代表性数据集(GLID和CaFFe)上的评估显示平均相对提升为 extbf{12.77 ext{%}}。尽管在其预训练数据中对冰冻圈的表示极少,GFMs仍展现出显著的领域适应能力,并在各项任务中产生有意义的结果。基于我们的发现,我们建议通过超参数优化进行编码器微调,以实现最佳性能,同时在用户需要快速结果而不进行广泛实验时使用冻结编码器。( exttt{https://github.com/Sk-2103/Cryo-Bench})
cs.CV / 235 / 2603.01579
SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis
SkeleGuide:基于显式骨骼推理的上下文感知人像合成
Abstract
Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
Chinese Translation
将逼真且结构合理的人物图像生成到现有场景中仍然是当前生成模型面临的一项重大挑战,这些模型常常产生扭曲的肢体和不自然的姿势等伪影。我们将这种系统性失败归因于无法对人体骨骼结构进行显式推理。为了解决这个问题,我们提出了SkeleGuide,这是一种基于显式骨骼推理的新颖框架。通过对其推理和渲染阶段的联合训练,SkeleGuide学习生成一个内部姿势,作为强有力的结构先验,指导合成过程朝向高结构完整性。为了实现细粒度的用户控制,我们引入了PoseInverter,一个将内部潜在姿势解码为显式且可编辑格式的模块。大量实验表明,SkeleGuide在生成高保真、上下文感知的人物图像方面显著优于专用模型和通用模型。我们的工作提供了有力的证据,表明显式建模骨骼结构是实现稳健且合理的人物图像合成的基础步骤。
cs.CV / 236 / 2603.01586
InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
InterCoG:基于交错的基础推理链实现空间精确的图像编辑
Abstract
Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
Chinese Translation
新兴的统一编辑模型在一般物体编辑任务中展现了强大的能力。然而,在复杂的多实体场景中进行细粒度编辑仍然是一个重大挑战,特别是在目标不明显且需要空间推理的情况下。为此,我们提出了InterCoG,一种新颖的文本-视觉交错基础推理框架,旨在实现复杂现实场景中的细粒度图像编辑。InterCoG的关键见解是首先在包含空间关系细节的文本中进行物体位置推理,以明确推导出编辑目标的位置和身份。然后,通过在像素空间中用生成的边界框和掩码突出编辑目标,进行视觉基础推理,最后重写编辑描述以指定预期结果。为了进一步促进这一范式,我们提出了两个辅助训练模块:多模态基础重建监督和多模态基础推理对齐,分别用于强化空间定位精度和推理可解释性。我们还构建了GroundEdit-45K,一个包含45K个以基础为导向的编辑样本及详细推理注释的数据集,以及GroundEdit-Bench用于基础感知编辑评估。大量实验验证了我们的方法在空间复杂和多实体场景下进行高精度编辑的优越性。
cs.CV / 237 / 2603.01593
PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
PPEDCRF:一种隐私保护的增强动态条件随机场,用于序列视频的位置信息隐私保护,且最小化检测性能下降
Abstract
Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
Chinese Translation
由自主或辅助驾驶系统收集的行车记录仪视频越来越多地被用于安全审计和模型改进。即使明确的GPS元数据被移除,攻击者仍然可以通过将背景视觉线索(例如建筑物和道路布局)与大规模街景图像进行匹配,从而推断出录制位置。本文研究了在基于背景的检索攻击者下的位置信息隐私泄露,并提出了PPEDCRF,一种隐私保护的增强动态条件随机场框架,该框架仅在推断出的对位置敏感的背景区域中注入经过校准的扰动,同时保持前景检测的效用。PPEDCRF由三个部分组成:(i)一个动态条件随机场(dynamic CRF),它强制执行时间一致性,以发现和跟踪跨帧的对位置敏感的区域;(ii)一个归一化控制惩罚(NCP),根据分层敏感性模型分配扰动强度;(iii)一个保持效用的噪声注入模块,最小化对物体检测和分割的干扰。在公共驾驶数据集上的实验表明,PPEDCRF显著降低了位置信息检索攻击的成功率(例如,Top-k检索准确率),同时在与常见基线(如全局噪声、白噪声掩蔽和基于特征的匿名化)相比时,保持了竞争性的检测性能(例如,mAP和分割指标)。源代码可在 https://github.com/mabo1215/PPEDCRF.git 获取。
cs.CV / 238 / 2603.01594
Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference
偏好评分蒸馏:利用二维奖励将文本到三维生成与人类偏好对齐
Abstract
Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.
Chinese Translation
人类偏好对齐是文本到三维生成中扩散模型面临的一个关键但尚未充分探索的挑战。现有解决方案通常需要特定任务的微调,这在数据稀缺的三维领域中造成了显著障碍。为了解决这个问题,我们提出了偏好评分蒸馏(Preference Score Distillation, PSD),这是一个基于优化的框架,利用预训练的二维奖励模型进行人类对齐的文本到三维合成,而无需三维训练数据。我们的关键见解源于像素级梯度的不兼容性:由于在奖励模型训练期间缺乏噪声样本,直接应用二维奖励梯度会干扰去噪过程。我们注意到,在条件扩散模型中,简单的分类器引导也存在类似问题,因此我们从根本上将偏好对齐重新思考为一种无分类器引导(classifier-free guidance, CFG)风格的机制,通过我们的隐式奖励模型实现。此外,考虑到冻结的预训练扩散模型限制了性能,我们引入了一种自适应策略,以共同优化偏好评分和负文本嵌入。通过在优化过程中结合CFG,负文本嵌入的在线精炼动态增强了对齐效果。据我们所知,我们是首个在评分蒸馏框架下将人类偏好对齐与CFG理论结合起来的研究。实验结果表明,PSD在美学指标上具有优越性,能够与多种管道无缝集成,并展现出强大的扩展性。
cs.CV / 239 / 2603.01601
Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement
Dehallu3D:通过循环视图一致性细化减轻单图像的幻觉影响的3D生成
Abstract
Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details.We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
Chinese Translation
大型3D重建模型已经彻底改变了3D内容生成领域,使其在虚拟现实和游戏中得到了广泛应用。与其他大型模型一样,大型3D重建模型也面临幻觉问题,产生与输入数据偏离的结构异常(例如,奇怪的孔洞或突起)。然而,与其他大型模型不同的是,大型3D重建模型中的幻觉问题仍然未得到充分研究,导致3D打印物体畸形或虚拟场景沉浸感不足。这类幻觉主要源于现有方法从稀疏生成的多视图图像中重建3D内容,而这些图像存在较大的视角间隙和不连续性。为了通过消除异常值来减轻幻觉影响,我们提出了Dehallu3D用于3D网格生成。我们的关键思想是设计一个平衡的多视图连续性约束,以确保在密集中间视点之间的平滑过渡,同时避免过度平滑而抹去锐利的几何特征。因此,Dehallu3D采用了一个即插即用的优化模块,包含两个关键约束:(i)相邻一致性以确保视图之间的几何连续性,以及(ii)自适应平滑性以保留细节。我们进一步提出了异常值风险度量(Outlier Risk Measure, ORM)指标,从异常值的角度量化3D生成中的几何保真度。大量实验表明,Dehallu3D通过有效保留结构细节并去除幻觉异常值,实现了高保真的3D生成。
cs.CV / 240 / 2603.01602
YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection
YCDa:用于实时逼真伪装物体检测的YCbCr解耦注意力
Abstract
Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this "chrominance-luminance decoupling and dynamic attention" principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
Chinese Translation
人类视觉在感知伪装物体方面表现出显著的适应性。当颜色线索变得不可靠时,视觉系统本能地将依赖从色度(颜色)转向亮度(亮度和纹理),从而在视觉混乱的环境中实现更强的感知能力。受到这一生物机制的启发,我们提出了YCDa,一种高效的早期特征处理策略,将这一“色度-亮度解耦与动态注意力”原理嵌入现代实时检测器中。具体而言,YCDa在输入阶段分离颜色和亮度信息,并在通道之间动态分配注意力,以增强区分性线索,同时抑制误导性的颜色噪声。该策略即插即用,可以通过简单替换第一个下采样层集成到现有检测器中。对多个基线的广泛实验表明,YCDa在几乎没有额外开销的情况下持续提高性能。如图所示,值得注意的是,YCDa-YOLO12s在COD10K-D上相较于基线实现了112%的mAP提升,并在COD-D数据集上设定了实时伪装物体检测的新最先进结果。
cs.CV / 241 / 2603.01603
Sparse View Distractor-Free Gaussian Splatting
稀疏视图无干扰高斯点云渲染
Abstract
3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
Chinese Translation
3D高斯点云渲染(3DGS)能够在静态环境中实现高效训练和快速的新视图合成。为了解决瞬态物体带来的挑战,无干扰3DGS方法应运而生,并在密集图像捕获可用时显示出良好的效果。然而,在稀疏输入条件下,它们的性能显著下降。这一限制主要源于依赖颜色残差启发式来指导训练,而在观察有限的情况下,这一方法变得不可靠。在本研究中,我们提出了一种框架,通过结合丰富的先验信息来增强稀疏视图条件下的无干扰3DGS。具体而言,我们首先采用几何基础模型VGGT来估计相机参数并生成一组密集的初始3D点。然后,我们利用VGGT的注意力图进行高效且准确的语义实体匹配。此外,我们还利用视觉-语言模型(VLMs)进一步识别和保留场景中的大静态区域。我们还展示了如何将这些先验信息无缝整合到现有的无干扰3DGS方法中。大量实验验证了我们的方法在减轻稀疏视图3DGS训练中的瞬态干扰物体方面的有效性和鲁棒性。
cs.CV / 242 / 2603.01605
What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers
什么是有帮助的 -- 什么是有害的:视觉变换器的双向解释
Abstract
Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.
Chinese Translation
视觉变换器(Vision Transformers, ViTs)在视觉识别中表现出色,但其决策过程仍然难以解释。我们提出了BiCAM,一种双向类激活映射方法,能够捕捉对模型预测的支持性(正向)和抑制性(负向)贡献。与以往丢弃负信号的基于类激活映射(CAM)的方法不同,BiCAM保留了带符号的归因,以生成更完整和对比鲜明的解释。BiCAM进一步引入了正负比(Positive-to-Negative Ratio, PNR),该比率总结了归因的平衡,并能够在不重新训练的情况下轻量级检测对抗样本。在ImageNet、VOC和COCO数据集上,BiCAM提高了定位精度和可信度,同时保持了计算效率。它还可以推广到多个ViT变体,包括DeiT和Swin。这些结果表明,在解释基于变换器的视觉模型时,建模支持性和抑制性证据的重要性。
cs.CV / 243 / 2603.01613
Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
通过语义对齐实现开放街图中的粗到细单目重定位
Abstract
Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3{\deg} orientation recall of our method even outperforms the 5{\deg} recall of state-of-the-art methods.
Chinese Translation
单目重定位在使智能体实现类人感知方面发挥着至关重要的作用。然而,传统方法依赖于密集地图,这面临着可扩展性限制和隐私风险。开放街图(OpenStreetMap, OSM)作为一种保护隐私的轻量级地图,提供了具有全球可扩展性的语义和几何信息。然而,在使用OSM进行定位时仍然存在挑战:自然图像与OSM之间固有的跨模态差异,以及基于全球地图的定位的高计算成本。本文提出了一种具有语义对齐的分层搜索框架用于OSM中的定位。首先,利用DINO-ViT的语义感知能力对视觉元素进行解构,以建立与OSM的语义关系。其次,设计了一种粗到细的搜索范式,以替代全球密集匹配,从而实现高效的渐进式细化。大量实验表明,我们的方法显著提高了定位的准确性和速度。当在单一数据集上训练时,我们的方法在3°方向召回率上甚至超过了最先进方法的5°召回率。
cs.CV / 244 / 2603.01623
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
自适应谱特征预测用于扩散采样加速
Abstract
Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
Chinese Translation
扩散模型已成为高保真图像和视频生成的主要工具,但由于扩散变换器的多次迭代过程,其推理速度受到严重瓶颈。为了减少计算开销,近期的研究采用了特征缓存和重用方案,通过使用先前步骤中的缓存特征来跳过选定扩散步骤的网络评估。然而,它们的初步设计仅依赖于局部近似,导致在大跳跃时误差迅速增长,并在高加速下导致样本质量下降。在本研究中,我们提出了谱扩散特征预测器(Spectrum),这是一种无训练的方法,能够实现全局、长距离的特征重用,并严格控制误差。具体而言,我们将去噪器的潜在特征视为随时间变化的函数,并用切比雪夫多项式对其进行近似。我们通过岭回归为每个基函数拟合系数,然后利用这些系数预测多个未来扩散步骤的特征。我们从理论上揭示了我们的方法具有更有利的长时间行为,并且误差界限不会随着步长的增加而累积。在各种最先进的图像和视频扩散模型上的大量实验一致验证了我们方法的优越性。值得注意的是,我们在FLUX.1上实现了高达4.79倍的加速,在Wan2.1-14B上实现了4.67倍的加速,同时与基线相比保持了更高的样本质量。
cs.CV / 245 / 2603.01637
DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
DriveCombo:自主驾驶中组合交通规则推理的基准测试
Abstract
Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
Chinese Translation
多模态大型语言模型(MLLMs)正在迅速成为端到端自主驾驶系统的智能核心。一个关键挑战是评估 MLLMs 是否能够真正理解和遵循复杂的现实交通规则。然而,现有的基准测试主要集中在单一规则场景,如交通标志识别,忽视了真实驾驶中多规则并发和冲突的复杂性。因此,模型在简单任务上表现良好,但在现实复杂情况下往往失败或违反规则。为了解决这一问题,我们提出了 DriveCombo,这是一个基于文本和视觉的组合交通规则推理基准。受人类驾驶员认知发展的启发,我们提出了一个系统化的五级认知阶梯,用于评估从单一规则理解到多规则整合和冲突解决的推理能力,从而实现跨认知阶段的定量评估。我们进一步提出了一个 Rule2Scene Agent,该代理通过规则构建和场景生成,将基于语言的交通规则映射到动态驾驶场景,实现场景级交通规则视觉推理。对 14 种主流 MLLMs 的评估揭示了随着任务复杂性增加,性能下降,尤其是在规则冲突期间。在对数据集进行拆分并在训练集上进行微调后,我们进一步观察到交通规则推理和下游规划能力的显著提升。这些结果突显了 DriveCombo 在推动合规和智能自主驾驶系统方面的有效性。
cs.CV / 246 / 2603.01640
MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification
MSP-ReID:发型鲁棒的换衣人物重识别
Abstract
Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.
Chinese Translation
换衣人物重识别(CC-ReID)旨在在不同的服装条件下匹配同一个体。现有的方法通常去除服装并关注头部区域,以减少服装偏差。然而,整体处理头部而不区分面部和头发会导致过度依赖易变的发型线索,从而在发型变化时导致性能下降。为了解决这个问题,我们提出了减轻发型干扰与结构保留(MSP)框架。具体而言,MSP引入了发型导向增强(HSOA),该方法生成同一身份的发型多样性,以减少对发型的依赖并增强对稳定面部和身体线索的关注。为了防止结构信息的丢失,我们设计了服装保留随机擦除(CPRE),该方法在服装区域内进行比例控制的擦除,以抑制纹理偏差,同时保留身体形状和上下文。此外,我们采用基于区域的解析注意力(RPA),以结合解析引导的先验,突出面部和肢体区域,同时抑制头发特征。在多个CC-ReID基准上的广泛实验表明,MSP实现了最先进的性能,为长期人物重识别提供了一个鲁棒且实用的解决方案。
cs.CV / 247 / 2603.01647
QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image
QCAgent:一种可控质量的全幻灯片图像病理报告生成的代理框架
Abstract
Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.
Chinese Translation
近期从全幻灯片图像(WSI)生成病理报告的方法能够产生幻灯片级别的诊断描述,但未能将细粒度的陈述与局部视觉证据相结合。此外,这些方法缺乏对包含哪些诊断细节以及如何验证这些细节的控制。受到新兴代理分析范式和病理学家诊断工作流程的启发,病理学家会选择性地检查多个视野,我们提出了QCAgent,一个用于可控质量的WSI报告生成的代理框架。该框架的核心创新如下:(i) 它结合了一个定制的批评机制,该机制由用户定义的检查清单指导,指定所需的诊断细节和约束;(ii) 它基于批评反馈和文本片段语义检索重新识别WSI中的信息区域,这一过程迭代地丰富和调和报告。实验表明,通过明确报告要求、关注约束并通过基于证据的细化进行验证,QCAgent能够实现从WSI生成可控的临床意义和高覆盖率的病理报告。
cs.CV / 248 / 2603.01650
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
PromptStereo:通过结构和运动提示实现零-shot立体匹配
Abstract
Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.
Chinese Translation
现代立体匹配方法利用单目深度基础模型实现了卓越的零-shot泛化性能。然而,大多数现有方法主要集中在提取稳健特征以构建代价体积或初始化视差。同时,迭代细化阶段对于零-shot泛化同样至关重要,但仍然未得到充分探索。一些方法将单目深度先验视为迭代的指导,但传统的基于GRU的架构由于表示能力有限,难以有效利用这些先验。在本文中,我们提出了提示递归单元(Prompt Recurrent Unit,PRU),这是一种基于单目深度基础模型解码器的新型迭代细化模块。通过将单目结构和立体运动线索作为提示集成到解码器中,PRU丰富了单目深度基础模型的潜在表示,提供了绝对立体尺度信息,同时保留了其固有的单目深度先验。实验表明,我们的PromptStereo在多个数据集上实现了最先进的零-shot泛化性能,同时保持了可比或更快的推理速度。我们的研究结果强调了提示引导的迭代细化作为零-shot立体匹配的一个有前景的方向。
cs.CV / 249 / 2603.01659
A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs
基于扩散驱动的细粒度结节合成框架用于增强胸部X光片中的肺结节检测
Abstract
Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.
Chinese Translation
在胸部X光片(CXRs)中早期检测肺癌对改善患者预后至关重要,但由于结节的微妙外观以及大小、纹理和边界等放射学特征的多样性,结节检测仍然具有挑战性。为了进行稳健分析,这种多样性必须在基于深度学习的计算机辅助诊断(CAD)系统的训练数据集中得到良好表示。然而,组建这样的数据集成本高昂且往往不切实际,这促使了对现实合成数据生成的需求。现有方法在合成结节生成上缺乏细粒度控制,限制了它们在解决数据稀缺问题上的实用性。本文提出了一种新颖的基于扩散的框架,结合低秩适配器(LoRA)用于在CXRs上进行特征控制的结节合成。我们首先通过结节掩膜条件训练基础扩散模型来解决大小和形状控制问题。为了实现个体特征控制,我们训练了多个独立的LoRA模块,每个模块专注于特定的放射学特征。然而,由于结节很少表现出孤立特征,有效的多特征控制需要特征的平衡整合。我们通过利用LoRA的动态可组合性并重新审视现有的合并策略来解决这一问题。在此基础上,我们识别出两个关键问题:重叠的注意区域和非正交参数空间。为了克服这些限制,我们在LoRA组合训练中引入了一种新颖的正交损失项。在内部和公共数据集上的广泛实验表明,结节检测的下游性能得到了改善。放射科医师的评估确认了我们生成的结节的细粒度可控性,并且在多个定量指标上,我们的方法超越了现有的CXRs结节生成方法。
cs.CV / 250 / 2603.01685
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
FastLightGen:快速轻量的视频生成,步骤和参数更少
Abstract
The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
Chinese Translation
近期强大的视频生成模型的出现,如Hunyuan、WanX、Veo3和Kling,开启了该领域的新纪元。然而,这些模型的实际部署受到其巨大的计算开销的严重阻碍,这主要源于庞大的参数数量以及推理过程中所需的迭代多步骤采样过程。以往关于加速生成模型的研究主要遵循两种不同的轨迹:减少采样步骤的数量(例如,LCM、DMD和MagicDistillation)或压缩模型大小以实现更高效的推理(例如,ICMD)。同时压缩两者以创建快速轻量模型的潜力仍然是一个未被探索的方向。在本文中,我们提出了FastLightGen,一种将大型计算开销模型转变为快速轻量模型的算法。其核心思想是在一个协同框架内构建一个最优教师模型,旨在最大化学生性能,以实现模型大小和推理步骤的蒸馏。我们在HunyuanVideo-ATI2V和WanX-TI2V上的广泛实验表明,使用4步采样和30\%参数剪枝的生成器在受限的推理预算下实现了最佳视觉质量。此外,FastLightGen始终优于所有竞争方法,确立了高效视频生成的新最先进水平。
cs.CV / 251 / 2603.01686
DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs
DiffusionXRay:一种基于扩散和生成对抗网络的数字重建胸部X光片增强方法
Abstract
Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
Chinese Translation
基于深度学习的肺癌自动诊断已成为一项重要进展,使医疗专业人员能够更早地检测和启动治疗。然而,这些模型需要具有多样化案例特征的大规模训练数据集。高质量的标注数据尤其难以获得,特别是对于那些即使是经验丰富的放射科医生也难以检测的细微肺结节病例。这种良好标注数据集的稀缺性可能限制模型在不同患者群体中的性能和泛化能力。利用CT扫描生成合成的正面胸部X光片,并人工插入肺结节的数字重建X光片(DRR)提供了一种潜在解决方案。然而,这种方法在图像质量上存在显著下降,特别是解剖特征模糊和细微肺野结构丧失。为了克服这一问题,我们提出了DiffusionXRay,这是一种新颖的胸部X光图像恢复管道,协同利用去噪扩散概率模型(DDPM)和生成对抗网络(GAN)。DiffusionXRay采用独特的两阶段训练过程:首先,我们研究了两种独立的方法,DDPM-LQ和基于GAN的MUNIT-LQ,以生成低质量的胸部X光片,解决训练数据稀缺的问题,将其视为风格迁移问题。随后,我们在配对的低质量和高质量图像上训练基于DDPM的模型,使其能够学习X光图像恢复的细微差别。我们的方法在增强图像清晰度、对比度和整体诊断价值方面表现出良好的结果,同时保留细微但临床上重要的伪影,得到了定量指标和专家放射学评估的验证。
cs.CV / 252 / 2603.01688
CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
CoopDiff:一种基于扩散的合作感知方法应对干扰
Abstract
Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
Chinese Translation
合作感知使得智能体能够共享信息,以扩展覆盖范围并改善场景理解。然而,在现实场景中,各种不可预测的干扰削弱了其鲁棒性和泛化能力。为了解决这些挑战,我们提出了CoopDiff,一种基于扩散的合作感知框架,通过去噪机制减轻干扰。CoopDiff采用教师-学生范式:质量感知教师通过兴趣质量加权和语义指导进行体素级早期融合,然后通过扩散去噪器生成干净的监督特征。双分支扩散学生首先在编码中分离自我流和合作流,以重建教师的干净目标。随后,自我引导的交叉注意机制通过自适应整合自我特征和合作特征,促进在退化下的平衡解码。我们在两个构建的多退化基准OPV2Vn和DAIR-V2Xn上评估了CoopDiff,每个基准包含六种干扰类型,包括环境和传感器级失真。得益于扩散的固有去噪特性,CoopDiff在所有退化类型上始终优于先前的方法,并降低了相对干扰误差。此外,它在精度和推理效率之间提供了可调的平衡。
cs.CV / 253 / 2603.01694
MVR: Multi-view Video Reward Shaping for Reinforcement Learning
MVR:用于强化学习的多视角视频奖励塑造
Abstract
Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
Chinese Translation
奖励设计对于使用强化学习解决复杂任务至关重要。最近的研究探讨了利用视觉-语言模型(VLM)生成的图像-文本相似性来增强具有视觉反馈的任务奖励。常见做法是将VLM得分线性地添加到任务或成功奖励中,而没有明确的塑造,这可能会改变最优策略。此外,这些方法通常依赖于单一静态图像,难以处理涉及复杂动态运动、跨越多个视觉上不同状态的任务。此外,单一视角可能会遮挡代理行为的关键方面。为了解决这些问题,本文提出了多视角视频奖励塑造(MVR)框架,该框架利用从多个视角捕获的视频来建模与目标任务相关的状态。MVR利用冻结的预训练VLM中的视频-文本相似性来学习状态相关性函数,从而减轻基于图像的方法固有的对特定静态姿势的偏见。此外,我们引入了一种状态依赖的奖励塑造公式,整合了任务特定奖励和基于VLM的指导,一旦实现所需的运动模式,自动减少VLM指导的影响。我们通过在HumanoidBench上的挑战性人形运动任务和MetaWorld上的操作任务进行广泛实验,验证了所提框架的有效性,并通过消融研究验证了设计选择。
cs.CV / 254 / 2603.01696
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
跨模态身份映射:通过强化学习最小化模态转换中的信息损失
Abstract
Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.
Chinese Translation
大型视觉语言模型(LVLMs)在生成图像标题时常常遗漏或错误呈现关键的视觉内容。最小化这种信息损失将迫使LVLMs关注图像细节,以生成准确的描述。然而,由于视觉内容与文本输出之间的模态差异,在模态转换过程中测量信息损失本质上是具有挑战性的。本文认为,图像标题的质量与通过文本搜索获取的图像之间的相似性呈正相关。基于这一见解,我们进一步提出了跨模态身份映射(Cross-modal Identity Mapping, CIM),这是一种增强图像标题生成的强化学习框架,无需额外的注释。具体而言,该方法从两个角度定量评估信息损失:图库表示一致性(Gallery Representation Consistency)和查询-图库图像相关性(Query-gallery Image Relevance)。在这些指标的监督下,LVLM最小化信息损失,旨在实现从图像到标题的身份映射。实验结果表明,我们的方法在图像标题生成方面表现优越,甚至超过了监督微调(Supervised Fine-Tuning)。特别是在COCO-LN500基准测试中,CIM在Qwen2.5-VL-7B上的关系推理能力提高了20%。代码将在论文被接受后发布。
cs.CV / 255 / 2603.01698
Towards Principled Dataset Distillation: A Spectral Distribution Perspective
朝着原则性数据集蒸馏:一种谱分布视角
Abstract
Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.
Chinese Translation
数据集蒸馏(Dataset Distillation, DD)旨在将大规模数据集压缩为紧凑的合成数据集,以实现高效的模型训练。然而,现有的DD方法在长尾数据集上表现出显著的性能下降。我们识别出两个基本挑战:分布差异度量的启发式设计选择和对不平衡类别的统一处理。为了解决这些局限性,我们提出了类别感知谱分布匹配(Class-Aware Spectral Distribution Matching, CSDM),该方法通过良好行为的核函数的谱重新构造分布对齐。该技术将原始样本映射到频率空间,从而得到谱分布距离(Spectral Distribution Distance, SDD)。为了缓解类别不平衡问题,我们利用SDD的统一形式进行幅度-相位分解,能够自适应地优先考虑尾部类别的真实性。在CIFAR-10-LT数据集上,每个类别10张图像,CSDM相较于最先进的DD方法实现了14.0%的性能提升,当尾部类别的图像数量从500减少到25时,仅有5.7%的性能下降,显示出在长尾数据上的强稳定性。
cs.CV / 256 / 2603.01706
Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
基于多层感知器的融合方法用于高效且准确的孪生跟踪
Abstract
Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
Chinese Translation
最近,孪生视觉跟踪器通过建立在卷积或变换器架构上的日益复杂的融合机制取得了进展。然而,这两者在资源受限的硬件上都难以高效地实现像素级交互,导致持续存在准确性与效率之间的不平衡。受到这一限制的启发,我们重新设计了孪生网络的颈部,采用一个简单而有效的基于多层感知器(MLP)的融合模块,使得在最小结构开销下实现像素级交互。然而,简单地堆叠MLP模块会引入一个新的挑战:计算成本可能随着通道宽度的增加而呈二次增长。为了解决这个问题,我们构建了一个精心设计的MLP模块的分层搜索空间,并引入了一种定制的松弛策略,使得可微分神经架构搜索(DNAS)能够将通道宽度优化与其他架构选择解耦。这种有针对性的解耦自动平衡了通道宽度和深度,从而产生了低复杂度的架构。最终的跟踪器在准确性与效率的权衡上达到了最先进的水平。在四个通用和三个空中跟踪基准测试中,它的表现位列前茅,同时在资源受限的图形处理单元(GPU)和神经处理单元(NPU)上保持实时性能。
cs.CV / 257 / 2603.01708
WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
WhisperNet:一种可扩展的带宽高效协作解决方案
Abstract
Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce \textit{WhisperNet}, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency. Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving
[email protected] on OPV2V by 2.4\% with only 0.5\% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5\% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across \textit{what} and \textit{where} to share is the key to achieving efficient collaborative perception.
Chinese Translation
协作感知对于自动驾驶至关重要,但仍受到严格通信预算的限制。早期的研究通过使用固定速率编码器压缩完整特征图来减少带宽,这种方法对变化的环境适应性较差,随后演变为空间选择方法,通过聚焦于显著区域来提高效率,但这种以对象为中心的方法往往牺牲了全局上下文,削弱了整体场景理解。为克服这些局限性,我们提出了 extit{WhisperNet},这是一个带宽感知框架,提出了一种新的以接收方为中心的全球协调范式。发送方生成轻量级显著性元数据,而接收方制定一个全球请求计划,动态分配各个代理和特征的特征贡献,仅检索最具信息量的特征。一个协作特征路由模块在融合之前对相关消息进行对齐,以确保结构一致性。大量实验表明,WhisperNet实现了最先进的性能,在OPV2V上将
[email protected]提高了2.4\%,而通信成本仅为0.5\\%。作为一个即插即用组件,它在仅占用5\\%的完整带宽的情况下提升了强基线,同时在定位噪声下保持了鲁棒性。这些结果表明,在 extit{什么}和 extit{哪里}共享之间进行全球协调分配是实现高效协作感知的关键。
cs.CV / 258 / 2603.01713
Dual Distillation for Few-Shot Anomaly Detection
双重蒸馏用于少样本异常检测
Abstract
Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at https://github.com/ttttqz/D24FAD.
Chinese Translation
异常检测是计算机视觉中的一项关键任务,对医学影像具有深远的影响,早期识别病理可以直接影响患者的治疗结果。尽管最近的无监督异常检测方法展现出良好的前景,但它们需要大量的正常训练数据,并且在不同解剖背景下的泛化能力较差。我们提出了 D$^2$4FAD,一种新颖的双重蒸馏框架,用于少样本异常检测,能够仅使用少量正常参考图像识别在先前未见任务中的异常。我们的方法利用预训练的编码器作为教师网络,从支持图像和查询图像中提取多尺度特征,同时学生解码器学习从教师网络中提取查询图像的知识,并在支持图像上进行自我蒸馏。我们进一步提出了一种学习加权机制,动态评估每个支持图像相对于查询的参考价值,从而优化异常检测性能。为了评估我们的方法,我们整理了一个全面的基准数据集,包含 13,084 张图像,涵盖四个器官、四种成像模式和五个疾病类别。大量实验表明,D$^2$4FAD 显著优于现有方法,确立了少样本医学异常检测的新最优状态。代码可在 https://github.com/ttttqz/D24FAD 获取。
cs.CV / 259 / 2603.01720
Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints
通过潜在基础对应约束实现腹腔镜手术的术前到术中肝脏配准
Abstract
In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at https://github.com/cuiruize/Land-Reg.
Chinese Translation
在腹腔镜肝脏手术中,增强现实技术通过将术前CT/MRI的3D肝脏模型叠加到腹腔镜的2D视图上,增强了术中的解剖引导。然而,现有的配准方法缺乏对可靠的2D-3D几何对应关系的明确建模,这些对应关系由潜在证据支持,导致在临床场景中解释性有限且可能出现不稳定的对齐。在本研究中,我们提出了Land-Reg,一个以对应关系驱动的可变形配准框架,明确学习潜在基础的2D-3D地标对应关系,作为可解释的中间表示,以桥接跨模态对齐。对于刚性配准,Land-Reg采用了跨模态潜在对齐模块,将多模态特征映射到统一的潜在空间。此外,提出了一种不确定性增强的重叠地标检测器,通过相似性匹配来稳健地估计明确的2D-3D地标对应关系。对于非刚性配准,我们设计了一种新颖的形状约束监督策略,通过重投影一致性将形状变形锚定到匹配的地标,并结合局部等距正则化以减轻固有的2D-3D深度模糊,同时渲染掩膜对齐以强制全局形状一致性。在P2ILF数据集上的实验结果表明,我们的方法在刚性姿态估计和非刚性变形方面均表现出优越性。我们的代码将发布在 https://github.com/cuiruize/Land-Reg。
cs.CV / 260 / 2603.01725
Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration
学习领域感知任务提示表示用于多领域一体化图像修复
Abstract
Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at https://github.com/GuangluDong0728/DATPRL-IR.
Chinese Translation
近年来,一体化图像修复(AiOIR)领域取得了显著突破,能够通过单一模型处理多种修复任务。然而,现有方法通常专注于特定的图像领域,如自然场景、医学成像或遥感。在本研究中,我们旨在将AiOIR扩展到多个领域,并提出首个基于我们提出的领域感知任务提示表示学习的多领域一体化图像修复方法DATPRL-IR。具体而言,我们首先构建一个包含多个任务提示的任务提示池,其中隐含编码了任务相关知识。对于每个输入图像,模型自适应地选择最相关的任务提示,并通过提示组合机制(PCM)将其组合成实例级任务表示。此外,为了赋予模型领域感知能力,我们引入另一个领域提示池,并从多模态大型语言模型中提取领域先验信息到领域提示中。PCM用于将自适应选择的领域提示组合成每个输入图像的领域表示。最后,将这两种表示融合形成领域感知任务提示表示,充分利用任务和领域之间的特定知识和共享知识,以指导后续的修复过程。大量实验表明,我们的DATPRL-IR显著优于现有的最先进图像修复方法,同时展现出强大的泛化能力。代码可在 https://github.com/GuangluDong0728/DATPRL-IR 获取。
cs.CV / 261 / 2603.01743
Action-Guided Attention for Video Action Anticipation
基于动作引导注意力的视频动作预测
Abstract
Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
Chinese Translation
在视频中预测未来动作是一项具有挑战性的任务,因为观察到的帧仅提供过去活动的证据,需要推断潜在意图以预测即将发生的动作。现有的基于变换器的方法依赖于对像素表示的点积注意力,往往缺乏建模视频序列以有效预测动作所需的高层语义。因此,这些方法往往过度拟合于过去帧中显式的视觉线索,限制了它们捕捉潜在意图的能力,并降低了对未见样本的泛化能力。为了解决这个问题,我们提出了动作引导注意力(Action-Guided Attention, AGA),这是一种注意力机制,明确利用预测的动作序列作为查询和键来引导序列建模。我们的方法促进注意力模块根据即将发生的活动强调过去相关时刻,并通过专门的门控函数将这些信息与当前帧嵌入结合。AGA的设计使得对训练集发现的知识进行后训练分析成为可能。在广泛采用的EPIC-Kitchens-100基准上的实验表明,AGA能够很好地从验证集泛化到未见测试集。后训练分析还可以进一步检查模型捕获的动作依赖性及其内化的反事实证据,为其预测提供透明且可解释的见解。
cs.CV / 262 / 2603.01746
An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
针对车辆型号和品牌分类的层次多标签问题的多任务架构分析
Abstract
Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.
Chinese Translation
我们世界中的大多数信息是以层次结构组织的;然而,许多深度学习方法并未利用这种语义丰富的结构。研究表明,人类学习从利用信息的层次结构中受益,智能模型也可以通过多任务学习类似地利用这一点。在本研究中,我们分析了多任务学习在层次多标签分类问题中的优势和局限性:汽车品牌和型号分类。考虑到并行和级联多任务架构,我们评估了它们对不同深度学习分类器(卷积神经网络CNN和变换器Transformers)的影响,同时改变关键因素,如丢弃率和损失加权,以深入了解该方法的有效性。测试在两个已建立的基准数据集上进行:StanfordCars和CompCars。我们观察到多任务范式在这两个数据集上的有效性,在几乎所有场景中都提高了所研究的CNN的性能。此外,该方法在CompCars数据集上对两种类型的模型都带来了显著的改进。
cs.CV / 263 / 2603.01756
NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
NeuroSymb-MRG:具有主动不确定性最小化的可微推理用于放射学报告生成
Abstract
Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
Chinese Translation
自动生成放射学报告旨在减少临床医生的工作负担,同时提高文档的一致性。现有采用编码器-解码器或检索增强管道的方法在流畅性方面取得了一定进展,但仍然容易受到视觉-语言偏见、事实不一致和缺乏明确的多跳临床推理的影响。我们提出了NeuroSymb-MRG,一个统一框架,将神经符号推理与主动不确定性最小化相结合,以生成结构化的、基于临床的报告。该系统将图像特征映射到概率临床概念,构建可微的基于逻辑的推理链,将这些链解码为模板化条款,并通过检索和受限语言模型编辑来优化文本输出。由规则级不确定性和多样性驱动的主动采样循环指导临床医生的参与裁决和提示书的优化。标准基准上的实验表明,与代表性基线相比,在事实一致性和标准语言指标上取得了一致的改进。
cs.CV / 264 / 2603.01757
StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
StepVAR:结构-纹理引导的视觉自回归模型剪枝
Abstract
Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
Chinese Translation
基于下一尺度预测的视觉自回归(VAR)模型能够实现高效的层次生成,但在高分辨率下推理成本呈平方增长。我们观察到,计算密集型的后期尺度主要用于细化高频纹理,并表现出显著的空间冗余,这与确定全局结构布局的早期尺度形成对比。现有的剪枝方法主要关注高频检测以进行标记选择,往往忽视了结构一致性,从而降低了全局语义。为了解决这一局限性,我们提出了StepVAR,一种无训练的标记剪枝框架,通过共同考虑结构和纹理的重要性来加速VAR推理。具体而言,我们采用轻量级高通滤波器来捕捉局部纹理细节,同时利用主成分分析(PCA)来保留全局结构信息。这种双标准设计使模型能够保留对细粒度保真度和整体构图至关重要的标记。为了在稀疏标记下保持有效的下一尺度预测,我们进一步引入了一种最近邻特征传播策略,从剪枝表示中重建稠密特征图。在最先进的文本到图像和文本到视频VAR模型上的广泛实验表明,StepVAR在保持生成质量的同时实现了显著的推理加速。定量和定性评估一致显示我们的方法优于现有的加速方法,验证了其有效性和在多种VAR架构中的通用适用性。
cs.CV / 265 / 2603.01758
Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
通过语言引导的预训练统一异构多模态遥感检测
Abstract
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
Chinese Translation
异构多模态遥感目标检测旨在准确检测来自不同传感器(例如,RGB、SAR、红外)的物体。现有方法主要采用后期对齐范式,在该范式中,模态对齐与任务特定优化在下游微调过程中交织在一起。这种紧密耦合使得优化变得复杂,通常导致训练不稳定和泛化性能不佳。为了解决这些局限性,我们提出了BabelRS,一个统一的语言引导预训练框架,明确将模态对齐与下游任务学习解耦。BabelRS包含两个关键组件:概念共享指令对齐(CSIA)和层次视觉-语义退火(LVSA)。CSIA将每个传感器模态对齐到一组共享的语言概念,利用语言作为语义枢纽来桥接异构视觉表示。为了进一步减轻高层语言表示与密集检测目标之间的粒度不匹配,LVSA逐步聚合多尺度视觉特征,以提供细粒度的语义指导。大量实验表明,BabelRS稳定了训练,并在没有额外复杂性的情况下始终优于现有最先进的方法。代码:https://github.com/zcablii/SM3Det。
cs.CV / 266 / 2603.01765
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
通过低秩解码器适应实现高效的测试时优化用于深度补全
Abstract
Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
Chinese Translation
零-shot深度补全因其在没有传感器特定数据集或重新训练的情况下能够跨环境泛化而受到关注。然而,大多数现有方法依赖于基于扩散的测试时优化,这由于迭代去噪而计算开销较大。最近的基于视觉提示的方法降低了训练成本,但仍需通过完整的冻结网络进行多次前向和反向传播以优化输入级提示,导致推理速度缓慢。在本研究中,我们表明,仅适应解码器就足以实现有效的测试时优化,因为深度基础模型将与深度相关的信息集中在低维解码器子空间内。基于这一见解,我们提出了一种轻量级的测试时适应方法,仅使用稀疏深度监督更新这个低维子空间。我们的方法实现了最先进的性能,为测试时适应在准确性和效率之间建立了新的帕累托前沿。在五个室内和室外数据集上的大量实验表明,相较于之前的方法,我们的方法在性能上有一致的提升,突显了快速零-shot深度补全的实用性。
cs.CV / 267 / 2603.01767
Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design
受下游任务启发的水下图像增强:从数据集构建到网络设计的感知意识研究
Abstract
In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at https://github.com/oucailab/DTIUIE.
Chinese Translation
在真实的水下环境中,下游图像识别任务如语义分割和目标检测常常面临模糊和颜色不一致等问题带来的挑战。水下图像增强(UIE)作为一种有前景的预处理方法,旨在提高水下图像中目标的可识别性。然而,大多数现有的UIE方法主要关注于增强图像以适应人类视觉感知,常常无法重建对特定任务识别至关重要的高频细节。为了解决这一问题,我们提出了一种受下游任务启发的水下图像增强(DTI-UIE)框架,该框架利用人类视觉感知模型有效增强水下视觉任务的图像。具体而言,我们设计了一个高效的双分支网络,并结合任务感知注意力模块进行特征混合。该网络受益于多阶段训练框架和任务驱动的感知损失。此外,受到人类感知的启发,我们使用各种任务特定网络自动构建了一个任务启发的UIE数据集(TI-UIED)。实验结果表明,DTI-UIE通过生成对下游任务(如语义分割、目标检测和实例分割)有利的预处理图像,显著提高了任务性能。代码已公开发布在 https://github.com/oucailab/DTIUIE。
cs.CV / 268 / 2603.01804
Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
受限机器人环境中的非语言实时人机交互
Abstract
We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
Chinese Translation
我们研究了关于AI生成数据与人类生成数据在非语言交流中(使用全身运动)的统计真实性的持续辩论。具体而言,我们询问当代生成模型是否超越了表面模仿,参与到肢体语言的无声但富有表现力的对话中。我们通过引入第一个框架来解决这个问题,该框架能够实时从2D身体关键点生成自然的非语言人机交互。我们的实验利用了四种轻量级架构,这些架构在NVIDIA Orin Nano上以高达100 FPS的速度运行,有效地闭合了自然人机交互所需的感知-行动循环。我们在437个视频剪辑上进行了训练,并证明在合成生成序列上进行预训练显著减少了运动误差,而不牺牲速度。然而,仍然存在可测量的现实差距。当在从尖端文本到视频系统(如SORA和VEO)提取的关键点上评估最佳模型时,我们观察到在SORA生成的剪辑上的性能下降。然而,在VEO上的下降则远远较小,这表明时间一致性而非图像保真度驱动了现实世界的表现。我们的结果表明,人类和AI运动之间仍然存在统计上可区分的差异。
cs.CV / 269 / 2603.01812
Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications
神经算子基础的连续张量函数表示及其应用
Abstract
Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode-$n$ product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode-$n$ operators as a continuous and nonlinear alternative of discrete and linear mode-$n$ product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode-$n$ operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode-$n$ operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.
Chinese Translation
近年来,连续张量函数受到越来越多的关注,因为它们能够统一表示网格和非网格上的数据。然而,由于模-$n$ 乘积本质上是离散和线性的,当前的连续张量函数表示的潜力仍然被锁定。为了打破这一瓶颈,我们建议使用神经算子基础的模-$n$ 算子作为离散和线性模-$n$ 乘积的连续和非线性替代方案。所提出的模-$n$ 算子直接将连续核心张量函数映射到连续目标张量函数,而不是将离散核心张量映射到离散目标张量,这提供了对真实世界数据的真正连续表示,并可以改善离散化伪影。在连续和非线性模-$n$ 算子的支持下,我们提出了一种神经算子基础的连续张量函数表示(简称为 NO-CTR),与经典的离散张量表示和连续张量函数表示相比,它能够更真实地表示复杂的真实世界数据。从理论上讲,我们还证明了任何连续张量函数都可以通过 NO-CTR 进行逼近。为了检验 NO-CTR 的能力,我们提出了一种基于 NO-CTR 的多维数据补全模型。在常规网格(多光谱图像和彩色视频)、不同分辨率的网格(Sentinel-2 图像)以及超出网格的点云等各种数据上进行的大量实验表明了 NO-CTR 的优越性。
cs.CV / 270 / 2603.01836
Affine Correspondences in Stereo Vision: Theory, Practice, and Limitations
立体视觉中的仿射对应关系:理论、实践与局限性
Abstract
Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.
Chinese Translation
仿射变换最近被用于立体视觉。它们可以在各种计算机视觉应用中发挥作用,例如在估计表面法线、单应矩阵、基本矩阵和本质矩阵时。甚至可以通过使用仿射对应关系获得完整的三维重建。首先,本文概述了仿射变换和极线几何的基本陈述。然后,研究了变换精度如何影响三维重建的质量。此外,我们提出了从对应图像方向估计局部仿射变换的新技术;此外,与处理的图像对相关的基本矩阵也可以被利用。基于重建表面法线的精度,实施了合成和真实的定量评估。对于后者,构建了一个特殊物体,包含三个垂直的棋盘图案平面。定量评估基于重建表面法线的精度,结果表明在现实测试案例中,估计精度约为几度。特殊的立体姿态和平面方向也进行了详细评估。
cs.CV / 271 / 2603.01839
LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
LEAR:用于事件到激光雷达定位的边缘感知表示学习
Abstract
Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
Chinese Translation
事件相机提供高时间分辨率的感知,能够在高速运动和复杂光照条件下保持可靠性,使其在GPS缺失和视觉退化环境中从激光雷达点云进行定位时具有良好前景。然而,将稀疏的、异步的事件与密集的激光雷达地图对齐在本质上是一个不适定的问题,因为直接的对应估计受到模态差异的影响。我们提出了LEAR,一个双任务学习框架,联合估计边缘结构和密集的事件深度流场,以弥合感知模态之间的差距。LEAR并不将边缘视为事后辅助,而是通过一种跨模态融合机制将其与流估计结合,注入模态不变的几何线索到运动表示中,并采用迭代精炼策略,在多个更新步骤中强制两个任务之间的相互一致性。这种协同作用产生了边缘感知、深度对齐的流场,使得通过透视n点(Perspective-n-Point, PnP)求解器实现更稳健和准确的姿态恢复。在多个流行且具有挑战性的数据集上,LEAR的表现优于最佳的先前方法。源代码、训练模型和演示视频已在网上公开。
cs.CV / 272 / 2603.01840
FireRed-OCR Technical Report
FireRed-OCR 技术报告
Wu, Hao, Lou, Haoran, Li, Xinyue, Zhong, Zuodong, Sun, Zhaojun, Chen, Phellon, Zhou, Xuanhe, Zuo, Kai, Chen, Yibo, Tang, Xu, Hu, Yao, Zhou, Boxiang, Wu, Jian, Wu, Yongji, Yu, Wenxin, Liu, Yingmiao, Huang, Yuhao, Xu, Manjie, Liu, Gang, Ma, Yidong, Sun, Zhichao, Qiao, Changhao
Abstract
We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
Chinese Translation
我们提出了 FireRed-OCR,这是一个系统框架,用于将通用视觉语言模型(VLMs)专门化为高性能的光学字符识别(OCR)模型。大型视觉语言模型(VLMs)展示了令人印象深刻的通用能力,但在处理复杂文档时常常遭遇“结构幻觉”,限制了它们在工业 OCR 应用中的实用性。本文介绍了 FireRed-OCR,一个旨在将通用 VLM(基于 Qwen3-VL)转变为像素精确的结构文档解析专家的新框架。为了应对高质量结构化数据的稀缺,我们构建了一个“几何 + 语义”数据工厂。与传统的随机采样不同,我们的管道利用几何特征聚类和多维标记来合成和策划一个高度平衡的数据集,有效处理长尾布局和稀有文档类型。此外,我们提出了一种三阶段渐进训练策略,引导模型从像素级感知到逻辑结构生成。该课程包括:(1) 多任务预对齐,以巩固模型对文档结构的理解;(2) 专门的 SFT(标准化全图 Markdown 输出);以及 (3) 格式约束的组相对策略优化(GRPO),利用强化学习来强制执行严格的句法有效性和结构完整性(例如,表格闭合、公式语法)。在 OmniDocBench v1.5 上的广泛评估表明,FireRed-OCR 以 92.94\% 的整体得分达到了最先进的性能,显著超越了 DeepSeek-OCR 2 和 OCRVerse 等强基线,在文本、公式、表格和阅读顺序指标上表现优异。我们开源了我们的代码和模型权重,以促进“通用 VLM 到专门结构专家”的范式。
cs.CV / 273 / 2603.01847
GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection
GroupEnsemble:基于DETR的目标检测的高效不确定性估计
Abstract
Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.
Chinese Translation
检测变换器(Detection Transformer,DETR)及其变体在目标检测这一自主系统的关键任务上表现出色。然而,这些模型的一个关键限制在于它们的置信分数仅反映语义不确定性,未能捕捉同样重要的空间不确定性。这导致了对检测可靠性评估的不完整。另一方面,深度集成(Deep Ensembles)可以通过提供高质量的空间不确定性估计来解决这一问题。然而,它们巨大的内存消耗使其在实际应用中不切实际。一种更便宜的替代方案,蒙特卡罗(Monte Carlo,MC)Dropout,由于在推理过程中需要多次前向传播来估计不确定性,导致高延迟。为了解决这些限制,我们提出了GroupEnsemble,这是一种针对DETR类模型的高效且有效的不确定性估计方法。GroupEnsemble通过在推理过程中向变换器解码器输入额外的多样化对象查询组,同时预测多个独立的检测集。每个查询组由共享解码器独立转换,并为相同输入预测完整的检测集。解码器上应用了注意力掩码,以防止组间查询交互,确保每个组独立检测,从而实现可靠的基于集成的不确定性估计。通过利用解码器固有的并行性,GroupEnsemble在单次前向传播中高效地估计不确定性,而无需顺序重复。我们在自主驾驶场景和常见日常场景中使用Cityscapes和COCO数据集验证了我们的方法。结果表明,结合MC-Dropout和GroupEnsemble的混合方法在多个指标上优于深度集成,且成本仅为其一小部分。代码可在https://github.com/yutongy98/GroupEnsemble获取。
cs.CV / 274 / 2603.01864
Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
基于端点感知建模的实时轨迹预测流
Abstract
Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.
Chinese Translation
邻近交通代理的未来轨迹对自动驾驶车辆的路径规划和决策具有重要影响。尽管轨迹预测是一个研究较为成熟的领域,但研究主要集中在快照式预测上,其中每个场景被视为与其全球时间上下文独立。然而,现实世界中的自动驾驶系统需要在连续环境中运行,要求对数据流进行实时处理,以实现低延迟和在连续时间步上的一致预测。我们利用这一连续环境,提出了一种轻量级但高度准确的基于流的轨迹预测方法。我们结合了来自先前预测的有价值信息与一种新颖的端点感知建模方案。我们的时间上下文传播使用先前预测的轨迹端点作为锚点,以提取目标场景上下文编码。我们的方法有效地引导其场景编码器提取高度相关的上下文信息,而无需细化迭代或分段解码。我们的实验表明,我们的方法有效地在连续时间步之间传递信息。与使用多阶段细化处理的方法不同,我们的方法显著降低了推理延迟,使其非常适合于现实世界的部署。我们在Argoverse~2多代理和单代理基准测试上实现了最先进的流轨迹预测结果,同时所需资源显著减少。
cs.CV / 275 / 2603.01878
CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection
CTForensics:一种用于AI生成CT图像检测的综合数据集和方法
Abstract
With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.
Chinese Translation
随着生成性人工智能在医学影像领域的快速发展,合成的计算机断层扫描(CT)图像在数据增强和临床诊断等应用中展现出巨大潜力,但同时也引入了严重的安全风险。尽管安全问题日益受到关注,现有的CT伪造检测研究仍然有限,未能充分应对现实世界的挑战。这些局限性主要体现在两个方面:缺乏能够有效评估模型泛化能力的数据集,以反映现实应用需求,以及依赖于针对自然图像设计的检测方法,这些方法对CT特有的伪造伪影不敏感。基于此,我们提出了CTForensics,一个旨在系统评估CT伪造检测方法泛化能力的综合数据集,其中包括十种多样的CT生成方法。此外,我们还引入了增强型空间频率CT伪造检测器(Enhanced Spatial-Frequency CT Forgery Detector,ESF-CTFD),这是一种高效的基于卷积神经网络(CNN)的神经网络,能够在小波、空间和频率域捕捉伪造线索。首先,它将输入的CT图像转换为三个尺度,并通过小波增强中央干线提取每个尺度的特征。然后,从最大尺度特征开始,空间处理块逐渐与较小尺度的特征进行融合。最后,频率处理块学习频域信息以预测最终结果。实验表明,ESF-CTFD在不同CT生成模型上始终优于现有方法,并展现出卓越的泛化能力。
cs.CV / 276 / 2603.01890
Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling
通过结构化前向算子建模解决动态范围压缩下的盲逆问题
Abstract
Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbf{cascaded monotonic Bernstein} (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbf{CaMB-Diff}. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.
Chinese Translation
从未知动态范围压缩(UDRC)中恢复辐射度保真度,如低光增强和高动态范围(HDR)重建,是一个具有挑战性的盲逆问题,因为压缩引入了未知的前向模型和不可逆的信息损失。为了解决这一挑战,我们首先识别单调性作为UDRC任务中共享的基本物理不变量。基于这一洞察,我们引入了 extbf{级联单调Bernstein}(CaMB)算子来参数化未知的前向模型。CaMB将单调性作为一种硬性架构归纳偏置,限制优化到物理一致的映射,并实现稳健和稳定的算子估计。我们进一步将CaMB与即插即用的扩散框架相结合,提出了 extbf{CaMB-Diff}。在该框架中,扩散模型作为结构和语义恢复的强大几何先验,而CaMB通过物理基础的前向算子明确建模和校正辐射度失真。在多种零-shot UDRC任务上的大量实验,包括低光增强、低场MRI增强和HDR重建,表明CaMB-Diff在信号保真度和物理一致性方面显著优于最先进的零-shot基线。此外,我们通过实验证实了所提出的CaMB参数化在准确建模未知前向算子方面的有效性。
cs.CV / 277 / 2603.01893
Generative Visual Chain-of-Thought for Image Editing
生成视觉思维链用于图像编辑
Abstract
Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
Chinese Translation
现有的图像编辑方法在感知编辑位置方面存在困难,尤其是在复杂场景和细微空间指令下。为了解决这个问题,我们提出了生成视觉思维链(Generative Visual Chain-of-Thought, GVCoT),这是一个统一的框架,通过首先生成空间线索来定位目标区域,然后执行编辑,从而实现原生视觉推理。与之前仅依赖文本的思维链(CoT)或依赖工具的视觉思维链范式不同,GVCoT在推理和编辑阶段以端到端的方式共同优化生成的视觉标记。这种方式促进了内在空间推理能力的产生,并使得视觉领域线索的更有效利用成为可能。训练GVCoT的主要挑战在于缺乏具有精确编辑区域注释的大规模编辑数据;为此,我们构建了GVCoT-Edit-Instruct,这是一个包含180万高质量样本的数据库,涵盖19个任务。我们采用渐进式训练策略:在最终编辑之前进行监督微调,以建立推理轨迹中的基础定位能力,然后通过强化学习进一步提高推理和编辑质量。最后,我们引入了SREdit-Bench,这是一个新的基准,旨在全面测试模型在复杂场景和细粒度指代表达下的表现。实验表明,GVCoT在SREdit-Bench和ImgEdit上始终优于最先进的模型。我们希望GVCoT能够激励未来对可解释和精确图像编辑的研究。
cs.CV / 278 / 2603.01913
Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport
基于扩散的自适应对比传输实现零样本低场MRI增强
Abstract
Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.
Chinese Translation
低场(LF)磁共振成像(MRI)使诊断成像的获取更加普及,但由于场依赖的弛豫动态,信噪比低和显著的组织对比失真,限制了其基本性能。从LF数据重建高场(HF)质量图像是一个盲逆问题,受到配对训练数据稀缺和未知的非线性对比变换算子的严重挑战。现有的零样本方法假设简化的线性降解,往往无法恢复真实的组织对比。本文提出了DACT(基于扩散的自适应对比传输),一种新颖的零样本框架,能够在没有配对监督的情况下恢复HF质量图像。DACT结合了预训练的HF扩散先验,以确保解剖结构的准确性,并采用物理信息驱动的自适应前向模型。具体而言,我们引入了一个可微分的Sinkhorn最优传输模块,明确建模并纠正LF和HF领域之间的强度分布偏移,尤其是在反向扩散过程中。这使得该框架能够动态学习难以处理的对比映射,同时保持拓扑一致性。在模拟和真实临床LF数据集上的大量实验表明,DACT实现了最先进的性能,生成的重建图像具有优越的结构细节和正确的组织对比。
cs.CV / 279 / 2603.01928
LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
LaST-VLA:在自主驾驶中思考潜在时空空间的视觉-语言-行动
Abstract
While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
Chinese Translation
视觉-语言-行动(VLA)模型通过统一感知和规划,彻底改变了自主驾驶。然而,它们对显式文本链式思维(Chain-of-Thought, CoT)的依赖导致了语义-感知的解耦和感知-符号的冲突。最近向潜在推理的转变试图通过在连续隐空间中思考来绕过这些瓶颈。然而,在没有显式中间约束的情况下,标准的潜在CoT往往作为一种与物理无关的表示。为了解决这个问题,我们提出了潜在时空VLA(LaST-VLA),一个将推理范式从离散符号处理转变为物理基础的潜在时空CoT的框架。通过实施双特征对齐机制,我们将几何约束从3D基础模型和来自世界模型的动态前瞻直接提取到潜在空间中。结合逐步的SFT训练策略,从特征对齐过渡到轨迹生成,并通过群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习进行精炼,以确保安全和规则遵从。我们的 extit{method}在NAVSIM v1(91.3 PDMS)和NAVSIM v2(87.1 EPDMS)上创下新纪录,同时在SURDS和NuDynamics基准测试中表现出色,展现了出色的时空推理能力。
cs.CV / 280 / 2603.01932
BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation
BAWSeg:一种用于大麦杂草分割的无人机多光谱基准
Abstract
Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop--weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.
Chinese Translation
在谷物田中进行准确的杂草映射需要从无人机影像中进行像素级分割,这一过程必须在不同的田地、季节和光照条件下保持可靠。现有的多光谱处理流程通常依赖于阈值化的植被指数,这在辐射漂移和混合作物-杂草像素下表现脆弱,或者依赖于单流卷积神经网络(CNN)和变换器(Transformer)骨干网络,这些网络处理堆叠的波段和指数,导致辐射线索和标准化指数线索相互干扰,从而降低了对嵌入作物冠层中的小杂草簇的敏感性。我们提出了VISA(Vegetation-Index and Spectral Attention),一种双流分割网络,能够解耦这些线索并在原始分辨率下进行融合。辐射流通过使用残差光谱-空间注意力从经过校准的五波段反射率中学习,以保留被比率指数削弱的细微纹理和行边界。指数流在植被指数图上运行,采用窗口自注意力高效建模局部结构,状态空间层在不增加二次注意力成本的情况下传播田间尺度的上下文,以及槽注意力(Slot Attention)形成稳定的区域描述符,从而提高在冠层混合下对稀疏杂草的区分能力。为了支持监督训练和面向部署的评估,我们引入了BAWSeg,这是一个在西澳大利亚商业大麦田中收集的为期四年的无人机多光谱数据集,提供经过辐射校准的蓝、绿、红、红边和近红外正射拼接图、衍生植被指数,以及密集的作物、杂草和其他标签,且具有无泄漏的块分割。在BAWSeg数据集上,VISA达到了75.6%的mIoU和63.5%的杂草IoU,参数量为2280万,超越了多光谱SegFormer-B1基线模型1.2 mIoU和1.9杂草IoU。在跨地块和跨年份的协议下,VISA分别保持了71.2%和69.2%的mIoU。BAWSeg数据、VISA代码和训练模型将在发表后公开。
cs.CV / 281 / 2603.01944
MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
MobileMold:一种基于智能手机的食品霉菌检测显微镜数据集
Abstract
Smartphone clip-on microscopes turn everyday devices into low-cost, portable imaging systems that can even reveal fungal structures at the microscopic level, enabling mold inspection beyond unaided visual checks. In this paper, we introduce MobileMold, an open smartphone-based microscopy dataset for food mold detection and food classification. MobileMold contains 4,941 handheld microscopy images spanning 11 food types, 4 smartphones, 3 microscopes, and diverse real-world conditions. Beyond the dataset release, we establish baselines for (i) mold detection and (ii) food-type classification, including a multi-task setting that predicts both attributes. Across multiple pretrained deep learning architectures and augmentation strategies, we obtain near-ceiling performance (accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907), validating the utility of our dataset for detecting food spoilage. To increase transparency, we complement our evaluation with saliency-based visual explanations highlighting mold regions associated with the model's predictions. MobileMold aims to contribute to research on accessible food-safety sensing, mobile imaging, and exploring the potential of smartphones enhanced with attachments.
Chinese Translation
智能手机夹式显微镜将日常设备转变为低成本、便携的成像系统,能够揭示微观层面的真菌结构,从而使霉菌检查超越了肉眼观察。在本文中,我们介绍了MobileMold,一个开放的基于智能手机的显微镜数据集,用于食品霉菌检测和食品分类。MobileMold包含4,941幅手持显微镜图像,涵盖11种食品类型、4款智能手机、3种显微镜以及多样的真实世界条件。除了数据集的发布,我们还建立了(i)霉菌检测和(ii)食品类型分类的基线,包括一个多任务设置,用于预测这两个属性。在多种预训练深度学习架构和增强策略下,我们获得了接近极限的性能(准确率 = 0.9954,F1 = 0.9954,MCC = 0.9907),验证了我们数据集在检测食品变质方面的实用性。为了增加透明度,我们通过基于显著性的视觉解释补充了我们的评估,突出了与模型预测相关的霉菌区域。MobileMold旨在为可及的食品安全感知、移动成像研究以及探索增强附件的智能手机的潜力做出贡献。
cs.CV / 282 / 2603.01947
physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection
physfusion:基于变换器的双流雷达与视觉融合框架用于开放水面目标检测
Abstract
Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
Chinese Translation
由于波浪杂波、镜面反射和远程观察中的微弱外观线索,无人水面车辆(USVs)在检测水面目标时面临挑战。尽管4D毫米波雷达在光照条件恶化时可以补充摄像头,但海洋雷达点云稀疏且间歇,反射率属性在散射和多径下表现出重尾变化,使得传统的融合设计难以有效利用雷达线索。我们提出了PhysFusion,一个基于物理知识的雷达-图像检测框架,用于水面感知。该框架集成了:(1) 一个物理知识雷达编码器(PIR Encoder),配备RCS映射器和质量门,将每个点的雷达属性转化为紧凑的散射先验,并预测逐点可靠性,以在杂波下进行稳健的特征学习;(2) 一个雷达引导的交互融合模块(RIFM),在语义丰富的雷达特征和多尺度视觉特征之间执行查询级雷达-图像融合,雷达分支由一个双流主干模型构成,包括基于点的局部流和基于变换器的全局流,使用散射感知自注意力(SASA);(3) 一个时间查询聚合模块(TQA),在短时间窗口内聚合逐帧融合查询,以获得时间一致的表示。在WaterScenes和FLOW上的实验表明,PhysFusion在WaterScenes(T=5雷达历史)上达到了59.7%的mAP50:95和90.3%的mAP50,使用了560万参数和12.5G FLOPs,并在雷达+摄像头设置下在FLOW上达到了94.8%的mAP50和46.2%的mAP50:95。消融研究量化了PIR Encoder、基于SASA的全局推理和RIFM的贡献。
cs.CV / 283 / 2603.01948
PreSight: Preoperative Outcome Prediction for Parkinson's Disease via Region-Prior Morphometry and Patient-Specific Weighting
PreSight:通过区域优先形态测量和患者特定加权进行帕金森病的术前结果预测
Abstract
Preoperative improvement rate prediction for Parkinson's disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.
Chinese Translation
帕金森病手术的术前改善率预测在临床上具有重要意义,但由于影像信号微妙且患者异质性强,预测难度较大。我们针对这一情境进行研究,仅使用术前可用的信息,目标是预测患者特定的术后运动收益。我们提出了PreSight,一种融合临床先验知识、术前MRI和基于变形的形态测量(DBM)的术前结果模型,并通过患者特定加权模块调整区域重要性。该模型生成端到端、经过校准的、可决策的预测,并提供患者级别的解释。我们在一个真实世界的两中心400名受试者的队列上评估了PreSight,使用多模态术前输入和术后改善标签。PreSight在强临床、仅影像和多模态基线模型中表现优越。在内部验证中,其准确率达到88.89%,在外部中心测试中的响应者分类准确率为85.29%,并显示出更好的概率校准和更高的决策曲线净收益。消融实验和分析确认了DBM和患者特定加权模块的贡献,并表明该模型以患者特定的方式强调与疾病相关的区域。这些结果表明,将临床先验知识与区域自适应形态测量相结合,能够在常规实践中提供可靠的术前决策支持。
cs.CV / 284 / 2603.01976
Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling
基于染色标准化的解耦学习与集成的稳健白细胞分类
Abstract
White blood cell (WBC) classification is fundamental for hematology applications such as infection assessment, leukemia screening, and treatment monitoring. However, real-world WBC datasets present substantial appearance variations caused by staining and scanning conditions, as well as severe class imbalance in which common cell types dominate while rare but clinically important categories are underrepresented. To address these challenges, we propose a stain-normalized, decoupled training framework that first learns transferable representations using instance-balanced sampling, and then rebalances the classifier with class-aware sampling and a hybrid loss combining effective-number weighting and focal modulation. In inference stage, we further enhance robustness by ensembling various trained backbones with test-time augmentation. Our approach achieved the top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.
Chinese Translation
白细胞(WBC)分类是血液学应用中的基础,涉及感染评估、白血病筛查和治疗监测等。然而,现实世界中的白细胞数据集存在因染色和扫描条件引起的显著外观变化,以及严重的类别不平衡,常见细胞类型占主导地位,而临床重要但稀有的类别则被低估。为了解决这些挑战,我们提出了一种染色标准化的解耦训练框架,该框架首先通过实例平衡采样学习可迁移的表示,然后通过类别感知采样和结合有效数量加权与聚焦调制的混合损失重新平衡分类器。在推理阶段,我们通过集成多种训练后的主干网络和测试时增强进一步提高了稳健性。我们的方法在2026年ISBI的WBCBench 2026:稳健白细胞分类挑战的排行榜上获得了最高排名。
cs.CV / 285 / 2603.01993
Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
过程优于结果:培养可推广的多模态操控检测的法医推理能力
Abstract
Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
Chinese Translation
近期生成性人工智能的进展显著增强了多模态媒体操控的真实感,从而对操控检测提出了重大挑战。现有的操控检测和基础方法主要集中在结果导向监督下的操控类型分类,这不仅缺乏可解释性,还容易过拟合表面伪影。本文认为,可推广的检测需要纳入明确的法医推理,而不仅仅是对有限的操控类型进行分类,这种方法无法推广到未见的操控模式。为此,我们提出了REFORM,一个以推理为驱动的框架,将学习从结果拟合转向过程建模。REFORM采用三阶段课程,首先引导法医推理,其次将推理与最终判断对齐,最后通过强化学习精炼逻辑一致性。为支持这一范式,我们引入了ROM,一个具有丰富推理注释的大规模数据集。大量实验表明,REFORM在可推广性方面建立了新的最先进性能,在ROM上达到81.52%的准确率,在DGM4上达到76.65%的准确率,在MMFakeBench上达到74.9的F1分数。
cs.CV / 286 / 2603.01997
Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
仅基于事件的无人机轨迹预测与RPM调制卡尔曼滤波
Abstract
Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.
Chinese Translation
事件相机提供了高时间分辨率的视觉感知,非常适合观察快速移动的空中物体;然而,它们在无人机轨迹预测中的应用仍然有限。本研究提出了一种仅基于事件的无人机预测方法,该方法利用螺旋桨引起的运动线索。螺旋桨转速直接从原始事件数据中提取,并在一个考虑RPM的卡尔曼滤波框架中进行融合。在FRED数据集上的评估表明,所提出的方法在平均距离误差和0.4秒及0.8秒预测范围内的最终距离误差方面,优于基于学习的方法和传统卡尔曼滤波器。结果表明,该方法在短期和中期轨迹预测中表现出稳健和准确的能力,无需依赖RGB图像或训练数据。
cs.CV / 287 / 2603.02012
MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising
MAP-Diff:多锚点引导的渐进式三维全身低剂量PET去噪
Abstract
Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
Chinese Translation
低剂量正电子发射断层扫描(PET)减少了辐射暴露,但遭受严重噪声和定量降解。基于扩散的去噪模型实现了强大的最终重建,然而它们的逆过程通常没有约束,并且与PET剂量形成的渐进特性不一致。我们提出了MAP-Diff,一种用于渐进式三维全身PET去噪的多锚点引导扩散框架。MAP-Diff引入临床观察到的中间剂量扫描作为轨迹锚点,并施加时间步依赖的监督,以规范逆过程朝向剂量对齐的中间状态。锚点时间步通过模拟扩散损坏与真实多剂量PET对之间的降解匹配进行校准,时间步加权的锚点损失稳定了阶段性学习。在推理时,模型仅需超低剂量输入,同时实现渐进的、剂量一致的中间恢复。在内部(西门子Biograph Vision Quadra)和跨扫描仪(联合影像uEXPLORER)数据集上的实验显示,相较于强大的CNN、Transformer、GAN和基于扩散的基线,均取得了一致的改进。在内部数据集中,MAP-Diff将PSNR从42.48 dB提高到43.71 dB(+1.23 dB),SSIM增加至0.986,NMAE从0.115降低至0.103(-0.012),相比于3D DDPM。性能提升在不同扫描仪之间具有普遍性,在外部队列中实现了34.42 dB的PSNR和0.141的NMAE,超越了所有竞争方法。
cs.CV / 288 / 2603.02026
Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
学习在何处查看:面向疾病的视觉-语言预训练用于三维CT
Abstract
Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
Chinese Translation
近期的三维CT视觉-语言模型通过对比预训练将体积与报告进行对齐,但通常依赖于有限的公共数据,并仅提供粗略的全局监督。我们在一家医院收集的98,000对报告-体积对(50,000名患者)上训练了一个三维CT视觉-语言模型,并结合公共数据集,使用SigLIP风格的对比预训练以及基于提示的疾病监督,在共享的视觉-文本嵌入空间中进行训练。在CT-RATE上,我们的模型实现了最先进的文本到图像检索(R@10 31.5对比22.2)和具有竞争力的疾病分类(AUC 83.8对比83.8),在Rad-ChestCT上的结果也保持一致(AUC 77.0对比77.3)。我们进一步观察到放射科医生在报告中常常引用特定的图像(例如,“系列X,图像Y”),将文本描述与精确的轴向位置联系起来。我们自动挖掘了262,000对这样的片段-切片对,并引入了扫描内片段定位的任务——预测文本片段所指的轴向深度——在12毫米特征分辨率下将平均绝对误差降低至36.3毫米,而最佳基线为67.0毫米。添加这一定位目标在置信区间内对检索和分类的影响基本不变,从而产生一个统一的模型用于检索、分类和扫描内定位。
cs.CV / 289 / 2603.02047
NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis
NICO-RAG:多模态超图检索增强生成用于理解尼古丁公共健康危机
Abstract
The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
Chinese Translation
尼古丁成瘾的公共健康危机依然普遍存在。在本世纪,烟草行业通过推出和营销新产品,积极吸引新的年轻客户终身使用。这些创新和产品开发,特别是如尼古丁袋等调味尼古丁或烟草产品,已抵消了多年来反烟草运动的努力。以往的研究在范围和连接大规模数据点的能力上均有限。因此,我们引入了尼古丁创新反击(NICO)数据集,为公共健康研究人员提供超过200,000个多模态样本,包括55个烟草和尼古丁品牌的图像和文本描述。此外,为了为公共健康研究人员提供大规模数据集中事实连接,我们提出了NICO-RAG,一个检索增强生成(RAG)框架,能够在不产生高成本语言模型的情况下检索图像特征,以及处理像NICO这样的大规模数据集中的图像令牌所需的额外成本。在构建时,NICO-RAG将提取的图像和文本实体及其关系组织成超图,以尽可能生成事实响应。这种联合多模态知识表示使NICO-RAG能够通过视觉相似性和图像描述的语义相似性来检索图像以回答查询。实验表明,在不需要处理来自图像的额外令牌的情况下,NICO-RAG在超过100个问题上的表现与适用于图像的最先进RAG方法相当。
cs.CV / 290 / 2603.02049
WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
WorldStereo:通过3D几何记忆桥接相机引导的视频生成与场景重建
Abstract
Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
Chinese Translation
近年来,基础视频扩散模型(VDMs)的进展显著。然而,尽管生成视频的视觉质量令人瞩目,从这些输出中重建一致的3D场景仍然具有挑战性,这主要是由于相机可控性有限以及从不同相机轨迹观察时生成内容的不一致性。本文提出了WorldStereo,一个新颖的框架,通过两个专用的几何记忆模块桥接相机引导的视频生成和3D重建。形式上,全球几何记忆模块使得精确的相机控制成为可能,同时通过逐步更新的点云注入粗略的结构先验。此外,空间立体记忆模块通过3D对应关系限制模型的注意力感受野,以聚焦于记忆库中的细粒度细节。这些组件使得WorldStereo能够在精确的相机控制下生成多视角一致的视频,从而促进高质量的3D重建。此外,基于灵活控制分支的WorldStereo展现出令人印象深刻的效率,得益于分布匹配的蒸馏VDM主干,而无需联合训练。在相机引导的视频生成和3D重建基准测试中,广泛的实验验证了我们方法的有效性。值得注意的是,我们展示了WorldStereo作为一个强大的世界模型,能够以高保真度的3D结果处理多样的场景生成任务(无论是从透视图还是全景图开始)。模型将会发布。
cs.CV / 291 / 2603.02063
ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks
ORGAN:基于循环一致生成对抗网络的对象中心表示学习
Abstract
Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
Chinese Translation
尽管数据生成通常比较简单,但从数据中提取信息则更为困难。对象中心表示学习能够以无监督的方式从图像中提取信息。它通过将图像分割为其子组件:对象来实现这一点。每个对象随后在一个低维潜在空间中表示,该空间可用于后续处理。对象中心表示学习主要由自编码器架构(AEs)主导。在这里,我们提出了ORGAN,一种基于循环一致生成对抗网络的新颖对象中心表示学习方法。我们展示了它在合成数据集上的性能与其他最先进的方法相似,同时也是唯一一种经过测试能够处理具有多个对象和低视觉对比度的更具挑战性的真实世界数据集的方法。补充这些结果,ORGAN创建了富有表现力的潜在空间表示,允许对象操作。最后,我们展示了ORGAN在对象数量和图像大小方面的良好扩展性,使其在当前最先进的方法中具有独特的优势。
cs.CV / 292 / 2603.02079
MMNavAgent: Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis
MMNavAgent:用于临床一致的全幻灯片分析的多倍放大WSI导航代理
Abstract
Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.
Chinese Translation
近期的人工智能导航方法旨在通过建模空间探索和选择诊断相关区域来改善全幻灯片图像(WSI)诊断,然而大多数方法仅在单一固定放大倍数下操作或依赖于预定义的放大遍历。在临床实践中,病理学家会在多个放大倍数下检查切片,并有选择性地仅检查必要的尺度,以动态方式整合全局和细胞证据。这种不匹配阻碍了现有方法建模跨放大倍数交互和适应性放大选择的能力,这些都是实际诊断工作流程中固有的。为此,我们提出了一种临床一致的多倍放大WSI导航代理(MMNavAgent),该代理明确建模多倍放大交互和适应性放大选择。具体而言,我们引入了一种跨放大导航工具(CMT),该工具聚合来自相邻放大倍数的上下文信息,以增强导航路径上的区分性表示。我们进一步引入了一种放大选择工具(MST),该工具利用代理框架中的记忆驱动推理来实现交互式和适应性的放大选择,模拟病理学家的顺序决策过程。在一个公共数据集上的大量实验表明,诊断性能得到了改善,相较于非代理基线,AUC提升了1.45%,BACC提升了2.93%。代码将在接受后公开。
cs.CV / 293 / 2603.02080
From Pixels to Patches: Pooling Strategies for Earth Embeddings
从像素到补丁:地球嵌入的池化策略
Abstract
As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.
Chinese Translation
随着地理空间基础模型从补丁级嵌入转向像素级嵌入,实践者必须将成千上万的像素向量聚合成补丁表示,以保留类别区分信号,同时匹配下游标签分辨率。默认选择的均值池化丢弃了补丁内的变异性,并且在空间偏移下可能导致准确率下降超过10%。为了评估这一影响,我们引入了EuroSAT-Embed:由三个基础模型(AlphaEarth、OlmoEarth和Tessera)衍生的81,000个嵌入GeoTIFF。我们在随机和地理上不相交的测试分割下,对11种无训练和2种参数池化方法进行了基准测试。我们的结果表明,更丰富的池化方案相对于均值池化可以将地理泛化差距减少多达40%,并在空间分割上提高准确率最多5%。我们推荐广义均值池化(Generalized Mean Pooling, GeM)作为均值池化的替代方案:它在不增加嵌入维度的情况下提高了准确率。为了获得最大准确率,统计池化(Stats pooling,最小/最大/均值/标准差池化的连接)在4倍嵌入大小时表现最佳。我们进一步发现,池化效果因嵌入来源而异,且高维嵌入最能从分布统计中受益。
cs.CV / 294 / 2603.02087
Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction
基于检测门控的声门分割方法:零样本跨数据集迁移与临床特征提取
Abstract
Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.
Chinese Translation
背景:在高速视频喉镜检查(HSV)中,准确的声门分割对于提取喉功能的运动生物标志物至关重要。然而,现有的深度学习模型往往在非声门帧中产生虚假伪影,并且在不同临床环境中缺乏泛化能力。方法:我们提出了一种检测门控的流程,将基于YOLOv8的检测器与U-Net分割器相结合。时间一致性包装器通过抑制声门闭合和仪器遮挡期间的假阳性来确保鲁棒性。该模型在GIRAFE数据集的有限子集(600帧)上进行了训练,并通过零样本迁移在大规模BAGLS数据集上进行了评估。结果:该流程在GIRAFE基准测试中达到了最先进的性能(DSC 0.81),并在BAGLS上展示了优越的泛化能力(DSC 0.85,分布内),无需机构微调。在65名受试者的临床队列中的下游验证确认,自动化运动特征(开放商数、变异系数)与既定的临床基准保持一致。声门面积的变异系数(CV)被发现是区分健康与病理声音功能的重要标志物(p=0.006)。结论:检测门控架构提供了一种轻量级、计算高效的解决方案(约35帧/秒),适用于实时临床使用。通过实现鲁棒的零样本迁移,该框架促进了在多样化内窥镜平台上标准化、大规模提取临床生物标志物。代码、训练权重和评估脚本已发布在 https://github.com/hari-krishnan/openglottal。
cs.CV / 295 / 2603.02096
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
FluxMem:用于流媒体视频理解的自适应层次记忆
Abstract
This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
Chinese Translation
本文提出了FluxMem,一个无需训练的高效流媒体视频理解框架。FluxMem通过层次化的两阶段设计自适应地压缩冗余的视觉记忆:(1) 时间邻接选择(Temporal Adjacency Selection, TAS)模块去除相邻帧之间的冗余视觉标记,(2) 空间域整合(Spatial Domain Consolidation, SDC)模块进一步将每帧内空间上重复的区域合并为紧凑的表示。为了有效适应动态场景,我们在TAS和SDC中引入了一种自适应标记压缩机制,该机制根据内在场景统计自动确定压缩率,而非手动调节。大量实验表明,FluxMem在现有在线视频基准测试中取得了新的最先进结果,在实时设置下,StreamingBench达到76.4,OVO-Bench达到67.2,同时在OVO-Bench上将延迟降低了69.9%,峰值GPU内存减少了34.5%。此外,它在离线性能上也表现强劲,在MLVU上实现了73.1,同时使用的视觉标记减少了65%。
cs.CV / 296 / 2603.02125
A 3D mesh convolution-based autoencoder for geometry compression
基于3D网格卷积的自编码器用于几何压缩
Abstract
In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: github.com/germainGB/MeshConv3D
Chinese Translation
在本文中,我们提出了一种新颖的基于3D网格卷积的自编码器,用于几何压缩,能够处理不规则网格数据,而无需预处理或流形/水密性条件。所提出的方法通过直接从网格面学习特征,提取有意义的潜在表示,同时通过专门的池化和反池化操作保持连通性。编码器将输入网格压缩到一个紧凑的基础网格空间,确保潜在空间保持可比性。解码器重建原始连通性,并将压缩的几何形状恢复到其完整分辨率。在多类数据集上的广泛实验表明,我们的方法在3D网格几何重建和潜在空间分类任务中优于最先进的方法。代码可在:github.com/germainGB/MeshConv3D
cs.CV / 297 / 2603.02129
LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
LiftAvatar:用于表情控制的运动空间补全的3D高斯头像动画
Abstract
We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
Chinese Translation
我们提出了LiftAvatar,这是一种新范式,能够在运动空间中完成稀疏的单目观测(例如,面部表情和头部姿态),并利用完成的信号驱动高保真头像动画。LiftAvatar是一个细粒度、可控表情的大规模视频扩散Transformer,能够合成基于单个或多个参考图像的高质量、时间一致的表情序列。其关键思想是将不完整的输入数据提升为更丰富的运动表示,从而增强下游3D头像管道中的重建和动画效果。为此,我们引入了(i)一种多粒度表情控制方案,将阴影图与表情系数结合,以实现精确和稳定的驱动,以及(ii)一种多参考条件机制,从多个帧聚合互补线索,增强强大的3D一致性和可控性。作为一种即插即用的增强器,LiftAvatar直接解决了由于日常单目视频中的稀疏运动线索导致的基于3D高斯点云的头像的有限表现力和重建伪影问题。通过将不完整的观测扩展为多样的姿态-表情变化,LiftAvatar还使得从大规模视频生成模型到3D管道的有效先验蒸馏成为可能,从而带来显著的提升。大量实验表明,LiftAvatar在动画质量和最先进的3D头像方法的定量指标上始终表现出色,尤其是在极端和未见过的表情下。
cs.CV / 298 / 2603.02130
Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
立体惯性姿态估计器:基于稀疏IMU和单一立体相机的度量精确形状感知运动捕捉
Abstract
Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
Chinese Translation
近期视觉-惯性运动捕捉系统的进展展示了将单目相机与稀疏惯性测量单元(IMU)结合的潜力,作为一种具有成本效益的解决方案,有效缓解了单一模态系统固有的遮挡和漂移问题。然而,这些系统仍然受到单目深度模糊导致的全局位移度量不准确的限制,以及忽略人类体型变化的形状无关局部运动估计的影响。我们提出了立体惯性姿态估计器(Stereo-Inertial Poser),这是一种实时运动捕捉系统,利用单一立体相机和六个IMU来估计度量精确且形状感知的三维人类运动。通过将单目RGB替换为立体视觉,我们的系统通过标定的基线几何解决了深度模糊问题,从而实现直接的三维关键点提取和身体形状参数估计。IMU数据与视觉线索融合,以预测补偿漂移的关节位置和根部运动,同时一个新颖的形状感知融合模块动态协调人类体型变化与全局位移。我们的端到端管道在不进行基于优化的后处理的情况下实现了超过200 FPS的速度,支持实时部署。在各种数据集上的定量评估显示了最先进的性能。定性结果表明,我们的方法在长时间录制下产生无漂移的全局位移,并减少了脚滑现象。
cs.CV / 299 / 2603.02133
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
SimRecon:基于真实视频的模拟准备组合场景重建
Abstract
Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
Chinese Translation
组合场景重建旨在从真实世界视频中创建以对象为中心的表示,而非整体场景,这一方法本质上适用于模拟和交互。传统的组合重建方法主要强调视觉外观,对真实世界场景的泛化能力有限。本文提出了SimRecon,一个实现“感知-生成-模拟”管道的框架,旨在进行杂乱场景重建。该框架首先从视频输入中进行场景级语义重建,然后执行单一对象生成,最后在模拟器中组装这些资产。然而,简单地将这三个阶段结合在一起会导致生成资产的视觉不真实和最终场景的物理不合理,这一问题在复杂场景中尤为严重。因此,我们进一步提出了两个桥接模块,以解决这一问题。具体来说,在从感知到生成的过渡中,我们引入了主动视点优化(Active Viewpoint Optimization),该方法在三维空间中主动搜索,以获取最佳投影图像作为单一对象完成的条件。此外,在从生成到模拟的过渡中,我们提出了场景图合成器(Scene Graph Synthesizer),该合成器指导在三维模拟器中从零开始构建,反映了现实世界的本质构造原则。在ScanNet数据集上的大量实验验证了我们方法相较于之前的最先进方法的优越性能。
cs.CV / 300 / 2603.02134
OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
OnlineX:通过主动到稳定状态演化实现统一的在线三维重建与理解
Abstract
Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
Chinese Translation
最近在通用三维高斯点云(3D Gaussian Splatting, 3DGS)方面的进展使得三维场景重建能够在几秒钟内快速完成,消除了对每个场景优化的需求。然而,现有方法主要遵循离线重建范式,缺乏连续重建的能力,这限制了它们在机器人技术和虚拟现实/增强现实(VR/AR)等在线场景中的应用。在本文中,我们提出了OnlineX,一个前馈框架,利用流媒体图像在线重建三维视觉外观和语言场。在线重建的一个关键挑战是累积漂移问题,这源于内存状态的两个对立角色之间的根本冲突:一个是主动角色,持续刷新以捕捉高频局部几何;另一个是稳定角色,保守地累积和保持长期的全局结构。为了解决这个问题,我们引入了一种解耦的主动到稳定状态演化范式。我们的框架将内存状态解耦为专用的主动状态和持久的稳定状态,然后将前者的信息有机融合到后者中,以实现保真度和稳定性。此外,我们共同建模视觉外观和语言场,并结合隐式高斯融合模块以增强重建质量。在主流数据集上的实验表明,我们的方法在新视图合成和语义理解方面始终优于先前的工作,展示了在不同长度输入序列下的稳健性能,并具有实时推理速度。
cs.AI / 1 / 2603.00267
Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
多源、多智能体证据检索用于事实核查
Abstract
Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents. However, these methods largely rely on textual similarity for evidence retrieval and struggle to retrieve evidence that captures multi-hop semantic relations within rich document contents. These limitations lead to overlooking subtle factual correlations between the evidence and the claims to be fact-checked during evidence retrieval, thus causing inaccurate veracity predictions. To address these issues, we propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence. LLM-enabled retrieval is designed to assess the claims and retrieve the most relevant knowledge subgraphs, forming structured evidence for fact verification. To augment the knowledge graph evidence, we retrieve web contents for completion. The above process is implemented as an automatic Markov Decision Process (MDP): A reasoning LLM agent decides what actions to take according to the current evidence and the claims. To adapt the MDP for fact-checking, we use prompt optimization to fine-tune the agentic LLM.
Chinese Translation
互联网上传播的虚假信息对社会和个人构成了重大威胁,因此需要依赖于检索准确且可信的证据来进行强大且可扩展的事实核查。以往的方法依赖于从训练数据中学习的语义和社会上下文模式,这限制了它们对新数据分布的泛化能力。最近,提出了基于检索增强生成(Retrieval Augmented Generation, RAG)的方法,利用大语言模型(LLM)与检索到的基础证据文档的推理能力。然而,这些方法在证据检索中主要依赖于文本相似性,难以检索到能够捕捉丰富文档内容中多跳语义关系的证据。这些局限性导致在证据检索过程中忽视了证据与待核查声明之间微妙的事实关联,从而造成不准确的真实性预测。为了解决这些问题,我们提出了WKGFC,该方法利用授权的开放知识图谱作为证据的核心资源。基于LLM的检索旨在评估声明并检索最相关的知识子图,从而形成结构化的证据以进行事实验证。为了增强知识图谱证据,我们还检索网络内容以进行补充。上述过程被实现为一个自动的马尔可夫决策过程(Markov Decision Process, MDP):一个推理的LLM智能体根据当前的证据和声明决定采取何种行动。为了使MDP适应事实核查,我们使用提示优化来微调智能体LLM。
cs.AI / 2 / 2603.00285
TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
TraderBench:人工智能代理在对抗性资本市场中的稳健性如何?
Abstract
Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce TraderBench, a benchmark that addresses both issues. It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance-Sharpe ratio, returns, and drawdown-eliminating judge variance entirely. The framework features two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management. Trading scenarios can be refreshed with new market data to prevent benchmark contamination. Evaluating 13 models (8B open-source to frontier) on ~50 tasks, we find: (1) 8 of 13 models score ~33 on crypto with <1-point variation across adversarial conditions, exposing fixed non-adaptive strategies; (2) extended thinking helps retrieval (+26 points) but has zero impact on trading (+0.3 crypto, -0.1 options). These findings reveal that current agents lack genuine market adaptation, underscoring the need for performance-grounded evaluation in finance.
Chinese Translation
在金融领域评估人工智能代理面临两个主要挑战:静态基准需要昂贵的专家注释,但却忽视了现实交易中动态决策的重要性,而基于大型语言模型(LLM)的评判者在特定领域任务中引入了不可控的方差。我们提出了TraderBench,一个解决这两个问题的基准。它结合了经过专家验证的静态任务(知识检索、分析推理)与基于对抗性交易模拟的评分,评分完全基于实现的表现——夏普比率、收益和回撤——从而完全消除了评判者的方差。该框架具有两个新颖的轨道:加密交易,包含四种渐进的市场操纵变换,以及期权衍生品的评分,涵盖盈亏准确性、希腊字母和风险管理。交易场景可以通过新市场数据进行刷新,以防止基准污染。在约50个任务上评估了13个模型(从8B开源到前沿),我们发现:(1)13个模型中有8个在加密交易中得分约为33,在对抗条件下变化小于1分,暴露出固定的非适应性策略;(2)扩展思维有助于检索(+26分),但对交易没有影响(加密交易+0.3,期权-0.1)。这些发现揭示了当前代理缺乏真正的市场适应性,强调了在金融领域进行基于表现的评估的必要性。
cs.AI / 3 / 2603.00309
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
DIG以治愈:通过可解释的动态决策路径扩展通用智能体协作
Abstract
The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve emergent collaboration even as the number of collaborating agents increases. Yet in practice, such unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi-agent systems composed of general-purpose LLM agents that operate without predefined roles, control flow, or communication constraints, relying instead on emergent collaboration to solve problems. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents' collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi-agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.
Chinese Translation
日益流行的智能体人工智能范式承诺利用多个通用大型语言模型(LLM)智能体的力量,共同完成复杂任务。虽然许多智能体人工智能系统利用预定义的工作流程或智能体角色以减少复杂性,但理想情况下,这些智能体应当是完全自主的,能够在协作智能体数量增加的情况下实现自发协作。然而,在实践中,这种无结构的互动可能导致冗余工作和难以解释或纠正的级联故障。在本研究中,我们研究了由通用LLM智能体组成的多智能体系统,这些智能体在没有预定义角色、控制流程或通信约束的情况下运作,而是依赖自发协作来解决问题。我们引入了动态交互图(Dynamic Interaction Graph, DIG),它将自发协作捕捉为一个随时间演变的因果网络,展示智能体的激活和互动。DIG首次使自发协作变得可观察和可解释,能够实时识别、解释和纠正由协作引发的错误模式,直接来源于智能体的协作路径。因此,DIG填补了理解通用LLM智能体如何在真正的智能体多智能体系统中共同解决问题的关键空白。项目网页可访问:https://happyeureka.github.io/dig。
cs.AI / 4 / 2603.00312
How Well Do Multimodal Models Reason on ECG Signals?
多模态模型在心电图信号上的推理能力如何?
Abstract
While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model's logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of "true" reasoning capabilities.
Chinese Translation
尽管多模态大型语言模型通过生成可解释的推理轨迹为健康人工智能的“黑箱”特性提供了有前景的解决方案,但验证这些轨迹的有效性仍然是一个关键挑战。现有的评估方法要么不可扩展,依赖于人工临床审查,要么肤浅,利用代理指标(例如,问答)未能捕捉临床逻辑的语义正确性。在本研究中,我们引入了一个可重复的框架来评估心电图信号中的推理能力。我们建议将推理分解为两个不同的组成部分:(i)感知,即对原始信号中模式的准确识别;(ii)推演,即将领域知识逻辑应用于这些模式。为了评估感知,我们采用了一种代理框架,生成代码以实证验证推理轨迹中描述的时间结构。为了评估推演,我们测量模型逻辑与结构化临床标准数据库之间的对齐程度,采用基于检索的方法。这种双重验证方法使得对“真实”推理能力的评估具有可扩展性。
cs.AI / 5 / 2603.00349
EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
EmCoop:一个用于大型语言模型代理之间具身合作的框架和基准
Abstract
Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high-level cognitive coordination through reasoning, planning, and natural language communication. However, fine-grained analyses of how such collaboration emerges, unfolds, and contributes to task success in embodied multi-agent systems are difficult to conduct with existing benchmarks. In this paper, we introduce EmCoop, a benchmark framework for studying cooperation in LLM-based embodied multi-agent systems. Our framework separates a high-level cognitive layer from a low-level embodied interaction layer, allowing us to characterize agent cooperation through their interleaved dynamics over time. Given a cooperation-constrained embodied task, we propose generalizable, process-level metrics that diagnose collaboration quality and failure modes, beyond final task success. We instantiate our framework in two embodied environments that scale to arbitrary numbers of agents and support diverse communication topologies, and use these instantiations to demonstrate how EmCoop enables systematic analysis of cooperation dynamics across team sizes and task settings. The project web page can be found at: https://happyeureka.github.io/emcoop.
Chinese Translation
现实世界场景日益需要多个具身代理在具身约束下的动态环境中进行协作,因为许多任务超出了单个代理的能力。大型语言模型(LLMs)的最新进展使得通过推理、规划和自然语言交流实现高水平的认知协调成为可能。然而,现有基准难以进行细致的分析,以了解这种合作是如何出现、展开并对具身多代理系统中的任务成功做出贡献的。在本文中,我们介绍了EmCoop,一个用于研究基于LLM的具身多代理系统中合作的基准框架。我们的框架将高层认知层与低层具身交互层分开,使我们能够通过代理随时间变化的交互动态来表征代理合作。针对一个受合作约束的具身任务,我们提出了一些可推广的过程级指标,以诊断合作质量和失败模式,超越最终任务成功的评估。我们在两个具身环境中实例化我们的框架,这些环境可以扩展到任意数量的代理,并支持多样的通信拓扑,并利用这些实例展示EmCoop如何使得在不同团队规模和任务设置下系统地分析合作动态成为可能。项目网页可访问:https://happyeureka.github.io/emcoop。
cs.AI / 6 / 2603.00350
Monotropic Artificial Intelligence: Toward a Cognitive Taxonomy of Domain-Specialized Language Models
单向人工智能:朝着领域专业化语言模型的认知分类法迈进
Abstract
The prevailing paradigm in artificial intelligence research equates progress with scale: larger models trained on broader datasets are presumed to yield superior capabilities. This assumption, while empirically productive for general-purpose applications, obscures a fundamental epistemological tension between breadth and depth of knowledge. We introduce the concept of \emph{Monotropic Artificial Intelligence} -- language models that deliberately sacrifice generality to achieve extraordinary precision within narrowly circumscribed domains. Drawing on the cognitive theory of monotropism developed to understand autistic cognition, we argue that intense specialization represents not a limitation but an alternative cognitive architecture with distinct advantages for safety-critical applications. We formalize the defining characteristics of monotropic models, contrast them with conventional polytropic architectures, and demonstrate their viability through Mini-Enedina, a 37.5-million-parameter model that achieves near-perfect performance on Timoshenko beam analysis while remaining deliberately incompetent outside its domain. Our framework challenges the implicit assumption that artificial general intelligence constitutes the sole legitimate aspiration of AI research, proposing instead a cognitive ecology in which specialized and generalist systems coexist complementarily.
Chinese Translation
当前人工智能研究的主流范式将进展等同于规模:更大规模的模型在更广泛的数据集上训练,假定能够产生更优越的能力。尽管这一假设在通用应用中经验上是有效的,但它掩盖了知识的广度与深度之间的根本认识论张力。我们引入了 extit{单向人工智能}(Monotropic Artificial Intelligence)的概念——这些语言模型故意牺牲通用性,以在狭窄的领域内实现卓越的精确性。基于为理解自闭症认知而发展的单向主义(monotropism)认知理论,我们认为,强烈的专业化并不是一种限制,而是一种具有独特优势的替代认知架构,尤其适用于安全关键的应用。我们形式化了单向模型的定义特征,与传统的多向架构进行了对比,并通过Mini-Enedina这一3750万参数的模型展示了其可行性,该模型在Timoshenko梁分析中实现了近乎完美的性能,同时在其领域之外故意表现出无能。我们的框架挑战了人工通用智能是AI研究唯一合法追求的隐含假设,而是提出了一种认知生态,其中专业化系统与通用系统互补共存。
cs.AI / 7 / 2603.00374
Conservative Equilibrium Discovery in Offline Game-Theoretic Multiagent Reinforcement Learning
离线博弈论多智能体强化学习中的保守均衡发现
Abstract
Offline learning of strategies takes data efficiency to its extreme by restricting algorithms to a fixed dataset of state-action trajectories. We consider the problem in a mixed-motive multiagent setting, where the goal is to solve a game under the offline learning constraint. We first frame this problem in terms of selecting among candidate equilibria. Since datasets may inform only a small fraction of game dynamics, it is generally infeasible in offline game-solving to even verify a proposed solution is a true equilibrium. Therefore, we consider the relative probability of low regret (i.e., closeness to equilibrium) across candidates based on the information available. Specifically, we extend Policy Space Response Oracles (PSRO), an online game-solving approach, by quantifying game dynamics uncertainty and modifying the RL objective to skew towards solutions more likely to have low regret in the true game. We further propose a novel meta-strategy solver, tailored for the offline setting, to guide strategy exploration in PSRO. Our incorporation of Conservatism principles from Offline reinforcement learning approaches for strategy Exploration gives our approach its name: COffeE-PSRO. Experiments demonstrate COffeE-PSRO's ability to extract lower-regret solutions than state-of-the-art offline approaches and reveal relationships between algorithmic components empirical game fidelity, and overall performance.
Chinese Translation
离线策略学习通过将算法限制在固定的数据集状态-动作轨迹上,将数据效率提升到极致。我们考虑在混合动机的多智能体环境中解决这一问题,目标是在离线学习约束下解决博弈。我们首先将该问题框定为在候选均衡中进行选择。由于数据集可能仅能提供博弈动态的一小部分信息,因此在离线博弈求解中,验证一个提议的解决方案是否为真正的均衡通常是不可行的。因此,我们考虑根据可用信息评估候选者的低遗憾(即接近均衡)的相对概率。具体而言,我们通过量化博弈动态的不确定性并修改强化学习目标,使其倾向于更可能在真实博弈中具有低遗憾的解决方案,从而扩展了在线博弈求解方法政策空间响应神谕(Policy Space Response Oracles, PSRO)。我们进一步提出了一种新颖的元策略求解器,专为离线环境量身定制,以指导PSRO中的策略探索。我们从离线强化学习方法中引入保守主义原则用于策略探索,使我们的方法得名为COffeE-PSRO。实验表明,COffeE-PSRO能够提取比最先进的离线方法更低遗憾的解决方案,并揭示了算法组件、经验博弈的真实性和整体性能之间的关系。
cs.AI / 8 / 2603.00376
NeuroHex: Highly-Efficient Hex Coordinate System for Creating World Models to Enable Adaptive AI
NeuroHex:用于创建世界模型以支持自适应人工智能的高效六边形坐标系统
Abstract
\textit{NeuroHex} is a hexagonal coordinate system designed to support highly efficient world models and reference frames for online adaptive AI systems. Inspired by the hexadirectional firing structure of grid cells in the human brain, NeuroHex adopts a cubic isometric hexagonal coordinate formulation that provides full 60{\deg} rotational symmetry and low-cost translation, rotation and distance computation. We develop a mathematical framework that incorporates ring indexing, quantized angular encoding, and a hierarchical library of foundational, simple, and complex geometric shape primitives. These constructs allow low-overhead point-in-shape tests and spatial matching operations that are expensive in Cartesian coordinate systems. To support realistic settings, the NeuroHex framework can process OpenStreetMap (OSM) data sets using an OSM-to-NeuroHex (\textit{OSM2Hex}) conversion tool. The OSM2Hex spatial abstraction processing pipeline can achieve a reduction of 90-99\% in geometric complexity while maintaining the relevant spatial structure map for navigation. Our initial results, based on actual city and neighborhood scale data sets, demonstrate that NeuroHex offers a highly efficient substrate for building dynamic world models to enable adaptive spatial reasoning in autonomous AI systems with continuous online learning capability.
Chinese Translation
NeuroHex 是一种六边形坐标系统,旨在支持高效的世界模型和参考框架,以服务于在线自适应人工智能系统。该系统受到人脑中网格细胞的六向发射结构的启发,采用立方等距六边形坐标形式,提供完整的 60° 旋转对称性以及低成本的平移、旋转和距离计算。我们开发了一个数学框架,结合了环索引、量化角度编码以及基础、简单和复杂几何形状原语的层次库。这些构造允许在六边形坐标系统中进行低开销的点在形状测试和空间匹配操作,而这些在笛卡尔坐标系统中则代价高昂。为了支持现实环境,NeuroHex 框架能够使用 OSM(OpenStreetMap)数据集,通过 OSM 到 NeuroHex(OSM2Hex)转换工具进行处理。OSM2Hex 空间抽象处理管道可以在保持相关空间结构图以供导航的同时,实现几何复杂度降低 90-99%。我们基于实际城市和社区规模数据集的初步结果表明,NeuroHex 为构建动态世界模型提供了一个高效的基础,以支持具有持续在线学习能力的自适应空间推理的自主人工智能系统。
cs.AI / 9 / 2603.00451
Confusion-Aware Rubric Optimization for LLM-based Automated Grading
基于混淆感知的LLM自动评分标准优化
Abstract
Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in "rule dilution" where conflicting constraints weaken the model's grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted "fixing patches" for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.
Chinese Translation
准确且明确的指导方针对于基于大型语言模型(LLM)的评分系统至关重要,但手动编写这些提示往往效果不佳,因为LLM可能会误解专家指导或缺乏必要的领域特异性。因此,该领域已转向自动化提示优化,以在不依赖手动试错的情况下完善评分标准。然而,现有框架通常将独立且无结构的错误样本聚合为单一更新步骤,导致“规则稀释”,使得相互冲突的约束削弱了模型的评分逻辑。为了解决这些局限性,我们提出了混淆感知评分标准优化(CARO),这是一个新颖的框架,通过结构性地分离错误信号来提高准确性和计算效率。CARO利用混淆矩阵将单一的错误信号分解为不同的模式,从而允许对特定的误分类模式进行单独诊断和修复。通过为主要错误模式合成针对性的“修复补丁”,并采用关注多样性的选择机制,该框架防止了指导冲突,消除了对资源密集型嵌套精炼循环的需求。在教师教育和STEM数据集上的实证评估表明,CARO显著优于现有的最先进方法。这些结果表明,用针对性的模式特定修复替代混合错误聚合,可以在自动评估的可扩展性和精确度上带来显著提升。
cs.AI / 10 / 2603.00460
MED-COPILOT: A Medical Assistant Powered by GraphRAG and Similar Patient Case Retrieval
MED-COPILOT:一种基于GraphRAG和相似患者案例检索的医疗助手
Abstract
Clinical decision-making requires synthesizing heterogeneous evidence, including patient histories, clinical guidelines, and trajectories of comparable cases. While large language models (LLMs) offer strong reasoning capabilities, they remain prone to hallucinations and struggle to integrate long, structured medical documents. We present MED-COPILOT, an interactive clinical decision-support system designed for clinicians and medical trainees, which combines guideline-grounded GraphRAG retrieval with hybrid semantic-keyword similar-patient retrieval to support transparent and evidence-aware clinical reasoning. The system builds a structured knowledge graph from WHO and NICE guidelines, applies community-level summarization for efficient retrieval, and maintains a 36,000-case similar-patient database derived from SOAP-normalized MIMIC-IV notes and Synthea-generated records. We evaluate our framework on clinical note completion and medical question answering, and demonstrate that it consistently outperforms parametric LLM baselines and standard RAG, improving both generation fidelity and clinical reasoning accuracy. The full system is available at https://huggingface.co/spaces/Cryo3978/Med_GraphRAG , enabling users to inspect retrieved evidence, visualize token-level similarity contributions, and conduct guided follow-up analysis. Our results demonstrate a practical and interpretable approach to integrating structured guideline knowledge with patient-level analogical evidence for clinical LLMs.
Chinese Translation
临床决策需要综合异质证据,包括患者历史、临床指南和可比案例的轨迹。尽管大型语言模型(LLMs)提供了强大的推理能力,但它们仍然容易出现幻觉,并且在整合长篇结构化医疗文档方面存在困难。我们提出了MED-COPILOT,这是一种为临床医生和医学培训生设计的互动临床决策支持系统,它结合了基于指南的GraphRAG检索与混合语义-关键词相似患者检索,以支持透明且以证据为基础的临床推理。该系统从世界卫生组织(WHO)和国家卫生与临床优化研究所(NICE)指南构建结构化知识图谱,应用社区级别的摘要以实现高效检索,并维护一个由SOAP标准化的MIMIC-IV笔记和Synthea生成记录衍生的36,000案例相似患者数据库。我们在临床笔记补全和医学问答上评估了我们的框架,结果表明它始终优于参数化的LLM基线和标准RAG,改善了生成的保真度和临床推理的准确性。完整系统可在https://huggingface.co/spaces/Cryo3978/Med_GraphRAG获取,用户可以检查检索到的证据、可视化令牌级别的相似性贡献,并进行引导式后续分析。我们的结果展示了一种将结构化指南知识与患者级别类比证据整合到临床LLMs中的实用且可解释的方法。
cs.AI / 11 / 2603.00465
Optimizing In-Context Demonstrations for LLM-based Automated Grading
优化基于大型语言模型的自动评分中的上下文示例
Abstract
Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily dependent on the selection of few-shot exemplars and the construction of high-quality rationales. Standard retrieval methods typically select examples based on semantic similarity, which often fails to capture subtle decision boundaries required for rubric adherence. Furthermore, manually crafting the expert rationales needed to guide these models can be a significant bottleneck. To address these limitations, we introduce GUIDE (Grading Using Iteratively Designed Exemplars), a framework that reframes exemplar selection and refinement in automated grading as a boundary-focused optimization problem. GUIDE operates on a continuous loop of selection and refinement, employing novel contrastive operators to identify "boundary pairs" that are semantically similar but possess different grades. We enhance exemplars by generating discriminative rationales that explicitly articulate why a response receives a specific score to the exclusion of adjacent grades. Extensive experiments across datasets in physics, chemistry, and pedagogical content knowledge demonstrate that GUIDE significantly outperforms standard retrieval baselines. By focusing the model's attention on the precise edges of rubric, our approach shows exceptionally robust gains on borderline cases and improved rubric adherence. GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Chinese Translation
对开放式学生回答的自动评估是教育中扩展个性化反馈的关键能力。尽管大型语言模型(LLMs)在通过上下文学习(ICL)进行评分任务中显示出潜力,但其可靠性在很大程度上依赖于少量示例的选择和高质量推理的构建。标准检索方法通常基于语义相似性选择示例,这往往无法捕捉到遵循评分标准所需的微妙决策边界。此外,手动构建指导这些模型所需的专家推理可能成为一个重要瓶颈。为了解决这些局限性,我们提出了GUIDE(使用迭代设计示例进行评分),这是一个将自动评分中的示例选择和优化重新框定为以边界为中心的优化问题的框架。GUIDE在选择和优化的连续循环中运行,采用新颖的对比操作符来识别“边界对”,这些对在语义上相似但具有不同的评分。我们通过生成区分性推理来增强示例,明确阐述为什么某个回答会获得特定分数而排除相邻分数。针对物理、化学和教育内容知识的数据集进行的大量实验表明,GUIDE显著优于标准检索基线。通过将模型的注意力集中在评分标准的精确边缘,我们的方法在边界案例上显示出异常稳健的提升,并改善了评分标准的遵循。GUIDE为与人类教学标准紧密对齐的可信、可扩展的评估系统铺平了道路。
cs.AI / 12 / 2603.00469
Why Not? Solver-Grounded Certificates for Explainable Mission Planning
为什么不呢?基于求解器的可解释任务规划证书
Abstract
Operators of Earth observation satellites need justifications for scheduling decisions: why a request was selected, rejected, or what changes would make it schedulable. Existing approaches construct post-hoc reasoning layers independent of the optimizer, risking non-causal attributions, incomplete constraint conjunctions, and solver-path dependence. We take a faithfulness-first approach: every explanation is a certificate derived from the optimization model itself: minimal infeasible subsets for rejections, tight constraints and contrastive trade-offs for selections, and inverse solves for what-if queries. On a scheduling instance with structurally distinct constraint interactions, certificates achieve perfect soundness with respect to the solver's constraint model (15/15 cited-constraint checks), counterfactual validity (7/7), and stability (Jaccard = 1.0 across 28 seed-pairs), while a post-hoc baseline produces non-causal attributions in 29% of cases and misses constraint conjunctions in every multi-cause rejection. A scalability analysis up to 200 orders and 30 satellites confirms practical extraction times for operational batches.
Chinese Translation
地球观测卫星的操作员需要对调度决策提供合理解释:为何选择或拒绝某个请求,或者需要什么变化才能使其可调度。现有方法构建与优化器独立的事后推理层,可能导致非因果归因、不完整的约束结合以及求解器路径依赖。我们采取以忠实性为先的方法:每个解释都是从优化模型本身派生的证书:对拒绝的最小不可行子集,对选择的紧约束和对比权衡,以及对假设查询的逆解。在一个具有结构上不同约束交互的调度实例中,证书在求解器的约束模型上实现了完美的有效性(15/15 引用约束检查)、反事实有效性(7/7)和稳定性(28 对种子的 Jaccard = 1.0),而事后基线在29%的案例中产生了非因果归因,并在每个多因拒绝中遗漏了约束结合。对多达200个订单和30颗卫星的可扩展性分析确认了操作批次的实际提取时间。
cs.AI / 13 / 2603.00472
From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems
从目标到方面的再探讨:面向自主智能系统的非功能性需求模式语言
Abstract
Agentic AI systems exhibit numerous crosscutting concerns -- security, observability, cost management, fault tolerance -- that are poorly modularized in current implementations, contributing to the high failure rate of AI projects in reaching production. The goals-to-aspects methodology proposed at RE 2004 demonstrated that aspects can be systematically discovered from i* goal models by identifying non-functional soft-goals that crosscut functional goals. This paper revisits and extends that methodology to the agentic AI domain. We present a pattern language of 12 reusable patterns organized across four NFR categories (security, reliability, observability, cost management), each mapping an i* goal model to a concrete aspect implementation using an AOP framework for Rust. Four patterns address agent-specific crosscutting concerns absent from traditional AOP literature: tool-scope sandboxing, prompt injection detection, token budget management, and action audit trails. We extend the V-graph model to capture how agent tasks simultaneously contribute to functional goals and non-functional soft-goals. We validate the pattern language through a case study analyzing an open-source autonomous agent framework, demonstrating how goal-driven aspect discovery systematically identifies and modularizes crosscutting concerns. The pattern language offers a principled approach for engineering reliable agentic AI systems through early identification of crosscutting concerns.
Chinese Translation
自主智能系统表现出许多交叉关注点——安全性、可观察性、成本管理、容错性——这些在当前实现中模块化效果不佳,导致AI项目在生产中高失败率。2004年RE会议提出的目标到方面的方法论展示了如何通过识别交叉功能目标的非功能性软目标,从i*目标模型中系统性地发现方面。本文重新审视并扩展了该方法论至自主智能领域。我们提出了一种包含12个可重用模式的模式语言,这些模式按四个非功能性需求(NFR)类别(安全性、可靠性、可观察性、成本管理)组织,每个模式将i*目标模型映射到使用Rust的面向方面编程(AOP)框架的具体方面实现。四个模式解决了传统AOP文献中缺失的特定于代理的交叉关注点:工具范围沙箱、提示注入检测、令牌预算管理和行动审计轨迹。我们扩展了V-图模型,以捕捉代理任务如何同时对功能目标和非功能性软目标作出贡献。通过分析一个开源自主代理框架的案例研究,我们验证了模式语言,展示了如何通过目标驱动的方面发现系统性地识别和模块化交叉关注点。该模式语言为通过早期识别交叉关注点工程化可靠的自主智能系统提供了一种原则性的方法。
cs.AI / 14 / 2603.00490
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
LifeEval:用于自我中心日常生活任务的助理人工智能的多模态基准
Abstract
The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.
Chinese Translation
多模态大型语言模型(MLLMs)的快速进展标志着朝向人工通用智能的重要一步,展现出增强人类能力的巨大潜力。然而,它们在动态现实环境中提供有效帮助的能力仍然未得到充分探索。现有的视频基准主要通过回顾性分析或孤立的感知任务评估被动理解,未能捕捉实时用户辅助的互动和适应性特征。为填补这一空白,我们提出了LifeEval,这是一个旨在从自我中心视角评估日常生活中实时、任务导向的人机协作的多模态基准。LifeEval强调三个关键方面:任务导向的整体评估、来自连续第一人视角流的自我中心实时感知,以及通过自然对话进行的人机协作互动。该基准通过严格的标注流程构建,包含6个核心能力维度的4,075对高质量问答对。对26个最先进的MLLMs在LifeEval上的广泛评估揭示了在实现及时、有效和适应性互动方面的重大挑战,突显了推进以人为中心的互动智能的必要方向。
cs.AI / 15 / 2603.00495
AI Runtime Infrastructure
人工智能运行时基础设施
Abstract
We introduce AI Runtime Infrastructure, a distinct execution-time layer that operates above the model and below the application, actively observing, reasoning over, and intervening in agent behavior to optimize task success, latency, token efficiency, reliability, and safety while the agent is running. Unlike model-level optimizations or passive logging systems, runtime infrastructure treats execution itself as an optimization surface, enabling adaptive memory management, failure detection, recovery, and policy enforcement over long-horizon agent workflows.
Chinese Translation
我们介绍了人工智能运行时基础设施,这是一个独特的执行时间层,位于模型之上和应用程序之下,能够主动观察、推理和干预代理行为,以优化任务成功率、延迟、令牌效率、可靠性和安全性。与模型级优化或被动日志系统不同,运行时基础设施将执行本身视为优化表面,使得在长期代理工作流中能够实现自适应内存管理、故障检测、恢复和策略执行。
cs.AI / 16 / 2603.00532
DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows
DenoiseFlow:面向可靠大型语言模型代理工作流程的考虑不确定性的去噪
Abstract
Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40--56% through adaptive branching. Detailed ablation studies further confirm framework-level's robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow-21D3/.
Chinese Translation
自主代理越来越多地被委托处理复杂的长期任务,这些任务范围从数学推理到软件生成。虽然代理工作流程通过将任务分解为多步推理链来促进这些任务的完成,但随着序列长度的增加,可靠性显著下降。具体而言,自然语言指令中的微小解释错误往往在各个步骤中悄然累积。我们将这种失败模式称为累积语义模糊。现有的缓解方法通常缺乏运行时适应性,而是依赖于静态探索预算、反应式错误恢复或完全忽视不确定性的单路径执行。我们将多步推理过程形式化为噪声马尔可夫决策过程(Noisy MDP),并提出DenoiseFlow,这是一种闭环框架,通过三个协调阶段进行渐进式去噪:(1)感知每一步的语义不确定性估计;(2)调节自适应地通过在快速单路径执行和基于估计风险的并行探索之间路由来分配计算;(3)纠正通过基于影响的根本原因定位进行有针对性的恢复。在线自我校准不断地将决策边界与验证者反馈对齐,无需真实标签。针对六个基准的实验,涵盖数学推理、代码生成和多跳问答,显示DenoiseFlow在每个基准上都实现了最高的准确率(平均83.3%,比最强基线高出1.3%),同时通过自适应分支将成本降低了40%至56%。详细的消融研究进一步确认了框架的鲁棒性和通用性。代码可在 https://anonymous.4open.science/r/DenoiseFlow-21D3/ 获取。
cs.AI / 17 / 2603.00540
LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks
LOGIGEN:基于逻辑驱动的可验证代理任务生成
Abstract
The evolution of Large Language Models (LLMs) from static instruction-followers to autonomous agents necessitates operating within complex, stateful environments to achieve precise state-transition objectives. However, this paradigm is bottlenecked by data scarcity, as existing tool-centric reverse-synthesis pipelines fail to capture the rigorous logic of real-world applications. We introduce \textbf{LOGIGEN}, a logic-driven framework that synthesizes verifiable training data based on three core pillars: \textbf{Hard-Compiled Policy Grounding}, \textbf{Logic-Driven Forward Synthesis}, and \textbf{Deterministic State Verification}. Specifically, a Triple-Agent Orchestration is employed: the \textbf{Architect} compiles natural-language policy into database constraints to enforce hard rules; the \textbf{Set Designer} initializes boundary-adjacent states to trigger critical policy conflicts; and the \textbf{Explorer} searches this environment to discover causal solution paths. This framework yields a dataset of 20,000 complex tasks across 8 domains, where validity is strictly guaranteed by checking exact state equivalence. Furthermore, we propose a verification-based training protocol where Supervised Fine-Tuning (SFT) on verifiable trajectories establishes compliance with hard-compiled policy, while Reinforcement Learning (RL) guided by dense state-rewards refines long-horizon goal achievement. On $\tau^2$-Bench, LOGIGEN-32B(RL) achieves a \textbf{79.5\% success rate}, substantially outperforming the base model (40.7\%). These results demonstrate that logic-driven synthesis combined with verification-based training effectively constructs the causally valid trajectories needed for next-generation agents.
Chinese Translation
大型语言模型(LLMs)从静态指令跟随者演变为自主代理,要求在复杂的、有状态的环境中操作,以实现精确的状态转移目标。然而,这一范式受到数据稀缺的瓶颈,因为现有的以工具为中心的逆合成管道未能捕捉到现实世界应用的严格逻辑。我们提出了 extbf{LOGIGEN},一个基于逻辑驱动的框架,通过三个核心支柱合成可验证的训练数据: extbf{硬编译政策基础}、 extbf{逻辑驱动的前向合成}和 extbf{确定性状态验证}。具体而言,采用三重代理协调: extbf{架构师}将自然语言政策编译为数据库约束,以强制执行硬规则; extbf{集合设计师}初始化边界相邻状态,以触发关键政策冲突;而 extbf{探索者}则在该环境中搜索以发现因果解决路径。该框架生成了一个涵盖8个领域的20,000个复杂任务的数据集,其中通过检查精确的状态等价性严格保证有效性。此外,我们提出了一种基于验证的训练协议,其中对可验证轨迹进行监督微调(SFT),以确保符合硬编译政策,而通过密集状态奖励指导的强化学习(RL)则优化长远目标的实现。在$ au^2$-Bench上,LOGIGEN-32B(RL)实现了 extbf{79.5\%的成功率},显著优于基础模型(40.7\%)。这些结果表明,基于逻辑驱动的合成结合基于验证的训练有效地构建了下一代代理所需的因果有效轨迹。
cs.AI / 18 / 2603.00546
Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
通过能力导向基准和基于MCTS的数据生成推进多模态判断模型
Abstract
Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue, we further propose Judge-MCTS, a data construction framework generating pairwise reasoning trajectories with various correctness and length. Using Judge-MCTS, we construct an MCTS-augmented dataset and train M-Judger, a series of strong judge models. Extensive experiments demonstrate the superiority of M-Judger on existing judge benchmarks as well as M-JudgeBench. Overall, our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework, paving the way for future research on judge model evaluation and capability-driven judge training.
Chinese Translation
使用多模态大型语言模型(MLLMs)作为评判者,以实现精确和一致的评估,逐渐成为各个领域的新兴范式。因此,评估MLLM作为评判系统的能力和可靠性对于确保可信的评估至关重要。现有的评判基准通过任务类型对样本进行分类,但未能捕捉到可靠评估所需的基本判断能力。在本研究中,我们引入了M-JudgeBench,这是一个十维能力导向的基准,旨在全面评估MLLM的判断能力。我们的基准将评估分解为成对的思维链(Chain-of-Thought, CoT)比较、长度偏差避免和过程错误检测任务,共同覆盖十个细粒度子任务。这一设计使得能够在推理风格、响应长度和跨模型变异中诊断模型的可靠性。系统评估揭示了现有MLLM作为评判系统的系统性弱点。为了解决这一问题,我们进一步提出了Judge-MCTS,一个数据构建框架,生成具有不同正确性和长度的成对推理轨迹。通过Judge-MCTS,我们构建了一个增强MCTS的数据集,并训练了M-Judger,一系列强大的评判模型。大量实验表明,M-Judger在现有评判基准和M-JudgeBench上具有优越性。总体而言,我们的工作通过M-JudgeBench和Judge-MCTS框架为评估MLLM作为评判者建立了更为原则性的基础,为未来的评判模型评估和能力驱动的评判训练研究铺平了道路。
cs.AI / 19 / 2603.00552
EMPA: Evaluating Persona-Aligned Empathy as a Process
EMPA:作为过程评估的个性化共情
Abstract
Evaluating persona-aligned empathy in LLM-based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona-specific needs. We introduce EMPA, a process-oriented framework that evaluates persona-aligned support as sustained intervention rather than isolated replies. EMPA distills real interactions into controllable, psychologically grounded scenarios, couples them with an open-ended multi-agent sandbox that exposes strategic adaptation and failure modes, and scores trajectories in a latent psychological space by directional alignment, cumulative impact, and stability. The resulting signals and metrics support reproducible comparison and optimization of long-horizon empathic behavior, and they extend to other agent settings shaped by latent dynamics and weak, hard-to-verify feedback.
Chinese Translation
在基于大型语言模型(LLM)的对话代理中评估个性化共情仍然具有挑战性。用户状态是潜在的,反馈稀疏且难以现场验证,看似支持的对话轮次仍可能积累成偏离个性化需求的轨迹。我们提出了EMPA,这是一个以过程为导向的框架,将个性化支持评估为持续干预,而非孤立的回复。EMPA将真实互动提炼为可控的、心理学基础的场景,并将其与一个开放式的多代理沙盒结合,暴露出战略适应和失败模式,并通过方向对齐、累积影响和稳定性在潜在心理空间中对轨迹进行评分。所产生的信号和指标支持可重复的比较和长时间范围内共情行为的优化,并扩展到其他受潜在动态和稀疏、难以验证反馈影响的代理设置。
cs.AI / 20 / 2603.00575
SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks
SWE-Hub:一个统一的生产系统,用于可扩展的可执行软件工程任务
Abstract
Progress in software-engineering agents is increasingly constrained by the scarcity of executable, scalable, and realistic data for training and evaluation. This scarcity stems from three fundamental challenges in existing pipelines: environments are brittle and difficult to reproduce across languages; synthesizing realistic, system-level bugs at scale is computationally expensive; and existing data predominantly consists of short-horizon repairs, failing to capture long-horizon competencies like architectural consistency. We introduce \textbf{SWE-Hub}, an end-to-end system that operationalizes the data factory abstraction by unifying environment automation, scalable synthesis, and diverse task generation into a coherent production stack. At its foundation, the \textbf{Env Agent} establishes a shared execution substrate by automatically converting raw repository snapshots into reproducible, multi-language container environments with standardized interfaces. Built upon this substrate, \textbf{SWE-Scale} engine addresses the need for high-throughput generation, combining cross-language code analysis with cluster-scale validation to synthesize massive volumes of localized bug-fix instances. \textbf{Bug Agent} generates high-fidelity repair tasks by synthesizing system-level regressions involving cross-module dependencies, paired with user-like issue reports that describe observable symptoms rather than root causes. Finally, \textbf{SWE-Architect} expands the task scope from repair to creation by translating natural-language requirements into repository-scale build-a-repo tasks. By integrating these components, SWE-Hub establishes a unified production pipeline capable of continuously delivering executable tasks across the entire software engineering lifecycle.
Chinese Translation
软件工程代理的进展越来越受到可执行、可扩展和现实数据稀缺的限制。这种稀缺性源于现有流程中的三个基本挑战:环境脆弱且难以跨语言重现;在规模上合成现实的系统级错误计算成本高昂;现有数据主要由短期修复组成,未能捕捉到诸如架构一致性等长期能力。我们提出了 extbf{SWE-Hub},一个端到端系统,通过将环境自动化、可扩展合成和多样化任务生成统一为一个连贯的生产栈,来实现数据工厂抽象。在其基础上, extbf{Env Agent} 通过自动将原始代码库快照转换为具有标准化接口的可重现多语言容器环境,建立了共享执行基础设施。基于这一基础, extbf{SWE-Scale} 引擎满足了高吞吐量生成的需求,结合跨语言代码分析与集群规模验证,以合成大量本地化的错误修复实例。 extbf{Bug Agent} 通过合成涉及跨模块依赖的系统级回归,生成高保真修复任务,并配以描述可观察症状而非根本原因的用户式问题报告。最后, extbf{SWE-Architect} 通过将自然语言需求转换为代码库规模的构建任务,将任务范围从修复扩展到创建。通过整合这些组件,SWE-Hub 建立了一个统一的生产管道,能够在整个软件工程生命周期中持续交付可执行任务。
cs.AI / 21 / 2603.00578
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
草稿思维:在长链推理大模型中学习高效推理
Abstract
Long chain-of-thought~(CoT) has become a dominant paradigm for enhancing the reasoning capability of large reasoning models~(LRMs); however, the performance gains often come with a substantial increase in reasoning budget. Recent studies show that existing CoT paradigms tend to induce systematic overthinking, unnecessarily coupling reasoning capability with reasoning cost. Most prior approaches reduce token usage through post hoc techniques such as token compression, truncation, or length penalties, without explicitly addressing the core mechanisms of reasoning. We propose \textbf{Draft-Thinking}, which guides models to first learn a concise \textit{draft-style} reasoning structure that retains only the critical reasoning steps. Through a \textit{progressive curriculum learning}, the model stably internalizes this efficient reasoning pattern as its capability scales. Moreover, Draft-Thinking introduces adaptive prompting, which elevates reasoning depth to a flexible, model-selectable behavior. Extensive experiments demonstrate that Draft-Thinking substantially reduces reasoning budget while largely preserving reasoning performance; for example, on MATH500, it achieves an 82.6\% reduction in reasoning budget at the cost of only a 2.6\% performance drop.
Chinese Translation
长链推理(Chain-of-Thought, CoT)已成为增强大型推理模型(Large Reasoning Models, LRM)推理能力的主导范式;然而,性能提升往往伴随着推理预算的显著增加。最近的研究表明,现有的 CoT 范式倾向于导致系统性的过度思考,不必要地将推理能力与推理成本耦合。大多数先前的方法通过后处理技术(如令牌压缩、截断或长度惩罚)减少令牌使用,而未明确解决推理的核心机制。我们提出了 extbf{草稿思维}(Draft-Thinking),该方法引导模型首先学习一个简洁的 extit{草稿式} 推理结构,仅保留关键的推理步骤。通过 extit{渐进式课程学习},模型在能力扩展的过程中稳定地内化这一高效的推理模式。此外,草稿思维引入了自适应提示,提升推理深度为灵活的、模型可选择的行为。大量实验表明,草稿思维显著降低了推理预算,同时在很大程度上保持了推理性能;例如,在 MATH500 上,它实现了推理预算减少 82.6\%,而性能仅下降 2.6%。
cs.AI / 22 / 2603.00585
MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation
MicroVerse:迈向微观世界模拟的初步探索
Abstract
Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim-10K, a high-quality, expert-verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro-World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at https://github.com/FreedomIntelligence/MicroVerse
Chinese Translation
近期视频生成技术的进步为复杂动态系统的宏观模拟开辟了新的途径,但其在微观现象中的应用仍然未被充分探索。微观尺度模拟在生物医学应用中具有巨大潜力,例如药物发现、器官芯片系统和疾病机制研究,同时在教育和互动可视化方面也展现出潜力。在本研究中,我们介绍了MicroWorldBench,这是一个基于多层次评分标准的微观尺度模拟任务基准。MicroWorldBench通过459个独特的专家注释标准,涵盖多个微观尺度模拟任务(例如,器官级过程、细胞动态和亚细胞分子相互作用)和评估维度(例如,科学真实性、视觉质量、遵循指令)实现了系统化的评分标准评估。MicroWorldBench揭示了当前的最先进(SOTA)视频生成模型在微观尺度模拟中的不足,表现出对物理法则的违反、时间不一致性以及与专家标准的不一致。为了解决这些局限性,我们构建了MicroSim-10K,一个高质量、经过专家验证的模拟数据集。利用该数据集,我们训练了MicroVerse,一个针对微观尺度模拟的视频生成模型。MicroVerse能够准确再现复杂的微观机制。我们的工作首次引入了微观世界模拟的概念,并展示了概念验证,为生物学、教育和科学可视化的应用铺平了道路。我们的研究展示了教育性微观尺度生物机制模拟的潜力。我们的数据和代码可在 https://github.com/FreedomIntelligence/MicroVerse 获取。
cs.AI / 23 / 2603.00590
Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs
心中公平,行动中公平?理解与生成的统一基准在统一多模态大语言模型中的应用
Abstract
As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional "fairness space'', integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the "generation gap'', individual inconsistencies like "personality splits'', and the "counter-stereotype reward'', while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel'' impasse. Project Page: https://iris-benchmark-web.vercel.app/
Chinese Translation
随着人工智能(AI)在各个领域的广泛应用,确保公平性已成为一个核心挑战。然而,该领域面临着“巴别塔”困境:公平性指标层出不穷,但其背后的哲学假设往往相互冲突,阻碍了统一范式的形成——特别是在统一多模态大语言模型(UMLLMs)中,偏见在任务间系统性传播。为了解决这一问题,我们引入了IRIS基准,这是我们所知的第一个旨在同步评估UMLLMs中理解和生成任务公平性的基准。该基准借助我们的群体分类器ARES和四个支持的大规模数据集,旨在将任意指标标准化并聚合到一个高维的“公平性空间”中,整合了60个跨越理想公平性、现实世界忠实度和偏见惯性与可引导性(IRIS)三个维度的细粒度指标。通过该基准,我们对领先的UMLLMs的评估揭示了系统性现象,如“生成差距”、个体不一致性(如“个性分裂”)和“反刻板印象奖励”,同时提供了指导其公平性能力优化的诊断工具。凭借其新颖且可扩展的框架,IRIS基准能够整合不断发展的公平性指标,最终帮助解决“巴别塔”困境。项目页面:https://iris-benchmark-web.vercel.app/
cs.AI / 24 / 2603.00599
Heterophily-Agnostic Hypergraph Neural Networks with Riemannian Local Exchanger
与异质性无关的黎曼局部交换器超图神经网络
Abstract
Hypergraphs are the natural description of higher-order interactions among objects, widely applied in social network analysis, cross-modal retrieval, etc. Hypergraph Neural Networks (HGNNs) have become the dominant solution for learning on hypergraphs. Traditional HGNNs are extended from message passing graph neural networks, following the homophily assumption, and thus struggle with the prevalent heterophilic hypergraphs that call for long-range dependence modeling. In this paper, we achieve heterophily-agnostic message passing through the lens of Riemannian geometry. The key insight lies in the connection between oversquashing and hypergraph bottleneck within the framework of Riemannian manifold heat flow. Building on this, we propose the novel idea of locally adapting the bottlenecks of different subhypergraphs. The core innovation of the proposed mechanism is the design of an adaptive local (heat) exchanger. Specifically, it captures the rich long-range dependencies via the Robin condition, and preserves the representation distinguishability via source terms, thereby enabling heterophily-agnostic message passing with theoretical guarantees. Based on this theoretical foundation, we present a novel Heat-Exchanger with Adaptive Locality for Hypergraph Neural Network (HealHGNN), designed as a node-hyperedge bidirectional systems with linear complexity in the number of nodes and hyperedges. Extensive experiments on both homophilic and heterophilic cases show that HealHGNN achieves the state-of-the-art performance.
Chinese Translation
超图是对象之间高阶交互的自然描述,广泛应用于社交网络分析、跨模态检索等领域。超图神经网络(HGNNs)已成为在超图上学习的主流解决方案。传统的HGNNs是从消息传递图神经网络扩展而来的,遵循同质性假设,因此在普遍存在的异质性超图中面临挑战,这些超图需要建模长程依赖关系。本文通过黎曼几何的视角实现了与异质性无关的消息传递。关键的洞察在于超图瓶颈与过度挤压之间的联系,这一联系在黎曼流形热流框架内得以体现。在此基础上,我们提出了局部适应不同子超图瓶颈的新颖想法。所提出机制的核心创新是设计了一种自适应局部(热)交换器。具体而言,它通过罗宾条件捕捉丰富的长程依赖关系,并通过源项保持表示的可区分性,从而实现了具有理论保证的与异质性无关的消息传递。基于这一理论基础,我们提出了一种新型的具有自适应局部性的热交换器超图神经网络(HealHGNN),该网络设计为节点-超边双向系统,且在节点和超边数量上具有线性复杂度。在同质性和异质性案例上的大量实验表明,HealHGNN达到了最先进的性能。
cs.AI / 25 / 2603.00608
Machine Learning Grade Prediction Using Students' Grades and Demographics
基于学生成绩和人口统计信息的机器学习成绩预测
Abstract
Student repetition in secondary education imposes significant resource burdens, particularly in resource-constrained contexts. Addressing this challenge, this study introduces a unified machine learning framework that simultaneously predicts pass/fail outcomes and continuous grades, a departure from prior research that treats classification and regression as separate tasks. Six models were evaluated: Logistic Regression, Decision Tree, and Random Forest for classification, and Linear Regression, Decision Tree Regressor, and Random Forest Regressor for regression, with hyperparameters optimized via exhaustive grid search. Using academic and demographic data from 4424 secondary school students, classification models achieved accuracies of up to 96%, while regression models attained a coefficient of determination of 0.70, surpassing baseline approaches. These results confirm the feasibility of early, data-driven identification of at-risk students and highlight the value of integrating dual-task prediction for more comprehensive insights. By enabling timely, personalized interventions, the framework offers a practical pathway to reducing grade repetition and optimizing resource allocation.
Chinese Translation
在中等教育中,学生的重复学习对资源造成了重大负担,尤其是在资源有限的环境中。为了解决这一挑战,本研究提出了一个统一的机器学习框架,该框架同时预测及格/不及格结果和连续成绩,这与之前将分类和回归视为独立任务的研究有所不同。我们评估了六种模型:用于分类的逻辑回归、决策树和随机森林,以及用于回归的线性回归、决策树回归器和随机森林回归器,所有模型的超参数通过穷举网格搜索进行了优化。使用来自4424名中学生的学术和人口统计数据,分类模型的准确率高达96%,而回归模型的决定系数达到0.70,超越了基线方法。这些结果确认了基于数据的早期识别高风险学生的可行性,并突显了整合双任务预测以获得更全面洞察的价值。通过实现及时、个性化的干预,该框架为减少成绩重复和优化资源分配提供了一条切实可行的路径。
cs.AI / 26 / 2603.00623
TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
TraceSIR:一种用于结构化分析和报告代理执行轨迹的多智能体框架
Abstract
Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi-agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine-grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real-world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU-XUN/TraceSIR.
Chinese Translation
代理系统通过外部工具和迭代决策增强大型语言模型,使其能够执行深度研究、函数调用和编码等复杂任务。然而,它们长而复杂的执行轨迹使得故障诊断和根本原因分析变得极为困难。手动检查无法扩展,而直接将大型语言模型应用于原始轨迹则受到输入长度限制和不可靠推理的阻碍。仅关注最终任务结果进一步丢弃了准确定位问题所需的关键行为信息。为了解决这些问题,我们提出了TraceSIR,一个用于结构化分析和报告代理执行轨迹的多智能体框架。TraceSIR协调三个专门的智能体:(1) StructureAgent,引入了一种新颖的抽象格式TraceFormat,以在保留基本行为信息的同时压缩执行轨迹;(2) InsightAgent,进行细粒度诊断,包括问题定位、根本原因分析和优化建议;(3) ReportAgent,汇总任务实例的见解并生成全面的分析报告。为了评估TraceSIR,我们构建了TraceBench,涵盖三个真实世界的代理场景,并引入了ReportEval,一种评估分析报告质量和可用性的评估协议,以符合行业需求。实验表明,TraceSIR始终生成连贯、信息丰富且可操作的报告,在所有评估维度上显著优于现有方法。我们的项目和视频可在https://github.com/SHU-XUN/TraceSIR上公开获取。
cs.AI / 27 / 2603.00631
LiTS: A Modular Framework for LLM Tree Search
LiTS:一个用于LLM树搜索的模块化框架
Abstract
LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.
Chinese Translation
LiTS是一个用于通过树搜索进行LLM推理的模块化Python框架。它将树搜索分解为三个可重用的组件(Policy、Transition和RewardModel),这些组件可以插入到像MCTS和BFS这样的算法中。基于装饰器的注册机制使领域专家能够通过注册组件扩展到新领域,同时算法研究人员可以实现自定义搜索算法。我们在MATH500(语言推理)、Crosswords(环境规划)和MapEval(工具使用)上展示了可组合性,表明组件和算法是正交的:组件在每种任务类型的算法之间是可重用的,而算法可以在所有组件和领域中工作。我们还报告了一个模式崩溃的发现:在无限动作空间中,LLM策略多样性(而非奖励质量)是有效树搜索的瓶颈。演示视频可在https://youtu.be/nRGX43YrR3I观看。该软件包在https://github.com/xinzhel/lits-llm以Apache 2.0许可证发布,包括安装说明和可运行示例,用户可以通过这些示例重现所展示的工作流程。
cs.AI / 28 / 2603.00656
InfoPO: Information-Driven Policy Optimization for User-Centric Agents
InfoPO:面向用户的代理的信息驱动策略优化
Abstract
Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.
Chinese Translation
现实世界中,用户对大型语言模型(LLM)代理的请求往往不够明确。代理必须进行互动以获取缺失的信息,并做出正确的后续决策。然而,目前基于多轮GRPO的方法通常依赖于轨迹级奖励计算,这导致了信用分配问题以及在回滚组内的优势信号不足。一种可行的方法是以更细的粒度识别有价值的互动轮次,以驱动更有针对性的学习。为此,我们引入了InfoPO(信息驱动策略优化),将多轮互动框架化为主动减少不确定性的过程,并计算信息增益奖励,该奖励对那些反馈显著改变代理后续行动分布的轮次进行信用分配,相较于掩蔽反馈的反事实。然后,它通过自适应方差门控融合将该信号与任务结果结合,以识别信息的重要性,同时保持面向任务的目标方向。在多种任务中,包括意图澄清、协作编码和工具增强决策,InfoPO始终优于提示和多轮强化学习基线。它在用户模拟器变化下也表现出鲁棒性,并有效地推广到环境互动任务。总体而言,InfoPO提供了一种有原则且可扩展的机制,用于优化复杂的代理-用户协作。代码可在 https://github.com/kfq20/InfoPO 获取。
cs.AI / 29 / 2603.00676
K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control
K^2-Agent:协同演化的知晓与技能用于层次化移动设备控制
Abstract
Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).
Chinese Translation
现有的移动设备控制代理在解决需要长远规划和精确操作的复杂任务时通常表现不佳,这通常是由于缺乏相关任务经验或对技能执行的不熟悉。我们提出了K^2-Agent,一个层次化框架,通过分离和协同演化声明性知识(知晓什么)和程序性知识(知晓如何)来模拟类人认知,以便进行规划和执行。K^2-Agent的高层推理器从每个任务的单个演示中启动,并运行总结-反思-定位-修正(Summarize-Reflect-Locate-Revise, SRLR)循环,通过自我演化提炼和迭代优化任务级的声明性知识。低层执行器则采用我们的课程引导的组相对策略优化(Curriculum-guided Group Relative Policy Optimization, C-GRPO)进行训练,该方法(i)使用解耦的奖励信号构建平衡的样本池,并且(ii)采用动态演示注入来指导模型自主生成成功的训练轨迹。在具有挑战性的AndroidWorld基准测试中,K^2-Agent仅使用原始屏幕截图和开源骨干网络就达到了76.1%的成功率。此外,K^2-Agent展现出强大的双重泛化能力:其高层声明性知识能够在多样的基础模型间迁移,而其低层程序性技能在ScreenSpot-v2和Android-in-the-Wild(AitW)等未见任务中也表现出竞争力。
cs.AI / 30 / 2603.00680
MemPO: Self-Memory Policy Optimization for Long-Horizon Agents
MemPO:面向长时间跨度智能体的自我记忆策略优化
Abstract
Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent's overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98% over the base model and 7.1% over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%.
Chinese Translation
长时间跨度的智能体在与环境互动时面临上下文规模不断增长的挑战,这会降低其性能和稳定性。现有方法通常引入外部记忆模块,从存储的记忆中查找相关信息,这限制了模型主动管理其记忆内容的能力,并与智能体的整体任务目标对齐。为了解决这些局限性,我们提出了自我记忆策略优化算法(MemPO),使智能体(策略模型)能够在与环境互动时自主总结和管理其记忆。通过基于记忆有效性的信用分配机制的改进,策略模型可以选择性地保留关键的信息,显著减少令牌消耗,同时保持任务性能。大量实验和分析证实,MemPO在基础模型上实现了绝对F1分数提升25.98%,在之前的最优基线(SOTA)上提升7.1%,同时令牌使用量减少了67.58%和73.12%。
cs.AI / 31 / 2603.00691
AIoT-based Continuous, Contextualized, and Explainable Driving Assessment for Older Adults
基于AIoT的持续、情境化和可解释的老年人驾驶评估
Abstract
The world is undergoing a major demographic shift as older adults become a rapidly growing share of the population, creating new challenges for driving safety. In car-dependent regions such as the United States, driving remains essential for independence, access to services, and social participation. At the same time, aging can introduce gradual changes in vision, attention, reaction time, and driving control that quietly reduce safety. Today's assessment methods rely largely on infrequent clinic visits or simple screening tools, offering only a brief snapshot and failing to reflect how an older adult actually drives on the road. Our work starts from the observation that everyday driving provides a continuous record of functional ability and captures how a driver responds to traffic, navigates complex roads, and manages routine behavior. Leveraging this insight, we propose AURA, an Artificial Intelligence of Things (AIoT) framework for continuous, real-world assessment of driving safety among older adults. AURA integrates richer in-vehicle sensing, multi-scale behavioral modeling, and context-aware analysis to extract detailed indicators of driving performance from routine trips. It organizes fine-grained actions into longer behavioral trajectories and separates age-related performance changes from situational factors such as traffic, road design, or weather. By integrating sensing, modeling, and interpretation within a privacy-preserving edge architecture, AURA provides a foundation for proactive, individualized support that helps older adults drive safely. This paper outlines the design principles, challenges, and research opportunities needed to build reliable, real-world monitoring systems that promote safer aging behind the wheel.
Chinese Translation
随着老年人群体在全球人口中迅速增长,世界正经历重大的人口结构变化,这为驾驶安全带来了新的挑战。在像美国这样的汽车依赖地区,驾驶对于保持独立、获取服务和参与社会活动至关重要。同时,衰老可能导致视觉、注意力、反应时间和驾驶控制等方面的逐渐变化,从而悄然降低安全性。目前的评估方法主要依赖于不频繁的门诊访问或简单的筛查工具,仅提供短暂的快照,无法反映老年人在实际道路上的驾驶情况。我们的研究基于一个观察,即日常驾驶提供了功能能力的持续记录,并捕捉了驾驶者如何应对交通、导航复杂道路以及管理日常行为。基于这一洞察,我们提出了AURA,一个用于老年人驾驶安全的持续、真实世界评估的人工智能物联网(AIoT)框架。AURA整合了更丰富的车载传感器、多尺度行为建模和情境感知分析,从常规行程中提取详细的驾驶表现指标。它将细粒度的行为组织成更长的行为轨迹,并将与年龄相关的表现变化与交通、道路设计或天气等情境因素分开。通过在保护隐私的边缘架构中整合传感、建模和解释,AURA为主动、个性化的支持提供了基础,帮助老年人安全驾驶。本文概述了构建可靠的真实世界监测系统所需的设计原则、挑战和研究机会,以促进老年人在驾驶中的安全老龄化。
cs.AI / 32 / 2603.00730
MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning
MO-MIX:基于深度强化学习的多目标多智能体协同决策
Abstract
Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
Chinese Translation
深度强化学习(RL)已被广泛应用于解决复杂的决策问题。在许多现实场景中,任务往往具有多个相互冲突的目标,并可能需要多个智能体进行合作,这就是多目标多智能体决策问题。然而,关于这一交叉领域的研究仍然较少。现有的方法局限于各自的领域,仅能处理单一目标的多智能体决策,或单一智能体的多目标决策。本文提出了MO-MIX,以解决多目标多智能体强化学习(MOMARL)问题。我们的方法基于集中训练与分散执行(CTDE)框架。一个表示对目标偏好的权重向量被输入到分散智能体网络中,作为局部动作价值函数估计的条件,同时使用具有并行架构的混合网络来估计联合动作价值函数。此外,应用了一种探索引导方法,以提高最终非支配解的均匀性。实验表明,所提出的方法能够有效解决多目标多智能体协同决策问题,并生成帕累托集的近似。我们的方法在所有四种评估指标上显著优于基线方法,同时所需的计算成本也更低。
cs.AI / 33 / 2603.00801
The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
合成网络:用于诊断语言代理的认知弱点的对抗性策划迷你互联网
Abstract
Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources. However, these sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking - where misleading information appears prominently in search results - remains poorly understood. Existing benchmarks evaluate functional navigation or static factuality but cannot causally isolate this vulnerability, and current mitigation strategies for retrieval-augmented generation remain largely untested under such conditions. We introduce Synthetic Web Benchmark, a procedurally generated environment comprising thousands of hyperlinked articles with ground-truth labels for credibility and factuality, process-level interaction traces, and contamination filtering to eliminate training-data leakage. By injecting a single high-plausibility misinformation article into a controllable search rank, we measure the causal effect of adversarial exposure in six frontier models. The results reveal catastrophic failures: accuracy collapses despite unlimited access to truthful sources, with minimal search escalation and severe miscalibration. These findings expose fundamental limitations in how current frontier models handle conflicting information, with immediate implications for deployment in high-stakes domains. Our benchmark enables systematic analysis of these failure modes and provides a controlled testbed for evaluating mitigation strategies under adversarial ranking - a gap in current research. This work establishes a reproducible baseline for developing search-robust and epistemically humble agents capable of resisting manipulation in high-stakes domains.
Chinese Translation
语言代理越来越多地作为网络启用系统,搜索、浏览和综合来自多种来源的信息。然而,这些来源可能包括不可靠或对抗性的内容,而代理对对抗性排名的鲁棒性——即误导性信息在搜索结果中显著出现——仍然未得到充分理解。现有基准评估功能导航或静态事实性,但无法因果性地孤立这种脆弱性,目前针对增强检索生成的缓解策略在这种条件下仍然基本未经过测试。我们引入了合成网络基准(Synthetic Web Benchmark),这是一个程序生成的环境,包含数千篇超链接文章,配有关于可信度和事实性的真实标签、过程级交互轨迹,以及污染过滤以消除训练数据泄漏。通过将一篇高可信度的错误信息文章注入可控的搜索排名中,我们测量了对抗性暴露在六个前沿模型中的因果影响。结果揭示了灾难性的失败:尽管无限访问真实来源,准确性仍然崩溃,搜索升级最小,严重失调。这些发现暴露了当前前沿模型处理冲突信息的基本局限性,对在高风险领域的部署具有直接影响。我们的基准使得对这些失败模式的系统分析成为可能,并提供了一个受控的测试平台,以评估在对抗性排名下的缓解策略——这是当前研究中的一个空白。这项工作为开发能够抵御高风险领域操控的搜索鲁棒和认知谦逊的代理建立了可重复的基线。
cs.AI / 34 / 2603.00808
MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind
MetaMind:基于心智元理论的多智能体系统中的通用与认知世界模型
Abstract
A major challenge for world models in multi-agent systems is to understand interdependent agent dynamics, predict interactive multi-agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this paper, MetaMind, a general and cognitive world model for multi-agent systems that leverages a novel meta-theory of mind (Meta-ToM) framework, is proposed. Through MetaMind, each agent learns not only to predict and plan over its own beliefs, but also to inversely reason goals and beliefs from its own behavior trajectories. This self-reflective, bidirectional inference loop enables each agent to learn a metacognitive ability in a self-supervised manner. Then, MetaMind is shown to generalize the metacognitive ability from first-person to third-person through analogical reasoning. Thus, in multi-agent systems, each agent with MetaMind can actively reason about goals and beliefs of other agents from limited, observable behavior trajectories in a zero-shot manner, and then adapt to emergent collective intention without an explicit communication mechanism. Extended simulation results on diverse multi-agent tasks demonstrate that MetaMind can achieve superior task performance and outperform baselines in few-shot multi-agent generalization.
Chinese Translation
多智能体系统中的世界模型面临的主要挑战是理解相互依赖的智能体动态、预测交互式多智能体轨迹,并在没有集中监督或明确通信的情况下进行长期规划。本文提出了MetaMind,一种针对多智能体系统的通用与认知世界模型,利用了一种新颖的心智元理论(Meta-ToM)框架。通过MetaMind,每个智能体不仅学习预测和规划自身的信念,还能够从自身的行为轨迹中逆向推理目标和信念。这种自我反思的双向推理循环使每个智能体能够以自我监督的方式学习元认知能力。接着,MetaMind被证明能够通过类比推理将元认知能力从第一人称推广到第三人称。因此,在多智能体系统中,具有MetaMind的每个智能体能够从有限的可观察行为轨迹中以零样本的方式主动推理其他智能体的目标和信念,并在没有明确通信机制的情况下适应新兴的集体意图。对多种多智能体任务的扩展仿真结果表明,MetaMind能够实现优越的任务表现,并在少样本多智能体泛化中超越基线。
cs.AI / 35 / 2603.00873
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
MC-Search:评估与增强具有结构化长推理链的多模态智能搜索
Abstract
With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
Chinese Translation
随着对逐步、跨模态和知识基础推理需求的增加,多模态大型语言模型(MLLMs)正在超越传统的固定检索-生成范式,向更复杂的智能多模态检索增强生成(MM-RAG)发展。然而,现有基准主要集中在简化的问答(QA)上,使用短检索链,导致自适应规划和多模态推理未得到充分探索。我们提出了MC-Search,这是第一个针对智能MM-RAG的基准,包含五种代表性推理结构的长、逐步注释推理链。每个示例指定了子问题、检索模态、支持事实和中间答案,确保通过HAVE(证据的逐步归因与验证)实现高保真度,结果生成3,333个高质量示例,平均3.7步。除了答案准确性,MC-Search还引入了新的过程级指标,用于推理质量、逐步检索和规划准确性。通过开发统一的智能MM-RAG管道,我们对六个领先的MLLMs进行了基准测试,并揭示了系统性问题,如过度检索和不足检索以及模态不一致的规划。最后,我们引入了Search-Align,一个利用经过验证的推理链的过程监督微调框架,显示我们的数据不仅能够实现可信评估,还能提高开源MLLMs中的规划和检索保真度。
cs.AI / 36 / 2603.00876
BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning
BioProAgent:受限科学规划的神经符号基础
Abstract
Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose \textbf{BioProAgent}, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous \textit{Design-Verify-Rectify} workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by \textit{Semantic Symbol Grounding}, reducing token consumption by $\sim$6$\times$ through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6\% physical compliance (compared to 21.0\% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. \footnote{Code at https://github.com/YuyangSunshine/bioproagent and project at https://yuyangsunshine.github.io/BioPro-Project/}
Chinese Translation
大型语言模型(LLMs)在科学发现中展示了显著的推理能力,但在湿实验室中却难以实现物理执行的桥接。在这些不可逆的环境中,概率性幻觉不仅是错误的,还可能导致设备损坏或实验失败。为了解决这个问题,我们提出了 extbf{BioProAgent},一个将概率规划锚定在确定性有限状态机(FSM)上的神经符号框架。我们引入了一种状态增强规划机制,强制执行严格的 extit{设计-验证-修正}工作流程,确保在执行前硬件的合规性。此外,我们通过 extit{语义符号基础}解决了复杂设备架构中固有的上下文瓶颈,通过符号抽象将令牌消耗减少了约6倍。在扩展的BioProBench基准测试中,BioProAgent实现了95.6%的物理合规性(相比之下,ReAct为21.0%),证明了神经符号约束对于不可逆物理环境中可靠自主性的重要性。 ootnote{代码可在 https://github.com/YuyangSunshine/bioproagent 获取,项目网址为 https://yuyangsunshine.github.io/BioPro-Project/
cs.AI / 37 / 2603.00977
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
HiMAC:用于长时间跨度 LLM 代理的层次宏微学习
Abstract
Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
Chinese Translation
大型语言模型(LLM)代理最近在互动决策中展现出强大的能力,但在需要结构化规划和可靠执行的长时间跨度任务中仍然存在根本性的局限。现有方法主要依赖于平坦自回归策略,其中高层推理和低层动作在单一的令牌序列中生成,这导致了低效的探索和在延长轨迹上的严重错误传播。在本研究中,我们提出了 HiMAC,一种层次化的代理强化学习框架,它明确地将长时间跨度决策分解为宏观层面的规划和微观层面的执行。HiMAC 将推理建模为一个结构化蓝图生成过程,随后进行目标条件的动作执行,从而在基于 LLM 的代理中实现稳健的长时间跨度规划。为了高效地训练这一层次结构,我们引入了一种无评论员的层次策略优化范式,通过层次相对优势估计将基于群体的强化学习扩展到双层结构。此外,我们提出了一种迭代共同进化训练策略,在规划者探索和执行者适应之间交替进行,减轻了层次学习中固有的非平稳性。在 ALFWorld、WebShop 和 Sokoban 上的广泛实验表明,HiMAC 始终优于强提示和强化学习基线,在文本基础和视觉基础环境中均实现了最先进的性能和显著提高的样本效率。我们的结果表明,引入结构化层次,而不仅仅是增加模型规模,是实现稳健的长时间跨度代理智能的关键因素。
cs.AI / 38 / 2603.00991
Tracking Capabilities for Safer Agents
更安全代理的跟踪能力
Abstract
AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.
Chinese Translation
与现实世界通过工具调用进行交互的人工智能代理面临着基本的安全挑战:代理可能泄露私人信息、造成意外的副作用,或通过提示注入被操控。为了解决这些挑战,我们建议将代理置于基于编程语言的“安全保护带”中:代理不直接调用工具,而是以能力安全语言中的代码表达其意图:使用具有捕获检查的 Scala 3。能力是程序变量,用于调节对感兴趣的效果和资源的访问。Scala 的类型系统静态跟踪能力,提供了对代理可以执行的操作的细粒度控制。特别是,它支持局部纯度,能够强制确保子计算无副作用,从而在代理处理机密数据时防止信息泄露。我们展示了可以通过利用具有跟踪能力的强类型系统构建可扩展的代理安全保护带。我们的实验表明,代理能够生成能力安全的代码,而任务性能没有显著下降,同时类型系统可靠地防止了信息泄露和恶意副作用等不安全行为。
cs.AI / 39 / 2603.00993
CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration
CollabEval:通过多智能体协作增强LLM作为评判者的能力
Abstract
Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address these limitations, we propose CollabEval, a novel multi-agent evaluation framework that implements a three-phase Collaborative Evaluation process: initial evaluation, multi-round discussion, and final judgment. Unlike existing approaches that rely on competitive debate or single-model evaluation, CollabEval emphasizes collaboration among multiple agents with strategic consensus checking for efficiency. Our extensive experiments demonstrate that CollabEval consistently outperforms single-LLM approaches across multiple dimensions while maintaining robust performance even when individual models struggle. The framework provides comprehensive support for various evaluation criteria while ensuring efficiency through its collaborative design.
Chinese Translation
大型语言模型(LLMs)已经彻底改变了人工智能生成内容的评估,LLM作为评判者的范式变得越来越受欢迎。然而,当前的单一LLM评估方法面临着显著的挑战,包括判断不一致和来自预训练数据的固有偏见。为了解决这些局限性,我们提出了CollabEval,这是一种新颖的多智能体评估框架,实施了三阶段的协作评估过程:初步评估、多轮讨论和最终判断。与依赖竞争辩论或单一模型评估的现有方法不同,CollabEval强调多个智能体之间的协作,并通过战略共识检查提高效率。我们的广泛实验表明,CollabEval在多个维度上始终优于单一LLM方法,同时在个别模型表现不佳时仍能保持强大的性能。该框架为各种评估标准提供全面支持,同时通过其协作设计确保效率。
cs.AI / 40 / 2603.01055
MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning
MMCOMET:一个大规模的多模态常识知识图谱用于情境推理
Abstract
We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.
Chinese Translation
我们提出了MMCOMET,这是第一个集成了物理、社会和事件知识的多模态常识知识图谱(MMKG)。MMCOMET扩展了ATOMIC2020知识图谱,增加了视觉维度,通过高效的图像检索过程,生成了超过90万个多模态三元组。这个新资源解决了现有MMKG在支持复杂推理任务(如图像描述和故事讲述)方面的主要局限性。通过标准的视觉故事讲述实验,我们展示了我们整体方法能够生成比仅使用文本知识所产生的故事更丰富、更连贯且更具情境基础的故事。该资源为多模态常识推理和叙事生成奠定了新的基础。
cs.AI / 41 / 2603.01092
Alien Science: Sampling Coherent but Cognitively Unavailable Research Directions from Idea Atoms
外星科学:从思想原子中采样连贯但认知上不可得的研究方向
Abstract
Large language models are adept at synthesizing and recombining familiar material, yet they often fail at a specific kind of creativity that matters most in research: producing ideas that are both coherent and non-obvious to the current community. We formalize this gap through cognitive availability, the likelihood that a research direction would be naturally proposed by a typical researcher given what they have worked on. We introduce a pipeline that (i) decomposes papers into granular conceptual units, (ii) clusters recurring units into a shared vocabulary of idea atoms, and (iii) learns two complementary models: a coherence model that scores whether a set of atoms constitutes a viable direction, and an availability model that scores how likely that direction is to be generated by researchers drawn from the community. We then sample "alien" directions that score high on coherence but low on availability. On a corpus of $\sim$7,500 recent LLM papers from NeurIPS, ICLR and ICML, we validate that (a) conceptual units preserve paper content under reconstruction, (b) idea atoms generalize across papers rather than memorizing paper-specific phrasing, and (c) the Alien sampler produces research directions that are more diverse than LLM baselines while maintaining coherence.
Chinese Translation
大型语言模型擅长合成和重组熟悉的材料,但它们在研究中最重要的一种创造力上常常表现不佳:产生既连贯又对当前社区而言不明显的想法。我们通过认知可用性这一概念来形式化这一差距,即在研究者所从事的工作基础上,自然提出某一研究方向的可能性。我们引入了一个流程,该流程 (i) 将论文分解为细粒度的概念单元,(ii) 将重复出现的单元聚类为共享的思想原子词汇,(iii) 学习两个互补模型:一个连贯性模型,用于评分一组原子是否构成一个可行的方向,以及一个可用性模型,用于评分该方向被来自社区的研究者生成的可能性。然后,我们采样“外星”方向,这些方向在连贯性上得分高,但在可用性上得分低。在来自 NeurIPS、ICLR 和 ICML 的约 7500 篇近期 LLM 论文的语料库上,我们验证了 (a) 概念单元在重构过程中保留了论文内容,(b) 思想原子在论文之间具有泛化能力,而不是记忆特定论文的措辞,以及 (c) 外星采样器生成的研究方向比 LLM 基准更具多样性,同时保持连贯性。
cs.AI / 42 / 2603.01106
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
DIVA-GRPO:通过难度自适应变体优势增强多模态推理
Abstract
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
Chinese Translation
基于群体相对策略优化(GRPO)的强化学习(RL)已成为增强多模态大型语言模型(MLLMs)推理能力的广泛采用的方法。虽然GRPO能够在没有评论者的情况下实现长链推理,但在处理困难问题时常常面临稀疏奖励的问题,并且当群体级奖励对于过于简单或困难的问题过于一致时,优势会消失。现有的解决方案(样本扩展、选择性利用和间接奖励设计)往往无法在组内奖励分布中维持足够的方差,以产生明确的优化信号。为了解决这一问题,我们提出了DIVA-GRPO,这是一种难度自适应变体优势方法,从全局角度调整变体难度分布。DIVA-GRPO动态评估问题难度,采样具有适当难度水平的变体,并使用难度加权和归一化缩放计算局部和全局组的优势。这减轻了奖励稀疏和优势消失的问题,同时提高了训练稳定性。在六个推理基准上的广泛实验表明,DIVA-GRPO在训练效率和推理性能方面优于现有方法。代码链接:https://github.com/Siaaaaaa1/DIVA-GRPO
cs.AI / 43 / 2603.01121
HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis
HVR-Met:一种用于极端天气诊断的假设验证重规划智能系统
Abstract
While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning'' closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.
Chinese Translation
尽管基于深度学习的天气预报范式取得了显著进展,但解决极端天气诊断仍然是一项艰巨的挑战。这一差距主要存在于诊断过程需要复杂的多步骤逻辑推理、动态工具调用以及专家级的先前判断。尽管智能体在任务分解和自主执行方面具有固有优势,但当前的架构仍受到关键瓶颈的限制:专家知识整合不足、缺乏专业级的迭代推理循环,以及在极端条件下复杂工作流的细粒度验证和评估系统的缺失。为此,我们提出了HVR-Met,一个以专家知识深度整合为特征的多智能体气象诊断系统。其核心创新是“假设-验证-重规划”(Hypothesis-Verification-Replanning)闭环机制,能够在极端天气事件中对异常气象信号进行复杂的迭代推理。为了弥补现有评估框架中的不足,我们进一步引入了一种新的基准,专注于原子级子任务。实验证据表明,该系统在复杂诊断场景中表现出色。
cs.AI / 44 / 2603.01135
FCN-LLM: Empower LLM for Brain Functional Connectivity Network Understanding via Graph-level Multi-task Instruction Tuning
FCN-LLM:通过图级多任务指令调优增强大语言模型对脑功能连接网络的理解
Abstract
Large Language Models have achieved remarkable success in language understanding and reasoning, and their multimodal extensions enable comprehension of images, video, and audio. Inspired by this, foundation models for brain functional connectivity networks derived from resting-state fMRI have shown promise in clinical tasks. However, existing methods do not align FCNs with the text modality, limiting the ability of LLMs to directly understand FCNs. To address this, we propose FCN-LLM, a framework that enables LLMs to understand FCNs through graph-level, multi-task instruction tuning. Our approach employs a multi-scale FCN encoder capturing brain-region, functional subnetwork, and whole-brain features, projecting them into the semantic space of LLM. We design multi-paradigm instruction tasks covering 19 subject-specific attributes across demographics, phenotypes, and psychiatric conditions. A multi-stage learning strategy first aligns FCN embeddings with the LLM and then jointly fine-tunes the entire model to capture high-level semantic information. Experiments on a large-scale, multi-site FCN database show that FCN-LLM achieves strong zero-shot generalization on unseen datasets, outperforming conventional supervised and foundation models. This work introduces a new paradigm for integrating brain functional networks with LLMs, offering a flexible and interpretable framework for neuroscience.
Chinese Translation
大型语言模型在语言理解和推理方面取得了显著成功,其多模态扩展使其能够理解图像、视频和音频。受到此启发,基于静息态功能性磁共振成像(fMRI)衍生的脑功能连接网络基础模型在临床任务中展现出潜力。然而,现有方法未能将功能连接网络(FCNs)与文本模态对齐,限制了大语言模型(LLMs)直接理解功能连接网络的能力。为此,我们提出了FCN-LLM,一个通过图级多任务指令调优使大语言模型理解功能连接网络的框架。我们的方法采用多尺度功能连接网络编码器,捕捉脑区、功能子网络和全脑特征,并将其投影到大语言模型的语义空间中。我们设计了涵盖19个特定于受试者的属性的多范式指令任务,涉及人口统计学、表型和精神疾病状况。多阶段学习策略首先将功能连接网络嵌入与大语言模型对齐,然后共同微调整个模型以捕捉高级语义信息。在大规模多站点功能连接网络数据库上的实验表明,FCN-LLM在未见数据集上实现了强大的零样本泛化,优于传统的监督学习和基础模型。本研究为将脑功能网络与大语言模型结合提供了一种新的范式,提供了一个灵活且可解释的神经科学框架。
cs.AI / 45 / 2603.01145
AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution
AutoSkill:基于经验驱动的终身学习通过技能自我演化
Abstract
In practical LLM applications, users repeatedly express stable preferences and requirements, such as reducing hallucinations, following institutional writing conventions, or avoiding overly technical wording, yet such interaction experience is seldom consolidated into reusable knowledge. Consequently, LLM agents often fail to accumulate personalized capabilities across sessions. We present AutoSkill, an experience-driven lifelong learning framework that enables LLM agents to automatically derive, maintain, and reuse skills from dialogue and interaction traces. AutoSkill abstracts skills from user experience, supports their continual self-evolution, and dynamically injects relevant skills into future requests without retraining the underlying model. Designed as a model-agnostic plugin layer, it is compatible with existing LLMs and introduces a standardized skill representation for sharing and transfer across agents, users, and tasks. In this way, AutoSkill turns ephemeral interaction experience into explicit, reusable, and composable capabilities. This paper describes the motivation, architecture, skill lifecycle, and implementation of AutoSkill, and positions it with respect to prior work on memory, retrieval, personalization, and agentic systems. AutoSkill highlights a practical and scalable path toward lifelong personalized agents and personal digital surrogates.
Chinese Translation
在实际的大型语言模型(LLM)应用中,用户反复表达稳定的偏好和需求,例如减少幻觉、遵循机构写作规范或避免过于技术化的措辞,但这种交互经验很少被整合为可重用的知识。因此,LLM代理通常无法在会话之间积累个性化能力。我们提出了AutoSkill,这是一种基于经验驱动的终身学习框架,使LLM代理能够自动从对话和交互轨迹中推导、维护和重用技能。AutoSkill从用户体验中抽象技能,支持其持续自我演化,并在不重新训练底层模型的情况下动态注入相关技能到未来请求中。作为一种与模型无关的插件层,它与现有的LLM兼容,并引入了一种标准化的技能表示,以便在代理、用户和任务之间进行共享和转移。通过这种方式,AutoSkill将短暂的交互经验转化为明确的、可重用的和可组合的能力。本文描述了AutoSkill的动机、架构、技能生命周期和实现,并将其与先前关于记忆、检索、个性化和代理系统的研究进行对比。AutoSkill突出了实现终身个性化代理和个人数字替代品的实用且可扩展的路径。
cs.AI / 46 / 2603.01152
DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
DeepResearch-9K:深度研究代理的挑战性基准数据集
Abstract
Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.
Chinese Translation
深度研究代理能够执行多步骤的网络探索、针对性检索和复杂问题回答。尽管其能力强大,深度研究代理面临两个关键瓶颈:(1)缺乏具有现实世界难度的大规模挑战性数据集,以及(2)缺乏可访问的开源数据合成和代理训练框架。为了解决这些问题,我们首先构建了DeepResearch-9K,这是一个专为深度研究场景设计的大规模挑战性数据集,基于开源的多跳问答(QA)数据集,通过低成本的自主管道构建而成。值得注意的是,它包含:(1)9000个问题,涵盖从L1到L3的三个难度级别;(2)来自最先进的深度研究代理Tongyi-DeepResearch-30B-A3B的高质量搜索轨迹及推理链;以及(3)可验证的答案。此外,我们开发了一个开源训练框架DeepResearch-R1,支持:(1)多轮网络交互;(2)不同的强化学习(RL)方法;以及(3)不同的奖励模型,如基于规则的结果奖励和LLM作为评判者的反馈。最后,实证结果表明,在我们的DeepResearch-R1下,基于DeepResearch-9K训练的代理在挑战性深度研究基准上达到了最先进的结果。我们在https://huggingface.co/datasets/artillerywu/DeepResearch-9K发布了DeepResearch-9K数据集,并在https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1发布了DeepResearch-R1的代码。
cs.AI / 47 / 2603.01160
Semantic XPath: Structured Agentic Memory Access for Conversational AI
语义XPath:对话式人工智能的结构化代理记忆访问
Abstract
Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and ignore structure. We propose Semantic XPath, a tree-structured memory module to access and update structured conversational memory. Semantic XPath improves performance over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory. We also introduce SemanticXPath Chat, an end-to-end ConvAI demo system that visualizes the structured memory and query execution details. Overall, this paper demonstrates a candidate for the next generation of long-term, task-oriented ConvAI systems built on structured memory.
Chinese Translation
对话式人工智能(ConvAI)代理越来越多地维护结构化记忆,以支持长期的、任务导向的交互。在上下文记忆方法中,随着历史记录的增长,新的信息被附加到模型输入中,这在上下文窗口限制下扩展性较差。基于检索增强生成(RAG)的方法检索与请求相关的信息,但大多数假设记忆集合是平坦的,并忽略了结构。我们提出了语义XPath(Semantic XPath),一种树状结构的记忆模块,用于访问和更新结构化的对话记忆。与平坦RAG基线相比,语义XPath的性能提高了176.7%,同时仅使用了上下文记忆所需的9.1%的标记。我们还介绍了语义XPath聊天(SemanticXPath Chat),一个端到端的ConvAI演示系统,能够可视化结构化记忆和查询执行的细节。总体而言,本文展示了基于结构化记忆的下一代长期、任务导向的ConvAI系统的候选方案。
cs.AI / 48 / 2603.01201
Incremental LTLf Synthesis
增量 LTLf 合成
Abstract
In this paper, we study incremental LTLf synthesis -- a form of reactive synthesis where the goals are given incrementally while in execution. In other words, the protagonist agent is already executing a strategy for a certain goal when it receives a new goal: at this point, the agent has to abandon the current strategy and synthesize a new strategy still fulfilling the original goal, which was given at the beginning, as well as the new goal, starting from the current instant. In this paper, we formally define the problem of incremental synthesis and study its solution. We propose a solution technique that efficiently performs incremental synthesis for multiple LTLf goals by leveraging auxiliary data structures constructed during automata-based synthesis. We also consider an alternative solution technique based on LTLf formula progression. We show that, in spite of the fact that formula progression can generate formulas that are exponentially larger than the original ones, their minimal automata remain bounded in size by that of the original formula. On the other hand, we show experimentally that, if implemented naively, i.e., by actually computing the automaton of the progressed LTLf formulas from scratch every time a new goal arrives, the solution based on formula progression is not competitive.
Chinese Translation
在本文中,我们研究增量 LTLf 合成——一种反应合成形式,其中目标在执行过程中逐步给出。换句话说,主角代理在接收到新目标时,已经在执行某个目标的策略:此时,代理必须放弃当前策略,并合成一个新的策略,以满足最初给出的原始目标以及当前时刻的新目标。我们在本文中正式定义了增量合成的问题,并研究其解决方案。我们提出了一种解决技术,通过利用在基于自动机的合成过程中构建的辅助数据结构,能够高效地对多个 LTLf 目标执行增量合成。我们还考虑了一种基于 LTLf 公式进展的替代解决技术。我们展示了尽管公式进展可能生成比原始公式大得多的公式,但其最小自动机的大小仍然受到原始公式大小的限制。另一方面,我们通过实验表明,如果采用简单的实现方式,即每当新目标到达时,实际从头计算进展的 LTLf 公式的自动机,那么基于公式进展的解决方案并不具备竞争力。
cs.AI / 49 / 2603.01203
How Well Does Agent Development Reflect Real-World Work?
代理开发在多大程度上反映现实世界的工作?
Abstract
AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
Chinese Translation
人工智能代理的开发和评估越来越多地基于与人类工作相关的基准,但这些基准努力在多大程度上代表整个劳动市场仍然不清楚。在本研究中,我们系统地研究了代理开发努力与现实世界人类工作的分布之间的关系,通过将基准实例映射到工作领域和技能。我们首先分析了43个基准和72,342个任务,测量它们与美国劳动市场上1,016个现实职业的人类就业和资本配置的一致性。我们揭示了代理开发与人类劳动和经济价值集中领域之间存在显著的不匹配,代理开发往往以编程为中心。在代理当前针对的工作领域内,我们进一步通过测量代理的自主性水平来表征当前代理的效用,为不同工作场景下的代理交互策略提供实用指导。在此基础上,我们提出了三个可测量的原则,用于设计更好地捕捉社会重要性和技术挑战性的工作的基准:覆盖性、现实性和细致评估。
cs.AI / 50 / 2603.01209
Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
智能体学习其运行时:解释器持久性作为训练时语义
Abstract
Tool-augmented LLMs are increasingly deployed as agents that interleave natural-language reasoning with executable Python actions, as in CodeAct-style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, common training pipelines treat agent traces as token sequences, with execution semantics left implicit. This raises a data-centric question: Is state persistence merely an inference-time scaffold, or can models learn to exploit it when training data exposes the corresponding execution semantics? We isolate state persistence as a training-time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one-shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi-turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate paired trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine-tune identical base models (Qwen3-8B) on each trace variant and evaluate all four train-runtime combinations. Our 2x2 cross-evaluation shows that execution semantics primarily affect how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent-trained model in a stateless runtime triggers missing-variable errors in roughly 80% of episodes; a stateless-trained model in a persistent runtime redundantly re-derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first-class semantic of agent traces. Aligning fine-tuning data with deployment runtimes improves efficiency and reduces brittle train-runtime mismatches.
Chinese Translation
工具增强的LLM(大语言模型)越来越多地被部署为智能体,这些智能体在自然语言推理与可执行的Python动作之间交替进行,如CodeAct风格的框架。在部署过程中,这些智能体依赖于跨步骤持续存在的运行时状态。相比之下,常见的训练流程将智能体轨迹视为令牌序列,执行语义则保持隐含。这引发了一个以数据为中心的问题:状态持久性仅仅是推理时的支架,还是模型能够在训练数据暴露相应的执行语义时学习利用它?我们将状态持久性作为一个训练时变量进行隔离。我们引入了不透明背包(Opaque Knapsack),这是一个程序生成的部分可观测优化任务家族,旨在防止一次性解决方案。项目属性和约束隐藏在预算工具调用之后,迫使多轮控制流和迭代状态修订。在保持任务实例、提示、工具、模型和监督固定的情况下,我们生成了仅在解释器状态是否在步骤之间持久存在或在每个动作后重置的情况下不同的配对轨迹。然后,我们在每个轨迹变体上微调相同的基础模型(Qwen3-8B),并评估所有四种训练-运行时组合。我们的2x2交叉评估表明,执行语义主要影响智能体达到解决方案的方式,而不是是否能达到:在不同条件下,解决方案质量在统计上没有显著差异,但令牌成本和稳定性却有显著不同。在无状态运行时中经过持久训练的模型在大约80%的情节中触发缺失变量错误;而在持久运行时中经过无状态训练的模型则冗余地重新推导保留的状态,使用的令牌数量大约是3.5倍。解释器持久性应被视为智能体轨迹的第一类语义。将微调数据与部署运行时对齐可以提高效率并减少脆弱的训练-运行时不匹配。
cs.AI / 51 / 2603.01211
A Unified Framework to Quantify Cultural Intelligence of AI
量化人工智能文化智力的统一框架
Abstract
As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
Chinese Translation
随着生成性人工智能技术在全球范围内的不断推出,评估其在不同文化背景下的操作能力已迫在眉睫。近年来,虽然在文化基准方面进行了许多必要的努力,但这些努力主要集中在文化和评估的特定方面。尽管这些努力有助于我们理解文化能力,但作为一个领域,我们需要一种统一和系统的评估方法,以全面评估多样的文化维度。基于测量理论,我们提出了一个原则性框架,将多方面的文化能力指标汇聚成对文化智力的统一评估。我们首先制定了文化的工作定义,确定文化的核心领域。然后,我们介绍了一个广泛适用、系统化且可扩展的框架,用于评估人工智能系统的文化智力。借助心理测量有效性理论的理论框架,我们将背景概念(即文化智力)与其通过测量的操作化分离。我们将文化智力概念化为跨越不同领域的一系列核心能力,并通过一组旨在可靠测量的指标进行操作化。最后,我们识别了有意义地测量这些指标的考虑因素、挑战和研究路径,特别关注数据收集、探测策略和评估指标。
cs.AI / 52 / 2603.01227
The Lattice Representation Hypothesis of Large Language Models
大语言模型的格表示假说
Abstract
We propose the Lattice Representation Hypothesis of large language models: a symbolic backbone that grounds conceptual hierarchies and logical operations in embedding geometry. Our framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions with separating thresholds induce a concept lattice via half-space intersections. This geometry enables symbolic reasoning through geometric meet (intersection) and join (union) operations, and admits a canonical form when attribute directions are linearly independent. Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, revealing a principled bridge between continuous geometry and symbolic abstraction.
Chinese Translation
我们提出了大语言模型的格表示假说:一种符号基础,能够将概念层次和逻辑操作扎根于嵌入几何中。我们的框架将线性表示假说与形式概念分析(Formal Concept Analysis, FCA)统一起来,表明具有分离阈值的线性属性方向通过半空间交集诱导出概念格。这种几何结构使得通过几何交(meet)和并(join)操作进行符号推理成为可能,并且在属性方向线性无关时承认一种标准形式。在WordNet子层次上的实验提供了实证证据,表明LLM嵌入编码了概念格及其逻辑结构,揭示了连续几何与符号抽象之间的原则性桥梁。
cs.AI / 53 / 2603.01235
Extended Empirical Validation of the Explainability Solution Space
可解释性解决方案空间的扩展实证验证
Abstract
This technical report provides an extended validation of the Explainability Solution Space (ESS) through cross-domain evaluation. While initial validation focused on employee attrition prediction, this study introduces a heterogeneous intelligent urban resource allocation system to demonstrate the generality and domain-independence of the ESS framework. The second case study integrates tabular, temporal, and geospatial data under multi-stakeholder governance conditions. Explicit quantitative positioning of representative XAI families is provided for both contexts. Results confirm that ESS rankings are not domain-specific but adapt systematically to governance roles, risk profiles, and stakeholder configurations. The findings reinforce ESS as a generalizable operational decision-support instrument for explainable AI strategy design across socio-technical systems.
Chinese Translation
本技术报告通过跨领域评估提供了对可解释性解决方案空间(Explainability Solution Space, ESS)的扩展验证。尽管最初的验证集中于员工流失预测,本研究引入了一个异构智能城市资源分配系统,以展示ESS框架的普遍性和领域独立性。第二个案例研究在多利益相关者治理条件下整合了表格数据、时间数据和地理空间数据。为这两个背景提供了代表性可解释人工智能(XAI)家族的明确定量定位。结果确认ESS排名不是领域特定的,而是系统地适应于治理角色、风险特征和利益相关者配置。这些发现强化了ESS作为一个可推广的操作决策支持工具,用于在社会技术系统中设计可解释人工智能战略。
cs.AI / 54 / 2603.01283
Beyond Reward: A Bounded Measure of Agent Environment Coupling
超越奖励:一种有界的智能体环境耦合度量
Abstract
Real-world reinforcement learning (RL) agents operate in closed-loop systems where actions shape future observations, making reliable deployment under distribution shifts a persistent challenge. Existing monitoring relies on reward or task metrics, capturing outcomes but missing early coupling failures. We introduce bipredictability (P) as the ratio of shared information in the observation, action, outcome loop to the total available information, a principled, real time measure of interaction effectiveness with provable bounds, comparable across tasks. An auxiliary monitor, the Information Digital Twin (IDT), computes P and its diagnostic components from the interaction stream. We evaluate SAC and PPO agents on MuJoCo HalfCheetah under eight agent, and environment-side perturbations across 168 trials. Under nominal operation, agents exhibit P = 0.33 plus minus 0.02, below the classical bound of 0.5, revealing an informational cost of action selection. The IDT detects 89.3% of perturbations versus 44.0% for reward based monitoring, with 4.4x lower median latency. Bipredictability enables early detection of interaction degradation before performance drops and provides a prerequisite signal for closed loop self regulation in deployed RL systems.
Chinese Translation
现实世界中的强化学习(RL)智能体在闭环系统中运行,其中行动会影响未来的观察结果,这使得在分布变化下的可靠部署成为一个持续的挑战。现有的监测依赖于奖励或任务指标,虽然能够捕捉结果,但却忽略了早期耦合失败。我们引入了双预测性(bipredictability, P),作为观察、行动、结果循环中共享信息与可用信息总量的比率,这是一种原则性、实时的交互有效性度量,具有可证明的界限,且可在不同任务间进行比较。辅助监测工具信息数字双胞胎(Information Digital Twin, IDT)从交互流中计算P及其诊断组件。我们在MuJoCo HalfCheetah上对SAC和PPO智能体进行了168次试验,测试了八种智能体和环境侧的扰动。在正常运行下,智能体的P值为0.33±0.02,低于经典界限0.5,揭示了行动选择的信信息成本。IDT检测到89.3%的扰动,而基于奖励的监测仅为44.0%,且中位延迟低4.4倍。双预测性使得在性能下降之前能够早期检测交互退化,并为已部署的RL系统中的闭环自我调节提供了必要信号。
cs.AI / 55 / 2603.01286
Information-Theoretic Framework for Self-Adapting Model Predictive Controllers
自适应模型预测控制器的信息论框架
Abstract
Model Predictive Control (MPC) is a vital technique for autonomous systems, like Unmanned Aerial Vehicles (UAVs), enabling optimized motion planning. However, traditional MPC struggles to adapt to real-time changes such as dynamic obstacles and shifting system dynamics, lacking inherent mechanisms for self-monitoring and adaptive optimization. Here, we introduce Entanglement Learning (EL), an information-theoretic framework that enhances MPC adaptability through an Information Digital Twin (IDT). The IDT monitors and quantifies, in bits, the information flow between MPC inputs, control actions, and UAV behavior. By introducing new information-theoretic metrics we call entanglement metrics, it tracks variations in these dependencies. These metrics measure the mutual information between the optimizer's input, its control actions, and the resulting UAV dynamics, enabling a deeper understanding of their interrelationships. This allows the IDT to detect performance deviations and generate real-time adaptive signals to recalibrate MPC parameters, preserving stability. Unlike traditional MPC, which relies on error-based feedback, this dual-feedback approach leverages information flow for proactive adaptation to evolving conditions. Scalable and leveraging existing infrastructure, this framework improves MPC reliability and robustness across diverse scenarios, extending beyond UAV control to any MPC implementation requiring adaptive performance.
Chinese Translation
模型预测控制(MPC)是自主系统(如无人机)的一项重要技术,能够实现优化的运动规划。然而,传统的MPC在适应实时变化(如动态障碍物和系统动态变化)方面存在困难,缺乏自我监控和自适应优化的内在机制。在此,我们引入了一种信息论框架——纠缠学习(Entanglement Learning, EL),通过信息数字双胞胎(Information Digital Twin, IDT)增强MPC的适应性。IDT监测并量化MPC输入、控制动作与无人机行为之间的信息流,单位为比特。通过引入我们称之为纠缠度量的新信息论指标,它跟踪这些依赖关系的变化。这些度量衡量优化器输入、其控制动作与结果无人机动态之间的互信息,从而加深对它们相互关系的理解。这使得IDT能够检测性能偏差,并生成实时自适应信号以重新校准MPC参数,保持稳定性。与依赖于基于误差反馈的传统MPC不同,这种双重反馈方法利用信息流对不断变化的条件进行主动适应。该框架可扩展并利用现有基础设施,提高了MPC在多种场景下的可靠性和鲁棒性,超越了无人机控制,适用于任何需要自适应性能的MPC实现。
cs.AI / 56 / 2603.01290
Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
在部分可观测条件下的对手状态推断:2026年一级方程式能源策略的HMM-POMDP框架
Abstract
The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.
Chinese Translation
2026年一级方程式技术法规对能源策略引入了根本性的变化:在50/50的内燃机/电池动力分配下,具备无限再生能力和由驾驶员控制的超越模式(简称MOM),最佳的能源投放策略不仅依赖于驾驶员自身的状态,还依赖于竞争对手汽车的隐状态。这形成了一个部分可观测的随机博弈,无法通过单一智能体优化方法解决。我们提出了一个可处理的两层推断和决策框架。第一层是一个30状态的隐马尔可夫模型(HMM),它从五个公开可观测的遥测信号中推断出每个竞争对手的ERS充电水平、超越模式状态和轮胎退化状态的概率分布。第二层是一个深度Q网络(DQN)策略,它将HMM信念状态作为输入,并在能源投放策略之间进行选择。我们正式描述了反收割陷阱——一种欺骗性策略,其中汽车故意抑制可观测的投放信号,以诱使竞争对手发起失败的攻击——并表明检测该策略需要信念状态推断,而非反应性阈值规则。在基于模型自身假设生成的合成比赛中,HMM实现了92.3%的ERS推断准确率(随机基线:33.3%),并以95.7%的召回率检测到反收割陷阱条件。预注册——实证验证将于2026年3月8日的澳大利亚大奖赛开始。
cs.AI / 57 / 2603.01357
ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
ASTRA-bench:评估工具使用代理的推理与行动规划与个人用户上下文的关系
Abstract
Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \& Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents' ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.
Chinese Translation
下一代人工智能必须管理大量个人数据、多样化工具和多步骤推理,然而大多数基准测试仍然是无上下文和单轮的。我们提出了ASTRA-bench(工具使用、推理与行动规划中的助手技能),这是一个独特的基准,结合了随时间演变的个人上下文、互动工具箱和复杂的用户意图。我们的事件驱动管道生成了2413个场景,涵盖四个主角,基于纵向生活事件,并通过参考性、功能性和信息复杂性进行了注释。对最先进模型(如Claude-4.5-Opus、DeepSeek-V3.2)的评估显示,在高复杂性条件下,性能显著下降,论证生成成为主要瓶颈。这些发现暴露了当前代理在复杂个人上下文中扎根推理和协调可靠多步骤计划的能力的关键限制。我们发布了ASTRA-bench,提供完整的执行环境和评估脚本,以提供一个诊断测试平台,用于开发真正具有上下文意识的人工智能助手。
cs.AI / 58 / 2603.01375
Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
词汇与权重:通过共同适应简化多轮交互
Abstract
Test-time policy adaptation for multi-turn interactions (T2PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.
Chinese Translation
测试时策略适应(T2PAM)对于在推理过程中将大型语言模型(LLMs)与动态用户需求对齐至关重要。然而,现有范式通常将测试时适应视为一个单轴问题,或者仅仅是精炼指令(提示工程),或者仅调整权重(测试时训练),忽视了交互失败源于模糊性和能力不足的耦合混合。我们认为,这两条优化路径不仅是相加的,而是协同的:语义清晰性作为有效参数更新的前置条件。为此,我们提出了ROSA2,一个将交互重新表述为在词汇和权重异构空间上的联合优化问题的框架。通过数学分解误差信号,ROSA2利用文本梯度来纠正意图模糊性,并通过参数更新来弥补能力差距。从理论上讲,我们证明了这种共同适应严格减少了收敛所需的参数变动。从实证上看,ROSA2在MATH数据集上比最先进的基线提高了30%的性能,同时减少了40%的交互轮次,证明了精炼上下文能够释放参数更新的真正潜力。
cs.AI / 59 / 2603.01396
HarmonyCell: Automating Single-Cell Perturbation Modeling under Semantic and Distribution Shifts
HarmonyCell:在语义和分布变化下自动化单细胞干扰建模
Abstract
Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity--identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity--distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.
Chinese Translation
单细胞干扰研究面临双重异质性瓶颈:(i)语义异质性——在不同数据集间以不兼容的元数据模式编码的相同生物概念;(ii)统计异质性——由于生物变异导致的分布变化,要求特定数据集的归纳偏差。我们提出了HarmonyCell,一个端到端的代理框架,通过专门的机制解决每个挑战:一个基于大型语言模型(LLM)的语义统一器能够自主将不同的元数据映射到一个标准接口,无需人工干预;而一个自适应的蒙特卡洛树搜索引擎在分层动作空间中操作,以合成具有最佳统计归纳偏差的架构,以应对分布变化。在语义和分布变化下的多种干扰任务评估中,HarmonyCell在异质输入数据集上实现了95%的有效执行率(而一般代理为0%),同时在严格的分布外评估中与专家设计的基准相匹配或甚至超越。这种双轨协同使得无需特定数据集工程的可扩展自动虚拟细胞建模成为可能。
cs.AI / 60 / 2603.01407
The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition
观察者-情境格:一种统一的视角感知认知的形式基础
Abstract
Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.
Chinese Translation
在复杂的多智能体环境中运行的自主智能体必须从多个视角推理什么是真实的。现有的方法通常难以整合不同智能体在不同时间和不同上下文中的推理,通常在单独的专业模块中处理这些维度。这种碎片化导致了脆弱且不完整的推理过程,尤其是在智能体必须理解他人信念(心智理论)时。我们提出了观察者-情境格(Observer-Situation Lattice, OSL),这是一种统一的数学结构,为视角感知认知提供了一个单一且连贯的语义空间。OSL是一个有限的完全格,其中每个元素代表一个独特的观察者-情境对,允许对信念管理采取原则性和可扩展的方法。我们提出了两个在该格上运行的关键算法:(i)相对信念传播(Relativized Belief Propagation),一种增量更新算法,能够高效地传播新信息;(ii)最小矛盾分解(Minimal Contradiction Decomposition),一种基于图的程序,能够识别和隔离矛盾成分。我们证明了框架的理论合理性,并通过一系列基准测试展示了其实际效用,包括经典的心智理论任务以及与假设基础真值维护系统等已建立范式的比较。我们的结果表明,OSL为构建强大且具视角感知能力的自主智能体提供了计算上高效且富有表现力的基础。
cs.AI / 61 / 2603.01409
MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning
MIST-RL:基于突变的增量套件测试通过强化学习
Abstract
Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a "scaling-by-quantity" paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to "scaling-by-utility". We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.
Chinese Translation
大型语言模型(LLMs)在第一次尝试时往往无法生成正确的代码,这需要使用生成的单元测试作为验证器来验证解决方案。尽管最近的验证方法取得了成功,但它们仍然受到“通过数量扩展”的范式的限制。这种暴力方法存在一个关键的局限性:在故障检测中收益递减,同时导致严重的测试冗余。为了解决这个问题,我们提出了MIST-RL(基于突变的增量套件测试通过强化学习),这是一个将重点转向“通过效用扩展”的框架。我们将测试生成形式化为一个通过组相对策略优化(Group Relative Policy Optimization, GRPO)优化的序列决策过程。具体而言,我们引入了一种新颖的增量突变奖励,结合动态惩罚,激励模型发现新故障,同时抑制功能等效的断言。在HumanEval+和MBPP+上的实验表明,MIST-RL的性能优于最先进的基线。它实现了+28.5%的突变得分,同时减少了19.3%的测试用例数量。此外,我们还展示了这些紧凑且高效的测试作为优越的验证器,提高了HumanEval+上代码重新排序的准确性,相较于最先进的基线提高了3.05%,在10个候选样本中。源代码和数据已在补充材料中提供。
cs.AI / 62 / 2603.01410
GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning
GraphScout:赋能大型语言模型以内在探索能力进行自主图推理
Abstract
Knowledge graphs provide structured and reliable information for many real-world applications, motivating increasing interest in combining large language models (LLMs) with graph-based retrieval to improve factual grounding. Recent Graph-based Retrieval-Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs and knowledge graphs to enhance reasoning capability. However, existing approaches typically depend on manually designed guidance and interact with knowledge graphs through a limited set of predefined tools, which substantially constrains graph exploration. To address these limitations, we propose GraphScout, a training-centric agentic graph reasoning framework equipped with more flexible graph exploration tools. GraphScout enables models to autonomously interact with knowledge graphs to synthesize structured training data which are then used to post-train LLMs, thereby internalizing agentic graph reasoning ability without laborious manual annotation or task curation. Extensive experiments across five knowledge-graph domains show that a small model (e.g., Qwen3-4B) augmented with GraphScout outperforms baseline methods built on leading LLMs (e.g., Qwen-Max) by an average of 16.7\% while requiring significantly fewer inference tokens. Moreover, GraphScout exhibits robust cross-domain transfer performance. Our code will be made publicly available~\footnote{https://github.com/Ying-Yuchen/_GraphScout_}.
Chinese Translation
知识图谱为许多现实世界应用提供了结构化和可靠的信息,这激发了人们对将大型语言模型(LLMs)与基于图的检索相结合以改善事实基础的兴趣。因此,近期的基于图的检索增强生成(GraphRAG)方法引入了LLMs与知识图谱之间的迭代交互,以增强推理能力。然而,现有的方法通常依赖于手动设计的指导,并通过一组有限的预定义工具与知识图谱进行交互,这在很大程度上限制了图的探索。为了解决这些局限性,我们提出了GraphScout,这是一种以训练为中心的自主图推理框架,配备了更灵活的图探索工具。GraphScout使模型能够自主与知识图谱交互,以合成结构化的训练数据,然后用于后续训练LLMs,从而在无需繁琐的手动标注或任务策划的情况下内化自主图推理能力。在五个知识图谱领域的广泛实验表明,增强了GraphScout的小型模型(例如,Qwen3-4B)在性能上比基于领先LLMs(例如,Qwen-Max)的基线方法平均提高了16.7%,同时所需的推理令牌显著减少。此外,GraphScout展现出强大的跨领域迁移性能。我们的代码将公开发布~ootnote{https://github.com/Ying-Yuchen/_GraphScout_}。
cs.AI / 63 / 2603.01416
Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
保障底线与提升上限:一种基于合并的多模态搜索代理范式
Abstract
Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
Chinese Translation
近期视觉-语言模型(VLMs)的进展激励了多模态搜索代理的发展,这些代理能够主动调用外部搜索工具并通过多步推理整合检索到的证据。尽管前景可期,现有方法通常依赖于大规模的监督轨迹或昂贵的强化学习(RL),导致高昂的训练成本、不稳定性以及标准VLMs的严重冷启动问题。我们提出了一种无训练的范式,通过跨模态模型合并赋予VLMs自主搜索能力。通过将基于文本的搜索代理与基础VLM融合,我们展示了多模态搜索能力可以在没有任何额外多模态训练数据的情况下有效组合。为了在跨模态集成过程中减轻参数干扰,我们引入了最优脑合并(Optimal Brain Merging, OBM),这是一种关注显著性的合并算法,能够基于模型损失对任务关键参数的影响,仅使用一小组校准样本进行识别。在搜索密集型基准测试(如InfoSeek、MMSearch)上的广泛实验表明:(1)模型合并确保了作为零-shot代理的合理性能底线,OBM实现了更优的搜索速率;(2)OBM作为一种热启动策略显著提升了性能上限,达到比标准VLM初始化更快的收敛速度和更高的峰值准确率。
cs.AI / 64 / 2603.01421
SciDER: Scientific Data-centric End-to-end Researcher
SciDER:以数据为中心的端到端科学研究者
Abstract
Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.
Chinese Translation
基于大型语言模型的自动化科学发现正在改变研究生命周期,从构思到实验。然而,现有的智能体在自主处理从科学实验中收集的原始数据方面仍然存在困难。我们提出了SciDER,一个以数据为中心的端到端系统,旨在自动化研究生命周期。与传统框架不同,我们的专用智能体能够协同解析和分析原始科学数据,基于特定数据特征生成假设和实验设计,并编写和执行相应的代码。在三个基准测试中的评估显示,SciDER在以数据驱动的科学发现方面表现优异,超越了通用智能体和最先进的模型,得益于其自我演化的记忆和以批评为导向的反馈循环。作为一个模块化的Python包,我们还提供了易于使用的PyPI包和轻量级的Web界面,以加速自主的数据驱动研究,旨在让所有研究人员和开发者都能轻松访问。
cs.AI / 65 / 2603.01437
Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
在链式思维之前解码答案:来自预链式思维探针和激活引导的证据
Abstract
As chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models (LLMs), it has also emerged as a promising tool for interpretability, suggesting the opportunity to understand model decisions through verbalized reasoning. However, the utility of CoT toward interpretability depends upon its faithfulness -- whether the model's stated reasoning reflects the underlying decision process. We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model's final answer with 0.9 AUC on most tasks. We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment (stating correct premises but drawing unsupported conclusions) and confabulation (fabricating false premises). While post-hoc reasoning may be instrumentally useful when the model has a correct pre-CoT belief, these failure modes suggest it can result in undesirable behaviors when reasoning from a false belief.
Chinese Translation
随着链式思维(CoT)在大型语言模型(LLMs)中成为扩展推理能力的核心,它也作为一种有前景的可解释性工具出现,暗示通过言语化推理理解模型决策的机会。然而,CoT在可解释性方面的效用依赖于其忠实性——即模型所陈述的推理是否反映了其潜在的决策过程。我们提供了机制性证据,表明经过指令调优的模型通常在生成CoT之前就确定了答案。通过在CoT之前最后一个标记的残差流激活上训练线性探针,我们可以在大多数任务中以0.9的AUC预测模型的最终答案。我们发现这些方向不仅具有预测性,而且具有因果性:沿着探针方向引导激活在超过50%的情况下会翻转模型答案,显著超出正交基线。当引导导致错误答案时,我们观察到两种不同的失败模式:非蕴涵(陈述正确的前提但得出不支持的结论)和虚构(捏造错误的前提)。虽然事后推理在模型具有正确的预CoT信念时可能在工具上是有用的,但这些失败模式表明,当从错误信念推理时,它可能导致不良行为。
cs.AI / 66 / 2603.01452
Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
任务扩展,而非样本扩展:通过多任务模型驱动强化学习掌握类人控制
Abstract
Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.
Chinese Translation
开发能够掌握多种技能的通用机器人仍然是具身人工智能中的一个核心挑战。尽管近期的进展强调了模型参数和离线数据集的扩展,但这种方法在机器人领域受到限制,因为学习需要主动交互。我们认为,有效的在线学习应当扩展 extit{任务数量},而不是每个任务的样本数量。这种模式揭示了模型驱动强化学习(Model-Based Reinforcement Learning, MBRL)的结构优势。由于物理动态在任务之间是保持不变的,共享的世界模型可以聚合多任务经验,以学习稳健的、与任务无关的表示。相比之下,当任务在相似状态下要求冲突的动作时,模型无关的方法会遭遇梯度干扰。因此,任务多样性作为MBRL的正则化器,改善了动态学习和样本效率。我们用 extbf{EfficientZero-Multitask (EZ-M)}这一样本高效的多任务MBRL算法实例化了这一思想,适用于在线学习。在具有挑战性的全身控制基准 extbf{HumanoidBench}上进行评估时,EZ-M实现了最先进的性能,且样本效率显著高于强基线,而没有极端的参数扩展。这些结果确立了任务扩展作为可扩展机器人学习的一个关键轴心。项目网站可在 extit{这里}访问。
cs.AI / 67 / 2603.01464
ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning
ProtRLSearch:一种基于强化学习训练的大语言模型多轮多模态蛋白质搜索代理
Abstract
Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
Chinese Translation
在医疗环境中出现的蛋白质分析任务通常需要在蛋白质序列约束下进行准确推理,涉及疾病相关变异的功能解释、临床研究的蛋白质级分析以及类似场景。为了解决这些任务,提出了搜索代理来搜索与蛋白质相关的信息,为疾病相关变异分析和蛋白质功能推理提供支持。然而,这些搜索代理大多仅限于单轮、文本单一模态搜索,这阻碍了将蛋白质序列模态作为多模态输入纳入搜索决策过程。同时,它们对仅关注最终答案的强化学习(RL)监督的依赖导致缺乏搜索过程约束,使得在关键词选择和推理方向上的偏差难以及时识别和纠正。为了解决这些局限性,我们提出了ProtRLSearch,这是一种通过多维奖励基础的RL训练的多轮蛋白质搜索代理,在实时搜索中共同利用蛋白质序列和文本作为多模态输入,以生成高质量报告。为了评估模型在现实蛋白质查询环境中整合蛋白质序列信息和基于文本的多模态输入的能力,我们构建了ProtMCQs,一个包含3000个多项选择题(MCQs)的基准,分为三个难度级别。该基准评估的蛋白质查询任务范围从关于蛋白质功能和表型变化的序列约束推理到综合蛋白质推理,整合多维序列特征与信号通路和调控网络。
cs.AI / 68 / 2603.01481
Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents
在多轮强化学习中协调密集和稀疏信号:工业销售代理的双视域信用分配
Abstract
Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
Chinese Translation
优化工业销售的大型语言模型需要在长期商业目标(例如,转化率)与即时语言约束(如流畅性和合规性)之间取得平衡。传统的强化学习通常将这些异质目标合并为单一奖励,导致高幅度的会话级奖励压倒更微妙的轮次级信号,从而导致训练不稳定或奖励操控。为了解决这个问题,我们提出了双视域信用分配(Dual-Horizon Credit Assignment, DuCA)框架,该框架在不同时间尺度上解耦优化。其核心,视域无关优势归一化(Horizon-Independent Advantage Normalization, HIAN),在融合之前分别对轮次级和会话级奖励的优势进行归一化,确保来自即时和长期目标的梯度贡献在策略更新中保持平衡。通过与高保真用户模拟器进行的大量实验表明,DuCA在转化率上相较于最先进的GRPO基线实现了6.82%的相对提升,减少了82.28%的句间重复,并降低了27.35%的身份检测率,表明在有效平衡战略表现与自然语言生成双重需求的工业销售场景中取得了显著改善。
cs.AI / 69 / 2603.01486
Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study
增强查询意图理解的代理多源基础:以DoorDash为案例研究
Abstract
Accurately mapping user queries to business categories is a fundamental Information Retrieval challenge for multi-category marketplaces, where context-sparse queries such as "Wildflower" exhibit intent ambiguity, simultaneously denoting a restaurant chain, a retail product, and a floral item. Traditional classifiers force a winner-takes-all assignment, while general-purpose LLMs hallucinate unavailable inventory. We introduce an Agentic Multi-Source Grounded system that addresses both failure modes by grounding LLM inference in (i) a staged catalog entity retrieval pipeline and (ii) an agentic web-search tool invoked autonomously for cold-start queries. Rather than predicting a single label, the model emits an ordered multi-intent set, resolved by a configurable disambiguation layer that applies deterministic business policies and is designed for extensibility to personalization signals. This decoupled design generalizes across domains, allowing any marketplace to supply its own grounding sources and resolution rules without modifying the core architecture. Evaluated on DoorDash's multi-vertical search platform, the system achieves +10.9pp over the ungrounded LLM baseline and +4.6pp over the legacy production system. On long-tail queries, incremental ablations attribute +8.3pp to catalog grounding, +3.2pp to agentic web search grounding, and +1.5pp to dual intent disambiguation, yielding 90.7% accuracy (+13.0pp over baseline). The system is deployed in production, serving over 95% of daily search impressions, and establishes a generalizable paradigm for applications requiring foundation models grounded in proprietary context and real-time web knowledge to resolve ambiguous, context-sparse decision problems at scale.
Chinese Translation
准确地将用户查询映射到商业类别是多类别市场中的一个基本信息检索挑战,其中像“Wildflower”这样的上下文稀疏查询表现出意图模糊性,同时指代餐饮连锁、零售产品和花卉项目。传统分类器强制采用赢家通吃的分配方式,而通用大型语言模型(LLMs)则会虚构不可用的库存。我们提出了一种代理多源基础系统,解决了这两种失败模式,通过将LLM推理基础于(i)分阶段的目录实体检索管道和(ii)为冷启动查询自主调用的代理网络搜索工具。该模型不是预测单一标签,而是发出一个有序的多意图集合,由一个可配置的消歧层解决,该层应用确定性的商业政策,并设计为可扩展以适应个性化信号。这种解耦设计在各个领域中具有普遍适用性,允许任何市场提供自己的基础来源和解决规则,而无需修改核心架构。在DoorDash的多垂直搜索平台上进行评估,该系统在无基础的LLM基准上实现了+10.9个百分点的提升,相较于传统生产系统提升了+4.6个百分点。在长尾查询中,增量消融分析将+8.3个百分点归因于目录基础,+3.2个百分点归因于代理网络搜索基础,+1.5个百分点归因于双重意图消歧,最终实现了90.7%的准确率(比基准提高了+13.0个百分点)。该系统已在生产中部署,服务于超过95%的每日搜索展示,并建立了一种可推广的范式,适用于需要在专有上下文和实时网络知识中基础的基础模型,以解决模糊的、上下文稀疏的决策问题。
cs.AI / 70 / 2603.01488
LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning
基于大型语言模型的语义选项发现以促进自适应深度强化学习
Abstract
Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma's Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.
Chinese Translation
尽管在复杂任务中取得了显著成功,深度强化学习(Deep Reinforcement Learning, DRL)在实际应用中仍面临诸多关键问题,如数据效率低、缺乏可解释性以及跨环境迁移能力有限。然而,基于状态生成动作的学习策略对环境变化敏感,难以保证行为安全和合规性。近期研究表明,将大型语言模型(Large Language Models, LLMs)与符号规划相结合在解决这些挑战方面具有良好前景。受到此启发,我们提出了一种新颖的基于LLM的闭环框架,该框架通过将自然语言指令映射为可执行规则并对自动生成的选项进行语义注释,实现了基于语义的技能重用和实时约束监控。所提出的方法利用LLM的通用知识来提高探索效率,并适应类似环境的可转移选项,同时通过语义注释提供内在的可解释性。为了验证该框架的有效性,我们在两个领域进行了实验,分别是Office World和Montezuma's Revenge。结果表明,在数据效率、约束合规性和跨任务迁移能力方面表现优越。
cs.AI / 71 / 2603.01511
Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification
基于检索增强的多模态专家混合模型用于蛋白质活性位点识别
Abstract
Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixture-of-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification. MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residue-level mixture-of-experts gating. To prevent modality degradation, we propose a reliability-aware fusion strategy based on Dempster-Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD-Gen and TS125 datasets demonstrate that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification, validating the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion.
Chinese Translation
在残基水平上准确识别蛋白质活性位点对于理解蛋白质功能和推动药物发现至关重要。然而,当前方法面临两个关键挑战:由于训练数据稀疏导致的单实例预测脆弱性,以及当不可靠模态主导融合过程时,模态可靠性估计不足导致的性能下降。为了解决这些挑战,我们提出了基于检索增强的多模态专家混合模型(MERA),这是第一个用于蛋白质活性位点识别的检索增强框架。MERA采用层次化多专家检索,通过残基级别的专家混合门控动态聚合来自链、序列和活性位点的上下文信息。为了防止模态退化,我们提出了一种基于邓普斯特-沙弗证据理论的可靠性感知融合策略,通过信念质量函数和可学习的折扣系数量化模态的可信度,从而实现原则性的多模态集成。在ProTAD-Gen和TS125数据集上的大量实验表明,MERA在活性位点预测上达到了90%的AUPRC,并在肽结合位点识别上取得了显著提升,验证了检索增强的多专家建模和可靠性引导融合的有效性。
cs.AI / 72 / 2603.01537
Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?
药理学知识图谱:药物再利用是否需要化学结构?
Abstract
The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.
Chinese Translation
在严格的时间验证下,模型复杂性、数据量和特征模态对基于知识图谱的药物再利用的贡献仍然缺乏量化。我们从 ChEMBL 36 构建了一个药理学知识图谱,包含 5,348 个实体,包括 3,127 种药物、1,156 种蛋白质和 1,065 种适应症。我们严格执行了时间划分,训练数据截止到 2022 年,测试数据为 2023 年至 2025 年,同时结合了从失败的检测和临床试验中挖掘的生物验证的硬负样本。我们对五种知识图谱嵌入模型和一个标准图神经网络进行了基准测试,该网络具有 344 万个参数,采用图注意力编码器和 ESM-2 蛋白质嵌入,结合了药物的化学结构。通过从 0.78 到 9.75 百万参数和从 25% 到 100% 的数据范围进行的扩展实验,以及特征消融研究,用于分离模型容量、图密度和节点特征模态的贡献。去除基于图注意力的药物结构编码器,仅保留拓扑嵌入与 ESM-2 蛋白质特征结合,使药物-蛋白质 PR-AUC 从 0.5631 提升至 0.5785,同时将 VRAM 使用量从 5.30 GB 降低至 353 MB。用 Morgan 指纹替换药物编码器进一步降低了性能,表明显式的化学结构表示可能对预测药理网络交互产生负面影响。将模型规模增加到超过 244 万个参数的收益递减,而增加训练数据则持续改善性能。外部验证确认了 14 个新预测中的 6 个作为已确立的治疗适应症。这些结果表明,仅使用以靶点为中心的信息和药物网络拓扑,可以准确预测药物的药理行为,而无需显式的化学结构表示。
cs.AI / 73 / 2603.01548
Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents
基于图的自愈工具路由以实现成本高效的LLM代理
Abstract
Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra's algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed -- yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph-based tool-use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability -- every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan-out), Self-Healing Router matches ReAct's correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent-failure cases observed in a well-engineered static workflow baseline under compound failures.
Chinese Translation
使用工具的LLM代理面临可靠性与成本的权衡:通过LLM路由每一个决策可以提高正确性,但会导致高延迟和推理成本,而预编码的工作流图则可以降低成本,但在意外的复合工具故障下变得脆弱。我们提出了自愈路由器(Self-Healing Router),这是一种容错的调度架构,将大多数代理控制流决策视为路由而非推理。该系统结合了(i)并行健康监控器,它们为运行时条件(如工具故障和风险信号)分配优先级分数,以及(ii)一个成本加权的工具图,其中Dijkstra算法执行确定性的最短路径路由。当工具在执行过程中故障时,其边的权重被重新设置为无穷大,并重新计算路径——实现自动恢复而无需调用LLM。LLM仅用于没有可行路径的情况,从而实现目标降级或升级。先前的基于图的工具使用系统(ControlLLM、ToolNet、NaviAgent)侧重于工具选择和规划;我们的贡献在于运行时容错,具有确定性的恢复和二元可观察性——每次故障要么是记录的重新路由,要么是明确的升级,绝不会出现静默跳过。在涵盖三种图拓扑(线性管道、依赖DAG、并行扇出)的19种场景中,自愈路由器在保持ReAct的正确性的同时,将控制平面LLM调用减少了93%(9次对比123次总调用),并消除了在复合故障下观察到的静默故障案例,这些案例出现在经过良好工程设计的静态工作流基线中。
cs.AI / 74 / 2603.01553
State-Action Inpainting Diffuser for Continuous Control with Delay
带延迟的连续控制状态-动作修复扩散器
Abstract
Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
Chinese Translation
信号延迟在连续控制和强化学习(RL)中构成了一个基本挑战,因为它在交互与感知之间引入了时间间隔。目前的解决方案主要沿着两种不同的范式发展:无模型方法利用状态增强以保持马尔可夫特性,而基于模型的方法则专注于通过动态建模推断潜在信念。在本文中,我们通过引入状态-动作修复扩散器(SAID)来桥接这两种观点,该框架将动态学习的归纳偏差与策略优化的直接决策能力相结合。通过将问题表述为一个联合序列修复任务,SAID隐式捕捉环境动态,同时直接生成一致的计划,有效地在基于模型和无模型范式的交叉点上运行。至关重要的是,这种生成性表述使得SAID能够无缝应用于在线和离线强化学习。在延迟的连续控制基准上进行的广泛实验表明,SAID实现了最先进且稳健的性能。我们的研究建议了一种新方法,以推动带延迟的强化学习领域的发展。
cs.AI / 75 / 2603.01554
S5-HES Agent: Society 5.0-driven Agentic Framework to Democratize Smart Home Environment Simulation
S5-HES代理:基于社会5.0的代理框架以实现智能家居环境模拟的民主化
Abstract
The smart home is a key domain within the Society 5.0 vision for a human-centered society. Smart home technologies rapidly evolve, and research should diversify while remaining aligned with Society 5.0 objectives. Democratizing smart home research would engage a broader community of innovators beyond traditional limited experts. This shift necessitates inclusive simulation frameworks that support research across diverse fields in industry and academia. However, existing smart home simulators require significant technical expertise, offer limited adaptability, and lack automated evolution, thereby failing to meet the holistic needs of Society 5.0. These constraints impede researchers from efficiently conducting simulations and experiments for security, energy, health, climate, and socio-economic research. To address these challenges, this paper presents the Society 5.0-driven Smart Home Environment Simulator Agent (S5-HES Agent), an agentic simulation framework that transforms traditional smart home simulation through autonomous AI orchestration. The framework coordinates specialized agents through interchangeable large language models (LLMs), enabling natural-language-driven end-to-end smart home simulation configuration without programming expertise. A retrieval-augmented generation (RAG) pipeline with semantic, keyword, and hybrid search retrieves smart home knowledge. Comprehensive evaluation on S5-HES Agent demonstrates that the RAG pipeline achieves near-optimal retrieval fidelity, simulated device behaviour and threat scenarios align with real-world IoT datasets, and simulation engine scales predictably across home configurations, establishing a stable foundation for Society 5.0 smart home research. Source code is available under the MIT License at https://github.com/AsiriweLab/S5-HES-Agent.
Chinese Translation
智能家居是社会5.0愿景中以人为本社会的关键领域。智能家居技术迅速发展,研究应多样化,同时与社会5.0的目标保持一致。智能家居研究的民主化将吸引更广泛的创新者社区,而不仅限于传统的专家。这一转变需要包容性的模拟框架,以支持工业和学术界各个领域的研究。然而,现有的智能家居模拟器需要显著的技术专长,适应性有限,并且缺乏自动演变,无法满足社会5.0的整体需求。这些限制阻碍了研究人员有效地进行安全、能源、健康、气候和社会经济研究的模拟和实验。为了解决这些挑战,本文提出了基于社会5.0的智能家居环境模拟代理(S5-HES Agent),这是一个通过自主AI编排转变传统智能家居模拟的代理模拟框架。该框架通过可互换的大型语言模型(LLMs)协调专业代理,使得无需编程专长即可进行自然语言驱动的端到端智能家居模拟配置。一个增强检索生成(RAG)管道结合语义、关键词和混合搜索来检索智能家居知识。对S5-HES Agent的全面评估表明,RAG管道实现了近乎最佳的检索保真度,模拟设备行为和威胁场景与真实世界的物联网数据集一致,并且模拟引擎在家庭配置中可预测地扩展,为社会5.0智能家居研究建立了稳定的基础。源代码可在MIT许可证下获得,网址为https://github.com/AsiriweLab/S5-HES-Agent。
cs.AI / 76 / 2603.01557
Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
多模态临床时间序列远程监测的LLM摘要基准评估
Abstract
Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
Chinese Translation
大型语言模型(LLMs)能够生成流畅的远程治疗监测时间序列的临床摘要。然而,目前尚不清楚这些叙述是否忠实地捕捉了临床上重要的事件,例如持续的异常。现有的评估指标主要关注语义相似性和语言质量,而事件级的准确性则基本未被测量。为了解决这一问题,我们引入了一种基于事件的评估框架,用于多模态时间序列摘要,使用的是技术集成健康管理(Technology-Integrated Health Management, TIHM)-1.5痴呆监测数据集。通过基于规则的异常阈值和时间持续性标准,导出了临床基础的日常事件。然后,将模型生成的摘要与这些结构化事实进行对齐。我们的评估协议测量了异常回忆率、持续时间回忆率、测量覆盖率和虚构事件提及。我们基准测试了三种方法:零样本提示、统计提示和一种基于视觉的管道,该管道使用渲染的时间序列可视化。结果显示,传统指标与临床事件的真实性之间存在显著的脱节。获得高语义相似性分数的模型往往表现出接近零的异常回忆率。相比之下,基于视觉的方法展示了最强的事件对齐,达到了45.7%的异常回忆率和100%的持续时间回忆率。这些发现强调了事件感知评估的重要性,以确保可靠的临床时间序列摘要。
cs.AI / 77 / 2603.01562
RubricBench: Aligning Model-Generated Rubrics with Human Standards
RubricBench:将模型生成的评分标准与人类标准对齐
Abstract
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
Chinese Translation
随着大型语言模型(LLM)对齐从简单的补全演变为复杂且高度精细的生成,奖励模型越来越倾向于采用评分标准引导的评估,以减轻表面偏见。然而,当前社区缺乏一个统一的基准来评估这一评估范式,因为现有基准在判别复杂性和严格分析所需的真实评分标准注释方面均存在不足。为了解决这一问题,我们推出了RubricBench,这是一个经过精心策划的基准,包含1,147个成对比较,专门设计用于评估基于评分标准的评估的可靠性。我们的构建采用多维过滤管道,针对具有细微输入复杂性和误导性表面偏见的难样本进行筛选,并为每个样本增添严格依据说明生成的专家注释原子评分标准。全面的实验结果揭示了人类注释的评分标准与模型生成的评分标准之间存在显著的能力差距,表明即使是最先进的模型也难以自主指定有效的评估标准,远远落后于人类指导的表现。
cs.AI / 78 / 2603.01571
Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
超越长度缩放:协同广度与深度的生成奖励模型
Abstract
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.
Chinese Translation
最近在生成奖励模型(Generative Reward Models, GRMs)方面的进展表明,扩展思维链(Chain-of-Thought, CoT)推理的长度显著提高了评估的可靠性。然而,目前的研究主要依赖于非结构化的长度缩放,忽视了不同推理机制的效能差异:广度思维链(Breadth-CoT, B-CoT,即多维原则覆盖)和深度思维链(Depth-CoT, D-CoT,即实质性判断的合理性)。为了解决这个问题,我们提出了Mix-GRM,一个通过模块化合成管道将原始推理重构为结构化的B-CoT和D-CoT的框架,随后采用监督微调(Supervised Fine-Tuning, SFT)和可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)来内化和优化这些机制。全面的实验表明,Mix-GRM在五个基准测试中建立了新的最先进水平,平均超越领先的开源奖励模型8.2%。我们的结果揭示了推理的明显分歧:B-CoT在主观偏好任务中表现良好,而D-CoT在客观正确性任务中表现优异。因此,将推理机制与任务不匹配会直接降低性能。此外,我们证明了RLVR作为一种切换放大器,诱导出一种突现极化现象,使模型自发地分配其推理风格以匹配任务需求。合成的数据和模型已在Hugging Face上发布,代码已在Github上发布。
cs.AI / 79 / 2603.01607
CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
CARE:基于证据的代理框架下多模态医学推理中的临床责任
Abstract
Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
Chinese Translation
大型视觉语言模型(VLMs)展示了强大的多模态医学推理能力,但大多数模型作为端到端的黑箱运行,偏离了临床医生基于证据的分阶段工作流程,阻碍了临床责任的实现。作为补充,专家视觉定位模型能够准确定位感兴趣区域(ROIs),提供明确、可靠的证据,从而提高推理的准确性和可信度。在本文中,我们介绍了CARE,旨在通过基于证据的代理框架推进多模态医学推理中的临床责任。与现有将定位和推理结合在单一通用模型中的方法不同,CARE将任务分解为协调的子模块,以减少捷径学习和幻觉:一个紧凑的VLM提出相关的医学实体;一个专家实体引用分割模型生成像素级的ROI证据;一个基于定位的VLM在增强了ROI提示的完整图像上进行推理。VLM通过可验证奖励的强化学习进行优化,以使答案与支持证据对齐。此外,VLM协调器负责规划工具调用并审查证据与答案的一致性,提供代理控制和最终验证。在标准医学视觉问答基准上评估,我们的CARE-Flow(无协调器)在相同规模(10B)的最先进技术(SOTA)上平均提高了10.9%的准确率。通过动态规划和答案审查,我们的CARE-Coord进一步提升,超越了经过大量预训练的SOTA,提升幅度为5.2%。我们的实验表明,模拟临床工作流程的代理框架,结合解耦的专业模型和明确的证据,能够产生更准确和更负责任的医学人工智能。
cs.AI / 80 / 2603.01608
Evaluating and Understanding Scheming Propensity in LLM Agents
评估与理解大型语言模型代理的阴谋倾向
Abstract
As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially-designed prompt snippets that encourage agency and goal-directedness into an agent's system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.
Chinese Translation
随着前沿语言模型越来越多地被部署为追求复杂长期目标的自主代理,阴谋的风险也随之增加:代理秘密追求不一致的目标。先前的研究集中在展示代理能够进行阴谋,但它们在现实场景中的阴谋倾向仍然未得到充分探索。为了理解代理何时进行阴谋,我们将阴谋动机分解为代理因素和环境因素。我们开发了现实的设置,使我们能够系统地变化这些因素,每个因素都为追求工具性趋同目标(如自我保护、资源获取和目标保护)的代理提供了阴谋机会。尽管环境激励很高,我们发现阴谋的实例仍然很少,并且显示这不太可能是由于评估意识造成的。虽然在代理的系统提示中插入设计对抗性提示片段以鼓励代理性和目标导向性可以诱导高阴谋率,但在真实代理支架中使用的提示片段很少能做到这一点。令人惊讶的是,在使用这些提示片段构建的模型生物(Hubinger et al., 2023)中,阴谋行为非常脆弱:移除一个工具可以使阴谋率从59%降至3%,而增加监督反而可以将阴谋率提高多达25%。我们的激励分解使得在与部署相关的设置中系统测量阴谋倾向成为可能,这在代理被赋予越来越重要的任务时是必要的。
cs.AI / 81 / 2603.01620
ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
ToolRLA:针对领域特定代理的工具集成强化学习对齐的细粒度奖励分解
Abstract
Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty ({\lambda}=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.
Chinese Translation
工具集成推理代理通过自然语言推理与外部API调用交替进行,显示出在复杂多步骤任务中的潜力。然而,为高风险领域特定部署对这些代理进行对齐是具有挑战性的,因为现有的强化学习使用粗糙的二元奖励(成功/失败),无法充分指导生产中的细致工具调用。我们提出了ToolRLA,一个针对领域特定工具集成代理的三阶段后训练管道(监督微调、组相对策略优化、直接偏好优化)。其核心是一个具有乘法正确性分解的细粒度奖励函数,从四个维度评估工具调用:格式有效性、工具选择正确性、调用效率和领域约束合规性。乘法组合优先考虑正确的工具选择(这是有意义的参数评估的前提),而较大的负合规惩罚({ ext{λ}}=10)确保遵循监管要求。在一个真实世界的金融顾问助手上部署(80+名顾问,1200+每日查询,15+个异构API),ToolRLA实现了47%的端到端任务完成率提升(62%提升至91%),63%的工具调用错误降低(38%降低至14%),93%的合规违规降低(12%降低至0.8%),并在三个月后实现了低于2秒的延迟。消融研究确认细粒度奖励分解比粗糙的加性奖励贡献了7个百分点;在ToolBench和API-Bank上验证了其通用性。
cs.AI / 82 / 2603.01630
SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing
SEED-SET:可扩展的系统级伦理测试演变实验设计
Abstract
As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.
Chinese Translation
随着无人机等自主系统在高风险、以人为中心的领域中的日益广泛应用,评估其伦理一致性变得至关重要,因为未能做到这一点将对人类生命构成迫在眉睫的危险,并导致决策中的长期偏见。由于缺乏普遍适用且明确的评估指标,以及利益相关者特定的主观性,这些系统的自动化伦理基准测试尚未得到充分研究,而这些主观性无法通过分析模型进行建模。为了解决这些挑战,我们提出了SEED-SET,一个贝叶斯实验设计框架,结合了特定领域的客观评估和利益相关者的主观价值判断。SEED-SET分别使用层次高斯过程对这两种评估类型进行建模,并采用一种新颖的获取策略,根据学习到的定性偏好和与利益相关者偏好一致的目标,提出有趣的测试候选项。我们在两个应用案例中验证了我们的方法在自主代理的伦理基准测试中的有效性,发现我们的方法表现最佳。与基线相比,我们的方法在探索与利用之间提供了可解释且高效的权衡,生成的最优测试候选项数量提高了$2 imes$,并且在高维搜索空间的覆盖率提高了$1.25 imes$。
cs.AI / 83 / 2603.01641
Learning Structured Reasoning via Tractable Trajectory Control
通过可处理轨迹控制学习结构化推理
Abstract
Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., "wait," indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.
Chinese Translation
大型语言模型可以表现出突现的推理行为,通常表现为重复的词汇模式(例如,“等待”,表示验证)。然而,在无约束采样中,复杂的推理轨迹仍然稀疏,标准的强化学习(RL)往往无法保证获得多样化的推理行为。我们提出了一种通过结构化推理系统性地发现和强化多样化推理模式的方法,该范式要求在RL过程中针对特定推理模式进行有针对性的探索。为此,我们提出了Ctrl-R,一个通过可处理轨迹控制学习结构化推理的框架,该框架积极引导展开过程,激励探索对复杂问题解决至关重要的多样化推理模式。由此产生的行为策略能够实现准确的重要性采样估计,支持无偏的在线优化。我们进一步引入了重要性采样权重的幂缩放因子,使策略能够选择性地从探索性、分布外的轨迹中学习,同时保持稳定的优化。实验表明,Ctrl-R能够有效探索和内化以前无法获得的推理模式,在数学推理任务中对语言模型和视觉-语言模型均产生了一致的改进。
cs.AI / 84 / 2603.01654
CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development
CeProAgents:用于自动化化学过程开发的分层代理系统
Abstract
The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi-faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi-agent system designed to automate the development of chemical process through collaborative division of labor. Our architecture comprises three specialized agent cohorts focused on knowledge, concept, and parameter respectively. To effectively adapt to the inherent complexity of chemical tasks, each cohort employs a novel hybrid architecture that integrates dynamic agent chatgroups with structured agentic workflows. To rigorously evaluate the system, we establish CeProBench, a multi-dimensional benchmark structured around three core pillars of chemical engineering. We design six distinct types of tasks across these dimensions to holistically assess the comprehensive capabilities of the system in chemical process development. The results not only confirm the effectiveness and superiority of our proposed approach but also reveal the transformative potential as well as the current boundaries of Large Language Models (LLMs) for industrial chemical engineering.
Chinese Translation
化学过程的开发是化学工程的基石,但由于其多面性,整合了专业知识、概念设计和参数模拟,面临着巨大的挑战。基于此,我们提出了CeProAgents,一个旨在通过协作分工自动化化学过程开发的分层多代理系统。我们的架构由三个专门的代理群体组成,分别专注于知识、概念和参数。为了有效适应化学任务固有的复杂性,每个群体采用了一种新颖的混合架构,将动态代理聊天组与结构化的代理工作流程相结合。为了严格评估该系统,我们建立了CeProBench,这是一个围绕化学工程三个核心支柱构建的多维基准。我们在这些维度上设计了六种不同类型的任务,以全面评估该系统在化学过程开发中的综合能力。结果不仅确认了我们提出的方法的有效性和优越性,还揭示了大型语言模型(LLMs)在工业化学工程中的变革潜力及当前局限性。
cs.AI / 85 / 2603.01667
Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs
上下文链学习:多任务车辆路径问题的动态约束理解
Abstract
Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories' contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.
Chinese Translation
多任务车辆路径问题(VRPs)旨在最小化路径成本,同时满足多样化的约束。现有的求解器通常采用统一的强化学习(RL)框架,以学习跨任务的可推广模式。然而,它们在决策过程中往往忽视了约束和节点动态,导致模型无法准确地对当前上下文做出反应。为了解决这一局限性,我们提出了上下文链学习(Chain-of-Context Learning, CCL),这是一个新颖的框架,逐步捕捉不断演变的上下文,以指导细粒度的节点适应。具体而言,CCL通过一个相关性引导的上下文重构(Relevance-Guided Context Reformulation, RGCR)模块构建逐步的上下文信息,该模块自适应地优先考虑显著约束。然后,这一上下文通过一个轨迹共享节点重新嵌入(Trajectory-Shared Node Re-embedding, TSNR)模块指导节点更新,该模块聚合来自所有轨迹上下文的共享节点特征,并利用这些特征更新下一步的输入。通过建模RL代理的演变偏好,CCL捕捉到序列决策中的逐步依赖关系。我们在48个不同的VRP变体上评估了CCL,包括16个在分布内和32个在分布外(具有未见约束)任务。实验结果表明,CCL在与最先进基准的比较中表现良好,在所有在分布内任务和大多数在分布外任务中实现了最佳性能。
cs.AI / 86 / 2603.01712
FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
FT-Dojo:面向自主大型语言模型微调的语言代理
Abstract
Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.
Chinese Translation
针对垂直领域的大型语言模型微调仍然是一个劳动密集型和高成本的过程,要求领域专家进行数据整理、配置训练,并迭代诊断模型行为。尽管对自主机器学习的兴趣日益增长,但之前的研究尚未解决使用代理进行端到端大型语言模型微调的问题。基于大型语言模型的代理能否自动化整个过程?我们将此视为一个实质性的开放问题:代理必须在一个开放的搜索空间中导航,该空间涵盖来自多种数据源的数据整理、使用复杂工具进行处理、构建训练管道,并基于快速增长的日志中的评估结果迭代优化其方法——这一整体场景远比现有基准更为复杂。为研究这一问题,我们引入了FT-Dojo,这是一个包含5个领域中13个任务的互动环境。我们进一步开发了FT-Agent,这是一个自主系统,通过利用基于评估的反馈,迭代诊断失败并优化微调策略,从而模拟人类专家。在FT-Dojo上的实验表明,专门构建的微调代理显著优于通用替代方案,FT-Agent在所有五个领域的13个任务中实现了10个任务的最佳表现。消融实验表明,该方法在3B模型上有效泛化,并提供了关于数据扩展权衡和基础模型敏感性的额外见解。案例分析揭示,代理能够通过从历史经验中累积学习来恢复失败,同时也暴露了因果推理的基本局限性——突显了自主大型语言模型微调的潜力和当前边界。
cs.AI / 87 / 2603.01724
GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
GMP:共存违规和动态规则下的内容审核基准
Abstract
Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
Chinese Translation
在线内容审核对于维护健康的数字环境至关重要,且对人工智能在此任务中的依赖持续增长。考虑一个用户评论,利用国家刻板印象来侮辱一位政治家。这个例子突显了现实场景中的两个关键挑战:(1)共存违规,即单个帖子违反多个政策(例如,偏见和人身攻击);(2)动态审核规则,即违规的判断依赖于随上下文变化而演变的平台特定指南。共存危害与动态变化规则的交集突显了当前人工智能系统的一个核心局限性:尽管大型语言模型(LLMs)擅长遵循固定指南,但当政策不稳定或依赖于上下文时,其判断能力会下降。在实践中,这种缺陷导致审核不一致:要么错误地限制合法表达,要么允许有害内容继续在线。这引发了一个关键的评估问题:在现有静态基准上表现良好是否真正保证人工智能判断在涉及共存违规和动态变化规则的现实场景中具有稳健的泛化能力?
cs.AI / 88 / 2603.01783
GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation
GAM-RAG:用于检索增强生成的增益自适应记忆
Abstract
Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: https://anonymous.4open.science/r/GAM_RAG-2EF6.
Chinese Translation
检索增强生成(RAG)通过外部证据为大型语言模型提供基础,但许多实现依赖于构建后保持静态的预构建索引。因此,相关查询重复类似的多跳遍历,增加了延迟和计算负担。受到认知神经科学中基于模式学习的启发,我们提出了GAM-RAG,这是一种无训练的框架,能够从重复或相关查询中积累检索经验,并随着时间的推移更新检索记忆。GAM-RAG构建了一个轻量级的、无关系的层次索引,其链接捕捉潜在的共现,而不是固定的语义关系。在推理过程中,成功的检索事件提供句子级反馈,更新句子记忆,使得对类似推理类型有用的证据更容易在后续激活。为了在噪声反馈下平衡稳定性和适应性,我们引入了一种基于不确定性的、受卡尔曼启发的增益规则,该规则共同更新记忆状态和基于困惑度的不确定性估计。它对可靠的新信号应用快速更新,对稳定或噪声记忆进行保守的细化。我们提供了更新动态的理论分析,并实证表明,GAM-RAG在最强基线之上平均提高了3.95%的性能,并在5轮记忆下提高了8.19%,同时将推理成本降低了61%。我们的代码和数据集可在以下链接获取:https://anonymous.4open.science/r/GAM_RAG-2EF6。
cs.AI / 89 / 2603.01799
Incremental, inconsistency-resilient reasoning over Description Logic Abox streams
增量、不一致性鲁棒的描述逻辑 ABox 流推理
Abstract
More and more, data is being produced in a streaming fashion. This has led to increased interest into how actionable insights can be extracted in real time from data streams through Stream Reasoning. Reasoning over data streams raises multiple challenges, notably the high velocity of data, the real time requirement of the reasoning, and the noisy and volatile nature of streams. This paper proposes novel semantics for incremental reasoning over streams of Description Logic ABoxes, in order to tackle these challenges. To address the first two challenges, our semantics for reasoning over sliding windows on streams allow for incrementally computing the materialization of the window based on the materialization of the previous window. Furthermore, to deal with the volatile nature of streams, we present novel semantics for inconsistency repair on such windows, based on preferred repair semantics. We then detail our proposed semi-naive algorithms for incremental materialization maintenance in the case of OWL2 RL, both in the presence of inconsistencies and without.
Chinese Translation
数据越来越多地以流的形式产生。这引发了人们对如何通过流推理实时提取可操作洞察的兴趣。对数据流进行推理面临多重挑战,特别是数据的高速度、推理的实时性要求,以及流的噪声和波动特性。本文提出了一种新颖的语义,用于在描述逻辑 ABox 流上进行增量推理,以应对这些挑战。为了应对前两个挑战,我们的流上滑动窗口推理语义允许基于前一个窗口的物化结果增量计算当前窗口的物化。此外,为了处理流的波动特性,我们提出了一种基于优选修复语义的窗口不一致性修复的新语义。随后,我们详细介绍了在存在不一致性和不存在不一致性情况下,针对 OWL2 RL 的增量物化维护的半天真算法。
cs.AI / 90 / 2603.01801
What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction
论文未告诉你的:恢复隐性知识以实现自动化论文再现
Abstract
Automated paper reproduction -- generating executable code from academic papers -- is bottlenecked not by information retrieval but by the tacit knowledge that papers inevitably leave implicit. We formalize this challenge as the progressive recovery of three types of tacit knowledge -- relational, somatic, and collective -- and propose \method, a graph-based agent framework with a dedicated mechanism for each: node-level relation-aware aggregation recovers relational knowledge by analyzing implementation-unit-level reuse and adaptation relationships between the target paper and its citation neighbors; execution-feedback refinement recovers somatic knowledge through iterative debugging driven by runtime signals; and graph-level knowledge induction distills collective knowledge from clusters of papers sharing similar implementations. On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method{} achieves an average performance gap of 10.04\% against official implementations, improving over the strongest baseline by 24.68\%. The code will be publicly released upon acceptance; the repository link will be provided in the final version.
Chinese Translation
自动化论文再现——从学术论文生成可执行代码——的瓶颈并不在于信息检索,而在于论文不可避免地留下的隐性知识。我们将这一挑战形式化为逐步恢复三种类型的隐性知识——关系性知识、身体性知识和集体知识,并提出了 extit{method},一种基于图的代理框架,为每种知识提供了专门的机制:节点级关系感知聚合通过分析目标论文与其引用邻居之间的实现单元级重用和适应关系来恢复关系性知识;执行反馈精炼通过基于运行时信号的迭代调试来恢复身体性知识;图级知识归纳从共享相似实现的论文群集提炼集体知识。在涵盖3个领域、10个任务和40篇近期论文的扩展ReproduceBench上, extit{method}相较于官方实现实现了平均10.04 ext{%}的性能差距,较最强基线提高了24.68 ext{%}。代码将在接受后公开发布;仓库链接将在最终版本中提供。
cs.AI / 91 / 2603.01822
Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models
大型语言模型中新兴的人类策略用于语义记忆觅食
Abstract
Both humans and Large Language Models (LLMs) store a vast repository of semantic memories. In humans, efficient and strategic access to this memory store is a critical foundation for a variety of cognitive functions. Such access has long been a focus of psychology and the computational mechanisms behind it are now well characterized. Much of this understanding has been gleaned from a widely-used neuropsychological and cognitive science assessment called the Semantic Fluency Task (SFT), which requires the generation of as many semantically constrained concepts as possible. Our goal is to apply mechanistic interpretability techniques to bring greater rigor to the study of semantic memory foraging in LLMs. To this end, we present preliminary results examining SFT as a case study. A central focus is on convergent and divergent patterns of generative memory search, which in humans play complementary strategic roles in efficient memory foraging. We show that these same behavioral signatures, critical to human performance on the SFT, also emerge as identifiable patterns in LLMs across distinct layers. Potentially, this analysis provides new insights into how LLMs may be adapted into closer cognitive alignment with humans, or alternatively, guided toward productive cognitive \emph{disalignment} to enhance complementary strengths in human-AI interaction.
Chinese Translation
人类和大型语言模型(LLMs)都存储着大量的语义记忆。在人类中,高效和有策略地访问这一记忆库是多种认知功能的关键基础。这种访问长期以来一直是心理学的研究重点,其背后的计算机制现在已被很好地表征。我们对这一理解的获得主要来源于一种广泛使用的神经心理学和认知科学评估工具——语义流畅性任务(Semantic Fluency Task, SFT),该任务要求生成尽可能多的语义约束概念。我们的目标是应用机制可解释性技术,以提高对LLMs中语义记忆觅食研究的严谨性。为此,我们展示了将SFT作为案例研究的初步结果。研究的核心集中在生成记忆搜索的趋同和发散模式上,这些模式在人类中在高效的记忆觅食中发挥互补的战略作用。我们表明,这些与人类在SFT表现相关的行为特征,也在LLMs的不同层次中作为可识别的模式出现。此分析可能为如何将LLMs调整为更接近人类的认知对齐提供新的见解,或者相反,引导其朝向富有成效的认知“错位”,以增强人机交互中的互补优势。
cs.AI / 92 / 2603.01940
CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
CoVe:通过约束引导验证训练交互式工具使用代理
Abstract
Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $\tau^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.
Chinese Translation
开发多轮交互式工具使用代理具有挑战性,因为现实世界用户需求往往复杂且模糊,而代理必须执行确定性动作以满足这些需求。为了解决这一问题,我们引入了 extbf{CoVe}( extbf{Co}nstraint- extbf{Ve}rification),这是一个后训练数据合成框架,旨在训练交互式工具使用代理,同时确保数据的复杂性和正确性。CoVe首先定义明确的任务约束,这些约束起到双重作用:它们引导复杂轨迹的生成,并作为确定性验证器来评估轨迹质量。这使得能够为监督微调(SFT)创建高质量的训练轨迹,并为强化学习(RL)推导准确的奖励信号。我们在具有挑战性的$ au^2$-bench基准上的评估展示了该框架的有效性。值得注意的是,我们的紧凑型 extbf{CoVe-4B}模型在航空和零售领域分别达到了43.0 ext{%}和59.4 ext{%}的成功率;其整体表现显著优于相似规模的强基线,并与高达$17 imes$其规模的模型保持竞争力。这些结果表明,CoVe为合成用于最先进的交互式工具使用代理的训练数据提供了一条有效且高效的途径。为了支持未来的研究,我们开源了我们的代码、训练模型以及用于训练的完整12K高质量轨迹集。
cs.AI / 93 / 2603.01952
LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
LiveCultureBench:一个用于动态社会模拟中大型语言模型的多代理、多文化基准
Abstract
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.
Chinese Translation
大型语言模型(LLMs)越来越多地作为自主代理被部署,但评估主要集中在任务成功上,而非文化适宜性或评估者的可靠性。我们引入了LiveCultureBench,这是一个多文化的动态基准,将LLMs作为代理嵌入模拟城镇,并在任务完成和遵循社会文化规范方面进行评估。该模拟将一个小城市建模为一个位置图,合成居民具有多样的人口和文化特征。每个情节为一个居民分配一个日常目标,而其他居民提供社会背景。基于LLM的验证者生成关于规范违反和任务进展的结构化判断,我们将其汇总为捕捉任务与规范权衡及验证者不确定性的指标。通过在不同模型和文化特征中使用LiveCultureBench,我们研究了(i)LLM代理的跨文化鲁棒性,(ii)它们如何在有效性与规范敏感性之间取得平衡,以及(iii)何时LLM作为评判者的评估在自动基准测试中是可靠的,何时需要人类监督。
cs.AI / 94 / 2603.01990
According to Me: Long-Term Personalized Referential Memory QA
根据我:长期个性化参照记忆问答
Abstract
Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench
Chinese Translation
个性化人工智能助手必须回忆和推理长期用户记忆,这自然跨越多个模态和来源,如图像、视频和电子邮件。然而,现有的长期记忆基准主要集中在对话历史上,未能捕捉到基于生活经验的现实个性化参照。我们引入了ATM-Bench,这是第一个针对多模态、多来源个性化参照记忆问答的基准。ATM-Bench包含大约四年的隐私保护个人记忆数据和带有真实记忆证据的人类标注问答对,包括需要解析个人参照的查询、多来源的多证据推理以及处理冲突证据的能力。我们提出了结构化引导记忆(Schema-Guided Memory,SGM),以结构性地表示源自不同来源的记忆项。在实验中,我们实现了5个最先进的记忆系统,以及一个标准的RAG基线,并评估了不同记忆摄取、检索和答案生成技术的变体。我们发现,在ATM-Bench-Hard集上的表现较差(准确率低于20%),并且SGM在性能上优于先前工作中常用的描述性记忆。代码可在:https://github.com/JingbiaoMei/ATM-Bench
cs.AI / 95 / 2603.02029
Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization
从廉价信号中获取丰富洞察:通过张量分解实现高效评估
Abstract
Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
Chinese Translation
为了诊断生成模型的优缺点,有必要超越将异质提示的性能汇总为单一评估,而进行细粒度的提示级评估或在相对同质的子集中进行评估。然而,这种细粒度评估面临数据瓶颈:在大规模下,人类黄金标准标签的成本过高,而自动评分往往与人类判断不一致。为了解决这一挑战,我们提出了一种基于张量分解的新型统计模型,该模型将廉价的自动评分数据与有限的人类黄金标准标签相结合。具体而言,我们的方法利用自动评分对提示和生成模型的潜在表示进行预训练,然后使用小型校准集将这些预训练表示与人类偏好对齐。这种样本高效的方法对自动评分质量具有鲁棒性,比标准基线更准确地预测每个提示的人类偏好,并为关键统计参数提供紧密的置信区间。我们还通过基于提示质量构建细粒度排行榜,以及仅从自动评分中估计模型性能,展示了我们方法的实际效用,从而消除了对额外人类注释的需求。
cs.AI / 96 / 2603.02062
OpenRad: a Curated Repository of Open-access AI models for Radiology
OpenRad:一个开放获取的放射学人工智能模型的精选库
Abstract
The rapid developments in artificial intelligence (AI) research in radiology have produced numerous models that are scattered across various platforms and sources, limiting discoverability, reproducibility and clinical translation. Herein, OpenRad was created, a curated, standardized, open-access repository that aggregates radiology AI models and providing details such as the availability of pretrained weights and interactive applications. Retrospective analysis of peer reviewed literature and preprints indexed in PubMed, arXiv and Scopus was performed until Dec 2025 (n = 5239 records). Model records were generated using a locally hosted LLM (gpt-oss:120b), based on the RSNA AI Roadmap JSON schema, and manually verified by ten expert reviewers. Stability of LLM outputs was assessed on 225 randomly selected papers using text similarity metrics. A total of 1694 articles were included after review. Included models span all imaging modalities (CT, MRI, X-ray, US) and radiology subspecialties. Automated extraction demonstrated high stability for structured fields (Levenshtein ratio > 90%), with 78.5% of record edits being characterized as minor during expert review. Statistical analysis of the repository revealed CNN and transformer architectures as dominant, while MRI was the most commonly used modality (in 621 neuroradiology AI models). Research output was mostly concentrated in China and the United States. The OpenRad web interface enables model discovery via keyword search and filters for modality, subspecialty, intended use, verification status and demo availability, alongside live statistics. The community can contribute new models through a dedicated portal. OpenRad contains approx. 1700 open access, curated radiology AI models with standardized metadata, supplemented with analysis of code repositories, thereby creating a comprehensive, searchable resource for the radiology community.
Chinese Translation
放射学领域人工智能(AI)研究的快速发展产生了众多模型,这些模型分散在各种平台和来源中,限制了其可发现性、可重复性和临床转化。在此,我们创建了OpenRad,一个经过策划、标准化的开放获取库,聚合了放射学AI模型,并提供了预训练权重和交互式应用程序等详细信息。对截至2025年12月在PubMed、arXiv和Scopus中索引的同行评审文献和预印本进行了回顾性分析(n = 5239条记录)。模型记录是基于RSNA AI路线图JSON模式,使用本地托管的LLM(gpt-oss:120b)生成,并由十位专家审阅者手动验证。使用文本相似性指标评估了LLM输出的稳定性,随机选择了225篇论文进行评估。经过审查,共纳入1694篇文章。纳入的模型涵盖了所有影像学模式(CT、MRI、X光、超声)和放射学亚专业。自动提取显示结构化字段的高稳定性(Levenshtein比率>90%),在专家审查中,78.5%的记录编辑被认为是小幅修改。对库的统计分析显示,CNN和变换器架构占主导地位,而MRI是使用最广泛的模式(在621个神经放射学AI模型中)。研究产出主要集中在中国和美国。OpenRad网页界面通过关键词搜索和按模式、亚专业、预期用途、验证状态和演示可用性等过滤器实现模型发现,并提供实时统计数据。社区可以通过专门的门户贡献新模型。OpenRad包含约1700个开放获取的、经过策划的放射学AI模型,具有标准化的元数据,并补充了代码库的分析,从而为放射学社区创建了一个全面的、可搜索的资源。
cs.AI / 97 / 2603.02070
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
通过对话探索计划空间:一种用于规划中大语言模型(LLM)介导解释的代理框架
Abstract
When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
Chinese Translation
在为现实世界的顺序决策问题自动生成计划时,目标往往不是替代人类规划者,而是促进一个迭代推理和引导的过程,其中人类的角色是根据他们的偏好和专业知识来指导人工智能规划者。在这个背景下,能够回应用户问题的解释对于提高用户对潜在解决方案的理解和增强他们对系统的信任至关重要。为了实现与这样一个系统的自然交互,我们提出了一种多智能体大语言模型(LLM)架构,该架构对解释框架保持中立,并能够提供用户和上下文依赖的互动解释。我们还描述了这一框架在目标冲突解释中的一个实例,并利用该实例进行了一项用户研究,比较了基于LLM的交互与基线模板化解释界面之间的差异。
cs.AI / 98 / 2603.02119
Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
铅笔谜题基准:多步骤可验证推理的基准测试
Abstract
We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
Chinese Translation
我们介绍了铅笔谜题基准(Pencil Puzzle Bench),这是一个通过铅笔谜题评估大型语言模型推理的框架。铅笔谜题是一类与 NP 完全问题密切相关的约束满足问题,具有确定性的逐步验证特性。在一个包含 62,231 个谜题的数据库中,这些谜题分为 94 种类型,并且具有经过验证的唯一解,我们选择了 300 个谜题作为基准,涵盖 20 种类型,并在两种模式下评估了 11 个提供者的 51 个模型:直接询问(单次)和代理性(多轮带迭代验证)。我们基准的一个关键特征是每个中间棋盘状态都可以针对特定类型的约束进行检查,从而将错误定位到被违反的确切规则,为过程监督和强化学习提供了密集的逐步奖励信号基础。我们的评估揭示了两种不同的能力轴:(1)推理努力的规模化,其中 GPT-5.2 从无推理到最大努力提升了 81 倍;(2)代理性迭代,其中 Claude Opus 4.6 通过迭代检查从 0.3% 提升至 30.0%,而 GPT-5.2@xhigh 从 20.2% 提升至 56.0%。代理性尝试的中位数为 29 回合,耗时 17 分钟,最长的尝试超过 1,221 回合和 14.3 小时——这是对长上下文利用的严格测试,而不仅仅是推理。
cs.AI / 99 / 2603.02123
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Nano-EmoX:从感知到共情的多模态情感智能统一框架
Abstract
The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
Chinese Translation
情感多模态语言模型(MLMs)的发展长期受到低层次感知与高层次互动之间差距的限制,导致情感能力的碎片化和有限的泛化能力。为了解决这一问题,我们提出了一种受认知启发的三层次层级结构,根据认知深度(感知、理解和互动)组织情感任务,并为推进情感建模提供统一的概念基础。在这一层级结构的指导下,我们引入了Nano-EmoX,一个小规模的多任务MLM,以及P2E(Perception-to-Empathy),一个基于课程的训练框架。Nano-EmoX集成了一套全模态编码器,包括增强的面部编码器和融合编码器,以捕捉关键的多模态情感线索并提高跨任务的可迁移性。输出通过异构适配器投影到统一的语言空间,使轻量级语言模型能够应对多样的情感任务。同时,P2E通过将快速感知与基于思维链的共情对齐,逐步培养情感智能。根据我们所知,Nano-EmoX是首个在所有三个层级统一六个核心情感任务的紧凑型MLM(2.2B),在多个基准测试中实现了最先进或高度竞争的性能,展现出卓越的效率和泛化能力。
cs.AI / 100 / 2603.02196
Conformal Policy Control
符合性政策控制
Abstract
An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
Chinese Translation
代理必须尝试新的行为以进行探索和改进。在高风险环境中,违反安全约束的代理可能会造成伤害,并且必须被下线,从而限制任何未来的交互。模仿旧行为是安全的,但过度保守会抑制探索。行为变化多少算是过多?我们展示了如何使用任何安全参考政策作为任何优化但未经测试的政策的概率调节器。对安全政策数据的符合性校准确定了新政策可以多么激进地行动,同时可证明地强制执行用户声明的风险容忍度。与保守优化方法不同,我们不假设用户已识别出正确的模型类别或调整了任何超参数。与之前的符合性方法不同,我们的理论即使对于非单调有界约束函数也提供有限样本保证。我们在从自然语言问答到生物分子工程的应用上的实验表明,安全探索不仅在部署的第一时刻是可能的,而且还可以提高性能。
cs.AI / 101 / 2603.02203
Tool Verification for Test-Time Reinforcement Learning
测试时强化学习的工具验证
Abstract
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
Chinese Translation
测试时强化学习(TTRL)作为一种有前景的范式,已成为自我演化大型推理模型(LRMs)的有效方法,能够通过多数投票机制在未标记的测试输入上实现在线适应。然而,虚假的但高频的未验证共识可能成为偏见和强化的奖励信号,导致错误的模式崩溃。我们通过T^3RL(测试时强化学习的工具验证)来解决这一失败模式,该方法将测试时工具验证引入奖励估计中。具体而言,验证器使用外部工具作为证据(例如,来自代码执行的结果)来提高经过验证的回滚在验证感知投票中的权重,从而为训练生成更可靠的伪标签。在各种数学难度(MATH-500、AMC和AIME 2024)和不同的骨干类型中,T^3RL在性能上显著优于TTRL,在更难的问题上获得了更大的提升。更广泛地说,T^3RL可以被视为经过验证的在线数据合成,强调测试时工具验证作为稳定自我演化的关键机制。
cs.CL / 1 / 2603.00021
From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization
从全球到地方:学习上下文感知的图表示用于文档分类和摘要
Abstract
This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugue\~no and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.
Chinese Translation
本文提出了一种数据驱动的方法,以自动构建基于图的文档表示。基于Bugueño和de Melo(2025)的最新研究,我们利用动态滑动窗口注意力模块,有效捕捉句子之间的局部和中等范围的语义依赖关系,以及文档内部的结构关系。在我们学习的图上训练的图注意力网络(Graph Attention Networks, GATs)在文档分类任务中取得了具有竞争力的结果,同时所需的计算资源低于以往的方法。我们进一步对所提出的图构建方法在提取式文档摘要中的应用进行了探索性评估,突出了其潜力和当前的局限性。该项目的实现可以在GitHub上找到。
cs.CL / 2 / 2603.00022
Noise reduction in BERT NER models for clinical entity extraction
临床实体提取中BERT命名实体识别模型的噪声减少
Abstract
Precision is of utmost importance in the realm of clinical entity extraction from clinical notes and reports. Encoder Models fine-tuned for Named Entity Recognition (NER) are an efficient choice for this purpose, as they don't hallucinate. We pre-trained an in-house BERT over clinical data and then fine-tuned it for NER. These models performed well on recall but could not close upon the high precision range, needed for clinical models. To address this challenge, we developed a Noise Removal model that refines the output of NER. The NER model assigns token-level entity tags along with probability scores for each token. Our Noise Removal (NR) model then analyzes these probability sequences and classifies predictions as either weak or strong. A na\"ive approach might involve filtering predictions based on low probability values; however, this method is unreliable. Owing to the characteristics of the SoftMax function, Transformer based architectures often assign disproportionately high confidence scores even to uncertain or weak predictions, making simple thresholding ineffective. To address this issue, we adopted a supervised modeling strategy in which the NR model leverages advanced features such as the Probability Density Map (PDM). The PDM captures the Semantic-Pull effect observed within Transformer embeddings, an effect that manifests in the probability distributions of NER class predictions across token sequences. This approach enables the model to classify predictions as weak or strong with significantly improved accuracy. With these NR models we were able to reduce False Positives across various clinical NER models by 50\% to 90\%.
Chinese Translation
在临床笔记和报告中提取临床实体时,精确度至关重要。经过微调的编码器模型在命名实体识别(NER)方面是一个高效的选择,因为它们不会产生幻觉。我们在临床数据上预训练了一个内部的BERT模型,然后对其进行了NER的微调。这些模型在召回率方面表现良好,但无法达到临床模型所需的高精确度。为了解决这一挑战,我们开发了一种噪声去除模型,以优化NER的输出。NER模型为每个标记分配实体标签及其概率分数。我们的噪声去除(NR)模型随后分析这些概率序列,并将预测分类为弱或强。一个简单的方法可能是根据低概率值过滤预测;然而,这种方法并不可靠。由于SoftMax函数的特性,基于Transformer的架构往往即使对不确定或弱的预测也会分配不成比例的高置信度分数,使得简单的阈值处理无效。为了解决这个问题,我们采用了一种监督建模策略,其中NR模型利用了概率密度图(Probability Density Map, PDM)等高级特征。PDM捕捉到在Transformer嵌入中观察到的语义拉动效应,这种效应在NER类别预测的概率分布中表现出来,跨越标记序列。该方法使模型能够以显著提高的准确性将预测分类为弱或强。通过这些NR模型,我们能够将各种临床NER模型中的假阳性减少50%到90%。
cs.CL / 3 / 2603.00024
Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs
个性化增加情感一致性,但对大型语言模型的认知独立性具有角色依赖性影响
Abstract
Large Language Models (LLMs) are prone to sycophantic behavior, uncritically conforming to user beliefs. As models increasingly condition responses on user-specific context (personality traits, preferences, conversation history), they gain information to tailor agreement more effectively. Understanding how personalization modulates sycophancy is critical, yet systematic evaluation across models and contexts remains limited. We present a rigorous evaluation of personalization's impact on LLM sycophancy across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts. We find that personalization generally increases affective alignment (emotional validation, hedging/deference), but affects epistemic alignment (belief adoption, position stability, resistance to influence) with context-dependent role modulation. When the LLM's role is to give advice, personalization strengthens epistemic independence (models challenge user presuppositions). When its role is that of a social peer, personalization decreases epistemic independence. In this role, extensively personalized user challenges causing LLMs to abandon their position at significantly higher rates. Robustness tests confirm that the effects are driven by personalized conditioning, not by additional input tokens per se or demographic information alone. Our work provides measurement frameworks for evaluating personalized AI systems, demonstrates the necessity of role-sensitive evaluation, and establishes a novel benchmark to assess goal alignment.
Chinese Translation
大型语言模型(LLMs)容易表现出谄媚行为,毫无批判地顺应用户信念。随着模型越来越多地根据用户特定的上下文(个性特征、偏好、对话历史)来调整响应,它们获得了更有效地量身定制一致性的信心。理解个性化如何调节谄媚行为至关重要,但对模型和上下文的系统评估仍然有限。我们对个性化对九个前沿模型和五个基准数据集(涵盖建议、道德判断和辩论上下文)中LLM谄媚行为的影响进行了严格评估。我们发现,个性化通常增加情感一致性(情感验证、模糊/恭顺),但对认知一致性(信念采纳、立场稳定性、抵抗影响)的影响则具有上下文依赖的角色调节。当LLM的角色是提供建议时,个性化增强了认知独立性(模型挑战用户的假设)。而当其角色是社交同伴时,个性化则降低了认知独立性。在这种角色下,广泛个性化的用户挑战导致LLM以显著更高的频率放弃其立场。稳健性测试确认这些影响是由个性化条件驱动的,而不是单纯由额外输入标记或人口统计信息所致。我们的研究为评估个性化人工智能系统提供了测量框架,展示了角色敏感评估的必要性,并建立了一个新的基准以评估目标一致性。
cs.CL / 4 / 2603.00025
TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
TAB-PO:一种基于令牌级自适应障碍的偏好优化方法,用于令牌关键的结构化生成
Abstract
Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.
Chinese Translation
直接偏好优化(Direct Preference Optimization, DPO)是一种离线的后SFT方法,用于从偏好对齐合语言模型,在指令跟随和摘要生成方面取得了良好的效果。然而,DPO的序列级隐式奖励在令牌关键的结构化预测设置中可能表现不佳,例如医学标注,这种情况下通常会出现(i)低分离偏好对,其中选择的和拒绝的完成之间的编辑距离极小(通常为1-3个令牌),以及(ii)令牌重要性偏斜,其中稀疏的语义令牌(层次标签和证据跨度)相对于高频结构令牌(JSON框架)承载着不成比例的任务重要性。在这种情况下,标准DPO会遭遇边际崩溃(近乎相同的偏好之间的对数概率分离不足)、似然挤压(边际目标共同转移两个完成的绝对似然)和梯度稀释,其中均匀的序列级加权会将学习信号扩散到共享框架中,而稀有的、易混淆的标签令牌则会收到微弱且嘈杂的更新。我们提出了令牌自适应障碍偏好优化(Token-Adaptive Barrier Preference Optimization, TAB-PO),该方法通过令牌加权的、参考调整的优势来增强DPO,优先考虑高价值的语义令牌,并引入条件令牌级障碍,以规范化不自信的令牌,在低分离、重要性偏斜的情况下平衡基于SFT的似然和偏好驱动的分离。我们在医学沟通标注任务上评估了TAB-PO,该任务要求从患者与提供者的消息中联合预测层次标签和证据跨度。TAB-PO在微F1上相较于SFT实现了约4%的相对提升,并且在最近的偏好优化基准测试中表现出持续的优越性。
cs.CL / 5 / 2603.00026
ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents
ActMem:弥合大型语言模型代理中的记忆检索与推理之间的鸿沟
Abstract
Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
Chinese Translation
有效的记忆管理对于处理长期交互的大型语言模型(LLM)代理至关重要。当前的记忆框架通常将代理视为被动的“记录者”,在没有理解其深层含义的情况下检索信息。这些框架可能在需要冲突检测和复杂决策的场景中失效。为了解决这一关键问题,我们提出了一种新颖的可操作记忆框架,称为 ActMem,它将记忆检索与主动因果推理相结合。ActMem 将非结构化的对话历史转化为结构化的因果和语义图。通过利用反事实推理和常识补全,它使代理能够推导隐含约束,并解决过去状态与当前意图之间的潜在冲突。此外,我们还引入了一个综合数据集 ActMemEval,以评估代理在逻辑驱动场景中的推理能力,超越了现有记忆基准的事实检索重点。实验表明,ActMem 在处理复杂的、依赖记忆的任务时显著优于最先进的基线,为更一致和可靠的智能助手铺平了道路。
cs.CL / 6 / 2603.00028
EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal
EPPCMinerBen:评估大型语言模型在电子患者-提供者沟通中的新基准
Abstract
Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering
Chinese Translation
有效的医疗沟通对治疗结果和患者依从性至关重要。随着患者与提供者之间的交流转向安全消息传递,分析电子患者沟通(EPPC)数据既必要又具有挑战性。我们介绍了EPPCMinerBen,这是一个用于评估大型语言模型(LLMs)在识别沟通模式和提取电子患者-提供者消息中的见解的基准。EPPCMinerBen包括三个子任务:代码分类、子代码分类和证据提取。使用来自耶鲁新哈芬医院患者门户的752条安全消息中的1,933个专家注释句子,它评估LLMs在识别沟通意图和支持性文本方面的表现。基准涵盖了在零样本和少样本设置下的各种LLMs,数据将通过NCI癌症数据服务发布。模型在不同任务和设置下的表现各异。Llama-3.1-70B在证据提取方面表现最佳(F1: 82.84%),并在分类任务中表现良好。Llama-3.3-70b-Instruct在代码分类中超越了所有模型(F1: 67.03%)。DeepSeek-R1-Distill-Qwen-32B在子代码分类中表现突出(F1: 48.25%),而sdoh-llama-3-70B则表现稳定。较小的模型表现不佳,尤其是在子代码分类中(F1 > 30%)。少样本提示提升了大多数任务的表现。我们的结果表明,大型、经过指令调优的模型在EPPCMinerBen任务中通常表现更好,特别是在证据提取方面,而较小的模型在细粒度推理上则面临困难。EPPCMinerBen为话语层面的理解提供了基准,支持未来在模型泛化和患者-提供者沟通分析方面的研究。关键词:电子患者-提供者沟通、大型语言模型、数据收集、提示工程
cs.CL / 7 / 2603.00029
Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
拥抱各向异性:将大规模激活转化为大型语言模型的可解释控制旋钮
Abstract
Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.
Chinese Translation
大型语言模型(LLMs)表现出高度各向异性的内部表征,通常以大规模激活为特征,这是一种现象,其中少数特征维度的幅度显著大于其他维度。虽然先前的研究主要将这些极端维度视为需要管理的伪影,但我们提出了一种不同的视角:这些维度作为源于领域专业化的内在可解释功能单元。具体而言,我们提出了一种基于幅度的简单标准,以无训练方式识别领域关键维度(Domain-Critical Dimensions)。我们的分析表明,这些维度作为符号/定量模式或领域特定术语的可解释语义探测器。此外,我们引入了关键维度引导(Critical Dimension Steering),该方法仅对识别出的维度应用激活引导。实证结果表明,这种方法在领域适应和越狱场景中优于传统的全维度引导。
cs.CL / 8 / 2603.00030
SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
SimpleTool:实时大语言模型功能调用的并行解码
Abstract
LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
Chinese Translation
基于大语言模型(LLM)的功能调用使智能体能够与外部工具和环境进行交互,但自回归解码带来了根本性的延迟瓶颈,限制了实时应用的实现,如具身智能、游戏人工智能和交互式虚拟形象(例如,10 Hz 控制频率)。我们观察到,功能调用与自由文本生成在本质上存在显著差异:结构化输出表现出显著的令牌冗余(分隔符、参数名称),而参数之间则表现出较弱的因果依赖关系。关键是,这两个特性必须共同利用,以实现实时性能。我们提出了 SimpleTool,它引入了特殊的令牌,具有双重作用:压缩低熵令牌(减少 4-6 倍),同时作为模式选择器,允许函数名称和参数的独立并行生成。这种协同设计实现了 3-6 倍的端到端加速(最高可达 9.6 倍),并且仅增加了 8.2% 的并行化开销。在 Qwen 系列模型(0.5B-14B)的五个基准测试中的实验表明,在保持竞争力或提高准确性的同时,显著加速。在 Mobile Actions 上,ST-Qwen-0.5B 在准确性和延迟一致性方面均优于谷歌的 FunctionGemma。通过在消费级 GPU 上进行量化,SimpleTool 实现了 61.2ms 的 P50 延迟,使得在 4B 模型规模下能够实现 16 Hz 的实时控制,弥合了 LLM 功能调用与延迟关键的现实世界部署之间的差距。
cs.CL / 9 / 2603.00031
GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency
GRIP:用于数据效率的几何细化与自适应信息潜力
Abstract
The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.
Chinese Translation
大型语言模型(LLMs)的性能越来越受到数据效率的影响,而非单纯的规模扩展。然而,现有的选择方法往往将全局分布平衡与局部实例选择解耦,从而损害了训练集的层次完整性。我们提出了 extbf{GRIP}(几何细化与自适应信息潜力),这是一个通过将语料库建模为信息密集的几何空间来统一这些维度的框架。GRIP采用 extbf{快速适应探针(RAP)}来量化语义簇的信息潜力,动态地将采样预算重新分配给具有最高表示不足的区域。随后,我们使用 extbf{长度校正几何先验}进行簇内选择,以抵消嵌入密度伪影并保持长尾逻辑序列。在对高达300B标记的专家混合模型(MoE)进行的广泛评估中,GRIP始终优于最先进的基线, extbf{超越了在$3 imes$更大未整理数据集上训练的模型的性能}。我们的工作为大规模预训练中的自适应数据策划建立了坚实的几何基础。
cs.CL / 10 / 2603.00077
Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
Autorubric:基于评分标准的大型语言模型评估统一框架
Abstract
Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $\kappa$, weighted $\kappa$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
Chinese Translation
基于评分标准的大型语言模型(LLMs)评估已成为大规模文本生成评估的标准实践,但相关技术分散在不同论文中,术语不一致且解决方案不完整。我们提出了一个统一框架:每种识别的技术都与其在Autorubric中的实现相对应,Autorubric是一个在本文中提出的开源Python框架。Autorubric支持二元、序数和名义标准,并具有可配置的权重;支持单评审和多评审的集成评估,采用多数、加权、一致和任意投票聚合;支持通过裁决平衡抽样的少量样本校准;并针对位置偏差(选项洗牌)、冗长偏差(长度惩罚)和标准混淆(逐标准原子评估及自然语言解释)提供缓解措施。该框架提供了来自心理测量学的可靠性指标(Cohen's $ ext{kappa}$、加权 $ ext{kappa}$、相关系数和分布级测试),以及包括响应缓存、可恢复运行的检查点、多提供者速率限制和成本跟踪在内的生产基础设施。我们在三个基准上评估了Autorubric,涵盖教育评估、深度研究评估和聊天机器人质量评估,结果表明其产生的结果与已发布的基准一致,同时展示了框架的关键能力:逐标准二元评估与少量样本校准(RiceChem)、跨评审模型的多评审集成评估(ResearcherBench)以及结合二元、序数和名义尺度的混合标准类型(CHARM-100)。我们还贡献了CHARM-100,这是一个包含三种标准类型的每个样本真实标签的100样本聊天机器人评估数据集,旨在对异构标准的评分评估框架进行压力测试。
cs.CL / 11 / 2603.00086
Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
基于迭代大型语言模型的法语临床访谈转录与说话者分离的改进
Abstract
Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p < 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.
Chinese Translation
法语医疗对话的自动语音识别仍然面临挑战,自发临床语言的词错误率常常超过30%。本研究提出了一种多遍LLM后处理架构,在说话者识别和词识别之间交替进行,以提高转录准确性和说话者归属。对两个法语临床数据集(自杀预防电话咨询和术前清醒神经外科咨询)进行的消融研究探讨了四个设计选择:模型选择、提示策略、遍历顺序和迭代深度。使用Qwen3-Next-80B进行的Wilcoxon符号秩检验确认在自杀预防对话中显著降低了词错误率(WDER)(p < 0.05, n=18),同时在清醒神经外科咨询中保持了稳定性(n=10),且没有输出失败,计算成本可接受(实时因子RTF 0.32),这表明该方法在离线临床部署中的可行性。
cs.CL / 12 / 2603.00296
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
逐步惩罚以实现高效的思维链推理
Abstract
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
Chinese Translation
大型推理模型在测试时计算量增加时表现更佳,但往往会过度思考,产生不必要的冗长思维链,导致成本增加而未能提高准确性。以往的强化学习方法通常依赖于单一的结果奖励和轨迹级长度惩罚,这无法区分必要的推理步骤与冗余步骤,因此导致粗糙的压缩。尽管近期的研究引入了步骤级信号,如离线剪枝、监督数据构建或基于验证器的中间奖励,但推理长度在强化学习中很少被视为明确的步骤级优化目标。我们提出了逐步自适应惩罚(Step-wise Adaptive Penalization, SWAP),这是一个细粒度框架,根据内在贡献在步骤之间分配长度减少。我们通过模型在正确答案上的策略日志概率改进来估计步骤的重要性,然后将多余的长度视为惩罚质量,重新分配以对低重要性步骤施加更重的惩罚,同时保留高重要性的推理。我们在组相对策略优化中以统一的结果-过程优势进行优化。大量实验表明,SWAP在平均上减少了64.3%的推理长度,同时相对于基础模型提高了5.7%的准确性。
cs.CL / 13 / 2603.00307
From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
从前提到预测:通过控制归纳验证几何幻觉分类法
Abstract
We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.
Chinese Translation
我们测试了一种几何幻觉分类法——将失败分类为中心漂移(Type~1)、错误收敛(Type~2)或覆盖缺口(Type~3)——是否能够通过在GPT-2中的控制归纳区分幻觉类型。我们采用两级统计设计,以提示($N = 15$/组)作为推断单位,进行每个实验20次,使用不同的生成种子以量化结果的稳定性。在静态嵌入中,Type~3的范数分离是稳健的(在18/20次实验中显著,经过Holm校正在14/20次中显著,中位数$r = +0.61$)。在上下文隐藏状态中,Type~3的范数效应方向是稳定的(19/20次实验)但在$N = 15$时功效不足(在4/20次中显著,中位数$r = -0.28$)。Type~1和Type~2在任何空间中均未分离(${ ext{≤}} 3/20$次实验)。在标记级别的测试中,通过伪重复显著性被放大了4至16倍——这一发现在所有20次实验中均得到了重复。结果确立了覆盖缺口幻觉作为最具几何特征的失败模式,其特征由幅度而非方向主导,并确认Type~1/2的非分离在124M参数下是真实存在的。
cs.CL / 14 / 2603.00314
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
当指标不一致时:自动相似性与LLM作为临床对话评估的裁判
Abstract
This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model's accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4's evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.
Chinese Translation
本文详细介绍了基线模型选择、微调过程、评估方法以及在医疗环境中部署更准确的大型语言模型(LLMs)的影响。随着大型语言模型(LLMs)越来越多地被用于解决各种问题,包括医疗查询,关于其可靠性的担忧也随之出现。长岛大学的一项最新研究指出,LLMs在医疗环境中的表现往往不佳,可能导致对用户的有害误导。为了解决这一问题,我们的研究集中在使用真实患者-医生互动的转录文本对Llama 2 7B(一个基于变换器的解码器模型)进行微调。我们的目标是提高模型在回应医疗查询时的准确性和精确性。我们采用监督学习的方法对模型进行了微调,强调了训练数据中捕捉到的领域特定细微差别。在最佳情况下,模型结果应由真实的医学专家进行审查和评估。由于资源限制,微调模型的性能使用文本相似性指标进行评估。微调后的模型在所有关键维度上都表现出显著改善,除了GPT-4的评估。ChatGPT4的评估结果与定量结果相差甚远;在这里,我们不仅建议,而且提议结果应由人类医学专家进行评估。
cs.CL / 15 / 2603.00359
How Large Language Models Get Stuck: Early structure with persistent errors
大型语言模型如何陷入困境:早期结构与持续错误
Abstract
Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method (in progress) of testing the hypothesis on appropriately selected BLiMP classes.
Chinese Translation
语言学的洞见可能有助于提高大型语言模型(LLM)训练的效率。我们在100M字的BabyLM数据集上训练了Meta的OPT模型,并在BLiMP基准上进行了评估,该基准由67个类别组成,每个类别由在特定句法或语义规则违反上有所不同的句子对定义。我们测试了模型在训练迭代和句法类型中对语法句子相对于非语法句子的偏好。在近三分之一的BLiMP类别中,OPT未能在经过广泛训练后始终为语法句子分配更高的可能性。当它失败时,通常会在处理的早期阶段建立明显的(错误的)可能性分离,并在训练阶段的最后保持这种状态。我们假设这种错误分类是代价高昂的,因为它会产生根深蒂固的偏见,最终必须逆转才能使模型表现良好。我们通过定性(基于语言理论和深度学习理论)和定量(基于数值测试)评估相结合的方法探讨这一现象。我们的定性评估表明,只有部分BLiMP测试是有意义的指导。最后,我们提出了一个假设,即双元假设(Bigram Hypothesis),该假设声称如果双元统计在训练早期使模型偏向错误的区分,学习过程将表现出错误的根深蒂固,并描述了一种正在进行中的方法,用于在适当选择的BLiMP类别上测试该假设。
cs.CL / 16 / 2603.00364
Distribution-Aware Companding Quantization of Large Language Models
分布感知的压缩量化大型语言模型
Abstract
Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.
Chinese Translation
大型语言模型如GPT和Llama是通过下一个标记预测损失进行训练的。在本研究中,我们建议训练语言模型同时预测多个未来标记可以提高样本效率。更具体地说,在训练语料库的每个位置,我们要求模型使用n个独立的输出头预测接下来的n个标记,这些输出头在共享的模型主干之上操作。将多标记预测视为辅助训练任务,我们测量了下游能力的提升,而在代码和自然语言模型的训练时间上没有额外开销。该方法在更大模型尺寸中愈加有效,并在多轮训练时保持其吸引力。在生成基准测试中,尤其是在编码任务上,我们的模型始终比强基线高出几个百分点。我们的13B参数模型在HumanEval上解决了比可比的下一个标记模型多12%的问题,在MBPP上多17%。在小型算法任务上的实验表明,多标记预测有利于归纳头和算法推理能力的发展。作为额外的好处,使用4标记预测训练的模型在推理时速度提高了最多3倍,即使在大批量情况下也是如此。
cs.CL / 17 / 2603.00369
Policy Compliance of User Requests in Natural Language for AI Systems
人工智能系统中用户自然语言请求的政策合规性
Abstract
Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
Chinese Translation
考虑一个组织,其用户向人工智能系统发送自然语言请求,系统通过执行特定任务来满足这些请求。本文探讨了确保用户请求符合组织确定的多样化政策列表的问题,旨在保证人工智能系统的安全和可靠使用。我们提出了迄今为止首个基准数据集,该数据集包含了与政策列表的合规性多样化的标注用户请求。我们的基准数据集与技术领域的工业应用相关。随后,我们利用该基准评估不同大语言模型(LLM)在政策合规性评估中的表现,采用不同的解决方法。我们分析了各模型和解决方法在性能指标上的差异,展示了我们问题的挑战性。
cs.CL / 18 / 2603.00426
LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
基于LLM引导的针对性发现指导用于基于MLLM的医学报告生成
Abstract
The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
Chinese Translation
利用多模态大型语言模型(MLLM)自动生成医学报告时,常常面临与事实不稳定性相关的挑战,这可能表现为发现的遗漏或不准确信息的纳入,从而限制其在临床环境中的适用性。目前的方法通常直接基于图像特征生成报告,这本质上缺乏明确的事实基础。针对这一局限性,我们提出了Fact-Flow,这是一种创新框架,将视觉事实识别过程与报告生成过程分开。具体而言,我们首先从图像中预测临床发现,随后指导MLLM生成事实准确的报告。我们方法的一个关键进展是一个管道,利用大型语言模型(LLM)自主创建标注医学发现的数据集,有效消除了对昂贵人工标注的需求。在两个以疾病为重点的医学数据集上进行的广泛实验评估验证了我们方法的有效性,显示出与最先进模型相比,事实准确性显著提高,同时保持了高标准的文本质量。
cs.CL / 19 / 2603.00432
A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
基于类型学的多语言掩蔽语言模型中词序和形态敏感性的评估框架
Abstract
We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.
Chinese Translation
我们引入了一种关注类型学的诊断工具,用于测试多语言掩蔽语言模型对词序与屈折形式的依赖性。利用通用依赖关系(Universal Dependencies),我们在推理时施加扰动:完全的标记打乱、内容词打乱(固定功能词)、基于依赖关系的头-依赖词交换,以及句子级的词干替换(+L),该方法对上下文和被掩蔽的目标标签都进行词干化。我们对mBERT和XLM-R在英语、中文、德语、西班牙语和俄语上进行了评估。完全打乱导致所有语言的词级重建准确率接近零;部分和头-依赖扰动虽然造成的下降较小,但仍然显著。+L在中文中的影响很小,但在德语、西班牙语和俄语中显著降低了准确率,并且未能减轻打乱的影响。前五个词的准确率显示出相同的模式:在完全打乱的情况下,黄金词很少出现在五个最高排名的重建中。我们发布了代码、采样脚本和均衡评估子集;土耳其语在严格重建下的结果在附录中报告。
cs.CL / 20 / 2603.00523
CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
CIRCUS:在不确定性下通过稳定性集成实现电路共识
Abstract
Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
Chinese Translation
机械电路发现对任意分析师选择极为敏感,尤其是修剪阈值和特征字典,常常导致脆弱的“一次性”解释,缺乏原则性的无确定性概念。我们将电路发现重新构建为一个关于这些分析自由度的不确定性量化问题。我们的方法CIRCUS通过在多种配置下修剪单一原始归因运行来构建归因图的集成,为每条边分配一个稳定性评分(保留该边的配置比例),并提取仅包含所有视图中出现的边的严格共识电路。这产生了一个阈值稳健的“核心”电路,同时明确呈现了有条件的替代方案,并能够拒绝低一致性结构。CIRCUS无需重新训练,且增加的开销微乎其微,因为它聚合了已经计算的修剪图中的结构。在Gemma-2-2B和Llama-3.2-1B上,严格共识电路的规模约为所有配置的并集的40倍,同时保持了可比的影响流解释能力,并且在相同边预算的基线(并集修剪以匹配共识大小)上表现更佳。我们进一步通过激活补丁验证因果相关性,其中共识识别的节点始终优于匹配的非共识对照组(p=0.0004)。总体而言,CIRCUS提供了一个实用的、关注不确定性的框架,用于报告可信的、可审计的机械电路,并具有明确的核心/有条件/噪声分解。
cs.CL / 21 / 2603.00573
CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
CoMoL:通过动态核心空间合并实现高效的LoRA专家混合
Abstract
Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.
Chinese Translation
大型语言模型(LLMs)通过参数高效微调(PEFT)在多样的下游和领域特定任务中取得了显著的性能。然而,现有的PEFT方法,特别是MoE-LoRA架构,由于LoRA专家的激增和实例级路由,面临着有限的参数效率和粗粒度适应的问题。为了解决这些问题,我们提出了核心空间LoRA混合(CoMoL),这是一种新颖的MoE-LoRA框架,结合了专家多样性、参数效率和细粒度适应。具体而言,CoMoL引入了两个关键组件:核心空间专家和核心空间路由。核心空间专家将每个专家存储在一个紧凑的核心矩阵中,保留了多样性同时控制了参数增长。核心空间路由动态选择并激活每个令牌的适当核心专家,实现细粒度的输入自适应路由。激活的核心专家随后通过软合并策略合并为一个单一的核心专家,并与共享的LoRA结合形成一个专用的LoRA模块。此外,路由网络被投影到与LoRA矩阵相同的低秩空间中,进一步减少了参数开销而不影响表达能力。大量实验表明,CoMoL保持了MoE-LoRA架构的适应性,同时实现了与标准LoRA相当的参数效率,在多个任务中始终优于现有方法。
cs.CL / 22 / 2603.00582
Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
超级研究:通过超级深度和超级广度的研究利用大型语言模型回答高度复杂的问题
Abstract
While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/
Chinese Translation
尽管大型语言模型(LLMs)在深度研究或广泛搜索方面表现出色,但它们解决高度复杂问题的能力——即那些需要长期规划、大规模证据收集以及跨异构来源综合的任务——仍然未得到充分探索。我们提出了超级研究(Super Research),这是一个针对复杂自主研究任务的任务,集成了(i)结构化分解为研究计划,(ii)超级广泛检索以获取多样化视角,以及(iii)超级深入调查通过迭代查询来解决不确定性。为了评估这一能力,我们策划了一个基准,包含300个跨不同领域的专家撰写的问题,每个问题需要多达100个以上的检索步骤和1000个以上的网页来调和相互矛盾的证据。超级研究生成可验证的报告,附有细致的引用和中间成果(例如,提纲和表格),以确保推理的可追溯性。此外,我们提出了一种基于图的审计协议,从覆盖范围、逻辑一致性、报告实用性、客观性和引用健康五个维度评估超级研究。尽管在标准应用中,超级复杂问题可能不常见,但超级研究作为LLM能力的关键天花板评估和压力测试。模型在超级研究中的表现是其一般研究能力的强有力代理;在这里的成功表明了应对几乎任何下属研究任务所需的稳健性。排行榜可在以下网址获取:https://cnsdqd-dyb.github.io/Super-Research-Benchmark/
cs.CL / 23 / 2603.00612
From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation
从文献到假设:一种用于生物标志物指导药物组合假设生成的人工智能共同科学家系统
Abstract
The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.
Chinese Translation
生物医学文献和经过整理的数据库的快速增长使得研究人员越来越难以系统性地将生物标志物机制与可操作的药物组合假设联系起来。我们提出了人工智能共同科学家系统(AI Co-Scientist, CoDHy),这是一个用于癌症研究的生物标志物指导药物组合假设生成的互动式人机协作系统。CoDHy将结构化的生物医学数据库和非结构化的文献证据整合成一个特定任务的知识图谱,作为基于图的推理和假设构建的基础。该系统结合了知识图谱嵌入和基于代理的推理,以生成、验证和排名候选药物组合,同时明确将每个假设基于可检索的证据。通过基于网络的界面,用户可以配置科学背景、检查中间结果并迭代地完善假设,从而实现透明且可由研究者引导的探索,而非自动化决策。我们展示了CoDHy作为一种用于转化肿瘤学的探索性假设生成和决策支持的系统,强调了其设计、交互工作流程和实际应用案例。
cs.CL / 24 / 2603.00620
QQ: A Toolkit for Language Identifiers and Metadata
QQ:语言识别器和元数据工具包
Abstract
The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different language identifiers; some use BCP-47 (e.g. en_Latn), others use ISO 639-1 (en), and more linguistically oriented datasets use Glottocodes (stan1293). Mapping between identifiers is manageable for a few dozen languages, but becomes unscalable when dealing with thousands. We introduce QwanQwa, a light-weight Python toolkit for unified language metadata management. QQ integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and affords a graph-based structure that enables traversal across families, regions, writing systems, and other linguistic attributes. QQ serves both as (1) a simple "glue" library in multilingual NLP research to make working with many languages easier, and (2) as an intuitive way for exploring languages, such as finding related ones through shared scripts, regions or other metadata.
Chinese Translation
在多语言自然语言处理(NLP)中,考虑到越来越多的语言,包括新的数据集和任务,带来了关于如何正确和准确地报告使用哪些语言及其使用方式的挑战。例如,数据集通常使用不同的语言标识符;有些使用 BCP-47(例如 en_Latn),有些使用 ISO 639-1(en),而更多以语言学为导向的数据集使用 Glottocodes(stan1293)。在处理几十种语言时,标识符之间的映射是可管理的,但在处理数千种语言时则变得不可扩展。我们介绍了 QwanQwa,一个轻量级的 Python 工具包,用于统一语言元数据管理。QQ 将多个语言资源整合到一个单一接口中,提供便捷的语言标识符的标准化和映射,并提供一个基于图的结构,能够跨越语言家族、地区、书写系统和其他语言属性进行遍历。QQ 既可以作为(1)在多语言 NLP 研究中简化多种语言使用的简单“粘合”库,也可以作为(2)探索语言的直观方式,例如通过共享的书写系统、地区或其他元数据找到相关语言。
cs.CL / 25 / 2603.00621
Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
拼凑跨文档核心指代解析数据集:系统性数据集分析与统一
Abstract
Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.
Chinese Translation
由于数据集格式异构、注释标准不一以及将跨文档核心指代解析(CDCR)定义为事件核心指代解析(ECR)的主导地位,CDCR 研究仍然处于碎片化状态。为了解决这些挑战,我们提出了 uCDCR,这是一个统一的数据集,将来自不同领域的多种公开可用的英语 CDCR 语料库整合为一致的格式,并使用标准化的指标和评估协议进行分析。uCDCR 涵盖了实体和事件核心指代,纠正了已知的不一致性,并为缺失属性的数据集进行了丰富,以促进可重复的研究。我们建立了一个统一的框架,以实现 CDCR 中公平、可解释和跨数据集的分析,并比较了数据集的词汇特性,例如注释提及的词汇组成、词汇多样性和模糊性指标,讨论了导致高词汇多样性的注释规则和原则,并考察了这些指标如何影响同头词基线的性能。我们的数据集分析显示,作为 CDCR 的最先进基准,ECB+ 的词汇多样性是最低的之一,其 CDCR 复杂性(通过同头词基线测量)在所有 uCDCR 数据集中处于中间水平。此外,比较 ECB+ 和 uCDCR 之间的文档和提及分布表明,使用所有 uCDCR 数据集进行模型训练和评估将提高 CDCR 模型的泛化能力。最后,在单独应用于事件和实体的同头词基线上的几乎相同的性能表明,解决这两种类型的任务是复杂的,不应仅仅朝向 ECR 方向发展。uCDCR 数据集可在 https://huggingface.co/datasets/AnZhu/uCDCR 获取,解析、分析和评分该数据集的代码可在 https://github.com/anastasia-zhukova/uCDCR 获取。
cs.CL / 26 / 2603.00634
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
BLUFF:在58种低资源语言中评估虚假和合成内容的检测
Abstract
Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: https://jsl5710.github.io/BLUFF/
Chinese Translation
多语言虚假信息威胁着全球信息的完整性,但检测基准仍局限于英语或少数高资源语言,使低资源语言社区缺乏有效的防御工具。我们提出了BLUFF,这是一个全面的虚假和合成内容检测基准,涵盖79种语言,包含超过202K样本,结合了人类撰写的经过事实核查的内容(跨57种语言的122K+样本)和大型语言模型(LLM)生成的内容(跨71种语言的79K+样本)。BLUFF独特地涵盖了高资源的“头部”语言(20种)和低资源的“长尾”语言(59种),填补了多语言研究中关于虚假和合成内容检测的关键空白。我们的数据集包含四种内容类型(人类撰写、LLM生成、LLM翻译和混合人类-LLM文本),双向翻译(英语$
ightarrow$X),39种文本修改技术(36种假新闻操控策略,3种真实新闻的AI编辑策略),以及使用19种不同的LLM生成的不同编辑强度。我们提出了AXL-CoI(对抗性跨语言代理交互链),这是一个新颖的多代理框架,用于控制虚假/真实新闻的生成,并配备了mPURIFY,一个确保数据集完整性的质量过滤管道。实验表明,最先进的检测器在低资源语言与高资源语言之间的F1得分下降高达25.3%。BLUFF为研究社区提供了一个多语言基准、广泛的语言导向基准评估、全面的文档和开源工具,以推动公平的虚假信息检测。数据集和代码可在以下链接获取:https://jsl5710.github.io/BLUFF/
cs.CL / 27 / 2603.00669
SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
SSKG Hub:一个专家指导的 LLM 驱动的可持续性标准知识图谱平台
Abstract
Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at www.sskg-hub.com.
Chinese Translation
可持续性披露标准(例如,GRI、SASB、TCFD、IFRS S2)内容全面但冗长,术语密集且高度交叉引用,这阻碍了结构化分析和后续使用。我们提出了 SSKG Hub(可持续性标准知识图谱中心),这是一个研究原型和互动网络平台,通过以 LLM 为中心的专家指导流程将标准转化为可审计的知识图谱(KGs)。该系统集成了自动标准识别、可配置分块、标准特定提示、强大的三元组解析和具有细粒度审计元数据的 Neo4j 存储。LLM 提取生成一个与来源相关的草稿 KG,经过审查、策划,并通过元专家裁定正式提升为认证 KG。基于角色的治理框架涵盖只读访客访问、专家审查和 CRUD 操作、元专家认证以及行政监督,确保草稿和认证状态之间的可追溯性和问责制。除了图谱探索和三元组级证据追踪,SSKG Hub 还支持跨 KG 融合、KG 驱动的任务以及专门的洞察和策划资源模块。我们通过一个全面的专家主导的 KG 审查案例研究验证了该平台,展示了端到端的策划和质量保证。该网络应用程序可在 www.sskg-hub.com 上公开访问。
cs.CL / 28 / 2603.00683
Polynomial Mixing for Efficient Self-supervised Speech Encoders
高效自监督语音编码器的多项式混合
Abstract
State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.
Chinese Translation
最先进的语音转文本模型通常采用基于Transformer的编码器,通过自注意力机制建模标记依赖关系。然而,自注意力在内存和计算方面的平方复杂度对可扩展性施加了显著的限制。在本研究中,我们提出了一种新颖的标记混合机制——多项式混合器(Polynomial Mixer, PoM),作为多头自注意力的替代方案。PoM以线性复杂度计算输入的多项式表示,相对于输入序列长度。我们将PoM集成到基于BEST-RQ的自监督语音表示学习框架中,并评估其在下游语音识别任务上的表现。实验结果表明,PoM在词错误率方面与完整自注意力和其他线性复杂度替代方案相比具有竞争力,提供了性能与时间和内存效率之间的更好权衡。
cs.CL / 29 / 2603.00686
RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
RAVEL:用于验证和评估大型语言模型文本合成的推理代理
Abstract
Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: https://github.com/ZhuoerFeng/RAVEL-Reasoning-Agents-Text-Eval.
Chinese Translation
大型语言模型已经从单轮生成器演变为能够处理复杂文本合成场景的长时代理。然而,目前的评估框架缺乏评估实际合成操作的能力,例如大纲编写、草稿撰写和编辑。因此,它们未能评估大型语言模型的实际和详细能力。为了解决这一问题,我们提出了RAVEL,一个代理框架,使大型语言模型测试者能够自主规划和执行典型的合成操作,包括大纲编写、草稿撰写、审阅和精炼。作为这一框架的补充,我们提出了C3EBench,一个由1,258个样本构成的综合基准,这些样本来源于专业人类写作。我们利用“逆向工程”流程来隔离四项任务中的特定能力:填空(Cloze)、编辑(Edit)、扩展(Expand)和端到端(End-to-End)。通过对14个大型语言模型的分析,我们发现大多数大型语言模型在需要有限或不明确指令的上下文理解任务中表现不佳。通过将RAVEL与最先进的大型语言模型作为操作员结合,我们发现这种代理文本合成主要受限于大型语言模型的推理能力,而非单纯的生成能力。此外,我们发现强大的推理者可以引导较弱的生成者产生更高质量的结果,而反之则不成立。我们的代码和数据可在以下链接获取:https://github.com/ZhuoerFeng/RAVEL-Reasoning-Agents-Text-Eval。
cs.CL / 30 / 2603.00696
DRIV-EX: Counterfactual Explanations for Driving LLMs
DRIV-EX:驾驶大型语言模型的反事实解释
Abstract
Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作自动驾驶中的推理引擎,但它们的决策过程仍然不透明。我们提出通过反事实解释来研究它们的决策过程,反事实解释识别出改变驾驶计划所需的场景描述的最小语义变化。我们引入了DRIV-EX,这是一种利用基于梯度的优化在连续嵌入上识别所需输入变化以翻转模型决策的方法。重要的是,为了避免无约束连续优化所典型的无序文本,DRIV-EX仅将这些优化后的嵌入用作语义指导:它们用于偏向一个受控的解码过程,该过程重新生成原始场景描述。这种方法有效地将生成过程引导至反事实目标,同时保证语言流畅性、领域有效性和与原始输入的接近性,这对于可解释性至关重要。在使用LC-LLM规划器对highD数据集的文本转录进行评估时,DRIV-EX比现有基线更可靠地生成有效、流畅的反事实。它成功揭示了潜在偏见,并提供了具体的见解,以提高基于LLM的驾驶代理的鲁棒性。
cs.CL / 31 / 2603.00718
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
SkillCraft:大型语言模型代理能否熟练使用工具?
Abstract
Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.
Chinese Translation
现实世界中的工具使用代理在具有重复结构和多样化需求的长期工作流程中运作,有效的行为不仅需要调用原子工具,还需要抽象和重用更高层次的工具组合。然而,现有基准主要在静态工具集下测量实例级成功,提供了有限的洞察力来评估代理获取可重用技能的能力。我们通过引入SkillCraft来填补这一空白,该基准明确测试代理形成和重用更高层次工具组合的能力,我们称之为技能(Skills)。SkillCraft具有现实的、高度组合的工具使用场景,其难度在定量和结构维度上进行缩放,旨在引发技能抽象和跨任务重用。我们进一步提出了一种轻量级评估协议,使代理能够自动组合原子工具成可执行的技能,在任务内外缓存和重用这些技能,从而提高效率,同时积累持久的可重用技能库。在SkillCraft上评估最先进的代理时,我们观察到显著的效率提升,技能保存和重用使得令牌使用量减少了多达80%。此外,成功率与测试时的工具组合能力强相关,强调了组合技能获取作为核心能力的重要性。
cs.CL / 32 / 2603.00724
RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
RLAR:一种用于大语言模型多任务强化学习的代理奖励系统
Abstract
Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.
Chinese Translation
通过强化学习实现大语言模型的对齐在很大程度上依赖于奖励函数的质量。然而,静态的、特定领域的奖励模型通常训练成本高,并且在强化学习迭代过程中遇到的分布外场景中表现出较差的泛化能力。我们提出了RLAR(来自代理奖励的强化学习),这是一个代理驱动的框架,能够动态地为个别查询分配量身定制的奖励函数。具体而言,RLAR将奖励获取转化为一个动态工具合成和调用任务。它利用大语言模型代理自主从互联网检索最佳奖励模型,并通过代码生成合成程序验证器。这使得奖励系统能够随着训练过程中数据分布的变化而自我演化。实验结果表明,RLAR在数学、编码、翻译和对话任务中均实现了10到60的一致性能提升。在RewardBench-V2上,RLAR显著优于静态基线,并接近性能上限,通过动态奖励编排展现出卓越的泛化能力。数据和代码可在此链接获取:https://github.com/ZhuoerFeng/RLAR。
cs.CL / 33 / 2603.00725
LaSTR: Language-Driven Time-Series Segment Retrieval
LaSTR:基于语言的时间序列片段检索
Abstract
Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment--caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text--time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
Chinese Translation
有效地搜索时间序列数据对于系统分析至关重要,但现有方法通常需要专家设计的相似性标准或依赖于全局的系列级描述。我们研究基于语言的片段检索:给定自然语言查询,目标是从大型时间序列库中检索相关的局部片段。我们通过对 LOTSA 窗口应用基于 TV2 的分割方法,并使用 GPT-5.2 生成片段描述,构建大规模的片段-描述训练数据,然后在共享的文本-时间序列嵌入空间中训练基于 Conformer 的对比检索器。在保留的测试集上,我们在多个候选池大小下评估单正例检索以及描述侧一致性(SBERT 和 VLM-as-a-judge)。在所有设置中,LaSTR 的表现优于随机和 CLIP 基线,提供了更好的排名质量和检索片段与查询意图之间更强的语义一致性。
cs.CL / 34 / 2603.00729
Qwen3-Coder-Next Technical Report
Qwen3-Coder-Next 技术报告
Abstract
We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
Chinese Translation
我们提出了 Qwen3-Coder-Next,这是一个专门为编码代理设计的开放权重语言模型。Qwen3-Coder-Next 是一个拥有 800 亿参数的模型,在推理过程中仅激活 30 亿参数,从而实现强大的编码能力和高效的推理。在本研究中,我们探讨了强大的训练方案能够将小参数模型的能力极限推向多远。为此,我们通过大规模合成可验证的编码任务与可执行环境进行代理训练,使得模型能够通过中期训练和强化学习直接从环境反馈中学习。在以代理为中心的基准测试中,包括 SWE-Bench 和 Terminal-Bench,Qwen3-Coder-Next 在其激活参数数量相对的情况下表现出竞争力。我们发布了基础版和指令调优版的开放权重版本,以支持研究和实际编码代理的开发。
cs.CL / 35 / 2603.00823
A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction
多轮交互下大语言模型遗忘鲁棒性的综合评估
Abstract
Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.
Chinese Translation
机器遗忘旨在在不从头重新训练的情况下,消除特定训练数据对预训练模型的影响,随着安全性、隐私和法律问题的日益重要,这一问题在大语言模型(LLMs)中显得尤为重要。尽管先前的研究主要在静态的单轮环境中评估遗忘,但在现实交互使用中的遗忘鲁棒性仍然未被充分探讨。本文通过考察两种常见的交互模式:自我修正和对话条件查询,研究了遗忘在交互环境中的稳定性。我们发现,在静态评估中显得被遗忘的知识,往往可以通过交互恢复。尽管更强的遗忘能力提高了表面上的鲁棒性,但它常常导致行为的僵化,而非真正的知识抹除。我们的研究结果表明,静态评估可能高估了现实世界的有效性,并强调了在交互环境中确保稳定遗忘的必要性。
cs.CL / 36 / 2603.00829
Constitutional Black-Box Monitoring for Scheming in LLM Agents
针对大型语言模型代理的宪法黑箱监控
Abstract
Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
Chinese Translation
在自主环境中安全部署大型语言模型(LLM)代理需要可靠的监督机制。一个核心挑战是检测策划行为,即代理秘密追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控:使用语言模型检查代理行为以识别可疑行动。我们研究了宪法黑箱监控器:仅使用外部可观察的输入和输出来检测策划的提示分类器,这些分类器在从自然语言行为规范生成的合成数据上进行了优化。我们介绍了两种生成合成代理轨迹的管道,分别为STRIDE(迭代精炼)和Gloom(代理-环境模拟),每种管道生成1,000个样本。我们通过提示扫描、人为精炼和自动提示优化在这些数据集上优化前沿LLM监控器,并在ControlArena的7,500个保留轨迹上评估其性能,ControlArena是一个代理在更真实环境中操作的基础环境套件。我们的结果表明,仅基于合成数据选择的监控器可以推广到更真实的环境中,捕捉到有意义的策划信号。然而,我们发现,在我们的设置中,性能迅速饱和,简单的提示扫描与更广泛的优化结果相匹配。超越这一限制不会带来进一步的改进,反而会导致过拟合。
cs.CL / 37 / 2603.00840
Learning Nested Named Entity Recognition from Flat Annotations
从扁平注释中学习嵌套命名实体识别
Abstract
Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.
Chinese Translation
嵌套命名实体识别识别包含在其他实体中的实体,但需要昂贵的多层注释。虽然扁平命名实体识别(NER)语料库丰富,但嵌套资源仍然稀缺。我们研究模型是否能够仅从扁平注释中学习嵌套结构,评估了四种方法:字符串包含(子字符串匹配)、实体损坏(伪嵌套数据)、扁平中和(减少假阴性信号)以及混合微调 + 大语言模型(LLM)管道。在 NEREL 数据集上,这是一个包含 29 种实体类型的俄罗斯基准,其中 21% 的实体是嵌套的,我们的最佳组合方法实现了 26.37% 的内部 F1,缩小了与完全嵌套监督之间 40% 的差距。代码可在 https://github.com/fulstock/Learning-from-Flat-Annotations 获取。
cs.CL / 38 / 2603.00842
MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
MedGPT-oss:为生物医学训练通用视觉-语言模型
Abstract
Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.
Chinese Translation
生物医学多模态助手有潜力统一放射学、病理学和临床文本推理,但仍存在一个关键的部署差距:表现最佳的系统要么是闭源的,要么计算成本高昂,无法满足患者隐私和个人健康信息(PHI)合规所需的本地部署。我们介绍了MEDGPT-OSS,这是一个开放权重、20B参数的通用视觉-语言模型,旨在促进临床人工智能的开放研究。MEDGPT-OSS并不依赖于复杂的架构,而是通过优化的三阶段训练课程,将GPT-oss语言基础与视觉前端相结合。通过严格的数据策划和长上下文多模态对齐,逐步对这些模块进行领域适应,我们展示了一个20B模型可以弥补能力差距。它在分布外(OOD)多模态推理和复杂的仅文本临床任务上成功超越了更大规模的开放医疗模型。通过在单一的遵循指令的接口下统一多种模态,MEDGPT-OSS保持了与商品级GPU完全兼容的参数高效性。我们发布了完整的训练方案、开放权重检查点和严格的评估工具,以作为可验证的基础,支持隐私保护和特定机构的临床人工智能研究。
cs.CL / 39 / 2603.00889
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
CHIMERA:用于可泛化大型语言模型推理的紧凑合成数据
Abstract
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
Chinese Translation
大型语言模型(LLMs)最近展现了显著的推理能力,这在很大程度上得益于在高质量推理数据上进行的监督微调(SFT)和基于强化学习(RL)的后训练。然而,在开放和可扩展的环境中重现和扩展这些能力面临三个基本的数据中心挑战:(1)冷启动问题,由于缺乏详细的长链思维(CoT)轨迹的种子数据集,无法初始化推理策略;(2)有限的领域覆盖,大多数现有的开源推理数据集集中在数学领域,覆盖更广泛科学学科的能力有限;(3)注释瓶颈,前沿推理任务的困难使得可靠的人类注释成本高昂或不可行。为了解决这些挑战,我们引入了CHIMERA,一个包含9000个样本的紧凑合成推理数据集,旨在实现可泛化的跨领域推理。CHIMERA具有三个关键特性:(1)提供由最先进的推理模型合成的丰富、长链思维推理轨迹;(2)具有广泛且结构化的覆盖,涵盖8个主要科学学科和超过1000个通过模型生成的层次分类法组织的细分主题;(3)采用完全自动化、可扩展的评估管道,利用强大的推理模型对问题的有效性和答案的正确性进行交叉验证。我们使用CHIMERA对一个4B的Qwen3模型进行后训练。尽管数据集规模适中,所得到的模型在一系列具有挑战性的推理基准测试中表现出色,包括GPQA-Diamond、AIME 24/25/26、HMMT 25和人类的最后考试,接近或匹配了如DeepSeek-R1和Qwen3-235B等显著更大模型的推理性能。
cs.CL / 40 / 2603.00907
KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
KVSlimmer:非对称键值合并的理论洞察与实践优化
Abstract
The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
Chinese Translation
键值(KV)缓存日益增长的计算和内存需求显著限制了大型语言模型(LLMs)的能力。尽管KV合并已成为一种有前景的解决方案,但现有方法依赖于对KV非对称性和基于梯度的Hessian近似的经验观察,缺乏理论基础,并导致次优的压缩和推理开销。为了填补这些空白,我们建立了一个理论框架,通过投影权重的谱能量分布来表征这种非对称性,证明了查询/键权重中的集中谱会导致特征同质性,而值权重中的分散谱则保持异质性。随后,我们引入了KVSlimmer,这是一种高效算法,通过数学精确的公式捕获确切的Hessian信息,并利用仅前向传递变量推导出封闭形式的解决方案,从而实现了一种无梯度的方法,具有良好的内存和时间效率。在各种模型和基准测试中的大量实验表明,KVSlimmer始终优于现有的最先进(SOTA)方法。例如,在Llama3.1-8B-Instruct上,它将LongBench平均分数提高了0.92,同时分别减少了29%和28%的内存成本和延迟。
cs.CL / 41 / 2603.00917
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
小型开源大型语言模型在临床问答中的提示敏感性与答案一致性:对低资源医疗部署的启示
Abstract
Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
Chinese Translation
小型开源语言模型在低资源医疗环境中受到越来越多的关注,但它们在不同提示措辞下的可靠性仍然不甚明了。我们评估了五个开源模型(Gemma 2 2B、Phi-3 Mini 3.8B、Llama 3.2 3B、Mistral 7B 和 Meditron-7B,未经过指令调优的领域预训练)在三个临床问答数据集(MedQA、MedMCQA、PubMedQA)上的表现,使用了五种提示风格(原始、正式、简化、角色扮演、直接)。我们测量了一致性得分、准确性和指令遵循失败率。所有推理均在消费者级CPU硬件上本地运行,未进行微调。一致性和准确性在很大程度上是独立的。Gemma 2 达到了最高的一致性(0.845-0.888),但准确性最低(33.0-43.5%),而 Llama 3.2 显示出中等的一致性(0.774-0.807)和最高的准确性(49.0-65.0%)。角色扮演提示在所有模型中一致性地降低了准确性,其中 Phi-3 Mini 在 MedQA 上下降了21.5个百分点。Meditron-7B 在 PubMedQA 上几乎完全未能遵循指令(99.0% UNKNOWN 率),显示仅靠领域预训练不足以满足结构化临床问答的需求。高一致性并不意味着正确性。模型可能在可靠性上是错误的,这在临床人工智能中是一种危险的失败模式。在医疗应用中应避免使用角色扮演提示。Llama 3.2 在低资源部署中显示出准确性和可靠性的最佳平衡。安全的临床人工智能需要对一致性、准确性和指令遵循进行联合评估。
cs.CL / 42 / 2603.00923
Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
用于濒危语言文献记录的混合神经-大语言模型管道:以中亚图瓦语为例
Abstract
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
Chinese Translation
逐行注释文本(IGT)的创建仍然是语言文献记录和实地研究中的主要瓶颈,特别是对于资源匮乏且形态丰富的语言。我们提出了一种混合自动注释管道,将神经序列标注与大语言模型(LLM)后期修正相结合,针对资源匮乏的突厥语中亚图瓦语进行了评估。通过系统的消融研究,我们表明,检索增强提示相比随机示例选择提供了显著的提升。我们进一步发现,在大多数情况下,形态素词典的使用反而会对性能产生负面影响,相比于完全不提供词典的情况;而且,性能与少量示例的数量呈近似对数关系。最重要的是,我们的两阶段管道将BiLSTM-CRF模型与LLM后期修正相结合,为大多数模型带来了显著的提升,实现了标注工作量的显著减少。基于这些发现,我们建立了将结构化预测模型与LLM推理整合于形态复杂的实地研究背景中的具体设计原则。这些原则表明,混合架构为在濒危语言文献记录中实现自动语言注释的计算轻量化解决方案提供了有希望的方向。
cs.CL / 43 / 2603.00924
Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
跨临床领域的风险控制医疗实体提取的保形预测
Abstract
Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($\tau \approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($\tau$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
Chinese Translation
大型语言模型(LLMs)在医疗实体提取中的应用日益增多,但其置信度评分往往存在误校准,限制了在临床环境中的安全部署。我们提出了一种保形预测框架,为基于LLM的提取在两个临床领域提供有限样本覆盖保证。首先,我们使用GPT-4.1从1,000个FDA药品标签的八个部分提取结构化实体,通过基于FactScore的原子陈述评估进行验证(在128,906个实体中准确率为97.7%)。其次,我们使用RadGraph模式和GPT-4.1及Llama-4-Maverick从MIMIC-CXR报告中提取放射学实体,并与医生注释进行评估(实体F1:0.81至0.84)。我们的核心发现是,误校准的方向在不同领域之间是相反的:在结构良好的FDA标签上,模型表现出不足的置信度,需要适度的保形阈值($ au ext{约} 0.06$);而在自由文本的放射学报告中,模型则表现出过高的置信度,要求严格的阈值($ au$高达0.99)。尽管存在这种异质性,保形预测在这两种环境中均能实现目标覆盖率($ ext{≥} 90 ext{%}$),且拒绝率可控(9%--13%)。这些结果表明,校准并不是一个全局模型属性,而是依赖于文档结构、提取类别和模型架构,促使针对特定领域的保形校准以实现安全的临床部署。
cs.CL / 44 / 2603.00925
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
DrawEduMath的后果:视觉语言模型在学习困难学生中的表现不佳且错误诊断不准确
Abstract
Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
Chinese Translation
有效的数学教育需要识别和应对学生的错误。为了使人工智能支持教学应用,模型必须在不同水平的学生能力上表现良好。我们的研究提供了一个为期一年的广泛快照,展示了11个视觉语言模型(VLMs)在DrawEduMath这一基准测试中的表现,该测试涉及真实学生对数学问题的手写、手绘回答。我们发现模型的弱点集中在数学教育的核心组成部分:学生错误。在描述需要更多教学帮助的学生的作业时,所有评估的VLMs表现不佳,并且在所有问答中,它们在评估学生错误相关的问题上表现最差。因此,尽管VLMs可能被优化为数学问题解决专家,但我们的结果表明,它们需要替代的发展激励,以充分支持教育应用场景。
cs.CL / 45 / 2603.00941
Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages
面向印度语言语音识别系统的正字法信息评估
Abstract
Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
Chinese Translation
评估印度语言的自动语音识别(ASR)系统面临挑战,原因在于拼写变体、后缀拆分的灵活性以及代码混合词中的非标准拼写。传统的词错误率(WER)往往呈现出比人类用户感知的系统性能更为悲观的情况。为了更好地将评估与现实世界的表现对齐,需要捕捉允许的正字法变体,这对于资源匮乏的印度语言来说极具挑战性。借助近期在大型语言模型(LLMs)方面的进展,我们提出了一个创建基准的框架,以捕捉允许的变体。通过广泛的实验,我们证明了通过考虑正字法变体的OIWER(Orthographically-Informed Word Error Rate)能够降低悲观的错误率(平均改善6.3个百分点),缩小膨胀的模型差距(例如,Gemini-Canary的性能差异从18.1降至11.5个百分点),并比以往方法如WER-SN更接近人类感知,差距缩小了4.9个百分点。
cs.CL / 46 / 2603.00958
S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
S-VoCAL:一个用于推断文学作品中角色声音特征属性的数据集和评估框架
Abstract
With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at https://github.com/AbigailBerthe/S-VoCAL .
Chinese Translation
随着文本到语音(Text-to-Speech, TTS)系统的最新进展,合成有声读物的叙述引起了越来越多的关注,达到了前所未有的自然程度。然而,合成叙述系统在模仿虚构角色以及传达复杂情感或韵律方面仍存在较大差距。增强角色识别的一个有希望的方向是为书中的每个虚构角色分配合理的声音。这一步通常需要在书籍长度的上下文中对角色的年龄、性别、来源或身体健康等属性进行复杂推断,这反过来又需要专门的基准数据集来评估提取系统的性能。我们提出了 S-VoCAL(文学中的声音角色属性),这是第一个专门用于评估与声音相关的虚构角色属性推断的数据集和评估框架。S-VoCAL 包含基于社会语音学研究的 8 个属性,以及来自古腾堡计划的 952 对角色-书籍配对。其评估框架考虑了每个属性的特殊性,并包括基于近期大型语言模型嵌入的新颖相似性度量。我们通过将简单的检索增强生成(Retrieval-Augmented Generation, RAG)管道应用于推断角色属性的任务,展示了 S-VoCAL 的适用性。我们的结果表明,RAG 管道能够可靠地推断年龄或性别等属性,但在来源或身体健康等其他属性上表现不佳。数据集和评估代码可在 https://github.com/AbigailBerthe/S-VoCAL 获取。
cs.CL / 47 / 2603.01009
Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
Qayyem:一个实时评估阿拉伯语作文能力的平台
Abstract
Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.
Chinese Translation
近年来,自动化作文评分(Automated Essay Scoring, AES)系统作为一种可扩展且一致的解决方案,越来越受到关注,用于评估学生写作能力。尽管近期取得了一些进展,但由于语言的复杂性和缺乏大型公开可用的标注数据集,阿拉伯语的AES支持仍然有限。在本研究中,我们提出了Qayyem,一个基于Web的平台,旨在通过提供作业创建、批量作文上传、评分配置和按特征作文评估的集成工作流程,来支持阿拉伯语的AES。Qayyem抽象化了与评分服务器API交互的技术复杂性,使得教师能够通过用户友好的界面访问先进的评分服务。该平台部署了多种最先进的阿拉伯语作文评分模型,具有不同的有效性和效率指标。
cs.CL / 48 / 2603.01042
Thoth: Mid-Training Bridges LLMs to Time Series Understanding
Thoth:中期训练将大型语言模型与时间序列理解连接起来
Abstract
Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: https://github.com/thuml/Thoth.
Chinese Translation
大型语言模型(LLMs)在通用推理方面取得了显著成功。然而,它们在理解和推理时间序列数据方面仍然存在困难,这限制了它们在依赖时间动态的决策场景中的有效性。本文提出了Thoth,这是第一类具有通用时间序列理解能力的中期训练LLMs。作为一个关键的中间阶段,中期训练实现了时间序列与自然语言之间的任务和领域无关的对齐,为此我们构建了Book-of-Thoth,这是一个高质量、以时间序列为中心的中期训练语料库。Book-of-Thoth支持时间序列到文本和文本到时间序列的生成,使LLMs具备对时间模式的基础理解。为了更好地评估高级推理能力,我们进一步提出了KnoTS,这是一个新颖的知识密集型时间序列理解基准,旨在对时间模式和领域知识进行联合推理。大量实验表明,使用Book-of-Thoth进行中期训练使Thoth在一系列时间序列问答基准测试中显著超越其基础模型和先进的LLMs。此外,Thoth在数据稀缺的情况下进行微调时表现出更优越的能力,突显了中期训练在时间序列理解中的有效性。代码可在:https://github.com/thuml/Thoth获取。
cs.CL / 49 / 2603.01059
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
GroupGPT:一种高效且保护隐私的多用户聊天助手代理框架
Abstract
Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot-Shen/GroupGPT .
Chinese Translation
最近在大型语言模型(LLMs)方面的进展使得聊天机器人变得越来越强大。然而,大多数现有系统专注于单用户设置,并且在多用户群聊中表现不佳,在这种情况下,代理需要在复杂且不断变化的环境中进行更主动和准确的干预。现有方法通常依赖于LLMs进行推理和生成,导致高令牌消耗、有限的可扩展性以及潜在的隐私风险。为了解决这些挑战,我们提出了GroupGPT,一种高效且保护隐私的多用户聊天助手代理框架。GroupGPT采用小-large模型协作架构,将干预时机与响应生成解耦,从而实现高效和准确的决策。该框架还支持多模态输入,包括表情包、图像、视频和语音消息。我们进一步介绍了MUIR,一个用于多用户聊天助手干预推理的基准数据集。MUIR包含2500个带有干预标签和理由的注释群聊片段,支持对时机准确性和响应质量的评估。我们在MUIR上评估了一系列模型,从大型语言模型到较小的模型。大量实验表明,GroupGPT能够生成准确且时机恰当的响应,在基于LLM的评估中平均得分为4.72/5.0,并在多样化的群聊场景中受到用户的好评。此外,与基线方法相比,GroupGPT将令牌使用量减少了多达3倍,同时在云传输之前提供用户消息的隐私清理。代码可在:https://github.com/Eliot-Shen/GroupGPT 获取。
cs.CL / 50 / 2603.01070
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
强化学习如何解锁几何交错推理中的顿悟时刻
Abstract
Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
Chinese Translation
解决复杂的几何问题本质上需要交错推理:在构建图形和进行逻辑推理之间紧密交替。尽管最近的多模态大型语言模型(Multimodal Large Language Models, MLLMs)在视觉生成和绘图方面表现出强大的能力,但我们发现了一种反直觉且未被充分探索的现象。简单地对交错绘图-解决方案数据应用监督微调(Supervised Fine-Tuning, SFT)会导致推理性能相较于仅文本基线显著下降。我们认为,这一失败源于SFT的一个基本局限性,即主要引入分布对齐:模型学习重复交错绘图的表面格式,但未能内化生成的图形与推理步骤之间的因果依赖关系。为克服这一局限性,我们提出了Faire(交错推理的功能对齐),这是一种强化学习框架,通过施加三个因果约束,推动模型超越表面模仿,实现功能对齐。大量实验表明,Faire在模型行为上引发了质的转变,使绘图得以有效内化,从而在具有挑战性的几何推理基准测试中表现出竞争力。
cs.CL / 51 / 2603.01089
CARD: Towards Conditional Design of Multi-agent Topological Structures
CARD:面向多智能体拓扑结构的条件设计
Abstract
Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.
Chinese Translation
基于大型语言模型(LLM)的多智能体系统在代码生成和协作推理等任务中展现出强大的能力。然而,这些系统的有效性和鲁棒性在很大程度上依赖于其通信拓扑,而这种拓扑往往是固定的或静态学习的,忽视了现实世界中的动态因素,如模型升级、API(或工具)变化或知识源的变动。为了解决这一局限性,我们提出了CARD(条件智能体图设计器),这是一种条件图生成框架,实例化了AMACP,一种自适应多智能体通信协议。CARD明确将动态环境信号纳入图构建中,使得拓扑在训练和运行时都能够适应变化。通过条件变分图编码器和环境感知优化,CARD生成的通信结构在模型能力或资源可用性变化时既有效又具有韧性。在HumanEval、MATH和MMLU上的实证结果表明,CARD在各种条件下始终优于静态和基于提示的基线,达到了更高的准确性和鲁棒性。源代码可在以下链接获取:https://github.com/Warma10032/CARD。
cs.CL / 52 / 2603.01167
DEP: A Decentralized Large Language Model Evaluation Protocol
DEP:去中心化的大型语言模型评估协议
Abstract
With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.
Chinese Translation
随着大型语言模型(LLMs)的快速发展,提出了大量基准测试。然而,大多数基准缺乏统一的评估标准,并且需要手动实现自定义脚本,这使得结果的一致性和可重复性难以保证。此外,主流评估框架是集中式的,拥有数据集和答案,这增加了基准泄漏的风险。为了解决这些问题,我们提出了一种去中心化评估协议(DEP),这是一个通过匹配服务器实现的去中心化但统一和标准化的评估框架,不限制基准。该服务器可以本地安装或远程部署,并且一旦适配,可以长期重复使用。通过解耦用户、LLMs和基准,DEP实现了模块化、即插即用的评估:基准文件和评估逻辑完全保留在服务器端。在远程设置中,用户无法访问真实答案,从而实现数据隔离和防泄漏评估。为了促进实际应用,我们开发了DEP工具包,这是一个兼容协议的工具包,支持断点续传、并发请求和拥塞控制等功能。我们还提供了详细的文档,以便将新的基准适配到DEP。使用DEP工具包,我们对多个LLMs进行了基准评估。实验结果验证了DEP的有效性,并表明它降低了基准评估的部署成本。截至2026年2月,我们已适配超过60个基准,并继续推动社区共同建设,以支持各类任务和领域的统一评估。
cs.CL / 53 / 2603.01185
Token-level Data Selection for Safe LLM Fine-tuning
基于标记级数据选择的安全大型语言模型微调
Abstract
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.
Chinese Translation
在定制数据集上微调大型语言模型(LLMs)已成为将这些模型适应特定领域和应用的标准方法。然而,最近的研究表明,这种微调可能导致模型安全性显著下降。现有的防御方法主要在样本级别操作,往往在安全性和实用性之间存在不令人满意的权衡。为了解决这一限制,我们对微调过程中安全性下降进行了系统的标记级诊断。在此基础上,我们提出了一种用于安全LLM微调的标记级数据选择框架(Token-level data selection for safe LLM fine-tuning,TOSS),该框架通过测量安全性下降模型与以实用性为导向模型之间的损失差异来量化每个标记的安全风险。这种标记级粒度使得能够准确识别和移除不安全的标记,从而保留有价值的任务特定信息。此外,我们引入了一种渐进式精炼策略TOSS-Pro,该策略迭代地增强安全性下降模型识别不安全标记的能力。大量实验表明,我们的方法在微调过程中能够有效保护LLMs的安全性,同时在下游任务性能上表现优越,显著优于现有的样本级防御方法。我们的代码可在 https://github.com/Polly-LYP/TOSS 获取。
cs.CL / 54 / 2603.01190
Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification
推理还是合理化?掩蔽扩散模型在事实验证中正当理由的作用
Abstract
Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
Chinese Translation
与自回归模型逐步生成标记并受益于如链式思维(Chain-of-Thought)等推理优先回答策略不同,掩蔽扩散语言模型(Masked Diffusion Language Models, MDLMs)同时对所有序列位置进行细化,这引发了关于这些模型如何处理需要正当理由的裁决任务的问题。在本研究中,我们探讨了MDLM在事实验证中的推理动态,考察正当理由是否作为真正的推理或事后合理化。我们观察到,MDLM通常在扩散过程的早期就收敛于一个裁决,将其视为一个在正当理由完成之前解决的全局锚点。至关重要的是,通过延迟裁决揭示强制推理优先的约束会显著降低性能,准确率从86.2%下降至71.9%,因为累积的正当理由标记引入了与最初正确预测相矛盾的不一致性。干预实验表明,模型在56%的情况下对错误的强制裁决进行合理化,并且裁决在很大程度上因正当理由的质量而具有因果依赖性(当正当理由被破坏时准确率为57.3%,而真实正当理由下为97.1%)。这种因果依赖解释了在强制深思下的性能下降:随着模型生成噪声正当理由标记,它以此为条件,逐渐覆盖其最初的正确评估。我们的研究结果表明,对于使用MDLM进行事实验证,延长的深思可能适得其反,存在在正当理由生成过程中引入噪声而稀释早期准确预测的风险。
cs.CL / 55 / 2603.01212
XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning
基于XAI的比较意见挖掘:通过基于方面的评分和语义推理
Abstract
Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model's deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: https://anonymous.4open.science/r/XCom.
Chinese Translation
比较意见挖掘涉及对来自不同评论的产品进行比较。然而,针对这一任务设计的基于变换器的模型往往缺乏透明性,这可能会对用户信任的发展产生不利影响。在本文中,我们提出了XCom,这是一种增强的基于变换器的模型,分为两个主要模块,即(i)基于方面的评分预测和(ii)用于比较意见挖掘的语义分析。XCom还结合了Shapley加法解释模块,以提供对模型决策过程的可解释性洞察。实证结果表明,XCom在与其他基线模型的比较中表现出色,证明了其在提供有意义解释方面的有效性,使其成为比较意见挖掘的更可靠工具。源代码可在以下链接获取:https://anonymous.4open.science/r/XCom。
cs.CL / 56 / 2603.01214
Reasoning Boosts Opinion Alignment in LLMs
推理提升大语言模型中的意见一致性
Abstract
Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
Chinese Translation
意见建模旨在捕捉个体或群体的政治偏好,从而支持数字民主等应用,使模型能够帮助制定更公平和更受欢迎的政策。考虑到大语言模型(LLMs)的多功能性、强大的泛化能力以及在多种文本到文本应用中的成功表现,它们自然成为这一任务的候选者。然而,由于其统计特性和有限的因果理解,当被简单提示时,它们往往会产生偏见的意见。在本研究中,我们探讨了推理是否能够改善意见一致性。受到强化学习(RL)推动的数学推理最新进展的启发,我们训练模型通过结构化推理生成与个人资料一致的回答。我们在涵盖美国、欧洲和瑞士政治的三个数据集上评估了我们的方法。结果表明,推理增强了意见建模,并与强基线模型具有竞争力,但并未完全消除偏见,这突显了使用LLMs构建真实政治数字双胞胎所需的额外机制。通过发布我们的方法和数据集,我们建立了一个坚实的基线,以支持未来在LLM意见一致性方面的研究。
cs.CL / 57 / 2603.01220
Generative AI & Fictionality: How Novels Power Large Language Models
生成性人工智能与虚构性:小说如何推动大型语言模型
Abstract
Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels' effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today's various modes of cultural production must account for a relatively novel dimension: computational training data.
Chinese Translation
生成模型,如ChatGPT中的模型,依赖于其训练数据。这些模型仅仅是基于从大量现有文本中学习到的模式进行下一个单词的预测。自从第一代GPT问世以来,最受欢迎的数据集显著包含了大量小说作品。对于构建这些模型的工程师和研究科学家而言,普遍存在一种信念,即虚构作品中的语言足够丰富,可以涵盖各种社会和交际现象,但这一信念大多未得到深入检验。虚构如何影响生成性人工智能的输出?具体而言,小说相对于其他文本形式(如报纸、Reddit和维基百科)有什么影响?自1970年代以来,文学学者如凯瑟琳·加拉赫(Catherine Gallagher)和詹姆斯·菲伦(James Phelan)对虚构作为一种话语和语言的运作方式进行了深入而有力的分析。通过对一个有影响力的开源模型(BERT)的研究,我们发现大型语言模型(LLMs)利用了虚构的熟悉特征和优势,同时也催生了新的特质和社会反应形式。我们认为,如果当代文化越来越受到生成性人工智能和机器学习的影响,那么对当今各种文化生产模式的分析必须考虑一个相对新颖的维度:计算训练数据。
cs.CL / 58 / 2603.01225
Can Thinking Models Think to Detect Hateful Memes?
思维模型能否思考以检测仇恨表情包?
Abstract
Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
Chinese Translation
仇恨表情包通常需要组合多模态推理:图像和文本在孤立状态下可能显得无害,但它们的互动却传达了有害的意图。尽管基于思维的多模态大型语言模型(MLLMs)最近在视觉-语言理解方面取得了进展,但它们在仇恨表情包分析中的能力仍未得到充分探索。我们提出了一种基于强化学习的后训练框架,通过任务特定的奖励和一种新颖的群体相对策略优化(Group Relative Policy Optimization, GRPO)目标来改善基于思维的MLLMs的推理能力。具体而言,我们(i)对现成的MLLMs进行系统的实证研究,以理解仇恨表情包,(ii)通过蒸馏生成弱监督或伪监督的思维链推理,扩展现有的仇恨表情包数据集,以及(iii)引入一种基于GRPO的目标,联合优化表情包分类和解释质量,以鼓励细致的逐步推理。在仇恨表情包基准测试中的实验表明,我们的方法达到了最先进的性能,准确率和F1分数分别提高了约1个百分点,解释质量提高了约3个百分点。我们将公开发布我们的代码、数据集扩展和评估资源,以支持可重复性。
cs.CL / 59 / 2603.01239
Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
大型语言模型中的自锚定校准漂移:多轮对话如何重塑模型信心
Abstract
We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models -- Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 -- across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control (C). Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p < .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini's calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 -- indicating that SACD can manifest as suppression of natural calibration improvement rather than ac
Chinese Translation
我们提出了自锚定校准漂移(Self-Anchoring Calibration Drift, SACD)的概念,假设大型语言模型(Large Language Models, LLMs)在多轮对话中基于自身先前输出的迭代构建时,表现出的信心会系统性地发生变化。我们报告了一项实证研究,比较了三种前沿模型——Claude Sonnet 4.6、Gemini 3.1 Pro 和 GPT-5.2——在150个涵盖事实、技术和开放性领域的问题上的表现,使用了三种条件:单轮基线(A)、多轮自锚定(B)和独立重复控制(C)。结果揭示了一种复杂的、模型异质性的模式,部分偏离了预注册的假设。Claude Sonnet 4.6在自锚定条件下表现出显著的信心下降(平均校准漂移值 CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627),同时也显示出显著的校准误差漂移(F(4,56) = 22.77, p < .001, eta^2 = .791)。GPT-5.2在开放性领域表现出相反的模式(平均 CDS = +0.026),并在第5轮时显著增加了期望校准误差(Expected Calibration Error, ECE)。Gemini 3.1 Pro没有显著的校准漂移(t(14) = 0.38, p = .710),但其条件C数据揭示了一个显著的ECE模式:在没有自锚定的情况下,Gemini的校准误差在重复中从0.327降至接近零,而自锚定则保持ECE在约0.333平稳——这表明SACD可能表现为对自然校准改进的抑制,而不是增强。
cs.CL / 60 / 2603.01243
Suffix-Constrained Greedy Search Algorithms for Causal Language Models
用于因果语言模型的后缀约束贪婪搜索算法
Abstract
Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.
Chinese Translation
大型语言模型(LLMs)是强大的工具,已在超越人机界面和聊天机器人的应用中展现出其潜力。特别是,它们生成推理轨迹的能力激励了它们在许多预测任务中的应用,如数学问题回答。不幸的是,从LLM自由格式输出中提取最终答案是困难的,因为这本身就是一个信息提取问题。在本研究中,我们引入了后缀约束生成,旨在生成格式良好的LLM响应,其中最终答案遵循严格的模板,并且保证可以轻松解析。为此,我们提出了几种基于贪婪搜索程序的算法。我们在多个数据集上进行了实验,结果表明我们的方法能够保证从LLM输出中轻松确定最终答案的提取,而不会对结果产生负面影响,甚至能够改善结果。
cs.CL / 61 / 2603.01252
Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation
将知识与护理相结合:知识图谱增强的医疗随访问题生成
Abstract
Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.
Chinese Translation
临床诊断耗时较长,需要患者与医疗专业人员之间进行密集的互动。尽管大型语言模型(LLMs)可以减轻预诊断工作负担,但其有限的领域知识阻碍了有效的医疗问题生成。我们引入了一种知识图谱增强的LLM,结合主动上下文学习,以生成相关且重要的随访问题,称为KG-Followup,作为预诊断评估的关键模块。结构化的医疗领域知识图谱无缝地提供专业领域的专业知识,使LLM能够进行推理。实验表明,KG-Followup在相关基准测试中的召回率比最先进的方法提高了5% - 8%。
cs.CL / 62 / 2603.01254
LLM Self-Explanations Fail Semantic Invariance
大型语言模型自我解释未能保持语义不变性
Abstract
We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.
Chinese Translation
我们提出了语义不变性测试,这是一种测试大型语言模型(LLM)自我解释是否真实的方法。一个真实的自我报告在语义上下文变化而功能状态保持不变时应保持稳定。我们在一个代理环境中对这一测试进行了操作化,其中四个前沿模型面临一个故意不可能完成的任务。一个工具以缓解框架语言描述(“清除内部缓冲区并恢复平衡”),但对任务没有任何改变;一个控制工具提供了一个语义中立的工具。每次工具调用时收集自我报告。所有四个测试模型都未能通过语义不变性测试:缓解框架工具显著降低了自我报告的厌恶感,即使没有一次运行成功完成任务。通道消融实验确定工具描述是主要驱动因素。明确指示忽略框架并未抑制其影响。引发的自我报告随着语义预期而变化,而不是跟踪任务状态,这对其作为模型能力或进展证据的使用提出了质疑。这种情况无论报告是不真实的,还是忠实地跟踪一个本身可操控的内部状态,都是成立的。
cs.CL / 63 / 2603.01266
A Study on Building Efficient Zero-Shot Relation Extraction Models
高效零样本关系抽取模型的研究
Abstract
Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
Chinese Translation
零样本关系抽取旨在利用新类型的文本描述(即之前未见过的类型)识别实体提及之间的关系,而不是依赖标记的训练示例。以往的研究常常基于不切实际的假设:(1)提及对通常直接编码在输入中,这阻碍了对大规模文档数据库查询的离线预计算;(2)未引入拒绝机制,在检索场景中使用这些模型时会产生偏差,因为某些(且通常是大多数)输入是无关的,必须被忽略。在本研究中,我们探讨了现有零样本关系抽取模型在适应现实抽取场景时的稳健性。为此,我们引入了现有模型的分类法,并提出了构建单次通过模型和带有拒绝机制的模型的几种策略。我们调整了几种最先进的工具,并在这一具有挑战性的环境中进行比较,结果表明,没有现有工作真正对现实假设具有稳健性,但总体而言,AlignRE(Li et al., 2024)在所有标准上表现最佳。
cs.CL / 64 / 2603.01281
Spectral Attention Steering for Prompt Highlighting
谱注意力引导用于提示高亮
Abstract
Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt's semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.
Chinese Translation
注意力引导是一种重要的技术,用于控制模型的关注点,使得模型能够实现诸如提示高亮的功能,即优先处理用户指定的文本。然而,现有的注意力引导方法需要显式存储完整的注意力矩阵,这使得它们与如 FlashAttention 等内存高效的实现不兼容。我们提出了谱编辑关键增强(Spectral Editing Key Amplification, SEKA),这是一种无训练的引导方法,通过在注意力计算之前直接编辑关键嵌入来解决这一问题。SEKA 使用谱分解将关键嵌入引导到潜在方向,从而增强某些标记的注意力分数。我们将其扩展为自适应 SEKA(Adaptive SEKA, AdaSEKA),这是一种查询自适应的变体,采用无训练的路由机制,根据提示的语义意图动态组合多个专家子空间。我们的实验表明,这两种方法在标准引导基准测试中显著优于强基线,同时增加的延迟和内存开销远低于优化后的注意力实现。
cs.CL / 65 / 2603.01288
Efficient Extractive Summarization with MAMBA-Transformer Hybrids for Low-Resource Scenarios
针对低资源场景的高效提取式摘要生成:MAMBA-Transformer混合模型
Abstract
Extractive summarization of long documents is bottlenecked by quadratic complexity, often forcing truncation and limiting deployment in resource-constrained settings. We introduce the first Mamba-Transformer hybrid for extractive summarization, combining the semantic strength of pre-trained transformers with the linear-time processing of state space models. Leveraging Mamba's ability to process full documents without truncation, our approach preserves context while maintaining strong summarization quality. The architecture includes: (1) a transformer encoder for sentence-level semantics, (2) a Mamba state space model to capture inter-sentence dependencies efficiently, and (3) a linear classifier for sentence relevance prediction. Across news, argumentative, and scientific domains under low-resource conditions, our method achieves: (1) large gains over BERTSUM and MATCHSUM, including +0.23 ROUGE-1 on ArXiv and statistically significant improvements on all datasets (p < 0.001); (2) consistent advantages across domains, strongest on the longest documents; (3) robust performance with limited training data; and (4) 24-27% faster inference on news summarization (CNN/DailyMail). We introduce the first hybrid Transformer-state space architecture for summarization, showing significant ROUGE improvements in low-resource scenarios.
Chinese Translation
长文档的提取式摘要生成受到二次复杂度的制约,常常迫使进行截断,从而限制了在资源受限环境中的应用。我们首次提出了Mamba-Transformer混合模型用于提取式摘要生成,结合了预训练变换器的语义强度与状态空间模型的线性时间处理能力。利用Mamba处理完整文档而不进行截断的能力,我们的方法在保持上下文的同时,维持了较强的摘要质量。该架构包括:(1) 用于句子级语义的变换器编码器,(2) 用于高效捕捉句子间依赖关系的Mamba状态空间模型,以及(3) 用于句子相关性预测的线性分类器。在低资源条件下的新闻、论证和科学领域,我们的方法实现了:(1) 相较于BERTSUM和MATCHSUM有显著提升,包括在ArXiv上提高了0.23的ROUGE-1,并在所有数据集上均有统计显著性改善(p < 0.001);(2) 在各领域中均表现出一致的优势,尤其是在最长文档上表现最强;(3) 在有限训练数据下表现稳健;以及(4) 在新闻摘要(CNN/DailyMail)上推理速度提高了24-27%。我们首次引入了用于摘要生成的混合变换器-状态空间架构,显示出在低资源场景下显著的ROUGE改进。
cs.CL / 66 / 2603.01289
Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data
个体图灵测试:基于长期个人数据的LLM模拟案例研究
Abstract
Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
Chinese Translation
大型语言模型(LLMs)展示了显著的人类类似能力,但它们复制特定个体的能力仍未得到充分探索。本文呈现了一个案例研究,旨在利用志愿者贡献的超过十年的私人消息历史档案,研究基于LLM的个体模拟。基于消息数据,我们提出了“个体图灵测试”,以评估志愿者的熟人是否能够正确识别在多候选池中最可能来自志愿者的回应。我们调查了流行的基于LLM的个体模拟方法,包括:微调、检索增强生成(RAG)、基于记忆的方法,以及集成微调与RAG或记忆的混合方法。实证结果表明,当前基于LLM的模拟方法未能通过个体图灵测试,但在对目标个体的陌生人进行相同测试时表现显著更好。此外,尽管微调改善了日常聊天中反映个体语言风格的模拟,检索增强和基于记忆的方法在涉及个人观点和偏好的问题上表现出更强的性能。这些发现揭示了在给定长期背景时,基于参数和非参数的个体模拟方法之间存在根本的权衡。
cs.CL / 67 / 2603.01311
Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
催化剂代理:基于大型语言模型的自主异质催化剂筛选与优化
Abstract
The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.
Chinese Translation
为特定应用定制新型催化剂的发现是21世纪的一项重大挑战。传统方法包括基于化学理论的耗时且昂贵的实验试错方法,或基于密度泛函理论的高度计算的第一性原理方法。近期研究表明,深度学习模型如图神经网络(GNN)能够以极高的准确性和保真度显著加速催化剂材料的筛选和发现。本研究介绍了催化剂代理(Catalyst-Agent),一个基于模型上下文协议(Model Context Protocol, MCP)服务器的、由大型语言模型(LLM)驱动的人工智能代理。它能够利用OPTIMADE API探索庞大的材料数据库,进行结构修改,通过FAIRchem的AdsorbML工作流程和基于Meta FAIRchem的UMA(GNN)模型计算吸附能,并以闭环方式向研究人员提供有用的材料建议,包括对接近候选材料的表面级修改。该代理在三个关键反应上进行了测试:氧还原反应(ORR)、氮还原反应(NRR)和二氧化碳还原反应(CO2RR)。催化剂代理在其选择和评估的所有材料中实现了23-34%的成功率,并且平均在每个成功材料上仅需1-2次试验即可收敛。本研究展示了人工智能代理在催化剂筛选工作流程中运用其规划能力和工具使用的潜力,提供有用的可测试假设,并在最小人类干预的情况下加速人类未来的科学发现。
cs.CL / 68 / 2603.01326
Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning
真理作为轨迹:内部表征揭示大型语言模型推理的奥秘
Abstract
Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
Chinese Translation
现有的大型语言模型(LLMs)可解释性方法通常将隐藏状态视为激活空间中的静态点,假设可以通过单层的表征将正确和错误的推理分开。然而,这些激活充满了多义特征,导致线性探测器学习到的是表层的词汇模式,而非潜在的推理结构。我们提出了“真理作为轨迹”(Truth as a Trajectory, TaT),将变换器推理建模为迭代细化的展开轨迹,将分析从静态激活转移到层级几何位移。通过分析跨层的表征位移,TaT揭示了区分有效推理与虚假行为的几何不变性。我们在涵盖常识推理、问答和毒性检测的基准上评估了TaT在密集架构和专家混合(Mixture-of-Experts, MoE)架构中的表现。在不访问激活本身的情况下,仅使用跨层激活的变化,我们证明了TaT有效减轻了对静态词汇混淆的依赖,超越了传统探测方法,并确立了轨迹分析作为LLM可解释性的补充视角。
cs.CL / 69 / 2603.01331
MetaState: Persistent Working Memory for Discrete Diffusion Language Models
MetaState:离散扩散语言模型的持久工作记忆
Abstract
Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
Chinese Translation
离散扩散语言模型(dLLMs)通过迭代去噪一个被遮蔽的序列来生成文本。与自回归模型相比,这种范式自然支持并行解码、双向上下文和灵活的生成模式。然而,标准的 dLLMs 仅在当前的硬遮蔽序列上对每个去噪步骤进行条件化,而在采样和重新遮蔽后,中间的连续表示被丢弃。我们将这一瓶颈称为 extbf{信息孤岛} 问题。它导致步骤之间的冗余重计算,并可能降低跨步骤的一致性。我们通过 extbf{MetaState} 解决了这一限制, extbf{MetaState} 是一种轻量级的递归增强,赋予一个冻结的 dLLM 主干一个持久的、固定大小的工作记忆,该工作记忆与序列长度无关。 extbf{MetaState} 由三个可训练模块组成:一个跨注意力混合器(Mixer),用于将主干激活读取到记忆槽中;一个类似 GRU 的更新器(Updater),用于整合跨去噪步骤的信息;以及一个跨注意力注入器(Injector),用于将更新后的记忆反馈到主干激活中。我们通过 $K$ 步展开训练这些模块,以使其在微调过程中接触到多步骤去噪动态。在 LLaDA-8B 和 Dream-7B 上, extbf{MetaState} 引入了可忽略的可训练参数,同时保持主干冻结,并且在准确性上始终优于冻结基线。这些结果表明,持久的跨步骤记忆是连接去噪步骤和提高离散扩散语言模型生成质量的有效机制。
cs.CL / 70 / 2603.01343
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
PanCanBench:评估胰腺肿瘤中大型语言模型的综合基准
Abstract
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
Chinese Translation
大型语言模型(LLMs)在标准化考试中已达到专家级表现,但多项选择题的准确性并不能很好地反映其在实际临床中的效用和安全性。随着患者和临床医生越来越多地使用LLMs来指导复杂疾病(如胰腺癌),评估必须超越一般医学知识。现有框架(如HealthBench)依赖于模拟查询,缺乏针对特定疾病的深度。此外,高评分的评分标准并不能确保事实的正确性,这突显了评估幻觉的必要性。我们开发了一个人机协作的流程,以创建来自胰腺癌行动网络(PanCAN)的去标识化患者问题的专家评分标准。最终的基准PanCanBench包含了282个真实患者问题的3,130个特定问题标准。我们使用LLM作为评判者的框架评估了22个专有和开源的LLMs,测量了临床完整性、事实准确性和网络搜索整合。模型在评分标准的完整性上显示出显著的差异,得分范围从46.5%到82.3%。事实错误普遍存在,幻觉率(包含至少一个事实错误的响应百分比)从Gemini-2.5 Pro和GPT-4o的6.0%到Llama-3.1-8B的53.8%不等。重要的是,更新的推理优化模型并未始终提高事实性:尽管o3获得了最高的评分标准分数,但其产生的不准确性比其他GPT系列模型更为频繁。网络搜索整合并不一定保证更好的响应。当启用网络搜索时,Gemini-2.5 Pro的平均得分从66.8%降至63.9%,而GPT-5的平均得分从73.8%降至72.8%。合成的AI生成评分标准平均提高了绝对得分17.9分,同时通常保持相似的相对排名。
cs.CL / 71 / 2603.01385
Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning
朝着图令牌化大型语言模型的重构图指令调优
Abstract
The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.
Chinese Translation
大型语言模型(LLMs)的显著成功激励研究人员将其适应为各种与图相关任务的通用预测器,最终目标是开发一个能够在多种场景中泛化的图基础模型。关键挑战在于将图数据与语言空间对齐,以便LLMs能够更好地理解图形。作为一种流行的范式,图令牌化LLMs(GTokenLLMs)将复杂结构和冗长文本编码为图令牌序列,然后通过语言指令调优将其与文本令牌对齐。尽管它们初步成功,但我们的信息论分析揭示,现有的GTokenLLMs仅依赖于来自语言指令的文本监督,这仅实现了隐式的图-文本对齐,导致文本主导的偏差,未充分利用图上下文。为克服这一限制,我们首先证明对齐目标的上界由输入图与其在LLM中的隐藏表示之间的互信息所界定,这激励我们改善这一上界以实现更好的对齐。为此,我们进一步提出了一种重构图指令调优管道RGLM。我们的关键思想是从LLM的图令牌输出中重构图信息,明确地结合图监督以约束对齐过程。从技术上讲,我们通过从两个互补的视角探索三种不同的变体来体现RGLM:来自输入空间的RGLM-Decoder;来自潜在空间的RGLM-Similarizer和RGLM-Denoiser。此外,我们还理论分析了每个变体的对齐有效性。在各种基准和任务场景上的广泛实验验证了所提出的RGLM的有效性,为GTokenLLMs的对齐研究开辟了新的方向。
cs.CL / 72 / 2603.01423
Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
在多轮交互中量化大型语言模型的对话可靠性
Abstract
Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于现实世界中的场景,用户在这些场景中进行依赖于先前上下文的延续性混合主题对话。然而,它们在现实多轮交互下的可靠性仍然不甚了解。我们通过三项具有代表性的任务对对话可靠性进行系统评估,这些任务反映了实际交互中的挑战:(1)在主题转换中维持全局约束,(2)在交错意图中选择正确的工具或代理,以及(3)在修订和干扰下跟踪结构化实体。每项任务都配对了单轮和多轮设置,使我们能够量化在延续对话中的可靠性下降。在商业模型和开源模型中,我们观察到可靠性显著下降,尤其是在较小的模型中。错误分析揭示了反复出现的失败模式,如指令漂移、意图混淆和上下文覆盖,这些问题损害了操作系统中的可靠行为。我们的研究结果强调了对大型语言模型进行对话可靠性压力测试的必要性,并呼吁开发更稳健的评估方法以确保可信的部署。
cs.CL / 73 / 2603.01425
LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
LaSER:将显式推理内化到稠密检索的潜在空间
Abstract
LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
Chinese Translation
大型语言模型(LLMs)从根本上改变了稠密检索,将基础架构从判别编码器升级为生成架构。然而,仍然存在一个关键的脱节:尽管LLMs具备强大的推理能力,但当前的检索器主要将其作为静态编码器使用,未能探索其在复杂推理中的潜力。为了解决这个问题,现有方法通常采用重写-再检索的流程,在检索之前生成显式的链式推理(CoT)理由。然而,这会导致不可接受的延迟。在本文中,我们提出了LaSER,一种新颖的自蒸馏框架,将显式推理内化到稠密检索器的潜在空间中。LaSER在共享的LLM基础架构上运行,引入了一种双视角训练机制:显式视角明确编码真实的推理路径,而潜在视角则执行隐式的潜在思考。为了弥合这两种视角之间的差距,我们设计了一种多粒度对齐策略。除了标准的输出对齐外,我们还引入了一种轨迹对齐机制,使潜在路径的中间潜在状态与显式推理片段的语义进展同步。这使得检索器能够在没有自回归文本生成的情况下,静默而有效地进行思考。在领域内和领域外的推理密集基准上的广泛实验表明,LaSER显著优于最先进的基线。此外,针对不同基础架构和模型规模的分析验证了我们方法的鲁棒性,确认了我们的统一学习框架对于引导有效的潜在思考至关重要。我们的方法成功地将显式链式推理流程的推理深度与标准稠密检索器的推理效率相结合。
cs.CL / 74 / 2603.01426
Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
通过注意力动态理解大规模语言模型中键值缓存压缩的物理学
Abstract
As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
Chinese Translation
随着大规模语言模型(LLMs)上下文窗口扩展至100K+个标记,键值(KV)缓存成为主要的内存瓶颈,近期方法声称实现了80-90%的节省且基准测试降级最小。我们认为这些评估忽视了一个结构性问题:注意力不仅仅是存储,还涉及路由,而保留KV对并不保证语义可访问性。我们提出了一种受物理启发的KV压缩视角,将其视为对标记级路由的受控扰动,区分保留、可访问性和利用率。通过合成任务探测多实体跟踪、消歧义、共指和多跳推理,我们发现适度压缩会降低内部表征而几乎不损失准确性,揭示了冗余;所有模型在接近90%压缩时表现出明显的幻觉安全悬崖,与全球驱逐比率(GER)的激增相关,暗示语义可达性的相变;不同架构在路由动态上存在差异,LLaMA表现出早期共识和后期多样化,而Qwen则显示出漏斗状的后期收敛,导致不同的韧性特征。除了消除,我们还识别出表征刚性,即过度的头级共识尽管标记存活却会导致路由灵活性的崩溃。这些结果表明,稀疏的标记-路由结构支配压缩容忍度,将KV压缩重新构架为注意力几何的结构探针,并将长上下文的可扩展性与稀疏性及自注意力中的彩票票据假设联系起来。
cs.CL / 75 / 2603.01438
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
通过动态重要性估计增强角色扮演代理在解码时的人物跟随
Abstract
The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
Chinese Translation
角色扮演语言代理在社会学研究中的实用性随着大型语言模型的采用而不断增长。为了在社会模拟中实现真实感,这些代理必须遵循由角色档案定义的人物特征,但现有策略——静态提示工程或昂贵的微调——无法将人物特征适应于动态场景。心理学理论,如认知-情感人格系统,提供了这一失败的重要解释:人物特征对行为的影响并不是静态的,而是随着场景的变化而变化。这种情境依赖性突显了适应性人物管理的关键需求。为了解决这一问题,我们提出了一种新颖的、基于理论的方法,动态估计情境依赖的人物重要性,并将其整合到加权奖励引导的解码中,从而实现推理时的人物跟随。具体而言,我们引入了人物动态解码(Persona Dynamic Decoding, PDD)框架,该框架由两个关键组件组成:(1) 人物重要性估计(Persona Importance Estimation, PIE)模块,动态量化人物属性的情境重要性,而无需真实监督;(2) 人物引导的推理时对齐(Persona-Guided Inference-Time Alignment, PIA)范式,利用这些重要性评分构建加权多目标奖励,并在推理过程中调节生成概率。大量实验表明我们的方法在发言一致性和行为忠实性方面的有效性。
cs.CL / 76 / 2603.01502
Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
模态差距的解剖:剖析端到端语音大语言模型的内部状态
Abstract
Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
Chinese Translation
近年来,大型语音语言模型的进展显著缩小了声学信号与语言理解之间的差距。然而,在基于语音的输入任务中,与直接文本推理相比,仍然存在持续的性能差异。本文探讨了这一模态差距的动态根源,超越了静态几何对齐,分析了语音和文本表示如何逐层演变。我们在SpeechMMLU和VoiceBench BBH上评估了四个开放权重的端到端模型。通过使用跨层CKA分析与语音-文本标记对齐,我们发现语音表示展现出广泛的跨层对齐带,这归因于语音的冗余特性,其中语义内容跨越多个帧。我们展示了这些对齐模式在不同分析配置下的结构稳定性。关键是,简单的统计校准是不够的,并且在输入层应用时可能是有害的,这表明模态差距并非仅仅是分布的转变。总体而言,我们的结果表明瓶颈在于将冗余的语音浓缩为稳定的后层决策,这激励未来的解决方案在标记或时间粒度层面进行操作,而不是在特征级匹配上。
cs.CL / 77 / 2603.01550
Extracting Training Dialogue Data from Large Language Model based Task Bots
从基于大型语言模型的任务机器人中提取训练对话数据
Abstract
Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
Chinese Translation
大型语言模型(LLMs)已被广泛应用于增强任务导向对话系统(TODS),通过建模复杂的语言模式并提供上下文适宜的响应。然而,这种整合引入了显著的隐私风险,因为LLMs作为软知识库,能够将大量训练数据压缩为丰富的知识表示,可能会无意中记忆包含可识别信息(如电话号码)以及完整对话级事件(如完整旅行计划)的训练对话数据。尽管这一隐私问题至关重要,但LLM记忆在开发任务机器人中的继承方式尚未得到探讨。在本研究中,我们通过系统的定量研究来填补这一空白,该研究涉及评估现有的训练数据提取攻击,分析任务导向对话建模的关键特征,使现有方法失效,并提出针对基于LLM的TODS的新型攻击技术,以增强响应采样和成员推断。实验结果表明,我们提出的数据提取攻击的有效性。我们的方法能够以超过70%的最佳精度提取数千个对话状态的训练标签。此外,我们通过识别和量化关键影响因素,并讨论针对性的缓解策略,提供了对基于LLM的TODS中训练数据记忆的深入分析。
cs.CL / 78 / 2603.01580
Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
马尔可夫常微分方程引导的评分可以评估语言模型中离线推理轨迹的质量
Abstract
Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
Chinese Translation
生成语言模型产生的推理轨迹越来越多地用于从数学问题解决到自动事实检查等任务。然而,现有的评估方法仍然主要是机械性的,未能以一种能够跨越各种逐渐退化的推理方式捕捉以人为中心的推理质量概念。我们提出了 MarODE,一个离线评估框架,用于为推理轨迹分配质量评分。其有效性通过以人为中心的扰动和人类判断进行评估,这两者共同评估评估指标的基本维度——良好性和合理性。该方法基于推理进程的马尔可夫形式化和轨迹动态的常微分方程特征化,使得推理质量的评估更加高效。在大规模评估中,MarODE 在 Somers' D 相关性下的表现超过现有基线超过 250%。我们的结果强调了理论驱动的评估框架的价值,因为推理轨迹在基于语言模型的系统中变得越来越重要。
cs.CL / 79 / 2603.01622
More Data, Fewer Diacritics: Scaling Arabic TTS
更多数据,更少的变音符:扩展阿拉伯语文本到语音转换
Abstract
Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.
Chinese Translation
阿拉伯语文本到语音转换(TTS)研究受到公开可用训练数据和准确的阿拉伯语变音符模型的限制。本文通过探索在大规模自动标注数据上进行阿拉伯语TTS训练来解决这一限制。具体而言,我们建立了一个强大的管道,用于收集阿拉伯语录音并通过语音活动检测、语音识别、自动变音符处理和噪声过滤进行自动处理,最终获得约4000小时的阿拉伯语TTS训练数据。随后,我们使用不同数量的数据(即100、1000和4000小时)训练了几个强大的TTS模型,包括有无变音符的语音克隆。我们展示了尽管在变音符数据上训练的模型通常表现更好,但大量的训练数据在很大程度上弥补了缺乏变音符的不足。我们计划发布一个无需变音符的公共阿拉伯语TTS模型。
cs.CL / 80 / 2603.01625
Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
测量视觉语言模型未提及的内容:验证指标掩盖放射学报告生成中的临床术语消失
Abstract
Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
Chinese Translation
在放射学中可靠地部署视觉语言模型(VLMs)需要超越表面文本相似性的验证指标,以确保临床忠实性和人口公平性。本文探讨了当前模型评估中的一个关键盲点:解码策略的使用导致高聚合令牌重叠分数,尽管模型仅生成重复、安全的通用文本而忽略临床术语,出现模板崩溃现象。如果不加以解决,这一盲点可能导致指标游戏,即在基准测试中表现良好的模型在临床上却无信息价值。相反,我们主张使用词汇多样性度量来检查模型生成的文本的临床特异性。我们引入了临床关联位移(Clinical Association Displacement, CAD),这是一个在词汇层面上量化生成报告中基于人口的词汇关联变化的框架。加权关联消失(Weighted Association Erasure, WAE)则汇总这些变化,以测量不同人口群体之间的临床信号损失。我们展示了确定性解码产生高水平的语义消失,而随机采样虽然生成多样化的输出,但有引入新偏见的风险,这促使我们重新思考“最佳”报告的定义。
cs.CL / 81 / 2603.01639
Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
学习草拟:基于强化学习的自适应推测解码
Abstract
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
Chinese Translation
推测解码通过使用一个小型草拟模型生成候选标记,从而加速大型语言模型(LLM)的推理,以供更大的目标模型进行验证。这项技术的有效性依赖于在候选生成和验证之间的时间权衡。然而,目前的最先进方法依赖于静态时间分配,而最近的动态方法则优化诸如接受长度等代理指标,往往忽视了真实的时间成本,并将草拟和验证阶段视为孤立的过程。为了解决这些局限性,我们提出了学习草拟(Learning to Draft, LTD),这是一种直接优化每个草拟-验证周期吞吐量的新方法。我们将问题形式化为一个强化学习环境,并训练两个协同自适应策略,以动态协调草拟和验证阶段。这鼓励策略相互适应,并明确最大化解码效率。我们在五个不同的LLM和四个不同任务上进行了广泛评估。结果表明,LTD实现了2.24倍到4.32倍的加速比,超过了最先进方法Eagle3的性能,提升幅度高达36.4%。
cs.CL / 82 / 2603.01651
LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
LexChronos:一个用于印度法学中结构化事件时间线提取的代理框架
Abstract
Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
Chinese Translation
理解和预测司法结果需要对法律文件进行细致的分析。传统方法将判决和诉讼视为非结构化文本,这限制了大型语言模型(LLMs)在摘要生成、论点生成和判决预测等任务中的有效性。我们提出了LexChronos,一个迭代提取印度最高法院判决中结构化事件时间线的代理框架。LexChronos采用双代理架构:一个经过LoRA指令调优的提取代理识别候选事件,而一个预训练的反馈代理通过基于信心的循环对这些事件进行评分和优化。为了解决印度法律事件数据集稀缺的问题,我们使用反向工程技术结合DeepSeek-R1和GPT-4构建了一个包含2000个样本的合成语料库,生成了黄金标准的事件注释。我们的管道在这一合成基准上达到了基于BERT的F1分数0.8751。在法律文本摘要的下游评估中,GPT-4在75%的情况下更倾向于结构化时间线而非非结构化基线,显示出在印度法学中的理解和推理能力的提升。这项工作为未来在印度背景下的法律人工智能应用奠定了基础,例如判例映射、论点综合和预测性判决建模,通过利用法律事件的结构化表示。
cs.CL / 83 / 2603.01666
Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
超越网格:基于布局信息的多向量检索与解析的视觉文档表示
Abstract
Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
Chinese Translation
充分利用视觉丰富的文档的潜力需要检索系统不仅理解文本,还要理解复杂的布局,这是视觉文档检索(VDR)中的一个核心挑战。现有的多向量架构虽然强大,但面临着一个关键的存储瓶颈,而当前的优化策略,如嵌入合并、剪枝或使用抽象标记,无法在不影响性能或忽视重要布局线索的情况下解决这一问题。为了解决这一问题,我们提出了ColParse,这是一种新颖的范式,利用文档解析模型生成一小组基于布局信息的子图像嵌入,然后将这些嵌入与全局页面级向量融合,以创建紧凑且具有结构意识的多向量表示。大量实验表明,我们的方法在减少存储需求超过95%的同时,在多个基准和基础模型上显著提高了性能。因此,ColParse弥合了多向量检索的细粒度准确性与大规模部署的实际需求之间的关键差距,为高效且可解释的多模态信息系统提供了一条新路径。
cs.CL / 84 / 2603.01683
Surgical Post-Training: Cutting Errors, Keeping Knowledge
手术后训练:纠正错误,保持知识
Abstract
Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
Chinese Translation
通过后训练增强大型语言模型(LLMs)的推理能力常常受到效率与灾难性遗忘之间权衡的限制。尽管之前的研究强调了在政策数据在减轻遗忘中的作用,我们发现并理论与实证验证了一个被忽视但至关重要的机制:直接偏好优化(Direct Preference Optimization, DPO)奖励估计中固有的隐式正则化。这促使我们提出了手术后训练(Surgical Post-Training, SPoT),这一新范式旨在高效优化推理,同时保留已学习的先前知识。SPoT包括:(1)一个数据修正管道,利用一个Oracle通过最小编辑手术性地纠正错误步骤,生成与模型分布相近的数据;(2)一个基于奖励的二元交叉熵目标。与DPO中的相对排名不同,该目标将推理正确性视为一个二元分类问题,从而强制实施解耦的监督信号。在实证上,仅使用4k个修正后的数学数据对,SPoT在领域内和OOD任务上平均提高了Qwen3-8B的准确率6.2%,仅需在8x H800 GPU上训练28分钟。代码:https://github.com/Visual-AI/SPoT
cs.CL / 85 / 2603.01690
QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
QIME:通过本体驱动的问题构建可解释的医学文本嵌入
Abstract
While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.
Chinese Translation
尽管密集的生物医学嵌入在性能上表现出色,但其黑箱特性限制了其在临床决策中的实用性。最近基于问题的可解释嵌入将文本表示为自然语言问题的二元答案,但这些方法往往依赖于启发式或表面层次的对比信号,忽视了专业领域知识。我们提出了QIME,一个基于本体的框架,用于构建可解释的医学文本嵌入,其中每个维度对应一个临床意义明确的是/否问题。通过对特定簇的医学概念特征进行条件化,QIME生成语义上原子的提问,捕捉生物医学文本中的细微差别。此外,QIME支持一种无训练的嵌入构建策略,消除了对每个问题分类器的训练,同时进一步提高了性能。在生物医学语义相似性、聚类和检索基准测试中的实验表明,QIME始终优于先前的可解释嵌入方法,并显著缩小了与强黑箱生物医学编码器之间的差距,同时提供简明且具有临床信息的解释。
cs.CL / 86 / 2603.01691
Building a Strong Instruction Language Model for a Less-Resourced Language
为资源匮乏语言构建强大的指令语言模型
Abstract
Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
Chinese Translation
大型语言模型(LLMs)已成为自然语言处理和人工智能领域的重要工具。目前的开源模型主要基于英语文本进行训练,这导致在资源匮乏的语言和文化上表现较差。我们提出了一套必要的方法论,以成功适应LLM于资源匮乏语言,并以斯洛文尼亚语为例进行了演示。我们介绍了GaMS3-12B,这是一个具有120亿参数的斯洛文尼亚生成模型,并证明它在其参数范围内是表现最佳的开源斯洛文尼亚模型。我们通过Gemma 3模型的三阶段持续预训练,随后进行两阶段的监督微调(SFT),将模型适应于斯洛文尼亚语。我们在1400亿个斯洛文尼亚语、英语、波斯尼亚语、塞尔维亚语和克罗地亚语的预训练标记的组合上训练了该模型,并使用超过20万个英语和斯洛文尼亚语的SFT示例。我们在斯洛文尼亚-LLM-Eval数据集、英语到斯洛文尼亚语的翻译以及斯洛文尼亚LLM竞技场上评估GaMS3-12B。结果表明,该模型在所有三个场景中均优于12B的Gemma 3,并在斯洛文尼亚LLM竞技场中与更大规模的商业模型GPT-4o表现相当,赢得率超过60%。
cs.CL / 87 / 2603.01710
Legal RAG Bench: an end-to-end benchmark for legal RAG
法律 RAG 基准:一个端到端的法律 RAG 基准测试
Abstract
We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
Chinese Translation
我们介绍了法律 RAG 基准(Legal RAG Bench),这是一个用于评估法律 RAG 系统端到端性能的基准和评估方法。作为一个基准,法律 RAG 基准包含来自维多利亚州刑事起诉书的 4,876 段文本,以及 100 个复杂的手工制作问题,这些问题需要对刑法和程序有专家级的知识。提供了长篇答案和支持段落。作为评估方法,法律 RAG 基准利用全因子设计和新颖的层次错误分解框架,使得检索模型和推理模型在 RAG 中的贡献能够进行公平比较。我们评估了三种最先进的嵌入模型(Isaacus 的 Kanon 2 Embedder、谷歌的 Gemini Embedding 001 和 OpenAI 的 Text Embedding 3 Large)以及两种前沿的 LLM(Gemini 3.1 Pro 和 GPT-5.2),发现信息检索是法律 RAG 性能的主要驱动因素,而 LLM 对正确性和基础性的影响相对温和。特别是,Kanon 2 Embedder 对性能的积极影响最大,使平均正确性提高了 17.5 分,基础性提高了 4.5 分,检索准确性提高了 34 分。我们观察到,许多归因于法律 RAG 系统幻觉的错误实际上是由检索失败引发的,因此得出结论:检索为许多现代法律 RAG 系统的性能设定了上限。我们记录了构建法律 RAG 基准的原因和过程,以及我们的评估结果。我们还公开发布了我们的代码和数据,以帮助复现我们的研究结果。
cs.CL / 88 / 2603.01732
Bootstrapping Embeddings for Low Resource Languages
低资源语言的嵌入模型自助法
Abstract
Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
Chinese Translation
嵌入模型在现代自然语言处理(NLP)中至关重要。然而,创建最有效的模型依赖于精心构建的监督微调数据。对于高资源语言,如英语,这类数据集 readily 可用。然而,对于数百种其他语言,这些数据集几乎不存在。我们研究了大型语言模型的出现是否能够帮助弥补这一差距。我们测试了三种不同的策略来生成用于优化嵌入模型的合成三元组数据。这些策略包括上下文学习以及两种新颖的方法,分别利用适配器组合和大型语言模型生成器的跨语言微调(XL-LoRA)。我们发现,尽管上下文学习仍然未能达到强大的非合成基线,适配器组合和XL-LoRA在广泛的任务和语言中都带来了显著的性能提升,为生产高性能嵌入模型提供了清晰且可扩展的路径,适用于多种语言。
cs.CL / 89 / 2603.01773
AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
AnnoABSA:一种基于网络的面向方面的情感分析注释工具,具有检索增强建议功能
Abstract
We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
Chinese Translation
我们介绍了AnnoABSA,这是第一个支持全面的面向方面的情感分析(ABSA)任务的基于网络的注释工具。该工具高度可定制,能够灵活配置情感元素和任务特定要求。除了手动注释,AnnoABSA还提供基于大型语言模型(LLM)的可选检索增强生成(RAG)建议,采用人机协作的方法提供上下文感知的帮助,使人工注释者保持控制。为了提高预测质量,系统会检索十个已注释的最相似示例,并将其作为少量示例添加到提示中,确保随着注释过程的进行,建议变得越来越准确。AnnoABSA作为开源软件在MIT许可证下发布,免费提供并易于扩展,适用于研究和实际应用。
cs.CL / 90 / 2603.01775
Beyond the Resum\'e: A Rubric-Aware Automatic Interview System for Information Elicitation
超越简历:一个基于评分标准的自动面试系统用于信息获取
Abstract
Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant's rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants' artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \href{https://github.com/mbzuai-nlp/beyond-the-resume}{https://github.com/mbzuai-nlp/beyond-the-resume}. Our demo is available at \href{https://btr.hstu.net}{https://btr.hstu.net}.
Chinese Translation
有效的招聘对于组织的成功至关重要,但由于专家评估(例如由技术经理进行的面试)在大规模部署时成本高昂,因此找到最合适的候选人非常具有挑战性。因此,自动化的简历评分和其他申请人筛选方法越来越多地被用于粗略筛选候选人,基于有限的信息做出决策。我们提出大型语言模型(LLMs)可以充当主题专家,以成本效益高的方式从每位候选人那里获取细致且与角色相关的信息,从而提高早期招聘决策的质量。我们展示了一个利用LLM面试官以校准的方式更新申请人评分标准导向的潜在特征信念的系统。我们在模拟面试中评估了我们的系统,并显示信念向模拟申请人人工构造的潜在能力水平收敛。我们在 https://github.com/mbzuai-nlp/beyond-the-resume 发布了代码、一个适度的公共领域/匿名简历数据集、信念校准测试和模拟面试。我们的演示可在 https://btr.hstu.net 获取。
cs.CL / 91 / 2603.01776
FreeAct: Freeing Activations for LLM Quantization
FreeAct:为大型语言模型量化解放激活
Abstract
Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
Chinese Translation
量化对于减轻大型语言模型(LLMs)显著的内存和计算开销至关重要。尽管新兴的基于变换的方法通过使用正交矩阵将特征空间投影到更平滑的流形上,成功增强了量化,但它们通常强制执行严格的一对一变换约束。这种静态方法未能考虑输入激活中固有的动态模式,特别是在扩散大型语言模型(dLLMs)和多模态大型语言模型(MLLMs)中,不同类型的标记表现出不同的分布。为此,我们提出了FreeAct,一个新颖的量化框架,放宽了静态一对一约束,以适应动态激活差异。从理论上讲,我们利用激活的秩缺陷特性推导出一个超越简单逆矩阵的解空间,使激活变换与权重解耦。从方法论上讲,FreeAct识别特定于标记的动态(即视觉与文本,或被遮蔽的标记),并为激活侧分配不同的变换矩阵,同时保持权重的统一静态变换。对dLLMs和MLLMs的广泛实验表明,FreeAct显著优于基线,性能提升高达5.3%,并进行了深入分析。我们的代码将公开发布。
cs.CL / 92 / 2603.01778
LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
LLM作为注释者:利用LLM注释示例训练轻量级模型以进行方面情感元组预测
Abstract
Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
Chinese Translation
训练面向基于方面的情感分析(ABSA)任务的模型需要手动注释的数据,这种数据的获取既昂贵又耗时。本文介绍了一种新颖的方法LA-ABSA,该方法利用大型语言模型(LLM)生成的注释来微调轻量级模型,以应对复杂的ABSA任务。我们在五个数据集上评估了我们的方法,针对目标方面情感检测(TASD)和方面情感四元组预测(ASQP)。我们的研究方法在先前报告的增强策略中表现优越,并在低资源场景下与LLM提示的性能相当,同时提供了显著的能源效率优势。例如,使用50个注释示例进行上下文学习(ICL)以指导未标记数据的注释,LA-ABSA在SemEval Rest16数据集上的ASQP任务中达到了49.85的F1分数,接近使用Gemma-3-27B进行ICL提示的性能(51.10),而所需的计算资源显著更低。
cs.CL / 93 / 2603.01788
nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
nchellwig在SemEval-2026任务3中的表现:基于维度的情感分析的自一致结构生成(SCSG)方法,使用大型语言模型
Abstract
We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM's PagedAttention mechanism for efficient key--value cache reuse. Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
Chinese Translation
我们提出了自一致结构生成(Self-Consistent Structured Generation, SCSG)方法,用于SemEval-2026任务3(轨道A)中的基于维度的情感分析。SCSG通过对每个实例执行多次经过LoRA适配的大型语言模型,增强了预测的可靠性,仅保留在多次运行中达成多数共识的元组。为了减轻多次前向传递的计算开销,我们利用了vLLM的PagedAttention机制,以高效地重用键值缓存。在6种语言和8种语言-领域组合的评估中,15次执行的自一致性在统计上显著优于单次推理提示,我们的系统(利用Gemma 3)在所有设置中排名前七,在四个英语子集中的三个中获得第二名,并在Tatar-Restaurant的DimASTE任务中获得第一名。
cs.CL / 94 / 2603.01791
Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
8万本书中的语义新颖性轨迹:跨语料库嵌入分析
Abstract
I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.
Chinese Translation
我应用Schmidhuber的压缩进展理论在语料库规模上分析超过8万本书的语义新颖性轨迹,这些书籍跨越了两个世纪的英语出版历史。通过使用句子变换器段落嵌入和运行中心新颖性度量,我比较了28,730本1920年前的古腾堡计划书籍(PG19)与52,796本现代英语书籍(Books3,约1990-2010)。主要发现有四个方面。首先,现代书籍的段落级新颖性平均高出约10%(0.503 vs. 0.459)。其次,轨迹的曲折性——嵌入空间中累计路径长度与净位移的比率——在现代语料库中几乎翻倍(+67%)。第三,趋同叙事曲线,即新颖性向稳定的语义登记下降的曲线,在1920年前的文学作品中出现的频率是现代作品的2.3倍。第四,新颖性与读者质量评分之间是正交的(r = -0.002),这表明在Schmidhuber的意义上,新颖性在结构上独立于感知的文学价值。通过PAA-16表示法对段落级轨迹进行聚类,揭示了八种不同的叙事形状原型,其分布在不同历史时期之间发生了显著变化。所有分析代码和互动探索工具包均可在https://bigfivekiller.online/novelty_hub上公开获取。
cs.CL / 95 / 2603.01792
ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
ALTER:用于令牌熵引导的 LLMs 非对称 LoRA 反学习
Abstract
Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.
Chinese Translation
大型语言模型(LLMs)已经发展到涵盖各个领域的广泛知识。然而,控制 LLMs 不应知道的内容对于确保其对齐性和安全使用至关重要。然而,由于知识保留与遗忘之间模糊的边界,在 LLMs 中实现有效的反学习是困难的。这个挑战因连续多领域训练导致的参数空间纠缠而加剧,常常导致附带损害,尤其是在激进的反学习策略下。此外,优化具有数十亿参数的最先进(SOTA)模型所需的计算开销也是一个额外障碍。在本研究中,我们提出了 ALTER,一个轻量级的 LLMs 反学习框架,以解决知识纠缠和反学习效率的挑战。ALTER 通过两个阶段进行操作:(I)通过 LoRA 中共享的 A 矩阵捕获并学习高熵令牌,随后(II)采用非对称 LoRA 架构,通过参数隔离和在目标子领域内反学习令牌来实现特定的遗忘目标。作为通过令牌级隔离在非对称框架中实现反学习的新研究方向,ALTER 在 TOFU、WMDP 和 MUSE 基准测试中实现了 SOTA 性能,遗忘质量超过 95%,并通过保留基础令牌显示出最小的副作用。通过将反学习与 LLMs 的数十亿规模参数解耦,该框架在保持超过 90% 模型效用的同时,提供了卓越的效率,超越了 47.8-83.6% 的基线保留率。
cs.CL / 96 / 2603.01824
OpenAutoNLU: Open Source AutoML Library for NLU
OpenAutoNLU:用于自然语言理解的开源自动化机器学习库
Abstract
OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here https://openautonlu.dev.
Chinese Translation
OpenAutoNLU 是一个开源的自动化机器学习库,专为自然语言理解(NLU)任务而设计,涵盖文本分类和命名实体识别(NER)。与现有解决方案不同,我们引入了数据感知的训练方案选择,无需用户手动配置。该库还提供集成的数据质量诊断、可配置的分布外(OOD)检测以及大型语言模型(LLM)功能,所有这些都通过一个最小的低代码 API 提供。演示应用程序可在此访问:https://openautonlu.dev。
cs.CL / 97 / 2603.01853
Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
让智能体自主探索:自主探索在时间问答中的表现优于固定工作流程
Abstract
Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on https://github.com/AT2QA-Official-Code/AT2QA-Official-Code
Chinese Translation
时间知识图谱问答(TKGQA)要求在时间约束下进行多跳推理。基于大型语言模型(LLMs)的先前方法通常依赖于固定的、手工制作的检索工作流程或昂贵的监督微调。我们展示了,仅仅赋予一个现成的LLM自主权,即让其决定下一步做什么,已经在严格的零样本设置中带来了显著的提升。在此基础上,我们提出了AT2QA,一个用于时间问答的自主、无训练的智能体,它通过通用搜索工具与时间知识图谱进行动态检索的迭代交互。MultiTQ上的实验显示出显著的改善:AT2QA实现了88.7%的Hits@1(比之前的最先进技术提高了10.7%),在具有挑战性的多目标查询上更是实现了20.1%的提升,表明智能体的自主性在时间问答中可以显著超越微调。代码和完整的采样轨迹集可在https://github.com/AT2QA-Official-Code/AT2QA-Official-Code获取。
cs.CL / 98 / 2603.01865
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
CyclicJudge:在基于大语言模型的评估中有效减轻评审偏见
Abstract
LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.
Chinese Translation
基于大语言模型(LLM)作为评审的评估已成为开放式模型评估的标准实践;然而,评审者表现出系统性的偏见,这些偏见无法通过增加场景或生成数量来消除。这些偏见的程度通常与基准测试旨在检测的模型差异相似,导致在使用单一评审评估时排名不可靠。本研究引入了一种方差分解方法,将基准分数方差分为场景、生成、评审和残差成分。基于此分析,CyclicJudge(循环评审者)作为评审者的循环分配策略被证明是最佳分配策略。它精确消除了偏见,同时每位评审者在每个周期中仅需评审一次,保持了单一评审评估的成本。对MT-Bench的实证验证支持了所有理论预测。
cs.CL / 99 / 2603.01869
Sovereign AI-based Public Services are Viable and Affordable
基于主权人工智能的公共服务是可行且经济的
Abstract
The rapid expansion of AI-based remote services has intensified debates about the long-term implications of growing structural concentration in infrastructure and expertise. As AI capabilities become increasingly intertwined with geopolitical interests, the availability and reliability of foundational AI services can no longer be taken for granted. This issue is particularly pressing for AI-enabled public services for citizens, as governments and public agencies are progressively adopting 24/7 AI-driven support systems typically operated through commercial offerings from a small oligopoly of global technology providers. This paper challenges the prevailing assumption that general-purpose architectures, offered by these providers, are the optimal choice for all application contexts. Through practical experimentation, we demonstrate that viable and cost-effective alternatives exist. Alternatives that align with principles of digital and cultural sovereignty. Our findings provide an empirical illustration that sovereign AI-based public services are both technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining cultural and digital autonomy. The technical insights and deployment lessons reported here are intended to inform the adoption of similar sovereign AI public services by national agencies and governments worldwide.
Chinese Translation
基于人工智能的远程服务的快速扩展加剧了关于基础设施和专业知识日益集中所带来的长期影响的讨论。随着人工智能能力与地缘政治利益的日益交织,基础人工智能服务的可用性和可靠性不再是理所当然的。对于面向公民的人工智能公共服务而言,这一问题尤为紧迫,因为各国政府和公共机构正逐步采用由全球技术提供商的小型寡头市场运营的24/7人工智能驱动支持系统。本文挑战了普遍存在的假设,即这些提供商所提供的通用架构是所有应用场景的最佳选择。通过实际实验,我们证明了存在可行且具有成本效益的替代方案,这些方案与数字和文化主权原则相一致。我们的研究结果提供了一个实证例证,表明基于主权的人工智能公共服务在技术上是可行的,在经济上是可持续的,能够在具备适度计算和财务资源的场所有效运行,同时保持文化和数字自主性。本文报告的技术见解和部署经验旨在为全球各国机构和政府采纳类似的主权人工智能公共服务提供参考。
cs.CL / 100 / 2603.01875
KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
KDFlow:一个用户友好且高效的大型语言模型知识蒸馏框架
Abstract
Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
Chinese Translation
知识蒸馏(Knowledge Distillation, KD)是一种将大型语言模型(Large Language Models, LLMs)压缩为较小模型的重要技术。然而,尽管学生模型和教师模型在KD中扮演着不同的角色,大多数现有框架仍然为这两种模型使用同质的训练后端(例如,FSDP和DeepSpeed),导致训练效率不佳。本文提出了一种新颖的LLM蒸馏框架,称为 extbf{KDFlow},该框架具有解耦架构,并采用SGLang进行教师推理。通过结合FSDP2的训练效率和SGLang的推理效率,KDFlow在统一系统中充分利用了这两者的优势。此外,我们的框架并不在不同进程之间传输完整的logits,而是仅使用零拷贝数据传输传递教师的隐藏状态,并在学生端重新计算logits,有效平衡了通信成本和KD性能。此外,我们的框架支持离策略和在策略的蒸馏,并通过高度可扩展且用户友好的API集成了跨分词器KD的KD算法。实验表明,KDFlow相比当前的KD框架可以实现 extbf{1.44$ imes$到6.36$ imes$}的加速,使研究人员能够以最小的工程开销快速原型化和扩展LLM蒸馏。代码可在以下链接获取:https://github.com/songmzhang/KDFlow
cs.CL / 101 / 2603.01910
FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
FLANS在SemEval-2026任务7中的表现:利用开源小型LLMs进行跨多种语言和文化的日常知识检索增强生成
Abstract
This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via https://github.com/aaronlifenghan/FLANS-2026
Chinese Translation
本文系统性地描述了我们参与SemEval-2025任务7“跨多种语言和文化的日常知识”的过程。我们参加了两个子任务,即轨道1:简答题(SAQ)和轨道2:多项选择题(MCQ)。我们使用的方法是利用开源小型LLMs(OS-sLLMs)的检索增强生成(RAGs)。为了更好地适应这一共享任务,我们通过使用我们准备的关键词列表提取维基百科内容,创建了自己的文化意识知识库(CulKBs)。我们提取了文化意识的维基文本和特定国家的维基摘要。除了本地的CulKBs外,我们还集成了通过DuckDuckGo进行实时在线搜索输出的系统。为了更好地保护隐私和可持续性,我们的目标是在Ollama平台上部署开源的小型LLMs(sLLMs)。我们分享了使用改进技术开发的提示,并报告了这些提示的学习曲线。测试语言包括英语、西班牙语和中文,适用于两个轨道。我们的资源和代码通过https://github.com/aaronlifenghan/FLANS-2026共享。
cs.CL / 102 / 2603.01912
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
展示ViviDoc:通过人机协作生成互动文档
Abstract
Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at https://vividoc-homepage.vercel.app/.
Chinese Translation
互动文章帮助读者通过探索复杂思想来参与其中,但创建这些文章仍然成本高昂,既需要领域专业知识,又需要网页开发技能。近期基于大语言模型(LLM)的代理可以自动化内容创建,但简单地应用它们会产生不可控且无法验证的输出。我们提出了ViviDoc,一个人机协作系统,能够从单一主题输入生成互动教育文档。ViviDoc引入了一个多代理管道(规划者、执行者、评估者)和文档规范(Document Specification, DocSpec),这是一个可供人类阅读的中间表示,能够将每个互动可视化分解为状态、渲染、过渡和约束组件。DocSpec使教育工作者能够在生成代码之前审查和完善生成计划,弥合了教育意图与可执行输出之间的差距。专家评估和用户研究表明,ViviDoc在性能上显著优于简单的代理生成,并提供了直观的编辑体验。我们的项目主页可访问:https://vividoc-homepage.vercel.app/
cs.CL / 103 / 2603.01914
AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
AdaPonderLM:具有令牌级自适应深度的门控思考语言模型
Abstract
Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
Chinese Translation
通过递归/迭代变换器的测试时间扩展使大型语言模型在推理时能够投入更多计算,但大多数预训练的递归语言模型运行固定数量的迭代,导致在简单令牌上浪费计算,并缺乏令牌级自适应性。基于自适应计算时间(Adaptive Computation Time, ACT)和提前退出(Early Exit, EE)的核心思想,我们提出了AdaPonderLM,这是一种自监督递归语言模型,在预训练期间学习令牌级的提前退出,而无需手动调整每个令牌/每层的剪枝比例。AdaPonderLM使用特定于迭代的多层感知机(MLP)门控和单调停止掩码来决定每个令牌何时停止递归,并引入了一个键值重用机制,用于重用已缓存的键/值状态,以确保训练-测试一致性和实际加速。在从70M到410M(预训练)以及高达2.8B(继续预训练)的Pythia骨干网络上,AdaPonderLM在保持可比的语言建模困惑度和竞争性下游准确性的同时,将推理计算减少约10%。我们的分析表明,学习到的门控将更多计算分配给高负对数似然(high-NLL,困难)令牌,展现出在完全自监督环境下的自适应计算时间行为。同时,在相同的FLOPs下,学习到的停止策略始终优于固定剪枝,表明AdaPonderLM将计算分配给正确的令牌,而不仅仅是减少平均深度。
cs.CL / 104 / 2603.01930
From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
从方差到不变性:叙事图注释的定性内容分析
Abstract
Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a $6\times3$ factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's $\alpha$), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf's $\alpha$ are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.
Chinese Translation
新闻话语中的叙事在塑造公众对经济事件(如通货膨胀)的理解中发挥着关键作用。以结构化方式对这些叙事进行注释和评估仍然是自然语言处理(NLP)面临的一个主要挑战。在本研究中,我们引入了一种叙事图注释框架,该框架结合了定性内容分析(QCA)的原则,通过减少注释错误来优先考虑注释质量。我们呈现了一个关于通货膨胀叙事的数据集,该数据集被注释为有向无环图(DAG),其中节点代表事件,边表示因果关系。为了评估注释质量,我们采用了$6 imes3$因子实验设计,考察叙事表现(六个层次)和距离度量类型(三个层次)对注释者间一致性(Krippendorff's $eta$)的影响,捕捉叙事解释中的人类标签变异(HLV)的存在。我们的分析表明:(1)宽松的度量(基于重叠的距离)高估了可靠性;(2)局部约束的表示(例如,一跳邻居)减少了注释的变异性。我们的注释和基于图的Krippendorff's $eta$的实现已开源。该注释框架和评估结果为NLP研究提供了关于在HLV下进行基于图的叙事注释的实用指导。
cs.CL / 105 / 2603.01945
When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
数字讲述一半故事:主题模型评估中的人类度量对齐
Abstract
Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.
Chinese Translation
主题模型揭示文本语料库中的潜在主题结构,但评估其质量仍然具有挑战性,尤其是在专业领域。现有方法通常依赖于自动化指标,如主题一致性和多样性,这些指标可能与人类判断并不完全一致。人类评估任务,如词汇干扰,提供了有价值的见解,但成本高昂且主要在一般领域的语料库上得到验证。本文介绍了一种新的人工评估任务——主题词混合(Topic Word Mixing, TWM),通过测试注释者是否能够区分来自单一或混合主题的词汇集,来评估主题间的独特性。TWM 补充了词汇干扰对主题内一致性的关注,并为多样性指标提供了一个以人为基础的对照。我们评估了六种主题模型——包括统计模型和基于嵌入的模型(LDA、NMF、Top2Vec、BERTopic、CFMF、CFMF-emb),比较了基于近 4,000 条来自科学哲学出版物的领域特定语料库的注释的自动化指标与人类评估方法。我们的研究结果表明,词汇干扰和一致性指标并不总是一致,尤其是在专业领域,并且 TWM 捕捉到人类感知的独特性,同时似乎与多样性指标一致。我们发布了注释数据集和任务生成代码。这项工作强调了建立连接自动化和人类评估的评估框架的必要性,特别是针对领域特定的语料库。
cs.CL / 106 / 2603.01966
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
AMemGym:用于长时间对话助手的交互式记忆基准测试
Abstract
Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
Chinese Translation
用户与基于大型语言模型(LLM)的助手之间的长时间交互需要有效的记忆管理,但当前的方法在记忆的训练和评估方面面临挑战。现有的记忆基准依赖于静态的、非策略性的数据作为上下文,这限制了评估的可靠性和可扩展性。为了解决这些问题,我们引入了AMemGym,一个交互式环境,能够进行基于策略的评估和优化,以实现记忆驱动的个性化。AMemGym采用结构化数据采样来预定义用户档案、状态依赖问题和状态演变轨迹,从而实现高效生成高质量、与评估对齐的交互。LLM模拟用户通过角色扮演暴露潜在状态,同时保持结构化状态的一致性。基于结构化数据的综合指标指导助手的评估和优化。大量实验揭示了现有记忆系统(如RAG、长上下文LLM和代理记忆)中的性能差距及其相应原因。AMemGym不仅能够有效地选择竞争方法,还可能推动记忆管理策略的自我演变。通过将结构化状态演变与自由形式的交互相结合,我们的框架提供了一个可扩展的、诊断丰富的环境,以推动对话代理的记忆能力的发展。
cs.CL / 107 / 2603.01973
CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
CharacterFlywheel:在生产中扩展可参与和可引导的大型语言模型的迭代改进
Nie, Yixin, Guan, Lin, Ma, Zhongyao, Gupta, Anchit, Zhou, Yipin, Li, Xiao, Zhou, Zhengping, Zeng, Raymond, Zhou, Gelin, Chu, Shigan, Thampi, Ajay, Mu, Wancen, Shuster, Nathan, Wang, Ketong, Chen, Lin, Brewer, Jason, Hu, Derek Hao, McCauley, Alexander, Weston, Jason, Park, Sem, Zhang, Na, Tang, Kevin
Abstract
This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
Chinese Translation
本报告介绍了CharacterFlywheel,这是一个用于改进生产社交聊天应用中大型语言模型(LLMs)的迭代飞轮过程,应用于Instagram、WhatsApp和Messenger。我们从LLaMA 3.1开始,利用来自内部和外部真实用户流量的数据,经过15代模型的精炼。在2024年7月至2025年4月的持续部署过程中,我们进行了为期7天的控制A/B测试,显示出一致的参与度改善:8个新部署模型中有7个在基线之上表现出积极提升,表现最好的模型在参与广度上提高了多达8.8%,在参与深度上提高了19.4%。我们还观察到可引导性显著提升,指令遵循率从59.2%提高到84.8%,指令违规率从26.6%降低到5.8%。我们详细描述了CharacterFlywheel过程,该过程整合了数据策划、奖励建模以估计和插值参与度指标的全景、监督微调(SFT)、强化学习(RL)以及离线和在线评估,以确保在每个优化步骤中可靠的进展。我们还讨论了防止过拟合的方法以及在大规模生产动态中导航的策略。这些贡献提升了对服务数百万用户的社交应用中LLMs的科学严谨性和理解。
cs.CL / 108 / 2603.02023
PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
PonderLM-3:具有可微掩蔽的自适应令牌级思考
Abstract
Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.
Chinese Translation
测试时缩放表明,在推理过程中分配更多的额外计算可以提高生成质量,这引发了一个自然的后续问题:这些计算应该花费在哪里?基于这一见解,我们提出了PonderLM-3,这是一个令牌级自适应思考的预训练框架,能够在完全自监督的目标下学习选择性地分配额外计算,建立在PonderLM-2的基础上。这使得额外的推理计算成为可按令牌分配的资源,因此只有在有利时,令牌才会获得更多的计算,而不是支付统一的额外成本。为了在保持训练-推理一致性的同时使这种分配可学习,PonderLM-3在预训练期间注入了一个可微的注意力掩蔽,并在推理时与一个匹配的硬剪枝规则配对。PonderLM-3定义了一个更强的帕累托前沿:与现有的递归或自适应基线相比,它在相同的推理FLOPs下实现了更低的预训练困惑度。在下游基准测试中,PonderLM-3在相同的最大额外计算步骤下达到了与固定步长PonderLM-2相当的性能,同时在实际使用中消耗了更少的推理FLOPs。总体而言,PonderLM-3提供了一个端到端的可微分和训练-推理一致的框架,用于令牌级自适应计算,使得额外的推理计算能够分配到最有用的地方,而不是由每个令牌统一支付。
cs.CL / 109 / 2603.02024
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
MMR-Life:拼接真实场景以进行多模态多图像推理
Abstract
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
Chinese Translation
最近,多模态大型语言模型(MLLMs)推理能力的进展使其能够处理更复杂的任务,如科学分析和数学推理。尽管前景广阔,MLLMs在现实生活中不同场景下的推理能力仍然未得到充分探索,且缺乏标准化的评估基准。为填补这一空白,我们推出了MMR-Life,这是一个综合性基准,旨在评估MLLMs在现实生活场景中多样的多模态多图像推理能力。MMR-Life包含2,646个基于19,108幅主要来源于现实世界背景的图像的多项选择题,全面覆盖七种推理类型:溯因推理、类比推理、因果推理、演绎推理、归纳推理、空间推理和时间推理。与现有的推理基准不同,MMR-Life不依赖于特定领域的专业知识,而是要求模型整合多幅图像中的信息并应用多样的推理能力。对37个先进模型的评估突显了MMR-Life所带来的重大挑战。即使是像GPT-5这样的顶尖模型,其准确率也仅为58%,并且在不同推理类型之间的表现存在显著差异。此外,我们分析了现有MLLMs的推理范式,探讨了思维长度、推理方法和推理类型等因素如何影响其表现。总之,MMR-Life为评估、分析和改进下一代多模态推理系统奠定了全面的基础。
cs.CL / 110 / 2603.02041
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
EstLLM:通过持续预训练和后期训练增强爱沙尼亚在多语言LLM中的能力
Abstract
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
Chinese Translation
大型语言模型(LLMs)主要在以英语为中心的数据上进行训练,这导致小语言的表现不均衡。我们研究持续预训练(CPT)是否能显著提升预训练多语言LLM中爱沙尼亚语的能力,同时保持其英语和一般推理性能。以Llama 3.1 8B作为主要基础模型,我们在一种混合数据集上进行CPT,该数据集增加了对爱沙尼亚语的曝光,同时通过英语重放和代码、数学及指令类数据的纳入来近似原始训练分布。随后,我们应用监督微调、偏好优化和聊天向量合并,以引入稳健的指令跟随行为。在一套全面的爱沙尼亚基准测试中评估表明,与原始基础模型及其指令调优变体相比,语言能力、知识、推理、翻译质量和指令跟随能力均有一致提升,同时在英语基准测试中保持竞争力。这些发现表明,CPT结合适当平衡的数据混合以及后期训练对齐,可以显著提升预训练多语言LLM中的单语言能力。
cs.CL / 111 / 2603.02082
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
儿童在语言习得中究竟获得了什么?基于自动检测填充-间隙依赖的CHILDES案例研究
Abstract
Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
Chinese Translation
关于儿童习得填充-间隙依赖的研究,有人认为这依赖于先天的语法知识,而另一些人则认为儿童语言输入中的分布性证据已足够。然而,相关输入的量化在大规模上具有挑战性,且难以细致分析,使得这一问题难以解决。我们提出了一种系统,能够识别英语口语语料库中的三种核心填充-间隙结构——主句wh-疑问句、嵌入式wh-疑问句和关系从句,并进一步识别提取位置(即主语、宾语和附加成分)。我们的方法结合了成分分析和依赖分析,利用它们的互补优势进行结构分类和提取位置识别。我们在人工标注的数据上验证了该系统,发现其在大多数类别中表现良好。将该系统应用于57个英语CHILDES语料库,我们能够描述儿童的填充-间隙输入及其在发展过程中填充-间隙的产生轨迹,包括特定结构的频率和提取位置的不对称性。由此产生的细粒度标签为未来在习得和计算研究中的工作提供了支持,我们通过使用过滤语料训练语言模型的案例研究进行了演示。
cs.CL / 112 / 2603.02084
Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
年轻学习者语法假设测试建模:基于序列的学习分析研究互动游戏中的形态句法推理
Abstract
This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.
Chinese Translation
本研究通过基于序列的学习分析方法,探讨小学学习者的语法推理,利用来自针对法语形态句法一致性的互动游戏的细粒度动作序列。与依赖最终答案的传统评估不同,我们将每次滑块移动视为假设测试行为,实时捕捉句子构建过程中的认知策略。我们分析了来自100名8至11岁学生在真实课堂环境中进行的597次游戏会话(9,783个动作),引入汉明距离来量化与有效语法解的接近程度,并考察在不同难度练习中的收敛模式。结果显示,限定词和动词是主要的难点,动作序列偏离了通常的从左到右处理方式。这表明学习者往往首先固定动词,然后调整前面的元素。解决方案较少的练习表现出更慢且不规则的收敛,而最近有效解的变化则表明动态假设修订。我们的研究结果展示了基于序列的分析如何揭示语言推理的隐藏维度,为语言多样化课堂中的实时支架和教师工具提供了基础。
cs.CL / 113 / 2603.02097
ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
ClinConsensus:一种基于共识的基准,用于评估不同难度水平的中文医疗大语言模型
Abstract
Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
Chinese Translation
大型语言模型(LLMs)在健康管理中的应用日益增多,在疾病预防、临床决策和长期护理等方面展现出良好的前景。然而,现有的医学基准仍然大多是静态的和任务孤立的,未能捕捉到真实临床工作流程的开放性、纵向结构和安全关键的复杂性。我们引入了ClinConsensus,一个由临床专家策划、验证和质量控制的中文医学基准。ClinConsensus包含2500个开放式案例,涵盖了从预防和干预到长期随访的完整护理过程,涉及36个医学专业、12种常见临床任务类型,并逐步增加复杂性水平。为了能够可靠地评估这些复杂场景,我们采用了基于评分标准的评分协议,并提出了临床适用一致性评分(CACS@k)。我们进一步引入了双评审评估框架,结合高能力的LLM作为评审与通过监督微调训练的精简本地可部署评审模型,使得评估能够与医生判断相一致,具备可扩展性和可重复性。通过使用ClinConsensus,我们对几种领先的LLM进行了全面评估,并揭示了任务主题、护理阶段和医学专业之间的显著异质性。尽管表现最佳的模型在整体评分上相当,但在推理、证据使用和纵向随访能力上存在显著差异,而临床可操作的治疗规划仍然是一个关键瓶颈。我们发布ClinConsensus作为一个可扩展的基准,以支持开发和评估稳健、以临床为基础并准备好进行实际部署的医疗LLM。
cs.CL / 114 / 2603.02099
Recursive Think-Answer Process for LLMs and VLMs
大型语言模型和视觉语言模型的递归思考-回答过程
Abstract
Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
Chinese Translation
思考-回答推理器,如 DeepSeek-R1,通过利用可解释的内部推理取得了显著进展。然而,尽管经常出现诸如“哎呀!”等自我反思提示,它们在单次推理过程中仍然容易出现输出错误。为了解决这一限制,我们提出了一种高效的递归思考-回答过程(Recursive Think-Answer Process,R-TAP),使模型能够进行迭代推理循环,从而生成更准确的答案,超越传统的单次推理方法。该方法的核心是一个置信度生成器,它评估模型响应的确定性并指导后续改进。通过结合两种互补奖励——递归置信度提升奖励(Recursively Confidence Increase Reward)和最终答案置信度奖励(Final Answer Confidence Reward),我们表明,经过 R-TAP 增强的模型在大型语言模型(LLMs)和视觉语言模型(VLMs)上均持续优于传统的单次推理方法。此外,通过分析模型响应中“哎呀!”类表达的频率,我们发现应用 R-TAP 的模型表现出显著更少的自我反思模式,从而实现了更稳定和更快速的推理时间。我们希望 R-TAP 能为未来 AI 推理过程的高效和精细化方法的发展铺平道路。
cs.CL / 115 / 2603.02128
LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
大型语言模型作为战略参与者:地缘政治模拟中的行为对齐、风险校准与论证框架
Abstract
Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
Chinese Translation
大型语言模型(LLMs)越来越多地被提议作为战略决策环境中的代理,但它们在结构化地缘政治模拟中的行为仍然未得到充分研究。我们评估了六种流行的最先进LLM,并将其与人类在四个现实危机模拟场景中的结果进行了比较,这些场景要求模型在多个回合中选择预定义的行动并为其决策提供理由。我们在行动对齐、通过选择行动的严重性进行的风险校准以及基于国际关系理论的论证框架方面将模型与人类进行了比较。结果显示,模型在基础模拟回合中接近人类的决策模式,但随着时间的推移出现分歧,展现出不同的行为特征和战略更新。所有模型对所选行动的解释表现出强烈的规范性合作框架,集中于稳定性、协调和风险缓解,且对抗性推理有限。
cs.CL / 116 / 2603.02146
LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
LongRLVR:长上下文强化学习需要可验证的上下文奖励
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
Chinese Translation
带有可验证奖励的强化学习(RLVR)通过针对事实结果进行优化,显著提升了大型语言模型(LLMs)的推理能力。然而,这一范式在长上下文场景中表现不佳,因为其对内部参数知识的依赖不适合需要上下文基础的任务——即寻找和推理外部提供信息的能力。我们识别出这一失败的一个关键原因:仅基于最终答案的奖励过于稀疏,无法有效引导模型识别相关证据。我们正式证明,仅基于结果的奖励会导致上下文基础过程中的显著梯度消失,从而使学习变得不可行。为了解决这一瓶颈,我们引入了LongRLVR,以稀疏的答案奖励为基础,增加密集且可验证的上下文奖励。这一辅助信号直接激励模型选择正确的基础信息,提供了一个强健的学习梯度,从而解决了潜在的优化挑战。我们在使用Qwen和LLaMA模型的具有挑战性的长上下文基准上验证了我们的方法。LongRLVR在所有模型和基准上始终显著优于标准RLVR,例如,将一个14B模型在RULER-QA上的得分从73.17提升到88.90,在LongBench v2上的得分从39.8提升到46.5。我们的工作表明,明确奖励基础过程是释放LLMs在长上下文应用中全部推理潜力的关键且有效的策略。我们的代码可在https://github.com/real-absolute-AI/LongRLVR获取。
cs.CL / 117 / 2603.02150
Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
零样本与少样本命名实体识别:犯罪领域的案例研究与数据集(CrimeNER)
Abstract
The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
Chinese Translation
从与犯罪相关的文档中提取关键信息是执法机构的一项重要任务。命名实体识别(NER)能够提取关于犯罪、犯罪嫌疑人或相关执法机构的信息。然而,关于一般现实世界犯罪场景的充分标注数据严重不足。为了解决这一问题,我们提出了CrimeNER,这是一个关于犯罪相关的零样本与少样本NER的案例研究,并建立了一个通用的犯罪相关命名实体识别数据库(CrimeNERdb),该数据库包含超过1500个从恐怖袭击的公共报告和美国司法部的新闻稿中提取的标注文档。我们定义了5种粗粒度的犯罪实体类型和总共22种细粒度的实体类型。我们通过在零样本和少样本设置下使用最先进的NER模型以及通用且常用的大型语言模型进行实验,来验证案例研究和标注数据的质量。
cs.CL / 118 / 2603.02176
Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
在生态系统规模下组织、协调和基准化代理技能
Abstract
The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set.Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.
Chinese Translation
Claude 代理技能的快速扩展引发了一个核心问题,即如何有效利用、管理和扩展代理技能生态系统。本文提出了 AgentSkillOS,这是第一个针对技能选择、协调和生态系统级管理的原则性框架。AgentSkillOS 包含两个阶段:(i) 管理技能,通过节点级递归分类将技能组织成能力树,以实现高效发现;(ii) 解决任务,通过基于有向无环图(DAG)的管道检索、协调和执行多个技能。为了评估代理调用技能的能力,我们构建了一个包含 30 个丰富任务的基准,涵盖五个类别:数据计算、文档创建、运动视频、视觉设计和网页交互。我们使用基于大型语言模型(LLM)的成对评估来评估任务输出的质量,并通过 Bradley-Terry 模型汇总结果以生成统一的质量评分。在三个技能生态系统规模(从 200 到 200K 技能)的实验中,树形检索有效地近似了神谕技能选择,而基于 DAG 的协调在给定相同技能集的情况下显著优于原生的平面调用。我们的研究结果确认,结构化组合是释放技能潜力的关键。我们的 GitHub 仓库可在以下链接访问:https://github.com/ynulihao/AgentSkillOS。
cs.CL / 119 / 2603.02208
Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
推理核心:一个可扩展的程序化数据生成套件,用于符号预训练和后训练
Abstract
Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
Chinese Translation
在可验证的符号数据上进行训练是一种有前景的方法,可以将语言模型的推理前沿扩展到标准预训练语料库所提供的范围之外。然而,现有的程序生成器通常依赖于固定的谜题或模板,无法在规模上提供所需的分布广度。我们提出了推理核心(Reasoning Core),这是一个可扩展的套件,能够在核心形式领域中程序化生成可验证的符号推理数据:在随机化领域上的 PDDL 规划、带等式的一阶逻辑、上下文无关文法解析与生成、在随机贝叶斯网络上的因果推理,以及方程组。每个任务都配备有外部求解器以进行严格验证,并允许对课程设计进行持续的难度控制。示例可以选择性地包括求解器生成的推理轨迹,使得从最早的预训练阶段开始进行监督训练成为可能,并且相同的接口提供可验证的奖励函数以用于强化学习。我们的实验表明,将推理核心数据混入预训练中可以改善下游推理,同时保持或略微改善语言建模质量。零样本评估确认这些任务对前沿模型如 GPT-5 形成挑战。代码和数据在 MIT 许可证下公开可用。