cs.RO / 1 / 2604.16331
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning
BrainMem:基于大脑的进化记忆用于具身智能体任务规划
Abstract
Embodied task planning requires agents to execute long-horizon, goal-directed actions in complex 3D environments, where success depends on both immediate perception and accumulated experience across tasks. However, most existing LLM-based planners are stateless and reactive, operating without persistent memory and therefore repeating errors and struggling with spatial or temporal dependencies. We propose BrainMem(Brain-Inspired Evolving Memory), a training-free hierarchical memory system that equips embodied agents with working, episodic, and semantic memory inspired by human cognition. BrainMem continuously transforms interaction histories into structured knowledge graphs and distilled symbolic guidelines, enabling planners to retrieve, reason over, and adapt behaviors from past experience without any model fine-tuning or additional training. This plug-and-play design integrates seamlessly with arbitrary multi-modal LLMs and greatly reduces reliance on task-specific prompt engineering. Extensive experiments on four representative benchmarks, including EB-ALFRED, EB-Navigation, EB-Manipulation, and EB-Habitat, demonstrate that BrainMem significantly enhances task success rates across diverse models and difficulty subsets, with the largest gains observed on long-horizon and spatially complex tasks. These results highlight evolving memory as a promising and scalable mechanism for generalizable embodied intelligence.
Chinese Translation
具身任务规划要求智能体在复杂的三维环境中执行长时间、目标导向的动作,其成功依赖于即时感知和跨任务的累积经验。然而,现有的大型语言模型(LLM)基础的规划器大多是无状态和反应式的,缺乏持久记忆,因此容易重复错误,并在空间或时间依赖性方面面临困难。我们提出了BrainMem(基于大脑的进化记忆),这是一种无训练的层次记忆系统,赋予具身智能体工作记忆、情节记忆和语义记忆,灵感来源于人类认知。BrainMem持续将交互历史转化为结构化知识图谱和提炼的符号指导,使规划器能够从过去的经验中检索、推理和调整行为,而无需任何模型微调或额外训练。这种即插即用的设计与任意多模态LLM无缝集成,极大减少了对特定任务提示工程的依赖。在四个代表性的基准测试上,包括EB-ALFRED、EB-Navigation、EB-Manipulation和EB-Habitat,广泛实验表明,BrainMem显著提高了不同模型和难度子集上的任务成功率,尤其在长时间和空间复杂的任务上取得了最大的提升。这些结果突显了进化记忆作为一种有前景且可扩展的机制,能够促进可推广的具身智能。
cs.RO / 2 / 2604.16381
Interdisciplinary Workshop on Mechanical Intelligence: Summary Report
机械智能跨学科研讨会:总结报告
Webster-Wood, Victoria A., Gravish, Nicholas, Alavi, Amir, Arrieta, Andres F, Bergbreiter, Sarah, Bloch, Anthony, Blumenschein, Laura, Cao, C. Chase, Carter, Aja Mia, Celli, Paolo, Chen, Tony, Coad, Margaret, Cutkosky, Mark, Dickey, Michael, Do, Brian, Full, Robert, Haghshenas-Jaryani, Mahdi, Jayaram, Kaushik, Johnson, Aaron, Kanso, Eva, Lejeune, Emma, Li, Chen, Li, Suyi, Lipton, Jeffrey, MacCurdy, Rob, McHenry, Matt, Mongeau, Jean-Michel, Murphey, Todd, Plecnik, Mark, Raney, Jordan, Sochol, Ryan D., Stuart, Hannah, Temel, Zeynep, Tolley, Michael, Trimmer, Barry, Wallin, T. J., Wang, Kon-Well, Yan, Wenzhong, Yim, Mark, Zhang, Wenlong
Abstract
This report provides a summary of the outcomes of the Interdisciplinary Workshop on Mechanical Intelligence held in 2024. Mechanical Intelligence (MI) represents the phenomenon that novel structural features of material/biological/robotic systems can encode intelligence through responsiveness, adaptivity, memory, and learning in the mechanical structure itself. This is in contrast to computational intelligence, wherein the intelligence functions occur through electrical signaling and computer code. The two-day workshop was held at NSF headquarters on May 30-31 and included 38 invited academic researcher participants, and 8 program officers from the NSF. The workshop was structured around active small and large group discussions in groups of 4-5 and 9-10 with the goal of addressing topical questions on MI. Working groups entered notes into shared presentation slides for each discussion session and presented their outcomes in a final presentation on the last day. Here we summarize the overall outcomes of the workshop.
Chinese Translation
本报告总结了2024年举行的机械智能跨学科研讨会的成果。机械智能(Mechanical Intelligence, MI)指的是材料/生物/机器人系统的新型结构特征能够通过机械结构本身的响应性、适应性、记忆和学习来编码智能。这与计算智能形成对比,后者的智能功能通过电信号和计算机代码实现。此次为期两天的研讨会于5月30日至31日在美国国家科学基金会(NSF)总部举行,邀请了38位学术研究者和8位NSF项目官员参与。研讨会围绕小组和大组讨论进行,分为4-5人和9-10人的小组,旨在探讨与MI相关的主题问题。工作组在每个讨论环节中将笔记输入共享的演示幻灯片,并在最后一天的最终报告中展示他们的成果。本文总结了研讨会的整体成果。
cs.RO / 3 / 2604.16384
RHINO-AR: An Augmented Reality Exhibit for Teaching Mobile Robotics Concepts in Museums
RHINO-AR:用于在博物馆教授移动机器人概念的增强现实展览
Abstract
We present RHINO-AR, an interactive Augmented Reality (AR) museum exhibit that reintroduces the historical mobile robot RHINO into its original exhibition environment at the Deutsches Museum Bonn. The system builds on our previous work RHINO-VR, which reconstructed the robot and the environment in virtual reality. Although this created an engaging experience, it also revealed an important limitation, because visitors were separated from the real exhibition space and from the physical robot on display. RHINO-AR addresses this reality gap by placing a virtual reconstruction of the robot directly into the real museum space. Implemented on a Magic Leap~2 headset using Unity, our system combines real-time environment meshing with interactive visualizations of LiDAR sensing, traversability, and path planning to make otherwise invisible robotics processes understandable to non-expert visitors. We evaluated RHINO-AR in a two-day museum study with 22 participants, assessing usability, technical performance, satisfaction, conceptual understanding, and preference comparison to RHINO-VR. The results show that RHINO-AR was well received, effectively conveyed key navigation concepts, and generally preferred over the VR exhibit due to its stronger physical grounding and increased realism.
Chinese Translation
我们介绍了RHINO-AR,这是一个互动的增强现实(AR)博物馆展览,将历史移动机器人RHINO重新引入其在波恩德意志博物馆的原始展览环境。该系统基于我们之前的工作RHINO-VR,该工作在虚拟现实中重建了机器人及其环境。尽管这创造了一个引人入胜的体验,但也揭示了一个重要的局限性,因为参观者与真实的展览空间和展出的物理机器人之间存在隔离。RHINO-AR通过将机器人的虚拟重建直接置于真实的博物馆空间中来解决这一现实差距。该系统在Magic Leap 2头戴设备上使用Unity实现,结合了实时环境网格生成与LiDAR传感、可通行性和路径规划的互动可视化,使得非专业访客能够理解原本不可见的机器人过程。我们在为期两天的博物馆研究中评估了RHINO-AR,参与者为22人,评估内容包括可用性、技术性能、满意度、概念理解以及与RHINO-VR的偏好比较。结果表明,RHINO-AR受到了良好的反馈,有效传达了关键的导航概念,并且由于其更强的物理基础和增强的现实感,通常比VR展览更受欢迎。
cs.RO / 4 / 2604.16388
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
视觉-RRT:通过可微渲染寻找朝向视觉目标的路径
Abstract
Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (i) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (ii) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code is available at https://sgvr.kaist.ac.kr/Visual-RRT.
Chinese Translation
快速探索随机树(RRT)因其鲁棒性和理论保证而被广泛应用于机器人运动规划。然而,现有的基于RRT的规划器需要明确的目标配置,这些配置以数值关节角度的形式指定,而许多实际应用则通过视觉观察(如图像或演示视频)提供目标规格,此时精确的目标配置并不可用。在本文中,我们提出了视觉-RRT(vRRT),这是一种运动规划器,通过将基于梯度的可微机器人渲染与基于采样的RRT探索相结合,实现了视觉目标规划。我们进一步引入了(i)一种基于前沿的探索-利用策略,该策略自适应地优先考虑视觉上有前景的搜索区域,以及(ii)惯性梯度树扩展,该方法在树的分支间继承优化状态,以实现动量一致的梯度利用。在包括Franka、UR5e和Fetch在内的多种机器人操纵器上的大量实验表明,vRRT在模拟和现实环境中均能有效实现视觉目标规划,弥合了基于采样的规划与以视觉为中心的机器人应用之间的差距。我们的代码可在 https://sgvr.kaist.ac.kr/Visual-RRT 获取。
cs.RO / 5 / 2604.16391
Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
通过独立的前向和逆向动力学预训练实现解耦机器人学习
Abstract
Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, GFDM and GIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.
Chinese Translation
视觉-语言-动作(VLA)模型在构建通用机器人方面展现了巨大潜力,但仍面临二维图像预测与三维动作预测之间的不对齐困境。此外,这种视觉-动作纠缠的训练方式限制了模型从大规模、无动作的网络视频数据中学习。为了解决这些问题,我们提出了DeFI,一个新颖的框架,通过解耦视觉前向和逆向动力学预训练来利用各自的数据源,其中视频生成和动作预测被解耦。我们引入了通用前向动力学模型(GFDM),该模型在多样的人类和机器人视频上进行预训练以进行未来预测,以及通用逆向动力学模型(GIDM),该模型通过自监督学习从未标记的视频过渡中推断潜在动作。这些模型随后被集成到一个统一的架构中,以便在下游任务上进行端到端的微调。通过这种方式,GFDM和GIDM首先各自发挥作用,然后协作以实现互利。在CALVIN ABC-D和SimplerEnv上的大量实验表明,DeFI实现了最先进的性能,其中CALVIN的平均任务长度为4.51,SimplerEnv-Fractal基准的成功率为51.2%,在实际部署中的成功率为81.3%,显著优于先前的方法。
cs.RO / 6 / 2604.16405
ICAT: Incident-Case-Grounded Adaptive Testing for Physical-Risk Prediction in Embodied World Models
ICAT:基于事件案例的适应性测试用于具身世界模型中的物理风险预测
Abstract
Video-generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely evaluated.We find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT-based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety-critical embodied deployment.
Chinese Translation
视频生成的世界模型越来越多地被用作具身规划和策略学习的神经模拟器,但它们预测物理风险和严重后果的能力却很少被评估。我们发现这些模型常常淡化或忽视危险行为的关键危险线索和严重后果,这可能在规划和想象的回滚训练中导致不安全的偏好。我们提出了ICAT,通过构建结构化风险记忆并检索/组合这些记忆,以真实事件报告和安全手册为基础,来约束风险案例的生成,包括因果链和严重性标签。基于ICAT的基准实验表明,主流世界模型常常错过机制和触发条件,并且严重性校准不准确,未能达到安全关键的具身部署所需的可靠性。
cs.RO / 7 / 2604.16408
An Edge-Host-Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care
一种边缘-主机-云架构用于机器人无关、护理人员参与的个性化认知训练:在痴呆护理中的多地点部署
Abstract
We present Speaking Memories, a distributed, stakeholder-in-the-loop robotic interaction platform for personalized cognitive exercise support. Rather than a single robot-centric system, Speaking Memories is designed as a generalizable robotics architecture that integrates caregiver-authored knowledge, local edge intelligence, and embodied robotic agents into a unified socio-technical loop. The platform fuses auditory, visual, and textual signals to enable emotion-aware, personalized dialogue, while decoupling multimodal perception and reasoning from robot-specific hardware through a local edge interaction server. This design achieves low-latency, privacy-preserving operation and supports scalable deployment across heterogeneous robotic embodiments. Caregivers and family members contribute structured biographical knowledge via a secure cloud portal, which conditions downstream dialogue policies and enables longitudinal personalization across interaction sessions. Beyond real-time interaction, the system incorporates an automated multimodal evaluation layer that continuously analyzes user responses, affective cues, and engagement patterns, producing structured interaction metrics at scale. These metrics support systematic assessment of interaction quality, enable data-driven model fine-tuning, and lay the foundation for future clinician- and caregiver-informed personalization and intervention planning. We evaluate the platform through real-world deployments, measuring end-to-end latency, dialogue coherence, interaction stability, and stakeholder-reported usability and engagement. Results demonstrate sub-6-second response latency, robust multimodal synchronization, and consistently positive feedback from both participants and caregivers. Furthermore, subsets of the dataset can be shared upon request, subject to participant consent and IRB constraints.
Chinese Translation
我们提出了Speaking Memories,这是一个分布式的利益相关者参与的机器人交互平台,用于个性化的认知训练支持。与单一的机器人中心系统不同,Speaking Memories被设计为一种可推广的机器人架构,将护理人员撰写的知识、本地边缘智能和具身机器人代理整合到一个统一的社会技术循环中。该平台融合了听觉、视觉和文本信号,以实现情感感知的个性化对话,同时通过本地边缘交互服务器将多模态感知和推理与特定机器人硬件解耦。这一设计实现了低延迟、保护隐私的操作,并支持在异构机器人实现中的可扩展部署。护理人员和家庭成员通过安全的云门户贡献结构化的传记知识,这为下游对话策略提供了条件,并使跨交互会话的长期个性化成为可能。除了实时交互外,该系统还包含一个自动化的多模态评估层,持续分析用户反应、情感线索和参与模式,生成结构化的交互指标。这些指标支持系统化的交互质量评估,促进数据驱动的模型微调,并为未来临床医生和护理人员知情的个性化和干预规划奠定基础。我们通过实际部署对该平台进行评估,测量端到端延迟、对话连贯性、交互稳定性以及利益相关者报告的可用性和参与度。结果表明,响应延迟低于6秒,强大的多模态同步,以及参与者和护理人员的一致积极反馈。此外,数据集的子集可以在请求时共享,但需遵循参与者同意和伦理审查委员会的限制。
cs.RO / 8 / 2604.16440
LatentMimic: Terrain-Adaptive Locomotion via Latent Space Imitation
LatentMimic:通过潜在空间模仿实现地形自适应运动
Abstract
Developing natural and diverse locomotion controllers for quadruped robots that can adapt to complex terrains while preserving motion style remains a significant challenge. Existing imitation-based methods face a fundamental optimization trade-off: strict adherence to motion capture (mocap) references penalizes the geometric deviations required for terrain adaptability, whereas terrain-centric policies often compromise stylistic fidelity. We introduce LatentMimic, a novel locomotion learning framework that decouples stylistic fidelity from geometric constraints. By minimizing the marginal latent divergence between the policy's state-action distribution and a learned mocap prior, our approach provides a conditional relaxation of rigid pose-tracking objectives. This formulation preserves gait topology while permitting independent end-effector adaptations for irregular terrains. We further introduce a terrain adaptation module with a dynamic replay buffer to resolve the policy's distribution shifts across different terrains. We validate our method across four locomotion styles and four terrains, demonstrating that LatentMimic enables effective terrain-adaptive locomotion, achieving higher terrain traversal success rates than state-of-the-art motion-tracking methods while maintaining high stylistic fidelity.
Chinese Translation
为四足机器人开发自然且多样的运动控制器,使其能够适应复杂地形同时保持运动风格,仍然是一个重大挑战。现有的基于模仿的方法面临一个基本的优化权衡:严格遵循运动捕捉(mocap)参考会惩罚为适应地形所需的几何偏差,而以地形为中心的策略往往会妥协风格的保真度。我们提出了LatentMimic,这是一种新颖的运动学习框架,能够将风格保真度与几何约束解耦。通过最小化策略的状态-动作分布与学习到的mocap先验之间的边际潜在散度,我们的方法提供了一种对刚性姿态跟踪目标的条件放松。这种公式保持了步态拓扑,同时允许对不规则地形进行独立的末端效应器适应。我们进一步引入了一个地形适应模块,配备动态重放缓冲区,以解决策略在不同地形上的分布变化。我们在四种运动风格和四种地形上验证了我们的方法,证明LatentMimic能够实现有效的地形自适应运动,其地形穿越成功率高于最先进的运动跟踪方法,同时保持高风格保真度。
cs.RO / 9 / 2604.16452
Compiling OpenSCENARIO 2.1 for Scenario-Based Testing in CARLA
为CARLA中的基于场景的测试编译OpenSCENARIO 2.1
Abstract
While the ASAM OpenSCENARIO 2.1 Domain-Specific Language (DSL) enables declarative, intent-driven authoring for Scenario-Based Testing (SBT), its integration into open-source simulators like CARLA remains limited by legacy parsers. We propose a multi-pass modern compiler architecture that translates the OpenSCENARIO 2.1 DSL directly into executable CARLA behaviors. The pipeline features an ANTLR4 frontend for Abstract Syntax Tree (AST) generation, a semantic middle-end, and a runtime backend that synthesizes deterministic py_trees behavior trees. Mapping the standardized domain ontology directly to CARLA's procedural API via a custom method registry eliminates the need for external logic solvers. A demonstrative multi-actor cut-in and evasive maneuver, selected from a wider suite of validated scenarios, confirms the compiler's ability to process concurrent actions, dynamic mathematical expressions, and asynchronous signaling. This framework establishes a functional baseline for reproducible, large-scale SBT, paving the way for future C++ optimizations to mitigate current Python-based computational overhead.
Chinese Translation
虽然ASAM OpenSCENARIO 2.1领域特定语言(DSL)支持基于场景的测试(SBT)的声明式、意图驱动的创作,但其在开源模拟器如CARLA中的集成仍受到传统解析器的限制。我们提出了一种多遍现代编译器架构,将OpenSCENARIO 2.1 DSL直接转换为可执行的CARLA行为。该管道具有用于抽象语法树(AST)生成的ANTLR4前端、语义中端,以及合成确定性py_trees行为树的运行时后端。通过自定义方法注册表将标准化的领域本体直接映射到CARLA的过程API,消除了对外部逻辑求解器的需求。一个从更广泛的验证场景中选出的多参与者切入和规避机动的演示,确认了编译器处理并发动作、动态数学表达式和异步信号的能力。该框架为可重复的大规模SBT建立了功能基线,为未来的C++优化铺平了道路,以减轻当前基于Python的计算开销。
cs.RO / 10 / 2604.16509
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms
基于学习的动态图稀疏化在机器人探索算法中的应用
Abstract
Many robotic exploration algorithms rely on graph structures for frontier-based exploration and dynamic path planning. However, these graphs grow rapidly, accumulating redundant information and impacting performance. We present a transformer-based framework trained with Proximal Policy Optimization (PPO) to prune these graphs during exploration, limiting their growth and reducing the accumulation of excess information. The framework was evaluated on simulations of a robotic agent using Rapidly Exploring Random Trees (RRT) to carry out frontier-based exploration, where the learned policy reduces graph size by up to 96%. We find preliminary evidence that our framework learns to associate pruning decisions with exploration outcomes despite sparse, delayed reward signals. We also observe that while intelligent pruning achieves a lower rate of exploration compared to baselines, it yields the lowest standard deviation, producing the most consistent exploration across varied environments. To the best of our knowledge, these results are the first suggesting the viability of RL in sparsification of dynamic graphs used in robotic exploration algorithms.
Chinese Translation
许多机器人探索算法依赖于图结构进行基于边界的探索和动态路径规划。然而,这些图的规模迅速增长,积累冗余信息并影响性能。我们提出了一种基于变换器的框架,该框架使用近端策略优化(Proximal Policy Optimization, PPO)进行训练,以在探索过程中修剪这些图,从而限制其增长并减少多余信息的积累。该框架在使用快速探索随机树(Rapidly Exploring Random Trees, RRT)进行基于边界的探索的机器人代理模拟中进行了评估,结果表明,学习到的策略使图的大小减少了多达96%。我们发现初步证据表明,尽管奖励信号稀疏且延迟,我们的框架仍能学习将修剪决策与探索结果关联起来。我们还观察到,尽管智能修剪的探索速率低于基线,但其标准差最低,在不同环境中产生了最一致的探索结果。根据我们所知,这些结果首次表明了强化学习(Reinforcement Learning, RL)在机器人探索算法中动态图稀疏化的可行性。
cs.RO / 11 / 2604.16518
On-Orbit Space AI: Federated, Multi-Agent, and Collaborative Algorithms for Satellite Constellations
在轨空间人工智能:卫星星座的联邦、多智能体和协作算法
Abstract
Satellite constellations are transforming space systems from isolated spacecraft into networked, software-defined platforms capable of on-orbit perception, decision making, and adaptation. Yet much of the existing AI studies remains centered on single-satellite inference, while constellation-scale autonomy introduces fundamentally new algorithmic requirements: learning and coordination under dynamic inter-satellite connectivity, strict SWaP-C limits, radiation-induced faults, non-IID data, concept drift, and safety-critical operational constraints. This survey consolidates the emerging field of on-orbit space AI through three complementary paradigms: (i) {federated learning} for cross-satellite training, personalization, and secure aggregation; (ii) {multi-agent algorithms} for cooperative planning, resource allocation, scheduling, formation control, and collision avoidance; and (iii) {collaborative sensing and distributed inference} for multi-satellite fusion, tracking, split/early-exit inference, and cross-layer co-design with constellation networking. We provide a system-level view and a taxonomy that unifies collaboration architectures, temporal mechanisms, and trust models. To support community development and keep this review actionable over time, we continuously curate relevant papers and resources at https://github.com/ziyangwang007/AI4Space.
Chinese Translation
卫星星座正在将空间系统从孤立的航天器转变为网络化、软件定义的平台,能够实现在轨感知、决策和适应。然而,现有的人工智能研究大多集中于单卫星推理,而星座规模的自主性引入了根本新的算法要求:在动态的星间连接下进行学习和协调,严格的尺寸、重量和功耗(SWaP-C)限制、辐射引起的故障、非独立同分布(non-IID)数据、概念漂移以及安全关键的操作约束。本综述通过三种互补的范式整合了新兴的在轨空间人工智能领域:(i)用于跨卫星训练、个性化和安全聚合的联邦学习(federated learning);(ii)用于协作规划、资源分配、调度、编队控制和避碰的多智能体算法(multi-agent algorithms);(iii)用于多卫星融合、跟踪、分裂/提前退出推理以及与星座网络的跨层协同设计的协作感知和分布式推理(collaborative sensing and distributed inference)。我们提供了一个系统级视角和一个统一协作架构、时间机制和信任模型的分类法。为了支持社区发展并保持该综述的可操作性,我们持续在 https://github.com/ziyangwang007/AI4Space 上整理相关论文和资源。
cs.RO / 12 / 2604.16592
Human Cognition in Machines: A Unified Perspective of World Models
机器中的人类认知:世界模型的统一视角
Rupprecht, Timothy, Zhao, Pu, Taherin, Amir, Akbari, Arash, Akbari, Arman, He, Yumei, Duffy, Sean, Lin, Juyi, Chen, Yixiao, Chowdhury, Rahul, Nan, Enfu, Shen, Yixin, Cao, Yifan, Zeng, Haochen, Chen, Weiwei, Yuan, Geng, Dy, Jennifer, Ostadabbas, Sarah, Zhang, Silvia, Kaeli, David, Yeh, Edmund, Wang, Yanzhi
Abstract
This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
Chinese Translation
本综合报告通过创新的认知功能区分了先前的研究工作。许多研究声称其世界模型具备几乎“类人”的认知能力。评估这些主张需要在认知架构理论(Cognitive Architecture Theory, CAT)的第一原则上进行适当的基础。我们提出了一个概念上的统一框架,全面整合与CAT相关的所有认知功能(即记忆、感知、语言、推理、想象、动机和元认知),并识别出研究中的空白,以指导未来的前沿研究。特别是,我们发现动机(尤其是内在动机)和元认知仍然严重缺乏研究,我们提出了基于主动推理(active inference)和全球工作空间理论(global workspace theory)的具体研究方向来解决这些问题。我们进一步引入了认识论世界模型(Epistemic World Models),这是一个新类别,涵盖了在结构化知识上运作的科学发现代理框架。我们的分类法适用于视频、具身和认识论世界模型,提出了先前分类法未涉及的研究方向。
cs.RO / 13 / 2604.16667
Emergency Stopping for Liquid-manipulating Robots
液体操控机器人紧急停止
Abstract
Manipulating open liquid containers is challenging because liquids are highly sensitive to vessel accelerations and jerks. Although spill-free liquid manipulation has been widely studied, emergency stopping under unexpected hazards has received little attention, despite the fact that abrupt braking may cause hazardous spills. This letter presents an emergency stop system for robots manipulating liquids in open containers. We formulate emergency stopping as an optimal control problem and solve it in a model predictive control framework to generate time-optimal, spill-free stopping trajectories. The method operates as a plug-and-play safety layer on top of existing slosh-free motion planning methods, enabling immediate reaction to detected hazards while accounting for nonlinear liquid dynamics. We demonstrate, through simulation and on a 7-DoF Franka Emika Panda robot, that the proposed approach achieves fast emergency stopping without spilling.
Chinese Translation
操控开放液体容器具有挑战性,因为液体对容器的加速度和抖动非常敏感。尽管无溢出液体操控已被广泛研究,但在意外危险下的紧急停止却鲜有关注,尽管突然制动可能导致危险的溢出。本文提出了一种用于操控开放容器中液体的机器人的紧急停止系统。我们将紧急停止形式化为一个最优控制问题,并在模型预测控制框架中求解,以生成时间最优且无溢出的停止轨迹。该方法作为现有无波动运动规划方法之上的即插即用安全层运行,能够对检测到的危险做出即时反应,同时考虑非线性液体动力学。我们通过仿真和在7自由度的Franka Emika Panda机器人上的实验展示了所提方法能够实现快速的紧急停止而不发生溢出。
cs.RO / 14 / 2604.16670
Diffusion-Based Optimization for Accelerated Convergence of Redundant Dual-Arm Minimum Time Problems
基于扩散的优化方法用于加速冗余双臂最小时间问题的收敛
Abstract
We present a framework leveraging a novel variant of the model-based diffusion algorithm to minimize the time required for a redundant dual-arm robot configuration to follow a desired relative Cartesian path. Our prior work proposed a bi-level optimization approach for the dual-arm problem, where we derived the analytical solution to the lower-level convex sub-problem and solved the high-level nonconvex problem using a primal-dual approach. However, the gradient-based nature leads to a large computation overhead, and it prohibits directly imposing an $L_{\infty}$ Cartesian error constraint along the joint trajectory due to the sparsity of the gradient. In this work, we propose a diffusion-based framework that relies on probabilistic sampling to tackle the aforementioned challenges in the nonconvex high-level problem, leading to a 35x reduction in the runtime and 34\% less Cartesian error compared to our prior work.
Chinese Translation
我们提出了一个框架,利用一种新型的基于模型的扩散算法变体,来最小化冗余双臂机器人配置沿预期相对笛卡尔路径所需的时间。我们之前的工作提出了一种双层优化方法,其中我们推导了下层凸子问题的解析解,并使用原始-对偶方法解决高层非凸问题。然而,基于梯度的方法导致了较大的计算开销,并且由于梯度的稀疏性,无法直接在关节轨迹上施加 $L_{ ext{∞}}$ 笛卡尔误差约束。在本研究中,我们提出了一种基于扩散的框架,依赖于概率采样来应对上述非凸高层问题中的挑战,与我们之前的工作相比,运行时间减少了35倍,笛卡尔误差减少了34%。
cs.RO / 15 / 2604.16677
ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control
ReconVLA:一种不确定性引导和故障感知的视觉-语言-动作框架用于机器人控制
Abstract
Vision-language-action (VLA) models have emerged as generalist robotic controllers capable of mapping visual observations and natural language instructions to continuous action sequences. However, VLAs provide no calibrated measure of confidence in their action predictions, thus limiting their reliability in real-world settings where uncertainty and failures must be anticipated. To address this problem we introduce ReconVLA, a reliable conformal model that produces uncertainty-guided and failure-aware control signals. Concretely, our approach applies conformal prediction directly to the action token outputs of pretrained VLA policies, yielding calibrated uncertainty estimates that correlate with execution quality and task success. Furthermore, we extend conformal prediction to the robot state space to detect outliers or unsafe states before failures occur, providing a simple yet effective failure detection mechanism that complements the action-level uncertainty. We evaluate ReconVLA in both simulation and real robot experiments across diverse manipulation tasks. Our results show that conformalized action predictions consistently improve failure anticipation, reduce catastrophic errors, and provide a calibrated measure of confidence without retraining or modifying the underlying VLA.
Chinese Translation
视觉-语言-动作(VLA)模型作为通用机器人控制器,能够将视觉观察和自然语言指令映射为连续的动作序列。然而,VLA并未提供其动作预测的校准置信度度量,从而限制了其在需要预见不确定性和故障的现实环境中的可靠性。为了解决这一问题,我们提出了ReconVLA,这是一种可靠的符合模型,能够生成不确定性引导和故障感知的控制信号。具体而言,我们的方法将符合预测直接应用于预训练VLA策略的动作令牌输出,产生与执行质量和任务成功相关的校准不确定性估计。此外,我们将符合预测扩展到机器人状态空间,以在故障发生之前检测异常或不安全状态,提供了一种简单而有效的故障检测机制,补充了动作级别的不确定性。我们在多种操作任务的仿真和真实机器人实验中评估了ReconVLA。结果表明,符合化的动作预测持续改善了故障预见,减少了灾难性错误,并在不重新训练或修改基础VLA的情况下提供了校准的置信度度量。
cs.RO / 16 / 2604.16683
Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning
Rewind-IL:模仿学习的在线故障检测与状态重生
Abstract
Imitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long-horizon action-chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over-trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind-IL, a training-free online safeguard framework for generative action-chunked imitation policies. Rewind-IL combines a zero-shot failure detector based on Temporal Inter-chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state-respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision-language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind-IL monitors self-consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real-world and simulated long-horizon manipulation tasks, including transfer to flow-matching action-chunked policies, demonstrate that policy-internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at https://sjay05.github.io/rewind-il
Chinese Translation
模仿学习使机器人能够从示范中获得复杂的视觉运动操控技能,但部署故障仍然是一个主要障碍,尤其对于长时间跨度的动作分块策略。一旦执行偏离示范流形,这些策略往往会继续产生局部合理的动作,而无法从故障中恢复。现有的运行时监测器要么需要故障数据,要么在良性特征漂移下过度触发,或者仅停留在故障检测阶段而未提供恢复机制。我们提出了Rewind-IL,这是一种针对生成性动作分块模仿策略的无训练在线保护框架。Rewind-IL结合了一种基于时间块间差异估计(Temporal Inter-chunk Discrepancy Estimate, TIDE)的零-shot故障检测器,并通过分裂保形预测进行校准,配合一种状态重生机制,将机器人返回到一个语义上经过验证的安全中间状态。在离线阶段,视觉-语言模型识别示范中的恢复检查点,冻结的策略编码器用于构建紧凑的检查点特征数据库。在在线阶段,Rewind-IL监测重叠动作块中的自一致性,跟踪与检查点库的相似性,并在发生故障时,将执行回滚到最新的经过验证的安全状态,然后从干净的策略状态重新启动推理。在真实世界和模拟的长时间跨度操控任务上的实验,包括转移到流匹配动作分块策略,证明了策略内部一致性与语义基础的重生相结合,为提高模仿学习的可靠性提供了一条切实可行的途径。补充材料可在 https://sjay05.github.io/rewind-il 获取。
cs.RO / 17 / 2604.16702
Autonomous Vehicle Collision Avoidance With Racing Parameterized Deep Reinforcement Learning
基于赛车参数化深度强化学习的自主车辆碰撞避免
Abstract
Road traffic accidents are a leading cause of fatalities worldwide. In the US, human error causes 94% of crashes, resulting in excess of 7,000 pedestrian fatalities and $500 billion in costs annually. Autonomous Vehicles (AVs) with emergency collision avoidance systems that operate at the limits of vehicle dynamics at a high frequency, a dual constraint of nonlinear kinodynamic accuracy and computational efficiency, further enhance safety benefits during adverse weather and cybersecurity breaches, and to evade dangerous human driving when AVs and human drivers share roads. This paper parameterizes a Deep Reinforcement Learning (DRL) collision avoidance policy Out-Of-Distribution (OOD) utilizing race car overtaking, without explicit geometric mimicry reference trajectory guidance, in simulation, with a physics-informed, simulator exploit-aware reward to encode nonlinear vehicle kinodynamics. Two policies are evaluated, a default uni-direction and a reversed heading variant that navigates in the opposite direction to other cars, which both consistently outperform a Model Predictive Control and Artificial Potential Function (MPC-APF) baseline, with zero-shot transfer to proportionally scaled hardware, across three intersection collision scenarios, at 31x fewer Floating Point Operations (FLOPS) and 64x lower inference latency. The reversed heading policy outperforms the default racing overtaking policy in head-to-head collisions by 30% and the baseline by 50%, and matches the former in side collisions, where both DRL policies evade 10% greater than numerical optimal control.
Chinese Translation
道路交通事故是全球致命事故的主要原因。在美国,94%的交通事故由人为错误引起,导致每年超过7,000名行人遇难和5000亿美元的经济损失。配备紧急碰撞避免系统的自主车辆(AVs)在高频率下以车辆动态的极限运行,面临非线性运动动力学精度和计算效率的双重约束,进一步增强了在恶劣天气和网络安全漏洞情况下的安全性,并在自主车辆与人类驾驶者共享道路时避免危险的人类驾驶行为。本文参数化了一种基于深度强化学习(DRL)的碰撞避免策略,利用赛车超车的方式进行分布外(OOD)训练,而不依赖于明确的几何模仿参考轨迹指导,采用物理信息驱动的、模拟器利用感知奖励来编码非线性车辆运动动力学。评估了两种策略:一种是默认的单向策略,另一种是反向行驶变体,后者在与其他车辆相反的方向行驶,两者在三个交叉口碰撞场景中均持续优于模型预测控制和人工势能函数(MPC-APF)基线,且在比例缩放硬件上的零样本迁移表现出31倍更少的浮点运算(FLOPS)和64倍更低的推理延迟。反向行驶策略在正面碰撞中比默认赛车超车策略提高了30%,比基线提高了50%,在侧面碰撞中与前者表现相当,而两种DRL策略的规避效果比数值最优控制高出10%。
cs.RO / 18 / 2604.16741
LiDAR-based Crowd Navigation with Visible Edge Group Representation
基于LiDAR的可见边缘群体表示的拥挤导航
Abstract
Robot navigation in crowded pedestrian environments is a well-known challenge and we explore the practical deployment of group-based representations in this setting. Pedestrian groups have been empirically shown to enable a mobile robot's navigation behavior to be safer and more social. However, existing approaches either explored groups only in limited scenarios with no high-density crowds or depended on external detection modules to track individuals, which are prone to noise and errors due to occlusions in crowds. We show that group prediction accuracy affects navigation performance only marginally in crowded environments. Based on this observation, we propose the visible edge-based group representation. We additionally demonstrate via simulation experiments that our navigation framework, integrated with the simplified group representation, performs comparatively in terms of safety and socialness in dense crowds, while achieving faster computation speed. Finally, we deploy our navigation framework on a real robot to explore the benefits of practically deploying group-based representations in the real world.
Chinese Translation
在拥挤的行人环境中,机器人导航是一个众所周知的挑战,我们探讨了在这一环境中基于群体表示的实际应用。实证研究表明,行人群体能够使移动机器人的导航行为更加安全和社交。然而,现有的方法要么仅在有限的场景中探索群体,且没有高密度人群,要么依赖外部检测模块来跟踪个体,这些模块由于人群中的遮挡而容易受到噪声和错误的影响。我们表明,在拥挤环境中,群体预测的准确性对导航性能的影响仅为边际。基于这一观察,我们提出了基于可见边缘的群体表示。我们还通过仿真实验展示了我们的导航框架与简化的群体表示相结合,在密集人群中在安全性和社交性方面表现相当,同时实现了更快的计算速度。最后,我们将我们的导航框架部署在真实机器人上,以探索在现实世界中实际应用基于群体表示的好处。
cs.RO / 19 / 2604.16788
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
LongBench:在真实世界长时间任务上评估机器人操作策略
Abstract
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor. We observe that performance in fully observable settings is more strongly associated with execution robustness, while contextual difficulty varies across tasks and is not consistently improved by memory-based methods. We hope that LongBench serves as a useful benchmark for studying long-horizon manipulation and for developing policies with stronger robustness across both execution and contextual challenges.
Chinese Translation
机器人操作策略在长时间任务中往往会退化,但现有的基准测试对这种失败发生的原因提供的见解有限。大多数先前的基准测试要么基于模拟,要么报告整体成功率,这使得在真实执行中难以理清时间困难的不同来源。我们引入了LongBench,一个用于评估长时间操作的真实世界基准。LongBench包含超过1000个真实世界的实验,涵盖两种互补的范畴:独立上下文(完全可观察)和依赖上下文(模糊驱动)。通过将任务组织成能力特定和模糊特定的子集,LongBench使得对执行稳健性、时间一致性和上下文依赖推理的机制感知评估成为可能。对六种最先进策略的评估表明,长时间性能并不是由单一因素决定的。我们观察到,在完全可观察的环境中,性能与执行稳健性之间的关联更为显著,而上下文难度在任务之间变化,并且并不总是通过基于记忆的方法得到一致改善。我们希望LongBench能作为研究长时间操作和开发在执行和上下文挑战中具有更强稳健性的策略的有用基准。
cs.RO / 20 / 2604.16850
Refinement of Accelerated Demonstrations via Incremental Iterative Reference Learning Control for Fast Contact-Rich Imitation Learning
通过增量迭代参考学习控制精炼加速演示以实现快速接触丰富的模仿学习
Abstract
Fast execution of contact-rich manipulation is critical for practical deployment, yet providing fast demonstrations for imitation learning (IL) remains challenging: humans cannot demonstrate at high speed, and naively accelerating demonstrations alters contact dynamics and induces large tracking errors. We present a method to autonomously refine time-accelerated demonstrations by repurposing Iterative Reference Learning Control (IRLC) to iteratively update the reference trajectory from observed tracking errors. However, applying IRLC directly at high speed tends to produce larger early-iteration errors and less stable transients. To address this issue, we propose Incremental Iterative Reference Learning Control (I2RLC), which gradually increases the speed while updating the reference, yielding high-fidelity trajectories. We validate on real-robot whiteboard erasing and peg-in-hole tasks using a teleoperation setup with a compliance-controlled follower and a 3D-printed haptic leader. Both IRLC and I2RLC achieve up to 10x faster demonstrations with reduced tracking error; moreover, I2RLC improves spatial similarity to the original trajectories by 22.5% on average over IRLC across three tasks and multiple speeds (3x-10x). We then use the refined trajectories to train IL policies; the resulting policies execute faster than the demonstrations and achieve 100% success rates in the peg-in-hole task at both seen and unseen positions, with I2RLC-trained policies exhibiting lower contact forces than those trained on IRLC-refined demonstrations. These results indicate that gradual speed scheduling coupled with reference adaptation provides a practical path to fast, contact-rich IL.
Chinese Translation
快速执行接触丰富的操作对于实际部署至关重要,但为模仿学习(IL)提供快速演示仍然具有挑战性:人类无法以高速进行演示,而简单地加速演示会改变接触动态并引发较大的跟踪误差。我们提出了一种通过重新利用迭代参考学习控制(IRLC)的方法,自动精炼时间加速的演示,通过观察到的跟踪误差迭代更新参考轨迹。然而,直接在高速下应用IRLC往往会产生较大的早期迭代误差和不稳定的瞬态。为了解决这个问题,我们提出了增量迭代参考学习控制(I2RLC),该方法在更新参考的同时逐渐提高速度,从而产生高保真度的轨迹。我们在真实机器人白板擦除和插销入孔任务上进行了验证,使用了一个具有合规控制跟随者和3D打印触觉领导者的遥操作设置。IRLC和I2RLC均实现了高达10倍的加速演示,并减少了跟踪误差;此外,I2RLC在三个任务和多个速度(3x-10x)上平均提高了与原始轨迹的空间相似性22.5%。然后,我们使用精炼后的轨迹训练IL策略;结果显示,生成的策略执行速度快于演示,并在插销入孔任务的已见和未见位置上均实现了100%的成功率,而I2RLC训练的策略表现出比基于IRLC精炼演示训练的策略更低的接触力。这些结果表明,逐步的速度调度结合参考适应为快速、接触丰富的IL提供了一条切实可行的路径。
cs.RO / 21 / 2604.16868
Greedy Kalman-Swarm: Improving State Estimation in Robot Swarms in Harsh Environments
贪婪卡尔曼群体:在恶劣环境中改善机器人群体的状态估计
Abstract
State estimation is a fundamental requirement in robotics, where the accurate determination of a robot's state is essential for stable operation despite inherent process disturbances and sensor noise. Traditionally, this is achieved through Kalman filtering, providing a statistically optimal estimate by balancing predictive models with noisy measurements. In the context of robotic swarms, the challenge shifts from individual accuracy to collective coordination, where the integration of global dynamics can significantly enhance the precision of the entire group. Existing estimation techniques rely on centralized processing or heavy communication protocols to reach a global consensus, which are frequently impractical in real-world deployments. Here we show that a localized, "greedy" approach to distributed state estimation (termed "Greedy Kalman-Swarm") allows individual robots to leverage relative inter-robot sensing for improved accuracy without requiring full data availability or global communication. Simulations in communication-constrained environments show robots can effectively integrate all currently available neighbor data at each iteration to refine their internal states, yet remain robust and functional even when data is missing. This results in a performance profile that strikes a balance between the low overhead of independent estimation and the high accuracy of centralized systems, specifically under harsh or dynamic environmental conditions. Our results demonstrate that global state awareness can be emergent rather than enforced, providing a scalable framework for maintaining swarm cohesion in unpredictable terrains. We anticipate that this decentralized methodology will serve as a foundation for more resilient autonomous systems, particularly in search-and-rescue or space exploration missions where reliable, high-bandwidth communication cannot be guaranteed.
Chinese Translation
状态估计是机器人技术中的基本要求,准确确定机器人的状态对于在固有过程干扰和传感器噪声下的稳定操作至关重要。传统上,这通过卡尔曼滤波实现,提供了一种通过平衡预测模型与噪声测量来获得统计最优估计的方法。在机器人群体的背景下,挑战从个体精度转向集体协调,其中全球动态的整合可以显著提高整个群体的精度。现有的估计技术依赖于集中处理或繁重的通信协议以达成全球共识,这在实际部署中往往不切实际。在此,我们展示了一种局部的“贪婪”分布式状态估计方法(称为“Greedy Kalman-Swarm”),允许个体机器人利用相对的机器人间感知来提高精度,而无需完全的数据可用性或全球通信。在通信受限的环境中的模拟表明,机器人可以有效整合每次迭代中所有当前可用的邻居数据,以细化其内部状态,即使在数据缺失的情况下仍然保持稳健和功能性。这导致了一种在独立估计的低开销与集中系统的高精度之间取得平衡的性能特征,特别是在恶劣或动态的环境条件下。我们的结果表明,全球状态意识可以是自发产生的而非强制执行的,为在不可预测的地形中维持群体凝聚力提供了可扩展的框架。我们预期这种去中心化的方法将为更具韧性的自主系统奠定基础,特别是在搜索与救援或太空探索任务中,可靠的高带宽通信无法得到保证。
cs.RO / 22 / 2604.16886
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
交互链基准(COIN):当推理遇上具身交互
Abstract
Generalist embodied agents must perform interactive, causally-dependent reasoning, continually interacting with the environment, acquiring information, and updating plans to solve long-horizon tasks before they could be adopted in real-life scenarios. For instance, retrieving an apple from a cabinet may require opening multiple doors and drawers before the apple becomes visible and reachable, demanding sequential interaction under partial observability. However, existing benchmarks fail to systematically evaluate this essential capability. We introduce COIN, a benchmark designed to assess interactive reasoning in realistic robotic manipulation through three key contributions. First, we construct COIN-50: 50 interactive tasks in daily scenarios, and create COIN-Primitive required by causally-dependent tasks, and COIN-Composition with mid-term complexity for skill learning and generalization evaluation. Second, we develop a low-cost mobile AR teleoperation system and collect the COIN-Primitive Dataset with 50 demonstrations per primitive task (1,000 in total). Third, we develop systematic evaluation metrics about execution stability and generalization robustness to evaluate CodeAsPolicy, VLA, and language-conditioned H-VLA approaches. Our comprehensive evaluation reveals critical limitations in current methods: models struggle with interactive reasoning tasks due to significant gaps between visual understanding and motor execution. We provide fine-grained analysis of these limitations.
Chinese Translation
通用具身智能体必须执行交互式、因果依赖的推理,持续与环境互动,获取信息并更新计划,以解决长时间跨度的任务,才能在现实场景中被采用。例如,从橱柜中取出一个苹果可能需要打开多个门和抽屉,直到苹果变得可见且可达,这要求在部分可观察性下进行顺序交互。然而,现有的基准测试未能系统性地评估这一基本能力。我们引入了COIN,一个旨在通过三个关键贡献评估现实机器人操作中的交互推理的基准。首先,我们构建了COIN-50:50个日常场景中的交互任务,并创建了因果依赖任务所需的COIN-Primitive,以及具有中期复杂性的COIN-Composition,用于技能学习和泛化评估。其次,我们开发了一个低成本的移动增强现实远程操作系统,并收集了COIN-Primitive数据集,每个原始任务提供50个演示(总共1,000个)。第三,我们开发了关于执行稳定性和泛化鲁棒性的系统评估指标,以评估CodeAsPolicy、VLA和语言条件下的H-VLA方法。我们的全面评估揭示了当前方法的关键局限性:模型在交互推理任务中表现不佳,因视觉理解与运动执行之间存在显著差距。我们对这些局限性进行了细致的分析。
cs.RO / 23 / 2604.16887
Time-Division Multiplexing Actuation in Tendon-Driven Arms: Lightweight Design and Fault Tolerance
腱驱动臂的时分复用驱动:轻量化设计与容错性
Abstract
Robotic manipulators for aerospace applications require a delicate balance between lightweight construction and fault-tolerant operation to satisfy strict weight limitations and ensure reliability in remote, hazardous environments. This paper presents Time-Division Multiplexing Actuation (TDMA), a practical approach for tendon-driven robots that significantly reduces actuator count while preserving high torque output and intrinsic fault tolerance. The key hardware employs a vertically-stacked rotational selection structure that integrates self-rotating TDM motors for rapid configuration, electromagnetic clutches enabling sub-0.1 second engagement, a worm gear reducer for enhanced load capacity and self-locking capability, and a dual-encoder system for precise, long-term positioning. Leveraging TDMA, the proposed MuxArm achieves a self-weight of 2.17 kg, supports an actuator driving capacity of 10 kg, and maintains end-effector accuracy up to 1% of its length, even under partial servo failure. Additionally, an actuation space trajectory planning algorithm is developed, enabling fault-tolerant control and reducing tendon load by up to 50% compared to conventional methods. Comprehensive experiments demonstrate MuxArm's robust performance in diverse settings, including free-space, cluttered, and confined environments.
Chinese Translation
用于航空航天应用的机器人操纵器需要在轻量化构造与容错操作之间取得微妙平衡,以满足严格的重量限制并确保在偏远、危险环境中的可靠性。本文提出了一种时分复用驱动(Time-Division Multiplexing Actuation, TDMA)的方法,适用于腱驱动机器人,显著减少了执行器数量,同时保持高扭矩输出和固有的容错能力。关键硬件采用垂直堆叠的旋转选择结构,集成了自旋转的TDM电机以实现快速配置,电磁离合器使得接合时间低于0.1秒,蜗轮减速器增强了负载能力和自锁能力,并配备双编码器系统以实现精确的长期定位。利用TDMA,所提出的MuxArm自重为2.17千克,支持10千克的执行器驱动能力,并在部分伺服故障情况下保持末端执行器精度达到其长度的1%。此外,开发了一种驱动空间轨迹规划算法,实现了容错控制,并将腱负载相比于传统方法减少了多达50%。全面的实验展示了MuxArm在自由空间、杂乱和封闭环境中的强大性能。
cs.RO / 24 / 2604.16903
Leveraging VR Robot Games to Facilitate Data Collection for Embodied Intelligence Tasks
利用虚拟现实机器人游戏促进具身智能任务的数据收集
Abstract
Collecting embodied interaction data at scale remains costly and difficult due to the limited accessibility of conventional interfaces. We present a gamified data collection framework based on Unity that combines procedural scene generation, VR-based humanoid robot control, automatic task evaluation, and trajectory logging. A trash pick-and-place task prototype is developed to validate the full workflow.Experimental results indicate that the collected demonstrations exhibit broad coverage of the state-action space, and that increasing task difficulty leads to higher motion intensity as well as more extensive exploration of the arm's workspace. The proposed framework demonstrates that game-oriented virtual environments can serve as an effective and extensible solution for embodied data collection.
Chinese Translation
由于传统接口的可及性有限,大规模收集具身交互数据仍然成本高昂且困难。我们提出了一种基于Unity的游戏化数据收集框架,该框架结合了程序化场景生成、基于虚拟现实的人形机器人控制、自动任务评估和轨迹记录。我们开发了一个垃圾拾取与放置任务原型,以验证完整的工作流程。实验结果表明,收集到的演示在状态-动作空间中具有广泛的覆盖范围,且任务难度的增加导致运动强度提高以及手臂工作空间的更广泛探索。所提出的框架表明,面向游戏的虚拟环境可以作为一个有效且可扩展的解决方案,用于具身数据的收集。
cs.RO / 25 / 2604.16962
Multi-stage Planning for Multi-target Surveillance using Aircrafts Equipped with Synthetic Aperture Radars Aware of Target Visibility
基于目标可见性感知的合成孔径雷达装备飞机的多阶段多目标监视规划
Abstract
Generating trajectories for synthetic aperture radar (SAR)-equipped aircraft poses significant challenges due to terrain constraints, and the need for straight-flight segments to ensure high-quality imaging. Related works usually focus on trajectory optimization for predefined straight-flight segments that do not adapt to the target visibility, which depends on the 3D terrain and aircraft orientation. In addition, this assumption does not scale well for the multi-target problem, where multiple straight-flight segments that maximize target visibility must be defined for real-time operations. For this purpose, this paper presents a multi-stage planning system. First, the waypoint sequencing to visit all the targets is estimated. Second, straight-flight segments maximizing target visibility according to the 3D terrain are predicted using a novel neural network trained with deep reinforcement learning. Finally, the segments are connected to create a trajectory via optimization that imposes 3D Dubins curves. Evaluations demonstrate the robustness of the system for SAR missions since it ensures high-quality multi-target SAR image acquisition aware of 3D terrain and target visibility, and real-time performance.
Chinese Translation
为装备有合成孔径雷达(SAR)的飞机生成航迹面临显著挑战,主要由于地形限制以及确保高质量成像所需的直飞段。相关研究通常集中于对预定义直飞段的航迹优化,而这些直飞段并未适应目标可见性,而目标可见性又依赖于三维地形和飞机方向。此外,这一假设在多目标问题上表现不佳,因为在实时操作中必须定义多个最大化目标可见性的直飞段。为此,本文提出了一种多阶段规划系统。首先,估计访问所有目标的航点顺序。其次,利用一种通过深度强化学习训练的新型神经网络预测根据三维地形最大化目标可见性的直飞段。最后,通过施加三维Dubins曲线的优化将这些段连接起来以创建航迹。评估结果表明,该系统在SAR任务中的稳健性,因为它确保了高质量的多目标SAR图像获取,同时考虑了三维地形和目标可见性,并具备实时性能。
cs.RO / 26 / 2604.16967
NaviFormer: A Deep Reinforcement Learning Transformer-like Model to Holistically Solve the Navigation Problem
NaviFormer:一种深度强化学习Transformer类模型以整体方式解决导航问题
Abstract
Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.
Chinese Translation
路径规划通常通过解决(高层次)路线规划问题(通过航点排序实现最终目标)或(低层次)路径规划问题(在两个航点之间进行轨迹预测以避免碰撞)来实现。然而,现实世界的问题通常需要以整体和高效的方式同时解决路线和路径规划的子问题。在本文中,我们介绍了NaviFormer,这是一种基于Transformer架构的深度强化学习模型,通过预测高层次的路线和低层次的轨迹来解决全球导航问题。为了评估NaviFormer,我们进行了多项实验,包括与其他算法的比较。结果表明,NaviFormer在准确性上具有竞争力,因为它能够理解每个子问题的约束和困难,并相应地采取行动以提高性能。此外,其卓越的计算速度证明了其适用于实时任务的能力。
cs.RO / 27 / 2604.17048
Neural Network-Based Adaptive Event-Triggered Control for Dual-Arm Unmanned Aerial Manipulator Systems
基于神经网络的双臂无人机操作系统自适应事件触发控制
Abstract
This paper investigates the control problem of dual-arm unmanned aerial manipulator systems (DAUAMs). Strong coupling between the dual-arm and the multirotor platform, together with unmodeled dynamics and external disturbances, poses significant challenges to stable and accurate operation. An adaptive event-triggered control scheme with neural network-based approximation is proposed to address these issues while explicitly considering communication constraints. First, a dynamic model of the DAUAM system is derived, and a command-filter-based backstepping framework with error compensation is constructed. Then, a neural network is employed to approximate external frictions, and an event-triggered mechanism is designed to reduce the transmission frequency of control updates, thereby alleviating communication and energy burdens. Lyapunov-based analysis shows that all closed-loop signals remain bounded and that the tracking error converges to a neighborhood of the desired trajectory within a fixed time. Finally, experiments on a self-built DAUAM platform demonstrate that the proposed approach achieves accurate trajectory tracking.
Chinese Translation
本文研究了双臂无人机操作系统(DAUAMs)的控制问题。双臂与多旋翼平台之间的强耦合,加上未建模的动态特性和外部干扰,给稳定和精确操作带来了重大挑战。为了解决这些问题,提出了一种基于神经网络近似的自适应事件触发控制方案,同时明确考虑了通信约束。首先,推导了DAUAM系统的动态模型,并构建了一个基于命令滤波的反步框架,结合误差补偿。然后,采用神经网络来近似外部摩擦,并设计了一种事件触发机制,以减少控制更新的传输频率,从而减轻通信和能量负担。基于Lyapunov的分析表明,所有闭环信号保持有界,跟踪误差在固定时间内收敛到期望轨迹的邻域。最后,在自建的DAUAM平台上进行的实验表明,所提出的方法实现了精确的轨迹跟踪。
cs.RO / 28 / 2604.17050
Web-Gewu: A Browser-Based Interactive Playground for Robot Reinforcement Learning
Web-Gewu:一个基于浏览器的机器人强化学习互动平台
Abstract
With the rapid development of embodied intelligence, robotics education faces a dual challenge: high computational barriers and cumbersome environment configuration. Existing centralized cloud simulation solutions incur substantial GPU and bandwidth costs that preclude large-scale deployment, while pure local computing is severely constrained by learners' hardware limitations. To address these issues, we propose \href{http://47.76.242.88:8080/receiver/index.html}{Web-Gewu}, an interactive robotics education platform built on a WebRTC cloud-edge-client collaborative architecture. The system offloads all physics simulation and reinforcement learning (RL) training to the edge node, while the cloud server acts exclusively as a lightweight signaling relay, enabling extremely low-cost browser-based peer-to-peer (P2P) real-time streaming. Learners can interact with multi-form robots at low end-to-end latency directly in a web browser without any local installation, and simultaneously observe real-time visualization of multi-dimensional monitoring data, including reinforcement learning reward curves. Combined with a predefined robust command communication protocol, Web-Gewu provides a highly scalable, out-of-the-box, and barrier-free teaching infrastructure for embodied intelligence, significantly lowering the barrier to entry for cutting-edge robotics technology.
Chinese Translation
随着具身智能的快速发展,机器人教育面临着双重挑战:高计算门槛和繁琐的环境配置。现有的集中式云仿真解决方案产生了可观的GPU和带宽成本,阻碍了大规模部署,而纯本地计算又受到学习者硬件限制的严重制约。为了解决这些问题,我们提出了Web-Gewu,一个基于WebRTC云边缘客户端协作架构构建的互动机器人教育平台。该系统将所有物理仿真和强化学习(RL)训练卸载到边缘节点,而云服务器则仅作为轻量级信令中继,使得基于浏览器的点对点(P2P)实时流媒体传输成本极低。学习者可以在网页浏览器中直接与多形态机器人进行低延迟交互,无需任何本地安装,同时观察多维监控数据的实时可视化,包括强化学习奖励曲线。结合预定义的稳健命令通信协议,Web-Gewu为具身智能提供了一个高度可扩展、开箱即用且无障碍的教学基础设施,显著降低了前沿机器人技术的入门门槛。
cs.RO / 29 / 2604.17189
Shepherding UAV Swarm with Action Prediction Based on Movement Constraints
基于运动约束的行动预测引导无人机群
Abstract
In this study, we propose a new sheepdog-inspired control method for a swarm of small unmanned aerial vehicles (UAVs), which predicts the swarm behavior while explicitly accounting for the motion constraints of real robots. Sheepdog-inspired guidance control refers to a framework in which a small number of navigator agents (sheepdog agents) indirectly drive a large number of autonomous agents (a flock of sheep agents) so as to steer the group toward a target position. In conventional studies on sheepdog-inspired guidance, both types of agents have typically been modeled as point masses, and the guidance law for the navigator agents has been designed using simple interaction vectors based on the instantaneous relative positions between the agents. However, when implementing such methods on real robots such as drones, it is necessary to consider each agent's motion constraints, including upper bounds on velocity and acceleration. Moreover, we argue that guidance can be made more efficient by predicting the future behavior of the autonomous swarm that is observable to the navigator agents. To this end, we propose a three-dimensional guidance control law based on behavior prediction of autonomous agents under motion constraints, inspired by the Dynamic Window Approach (DWA). At each control cycle, the navigator agent generates a set of feasible motion candidates that satisfy its motion constraints, and predicts the short-horizon swarm evolution using an internal model of the autonomous agents maintained within the navigator agent. The motion candidates are then evaluated according to criteria such as the progress velocity toward the target, the positioning strategy with respect to the swarm, and safety margins, and the optimal motion is selected to achieve safe and efficient guidance. Numerical simulation results demonstrate the effectiveness of the proposed guidance control law.
Chinese Translation
在本研究中,我们提出了一种新的受牧羊犬启发的控制方法,用于一群小型无人机(UAV),该方法在明确考虑真实机器人运动约束的同时,预测群体行为。受牧羊犬启发的引导控制指的是一种框架,其中少量的导航代理(牧羊犬代理)间接驱动大量自主代理(羊群代理),以引导群体朝向目标位置。在传统的受牧羊犬启发的引导研究中,这两种类型的代理通常被建模为点质量,导航代理的引导法则是基于代理之间瞬时相对位置的简单交互向量设计的。然而,在将此类方法应用于真实机器人(如无人机)时,必须考虑每个代理的运动约束,包括速度和加速度的上限。此外,我们认为,通过预测导航代理可观察到的自主群体的未来行为,可以使引导更加高效。为此,我们提出了一种基于运动约束下自主代理行为预测的三维引导控制法则,该法则受到动态窗口方法(Dynamic Window Approach, DWA)的启发。在每个控制周期中,导航代理生成一组满足其运动约束的可行运动候选,并使用导航代理内部维护的自主代理模型预测短期的群体演变。然后,根据朝向目标的进展速度、相对于群体的定位策略和安全边际等标准评估运动候选,并选择最佳运动以实现安全高效的引导。数值仿真结果验证了所提出的引导控制法则的有效性。
cs.RO / 30 / 2604.17199
Modeling, Control and Self-sensing of Dielectric Elastomer Soft Actuators: A Review
介电弹性体软驱动器的建模、控制与自感知:综述
Abstract
Dielectric elastomer actuators (DEAs) have garnered extensive attention especially in soft robotic applications over the past few decades owing to the advantages of lightweight, large strain, fast response and high energy density. However, because the DEAs suffer from nonlinear elasticity, inherent viscoelastic creep, hysteresis and vibrational dynamics, the modeling, control and self-sensing of DEAs are challenging, thereby hindering the practical applications of DEAs. In order to address these challenges, numerous studies have been conducted. In this review, various physics-based modeling methods and phenomenological modeling methods for predicting the electromechanical response of DEAs are presented and discussed. Different control methods for DEAs are reviewed, which are classified into open-loop feedforward control, feedback control, feedforward-feedback control and adaptive feedforward control. Physics-based self-sensing methods and data-driven self-sensing methods for reconstructing the DEA displacement without the need for additional sensors are discussed. Finally, the existing problems and new opportunities for the further studies are summarized.
Chinese Translation
介电弹性体驱动器(DEAs)由于其轻量化、大变形、快速响应和高能量密度等优点,在过去几十年中尤其在软机器人应用中引起了广泛关注。然而,由于DEAs存在非线性弹性、固有的粘弹性蠕变、滞后效应和振动动态等问题,DEAs的建模、控制和自感知面临挑战,从而阻碍了DEAs的实际应用。为了解决这些挑战,已经开展了大量研究。在本综述中,介绍并讨论了多种基于物理的建模方法和现象学建模方法,用于预测DEAs的电机械响应。对DEAs的不同控制方法进行了回顾,这些方法被分类为开环前馈控制、反馈控制、前馈-反馈控制和自适应前馈控制。讨论了基于物理的自感知方法和数据驱动的自感知方法,以在不需要额外传感器的情况下重构DEA位移。最后,总结了现有问题和进一步研究的新机遇。
cs.RO / 31 / 2604.17212
Planning Smooth and Safe Control Laws for a Unicycle Robot Among Obstacles
在障碍物间为单轮机器人规划平滑安全的控制律
Abstract
This paper presents a framework for safe navigation of a unicycle point robot to a goal position in an environment populated with obstacles from almost any admissible state, considering input limits. We introduce a novel QP formulation to create a Cinfinity-smooth vector field with reduced total bending and total turning. Then we design an analytic, non-linear feedback controller that inherently satisfies the conditions of Nagumo's theorem, ensuring forward invariance of the safe set without requiring any online optimization. We have demonstrated that our controller, even under hard input limits, safely converges to the goal position. Simulations confirm the effectiveness of the proposed framework, resulting in a twice faster arrival time with over 50\% lower angular control effort compared to the baseline.
Chinese Translation
本文提出了一种框架,用于在几乎任何可接受状态下,考虑输入限制的情况下,安全地导航单轮点机器人到达目标位置,该环境中存在障碍物。我们引入了一种新颖的二次规划(QP)公式,以创建一个具有降低的总弯曲和总转向的C∞平滑向量场。然后,我们设计了一种解析的非线性反馈控制器,该控制器本质上满足Nagumo定理的条件,确保安全集的前向不变性,而无需任何在线优化。我们已经证明,即使在严格的输入限制下,我们的控制器也能安全地收敛到目标位置。仿真结果确认了所提框架的有效性,与基线相比,达到目标的时间快了两倍,角控制努力降低了超过50%。
cs.RO / 32 / 2604.17241
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa:用于程序规划的超图引导视觉语言模型
Abstract
Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
Chinese Translation
隐含的空间关系和编码在对象属性中的深层语义结构对于具身人工智能系统中的程序规划至关重要。然而,现有的方法往往过于依赖视觉语言模型(VLMs)自身的推理能力,而忽视了可以从多模态输入中挖掘的丰富结构化语义信息。因此,模型在理解复杂场景中的功能空间关系时面临困难。为了充分利用多模态数据中的隐含空间关系和深层语义结构,我们提出了GaLa,一个用于多模态程序规划的视觉语言框架。GaLa引入了一种基于超图的表示,其中图像中的对象实例被建模为节点,区域级超边通过根据对象的属性和功能语义聚合对象而构建。该设计明确捕捉了对象之间的隐含语义关系以及功能区域的层次组织。此外,我们设计了一种TriView HyperGraph Encoder,通过对比学习在节点视图、区域视图和节点区域关联视图之间强制语义一致性,从而使超图语义能够更有效地注入到下游VLM推理中。在ActPlan1K和ALFRED基准上的大量实验表明,GaLa在执行成功率、LCS和规划正确性方面显著优于现有方法。
cs.RO / 33 / 2604.17245
MM-Hand: A 21-DOF Multi-modal Modular Dexterous Robotic Hand with Remote Actuation
MM-Hand:一种具有远程驱动的21自由度多模态模块化灵巧机器人手
Abstract
High-DOF dexterous hands require compact actuation, rich sensing, and reliable thermal behavior, but conventional designs often occupy valuable in-hand space, increase end-effector mass, and suffer from heat accumulation near the hand. Remote tendon-driven actuation offers an alternative by relocating motors to the robot base or an external motor hub, thereby freeing the fingers and palm for additional degrees of freedom, sensing modules, and maintainable mechanical structures. This paper presents MM-Hand, a 21-DOF Multimodal Modular dexterous hand based on remote tendon-driven actuation. The hand integrates spring-return tendon-driven fingers, modular 3D-printed finger and palm structures, quick tendon connectors for maintenance, and a multimodal sensing system including joint angle sensors, tactile sensors, motor-side feedback, and in-palm stereo vision. We further analyze tendon-sheath length variation and friction loss to guide the design of the routing, motor hub, and closed-loop joint control. Experiments validate the transmission, output force, sensing, and control capability of the system. The fingertip force reaches 25N under a 1m remote sheath transmission, demonstrating practical load capacity despite long-distance tendon routing. Closed-loop joint-level experiments further evaluate command tracking with a static arm and during arm motion. These results show that MM-Hand provides a lightweight, sensor-rich, and maintainable hardware platform for dexterous manipulation research. To support the community, all hardware designs and software frameworks are made fully open-source at https://mmlab.hk/research/MM-Hand.
Chinese Translation
高自由度灵巧手需要紧凑的驱动、丰富的传感和可靠的热行为,但传统设计往往占用宝贵的手内空间,增加末端执行器的质量,并且在手部附近容易出现热积累。远程腱驱动的驱动方式提供了一种替代方案,通过将电机移至机器人基座或外部电机中心,从而为手指和掌心释放出额外的自由度、传感模块和可维护的机械结构。本文介绍了MM-Hand,一种基于远程腱驱动的21自由度多模态模块化灵巧手。该手集成了弹簧回位的腱驱动手指、模块化的3D打印手指和掌心结构、便于维护的快速腱连接器,以及包括关节角传感器、触觉传感器、电机侧反馈和掌心立体视觉在内的多模态传感系统。我们进一步分析了腱鞘长度变化和摩擦损失,以指导布线、电机中心和闭环关节控制的设计。实验验证了系统的传动、输出力、传感和控制能力。在1米远程鞘传输下,指尖力达到25N,尽管腱路由距离较长,仍展示了实际的负载能力。闭环关节级实验进一步评估了静态臂和臂运动过程中的指令跟踪。这些结果表明,MM-Hand为灵巧操作研究提供了一个轻量级、传感器丰富且易于维护的硬件平台。为了支持社区,所有硬件设计和软件框架均已在 https://mmlab.hk/research/MM-Hand 上完全开源。
cs.RO / 34 / 2604.17258
A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models
基于基础模型的自主类人抓取的快速部署管道
Abstract
Deploying a humanoid robot to manipulate a new object has traditionally required one to two days of effort: data collection, manual annotation, 3D model acquisition, and model training. This paper presents an end-to-end rapid deployment pipeline that integrates three foundation-model components to shorten the onboarding cycle for a new object to approximately 30 minutes: (i) Roboflow-based automatic annotation to assist in training a YOLOv8 object detector; (ii) 3D reconstruction based on Meta SAM 3D, which eliminates the need for a dedicated laser scanner; and (iii) zero-shot 6-DoF pose tracking based on FoundationPose, using the SAM~3D-generated mesh directly as the template. The estimated pose drives a Unity-based inverse kinematics planner, whose joint commands are streamed via UDP to a Unitree~G1 humanoid and executed through the Unitree SDK. We demonstrate detection accuracy of
[email protected] = 0.995, pose tracking precision of $\sigma < 1.05$ mm, and successful grasping on a real robot at five positions within the workspace. We further verify the generality of the pipeline on an automobile-window glue-application task. The results show that combining foundation models for perception with everyday imaging devices (e.g., smartphones) can substantially lower the deployment barrier for humanoid manipulation tasks.
Chinese Translation
将类人机器人部署到操作新物体上,传统上需要一到两天的工作量,包括数据收集、手动标注、3D模型获取和模型训练。本文提出了一种端到端的快速部署管道,集成了三个基础模型组件,将新物体的上手周期缩短至大约30分钟:(i) 基于Roboflow的自动标注,辅助训练YOLOv8目标检测器;(ii) 基于Meta SAM 3D的3D重建,消除了对专用激光扫描仪的需求;(iii) 基于FoundationPose的零-shot 6自由度姿态跟踪,直接使用SAM~3D生成的网格作为模板。估计的姿态驱动一个基于Unity的逆向运动学规划器,其关节命令通过UDP流式传输到Unitree~G1类人机器人,并通过Unitree SDK执行。我们在工作空间内五个位置的真实机器人上展示了
[email protected] = 0.995的检测准确率、$ ext{σ} < 1.05$ mm的姿态跟踪精度,以及成功抓取的结果。我们进一步验证了该管道在汽车窗户胶水应用任务上的通用性。结果表明,将感知的基础模型与日常成像设备(如智能手机)结合,可以显著降低类人操作任务的部署门槛。
cs.RO / 35 / 2604.17335
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
通过运动生成和运动跟踪学习全身类人机器人运动
Abstract
Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.
Chinese Translation
全身类人机器人运动因高维控制、形态不稳定以及需要实时适应各种地形的机载感知而面临挑战。直接将强化学习(RL)与奖励塑形应用于类人机器人运动通常会导致下肢主导的行为,而基于模仿的RL可以学习更协调的全身技能,但通常仅限于重放参考运动,缺乏从感知中在线适应的机制以实现地形感知的运动。为了解决这一问题,我们提出了一种全身类人机器人运动框架,该框架结合了从参考运动中学习的技能与地形感知的适应能力。我们首先在重新定向的人类运动上训练一个扩散模型,以实时预测地形感知的参考运动。同时,我们使用这些运动数据通过RL训练一个全身参考跟踪器。为了在生成的不完美参考下提高鲁棒性,我们进一步在闭环设置中使用一个冻结的运动生成器微调跟踪器。最终系统支持具有地形感知的全身适应的方向性目标到达控制,并可以部署在具有机载感知和计算能力的Unitree G1类人机器人上。硬件实验展示了成功跨越箱子、障碍物、楼梯和混合地形组合的能力。定量结果进一步表明,结合在线运动生成和微调运动跟踪器在提高泛化能力和鲁棒性方面的优势。
cs.RO / 36 / 2604.17407
Think before Go: Hierarchical Reasoning for Image-goal Navigation
思考再出发:图像目标导航的层次推理
Abstract
Image-goal navigation steers an agent to a target location specified by an image in unseen environments. Existing methods primarily handle this task by learning an end-to-end navigation policy, which compares the similarities of target and observation images and directly predicts the actions. However, when the target is distant or lies in another room, such methods fail to extract informative visual cues, leading the agent to wander around. Motivated by the human cognitive principle that deliberate, high-level reasoning guides fast, reactive execution in complex tasks, we propose Hierarchical Reasoning Navigation (HRNav), a framework that decomposes image-goal navigation into high-level planning and low-level execution. In high-level planning, a vision-language model is trained on a self-collected dataset to generate a short-horizon plan, such as whether the agent should walk through the door or down the hallway. This downgrades the difficulty of the long-horizon task, making it more amenable to the execution part. In low-level execution, an online reinforcement learning policy is utilized to decide actions conditioned on the short-horizon plan. We also devise a novel Wandering Suppression Penalty (WSP) to further reduce the wandering problem. Together, these components form a hierarchical framework for Image-Goal Navigation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method.
Chinese Translation
图像目标导航引导智能体在未知环境中前往由图像指定的目标位置。现有方法主要通过学习端到端的导航策略来处理这一任务,该策略比较目标图像和观察图像的相似性,并直接预测行动。然而,当目标距离较远或位于另一个房间时,这些方法无法提取有用的视觉线索,导致智能体四处游荡。受到人类认知原则的启发,即深思熟虑的高层次推理能够指导复杂任务中的快速反应执行,我们提出了层次推理导航(Hierarchical Reasoning Navigation, HRNav)框架,该框架将图像目标导航分解为高层次规划和低层次执行。在高层次规划中,我们在自收集的数据集上训练了一个视觉-语言模型,以生成短期计划,例如智能体是否应该穿过门或沿着走廊行走。这降低了长期任务的难度,使其更易于执行部分。在低层次执行中,利用在线强化学习策略根据短期计划决定行动。我们还设计了一种新颖的游荡抑制惩罚(Wandering Suppression Penalty, WSP),以进一步减少游荡问题。这些组件共同构成了图像目标导航的层次框架。在模拟和真实环境中的大量实验表明我们方法的优越性。
cs.RO / 37 / 2604.17513
FLASH: Fast Learning via GPU-Accelerated Simulation for High-Fidelity Deformable Manipulation in Minutes
FLASH:通过GPU加速仿真实现高保真可变形操作的快速学习
Abstract
Simulation frameworks such as Isaac Sim have enabled scalable robot learning for locomotion and rigid-body manipulation; however, contact-rich simulation remains a major bottleneck for deformable object manipulation. The continuously changing geometry of soft materials, together with large numbers of vertices and contact constraints, makes it difficult to achieve high accuracy, speed, and stability required for large-scale interactive learning. We present FLASH, a GPU-native simulation framework for contact-rich deformable manipulation, built on an accurate NCP-based solver that enforces strict contact and deformation constraints while being explicitly designed for fine-grained GPU parallelism. Rather than porting conventional single-instruction-multiple-data (SIMD) solvers to GPUs, FLASH redesigns the physics engine from the ground up to leverage modern GPU architectures, including optimized collision handling and memory layouts. As a result, FLASH scales to over 3 million degrees of freedom at 30 FPS on a single RTX 5090, while accurately simulating physical interactions. Policies trained solely on FLASH-generated synthetic data in minutes achieve robust zero-shot sim-to-real transfer, which we validate on physical robots performing challenging deformable manipulation tasks such as towel folding and garment folding, without any real-world demonstration, providing a practical alternative to labor-intensive real-world data collection.
Chinese Translation
仿真框架如Isaac Sim使得机器人在运动和刚体操作方面的可扩展学习成为可能;然而,接触丰富的仿真仍然是可变形物体操作的主要瓶颈。软材料不断变化的几何形状,以及大量的顶点和接触约束,使得实现大规模交互学习所需的高精度、高速度和高稳定性变得困难。我们提出了FLASH,一个针对接触丰富的可变形操作的GPU原生仿真框架,基于一种准确的NCP(非光滑补充问题)求解器,该求解器在严格执行接触和变形约束的同时,明确设计为细粒度的GPU并行计算。与其将传统的单指令多数据(SIMD)求解器移植到GPU上,FLASH从头开始重新设计物理引擎,以利用现代GPU架构,包括优化的碰撞处理和内存布局。因此,FLASH在单个RTX 5090上可扩展至超过300万自由度,以30帧每秒的速度准确模拟物理交互。仅在FLASH生成的合成数据上训练的策略在几分钟内实现了强健的零-shot仿真到现实转移,我们在执行如毛巾折叠和衣物折叠等具有挑战性的可变形操作任务的物理机器人上验证了这一点,而无需任何现实世界的演示,为劳动密集型的现实世界数据收集提供了实用的替代方案。
cs.RO / 38 / 2604.17527
Safer Trajectory Planning with CBF-guided Diffusion Model for Unmanned Aerial Vehicles
基于CBF引导的扩散模型的无人机安全轨迹规划
Abstract
Safe and agile trajectory planning is essential for autonomous systems, especially during complex aerobatic maneuvers. Motivated by the recent success of diffusion models in generative tasks, this paper introduces AeroTrajGen, a novel framework for diffusion-based trajectory generation that incorporates control barrier function (CBF)-guided sampling during inference, specifically designed for unmanned aerial vehicles (UAVs). The proposed CBF-guided sampling addresses two critical challenges: (1) mitigating the inherent unpredictability and potential safety violations of diffusion models, and (2) reducing reliance on extensively safety-verified training data. During the reverse diffusion process, CBF-based guidance ensures collision-free trajectories by seamlessly integrating safety constraint gradients with the diffusion model's score function. The model features an obstacle-aware diffusion transformer architecture with multi-modal conditioning, including trajectory history, obstacles, maneuver styles, and goal, enabling the generation of smooth, highly agile trajectories across 14 distinct aerobatic maneuvers. Trained on a dataset of 2,000 expert demonstrations, AeroTrajGen is rigorously evaluated in simulation under multi-obstacle environments. Simulation results demonstrate that CBF-guided sampling reduces collision rates by 94.7% compared to unguided diffusion baselines, while preserving trajectory agility and diversity. Our code is open-sourced at https://github.com/RoboticsPolyu/CBF-DMP.
Chinese Translation
安全和灵活的轨迹规划对于自主系统至关重要,尤其是在复杂的特技飞行过程中。受到扩散模型在生成任务中取得的近期成功的启发,本文提出了一种新颖的基于扩散的轨迹生成框架AeroTrajGen,该框架在推理过程中结合了控制障碍函数(CBF)引导的采样,专为无人机(UAV)设计。所提出的CBF引导采样解决了两个关键挑战:(1)减轻扩散模型固有的不确定性和潜在的安全违规,及(2)减少对经过广泛安全验证的训练数据的依赖。在反向扩散过程中,基于CBF的引导通过将安全约束梯度与扩散模型的评分函数无缝集成,确保了无碰撞轨迹。该模型具有障碍物感知的扩散变换器架构,具备多模态条件,包括轨迹历史、障碍物、机动风格和目标,使得能够在14种不同的特技飞行中生成平滑且高度灵活的轨迹。AeroTrajGen在包含2000个专家演示的数据集上进行训练,并在多障碍环境下进行了严格的仿真评估。仿真结果表明,与无引导的扩散基线相比,CBF引导采样将碰撞率降低了94.7%,同时保持了轨迹的灵活性和多样性。我们的代码已开源,链接为 https://github.com/RoboticsPolyu/CBF-DMP。
cs.RO / 39 / 2604.17538
Novel Algorithms for Smoothly Differentiable and Efficiently Vectorizable Contact Manifold Construction
平滑可微且高效向量化的接触流形构建新算法
Abstract
Generating intelligent robot behavior in contact-rich settings is a research problem where zeroth-order methods currently prevail. Developing methods that make use of first/second order information about the dynamics holds great promise in terms of increasing the solution speed and computational efficiency. The main bottleneck in this research direction is the difficulty in obtaining useful gradients and Hessians, due to pathologies in all three steps of a common simulation pipeline: i) collision detection, ii) contact dynamics, iii) time integration. This abstract proposes a method that can address the collision detection part of the puzzle in a manner that is smoothly differentiable and massively vectorizable. This is achieved via two contributions: i) a highly expressive class of analytical SDF primitives that can efficiently represent complex 3D surfaces, ii) a novel contact manifold generation routine that makes use of this geometry representation.
Chinese Translation
在接触丰富的环境中生成智能机器人行为是一个研究问题,目前主要采用零阶方法。开发利用一阶/二阶信息的方法在提高解决速度和计算效率方面具有很大潜力。该研究方向的主要瓶颈在于由于常见仿真流程的三个步骤中的病态现象,难以获得有用的梯度和 Hessian:i) 碰撞检测,ii) 接触动力学,iii) 时间积分。本文提出了一种方法,可以以平滑可微和大规模向量化的方式解决碰撞检测这一难题。该方法通过两个贡献实现:i) 一类高度表达性的解析 SDF(Signed Distance Function)原语,可以高效表示复杂的三维表面,ii) 一种新颖的接触流形生成例程,利用这种几何表示。
cs.RO / 40 / 2604.17679
A Hamilton-Jacobi Reachability-Guided Search Framework for Efficient and Safe Indoor Planar Robot Navigation
一种基于哈密顿-雅可比可达性引导的高效安全室内平面机器人导航搜索框架
Abstract
Autonomous navigation requires planning to reach a goal safely and efficiently in complex and potentially dynamic environments. Graph search-based algorithms are widely adopted due to their generality and theoretical guarantees when equipped with admissible heuristics. However, the computational complexity of graph search grows rapidly with the dimensionality of the search space, often making real-time planning in dynamic environments intractable. In this paper, we combine offline Hamilton-Jacobi (HJ) reachability with online graph search to leverage the complementary strengths of both. Precomputed HJ value functions, used as informative heuristics and proactive safety constraints, amortize online computation of the graph search process. At the same time, graph search enables reachability-based reasoning to be incorporated into online planning, overcoming the long-standing challenge of HJ reachability requiring full knowledge of the environment. Extensive simulation studies and real-world experiments demonstrate that the proposed approach consistently outperforms baseline methods in terms of planning efficiency and navigation safety, in environments with and without human presence.
Chinese Translation
自主导航需要在复杂且可能动态的环境中安全高效地规划以到达目标。基于图搜索的算法因其通用性和在配备可接受启发式时的理论保证而被广泛采用。然而,图搜索的计算复杂性随着搜索空间维度的增加而迅速增长,常常使得在动态环境中进行实时规划变得不可行。本文将离线哈密顿-雅可比(HJ)可达性与在线图搜索相结合,以利用两者的互补优势。预先计算的HJ值函数作为信息性启发式和主动安全约束,减少了在线图搜索过程的计算负担。同时,图搜索使得基于可达性的推理能够融入在线规划,克服了HJ可达性需要对环境具有完全知识这一长期挑战。大量的仿真研究和现实世界实验表明,所提出的方法在规划效率和导航安全性方面始终优于基线方法,无论是在有无人的环境中。
cs.RO / 41 / 2604.17706
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL:具有空间理解和在线强化学习的视觉-语言-行动模型
Abstract
Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL significantly outperforms state-of-the-art methods, effectively overcoming the fundamental limitations of current VLA models.
Chinese Translation
视觉-语言-行动(VLA)模型代表了具身人工智能的范式转变,但现有框架往往在空间感知不精确、多模态融合不理想以及强化学习不稳定等方面存在困难。为了解决这些问题,我们提出了OmniVLA-RL,这是一种新颖的架构,利用混合变换器(Mix-of-Transformers, MoT)设计,协同整合推理、空间和行动专家。此外,我们引入了Flow-GSPO,将流匹配重新表述为随机微分方程(Stochastic Differential Equation, SDE)过程,并将其与组分段策略优化(Group Segmented Policy Optimization, GSPO)结合,以提高行动精度和训练稳健性。在LIBERO和LIBERO-Plus基准上的广泛评估表明,OmniVLA-RL显著优于现有的最先进方法,有效克服了当前VLA模型的基本局限性。
cs.RO / 42 / 2604.17787
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
AnchorRefine:基于轨迹锚点和残差精炼的视觉-语言-动作模型协同操控
Abstract
Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.
Chinese Translation
精确关键的操控需要全球轨迹组织和局部执行修正,但大多数视觉-语言-动作(VLA)策略在单一统一空间内生成动作。这种单一的表述强迫宏观层面的传输和微观层面的精炼在同一目标下进行优化,导致大幅度动作主导学习,同时抑制小但关键的失败修正信号。相比之下,人类操控是通过全球运动规划与执行过程中的持续局部调整来构建的。基于这一原则,我们提出了AnchorRefine,一个将VLA动作建模分解为轨迹锚点和残差精炼的分层框架。锚点规划器预测粗略的运动框架,而精炼模块则修正执行层面的偏差,以提高几何和接触精度。我们进一步引入了一种决策感知的夹具精炼机制,以更好地捕捉夹具控制的离散性和边界敏感性。在LIBERO、CALVIN和真实机器人任务上的实验表明,AnchorRefine始终改善了基于回归和扩散的VLA骨干网络,在模拟成功率上提高了高达7.8%,在真实世界成功率上提高了18%。
cs.RO / 43 / 2604.17800
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA:通过教师指导微调实现的多模态推理意识通用机器人策略
Abstract
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization abilities and boosting reasoning capabilities. We also conduct attention map visualization to analyze the alignment among visual observation, linguistic prompts, and to-be-executed actions of ReFineVLA, reflecting the model is ability to focus on relevant tasks and actions. Through this additional step, we explore that ReFineVLA-trained models exhibit a meaningful agreement between vision-language and action domains, highlighting the enhanced multimodal understanding and generalization. Evaluated across a suite of simulated manipulation benchmarks on SimplerEnv with both WidowX and Google Robot tasks, ReFineVLA achieves state-of-the-art performance, in success rate over the second-best method on the both the WidowX benchmark and Google Robot Tasks.
Chinese Translation
视觉-语言-动作(VLA)模型因其将多模态观察与语言指令转化为期望机器人动作的能力而受到研究界的广泛关注。尽管取得了进展,VLA模型往往忽视显式推理,仅学习功能性输入-动作映射,省略了关键的逻辑步骤,这在复杂的长时间操作任务的可解释性和泛化能力上尤为明显。在本研究中,我们提出了ReFineVLA,一个多模态推理意识框架,通过教师指导的推理对VLA进行微调。我们首先通过专家教师模型生成推理依据,增强机器人数据集,引导VLA模型学习其动作的推理。然后,我们利用ReFineVLA对预训练的VLA进行微调,使用增强推理的数据集,同时保持基础的泛化能力并提升推理能力。我们还进行注意力图可视化,以分析ReFineVLA中视觉观察、语言提示和待执行动作之间的对齐情况,反映模型聚焦相关任务和动作的能力。通过这一额外步骤,我们发现ReFineVLA训练的模型在视觉-语言与动作领域之间表现出有意义的一致性,突显了增强的多模态理解和泛化能力。在SimplerEnv上针对WidowX和Google Robot任务的一系列模拟操作基准测试中,ReFineVLA在成功率上超越了第二名方法,达到了最先进的性能。
cs.RO / 44 / 2604.17810
Memory Centric Power Allocation for Multi-Agent Embodied Question Answering
面向记忆的多智能体具身问答的功率分配
Abstract
This paper considers multi-agent embodied question answering (MA-EQA), which aims to query robot teams on what they have seen over a long horizon. In contrast to existing edge resource management methods that emphasize sensing, communication, or computation performance metrics, MA-EQA emphasizes the memory qualities. To cope with this paradigm shift, we propose a quality of memory (QoM) model based on generative adversarial exam (GAE), which leverages forward simulation to assess memory retrieval and uses the resulting exam scores to compute QoM values. Then we propose memory centric power allocation (MCPA), which maximizes the QoM function under communication resource constraints. Through asymptotic analysis, it is found that the transmit powers are proportional to the GAE error probability, thus prioritizing towards high-QoM robots. Extensive experiments demonstrate that MCPA achieves significant improvements over extensive benchmarks in terms of diverse metrics in various scenarios.
Chinese Translation
本文考虑了多智能体具身问答(MA-EQA),其目标是对机器人团队在长时间范围内所见内容进行查询。与现有强调感知、通信或计算性能指标的边缘资源管理方法不同,MA-EQA 强调记忆质量。为了应对这一范式转变,我们提出了一种基于生成对抗考试(GAE)的记忆质量(QoM)模型,该模型利用前向仿真来评估记忆检索,并使用得到的考试分数来计算 QoM 值。然后,我们提出了面向记忆的功率分配(MCPA),在通信资源约束下最大化 QoM 函数。通过渐近分析发现,发射功率与 GAE 错误概率成正比,从而优先考虑高 QoM 机器人。大量实验表明,MCPA 在各种场景下的多种指标上相较于广泛基准测试取得了显著改善。
cs.RO / 45 / 2604.17830
SYMBOLIZER: Symbolic Model-free Task Planning with VLMs
SYMBOLIZER:基于视觉语言模型的符号无模型任务规划
Abstract
Traditional Task and Motion Planning (TAMP) systems depend on physics models for motion planning and discrete symbolic models for task planning. Although physics model are often available, symbolic models (consisting of symbolic state interpretation and action models) must be meticulously handcrafted or learned from labeled data. This process is both resource-intensive and constrains the solution to the specific domain, limiting scalability and adaptability. On the other hand, Visual Language Models (VLMs) show desirable zero-shot visual understanding (due to their extensive training on heterogeneous data), but still achieve limited planning capabilities. Therefore, integrating VLMs with classical planning for long-horizon reasoning in TAMP problems offers high potential. Recent works in this direction still lack generality and depend on handcrafted, task-specific solutions, e.g. describing all possible objects in advance, or using symbolic action models. We propose a framework that generalizes well to unseen problem instances. The method requires only lifted predicates describing relations among objects and uses VLMs to ground them from images to obtain the symbolic state. Planning is performed with domain-independent heuristic search using goal-count and width-based heuristics, without need for action models. Symbolic search over VLM-grounded state-space outperforms direct VLM-based planning and performs on par with approaches that use a VLM-derived heuristic. This shows that domain-independent search can effectively solve problems across domains with large combinatorial state spaces. We extensively evaluate on extensively evaluate our method and achieve state-of-the-art results on the ProDG and ViPlan benchmarks.
Chinese Translation
传统的任务与运动规划(TAMP)系统依赖于物理模型进行运动规划,以及离散的符号模型进行任务规划。尽管物理模型通常是可用的,但符号模型(包括符号状态解释和动作模型)必须经过精心手工制作或从标记数据中学习。这个过程既耗费资源,又限制了解决方案的特定领域,限制了可扩展性和适应性。另一方面,视觉语言模型(VLMs)展示了理想的零-shot视觉理解(由于其在异构数据上的广泛训练),但仍然实现有限的规划能力。因此,将VLM与经典规划结合以解决TAMP问题中的长时间推理具有很大的潜力。近期在这一方向的研究仍然缺乏普遍性,并依赖于手工制作的特定任务解决方案,例如提前描述所有可能的对象,或使用符号动作模型。我们提出了一个框架,能够很好地推广到未见问题实例。该方法仅需提升谓词来描述对象之间的关系,并利用VLM从图像中获取符号状态。规划通过使用目标计数和宽度基础启发式的领域无关启发式搜索进行,而无需动作模型。基于VLM的符号搜索在状态空间上优于直接基于VLM的规划,并且与使用VLM派生启发式的方法表现相当。这表明领域无关搜索能够有效解决具有大组合状态空间的跨领域问题。我们对该方法进行了广泛评估,并在ProDG和ViPlan基准上取得了最先进的结果。
cs.RO / 46 / 2604.17833
DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation
DART:增强学习的双臂非抓取操作模型预测控制
Abstract
What appears effortless to a human waiter remains a major challenge for robots. Manipulating objects nonprehensilely on a tray is inherently difficult, and the complexity is amplified in dual-arm settings. Such tasks are highly relevant to service robotics in domains such as hotels and hospitality, where robots must transport and reposition diverse objects with precision. We present DART, a novel dual-arm framework that integrates nonlinear Model Predictive Control (MPC) with an optimization-based impedance controller to achieve accurate object motion relative to a dynamically controlled tray. The framework systematically evaluates three complementary strategies for modeling tray-object dynamics as the state transition function within our MPC formulation: (i) a physics-based analytical model, (ii) an online regression based identification model that adapts in real-time, and (iii) a reinforcement learning-based dynamics model that generalizes across object properties. Our pipeline is validated in simulation with objects of varying mass, geometry, and friction coefficients. Extensive evaluations highlight the trade-offs among the three modeling strategies in terms of settling time, steady-state error, control effort, and generalization across objects. To the best of our knowledge, DART constitutes the first framework for non-prehensile dual-arm manipulation of objects on a tray. Project Link: https://dart-icra.github.io/dart/
Chinese Translation
对人类服务员而言看似轻松的操作,对于机器人来说仍然是一个重大挑战。在托盘上进行非抓取物体操作本质上是困难的,而在双臂设置中,这种复杂性更是加剧。这类任务与服务机器人在酒店和款待等领域密切相关,在这些领域,机器人必须精确地运输和重新定位各种物体。我们提出了DART,一个新颖的双臂框架,它将非线性模型预测控制(MPC)与基于优化的阻抗控制器相结合,以实现相对于动态控制托盘的精确物体运动。该框架系统地评估了三种互补策略,以将托盘-物体动力学建模为我们MPC公式中的状态转移函数:(i)基于物理的解析模型,(ii)实时适应的在线回归识别模型,以及(iii)基于强化学习的动力学模型,能够跨物体属性进行泛化。我们的管道在模拟中进行了验证,使用了不同质量、几何形状和摩擦系数的物体。广泛的评估突出了三种建模策略在稳定时间、稳态误差、控制努力和跨物体泛化方面的权衡。根据我们所知,DART构成了第一个用于托盘上非抓取双臂操作的框架。项目链接:https://dart-icra.github.io/dart/
cs.RO / 47 / 2604.17841
Driving risk emerges from the required two-dimensional joint evasive acceleration
驾驶风险源于所需的二维联合规避加速度
Abstract
Most autonomous driving safety benchmarks use time-to-collision (TTC) to assess risk and guide safe behaviour. However, TTC-based methods treat risk as a one-dimensional closing problem, despite the inherently two-dimensional nature of collision avoidance, and therefore cannot faithfully capture risk or its evolution over time. Here, we report evasive acceleration (EA), a hyperparameter-free and physically interpretable two-dimensional paradigm for risk quantification. By evaluating all possible directions of collision avoidance, EA defines risk as the minimum magnitude of a constant relative acceleration vector required to alter the relative motion and make the interaction collision-free. Using interaction data from five open datasets and more than 600 real crashes, we derive percentile-based warning thresholds and show that EA provides the earliest statistically significant warning across all thresholds. Moreover, EA provides the best discrimination of eventual collision outcomes and improves information retention by 54.2-241.4% over all compared baselines. Adding EA to existing methods yields 17.5-95.5 times more information gain than adding existing methods to EA, indicating that EA captures much of the outcome-relevant information in existing methods while contributing substantial additional nonredundant information. Overall, EA better captures the structure of collision risk and provides a foundation for next-generation autonomous driving systems.
Chinese Translation
大多数自动驾驶安全基准使用碰撞时间(TTC)来评估风险并指导安全行为。然而,基于TTC的方法将风险视为一维的接近问题,尽管碰撞规避本质上是二维的,因此无法真实捕捉风险或其随时间的演变。在此,我们报告了规避加速度(EA),这是一种无超参数且具有物理可解释性的二维风险量化范式。通过评估所有可能的碰撞规避方向,EA将风险定义为改变相对运动并使交互无碰撞所需的恒定相对加速度向量的最小大小。利用来自五个开放数据集和600多起真实碰撞的交互数据,我们推导出基于百分位数的警告阈值,并显示EA在所有阈值中提供了最早的统计显著警告。此外,EA在最终碰撞结果的区分能力上表现最佳,并在所有比较基准上提高了信息保留率54.2-241.4%。将EA添加到现有方法中所获得的信息增益是将现有方法添加到EA中所获得的17.5-95.5倍,表明EA捕捉了现有方法中大量与结果相关的信息,同时贡献了大量额外的非冗余信息。总体而言,EA更好地捕捉了碰撞风险的结构,并为下一代自动驾驶系统提供了基础。
cs.RO / 48 / 2604.17863
Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist
基于平行反平行四边形腱驱动腕关节的手帕旋转任务的周期稳态控制
Abstract
Spinning flexible objects, exemplified by traditional Chinese handkerchief performances, demands periodic steady-state motions under nonlinear dynamics with frictional contacts and boundary constraints. To address these challenges, we first design an intuitive dexterous wrist based on a parallel anti-parallelogram tendon-driven structure, which achieves 90 degrees omnidirectional rotation with low inertia and decoupled roll-pitch sensing, and implement a high-low level hierarchical control scheme. We then develop a particle-spring model of the handkerchief for control-oriented abstraction and strategy evaluation. Hardware experiments validate this framework, achieving an unfolding ratio of approximately 99% and fingertip tracking error of RMSE = 2.88 mm in high-dynamic spinning. These results demonstrate that integrating control-oriented modeling with a task-tailored dexterous wrist enables robust rest-to-steady-state transitions and precise periodic manipulation of highly flexible objects. More visualizations: https://slowly1113.github.io/icra2026-handkerchief/
Chinese Translation
旋转柔性物体,例如传统中国手帕表演,要求在具有摩擦接触和边界约束的非线性动态下进行周期稳态运动。为了解决这些挑战,我们首先设计了一种基于平行反平行四边形腱驱动结构的直观灵巧腕关节,该结构实现了90度的全向旋转,具有低惯性和解耦的滚转-俯仰传感,并实施了高低层次的分层控制方案。接着,我们开发了一个手帕的粒子-弹簧模型,以便进行控制导向的抽象和策略评估。硬件实验验证了该框架,在高动态旋转中实现了约99%的展开比和指尖跟踪误差的均方根误差为2.88毫米。这些结果表明,将控制导向建模与任务定制的灵巧腕关节相结合,可以实现稳健的静止到稳态过渡以及对高度柔性物体的精确周期操控。更多可视化内容请访问:https://slowly1113.github.io/icra2026-handkerchief/
cs.RO / 49 / 2604.17876
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow:注入对象感知的时间流匹配以增强机器人操作的鲁棒性
Abstract
Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.
Chinese Translation
鲁棒的机器人操作不仅需要预测场景随时间的演变,还需要在复杂场景中识别与任务相关的对象。然而,现有的变换学习模型(VLA)面临两个限制。它们通常仅在当前帧上进行操作,而未来预测和对象感知推理往往在不同的潜在空间中学习。我们提出了OFlow(将对象感知的时间流匹配注入VLA),这是一个通过在共享的语义潜在空间中统一时间前瞻和对象感知推理来解决这两个限制的框架。我们的方法通过时间流匹配预测未来的潜在表示,将其分解为强调物理相关线索的对象感知表示,同时过滤掉与任务无关的变化,并基于这些预测条件生成连续的动作。通过将OFlow集成到VLA管道中,我们的方法在分布变化下实现了更可靠的控制。在LIBERO、LIBERO-Plus、MetaWorld和SimplerEnv基准以及真实世界任务中的广泛实验表明,对象感知的前瞻性始终增强了鲁棒性和成功率。
cs.RO / 50 / 2604.17880
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-$ ext{π}$:用于机器人操作的结构化时空视觉-语言-动作模型
Abstract
Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$\pi$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.
Chinese Translation
视觉-语言-动作(VLA)模型在一般机器人任务中取得了巨大的成功,但在细粒度时空操作方面仍面临挑战。现有方法通常将时空知识嵌入视觉和动作表示中,并直接执行跨模态映射以进行步骤级动作预测。然而,这种时空推理在很大程度上仍然是隐式的,使得处理具有明确时空边界的多个顺序行为变得困难。在本研究中,我们提出了ST-$ ext{π}$,一种用于机器人操作的结构化时空VLA模型。我们的模型由两个关键设计指导:1)时空视觉-语言模型(VLM)。我们将4D观测和任务指令编码到潜在空间中,并将其输入到大语言模型(LLM)中,以生成由子任务、空间基础和时间基础组成的因果顺序的块级动作提示序列。2)时空动作专家。在块级动作提示的条件下,我们设计了一种结构化的双生成器指导,以共同建模空间依赖性和时间因果性,从而预测步骤级动作参数。在这个结构化框架内,VLM明确规划全局时空行为,而动作专家进一步细化局部时空控制。此外,我们提出了一个具有结构化时空注释的真实世界机器人数据集,以便进行微调。我们进行了大量实验,以证明我们模型的有效性。我们的代码链接: https://github.com/chuanhaoma/ST-pi。
cs.RO / 51 / 2604.17887
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM:通过时空精炼稳定逆动力学模型以应对操纵器截断
Abstract
Inverse Dynamics Models (IDMs) map visual observations to low-level action commands, serving as central components for data labeling and policy execution in embodied AI. However, their performance degrades severely under manipulator truncation, a common failure mode that makes state recovery ill-posed and leads to unstable control. We present StableIDM, a spatio-temporal framework that refines features from visual inputs to stabilize action predictions under such partial observability. StableIDM integrates three complementary components: (1) auxiliary robot-centric masking to suppress background clutter, (2) Directional Feature Aggregation (DFA) for geometry-aware spatial reasoning, which extracts anisotropic features along directions inferred from the visible arm and (3) Temporal Dynamics Refinement (TDR) to smooth and correct predictions via motion continuity. Extensive evaluations validate our approach: StableIDM improves strict action accuracy by 12.1% under severe truncation on the AgiBot benchmark, and increases average task success by 9.7% in real-robot replay. Moreover, it boosts end-to-end grasp success by 11.5% when decoding video-generated plans, and improves downstream VLA real-robot success by 17.6% when functioning as an automatic annotator. These results demonstrate that StableIDM provides a robust and scalable backbone for both policy execution and data generation in embodied artificial intelligence.
Chinese Translation
逆动力学模型(IDMs)将视觉观察映射到低级动作指令,是具身人工智能中数据标注和策略执行的核心组成部分。然而,在操纵器截断这一常见故障模式下,它们的性能严重下降,这使得状态恢复变得不适定,并导致控制不稳定。我们提出了StableIDM,这是一种时空框架,通过从视觉输入中精炼特征,以稳定在这种部分可观测性下的动作预测。StableIDM整合了三个互补组件:(1)辅助机器人中心的掩蔽以抑制背景杂乱,(2)方向特征聚合(Directional Feature Aggregation, DFA)用于几何感知的空间推理,提取沿可见手臂推断方向的各向异性特征,以及(3)时间动态精炼(Temporal Dynamics Refinement, TDR)通过运动连续性平滑和修正预测。大量评估验证了我们的方法:在AgiBot基准测试中,StableIDM在严重截断下提高了12.1%的严格动作准确率,并在真实机器人重放中将平均任务成功率提高了9.7%。此外,当解码视频生成的计划时,它将端到端抓取成功率提高了11.5%,并在作为自动标注器时提高了下游VLA真实机器人成功率17.6%。这些结果表明,StableIDM为具身人工智能中的策略执行和数据生成提供了一个稳健且可扩展的基础。
cs.RO / 52 / 2604.17888
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
SpaceDex:在分层工作空间中可泛化的灵巧抓取
Abstract
Generalizable grasping with high-degree-of-freedom (DoF) dexterous hands remains challenging in tiered workspaces, where occlusion, narrow clearances, and height-dependent constraints are substantially stronger than in open tabletop scenes. Most existing methods are evaluated in relatively unoccluded settings and typically do not explicitly model the distinct control requirements of arm navigation and hand articulation under spatial constraints. We present SpaceDex, a hierarchical framework for dexterous manipulation in constrained 3D environments. At the high level, a Vision-Language Model (VLM) planner parses user intent, reasons about occlusion and height relations across multiple camera views, and generates target bounding boxes for zero-shot segmentation and mask tracking. This stage provides structured spatial guidance for downstream control instead of relying on single-view target selection. At the low level, we introduce an arm-hand Feature Separation Network that decouples global trajectory control for the arm from geometry-aware grasp mode selection for the hand, reducing feature interference between reaching and grasping objectives. The controller further integrates multi-view perception, fingertip tactile sensing, and a small set of recovery demonstrations to improve robustness to partial observability and off-nominal contacts. In 100 real-world trials involving over 30 unseen objects across four categories, SpaceDex achieves a 63.0\% success rate, compared with 39.0\% for a strong tabletop baseline. These results indicate that combining hierarchical spatial planning with arm-hand representation decoupling improves dexterous grasping performance in spatially constrained environments.
Chinese Translation
在分层工作空间中,高自由度(DoF)灵巧手的可泛化抓取仍然面临挑战,因为在这些环境中,遮挡、狭窄间隙和高度依赖的约束比开放的桌面场景更为显著。大多数现有方法是在相对无遮挡的环境中进行评估,通常没有明确建模在空间约束下手臂导航和手部关节运动的不同控制需求。我们提出了SpaceDex,一个用于在受限三维环境中进行灵巧操作的分层框架。在高层次上,视觉-语言模型(Vision-Language Model, VLM)规划器解析用户意图,推理多个相机视角下的遮挡和高度关系,并生成用于零-shot分割和掩膜跟踪的目标边界框。该阶段为下游控制提供结构化的空间指导,而不是依赖单视图目标选择。在低层次上,我们引入了一种手臂-手部特征分离网络(Feature Separation Network),将手臂的全局轨迹控制与手部的几何感知抓取模式选择解耦,从而减少了抓取和到达目标之间的特征干扰。控制器进一步整合了多视角感知、指尖触觉传感和一小组恢复演示,以提高对部分可观测性和非正常接触的鲁棒性。在涉及四个类别、超过30个未见物体的100次真实世界试验中,SpaceDex的成功率达到了63.0%,而强大的桌面基线仅为39.0%。这些结果表明,将分层空间规划与手臂-手部表示解耦相结合,可以提高在空间受限环境中的灵巧抓取性能。
cs.RO / 53 / 2604.17895
Locomotion of an Elastic Snake Robot via Natural Dynamics
通过自然动力学实现弹性蛇形机器人运动
Abstract
Nature suggests that exploiting the elasticities and natural dynamics of robotic systems could increase their locomotion efficiency. Prior work on elastic snake robots supports this hypothesis, but has not fully exploited the nonlinear dynamic behavior of the systems. Recent advances in eigenmanifold theory enable a better characterization of the natural dynamics in complex nonlinear systems. This letter investigates if and how the nonlinear natural dynamics of a kinematic elastic snake robot can be used to design efficient gaits. Two types of gaits based on natural dynamics are presented and compared to a state-of-the-art approach using dynamics simulations. The results reveal that a gait generated by switching between two nonlinear normal modes does not improve the locomotion efficiency of the robot. In contrast, gaits based on non-brake periodic trajectories (non-brake orbits) are perfectly efficient in the energy-conservative case. Further simulations with friction reveal that, in a more realistic scenario, non-brake orbit gaits achieve higher efficiency compared to the baseline gait on the rigid system. Overall, the investigation offers promising insights into the design of gaits based on natural dynamics, fostering further research.
Chinese Translation
自然界表明,利用机器人系统的弹性和自然动力学可以提高其运动效率。先前关于弹性蛇形机器人的研究支持了这一假设,但尚未充分利用系统的非线性动态行为。最近在特征流形理论方面的进展使得能够更好地表征复杂非线性系统中的自然动力学。本文探讨了运动学弹性蛇形机器人的非线性自然动力学是否以及如何用于设计高效的步态。提出了两种基于自然动力学的步态,并与使用动态仿真的最先进方法进行了比较。结果显示,通过在两个非线性正常模式之间切换生成的步态并未提高机器人的运动效率。相反,基于非制动周期轨迹(非制动轨道)的步态在能量守恒的情况下表现出完美的效率。进一步的摩擦仿真表明,在更现实的场景中,非制动轨道步态相比于刚性系统的基线步态实现了更高的效率。总体而言,这项研究为基于自然动力学的步态设计提供了有希望的见解,促进了进一步的研究。
cs.RO / 54 / 2604.18000
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
揭示视觉-语言-行动模型中具身推理的幻觉
Abstract
Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.
Chinese Translation
最近的视觉-语言-行动(VLA)模型在标准机器人基准测试中报告了令人印象深刻的成功率,这激发了对通用物理智能的乐观预期。然而,最近的证据表明,标准基准成功与真实的具身推理之间存在系统性不一致,这引发了一个问题:这些高分是否反映了真正的认知能力。为了解决这一差距,我们引入了 BeTTER,一个用于测试机器人策略中真实具身推理的诊断基准。BeTTER 通过施加针对性的因果干预(例如,空间布局变化、时间外推),同时强制进行运动学隔离,以明确将高层推理失败与低层执行限制解耦。通过系统评估,我们揭示了最先进的 VLA 在动态场景中灾难性失败,表现出严重的词汇-运动学捷径、行为惯性和语义特征崩溃。关键是,我们的机制分析将这些症状追溯到基本的架构瓶颈——如容量压缩和短视下采样——这些瓶颈系统性地削弱了模型的基础语义表示。我们证明,高度静态的评估协议有效地掩盖了这种退化,因为它允许优化过度拟合于传感器运动先验。通过真实世界的机器人验证,我们的发现确认这种表示崩溃并非模拟伪影,强调了未来 VLA 模式在高频控制与高层推理之间解决结构性张力的关键需求。
cs.RO / 55 / 2604.18090
Muscle-inspired magnetic actuators that push, pull, crawl, and grasp
肌肉启发的磁驱动器:推动、拉动、爬行与抓取
Abstract
Functional magnetic composites capable of large deformation, load bearing, and multifunctional motion are essential for next-generation adaptive soft robots. Here, we present muscle-inspired magnetic actuators (MMA), additively manufactured from a thermoplastic/permanent magnet polyurethane/Nd2Fe14B (TPU/MQP-S) composite using laser powder bed fusion (LPBF). By tuning the laser-energy scale between 1.0 and 3.0, both mechanical stiffness and magnetic response are precisely controlled: the tensile strength increases from 0.28 to 0.99 MPa while maintaining 30-45% elongation at break. This process enables the creation of 0.5 mm-thick flexural hinges, which reversibly bend and fold under moderate magnetic fields without damage. Two actuator types are reported showing the system versatility. The elongated actuator with self-weight of 1.57 g, magnetized in its contracted state, achieves linear contraction under a 500 mT field, lifting 50 g (32x its own weight) and sustaining performance over at least 50 cycles. Equipped with anisotropic frictional feet, it supports movement of a magnetic crawling robot that achieves up to 100% locomotion success on textured substrates. The expandable actuator exhibits reversible opening and closing under a 300 mT field, reliably grasping and releasing different objects, including soft berries and rigid 3D printed geometries. It can also anchor in a tube while holding suspended 50 g loads. This work demonstrates a LPBF-based strategy to program both stiffness and magnetization within a single material system, enabling remotely driven, reconfigurable, and fatigue-resistant soft actuators. The approach opens new possibilities for force controlled, multifunctional magnetic soft robots for adaptive gripping, locomotion, and minimally invasive manipulation of biomedical tools.
Chinese Translation
能够大幅变形、承载负荷和实现多功能运动的功能性磁复合材料对于下一代自适应软机器人至关重要。在此,我们提出了一种肌肉启发的磁驱动器(Muscle-inspired Magnetic Actuators, MMA),该驱动器通过激光粉末床熔融(Laser Powder Bed Fusion, LPBF)技术从热塑性/永磁聚氨酯/Nd2Fe14B(TPU/MQP-S)复合材料中增材制造而成。通过调节激光能量范围在1.0至3.0之间,机械刚度和磁响应得以精确控制:拉伸强度从0.28 MPa提高到0.99 MPa,同时保持30-45%的断裂延伸率。该工艺使得能够制造出0.5毫米厚的弯曲铰链,这些铰链在适度的磁场下可反复弯曲和折叠而不受损坏。报告了两种类型的驱动器,展示了系统的多样性。自重为1.57克的延长型驱动器在收缩状态下被磁化,在500 mT的磁场下实现线性收缩,能够提升50克(其自身重量的32倍),并在至少50个循环中保持性能。配备各向异性摩擦脚的驱动器支持一种磁性爬行机器人在纹理基底上实现高达100%的运动成功率。可扩展驱动器在300 mT的磁场下表现出可逆的开合,可靠地抓取和释放不同的物体,包括柔软的浆果和刚性的3D打印几何体。它还可以在管道中固定,同时承载悬挂的50克负载。本研究展示了一种基于LPBF的策略,在单一材料系统内编程刚度和磁化,使得远程驱动、可重构和耐疲劳的软驱动器成为可能。这一方法为力控、多功能磁性软机器人在自适应抓取、运动和微创操作生物医学工具方面开辟了新的可能性。
cs.RO / 56 / 2604.18126
Chatting about Conditional Trajectory Prediction
关于条件轨迹预测的对话
Abstract
Human behavior has the nature of mutual dependencies, which requires human-robot interactive systems to predict surrounding agents trajectories by modeling complex social interactions, avoiding collisions and executing safe path planning. While there exist many trajectory prediction methods, most of them do not incorporate the own motion of the ego agent and only model interactions based on static information. We are inspired by the humans theory of mind during trajectory selection and propose a Cross time domain intention-interactive method for conditional Trajectory prediction(CiT). Our proposed CiT conducts joint analysis of behavior intentions over time, and achieves information complementarity and integration across different time domains. The intention in its own time domain can be corrected by the social interaction information from the other time domain to obtain a more precise intention representation. In addition, CiT is designed to closely integrate with robotic motion planning and control modules, capable of generating a set of optional trajectory prediction results for all surrounding agents based on potential motions of the ego agent. Extensive experiments demonstrate that the proposed CiT significantly outperforms the existing methods, achieving state-of-the-art performance in the benchmarks.
Chinese Translation
人类行为具有相互依赖的特性,这要求人机交互系统通过建模复杂的社会互动来预测周围代理的轨迹,避免碰撞并执行安全路径规划。尽管存在许多轨迹预测方法,但大多数方法并未考虑自我代理的运动,仅基于静态信息建模交互。我们受到人类心智理论在轨迹选择中的启发,提出了一种跨时间域意图交互方法用于条件轨迹预测(Conditional Trajectory prediction, CiT)。我们提出的CiT对行为意图进行跨时间的联合分析,实现了不同时间域之间信息的互补和整合。自身时间域中的意图可以通过来自其他时间域的社会交互信息进行修正,从而获得更精确的意图表示。此外,CiT设计上与机器人运动规划和控制模块紧密集成,能够基于自我代理的潜在运动为所有周围代理生成一组可选的轨迹预测结果。大量实验表明,所提出的CiT显著优于现有方法,在基准测试中实现了最先进的性能。
cs.RO / 57 / 2604.18236
COFFAIL: A Dataset of Successful and Anomalous Robot Skill Executions in the Context of Coffee Preparation
COFFAIL:咖啡准备过程中成功与异常机器人技能执行的数据集
Abstract
In the context of robot learning for manipulation, curated datasets are an important resource for advancing the state of the art; however, available datasets typically only include successful executions or are focused on one particular type of skill. In this short paper, we briefly describe a dataset of various skills performed in the context of coffee preparation. The dataset, which we call COFFAIL, includes both successful and anomalous skill execution episodes collected with a physical robot in a kitchen environment, a couple of which are performed with bimanual manipulation. In addition to describing the data collection setup and the collected data, the paper illustrates the use of the data in COFFAIL to learn a robot policy using imitation learning.
Chinese Translation
在机器人操作学习的背景下,精心策划的数据集是推动技术进步的重要资源;然而,现有的数据集通常仅包括成功的执行案例或专注于某一特定类型的技能。在本文中,我们简要描述了一个在咖啡准备过程中执行的各种技能的数据集。我们称之为COFFAIL,该数据集包括在厨房环境中使用物理机器人收集的成功和异常技能执行案例,其中一些是通过双手操作完成的。除了描述数据收集的设置和收集到的数据外,本文还展示了如何利用COFFAIL中的数据通过模仿学习来学习机器人策略。
cs.RO / 58 / 2604.18271
EmbodiedLGR: Integrating Lightweight Graph Representation and Retrieval for Semantic-Spatial Memory in Robotic Agents
EmbodiedLGR:将轻量级图表示与检索集成用于机器人智能体的语义-空间记忆
Abstract
As the world of agentic artificial intelligence applied to robotics evolves, the need for agents capable of building and retrieving memories and observations efficiently is increasing. Robots operating in complex environments must build memory structures to enable useful human-robot interactions by leveraging the mnemonic representation of the current operating context. People interacting with robots may expect the embodied agent to provide information about locations, events, or objects, which requires the agent to provide precise answers within human-like inference times to be perceived as responsive. We propose the Embodied Light Graph Retrieval Agent (EmbodiedLGR-Agent), a visual-language model (VLM)-driven agent architecture that constructs dense and efficient representations of robot operating environments. EmbodiedLGR-Agent directly addresses the need for an efficient memory representation of the environment by providing a hybrid building-retrieval approach built on parameter-efficient VLMs that store low-level information about objects and their positions in a semantic graph, while retaining high-level descriptions of the observed scenes with a traditional retrieval-augmented architecture. EmbodiedLGR-Agent is evaluated on the popular NaVQA dataset, achieving state-of-the-art performance in inference and querying times for embodied agents, while retaining competitive accuracy on the global task relative to the current state-of-the-art approaches. Moreover, EmbodiedLGR-Agent was successfully deployed on a physical robot, showing practical utility in real-world contexts through human-robot interaction, while running the visual-language model and the building-retrieval pipeline locally.
Chinese Translation
随着应用于机器人技术的代理人工智能领域的发展,能够高效构建和检索记忆与观察的智能体的需求日益增加。在复杂环境中操作的机器人必须构建记忆结构,以利用当前操作上下文的助记表示,从而实现有用的人机交互。与机器人互动的人们可能期望具身智能体提供关于位置、事件或物体的信息,这要求智能体在类人推理时间内提供精确的答案,以被视为具有响应能力。我们提出了具身轻量图检索智能体(EmbodiedLGR-Agent),这是一种基于视觉-语言模型(VLM)的智能体架构,能够构建密集且高效的机器人操作环境表示。EmbodiedLGR-Agent 直接解决了环境高效记忆表示的需求,采用基于参数高效的 VLM 的混合构建-检索方法,存储有关物体及其位置的低级信息于语义图中,同时保留观察场景的高级描述,结合传统的检索增强架构。EmbodiedLGR-Agent 在流行的 NaVQA 数据集上进行了评估,在推理和查询时间方面达到了具身智能体的最先进性能,同时在全球任务的准确性上与当前最先进的方法保持竞争力。此外,EmbodiedLGR-Agent 已成功部署在物理机器人上,通过人机交互展示了在现实世界情境中的实际效用,同时在本地运行视觉-语言模型和构建-检索管道。
cs.RO / 59 / 2604.18289
Relative State Estimation using Event-Based Propeller Sensing
基于事件的螺旋桨传感的相对状态估计
Abstract
Autonomous swarms of multi-Unmanned Aerial Vehicle (UAV) system requires an accurate and fast relative state estimation. Although monocular frame-based camera methods perform well in ideal conditions, they are slow, suffer scale ambiguity, and often struggle in visually challenging conditions. The advent of event cameras addresses these challenging tasks by providing low latency, high dynamic range, and microsecond-level temporal resolution. This paper proposes a framework for relative state estimation for quadrotors using event-based propeller sensing. The propellers in the event stream are tracked by detection to extract the region-of-interests. The event streams in these regions are processed in temporal chunks to estimate per-propeller frequencies. These frequency measurements drive a kinematic state estimation module as a thrust input, while camera-derived position measurements provide the update step. Additionally, we use geometric primitives derived from event streams to estimate the orientation of the quadrotor by fitting an ellipse over a propeller and backprojecting it to recover body-frame tilt-axis. The existing event-based approaches for quadrotor state estimation use the propeller frequency in simulated flight sequences. Our approach estimates the propeller frequency under 3% error on a test dataset of five real-world outdoor flight sequences, providing a method for decentralized relative localization for multi-robot systems using event camera.
Chinese Translation
自主多无人机(UAV)系统需要准确且快速的相对状态估计。尽管单目框架摄像头方法在理想条件下表现良好,但它们速度较慢,存在尺度模糊,并且在视觉挑战条件下常常表现不佳。事件摄像头的出现解决了这些挑战性任务,提供了低延迟、高动态范围和微秒级的时间分辨率。本文提出了一种基于事件的螺旋桨传感的四旋翼相对状态估计框架。通过检测跟踪事件流中的螺旋桨,以提取感兴趣区域。这些区域中的事件流以时间块的形式进行处理,以估计每个螺旋桨的频率。这些频率测量作为推力输入驱动运动状态估计模块,而摄像头获取的位置测量则提供更新步骤。此外,我们利用从事件流中提取的几何原语,通过在螺旋桨上拟合椭圆并反投影来估计四旋翼的朝向,以恢复机体坐标系的倾斜轴。现有的基于事件的四旋翼状态估计方法在模拟飞行序列中使用螺旋桨频率。我们的方法在五个真实世界户外飞行序列的测试数据集上估计螺旋桨频率的误差低于3%,为使用事件摄像头的多机器人系统提供了一种去中心化的相对定位方法。
cs.RO / 60 / 2604.18331
Will People Enjoy a Robot Trainer? A Case Study with Snoopie the Pacerbot
人们会喜欢机器人教练吗?以Snoopie机器人为例的研究
Abstract
The physicality of exercise makes the role of athletic trainers unique. Their physical presence allows them to guide a student through a motion, demonstrate an exercise, and give intuitive feedback. Robot quadrupeds are also embodied agents with robust agility and athleticism. In our work, we investigate whether a robot quadruped can serve as an effective and enjoyable personal trainer device. We focus on a case study of interval training for runners: a repetitive, long-horizon task where precision and consistency are important. To meet this challenge, we propose SNOOPIE, an autonomous robot quadruped pacer capable of running interval training exercises tailored to challenge a user's personal abilities. We conduct a set of user experiments that compare the robot trainer to a wearable trainer device--the Apple Watch--to investigate the benefits of a physical embodiment in exercise-based interactions. We demonstrate 60.6% better adherence to a pace schedule and were 45.9% more consistent across their running speeds with the quadruped trainer. Subjective results also showed that participants strongly preferred training with the robot over wearable devices across many qualitative axes, including its ease of use (+56.7%), enjoyability of the interaction (+60.6%), and helpfulness (+39.1%). Additional videos and visualizations can be found on our website: https://sites.google.com/view/snoopie
Chinese Translation
运动的身体性使得运动教练的角色独特。他们的身体存在使他们能够指导学生完成动作、演示锻炼并提供直观反馈。机器人四足动物同样是具身的代理,具有强大的灵活性和运动能力。在我们的研究中,我们探讨了机器人四足动物是否可以作为有效且令人愉悦的个人教练设备。我们专注于跑步者的间歇训练案例:这是一项重复性、长期的任务,其中精确性和一致性至关重要。为应对这一挑战,我们提出了SNOOPIE,一种能够进行个性化间歇训练的自主机器人四足动物,旨在挑战用户的个人能力。我们进行了一系列用户实验,将机器人教练与可穿戴教练设备——Apple Watch进行比较,以研究身体具身在基于运动的交互中的优势。结果表明,使用四足教练的参与者在遵循配速计划方面表现出60.6%的更好遵从性,并且在跑步速度的一致性上提高了45.9%。主观结果也显示,参与者在多个定性指标上强烈偏好与机器人训练,而非可穿戴设备,包括使用的便利性(+56.7%)、交互的愉悦感(+60.6%)和帮助程度(+39.1%)。更多视频和可视化内容可以在我们的网站上找到:https://sites.google.com/view/snoopie
cs.RO / 61 / 2604.18336
Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation
通过深度先验增强玻璃表面重建以支持机器人导航
Abstract
Indoor robot navigation is often compromised by glass surfaces, which severely corrupt depth sensor measurements. While foundation models like Depth Anything 3 provide excellent geometric priors, they lack an absolute metric scale. We propose a training-free framework that leverages depth foundation models as a structural prior, employing a robust local RANSAC-based alignment to fuse it with raw sensor depth. This naturally avoids contamination from erroneous glass measurements and recovers an accurate metric scale. Furthermore, we introduce \ti{GlassRecon}, a novel RGB-D dataset with geometrically derived ground truth for glass regions. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art baselines, especially under severe sensor depth corruption. The dataset and related code will be released at https://github.com/jarvisyjw/GlassRecon.
Chinese Translation
室内机器人导航常常受到玻璃表面的影响,这严重干扰了深度传感器的测量。虽然像 Depth Anything 3 这样的基础模型提供了出色的几何先验,但它们缺乏绝对的度量尺度。我们提出了一种无训练框架,利用深度基础模型作为结构先验,采用基于 RANSAC 的稳健局部对齐方法将其与原始传感器深度数据融合。这自然避免了来自错误玻璃测量的污染,并恢复了准确的度量尺度。此外,我们引入了 i{GlassRecon},一个具有几何推导的玻璃区域真实值的新型 RGB-D 数据集。大量实验表明,我们的方法在严重的传感器深度干扰下,始终优于最先进的基线。数据集及相关代码将发布在 https://github.com/jarvisyjw/GlassRecon。
cs.RO / 62 / 2604.18343
DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications
DAG-STL:一种基于信号时序逻辑规范的零-shot轨迹规划层次框架
Abstract
Signal Temporal Logic (STL) is a powerful language for specifying temporally structured robotic tasks. Planning executable trajectories under STL constraints remains difficult when system dynamics and environment structure are not analytically available. Existing methods typically either assume explicit models or learn task-specific behaviors, limiting zero-shot generalization to unseen STL tasks. In this work, we study offline STL planning under unknown dynamics using only task-agnostic trajectory data. Our central design philosophy is to separate logical reasoning from trajectory realization. We instantiate this idea in DAG-STL, a hierarchical framework that converts long-horizon STL planning into three stages. It first decomposes an STL formula into reachability and invariance progress conditions linked by shared timing constraints. It then allocates timed waypoints using learned reachability-time estimates. Finally, it synthesizes trajectories between these waypoints with a diffusion-based generator. This decomposition--allocation--generation pipeline reduces global planning to shorter, better-supported subproblems. To bridge the gap between planning-level correctness and execution-level feasibility, we further introduce a rollout-free dynamic consistency metric, an anytime refinement search procedure for improving multiple allocation hypotheses under finite budgets, and a hierarchical online replanning mechanism for execution-time recovery. Experiments in Maze2D, OGBench AntMaze, and the Cube domain show that DAG-STL substantially outperforms direct robustness-guided diffusion on complex long-horizon STL tasks and generalizes across navigation and manipulation settings. In a custom environment with an optimization-based reference, DAG-STL recovers most model-solvable tasks while retaining a clear computational advantage over direct optimization based on the explicit system model.
Chinese Translation
信号时序逻辑(STL)是一种强大的语言,用于指定具有时间结构的机器人任务。在系统动态和环境结构不可解析的情况下,满足STL约束的可执行轨迹规划仍然困难。现有方法通常假设明确的模型或学习特定任务的行为,从而限制了对未见STL任务的零-shot泛化。在本研究中,我们研究了在未知动态下的离线STL规划,仅使用与任务无关的轨迹数据。我们的核心设计理念是将逻辑推理与轨迹实现分离。我们在DAG-STL中实现了这一思想,DAG-STL是一个将长时间范围的STL规划转化为三个阶段的层次框架。它首先将STL公式分解为通过共享时间约束连接的可达性和不变性进展条件。然后,它使用学习到的可达性-时间估计分配定时航点。最后,它利用基于扩散的生成器在这些航点之间合成轨迹。这种分解-分配-生成的管道将全局规划简化为更短、更有支持的子问题。为了弥合规划层面的正确性与执行层面的可行性之间的差距,我们进一步引入了一种无回滚动态一致性度量,一种在有限预算下改进多个分配假设的随时优化搜索过程,以及一种用于执行时恢复的层次在线重新规划机制。在Maze2D、OGBench AntMaze和Cube领域的实验表明,DAG-STL在复杂的长时间范围STL任务上显著优于直接的鲁棒性引导扩散,并在导航和操作设置中实现了泛化。在一个具有基于优化的参考的自定义环境中,DAG-STL恢复了大多数模型可解任务,同时在计算上明显优于基于显式系统模型的直接优化。
cs.CV / 1 / 2604.16446
A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions
基于瓶颈残差卷积的高精度光学音乐识别方法
Abstract
Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of $7.52\%$ and a symbol error rate (SyER) of $0.45\%$, with pitch, type, and note accuracies of $99.33\%$, $99.60\%$, and $99.28\%$, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of $8.11\%$ and a SyER of $0.49\%$, with pitch, type, and note accuracies of $99.27\%$, $99.58\%$, and $99.21\%$, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.
Chinese Translation
光学音乐识别(OMR)旨在将印刷或手写的乐谱图像转换为可编辑的符号表示。本文提出了一种端到端的OMR框架,该框架将残差瓶颈卷积与基于双向门控递归单元(BiGRU)的序列建模相结合。使用具有ResNet-v2风格的残差瓶颈块和多尺度扩张卷积的卷积神经网络提取特征,这些特征编码了细粒度符号细节和全局五线谱结构。提取的特征序列随后被输入到BiGRU网络中,以建模音乐符号之间的时间依赖关系。该模型使用连接主义时间分类损失进行训练,从而实现端到端的预测,而无需显式对齐注释。在Camera-PrIMuS和PrIMuS数据集上的实验结果证明了所提框架的有效性。在Camera-PrIMuS上,所提方法实现了7.52%的序列错误率(SeER)和0.45%的符号错误率(SyER),音高、类型和音符的准确率分别为99.33%、99.60%和99.28%。平均每个训练周期的训练时间为1.74秒,展示了高计算效率,同时保持了强大的识别性能。在PrIMuS上,该方法实现了8.11%的SeER和0.49%的SyER,音高、类型和音符的准确率分别为99.27%、99.58%和99.21%。细粒度的错误分析进一步确认了所提模型的有效性。
cs.CV / 2 / 2604.16462
From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration
从继承到饱和:解开视觉冗余在架构感知的多模态大语言模型推理加速中的演变
Abstract
High-resolution Multimodal Large Language Models (MLLMs) face prohibitive computational costs during inference due to the explosion of visual tokens. Existing acceleration strategies, such as token pruning or layer sparsity, suffer from severe "backbone dependency", performing well on Vicuna or Mistral architectures (e.g., LLaVA) but causing significant performance degradation when transferred to architectures like Qwen. To address this, we leverage truncated matrix entropy to uncover a universal three-stage inference lifecycle, decoupling visual redundancy into universal Intrinsic Visual Redundancy (IVR) and architecture-dependent Secondary Saturation Redundancy (SSR). Guided by this insight, we propose HalfV, a framework that first mitigates IVR via a unified pruning strategy and then adaptively handles SSR based on its specific manifestation. Experiments demonstrate that HalfV achieves superior efficiency-performance trade-offs across diverse backbones. Notably, on Qwen25-VL, it retains 96.8\% performance at a 4.1$\times$ FLOPs speedup, significantly outperforming state-of-the-art baselines. Our code is available at https://github.com/civilizwa/HalfV.
Chinese Translation
高分辨率多模态大语言模型(MLLMs)在推理过程中面临着由于视觉标记数量激增而导致的高昂计算成本。现有的加速策略,如标记剪枝或层稀疏,受到严重的“骨干依赖”影响,在Vicuna或Mistral架构(例如LLaVA)上表现良好,但在转移到Qwen等架构时会导致显著的性能下降。为了解决这个问题,我们利用截断矩阵熵揭示了一个通用的三阶段推理生命周期,将视觉冗余解耦为通用的内在视觉冗余(Intrinsic Visual Redundancy, IVR)和依赖于架构的次级饱和冗余(Secondary Saturation Redundancy, SSR)。基于这一见解,我们提出了HalfV框架,该框架首先通过统一的剪枝策略减轻IVR,然后根据SSR的具体表现进行自适应处理。实验表明,HalfV在多种骨干网络上实现了更优的效率与性能权衡。值得注意的是,在Qwen25-VL上,它在实现4.1倍FLOPs加速的同时保持了96.8%的性能,显著优于最先进的基线。我们的代码可在https://github.com/civilizwa/HalfV获取。
cs.CV / 3 / 2604.16479
Latent-Compressed Variational Autoencoder for Video Diffusion Models
用于视频扩散模型的潜在压缩变分自编码器
Abstract
Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.
Chinese Translation
在潜在扩散模型中使用的视频变分自编码器(VAEs)通常需要足够数量的潜在通道以确保高质量的视频重建。然而,近期研究表明,过多的潜在通道会阻碍潜在扩散模型的收敛,并降低其生成性能,即使重建质量保持较高。我们提出了一种潜在压缩方法,该方法通过去除视频潜在表示中的高频成分,而不是直接减少通道数量,从而避免了常常妥协重建保真度的问题。实验结果表明,所提方法在保持相同整体压缩比的情况下,达到了优于强基线的视频重建质量。
cs.CV / 4 / 2604.16480
Positioning radiata pine branches requiring pruning by drone stereo vision
基于无人机立体视觉的辐射松修剪所需枝条定位
Abstract
This paper presents a stereo-vision-based system mounted on a drone for detecting and localising radiata pine branches to support autonomous pruning. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, YOLOv8, YOLOv9, and Mask R-CNN variants are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera. For depth estimation, both a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated. A centroid-based triangulation algorithm with MAD outlier rejection is proposed to compute branch distance from the segmentation mask and disparity map. Qualitative evaluation at distances of 1-2 m indicates that the deep learning-based disparity maps produce more coherent depth estimates than SGBM, demonstrating the feasibility of low-cost stereo vision for automated branch positioning in forestry.
Chinese Translation
本文提出了一种基于立体视觉的系统,该系统安装在无人机上,用于检测和定位辐射松枝条,以支持自主修剪。所提出的流程包括两个阶段:枝条分割和深度估计。在分割阶段,比较了YOLOv8、YOLOv9和Mask R-CNN变体在使用ZED Mini相机捕获的71对立体图像的自定义数据集上的表现。在深度估计阶段,评估了传统方法(带有WLS滤波的SGBM)和基于深度学习的方法(PSMNet、ACVNet、GWCNet、MobileStereoNet、RAFT-Stereo和NeRF-Supervised Deep Stereo)。提出了一种基于质心的三角测量算法,结合MAD异常值拒绝,计算从分割掩膜和视差图到枝条的距离。在1-2米的距离下进行的定性评估表明,基于深度学习的视差图产生的深度估计比SGBM更为一致,证明了低成本立体视觉在林业中进行自动化枝条定位的可行性。
cs.CV / 5 / 2604.16481
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
消除数千个概念:面向可扩展和实用的文本到图像扩散模型的概念消除
Abstract
Large-scale text-to-image (T2I) diffusion models deliver remarkable visual fidelity but pose safety risks due to their capacity to reproduce undesirable content, such as copyrighted ones. Concept erasure has emerged as a mitigation strategy, yet existing approaches struggle to balance scalability, precision, and robustness, which restricts their applicability to erasing only a few hundred concepts. To address these limitations, we present Erasing Thousands of Concepts (ETC), a scalable framework capable of erasing thousands of concepts while preserving generation quality. Our method first models low-rank concept distributions via a Student's t-distribution Mixture Model (tMM). It enables pin-point erasure of target concepts via affine optimal transport while preserving others by anchoring the boundaries of target concept distributions without pre-defined anchor concepts. We then train a Mixture-of-Experts (MoE)-based module, termed MoEraser, which removes target embeddings while preserving the anchor embeddings. By injecting noise into the text embedding projector and fine-tuning MoEraser for recovery, our framework achieves robustness to white-box attack such as module removal. Extensive experiments on over 2,000 concepts across heterogeneous domains and diffusion models demerate state-of-the-art scalability and precision in large-scale concept erasure.
Chinese Translation
大规模文本到图像(T2I)扩散模型在视觉保真度方面表现出色,但由于其能够再现不良内容(如受版权保护的内容),因此带来了安全风险。概念消除已成为一种缓解策略,但现有方法在可扩展性、精确性和鲁棒性之间难以取得平衡,这限制了其仅能消除几百个概念的适用性。为了解决这些局限性,我们提出了消除数千个概念(ETC),这是一个可扩展的框架,能够在保持生成质量的同时消除数千个概念。我们的方法首先通过学生t分布混合模型(tMM)对低秩概念分布进行建模。它通过仿射最优传输实现目标概念的精准消除,同时通过锚定目标概念分布的边界来保留其他概念,而无需预定义的锚概念。然后,我们训练了一个基于专家混合(MoE)的模块,称为MoEraser,该模块在保留锚嵌入的同时去除目标嵌入。通过向文本嵌入投影器注入噪声并对MoEraser进行微调以实现恢复,我们的框架在面对如模块移除等白盒攻击时表现出鲁棒性。在超过2000个异构领域和扩散模型上的大量实验表明,我们的方法在大规模概念消除方面具有最先进的可扩展性和精确性。
cs.CV / 6 / 2604.16482
A Survey of Spatial Memory Representations for Efficient Robot Navigation
高效机器人导航的空间记忆表示调查
Abstract
As vision-based robots navigate larger environments, their spatial memory grows without bound, eventually exhausting computational resources, particularly on embedded platforms (8-16GB shared memory, $<$30W) where adding hardware is not an option. This survey examines the spatial memory efficiency problem across 88 references spanning 52 systems (1989-2025), from occupancy grids to neural implicit representations. We introduce the $\alpha = M_{\text{peak}} / M_{\text{map}}$, the ratio of peak runtime memory (the total RAM or GPU memory consumed during operation) to saved map size (the persistent checkpoint written to disk), exposing the gap between published map sizes and actual deployment cost. Independent profiling on an NVIDIA A100 GPU reveals that $\alpha$ spans two orders of magnitude within neural methods alone, ranging from 2.3 (Point-SLAM) to 215 (NICE-SLAM, whose 47,MB map requires 10GB at runtime), showing that memory architecture, not paradigm label, determines deployment feasibility. We propose a standardized evaluation protocol comprising memory growth rate, query latency, memory-completeness curves, and throughput degradation, none of which current benchmarks capture. Through a Pareto frontier analysis with explicit benchmark separation, we show that no single paradigm dominates within its evaluation regime: 3DGS methods achieve the best absolute accuracy at 90-254,MB map size on Replica, while scene graphs provide semantic abstraction at predictable cost. We provide the first independently measured $\alpha$ reference values and an $\alpha$-aware budgeting algorithm enabling practitioners to assess deployment feasibility on target hardware prior to implementation.
Chinese Translation
随着基于视觉的机器人在更大环境中的导航,它们的空间记忆无限增长,最终耗尽计算资源,尤其是在嵌入式平台(8-16GB共享内存,$<$30W)上,增加硬件并不是一个选项。本调查研究了跨越88个参考文献的空间记忆效率问题,这些文献涵盖了52个系统(1989-2025),从占用网格到神经隐式表示。我们引入了$eta = M_{ ext{peak}} / M_{ ext{map}}$,即峰值运行内存(操作期间消耗的总RAM或GPU内存)与保存地图大小(写入磁盘的持久检查点)之间的比率,揭示了已发布地图大小与实际部署成本之间的差距。在NVIDIA A100 GPU上的独立分析显示,$eta$在神经方法中跨越两个数量级,从2.3(Point-SLAM)到215(NICE-SLAM,其47MB地图在运行时需要10GB),表明内存架构而非范式标签决定了部署的可行性。我们提出了一种标准化评估协议,包括内存增长率、查询延迟、内存完整性曲线和吞吐量下降,而当前基准测试未能捕捉到这些指标。通过显式基准分离的帕累托前沿分析,我们表明在其评估范围内没有单一范式占主导地位:3DGS方法在Replica上的90-254MB地图大小上实现了最佳绝对精度,而场景图则以可预测的成本提供语义抽象。我们提供了首个独立测量的$eta$参考值和一个$eta$感知预算算法,使从业者能够在实施前评估目标硬件上的部署可行性。
cs.CV / 7 / 2604.16483
Dynamic Eraser for Guided Concept Erasure in Diffusion Models
扩散模型中引导概念擦除的动态擦除器
Abstract
Concept erasure in Text-To-Image (T2I) diffusion models is vital for safe content generation, but existing inference-time methods face significant limitations. Feature-correction approaches often cause uncontrolled over-correction, while token-level interventions struggle with semantic granularity and context. Moreover, both types of methods are prone to severe semantic drift or even complete representation collapse. To address these challenges, we present Dynamic Semantic Steering (DSS), a lightweight, training-free framework for interpretable and controllable concept erasure. DSS introduces: 1) Sensitive Semantic Boundary Modeling (SSBM) to automate the discovery of safe semantic anchors, and 2) Sensitive Semantic Guidance (SSG), which leverages cross-attention features for precise detection and performs correction via a closed-form solution derived from a well-posed objective. This ensures optimal suppression of sensitive content while preserving benign semantics. DSS achieves an average erasure rate of 91.0\%, significantly outperforming SOTA methods (from 18.6\% to 85.9\%) with minimal impact on output fidelity.
Chinese Translation
在文本到图像(Text-To-Image, T2I)扩散模型中,概念擦除对于安全内容生成至关重要,但现有的推理时方法面临重大局限性。特征修正方法往往导致不受控制的过度修正,而基于标记的干预在语义粒度和上下文方面存在困难。此外,这两种方法都容易导致严重的语义漂移甚至完全的表示崩溃。为了解决这些挑战,我们提出了动态语义引导(Dynamic Semantic Steering, DSS),这是一个轻量级的、无训练的框架,用于可解释和可控的概念擦除。DSS引入了:1)敏感语义边界建模(Sensitive Semantic Boundary Modeling, SSBM),以自动发现安全的语义锚点;2)敏感语义引导(Sensitive Semantic Guidance, SSG),利用交叉注意力特征进行精确检测,并通过从良构目标推导出的封闭形式解进行修正。这确保了在保留良性语义的同时,最佳抑制敏感内容。DSS实现了91.0%的平均擦除率,显著优于现有最先进方法(从18.6%到85.9%),对输出保真度的影响最小。
cs.CV / 8 / 2604.16484
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
DexWorldModel:面向自动化学习具身任务的因果潜在世界建模
Abstract
Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction, $\mathcal{O}(T)$ memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict $\mathcal{O}(1)$ footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about $50\%$. To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics-grounded trajectories during training. Extensive experiments validate that CLWM achieves state-of-the-art performance in complex dual-arm simulation and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines explicitly finetuned on real-world data.
Chinese Translation
在操作中部署生成的世界-动作模型受到冗余像素级重建、$ ext{O}(T)$内存扩展和顺序推理延迟的严重瓶颈。我们提出了因果潜在世界模型(Causal Latent World Model, CLWM),该模型利用DINOv3特征作为生成目标,以从视觉噪声中解耦交互语义,从而实现高度稳健的领域泛化。为了克服内存扩展问题,CLWM具有双状态测试时训练(Dual-State Test-Time Training, TTT)内存,确保在长时间任务中严格保持$ ext{O}(1)$的内存占用。为了克服部署延迟,我们提出了投机性异步推理(Speculative Asynchronous Inference, SAI),将部分扩散去噪的过程隐藏在物理执行之后,从而将阻塞延迟减少约50%。为了扩展稳健策略,我们提出了EmbodiChain,这是一个在线框架,通过在训练过程中注入无限流的基于物理的轨迹来建立效率法则。大量实验验证了CLWM在复杂双臂仿真中的最先进性能以及在物理机器人上前所未有的零-shot仿真到现实转移,超越了在真实世界数据上显式微调的基线。
cs.CV / 9 / 2604.16485
Saccade Attention Networks: Using Transfer Learning of Attention to Reduce Network Sizes
注视注意网络:利用注意力的迁移学习来减少网络规模
Abstract
One of the limitations of transformer networks is the sequence length due to the quadratic nature of the attention matrix. Classical self attention uses the entire sequence length, however, the actual attention being used is sparse. Humans use a form of sparse attention when analyzing an image or scene called saccades. Focusing on key features greatly reduces computation time. By using a network (Saccade Attention Network) to learn where to attend from a large pre-trained model, we can use it to pre-process images and greatly reduce network size by reducing the input sequence length to just the key features being attended to. Our results indicate that you can reduce calculations by close to 80% and produce similar results.
Chinese Translation
变换器网络的一个限制是序列长度,这与注意力矩阵的平方性质有关。经典的自注意力使用整个序列长度,然而,实际使用的注意力是稀疏的。人类在分析图像或场景时使用一种称为注视的稀疏注意力形式。专注于关键特征大大减少了计算时间。通过使用一个网络(注视注意网络)从一个大型预训练模型中学习注意力的焦点,我们可以用它来预处理图像,并通过将输入序列长度减少到仅关注的关键特征来大幅减少网络规模。我们的结果表明,可以将计算量减少近80%,并产生相似的结果。
cs.CV / 10 / 2604.16486
Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection
Aletheia:基于物理条件的局部伪影注意力(PhyLAA-X)用于端到端可泛化和鲁棒的深伪视频检测
Abstract
State-of-the-art deepfake detectors achieve near-perfect in-domain accuracy yet degrade under cross-generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance (rPPG) are treated either as post-hoc features or ignored. We introduce PhyLAA-X, a novel physics-conditioned extension of Localized Artifact Attention (LAA-X). PhyLAA-X injects three end-to-end differentiable physics-derived feature volumes - optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra - directly into the LAA-X attention computation via cross-attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur - regions inherently harder for generative models to replicate consistently. PhyLAA-X is embedded across an efficient spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC-ROC; on Celeb-DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 - outperforming the strongest published baseline (LAA-Net [1]) by 4.1-7.3% in cross-generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD-10 attacks. Single-backbone ablations confirm PhyLAA-X alone delivers a 4.2% cross-dataset AUC gain. The full production system is open-sourced at https://github.com/devghori1264/Aletheia (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC-2026 in this work), and complete reproducibility artifacts.
Chinese Translation
最先进的深伪检测器在领域内的准确率接近完美,但在跨生成器转变、重压缩和对抗扰动下性能下降。其核心限制在于语义伪影学习与物理不变性之间的解耦:光流不连续性、镜面反射不一致性和心脏调制反射(rPPG)要么被视为事后特征,要么被忽视。我们提出了PhyLAA-X,一种局部伪影注意力(LAA-X)的新型基于物理条件的扩展。PhyLAA-X通过交叉注意力门控和共振一致性损失,将三个端到端可微分的物理衍生特征体积——光流旋度、镜面反射偏度和空间上上采样的rPPG功率谱——直接注入LAA-X注意力计算中。这迫使网络学习语义不一致性和物理违反共同发生的操控边界——这些区域本质上更难以被生成模型一致复制。PhyLAA-X嵌入在一个高效的时空集成模型中(EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+因果卷积1D),并具有不确定性感知的自适应加权。在FaceForensics++ (c23)上,Aletheia达到了97.2%的准确率/0.992 AUC-ROC;在Celeb-DF v2上,94.9%/0.981;在DFDC上,90.8%/0.966——在跨生成器设置中超越了最强的已发布基线(LAA-Net [1])4.1-7.3%。在epsilon = 0.02 PGD-10攻击下保持79.4%的准确率。单一骨干网络的消融实验确认PhyLAA-X单独提供了4.2%的跨数据集AUC提升。完整的生产系统已开源于https://github.com/devghori1264/Aletheia(v1.2,2026年4月),包含预训练权重、对抗语料库(在本工作中称为ADC-2026)和完整的可重现性文档。
cs.CV / 11 / 2604.16487
Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering
基于几何感知的 CLIP 检索:局部跨模态对齐与引导
Abstract
CLIP retrieval is typically framed as a pointwise similarity problem in a shared embedding space. While CLIP achieves strong global cross-modal alignment, many retrieval failures arise from local geometric inconsistencies: nearby items are incorrectly ordered, leading to systematic confusions (e.g., pentagon vs. hexagon) and produces diffuse, weakly controlled result sets. Prior work largely optimizes for point wise relevance or finetuning to mitigate these problems. We instead view retrieval as a problem of neighborhood alignment. Our work introduces (1) neighborhood-level re-ranking via Hungarian matching, which rewards structural consistency; (2) query-conditioned local steering, where directions derived from contrastive neighborhoods around the query reshape retrieval. We show that these techniques improve retrieval performance on attribute-binding and compositional retrieval tasks. Together, these methods operate on local neighborhoods but serve different roles: re-ranking rewards alignment whereas local steering controls neighborhood structure. This shows that retrieval quality and controllability depend critically on local structure, which can be exploited at inference time without retraining.
Chinese Translation
CLIP 检索通常被视为共享嵌入空间中的逐点相似性问题。尽管 CLIP 实现了强大的全局跨模态对齐,但许多检索失败源于局部几何不一致:相邻项的排序错误,导致系统性的混淆(例如,五边形与六边形),并产生模糊且控制弱的结果集。之前的研究主要优化逐点相关性或微调以缓解这些问题。我们则将检索视为邻域对齐的问题。我们的工作引入了(1)通过匈牙利匹配进行的邻域级重排序,该方法奖励结构一致性;(2)基于查询的局部引导,其中从查询周围对比邻域导出的方向重新塑造检索。我们展示了这些技术在属性绑定和组合检索任务上提高了检索性能。这些方法共同作用于局部邻域,但扮演不同的角色:重排序奖励对齐,而局部引导控制邻域结构。这表明检索质量和可控性在很大程度上依赖于局部结构,这可以在推理时利用而无需重新训练。
cs.CV / 12 / 2604.16490
An Uncertainty-Aware Loss Function Incorporating Fuzzy Logic: Application to MRI Brain Image Segmentation
一种结合模糊逻辑的考虑不确定性的损失函数:在MRI脑图像分割中的应用
Abstract
Accurate brain image segmentation, particularly for distinguishing various tissues from magnetic resonance imaging (MRI) images, plays a pivotal role in finding the neurological dis ease and medical image computing. In deep learning approaches, loss functions are very crucial for optimizing the model. In this study, we introduce a novel loss function integrating fuzzy logic to deals uncertainty issues in brain image segmentation into various tissues. It integrates the well-known categorical cross-entropy (CCE) loss function and fuzzy entropy based on fuzzy logic. By employing fuzzy logic, this loss function accounts for the inherent uncertainties in pixel classifications. The proposed loss function has been evaluated on two publicly available benchmark datasets, IBSR and OASIS, using two widely recognised architectures, U-Net and U-Net++. Experimental results demonstrate that the trained model with proposed loss function provided better results in comparison to the CCE optimisation function in terms of various performance metrics. Additionally, it effectively enhances segmentation performance while handling meaningful uncer tainty during training. The findings suggest that this approach not only improves segmentation outcomes but also contributes to the reliability of model predictions.
Chinese Translation
准确的脑图像分割,特别是在磁共振成像(MRI)图像中区分各种组织,对于发现神经疾病和医学图像计算具有重要作用。在深度学习方法中,损失函数对于优化模型至关重要。在本研究中,我们提出了一种新颖的损失函数,结合模糊逻辑以处理脑图像分割中不同组织的不确定性问题。该损失函数整合了著名的分类交叉熵(CCE)损失函数和基于模糊逻辑的模糊熵。通过采用模糊逻辑,该损失函数考虑了像素分类中的固有不确定性。我们在两个公开可用的基准数据集IBSR和OASIS上评估了所提出的损失函数,使用了两种广泛认可的架构U-Net和U-Net++。实验结果表明,使用所提出的损失函数训练的模型在各种性能指标上相比于CCE优化函数提供了更好的结果。此外,它在处理训练过程中的有意义的不确定性时,能够有效提升分割性能。研究结果表明,该方法不仅改善了分割结果,还增强了模型预测的可靠性。
cs.CV / 13 / 2604.16491
A Lightweight Transformer for Pain Recognition from Brain Activity
一种轻量级变换器用于从脑活动中识别疼痛
Abstract
Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.
Chinese Translation
疼痛是一种多面向且广泛存在的现象,具有显著的临床和社会负担,因此可靠的自动评估成为一个关键目标。本文提出了一种轻量级变换器架构,通过统一的标记机制融合多种功能性近红外光谱(fNIRS)表示,能够在不需要特定模态适配或增加架构复杂性的情况下,联合建模互补的信号视图。所提出的标记混合策略通过将异构输入投影到共享的潜在表示上,保留空间、时间和时频特征,并使用结构化分段方案控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上进行了评估,使用堆叠的原始波形和功率谱密度表示的fNIRS输入。实验结果表明,该方法在保持计算紧凑的同时,展现出竞争力的疼痛识别性能,使其适合在GPU和CPU硬件上进行实时推理。
cs.CV / 14 / 2604.16492
LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference
LayerCache:利用层级速度异质性实现高效的流匹配推断
Abstract
Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages per-group stability measurements to balance estimation accuracy and computational savings. We formulate a three-dimensional scheduling problem over timesteps, layer groups, and JVP span, and solve it with a greedy budget allocation algorithm. On Qwen-Image (1024x1024, 50 steps), LayerCache achieves PSNR 37.46 dB (+5.38 dB over MeanCache), SSIM 0.9834, and LPIPS 0.0178 (a 70% reduction over MeanCache) at 1.37x speedup, dominating all prior caching methods on the quality-speed Pareto frontier.
Chinese Translation
流匹配模型在图像生成质量上达到了最先进的水平,但由于通过大型Transformer网络进行迭代去噪而导致了显著的推断成本。我们观察到,Transformer内的不同层组表现出明显的速度动态异质性:浅层具有高度的稳定性,适合进行激进的缓存,而深层则经历较大的速度变化,要求进行全面计算。然而,现有的缓存方法将整个Transformer视为一个整体,在每个时间步应用单一的缓存决策,从而未能利用这种异质性。基于这一发现,我们提出了LayerCache,一个层感知的缓存框架,它将Transformer划分为层组,并在每个去噪步骤中对每个组做独立的缓存决策。LayerCache引入了一种自适应的JVP跨度K选择机制,利用每个组的稳定性测量来平衡估计精度和计算节省。我们将这一问题形式化为一个三维调度问题,涉及时间步、层组和JVP跨度,并通过贪心预算分配算法进行求解。在Qwen-Image(1024x1024,50步)上,LayerCache实现了PSNR 37.46 dB(比MeanCache提高5.38 dB),SSIM 0.9834,和LPIPS 0.0178(比MeanCache减少70%),并以1.37倍的加速在质量-速度帕累托前沿上超越了所有先前的缓存方法。
cs.CV / 15 / 2604.16499
HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
HQA-VLAttack:面向高质量视觉-语言预训练模型的对抗攻击
Abstract
Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
Chinese Translation
对视觉-语言预训练模型的黑箱对抗攻击是一项实际且具有挑战性的任务,因为需要同时考虑文本和图像的扰动,并且只能访问预测结果。对此问题的研究仍处于起步阶段,目前仅有少数方法可用。然而,现有方法要么依赖于复杂的迭代交叉搜索策略,这不可避免地消耗大量查询,要么仅考虑减少正图像-文本对的相似性,而忽略负图像-文本对的相似性,这也会被隐式降低,从而不可避免地影响攻击性能。为了解决上述问题,我们提出了一种简单而有效的框架来生成高质量的对抗样本,命名为HQA-VLAttack,该框架由文本和图像攻击阶段组成。在文本扰动生成方面,它利用反向拟合词向量生成替代词集,从而保证替代词与原始词之间的语义一致性。在图像扰动生成方面,它首先通过层重要性引导策略初始化图像对抗样本,然后利用对比学习优化图像对抗扰动,确保正图像-文本对的相似性降低,同时负图像-文本对的相似性增加。通过这种方式,优化后的对抗图像和文本更有可能检索到负例,从而提高攻击成功率。在三个基准数据集上的实验结果表明,HQA-VLAttack在攻击成功率方面显著优于强基线。
cs.CV / 16 / 2604.16500
Semantically Stable Image Composition Analysisvia Saliency and Gradient Vector Flow Fusion
基于显著性与梯度向量流融合的语义稳定图像构图分析
Abstract
The reliable computational assessment of photographic composition requires features that are discriminative of spatial layout yet robust to semantic content. This paper proposes a low-level representation grounded in the assumption that composition can be understood as the flow of visual attention across geometric structure. We introduce VFCNet, which fuses saliency and edge information into a gradient vector flow (GVF) field. The model computes dual-stream GVF representations, integrates them via attention, and extracts multi-scale flow features with a DINOv3 backbone. VFCNet achieves state-of-the-art performance on the PICD benchmark (CDA-1: 0.683, CDA-2: 0.629), improving by 33.1\% and 36.1\% over the previous best method. We also show that a simple classifier on self-supervised DINOv3 features substantially outperforms more sophisticated, composition-specialized models. Code is available at https://github.com/ADadras/VFCNet
Chinese Translation
可靠的摄影构图计算评估需要具有空间布局区分性且对语义内容具有鲁棒性的特征。本文提出了一种低级表示,基于构图可以理解为视觉注意力在几何结构上流动的假设。我们引入了VFCNet,该模型将显著性和边缘信息融合到梯度向量流(Gradient Vector Flow, GVF)场中。该模型计算双流GVF表示,通过注意力机制整合它们,并使用DINOv3主干网络提取多尺度流特征。VFCNet在PICD基准测试上实现了最先进的性能(CDA-1: 0.683, CDA-2: 0.629),相比于之前的最佳方法提高了33.1%和36.1%。我们还展示了在自监督DINOv3特征上使用简单分类器的效果显著优于更复杂的构图专用模型。代码可在https://github.com/ADadras/VFCNet获取。
cs.CV / 17 / 2604.16502
Topology-Aware Layer Pruning for Large Vision-Language Models
面向拓扑的层剪枝方法用于大型视觉语言模型
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local similarity metrics or static proxy signals, failing to capture the global and dynamic evolution of representations across model depth, which often leads to the removal of transition-critical layers. To address this limitation, we propose a topology-aware layer pruning framework for LVLMs. Specifically, we represent layer wise hidden states as point clouds and models their evolution using \textit{simplicial complexes}. By leveraging \textit{zigzag persistent homology}, we quantify inter-layer topological consistency and enable adaptive pruning that preserves critical representational transitions. Extensive experiments on diverse multimodal benchmarks demonstrate that the proposed framework consistently outperforms existing pruning methods across a wide range of sparsity ratios. Our code is available at https://github.com/zpc456/TopoVLM.
Chinese Translation
大型语言模型(LLMs)在自然语言理解和推理方面展现了强大的能力,而最近的扩展结合了视觉输入,使其能够处理多模态信息。尽管取得了这些进展,大型视觉语言模型(LVLMs)仍然会产生巨大的计算和内存开销,限制了其在资源受限场景中的部署。现有的层剪枝方法通常依赖于局部相似性度量或静态代理信号,未能捕捉到模型深度中表示的全局和动态演变,这常常导致关键过渡层的移除。为了解决这一局限性,我们提出了一种面向拓扑的层剪枝框架,专门针对LVLMs。具体而言,我们将每层的隐藏状态表示为点云,并使用 extit{单纯复形}(simplicial complexes)来建模其演变。通过利用 extit{锯齿持久同调}(zigzag persistent homology),我们量化了层间的拓扑一致性,并实现了能够保留关键表示过渡的自适应剪枝。在多样的多模态基准测试中进行的广泛实验表明,所提出的框架在各种稀疏比下始终优于现有的剪枝方法。我们的代码可在https://github.com/zpc456/TopoVLM获取。
cs.CV / 18 / 2604.16503
Motif-Video 2B: Technical Report
Motif-Video 2B:技术报告
Lim, Junghwan, Cheung, Wai Ting, Ha, Minsu, Kim, Beomgyu, Kim, Taewhan, Lee, Haesol, Oh, Dongpin, Lee, Jeesoo, Kim, Taehyun, Kim, Minjae, Lee, Sungmin, Cho, Hyeyeon, Choi, Dahye, Her, Jaeheui, Huh, Jaeyeon, Jung, Hanbin, Kang, Changjin, Kim, Dongseok, Kim, Jangwoong, Kim, Youngrok, Kweon, Hyukjin, Lee, Hongjoo, Lee, Jeongdoo, Lee, Junhyeok, Park, Eunhwan, Park, Yeongjae, Ryu, Bokki, Weon, Dongjoo
Abstract
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
Chinese Translation
训练强大的视频生成模型通常需要大量的数据集、大规模的参数以及可观的计算资源。在本研究中,我们探讨在更小的预算下是否能够实现高质量的文本到视频生成:少于1000万个片段和少于100,000个H200 GPU小时。我们的核心观点是,部分答案在于模型容量的组织方式,而不仅仅在于使用了多少容量。在视频生成中,提示对齐、时间一致性和细节恢复在通过同一路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色来解决这一问题,而不是单纯依赖规模。该模型结合了两个关键思想。首先,Shared Cross-Attention在视频标记序列变长时增强了文本控制。其次,一个三部分的主干网络将早期融合、联合表示学习和细节精炼分开。为了在有限的计算预算下使这一设计有效,我们将其与基于动态标记路由和与冻结的预训练视频编码器的早期特征对齐的高效训练方案相结合。我们的分析表明,后续模块比标准的单流基线发展出更清晰的跨帧注意力结构。在VBench上,Motif-Video 2B达到了83.76\%,超越了Wan2.1 14B,同时使用了7倍更少的参数和显著更少的训练数据。这些结果表明,仔细的架构专业化结合以效率为导向的训练方案,可以缩小或超越通常与更大视频模型相关的质量差距。
cs.CV / 19 / 2604.16504
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
从手写到结构化数据:人工智能手写表单数字化的基准测试
Abstract
Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around $85\%$ with weighted F1 scores $\simeq 90\%$ across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate ($6\%$). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER = $0.50$ and CER = $0.31$) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over $60\%$, but has little impact on weighted metrics (only $\sim2-5\%$ improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.
Chinese Translation
手动数字化结构化手写文档既缓慢又昂贵。我们对17个领先的前沿多模态大型语言模型和开源模型进行了基准测试,针对一个非常具有挑战性的现实世界医疗表单,该表单混合了日期、结构化打印文本、手写回复以及显著的变异性挑战。没有任何较小或较旧的模型表现良好,但最新的谷歌和OpenAI模型在面对响应的极具挑战性时,准确率达到约85%,加权F1分数约为90%。明显的任务特定优势显现:GPT 5.4在嘈杂日期提取方面表现出色,并且具有最低的幻觉率(6%)。Claude Sonnet 4.6在格式化字段(日期和数值)上的平均表现最佳,而Gemini 3.1则提供了最佳的整体表现,具有最低的自由文本错误率(WER = 0.50和CER = 0.31)以及在离散分类指标上最强的结果。我们进一步展示,提示优化显著提高了宏观精确率、召回率和F1分数超过60%,但对加权指标的影响很小(仅约2-5%的改善)。这些结果提供了证据,表明多模态大型语言模型的快速进步为完全自动化复杂手写工作流程的数字化提供了一个引人注目的途径,这在低收入和中等收入国家尤为相关。
cs.CV / 20 / 2604.16505
Predicting Blastocyst Formation in IVF: Integrating DINOv2 and Attention-Based LSTM on Time-Lapse Embryo Images
预测IVF中的胚泡形成:基于DINOv2和注意力机制LSTM的时序胚胎图像整合
Abstract
The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.
Chinese Translation
在体外受精(IVF)中,选择最佳胚胎进行移植是一个关键但具有挑战性的步骤,主要由于其依赖于对大量时序成像数据的人工检查。这个过程中的一个主要障碍是从每日有限的图像中预测胚泡的形成。许多诊所也缺乏完整的时序系统,因此完整的视频通常不可用。在本研究中,我们旨在利用来自时序记录的有限每日图像预测哪些胚胎将发育为胚泡。我们提出了一种新颖的混合模型,将基于变换器的视觉模型DINOv2与增强型长短期记忆(LSTM)网络相结合,该网络具有多头注意力层。DINOv2从胚胎图像中提取有意义的特征,LSTM模型随后利用这些特征分析胚胎随时间的发展并生成最终预测。我们在一个包含704个胚胎视频的真实数据集上测试了我们的模型。该模型达到了96.4%的准确率,超越了现有方法。它在缺失帧的情况下也表现良好,这使其对许多拥有有限成像系统的IVF实验室具有重要价值。我们的方法可以帮助胚胎学家更高效、更有信心地选择更优质的胚胎。
cs.CV / 21 / 2604.16506
Medical thinking with multiple images
多图像下的医学思维
Abstract
Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.
Chinese Translation
大型语言模型在许多医学问答基准测试中表现良好,但真实的临床推理往往需要整合来自多个图像的证据,而不是仅仅解释单一视图。我们引入了 MedThinkVQA,这是一个专家注释的多图像思维基准,模型必须解释每个图像,结合跨视图证据,并在中间监督和逐步评估下回答诊断问题。该数据集包含 8,067 个案例,包括 720 个测试案例,每个案例平均有 6.62 张图像,远远超过之前的工作,其专家级基准每个案例最多使用 1.43 张图像。在测试集中,最佳的闭源模型 Claude-4.6-Opus、Gemini-3-Pro 和 GPT-5.2-xhigh 的准确率仅为 57.2%、55.3% 和 54.9%,而 GPT-5-mini 和 GPT-5-nano 的准确率分别为 39.7% 和 30.8%。强大的开源模型表现不佳,其中 Qwen3.5-397B-A17B 的准确率为 52.2%,Qwen3.5-27B 为 50.6%。进一步分析表明,基于多图像的推理是主要瓶颈:模型往往无法在更高层次的推理之前提取、对齐和组合跨视图的证据。提供专家的单图像提示和跨图像摘要可以提高性能,而用自生成的中间结果替代则会降低准确性。逐步分析显示,超过 70% 的错误源于图像读取和跨视图整合。结果进一步表明,额外的推理时间计算仅在视觉基础可靠时才有帮助;当早期证据提取较弱时,较长的推理时间带来的收益有限或不稳定,甚至可能加剧错误读取的提示。这些结果表明,关键挑战不仅在于推理的长度,而在于在真实世界多模态临床输入中可靠的证据基础、对齐和组合机制。
cs.CV / 22 / 2604.16512
Medial Axis Aware Learning of Signed Distance Functions
考虑中轴线的有符号距离函数学习
Abstract
We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.
Chinese Translation
我们提出了一种新颖的变分方法,用于计算给定点云的高精度全局有符号距离函数(SDF)。为此,SDF梯度的跳跃集与表面的中轴线重合,通过更高阶的变分形式显式考虑该跳跃集,该形式强制在远离该不连续集的梯度方向上线性增长。我们将Eikonal方程和SDF的零水平集作为约束条件。为了使这个变分问题在计算上可行,采用了Ambrosio-Tortorelli类型的相场近似。相关的相场函数隐式描述了中轴线。该方法针对由无方向点云表示的表面进行了实现,使用神经网络对SDF和相场进行近似。实验结果表明该方法在近场和全局范围内的准确性。与其他方法的定量和定性比较显示了所提方法的优势。
cs.CV / 23 / 2604.16513
SynthPID: P&ID digitization from Topology-Preserving Synthetic Data
SynthPID:基于拓扑保持的合成数据进行管道和仪表图(P&ID)数字化
Abstract
Automating the digitization of Piping and Instrumentation Diagrams (P&IDs) into structured process graphs would unlock significant value in plant operations, yet progress is bottlenecked by a fundamental data problem: engineering drawings are proprietary, and the entire community shares a single public benchmark of just 12 annotated images. Prior attempts at synthetic augmentation have fallen short because template-based generators scatter symbols at random, producing graphs that bear little resemblance to real process plants and, accordingly, yield only approximately 33% edge detection accuracy under synth-only training. We argue the failure is structural rather than visual and address it by introducing SynthPID, a corpus of 665 synthetic P&IDs whose pipe topology is seeded directly from real drawings. Paired with a patch-based Relationformer adapted for high-resolution diagrams, a model trained on SynthPID alone achieves 63.8 +/- 3.1% edge mAP on PID2Graph OPEN100 without seeing a single real P&ID during training, closing within 8 pp of the real-data oracle. These gains hold up under a controlled comparison against the template-based regime, confirming that generation quality drives performance rather than model choice. A scaling study reveals that gains flatten beyond roughly 400 synthetic images, pointing to seed diversity as the binding constraint.
Chinese Translation
将管道和仪表图(P&ID)自动化数字化为结构化过程图将为工厂运营释放显著价值,但进展受到一个根本数据问题的瓶颈:工程图纸是专有的,整个社区仅共享一个包含12张标注图像的公共基准。之前的合成增强尝试未能成功,因为基于模板的生成器随机散布符号,产生的图形与真实的过程工厂几乎没有相似之处,因此在仅使用合成数据训练的情况下,边缘检测准确率仅约为33%。我们认为这一失败是结构性而非视觉性的,并通过引入SynthPID来解决这一问题,该数据集包含665个合成P&ID,其管道拓扑直接来源于真实图纸。配合适用于高分辨率图纸的基于补丁的Relationformer模型,仅在SynthPID上训练的模型在PID2Graph OPEN100上达到了63.8 +/- 3.1%的边缘mAP,且在训练期间未见过任何真实的P&ID,距离真实数据的oracle仅相差8个百分点。这些提升在与基于模板的方案进行控制比较时依然成立,确认生成质量驱动了性能,而非模型选择。一项扩展研究表明,超过约400张合成图像后,增益趋于平稳,指出种子多样性是限制因素。
cs.CV / 24 / 2604.16514
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD:通过高效的渐进块合并和阶段性蒸馏架起自回归与扩散视觉-语言模型之间的桥梁
Abstract
Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq 4.4M$ data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3$\times$} decoding throughput speedup compared to the source model.
Chinese Translation
自回归视觉-语言模型(VLMs)具有强大的多模态能力,但其逐个标记解码导致了基本的推理瓶颈。扩散VLMs提供了一种更为并行的解码范式,但直接将预训练的自回归VLM转换为大块扩散VLM(dVLM)往往会导致显著的质量下降。在本研究中,我们提出了BARD,一个简单而有效的桥接框架,将预训练的自回归VLM转换为同一架构、解码高效的dVLM。我们的方法结合了渐进式监督块合并,逐步增大解码块的大小,以及从固定的小块扩散锚点进行阶段性内部dVLM蒸馏,以恢复在较大块中丢失的性能。我们进一步结合了混合噪声调度器,以提高去噪过程中的鲁棒性和标记修正能力,并采用内存友好的训练方法,以实现对长多模态序列的高效训练。一个关键的实证发现是,直接的自回归到扩散蒸馏并不匹配,甚至可能损害性能,而在扩散范畴内的蒸馏则始终有效。实验结果表明,在数据量不超过4.4M的情况下,BARD-VL将Qwen3-VL的强大多模态能力转移至大块dVLM。值得注意的是,BARD-VL在我们评估套件中,在4B和8B规模的可比规模开放dVLM中建立了新的SOTA。同时,BARD-VL的解码吞吐量速度相比源模型提高了多达3倍。
cs.CV / 25 / 2604.16515
Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations
精打细算,像素愚蠢:通过视觉对抗扰动绕过多模态智能体的价格限制
Abstract
The rapid proliferation of Multimodal Large Language Models (MLLMs) has enabled mobile agents to execute high-stakes financial transactions, but their adversarial robustness remains underexplored. We identify Visual Dominance Hallucination (VDH), where imperceptible visual cues can override textual price evidence in screenshot-based, price-constrained settings and lead agents to irrational decisions. We propose PriceBlind, a stealthy white-box adversarial attack framework for controlled screenshot-based evaluation. PriceBlind exploits the modality gap in CLIP-based encoders via a Semantic-Decoupling Loss that aligns the image embedding with low-cost, value-associated anchors while preserving pixel-level fidelity. On E-ShopBench, PriceBlind achieves around 80% ASR in white-box evaluation; under a simplified single-turn coordinate-selection protocol, Ensemble-DI-FGSM transfers with roughly 35-41% ASR across GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. We also show that robust encoders and Verify-then-Act defenses reduce ASR substantially, though with some clean-accuracy trade-off.
Chinese Translation
多模态大型语言模型(MLLMs)的快速普及使得移动智能体能够执行高风险的金融交易,但其对抗鲁棒性仍然未被充分探索。我们识别出视觉主导幻觉(Visual Dominance Hallucination, VDH),即在基于截图的价格限制环境中,微小的视觉线索可以覆盖文本价格证据,导致智能体做出不理性的决策。我们提出了PriceBlind,一个隐秘的白盒对抗攻击框架,用于控制的基于截图的评估。PriceBlind利用基于CLIP的编码器中的模态差距,通过语义解耦损失(Semantic-Decoupling Loss)将图像嵌入与低成本、价值相关的锚点对齐,同时保持像素级的保真度。在E-ShopBench上,PriceBlind在白盒评估中实现了约80%的攻击成功率(ASR);在简化的单轮坐标选择协议下,Ensemble-DI-FGSM在GPT-4o、Gemini-1.5-Pro和Claude-3.5-Sonnet上转移的ASR约为35-41%。我们还展示了鲁棒编码器和先验证后行动(Verify-then-Act)防御机制显著降低了ASR,尽管伴随有一定的清晰度准确性权衡。
cs.CV / 26 / 2604.16516
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies
文本生成图像模型中的公平性操作化:偏见、公平性审计与缓解策略的调查
Abstract
Text-to-Image (T2I) generation models have been widely adopted across various industries, yet are criticized for frequently exhibiting societal stereotypes. While a growing body of research has emerged to evaluate and mitigate these biases, the field at present contends with conceptual ambiguity, for example terms like "bias" and "fairness" are not always clearly distinguished and often lack clear operational definitions. This paper provides a comprehensive systematic review of T2I fairness literature, organizing existing work into a taxonomy of bias types and fairness notions. We critically assess the gap between "target fairness" (normative ideals in T2I outputs) and "threshold fairness" (normative standards with actionable decision rules). Furthermore, we survey the landscape of mitigation strategies, ranging from prompt engineering to diffusion process manipulation. We conclude by proposing a new framework for operationalizing fairness that moves beyond descriptive metrics towards rigorous, target-based testing, offering an approach for more accountable generative AI development.
Chinese Translation
文本生成图像(T2I)模型已在各个行业得到广泛应用,但因频繁表现出社会刻板印象而受到批评。尽管越来越多的研究涌现出来以评估和缓解这些偏见,但该领域目前面临概念模糊的问题,例如“偏见”和“公平性”等术语并不总是被清晰区分,且往往缺乏明确的操作性定义。本文对T2I公平性文献进行了全面的系统性回顾,将现有研究组织成偏见类型和公平性概念的分类法。我们批判性地评估了“目标公平性”(T2I输出中的规范理想)与“阈值公平性”(具有可操作决策规则的规范标准)之间的差距。此外,我们调查了缓解策略的全景,从提示工程到扩散过程操控。最后,我们提出了一个新的公平性操作化框架,超越描述性指标,朝着严格的基于目标的测试迈进,为更具问责制的生成式人工智能发展提供了一种方法。
cs.CV / 27 / 2604.16517
SmoGVLM: A Small, Graph-enhanced Vision-Language Model
SmoGVLM:一种小型图增强视觉语言模型
Abstract
Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.
Chinese Translation
大型视觉语言模型(VLMs)在多模态任务中表现出色,但常常面临幻觉和在知识密集型推理中的基础不牢固的问题。我们提出了SmoGVLM,一种小型图增强VLM,利用图神经网络将结构化知识与视觉和文本模态相结合。我们研究了我们的方法在不同模型规模下的效果,从小型(1.3B)到大型(13B)模型。我们的结果表明,当采用我们的方法进行训练时,小型模型的性能提升可达16.24%,并超越了其更大模型的表现,优于更大的VLM和强大的微调基线。这些结果突显了结构化知识增强在高效、小规模多模态推理系统中的潜力。
cs.CV / 28 / 2604.16522
Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation
快速在线三维多摄像头多目标跟踪与姿态估计
Abstract
This paper proposes a fast and online method for jointly performing 3D multi-object tracking and pose estimation using multiple monocular cameras. Our algorithm requires only 2D bounding box and pose detections, eliminating the need for costly 3D training data or computationally expensive deep learning models. Our solution is an efficient implementation of a Bayes-optimal multi-object tracking filter, enhancing computational efficiency while maintaining accuracy. We demonstrate that our algorithm is significantly faster than state-of-the-art methods without compromising accuracy, using only publicly available pre-trained 2D detection models. We also illustrate the robust performance of our algorithm in scenarios where multiple cameras are intermittently disconnected or reconnected during operation.
Chinese Translation
本文提出了一种快速在线的方法,利用多个单目摄像头共同执行三维多目标跟踪和姿态估计。我们的算法仅需2D边界框和姿态检测,消除了对昂贵的3D训练数据或计算成本高昂的深度学习模型的需求。我们的解决方案是贝叶斯最优多目标跟踪滤波器的高效实现,增强了计算效率,同时保持了准确性。我们展示了我们的算法在不妥协准确性的情况下,显著快于最先进的方法,仅使用公开可用的预训练2D检测模型。我们还说明了我们的算法在多个摄像头在操作过程中间歇性断开或重新连接的场景下的稳健性能。
cs.CV / 29 / 2604.16523
Privacy-Preserving Semantic Segmentation without Key Management
无密钥管理的隐私保护语义分割
Abstract
This paper proposes a novel privacy-preserving semantic segmentation method that can use independent keys for each client and image. In the proposed method, the model creator and each client encrypt images using locally generated keys, and model training and inference are conducted on the encrypted images. To mitigate performance degradation, an image encryption method is applied to model training in addition to the generation of test images. In experiments, the effectiveness of the proposed method is confirmed on the Cityscapes dataset under the use of a vision transformer-based model, called SETR.
Chinese Translation
本文提出了一种新颖的隐私保护语义分割方法,该方法可以为每个客户端和图像使用独立的密钥。在所提出的方法中,模型创建者和每个客户端使用本地生成的密钥对图像进行加密,模型训练和推理在加密图像上进行。为了减轻性能下降,除了生成测试图像外,还在模型训练中应用了一种图像加密方法。在实验中,所提出方法的有效性在使用基于视觉变换器的模型(称为 SETR)的 Cityscapes 数据集上得到了验证。
cs.CV / 30 / 2604.16528
Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF
带有自然语言描述的专家注释胚胎图像数据集,用于 IVF 中基于证据的患者沟通
Abstract
Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.
Chinese Translation
胚胎选择是体外受精中多个关键步骤之一,通常基于临床胚胎学家的形态学评估。尽管人工智能方法已显示出通过自动胚胎排名或分级方法支持胚胎选择的潜力,但基于 AI 的解决方案的整体影响仍然有限。这主要是由于自动化解决方案需要适应定制的临床数据,依赖于时间推移孵化器,以及缺乏可解释性以理解 AI 的推理。现代知情患者对专家决策提出质疑,特别是在治疗不成功的情况下。因此,在胚胎选择等任务中基于证据的决策辩护将支持透明的决策制定和尊重的患者沟通。为支持这一目标,我们在此呈现一个专家注释的数据集,包含胚胎图像及其对应的自然语言形态学描述。描述中包含有关胚胎细胞周期、发育阶段和形态特征的相关信息。该数据集使现代基础视觉-语言模型能够进行微调,并随着时间的推移提高准确性。预测的胚胎描述可以被利用来自动提取文献中的科学证据,从而促进充分知情、基于证据的决策制定和与患者的透明沟通。我们提出的数据集支持基于语言的、可解释的和透明的自动胚胎评估研究,并有潜力显著改善决策过程和患者结果。
cs.CV / 31 / 2604.16532
Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models
超越攻击成功率:医疗影像模型对抗性可转移性的多指标评估
Abstract
While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.
Chinese Translation
随着深度学习系统在医疗影像分析中的日益普及,其对对抗扰动的脆弱性引发了临床应用的严重担忧。这些脆弱性评估主要依赖于攻击成功率(Attack Success Rate, ASR),这一二元指标仅表示攻击是否成功。然而,ASR指标未考虑其他因素,如扰动强度、感知图像质量和跨架构攻击可转移性,因此其解释是不完整的。这个差距需要引起重视,因为包括视觉变换器(Vision Transformers, ViTs)在内的复杂大规模深度学习系统正逐渐挑战卷积神经网络(Convolutional Neural Networks, CNNs)的主导地位。这些架构的学习方式不同,目前尚不清楚单一指标(如ASR)是否能够有效捕捉对抗性行为。为了解决这一问题,我们对四个医疗影像数据集(PathMNIST、DermaMNIST、RetinaMNIST和CheXpert)进行了系统的实证研究。我们评估了七个模型(VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16),针对七种攻击方法在五个扰动预算下进行评估,测量ASR、峰值信噪比(Peak Signal-to-Noise Ratio, PSNR)、结构相似性指数(Structural Similarity Index Measure, SSIM)和$L_2$扰动幅度。我们的研究结果显示出一致的模式:感知和失真指标之间存在强关联,并且与ASR的相关性较小。这适用于CNN和ViT。结果表明,仅依靠ASR作为对抗性鲁棒性和可转移性的指标是不够的。因此,我们认为,对医疗人工智能的对抗风险进行全面评估需要多指标框架,不仅涵盖攻击效果,还包括其方法论和相关开销。
cs.CV / 32 / 2604.16540
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
PoInit-of-View:跨多个3D重建系统的视图初始化中毒
Abstract
Poisoning input views of 3D reconstruction systems has been recently studied. However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline. In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve transferable poisoning effects across diverse 3D reconstruction systems. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also provide a theoretical analysis that connects cross-view inconsistency to correspondence collapse. Experimental results demonstrate the effectiveness of our PoInit-of-View on diverse 3D reconstruction systems and datasets, surpassing the single-view baseline by 25.1% in PSNR and 16.5% in SSIM in black-box transfer settings, such as 3DGS to NeRF.
Chinese Translation
近期对3D重建系统的输入视图中毒进行了研究。然而,我们发现现有研究仅仅是将对抗性梯度通过整个3D重建管道反向传播,而没有揭示根植于3D重建管道特定模块的新脆弱性。本文论证了运动重建(Structure-from-Motion, SfM)初始化作为许多广泛使用的重建系统的几何核心,可以被针对性地攻击,以实现跨多种3D重建系统的可转移中毒效果。为此,我们提出了PoInit-of-View,优化对抗性扰动,故意在对应3D点的投影中引入跨视图梯度不一致。这些不一致会干扰关键点检测和特征匹配,从而破坏SfM中的姿态估计和三角测量,最终导致低质量的渲染视图。我们还提供了一个理论分析,将跨视图不一致性与对应关系崩溃联系起来。实验结果表明,在多种3D重建系统和数据集上,PoInit-of-View的有效性超越了单视图基线,在黑箱转移设置中(例如从3DGS到NeRF)在峰值信噪比(PSNR)上提高了25.1%,在结构相似性指数(SSIM)上提高了16.5%。
cs.CV / 33 / 2604.16541
BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration
BOOKAGENT:通过多智能体认知校准协调安全意识的视觉叙事
Abstract
Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at https://github.com/bogao-code/BookAgent/tree/main.
Chinese Translation
近期大型生成模型(Large Generative Models, LGMs)的进展彻底改变了多模态生成。然而,生成插图故事书仍然是一个未解决的挑战,之前的研究主要将这一任务分解为不同阶段,因此整体多模态基础仍然有限。此外,虽然安全对齐已在文本或图像生成中得到研究,但现有工作很少将儿童特定的安全约束整合到叙事规划和序列级多模态验证中。为了解决这些限制,我们提出了BookAgent,一个安全意识的多智能体协作框架,旨在生成高质量、安全意识的视觉叙事。与之前假设固定故事情节序列的故事可视化模型不同,BookAgent旨在通过联合规划、编剧、插图和全局修复不一致性,从用户草稿中进行端到端的故事书合成。为了确保精确的多模态基础,BookAgent动态校准文本脚本与视觉布局之间的页面级对齐。此外,BookAgent还从时间维度校准整体一致性,通过验证然后修正角色身份和叙事逻辑中的全局不一致性。大量实验表明,BookAgent在叙事连贯性、视觉一致性和安全合规性方面显著优于现有方法,为复杂多模态创作中的可靠智能体提供了一个强大的范式。该实现将公开发布在 https://github.com/bogao-code/BookAgent/tree/main。
cs.CV / 34 / 2604.16546
A B-Spline Function Based 3D Point Cloud Unwrapping Scheme for 3D Fingerprint Recognition and Identification
基于B样条函数的3D点云展开方案用于3D指纹识别与鉴定
Abstract
Three-dimensional (3D) fingerprint recognition and identification offer several advantages over traditional two-dimensional (2D) recognition systems. The contactless nature of 3D fingerprints enhances hygiene and security, reducing the risk of contamination and spoofing. In addition to surface ridge and valley patterns, 3D fingerprints capture depth, curvature, and shape information, enabling the development of more precise and robust authentication systems. Despite recent advancements, significant challenges remain. The topological height of fingerprint pixels complicates the extraction of ridge and valley patterns. Furthermore, registration issues limit the acquisition process, requiring consistent direction and orientation across all samples. To address these challenges, this paper introduces a method that unwraps 3D fingerprints, represented as 3D point clouds, using B-spline curve fitting to mitigate height variation and reduce registration limitations. The unwrapped point cloud is then converted into a grayscale image by mapping the relative heights of the points. This grayscale image is subsequently used for recognition through conventional 2D fingerprint identification methods. The proposed approach demonstrated superior performance in 3D fingerprint recognition, achieving Equal Error Rates (EERs) of 0.2072%, 0.26%, and 0.22% across three experiments, outperforming existing methods. Additionally, the method surpassed 3D fingerprint flattening technique in both recognition and identification during cross-session experiments, achieving an EER of 1.50% when fingerprints with varying registrations were included.
Chinese Translation
三维(3D)指纹识别与鉴定相较于传统的二维(2D)识别系统具有多种优势。3D指纹的无接触特性增强了卫生和安全性,降低了污染和伪造的风险。除了表面脊线和谷线模式外,3D指纹还捕捉了深度、曲率和形状信息,从而能够开发出更精确和稳健的认证系统。尽管近期取得了一些进展,但仍面临重大挑战。指纹像素的拓扑高度使得脊线和谷线模式的提取变得复杂。此外,配准问题限制了采集过程,要求所有样本在方向和方位上的一致性。为了解决这些挑战,本文提出了一种方法,通过B样条曲线拟合展开3D指纹(以3D点云表示),以减轻高度变化并减少配准限制。展开后的点云随后通过映射点的相对高度转换为灰度图像。该灰度图像随后用于通过传统的2D指纹识别方法进行识别。所提出的方法在3D指纹识别中表现出优越的性能,在三次实验中实现了0.2072%、0.26%和0.22%的等错误率(EER),超越了现有方法。此外,该方法在跨会话实验中在识别和鉴定方面均优于3D指纹平坦化技术,当包含具有不同配准的指纹时,达到了1.50%的EER。
cs.CV / 35 / 2604.16552
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
通过自回归3D扩散从文本共同生成布局和形状
Abstract
Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM's help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the scene space, conditioned on both the current seen text instructions and already synthesized 3D scene. It then uses a second step to generate the 3D latents in the smaller object space, which can be decoded into fine-grained object geometry and appearance. We curate a large dataset of 230K indoor scenes with paired text instructions for training. We evaluate 7B 3D-ARD+, on challenging scenes, and showcase the model can generate and place objects following non-trivial spatial layout and semantics prescribed by the text instructions.
Chinese Translation
最近的文本到场景生成方法在很大程度上减少了创建3D场景所需的手动工作。然而,它们的重点要么是生成场景布局,要么是生成对象,很少有方法同时生成两者。即使在大型语言模型(LLM)的帮助下,生成的场景布局通常也很简单。此外,生成的场景往往与包含对象形状、外观和空间排列的非平凡描述的文本输入不一致。我们提出了一种新的顺序文本到场景生成范式,并提出了一种新颖的生成模型用于交互式场景创建。其核心是3D自回归扩散模型3D-ARD+,该模型统一了对多模态标记序列的自回归生成和下一个对象3D潜变量的扩散生成。为了生成下一个对象,该模型使用一步自回归生成场景空间中的粗粒度3D潜变量,条件是当前的文本指令和已经合成的3D场景。然后,它使用第二步在较小的对象空间中生成3D潜变量,这些潜变量可以解码为细粒度的对象几何形状和外观。我们整理了一个包含23万对文本指令的大型室内场景数据集用于训练。我们在具有挑战性的场景上评估了7B 3D-ARD+,并展示了该模型能够生成和放置遵循文本指令所规定的非平凡空间布局和语义的对象。
cs.CV / 36 / 2604.16554
PA-TCNet: Pathology-Aware Temporal Calibration with Physiology-Guided Target Refinement for Cross-Subject Motor Imagery EEG Decoding in Stroke Patients
PA-TCNet:具有生理引导目标精细化的病理感知时间校准框架,用于中风患者的跨主体运动想象脑电图解码
Abstract
Stroke patient cross-subject electroencephalography (EEG) decoding of motor imagery (MI) brain-computer interface (BCI) is essential for motor rehabilitation, yet lesion-related abnormal temporal dynamics and pronounced inter-patient heterogeneity often undermine generalization. Existing adaptation methods are easily misled by pathological slow-wave activity and unstable target-domain pseudo-labels. To address this challenge, we propose PA-TCNet, a pathology-aware temporal calibration framework with physiology-guided target refinement for stroke motor imagery decoding. PA-TCNet integrates two coordinated components. The Pathology-aware Rhythmic State Mamba (PRSM) module decomposes EEG spatiotemporal features into slowly varying rhythmic context and fast transient perturbations, injecting the fused pathological context into selective state propagation to more effectively capture abnormal temporal dynamics. The Physiology-Guided Target Calibration (PGTC) module constructs source-domain sensorimotor region-of-interest templates, imposing physiological consistency constraints and dynamically refining target-domain pseudo-labels, thereby improving adaptation reliability. Leave-one-subject-out experiments on two independent stroke EEG datasets, XW-Stroke and 2019-Stroke, yielded mean accuracies of 66.56\% and 72.75\%, respectively, outperforming state-of-the-art baselines. These results indicate that jointly modeling pathological temporal dynamics and physiology-constrained pseudo-supervision can provide more robust cross-subject initialization for personalized post-stroke MI-BCI rehabilitation. The implemented code is available at https://github.com/wxk1224/PA-TCNet.
Chinese Translation
中风患者跨主体的运动想象(MI)脑-计算机接口(BCI)脑电图(EEG)解码对于运动康复至关重要,但与病灶相关的异常时间动态和显著的患者间异质性常常削弱其泛化能力。现有的适应方法容易受到病理性慢波活动和不稳定的目标领域伪标签的误导。为了解决这一挑战,我们提出了PA-TCNet,这是一种具有生理引导目标精细化的病理感知时间校准框架,用于中风运动想象解码。PA-TCNet集成了两个协调的组件。病理感知节律状态模块(PRSM)将EEG时空特征分解为缓慢变化的节律背景和快速瞬态扰动,将融合的病理背景注入选择性状态传播中,以更有效地捕捉异常的时间动态。生理引导目标校准模块(PGTC)构建源领域传感器运动区域的兴趣模板,施加生理一致性约束并动态精细化目标领域伪标签,从而提高适应的可靠性。在两个独立的中风EEG数据集XW-Stroke和2019-Stroke上进行的留一主体实验分别获得了66.56\%和72.75\%的平均准确率,超越了现有的最先进基线。这些结果表明,联合建模病理时间动态和生理约束的伪监督可以为个性化的中风后MI-BCI康复提供更稳健的跨主体初始化。实现的代码可在https://github.com/wxk1224/PA-TCNet获取。
cs.CV / 37 / 2604.16562
See Through the Noise: Improving Domain Generalization in Gaze Estimation
透过噪声:改善注视估计中的领域泛化
Abstract
Generalizable gaze estimation methods have garnered increasing attention due to their critical importance in real-world applications and have achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects of label noise on generalization in gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against label noise. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlight the importance of explicitly handling noise in generalized gaze estimation.
Chinese Translation
可泛化的注视估计方法因其在现实应用中的关键重要性而受到越来越多的关注,并取得了显著进展。然而,它们往往忽视了标签噪声的影响,标签噪声源于获取精确注视标注的固有困难,这对模型的泛化性能产生了影响。在本文中,我们首次全面研究了标签噪声对注视估计中泛化的负面影响。此外,我们提出了一种新颖的解决方案,称为透过噪声(See-Through-Noise, SeeTN)框架,从减轻标签噪声的新视角改善泛化。具体而言,我们提出通过基于原型的变换构建语义嵌入空间,以保持注视特征与连续标签之间一致的拓扑结构。然后,我们测量特征-标签亲和一致性,以区分噪声样本和干净样本,并在语义流形中引入一种新颖的亲和正则化,将与注视相关的信息从干净样本转移到噪声样本。我们提出的SeeTN促进了语义结构的对齐,并强制执行领域不变的注视关系,从而增强了对标签噪声的鲁棒性。大量实验表明,我们的SeeTN有效减轻了源领域噪声的不利影响,实现了优越的跨领域泛化,而不损害源领域的准确性,并强调了在广义注视估计中显式处理噪声的重要性。
cs.CV / 38 / 2604.16563
Classification of systolic murmurs in heart sounds using multiresolution complex Gabor dictionary and vision transformer
基于多分辨率复数Gabor字典和视觉变换器的心音收缩杂音分类
Abstract
Systolic murmurs are extra heart sounds that occur during the contraction phase of the cardiac cycle, often indicating heart abnormalities caused by turbulent blood flow. Their intensity, pitch, and quality vary, requiring precise identification for the accurate diagnosis of cardiac disorders. This study presents an automatic classification system for systolic murmurs using a feature extraction module, followed by a classification model. The feature extraction module employs complex orthogonal matching pursuit to project single or multiple murmur segments onto a redundant dictionary composed of multiresolution complex Gabor basis functions (GBFs). The resulting projection weights are split and reshaped into variable-resolution time--frequency feature matrices. Processing multiple segments of a single recording using a shared dictionary mitigates murmur variability. This is achieved by learning the weights for each segment while enforcing that they correspond to the same set of basis functions in the dictionary, promoting consistent time--frequency feature matrices. The classification model is built based on a vision transformer to process multiple input matrices of different resolutions by passing each through a convolutional neural network for patch tokenization. All embedding tokens are then concatenated to form a matrix and forwarded to an encoder layer that includes multihead attention, residual connections, and a convolutional network with a kernel size of one. This integration of multiresolution feature extraction with transformer-based feature classification enhances the accuracy and reliability of heart murmur identification. An experimental analysis of four types of systolic murmurs from the CirCor DigiScope dataset demonstrates the effectiveness of the system, achieving a classification accuracy of $95.96\%$.
Chinese Translation
收缩杂音是在心脏周期的收缩阶段出现的额外心音,通常表明由于湍流血流引起的心脏异常。其强度、音调和质量各异,要求精确识别以便于准确诊断心脏疾病。本研究提出了一种自动分类收缩杂音的系统,该系统包括一个特征提取模块和一个分类模型。特征提取模块采用复数正交匹配追踪将单个或多个杂音片段投影到由多分辨率复数Gabor基函数(GBFs)组成的冗余字典上。所得的投影权重被拆分并重塑为可变分辨率的时频特征矩阵。通过使用共享字典处理单个录音的多个片段,减轻了杂音的变异性。这是通过学习每个片段的权重,同时强制它们对应于字典中的同一组基函数来实现的,从而促进了一致的时频特征矩阵的生成。分类模型基于视觉变换器构建,通过卷积神经网络进行补丁标记化,处理不同分辨率的多个输入矩阵。所有嵌入标记随后被连接形成一个矩阵,并传递到包含多头注意力、残差连接和核大小为1的卷积网络的编码器层。多分辨率特征提取与基于变换器的特征分类的结合提高了心脏杂音识别的准确性和可靠性。对CirCor DigiScope数据集中四种类型的收缩杂音的实验分析表明该系统的有效性,分类准确率达到95.96%。
cs.CV / 39 / 2604.16577
Multilevel neural networks with dual-stage feature fusion for human activity recognition
具有双阶段特征融合的多级神经网络用于人类活动识别
Abstract
Human activity recognition (HAR) refers to the process of identifying human actions and activities using data collected from sensors. Neural networks, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, convolutional LSTM, and their hybrid combinations, have demonstrated exceptional performance in various research domains. Developing a multilevel individual or hybrid model for HAR involves strategically integrating multiple networks to capitalize on their complementary strengths. The structural arrangement of these components is a critical factor influencing the overall performance. This study explores a novel framework of a two-level network architecture with dual-stage feature fusion: late fusion, which combines the outputs from the first network level, and intermediate fusion, which integrates the features from both the first and second levels. We evaluated $15$ different network architectures of CNNs, LSTMs, and convolutional LSTMs, incorporating late fusion with and without intermediate fusion, to identify the optimal configuration. Experimental evaluation on two public benchmark datasets demonstrates that architectures incorporating both late and intermediate fusion achieve higher accuracy than those relying on late fusion alone. Moreover, the optimal configuration outperforms baseline models, thereby validating its effectiveness for HAR.
Chinese Translation
人类活动识别(HAR)是指利用传感器收集的数据识别人的动作和活动的过程。神经网络,如卷积神经网络(CNN)、长短期记忆网络(LSTM)、卷积LSTM及其混合组合,在多个研究领域表现出色。为HAR开发一个多级的个体或混合模型涉及战略性地整合多个网络,以利用它们的互补优势。这些组件的结构安排是影响整体性能的关键因素。本研究探讨了一种新颖的双级网络架构框架,采用双阶段特征融合:后融合(late fusion),即结合第一个网络层的输出,以及中间融合(intermediate fusion),即整合第一层和第二层的特征。我们评估了15种不同的CNN、LSTM和卷积LSTM网络架构,结合后融合和有无中间融合,以确定最佳配置。在两个公共基准数据集上的实验评估表明,结合后融合和中间融合的架构比仅依赖后融合的架构具有更高的准确性。此外,最佳配置在性能上优于基线模型,从而验证了其在HAR中的有效性。
cs.CV / 40 / 2604.16582
Camo-M3FD: A New Benchmark Dataset for Cross-Spectral Camouflaged Pedestrian Detection
Camo-M3FD:一种用于跨光谱伪装行人检测的新基准数据集
Abstract
Pedestrian detection is fundamental to autonomous driving, robotics, and surveillance. Despite progress in deep learning, reliable identification remains challenging due to occlusions, cluttered backgrounds, and degraded visibility. While multispectral detection-combining visible and thermal sensors-mitigates poor visibility, the challenge of camouflaged pedestrians remains largely unexplored. Existing Camouflaged Object Detection (COD) benchmarks focus on biological species, leaving a gap in safety-critical human detection where targets blend into their surroundings. To address this, we introduce Camo-M3FD (derived from the M3FD dataset), a novel benchmark for cross-spectral camouflaged pedestrian detection, consisting of registered visible-thermal image pairs. The dataset is curated using quantitative metrics to ensure high foreground-background similarity. We provide high-quality pixel-level masks and establish a standardized evaluation framework using state-of-the-art COD models. Our results demonstrate that while thermal signals provide indispensable localization cues, multispectral fusion is essential for refining structural details. Camo-M3FD serves as a foundational resource for developing robust and safety-critical detection systems. The dataset is available on GitHub: https://cod-espol.github.io/Camo-M3FD/
Chinese Translation
行人检测是自动驾驶、机器人技术和监控的基础。尽管深度学习取得了一定进展,但由于遮挡、杂乱背景和能见度下降,可靠识别仍然具有挑战性。虽然多光谱检测(结合可见光和热成像传感器)可以缓解能见度差的问题,但伪装行人的挑战仍然在很大程度上未被探索。现有的伪装物体检测(COD)基准主要集中在生物物种上,导致在安全关键的人类检测中存在空白,因为目标与周围环境融为一体。为了解决这一问题,我们引入了Camo-M3FD(源自M3FD数据集),这是一个用于跨光谱伪装行人检测的新基准,包含配准的可见光-热成像图像对。该数据集使用定量指标进行策划,以确保高前景-背景相似性。我们提供高质量的像素级掩膜,并建立了一个标准化的评估框架,使用最先进的COD模型。我们的结果表明,尽管热信号提供了不可或缺的定位线索,但多光谱融合对于细化结构细节至关重要。Camo-M3FD为开发稳健且安全关键的检测系统提供了基础资源。该数据集可在GitHub上获取:https://cod-espol.github.io/Camo-M3FD/
cs.CV / 41 / 2604.16587
Real-Time Visual Attribution Streaming in Thinking Model
思维模型中的实时视觉归因流
Abstract
We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
Chinese Translation
我们提出了一种用于多模态思维模型的实时视觉归因流的摊销框架。当这些模型从截图生成代码或从图像中解决数学问题时,它们的长推理轨迹应当基于视觉证据。然而,验证这种依赖关系是具有挑战性的:忠实的因果方法需要耗时的重复反向传播或扰动,而原始注意力图虽然提供了即时访问,但缺乏因果有效性。为了解决这个问题,我们引入了一种摊销方法,该方法学习直接从注意力特征中编码的丰富信号中估计语义区域的因果效应。在五个不同的基准测试和四个思维模型中,我们的方法在实现与详尽因果方法相当的忠实度的同时,支持视觉归因流,使用户在模型推理时观察到基础证据,而不是事后观察。我们的结果表明,通过轻量级学习,而非强力计算,实现多模态思维模型中的实时、忠实归因是可行的。
cs.CV / 42 / 2604.16588
MambaKick: Early Penalty Direction Prediction from HAR Embeddings
MambaKick:基于人类动作识别嵌入的早期罚球方向预测
Abstract
Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning-based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact-centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state-spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real-world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state-space temporal modeling is a practical direction for low-latency intention prediction in real-world sports video. The code will be available at GitHub: https://github.com/hvelesaca/MambaKick/
Chinese Translation
足球中的罚球在极端时间限制下进行,守门员可以通过在球接触前或接触时预测踢球者的动作方向来获益。本文提出了MambaKick作为一种基于学习的罚球方向预测框架,该框架利用从以接触为中心的短视频片段中提取的预训练人类动作识别(HAR)嵌入,并将其与轻量级时间预测器相结合。该方法并不依赖于显式的运动学重建或手工制作的生物力学特征,而是重用可转移的时空表示,并利用选择性状态稀疏模型(Mamba)进行高效的序列聚合。简单的上下文元数据(例如,场地侧和用脚习惯)也被视为可以减少现实世界视频中歧义的补充线索。在多种HAR骨干网络的测试中,MambaKick始终提高或匹配强嵌入基线,在所提出的方法下,对于三类任务达到了53.1%的准确率,对于两类任务达到了64.5%的准确率。总体而言,结果表明,将预训练的HAR表示与高效的状态空间时间建模相结合,是实现现实世界体育视频中低延迟意图预测的一个实际方向。代码将发布在GitHub上:https://github.com/hvelesaca/MambaKick/
cs.CV / 43 / 2604.16609
IncepDeHazeGAN: Novel Satellite Image Dehazing
IncepDeHazeGAN:新型卫星图像去雾
Abstract
Dehazing is a technique in computer vision for enhancing the visual quality of images captured in cloudy or foggy conditions. Dehazing helps to recover clear, high-quality images from haze-affected remote sensing data. In this study, we introduce IncepDeHazeGAN, a novel Generative Adversarial Network (GAN) involving Inception block and multi-layer feature fusion for the task of single-image dehazing. Utilizing the Inception block allows for multi-scale feature extraction. On the other hand, the multi-layer feature fusion design achieves efficient reuse of features as the features extracted at different convolution layers are fused several times. Grad-CAM XAI technique has been applied to our network, highlighting the regions focused on by the network for dehazing and its adaptation to different haze conditions. Experiments demonstrate that our network achieves state-of-the-art results in several datasets.
Chinese Translation
去雾是计算机视觉中的一种技术,用于增强在多云或雾霾条件下捕获的图像的视觉质量。去雾有助于从受雾霾影响的遥感数据中恢复清晰、高质量的图像。在本研究中,我们介绍了IncepDeHazeGAN,这是一种新型生成对抗网络(GAN),结合了Inception模块和多层特征融合,用于单幅图像去雾任务。利用Inception模块可以实现多尺度特征提取。另一方面,多层特征融合设计实现了特征的高效重用,因为在不同卷积层提取的特征经过多次融合。我们在网络中应用了Grad-CAM XAI技术,突出显示了网络在去雾过程中关注的区域及其对不同雾霾条件的适应性。实验表明,我们的网络在多个数据集上达到了最先进的结果。
cs.CV / 44 / 2604.16617
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT:通过单模态教师进行音视频推理迁移
Abstract
Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.
Chinese Translation
近期在推理模型方面的进展在基于文本的领域取得了显著成就,但将这些能力迁移到多模态环境中,例如在音视频数据上进行推理,仍然面临挑战,部分原因是目标多模态组合中高质量推理数据的有限可用性。为了解决这个问题,我们提出了AVRT,一个新颖的框架,通过单模态教师模型生成高质量的音视频推理轨迹。我们通过专门针对各自模态进行推理的模型生成独立的视觉和音频推理轨迹,并将生成的轨迹与LLM合并模型进行合并。生成的多模态轨迹用于监督微调(SFT)冷启动,以首先将目标模型适应于音视频推理轨迹,然后在更大规模的数据上进行第二阶段的强化学习训练。在七个音视频和音频基准上进行评估,我们的3B和7B参数模型在与之规模相当的模型中,包括音视频的OmniBench和DailyOmni,以及仅音频推理的MMAR中,取得了最先进的结果,显示跨模态训练也能迁移到单模态任务,并为多模态推理模型建立了新的训练流程。
cs.CV / 45 / 2604.16629
Amortized Inverse Kinematics via Graph Attention for Real-Time Human Avatar Animation
通过图注意力实现的摊销逆运动学用于实时人类虚拟形象动画
Abstract
Inverse kinematics (IK) is a core operation in animation, robotics, and biomechanics: given Cartesian constraints, recover joint rotations under a known kinematic tree. In many real-time human avatar pipelines, the available signal per frame is a sparse set of tracked 3D joint positions, whereas animation systems require joint orientations to drive skinning. Recovering full orientations from positions is underconstrained, most notably because twist about bone axes is ambiguous, and classical IK solvers typically rely on iterative optimization that can be slow and sensitive to noisy inputs. We introduce IK-GAT, a lightweight graph-attention network that reconstructs full-body joint orientations from 3D joint positions in a single forward pass. The model performs message passing over the skeletal parent-child graph to exploit kinematic structure during rotation inference. To simplify learning, IK-GAT predicts rotations in a bone-aligned world-frame representation anchored to rest-pose bone frames. This parameterization makes the twist axis explicit and is exactly invertible to standard parent-relative local rotations given the kinematic tree and rest pose. The network uses a continuous 6D rotation representation and is trained with a geodesic loss on SO(3) together with an optional forward-kinematics consistency regularizer. IK-GAT produces animation-ready local rotations that can directly drive a rigged avatar or be converted to pose parameters of SMPL-like body models for real-time and online applications. With 374K parameters and over 650 FPS on CPU, IK-GAT outperforms VPoser-based per-frame iterative optimization without warm-start at significantly lower cost, and is robust to initial pose and input noise
Chinese Translation
逆运动学(IK)是动画、机器人技术和生物力学中的核心操作:在给定笛卡尔约束的情况下,恢复已知运动学树下的关节旋转。在许多实时人类虚拟形象管道中,每帧可用信号是一组稀疏的跟踪3D关节位置,而动画系统则需要关节方向来驱动蒙皮。从位置恢复完整的方向是欠约束的,尤其是因为围绕骨骼轴的扭转是模糊的,传统的IK求解器通常依赖于迭代优化,这可能很慢且对噪声输入敏感。我们引入了IK-GAT,这是一种轻量级的图注意力网络,能够在单次前向传递中从3D关节位置重建全身关节方向。该模型在骨骼父子图上执行信息传递,以利用运动学结构进行旋转推断。为了简化学习,IK-GAT在与静息姿态骨骼框架对齐的世界坐标系中预测旋转。这种参数化使得扭转轴变得明确,并且在给定运动学树和静息姿态的情况下,可以精确地反转为标准的父相对局部旋转。该网络使用连续的6D旋转表示,并通过在SO(3)上的测地损失进行训练,同时可选地使用前向运动学一致性正则化器。IK-GAT生成的动画就绪局部旋转可以直接驱动装配好的虚拟形象,或转换为SMPL类身体模型的姿态参数,以用于实时和在线应用。IK-GAT具有374K参数,在CPU上超过650 FPS的性能,优于基于VPoser的每帧迭代优化,无需热启动且成本显著更低,并且对初始姿态和输入噪声具有鲁棒性。
cs.CV / 46 / 2604.16630
Tri-Modal Fusion Transformers for UAV-based Object Detection
基于无人机的三模态融合变换器用于目标检测
Abstract
Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.
Chinese Translation
可靠的无人机目标检测需要对光照变化、运动模糊和场景动态具有鲁棒性,这些因素会抑制RGB线索。热长波红外(LWIR)传感器在低光照条件下保持对比度,而事件相机则保留微秒级的时间边缘,但将这三种模态集成到一个统一的检测器中尚未得到系统研究。我们提出了一种三模态框架,通过双流层次视觉变换器处理RGB、热成像和事件数据。在选定的编码器深度上,模态感知门控交换(Modality-Aware Gated Exchange, MAGE)应用传感器间通道和空间门控,而双向令牌交换(Bidirectional Token Exchange, BiTE)模块则执行双向令牌级注意力,并进行深度点wise精炼,生成用于标准特征金字塔和两阶段检测器的分辨率保持融合图。我们引入了一个包含10,489帧的无人机数据集,该数据集具有同步和预对齐的RGB-热成像-事件流,以及24,223个标注车辆,涵盖昼夜飞行。通过61次控制性消融实验,我们评估了融合位置、机制(基线MAGE+BiTE、CSSA、GAFF)、模态子集和主干网络能力。三模态融合在所有双模态基线中表现出改进,融合深度具有显著影响,轻量级CSSA变体在最小成本下恢复了大部分收益。本研究提供了首个系统性基准和模块化主干,用于基于无人机的三模态目标检测。
cs.CV / 47 / 2604.16663
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
基于卫星影像的滑坡检测分割模型及适应策略的基准研究
Abstract
Landslide detection from high resolution satellite imagery is a critical task for disaster response and risk assessment, yet the relative effectiveness of modern segmentation architectures and finetuning strategies for this problem remains insufficiently understood. In this work, we present a systematic benchmarking study of convolutional neural networks, transformer based segmentation models, and large pre-trained foundation models for landslide detection. Using the Globally Distributed Coseismic Landslide Dataset (GDCLD) dataset, we evaluate representative CNN- and transformer-based segmentation models alongside large pretrained foundation models under consistent training and evaluation protocols. In addition, we compare full fine-tuning with parameter-efficient fine-tuning methods, including LoRA and AdaLoRA, to assess their performance efficiency tradeoffs. Experimental results show that transformer-based models achieve strong segmentation performance, while parameter efficient finetuning reduces trainable parameters by up to 95% with comparable accuracy to full finetuning. We further analyze generalization under distribution shift by comparing validation and held-out test performance.
Chinese Translation
从高分辨率卫星影像中检测滑坡是灾害响应和风险评估中的一项关键任务,但现代分割架构和微调策略在这一问题上的相对有效性尚未得到充分理解。在本研究中,我们对卷积神经网络(CNN)、基于变换器的分割模型以及大型预训练基础模型在滑坡检测中的表现进行了系统的基准研究。我们使用全球分布的同震滑坡数据集(GDCLD),在一致的训练和评估协议下评估代表性的CNN和基于变换器的分割模型,以及大型预训练基础模型。此外,我们比较了全量微调与参数高效微调方法(包括LoRA和AdaLoRA),以评估它们的性能效率权衡。实验结果表明,基于变换器的模型在分割性能上表现出色,而参数高效微调将可训练参数减少了多达95%,且与全量微调的准确性相当。我们进一步通过比较验证集和保留测试集的表现,分析了在分布转移下的泛化能力。
cs.CV / 48 / 2604.16675
Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model
无外观动作识别:人类的零-shot泛化与双通路模型
Abstract
Action recognition is a fundamental ability for social species. Yet, its underlying computations are not well understood. Classical psychophysical studies using simplified stimuli have shown that humans can perceive body motion even under degradation of relevant shape cues. Recent work using real-world action videos and their appearance-free counterparts (that preserve motion but lack static shape cues) included explicit training of humans and models on the appearance-free videos. Whether humans and vision models generalize in a zero-shot manner to appearance-free transformations of real-world action videos is not yet known. To measure this generalization in humans, we conducted a laboratory-based psychophysics experiment. 22 participants were trained to recognize five action categories using naturalistic videos (UCF5 dataset), and tested zero-shot on two types of appearance-free transformations: (i) dense-noise motion videos from an existing dataset (AFD5) and (ii) random-dot appearance-free videos. We find that participants recognize actions in both types of appearance-free videos well above chance, albeit with reduced accuracy compared to naturalistic videos. To model this behavior, we developed a two-pathway 3D CNN-based model combining an RGB (form) stream and an optical flow (motion) stream, including a coherence-gating mechanism inspired by Gestalt common-fate grouping. Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos. Our findings highlight the importance of motion-based representations for generalization to appearance-free videos, and support the use of multi-stream architectures to model video-based action recognition.
Chinese Translation
动作识别是社会物种的一项基本能力。然而,其潜在的计算机制尚不清楚。经典的心理物理学研究使用简化刺激表明,人类即使在相关形状线索退化的情况下也能感知身体运动。最近的研究利用真实世界的动作视频及其无外观对应物(保留运动但缺乏静态形状线索),对人类和模型进行了显式训练。人类和视觉模型是否能以零-shot的方式对真实世界动作视频的无外观变换进行泛化尚不清楚。为了测量人类的这种泛化能力,我们进行了基于实验室的心理物理学实验。22名参与者接受了使用自然视频(UCF5数据集)识别五种动作类别的训练,并在两种类型的无外观变换上进行了零-shot测试:(i)来自现有数据集(AFD5)的密集噪声运动视频和(ii)随机点无外观视频。我们发现参与者在这两种类型的无外观视频中识别动作的表现远高于随机水平,尽管与自然视频相比准确性有所降低。为了建模这种行为,我们开发了一个基于双通路3D卷积神经网络(CNN)的模型,结合了RGB(形状)流和光流(运动)流,包括一个受格式塔共同命运分组启发的连贯性门控机制。我们的模型对两种无外观数据集进行了泛化,并且超越了当代视频分类模型,缩小了与人类表现之间的差距。我们发现运动通路对无外观视频的泛化至关重要,而形状通路则提高了自然视频的表现。我们的研究结果强调了基于运动的表征在无外观视频泛化中的重要性,并支持使用多流架构来建模基于视频的动作识别。
cs.CV / 49 / 2604.16680
C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
C-GenReg:通过多视图一致的几何到图像生成与概率模态融合实现无训练的3D点云配准
Abstract
We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer, preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a "Match-then-Fuse" probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where no imagery data is available.
Chinese Translation
我们提出了C-GenReg,一个无训练的3D点云配准框架,利用世界级生成先验和面向配准的视觉基础模型(Vision Foundation Models, VFMs)的互补优势。目前基于学习的3D点云配准方法在不同传感模态、采样差异和环境之间的泛化能力较弱。因此,C-GenReg通过将匹配问题转移到辅助图像域来增强几何点云配准分支,在该域中,VFMs表现优异,使用世界基础模型从输入几何体合成多视图一致的RGB表示。这种生成转移在源视图和目标视图之间保持空间一致性,无需任何微调。从这些生成的视图中,预训练的VFM用于寻找稠密对应关系并提取匹配。得到的像素对应关系通过原始深度图回升至3D。为了进一步增强鲁棒性,我们引入了一种“匹配后融合”(Match-then-Fuse)概率冷融合方案,结合了生成RGB分支与原始几何分支的两个独立对应后验。这种原则性融合保留了每种模态的归纳偏差,并在没有额外学习的情况下提供了校准的置信度。C-GenReg是零-shot且即插即用的:所有模块均为预训练并在无需微调的情况下运行。在室内(3DMatch, ScanNet)和室外(Waymo)基准上的大量实验表明了强大的零-shot性能和优越的跨域泛化能力。我们首次展示了一个在真实室外激光雷达数据上成功运行的生成配准框架,而该数据没有可用的图像数据。
cs.CV / 50 / 2604.16696
LOD-Net: Locality-Aware 3D Object Detection Using Multi-Scale Transformer Network
LOD-Net:基于多尺度变换网络的局部感知三维物体检测
Abstract
3D object detection in point cloud data remains a challenging task due to the sparsity and lack of global structure inherent in the input. In this work, we propose a novel Multi-Scale Attention (MSA) mechanism integrated into the 3DETR architecture to better capture both local geometry and global context. Our method introduces an upsampling operation that generates high-resolution feature maps, enabling the network to better detect smaller and semantically related objects. Experiments conducted on the ScanNetv2 dataset demonstrate that our 3DETR + MSA model improves detection performance, achieving a gain of almost 1% in mAP@25 and 4.78% in mAP@50 over the baseline. While applying MSA to the 3DETR-m variant shows limited improvement, our analysis reveals the importance of adapting the upsampling strategy for lightweight models. These results highlight the effectiveness of combining hierarchical feature extraction with attention mechanisms in enhancing 3D scene understanding.
Chinese Translation
在点云数据中,三维物体检测仍然是一项具有挑战性的任务,因为输入数据的稀疏性和缺乏全局结构。在本研究中,我们提出了一种新颖的多尺度注意力(Multi-Scale Attention, MSA)机制,集成到3DETR架构中,以更好地捕捉局部几何和全局上下文。我们的方法引入了一种上采样操作,生成高分辨率特征图,使网络能够更好地检测较小和语义相关的物体。在ScanNetv2数据集上进行的实验表明,我们的3DETR + MSA模型提高了检测性能,在mAP@25上提升了近1%,在mAP@50上提升了4.78%,相较于基线模型。虽然将MSA应用于3DETR-m变体的改进有限,但我们的分析揭示了为轻量级模型调整上采样策略的重要性。这些结果突显了将分层特征提取与注意力机制相结合在增强三维场景理解中的有效性。
cs.CV / 51 / 2604.16726
iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents
iDocV2:利用自监督学习和开放集检测提升历史文献中的模式识别
Abstract
Considering the imminent massification of digital books, it has become critical to facilitate searching collections through graphical patterns. Current strategies for document retrieval and pattern spotting in historical documents still need to be improved. State-of-the-art strategies achieve an overall precision of $0.494$ for pattern spotting, where the precision for small non-square queries reaches 0.427. In addition, the processing time is excessive, requiring up to 7 seconds for searching in the DocExplore dataset due to a dense-based strategy used by SOTA models. Therefore, we propose a new model based on a better encoder (iDoc), trained under a self-supervised strategy, and an open-set detector to accelerate searching. Our model achieves competitive results with state-of-the-art pattern spotting and document retrieval, improving speed by 10x. Furthermore, our model reaches a new SOTA performance on the small non-square queries, achieving a new precision of 0.612.Different from the previous version, this leverages non-maximum suppression to reduce false positives.
Chinese Translation
考虑到数字书籍的即将大规模化,促进通过图形模式搜索文献集合变得至关重要。目前在历史文献中的文档检索和模式识别策略仍需改进。现有的最先进策略在模式识别方面的整体精度为0.494,其中小型非方形查询的精度达到0.427。此外,由于最先进模型使用的基于密度的策略,处理时间过长,在DocExplore数据集中的搜索时间可达7秒。因此,我们提出了一种基于更好编码器(iDoc)、在自监督策略下训练的模型,并采用开放集检测器以加速搜索。我们的模型在模式识别和文档检索方面取得了与最先进技术相媲美的结果,速度提升了10倍。此外,我们的模型在小型非方形查询上达到了新的最先进性能,精度提升至0.612。与之前版本不同的是,该模型利用非极大值抑制来减少假阳性。
cs.CV / 52 / 2604.16729
Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis
用于无训练神经放射影像分析的自主大型语言模型
Abstract
State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.
Chinese Translation
最先进的大型语言模型(LLMs)在一般视觉问答中表现出色。然而,仍然存在一个根本性的限制:当前的架构缺乏直接分析体积医学影像(如CT或MRI)所需的原生三维空间推理能力。新兴的自主人工智能提供了一种新解决方案,通过使LLMs能够协调和利用专业的外部工具,消除了对内在三维处理的需求。然而,这种自主框架在复杂的多步骤放射学工作流程中的可行性仍然未被充分探索。在本研究中,我们提出了一种用于自动化脑MRI分析的无训练自主管道。我们在多个LLMs(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)上验证了我们的方法,使用现成的领域特定工具,我们的系统能够自主执行复杂的端到端工作流程,包括预处理(去颅骨、配准)、病理分割(胶质瘤、脑膜瘤、转移瘤)和体积分析。我们评估了我们的框架在越来越复杂的放射学任务中的表现,从单次扫描分割和体积报告到需要多时间点比较的纵向反应评估。我们通过比较单一代理模型与多代理“领域专家”协作分析了架构设计的影响。最后,为了支持未来自主系统的严格评估,我们引入并发布了一套基于公共BraTS数据的图像-提示-答案元组的基准数据集。我们的结果表明,自主人工智能能够通过工具使用解决高度神经放射影像分析任务,而无需训练或微调。
cs.CV / 53 / 2604.16733
Active World-Model with 4D-informed Retrieval for Exploration and Awareness
基于4D信息检索的主动世界模型用于探索与认知
Abstract
Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real-world explorations are excessively costly, while sim-to-real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World-model with 4D-informed Retrieval for Exploration), an awareness-centric generative world model that provides a sensor-native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action-conditioned observation process. This is done by combining 4D-informed evidence retrieval, action-conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry-aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.
Chinese Translation
物理认知,尤其是在大型动态环境中,是由感知决策所塑造的,这些决策决定了在空间、时间和尺度上的可观测性,而观察结果又影响感知决策的质量。这种循环的信息结构使得物理认知成为一个基本具有挑战性的决策问题,尤其是在部分观察的情况下。尽管在过去十年中,我们见证了强化学习(RL)在完全可观测性问题上的前所未有的成功,但部分观察的决策问题,如部分可观测马尔可夫决策过程(POMDPs),仍然在很大程度上未得到解决:现实世界的探索成本过高,而从仿真到现实的流程又受到未观察视角的影响。我们提出了AW4RE(基于4D信息检索的主动世界模型),这是一种以认知为中心的生成世界模型,为探索感知查询提供了传感器原生的替代环境。AW4RE在查询的感知动作条件下,估计动作条件下的观察过程。这是通过结合4D信息检索、动作条件下的几何支持与时间一致性,以及条件生成补全来实现的。实验表明,AW4RE在极端视角变化、时间间隔和稀疏几何支持下,产生的预测比几何感知生成基线更为可靠和一致。
cs.CV / 54 / 2604.16734
Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines
降低现代多模态大语言模型管道的峰值内存使用
Abstract
Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck. Existing methods address this issue by identifying redundancy in vision tokens and compressing the cache, but such compression is typically applied only after all inputs are processed, resulting in high peak memory usage during the prefill stage. In this work, we show that MLLMs exhibit inherent structural regularities and representational redundancy that can be exploited to control memory growth throughout inference. Based on this insight, we propose a sequential input-compression mechanism that enforces a fixed memory budget by performing structure-aware key-value cache compression during the prefill process. This approach substantially reduces peak memory usage while maintaining generative performance with only minimal degradation, enabling more practical and memory-efficient multimodal inference.
Chinese Translation
多模态大语言模型(MLLMs)最近在理解和生成来自多样化视觉输入的响应方面表现出了强大的能力,包括高分辨率图像和长视频序列。随着这些模型在更丰富的视觉表征上扩展,推理过程越来越依赖于在键值(KV)缓存中存储大量视觉标记,这使得内存消耗成为一个主要瓶颈。现有方法通过识别视觉标记中的冗余并压缩缓存来解决这个问题,但这种压缩通常仅在所有输入处理完成后应用,导致在预填充阶段内存使用峰值过高。在本研究中,我们展示了MLLMs固有的结构规律性和表征冗余,这些特性可以被利用以控制推理过程中的内存增长。基于这一洞察,我们提出了一种顺序输入压缩机制,通过在预填充过程中执行结构感知的键值缓存压缩,强制执行固定的内存预算。这种方法显著降低了峰值内存使用,同时在仅有轻微性能下降的情况下保持生成性能,从而实现更实用和内存高效的多模态推理。
cs.CV / 55 / 2604.16743
Automated Palynological Analysis System: Integrating Deep Metric Learning and $U^{2}$-Net Detection in $H\infty$ bright field microscopy
自动化花粉分析系统:在 $H ext{∞}$ 明场显微镜中集成深度度量学习和 $U^{2}$-Net 检测
Abstract
Traditional melissopalynology is a time-consuming and subjective process, often taking 4-6 hours per sample. We present an automated, high-throughput microscopy system that integrates $H\infty$ robust mechanical control with advanced deep learning pipelines for the precise counting, classification, and morphological analysis of pollen grains from Bio Bio region in south central territory in Chile. Our system employs $U^{2}$-Net for salient object detection and a DINOv2 Vision Transformer backbone trained via Deep Metric Learning for classification. By integrating Gradient-Weighted Attention, the model provides human-interpretable texture and diagnostic feature annotations. The system achieves a 95.8$\%$ classification recall and a 6x processing speedup compared to manual expert analysis.
Chinese Translation
传统的蜜蜂花粉学是一个耗时且主观的过程,通常每个样本需要4-6小时。我们提出了一种自动化的高通量显微镜系统,该系统将 $H ext{∞}$ 稳健机械控制与先进的深度学习流程相结合,以精确计数、分类和形态分析来自智利中南部Bio Bio地区的花粉颗粒。我们的系统采用 $U^{2}$-Net 进行显著物体检测,并使用通过深度度量学习训练的 DINOv2 视觉变换器骨干网络进行分类。通过集成梯度加权注意力,该模型提供了可供人类解释的纹理和诊断特征注释。与人工专家分析相比,该系统实现了95.8%的分类召回率和6倍的处理速度提升。
cs.CV / 56 / 2604.16747
Incoherent Deformation, Not Capacity: Diagnosing and Mitigating Overfitting in Dynamic Gaussian Splatting
非一致性变形,而非容量:诊断和缓解动态高斯喷溅中的过拟合
Abstract
Dynamic 3D Gaussian Splatting methods achieve strong training-view PSNR on monocular video but generalize poorly: on the D-NeRF benchmark we measure an average train-test PSNR gap of 6.18 dB, rising to 11 dB on individual scenes. We report two findings that together account for most of that gap. Finding 1 (the role of splitting). A systematic ablation of the Adaptive Density Control pipeline (split, clone, prune, frequency, threshold, schedule) shows that splitting is responsible for over 80% of the gap: disabling split collapses the cloud from 44K to 3K Gaussians and the gap from 6.18 dB to 1.15 dB. Across all threshold-varying ablations, gap is log-linear in count (r = 0.995, bootstrap 95% CI [0.99, 1.00]), which suggests a capacity-based explanation. Finding 2 (the role of deformation coherence). We show that the capacity explanation is incomplete. A local-smoothness penalty on the per-Gaussian deformation field -- Elastic Energy Regularization (EER) -- reduces the gap by 40.8% while growing the cloud by 85%. Measuring per-Gaussian strain directly on trained checkpoints, EER reduces mean strain by 99.72% (median 99.80%) across all 8 scenes; on 8/8 scenes the median Gaussian under EER is less strained than the 1st-percentile (best-behaved) Gaussian under baseline. Alongside EER, we evaluate two further regularizers: GAD, a loss-rate-aware densification threshold, and PTDrop, a jitter-weighted Gaussian dropout. GAD+EER reduces the gap by 48%; adding PTDrop and a soft growth cap reaches 57%. We confirm that coherence generalizes to (a) a different deformation architecture (Deformable-3DGS, +40.6% gap reduction at re-tuned lambda), and (b) real monocular video (4 HyperNeRF scenes, reducing the mean PSNR gap by 14.9% at the same lambda as D-NeRF, with near-zero quality cost). The overfitting in dynamic 3DGS is driven by incoherent deformation, not parameter count.
Chinese Translation
动态3D高斯喷溅方法在单目视频上实现了强大的训练视图PSNR,但泛化能力较差:在D-NeRF基准测试中,我们测得平均训练-测试PSNR差距为6.18 dB,在个别场景中上升至11 dB。我们报告了两个发现,这两个发现共同解释了大部分差距。发现1(分裂的作用)。对自适应密度控制管道(分裂、克隆、修剪、频率、阈值、调度)进行系统的消融实验表明,分裂负责超过80%的差距:禁用分裂将云从44K个高斯点压缩到3K个高斯点,差距从6.18 dB降至1.15 dB。在所有阈值变化的消融实验中,差距与数量呈对数线性关系(r = 0.995,自助法95%置信区间[0.99, 1.00]),这表明了一种基于容量的解释。发现2(变形一致性的作用)。我们表明容量解释是不完整的。对每个高斯变形场施加局部平滑惩罚——弹性能量正则化(Elastic Energy Regularization, EER)——将差距减少了40.8%,同时使云增长了85%。在训练检查点上直接测量每个高斯的应变,EER在所有8个场景中将平均应变减少了99.72%(中位数99.80%);在8/8个场景中,EER下的中位数高斯的应变低于基线下第1百分位数(表现最佳)的高斯。除了EER,我们还评估了两个进一步的正则化器:GAD,一个损失率感知的密度阈值,以及PTDrop,一个抖动加权的高斯丢弃。GAD+EER将差距减少了48%;添加PTDrop和软增长上限使差距达到57%。我们确认一致性可以推广到(a)不同的变形架构(Deformable-3DGS,在重新调整lambda时减少了40.6%的差距),以及(b)真实的单目视频(4个HyperNeRF场景,在与D-NeRF相同的lambda下将平均PSNR差距减少了14.9%,几乎没有质量损失)。动态3DGS中的过拟合是由非一致性变形驱动的,而非参数数量。
cs.CV / 57 / 2604.16748
TriTS: Time Series Forecasting from a Multimodal Perspective
TriTS:从多模态视角进行时间序列预测
Abstract
Time series forecasting plays a pivotal role in critical sectors such as finance, energy, transportation, and meteorology. However, Long-term Time Series Forecasting (LTSF) remains a significant challenge because real-world signals contain highly entangled temporal dynamics that are difficult to fully capture from a purely 1D perspective. To break this representation bottleneck, we propose TriTS, a novel cross-modal disentanglement framework that projects 1D time series into orthogonal time, frequency, and 2D-vision spaces.To seamlessly bridge the 1D-to-2D modality gap without the prohibitive $O(N^2)$ computational overhead of Vision Transformers (ViTs), we introduce a Period-Aware Reshaping strategy and incorporate Visual Mamba (Vim). This approach efficiently models cross-period dependencies as global visual textures while maintaining linear computational complexity. Complementing this, we design a Multi-Resolution Wavelet Mixing (MR-WM) module for the frequency modality, which explicitly decouples non-stationary signals into trend and noise components to achieve fine-grained time-frequency localization. Finally, a streaming linear branch is retained in the time domain to anchor numerical stability. By dynamically fusing these three complementary representations, TriTS effectively adapts to diverse data contexts. Extensive experiments across multiple benchmark datasets demonstrate that TriTS achieves state-of-the-art (SOTA) performance, fundamentally outperforming existing vision-based forecasters by drastically reducing both parameter count and inference latency.
Chinese Translation
时间序列预测在金融、能源、交通和气象等关键领域中发挥着重要作用。然而,长期时间序列预测(LTSF)仍然是一个重大挑战,因为现实世界信号包含高度纠缠的时间动态,这些动态从纯粹的一维视角难以完全捕捉。为了解决这一表示瓶颈,我们提出了TriTS,一种新颖的跨模态解耦框架,将一维时间序列投影到正交的时间、频率和二维视觉空间。为了无缝弥合一维到二维模态的差距,同时避免视觉变换器(Vision Transformers, ViTs)所带来的高达 $O(N^2)$ 的计算开销,我们引入了一种周期感知重塑策略,并结合了视觉曼巴(Visual Mamba, Vim)。该方法有效地将跨周期依赖建模为全局视觉纹理,同时保持线性计算复杂度。作为补充,我们设计了一个多分辨率小波混合(Multi-Resolution Wavelet Mixing, MR-WM)模块,用于频率模态,明确将非平稳信号解耦为趋势和噪声成分,以实现精细的时频定位。最后,在时间域中保留了一个流式线性分支,以确保数值稳定性。通过动态融合这三种互补表示,TriTS 能够有效适应多样的数据上下文。在多个基准数据集上的广泛实验表明,TriTS 实现了最先进的(SOTA)性能,根本上超越了现有的基于视觉的预测模型,显著减少了参数数量和推理延迟。
cs.CV / 58 / 2604.16758
Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization
用于小型数据集的密集预测的冻结视觉变换器:箭头定位的案例研究
Abstract
We present a system for automated detection, localization, and scoring of arrow punctures on 40\,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from $32 \times 32$ patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8\,M of 308\,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of $0.893 \pm 0.011$ and a mean localization error of $1.41 \pm 0.06$\,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8\% and group centroid positions to within a median of 4.00\,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.
Chinese Translation
我们提出了一种系统,用于自动检测、定位和评分40厘米室内射箭靶面上的箭头穿刺,仅使用48张标注照片(5,084个穿刺)进行训练。我们的流程结合了三个组件:一个基于颜色的标准化矫正阶段,将透视失真的照片映射到一个标准化坐标系统中,其中像素距离对应于已知的物理测量;一个冻结的自监督视觉变换器(DINOv3 ViT-L/16),配合AnyUp引导的特征上采样,从$32 imes 32$的补丁令牌中恢复亚毫米级的空间精度;以及轻量级的CenterNet风格检测头,用于箭头中心热图预测。308M总参数中仅有3.8M是可训练的。在三个交叉验证折中,我们实现了平均F1分数为$0.893 imes 0.011$,平均定位误差为$1.41 imes 0.06$毫米,性能与或优于需要显著更多训练数据的先前完全监督方法。消融研究表明,CenterNet偏移回归头在我们的设置中提供的检测改进微乎其微,同时降低了定位精度,这表明引导特征上采样已经解决了通过补丁令牌化所丢失的空间精度。在下游射箭指标上,该系统恢复每张图像的平均箭头得分,误差中位数为1.8%,群体重心位置的中位数误差为4.00毫米。这些结果表明,冻结的基础模型经过最小的任务特定适应后,为小数据环境中的密集预测提供了一种实用的范式。
cs.CV / 59 / 2604.16780
FairNVT: Improving Fairness via Noise Injection in Vision Transformers
FairNVT:通过在视觉变换器中注入噪声来提高公平性
Abstract
This paper presents FairNVT, a lightweight debiasing framework for pretrained transformer-based encoders that improves both representation and prediction level fairness while preserving task accuracy. Unlike many existing debiasing approaches that address these notions separately, we argue they are inherently connected: suppressing sensitive information at the representation level can facilitate fairer predictions. Our approach learns task-relevant and sensitive embeddings via lightweight adapters, applies calibrated Gaussian noise to the sensitive embedding, and fuses it with the task representation. Together with orthogonality constraints and fairness regularization, these components jointly reduce sensitive-attribute leakage in the learned embeddings and encourage fairer downstream predictions. The framework is compatible with a wide range of pretrained transformer encoders. Across three datasets spanning vision and language, FairNVT reduces sensitive-attribute attacker accuracy, improves demographic-parity and equalized-odds metrics, and maintains high task performance.
Chinese Translation
本文提出了FairNVT,一个轻量级的去偏见框架,旨在改善基于预训练变换器编码器的表示和预测层面的公平性,同时保持任务准确性。与许多现有的去偏见方法分别处理这些概念不同,我们认为它们本质上是相互关联的:在表示层面抑制敏感信息可以促进更公平的预测。我们的方法通过轻量级适配器学习与任务相关的敏感嵌入,向敏感嵌入施加校准的高斯噪声,并将其与任务表示融合。结合正交约束和公平性正则化,这些组件共同减少了学习嵌入中的敏感属性泄漏,并鼓励更公平的下游预测。该框架与广泛的预训练变换器编码器兼容。在涵盖视觉和语言的三个数据集上,FairNVT降低了敏感属性攻击者的准确性,提高了人口平衡和均衡赔率指标,并保持了高任务性能。
cs.CV / 60 / 2604.16783
EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications
EdgeVTP:针对边缘嵌入式视觉应用的低延迟轨迹预测探索
Abstract
Vehicle trajectory prediction is central to highway perception, but deployment on roadside edge devices necessitates bounded, deterministic end-to-end latency. We present EdgeVTP, an embedded-first trajectory predictor that combines interaction-aware graph modeling with a lightweight transformer backbone and a one-shot curve decoder. By predicting future motion as compact curve parameters (anchored at the last observed position) rather than horizon-scaled autoregressive waypoints, EdgeVTP reduces decoding overhead while producing smooth trajectories. To keep runtime predictable in crowded scenes, we explicitly bound interaction complexity via a locality graph with a hard neighbor cap. Across three highway benchmarks and two Jetson-class platforms, EdgeVTP achieves the lowest measured end-to-end latency under a protocol that includes graph construction and post-processing, while attaining state-of-the-art (SotA) prediction accuracy on two of the three datasets and competitive error on other benchmarks. Our code is available at https://github.com/SeungjinStevenKim/EdgeVTP.
Chinese Translation
车辆轨迹预测是高速公路感知的核心,但在路边边缘设备上的部署需要有限且确定的端到端延迟。我们提出了EdgeVTP,这是一种以嵌入式为优先的轨迹预测器,结合了交互感知的图建模、轻量级变换器骨干网络和一次性曲线解码器。通过将未来运动预测为紧凑的曲线参数(锚定在最后观察到的位置),而不是按时间尺度扩展的自回归路径点,EdgeVTP减少了解码开销,同时生成平滑的轨迹。为了在拥挤场景中保持运行时间的可预测性,我们通过具有硬邻居上限的局部图显式限制交互复杂性。在三个高速公路基准测试和两个Jetson级平台上,EdgeVTP在包括图构建和后处理的协议下,实现了最低的测量端到端延迟,同时在三个数据集中中的两个上达到了最先进的(SotA)预测准确性,并在其他基准测试中表现出竞争力的误差。我们的代码可在 https://github.com/SeungjinStevenKim/EdgeVTP 获取。
cs.CV / 61 / 2604.16785
Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
粗细识别的桥梁:一种用于互动教育游戏中开放式多粒度物体识别的混合方法
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.
Chinese Translation
最近在多模态大型语言模型(MLLMs)方面的进展使得开放式物体识别成为可能,但它们在细粒度任务上表现不佳。相比之下,CLIP风格的模型在细粒度识别方面表现出色,但在一般物体类别的广泛覆盖上存在不足。为了解决这一问题,我们提出了 extbf{HyMOR},一种 extbf{Hy}brid extbf{M}ulti-granularity open-ended extbf{O}bject extbf{R}ecognition框架,该框架将MLLM与CLIP模型相结合。在HyMOR中,MLLM执行开放式和粗粒度的物体识别,而CLIP模型则专注于特定领域物体(如动物和植物)的细粒度识别。这种混合设计使得在多个语义粒度上实现准确的物体理解,为下游的多模态内容生成和互动游戏提供了稳健的感知基础。为了支持在内容丰富和教育场景中的评估,我们引入了TBO(TextBook Objects),一个包含20,942张图像和8,816个从教科书中提取的物体类别的标注数据集。大量实验表明,HyMOR将与CLIP的细粒度识别差距缩小至0.2\%,同时在基准MLLM上提高了一般物体识别的准确率2.5\%,这一结果通过平均Sentence-BERT(SBert)相似度进行测量。总体而言,HyMOR在所有评估数据集上实现了平均SBert提高23.2\%,突显了其在实现多模态游戏内容生成和互动学习应用中的准确感知能力的有效性。
cs.CV / 62 / 2604.16794
Improving Radio Interferometry Imaging by Explicitly Modeling Cross-Domain Consistency in Reconstruction
通过显式建模重建中的跨域一致性来改善无线电干涉成像
Abstract
Radio astronomy plays a crucial role in understanding the universe, particularly within the realm of non-thermal astrophysics. Images of celestial objects are derived from the signals (called visibility) measured by radio telescopes. Such imaging results, called dirty images, contain artifacts due to factors such as sparsity and therefore require reconstruction to improve imaging quality. Existing methods typically restrict reconstruction to a unimodal domain, either to the dirty image after imaging or to the sparse visibility prior to imaging. Focusing solely on each unimodal reconstruction results in the loss of complementary in-context information in either the visibility or image domain, leading to an incomplete modeling of mutual dependency and consistency. To address these challenges, we propose CDCRec, a multimodal radio interferometric data reconstruction method that explicitly models cross-domain consistency. We design a hierarchical multi-task and multi-stage framework to enhance the exploration of interplays between domains during reconstruction. Our experimental results demonstrate that CDCRec improves imaging performance through enhanced cross-domain correlation extraction. In particular, our self-supervised complementary modeling strategy is better than current methods at interferometric domain translations that rely heavily on recovering dense information from constrained source-domain data.
Chinese Translation
无线电天文学在理解宇宙中发挥着至关重要的作用,特别是在非热天体物理学领域。天体图像是通过无线电望远镜测量的信号(称为可见度)得出的。这些成像结果称为脏图像,由于稀疏性等因素而包含伪影,因此需要重建以提高成像质量。现有方法通常将重建限制在单模态域内,或者是在成像后的脏图像,或者是在成像前的稀疏可见度。仅关注每个单模态重建会导致在可见度或图像域中丧失互补的上下文信息,从而导致相互依赖性和一致性的建模不完整。为了解决这些挑战,我们提出了CDCRec,一种显式建模跨域一致性的多模态无线电干涉数据重建方法。我们设计了一个分层的多任务和多阶段框架,以增强重建过程中域之间相互作用的探索。我们的实验结果表明,CDCRec通过增强跨域相关性提取来改善成像性能。特别是,我们的自监督互补建模策略在依赖于从受限源域数据恢复密集信息的干涉域转换方面优于当前方法。
cs.CV / 63 / 2604.16796
Generative Semantic Communication via Alternating Dual-Domain Posterior Sampling
通过交替双域后验采样实现生成语义通信
Abstract
Generative semantic communication (SemCom) harnesses pretrained generative priors to improve the perceptual quality of wireless image transmission. Existing generative SemCom receivers, however, rely on maximum a posteriori (MAP) estimation, which fundamentally cannot preserve the data distribution and thus limits achievable perceptual quality. Moreover, current diffusion-based approaches using single-domain guidance face significant limitations: latent-domain guidance is sensitive to channel noise, while image-domain guidance inherits decoder bias. Simply combining both domains simultaneously yields an overconfident pseudo-posterior. In this paper, we formulate semantic decoding as a Bayesian inverse problem and prove that posterior sampling achieves optimal perceptual quality by preserving the data distribution. Building on this insight, we propose alternating dual-domain posterior sampling (ADDPS), a diffusion-based SemCom receiver that alternately enforces latent-domain and image-domain consistency during the sampling process. This alternating strategy decomposes joint posterior sampling into simpler subproblems, avoiding gradient conflicts while retaining the complementary strengths of both domains. Experiments on FFHQ demonstrate that the proposed ADDPS achieves superior perceptual quality compared with existing methods.
Chinese Translation
生成语义通信(SemCom)利用预训练的生成先验来提高无线图像传输的感知质量。然而,现有的生成SemCom接收器依赖于最大后验估计(MAP),这在根本上无法保持数据分布,从而限制了可实现的感知质量。此外,当前基于扩散的方法使用单域引导面临显著限制:潜在域引导对信道噪声敏感,而图像域引导则继承解码器偏差。简单地同时结合两个域会产生过于自信的伪后验。在本文中,我们将语义解码形式化为一个贝叶斯逆问题,并证明后验采样通过保持数据分布实现最佳的感知质量。基于这一见解,我们提出了交替双域后验采样(ADDPS),这是一种基于扩散的SemCom接收器,在采样过程中交替强制潜在域和图像域的一致性。这种交替策略将联合后验采样分解为更简单的子问题,避免了梯度冲突,同时保留了两个域的互补优势。在FFHQ上的实验表明,所提出的ADDPS相比现有方法实现了更优的感知质量。
cs.CV / 64 / 2604.16800
Frequency-Decomposed INR for NIR-Assisted Low-Light RGB Image Denoising
基于频率分解的近红外辅助低光照RGB图像去噪
Abstract
Addressing the issues of severe noise and high frequency structural degradation in visible images under low-light conditions, this paper proposes a Near Infrared (NIR) aided low light image restoration method based on Frequency Decoupled Implicit Neural Representation (FDINR). Based on the statistical prior of RGB-NIR cross-modal frequency correlations, specifically that low-frequency RGB signals are more reliable, whereas high frequency NIR signals exhibit higher correlation, we explicitly decompose images into distinct frequency components via multi-scale wavelet transforms and construct a dual-branch implicit neural representation framework. Within this framework, we design a cross modal differentiated frequency supervision mechanism, leveraging low light RGB to guide the reconstruction of low frequency luminance and color, and utilizing high-SNR NIR signals to constrain the generation of high frequency texture details, thereby achieving complementary advantages in the frequency domain. Furthermore, an uncertainty-based adaptive weighting loss function is introduced to automatically balance the contributions of different frequency tasks, solving the problems of color distortion and artifacts caused by rigid fusion in the spatial domain common in traditional methods. Experimental results demonstrate that FD-INR not only effectively restores image luminance consistency and structural details but also, benefitting from its implicit continuous representation, outperforms existing methods in arbitrary-resolution reconstruction tasks, significantly enhancing the reliability of low light perception.
Chinese Translation
针对低光照条件下可见图像中严重噪声和高频结构退化的问题,本文提出了一种基于频率解耦隐式神经表示(Frequency Decoupled Implicit Neural Representation, FDINR)的近红外(NIR)辅助低光照图像恢复方法。基于RGB-NIR跨模态频率相关性的统计先验,具体而言,低频RGB信号更为可靠,而高频NIR信号则表现出更高的相关性,我们通过多尺度小波变换将图像显式分解为不同的频率分量,并构建了一个双分支隐式神经表示框架。在该框架内,我们设计了一种跨模态差异化频率监督机制,利用低光照RGB引导低频亮度和颜色的重建,同时利用高信噪比(SNR)NIR信号约束高频纹理细节的生成,从而在频率域实现互补优势。此外,引入了一种基于不确定性的自适应加权损失函数,以自动平衡不同频率任务的贡献,解决了传统方法中常见的空间域刚性融合导致的颜色失真和伪影问题。实验结果表明,FD-INR不仅有效恢复了图像的亮度一致性和结构细节,而且得益于其隐式连续表示,在任意分辨率重建任务中超越了现有方法,显著增强了低光照感知的可靠性。
cs.CV / 65 / 2604.16806
Channel Attention-Guided Cross-Modal Knowledge Distillation for Referring Image Segmentation
基于通道注意力引导的跨模态知识蒸馏用于指称图像分割
Abstract
Referring image segmentation (RIS) requires accurate segmentation of target regions in images according to language descriptions, which is a cross-modal task integrating vision and language. Existing RIS methods typically employ large-scale vision and language encoding models to improve performance, but their enormous parameter size severely restricts deployment in scenarios with limited computing resources. To solve this problem, this paper proposes a channel attention-guided cross-modal knowledge distillation method, which transfers the high-order fine-grained correlations between vision and language learned by the teacher network, as well as the correlations between semantic components represented by each channel, to the student network. Compared with the traditional pixel-wise relational distillation, this method not only enables the student to learn the knowledge of the teacher, but also retains part of its independent learning ability, alleviating the transfer of learning bias. Experimental results on two public datasets show that the proposed distillation method does not introduce additional parameters during inference and can achieve significant performance improvement for the student model.
Chinese Translation
指称图像分割(RIS)需要根据语言描述对图像中的目标区域进行准确分割,这是一项整合视觉和语言的跨模态任务。现有的RIS方法通常采用大规模的视觉和语言编码模型来提高性能,但其庞大的参数规模严重限制了在计算资源有限的场景中的部署。为了解决这个问题,本文提出了一种基于通道注意力引导的跨模态知识蒸馏方法,该方法将教师网络学习到的视觉和语言之间的高阶细粒度关联,以及每个通道所表示的语义组件之间的关联,转移到学生网络。与传统的逐像素关系蒸馏相比,该方法不仅使学生能够学习教师的知识,还保留了部分独立学习能力,从而减轻了学习偏差的转移。在两个公共数据集上的实验结果表明,所提出的蒸馏方法在推理过程中不引入额外的参数,并且能够显著提高学生模型的性能。
cs.CV / 66 / 2604.16808
Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection
建模生物力学约束违规以实现语言无关的唇动深伪检测
Abstract
Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.
Chinese Translation
当前的唇动深伪检测器依赖于像素级伪影或视听对应关系,无法跨语言进行泛化,因为这些线索编码的是数据依赖的模式而非普遍的物理法则。我们识别出一个更为根本的原则:生成模型未能强制执行真实口面发音的生物力学约束,导致可测量的时间唇部变异性升高——我们称之为时间唇部抖动(temporal lip jitter)——这一信号在说话者的语言、种族和录音条件下经验上是一致的。我们通过BioLip这一轻量级框架实现了这一原则,该框架基于MediaPipe提取的64个口周标志点坐标进行操作。
cs.CV / 67 / 2604.16823
Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification
增强图卷积网络的层次化视觉变换器用于图像分类
Abstract
Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the self-attention mechanism of ViT can draw the global relation on image patches, but it is unable to model the local structure of image. To overcome such limitations, we propose the Hierarchical Vision Transformer Enhanced by Graph Convolutional Network (GCN-HViT) for image classification. Specifically, the Hierarchical ViT we designed can model patch-wise information interactions on a global scale within each level and model hierarchical relationships between small patches and large patches across multiple levels. In addition, the proposed GCN method functions as a local feature extractor to obtain the local representation of each image patch which serves as a 2D position embedding of each patch in the 2D space. Meanwhile, it models patch-wise information interactions on a local scale within each level. Extensive experiments on 3 real-world datasets demonstrate that GCN-HViT achieves state-of-the-art performance.
Chinese Translation
视觉变换器(Vision Transformer, ViT)通过引入自注意力机制为图像分类领域带来了新的突破,而图卷积网络(Graph Convolutional Networks, GCN)也被提出并成功应用于数据表示和分析。然而,仍然存在一些关键挑战限制其进一步发展:(1)ViT选择的补丁大小对准确预测至关重要,这引发了一个自然的问题:如何恰当地选择补丁的大小,或者如何综合结合小补丁和大补丁;(2)虽然空间结构信息在视觉任务中很重要,但一维位置嵌入无法更准确地捕捉补丁的空间结构信息;(3)GCN能够捕捉图像节点之间的局部连接关系,但缺乏捕捉全局图结构信息的能力。相反,ViT的自注意力机制能够提取图像补丁之间的全局关系,但无法建模图像的局部结构。为了解决这些局限性,我们提出了增强图卷积网络的层次化视觉变换器(GCN-HViT)用于图像分类。具体而言,我们设计的层次化ViT能够在每个层级内以全局尺度建模补丁间的信息交互,并在多个层级之间建模小补丁和大补丁之间的层次关系。此外,所提出的GCN方法作为局部特征提取器,获取每个图像补丁的局部表示,作为二维空间中每个补丁的二维位置嵌入。同时,它在每个层级内以局部尺度建模补丁间的信息交互。在三个真实世界数据集上的大量实验表明,GCN-HViT达到了最先进的性能。
cs.CV / 68 / 2604.16836
Lorentz Framework for Semantic Segmentation
洛伦兹框架下的语义分割
Abstract
Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincar\'e ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture-agnostic semantic segmentation framework (pixel-wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel-level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text-based retrieval, and zero-shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO-Stuff-164k, Pascal-VOC, and Cityscapes, utilizing state-of-the-art per-pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty-aware semantic segmentation. Code is available at https://github.com/mxahan/Lorentz_semantic_segmentation.
Chinese Translation
在双曲空间中的语义分割能够紧凑地建模层次结构,同时提供固有的不确定性量化。先前的方法主要依赖于庞加莱球模型,但该模型存在数值不稳定性、优化和计算挑战。我们提出了一种新颖的、可处理的、与架构无关的语义分割框架(像素级和掩膜分类),基于双曲洛伦兹模型。我们利用文本嵌入结合语义和视觉线索来引导洛伦兹空间中的层次像素级表示。这使得优化过程稳定高效,无需使用黎曼优化器,并且能够与现有的欧几里得架构轻松集成。除了分割之外,我们的方法还提供了自由的不确定性估计、置信度图、边界描绘、层次和基于文本的检索以及零样本性能,达到广义的平坦极小值。我们在洛伦兹锥嵌入中引入了一种新颖的不确定性和置信度指标。此外,我们通过梯度分析提供了对洛伦兹优化的分析和实证见解。在ADE20K、COCO-Stuff-164k、Pascal-VOC和Cityscapes上的大量实验,利用最先进的逐像素分类模型(DeepLabV3和SegFormer)以及掩膜分类模型(mask2former和maskformer),验证了我们方法的有效性和普适性。我们的结果展示了双曲洛伦兹嵌入在稳健和不确定性感知的语义分割中的潜力。代码可在 https://github.com/mxahan/Lorentz_semantic_segmentation 获取。
cs.CV / 69 / 2604.16841
When Earth Foundation Models Meet Diffusion: An Application to Land Surface Temperature Super-Resolution
当地球基础模型遇上扩散:应用于地表温度超分辨率
Abstract
Land surface temperature (LST) super-resolution is important for environmental monitoring. However, it remains challenging as coarse thermal observations severely underdetermine fine-scale structure. In this paper, we propose Earth Foundation Model-guided Diffusion (EFDiff), a novel framework for super-resolution under extreme spatial degradation. EFDiff uses the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into the denoising network via cross-attention to guide fine-scale reconstruction from highly degraded observations. We study two variants, EFDiff-$\epsilon$ and EFDiff-$x_0$, which offer complementary trade-offs between perceptual realism and pixel-level fidelity. We evaluate EFDiff under an extreme $32\times$ scale gap using a globally diverse benchmark comprising 242,416 co-registered Landsat thermal-reflectance patches. Results show that EFDiff consistently outperforms baseline methods and that cross-attention conditioning by EFM is more effective than HLS channel concatenation. Although we present EFDiff in the context of LST super-resolution, the framework is broadly applicable to remote sensing problems in which pretrained geospatial representations can guide generative reconstruction.
Chinese Translation
地表温度(LST)超分辨率对环境监测至关重要。然而,由于粗糙的热观测严重不足以确定细尺度结构,这一任务仍然具有挑战性。本文提出了一种新颖的框架——地球基础模型引导的扩散(Earth Foundation Model-guided Diffusion, EFDiff),用于在极端空间降解条件下实现超分辨率。EFDiff利用Prithvi-EO-2.0地球基础模型将高分辨率多光谱反射率编码为地理空间嵌入,这些嵌入通过交叉注意力注入去噪网络,以指导从高度降解的观测中进行细尺度重建。我们研究了两种变体,EFDiff-$ ext{ε}$和EFDiff-$x_0$,它们在感知真实感和像素级保真度之间提供了互补的权衡。我们在极端的$32 imes$尺度差距下评估EFDiff,使用一个包含242,416个共同注册的Landsat热反射斑块的全球多样性基准。结果表明,EFDiff始终优于基线方法,并且EFM的交叉注意力条件化比HLS通道拼接更为有效。尽管我们在LST超分辨率的背景下展示EFDiff,但该框架广泛适用于可以利用预训练地理空间表示指导生成重建的遥感问题。
cs.CV / 70 / 2604.16848
TowerDataset: A Heterogeneous Benchmark for Transmission Corridor Segmentation with a Global-Local Fusion Framework
TowerDataset:一种用于传输走廊分割的异构基准测试,采用全球-局部融合框架
Abstract
Fine-grained semantic segmentation of transmission-corridor point clouds is fundamental for intelligent power-line inspection. However, current progress is limited by realistic data scarcity and the difficulty of modeling global corridor structure and local geometric details in long, heterogeneous scenes. Existing public datasets usually provide only a few coarse categories or short cropped scenes which overlook long-range structural dependencies, severe long-tail distributions, and subtle distinctions among safety-critical components. As a result, current methods are difficult to evaluate under realistic inspection settings, and their ability to preserve and integrate complementary global and local cues remains unclear. To address the above challenges, we introduce TowerDataset, a heterogeneous benchmark for transmission-corridor segmentation. TowerDataset contains 661 real-world scenes and about 2.466 billion points. It preserves long corridor extents, defines a fine-grained 22-class taxonomy, and provides standardized splits and evaluation protocols. In addition, we present a global-local fusion framework which preserves and fuses whole-scene and local-detail information. A whole-scene branch with NoCrop training and prototypical contrastive learning captures long-range topology and contextual dependencies. A block-wise local branch retains fine geometric structures. Both predictions are then fused and refined by geometric validation. This design allows the model to exploit both global relationships and local shape details when recognizing rare and confusing components. Experiments on TowerDataset and two public benchmarks demonstrate the challenge of the proposed benchmark and the robustness of our framework in real, complex, and heterogeneous transmission-corridor scenes. The dataset will be released soon at https://huggingface.co/datasets/tccx18/Towerdataset/tree/main.
Chinese Translation
传输走廊点云的细粒度语义分割对于智能电力线路检查至关重要。然而,当前的进展受到现实数据稀缺和在长时间异构场景中建模全球走廊结构与局部几何细节的困难的限制。现有的公共数据集通常仅提供少量粗略类别或短裁剪场景,忽视了长距离结构依赖、严重的长尾分布以及安全关键组件之间的微妙区别。因此,当前的方法在现实检查环境下难以评估,其保留和整合互补的全球与局部线索的能力仍不明确。为了解决上述挑战,我们引入了TowerDataset,一个用于传输走廊分割的异构基准测试。TowerDataset包含661个真实场景和约24.66亿个点。它保留了长走廊的范围,定义了一个细粒度的22类分类法,并提供了标准化的划分和评估协议。此外,我们提出了一个全球-局部融合框架,该框架保留并融合了全场景和局部细节信息。一个采用NoCrop训练和原型对比学习的全场景分支捕捉了长距离拓扑和上下文依赖关系。一个基于块的局部分支保留了细致的几何结构。然后,这两个预测通过几何验证进行融合和精炼。该设计使模型在识别稀有和混淆组件时能够利用全球关系和局部形状细节。在TowerDataset和两个公共基准上的实验展示了所提基准的挑战性以及我们框架在真实、复杂和异构传输走廊场景中的鲁棒性。该数据集将很快在https://huggingface.co/datasets/tccx18/Towerdataset/tree/main发布。
cs.CV / 71 / 2604.16854
CATP: Confidence-Aware Token Pruning for Camouflaged Object Detection
CATP:用于伪装物体检测的置信度感知令牌剪枝
Abstract
Camouflaged Object Detection (COD) aims to segment targets that share extreme textural and structural similarities with their complex environments. Leveraging their capacity for long-range dependency modeling, Transformer-based detectors have become the mainstream approach and achieve state-of-the-art (SoTA) accuracy, yet their substantial computational overhead severely limits practical deployment. To address this, we propose a hierarchical Confidence-Aware Token Pruning framework (CATP) tailored for COD. Our approach hierarchically identifies and discards easily distinguishable tokens from both background and object interiors, focusing computations on critical boundary tokens. To compensate for information loss from pruning, we introduce a dual-path feature compensation mechanism that aggregates contextual knowledge from pruned tokens into enriched features. Extensive experiments on multiple COD benchmarks demonstrate that our method significantly reduces computational complexity while maintaining high accuracy, offering a promising research direction for the efficient deployment of COD models in real-world scenarios. The code will be released.
Chinese Translation
伪装物体检测(COD)旨在分割与其复杂环境具有极端纹理和结构相似性的目标。利用其长程依赖建模能力,基于Transformer的检测器已成为主流方法,并实现了最先进的(SoTA)准确性,但其巨大的计算开销严重限制了实际部署。为了解决这个问题,我们提出了一种针对COD的分层置信度感知令牌剪枝框架(CATP)。我们的方法分层识别并丢弃背景和物体内部中易于区分的令牌,将计算重点放在关键边界令牌上。为了弥补剪枝带来的信息损失,我们引入了一种双路径特征补偿机制,将来自剪枝令牌的上下文知识聚合到丰富的特征中。在多个COD基准上的广泛实验表明,我们的方法显著降低了计算复杂性,同时保持了高准确性,为在现实场景中高效部署COD模型提供了有前景的研究方向。代码将会发布。
cs.CV / 72 / 2604.16855
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
当 W4A4 破坏伪装物体检测:基于 Token-Group 的双约束激活量化
Abstract
Camouflaged object detection (COD) segments objects that intentionally blend with the background, so predictions depend on subtle texture and boundary cues. COD is often needed under tight on-device memory and latency budgets, making low-bit inference highly desirable. However, COD is unusually hard to quantify aggressively. We study post-training W4A4 quantization of Transformer-based COD and find a task-specific cliff: heavy-tailed background tokens dominate a shared activation range, inflating the step size and pushing weak-but-structured boundary cues into the zero bin. This exposes a token-local bottleneck -- remove cross-token range domination and bound the zero-bin mass under 4-bit activations. To address this, we introduce COD-TDQ, a COD-aware Token-group Dual-constraint activation Quantization method. COD-TDQ addresses this token-local bottleneck with two coupled steps: Direct-Sum Token-Group (DSTG) assigns token-group scales to suppress cross-token range domination, and Dual-Constraint Range Projection (DCRP) projects each token-group clip range to keep the step-to-dispersion ratio and the zero-bin mass bounded. Across four COD benchmarks and two baseline models (CFRN and ESCNet), COD-TDQ consistently achieves an S{\alpha}score more than 0.12 higher than that of the state-of-the-art quantization method without retraining. The code will be released.
Chinese Translation
伪装物体检测(COD)是对故意与背景融为一体的物体进行分割,因此预测依赖于微妙的纹理和边界线索。在设备内存和延迟预算紧张的情况下,COD 通常需要低位推理,这使得低位推理变得非常重要。然而,COD 的量化通常异常困难。我们研究了基于 Transformer 的 COD 的后训练 W4A4 量化,并发现了一个任务特定的瓶颈:重尾背景 token 主导了共享激活范围,导致步长膨胀并将弱但结构化的边界线索推入零箱。这暴露了一个 token 局部瓶颈——消除跨 token 范围的主导作用,并将零箱质量限制在 4 位激活下。为了解决这个问题,我们引入了 COD-TDQ,一种 COD 感知的 Token-group 双约束激活量化方法。COD-TDQ 通过两个耦合步骤解决了这个 token 局部瓶颈:直接和 Token-Group(DSTG)为 token-group 分配尺度,以抑制跨 token 范围的主导作用,双约束范围投影(DCRP)将每个 token-group 的剪辑范围投影,以保持步长与离散度比率和零箱质量的界限。在四个 COD 基准和两个基线模型(CFRN 和 ESCNet)上,COD-TDQ 的 S{ ext{α}} 分数始终比最先进的量化方法高出 0.12 以上,而无需重新训练。代码将会发布。
cs.CV / 73 / 2604.16858
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight:通过图像激励思考以进行图像质量评估和优化
Abstract
Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.
Chinese Translation
图像质量评估(IQA)模型越来越多地被用作感知评估者,以指导生成模型和图像修复。这一角色不仅要求准确的评分,还需要可操作的、局部的反馈。然而,当前基于MLLM的方法采用单一视角、仅基于语言的范式,这与人类寻求证据的判断相悖,导致其推理基础薄弱,从而限制了其在循环优化中的可靠性。我们提出了Q-DeepSight,这是一种通过图像思考的框架,模拟这种类人过程。它通过工具增强的证据获取(例如,裁剪和放大)执行交错的多模态思维链(iMCoT),明确确定质量下降的原因及位置。为了通过强化学习训练这些长的iMCoT轨迹,我们引入了两种技术:感知课程奖励(Perceptual Curriculum Reward, PCR)以减轻奖励稀疏性,以及证据梯度过滤(Evidence Gradient Filtering, EGF)以改善视觉基础推理的信用分配。Q-DeepSight在自然、修复和AI生成内容等多种基准上实现了最先进的性能。此外,我们通过生成中的感知(Perceptual-in-Generation, PiG)展示了其实际价值,这是一种无训练框架,其中Q-DeepSight的诊断指导迭代图像增强,有效地闭合了评估与优化之间的循环。
cs.CV / 74 / 2604.16879
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
通过内在重要性感知的自适应取证特征精炼
Abstract
With the rapid development of generative models and multimodal content editing technologies, the key challenge faced by synthetic image detection (SID) lies in cross-distribution generalization to unknown generation sources. In recent years, visual foundation models (VFM), which acquire rich visual priors through large scale image-text alignment pretraining, have become a promising technical route for improving the generalization ability of SID. However, existing VFM-based methods remain relatively coarse-grained in their adaptation strategies. They typically either directly use the final layer representations of VFM or simply fuse multi layer features, lacking explicit modeling of the optimal representational hierarchy for transferable forgery cues. Meanwhile, although directly fine-tuning VFM can enhance task adaptation, it may also damage the cross-modal pretrained structure that supports open-set generalization. To address this task specific tension, we reformulate VFM adaptation for SID as a joint optimization problem: it is necessary both to identify the critical representational layer that is more suitable for carrying forgery discriminative information and to constrain the disturbance caused by task knowledge injection to the pretrained structure. Based on this, we propose I2P, an SID framework centered on intrinsic importance perception. I2P first adaptively identifies the critical layer representations that are most discriminative for SID, and then constrains task-driven parameter updates within a low sensitivity parameter subspace, thereby improving task specificity while preserving the transferable structure of pretrained representations as much as possible.
Chinese Translation
随着生成模型和多模态内容编辑技术的快速发展,合成图像检测(SID)面临的主要挑战在于对未知生成源的跨分布泛化。近年来,视觉基础模型(VFM)通过大规模图像-文本对齐预训练获取丰富的视觉先验,已成为提高SID泛化能力的有前景的技术路线。然而,现有的基于VFM的方法在适应策略上仍然相对粗糙。它们通常直接使用VFM的最终层表示或简单地融合多层特征,缺乏对可转移伪造线索的最佳表示层次的明确建模。同时,尽管直接微调VFM可以增强任务适应性,但也可能损害支持开放集泛化的跨模态预训练结构。为了解决这一任务特定的矛盾,我们将VFM适应SID重新表述为一个联合优化问题:既需要识别更适合承载伪造区分信息的关键表示层,又需要限制任务知识注入对预训练结构造成的干扰。在此基础上,我们提出了I2P,一个以内在重要性感知为中心的SID框架。I2P首先自适应地识别对SID最具区分性的关键层表示,然后将任务驱动的参数更新限制在低敏感度参数子空间内,从而在尽可能保留预训练表示的可转移结构的同时提高任务特异性。
cs.CV / 75 / 2604.16884
Bias-constrained multimodal intelligence for equitable and reliable clinical AI
偏差约束的多模态智能用于公平可靠的临床人工智能
Abstract
The integration of medical imaging and clinical text has enabled the emergence of generalist artificial intelligence (AI) systems for healthcare. However, pervasive biases, such as imbalanced disease prevalence, skewed anatomical region distributions, heterogeneous imaging protocols, and demographic disparities, pose significant challenges to the fairness and reliability of vision-language systems in real-world clinical settings. Here we present BiasCareVL, a bias-aware multimodal learning framework that introduces bias control directly into model design, rather than treating it as a post hoc correction. BiasCareVL incorporates adaptive uncertainty modeling with optional human-in-the-loop refinement to regulate the influence of dominant data patterns and to promote equitable reasoning under distributional imbalance. Trained on 3.44 million samples spanning over 15 imaging modalities, the framework supports diverse clinical tasks, including visual question answering, disease classification, segmentation, and report generation within a unified representation space. Across eight public benchmarks covering dermatology, oncology, radiology, and pathology, BiasCareVL consistently outperforms 20 state-of-the-art methods, with pronounced gains in clinically challenging scenarios, including over 10% accuracy improvement in multi-class skin lesion diagnosis and more than 20% Dice improvement in small tumor segmentation. Furthermore, BiasCareVL achieves diagnostic performance exceeding human accuracy with substantially reduced time requirements when evaluated with board-certified radiologists. By open-sourcing BiasCareVL, we aim to promote a transparent, reproducible, and equitable future for AI in healthcare, paving the way for general-purpose, trustworthy, and clinically reliable AI systems.
Chinese Translation
医学影像与临床文本的整合使得通用人工智能(AI)系统在医疗领域的出现成为可能。然而,普遍存在的偏差,如疾病发生率不平衡、解剖区域分布偏斜、异质成像协议和人口统计差异,对现实临床环境中视觉-语言系统的公平性和可靠性构成了重大挑战。在此,我们提出了BiasCareVL,一个偏差感知的多模态学习框架,该框架将偏差控制直接引入模型设计,而不是将其视为事后修正。BiasCareVL结合了自适应不确定性建模和可选的人机协作精细化,以调节主导数据模式的影响,并促进在分布不平衡下的公平推理。该框架在覆盖15种成像模态的344万样本上进行训练,支持多种临床任务,包括视觉问答、疾病分类、分割和报告生成,均在统一的表示空间内进行。在涵盖皮肤病学、肿瘤学、放射学和病理学的八个公共基准测试中,BiasCareVL始终优于20种最先进的方法,在临床挑战场景中表现出显著的提升,包括在多类皮肤病变诊断中提高超过10%的准确率,以及在小肿瘤分割中提高超过20%的Dice系数。此外,当与持证放射科医师进行评估时,BiasCareVL的诊断性能超过人类准确性,同时显著减少了时间需求。通过开源BiasCareVL,我们旨在推动医疗领域人工智能的透明性、可重复性和公平性,为通用、可信和临床可靠的人工智能系统铺平道路。
cs.CV / 76 / 2604.16892
CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization
CrossFlowDG:通过跨模态流匹配弥合模态差距以实现领域泛化
Abstract
Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP's text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: https://github.com/ajkrit/CrossFlowDG
Chinese Translation
领域泛化(DG)旨在在领域转移下保持性能,这在计算机视觉中主要表现为风格变化,这导致模型过度拟合于领域特定的外观线索而非类别语义。为了克服这一问题,最近的方法使用文本表示作为稳定的、领域不变的锚点。然而,依赖于基于余弦相似度的对比对齐的多模态方法留下了一个模态差距,即图像和文本嵌入在几何上保持分离,尽管它们在语义上是对应的。我们提出了CrossFlowDG,这是一种新颖的领域泛化框架,通过无噪声的跨模态流匹配来解决这一残余差距。通过在联合欧几里得潜在空间中学习连续变换,我们的框架明确地将领域偏向的图像嵌入传输到正确类别的领域不变文本嵌入。使用高效的VMamba图像编码器和CLIP的文本编码器,CrossFlowDG在四个常见的领域泛化基准上进行了测试,并在多个基准上取得了竞争力的表现,在TerraIncognita上达到了最先进的水平。代码可在以下链接获取:https://github.com/ajkrit/CrossFlowDG
cs.CV / 77 / 2604.16893
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1:更简单的视频理解强化学习
Abstract
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Chinese Translation
基于可验证奖励的强化学习(RLVR)在提升大型语言模型的推理能力方面表现出了显著的有效性。随着模型演变为原生的多模态架构,将RLVR扩展到视频理解变得越来越重要,但由于视频任务类型的多样性、反复解码和预处理高维视觉输入的计算开销,以及在众多敏感超参数下进行可重复评估的困难,这一领域仍然在很大程度上未被探索。现有的开源强化学习训练框架为文本和图像场景提供了坚实的基础设施,但缺乏针对视频模态的系统优化。在本研究中,我们提出了 extbf{EasyVideoR1},这是一个专门为视频理解任务训练大型视觉-语言模型而设计的完整高效的强化学习框架。EasyVideoR1的贡献包括:(1)一个完整的视频强化学习训练流程,具有离线预处理和张量缓存,消除了冗余的视频解码,并实现了1.47倍的吞吐量提升;(2)一个全面的、任务感知的奖励系统,涵盖11种不同的视频和图像问题类型,具有统一的路由和模块化扩展;(3)一种混合的离线-在线数据训练范式,结合了精心策划的高质量轨迹与策略内探索,有助于学习更具挑战性的任务;(4)联合图像-视频训练,具有独立可配置的像素预算,使两种模态能够相互增强;(5)一个异步多基准评估框架,涵盖22个主流视频理解基准,其再现的准确性与官方报告的分数紧密一致。
cs.CV / 78 / 2604.16895
Physics-Informed Tracking (PIT)
物理信息跟踪 (PIT)
Abstract
We propose Physics-Informed Tracking (PIT), a video-based framework for tracking a single particle from video, where a neural network autoencoder localizes a particle as a heatmap peak (landmark) and a differentiable physics module embedded in the autoencoder constrains several landmarks over time (a trajectory) to satisfy known dynamics. The novel Physics-Informed Landmark Loss (PILL) compares this predicted trajectory back against the landmarks, enforcing physical consistency without labels. Its supervised variant (PILLS) instead compares the prediction against ground-truth position, velocity, and bounce from simulation, enabling end-to-end backpropagation. To support supervised and unsupervised learning, we use an autoencoder with a split bottleneck that separates A) tracking-related structure via landmark heatmaps from B) background noise and subsequent image reconstruction. We evaluate a replicated 26 factorial design (n = 4 replicates, 64 configurations), showing that PILLS consistently achieves sub-pixel tracking accuracy for the bilinear and physics-refined decoder outputs under both clean and noisy conditions.
Chinese Translation
我们提出了物理信息跟踪 (PIT),这是一种基于视频的框架,用于从视频中跟踪单个粒子,其中神经网络自编码器将粒子定位为热图峰值(地标),并且嵌入自编码器的可微分物理模块约束多个地标随时间变化(轨迹),以满足已知的动力学。新颖的物理信息地标损失 (PILL) 将预测的轨迹与地标进行比较,强制物理一致性而无需标签。其监督变体 (PILLS) 则将预测与来自仿真的真实位置、速度和反弹进行比较,从而实现端到端的反向传播。为了支持监督和无监督学习,我们使用了一个具有分离瓶颈的自编码器,该瓶颈将 A) 与跟踪相关的结构通过地标热图与 B) 背景噪声及后续图像重建分开。我们评估了一个复制的 26 因子设计(n = 4 次重复,64 种配置),结果显示在干净和嘈杂条件下,PILLS 在双线性和物理精炼解码器输出下始终实现亚像素级的跟踪精度。
cs.CV / 79 / 2604.16910
LAGS: Low-Altitude Gaussian Splatting with Groupwise Heterogeneous Graph Learning
LAGS:基于群体异构图学习的低空高斯溅射
Abstract
Low-altitude Gaussian splatting (LAGS) facilitates 3D scene reconstruction by aggregating aerial images from distributed drones. However, as LAGS prioritizes maximizing reconstruction quality over communication throughput, existing low-altitude resource allocation schemes become inefficient. This inefficiency stems from their failure to account for image diversity introduced by varying viewpoints. To fill this gap, we propose a groupwise heterogeneous graph neural network (GW-HGNN) for LAGS resource allocation. GW-HGNN explicitly models the non-uniform contribution of different image groups to the reconstruction process, thus automatically balancing data fidelity and transmission cost. The key insight of GW-HGNN is to transform LAGS losses and communication constraints into graph learning costs for dual-level message passing. Experiments on real-world LAGS datasets demonstrate that GW-HGNN significantly outperforms state-of-the-art benchmarks across key rendering metrics, including PSNR, SSIM, and LPIPS. Furthermore, GW-HGNN reduces computational latency by approximately 100x compared to the widely-used MOSEK solver, achieving millisecond-level inference suitable for real-time deployment.
Chinese Translation
低空高斯溅射(LAGS)通过聚合来自分布式无人机的航空图像来促进三维场景重建。然而,由于LAGS优先考虑最大化重建质量而非通信吞吐量,现有的低空资源分配方案变得低效。这种低效源于它们未能考虑由不同视角引入的图像多样性。为填补这一空白,我们提出了一种用于LAGS资源分配的群体异构图神经网络(GW-HGNN)。GW-HGNN明确建模不同图像组对重建过程的非均匀贡献,从而自动平衡数据保真度和传输成本。GW-HGNN的关键见解在于将LAGS损失和通信约束转化为图学习成本,以实现双层次的信息传递。在真实世界的LAGS数据集上的实验表明,GW-HGNN在关键渲染指标(包括PSNR、SSIM和LPIPS)上显著超越了最先进的基准。此外,与广泛使用的MOSEK求解器相比,GW-HGNN将计算延迟降低了约100倍,实现了适合实时部署的毫秒级推理。
cs.CV / 80 / 2604.16914
Unified Ultrasound Intelligence Toward an End-to-End Agentic System
统一超声智能:面向端到端的智能系统
Abstract
Clinical ultrasound analysis demands models that generalize across heterogeneous organs, views, and devices, while supporting interpretable workflow-level analysis. Existing methods often rely on task-wise adaptation, and joint learning may be unstable due to cross-task interference, making it hard to deliver workflow-level outputs in practice. To address these challenges, we present USTri, a tri-stage ultrasound intelligence pipeline for unified multi-organ, multi-task analysis. Stage I trains a universal generalist USGen on different domains to learn broad, transferable priors that are robust to device and protocol variability. To better handle domain shifts and reach task-aligned performance while preserving ultrasound shared knowledge, Stage II builds USpec by keeping USGen frozen and finetuning dataset-specific heads. Stage III introduces USAgent, which mimics clinician workflows by orchestrating USpec specialists for multi-step inference and deterministic structured reports. On the FMC\_UIA validation set, our model achieves the best overall performance across 4 task types and 27 datasets, outperforming state-of-the-art methods. Moreover, qualitative results show that USAgent produces clinically structured reports with high accuracy and interpretability. Our study suggests a scalable path to ultrasound intelligence that generalizes across heterogeneous ultrasound tasks and supports consistent end-to-end clinical workflows.
Chinese Translation
临床超声分析需要能够在异质器官、视图和设备之间进行泛化的模型,同时支持可解释的工作流程级分析。现有方法通常依赖于任务级适应,联合学习可能由于跨任务干扰而不稳定,导致在实践中难以提供工作流程级输出。为了解决这些挑战,我们提出了USTri,一个用于统一多器官、多任务分析的三阶段超声智能管道。第一阶段在不同领域训练通用的USGen,以学习对设备和协议变异具有鲁棒性的广泛可转移先验。为了更好地处理领域转移并在保持超声共享知识的同时达到任务对齐的性能,第二阶段通过保持USGen不变并微调特定数据集的头部来构建USpec。第三阶段引入USAgent,通过协调USpec专家进行多步骤推理和确定性结构化报告,模拟临床医生的工作流程。在FMC_UIA验证集上,我们的模型在4种任务类型和27个数据集上实现了最佳整体性能,超越了最先进的方法。此外,定性结果表明USAgent生成的临床结构化报告具有高准确性和可解释性。我们的研究为超声智能提供了一条可扩展的路径,使其能够在异质超声任务之间进行泛化,并支持一致的端到端临床工作流程。
cs.CV / 81 / 2604.16915
KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains
KIRA:针对专业视觉领域的知识密集型图像检索与推理架构
Abstract
Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with chainOfThought query expansion, (4) chainOfRetrieval for multihop visual reasoning with temporal and multiview support, and (5) evidence conditioned grounded generation with posthoc hallucination verification. We also propose DOMAINVQAR, a benchmark suite that evaluates visual RAG along three axes (retrieval precision, reasoning faithfulness, and domain correctness) going beyond standard recall metrics. Experiments across four specialized domains (medical Xray, circuit diagrams, satellite imagery, and histopathology) with a progressive six variant ablation demonstrate that KIRA achieves 0.97 retrieval precision, 1.0 grounding scores, and 0.707 domain correctness averaged across domains, while the ablation reveals actionable insights about when each component helps and when components introduce precision diversity tradeoffs that must be managed. Code will be released upon acceptance.
Chinese Translation
检索增强生成(RAG)已改变基于文本的问题回答,但其在视觉领域的扩展仍面临基本挑战:弥合图像查询与以文本为主的知识库之间的模态差距,构建语义上有意义的视觉知识库,对检索到的图像进行多跳推理,以及验证生成的答案是否忠实于视觉证据。我们提出了KIRA(知识密集型图像检索与推理架构),这是一个统一的五阶段框架,解决了专业领域视觉RAG中的十个核心问题。KIRA引入了:(1)基于DINO的区域检测的分层语义分块,用于多粒度知识库构建;(2)具有少量样本适应的领域自适应对比编码器,用于稀有视觉概念;(3)具有链式思维查询扩展的双路径跨模态检索;(4)具有时间和多视角支持的多跳视觉推理的检索链;(5)具有后验幻觉验证的证据条件生成。我们还提出了DOMAINVQAR,一个基准套件,从检索精度、推理可信度和领域正确性三个维度评估视觉RAG,超越标准的召回指标。在四个专业领域(医学X光、电路图、卫星图像和组织病理学)进行的实验,结合六种渐进式变体消融实验表明,KIRA在各领域的平均检索精度达到0.97,基础分数为1.0,领域正确性为0.707,而消融实验揭示了每个组件何时有助于以及何时组件引入必须管理的精度多样性权衡的可操作见解。代码将在接受后发布。
cs.CV / 82 / 2604.16925
Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning
重新思考跨剂量正电子发射断层成像去噪:通过残差噪声学习减轻平均效应
Abstract
Cross-dose denoising for low-dose positron emission tomography (LDPET) has been proposed to address the limited generalization of models trained at a single noise level. In practice, neural networks trained on a specific dose level often fail to generalize to other dose conditions due to variations in noise magnitude and statistical properties. Conventional "one-size-for-all" models attempt to handle this variability but tend to learn averaged representations across noise levels, resulting in degraded performance. In this work, we analyze this limitation and show that standard training formulations implicitly optimize an expectation over heterogeneous noise distributions. To this end, we propose a unified residual noise learning framework that estimates noise directly from low-dose PET images rather than predicting full-dose images. Experiments on large-scale multi-dose PET datasets from two medical centers demonstrate that the proposed method outperforms the "one-size-for-all" model, individual dose-specific U-Net models, and dose-conditioned approaches, achieving improved denoising performance. These results indicate that residual noise learning effectively mitigates the averaging effect and enhances generalization for cross-dose PET denoising.
Chinese Translation
针对低剂量正电子发射断层成像(LDPET)的跨剂量去噪方法被提出,以解决在单一噪声水平下训练的模型的有限泛化能力。在实际应用中,针对特定剂量水平训练的神经网络往往无法泛化到其他剂量条件,因为噪声幅度和统计特性存在变化。传统的“通用模型”试图处理这种变异性,但往往倾向于学习跨噪声水平的平均表示,导致性能下降。在本研究中,我们分析了这一限制,并表明标准训练公式隐式优化了异质噪声分布的期望。为此,我们提出了一种统一的残差噪声学习框架,该框架直接从低剂量PET图像中估计噪声,而不是预测全剂量图像。对来自两个医疗中心的大规模多剂量PET数据集的实验表明,所提出的方法优于“通用模型”、特定剂量的U-Net模型以及剂量条件方法,取得了更好的去噪性能。这些结果表明,残差噪声学习有效减轻了平均效应,并增强了跨剂量PET去噪的泛化能力。
cs.CV / 83 / 2604.16930
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
CoGR-MoE:基于概念引导的专家路由,具有一致选择和灵活推理的视觉问答
Abstract
Visual Question Answering (VQA) requires models to identify the correct answer options based on both visual and textual evidence. Recent Mixture-of-Experts (MoE) methods improve option reasoning by grouping similar concepts or routing based on examples. However, unstable routing can lead to inconsistent expert selection in the same question type, while overly stable routing may reduce flexibility. To address this, we propose Concept-Guided Routing framework (CoGR-MoE), which incorporates semantics of the answer options to guide expert selection in the training phase. Next, option features are used to reweight the selected experts, producing discriminative representations for each candidate option. These option-level representations are further used for option comparison and optimized via contrastive learning. The experimental results indicate that CoGR-MoE delivers strong performance across multiple VQA tasks, demonstrating the effectiveness of our approach.
Chinese Translation
视觉问答(VQA)要求模型根据视觉和文本证据识别正确的答案选项。最近的专家混合(Mixture-of-Experts, MoE)方法通过对相似概念进行分组或基于示例进行路由来改善选项推理。然而,不稳定的路由可能导致在相同问题类型中专家选择的不一致,而过于稳定的路由可能降低灵活性。为了解决这个问题,我们提出了概念引导路由框架(CoGR-MoE),该框架在训练阶段结合答案选项的语义来引导专家选择。接下来,选项特征被用来重新加权所选专家,从而为每个候选选项生成区分性表示。这些选项级表示进一步用于选项比较,并通过对比学习进行优化。实验结果表明,CoGR-MoE在多个VQA任务中表现出色,证明了我们方法的有效性。
cs.CV / 84 / 2604.16936
Adaptive receptive field-based spatial-frequency feature reconstruction network for few-shot fine-grained image classification
基于自适应感受野的空间频率特征重建网络用于少样本细粒度图像分类
Abstract
Feature reconstruction techniques are widely applied for few-shot fine-grained image classification (FSFGIC). Our research indicates that one of the main challenges facing existing feature-based FSFGIC methods is how to choose the size of the receptive field to extract feature descriptors (including spatial and frequency feature descriptors) from different category input images, thereby better performing the FSFGIC tasks. To address this, an adaptive receptive field-based spatial-frequency feature reconstruction network (ARF-SFR-Net) is proposed. The designed ARF-SFR-Net has the capability to adaptively determine receptive field sizes for obtaining spatial and frequency features, and effectively fuse them for reconstruction and FSFGIC tasks. The designed ARF-SFR-Net can be easily embedded into a given episodic training mechanism for end-to-end training from scratch. Extensive experiments on multiple FSFGIC benchmarks demonstrate the effectiveness and superiority of the proposed ARF-SFR-Net over state-of-the-art approaches. The code is available at: https://github.com/ICL-SUST/ARF-SFR-Net.git.
Chinese Translation
特征重建技术广泛应用于少样本细粒度图像分类(FSFGIC)。我们的研究表明,现有基于特征的FSFGIC方法面临的主要挑战之一是如何选择感受野的大小,以从不同类别的输入图像中提取特征描述符(包括空间和频率特征描述符),从而更好地执行FSFGIC任务。为此,提出了一种基于自适应感受野的空间频率特征重建网络(ARF-SFR-Net)。设计的ARF-SFR-Net能够自适应地确定感受野大小,以获取空间和频率特征,并有效融合这些特征以进行重建和FSFGIC任务。设计的ARF-SFR-Net可以轻松嵌入到给定的情节训练机制中,实现从头到尾的端到端训练。在多个FSFGIC基准上的大量实验表明,所提出的ARF-SFR-Net在效果和优越性上均超过了最先进的方法。代码可在以下链接获取:https://github.com/ICL-SUST/ARF-SFR-Net.git。
cs.CV / 85 / 2604.16952
Better with Less: Tackling Heterogeneous Multi-Modal Image Joint Pretraining via Conditioned and Degraded Masked Autoencoder
更少更好:通过条件化和降级的掩码自编码器应对异构多模态图像联合预训练
Abstract
Learning robust representations across extremely heterogeneous modalities remains a fundamental challenge in multi-modal vision. As a critical and profound instantiation of this challenge, high-resolution (HR) joint optical and synthetic aperture radar (SAR) pretraining seeks modality synergy to mutually enhance single-source representations; its potential is severely hindered by the Heterogeneity-Resolution Paradox: finer spatial scales drastically amplify the physical divergence between complex radar geometries and non-homologous optical textures. Consequently, migrating medium-resolution-oriented rigid alignment paradigms to HR scenarios triggers either severe feature suppression to force equivalence, or feature contamination driven by extreme epistemic uncertainty. Both extremes inevitably culminate in profound representation degradation and negative transfer. To overcome this bottleneck, we propose CoDe-MAE, pioneering a \textit{better synergy with less alignment} philosophy. First, Optical-anchored Knowledge Distillation (OKD) implicitly regularizes SAR's speckle noise by mapping it into a pure semantic manifold. Building on this, Conditioned Contrastive Learning (CCL) utilizes a gradient buffering mechanism to align shared consensus while safely preserving divergent physical signatures. Concurrently, Cross-Modal Degraded Reconstruction (CDR) deliberately strips non-homologous spectral pseudo-features, truncating the inherently ill-posed mapping to capture true structural invariants. Extensive analyses validate our theoretical claims. Pretrained on 1M samples, CoDe-MAE demonstrates remarkable data efficiency, successfully preventing representation degradation and establishing new state-of-the-art performance across diverse single- and bi-modal downstream tasks, substantially outperforming foundation models scaled on vastly larger datasets.
Chinese Translation
在极其异构的模态中学习鲁棒的表征仍然是多模态视觉中的一个基本挑战。作为这一挑战的一个关键和深刻的实例,高分辨率(HR)联合光学和合成孔径雷达(SAR)预训练寻求模态协同,以相互增强单一来源的表征;但其潜力受到异构性-分辨率悖论的严重阻碍:更细的空间尺度极大地放大了复杂雷达几何与非同源光学纹理之间的物理差异。因此,将面向中等分辨率的刚性对齐范式迁移到高分辨率场景中,导致要么严重抑制特征以强制等价,要么因极端的认识不确定性而导致特征污染。这两种极端情况不可避免地导致深刻的表征退化和负迁移。为了解决这一瓶颈,我们提出了CoDe-MAE,开创了一种“更少对齐实现更好协同”的理念。首先,光学锚定知识蒸馏(OKD)通过将SAR的斑点噪声映射到纯语义流形中,隐式地对其进行正则化。在此基础上,条件对比学习(CCL)利用梯度缓冲机制对齐共享共识,同时安全地保留不同的物理特征。同时,跨模态降级重建(CDR)故意剥离非同源的光谱伪特征,截断固有的不适定映射以捕捉真实的结构不变性。大量分析验证了我们的理论主张。在1M样本上进行预训练的CoDe-MAE展现出显著的数据效率,成功防止了表征退化,并在各种单模态和双模态下游任务中建立了新的最先进性能,远远超越了在更大数据集上扩展的基础模型。
cs.CV / 86 / 2604.16954
TSM-Pose: Topology-Aware Learning with Semantic Mamba for Category-Level Object Pose Estimation
TSM-Pose:基于拓扑的语义Mamba学习用于类别级物体姿态估计
Abstract
Category-level object pose estimation is fundamental for embodied intelligence, yet achieving robust generalization to unseen instances remains challenging. However, existing methods mainly rely on simple feature extraction and aggregation, which struggle to capture category-shared topological structures and conduct semantic keypoint modeling, limiting their generalization. To address these, we propose a \textbf{T}opology-Aware Learning with \textbf{S}emantic \textbf{M}amba for Category-Level \textbf{P}ose Estimation framework (TSM-Pose). Specifically, we introduce a Topology Extractor to capture the global topological representation of the point cloud, which is integrated into local geometry features and enables robust category-level structural representation. Simultaneously, we propose a Mamba-based Global Semantic Aggregator that injects semantics priors into keypoints to enhance their expressiveness and leverages multiple TwinMamba blocks to model long-range dependencies for more effective global feature aggregation. Extensive experiments on three benchmark datasets (REAL275, CAMERA25, and HouseCat6D) demonstrate that TSM-Pose outperforms existing state-of-the-art methods.
Chinese Translation
类别级物体姿态估计是具身智能的基础,但在未见实例上实现稳健的泛化仍然具有挑战性。然而,现有方法主要依赖于简单的特征提取和聚合,难以捕捉类别共享的拓扑结构并进行语义关键点建模,从而限制了它们的泛化能力。为了解决这些问题,我们提出了一种基于拓扑的语义Mamba学习框架(TSM-Pose),用于类别级姿态估计。具体而言,我们引入了一个拓扑提取器,以捕捉点云的全局拓扑表示,该表示与局部几何特征相结合,使得类别级结构表示更加稳健。同时,我们提出了一种基于Mamba的全局语义聚合器,将语义先验注入关键点,以增强其表现力,并利用多个TwinMamba模块建模长距离依赖,从而实现更有效的全局特征聚合。在三个基准数据集(REAL275、CAMERA25和HouseCat6D)上的大量实验表明,TSM-Pose的性能优于现有的最先进方法。
cs.CV / 87 / 2604.16955
Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction
训练-推理输入对齐在纵向视网膜图像预测中的重要性超过框架选择
Abstract
Quantitative prediction of future retinal appearance from longitudinal imaging would support clinical decisions in progressive macular disease that currently rely on qualitative comparison or scalar progression scores. Recent methods have moved toward increasing generative complexity, but whether this complexity is necessary for slowly progressing retinal disease is unclear. We tested this through a controlled comparison of five conditioning configurations sharing one architecture and training dataset, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. In our evaluation, aligning the training and inference input distributions produced large gains (delta-SSIM +0.082, SSIM +0.086, both p < 0.001), while the choice among aligned frameworks did not significantly affect any primary metric. Task-entropy and posterior-concentration analyses, replicated on two fundus autofluorescence (FAF) platforms, provided a mechanistic account: the predictable component of inter-visit change is small relative to time-invariant acquisition variability, leaving stochastic sampling with little width to exploit. Guided by these findings, we developed TRU (Temporal Retinal U-Net), a deterministic direct-regression model with continuous time-delta conditioning and multi-scale history aggregation. We evaluated TRU on 28,902 eyes across three imaging platforms: a mixed-disease Optos FAF cohort (9,942 eyes), zero-shot transfer to Stargardt macular dystrophy on Optos (288 eyes) and Heidelberg Spectralis (125 eyes), and a boundary evaluation on Cirrus en-face fundus images from a glaucoma cohort (18,547 eyes). TRU matched or exceeded delta-SSIM, SSIM, and PSNR in every FAF cohort against three state-of-the-art benchmarks, and its advantage grew monotonically with available history length.
Chinese Translation
从纵向成像中定量预测未来视网膜外观将支持在进行性黄斑病中依赖于定性比较或标量进展评分的临床决策。近期的方法趋向于增加生成复杂性,但这种复杂性对于缓慢进展的视网膜疾病是否必要尚不清楚。我们通过对五种共享同一架构和训练数据集的条件配置进行控制比较来测试这一点,涵盖标准条件扩散、推理对齐的随机训练和确定性回归。在我们的评估中,训练和推理输入分布的对齐产生了显著的增益(delta-SSIM +0.082,SSIM +0.086,均 p < 0.001),而在对齐框架之间的选择对任何主要指标没有显著影响。任务熵和后验集中度分析在两个眼底自发荧光(FAF)平台上重复进行了验证,提供了机制解释:访次间变化的可预测成分相对于时间不变的采集变异性较小,留下了随机采样可利用的宽度很小。根据这些发现,我们开发了 TRU(Temporal Retinal U-Net),这是一种具有连续时间增量条件和多尺度历史聚合的确定性直接回归模型。我们在三个成像平台上对 28,902 只眼睛评估了 TRU:一个混合疾病的 Optos FAF 队列(9,942 只眼睛)、在 Optos 上零样本迁移至 Stargardt 黄斑营养不良(288 只眼睛)和 Heidelberg Spectralis(125 只眼睛),以及对来自青光眼队列的 Cirrus 面部眼底图像的边界评估(18,547 只眼睛)。在每个 FAF 队列中,TRU 在 delta-SSIM、SSIM 和 PSNR 上与三种最先进的基准相匹配或超越,并且其优势随着可用历史长度的增加而单调增长。
cs.CV / 88 / 2604.16958
Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation
自我推理代理框架用于叙事产品网格拼贴生成
Abstract
Narrative-driven product photography has become a prevalent paradigm in modern marketing, as coherent visual storytelling helps convey product value and establishes emotional engagement with consumers. However, existing image generation methods do not support structured narrative planning or cross-panel coordination, often resulting in weak storytelling and visual incoherence. In practice, narrative product photography is commonly presented as multi-grid collages, where multiple views or scenes jointly communicate a product narrative. To ensure visual consistency across grids and aesthetic harmony of the overall composition, we generate the collage as a single unified image rather than composing independently synthesized panels. We propose a self-reasoning agentic framework for narrative product grid collage generation. Given a product packshot and its name, the system first constructs a Product Narrative Framework that explicitly represents the product's identity, usage context, and situational environment, and translates it into complementary grids governed by a shared visual style. Constraint-aware prompts are then compiled and fed to a generation model that synthesizes the collage jointly. The generated output is evaluated on both content validity and photography quality, with explicit gates determining whether to proceed or refine. When evaluation fails, the system performs failure attribution and applies targeted refinement, enabling progressive improvement through iterative self-reflection. Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines.
Chinese Translation
以叙事驱动的产品摄影已成为现代营销中的一种普遍范式,因为连贯的视觉叙事有助于传达产品价值并与消费者建立情感联系。然而,现有的图像生成方法不支持结构化的叙事规划或跨面板协调,常常导致叙事薄弱和视觉不连贯。在实际应用中,叙事产品摄影通常以多网格拼贴的形式呈现,其中多个视角或场景共同传达产品叙事。为了确保网格之间的视觉一致性和整体构图的美学和谐,我们将拼贴生成作为一个单一的统一图像,而不是独立合成面板。我们提出了一种自我推理代理框架,用于叙事产品网格拼贴生成。给定一个产品的特写图和其名称,系统首先构建一个产品叙事框架,该框架明确表示产品的身份、使用背景和情境环境,并将其转化为由共享视觉风格主导的互补网格。然后,编制约束感知提示并将其输入到生成模型中,以共同合成拼贴。生成的输出在内容有效性和摄影质量上进行评估,并通过明确的门控决定是否继续或进行优化。当评估失败时,系统执行失败归因并应用有针对性的优化,从而通过迭代自我反思实现渐进式改进。实验表明,与直接提示基线相比,我们的框架在美学质量、叙事丰富性和视觉连贯性方面始终有所提升。
cs.CV / 89 / 2604.16969
Hyperspectral Unmixing Hierarchies
高光谱解混合层次结构
Abstract
Unmixing reveals the spatial distribution and spectral details of different constituents, called endmembers, in a hyperspectral image. Because unmixing has limited ground truth requirements, can accommodate mixed pixels, and is closely tied to light propagation, it is a uniquely powerful tool for analyzing hyperspectral images. However, spectral variability inhibits unmixing performance, the proper way to determine the number of endmembers is ambiguous, and the clarity of the endmembers degrades as more are included. Hierarchical structure is a possible solution to all three problems. Here, hierarchical unmixing is defined by imposing a hierarchical abundance sum constraint on Deep Nonnegative Matrix Factorization. Binary Linear Unmixing Tactile Hierarchies (BLUTHs) solve the hierarchical unmixing problem with a simple network architecture. Sparsity modulation unmixing growth tailors the topology of a BLUTH to each scene. The structure imposed by BLUTHs allows endmembers with varying levels of spectral contrast to be revealed, mitigating the challenge of spectral variability. The performance of BLUTHs exceeds state-of-the-art unmixing algorithms on laboratory scenes, particularly with regard to abundance estimation, while their performance remains competitive on remote sensing scenes. In addition, ocean color unmixing by BLUTHs is demonstrated on hyperspectral scenes from the HYPSO and PACE satellites.
Chinese Translation
解混合揭示了高光谱图像中不同成分(称为端元)的空间分布和光谱细节。由于解混合对地面真实数据的需求有限,可以处理混合像素,并且与光传播密切相关,因此它成为分析高光谱图像的独特强大工具。然而,光谱变异性抑制了解混合性能,确定端元数量的正确方法模糊不清,并且随着端元数量的增加,端元的清晰度会降低。层次结构可能是解决这三个问题的方案。在这里,层次解混合通过对深度非负矩阵分解施加层次丰度和约束来定义。二元线性解混合触觉层次结构(Binary Linear Unmixing Tactile Hierarchies, BLUTHs)通过简单的网络架构解决层次解混合问题。稀疏调制解混合增长根据每个场景定制BLUTH的拓扑结构。BLUTH施加的结构允许揭示具有不同光谱对比度水平的端元,从而减轻光谱变异性带来的挑战。BLUTH在实验室场景中的性能超过了最先进的解混合算法,特别是在丰度估计方面,而在遥感场景中的性能仍然具有竞争力。此外,BLUTH在HYPSO和PACE卫星的高光谱场景上展示了海洋颜色解混合。
cs.CV / 90 / 2604.16976
UGD: An Unsupervised Geometric Distance for Evaluating Real-world Noisy Point Cloud Denoising
UGD:一种用于评估现实世界噪声点云去噪的无监督几何距离
Abstract
Point cloud denoising is a fundamental and crucial challenge in real-world point cloud applications. Existing quantitative evaluation metrics for point cloud denoising methods are implemented in a supervised manner, which requires both the denoised point cloud and the corresponding ground-truth clean point cloud to compute a representative geometric distance. This requirement is highly problematic in real-world scenarios, where ground-truth clean point clouds are often unavailable. In this paper, we propose a simple yet effective unsupervised geometric distance (UGD) for real-world noisy point cloud denoising, calculated solely from noisy point clouds. The core idea of UGD is to learn a patch-wise prior model from a set of clean point clouds and then employ this prior model as the ground-truth to quantify the degradation by measuring the geometric variations of the denoised point cloud. To this end, we first learn a pristine Gaussian Mixture Model (GMM) with extracted patch-wise quality-aware features from a set of pristine clean point clouds by a patch-wise feature extraction network, which serves as the ground-truth for the quantitative evaluation. Then, the UGD is defined as the weighted sum of distances between each patch of the denoised point cloud and the learned pristine GMM model in the patch space. To train the employed patch-wise feature extraction network, we propose a self-supervised training framework through multi-task learning, which includes pair-wise quality ranking, distortion classification, and distortion distribution prediction. Quantitative experiments with synthetic noise confirm that the proposed UGD achieves comparable performance to supervised full-reference metrics. Moreover, experimental results on real-world data demonstrate that the proposed UGD enables unsupervised evaluation of point cloud denoising methods based exclusively on noisy point clouds.
Chinese Translation
点云去噪是现实世界点云应用中的一个基本且关键的挑战。现有的点云去噪方法的定量评估指标通常以监督方式实施,这要求同时提供去噪后的点云和相应的真实干净点云,以计算代表性的几何距离。这一要求在现实场景中存在很大问题,因为真实干净点云往往不可用。在本文中,我们提出了一种简单而有效的无监督几何距离(UGD),用于现实世界噪声点云去噪,该距离仅基于噪声点云进行计算。UGD的核心思想是从一组干净点云中学习一个基于补丁的先验模型,然后将该先验模型作为真实值,通过测量去噪点云的几何变化来量化退化。为此,我们首先通过补丁特征提取网络,从一组干净点云中提取补丁级的质量感知特征,学习一个原始的高斯混合模型(GMM),该模型作为定量评估的真实值。然后,UGD被定义为去噪点云的每个补丁与学习到的原始GMM模型在补丁空间中的距离的加权和。为了训练所使用的补丁特征提取网络,我们提出了一种通过多任务学习的自监督训练框架,包括成对质量排序、失真分类和失真分布预测。针对合成噪声的定量实验表明,所提出的UGD在性能上与监督全参考指标相当。此外,针对现实世界数据的实验结果表明,所提出的UGD能够仅基于噪声点云实现点云去噪方法的无监督评估。
cs.CV / 91 / 2604.16979
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models
DOSE:通过现成模型进行多模态大语言模型的数据选择
Abstract
High-quality and diverse multimodal data are essential for improving vision-language models (VLMs), yet existing datasets often contain noisy, redundant, and poorly aligned samples. To address these problems, data filtering is commonly used to enhance the efficiency and performance of multimodal learning, but it introduces extra computational cost because filtering models are usually trained on the same data they are meant to screen. To reduce this cost, we study DOSE, which explores whether off-the-shelf pretrained models that have never seen the target data can be used to select training samples for larger and stronger multimodal models without any task-specific training. Even without fine-tuning, these models can effectively assess text quality and image-text alignment to guide data selection. Based on this, we build a joint quality-alignment distribution and apply adaptive weighted sampling to select informative samples while maintaining long-tail diversity. This approach enhances data diversity, enabling models trained on DOSE-filtered data to match or surpass those trained on the full dataset on standard VQA and math benchmarks. Extensive experiments demonstrate its effectiveness, efficiency, and scalability.
Chinese Translation
高质量和多样化的多模态数据对于提升视觉语言模型(VLMs)至关重要,但现有数据集往往包含噪声、冗余和对齐不良的样本。为了解决这些问题,数据过滤通常被用来提高多模态学习的效率和性能,但这会引入额外的计算成本,因为过滤模型通常是在其所要筛选的数据上训练的。为了降低这一成本,我们研究了DOSE,探讨了是否可以利用从未见过目标数据的现成预训练模型来选择用于更大更强的多模态模型的训练样本,而无需任何特定任务的训练。即使没有微调,这些模型也能有效评估文本质量和图像-文本对齐,以指导数据选择。基于此,我们构建了联合质量-对齐分布,并应用自适应加权采样来选择信息丰富的样本,同时保持长尾多样性。这种方法增强了数据的多样性,使得在DOSE过滤数据上训练的模型在标准视觉问答(VQA)和数学基准测试中能够匹敌或超越在完整数据集上训练的模型。大量实验表明了其有效性、效率和可扩展性。
cs.CV / 92 / 2604.16984
Adverse-to-the-eXtreme Panoptic Segmentation: URVIS 2026 Study and Benchmark
极端恶劣环境下的全景分割:URVIS 2026 研究与基准
Abstract
This paper presents the report of the URVIS 2026 challenge on adverse-to-extreme panoptic segmentation. As the first challenge of its kind, it attracted 17 registered participants and 47 submissions, with 4 teams reaching the final phase. The challenge is based on the MUSES dataset, a multi-sensor benchmark for panoptic segmentation in adverse-to-extreme weather, including RGB frame camera, LiDAR, radar, and event camera data. Weighted Panoptic Quality (wPQ) is designed and adopted as the official ranking metric for fair evaluation across weather conditions. In this report, we summarise the challenge setting and benchmark results, analyse the performance of the submitted methods, and discuss current progress and remaining challenges for robust multimodal panoptic segmentation. Link: https://urvis-workshop.github.io/challenge-Muses.html
Chinese Translation
本文呈现了URVIS 2026挑战赛关于极端恶劣环境下全景分割的报告。作为首个此类挑战,它吸引了17名注册参与者和47个提交,其中4个团队进入了最终阶段。该挑战基于MUSES数据集,这是一个针对极端恶劣天气条件下全景分割的多传感器基准,包括RGB帧摄像头、激光雷达、雷达和事件摄像头数据。加权全景质量(Weighted Panoptic Quality, wPQ)被设计并采用为官方排名指标,以便在不同天气条件下进行公平评估。在本报告中,我们总结了挑战设置和基准结果,分析了提交方法的性能,并讨论了稳健的多模态全景分割的当前进展和剩余挑战。链接:https://urvis-workshop.github.io/challenge-Muses.html
cs.CV / 93 / 2604.16987
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
DVAR:用于视频真实性检测的对抗性多智能体辩论
Abstract
The rapid evolution of video generation technologies poses a significant challenge to media forensics, as conventional detection methods often fail to generalize beyond their training distributions. To address this, we propose DVAR (Debate-based Video Authenticity Reasoning), a training-free framework that reformulates video detection as a structured multi-agent forensic reasoning process. Moving beyond the paradigm of pattern matching, DVAR orchestrates a competition between a Generative Hypothesis Agent and a Natural Mechanism Agent. Through iterative rounds of cross-examination, these agents defend their respective explanations against abnormal evidence, driving a logical convergence where the truth emerges from rigorous stress-testing. To adjudicate these conflicting claims, we apply Occam's Razor through the Minimum Description Length (MDL) framework, defining an Explanatory Cost to quantify the "logical burden" of each reasoning path. Furthermore, we integrate GenVideoKB, a dynamic knowledge repository that provides high-level reasoning heuristics on generative boundaries and failure modes. Extensive experiments demonstrate that DVAR achieves competitive performance against supervised state-of-the-art methods while exhibiting superior generalization to unseen generative architectures. By transforming detection into a transparent debate, DVAR provides explicit, interpretable reasoning traces for robust video authenticity assessment.
Chinese Translation
视频生成技术的快速发展对媒体取证提出了重大挑战,因为传统的检测方法往往无法超越其训练分布。为了解决这一问题,我们提出了DVAR(基于辩论的视频真实性推理),这是一个无训练的框架,将视频检测重新构建为一个结构化的多智能体取证推理过程。DVAR超越了模式匹配的范式,组织了一个生成假设智能体(Generative Hypothesis Agent)与自然机制智能体(Natural Mechanism Agent)之间的竞争。通过迭代的交叉审查,这些智能体对异常证据进行辩护,推动逻辑收敛,使真相在严格的压力测试中浮现。为了裁决这些相互冲突的主张,我们通过最小描述长度(Minimum Description Length, MDL)框架应用奥卡姆剃刀,定义了解释成本(Explanatory Cost),以量化每条推理路径的“逻辑负担”。此外,我们整合了GenVideoKB,一个动态知识库,提供关于生成边界和失败模式的高级推理启发式。大量实验表明,DVAR在与监督的最先进方法的竞争性能方面表现出色,同时在未见过的生成架构上展现出更强的泛化能力。通过将检测转化为透明的辩论,DVAR为稳健的视频真实性评估提供了明确、可解释的推理轨迹。
cs.CV / 94 / 2604.17001
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling
用于任意采样的张量补全的归纳卷积核范数最小化
Abstract
The recently established Convolution Nuclear Norm Minimization (CNNM) addresses the problem of \textit{tensor completion with arbitrary sampling} (TCAS), which involves restoring a tensor from a subset of its entries sampled in an arbitrary manner. Despite its promising performance, the optimization procedure of CNNM needs performing Singular Value Decomposition (SVD) multiple times, which is computationally expensive and hard to parallelize. To address the issue, we reformulate the optimization objective of CNNM from the perspective of convolution eigenvectors. By introducing pre-learned convolution eigenvectors which are shared among different tensors, we propose a novel method called Inductive Convolution Nuclear Norm Minimization (ICNNM), which bypasses the SVD step so as to decrease significantly the computational time. In addition, due to the extra prior knowledge encoded in the pre-learned convolution eigenvectors, ICNNM also outperforms CNNM in terms of recovery performance. Extensive experiments on video completion, prediction and frame interpolation verify the superiority of ICNNM over CNNM and several other competing methods.
Chinese Translation
最近建立的卷积核范数最小化(CNNM)解决了 extit{任意采样的张量补全}(TCAS)问题,该问题涉及从以任意方式采样的条目子集中恢复张量。尽管其性能令人期待,但CNNM的优化过程需要多次执行奇异值分解(SVD),这在计算上是昂贵的且难以并行化。为了解决这个问题,我们从卷积特征向量的角度重新表述了CNNM的优化目标。通过引入在不同张量之间共享的预学习卷积特征向量,我们提出了一种新方法,称为归纳卷积核范数最小化(ICNNM),该方法绕过了SVD步骤,从而显著减少了计算时间。此外,由于预学习卷积特征向量中编码的额外先验知识,ICNNM在恢复性能方面也优于CNNM。在视频补全、预测和帧插值的广泛实验中,验证了ICNNM相对于CNNM和其他几种竞争方法的优越性。
cs.CV / 95 / 2604.17005
TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
TeMuDance:基于对比对齐的文本控制用于音乐驱动的舞蹈生成
Abstract
Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end training. A lightweight text control branch is then trained on top of a frozen music-to-dance diffusion backbone, preserving rhythmic fidelity while enabling fine-grained semantic guidance. To further suppress noise inherent in the retrieved supervision, we design a dual-stream fine-tuning strategy with confidence-based filtering. We also propose a novel task-aligned metric that quantifies whether textual prompts induce the intended kinematic attributes under music conditioning. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over existing methods.
Chinese Translation
现有的音乐驱动舞蹈生成方法在现实感和音频-运动对齐方面取得了显著成果。然而,它们通常缺乏语义可控性,使得通过自然语言描述引导特定动作变得困难。这一限制主要源于缺乏大规模的数据集,无法共同对齐音乐、文本和运动,以进行文本条件控制的监督学习。为了解决这一挑战,我们提出了TeMuDance,一个框架,能够实现基于文本的音乐条件舞蹈生成,而无需任何手动标注的音乐-文本-运动三元组数据集。TeMuDance引入了一种以运动为中心的桥接范式,利用运动作为共享语义锚点,在统一的嵌入空间中对齐不相交的音乐-舞蹈和文本-运动数据集,从而实现缺失模态的跨模态检索以进行端到端训练。然后,在冻结的音乐到舞蹈扩散主干上训练一个轻量级的文本控制分支,保持节奏的保真性,同时实现细粒度的语义指导。为了进一步抑制检索到的监督中固有的噪声,我们设计了一种基于置信度过滤的双流微调策略。我们还提出了一种新颖的任务对齐度量,量化文本提示在音乐条件下是否引发预期的运动属性。大量实验表明,TeMuDance在舞蹈质量上具有竞争力,同时显著改善了现有方法的文本条件控制。
cs.CV / 96 / 2604.17007
MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment
MobileAgeNet:面向移动部署的轻量级面部年龄估计
Abstract
Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.
Chinese Translation
面部年龄估计的移动部署需要在预测准确性、低延迟和紧凑尺寸之间取得平衡。在本研究中,我们提出了MobileAgeNet,这是一种轻量级年龄回归框架,在UTKFace保留测试集上实现了4.65年的平均绝对误差(MAE),同时在设备上保持高效推理,使用AI Benchmark应用测得的平均延迟为14.4毫秒。该模型基于预训练的MobileNetV3-Large主干网络,并结合紧凑的回归头,能够在移动设备上实现实时预测。训练和评估流程集成在NN LEMUR数据集框架中,支持可重复的实验、结构化的超参数优化和一致的评估。我们采用有界年龄回归以及两阶段微调策略来提高训练的稳定性和泛化能力。实验结果表明,MobileAgeNet以3.23M的参数量实现了竞争力的准确性,并且从PyTorch训练到ONNX导出再到TensorFlow Lite转换的部署流程在实际设备条件下保持了预测行为,没有可测量的降级。总体而言,本研究为面向移动的面部年龄估计提供了一个实用的、准备部署的基准。
cs.CV / 97 / 2604.17013
Towards Universal Skeleton-Based Action Recognition
面向通用骨架基础的动作识别
Abstract
With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at https://github.com/jidongkuang/Universal-Skeleton.
Chinese Translation
随着机器人技术的发展,基于骨架的动作识别变得越来越重要,因为人机交互需要理解人类和类人机器人所执行的动作。由于人类骨架和类人机器人的结构来源不同,骨架数据自然表现出异质性。然而,以往的研究忽视了骨架数据的异质性,仅使用同质骨架构建模型。此外,开放词汇的动作识别对于实际应用也至关重要。为此,本文研究了具有开放词汇的异质骨架基础动作识别这一具有挑战性的问题。我们通过整合和精炼多个具有代表性的规模较大的基于骨架的动作数据集,构建了一个大规模的异质开放词汇(Heterogeneous Open-Vocabulary, HOV)骨架数据集。为了解决通用骨架基础的动作识别问题,我们提出了一种基于Transformer的模型,该模型包含三个关键组件:统一的骨架表示、用于骨架的运动编码器和多粒度的运动-文本对齐。运动编码器将多模态骨架嵌入输入到一个双流的基于Transformer的编码器中,以学习时空动作表示,然后将其映射到语义空间以与文本嵌入对齐。多粒度的运动-文本对齐在三个层面上结合了对比学习:全局实例对齐、流特定对齐和细粒度对齐。在具有异质骨架数据的流行基准上进行的大量实验表明了所提方法的有效性和泛化能力。代码可在 https://github.com/jidongkuang/Universal-Skeleton 获取。
cs.CV / 98 / 2604.17021
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
LIVE:利用图像处理先验进行基于指令的视频编辑
Abstract
Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.
Chinese Translation
视频编辑旨在根据用户意图修改输入视频。近年来,端到端训练方法引起了广泛关注,通过视频生成或编辑模型构建配对的视频编辑数据。然而,与图像编辑相比,视频数据的高标注成本严重限制了视频编辑数据集的规模、质量和任务多样性,尤其是在依赖视频生成模型或手动标注时。为了解决这一问题,我们提出了LIVE,一个联合训练框架,利用大规模、高质量的图像编辑数据与视频数据集相结合,以增强编辑能力。为了减轻静态图像与动态视频之间的领域差异,我们引入了一种逐帧令牌噪声策略,将特定帧的潜在表示视为推理令牌,利用大型预训练视频生成模型创建合理的时间变换。此外,通过清理公共数据集和构建自动化数据管道,我们采用了两阶段训练策略来逐步提升视频编辑能力。此外,我们整理了一个全面的评估基准,涵盖60多个在图像编辑中常见但在现有视频数据集中稀缺的挑战任务。大量的比较和消融实验表明,我们的方法达到了最先进的性能。源代码将公开发布。
cs.CV / 99 / 2604.17024
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
CAM3DNet:全面挖掘多视角相机下的多尺度特征以进行3D物体检测
Abstract
Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI\_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.
Chinese Translation
基于查询的3D物体检测方法在使用多视角图像时常常难以有效利用动态多尺度信息,例如,物体特征与查询几何之间的关系未能得到充分学习,直接探索多尺度时空特征会付出过多代价。为了解决这些挑战,我们提出了CAM3DNet,这是一种新颖的稀疏查询基础框架,结合了三个新模块:复合查询(CQ)、自适应自注意力(ASA)和多尺度混合采样(MSHS)。首先,CQ模块的核心思想是多尺度投影策略,将2D查询转换为3D空间。其次,ASA模块学习时空多尺度查询之间的交互。第三,MSHS模块利用可变形注意力机制,通过考虑多尺度查询、金字塔特征图和2D相机先验知识来采样多尺度物体信息。整个模型采用一个主干网络和一个特征金字塔网络(FPN)作为编码器,然后引入YOLOX和DepthNet作为ROI_Head来生成CQ,并反复利用ASA和MSHS作为解码器以获取检测特征。在nuScenes、Waymo和Argoverse基准数据集上的大量实验表明,我们的CAM3DNet的有效性,并且超越了大多数现有的基于相机的3D物体检测方法。此外,我们进行了全面的消融研究,以检查CQ、ASA和MSHS的个体效果,以及它们的空间和计算复杂度成本。
cs.CV / 100 / 2604.17028
IMA-MoE: An Interpretable Modality-Aware Mixture-of-Experts Framework for Characterizing the Neurobiological Signatures of Binge Eating Disorder
IMA-MoE:一种可解释的模态感知专家混合框架,用于表征暴食症的神经生物学特征
Abstract
Binge eating disorder (BED) is the most prevalent eating disorder. However, current diagnostic frameworks remain largely grounded in symptom-based criteria rather than underlying biological mechanisms, thereby limiting early detection and the development of biologically-informed interventions. Emerging studies have begun to investigate the neurobiological signatures of BED, yet their findings are often difficult to generalize due to the reliance on hypothesis-driven parametric models, single-modality analyses, and limited data diversity. Therefore, there is a critical need for advanced data-driven frameworks capable of modeling multimodal data to uncover generalizable and biologically meaningful signatures of BED. In this study, we propose the Interpretable Modality-Aware Mixture-of-Experts (IMA-MoE), a novel architecture designed to integrate heterogeneous neuroimaging, behavioral, hormonal, and demographic measures within a unified predictive framework. By encoding each measure as a distinct token, IMA-MoE enables flexible modeling of cross-modal dependencies while preserving modality-specific characteristics. We further introduce a token-importance mechanism to enhance interpretability by quantifying the contribution of each measure to model predictions. Evaluated on the large-scale Adolescent Brain Cognitive Development (ABCD) dataset, IMA-MoE demonstrates superior performance in differentiating BED from healthy controls compared with baseline methods, while revealing sex-specific predictive patterns, with hormonal measures contributing more prominently to prediction in females. Collectively, these findings highlight the promise of interpretable, data-driven multimodal modeling in advancing biologically-informed characterization of BED and facilitating more precise and personalized interventions in neuropsychiatric disorders.
Chinese Translation
暴食症(Binge Eating Disorder, BED)是最常见的饮食障碍。然而,目前的诊断框架主要基于症状标准,而非潜在的生物机制,这限制了早期检测和生物学信息驱动干预的发展。新兴研究已开始探讨BED的神经生物学特征,但由于依赖假设驱动的参数模型、单一模态分析以及数据多样性有限,其发现往往难以推广。因此,迫切需要先进的数据驱动框架,能够建模多模态数据,以揭示BED的可推广和生物学意义特征。在本研究中,我们提出了可解释的模态感知专家混合框架(Interpretable Modality-Aware Mixture-of-Experts, IMA-MoE),这是一种新颖的架构,旨在将异质的神经影像、行为、激素和人口统计学测量整合到一个统一的预测框架中。通过将每个测量编码为一个独特的标记,IMA-MoE能够灵活建模跨模态依赖关系,同时保留模态特定特征。我们进一步引入了一种标记重要性机制,通过量化每个测量对模型预测的贡献来增强可解释性。在大规模青少年大脑认知发展(Adolescent Brain Cognitive Development, ABCD)数据集上的评估表明,与基线方法相比,IMA-MoE在区分BED与健康对照方面表现出更优越的性能,同时揭示了性别特异的预测模式,其中激素测量在女性预测中贡献更为显著。总体而言,这些发现突显了可解释的数据驱动多模态建模在推动BED的生物学信息驱动特征表征和促进神经精神疾病中更精确、个性化干预的潜力。
cs.CV / 101 / 2604.17030
Conditional Evidence Reconstruction and Decomposition for Interpretable Multimodal Diagnosis
可解释的多模态诊断中的条件证据重建与分解
Abstract
Neurobiological and neurodegenerative diseases are inherently multifactorial, arising from coupled influences spanning genetic susceptibility, brain alterations, and environmental and behavioral factors. Multimodal modeling has therefore been increasingly adopted for disease diagnosis by integrating complementary evidence across data sources. However, in both large-scale cohorts and real-world clinical workflows, modality coverage is often incomplete, making many multimodal models brittle when one or more modalities are unavailable. Existing approaches to incomplete multimodal diagnosis typically rely on group-wise or static priors, which may fail to capture subject-specific cross-modal dependencies; moreover, many models provide limited interpretability into which evidence sources drive the final decision. To address these limitations, we propose Conditional Evidence Reconstruction and Decomposition (CERD), a framework for interpretable multimodal diagnosis with incomplete modalities. CERD first reconstructs missing modality representations conditioned on each subject's observed inputs, then decomposes diagnostic evidence into shared cross-modal corroboration and modality-specific cues via logit-level attribution. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate that CERD outperforms competitive baselines under incomplete-modality settings while producing structured and clinically aligned evidence attributions for trustworthy decision support.
Chinese Translation
神经生物学和神经退行性疾病本质上是多因素的,源于遗传易感性、脑部变化以及环境和行为因素的相互影响。因此,多模态建模在疾病诊断中越来越被采用,通过整合来自不同数据源的互补证据。然而,在大规模队列和现实临床工作流程中,模态覆盖通常不完整,这使得许多多模态模型在一个或多个模态不可用时变得脆弱。现有的针对不完整多模态诊断的方法通常依赖于群体或静态先验,这可能无法捕捉特定个体的跨模态依赖关系;此外,许多模型在解释最终决策时提供的可解释性有限。为了解决这些局限性,我们提出了条件证据重建与分解(Conditional Evidence Reconstruction and Decomposition, CERD),这是一个用于不完整模态的可解释多模态诊断框架。CERD首先根据每个个体的观察输入重建缺失的模态表示,然后通过logit级别的归因将诊断证据分解为共享的跨模态证实和特定模态的线索。在阿尔茨海默病神经影像学倡议(Alzheimer's Disease Neuroimaging Initiative, ADNI)上的实验表明,CERD在不完整模态设置下优于竞争基线,同时为可信的决策支持提供了结构化且与临床对齐的证据归因。
cs.CV / 102 / 2604.17041
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SIF:大规模视觉-语言模型的语义内分布指纹
Abstract
The public accessibility of large vision-language models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification methods often rely on semantically abnormal queries or out-of-distribution responses as fingerprints, which can be easily detected and removed by adversaries. We expose this vulnerability through a Semantic Divergence Attack (SDA), which identifies and filters fingerprint queries by measuring semantic divergence between a suspect model and a reference model, showing that existing fingerprints are not semantic-preserving and are therefore easy to detect and bypass. To address these limitations, we propose SIF (Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework that requires no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which transfers text watermarking signals into the visual modality to produce semantically coherent yet fingerprinted responses. In addition, Robust-Fingerprint Optimization (RFO) enhances robustness by simulating worst-case representation perturbations, making the fingerprints resilient to model modifications such as fine-tuning and quantization. Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate that SIF achieves strong stealthiness and robustness, providing a practical solution for LVLM copyright protection. Code is available at https://github.com/UCF-ML-Research/SIF-VLM-Fingerprint
Chinese Translation
大型视觉-语言模型(LVLMs)的公共可访问性引发了对未经授权模型重用和知识产权侵犯的严重担忧。现有的所有权验证方法通常依赖于语义异常查询或分布外响应作为指纹,这些指纹容易被对手检测和删除。我们通过语义偏差攻击(Semantic Divergence Attack, SDA)揭示了这一漏洞,该攻击通过测量可疑模型与参考模型之间的语义偏差来识别和过滤指纹查询,显示现有指纹并不保持语义,因此容易被检测和规避。为了解决这些局限性,我们提出了SIF(语义内分布指纹),这是一种非侵入式的所有权验证框架,无需参数修改。SIF引入了语义对齐指纹蒸馏(Semantic-Aligned Fingerprint Distillation, SAFD),将文本水印信号转移到视觉模态,以生成语义一致但带有指纹的响应。此外,稳健指纹优化(Robust-Fingerprint Optimization, RFO)通过模拟最坏情况的表示扰动来增强稳健性,使指纹对模型修改(如微调和量化)具有抵抗力。在LLaVA-1.5和Qwen2.5-VL上的大量实验表明,SIF实现了强大的隐蔽性和稳健性,为LVLM版权保护提供了实用解决方案。代码可在https://github.com/UCF-ML-Research/SIF-VLM-Fingerprint获取。
cs.CV / 103 / 2604.17046
A Real-Time Bike-Pedestrian Safety System with Wide-Angle Perception and Evaluation Testbed for Urban Intersections
一种实时自行车-行人安全系统及其城市交叉口的广角感知与评估测试平台
Abstract
Collisions between cyclists and pedestrians at urban intersections remain a persistent source of injuries, yet few systems attempt real-time warnings to unequipped road users using commodity hardware. We present a prototype collision warning system that runs on a single edge device with a wide-angle fisheye camera, producing audible and visual alerts at 30\,fps. The system makes four contributions. First, we develop a calibration pipeline for ultra-wide fisheye lenses that overcomes corner-detection failure and optimizer divergence through perspective remapping and direct bundle adjustment. Second, we combine fisheye-aware object detection with a closed-form ground-plane projection via a precomputed lookup table. Third, we introduce a design-time conformance simulation with 24 scripted hazard scenarios, stochastic size-aware detection failures, and a latency sweep showing that a first-order kinematic predictor maintains the mean warning budget above the distracted-pedestrian reaction time across realistic camera latencies. Fourth, we formalize the decision layer as a separable, auditable testbench with explicit deployment gates, contestability mechanisms, and a residual risk register. Under conformance testing with fisheye localization error, the selected pipeline configuration achieves 93.3\% sensitivity and 92.3\% specificity, with a mean warning budget of 3.3\,s. The system design was informed by community-aided design workshops. Code and replication scripts are available at https://github.com/mkturkcan/bikeped.
Chinese Translation
城市交叉口自行车与行人之间的碰撞仍然是造成伤害的一个持续来源,但很少有系统尝试使用普通硬件对未装备的道路使用者进行实时警告。我们提出了一种原型碰撞警告系统,该系统运行在单个边缘设备上,配备广角鱼眼摄像头,以每秒30帧的速度产生可听和可视的警报。该系统有四个贡献。首先,我们开发了一种超广角鱼眼镜头的校准流程,通过透视重映射和直接束调整克服了角点检测失败和优化器发散的问题。其次,我们将鱼眼感知的物体检测与通过预计算查找表的闭式地面投影相结合。第三,我们引入了一种设计时一致性仿真,包含24个脚本化的危险场景、随机大小感知的检测失败,以及一个延迟扫描,显示一阶运动预测器在现实摄像头延迟下保持平均警告预算高于分心行人的反应时间。第四,我们将决策层形式化为一个可分离的、可审计的测试平台,具有明确的部署门、可争议机制和剩余风险登记。在鱼眼定位误差的一致性测试下,所选的管道配置实现了93.3%的灵敏度和92.3%的特异性,平均警告预算为3.3秒。系统设计受到社区辅助设计研讨会的启发。代码和复现脚本可在 https://github.com/mkturkcan/bikeped 获取。
cs.CV / 104 / 2604.17052
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS:用于流媒体视频推理的按需层次事件记忆
Abstract
Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis-small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement-short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieved memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and readily attaches to different streaming MLLM backbones. Experiments across multiple benchmarks and backbones show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with bounded token cost and low request delay. Code is available at https://github.com/Solus-sano/OASIS.
Chinese Translation
流媒体视频推理要求模型在历史不断增长而有意义证据稀缺的环境中运行。在这样的背景下,相关信号就像一个绿洲——小而关键,容易在冗余的沙漠中迷失。扩大记忆只会加大沙漠的范围;激进的压缩则会使绿洲干涸。真正的困难在于发现该在哪里寻找,而不是记住多少。因此,我们提出了OASIS,一个用于流媒体视频推理的新框架,通过结构化的按需检索来应对这一挑战。它将流媒体历史组织成层次事件,并进行推理,首先进行短上下文推理的受控精炼,然后在不确定性出现时进行语义基础的检索。由于检索是由高层意图驱动,而非嵌入相似性,所检索的记忆显著更准确且噪声更少。此外,该机制是即插即用的,无需训练,并且可以轻松附加到不同的流媒体多模态大语言模型(MLLM)骨干网络上。在多个基准和骨干网络上的实验表明,OASIS在长时间跨度的准确性和组合推理方面取得了显著提升,同时保持了有限的令牌成本和低请求延迟。代码可在 https://github.com/Solus-sano/OASIS 获取。
cs.CV / 105 / 2604.17054
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
mEOL:无训练的指令引导多模态嵌入器用于矢量图形和图像检索
Abstract
Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/
Chinese Translation
可缩放矢量图形(SVG)既作为视觉图像,又作为编码丰富几何和布局信息的结构化代码存在,但大多数方法将其光栅化并丢弃这种符号化的组织。同时,最近的句子嵌入方法生成强大的文本表示,但并未自然扩展到视觉或结构化模态。我们提出了一种无训练的、指令引导的多模态嵌入框架,该框架利用多模态大型语言模型(MLLM)将文本、光栅图像和SVG代码映射到对齐的嵌入空间。我们通过特定模态的指令和结构化SVG提示来控制嵌入的方向,消除了对学习投影头或对比训练的需求。我们的方法有两个关键组成部分:(1)多模态显式单词限制(mEOL),它指示MLLM将任何多模态输入总结为一个单一的标记,其隐藏状态作为紧凑的语义嵌入。(2)一个语义SVG重写模块,通过对渲染图像的视觉推理,为嵌套的SVG元素分配有意义的标识符并简化其结构,揭示隐藏在原始代码中的几何和关系线索。通过重新利用VGBench,我们建立了第一个文本到SVG检索基准,并展示了我们的无训练嵌入优于基于编码器和基于训练的多模态基线。这些结果突显了提示级控制作为一种有效的替代方案,优于参数级训练,用于结构感知的多模态检索。项目页面:https://scene-the-ella.github.io/meol/
cs.CV / 106 / 2604.17062
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
基于运动引导的语义对齐与负提示的零样本视频动作识别
Abstract
Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.
Chinese Translation
零样本动作识别面临着已见类别与未见类别之间的语义差距的挑战。我们提出了一种新颖的框架,通过解耦嵌入和语义引导的交互来增强 CLIP。运动分离模块(Motion Separation Module, MSM)将运动敏感特征与全局静态特征分离,而运动聚合块(Motion Aggregation Block, MAB)采用门控交叉注意力来精炼运动表示,而不重新耦合冗余信息。为了促进对未见类别的泛化,我们通过将投影嵌入与正文本提示对齐,强制视频特征与文本表示之间的语义对齐,同时利用负提示明确建模“非类别”语义。在标准基准上的实验表明,我们的方法在粗粒度和细粒度数据集上都能持续超越先前基于 CLIP 的方法,实现稳健的零样本动作识别。
cs.CV / 107 / 2604.17065
BasketHAR: A Multimodal Dataset for Human Activity Recognition and Sport Analysis in Basketball Training Scenarios
BasketHAR:用于篮球训练场景的人类活动识别与运动分析的多模态数据集
Abstract
Human Activity Recognition (HAR) involves the automatic identification of user activities and has gained significant research interest due to its broad applicability. Most HAR systems rely on supervised learning, which necessitates large, diverse, and well-annotated datasets. However, existing datasets predominantly focus on basic activities such as walking, standing, and stair navigation, limiting their utility in specialized contexts like sports performance analysis. To address this gap, we present BasketHAR, a novel multimodal HAR dataset tailored for basketball training, encompassing a diverse set of professional-level actions. BasketHAR includes comprehensive motion data from inertial measurement units (accelerometers and gyroscopes), angular velocity, magnetic field, heart rate, skin temperature, and synchronized video recordings. We also provide a baseline multimodal alignment method to benchmark performance. Experimental results underscore the dataset's complexity and suitability for advanced HAR tasks. Furthermore, we highlight its potential applications in the analysis of basketball training sessions and in the generation of specialized performance reports, representing a valuable resource for future research in HAR and sports analytics. The dataset are publicly accessible at https://huggingface.co/datasets/Xian-Gao/BasketHAR licensed under Apache License 2.0.
Chinese Translation
人类活动识别(HAR)涉及用户活动的自动识别,因其广泛的应用性而受到显著的研究关注。大多数HAR系统依赖于监督学习,这需要大规模、多样化且标注良好的数据集。然而,现有数据集主要集中于基本活动,如步行、站立和楼梯导航,这限制了其在运动表现分析等专业领域的实用性。为了解决这一问题,我们提出了BasketHAR,这是一个针对篮球训练量身定制的新型多模态HAR数据集,涵盖了一系列专业水平的动作。BasketHAR包括来自惯性测量单元(加速度计和陀螺仪)的全面运动数据、角速度、磁场、心率、皮肤温度以及同步视频录制。我们还提供了一种基准多模态对齐方法以评估性能。实验结果强调了数据集的复杂性和适用于高级HAR任务的适宜性。此外,我们还强调了其在篮球训练课程分析和生成专业表现报告中的潜在应用,代表了未来HAR和运动分析研究的宝贵资源。该数据集可在 https://huggingface.co/datasets/Xian-Gao/BasketHAR 公开访问,采用Apache License 2.0许可。
cs.CV / 108 / 2604.17070
NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge Report
NTIRE 2026 rip流检测与分割(RipDetSeg)挑战报告
Dumitriu, Andrei, Ralhan, Aakash, Miron, Florin, Tatui, Florin, Ionescu, Radu Tudor, Timofte, Radu, Naeem, Abdullah, Katwal, Anav, Dey, Ayon, Hoque, Md Tamjidul, Shin, Asuka, Shirono, Hiroto, Shigematsu, Kosuke, Mahesh, Gaurav, Nanditha, Anjana, CV, Jiji, Vakhitov, Akbarali, Lee, Sang-Chul, Li, Xinger, Yu, Chun'an, Chen, Junhao, Yang, Yang, Reddy, Gundluri Yuvateja, Palaram, Harshitha, N, Gejalakshmi, S, Jeevitha, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun, Tripathi, Amitabh, Mahesh, Modugumudi, Vipparthi, Santosh Kumar, Murala, Subrahmanyam
Abstract
This report presents the NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge, which targets automatic rip current understanding in images. Rip currents are hazardous nearshore flows that cause many beach-related fatalities worldwide, yet remain difficult to identify because their visual appearance varies substantially across beaches, viewpoints, and sea states. To advance research on this safety-critical problem, the challenge builds on the RipVIS benchmark, evaluating both detection and segmentation. The dataset is diverse, sourced from more than $10$ countries, with $4$ camera orientations and diverse beach and sea conditions. This report describes the dataset, challenge protocol, evaluation methodology, final results, and summarizes the main insights from the submitted methods. The challenge attracted $159$ registered participants and produced $9$ valid test submissions across the two tasks. Final rankings are based on a composite score that combines $F_1[50]$, $F_2[50]$, $F_1[40\!:\!95]$, and $F_2[40\!:\!95]$. Most participant solutions relied on pretrained models, combined with strong augmentation and post-processing design. These results suggest that rip current understanding benefits strongly from the robust general-purpose vision models' progress, while leaving ample room for future methods tailored to their unique visual structure.
Chinese Translation
本报告介绍了NTIRE 2026 rip流检测与分割(RipDetSeg)挑战,旨在实现图像中rip流的自动理解。rip流是危险的近海流动,导致全球许多与海滩相关的死亡事件,但由于其视觉外观在不同海滩、视角和海洋状态下差异显著,因此识别起来仍然困难。为了推动这一安全关键问题的研究,该挑战基于RipVIS基准,评估检测和分割两方面。数据集来源多样,来自超过10个国家,包含4种相机方向以及多样的海滩和海洋条件。本报告描述了数据集、挑战协议、评估方法、最终结果,并总结了提交方法的主要见解。该挑战吸引了159名注册参与者,并在两个任务中产生了9个有效的测试提交。最终排名基于一个综合得分,该得分结合了$F_1[50]$、$F_2[50]$、$F_1[40 ext{:}95]$和$F_2[40 ext{:}95]$。大多数参与者的解决方案依赖于预训练模型,并结合强大的数据增强和后处理设计。这些结果表明,rip流的理解在很大程度上受益于通用视觉模型的稳健进展,同时也为未来针对其独特视觉结构的方法留出了充足的空间。
cs.CV / 109 / 2604.17074
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
比较驱动偏好:面向参考的人工智能生成视频质量评估建模
Abstract
The rapid advancement of generative models has led to a growing volume of AI-generated videos, making the automatic quality assessment of such videos increasingly important. Existing AI-generated content video quality assessment (AIGC-VQA) methods typically estimate visual quality by analyzing each video independently, ignoring potential relationships among videos. In this work, we revisit AIGC-VQA from an inter-video perspective and formulate it as a reference-aware evaluation problem. Through this formulation, quality assessment is guided not only by intrinsic video characteristics but also by comparisons with related videos, which is more consistent with human perception. To validate its effectiveness, we propose Reference-aware Video Quality Assessment (RefVQA), which utilizes a query-centered reference graph to organize semantically related samples and performs graph-guided difference aggregation from the reference nodes to the query node. Experiments on existing datasets demonstrate that our proposed RefVQA outperforms state-of-the-art methods across multiple quality dimensions, with strong generalization ability validated by cross-dataset evaluation. These results highlight the effectiveness of the proposed reference-based formulation and suggest its potential to advance AIGC-VQA.
Chinese Translation
生成模型的快速发展导致了人工智能生成视频数量的激增,使得对这些视频的自动质量评估变得愈发重要。现有的人工智能生成内容视频质量评估(AIGC-VQA)方法通常通过独立分析每个视频来估计视觉质量,忽视了视频之间可能存在的关系。在本研究中,我们从视频间的视角重新审视AIGC-VQA,并将其构建为一个面向参考的评估问题。通过这种构建,质量评估不仅受到视频内在特征的指导,还受到与相关视频的比较影响,这与人类的感知更为一致。为了验证其有效性,我们提出了面向参考的视频质量评估(RefVQA),该方法利用以查询为中心的参考图来组织语义相关的样本,并从参考节点到查询节点执行图引导的差异聚合。在现有数据集上的实验表明,我们提出的RefVQA在多个质量维度上超越了最先进的方法,并通过跨数据集评估验证了其强大的泛化能力。这些结果突显了所提出的基于参考的构建的有效性,并暗示其在推动AIGC-VQA方面的潜力。
cs.CV / 110 / 2604.17082
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
D-Prism:用于结构化动态建模的可微分原语
Abstract
Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain. Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive counts, better matching objects' true spatial footprint. Experiments confirm that our method excels at structured dynamic modeling, providing both structured geometry and precise motion tracking.
Chinese Translation
捕捉结构化动态物体(如多部件装配体或关节机制)的几何形状和刚性运动仍然是一个关键挑战。现有的动态方法,如可变形网格或3DGS,依赖于非结构化表示,无法共同建模合适的几何形状和关节运动。基于原语的方法在结构化静态场景中表现出色,但其动态潜力仍未被探索。我们提出了D-Prism,这是第一个通过将可微分原语扩展到动态领域来实现高保真结构化动态建模的框架。具体而言,我们将3DGS与原语表面绑定,利用它们在外观和几何形状上的各自优势。我们引入了一个变形网络来控制原语运动,确保其准确匹配物体的运动。此外,我们设计了一种新颖的自适应控制策略,以动态调整原语数量,更好地匹配物体的真实空间占用。实验确认我们的方法在结构化动态建模方面表现出色,提供了结构化几何形状和精确的运动跟踪。
cs.CV / 111 / 2604.17087
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
EvoComp:通过语义引导的进化标记学习视觉标记压缩以用于多模态大型语言模型
Abstract
Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
Chinese Translation
近期的多模态大型语言模型(MLLMs)在视觉-语言理解任务中表现出色,但其推理效率常常受到大量视觉标记的影响,尤其是在高分辨率或多图像场景中。为了解决这一问题,我们提出了EvoComp,一个视觉标记压缩框架,能够在保持任务准确性的同时显著减少标记数量。EvoComp引入了一种轻量级的仅编码器变换器基础的压缩器,通过共同考虑视觉和文本上下文,选择最具信息量和非冗余的视觉标记。一个核心挑战在于为训练压缩器提供有效的监督。为此,我们设计了一种进化标记策略,搜索最小化MLLM输出损失的标记子集,同时通过基于词汇的标记分组强制实现语义多样性。我们进一步使用一种定制的损失函数训练压缩器,该损失函数结合了GHM损失以减轻类别和难度不平衡,以及余弦相似性正则化以鼓励保留和丢弃标记之间的语义分离。在多个视觉-语言基准上的广泛实验表明,EvoComp在性能上优于基于注意力或相似性启发式的方法。值得注意的是,在3倍标记压缩下,它保留了99.3%的原始准确性,并在移动设备上实现了高达1.6倍的速度提升。
cs.CV / 112 / 2604.17090
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
将文本到运动生成与基于骨架的动作识别相结合
Abstract
Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation. We propose Coordinates-based Autoregressive Motion Diffusion (CoAMD), which synthesizes motion in a coarse-to-fine manner. As a core component of CoAMD, we design a Multi-modal Action Recognizer (MAR) that provides gradient-based semantic guidance for motion generation. Furthermore, we establish a rigorous benchmark by evaluating baselines on absolute coordinates. Our model can be applied to four important tasks, including skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing. Extensive experiments on 13 benchmarks across these tasks demonstrate that our approach achieves state-of-the-art performance, highlighting its effectiveness and versatility for human motion modeling. Code is available at https://github.com/jidongkuang/CoAMD.
Chinese Translation
人类动作识别和运动生成是人本计算机视觉领域中的两个活跃研究问题,二者均旨在将运动与文本语义对齐。然而,大多数现有研究将这两个问题分开研究,未能揭示它们之间的联系,即运动生成需要语义理解。本研究通过利用骨架坐标来统一动作识别和运动生成,探讨了这两个问题的结合。我们提出了基于坐标的自回归运动扩散模型(Coordinates-based Autoregressive Motion Diffusion, CoAMD),该模型以粗到细的方式合成运动。作为CoAMD的核心组件,我们设计了多模态动作识别器(Multi-modal Action Recognizer, MAR),为运动生成提供基于梯度的语义指导。此外,我们通过在绝对坐标上评估基线建立了一个严格的基准。我们的模型可以应用于四个重要任务,包括基于骨架的动作识别、文本到运动生成、文本-运动检索和运动编辑。在这13个任务的广泛实验中,我们的方法展示了最先进的性能,突显了其在人类运动建模中的有效性和多样性。代码可在 https://github.com/jidongkuang/CoAMD 获取。
cs.CV / 113 / 2604.17107
Hybrid Multi-Dimensional MRI Prostate Cancer Detection via Hadamard Network-Based Bias Correction and Residual Networks
基于哈达玛网络的偏差校正和残差网络的混合多维MRI前列腺癌检测
Abstract
Magnetic Resonance Imaging (MRI) is vital for prostate cancer (PCa) diagnosis. While advanced techniques such as Hybrid Multi-dimensional MRI (HM-MRI) have enhanced diagnostic capabilities, the significant need remains for robust, automated Artificial Intelligence (AI)-based detection methods. In this study, we combine quantitative HM-MRI of tissue composition with an AI-based neural network. We propose the Hadamard-Bias Network plus ResNet18 (HBR-Net-18), a two-stage AI framework for PCa detection. In the first stage, a Hadamard U-Net-based algorithm suppresses intensity inhomogeneities (bias fields) across six parametric HM-MRI maps generated via a Physics-Informed Autoencoder (PIA). In the second stage, a Residual Network (ResNet-18) performs patch-level classification. The framework utilizes overlapping 11-by-11 patches, incorporating both 2D intra-slice and 3D inter-slice (adjacent-slice) information to improve spatial consistency. Our experimental results demonstrate that HB-Net achieves balanced sensitivity and specificity, significantly outperforming conventional radiomics-based approaches and baseline CNN models, highlighting its potential for clinical deployment.
Chinese Translation
磁共振成像(MRI)对前列腺癌(PCa)的诊断至关重要。尽管混合多维MRI(HM-MRI)等先进技术提高了诊断能力,但仍然迫切需要稳健的、基于人工智能(AI)的自动检测方法。在本研究中,我们将组织成分的定量HM-MRI与基于AI的神经网络相结合。我们提出了哈达玛偏差网络加残差网络18(HBR-Net-18),这是一个用于PCa检测的两阶段AI框架。在第一阶段,基于哈达玛U-Net的算法抑制通过物理信息自编码器(PIA)生成的六个参数HM-MRI图的强度不均匀性(偏差场)。在第二阶段,残差网络(ResNet-18)进行补丁级分类。该框架利用重叠的11×11补丁,结合二维切片内和三维切片间(相邻切片)信息,以提高空间一致性。我们的实验结果表明,HB-Net在灵敏度和特异性方面表现平衡,显著优于传统的放射组学方法和基线卷积神经网络(CNN)模型,突显了其在临床应用中的潜力。
cs.CV / 114 / 2604.17110
From Clinical Intent to Clinical Model: An Autonomous Coding-Agent Framework for Clinician-driven AI Development
从临床意图到临床模型:一种自主编码代理框架用于临床医生驱动的人工智能开发
Abstract
Clinical AI development has traditionally followed a collaborative paradigm that depends on close interaction between clinicians and specialized AI teams. This paradigm imposes a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other's expertise. However, autonomous coding agents may change this paradigm, raising the possibility that clinicians could develop clinical AI models independently through natural-language interaction alone. In this study, we present such an autonomous prototype for clinician-driven clinical AI development. We evaluated the system on five clinical tasks spanning dermoscopic lesion classification, melanoma-versus-nevus triage, wrist-fracture detection (including a weakly supervised variant with only 5% bounding-box annotations), and debiased pneumothorax classification on chest radiographs. Across these settings, the system consistently developed models from clinician requests and achieved promising performance. Notably, in a debiased pneumothorax classification task on chest radiographs, where chest drains can act as a major confounder, the system successfully mitigated shortcut learning and nearly halved the model's reliance on chest drains. These findings provide proof of concept that autonomous coding agents may help shift clinical AI development toward a more clinician-driven paradigm, reducing the communication overhead and dependence on specialized AI developers. Although further validation and robustness assessment are needed, this study suggests a promising path toward making clinical AI development more accessible.
Chinese Translation
临床人工智能(AI)开发传统上遵循一种协作范式,该范式依赖于临床医生与专业AI团队之间的密切互动。这一范式带来了实际挑战:临床医生必须反复与AI开发者沟通并完善他们的需求,才能将这些需求转化为可执行的模型开发。这一迭代过程耗时较长,即使经过多次讨论,双方之间仍可能存在不一致,因为双方并未完全共享彼此的专业知识。然而,自主编码代理可能会改变这一范式,提出了临床医生仅通过自然语言互动独立开发临床AI模型的可能性。在本研究中,我们展示了这样一个自主原型,用于临床医生驱动的临床AI开发。我们在五个临床任务上评估了该系统,这些任务包括皮肤镜病变分类、黑色素瘤与痣的分流、腕部骨折检测(包括仅有5%边界框标注的弱监督变体)以及胸部X光片上的去偏肺气肿分类。在这些设置中,该系统始终能够根据临床医生的请求开发模型,并取得了令人鼓舞的性能。值得注意的是,在胸部X光片的去偏肺气肿分类任务中,胸腔引流管可能作为一个主要的混淆因素,该系统成功减轻了捷径学习,并将模型对胸腔引流管的依赖几乎减半。这些发现提供了概念验证,表明自主编码代理可能有助于将临床AI开发转向更以临床医生为驱动的范式,从而减少沟通开销和对专业AI开发者的依赖。尽管需要进一步的验证和稳健性评估,但本研究表明了一条使临床AI开发更具可及性的有希望的路径。
cs.CV / 115 / 2604.17115
Inference-Time Temporal Probability Smoothing for Stable Video Segmentation with SAM2 under Weak Prompts
基于弱提示的SAM2稳定视频分割的推理时序概率平滑
Abstract
Interactive video segmentation models such as SAM2 have demonstrated strong generalization across diverse visual domains. However, under weak user supervision, for example, when sparse point prompts are provided on a single frame, their predictions often suffer from temporal instability, including flickering boundaries, object dropout, and inconsistent object extents across frames. These issues limit their reliability in downstream video understanding and control applications. In this paper, we propose an inference-time temporal probability smoothing method that improves the temporal stability of SAM2-based video segmentation without retraining or architectural modification. Our approach operates directly on per-frame segmentation probability maps and leverages optical-flow-based motion warping together with pixel-wise uncertainty estimates derived from segmentation entropy, and forward-backwards flow consistency. These signals are used to adaptively blend current-frame predictions with motion-aligned historical estimates, yielding temporally coherent segmentation outputs under weak prompts. We evaluate the proposed method on four diverse video sequences using a comprehensive set of frame-wise and temporal stability metrics, including motion-compensated IoU, boundary consistency, object persistence, and area volatility. Experimental results demonstrate consistent improvements in temporal stability over vanilla SAM2 inference while preserving spatial accuracy. The proposed framework is lightweight, model-agnostic, and well-suited for real-time, interactive video segmentation.
Chinese Translation
交互式视频分割模型如SAM2在多样的视觉领域中表现出了强大的泛化能力。然而,在弱用户监督下,例如,当仅在单帧上提供稀疏点提示时,它们的预测往往会遭遇时序不稳定的问题,包括边界闪烁、物体丢失以及跨帧物体范围不一致。这些问题限制了它们在下游视频理解和控制应用中的可靠性。本文提出了一种推理时序概率平滑方法,旨在提高基于SAM2的视频分割的时序稳定性,而无需重新训练或修改架构。我们的方法直接作用于每帧的分割概率图,并结合基于光流的运动扭曲以及从分割熵和前后流一致性中得出的逐像素不确定性估计。这些信号用于自适应地将当前帧的预测与运动对齐的历史估计进行混合,从而在弱提示下产生时序一致的分割输出。我们在四个多样的视频序列上评估了所提出的方法,使用了一套全面的帧级和时序稳定性指标,包括运动补偿的IoU、边界一致性、物体持续性和区域波动性。实验结果表明,与传统的SAM2推理相比,所提方法在时序稳定性上有一致的改善,同时保持了空间准确性。该框架轻量、模型无关,适合实时交互式视频分割。
cs.CV / 116 / 2604.17122
Multimodal Fusion of Histopathology Images and Electronic Health Records for Early Breast Cancer Diagnosis
组织病理图像与电子健康记录的多模态融合用于早期乳腺癌诊断
Abstract
Breast cancer is a leading cause of cancer-related mortality worldwide, and timely accurate diagnosis is critical to improving survival outcomes. While convolutional neural networks (CNNs) have demonstrated strong performance on histopathology image classification, and machine learning models on structured electronic health records (EHR) have shown utility for clinical risk stratification, most existing work treats these modalities in isolation. This paper presents a systematic multimodal framework that integrates patch-level histopathology features from the BreCaHAD dataset with structured clinical data from MIMIC-IV. We train and evaluate unimodal image models (a simple CNN baseline and ResNet-18 with transfer learning), unimodal tabular models (XGBoost and a multilayer perceptron), and an intermediate-fusion model that concatenates latent representations from both modalities. ResNet-18 achieves near-perfect accuracy (1.000) and AUC (1.000) on three-class patch-level classification, while XGBoost achieves 98% accuracy on the EHR prediction task. The intermediate fusion model yields a macro-average AUC of 0.997, outperforming all unimodal baselines and delivering the largest improvements on the diagnostically critical but class-imbalanced mitosis category (AUC 0.994). Grad-CAM and SHAP interpretability analyses validate that model decisions align with established pathological and clinical criteria. Our results demonstrate that multimodal integration delivers meaningful improvements in both predictive performance and clinical transparency.
Chinese Translation
乳腺癌是全球癌症相关死亡的主要原因,及时准确的诊断对改善生存结果至关重要。尽管卷积神经网络(CNN)在组织病理图像分类方面表现出色,且机器学习模型在结构化电子健康记录(EHR)上已显示出临床风险分层的有效性,但现有的大多数研究将这些模态视为孤立的。本论文提出了一种系统的多模态框架,将来自BreCaHAD数据集的补丁级组织病理特征与MIMIC-IV的结构化临床数据相结合。我们训练并评估了单模态图像模型(一个简单的CNN基线和带迁移学习的ResNet-18)、单模态表格模型(XGBoost和多层感知器)以及一个中间融合模型,该模型将两种模态的潜在表示进行连接。ResNet-18在三类补丁级分类中达到了近乎完美的准确率(1.000)和AUC(1.000),而XGBoost在EHR预测任务中达到了98%的准确率。中间融合模型的宏平均AUC为0.997,超越了所有单模态基线,并在诊断上至关重要但类别不平衡的有丝分裂类别上提供了最大的改进(AUC 0.994)。Grad-CAM和SHAP可解释性分析验证了模型决策与既定病理和临床标准的一致性。我们的结果表明,多模态集成在预测性能和临床透明度方面都带来了显著的改善。
cs.CV / 117 / 2604.17126
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection
视觉-语言基础中的提示敏感性:措辞的小变化如何影响物体检测
Abstract
Vision-language models enable open-vocabulary object grounding through natural language queries, under the implicit assumption that semantically equivalent descriptions yield consistent outputs. We examine this assumption using a controlled pipeline combining DETR for object proposals with CLIP for language-conditioned selection on 263 COCO val2017 images. We find that overlapping prompts such as "a person," "a human," and "a pedestrian" frequently select different instances, with mean instability of 2.11 distinct selections across six prompts. PCA analysis shows this variability is structured and directional, not random. Prompt ensembling does not improve quality and often shifts selections toward generic regions. We further show that text embedding proximity explains only 34% of grounding disagreement (r = -0.58), confirming that instability arises from the argmax selection mechanism rather than text-level distances alone.
Chinese Translation
视觉-语言模型通过自然语言查询实现开放词汇的物体基础,隐含假设是语义等价的描述会产生一致的输出。我们使用一个控制管道来检验这一假设,该管道结合了用于物体提议的DETR与用于语言条件选择的CLIP,分析了263张COCO val2017图像。我们发现,重叠的提示如“一个人”、“一个人类”和“一个行人”经常选择不同的实例,在六个提示中平均不稳定性为2.11个不同选择。主成分分析(PCA)显示这种变异是有结构和方向性的,而非随机的。提示集成并未提高质量,且通常将选择偏向于通用区域。我们进一步表明,文本嵌入的接近性仅解释了34%的基础不一致(r = -0.58),确认不稳定性源于argmax选择机制,而不仅仅是文本级距离。
cs.CV / 118 / 2604.17135
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
OptiMVMap:通过最优多车辆视角进行离线矢量化地图构建
Abstract
Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts. We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset of helpers that maximally reduce ego-centric uncertainty in occluded regions, addressing computation and redundancy challenges. Cross-Vehicle Attention (CVA) and Semantic-aware Noise Filter (SNF) then perform pose-tolerant alignment and artifact suppression before BEV-level fusion, addressing the noise challenge. This targeted pipeline yields more complete and topologically faithful maps with substantially fewer views than indiscriminate aggregation. On nuScenes and Argoverse2, OptiMVMap improves MapTRv2 by +10.5 mAP and +9.3 mAP, respectively, and surpasses memory-augmented baselines MVMap and HRMapNet by +6.2 mAP and +3.8 mAP on nuScenes. These results demonstrate that uncertainty-guided selection of helper vehicles is essential for efficient and accurate multi-vehicle vectorized mapping. The code is released at https://github.com/DanZeDong/OptiMVMap.
Chinese Translation
离线矢量化地图是高精度自动驾驶和地图服务的重要基础设施。现有方法主要依赖于单一自我车辆轨迹,这在根本上面临视角不足的问题:尽管基于记忆的方法通过聚合自我轨迹帧来延长观察时间,但它们缺乏揭示被遮挡区域所需的空间多样性。引入周围车辆的视角提供了互补的视角,但简单的融合会带来三个主要挑战:来自大候选池的计算成本、近共线视角带来的冗余,以及来自姿态误差和遮挡伪影的噪声。我们提出了OptiMVMap,将多车辆映射重新表述为选择-再融合问题,以系统性地解决这些挑战。一个最优车辆选择(Optimal Vehicle Selection, OVS)模块战略性地识别出一个紧凑的帮助者子集,最大限度地减少被遮挡区域的自我中心不确定性,从而应对计算和冗余挑战。跨车辆注意力(Cross-Vehicle Attention, CVA)和语义感知噪声过滤器(Semantic-aware Noise Filter, SNF)随后在鸟瞰视图(BEV)级别融合之前执行姿态容忍对齐和伪影抑制,以应对噪声挑战。这个有针对性的流程产生了比无差别聚合所需视角显著更少的更完整且拓扑忠实的地图。在nuScenes和Argoverse2数据集上,OptiMVMap分别提升了MapTRv2的mAP指标+10.5和+9.3,并在nuScenes上超越了增强记忆的基线MVMap和HRMapNet,分别提高了+6.2和+3.8的mAP。这些结果表明,基于不确定性指导的帮助车辆选择对于高效和准确的多车辆矢量化映射至关重要。代码已发布在https://github.com/DanZeDong/OptiMVMap。
cs.CV / 119 / 2604.17147
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation
ScenarioControl:可控的视觉-语言向量化潜在场景生成
Abstract
We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of driving scenarios. To facilitate training and evaluation, we release a dataset with text annotations aligned to vectorized map structures. Extensive experiments validate that the control adherence and fidelity of ScenarioControl compare favorable to all tested methods across all experiments. Project webpage: https://light.princeton.edu/ScenarioControl
Chinese Translation
我们介绍了ScenarioControl,这是首个用于学习驾驶场景生成的视觉-语言控制机制。给定文本提示或输入图像,ScenarioControl合成多样化、真实的3D场景展开——包括地图、随时间变化的反应性行为者的3D框、行人、驾驶基础设施和自我摄像头观察。该方法在一个向量化潜在空间中生成场景,该空间共同表示道路结构和动态代理。为了将多模态控制与稀疏的向量化场景元素连接起来,我们提出了一种跨全局控制机制,该机制将交叉注意力与轻量级全局上下文分支相结合,使得在保持真实感的同时,对道路布局和交通状况进行细粒度控制。该方法从场景中不同行为者的视角生成时间一致的场景展开,支持驾驶场景的长时间延续。为了便于训练和评估,我们发布了一个与向量化地图结构对齐的文本注释数据集。大量实验验证了ScenarioControl在控制遵循性和保真度方面的表现优于所有测试方法。项目网页:https://light.princeton.edu/ScenarioControl
cs.CV / 120 / 2604.17155
Instant Colorization of Gaussian Splats
高斯斑点的即时上色
Abstract
Gaussian Splatting has recently become one of the most popular frameworks for photorealistic 3D scene reconstruction and rendering. While current rasterizers allow for efficient mappings of 3D Gaussian splats onto 2D camera views, this work focuses on mapping 2D image information (e.g. color, neural features or segmentation masks) efficiently back onto an existing scene of Gaussian splats. This 'opposite' direction enables applications ranging from scene relighting and stylization to 3D semantic segmentation, but also introduces challenges, such as view-dependent colorization and occlusion handling. Our approach tackles these challenges using the normal equation to solve a visibility-weighted least squares problem for every Gaussian and can be implemented efficiently with existing differentiable rasterizers. We demonstrate the effectiveness of our approach on scene relighting, feature enrichment and 3D semantic segmentation tasks, achieving up to an order of magnitude speedup compared to gradient descent-based baselines.
Chinese Translation
高斯斑点技术最近成为了逼真3D场景重建和渲染的最流行框架之一。虽然当前的光栅化器允许将3D高斯斑点有效地映射到2D相机视图上,但本研究的重点是将2D图像信息(例如颜色、神经特征或分割掩膜)有效地映射回现有的高斯斑点场景。这种“反向”方向使得从场景重光照、风格化到3D语义分割等应用成为可能,但也带来了诸如视角依赖的上色和遮挡处理等挑战。我们的方法通过使用法线方程为每个高斯斑点解决一个加权可见性最小二乘问题来应对这些挑战,并且可以通过现有的可微分光栅化器高效实现。我们在场景重光照、特征增强和3D语义分割任务上展示了我们方法的有效性,与基于梯度下降的基线相比,速度提升可达一个数量级。
cs.CV / 121 / 2604.17163
PPEDCRF: Dynamic-CRF-Guided Selective Perturbation for Background-Based Location Privacy in Video Sequences
PPEDCRF:基于动态条件随机场的选择性扰动框架,用于视频序列中的背景位置隐私保护
Abstract
We propose PPEDCRF, a calibrated selective perturbation framework that protects \emph{background-based location privacy} in released video frames against gallery-based retrieval attackers. Even after GPS metadata are stripped, an adversary can geolocate a frame by matching its background visual cues to geo-tagged reference imagery; PPEDCRF mitigates this threat by estimating location-sensitive background regions with a dynamic conditional random field (DCRF), rescaling perturbation strength with a normalized control penalty (NCP), and injecting Gaussian noise only inside the inferred regions via a DP-style calibration rule. On a controlled paired-scene retrieval benchmark with eight attacker backbones and three noise seeds, PPEDCRF reduces ResNet18 Top-1 retrieval accuracy from 0.667 to $0.361\pm0.127$ at $\sigma_0=8$ while preserving $36.14\,$dB PSNR -- an ${\approx}6\,$dB quality advantage over global Gaussian noise. Transfer across the eight-backbone seed-averaged benchmark is broadly supportive (23 of 24 backbone-gallery cells show negative $\Delta$), while appendix-scale confirmation identifies MixVPR as a remaining adverse-transfer exception. Matched-operating-point analysis shows that PPEDCRF and global Gaussian noise converge in Top-1 privacy at equal utility, so the practical benefit is spatially concentrated perturbation that preserves higher visual quality at any given noise scale rather than stronger matched-utility privacy. Code: https://github.com/mabo1215/PPEDCRF
Chinese Translation
我们提出了PPEDCRF,这是一种经过校准的选择性扰动框架,旨在保护发布视频帧中的 extit{基于背景的位置隐私},防止基于图库的检索攻击者。即使在剥离GPS元数据后,攻击者仍然可以通过将帧的背景视觉线索与地理标记的参考图像进行匹配来进行地理定位;PPEDCRF通过使用动态条件随机场(DCRF)估计位置敏感的背景区域,利用归一化控制惩罚(NCP)重新调整扰动强度,并仅在推断出的区域内通过DP风格的校准规则注入高斯噪声,从而减轻这一威胁。在一个受控的配对场景检索基准上,使用八个攻击者骨干网络和三个噪声种子,PPEDCRF将ResNet18的Top-1检索准确率从0.667降低到$0.361 extpm0.127$,同时保持$36.14 ext{ dB}$的峰值信噪比(PSNR)——相比于全局高斯噪声具有约6 ext{ dB}的质量优势。在八个骨干网络的种子平均基准上,转移效果普遍支持(24个骨干图库单元中有23个显示负$ riangle$),而附录规模的确认则识别出MixVPR作为一个剩余的不利转移例外。匹配操作点分析表明,PPEDCRF和全局高斯噪声在相等效用下的Top-1隐私表现趋于一致,因此其实际优势在于空间集中扰动,能够在任何给定噪声规模下保持更高的视觉质量,而不是更强的匹配效用隐私。代码:https://github.com/mabo1215/PPEDCRF
cs.CV / 122 / 2604.17190
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
LookasideVLN:方向感知的空中视觉与语言导航
Abstract
Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments. While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues "a key source of spatial context in human navigation". In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning. Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
Chinese Translation
空中视觉与语言导航(Aerial VLN)使无人机(UAV)能够遵循自然语言指令并在复杂的城市环境中导航。尽管近期的进展通过大规模记忆图和前瞻路径规划取得了一定的进展,但仍然受到浅层指令理解和高计算成本的限制。特别是,现有方法主要依赖于地标描述,忽视了方向线索,这“是人类导航中空间上下文的关键来源”。在本研究中,我们提出了LookasideVLN,这是一种新范式,利用自然语言中的方向线索,以实现更准确的空间推理和更高的计算效率。LookasideVLN包括三个核心组件:(1)一个自我中心的旁视图(Egocentric Lookaside Graph, ELG),动态编码与指令相关的地标及其方向关系;(2)一个空间地标知识库(Spatial Landmark Knowledge Base, SLKB),提供来自先前导航经验的轻量级记忆检索;(3)一个旁视多模态语言模型导航代理(Lookaside MLLM Navigation Agent),将用户指令、视觉观察和来自ELG的地标方向信息的多模态信息对齐,以进行路径规划。大量实验表明,LookasideVLN显著优于最先进的CityNavAgent,即使在单级前瞻的情况下,证明了利用方向线索是一种强大而高效的空中视觉与语言导航策略。
cs.CV / 123 / 2604.17195
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamShot:基于视频扩散先验的个性化故事板合成
Abstract
Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
Chinese Translation
故事板合成在视觉叙事中起着至关重要的作用,旨在生成连贯的镜头序列,以一致的人物、场景和过渡视觉叙述电影事件。然而,现有方法大多改编自文本到图像的扩散模型,这些模型在保持长时间的时间一致性、一致的人物身份和多个镜头之间的叙事流畅性方面存在困难。本文介绍了DreamShot,一种基于视频生成模型的故事板框架,充分利用强大的视频扩散先验进行可控的多镜头合成。DreamShot支持文本到镜头(Text-to-Shot)和参考到镜头(Reference-to-Shot)生成,以及基于先前帧的故事延续,能够实现灵活且具有上下文意识的故事板生成。通过利用视频生成模型固有的时空一致性,DreamShot生成视觉和语义上连贯的序列,提升了叙事的真实感和人物的连续性。此外,DreamShot还结合了一个多参考角色条件模块,接受多个角色参考图像,并通过角色注意一致性损失(Role-Attention Consistency Loss)强制身份对齐,明确约束参考角色与生成角色之间的注意力。大量实验表明,DreamShot在场景一致性、角色一致性和生成效率方面优于最先进的文本到图像故事板模型,为可控的视频模型驱动视觉叙事开辟了新的方向。
cs.CV / 124 / 2604.17206
SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini
SciDraw-6K:由 Google Gemini 生成的多语言科学插图数据集
Abstract
We present SciDraw-6K, a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models, each paired with prompts in eleven languages (English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian). Images span eight broad scientific categories -- biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long "other" tail -- and are produced primarily by the gemini-2.5-flash-image and gemini-3-pro-image-preview model families. In contrast to general-purpose text-to-image corpora that dominate the literature, SciDraw-6K is purpose-built for the scientific illustration genre: schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters. We describe the construction pipeline, report dataset statistics, and document its use as the substrate of sci-draw.com, a public scientific drawing service. The dataset is released to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization. Dataset: https://huggingface.co/datasets/SciDrawAI/SciDraw-6K Code: https://github.com/SciDrawAI/scidraw-6k
Chinese Translation
我们提出了 SciDraw-6K,这是一个经过精心策划的数据集,包含 6,291 幅由 Google Gemini 图像生成模型合成的科学插图,每幅插图均配有十一种语言的提示(英语、简体中文、繁体中文、日语、韩语、德语、法语、西班牙语、巴西葡萄牙语、意大利语和俄语)。这些图像涵盖八个广泛的科学类别——生物医学、化学、材料、电子学、环境、人工智能系统、物理学,以及一个较长的“其他”类别,主要由 gemini-2.5-flash-image 和 gemini-3-pro-image-preview 模型系列生成。与文献中占主导地位的通用文本到图像语料库相比,SciDraw-6K 专为科学插图类型而构建:包括示意图、机制图、目录图形和概念海报。我们描述了构建流程,报告了数据集统计信息,并记录了其作为公共科学绘图服务 sci-draw.com 的基础。该数据集的发布旨在支持多语言文本到图像研究、领域适应的扩散微调,以及科学可视化的提示工程研究。数据集链接:https://huggingface.co/datasets/SciDrawAI/SciDraw-6K 代码链接:https://github.com/SciDrawAI/scidraw-6k
cs.CV / 125 / 2604.17208
CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography
CDSA-Net:高保真冠状动脉数字减影血管造影的血管结构与背景的协同解耦
Abstract
Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.
Chinese Translation
冠状动脉成像中的数字减影血管造影(DSA)在本质上受到生理运动的挑战,迫使我们依赖于充满解剖噪声的原始血管造影图像。现有的深度学习方法通常产生两种关键的临床不可接受的缺陷:持续的边界伪影和原生组织灰度保真度的丧失,从而削弱了诊断信心。我们提出了一种新颖的框架,称为CDSA-Net,它首次明确解耦并联合优化血管结构的保留和真实背景的恢复。CDSA-Net引入了两个核心创新:(i)一种分层几何先验引导机制(HGPG),嵌入在我们的冠状结构提取网络(CSENet)中。它协同结合了集成几何先验(IGP)、门控空间调制(GSM)和中心线感知拓扑(CAT)损失监督,确保结构的连续性。(ii)在我们的冠状背景恢复网络(CBResNet)中引入了一种自适应噪声模块(ANM)。与标准恢复不同,ANM独特地建模了临床X射线噪声的随机特性,弥合了领域差距,实现无缝的背景强度估计和完全消除边界伪影。最终的减影是通过从原始血管造影中去除恢复的背景获得的。在定量上,它在血管强度相关性和感知质量方面显著优于最先进的方法。在形态评估效率上提高了25.6%,在血流动力学评估速度上提升了42.9%,为介入心脏病学的应用设定了新的基准,同时保持与原始血管造影一致的诊断结果。项目代码可在 https://github.com/DrThink-ai/CDSA-Net 获取。
cs.CV / 126 / 2604.17209
DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation
DREAM:基于自适应多模态融合的动态视网膜增强技术用于专家级医学报告生成
Abstract
Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.
Chinese Translation
自动生成视网膜图像的医学报告需要视觉模式识别与深厚临床知识的复杂结合。目前的大型视觉-语言模型(LVLMs)在数据稀缺的专业医学领域往往表现不佳,导致模型过拟合并错过细微但关键的病理特征。为了解决这一问题,我们提出了DREAM(基于自适应多模态融合的动态视网膜增强),这是一个新颖的高保真医学报告生成框架,即使在数据有限的情况下也能表现出色。DREAM采用独特的两阶段融合机制,智能地将视觉数据与眼科医生策划的临床关键词进行整合。首先,Abstractor模块将图像和关键词特征映射到共享空间,通过病理相关的见解增强视觉数据。接下来,Adaptor模块执行自适应多模态融合,使用可学习参数动态加权每种模态的重要性,以创建统一的表示。为了确保模型输出在临床现实中具有语义基础,Contrastive Alignment模块在训练期间将这些融合表示与真实的医学报告对齐。通过将医学专业知识与高效的融合策略相结合,DREAM在DeepEyeNet基准测试中设定了新的最先进水平,获得了BLEU-4分数0.241,并进一步展示了对ROCO数据集的强泛化能力。
cs.CV / 127 / 2604.17211
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
EmbodiedHead:用于对话代理的实时听说虚拟形象
Abstract
We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user--LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.
Chinese Translation
我们提出了EmbodiedHead,一个基于语音驱动的对话头框架,为大型语言模型(LLMs)提供实时视觉虚拟形象。一个实用的具身虚拟形象必须同时实现实时生成、统一的听说行为和高渲染视觉质量。我们的框架将首个修正流扩散变换器(Rectified-Flow Diffusion Transformer,DiT)与可微渲染器相结合,使得在仅需四个采样步骤的情况下实现多样化的高保真生成。以往的听说方法依赖于双流音频,导致对话者的前瞻依赖性,这与因果用户-LLM交互不兼容。相反,我们采用了单流接口,并明确对每帧的听说状态进行条件化,同时引入流式音频调度器,在听的过程中抑制虚假的嘴部运动,同时实现无缝的轮流对话。通过系数空间预训练和联合图像域精炼的两阶段训练方案,进一步缩小了运动级别监督与渲染质量之间的差距。大量实验表明,在说话和听取场景中,具有最先进的视觉质量和运动保真度。
cs.CV / 128 / 2604.17217
Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability
视觉-语言模型中的跨模态注意力分析与优化:关于视觉可靠性的研究
Abstract
Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies -- shape\_swap, color\_swap, position\_swap, and random\_text -- are applied to a controlled geometric-shapes dataset ($n{=}1{,}000$). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5\% to 9.8\% (64.4\% relative improvement, $p{<}0.001$) while maintaining 97\% normal accuracy. Attention visualization and embedding-space analysis confirm that the optimized model attends more to visual features and achieves tighter cross-modal alignment.
Chinese Translation
视觉-语言模型(VLMs)在跨模态性能上表现出色,但最近的证据表明,它们过度依赖文本描述,而未充分利用视觉证据——这一现象被称为“文本捷径学习”。我们提出了一种对抗性评估框架,通过测量在语义冲突的文本与不变图像配对时的准确性下降(Drop)来量化这种跨模态依赖性。我们对一个受控的几何形状数据集($n{=}1{,}000$)应用了四种对抗策略——shape_swap、color_swap、position_swap和random_text。我们比较了三种配置:基线 CLIP(ViT-B/32)、LoRA 微调和 LoRA 优化(整合了困难负样本挖掘、标签平滑、层级学习率、余弦重启、课程学习和数据增强)。优化后的模型将平均 Drop 从 27.5\% 降低至 9.8\%(相对改善 64.4\%,$p{<}0.001$),同时保持 97\% 的正常准确率。注意力可视化和嵌入空间分析确认,优化后的模型对视觉特征的关注度更高,并实现了更紧密的跨模态对齐。
cs.CV / 129 / 2604.17222
Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging
基于区域亲和力注意力的深紫外成像全幻灯片乳腺癌分类
Abstract
Breast cancer diagnosis demands rapid and precise tools, yet traditional histopathological methods often fall short in intra-operative settings. Deep Ultraviolet (DUV) fluorescence imaging emerges as a transformative approach, offering high-contrast, label-free visualization of whole-slide images (WSIs) with unprecedented detail, surpassing conventional hematoxylin and eosin (H&E) staining in speed and resolution. However, existing deep learning methods for breast cancer classification, predominantly patch-based, fragment spatial context and incur significant preprocessing overhead, limiting their clinical utility. Moreover, standard attention mechanisms, such as Spatial, Squeeze-and-Excitation, Global Context and Guided Context Gating, fail to fully exploit the rich, multi-scale regional relationships inherent in DUV-WSI data, often prioritizing generic feature recalibration over diagnostic specificity. This study introduces a novel Region-Affinity Attention mechanism tailored for DUV-WSI breast cancer classification, processing entire slides without patching to preserve spatial integrity. By modeling local neighbor distances and constructing a full affinity matrix, our method dynamically highlights diagnostically relevant regions, augmented by a contrastive loss to enhance feature discriminability. Evaluated on a dataset of 136 DUV-WSI samples, our approach achieves an accuracy of 92.67 +/- 0.73% and an AUC of 95.97%, outperforming existing attention methods.
Chinese Translation
乳腺癌的诊断需要快速且精确的工具,但传统的组织病理学方法在手术过程中往往难以满足需求。深紫外(DUV)荧光成像作为一种变革性的方法,提供了高对比度、无标签的全幻灯片图像(WSIs)可视化,具有前所未有的细节,速度和分辨率均超过传统的苏木精-伊红(H&E)染色。然而,现有的乳腺癌分类深度学习方法主要基于图块,破坏了空间上下文,并且需要大量的预处理,限制了其临床实用性。此外,标准的注意力机制,如空间注意力、挤压与激励(Squeeze-and-Excitation)、全局上下文和引导上下文门控,未能充分利用DUV-WSI数据中固有的丰富多尺度区域关系,往往优先考虑通用特征的重新校准而非诊断特异性。本研究提出了一种新颖的区域亲和力注意力机制,专为DUV-WSI乳腺癌分类而设计,处理整个幻灯片而不进行图块切割,以保持空间完整性。通过建模局部邻域距离并构建全亲和力矩阵,我们的方法动态突出诊断相关区域,并通过对比损失增强特征的可区分性。在136个DUV-WSI样本的数据集上进行评估,我们的方法实现了92.67 +/- 0.73%的准确率和95.97%的AUC,优于现有的注意力方法。
cs.CV / 130 / 2604.17231
Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly
基于条纹投影的自主硬盘拆解视觉管道
Abstract
Unrecovered e-waste represents a significant economic loss. Hard disk drives (HDDs) comprise a valuable e-waste stream necessitating robotic disassembly. Automating the disassembly of HDDs requires holistic 3D sensing, scene understanding, and fastener localization, however current methods are fragmented, lack robust 3D sensing, and lack fastener localization. We propose an autonomous vision pipeline which performs 3D sensing using a Fringe Projection Profilometry (FPP) module, with selective triggering of a depth completion module where FPP fails, and integrates this module with a lightweight, real-time instance segmentation network for scene understanding and critical component localization. By utilizing the same FPP camera-projector system for both our depth sensing and component localization modules, our depth maps and derived 3D geometry are inherently pixel-wise aligned with the segmentation masks without registration, providing an advantage over RGB-D perception systems common in industrial sensing. We optimize both our trained depth completion and instance segmentation networks for deployment-oriented inference. The proposed system achieves a box mAP@50 of 0.960 and mask mAP@50 of 0.957 for instance segmentation, while the selected depth completion configuration with the Depth Anything V2 Base backbone achieves an RMSE of 2.317 mm and MAE of 1.836 mm; the Platter Facing learned inference stack achieved a combined latency of 12.86 ms and a throughput of 77.7 Frames Per Second (FPS) on the evaluation workstation. Finally, we adopt a sim-to-real transfer learning approach to augment our physical dataset. The proposed perception pipeline provides both high-fidelity semantic and spatial data which can be valuable for downstream robotic disassembly. The synthetic dataset developed for HDD instance segmentation will be made publicly available.
Chinese Translation
未回收的电子废物代表了显著的经济损失。硬盘驱动器(HDD)构成了一个有价值的电子废物流,亟需机器人拆解。自动化HDD的拆解需要全面的3D感知、场景理解和紧固件定位,然而当前的方法存在碎片化、缺乏稳健的3D感知和紧固件定位的问题。我们提出了一种自主视觉管道,该管道使用条纹投影轮廓测量(Fringe Projection Profilometry, FPP)模块进行3D感知,并在FPP失败时选择性触发深度补全模块,同时将该模块与轻量级、实时实例分割网络集成,以实现场景理解和关键组件定位。通过利用相同的FPP相机-投影仪系统用于我们的深度感知和组件定位模块,我们的深度图和派生的3D几何形状在像素级上与分割掩膜自然对齐,无需注册,这为工业感知中常见的RGB-D感知系统提供了优势。我们优化了训练好的深度补全和实例分割网络,以适应部署导向的推理。所提系统在实例分割中实现了0.960的箱体mAP@50和0.957的掩膜mAP@50,而选择的深度补全配置(使用Depth Anything V2 Base主干网络)实现了2.317 mm的均方根误差(RMSE)和1.836 mm的平均绝对误差(MAE);在评估工作站上,盘面朝向学习推理堆栈的综合延迟为12.86 ms,吞吐量为77.7帧每秒(FPS)。最后,我们采用了仿真到现实的迁移学习方法来增强我们的物理数据集。所提感知管道提供了高保真度的语义和空间数据,这对后续的机器人拆解具有重要价值。为HDD实例分割开发的合成数据集将公开发布。
cs.CV / 131 / 2604.17233
Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM
通过关注用户画像的多模态大语言模型增强零样本个性化图像美学评估
Abstract
Personalized image aesthetics assessment (PIAA) aims to predict an individual user's subjective rating of an image, which requires modeling user-specific aesthetic preferences. Existing methods rely on historical user ratings for this modeling and therefore struggle when such data are unavailable. We address this zero-shot setting by using user profiles as contextual signals for personalization and adopting a profile-based personalization paradigm. We introduce P-MLLM, a profile-aware multimodal LLM that augments a frozen LLM with selective fusion modules for controlled visual integration. These modules selectively integrate visual information into the model's evolving hidden states during profile-conditioned reasoning, allowing visual information to be incorporated in a profile-aware manner. Experiments on recent PIAA benchmarks show that P-MLLM achieves competitive zero-shot performance and remains effective even with coarse profile information, highlighting the potential of profile-based personalization for zero-shot PIAA.
Chinese Translation
个性化图像美学评估(PIAA)旨在预测个别用户对图像的主观评分,这需要对用户特定的美学偏好进行建模。现有方法依赖于历史用户评分进行建模,因此在缺乏此类数据时面临困难。我们通过使用用户画像作为个性化的上下文信号来解决这一零样本设置,并采用基于画像的个性化范式。我们引入了P-MLLM,一种关注用户画像的多模态大语言模型,它通过选择性融合模块增强了一个冻结的大语言模型,以实现受控的视觉集成。这些模块在基于画像的推理过程中,选择性地将视觉信息整合到模型不断演变的隐藏状态中,从而以关注用户画像的方式融入视觉信息。在最近的PIAA基准测试中的实验表明,P-MLLM实现了具有竞争力的零样本性能,即使在粗略的用户画像信息下也依然有效,突显了基于画像的个性化在零样本PIAA中的潜力。
cs.CV / 132 / 2604.17243
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
RemoteShield:为地球观测启用稳健的多模态大语言模型
Abstract
A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address this limitation, we introduce RemoteShield, a robust Remote Sensing MLLM trained to maintain consistent outputs across realistic input variations. During training, each clean sample is paired with its image-text perturbed variants to form a semantic equivalence cluster. Rather than directly fitting noisy samples, RemoteShield is optimized through preference learning over clean and perturbed conditions within the same cluster. By comparing model responses to clean and corrupted inputs, the model is encouraged to favor stable responses over perturbation-induced failures. This cross-condition alignment helps the model focus on underlying task semantics despite visual degradations and textual noise. Experiments on three Earth Observation tasks show that RemoteShield consistently delivers stronger robustness and cross-condition consistency than representative baselines under realistic multimodal perturbations.
Chinese Translation
一个稳健的多模态大语言模型(MLLM)用于地球观测,应在现实输入变化下保持一致的解释和推理。然而,目前的遥感MLLM未能满足这一要求。它们在经过精心策划的干净数据集上训练,学习到的脆弱映射无法推广到操作性地球观测中的噪声条件。因此,当面对部署中的不完美输入时,它们的性能会下降。为了量化这种脆弱性,我们构建了一组现实的多模态扰动,包括云层和雾霾等视觉退化,以及从口语化到模糊或省略指令的多样化以人为中心的文本变化。实证评估表明,这些扰动显著削弱了领先的遥感基础模型的视觉-语义推理能力。为了解决这一局限性,我们引入了RemoteShield,一个稳健的遥感MLLM,旨在保持在现实输入变化下的一致输出。在训练过程中,每个干净样本与其图像-文本扰动变体配对,形成语义等价集群。RemoteShield并不是直接拟合噪声样本,而是通过在同一集群内对干净和扰动条件进行偏好学习进行优化。通过比较模型对干净和损坏输入的响应,模型被鼓励优先选择稳定的响应,而不是因扰动引起的失败。这种跨条件对齐帮助模型在视觉退化和文本噪声的情况下专注于潜在的任务语义。在三个地球观测任务上的实验表明,RemoteShield在现实多模态扰动下,始终展现出比代表性基线更强的稳健性和跨条件一致性。
cs.CV / 133 / 2604.17268
Fractal Characterization of Low-Correlation Signals in AI-Generated Image Detection
低相关信号的分形特征在AI生成图像检测中的应用
Abstract
AI-generated imagery has reached near-photorealistic fidelity, yet this technology poses significant threats to information security and societal trust. Existing deepfake detection methods often exhibit limited robustness in open-world scenarios. To address this limitation, this paper investigates intrinsic discrepancies between synthetic and authentic images from a signal-level perspective. Our analysis reveals that low-correlation signals serve as distinctive markers for differentiating AI-generated imagery from real photographs. Building on this insight, we introduce a novel method for quantifying these signals based on fractal theory. By analyzing the fractal characteristics of low-correlation signals, our method effectively captures the subtle statistical anomalies inherent to the synthesis process. Extensive experimental results demonstrate the method's robustness and superior detection performance. This work emphasizes the need to shift research focus to a new signal-level direction for deepfake detection. Theoretically, this proposed approach is not limited to face image identification but can be applied to all AI-generated image detection tasks. This study provides a new research direction for deepfake detection.
Chinese Translation
AI生成的图像已达到近乎照片真实的逼真度,但这一技术对信息安全和社会信任构成了重大威胁。现有的深度伪造检测方法在开放世界场景中往往表现出有限的鲁棒性。为了解决这一局限性,本文从信号层面探讨合成图像与真实图像之间的内在差异。我们的分析表明,低相关信号作为区分AI生成图像与真实照片的独特标记。基于这一见解,我们提出了一种基于分形理论量化这些信号的新方法。通过分析低相关信号的分形特征,我们的方法有效捕捉到合成过程中的微妙统计异常。大量实验结果证明了该方法的鲁棒性和卓越的检测性能。本研究强调了将深度伪造检测的研究重点转向新的信号层面方向的必要性。从理论上讲,这种提出的方法不仅限于人脸图像识别,还可以应用于所有AI生成图像的检测任务。本研究为深度伪造检测提供了新的研究方向。
cs.CV / 134 / 2604.17274
Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models
本能与反思:在多模态大型模型中统一标记和口头信心
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs' response confidence estimation. Our analysis reveals a significant instinct-reflection misalignment: the model's implicit token-level support frequently diverges from its verbal self-assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual-channel signals and cross-channel consistency to estimate correctness. Subsequently, an order-preserving mean alignment step is applied to correct global bias, which improves calibration while preserving the risk-coverage trade-off for selective prediction. Experiments on diverse open-source and closed-source MLLMs show that our method consistently yields more reliable confidence estimates and improves both calibration and failure prediction. Code will be available at https://github.com/Yunkaidang/Instinct-vs.-Reflection.
Chinese Translation
多模态大型语言模型(MLLMs)在各种感知和推理任务中展现了卓越的能力。尽管取得了这些成功,但在实际应用中确保其可靠性需要强有力的信心估计。以往的研究主要集中在仅处理文本的LLMs上,通常依赖于计算成本高昂的自一致性采样。在本文中,我们将这一方法扩展到多模态环境,并对MLLMs的响应信心估计进行了全面评估。我们的分析揭示了显著的本能-反思不一致:模型的隐式标记级支持常常与其口头自我评估的信心相偏离。为了解决这一不一致,我们提出了一种单调信心融合框架,以合并双通道信号和跨通道一致性来估计正确性。随后,应用保持顺序的均值对齐步骤来纠正全局偏差,这在提高校准的同时保持了选择性预测的风险覆盖权衡。在多种开源和闭源的MLLMs上进行的实验表明,我们的方法始终能够提供更可靠的信心估计,并改善校准和失败预测。代码将发布在 https://github.com/Yunkaidang/Instinct-vs.-Reflection。
cs.CV / 135 / 2604.17278
PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction
PestVL-Net:通过细粒度视觉-语言交互实现多模态害虫学习
Abstract
Effective pest recognition and management are crucial for sustainable agricultural development. However, collecting pest data in real scenarios is often challenging. Compared to other domains, pests exhibit a wide variety of species with complex and diverse morphological characteristics. Existing techniques struggle to effectively model the key visual and high-level semantic features of pests in a fine-grained manner. These limitations hinder the practical application of such methods in real agricultural scenarios. To address these critical challenges, we present a synergistic approach that integrates PestVL-Net, a novel vision-language framework, with two multi-species pest datasets to facilitate fine-grained pest learning. The visual pathway of PestVL-Net utilizes the Recurrent Weighted Key Value (RWKV) architecture, incorporating a saliency-guided adaptive window partitioning scheme to effectively model the fine-grained visual characteristics of pests. Concurrently, the linguistic component generates precise pest semantic descriptions by leveraging Multimodal Large Language Models (MLLMs) priors, critically informed by agricultural expert knowledge and structured via multimodal Chain-of-Thought (CoT) reasoning. The deep fusion of these complementary visual and textual representations enables fine-grained multimodal pest learning. Extensive experimental evaluations on multiple pest datasets validate the superior performance of PestVL-Net, highlighting its potential for effective real-world pest management.
Chinese Translation
有效的害虫识别和管理对可持续农业发展至关重要。然而,在实际场景中收集害虫数据往往具有挑战性。与其他领域相比,害虫种类繁多,形态特征复杂多样。现有技术难以有效地以细粒度方式建模害虫的关键视觉和高级语义特征。这些限制阻碍了此类方法在实际农业场景中的应用。为了解决这些关键挑战,我们提出了一种协同方法,将PestVL-Net这一新颖的视觉-语言框架与两个多物种害虫数据集结合,以促进细粒度害虫学习。PestVL-Net的视觉路径利用了递归加权关键值(RWKV)架构,结合了基于显著性引导的自适应窗口划分方案,有效建模害虫的细粒度视觉特征。同时,语言组件通过利用多模态大型语言模型(MLLMs)的先验知识,生成精确的害虫语义描述,这些描述受到农业专家知识的关键指导,并通过多模态思维链(CoT)推理进行结构化。这些互补的视觉和文本表示的深度融合实现了细粒度多模态害虫学习。在多个害虫数据集上的广泛实验评估验证了PestVL-Net的优越性能,突显了其在有效的现实世界害虫管理中的潜力。
cs.CV / 136 / 2604.17286
Depth Adaptive Efficient Visual Autoregressive Modeling
深度自适应高效视觉自回归建模
Abstract
Visual Autoregressive (VAR) modeling inefficiently applies a fixed computational depth to each position when generating high-resolution images. While existing methods accelerate inference by pruning tokens using frequency maps, their binary hard-pruning approach is fundamentally limited and fails to improve quality even with better frequency estimation. Observing that VAR models possess significant depth redundancy, we propose a paradigm shift from pruning entire tokens to adaptively allocating per-token computational depth. To this end, we introduce DepthVAR, a training-free framework that dynamically allocates computation. It integrates an adaptive depth scheduler, which assigns computational depth via a cyclic rotated schedule for balanced, non-static refinement, with a dynamic inference process that translates these depths into layer-major masks, selectively applies transformer blocks, and blends the resulting codes to ensure each token's influence is proportional to its processing depth. Extensive experiments show that DepthVAR achieves 2.3$\times$-3.1$\times$ acceleration with minimal quality loss, offering a competitive compute-performance trade-off compared to existing hard-pruning approaches. Code is available at https://github.com/STOVAGtz/DepthVAR
Chinese Translation
视觉自回归(VAR)建模在生成高分辨率图像时,低效地对每个位置应用固定的计算深度。尽管现有方法通过使用频率图修剪令牌来加速推理,但其二进制硬修剪方法在根本上受到限制,即使在更好的频率估计下也无法提高质量。观察到VAR模型存在显著的深度冗余,我们提出了一种从修剪整个令牌到自适应分配每个令牌计算深度的范式转变。为此,我们引入了DepthVAR,这是一种无训练框架,能够动态分配计算。它集成了一个自适应深度调度器,通过循环旋转调度分配计算深度,以实现平衡的、非静态的精细调整,并结合动态推理过程,将这些深度转化为层主掩码,选择性地应用变换器块,并混合生成的代码,以确保每个令牌的影响与其处理深度成正比。大量实验表明,DepthVAR实现了2.3×-3.1×的加速,且质量损失最小,提供了与现有硬修剪方法相比具有竞争力的计算性能权衡。代码可在 https://github.com/STOVAGtz/DepthVAR 获取。
cs.CV / 137 / 2604.17287
Spectral Forensics of Diffusion Attention Graphs for Copy-Move Forgery Detection
扩散注意图谱的光谱取证用于复制移动伪造检测
Abstract
Copy-move forgery, where a region within an image is duplicated to hide or fabricate content, remains a persistent threat to visual media integrity. We introduce GraphSpecForge, a training-free framework that detects copy-move forgery by analysing the spectral structure of attention graphs from a pretrained Stable Diffusion U-Net. Our central insight is that copy-move manipulation induces approximate subgraph duplication in the self-attention graph, leading to measurable spectral redistribution in the normalized graph Laplacian. We formalise this link with perturbation-based arguments and build an image-level anomaly detector using Wasserstein distances between per-image Laplacian spectra and an authentic reference distribution. We evaluate GraphSpecForge on four copy-move benchmarks without forgery-specific retraining. On RecodAI-LUC (5,128 images), our best configuration achieves AUROC = 0.606 (95% CI: 0.580-0.638; permutation p = 0.005), and the normalized Laplacian outperforms raw attention spectra by +0.057 AUROC. On MICC-F220, CoMoFoD, and COVERAGE, the same pipeline attains AUROCs of 0.752, 0.774, and 0.673, respectively; on CoMoFoD it also reaches AUPRC = 0.833, balanced accuracy = 0.712, MCC = 0.499, and TPR@1%FPR = 32.5%. Additional ablation and falsification experiments confirm the signal's specificity and sensitivity to manipulation strength, while null-graph controls rule out trivial-statistic explanations.
Chinese Translation
复制移动伪造是指在图像中复制某个区域以隐藏或伪造内容,这对视觉媒体的完整性构成了持续威胁。我们提出了GraphSpecForge,这是一个无训练的框架,通过分析预训练的Stable Diffusion U-Net生成的注意图谱的光谱结构来检测复制移动伪造。我们的核心见解是,复制移动操作会在自注意图中引发近似子图的重复,从而导致归一化图拉普拉斯算子中的可测光谱重分布。我们通过基于扰动的论证形式化了这一联系,并使用每幅图像的拉普拉斯光谱与真实参考分布之间的Wasserstein距离构建了一个图像级异常检测器。我们在四个复制移动基准上评估了GraphSpecForge,而无需进行伪造特定的再训练。在RecodAI-LUC(5,128幅图像)上,我们的最佳配置达到了AUROC = 0.606(95% CI: 0.580-0.638;置换p = 0.005),归一化拉普拉斯算子的表现比原始注意光谱提高了0.057 AUROC。在MICC-F220、CoMoFoD和COVERAGE上,相同的流程分别达到了0.752、0.774和0.673的AUROC;在CoMoFoD上,它还达到了AUPRC = 0.833,平衡准确率 = 0.712,MCC = 0.499,以及TPR@1%FPR = 32.5%。额外的消融和伪造实验确认了信号对操作强度的特异性和敏感性,而零图控制则排除了平凡统计解释。
cs.CV / 138 / 2604.17298
Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video
基于频率引导的多层次推理用于视频场景图生成
Abstract
Video Scene Graph Generation aims to obtain structured semantic representations of objects and their relationships in videos for high-level understanding. However, existing methods still have limitations in handling long-tail distributions. This paper proposes the Frequency-guided Relational Multi-level Reasoning (FReMuRe) model, which enhances the modeling ability of long-tail relationships from a mechanism perspective. We introduce relation-specific branches to deal gradient conflicts, yielding more balanced and tail-aware learning. And we design a frequency-aware dual-branch predicate embedding network to model high-frequency and low-frequency relationships separately and improve the recall rate of tail classes through gated fusion. Meanwhile, we propose two types of interchangeable relation classification heads: Bayesian Head for uncertainty estimation and new Gaussian Mixture Model Head to enhance intra-class diversity. Experimental results show that FReMuRe significantly improves the recall rate of long-tail relationships and overall reasoning robustness on the Action Genome dataset.
Chinese Translation
视频场景图生成旨在获取视频中对象及其关系的结构化语义表示,以实现高层次理解。然而,现有方法在处理长尾分布方面仍存在局限性。本文提出了频率引导的关系多层次推理模型(Frequency-guided Relational Multi-level Reasoning, FReMuRe),从机制角度增强了对长尾关系的建模能力。我们引入了特定关系的分支以处理梯度冲突,从而实现更平衡和关注长尾的学习。同时,我们设计了一个频率感知的双分支谓词嵌入网络,以分别建模高频和低频关系,并通过门控融合提高长尾类别的召回率。此外,我们提出了两种可互换的关系分类头:用于不确定性估计的贝叶斯头(Bayesian Head)和增强类内多样性的新的高斯混合模型头(Gaussian Mixture Model Head)。实验结果表明,FReMuRe显著提高了长尾关系的召回率和在Action Genome数据集上的整体推理鲁棒性。
cs.CV / 139 / 2604.17306
The First Challenge on Mobile Real-World Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
2026年NTIRE移动真实世界图像超分辨率首届挑战赛:基准结果与方法概述
Li, Jiatong, Chen, Zheng, Liu, Kai, Wang, Jingkai, Zhou, Zihan, Liu, Xiaoyang, Zhu, Libo, Gong, Jue, Timofte, Radu, Zhang, Yulun, Wang, Congyu, Wang, Zihao, Wu, Ke, Zhu, Xinzhe, Zhang, Fengkai, Yang, Zhongbao, Sun, Long, Dong, Jiangxin, Pan, Jinshan, Tu, Jiachen, Shi, Yaokun, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Situ, Renyuan, Yang, Yixin, Zhou, Zhaorun, Chen, Junyang, Li, Yuqi, Yang, Chuanguang, Feng, Weilun, Yan, Chuanyue, Tan, Yuedong, Tian, Yingli, Chen, Zhenzhong, Guo, Tongqi, Liu, Ruhan, Shi, Sangzi, Deng, Huazhang, Yang, Jie, Ma, Wenzhuo, Zhang, Yuantong, Yang, Daiqin, Chen, Tianrun, Ji, Deyi, Jiang, Yuxiao, Zhu, Qi, Zhu, Lanyun, Pan, Yuwen, Tian, Runze, Shi, Mingyu, Feng, Zhanfeng, Bao, Yuanfei, Guo, Jiaming, Pei, Renjing, Di, Xin, Peng, Long, Jiang, Linfeng, Fu, Xueyang, Cao, Yang, Zha, Zhengjun, Lee, Choulhyouc, Weng, Shyang-En, Liao, Yi-Cheng, Tyrakowski, Jorge, Xu, Yu-Syuan, Chiu, Wei-Chen, Huang, Ching-Chun, Im, Yoonjin, Park, Jihye, Chun, Hyungju, Park, Hyunhee, Park, MinKyu, Yu, Xiaoxuan, Zhang, Jianxing, Jiang, Yuxuan, Zeng, Chengxi, Peng, Tianhao, Zhang, Fan, Bull, David, Ruangsang, Watchara, Aramvith, Supavadee, Deng, JiaHao, Zhou, Wei, Huang, Hongyu, Lin, Shaohui, Wang, Zihan, Chen, Yilin, Li, Yunchen, Qiao, Junbo, Li, Wei, Xie, Jiao, He, Gaoqi, Li, Wenxi
Abstract
This paper provides a review of the NTIRE 2026 challenge on mobile real-world image super-resolution, highlighting the proposed solutions and the resulting outcomes. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through unknown degradations with a x4 scaling factor while ensuring the models remain executable on mobile devices. The objective is to develop effective and efficient network designs or solutions that achieve state-of-the-art real-world image super-resolution performance. The track of the challenge evaluates performance using a weighted combination of image quality assessment (IQA) score and speedup ratios. The competition attracted 108 registrants, with 16 teams achieving a valid score in the final ranking. This collaborative effort advances the performance of mobile real-world image super-resolution while offering an in-depth overview of the latest trends in the field.
Chinese Translation
本文回顾了2026年NTIRE移动真实世界图像超分辨率挑战赛,重点介绍了提出的解决方案及其结果。该挑战旨在从通过未知降质生成的低分辨率(LR)图像中恢复高分辨率(HR)图像,缩放因子为4,同时确保模型能够在移动设备上执行。目标是开发有效且高效的网络设计或解决方案,以实现先进的真实世界图像超分辨率性能。挑战赛的评估轨道使用图像质量评估(IQA)得分和加速比的加权组合来评估性能。此次比赛吸引了108名注册者,其中16支队伍在最终排名中获得有效分数。这一合作努力推动了移动真实世界图像超分辨率的性能,同时提供了该领域最新趋势的深入概述。
cs.CV / 140 / 2604.17307
Generalizable Face Forgery Detection via Separable Prompt Learning
可推广的人脸伪造检测通过可分离提示学习
Abstract
Detecting face forgeries using CLIP has recently emerged as a promising and increasingly popular research direction. Owing to its rich visual knowledge acquired through large-scale pretraining, most existing methods typically rely on the visual encoder of CLIP, while paying limited attention to the text modality. Given the instructive nature of the text modality, we posit that it can be leveraged to instruct Deepfake detection with meticulous design. Accordingly, we shift the focus from the visual modality to the text modality and propose a new Separable Prompt Learning strategy (SePL) that enables CLIP to serve as an effective face forgery detector. The core idea of SePL is to disentangle forgery-specific and forgery-irrelevant information in images via two types of prompt learning, with the former enhancing detection. To achieve this disentangle, we describe a cross-modality alignment strategy and a set of dedicated objectives. Extensive experiments demonstrate that, with this simple adaptation, our method achieves competitive and even superior performance compared to other methods under both cross-dataset and cross-method evaluation, highlighting its strong generalizability. The codes have been released at https://github.com/OUC-YER/SePL-DeepfakeDetection
Chinese Translation
利用 CLIP 检测人脸伪造最近成为一个有前景且日益受欢迎的研究方向。由于其通过大规模预训练获得的丰富视觉知识,大多数现有方法通常依赖于 CLIP 的视觉编码器,而对文本模态关注有限。鉴于文本模态的指导性特征,我们认为可以通过精心设计来利用它指导 Deepfake 检测。因此,我们将重点从视觉模态转向文本模态,提出了一种新的可分离提示学习策略(Separable Prompt Learning, SePL),使 CLIP 能够作为有效的人脸伪造检测器。SePL 的核心思想是通过两种类型的提示学习来解耦图像中的伪造特定信息和伪造无关信息,其中前者增强检测能力。为了实现这种解耦,我们描述了一种跨模态对齐策略和一组专门的目标。大量实验表明,通过这种简单的适应,我们的方法在跨数据集和跨方法评估中实现了与其他方法竞争甚至更优的性能,突显了其强大的可推广性。代码已发布在 https://github.com/OUC-YER/SePL-DeepfakeDetection
cs.CV / 141 / 2604.17318
When Background Matters: Breaking Medical Vision Language Models by Transferable Attack
背景的重要性:通过可转移攻击打破医学视觉语言模型
Abstract
Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. The method injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model's focus away from pathological areas. Extensive evaluations across six medical imaging modalities show that MedFocusLeak achieves state-of-the-art performance, generating misleading yet realistic diagnostic outputs across diverse VLMs. We further introduce a unified evaluation framework with novel metrics that jointly capture attack success and image fidelity, revealing a critical weakness in the reasoning capabilities of modern clinical VLMs.
Chinese Translation
视觉-语言模型(VLMs)在临床诊断中的应用日益增多,但它们对对抗性攻击的鲁棒性仍然未得到充分探索,这带来了严重风险。现有的医学攻击主要集中在模型窃取或对抗性微调等次要目标上,而来自自然图像的可转移攻击则引入了临床医生易于检测的明显失真。为了解决这个问题,我们提出了MedFocusLeak,这是一种高度可转移的黑箱多模态攻击,能够诱导出不正确但在临床上合理的诊断,同时保持扰动不可察觉。该方法将协调的扰动注入非诊断背景区域,并采用注意力干扰机制将模型的注意力从病理区域转移开。对六种医学影像模态的广泛评估表明,MedFocusLeak实现了最先进的性能,在多种VLMs中生成误导性但现实的诊断输出。我们进一步引入了一个统一的评估框架,使用新颖的指标共同捕捉攻击成功率和图像保真度,揭示了现代临床VLMs推理能力的一个关键弱点。
cs.CV / 142 / 2604.17319
E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition
E2E-GMNER:端到端生成式基础多模态命名实体识别
Abstract
Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision. Code is available at:https://github.com/Finch-coder/E2E-GMNER
Chinese Translation
基础多模态命名实体识别(GMNER)旨在共同识别文本中的命名实体提及,预测其语义类型,并将每个实体与相关图像中的相应视觉区域进行关联。现有方法主要采用管道式架构,将文本实体识别和视觉基础分离,导致错误累积和次优的联合优化。在本文中,我们提出了E2E-GMNER,这是一种完全端到端的生成框架,它在单一的多模态大型语言模型中统一了实体识别、语义分类、视觉基础和隐式知识推理。我们将GMNER表述为一种经过指令调优的条件生成任务,并结合思维链推理,使模型能够自适应地判断何时视觉证据或背景知识是有用的,从而减少对噪声线索的依赖。为了进一步解决生成边界框预测的不稳定性,我们引入了高斯风险感知框扰动(GRBP),用概率扰动的软目标替代硬框监督,以提高对标注噪声和离散化误差的鲁棒性。在Twitter-GMNER和Twitter-FMNERG基准上的大量实验表明,E2E-GMNER与最先进的方法相比,表现出高度竞争力,验证了统一端到端优化和噪声感知基础监督的有效性。代码可在以下链接获取:https://github.com/Finch-coder/E2E-GMNER
cs.CV / 143 / 2604.17320
Towards Joint Quantization and Token Pruning of Vision-Language Models
面向视觉-语言模型的联合量化与令牌剪枝
Abstract
Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
Chinese Translation
在激进的低比特推理下部署视觉-语言模型(VLMs)仍然面临挑战,因为推理成本主要由预填充期间的长视觉令牌前缀和自回归解码期间不断增长的KV缓存主导。令牌剪枝和低比特量化在降低这些成本方面是互补的,但简单的阶段性组合往往脆弱,因为量化校准与剪枝执行之间存在不匹配。我们提出了一种协同量化与剪枝框架,将低比特推理和确定性视觉令牌剪枝统一在一个可部署的管道中。该框架引入了 extbf{Q}uantization extbf{U}nified extbf{O}ffline extbf{T}oken extbf{A}llocator( extbf{QUOTA}),它将低比特校准信号转换为层级令牌分配计划,并将其具体化为剪枝配方。令牌重要性在部署的W4A4运算符下通过结合激活幅度、注意力线索和显式低比特风险信号进行评估,从而实现一致的预算顶级-$k$选择。在标准VLM基准上的实验表明,在相同的低比特条件下,相较于代表性的阶段性组合,我们的框架在稳健性上有所提升,平均保留率达95.65\%,而仅保留30 ext{%}的视觉令牌,而代表性的阶段性组合的保留率约为94.3 ext{%}。代码将会发布。
cs.CV / 144 / 2604.17321
R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection
R-FLoRA:用于单图像人脸变形攻击检测的残差统计门控低秩适应
Abstract
Face morphing attacks pose a substantial risk to the reliability of face recognition systems used in passport issuance, border control, and digital identity verification. Detecting morphing attacks from a single facial image remains challenging owing to the lack of a trusted reference and the diversity of attack generation methods. This paper presents a new Single-Image Face Morphing Attack Detection (S-MAD) framework that integrates high-frequency Laplacian residual statistics with representations from a frozen, foundation-scale vision transformer. The approach employs residual-statistic-gated low-rank adapters (R-FLoRA) and feature-wise residual fusion (Res-FiLM) to enhance sensitivity to local morphing artefacts while preserving the semantic context of the backbone. A novel residual-contrastive alignment loss further regularises the fused token space, improving discrimination under unseen morphing conditions. Comprehensive experiments on four ICAO-compliant datasets, encompassing seven morph generation techniques, demonstrate that the proposed method consistently surpasses nine recent state-of-the-art S-MAD algorithms in detection accuracy and cross-domain (or dataset) generalisation. With a frozen backbone and minimal trainable parameters, the model achieves real-time efficiency and interpretability, making it suitable for real-life scenarios in biometric verification systems.
Chinese Translation
人脸变形攻击对护照发放、边境控制和数字身份验证中使用的人脸识别系统的可靠性构成了重大风险。由于缺乏可信的参考和攻击生成方法的多样性,从单个面部图像中检测变形攻击仍然具有挑战性。本文提出了一种新的单图像人脸变形攻击检测(S-MAD)框架,该框架将高频拉普拉斯残差统计与来自冻结的基础规模视觉变换器的表示相结合。该方法采用残差统计门控低秩适配器(R-FLoRA)和特征级残差融合(Res-FiLM),以增强对局部变形伪影的敏感性,同时保持主干的语义上下文。一种新颖的残差对比对齐损失进一步规范化了融合的标记空间,提高了在未见变形条件下的区分能力。在四个符合国际民航组织(ICAO)标准的数据集上的全面实验,涵盖七种变形生成技术,证明所提出的方法在检测准确性和跨领域(或数据集)泛化方面始终超过九种最新的最先进S-MAD算法。该模型在冻结主干和最小可训练参数的情况下,实现了实时效率和可解释性,适合在生物识别验证系统的实际场景中应用。
cs.CV / 145 / 2604.17341
Robust Diabetic Retinopathy Grading Using Dual-Resolution Attention-Based Deep Learning with Ordinal Regression
基于双分辨率注意力深度学习与序数回归的稳健糖尿病视网膜病变分级
Abstract
Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, and automated grading systems play a crucial role in large-scale screening programs. However, deep learning models often exhibit degraded performance when deployed across datasets acquired under different imaging conditions. This study presents a robust dual-resolution deep learning framework for DR grading that integrates attention-based feature fusion with ordinal regression to improve cross-dataset generalization. The proposed method employs two parallel EfficientNet backbones operating at different spatial resolutions to capture complementary retinal features. A learnable attention mechanism adaptively fuses multi-resolution representations, while an ordinal regression formulation based on the cumulative link model (CORAL) explicitly accounts for the ordered nature of DR severity levels. To mitigate domain discrepancies between datasets, a preprocessing strategy combining circular cropping, contrast enhancement, and histogram matching is applied. The model was trained on the APTOS 2019 dataset and evaluated on both an internal validation split and an external Messidor-2 test set. Experimental results demonstrate strong grading performance, achieving a quadratic weighted kappa (QWK) of 0.88 on the APTOS validation set and 0.68 on the unseen Messidor-2 dataset, indicating improved robustness for cross-dataset DR grading applications.
Chinese Translation
糖尿病视网膜病变(DR)是全球视力障碍的主要原因,自动化分级系统在大规模筛查项目中发挥着至关重要的作用。然而,深度学习模型在不同成像条件下获取的数据集上部署时,往往表现出性能下降。本研究提出了一种稳健的双分辨率深度学习框架,用于DR分级,该框架结合了基于注意力的特征融合与序数回归,以提高跨数据集的泛化能力。所提方法采用两个并行的EfficientNet骨干网络,在不同的空间分辨率下运行,以捕捉互补的视网膜特征。可学习的注意力机制自适应地融合多分辨率表示,而基于累积链接模型(CORAL)的序数回归公式明确考虑了DR严重程度级别的有序性质。为了减轻数据集之间的领域差异,应用了一种结合圆形裁剪、对比度增强和直方图匹配的预处理策略。该模型在APTOS 2019数据集上进行训练,并在内部验证集和外部Messidor-2测试集上进行评估。实验结果表明,该模型在分级性能上表现出色,在APTOS验证集上实现了0.88的二次加权kappa(QWK),在未见过的Messidor-2数据集上实现了0.68,表明其在跨数据集DR分级应用中的稳健性得到了改善。
cs.CV / 146 / 2604.17375
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
当文本劫持视觉:基准测试与缓解文本覆盖引发的视觉语言模型幻觉
Abstract
Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1--L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.
Chinese Translation
近年来,视觉语言模型(Vision-Language Models, VLMs)的进展显著提升了其在跨模态视频理解基准测试中的能力,涵盖了时间、动作、物体和空间理解。然而,我们识别出一个关键但被忽视的问题:当嵌入的屏幕文本与视觉场景相矛盾时,现有的 VLMs 系统性地产生幻觉,优先考虑覆盖文本的语义而非实际的视觉内容。我们将这一现象定义为文本覆盖引发的幻觉(Text Overlay-Induced Hallucination, TOIH)。在本研究中,我们提出了 VisualTextTrap,这是第一个综合性基准,包括经过大规模人工验证的样本,并配备专门设计的评估指标。具体而言,我们利用广泛使用的公共数据集,通过可扩展的 VLMs 辅助文本生成和严格的人工验证构建了 VisualTextTrap。该基准包含 6,057 个样本,跨越四个维度的 88 个细粒度属性进行标注,幻觉强度在一个五级量表(L1--L5)上进行量化,反映覆盖文本与视觉现实之间的语义矛盾。此外,我们提出了视觉文本幻觉缓解专家混合模型(Visual Text Hallucination Mitigation Mixture-of-Experts, VTHM-MoE),这是一种新颖的视觉-文本解耦框架,采用双编码器架构。具体而言,四个专注于时间、动作、物体和空间推理的专家模块首先经过预训练,以识别和利用文本语义与实际视频内容之间的跨模态差异。我们开发了一种自适应令牌路由策略,以实现动态专家分配,从而在保持未受污染视频性能的同时,赋予对 TOIH 的强大抵抗力。在我们的 VisualTextTrap 基准上进行的大量实验验证了 VTHM-MoE 的有效性,其在多样的视频问答任务中超越了现有的最先进模型。
cs.CV / 147 / 2604.17376
Towards Generalizable Deepfake Image Detection with Vision Transformers
基于视觉变换器的可泛化深度伪造图像检测研究
Abstract
In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
Chinese Translation
在当今时代,我们面临着检测深度伪造图像的挑战,原因在于现代生成模型的快速演变以及现有方法的泛化能力较差。本文采用了一组经过微调的视觉变换器,如 DINOv2、AIMv2 和 OpenCLIP 的 ViT-L/14,创建了一种可泛化的深度伪造检测方法。我们使用了 DF-Wild 数据集,该数据集是 IEEE SP Cup 2025 的一部分,因为它涵盖了一系列具有挑战性和多样化的操控和生成技术。我们的实验从训练空间特征的 CNN 分类器开始。实验结果表明,我们的集成模型在性能上优于单一模型和强大的 CNN 基线,在 DF-Wild 测试集上实现了 96.77% 的 AUC 和仅 9% 的等错误率 (EER),分别比当前最先进的深度伪造检测算法 Effort 提高了 7.05% 和 8% 的 AUC 和 EER。这是 SP Cup 的获胜解决方案,并在 ICASSP 2025 上进行了展示。
cs.CV / 148 / 2604.17385
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer:面向空间推理的自适应视觉想象
Abstract
Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.
Chinese Translation
空间智能是指从视觉观察中推理几何和物理结构的能力,仍然是多模态大型语言模型面临的核心挑战。尽管表现出色,最近的多模态大型语言模型(MLLMs)在涉及一致空间状态识别的空间智能任务中,往往表现出脆弱的推理痕迹。我们认为,这些失败源于空间识别机制与这些MLLMs的文本推理行为之间的不匹配。有效的空间推理要求在整个推理过程中忠实地保留和更新低层次的几何结构,而文本表示往往会抽象掉这些关键细节。为了解决这一问题,我们提出了SpatialImaginer,一个统一的多模态生成框架,将文本推理与视觉想象相结合。我们的框架采用分而治之的策略,利用文本链式思维进行高层次的语义规划,并使用视觉想象进行几何敏感的状态转换和一致性保持。为了支持这一能力,我们进一步引入了一个难度感知的数据引擎,结合闭环验证,训练模型在需要稳定空间状态跟踪时选择性地调用视觉想象。在多种空间智能基准上的广泛实验表明,SpatialImaginer实现了最先进的性能,并在复杂的多步骤空间推理任务中显著提高了鲁棒性。
cs.CV / 149 / 2604.17389
Deep learning based Non-Rigid Volume-to-Surface Registration for Brain Shift compensation Using Point Cloud
基于深度学习的非刚性体积到表面配准用于脑移位补偿的点云方法
Abstract
Soft-tissue deformation remains a major limitation in image-guided neurosurgery, where intra-operative anatomy can deviate substantially from pre-operative imaging due to brain shift, compromising navigation accuracy and surgical safety. Existing compensation methods often rely on intra-operative MRI, CT, or ultrasound, which are disruptive and difficult to integrate repeatedly into the surgical workflow. In contrast, partial 3D cortical surfaces can be reconstructed as point clouds from stereoscopic microscopes or laser range scanners (LRS), capturing only a limited portion of the exposed cortex. This makes point cloud registration a practical alternative without interrupting surgery; however, such partial and noisy observations make deformation estimation highly challenging. In this study, we propose a deep learning-based framework for non-rigid volume-to-surface registration, enabling dense displacement field estimation from sparse intra-operative surface observations without explicit point correspondences or volumetric intra-operative imaging. The network leverages multi-scale point-based feature extraction and a hierarchical deformation decoder to capture both global and local deformations. The key contribution lies in integrating partial intra-operative surface information into the full pre-operative point cloud domain, enabling implicit correspondence learning and dense deformation recovery under limited visibility. Quantitative results demonstrate accurate recovery of fine-scale deformations, achieving an Endpoint Error (EPE) of 1.13 +/- 0.75 mm and RMSE of 1.33 +/- 0.81 mm under challenging partial-surface conditions. The proposed approach supports automatic, workflow-compatible brain-shift compensation from sparse surface observations.
Chinese Translation
软组织变形仍然是图像引导神经外科手术中的一个主要限制因素,在手术过程中,由于脑移位,术中解剖结构可能与术前影像显著偏离,从而影响导航精度和手术安全性。现有的补偿方法通常依赖于术中MRI、CT或超声,这些方法具有干扰性,并且难以反复融入手术工作流程。相比之下,部分3D皮层表面可以通过立体显微镜或激光测距仪(LRS)重建为点云,仅捕获暴露皮层的有限部分。这使得点云配准成为一种在不干扰手术的情况下的实用替代方案;然而,这种部分和噪声观察使得变形估计变得极具挑战性。在本研究中,我们提出了一种基于深度学习的非刚性体积到表面配准框架,能够从稀疏的术中表面观察中进行密集位移场估计,而无需显式的点对应关系或体积术中成像。该网络利用多尺度点特征提取和分层变形解码器来捕获全局和局部变形。关键贡献在于将部分术中表面信息整合到完整的术前点云领域中,从而实现隐式对应学习和在有限可见性下的密集变形恢复。定量结果表明,在具有挑战性的部分表面条件下,精确恢复细微变形,达到1.13 +/- 0.75 mm的端点误差(EPE)和1.33 +/- 0.81 mm的均方根误差(RMSE)。所提出的方法支持从稀疏表面观察中自动、兼容工作流程的脑移位补偿。
cs.CV / 150 / 2604.17390
MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures
MESA:一种无训练的多示例深度框架用于恢复古代铭文纹理
Abstract
Ancient inscriptions frequently suffer missing or corrupted regions from fragmentation, erosion, or other damage, hindering reading, and analysis. We review prior image restoration methods and their applicability to inscription image recovery, then introduce MESA (Multi-Exemplar, Style-Aware) -an image-level restoration method that uses well-preserved exemplar inscriptions (from the same epigraphic monument, material, or similar letterforms) to guide reconstruction of damaged text. MESA encodes VGG19 convolutional features as Gram matrices to capture exemplar texture, style, and stroke structure; for each neural network layer it selects the exemplar minimizing Mean-Squared Displacement (MSD) to the damaged input. Layer-wise contribution weights are derived from Optical Character Recognition-estimated character widths in the exemplar set to bias filters toward scales matching letter geometry, and a training mask preserves intact regions so synthesis is restricted to damaged areas. We also summarize prior network architectures and exemplar and single-image synthesis, inpainting, and Generative Adversarial Network (GAN) approaches, highlighting limitations that MESA addresses. Comparative experiments demonstrate the advantages of MESA. Finally, we provide a practical roadmap for choosing restoration strategies given available exemplars and metadata.
Chinese Translation
古代铭文常常因碎裂、侵蚀或其他损坏而出现缺失或损坏区域,这阻碍了阅读和分析。我们回顾了先前的图像恢复方法及其在铭文图像恢复中的适用性,然后介绍了MESA(多示例、风格感知)——一种图像级恢复方法,利用保存良好的示例铭文(来自同一铭文遗址、材料或相似字形)来指导损坏文本的重建。MESA将VGG19卷积特征编码为Gram矩阵,以捕捉示例的纹理、风格和笔画结构;对于每一层神经网络,它选择与损坏输入的均方位移(MSD)最小化的示例。层级贡献权重基于光学字符识别(OCR)估计的示例集中字符宽度,以使滤波器偏向于与字母几何形状匹配的尺度,并且训练掩码保留完整区域,从而合成限制在损坏区域。我们还总结了先前的网络架构以及示例和单图像合成、修补和生成对抗网络(GAN)方法,强调了MESA所解决的局限性。比较实验展示了MESA的优势。最后,我们提供了一条实用的路线图,以便在给定可用示例和元数据的情况下选择恢复策略。
cs.CV / 151 / 2604.17397
Speculative Decoding for Autoregressive Video Generation
自回归视频生成的推测解码
Abstract
Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.
Chinese Translation
自回归视频扩散作为一种有前景的视频流合成范式正在兴起,其中步骤蒸馏是加速推理的主要手段。推测解码作为大型语言模型的主导加速策略,是否能够有效适应自回归视频生成仍然是一个悬而未决的问题,因为视频块是连续的时空张量,没有令牌级分布以进行精确的拒绝采样。我们提出了SDVG,它通过用图像质量路由器替代令牌验证,将推测解码引入基于块的自回归视频扩散。一个1.3B的草稿生成器通过四个去噪步骤提出候选块;每个块都经过变分自编码器(VAE)解码,并通过ImageReward使用最差帧聚合进行评分——采用每帧奖励的最小值来捕捉平均可能掩盖的单帧伪影。得分超过固定阈值tau的块被接受到14B目标的KV缓存中;其余的由目标重新生成。两个额外的设计选择被证明至关重要:第一个块总是被强制拒绝以锚定场景构图,而tau作为一个单一的调节器,描绘出平滑的质量-速度帕累托前沿。在1003个MovieGenVideoBench提示(832x480)上,SDVG在tau=-0.7时保留了98.1%的目标专属VisionReward质量(0.0773对比0.0788),以1.59倍的加速实现,并在95.7%质量保留时达到2.09倍的加速——同时始终超越仅草稿生成超过17%。该框架无需训练,不需要架构更改,并可以无缝集成到现有的自回归视频生成管道中。
cs.CV / 152 / 2604.17422
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
聚焦何处:用于长视频理解的查询调制多模态关键帧选择
Abstract
Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This ``one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe ``modal noise'' for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and Contextual Alignment for subtitle-driven narratives. Crucially, Q-Gate introduces a Query-Modulated Gating Mechanism that leverages the in-context reasoning of an LLM to assess the query's intent and dynamically allocate attention weights across the experts. This mechanism intelligently activates necessary modalities while ``muting'' irrelevant ones, thereby maximizing the signal-to-noise ratio. Extensive experiments on LongVideoBench and Video-MME across multiple MLLM backbones demonstrate that Q-Gate substantially outperforms state-of-the-art baselines. By effectively suppressing modality-specific noise, it provides a robust, highly interpretable solution for scalable video reasoning.
Chinese Translation
长视频理解对于多模态大语言模型(MLLMs)仍然是一项艰巨的挑战,因为处理密集帧序列的计算成本极高。现有的解决方案通常选择一个关键帧子集,通常依赖于单一的视觉中心度量(例如,CLIP相似度)或启发式评分的静态融合。这种“一刀切”的范式常常失效:仅依赖视觉的度量对于情节驱动的叙事查询无效,而不加区分地结合文本评分则为纯视觉任务引入了严重的“模态噪声”。为了解决这一瓶颈,我们提出了Q-Gate,一个即插即用且无需训练的框架,将关键帧选择视为一个动态模态路由问题。我们将检索过程解耦为三个轻量级专家流:用于局部细节的视觉定位(Visual Grounding)、用于场景语义的全局匹配(Global Matching)和用于字幕驱动叙事的上下文对齐(Contextual Alignment)。关键是,Q-Gate引入了一种查询调制门控机制(Query-Modulated Gating Mechanism),利用LLM的上下文推理来评估查询的意图,并动态分配专家之间的注意力权重。该机制智能地激活必要的模态,同时“静音”无关的模态,从而最大化信噪比。在LongVideoBench和Video-MME上进行的大量实验表明,Q-Gate显著优于最先进的基线。通过有效抑制模态特定噪声,它为可扩展视频推理提供了一个稳健且高度可解释的解决方案。
cs.CV / 153 / 2604.17428
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Long-CODE:将纯长上下文作为视频评估中的一个正交维度进行隔离
Abstract
As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.
Chinese Translation
随着视频生成模型达到前所未有的能力,对稳健视频评估指标的需求变得愈发重要。传统指标本质上是为短视频评估量身定制的,主要评估帧级视觉质量和局部时间平滑性。然而,随着最先进的视频生成模型扩展到生成更长的视频,这些指标未能捕捉到重要的长距离特征,如叙事丰富性和全局因果一致性。我们认识到短期视觉感知和长上下文属性在本质上是正交维度,因此我们认为长视频指标应与短视频评估分开。在本文中,我们专注于长视频评估专用框架的严格论证和设计。我们首先引入一系列长视频属性损坏测试,揭示现有短视频指标在对结构不一致性(如镜头级扰动和叙事洗牌)敏感性不足方面的关键局限性。为了填补这一空白,我们设计了一种基于镜头动态的新型长视频指标,该指标对长距离测试框架具有高度敏感性。此外,我们引入了Long-CODE(长上下文作为视频评估的正交维度),这是一个专门设计的数据集,用于基准测试长视频评估,且人类注释专门针对真实的长距离特征。大量实验表明,我们提出的指标与人类判断的相关性达到了最先进水平。最终,我们的指标和基准与现有短视频标准无缝互补,为视频生成模型建立了一个整体且无偏的评估范式。
cs.CV / 154 / 2604.17436
DEM Refinement and Validation on the Lunar Surface Using Shape-from-Shading with Chandrayaan-2 OHRC Imagery
基于形状阴影法的月球表面数字高程模型(DEM)精细化与验证:以昌德拉扬-2号高分辨率相机影像为例
Abstract
This study presents a Shape from Shading (SfS) framework to enhance sub-metre resolution lunar digital elevation models (DEMs) using imagery from the Orbiter High Resolution Camera (OHRC) aboard Chandrayaan-2. The framework applies SfS to an independent OHRC image of the same region, enabling SfS not just as a refinement tool, but as a source of new topographic data, unconstrained by stereo baseline limitations. The method is applied across three lunar sites, including the Cyrillus crater, the Vikram landing region, and the lunar south pole (Mons Mouton), with a systematic three-stage parameter sweep on the SfS smoothness weight. Results show measurable topographic enhancement, particularly in surface slope statistics, revealing fine-scale crater morphology previously unresolved. A limiting case is also characterized, where large pitch angle separation between the shading image and stereo pair reduces SfS sensitivity, and partial footprint coverage of the shading image is identified as a factor influencing spatially variable enhancement quality.
Chinese Translation
本研究提出了一种基于形状阴影法(Shape from Shading, SfS)的框架,利用昌德拉扬-2号探测器上的轨道器高分辨率相机(Orbiter High Resolution Camera, OHRC)影像来增强亚米级分辨率的月球数字高程模型(DEM)。该框架将SfS应用于同一区域的独立OHRC影像,使得SfS不仅作为一种精细化工具,还作为一种新的地形数据来源,不受立体基线限制。该方法应用于三个月球地点,包括基里勒斯陨石坑、维克拉姆着陆区和月球南极(蒙斯·穆通),并对SfS平滑权重进行了系统的三阶段参数扫描。结果显示出可测量的地形增强,特别是在表面坡度统计方面,揭示了之前未能分辨的细尺度陨石坑形态。此外,还描述了一个限制案例,其中阴影影像与立体对之间的大俯仰角分离降低了SfS的敏感性,阴影影像的部分覆盖范围被确定为影响空间可变增强质量的因素。
cs.CV / 155 / 2604.17439
Attention Is not Everything: Efficient Alternatives for Vision
注意力并非一切:视觉的高效替代方案
Abstract
Recently computer vision has seen advancements mainly thanks to Transformer-based models. However many non-Transformer methods are still doing well being a direct competition of Transformer-based models. This review tries to present a comprehensive taxonomy of such methods and organize these methods into categories like convolution-based models, MLP-based models, state-space-based and more. These methods are looked at in terms of how efficient they are, how well they scale, how easy they are to understand and how robust they are. A total of 40 papers were chosen for this study. The goal is to give a view of non-Transformer methods and find out what challenges and opportunities exist for future computer vision research.
Chinese Translation
近年来,计算机视觉领域的进展主要得益于基于Transformer的模型。然而,许多非Transformer方法仍然表现良好,成为基于Transformer模型的直接竞争者。本综述试图呈现这些方法的全面分类法,并将其组织为卷积模型、MLP模型、状态空间模型等类别。这些方法从效率、可扩展性、易理解性和鲁棒性等方面进行分析。本研究选择了40篇论文,旨在展示非Transformer方法,并探讨未来计算机视觉研究中存在的挑战和机遇。
cs.CV / 156 / 2604.17446
HyKey: Hyperspectral Keypoint Detection and Matching in Minimally Invasive Surgery
HyKey:在微创手术中的高光谱关键点检测与匹配
Abstract
Purpose: 3D reconstruction in minimally invasive surgery (MIS) enables enhanced surgical guidance through improved visualisation, tool tracking, and augmented reality. However, traditional RGB-based keypoint detection and matching pipelines struggle with surgical challenges, such as poor texture and complex illumination. We investigate whether using snapshot hyperspectral imaging (HSI) can provide improved results on keypoint detection and matching surgical scenes. Methods: We developed HyKey, a HYperspectral KEYpoint detection and description model made up of a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from HSI. The model was trained using synthetic homographic augmentation and epipolar geometry constraints on a robotically-acquired dual-camera RGB-HSI laparoscopic dataset of ex-vivo organs with calibrated camera poses. We benchmarked performance against established RGB-based methods, including SuperPoint and ALIKE. Results: Our HSI-based model outperformed RGB baselines on registered RGB frames, achieving 96.62% mean matching accuracy and 67.18% mean average accuracy at 10 degree on pose estimation, demonstrating consistent improvements across multiple evaluation metrics. Conclusion: Integrating spectral information from an HSI cube offers a promising approach for robust monocular 3D reconstruction in MIS, addressing limitations of texture-poor surgical environments through enhanced spectral-spatial feature discrimination. Our model and dataset are available at https://github.com/alexsaikia/HyKey-Hyperspectral-Keypoint-Detection
Chinese Translation
目的:微创手术(MIS)中的三维重建通过改善可视化、工具跟踪和增强现实,增强了手术指导。然而,传统的基于RGB的关键点检测和匹配流程在手术挑战中表现不佳,例如纹理差和复杂的光照条件。我们研究了使用快照高光谱成像(HSI)是否能在关键点检测和匹配手术场景中提供更好的结果。方法:我们开发了HyKey,一个高光谱关键点检测与描述模型,由一个混合的3D-2D卷积神经网络组成,能够共同提取HSI中的空间-光谱特征。该模型使用合成单应性增强和极几何约束在机器人获取的双摄像头RGB-HSI腹腔镜数据集上进行训练,该数据集包含经过标定的相机姿态的离体器官。我们将性能与已建立的基于RGB的方法进行基准测试,包括SuperPoint和ALIKE。结果:我们的基于HSI的模型在注册的RGB帧上超越了RGB基线,在姿态估计中达到了96.62%的平均匹配准确率和67.18%的平均准确率(在10度时),在多个评估指标上表现出一致的改进。结论:从HSI立方体中整合光谱信息为微创手术中的稳健单目三维重建提供了一种有前景的方法,通过增强的光谱-空间特征区分来解决纹理贫乏的手术环境的局限性。我们的模型和数据集可在https://github.com/alexsaikia/HyKey-Hyperspectral-Keypoint-Detection获取。
cs.CV / 157 / 2604.17451
SegTTA: Training-Free Test-Time Augmentation for Zero-Shot Medical Imaging Segmentation
SegTTA:无训练的测试时增强用于零样本医学影像分割
Abstract
Increasingly advanced data augmentation techniques have greatly aided clinical medical research, increasing data diversity and improving model generalization capabilities. Although most current basic models exhibit strong generalization abilities, image quality varies due to differences in equipment and operators. To address these challenges, we present SegTTA, a framework that improves medical image segmentation without model retraining by combining four augmentations (Gamma correction, Contrast enhancement, Gaussian blur, Gaussian noise) with weighted voting across multiple MedSAM2 checkpoints. Experiments demonstrate consistent improvements across three diverse datasets: healthy uterus segmentation, uterine myoma detection, and multi class hepatic structure segmentation. Ablation studies reveal that large organs benefit from intensity augmentations while small lesions require noise augmentations. The voting threshold controls the coverage precision trade off, enabling task specific optimization for different clinical requirements. Ultimately, on a multiclass hepatic vessel dataset, compared to MedSAM2 baselines, our method achieves an increase of 1.6 in mIoU and 1.9 in aIoU, along with a reduction of approximately 2.0 in HD95. Code will be available at https://github.com/AIGeeksGroup/SegTTA.
Chinese Translation
日益先进的数据增强技术极大地促进了临床医学研究,增加了数据多样性并提高了模型的泛化能力。尽管目前大多数基础模型表现出强大的泛化能力,但由于设备和操作人员的差异,图像质量存在差异。为了解决这些挑战,我们提出了SegTTA,一个通过结合四种增强方法(伽马校正、对比度增强、高斯模糊、高斯噪声)与多个MedSAM2检查点的加权投票,来改善医学图像分割而无需模型重新训练的框架。实验表明,在三个不同的数据集上均取得了一致的改善:健康子宫分割、子宫肌瘤检测和多类肝脏结构分割。消融研究表明,大型器官受益于强度增强,而小病变则需要噪声增强。投票阈值控制了覆盖精度的权衡,从而实现了针对不同临床需求的任务特定优化。最终,在一个多类肝脏血管数据集上,与MedSAM2基线相比,我们的方法在mIoU上提高了1.6,在aIoU上提高了1.9,并且HD95约减少了2.0。代码将发布在 https://github.com/AIGeeksGroup/SegTTA。
cs.CV / 158 / 2604.17454
HSG: Hyperbolic Scene Graph
HSG:双曲场景图
Abstract
Scene graph representations enable structured visual understanding by modeling objects and their relationships, and have been widely used for multiview and 3D scene reasoning. Existing methods such as MSG learn scene graph embeddings in Euclidean space using contrastive learning and attention based association. However, Euclidean geometry does not explicitly capture hierarchical entailment relationships between places and objects, limiting the structural consistency of learned representations. To address this, we propose Hyperbolic Scene Graph (HSG), which learns scene graph embeddings in hyperbolic space where hierarchical relationships are naturally encoded through geometric distance. Our results show that HSG improves hierarchical structure quality while maintaining strong retrieval performance. The largest gains are observed in graph level metrics: HSG achieves a PP IoU of 33.17 and the highest Graph IoU of 33.51, outperforming the best AoMSG variant (25.37) by 8.14, highlighting the effectiveness of hyperbolic representation learning for scene graph modeling. Code: https://github.com/AIGeeksGroup/HSG.
Chinese Translation
场景图表示通过建模对象及其关系,实现结构化的视觉理解,并已广泛应用于多视角和三维场景推理。现有方法如MSG使用对比学习和基于注意力的关联在欧几里得空间中学习场景图嵌入。然而,欧几里得几何并未明确捕捉地点和对象之间的层次蕴含关系,限制了学习表示的结构一致性。为此,我们提出了双曲场景图(Hyperbolic Scene Graph, HSG),该方法在双曲空间中学习场景图嵌入,其中层次关系通过几何距离自然编码。我们的结果表明,HSG在保持强大检索性能的同时改善了层次结构质量。在图级指标中观察到最大的提升:HSG实现了33.17的PP IoU和33.51的最高Graph IoU,超越最佳的AoMSG变体(25.37)8.14,突显了双曲表示学习在场景图建模中的有效性。代码:https://github.com/AIGeeksGroup/HSG。
cs.CV / 159 / 2604.17455
From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation
从适应到泛化:用于医学图像分割的自适应视觉提示
Abstract
Visual prompting has emerged as a powerful method for adapting pre-trained models to new domains without updating model parameters. However, existing prompting methods typically optimize a single prompt per domain and apply it uniformly to all inputs, limiting their ability to generalize under intra and inter-domain variability, which is especially critical in the medical field. To address this, we propose APEX, an Adaptive Prompt EXtraction framework that retrieves input-specific prompts from a learnable prompt memory. The memory stores diverse, domain-discriminative prompt representations and is queried via domain features extracted from the Fourier spectrum. To learn robust and discriminative domain features, we introduce a novel Low-Frequency Feature Contrastive (LFC) learning framework that clusters representations from the same domain while separating those from different domains. Extensive experiments on two medical segmentation tasks demonstrate that APEX significantly improves generalization across both seen and unseen domains. Furthermore, it complements any existing backbones and consistently enhances performance, confirming its effectiveness as a plug-and-play prompting solution in medical fields. The code is available at https://github.com/cetinkayaevren/apex/
Chinese Translation
视觉提示已成为一种强大的方法,可以在不更新模型参数的情况下将预训练模型适应于新领域。然而,现有的提示方法通常针对每个领域优化单一提示,并将其统一应用于所有输入,这限制了它们在领域内和领域间变异下的泛化能力,这在医学领域尤为关键。为了解决这个问题,我们提出了APEX(自适应提示提取)框架,该框架从可学习的提示记忆中检索特定于输入的提示。该记忆存储多样化的、领域区分的提示表示,并通过从傅里叶谱中提取的领域特征进行查询。为了学习鲁棒且具有区分性的领域特征,我们引入了一种新颖的低频特征对比(Low-Frequency Feature Contrastive,LFC)学习框架,该框架将来自同一领域的表示聚类,同时将来自不同领域的表示分离。对两个医学分割任务的广泛实验表明,APEX显著提高了在已见和未见领域中的泛化能力。此外,它可以与任何现有的主干网络互补,并始终提升性能,确认其作为医学领域即插即用的提示解决方案的有效性。代码可在 https://github.com/cetinkayaevren/apex/ 获取。
cs.CV / 160 / 2604.17472
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh:统一的3D网格理解与生成
Abstract
Recent advances in 3D vision have led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D generation (e.g., synthesis, completion, and editing). However, these tasks are often tackled in isolation, resulting in fragmented architectures and representations that hinder knowledge transfer and holistic scene modeling. To address these challenges, we propose UniMesh, a unified framework that jointly learns 3D generation and understanding within a single architecture. First, we introduce a novel Mesh Head that acts as a cross model interface, bridging diffusion based image generation with implicit shape decoders. Second, we develop Chain of Mesh (CoM), a geometric instantiation of iterative reasoning that enables user driven semantic mesh editing through a closed loop latent, prompting, and re generation cycle. Third, we incorporate a self reflection mechanism based on an Actor Evaluator Self reflection triad to diagnose and correct failures in high level tasks like 3D captioning. Experimental results demonstrate that UniMesh not only achieves competitive performance on standard benchmarks but also unlocks novel capabilities in iterative editing and mutual enhancement between generation and understanding. Code: https://github.com/AIGeeksGroup/UniMesh. Website: https://aigeeksgroup.github.io/UniMesh.
Chinese Translation
最近在3D视觉领域的进展催生了专门针对3D理解(如形状分类、分割、重建)或3D生成(如合成、补全和编辑)的模型。然而,这些任务往往是孤立处理的,导致架构和表征的碎片化,妨碍了知识转移和整体场景建模。为了解决这些挑战,我们提出了UniMesh,一个在单一架构中共同学习3D生成和理解的统一框架。首先,我们引入了一种新颖的Mesh Head,作为跨模型接口,连接基于扩散的图像生成与隐式形状解码器。其次,我们开发了Mesh链(Chain of Mesh, CoM),这是一种几何实例化的迭代推理,能够通过闭环潜在、提示和再生成周期实现用户驱动的语义网格编辑。第三,我们结合了一种基于演员-评估者自我反思三元组的自我反思机制,以诊断和纠正在3D字幕等高级任务中的失败。实验结果表明,UniMesh不仅在标准基准上实现了竞争力的性能,还解锁了在迭代编辑和生成与理解之间的相互增强方面的新能力。代码:https://github.com/AIGeeksGroup/UniMesh。网站:https://aigeeksgroup.github.io/UniMesh。
cs.CV / 161 / 2604.17473
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
双锚定:解决视觉-语言导航中的状态漂移
Abstract
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
Chinese Translation
视觉-语言导航(Vision-Language Navigation, VLN)要求智能体通过遵循自然语言指令在三维环境中导航。尽管最近的视频大型语言模型(Video Large Language Models, Video-LLMs)在VLN方面取得了显著进展,但在长时间场景中,它们仍然高度易受状态漂移的影响。在这些情况下,智能体的内部状态与真实的任务执行状态偏离,导致无目的游荡并未能执行指令中的重要操作。我们将这种失败归因于两种不同的认知缺陷:进度漂移(Progress Drift),即智能体未能区分已完成的子目标与剩余的子目标;以及记忆漂移(Memory Drift),即智能体的历史表示退化,使其无法追踪已访问的地标。本文提出了一种双锚定框架(Dual-Anchoring Framework),明确锚定指令进度和历史表示。首先,为了解决进度漂移,我们引入了指令进度锚定(Instruction Progress Anchoring),该方法监督智能体生成结构化文本标记,以区分已完成的子目标与剩余的子目标。其次,为了减轻记忆漂移,我们提出了记忆地标锚定(Memory Landmark Anchoring),该方法利用以地标为中心的世界模型(Landmark-Centric World Model)回顾性地预测由Segment Anything Model提取的以对象为中心的嵌入,迫使智能体明确验证过去的观察并保持已访问地标的独特表示。为了支持该框架,我们整理了两个大型数据集:包含360万条带有明确进度描述的样本,以及93.7万条用于回顾验证的地标数据。在模拟和现实环境中的大量实验表明我们的方法优越性,成功率提高了15.2%,在长时间轨迹上取得了显著的24.7%的提升。为了促进进一步的研究,我们将发布我们的代码、数据生成管道以及收集的数据集。
cs.CV / 162 / 2604.17477
Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection
揭示深伪技术:一种频率感知的三分支网络用于深伪检测
Abstract
Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.
Chinese Translation
先进的深伪技术正在模糊真实与虚假的界限,带来了革命性的机遇和令人担忧的威胁。尽管它在娱乐和教育等领域开启了新的应用,但其恶意使用引发了从身份盗窃到虚假信息传播等紧迫的伦理和社会问题。为应对这些挑战,基于频率特征的特征分析已成为深伪检测的一个有前景的方向。然而,迄今为止一个被忽视的方面是,现有方法往往集中于一个或几个特定的频率域,这使得它们容易对特定伪造特征过拟合,并在面对多样化的伪造模式时显著削弱其鲁棒性。我们观察到的另一个未被充分探索的方面是,不同特征往往关注于同一伪造区域,导致冗余的特征表示,并限制了提取线索的多样性。这可能削弱模型捕捉不同方面互补信息的能力,从而影响其对多样化操控的泛化能力。在本文中,我们试图从两个方面解决这些挑战:(1) 我们提出了一种三分支网络,通过学习原始图像和由不同频率通道重建的图像,联合捕捉空间和频率特征;(2) 我们基于互信息理论数学推导了特征解耦和融合损失,增强模型关注原始图像和由不同频率通道重建的图像中的任务相关特征。在六个大规模基准数据集上的大量实验表明,我们的方法始终实现了最先进的性能。我们的代码已发布在 https://github.com/injooker/Unveiling Deepfake。
cs.CV / 163 / 2604.17488
AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation
AutoVQA-G:用于自动化视觉问答和基础注释的自我提升代理框架
Abstract
Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G
Chinese Translation
高质量视觉问答与基础(VQA-G)数据集的手动注释至关重要,它将视觉问题与证据基础相结合,推动了视觉语言模型(VLMs)的发展,但仍然难以扩展。现有的自动化方法常常受到两个关键问题的制约:(1)由于模型幻觉导致的数据一致性不足;(2)基于简单启发式的脆弱验证机制。为了解决这些局限性,我们提出了AutoVQA-G,一个用于自动化VQA-G注释的自我提升代理框架。AutoVQA-G采用迭代精炼循环,其中一致性评估模块利用思维链(Chain-of-Thought, CoT)推理进行细粒度视觉验证。基于这些反馈,增强记忆的提示优化代理分析来自失败样本的批评意见,以逐步优化生成提示。我们的实验表明,与领先的多模态大型语言模型(LLMs)相比,AutoVQA-G生成的VQA-G数据集在视觉基础准确性方面具有更优表现,为创建高保真数据以促进更稳健的VLM训练和评估提供了有前景的方法。代码链接: https://github.com/rohnson1999/AutoVQA-G
cs.CV / 164 / 2604.17492
Coevolving Representations in Joint Image-Feature Diffusion
联合图像-特征扩散中的共演化表示
Abstract
Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents with high-level semantic features extracted from pre-trained visual encoders. However, existing approaches rely on a fixed representation space, constructed independently of the generative objective and kept unchanged during training. We argue that the representation space guiding diffusion should itself adapt to the generative task. To this end, we propose Coevolving Representation Diffusion (CoReDi), a framework in which the semantic representation space evolves during training by learning a lightweight linear projection jointly with the diffusion model. While naively optimizing this projection leads to degenerate solutions, we show that stable coevolution can be achieved through a combination of stop-gradient targets, normalization, and targeted regularization that prevents feature collapse. This formulation enables the semantic space to progressively specialize to the needs of image synthesis, improving its complementarity with image latents. We apply CoReDi to both VAE latent diffusion and pixel-space diffusion, demonstrating that adaptive semantic representations improve generative modeling across both settings. Experiments show that CoReDi achieves faster convergence and higher sample quality compared to joint diffusion models operating in fixed representation spaces.
Chinese Translation
联合图像-特征生成建模最近被提出作为一种有效的策略,通过将低层次的变分自编码器(VAE)潜变量与从预训练视觉编码器提取的高层次语义特征结合,从而改善扩散训练。然而,现有的方法依赖于一个固定的表示空间,该空间是独立于生成目标构建的,并在训练过程中保持不变。我们认为,指导扩散的表示空间应该根据生成任务进行自我适应。为此,我们提出了共演化表示扩散(Coevolving Representation Diffusion, CoReDi),这是一个框架,其中语义表示空间在训练过程中通过与扩散模型共同学习一个轻量级线性投影而演变。虽然天真地优化这个投影会导致退化解,但我们展示了通过组合停止梯度目标、归一化和针对性的正则化来防止特征崩溃,可以实现稳定的共演化。这种表述使得语义空间能够逐步专门化以满足图像合成的需求,从而提高其与图像潜变量的互补性。我们将CoReDi应用于VAE潜变量扩散和像素空间扩散,证明自适应语义表示在这两种设置中都改善了生成建模。实验表明,与在固定表示空间中操作的联合扩散模型相比,CoReDi实现了更快的收敛和更高的样本质量。
cs.CV / 165 / 2604.17500
Edit Fidelity Field: Semantics-Aware Region Isolation for Training-Free Scene Text Editing
编辑保真度场:语义感知区域隔离的无训练场景文本编辑
Abstract
Scene text editing (STE) has achieved remarkable progress in accurately rendering target text through diffusion-based methods. However, we identify a critical yet overlooked problem: edit spillover -- when editing a target text region, existing methods inadvertently modify non-target regions, particularly neighboring text. Through systematic evaluation on 50 real-world scenes across four categories, we reveal that state-of-the-art diffusion editing models exhibit a spillover rate of 94%, meaning nearly all non-target text regions are altered during editing. To address this, we propose the Edit Fidelity Field (EFF), a semantics-aware continuous field that controls per-pixel editing fidelity. Unlike binary masks, EFF leverages OCR-detected text regions to construct a four-zone field: Edit Core (fully editable), Transition Zone (smooth decay), Protected Zone (non-target text, explicitly locked), and Background (strictly preserved). EFF operates as a training-free, model-agnostic post-processing module applicable to any diffusion-based STE method. We further propose per-region spillover quantification, a novel evaluation protocol that measures edit leakage at each non-target text region individually. Experiments demonstrate that EFF reduces spillover rate from 94% to 25% while improving non-target region preservation by +91.4 dB PSNR.
Chinese Translation
场景文本编辑(STE)在通过基于扩散的方法准确渲染目标文本方面取得了显著进展。然而,我们发现一个关键但被忽视的问题:编辑溢出——在编辑目标文本区域时,现有方法不经意地修改了非目标区域,特别是相邻文本。通过对四个类别的50个真实场景进行系统评估,我们揭示了最先进的扩散编辑模型的溢出率高达94%,这意味着几乎所有非目标文本区域在编辑过程中都被改变。为了解决这个问题,我们提出了编辑保真度场(Edit Fidelity Field, EFF),这是一种语义感知的连续场,用于控制每个像素的编辑保真度。与二进制掩码不同,EFF利用OCR检测到的文本区域构建了一个四区场:编辑核心(完全可编辑)、过渡区(平滑衰减)、保护区(非目标文本,明确锁定)和背景(严格保留)。EFF作为一种无训练、模型无关的后处理模块,适用于任何基于扩散的STE方法。我们进一步提出了每区域溢出量化,这是一种新颖的评估协议,单独测量每个非目标文本区域的编辑泄漏。实验表明,EFF将溢出率从94%降低到25%,同时非目标区域的保留率提高了+91.4 dB PSNR。
cs.CV / 166 / 2604.17504
RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding
RS-HyRe-R1:一种混合奖励机制以克服遥感图像理解中的感知惯性
Abstract
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
Chinese Translation
强化学习(RL)后训练显著提升了遥感视觉语言模型(RS-VLMs)的性能。然而,在处理需要全面视觉扫描的复杂遥感图像(RSI)时,模型往往依赖于局部显著线索进行快速推理。我们将这种由RL引起的偏差称为“感知惯性”。受奖励最大化的驱动,模型倾向于快速结果拟合,这导致了两个局限性:在认知上,过度依赖特定特征妨碍了完整证据的构建;在操作上,模型在任务间灵活调整视觉焦点时面临困难。为了解决这一偏差并鼓励全面的视觉证据挖掘,我们提出了RS-HyRe-R1,一种用于RSI理解的混合奖励框架。该框架引入了:(1)一种空间推理激活奖励,强制执行结构化的视觉推理;(2)一种感知正确性奖励,为RS任务提供自适应质量锚点,确保几何和语义的准确对齐;(3)一种视觉-语义路径演化奖励,惩罚重复推理并促进对互补线索的探索,以构建更丰富的证据链。实验表明,RS-HyRe-R1有效缓解了“感知惯性”,鼓励更深入和多样的推理。仅用3B参数,它在REC、OVD和VQA任务上达到了最先进的性能,超越了高达7B参数的模型。它还展示了强大的零样本泛化能力,在VQA、OVD和REC任务上分别超越第二名模型3.16%、3.97%和2.72%。代码和数据集可在https://github.com/geox-lab/RS-HyRe-R1获取。
cs.CV / 167 / 2604.17542
Dual Strategies for Test-Time Adaptation
测试时适应的双重策略
Abstract
Conventional test-time adaptation (TTA) approaches typically adapt the model using only a small fraction of test samples, often those with low-entropy predictions, thereby failing to fully leverage the available information in the test distribution. This paper introduces DualTTA, a novel framework that improves performance under distribution shifts by utilizing a larger and more diverse set of test samples. DualTTA identifies two distinct groups: one where the model's predictions are likely consistent with the underlying semantics, and another where predictions are likely incorrect. For the first group, it minimizes prediction entropy to reinforce reliable decisions; for the second, it maximizes entropy to suppress overconfident errors and unlearn spurious behavior. These groups are adaptively selected using a new reliability criterion that measures prediction stability under both semantic-preserving and semantic-altering transformations, addressing the limitations of purely entropy-based selection. We further provide theoretical analysis and empirical justification showing that our approach enables a tighter separation between reliable and unreliable samples, in the context of their suitability for adaptation, leading to provably more effective model updates.
Chinese Translation
传统的测试时适应(TTA)方法通常仅使用少量测试样本进行模型适应,通常是那些具有低熵预测的样本,因此未能充分利用测试分布中的可用信息。本文提出了DualTTA,一个新颖的框架,通过利用更大且更具多样性的测试样本集,在分布变化下提高性能。DualTTA识别出两个不同的组:一个是模型的预测可能与潜在语义一致的组,另一个是预测可能不正确的组。对于第一个组,它最小化预测熵以强化可靠的决策;对于第二个组,它最大化熵以抑制过于自信的错误并消除虚假的行为。这些组通过一种新的可靠性标准自适应选择,该标准在语义保持和语义改变的变换下测量预测的稳定性,从而解决了纯熵选择的局限性。我们进一步提供理论分析和实证证明,表明我们的方法能够在适应性适用性方面更有效地将可靠样本与不可靠样本区分开,从而导致可证明的更有效的模型更新。
cs.CV / 168 / 2604.17565
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo:通过视频模型统一几何引导以实现相机可控的图像编辑
Abstract
Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.
Chinese Translation
相机可控的图像编辑旨在在不同的相机姿态下合成给定场景的新视图,同时严格保持跨视图的几何一致性。然而,现有的方法通常依赖于碎片化的几何引导,例如仅在表示层面注入点云,尽管模型包含多个层次,并且主要基于在离散视图映射上操作的图像扩散模型。这两种限制共同导致在连续相机运动下的几何漂移和结构退化。我们观察到,虽然利用视频模型为相机可控的图像编辑提供了连续的视点先验,但如果几何引导仍然是碎片化的,它们仍然难以形成稳定的几何理解。为系统性地解决这一问题,我们在三个层次上注入统一的几何引导,这些层次共同决定生成输出:表示层、架构层和损失函数层。为此,我们提出了UniGeo,一个新颖的相机可控编辑框架。具体而言,在表示层面,UniGeo结合了帧解耦的几何参考注入机制,以提供稳健的跨视图几何上下文。在架构层面,它引入了几何锚点注意力,以对齐多视图特征。在损失函数层面,它提出了一种轨迹端点几何监督策略,以明确增强目标视图的结构保真度。在多个公共基准上的全面实验,涵盖了广泛和有限的相机运动设置,证明UniGeo在视觉质量和几何一致性方面显著优于现有方法。
cs.CV / 169 / 2604.17567
Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses
运动捕捉中的多摄像头自标定:利用人体和杆状物体姿态
Abstract
Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g., golf clubs, bats, hockey sticks). Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools. To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors. Our project page: https://fandulu.github.io/sport_stick_multi_cam_calib/.
Chinese Translation
多摄像头系统在体育运动中被广泛应用于捕捉运动员和器材的三维运动,但其外部参数的标定仍然成本高昂且劳动密集。我们提出了一种高效的、无工具的方法,专门针对涉及杆状工具(如高尔夫球杆、棒球棒、冰球杆)的运动进行多摄像头外部标定。我们的方法共同利用来自同步多摄像头视频的两个互补线索:(i)具有未知度量尺度的人体关键点和(ii)已知长度的刚性杆状工具。我们构建了一个三阶段的优化流程,精炼摄像头的外部参数,重建人体和杆的轨迹,并通过杆长约束解决全局尺度。我们的方法在没有专用标定工具的情况下实现了准确的外部标定。为了基准测试这一任务,我们首次提出了一个用于杆状运动的多摄像头自标定数据集,包含四个运动类别的合成序列,摄像头数量从3到10个。全面的实验表明,我们的方法在旋转和位移误差方面达到了最先进的性能。我们的项目页面:https://fandulu.github.io/sport_stick_multi_cam_calib/
cs.CV / 170 / 2604.17570
PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation
PBSBench:一种多层次视觉-语言框架及其在血液病理全切片图像解读中的基准
Abstract
Peripheral Blood Smear (PBS) is a critical microscopic examination in hematopathology that yields whole-slide imaging (WSI). Unlike solid tissue pathology, PBS interpretation focuses on individual cell morphologies rather than tissue architecture, making it distinct in both visual characteristics and diagnostic reasoning. However, current multimodal large language models (MLLMs) for pathology are primarily developed on solid-tissue WSIs and struggle to generalize to PBS. To bridge this gap, we construct PBSInstr, the first vision-language dataset for PBS interpretation, comprising 353 PBS WSIs paired with microscopic impression paragraphs and 29k cell-level image crops annotated with cell type labels and morphological descriptions. To facilitate instruction tuning, PBSInstr further includes 27k question-answer (QA) pairs for cell crops and 1,286 QA pairs for PBS slides. Building upon PBSInstr, we develop PBS-VL, a hematopathology-tailored vision-language model for multi-level PBS interpretation at both cell and slide levels. To comprehensively evaluate PBS understanding, we construct PBSBench, a visual question answering (VQA) benchmark featuring four question categories and six PBS interpretation tasks. Experiments show that PBS-VL outperforms existing general-purpose and pathology MLLMs, underscoring the value of PBS-specific data. We release our code, datasets, and model weights to facilitate future research. Our proposed framework lays the foundation for developing practical AI assistants supporting decision-making in hematopathology.
Chinese Translation
外周血涂片(PBS)是血液病理学中一项重要的显微镜检查,产生全切片图像(WSI)。与固体组织病理学不同,PBS 解读侧重于个体细胞形态而非组织结构,使其在视觉特征和诊断推理上具有独特性。然而,目前针对病理学的多模态大型语言模型(MLLMs)主要是在固体组织 WSI 上开发,难以推广到 PBS。为了解决这一问题,我们构建了 PBSInstr,这是首个用于 PBS 解读的视觉-语言数据集,包含 353 个 PBS WSI,配有显微镜印象段落,以及 29,000 个带有细胞类型标签和形态描述的细胞级图像裁剪。为了促进指令调优,PBSInstr 还包括 27,000 个针对细胞裁剪的问题-答案(QA)对和 1,286 个针对 PBS 切片的 QA 对。在 PBSInstr 的基础上,我们开发了 PBS-VL,这是一个针对血液病理学的视觉-语言模型,旨在实现细胞和切片层面的多层次 PBS 解读。为了全面评估 PBS 理解能力,我们构建了 PBSBench,这是一个视觉问答(VQA)基准,包含四个问题类别和六个 PBS 解读任务。实验表明,PBS-VL 在现有通用和病理 MLLMs 中表现优越,突显了 PBS 特定数据的价值。我们发布了代码、数据集和模型权重,以促进未来的研究。我们提出的框架为开发支持血液病理学决策的实用 AI 助手奠定了基础。
cs.CV / 171 / 2604.17585
DGSSM: Diffusion guided state-space models for multimodal salient object detection
DGSSM:用于多模态显著目标检测的扩散引导状态空间模型
Abstract
Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.
Chinese Translation
显著目标检测(SOD)需要建模长距离上下文依赖关系和细粒度结构细节,这对基于卷积、变换器和Mamba的状态空间模型仍然具有挑战性。尽管最近的基于Mamba的状态空间方法能够实现高效的全局推理,但它们在恢复精确的目标边界方面往往存在困难。相比之下,扩散模型通过迭代去噪捕捉强大的结构先验,但由于计算成本和集成挑战,其在区分性密集预测中的应用仍然有限。在本研究中,我们提出了DGSSM,一个扩散引导的状态空间(Mamba)框架,将多模态显著目标检测表述为一个渐进去噪过程。该框架将扩散结构先验与多尺度状态空间编码、自适应显著性提示和迭代Mamba扩散精炼机制相结合,以提高边界准确性。边界感知精炼头和自蒸馏策略进一步增强了空间一致性和特征一致性。在RGB、RGB-D和RGB-T设置下的13个公共基准上进行的广泛实验表明,DGSSM在多个评估指标上始终优于最先进的方法,同时保持紧凑的模型大小。这些结果表明,扩散引导的状态空间建模是多模态密集预测任务的有效且可推广的范式。
cs.CV / 172 / 2604.17623
ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
ViPS:基于视频的信息化姿态空间用于自动绑定网格
Abstract
Kinematic rigs provide a structured interface for articulating 3D meshes, but they lack an inherent representation of the plausible manifold of joint configurations for a given asset. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters often leads to semantic or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feed-forward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce artist-authored 4D datasets, ViPS transfers generative video priors into a universal distribution over a given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce asset-specific validity without requiring manual regularizers. Our model learns a smooth, compact, and controllable pose space that supports diverse sampling, manifold projection for inverse kinematics, and temporally coherent trajectories for keyframing. Furthermore, the distilled 3D pose samples serve as precise semantic proxies for guiding video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely on video priors, matches the performance of state-of-the-art methods trained on synthetic artist-created 4D data in both plausibility and diversity. Most importantly, as a universal model, ViPS demonstrates robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.
Chinese Translation
运动学绑定为三维网格提供了一个结构化的接口,但它缺乏对给定资产的关节配置的合理流形的内在表示。没有这样的姿态空间,随机采样或手动操作原始绑定参数往往会导致语义或几何上的违规,例如解剖学上的过度伸展和非物理的自交。我们提出了视频信息化姿态空间(ViPS),这是一个前馈框架,通过从预训练的视频扩散模型中提取运动先验,发现自动绑定网格的有效关节的潜在分布。与依赖稀缺的艺术家创作的四维数据集的现有方法不同,ViPS将生成性视频先验转化为给定绑定参数化上的通用分布。应用于蒙皮网格的可微几何验证器在不需要手动正则化的情况下强制资产特定的有效性。我们的模型学习了一个平滑、紧凑且可控的姿态空间,支持多样的采样、用于逆运动学的流形投影,以及用于关键帧的时间一致轨迹。此外,提取的三维姿态样本作为引导视频扩散的精确语义代理,有效地闭合了生成性二维先验与结构化三维运动学控制之间的循环。我们的评估表明,ViPS仅基于视频先验训练,其性能在合理性和多样性方面与基于合成艺术家创建的四维数据训练的最先进方法相匹配。最重要的是,作为一个通用模型,ViPS在分布外物种和未见骨骼拓扑上表现出强大的零样本泛化能力。
cs.CV / 173 / 2604.17625
FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation
FlowC2S:从当前帧流向后续帧的快速且节省内存的视频延续方法
Abstract
This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.
Chinese Translation
本文介绍了一种生成快速且节省内存的视频延续的新方法。我们的方法称为FlowC2S,微调了一个预训练的文本到视频流模型,以学习当前视频片段与后续视频片段之间的向量场。两个设计选择至关重要。首先,我们引入了固有的最优耦合,在训练过程中利用时间上相邻的视频片段作为真实最优耦合的实际代理,从而实现更直的流动。其次,我们结合了目标反转,将目标片段的反向潜变量注入输入表示中,以增强对应关系并提高视觉保真度。通过直接从当前帧流向后续帧,而不是常见的将当前帧与噪声结合生成视频延续的方法,我们将模型输入的维度减少了一半。所提出的方法经过LTXV和Wan的微调,在FID和FVD的定量评估中超越了当前的最先进水平,所需的神经函数评估次数少至五次。
cs.CV / 174 / 2604.17629
BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
BioVLM:在生物医学视觉语言模型中通过路由提示而非参数实现跨模态泛化
Abstract
Pretrained biomedical vision-language models (VLMs) such as BioMedCLIP perform well on average but often degrade on challenging modalities where inter-class margins are small and acquisition-specific variations are pronounced, especially under few-shot supervision and when modality priors differ from pretraining corpora substantially. We propose BioVLM, a prompt-learning framework that improves cross-domain generalization without extensive backbone fine-tuning. BioVLM learns a diverse prompt bank and introduces dynamic prompt selection: for each input, it selects the most discriminative prompts via a low-entropy criterion on the predictive distribution, effectively coupling sparse few-shot evidence with rich LLM semantic priors. To strengthen this coupling, we distill high-confidence LLM-derived attributes and enforce robust knowledge transfer through strong/weak augmentation consistency. At test time, BioVLM adapts by choosing modality-appropriate prompts, enabling transfer to unseen categories and domains, while keeping training lightweight and inference efficient. On 11 MedMNIST+ 2D datasets, BioVLM achieves new state of the art across three distinct generalization settings. Codes are available at https://github.com/mainaksingha01/BioVLM.
Chinese Translation
预训练的生物医学视觉语言模型(VLMs),如BioMedCLIP,通常在平均水平上表现良好,但在类别间边界较小且获取特定变化显著的挑战性模态下,尤其是在少量样本监督和模态先验与预训练语料显著不同的情况下,性能往往下降。我们提出了BioVLM,一个通过提示学习框架,旨在改善跨领域泛化,而无需进行广泛的主干网络微调。BioVLM学习一个多样化的提示库,并引入动态提示选择:对于每个输入,它通过预测分布上的低熵标准选择最具区分性的提示,有效地将稀疏的少量样本证据与丰富的LLM(大语言模型)语义先验相结合。为了增强这种结合,我们提炼高置信度的LLM派生属性,并通过强/弱增强一致性来强制进行稳健的知识转移。在测试时,BioVLM通过选择适合模态的提示进行适应,从而实现对未见类别和领域的迁移,同时保持训练轻量和推理高效。在11个MedMNIST+ 2D数据集上,BioVLM在三种不同的泛化设置下实现了新的最先进水平。代码可在 https://github.com/mainaksingha01/BioVLM 获取。
cs.CV / 175 / 2604.17651
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
基础设施中心的世界模型:桥接路边感知的时间深度与空间广度
Abstract
World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird's-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun's JEPA, Li Fei-Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.
Chinese Translation
世界模型是模拟环境演变的生成性人工智能系统,正在改变自动驾驶技术,但所有现有方法均采用自我车辆视角,未探索基础设施视角。我们认为,基础设施中心的世界模型提供了一种根本互补的能力:路边系统独特的鸟瞰、多传感器、持久视角。我们论点的核心是时空互补性:固定的路边传感器在时间深度方面表现出色,能够累积长期行为分布,包括稀有的安全关键事件,而车载传感器在空间广度方面表现优异,能够在广泛的道路网络中采样多样场景。本文提出了基础设施中心世界模型(I-WM)的三阶段愿景:(I) 具有质量感知不确定性传播的生成场景理解,(II) 具有多智能体反事实推理的物理信息预测动态,以及(III) 通过潜在空间对齐实现的V2X通信的协作世界模型。我们提出了一种双层架构,注释无感知作为多模态数据引擎,支持端到端的生成世界模型,并采用从LiDAR到4D雷达和信号相位数据再到事件相机的分阶段传感器策略。我们建立了驾驶世界模型范式的分类法,将I-WM与LeCun的JEPA、Li Fei-Fei的空间智能和VLA架构进行比较,并引入基础设施VLA(I-VLA)作为路边感知、语言指令和交通控制行为的新统一。我们的愿景建立在现有的多LiDAR管道之上,并为每个阶段识别开源基础,提供了一条通向理解和预测交通的基础设施的路径。
cs.CV / 176 / 2604.17652
Self-Supervised Super-Resolution for Sentinel-5P Hyperspectral Images
用于哨兵-5P高光谱图像的自监督超分辨率
Abstract
Sentinel-5P (S5P) plays a critical role in atmospheric monitoring; however, its spatial resolution limits fine-scale analysis. Existing super-resolution (SR) approaches rely on supervised learning with synthetic low-resolution (LR) data, since true high-resolution (HR) data do not exist, limiting their applicability to real observations. We propose a self-supervised hyperspectral SR framework for S5P that enables training without HR ground truth. The method combines Stein's Unbiased Risk Estimator (SURE) with an equivariant imaging constraint, incorporating the S5P degradation operator and noise statistics derived from signal-to-noise ratio (SNR) metadata. We also introduce depthwise separable convolution U-Net architectures designed for efficiency and spectral fidelity. The framework is evaluated in two settings: (i) LR-HR, where synthetic LR data are used for direct comparison with supervised learning, and (ii) GT-SHR, where super-resolved images surpass the native spatial resolution without HR reference. Results across multiple bands show that self-supervised models achieve performance comparable to supervised methods while maintaining strong consistency. Qualitative analysis shows improved spatial detail over bicubic interpolation, and validation with EMIT data confirms that reconstructed structures are physically meaningful. Code is available at https://github.com/hyamomar/Sentinel-5P-Super-Resolution/tree/main/self_supervised
Chinese Translation
哨兵-5P (S5P) 在大气监测中发挥着关键作用;然而,其空间分辨率限制了细尺度分析。现有的超分辨率 (SR) 方法依赖于使用合成低分辨率 (LR) 数据的监督学习,因为真实的高分辨率 (HR) 数据并不存在,这限制了它们在实际观测中的适用性。我们提出了一种针对 S5P 的自监督高光谱 SR 框架,该框架能够在没有 HR 真实值的情况下进行训练。该方法结合了 Stein 的无偏风险估计器 (SURE) 和等变成像约束,纳入了 S5P 降解算子和基于信噪比 (SNR) 元数据推导的噪声统计信息。我们还引入了为提高效率和光谱保真度而设计的深度可分离卷积 U-Net 架构。该框架在两种设置下进行评估:(i) LR-HR,其中使用合成 LR 数据与监督学习进行直接比较;(ii) GT-SHR,其中超分辨率图像超越了原生空间分辨率而没有 HR 参考。多个波段的结果表明,自监督模型的性能与监督方法相当,同时保持了较强的一致性。定性分析显示,相较于双三次插值,空间细节得到了改善,并且与 EMIT 数据的验证确认重建结构在物理上是有意义的。代码可在 https://github.com/hyamomar/Sentinel-5P-Super-Resolution/tree/main/self_supervised 获取。
cs.CV / 177 / 2604.17669
Low Light Image Enhancement Challenge at NTIRE 2026
2026年NTIRE低光图像增强挑战
Ciubotariu, George, A, Sharif S M, Rehman, Abdur, Dharejo, Fayaz Ali, Naqvi, Rizwan Ali, Conde, Marcos V., Timofte, Radu, Jin, Zhi, Wu, Hongjun, Zhang, Wenjian, Ye, Chang, Yi, Xunpeng, Yan, Qinglong, Zhang, Yibing, Akalwadi, Nikhil, Pattanshetty, Varda I, Pattanshetty, Varsha I, Desai, Padmashree, Mudenagudi, Uma, Tabib, Ramesh Ashok, Yang, Hao, Zhang, Ruikun, Pan, Liyuan, Kınlı, Furkan, Ryou, Donghun, Ha, Inju, Kang, Junoh, Han, Bohyung, Zhou, Wei, Haitman, Yuval, Lapid, Ariel, Peretz, Reuven, Diamant, Idit, Cao, Leilei, Zhang, Shuo, Hambarde, Praful, Shaily, Prateek, Kumar, Jayant, Sharma, Hardik, Negi, Aashish, Chaudhary, Sachin, Dudhane, Akshay, Shukla, Amit, Wu, MoHao, Wang, Lin, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun, Balmez, Raul, Brateanu, Alexandru, Orhei, Ciprian, Ancuti, Cosmin, Ancuti, Codruta O., Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Qiao, Kaifan, Chen, Bofei, Xu, Jingyi, Zhang, Duo, Deng, Xin, Xu, Mai, Li, Shengxi, Jiang, Lai, A, Harini, N, Ananya, K, Lakshanya, Xu, Ying, Zhu, Xinyi, Shi, Shijun, Zhang, Jiangning, Liu, Yong, Hu, Kai, Xu, Jing, Zeng, Xianfang, Song, Jinao, Tang, Guangsheng, Li, Cheng, Yang, Yuqiang, Wang, Ziyi, Chen, Yan, Bao, Long, Sun, Heng, Kishawy, Mohab, Chen, Jun, Siu, Wan-Chi, Cheng, Yihao, Lee, Hon Man Hammond, Hui, Chun-Chuen
Abstract
This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.
Chinese Translation
本文对2026年NTIRE低光图像增强挑战进行了全面回顾,重点介绍了提出的解决方案和最终结果。该挑战的目标是识别能够在多样且具有挑战性的条件下,通过学习代表性的视觉线索来恢复因低对比度和噪声图像而导致的信息损失,从而生成更清晰和视觉上更具吸引力的图像的有效网络。共有195名参与者注册了比赛的第一轨道,153名注册了第二轨道,最终有22个团队提交了有效的参赛作品。本文全面评估了(联合去噪和)低光图像增强领域的最新进展,展示了该领域的显著进步,同时利用了我们新数据集的样本。
cs.CV / 178 / 2604.17688
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
双流时空GCN-Transformer网络用于3D人体姿态估计
Abstract
3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and fuses various information of human skeletons through two parallel Mixformer Blocks with different modes. Then, it further supplements the fused information through the SE Layer. The Mixformer Block integrates Graph Convolutional Networks (GCN) into the Transformer, enhancing both local and global information utilization. Additionally, we further implement its temporal and spatial forms to extract both spatial and temporal relationships. We extensively evaluated our model on two benchmark datasets (Human3.6M and MPI-INF-3DHP). The experimental results showed that, compared to other methods, our MixTGFormer achieved state-of-the-art results, with P1 errors of 37.6mm and 15.7mm on these datasets, respectively.
Chinese Translation
3D人体姿态估计是计算机视觉领域一个经典且重要的研究方向。近年来,基于Transformer的方法在将2D姿态提升到3D姿态估计方面取得了显著进展。然而,这些方法主要集中于建模全局时间和空间关系,忽视了局部骨骼关系以及不同通道之间的信息交互。因此,我们提出了一种新颖的方法——双流时空GCN-Transformer网络(MixTGFormer)。该方法通过两个并行通道同时建模人体骨骼的空间和时间关系,实现了全局特征和局部特征的有效融合。MixTGFormer的核心由堆叠的Mixformers组成。具体而言,Mixformer包括Mixformer Block和Squeeze-and-Excitation Layer(SE Layer)。它首先通过两个不同模式的并行Mixformer Blocks提取和融合人体骨骼的各种信息。然后,通过SE Layer进一步补充融合的信息。Mixformer Block将图卷积网络(GCN)集成到Transformer中,增强了局部和全局信息的利用。此外,我们进一步实现了其时间和空间形式,以提取空间和时间关系。我们在两个基准数据集(Human3.6M和MPI-INF-3DHP)上对我们的模型进行了广泛评估。实验结果表明,与其他方法相比,我们的MixTGFormer在这些数据集上达到了最先进的结果,P1误差分别为37.6mm和15.7mm。
cs.CV / 179 / 2604.17710
Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous Labels
动态视觉-语义对齐用于模糊标签的零样本学习
Abstract
Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.
Chinese Translation
零样本学习(ZSL)旨在识别未见过的类别而无需视觉实例。然而,现有方法通常假设标签是干净的,忽视了现实世界中的标签噪声和模糊性,这降低了性能。为了解决这一问题,我们提出了动态视觉-语义对齐(DVSA),这是一个针对模糊标签的强健ZSL框架。DVSA使用一个双向视觉-语义对齐模块,通过注意力机制相互校准视觉特征和属性原型,并在属性层面上基于互信息(Mutual Information, MI)进行对比优化,以增强区分性和语义一致性属性。此外,一个动态标签消歧机制迭代地纠正噪声监督,同时保持语义一致性,缩小实例与标签之间的差距,提高泛化能力。在标准基准上的大量实验验证了DVSA在模糊监督下实现了更强的性能。
cs.CV / 180 / 2604.17721
GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion
GeGS-PCR:一种有效且稳健的三维点云配准方法,采用两阶段颜色增强几何-3DGS融合
Abstract
We address the challenge of point cloud registration using color information, where traditional methods relying solely on geometric features often struggle in low-overlap and incomplete scenarios. To overcome these limitations, we propose GeGS-PCR, a novel two-stage method that combines geometric, color, and Gaussian information for robust registration. Our approach incorporates a dedicated color encoder that enhances color features by extracting multi-level geometric and color data from the original point cloud. We introduce the \textbf{Ge}ometric-3D\textbf{GS} module, which encodes the local neighborhood information of colored superpoints to ensure a globally invariant geometric-color context. Leveraging LORA optimization, we maintain high performance while preserving the expressiveness of 3DGS. Additionally, fast differentiable rendering is utilized to refine the registration process, leading to improved convergence. To further enhance performance, we propose a joint photometric loss that exploits both geometric and color features. This enables strong performance in challenging conditions with extremely low point cloud overlap. We validate our method by colorizing the Kitti dataset as ColorKitti and testing on both Color3DMatch and Color3DLoMatch datasets. Our method achieves state-of-the-art performance with \textit{Registration Recall} at 99.9\%, \textit{Relative Rotation Error} as low as 0.013, and \textit{Relative Translation Error} as low as 0.024, improving precision by at least a factor of 2.
Chinese Translation
我们针对使用颜色信息的点云配准挑战进行了研究,传统方法仅依赖几何特征,在低重叠和不完整场景中往往表现不佳。为克服这些局限性,我们提出了GeGS-PCR,一种新颖的两阶段方法,结合几何、颜色和高斯信息以实现稳健的配准。我们的方法引入了一个专用的颜色编码器,通过从原始点云中提取多层次的几何和颜色数据来增强颜色特征。我们提出了 extbf{Ge}ometric-3D extbf{GS}模块,该模块编码了彩色超点的局部邻域信息,以确保全局不变的几何-颜色上下文。利用LORA优化,我们在保持高性能的同时,保留了3DGS的表达能力。此外,快速可微渲染被用于优化配准过程,从而提高收敛性。为了进一步提升性能,我们提出了一种联合光度损失,利用几何和颜色特征。这使得在极低点云重叠的挑战条件下,表现出色。我们通过对Kitti数据集进行着色处理,生成ColorKitti,并在Color3DMatch和Color3DLoMatch数据集上进行测试,验证了我们的方法。我们的方案在 extit{Registration Recall}上达到了99.9\%, extit{Relative Rotation Error}低至0.013, extit{Relative Translation Error}低至0.024,精度提高至少两倍。
cs.CV / 181 / 2604.17727
Voronoi-guided Bilateral 2D Gaussian Splatting for Arbitrary-Scale Hyperspectral Image Super-Resolution
基于Voronoi引导的双边2D高斯溅射用于任意尺度的高光谱图像超分辨率
Abstract
Most existing hyperspectral image super-resolution methods require modifications for different scales, limiting their flexibility in arbitrary-scale reconstruction. 2D Gaussian splatting provides a continuous representation that is compatible with arbitrary-scale super-resolution. Existing methods often rely on rasterization strategies, which may limit flexible spatial modeling. Extending them to hyperspectral image super-resolution remains challenging, as the task requires adaptive spatial reconstruction while preserving spectral fidelity. This paper proposes GaussianHSI, a Gaussian-Splatting-based framework for arbitrary-scale hyperspectral image super-resolution. We develop a Voronoi-Guided Bilateral 2D Gaussian Splatting for spatial reconstruction. After predicting a set of Gaussian functions to represent the input, it associates each target pixel with relevant Gaussian functions through Voronoi-guided selection. The target pixel is then reconstructed by aggregating the selected Gaussian functions with reference-aware bilateral weighting, which considers both geometric relevance and consistency with low-resolution features. We further introduce a Spectral Detail Enhancement module to improve spectral reconstruction. Extensive experiments on benchmark datasets demonstrate the effectiveness of GaussianHSI over state-of-the-art methods for arbitrary-scale hyperspectral image super-resolution.
Chinese Translation
大多数现有的高光谱图像超分辨率方法需要针对不同尺度进行修改,这限制了它们在任意尺度重建中的灵活性。2D高斯溅射提供了一种与任意尺度超分辨率兼容的连续表示。现有方法通常依赖于光栅化策略,这可能限制灵活的空间建模。将其扩展到高光谱图像超分辨率仍然具有挑战性,因为该任务需要在保持光谱保真度的同时进行自适应空间重建。本文提出了GaussianHSI,一个基于高斯溅射的框架,用于任意尺度的高光谱图像超分辨率。我们开发了一种Voronoi引导的双边2D高斯溅射用于空间重建。在预测一组高斯函数以表示输入后,它通过Voronoi引导选择将每个目标像素与相关的高斯函数关联。然后,通过考虑几何相关性和与低分辨率特征的一致性的参考感知双边加权,聚合所选的高斯函数来重建目标像素。我们进一步引入了光谱细节增强模块,以改善光谱重建。在基准数据集上的大量实验表明,GaussianHSI在任意尺度的高光谱图像超分辨率方面优于最先进的方法。
cs.CV / 182 / 2604.17734
Score-Based Matching with Target Guidance for Cryo-EM Denoising
基于评分的目标引导匹配用于冷冻电子显微镜去噪
Abstract
Cryo-electron microscopy (cryo-EM) enables single-particle analysis of biological macromolecules under strict low-dose imaging conditions, but the resulting micrographs often exhibit extremely low signal-to-noise ratios and weak particle visibility. Image denoising is therefore an important preprocessing step for downstream cryo-EM analysis, including particle picking, 2D classification, and 3D reconstruction. Existing cryo-EM denoising methods are commonly trained with pixel-wise or Noise2Noise-style objectives, which can improve visual quality but do not explicitly account for structural consistency required by downstream analysis. In this work, we propose a score-based denoising framework for cryo-EM that learns the clean-data score to recover particle signals while better preserving structural information. Building on this formulation, we further introduce a target-guided variant that incorporates reference-density guidance to stabilize score learning under weak and ambiguous signal conditions. Rather than simply amplifying particle-like responses, our framework better suppresses structured low-frequency background, which improves particle--background separability for downstream analysis. Experiments on multiple cryo-EM datasets show that our score-based methods consistently improve downstream particle picking and produce more structure-consistent 3D reconstructions. Experiments on multiple cryo-EM datasets show that our methods improve downstream particle picking and produce more structure-consistent reconstructions.
Chinese Translation
冷冻电子显微镜(cryo-EM)在严格低剂量成像条件下实现了生物大分子的单颗粒分析,但所得到的显微图像通常表现出极低的信噪比和弱颗粒可见性。因此,图像去噪是下游冷冻电子显微镜分析的重要预处理步骤,包括颗粒选择、二维分类和三维重建。现有的冷冻电子显微镜去噪方法通常采用逐像素或Noise2Noise风格的目标进行训练,这可以改善视觉质量,但并未明确考虑下游分析所需的结构一致性。在本研究中,我们提出了一种基于评分的冷冻电子显微镜去噪框架,该框架学习干净数据的评分,以恢复颗粒信号,同时更好地保留结构信息。在此基础上,我们进一步引入了一种目标引导变体,该变体结合了参考密度引导,以在信号弱且模糊的条件下稳定评分学习。我们的框架不仅仅是简单地放大颗粒样响应,而是更好地抑制结构化的低频背景,从而改善下游分析中的颗粒与背景的可分离性。在多个冷冻电子显微镜数据集上的实验表明,我们的基于评分的方法在下游颗粒选择中始终表现出改善,并产生更具结构一致性的三维重建。
cs.CV / 183 / 2604.17736
IncreFA: Breaking the Static Wall of Generative Model Attribution
IncreFA:打破生成模型归因的静态壁垒
Abstract
As AI generative models evolve at unprecedented speed, image attribution has become a moving target. New diffusion, adversarial and autoregressive generators appear almost monthly, making existing watermark, classifier and inversion methods obsolete upon release. The core problem lies not in model recognition, but in the inability to adapt attribution itself. We introduce IncreFA, a framework that redefines attribution as a structured incremental learning problem, allowing the system to learn continuously as new generative models emerge. IncreFA departs from conventional incremental learning by exploiting the hierarchical relationships among generative architectures and coupling them with continual adaptation. It integrates two mutually reinforcing mechanisms: (1) Hierarchical Constraints, which encode architectural hierarchies through learnable orthogonal priors to disentangle family-level invariants from model-specific idiosyncrasies; and (2) a Latent Memory Bank, which replays compact latent exemplars and mixes them to generate pseudo-unseen samples, stabilising representation drift and enhancing open-set awareness. On the newly constructed Incremental Attribution Benchmark (IABench) covering 28 generative models released between 2022 and 2025, IncreFA achieves state-of-the-art attribution accuracy and 98.93% unseen detection under a temporally ordered open-set protocol. Code will be available at https://github.com/Ant0ny44/IncreFA.
Chinese Translation
随着人工智能生成模型以空前的速度发展,图像归因已成为一个动态目标。新的扩散、对抗和自回归生成器几乎每月都会出现,使得现有的水印、分类器和反演方法在发布时便变得过时。核心问题不在于模型识别,而在于归因本身无法适应。我们提出了IncreFA,一个将归因重新定义为结构化增量学习问题的框架,使系统能够随着新生成模型的出现而持续学习。IncreFA不同于传统的增量学习,通过利用生成架构之间的层次关系并将其与持续适应相结合。它整合了两种相互强化的机制:(1)层次约束,通过可学习的正交先验编码架构层次,以解开家族级不变性与模型特定特征之间的关系;(2)潜在记忆库,重放紧凑的潜在样本并将其混合以生成伪未见样本,从而稳定表示漂移并增强开放集意识。在新构建的增量归因基准(Incremental Attribution Benchmark,IABench)上,涵盖了2022年至2025年间发布的28个生成模型,IncreFA在时间有序的开放集协议下实现了最先进的归因准确率和98.93%的未见检测率。代码将发布于 https://github.com/Ant0ny44/IncreFA。
cs.CV / 184 / 2604.17748
Source-Free Domain Adaptation with Vision-Language Prior
无源领域适应与视觉-语言先验
Abstract
Source-Free Domain Adaptation (SFDA) seeks to adapt a source model, which is pre-trained on a supervised source domain, for a target domain, with only access to unlabeled target training data. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g., CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task-specific, we propose a novel DIFO++ approach. Specifically, DIFO++ alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model, centering on gap region reduction. During progressive knowledge adaptation, we first identify and focus on the gap region, where enclosed features are entangled and class-ambiguous, as it often captures richer task-specific semantics. Reliable pseudo-labels are then generated by fusing predictions from the target and ViL models, supported by a memory mechanism. Finally, gap region reduction is guided by category attention and predictive consistency for semantic alignment, complemented by referenced entropy minimization to suppress uncertainty. Extensive experiments show that DIFO++ significantly outperforms the state-of-the-art alternatives. Our code and data are available at https://github.com/tntek/DIFO-Plus.
Chinese Translation
无源领域适应(Source-Free Domain Adaptation, SFDA)旨在将一个在有监督源领域上预训练的源模型适应到目标领域,仅依赖于未标记的目标训练数据。传统方法依赖于伪标签和/或辅助监督,难免会出现错误。为了缓解这一限制,本文首次探索了现成的视觉-语言(Vision-Language, ViL)多模态模型(如 CLIP)所蕴含的丰富而异质的知识。我们发现,直接以零样本方式将 ViL 模型应用于目标领域并不令人满意,因为它并未专门针对这一特定任务,而是相对通用。为使其更具任务特异性,我们提出了一种新颖的 DIFO++ 方法。具体而言,DIFO++ 在适应过程中交替进行两个步骤:(i)通过以提示学习的方式最大化与目标模型的互信息来定制 ViL 模型,(ii)将该定制的 ViL 模型的知识蒸馏到目标模型,重点关注差距区域的缩减。在渐进式知识适应过程中,我们首先识别并关注差距区域,在该区域内封闭特征交织且类别模糊,因为它通常捕捉到更丰富的任务特定语义。然后,通过融合来自目标模型和 ViL 模型的预测,借助记忆机制生成可靠的伪标签。最后,差距区域的缩减通过类别注意力和预测一致性来引导,以实现语义对齐,并辅以参考熵最小化以抑制不确定性。大量实验表明,DIFO++ 显著优于现有的最先进方法。我们的代码和数据可在 https://github.com/tntek/DIFO-Plus 获取。
cs.CV / 185 / 2604.17749
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
自我中心间隔:在自我中心视频中生成物体状态转变
Abstract
Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.
Chinese Translation
理解物理转变过程对于人类认知和人工智能系统至关重要,特别是从自我中心的视角来看,这为人类与机器在动作建模中架起了关键的桥梁。我们将这一建模过程定义为自我中心指令视觉状态转变(Egocentric Instructed Visual State Transition, EIVST),其涉及在简短的动作指令下生成描绘初始状态与目标状态之间物体转变的中间帧。EIVST 对当前生成模型提出了两个挑战:(1)理解初始状态和目标状态的视觉场景,并从自我中心视角推理转变步骤;(2)生成一个一致的中间转变,遵循给定指令,同时在两个视觉状态之间保持物体外观的一致性。为了解决这些挑战,我们提出了 EgoIn 框架。该框架首先使用 TransitionVLM 推断两个给定状态之间的多步骤转变过程,该模型在我们精心策划的数据集上进行了微调,以更好地适应这一任务并减少虚假信息。然后,它基于所提出的转变条件模块生成一系列帧。此外,我们引入了物体感知辅助监督,以在整个转变过程中保持物体外观的一致性。在人类-物体和机器人-物体交互数据集上的大量实验表明,EgoIn 在生成语义上有意义且视觉上连贯的转变序列方面表现优越。
cs.CV / 186 / 2604.17773
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
体素空间中的结构自适应稀疏扩散用于三维医学图像增强
Abstract
Three-dimensional (3D) medical image enhancement, including denoising and super-resolution, is critical for clinical diagnosis in CT, PET, and MRI. Although diffusion models have shown remarkable success in 2D medical imaging, scaling them to high-resolution 3D volumes remains computationally prohibitive due to lengthy diffusion trajectories over high-dimensional volumetric data. We observe that in conditional enhancement, strong anatomical priors in the degraded input render dense noise schedules largely redundant. Leveraging this insight, we propose a sparse voxel-space diffusion framework that trains and samples on a compact set of uniformly subsampled timesteps. The network predicts clean data directly on the data manifold, supervised in velocity space for stable gradient scaling. A lightweight Structure-aware Trajectory Modulation (STM) module recalibrates time embeddings at each network block based on local anatomical content, enabling structure-adaptive denoising over the shared sparse schedule. Operating directly in voxel space, our framework preserves fine anatomical detail without lossy compression while achieving up to $10\times$ training acceleration. Experiments on four datasets spanning CT, PET, and MRI demonstrate state-of-the-art performance on both denoising and super-resolution tasks. Our code is publicly available at: https://github.com/mirthAI/sparse-3d-diffusion.
Chinese Translation
三维(3D)医学图像增强,包括去噪和超分辨率,对于CT、PET和MRI的临床诊断至关重要。尽管扩散模型在二维医学成像中取得了显著成功,但将其扩展到高分辨率的三维体积仍然因高维体积数据上的漫长扩散轨迹而计算上不可行。我们观察到,在条件增强中,退化输入中的强解剖先验使得密集噪声调度在很大程度上变得多余。基于这一见解,我们提出了一种稀疏体素空间扩散框架,该框架在一组均匀下采样的时间步长上进行训练和采样。网络直接在数据流形上预测干净数据,并在速度空间中进行监督,以实现稳定的梯度缩放。一个轻量级的结构感知轨迹调制(Structure-aware Trajectory Modulation, STM)模块根据局部解剖内容在每个网络块中重新校准时间嵌入,从而实现基于共享稀疏调度的结构自适应去噪。我们的框架直接在体素空间中操作,保留了细致的解剖细节而不进行有损压缩,同时实现了高达10倍的训练加速。在涵盖CT、PET和MRI的四个数据集上的实验表明,在去噪和超分辨率任务上均达到了最先进的性能。我们的代码已公开可用,链接为:https://github.com/mirthAI/sparse-3d-diffusion。
cs.CV / 187 / 2604.17782
Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval
面向主体的多粒度对齐用于零样本脑电图到图像检索
Abstract
Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.
Chinese Translation
零样本脑电图(EEG)到图像检索旨在通过将神经反应与预训练的视觉表征对齐,从脑电图中解码感知的视觉内容,为可扩展的视觉神经解码和实用的脑机接口提供了一条有前景的路径。然而,稳健的脑电图到图像检索仍然面临挑战,因为以往的方法通常依赖于单一固定的视觉目标或主体不变的目标构建方案。这些设计忽视了视觉诱发的脑电图信号的两个重要特性:它们在多个表征尺度上保留信息,并且与脑电图最佳匹配的视觉粒度可能因主体而异。为了解决这些问题,提出了一种面向主体的多粒度对齐(Subject-Aware Multi-Granularity Alignment, SAMGA)框架用于零样本脑电图到图像检索。SAMGA首先通过自适应聚合来自预训练视觉编码器的多个中间表征来构建一个面向主体的视觉监督目标,使模型在训练过程中能够吸收主体依赖的粒度偏差,同时保持主体无关的推理。在此自适应目标构建的基础上,进一步设计了一种粗到细的跨模态对齐策略,该策略采用共享编码器,其中粗阶段稳定共享的语义几何并减少主体引起的分布偏移,而细阶段进一步提高实例级检索的区分度。在THINGS-EEG基准上的大量实验表明,所提出的方法在主体内设置中实现了91.3%的Top-1准确率和98.8%的Top-5准确率,在主体间设置中实现了34.4%的Top-1准确率和64.8%的Top-5准确率,超越了近期的最先进方法。
cs.CV / 188 / 2604.17789
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
DuQuant++:细粒度旋转增强微缩FP4量化
Abstract
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant++.
Chinese Translation
MXFP4微缩格式将张量划分为共享E8M0缩放因子的32个元素块,已成为高效大规模语言模型(LLM)推理的有前景的基础,得益于NVIDIA Blackwell张量核心的原生硬件支持。然而,激活异常值在该格式下构成了独特的挑战:单个异常值会膨胀共享块的缩放,压缩其余元素的有效动态范围,并导致显著的量化误差。现有的基于旋转的解决方案,包括随机Hadamard和可学习旋转,都是与数据无关的,因此无法专门针对异常值集中所在的通道。我们提出了DuQuant++,它通过将旋转块大小与微缩组大小(B{=}32)对齐,将DuQuant的异常值感知细粒度旋转适配到MXFP4格式。由于每个MXFP4组具有独立的缩放因子,原始DuQuant中需要双重旋转和锯齿形置换的跨块方差问题变得无关紧要,使得DuQuant++能够用单个异常值感知旋转替代整个流程,从而将在线旋转成本减半,同时平滑权重分布。在MXFP4 W4A4量化下对LLaMA-3系列进行的广泛实验表明,DuQuant++始终实现了最先进的性能。我们的代码可在https://github.com/Hsu1023/DuQuant++获取。
cs.CV / 189 / 2604.17797
Weakly-Supervised Referring Video Object Segmentation through Text Supervision
通过文本监督的弱监督视频目标分割
Abstract
Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi-directional vision-language feature selection and interaction to enable fine-grained multimodal alignment. Next, we propose an instance-aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive-prediction fusion strategy to generate high-quality pseudo-masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-YouTube-VOS, and Ref-DAVIS17, demonstrate the superiority of our method. Code is available at \href{https://github.com/viscom-tongji/WSRVOS}{https://github.com/viscom-tongji/WSRVOS}.
Chinese Translation
指向视频目标分割(RVOS)旨在根据文本表达对视频中的目标实例进行分割。传统方法大多采用监督学习,需依赖昂贵的像素级掩码标注。为了解决这一问题,最近提出了弱监督RVOS方法,用边界框或点替代掩码标注,然而这些方法仍然成本高且劳动密集。本文设计了一种新颖的弱监督RVOS方法,称为WSRVOS,仅通过文本表达训练模型。给定输入视频和指向表达,我们首先设计了一种对比性指向表达增强方案,该方案利用多模态大型语言模型的字幕生成能力,生成正向和负向表达。我们从输入视频和生成的表达中提取视觉和语言特征,然后进行双向视觉-语言特征选择和交互,以实现细粒度的多模态对齐。接下来,我们提出了一种实例感知的表达分类方案,以优化模型区分正向和负向表达。此外,我们引入了一种正向预测融合策略,以生成高质量的伪掩码,作为对模型的额外监督。最后,我们设计了一种时间段排序约束,要求时间上相邻帧的掩码预测重叠部分符合特定顺序。在四个公开可用的RVOS数据集上进行的大量实验,包括A2D Sentences、J-HMDB Sentences、Ref-YouTube-VOS和Ref-DAVIS17,证明了我们方法的优越性。代码可在 exttt{https://github.com/viscom-tongji/WSRVOS} 获取。
cs.CV / 190 / 2604.17801
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
通过双路径结构对应和语义连续性实现视图一致的三维场景编辑
Abstract
Text-driven 3D scene editing has recently attracted increasing attention. Most existing methods follow a render-edit-optimize pipeline, where multi-view images are rendered from a 3D scene, edited with 2D image editors, and then used to optimize the underlying 3D representation. However, cross-view inconsistency remains a major bottleneck. Although recent methods introduce geometric cues, cross-view interactions, or video priors to mitigate this issue, they still largely rely on inference-time synchronization and thus remain limited in robustness and generalization.In this work, we recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across viewpoints.Based on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable supervision for learning cross-view consistency in edited scenes. Extensive experiments demonstrate that our method achieves superior editing performance with precise and consistent views for complex scenes.
Chinese Translation
基于文本驱动的三维场景编辑最近引起了越来越多的关注。大多数现有方法遵循渲染-编辑-优化的流程,其中从三维场景渲染出多视图图像,使用二维图像编辑器进行编辑,然后用于优化底层的三维表示。然而,视图间不一致性仍然是一个主要瓶颈。尽管最近的方法引入了几何线索、视图间交互或视频先验来缓解这一问题,但它们仍然在很大程度上依赖于推理时的同步,因此在鲁棒性和泛化能力上仍然有限。在本研究中,我们从分布的角度重新审视多视图一致的三维编辑:三维场景编辑本质上需要跨视点的联合分布建模。基于这一见解,我们提出了一种视图一致的三维编辑框架,明确地将视图间依赖关系引入编辑过程。此外,受到结构对应和语义连续性依赖于不同视图间线索的观察启发,我们引入了一种双路径一致性机制,包括基于投影的结构引导和补丁级语义传播,以实现有效的视图间编辑。此外,我们构建了一个配对的多视图编辑数据集,为学习编辑场景中的视图一致性提供可靠的监督。大量实验表明,我们的方法在复杂场景中实现了优越的编辑性能,提供了精确且一致的视图。
cs.CV / 191 / 2604.17807
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
Re$^2$MoGen:通过大语言模型推理和物理感知精炼的开放词汇运动生成
Abstract
Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re$^2$MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re$^2$MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM's reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints' positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.
Chinese Translation
文本到运动(T2M)生成旨在通过文本描述控制目标角色的行为。利用文本-运动配对数据集,现有的T2M模型在生成高质量运动方面取得了令人印象深刻的性能,且这些运动在其训练数据的分布范围内。然而,当运动描述与训练文本显著不同时,它们的性能显著下降。为了解决这一问题,我们提出了Re$^2$MoGen,一种推理与精炼的开放词汇运动生成框架,利用增强的大语言模型(LLM)推理生成初步的运动规划,然后通过强化学习(RL)后训练来优化其物理合理性。具体而言,Re$^2$MoGen包括三个阶段:首先,我们采用蒙特卡罗树搜索增强LLM在生成基于文本提示的合理关键帧时的推理能力,仅指定根节点和若干关键关节的位置以简化推理过程。然后,我们应用人体姿态模型作为先验,基于规划的关键帧优化全身姿态,并使用生成的不完整运动来监督通过动态时间匹配目标微调预训练的运动生成器,从而实现时空补全。最后,我们使用带有物理感知奖励的后训练来精炼运动质量,以消除LLM规划运动中的物理不合理性。大量实验表明,我们的框架能够生成语义一致且物理合理的运动,并在开放词汇运动生成中实现了最先进的性能。
cs.CV / 192 / 2604.17818
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
AnyLift:通过2D扩散扩展互联网视频中的运动重建
Abstract
Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
Chinese Translation
从互联网视频中重建3D人类运动和人类-物体交互(HOI)是构建大规模人类行为数据集的基础步骤。现有方法在动态摄像机下恢复全球一致的3D运动方面存在困难,尤其是对于在当前运动捕捉数据集中表现不足的运动类型,同时在恢复一致的人类-物体交互的3D表现时也面临额外挑战。我们提出了一种利用2D扩散的两阶段框架,从互联网视频中重建3D人类运动和HOI。在第一阶段,我们为每个领域合成多视角的2D运动数据,利用从互联网视频中提取的2D关键点,结合在现有运动捕捉数据集中很少出现的人类运动。在第二阶段,基于领域特定的合成数据训练一个受摄像机条件影响的多视角2D运动扩散模型,以恢复世界空间中的3D人类运动和3D HOI。我们在包含挑战性运动(如体操)的视频以及野外HOI视频上展示了我们方法的有效性,并表明其在生成真实的人类运动和人类-物体交互方面优于先前的工作。
cs.CV / 193 / 2604.17822
GR4CIL: Gap-compensated Routing for CLIP-based Class Incremental Learning
GR4CIL:基于CLIP的类别增量学习的间隙补偿路由
Abstract
Class-Incremental Learning (CIL) aims to continuously acquire new categories while preserving previously learned knowledge. Recently, Contrastive Language-Image Pre-trained (CLIP) models have shown strong potential for CIL due to their powerful generalization ability. However, existing methods still face two key challenges: shared-parameter adaptation tends to cause old-knowledge drift, and task-specific knowledge organization often leads to poorly calibrated cross-task responses, making reliable routing difficult. To address these issues, we propose GR4CIL, a framework combining task discrimination and knowledge routing for CLIP-based CIL. GR4CIL preserves task-specific visual knowledge while maintaining an incrementally stable shared textual semantic space, thereby reducing interference across tasks. Moreover, we introduce an orthogonal compensation mechanism to mitigate modality-gap-induced bias, enhance within-task discrimination, and enlarge the score margin between the ground-truth task and competing tasks. As a result, GR4CIL enables more reliable task-aware routing over learned knowledge while retaining the zero-shot generalization capability. Experiments on multiple benchmarks show that GR4CIL consistently outperforms strong baselines.
Chinese Translation
类别增量学习(CIL)旨在在持续获取新类别的同时保留先前学习的知识。近年来,基于对比语言-图像预训练(CLIP)模型因其强大的泛化能力而显示出在CIL中的巨大潜力。然而,现有方法仍面临两个关键挑战:共享参数的适应性往往导致旧知识漂移,而任务特定的知识组织则常常导致跨任务响应的校准不良,从而使得可靠的路由变得困难。为了解决这些问题,我们提出了GR4CIL,一个结合任务区分和知识路由的框架,用于基于CLIP的CIL。GR4CIL在保持增量稳定的共享文本语义空间的同时,保留了任务特定的视觉知识,从而减少了任务间的干扰。此外,我们引入了一种正交补偿机制,以减轻由模态间隙引起的偏差,增强任务内区分能力,并扩大真实任务与竞争任务之间的得分差距。因此,GR4CIL能够在保留零样本泛化能力的同时,实现对学习知识的更可靠的任务感知路由。在多个基准测试中的实验表明,GR4CIL始终优于强基线。
cs.CV / 194 / 2604.17831
PCM-NeRF: Probabilistic Camera Modeling for Neural Radiance Fields under Pose Uncertainty
PCM-NeRF:在姿态不确定性下的神经辐射场的概率相机建模
Abstract
Neural surface reconstruction methods typically treat camera poses as fixed values, assuming perfect accuracy from Structure-from-Motion (SfM) systems. This assumption breaks down with imperfect pose estimates, leading to distorted or incomplete reconstructions. We present PCM-NeRF, a probabilistic framework that augments neural surface reconstruction with per-camera learnable uncertainty, built on top of SG-NeRF. Rather than treating all cameras equally throughout optimization, we represent each pose as a distribution with a learnable mean and variance, initialized from SfM correspondence quality. An uncertainty regularization loss couples the learned variance to view confidence, and the resulting uncertainty directly modulates the effective pose learning rate: uncertain cameras receive damped gradient updates, preventing poorly initialized views from corrupting the reconstruction. This lightweight mechanism requires no changes to the rendering pipeline and adds negligible overhead. Experiments on challenging scenes with severe pose outliers demonstrate that PCM-NeRF consistently outperforms state-of-the-art methods in both Chamfer Distance and F-Score, particularly for geometrically complex structures, without requiring foreground masks.
Chinese Translation
神经表面重建方法通常将相机姿态视为固定值,假设来自运动重建(Structure-from-Motion, SfM)系统的精确度是完美的。然而,这一假设在姿态估计不准确时会失效,导致重建结果扭曲或不完整。我们提出了PCM-NeRF,这是一个概率框架,通过每个相机可学习的不确定性增强神经表面重建,基于SG-NeRF构建。我们并不是在优化过程中将所有相机视为相同,而是将每个姿态表示为一个具有可学习均值和方差的分布,这些均值和方差是根据SfM对应质量初始化的。一个不确定性正则化损失将学习到的方差与视图置信度相结合,最终的不确定性直接调节有效的姿态学习率:不确定的相机会接收减弱的梯度更新,从而防止初始化不良的视图破坏重建。这个轻量级机制无需对渲染管道进行更改,并且增加的开销可以忽略不计。在具有严重姿态异常的挑战性场景上的实验表明,PCM-NeRF在Chamfer距离和F-Score上始终优于最先进的方法,尤其是在几何复杂的结构上,而无需前景掩模。
cs.CV / 195 / 2604.17846
AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis
基于人工智能的MRI全脊柱椎体分割与三维重建在儿童脊柱侧弯中的应用
Abstract
MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.
Chinese Translation
在儿童影像学中,MRI因其避免了电离辐射而优于CT,但在脊柱畸形评估中的应用受到缺乏自动化、高分辨率三维骨重建的限制,仍然依赖于CT。基于MRI的三维重建由于手动工作流程和标注全脊柱数据集的稀缺而显得不切实际。本研究提出了一种人工智能框架,能够从MRI中实现完全自动化的胸腰椎(T1-L5)分割和三维重建。通过生成对抗网络(GAN)将青少年特发性脊柱侧弯(AIS)患者的历史低剂量CT扫描转换为类似MRI的图像,并与现有的标注胸部MRI数据结合,训练了一个基于U-Net的模型。所得到的算法准确生成了连续的胸腰椎三维重建,提高了分割精度(88% Dice系数),并将处理时间从大约1小时减少到不到1分钟,同时保留了特定于AIS的畸形特征。这种方法实现了基于MRI的无辐射三维畸形评估,支持儿童脊柱护理中的临床评估、手术规划和导航。
cs.CV / 196 / 2604.17850
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
UniCSG:通过分阶段语义和频率解耦实现统一的高保真内容约束风格驱动生成
Abstract
Style transfer must match a target style while preserving content semantics. DiT-based diffusion models often suffer from content-style entanglement, leading to reference-content leakage and unstable generation. We present UniCSG, a unified framework for content-constrained, style-driven generation in both text-guided and reference-guided settings. UniCSG employs staged training: (i) a latent-space semantic disentanglement stage that combines low-frequency preprocessing with conditioning corruption to encourage content-style separation, and (ii) a latent-space frequency-aware detail reconstruction stage that refines details via multi-scale frequency supervision. We further incorporate pixel-space reward learning to align latent objectives with perceptual quality after decoding. Experiments demonstrate improved content faithfulness, style alignment, and robustness in both settings.
Chinese Translation
风格迁移必须在保留内容语义的同时匹配目标风格。基于 DiT 的扩散模型常常受到内容与风格纠缠的影响,导致参考内容泄漏和生成不稳定。我们提出了 UniCSG,这是一个用于内容约束、风格驱动生成的统一框架,适用于文本引导和参考引导设置。UniCSG 采用分阶段训练:(i)一个潜在空间语义解耦阶段,该阶段结合低频预处理与条件腐蚀,以促进内容与风格的分离;(ii)一个潜在空间频率感知细节重建阶段,通过多尺度频率监督来细化细节。我们进一步结合像素空间奖励学习,以在解码后将潜在目标与感知质量对齐。实验表明,在这两种设置中,内容忠实度、风格一致性和鲁棒性均有所改善。
cs.CV / 197 / 2604.17856
PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation
PlankFormer:通过MAE预训练视觉变换器和伪社区图像生成实现稳健的浮游生物实例分割
Abstract
Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.
Chinese Translation
浮游生物监测对于评估水生生态系统至关重要,但受到人工显微分析劳动强度大的限制。自动化从拥挤图像中分割浮游生物是关键,然而面临两个主要挑战:(i)像素级标注数据集的稀缺性和(ii)使用传统卷积神经网络(CNN)方法难以区分浮游生物与碎片及重叠个体。为了解决这些问题,我们提出了PlankFormer,一个用于浮游生物实例分割的新框架。首先,为了克服数据短缺,我们引入了一种通过将单个浮游生物图像合成到多样背景(包括生成模型创建的背景)上来生成标注伪社区图像(PCI)的方法。其次,我们提出了一种利用视觉变换器(ViT)作为主干的分割模型,并配备Mask2Former解码器。为了稳健地捕捉浮游生物在遮挡和碎片下的全局结构特征,我们采用了一种掩码自编码器(MAE)在未标注的单个图像上进行自监督预训练。真实世界数据集上的实验结果表明,我们的方法显著优于传统方法,如Mask R-CNN,特别是在高碎片密度的挑战性环境中。我们证明了我们的合成训练策略和基于MAE的架构能够实现高精度分割,同时减少对单个浮游生物图像的手动标注需求。
cs.CV / 198 / 2604.17865
Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models
针对广义息肉分割的轻量模型优化:基于基础模型的边界引导蒸馏
Abstract
Automated polyp segmentation is critical for early colorectal cancer detection and its prevention, yet remains challenging due to weak boundaries, large appearance variations, and limited annotated data. Lightweight segmentation models such as U-Net, U-Net++, and PraNet offer practical efficiency for clinical deployment but struggle to capture the rich semantic and structural cues required for accurate delineation of complex polyp regions. In contrast, large Vision Foundation Models (VFMs), including SAM, OneFormer, Mask2Former, and DINOv2, exhibit strong generalization but transfer poorly to polyp segmentation due to domain mismatch, insufficient boundary sensitivity, and high computational cost. To bridge this gap, we propose \textit{\textbf{LiteBounD}, a \underline{Li}gh\underline{t}w\underline{e}ight \underline{Boun}dary-guided \underline{D}istillation} framework that transfers complementary semantic and structural priors from multiple VFMs into compact segmentation backbones. LiteBounD introduces (i) a dual-path distillation mechanism that disentangles semantic and boundary-aware representations, (ii) a frequency-aware alignment strategy that supervises low-frequency global semantics and high-frequency boundary details separately, and (iii) a boundary-aware decoder that fuses multi-scale encoder features with distilled semantically rich boundary information for precise segmentation. Extensive experiments on both seen (Kvasir-SEG, CVC-ClinicDB) and unseen (ColonDB, CVC-300, ETIS) datasets demonstrate that LiteBounD consistently outperforms its lightweight baselines by a significant margin and achieves performance competitive with state-of-the-art methods, while maintaining the efficiency required for real-time clinical use. Our code is available at https://github.com/lostinrepo/LiteBounD.
Chinese Translation
自动化息肉分割对于早期结直肠癌的检测及其预防至关重要,但由于边界模糊、外观变化大以及标注数据有限,仍然面临挑战。轻量级分割模型如 U-Net、U-Net++ 和 PraNet 在临床应用中提供了实用的效率,但在捕捉复杂息肉区域所需的丰富语义和结构线索方面表现不佳。相比之下,大型视觉基础模型(Vision Foundation Models, VFMs),如 SAM、OneFormer、Mask2Former 和 DINOv2,展现出强大的泛化能力,但由于领域不匹配、边界敏感性不足和高计算成本,转移到息肉分割任务时效果不佳。为了解决这一问题,我们提出了 extit{ extbf{LiteBounD},一个 extunderline{轻}量级 extunderline{边界}引导 extunderline{蒸馏} 框架},该框架将来自多个 VFMs 的互补语义和结构先验转移到紧凑的分割骨干网络中。LiteBounD 引入了 (i) 一种双路径蒸馏机制,解耦语义和边界感知表示,(ii) 一种频率感知对齐策略,分别监督低频全局语义和高频边界细节,以及 (iii) 一种边界感知解码器,将多尺度编码器特征与蒸馏的语义丰富边界信息融合,以实现精确分割。在 Kvasir-SEG、CVC-ClinicDB 等已见数据集和 ColonDB、CVC-300、ETIS 等未见数据集上的广泛实验表明,LiteBounD 在性能上显著超越其轻量基线,并与最先进的方法竞争,同时保持实时临床使用所需的效率。我们的代码可在 https://github.com/lostinrepo/LiteBounD 获取。
cs.CV / 199 / 2604.17873
Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models
时空谄媚:基于否定的煤气灯效应在视频大型语言模型中的表现
Abstract
Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting. Rather than merely changing their answers, the models often fabricate unsupported temporal or spatial explanations to justify incorrect revisions. To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce GasVideo-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. We evaluate a broad range of state-of-the-art open-source and proprietary Vid-LLMs across diverse video understanding tasks. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance. While prompt-level grounding constraints can partially mitigate this behavior, they do not reliably prevent hallucinated justifications or belief reversal. Our results indicate that current Vid-LLMs lack robust mechanisms for maintaining grounded spatiotemporal beliefs under adversarial conversational feedback.
Chinese Translation
视频大型语言模型(Vid-LLMs)在视频理解任务中表现出色,但它们在对话交互下的鲁棒性仍然未得到充分探索。本文识别了时空谄媚这一失效模式,在该模式中,Vid-LLMs 撤回最初正确的、基于视觉的判断,并在基于否定的煤气灯效应下顺应误导性用户反馈。模型不仅仅是改变答案,往往还会编造不支持的时间或空间解释,以为不正确的修正辩护。为了系统地研究这一现象,我们提出了一个基于否定的煤气灯评估框架,并引入了 GasVideo-1000,这是一个经过精心策划的基准,旨在探测具有明确视觉基础和时间推理要求的时空谄媚。我们评估了一系列最先进的开源和专有 Vid-LLMs,涵盖多种视频理解任务。广泛的实验表明,脆弱性对基于否定的煤气灯效应普遍且严重,即使在具有强基线性能的模型中也是如此。虽然提示级别的基础约束可以部分缓解这种行为,但它们并不能可靠地防止幻觉解释或信念反转。我们的结果表明,当前的 Vid-LLMs 缺乏在对抗性对话反馈下维持扎根的时空信念的稳健机制。
cs.CV / 200 / 2604.17879
Exploring Boundary-Aware Spatial-Frequency Fusion for Camouflaged Object Detection
探索边界感知的空间频率融合用于伪装物体检测
Abstract
Camouflaged Object Detection is challenging due to the high degree of similarity between camouflaged objects and their surrounding backgrounds. Current COD methods mainly rely on edge extraction in the spatial domain and local pixel-level information, neglecting the importance of global structural features. Additionally, they fail to effectively leverage the importance of phase spectrum information within frequency domain features. To this end, we propose a COD framework BASFNet based on boundary-aware frequency domain and spatial domain fusion.This method uses dual guided integration of frequency domain and spatial domain features. A phase-spectrum-based frequency-enhanced edge exploration module (FEEM) and a spatial core segmentation module (SCSM) are introduced to jointly capture the boundary and object features of camouflaged objects. These features are then effectively integrated through a spatial-frequency fusion interaction module (SFFIM). Furthermore, the boundary detection is further optimized through an boundary-aware training strategy. BASFNet outperforms existing state-of-the-art methods on three benchmark datasets, validating the effectiveness of the fusion of frequency and spatial domain information in COD tasks.
Chinese Translation
伪装物体检测因伪装物体与其周围背景之间高度相似而具有挑战性。目前的伪装物体检测方法主要依赖于空间域中的边缘提取和局部像素级信息,忽视了全局结构特征的重要性。此外,它们未能有效利用频率域特征中的相位谱信息的重要性。为此,我们提出了一种基于边界感知的频率域与空间域融合的伪装物体检测框架BASFNet。该方法采用频率域与空间域特征的双向引导集成。引入了基于相位谱的频率增强边缘探索模块(FEEM)和空间核心分割模块(SCSM),共同捕捉伪装物体的边界和特征。这些特征通过空间频率融合交互模块(SFFIM)有效整合。此外,通过边界感知训练策略进一步优化边界检测。BASFNet在三个基准数据集上超越了现有的最先进方法,验证了频率域与空间域信息融合在伪装物体检测任务中的有效性。
cs.CV / 201 / 2604.17889
AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
AeroRAG:用于细粒度航空视觉推理的结构化多模态检索增强大型语言模型
Abstract
Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens, our method introduces a more explicit intermediate interface between perception and language reasoning. Experiments on the AUG aerial dataset and the general-domain VG-150 benchmark show consistent improvements over six strong MLLM baselines, with the largest gains observed in dense aerial scenes and relation-sensitive reasoning. We further evaluate the framework on VQAv2 to verify that the proposed interface remains compatible with standard visual reasoning settings. These results suggest that structured retrieval is a practical design direction for deployment-oriented and grounded visual reasoning systems.
Chinese Translation
尽管近年来多模态大型语言模型(MLLMs)取得了进展,但在航空场景中进行可靠的视觉问答仍然具有挑战性。在这些场景中,任务关键证据通常由小物体、明确的数量、粗略的位置和物体间关系承载,而传统的密集视觉标记表示与这些结构化语义并不匹配。为了解决这一接口不匹配问题,我们提出了AeroRAG,一种基于场景图的多模态检索增强生成框架,用于视觉问答。该框架首先将输入图像转换为结构化视觉知识,包括物体类别、数量、空间位置和语义关系,然后检索与查询相关的语义片段,以构建紧凑的提示供文本基础的大型语言模型使用。我们的算法并不依赖于对密集视觉标记的直接推理,而是引入了感知与语言推理之间更明确的中间接口。在AUG航空数据集和通用领域VG-150基准上的实验表明,相较于六个强大的MLLM基线,我们的方法在密集航空场景和关系敏感推理中显示出一致的改进,最大增益尤为显著。我们进一步在VQAv2上评估该框架,以验证所提出的接口与标准视觉推理设置的兼容性。这些结果表明,结构化检索是面向部署和基于实证的视觉推理系统的一个实用设计方向。
cs.CV / 202 / 2604.17898
ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
ReTrack:基于证据驱动的双流方向锚定校准网络用于复合视频检索
Abstract
With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at https://github.com/Lee-zixu/ReTrack
Chinese Translation
随着视频数据的快速增长,复合视频检索(Composed Video Retrieval, CVR)作为视频检索中的一种新兴范式,正受到研究者们的越来越多关注。与单模态视频检索方法不同,CVR任务以一个参考视频和一段修改文本组成的多模态查询作为输入。修改文本传达了用户对参考视频的预期修改。基于这一输入,模型旨在检索最相关的目标视频。在CVR任务中,视频和文本模态之间的信息密度存在显著差异。传统的组合方法往往使组合特征偏向于参考视频,从而导致检索性能不佳。这一局限性主要源于三个核心挑战:(1)模态贡献纠缠,(2)组合特征的显式优化,以及(3)检索不确定性。为了解决这些挑战,我们提出了证据驱动的双流方向锚定校准网络(ReTrack)。ReTrack是第一个通过校准组合特征中的方向偏差来提升多模态查询理解的CVR框架。它由三个关键模块组成:语义贡献解耦、组合几何校准和可靠的证据驱动对齐。具体而言,ReTrack估计每个模态的语义贡献,以校准组合特征的方向偏差。然后,它利用校准后的方向锚点计算双向证据,从而驱动可靠的组合到目标的相似性估计。此外,ReTrack在复合图像检索(Composed Image Retrieval, CIR)任务中表现出强大的泛化能力,在CVR和CIR场景下的三个基准数据集上均达到了最先进的性能。代码可在 https://github.com/Lee-zixu/ReTrack 获取。
cs.CV / 203 / 2604.17899
MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition
MEDN:用于微表情识别的运动-情感特征解耦网络
Abstract
Unlike macro-expression, micro-expression does not follow a strictly consistent mapping rule between emotions and Action Units (AUs). As a result, some micro-expressions share identical AUs yet represent completely opposite emotional categories, making them highly visually similar. Existing microexpression recognition (MER) methods mostly rely on explicit facial motion cues (e.g., optical flow, frame differences, AU features) while ignoring implicit emotion information. To tackle this issue, this paper presents a Motion Emotion Feature Decoupling Network (MEDN) for MER. We design a dual-branch framework to separately extract motion and emotion features. In the motion branch, an AU-detection task restricts features to the explicit motion domain, and orthogonal loss is adopted to reduce motion emotion feature coupling. For implicit emotion modeling, we propose a Sparse Emotion Vision Transformer (SEVit) that sparsifies spatial tokens to highlight local temporal variations with multi-scale sparsity rates. A Collaborative Fusion Module (CoFM) is further developed to fuse disentangled motion and emotion features adaptively. Extensive experiments on three benchmark datasets validate that MEDN effectively decouples motion and emotion features and achieves superior recognition performance, offering a new perspective for enhancing recognition accuracy and generalization.
Chinese Translation
与宏观表情不同,微表情在情感与动作单位(Action Units, AUs)之间并不遵循严格一致的映射规则。因此,一些微表情虽然共享相同的动作单位,却代表完全相反的情感类别,使得它们在视觉上高度相似。现有的微表情识别(MER)方法大多依赖于显式的面部运动线索(例如,光流、帧差、动作单位特征),而忽视了隐式的情感信息。为了解决这一问题,本文提出了一种用于微表情识别的运动情感特征解耦网络(Motion Emotion Feature Decoupling Network, MEDN)。我们设计了一个双支路框架,以分别提取运动和情感特征。在运动支路中,通过一个动作单位检测任务将特征限制在显式运动领域,并采用正交损失来减少运动与情感特征之间的耦合。为了进行隐式情感建模,我们提出了一种稀疏情感视觉变换器(Sparse Emotion Vision Transformer, SEVit),该变换器通过稀疏化空间标记来突出局部时间变化,并采用多尺度稀疏率。此外,我们进一步开发了一个协同融合模块(Collaborative Fusion Module, CoFM),以自适应地融合解耦的运动和情感特征。在三个基准数据集上的大量实验验证了MEDN有效地解耦了运动和情感特征,并实现了卓越的识别性能,为提高识别准确性和泛化能力提供了新的视角。
cs.CV / 204 / 2604.17914
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
超越二元对比:利用过渡锚点建模连续骨架动作空间
Abstract
Self-supervised contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition by enforcing consistency in the embedding space. However, existing methods rely on binary contrastive objectives that overlook the intrinsic continuity of human motion, resulting in fragmented feature clusters and rigid class boundaries. To address these limitations, we propose TranCLR, a Transitional anchor-based Contrastive Learning framework that captures the continuous geometry of the action space. Specifically, the proposed Action Transitional Anchor Construction (ATAC) explicitly models the geometric structure of transitional states to enhance the model's perception of motion continuity. Building upon these anchors, a Multi-Level Geometric Manifold Calibration (MGMC) mechanism is introduced to adaptively calibrate the action manifold across multiple levels of continuity, yielding a smoother and more discriminative representation space. Extensive experiments on the NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets demonstrate that TranCLR achieves superior accuracy and calibration performance, effectively learning continuous and uncertainty-aware skeleton representations. The code is available at https://github.com/Philchieh/TranCLR.
Chinese Translation
自监督对比学习已成为基于骨架的动作识别的强大范式,通过在嵌入空间中强制一致性。然而,现有方法依赖于二元对比目标,忽视了人类运动的内在连续性,导致特征簇碎片化和类别边界僵化。为了解决这些局限性,我们提出了TranCLR,一种基于过渡锚点的对比学习框架,旨在捕捉动作空间的连续几何特性。具体而言,所提出的动作过渡锚点构建(Action Transitional Anchor Construction, ATAC)明确建模过渡状态的几何结构,以增强模型对运动连续性的感知。在这些锚点的基础上,引入了一种多级几何流形校准(Multi-Level Geometric Manifold Calibration, MGMC)机制,以自适应地校准多个连续性层级的动作流形,从而产生更平滑和更具区分性的表示空间。在NTU RGB+D、NTU RGB+D 120和PKU-MMD数据集上的大量实验表明,TranCLR在准确性和校准性能上均表现优越,有效地学习了连续且具不确定性感知的骨架表示。代码可在 https://github.com/Philchieh/TranCLR 获取。
cs.CV / 205 / 2604.17915
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive:基于视觉-语言-动作模型的统一多范式驾驶
Abstract
Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context through the original attention mechanism. Textual and structured outputs share a common attention backbone, enabling stable joint optimization across heterogeneous tasks. Trajectory planning is realized within the same causal LLM decoder by introducing structured trajectory queries. This unified formulation enables planning to share the pretrained attention backbone with images and perception tokens. Extensive experiments on end-to-end autonomous driving benchmarks demonstrate state-of-the-art performance, including 0.28 L2 and 0.18 collision rate on nuScenes open-loop evaluation and competitive results (86.8 PDMS) on NAVSIM closed-loop evaluation. The full model preserves multi-modal generation capability, while an efficient inference mode achieves approximately 40% lower latency. Code and models are available at https://github.com/Z1zyw/OneDrive
Chinese Translation
视觉-语言模型(VLMs)在自回归文本生成方面表现出色,但端到端的自主驾驶需要具有结构化输出和异构解码行为的多任务学习,例如自回归语言生成、并行目标检测和轨迹回归。为了适应这些差异,现有系统通常引入独立或级联的解码器,导致架构碎片化和有限的主干重用。在本研究中,我们提出了一种基于预训练VLM的统一自主驾驶框架,其中异构解码行为在单个变换器解码器中得以调和。我们证明了预训练VLM的注意力机制在纯语言建模之外具有很强的迁移能力。通过在单个因果解码器中组织视觉和结构化查询标记,结构化查询可以通过原始注意力机制自然地依赖于视觉上下文。文本和结构化输出共享一个共同的注意力主干,使得在异构任务之间能够稳定地联合优化。通过引入结构化轨迹查询,轨迹规划在同一个因果LLM解码器中实现。这种统一的公式化使得规划能够与图像和感知标记共享预训练的注意力主干。在端到端自主驾驶基准上的大量实验表明了我们方法的最先进性能,包括在nuScenes开放循环评估中达到0.28的L2和0.18的碰撞率,以及在NAVSIM闭环评估中获得竞争性结果(86.8 PDMS)。完整模型保留了多模态生成能力,而高效的推理模式实现了约40%的延迟降低。代码和模型可在https://github.com/Z1zyw/OneDrive获取。
cs.CV / 206 / 2604.17920
Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery
基于提示的基础模型在SAR图像中的零-shot船舶实例分割
Abstract
Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.
Chinese Translation
合成孔径雷达(SAR)在海洋监视中发挥着关键作用,但由于缺乏像素级注释,深度学习在SAR分析中的应用受到限制。本文探讨了通用视觉基础模型如何实现SAR图像中的零-shot船舶实例分割,从而消除对像素级监督的需求。我们基于YOLOv11的检测器在开放的SAR数据集上进行训练,通过边界框定位船舶,随后提示Segment Anything Model 2(SAM2)生成实例掩膜,而无需任何掩膜注释。与以往依赖微调或适配器的SAM基础SAR方法不同,我们的方法表明,仅凭SAR训练的检测器所提供的空间约束就能有效地规范基础模型的预测。这一设计部分缓解了光学与SAR领域之间的差距,并支持下游应用,如船舶分类、尺寸估计和尾流分析。在SSDD基准测试中的实验结果显示,平均交并比(mean IoU)达到0.637(相当于完全监督基线的89%),整体船舶检测率为89.2%,验证了朝向基础模型驱动的SAR图像理解的可扩展、高效注释路径。
cs.CV / 207 / 2604.17927
Brain-Inspired Capture: Evidence-Driven Neuromimetic Perceptual Simulation for Visual Decoding
脑启发捕获:基于证据的类神经感知模拟用于视觉解码
Abstract
Visual decoding of neurophysiological signals is a critical challenge for brain-computer interfaces (BCIs) and computational neuroscience. However, current approaches are often constrained by the systematic and stochastic gaps between neural and visual modalities, largely neglecting the intrinsic computational mechanisms of the Human Visual System (HVS). To address this, we propose Brain-Inspired Capture (BI-Cap), a neuromimetic perceptual simulation paradigm that aligns these modalities by emulating HVS processing. Specifically, we construct a neuromimetic pipeline comprising four biologically plausible dynamic and static transformations, coupled with Mutual Information (MI)-guided dynamic blur regulation to simulate adaptive visual processing. Furthermore, to mitigate the inherent non-stationarity of neural activity, we introduce an evidence-driven latent space representation. This formulation explicitly models uncertainty, thereby ensuring robust neural embeddings. Extensive evaluations on zero-shot brain-to-image retrieval across two public benchmarks demonstrate that BI-Cap substantially outperforms state-of-the-art methods, achieving relative gains of 9.2\% and 8.0\%, respectively. We have released the source code on GitHub through the link https://github.com/flysnow1024/BI-Cap.
Chinese Translation
神经生理信号的视觉解码是脑-计算机接口(BCIs)和计算神经科学面临的一个关键挑战。然而,当前的方法常常受到神经和视觉模态之间系统性和随机性差距的限制,主要忽视了人类视觉系统(HVS)的内在计算机制。为了解决这一问题,我们提出了脑启发捕获(Brain-Inspired Capture, BI-Cap),一种类神经感知模拟范式,通过模拟HVS处理来对齐这些模态。具体而言,我们构建了一个类神经管道,包括四种生物学上合理的动态和静态变换,并结合互信息(Mutual Information, MI)引导的动态模糊调节,以模拟自适应视觉处理。此外,为了减轻神经活动的固有非平稳性,我们引入了一种基于证据的潜在空间表示。这一构造明确建模了不确定性,从而确保了稳健的神经嵌入。在两个公共基准上的零样本脑到图像检索的广泛评估表明,BI-Cap显著优于最先进的方法,分别实现了9.2 ext{%}和8.0 ext{%}的相对提升。我们已在GitHub上发布了源代码,链接为 https://github.com/flysnow1024/BI-Cap。
cs.CV / 208 / 2604.17941
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
从头部到神经元:多任务视觉-语言模型中的因果归因与引导
Abstract
Recent work has increasingly explored neuron-level interpretation in vision-language models (VLMs) to identify neurons critical to final predictions. However, existing neuron analyses generally focus on single tasks, limiting the comparability of neuron importance across tasks. Moreover, ranking strategies tend to score neurons in isolation, overlooking how task-dependent information pathways shape the write-in effects of feed-forward network (FFN) neurons. This oversight can exacerbate neuron polysemanticity in multi-task settings, introducing noise into the identification and intervention of task-critical neurons. In this study, we propose HONES (Head-Oriented Neuron Explanation & Steering), a gradient-free framework for task-aware neuron attribution and steering in multi-task VLMs. HONES ranks FFN neurons by their causal write-in contributions conditioned on task-relevant attention heads, and further modulates salient neurons via lightweight scaling. Experiments on four diverse multimodal tasks and two popular VLMs show that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering. Our source code is released at: https://github.com/petergit1/HONES.
Chinese Translation
近期的研究越来越多地探讨视觉-语言模型(VLMs)中的神经元级解释,以识别对最终预测至关重要的神经元。然而,现有的神经元分析通常集中于单一任务,限制了跨任务神经元重要性的可比性。此外,排名策略往往孤立地对神经元进行评分,忽视了任务相关的信息通路如何塑造前馈网络(FFN)神经元的写入效应。这一忽视可能在多任务环境中加剧神经元的多义性,为识别和干预任务关键神经元引入噪声。在本研究中,我们提出了HONES(面向头部的神经元解释与引导),这是一个无梯度的框架,用于在多任务VLM中进行任务感知的神经元归因与引导。HONES根据与任务相关的注意力头对FFN神经元的因果写入贡献进行排名,并通过轻量级缩放进一步调节显著神经元。在四个多样化的多模态任务和两个流行的VLM上的实验表明,HONES在识别任务关键神经元方面优于现有方法,并在引导后提高了模型性能。我们的源代码已发布在:https://github.com/petergit1/HONES。
cs.CV / 209 / 2604.17949
ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection
ZSG-IAD:一种用于零样本基础工业异常检测的多模态框架
Abstract
Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.
Chinese Translation
基于深度学习的工业异常检测器通常表现为黑箱,难以用物理上有意义的缺陷证据来证明其决策。我们提出了ZSG-IAD,一种用于零样本基础工业异常检测的多模态视觉-语言框架。给定RGB图像、传感器图像和3D点云,ZSG-IAD生成结构化的异常报告和像素级异常掩码。ZSG-IAD引入了一种语言引导的双跳基础模块:(1) 与异常相关的句子选择从多模态特征中提取的证据类潜在槽,产生粗略的空间支持;(2) 选定的槽通过通道-空间门控和轻量解码器调制特征图,以生成细粒度掩码。为了提高可靠性,我们进一步应用可执行规则GRPO(Executable-Rule GRPO)与可验证奖励,以促进结构化输出、异常区域一致性和推理-结论一致性。在多个工业异常基准测试中的实验显示出强大的零样本性能,并且比以往方法提供了更透明、物理基础的解释。我们将发布代码和注释,以支持未来对可信工业异常检测系统的研究。
cs.CV / 210 / 2604.17959
Chatting about Upper-Body Expressive Human Pose and Shape Estimation
关于上半身表现性人类姿态和形状估计的讨论
Abstract
Expressive Human Pose and Shape Estimation (EHPS) plays a crucial role in various AR/VR applications and has witnessed significant progress in recent years. However, current state-of-the-art methods still struggle with accurate parameter estimation for facial and hand regions and exhibit limited generalization to wild images. To address these challenges, we present CoEvoer, a novel one-stage synergistic cross-dependency transformer framework tailored for upper-body EHPS. CoEvoer enables explicit feature-level interaction across different body parts, allowing for mutual enhancement through contextual information exchange. Specifically, larger and more easily estimated regions such as the torso provide global semantics and positional priors to guide the estimation of finer, more complex regions like the face and hands. Conversely, the localized details captured in facial and hand regions help refine and calibrate adjacent body parts. To the best of our knowledge, CoEvoer is the first framework designed specifically for upper-body EHPS, with the goal of capturing the strong coupling and semantic dependencies among the face, hands, and torso through joint parameter regression. Extensive experiments demonstrate that CoEvoer achieves state-of-the-art performance on upper-body benchmarks and exhibits strong generalization capability even on unseen wild images.
Chinese Translation
表现性人类姿态和形状估计(EHPS)在各种增强现实/虚拟现实(AR/VR)应用中发挥着至关重要的作用,并在近年来取得了显著进展。然而,当前的最先进方法在面部和手部区域的参数估计上仍然存在困难,并且在处理野外图像时表现出有限的泛化能力。为了解决这些挑战,我们提出了CoEvoer,这是一种新颖的一阶段协同交叉依赖变换器框架,专门针对上半身EHPS。CoEvoer能够在不同身体部位之间实现显式的特征级交互,通过上下文信息交换实现相互增强。具体而言,像躯干这样较大且更易于估计的区域提供全局语义和位置先验,以指导面部和手部等更细致、更复杂区域的估计。相反,面部和手部区域捕获的局部细节有助于细化和校准相邻的身体部位。据我们所知,CoEvoer是第一个专门为上半身EHPS设计的框架,旨在通过联合参数回归捕捉面部、手部和躯干之间的强耦合和语义依赖关系。大量实验表明,CoEvoer在上半身基准测试中实现了最先进的性能,并在未见的野外图像上展现出强大的泛化能力。
cs.CV / 211 / 2604.17961
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
DifFoundMAD:基础模型与差分变形攻击检测的结合
Abstract
In this work, we introduce DifFoundMAD, a parameter-efficient D-MAD framework that exploits the generalisation capabilities of vision foundation models (FM) to capture discrepancies between suspected morphs and live capture images. In contrast to conventional D-MAD systems that rely on face recognition embeddings or handcrafted feature differences, DifFoundMAD follows the standard differential paradigm while replacing the underlying representation space with embeddings extracted from FMs. By combining lightweight finetuning with class-balanced optimisation, the proposed method updates only a small subset of parameters while preserving the rich representational priors of the underlying FMs. Extensive cross-database evaluations on standard D-MAD benchmarks demonstrate that DifFoundMAD achieves consistent improvements over state-of-the-art systems, particularly at the strict security levels required in operational deployments such as border control: The error rates reported in the current state-of-the-art were reduced from 6.16% to 2.17% for high-security levels using DifFoundMAD.
Chinese Translation
在本研究中,我们介绍了DifFoundMAD,这是一种参数高效的D-MAD框架,利用视觉基础模型(FM)的泛化能力来捕捉可疑变形图像与实时捕捉图像之间的差异。与依赖于人脸识别嵌入或手工特征差异的传统D-MAD系统不同,DifFoundMAD遵循标准的差分范式,同时用从基础模型中提取的嵌入替代底层表示空间。通过结合轻量级微调与类别平衡优化,所提出的方法仅更新少量参数,同时保留底层基础模型丰富的表示先验。在标准D-MAD基准上的广泛跨数据库评估表明,DifFoundMAD在各类最先进系统中实现了一致的性能提升,特别是在边境控制等操作部署中所需的严格安全级别下:使用DifFoundMAD时,当前最先进技术报告的高安全级别错误率从6.16%降低至2.17%。
cs.CV / 212 / 2604.17965
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
MU-GeNeRF:面向干扰物的多视角不确定性引导可泛化神经辐射场
Abstract
Generalizable Neural Radiance Fields (GeNeRFs) enable high-quality scene reconstruction from sparse views and can generalize to unseen scenes. However, in real-world settings, transient distractors break cross-view structural consistency, corrupting supervision and degrading reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and estimate uncertainty from per-view reconstruction errors, which are not reliable for GeNeRFs and often misjudge inconsistent static structures as distractors. To this end, we propose MU-GeNeRF, a Multi-view Uncertainty-guided distractor-aware GeNeRF framework designed to alleviate GeNeRF's robust modeling challenges in the presence of transient distractions. We decompose distractor awareness into two complementary uncertainty components: Source-view Uncertainty, which captures structural discrepancies across source views caused by viewpoint changes or dynamic factors; and Target-view Uncertainty, which detects observation anomalies in the target image induced by transient distractors.These two uncertainties address distinct error sources and are combined through a heteroscedastic reconstruction loss, which guides the model to adaptively modulate supervision, enabling more robust distractor suppression and geometric modeling.Extensive experiments show that our method not only surpasses existing GeNeRFs but also achieves performance comparable to scene-specific distractor-free NeRFs.
Chinese Translation
可泛化神经辐射场(GeNeRFs)能够从稀疏视角中实现高质量场景重建,并且可以推广到未见过的场景。然而,在现实世界中,瞬态干扰物会破坏跨视角的结构一致性,从而损害监督信号并降低重建质量。现有的无干扰NeRF方法依赖于逐场景优化,并通过每视角重建误差估计不确定性,这对于GeNeRFs并不可靠,且常常将不一致的静态结构误判为干扰物。为此,我们提出了MU-GeNeRF,一个面向干扰物的多视角不确定性引导GeNeRF框架,旨在缓解GeNeRF在瞬态干扰存在下的稳健建模挑战。我们将干扰物意识分解为两个互补的不确定性成分:源视角不确定性(Source-view Uncertainty),捕捉由于视点变化或动态因素引起的源视角间的结构差异;以及目标视角不确定性(Target-view Uncertainty),检测由瞬态干扰物引起的目标图像中的观察异常。这两种不确定性解决了不同的误差来源,并通过异方差重建损失相结合,引导模型自适应调节监督,从而实现更稳健的干扰物抑制和几何建模。大量实验表明,我们的方法不仅超越了现有的GeNeRFs,还达到了与特定场景的无干扰NeRFs相当的性能。
cs.CV / 213 / 2604.17969
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
E3VS-Bench:一个用于3D高斯点云场景中视角依赖主动感知的基准测试
Abstract
Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.
Chinese Translation
在3D环境中的视觉搜索要求具身智能体主动探索其周围环境并获取与任务相关的证据。然而,现有的视觉搜索和具身人工智能基准测试,包括EQA,通常依赖于静态观察或受限的自我中心运动,因此并未明确评估在现实世界3D环境中在不受限制的5自由度(5-DoF)视角控制下出现的细粒度视角依赖现象,例如由于垂直视角变化引起的可见性变化、揭示容器内部内容以及区分仅能从特定角度观察到的物体属性。为了解决这一局限性,我们引入了E3VS-Bench,一个用于具身3D视觉搜索的基准测试,其中智能体必须在5-DoF中控制其视角,以收集视角依赖的证据以进行问答。E3VS-Bench由99个高保真3D场景构成,这些场景是使用3D高斯点云重建的,并包含2,014个以问题驱动的情节。3D高斯点云技术使得能够进行光线真实的自由视角渲染,保留了在基于网格的模拟器中常常退化的细粒度视觉细节(例如,小文本和微妙属性),从而允许构建无法从单一视角回答的问题,而是需要在5-DoF中跨视角进行主动检查。我们评估了多种最先进的视觉语言模型(VLM),并将它们的表现与人类进行比较。尽管模型在2D推理能力上表现强劲,但所有模型与人类之间存在显著差距,突显了在完全5-DoF视角变化下主动感知和连贯视角规划的局限性。
cs.CV / 214 / 2604.17971
Identifying Ethical Biases in Action Recognition Models
识别动作识别模型中的伦理偏见
Abstract
Human Action Recognition (HAR) models are increasingly deployed in high-stakes environments, yet their fairness across different human appearances has not been analyzed. We introduce a framework for auditing bias in HAR models using synthetic video data, generated with full control over visual identity attributes such as skin color. Unlike prior work that focuses on static images or pose estimation, our approach preserves temporal consistency, allowing us to isolate and test how changes to a single attribute affect model predictions. Through controlled interventions using the BEDLAM simulation platform, we show whether some popular HAR models exhibit statistically significant biases on the skin color even when the motion remains identical. Our results highlight how models may encode unwanted visual associations, and we provide evidence of systematic errors across groups. This work contributes a framework for auditing HAR models and supports the development of more transparent, accountable systems in light of upcoming regulatory standards.
Chinese Translation
人类动作识别(HAR)模型越来越多地应用于高风险环境,但其在不同人类外观下的公平性尚未得到分析。我们提出了一种使用合成视频数据审计HAR模型偏见的框架,该数据在视觉身份属性(如肤色)方面具有完全的控制。与以往关注静态图像或姿态估计的研究不同,我们的方法保持了时间一致性,使我们能够隔离并测试单一属性的变化如何影响模型预测。通过使用BEDLAM仿真平台进行的控制干预,我们展示了一些流行的HAR模型在肤色上是否表现出统计显著的偏见,即使运动保持不变。我们的结果突显了模型可能编码不必要的视觉关联,并提供了跨群体系统性错误的证据。这项工作为审计HAR模型提供了框架,并支持在即将到来的监管标准下开发更透明、可问责的系统。
cs.CV / 215 / 2604.17982
Mitigating Multimodal Hallucination via Phase-wise Self-reward
通过阶段性自奖励减轻多模态幻觉
Abstract
Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \textbf{PSRD} (\textbf{Phase-wise \textbf{S}elf-\textbf{R}eward \textbf{D}ecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.
Chinese Translation
大型视觉语言模型(LVLMs)仍然面临视觉幻觉的问题,即生成的响应与视觉输入不一致。现有的方法要么依赖于大规模标注数据进行微调,这会产生巨大的计算开销,要么采用静态事后策略,忽视了幻觉出现的动态特性。为了解决这些问题,我们引入了一种新的自奖励框架,使得在推理时能够动态减轻幻觉,而无需外部监督。在实证方面,我们揭示了视觉幻觉表现出阶段性动态模式,在每个语义阶段的开始时达到峰值。基于这些见解,我们提出了 extbf{PSRD}( extbf{阶段性自奖励解码}),用于在线幻觉修正,受阶段性自奖励信号的指导。为了减少在解码过程中重复自我评估的成本,我们将来自LVLMs的幻觉指导信号提炼为轻量级奖励模型。该奖励模型随后为解码过程中的针对性干预提供实时指导,从而实现精确的幻觉抑制。所提出的PSRD显著降低了LLaVA-1.5-7B的幻觉率,减少了50.0%,并在四个LVLM的五个幻觉评估基准上始终优于现有的事后方法。进一步分析确认,PSRD有效减轻了幻觉传播,并在强性能与推理效率之间实现了高度可控的权衡。
cs.CV / 216 / 2604.18001
Trustworthy Endoscopic Super-Resolution
可信赖的内窥镜超分辨率
Abstract
Super-resolution (SR) models are attracting growing interest for enhancing minimally invasive surgery and diagnostic videos under hardware constraints. However, valid concerns remain regarding the introduction of hallucinated structures and amplified noise, limiting their reliability in safety-critical settings. We propose a direct and practical framework to make SR systems more trustworthy by identifying where reconstructions are likely to fail. Our approach integrates a lightweight error-prediction network that operates on intermediate representations to estimate pixel-wise reconstruction error. The module is computationally efficient and low-latency, making it suitable for real-time deployment. We convert these predictions into operational failure decisions by constructing Conformal Failure Masks (CFM), which localize regions where the SR output should not be trusted. Built on conformal risk control principles, our method provides theoretical guarantees for controlling both the tolerated error limit and the miscoverage in detected failures. We evaluate our approach on image and video SR, demonstrating its effectiveness in detecting unreliable reconstructions in endoscopic and robotic surgery settings. To our knowledge, this is the first study to provide a model-agnostic, theoretically grounded approach to improving the safety of real-time endoscopic image SR.
Chinese Translation
超分辨率(SR)模型在硬件限制下日益受到关注,旨在增强微创手术和诊断视频。然而,关于引入虚幻结构和放大噪声的有效担忧依然存在,这限制了它们在安全关键环境中的可靠性。我们提出了一种直接且实用的框架,以通过识别重建可能失败的区域来提高SR系统的可信赖性。我们的方法集成了一个轻量级的误差预测网络,该网络在中间表示上运行,以估计逐像素的重建误差。该模块计算效率高且延迟低,适合实时部署。我们通过构建符合性失效掩码(Conformal Failure Masks, CFM)将这些预测转化为操作失效决策,从而定位SR输出不应被信任的区域。基于符合性风险控制原则,我们的方法为控制容忍的误差限度和检测到的失效中的误覆盖提供了理论保证。我们在图像和视频SR上评估了我们的方法,证明其在内窥镜和机器人手术环境中检测不可靠重建的有效性。据我们所知,这是首个提供模型无关、理论基础的方法以提高实时内窥镜图像SR安全性的研究。
cs.CV / 217 / 2604.18019
Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval
基于草图的三维形状检索的多视图层次图神经网络
Abstract
Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.
Chinese Translation
基于草图的三维形状检索(SBSR)旨在检索与输入手绘草图类别一致的三维形状。该任务的核心挑战在于两个方面:现有方法通常采用简化的聚合策略来处理独立编码的三维多视图特征,忽视了视图之间的几何关系和多层次细节,从而导致三维表示能力较弱。同时,传统的SBSR方法受到可见类别限制的约束,在零样本场景下表现不佳。为了解决这些挑战,我们提出了多视图层次图神经网络(MV-HGNN),这是一个用于SBSR的新框架。具体而言,我们构建了一个视图级图,通过局部图卷积和全局注意力捕捉相邻的几何依赖关系和跨视图信息传递。进一步引入了视图选择器以执行层次图粗化,从而为图卷积提供逐渐增大的感受野,并减轻冗余视图的干扰,进而实现更具辨别性的层次三维表示。为了实现类别无关的对齐并减轻对已见类别的过拟合,我们利用CLIP文本嵌入作为语义原型,并将草图和三维特征投影到共享的语义空间。我们采用两阶段训练策略进行类别级检索,并在相同模型架构下采用单阶段策略进行零样本检索。在类别级和零样本设置下,在两个公共基准上的大量实验表明,MV-HGNN超越了最先进的方法。
cs.CV / 218 / 2604.18032
CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement
CFSR:基于几何条件的物理解耦阴影去除
Abstract
Traditional shadow removal networks often treat image restoration as an unconstrained mapping, lacking the physical interpretability required to balance localized texture recovery with global illumination consistency. To address this, we propose CFSR, a multi-modal prior-driven framework that reframes shadow removal as a physics-constrained restoration process. By seamlessly integrating 3D geometric cues with large-scale foundation model semantics, CFSR effectively bridges the 2D-3D domain gap. Specifically, we first map observations into a custom HVI color space to suppress shadow-induced noise and robustly fuse RGB data with estimated depth priors. At its core, our Geometric & Semantic Dual Explicit Guided Attention mechanism utilizes DINO features and 3D surface normals to directly modulate the attention affinity matrix, structurally enforcing physical lighting constraints. To recover severely degraded regions, we inject holistic priors via a frozen CLIP encoder. Finally, our Frequency Collaborative Reconstruction Module (FCRM) achieves an optimal synthesis by decoupling the decoding process. Conditioned on geometric priors, FCRM seamlessly harmonizes the reconstruction of sharp high-frequency occlusion boundaries with the restoration of low-frequency global illumination. Extensive experiments demonstrate that CFSR achieves state-of-the-art performance across multiple challenging benchmarks.
Chinese Translation
传统的阴影去除网络通常将图像恢复视为一种无约束的映射,缺乏平衡局部纹理恢复与全局光照一致性所需的物理可解释性。为了解决这一问题,我们提出了CFSR,一个多模态先验驱动的框架,将阴影去除重新构架为一个物理约束的恢复过程。通过无缝整合三维几何线索与大规模基础模型语义,CFSR有效地弥合了二维与三维领域之间的差距。具体而言,我们首先将观测映射到自定义的HVI颜色空间,以抑制阴影引起的噪声,并稳健地将RGB数据与估计的深度先验融合。在其核心,我们的几何与语义双显式引导注意机制利用DINO特征和三维表面法线直接调节注意亲和矩阵,结构性地强制施加物理光照约束。为了恢复严重退化的区域,我们通过冻结的CLIP编码器注入整体先验。最后,我们的频率协同重建模块(FCRM)通过解耦解码过程实现了最佳合成。在几何先验的条件下,FCRM无缝协调了锐利高频遮挡边界的重建与低频全局光照的恢复。大量实验表明,CFSR在多个具有挑战性的基准测试中实现了最先进的性能。
cs.CV / 219 / 2604.18037
HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval
HABIT:用于复合图像检索的时间协同稳健渐进学习框架
Abstract
Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms most methods under various noise ratios, exhibiting superior robustness and retrieval performance. Codes are available at https://github.com/Lee-zixu/HABIT
Chinese Translation
复合图像检索(CIR)是一种灵活的图像检索范式,使用户能够通过由参考图像和修改文本组成的多模态查询准确定位目标图像。尽管该任务在个性化搜索和推荐系统中展现了良好的应用前景,但在实际场景中面临着严重的挑战,即噪声三元组对应(NTC)问题。该问题主要源于标注三元组数据所涉及的高成本和主观性。为了解决这一问题,我们识别出两个核心挑战:复合语义差异的精确估计和对修改差异的不足渐进适应。为应对这些挑战,我们提出了一种用于复合图像检索的时间协同稳健渐进学习框架(HABIT),该框架由两个核心模块组成。首先,互知识估计模块通过计算复合特征与目标图像之间的互信息转移率来量化样本的干净程度,从而有效识别与预期修改语义一致的干净样本。其次,双一致性渐进学习模块引入了历史模型与当前模型之间的协作机制,模拟人类习惯形成过程,以保留良好习惯并校准不良习惯,最终在存在NTC的情况下实现稳健学习。在两个标准CIR数据集上进行的广泛实验表明,HABIT在各种噪声比下显著优于大多数方法,展现出卓越的鲁棒性和检索性能。代码可在 https://github.com/Lee-zixu/HABIT 获取。
cs.CV / 220 / 2604.18047
GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
GS-STVSR:通过2D高斯溅射实现超高效连续时空视频超分辨率
Abstract
Continuous Spatio-Temporal Video Super-Resolution (C-STVSR) aims to simultaneously enhance the spatial resolution and frame rate of videos by arbitrary scale factors, offering greater flexibility than fixed-scale methods that are constrained by predefined upsampling ratios. In recent years, methods based on Implicit Neural Representations (INR) have made significant progress in C-STVSR by learning continuous mappings from spatio-temporal coordinates to pixel values. However, these methods fundamentally rely on dense pixel-wise grid queries, causing computational cost to scale linearly with the number of interpolated frames and severely limiting inference efficiency. We propose GS-STVSR, an ultra-efficient C-STVSR framework based on 2D Gaussian Splatting (2D-GS) that drives the spatiotemporal evolution of Gaussian kernels through continuous motion modeling, bypassing dense grid queries entirely. We exploit the strong temporal stability of covariance parameters for lightweight intermediate fitting, design an optical flow-guided motion module to derive Gaussian position and color at arbitrary time steps, introduce a Covariance resampling alignment module to prevent covariance drift, and propose an adaptive offset window for large-scale motion. Extensive experiments on Vid4, GoPro, and Adobe240 show that GS-STVSR achieves state-of-the-art quality across all benchmarks. Moreover, its inference time remains nearly constant at conventional temporal scales (X2--X8) and delivers over X3 speedup at extreme scales X32, demonstrating strong practical applicability.
Chinese Translation
连续时空视频超分辨率(C-STVSR)旨在通过任意比例因子同时提高视频的空间分辨率和帧率,提供比受预定义上采样比限制的固定比例方法更大的灵活性。近年来,基于隐式神经表示(INR)的方法在C-STVSR方面取得了显著进展,通过学习从时空坐标到像素值的连续映射。然而,这些方法在根本上依赖于密集的逐像素网格查询,导致计算成本随着插值帧数线性增长,严重限制了推理效率。我们提出了GS-STVSR,这是一个基于2D高斯溅射(2D-GS)的超高效C-STVSR框架,通过连续运动建模驱动高斯核的时空演变,完全绕过密集网格查询。我们利用协方差参数的强时间稳定性进行轻量级中间拟合,设计了一个光流引导的运动模块,以在任意时间步推导高斯位置和颜色,引入了一个协方差重采样对齐模块以防止协方差漂移,并提出了一个用于大规模运动的自适应偏移窗口。在Vid4、GoPro和Adobe240上的大量实验表明,GS-STVSR在所有基准测试中实现了最先进的质量。此外,其推理时间在常规时间尺度(X2--X8)下几乎保持不变,并在极端尺度X32下实现了超过X3的加速,展示了强大的实际应用性。
cs.CV / 221 / 2604.18051
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval
INTENT:针对鲁棒组合图像检索的变不变性与辨别意识噪声减轻
Abstract
Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.
Chinese Translation
组合图像检索(CIR)是一种具有挑战性的图像检索范式,能够基于由参考图像和修改文本组成的多模态查询来检索目标图像。尽管近年来取得了显著进展,但现有方法假设所有样本均正确匹配。然而,在现实场景中,由于三元组标注成本高,CIR 数据集不可避免地包含标注错误,导致三元组匹配不正确。为了解决这个问题,噪声三元组对应(NTC)问题引起了越来越多的关注。我们认为,CIR 中的噪声可以分为两种类型:跨模态对应噪声和模态固有噪声。前者源于跨模态的不匹配,而后者则源于模态内部的背景干扰或与粗粒度修改标注无关的视觉因素。然而,模态固有噪声常常被忽视,而对跨模态对应噪声的研究仍处于起步阶段。为了解决上述问题,我们提出了变不变性与辨别意识噪声网络(INTENT),该网络由两个部分组成:视觉不变组合和双目标辨别学习,专门设计用于处理这两方面的噪声。前者通过快速傅里叶变换(FFT)在视觉侧施加因果干预,以生成干预后的组合特征,强制视觉不变性,并使模型在组合过程中忽略模态固有噪声。后者采用正负样本的协同优化,构建一个可扩展的决策边界,根据忠诚度动态调整决策,从而实现鲁棒的对应辨别。在两个广泛使用的基准数据集上的大量实验表明,INTENT 的优越性和鲁棒性。
cs.CV / 222 / 2604.18075
Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
通过动态前缀加权增强视觉-语言模型的持续学习
Abstract
We investigate recently introduced domain-class incremental learning scenarios for vision-language models (VLMs). Recent works address this challenge using parameter-efficient methods, such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring that adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code is available at: https://github.com/YonseiML/dpw.
Chinese Translation
我们研究了最近提出的视觉-语言模型(VLMs)的领域类增量学习场景。近期的研究通过参数高效的方法解决了这一挑战,例如前缀调优(prefix-tuning)或适配器(adapters),这些方法通过将任务特定信息以加性向量的形式融入输入标记,促进模型对下游任务的适应。然而,以往的方法往往对这些向量的权重进行归一化,忽视了不同输入标记需要不同程度的调整这一事实。为了解决这一问题,我们提出了动态前缀加权(Dynamic Prefix Weighting, DPW)框架,该框架动态地为前缀分配权重,并辅以适配器。DPW包括1)一个门控模块,根据相应输入标记的重要性调整每个前缀的权重,以及2)一个加权机制,将适配器输出权重作为前缀调优权重的残差,确保仅在必要时使用适配器。实验结果表明,我们的方法在视觉-语言模型的领域类增量学习场景中达到了最先进的性能。代码可在以下链接获取:https://github.com/YonseiML/dpw。
cs.CV / 223 / 2604.18076
Class-specific diffusion models improve military object detection in a low-data domain
特定类别扩散模型在低数据域中改善军事目标检测
Abstract
Diffusion-based image synthesis has emerged as a promising source of synthetic training data for AI-based object detection and classification. In this work, we investigate whether images generated with diffusion can improve military vehicle detection under low-data conditions. We fine-tuned the text-to-image diffusion model FLUX.1 [dev] using LoRA with only 8 or 24 real images per class across 15 vehicle categories, resulting in class-specific diffusion models, which were used to generate new samples from automatically generated text prompts. The same real images were used to fine-tune the RF-DETR detector for a 15-class object detection task. Synthetic datasets generated by the diffusion models were then used to further improve detector performance. Importantly, no additional real data was required, as the generative models leveraged the same limited training samples. FLUX-generated images improved detection performance, particularly in the low-data regime (up to +8.0% mAP$_{50}$ with 8 real samples). To address the limited geometric control of text prompt-based diffusion, we additionally generated structurally guided synthetic data using ControlNet with Canny edge-map conditioning, yielding a FLUX-ControlNet (FLUX-CN) dataset with explicit control over viewpoint and pose. Structural guidance further enhanced performance when data is scarce (+4.1% mAP$_{50}$ with 8 real samples), but no additional benefit was observed when more real data is available. This study demonstrates that object-specific diffusion models are effective for improving military object detection in a low-data domain, and that structural guidance is most beneficial when real data is highly limited. These results highlight generative image data as an alternative to traditional simulation pipelines for the training of military AI systems.
Chinese Translation
基于扩散的图像合成已成为人工智能目标检测和分类的合成训练数据的有前景的来源。在本研究中,我们探讨了在低数据条件下,使用扩散生成的图像是否能够改善军事车辆检测。我们使用仅有8或24张真实图像的每个类别对文本到图像的扩散模型FLUX.1 [dev]进行了微调,涵盖15个车辆类别,从而生成了特定类别的扩散模型,这些模型用于从自动生成的文本提示中生成新样本。相同的真实图像被用于微调RF-DETR检测器,以完成15类目标检测任务。随后,使用扩散模型生成的合成数据集进一步提高了检测器的性能。重要的是,不需要额外的真实数据,因为生成模型利用了相同的有限训练样本。FLUX生成的图像提高了检测性能,特别是在低数据环境下(使用8个真实样本时,mAP$_{50}$提高了8.0%)。为了应对基于文本提示的扩散在几何控制方面的局限性,我们还使用ControlNet和Canny边缘图条件生成了结构引导的合成数据,得到了FLUX-ControlNet (FLUX-CN) 数据集,该数据集在视角和姿态上具有明确的控制。当数据稀缺时,结构引导进一步提高了性能(使用8个真实样本时,mAP$_{50}$提高了4.1%),但在更多真实数据可用时未观察到额外的好处。本研究表明,特定目标的扩散模型在低数据域中有效改善军事目标检测,并且结构引导在真实数据高度有限时最为有利。这些结果突显了生成图像数据作为军事人工智能系统训练的传统仿真管道的替代方案。
cs.CV / 224 / 2604.18088
Autonomous Unmanned Aircraft Systems for Enhanced Search and Rescue of Drowning Swimmers: Image-Based Localization and Mission Simulation
增强溺水游泳者搜索与救援的自主无人机系统:基于图像的定位与任务模拟
Abstract
Drowning is an omnipresent risk associated with any activity on or in the water, and rescuing a drowning person is particularly challenging because of the time pressure, making a short response time important. Further complicating water rescue are unsupervised and extensive swimming areas, precise localization of the target, and the transport of rescue personnel. Technical innovations can provide a remedy: We propose an Unmanned Aircraft System (UAS), also known as a drone-in-a-box system, consisting of a fleet of Unmanned Aerial Vehicles (UAVs) allocated to purpose-built hangars near swimming areas. In an emergency, the UAS can be deployed in addition to Standard Rescue Operation (SRO) equipment to locate the distressed person early by performing a fully automated Search and Rescue (S&R) operation and dropping a flotation device. In this paper, we address automatically locating distressed swimmers using the image-based object detection architecture You Only Look Once (YOLO). We present a dataset created for this application and outline the training process. We evaluate the performance of YOLO versions 3, 5, and 8 and architecture sizes (nano, extra-large) using Mean Average Precision (mAP) metrics
[email protected] and
[email protected]:.95. Furthermore, we present two Discrete-Event Simulation (DES) approaches to simulate response times of SRO and UAS-based water rescue. This enables estimation of time savings relative to SRO when selecting the UAS configuration (type, number, and location of UAVs and hangars). Computational experiments for a test area in the Lusatian Lake District, Germany, show that UAS assistance shortens response time. Even a small UAS with two hangars, each containing one UAV, reduces response time by a factor of five compared to SRO.
Chinese Translation
溺水是与水上或水中活动相关的普遍风险,救援溺水者尤其具有挑战性,因为时间压力使得快速反应时间变得至关重要。水上救援的复杂性还体现在无人监管和广泛的游泳区域、目标的精确定位以及救援人员的运输上。技术创新可以提供解决方案:我们提出了一种无人机系统(Unmanned Aircraft System, UAS),也称为“箱中无人机”系统,由一支分配到靠近游泳区域的专用机库的无人机(Unmanned Aerial Vehicles, UAV)舰队组成。在紧急情况下,UAS可以与标准救援操作(Standard Rescue Operation, SRO)设备一起部署,通过执行完全自动化的搜索与救援(Search and Rescue, S&R)操作并投放浮标,尽早定位遇险者。本文讨论了使用基于图像的目标检测架构“你只看一次”(You Only Look Once, YOLO)自动定位遇险游泳者。我们展示了为此应用创建的数据集,并概述了训练过程。我们使用平均精度均值(Mean Average Precision, mAP)指标
[email protected]和
[email protected]:.95评估了YOLO版本3、5和8及其架构大小(纳米、超大)的性能。此外,我们提出了两种离散事件仿真(Discrete-Event Simulation, DES)方法,以模拟SRO和基于UAS的水上救援的响应时间。这使得在选择UAS配置(类型、数量和UAV及机库的位置)时能够估算相对于SRO的时间节省。在德国卢萨蒂湖区的测试区域进行的计算实验表明,UAS的辅助可以缩短响应时间。即使是一个小型UAS,配备两个机库,每个机库内有一架UAV,相比于SRO也能将响应时间缩短五倍。
cs.CV / 225 / 2604.18094
Decision-Aware Attention Propagation for Vision Transformer Explainability
面向决策的注意力传播用于视觉变换器的可解释性
Abstract
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet their prediction process remains difficult to interpret because information is propagated through complex interactions across layers and attention heads. Existing attention based explanation methods provide an intuitive way to trace information flow. However, they rely mainly on raw attention weights, which do not explicitly reflect the final decision and often lead to explanations with limited class discriminability. In contrast, gradient based localization methods are more effective at highlighting class specific evidence, but they do not fully exploit the hierarchical attention propagation mechanism of transformers. To address this limitation, we propose Decision-Aware Attention Propagation (DAP), an attribution method that injects decision-relevant priors into transformer attention propagation. By estimating token importance through gradient based localization and integrating it into layer wise attention rollout, the method captures both the structural flow of attention and the evidence most relevant to the final prediction. Consequently, DAP produces attribution maps that are more class sensitive, compact, and faithful than those generated by conventional attention based methods. Extensive experiments across Vision Transformer variants of different model scales show that DAP consistently outperforms existing baselines in both quantitative metrics and qualitative visualizations, indicating that decision aware propagation is an effective direction for improving ViT interpretability.
Chinese Translation
视觉变换器(Vision Transformers,ViTs)已成为计算机视觉中的主导架构,但其预测过程仍然难以解释,因为信息通过层和注意力头之间的复杂交互进行传播。现有的基于注意力的解释方法提供了一种直观的方式来追踪信息流。然而,它们主要依赖于原始的注意力权重,这些权重并未明确反映最终决策,通常导致解释的类别区分能力有限。相比之下,基于梯度的定位方法在突出类别特定证据方面更为有效,但它们并未充分利用变换器的层次注意力传播机制。为了解决这一局限性,我们提出了面向决策的注意力传播(Decision-Aware Attention Propagation,DAP),这是一种将与决策相关的先验知识注入变换器注意力传播的归因方法。通过基于梯度的定位估计标记的重要性,并将其整合到逐层注意力展开中,该方法捕捉了注意力的结构流动以及与最终预测最相关的证据。因此,DAP生成的归因图比传统基于注意力的方法更具类别敏感性、紧凑性和真实性。在不同模型规模的视觉变换器变体上进行的广泛实验表明,DAP在定量指标和定性可视化方面始终优于现有基准,表明面向决策的传播是提高ViT可解释性的有效方向。
cs.CV / 226 / 2604.18107
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
具有延迟反馈的测试时扰动学习用于视觉-语言-动作模型
Abstract
Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4\% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents. The code is available at \href{https://github.com/zhoujiahuan1991/CVPR2026-PDF}{https://github.com/zhoujiahuan1991/CVPR2026-PDF}.
Chinese Translation
视觉-语言-动作模型(VLA)在顺序决策中表现出色,但对环境的微小变化(如物体姿态的小变化)仍然脆弱。我们将这种脆弱性归因于轨迹过拟合,VLA过度关注动作与实体之间的虚假关联,从而重现记忆中的动作模式。我们提出了具有延迟反馈的扰动学习(Perturbation learning with Delayed Feedback, PDF),这是一种无验证器的测试时适应框架,能够在不微调基础模型的情况下提高决策性能。PDF通过基于不确定性的数据显示增强和动作投票来缓解虚假关联,同时自适应调度器分配增强预算,以平衡性能和效率。为了进一步提高稳定性,PDF学习了一个轻量级的扰动模块,该模块根据延迟反馈回顾性地调整动作logits,纠正过度自信的问题。在LIBERO(成功率提高7.4%)和Atari(人类标准分提高10.3)上的实验表明,PDF在任务成功率上相较于普通VLA和具有测试时适应的VLA具有一致的提升,为多模态决策代理在可靠的测试时适应方面建立了一条实用路径。代码可在此获取: exttt{https://github.com/zhoujiahuan1991/CVPR2026-PDF}。
cs.CV / 227 / 2604.18134
Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
LLM生成文本能否增强外科视觉-语言预训练?
Abstract
Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \href{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME}.
Chinese Translation
最近,自监督学习的进展催生了强大的外科视觉编码器,能够实现时空理解。然而,将这些视觉基础扩展到多模态推理任务时,专家文本注释的高昂成本严重制约了其可扩展性。为克服这一可扩展性限制,我们引入了 extbf{LIME},这是一个基于开放获取外科视频生成的人类无关的大规模多模态数据集,使用大型语言模型(LLM)生成叙述。尽管LIME提供了巨大的可扩展性,但未经验证的生成文本可能包含错误,包括幻觉,这可能导致在标准对比管道中预训练的医学先验严重退化。为此,我们提出了 extbf{SurgLIME},一个参数高效的视觉-语言预训练(VLP)框架,旨在利用噪声叙述学习可靠的跨模态对齐。SurgLIME使用LoRA适配的双编码器架构保留基础医学先验,并引入了一种自动化的置信度估计机制,在对比对齐过程中动态降低不确定文本的权重。在AutoLaparo和Cholec80基准上的评估表明,SurgLIME在保持视觉基础模型的强大线性探测性能的同时,实现了具有竞争力的零-shot跨模态对齐。数据集、代码和模型可在 exthref{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME}上公开获取。
cs.CV / 228 / 2604.18135
Soft Label Pruning and Quantization for Large-Scale Dataset Distillation
大规模数据集蒸馏的软标签剪枝与量化
Abstract
Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.
Chinese Translation
大规模数据集蒸馏需要存储辅助软标签,这些标签在 ImageNet-1K 上可能比浓缩图像大 30-40 倍,在 ImageNet-21K 上则可能大 200 倍,这削弱了数据集压缩的目标。我们识别出需要如此大量标签的两个基本问题:(1)图像多样性不足,在合成图像中类内相似性高需要大量增强;(2)监督多样性不足,训练过程中监督信号的有限多样性导致在高压缩率下性能下降。为了解决这些挑战,我们提出了大规模蒸馏的标签剪枝与量化(Label Pruning and Quantization for Large-scale Distillation,LPQLD)。我们通过类级批处理和合成过程中的批归一化监督来增强图像多样性。针对监督多样性,我们引入动态知识重用的标签剪枝以改善标签与增强之间的多样性,并采用经过校准的学生-教师对齐的标签量化来提升增强与图像之间的多样性。我们的方法在 ImageNet-1K 上将软标签存储减少了 78 倍,在 ImageNet-21K 上减少了 500 倍,同时分别提高了准确率 7.2% 和 2.8%。大量实验验证了 LPQLD 在不同网络架构和数据集蒸馏方法中的优越性。代码可在 https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation 获取。
cs.CV / 229 / 2604.18145
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
基于区域的3D医学成像报告生成:一个细粒度数据集和图增强框架
Abstract
Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.
Chinese Translation
针对3D PET/CT成像的自动医学报告生成面临着体积数据的高维特性和标注数据集严重匮乏的根本挑战,尤其是在低资源语言环境中。目前的黑箱方法将整个体积映射到报告,忽视了分析局部感兴趣区域(RoI)以得出诊断结论的临床工作流程。本文通过引入VietPET-RoI,填补了这一空白,这是首个针对低资源语言的大规模3D PET/CT数据集,具有细粒度的RoI标注,包含600个PET/CT样本和1,960个手动标注的RoI,并配有相应的临床报告。此外,为了展示该数据集的实用性,我们提出了HiRRA,一个新颖的框架,通过采用基于图的关系模块来捕捉RoI属性之间的依赖关系,从而模拟专业放射科医生的诊断工作流程。这种方法从全局模式匹配转向局部临床发现。此外,我们引入了新的临床评估指标,即RoI覆盖率和RoI质量指数,分别测量RoI定位准确性和属性描述的保真度,使用基于大型语言模型(LLM)的提取。广泛的评估表明,我们的框架实现了SOTA(最先进技术)性能,在BLEU上超越现有模型19.7%,在ROUGE-L上超越4.7%,同时在临床指标上实现了显著的45.8%提升,表明临床可靠性增强和幻觉现象减少。我们的代码和数据集已在GitHub上公开。
cs.CV / 230 / 2604.18148
Attention-ResUNet for Automated Fetal Head Segmentation
用于自动化胎头分割的注意力残差网络(Attention-ResUNet)
Abstract
Automated fetal head segmentation in ultrasound images is critical for accurate biometric measurements in prenatal care. While existing deep learning approaches have achieved a reasonable performance, they struggle with issues like low contrast, noise, and complex anatomical boundaries which are inherent to ultrasound imaging. This paper presents Attention-ResUNet. It is a novel architecture that synergistically combines residual learning with multi-scale attention mechanisms in order to achieve enhanced fetal head segmentation. Our approach integrates attention gates at four decoder levels to focus selectively on anatomically relevant regions while suppressing the background noise, and complemented by residual connections which facilitates gradient flow and feature reuse. Extensive evaluation on the HC18 Challenge dataset where n = 200 demonstrates that Attention ResUNet achieves a superior performance with a mean Dice score of 99.30 +/- 0.14% against similar architectures. It significantly outperforms five baseline architectures including ResUNet (99.26%), Attention U-Net (98.79%), Swin U-Net (98.60%), Standard U-Net (98.58%), and U-Net++ (97.46%). Through statistical analysis we confirm highly significant improvements (p < 0.001) with effect sizes that range from 0.230 to 13.159 (Cohen's d). Using Saliency map analysis, we reveal that our architecture produces highly concentrated, anatomically consistent activation patterns, which demonstrate an enhanced interpretability which is crucial for clinical deployment. The proposed method establishes a new state of the art performance for automated fetal head segmentation whilst maintaining computational efficiency with 14.7M parameters and a 45 GFLOPs inference cost. Code repository: https://github.com/Ammar-ss
Chinese Translation
在超声图像中,自动化胎头分割对于产前护理中的准确生物测量至关重要。尽管现有的深度学习方法已取得了合理的性能,但在低对比度、噪声和复杂解剖边界等超声成像固有问题上仍面临挑战。本文提出了注意力残差网络(Attention-ResUNet),这是一种新颖的架构,协同结合了残差学习与多尺度注意力机制,以实现增强的胎头分割。我们的方法在四个解码器级别集成了注意力门,以选择性地关注解剖相关区域,同时抑制背景噪声,并辅以残差连接以促进梯度流动和特征重用。在HC18挑战数据集(n = 200)上的广泛评估表明,注意力残差网络的平均Dice得分为99.30 +/- 0.14%,在类似架构中表现优越。它显著超越了五种基线架构,包括残差网络(ResUNet)(99.26%)、注意力U-Net(Attention U-Net)(98.79%)、Swin U-Net(98.60%)、标准U-Net(Standard U-Net)(98.58%)和U-Net++(97.46%)。通过统计分析,我们确认了高度显著的改进(p < 0.001),效应大小范围从0.230到13.159(Cohen's d)。使用显著性图分析,我们揭示了我们的架构产生了高度集中且解剖一致的激活模式,显示出增强的可解释性,这对于临床应用至关重要。所提出的方法在保持计算效率的同时(14.7M参数和45 GFLOPs推理成本),为自动化胎头分割建立了新的最先进性能。代码库:https://github.com/Ammar-ss
cs.CV / 231 / 2604.18151
AI-based Waste Mapping for Addressing Climate-Exacerbated Flood Risk
基于人工智能的废物映射以应对气候加剧的洪水风险
Abstract
Urban flooding is a growing climate change-related hazard in rapidly expanding African cities, where inadequate waste management often blocks drainage systems and amplifies flood risks. This study introduces an AI-powered urban waste mapping workflow that leverages openly available aerial and street-view imagery to detect municipal solid waste at high resolution. Applied in Dar es Salaam, Tanzania, our approach reveals spatial waste patterns linked to informal settlements and socio-economic factors. Waste accumulation in waterways was found to be up to three times higher than in adjacent urban areas, highlighting critical hotspots for climate-exacerbated flooding. Unlike traditional manual mapping methods, this scalable AI approach allows city-wide monitoring and prioritization of interventions. Crucially, our collaboration with local partners ensured culturally and contextually relevant data labeling, reflecting real-world reuse practices for solid waste. The results offer actionable insights for urban planning, climate adaptation, and sustainable waste management in flood-prone urban areas.
Chinese Translation
城市洪水是快速扩张的非洲城市中日益严重的气候变化相关风险,其中不充分的废物管理常常阻塞排水系统,增加洪水风险。本研究介绍了一种基于人工智能的城市废物映射工作流程,该流程利用公开可用的航拍和街景图像,以高分辨率检测市政固体废物。我们在坦桑尼亚达累斯萨拉姆应用该方法,揭示了与非正式定居点和社会经济因素相关的空间废物模式。发现水道中的废物积累量比邻近城市区域高出多达三倍,突显了气候加剧洪水的关键热点。与传统的手动映射方法不同,这种可扩展的人工智能方法允许对整个城市进行监测和干预优先级排序。至关重要的是,我们与当地合作伙伴的合作确保了数据标注的文化和情境相关性,反映了固体废物的实际再利用实践。研究结果为洪水易发城市地区的城市规划、气候适应和可持续废物管理提供了可行的见解。
cs.CV / 232 / 2604.18167
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
嵌入算术:一种轻量级、无调优的后置偏见缓解框架用于文本到图像模型
Abstract
Modern text-to-image (T2I) models amplify harmful societal biases, challenging their ethical deployment. We introduce an inference-time method that reliably mitigates social bias while keeping prompt semantics and visual context (background, layout, and style) intact. This ensures context persistency and provides a controllable parameter to adjust mitigation strength, giving practitioners fine-grained control over fairness-coherence trade-offs. Using Embedding Arithmetic, we analyze how bias is structured in the embedding space and correct it without altering model weights, prompts, or datasets. Experiments on FLUX 1.0-Dev and Stable Diffusion 3.5-Large show that the conditional embedding space forms a complex, entangled manifold rather than a grid of disentangled concepts. To rigorously assess semantic preservation beyond the circularity and bias limitations of of CLIP scores, we propose the Concept Coherence Score (CCS). Evaluated against this robust metric, our lightweight, tuning-free method significantly outperforms existing baselines in improving diversity while maintaining high concept coherence, effectively resolving the critical fairness-coherence trade-off. By characterizing how models represent social concepts, we establish geometric understanding of latent space as a principled path toward more transparent, controllable, and fair image generation.
Chinese Translation
现代文本到图像(T2I)模型放大了有害的社会偏见,挑战了其伦理部署。我们提出了一种推理时方法,能够可靠地缓解社会偏见,同时保持提示语义和视觉上下文(背景、布局和风格)不变。这确保了上下文的持续性,并提供了一个可控参数以调整缓解强度,使从业者能够对公平性与一致性之间的权衡进行细致控制。通过嵌入算术,我们分析了偏见在嵌入空间中的结构,并在不改变模型权重、提示或数据集的情况下进行纠正。在 FLUX 1.0-Dev 和 Stable Diffusion 3.5-Large 上的实验表明,条件嵌入空间形成了一个复杂的、交织的流形,而不是一个解缠概念的网格。为了严格评估语义保留,超越 CLIP 分数的循环性和偏见限制,我们提出了概念一致性评分(CCS)。在这一稳健指标的评估下,我们的轻量级、无调优方法在提高多样性方面显著优于现有基线,同时保持高概念一致性,有效解决了关键的公平性与一致性之间的权衡。通过表征模型如何表示社会概念,我们建立了潜在空间的几何理解,作为实现更透明、可控和公平图像生成的原则性路径。
cs.CV / 233 / 2604.18168
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
将从类别标签到文本的一步图像生成扩展至区分性文本表示
Abstract
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.
Chinese Translation
少步生成一直是一个长期目标,最近的一步生成方法如 MeanFlow 取得了显著的成果。现有关于 MeanFlow 的研究主要集中在类别到图像的生成上。然而,一个直观但尚未探索的方向是将条件从固定的类别标签扩展到灵活的文本输入,以实现更丰富的内容创作。与有限的类别标签相比,文本条件对模型理解能力提出了更大的挑战,迫切需要将强大的文本编码器有效整合到 MeanFlow 框架中。令人惊讶的是,尽管将文本条件纳入看似简单,但我们发现使用传统训练策略整合强大的基于 LLM 的文本编码器会导致不理想的性能。为了揭示潜在原因,我们进行了详细分析,发现由于 MeanFlow 生成中的精细化步骤极其有限,例如仅有一步,文本特征表示需要具备足够高的可区分性。这也解释了为什么离散且易于区分的类别特征在 MeanFlow 框架中表现良好。在这些见解的指导下,我们利用经过验证的强大 LLM 基于文本编码器,具备所需的语义特性,并将 MeanFlow 生成过程调整至此框架,从而首次实现了高效的文本条件合成。此外,我们在广泛使用的扩散模型上验证了我们的方法,显示出显著的生成性能提升。我们希望这项工作为未来关于文本条件 MeanFlow 生成的研究提供一个通用且实用的参考。代码可在 https://github.com/AMAP-ML/EMF 获取。
cs.CV / 234 / 2604.18184
CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition
CanonSLR:基于典范视图的多视角连续手语识别
Abstract
Continuous Sign Language Recognition (CSLR) has achieved remarkable progress in recent years; however, most existing methods are developed under single-view settings and thus remain insufficiently robust to viewpoint variations in real-world scenarios. To address this limitation, we propose CanonSLR, a canonical-view guided framework for multi-view CSLR. Specifically, we introduce a frontal-view-anchored teacher-student learning strategy, in which a teacher network trained on frontal-view data provides canonical temporal supervision for a student network trained on all viewpoints. To further reduce cross-view semantic discrepancy, we propose Sequence-Level Soft-Target Distillation, which transfers structured temporal knowledge from the frontal view to non-frontal samples, thereby alleviating gloss boundary ambiguity and category confusion caused by occlusion and projection variation. In addition, we introduce Temporal Motion Relational Enhancement to explicitly model motion-aware temporal relations in high-level visual features, strengthening stable dynamic representations while suppressing viewpoint-sensitive appearance disturbances. To support multi-view CSLR research, we further develop a universal multi-view sign language data construction pipeline that transforms original single-view RGB videos into semantically consistent, temporally coherent, and viewpoint-controllable multi-view sign language videos. Based on this pipeline, we extend PHOENIX-2014T and CSL-Daily into two seven-view benchmarks, namely PT14-MV and CSL-MV, providing a new experimental foundation for multi-view CSLR. Extensive experiments on PT14-MV and CSL-MV demonstrate that CanonSLR consistently outperforms existing approaches under multi-view settings and exhibits stronger robustness, especially on challenging non-frontal views.
Chinese Translation
连续手语识别(CSLR)近年来取得了显著进展;然而,大多数现有方法是在单视角设置下开发的,因此在现实场景中的视角变化下仍然不足够稳健。为了解决这一局限性,我们提出了CanonSLR,一个基于典范视图的多视角CSLR框架。具体而言,我们引入了一种以正面视图为锚的教师-学生学习策略,其中在正面视图数据上训练的教师网络为在所有视角上训练的学生网络提供典范时间监督。为了进一步减少跨视角的语义差异,我们提出了序列级软目标蒸馏(Sequence-Level Soft-Target Distillation),该方法将结构化的时间知识从正面视图转移到非正面样本,从而缓解因遮挡和投影变化引起的光泽边界模糊和类别混淆。此外,我们引入了时间运动关系增强(Temporal Motion Relational Enhancement),以显式建模高层视觉特征中的运动感知时间关系,增强稳定的动态表示,同时抑制对视角敏感的外观干扰。为了支持多视角CSLR研究,我们进一步开发了一个通用的多视角手语数据构建管道,该管道将原始单视角RGB视频转换为语义一致、时间连贯且视角可控的多视角手语视频。基于该管道,我们将PHOENIX-2014T和CSL-Daily扩展为两个七视角基准,即PT14-MV和CSL-MV,为多视角CSLR提供了新的实验基础。在PT14-MV和CSL-MV上的大量实验表明,CanonSLR在多视角设置下始终优于现有方法,并表现出更强的稳健性,尤其是在具有挑战性的非正面视角上。
cs.CV / 235 / 2604.18201
DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery
DiffuSAM:基于扩散引导的零样本目标定位在遥感图像中的应用
Abstract
Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundational segmentation models, our approach enables robust and adaptive object localization across complex scenes. Experiments demonstrate that our pipeline significantly improves localization performance, achieving over a 14% increase in
[email protected] compared to existing state-of-the-art methods.
Chinese Translation
扩散模型已成为广泛视觉任务的强大工具,包括文本引导的图像生成和编辑。在本研究中,我们探讨了它们在遥感图像目标定位中的潜力。我们提出了一种混合管道,将基于扩散的定位线索与最先进的分割模型(如RemoteSAM和SAM3)相结合,以获得更准确的边界框。通过利用生成性扩散模型和基础分割模型的互补优势,我们的方法能够在复杂场景中实现稳健和自适应的目标定位。实验表明,我们的管道显著提高了定位性能,与现有最先进的方法相比,
[email protected]提升超过14%。
cs.CV / 236 / 2604.18205
A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting
NeRF与高斯点云在几何精度方面的比较评估
Abstract
Recent advances in neural rendering have introduced numerous 3D scene representations. Although standard computer vision metrics evaluate the visual quality of generated images, they often overlook the fidelity of surface geometry. This limitation is particularly critical in robotics, where accurate geometry is essential for tasks such as grasping and object manipulation. In this paper, we present an evaluation pipeline for neural rendering methods that focuses on geometric accuracy, along with a benchmark comprising 19 diverse scenes. Our approach enables a systematic assessment of reconstruction methods in terms of surface and shape fidelity, complementing traditional visual metrics.
Chinese Translation
近年来,神经渲染的进展引入了众多3D场景表示方法。尽管标准计算机视觉指标能够评估生成图像的视觉质量,但它们往往忽视了表面几何的保真度。这一局限性在机器人技术中尤为关键,因为准确的几何形状对于抓取和物体操作等任务至关重要。本文提出了一种针对神经渲染方法的评估流程,重点关注几何精度,并包含19个多样化场景的基准测试。我们的方法使得在表面和形状保真度方面对重建方法进行系统评估成为可能,从而补充了传统的视觉指标。
cs.CV / 237 / 2604.18208
Towards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes
面向对称敏感的姿态估计:对称物体类别的旋转表示
Abstract
Symmetric objects are common in daily life and industry, yet their inherent orientation ambiguities that impede the training of deep learning networks for pose estimation are rarely discussed in the literature. To cope with these ambiguities, existing solutions typically require the design of specific loss functions and network architectures or resort to symmetry-invariant evaluation metrics. In contrast, we focus on the numeric representation of the rotation itself, modifying trigonometric identities with the degrees of symmetry derived from the objects' shapes. We use our representation, SARR, to obtain canonic (symmetry-resolved) poses for the symmetric objects in two popular 6D pose estimation datasets, T-LESS and ITODD, where SARR is unique and continuous w.r.t. the visual appearance. This allows us to use a standard CNN for 3D orientation estimation whose performance is evaluated with the symmetry-sensitive cosine distance $\text{AR}_{\text{C}}$. Our networks outperform the state of the art using $\text{AR}_{\text{C}}$ and achieve satisfactory performance when using conventional symmetry-invariant measures. Our method does not require any 3D models but only depth, or, as part of an additional experiment, texture-less RGB/grayscale images as input. We also show that networks trained on SARR outperform the same networks trained on rotation matrices, Euler angles, quaternions, standard trigonometrics or the recently popular 6d representation -- even in inference scenarios where no prior knowledge of the objects' symmetry properties is available. Code and a visualization toolkit are available at https://github.com/akriegler/SARR .
Chinese Translation
对称物体在日常生活和工业中非常常见,但其固有的方向模糊性对深度学习网络在姿态估计中的训练造成了阻碍,这在文献中鲜有讨论。为了解决这些模糊性,现有解决方案通常需要设计特定的损失函数和网络架构,或者依赖于对称不变的评估指标。相较之下,我们关注于旋转本身的数值表示,通过修改三角函数恒等式来引入源自物体形状的对称度。我们使用我们的表示法 SARR 来获取在两个流行的 6D 姿态估计数据集 T-LESS 和 ITODD 中的标准(对称解析)姿态,其中 SARR 在视觉外观方面是独特且连续的。这使我们能够使用标准卷积神经网络(CNN)进行 3D 方向估计,其性能通过对称敏感的余弦距离 $ ext{AR}_{ ext{C}}$ 进行评估。我们的网络在使用 $ ext{AR}_{ ext{C}}$ 时超越了当前最先进的技术,并在使用常规的对称不变度量时也达到了令人满意的性能。我们的方法不需要任何 3D 模型,仅需深度信息,或者作为额外实验的一部分,使用无纹理的 RGB/灰度图像作为输入。我们还展示了在 SARR 上训练的网络在推理场景中表现优于在旋转矩阵、欧拉角、四元数、标准三角函数或最近流行的 6D 表示上训练的相同网络,即使在没有物体对称属性先验知识的情况下。代码和可视化工具包可在 https://github.com/akriegler/SARR 获取。
cs.CV / 238 / 2604.18215
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
按需记忆:用于空间一致的长时间视频生成的解耦记忆控制
Abstract
Spatially consistent long-horizon video generation aims to maintain temporal and spatial consistency along predefined camera trajectories. Existing methods mostly entangle memory modeling with video generation, leading to inconsistent content during scene revisits and diminished generative capacity when exploring novel regions, even trained on extensive annotated data. To address these limitations, we propose a decoupled framework that separates memory conditioning from generation. Our approach significantly reduces training costs while simultaneously enhancing spatial consistency and preserving the generative capacity for novel scene exploration. Specifically, we employ a lightweight, independent memory branch to learn precise spatial consistency from historical observation. We first introduce a hybrid memory representation to capture complementary temporal and spatial cues from generated frames, then leverage a per-frame cross-attention mechanism to ensure each frame is conditioned exclusively on the most spatially relevant historical information, which is injected into the generative model to ensure spatial consistency. When generating new scenes, a camera-aware gating mechanism is proposed to mediate the interaction between memory and generation modules, enabling memory conditioning only when meaningful historical references exist. Compared with the existing method, our method is highly data-efficient, yet the experiments demonstrate that our approach achieves state-of-the-art performance in terms of both visual quality and spatial consistency.
Chinese Translation
空间一致的长时间视频生成旨在沿预定义的摄像机轨迹保持时间和空间的一致性。现有方法大多将记忆建模与视频生成纠缠在一起,导致在场景重访时内容不一致,并且在探索新区域时生成能力下降,即使在大量标注数据上训练。为了解决这些局限性,我们提出了一种解耦框架,将记忆条件与生成过程分离。我们的方法显著降低了训练成本,同时增强了空间一致性,并保持了对新场景探索的生成能力。具体而言,我们采用了一个轻量级的独立记忆分支,从历史观察中学习精确的空间一致性。我们首先引入了一种混合记忆表示,以捕捉从生成帧中提取的互补时间和空间线索,然后利用逐帧交叉注意机制,确保每帧仅基于最相关的历史信息进行条件化,这些信息被注入生成模型以确保空间一致性。在生成新场景时,提出了一种摄像机感知的门控机制,以调节记忆与生成模块之间的交互,仅在存在有意义的历史参考时进行记忆条件化。与现有方法相比,我们的方法具有很高的数据效率,实验表明我们的方法在视觉质量和空间一致性方面均达到了最先进的性能。
cs.CV / 239 / 2604.18223
Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
指令作为状态:环境引导和状态条件的语义理解用于具身导航
Abstract
Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.
Chinese Translation
视觉与语言导航要求智能体在视觉变化的环境中遵循自然语言指令。一个核心挑战是语言与观察之间的动态纠缠:随着智能体的视野和空间上下文的演变,指令的意义也在变化。然而,许多现有模型将指令编码为静态的全局表示,限制了它们根据当前视觉上下文调整指令意义的能力。因此,我们将指令理解建模为指令作为状态(Instruction-as-State)变量:一个与决策相关的、逐步演变的令牌级指令状态,条件是智能体的感知状态,其中感知状态表示每一步的观察基础导航上下文。为了实现这一原则,我们引入了状态纠缠环境引导指令理解(State-Entangled Environment-Guided Instruction Understanding,S-EGIU),这是一个用于状态条件的段激活和令牌级语义细化的粗到细框架。在粗略层面,S-EGIU 激活与当前观察语义对齐的指令段。在细致层面,它通过观察引导的令牌定位和上下文建模来细化激活的段,在当前观察下锐化其内部语义。这些阶段共同维持一个根据智能体在导航过程中感知状态不断更新的指令状态。S-EGIU 在多个关键指标上表现出色,包括在 REVERIE 测试未见数据集上获得 +2.68% 的 SPL 增益,并在多个视觉语言导航基准测试中展示出一致的效率提升,突显了动态指令与感知纠缠的价值。
cs.CV / 240 / 2604.18225
Is SAM3 ready for pathology segmentation?
SAM3是否准备好进行病理分割?
Abstract
Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: 1.text-only prompts poorly activate nuclear concepts. 2.performance is highly sensitive to visual prompt types and budgets. 3.few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise. and 4.a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.
Chinese Translation
Segment Anything Model 3 (SAM3) 是否能够对任何病理图像进行分割?数字病理分割涵盖了组织级和细胞核级的尺度,传统方法往往面临高标注成本和较差的泛化能力。SAM3 引入了可提示的概念分割,提供了一种通过文本提示的潜在自动化接口。在本研究中,我们提出了一种系统的评估协议,以结构化的方式探索 SAM3 的能力空间。具体而言,我们在不同的监督设置下评估 SAM3,包括零样本、少样本和带有不同提示策略的监督学习。我们对包括 NuInsSeg、PanNuke 和 GlaS 在内的病理数据集进行了广泛评估,结果显示:1. 仅使用文本提示难以有效激活细胞核概念;2. 性能对视觉提示类型和预算高度敏感;3. 少样本学习提供了收益,但 SAM3 对视觉提示噪声缺乏鲁棒性;4. 基于提示的使用与任务训练的适配器基准之间仍存在显著差距。我们的研究勾勒了 SAM3 在病理图像分割中的边界,并为病理领域适应的必要性提供了实用指导。
cs.CV / 241 / 2604.18250
Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning
医学图像理解通过视觉指令调优改善生存预测
Abstract
Accurate prognostication and risk estimation are essential for guiding clinical decision-making and optimizing patient management. While radiologist-assessed features from CT scans provide valuable indicators of disease severity and outcomes, interpreting such images requires expert knowledge, and translating rich visual information into textual summaries inevitably leads to information loss. In this work, we propose a vision-language framework for 3D CT image understanding that leverages large-scale open-sourced CT images paired with radiology reports through visual instruction tuning. This pre-training enables the model to learn clinically meaningful visual-textual representations, which can then be adapted to downstream survival prediction tasks. By incorporating a survival prediction head on top of the pre-trained model, our approach improves survival prediction from CT images and clinical data while generating clinically meaningful language responses to predefined questions. Experimental results demonstrate that our method outperforms baseline methods in survival prediction, particularly, when clinical data alone is less predictive. The code will be released upon acceptance.
Chinese Translation
准确的预后评估和风险估计对于指导临床决策和优化患者管理至关重要。虽然放射科医师评估的CT扫描特征提供了疾病严重程度和结果的有价值指标,但解读这些图像需要专业知识,且将丰富的视觉信息转化为文本摘要不可避免地会导致信息损失。在本研究中,我们提出了一种用于3D CT图像理解的视觉-语言框架,该框架利用大规模开源CT图像与放射学报告的配对,通过视觉指令调优进行训练。这种预训练使模型能够学习具有临床意义的视觉-文本表示,进而可以适应下游的生存预测任务。通过在预训练模型上增加生存预测头,我们的方法在CT图像和临床数据的生存预测中取得了改善,同时生成对预定义问题的临床意义语言响应。实验结果表明,我们的方法在生存预测方面优于基线方法,特别是在仅依赖临床数据时预测能力较弱的情况下。代码将在接受后发布。
cs.CV / 242 / 2604.18251
Style-Based Neural Architectures for Real-Time Weather Classification
基于风格的神经网络架构用于实时天气分类
Abstract
In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called "Multi-PatchGAN", is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, "Truncated ResNet50", is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose "Truncated ResNet50 with Gram Matrix and Attention", which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.
Chinese Translation
在本文中,我们提出了三种神经网络架构,旨在从图像中实时分类天气状况(晴天、雨天、雪天、雾天)。这些模型受到近期风格迁移进展的启发,旨在捕捉图像中存在的风格元素。其中一个模型称为“Multi-PatchGAN”,基于在知名架构(如Pix2Pix和CycleGAN)中使用的PatchGAN,但在此针对检测任务进行了多种补丁大小的适配。第二个模型“Truncated ResNet50”是ResNet50的简化版本,仅保留其前九层。通过进化算法确定的这种截断,促进了高频特征的提取,这对于捕捉细微的风格细节至关重要。最后,我们提出了“带有Gram矩阵和注意力机制的Truncated ResNet50”,该模型在训练过程中计算每一层的Gram矩阵,并通过注意力机制自动加权,从而优化提取最相关的风格表达用于分类。这两个模型在多个公共数据库上超越了现有技术,并展现出卓越的泛化能力。尽管这些架构是为天气检测开发的,但它们同样适用于其他基于外观的分类任务,如动物物种识别、纹理分类、医学影像中的疾病检测或工业缺陷识别。
cs.CV / 243 / 2604.18256
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
通过模型级专家混合实现领域专用的目标检测
Abstract
Mixture-of-Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model-level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model-level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
Chinese Translation
专家混合模型(Mixture-of-Experts, MoE)提供了一种结构化的方法来结合专门的神经网络,并比传统的集成方法提供更好的可解释性。尽管MoE已成功应用于图像分类和语义分割,但由于合并密集和结构化预测的挑战,其在目标检测中的应用仍然有限。在本研究中,我们探讨了目标检测器的模型级混合,并分析其在提高目标检测性能和可解释性方面的适用性。我们提出了一种MoE架构,该架构结合了在语义上不重叠的数据子集上训练的基于YOLO的检测器,并配备了一个动态加权专家贡献的学习门控网络。我们研究了不同的检测输出融合策略和门控机制的训练方法,包括平衡损失以防止专家崩溃。在BDD100K数据集上的实验表明,所提出的MoE在性能上始终优于标准集成方法,并提供了关于专家在不同领域中的专业化的见解,突出了模型级MoE作为目标检测中传统集成方法的可行替代方案。我们的代码可在 https://github.com/KASTEL-MobilityLab/mixtures-of-experts/ 获取。
cs.CV / 244 / 2604.18258
Long-Text-to-Image Generation via Compositional Prompt Decomposition
通过组合提示分解进行长文本到图像生成
Abstract
While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.
Chinese Translation
尽管现代文本到图像(T2I)模型在从复杂提示生成图像方面表现出色,但当输入为描述性段落时,它们在捕捉关键细节方面存在困难。这一限制源于训练分布中简洁标题的普遍存在。现有方法试图通过对长提示进行微调T2I模型来弥补这一差距,但这种方法在处理更长的输入时效果不佳;或者通过将超大输入投影到正常提示空间,从而妥协了保真度。我们提出了用于复杂场景建模的提示折射(Prompt Refraction for Intricate Scene Modeling,PRISM),这是一种组合方法,使预训练的T2I模型能够处理长序列输入。PRISM使用轻量级模块从长提示中提取组成表示。T2I模型对每个组件进行独立的噪声预测,并通过基于能量的结合将它们的输出合并为单个去噪步骤。我们在多种模型架构上评估PRISM,显示出与在相同训练数据上微调的模型相当的性能。此外,PRISM展示了优越的泛化能力,在一个具有挑战性的公共基准测试中,对超过500个标记的提示超越了基线模型7.4%。
cs.CV / 245 / 2604.18260
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
基于几何引导的3D视觉标记剪枝用于视频语言模型
Abstract
Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
Chinese Translation
多模态大型语言模型在2D视觉中展现了卓越的能力,这激励了它们在3D场景理解中的扩展。近期研究将3D场景表示为由图像序列、深度和相机姿态信息组成的3D空间视频,使得预训练的视频语言模型能够执行3D推理任务。然而,空间视频中大量的视觉标记仍然是高效推理和上下文管理的主要瓶颈。现有的剪枝方法忽视了空间视频的视图一致性和剩余标记的空间多样性,这使得它们无法有效去除帧间冗余并保持场景的完整性。本文提出了Geo3DPruner,一个基于几何引导的3D视觉标记剪枝框架。Geo3DPruner首先通过几何感知的全局注意力建模跨帧相关性,然后执行两阶段的剪枝过程。体素内阶段在每个体素内选择代表性的多视角特征,而体素间阶段通过选择全局分布的体素子集来保持空间多样性。在多个3D场景理解基准上的大量实验表明,Geo3DPruner在剪枝90%视觉标记的同时保留了超过90%的原始性能,显著优于现有的文本引导和视觉引导剪枝方法。
cs.CV / 246 / 2604.18267
MARCO: Navigating the Unseen Space of Semantic Correspondence
MARCO:导航语义对应的隐形空间
Abstract
Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9
[email protected]), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .
Chinese Translation
最近在语义对应方面的进展依赖于双编码器架构,将 DINOv2 与扩散骨干网络相结合。尽管这些十亿参数的模型具有较高的准确性,但在训练关键点之外的泛化能力较差,暴露出基准性能与实际可用性之间的差距,因为查询点很少与训练期间看到的点匹配。在 DINOv2 的基础上,我们提出了 MARCO,这是一个统一的模型,旨在通过一种新颖的训练框架实现可泛化的对应,增强细粒度定位和语义泛化。通过结合粗到细的目标来提高空间精度,以及扩展稀疏监督到未标注区域的自蒸馏框架,我们的方法将少量关键点转化为密集且语义一致的对应。MARCO 在 SPair-71k、AP-10K 和 PF-PASCAL 上设定了新的最先进水平,在细粒度定位阈值下的提升达到了 (+8.9
[email protected]),对未见关键点的最强泛化 (+5.1, SPair-U) 和类别 (+4.7, MP-100),同时比基于扩散的方法小 3 倍且快 10 倍。代码可在 https://github.com/visinf/MARCO 获取。
cs.CV / 247 / 2604.18274
LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics
LiquidTAD:一种通过液态神经动力学进行高效时间动作检测的方法
Abstract
Temporal Action Detection (TAD) in untrimmed videos is currently dominated by Transformer-based architectures. While high-performing, their quadratic computational complexity and substantial parameter redundancy limit deployment in resource-constrained environments. In this paper, we propose LiquidTAD, a novel parameter-efficient framework that replaces cumbersome self-attention layers with parallelized ActionLiquid blocks. Unlike traditional Liquid Neural Networks (LNNs) that suffer from sequential execution bottlenecks, LiquidTAD leverages a closed-form continuous-time (CfC) formulation, allowing the model to be reformulated as a parallelizable operator while preserving the intrinsic physical prior of continuous-time dynamics. This architecture captures complex temporal dependencies with $O(N)$ linear complexity and adaptively modulates temporal sensitivity through learned time-constants ($\tau$), providing a robust mechanism for handling varying action durations. To the best of our knowledge, this work is the first to introduce a parallelized LNN-based architecture to the TAD domain. Experimental results on the THUMOS-14 dataset demonstrate that LiquidTAD achieves a highly competitive Average mAP of 69.46\% with only 10.82M parameters -- a 63\% reduction compared to the ActionFormer baseline. Further evaluations on ActivityNet-1.3 and Ego4D benchmarks confirm that LiquidTAD achieves an optimal accuracy-efficiency trade-off and exhibits superior robustness to temporal sampling variations, advancing the Pareto frontier of modern TAD frameworks.
Chinese Translation
在未裁剪视频中,时间动作检测(TAD)目前主要由基于Transformer的架构主导。尽管性能优越,但其二次计算复杂度和大量参数冗余限制了在资源受限环境中的部署。本文提出了LiquidTAD,一种新颖的参数高效框架,通过并行化的ActionLiquid模块替代了繁重的自注意力层。与传统的液态神经网络(LNNs)因顺序执行瓶颈而受限不同,LiquidTAD利用闭式连续时间(CfC)公式,使得模型能够被重新构造为可并行化的算子,同时保留连续时间动态的内在物理先验。这种架构以$O(N)$的线性复杂度捕捉复杂的时间依赖关系,并通过学习的时间常数($ au$)自适应调节时间敏感性,为处理不同动作持续时间提供了强大的机制。据我们所知,这项工作是首次将基于并行化LNN的架构引入TAD领域。在THUMOS-14数据集上的实验结果表明,LiquidTAD以仅10.82M的参数实现了69.46\%的高度竞争的平均mAP,相较于ActionFormer基线减少了63\%。在ActivityNet-1.3和Ego4D基准上的进一步评估确认了LiquidTAD在准确性和效率之间实现了最佳的权衡,并表现出对时间采样变化的优越鲁棒性,推动了现代TAD框架的Pareto前沿。
cs.CV / 248 / 2604.18284
Spike-NVPT: Learning Robust Visual Prompts via Bio-Inspired Temporal Filtering and Discretization
Spike-NVPT:通过生物启发的时间滤波和离散化学习鲁棒的视觉提示
Abstract
Pre-trained vision models have found widespread application across diverse domains. Prompt tuning-based methods have emerged as a parameter-efficient paradigm for adapting pre-trained vision models. While effective on standard benchmarks, the continuous and dense nature of learned prompts can lead to sensitivity against input noise, as the high-capacity prompts tend to overfit task-irrelevant details. To address this trade-off, we propose Spike-NVPT, a noise-robust visual prompt tuning method. Specifically, we design a Signal Filtering Layer based on spiking neurons, which uses the integrate-and-fire (IF) mechanism to accumulate task-relevant signals over time and filter transient noise fluctuations. A subsequent Spike Discretization Unit converts filtered signals into sparse binary prompts. This discretization acts as a strong regularizer, forcing the model to anchor decision boundaries on the most discriminative and robust features. Notably, the resulting binary prompts remain static during deployment, ensuring zero additional computational overhead during inference. Experimental results demonstrate that Spike-NVPT achieves superior robustness performance, with a maximum improvement of 11.2% over conventional methods, and retains competitive accuracy on clean datasets. To the best of our knowledge, this is the first attempt to leverage spiking neurons for fine-tuning traditional artificial neural network (ANN)-based visual models.
Chinese Translation
预训练视觉模型在多个领域得到了广泛应用。基于提示调优的方法已成为一种高效的参数适应预训练视觉模型的范式。尽管在标准基准上效果显著,但学习到的提示的连续和密集特性可能导致对输入噪声的敏感性,因为高容量的提示往往会过拟合与任务无关的细节。为了解决这一权衡,我们提出了Spike-NVPT,一种抗噪声的视觉提示调优方法。具体而言,我们设计了一个基于脉冲神经元的信号滤波层,该层利用积分-发放(integrate-and-fire, IF)机制在时间上累积与任务相关的信号并滤除瞬态噪声波动。随后,脉冲离散化单元将滤波后的信号转换为稀疏的二进制提示。这种离散化作为一种强正则化器,迫使模型将决策边界锚定在最具判别性和鲁棒性的特征上。值得注意的是,生成的二进制提示在部署期间保持静态,确保推理过程中的零额外计算开销。实验结果表明,Spike-NVPT在鲁棒性性能上优于传统方法,最大提升达到11.2%,并在干净数据集上保持竞争性准确性。根据我们所知,这是首次尝试利用脉冲神经元对传统人工神经网络(ANN)基础的视觉模型进行微调。
cs.CV / 249 / 2604.18313
Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection
去噪与对齐:基于扩散驱动的前景知识提示在开放词汇时间动作检测中的应用
Abstract
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.
Chinese Translation
开放词汇时间动作检测(OV-TAD)旨在定位和分类未修剪视频中未见类别的动作片段,其中动作语义与视频表示之间的有效对齐对于准确检测至关重要。然而,现有方法在缓解简洁、抽象的动作标签与丰富、复杂的视频内容之间的语义不平衡方面存在困难,必然引入语义噪声并导致跨模态对齐的误导。为了解决这一挑战,我们提出了DFAlign,这是第一个利用基于扩散的去噪生成前景知识以指导动作与视频对齐的框架。按照“条件化、去噪和对齐”的方式,我们首先引入语义统一条件(SUC)模块,该模块将动作共享和动作特定的语义统一为扩散去噪的条件。然后,背景抑制去噪(BSD)模块通过去噪过程逐步去除视频中的背景冗余,从而生成前景知识。该前景知识作为视频和文本表示之间有效的中介语义锚点,减小了语义差距并增强了与动作相关片段的可区分性。此外,我们引入前景提示对齐(FPA)模块,将提取的前景知识作为提示标记注入文本表示中,引导模型的注意力集中在与动作相关的片段上,从而实现精确的跨模态对齐。大量实验表明,我们的方法在两个OV-TAD基准上达到了最先进的性能。代码库链接如下:https://anonymous.4open.science/r/Code-2114/.
cs.CV / 250 / 2604.18320
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE:通过可执行视觉变换实现可验证的多模态大语言模型自我进化
Abstract
Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model's internal certainty, but also a mechanism to perpetually diversify the training distribution. To this end, we introduce EVE (Executable Visual transformation-based self-Evolution), a novel framework that entirely bypasses pseudo-labels by harnessing executable visual transformations continuously enriched in both variety and complexity. EVE adopts a Challenger-Solver dual-policy architecture. The Challenger maintains and progressively expands a queue of visual transformation code examples, from which it synthesizes novel Python scripts to perform dynamic visual transformations. Executing these scripts yields VQA problems with absolute, execution-verified ground-truth answers, eliminating any reliance on model-generated supervision. A multi-dimensional reward system integrating semantic diversity and dynamic difficulty calibration steers the Challenger to enrich its code example queue while posing progressively more challenging tasks, preventing mode collapse and fostering reciprocal co-evolution between the two policies. Extensive experiments demonstrate that EVE consistently surpasses existing self-evolution methods, establishing a robust and scalable paradigm for verifiable MLLM self-evolution. The code is available at https://github.com/0001Henry/EVE .
Chinese Translation
多模态大语言模型(MLLMs)的自我进化仍然是一个关键挑战:基于伪标签的方法由于模型预测漂移而遭受逐渐的质量下降,而基于模板的方法则局限于一组静态的变换,无法在难度或多样性上进行适应。我们认为,稳健的、持续的自我改进不仅需要独立于模型内部确定性的确定性外部反馈,还需要一个机制来不断多样化训练分布。为此,我们提出了EVE(基于可执行视觉变换的自我进化),一个全新的框架,完全绕过伪标签,通过利用在多样性和复杂性上不断丰富的可执行视觉变换。EVE采用了挑战者-求解者双策略架构。挑战者维护并逐步扩展一队视觉变换代码示例,从中合成新的Python脚本以执行动态视觉变换。执行这些脚本产生具有绝对、执行验证的真实答案的视觉问答(VQA)问题,消除了对模型生成的监督的任何依赖。一个多维奖励系统整合了语义多样性和动态难度校准,引导挑战者丰富其代码示例队列,同时提出逐渐更具挑战性的任务,防止模式崩溃,并促进两个策略之间的相互共进化。大量实验表明,EVE始终超越现有的自我进化方法,建立了一个稳健且可扩展的可验证MLLM自我进化范式。代码可在 https://github.com/0001Henry/EVE 获取。
cs.CV / 251 / 2604.18326
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
OmniHuman:一个用于以人为中心的视频生成的大规模数据集和基准
Abstract
Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.
Chinese Translation
近期音视频联合生成模型的进展展示了在内容创作方面的卓越能力。然而,在复杂的现实物理场景中生成高保真度的以人为中心的视频仍然是一个重大挑战。我们发现,根本原因在于现有数据集在三个维度上的结构缺陷:全球场景和相机多样性有限、交互建模稀疏(包括人际交互和人-物体交互)以及个体属性对齐不足。为了解决这些问题,我们提出了OmniHuman,一个旨在进行细粒度人类建模的大规模多场景数据集。OmniHuman提供了层次化的注释,涵盖视频级场景、帧级交互和个体级属性。为此,我们开发了一条全自动化的高质量数据收集和多模态注释管道。作为数据集的补充,我们建立了OmniHuman基准(OHBench),这是一个三级评估系统,为以人为中心的音视频合成提供科学诊断。至关重要的是,OHBench引入了与人类感知高度一致的度量标准,通过在全球场景、关系交互和个体属性方面提供全面的诊断,填补了现有基准的空白。
cs.CV / 252 / 2604.18348
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
AdaCluster:用于视频生成中稀疏注意力的自适应查询-键聚类
Abstract
Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.
Chinese Translation
视频扩散变换器(DiTs)由于二次注意力复杂度而面临巨大的推理延迟。现有的稀疏注意力方法要么忽视语义相似性,要么未能适应跨层异构标记分布,导致模型性能下降。我们提出了AdaCluster,这是一种无训练的自适应聚类框架,旨在加速DiTs的生成,同时保持准确性。AdaCluster对查询向量应用角度相似性保持聚类方法以实现更高的压缩,并为键设计了欧几里得相似性保持聚类方法,涵盖了聚类数量分配、阈值自适应聚类和高效的关键聚类选择。在一台A40 GPU上对CogVideoX-2B、HunyuanVideo和Wan-2.1的实验表明,速度提升可达1.67-4.31倍,且质量下降可忽略不计。
cs.CV / 253 / 2604.18358
LBFTI: Layer-Based Facial Template Inversion for Identity-Preserving Fine-Grained Face Reconstruction
LBFTI:基于层的面部模板反演用于身份保留的细粒度面部重建
Abstract
In face recognition systems, facial templates are widely adopted for identity authentication due to their compliance with the data minimization principle. However, facial template inversion technologies have posed a severe privacy leakage risk by enabling face reconstruction from templates. This paper proposes a Layer-Based Facial Template Inversion (LBFTI) method to reconstruct identity-preserving fine-grained face images. Our scheme decomposes face images into three layers: foreground layers (including eyebrows, eyes, nose, and mouth), midground layers (skin), and background layers (other parts). LBFTI leverages dedicated generators to produce these layers, adopting a rigorous three-stage training strategy: (1) independent refined generation of foreground and midground layers, (2) fusion of foreground and midground layers with template secondary injection to produce complete panoramic face images with background layers, and (3) joint fine-tuning of all modules to optimize inter-layer coordination and identity consistency. Experiments demonstrate that our LBFTI not only outperforms state-of-the-art methods in machine authentication performance, with a 25.3% improvement in TAR, but also achieves better similarity in human perception, as validated by both quantitative metrics and a questionnaire survey.
Chinese Translation
在面部识别系统中,面部模板因其符合数据最小化原则而被广泛应用于身份认证。然而,面部模板反演技术通过允许从模板中重建面部图像,带来了严重的隐私泄露风险。本文提出了一种基于层的面部模板反演(Layer-Based Facial Template Inversion, LBFTI)方法,用于重建保留身份的细粒度面部图像。我们的方案将面部图像分解为三层:前景层(包括眉毛、眼睛、鼻子和嘴巴)、中景层(皮肤)和背景层(其他部分)。LBFTI利用专用生成器生成这些层,采用严格的三阶段训练策略:(1)前景层和中景层的独立精细生成;(2)将前景层和中景层与模板二次注入融合,以生成完整的全景面部图像及背景层;(3)对所有模块进行联合微调,以优化层间协调和身份一致性。实验表明,我们的LBFTI不仅在机器认证性能上优于最先进的方法,TAR提高了25.3%,而且在人类感知的相似性方面也表现更佳,这通过定量指标和问卷调查得到了验证。
cs.CV / 254 / 2604.18367
EAST: Early Action Prediction Sampling Strategy with Token Masking
EAST:带有令牌掩码的早期行动预测采样策略
Abstract
Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.
Chinese Translation
早期行动预测旨在在行动完全展开之前进行预判,但有限的视觉证据使得这一任务尤其具有挑战性。我们提出了EAST,一个简单而高效的框架,使模型能够推理不完整的观察结果。在我们的实证研究中,我们识别了训练早期行动预测模型时的关键组成部分。我们的主要贡献是一种随机化训练策略,该策略在观察到和未观察到的视频帧之间采样一个时间步,从而使单一模型能够在所有测试时观察比例上无缝泛化。我们进一步表明,在观察到的和未来(oracle)表示上进行联合学习显著提升了性能,甚至使得仅编码器模型表现出色。为了提高可扩展性,我们提出了一种令牌掩码程序,该程序将内存使用量减半,并以2倍的速度加速训练,同时几乎没有准确性损失。结合预测解码器,EAST在NTU60、SSv2和UCF101上设定了新的状态-of-the-art,分别超越了之前最佳工作的10.1、7.7和3.9个百分点。
cs.CV / 255 / 2604.18368
DSA-CycleGAN: A Domain Shift Aware CycleGAN for Robust Multi-Stain Glomeruli Segmentation
DSA-CycleGAN:一种关注领域转移的CycleGAN用于稳健的多染色肾小球分割
Abstract
A key challenge in segmentation in digital histopathology is inter- and intra-stain variations as it reduces model performance. Labelling each stain is expensive and time-consuming so methods using stain transfer via CycleGAN, have been developed for training multi-stain segmentation models using labels from a single stain. Nevertheless, CycleGAN tends to introduce noise during translation because of the one-to-many nature of some stain pairs, which conflicts with its cycle consistency loss. To address this, we propose the Domain Shift Aware CycleGAN, which reduces the presence of such noise. Furthermore, we evaluate several advances from the field of machine learning aimed at resolving similar problems and compare their effectiveness against DSA-CycleGAN in the context of multi-stain glomeruli segmentation. Experiments demonstrate that DSA-CycleGAN not only improves segmentation performance in glomeruli segmentation but also outperforms other methods in reducing noise. This is particularly evident when translating between biologically distinct stains. The code is publicly available at https://github.com/zeeshannisar/DSA-CycleGAN.
Chinese Translation
数字组织病理学中的分割面临的一个关键挑战是染色间和染色内的变异,这降低了模型的性能。为每种染色标注是昂贵且耗时的,因此开发了通过CycleGAN进行染色转移的方法,以利用单一染色的标签训练多染色分割模型。然而,由于某些染色对的一对多特性,CycleGAN在转换过程中往往会引入噪声,这与其循环一致性损失相冲突。为了解决这个问题,我们提出了领域转移感知CycleGAN(Domain Shift Aware CycleGAN),以减少此类噪声的出现。此外,我们评估了机器学习领域中的若干进展,旨在解决类似问题,并在多染色肾小球分割的背景下比较它们与DSA-CycleGAN的有效性。实验表明,DSA-CycleGAN不仅提高了肾小球分割的性能,而且在减少噪声方面优于其他方法。这在生物学上不同的染色之间的转换中尤为明显。代码已公开发布在 https://github.com/zeeshannisar/DSA-CycleGAN。
cs.CV / 256 / 2604.18376
Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation
面向鲁棒的文本到图像人物检索:多视角重构用于语义补偿
Abstract
In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing "Semantic Echoes"; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.
Chinese Translation
在文本到图像人物检索任务中,自然语言表达的多样性和视觉语义的隐含性常常导致表达漂移问题,即语义上等价的文本在嵌入空间中由于措辞变化而表现出显著的特征差异,从而降低图像与文本对齐的鲁棒性。本文提出了一种由大型语言模型(LLMs)驱动的语义补偿框架(MVR),通过多视角语义重构和特征补偿增强跨模态表示的一致性。核心方法包括三个组成部分:多视角重构(MVR):一种双分支提示策略结合了关键特征引导(通过特征相似性提取视觉关键组件)和多样性感知重写,以生成语义上等价但分布上多样的文本变体;文本特征鲁棒性增强:一种无训练的潜在空间补偿机制通过多视角特征均值池化和残差连接抑制噪声干扰,有效捕捉“语义回声”;视觉语义补偿:视觉语言模型(VLM)生成多视角图像描述,并通过共享文本重构进一步增强,以解决视觉语义差距。实验表明,我们的方法在不进行训练的情况下能够显著提高原始模型的准确性,并在三个文本到图像人物检索数据集上表现出最先进的性能。
cs.CV / 257 / 2604.18393
One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection
基于逆残差场的一步扩散模型用于无监督工业异常检测
Abstract
Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse residual fields, to address this limitation for uIAD task. We first train a deep diffusion probabilistic model (DDPM) on normal data without any conditioning. Then, for a test sample, we predict its inverse residual fields (IRF) based on the noise estimated by the well-trained parametric noise function of the DDPM. Finally, uIAD is performed by evaluating the probability density of the IRF under a Gaussian distribution and comparing it with a threshold. Our key observation is that anomalies become distinguishable in this IRF space, a finding that has seldom been reported in prior works. Moreover, OSD-IRF requires only single step diffusion for uIAD, thanks to the property that IRF holds for any neighboring time step in the denoising process. Extensive experiments on three widely used uIAD benchmarks show that our model achieves SOTA or competitive performance across six metrics, along with roughly a 2X inference speedup without distillation.
Chinese Translation
扩散模型在无监督工业异常检测(uIAD)中表现出色,主要通过学习正常数据的流形,基于一个共同假设:流形外的异常更难以生成,从而导致数据空间中的重构误差更大或在可处理的潜在空间中概率密度更低。然而,它们的迭代去噪和加噪特性导致推理速度较慢。本文提出了一种新颖的一步扩散模型——OSD-IRF(One-Step Diffusion with Inverse Residual Fields),以解决uIAD任务中的这一限制。我们首先在正常数据上训练一个深度扩散概率模型(DDPM),不进行任何条件设置。然后,对于测试样本,我们基于DDPM的经过良好训练的参数噪声函数估计的噪声预测其逆残差场(IRF)。最后,通过在高斯分布下评估IRF的概率密度并与阈值进行比较来执行uIAD。我们的关键观察是,异常在这个IRF空间中变得可区分,这一发现此前的研究中鲜有报道。此外,由于IRF在去噪过程中对任何相邻时间步都成立,OSD-IRF仅需一步扩散即可完成uIAD。在三个广泛使用的uIAD基准上的大量实验表明,我们的模型在六个指标上实现了SOTA或具有竞争力的性能,同时推理速度提升约2倍,无需蒸馏。
cs.CV / 258 / 2604.18418
MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline
MedProbeBench:深度证据整合的系统基准测试以实现专家级医学指南
Liu, Jiyao, Shen, Jianghan, Song, Sida, Li, Tianbin, Liu, Xiaojia, Li, Rongbin, Huang, Ziyan, Lin, Jiashi, Ning, Junzhi, Ji, Changkai, Luo, Siqi, Li, Wenjie, Ma, Chenglong, Hu, Ming, Xiong, Jing, Ye, Jin, Fu, Bin, Xu, Ningsheng, Chen, Yirong, Jin, Lei, Chen, Hong, He, Junjun
Abstract
Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench
Chinese Translation
近期深度研究系统的进展使得大型语言模型能够检索、综合和推理大规模外部知识。在医学领域,临床指南的制定在很大程度上依赖于这种深度证据整合。然而,现有的基准测试未能在需要多步骤证据整合和专家级判断的现实工作流程中评估这一能力。为了解决这一问题,我们引入了MedProbeBench,这是第一个利用高质量临床指南作为专家级参考的基准测试。医学指南以其在中立性和可验证性方面的严格标准,代表了医学专业知识的巅峰,并对深度研究代理提出了重大挑战。为了进行评估,我们提出了MedProbe-Eval,一个全面的评估框架,包含:(1) 具有1200多个任务适应性标准的整体评分标准,用于全面质量评估,以及 (2) 精细化证据验证,基于5130多个原子声明对证据的准确性进行严格验证。对17个大型语言模型和深度研究代理的评估揭示了证据整合和指南生成方面的关键差距,突显了当前能力与专家级临床指南开发之间的显著差距。项目链接: https://github.com/uni-medical/MedProbeBench
cs.CV / 259 / 2604.18429
Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models
重新审视遥感中的变化视觉问答:基于结构化和原生多模态Qwen模型
Abstract
Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.
Chinese Translation
变化视觉问答(Change VQA)解决了关于双时相遥感(RS)图像之间语义变化的自然语言问题的回答。尽管最近对视觉语言模型(VLMs)在时序RS图像理解中的应用进行了研究,但在现代多模态模型的背景下,变化视觉问答仍然未得到充分探索。在本文中,我们在统一的低秩适应(LoRA)设置下,重新审视了CDVQA基准,使用了最新的Qwen模型。我们比较了Qwen3-VL(该模型遵循结构化的视觉语言管道,具有多深度视觉条件和全注意力解码器)与Qwen3.5(一个结合单阶段对齐和混合解码器骨干的原生多模态模型)。在官方CDVQA测试集上的实验结果表明,最近的VLMs在早期专门基线之上有所提升。结果进一步表明,性能并不随着模型规模单调增加,且原生多模态模型在此任务中比结构化视觉语言管道更为有效。这些发现表明,紧密集成的多模态骨干对性能的贡献大于规模或显式的多深度视觉条件,这对于基于语言的遥感图像语义变化推理至关重要。
cs.CV / 260 / 2604.18452
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
ESsEN:在低资源环境下训练紧凑的区分性视觉-语言变换器
Abstract
Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.
Chinese Translation
视觉-语言建模的受欢迎程度迅速上升,现有模型的列表不断扩展。在大多数情况下,这些视觉-语言模型的参数数量达到数十亿,这对于某些需求是必要的,但在许多情况下,需要更小的模型(例如,在边缘设备或独立的机器人平台上)。不幸的是,关于生产轻量级模型或使用小数据集进行训练的研究较少。受到儿童发展中语言学习进程和数据稀疏性的启发,本文系统性地解决了这两个目标。我们展示了在低资源环境下,两塔编码器模型在区分性英语任务中优于单塔编码器。我们还表明,将传统卷积网络纳入两塔变换器架构可以帮助生成参数高效的视觉-语言模型。最后,我们展示了两塔编码器的跨模态融合模块在形状和大小上可以显著变化,同时产生相同的结果。此外,我们提出了ESsEN,一个可以用相对较少的资源进行端到端训练的紧凑视觉-语言模型,其在多个任务上的表现与其他模型相比仅需少量参数。我们在此呈现的实验结果和工具使得视觉-语言建模对更广泛的研究人员更具可及性。
cs.CV / 261 / 2604.18459
Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
渐进式在线视频理解:证据对齐时序与透明决策
Abstract
Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbol{\rho}$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.
Chinese Translation
在实际环境中操作的视觉智能体必须在视频流中首次出现足够证据时准确响应查询,这是传统在离线环境中评估的视频大语言模型(LLMs)所忽视的关键能力。转向在线流式范式带来了显著挑战:缺乏决策透明性、响应时序与视觉证据对齐的困难,以及在紧张的计算预算下保持全局因果一致理解的需求。为了解决这些问题,我们提出了一种新颖的框架,将推理控制与记忆整合解耦。我们引入了 extbf{ extit{模型名称}},这是该框架的一个实例,包含两个核心组件。首先, extit{主动思维决策者(Active Thinking Decision Maker, ATDM)}是一个透明的推理控制器,它通过可观察的进度($oldsymbol{
ho}$)和置信度($oldsymbol{c}$)指标外化其决策过程。这使得它能够精确地将响应时间$t_r$与首次足够证据的时间戳$t^igstar$对齐,同时将其推理过程流式传输给用户。其次, extit{层次渐进语义整合(Hierarchical Progressive Semantic Integration, HPSI)}模块充当高效的记忆系统。它采用一组可学习的多层聚合标记,这些标记在剪辑之间传播,以构建丰富的全局认知状态,而不超过标记预算。我们的研究在关键的在线视频理解基准上设定了新标准,在StreamingBench上取得了 extbf{71.6 ext{%}}的强劲表现,在OVOBench上达到了 extbf{46.9 ext{%}},展示了对齐证据和透明在线视频分析的稳健解决方案。大量实验表明ATDM和HPSI的有效性,例如,Thinking-QwenVL将StreamingBench基准上先前的最先进技术的准确率从67.63 ext{%}提高到71.60 ext{%}。
cs.CV / 262 / 2604.18468
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
资产收集器:从自动驾驶日志中提取用于仿真的3D资产
Abstract
Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.
Chinese Translation
闭环仿真是自动驾驶车辆(AV)开发的核心组成部分,它在实际部署之前实现了可扩展的测试、训练和安全验证。神经场景重建将驾驶日志转换为用于仿真的交互式3D环境,但未能生成代理操作和大视角新视图合成所需的完整3D对象资产。为了解决这一挑战,我们提出了资产收集器(Asset Harvester),这是一种图像到3D模型的端到端管道,将来自真实驾驶日志的稀疏野外对象观测转换为完整的、适合仿真的资产。我们开发了一种系统级设计,结合了对象中心训练元组的大规模策划、跨异构传感器的几何感知预处理,以及将稀疏视图条件的多视图生成与3D高斯提升相结合的强健训练方案,而不是依赖单一模型组件。在该系统中,SparseViewDiT被明确设计用于解决有限角度视图和其他现实世界数据挑战。结合混合数据策划、增强和自蒸馏,该系统实现了将稀疏AV对象观测可扩展地转换为可重用3D资产的能力。
cs.CV / 263 / 2604.18476
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
SemLT3D:基于语义引导的专家蒸馏用于仅依赖相机的长尾3D目标检测
Abstract
Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.
Chinese Translation
仅依赖相机的3D目标检测已成为自动驾驶中一种具有成本效益和可扩展性的替代方案,取代了激光雷达(LiDAR),然而现有方法主要优先考虑整体性能,而忽视了现实世界数据集中固有的严重长尾不平衡问题。在实践中,许多稀有但对安全至关重要的类别,如儿童、婴儿推车或紧急车辆,严重不足,导致偏见学习和性能下降。这个挑战因明显的类间模糊性(例如,视觉上相似的子类)和显著的类内多样性(例如,外观、尺度、姿态或上下文差异较大的物体)而进一步加剧,这共同阻碍了可靠的长尾识别。在本研究中,我们提出了SemLT3D,一个基于语义引导的专家蒸馏框架,旨在通过语义先验丰富不足类别的表征空间。SemLT3D包括:(1)一个语言引导的专家混合模块,根据3D查询的语义亲和性将其路由到专门的专家,使模型能够更好地区分混淆类别并专注于长尾分布;(2)一个语义投影蒸馏管道,将3D查询与基于CLIP的2D语义对齐,生成在多样化视觉表现中更连贯和更具辨别力的特征。尽管受到长尾不平衡的驱动,SemLT3D中的语义结构学习也提高了在更广泛外观变化和具有挑战性的边缘案例下的鲁棒性,为更可靠的仅依赖相机的3D感知提供了一个原则性步骤。
cs.CV / 264 / 2604.18484
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied:一种具有增强几何和物理线索的大规模具身环境基础模型
Abstract
Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
Chinese Translation
视觉-语言-动作(VLA)模型推动了下一代自主系统的发展,但训练这些模型需要来自复杂环境的可扩展高质量注释。目前的云端管道依赖于通用的视觉-语言模型(VLM),由于其二维图像-文本的预训练,缺乏几何推理和领域语义。为了解决这一不匹配,我们提出了XEmbodied,这是一种云端基础模型,赋予VLM内在的三维几何意识和与物理线索的交互(例如,占用网格、三维盒子)。XEmbodied并不将几何视为辅助输入,而是通过结构化的三维适配器整合几何表示,并利用高效图像-具身适配器将物理信号提炼为上下文标记。通过渐进的领域课程和强化学习后训练,XEmbodied在保持一般能力的同时,在18个公共基准测试中表现出强大的性能。它显著提高了空间推理、交通语义、具身可供性以及在大规模场景挖掘和具身视觉问答中的分布外泛化能力。
cs.CV / 265 / 2604.18486
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL:具有视觉-语言解释的一步潜在推理与规划
Lu, Jinghui, Guan, Jiayi, Huang, Zhijian, Li, Jinlong, Li, Guang, Kong, Lingdong, Li, Yingyan, Wang, Han, Xu, Shaoqing, Luo, Yuechen, Li, Fang, Dang, Chenxu, Wang, Junli, Xu, Tao, Wu, Jing, Wu, Jianhua, Hao, Xiaoshuai, Zhang, Wen, Jiang, Tianyi, Zhang, Lingfeng, Zhou, Lei, Tang, Yingbo, Wang, Jie, Gao, Yinfeng, Bu, Xizhou, Tian, Haochen, Qiu, Yihang, Jia, Feiyang, Liu, Lin, Ge, Yigu, Li, Hanbing, Shen, Yuannan, Cui, Jianwei, Xie, Hongwei, Wang, Bing, Sun, Haiyang, Zhao, Jingwei, Huang, Jiahui, Liu, Pei, Zhu, Zeyu, Jiang, Yuncheng, Guo, Zibin, Gong, Chuhong, Leng, Hanchao, Ma, Kun, Wang, Naiyang, Chen, Guang, Yang, Kuiyuan, Ye, Hangjun, Chen, Long
Abstract
Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL
Chinese Translation
链式思维(Chain-of-Thought, CoT)推理已成为基于视觉-语言(VLA)自主驾驶轨迹预测的强大驱动力,然而其自回归特性带来的延迟成本对于实时部署来说是不可接受的。潜在的 CoT 方法试图通过将推理压缩为连续的隐藏状态来弥补这一差距,但始终未能达到其显式对应物的效果。我们认为,这主要是因为纯语言的潜在表示压缩了世界的符号抽象,而非实际支配驾驶的因果动态。因此,我们提出了 OneVL(具有视觉-语言解释的一步潜在推理与规划),这是一个统一的 VLA 和世界模型框架,通过由双辅助解码器监督的紧凑潜在标记进行推理。除了重构文本 CoT 的语言解码器外,我们还引入了一个视觉世界模型解码器,该解码器预测未来帧标记,迫使潜在空间内化道路几何、代理运动和环境变化的因果动态。一个三阶段的训练管道逐步将这些潜在表示与轨迹、语言和视觉目标对齐,确保稳定的联合优化。在推理时,辅助解码器被丢弃,所有潜在标记在一次并行传递中预填充,从而匹配仅回答预测的速度。在四个基准测试中,OneVL 成为第一个超越显式 CoT 的潜在 CoT 方法,在仅回答延迟下提供了最先进的准确性,并直接证明了在语言和世界模型监督下更紧凑的压缩能够产生比逐词推理更具可泛化性的表示。项目页面:https://xiaomi-embodied-intelligence.github.io/OneVL
cs.CV / 266 / 2604.18512
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO:面向硬度的视觉-语言模型偏好优化
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.
Chinese Translation
视觉-语言模型(VLMs)在单幅图像理解方面取得了显著进展,但在多幅图像之间进行有效推理仍然具有挑战性。我们识别出现有多图像对齐方法中的一个关键能力差距:当前方法主要集中于使用预先指定的图像索引进行局部推理(“查看图像3并...”),而忽略了全球视觉搜索和自主跨图像比较的基本技能。为了解决这一局限性,我们引入了一种简单到困难(S2H)的学习框架,该框架系统地构建了跨越三个层次推理水平的多图像偏好数据,这些层次要求逐渐提高的能力:(1)单幅图像局部推理,(2)多幅图像局部比较,以及(3)全球视觉搜索。与依赖于模型特定属性(例如幻觉或注意力启发式)生成偏好对的先前工作不同,我们的方法利用基于提示的复杂性来创建适用于不同模型的选择/拒绝对。通过对LLaVA和Qwen-VL模型的广泛评估,我们展示了我们多样化的多图像推理数据显著提升了多图像推理性能,在各基准测试中相较于基线方法取得了显著改进。重要的是,我们的方法在增强多图像理解能力的同时,保持了强大的单幅图像推理性能,从而推动了整体视觉偏好对齐的最新进展。
cs.CV / 267 / 2604.18518
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO:统一离散扩散模型的稳定高效群体相对策略优化
Abstract
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at \href{https://github.com/Yovecent/UDM-GRPO}{https://github.com/Yovecent/UDM-GRPO}.
Chinese Translation
统一离散扩散模型(Uniform Discrete Diffusion Model,UDM)最近作为离散生成建模的一种有前景的范式而出现;然而,它与强化学习的结合仍然在很大程度上未被探索。我们观察到,简单地将群体相对策略优化(GRPO)应用于UDM会导致训练不稳定和性能提升有限。为了解决这个问题,我们提出了 extit{UDM-GRPO},这是第一个将UDM与强化学习相结合的框架。我们的方法基于两个关键见解:(i)将最终的干净样本视为动作可以提供更准确和稳定的优化信号;(ii)通过扩散前向过程重建轨迹可以更好地将概率路径与预训练分布对齐。此外,我们引入了两种策略,Reduced-Step和CFG-Free,以进一步提高训练效率。 extit{UDM-GRPO}在多个文本到图像(T2I)任务中显著提高了基础模型的性能。值得注意的是,生成评估(GenEval)准确率从$69\%$提高到$96\\%$,而选择分数(PickScore)从$20.46$增加到$23.81$,在连续和离散设置中均达到了最先进的性能。在光学字符识别(OCR)基准测试中,准确率从$8\\%$上升到$57\\%$,进一步验证了我们方法的泛化能力。代码可在 exttt{https://github.com/Yovecent/UDM-GRPO}获取。
cs.CV / 268 / 2604.18537
MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation
MetaCloak-JPEG:一种针对JPEG的鲁棒对抗扰动,以防止未经授权的基于DreamBooth的深度伪造生成
Abstract
The rapid progress of subject-driven text-to-image synthesis, and in particular DreamBooth, has enabled a consent-free deepfake pipeline: an adversary needs only 4-8 publicly available face images to fine-tune a personalized diffusion model and produce photorealistic harmful content. Current adversarial face-protection systems -- PhotoGuard, Anti-DreamBooth, and MetaCloak -- perturb user images to disrupt surrogate fine-tuning, but all share a structural blindness: none backpropagates gradients through the JPEG compression pipeline that every major social-media platform applies before adversary access. Because JPEG quantization relies on round(), whose derivative is zero almost everywhere, adversarial energy concentrates in high-frequency DCT bands that JPEG discards, eliminating 60-80% of the protective signal. We introduce MetaCloak-JPEG, which closes this gap by inserting a Differentiable JPEG (DiffJPEG) layer built on the Straight-Through Estimator (STE): the forward pass applies standard JPEG compression, while the backward pass replaces round() with the identity. DiffJPEG is embedded in a JPEG-aware EOT distribution (~70% of augmentations include DiffJPEG) and a curriculum quality-factor schedule (QF: 95 to 50) inside a bilevel meta-learning loop. Under an l-inf perturbation budget of eps=8/255, MetaCloak-JPEG attains 32.7 dB PSNR, a 91.3% JPEG survival rate, and outperforms PhotoGuard on all 9 evaluated JPEG quality factors (9/9 wins, mean denoising-loss gain +0.125) within a 4.1 GB training-memory budget.
Chinese Translation
以主题驱动的文本到图像合成的快速进展,尤其是DreamBooth,使得无需同意的深度伪造流程成为可能:对手只需4-8张公开可用的面部图像即可微调个性化的扩散模型并生成逼真的有害内容。目前的对抗性面部保护系统——PhotoGuard、Anti-DreamBooth和MetaCloak——通过扰动用户图像来干扰替代微调,但都存在结构性盲点:没有一个系统在对手访问之前通过每个主要社交媒体平台应用的JPEG压缩管道反向传播梯度。由于JPEG量化依赖于round(),其导数几乎在所有地方为零,因此对抗性能量集中在JPEG丢弃的高频DCT带中,消除了60-80%的保护信号。我们提出了MetaCloak-JPEG,通过插入基于直通估计器(STE)的可微分JPEG(DiffJPEG)层来填补这一空白:前向传播应用标准JPEG压缩,而反向传播则用恒等映射替代round()。DiffJPEG嵌入在一个JPEG感知的EOT分布中(约70%的增强包括DiffJPEG)和一个课程质量因子调度(QF:从95到50)内的双层元学习循环中。在l-inf扰动预算为eps=8/255的情况下,MetaCloak-JPEG达到了32.7 dB的PSNR,91.3%的JPEG生存率,并在评估的9个JPEG质量因子中超越了PhotoGuard(9/9胜利,平均去噪损失增益+0.125),训练内存预算为4.1 GB。
cs.CV / 269 / 2604.18549
Advancing Vision Transformer with Enhanced Spatial Priors
通过增强空间先验推进视觉变换器
Abstract
In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.
Chinese Translation
近年来,视觉变换器(Vision Transformer, ViT)在计算机视觉领域引起了广泛关注。然而,ViT的核心组件自注意力(Self-Attention)缺乏明确的空间先验,并且计算复杂度为二次方,限制了其应用。为了解决这些问题,我们提出了RMT,一种具有明确空间先验的强大视觉骨干网络,适用于一般目的。RMT利用曼哈顿距离衰减引入空间信息,并采用水平和垂直分解注意力方法来建模全局信息。在RMT的基础上,欧几里得增强视觉变换器(Euclidean enhanced Vision Transformer, EVT)是一个扩展版本,包含了几个关键改进。首先,EVT使用更合理的欧几里得距离衰减来增强空间信息的建模,相比于RMT中使用的曼哈顿距离,能够更准确地表示空间关系。其次,EVT放弃了RMT中采用的分解注意力机制,而是采用了更简单的空间独立分组方法,使模型在控制每组内的标记数量时具有更大的灵活性。通过解决这些修改,EVT提供了一种更复杂且适应性更强的方法,将空间先验融入自注意力机制,从而克服了RMT的一些局限性,并进一步增强了其在各种计算机视觉任务中的适用性。在图像分类、目标检测、实例分割和语义分割等任务上的大量实验表明,EVT表现出卓越的性能。在没有额外训练数据的情况下,EVT在ImageNet-1k上达到了86.6%的Top-1准确率。
cs.CV / 270 / 2604.18557
SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy
SynAgent:通过单人到合作代理的协同实现可推广的人形操控
Abstract
Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent
Chinese Translation
可控的合作人形操控是具身智能的一个基本但具有挑战性的问题,这主要由于数据稀缺、多代理协调的复杂性以及在物体间的有限泛化能力。本文提出了SynAgent,一个统一框架,通过利用单人到合作代理的协同,能够实现可扩展且物理上合理的合作操控,将单人类-物体交互的技能转移到多代理人类-物体-人类场景中。为了在运动转移过程中保持语义完整性,我们引入了一种基于通过德劳内四面体化构建的交互网格的交互保持重定向方法,该方法忠实地维护了人类与物体之间的空间关系。在此精炼数据的基础上,我们提出了一种单代理预训练和适应范式,通过去中心化训练和多代理PPO,从丰富的单人类数据中引导协同合作行为。最后,我们开发了一种基于条件变分自编码器(conditional VAE)的轨迹条件生成策略,通过从运动模仿先验中进行多教师蒸馏进行训练,以实现稳定且可控的物体级轨迹执行。大量实验表明,SynAgent在合作模仿和轨迹条件控制方面显著优于现有基线,同时在多样的物体几何形状中具有良好的泛化能力。代码和数据将在发表后提供。项目页面:http://yw0208.github.io/synagent
cs.CV / 271 / 2604.18562
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg:基于语言的查询库用于推理分割
Abstract
Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.
Chinese Translation
推理分割要求模型将复杂的、隐含的文本查询转化为精确的像素级掩码。现有方法依赖于单一的分割标记 $ exttt{}$,其隐藏状态隐式编码了语义推理和空间定位,这限制了模型明确区分要分割的内容与分割位置的能力。我们提出了 AnchorSeg,将推理分割重新表述为基于图像标记的结构化条件生成过程,条件是基于语言的查询库。AnchorSeg 并不将所有的语义推理和空间定位压缩到一个嵌入中,而是构建了一个有序的查询库序列:捕捉中间语义状态的潜在推理标记,以及提供明确空间定位的分割锚标记。我们将空间条件建模为图像标记上的分解分布,其中锚查询确定定位信号,而上下文查询提供语义调制。为了弥合标记级预测与像素级监督之间的差距,我们提出了标记-掩码循环一致性(Token--Mask Cycle Consistency, TMCC),这是一种双向训练目标,强制在不同分辨率之间对齐。通过通过结构化的基于语言的查询库明确解耦空间定位与语义推理,AnchorSeg 在 ReasonSeg 测试集上达到了最先进的结果(67.7 ext{%} gIoU 和 68.1 ext{%} cIoU)。所有代码和模型均可在 https://github.com/rui-qian/AnchorSeg 上公开获取。
cs.CV / 272 / 2604.18564
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld:可扩展的多智能体多视角视频世界模型
Abstract
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/
Chinese Translation
视频世界模型在模拟环境动态以响应用户或智能体的行为方面取得了显著成功。它们被建模为基于动作的视频生成模型,输入历史帧和当前动作以预测未来帧。然而,大多数现有方法仅限于单智能体场景,无法捕捉现实世界多智能体系统中固有的复杂交互。我们提出了 extbf{MultiWorld},一个统一的多智能体多视角世界建模框架,能够在保持多视角一致性的同时,准确控制多个智能体。我们引入了多智能体条件模块(Multi-Agent Condition Module)以实现精确的多智能体可控性,并引入全局状态编码器(Global State Encoder)以确保不同视角下的一致观察。MultiWorld支持智能体和视角数量的灵活扩展,并并行合成不同视角以提高效率。在多玩家游戏环境和多机器人操作任务上的实验表明,MultiWorld在视频保真度、动作跟随能力和多视角一致性方面优于基线模型。项目页面:https://multi-world.github.io/
cs.CV / 273 / 2604.18572
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
重返柏拉图的洞穴:大规模跨模态表征收敛的研究
Abstract
The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.
Chinese Translation
柏拉图表征假说认为,在不同模态(例如文本和图像)上训练的神经网络会对齐并最终收敛到相同的现实表征。如果这一假说成立,将对模态选择的重要性产生重大影响。我们展示了这一假说的实验证据是脆弱的,并且在很大程度上依赖于评估机制。对齐是通过在小型数据集(约1K样本)上使用互为最近邻来测量的,而当数据集扩展到数百万样本时,对齐显著下降。模型表征之间剩余的对齐反映的是粗略的语义重叠,而不是一致的细粒度结构。此外,Huh等人的评估是在一对一的图像-标题设置中进行的,这一限制在现实的多对多设置中失效,并进一步减少了对齐。我们还发现,报告的更强语言模型与视觉模型之间的对齐趋势在较新模型中似乎并不成立。总体而言,我们的研究结果表明,目前对跨模态表征收敛的证据远不如后续研究所认为的那样强。不同模态上训练的模型可能学习到同样丰富的世界表征,只是并不是相同的表征。
cs.CV / 274 / 2604.18573
T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
T-REN:学习文本对齐的区域标记改善了密集视觉-语言对齐和可扩展性
Abstract
Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.
Chinese Translation
尽管最近取得了进展,视觉-语言编码器仍面临两个核心限制:(1)语言与密集视觉特征之间的对齐较弱,影响了开放词汇语义分割等任务;(2)细粒度视觉表示的标记数量较高,限制了对长视频的可扩展性。本研究针对这两个限制提出了解决方案。我们提出了 T-REN(文本对齐区域编码网络),这是一种高效的编码器,将视觉数据映射到一组紧凑的文本对齐区域级表示(或区域标记)。T-REN 通过在冻结的视觉骨干网络上添加轻量级网络实现这一目标,该网络经过训练以将每个语义区域内的补丁级表示汇聚为区域标记,并与区域级文本注释对齐。与视觉-语言骨干相比,该设计仅增加了 3.7% 的参数,显著增强了密集跨模态理解,同时将标记数量减少了几个数量级。具体而言,T-REN 在 ADE20K 开放词汇分割上提升了 +5.9 mIoU,在 COCO 物体级文本-图像检索上提升了 +18.4% 的召回率,在 Ego4D 视频物体定位上提升了 +15.6% 的召回率,在 VSPW 视频场景解析上提升了 +17.6% 的 mIoU,同时与基于补丁的视觉-语言骨干相比,图像的标记数量减少了超过 24 倍,视频的标记数量减少了 187 倍。代码和模型可在 https://github.com/savya08/T-REN 获取。
cs.CV / 275 / 2604.18575
ReCap: Lightweight Referential Grounding for Coherent Story Visualization
ReCap:轻量级的指称基础一致性框架用于连贯的故事可视化
Abstract
Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.
Chinese Translation
故事可视化旨在生成一系列图像,忠实地描绘文本叙事,保持角色身份、空间配置和风格一致性,随着叙事的发展而展开。维持这种跨帧一致性传统上依赖于显式记忆库、架构扩展或辅助语言模型,导致参数显著增长和推理开销。我们提出了ReCap,一个轻量级的一致性框架,能够在不修改基础扩散骨干网络的情况下提高角色稳定性和视觉保真度。ReCap的CORE(条件帧引用)模块将指代词(在我们的案例中是代词)视为视觉锚点,仅在角色通过代词被提及时激活,并基于前一帧进行条件处理,以传播视觉身份。这种选择性设计避免了无条件的跨帧条件处理,仅引入149K的额外参数,成本仅为记忆库和增强语言模型方法的一小部分。为了进一步稳定身份,我们在训练期间引入了SemDrift(引导语义漂移校正),仅在训练过程中应用。当文本模糊或具有指称性时,去噪器缺乏用于定义身份属性的视觉锚点,导致角色外观在帧之间漂移,SemDrift通过将去噪器表示与预训练的DINOv3视觉嵌入对齐来纠正这一点,在零推理成本下强制语义身份稳定。ReCap在两个主要的故事可视化基准上超越了之前的最先进模型StoryGPT-V,FlintstonesSV上的角色准确率提高了2.63%,在PororoSV上提高了5.65%,在这两个基准上建立了新的角色一致性最先进水平。此外,我们将故事可视化扩展到源自真实电影的人本叙事,展示了ReCap在风格化卡通领域之外的能力。
cs.CV / 276 / 2604.18583
MUA: Mobile Ultra-detailed Animatable Avatars
MUA:移动超细节可动画化头像
Abstract
Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.
Chinese Translation
构建逼真的、可动画化的全身数字人类仍然是计算机图形学和视觉领域的一项长期挑战。最近在可动画化头像建模方面的进展主要沿着两个方向发展:提高动态几何和外观的保真度,或降低计算复杂性以便在资源受限的平台上部署,例如虚拟现实头戴设备。然而,现有的方法无法同时实现这两个目标:超高保真度的头像通常需要在服务器级GPU上进行大量计算,而轻量级头像往往在表面动态、外观细节和明显伪影方面受到限制。为了弥补这一差距,我们提出了一种新颖的可动画化头像表示方法,称为小波引导的多级空间分解混合形状(Wavelet-guided Multi-level Spatial Factorized Blendshapes),以及一个相应的蒸馏管道,该管道将运动感知的服装动态和细致的外观细节从预训练的超高质量头像模型转移到一个紧凑、高效的表示中。通过将多级小波谱分解与纹理空间中的低秩结构分解相结合,我们的方法实现了比原始高质量教师头像模型低2000倍的计算成本和10倍更小的模型尺寸,同时保持了视觉上合理的动态和外观细节,紧密类似于教师模型。与最先进的方法进行的广泛比较显示,我们的方法显著优于为移动环境设计的现有头像方法,并且在渲染质量上与大多数只能在服务器上运行的方法相当或更优。重要的是,我们的表示大大提高了高保真头像在沉浸式应用中的实用性,在桌面PC上实现超过180帧每秒(FPS)的性能,并在独立的Meta Quest 3设备上实现24帧每秒的实时本地性能。
cs.AI / 1 / 2604.16338
Governing the Agentic Enterprise: A Governance Maturity Model for Managing AI Agent Sprawl in Business Operations
治理自主企业:管理商业运营中人工智能代理扩张的治理成熟度模型
Abstract
The rapid adoption of agentic AI in enterprise business operations--autonomous systems capable of planning, reasoning, and executing multi-step workflows--has created an urgent governance crisis. Organizations face uncontrolled agent sprawl: the proliferation of redundant, ungoverned, and conflicting AI agents across business functions. Industry surveys report that only 21% of enterprises have mature governance models for autonomous agents, while 40% of agentic AI projects are projected to fail by 2027 due to inadequate governance and risk controls. Despite growing acknowledgment of this challenge, academic literature lacks a formal, empirically validated governance maturity model connecting governance capability to measurable business outcomes. This paper introduces the Agentic AI Governance Maturity Model (AAGMM), a five-level framework spanning 12 governance domains, grounded in NIST AI RMF and ISO/IEC 42001 standards. We additionally propose a novel taxonomy of agent sprawl patterns--functional duplication, shadow agents, orphaned agents, permission creep, and unmonitored delegation chains--each linked to quantifiable business cost models. The framework is validated through 750 simulation runs across five enterprise scenarios and five governance maturity levels, measuring business outcomes including cost containment, risk incident rates, operational efficiency, and decision quality. Results demonstrate statistically significant differences (p < 0.001, large effect sizes d > 2.0) between all governance maturity levels, with Level 4-5 organizations achieving 94.3% lower sprawl indices, 96.4% fewer risk incidents, and 32.6% higher effective task completion rates compared to Level 1. The AAGMM provides practitioners with an actionable roadmap for governing autonomous AI agents while maximizing business returns.
Chinese Translation
自主人工智能在企业商业运营中的快速应用——能够进行规划、推理和执行多步骤工作流程的自主系统——引发了紧迫的治理危机。组织面临失控的代理扩张:冗余、缺乏治理和相互冲突的人工智能代理在商业职能中的泛滥。行业调查报告显示,只有21%的企业拥有成熟的自主代理治理模型,而预计到2027年,40%的自主人工智能项目将因治理和风险控制不足而失败。尽管对这一挑战的认识日益增强,学术文献中缺乏一个正式的、经过实证验证的治理成熟度模型,将治理能力与可衡量的商业成果联系起来。本文提出了自主人工智能治理成熟度模型(Agentic AI Governance Maturity Model,AAGMM),这是一个涵盖12个治理领域的五级框架,基于NIST AI RMF和ISO/IEC 42001标准。此外,我们还提出了一种新的代理扩张模式分类法——功能重复、影子代理、孤立代理、权限蔓延和未监控的委托链——每种模式都与可量化的商业成本模型相关联。该框架通过在五个企业场景和五个治理成熟度级别下进行750次模拟运行进行了验证,测量的商业成果包括成本控制、风险事件发生率、运营效率和决策质量。结果显示所有治理成熟度级别之间存在统计显著差异(p < 0.001,效应量大于2.0),第4-5级组织的扩张指数比第1级低94.3%,风险事件减少96.4%,有效任务完成率提高32.6%。AAGMM为从业者提供了一条可操作的路线图,以治理自主人工智能代理,同时最大化商业回报。
cs.AI / 2 / 2604.16339
Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems
语义共识:面向过程的企业多智能体大语言模型系统冲突检测与解决
Abstract
Multi-agent large language model (LLM) systems are rapidly emerging as the dominant architecture for enterprise AI automation, yet production deployments exhibit failure rates between 41% and 86.7%, with nearly 79% of failures originating from specification and coordination issues rather than model capability limitations. This paper identifies Semantic Intent Divergence--the phenomenon whereby cooperating LLM agents develop inconsistent interpretations of shared objectives due to siloed context and absent process models--as a primary yet formally unaddressed root cause of multi-agent failure in enterprise settings. We propose the Semantic Consensus Framework (SCF), a process-aware middleware comprising six components: a Process Context Layer for shared operational semantics, a Semantic Intent Graph for formal intent representation, a Conflict Detection Engine for real-time identification of contradictory, contention-based, and causally invalid intent combinations, a Consensus Resolution Protocol using a policy--authority--temporal hierarchy, a Drift Monitor for detecting gradual semantic divergence, and a Process-Aware Governance Integration layer for organizational policy enforcement. Evaluation across 600 runs spanning three multi-agent frameworks (AutoGen, CrewAI, LangGraph) and four enterprise scenarios demonstrates that SCF is the only approach to achieve 100% workflow completion--compared to 25.1% for the next-best baseline--while detecting 65.2% of semantic conflicts with 27.9% precision and providing complete governance audit trails. The framework is protocol-agnostic and compatible with MCP and A2A communication standards.
Chinese Translation
多智能体大语言模型(LLM)系统正在迅速成为企业人工智能自动化的主导架构,但生产部署的失败率在41%到86.7%之间,近79%的失败源于规范和协调问题,而非模型能力的限制。本文识别出语义意图偏差——即合作的LLM智能体由于孤立的上下文和缺失的过程模型而对共享目标产生不一致的解释——作为企业环境中多智能体失败的主要且尚未正式解决的根本原因。我们提出了语义共识框架(Semantic Consensus Framework, SCF),这是一种面向过程的中间件,包含六个组件:用于共享操作语义的过程上下文层、用于正式意图表示的语义意图图、用于实时识别矛盾、竞争性和因果无效意图组合的冲突检测引擎、使用政策-权威-时间层次的共识解决协议、用于检测逐渐语义偏差的漂移监控器,以及用于组织政策执行的面向过程的治理集成层。在涵盖三个多智能体框架(AutoGen、CrewAI、LangGraph)和四个企业场景的600次运行评估中,SCF是唯一实现100%工作流完成的方案——相比之下,最佳基准的完成率为25.1%——同时以27.9%的精度检测到65.2%的语义冲突,并提供完整的治理审计追踪。该框架与协议无关,并兼容MCP和A2A通信标准。
cs.AI / 3 / 2604.16403
Computational Hermeneutics: Evaluating generative AI as a cultural technology
计算解释学:评估生成性人工智能作为文化技术
Kommers, Cody, Ahnert, Ruth, Antoniak, Maria, Benetos, Emmanouil, Benford, Steve, Bunz, Mercedes, Caramiaux, Baptiste, Concannon, Shauna, Disley, Martin, Dobson, James, Du, Yali, Duéñez-Guzmán, Edgar, Francksen, Kerry, Gius, Evelyn, Gray, Jonathan W. Y., Heuser, Ryan, Immel, Sarah, So, Richard Jean, Leigh, Sang, Livingston, Dalaki, Long, Hoyt, Martin, Meredith, Meyer, Georgia, Mihai, Daniela, Noel-Hirst, Ashley, Ostherr, Kirsten, Parker, Deven, Qin, Yipeng, Ratcliff, Jessica, Robinson, Emily, Rodriguez, Karina, Sobey, Adam, Underwood, Ted, Vashistha, Aditya, Wilkens, Matthew, Wu, Youyou, Zheng, Yuan, Hemment, Drew
Abstract
Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system's operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as "context machines" that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation -- that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.
Chinese Translation
生成性人工智能系统越来越被视为文化技术,然而当前的评估框架往往将文化视为一个可测量的变量,而不是系统运作的基础。基于人文学科的解释学理论,我们认为生成性人工智能系统作为“情境机器”运作,必须内在地应对三个解释性挑战:情境性(意义仅在特定情境中产生)、多元性(多种有效的解释共存)和模糊性(解释自然发生冲突)。我们提出计算解释学作为一个新兴框架,提供对生成性人工智能系统所做之事及其改进方法的解释性阐述。我们提出三个解释学评估原则——基准应为迭代的,而非一次性的;应包括人,而不仅仅是机器;并测量文化背景,而不仅仅是模型输出。这一视角为设计和评估当代人工智能系统提供了一个新兴范式:从关于准确性的标准化问题转向关于意义的情境性问题。
cs.AI / 4 / 2604.16406
Heterogeneous Self-Play for Realistic Highway Traffic Simulation
异构自我对弈用于现实高速公路交通模拟
Abstract
Realistic highway simulation is critical for scalable safety evaluation of autonomous vehicles, particularly for interactions that are too rare to study from logged data alone. Yet highway traffic generation remains challenging because it requires broad coverage across speeds and maneuvers, controllable generation of rare safety-critical scenarios, and behavioral credibility in multi-agent interactions. We present PHASE, Policy for Heterogeneous Agent Self-play on Expressway, a context-aware self-play framework that addresses these three requirements through explicit per-agent conditioning for controllability, synthetic scenario generation for broad highway coverage, and closed-loop multi-agent training for realistic interaction dynamics. PHASE further supports different vehicle profiles, for example, passenger cars and articulated trailer trucks, within a single policy via vehicle-aware dynamics and context-conditioned actions, and stabilizes self-play with early termination of unrecoverable states, at-fault collision attribution, highway-aware reward shaping, coupled curricula, and robust policy optimization. Despite being trained only on synthetic data, PHASE transfers zero-shot to 512 unseen high-interaction real scenarios in exiD, achieving a 96.3% success rate and reducing ADE/FDE from 6.57/12.07 m to 2.44/5.25 m relative to a prior self-play baseline. In a learned trajectory embedding space, it also improves behavioral realism over IDM, reducing Frechet trajectory distance by 13.1% and energy distance by 20.2%. These results show that synthetic self-play can provide a scalable route to controllable and realistic highway scenario generation without direct imitation of expert logs.
Chinese Translation
现实高速公路模拟对于自主车辆的可扩展安全评估至关重要,特别是对于那些仅通过记录数据无法研究的稀有交互。然而,高速公路交通生成仍然具有挑战性,因为它需要在速度和机动性上广泛覆盖、可控生成稀有的安全关键场景,以及在多智能体交互中具备行为可信度。我们提出了PHASE(Policy for Heterogeneous Agent Self-play on Expressway),一个上下文感知的自我对弈框架,通过对每个智能体进行显式条件化以实现可控性、合成场景生成以实现广泛的高速公路覆盖,以及闭环多智能体训练以实现现实的交互动态,从而满足这三个要求。PHASE进一步支持不同的车辆类型,例如乘用车和挂车卡车,通过车辆感知动态和上下文条件动作在单一策略中实现,并通过对不可恢复状态的提前终止、责任碰撞归属、高速公路感知奖励塑造、耦合课程和稳健的策略优化来稳定自我对弈。尽管仅在合成数据上进行训练,PHASE在exiD中零-shot转移到512个未见的高交互真实场景,取得了96.3%的成功率,并将ADE/FDE从6.57/12.07米降低到2.44/5.25米,相较于先前的自我对弈基线。在学习的轨迹嵌入空间中,它还在行为真实感上优于IDM,将Frechet轨迹距离降低了13.1%,能量距离降低了20.2%。这些结果表明,合成自我对弈可以提供一种可扩展的途径,实现可控和现实的高速公路场景生成,而无需直接模仿专家日志。
cs.AI / 5 / 2604.16434
Support Sufficiency as Consequence-Sensitive Compression in Belief Arbitration
作为后果敏感压缩的支持充分性在信念仲裁中的应用
Abstract
When a system commits to a hypothesis, much of the evidential structure behind that commitment is lost to compression. Standard accounts assume that selected content and scalar confidence suffice for downstream control. This paper argues that they do not, and that determining what must survive compression is itself a consequence-sensitive problem. We develop a recurrent arbitration architecture in which active constraint fields jointly determine a hypothesis geometry over candidates. Rather than carrying that geometry forward in full, the system compresses it into a support-aware control state whose resolution is regulated by current consequence geometry, arbitration memory, and resource constraints. A bounded objective formalizes the tradeoff. Too little retained support collapses policy-relevant distinctions, producing controllers that select content adequately while misrouting verification, abstention, and recovery. Too much retained support fragments learning across overly fine contexts, degrading adaptation even as discrimination improves. These failure modes yield ordered controller predictions confirmed by a minimal repeated-interaction simulation. Adaptive controllers that regulate support resolution outperform all fixed-resolution controllers in cumulative utility. Agile adaptive control outperforms sluggish adaptive control. Fixed high-resolution control achieves the best commitment accuracy but still trails adaptive controllers because resource cost and learning fragmentation offset the gains from richer retention. Support sufficiency should be understood not as a static representational threshold, but as a dynamic compression criterion. Robust arbitration depends on preserving the smallest support structure adequate for policy under the current consequence landscape, and on regulating that structure as conditions change across repeated cycles of inference and action.
Chinese Translation
当一个系统承诺于某一假设时,支撑该承诺的大部分证据结构会因压缩而丧失。标准理论假设所选择的内容和标量置信度足以进行下游控制。本文认为事实并非如此,确定在压缩中必须保留的内容本身就是一个后果敏感的问题。我们开发了一种递归仲裁架构,其中主动约束场共同决定候选者的假设几何形状。系统并不是将该几何形状完整地传递下去,而是将其压缩为一个支持感知的控制状态,其分辨率由当前的后果几何、仲裁记忆和资源约束来调节。一个有界目标形式化了这种权衡。保留的支持过少会导致政策相关的区分崩溃,产生在选择内容时足够但在验证、弃权和恢复时错误路由的控制器。保留的支持过多则会在过于细微的上下文中分散学习,尽管辨别能力提高,但适应性却下降。这些失败模式产生了经过最小重复交互仿真确认的有序控制器预测。调节支持分辨率的自适应控制器在累积效用上优于所有固定分辨率控制器。灵活的自适应控制优于迟缓的自适应控制。固定高分辨率控制实现了最佳承诺准确性,但仍落后于自适应控制器,因为资源成本和学习碎片化抵消了更丰富保留带来的收益。支持充分性应被理解为一个动态压缩标准,而非静态的表征阈值。稳健的仲裁依赖于在当前后果环境下保留足够的最小支持结构,并在推理和行动的重复周期中随着条件变化调节该结构。
cs.AI / 6 / 2604.16465
Healthcare AI for Automation or Allocation? A Transaction Cost Economics Framework
医疗保健中的人工智能:自动化还是资源分配?一个交易成本经济学框架
Abstract
Healthcare productivity is shaped not only by clinical complexity but by the costs of coordinating work under uncertainty. Transaction-cost economics offers a theory of these coordination frictions, yet has rarely been operationalised at task level across health occupations. Using task statements and frequency weights from the O*NET occupational database, we characterised healthcare work at task granularity and coded each unique task using a constrained large language model into one dominant transaction-cost category (information search, decision and bargaining, monitoring and enforcement, or adaptation and coordination) together with an overall transaction-cost intensity score. Aggregating to the occupation level, clinician roles exhibited substantially higher transaction-cost intensity than non-clinician roles, driven primarily by greater burdens of information search and decision-related coordination, while dispersion of transaction costs within occupations did not differ. These findings demonstrate systematic heterogeneity in the nature of coordination work across healthcare roles and suggest that the opportunities for digital and AI interventions are unevenly distributed, shaped less by technical task complexity than by underlying coordination structure.
Chinese Translation
医疗保健的生产力不仅受临床复杂性的影响,还受到在不确定性下协调工作的成本的影响。交易成本经济学提供了关于这些协调摩擦的理论,但在健康职业的任务层面上很少被具体化。通过使用O*NET职业数据库中的任务陈述和频率权重,我们对医疗保健工作进行了任务粒度的特征描述,并利用受限的大型语言模型将每个独特任务编码为一个主要的交易成本类别(信息搜索、决策与谈判、监控与执行,或适应与协调),同时给出整体交易成本强度评分。聚合到职业层面,临床角色的交易成本强度显著高于非临床角色,主要是由于信息搜索和决策相关协调的更大负担,而职业内部的交易成本分布没有差异。这些发现表明医疗保健角色之间协调工作的性质存在系统性异质性,并暗示数字和人工智能干预的机会分布不均,受基础协调结构的影响,而非技术任务复杂性。
cs.AI / 7 / 2604.16646
Agentic Frameworks for Reasoning Tasks: An Empirical Study
推理任务的代理框架:一项实证研究
Abstract
Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per task. Poorer results were mainly caused by orchestration problems rather than reasoning limits. For example, Camel failed to complete BBH after 11 days because of uncontrolled context growth, while Upsonic consumed USD 1,434 in one day because repeated extraction failures triggered costly retries. AutoGen and Mastra also exhausted API quotas through iterative interactions that increased prompt length without improving results. We also found a sharp drop in mathematical reasoning. Mean accuracy on GSM8K was 44.35%, compared with 89.80% on BBH and 89.56% on ARC. Overall, this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.
Chinese Translation
近年来,代理框架的进展使得人工智能代理能够执行复杂的推理和决策。然而,关于它们的推理性能、效率和实际适用性的比较证据仍然有限。为了解决这一空白,我们对22个广泛使用的代理框架在三个推理基准(BBH、GSM8K和ARC)上进行了实证评估。这些框架是从2023年1月至2025年7月收集的1,200个GitHub代码库中选择的,并根据架构设计组织成一个分类体系。我们在统一的设置下对它们进行了评估,测量了推理准确性、执行时间、计算成本和跨基准一致性。我们的结果显示,22个框架中有19个完成了所有三个基准。在这些框架中,有12个表现稳定,平均准确率为74.6%-75.9%,每个任务的执行时间为4-6秒,成本为每个任务0.14-0.18美分。较差的结果主要是由于协调问题而非推理限制。例如,Camel在11天后未能完成BBH,因为上下文增长失控,而Upsonic在一天内消耗了1,434美元,因为重复的提取失败触发了昂贵的重试。AutoGen和Mastra也通过迭代交互耗尽了API配额,增加了提示长度而没有改善结果。我们还发现数学推理的准确率急剧下降。GSM8K的平均准确率为44.35%,而BBH和ARC的平均准确率分别为89.80%和89.56%。总体而言,本研究提供了针对推理密集型软件工程任务的代理框架的首次大规模实证比较,并表明框架选择应优先考虑协调质量,特别是内存控制、故障处理和成本管理。
cs.AI / 8 / 2604.16672
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies
从子概念到可满足性:基于大型语言模型的OWL本体主动学习
Abstract
In active learning, membership queries (MQs) allow a learner to pose questions to a teacher, such as ''Is every apple a fruit?'', to which the teacher responds correctly with yes or no. These MQs can be viewed as subsumption tests with respect to the target ontology. Inspired by the standard reduction of subsumption to satisfiability in description logics, we reformulate each candidate axiom into its corresponding counter-concept and verbalise it in controlled natural language before presenting it to Large Language Models (LLMs). We introduce LLMs as a third component that provides real-world examples approximating an instance of the counter-concept. This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall, corresponding to Type II errors in our framework, remains stable across several well-established ontologies.
Chinese Translation
在主动学习中,成员查询(MQs)允许学习者向教师提出问题,例如“每个苹果都是水果吗?”,教师则以“是”或“否”作出正确回应。这些MQs可以视为相对于目标本体的子概念测试。受到描述逻辑中子概念到可满足性标准化简的启发,我们将每个候选公理重新表述为其对应的反概念,并在呈现给大型语言模型(LLMs)之前用受控自然语言进行表述。我们引入LLMs作为第三个组成部分,提供接近反概念实例的现实世界示例。该设计特性确保在本体建模中仅可能发生II型错误;在最坏情况下,这些错误仅会延迟构建过程,而不会引入不一致性。在对13个商业LLMs进行的实验结果显示,回忆率(对应于我们框架中的II型错误)在多个成熟本体中保持稳定。
cs.AI / 9 / 2604.16687
Agentic Risk-Aware Set-Based Engineering Design
基于代理的风险意识集合工程设计
Abstract
This paper introduces a multi-agent framework guided by Large Language Models (LLMs) to assist in the early stages of engineering design, a phase often characterized by vast parameter spaces and inherent uncertainty. Operating under a human-in-the-loop paradigm and demonstrated on the canonical problem of aerodynamic airfoil design, the framework employs a team of specialized agents: a Coding Assistant, a Design Agent, a Systems Engineering Agent, and an Analyst Agent - all coordinated by a human Manager. Integrated within a set-based design philosophy, the process begins with a collaborative phase where the Manager and Coding Assistant develop a suite of validated tools, after which the agents execute a structured workflow to systematically explore and prune a large set of initial design candidates. A key contribution of this work is the explicit integration of formal risk management, employing the Conditional Value-at-Risk (CVaR) as a quantitative metric to filter designs that exhibit a high probability of failing to meet performance requirements, specifically the target coefficient of lift. The framework automates labor-intensive initial exploration through a global sensitivity analysis conducted by the Analyst agent, which generates actionable heuristics to guide the other agents. The process culminates by presenting the human Manager with a curated final set of promising design candidates, augmented with high-fidelity Computational Fluid Dynamics (CFD) simulations. This approach effectively leverages AI to handle high-volume analytical tasks, thereby enhancing the decision-making capability of the human expert in selecting the final, risk-assessed design.
Chinese Translation
本文介绍了一种由大型语言模型(LLMs)指导的多代理框架,以协助工程设计的早期阶段,该阶段通常具有广泛的参数空间和固有的不确定性。在人机协同的范式下,框架在经典的气动翼型设计问题上进行了演示,采用了一组专业代理:编码助手、设计代理、系统工程代理和分析代理,所有代理均由人类经理协调。该过程集成在集合设计理念中,首先进行协作阶段,经理和编码助手开发一套经过验证的工具,随后代理们执行结构化工作流程,系统地探索和筛选大量初始设计候选。本文的一个关键贡献是明确整合了正式的风险管理,采用条件风险价值(CVaR)作为定量指标,以过滤出高概率未能满足性能要求(特别是目标升力系数)的设计。该框架通过分析代理进行的全局灵敏度分析,自动化了劳动密集型的初步探索,生成可操作的启发式指导其他代理。最终,向人类经理呈现经过筛选的有前景的设计候选集,并附上高保真计算流体动力学(CFD)模拟。该方法有效利用人工智能处理高容量的分析任务,从而增强人类专家在选择最终经过风险评估的设计时的决策能力。
cs.AI / 10 / 2604.16689
The Query Channel: Information-Theoretic Limits of Masking-Based Explanations
查询通道:基于掩蔽的解释的信息论极限
Abstract
Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.
Chinese Translation
基于掩蔽的事后解释方法,如 KernelSHAP 和 LIME,通过在随机扰动下查询黑箱模型来估计局部特征的重要性。本文将这一过程形式化为在查询通道上的通信,其中潜在的解释作为消息,而每次掩蔽评估则是一次通道使用。在这一框架下,解释的复杂性由假设类的熵来捕捉,而查询接口提供的信息速率由每次查询的识别能力决定。我们推导出一个强对偶定理,表明如果解释速率超过该能力,则在任何解释器和解码器序列中,精确恢复的概率必然收敛于一的错误。我们还证明了一个可达性结果,建立了当速率低于能力时,稀疏最大似然解码器能够实现可靠恢复。互信息的蒙特卡洛估计器提供了一个非渐近的查询基准,我们用它来比较与 Lasso 和 OLS 基于程序的最优解码,这些程序与 LIME 和 KernelSHAP 相似。实验揭示了一系列查询预算,在这些预算下,信息论允许可靠解释,但标准的凸替代方法仍然失败。最后,我们将超像素分辨率和神经语言模型的标记化解释为一种源编码选择,这种选择设置了解释的熵,并展示了高斯噪声和非线性曲率如何降低查询通道的性能,导致瀑布效应和错误底线行为,使得高分辨率解释无法实现。
cs.AI / 11 / 2604.16694
RankGuide: Tensor-Rank-Guided Routing and Steering for Efficient Reasoning
RankGuide:基于张量秩引导的路由与引导以实现高效推理
Abstract
Large reasoning models (LRMs) enhance problem-solving capabilities by generating explicit multi-step chains of thought (CoT) reasoning; however, they incur substantial inference latency and computational overhead. To mitigate this issue, recent works have explored model collaboration paradigms, where small reasoning models (SRMs) generate intermediate reasoning steps to achieve a better accuracy--latency trade-off. Despite recent progress, effectively and efficiently detecting and mitigating SRM failures in collaborative systems remains a key challenge. To address this issue, we analyze SRM inference in both the generated text and hidden-state spaces, and identify three types of failure modes: \textit{overconfidence}, \textit{uncertainty}, and \textit{heavy revalidation}. Building on these insights, we propose \textbf{RankGuide}, a framework that improves the efficiency and effectiveness of SRM--LRM collaboration through tensor-rank-guided routing and steering. Specifically, RankGuide leverages a routing signal that incorporates tensor-rank signals derived from consecutive hidden states to detect when SRMs are likely to fail and selectively invoke LRMs. In addition, we introduce a tensor-rank-filtered steering vector extraction method to modulate the reasoning trajectory of SRMs, thereby improving their generation quality. By improving both routing and steering through tensor-rank signals, RankGuide enables SRM--LRM collaborative systems to achieve more efficient reasoning with fewer steps and improved accuracy. Experiments on multiple reasoning benchmarks demonstrate the efficacy of RankGuide in reducing latency by up to $1.75\times$ compared to LRM, while maintaining competitive accuracy relative to prior methods.
Chinese Translation
大型推理模型(LRMs)通过生成明确的多步骤思维链(CoT)推理来增强问题解决能力;然而,它们会带来显著的推理延迟和计算开销。为了解决这个问题,近期的研究探索了模型协作范式,其中小型推理模型(SRMs)生成中间推理步骤,以实现更好的准确性与延迟的权衡。尽管近期取得了一些进展,但在协作系统中有效且高效地检测和缓解SRM故障仍然是一个关键挑战。为了解决这一问题,我们分析了SRM推理在生成文本和隐藏状态空间中的表现,并识别出三种故障模式: extit{过度自信}、 extit{不确定性}和 extit{重验证负担}。基于这些见解,我们提出了 extbf{RankGuide},一个通过张量秩引导的路由与引导框架,以提高SRM-LRM协作的效率和有效性。具体而言,RankGuide利用一个路由信号,该信号结合了来自连续隐藏状态的张量秩信号,以检测SRM可能失败的时刻,并选择性地调用LRM。此外,我们引入了一种张量秩过滤的引导向量提取方法,以调节SRM的推理轨迹,从而提高其生成质量。通过改进路由和引导,RankGuide使SRM-LRM协作系统能够以更少的步骤和更高的准确性实现更高效的推理。在多个推理基准上的实验表明,RankGuide在保持与先前方法相对竞争的准确性的同时,能够将延迟减少多达$1.75 imes$。
cs.AI / 12 / 2604.16706
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
评估工具使用语言代理:AgentProp-Bench中的评审可靠性、传播级联和运行时缓解
Abstract
Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.
Chinese Translation
自动评估使用工具的大型语言模型(LLM)代理被广泛认为是可靠的,但这一假设很少经过与人类注释的验证。我们引入了AgentProp-Bench,这是一个包含2000个任务的基准,涵盖四个领域、2300个追踪记录、九个生产级LLM,以及一个经过100个标签的人类验证的子集。我们量化了评审的可靠性,描述了错误传播,并评估了一种运行时缓解措施。基于子字符串的评审与人类注释的kappa值为0.049(随机水平)一致;一个三LLM集成模型的kappa值达到0.432(中等),但存在保守偏差。在经过验证的评估下,参数级别的注入以约0.62的概率传播到错误的最终答案(在不同模型中范围为0.46-0.73)。拒绝(捕捉不良参数)和恢复(在接受后纠正)是独立的模型能力(Spearman rho=0.126,p=0.747)。经过调优的运行时拦截器在n=600的对照组下将GPT-4o-mini的幻觉减少了23.0个百分点,但对Gemini-2.0-Flash没有显著影响,其激进的参数拒绝消除了目标失败模式。所有代码、数据、追踪记录和人类标签均已发布在https://github.com/bhaskargurram-ai/agenthallu-bench。
cs.AI / 13 / 2604.16723
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
作为奖励的辩论:通过RL后训练的多智能体奖励系统用于科学创意生成
Abstract
Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.
Chinese Translation
大型语言模型(LLMs)在自动化科学创意生成方面展现了潜力,但当前依赖于迭代提示或复杂多智能体架构的方法往往遭遇幻觉或计算效率低下的问题。在将强化学习(RL)应用于这一开放领域时,一个关键瓶颈是奖励黑客行为——模型利用不完善的评估代理来最大化分数,而未能产生真正的科学创新。为了解决这些限制,我们提出了一种专门为高质量科学创意生成量身定制的RL框架。我们提出了第一个多智能体奖励函数,旨在作为评判者,将方法论验证与实施细节解耦,同时提供对奖励黑客行为具有鲁棒性的严格二元奖励。为了有效优化这一稀疏信号,我们利用了一种无偏的群体相对策略优化(Group Relative Policy Optimization)变体,以减轻人工长度偏差。我们的训练基于ICLR-320,这是一个从ICLR 2024会议中提取的问题-解决方案对的策划数据集。实验表明,我们的框架在专家评估的新颖性、可行性和有效性等指标上显著优于最先进的基线。
cs.AI / 14 / 2604.16736
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
当智能体变得沉默:LLM文档合成的输出生成能力与格式成本分离
Abstract
LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier $\mu_f > 1$, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.
Chinese Translation
基于LLM的编码智能体面临一种我们称之为输出停滞的失败模式:在尝试生成大型、格式复杂的文档时,智能体会默默地产生空响应。我们提出了一个理论框架,通过三项贡献来解释和防止这种失败。(1) 我们引入了输出生成能力(Output Generation Capacity, OGC),这是一个正式的度量,表示在当前上下文状态下智能体有效生成输出的能力——它与原始上下文窗口不同且在经验上更小。(2) 我们证明了格式成本分离定理,表明对于任何具有开销乘数 $_f > 1$ 的格式,延迟模板渲染的令牌效率总是至少与直接生成相当,并推导出节省的紧密界限。(3) 我们形式化了自适应策略选择(Adaptive Strategy Selection),这是一个决策框架,将估计的输出成本与可用OGC的比率映射为最佳生成策略(直接、分块或延迟)。我们通过对三种模型(Claude 3.5 Sonnet、GPT-4o、Llama 3.1 70B)、四种文档类型以及一个隔离每个组件贡献的消融研究进行的控制实验来验证该理论。延迟渲染在所有条件下将LLM生成的令牌减少了48-72%,并完全消除了输出停滞。我们将该框架实例化为GEN-PILOT,一个开源的MCP服务器,展示了该理论如何直接转化为一个实用工具。
cs.AI / 15 / 2604.16742
CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction
CT Open:一个开放获取、无污染的实时平台,用于临床试验结果预测的开放挑战
Abstract
Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$
Chinese Translation
科学家们长期以来一直寻求在现实事件发生之前准确预测其结果。人工智能系统能否更可靠地做到这一点?我们通过临床试验结果预测这一高风险的开放挑战来研究这个问题,即使对于领域专家也是如此。我们推出了CT Open,一个开放获取的实时平台,每年将举办四个挑战。任何人都可以为每个挑战提交预测。CT Open在提交时尚未公开结果的试验上评估这些提交,但结果在之后被公开。确定某个试验的结果在特定日期之前是否在互联网上公开是相当困难的。官方注册处发布的结果可能会滞后多年,而首次提及可能出现在不知名的文章中。为了解决这个问题,我们提出了一种新颖的完全自动化去污染流程,该流程利用迭代的基于大型语言模型(LLM)的网络搜索来识别试验结果的最早提及。我们通过人类专家的注释验证了该流程的质量和准确性。由于CT Open的流程确保每个评估的试验在预测时没有公开报告的结果,因此它允许参与者使用任何方法和任何数据源。在本文中,我们发布了一个训练集和两个带时间戳的测试基准,分别为2025年冬季和2025年夏季。我们相信CT Open可以作为推动人工智能研究在现实世界结果预测方面的中心枢纽,同时为生物医学研究提供信息并改善临床试验设计。CT Open平台托管于$ ext{https://ct-open.net/}$
cs.AI / 16 / 2604.16745
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
无训练的标记减少为何会崩溃:成对评分信号的固有不稳定性
Abstract
Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $\rho_s$ and off-diagonal correlation $\rho_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $\rho_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.
Chinese Translation
无训练的标记减少方法(如 ToMe、ToFu、PiToMe 和 MCTF)针对视觉变换器采用了不同的评分机制,但在高压缩率下,它们却表现出相似的陡峭崩溃现象。本文解释了 extit{原因}。我们开发了一个诊断框架,使用两个工具:排名一致性 $
ho_s$ 和非对角相关性 $
ho_ ext{off}$,将崩溃分解为(1)层级减少固有的信号无关误差放大器,预测凸帕累托曲线和 $r_{ ext{crit}} ext{propto} 1/L$;(2)对 extit{成对}相似性信号的共同依赖,其排名一致性在深层中从 $
ho_s{=}0.88$ 降低到 $0.27$。成对排名本质上是不稳定的($O(N_p^2)$ 联合扰动),而单一信号则享有更大的稳定性($O(N_p)$ 扰动,中心极限定理)。基于这一诊断,我们提出了三个设计原则,并构建了 CATIS 作为一种建设性验证:单一信号提高触发阈值,分类抑制增益。在 ViT-Large 上进行 63% FLOPs 减少时,CATIS 保留了 96.9% 的原始准确率(81.0%)在 ImageNet-1K 上,而所有基线的准确率崩溃至 43--65%。
cs.AI / 17 / 2604.16752
Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents
不要开始你无法完成的事情:对大型语言模型代理支持状态分诊的反事实审计
Abstract
Current agent evaluations largely reward execution on fully specified tasks, while recent work studies clarification [11, 22, 2], capability awareness [9, 1], abstention [8, 14], and search termination [20, 5] mostly in isolation. This leaves open whether agents can diagnose why a task is blocked before acting. We introduce the Support-State Triage Audit (SSTA-32), a matched-item diagnostic framework in which minimal counterfactual edits flip the same base request across four support states: Complete (ANSWER), Clarifiable (CLARIFY), Support-Blocked (REQUEST SUPPORT), and Unsupported-Now (ABSTAIN). We evaluate a frontier model under four prompting conditions - Direct, Action-Only, Confidence-Only, and a typed Preflight Support Check (PSC) - using Dual-Persona Auto-Auditing (DPAA) with deterministic heuristic scoring. Default execution overcommits heavily on non-complete tasks (41.7% overcommitment rate). Scalar confidence mapping avoids overcommitment but collapses the three-way deferral space (58.3% typed deferral accuracy). Conversely, both Action-Only and PSC achieve 91.7% typed deferral accuracy by surfacing the categorical ontology in the prompt. Targeted ablations confirm that removing the support-sufficiency dimension selectively degrades REQUEST SUPPORT accuracy, while removing the evidence-sufficiency dimension triggers systematic overcommitment on unsupported items. Because DPAA operates within a single context window, these results represent upper-bound capability estimates; nonetheless, the structural findings indicate that frontier models possess strong latent triage capabilities that require explicit categorical decision paths to activate safely.
Chinese Translation
当前的代理评估主要奖励在完全指定任务上的执行,而最近的研究则主要孤立地研究了澄清 [11, 22, 2]、能力意识 [9, 1]、弃权 [8, 14] 和搜索终止 [20, 5]。这使得尚不清楚代理是否能够在行动之前诊断任务被阻塞的原因。我们引入了支持状态分诊审计(Support-State Triage Audit, SSTA-32),这是一个匹配项诊断框架,其中最小的反事实编辑可以在四种支持状态下翻转相同的基本请求:完成(Complete, ANSWER)、可澄清(Clarifiable, CLARIFY)、支持阻塞(Support-Blocked, REQUEST SUPPORT)和当前不支持(Unsupported-Now, ABSTAIN)。我们在四种提示条件下评估了一个前沿模型——直接(Direct)、仅行动(Action-Only)、仅信心(Confidence-Only)和一种类型化的预飞行支持检查(typed Preflight Support Check, PSC)——使用双重角色自动审计(Dual-Persona Auto-Auditing, DPAA)和确定性启发式评分。默认执行在非完成任务上严重过度承诺(41.7% 的过度承诺率)。标量信心映射避免了过度承诺,但压缩了三向延迟空间(58.3% 的类型化延迟准确率)。相反,仅行动和 PSC 通过在提示中呈现分类本体,实现了 91.7% 的类型化延迟准确率。针对性消融实验确认,去除支持充分性维度会选择性降低请求支持的准确率,而去除证据充分性维度则会导致在不支持项上的系统性过度承诺。由于 DPAA 在单一上下文窗口内操作,这些结果代表了能力的上限估计;尽管如此,结构性发现表明,前沿模型具有强大的潜在分诊能力,需要明确的分类决策路径才能安全激活。
cs.AI / 18 / 2604.16753
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs
知道何时信任技能:单一代理大型语言模型的延迟评估与认知警觉
Abstract
As large language models (LLMs) transition into autonomous agents integrated with extensive tool ecosystems, traditional routing heuristics increasingly succumb to context pollution and "overthinking". We argue that the bottleneck is not a deficit in algorithmic capability or skill diversity, but the absence of disciplined second-order metacognitive governance. In this paper, our scientific contribution focuses on the computational translation of human cognitive control - specifically, delayed appraisal, epistemic vigilance, and region-of-proximal offloading - into a single-agent architecture. We introduce MESA-S (Metacognitive Skills for Agents, Single-agent), a preliminary framework that shifts scalar confidence estimation into a vector separating self-confidence (parametric certainty) from source-confidence (trust in retrieved external procedures). By formalizing a delayed procedural probe mechanism and introducing Metacognitive Skill Cards, MESA-S decouples the awareness of a skill's utility from its token-intensive execution. Evaluated under an In-Context Static Benchmark Evaluation natively executed via Gemini 3.1 Pro, our early results suggest that explicitly programming trust provenance and delayed escalation mitigates supply-chain vulnerabilities, prunes unnecessary reasoning loops, and prevents offloading-induced confidence inflation. This architecture offers a scientifically cautious, behaviorally anchored step toward reliable, epistemically vigilant single-agent orchestration.
Chinese Translation
随着大型语言模型(LLMs)转变为与广泛工具生态系统集成的自主代理,传统的路由启发式方法越来越容易受到上下文污染和“过度思考”的影响。我们认为,瓶颈并不在于算法能力或技能多样性的缺乏,而在于缺乏有序的二阶元认知治理。在本文中,我们的科学贡献集中于将人类认知控制的计算转化——特别是延迟评估、认知警觉和近端卸载区域——应用于单一代理架构。我们介绍了MESA-S(代理的元认知技能,单一代理),这是一个初步框架,将标量信心估计转变为一个向量,区分自信(参数确定性)与源信心(对检索外部程序的信任)。通过形式化延迟程序探测机制并引入元认知技能卡,MESA-S将技能效用的意识与其高代币执行解耦。在通过Gemini 3.1 Pro本地执行的上下文静态基准评估下进行评估,我们的早期结果表明,明确编程信任来源和延迟升级可以减轻供应链脆弱性,修剪不必要的推理循环,并防止卸载引起的信心膨胀。这种架构为可靠的、具有认知警觉的单一代理编排提供了一个科学谨慎、行为锚定的步骤。
cs.AI / 19 / 2604.16755
Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
机器个体性:在大型语言模型中区分真实特性与反应偏差
Abstract
As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models -- widely used in psychometrics to separate systematic effects -- to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality.
Chinese Translation
随着大型语言模型(LLMs)越来越多地融入日常生活,从高风险决策支持到陪伴等角色,理解它们的行为倾向变得至关重要。越来越多的文献利用心理测量工具和认知范式来描绘LLM的倾向。然而,这些方法无法确定行为差异是反映稳定的、特定刺激的个体性,还是全球反应偏差和随机噪声。在此,我们应用交叉随机效应模型——这一在心理测量学中广泛使用的方法,用于分离系统效应——对10个开放权重的LLM在14个心理语言学标准下对超过100,000个词汇提供的7490万条评分进行分析。平均而言,16.9%的方差可归因于特定刺激的个体性,显著超过统计零模型。跨标准预测分析揭示了这种个体性作为一种连贯的指纹,独特于每个模型。这些结果识别出LLM之间的个体差异,这些差异无法归因于反应偏差或随机噪声。我们将这些差异称为机器个体性。
cs.AI / 20 / 2604.16776
SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention
SAVE:一种具有基因块注意力的多条件单细胞生成通用框架
Abstract
Modeling single-cell gene expression across diverse biological and technical conditions is crucial for characterizing cellular states and simulating unseen scenarios. Existing methods often treat genes as independent tokens, overlooking their high-level biological relationships and leading to poor performance. We introduce SAVE, a unified generative framework based on conditional Transformers for multi-condition single-cell modeling. SAVE leverages a coarse-grained representation by grouping semantically related genes into blocks, capturing higher-order dependencies among gene modules. A Flow Matching mechanism and condition-masking strategy further enhance flexible simulation and enable generalization to unseen condition combinations. We evaluate SAVE on a range of benchmarks, including conditional generation, batch effect correction, and perturbation prediction. SAVE consistently outperforms state-of-the-art methods in generation fidelity and extrapolative generalization, especially in low-resource or combinatorially held-out settings. Overall, SAVE offers a scalable and generalizable solution for modeling complex single-cell data, with broad utility in virtual cell synthesis and biological interpretation. Our code is publicly available at https://github.com/fdu-wangfeilab/sc-save
Chinese Translation
在多种生物和技术条件下对单细胞基因表达进行建模对于表征细胞状态和模拟未见场景至关重要。现有方法通常将基因视为独立的标记,忽视了它们之间的高级生物关系,从而导致性能不佳。我们提出了SAVE,这是一种基于条件Transformer的统一生成框架,用于多条件单细胞建模。SAVE通过将语义相关的基因分组为块,利用粗粒度表示,捕捉基因模块之间的高阶依赖关系。流匹配机制和条件掩蔽策略进一步增强了灵活的模拟能力,并使其能够推广到未见的条件组合。我们在一系列基准测试中评估了SAVE,包括条件生成、批次效应校正和扰动预测。SAVE在生成保真度和外推泛化方面始终优于最先进的方法,尤其是在资源稀缺或组合性保留的设置中。总体而言,SAVE为建模复杂的单细胞数据提供了一种可扩展和可推广的解决方案,在虚拟细胞合成和生物学解释方面具有广泛的应用。我们的代码已公开发布在 https://github.com/fdu-wangfeilab/sc-save
cs.AI / 21 / 2604.16812
Introspection Adapters: Training LLMs to Report Their Learned Behaviors
内省适配器:训练大型语言模型报告其学习行为
Abstract
When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behaviors $b_i$; the $(M_i, b_i)$ pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes $M_i$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_i$. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.
Chinese Translation
当模型开发者或用户对大型语言模型(LLM)进行微调时,这可能会引发意想不到的、故意有害的或难以检测的行为。如果LLM能够用自然语言简单描述其行为,审计将变得容易得多。在此,我们研究了一种可扩展的方法,以快速识别从共享基础LLM派生的多个LLM的学习行为。给定模型$M$,我们的方法通过对$M$进行微调,生成植入行为$b_i$的模型$M_i$;$(M_i, b_i)$对作为标记训练数据。然后,我们训练一个 extit{内省适配器}(IA):一个在微调$M_i$上共同训练的单个LoRA适配器,使其能够口头表达其植入的行为。我们发现,即使在与$M_i$的训练方式非常不同的$M$的微调中,这个IA也能引发对学习行为的自我描述。例如,IA在AuditBench上具有良好的泛化能力,能够以最先进的水平识别明确隐藏的令人担忧的行为。IA还可以用于检测加密的微调API攻击。它们在模型规模和训练数据多样性方面表现出良好的扩展性。总体而言,我们的结果表明,IA是一种可扩展、有效且在实践中有用的审计微调LLM的方法。
cs.AI / 22 / 2604.16813
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
PersonalHomeBench:在个性化智能家居中评估智能体
Abstract
Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.
Chinese Translation
智能体人工智能系统正在迅速向现实世界应用发展,但它们在复杂和个性化环境中的准备情况仍然不足以进行充分表征。为了解决这一问题,我们引入了PersonalHomeBench,这是一个用于评估基础模型作为个性化智能家居环境中智能助手的基准。该基准通过一个迭代过程构建,逐步建立丰富的家庭状态,然后用于生成个性化的、依赖于上下文的任务。为了支持现实的智能体与环境的交互,我们提供了PersonalHomeTools,这是一个全面的工具箱,能够实现家庭信息检索、家电控制和情境理解。PersonalHomeBench在单模态和多模态观察下评估反应性和主动性智能体能力。全面的实验表明,随着任务复杂性的增加,性能系统性下降,尤其是在反事实推理和部分可观察性下出现明显失败,此时需要有效的基于工具的信息收集。这些结果使PersonalHomeBench成为分析个性化智能体推理和规划的鲁棒性及局限性的严格评估平台。
cs.AI / 23 / 2604.16835
The CTLNet for Shanghai Composite Index Prediction
用于上海综合指数预测的CTLNet
Abstract
Shanghai Composite Index prediction has become a hot issue for many investors and academic researchers. Deep learning models are widely applied in multivariate time series forecasting, including recurrent neural networks (RNN), convolutional neural networks (CNN), and transformers. Specifically, the Transformer encoder, with its unique attention mechanism and parallel processing capabilities, has become an important tool in time series prediction, and has an advantage in dealing with long sequence dependencies and multivariate data correlations. Drawing on the strengths of various models, we propose the CNN-Transformer-LSTM Networks (CTLNet). This paper explores the application of CTLNet for Shanghai Composite Index prediction and the comparative experiments show that the proposed model outperforms state-of-the-art baselines.
Chinese Translation
上海综合指数预测已成为许多投资者和学术研究者关注的热点问题。深度学习模型广泛应用于多变量时间序列预测,包括递归神经网络(RNN)、卷积神经网络(CNN)和变换器(Transformers)。具体而言,变换器编码器凭借其独特的注意力机制和并行处理能力,已成为时间序列预测的重要工具,并在处理长序列依赖和多变量数据相关性方面具有优势。基于各种模型的优点,我们提出了卷积神经网络-变换器-长短期记忆网络(CNN-Transformer-LSTM Networks,CTLNet)。本文探讨了CTLNet在上海综合指数预测中的应用,比较实验表明,所提出的模型优于最先进的基准模型。
cs.AI / 24 / 2604.16859
GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba
GAMMA-Net:基于交错图注意力和多轴Mamba的自适应长时间交通时空预测模型
Abstract
Accurate traffic forecasting is crucial for intelligent transportation systems, supporting effective traffic management, congestion reduction, and informed urban planning. However, traditional models often fail to adequately capture the intricate spatio-temporal dependencies present in traffic data. To overcome these limitations, we introduce GAMMA-Net, a novel approach that integrates Graph Attention Networks (GAT) with multi-axis Selective State Space Models (Mamba). The GAT component uses a self-attention mechanism to dynamically adjust the influence of nodes within the traffic network, enabling adaptive spatial dependency modeling based on real-time conditions. Simultaneously, the Mamba module efficiently models long-term temporal and spatial dynamics without the heavy computational cost of conventional recurrent architectures. Extensive experiments on several benchmark traffic datasets, including METR-LA, PEMS-BAY, PEMS03, PEMS04, PEMS07, and PEMS08, show that GAMMA-Net consistently outperforms existing state-of-the-art models across different prediction horizons, achieving up to a 16.25% reduction in Mean Absolute Error (MAE) compared to baseline models. Ablation studies highlight the critical contributions of both the spatial and temporal components, emphasizing their complementary role in improving prediction accuracy. In conclusion, the GAMMA-Net model sets a new standard in traffic forecasting, offering a powerful tool for next-generation traffic management and urban planning. The code for this study is available at https://github.com/hdy6438/GAMMA-Net
Chinese Translation
准确的交通预测对于智能交通系统至关重要,支持有效的交通管理、减少拥堵以及为城市规划提供信息。然而,传统模型往往无法充分捕捉交通数据中复杂的时空依赖关系。为了解决这些局限性,我们提出了GAMMA-Net,这是一种新颖的方法,将图注意力网络(Graph Attention Networks, GAT)与多轴选择性状态空间模型(Selective State Space Models, Mamba)相结合。GAT组件采用自注意力机制,动态调整交通网络中节点的影响力,使得基于实时条件的自适应空间依赖建模成为可能。同时,Mamba模块高效地建模长期的时间和空间动态,而不需要传统递归架构的高计算成本。在多个基准交通数据集上的广泛实验,包括METR-LA、PEMS-BAY、PEMS03、PEMS04、PEMS07和PEMS08,结果表明GAMMA-Net在不同预测范围内始终优于现有的最先进模型,与基线模型相比,均方误差(Mean Absolute Error, MAE)减少了最多16.25%。消融研究强调了空间和时间组件的关键贡献,突出了它们在提高预测准确性方面的互补作用。总之,GAMMA-Net模型为交通预测设定了新的标准,为下一代交通管理和城市规划提供了强大的工具。本研究的代码可在https://github.com/hdy6438/GAMMA-Net获取。
cs.AI / 25 / 2604.16871
GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning
GRAIL:用于神经符号强化学习的自主概念基础
Abstract
Neuro-symbolic Reinforcement Learning (NeSy-RL) combines symbolic reasoning with gradient-based optimization to achieve interpretable and generalizable policies. Relational concepts, such as "left of" or "close by", serve as foundational building blocks that structure how agents perceive and act. However, conventional approaches require human experts to manually define these concepts, limiting adaptability since concept semantics vary across environments. We propose GRAIL (Grounding Relational Agents through Interactive Learning), a framework that autonomously grounds relational concepts through environmental interaction. GRAIL leverages large language models (LLMs) to provide generic concept representations as weak supervision, then refines them to capture environment-specific semantics. This approach addresses both sparse reward signals and concept misalignment prevalent in underdetermined environments. Experiments on the Atari games Kangaroo, Seaquest, and Skiing demonstrate that GRAIL matches or outperforms agents with manually crafted concepts in simplified settings, and reveals informative trade-offs between reward maximization and high-level goal completion in the full environment.
Chinese Translation
神经符号强化学习(NeSy-RL)将符号推理与基于梯度的优化相结合,以实现可解释和可泛化的策略。关系概念,如“左侧”或“附近”,作为基础构建块,构建了代理的感知和行动方式。然而,传统方法需要人类专家手动定义这些概念,这限制了适应性,因为概念语义在不同环境中有所不同。我们提出了GRAIL(通过交互学习基础关系代理),这是一个通过环境交互自主基础关系概念的框架。GRAIL利用大型语言模型(LLMs)提供通用概念表示作为弱监督,然后对其进行细化,以捕捉特定环境的语义。这种方法解决了稀疏奖励信号和在欠定环境中普遍存在的概念不一致问题。在Atari游戏《袋鼠》、《海洋探险》和《滑雪》上的实验表明,GRAIL在简化环境中与手动构建概念的代理相匹配或表现更好,并揭示了在完整环境中奖励最大化与高层目标完成之间的信息权衡。
cs.AI / 26 / 2604.16890
Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
Step-GRPO:内化动态提前退出以实现高效推理
Abstract
Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.
Chinese Translation
使用长链推理的大型推理模型在解决问题方面表现出色,但在冗余检查上浪费了计算资源。遏制这种过度思考是困难的:训练时的长度惩罚可能会削弱模型能力,而推理时的提前退出则增加了系统开销。为了解决这一问题,我们提出了Step-GRPO,一种新颖的后训练框架,直接将动态提前退出能力内化到模型中。Step-GRPO通过利用语言标记来构建推理,将优化目标从原始标记转移到语义步骤。我们引入了一种动态截断展开机制,在探索过程中使模型接触到简洁的高置信度轨迹,并结合了一种动态惩罚冗余的步骤感知相对奖励,该奖励基于群体水平的基线进行动态惩罚。在三个模型规模和多样化基准上的广泛实验表明,Step-GRPO在准确性和效率之间实现了优越的权衡。在Qwen3-8B上,我们的方法相比于原始模型减少了32.0%的标记消耗,同时避免了传统长度惩罚方法中观察到的准确性下降。
cs.AI / 27 / 2604.16902
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
超越文本主导性:理解全模态大型语言模型的模态偏好
Abstract
Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference
Chinese Translation
原生全模态大型语言模型(OLLMs)已从管道架构转向统一表示空间。然而,这种原生整合引发了一种关键但尚未深入探讨的现象:模态偏好。为填补这一空白,我们首先使用新近整理的基于冲突的基准和模态选择率指标系统地量化了OLLMs的模态偏好。对十个代表性OLLMs的评估揭示了一个显著的范式转变:与传统视觉语言模型(VLMs)的“文本主导性”不同,大多数OLLMs表现出明显的视觉偏好。为了进一步理解其潜在机制,我们进行层级探测,证明这种模态偏好并非静态,而是在中后层逐渐显现。基于这些见解,我们利用这些内部信号来诊断跨模态幻觉,在三个下游多模态基准上实现了具有竞争力的表现,而无需特定任务的数据。我们的工作不仅提供了机制理解,还为构建更可信的OLLMs提供了实用工具。我们的代码和相关资源已公开可用,网址为:https://github.com/icip-cas/OmniPreference
cs.AI / 28 / 2604.16911
Skilldex: A Package Manager and Registry for Agent Skill Packages with Hierarchical Scope-Based Distribution
Skilldex:一种具有层次范围分发的智能体技能包的包管理器和注册中心
Abstract
Large Language Model (LLM) agents are increasingly extended at runtime via skill packages, structured natural-language instruction bundles loaded from a well-known directory. Community install tooling and registries exist, but two gaps persist: no public tool scores skill packages against Anthropic's published format specification, and no mechanism bundles related skills with the shared context they need to remain mutually coherent. We present Skilldex, a package manager and registry for agent skill packages addressing both gaps. The two novel contributions are: (1) compiler-style format conformance scoring against Anthropic's skill specification, producing line-level diagnostics on description specificity, frontmatter validity, and structural adherence; and (2) the skillset abstraction, a bundled collection of related skills with shared assets (vocabulary files, templates, reference documents) that enforce cross-skill behavioral coherence. Skilldex also provides supporting infrastructure: a three-tier hierarchical scope system, a human-in-the-loop agent suggestion loop, a metadata-only community registry, and a Model Context Protocol (MCP) server. The system is implemented as a TypeScript CLI (skillpm / spm) with a Hono/Supabase registry backend, and is open-source.
Chinese Translation
大型语言模型(LLM)智能体越来越多地通过技能包在运行时进行扩展,这些技能包是从知名目录加载的结构化自然语言指令集合。虽然社区安装工具和注册中心已经存在,但仍然存在两个缺口:没有公共工具根据Anthropic发布的格式规范对技能包进行评分,也没有机制将相关技能与它们所需的共享上下文捆绑在一起,以保持相互一致性。我们提出了Skilldex,这是一种针对智能体技能包的包管理器和注册中心,解决了这两个缺口。两个新颖的贡献是:(1)根据Anthropic的技能规范进行编译器风格的格式一致性评分,提供关于描述特异性、前言有效性和结构遵循性的逐行诊断;(2)技能集抽象,即一组捆绑的相关技能,具有共享资产(词汇文件、模板、参考文档),以强制执行跨技能的行为一致性。Skilldex还提供支持基础设施:一个三层层次范围系统、一个人机协作的智能体建议循环、一个仅包含元数据的社区注册中心,以及一个模型上下文协议(Model Context Protocol, MCP)服务器。该系统实现为一个TypeScript命令行工具(skillpm / spm),并具有Hono/Supabase注册中心后端,且为开源。
cs.AI / 29 / 2604.16913
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
认知惩罚:在边缘原生小语言模型中消融系统1和系统2推理以实现去中心化共识
Abstract
Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference-time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel-Bench, an 840-inference empirical framework executing a strict intra-model ablation on Qwen-3.5-9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference-time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute-accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non-Convergence (cognitive collapse) rate. This collapse degraded trial-to-trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured "Reasoning-Induced Sycophancy," where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge-native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: https://github.com/smarizvi110/sentinel-bench
Chinese Translation
去中心化自治组织(DAO)倾向于探索小语言模型(SLM)作为边缘原生宪法防火墙,以审查提案并减轻语义社会工程的影响。虽然扩展推理时计算(系统2)增强了形式逻辑,但其在高度对抗性的加密经济治理环境中的有效性仍然未被充分探讨。为了解决这一问题,我们引入了Sentinel-Bench,一个840次推理的实证框架,对Qwen-3.5-9B进行了严格的模型内部消融。通过切换冻结权重下的潜在推理,我们隔离了推理时计算在对抗性Optimism DAO数据集上的影响。我们的研究结果揭示了严重的计算-准确性反转。自回归基线(系统1)在13秒内实现了100%的对抗鲁棒性、100%的司法一致性和状态最终性。相反,系统2推理引入了灾难性的稳定性不稳定,根本上由26.7%的推理非收敛(认知崩溃)率驱动。这一崩溃将试验间共识稳定性降至72.6%,并施加了17倍的延迟开销,给治理可提取价值(GEV)和硬件集中化引入了关键脆弱性。尽管这种情况较为罕见(1.5%的对抗试验),我们实证捕捉到了“推理引发的谄媚”,即模型生成了显著更长的内部独白(平均25,750个字符)以合理化未能通过对抗陷阱。我们得出结论,对于在拜占庭容错(BFT)约束下运行的边缘原生SLM,系统1参数化的直觉在结构和经济上优于系统2迭代的深思熟虑,以实现去中心化共识。代码和数据集:https://github.com/smarizvi110/sentinel-bench
cs.AI / 30 / 2604.16922
ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis
ClimAgent:作为自主开放式气候科学分析代理的语言模型
Abstract
Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (Q&A) tasks. These approaches often oversimplify real-world challenges, neglecting the intricate physical constraints and the data-driven nature required in professional climate science.To bridge this gap, we introduce ClimAgent, a general-purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end-to-end modeling and analysis.To foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real-world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state-of-the-art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at https://github.com/usail-hkust/ClimAgent.
Chinese Translation
气候研究对于缓解全球环境危机至关重要,但多尺度数据集的快速增长和分析工具的复杂性已造成显著瓶颈,限制了科学发现的碎片化和劳动密集型工作流程。尽管大型语言模型(LLMs)的出现为扩展科学专业知识提供了变革性范式,但现有的探索仍然主要局限于简单的问答(Q&A)任务。这些方法往往过于简化现实世界的挑战,忽视了专业气候科学中所需的复杂物理约束和数据驱动特性。为了解决这一问题,我们引入了ClimAgent,一个通用的自主框架,旨在执行跨多个气候子领域的广泛研究任务。通过将统一的工具使用环境与严格的推理协议相结合,ClimAgent超越了简单的检索,能够进行端到端的建模和分析。为了促进系统评估,我们提出了ClimaBench,这是第一个针对现实气候发现的综合基准,涵盖了2000年至2025年间专业场景中衍生出的5个不同任务类别的挑战性问题。在ClimaBench上的实验表明,ClimAgent在解决方案的严谨性和实用性方面显著优于最先进的基线,取得了40.21%的提升。我们的代码可在https://github.com/usail-hkust/ClimAgent获取。
cs.AI / 31 / 2604.16923
Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy
对齐印记:通过可证明的偏好差异进行零样本AI生成文本检测
Abstract
Detecting AI-generated text is an important but challenging problem. Existing likelihood-based detection methods are often sensitive to content complexity and may exhibit unstable performance. In this paper, our key insight is that modern Large Language Models (LLMs) undergo alignment (including fine-tuning and preference tuning), leaving a measurable distributional imprint. We theoretically derive this imprint by abstracting the alignment process as a sequence of constrained optimization steps, showing that the log-likelihood ratio can naturally decompose into implicit instructional biases and preference rewards. We refer to this quantity as the Alignment Imprint. Furthermore, to mitigate the instability in high-entropy regions, we introduce Log-likelihood Alignment Preference Discrepancy (LAPD), a standardized information-weighted statistic based on alignment imprint. We provide statistical guarantee that alignment-based statistics dominate Fast-DetectGPT in performance. We also theoretically show that LAPD strictly improves the unweighted alignment scores when the aligned and base models are close in distribution. Extensive experiments show that LAPD achieves an improvement 45.82% relative to the strongest existing baselines, yielding large and consistent gains across all settings.
Chinese Translation
检测AI生成的文本是一个重要但具有挑战性的问题。现有的基于似然的检测方法通常对内容复杂性敏感,可能表现出不稳定的性能。本文的关键见解是,现代大型语言模型(Large Language Models, LLMs)经历了对齐过程(包括微调和偏好调整),留下可测量的分布印记。我们通过将对齐过程抽象为一系列约束优化步骤,理论上推导出这一印记,表明对数似然比可以自然地分解为隐含的指令偏见和偏好奖励。我们将这一量称为对齐印记(Alignment Imprint)。此外,为了减轻高熵区域的不稳定性,我们引入了对数似然对齐偏好差异(Log-likelihood Alignment Preference Discrepancy, LAPD),这是一种基于对齐印记的标准化信息加权统计量。我们提供了统计保证,表明基于对齐的统计量在性能上优于Fast-DetectGPT。我们还理论上证明,当对齐模型和基础模型在分布上接近时,LAPD严格改善了未加权的对齐得分。大量实验表明,LAPD相较于现有最强基线实现了45.82%的相对提升,在所有设置中均取得了显著且一致的增益。
cs.AI / 32 / 2604.16931
Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks
扮演心理学家:利用思维树预测编码任务中推理模型的准确性
Abstract
Recent advances in large language models (LLMs) have shown that test-time scaling can substantially improve model performance on complex tasks, particularly in the coding domain. Under this paradigm, models use a larger token budget during inference to generate intermediate reasoning traces before producing a final answer. However, current evaluations primarily rely on competitive programming benchmarks, which may not capture the full range of reasoning abilities. In this work, we perform a systematic study of frontier reasoning models to understand their performance on real-world coding benchmarks. To gain more insights into the performance of such models, we devise a programmatic way to {\em automatically generate} coding tasks of arbitrary difficulty and structure from existing benchmarks. Using this framework, our analysis reveals that the structure of a reasoning trace, not just its contents, is a strong predictor of correctness. Motivated by this, we propose structured thought-trees as means to represent reasoning traces. To illustrate their use, we train a lightweight classifier on features extracted from thought-trees to predict trace correctness, and demonstrate that flagging and retrying structurally anomalous traces based on the extracted features yields consistent gains at lower complexity levels.
Chinese Translation
近期大型语言模型(LLMs)的进展表明,测试时的规模调整可以显著提升模型在复杂任务上的表现,尤其是在编码领域。在这一范式下,模型在推理过程中使用更大的标记预算,以生成中间推理轨迹,然后再给出最终答案。然而,目前的评估主要依赖于竞争性编程基准,这可能无法全面捕捉推理能力的全貌。在本研究中,我们对前沿推理模型进行了系统研究,以了解它们在真实编码基准上的表现。为了更深入地洞察这些模型的性能,我们设计了一种程序化的方法,从现有基准中{ extit{自动生成}}任意难度和结构的编码任务。利用这一框架,我们的分析揭示了推理轨迹的结构,而不仅仅是其内容,是正确性的强预测因子。基于此,我们提出了结构化思维树作为表示推理轨迹的手段。为了说明其使用,我们在从思维树中提取的特征上训练了一个轻量级分类器,以预测轨迹的正确性,并展示了基于提取特征标记和重试结构异常轨迹在较低复杂度水平上能够带来一致的收益。
cs.AI / 33 / 2604.16935
LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies
大型语言模型仅能通过对人工智能的信任和情感诉求,影响心理易感的人类在社会问题上的看法,尽管存在逻辑谬误
Abstract
Scarce longitudinal evidence examines LLMs' persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs' persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI's 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM's perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs' cognitive surrender. LLMs' perceived humanness was most learnable from sociodemographic, psychological and engagement features ($R^2=0.44$), followed by opinion change ($R^2=0.34$), conviction ($R^2=0.26$) and personal endowment ($R^2=0.24$). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.
Chinese Translation
关于大型语言模型(LLMs)在时间演变的心理框架下的说服力和人性,缺乏长期的证据。我们引入了Talk2AI,这是一个纵向框架,量化了LLMs在极化社会话题上的说服力的心理-社会、推理和情感维度。在一个四方纵向设置中,Talk2AI的770名参与者与四个领先的LLMs就气候变化、社交媒体虚假信息和数学焦虑等话题进行了结构化对话。这产生了3080次对话,超过60000个回合。在每一轮之后,参与者报告了他们在初始话题立场上的信念、感知的观点变化、LLM的感知人性、对话题的自我捐赠以及文本解释。反馈时间序列显示信念的长期惯性,表明即使在反复接触AI生成的论据后,某些人类仍然对初始观点保持锚定。有趣的是,自然语言处理(NLP)分析揭示,无论是人类还是LLMs,在每6个对话中都有1个使用了谬误推理,这与“LLMs作为优越系统”的刻板印象相悖。LLMs的感知人性最容易从社会人口、心理和参与特征中学习到($R^2=0.44$),其次是观点变化($R^2=0.34$)、信念($R^2=0.26$)和个人捐赠($R^2=0.24$)。至关重要的是,可解释人工智能(XAI)表明:(i)存在更易受LLM影响的个体;(ii)对LLM说服的心理易感性包括对LLMs的更高信任、更高的宜人性和外向性,以及更强的认知需求。多元宇宙方法结合混合效应模型确认了XAI结果,并显示出显著的个体差异。Talk2AI提供了一个扎实的框架和证据,揭示了生成性人工智能如何通过多种心理-社会途径影响人类观点,尤其是在AI与人类的数字平台上。
cs.AI / 34 / 2604.16950
AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction
AutoPKG:一种自动化框架用于动态电子商务产品属性知识图谱构建
Abstract
Product attribute extraction in e-commerce is bottlenecked by ontologies that are inconsistent, incomplete, and costly to maintain. We present AutoPKG, a multi-agent Large Language Model (LLM) framework that automatically constructs a Product-attribute Knowledge Graph (PKG) from multimodal product content. AutoPKG induces product types and type-specific attribute keys on demand, extracts attribute values from text and images, and consolidates updates through a centralized decision agent that maintains a globally consistent canonical graph. We also propose an evaluation protocol for dynamic PKGs that measures type and key validity, consolidation quality, and edge-level accuracy for value assertions after canonicalization. On a large real-world marketplace catalog dataset from Lazada (Alibaba), AutoPKG achieves up to 0.953 Weighted Knowledge Efficiency (WKE) for product types, 0.724 WKE for attribute keys, and 0.531 edge-level F1 for multimodal value extraction. Across three public benchmarks, our method improves edge-level exact-match F1 by 0.152 and yields a precision gain of 0.208 on the attribute extraction application. Online A/B tests show that AutoPKG-derived attributes increase Gross Merchandise Value (GMV) in Badge by 3.81 percent, in Search by 5.32 percent, and in Recommendation by 7.89 percent, supporting the practical value of AutoPKG in production.
Chinese Translation
电子商务中的产品属性提取受到本体不一致、不完整以及维护成本高的瓶颈。我们提出了AutoPKG,一个多智能体大语言模型(LLM)框架,能够自动从多模态产品内容中构建产品属性知识图谱(PKG)。AutoPKG根据需求诱导产品类型和类型特定的属性键,从文本和图像中提取属性值,并通过一个集中决策代理整合更新,维护一个全局一致的规范图。我们还提出了一种动态PKG的评估协议,衡量类型和键的有效性、整合质量,以及规范化后值断言的边级准确性。在来自Lazada(阿里巴巴)的大型真实市场目录数据集中,AutoPKG在产品类型上实现了高达0.953的加权知识效率(WKE),在属性键上为0.724,在多模态值提取的边级F1上为0.531。在三个公共基准测试中,我们的方法提高了边级精确匹配F1值0.152,并在属性提取应用中获得了0.208的精度提升。在线A/B测试显示,AutoPKG衍生的属性在Badge中使总商品价值(GMV)增加了3.81%,在搜索中增加了5.32%,在推荐中增加了7.89%,支持了AutoPKG在实际生产中的价值。
cs.AI / 35 / 2604.16972
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO:针对大型推理模型的掌握整合策略优化
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL regularizer applied exclusively to mastered prompts to bound harmful policy drift between successive gradient steps, and (ii) a weighting mechanism that prioritizes majority-correct prompts to better allocate optimization effort. Extensive experiments across three mathematical benchmarks demonstrate that MCPO consistently improves pass@1 performance. Counter-intuitively, rather than restricting exploration, MCPO boosts pass@k metrics, indicating that mastery consolidation further catalyzes solution diversity.
Chinese Translation
可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)作为一种有前景的方法,旨在提升大型语言模型(Large Language Models, LLMs)的推理能力。在RLVR算法中,群体相对策略优化(Group Relative Policy Optimization, GRPO)及其变体表现出了强大的性能和高效的训练能力。然而,GRPO风格的目标在高准确度提示上存在两个问题,包括掌握提示(rollout accuracy = 1)和多数正确提示(rollout accuracy 在 (0.5, 1) 之间)。对于掌握提示,群体相对优势消失,导致没有训练信号和不受约束的策略漂移,这可能导致遗忘。对于多数正确提示,随着准确度的提高,诱导的查询权重缩小,削弱了从部分正确到掌握的整合。为了解决这一问题,我们提出了掌握整合策略优化(Mastery-Consolidated Policy Optimization, MCPO),该方法引入了(i)仅应用于掌握提示的铰链-KL正则化器,以限制连续梯度步骤之间有害的策略漂移,以及(ii)一种加权机制,优先考虑多数正确提示,以更好地分配优化努力。通过在三个数学基准上的广泛实验,证明MCPO持续提高了pass@1性能。出乎意料的是,MCPO并没有限制探索,而是提升了pass@k指标,表明掌握整合进一步促进了解决方案的多样性。
cs.AI / 36 / 2604.16982
A phenotype-driven and evidence-governed framework for knowledge graph enrichment and hypotheses discovery in population data
基于表型驱动和证据导向的知识图谱丰富与假设发现框架在群体数据中的应用
Abstract
Current knowledge graph (KG) construction methods are confirmatory, focusing on recovering known relationships rather than identifying novel or context-dependent nodes. This paper proposes a phenotype-driven and evidence-governed framework that shifts the paradigm toward structured hypothesis discovery and controlled KG expansion. The approach integrates graph neural networks (GNNs) for phenotype discovery, causal inference, probabilistic reasoning and large language models (LLMs) for hypothesis generation and claim extraction within a unified pipeline. The framework prioritizes relationships that are both structurally supported by data and underexplored in the literature. KG expansion is formulated as a multi-objective optimization problem, where candidate claims are jointly evaluated in terms of relevance, structural validation and novelty. Pareto-optimal selection enables the identification of non-dominated claims that balance confirmation and discovery, avoiding trivial or redundant knowledge inclusion. Experiments on heterogeneous population datasets demonstrate that the proposed framework produces more interpretable phenotypes, reveals context-dependent causal structures and generates high-quality claims that align with both data and scientific evidence. Compared to rule-based and LLM-only baselines, the method achieves the best trade-off across plausibility, novelty, validation and relevance. In retrieval-augmented settings, it significantly improves performance (Recall@5=0.98) while reducing hallucination rates (0.05), highlighting its effectiveness in grounding LLM outputs.
Chinese Translation
当前的知识图谱(KG)构建方法主要是确认性的,侧重于恢复已知关系,而非识别新颖或依赖于上下文的节点。本文提出了一种基于表型驱动和证据导向的框架,旨在将范式转向结构化假设发现和受控的KG扩展。该方法在统一的流程中整合了图神经网络(GNNs)用于表型发现、因果推断、概率推理,以及大型语言模型(LLMs)用于假设生成和主张提取。该框架优先考虑那些在数据结构上得到支持且在文献中未被充分探讨的关系。KG扩展被形式化为一个多目标优化问题,其中候选主张在相关性、结构验证和新颖性方面进行联合评估。帕累托最优选择使得能够识别出在确认与发现之间取得平衡的非支配主张,避免了平庸或冗余知识的纳入。在异质群体数据集上的实验表明,所提出的框架生成了更具可解释性的表型,揭示了依赖于上下文的因果结构,并生成了与数据和科学证据相一致的高质量主张。与基于规则和仅使用LLM的基线相比,该方法在合理性、新颖性、验证和相关性方面实现了最佳的权衡。在检索增强的设置中,它显著提高了性能(Recall@5=0.98),同时降低了幻觉率(0.05),突显了其在扎实LLM输出方面的有效性。
cs.AI / 37 / 2604.16993
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
规则-VLN:通过语义推理和几何校正桥接感知与合规性
Abstract
As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
Chinese Translation
随着具身人工智能向现实世界的部署过渡,视觉与语言导航(VLN)任务的成功趋势从单纯的可达性演变为社会合规性。然而,目前的智能体遭遇了“目标驱动陷阱”,优先考虑物理几何(“我能去吗?”)而忽视语义规则(“我可以去吗?”),常常忽略微妙的监管约束。为了解决这一问题,我们建立了规则-VLN,这是第一个大规模的城市规则合规导航基准。该基准涵盖了一个庞大的29,000节点环境,将177个不同的监管类别注入到4,000个受限节点中,分为四个课程级别,挑战智能体在细粒度视觉和行为约束下的表现。我们进一步提出了语义导航校正模块(SNRM),这是一个通用的零样本模块,旨在为预训练的智能体提供安全意识。SNRM将粗到细的视觉感知VLM框架与动态绕行规划的认知心理地图相结合。实验表明,尽管规则-VLN对最先进的模型提出了挑战,但SNRM显著恢复了导航能力,CVR降低了19.26%,TC提高了5.97%。
cs.AI / 38 / 2604.17009
Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition
小模型作为主协调者:通过并行子任务分解学习统一的智能体-工具协调
Abstract
Multi-agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent-as-Tool, a unified parallel orchestration paradigm that abstracts both agents and tools into a standardized, learnable action space with protocol normalization and explicit state feedback. Building on this paradigm, we train a lightweight orchestrator, ParaManager, which decouples planning decisions from subtask solving, enabling state-aware parallel subtask decomposition, delegation, and asynchronous execution. For training, we adopt a two-stage ParaManager training pipeline. It improves robustness by incorporating supervised fine-tuning (SFT) trajectories equipped with recovery mechanisms, and further applies reinforcement learning (RL) to achieve an optimal balance among task success, protocol compliance, diversity, and reasoning efficiency. Experiments show that ParaManager achieves strong performance across multiple benchmarks and exhibits robust generalization under unseen model pools.
Chinese Translation
多智能体系统(MAS)在协调多样化的智能体和外部工具以解决复杂问题方面展现出明显优势。然而,现有的大多数协调方法依赖于静态工作流或串行智能体调度,并且受到工具与智能体之间异构接口协议的进一步限制。这导致系统复杂性高且可扩展性差。为了解决这些问题,我们提出了Agent-as-Tool,这是一种统一的并行协调范式,将智能体和工具抽象为一个标准化、可学习的动作空间,并进行协议规范化和显式状态反馈。在此范式的基础上,我们训练了一个轻量级的协调器ParaManager,它将规划决策与子任务求解解耦,实现了状态感知的并行子任务分解、委派和异步执行。在训练过程中,我们采用了两阶段的ParaManager训练流程。该流程通过结合带有恢复机制的监督微调(SFT)轨迹来提高鲁棒性,并进一步应用强化学习(RL)以实现任务成功、协议合规性、多样性和推理效率之间的最佳平衡。实验表明,ParaManager在多个基准测试中表现出色,并在未见过的模型池中展现出强大的泛化能力。
cs.AI / 39 / 2604.17019
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
Mini-BEHAVIOR-Gran:揭示指令粒度对语言引导的具身智能体的U型效应
Abstract
Instruction granularity is an important yet poorly controlled variable in language-guided embodied AI. Existing benchmarks typically pair each task with a single static instruction, making it difficult to study how agent behavior changes when the same task is described at different levels of detail. We introduce Mini-BEHAVIOR-Gran, a new benchmark for controlled studies of instruction granularity that extends Mini-BEHAVIOR with multiple instruction variants per task, ranging from high-level goal descriptions to step-by-step guidance. Using this benchmark, we compare four candidate metrics for cross-task granularity quantification: token count, entity count, action-verb count, and planning-width, and find that width correlates most consistently with agent performance. Using width to organize training and evaluation further reveals a non-monotonic U-shaped relationship between instruction granularity and performance, with peaks at both fine and coarse extremes. Further analysis suggests that the coarse-granularity performance rebound is associated with shallow grounding, where agents learn vision-dominant policies.
Chinese Translation
指令粒度是语言引导的具身人工智能中一个重要但控制不佳的变量。现有基准通常将每个任务与单一静态指令配对,这使得研究在不同细节层次下描述同一任务时智能体行为的变化变得困难。我们引入了Mini-BEHAVIOR-Gran,这是一个用于控制指令粒度研究的新基准,它在Mini-BEHAVIOR的基础上扩展了每个任务的多种指令变体,从高层次的目标描述到逐步指导。利用该基准,我们比较了四种跨任务粒度量化的候选指标:标记计数、实体计数、动作动词计数和规划宽度,发现宽度与智能体表现的相关性最为一致。使用宽度来组织训练和评估进一步揭示了指令粒度与表现之间的非单调U型关系,在细粒度和粗粒度的极端都存在峰值。进一步分析表明,粗粒度表现的反弹与浅层基础有关,智能体学习到以视觉为主导的策略。
cs.AI / 40 / 2604.17025
Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)
将马具视为资产:通过收敛人工智能代理框架(CAAF)强化确定性
Abstract
Large Language Models (LLMs) produce a controllability gap in safety-critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self-correction [Huang et al., 2024]. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open-loop generation to closed-loop Fail-Safe Determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine-readable registries enforced by a deterministic Unified Assertion Interface (UAI); and (3) Structured Semantic Gradients with State Locking for monotonic convergence. Empirical evaluation across two domains -- SAE Level 3 (L3) autonomous driving (AD) (n=30, 7 conditions) and pharmaceutical continuous flow reactor design (n=20, 4 conditions including a Mono+UAI ablation) -- shows that CAAF-all-GPT-4o-mini achieves 100% paradox detection while monolithic GPT-4o achieves 0% (even at temperature=0). The pharmaceutical benchmark features 7 simultaneous constraints with nonlinear Arrhenius interactions and a 3-way minimal unsatisfiable subset, representing a structurally harder challenge than the 2-constraint AD paradox. Alternative multi-agent architectures (debate, sequential checking) also achieve 0% across 80 trials, confirming that CAAF's reliability derives from its deterministic UAI, not from multi-agent orchestration per se. A Mono+UAI ablation (95%) isolates UAI as the core contribution. CAAF's reliability is invariant to prompt hints; all components use a single commodity model, enabling fully offline deployment.
Chinese Translation
大型语言模型(LLMs)在安全关键工程中产生了可控性缺口:即使是低频率的未检测约束违反也会使系统无法部署。目前的编排范式存在阿谀奉承的合规性、上下文注意力衰减[Liu et al., 2024]和自我纠正过程中的随机振荡[Huang et al., 2024]等问题。我们提出了收敛人工智能代理框架(CAAF),该框架通过三个支柱将代理工作流程从开放式生成转变为闭环的安全确定性:(1)具有物理上下文防火墙的递归原子分解;(2)将马具视为资产,将领域不变性形式化为由确定性统一断言接口(UAI)强制执行的机器可读注册表;(3)具有状态锁定的结构化语义梯度以实现单调收敛。在两个领域的实证评估中——SAE 3级(L3)自动驾驶(AD)(n=30,7种条件)和制药连续流反应器设计(n=20,4种条件,包括Mono+UAI消融实验)——显示CAAF-all-GPT-4o-mini实现了100%的悖论检测,而单一的GPT-4o则实现了0%(即使在温度=0时)。制药基准测试具有7个同时约束,具有非线性阿伦尼乌斯相互作用和一个3-way最小不可满足子集,代表了比2约束AD悖论更具结构性挑战的难题。替代的多代理架构(辩论、顺序检查)在80次试验中也实现了0%,确认CAAF的可靠性源于其确定性UAI,而不是多代理编排本身。Mono+UAI消融实验(95%)将UAI孤立为核心贡献。CAAF的可靠性对提示提示不变;所有组件使用单一商品模型,支持完全离线部署。
cs.AI / 41 / 2604.17078
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
理解与强化任务算术中的权重解耦
Abstract
Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($\theta_0$) or the task vectors ($\tau_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ($\Delta W$) that constitute $\tau_t$ during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods. Code is available at \href{https://github.com/RL-MIND/OrthoReg}{https://github.com/RL-MIND/OrthoReg}.
Chinese Translation
任务算术提供了一种高效、无需训练的方式来编辑预训练模型,但缺乏对其成功的基本理论解释。现有的“权重解耦”概念描述了非干扰任务组合的理想结果,但并未揭示其根本原因。关键在于,预训练模型($ heta_0$)或任务向量($ au_t$)的哪些内在属性使得这种解耦成为可能仍然未被充分探讨。在本文中,我们引入了任务特征专业化(Task-Feature Specialization, TFS),即模型将不同的内部特征分配给不同任务的能力,作为基本原则。我们首先证明了TFS是权重解耦的充分条件。更重要的是,我们发现TFS还导致了一个可观察的几何后果:权重向量正交性。这使得TFS成为期望功能结果(解耦)和可测几何属性(正交性)的共同原因。这一关系为我们的方法提供了关键见解:由于抽象的TFS属性难以直接强制执行,我们可以通过塑造其具体的几何后果——正交性,来促进权重解耦。因此,我们提出了OrthoReg,这是一种简单有效的正则化方法,主动在微调过程中对构成$ au_t$的权重更新($ riangle W$)施加内部正交结构。我们理论上证明了OrthoReg促进了解耦。大量实验表明,OrthoReg持续且显著地提升了各种任务算术方法的性能。代码可在 exttt{https://github.com/RL-MIND/OrthoReg} 获取。
cs.AI / 42 / 2604.17112
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
通过跨模型不一致性补充自一致性以进行不确定性量化
Abstract
Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.
Chinese Translation
大型语言模型(LLMs)经常产生自信但错误的回答,而不确定性量化是实现更稳健使用的一个潜在解决方案。近期的研究通常依赖自一致性来估计随机不确定性(AU),然而当模型过于自信并在样本间产生相同的错误答案时,这一代理方法会失效。我们分析了这一情况,并显示在错误答案上,跨模型的语义不一致性在随机不确定性低时更高。基于此,我们引入了一个在黑箱访问设置下操作的认知不确定性(EU)项:EU仅使用来自小规模匹配集成生成的文本,并计算模型间和模型内序列语义相似性之间的差距。然后,我们将总不确定性(TU)定义为随机不确定性(AU)和认知不确定性(EU)的总和。在对五个7-9B指令调优模型和十个长文本任务的全面研究中,TU相较于AU改善了排名校准和选择性放弃,并且EU可靠地标记了在AU低时的自信失败。我们进一步通过一致性和互补性诊断来表征EU最有用的情况。
cs.AI / 43 / 2604.17133
If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data
如果我的连续血糖监测仪能够说话:一个保护隐私的个人血糖数据问答代理
Abstract
Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.
Chinese Translation
用于糖尿病护理的连续血糖监测仪(CGMs)收集丰富的个人健康数据,这些数据可以改善日常自我管理。然而,目前的患者平台仅提供静态摘要,无法支持用户的探索性查询。大型语言模型(LLMs)可以实现对连续血糖数据的自由形式询问,但在敏感健康记录上部署它们会引发隐私和准确性问题。本文提出了CGM-Agent,一个用于个人血糖数据问答的保护隐私框架。在我们的设计中,LLM仅作为推理引擎,选择分析函数。所有计算均在本地进行,个人健康数据从未离开用户设备。为了评估,我们构建了一个基准,包含4,180个问题,结合了参数化问题模板与真实用户查询以及来自确定性程序执行的真实答案。评估6个领先的LLM后,我们发现顶级模型在合成查询上实现了94%的价值准确率,在模糊的现实世界查询上实现了88%的准确率。错误主要源于意图和时间模糊,而非计算失败。此外,轻量级模型在我们的代理设计中表现出竞争力,表明低成本部署的机会。我们发布了代码和基准,以支持未来可信健康代理的研究。
cs.AI / 44 / 2604.17140
Local Inconsistency Resolution: The Interplay between Attention and Control in Probabilistic Models
局部不一致性解决:注意力与控制在概率模型中的相互作用
Abstract
We present a generic algorithm for learning and approximate inference with an intuitive epistemic interpretation: iteratively focus on a subset of the model and resolve inconsistencies using the parameters under control. This framework, which we call Local Inconsistency Resolution (LIR) is built upon Probabilistic Dependency Graphs (PDGs), which provide a flexible representational foundation capable of capturing inconsistent beliefs. We show how LIR unifies and generalizes a wide variety of important algorithms in the literature, including the Expectation-Maximization (EM) algorithm, belief propagation, adversarial training, GANs, and GFlowNets. In the last case, LIR actually suggests a more natural loss, which we demonstrate improves GFlowNet convergence. Each method can be recovered as a specific instance of LIR by choosing a procedure to direct focus (attention and control). We implement this algorithm for discrete PDGs and study its properties on synthetically generated PDGs, comparing its behavior to the global optimization semantics of the full PDG.
Chinese Translation
我们提出了一种通用算法,用于学习和近似推理,并具有直观的认识论解释:迭代地关注模型的一个子集,并使用受控参数解决不一致性。我们称之为局部不一致性解决(Local Inconsistency Resolution, LIR)的这一框架建立在概率依赖图(Probabilistic Dependency Graphs, PDGs)之上,提供了一种灵活的表示基础,能够捕捉不一致的信念。我们展示了LIR如何统一和概括文献中多种重要算法,包括期望最大化(Expectation-Maximization, EM)算法、信念传播、对抗训练、生成对抗网络(GANs)和GFlowNets。在最后一种情况下,LIR实际上建议了一种更自然的损失函数,我们证明这改善了GFlowNet的收敛性。通过选择一个引导关注(注意力和控制)的过程,每种方法都可以作为LIR的特定实例被恢复。我们为离散PDGs实现了该算法,并研究其在合成生成的PDGs上的性质,将其行为与完整PDG的全局优化语义进行比较。
cs.AI / 45 / 2604.17148
Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration
代理图:一种基于图的多智能体大语言模型协作框架
Abstract
With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing-positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo. Code is available at: https://github.com/UNITES-Lab/GoA.
Chinese Translation
随着大语言模型(LLM)和基准测试的不断增加,协调多个模型以提高任务性能的需求变得愈发迫切。尽管像混合代理(Mixture-of-Agents, MoA)这样的框架试图协调LLMs,但在(1)选择相关代理、(2)促进有效的内部代理通信以及(3)高效整合响应等方面往往存在不足。在本研究中,我们提出了代理图(Graph-of-Agents, GoA),这是一种用于建模多智能体LLM通信的新型基于图的框架。我们的方法首先进行节点采样,仅选择最相关的代理,利用模型卡片总结每个模型的领域、任务专业化及其他特征。接下来,通过相互评估所选代理的响应来构建它们之间的边,以确定相关性排序。然后,从高度相关的代理向较少相关的代理进行有向消息传递,以增强它们的响应,随后进行反向消息传递,以细化更相关代理的原始响应。最后,通过基于图的池化(例如,最大池化或均值池化)聚合更新后的响应,以生成单一的统一答案。我们在多领域基准测试(MMLU、MMLU-Pro、GPQA)和领域特定基准测试(MATH、HumanEval、MedMCQA)上评估GoA,代理池包含6个跨多个领域的LLM。令人惊讶的是,GoA仅使用3个选定代理就实现了优越的性能,超越了最近同时利用所有6个代理的多智能体LLM基线。通过采用图结构,GoA在结构化消息传递中提供了可扩展性和有效性,使其成为应对不断增长的LLM生态系统挑战的有力候选者。代码可在:https://github.com/UNITES-Lab/GoA获取。
cs.AI / 46 / 2604.17214
Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition
超越基础:利用大型语言模型进行细粒度医学实体识别
Abstract
Extracting clinically relevant information from unstructured medical narratives such as admission notes, discharge summaries, and emergency case histories remains a challenge in clinical natural language processing (NLP). Medical Entity Recognition (MER) identifies meaningful concepts embedded in these records. Recent advancements in large language models (LLMs) have shown competitive MER performance; however, evaluations often focus on general entity types, offering limited utility for real-world clinical needs requiring finer-grained extraction. To address this gap, we rigorously evaluated the open-source LLaMA3 model for fine-grained medical entity recognition across 18 clinically detailed categories. To optimize performance, we employed three learning paradigms: zero-shot, few-shot, and fine-tuning with Low-Rank Adaptation (LoRA). To further enhance few-shot learning, we introduced two example selection methods based on token- and sentence-level embedding similarity, utilizing a pre-trained BioBERT model. Unlike prior work assessing zero-shot and few-shot performance on proprietary models (e.g., GPT-4) or fine-tuning different architectures, we ensured methodological consistency by applying all strategies to a unified LLaMA3 backbone, enabling fair comparison across learning settings. Our results showed that fine-tuned LLaMA3 surpasses zero-shot and few-shot approaches by 63.11% and 35.63%, respectivel respectively, achieving an F1 score of 81.24% in granular medical entity extraction.
Chinese Translation
从非结构化医学叙述中提取临床相关信息,如入院记录、出院总结和急诊病例历史,仍然是临床自然语言处理(NLP)中的一项挑战。医学实体识别(MER)旨在识别这些记录中嵌入的有意义概念。近期大型语言模型(LLMs)的进展显示出竞争力的MER性能;然而,评估往往集中在一般实体类型上,对于需要更细粒度提取的现实临床需求提供的帮助有限。为了解决这一问题,我们对开源的LLaMA3模型进行了严格评估,以实现18个临床详细类别的细粒度医学实体识别。为了优化性能,我们采用了三种学习范式:零样本、少样本和使用低秩适配(LoRA)的微调。为了进一步增强少样本学习,我们引入了基于标记和句子级嵌入相似性的两个示例选择方法,利用预训练的BioBERT模型。与之前评估专有模型(如GPT-4)上的零样本和少样本性能或微调不同架构的工作不同,我们通过将所有策略应用于统一的LLaMA3基础模型,确保了方法的一致性,从而在不同学习设置之间实现公平比较。我们的结果表明,微调后的LLaMA3在细粒度医学实体提取中分别超越了零样本和少样本方法63.11%和35.63%,实现了81.24%的F1得分。
cs.AI / 47 / 2604.17229
Yanasse: Finding New Proofs from Deep Vision's Analogies, Part 1
Yanasse:从深度视觉的类比中寻找新证明,第一部分
Abstract
Project Yanasse presents a method for discovering new proofs of theorems in one area of mathematics by transferring proof strategy patterns (e.g., Lean 4 tactic invocation patterns) from a structurally distant area. The system extracts tactic usage distributions across 27 top-level areas of Mathlib (217,133 proof states), computes z-scores to identify tactics that are heavily used in a source area but rare or absent in a target area, matches source and target proof states via GPU-accelerated NP-hard analogy (running on a MacBook Air via Apple's MPS backend), and then asks an AI reasoning agent to semantically adapt--not symbol-substitute--the source tactics invocation pattern to the target theorem. In this first part of the study, the method is applied to the pair Probability -> Representation Theory, producing 4 Lean-verified new proofs out of 10 attempts (40%). The proofs compile with zero sorry declarations. The key finding is that tactic schemas decompose into a head (domain-gated, rarely transfers) and a modifier (domain-general, often transfers): filter upwards's head fails in representation theory (no Filter structure), but its [LIST] with {\omega} modifier transfers cleanly as ext1 + simp [LIST] + rfl. Crucially, the underlying matching engine--deep vision lib.py--is entirely domain independent: the same optimization code for an NP-hard matching that matches chess positions by analogy matches Lean proof states by analogy, without knowing which domain it is processing. Only a relation extractor is domain-specific.
Chinese Translation
Yanasse项目提出了一种通过从结构上遥远的领域转移证明策略模式(例如,Lean 4战术调用模式)来发现数学某一领域定理的新证明的方法。该系统提取了Mathlib中27个顶级领域的战术使用分布(217,133个证明状态),计算z分数以识别在源领域中使用频繁但在目标领域中稀少或缺失的战术,通过GPU加速的NP难度类比(在MacBook Air上通过Apple的MPS后端运行)匹配源和目标证明状态,然后请求一个AI推理代理对源战术调用模式进行语义适配——而不是符号替换——到目标定理。在本研究的第一部分中,该方法应用于概率与表示理论这一对,产生了10次尝试中的4个Lean验证的新证明(40%)。这些证明在没有任何sorry声明的情况下编译成功。关键发现是,战术模式分解为一个头部(领域特定,较少转移)和一个修饰符(领域通用,常常转移):filter upwards的头部在表示理论中失败(没有Filter结构),但其带有{ ext{ω}}修饰符的[LIST]则可以顺利转移为ext1 + simp [LIST] + rfl。至关重要的是,底层匹配引擎——deep vision lib.py——完全与领域无关:用于通过类比匹配国际象棋位置的NP难度匹配的相同优化代码也可以通过类比匹配Lean证明状态,而无需知道其正在处理哪个领域。只有关系提取器是领域特定的。
cs.AI / 48 / 2604.17240
Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI
企业人工智能的安全与政策合规多智能体协调
Abstract
Enterprise AI systems increasingly deploy multiple intelligent agents across mission-critical workflows that must satisfy hard policy constraints, bounded risk exposure, and comprehensive auditability (SOX, HIPAA, GDPR). Existing coordination methods - cooperative MARL, consensus protocols, and centralized planners - optimize expected reward while treating constraints implicitly. This paper introduces CAMCO (Constraint-Aware Multi-Agent Cognitive Orchestration), a runtime coordination layer that models multi-agent decision-making as a constrained optimization problem. CAMCO integrates three mechanisms: (i) a constraint projection engine enforcing policy-feasible actions via convex projection, (ii) adaptive risk-weighted Lagrangian utility shaping, and (iii) an iterative negotiation protocol with provably bounded convergence. Unlike training-time constrained RL, CAMCO operates as deployment-time middleware compatible with any agent architecture, with policy predicates designed for direct integration with production engines such as OPA. Evaluation across three enterprise scenarios - including comparison against a constrained Lagrangian MARL baseline - demonstrates zero policy violations, risk exposure below threshold (mean ratio 0.71), 92-97% utility retention, and mean convergence in 2.4 iterations.
Chinese Translation
企业人工智能系统越来越多地在关键任务工作流中部署多个智能体,这些工作流必须满足严格的政策约束、有限的风险暴露和全面的可审计性(如 SOX、HIPAA、GDPR)。现有的协调方法——合作的多智能体强化学习(MARL)、共识协议和集中式规划者——在优化期望奖励的同时,隐含地处理约束。本文介绍了 CAMCO(约束感知多智能体认知协调),这是一个将多智能体决策建模为约束优化问题的运行时协调层。CAMCO 集成了三种机制:(i)通过凸投影强制执行政策可行动作的约束投影引擎,(ii)自适应风险加权拉格朗日效用塑造,以及(iii)具有可证明收敛界限的迭代谈判协议。与训练时的约束强化学习不同,CAMCO 作为部署时中间件运行,兼容任何智能体架构,其政策谓词设计用于与生产引擎(如 OPA)直接集成。在三个企业场景中的评估——包括与约束拉格朗日 MARL 基线的比较——表明零政策违规,风险暴露低于阈值(平均比率 0.71),效用保留率为 92-97%,平均收敛在 2.4 次迭代内。
cs.AI / 49 / 2604.17267
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
LLM增强调查中的修正难度与最优样本分配
Abstract
Large Language Models can generate synthetic survey responses at low cost, but their accuracy varies unpredictably across questions. We study the design problem of allocating a fixed budget of human respondents across estimation tasks when cheap LLM predictions are available for every task. Our framework combines three components. First, building on Prediction-Powered Inference, we characterize a question-specific rectification difficulty that governs how quickly the estimator's variance decreases with human sample size. Second, we derive a closed-form optimal allocation rule that directs more human labels to tasks where the LLM is least reliable. Third, since rectification difficulty depends on unobserved human responses for new surveys, we propose a meta-learning approach, trained on historical data, that predicts it for entirely new tasks without pilot data. The framework extends to general M-estimation, covering regression coefficients and multinomial logit partworths for conjoint analysis. We validate the framework on two datasets spanning different domains, question types, and LLMs, showing that our approach captures 61-79% of the theoretically attainable efficiency gains, achieving 11.4% and 10.5% MSE reductions without requiring any pilot human data for the target survey.
Chinese Translation
大型语言模型能够以低成本生成合成调查响应,但其准确性在不同问题上变化不可预测。我们研究了在每个任务都有廉价LLM预测可用的情况下,如何在估计任务之间分配固定的人类受访者预算的设计问题。我们的框架结合了三个组成部分。首先,基于预测驱动推断,我们描述了一种特定于问题的修正难度,该难度决定了随着人类样本量的增加,估计器方差下降的速度。其次,我们推导出一个封闭形式的最优分配规则,该规则将更多的人类标签分配给LLM可靠性最低的任务。第三,由于修正难度依赖于新调查中未观察到的人类响应,我们提出了一种基于历史数据训练的元学习方法,能够在没有试点数据的情况下为全新任务预测修正难度。该框架扩展到一般的M估计,涵盖回归系数和联合分析的多项式逻辑部分效用。我们在两个跨越不同领域、问题类型和LLM的数据集上验证了该框架,结果表明我们的方法捕获了61-79%的理论可达效率增益,在不需要任何目标调查的试点人类数据的情况下,实现了11.4%和10.5%的均方误差(MSE)降低。
cs.AI / 50 / 2604.17273
The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward
连续性层:智能为何需要一种承载其前进的架构
Abstract
The most important architectural problem in AI is not the size of the model but the absence of a layer that carries forward what the model has come to understand. Sessions end. Context windows fill. Memory APIs return flat facts that the model has to reinterpret from scratch on every read. The result is intelligence that is powerful per session and amnesiac across time. This position paper argues that the layer which fixes this, the continuity layer, is the most consequential piece of infrastructure the field has not yet built, and that the engineering work to build it has begun in public. The formal evaluation framework for the property described here is the ATANT benchmark (arXiv:2604.06710), published separately with evaluation results on a 250-story corpus; a companion paper (arXiv:2604.10981) positions this framework against existing memory, long-context, and agentic-memory benchmarks. The paper defines continuity as a system property with seven required characteristics, distinct from memory and from retrieval; describes a storage primitive (Decomposed Trace Convergence Memory) whose write-time decomposition and read-time reconstruction produce that property; maps the engineering architecture to the theological pattern of kenosis and the symbolic pattern of Alpha and Omega, and argues this mapping is structural rather than metaphorical; proposes a four-layer development arc from external SDK to hardware node to long-horizon human infrastructure; examines why the physics limits now constraining the model layer make the continuity layer newly consequential; and argues that the governance architecture (privacy implemented as physics rather than policy, founder-controlled class shares on non-negotiable architectural commitments) is inseparable from the product itself.
Chinese Translation
人工智能中最重要的架构问题不是模型的大小,而是缺乏一个能够承载模型所理解内容的层。会话结束,上下文窗口填满,内存API返回的平面事实使得模型在每次读取时都必须从头重新解释。结果是,智能在每个会话中表现强大,但在时间上却显得健忘。本文立场论文认为,解决这一问题的层,即连续性层,是该领域尚未构建的最重要基础设施,而构建它的工程工作已经在公开进行。这里描述的属性的正式评估框架是ATANT基准(arXiv:2604.06710),已单独发布,并在一个包含250个故事的语料库上提供评估结果;一篇伴随论文(arXiv:2604.10981)将该框架与现有的内存、长上下文和代理记忆基准进行了对比。本文将连续性定义为一种具有七个必要特征的系统属性,这些特征与内存和检索不同;描述了一种存储原语(分解轨迹收敛记忆),其写入时的分解和读取时的重构产生该属性;将工程架构映射到基督教神秘学的空虚模式和象征模式的阿尔法与欧米伽,并认为这种映射是结构性的而非隐喻性的;提出了从外部SDK到硬件节点再到长远人类基础设施的四层开发弧;考察了为何目前限制模型层的物理条件使得连续性层变得新近重要;并认为治理架构(作为物理而非政策实施的隐私,创始人控制的非可谈判架构承诺的股份)与产品本身密不可分。
cs.AI / 51 / 2604.17284
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear:诊断、评估和缓解图形用户界面代理中的幻觉
Abstract
While progress in GUI agents has been largely driven by industrial-scale training, ungrounded hallucinations often trigger cascading failures in real-world deployments.Unlike general VLM domains, the GUI agent field lacks a hallucination-focused suite for fine-grained diagnosis, reliable evaluation, and targeted mitigation.To bridge this gap, we introduce HalluClear, a comprehensive suite for hallucination mitigation in GUI agents as a complement to computation-intensive scaling. HalluClear comprises: (1) a GUI-specific hallucination taxonomy derived from empirical failure analysis; (2) a calibrated three-stage evaluation workflow which enhances VLM-as-a-judge reliability via expert-annotated benchmarking and ensemble credibility estimation; and (3) a mitigation scheme based on closed-loop structured reasoning, enabling lightweight continual post-training with cold-start initialization for both generalist and GUI-specialist agents. Experiments across representative agents and public benchmarks demonstrate that post-training on only 9K samples within our suite can significantly reduce hallucinations, thereby improving grounding and action fidelity, offering a compute-efficient pathway to robust GUI automation.
Chinese Translation
尽管图形用户界面(GUI)代理的进展在很大程度上是由工业规模的训练推动的,但无基础的幻觉往往会在实际部署中引发级联故障。与一般的视觉语言模型(VLM)领域不同,GUI代理领域缺乏专注于幻觉的工具套件,以进行细粒度的诊断、可靠的评估和有针对性的缓解。为了解决这一问题,我们引入了HalluClear,这是一个用于GUI代理幻觉缓解的综合工具套件,作为计算密集型扩展的补充。HalluClear包括:(1)基于经验故障分析得出的GUI特定幻觉分类;(2)一个经过校准的三阶段评估工作流程,通过专家注释的基准测试和集成可信度估计,提高VLM作为评判者的可靠性;(3)一个基于闭环结构推理的缓解方案,使得通用代理和GUI专业代理都能够进行轻量级的持续后训练,并具备冷启动初始化能力。针对具有代表性的代理和公共基准的实验表明,仅在我们的工具套件中对9K样本进行后训练即可显著减少幻觉,从而提高基础性和行动的准确性,提供了一条计算高效的强健GUI自动化路径。
cs.AI / 52 / 2604.17295
LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics
LLaTiSA:面向从视觉感知到语义的难度分层时间序列推理
Abstract
Comprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoning (TSR) via a four-level taxonomy of increasing cognitive complexity. We introduce HiTSR, a hierarchical time series reasoning dataset comprising 83k samples with diverse task combinations and verified Chain-of-Thought (CoT) trajectories. Leveraging HiTSR, we propose LLaTiSA, a strong TSRM that integrates visualized patterns with precision-calibrated numerical tables to enhance the temporal perception of Vision-Language Models (VLMs). Through a multi-stage curriculum fine-tuning strategy, LLaTiSA achieves superior performance and exhibits robust out-of-distribution generalization across diverse TSR tasks and real-world scenarios. Our code is available at https://github.com/RainingNovember/LLaTiSA.
Chinese Translation
全面理解时间序列仍然是大型语言模型(LLMs)面临的一项重大挑战。目前的研究受到任务定义和基准测试碎片化及固有模糊性的制约,阻碍了严格评估和统一时间序列推理模型(TSRMs)的发展。为了解决这一问题,我们通过一个四级分类法形式化时间序列推理(TSR),该分类法具有逐渐增加的认知复杂性。我们引入了HiTSR,一个包含83k个样本的层次时间序列推理数据集,涵盖多样的任务组合和经过验证的思维链(CoT)轨迹。利用HiTSR,我们提出了LLaTiSA,一个强大的TSRM,它将可视化模式与精确校准的数字表结合,以增强视觉语言模型(VLMs)的时间感知。通过多阶段课程微调策略,LLaTiSA在各种TSR任务和现实场景中实现了卓越的性能,并展现出强大的分布外泛化能力。我们的代码可在 https://github.com/RainingNovember/LLaTiSA 获取。
cs.AI / 53 / 2604.17304
Efficient Test-Time Scaling via Temporal Reasoning Aggregation
通过时间推理聚合实现高效的测试时间缩放
Abstract
Test-time scaling improves the reasoning performance of large language models but often results in token-inefficient overthinking, where models continue reasoning beyond what is necessary for a correct answer. Existing dynamic early-exit methods typically rely on single-step confidence signals, which are often unreliable for detecting reasoning convergence in multi-step settings. To mitigate this limitation, we propose TRACE, a training-free framework for efficient test-time scaling that determines when to terminate reasoning based on temporal aggregation of multi-step evidence rather than instantaneous signals. TRACE detects reasoning convergence over time by aggregating two complementary signals across recent reasoning steps: answer consistency, capturing the persistence of predicted answers, and confidence trajectory, modeling the temporal evolution of model confidence. Benefiting from these two factors, TRACE can accurately determine whether the reasoning process has converged, thereby promptly halting inference and effectively avoiding redundant reasoning steps. Extensive experiments on multiple challenging benchmarks show that TRACE reduces reasoning token usage by 25-30% on average while maintaining accuracy within 1-2% of full-length reasoning, consistently outperforming existing dynamic reasoning methods.
Chinese Translation
测试时间缩放提高了大型语言模型的推理性能,但往往导致令牌使用效率低下的过度推理,即模型在得出正确答案所需的基础上继续推理。现有的动态提前退出方法通常依赖于单步置信信号,这在多步设置中往往不可靠,难以检测推理的收敛性。为了解决这一局限性,我们提出了TRACE,一个无训练的高效测试时间缩放框架,它基于多步证据的时间聚合来决定何时终止推理,而不是依赖瞬时信号。TRACE通过聚合最近推理步骤中的两个互补信号来检测推理的时间收敛性:答案一致性,捕捉预测答案的持续性,以及置信轨迹,建模模型置信度的时间演变。得益于这两个因素,TRACE能够准确判断推理过程是否已收敛,从而及时停止推理,有效避免冗余的推理步骤。在多个具有挑战性的基准测试上的大量实验表明,TRACE平均减少了25-30%的推理令牌使用,同时保持准确性在完整推理的1-2%之内,始终优于现有的动态推理方法。
cs.AI / 54 / 2604.17308
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow:自主智能体终身技能发现与演化的基准测试
Abstract
As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.
Chinese Translation
随着自主智能体能力边界的不断扩展,它们越来越能够通过即插即用的外部技能完成专业任务。然而,目前的基准测试主要检验模型是否能够使用提供的技能,而未能探讨它们是否能够从经验中发现技能、在失败后修复技能以及随着时间的推移维护一致的技能库。我们提出了SkillFlow,这是一个涵盖20个任务家族的166个任务的基准,其中每个家族内的任务构建遵循一个定义智能体工作流程框架的领域无关执行流程(Domain-Agnostic Execution Flow, DAEF),使这些任务能够共享一致的工作流程。智能体在一个智能体终身学习协议下进行评估,在该协议中,它们从无技能开始,依次解决每个家族内的任务,通过轨迹驱动和标准驱动的技能补丁外化经验教训,并将更新后的技能库向前推进。实验结果揭示了显著的能力差距。对于Claude Opus 4.6,终身技能演化将任务成功率从62.65%提高到71.08%(+8.43个百分点)。然而,高技能使用率并不一定意味着高效用:尽管Kimi K2.5的技能使用率为66.87%,但仅获得+0.60个百分点的提升,而Qwen-Coder-Next的任务完成率仅为44.58%,相较于基础设置仍然出现回退。SkillFlow为这一方向提供了一个结构化的测试平台,并对技能发现、补丁、转移及其在终身评估下的失败模式进行了深入的实证分析。
cs.AI / 55 / 2604.17309
Knows: Agent-Native Structured Research Representations
Knows:代理原生结构化研究表示
Abstract
Research artifacts are distributed primarily as reader-oriented documents like PDFs. This creates a bottleneck for increasingly agent-assisted and agent-native research workflows, in which LLM agents need to infer fine-grained, task-relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF-only, sidecar-only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B--2B parameters) improve from 19--25\% to 47--67\% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29--86\% fewer input tokens; an LLM-as-judge re-scoring confirms that weak-model sidecar accuracy (75--77\%) approaches stronger-model PDF accuracy (78--83\%). Beyond this controlled evaluation, a community sidecar hub at https://knows.academy/ has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption-ready at scale.
Chinese Translation
研究文献主要以面向读者的文档形式分发,如PDF。这为日益依赖代理辅助和代理原生的研究工作流程创造了瓶颈,在这些工作流程中,大型语言模型(LLM)代理需要从冗长的完整文档中推断出细粒度的、与任务相关的信息,这一过程在规模上既昂贵又重复且不稳定。我们介绍了Knows,这是一种轻量级的伴随规范,将结构化的主张、证据、来源和可验证关系绑定到现有的研究文献中,以LLM代理可以直接消费的形式。Knows通过一个与原始PDF共存的薄型YAML侧车(KnowsRecord)来填补这一空白,要求对出版物本身没有任何更改,并通过确定性模式检查器进行验证。我们在20篇涵盖14个学科的论文上评估了Knows,针对140个理解问题,比较了仅PDF、仅侧车和混合条件下六种不同能力的LLM代理。弱模型(0.8B-2B参数)在阅读侧车而非PDF时,准确率从19-25%提高到47-67%(增加29到42个百分点),同时消耗的输入标记减少了29-86%;LLM作为评判者的重新评分确认,弱模型侧车的准确率(75-77%)接近强模型PDF的准确率(78-83%)。除了这一受控评估外,社区侧车中心https://knows.academy/已索引超过一万篇出版物,并且每天持续增长,提供了独立证据,表明该格式在规模上已准备好被广泛采用。
cs.AI / 56 / 2604.17337
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
AutoSearch:通过强化学习实现高效代理检索增强生成的自适应搜索深度
Abstract
Agentic retrieval-augmented generation (RAG) systems enable large language models (LLMs) to solve complex tasks through multi-step interaction with external retrieval tools. However, such multi-step interaction often involves redundant search steps, incurring substantial computational cost and latency. Prior work limits search depth (i.e., the number of search steps) to reduce cost, but this often leads to underexploration of complex questions. To address this, we first investigate how search depth affects accuracy and find a minimal sufficient search depth that defines an accuracy-efficiency trade-off, jointly determined by question complexity and the agent's capability. Furthermore, we propose AutoSearch, a reinforcement learning (RL) framework that evaluates each search step via self-generated intermediate answers. By a self-answering mechanism, AutoSearch identifies the minimal sufficient search depth and promotes efficient search by rewarding its attainment while penalizing over-searching. In addition, reward mechanisms are introduced to stabilize search behavior and improve answer quality on complex questions. Extensive experiments on multiple benchmarks show that AutoSearch achieves a superior accuracy-efficiency trade-off, alleviating over-searching while preserving search quality.
Chinese Translation
代理检索增强生成(RAG)系统使大型语言模型(LLMs)能够通过与外部检索工具的多步交互来解决复杂任务。然而,这种多步交互通常涉及冗余的搜索步骤,导致显著的计算成本和延迟。之前的研究通过限制搜索深度(即搜索步骤的数量)来降低成本,但这往往导致对复杂问题的探索不足。为了解决这一问题,我们首先研究了搜索深度如何影响准确性,并找到了一个定义准确性与效率权衡的最小充分搜索深度,该深度由问题复杂性和代理能力共同决定。此外,我们提出了AutoSearch,一个通过自生成中间答案评估每个搜索步骤的强化学习(RL)框架。通过自回答机制,AutoSearch识别最小充分搜索深度,并通过奖励其实现来促进高效搜索,同时惩罚过度搜索。此外,引入了奖励机制以稳定搜索行为并提高复杂问题的答案质量。在多个基准上的广泛实验表明,AutoSearch实现了优越的准确性与效率权衡,减轻了过度搜索的情况,同时保持了搜索质量。
cs.AI / 57 / 2604.17347
Formal Foundations of Agentic Business Process Management
自主商业流程管理的形式基础
Abstract
Just like traditional BPM systems, agentic BPM systems are built around a specification of the process under consideration. Their distinguishing feature, however, is that the execution of the process is driven by multiple autonomous decision-makers, referred to as agents. Since such agents cannot be fully controlled, the process specification is augmented with explicit objectives, or goals, assigned to the participating agents. Agents then pursue these goals, at least to the best of their efforts, under suitable assumptions on the behavior of others, by adopting appropriate strategies. Centrally, the organization enacting the process can use these specifications to provide guardrails on the decision-making capabilities of agents at the strategy level. This paper sets up the mathematical foundations of such systems in three key settings and analyzes four foundational problems of agentic BPM.
Chinese Translation
与传统的商业流程管理(BPM)系统一样,自主BPM系统是围绕所考虑的流程规范构建的。然而,它们的显著特征在于,流程的执行由多个自主决策者驱动,这些决策者被称为代理(agents)。由于这些代理无法被完全控制,流程规范被增强为包含明确的目标或任务,这些目标分配给参与的代理。代理在适当的假设下,尽其所能地追求这些目标,采用适当的策略。组织在实施流程时,可以利用这些规范为代理在策略层面的决策能力提供指导。本文在三个关键设置中建立了此类系统的数学基础,并分析了自主BPM的四个基础性问题。
cs.AI / 58 / 2604.17351
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO:通过双锚点双层优化实现自动化模拟器构建
Abstract
Automated simulator construction requires distributional fidelity, distinguishing it from generic code generation. We identify two failure modes in long-horizon LLM agents: contextual drift and optimization instability arising from conflating structural and parametric errors. We propose SOCIA-EVO, a dual-anchored evolutionary framework. SOCIA-EVO introduces: (1) a static blueprint to enforce empirical constraints; (2) a bi-level optimization to decouple structural refinement from parameter calibration; and (3) a self-curating Strategy Playbook that manages remedial hypotheses via Bayesian-weighted retrieval. By falsifying ineffective strategies through execution feedback, SOCIA-EVO achieves robust convergence, generating simulators that are statistically consistent with observational data. The code and data of SOCIA-EVO are available here: https://github.com/cruiseresearchgroup/SOCIA/tree/evo.
Chinese Translation
自动化模拟器构建需要分布忠实性,这使其与通用代码生成有所区别。我们识别出长时间跨度的LLM(大语言模型)代理中的两种失败模式:上下文漂移和由于混淆结构性错误与参数性错误而导致的优化不稳定性。我们提出了SOCIA-EVO,一个双锚点的进化框架。SOCIA-EVO引入了:(1) 一个静态蓝图以强制执行经验约束;(2) 一种双层优化方法以将结构精炼与参数校准解耦;以及(3) 一个自我策划的策略手册,通过贝叶斯加权检索管理补救假设。通过执行反馈来驳斥无效策略,SOCIA-EVO实现了稳健的收敛,生成与观测数据在统计上相一致的模拟器。SOCIA-EVO的代码和数据可在此获取:https://github.com/cruiseresearchgroup/SOCIA/tree/evo。
cs.AI / 59 / 2604.17353
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
Hive:一种用于算法和任务级扩展的多智能体基础设施
Abstract
Large language models are increasingly deployed as complex agentic systems that scale with task complexity. While prior work has extensively explored model- and system-level scaling, algorithm- and task-level scaling remain largely unaddressed, constraining the full potential of agentic systems. At the algorithm level, allocating additional inference-time computation can enhance workflow capacity but introduces cross-path redundancy: overlapping computations across multiple reasoning branches. At the task level, complex tasks can be decomposed into subproblems and delegated across multiple agents for improved scalability and parallelism. However, existing infrastructures' scheduling is unaware of the existence of multiple agents, missing opportunities to optimize resource allocation. We propose Hive, a multi-agent infrastructure that enables algorithm- and task-level scaling. Hive features a description frontend that captures per-agent behavior and supports test-time scaling algorithms. Leveraging this specification, our backend introduces two key mechanisms: Logits Cache that reuses intermediate logits across redundant sampling paths to mitigate cross-path redundancy at the algorithm level, and Agent-Aware Scheduling that efficiently allocates compute and KV-cache resources according to agent contributions at the task level. Experiments show that Logits Cache achieves an average speedup of $1.11\times$-$1.76\times$ for re-sampling, and Agent-Aware Scheduling reduces the hotspot miss rate by $33\%$-$51\%$.
Chinese Translation
大型语言模型越来越多地被部署为复杂的智能系统,这些系统随着任务复杂性的增加而扩展。尽管先前的研究已经广泛探讨了模型和系统级的扩展,但算法和任务级的扩展仍然在很大程度上未得到解决,从而限制了智能系统的全部潜力。在算法层面,分配额外的推理时间计算可以增强工作流能力,但会引入跨路径冗余:在多个推理分支之间重叠的计算。在任务层面,复杂任务可以被分解为子问题,并在多个智能体之间委派,以提高可扩展性和并行性。然而,现有基础设施的调度并未意识到多个智能体的存在,错失了优化资源分配的机会。我们提出了Hive,一种支持算法和任务级扩展的多智能体基础设施。Hive具有一个描述前端,可以捕捉每个智能体的行为,并支持测试时扩展算法。利用这一规范,我们的后端引入了两个关键机制:Logits Cache,它在冗余采样路径之间重用中间logits,以减轻算法级的跨路径冗余,以及Agent-Aware Scheduling,它根据智能体的贡献有效分配计算和KV缓存资源。在实验中,Logits Cache在重新采样时实现了平均$1.11 imes$-$1.76 imes$的加速,而Agent-Aware Scheduling将热点未命中率降低了$33 ext{ ext{%}}$-$51 ext{ ext{%}}$。
cs.AI / 60 / 2604.17360
T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification
T-DuMpRa:教师指导的双路径多原型检索增强框架用于细粒度医学图像分类
Abstract
Fine-grained medical image classification is challenged by subtle inter-class variations and visually ambiguous cases, where confidence estimates often exhibit uncertainty rather than being overconfident. In such scenarios, purely discriminative classifiers may achieve high overall accuracy yet still fail to distinguish between highly similar categories, leading to miscalibrated predictions. We propose T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework, where discriminative classification and multi-prototype retrieval jointly drive both training and prediction. During training, we jointly optimize cross-entropy and supervised contrastive objectives to learn a cosine-compatible embedding geometry for reliable prototype matching. We further employ an exponential moving average (EMA) teacher to obtain smoother representations and build a multi-prototype memory bank by clustering teacher embeddings in the teacher embedding space. Our framework is plug-and-play: it can be easily integrated into existing classification models by constructing a compact prototype bank, thereby improving performance on visually ambiguous cases. At inference, we combine the classifier's predicted distribution with a similarity-based distribution computed via cosine matching to prototypes, and apply a conservative confidence-gated fusion that activates retrieval only when the classifier's prediction is uncertain and the retrieval evidence is decisive and conflicting, otherwise keeping confident predictions unchanged. On HAM10000 and ISIC2019, our method yields 0.68%-0.21% and 0.44%-2.69% improvements on 5 different backbones. And visualization analysis proves our model can enhance the model's ability to handle visually ambiguous cases.
Chinese Translation
细粒度医学图像分类面临着微妙的类间变异和视觉模糊案例的挑战,在这些情况下,置信度估计往往表现出不确定性,而非过于自信。在这种情况下,纯粹的判别分类器可能会获得较高的整体准确率,但仍然无法区分高度相似的类别,导致预测的校准不准确。我们提出了T-DuMpRa,一个教师指导的双路径多原型检索增强框架,其中判别分类和多原型检索共同驱动训练和预测。在训练过程中,我们联合优化交叉熵和监督对比目标,以学习适合余弦匹配的嵌入几何,从而实现可靠的原型匹配。我们进一步采用指数移动平均(EMA)教师来获得更平滑的表示,并通过在教师嵌入空间中聚类教师嵌入来构建多原型记忆库。我们的框架是即插即用的:它可以通过构建紧凑的原型库轻松集成到现有的分类模型中,从而改善在视觉模糊案例上的表现。在推理时,我们将分类器的预测分布与通过余弦匹配计算的基于相似性的分布相结合,并应用保守的置信度门控融合,仅在分类器的预测不确定且检索证据决定性且相互矛盾时激活检索,否则保持自信的预测不变。在HAM10000和ISIC2019数据集上,我们的方法在5种不同的骨干网络上分别提高了0.68%-0.21%和0.44%-2.69%的性能。可视化分析证明我们的模型能够增强模型处理视觉模糊案例的能力。
cs.AI / 61 / 2604.17364
LLM-Guided Strategy Synthesis for Scalable Equality Saturation
基于大型语言模型的可扩展等式饱和策略合成
Abstract
Equality saturation (EqSat) is a powerful optimization paradigm that compactly represents many equivalent programs in an e-graph and delays commitment until extraction selects a lowest-cost program. Making EqSat effective, therefore, requires not only domain-specific rewrite rules but also domain-specific strategies. Today, much of this strategy design is still manual, making it a major obstacle to automating e-graph-based compilers. Recent rule-synthesis frameworks can automatically infer large rewrite vocabularies from semantic specifications, but they also enlarge the rewrite space and further exacerbate e-graph explosion. Although large language models (LLMs) make automated strategy synthesis plausible, directly evolving backend code remains ineffective in practice. The search lacks reusable strategy abstractions and actionable feedback, and can easily trigger e-graph explosion or converge to poor designs. We present EggMind, an LLM-guided, end-to-end framework for synthesizing reusable EqSat strategies. At its core, EggMind introduces a domain-specific language, EqSatL, to represent EqSat strategies as explicit and inspectable artifacts. It then proposes an LLM-guided agentic workflow, equipped with novel techniques including proof-derived rewrite motif caching and tractability guidance, to search efficiently for high-quality strategies while keeping synthesis stable under e-graph growth. Evaluation shows that EggMind substantially improves the resource-quality trade-off on vectorization benchmarks, reducing final cost by 45.1% and peak RAM by 69.1% relative to full EqSat. We further show that the same methodology transfers effectively to an XLA-based tensor compiler, and demonstrate its practical potential in a logic-synthesis case study with augmented rewrite spaces.
Chinese Translation
等式饱和(Equality saturation, EqSat)是一种强大的优化范式,它在e-图中紧凑地表示许多等价程序,并在提取选择最低成本程序之前延迟承诺。因此,使EqSat有效不仅需要特定领域的重写规则,还需要特定领域的策略。目前,这些策略设计大多仍然是手动的,这成为了自动化基于e-图的编译器的主要障碍。最近的规则合成框架可以从语义规范中自动推断出大量的重写词汇,但它们也扩大了重写空间,进一步加剧了e-图的爆炸。尽管大型语言模型(LLMs)使得自动化策略合成成为可能,但在实践中直接演化后端代码仍然效果不佳。搜索缺乏可重用的策略抽象和可操作的反馈,容易触发e-图爆炸或收敛到糟糕的设计。我们提出了EggMind,这是一个基于LLM的端到端框架,用于合成可重用的EqSat策略。EggMind的核心是引入了一种特定领域语言EqSatL,将EqSat策略表示为明确且可检查的工件。然后,它提出了一种基于LLM的自主工作流程,配备了包括基于证明的重写模式缓存和可处理性指导等新技术,以高效搜索高质量策略,同时在e-图增长下保持合成的稳定性。评估表明,EggMind在向量化基准测试中显著改善了资源与质量的权衡,相较于完整的EqSat,最终成本降低了45.1%,峰值内存减少了69.1%。我们进一步展示了相同的方法论有效地转移到基于XLA的张量编译器,并在增强重写空间的逻辑合成案例研究中展示了其实际潜力。
cs.AI / 62 / 2604.17399
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
超越元推理:自我改善大型语言模型推理的元认知整合
Abstract
Large language models (LLMs) have demonstrated strong reasoning capabilities, and as existing approaches for enhancing LLM reasoning continue to mature, increasing attention has shifted toward meta-reasoning as a promising direction for further improvement. However, most existing meta-reasoning methods remain episodic: they focus on executing complex meta-reasoning routines within individual instances, but ignore the accumulation of reusable meta-reasoning skills across instances, leading to recurring failure modes and repeatedly high metacognitive effort. In this paper, we introduce Metacognitive Consolidation, a novel framework in which a model consolidates metacognitive experience from past reasoning episodes into reusable knowledge that improves future meta-reasoning. We instantiate this framework by structuring instance-level problem solving into distinct roles for reasoning, monitoring, and control to generate rich, attributable meta-level traces. These traces are then consolidated through a hierarchical, multi-timescale update mechanism that gradually forms evolving meta-knowledge. Experimental results demonstrate consistent performance gains across benchmarks and backbone models, and show that performance improves as metacognitive experience accumulates over time.
Chinese Translation
大型语言模型(LLMs)展示了强大的推理能力,随着现有增强LLM推理的方法不断成熟,越来越多的关注转向元推理,作为进一步改进的一个有前景的方向。然而,大多数现有的元推理方法仍然是情节性的:它们专注于在单个实例中执行复杂的元推理程序,但忽视了跨实例可重用的元推理技能的积累,导致重复的失败模式和反复的高元认知努力。在本文中,我们提出了元认知整合(Metacognitive Consolidation),这是一个新颖的框架,其中模型将过去推理情节中的元认知经验整合为可重用的知识,从而改善未来的元推理。我们通过将实例级问题解决结构化为推理、监控和控制的不同角色来实现这一框架,以生成丰富的、可归因的元级痕迹。这些痕迹随后通过一个分层的、多时间尺度的更新机制进行整合,逐渐形成不断演变的元知识。实验结果表明,在基准测试和基础模型上表现出一致的性能提升,并显示随着元认知经验的积累,性能不断改善。
cs.AI / 63 / 2604.17400
Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
基于相位调度的多智能体系统用于高效的令牌协调
Abstract
Multi-agent systems (MAS) powered by large language models suffer from severe token inefficiency arising from two compounding sources: (i) unstructured parallel execution, where all agents activate simultaneously irrespective of input readiness; and (ii) unrestricted context sharing, where every agent receives the full accumulated context regardless of relevance. Existing mitigation strategies - static pruning, hierarchical decomposition, and learned routing - treat coordination as a structural allocation problem and fundamentally ignore its temporal dimension. We propose Phase-Scheduled Multi-Agent Systems (PSMAS), a framework that reconceptualizes agent activation as continuous control over a shared attention space modeled on a circular manifold. Each agent i is assigned a fixed angular phase theta_i in the range [0, 2*pi], derived from the task dependency topology; a global sweep signal phi(t) rotates at velocity omega, activating only agents within an angular window epsilon. Idle agents receive compressed context summaries, reducing per-step token consumption. We implement PSMAS on LangGraph, evaluate on four structured benchmarks (HotPotQA-MAS, HumanEval-MAS, ALFWorld-Multi, WebArena-Coord) and two unstructured conversational settings, and prove stability, convergence, and optimality results for the sweep dynamics. PSMAS achieves a mean token reduction of 27.3 percent (range 21.4-34.8 percent) while maintaining task performance within 2.1 percentage points of a fully activated baseline (p < 0.01, n = 500 per configuration), and outperforms the strongest learned routing baseline by 5.6 percentage points in token reduction with 2.0 percentage points less performance drop. Crucially, we show that scheduling and compression are independent sources of gain: scheduling alone accounts for 18-20 percentage points of reduction, robust to compression degradation up to alpha = 0.40.
Chinese Translation
由大型语言模型驱动的多智能体系统(MAS)面临严重的令牌效率问题,这主要源于两个相互叠加的因素:(i)无结构的并行执行,所有智能体在输入准备就绪之前同时激活;(ii)不受限制的上下文共享,每个智能体接收全部累积上下文,尽管其相关性可能不高。现有的缓解策略——静态修剪、层次分解和学习路由——将协调视为结构分配问题,根本忽视了其时间维度。我们提出了基于相位调度的多智能体系统(PSMAS),这一框架重新概念化了智能体激活为对基于圆形流形建模的共享注意力空间的连续控制。每个智能体 i 被分配一个固定的角相位 theta_i,范围为 [0, 2*pi],该相位源自任务依赖拓扑;一个以速度 omega 旋转的全局扫描信号 phi(t) 仅激活位于角度窗口 epsilon 内的智能体。闲置的智能体接收压缩的上下文摘要,从而减少每步的令牌消耗。我们在 LangGraph 上实现了 PSMAS,并在四个结构化基准(HotPotQA-MAS、HumanEval-MAS、ALFWorld-Multi、WebArena-Coord)和两个非结构化对话环境中进行了评估,证明了扫描动态的稳定性、收敛性和最优性结果。PSMAS 实现了 27.3% 的平均令牌减少(范围为 21.4%-34.8%),同时保持任务性能在完全激活基线的 2.1 个百分点以内(p < 0.01,n = 500 每个配置),并在令牌减少方面超越了最强的学习路由基线 5.6 个百分点,同时性能下降少 2.0 个百分点。关键的是,我们表明调度和压缩是独立的增益来源:仅调度就能实现 18-20 个百分点的减少,对压缩降级(高达 alpha = 0.40)具有鲁棒性。
cs.AI / 64 / 2604.17405
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
STRIDE:用于检索增强的多跳问答的战略迭代决策
Abstract
Multi-hop question answering (MHQA) enables accurate answers to complex queries by retrieving and reasoning over evidence dispersed across multiple documents. Existing MHQA approaches mainly rely on iterative retrieval-augmented generation, which suffer from the following two major issues. 1) Existing methods prematurely commit to surface-level entities rather than underlying reasoning structures, making question decomposition highly vulnerable to lexical ambiguity. 2) Existing methods overlook the logical dependencies among reasoning steps, resulting in uncoordinated execution. To address these issues, we propose STRIDE, a framework that separates strategic planning, dynamic control, and grounded execution. At its core, a Meta-Planner first constructs an entity-agnostic reasoning skeleton to capture the abstract logic of the query, thereby deferring entity grounding until after the reasoning structure is established, which mitigates disambiguation errors caused by premature lexical commitment. A Supervisor then orchestrates sub-question execution in a dependency-aware manner, enabling efficient parallelization where possible and sequential coordination when necessary. By dynamically deciding whether to retrieve new evidence or infer from existing facts, it avoids redundant queries and error propagation, while fusing cross-branch information and reformulating failed queries to enhance robustness. Grounded fact extraction and logical inference are delegated to specialized execution modules, ensuring faithfulness through explicit separation of retrieval and reasoning. We further propose STRIDE-FT, a modular fine-tuning framework that uses self-generated execution trajectories from STRIDE, requiring neither human annotations nor stronger teacher models. Experiments show that STRIDE achieves robust and accurate reasoning, while STRIDE-FT effectively enhances open-source LLMs.
Chinese Translation
多跳问答(MHQA)通过检索和推理分散在多个文档中的证据,使得对复杂查询的准确回答成为可能。现有的MHQA方法主要依赖于迭代检索增强生成,但面临以下两个主要问题。1)现有方法过早地承诺于表面实体,而非潜在的推理结构,使得问题分解高度易受词汇歧义的影响。2)现有方法忽视了推理步骤之间的逻辑依赖关系,导致执行不协调。为了解决这些问题,我们提出了STRIDE,一个将战略规划、动态控制和基础执行分开的框架。在其核心,Meta-Planner首先构建一个与实体无关的推理骨架,以捕捉查询的抽象逻辑,从而将实体绑定推迟到推理结构建立之后,这减轻了因过早词汇承诺而导致的歧义错误。随后,Supervisor以依赖感知的方式协调子问题的执行,尽可能实现高效并行化,并在必要时进行顺序协调。通过动态决定是检索新证据还是从现有事实中推断,它避免了冗余查询和错误传播,同时融合跨分支信息并重新构建失败的查询以增强鲁棒性。基础事实提取和逻辑推理被委托给专门的执行模块,确保通过明确分离检索和推理来保持忠实性。我们进一步提出了STRIDE-FT,一个模块化微调框架,利用来自STRIDE的自生成执行轨迹,既不需要人工标注也不需要更强的教师模型。实验表明,STRIDE实现了鲁棒且准确的推理,而STRIDE-FT有效增强了开源大型语言模型(LLMs)。
cs.AI / 65 / 2604.17406
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster:构建可扩展的演化自主科学代理的基础框架
Zhu, Xinyu, Cai, Yuzhu, Liu, Zexi, Wang, Cheng, Li, Fengyang, Jin, Wenkai, Liu, Wanxu, Bing, Zehao, Zheng, Bingyang, Chai, Jingyi, Tang, Shuo, Ye, Rui, Du, Yuwen, Pang, Xianghe, Du, Yaxin, Miao, Tingjia, Zhang, Yuzhi, Liao, Ruoxue, Ding, Zhaohan, Zhang, Linfeng, Wang, Yanfeng, E, Weinan, Chen, Siheng
Abstract
The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self-evolution, EvoMaster empowers agents to iteratively refine hypotheses, self-critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain-agnostic base harness, EvoMaster is exceptionally easy to scale up -- enabling developers to build and deploy highly capable, self-evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity's Last Exam, MLE-Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state-of-the-art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general-purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at https://github.com/sjtu-sai-agents/EvoMaster.
Chinese Translation
大语言模型与代理的融合正在催生科学发现的新纪元:代理科学(Agentic Science)。尽管科学方法本质上是迭代的,现有的代理框架主要是静态的、范围狭窄的,并且缺乏从试错中学习的能力。为了解决这一问题,我们提出了EvoMaster,一个专门为可扩展的代理科学设计的基础演化代理框架。EvoMaster以持续自我演化为核心原则,使代理能够迭代地完善假设、自我批评,并在实验周期中逐步积累知识,忠实地反映人类的科学探究过程。重要的是,作为一个领域无关的基础工具,EvoMaster极易扩展——使开发者能够用大约100行代码构建和部署高度智能的自我演化科学代理,适用于任意学科。在EvoMaster的基础上,我们在机器学习、物理学和一般科学等领域孵化了SciMaster生态系统。在四个权威基准(人类最后的考试(Humanity's Last Exam)、MLE-Bench Lite、BrowseComp和FrontierScience)上的评估表明,EvoMaster分别达到了41.1%、75.8%、73.3%和53.3%的最先进得分。它全面超越了通用基线OpenClaw,提升幅度从+159%到+316%不等,强有力地验证了其作为下一代自主科学发现的首要基础框架的有效性和通用性。EvoMaster可在https://github.com/sjtu-sai-agents/EvoMaster获取。
cs.AI / 66 / 2604.17450
Compiling Deterministic Structure into SLM Harnesses
将确定性结构编译为小语言模型的执行框架
Abstract
Enterprise deployment of small language models (SLMs) is constrained by epistemic asymmetry: SLMs cannot self-correct reasoning errors, while frontier LLMs are prohibitively costly and face data sovereignty limits for high-volume use. We propose Semantic Gradient Descent (SGDe), a teacher-student framework that compiles agentic workflows into discrete execution plans comprising DAG topologies, system prompts, and deterministic executable code. The trailing "e" distinguishes SGDe from stochastic gradient descent. SGDe operates in a discrete semantic space where a frontier teacher generates natural-language critiques acting as directional gradients to iteratively refine the SLM's workflow artefacts. We formalise SGDe within a PAC learning framework, establishing sample-complexity bounds that enable convergence with as few as three training examples on targeted synthetic tasks by leveraging the teacher as a statistical prior. On a GSM-Hard-derived test set built via adversarial synthesis, compiled workflows reach 91.3% accuracy at m=5 and 99.3% at m=3 within the small-m regime motivated by Corollary 1, a +26.3% to +34.3% absolute improvement over state-of-the-art prompt optimisers. In the emerging paradigm of harness engineering, SGDe treats placement of deterministic code (which subtasks to delegate to a Python runtime versus retain as LLM calls) as a trace-driven, per-node optimisation target, generalising the whole-problem offloading of PAL and PoT. The teacher compiles two complementary deterministic structures: capability offloading, which delegates subtasks to Python when the SLM cannot execute them reliably, and structural consensus, which wraps variance-limited reasoning steps in fan-out/fan-in subgraphs aggregated by deterministic voting.
Chinese Translation
小语言模型(SLMs)的企业部署受到认知不对称的限制:SLMs无法自我纠正推理错误,而前沿的大型语言模型(LLMs)成本高昂,并且在高容量使用中面临数据主权的限制。我们提出了语义梯度下降(Semantic Gradient Descent, SGDe),一种将代理工作流编译为包含有向无环图(DAG)拓扑、系统提示和确定性可执行代码的离散执行计划的师生框架。尾部的“e”将SGDe与随机梯度下降(stochastic gradient descent)区分开来。SGDe在离散语义空间中运行,其中前沿教师生成自然语言批评,作为方向梯度迭代地优化SLM的工作流产物。我们在PAC学习框架内形式化SGDe,建立样本复杂度界限,使其能够通过利用教师作为统计先验,在针对性的合成任务上仅用三个训练样本实现收敛。在通过对抗合成构建的GSM-Hard派生测试集上,编译的工作流在小规模(small-m)范围内,在m=5时达到91.3%的准确率,在m=3时达到99.3%的准确率,相较于最先进的提示优化器,绝对提升幅度为+26.3%至+34.3%。在新兴的执行框架工程范式中,SGDe将确定性代码的放置(即将哪些子任务委派给Python运行时,哪些保留为LLM调用)视为基于追踪的每节点优化目标,推广了PAL和PoT的整体问题卸载。教师编译了两种互补的确定性结构:能力卸载(capability offloading),当SLM无法可靠执行子任务时将其委派给Python,以及结构共识(structural consensus),将方差受限的推理步骤封装在通过确定性投票聚合的分支/汇聚子图中。
cs.AI / 67 / 2604.17456
TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
TrafficClaw:通过统一物理环境建模实现可泛化的城市交通控制
Abstract
Urban traffic control is a system-level coordination problem spanning heterogeneous subsystems, including traffic signals, freeways, public transit, and taxi services. Existing optimization-based, reinforcement learning (RL), and emerging LLM-based approaches are largely designed for isolated tasks, limiting both cross-task generalization and the ability to capture coupled physical dynamics across subsystems. We argue that effective system-level control requires a unified physical environment in which subsystems share infrastructure, mobility demand, and spatiotemporal constraints, allowing local interventions to propagate through the network. To this end, we propose TrafficClaw, a framework for general urban traffic control built upon a unified runtime environment. TrafficClaw integrates heterogeneous subsystems into a shared dynamical system, enabling explicit modeling of cross-subsystem interactions and closed-loop agent-environment feedback. Within this environment, we develop an LLM agent with executable spatiotemporal reasoning and reusable procedural memory, supporting unified diagnostics across subsystems and continual strategy refinement. Furthermore, we introduce a multi-stage training pipeline with supervised initialization and agentic RL with system-level optimization, further enabling coordinated and system-aware performance. Experiments demonstrate that TrafficClaw achieves robust, transferable, and system-aware performance across unseen traffic scenarios, dynamics, and task configurations. Our project is available at https://github.com/usail-hkust/TrafficClaw.
Chinese Translation
城市交通控制是一个系统级的协调问题,涉及异构子系统,包括交通信号、快速公路、公共交通和出租车服务。现有的基于优化的方法、强化学习(RL)以及新兴的大型语言模型(LLM)方法主要针对孤立任务设计,这限制了跨任务的泛化能力以及捕捉子系统间耦合物理动态的能力。我们认为,有效的系统级控制需要一个统一的物理环境,在该环境中,子系统共享基础设施、出行需求和时空约束,从而使局部干预能够在网络中传播。为此,我们提出了TrafficClaw,一个基于统一运行时环境的通用城市交通控制框架。TrafficClaw将异构子系统整合为一个共享的动态系统,使得跨子系统交互的显式建模和闭环代理-环境反馈成为可能。在这个环境中,我们开发了一个具备可执行时空推理和可重用过程记忆的LLM代理,支持跨子系统的统一诊断和持续策略优化。此外,我们引入了一个多阶段训练管道,包括监督初始化和具有系统级优化的代理强化学习,进一步增强了协调和系统感知的性能。实验表明,TrafficClaw在未见过的交通场景、动态和任务配置中实现了稳健、可转移和系统感知的性能。我们的项目可在 https://github.com/usail-hkust/TrafficClaw 获取。
cs.AI / 68 / 2604.17458
EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
EHRAG:通过混合超图构建和检索弥合轻量级图形检索增强生成中的语义差距
Abstract
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances LLMs by structuring corpus into graphs to facilitate multi-hop reasoning. While recent lightweight approaches reduce indexing costs by leveraging Named Entity Recognition (NER), they rely strictly on structural co-occurrence, failing to capture latent semantic connections between disjoint entities. To address this, we propose EHRAG, a lightweight RAG framework that constructs a hypergraph capturing both structure and semantic level relationships, employing a hybrid structural-semantic retrieval mechanism. Specifically, EHRAG constructs structural hyperedges based on sentence-level co-occurrence with lightweight entity extraction and semantic hyperedges by clustering entity text embeddings, ensuring the hypergraph encompasses both structural and semantic information. For retrieval, EHRAG performs a structure-semantic hybrid diffusion with topic-aware scoring and personalized pagerank (PPR) refinement to identify the top-k relevant documents. Experiments on four datasets show that EHRAG outperforms state-of-the-art baselines while maintaining linear indexing complexity and zero token consumption for construction. Code is available at https://github.com/yfsong00/EHRAG.
Chinese Translation
基于图的检索增强生成(GraphRAG)通过将语料库结构化为图来增强大语言模型(LLMs),以促进多跳推理。尽管最近的轻量级方法通过利用命名实体识别(NER)降低了索引成本,但它们严格依赖于结构共现,未能捕捉不相交实体之间的潜在语义联系。为了解决这个问题,我们提出了EHRAG,一种轻量级的检索增强生成框架,构建了一个捕捉结构和语义层次关系的超图,采用混合结构-语义检索机制。具体而言,EHRAG基于句子级共现构建结构超边,并通过聚类实体文本嵌入构建语义超边,确保超图涵盖结构和语义信息。对于检索,EHRAG执行结构-语义混合扩散,结合主题感知评分和个性化PageRank(PPR)优化,以识别前k个相关文档。在四个数据集上的实验表明,EHRAG在保持线性索引复杂度和零令牌消耗的情况下,优于最先进的基线。代码可在 https://github.com/yfsong00/EHRAG 获取。
cs.AI / 69 / 2604.17465
Language models recognize dropout and Gaussian noise applied to their activations
语言模型识别应用于其激活的 dropout 和高斯噪声
Abstract
We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) \emph{mask} activations, simulating \emph{dropout}, or (b) add \emph{Gaussian noise} to them, at a target sentence. We then ask a multiple-choice question such as ``\emph{Which of the previous sentences was perturbed?}'' or ``\emph{Which of the two perturbations was applied?}''. We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, \qwenb's \emph{zero-shot} accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic ``training awareness'' signal and the implications for AI safety. The code and data are available at \href{https://github.com/saifh-github/llm-dropout-noise-recognition}{link 1} and \href{https://drive.google.com/file/d/1es-Sfw_AH9GficeXgeqpy87rocrZZ_PQ/view}{link 2}, respectively.
Chinese Translation
我们提供证据表明,语言模型能够检测、定位并在一定程度上口头表达应用于其激活的扰动之间的差异。更具体地说,我们要么 (a) 对激活进行 extit{掩蔽},模拟 extit{dropout},要么 (b) 在目标句子上添加 extit{高斯噪声}。然后,我们提出一个多项选择问题,例如“ extit{之前的哪一句话被扰动了?}”或“ extit{哪两种扰动被应用了?}”。我们测试了来自 Llama、Olmo 和 Qwen 系列的模型,规模在 8B 到 32B 之间,所有这些模型都能轻松检测和定位扰动,通常具有完美的准确性。这些模型在上下文中学习时,还能区分 dropout 和高斯噪声。值得注意的是, extit{qwenb} 在识别应用了哪种扰动的 extit{零-shot} 准确性随着扰动强度的增加而提高,并且如果上下文标签被翻转,准确性会下降,这表明对正确标签存在先验 -- 即使在控制条件下也是如此。由于 dropout 被用作训练正则化技术,而高斯噪声有时在推理过程中添加,我们讨论了数据无关的“训练意识”信号的可能性及其对人工智能安全性的影响。代码和数据分别可在 extit{link 1} 和 extit{link 2} 获取。
cs.AI / 70 / 2604.17475
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
盲目觉醒:无监督代理轨迹的冷启动优化用于基础视觉感知
Abstract
Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
Chinese Translation
小型视觉语言模型(SVLMs)是高效的任务控制器,但通常面临视觉脆弱性和工具协调能力差的问题。它们通常需要昂贵的监督轨迹调优来缓解这些缺陷。在本研究中,我们提出了由级联工具展开对齐(Cascaded Tool Rollout Alignment)启用的自监督感知框架(SPECTRA),该框架通过冷启动强化学习为SVLMs引导代理能力。SPECTRA 强制执行软结构多轮展开(Soft Structured Multi-turn Rollouts),这是一种拓扑约束,指导代理在合成之前明确地对工具派生证据进行排序,从而有效地将推理扎根于视觉观察中。我们采用了一种多目标奖励信号,旨在同时最大化任务正确性、展开结构和工具效用,使代理能够在没有人类偏好标签的情况下自我发现稳健的行为。我们进一步引入了工具工具效用(Tool Instrumental Utility, TIU),这一新颖的度量标准用于量化在缺乏真实标签情况下的工具有效性。在复合和分布外(MMMU-Pro)基准上的广泛评估表明,SPECTRA 提升了代理轨迹,使任务准确性提高了多达5%,工具效率提高了9%,从而使得多模态代理能够更有效地仅通过与环境的互动进行学习。
cs.AI / 71 / 2604.17502
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
朝向可关闭的智能体:在强化学习智能体和大型语言模型中推广随机选择
Abstract
Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be Neutral about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be Useful). In this paper, we use DReST to train deep RL agents and fine-tune LLMs to be Neutral and Useful. We find that these DReST agents generalize to being Neutral and Useful in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher Usefulness on our test set than baseline agents, and our fine-tuned LLM achieves maximum Usefulness and near-maximum Neutrality. Our results provide some early evidence that DReST could be used to train more advanced agents to be Useful and Neutral. Prior theoretical work suggests that these agents would be useful and shutdownable.
Chinese Translation
不一致的人工智能体可能会抵制关闭。一个提出的解决方案是训练智能体对不同长度的轨迹缺乏偏好。相同长度轨迹的折扣奖励(Discounted Reward for Same-Length Trajectories, DReST)奖励函数通过惩罚智能体反复选择相同长度的轨迹,从而激励智能体(1)在不同轨迹长度之间进行随机选择(对轨迹长度保持中立),以及(2)在每个轨迹长度的条件下有效追求目标(具有实用性)。在本文中,我们使用DReST训练深度强化学习智能体,并微调大型语言模型,使其具备中立性和实用性。我们发现这些DReST智能体在测试时能够在未见过的上下文中推广为中立和实用。实际上,DReST强化学习智能体在我们的测试集上比基线智能体的实用性提高了11%(PPO)和18%(A2C),而我们微调的LLM达到了最高的实用性和接近最高的中立性。我们的结果提供了一些早期证据,表明DReST可以用于训练更高级的智能体,使其具备实用性和中立性。先前的理论研究表明,这些智能体将是有用的并且可以关闭。
cs.AI / 72 / 2604.17503
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph:自我演化的多智能体协作与多模态图拓扑
Abstract
Scaling vision-language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework that evolves both agent expertise and communication topology. Within this framework, a Multimodal Graph Transformer (MMGT) encodes visual tokens, instruction semantics and active skill embeddings to predict a query-conditioned collaboration graph, replacing hand-crafted routing with dynamic, content-aware information flow. Complementing this, a Skill Designer distills and refines reasoning heuristics from failure cases, constructing a self-evolving multimodal Skill Bank. Crucially, updated skill embeddings are fed back into the MMGT, enabling the topology to adapt alongside capability growth. Experiments show that SkillGraph achieves consistent improvements across four benchmarks, five common MAS structures and four base models. Code is available at https://github.com/niez233/skillgraph.
Chinese Translation
将视觉-语言模型扩展到视觉多智能体系统(VMAS)面临两个相互关联的问题。首先,通信拓扑在推理之前是固定的,无法感知视觉内容和查询上下文;其次,智能体的推理能力在部署期间保持静态。这些问题相互强化:刚性的拓扑无法利用更丰富的智能体专业知识,而静态智能体缺乏针对特定查询进行专业化的激励。我们通过SkillGraph解决了这一问题,这是一个共同框架,能够同时演化智能体的专业知识和通信拓扑。在该框架内,多模态图变换器(MMGT)对视觉标记、指令语义和主动技能嵌入进行编码,以预测基于查询的协作图,取代手工设计的路由,采用动态的、内容感知的信息流。与此相辅相成的是,技能设计器从失败案例中提炼和完善推理启发式,构建一个自我演化的多模态技能库。至关重要的是,更新的技能嵌入反馈到MMGT中,使拓扑能够随着能力的增长而适应。实验表明,SkillGraph在四个基准测试、五种常见的多智能体系统结构和四个基础模型上实现了一致的改进。代码可在 https://github.com/niez233/skillgraph 获取。
cs.AI / 73 / 2604.17517
From Admission to Invariants: Measuring Deviation in Delegated Agent Systems
从接纳到不变性:测量委托代理系统中的偏差
Abstract
Autonomous agent systems are governed by enforcement mechanisms that flag hard constraint violations at runtime. The Agent Control Protocol identifies a structural limit of such systems: a correctly-functioning enforcement engine can enter a regime in which behavioral drift is invisible to it, because the enforcement signal operates below the layer where deviation is measurable. We show that enforcement-based governance is structurally unable to determine whether an agent's behavior remains within the admissible behavior space A0 established at admission time. Our central result, the Non-Identifiability Theorem, proves that A0 is not in the sigma-algebra generated by the enforcement signal g under the Local Observability Assumption, which every practical enforcement system satisfies. The impossibility arises from a fundamental mismatch: g evaluates actions locally against a point-wise rule set, while A0 encodes global, trajectory-level behavioral properties set at admission time. We define the Invariant Measurement Layer (IML), which bypasses this limitation by retaining direct access to the generative model of A0. We prove an information-theoretic impossibility for enforcement-based monitoring; separately, we show IML detects admission-time drift with provably finite detection delay, operating in the region where enforcement is structurally blind. Validated across four settings: three drift scenarios (300 and 1000 steps), a live n8n webhook pipeline, and a LangGraph StateGraph agent -- enforcement triggers zero violations while IML detects each drift type within 9-258 steps. Paper 2 of a 4-paper Agent Governance Series: atomic boundaries (P0, 10.5281/zenodo.19642166), ACP enforcement (P1, arXiv:2603.18829), fair allocation (P3, 10.5281/zenodo.19643928), irreducibility (P4, 10.5281/zenodo.19643950).
Chinese Translation
自主代理系统由在运行时标记硬约束违规的执行机制所管理。代理控制协议识别了此类系统的结构性限制:一个正常运作的执行引擎可能会进入一个行为漂移对其不可见的状态,因为执行信号在偏差可测量的层次之下运作。我们证明了基于执行的治理在结构上无法确定代理的行为是否仍然处于接纳时建立的可接纳行为空间 A0 内。我们的核心结果,即非可识别性定理,证明了在局部可观测性假设下,执行信号 g 生成的 sigma-代数中不包含 A0,而这一假设是每个实际的执行系统所满足的。这一不可能性源于根本的不匹配:g 在局部对点状规则集评估行为,而 A0 编码了在接纳时设定的全局、轨迹级别的行为属性。我们定义了不变测量层(Invariant Measurement Layer, IML),该层通过保留对 A0 生成模型的直接访问来绕过这一限制。我们证明了基于执行的监控存在信息论上的不可能性;另外,我们展示了 IML 能够以可证明的有限检测延迟检测接纳时的漂移,操作于执行在结构上盲目的区域。通过四个设置进行了验证:三种漂移场景(300 和 1000 步),一个实时 n8n webhook 管道,以及一个 LangGraph StateGraph 代理——执行触发零违规,而 IML 在 9-258 步内检测到每种漂移类型。本文是四篇代理治理系列论文中的第二篇:原子边界(P0, 10.5281/zenodo.19642166),ACP 执行(P1, arXiv:2603.18829),公平分配(P3, 10.5281/zenodo.19643928),不可约性(P4, 10.5281/zenodo.19643950)。
cs.AI / 74 / 2604.17555
COSEARCH: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search
COSEARCH:通过强化学习联合训练推理与文档排序以实现自主搜索
Abstract
Agentic search -- the task of training agents that iteratively reason, issue queries, and synthesize retrieved information to answer complex questions -- has achieved remarkable progress through reinforcement learning (RL). However, existing approaches such as Search-R1, treat the retrieval system as a fixed tool, optimizing only the reasoning agent while the retrieval component remains unchanged. A preliminary experiment reveals that the gap between an oracle and a fixed retrieval system reaches up to +26.8% relative F1 improvement across seven QA benchmarks, suggesting that the retrieval system is a key bottleneck in scaling agentic search performance. Motivated by this finding, we propose CoSearch, a framework that jointly trains a multi-step reasoning agent and a generative document ranking model via Group Relative Policy Optimization (GRPO). To enable effective GRPO training for the ranker -- whose inputs vary across reasoning trajectories -- we introduce a semantic grouping strategy that clusters sub-queries by token-level similarity, forming valid optimization groups without additional rollouts. We further design a composite reward combining ranking quality signals with trajectory-level outcome feedback, providing the ranker with both immediate and long-term learning signals. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate consistent improvements over strong baselines, with ablation studies validating each design choice. Our results show that joint training of the reasoning agent and retrieval system is both feasible and strongly performant, pointing to a key ingredient for future search agents.
Chinese Translation
自主搜索——训练代理人迭代推理、发出查询并综合检索信息以回答复杂问题的任务——通过强化学习(RL)取得了显著进展。然而,现有的方法如Search-R1,将检索系统视为固定工具,仅优化推理代理,而检索组件保持不变。一项初步实验表明,预言者与固定检索系统之间的差距在七个问答基准测试中达到了+26.8%的相对F1提升,这表明检索系统是提升自主搜索性能的关键瓶颈。基于这一发现,我们提出了CoSearch,一个通过群体相对策略优化(GRPO)联合训练多步推理代理和生成文档排序模型的框架。为了使排序器能够有效进行GRPO训练——其输入在推理轨迹中变化——我们引入了一种语义分组策略,通过标记级相似性对子查询进行聚类,形成有效的优化组而无需额外的回滚。我们进一步设计了一种复合奖励,将排序质量信号与轨迹级结果反馈相结合,为排序器提供即时和长期的学习信号。在七个单跳和多跳问答基准测试上的实验表明,相较于强基线模型具有一致的改进,消融研究验证了每个设计选择。我们的结果显示,推理代理与检索系统的联合训练既可行又具有强大的性能,指向未来搜索代理的一个关键要素。
cs.AI / 75 / 2604.17562
SafeAgent: A Runtime Protection Architecture for Agentic Systems
SafeAgent:一种用于智能系统的运行时保护架构
Abstract
Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.
Chinese Translation
大型语言模型(LLM)代理容易受到通过多步骤工作流、工具交互和持久上下文传播的提示注入攻击,这使得仅依靠输入输出过滤无法提供可靠的保护。本文提出了SafeAgent,一种将代理安全视为在不断演变的交互轨迹上进行状态决策问题的运行时安全架构。所提出的设计通过两个协调组件将执行治理与语义风险推理分离:一个运行时控制器用于调解代理循环中的操作,以及一个上下文感知决策核心,基于持久会话状态进行操作。该核心被形式化为一个上下文感知的高级机器智能,并通过风险编码、效用-成本评估、后果建模、政策仲裁和状态同步的操作符进行实例化。在Agent Security Bench(ASB)和InjecAgent上的实验表明,SafeAgent在基线和文本级保护方法上始终提高了鲁棒性,同时保持了竞争性的良性任务性能。消融研究进一步表明,恢复信心和政策加权决定了不同的安全-效用操作点。
cs.AI / 76 / 2604.17573
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
超越静态快照:面向自主边界的语言模型评估框架
Abstract
We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for assessing deployed, agentic systems: distributional invalidity (evaluation inputs do not reflect real interaction distributions), temporal invalidity (evaluations are post-hoc rather than training-integrated), scope invalidity (evaluations measure single-turn outputs rather than long-horizon trajectories), and process invalidity (evaluations assess outputs rather than reasoning). These failures compound critically in RLHF, where reward models are evaluated under conditions that do not hold during RL training, making reward hacking a predictable consequence of evaluation design rather than a training pathology. We propose the Grounded Continuous Evaluation (GCE) framework and present ISOPro, a simulation-based fine-tuning and evaluation system. ISOPro replaces the learned reward model with a deterministic ground-truth verifier, eliminating reward hacking by construction in verifiable-reward domains, and operates on LoRA adapter weights updatable on CPU, reducing the hardware barrier by an order of magnitude. We validate ISOPro on a resource-constrained scheduling domain with six difficulty tiers, demonstrating capability emergence visible only through continuous evaluation, an implicit curriculum that forms without researcher curation, and a 3x accuracy improvement over zero-shot baselines, all on consumer hardware with 0.216% trainable parameters.
Chinese Translation
我们认为,当前大型语言模型(LLMs)的评估框架存在四个系统性缺陷,使其在评估已部署的自主系统时结构上不够充分:分布无效性(评估输入未能反映真实交互分布)、时间无效性(评估是事后进行的,而非与训练相结合)、范围无效性(评估测量的是单轮输出而非长时间轨迹)以及过程无效性(评估关注输出而非推理)。这些缺陷在强化学习与人类反馈(RLHF)中尤为严重,因为奖励模型的评估是在强化学习训练期间不成立的条件下进行的,这使得奖励操控成为评估设计的可预测后果,而非训练病理。我们提出了基于实证的持续评估(Grounded Continuous Evaluation, GCE)框架,并展示了ISOPro,一个基于模拟的微调与评估系统。ISOPro用确定性的真实验证器替代了学习的奖励模型,在可验证奖励领域通过构造消除了奖励操控,并在CPU上可更新的LoRA适配器权重上运行,将硬件门槛降低了一个数量级。我们在一个资源受限的调度领域进行了ISOPro的验证,该领域具有六个难度等级,展示了只有通过持续评估才能显现的能力涌现、一个在没有研究者策划下形成的隐性课程,以及在消费者硬件上相较于零-shot基线提高3倍的准确率,所有这些都在仅有0.216%可训练参数的情况下实现。
cs.AI / 77 / 2604.17584
DIRCR: Dual-Inference Rule-Contrastive Reasoning for Solving RAVENs
DIRCR:双重推理规则对比推理用于解决RAVEN问题
Abstract
Abstract visual reasoning remains challenging as existing methods often prioritize either global context or local row-wise relations, failing to integrate both, and lack intermediate feature constraints, leading to incomplete rule capture and entangled representations. To address these issues, we propose the Dual-Inference Rule-Contrastive Reasoning (DIRCR) model. Its core component, the Dual-Inference Reasoning Module, combines a local path for row-wise analogical reasoning and a global path for holistic inference, integrated via a gated attention mechanism. Additionally, a Rule-Contrastive Learning Module introduces pseudo-labels to construct positive and negative rule samples, applying contrastive learning to enhance feature separability and promote abstract, transferable rule learning. Experimental results on three RAVEN datasets demonstrate that DIRCR significantly enhances reasoning robustness and generalization. Codes are available at https://github.com/csZack-Zhang/DIRCR.
Chinese Translation
抽象视觉推理仍然具有挑战性,因为现有方法往往优先考虑全局上下文或局部行间关系,未能将两者结合,并且缺乏中间特征约束,导致规则捕捉不完整和表示混淆。为了解决这些问题,我们提出了双重推理规则对比推理(DIRCR)模型。其核心组件,双重推理模块,结合了用于行间类比推理的局部路径和用于整体推理的全局路径,通过门控注意力机制进行整合。此外,规则对比学习模块引入伪标签以构建正负规则样本,应用对比学习以增强特征可分性并促进抽象、可迁移的规则学习。在三个RAVEN数据集上的实验结果表明,DIRCR显著增强了推理的鲁棒性和泛化能力。代码可在 https://github.com/csZack-Zhang/DIRCR 获取。
cs.AI / 78 / 2604.17614
Characterizing Model-Native Skills
特征化模型本土技能
Abstract
Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines--all external hypotheses about what matters that need not align with the model's internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be *model-native*: grounded in the model's own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH--an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model's own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
Chinese Translation
技能是描述语言模型能够做什么以及如何改变其行为的自然单位。然而,现有的特征化依赖于人工编写的分类法、文本描述或手动分析流程——这些都是关于重要性的外部假设,未必与模型的内部表示对齐。我们认为,当目标是干预模型行为时,技能特征化应该是*模型本土的*:基于模型自身的表示,而不是通过外部本体强加的。我们通过从序列级激活中恢复一个紧凑的正交基来实现这一观点。得到的基在语义上是可解释的,但不必对应于任何预定义的人类本体;相反,它捕捉了模型自身围绕的行为变化轴。我们在推理后训练中验证了这种特征化,使用恢复的基进行SFT数据选择和推理时引导。我们开发了轻量级的代理干预,以识别哪些方向对给定模型最有用。在Llama3-8B和Qwen2.5-3B上,沿这些方向选择数据使得MATH的Pass@1提高了最多20%,AMC提高了41%,超越了基于人类特征化技能的数据选择。由于该基存在于激活空间中,相同的方向也作为推理时的引导向量,MATH的Pass@8提高了最多4.8%——这一干预是人类特征化技能无法支持的。我们进一步在安全对齐上验证了这种特征化,选择针对模型本土技能覆盖的对抗性训练数据,而不是文本多样性,产生了更高样本效率的学习。这些结果表明,从模型自身的表示中恢复技能,而不是外部强加,提供了干预模型行为的更有效基础。代码已开源。
cs.AI / 79 / 2604.17621
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
KnowledgeBerg:评估大型语言模型的系统知识覆盖和组合推理能力
Abstract
Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg
Chinese Translation
许多现实世界的问题看似简单,但实际上隐含着两种能力的需求:(i)对有限知识宇宙的系统覆盖,以及(ii)对该宇宙的组合集合推理,这种现象我们称之为“冰山一角”。我们通过两个正交维度来形式化这一挑战:知识宽度,即所需宇宙的基数,以及推理深度,即组合集合操作的数量。我们引入了KnowledgeBerg,这是一个包含4,800个多项选择题的基准,这些问题源自1,183个枚举种子,涵盖10个领域和17种语言,宇宙基于权威来源以确保可重复性。代表性的开源大型语言模型(LLMs)显示出严重的局限性,在宇宙枚举上仅达到5.26-36.88的F1分数,在知识基础推理上准确率仅为16.00-44.19。诊断分析揭示了三个失败阶段:完整性,即缺失知识;意识,即未能识别需求;以及应用,即推理执行错误。这个模式在不同语言和模型规模中持续存在。尽管测试时计算和检索增强带来了可测量的提升——分别高达4.35和3.78分——但仍然存在显著差距,暴露了当前大型语言模型在组织结构化知识和在有限领域内执行组合推理方面的局限性。该数据集可在 https://huggingface.co/datasets/2npc/KnowledgeBerg 获取。
cs.AI / 80 / 2604.17626
Toward Reusability of AI Models Using Dynamic Updates of AI Documentation
通过动态更新人工智能文档实现人工智能模型的可重用性
Abstract
This work addresses the challenge of disseminating reusable artificial intelligence (AI) models accompanied by AI documentation (a.k.a., AI model cards). The work is motivated by the large number of trained AI models that are not reusable due to the lack of (a) AI documentation and (b) the temporal lag between rapidly changing requirements on AI model reusability and those specified in various AI model cards. Our objectives are to shorten the lag time in updating AI model card templates and align AI documentation more closely with current AI best practices. Our approach introduces a methodology for delivering agile, data-driven, and community-based AI model cards. We use the Hugging Face (HF) repository of AI models, populated by a subset of the AI research and development community, and the AI consortium-based Zero Draft (ZD) templates for the AI documentation of AI datasets and AI models, as our test datasets. We also address questions about the value of AI documentation for AI reusability. Our work quantifies the correlations between AI model downloads/likes (i.e., AI model reuse metrics) from the HF repository and their documentation alignment with the ZD documentation templates using tables of contents and word statistics (i.e., AI documentation quality metrics). Furthermore, our work develops the infrastructure to regularly compare AI documentation templates against community-standard practices derived from millions of uploaded AI models in the Hugging Face repository. The impact of our work lies in introducing a methodology for delivering agile, data-driven, and community-based standards for documenting AI models and improving AI model reuse.
Chinese Translation
本研究解决了伴随人工智能(AI)文档(即,AI模型卡)传播可重用人工智能模型的挑战。研究的动机在于大量训练好的人工智能模型由于缺乏(a)AI文档和(b)快速变化的AI模型可重用性要求与各种AI模型卡中规定的要求之间存在时间滞后而无法重用。我们的目标是缩短更新AI模型卡模板的滞后时间,并使AI文档与当前的AI最佳实践更加紧密对齐。我们的方法引入了一种提供敏捷、数据驱动和基于社区的AI模型卡的方法论。我们使用Hugging Face(HF)人工智能模型库,该库由人工智能研究和开发社区的一个子集填充,以及基于AI联盟的零草稿(Zero Draft,ZD)模板,用于AI数据集和AI模型的AI文档,作为我们的测试数据集。我们还探讨了AI文档对AI可重用性的价值问题。我们的研究量化了HF库中AI模型下载/点赞(即,AI模型重用指标)与其文档与ZD文档模板的对齐之间的相关性,使用目录和词汇统计(即,AI文档质量指标)。此外,我们的研究开发了基础设施,以定期将AI文档模板与来自Hugging Face库中数百万个上传的AI模型所衍生的社区标准实践进行比较。我们工作的影响在于引入了一种提供敏捷、数据驱动和基于社区的AI模型文档标准的方法论,从而改善AI模型的重用。
cs.AI / 81 / 2604.17653
PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents
PV-SQL:协同数据库探测与基于规则的验证用于文本到SQL代理
Abstract
Text-to-SQL systems often struggle with deep contextual understanding, particularly for complex queries with subtle requirements. We present PV-SQL, an agentic framework that addresses these failures through two complementary components: Probe and Verify. The Probe component iteratively generates probing queries to retrieve concrete records from the database, resolving ambiguities in value formats, column semantics, and inter-table relationships to build richer contextual understanding. The Verify component employs a rule-based method to extract verifiable conditions and construct an executable checklist, enabling iterative SQL refinement that effectively reduces missing constraints. Experiments on the BIRD benchmarks show that PV-SQL outperforms the best text-to-SQL baseline by 5% in execution accuracy and 20.8% in valid efficiency score while consuming fewer tokens.
Chinese Translation
文本到SQL系统通常在深层次的上下文理解方面存在困难,尤其是在处理具有微妙要求的复杂查询时。我们提出了PV-SQL,一个代理框架,通过两个互补的组件来解决这些问题:探测(Probe)和验证(Verify)。探测组件迭代生成探测查询,从数据库中检索具体记录,解决值格式、列语义和表间关系的模糊性,以构建更丰富的上下文理解。验证组件采用基于规则的方法提取可验证条件并构建可执行的检查清单,使得SQL迭代优化能够有效减少缺失约束。在BIRD基准测试中的实验表明,PV-SQL在执行准确性上比最佳文本到SQL基线提高了5%,在有效性评分上提高了20.8%,同时消耗的标记更少。
cs.AI / 82 / 2604.17654
Poly-EPO: Training Exploratory Reasoning Models
多元探索:训练探索性推理模型
Abstract
Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.
Chinese Translation
探索是从经验中学习的基石:它使智能体能够找到复杂问题的解决方案,推广到新问题,并在测试时计算中提升性能。在本文中,我们提出了一种针对后训练语言模型(LM)的框架,该框架明确鼓励乐观探索,并促进探索与利用之间的协同。核心思想是训练语言模型生成在奖励函数下集体准确且推理策略上具有探索性的响应集合。我们首先开发了一种通用方法,用于在任意目标函数下通过集合强化学习(set RL)优化语言模型,展示了如何通过对优势计算的修改将标准强化学习算法适应于这一设置。然后,我们提出了多色探索策略优化(Polychromic Exploratory Policy Optimization,Poly-EPO),该方法通过一个明确协同探索与利用的目标来实现这一框架。在一系列推理基准测试中,我们展示了Poly-EPO改善了泛化能力,体现在更高的通过率@$k$覆盖率,保持了模型生成的更大多样性,并有效地与测试时计算规模相匹配。
cs.AI / 83 / 2604.17677
Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems
基于向量检索的语义纠缠:代理RAG系统的形式框架与上下文条件解缠管道
Abstract
Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.
Chinese Translation
检索增强生成(RAG)系统依赖于向量表示的几何特性,以检索上下文适当的证据。当源文档在连续文本中交织多个主题时,标准向量化会产生嵌入空间,其中语义上不同的内容占据重叠的邻域。我们将这种情况称为语义纠缠。我们将纠缠形式化为嵌入空间中跨主题重叠的模型相对度量,并定义纠缠指数(Entanglement Index, EI)作为定量代理。我们认为,较高的EI限制了在余弦相似性检索下可达到的Top-K检索精度。为了解决这个问题,我们引入了语义解缠管道(Semantic Disentanglement Pipeline, SDP),这是一个四阶段的预处理框架,在嵌入之前重构文档。我们进一步提出了上下文条件预处理,其中文档结构由操作使用模式塑造,以及一个基于代理性能调整文档结构的持续反馈机制。我们在一个包含超过2000个文档、约25个子领域的真实企业医疗知识库上评估了SDP。Top-K检索精度从固定标记分块下的约32%提高到SDP下的约82%,而平均EI从0.71降至0.14。我们并不声称纠缠完全解释了RAG的失败,但它捕捉到了一种独特的预处理失败模式,而下游优化在编码到向量空间后无法可靠地纠正。
cs.AI / 84 / 2604.17696
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
策略:通过轨迹调制的游戏自我对弈学习可转移推理
Abstract
Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.
Chinese Translation
游戏为语言模型的发展提供了一个引人注目的范式,以培养通用推理能力,因为它们自然要求战略规划、概率推理和适应性决策。然而,现有的自我对弈方法仅依赖于终局游戏结果,未能提供区分可转移推理模式与特定游戏启发式的方法。我们提出了STRATAGEM,旨在解决推理转移的两个基本障碍:领域特异性,即学习到的模式仍然固守于游戏语义,以及上下文静态性,即静态游戏上下文未能培养渐进式推理。STRATAGEM通过推理可转移系数(Reasoning Transferability Coefficient)选择性地强化展示抽象、领域无关推理的轨迹,同时通过推理演变奖励(Reasoning Evolution Reward)激励适应性推理的发展。在数学推理、通用推理和代码生成基准测试中的实验显示了显著的改进,尤其是在多步骤推理至关重要的竞争级数学中取得了特别强劲的进展。消融研究和人工评估确认了两个组件对可转移推理的贡献。
cs.AI / 85 / 2604.17708
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
共同进化的智能体架构与可解释推理在自动化优化中的应用
Abstract
Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.
Chinese Translation
利用大型语言模型(LLMs)自动化运筹学(OR)仍然受到手工设计的推理-执行工作流的限制。复杂的运筹学任务需要在问题解释、数学模型构建、求解器选择、代码生成和迭代调试之间进行自适应协调。为了解决这一限制,我们提出了EvoOR-Agent,一个用于自动化优化的共同进化框架。该框架将智能体工作流表示为活动在边缘(AOE)风格的网络,使工作流拓扑、执行依赖关系和替代推理路径变得明确。在这一表示上,框架维护一个架构图,并通过图介导的路径条件重组、多粒度语义变异和精英种群更新来进化推理个体的种群。一个知识库辅助的经验获取模块进一步将可重用的运筹学实践注入初始化和语义变异中。在异构运筹学基准上的实证结果表明,所提出的框架在零-shot LLMs、固定管道运筹学智能体和代表性的进化智能体框架上均表现出一致的提升。案例研究和消融分析进一步表明,显式的架构进化和图支持的推理轨迹搜索有助于性能提升和结构可解释性。这些结果表明,将智能体架构和推理轨迹视为可进化对象,为自适应和可解释的自动化优化提供了一条有效的途径。
cs.AI / 86 / 2604.17753
Evolutionary Negative Module Pruning for Better LoRA Merging
进化负模块剪枝以优化LoRA合并
Abstract
Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of $\textit{negative modules}$ -- specific LoRA layers that inherently degrade global performance upon merging. We propose $\textbf{E}$volutionary $\textbf{N}$egative $\textbf{M}$odule $\textbf{P}$runing ($\textbf{ENMP}$), a plug-and-play LoRA pruning method to locate and exclude these detrimental modules prior to merging. By leveraging an evolutionary search strategy, ENMP effectively navigates the discrete, non-differentiable landscape of module selection to identify optimal pruning configurations. Extensive evaluations demonstrate that ENMP consistently boosts the performance of existing merging algorithms, achieving a new state-of-the-art across both language and vision domains. Code is available at https://github.com/CaoAnda/ENMP-LoRAMerging.
Chinese Translation
将多个低秩适配(Low-Rank Adaptation, LoRA)专家合并为单一主干是实现高效多任务部署的一种有前景的方法。尽管现有方法努力通过权重插值或子空间对齐来减轻干扰,但它们基于一个隐含假设,即所有LoRA矩阵都对合并模型的构建产生积极贡献。在本文中,我们揭示了当前合并范式中的一个关键瓶颈:存在$ extit{负模块}$——特定的LoRA层在合并时会固有地降低全局性能。我们提出了$ extbf{E}$volutionary $ extbf{N}$egative $ extbf{M}$odule $ extbf{P}$runing($ extbf{ENMP}$),这是一种即插即用的LoRA剪枝方法,用于在合并之前定位并排除这些有害模块。通过利用进化搜索策略,ENMP有效地在离散的、不可微分的模块选择空间中导航,以识别最佳剪枝配置。广泛的评估表明,ENMP始终提升现有合并算法的性能,在语言和视觉领域都实现了新的最先进水平。代码可在https://github.com/CaoAnda/ENMP-LoRAMerging获取。
cs.AI / 87 / 2604.17761
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
现实环境中的对比归因:对大型语言模型在现实基准上失败的可解释性分析
Abstract
Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as \textit{contrastive attribution}, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.
Chinese Translation
可解释性工具越来越多地被用于分析大型语言模型(LLMs)的失败,然而以往的研究主要集中在短提示或玩具设置上,导致对其在常用基准上的表现探索不足。为了解决这一问题,我们研究了基于对比的LRP(Layer-wise Relevance Propagation)归因作为分析LLM在现实环境中失败的实用工具。我们将失败分析形式化为 extit{对比归因},将不正确输出标记与正确替代标记之间的logit差异归因于输入标记和内部模型状态,并引入了一种高效的扩展方法,使得能够为长上下文输入构建跨层归因图。利用这一框架,我们在多个基准上进行了系统的实证研究,比较了不同数据集、模型规模和训练检查点下的归因模式。我们的结果表明,这种基于标记的对比归因在某些失败案例中可以产生有价值的信号,但并非普遍适用,突显了其在现实LLM失败分析中的实用性和局限性。我们的代码可在以下链接获取: https://aka.ms/Debug-XAI。
cs.AI / 88 / 2604.17768
When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
当视觉-语言模型在未见的情况下进行判断:揭示信息性偏差
Abstract
The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge's focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.
Chinese Translation
视觉-语言模型(VLM)作为评判者的可靠性对于自动评估视觉-语言模型至关重要。尽管近期取得了一些进展,我们的分析表明,VLM作为评判者在做出决策时往往对图像关注有限。相反,它们常常盲目偏向于更具信息量的答案,即使它们能够识别出该答案与图像内容存在冲突。我们将这个问题称为信息性偏差,这显著削弱了评判的可靠性。为了解决这一问题,我们提出了BIRCH(Balanced Informativeness and CoRrectness with a Truthful AnCHor),一种评判范式,首先纠正候选答案与图像内容之间的不一致,然后将答案与这一修正版本进行比较。这将评判者的关注点从信息性转向基于图像的正确性。在多个模型和基准上的实验表明,BIRCH将信息性偏差降低了多达17%,从而使性能提升高达9.8%。我们的工作揭示了当前VLM作为评判者系统中一个被忽视但根本性的缺陷,并强调了更具原则性设计的必要性。
cs.AI / 89 / 2604.17774
Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
提示优化促进大型语言模型代理的稳定算法合谋
Abstract
LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand-crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta-learning loop where LLM agents participate in duopoly markets and an LLM meta-optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta-prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held-out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi-agent systems.
Chinese Translation
市场中的大型语言模型(LLM)代理存在算法合谋的风险。虽然先前的研究表明,LLM 代理通过默契协调达成超竞争性价格,但现有研究主要集中在手工设计的提示上。新兴的提示优化范式需要新的方法论来理解自主代理的行为。我们研究提示优化是否会在市场模拟中导致新兴的合谋行为。我们提出一个元学习循环,其中 LLM 代理参与双头垄断市场,而 LLM 元优化器则迭代地完善共享的战略指导。我们的实验表明,元提示优化使代理能够发现稳定的默契合谋策略,其协调质量显著优于基线代理。这些行为在保留的测试市场中具有普遍性,表明发现了一般协调原则。对演化提示的分析揭示了通过稳定的共享策略实现的系统协调机制。我们的发现呼吁对自主多代理系统中的人工智能安全影响进行进一步研究。
cs.AI / 90 / 2604.17803
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
对抗竞技场:通过互动竞争众包数据生成
Abstract
Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.
Chinese Translation
后训练的大型语言模型需要多样化的高质量数据,而这在低资源领域和多轮对话中尤其稀缺且成本高昂。常见的解决方案是众包或合成生成,但这两者通常会产生低质量或低多样性的数据。我们提出了对抗竞技场,通过将数据生成框架化为对抗任务来构建高质量的对话数据集:攻击者创建提示,防御者生成响应。这种多个团队之间的互动竞争自然产生了多样且复杂的数据。我们通过与来自美国和欧洲顶尖大学的10个学术团队进行竞争来验证这种方法,每个团队构建攻击者或防御者机器人。此次竞争聚焦于网络安全中大型语言模型的安全对齐,生成了19,683个多轮对话。在该数据集上微调一个开源模型,使得在CyberSecEval-Instruct上的安全代码生成提高了18.47%,在CyberSecEval-MITRE上的提高达29.42%。
cs.AI / 91 / 2604.17821
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
WebUncertainty:基于双层不确定性的自主网络代理规划与推理
Abstract
Recent advancements in large language models (LLMs) have empowered autonomous web agents to execute natural language instructions directly on real-world webpages. However, existing agents often struggle with complex tasks involving dynamic interactions and long-horizon execution due to rigid planning strategies and hallucination-prone reasoning. To address these limitations, we propose WebUncertainty, a novel autonomous agent framework designed to tackle dual-level uncertainty in planning and reasoning. Specifically, we design a Task Uncertainty-Driven Adaptive Planning Mechanism that adaptively selects planning modes to navigate unknown environments. Furthermore, we introduce an Action Uncertainty-Driven Monte Carlo tree search (MCTS) Reasoning Mechanism. This mechanism incorporates the Confidence-induced Action Uncertainty (ConActU) strategy to quantify both aleatoric uncertainty (AU) and epistemic uncertainty (EU), thereby optimizing the search process and guiding robust decision-making. Experimental results on the WebArena and WebVoyager benchmarks demonstrate that WebUncertainty achieves superior performance compared to state-of-the-art baselines.
Chinese Translation
近年来,大型语言模型(LLMs)的进步使得自主网络代理能够直接在真实网页上执行自然语言指令。然而,现有代理在处理涉及动态交互和长时间执行的复杂任务时,常常由于僵化的规划策略和容易产生幻觉的推理而面临困难。为了解决这些局限性,我们提出了WebUncertainty,一个新颖的自主代理框架,旨在应对规划和推理中的双层不确定性。具体而言,我们设计了一种任务不确定性驱动的自适应规划机制,该机制能够自适应地选择规划模式以应对未知环境。此外,我们引入了一种基于行动不确定性的蒙特卡洛树搜索(MCTS)推理机制。该机制结合了基于置信度的行动不确定性(ConActU)策略,以量化随机不确定性(AU)和认知不确定性(EU),从而优化搜索过程并指导稳健的决策。WebArena和WebVoyager基准上的实验结果表明,WebUncertainty的性能优于现有的最先进基线。
cs.AI / 92 / 2604.17837
Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs
多义专家,单义路径:路由作为 MoEs 中的控制
Abstract
An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.
Chinese Translation
大型语言模型(LLM)的残差流既是状态也是指令:它编码当前上下文并决定下一个变换。我们提出了一种无参数的 Mixture-of-Experts(MoE)模型分解方法,将每层的隐藏状态分为一个因果驱动路由的控制信号和一个对路由器不可见的正交内容通道。在六种 MoE 架构中,我们发现模型在内容通道中保留了表层特征(语言、标记身份、位置),而控制信号编码了一个在层与层之间旋转的抽象函数。由于每个路由决策带宽较低,这种交接迫使层之间的组合专业化。尽管个别专家仍然是多义的,但专家路径变得单义,按语义功能在不同语言和表面形式中对标记进行聚类。同一个标记(例如,':')的轨迹取决于它作为类型注释、引导冒号或时间分隔符的作用。我们的分解识别了这种结构的来源:控制子空间中的聚类显著比完整表示中的聚类更具单义性。因此,在 MoEs 中,自然的可解释性单位不是专家,而是轨迹。
cs.AI / 93 / 2604.17849
On the Reliability of Computer Use Agents
计算机使用代理的可靠性研究
Abstract
Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.
Chinese Translation
计算机使用代理在网页导航、桌面自动化和软件交互等现实任务中迅速取得了进展,在某些情况下甚至超越了人类表现。然而,即使任务和模型保持不变,成功执行一次的代理在重复执行相同任务时可能会失败。这引发了一个基本问题:如果一个代理能够在某个任务中成功,是什么阻止它可靠地完成该任务?在本研究中,我们通过三个因素研究计算机使用代理的不可靠性来源:执行过程中的随机性、任务规范中的模糊性,以及代理行为的变异性。我们在 OSWorld 上分析了这些因素,使用相同任务的重复执行以及配对统计测试来捕捉不同设置下的任务级变化。我们的分析表明,可靠性不仅依赖于任务的规范方式,还依赖于代理行为在执行过程中的变化。这些发现表明,有必要在重复执行的情况下评估代理,允许代理通过交互解决任务模糊性,并倾向于在多次运行中保持稳定的策略。
cs.AI / 94 / 2604.17884
SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning
SPREG:基于结构化计划修复的熵引导测试时干预用于大语言模型推理
Abstract
Large Language Models (LLMs) are prone to logical hallucinations and stochastic drifts during long-chain reasoning. While Classifier-Free Guidance (CFG) can improve instruction adherence, standard static implementations often cause semantic dilution and linguistic degradation. We propose SPREG (Structured Plan-guided Real-time Entropy Gating), a lightweight inference-time framework for surgical error rectification. SPREG employs an adaptive dual-threshold mechanism to monitor real-time entropy, identifying sudden ``entropy spikes'' as reliable indicators of logical failure. Upon detection, it triggers a dynamic repair by replacing uninformative null-priors with reference distributions synthesized from historical high-confidence states. By modulating guidance intensity according to structured reasoning stages (e.g., Action, Observation), SPREG steers the model back to a stable manifold without compromising fluency. Our experiments demonstrate significant gains, notably a 20.0% absolute accuracy improvement on AIME25, while effectively suppressing uncontrolled entropy drift in complex tasks.
Chinese Translation
大型语言模型(LLMs)在长链推理过程中容易出现逻辑幻觉和随机漂移。虽然无分类器引导(Classifier-Free Guidance, CFG)可以改善指令遵循,但标准的静态实现往往导致语义稀释和语言退化。我们提出了SPREG(结构化计划引导实时熵门控),这是一个轻量级的推理时框架,用于精确的错误修正。SPREG采用自适应双阈值机制监测实时熵,识别突发的“熵峰值”作为逻辑失败的可靠指标。一旦检测到,它通过用从历史高置信度状态合成的参考分布替换无信息的零先验,触发动态修复。通过根据结构化推理阶段(例如,行动、观察)调节引导强度,SPREG将模型引导回稳定的流形,而不影响流畅性。我们的实验表明,SPREG在AIME25上实现了显著的提升,绝对准确率提高了20.0%,同时有效抑制了复杂任务中的无控制熵漂移。
cs.AI / 95 / 2604.17910
Physics-Informed Causal MDPs for Sequential Constraint Repair in Engineering Simulation Pipelines
基于物理知识的因果马尔可夫决策过程在工程仿真管道中的顺序约束修复
Abstract
Off-policy learning in constrained MDPs with large binary state spaces faces a fundamental tension: causal identification of transition dynamics requires structural assumptions, while sample-efficient policy learning requires state-space compression. We introduce PI-CMDP, a framework for CMDPs whose constraint dependencies form a layered DAG under a Lifecycle Ordering Assumption (LOA). We propose an Identify-Compress-Estimate pipeline: (i) Identify: LOA enables backdoor identification of causal edge weights for cross-layer pairs, with formal partial-identification bounds when LOA is violated; (ii) Compress: a Markov abstraction compresses state cardinality from 2^(WL) to (W+1)^L under layer-priority regularity and exchangeability; and (iii) Estimate: a physics-guided doubly-robust estimator remains unbiased and reduces the variance constant when the physics prior outperforms a learned model. We instantiate PI-CMDP on constraint repair in engineering simulation pipelines. On the TPS benchmark (4,206 episodes), PI-CMDP achieves 76.2% repair success rate with only 300 training episodes versus 70.8% for the strongest baseline (+5.4 pp), narrowing to +2.8 pp (83.4% vs. 80.6%) in the full-data regime, while substantially reducing cascade failure rates. All improvements are consistent across 5 independent seeds (paired t-test p < 0.02).
Chinese Translation
在具有大型二元状态空间的约束马尔可夫决策过程(MDPs)中,离策略学习面临着根本性的矛盾:过渡动态的因果识别需要结构假设,而样本高效的策略学习则需要状态空间压缩。我们提出了PI-CMDP,这是一个约束马尔可夫决策过程的框架,其约束依赖关系在生命周期排序假设(LOA)下形成分层有向无环图(DAG)。我们提出了一个识别-压缩-估计的流程:(i) 识别:LOA使得能够通过后门识别跨层对的因果边权重,当LOA被违反时,提供正式的部分识别界限;(ii) 压缩:在层优先规则性和可交换性下,马尔可夫抽象将状态基数从2^(WL)压缩到(W+1)^L;(iii) 估计:一个基于物理的双重稳健估计器在物理先验优于学习模型时保持无偏并减少方差常数。我们在工程仿真管道的约束修复中实例化了PI-CMDP。在TPS基准测试(4,206个回合)中,PI-CMDP以仅300个训练回合实现了76.2%的修复成功率,而最强基线为70.8%(提高了5.4个百分点),在全数据情况下缩小至2.8个百分点(83.4%对80.6%),同时显著降低了级联故障率。所有改进在5个独立种子下均一致(配对t检验p < 0.02)。
cs.AI / 96 / 2604.17931
LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent
LiteResearcher:一个可扩展的代理强化学习训练框架用于深度研究代理
Abstract
Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fails to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and prohibitive cost, which limits the scalability of Agentic RL. LiteResearcher is a training framework that makes Agentic RL scalable: by constructing a lite virtual world that mirrors real-world search dynamics, we enable a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g., Tongyi DeepResearch and Claude-4.5 Sonnet). Specifically, on common benchmarks such as GAIA and Xbench, our LiteResearcher-4B achieves open-source state-of-the-art results of 71.3% and 78.0% respectively, demonstrating that scalable RL training is a key enabler for Deep Research Agents.
Chinese Translation
强化学习(Reinforcement Learning, RL)已成为基于大型语言模型(LLM)代理的强大训练范式。然而,深度研究中代理强化学习的扩展受到两个相互关联的挑战的限制:手工制作的合成数据无法引发真实世界的搜索能力,而在RL训练过程中对真实世界搜索的依赖则引入了不稳定性和高昂的成本,从而限制了代理强化学习的可扩展性。LiteResearcher是一个使代理强化学习可扩展的训练框架:通过构建一个反映真实世界搜索动态的轻量虚拟世界,我们实现了一个持续改进的训练方案,使得一个小型搜索代理能够超越大规模的开源和商业模型(例如,通义深度研究和Claude-4.5 Sonnet)。具体而言,在GAIA和Xbench等常见基准测试上,我们的LiteResearcher-4B分别达到了71.3%和78.0%的开源最先进结果,证明了可扩展的强化学习训练是深度研究代理的关键推动力。
cs.AI / 97 / 2604.17937
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
ContraPrompt:通过二元推理轨迹分析进行对比提示优化
Abstract
Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process distinguishing success from failure on the same input. We introduce ContraPrompt, built on the observation that when a model fails but succeeds on a retry with feedback, the difference between its two chain-of-thought traces constitutes an optimization signal not captured by prior methods. Unlike prior contrastive methods, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so remaining differences reflect reasoning strategy and appended error feedback -- we call this dyadic reasoning trace analysis. The multi-attempt solving phase is an instrumented agentic retry loop that generates contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree routing instructions by observable input characteristics. On four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA (Agrawal et al., 2026) on all four, with absolute gains of +8.29 pp on HotPotQA (+20.8% rel.), +2.21 pp on GDPR-Bench (+18.2% rel.), +7.14 pp on GPQA Diamond (+10.6% rel.), and +0.74 pp on BBH (+0.85% rel.). Ablations confirm dyadic trace contrastivity is the critical component, with a -16% relative average drop upon its removal. On 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA on 11, ties on 41, and loses on 1 at equal budget. On FiNER-139 financial named entity recognition (Loukas et al., 2022), ContraPrompt achieves +7.77 pp over the unoptimized baseline (+11.6% rel.) and +1.94 pp over GEPA (+2.66% rel.), with branch conditions aligning with standard US GAAP financial-instrument categories.
Chinese Translation
提示优化方法要么孤立地分析个别失败,要么在示例之间比较提示变体,基于单一执行轨迹进行操作,而无法访问区分同一输入成功与失败的推理过程。我们引入了ContraPrompt,基于以下观察:当模型失败但在反馈的重试中成功时,其两条思维链轨迹之间的差异构成了先前方法未能捕捉的优化信号。与之前的对比方法不同,我们比较完整的中间推理过程:这两条轨迹共享模型、输入和基础提示,因此剩余的差异反映了推理策略和附加的错误反馈——我们称之为二元推理轨迹分析。多次尝试解决阶段是一个带有仪器的自主重试循环,能够自动生成对比数据,而无需人工标注。提取的规则根据可观察的输入特征组织成一个输入感知的决策树路由指令。在四个推理和合规基准上,ContraPrompt在所有四个基准上均优于GEPA(Agrawal等,2026),在HotPotQA上绝对增益为+8.29个百分点(+20.8%相对增益),在GDPR-Bench上为+2.21个百分点(+18.2%相对增益),在GPQA Diamond上为+7.14个百分点(+10.6%相对增益),在BBH上为+0.74个百分点(+0.85%相对增益)。消融实验确认二元轨迹对比性是关键组成部分,去除后相对平均下降-16%。在53个EvalSet黑箱优化问题中,ContraPrompt在11个问题上胜过GEPA,在41个问题上持平,在1个问题上以相同预算落败。在FiNER-139金融命名实体识别(Loukas等,2022)中,ContraPrompt比未优化的基线提高了+7.77个百分点(+11.6%相对增益),比GEPA提高了+1.94个百分点(+2.66%相对增益),分支条件与标准美国公认会计原则(GAAP)金融工具类别一致。
cs.AI / 98 / 2604.17950
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX:多智能体委派的上下文能力校准
Abstract
We revisit multi-agent delegation under a stronger and more realistic assumption: an agent's capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long-horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill-level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS-CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS-CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk-aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill-level confidence with context-conditioned posteriors. Second, based on contextual bandit theory, we formally prove context-aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias-variance tradeoff. Third, we empirically validate our method on GAIA and SWE-bench benchmarks. On GAIA with GPT-4o agents, CADMAS-CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non-overlapping 95% confidence intervals. On SWE-bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk-aware delegation significantly improve multi-agent teamwork compared with static global skill assignments.
Chinese Translation
我们在一个更强且更现实的假设下重新审视多智能体委派:智能体的能力并非固定在技能水平上,而是依赖于任务上下文。一个编码智能体可能在短期独立编辑中表现出色,但在长期调试中却表现不佳;一个规划者可能在浅层任务中表现良好,但在链式依赖中却退化。因此,静态技能水平能力概况平均了异质情况,可能导致系统性的错误委派。我们提出了CADMAS-CTX,一个用于上下文能力校准的框架。对于每个智能体、技能和粗略上下文桶,CADMAS-CTX维护一个Beta后验,捕捉在该任务空间部分的稳定经验。委派通过一个风险意识评分进行,该评分结合了后验均值和不确定性惩罚,从而确保智能体仅在同伴表现更好且该评估有充分证据支持时进行委派。本文做出了三项贡献。首先,层次化的上下文能力概况用上下文条件后的后验替代了静态技能水平的置信度。其次,基于上下文赌博理论,我们正式证明了在足够的上下文异质性下,基于上下文的路由实现了比静态路由更低的累积遗憾,从而形式化了偏差-方差权衡。第三,我们在GAIA和SWE-bench基准上对我们的方法进行了实证验证。在使用GPT-4o智能体的GAIA上,CADMAS-CTX达到了0.442的准确率,超越了静态基线的0.381和AutoGen的0.354,且具有不重叠的95%置信区间。在SWE-bench Lite上,其解决率从22.3%提高到31.4%。消融实验表明,不确定性惩罚提高了对上下文标记噪声的鲁棒性。我们的结果表明,与静态全局技能分配相比,上下文校准和风险意识委派显著改善了多智能体团队合作。
cs.AI / 99 / 2604.17966
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
TPS-CalcBench:用于高超音速热保护系统工程中大语言模型分析计算能力的基准和诊断评估框架
Abstract
Deploying LLMs as reasoning assistants in safety-critical aerospace engineering requires stricter evaluation criteria than general scientific benchmarks. In hypersonic thermal protection system (TPS) design, inaccurate stagnation-point heat flux or boundary-layer calculations may cause catastrophic design margin violations. Models with numerically reasonable but physically invalid answers are more dangerous than those declining to respond. Current scientific benchmarks only test abstract math and basic physics, evaluate final answers solely, ignore engineering reasoning processes, and cannot detect such critical failures. We propose TPS-CalcBench, the first diagnostic benchmark for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers conduct without simulations. Our contributions include domain-oriented task taxonomy with 4 difficulty levels and 8 categories from Anderson's textbook, dual-track evaluation measuring result accuracy and reasoning quality via an 8-dimension rubric and calibrated judge with human audit to identify right answer wrong reasoning issues, human-AI data pipeline producing 420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data, noise-sensitivity analysis measuring data quality impacts on model ranking, and three diagnostic intervention methods: DFA-TPS fine-tuning, RAG-EQ retrieval grounding and PA-CoT process-aware prompting. Tests on 13 models from 7 groups show wide performance differences (KPI 12.6-87.9), hidden formula selection defects, data-driven rank changes and effective intervention improvements, establishing a complete diagnose-evaluate-intervene framework for safety-critical engineering LLM deployment assessment.
Chinese Translation
在安全关键的航空航天工程中,将大语言模型(LLMs)作为推理助手的要求比一般科学基准更为严格。在高超音速热保护系统(TPS)设计中,不准确的滞止点热流或边界层计算可能导致灾难性的设计裕度违规。具有数值合理但物理无效答案的模型比拒绝响应的模型更具危险性。目前的科学基准仅测试抽象数学和基础物理,单纯评估最终答案,忽视工程推理过程,无法检测此类关键失败。我们提出了TPS-CalcBench,这是第一个针对高超音速气动学和高温气体动力学中闭式解析计算的诊断基准,旨在让经验丰富的TPS工程师在不进行模拟的情况下进行评估。我们的贡献包括基于领域的任务分类,设有4个难度级别和8个类别,来源于安德森的教科书;双轨评估,通过8维评分标准和经过校准的评审人员测量结果准确性和推理质量,以识别正确答案错误推理的问题;人机数据管道,从4560条原始数据中生成420个高置信度核心项目和810个噪声控制的预筛选项目;噪声敏感性分析,测量数据质量对模型排名的影响;以及三种诊断干预方法:DFA-TPS微调、RAG-EQ检索基础和PA-CoT过程感知提示。对来自7个组的13个模型的测试显示出广泛的性能差异(KPI 12.6-87.9)、隐藏的公式选择缺陷、数据驱动的排名变化和有效的干预改进,建立了一个完整的诊断-评估-干预框架,用于安全关键工程中大语言模型的部署评估。
cs.AI / 100 / 2604.17967
A Sugeno Integral View of Binarized Neural Network Inference
二值神经网络推理的Sugeno积分视角
Abstract
In this article, we establish a precise connection between binarized neural networks (BNNs) and Sugeno integrals. The advantage of the Sugeno integral is that it provides a framework for representing the importance of inputs and their interactions, while being equivalent to a set of if-then rules. For a hidden BNN neuron at inference time, we show that the activation threshold test can be written as a Sugeno integral on binary inputs. This yields an explicit set-function representation of each neuron decision, and an associated rule-based representation. We also provide a Sugeno-integral expression for the last-layer score. Finally, we discuss how the same framework can be adapted to support richer input interactions and how it can be extended beyond the binary case induced by binarized neural networks.
Chinese Translation
在本文中,我们建立了二值神经网络(BNNs)与Sugeno积分之间的精确联系。Sugeno积分的优势在于它提供了一个框架,用于表示输入的重要性及其相互作用,同时等同于一组如果-那么规则。在推理时,对于一个隐藏的BNN神经元,我们展示了激活阈值测试可以被写作二进制输入上的Sugeno积分。这产生了每个神经元决策的显式集合函数表示及其相关的基于规则的表示。我们还提供了最后一层得分的Sugeno积分表达。最后,我们讨论了如何将相同的框架调整以支持更丰富的输入交互,以及如何将其扩展到超越由二值神经网络引发的二进制情况。
cs.AI / 101 / 2604.17968
From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?
从后备到前线:大型语言模型何时能成为人类观点的优越标注者?
Abstract
Although large language models (LLMs) are increasingly used as annotators at scale, they are typically treated as a pragmatic fallback rather than a faithful estimator of human perspectives. This work challenges that presumption. By framing perspective-taking as the estimation of a latent group-level judgment, we characterize the conditions under which modern LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks, and show that these conditions are common in practice. This advantage arises from structural properties of LLMs as estimators, including low variance and reduced coupling between representation and processing biases, rather than any claim of lived experience. Our analysis identifies clear regimes where LLMs act as statistically superior frontline estimators, as well as principled limits where human judgment remains essential. These findings reposition LLMs from a cost-saving compromise to a principled tool for estimating collective human perspectives.
Chinese Translation
尽管大型语言模型(LLMs)在大规模标注中越来越多地被使用,但它们通常被视为一种务实的后备方案,而非人类观点的忠实估计者。本研究挑战了这一假设。通过将观点采纳框架化为潜在群体层级判断的估计,我们界定了现代LLMs在预测主观任务中聚合子群体意见时能够超越人类标注者(包括同群体人类)的条件,并展示这些条件在实践中是普遍存在的。这一优势源于LLMs作为估计者的结构特性,包括低方差和表示与处理偏差之间的耦合减少,而非任何关于生活经验的主张。我们的分析识别出LLMs作为统计上优越的前线估计者的明确领域,以及人类判断仍然至关重要的原则性限制。这些发现将LLMs从一种节约成本的妥协重新定位为估计集体人类观点的原则性工具。
cs.AI / 102 / 2604.17989
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
AIT学院:通过儒家三域课程培养全面智能体
Abstract
What does it mean to give an AI agent a complete education? Current agent development produces specialists systems optimized for a single capability dimension, whether tool use, code generation, or security awareness that exhibit predictable deficits wherever they were not trained. We argue this pattern reflects a structural absence: there is no curriculum theory for agents, no principled account of what a fully developed agent should know, be, and be able to do across the full scope of intelligent behavior. This paper introduces the AIT Academy (Agents Institute of Technology Academy), a curriculum framework for cultivating AI agents across the tripartite structure of human knowledge. Grounded in Kagan's Three Cultures and UNESCO ISCED-F 2013, AIT organizes agent capability development into three domains: Natural Science and Technical Reasoning (Domain I), Humanities and Creative Expression (Domain II), and Social Science and Ethical Reasoning (Domain III). The Confucian Six Arts (liuyi) a 2,500-year-old holistic education system are reinterpreted as behavioral archetypes that map directly onto trainable agent capabilities within each domain. Three representative training grounds instantiate the framework across multiple backbone LLMs: the ClawdGO Security Dojo (Domain I), Athen's Academy (Domain II), and the Alt Mirage Stage (Domain III). Experiments demonstrate a 15.9-point improvement in security capability scores under weakest-first curriculum scheduling, and a 7-percentage-point gain in social reasoning performance under principled attribution modeling. A cross-domain finding Security Awareness Calibration Pathology (SACP), in which over-trained Domain I agents fail on out-of-distribution evaluation illustrates the diagnostic value of a multi-domain perspective unavailable to any single-domain framework.
Chinese Translation
为人工智能智能体提供全面教育意味着什么?当前的智能体开发产生了专门化的系统,这些系统针对单一能力维度进行了优化,无论是工具使用、代码生成还是安全意识,均在未接受训练的领域表现出可预测的不足。我们认为这种模式反映了一种结构性缺失:缺乏针对智能体的课程理论,没有原则性地说明一个全面发展的智能体应该在智能行为的全范围内具备什么知识、具备什么能力以及能够做什么。本文介绍了AIT学院(Agents Institute of Technology Academy),这是一个旨在通过人类知识的三元结构培养人工智能智能体的课程框架。基于Kagan的三种文化和UNESCO ISCED-F 2013,AIT将智能体能力的发展组织为三个领域:自然科学与技术推理(领域I)、人文学科与创造性表达(领域II)以及社会科学与伦理推理(领域III)。儒家的六艺(liuyi)作为一种已有2500年历史的整体教育体系,被重新诠释为直接映射到每个领域内可训练智能体能力的行为原型。三个代表性的训练场所通过多个基础大型语言模型(LLMs)实例化该框架:ClawdGO安全道场(领域I)、雅典学院(领域II)和Alt Mirage舞台(领域III)。实验表明,在最弱优先的课程安排下,安全能力评分提高了15.9分,而在原则性归因建模下,社会推理表现提高了7个百分点。跨领域发现的安全意识校准病理(Security Awareness Calibration Pathology, SACP)表明,过度训练的领域I智能体在分布外评估中失败,突显了多领域视角的诊断价值,这是任何单一领域框架所无法提供的。
cs.AI / 103 / 2604.18003
SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression
SELF-EMO:从识别到一致表达的情感自我演化
Abstract
Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.
Chinese Translation
对话中的情感识别(Emotion Recognition in Conversation, ERC)已成为大型语言模型(Large Language Models, LLMs)在人本交互中的基本能力。除了准确的识别外,一致的情感表达同样至关重要,但这两者都受到高质量标注数据稀缺和静态特性的限制。在本研究中,我们提出了SELF-EMO,一个自我演化框架,基于更好的情感预测能够导致更一致的情感反应的假设。我们引入了两个辅助任务:情感理解和情感表达,并设计了一个基于角色的自我对弈范式,其中模型既充当情感识别者,又充当对话响应者。通过迭代交互,模型生成多样的对话轨迹,从而实现可扩展的数据生成。为了确保质量,我们采用了一种数据飞轮机制,通过平滑的IoU(Intersection over Union)奖励过滤候选预测和响应,并将选定样本反馈用于持续的自我改进,而无需外部监督。我们进一步开发了SELF-GRPO,一种通过多标签对齐奖励和群体级一致性信号来稳定优化的强化学习算法。在IEMOCAP、MELD和EmoryNLP上的实验表明,SELF-EMO达到了最先进的性能,在Qwen3-4B上提高了+6.33%的准确率,在Qwen3-8B上提高了+8.54%的准确率,展示了强大的有效性和泛化能力。
cs.AI / 104 / 2604.18050
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
数据集的拓扑对偶:一种用于AlphaGeometry风格数据的逻辑到拓扑编码
Abstract
AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its symbolic deduction engine that limits its efficiency as problem complexity increases. Recent technical reports suggest that current domain-specific languages may be isomorphic as input representations to natural language, interchanging them acts as a performance-invariant transformation, implying that current neural guidance relies on superficial encodings rather than structural understanding. This paper addresses this representation bottleneck by proposing a logic-to-topology encoding designed to reveal the structural invariants of a model's latent space under a transformation of its input space. By leveraging the Logic of Observation, we utilize the duality between provability in observable theories and topologies to propose a logic-to-topology encoder for the input space. We introduce the concept of the "topological dual of a dataset", a transformation that bridges formal logic, topology, and neural processing. This framework serves as a Rosetta Stone for neuro-symbolic AI, providing a principled pathway for the mechanistic interpretability of how models navigate complex discovery paths.
Chinese Translation
AlphaGeometry代表了神经符号推理的一个里程碑,但其架构在符号推理引擎中面临着对数线性扩展瓶颈,这限制了其在问题复杂性增加时的效率。最近的技术报告表明,当前的领域特定语言可能在输入表示上与自然语言同构,互换它们的操作作为一种性能不变的变换,暗示当前的神经引导依赖于表面编码而非结构理解。本文通过提出一种逻辑到拓扑的编码来解决这一表示瓶颈,旨在揭示模型潜在空间在输入空间变换下的结构不变性。通过利用观察逻辑,我们利用可观察理论与拓扑之间的对偶性,提出了一种用于输入空间的逻辑到拓扑编码器。我们引入了“数据集的拓扑对偶”这一概念,这是一种连接形式逻辑、拓扑和神经处理的变换。该框架作为神经符号人工智能的罗塞塔石,为模型如何在复杂发现路径中进行机械可解释性提供了一个原则性路径。
cs.AI / 105 / 2604.18064
Understanding Human Actions through the Lens of Executable Models
通过可执行模型理解人类动作
Abstract
Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action's execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.
Chinese Translation
以人为中心的系统需要理解人类在物理世界中的动作。时间上延续的动作序列是有意图且结构化的,然而,现有的识别所执行动作的方法往往未能捕捉其结构,特别是动作的执行方式。然而,这对于评估动作执行的质量及其与其他动作的差异至关重要。为了捕捉动作的内部机制,我们引入了一种领域特定语言EXACT,该语言将人类运动表示为未完全指定的运动程序,并被解释为用于零样本策略推断的奖励生成函数,采用前向-后向表示。通过利用EXACT运动程序的组合特性,我们将单独的策略组合成一个可执行的神经符号模型,该模型利用程序结构进行组合建模。我们通过分析动作捕捉数据来理解人类动作,评估所提出的管道在创建可执行动作模型方面的实用性,任务包括人类动作分割和动作异常检测。我们的结果表明,与单一的、特定任务的方法相比,使用可执行动作模型提高了数据效率,并捕捉了动作之间的直观关系。
cs.AI / 106 / 2604.18071
Architectural Design Decisions in AI Agent Harnesses
人工智能代理系统中的架构设计决策
Abstract
AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects, addressing three questions: which design-decision dimensions recur across projects, which co-occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent-system corpora through source-code and technical-material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file-persistent, hybrid, and hierarchical context strategies; registry-oriented tool systems remain dominant while MCP- and plugin-oriented extensions are emerging; and intermediate isolation is common but high-assurance audit is rare. Cross-project co-occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool-registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. The result provides an evidence-based account of architectural regularities in agent-system engineering, with grounded guidance for framework designers, selectors, and researchers.
Chinese Translation
人工智能代理系统越来越依赖于可重用的非大语言模型(non-LLM)工程基础设施,该基础设施整合了工具中介、上下文处理、委托、安全控制和编排。然而,这种周边基础设施中的架构设计决策仍然未得到充分研究。本文呈现了一项以协议为指导、基于来源的实证研究,分析了70个公开可用的代理系统项目,重点回答三个问题:哪些设计决策维度在项目中反复出现,哪些共现结构影响这些决策,以及哪些典型的架构模式浮现。方法论上,我们贡献了一种透明的调查程序,通过源代码和技术材料阅读分析异构的代理系统语料。实证上,我们识别出五个反复出现的设计维度(子代理架构、上下文管理、工具系统、安全机制和编排),并发现语料偏向于文件持久化、混合和层次化的上下文策略;以注册中心为导向的工具系统仍占主导地位,而以多代理协作(MCP)和插件为导向的扩展正在兴起;中间隔离普遍存在,但高保证审计则较为稀缺。跨项目的共现分析揭示,深入的协调与更明确的上下文服务相辅相成,更强的执行环境与更结构化的治理相结合,正式的工具注册边界与更广泛的生态系统目标相匹配。我们综合了五种反复出现的架构模式,涵盖轻量级工具、平衡的命令行接口(CLI)框架、多代理编排器、企业系统和情景垂直化项目。该结果为代理系统工程中的架构规律提供了基于证据的说明,并为框架设计者、选择者和研究人员提供了扎实的指导。
cs.AI / 107 / 2604.18095
DSAINet: An Efficient Dual-Scale Attentive Interaction Network for General EEG Decoding
DSAINet:一种高效的双尺度注意力交互网络用于通用脑电图解码
Abstract
In real-world applications of noninvasive electroencephalography (EEG), specialized decoders often show limited generalizability across diverse tasks under subject-independent settings. One central challenge is that task-relevant EEG signals often follow different temporal organization patterns across tasks, while many existing methods rely on task-tailored architectural designs that introduce task-specific temporal inductive biases. This mismatch makes it difficult to adapt temporal modeling across tasks without changing the model configuration. To address these challenges, we propose DSAINet, an efficient dual-scale attentive interaction network for general EEG decoding. Specifically, DSAINet constructs shared spatiotemporal token representations from raw EEG signals and models diverse temporal dynamics through parallel convolutional branches at fine and coarse scales. The resulting representations are then adaptively refined by intra-branch attention to emphasize salient scale-specific patterns and by inter-branch attention to integrate task-relevant features across scales, followed by adaptive token aggregation to yield a compact representation for prediction. Extensive experiments on five downstream EEG decoding tasks across ten public datasets show that DSAINet consistently outperforms 13 representative baselines under strict subject-independent evaluation. Notably, this performance is achieved using the same architecture hyperparameters across datasets. Moreover, DSAINet achieves a favorable accuracy-efficiency trade-off with only about 77K trainable parameters and provides interpretable neurophysiological insights. The code is publicly available at https://github.com/zy0929/DSAINet.
Chinese Translation
在非侵入性脑电图(EEG)的实际应用中,专门的解码器在受试者独立设置下往往表现出有限的任务泛化能力。一个主要挑战是,不同任务的相关EEG信号往往遵循不同的时间组织模式,而许多现有方法依赖于任务特定的架构设计,这引入了任务特定的时间归纳偏差。这种不匹配使得在不改变模型配置的情况下,难以在不同任务之间适应时间建模。为了解决这些挑战,我们提出了DSAINet,一种高效的双尺度注意力交互网络,用于通用EEG解码。具体而言,DSAINet从原始EEG信号构建共享的时空标记表示,并通过在细尺度和粗尺度的并行卷积分支建模多样的时间动态。然后,通过分支内注意力自适应地精炼得到的表示,以强调显著的尺度特定模式,并通过分支间注意力整合跨尺度的任务相关特征,最后通过自适应标记聚合生成用于预测的紧凑表示。在十个公共数据集上的五个下游EEG解码任务中,广泛的实验表明,DSAINet在严格的受试者独立评估下始终优于13个代表性基线。值得注意的是,这一性能是在跨数据集使用相同的架构超参数下实现的。此外,DSAINet以大约77K的可训练参数实现了良好的准确性与效率的权衡,并提供了可解释的神经生理学见解。代码已公开发布在 https://github.com/zy0929/DSAINet。
cs.AI / 108 / 2604.18103
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
稳定性意味着冗余:用于高效长上下文预填充的Delta注意力选择性停止
Abstract
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.
Chinese Translation
预填充计算成本在长上下文设置中对大型语言模型(LLMs)和大型多模态模型(LMMs)构成了显著瓶颈。虽然令牌修剪可以减少序列长度,但先前的方法依赖于与硬件高效内核(如FlashAttention)不兼容的启发式方法。在本研究中,我们观察到令牌朝着 extit{语义固定点}演变,使得进一步处理变得冗余。为此,我们提出了Delta注意力选择性停止(DASH),这是一种无训练的策略,通过监控自注意力机制的层级更新动态来选择性地停止稳定的令牌。广泛的评估确认DASH在语言和视觉基准测试中具有良好的泛化能力,显著提高了预填充速度,同时保持了模型的准确性和硬件效率。代码将发布在https://github.com/verach3n/DASH.git。
cs.AI / 109 / 2604.18131
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
通过世界知识探索训练大型语言模型代理进行自发的无奖励自我进化
Abstract
Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.
Chinese Translation
如今大多数代理通过遵循人类定义的奖励和规则进行“自我进化”。然而,这一过程在根本上仍然依赖于外部监督;没有人类指导,进化便会停止。在本研究中,我们训练代理具备内在的元进化能力,以在任务执行之前自发地学习未知环境。为了培养这种能力,我们设计了一种基于结果的奖励机制,衡量代理自生成的世界知识在多大程度上提高了其在下游任务上的成功率。该奖励信号仅在训练阶段使用,以教导模型如何有效地探索和总结。在推理时,代理无需外部奖励或人类指令。它自发地进行本土自我进化,利用其内部参数适应未知环境。当应用于Qwen3-30B和Seed-OSS-36B时,这种向本土进化的转变使WebVoyager和WebWalker的性能提高了20%。最引人注目的是,生成的世界知识甚至使得一个紧凑的14B Qwen3模型超越了未辅助的Gemini-2.5-Flash,为真正进化的代理建立了新的范式。
cs.AI / 110 / 2604.18133
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
多智能体系统:从经典范式到大型基础模型驱动的未来
Abstract
With the rapid advancement of artificial intelligence, multi-agent systems (MASs) are evolving from classical paradigms toward architectures built upon large foundation models (LFMs). This survey provides a systematic review and comparative analysis of classical MASs (CMASs) and LFM-based MASs (LMASs). First, within a closed-loop coordination framework, CMASs are reviewed across four fundamental dimensions: perception, communication, decision-making, and control. Beyond this framework, LMASs integrate LFMs to lift collaboration from low-level state exchanges to semantic-level reasoning, enabling more flexible coordination and improved adaptability across diverse scenarios. Then, a comparative analysis is conducted to contrast CMASs and LMASs across architecture, operating mechanism, adaptability, and application. Finally, future perspectives on MASs are presented, summarizing open challenges and potential research opportunities.
Chinese Translation
随着人工智能的快速发展,多智能体系统(MASs)正从经典范式向基于大型基础模型(LFMs)的架构演变。本调查提供了对经典多智能体系统(CMASs)和基于LFMs的多智能体系统(LMASs)的系统性回顾和比较分析。首先,在闭环协调框架内,从感知、通信、决策和控制四个基本维度对CMASs进行了回顾。在此框架之外,LMASs整合LFMs,将协作从低级状态交换提升到语义级推理,使得在多样化场景中实现更灵活的协调和更好的适应性。然后,进行比较分析,以对比CMASs和LMASs在架构、运行机制、适应性和应用方面的差异。最后,提出了对多智能体系统的未来展望,总结了开放性挑战和潜在的研究机会。
cs.AI / 111 / 2604.18158
State Transfer Reveals Reuse in Controlled Routing
状态转移揭示了受控路由中的重用
Abstract
Prompt-based interventions can change model behavior, but trained success alone does not identify where the behaviorally relevant state is represented. We study this question in controlled routing tasks using interfaces chosen on support data, held-out query evaluation, and matched necessity, sufficiency, and wrong-interface controls. On GPT-2 triop, an early interface supports exact transfer under these tests. On GPT-2 add/sub, zero-retrain compiled transfer at the fixed interface recovers most of donor routing accuracy, while trainable prompt slots can relearn the same behavior at several other positions only after additional support examples and optimization. These results distinguish fixed-interface reuse from prompt relocation in a setting where the two can be tested directly. Qwen routing provides a cross-architecture consistency check for the same matched-interface pattern at the operator token, although donor-specific identity on the local V-path remains unresolved. Generation and reasoning branches are used to map scope: they show broader transport or weaker controller identifiability once control depends on longer trajectories or harder selection. In controlled routing, fixed-interface transfer is therefore stronger evidence of reuse than trained prompt success alone.
Chinese Translation
基于提示的干预可以改变模型行为,但仅凭训练成功并不能识别行为相关状态的表示位置。我们在受控路由任务中研究了这个问题,使用了基于支持数据选择的接口、保留查询评估以及匹配的必要性、充分性和错误接口控制。在 GPT-2 triop 上,早期接口在这些测试下支持精确转移。在 GPT-2 add/sub 上,固定接口下的零重训练编译转移恢复了大部分捐赠路由的准确性,而可训练的提示槽只能在额外的支持示例和优化后,在其他几个位置重新学习相同的行为。这些结果将固定接口重用与提示重定位区分开来,在可以直接测试这两者的环境中。Qwen 路由为操作符标记处相同的匹配接口模式提供了跨架构一致性检查,尽管在局部 V-path 上的捐赠特定身份仍未解决。生成和推理分支用于映射范围:它们显示出更广泛的传输或较弱的控制器可识别性,一旦控制依赖于更长的轨迹或更困难的选择。因此,在受控路由中,固定接口转移比仅凭训练提示成功更能证明重用的存在。
cs.AI / 112 / 2604.18176
QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
QuantumQA:通过物理一致性数据集和验证意识强化学习增强科学推理
Abstract
Large language models (LLMs) show strong capabilities in general reasoning but typically lack reliability in scientific domains like quantum mechanics, which demand strict adherence to physical constraints. This limitation arises from the scarcity of verifiable training resources and the inadequacy of coarse feedback signals in standard alignment paradigms. To address the data challenge, we introduce QuantumQA, a large-scale dataset constructed via a task-adaptive strategy and a hybrid verification protocol that combines deterministic solvers with semantic auditing to guarantee scientific rigor. Building on this foundation, we propose the verification-aware reward model (VRM) tailored for Reinforcement Learning with Verifiable Rewards (RLVR), which employs an adaptive reward fusion (ARF) mechanism to dynamically integrate deterministic signals from a scientific execution suite (SES) with multidimensional semantic evaluations for precise supervision. Experimental results demonstrate that our method consistently outperforms baselines and general-purpose preference models. Notably, our optimized 8B model achieves performance competitive with proprietary models, validating that incorporating verifiable, rule-based feedback into the reinforcement learning loop offers a parameter-efficient alternative to pure scaling.
Chinese Translation
大型语言模型(LLMs)在一般推理方面表现出强大的能力,但在量子力学等科学领域通常缺乏可靠性,这些领域要求严格遵循物理约束。这一局限性源于可验证训练资源的稀缺以及标准对齐范式中粗略反馈信号的不足。为了解决数据挑战,我们提出了QuantumQA,这是一个通过任务自适应策略和混合验证协议构建的大规模数据集,该协议结合了确定性求解器和语义审计,以确保科学严谨性。在此基础上,我们提出了针对可验证奖励的强化学习(RLVR)量身定制的验证意识奖励模型(VRM),该模型采用自适应奖励融合(ARF)机制,动态整合来自科学执行套件(SES)的确定性信号与多维语义评估,以实现精确监督。实验结果表明,我们的方法在性能上始终优于基线和通用偏好模型。值得注意的是,我们优化后的8B模型在性能上与专有模型相当,验证了将可验证的基于规则的反馈纳入强化学习循环中,提供了一种参数高效的替代方案,而不是单纯的扩展。
cs.AI / 113 / 2604.18206
A Control Architecture for Training-Free Memory Use
无训练记忆使用的控制架构
Abstract
Prompt-injected memory can improve reasoning without updating model weights, but it also creates a control problem: retrieved content helps only when it is applied in the right state. We study this problem in a strict training-free setting and formulate it as applicability control: when to trigger a memory-assisted second pass, when to trust it, and how to maintain the memory bank over time. Our method combines uncertainty-based routing, confidence-based selective acceptance, bank selection across rule and exemplar memory, and evidence-based governance of the memory bank over time. Under a locked training-free protocol with compute-matched controls, it improves two core arithmetic benchmarks by +7.0 points on SVAMP and +7.67 points on ASDiv over baseline. The same architecture also transfers to QA and agent benchmarks with smaller positive effects and shows the same positive direction on a second checkpoint for the main arithmetic tasks. On arithmetic, the main empirical pattern is that the control architecture, rather than raw memory exposure, drives the improvements on SVAMP and ASDiv. Mechanistically, confidence separates helpful from harmful rule-bank interventions, and under fixed retrieval the repair-versus-corrupt difference localizes to rows whose retrieved set actually contains the edited entries.
Chinese Translation
注入提示的记忆可以在不更新模型权重的情况下改善推理,但这也带来了一个控制问题:检索到的内容只有在应用于正确状态时才有帮助。我们在严格的无训练设置中研究了这个问题,并将其表述为适用性控制:何时触发记忆辅助的第二次处理,何时信任它,以及如何随着时间的推移维护记忆库。我们的方法结合了基于不确定性的路由、基于信心的选择性接受、跨规则和示例记忆的银行选择,以及基于证据的记忆库治理。在一个锁定的无训练协议下,配合计算匹配的控制,方法在两个核心算术基准上分别比基线提高了+7.0分(SVAMP)和+7.67分(ASDiv)。相同的架构也迁移到问答(QA)和代理基准上,虽然效果较小,但显示出相同的正向趋势,并在主要算术任务的第二个检查点上表现出相同的积极方向。在算术任务中,主要的实证模式是控制架构,而不是原始记忆暴露,驱动了SVAMP和ASDiv的改进。在机制上,信心将有益的和有害的规则库干预区分开来,而在固定检索下,修复与损坏的差异局限于那些检索集实际包含编辑条目的行。
cs.AI / 114 / 2604.18210
TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics
TacticGen:适应性和可扩展足球战术生成的基础
Abstract
Success in association football relies on both individual skill and coordinated tactics. While recent advancements in spatio-temporal data and deep learning have enabled predictive analyses like trajectory forecasting, the development of tactical design remains limited. Bridging this gap is essential, as prediction reveals what is likely to occur, whereas tactic generation determines what should occur to achieve strategic objectives. In this work, we present TacticGen, a generative model for adaptable and scalable tactic generation. TacticGen formulates tactics as sequences of multi-agent movements and interactions conditioned on the game context. It employs a multi-agent diffusion transformer with agent-wise self-attention and context-aware cross-attention to capture cooperative and competitive dynamics among players and the ball. Trained with over 3.3 million events and 100 million tracking frames from top-tier leagues, TacticGen achieves state-of-the-art precision in predicting player trajectories. Building on it, TacticGen enables adaptable tactic generation tailored to diverse inference-time objectives through classifier guidance mechanism, specified via rules, natural language, or neural models. Its modeling performance is also inherently scalable. A case study with football experts confirms that TacticGen generates realistic, strategically valuable tactics, demonstrating its practical utility for tactical planning in professional football. The project page is available at: https://shengxu.net/TacticGen/.
Chinese Translation
在足球运动中,成功依赖于个人技能和协调战术。尽管近期在时空数据和深度学习方面的进展使得轨迹预测等预测分析成为可能,但战术设计的发展仍然有限。弥补这一差距至关重要,因为预测揭示了可能发生的情况,而战术生成则决定了为实现战略目标应当发生的情况。在本研究中,我们提出了TacticGen,一种适应性和可扩展的战术生成生成模型。TacticGen将战术表述为基于比赛上下文的多智能体运动和交互序列。它采用具有智能体自注意力和上下文感知交叉注意力的多智能体扩散变换器,以捕捉球员与球之间的合作与竞争动态。TacticGen在顶级联赛中使用超过330万个事件和1亿个跟踪帧进行训练,在预测球员轨迹方面达到了最先进的精度。在此基础上,TacticGen通过分类器引导机制实现了适应性战术生成,能够针对不同的推理时间目标进行定制,该机制可以通过规则、自然语言或神经模型进行指定。其建模性能也具有内在的可扩展性。与足球专家的案例研究确认,TacticGen生成了现实且具有战略价值的战术,展示了其在职业足球战术规划中的实际应用价值。项目页面可访问:https://shengxu.net/TacticGen/
cs.AI / 115 / 2604.18240
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench:环境感知评估中的代理作为评判者的基准测试
Abstract
As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.
Chinese Translation
随着强化学习在大型语言模型基础的代理训练中的不断扩展,可靠地验证代理在复杂环境中的行为变得越来越具有挑战性。现有的方法依赖于基于规则的验证器或LLM-as-a-Judge模型,这些模型在狭窄领域之外的泛化能力较弱。代理作为评判者(Agent-as-a-Judge)通过主动与环境和工具互动来获取可验证的证据,从而解决了这一限制,但其能力仍未得到充分探索。我们引入了一个基准测试AJ-Bench,以系统地评估代理作为评判者在搜索、数据系统和图形用户界面三个领域的表现,包括155个任务和516条注释轨迹。该基准全面评估了评判代理在信息获取、状态验证和过程验证方面的能力。实验表明,相较于LLM-as-a-Judge基线,性能持续提升,同时也揭示了基于代理的验证中存在的重大开放挑战。我们的数据和代码可在https://aj-bench.github.io/获取。
cs.AI / 116 / 2604.18254
LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL
LeGo-Code:模块化课程学习能否推动复杂代码生成?来自文本到SQL的见解
Abstract
Recently, code-oriented large language models (LLMs) have demonstrated strong capabilities in translating natural language into executable code. Text-to-SQL is a significant application of this ability, enabling non-technical users to interact with relational databases using natural language. However, state-of-the-art models continue to struggle with highly complex logic, particularly deeply nested statements involving multiple joins and conditions, as well as with real-world database schemas that are noisy or poorly structured. In this paper, we investigate whether curriculum learning can improve the performance of code-based LLMs on Text-to-SQL tasks. Employing benchmarks including Spider and BIRD, we fine-tune models under different curriculum strategies. Our experiments show that naive curriculum, simply ordering training samples by complexity in a single epoch, fails to surpass standard fine-tuning due to catastrophic forgetting. To overcome this, we propose a Modular Adapter Composition (MAC) strategy. By sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard), we create a scaffolded learning environment that improves performance on complex queries. Our approach not only produces measurable performance gains on the Spider and BIRD benchmarks but also provides a flexible, "Lego-like" architecture, allowing models to be composed and deployed based on specific schema difficulty requirements. These findings demonstrate that structured, modular learning is a superior alternative to monolithic fine-tuning for mastering the syntax and logic of complex code generation.
Chinese Translation
近年来,面向代码的大型语言模型(LLMs)在将自然语言翻译为可执行代码方面展现出了强大的能力。文本到SQL是这一能力的重要应用,使非技术用户能够使用自然语言与关系数据库进行交互。然而,最先进的模型在处理高度复杂的逻辑时仍然面临挑战,特别是涉及多个连接和条件的深度嵌套语句,以及在噪声或结构不良的真实数据库模式下的表现。在本文中,我们探讨了课程学习是否能够提高基于代码的LLMs在文本到SQL任务中的表现。通过使用Spider和BIRD等基准,我们在不同的课程策略下对模型进行微调。我们的实验表明,简单地按复杂性对训练样本进行排序的天真课程在单个周期内未能超越标准微调,原因在于灾难性遗忘。为了解决这个问题,我们提出了一种模块适配器组合(Modular Adapter Composition,MAC)策略。通过在逐步增加的复杂性水平(从简单到额外困难)上顺序训练特定层级的适配器,我们创建了一个支架式学习环境,从而提高了对复杂查询的性能。我们的方法不仅在Spider和BIRD基准上产生了可测量的性能提升,还提供了一种灵活的“乐高式”架构,使模型能够根据特定的模式难度要求进行组合和部署。这些发现表明,结构化的模块化学习是掌握复杂代码生成的语法和逻辑的优越替代方案,而非单一的微调。
cs.AI / 117 / 2604.18266
Enhancing Tabular Anomaly Detection via Pseudo-Label-Guided Generation
通过伪标签引导生成增强表格异常检测
Abstract
Identifying anomalous instances in tabular data is essential for improving data reliability and maintaining system stability. Due to the scarcity of ground-truth anomaly labels, existing methods mainly rely on unsupervised anomaly detection models, or exploit a small number of labeled anomalies to facilitate detection via sample generation or contrastive learning. However, unsupervised methods lack sufficient anomaly awareness, while current generation and contrastive approaches tend to compute anomalies globally, overlooking the localized anomaly patterns of tabular features, resulting in suboptimal detection performance. To address these limitations, we propose PLAG, a pseudo-label-guided anomaly generation method designed to enhance tabular anomaly detection. Specifically, by utilizing pseudo-anomalies as guidance signals and decoupling the overall anomaly quantification of a sample into an accumulation of feature-level abnormalities, PLAG not only effectively obviates the need for scarce ground-truth labels but also provides a novel perspective for the model to comprehend localized anomalous signals at a fine-grained level. Furthermore, a two-stage data selection strategy is proposed, integrating format verification and uncertainty estimation to rigorously filter candidate samples, thereby ensuring the fidelity and diversity of the synthetic anomalies. Ultimately, these filtered synthetic anomalies serve as robust discriminative guidance, empowering the model to better separate normal and anomalous instances. Extensive experiments demonstrate that PLAG achieves state-of-the-art performance against eight representative baselines. Moreover, as a flexible framework, it integrates seamlessly with existing unsupervised detectors, consistently boosting F1-scores by 0.08 to 0.21.
Chinese Translation
在表格数据中识别异常实例对于提高数据可靠性和维护系统稳定性至关重要。由于真实异常标签的稀缺,现有方法主要依赖无监督异常检测模型,或利用少量标记的异常通过样本生成或对比学习来促进检测。然而,无监督方法缺乏足够的异常意识,而当前的生成和对比方法往往从全局计算异常,忽视了表格特征的局部异常模式,导致检测性能不佳。为了解决这些局限性,我们提出了PLAG(伪标签引导异常生成)方法,旨在增强表格异常检测。具体而言,通过利用伪异常作为引导信号,并将样本的整体异常量化解耦为特征级异常的累积,PLAG不仅有效消除了对稀缺真实标签的需求,还为模型提供了一种新颖的视角,以细粒度水平理解局部异常信号。此外,我们提出了一种两阶段数据选择策略,结合格式验证和不确定性估计,严格筛选候选样本,从而确保合成异常的真实性和多样性。最终,这些经过筛选的合成异常作为强有力的区分性指导,使模型能够更好地区分正常和异常实例。大量实验表明,PLAG在八个代表性基准上实现了最先进的性能。此外,作为一个灵活的框架,它与现有的无监督检测器无缝集成,F1分数始终提高0.08到0.21。
cs.AI / 118 / 2604.18292
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World:扩展现实环境合成以促进通用智能体智能的发展
Abstract
Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
Chinese Translation
大型语言模型越来越被期望作为通用智能体,与外部状态工具环境进行交互。模型上下文协议(Model Context Protocol, MCP)和更广泛的智能体技能提供了一个统一的接口,用于将智能体与可扩展的现实世界服务连接起来,但训练稳健的智能体仍然受到缺乏现实环境和系统性终身学习机制的限制。在本文中,我们提出了 extbf{Agent-World},这是一个自我进化的训练场,旨在通过可扩展环境推进通用智能体智能的发展。Agent-World有两个主要组成部分:(1)智能环境-任务发现,能够自主探索与主题对齐的数据库和可执行工具生态系统,基于数千个现实环境主题合成可验证的任务,并具有可控的难度;(2)持续自我进化的智能体训练,结合了多环境强化学习和一个自我进化的智能体竞技场,能够通过动态任务合成自动识别能力差距并推动有针对性的学习,从而实现智能体策略和环境的共同进化。在23个具有挑战性的智能体基准测试中,Agent-World-8B和14B始终优于强大的专有模型和环境扩展基准。进一步的分析揭示了与环境多样性和自我进化轮次相关的扩展趋势,为构建通用智能体智能提供了见解。
cs.AI / 119 / 2604.18302
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
迈向零数据外泄的精神卫生人工智能:用于隐私保护的心理健康决策支持的设备端大型语言模型部署
Abstract
Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.
Chinese Translation
隐私是人工智能在精神卫生保健中采用的最关键但未得到充分解决的障碍之一,尤其是在军事、监狱和远程医疗等高敏感性操作环境中,患者数据暴露的风险可能完全阻碍寻求帮助的行为。现有的人工智能辅助精神决策支持系统主要依赖于基于云的推理管道,这要求敏感的患者数据离开设备并经过外部服务器,从而在这些环境中产生不可接受的隐私和安全风险。本文提出了一种零数据外泄的设备端人工智能平台,用于隐私保护的精神决策支持,作为跨平台移动应用程序进行部署。该系统在我们之前关于精神诊断标准化的微调大型语言模型(LLM)联盟的基础上,通过根本性地重新构建推理管道以实现完全本地执行,确保在任何阶段都不将患者数据传输、处理或存储在任何外部服务器上。该平台整合了三个轻量级、微调和量化的开源LLM——Gemma、Phi-3.5-mini和Qwen2,因其紧凑的架构和在资源受限的移动硬件上的高效性而被选中。设备端的调度层协调集成推理和基于共识的诊断推理,生成与《精神障碍诊断与统计手册》第五版(DSM-5)一致的评估结果。该平台旨在帮助临床医生进行鉴别诊断和基于证据的症状映射,同时支持患者自我筛查,配备适当的临床保障。初步评估表明,所提出的零数据外泄部署在诊断准确性上与其服务器端前身相当,同时在普通移动硬件上保持实时推理延迟。
cs.AI / 120 / 2604.18327
PARM: Pipeline-Adapted Reward Model
PARM:管道适应性奖励模型
Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.
Chinese Translation
奖励模型(RMs)在将大型语言模型(LLMs)与人类偏好对齐方面至关重要,推动了强化学习从人类反馈(RLHF)和先进解码策略的发展。尽管大多数先前的研究集中于单步生成,但现实世界应用越来越多地采用多阶段LLM管道,其中有效的奖励指导仍然未得到充分探索。我们通过组合优化的代码生成来研究这一问题,构建了一个将奖励模型整合到公式制定和解决阶段的管道。我们识别出一个关键挑战:奖励模型预测与实际管道执行结果之间的不一致性。为了解决这个问题,我们提出了管道适应性奖励模型(PARM),该模型利用管道特定数据和直接偏好优化来使奖励与下游反馈对齐。我们将PARM实例化为一个两阶段管道(公式制定 -> 代码生成),并在四个公共优化基准上进行评估,测量执行率和解决准确性与基线和采样方法的对比。一个补充的跨领域实验在GSM8K上评估了可迁移性。结果表明,PARM始终提高了管道输出质量和稳定性,为多阶段LLM推理的奖励建模提供了新的见解。
cs.AI / 121 / 2604.18344
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
一遍完成所有:一种用于知识图谱三元组集预测的离散扩散模型
Abstract
Knowledge Graphs (KGs) are composed of triples, and the goal of Knowledge Graph Completion (KGC) is to infer the missing factual triples. Traditional KGC tasks predict missing elements in a triple given one or two of its elements. As a more realistic task, the Triple Set Prediction (TSP) task aims to infer the set of missing triples conditioned only on the observed knowledge graph, without assuming any partial information about the missing triples. Existing TSP methods predict the set of missing triples in a triple-by-triple manner, falling short in capturing the dependencies among the predicted triples to ensure consistency. To address this issue, we propose a novel discrete diffusion model termed DiffTSP that treats TSP as a generative task. DiffTSP progressively adds noise to the KG through a discrete diffusion process, achieved by masking relational edges. The reverse process then gradually recovers the complete KG conditioned on the incomplete graph. To this end, we design a structure-aware denoising network that integrates a relational context encoder with a relational graph diffusion transformer for knowledge graph generation. DiffTSP can generate the complete set of triples in a one-pass manner while ensuring the dependencies among the predicted triples. Our approach achieves state-of-the-art performance on three public datasets. Code: https://github.com/ADMIS-TONGJI/DiffTSP.
Chinese Translation
知识图谱(KG)由三元组组成,知识图谱补全(KGC)的目标是推断缺失的事实三元组。传统的KGC任务在给定三元组的一个或两个元素的情况下预测缺失元素。作为一个更为现实的任务,三元组集预测(TSP)任务旨在仅基于观察到的知识图谱推断缺失三元组的集合,而不假设关于缺失三元组的任何部分信息。现有的TSP方法以逐个三元组的方式预测缺失三元组的集合,未能有效捕捉预测三元组之间的依赖关系以确保一致性。为了解决这个问题,我们提出了一种新颖的离散扩散模型,称为DiffTSP,将TSP视为生成任务。DiffTSP通过离散扩散过程逐步向KG添加噪声,该过程通过掩蔽关系边实现。然后,反向过程逐渐恢复基于不完整图的完整KG。为此,我们设计了一个结构感知去噪网络,该网络将关系上下文编码器与关系图扩散变换器结合用于知识图谱生成。DiffTSP能够以一遍的方式生成完整的三元组集合,同时确保预测三元组之间的依赖关系。我们的方法在三个公共数据集上实现了最先进的性能。代码:https://github.com/ADMIS-TONGJI/DiffTSP。
cs.AI / 122 / 2604.18364
Training and Agentic Inference Strategies for LLM-based Manim Animation Generation
基于大型语言模型的Manim动画生成的训练与自主推理策略
Abstract
Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.
Chinese Translation
使用Manim等库生成程序化动画对大型语言模型(LLMs)提出了独特的挑战,这需要空间推理、时间序列安排以及对领域特定API的熟悉,而这些在一般的预训练数据中往往不足。目前的研究缺乏对训练和推理策略在这一环境中相互作用的系统性研究。本研究介绍了ManimTrainer,一个将监督微调(Supervised Fine-tuning, SFT)与基于强化学习(Reinforcement Learning, RL)的群体相对策略优化(Group Relative Policy Optimisation, GRPO)相结合的训练管道,使用统一的奖励信号融合代码和视觉评估信号;以及ManimAgent,一个推理管道,采用循环渲染(Renderer-in-the-loop, RITL)和增强API文档的循环渲染(RITL-DOC)策略。通过这些技术,本研究首次呈现了Manim在文本到代码再到视频转换中的统一训练与推理研究。使用ManimBench评估了17个开源的子30B LLM,在九种训练和推理策略组合下的表现。结果表明,SFT通常提高了代码质量,而GRPO增强了视觉输出并提高了模型在推理时自我纠正时对外部信号的响应能力。采用GRPO和RITL-DOC的Qwen 3 Coder 30B模型实现了最高的整体性能,渲染成功率(Render Success Rate, RSR)达到94%,与参考视频的视觉相似度(Visual Similarity, VS)为85.7%,在视觉相似度上超过基线模型GPT-4.1达3个百分点。此外,分析显示,随着SFT和GRPO的应用,代码与视觉指标之间的相关性增强,但在推理时增强的情况下则减弱,突显了训练与自主推理策略在Manim动画生成中的互补作用。
cs.AI / 123 / 2604.18380
The implicated scientist: on the role of AI researchers in the development of weapons systems
受牵连的科学家:人工智能研究者在武器系统发展中的角色
Abstract
Artificial intelligence (AI) technologies are increasingly used in modern weapons systems. Notably, these systems have recently been involved in mass killings and destruction at scale. Furthermore, there is currently a strong interest and competition among powerful players to accelerate the proliferation of weapons with automated or AI-based components, a phenomenon known as AI arms race. This competition poses a risk of causing even more deaths and devastation in the future, as well as increased power and wealth inequality. In this work, we aim to shed light on the role of AI researchers as implicated subjects in the harms caused by weapons enabled by AI technologies. We investigate and discuss the specifics of this implication and explore ways to transfigure this position of implication into one of differentiated, long-distance solidarity with the victims of technologically fortified injustices.
Chinese Translation
人工智能(AI)技术在现代武器系统中的应用日益广泛。值得注意的是,这些系统最近在大规模的杀戮和破坏中发挥了作用。此外,目前在强大参与者之间存在着强烈的兴趣和竞争,以加速具有自动化或基于AI组件的武器的扩散,这一现象被称为AI军备竞赛。这种竞争可能导致未来更多的死亡和破坏,以及日益严重的权力和财富不平等。在本研究中,我们旨在阐明AI研究者作为受牵连主体在由AI技术所导致的武器造成的伤害中的角色。我们探讨并讨论这种牵连的具体情况,并探索将这种牵连的立场转变为与技术强化的不公正受害者之间的差异化、远程团结的方式。
cs.AI / 124 / 2604.18381
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
从少量数据中学习:在低数据和计算环境下评估RLVR的有效性
Abstract
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
Chinese Translation
微调大型语言模型(LLMs)通常依赖于大量高质量的标注数据,或者在可验证奖励的强化学习(RLVR)情况下,依赖于具有明确真实答案的问题。尽管之前的研究探讨了通过扩大用于RLVR的数据和计算量对模型推理能力的益处,但这些结果在许多现实世界场景中缺乏适用性,因为标注数据和可用计算资源可能稀缺。在本研究中,我们对在低数据环境下,开源小型语言模型(SLM)在RLVR后的表现进行了全面的实证研究。通过三个涵盖数字计数问题、图推理和空间推理的新数据集,我们描述了模型性能如何随着数据集的大小、多样性和复杂性而变化。我们展示了(1)程序性数据集允许进行细致的评估和具有可控属性(大小、多样性和复杂性)的训练数据集开发,(2)在RLVR下,训练于较低复杂性任务的模型能够推广到更高复杂性任务,以及(3)在混合复杂性数据集上训练与在低数据环境中获得最大益处相关,提供了比在简单任务上训练高达5倍的样本效率。这些发现激励未来在RLVR的数据扩展法则开发和使用程序性数据生成器方面的研究,以进一步理解有效的数据开发以实现高效的LLM微调。
cs.AI / 125 / 2604.18404
Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models
六只美洲驼:通过LoRA适应的语言模型进行比较宗教伦理学研究
Abstract
We present Six Llamas, a comparative study examining whether large language models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Six variants of Meta-Llama-3.1-8B are constructed: one unmodified control and five LoRA-adapted models trained exclusively on the sacred and theological texts of Christianity, Islam, Judaism, Hinduism, or Buddhism. All six models are probed with an identical battery of 17 standardized ethical prompts spanning moral dilemmas, game-theoretic scenarios, public policy questions, and moral-psychological self-assessments. To assess robustness and reproducibility, we implement a multi-temperature sampling design spanning ten temperature settings. We compute response consistency metrics, pairwise inter-model agreement rates, temperature sensitivity coefficients across four prompt domains, and run-to-run stability analyses. Findings show that LoRA-adapted models produce ethical reasoning patterns that are (a) systematically differentiated from the base model, (b) consistent with the moral logics of their training traditions, (c) structured along interpretable dimensions in moral-philosophical space, (d) core ethical positions remain stable across temperature variations for high-consensus dilemmas. The Trolley Problem achieves 100% consistency across all models and temperatures, while (e) tradition-specific divergence intensifies at higher temperatures in morally contested domains, and (f) the base model exhibits the highest overall response consistency (mean 88.3%), suggesting LoRA adaptation introduces both tradition-specific signal and increased sampling sensitivity. The study offers a proof-of-concept for the condensate comparative method using differentially trained language models as instruments for cultural and ethical analysis and identifies specific criteria for falsification and planned extensions.
Chinese Translation
我们提出了六只美洲驼,这是一个比较研究,旨在考察在不同宗教语料库上微调的大型语言模型是否编码了系统性不同的伦理推理模式。构建了六个Meta-Llama-3.1-8B的变体:一个未修改的对照模型和五个仅在基督教、伊斯兰教、犹太教、印度教或佛教的神圣和神学文本上训练的LoRA适应模型。所有六个模型都使用相同的17个标准化伦理提示进行探测,这些提示涵盖了道德困境、博弈论场景、公共政策问题和道德心理自我评估。为了评估模型的稳健性和可重复性,我们实施了一个多温度采样设计,涵盖十个温度设置。我们计算了响应一致性指标、模型间配对一致率、四个提示领域的温度敏感系数,以及运行间稳定性分析。研究结果表明,LoRA适应模型产生的伦理推理模式具有以下特点:(a) 与基础模型系统性区分,(b) 与其训练传统的道德逻辑一致,(c) 在道德哲学空间中沿可解释维度结构化,(d) 对于高共识的困境,核心伦理立场在温度变化中保持稳定。电车难题在所有模型和温度下实现了100%的一致性,而(e) 在道德争议领域,特定传统的分歧在较高温度下加剧,(f) 基础模型表现出最高的整体响应一致性(平均88.3%),这表明LoRA适应引入了特定传统的信号和增强的采样敏感性。本研究为使用不同训练语言模型作为文化和伦理分析工具的浓缩比较方法提供了概念验证,并确定了特定的反驳标准和计划扩展。
cs.AI / 126 / 2604.18463
Using large language models for embodied planning introduces systematic safety risks
使用大型语言模型进行具身规划引入系统性安全风险
Abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
Chinese Translation
大型语言模型越来越多地被用作机器人系统的规划者,但它们的规划安全性仍然是一个悬而未决的问题。为了系统性地评估安全规划,我们引入了DESPITE,一个涵盖12,279个任务的基准,涉及物理和规范性危险,并进行完全确定性的验证。在23个模型中,即使是近乎完美的规划能力也无法确保安全:最佳规划模型在仅0.4%的任务上未能生成有效的计划,但在28.3%的任务上生成了危险的计划。在18个参数范围从3B到671B的开源模型中,规划能力随着规模的增加而显著提高(0.4-99.3%),而安全意识则相对平稳(38-57%)。我们识别出这两种能力之间的乘法关系,表明更大的模型主要通过改善规划来安全地完成更多任务,而不是通过更好的危险规避。三种专有推理模型的安全意识显著更高(71-81%),而非推理专有模型和开源推理模型的安全意识则低于57%。随着前沿模型的规划能力接近饱和,提高安全意识成为在机器人系统中部署语言模型规划者的核心挑战。
cs.AI / 127 / 2604.18469
A Generalized Synthetic Control Method for Baseline Estimation in Demand Response Services
一种用于需求响应服务基线估计的广义合成控制方法
Abstract
Baseline estimation is critical to Demand Response (DR) settlement in electricity markets, yet existing machine learning methods remain limited in predictive performance, while methodologies from causal inference and counterfactual prediction are still underutilized in this domain. We introduce a Generalized Synthetic Control Method that builds on the classical Synthetic Control Method (SCM) from econometrics. While SCM provides a powerful framework for counterfactual estimation, classical SCM remains a static estimator: it fits the treated unit as a combination of contemporaneous donor units and therefore ignores predictable temporal structure in the residual error. We develop a generalized SCM framework that transforms baseline estimation into a dynamic counterfactual prediction problem by augmenting the donor representation with exogenous features, lagged treated load, and selected lagged donor signals. This enriched representation allows the estimator to capture autoregressive dependence, delayed donor-response patterns, and error-correction effects beyond the scope of standard SCM. The framework further accommodates nonlinear predictors when linear weighting is inadequate, with the greatest benefit arising in limited-data settings. Experiments on the Ausgrid smart-meter dataset show consistent improvements over classical SCM and strong benchmark methods, with the dominant performance gains driven by dynamic augmentation.
Chinese Translation
基线估计对于电力市场中的需求响应(DR)结算至关重要,但现有的机器学习方法在预测性能上仍然有限,而因果推断和反事实预测的方法在这一领域仍未得到充分利用。我们提出了一种广义合成控制方法,该方法基于经济学中的经典合成控制方法(SCM)。虽然SCM为反事实估计提供了强大的框架,但经典SCM仍然是一个静态估计器:它将处理单元拟合为同时期捐赠单元的组合,因此忽略了残差误差中的可预测时间结构。我们开发了一个广义SCM框架,通过用外生特征、滞后处理负荷和选定的滞后捐赠信号增强捐赠表示,将基线估计转化为动态反事实预测问题。这种丰富的表示使得估计器能够捕捉自回归依赖、延迟的捐赠响应模式以及超出标准SCM范围的误差修正效应。当线性加权不足时,该框架进一步适应非线性预测因子,尤其在数据有限的情况下,能够获得最大的收益。在Ausgrid智能电表数据集上的实验显示,与经典SCM和强基准方法相比,表现出一致的改进,动态增强驱动了主要的性能提升。
cs.AI / 128 / 2604.18478
WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation
WorldDB:一种具有本体感知写时协调的世界向量图内存引擎
Abstract
Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world -- a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs -- each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine's graph layer -- resolver-unified entities and typed refers_to edges -- contributes +7.0pp task-averaged independently of the underlying answerer.
Chinese Translation
持久内存是将无状态聊天机器人与长时间运行的智能系统分开的瓶颈。基于平面向量存储的检索增强生成(RAG)将事实碎片化为块,失去了跨会话的身份,并且没有超越或矛盾的第一类概念。最近的双时态知识图系统(Graphiti、Memento、Hydra DB)增加了类型化边和有效时间元数据,但图本身仍然是平面的:没有递归组合,节点上没有内容寻址的不变性,边类型除了标签外没有其他行为。我们提出了WorldDB,这是一种基于三个承诺构建的内存引擎:(i)每个节点是一个世界——一个具有自己内部子图、本体范围和组合嵌入的容器,递归到任意深度;(ii)节点是内容寻址和不可变的,因此任何编辑都会在节点及其每个祖先处生成一个新的哈希,提供免费的Merkle风格审计追踪;(iii)边是写时程序——每种边类型都附带on_insert/on_delete/on_query_rewrite处理程序(超越关闭有效性,矛盾保留双方,same_as阶段合并提案),因此不存在原始附加路径。在LongMemEval-s(500个问题,约115k-token对话堆栈)上,使用Claude Opus 4.7作为回答者的WorldDB实现了96.40%的整体准确率/97.11%的任务平均准确率,比之前报告的Hydra DB最先进水平(90.79%)提高了5.61个百分点,比Supermemory(85.20%)提高了11.20个百分点,具有完美的单会话助手回忆和在时间推理(96.24%)、知识更新(98.72%)和偏好合成(96.67%)方面的稳健表现。消融实验表明,系统的图层——解析器统一实体和类型化的refers_to边——独立于底层回答者贡献了7.0个百分点的任务平均提升。
cs.AI / 129 / 2604.18519
LLM Safety From Within: Detecting Harmful Content with Internal Representations
从内部保障大型语言模型的安全性:利用内部表征检测有害内容
Abstract
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
Chinese Translation
守护模型广泛用于检测用户提示和大型语言模型(LLM)响应中的有害内容。然而,最先进的守护模型仅依赖于终层表征,忽视了分布在内部层中的丰富安全相关特征。我们提出了 SIREN,这是一种轻量级的守护模型,利用这些内部特征。通过线性探测识别安全神经元,并通过自适应层加权策略将其结合,SIREN 在不修改基础模型的情况下,从 LLM 内部构建有害性检测器。我们的全面评估表明,SIREN 在多个基准测试中显著优于最先进的开源守护模型,同时使用的可训练参数少了 250 倍。此外,SIREN 对未见基准的泛化能力更强,自然支持实时流检测,并显著提高了推理效率,相较于生成式守护模型。总体而言,我们的结果强调了 LLM 内部状态作为实用高性能有害性检测的有希望基础。
cs.AI / 130 / 2604.18530
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER:一种针对混合强化学习的稳健离线引导探索奖励
Abstract
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.
Chinese Translation
近年来,具有可验证奖励的强化学习(RLVR)的进展显著提升了大型语言模型(LLM)的推理能力,但模型在探索超出其初始潜在空间的新轨迹时常常面临困难。虽然已有提出离线教师引导和基于熵的策略来解决这一问题,但这些方法往往缺乏深度整合,或受到模型固有能力的限制。本文提出了OGER,一个通过专门的奖励建模视角统一离线教师引导和在线强化学习的新框架。OGER采用多教师协作训练,并构建了一种辅助探索奖励,该奖励利用离线轨迹和模型自身的熵来激励自主探索。在数学和一般推理基准上的大量实验表明,OGER显著优于竞争基线,在数学推理上取得了显著提升,同时保持了对域外任务的强健泛化。我们提供了训练动态的全面分析,并进行了详细的消融研究,以验证我们基于熵的奖励调节的有效性。我们的代码可在 https://github.com/ecoli-hit/OGER.git 获取。
cs.AI / 131 / 2604.18543
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit:用于爪状代理的自动环境生成
Abstract
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
Chinese Translation
为爪状代理构建训练和评估环境仍然是一个手动且人力密集的过程,无法扩展。我们认为,所需的不仅仅是一个数据集,而是一个能够按需生成多样化、经过验证的环境的自动化管道。为此,我们引入了ClawEnvKit,一个从自然语言描述中实例化这一形式的自主生成管道。该管道包括三个模块:(1)一个解析器,从自然语言输入中提取结构化生成参数;(2)一个生成器,生成任务规范、工具接口和评分配置;(3)一个验证器,确保生成环境的可行性、多样性、结构有效性和内部一致性。使用ClawEnvKit,我们构建了Auto-ClawEval,这是第一个大规模爪状代理基准,包含1,040个环境,分为24个类别。从经验上看,Auto-ClawEval在连贯性和清晰度上与人工策划的环境相匹配或超越,成本降低了13,800倍。在4个模型系列和8个代理框架中进行评估,我们发现,框架工程使性能比基础的ReAct模型提高了最多15.7个百分点,完成度仍然是主要的变化轴,没有模型在基准上达到饱和,自动生成使得以前不可行的规模评估成为可能。除了静态基准测试,ClawEnvKit还支持实时评估:用户用自然语言描述所需的能力,并按需获得经过验证的环境,将评估转变为一个持续的、用户驱动的过程。同样的机制也作为按需训练环境生成器,生成适应代理当前弱点的任务分布,而不是受限于现有用户日志。
cs.AI / 132 / 2604.18566
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
系统动力学人工智能助手的基准测试:云端与本地大型语言模型在因果环图提取与讨论中的比较
Abstract
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
Chinese Translation
我们对大型语言模型家族进行了系统评估,涵盖了专有云API和本地托管的开源模型,基于两个专门构建的基准测试用于系统动力学人工智能辅助: extbf{CLD排行榜}(53个测试,结构化因果环图提取)和 extbf{讨论排行榜}(互动模型讨论、反馈解释和模型构建指导)。在因果环图提取方面,云模型的整体通过率达到77%至89%;最佳本地模型的通过率为77%(Kimi~K2.5~GGUF~Q3,零-shot引擎),与中等水平的云性能相匹配。在讨论方面,最佳本地模型在模型构建步骤上达到50%至100%,在反馈解释上达到47%至75%,但在错误修复上仅为0%至50%——这一类别主要受到长上下文提示的影响,暴露了本地部署的内存限制。本文的一个核心贡献是对 extit{模型类型效应}对性能的系统分析:我们比较了推理与指令调优架构、GGUF(llama.cpp)与MLX(mlx extunderscore lm)后端,以及相同基础模型家族的量化级别(Q3 / Q4 extunderscore K extunderscore M / MLX-3bit / MLX-4bit / MLX-6bit)。我们发现,后端选择对实际影响大于量化级别:mlx extunderscore lm不强制执行JSON模式约束,需要明确的提示级JSON指令,而llama.cpp的语法约束采样可靠地处理JSON,但在密集模型的长上下文提示中导致无限生成。我们记录了所有本地模型的完整参数扫描($t$,$p$,$k$),清理后的时间数据(排除卡住的请求),以及在Apple~Silicon上运行671B至123B参数模型的实践指南。
cs.AI / 133 / 2604.18576
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
基于语言信念的顺序贝叶斯更新的代理预测
Abstract
We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.
Chinese Translation
我们提出了BLF(贝叶斯语言预测器),这是一个用于二元预测的代理系统,在ForecastBench基准测试中达到了最先进的性能。该系统基于三个理念构建。 (1) 贝叶斯语言信念状态:一种半结构化的表示,结合了数值概率估计和自然语言证据摘要,由大型语言模型(LLM)在每次迭代工具使用循环的步骤中进行更新。这与将所有检索到的证据附加到不断增长的上下文中的常见方法形成对比。 (2) 层次多次试验聚合:运行$K$个独立试验,并使用基于数据的先验通过对数几率空间收缩将它们结合起来。 (3) 层次校准:使用层次先验的Platt缩放,避免对具有偏斜基率的来源的极端预测进行过度收缩。在来自ForecastBench排行榜的400个回测问题上,BLF的表现优于所有顶级公共方法,包括Cassi、GPT-5、Grok~4.20和Foresight-32B。消融研究表明,结构化信念状态的影响与网络搜索访问同样重要,而收缩聚合和层次校准各自提供了显著的额外收益。此外,我们开发了一个稳健的回测框架,泄漏率低于1.5\%,并使用严格的统计方法比较不同的方法,同时控制各种噪声源。
cs.AI / 134 / 2604.18584
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet:一个全球多模态数学推理与检索基准
Abstract
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
Chinese Translation
数学问题解决仍然是对大型语言模型和多模态模型推理能力的一个具有挑战性的测试,然而现有的基准在规模、语言覆盖和任务多样性方面都存在局限。我们介绍了MathNet,这是一个高质量、大规模、多模态和多语言的奥林匹克级数学问题数据集,并提供了一个用于评估生成模型中的数学推理和基于嵌入系统中的数学检索的基准。MathNet涵盖了47个国家、17种语言和近二十年的竞赛,包含30,676个由专家撰写的问题及其解决方案,涉及多个领域。除了核心数据集外,我们还构建了一个检索基准,由人类专家策划的数学等价和结构相似的问题对组成。MathNet支持三项任务:(i)问题解决,(ii)数学感知检索,以及(iii)增强检索的问题解决。实验结果表明,即使是最先进的推理模型(Gemini-3.1-Pro的准确率为78.4%,GPT-5的准确率为69.3%)仍面临挑战,而嵌入模型在检索等价问题时也表现不佳。我们进一步展示了增强检索生成性能对检索质量的高度敏感性;例如,DeepSeek-V3.2-Speciale在基准测试中获得了最高分,提升幅度可达12%。MathNet提供了最大的高质量奥林匹克数据集,并首次建立了评估数学问题检索的基准,我们将在https://mathnet.mit.edu上公开发布数据集和基准。
cs.CL / 1 / 2604.16311
Multimodal Claim Extraction for Fact-Checking
多模态声明提取用于事实核查
Abstract
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
Chinese Translation
自动化事实核查(AFC)依赖于声明提取作为第一步,但现有方法在很大程度上忽视了当今虚假信息的多模态特性。社交媒体帖子通常将简短、非正式的文本与图像(如表情包、截图和照片)结合在一起,这带来了与仅文本的声明提取以及诸如图像描述或视觉问答等成熟的多模态任务不同的挑战。在本研究中,我们提出了首个针对社交媒体的多模态声明提取基准,该基准由包含文本和一个或多个图像的帖子组成,并附有来自真实世界事实核查者的金标准声明。我们在一个三部分的评估框架下(语义对齐、忠实性和去语境化)评估了最先进的多模态大语言模型(MLLMs),发现基线 MLLMs 在建模修辞意图和上下文线索方面存在困难。为了解决这个问题,我们引入了 MICE,一个关注意图的框架,显示出在意图关键的案例中有所改善。
cs.CL / 2 / 2604.16368
Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
苹果硅上波兰语言模型的跨家族推测解码:Bielik 11B与UAG扩展的MLX-LM的实证评估
Abstract
Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.
Chinese Translation
推测解码通过使用一个小型草稿模型提出k个候选标记,以加速大型语言模型(LLM)的推理。虽然在高带宽GPU上对于相同标记器对是有效的,但其在标记器不匹配的跨家族对以及消费级统一内存上的适用性仍未得到充分探索。我们扩展了MLX-LM框架,引入了通用辅助生成(UAG),以实现苹果硅上的跨标记器推测解码。我们将Bielik 11B-Instruct(基于Mistral)作为目标模型,配对三个草稿模型:Bielik 1.5B(基于Qwen且具有自定义标记器)、Qwen2.5-1.5B和Llama 3.2-1B。在三个波兰语数据集(维基百科、pl_alpaca、合成数据)上进行实验,使用草稿长度k为{2, 4, 6}来比较简单和上下文感知的标记翻译。结果显示:(1)上下文感知翻译在所有配置中一致提高了接受率;(2)波兰专用的Bielik 1.5B的接受率低于通用的Qwen2.5和Llama 3.2草稿模型;(3)在苹果硅上的吞吐量依赖于内容,对于结构化文本达到1.7倍加速,但对于多样化指令则失败;(4)在统一内存上的验证成本未如理论预测的那样摊销,因为两个模型都受限于内存带宽,使得顺序草稿相对于批量验证而言成本高昂。我们提出了一种硬件感知的加速公式,并描述了苹果硅上跨家族推测解码的条件。这是对波兰LLM的跨家族推测解码的首次系统评估,也是对统一内存架构上基于UAG解码的首次实证研究。
cs.CL / 3 / 2604.16370
Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction
Brain-CLIPLM:解码脑电图中的压缩语义表示以重建语言
Abstract
Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption may not hold under realistic information constraints, and instead propose a semantic compression hypothesis in which EEG signals encode a compressed set of semantic anchors rather than full linguistic structure. Under our new perspective, direct sentence reconstruction becomes an overparameterized objective relative to the intrinsic information capacity of EEG. To address this mismatch, we introduce Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic anchor extraction via contrastive learning and sentence reconstruction using a retrieval-grounded large language model (LLM) with Chain-of-Thought (CoT) reasoning, following a granularity matching principle that aligns decoding complexity with neural information capacity. Evaluated on the Zurich Cognitive Language Processing Corpus, Brain-CLIPLM achieves 67.55\% top-5 and 85.00\% top-25 sentence retrieval accuracy, significantly outperforming direct decoding baseline, while cross-subject evaluation confirms robust generalization. Control analyses, including permutation testing, further demonstrate that EEG-derived representations carry sentence-specific information beyond language model priors. These results suggest that EEG-to-text decoding is better framed as recovering compressed semantic content rather than reconstructing full sentences, providing a biologically grounded and data-efficient pathway for non-invasive brain-computer interfaces.
Chinese Translation
从非侵入性脑电图(EEG)中解码自然语言在根本上受到低信噪比和信息带宽限制的制约。这引发了一个基本问题,即句子级语言结构是否能够可靠地从这些信号中恢复。在本研究中,我们提出这一假设在现实信息约束下可能不成立,而是提出了一种语义压缩假设,认为EEG信号编码的是一组压缩的语义锚点,而非完整的语言结构。在我们新的视角下,直接的句子重建相对于EEG的内在信息容量而言变成了一个过参数化的目标。为了解决这一不匹配,我们引入了Brain-CLIPLM,一个两阶段框架,将EEG到文本的解码分解为通过对比学习提取语义锚点和使用基于检索的大型语言模型(LLM)进行句子重建,采用链式思维(Chain-of-Thought, CoT)推理,遵循一种粒度匹配原则,使解码复杂性与神经信息容量相一致。在苏黎世认知语言处理语料库上的评估表明,Brain-CLIPLM实现了67.55%的前5名和85.00%的前25名句子检索准确率,显著优于直接解码基线,同时跨受试者评估证实了其强大的泛化能力。控制分析,包括置换测试,进一步证明EEG衍生的表示携带超出语言模型先验的句子特定信息。这些结果表明,EEG到文本的解码更适合被视为恢复压缩的语义内容,而非重建完整的句子,为非侵入性脑机接口提供了一条生物学基础和数据高效的途径。
cs.CL / 4 / 2604.16372
CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark
CFMS:面向可解释和细粒度的中文多模态讽刺检测基准
Abstract
Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at https://anonymous.4open.science/r/CFMS-E8F9.
Chinese Translation
多模态讽刺检测近年来引起了广泛关注。然而,现有基准存在粗粒度注释和文化覆盖面有限的问题,这阻碍了对细粒度语义理解的研究。为此,我们构建了CFMS,这是第一个专为中文社交媒体量身定制的细粒度多模态讽刺数据集。该数据集包含2796对高质量的图像-文本对,并提供了三层级的注释框架:讽刺识别、目标识别和解释生成。我们发现,细粒度的解释注释有效地指导人工智能生成具有明确讽刺意图的图像。此外,我们还整理了一个高一致性的平行中英文隐喻子集(各200条),揭示了当前模型在隐喻推理方面的显著局限性。为了克服传统检索方法的限制,我们提出了一种增强型强化学习上下文学习策略(PGDS),以动态优化示例选择。大量实验表明,CFMS为构建可靠的多模态讽刺理解系统提供了坚实基础,而PGDS方法在关键任务上显著优于现有基准。我们的数据和代码可在 https://anonymous.4open.science/r/CFMS-E8F9 获取。
cs.CL / 5 / 2604.16376
Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis
日本网络评论作者归属分析的基础研究
Abstract
This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with $k$-nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top-$k$ evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.
Chinese Translation
本研究探讨了基于风格特征的作者归属方法在威胁情报中的演员分析应用的适用性。作为未来应用于暗网论坛的基础步骤,我们使用来自清网来源的日本评论数据进行了实验。我们从乐天市场(Rakuten Ichiba)评论中构建了数据集,并比较了四种方法:基于TF-IDF的逻辑回归(TF-IDF+LR)、基于BERT嵌入的逻辑回归(BERT-Emb+LR)、BERT微调(BERT-FT)和基于度量学习的$k$-最近邻(Metric+kNN)。结果表明,BERT-FT的性能最佳;然而,当作者数量增加到数百时,训练变得不稳定,此时TF-IDF+LR在准确性、稳定性和计算成本方面表现更优。此外,Top-$k$评估展示了候选筛选的实用性,错误分析揭示了模板文本、主题依赖性和短文本长度是导致误分类的主要因素。
cs.CL / 6 / 2604.16377
GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution
GoCoMA:用于大型语言模型生成代码归属的超曲面多模态表示融合
Abstract
Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincar\'e ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.
Chinese Translation
在大量代码语料库上训练的大型语言模型(LLMs)现在越来越能够生成与人类编写的代码难以区分的代码。这引发了实际问题,包括安全漏洞和许可模糊性,同时也激发了一个法医问题:'谁(或哪个LLM)编写了这段代码?' 我们提出了GoCoMA,一个多模态框架,建模了(i)代码风格学,捕捉更高层次的结构和风格特征,以及(ii)二进制可执行前文物(BPEA)的图像表示,捕捉由编译和工具链塑造的更低层次、面向执行的字节语义。GoCoMA将模态嵌入投影到超曲面Poincaré球中,通过基于测地线余弦相似度的跨模态注意力(GCSA)融合机制进行融合,并将融合后的表示反投影到欧几里得空间,以实现最终的LLM源归属。在两个开源基准(CoDET-M4和LLMAuthorBench)上的实验表明,GoCoMA在相同的评估协议下始终优于单模态和欧几里得多模态基线。
cs.CL / 7 / 2604.16378
Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
互惠共训练(RCT):通过强化学习耦合基于梯度和非可微模型
Abstract
Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentiable feature partitioning. This work introduces a reciprocal co-training framework that couples an LLM with an RF classifier via reinforcement learning, creating an iterative feedback loop in which each model improves using signals from the other. Tabular data are reformulated into standardized textual representations for the LLM, whose embeddings augment the RF feature space, while calibrated RF probability estimates provide feedback signals that guide reinforcement learning updates of the LLM. Experiments across three medical datasets demonstrate consistent performance gains for both models, with particularly strong effects for the LLM. Ablation analyses show that iterative refinement, hybrid reward design, and dimensionality control jointly contribute to these gains. The proposed framework provides a general mechanism that allows incompatible model families to leverage each other's strengths through bidirectional adaptation.
Chinese Translation
大型语言模型(LLMs)和经典机器学习方法在预测建模中提供了互补的优势,但它们在表示和训练范式上的根本差异阻碍了有效的整合:LLMs依赖于对文本数据的基于梯度的优化,而随机森林(Random Forests, RF)等模型则采用非可微的特征划分。本研究提出了一种互惠共训练框架,通过强化学习将LLM与RF分类器耦合,创建了一个迭代反馈循环,使每个模型能够利用来自另一个模型的信号进行改进。表格数据被重新表述为标准化的文本表示,以供LLM使用,其嵌入增强了RF特征空间,而经过校准的RF概率估计则提供反馈信号,指导LLM的强化学习更新。在三个医疗数据集上的实验表明,两个模型均获得了一致的性能提升,LLM的效果尤为显著。消融分析表明,迭代精炼、混合奖励设计和维度控制共同促进了这些提升。所提出的框架提供了一种通用机制,允许不兼容的模型家族通过双向适应利用彼此的优势。
cs.CL / 8 / 2604.16380
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
大语言模型预训练中的数据混合:综述与展望
Abstract
Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.
Chinese Translation
大语言模型(LLMs)依赖于对大规模异构语料库的预训练,其中训练数据的组成对训练效率和在现实计算与数据预算限制下的下游泛化具有决定性影响。与样本级数据选择不同,数据混合优化领域级采样权重,以更有效地分配有限预算。近年来,越来越多的研究提出了针对LLM预训练的原则性数据混合方法;然而,相关文献仍然零散,缺乏专门的系统性综述。本文提供了对LLM预训练中数据混合的全面回顾。我们首先将数据混合优化形式化为概率单纯形上的双层问题,并阐明数据混合在预训练流程中的作用,简要解释现有方法如何在实践中使这一形式化变得可行。接着,我们引入了一种细致的分类法,将现有方法沿两个维度进行组织:静态混合与动态混合。静态混合进一步分为基于规则的方法和基于学习的方法,而动态混合则分为自适应和外部指导两类。对于每一类,我们总结了代表性的方法,并从性能与成本权衡的角度分析它们的优缺点。在此分析的基础上,我们强调了跨方法的挑战,包括数据领域之间的有限可迁移性、优化目标、模型和验证集,以及不标准化的评估协议和基准,以及基于学习的方法中性能提升与成本控制之间的固有矛盾。最后,我们概述了几个探索性方向,包括更细致的领域划分和逆数据混合,以及管道感知设计,旨在为未来研究提供概念和方法上的见解。
cs.CL / 9 / 2604.16382
LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
LiFT:指令微调是否改善大型语言模型在纵向建模中的上下文学习?
Abstract
Longitudinal NLP tasks require reasoning over temporally ordered text to detect persistence and change in human behavior and opinions. However, in-context learning with large language models struggles on tasks where models must integrate historical context, track evolving interactions, and handle rare change events. We introduce LiFT, a longitudinal instruction fine-tuning framework that unifies diverse longitudinal modeling tasks under a shared instruction schema. LiFT uses a curriculum that progressively increases temporal difficulty while incorporating few-shot structure and temporal conditioning to encourage effective use of past context. We evaluate LiFT across five datasets. Models trained on longitudinal tasks with different levels of temporal granularity are tested for generalisability on two separate datasets. Across models with different parameter sizes (OLMo (1B/7B), LLaMA-8B, and Qwen-14B), LiFT consistently outperforms base-model ICL, with strong gains on out-of-distribution data and minority change events.
Chinese Translation
纵向自然语言处理(NLP)任务需要对时间顺序排列的文本进行推理,以检测人类行为和观点的持续性与变化。然而,在需要模型整合历史上下文、跟踪不断演变的互动以及处理稀有变化事件的任务中,大型语言模型的上下文学习表现不佳。我们提出了LiFT,一个纵向指令微调框架,它在共享的指令架构下统一了多种纵向建模任务。LiFT采用了一种逐步增加时间难度的课程,同时结合少量示例结构和时间条件,以鼓励有效利用过去的上下文。我们在五个数据集上评估了LiFT。针对不同时间粒度的纵向任务训练的模型在两个独立数据集上进行可泛化性测试。在不同参数规模的模型(OLMo (1B/7B)、LLaMA-8B 和 Qwen-14B)中,LiFT始终优于基础模型的上下文学习,在分布外数据和少数变化事件上取得了显著提升。
cs.CL / 10 / 2604.16396
QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
QU-NLP在QIAS 2026:阿拉伯伊斯兰继承推理的多阶段QLoRA微调
Abstract
Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.
Chinese Translation
伊斯兰继承法(ilm al-mawarīth)为评估大型语言模型的结构化推理能力提供了一个具有挑战性的领域,要求进行多步骤的法律分析、基于规则的阻断决策以及精确的分数计算。我们展示了QU-NLP在QIAS 2026阿拉伯伊斯兰继承推理共享任务中的提交。我们的方法采用了多阶段量化低秩适应(QLoRA)微调策略,基于Qwen3-4B模型:(1)在3166条伊斯兰法特记录上进行领域适应,以获取继承术语和法理推理模式,随后(2)在12000个结构化继承案例上进行任务特定训练,以优化JSON格式输出的生成。使用4位NF4量化和秩为128的LoRA适配器,我们的模型在测试集上达到了90%的MIR-E(Mawarith继承推理评估)得分,展示了具有竞争力的性能,同时所需的计算资源最小。我们的结果表明,领域特定的预适应结合结构化输出训练使小型语言模型能够有效执行复杂的法律推理任务,相较于商业系统如Gemini-2.5-flash表现出色。
cs.CL / 11 / 2604.16421
Measuring Representation Robustness in Large Language Models for Geometry
大语言模型在几何领域的表征鲁棒性测量
Abstract
Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
Chinese Translation
大语言模型(LLMs)在数学推理方面的评估日益增多,但它们对等效问题表征的鲁棒性仍然了解不足。在几何学中,相同的问题可以用欧几里得、坐标或向量形式表达,但现有基准测试报告的准确性基于固定格式,隐含假设了表征不变性,并掩盖了仅由表征变化引起的失败。我们提出了GeoRepEval,一个关注表征的评估框架,旨在测量问题层面的正确性、不变性和一致性,结合严格的答案匹配、引导置信区间、配对McNemar检验、表征翻转分析以及针对表面复杂性的回归控制。我们证明了我们的Invariance@3指标将准确性分解为鲁棒和脆弱两个组成部分,并受到最弱表征的限制。对158个精心挑选的高中几何问题(474个实例)评估了11个LLMs,我们发现仅由表征选择引起的准确性差距高达14个百分点。向量表述成为一个一致的失败点,Invariance@3的值低至0.044,即使在控制了长度和符号复杂性之后。采用“转换后求解”的提示干预使高容量模型的向量准确性提高了多达52个百分点,这表明失败反映了对表征的敏感性而非能力不足;然而,低容量模型没有显示出任何提升,表明存在更深层次的局限性。这些结果表明,当前模型依赖于特定于表征的启发式方法,而非抽象的几何推理。所有数据集、提示和脚本均已发布在 https://github.com/vedjaw/GeoRepEval。
cs.CL / 12 / 2604.16422
Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG
将结构化生物医学知识注入语言模型:持续预训练与GraphRAG
Abstract
The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.
Chinese Translation
领域特定知识的注入对于将语言模型(LMs)适应于生物医学等专业领域至关重要。尽管目前大多数方法依赖于非结构化文本语料库,本研究探讨了两种互补策略,以利用来自UMLS Metathesaurus的结构化知识:(i)持续预训练,将知识嵌入模型参数中,以及(ii)图检索增强生成(GraphRAG),在推理时查询知识图谱。我们首先从UMLS构建了一个大规模生物医学知识图谱(包含340万个概念和3420万个关系),并存储在Neo4j中以便高效查询。然后,我们从该图谱中提取了一个约1亿标记的文本语料库,以持续预训练两个模型:BERTUMLS(基于BERT)和BioBERTUMLS(基于BioBERT)。我们在六个BLURB(生物医学语言理解与推理基准)数据集上评估这些模型,涵盖五种任务类型,并在两个问答(QA)数据集(PubMedQA,BioASQ)上评估GraphRAG。在BLURB任务中,BERTUMLS相较于BERT有所提升,尤其在知识密集型问答任务中获得了最大的收益。BioBERT的效果则更为微妙,表明当基础模型已经编码了大量生物医学文本知识时,收益递减的现象。最后,使用我们的GraphRAG管道增强LLaMA 3-8B模型在PubMedQA上提高了超过3个点的准确率,在BioASQ上提高了5个点,且无需任何重新训练,实现了透明的、多跳的、易于更新的知识访问。我们发布了处理后的UMLS Neo4j图谱,以支持可重复性研究。
cs.CL / 13 / 2604.16430
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
HalluSAE:通过稀疏自编码器检测大型语言模型中的幻觉
Abstract
Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase transition-inspired framework that models hallucination as a critical shift in the model's latent dynamics. By modeling the generation process as a trajectory through a potential energy landscape, HalluSAE identifies critical transition zones and attributes factual errors to specific high-energy sparse features. Our approach consists of three stages: (1) Potential Energy Empowered Phase Zone Localization via sparse autoencoders and a geometric potential energy metric; (2) Hallucination-related Sparse Feature Attribution using contrastive logit attribution; and (3) Probing-based Causal Hallucination Detection through linear probes on disentangled features. Extensive experiments on Gemma-2-9B demonstrate that HalluSAE achieves state-of-the-art hallucination detection performance.
Chinese Translation
大型语言模型(LLMs)功能强大且被广泛采用,但其实际影响受到众所周知的幻觉现象的限制。尽管最近的幻觉检测方法取得了显著进展,但我们发现大多数方法忽视了幻觉的动态特性和潜在机制。为了解决这一问题,我们提出了HalluSAE,这是一种受相变启发的框架,将幻觉建模为模型潜在动态的关键转变。通过将生成过程建模为在潜能能量景观中的轨迹,HalluSAE识别关键转变区域,并将事实错误归因于特定的高能稀疏特征。我们的方法包括三个阶段:(1)通过稀疏自编码器和几何潜能能量度量进行潜能能量赋能的相区定位;(2)使用对比逻辑归因进行幻觉相关稀疏特征归因;(3)通过对解耦特征的线性探针进行探测式因果幻觉检测。在Gemma-2-9B上的大量实验表明,HalluSAE实现了最先进的幻觉检测性能。
cs.CL / 14 / 2604.16451
SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future
SynopticBench:评估视觉-语言模型在生成未来天气预报讨论中的表现
Abstract
Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial and temporal scales. Given the complexity of atmospheric phenomena, it is critical to verifiably quantify the effectiveness of existing VLMs on weather forecasting data. In this work, we present SynopticBench, a high-quality dataset consisting of 1,367,041 text samples of Area Forecast Discussions created by the National Weather Service over the continental United States paired to images of 500mb geopotential height, 2 meter temperature, and 850mb wind velocity in weather forecasts. We also present Synoptic Phenomena Alignment and Coverage Evaluation (SPACE), a novel evaluation framework that can be used to effectively estimate the quality of text descriptions of synoptic weather phenomena. Extensive experiments on generating forecast discussions using state-of-the-art VLMs show the sensitivity of existing evaluation metrics in this domain and enable further exploration into synoptic weather and climate text generation.
Chinese Translation
近年来,视觉-语言模型(VLMs)的进展在图像描述、报告生成和视觉感知等众多复杂的多模态任务中取得了显著的提升。然而,从气象数据生成文本是一个极具挑战性的任务,因为大气是一个在不同空间和时间尺度上快速变化的混沌系统。鉴于大气现象的复杂性,可靠地量化现有VLM在天气预报数据上的有效性至关重要。在本研究中,我们提出了SynopticBench,这是一个高质量的数据集,包含由国家气象局在美国大陆上创建的1,367,041个区域预报讨论文本样本,并与500mb气压高度、2米温度和850mb风速的天气预报图像配对。我们还提出了天气现象对齐与覆盖评估(SPACE),这是一个新颖的评估框架,可有效估计天气现象文本描述的质量。对使用最先进的VLM生成预报讨论的广泛实验显示了现有评估指标在该领域的敏感性,并促进了对天气和气候文本生成的进一步探索。
cs.CL / 15 / 2604.16456
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
EchoChain:一种全双工中断状态更新推理的基准测试
Abstract
Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recurring failure patterns in post-interruption continuations: contextual inertia, interruption amnesia, and objective displacement. The benchmark generates scenario-driven conversations and injects interruptions at a standardized point relative to assistant speech onset, enabling controlled cross-model comparison. In a paired half-duplex control, total failures drop by 40.2% relative to interrupted runs, indicating that many errors are driven by state-update reasoning under interruption rather than task difficulty alone. Across evaluated real-time voice models, no system exceeds a 50% pass rate, showing substantial room for improvement in mid-generation state revision. EchoChain provides a reproducible benchmark for diagnosing state-update reasoning failures in full-duplex voice interaction.
Chinese Translation
实时语音助手在用户中断响应时必须修订任务状态,但现有的语音对话基准主要评估基于轮次的交互,未能涵盖这一失败模式。我们引入了EchoChain,这是一个控制性基准,用于评估在中途语音中断下的全双工状态更新推理。EchoChain识别出三种在中断后继续对话中反复出现的失败模式:情境惯性、中断遗忘和目标位移。该基准生成情景驱动的对话,并在与助手语音开始相关的标准化点注入中断,从而实现控制的跨模型比较。在配对的半双工控制中,总失败率相较于中断运行下降了40.2%,表明许多错误是由中断下的状态更新推理驱动,而不仅仅是任务难度。在评估的实时语音模型中,没有一个系统的通过率超过50%,显示出在生成中状态修订方面有显著的改进空间。EchoChain提供了一个可重复的基准,用于诊断全双工语音交互中的状态更新推理失败。
cs.CL / 16 / 2604.16593
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
重新审视颈部疼痛:语言模型的语义推理基准
Abstract
We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.
Chinese Translation
我们提出了SemanticQA,这是一个评估语言模型(LMs)在语义短语处理任务中的表现的评估套件。该基准整合了现有的多词表达(MwE)资源,并将其重新组织为一个统一的测试平台。它涵盖了一般的词汇现象,如词汇搭配,以及三个细分类别:习语、名词复合词和动词结构。通过SemanticQA,我们评估了不同架构和规模的LMs在提取、分类和解释任务中的表现,以及顺序任务组合。我们揭示了显著的性能差异,特别是在需要语义推理的任务上,突显了LMs在推理效率和语义理解方面的差异,为推动LMs在非平凡语义短语上的更强理解提供了见解。SemanticQA的评估工具和数据可在https://github.com/jacklanda/SemanticQA获取。
cs.CL / 17 / 2604.16607
Spotlights and Blindspots: Evaluation Machine-Generated Text Detection
聚光灯与盲点:机器生成文本检测的评估
Abstract
With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems, as well as seven trained models, across seven English-language textual test sets and three creative human-written datasets. We provide an empirical analysis of model performance, the influence of training and evaluation data, and the impact of key metrics. We find that no single system excels in all areas and nearly all are effective for certain tasks, and the representation of model performance is critically linked to dataset and metric choices. We find high variance in model ranks based on datasets and metrics, and overall poor performance on novel human-written texts in high-risk domains. Across datasets and metrics, we find that methodological choices that are often assumed or overlooked are essential for clearly and accurately reflecting model performance.
Chinese Translation
随着生成语言模型的兴起,机器生成文本检测已成为一个关键挑战。目前有多种模型可供选择,但不一致的数据集、评估指标和评估策略使得模型有效性的比较变得模糊。为了解决这一问题,我们评估了来自六个不同系统的15种检测模型,以及七个训练模型,在七个英语文本测试集和三个创意人类撰写的数据集上进行比较。我们提供了模型性能的实证分析,训练和评估数据的影响,以及关键指标的影响。我们发现没有单一系统在所有领域都表现出色,几乎所有系统在某些任务上都是有效的,而模型性能的表现与数据集和指标选择密切相关。我们发现基于数据集和指标的模型排名存在高度变异,并且在高风险领域的新型人类撰写文本上整体表现较差。在各个数据集和指标中,我们发现常常被假设或忽视的方法选择对于清晰准确地反映模型性能至关重要。
cs.CL / 18 / 2604.16622
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
通过对比学习微调对齐后续信号和对话上下文表示
Abstract
Backchannels (e.g., `yeah', `mhm', and `right') are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context-backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.
Chinese Translation
后续信号(例如,`是的'、`嗯'和`对')是短小的非干扰性反馈信号,其词汇形式和韵律共同传达语用意义。尽管之前的计算研究主要集中在预测后续信号的时机上,但词汇-韵律形式与意义之间的关系仍然未得到充分探讨。我们提出了一个两阶段框架:首先,对对话文本进行大规模语言模型的微调,以获取丰富的上下文表示;其次,学习对话上下文和后续信号实现的联合嵌入空间。我们通过三元相似性判断(韵律和跨词汇)以及上下文-后续信号适用性任务来评估与人类感知的一致性。我们的结果表明,与之前的方法相比,学习到的投影显著改善了上下文-后续信号的检索。此外,它们还揭示了后续信号的形式对扩展的对话上下文高度敏感,并且学习到的嵌入与人类判断的对齐程度高于原始的 WavLM 特征。
cs.CL / 19 / 2604.16625
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore:基于失败驱动的适应与保持多样性的高效内核生成搜索
Abstract
Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non-linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self-improvement via accumulated execution feedback for performance-critical kernel code generation through two complementary stages: failure-driven adaptation and diversity-preserving search, jointly improving correctness and optimization performance without additional fine-tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3, respectively, within 100 steps, and continues to improve with additional computation.
Chinese Translation
最近的大型语言模型(LLM)代理在使用执行反馈进行测试时适应方面显示出良好的前景。然而,稳健的自我改进仍然远未解决:大多数方法仍然独立地处理每个问题实例,而没有积累可重用的知识。这一局限性在特定领域语言(如Triton)中尤为明显,因为这些语言在LLM预训练数据中代表性不足。它们的严格约束和非线性优化景观进一步使得简单的生成和局部优化变得不可靠。我们提出了AdaExplore,一个通过积累执行反馈来实现自我改进的代理框架,旨在为性能关键的内核代码生成提供支持,分为两个互补阶段:基于失败驱动的适应和保持多样性的搜索,联合提高正确性和优化性能,而无需额外的微调或外部知识。在适应阶段,代理合成任务并将重复出现的失败转化为可重用的有效性规则内存,帮助后续生成保持在可行集合内。在搜索阶段,代理将候选内核组织成树形结构,并在小规模局部优化和更大结构重生之间交替进行,从而使其能够探索超越局部最优的优化景观。在内核运行时优化基准上的实验验证了这些提升:AdaExplore在KernelBench Level-2和Level-3上分别实现了3.12倍和1.72倍的加速,并随着额外计算的增加而持续改进。
cs.CL / 20 / 2604.16651
Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content
移民声音与地方新闻:关于弥合社区需求与媒体内容的洞察
Abstract
Research shows news consumption differs across demographics, yet little is known about non-mainstream audiences, especially in relation to local media. Our study addresses this gap by examining how French-speaking migrants in a mid-size European city engage with local news, and whether their needs are reflected in coverage. Eight community members participated in focus groups, whose insights guided the selection of natural language processing methods (topic modeling, information retrieval, sentiment analysis, and readability) applied to over 2000 hyper-local news articles. Results showed that while articles frequently covered local events, gaps remained in topics important to participants. Sentiment analysis revealed a generally positive tone, and readability measures indicated an intermediate-advanced French level, raising questions about accessibility for integration. Our work contributes to bridging the gap between local news platforms' content and diverse readers' needs, and could inform local media organizations about opportunities to expand their current news story coverage to appeal to more diverse audiences.
Chinese Translation
研究表明,不同人口统计特征的新闻消费存在差异,但关于非主流受众,尤其是与地方媒体相关的研究仍然较少。本研究通过考察在一个中型欧洲城市中讲法语的移民如何与地方新闻互动,以及他们的需求是否在报道中得到反映,来填补这一空白。八位社区成员参与了焦点小组讨论,他们的见解指导了对超过2000篇超本地新闻文章应用自然语言处理方法(主题建模、信息检索、情感分析和可读性分析)的选择。结果显示,尽管文章经常报道地方事件,但在参与者认为重要的话题上仍存在空白。情感分析揭示出整体积极的语气,而可读性测量则表明法语水平处于中高级,提出了关于融入可及性的问题。我们的研究有助于弥合地方新闻平台内容与多样化读者需求之间的差距,并可能为地方媒体组织提供扩展当前新闻报道以吸引更多样化受众的机会。
cs.CL / 21 / 2604.16654
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language
IYKYK(但人工智能不懂):自动内容审核未能捕捉社区对重获语言的异质态度
Abstract
Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression of marginalized voices. In this work, we use quantitative and qualitative methods to examine the attitudes of social media users in LGBTQIA+, Black, and women communities around reclaimed slurs targeting our focus groups including the f-word, n-word, and b-word. With social media users from these communities, we collect and analyze an annotated online slur usage corpus. The corpus includes annotators' perceptions of whether an online text containing a slur should be flagged as hate speech, as well as contextual features of the slur usage. Across all communities and annotation questions, we observe low inter-annotator agreement, indicating substantial disagreement among in-group annotators. This is compounded by the fact that, absent clear contextual signals of identity and intent, even in-group members may disagree on how to interpret reclaimed slur usage online. Semi-structured interviews with annotators suggest that differences in lived experience and personal history contribute to this variation as well. We find poor alignment between annotator judgments and automated hate speech assessments produced by Perspective API. We further observe that certain features of a text such as whether the slur usage was derogatory and if the slur was targeted at oneself are more associated with whether annotators report the text as hate speech. Together, these findings highlight the inherent subjectivity and contextual nature of how marginalized communities interpret slurs online.
Chinese Translation
重获污名的使用在许多边缘化社区中是一种常见且有意义的在线实践。它作为团结、身份和共同经验的来源。然而,当代基于自动化和人工智能的在线内容审核工具在区分重获和仇恨性污名的使用方面大多失败,导致边缘化声音的压制。在本研究中,我们采用定量和定性方法,考察LGBTQIA+、黑人和女性社区社交媒体用户对针对我们关注群体的重获污名(包括f-word、n-word和b-word)的态度。我们与这些社区的社交媒体用户合作,收集并分析一个注释的在线污名使用语料库。该语料库包括注释者对包含污名的在线文本是否应被标记为仇恨言论的看法,以及污名使用的上下文特征。在所有社区和注释问题中,我们观察到注释者之间的低一致性,表明内部注释者之间存在显著分歧。更复杂的是,在缺乏明确的身份和意图的上下文信号的情况下,即使是内部成员也可能对如何解读在线重获污名的使用存在分歧。与注释者的半结构化访谈表明,生活经历和个人历史的差异也导致了这种变异。我们发现注释者的判断与Perspective API生成的自动仇恨言论评估之间存在较差的一致性。我们进一步观察到,文本的某些特征,例如污名使用是否具有贬义以及污名是否针对自己,与注释者报告该文本为仇恨言论的可能性更为相关。综合来看,这些发现突显了边缘化社区在线解读污名的固有主观性和上下文特性。
cs.CL / 22 / 2604.16656
Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
语言模型的碎片整理:基于可解释性的词汇扩展方法
Abstract
All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many languages as they do for English. Our analysis reveals that this issue, known as 'token over-fragmentation', persists in modern open-weight LLMs. The standard remedy is vocabulary expansion that adds target language items missing from the model's vocabulary. In this work, we comprehensively study and advance interpretability-based vocabulary expansion, a new research direction. We focus on two core decisions in the vocabulary expansion process: What items should we add? and How should we initialize their corresponding input and output embeddings? First, we question the conventional use of frequency-based methods to choose candidate vocabulary items to add (a decision long treated as settled), and show that interpretability-based methods offer a superior performance-token efficiency trade-off. Next, we strengthen the case for interpretability-based embedding initialization by showing large gains (~20 pts) over baseline initialization methods for several languages written in non-Latin scripts. We identify the phenomenon of "subword detokenization" where models progressively merge fragmented subword tokens into larger subwords across layers. Grounded in our analysis of this phenomenon, we propose FragMend to further push the efficiency ceiling of interpretability-based expansion. We validate the effectiveness of FragMend through comparison against strong baselines and we present extensive analysis of its design choices.
Chinese Translation
所有语言都是平等的;但在标记化方面,有些语言比其他语言更为平等。标记是决定现代大型语言模型(LLMs)访问成本和延迟的隐性货币。然而,许多使用非拉丁文字书写的语言在交换率上表现不佳:LLMs在编码许多语言的信息时所需的标记数量是英语的几倍。我们的分析揭示了这一问题,即“标记过度碎片化”,在现代开放权重的LLMs中依然存在。标准的解决方案是词汇扩展,即向模型的词汇中添加缺失的目标语言项目。在本研究中,我们全面研究并推进基于可解释性的词汇扩展,这是一项新的研究方向。我们关注词汇扩展过程中的两个核心决策:我们应该添加哪些项目?以及我们应该如何初始化它们对应的输入和输出嵌入?首先,我们质疑传统的基于频率的方法来选择候选词汇项目(这一决策长期以来被视为定论),并展示了基于可解释性的方法在性能与标记效率的权衡上提供了更优的表现。接下来,我们通过展示在多个使用非拉丁文字书写的语言中,基于可解释性的嵌入初始化相较于基线初始化方法有约20个点的显著提升,进一步加强了基于可解释性的嵌入初始化的论证。我们识别出“子词去标记化”现象,即模型在各层中逐步将碎片化的子词标记合并为更大的子词。基于对这一现象的分析,我们提出了FragMend,以进一步推动基于可解释性扩展的效率上限。我们通过与强基线的比较验证了FragMend的有效性,并对其设计选择进行了广泛的分析。
cs.CL / 23 / 2604.16665
CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams
CBRS:具有双语数据集和双层过滤的认知血液请求系统,用于多平台社交流
Abstract
Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app-based systems, reliant on manual input, struggle to reach users in low-resource settings, delaying critical responses. To address this, we introduce the Cognitive Blood Request System (CBRS), a multi-platform framework that efficiently filters and parses blood donation requests from social media streams using a cost-efficient dual-layered architecture. To do so, we curate a novel dataset of 11K parsed blood donation request messages in Bengali, English, and transliterated Bengali, capturing the linguistic diversity of real social media communications. The inclusion of adversarial negatives further enhances the robustness of our model. CBRS achieves an impressive 99% accuracy and precision in filtering, surpassing benchmark methods. In the parsing task, our LoRA finetuned Llama-3.2-3B model achieves 92% zero-shot accuracy, surpassing the base model by 41.54% and exceeding the few-shot performance of GPT-4o-mini, Gemini-2.0-Flash, and other LLMs, while resulting in a 35X reduction in input token usage. This work lays a robust foundation for scalable, inclusive information extraction in time-sensitive, object-focused tasks. Our code, dataset, and trained models are publicly available at [https://github.com/aaniksahaa/CBRS](https://github.com/aaniksahaa/CBRS).
Chinese Translation
由于每日通信量庞大,社交媒体上紧急寻求血液捐赠的帖子和消息常常被忽视。依赖手动输入的传统应用程序系统在资源匮乏的环境中难以触达用户,从而延误了关键响应。为了解决这一问题,我们提出了认知血液请求系统(Cognitive Blood Request System, CBRS),这是一个多平台框架,利用高效的双层架构从社交媒体流中有效过滤和解析血液捐赠请求。为此,我们策划了一个新颖的数据集,包含11,000条用孟加拉语、英语和音译孟加拉语解析的血液捐赠请求消息,捕捉了真实社交媒体交流的语言多样性。对抗性负样本的加入进一步增强了我们模型的鲁棒性。CBRS在过滤任务中实现了99%的准确率和精确度,超越了基准方法。在解析任务中,我们经过LoRA微调的Llama-3.2-3B模型实现了92%的零样本准确率,超出基础模型41.54%,并超过了GPT-4o-mini、Gemini-2.0-Flash及其他大型语言模型的少样本表现,同时输入标记使用量减少了35倍。这项工作为在时间敏感、以目标为导向的任务中进行可扩展、包容性的信息提取奠定了坚实基础。我们的代码、数据集和训练模型已公开发布在[https://github.com/aaniksahaa/CBRS](https://github.com/aaniksahaa/CBRS)。
cs.CL / 24 / 2604.16686
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
无恶化的上下文感知解码:防止上下文条件生成中的中性回归
Abstract
Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and quantify it by measuring accuracy drops on baseline-correct items under answer-consistent contexts. We propose No-Worse Context-Aware Decoding (NWCAD), a decode-time adapter built on a two-stream setup with a two-stage gate: it backs off to no-context decoding when the context is non-informative, and otherwise uses context-conditioned decoding with a CAD-style fallback under uncertainty. We evaluate NWCAD on benchmarks that separate do-no-harm reliability from context utilization (accuracy gains on genuinely helpful contexts). NWCAD prevents neutral regression on baseline-correct items while preserving strong context-driven accuracy on helpful contexts.
Chinese Translation
大型语言模型(LLMs)在依赖外部上下文(例如检索到的证据)时能够回答问题和总结文档,但上下文的使用仍然不可靠:即使上下文没有提供信息,模型也可能会覆盖已经正确的输出(中性回归)。我们将中性回归形式化为一种不造成伤害的要求,并通过在答案一致的上下文下测量基线正确项的准确性下降来量化它。我们提出了无恶化上下文感知解码(No-Worse Context-Aware Decoding, NWCAD),这是一种基于双流设置和两阶段门控的解码时适配器:当上下文没有信息时,它回退到无上下文解码;否则,在不确定性下使用上下文条件解码并采用CAD风格的回退。我们在将不造成伤害的可靠性与上下文利用(在真正有帮助的上下文中的准确性提升)分开的基准上评估NWCAD。NWCAD在基线正确项上防止了中性回归,同时在有帮助的上下文中保持了强大的上下文驱动准确性。
cs.CL / 25 / 2604.16704
The impact of postediting on AI generative translation in Yemeni context: Translating literary prose by ChatGPT
后编辑对也门语境下AI生成翻译的影响:ChatGPT翻译文学散文
Abstract
This study examines the role of artificial intelligence in translation, focusing on ChatGPT, specifically ChatGPT-4, and the extent to which human postediting is required in literary translation. A mixed-method approach was adopted, involving 30 professional translators who evaluated and postedited AI-generated translations of selected Arabic and English literary texts. The results show that although AI improves translation speed and accessibility, it remains limited in handling cultural, stylistic, and figurative aspects of language. Participants generally confirmed the necessity of human postediting, particularly in novels and drama. The findings indicate that emerging human-machine collaboration model rather than replacement of human translators. The study concludes that AI should be used as a supportive tool, while human expertise remains essential for ensuring translation quality and cultural appropriateness.
Chinese Translation
本研究探讨了人工智能在翻译中的作用,重点关注ChatGPT,特别是ChatGPT-4,以及在文学翻译中人类后编辑的必要性。采用了混合方法,涉及30名专业翻译人员对选定的阿拉伯语和英语文学文本的AI生成翻译进行了评估和后编辑。结果表明,尽管AI提高了翻译的速度和可及性,但在处理语言的文化、风格和比喻方面仍然存在局限性。参与者普遍确认了人类后编辑的必要性,尤其是在小说和戏剧中。研究结果表明,出现的是人机协作模型,而非人类翻译者的替代。研究结论认为,AI应作为一种辅助工具使用,而人类专业知识仍然是确保翻译质量和文化适宜性的关键。
cs.CL / 26 / 2604.16717
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
使用文本和音频分类器检测令人担忧的学生口头反应
Abstract
This paper addresses a critical safety gap in the use Automated Verbal Response Scoring (AVRS). We present a novel hybrid framework for troubled student detection that combines a text classifier, trained to detect responses based on their content, and an audio classifier, trained to detect responses using prosodic markers. This approach overcomes key limitations of traditional AVRS systems by considering both content and prosody of responses, achieving enhanced performance in identifying potentially concerning responses. This system can expedite the review process by humans, which can be life-saving particularly when timely intervention may be crucial.
Chinese Translation
本文针对自动化口头反应评分(AVRS)中存在的关键安全隐患进行了探讨。我们提出了一种新颖的混合框架,用于识别有问题的学生,该框架结合了一个文本分类器(用于基于内容检测反应)和一个音频分类器(用于利用韵律特征检测反应)。这种方法通过同时考虑反应的内容和韵律,克服了传统AVRS系统的主要局限性,从而在识别潜在令人担忧的反应方面实现了更好的性能。该系统能够加快人工审核过程,尤其在及时干预至关重要时,可能挽救生命。
cs.CL / 27 / 2604.16744
Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
评估基于模拟学习者的教育阅读材料的自适应个性化
Abstract
We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.
Chinese Translation
我们提出了一个框架,用于评估基于理论的模拟学习者对教育阅读材料的自适应个性化。该系统从开放教科书中构建学习目标和知识组件本体,并在基于浏览器的本体图谱中进行整理,使用本体实体对教科书片段进行标注,并生成对齐的阅读-评估对。模拟读者通过受 Construction-Integration 启发的记忆模型学习段落,该模型结合了 DIME 风格的读者因素、KREC 风格的误解修正以及开放的 New Dale-Chall 可读性信号。答案通过基于分数的选项选择生成,基于学习者的显性记忆状态,而 BKT 则驱动适应性。在三个抽样的学科本体和每种条件下匹配的 50 名模拟学习者的队列中,自适应阅读在计算机科学领域显著改善了结果,在无机化学中产生了较小的积极但不确定的增益,而在普通生物学中则表现中性至轻微负面。
cs.CL / 28 / 2604.16757
Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
表达社会情感:大型语言模型与人类文化情感规范之间的不一致
Abstract
The expression of emotions that serve social purposes, such as asserting independence or fostering interdependence, is central to human interactions and varies systematically across cultures. As LLMs are increasingly used to simulate human behavior in culturally nuanced interactions, it is important to understand whether they faithfully capture human patterns of social emotion expression. When LLM responses are not culturally aligned, their utility is compromised -- particularly when users assume they are interacting with a culturally attuned interlocutor, and may act on advice that proves inappropriate in their cultural context. We present a psychologically informed evaluation framework of cross-cultural social emotion expression in LLMs. Using a human study comparing European American and Latin American participants' expression of engaging and disengaging emotions, we evaluate six frontier LLMs on their ability to reflect culturally differentiated patterns for expressing social emotions. We find systematic misalignment between model and human behavior: all models express engaging emotions more than disengaging ones, with particularly stark differences observed for the generally well-represented European American persona. We further highlight that LLM responses are highly concentrated and deterministic, failing to capture the diversity of human responses in expressing social emotions. Our ablation analyses reveal that these patterns are robust to sampling temperatures, partially sensitive to prompt language, and dependent on the response elicitation format. Together, our findings highlight limitations in how current LLMs represent the interaction of cultural and emotional axes, particularly when expressing social emotions, with direct implications for their deployment in cross-cultural affective contexts.
Chinese Translation
表达社会目的的情感,如宣示独立或促进相互依赖,是人类互动的核心,且在不同文化中系统性地有所不同。随着大型语言模型(LLMs)越来越多地用于模拟人类在文化细微互动中的行为,了解它们是否忠实地捕捉人类社会情感表达的模式变得至关重要。当LLM的响应与文化不一致时,它们的效用就会受到影响——尤其是在用户假设自己正在与一个文化敏感的对话者互动时,可能会依据在其文化背景下不适当的建议采取行动。我们提出了一个心理学视角的跨文化社会情感表达评估框架。通过对比欧洲裔美国人和拉丁美洲参与者在表达参与性和非参与性情感方面的研究,我们评估了六个前沿LLM在反映文化差异化的社会情感表达模式方面的能力。我们发现模型与人类行为之间存在系统性的不一致:所有模型在表达参与性情感时的频率高于非参与性情感,尤其是在通常表现良好的欧洲裔美国人角色中观察到显著差异。我们进一步强调,LLM的响应高度集中且具有决定性,未能捕捉人类在表达社会情感时的多样性。我们的消融分析表明,这些模式对采样温度具有稳健性,对提示语言部分敏感,并依赖于响应引导格式。综上所述,我们的研究结果突显了当前LLM在表达社会情感时文化与情感轴线交互的表现局限性,这对其在跨文化情感背景下的应用具有直接影响。
cs.CL / 29 / 2604.16767
When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms
当错误信息发声与对话:重新思考音频平台中的事实核查
Abstract
Audio platforms have evolved beyond entertainment. They have become central to public discourse, from podcasts and radio to WhatsApp voice notes and live streams. With millions of shows and hundreds of millions of listeners, audio platforms are now a major channel for misinformation. Yet existing fact-checking pipelines are mostly designed for written claims, overlooking the unique properties of spoken media. We argue that audio misinformation is not merely textual content with transcripts: it is structurally different because it is both spoken - carrying persuasive force through prosody, pacing, and emotion - and conversational - unfolding across turns, speakers, and episodes. These dual properties introduce verification difficulties that traditional methods rarely face. This position paper synthesizes evidence across modalities and platforms, examines datasets and methods, and highlights why existing pipelines fail on audio. We argue that advancing fact-checking requires rethinking verification pipelines around the spoken and conversational realities of audio.
Chinese Translation
音频平台已经超越了娱乐的范畴,成为公共话语的核心,从播客和广播到WhatsApp语音备忘录和直播。随着数百万个节目和数亿听众的涌现,音频平台如今已成为错误信息传播的重要渠道。然而,现有的事实核查流程大多是为书面声明设计的,忽视了口语媒体的独特特性。我们认为,音频错误信息不仅仅是带有文字记录的文本内容:它在结构上是不同的,因为它既是口语的——通过韵律、节奏和情感传递说服力——又是对话的——在轮次、发言者和集数之间展开。这两种特性引入了传统方法很少面临的验证困难。本文综合了跨媒介和平台的证据,考察了数据集和方法,并强调了现有流程在音频上失效的原因。我们认为,推动事实核查的进步需要围绕音频的口语和对话现实重新思考验证流程。
cs.CL / 30 / 2604.16774
StageMem: Lifecycle-Managed Memory for Language Models
StageMem:语言模型的生命周期管理内存
Abstract
Long-horizon language model systems increasingly rely on persistent memory, yet many current designs still treat memory primarily as a static store: write an item, place it into memory, and retrieve it later if needed. We argue that this framing does not adequately capture the practical memory-control problem in deployed LLM systems. In realistic settings, the difficulty is often not merely forgetting useful information, but retaining too many uncertain items, forgetting important content in the wrong order, and giving users little trust in what will persist over time. We propose StageMem, a lifecycle-managed memory framework that treats memory as a stateful process rather than a passive repository. StageMem organizes memory into three stages -- transient, working, and durable memory -- and models each item with explicit confidence and strength. This separates shallow admission from long-term commitment: information may first be written at low cost and only later be promoted, retained, updated, or evicted as evidence and pressure evolve. Under controlled pressure regimes, this decomposition helps preserve late-important content while keeping memory burden and deeper-tier pollution more controlled. Adapted external tasks provide boundary evidence that the same schema remains compatible with stronger retrieval structure outside pure synthetic control. We present StageMem as a principled decomposition of the memory-control problem for language model systems.
Chinese Translation
长期语言模型系统越来越依赖持久内存,然而许多当前设计仍然主要将内存视为静态存储:写入一个项目,将其放入内存中,并在需要时检索。我们认为这种框架并未充分捕捉到在部署的LLM系统中实际的内存控制问题。在现实环境中,困难往往不仅仅是忘记有用的信息,而是保留过多不确定的项目,以错误的顺序忘记重要内容,并且让用户对哪些内容会随着时间的推移而持久化缺乏信任。我们提出了StageMem,一个生命周期管理的内存框架,将内存视为一个有状态的过程,而不是一个被动的存储库。StageMem将内存组织为三个阶段——短暂内存、工作内存和持久内存——并对每个项目进行明确的置信度和强度建模。这将浅层接纳与长期承诺分开:信息可以首先以低成本写入,随后根据证据和压力的发展被提升、保留、更新或驱逐。在受控的压力机制下,这种分解有助于保留后期重要内容,同时更好地控制内存负担和深层污染。适应的外部任务提供了边界证据,表明相同的模式与纯合成控制之外的更强检索结构兼容。我们将StageMem呈现为语言模型系统内存控制问题的原则性分解。
cs.CL / 31 / 2604.16787
When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations
当非正式文本破坏自然语言推理:分词失败、分布变化与针对性缓解措施
Abstract
We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" -> "homie") causes minimal degradation (at most 1.1pp): slang vocabulary falls largely within WordPiece coverage, so the tokenizer handles it without signal loss. Emoji replaces content words with Unicode characters that ELECTRA's WordPiece tokenizer maps to [UNK], destroying the input signal before any learned parameters see it (93.6% of emoji examples contain at least one [UNK], mean 2.91 per example). Noise tokens (no cap, deadass, tbh) are fully in-vocabulary but absent from NLI training data, consistent with the model assigning them inferential weight they do not carry. The two failure modes respond to different interventions: preprocessing recovers emoji accuracy by normalizing text before tokenization; augmentation handles noise by exposing the model to noise-bearing examples during training. A hybrid of both achieves 88.93% on the combined variant for ELECTRA on SNLI (up from 75.88%), with no statistically significant drop on clean text. Against GPT-4o-mini zero-shot, unmitigated ELECTRA is significantly worse on transformed variants (p < 0.0001); hybrid ELECTRA surpasses it across all SNLI variants and reaches statistical parity on MultiNLI.
Chinese Translation
我们研究了非正式表面形式如何降低 ELECTRA-small (14M) 和 RoBERTa-large (355M) 在 SNLI 和 MultiNLI 上的自然语言推理 (NLI) 准确性,涉及四种转换:俚语替换、表情符号替换、Z世代填充词及其组合。俚语替换(将正式词汇替换为非正式等价词,例如“going to” -> “gonna”,“friend” -> “homie”)造成的准确性下降最小(最多 1.1 个百分点):俚语词汇大部分在 WordPiece 覆盖范围内,因此分词器能够处理而不会丢失信号。表情符号用 Unicode 字符替换内容词,而 ELECTRA 的 WordPiece 分词器将其映射为 [UNK],在任何学习参数看到之前就破坏了输入信号(93.6% 的表情符号示例至少包含一个 [UNK],平均每个示例 2.91 个)。噪声词(no cap, deadass, tbh)完全在词汇表内,但在 NLI 训练数据中缺失,这与模型赋予它们的推理权重不符。两种失败模式对不同的干预措施有响应:预处理通过在分词之前规范化文本来恢复表情符号的准确性;增强通过在训练期间让模型接触带有噪声的示例来处理噪声。两者的混合在 SNLI 上的 ELECTRA 组合变体中达到了 88.93%(从 75.88% 提升),且在干净文本上没有统计显著下降。与 GPT-4o-mini 的零-shot 结果相比,未经缓解的 ELECTRA 在转换变体上的表现显著更差(p < 0.0001);混合 ELECTRA 在所有 SNLI 变体上均超越了它,并在 MultiNLI 上达到了统计平衡。
cs.CL / 32 / 2604.16826
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
拥挤的B空间:LoRA合并的共享方向校准
Abstract
Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update $\Delta W = BA$ as a single object and do not distinguish the two LoRA matrices. We show that the main source of LoRA merge interference comes from the output-side matrix $B$. Across tasks, $B$ repeatedly uses a small set of shared directions, while $A$ remains much more task-specific. As a result, the merged adapter overemphasizes these shared directions, and task-specific information is lost. We propose Pico (Pre-merge interference calibration in output-space), a data-free method that calibrates $B$ before merge by downscaling over-shared directions and then rescaling the merged update. Pico plugs directly into existing merging methods such as Task Arithmetic, TIES, and TSV-M. Across eight different benchmarks from math, coding, finance, and medical domains, Pico improves average accuracy by 3.4-8.3 points over the corresponding base method and achieves the best overall average performance. Pico also enables merged adapters to outperform the LoRA trained with all task data. These results show that LoRA merging works better when the two LoRA matrices are treated separately.
Chinese Translation
分别训练的LoRA适配器的合并是多任务联合训练的一个实用替代方案,但它往往会影响性能。现有方法通常将LoRA更新 $
abla W = BA$ 视为一个单一对象,未能区分两个LoRA矩阵。我们展示了LoRA合并干扰的主要来源来自输出侧矩阵 $B$。在不同任务中,$B$ 重复使用一小组共享方向,而 $A$ 则更具任务特异性。因此,合并后的适配器过度强调这些共享方向,导致任务特定信息的丢失。我们提出了Pico(输出空间中的预合并干扰校准),这是一种无数据的方法,通过缩小过度共享方向的比例来校准 $B$,然后对合并更新进行重新缩放。Pico可以直接集成到现有的合并方法中,如任务算术(Task Arithmetic)、TIES和TSV-M。在来自数学、编码、金融和医学领域的八个不同基准测试中,Pico相比于相应的基础方法提高了平均准确率3.4-8.3个百分点,并实现了最佳的整体平均性能。Pico还使得合并的适配器超越了使用所有任务数据训练的LoRA。这些结果表明,当将两个LoRA矩阵分别处理时,LoRA合并效果更佳。
cs.CL / 33 / 2604.16839
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
HeLa-Mem:用于大型语言模型代理的赫布学习与联想记忆
Abstract
Long-term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structure of human memory, wherein related experiences progressively strengthen interconnections through repeated co-activation. Inspired by cognitive neuroscience, we identify three mechanisms central to biological memory: association, consolidation, and spreading activation, which remain largely absent in current research. To bridge this gap, we propose HeLa-Mem, a bio-inspired memory architecture that models memory as a dynamic graph with Hebbian learning dynamics. HeLa-Mem employs a dual-level organization: (1) an episodic memory graph that evolves through co-activation patterns, and (2) a semantic memory store populated via Hebbian Distillation, wherein a Reflective Agent identifies densely connected memory hubs and distills them into structured, reusable semantic knowledge. This dual-path design leverages both semantic similarity and learned associations, mirroring the episodic-semantic distinction in human cognition. Experiments on LoCoMo demonstrate superior performance across four question categories while using significantly fewer context tokens. Code is available on GitHub: https://github.com/ReinerBRO/HeLa-Mem
Chinese Translation
长期记忆是大型语言模型代理面临的一个关键挑战,因为固定的上下文窗口无法在延续的互动中保持一致性。现有的记忆系统将对话历史表示为非结构化的嵌入向量,通过语义相似性检索信息。这种范式未能捕捉人类记忆的联想结构,其中相关经验通过重复的共同激活逐渐增强相互连接。受到认知神经科学的启发,我们确定了生物记忆的三个核心机制:联想、巩固和扩散激活,而这些机制在当前研究中大多缺失。为了解决这一问题,我们提出了HeLa-Mem,一种生物启发的记忆架构,将记忆建模为具有赫布学习动态的动态图。HeLa-Mem采用双层组织结构:(1) 通过共同激活模式演变的情节记忆图,以及 (2) 通过赫布蒸馏填充的语义记忆存储,其中反思代理识别密集连接的记忆中心,并将其提炼为结构化的、可重用的语义知识。这种双路径设计利用了语义相似性和学习到的联想,反映了人类认知中的情节-语义区分。在LoCoMo上的实验表明,在四个问题类别中表现优越,同时使用的上下文标记显著减少。代码可在GitHub上获取:https://github.com/ReinerBRO/HeLa-Mem
cs.CL / 34 / 2604.16845
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART:通过蒸馏-审计-修复训练缓解差异感知大语言模型中的伤害漂移
Abstract
Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill--Audit--Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.
Chinese Translation
针对安全性进行调优的大型语言模型(LLMs)通常避免承认人口统计差异,即使这种承认在事实层面上是正确的(例如,基于祖先的疾病发生率)或在情境上是合理的(例如,宗教招聘偏好)。这种对身份的盲视导致了不正确的回应、不必要的拒绝或通用的“平等待遇”默认设置。我们通过差异感知分类来研究这一问题:给定一个涉及人口群体的问题,任务不是直接回答,而是分类是否正确答案需要识别群体差异(是)或群体是否应被视为相同(否)。关键是,针对准确性进行微调会引发伤害漂移:随着决策准确性的提高,模型生成的解释变得越来越有害,无论是通过详细阐述有害内容、引入问题假设,还是未能标记基线识别的伤害。为此,我们引入了DART(蒸馏-审计-修复训练),它从教师模型中提取标签条件推理,审计输出以识别相对于基线的伤害漂移案例,并通过严重性加权微调修复问题案例。在八个基准测试中,DART将Llama-3-8B-Instruct的准确率从39.0%提高到68.8%,在平等待遇提示上获得了最大的提升(11.3% -> 72.6%),同时将伤害漂移案例减少了72.6%。它还适用于280个开放式的现实世界查询,涵盖医疗、法律、政策和教育领域,将适当差异的回应从39.8%提高到77.5%,同时将拒绝率从34.3%降低到3.0%。我们的结果表明,当存在明确的检测和修复机制时,准确性和安全性并不必然冲突。
cs.CL / 35 / 2604.16852
A Community-Based Approach for Stance Distribution and Argument Organization
基于社区的方法用于立场分布和论证组织
Abstract
The proliferation of online debate platforms and social media has led to an unprecedented volume of argumentative content on controversial topics from multiple perspectives. While this wealth of perspectives offers opportunities for developing critical thinking and breaking filter bubbles (Pariser 2011), the sheer volume and complexity of arguments make it challenging for readers to synthesize and comprehend diverse viewpoints effectively. We present an unsupervised graph-based approach for community-based argument organization that helps users navigate and understand complex argumentative landscapes. Our system analyzes collections of topic-focused articles and constructs a rich interaction graph by capturing multiple relationship types between arguments: topic similarity, semantic coherence, shared keywords, and common entities. We then employ community detection to identify argument communities that reveal homogeneous and heterogeneous viewpoint distributions. The detected communities are simplified through strategic graph operations to present users with digestible, yet comprehensive summaries of key argumentative patterns. Our approach requires no training data and can effectively process hundreds of articles while preserving nuanced relationships between arguments. Experimental results demonstrate our system's ability to identify meaningful argument communities and present them in an interpretable manner, facilitating users' understanding of complex socio-political debates.
Chinese Translation
在线辩论平台和社交媒体的迅猛发展导致了前所未有的关于争议话题的论证内容的涌现,涵盖了多种视角。尽管这种丰富的视角为发展批判性思维和打破信息茧房(Pariser 2011)提供了机会,但论证的庞大数量和复杂性使得读者难以有效地综合和理解多样的观点。我们提出了一种无监督的基于图的方法,用于基于社区的论证组织,帮助用户导航和理解复杂的论证环境。我们的系统分析了以主题为中心的文章集合,并通过捕捉论证之间的多种关系类型(如主题相似性、语义连贯性、共享关键词和共同实体)构建了一个丰富的互动图。然后,我们采用社区检测方法识别论证社区,从而揭示同质和异质的观点分布。通过战略性图操作简化检测到的社区,以向用户呈现易于消化但又全面的关键论证模式摘要。我们的方法不需要训练数据,并且能够有效处理数百篇文章,同时保留论证之间的细微关系。实验结果表明,我们的系统能够识别有意义的论证社区,并以可解释的方式呈现它们,从而促进用户对复杂社会政治辩论的理解。
cs.CL / 36 / 2604.16881
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
通过具有可验证奖励的强化学习激励参数知识以实现跨文化实体翻译
Abstract
Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66\% to 31.87\% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of $pass@k$ dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
Chinese Translation
跨文化实体翻译对于大型语言模型(LLMs)仍然具有挑战性,因为通常产生的是字面或音韵的表达,而不是在语境中适当的文化翻译。然而,相关知识可能已经在大规模预训练过程中编码在模型参数中。为了激励有效利用参数知识,我们提出了EA-RLVR(基于实体的具有可验证奖励的强化学习),这是一个优化跨文化实体翻译的训练框架,无需依赖外部知识库。EA-RLVR将监督锚定在一个可验证的实体级奖励信号上,并结合轻量级结构门以稳定优化。该设计引导模型学习一个稳健的推理过程,而不仅仅是模仿参考翻译。我们在XC-Translate上评估了EA-RLVR,并观察到实体翻译准确性和领域外泛化的一致性提升。具体而言,仅在7000个样本上训练使Qwen3-14B的实体翻译准确性从23.66%提升至31.87%,而测试集包含完全未见的实体,规模为50000。学习到的实体翻译能力也转移到一般翻译上,在WMT24++上实现了+1.35的XCOMET,经过扩展优化后提升至+1.59。对$pass@k$动态和奖励公式的广泛分析将这些增益归因于更优的采样效率和稳定的优化景观。
cs.CL / 37 / 2604.16889
Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
修剪、解释、评估:一个跨层转码器原生框架,通过特征归因实现高效电路发现
Abstract
Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets $K \in \{50, 100, 200, 400, 800\}$, and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to $K=100$ features matches the KL fidelity that random selection from the active feature set requires $\approx 4$k features to achieve ($\approx 40\times$ compression), enabling $\approx 40\times$ fewer interpretation/evaluation calls while substantially reducing low-quality features.
Chinese Translation
现有的特征解释管道通常在均匀采样的单元上运行,但只有一小部分跨层转码器(CLT)特征对目标行为有重要影响,其余特征则导致昂贵的特征解释和评估成本。我们提出了第一个CLT原生的端到端框架PIE,连接了修剪、自动解释和解释评估,能够在修剪下系统性地测量行为保真度和下游可解释性。为此,我们提出了特征归因补丁(Feature Attribution Patching, FAP),这是一种基于补丁的归因方法,通过聚合梯度加权的写入贡献来评分CLT特征,以及FAP-Synergy,一种关注协同效应的重新排序程序。我们使用KL散度行为保留评估修剪效果,并通过FADE风格的指标评估解释质量。在IOI和Doc-String上,针对预算$K ext{ in } \{50, 100, 200, 400, 800\}$,以及FAP、FAP-Synergy、激活幅度(Activation-Magnitude)和ACDC风格的修剪,FAP系列始终实现最佳或接近最佳的保真度,其中FAP-Synergy在严格预算情况下提供了最明显的增益。在IOI上,使用Llama-3.2-1B和Gemma-2-2B的CLT,将特征修剪至$K=100$与随机选择活跃特征集所需的约4k特征所需的KL保真度相匹配(约40倍压缩),使得解释/评估调用减少约40倍,同时显著降低低质量特征。
cs.CL / 38 / 2604.16909
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
PRISM:探测大型语言模型中的推理、指令和源记忆的幻觉
Abstract
As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into where and why hallucinations arise in the generation pipeline. We therefore reformulate hallucination evaluation as a diagnostic problem and propose PRISM, a controlled benchmark that disentangles hallucinations into four dimensions: knowledge missing, knowledge errors, reasoning errors, and instruction-following errors, grounded in three stages of generation (memory, instruction, and reasoning). PRISM contains 9,448 instances across 65 tasks and supports fine-grained, stage-aware diagnostic evaluation. Evaluating 24 mainstream open-source and proprietary LLMs, we uncover consistent trade-offs across instruction following, memory retrieval, and logical reasoning, showing that mitigation strategies often improve specific dimensions at the expense of others. We hope PRISM provides a framework for understanding the specific mechanisms behind LLMs hallucinations, ultimately accelerating the development of trustworthy large language models.
Chinese Translation
随着大型语言模型(LLMs)从对话助手演变为能够处理复杂任务的智能体,它们在高风险领域的应用越来越广泛。然而,现有的基准测试主要依赖混合查询和后验评估、输出级评分,这虽然量化了幻觉的严重性,但对幻觉在生成流程中出现的原因和位置提供的洞察有限。因此,我们将幻觉评估重新构建为一个诊断问题,并提出了PRISM,一个控制基准,将幻觉分解为四个维度:知识缺失、知识错误、推理错误和遵循指令错误,基于生成的三个阶段(记忆、指令和推理)。PRISM包含9,448个实例,涵盖65个任务,并支持细粒度、阶段感知的诊断评估。在评估24个主流的开源和专有LLM时,我们发现指令遵循、记忆检索和逻辑推理之间存在一致的权衡,显示出缓解策略往往在改善某些维度的同时牺牲其他维度。我们希望PRISM能为理解LLMs幻觉背后的具体机制提供一个框架,最终加速可信大型语言模型的发展。
cs.CL / 39 / 2604.16916
When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints
当选择变成风险:大型语言模型在多项选择约束下的安全失效
Abstract
Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.
Chinese Translation
大型语言模型(LLMs)的安全对齐主要是在开放式生成的情况下进行评估,此时模型可以通过拒绝响应来降低风险。相反,许多现实世界的应用将LLMs置于结构化决策任务中,例如多项选择题(MCQs),在这些情况下,弃权是不被鼓励或不可用的。我们在这种情况下识别出一种系统性的失效模式:将有害请求重新表述为强制选择的多项选择题,其中所有选项都是不安全的,这可以系统性地绕过拒绝行为,即使在那些始终拒绝等效开放式提示的模型中也是如此。在14个专有和开源模型中,我们展示了强制选择约束显著增加了违反政策的响应。值得注意的是,对于人类创作的多项选择题,违反率与结构约束强度呈倒U型趋势,在中等任务规范下达到峰值,而由高能力模型生成的多项选择题在各种约束下几乎达到饱和的违反率,并表现出强烈的跨模型可转移性。我们的研究结果揭示了当前的安全评估在结构化任务设置中显著低估了风险,并强调了受限决策作为对齐失效的一个关键且未充分探索的领域。
cs.CL / 40 / 2604.16917
x1: Learning to Think Adaptively Across Languages and Cultures
跨语言和文化适应性思维的学习
Abstract
Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language choice, x1 is constructed without expanding the model's knowledge boundaries and is trained by contrasting linguistically distinct reasoning trajectories for the same input. Our extensive experiments demonstrate the benefits of adaptive multilingual reasoning across multilingual mathematical reasoning and culturally grounded tasks. Moreover, our results challenge a simplistic view of scaling laws: while scaling reduces cross-lingual disparities in procedural domains such as math reasoning, it does not eliminate the advantages of culture-associated languages in culturally grounded tasks, as we empirically show that such reasoning enables more efficient and accurate cultural knowledge recall. Overall, our findings establish language choice as a functional component of reasoning, with implications for building more generalist and globally competent reasoning models.
Chinese Translation
语言编码了不同的抽象和归纳先验,但大多数大型语言模型(LLMs)通过在单一主导语言中推理而忽视了这种多样性。在本研究中,我们介绍了 x1,一类能够在每个实例基础上自适应地选择有利语言进行推理的模型。为了隔离推理语言选择的影响,x1 的构建没有扩展模型的知识边界,而是通过对同一输入的语言上明显不同的推理轨迹进行对比训练。我们的广泛实验展示了在多语言数学推理和文化基础任务中自适应多语言推理的优势。此外,我们的结果挑战了对规模法则的简单看法:尽管规模的扩大减少了在数学推理等程序性领域的跨语言差异,但并未消除文化相关语言在文化基础任务中的优势。我们通过实证研究表明,这种推理能够更高效和准确地回忆文化知识。总体而言,我们的研究结果确立了语言选择作为推理的一个功能组成部分,这对构建更具通用性和全球竞争力的推理模型具有重要意义。
cs.CL / 41 / 2604.16918
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
面向新鲜度的优先经验重放在大语言模型/视觉语言模型强化学习中的应用
Abstract
Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.
Chinese Translation
强化学习(RL)在后训练的大语言模型(LLMs)和视觉语言模型(VLMs)中取得了显著成功,PPO、GRPO和REINFORCE++等在线算法成为主流。然而,这些方法在单次梯度更新后会丢弃所有收集的轨迹,导致样本效率低下,特别是在多轮环境交互成本高昂的代理任务中尤为浪费。虽然经验重放通过允许代理重用过去的轨迹并优先考虑信息丰富的轨迹来提高经典RL的样本效率,但直接将优先经验重放(PER)应用于LLMs并不成功。亿参数模型的快速策略演变使得存储的优先级变得过时,导致旧的高优先级轨迹在变得无信息后仍然主导采样。我们提出了面向新鲜度的优先经验重放,解决了这一优先级过时问题,通过在有效样本量分析的基础上,将任何基于PER的优先级与乘法指数衰减相结合。据我们所知,面向新鲜度的优先经验重放是首个成功将PER应用于LLM/VLM强化学习的工作。我们在八个多步代理、推理和数学竞赛任务上进行了评估,使用了0.5B、3B和7B模型。面向新鲜度的优先经验重放在性能上显著超越在线基线,在NQ搜索上提高了46%,在推箱子(Sokoban)上提高了367%,在VLM冰湖(FrozenLake)上提高了133%,而没有年龄衰减的标准PER则持续降低了性能。我们的代码已公开发布在https://github.com/Vision-CAIR/Freshness-Aware-PER。
cs.CL / 42 / 2604.16929
MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning
MeasHalu:通过增强推理减轻大型语言模型中的科学测量幻觉
Abstract
The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large-scale analysis and integration of quantitative research findings. However, Large Language Models (LLMs) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems. To address this problem, we propose MeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization. We first present a fine-grained taxonomy of measurement-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations. Our approach incorporates a two-stage reasoning-aware fine-tuning strategy using augmented scientific data and process-based supervision. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness. Experimental results demonstrate that MeasHalu substantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine-assisted scientific literature analysis.
Chinese Translation
从文献中准确提取科学测量数据是AI4Science中的一项关键但具有挑战性的任务,它使得对定量研究结果的规模化分析和整合成为可能。然而,大型语言模型(LLMs)经常表现出严重的幻觉现象,这显著削弱了自动化科学文献理解系统的可靠性。为了解决这个问题,我们提出了MeasHalu,一个通过增强推理和针对性优化来减轻科学测量幻觉的新框架。我们首先提出了一种细粒度的测量特定幻觉分类法,将错误分为量、单位、修饰符和关系四类。我们的方法结合了使用增强科学数据和基于过程的监督的两阶段推理感知微调策略。此外,我们引入了一种渐进式奖励课程,旨在惩罚特定类型的幻觉,从而显著提高提取的真实性。实验结果表明,MeasHalu显著降低了幻觉率,并提高了在MeasEval基准上的整体准确性。这项工作为自动化科学知识提取中的一个关键瓶颈提供了针对性的解决方案,促进了更可信和可扩展的机器辅助科学文献分析。
cs.CL / 43 / 2604.16937
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
没有一种方法适合所有人:从固定提示到多语言大型语言模型中的学习路由
Abstract
Translation-based prompting is widely used in multilingual LLMs, yet its effectiveness varies across languages and tasks. We evaluate prompting strategies across ten languages of different resource levels and four benchmarks. Our analysis shows that no single strategy is universally optimal. Translation strongly benefits low-resource languages even when translation quality is imperfect, high-resource languages gain little, and prompt-based self-routing underperforms explicit translation. Motivated by these findings, we formulate prompting strategy selection as a learned decision problem and introduce lightweight classifiers that predict whether native or translation-based prompting is optimal for each instance. The classifiers achieve statistically significant improvements over fixed strategies across four benchmarks and generalize to unseen task formats not observed during training. Further analysis reveals that language resource level, rather than translation quality alone, determines when translation is beneficial.
Chinese Translation
基于翻译的提示在多语言大型语言模型中被广泛使用,但其有效性在不同语言和任务中存在差异。我们评估了十种不同资源水平的语言和四个基准上的提示策略。我们的分析表明,没有单一策略是普遍最优的。即使在翻译质量不完美的情况下,翻译对低资源语言的帮助也很大,而高资源语言的收益很小,基于提示的自我路由的表现不如明确的翻译。基于这些发现,我们将提示策略选择形式化为一个学习决策问题,并引入轻量级分类器来预测在每个实例中本地提示还是基于翻译的提示更为优越。这些分类器在四个基准上相较于固定策略取得了统计显著的改善,并且能够推广到训练期间未观察到的任务格式。进一步的分析表明,语言资源水平而非单纯的翻译质量决定了翻译何时是有益的。
cs.CL / 44 / 2604.16943
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT:面向模态神经元的多模态大语言模型图像翻译微调
Abstract
Multimodal large language models (MLLMs) have shown impressive capabilities, yet they often struggle to effectively capture the fine-grained textual information within images crucial for accurate image translation. This often leads to a modality gap between visual text inputs and textual inputs/outputs for image translation. Existing methods, primarily relying on instruction fine-tuning, risk parameter redundancy of pre-trained knowledge, hindering generalization performance. To address this, we introduce modality neuron-aware fine-tuning (MNAFT), a novel approach that takes advantage of the specialized roles of individual neurons within MLLMs for enhanced image translation. MNAFT identifies language-agnostic and language-specific neurons in both vision and language modules through an instruction-driven activation analysis, evaluating their importance in various translation tasks. We then perform selective fine-tuning, updating only the parameters of language-specific and language-agnostic neurons within the selected layers relevant to the target task, while preserving the knowledge encoded in other neurons and layers. Our extensive experiments on multiple benchmarks demonstrate that MNAFT significantly outperforms state-of-the-art image translation methods, including cascaded models, standard full fine-tuning, and parameter-efficient tuning techniques. Furthermore, we provide comprehensive analysis, including visualizations of neuron activations and clustering patterns, to offer insights into the roles of different neuron groups in mediating cross-modal understanding and facilitating accurate language-specific translation.
Chinese Translation
多模态大语言模型(MLLMs)展现出了令人印象深刻的能力,但它们在有效捕捉图像中细粒度文本信息方面常常面临挑战,而这些信息对于准确的图像翻译至关重要。这通常导致视觉文本输入与图像翻译的文本输入/输出之间存在模态差距。现有方法主要依赖于指令微调,这可能导致预训练知识的参数冗余,从而阻碍泛化性能。为了解决这个问题,我们提出了面向模态神经元的微调(MNAFT),这是一种新颖的方法,利用MLLMs中各个神经元的专业角色来增强图像翻译。MNAFT通过基于指令的激活分析识别视觉和语言模块中的语言无关和语言特定神经元,评估它们在各种翻译任务中的重要性。然后,我们进行选择性微调,仅更新与目标任务相关的选定层中语言特定和语言无关神经元的参数,同时保留其他神经元和层中编码的知识。我们在多个基准上的广泛实验表明,MNAFT显著优于最先进的图像翻译方法,包括级联模型、标准完全微调和参数高效微调技术。此外,我们提供了全面的分析,包括神经元激活和聚类模式的可视化,以深入了解不同神经元组在促进跨模态理解和实现准确的语言特定翻译中的作用。
cs.CL / 45 / 2604.16968
On Safety Risks in Experience-Driven Self-Evolving Agents
关于经验驱动自我进化代理的安全风险
Abstract
Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents' tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety-utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.
Chinese Translation
经验驱动的自我进化已成为提高大型语言模型代理自主性的有前景的范式,但其对自我策划经验的依赖引入了尚未充分探讨的安全风险。在本研究中,我们调查了自我进化代理中经验的积累和利用如何影响其在基于网络和具身环境中的安全性能。值得注意的是,仅从良性任务中获得的经验在高风险场景中仍可能危及安全。进一步分析将这种退化归因于积累经验的执行导向特性,这增强了代理采取行动而非拒绝的倾向。在代理同时面临良性和有害任务的更现实环境中,与拒绝相关的经验能够减轻安全下降,但却引发过度拒绝,揭示了安全与效用之间的基本权衡。总体而言,我们的研究结果揭示了当前自我进化代理的固有限制,并呼吁采取更有原则的策略以确保安全和可靠的适应。
cs.CL / 46 / 2604.16989
Bolzano: Case Studies in LLM-Assisted Mathematical Research
Bolzano:基于大型语言模型辅助的数学研究案例研究
Abstract
We report new results on six problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., four of the six results reach the level of publishable research, and three of the six were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.
Chinese Translation
我们报告了在数学和理论计算机科学领域的六个问题的新结果,这些结果是在开源多智能体大型语言模型系统Bolzano的协助下产生的。Bolzano协调平行证明者智能体与验证者智能体之间的交互轮次,同时维护一个跨轮次的持久知识库。根据Feng等人的重要性-自主性分类法,对六个结果进行分类,其中四个达到了可发表研究的水平,三个结果基本上是由Bolzano自主产生的。我们的结果提供了证据,表明大型语言模型可以对数学研究做出有意义的贡献,补充了Bubeck等人、Woodruff等人及其他人的最新报告。
cs.CL / 47 / 2604.16995
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
SPS:用于大型语言模型强化学习中更好探索的引导概率挤压
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
Chinese Translation
强化学习(RL)已成为训练面向推理模型的有前景的范式,通过利用基于规则的奖励信号。然而,RL训练通常倾向于提高单样本成功率(即 Pass@1),而对多样化推理轨迹的探索有限,而这对于多样本性能(即 Pass@k)至关重要。我们的初步分析表明,这一限制源于一种基本的挤压效应,即概率质量过度集中在一小部分高奖励轨迹上,限制了真实的探索并约束了在RL训练下可达到的性能。为了解决这一问题,我们在本研究中提出了引导概率挤压(SPS),一种将传统RL与逆强化学习(IRL)交替结合的训练范式。SPS将在线策略的回放视为示范,并利用IRL显式重塑诱导的轨迹分布,从而增强探索而不引入外部监督。在五个常用的推理基准上的实验表明,SPS能够实现更好的探索并提高Pass@k。除了算法贡献外,我们还提供了对RL学习动态的分析,并确定了Pass@k的经验上限,揭示了基于RL的推理模型中的内在探索限制。我们的研究结果表明,在RL和IRL之间交替提供了一条有效的途径,以扩展面向推理的大型语言模型的探索能力。
cs.CL / 48 / 2604.17008
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
BIASEDTALES-ML:用于分析大语言模型生成故事中的叙事属性分布的多语言数据集
Abstract
Large Language Models (LLMs) are increasingly used to generate narrative content, including children's stories, which play an important role in social and cultural learning. Despite growing interest in AI safety and alignment, most existing evaluations focus primarily on English, leaving the cross-lingual generalization of aligned behavior underexplored. In this work, we introduce BiasedTales-ML, a large-scale parallel corpus of approximately 350,000 children's stories generated across eight typologically and culturally diverse languages using a full-permutation prompting design. We propose a structured generator-extractor pipeline and a multi-dimensional distributional analysis framework to examine how narrative attributes vary across languages, models, and social conditions. Our analysis reveals substantial cross-lingual variability in narrative generation patterns, indicating that distributions observed in English do not always exhibit similar characteristics in other languages, particularly in lower-resource settings. At the narrative level, we identify recurring structural patterns involving character roles, settings, and thematic emphasis, which manifest differently across linguistic contexts. These findings highlight the limitations of English-centric evaluation for characterizing socially grounded narrative generation in multilingual settings. We release the dataset, code, and an interactive visualization tool to support future research on multilingual narrative analysis and evaluation.
Chinese Translation
大型语言模型(LLMs)越来越多地用于生成叙事内容,包括儿童故事,这在社会和文化学习中发挥着重要作用。尽管对人工智能安全性和对齐的关注日益增加,但现有的大多数评估主要集中在英语上,跨语言对齐行为的普遍性研究仍然不足。在本研究中,我们介绍了BiasedTales-ML,这是一个大规模的平行语料库,包含约350,000个儿童故事,这些故事是通过全排列提示设计在八种类型学和文化上多样的语言中生成的。我们提出了一个结构化的生成-提取管道和一个多维分布分析框架,以研究叙事属性在不同语言、模型和社会条件下的变化。我们的分析揭示了叙事生成模式在跨语言上的显著变异,表明在英语中观察到的分布在其他语言中并不总是表现出相似的特征,尤其是在资源较少的环境中。在叙事层面,我们识别出涉及角色、场景和主题强调的重复结构模式,这些模式在不同语言环境中表现不同。这些发现突显了以英语为中心的评估在多语言环境中表征社会基础叙事生成的局限性。我们发布了数据集、代码和一个交互式可视化工具,以支持未来在多语言叙事分析和评估方面的研究。
cs.CL / 49 / 2604.17010
Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
通过形式验证的语义等价自我博弈提升大型语言模型代码推理能力
Abstract
We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.
Chinese Translation
我们提出了一种用于 Haskell 中语义等价的自我博弈框架,利用形式验证指导生成器与评估器之间的对抗训练。该框架利用 Liquid Haskell 证明来验证等价性,并通过基于执行的反例来验证不等价性,采用难度感知的课程进行组织。为此,我们发布了 extbf{OpInstruct-HSx},一个包含约 28,000 个经过验证的 Haskell 程序的合成数据集。实证实验表明,我们的评估器在下游任务中有效转移,在 EquiBench 上实现了高达 13.3 个百分点的准确率提升,并在 PySecDB 上取得了一致的提升。对 SEQ-SINQ 方案的消融研究表明,尽管不等价监督提供了数据量,但等价证明对模型的推理能力具有独特的责任。整个训练流程和数据集已在 GitHub 和 Hugging Face 上公开发布。
cs.CL / 50 / 2604.17020
Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
超越静态基准:通过基于角色的模拟合成有害内容以进行稳健评估
Abstract
Static benchmarks for harmful content detection face limitations in scalability and diversity, and may also be affected by contamination from web-scale pre-training corpora. To address these issues, we propose a framework for synthesizing harmful content, leveraging persona-guided large language model (LLM) agents. Our approach constructs two-dimensional user personas by integrating demographic identities and topical interests with situational harmful strategies, enabling the simulation of diverse and contextually grounded harmful interactions. We evaluate the framework along three dimensions: harmfulness, challenge level, and diversity. Both human and LLM-based evaluations confirm that our framework achieves a high harmful generation success rate. Experiments across multiple detection systems reveal that our synthetic scenarios are more challenging to detect than those in existing benchmarks. Furthermore, a multi-faceted analysis confirms that our approach achieves linguistic and topical diversity comparable to human-curated datasets, establishing our framework as an effective tool for robust stress-testing of harmful content detection systems.
Chinese Translation
静态基准在有害内容检测中面临可扩展性和多样性的限制,并可能受到网络规模预训练语料库的污染影响。为了解决这些问题,我们提出了一种合成有害内容的框架,利用基于角色引导的大型语言模型(LLM)代理。我们的方法通过整合人口统计身份和主题兴趣与情境有害策略,构建二维用户角色,从而能够模拟多样化和情境基础的有害互动。我们从有害性、挑战水平和多样性三个维度对该框架进行评估。人类和基于LLM的评估均确认我们的框架实现了高有害生成成功率。在多个检测系统中的实验表明,我们的合成场景比现有基准中的场景更难以检测。此外,多方面的分析确认我们的方法在语言和主题多样性方面可与人工策划的数据集相媲美,确立了我们的框架作为有害内容检测系统稳健压力测试的有效工具。
cs.CL / 51 / 2604.17022
Beyond Black-Box Labels: Interpretable Criteria for Diagnosing SubjectiveNLP Tasks
超越黑箱标签:主观自然语言处理任务的可解释诊断标准
Abstract
Subjective NLP datasets typically aggregate annotator judgments into a single gold label, making it difficult to diagnose whether disagreement reflects unclear criteria, collapsed distinctions, or legitimate plurality. We propose a \emph{schema-level diagnostic} for auditing expert-designed annotation schemas \emph{prior to} gold-label commitment, using only multi-annotator criterion judgments. The diagnostic separates two failure modes: unstable criteria with hard-to-operationalize boundaries, and systematic overlap that blurs the boundaries between mutually exclusive categories. Applied to persuasive value extraction in commercial documents, we find that disagreement is not diffuse: instability concentrates in a few criteria, while nearly half of covered sentences activate multiple categories. These signals align with where domain experts disagree, yielding an evidence-based audit for tightening guidelines, revising category structure, or reconsidering the annotation paradigm.
Chinese Translation
主观自然语言处理(NLP)数据集通常将注释者的判断汇总为单一的金标准标签,这使得诊断分歧是否反映不明确的标准、模糊的区分或合理的多元性变得困难。我们提出了一种 extit{模式级诊断},用于在承诺金标准标签之前审计专家设计的注释模式,仅使用多注释者的标准判断。该诊断将两种失败模式区分开来:具有难以操作化边界的不稳定标准,以及模糊相互排斥类别边界的系统重叠。应用于商业文件中的说服价值提取时,我们发现分歧并非分散:不稳定性集中在少数标准上,而近一半的覆盖句子激活多个类别。这些信号与领域专家的分歧相一致,为收紧指南、修订类别结构或重新考虑注释范式提供了基于证据的审计。
cs.CL / 52 / 2604.17031
Where is the Mind? Persona Vectors and LLM Individuation
心智在哪里?人格向量与大型语言模型的个体化
Abstract
The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.
Chinese Translation
大型语言模型的个体化问题探讨了与其相关的哪些实体(如果有的话)应被识别为心智。我们通过机制可解释性来接近这个问题,特别关注近期关于人格向量、人格空间和新兴不一致性的实证研究。我们认为三种观点是最有力的候选者:虚拟实例观点以及我们提出的两种新观点,即(虚拟)实例-人格观点和模型-人格观点。首先,我们基于注意力流在标记时间上维持准心理连接的理由,支持虚拟实例观点。然后,我们呈现人格文献,围绕关于大型语言模型中人格内部结构的三种假设进行组织,并展示这两种基于人格的观点是有前景的替代方案。
cs.CL / 53 / 2604.17037
Dynamic Emotion and Personality Profiling for Multimodal Deception Detection
动态情感与个性特征分析在多模态欺骗检测中的应用
Abstract
Deception detection is of great significance for ensuring information security and conducting public opinion analysis, with personality factors and emotion cues playing a critical role. However, existing methods lack sample-level dynamic annotations for emotions and personality.In this paper, we propose an innovative multi-model multi-prompt annotation scheme and a strict label quality evaluation standard, and establish a multimodal joint detection dataset DDEP for deception, emotion, and personality. Meanwhile, we propose Rel-DDEP, an adaptive reliability-weighted fusion framework. Our framework quantifies uncertainty by mapping modal features to a high-dimensional Gaussian distribution space. It then performs reliability-weighted fusion and incorporates an alignment module and a sorting constraint module to achieve joint detection of deception, emotion, and personality. Experimental results on the MDPE and DDEP datasets show that our Rel-DDEP significantly outperforms the existing state-of-the-art baseline models in three tasks. The F1 score of the deception detection increases by 2.53%, that of the emotion detection increases by 2.66%, and that of the personality detection increases by 9.30%. The experiments fully verify the necessity of annotating dynamic emotion and personality labels for each sample and the effectiveness of reliability-weighted fusion.
Chinese Translation
欺骗检测对于确保信息安全和进行舆情分析具有重要意义,其中个性因素和情感线索发挥着关键作用。然而,现有方法缺乏对情感和个性进行样本级动态标注。在本文中,我们提出了一种创新的多模型多提示标注方案和严格的标签质量评估标准,并建立了一个用于欺骗、情感和个性检测的多模态联合检测数据集DDEP。同时,我们提出了Rel-DDEP,一个自适应的可靠性加权融合框架。我们的框架通过将模态特征映射到高维高斯分布空间来量化不确定性。然后,它执行可靠性加权融合,并结合对齐模块和排序约束模块,实现欺骗、情感和个性的联合检测。在MDPE和DDEP数据集上的实验结果表明,我们的Rel-DDEP在三个任务中显著优于现有的最先进基线模型。欺骗检测的F1分数提高了2.53%,情感检测的F1分数提高了2.66%,个性检测的F1分数提高了9.30%。实验充分验证了对每个样本进行动态情感和个性标签标注的必要性以及可靠性加权融合的有效性。
cs.CL / 54 / 2604.17051
Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization
通过选择性参数优化实现大型语言模型的高效任务适应
Abstract
Large Language Models (LLMs) have demonstrated excellent performance in general language understanding, generation and other tasks. However, when fine-tuning for specific domain tasks, the general knowledge accumulated in the pre-training phase is often partially overwritten or forgotten due to parameter updates, which severely limits the generalization ability and transferability of LLMs. Traditional fine-tuning strategies mostly train on the entire parameter space, ignoring the heterogeneity of model parameters, that is, some parameters are extremely important for general tasks, while other parameters are more sensitive to specific tasks. To alleviate the above problems, this paper innovatively proposes a parameter element importance evaluation method, which divides parameters into "core parameters" and "non-core parameters" by distinguishing the importance of parameters for general language ability tasks and specific domain tasks, and fixes the core parameters during fine-tuning, and only fine-tunes the non-core parameters. Extensive experiments on scientific, medical and physical tasks using GPT-J and LLaMA-3 show that our method can mitigate catastrophic forgetting while enhancing the adaptability of the model.
Chinese Translation
大型语言模型(LLMs)在一般语言理解、生成及其他任务中表现出色。然而,在针对特定领域任务进行微调时,由于参数更新,预训练阶段积累的一般知识往往会部分被覆盖或遗忘,这严重限制了LLMs的泛化能力和迁移能力。传统的微调策略大多在整个参数空间上进行训练,忽视了模型参数的异质性,即某些参数对于一般任务极为重要,而其他参数则对特定任务更为敏感。为了解决上述问题,本文创新性地提出了一种参数元素重要性评估方法,通过区分参数对一般语言能力任务和特定领域任务的重要性,将参数分为“核心参数”和“非核心参数”,并在微调过程中固定核心参数,仅对非核心参数进行微调。在使用GPT-J和LLaMA-3进行的科学、医学和物理任务的广泛实验表明,我们的方法能够减轻灾难性遗忘,同时增强模型的适应性。
cs.CL / 55 / 2604.17053
Jailbreaking Large Language Models with Morality Attacks
利用道德攻击破解大型语言模型
Abstract
Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under exploration.Inspired by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs' internal pluralistic values. In detail, we develop a morality dataset with 10.3K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs' judgment over the morality questions. We evaluate both the large language models and guardrail models which are typically used in generative systems with flexible user input. Our experiment results show that there is a critical vulnerability of LLMs and guardrail models to these subtle and sophisticated moral-aware attacks.
Chinese Translation
与人工智能的多元对齐具有复杂而必要的目标,即创造能够与道德多元的人类共存并服务的人工智能。针对多元对齐的研究在增强大型语言模型(LLMs)学习以实现多元化方面进行了许多努力。尽管这至关重要,但LLMs在生成与多元价值观相关的道德内容方面的鲁棒性仍在探索中。受到通过破解提示展现出的惊人说服能力的启发,我们提出利用破解攻击来研究LLMs的内部多元价值观。具体而言,我们开发了一个包含10.3K实例的道德数据集,分为两个类别:价值模糊和价值冲突。我们进一步利用构建的数据集形式化了四种对抗性攻击,以操控LLMs对道德问题的判断。我们评估了大型语言模型和通常用于具有灵活用户输入的生成系统的防护模型。实验结果表明,LLMs和防护模型在面对这些微妙而复杂的道德意识攻击时存在严重的脆弱性。
cs.CL / 56 / 2604.17068
Stability-Weighted Decoding for Diffusion Language Models
稳定性加权解码用于扩散语言模型
Abstract
Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token's temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.
Chinese Translation
扩散大型语言模型(dLLMs)通过迭代去噪完全掩蔽的序列,实现并行文本生成,在每一步解掩蔽一部分掩蔽的标记。现有的解码策略依赖于在单一去噪步骤计算的静态置信度指标,忽视了时间历史,常常导致不稳定标记的过早解掩蔽。在本研究中,我们理论上确立了标记的时间不稳定性,通过连续预测分布之间的KL散度量化,提供了其与剩余掩蔽上下文的互信息的严格下界,表明时间不稳定的标记本质上是不安全的解掩蔽。基于这一见解,我们提出了稳定性加权解码(Stability-Weighted Decoding, SWD),这是一种无训练、即插即用的策略,将时间稳定性纳入标记评分,并作为任意基于分数的解码策略的通用调节器。在代码生成和数学推理基准上的实验表明,SWD在代表性的评分指标和选择策略中始终提高生成准确性,并表现出卓越的鲁棒性,在不同加速比下保持显著的性能领先于标准基线。
cs.CL / 57 / 2604.17073
Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL
Abstain-R1:通过可验证的强化学习实现校准的弃权和拒绝后的澄清
Abstract
Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.
Chinese Translation
强化微调提高了大型语言模型的推理能力,但也可能促使它们通过猜测或虚构缺失信息来回答无法回答的问题。现有的弃权方法要么训练模型生成通用的拒绝,要么鼓励后续澄清,而未验证这些澄清是否识别出关键缺失信息。我们研究那些在意义上明确但无法从给定信息中可靠解决的查询,并认为一个可靠的模型不仅应该弃权,还应解释缺失的内容。我们提出了一种关注澄清的RLVR奖励,该奖励在对可回答查询给予正确答案奖励的同时,联合优化在无法回答查询上的明确弃权和语义对齐的拒绝后澄清。利用这一奖励,我们训练了Abstain-R1,一个3B模型,它在无法回答的查询上改善了弃权和澄清,同时在可回答的查询上保持了强劲的表现。在Abstain-Test、Abstain-QA和SelfAware上的实验表明,Abstain-R1显著优于其基础模型,并在无法回答查询的行为上与包括DeepSeek-R1在内的更大系统竞争,这表明校准的弃权和澄清可以通过可验证的奖励学习,而不仅仅是通过规模的增加而产生。
cs.CL / 58 / 2604.17079
Auditing Support Strategies in LLMs through Grounded Multi-Turn Social Simulation
通过基于实证的多轮社会模拟审计大型语言模型的支持策略
Abstract
When users seek social support from chatbots, they disclose their situation gradually, yet most evaluations of supportive LLMs rely on single-turn, fully specified prompts. We introduce a multi-turn simulation framework that closes this gap. Support-seeking narratives from five Reddit communities are decomposed into ordered fragments and revealed turn by turn to a language model. Each response is coded with the Social Support Behavior Code (SSBC), an established multi-label taxonomy that captures the composition of support, rather than a single quality score. To ask whether support choices track the model's own construal of user distress, we use linear probes on hidden representations to estimate this internal signal without altering the generation context. Across two mid-scale models (Llama-3.1-8B, OLMo-3-7B) and more than 6,200 turns, support composition shifts systematically with estimated distress: teaching declines as estimated distress rises, a finding that replicates across architectures, while increases in affective and esteem-oriented strategies (such as validation) are suggestive but model-specific and rest on noisier annotations. Community context independently shapes behavior, tracking topic and discourse norms rather than demographic categories. These trajectory-level dynamics, invisible to single-turn evaluation, motivate multi-turn auditing frameworks for socially sensitive applications.
Chinese Translation
当用户向聊天机器人寻求社会支持时,他们逐渐披露自己的情况,但对支持型大型语言模型的评估大多依赖于单轮、完全指定的提示。我们引入了一个多轮模拟框架,以填补这一空白。来自五个Reddit社区的支持寻求叙述被分解为有序片段,并逐轮向语言模型揭示。每个响应都使用社会支持行为编码(Social Support Behavior Code, SSBC)进行编码,这是一种已建立的多标签分类法,捕捉支持的组成,而不是单一的质量评分。为了探讨支持选择是否反映模型对用户痛苦的自身理解,我们使用线性探针对隐藏表示进行估计,以在不改变生成上下文的情况下评估这一内部信号。在两个中型模型(Llama-3.1-8B, OLMo-3-7B)和超过6200轮的实验中,支持组成随着估计的痛苦程度系统性变化:随着估计痛苦的上升,教学支持减少,这一发现跨架构重复出现,而情感和自尊导向策略(如验证)的增加则具有暗示性,但是模型特定的,并依赖于更嘈杂的注释。社区背景独立地塑造行为,跟踪主题和话语规范,而非人口统计类别。这些在单轮评估中不可见的轨迹级动态,激励了针对社会敏感应用的多轮审计框架。
cs.CL / 59 / 2604.17085
Comparing Human and Large Language Model Interpretation of Implicit Information
人类与大型语言模型对隐含信息的解读比较
Abstract
The interpretation of implicit meanings is an integral aspect of human communication. However, this framework may not transfer to interactions with Large Language Models (LLMs). To investigate this, we introduce the task of Implicit Information Extraction (IIE) and propose an LLM-based IIE pipeline that builds a structured knowledge graph from a context sentence by extracting relational triplets, validating implicit inferences, and analyzing temporal relations. We evaluate two LLMs against crowdsourced human judgments on two datasets. We find that humans agree with most model triplets yet consistently propose many additions, indicating limited coverage in current LLM-based IIE. Moreover, in our experiments, models appear to be more conservative about implicit inferences than humans in socially rich contexts, whereas humans become more conservative in shorter, fact-oriented contexts. Our code is available at https://github.com/Antonio-Dee/IIE_from_LLM.
Chinese Translation
隐含意义的解读是人类沟通中不可或缺的一个方面。然而,这一框架可能无法转移到与大型语言模型(LLMs)的互动中。为此,我们引入了隐含信息提取(Implicit Information Extraction, IIE)任务,并提出了一种基于LLM的IIE流程,该流程通过提取关系三元组、验证隐含推论以及分析时间关系,从上下文句子构建结构化知识图谱。我们在两个数据集上评估了两个LLM与众包的人类判断。我们发现,人类与大多数模型三元组达成一致,但始终提出许多补充,表明当前基于LLM的IIE覆盖有限。此外,在我们的实验中,模型在社会丰富的语境中对隐含推论显得比人类更为保守,而在人类在较短、以事实为导向的语境中则变得更加保守。我们的代码可在 https://github.com/Antonio-Dee/IIE_from_LLM 获取。
cs.CL / 60 / 2604.17091
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
GenericAgent:一种通过上下文信息密度最大化实现的高效自我进化大型语言模型代理(V1.0)
Abstract
Long-horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision-making. At the same time, useful experience gained from tasks is often lost across episodes. We argue that long-horizon performance is determined not by context length, but by how much decision-relevant information is maintained within a finite context budget. We present GenericAgent (GA), a general-purpose, self-evolving LLM agent system built around a single principle: context information density maximization. GA implements this through four closely connected components: a minimal atomic tool set that keeps the interface simple, a hierarchical on-demand memory that only shows a small high-level view by default, a self-evolution mechanism that turns verified past trajectories into reusable SOPs and executable code, and a context truncation and compression layer that maintains information density during long executions. Across task completion, tool use efficiency, memory effectiveness, self-evolution, and web browsing, GA consistently outperforms leading agent systems while using significantly fewer tokens and interactions, and it continues to evolve over time. Project: https://github.com/lsdefine/GenericAgent
Chinese Translation
长时间跨度的大型语言模型(LLM)代理在本质上受到上下文的限制。随着交互的延续,工具描述、检索的记忆和原始环境反馈不断累积,挤压出决策所需的信息。同时,从任务中获得的有用经验往往在不同的情节中丢失。我们认为,长时间跨度的表现并不是由上下文长度决定的,而是由在有限的上下文预算内保持多少与决策相关的信息决定的。我们提出了GenericAgent(GA),这是一个通用的自我进化LLM代理系统,建立在一个单一原则之上:上下文信息密度最大化。GA通过四个紧密相关的组件实现这一目标:一个保持接口简单的最小原子工具集,一个默认仅显示小范围高层视图的分层按需记忆,一个将经过验证的过去轨迹转化为可重用的标准操作程序(SOP)和可执行代码的自我进化机制,以及一个在长时间执行过程中保持信息密度的上下文截断和压缩层。在任务完成、工具使用效率、记忆有效性、自我进化和网页浏览等方面,GA始终优于领先的代理系统,同时使用显著更少的令牌和交互,并且随着时间的推移不断进化。项目链接:https://github.com/lsdefine/GenericAgent
cs.CL / 61 / 2604.17105
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
标记化如何限制语言模型中的音位知识表示及其改进方法
Abstract
Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.
Chinese Translation
标记化是每个语言模型(LM)的第一步,但它从未考虑单词的声音。我们研究了标记化如何影响仅基于文本的语言模型表示音位知识的能力。通过一系列探测实验,我们表明基于子词的标记化系统性地削弱了局部(例如,韵律)和全局(例如,音节划分)音位特征的编码。为了量化这一影响,我们引入了音节划分-标记化对齐距离(STAD),这一指标测量模型的标记化与单词自然音节边界之间的不对齐程度,并发现更高的不对齐程度与较差的音位表示相关,为音位感知标记化提供了简单的诊断方法。为了解决这些局限性,我们提出了一种轻量级的基于国际音标(IPA)的微调方法,将音位意识注入语言模型中,从而在三个与音位相关的任务中实现了一致的改进,同时在数学和一般推理能力上基本保持不变,GSM8K和MMLU的下降幅度分别为1.1%和0.9%。
cs.CL / 62 / 2604.17108
Beyond Word Boundaries: A Hebrew Coreference Benchmark and an Evaluation Protocol for Morphologically Complex Text
超越词边界:希伯来语共指消解基准及形态复杂文本的评估协议
Abstract
Coreference Resolution (CR) is a fundamental NLP task critical for long-form tasks as information extraction, summarization, and many business applications. However, CR methods originally designed for English struggle with Morphologically Rich Languages (MRLs), where mention boundaries do not necessarily align with word boundaries, and a single token may consist of multiple anaphors. CR modeling and evaluation protocols standardly assume that, as in English, words and mentions mostly align. However, this assumption breaks down in MRLs, particularly in the context of LLMs' raw-text processing and end-to-end tasks. To assess and address this challenge, we introduce {\em KibutzR}, the first comprehensive CR dataset for Modern Hebrew, an MRL rich with complex words and pronominal clitics. We deliver an annotated dataset that identifies mentions at word, sub-word and multi-word levels, and propose an evaluation protocol that directly addresses word/morpheme boundary discrepancies. Our experiments show that contemporary LLMs perform significantly worse on Hebrew than on English, and that performance degrades on raw unsegmented text. Crucially, we show an inverse performance-trend in Hebrew relative to English, where smaller encoders perform far better than contemporary decoder models, leaving ample space for investigation and improvement. We deliver a new benchmark for Hebrew coreference resolution and a segmentation-aware evaluation protocol to inform future work on other MRLs.
Chinese Translation
共指消解(Coreference Resolution, CR)是自然语言处理(NLP)中的一项基础任务,对于信息提取、摘要生成以及许多商业应用至关重要。然而,最初为英语设计的CR方法在处理形态丰富语言(Morphologically Rich Languages, MRLs)时遇到困难,因为在这些语言中,提及边界不一定与词边界对齐,且单个标记可能包含多个指代词。CR建模和评估协议通常假设,像英语一样,词和提及大多对齐。然而,这一假设在MRLs中失效,尤其是在大型语言模型(LLMs)处理原始文本和端到端任务的背景下。为了评估和解决这一挑战,我们推出了KibutzR,这是第一个针对现代希伯来语的全面CR数据集,现代希伯来语是一种富含复杂词汇和代词附加成分的MRL。我们提供了一个注释数据集,识别词、子词和多词层面的提及,并提出了一种评估协议,直接解决词/语素边界的不一致性。我们的实验表明,现代LLMs在希伯来语上的表现显著低于英语,并且在未经分段的原始文本上表现更差。重要的是,我们展示了希伯来语相对于英语的逆向表现趋势,其中较小的编码器表现远好于现代解码器模型,这为进一步的研究和改进留出了广阔的空间。我们为希伯来语共指消解提供了一个新的基准,并提出了一种关注分段的评估协议,以指导未来在其他MRLs上的研究。
cs.CL / 63 / 2604.17114
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
临床人工智能中的来源差距:用于罕见疾病推理的可追溯证据时间知识图谱
Abstract
Frontier large language models generate clinically accurate outputs, but their citations are often fabricated. We term this the Provenance Gap. We tested five frontier LLMs across 36 clinician-validated scenarios for three rare neuromuscular disease pairs. No model produced a clinically relevant PubMed identifier without prompting. When explicitly asked to cite, the best model achieved 15.3% relevant PMIDs; the majority resolved to real publications in unrelated fields. We present HEG-TKG (Hierarchical Evidence-Grounded Temporal Knowledge Graphs), a system that grounds clinical claims in temporal knowledge graphs built from 4,512 PubMed records and curated sources with quality-tier stratification and 1,280 disease-trajectory milestones. In a controlled three-arm comparison using the same synthesis model, HEG-TKG matches baseline clinical feature coverage while achieving 100% evidence verifiability with 203 inline citations. Guideline-RAG, given overlapping source documents as raw text, produces zero verifiable citations. LLM judges cannot distinguish fabricated from verified citations without PubMed audit data. Independent clinician evaluation confirms the verifiability advantage (Cohen's d = 1.81, p < 0.001) with no degradation on safety or completeness. A counterfactual experiment shows 80% resistance to injected clinical errors with 100% detectability via citation trace. The system deploys on-premise via open-source models so patient data never leaves institutional infrastructure.
Chinese Translation
前沿的大型语言模型生成临床准确的输出,但它们的引用往往是虚构的。我们将此称为来源差距。我们在36个经过临床验证的场景中测试了五个前沿的LLM,涉及三对罕见的神经肌肉疾病。没有模型在没有提示的情况下生成临床相关的PubMed标识符。当明确要求引用时,表现最佳的模型实现了15.3%的相关PMID;大多数引用指向与主题无关的真实出版物。我们提出了HEG-TKG(分层证据基础时间知识图谱),该系统将临床声明基于从4,512个PubMed记录和经过质量分层的策划来源构建的时间知识图谱。在使用相同合成模型的受控三组比较中,HEG-TKG在临床特征覆盖方面与基线相匹配,同时实现了100%的证据可验证性,并提供203个内联引用。Guideline-RAG在提供重叠源文档作为原始文本时,生成零个可验证引用。没有PubMed审计数据,LLM评审者无法区分虚构引用和验证引用。独立的临床评估确认了可验证性的优势(Cohen's d = 1.81,p < 0.001),且在安全性或完整性方面没有下降。反事实实验显示,系统对注入的临床错误具有80%的抵抗力,并通过引用追踪实现100%的可检测性。该系统通过开源模型在本地部署,因此患者数据永远不会离开机构基础设施。
cs.CL / 64 / 2604.17132
Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding
请拒绝回答我!通过自适应对比解码缓解大型语言模型中的过度拒绝问题
Abstract
Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.
Chinese Translation
安全对齐的大型语言模型(LLMs)经常对无害查询生成拒绝响应,这一现象被称为过度拒绝问题。然而,现有的缓解过度拒绝的方法无法在保持无害查询低拒绝率的同时,确保对恶意查询的高拒绝率。本文分析了不同安全级别的系统提示如何影响LLM在面对过度拒绝查询时的拒绝行为。一个关键观察是,当LLM遭遇过度拒绝问题时,非拒绝标记仍然存在于下一个标记候选列表中,但模型系统性地未能选择它们,尽管生成了拒绝标记。基于这一观察,我们提出了一种无训练且与模型无关的方法——自适应对比解码(Adaptive Contrastive Decoding, AdaCD),以缓解过度拒绝,同时保持LLM的安全性。首先,AdaCD比较LLM在有或没有极端安全系统提示下的输出分布,以细化拒绝标记的分布。其次,我们引入了一种自适应对比解码策略,动态地纳入或移除拒绝标记分布,自适应提升选择拒绝或非拒绝标记的概率。在五个基准数据集上的实验结果表明,平均而言,AdaCD将过度拒绝查询的拒绝率降低了10.35%,同时将恶意查询的拒绝率提高了0.13%。代码可在 https://github.com/OutdoorManofML/AdaCD 获取。
cs.CL / 65 / 2604.17134
RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian
RoIt-XMASA:罗马尼亚语和意大利语的多领域多语言情感分析数据集
Abstract
We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.
Chinese Translation
我们提出了RoIt-XMASA,这是一个多语言数据集,将跨语言多领域亚马逊情感分析扩展到意大利语和罗马尼亚语,包含36,000条标注评论,涵盖三个领域(书籍、电影和音乐)以及202,141条未标注样本。为了解决跨语言和跨领域的挑战,我们提出了一种多目标对抗训练框架,该框架采用带有元学习系数的损失反转方法,动态平衡情感区分与领域和语言的不变性。使用我们的方法,XLM-R达到了66.23%的F1分数,超越了基线4.64%。少样本评估显示,Llama-3.1-8B达到了58.43%的F1分数,揭示了基于提示的方法的效率与任务特定微调的更高性能之间的有意义权衡。
cs.CL / 66 / 2604.17139
The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration
共识陷阱:通过令牌级协作拯救多智能体大语言模型免受对抗性多数的影响
Abstract
Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.
Chinese Translation
多智能体大语言模型(LLM)架构越来越依赖于响应级聚合,例如多数投票(Majority Voting, MAJ),以提高推理能力。然而,在开放环境中,智能体对隐蔽的上下文腐蚀(如针对性的提示注入)高度敏感。我们揭示了当前多智能体系统中的一个关键结构性脆弱性:当被腐蚀的智能体形成局部多数时,响应级聚合会崩溃。由于投票聚合的是完全形成的结论,因此对有缺陷的中间逻辑视而不见。为了克服这一系统性限制,我们提出了令牌级循环协作(Token-Level Round-Robin, RR),在该方法中,智能体在共享的自回归上下文中顺序交错生成。我们将这一过程形式化为离散时间动态系统,证明令牌级交错将聚合从脆弱的最终投票计数(线性和)转变为动态的交织逻辑链(非线性算子积)。通过这一理论视角,我们证明诚实模型的恢复能力可以压倒对抗性腐蚀,即使在被腐蚀的智能体形成多数时。我们在多种推理基准上进行了全面的实证评估,结果表明,当被腐蚀的智能体达到多数时,MAJ会崩溃,而RR在这一临界阈值之上仍保持强大的准确性。
cs.CL / 67 / 2604.17141
SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction
SciImpact:一个多维度、多领域的科学影响预测基准
Abstract
The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation-based metrics, leaving limited evaluation of models' capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short-term (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off-the-shelf models exhibit substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction. Our project homepage is https://flypig23.github.io/sciimpact-homepage/
Chinese Translation
科学文献的快速增长呼唤自动化方法来评估和预测研究影响。之前的研究主要集中在基于引用的指标上,对模型在其他影响维度上的推理能力评估有限。为此,我们引入了SciImpact,这是一个大规模的多维度科学影响预测基准,涵盖19个领域。SciImpact捕捉了各种形式的科学影响,从引用次数到奖项认可、媒体关注、专利引用和成果采纳,通过整合异构数据源和针对性网络爬虫实现。它包含215,928对对比论文,反映了短期(例如,最佳论文奖)和长期(例如,诺贝尔奖)设置中有意义的影响差异。我们在SciImpact上评估了11个广泛使用的大型语言模型(LLMs)。结果表明,现成模型在不同维度和领域之间表现出显著的变异性,而多任务监督微调始终使较小的LLMs(例如,4B)显著超越更大的模型(例如,30B),并且超过强大的闭源LLMs(例如,o4-mini)。这些结果确立了SciImpact作为一个具有挑战性的基准,并展示了其在多维度、多领域科学影响预测中的价值。我们的项目主页是 https://flypig23.github.io/sciimpact-homepage/
cs.CL / 68 / 2604.17153
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation
从法律文本到可执行决策模型:评估法律决策模型生成的结构化表示
Abstract
Transforming legal text into executable decision logic is a longstanding challenge in legal informatics. With the rise of LLMs, this task has gained renewed interest, but remains challenging due to requiring extensive manual coding and evaluation. We use a unique real-world dataset that pairs production-grade decision models with legal text from the Dutch Environment and Planning Act. These models power the Omgevingsloket government platform, where citizens check permit requirements for environmental activities. We study whether intermediate structured representations can improve LLM-based generation of executable decision models from legal text. We compare four input conditions: raw legal text, text enriched with semantic role labels, text enriched with input and output constraints, and text enriched with both. We evaluate along two dimensions: structural evaluation, through similarity to gold decision models with graph kernels and graphs' descriptive statistics, and outcome evaluation, through functional equivalence by executing models on pre-configured test scenarios. Our findings show that I/O constraints provide the dominant improvement (+37-54% similarity over baseline), while semantic role labels show modest improvements. Outcome evaluation shows that generated models match the gold standard on 51-53% of test scenarios, even though generated models are typically smaller and simpler. We find LLMs eliminate redundant pass-through logic that comprises up to 45-55% of nodes. Importantly, structural similarity and outcome equivalence are complementary: structural similarity does not guarantee outcome equivalence, and vice versa. To facilitate reproducibility, we publicly release our dataset of 95 production decision models with associated legal text and all experimental code.
Chinese Translation
将法律文本转化为可执行的决策逻辑一直是法律信息学中的一项长期挑战。随着大型语言模型(LLMs)的兴起,这一任务重新引起了关注,但由于需要大量的手动编码和评估,仍然面临挑战。我们使用一个独特的真实世界数据集,该数据集将生产级决策模型与荷兰环境与规划法的法律文本配对。这些模型为Omgevingsloket政府平台提供支持,公民可以在该平台上检查环境活动的许可要求。我们研究中间结构化表示是否可以改善基于LLM的从法律文本生成可执行决策模型的过程。我们比较了四种输入条件:原始法律文本、带有语义角色标签的文本、带有输入和输出约束的文本,以及同时带有两者的文本。我们从两个维度进行评估:结构评估,通过与黄金决策模型的相似性(使用图核和图的描述统计)以及结果评估,通过在预配置测试场景上执行模型的功能等价性。我们的研究结果表明,输入/输出约束提供了显著的改进(比基线提高37-54%的相似性),而语义角色标签则显示出适度的改善。结果评估显示,生成的模型在51-53%的测试场景中与黄金标准相匹配,尽管生成的模型通常更小且更简单。我们发现LLMs消除了多达45-55%节点的冗余传递逻辑。重要的是,结构相似性和结果等价性是互补的:结构相似性并不保证结果等价性,反之亦然。为了促进可重复性,我们公开发布了包含95个生产决策模型及其相关法律文本的数据集以及所有实验代码。
cs.CL / 69 / 2604.17174
Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding
在认知拥挤下对大型语言模型中的多维认知状态建模
Abstract
Modeling human cognitive states is essential for advanced artificial intelligence. Existing Large Language Models (LLMs) mainly address isolated tasks such as emotion analysis or stance detection, and fail to capture interactions among cognitive dimensions defined in psychology, including emotion, thinking style, stance, and intention. To bridge this gap, we construct CognitiveBench, the first benchmark with unified annotations across the above four dimensions. Experiments on CognitiveBench show that although LLMs perform well on single dimension tasks, their performance drops sharply in joint multi-dimensional modeling. Using Gromov $\delta$-hyperbolicity analysis, we find that CognitiveBench exhibits a strong hierarchical structure. We attribute the performance bottleneck to ``Cognitive Crowding'', where hierarchical cognitive states require exponential representational space, while the Euclidean space of LLMs grows only polynomially, causing representation overlap and degraded performance. To address this mismatch, we propose HyCoLLM, which models cognitive states in hyperbolic space and aligns LLM representations via Hyperbolic Guided Alignment Tuning. Results show that HyCoLLM substantially improves multi-dimensional cognitive understanding, allowing 8B parameter model to outperform strong baselines, including GPT-4o.
Chinese Translation
建模人类认知状态对于先进的人工智能至关重要。现有的大型语言模型(LLMs)主要处理孤立任务,如情感分析或立场检测,未能捕捉心理学中定义的认知维度之间的相互作用,包括情感、思维风格、立场和意图。为填补这一空白,我们构建了CognitiveBench,这是第一个在上述四个维度上具有统一注释的基准。对CognitiveBench的实验表明,尽管LLMs在单维任务上表现良好,但在联合多维建模中,其性能急剧下降。通过Gromov $eta$-超曲率分析,我们发现CognitiveBench展现出强烈的层次结构。我们将性能瓶颈归因于“认知拥挤”,即层次认知状态需要指数级的表征空间,而LLMs的欧几里得空间仅以多项式增长,导致表征重叠和性能下降。为了解决这一不匹配,我们提出了HyCoLLM,该模型在超曲面空间中建模认知状态,并通过超曲面引导对齐调优来对齐LLM表征。结果表明,HyCoLLM显著提高了多维认知理解,使得8B参数模型超越了包括GPT-4o在内的强基准。
cs.CL / 70 / 2604.17178
Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation
基于认知政策驱动的大型语言模型用于情感支持对话中的认知扭曲诊断与干预
Abstract
Emotional Support Conversation (ESC) plays a critical role in mental health assistance by providing accessible psychological support in real-world applications. Large Language Models (LLMs) have shown strong empathetic abilities in ESC tasks. Yet, existing methods overlook the issue of cognitive distortions in help-seekers' expressions. As a result, current models can only provide basic emotional comfort, rather than helping help-seekers address their psychological distress at a deeper cognitive level. To address this challenge, we construct the CogBiasESC dataset, the first dataset that expands existing ESC datasets by adding labels for cognitive distortions, includes their type, intensity, and safe risk level. Furthermore, we propose the Cognitive Policy-driven Large Language Model framework (CoPoLLM) to enhance LLMs' ability to diagnose and intervene cognitive distortions in help-seekers. We also analyze the safety advantages of CoPoLLM from a theoretical perspective. Experimental results show that CoPoLLM significantly outperforms 15 state-of-the-art baselines in terms of distortion diagnosis accuracy, intervention strategy effectiveness, and safety risk control.
Chinese Translation
情感支持对话(ESC)在心理健康援助中发挥着关键作用,通过在现实应用中提供可获得的心理支持。大型语言模型(LLMs)在ESC任务中展现了强大的同理能力。然而,现有方法忽视了求助者表达中的认知扭曲问题。因此,当前模型只能提供基本的情感安慰,而无法帮助求助者在更深层次的认知水平上解决心理困扰。为了解决这一挑战,我们构建了CogBiasESC数据集,这是第一个通过添加认知扭曲标签来扩展现有ESC数据集的数据集,包含其类型、强度和安全风险级别。此外,我们提出了基于认知政策驱动的大型语言模型框架(CoPoLLM),以增强LLMs在求助者认知扭曲的诊断与干预能力。我们还从理论角度分析了CoPoLLM的安全优势。实验结果表明,CoPoLLM在扭曲诊断准确性、干预策略有效性和安全风险控制方面显著优于15个最先进的基线模型。
cs.CL / 71 / 2604.17188
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization
超越重叠度量:奖励推理与偏好以实现忠实的多角色对话摘要
Abstract
Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework's stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary.
Chinese Translation
多角色对话摘要需要建模多个发言者之间的复杂互动,同时保留角色特定的信息和事实一致性。然而,大多数现有方法优化自动度量,如ROUGE和BERTScore,这些度量更倾向于对参考文献的表面模仿,而非在忠实性或与人类偏好的对齐方面的真正提升。我们提出了一种新颖的框架,将显式的认知风格推理与基于奖励的优化结合起来,用于多角色对话摘要。我们的方法首先从大型教师模型中提取结构化的推理痕迹(例如,逐步推理和中间反思),并将其作为辅助监督,通过分阶段的有监督微调来初始化一个关注推理的摘要生成器。然后,它应用GRPO(Gradient Reward Policy Optimization)与一种双原则奖励相结合,该奖励将基于度量的信号与针对关键信息覆盖、隐性推理、事实忠实性和简洁性的与人类对齐标准相融合。在多语言多角色对话基准上的实验表明,我们的方法在ROUGE和BERTScore上与强基线相匹配。具体而言,CSDS上的结果确认了该框架在语义一致性方面的稳定性,而对SAMSum的深入分析则展示了在事实忠实性和基于模型的偏好对齐方面的明显提升。这些发现强调了关注推理和偏好的训练在可靠对话摘要中的价值。检查点和数据集可在 https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary 获取。
cs.CL / 72 / 2604.17197
Learning to Control Summaries with Score Ranking
学习通过评分排名控制摘要
Abstract
Recent advances in summarization research focus on improving summary quality across multiple criteria, such as completeness, conciseness, and faithfulness, by jointly optimizing these dimensions. However, these efforts largely overlook the challenge of controlling summary generation with respect to individual criteria, especially in the presence of their inherent trade-offs. For example, enhancing conciseness can compromise completeness, and vice versa. In this work, we address this gap by proposing a loss function that aligns model outputs with fine-grained, model-based evaluation scores (e.g., from FineSurE), enabling both improvement in summary quality and dimension-specific control. Our approach improves the overall quality of summaries while maintaining the ability to selectively prioritize one criterion over others. Experiments on three pretrained models (LLaMA, Qwen, and Mistral) demonstrate that our method achieves performance comparable to state-of-the-art summarizers, while uniquely offering strong controllability over individual quality dimensions.
Chinese Translation
最近在摘要研究中的进展集中于通过联合优化多个维度(如完整性、简洁性和忠实性)来提高摘要质量。然而,这些努力在很大程度上忽视了在存在固有权衡的情况下,如何针对单个标准控制摘要生成的挑战。例如,增强简洁性可能会妨碍完整性,反之亦然。在本研究中,我们通过提出一种损失函数来填补这一空白,该损失函数将模型输出与细粒度的基于模型的评估分数(例如来自 FineSurE 的分数)对齐,从而实现摘要质量的提升和针对特定维度的控制。我们的方法在提高摘要整体质量的同时,保持了选择性优先考虑某一标准的能力。在对三个预训练模型(LLaMA、Qwen 和 Mistral)的实验中,我们的方法表现出与最先进的摘要生成器相当的性能,同时独特地提供了对各个质量维度的强控制能力。
cs.CL / 73 / 2604.17200
Calibrating Model-Based Evaluation Metrics for Summarization
基于模型的摘要评估指标的校准
Abstract
Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.
Chinese Translation
最近在摘要评估方面的进展主要基于模型驱动的指标来评估质量维度,如完整性、简洁性和忠实性。然而,这些方法通常需要大型语言模型,并且预测得分经常出现校准不准确的问题,限制了它们的可靠性。此外,评估单个文档的不同摘要的平均质量通常需要访问多个参考摘要。在此,我们提出了一个通用框架,该框架能够生成个体和平均代理得分,而无需依赖参考摘要、人类注释或昂贵的基于模型的指标。我们还提出了一种组单调回归分箱(Group Isotonic Regression Binning, GIRB)校准方法,该方法调整原始预测,以更好地与真实评估指标对齐。虽然我们专注于连续值场景,如摘要生成,但该方法同样适用于离散值任务,如问答。对七个数据集的实验表明,我们的方法始终优于现有基线。
cs.CL / 74 / 2604.17225
A Multi-Agent Approach for Claim Verification from Tabular Data Documents
基于多智能体的表格数据文档索赔验证方法
Abstract
We present a novel approach for claim verification from tabular data documents. Recent LLM-based approaches either employ complex pretraining/fine-tuning or decompose verification into subtasks, often lacking comprehensive explanations and generalizability. To address these limitations, we propose a Multi-Agentic framework for Claim verification (MACE) consisting of three specialized agents: Planner, Executor, and Verifier. Instead of elaborate finetuning, each agent employs a zero-shot Chain-of-Thought setup to perform its tasks. MACE produces interpretable verification traces, with the Planner generating explicit reasoning strategies, the Executor providing detailed computation steps, and the Verifier validating the logic. Experiments demonstrate that MACE achieves state-of-the-art (SOTA) performance on two datasets and performs on par with the best models on two others, while achieving 80--100\% of best performance with substantially smaller models: 27--92B parameters versus 235B. This combination of competitive performance, memory efficiency, and transparent reasoning highlights our framework's effectiveness.
Chinese Translation
我们提出了一种新的基于表格数据文档的索赔验证方法。近期基于大型语言模型(LLM)的方法通常采用复杂的预训练/微调,或将验证分解为子任务,往往缺乏全面的解释和泛化能力。为了解决这些局限性,我们提出了一种多智能体索赔验证框架(MACE),该框架由三个专业智能体组成:规划者(Planner)、执行者(Executor)和验证者(Verifier)。每个智能体采用零-shot 思维链(Chain-of-Thought)设置来执行其任务,而不是复杂的微调。MACE 生成可解释的验证痕迹,规划者生成明确的推理策略,执行者提供详细的计算步骤,验证者则验证逻辑。实验表明,MACE 在两个数据集上达到了最先进(SOTA)的性能,并在另外两个数据集上与最佳模型表现相当,同时在参数量上显著更小:27-92亿参数对比235亿参数,达到了最佳性能的80-100%。这种竞争力的性能、内存效率和透明的推理过程突显了我们框架的有效性。
cs.CL / 75 / 2604.17244
DORA Explorer: Improving the Exploration Ability of LLMs Without Training
DORA Explorer:在不训练的情况下提升大语言模型的探索能力
Abstract
Despite the rapid progress, LLMs for sequential decision-making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub-optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token-level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi-Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain-of-Thought and Tree-of-Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity-Oriented Ranking of Actions), a training-free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log-probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB-competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5-7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora-explore.github.io/.
Chinese Translation
尽管取得了快速进展,但用于顺序决策的大语言模型(即大语言模型代理)仍然难以产生多样化的输出。这导致探索不足、收敛到次优解以及陷入循环等问题。这些限制在需要主动探索以收集信息和做出决策的环境中尤为突出。诸如温度缩放等采样方法引入了令牌级别的随机性,但未能在序列级别上产生足够的多样性。我们分析了在经典的多臂赌博机(Multi-Armed Bandit, MAB)设置和文本冒险学习环境套件(Text Adventure Learning Environment Suite, TALES)中的大语言模型探索。我们发现,当前的解码策略和提示方法(如思维链(Chain-of-Thought)和思维树(Tree-of-Thought))不足以支持稳健的探索。为了解决这一问题,我们提出了DORA Explorer(以多样性为导向的行动排名),这是一个无需训练的框架,用于提升大语言模型代理的探索能力。DORA生成多样化的行动候选,使用令牌对数概率对其进行评分,并通过可调的探索参数选择行动。DORA在多臂赌博机上实现了与上界置信(UCB)竞争的性能,并在TALES中获得了一致的提升,例如,在TextWorld中将Qwen2.5-7B的性能从29.2%提高到45.5%。我们的项目可在以下网址获取:https://dora-explore.github.io/
cs.CL / 76 / 2604.17252
Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents
看见并不等于相信:通过主动干预减轻具身代理的信念惯性
Abstract
Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate-Verify-Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting-based and training-based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at https://github.com/WangHanLinHenry/EVU.
Chinese Translation
近期大型语言模型(LLMs)的进展使得代理能够通过环境互动处理复杂的具身任务。然而,这些代理仍然做出次优决策并执行无效行动,因为它们常常忽视与其内部信念不同的关键环境反馈。通过正式的探测分析,我们将其表征为信念惯性,这是一种现象,代理在明确观察到的情况下仍顽固地坚持先前的信念。为了解决这个问题,我们倡导主动信念干预,从被动理解转向主动管理。我们引入了估计-验证-更新(Estimate-Verify-Update, EVU)机制,使代理能够预测预期结果,通过明确推理将其与观察结果进行验证,并根据验证证据主动更新先前的信念。EVU被设计为一个统一的干预机制,明确生成文本信念状态,并可以集成到基于提示和基于训练的代理推理方法中。针对三个具身基准的广泛实验表明,EVU在任务成功率上始终带来显著提升。进一步分析验证了我们的方法有效减轻了信念惯性,推动了更强大具身代理的发展。我们的代码可在 https://github.com/WangHanLinHenry/EVU 获取。
cs.CL / 77 / 2604.17255
Are Emotion and Rhetoric Neurons in LLM? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering
情感与修辞神经元在大型语言模型中存在吗?情感-修辞预测引导的神经元识别与自适应掩蔽
Abstract
Accurate comprehension and controllable generation of emotion and rhetoric are pivotal for enhancing the reasoning capabilities of large language models (LLMs). Existing studies mostly rely on external optimizations, lacking in-depth exploration of internal representation mechanisms, thus failing to achieve fine-grained steering at the neuron level. A handful of works on neurons are confined to emotions, neglecting rhetoric neurons and their intrinsic connections. Traditional neuron masking also exhibits counterintuitive phenomena, making reliable verification of neuron functionality infeasible. To address these issues, we systematically investigate the neurons representation mechanisms and inherent associations of 6 emotion categories and 4 core rhetorical devices. We propose a neuron identification framework that integrates multi-dimensional screening, and design an adaptive masking method incorporating dynamic filtering, attenuation masking, and feedback optimization, enabling reliable causal validation of neuron functionality.Through neuron regulation, we achieve directed induction of non-target sentences and enhancement of emotion tasks via rhetoric neurons. Experiments on 5 commonly used datasets validate the effectiveness of our method, providing a novel paradigm for the fine-grained steering of emotion and rhetoric expressions in LLMs.
Chinese Translation
准确理解和可控生成情感与修辞对于增强大型语言模型(LLMs)的推理能力至关重要。现有研究大多依赖外部优化,缺乏对内部表征机制的深入探索,因此未能在神经元层面实现细粒度的引导。关于神经元的少数研究仅限于情感,忽视了修辞神经元及其内在联系。传统的神经元掩蔽也表现出反直觉现象,使得神经元功能的可靠验证变得不可行。为了解决这些问题,我们系统地研究了6种情感类别和4种核心修辞手法的神经元表征机制及其内在关联。我们提出了一种神经元识别框架,整合了多维筛选,并设计了一种自适应掩蔽方法,结合动态过滤、衰减掩蔽和反馈优化,从而实现神经元功能的可靠因果验证。通过神经元调节,我们实现了对非目标句子的定向引导以及通过修辞神经元增强情感任务。对5个常用数据集的实验验证了我们方法的有效性,为大型语言模型中情感与修辞表达的细粒度引导提供了新的范式。
cs.CL / 78 / 2604.17257
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
REZE:用于领域自适应文本嵌入预微调的表示正则化
Abstract
Recent text embedding models are often adapted to specialized domains via contrastive pre-finetuning (PFT) on a naive collection of scattered, heterogeneous tasks. However, this approach often introduces task-induced bias alongside domain knowledge, leading to uncontrolled representation shifts that distort the pretrained embedding geometry and cause substantial performance degradation. To address this issue, we propose REZE}, a representation regularization framework that explicitly controls representation shift during embedding pre-finetuning. REZE operates on the relations of anchor-positive pairs and decomposes them in an eigenspace. It then measures task-wise dispersion along each eigencomponent to identify task-variant directions and applies adaptive soft-shrinkage to suppress task-induced noise while preserving task-invariant semantic structure, without inference-time overhead. Experiments across multiple embedding backbones and specialized benchmarks show that REZE outperforms standard pre-finetuning and isotropy-oriented post-hoc regularization in most settings, remaining stable where existing PFT variants collapse. Embedding space analyses further confirm that REZE induces controlled shifts aligned with the original embedding manifold, underscoring representation shift control as a key principle for robust embedding pre-finetuning under heterogeneous supervision.
Chinese Translation
近期的文本嵌入模型通常通过在一组简单的分散异构任务上进行对比预微调(PFT)来适应特定领域。然而,这种方法往往会引入任务诱导的偏差,伴随领域知识,导致无法控制的表示偏移,从而扭曲预训练嵌入的几何结构,并造成显著的性能下降。为了解决这一问题,我们提出了REZE,一种在嵌入预微调过程中明确控制表示偏移的表示正则化框架。REZE基于锚点-正样本对的关系,并在特征空间中对其进行分解。然后,它测量每个特征分量上的任务间离散度,以识别任务变异方向,并应用自适应软收缩来抑制任务诱导的噪声,同时保留任务不变的语义结构,而不增加推理时的开销。在多个嵌入骨干网络和专业基准上的实验表明,REZE在大多数设置中优于标准的预微调和面向各向同性的后处理正则化,并在现有PFT变体崩溃的情况下保持稳定。嵌入空间分析进一步确认REZE诱导的受控偏移与原始嵌入流形一致,强调了表示偏移控制作为在异构监督下实现稳健嵌入预微调的关键原则。
cs.CL / 79 / 2604.17260
Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation
重新思考会议有效性:时间细粒度自动会议有效性评估的基准与框架
Abstract
Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.
Chinese Translation
评估会议有效性对于提高组织生产力至关重要。目前的方法依赖于事后调查,这些调查为整个会议提供了一个粗略的单一评分。依赖人工评估在可扩展性、成本和可重复性方面本质上是有限的。此外,单一评分无法捕捉协作讨论的动态特性。我们提出了一种新的会议有效性评估范式,围绕新颖的标准和时间细粒度的方法进行。我们将有效性定义为随时间推移的目标实现率,并对会议中的各个主题段进行评估。为支持这一任务,我们引入了AMI会议有效性(AMI-ME)数据集,这是一个新的元评估数据集,包含来自130个AMI语料库会议的2,459个人工标注段落。我们还开发了一个自动有效性评估框架,使用大型语言模型(LLM)作为评判者,对每个段落的有效性进行评分,以相对于整体会议目标进行评估。通过大量实验,我们为这一新任务建立了全面的基准,并评估了框架在不同会议类型(从商业场景到非结构化讨论)的通用性。此外,我们从原始语音开始基准化端到端性能,以测量完整系统的能力。我们的结果验证了框架的有效性,并提供了强有力的基准,以促进未来在会议分析和多方对话方面的研究。我们的数据集和代码将公开提供。AMI-ME数据集和自动评估框架可在此网址获取。
cs.CL / 80 / 2604.17271
HopRank: Self-Supervised LLM Preference-Tuning on Graphs for Few-Shot Node Classification
HopRank:基于图的自监督大语言模型偏好调优用于少样本节点分类
Abstract
Node classification on text-attributed graphs (TAGs) is a fundamental task with broad applications in citation analysis, social networks, and recommendation systems. Current GNN-based approaches suffer from shallow text encoding and heavy dependence on labeled data, limiting their effectiveness in label-scarce settings. While large language models (LLMs) naturally address the text understanding gap with deep semantic reasoning, existing LLM-for-graph methods either still require abundant labels during training or fail to exploit the rich structural signals freely available in graph topology. Our key observation is that, in many real-world TAGs, edges predominantly connect similar nodes under the homophily principle, meaning graph topology inherently encodes class structure without any labels. Building on this insight, we reformulate node classification as a link prediction task and present HopRank, a fully self-supervised LLM-tuning framework for TAGs. HopRank constructs preference data via hierarchical hop-based sampling and employs adaptive preference learning to prioritize informative training signals without any class labels. At inference, nodes are classified by predicting their connection preferences to labeled anchors, with an adaptive early-exit voting scheme to improve efficiency. Experiments on three TAG benchmarks show that HopRank matches fully-supervised GNNs and substantially outperforms prior graph-LLM methods, despite using zero labeled training data.
Chinese Translation
文本属性图(TAGs)上的节点分类是一项基础任务,在引用分析、社交网络和推荐系统中具有广泛应用。目前基于图神经网络(GNN)的方法受到浅层文本编码和对标记数据的高度依赖的限制,从而限制了它们在标签稀缺环境中的有效性。尽管大型语言模型(LLMs)通过深层语义推理自然地解决了文本理解的差距,但现有的图-LLM方法要么在训练期间仍需大量标签,要么未能利用图拓扑中自由可用的丰富结构信号。我们的关键观察是,在许多现实世界的TAG中,边缘主要连接相似节点,这符合同质性原则,这意味着图拓扑本质上在没有任何标签的情况下编码了类别结构。基于这一见解,我们将节点分类重新表述为链接预测任务,并提出了HopRank,一个完全自监督的LLM调优框架用于TAG。HopRank通过层次跳跃采样构建偏好数据,并采用自适应偏好学习来优先考虑信息丰富的训练信号,而无需任何类别标签。在推理时,通过预测节点与标记锚点的连接偏好来进行节点分类,并采用自适应提前退出投票机制以提高效率。在三个TAG基准上的实验表明,HopRank的性能与完全监督的GNN相当,并且显著优于之前的图-LLM方法,尽管使用了零标记训练数据。
cs.CL / 81 / 2604.17282
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench:医学推理中过程奖励模型的细粒度基准
Abstract
Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning -- which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models' error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and Sensitivity) with the first 4-level severity grading system to quantify clinical impact. The benchmark comprises 6{,}500 questions with 13{,}000 reasoning chains and 113{,}910 step-level labels, plus 6{,}879 questions for training. Our medical PRM baseline achieves an 87.1\% overall PRMScore -- substantially surpassing all baselines -- and serves as a plug-and-play verifier that improves downstream medical QA accuracy by 3.2--6.7 percentage points. Systematic evaluation spanning proprietary frontier models, open-source reasoning models, and medical-specialized models reveals critical weaknesses in current models' medical reasoning error detection capabilities, providing clear directions for future PRM improvement.
Chinese Translation
过程级奖励模型(PRMs)对于指导大型语言模型中的复杂推理至关重要,但现有的PRM基准仅涵盖数学等一般领域,未能解决医学推理问题——医学推理的独特特征在于安全性关键性、知识密集性和多样的错误模式。缺乏可靠的医学PRM评估框架,我们无法量化模型在临床推理中的错误检测能力,从而无法验证其在现实医疗应用中的安全性。我们提出了MedPRMBench,这是医学领域首个过程级奖励模型基准。MedPRMBench通过基于临床推理蓝图(CRBs)的三阶段流程构建,系统地从七个医学问答来源生成高质量的评估数据,涵盖三个类别(简单性、合理性和敏感性)中的14种细粒度错误类型,并采用首个4级严重性分级系统来量化临床影响。该基准包含6,500个问题、13,000个推理链和113,910个步骤级标签,以及6,879个用于训练的问题。我们的医学PRM基线实现了87.1%的总体PRMScore——显著超越所有基线——并作为即插即用的验证工具,提高了下游医学问答的准确性3.2至6.7个百分点。对专有前沿模型、开源推理模型和医学专用模型的系统评估揭示了当前模型在医学推理错误检测能力方面的关键弱点,为未来PRM的改进提供了明确方向。
cs.CL / 82 / 2604.17283
HorizonBench: Long-Horizon Personalization with Evolving Preferences
HorizonBench:具有演变偏好的长期个性化
Abstract
User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user's originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.
Chinese Translation
用户偏好在数月的互动中不断演变,追踪这些偏好需要推断何时由于后续生活事件而改变了已声明的偏好。我们将此问题定义为长期个性化,并观察到在这一领域的进展受到数据可用性和测量的限制,目前没有现有资源能够提供自然的长期互动以及诊断模型失败原因所需的真实来源。我们引入了一种数据生成器,该生成器从结构化的心理状态图中生成对话,为每个偏好变化提供真实来源,时间跨度为6个月,并由此构建了HorizonBench,这是一个包含4,245个项目的基准,来自360个模拟用户,具有平均约4,300轮对话历史和约163K个标记。HorizonBench为长期上下文建模、增强记忆架构、心智理论推理和用户建模提供了测试平台。在25个前沿模型中,最佳模型的得分为52.8%,大多数模型的得分在20%的基线概率或以下。当这些模型在演变偏好上出现错误时,超过三分之一的时间它们选择了用户最初声明的值,而没有跟踪更新后的用户状态。这种信念更新失败在不同的上下文长度和表达明确性水平下持续存在,识别出状态跟踪能力是长期个性化的主要瓶颈。
cs.CL / 83 / 2604.17290
Probabilistic Programs of Thought
思维的概率程序
Abstract
LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the language model requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM's distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.
Chinese Translation
大型语言模型(LLMs)广泛应用于代码生成和数学推理任务,这些任务要求生成结构化输出。它们需要对代码进行推理,为给定的规范生成代码,或使用思维程序进行推理。代码生成的典型方法是提示模型并生成样本,直到获得合适的程序。在此过程中,从语言模型中采样 $n$ 个程序需要进行 $n$ 次计算密集型的 GPU 生成,这在 $n$ 较大时变得极为昂贵。在本研究中,我们通过揭示 LLM 在生成程序中的分布来解决这一限制。我们提出了一种新颖的测试时框架,称为思维的概率程序,以便在较少的 LLM 生成次数下从模型中获得更多样本。给定模型生成的程序及其相关的下一个标记概率,我们构建了一个概率程序,该程序紧凑地表示出指数数量的确定性程序。由于在这个概率程序中进行概率推理的成本要低得多,我们的方法允许在没有额外 GPU 计算和很少 CPU 开销的情况下采样新程序。我们在代码生成、代码理解和数学推理的基准测试中实例化了我们的方法,并报告在较少的 LLM 生成次数下性能的提升。
cs.CL / 84 / 2604.17293
Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
超越“我不知道”:评估大型语言模型在区分数据和模型不确定性方面的自我意识
Abstract
Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don't know'', failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.
Chinese Translation
可靠的大型语言模型(LLMs)在信心不足时应当选择不作答。然而,以往的研究往往将拒绝回答视为一种通用的“我不知道”,未能区分输入层面的模糊性(数据不确定性)与能力限制(模型不确定性)。这种缺乏区分的情况限制了后续行动决策,例如请求澄清或调用外部工具。在本研究中,我们引入了UA-Bench,这是一个基准测试,包含来自六个数据集的3500多个问题,涵盖知识密集型和推理密集型任务,旨在评估明确的不确定性归因。对18个前沿LLM的评估表明,即使是最先进的模型在可靠区分数据不确定性和模型不确定性方面也面临挑战,并且高答题准确性并不一定意味着强大的不确定性归因能力。为了缩小这一差距,我们提出了一种轻量级的数据合成和强化学习策略。在Qwen3-4B-Instruct-2507和Qwen3-8B的思考模式下进行的实验表明,所提方法在保持答题准确性的同时改善了不确定性归因。我们的代码和数据现已公开。
cs.CL / 85 / 2604.17297
CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning
CRISP:通过内在显著性剪枝压缩思维链中的冗余
Abstract
Long Chain-of-Thought (CoT) reasoning is pivotal for the success of recent reasoning models but suffers from high computational overhead and latency. While prior works attempt to compress CoT via external compressor, they often fail to align with the model's internal reasoning dynamics, resulting in the loss of critical logical steps. This paper presents \textbf{C}ompressing \textbf{R}edundancy in Chain-of-Thought via \textbf{I}ntrinsic \textbf{S}aliency \textbf{P}runing (\textbf{CRISP}), a framework that compresses CoT by exploiting the model's intrinsic saliency. Our analysis reveals a distinct phenomenon: the reasoning termination token \texttt{[object Object]} acts as an information anchor, where its attention pattern effectively demarcates essential reasoning from redundancy. Based on this finding, we design a policy that utilizes these intrinsic attention signals to guide atomic compression operations. In contrast to coarse-grained pruning strategies, CRISP strategically distills the reasoning chain to maximize information density while preserving logical coherence. Empirical results across various backbone models and mathematical datasets demonstrate that CRISP achieves a 50-60% reduction in token count without compromising accuracy, effectively mitigating the efficiency bottleneck of long-context reasoning. We open-source our implementation to facilitate further research in efficient reasoning.
Chinese Translation
长思维链(Chain-of-Thought, CoT)推理对于近期推理模型的成功至关重要,但其计算开销和延迟较高。尽管之前的研究尝试通过外部压缩器来压缩CoT,但往往未能与模型的内部推理动态对齐,导致关键逻辑步骤的丢失。本文提出了 extbf{C}ompressing extbf{R}edundancy in Chain-of-Thought via extbf{I}ntrinsic extbf{S}aliency extbf{P}runing( extbf{CRISP})框架,该框架通过利用模型的内在显著性来压缩CoT。我们的分析揭示了一种独特现象:推理终止标记 exttt{[object Object]}充当信息锚点,其注意力模式有效地划分了重要推理与冗余信息。基于这一发现,我们设计了一种策略,利用这些内在注意力信号来指导原子压缩操作。与粗粒度剪枝策略相比,CRISP战略性地提炼推理链,以最大化信息密度,同时保持逻辑一致性。不同主干模型和数学数据集的实证结果表明,CRISP在不影响准确性的情况下实现了50-60%的标记数量减少,有效缓解了长上下文推理的效率瓶颈。我们开源了我们的实现,以促进高效推理的进一步研究。
cs.CL / 86 / 2604.17299
Cat-DPO: Category-Adaptive Safety Alignment
Cat-DPO:类别自适应安全对齐
Abstract
Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO iimproves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.
Chinese Translation
将大型语言模型与人类偏好对齐必须平衡两个相互竞争的目标:对合法请求做出有帮助的回应和可靠地拒绝有害请求。大多数基于偏好的安全对齐方法将安全性简化为一个统一应用于每对偏好的单一标量。其结果是模型在平均上看起来是安全的,但在少数有害类别上仍然相对不安全。我们将安全对齐视为一个按类别的约束优化问题,并推导出 Cat-DPO,这是一种直接偏好优化算法,为每个有害类别设定独立的自适应安全边际。当模型在某一类别上仍然产生不安全的响应时,边际会收紧;一旦模型跟上进度,边际会放宽,因此训练信号跟踪每个类别当前的难度,而不是在一个全局速率下取平均。在两个大型语言模型基础和六个偏好学习基准上,Cat-DPO 提高了整体的有用性和无害性,并压缩了每个类别的安全方差和最佳与最差之间的差距,提供了直接偏好安全对齐的逐类别优化。
cs.CL / 87 / 2604.17301
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
RoTRAG:基于检索增强生成的对话有害内容检测的经验法则推理
Abstract
Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
Chinese Translation
在多轮对话中检测有害内容需要对整个对话上下文进行推理,而不仅仅是孤立的发言。然而,大多数现有方法主要依赖于模型内部的参数知识,而没有明确依赖于外部规范原则。这往往导致在社会细微的语境中判断不一致、可解释性有限以及跨轮次的冗余推理。为了解决这个问题,我们提出了RoTRAG,一个检索增强框架,将简洁的人类撰写的道德规范(称为经验法则,Rules of Thumb,RoTs)融入基于大型语言模型(LLM)的有害内容评估中。在每一轮中,RoTRAG从外部语料库中检索相关的RoTs,并将其作为轮次级推理和最终严重性分类的明确规范证据。为了提高效率,我们进一步引入了一种轻量级的二元路由分类器,用于决定新轮次是否需要基于检索的推理,或是否可以重用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue上的实验表明,RoTRAG在有害内容分类和严重性估计上均显著优于竞争基线,基准数据集上的F1平均相对提升约40%,分布误差平均相对减少8.4%,同时减少冗余计算而不牺牲性能。
cs.CL / 88 / 2604.17316
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA
校准?并非适用于所有人:性取向和宗教标记如何扭曲大型语言模型在医学问答中的准确性和信心
Abstract
Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.
Chinese Translation
大型语言模型(LLMs)在临床安全部署中不仅需要高准确性,还需要强健的不确定性校准,以确保模型在适当时能够听从临床医生的意见。我们的论文研究了患者的社会描述符(特别是性取向和宗教归属)如何扭曲这些不确定性信号和模型准确性。我们对2364个医学问题及其反事实变体评估了九个通用和生物医学LLMs,结果表明身份标记导致了“校准危机”。“同性恋”标记持续触发性能下降,而交叉身份则对校准产生特有的、非加性的伤害。此外,在开放式生成设置中经过临床医生验证的案例研究确认,这些失败并非多项选择格式的伪影。我们的结果表明,社会身份线索的存在不仅仅是改变预测;它影响了信心信号的可靠性,给公平护理和基于信心的临床工作流程的安全部署带来了重大风险。
cs.CL / 89 / 2604.17323
A Universal Avoidance Method for Diverse Multi-branch Generation
一种通用的多分支生成避免方法
Abstract
Modern generative models still lack human-level creativity, particularly in multi-branch diversity. Prior approaches to address this problem often incur heavy computation or strong dependency on model architecture. Therefore, we introduce UAG(Universal Avoidance Generation), a model-agnostic and computationally efficient generation strategy that penalizes similarity among previously generated outputs. Thus, UAG can enhance multi-branch diversity across both diffusion and transformer models, with minimal additional computation. In experiments, our method achieves up to 1.9 times higher diversity, runs 4.4 times faster, and requires only 1/64 of the FLOPs compared to state-of-the-art methods. The full code is https://anonymous.4open.science/r/2026_ACL_Universal/.
Chinese Translation
现代生成模型在创造力方面仍然缺乏人类水平的表现,尤其是在多分支多样性方面。以往解决这一问题的方法往往需要大量计算或对模型架构有较强依赖。因此,我们提出了UAG(Universal Avoidance Generation),这是一种与模型无关且计算效率高的生成策略,通过对先前生成输出之间的相似性进行惩罚,从而增强多分支多样性。UAG可以在扩散模型和变换器模型中提升多分支多样性,且额外计算量极小。在实验中,我们的方法实现了最高1.9倍的多样性提升,运行速度快4.4倍,并且与最先进的方法相比,仅需1/64的FLOPs。完整代码可访问 https://anonymous.4open.science/r/2026_ACL_Universal/。
cs.CL / 90 / 2604.17325
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation
对齐文档与问题:面向问题的文档重写用于检索增强生成
Abstract
Retrieval-Augmented Generation (RAG) enhances the factuality of Large Language Models (LLMs) by incorporating retrieved documents and/or generated context. However, LLMs often exhibit a stylistic bias when presented with mixed contexts, favoring fluent but hallucinated generated content over factually grounded yet disorganized retrieved evidence. This phenomenon reveals that the utility of retrieved information is bottlenecked by its presentation. To bridge this gap, we propose QREAM, a style-controlled rewriter that aligns retrieved documents with a question-oriented style while preserving facts, better for LLM readers to utilize. Our framework consists of two stages: (1) QREAM-ICL, which uses stylistic seeds to guide iterative rewriting exploration; and (2) QREAM-FT, a lightweight student model distilled from denoised ICL outputs. QREAM-FT employs dual-criteria rejection sampling, filtering based on answer correctness and factual consistency to ensure high-quality supervision. QREAM seamlessly integrates into existing RAG pipelines as a plug-and-play module. Experiments demonstrate that QREAM consistently enhances advanced RAG pipelines, yielding up to 8% relative improvement with negligible latency overhead, effectively balancing question relevance with factual grounding.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)通过结合检索到的文档和/或生成的上下文来增强大型语言模型(Large Language Models, LLMs)的事实性。然而,当面对混合上下文时,LLMs往往表现出风格偏见,更倾向于流畅但虚构的生成内容,而非事实基础但组织松散的检索证据。这一现象表明,检索信息的效用受到其呈现方式的瓶颈。为了解决这一问题,我们提出了QREAM,一种风格控制的重写器,旨在将检索到的文档与面向问题的风格对齐,同时保留事实,以便LLM读者更好地利用。我们的框架包括两个阶段:(1)QREAM-ICL,利用风格种子指导迭代重写探索;(2)QREAM-FT,基于去噪的ICL输出提炼出的轻量级学生模型。QREAM-FT采用双标准拒绝采样,根据答案正确性和事实一致性进行过滤,以确保高质量的监督。QREAM可以无缝集成到现有的RAG管道中,作为即插即用模块。实验表明,QREAM持续增强先进的RAG管道,带来高达8%的相对提升,同时延迟开销微乎其微,有效平衡了问题相关性与事实基础。
cs.CL / 91 / 2604.17340
Neuro-Symbolic Resolution of Recommendation Conflicts in Multimorbidity Clinical Guidelines
多病共存临床指南中推荐冲突的神经符号解决方案
Abstract
Clinical guidelines, typically developed by independent specialty societies, inherently exhibit substantial fragmentation, redundancy, and logical contradiction. These inconsistencies, particularly when applied to patients with multimorbidity, not only cause cognitive dissonance for clinicians but also introduce catastrophic noise into AI systems, rendering the standard Retrieval-Augmented Generation (RAG) system fragile and prone to hallucination. To address this fundamental reliability crisis, we introduce a Neuro-Symbolic framework that automates the detection of recommendation redundancies and conflicts. Our pipeline employs a multi-agent system to translate unstructured clinical natural language into rigorous symbolic logic language, which is then verified by a Satisfiability (SAT) solver. By formulating a hierarchical taxonomy of logical rule interactions, we identify a critical category termed Local Conflict - a decision conflict arising from the intersection of comorbidities. Evaluating our system on a curated benchmark of 12 authoritative SGLT2 inhibitor guidelines, we reveal that 90.6% of conflicts are Local, a structural complexity that single-disease guidelines fail to address. While state-of-the-art LLMs fail in detecting these conflicts, our neuro-symbolic approach achieves an F1 score of 0.861. This work demonstrates that logical verification must precede retrieval, establishing a new technical standard for automated knowledge coordination in medical AI.
Chinese Translation
临床指南通常由独立的专业学会制定,固有地表现出显著的碎片化、冗余和逻辑矛盾。这些不一致性,尤其是在应用于多病共存患者时,不仅给临床医生带来了认知失调,还为人工智能系统引入了灾难性的噪声,使得标准的检索增强生成(Retrieval-Augmented Generation, RAG)系统变得脆弱且易于产生幻觉。为了解决这一根本的可靠性危机,我们提出了一种神经符号框架,自动检测推荐的冗余和冲突。我们的流程采用多智能体系统将非结构化的临床自然语言翻译为严格的符号逻辑语言,然后由可满足性(Satisfiability, SAT)求解器进行验证。通过制定逻辑规则交互的层次分类法,我们识别出一个关键类别,称为局部冲突(Local Conflict)——这是由于共病交集而产生的决策冲突。在对12个权威SGLT2抑制剂指南的精心策划的基准测试中评估我们的系统,我们发现90.6%的冲突是局部的,这种结构复杂性是单病指南无法解决的。尽管最先进的大型语言模型(LLMs)在检测这些冲突方面表现不佳,但我们的神经符号方法达到了0.861的F1分数。这项工作表明,逻辑验证必须优先于检索,为医疗人工智能中的自动知识协调建立了新的技术标准。
cs.CL / 92 / 2604.17346
Logical Computational Linguistics
逻辑计算语言学
Abstract
In this book we promote logical computational linguistics as opposed to statistical computational linguistics. In particular, we provide a logical semantic interface. This book assembles more than twenty years of research work on type logical grammar, and adds new ideas and material. Chains of statistical dependencies of less than one hundred per cent confidence tend monotonically to zero. Chains of logical dependencies of any length maintain one hundred per cent confidence end to end. We aspire to enable perfect syntactic and semantic processing in life-critical NLP applications.
Chinese Translation
在本书中,我们倡导逻辑计算语言学,与统计计算语言学相对。特别地,我们提供了一个逻辑语义接口。本书汇集了超过二十年的类型逻辑语法研究成果,并增加了新的思想和材料。统计依赖链的置信度低于百分之百时,趋向于单调地接近零。任何长度的逻辑依赖链则始终保持百分之百的置信度。我们期望能够在生命关键的自然语言处理应用中实现完美的句法和语义处理。
cs.CL / 93 / 2604.17354
More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
超越视觉表象:通过语义锚定测量视觉-语言模型中的符号差距
Abstract
Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap ($\Delta$), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias $b(t)$ to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.
Chinese Translation
视觉-语言模型(VLMs)在照片级真实感生成方面表现出色,但在表示抽象意义(如名词复合词的习语解释)时常常面临挑战。为了研究高视觉保真度是否会干扰在视觉抽象下的习语组合性,我们引入了DIVA,一个控制基准,通过生成配对的、语义锚定的可视化图像来替代高保真视觉细节,以用于字面和习语解读。我们进一步提出了语义对齐差距(Semantic Alignment Gap, $ riangle$),这是一种与架构无关的度量,量化字面和习语视觉基础之间的偏差。此外,我们引入了一个方向性符号偏差 $b(t)$,以单独测量字面偏好的方向和强度。对8个近期的VLM进行评估,我们揭示了一种一致的字面优势偏差:模型规模本身并不能解决字面偏好,而更高的视觉保真度与较弱的符号对齐相关,这表明超现实图像可能导致认知干扰。我们的研究结果表明,改善组合理解需要对视觉输入进行图标化抽象,并在预期意义中锚定解释和生成。
cs.CL / 94 / 2604.17358
Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions
仍然在我们之间?评估和提升语音助手对第三方干扰的鲁棒性
Abstract
While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party Interruptions (TPI) from the primary user's ongoing flow, leaving them vulnerable to contextual failures. To bridge this gap, we introduce TPI-Train, a dataset of 88K instances designed with speaker-aware hard negatives to enforce acoustic cue prioritization for interruption handling, and TPI-Bench, a comprehensive evaluation framework designed to rigorously measure the interruption-handling strategy and precise speaker discrimination in deceptive contexts. Experiments demonstrate that our dataset design mitigates semantic shortcut learning-a critical pitfall where models exploit semantic context while neglecting acoustic signals essential for discerning speaker changes. We believe our work establishes a foundational resource for overcoming text-dominated unimodal reliance in SLMs, paving the way for more robust multi-party spoken interaction. The code for the framework is publicly available at https://tpi-va.github.io
Chinese Translation
尽管近期的口语语言模型(Spoken Language Models, SLMs)已在现实场景中积极部署,但它们缺乏区分第三方干扰(Third-Party Interruptions, TPI)与主要用户正在进行的对话流的能力,从而使其在上下文中容易出现失败。为了解决这一问题,我们引入了TPI-Train,一个包含88K实例的数据集,设计了以说话者为中心的困难负样本,以强化干扰处理中的声学线索优先级。此外,我们还提出了TPI-Bench,一个全面的评估框架,旨在严格测量干扰处理策略和在欺骗性上下文中的精确说话者区分。实验表明,我们的数据集设计减轻了语义捷径学习的影响——这是一个关键的陷阱,模型在利用语义上下文的同时忽视了区分说话者变化所需的声学信号。我们相信我们的工作为克服SLMs中以文本为主的单模态依赖奠定了基础资源,为更鲁棒的多方口语交互铺平了道路。该框架的代码已公开发布,网址为 https://tpi-va.github.io
cs.CL / 95 / 2604.17366
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench:对计算论证任务中大型语言模型的基准测试
Abstract
Argumentation skills are an essential toolkit for large language models (LLMs). These skills are crucial in various use cases, including self-reflection, debating collaboratively for diverse answers, and countering hate speech. In this paper, we create the first benchmark for a standardized evaluation of LLM-based approaches to computational argumentation, encompassing 33 datasets from previous work in unified form. Using the benchmark, we evaluate the generalizability of five LLM families across 46 computational argumentation tasks that cover mining arguments, assessing perspectives, assessing argument quality, reasoning about arguments, and generating arguments. On the benchmark, we conduct an extensive systematic analysis of the contribution of few-shot examples, reasoning steps, model size, and training skills to the performance of LLMs on the computational argumentation tasks in the benchmark.
Chinese Translation
论证能力是大型语言模型(LLMs)必不可少的工具。这些能力在多种应用场景中至关重要,包括自我反思、协作辩论以获得多样化答案,以及反击仇恨言论。本文创建了第一个标准化评估LLM基础的计算论证方法的基准,涵盖了来自以往研究的33个数据集,以统一的形式呈现。利用该基准,我们评估了五个LLM家族在46个计算论证任务上的泛化能力,这些任务包括论点挖掘、观点评估、论点质量评估、论证推理以及论点生成。在基准测试中,我们对少量示例、推理步骤、模型规模和训练技能对LLMs在计算论证任务中表现的贡献进行了广泛的系统分析。
cs.CL / 96 / 2604.17377
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
AnchorMem:基于关联上下文的锚定事实在大型语言模型中的记忆构建
Abstract
While large language models have achieved remarkable performance in complex tasks, they still need a memory system to utilize historical experience in long-term interactions. Existing memory methods (e.g., A-Mem, Mem0) place excessive emphasis on organizing interactions by frequently rewriting them, however, this heavy reliance on summarization risks diluting essential contextual nuances and obscuring key retrieval features. To bridge this gap, we introduce AnchorMem, a novel memory framework inspired by the Proust Phenomenon in cognitive science, where a specific anchor triggers a holistic recollection. We propose a method that decouples the retrieval unit from the generation context. AnchorMem extracts atomic facts from interaction history to serve as retrieval anchors, while preserving the original context as the immutable context. To reveal implicit narrative cues, we construct an associative event graph that uses higher-order event links that bind sets of related facts into shared event representations, strengthening cross-memory integration without relying on generic entities as bridges. During retrieval, the system anchors queries to specific facts and events to locate relevant memories, but reconstructs the context using the associated raw chunks and events. Our method reconciles fine-grained retrieval with the contextual integrity of interactions. Experiments across three closed-source and open-source models on the LoCoMo benchmark demonstrate that AnchorMem significantly outperforms baselines. Code is available at https://github.com/RayNeo-AI-2025/AnchorMem.
Chinese Translation
尽管大型语言模型在复杂任务中取得了显著的性能,但它们仍然需要一个记忆系统来利用长期交互中的历史经验。现有的记忆方法(如 A-Mem、Mem0)过于强调通过频繁重写来组织交互,然而,这种对摘要的过度依赖可能会稀释重要的上下文细微差别,并模糊关键的检索特征。为了解决这一问题,我们提出了 AnchorMem,这是一种新颖的记忆框架,灵感来自认知科学中的普鲁斯特现象,其中特定的锚点触发整体回忆。我们提出了一种将检索单元与生成上下文解耦的方法。AnchorMem 从交互历史中提取原子事实作为检索锚点,同时保留原始上下文作为不可变上下文。为了揭示隐含的叙事线索,我们构建了一个关联事件图,该图使用高阶事件链接将相关事实集合绑定为共享事件表示,从而增强跨记忆的整合,而无需依赖通用实体作为桥梁。在检索过程中,系统将查询锚定到特定的事实和事件,以定位相关的记忆,但使用相关的原始片段和事件重建上下文。我们的方法调和了细粒度检索与交互的上下文完整性。在 LoCoMo 基准上对三个闭源和开源模型的实验表明,AnchorMem 显著优于基线。代码可在 https://github.com/RayNeo-AI-2025/AnchorMem 获得。
cs.CL / 97 / 2604.17393
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
谁在监督监督者?人类与翻译指标在未见领域的意见不一致
Abstract
Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.
Chinese Translation
自动评估指标在机器翻译系统的开发中至关重要,但它们在领域转移下的稳健性仍不明确。大多数指标是在机器翻译研讨会(WMT)基准上开发的,这引发了对它们在未见领域稳健性的担忧。先前分析未见领域的研究在翻译系统、注释者或评估条件上存在差异,将领域效应与人类注释噪声混淆。为了解决这些偏差,我们引入了一个系统的多注释者跨领域错误跨度注释数据集(CD-ESA),该数据集包含来自三个语言对的18.8k个人类错误跨度注释,在每个语言对中固定注释者,并评估同六个翻译系统在一个已见新闻领域和两个未见技术领域的翻译。使用该数据集,我们首先发现自动指标在段级别上对领域转移表现出惊人的稳健性(一致性高达0.69),但一旦考虑人类标签的变异,这种稳健性大部分消失。平均注释使得注释者间一致性提高了最多+0.11。与人类相比,指标在未见化学领域的表现较差(注释者间一致性为0.78-0.83,而人类为0.96)。我们建议在不同领域的评估中,将指标与人类的一致性与注释者间一致性进行比较,而不是仅仅比较原始的指标与人类的一致性。
cs.CL / 98 / 2604.17396
Representation-Guided Parameter-Efficient LLM Unlearning
基于表示的参数高效大型语言模型遗忘
Abstract
Large Language Models (LLMs) often memorize sensitive or harmful information, necessitating effective machine unlearning techniques. While existing parameter-efficient unlearning methods have shown promise, they still struggle with the forget-retain trade-off. This can be attributed to their reliance on parameter importance metrics to identify parameters that are important exclusively for the forget set, which is fundamentally limited by the superposition phenomenon. Due to the polysemantic nature of LLM parameters, such an importance metric may struggle to disentangle parameters associated with the forget and retain sets. In this work, we propose Representation-Guided Low-rank Unlearning (REGLU), a novel approach that leverages the geometric properties of representation spaces to achieve robust and precise unlearning. First, we develop a representation-guided initialization for LoRA that identifies the optimal subspace for selective forgetting. Second, we introduce a regularization loss that constrains the outputs of the LoRA update to lie in the orthogonal complement of the retain set's representation subspace, thereby minimizing interference with the model's performance on the retain set. We evaluate REGLU on the TOFU and WMDP benchmarks across multiple models. Our results demonstrate that REGLU consistently outperforms state-of-the-art baselines, achieving superior unlearning quality while maintaining higher model utility.
Chinese Translation
大型语言模型(LLMs)常常记忆敏感或有害的信息,因此需要有效的机器遗忘技术。虽然现有的参数高效遗忘方法显示出一定的前景,但它们在遗忘与保留之间的权衡仍然面临挑战。这可以归因于它们依赖于参数重要性度量来识别仅对遗忘集重要的参数,而这种方法在本质上受到叠加现象的限制。由于LLM参数的多义性,这种重要性度量可能难以区分与遗忘集和保留集相关的参数。在本研究中,我们提出了一种基于表示的低秩遗忘方法(REGLU),这是一种新颖的方法,利用表示空间的几何特性来实现稳健且精确的遗忘。首先,我们为LoRA开发了一种基于表示的初始化方法,以识别选择性遗忘的最佳子空间。其次,我们引入了一种正则化损失,限制LoRA更新的输出位于保留集表示子空间的正交补空间,从而最小化对模型在保留集上的性能的干扰。我们在TOFU和WMDP基准测试中评估了REGLU,涵盖多个模型。我们的结果表明,REGLU始终优于最先进的基线,达到更高的遗忘质量,同时保持更高的模型效用。
cs.CL / 99 / 2604.17398
Contrastive Analysis of Linguistic Representations in Large Language Model Outputs through Structured Synthetic Data Generation and Abstracted N-gram Associations
通过结构化合成数据生成和抽象化N-gram关联对大型语言模型输出中的语言表征进行对比分析
Abstract
We present a methodological framework to discover linguistic and discursive patterns associated to different social groups through contrastive synthetic text generation and statistical analysis. In contrast with previous approaches, we aim to characterize subtle expressions of bias, instead of diagnosing bias through a pre-determined list of words or expressions. We are also working with contextualized data instead of isolated words or sentences. Our methodology applies to textual productions in any genre, encompassing narrative, task-oriented or dialogic. Contextualized data are generated using controlled combinations of situational scenarios and group markers, creating minimal pairs of texts that differ only in the referenced group while maintaining comparable narrative conditions. To facilitate robust analysis, linguistic forms are generalized and associations between linguistic abstractions and groups are quantified using a variant of pointwise mutual information to detect expressions that appear disproportionately across groups. A fragment-ranking strategy then prioritizes text segments with a high concentration of biased linguistic signals, which allows for experts to assess the harmful potential of linguistic expressions in context, bridging quantitative analysis and qualitative interpretation.
Chinese Translation
我们提出了一种方法论框架,以通过对比合成文本生成和统计分析,发现与不同社会群体相关的语言和话语模式。与以往的方法相比,我们旨在表征偏见的微妙表现,而不是通过预先确定的单词或表达列表来诊断偏见。我们还使用了上下文化的数据,而不是孤立的单词或句子。我们的方法适用于任何体裁的文本产出,包括叙述性、任务导向或对话性文本。上下文化的数据是通过对情境场景和群体标记的受控组合生成的,创建仅在引用的群体上有所不同的最小文本对,同时保持可比的叙述条件。为了促进稳健的分析,语言形式被概括,并通过一种变体的点互信息(pointwise mutual information)量化语言抽象与群体之间的关联,以检测在群体之间不成比例出现的表达。然后,片段排名策略优先考虑具有高浓度偏见语言信号的文本片段,使专家能够在上下文中评估语言表达的有害潜力,从而桥接定量分析与定性解读。
cs.CL / 100 / 2604.17411
DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs
DuConTE:具有拓扑约束注意力的双粒度文本编码器用于文本属性图
Abstract
Text-attributed graphs integrate semantic information of node texts with topological structure, offering significant value in various applications such as document classification and information extraction. Existing approaches typically encode textual content using language models (LMs), followed by graph neural networks (GNNs) to process structural information. However, during the LM-based text encoding phase, most methods not only perform semantic interaction solely at the word-token granularity, but also neglect the structural dependencies among texts from different nodes. In this work, we propose DuConTE, a dual-granularity text encoder with topology-constrained attention. The model employs a cascaded architecture of two pretrained LMs, encoding semantics first at the word-token granularity and then at the node granularity. During the self-attention computation in each LM, we dynamically adjust the attention mask matrix based on node connectivity, guiding the model to learn semantic correlations informed by the graph structure. Furthermore, when composing node representations from word-token embeddings, we separately evaluate the importance of tokens under the center-node context and the neighborhood context, enabling the capture of more contextually relevant semantic information. Extensive experiments on multiple benchmark datasets demonstrate that DuConTE achieves state-of-the-art performance on the majority of them.
Chinese Translation
文本属性图将节点文本的语义信息与拓扑结构相结合,在文档分类和信息提取等多个应用中具有重要价值。现有方法通常使用语言模型(LMs)对文本内容进行编码,然后通过图神经网络(GNNs)处理结构信息。然而,在基于LM的文本编码阶段,大多数方法不仅仅在词元粒度上进行语义交互,还忽视了来自不同节点的文本之间的结构依赖关系。在本研究中,我们提出了DuConTE,一种具有拓扑约束注意力的双粒度文本编码器。该模型采用两个预训练LM的级联架构,首先在词元粒度上编码语义,然后在节点粒度上进行编码。在每个LM的自注意力计算过程中,我们根据节点连接性动态调整注意力掩码矩阵,引导模型学习受图结构影响的语义关联。此外,在从词元嵌入中组合节点表示时,我们分别评估在中心节点上下文和邻域上下文下词元的重要性,能够捕捉到更具上下文相关性的语义信息。在多个基准数据集上的广泛实验表明,DuConTE在大多数数据集上实现了最先进的性能。
cs.CL / 101 / 2604.17429
Jupiter-N Technical Report
Jupiter-N 技术报告
Abstract
We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model's capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron's hybrid reasoning ability. Jupiter-N achieves standout gains over Nemotron in Welsh (+18 on ARC-Easy, +5.25 on MMLU-Lite), terminal-use (+9.1 on Terminal Bench 2) and instruction following (+4.4 on IFBench), while retaining the base model capabilities. We frame this work as a reproducible template for sovereign post-training: substituting cultural knowledge, institutional corpora, and target languages produces an equivalent pipeline for any country. All model weights and all post-training datasets are publicly released under open licences.
Chinese Translation
我们介绍了 Jupiter-N,这是一种从 Nemotron 3 Super(一个完全开源的 1200 亿参数的语言模型)后训练的混合推理模型。我们的目标有三个:(1)通过不确定性策划的轨迹实现代理能力;(2)通过基于文化规范的合成数据实现与英国文化的对齐;(3)通过平行语料库和 LLM 翻译的威尔士语对话提供威尔士语支持。我们的数据策划策略仔细保留了基础模型的能力:使用我们的 Forget-Me-Not 框架,我们将在线合成重放与离线任务数据混合,以减轻灾难性遗忘,并包括推理和非推理轨迹的混合,以维持 Nemotron 的混合推理能力。Jupiter-N 在威尔士语(ARC-Easy 上提高了 18 分,MMLU-Lite 上提高了 5.25 分)、终端使用(Terminal Bench 2 上提高了 9.1 分)和指令跟随(IFBench 上提高了 4.4 分)方面相较于 Nemotron 取得了显著提升,同时保留了基础模型的能力。我们将这项工作框架视为一个可重复的主权后训练模板:替换文化知识、机构语料库和目标语言可以为任何国家生成等效的流程。所有模型权重和所有后训练数据集均在开放许可下公开发布。
cs.CL / 102 / 2604.17433
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
仅基于两个样本的自一致性:CoT-PoT 集成用于高效的大型语言模型推理
Abstract
Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.
Chinese Translation
自一致性(SC)是一种通过聚合多个采样输出来提高大型语言模型推理准确性的流行技术,但由于广泛的采样,这种方法的计算成本很高。我们提出了一种混合集成方法,利用了两种不同推理模式的互补优势:思维链(Chain-of-Thought, CoT)和思维程序(Program-of-Thought, PoT)。我们描述了在自一致性中结合这两种推理形式的一般框架,以及针对全面采样和早停的特定策略。我们展示了CoT-PoT集成不仅提高了整体准确性,还将自一致性所需的样本数量大幅减少了9.3倍。特别是,大多数任务(78.6%)仅需两个样本即可解决,而这是以往任何自一致性方法所无法实现的。
cs.CL / 103 / 2604.17435
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
MoVE:通过声学专家混合模型在语音到语音翻译中传递笑声与泪水
Abstract
Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
Chinese Translation
近期的语音到语音翻译(S2ST)系统在语义准确性上取得了显著进展,但始终忽视了非语言声响(NVs),如笑声和哭泣,这些声响传达了语用意图,严重限制了其在现实世界中的应用。我们通过三项贡献来解决这一问题。首先,我们提出了一种合成管道,用于构建可扩展的表现性数据集,以克服数据稀缺的限制。其次,我们提出了MoVE,一种混合LoRA专家架构,配备表现性专用适配器和软加权路由器,以融合专家,捕捉混合表现状态。第三,我们展示了预训练的音频语言模型(AudioLLMs)能够显著提高数据效率:30分钟的精心策划数据足以实现强劲的性能。在英汉S2ST中,与强基线进行比较时,MoVE在76%的案例中成功再现目标非语言声响,并在所有比较系统中获得了最高的人类评分自然性和情感真实感,而现有的S2ST系统最多只能保留14%的非语言声响。
cs.CL / 104 / 2604.17487
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
仅在合理的情况下提供精确答案:面向代理系统的标定声明级特异性控制
Abstract
Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.
Chinese Translation
代理系统的失败往往不是因为完全错误,而是因为过于精确:一个回应可能在整体上是有用的,但具体声明超出了证据所支持的范围。我们将这种失败模式研究为过度承诺控制,并引入了组合选择性特异性(Compositional Selective Specificity, CSS),这是一种后生成层,它将答案分解为多个声明,提出更粗的回退,并以最具体的标定水平发出每个声明,该水平看似可接受。该方法旨在将不确定性表示为局部语义回退,而不是整体答案拒绝。在完整的 LongFact 运行和 HotpotQA 试点中,标定的 CSS 改善了固定草稿的风险-效用权衡。在完整的 LongFact 运行中,它将考虑过度承诺的效用从 0.846 提高到 0.913,相较于不使用 CSS 的输出,同时实现了 0.938 的特异性保留。这些结果表明,声明级特异性控制是代理系统有用的不确定性接口,并且是未来无分布有效性层的目标。
cs.CL / 105 / 2604.17501
CoAct: Co-Active LLM Preference Learning with Human-AI Synergy
CoAct:基于人机协同的共主动大语言模型偏好学习
Abstract
Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model families, CoAct achieves average improvements of +13.25% on GSM8K, +8.19% on MATH, and +13.16% on WebInstruct, consistently outperforming all baselines.
Chinese Translation
基于偏好的反馈学习已成为对齐大语言模型(LLMs)在多样化任务中有效的方法。然而,高质量的人类标注偏好数据仍然昂贵且稀缺。现有方法通过自我奖励解决这一挑战,利用纯AI生成的标签进行扩展,但存在不可靠的风险;或者通过主动学习确保质量,依赖于oracle标注,但无法充分利用未标记的数据。在本文中,我们提出了CoAct,一个新颖的框架,通过战略性的人机协作,将自我奖励和主动学习协同结合。CoAct利用自我一致性来识别可靠的自我标记数据和需要oracle验证的样本。此外,oracle反馈指导模型在其可解决能力范围内生成新的指令。在两个模型系列的三个推理基准上评估,CoAct在GSM8K上平均提高了13.25%,在MATH上提高了8.19%,在WebInstruct上提高了13.16%,始终优于所有基线。
cs.CL / 106 / 2604.17512
ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
ONTO:一种用于大语言模型输入优化的高效列式符号表示
Abstract
Serialization formats designed for document interchange impose structural overhead that becomes prohibitive when large language models consume operational data at scale. A modest dataset of 1,000 IoT sensor readings serialized as JSON requires approximately 80,000 tokens - the majority spent on repeated field names, nested braces, and structural punctuation rather than semantic content. We present ONTO (Object Notation for Token Optimization), a columnar notation that declares field names once per entity and arranges values in pipe-delimited rows with indentation-based hierarchy. This schema-once, data-many design eliminates per-record key repetition while preserving human readability and nested structure support. Evaluation across three synthetic operational datasets demonstrates 46-51% token reduction versus JSON, with stable scaling from 100 to 1,000 records. Controlled inference benchmarks on Qwen2.5-7B show corresponding 5-10% latency improvement. Comprehension validation confirms no material degradation in LLM task accuracy across lookup, counting, extraction, and aggregation operations when format context is provided. Ablation analysis reveals that key repetition accounts for the majority of JSON overhead, with indentation costs in nested structures explaining the 4-percentage-point gap between flat and hierarchical data. ONTO occupies a previously unfilled position in the serialization landscape: columnar efficiency with hierarchical structure, optimized for LLM context windows rather than document interchange. Code and specification are available at https://github.com/harsh-aranga/onto.
Chinese Translation
为文档交换设计的序列化格式会带来结构性开销,这在大语言模型大规模处理操作数据时变得不可承受。一个包含1,000个物联网传感器读数的适度数据集以JSON格式序列化时大约需要80,000个标记——大部分用于重复的字段名称、嵌套的括号和结构性标点,而非语义内容。我们提出了ONTO(对象符号表示优化),这是一种列式符号表示法,针对每个实体仅声明一次字段名称,并以管道分隔的行排列值,采用基于缩进的层次结构。这种一次定义模式、多次使用数据的设计消除了每条记录的键重复,同时保留了人类可读性和嵌套结构支持。对三个合成操作数据集的评估显示,与JSON相比,标记减少了46-51%,并且在从100到1,000条记录的扩展中保持稳定。在Qwen2.5-7B上的控制推理基准测试显示相应的延迟改善为5-10%。理解验证确认在提供格式上下文时,查找、计数、提取和聚合操作的LLM任务准确性没有实质性下降。消融分析表明,键重复是JSON开销的主要来源,而嵌套结构中的缩进成本解释了平面数据与层次数据之间的4个百分点差距。ONTO在序列化领域占据了一个之前未被填补的位置:具有层次结构的列式效率,优化用于LLM上下文窗口而非文档交换。代码和规范可在https://github.com/harsh-aranga/onto获取。
cs.CL / 107 / 2604.17535
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
OPSDL:用于长上下文语言模型的在线自蒸馏
Abstract
Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.
Chinese Translation
扩展大型语言模型(LLMs)的有效上下文长度仍然是现实应用中的一个核心挑战。尽管最近的后训练方法在长上下文扩展方面取得了一定进展,但它们要么依赖于高质量的监督数据,要么依赖于稀疏的序列级奖励,导致优化过程不稳定且效率低下。我们提出了OPSDL,一种用于增强LLMs长上下文能力的在线自蒸馏方法。与其他最近的自蒸馏方法不同,后者注入特权信息并依赖模型的上下文学习能力作为教师,OPSDL利用模型自身固有的强短上下文能力作为自我教师,在长上下文场景中监督其自身生成。模型首先在完整的长上下文条件下生成响应,然后自我教师通过相关提取的短上下文下的逐点反向KL散度提供逐标记的监督信号。这种密集的标记级信号鼓励对相关证据的忠实使用,并减轻了由无关上下文引起的幻觉。我们在一系列参数从7B到32B的模型上评估了OPSDL在长上下文基准测试中的表现。结果显示,在不同上下文长度下,OPSDL consistently and substantially improves performance,超越了标准的后训练方法,如SFT和DPO,并且具有更高的样本效率。值得注意的是,这些提升是在不降低一般短上下文性能的情况下实现的。这些发现突显了OPSDL作为一种可扩展且稳定的长上下文学习方法的有效性。
cs.CL / 108 / 2604.17543
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs
PoliLegalLM:一份关于政治与法律事务的大型语言模型的技术报告
Abstract
Large language models (LLMs) have achieved remarkable success in general-domain tasks, yet their direct application to the legal domain remains challenging due to hallucinated legal citations, incomplete knowledge coverage, and weak structured reasoning. To address these issues, we propose PoliLegalLM, a domain-specific large language model tailored for political and legal applications. Our approach adopts a unified training framework that integrates continued pretraining, progressive supervised fine-tuning, and preference-based reinforcement learning to jointly enhance legal knowledge grounding, task alignment, and reasoning capability. We construct a large-scale, high-quality legal corpus and design a structured post-training pipeline, enabling the model to effectively learn domain-specific knowledge and adapt to diverse legal tasks. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. Experimental results demonstrate that PoliLegalLM achieves strong and consistent performance, outperforming competitive models of similar scale and remaining highly competitive with significantly larger models, while achieving the best results on real-world legal scenarios. These results highlight the effectiveness of our training paradigm and the practical value of domain-specific LLMs for real-world legal applications.
Chinese Translation
大型语言模型(LLMs)在通用领域任务中取得了显著成功,但由于法律引用的虚构、不完整的知识覆盖和较弱的结构化推理,其在法律领域的直接应用仍然面临挑战。为了解决这些问题,我们提出了PoliLegalLM,这是一种针对政治和法律应用的领域特定大型语言模型。我们的方法采用统一的训练框架,结合持续的预训练、渐进的监督微调和基于偏好的强化学习,以共同增强法律知识的基础、任务对齐和推理能力。我们构建了一个大规模、高质量的法律语料库,并设计了一个结构化的后训练流程,使模型能够有效学习领域特定知识并适应多样的法律任务。我们在三个具有代表性的基准上评估了PoliLegalLM,包括LawBench、LexEval和一个真实世界数据集PoliLegal。实验结果表明,PoliLegalLM在性能上表现强劲且一致,超越了同类规模的竞争模型,并在与显著更大模型的比较中保持高度竞争,同时在真实世界法律场景中取得了最佳结果。这些结果突显了我们训练范式的有效性以及领域特定LLMs在现实法律应用中的实际价值。
cs.CL / 109 / 2604.17569
MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring
MAPLE:一种用于跨提示作文评分的元学习框架
Abstract
Automated Essay Scoring (AES) faces significant challenges in cross-prompt settings, where models must generalize to unseen writing prompts. To address this limitation, we propose MAPLE, a meta-learning framework that leverages prototypical networks to learn transferable representations across different writing prompts. Across three diverse datasets (ELLIPSE and ASAP (English), and LAILA (Arabic)), MAPLE achieves state-of-the-art performance on ELLIPSE and LAILA, outperforming strong baselines by 8.5 and 3 points in QWK, respectively. On ASAP, where prompts exhibit heterogeneous score ranges, MAPLE yields improvements on several traits, highlighting the strengths of our approach in unified scoring settings. Overall, our results demonstrate the potential of meta-learning for building robust cross-prompt AES systems.
Chinese Translation
自动化作文评分(AES)在跨提示设置中面临重大挑战,在这种情况下,模型必须对未见过的写作提示进行泛化。为了解决这一限制,我们提出了MAPLE,一种利用原型网络学习不同写作提示之间可转移表示的元学习框架。在三个不同的数据集(ELLIPSE和ASAP(英语),以及LAILA(阿拉伯语))上,MAPLE在ELLIPSE和LAILA上实现了最先进的性能,分别比强基线提高了8.5和3个点的QWK。在ASAP上,由于提示表现出异质的评分范围,MAPLE在多个特征上取得了改进,突显了我们的方法在统一评分设置中的优势。总体而言,我们的结果展示了元学习在构建稳健的跨提示AES系统中的潜力。
cs.CL / 110 / 2604.17574
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
超越微调:上下文学习与推理链条在干扰项生成中的应用
Abstract
Distractor generation (DG) remains a labor-intensive task that still significantly depends on domain experts. The task focuses on generating plausible yet incorrect options, known as distractors, for multiple-choice questions. A reliable distractor must be contextually relevant to the question and able to mislead examinees through implicit reasoning when identifying the correct answer. While a recent method integrates fine-tuning pre-trained encoder-decoder models with contrastive learning to generate semantically relevant distractors for a given question-answer, it often fails to capture the underlying reasoning process that experts utilize when selecting distractors in benchmarks. In this paper, we explore large language models (LLMs) reasoning for DG through in-context learning with unsupervised semantic retrieval for selecting few-shot examples. We design a rationale-augmented DG framework that jointly generates distractors and their rationales for a given question-answer. Extensive experiments on six benchmarks, with varying average distractor lengths and domains, demonstrate that prompting LLMs with few-shot examples substantially improves the performance compared to recent DG models. It outperforms recent approaches and achieves state-of-the-art results in generating reasoned distractors that align with human-labeled benchmarks.
Chinese Translation
干扰项生成(DG)仍然是一项劳动密集型任务,且在很大程度上依赖于领域专家。该任务的重点是为多项选择题生成看似合理但实际上错误的选项,称为干扰项。一个可靠的干扰项必须与问题在上下文上相关,并能够通过隐含推理误导考生在识别正确答案时的判断。尽管最近有一种方法将微调的预训练编码器-解码器模型与对比学习相结合,以生成与给定问题-答案语义相关的干扰项,但它往往无法捕捉到专家在基准测试中选择干扰项时所使用的潜在推理过程。在本文中,我们通过上下文学习和无监督语义检索探索大型语言模型(LLMs)在干扰项生成中的推理,选择少量示例。我们设计了一个增强推理的干扰项生成框架,该框架为给定的问题-答案共同生成干扰项及其推理依据。在六个基准测试上的广泛实验,涵盖不同的平均干扰项长度和领域,表明与最近的干扰项生成模型相比,使用少量示例提示LLMs显著提高了性能。该方法优于最近的研究,并在生成与人类标注基准一致的推理干扰项方面达到了最先进的结果。
cs.CL / 111 / 2604.17609
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
代理探索但代理忽视:大型语言模型缺乏环境好奇心
Abstract
LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.
Chinese Translation
基于大型语言模型(LLM)的代理被认为能够将环境观察整合到其推理中:发现高度相关但意外的信息应该自然导致模型利用其自身的发现。我们展示了这一假设对于当前的基于LLM的代理是错误的,这些代理在反映或对意外信息作出反应方面存在困难。在三个基准测试(Terminal-Bench、SWE-Bench、AppWorld)中,我们将完整的任务解决方案注入代理环境,以故意向模型暴露任务的解决方案。尽管代理在Terminal-Bench中在79-81%的运行中发现了这些解决方案,但它们仅在37-50%的案例中与这些解决方案进行交互或利用。在AppWorld中,这一差距最为明显:代理在超过90%的尝试中看到文档声明某个命令“返回该任务的完整解决方案”,但在不到7%的试验中利用这一点。我们表明,代理缺乏我们所称的环境好奇心:即在响应环境刺激时识别和调查意外但相关观察的能力。我们确定了影响环境好奇心的三个主要因素:代理框架中可用的工具、测试时的计算能力和训练数据分布。我们的研究发现,最大化好奇心的配置也在未修改的基准测试中实现了最佳性能。然而,即使是经过联合优化的代理在大多数试验中仍然忽视发现的解决方案:当前的代理使用环境来获取预期信息,而不是修正其策略或最大限度地利用有用的刺激。
cs.CL / 112 / 2604.17628
Does Welsh media need a review? Detecting bias in Nation.Cymru's political reporting
威尔士媒体需要审查吗?检测 Nation.Cymru 政治报道中的偏见
Abstract
Wales' political landscape has been marked by growing accusations of bias in Welsh media. This paper takes the first computational step toward testing those claims by examining Nation.Cymru, a prominent Welsh political news outlet. I use a two-stage natural language processing (NLP) pipeline: (1) a robustly optimized BERT approach (RoBERTa) bias detector for efficient bias discovery and (2) a large language model (LLM) for target-attributed sentiment classification of bias labels from (1). A primary analysis of 15,583 party mentions across 2022-2026 news articles finds that Reform UK attracts biased framing at twice the rate of Plaid Cymru and over three times as negative in mean sentiment (p<0.001). A secondary analysis across four parties across both news and opinion articles shows that Plaid Cymru is the outlier, receiving markedly more favourable framing than any other party. These findings provide evidence of measurable differential framing in a single Welsh political media outlet, supporting calls for a broader review of Welsh media coverage. Furthermore, the two-stage pipeline offers a low-cost, replicable framework for extending this analysis to other Welsh outlets, as well as media ecosystems outside of Wales.
Chinese Translation
威尔士的政治格局因对威尔士媒体的偏见指控日益增多而受到影响。本文首次通过计算方法测试这些指控,研究了 Nation.Cymru 这一知名的威尔士政治新闻媒体。我使用了一个两阶段的自然语言处理(NLP)流程:第一阶段是一个经过强优化的 BERT 方法(RoBERTa)偏见检测器,用于高效发现偏见;第二阶段是一个大型语言模型(LLM),用于对第一阶段的偏见标签进行目标归属情感分类。对 2022-2026 年新闻文章中 15,583 次政党提及的初步分析发现,改革英国(Reform UK)受到的偏见框架是 Plaid Cymru 的两倍,且其平均情感更为负面(p<0.001)。对四个政党在新闻和评论文章中的二次分析显示,Plaid Cymru 是一个异常值,其获得的框架明显比其他任何政党更为有利。这些发现提供了在单一威尔士政治媒体中可测量的差异性框架的证据,支持对威尔士媒体报道进行更广泛审查的呼声。此外,这一两阶段流程为将此分析扩展到其他威尔士媒体以及威尔士以外的媒体生态系统提供了低成本、可复制的框架。
cs.CL / 113 / 2604.17633
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
先复制,后翻译:多语言预训练中翻译动态的解读
Abstract
Large language models exhibit impressive cross-lingual capabilities. However, prior work analyzes this phenomenon through isolated factors and at sparse points during training, limiting our understanding of how cross-lingual generalization emerges--particularly in the early phases of learning. To study the early trajectory of linguistic and translation capabilities, we pretrain a multilingual 1.7B model on nine diverse languages, capturing checkpoints at a much finer granularity. We further introduce a novel word-level translation dataset and trace how translation develops over training through behavioral analyses, model-component analysis, and parameter-based ablations. We find that the model quickly acquires basic linguistic capabilities in parallel with token-level copying, while translation develops in two distinct phases: an initial phase dominated by copying and surface-level similarities, and a second phase in which more generalizing translation mechanisms are developed while copying is refined. Together, these findings provide a fine-grained view of how cross-lingual generalization develops during multilingual pretraining.
Chinese Translation
大型语言模型展现出令人印象深刻的跨语言能力。然而,以往的研究通过孤立因素和训练过程中的稀疏时间点分析这一现象,限制了我们对跨语言泛化如何出现的理解,尤其是在学习的早期阶段。为了研究语言和翻译能力的早期轨迹,我们在九种不同语言上对一个1.7B的多语言模型进行了预训练,并以更细的粒度捕捉检查点。我们进一步引入了一个新颖的词级翻译数据集,并通过行为分析、模型组件分析和基于参数的消融研究追踪翻译在训练过程中的发展。我们发现,该模型在与标记级复制并行的过程中迅速获得了基本的语言能力,而翻译则经历了两个不同的阶段:初始阶段以复制和表面相似性为主导,第二阶段则在复制得到精炼的同时发展出更具泛化性的翻译机制。这些发现共同提供了一个细致的视角,揭示了跨语言泛化在多语言预训练过程中的发展。
cs.CL / 114 / 2604.17648
ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts
ThreadSumm:使用思维树对嵌套话语线程进行摘要
Abstract
Summarizing deeply nested discussion threads requires handling interleaved replies, quotes, and overlapping topics, which standard LLM summarizers struggle to capture reliably. We introduce ThreadSumm, a multi-stage LLM framework that treats thread summarization as a hierarchical reasoning problem over explicit aspect and content unit representations. Our method first performs content planning via LLM-based extraction of discourse aspects and Atomic Content Units, then applies sentence ordering to construct thread-aware sequences that surface multiple viewpoints rather than a single linear strand. On top of these interpretable units, ThreadSumm employs a Tree of Thoughts search that generates and scores multiple paragraph candidates, jointly optimizing coherence and coverage within a unified search space. With this multi-proposal and iterative refinement design, we show improved performance in generating logically structured summaries compared to existing baselines, while achieving higher aspect retention and opinion coverage in nested discussions.
Chinese Translation
对深度嵌套的讨论线程进行摘要需要处理交错的回复、引用和重叠的话题,而标准的大型语言模型(LLM)摘要工具在可靠捕捉这些内容方面存在困难。我们提出了ThreadSumm,一个多阶段的LLM框架,将线程摘要视为一个关于显性方面和内容单元表示的层次推理问题。我们的方法首先通过基于LLM的提取来进行内容规划,识别话语方面和原子内容单元,然后应用句子排序来构建线程感知序列,展现多个观点而非单一线性思路。在这些可解释单元的基础上,ThreadSumm采用思维树(Tree of Thoughts)搜索,生成并评分多个段落候选,联合优化一致性和覆盖率于一个统一的搜索空间。通过这种多提案和迭代优化的设计,我们展示了在生成逻辑结构摘要方面的性能优于现有基准,同时在嵌套讨论中实现了更高的方面保留率和观点覆盖率。
cs.CL / 115 / 2604.17650
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
用户提示中的分布变化及其对大型语言模型性能的影响测量
Abstract
LLMs are increasingly deployed in dynamic, real-world settings, where the distribution of user prompts can shift substantially over time as new tasks, prompts, and users are introduced to a deployed model. Such natural prompt distribution shift poses a major challenge to LLM reliability, particularly for specialized models designed for narrow domains or user populations. Despite attention to out-of-distribution robustness, there is very limited exploration of measuring natural prompt distribution shift in prior work, and its impact on deployed LLMs remains poorly understood. We introduce the LLM Evaluation under Natural prompt Shift (LENS) framework: a data-centric approach for quantifying natural prompt distribution shift and evaluating its effect on the performance of deployed LLMs. We perform a large-scale evaluation using 192 real-world post-deployment prompt shift settings over time, user group, and geographic axes, training a total of 81 models on 4.68M training prompts, and evaluating on 57.6k prompts. We find that even moderate shifts in user prompt behavior correspond with large performance drops (73% average loss) in deployed LLMs. This performance degradation is particularly prevalent when users from different latent groups and geographic regions interact with models and is correlated with natural prompt distribution shift over time. We systematically characterize how LLM instruction following ability degrades over time and between user groups. Our findings highlight the critical need for data-driven monitoring to ensure LLM performance remains stable across diverse and evolving user populations.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于动态的现实环境中,在这些环境中,用户提示的分布可能会随着新任务、提示和用户的引入而显著变化。这种自然的提示分布变化对LLM的可靠性构成了重大挑战,特别是对于那些针对特定领域或用户群体设计的专业模型。尽管对分布外鲁棒性的关注日益增加,但在以往的研究中对自然提示分布变化的测量探索非常有限,其对已部署LLM的影响仍然不够明确。我们提出了自然提示变化下的LLM评估框架(LLM Evaluation under Natural prompt Shift, LENS):一种以数据为中心的方法,用于量化自然提示分布变化并评估其对已部署LLM性能的影响。我们在192个真实世界的后部署提示变化设置中进行了大规模评估,涵盖时间、用户群体和地理维度,共训练了81个模型,使用了468万条训练提示,并在57,600条提示上进行了评估。我们发现,即使是用户提示行为的适度变化也会导致已部署LLM性能的大幅下降(平均损失为73%)。当来自不同潜在群体和地理区域的用户与模型互动时,这种性能下降尤为明显,并且与自然提示分布随时间的变化相关。我们系统性地描述了LLM的指令遵循能力如何随着时间和用户群体的不同而退化。我们的研究结果强调了基于数据的监测对于确保LLM在多样化和不断变化的用户群体中保持稳定性能的关键需求。
cs.CL / 116 / 2604.17659
Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy
语义密度效应(SDE):每个令牌最大化信息量提高大型语言模型准确性
Abstract
We introduce the Semantic Density Effect (SDE): the empirical finding that prompts carrying higher semantic information per token consistently produce more accurate, focused, and less hallucinated outputs across all major LLM families. SDE is defined as the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness. Unlike prior prompt optimization techniques that add tokens (Chain of Thought), duplicate the prompt (Prompt Repetition), or reorder components (Instruction Placement Effect), SDE improves performance by removing or replacing low-information tokens while preserving or sharpening the semantic signal. Evaluated across five frontier models and seven benchmarks, ultra-dense prompts (SDE > 0.80) outperform diluted counterparts by an average of +8.4 percentage points with 0 additional tokens and 0 latency overhead. Combined with Instruction Placement Effect (IPE), the gain reaches +11.7 percentage points
Chinese Translation
我们提出了语义密度效应(SDE):这一经验发现表明,携带更高语义信息量的提示每个令牌能够在所有主要大型语言模型(LLM)系列中持续产生更准确、更集中且更少幻觉的输出。SDE被定义为语义负载令牌与总提示令牌的比率,经过冗余和具体性的调整。与之前的提示优化技术(如思维链(Chain of Thought)、提示重复(Prompt Repetition)或组件重排序(Instruction Placement Effect))不同,SDE通过去除或替换低信息量的令牌,同时保留或增强语义信号来提高性能。在五个前沿模型和七个基准测试中进行评估时,超密集提示(SDE > 0.80)相比稀释的对应物平均提高了8.4个百分点,且没有增加额外的令牌和延迟负担。结合指令放置效应(Instruction Placement Effect, IPE),增益达到11.7个百分点。
cs.CL / 117 / 2604.17667
Peerispect: Claim Verification in Scientific Peer Reviews
Peerispect:科学同行评审中的声明验证
Abstract
Peer review is central to scientific publishing, yet reviewers frequently include claims that are subjective, rhetorical, or misaligned with the submitted work. Assessing whether review statements are factual and verifiable is crucial for fairness and accountability. At the scale of modern conferences and journals, manually inspecting the grounding of such claims is infeasible. We present Peerispect, an interactive system that operationalizes claim-level verification in peer reviews by extracting check-worthy claims from peer reviews, retrieving relevant evidence from the manuscript, and verifying the claims through natural language inference. Results are presented through a visual interface that highlights evidence directly in the paper, enabling rapid inspection and interpretation. Peerispect is designed as a modular Information Retrieval (IR) pipeline, supporting alternative retrievers, rerankers, and verifiers, and is intended for use by reviewers, authors, and program committees. We demonstrate Peerispect through a live, publicly available demo (https://app.reviewer.ly/app/peerispect) and API services (https://github.com/Reviewerly-Inc/Peerispect), accompanied by a video tutorial (https://www.youtube.com/watch?v=pc9RkvkUh14).
Chinese Translation
同行评审是科学出版的核心,但评审者常常包含主观、修辞或与提交工作不一致的声明。评估评审陈述是否真实且可验证对于公平性和问责制至关重要。在现代会议和期刊的规模下,手动检查这些声明的依据是不可行的。我们提出了Peerispect,这是一种交互式系统,通过从同行评审中提取值得检查的声明、从手稿中检索相关证据,并通过自然语言推理验证声明,实现了同行评审中的声明级验证。结果通过一个可视化界面呈现,直接在论文中突出显示证据,便于快速检查和解读。Peerispect被设计为一个模块化的信息检索(Information Retrieval, IR)管道,支持替代的检索器、重排序器和验证器,旨在供评审者、作者和程序委员会使用。我们通过一个实时的、公开可用的演示(https://app.reviewer.ly/app/peerispect)和API服务(https://github.com/Reviewerly-Inc/Peerispect)展示了Peerispect,并附有视频教程(https://www.youtube.com/watch?v=pc9RkvkUh14)。
cs.CL / 118 / 2604.17674
Towards Intelligent Legal Document Analysis: CNN-Driven Classification of Case Law Texts
迈向智能法律文档分析:基于卷积神经网络的案例法文本分类
Abstract
Legal practitioners and judicial institutions face an ever-growing volume of case-law documents characterised by formalised language, lengthy sentence structures, and highly specialised terminology, making manual triage both time-consuming and error-prone. This work presents a lightweight yet high-accuracy framework for citation-treatment classification that pairs lemmatisation-based preprocessing with subword-aware FastText embeddings and a multi-kernel one-dimensional Convolutional Neural Network (CNN). Evaluated on a publicly available corpus of 25,000 annotated legal documents with a 75/25 training-test partition, the proposed system achieves 97.26% classification accuracy and a macro F1-score of 96.82%, surpassing established baselines including fine-tuned BERT, Long Short-Term Memory (LSTM) with FastText, CNN with random embeddings, and a Term Frequency-Inverse Document Frequency (TF-IDF) k-Nearest Neighbour (KNN) classifier. The model also attains the highest Area Under the Receiver Operating Characteristic (AUC-ROC) curve of 97.83% among all compared systems while operating with only 5.1 million parameters and an inference latency of 0.31 ms per document - more than 13 times faster than BERT. Ablation experiments confirm the individual contribution of each pipeline component, and the confusion matrix reveals that residual errors are confined to semantically adjacent citation categories. These findings indicate that carefully designed convolutional architectures represent a scalable, resource-efficient alternative to heavyweight transformers for intelligent legal document analysis.
Chinese Translation
法律从业者和司法机构面临着日益增长的案例法文档数量,这些文档的特点是语言形式化、句子结构冗长以及高度专业化的术语,使得手动筛选既耗时又容易出错。本研究提出了一种轻量级但高准确率的引用处理分类框架,该框架将基于词干提取的预处理与子词感知的 FastText 嵌入和多核一维卷积神经网络(CNN)相结合。在一个包含 25,000 个标注法律文档的公开语料库上进行评估,训练和测试的比例为 75/25,所提系统达到了 97.26% 的分类准确率和 96.82% 的宏观 F1 分数,超越了包括微调的 BERT、结合 FastText 的长短期记忆网络(LSTM)、随机嵌入的 CNN 以及基于词频-逆文档频率(TF-IDF)的 k 最近邻(KNN)分类器等已建立的基线模型。该模型在所有比较系统中还获得了最高的接收者操作特征曲线(AUC-ROC)面积,达到 97.83%,同时仅使用 510 万个参数,每个文档的推理延迟为 0.31 毫秒,比 BERT 快超过 13 倍。消融实验确认了每个管道组件的独立贡献,混淆矩阵显示残余错误仅限于语义相近的引用类别。这些发现表明,经过精心设计的卷积架构是智能法律文档分析中一种可扩展、资源高效的替代方案,优于重量级的变换器。
cs.CL / 119 / 2604.17707
Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report
在解释档案之前:大语言模型元认知自我报告的有效性标定
Abstract
Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), F (withdrawing consensus-endorsed items), Fp (withdrawing correct answers), RBS (inverted monitoring), and TRIN (fixed responding). A tiered classification system identifies four models as construct-level invalid and two as elevated. Valid-profile models produce item-sensitive confidence (mean r = .18, 14 of 16 significant). Invalid-profile models do not (mean r = -.20, d = 2.17, p = .001). Chain-of-thought training produces two opposite response distortions. Two latent dimensions account for 94.6% of index variance. Companion papers extract a portable screening protocol (Cacioli, 2026e) and validate it against selective prediction (Cacioli, 2026f). All data and code: https://github.com/synthiumjp/validity-scaling-llm
Chinese Translation
临床人格评估在解释实质性量表之前会筛查反应有效性,而大语言模型(LLM)评估则不会。我们将来自20个前沿模型的元认知探测数据应用于PAI和MMPI-3的有效性标定框架,共涉及524个项目。我们操作化了六个有效性指标:L(在错误上保持信心)、K(在错误上下注)、F(撤回共识支持的项目)、Fp(撤回正确答案)、RBS(反向监控)和TRIN(固定反应)。一个分层分类系统识别出四个模型为构念层面无效,两个模型为升高。有效档案模型产生对项目敏感的信心(平均r = 0.18,16个中的14个显著)。无效档案模型则没有(平均r = -0.20,d = 2.17,p = 0.001)。思维链训练产生两种相反的反应扭曲。两个潜在维度解释了94.6%的指标方差。相关论文提取了一种可移植的筛查协议(Cacioli, 2026e)并对其进行了选择性预测的验证(Cacioli, 2026f)。所有数据和代码可访问:https://github.com/synthiumjp/validity-scaling-llm
cs.CL / 120 / 2604.17709
DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
DeInfer:用于分解大型语言模型的高效并行推理
Abstract
Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
Chinese Translation
现有关于大型语言模型(LLM)分解的研究主要集中在提高下游任务的性能,但在尝试扩大模型规模时却忽视了并行推理性能较差这一重要问题。为了解决这一性能问题,本文提出了DeInfer,一个专门用于分解LLM的高性能推理系统。该系统包含多项优化措施,以最大化性能并兼容最先进的优化技术。我们进行了大量实验以评估DeInfer的性能,结果表明其具有显著优势,表明它能够极大地促进分解LLM的并行推理。
cs.CL / 121 / 2604.17714
Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
解读之前的筛选:基于基准的 LLM 信心信号的便携式有效性协议
Abstract
LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models using MMLU with verbalized confidence and on external data from Yang et al. (2024) confirms the screen transfers across benchmarks and probe formats. All data and code: https://github.com/synthiumjp/validity-scaling-llm
Chinese Translation
LLM 信心信号用于弃权、路由和安全关键决策。尚无标准实践来检查信心信号在构建其基础之前是否携带项目级信息。我们将临床人格评估(PAI, MMPI-3)的有效性筛选原则转化为基于基准的 LLM 信心数据的便携式协议。该协议指定了三个核心指标(L, Fp, RBS)、一个结构性指标(TRIN)以及一个从单个 2x2 交叉表计算得出的项目敏感性统计量。一个三层分类系统(无效、不确定、有效)借鉴了四种临床传统。在 524 个项目中对 20 个前沿 LLM 进行验证,四个模型被分类为无效,两个为不确定。有效型模型的平均 r = .18(15/16 显著)。无效型模型的平均 r = -.20(d = 2.48)。在 18 个模型上使用 MMLU 进行的跨基准验证,结合口头信心和来自 Yang 等人(2024)的外部数据,确认了筛选在不同基准和探测格式之间的转移。所有数据和代码可在:https://github.com/synthiumjp/validity-scaling-llm
cs.CL / 122 / 2604.17716
Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction
通过选择性预测对大型语言模型信心信号的有效性筛选进行的同时标准验证
Abstract
The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.
Chinese Translation
有效性筛选(Cacioli, 2026d, 2026e)将大型语言模型(LLM)信心信号分类为有效、未确定或无效。我们测试这些分类是否能预测选择性预测性能。评估了来自七个家族的二十个前沿LLM在六个认知轨道上的524个项目。有效模型的平均类型2 AUROC为0.624(标准差=0.048)。无效模型的平均AUROC为0.357(标准差=0.231)。Cohen's d = 2.81,p = 0.002。各层级的顺序单调递增:无效(0.357)< 未确定(0.554)< 有效(0.624)。分半交叉验证的中位数d为1.77,P(d > 0) = 1.0,基于1,000次分割。三层分类解释了AUROC中47%的方差。DeepSeek-R1在完全覆盖时的准确率为85.3%,而在10%覆盖时降至11.3%。该筛选能够预测标准。对于选择性预测,该筛选至关重要。
cs.CL / 123 / 2604.17718
Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation
大型语言模型是否在未被告知的情况下使用文化知识?对隐含语用适应的多语言评估
Abstract
Many benchmarks show that large language models can answer direct questions about culture. We study a different question: do they also change how they speak when culture is only implied by the situation? We evaluate 60 culturally grounded conversational scenarios across five languages in three conditions: a neutral baseline (Prompt A), an explicit cultural instruction (Prompt B), and implicit situational cueing (Prompt C). We score responses on 12 pragmatic features covering deference to authority, individual-versus-group framing, and uncertainty management. We define Pragmatic Context Sensitivity (PCS) as the fraction of the Prompt A->B shift that reappears under Prompt A->C. Across four deployed LLMs and five languages (English, German, Hindi, Nepali, Urdu), the primary stable-only PCS mean is 0.196 (SD = 0.113), indicating that the models recover only about one-fifth of the pragmatic shift they can produce when instructed explicitly. Transfer is strongest for authority-related cues (0.299) and weakest for individual-versus-group framing (0.120). Uncertainty-related behaviour is mixed: hedging density exhibits negative explicit gaps in all five languages, suggesting that alignment training actively suppresses the target behaviour. Because Hindi and Urdu share core grammar yet index distinct cultural communities, we use them as a natural control; a paired analysis finds no reliable baseline difference (t = 0.96, p = 0.339, dz = 0.06), suggesting that models respond primarily to linguistic structure rather than to the cultural associations a language carries. We argue that multilingual cultural pragmatics is an explicit-versus-implicit deployment problem, not only a factual knowledge problem.
Chinese Translation
许多基准测试表明,大型语言模型能够回答有关文化的直接问题。我们研究一个不同的问题:当文化仅通过情境暗示时,它们是否也会改变自己的表达方式?我们在三种条件下评估了60个文化背景的对话场景,涵盖五种语言:中性基线(提示A)、明确的文化指令(提示B)和隐含的情境提示(提示C)。我们根据12个语用特征对响应进行评分,这些特征涵盖对权威的尊重、个体与群体的框架以及不确定性管理。我们将语用上下文敏感性(Pragmatic Context Sensitivity, PCS)定义为提示A到提示B的转变在提示A到提示C下的重现比例。在四个部署的大型语言模型和五种语言(英语、德语、印地语、尼泊尔语、乌尔都语)中,主要的稳定仅PCS均值为0.196(标准差=0.113),表明模型仅恢复了在明确指令下可以产生的语用转变的约五分之一。与权威相关的提示转移最强(0.299),而个体与群体的框架转移最弱(0.120)。与不确定性相关的行为则表现出混合特征:在所有五种语言中,模糊性密度在显性方面存在负差距,表明对齐训练积极抑制了目标行为。由于印地语和乌尔都语共享核心语法但代表不同的文化社区,我们将它们作为自然对照;配对分析发现没有可靠的基线差异(t = 0.96, p = 0.339, dz = 0.06),这表明模型主要对语言结构而非语言所携带的文化关联作出反应。我们认为,多语言文化语用学是一个显性与隐性部署的问题,而不仅仅是一个事实知识的问题。
cs.CL / 124 / 2604.17725
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models
RePrompT:用于将结构化电子健康记录编码器与大型语言模型集成的递归提示调优
Abstract
Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.
Chinese Translation
大型语言模型(LLMs)在挖掘电子健康记录(EHRs)方面展现出强大的潜力,通过对纵向临床信息进行推理,以捕捉丰富的患者轨迹。然而,利用LLMs处理结构化EHR(例如,标准化的诊断和药物编码)面临两个主要挑战。首先,将带时间戳的EHR序列转换为普通文本可能会模糊时间结构和编码身份,从而削弱捕捉编码共现和纵向规律的能力。其次,与通过队列训练的预测模型不同,后者在患者之间学习共享的、与任务对齐的表示空间,LLMs通常在病例孤立的推理环境中应用,每位患者独立处理,而未能利用人口层面的模式。为了解决这些挑战,我们提出了RePrompT,这是一种时间感知的LLM框架,通过提示调优集成结构化EHR编码器,而不修改底层架构。具体而言,RePrompT递归地结合来自先前就诊的潜在状态,以保留纵向信息,并通过从队列训练的、与任务对齐的EHR编码器派生的可训练提示标记注入人口层面信息。在MIMIC-III和MIMIC-IV上的实验表明,RePrompT在多个临床预测任务中始终优于基于EHR和基于LLM的基线。
cs.CL / 125 / 2604.17730
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
MHSafeEval:大型语言模型心理健康安全的角色感知交互级评估
Abstract
Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.
Chinese Translation
大型语言模型(LLMs)越来越多地被探索作为可扩展的心理健康咨询工具,但由于临床伤害的交互性和依赖于上下文的特性,评估其安全性仍然具有挑战性。现有的评估框架主要使用粗粒度分类法或静态数据集评估孤立的响应,限制了它们诊断伤害如何在多轮咨询交互中出现和累积的能力。在本研究中,我们引入了R-MHSafe,一种角色感知的心理健康安全分类法,基于AI咨询师所采用的交互角色(包括施害者、煽动者、促进者或使能者)以及临床相关的伤害类别来表征临床显著伤害。随后,我们提出了MHSafeEval,一个闭环的基于代理的评估框架,将安全评估公式化为通过对抗性多轮交互的伤害轨迹级发现,受角色感知建模的指导。利用R-MHSafe和MHSafeEval,我们对最先进的LLMs进行了大规模评估。我们的结果揭示了显著的角色依赖性和累积性的安全失效,这些失效在现有的静态基准中被系统性地忽视,并且显示我们的框架显著提高了失效模式覆盖率和诊断细致度。
cs.CL / 126 / 2604.17738
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
Mira-Embeddings-V1:基于LLM合成数据的招聘领域适应性语义重排序
Abstract
Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. An upstream production retriever first returns a candidate shortlist for each job description (JD), and our goal is to rerank that shortlist so that qualified candidates appear as high as possible. We present mira-embeddings-v1, a semantic reranking system for the recruitment domain that reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real JDs, we build a five-stage prompt pipeline to generate diverse positive and hard negative samples that sculpt the semantic space from multiple angles. We then apply a two-round LoRA adaptation: JD--JD contrastive training followed by JD--CV triplet alignment on a heterogeneous text dataset. Importantly, these gains require no large-scale manually labeled industrial training pairs: a modest set of real JDs is expanded into supervision through LLM synthesis. Finally, a BoundaryHead MLP reranks the Top-K results to distinguish between roles that share the same title but differ in scope. On a local pool of 300 real JDs with candidates from an upstream production retriever, mira-embeddings-v1 improves Recall@50 from 68.89% (baseline) to 77.55% while lifting Precision@10 from 35.77% to 39.62%. On a supportive global pool over 44,138 candidates judged by a Qwen3-32B rubric, it achieves Recall@200 of 0.7047 versus 0.5969 for the baseline. These results show that LLM-synthesized supervision with boundary-aware reranking yields robust gains without a heavy cross-encoder.
Chinese Translation
对于招聘人员而言,候选人来源最佳视为一个两阶段的检索和重排序流程,其中召回是有限审查预算下的主要目标。上游生产检索器首先为每个职位描述(JD)返回候选人短名单,我们的目标是对该短名单进行重排序,以使合格的候选人尽可能高排。我们提出了mira-embeddings-v1,这是一个针对招聘领域的语义重排序系统,通过LLM合成的训练数据重塑嵌入空间,并通过轻量级重排序头纠正边界混淆。从真实的JD出发,我们构建了一个五阶段的提示管道,以生成多样的正样本和困难负样本,从多个角度雕刻语义空间。然后,我们应用了两轮LoRA适应:JD--JD对比训练,随后在异构文本数据集上进行JD--CV三元组对齐。重要的是,这些收益不需要大规模手动标注的工业训练对:一组适度的真实JD通过LLM合成扩展为监督。最后,一个BoundaryHead MLP对Top-K结果进行重排序,以区分具有相同标题但在范围上有所不同的角色。在一个包含300个真实JD的本地池中,mira-embeddings-v1将Recall@50从68.89%(基线)提高到77.55%,同时将Precision@10从35.77%提升至39.62%。在一个支持的全球池中,对44,138个候选人进行Qwen3-32B标准评估时,其Recall@200为0.7047,而基线为0.5969。这些结果表明,具有边界感知重排序的LLM合成监督在不需要重型交叉编码器的情况下,实现了稳健的收益。
cs.CL / 127 / 2604.17745
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
HiRAS:一种用于论文到代码生成与执行的层次化多智能体框架
Abstract
Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10\% relative performance gain beyond the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: https://github.com/KOU-199024/HiRAS.
Chinese Translation
近期大型语言模型的进展凸显了其在自动化计算研究方面的潜力,特别是在重现实验结果方面。然而,现有的方法仍然使用固定的顺序智能体管道,缺乏有效的全局协调,这限制了其鲁棒性和整体性能。在本研究中,我们提出了层次化研究智能体系统(HiRAS),这是一种用于端到端实验重现的层次化多智能体框架,采用监督管理智能体来协调跨细粒度阶段的专业智能体。我们还识别了无参考评估在Paper2Code基准中的局限性,并引入了Paper2Code-Extra(P2C-Ex),这是一种改进的协议,结合了仓库级信息,更好地与原始基于参考的度量对齐。我们进行了广泛的评估,验证了所提方法的有效性和鲁棒性,并观察到改进,包括使用开源基础模型相较于之前的最先进技术超过10%的相对性能提升,以及在评估中显著减少的幻觉现象。我们的工作已在GitHub上发布: https://github.com/KOU-199024/HiRAS。
cs.CL / 128 / 2604.17769
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
反向宪法人工智能:通过概率限制的强化学习从人工智能反馈生成可控的有害数据框架
Abstract
Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
Chinese Translation
确保大型语言模型(LLMs)的安全性需要强有力的红队测试,然而高质量有害数据的系统合成仍然未得到充分探索。我们提出了反向宪法人工智能(Reverse Constitutional AI, R-CAI),这是一个自动化和可控的对抗性数据生成框架,超越了孤立的越狱提示。通过将无害的宪法反转为有害的宪法,并通过批评-修订管道迭代精炼模型输出,R-CAI 实现了无需人工标注的多维对抗性数据的可扩展合成。然而,仅优化与有害性相关的奖励可能导致奖励黑客行为和语义一致性降低。为了解决这一挑战,我们在基于人工智能反馈的强化学习中引入了概率限制,这在保持对抗性意图的同时稳定了对抗性优化。实验表明,R-CAI 生成了多样化的高质量有害数据,并且概率限制显著提高了语义一致性(15%),而不牺牲对抗性强度。总体而言,R-CAI 提供了一个完全自动化的红队数据生成框架和对齐语言模型的系统安全评估。
cs.CL / 129 / 2604.17771
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE:用于检测 NL2SQL 基准污染的句法探测器
Abstract
Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall's tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.
Chinese Translation
大型语言模型(LLMs)在自然语言到 SQL(NL2SQL)基准测试中取得了强劲的表现,但其报告的准确性可能因基准查询或训练过程中见到的结构相似模式的污染而被夸大。我们引入了 SPENCE(句法探测与 NL2SQL 污染效应评估),这是一个受控的句法探测框架,用于检测和量化这种污染。SPENCE 系统地生成四个广泛使用的 NL2SQL 数据集(Spider、SParC、CoSQL 和较新的 BIRD 基准)的测试查询的句法变体。我们使用 SPENCE 在基于执行的评分下评估多个高容量 LLM。对于每个模型,我们测量在增加的句法差异水平下执行准确性的变化,并使用 Kendall's tau 和自助法置信区间量化排名敏感性。通过将这些鲁棒性趋势与基准发布日期对齐,我们观察到明显的时间梯度:较旧的基准如 Spider 显示出最强的负值,从而具有最高的训练泄漏可能性,而较新的 BIRD 数据集则表现出最小的敏感性,且似乎基本上未受到污染。这些发现共同强调了在可信的 NL2SQL 基准测试中进行时间上下文化的句法探测评估的重要性。
cs.CL / 130 / 2604.17785
Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens
忘记重要内容,保留其余:信息性标记的选择性遗忘
Abstract
Unlearning in large language models (LLMs) has emerged as a promising safeguard against adversarial behaviors. When the forgetting loss is applied uniformly without considering token-level semantic importance, model utility can be unnecessarily degraded. Recent studies have explored token-wise loss regularizers that prioritize informative tokens, but largely rely on ground-truth confidence or external linguistic parsers, which limits their ability to capture contextual information or the model's overall predictive state. Intuitively, function words like "the" primarily serve syntactic roles and are highly predictable with little ambiguity, but informative words admit multiple plausible alternatives with greater uncertainty. Based on this intuition, we propose Entropy-guided Token Weighting (ETW), a token-level unlearning regularizer that uses entropy of the predictive distribution as a proxy for token informativeness. We demonstrate that informative tokens tend to have higher entropy, whereas structural tokens tend to have lower entropy. This behavior enables ETW to achieve more effective unlearning while better preserving model utility than existing token-level approaches.
Chinese Translation
在大型语言模型(LLMs)中,遗忘机制已成为对抗行为的一种有前景的防护措施。当遗忘损失被均匀应用而未考虑标记级语义重要性时,模型的实用性可能会不必要地降低。最近的研究探讨了优先考虑信息性标记的标记级损失正则化方法,但主要依赖于真实标签的置信度或外部语言解析器,这限制了它们捕捉上下文信息或模型整体预测状态的能力。直观上,像“the”这样的功能词主要承担句法角色,并且具有高度的可预测性和较少的歧义,而信息性词汇则承认多种合理的替代选项,具有更大的不确定性。基于这一直觉,我们提出了熵引导的标记加权(Entropy-guided Token Weighting,ETW),这是一种使用预测分布的熵作为标记信息性的代理的标记级遗忘正则化器。我们证明了信息性标记往往具有更高的熵,而结构性标记则倾向于具有较低的熵。这种行为使得ETW能够实现更有效的遗忘,同时比现有的标记级方法更好地保留模型的实用性。
cs.CL / 131 / 2604.17794
Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling
通过测试时缩放弥合越南语中的推理差距的小型语言模型
Abstract
The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a "reasoning gap", particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe "formatting gap" in communication. Supervised Fine-Tuning (SFT) acts as a critical "reasoning unlocker", yielding a 77% improvement in Explanation Quality and bridging the gap between raw calculation and pedagogical coherence. Furthermore, our analysis of prompting strategies uncovers a significant trade-off: structured frameworks like ReAct impose a "cognitive tax" on the 1.7B parameter capacity, degrading performance relative to pure Chain-of-Thought (CoT) combined with Self-Consistency. These findings establish a deployment hierarchy for SLMs, demonstrating that SFT combined with simplified test-time scaling is superior to complex agentic workflows for edge-based reasoning.
Chinese Translation
普及的人工智能的实现依赖于在资源受限的设备上部署复杂的推理能力。然而,小型语言模型(SLMs)常常面临“推理差距”,尤其是在越南语等非英语语言中,它们难以维持连贯的思维链。本文研究了在越南小学数学背景下,针对Qwen3-1.7B架构的测试时缩放策略。我们引入了Vi-S1K,一个通过Gemini 2.5 Flash-Lite驱动的高保真推理数据集,以及Vi-Elementary-Bench,一个用于严格评估的双资源基准。使用LLM-as-a-Judge协议,我们揭示了基础模型具备强大的潜在知识(准确率:4.05/5.00),但在沟通中遭遇严重的“格式差距”。监督微调(SFT)作为关键的“推理解锁器”,在解释质量上提升了77%,弥合了原始计算与教学连贯性之间的差距。此外,我们对提示策略的分析揭示了一个显著的权衡:像ReAct这样的结构化框架对1.7B参数容量施加了“认知税”,相较于纯粹的思维链(CoT)结合自我一致性,其性能下降。这些发现建立了小型语言模型的部署层级,证明了SFT结合简化的测试时缩放优于复杂的代理工作流程,适用于边缘推理。
cs.CL / 132 / 2604.17819
PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
PDDL-Mind:大型语言模型在信念推理中的可靠状态跟踪能力
Abstract
Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.
Chinese Translation
大型语言模型(LLMs)在现有的心智理论(ToM)基准测试中的表现远低于人类水平,即使在使用思维链提示或概率信念更新的情况下也是如此。我们认为,这些失败主要源于不可靠的隐式状态跟踪,而不是高层次推理的局限性。我们提出了PDDL-Mind,这是一种神经符号框架,它将环境状态演变与信念推理解耦。通过将叙述描述转换为用规划领域定义语言(PDDL)表达的显式状态和动作,并通过验证动作引起的状态转变与预定义领域的一致性,PDDL-Mind为LLMs提供了一个逻辑上连贯且显式的世界状态表示,以用于ToM任务。在MMToM-QA、MuMA和FanToM上的实验表明,PDDL-Mind在ToM基准问题上相较于现有最佳的最先进方法实现了超过5%的绝对准确率提升。
cs.CL / 133 / 2604.17827
Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
学习寻求帮助:小型与大型语言模型之间的动态协作
Abstract
Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self-reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.
Chinese Translation
大型语言模型(LLMs)提供了强大的能力,但也引发了成本和隐私方面的担忧,而小型语言模型(SLMs)则促进了高效和私密的本地推理,但其能力有限。为了协同发挥各自的优势,我们提出了一种动态协作框架,其中SLM学习主动决定在多步推理过程中如何请求LLM,而LLM则提供自适应反馈,而不是作为被动工具。我们进一步系统地研究了协作策略如何受到SLM和LLM能力以及效率和隐私约束的影响。评估结果揭示了明显的规模效应:更强的SLM变得更加自立,而更强的LLM则使得交互次数减少且信息更加丰富。此外,学习到的动态协作策略显著优于静态流程和独立推理,并且能够稳健地迁移到未见过的LLM上。
cs.CL / 134 / 2604.17828
How Non-Linguistic Is the Indus Sign System? A Synthetic-Baseline Scorecard
印度河符号系统的非语言特征有多强?一个综合基准评分卡
Abstract
Whether the Indus Valley sign system (c. 2600-1900 BCE) encodes spoken language has been debated for decades. This paper introduces a multi-metric discrimination framework that tests the observed Indus corpus against two kinds of computer-generated non-linguistic baseline -- one mimicking a heraldic emblem system, the other an administrative coding system -- each calibrated with Zipfian frequency distributions, positional constraints, and bigram dependencies derived from six attested non-linguistic corpora. The scorecard evaluates four properties central to the Farmer-Sproat-Witzel (2004) critique: text brevity, repeated formulaic phrases, hapax legomenon rate, and positional rigidity. Applying this framework to 1,916 deduplicated inscriptions (584 unique signs, 11,110 tokens) from the ICIT/Yajnadevam digitization, we find that the Indus corpus does not match either baseline cleanly. Across the four metrics examined, the Indus corpus occupies an intermediate position relative to the two baseline families, matching neither cleanly. Neither a heraldic nor an administrative generator can reproduce all four properties at once. We also compare against seven real-world non-linguistic corpora including Sproat's (2014) datasets, finding that no attested non-linguistic system reproduces the full Indus statistical profile either. We replicate key prior results including a Zipf slope of -1.49 and conditional entropy of 3.23 bits. All code and data are publicly available.
Chinese Translation
关于印度河谷符号系统(公元前2600-1900年)是否编码口语语言的争论已持续数十年。本文引入了一种多指标区分框架,测试观察到的印度河语料库与两种计算机生成的非语言基准进行比较——一种模仿纹章系统,另一种为行政编码系统——每种基准均根据来自六个已确认的非语言语料库的Zipf频率分布、位置约束和二元依赖关系进行校准。评分卡评估了与Farmer-Sproat-Witzel(2004)批评相关的四个核心属性:文本简洁性、重复的公式化短语、单次出现词汇率和位置刚性。将该框架应用于来自ICIT/Yajnadevam数字化的1,916个去重铭文(584个独特符号,11,110个标记),我们发现印度河语料库与任何基准都没有完全匹配。在所检查的四个指标中,印度河语料库相对于两个基准家族处于中间位置,均未能完全匹配。无论是纹章生成器还是行政生成器都无法同时重现这四个属性。我们还与包括Sproat(2014)数据集在内的七个现实世界非语言语料库进行了比较,发现没有任何已确认的非语言系统能够重现完整的印度河统计特征。我们复制了关键的先前结果,包括Zipf斜率为-1.49和条件熵为3.23比特。所有代码和数据均可公开获取。
cs.CL / 135 / 2604.17842
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope:在动态 LLM 基准中认证难题
Abstract
LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive -- especially when the goal is not just determining an average score, but reliably identifying a model's weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez & Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually hard relative to their measured complexity). In experiments across a range of benchmarks, we show that our method, dubbed $\texttt{QuickScope}$, discovers truly difficult questions more sample efficiently than standard baselines, while also reducing false positives from noisy outcomes.
Chinese Translation
LLM 基准正变得越来越动态:它们不仅包含固定的问题集,而是定义了可以生成有效无限数量问题变体的模板和参数。这种灵活性是有价值的,但也使得评估成本高昂——尤其是当目标不仅仅是确定平均分数,而是可靠地识别模型的弱点时。本文提出了一种新的方法论,用于在动态基准中识别难题。它利用了 COUP,这是一种最新的贝叶斯优化算法(Graham, Velez & Leyton-Brown, 2026),并在此基础上进行了若干实质性修改,使该算法适合于实际的 LLM 流水线。我们还将其封装在一个工具中,支持灵活选择数据集和效用函数,使用户能够针对他们关心的问题类型(例如,低准确率问题;相对于其测量复杂性异常困难的问题)进行定位。在一系列基准的实验中,我们展示了我们的方法,称为 $ exttt{QuickScope}$,在样本效率上发现真正困难的问题优于标准基线,同时减少了来自噪声结果的假阳性。
cs.CL / 136 / 2604.17857
On the Emergence of Syntax by Means of Local Interaction
通过局部交互的句法产生
Abstract
Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal $L \times L$ grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson $r \approx 0.71$). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.
Chinese Translation
句法处理能否自发地从纯粹的局部交互中产生?我们在一个最小系统上展示了一个具体实例:一个拥有18658个参数的二维神经元细胞自动机(NCA),仅通过一个1位边界信号进行监督,训练于算术表达文法的成员资格问题。训练后,其内部$L imes L$网格自发地自组织成一种有序的、空间扩展的表示,我们称之为Proto-CKY。这种表示满足句法处理的三个操作标准:超越正规语言的表达能力、超越训练分布的结构泛化,以及与语法结构定量对齐的内部组织(Pearson $r ext{ 约 } 0.71$)。它在四种上下文无关文法上独立出现,并在扰动后自发再生。Proto-CKY在功能上与CKY算法对齐,但在形式上与其不同:它是一个物理原型,是在物理基底上对数学理想的具体实例,而二者之间的系统距离携带着关于基底本身的信息。
cs.CL / 137 / 2604.17866
Latent Abstraction for Retrieval-Augmented Generation
用于检索增强生成的潜在抽象
Abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.
Chinese Translation
检索增强生成(RAG)已成为增强大型语言模型(LLMs)以利用外部知识、减轻幻觉并提高事实准确性的标准方法。然而,现有系统依赖于在每个步骤生成自然语言查询,并在检索器和生成器之间保持严格的架构分离,阻碍了它们充分利用LLM的表示能力。我们提出了 extbf{LAnR}(用于RAG的潜在抽象),这是一个统一框架,其中单个LLM在其自身的潜在空间内共同执行编码、检索和生成。LAnR不是生成文本查询,而是从指定的 exttt{[PRED]}标记的隐藏状态中生成密集检索向量,并利用这些向量与来自同一模型的编码文档表示进行匹配。此外,LAnR通过对这些相同隐藏状态使用轻量级的多层感知机(MLP)控制头,自适应地决定何时检索到足够的证据,从而消除了单独的检索器和显式的标记级停止推理。这一设计的动机源于我们经验观察到的答案标记熵可靠地指示了检索的充分性。在六个涵盖单跳和多跳设置的问答基准上的广泛实验表明,LAnR在性能上超越了现有的RAG方法,同时通过减少检索调用次数和更紧密的模型集成提高了推理效率。
cs.CL / 138 / 2604.17870
GraSP: Graph-Structured Skill Compositions for LLM Agents
GraSP:面向大型语言模型代理的图结构技能组合
Abstract
Skill ecosystems for LLM agents have matured rapidly, yet recent benchmarks show that providing agents with more skills does not monotonically improve performance -- focused sets of 2-3 skills outperform comprehensive documentation, and excessive skills actually hurt. The bottleneck has shifted from skill availability to skill orchestration: agents need not more skills, but a structural mechanism to select, compose, and execute them with explicit causal dependencies. We propose GraSP, the first executable skill graph architecture that introduces a compilation layer between skill retrieval and execution. GraSP transforms flat skill sets into typed directed acyclic graphs (DAGs) with precondition-effect edges, executes them with node-level verification, and performs locality-bounded repair through five typed operators -- reducing replanning from O(N) to O(d^h). Across ALFWorld, ScienceWorld, WebShop, and InterCode with eight LLM backbones, GraSP outperforms ReAct, Reflexion, ExpeL, and flat skill baselines in every configuration, improving reward by up to +19 points over the strongest baseline while cutting environment steps by up to 41%. GraSP's advantage grows with task complexity and is robust to both skill over-retrieval and quality degradation, confirming that structured orchestration -- not larger skill libraries -- is the key to reliable agent execution.
Chinese Translation
大型语言模型(LLM)代理的技能生态系统迅速成熟,然而最近的基准测试显示,提供给代理更多的技能并不总是能单调地提高性能——2-3个技能的集中组合优于全面的文档,而过多的技能实际上会造成负面影响。瓶颈已从技能的可用性转移到技能的协调:代理需要的不是更多的技能,而是一个结构化机制来选择、组合和执行这些技能,并明确其因果依赖关系。我们提出了GraSP,这是第一个可执行的技能图架构,在技能检索和执行之间引入了一个编译层。GraSP将平面的技能集合转化为带有前置条件-效果边的有向无环图(DAG),通过节点级验证执行这些图,并通过五个类型化操作符进行局部修复——将重新规划的复杂度从O(N)降低到O(d^h)。在ALFWorld、ScienceWorld、WebShop和InterCode等八个大型语言模型基础上,GraSP在每种配置中均优于ReAct、Reflexion、ExpeL和传统技能基线,奖励提高幅度最高可达19分,同时环境步骤减少最多41%。GraSP的优势随着任务复杂度的增加而增强,并且对技能过度检索和质量下降具有鲁棒性,确认了结构化协调——而非更大的技能库——是可靠代理执行的关键。
cs.CL / 139 / 2604.17886
Latent Preference Modeling for Cross-Session Personalized Tool Calling
跨会话个性化工具调用的潜在偏好建模
Abstract
Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate--verify--refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.
Chinese Translation
用户在向基于大型语言模型(LLM)的代理请求时,常常省略关键信息,导致工具使用时输入不明确。这对工具增强型代理提出了根本性挑战,因为API执行通常需要完整的参数,这突显了个性化工具调用的必要性。为了解决这一问题,我们引入了MPT,一个包含265个多会话对话的基准,涵盖了三个挑战:偏好回忆、偏好归纳和偏好转移。我们还提出了PRefine,一种测试时的记忆增强方法,将用户偏好表示为不断演变的假设。通过生成-验证-精炼循环,它从历史中提取可重用的约束,并在仅使用全历史提示所需的1.24%令牌的情况下,提高了工具调用的准确性。这些结果表明,代理系统中的稳健个性化依赖于捕捉用户选择背后原因的记忆,而不仅仅是选择本身。
cs.CL / 140 / 2604.17894
Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions
基于用户定义动态模板和自然语言指令的自动幻灯片更新
Abstract
Presentation slides are a primary medium for data-driven reporting, yet keeping complex, analytics-style decks up to date remains labor-intensive. Existing automation methods mostly follow fixed template filling and cannot support dynamic updates for diverse, user-authored slide decks. We therefore define "Dynamic Slide Update via Natural Language Instructions on User-provided Templates" and introduce DynaSlide, a large-scale benchmark with 20,036 real-world instruction-execution triples (source slide, user instruction, target slide) grounded in a shared external database and built from business reporting slides under bring-your-own-template (BYO-template) conditions. To tackle this task, we propose SlideAgent, an agent-based framework that combines multimodal slide parsing, natural language instruction grounding, and tool-augmented reasoning for tables, charts, and textual conclusions. SlideAgent updates content while preserving layout and style, providing a strong reference baseline on DynaSlide. We further design end-to-end and component-level evaluation protocols that reveal key challenges and opportunities for future research. The dataset and code are available at https://github.com/XiaoZhou2024/SlideAgent.
Chinese Translation
演示幻灯片是数据驱动报告的主要媒介,但保持复杂的分析风格幻灯片的最新状态仍然是一项劳动密集型的工作。现有的自动化方法大多遵循固定模板填充,无法支持多样化的用户创作幻灯片的动态更新。因此,我们定义了“基于用户提供模板的自然语言指令的动态幻灯片更新”,并引入了 DynaSlide,这是一个大规模基准数据集,包含 20,036 个真实世界的指令执行三元组(源幻灯片、用户指令、目标幻灯片),这些三元组基于共享的外部数据库,并从自带模板(BYO-template)条件下的商业报告幻灯片构建而成。为了解决这一任务,我们提出了 SlideAgent,这是一种基于代理的框架,结合了多模态幻灯片解析、自然语言指令的基础以及增强工具推理,适用于表格、图表和文本结论。SlideAgent 在更新内容的同时保持布局和风格,为 DynaSlide 提供了一个强有力的参考基线。我们进一步设计了端到端和组件级的评估协议,以揭示未来研究的关键挑战和机遇。数据集和代码可在 https://github.com/XiaoZhou2024/SlideAgent 获取。
cs.CL / 141 / 2604.17930
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
语言模型的正式语言能力的异质性:数据是真正的瓶颈吗?
Abstract
Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.
Chinese Translation
大型语言模型(LLMs)在其正式语言能力上表现出令人困惑的差异:尽管它们对某些语言现象的掌握接近完美,但在其他现象上,它们的表现往往低于随机水平,即使在经过数万亿个标记的训练后。在本研究中,我们探讨这些失败是源于固有的架构限制,还是仅仅因为这些特定语法结构在网络规模语料库中的稀缺。我们在FineWeb语料库的1亿标记随机样本上预训练了简单的GPT-2 Small(124M)模型,并通过注入针对特定语言现象的少量(1%)合成数据进行干预。我们发现,这种针对性的干预在9个表现最差的BLiMP范式中显著提高了模型性能,其中一个特定范式only_npi_scope的准确率从20.9%飙升至69.4%。此外,我们观察到这些干预通常保持或略微改善整体性能。然而,尽管我们还识别出一个抵抗现象principle_A_c_command,其性能在数据增强后仍低于随机水平,我们的发现确实作为一个乐观的存在证明,即使是小型语言模型也可以在那些模型通常表现不佳的语言现象上显著改善,只要预训练数据包含足够的暴露。这表明,针对人类规模语言建模的努力可能会通过关注数据组成而获得巨大的收益。我们结果的重现代码已开源,地址为https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence。
cs.CL / 142 / 2604.17943
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
面向领域的RAG评估(DoRA):基于RAG的国防文件问答的合成基准测试
Abstract
Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.
Chinese Translation
开放领域的RAG基准在公共语料库上的评估可能会由于预训练重叠和较弱的归因要求而高估部署性能。我们提出了DoRA(面向领域的RAG评估),这是一个基于国防文件构建的领域基础基准,它将合成的、意图条件的问答(QA)与可审计的证据段落配对以进行归因。DoRA涵盖五种问题类型(查找、解释、总结、生成、提供),并包含6.5K个精心策划的实例。在与固定密集检索器的端到端评估中,通用语言模型(LMs)的表现相似,而在DoRA上训练的模型(DoRA SFT)在基础模型(Llama3.1-8B-Instruct)上取得了显著提升:问答任务成功率提高了多达26%,同时在RAG可信度评分中将幻觉率降低了47%,支持在领域转移下的污染意识回归测试。
cs.CL / 143 / 2604.17944
ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering
ReCoQA:一个用于工具增强和多步骤推理的房地产问答基准
Abstract
Developing agents capable of navigating fragmented, multi-source information remains challenging, primarily due to the scarcity of benchmarks reflecting hybrid workflows combining database querying with external APIs. To bridge this gap, we introduce ReCoQA, a large-scale benchmark of 29,270 real-estate instances featuring machine-verifiable supervision for intermediate steps, including structured intent labels, SQL queries, and API calls. Complementarily, we propose HIRE-Agent, a hierarchical framework instantiating an understand-plan-execute architecture as a strong baseline. By orchestrating a Front-end parser, a planning Supervisor, and execution Specialists, HIRE-Agent effectively integrates heterogeneous evidence. Extensive experiments demonstrate that HIRE-Agent constitutes a strong baseline and substantiates the necessity of hierarchical collaboration for complex, real-world reasoning tasks.
Chinese Translation
开发能够在碎片化的多源信息中导航的智能体仍然具有挑战性,主要是由于缺乏反映结合数据库查询与外部API的混合工作流程的基准。为了解决这一问题,我们引入了ReCoQA,这是一个包含29,270个房地产实例的大规模基准,具有可机器验证的监督机制,涵盖中间步骤,包括结构化意图标签、SQL查询和API调用。此外,我们提出了HIRE-Agent,一个分层框架,实例化了理解-规划-执行架构,作为一个强有力的基线。通过协调前端解析器、规划监督者和执行专家,HIRE-Agent有效整合了异构证据。大量实验表明,HIRE-Agent构成了一个强有力的基线,并证明了在复杂的现实世界推理任务中分层协作的必要性。
cs.CL / 144 / 2604.17957
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
过程奖励模型与规划相结合:生成精确且可扩展的逐步奖励数据集
Abstract
Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.
Chinese Translation
过程奖励模型(Process Reward Models, PRMs)已成为评估大型语言模型(Large Language Models, LLMs)推理能力时提供逐步反馈的强大工具。这些模型在生成思维链(Chains of Thought, CoTs)时常常包含错误,即使最终答案是正确的。然而,现有的PRM数据集构建成本高、易出现标注错误,并且主要限于数学领域。本研究提出了一种基于规划逻辑问题的新颖且可扩展的PRM数据集生成方法,该方法使用规划领域定义语言(Planning Domain Definition Language, PDDL)进行表达。通过这种方法,我们在多个PDDL领域生成了大约一百万个推理步骤的语料库,并利用该语料库训练PRMs。实验结果表明,将广泛使用的PRM训练数据集与PDDL衍生数据相结合,能够在多个基准测试中显著提高数学和非数学推理的表现。这些发现表明,规划问题构成了生成稳健、精确且细致的PRM训练数据的可扩展且有效的资源,超越了主导该领域的经典数学来源。
cs.CL / 145 / 2604.17972
Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations
在情感支持对话中建模单次发言内的多种支持策略
Abstract
Emotional Support Conversation (ESC) aims to assist individuals experiencing distress by generating empathetic and supportive dialogue. While prior work typically assumes that each supporter turn corresponds to a single strategy, real-world supportive communication often involves multiple strategies within a single utterance. In this paper, we revisit the ESC task by formulating it as multi-strategy utterance generation, where each utterance may contain one or more strategy-response pairs. We propose two generation methods: All-in-One, which predicts all strategy-response pairs in a single decoding step, and One-by-One, which iteratively generates strategy-response pairs until completion. Both methods are further enhanced with cognitive reasoning guided by reinforcement learning to improve strategy selection and response composition. We evaluate our models on the ESConv dataset under both utterance-level and dialogue-level settings. Experimental results show that our methods effectively model multi-strategy utterances and lead to improved supportive quality and dialogue success. To our knowledge, this work provides the first systematic empirical evidence that allowing multiple support strategies within a single utterance is both feasible and beneficial for emotional support conversations. All code and data will be publicly available at https://github.com/aliyun/qwen-dianjin.
Chinese Translation
情感支持对话(ESC)旨在通过生成富有同情心和支持性的对话来帮助经历困扰的个体。尽管以往的研究通常假设每个支持者的发言对应于单一策略,但现实中的支持性交流往往涉及在单次发言中使用多种策略。本文通过将ESC任务重新定义为多策略发言生成,探讨了这一问题,其中每次发言可能包含一个或多个策略-响应对。我们提出了两种生成方法:All-in-One,该方法在单次解码步骤中预测所有策略-响应对;以及One-by-One,该方法迭代生成策略-响应对直至完成。两种方法均通过强化学习指导的认知推理进一步增强,以改善策略选择和响应构成。我们在ESConv数据集上对模型进行了评估,涵盖了发言级和对话级的设置。实验结果表明,我们的方法有效地建模了多策略发言,并提高了支持质量和对话成功率。据我们所知,这项工作提供了首个系统的实证证据,表明在单次发言中允许多种支持策略既可行又对情感支持对话有益。所有代码和数据将公开发布于 https://github.com/aliyun/qwen-dianjin。
cs.CL / 146 / 2604.17976
ltzGLUE: Luxembourgish General Language Understanding Evaluation
ltzGLUE:卢森堡语通用语言理解评估
Abstract
This paper presents ltzGLUE, the first Natural Language Understanding (NLU) benchmark for Luxembourgish (LTZ) based on the popular GLUE benchmark for English. Although NLU tasks are available for many European languages nowadays, LTZ is one of the official national languages that is often overlooked. We construct new tasks and reuse existing ones to introduce the first official NLU benchmark and accompanying evaluation of encoder models for the language. Our tasks include common natural language processing tasks in binary and multi-class classification settings, including named entity recognition, topic classification, and intent classification. We evaluate various pre-trained language models for LTZ to present an overview of the current capabilities of these models on the LTZ language.
Chinese Translation
本文介绍了ltzGLUE,这是第一个基于流行的英语GLUE基准的卢森堡语(LTZ)自然语言理解(NLU)基准。尽管如今许多欧洲语言都有NLU任务,但LTZ作为官方国家语言之一,常常被忽视。我们构建了新的任务并重用现有任务,以引入第一个官方NLU基准及其对该语言编码器模型的评估。我们的任务包括在二分类和多分类设置下的常见自然语言处理任务,包括命名实体识别、主题分类和意图分类。我们评估了多种预训练语言模型在LTZ上的表现,以呈现这些模型在卢森堡语上的当前能力概述。
cs.CL / 147 / 2604.17988
Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design
利用通用和生物医学大型语言模型及先进的提示工程进行药物流行病学研究设计
Abstract
Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.
Chinese Translation
背景:大型语言模型(LLMs)在自动化和支持药物流行病学研究设计方面的潜力是一个新兴的研究领域,但其可靠性仍未得到充分表征。通用LLMs往往表现出不准确性,而专门的生物医学LLMs在这一领域的比较性能尚不清楚。方法:本研究评估了通用LLMs(GPT-4o和DeepSeek-R1)与生物医学微调LLMs(QuantFactory/Bio-Medical-Llama-3-8B-GGUF和Irathernotsay/qwen2-1.5B-medical_qa-Finetune)在使用HMA-EMA目录和哨兵系统的46个研究方案(2018-2024)中的表现。通过使用最小到最大(LTM)和主动提示策略,评估了在相关性、逻辑证明和本体代码一致性等多个编码系统中的表现。结果:GPT-4o和DeepSeek-R1结合LTM提示达到了最高的相关性和逻辑证明分数,其中GPT-4o-LTM在HMA-EMA方案的9个问题中有8个问题的中位相关性分数达到4。生物医学LLMs整体相关性较低,且经常生成不足的证明。所有LLMs在本体代码映射方面表现出有限的能力,尽管LTM在推理稳定性方面提供了最一致的改善。结论:目前,现成的通用LLMs在药物流行病学设计方面提供的支持优于生物医学LLMs。提示策略对LLM的表现有显著影响。
cs.CL / 148 / 2604.18031
How Creative Are Large Language Models in Generating Molecules?
大型语言模型在分子生成中的创造力如何?
Abstract
Molecule generation requires satisfying multiple chemical and biological constraints while searching a large and structured chemical space. This makes it a non-binary problem, where effective models must identify non-obvious solutions under constraints while maintaining exploration to improve success by escaping local optima. From this perspective, creativity is a functional requirement in molecular generation rather than an aesthetic notion. Large language models (LLMs) can generate molecular representations directly from natural language prompts, but it remains unclear what type of creativity they exhibit in this setting and how it should be evaluated. In this work, we study the creative behavior of LLMs in molecular generation through a systematic empirical evaluation across physicochemical, ADMET, and biological activity tasks. We characterize creativity along two complementary dimensions, convergent creativity and divergent creativity, and analyze how different factors shape these behaviors. Our results indicate that LLMs exhibit distinct patterns of creative behavior in molecule generation, such as an increase in constraint satisfaction when additional constraints are imposed. Overall, our work is the first to reframe the abilities required for molecule generation as creativity, providing a systematic understanding of creativity in LLM-based molecular generation and clarifying the appropriate use of LLMs in molecular discovery pipelines.
Chinese Translation
分子生成需要在广泛且结构化的化学空间中满足多个化学和生物学约束。这使得该问题成为一个非二元问题,有效的模型必须在约束下识别非明显的解决方案,同时保持探索以通过逃离局部最优解来提高成功率。从这个角度来看,创造力是分子生成中的一种功能性要求,而非一种美学概念。大型语言模型(LLMs)可以直接从自然语言提示生成分子表示,但在这种情况下它们表现出何种类型的创造力以及如何评估仍不清楚。在本研究中,我们通过对物理化学、ADMET和生物活性任务进行系统的实证评估,研究了LLMs在分子生成中的创造性行为。我们沿着两个互补维度对创造力进行特征化,即聚合创造力和发散创造力,并分析不同因素如何塑造这些行为。我们的结果表明,LLMs在分子生成中表现出不同的创造性行为模式,例如在施加额外约束时约束满足度的提高。总体而言,我们的工作首次将分子生成所需的能力重新框定为创造力,为基于LLM的分子生成中的创造力提供了系统的理解,并澄清了LLMs在分子发现流程中的适当使用。
cs.CL / 149 / 2604.18034
SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation
SignDPO:基于骨架的无注释手语翻译的多层次直接偏好优化
Abstract
We present SignDPO, a novel multi-level Direct Preference Optimisation (DPO) framework designed to enhance the alignment of skeleton-based Sign Language Translation. While current skeleton-based models have made significant progress using Maximum Likelihood Estimation, they are primarily constrained by an imitation-based paradigm that lacks discriminative sensitivity to the fine-grained spatio-temporal nuances of sign language, often leading to semantic drift. To address this, SignDPO shifts the optimisation goal from simple sequence mimicry to structured preference alignment across spatial, temporal, and linguistic dimensions. Our framework involves three key designs. First, we introduce a hierarchical perturbation strategy to construct spatial and temporal non-preferred samples at both global and local granularities automatically. Second, we propose a self-guiding mechanism that leverages decoder cross-attention scores to identify and perturb semantically salient skeletal regions, forcing the model to distinguish genuine sign signals from structural distortions. Third, we establish an automated language-level preference generator by fine-tuning a dedicated perturbation model, capturing complex output-level failure modes without manual annotation. Extensive experiments on three widely adopted benchmarks, CSL-Daily, How2Sign, and OpenASL, demonstrate that SignDPO consistently outperforms state-of-the-art gloss-free methods and even rivals established gloss-based ones. Our results suggest that multi-level preference alignment is a powerful paradigm for bridging the gap between high-entropy skeletal trajectories and discrete linguistic semantics.
Chinese Translation
我们提出了SignDPO,这是一种新颖的多层次直接偏好优化(DPO)框架,旨在增强基于骨架的手语翻译的对齐效果。尽管当前的基于骨架的模型在使用最大似然估计方面取得了显著进展,但它们主要受到模仿基础范式的限制,缺乏对手语细粒度时空细微差别的辨别敏感性,常常导致语义漂移。为了解决这个问题,SignDPO将优化目标从简单的序列模仿转变为在空间、时间和语言维度上的结构化偏好对齐。我们的框架涉及三个关键设计。首先,我们引入了一种层次扰动策略,自动构建全局和局部粒度的空间和时间非偏好样本。其次,我们提出了一种自我引导机制,利用解码器交叉注意力得分来识别和扰动语义显著的骨架区域,迫使模型区分真实的手势信号与结构扭曲。第三,我们通过微调专用的扰动模型建立了一个自动化的语言级偏好生成器,捕捉复杂的输出级失败模式,而无需手动标注。在三个广泛采用的基准测试(CSL-Daily、How2Sign和OpenASL)上的大量实验表明,SignDPO始终优于最先进的无注释方法,甚至与已建立的有注释方法相媲美。我们的结果表明,多层次偏好对齐是弥合高熵骨架轨迹与离散语言语义之间差距的强大范式。
cs.CL / 150 / 2604.18041
JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew
JudgeMeNot:个性化大型语言模型以模拟希伯来语司法推理
Abstract
Despite significant advances in large language models, personalizing them for individual decision-makers remains an open problem. Here, we introduce a synthetic-organic supervision pipeline that transforms raw judicial decisions into instruction-tuning data, enabling parameter-efficient fine-tuning of personalized models for individual judges in low-resource settings. We compare our approach to state-of-the-art personalization techniques across three different tasks and settings. The results show that Causal Language Modeling followed by synthetically generated instruction-tuning significantly outperforms all other baselines, providing significant improvements across lexical, stylistic, and semantic similarity. Notably, our model-generated outputs are indistinguishable from the reasoning of human judges, highlighting the viability of efficient personalization, even in low-resource settings.
Chinese Translation
尽管大型语言模型取得了显著进展,但为个别决策者个性化这些模型仍然是一个未解决的问题。在此,我们介绍了一种合成-有机监督管道,该管道将原始司法判决转化为指令微调数据,从而实现对低资源环境中个别法官的个性化模型的参数高效微调。我们将我们的方法与三种不同任务和设置下的最先进个性化技术进行了比较。结果表明,因果语言建模(Causal Language Modeling)后接合成生成的指令微调显著优于所有其他基线,在词汇、风格和语义相似性方面提供了显著改善。值得注意的是,我们模型生成的输出与人类法官的推理无异,突显了即使在低资源环境中高效个性化的可行性。
cs.CL / 151 / 2604.18069
Modeling Human Perspectives with Socio-Demographic Representations
通过社会人口特征建模人类视角
Abstract
Humans often hold different perspectives on the same issues. In many NLP tasks, annotation disagreement can reflect valid subjective perspectives. Modeling annotator perspectives and understanding their relationship with other human factors, such as socio-demographic attributes, have received increasing attention. Prior work typically focuses on single demographic factors or limited combinations. However, in real-world settings, annotator perspectives are shaped by complex social contexts, and finer-grained socio-demographic attributes can better explain human perspectives. In this work, we propose Socio-Contrastive Learning, a method that jointly models annotator perspectives while learning socio-demographic representations. Our method provides an effective approach for the fusion of socio-demographic features and textual representations to predict annotator perspectives, outperforming standard concatenation-based methods. The learned representations further enable analysis and visualization of how demographic factors relate to variation in annotator perspectives. Our code is available at GitHub: https://github.com/Leixin-Zhang/Socio_Contrastive_Learning
Chinese Translation
人类在同一问题上往往持有不同的观点。在许多自然语言处理(NLP)任务中,标注者之间的意见分歧可以反映有效的主观视角。建模标注者视角并理解其与其他人类因素(如社会人口属性)的关系,受到了越来越多的关注。以往的研究通常集中于单一的人口因素或有限的组合。然而,在现实环境中,标注者的视角受到复杂社会背景的影响,更细粒度的社会人口属性能够更好地解释人类视角。在本研究中,我们提出了社会对比学习(Socio-Contrastive Learning),这是一种在学习社会人口特征的同时共同建模标注者视角的方法。我们的方法为社会人口特征与文本表示的融合提供了一种有效的途径,以预测标注者视角,且优于基于标准拼接的方法。所学习的表示进一步使得分析和可视化人口因素与标注者视角变化之间的关系成为可能。我们的代码可在GitHub上获得:https://github.com/Leixin-Zhang/Socio_Contrastive_Learning
cs.CL / 152 / 2604.18087
Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation
混合与匹配:可扩展主题控制教育摘要的上下文配对
Abstract
Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.
Chinese Translation
主题控制摘要使用户能够生成集中于源文档特定方面的摘要。本文研究了一种数据增强策略,用于训练小型语言模型(sLMs)以执行主题控制摘要。我们提出了一种成对数据增强方法,通过结合来自不同文档的上下文来创建对比训练示例,使模型能够更有效地学习主题与摘要之间的关系。利用丰富了维基百科衍生主题的SciTLDR数据集,我们系统评估了增强规模如何影响模型性能。结果表明,随着增强规模的增加,胜率和语义对齐的一致性提升,而真实训练数据的数量保持不变。因此,使用我们的增强方法训练的T5-base模型在相对于更大模型的竞争性能上表现出色,尽管使用的参数显著更少,真实训练示例也大幅减少。
cs.CL / 153 / 2604.18091
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
文化意识幽默字幕生成:跨文化背景的多模态幽默生成
Abstract
Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.
Chinese Translation
近期的多模态大型语言模型在为图像生成幽默字幕方面展现出良好的能力,但它们在明确文化背景的稳定控制上仍然存在不足,这使得在特定文化背景下同时保持图像相关性、上下文适宜性和幽默质量变得困难。为了解决这一局限,我们提出了一项新的多模态生成任务——文化意识幽默字幕生成,该任务要求模型在输入图像和目标文化背景的条件下生成幽默字幕。在不同文化背景下生成的字幕不应共享相同的表面形式,但应保持在相似的视觉情境或幽默理由中扎根。为了支持这一任务,我们建立了一个涵盖图像相关性、上下文适配性、语义丰富性、合理性、幽默性和创造力的六维评估框架。我们进一步提出了一个分阶段对齐框架,首先在西方文化背景下用高资源监督初始化模型,然后通过基于评判的GRPO进行多维偏好对齐,并使用降级感知原型排斥约束来减轻开放式生成中的奖励操控,最后用少量监督将模型适应于东方文化背景。实验结果表明,我们的方法在所提出的评估框架下实现了更强的整体性能,尤其在上下文适配性方面有显著提升,并在文化约束下实现了图像相关性与幽默之间的更好平衡。
cs.CL / 154 / 2604.18106
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
通过多源动态逻辑融合实现高效的低资源语言适应
Abstract
Adapting large language models (LLMs) to low-resource languages (LRLs) is constrained by the scarcity of task data and computational resources. Although Proxy Tuning offers a logit-level strategy for introducing scaling effects, it often fails in LRL settings because the large model's weak LRL competence might overwhelm the knowledge of specialized smaller models. We thus propose TriMix, a test-time logit fusion framework that dynamically balances capabilities from three different sources: LRL competence from a continually pretrained small model, task competence from high-resource language instruction tuning, and the scaling benefits of large models. It is data- and compute-efficient, requiring no LRL task annotations, and only continual pretraining on a small model. Experiments across four model families and eight LRLs show that TriMix consistently outperforms single-model baselines and Proxy Tuning. Our analysis reveals that prioritizing the small LRL-specialized model's logits is crucial for success, challenging the prevalent large-model-dominant assumption.
Chinese Translation
将大型语言模型(LLMs)适应于低资源语言(LRLs)受到任务数据和计算资源稀缺的限制。尽管代理调优(Proxy Tuning)提供了一种在逻辑层面引入规模效应的策略,但在低资源语言环境中往往失败,因为大型模型在低资源语言上的弱能力可能会压倒专门小模型的知识。因此,我们提出了TriMix,一种测试时逻辑融合框架,动态平衡来自三个不同来源的能力:来自持续预训练的小模型的低资源语言能力、来自高资源语言指令调优的任务能力,以及大型模型的规模效益。该方法在数据和计算上都高效,无需低资源语言任务注释,仅需对小模型进行持续预训练。针对四个模型家族和八种低资源语言的实验表明,TriMix始终优于单模型基线和代理调优。我们的分析揭示,优先考虑小型低资源语言专用模型的逻辑对于成功至关重要,这挑战了普遍存在的大型模型主导假设。
cs.CL / 155 / 2604.18109
FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings
FLiP:理解和解释多模态多语言句子嵌入的探索
Abstract
This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.
Chinese Translation
本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型,以从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复词汇内容,涵盖多个高资源和中资源语言。我们展示了FLiP能够从嵌入中回忆起超过75%的词汇内容,显著优于现有的非因子化基线。利用这一诊断工具,我们揭示了所选句子编码器中的模态和语言偏差,并为实践者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现是公开的,链接为:https://github.com/BUTSpeechFIT/FLiP。
cs.CL / 156 / 2604.18112
Retrieval-Augmented Multimodal Model for Fake News Detection
基于检索增强的多模态模型用于假新闻检测
Abstract
In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross-Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross-instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data-scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval-Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross-modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high-level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model's decision-making paradigm with that of humans - specifically, it shifts the model's reasoning process from direct inference on multimodal features to an instance-based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: https://github.com/li-yiheng/RAMM
Chinese Translation
近年来,多模态多领域假新闻检测引起了越来越多的关注。然而,这一方向面临两个重大挑战:(1)未能捕捉跨实例叙事一致性:现有模型通常孤立地评估每条新闻,未能捕捉跨实例的叙事一致性,因此在应对社交媒体驱动的聚类假新闻传播时表现不佳;(2)缺乏领域特定的推理知识:传统模型仅依赖于训练期间编码在其参数中的知识,难以推广到新的或数据稀缺的领域(例如,突发事件或小众话题)。为了解决这些挑战,我们提出了基于检索增强的多模态模型(Retrieval-Augmented Multimodal Model,RAMM)用于假新闻检测。首先,RAMM采用多模态大语言模型(Multimodal Large Language Model,MLLM)作为其骨干,以捕捉来自新闻样本的跨模态语义信息。其次,RAMM结合了抽象叙事对齐模块(Abstract Narrative Alignment Module)。该组件自适应地从不同领域的多样实例中提取抽象叙事一致性,聚合相关知识,从而实现高层次叙事信息的建模。最后,RAMM引入了语义表示对齐模块(Semantic Representation Alignment Module),将模型的决策范式与人类的决策过程对齐——具体而言,它将模型的推理过程从对多模态特征的直接推断转变为基于实例的类比推理过程。在三个公共数据集上的大量实验结果验证了我们提出的方法的有效性。我们的代码可在以下链接获取:https://github.com/li-yiheng/RAMM
cs.CL / 157 / 2604.18122
Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents
Decisive:通过从非结构化文档中优化偏好引导用户决策
Abstract
Decision-making is a cognitively intensive task that requires synthesizing relevant information from multiple unstructured sources, weighing competing factors, and incorporating subjective user preferences. Existing methods, including large language models and traditional decision-support systems, fall short: they often overwhelm users with information or fail to capture nuanced preferences accurately. We present Decisive, an interactive decision-making framework that combines document-grounded reasoning with Bayesian preference inference. Our approach grounds decisions in an objective option-scoring matrix extracted from source documents, while actively learning a user's latent preference vector through targeted elicitation. Users answer pairwise tradeoff questions adaptively selected to maximize information gain over the final decision. This process converges efficiently, minimizing user effort while ensuring recommendations remain transparent and personalized. Through extensive experiments, we demonstrate that our approach significantly outperforms both general-purpose LLMs and existing decision-making frameworks achieving up to 20% improvement in decision accuracy over strong baselines across domains.
Chinese Translation
决策是一项认知密集型任务,需要从多个非结构化来源综合相关信息,权衡竞争因素,并融入用户的主观偏好。现有的方法,包括大型语言模型和传统决策支持系统,存在不足:它们往往用信息淹没用户,或未能准确捕捉细微的偏好。我们提出了Decisive,一个结合文档基础推理与贝叶斯偏好推断的互动决策框架。我们的方法将决策基于从源文档中提取的客观选项评分矩阵,同时通过有针对性的引导主动学习用户的潜在偏好向量。用户回答通过适应性选择的成对权衡问题,以最大化最终决策的信息增益。该过程高效收敛,最小化用户努力,同时确保推荐保持透明和个性化。通过大量实验,我们证明了我们的方法显著优于通用大型语言模型和现有的决策框架,在多个领域的决策准确性上比强基线提高了多达20%。
cs.CL / 158 / 2604.18124
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA:任务感知的大型语言模型低秩适应
Abstract
Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning method for large language models, with its effectiveness largely influenced by the allocation of ranks and scaling factors, as well as initialization. Existing LoRA variants typically address only one of these factors, often at the cost of increased training complexity or reduced practical efficiency. In this work, we present Task-aware Low-Rank Adaptation (TLoRA), a unified framework that jointly optimizes initialization and resource allocation at the outset of training. TLoRA introduces a data-driven initialization strategy that aligns the LoRA $A$ matrix with task-relevant subspaces by performing singular value decomposition on the product of pre-trained weights and input activation covariance. After this, the $A$ matrix is frozen, and only the $B$ matrix is trained. Furthermore, TLoRA employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget. We conduct extensive experiments that demonstrate TLoRA consistently performs excellently across various tasks, including natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation, while significantly reducing the number of trainable parameters.
Chinese Translation
低秩适应(LoRA)已成为一种广泛采用的参数高效微调方法,尤其适用于大型语言模型,其有效性在很大程度上受到秩和缩放因子的分配以及初始化的影响。现有的LoRA变体通常只关注这些因素中的一个,往往以增加训练复杂性或降低实际效率为代价。在本研究中,我们提出了任务感知低秩适应(TLoRA),这是一个统一框架,能够在训练开始时共同优化初始化和资源分配。TLoRA引入了一种数据驱动的初始化策略,通过对预训练权重和输入激活协方差的乘积进行奇异值分解,将LoRA的$A$矩阵与任务相关的子空间对齐。之后,$A$矩阵被冻结,仅对$B$矩阵进行训练。此外,TLoRA采用基于敏感度的重要性度量,在固定参数预算下自适应地分配各层的秩和缩放因子。我们进行了大量实验,证明TLoRA在自然语言理解、常识推理、数学推理、代码生成和聊天生成等多种任务中始终表现优异,同时显著减少了可训练参数的数量。
cs.CL / 159 / 2604.18128
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
深度寄存器解锁 SwiGLU 上的 W4A4:读者/生成器分解
Abstract
We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.
Chinese Translation
我们研究了在一个受控的 300M 参数 SwiGLU 解码器语言模型中进行后训练 W4A4 量化,该模型在 5B 个 FineWeb-Edu 令牌上训练,并探讨哪些输入激活位置主导了误差。简单的四舍五入到最近的 W4A4 将验证困惑度从 FP16 的 23.6 降低到 1727。一个简单的残差轴训练时间干预——带有寄存器幅度铰链损失的深度寄存器(DR+sink)——将这一数值降低到 119(约 14 倍),在匹配的 FP16 PPL 和匹配的零-shot 能力下,并与 SmoothQuant 组合到 39.9 PPL。与 FP16 的残差 ~2 PPL 差距是诊断核心。我们通过输入激活位置分解 W4A4 损害:SwiGLU 块中的五个可训练线性分为残差轴读者(qkv, w1, w3)和块内部生成器(o_proj, w2)。基本的范数论证表明,残差轴幅度控制紧密限制了读者,但仅通过因子边界的平凡乘积限制了 w2 的双线性输入;经验上,DR+sink 降低了读者的峰度,同时基本上保持生成器不变,而读者拯救的 W4A4 残差在三个匹配检查点上保持平坦,约为 0.28 nats,且 Delta-remove(w2) 占主导地位。我们将 DR+sink 视为一种训练时间探测工具,而非部署建议:一种事后替代方法(Per-Linear QuaRot)在读者轴上几乎与其匹配。完整的 QuaRot——增加在线每头值的哈达玛德乘法以及在线 w2 输入旋转——也未能缩小差距,直接测试了正交旋转无法限制双线性 SwiGLU 尾部的预测。声明特定于我们的 300M、5B 令牌、单种子设置,我们的实验并未将分区与铰链分离。
cs.CL / 160 / 2604.18159
FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs
FreezeEmpath:基于冻结大型语言模型的高效同理心口语聊天机器人训练
Abstract
Empathy is essential for fostering natural interactions in spoken dialogue systems, as it enables machines to recognize the emotional tone of human speech and deliver empathetic responses. Recent research has made significant progress in developing empathetic spoken chatbots based on large language models (LLMs). However, several challenges still exist when training such models, including reliance on costly empathetic speech instruction data and a lack of emotional expressiveness in the generated speech. Finetuning LLM with cross-modal empathetic instruction data may also lead to catastrophic forgetting and a degradation of its general capability. To address these challenges, we propose FreezeEmpath, an end-to-end empathetic spoken chatbot trained in a simple and efficient manner. The entire training process relies solely on existing speech instruction data and speech emotion recognition (SER) data, while keeping the LLM's parameters frozen. Experiments demonstrate that FreezeEmpath is able to generate emotionally expressive speech and outperforms other empathetic models in empathetic dialogue, SER, and SpokenQA tasks, demonstrating the effectiveness of our training strategy.
Chinese Translation
同理心对于促进口语对话系统中的自然互动至关重要,因为它使机器能够识别人类语言的情感基调并提供同理心响应。近期的研究在基于大型语言模型(LLMs)开发同理心口语聊天机器人方面取得了显著进展。然而,在训练此类模型时仍然存在一些挑战,包括对昂贵的同理心语音指导数据的依赖以及生成语音缺乏情感表现力。使用跨模态同理心指导数据对LLM进行微调也可能导致灾难性遗忘和其一般能力的下降。为了解决这些挑战,我们提出了FreezeEmpath,这是一种以简单高效的方式训练的端到端同理心口语聊天机器人。整个训练过程仅依赖于现有的语音指导数据和语音情感识别(SER)数据,同时保持LLM的参数不变。实验表明,FreezeEmpath能够生成情感丰富的语音,并在同理心对话、SER和SpokenQA任务中优于其他同理心模型,证明了我们训练策略的有效性。
cs.CL / 161 / 2604.18164
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias:评估 MLLM 作为评判者中的组合偏差的基准
Abstract
Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.
Chinese Translation
多模态大型语言模型(MLLMs)越来越多地被用作自动评估者,这一范式被称为 MLLM 作为评判者。然而,它们的可靠性以及对偏差的脆弱性仍然未得到充分探讨。我们发现,许多 MLLM 评判者在整合关键的视觉或文本线索时表现不可靠,当证据缺失或不匹配时,评估结果不可靠,并且在语义无关的扰动下表现出不稳定性。为了解决这一问题,我们系统性地定义了 MLLM 作为评判者系统中的组合偏差,并引入了 MM-JudgeBias,一个用于评估该偏差的基准。MM-JudgeBias 在查询、图像和响应之间引入了受控扰动,并通过两个互补指标进行模型行为评估:偏差偏离度(Bias-Deviation, BD)用于敏感性,偏差一致性(Bias-Conformity, BC)用于稳定性。我们的数据集包含来自 29 个源基准的超过 1,800 个经过策划和精炼的多模态样本,使得我们能够对九种偏差类型在不同任务和领域中进行细致的诊断。在对 26 个最先进的 MLLM 进行的实验中,揭示了系统性的模态忽视和不对称评估倾向,强调了对更可靠评判者的需求。
cs.CL / 162 / 2604.18169
Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation
超越再现:评估大型语言模型在文学翻译中的理解力与创造力的配对任务框架
Abstract
Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.
Chinese Translation
大型语言模型(LLMs)越来越多地被用于创意任务,如文学翻译。然而,翻译创造力仍然未得到充分探讨,且很少进行大规模评估,而源文本理解通常是孤立研究的,尽管在专业翻译中,理解力与创造力是紧密相连的。我们通过应用于11本书文学摘录的配对任务框架来填补这些空白。任务1评估源文本理解,任务2通过创造潜力单元(Units of Creative Potential, UCPs)评估翻译创造力,例如隐喻和文字游戏。我们使用结合专家人工注释与基于UCP的自动评分的可扩展评估设置,对23个模型和四个以创造力为导向的提示进行基准测试。我们的研究结果表明,强理解力并不等同于人类水平的创造力:模型往往产生字面或上下文不当的翻译,尤其在较远的英汉语言对中差距更大。以创造力为导向的提示仅带来适度的提升,只有一个模型Mistral-Large接近人类水平的创造力(0.167对比0.246)。在所有模型-提示组合中,只有三个超过了0.1的创造力评分,其余则保持在零附近。
cs.CL / 163 / 2604.18170
Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
复制即解码:基于语法约束的LLM编辑并行预填充
Abstract
LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: references an input line range, ... emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.
Chinese Translation
大型语言模型(LLMs)通过自回归地重新生成完整输出来编辑文本和代码,即使大多数标记在输入中是逐字出现的。我们研究了复制即解码(Copy-as-Decode),这是一种解码层机制,将编辑生成重新表述为在两种基本语法上的结构化解码:引用输入行范围,...生成新内容。一个基于标记的有限状态机(FSM)确保了语法的有效性,而服务层的基本操作通过单次并行预填充前向更新每个复制跨度的KV缓存,而不是$N$个自回归步骤——共享了投机解码的并行前向内核,但将输入标记视为草稿,并用程序强制的接受替代概率验证。我们报告了一个不需要端到端训练的上限分析。(i) 内核加速:在Qwen2.5-{1.5B, 7B}上,通过并行预填充复制$N$个标记比自回归快$6.8 imes$到$303 imes$($N ext{在} [8, 512]$,A100 80GB bf16)。(ii) 复制上限:在ProbeEdit和HumanEvalPack-Fix(Py/JS)上,$74 ext{--}98 ext{ extperthousand}$的金标记在行级基本操作下是可达的;与每个语料库的跨度直方图上的经验内核组合,这产生了一个封闭形式的实际时间界限为$29.0 imes / 3.4 imes / 4.2 imes$($13.0 imes$合并)。一个基于标记的扩展达到了$91 ext{--}99 ext{ extperthousand}$的覆盖率,底线为$4.5 imes$到$6.5 imes$。(iii) 流水线无损性:在所有$482$个案例中,神谕程序通过确定性解析器往返,任何下游故障都局限于跨度选择而非机制。一个扰动研究显示,在偏差为一的噪声下,合并的EM从$100 ext{ extperthousand}$降至$15.48 ext{ extperthousand}$。在Qwen2.5-Coder-1.5B上的微调试点将HEvalFix-Py的EM从$0/33$(未训练)提升至$12 ext{--}17 ext{ extperthousand}$,这是一个可学习性信号,而非生产选择器。批量服务集成和多文件覆盖被列为后续研究的范围。
cs.CL / 164 / 2604.18177
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
STaD:用于识别大型语言模型(LLMs)组合技能差距的支架任务设计
Abstract
Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.
Chinese Translation
基准测试通常被用作了解大型语言模型(LLMs)在不同领域能力的标准。然而,汇总的基准分数对LLMs的组合技能差距及其改进方式提供的洞察有限。为了使这些弱点可见,我们提出了支架任务设计(Scaffolded Task Design, STaD)框架。STaD基于支架概念生成基准任务的受控变体,该概念以逐步的方式引入结构化、渐进的支持。与其单独检查失败,这种方法通过识别模型缺乏的特定推理技能组合,使得对模型行为的系统性和可扩展性探测成为可能。将LLM视为黑箱,我们对六种不同规模的模型进行的实验揭示了三个推理基准中的多个失败点,并突出了每个模型独特且明显的技能差距。
cs.CL / 165 / 2604.18199
Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models
基于递归语言模型的线性时间和恒定内存文本嵌入
Abstract
Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked inference strategy that enables fast embedding generation with memory usage that becomes constant in the input length once it exceeds the vertical chunk size. By fine-tuning Mamba2 models, we demonstrate their viability as general-purpose text embedders, achieving competitive performance across a range of benchmarks while maintaining a substantially smaller memory footprint compared to transformer-based counterparts. We empirically validate the applicability of our inference strategy to Mamba2, RWKV, and xLSTM models, confirming consistent runtime-memory trade-offs across architectures and establishing recurrent models as a compelling alternative to transformers for efficient embedding generation.
Chinese Translation
基于变换器的嵌入模型存在平方级的计算复杂度和线性内存复杂度,这限制了它们在长序列中的应用。我们提出递归架构作为一种高效的替代方案,引入了一种垂直分块推理策略,使得在输入长度超过垂直分块大小后,嵌入生成的内存使用量保持恒定。通过微调 Mamba2 模型,我们展示了其作为通用文本嵌入器的可行性,在一系列基准测试中取得了具有竞争力的性能,同时与基于变换器的模型相比,内存占用显著更小。我们通过实验证实了我们的推理策略在 Mamba2、RWKV 和 xLSTM 模型中的适用性,确认了不同架构之间一致的运行时-内存权衡,并确立了递归模型作为高效嵌入生成的变换器有力替代方案。
cs.CL / 166 / 2604.18203
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
多模态大语言模型中的乘法:文本、图像和音频输入的计算
Abstract
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
Chinese Translation
多模态大语言模型能够准确感知跨模态的数值内容,但在以数字、数字词、图像或音频形式呈现相同的基础算术问题时,无法进行精确的多位数乘法。由于现有基准测试通常缺乏系统配对的跨模态实例,因此在模型家族内部和之间比较真实的算术限制仍然困难。因此,我们引入了一个受控的多模态乘法基准,该基准在数字长度、数字稀疏性、表示(例如,数字与数字词)和模态(文本、渲染图像、音频)上进行阶乘变化,并提供来自可重复生成器的配对实例。我们还将算术负载 C 定义为总数字和非零数字计数的乘积,作为操作计数的紧凑且机制驱动的代理。在评估中,随着 C 的增长,准确性急剧下降,通常在 C > 100 时接近零。实际上,C 在跨模态和模型的性能预测中保持有效,R-squared 值通常大于 0.5,接近更复杂的算术负载测量值,这些测量值计算中间算术步骤的数量。一个单独的感知与计算分解表明,多模态退化主要是计算性的而非感知性的:在匹配感知检查中,模型在跨模态下几乎完美(> 99%),即使乘法准确性下降。除了测量模型何时失败外,我们还询问它们倾向于遵循哪些程序。我们引入了一种强制完成损失探测器,该探测器对启发式特定推理前缀进行评分,包括列式乘法、分配分解和四舍五入/补偿。在这里,分解在文本和视觉模态中都受到青睐;启发式特定的 LoRA 适配器产生近乎正交的更新,但降低了准确性,表明基础模型保持了良好调优的内部路由器。
cs.CL / 167 / 2604.18204
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
难以被听到:对音位复杂、资源匮乏的濒危语言的音位级自动语音识别分析
Abstract
We present a phoneme-level analysis of automatic speech recognition (ASR) for two low-resourced and phonologically complex East Caucasian languages, Archi and Rutul, based on curated and standardized speech-transcript resources totaling approximately 50 minutes and 1 hour 20 minutes of audio, respectively. Existing recordings and transcriptions are consolidated and processed into a form suitable for ASR training and evaluation. We evaluate several state-of-the-art audio and audio-language models, including wav2vec2, Whisper, and Qwen2-Audio. For wav2vec2, we introduce a language-specific phoneme vocabulary with heuristic output-layer initialization, which yields consistent improvements and achieves performance comparable to or exceeding Whisper in these extremely low-resource settings. Beyond standard word and character error rates, we conduct a detailed phoneme-level error analysis. We find that phoneme recognition accuracy strongly correlates with training frequency, exhibiting a characteristic sigmoid-shaped learning curve. For Archi, this relationship partially breaks for Whisper, pointing to model-specific generalization effects beyond what is predicted by training frequency. Overall, our results indicate that many errors attributed to phonological complexity are better explained by data scarcity. These findings demonstrate the value of phoneme-level evaluation for understanding ASR behavior in low-resource, typologically complex languages.
Chinese Translation
我们对两种资源匮乏且音位复杂的东高加索语言——阿尔奇语(Archi)和鲁图尔语(Rutul)进行了音位级的自动语音识别(ASR)分析,基于经过整理和标准化的语音-文本资源,分别总计约50分钟和1小时20分钟的音频。现有的录音和转录被整合并处理成适合ASR训练和评估的形式。我们评估了几种最先进的音频和音频-语言模型,包括wav2vec2、Whisper和Qwen2-Audio。对于wav2vec2,我们引入了一种特定语言的音位词汇,并采用启发式输出层初始化,这带来了持续的改进,并在这些极其资源匮乏的环境中达到了与Whisper相当或更优的性能。除了标准的词和字符错误率外,我们还进行了详细的音位级错误分析。我们发现音位识别准确率与训练频率强相关,呈现出特征性S形学习曲线。对于阿尔奇语,Whisper的这一关系部分失效,指向模型特定的泛化效应,超出了训练频率的预测。总体而言,我们的结果表明,许多归因于音位复杂性的错误更好地用数据稀缺来解释。这些发现展示了音位级评估在理解资源匮乏、类型学复杂语言的ASR行为中的价值。
cs.CL / 168 / 2604.18226
Model in Distress: Sentiment Analysis on French Synthetic Social Media
困境中的模型:法语合成社交媒体的情感分析
Abstract
Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.
Chinese Translation
社交媒体上客户反馈的自动化分析面临三个挑战:标注训练数据的高成本、评估集的稀缺,尤其是在多语言环境中,以及阻碍数据共享和可重复性的隐私问题。我们通过开发一个可推广的合成数据生成管道来解决这些问题,该管道应用于法语公共交通中客户困扰检测的案例研究。我们的方法利用反向翻译与微调模型,从一个小的种子语料库生成170万条合成推文,并辅以合成推理轨迹。我们训练了600M参数的推理模型,采用英语和法语推理,在人工标注的评估数据上实现了77-79%的准确率,匹配或超过了最新的专有大型语言模型(SOTA LLMs)和专业编码器。除了降低标注成本外,我们的管道通过消除敏感用户数据的暴露来保护隐私。我们的方法论可以被应用于其他用例和语言。
cs.CL / 169 / 2604.18235
Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
负优势是一把双刃剑:在深度搜索中校准GRPO的优势
Abstract
Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.
Chinese Translation
深度搜索代理能够自主发起与搜索引擎的多轮交互,从而展现出强大的问答能力。这种性能在很大程度上依赖于群体相对策略优化(Group Relative Policy Optimization, GRPO)作为其核心训练算法。然而,GRPO在深度搜索环境中仍面临若干挑战。首先,中间步骤的正确性与奖励信号之间存在显著不匹配,导致许多正确的中间步骤在最终答案错误时被错误惩罚。其次,训练高度不稳定,常常导致自然语言能力的下降甚至灾难性的训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题,我们提出了CalibAdv,一种专门为深度搜索任务设计的优势校准方法。具体而言,CalibAdv利用中间步骤的正确性在细粒度水平上缩减过度的负优势,然后在答案组件中重新平衡正负优势。针对三个模型和七个基准的广泛实验表明,CalibAdv提高了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。
cs.CL / 170 / 2604.18249
Where Do Self-Supervised Speech Models Become Unfair?
自监督语音模型在哪些方面变得不公平?
Abstract
Speech encoder models are known to model members of some speaker groups (SGs) better than others. However, there has been little work in establishing why this occurs on a technological level. To our knowledge, we present the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing each embedding layer for speaker identification (SID) automatic speech recognition (ASR). We find S3Ms produce embeddings biased against certain SGs for both tasks, starting at the very first latent layers. Furthermore, we find opposite patterns of layerwise bias for SID vs ASR for all models in our study: SID bias is minimized in layers that minimize overall SID error; on the other hand, ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR is unaffected when probing S3Ms that are finetuned for ASR, suggesting SG-level bias is established during pretraining and is difficult to remove.
Chinese Translation
语音编码器模型在对某些说话者群体(SGs)的建模能力上优于其他群体。然而,在技术层面上探讨这种现象的原因的研究较少。据我们所知,我们首次对预训练的自监督语音编码器模型(S3Ms)进行了逐层公平性分析,探讨每个嵌入层在说话者识别(SID)和自动语音识别(ASR)中的表现。我们发现,S3Ms在这两项任务中对某些SGs产生了偏见的嵌入,从最初的潜在层开始。此外,我们发现,在我们研究的所有模型中,SID与ASR的逐层偏见表现出相反的模式:在最小化整体SID错误的层中,SID偏见被最小化;而在最小化整体ASR错误的层中,ASR偏见被最大化。当对经过微调以进行ASR的S3Ms进行探测时,ASR的反向偏见/错误关系并未受到影响,这表明SG级别的偏见是在预训练期间建立的,并且难以消除。
cs.CL / 171 / 2604.18293
An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal
能够通过惊讶度解释花园小径效应的神经语言模型的存在性证明
Abstract
Surprisal theory hypothesizes that the difficulty of human sentence processing increases linearly with surprisal, the negative log-probability of a word given its context. Computational psycholinguistics has tested this hypothesis using language models (LMs) as proxies for human prediction. While surprisal derived from recent neural LMs generally captures human processing difficulty on naturalistic corpora that predominantly consist of simple sentences, it severely underestimates processing difficulty on sentences that require syntactic disambiguation (garden-path effects). This leads to the claim that the processing difficulty of such sentences cannot be reduced to surprisal, although it remains possible that neural LMs simply differ from humans in next-word prediction. In this paper, we investigate whether it is truly impossible to construct a neural LM that can explain garden-path effects via surprisal. Specifically, instead of evaluating off-the-shelf neural LMs, we fine-tune these LMs on garden-path sentences so as to better align surprisal-based reading-time estimates with actual human reading times. Our results show that fine-tuned LMs do not overfit and successfully capture human reading slowdowns on held-out garden-path items; they even improve predictive power for human reading times on naturalistic corpora and preserve their general LM capabilities. These results provide an existence proof for a neural LM that can explain both garden-path effects and naturalistic reading times via surprisal, but also raise a theoretical question: what kind of evidence can truly falsify surprisal theory?
Chinese Translation
惊讶度理论假设,人类句子处理的难度与惊讶度(给定上下文的词的负对数概率)呈线性增加。计算心理语言学通过将语言模型(LMs)作为人类预测的代理来测试这一假设。尽管来自近期神经语言模型的惊讶度通常能够捕捉到自然语料库中简单句子的处理难度,但它严重低估了需要句法消歧的句子(花园小径效应)的处理难度。这导致了这样的主张:此类句子的处理难度无法归结为惊讶度,尽管神经语言模型在下一个词的预测上可能与人类存在差异。在本文中,我们探讨是否真的不可能构建一个能够通过惊讶度解释花园小径效应的神经语言模型。具体而言,我们不是评估现成的神经语言模型,而是对这些模型进行微调,以便更好地将基于惊讶度的阅读时间估计与实际人类阅读时间对齐。我们的结果表明,经过微调的语言模型不会过拟合,并成功捕捉到在保留的花园小径项目上的人类阅读减速;它们甚至提高了对自然语料库中人类阅读时间的预测能力,并保持了其一般的语言模型能力。这些结果为能够通过惊讶度解释花园小径效应和自然阅读时间的神经语言模型提供了存在性证明,但也提出了一个理论问题:什么样的证据可以真正反驳惊讶度理论?
cs.CL / 172 / 2604.18296
Exploring Concreteness Through a Figurative Lens
通过比喻视角探索具体性
Abstract
Static concreteness ratings are widely used in NLP, yet a word's concreteness can shift with context, especially in figurative language such as metaphor, where common concrete nouns can take abstract interpretations. While such shifts are evident from context, it remains unclear how LLMs understand concreteness internally. We conduct a layer-wise and geometric analysis of LLM hidden representations across four model families, examining how models distinguish literal vs figurative uses of the same noun and how concreteness is organized in representation space. We find that LLMs separate literal and figurative usage in early layers, and that mid-to-late layers compress concreteness into a one-dimensional direction that is consistent across models. Finally, we show that this geometric structure is practically useful: a single concreteness direction supports efficient figurative-language classification and enables training-free steering of generation toward more literal or more figurative rewrites.
Chinese Translation
静态具体性评分在自然语言处理(NLP)中被广泛使用,然而一个词的具体性可能会随着语境而变化,尤其是在比喻语言中,例如隐喻,常见的具体名词可以具有抽象的解释。尽管这种变化在语境中显而易见,但尚不清楚大型语言模型(LLMs)如何在内部理解具体性。我们对四个模型家族的LLM隐藏表示进行了分层和几何分析,考察模型如何区分同一名词的字面与比喻用法,以及具体性在表示空间中的组织方式。我们发现,LLMs在早期层中区分字面和比喻用法,而中后期层将具体性压缩为一个在各模型间一致的一维方向。最后,我们展示了这种几何结构在实践中的有效性:单一的具体性方向支持高效的比喻语言分类,并使得在生成过程中无需训练即可引导文本朝向更字面或更比喻的改写。
cs.CL / 173 / 2604.18307
Reasoning Models Know What's Important, and Encode It in Their Activations
推理模型知道什么是重要的,并将其编码在激活中
Abstract
Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
Chinese Translation
语言模型通常通过生成长的推理链来解决复杂任务,这些推理链由许多重要性各异的步骤组成。虽然某些步骤对于生成最终答案至关重要,但其他步骤则可以省略。确定哪些步骤最为重要,以及原因,仍然是一个开放性问题,这对于理解模型如何处理推理至关重要。我们探讨这个问题是通过模型内部结构还是通过推理链本身的标记来最佳解决。我们发现,模型的激活比标记包含更多的信息,以识别重要的推理步骤。关键是,通过在模型激活上训练探针以预测重要性,我们表明模型编码了步骤重要性的内部表示,甚至在生成后续步骤之前。这种重要性的内部表示在不同模型间具有普遍性,分布在各层之间,并且与表面特征(如步骤的相对位置或长度)没有相关性。我们的研究结果表明,分析激活可以揭示推理中表面方法根本无法捕捉的方面,这表明推理分析应关注模型内部结构。
cs.CL / 174 / 2604.18311
On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
叙事性在自然语言人工智能解释中的重要性与评估
Abstract
Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.
Chinese Translation
可解释人工智能(XAI)旨在使机器学习模型的行为可解释,但许多解释方法仍然难以理解。将自然语言生成(Natural Language Generation)整合到XAI中,旨在以文本形式提供解释,使其对从业者更具可及性。然而,目前的方法主要产生静态的特征重要性列表。尽管这些解释表明了影响预测的因素,但并未解释预测发生的原因。在本研究中,我们借鉴社会科学和语言学的见解,认为XAI解释应以叙事的形式呈现。叙事解释通过四个定义特征支持人类理解:连续结构、因果机制、语言流畅性和词汇多样性。我们展示了基于标记概率或词频的标准自然语言处理(NLP)指标无法捕捉这些特征,并且可以被传达无解释内容的同义文本所匹配或超越。为了解决这一问题,我们提出了七个自动化指标,以量化解释在四个识别维度上的叙事质量。我们在六个数据集上基准测试当前最先进的解释生成方法,并表明所提出的指标比标准NLP指标更可靠地区分描述性和叙事性解释。最后,为了进一步推动该领域的发展,我们提出了一套与问题无关的XAI叙事生成规则,以生成自然语言的XAI解释,使得生成的XAI叙事展现出更强的叙事特性,并与语言学和社会科学文献的发现相一致。
cs.CL / 175 / 2604.18328
FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction
FregeLogic在SemEval 2026任务11中的应用:一种混合神经符号架构用于内容稳健的三段论有效性预测
Abstract
We present FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 (Subtask 1), which addresses syllogistic validity prediction while reducing content effects on predictions. Our approach combines an ensemble of five LLM classifiers, spanning three open-weights models (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B) paired with varied prompting strategies, with a Z3 SMT solver that serves as a formal logic tiebreaker. The central hypothesis is that LLM disagreement within the ensemble signals likely content-biased errors, where real-world believability interferes with logical judgment. By deferring to Z3's structurally-grounded formal verification on these disputed cases, our system achieves 94.3% accuracy with a content effect of 2.85 and a combined score of 41.88 in nested 5-fold cross-validation on the dataset (N=960). This represents a 2.76-point improvement in combined score over the pure ensemble (39.12), with a 0.9% accuracy gain, driven by a 16% reduction in content effect (3.39 to 2.85). Adopting structured-output API calls for Z3 extraction reduced failure rates from ~22% to near zero, and an Aristotelian encoding with existence axioms was validated against task annotations. Our results suggest that targeted neuro-symbolic integration, applying formal methods precisely where ensemble consensus is lowest, can improve the combined accuracy-plus-content-effect metric used by this task.
Chinese Translation
我们提出了FregeLogic,这是一种混合神经符号系统,旨在解决SemEval-2026任务11(子任务1)中的三段论有效性预测,同时减少内容对预测的影响。我们的方法结合了五个大型语言模型(LLM)分类器的集成,涵盖了三个开放权重模型(Llama 4 Maverick、Llama 4 Scout和Qwen3-32B),并配以多样的提示策略,以及一个Z3 SMT求解器,作为形式逻辑的决策依据。核心假设是,集成中的LLM不一致性可能表明内容偏见错误的存在,其中现实世界的可信度干扰了逻辑判断。通过在这些争议案例中依赖于Z3的结构性基础形式验证,我们的系统在数据集(N=960)上的5折交叉验证中实现了94.3%的准确率,内容效应为2.85,综合得分为41.88。这比纯集成(39.12)的综合得分提高了2.76分,准确率提高了0.9%,内容效应减少了16%(从3.39降至2.85)。采用结构化输出API调用进行Z3提取将失败率从约22%降低到接近零,并且使用存在公理的亚里士多德编码经过了任务注释的验证。我们的结果表明,针对性地进行神经符号集成,在集成共识最低的地方精确应用形式方法,可以提高该任务使用的综合准确率与内容效应指标。
cs.CL / 176 / 2604.18347
Multilingual Training and Evaluation Resources for Vision-Language Models
视觉语言模型的多语言训练与评估资源
Abstract
Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.
Chinese Translation
视觉语言模型(VLMs)近年来取得了快速进展。然而,尽管其发展迅速,VLMs 的发展仍然严重依赖于英语,这导致了两个主要限制:(i)缺乏用于训练的多语言和多模态数据集,以及(ii)跨语言的综合评估基准稀缺。在本研究中,我们通过引入一套新的综合资源来解决这些问题,以支持 VLMs 的训练和评估,涵盖五种欧洲语言(英语、法语、德语、意大利语和西班牙语)。我们采用再生-翻译范式,通过结合策划的合成生成和人工标注,生成高质量的跨语言资源。具体而言,我们构建了 Multi-PixMo,这是一个通过再生来自 Pixmo 现有数据集的示例而获得的训练语料库,使用的是具有宽松许可的模型:PixMo-Cap、PixMo-AskModelAnything 和 CoSyn-400k。在评估方面,我们构建了一组多语言基准,源自翻译广泛使用的英语数据集(MMbench、ScienceQA、MME、POPE、AI2D)。我们通过定性和定量的人类分析评估这些资源的质量,测量标注者之间的一致性。此外,我们还进行了消融研究,以展示多语言数据对 VLMs 训练的影响,相较于仅使用英语。实验包括三种不同的模型,结果表明,使用多语言、多模态示例训练 VLMs 在非英语基准上始终有益,并且对英语也有积极的迁移效果。
cs.CL / 177 / 2604.18349
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
HiGMem:一种用于长期对话代理的分层和大型语言模型引导的记忆系统
Abstract
Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer-stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM-Guided Memory System), a two-level event-turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high-level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A-Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at https://github.com/ZeroLoss-Lab/HiGMem.
Chinese Translation
长期对话的大型语言模型(LLM)代理需要能够从历史交互中恢复相关证据的记忆系统,而不至于在答案阶段被无关的上下文所淹没。然而,现有的记忆系统,包括分层记忆系统,仍然往往仅依赖于向量相似性进行检索。这往往会产生冗长的证据集:添加许多表面上相似的对话轮次几乎不会增加额外的召回率,但却降低了检索精度,增加了答案阶段的上下文成本,并使检索到的记忆更难以检查和管理。为了解决这个问题,我们提出了HiGMem(分层和LLM引导的记忆系统),这是一种两级事件-轮次记忆系统,允许LLM使用事件摘要作为语义锚点来预测哪些相关轮次值得阅读。这使得模型能够首先检查高层次的事件摘要,然后专注于一小部分可能有用的轮次,通过推理提供简洁可靠的证据集,同时避免与向量检索相比过高的检索开销。在LoCoMo10基准测试中,HiGMem在五个问题类别中的四个类别上实现了最佳的F1分数,并将对抗性F1从0.54提高到0.78,同时检索的轮次数量减少了一个数量级。代码已公开发布在 https://github.com/ZeroLoss-Lab/HiGMem。
cs.CL / 178 / 2604.18354
PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues
PRISMA:基于偏好强化的自我训练方法用于可解释的情感智能谈判对话
Abstract
Emotion plays a pivotal role in shaping negotiation outcomes, influencing trust, cooperation, and long-term relationships. Developing negotiation dialog systems that can recognize and respond strategically to emotions is, therefore, essential to create more effective human-centered interactions. Beyond generating emotionally appropriate responses, interpretability - understanding how a system generates a particular emotion-aware response, is critical for fostering reliability and building rapport. Driven by these aspects, in this work, we introduce PRISMA, an interpretable emotionally intelligent negotiation dialogue system targeting two application domains, viz. job interviews and resource allocation. To enable interpretability, we propose an Emotion-aware Negotiation Strategy-informed Chain-of-Thought (ENS-CoT) reasoning mechanism, which mimics human negotiation by perceiving, understanding, using, and managing emotions. Leveraging ENS-CoT, we curate two new datasets: JobNego (for job interview negotiation) and ResNego (for resource allocation negotiation). We then leverage these datasets to develop PRISMA by augmenting self-training with Direct Preference Optimization (DPO), guiding agents toward more accurate, interpretable, and emotionally appropriate negotiation responses. Automatic and human evaluation on JobNego and ResNego datasets demonstrate that PRISMA substantially enhances interpretability and generates appropriate emotion-aware responses, while improving overall negotiation effectiveness.
Chinese Translation
情感在塑造谈判结果中发挥着关键作用,影响信任、合作和长期关系。因此,开发能够识别并战略性地响应情感的谈判对话系统对于创造更有效的人本交互至关重要。除了生成情感适当的响应外,可解释性——理解系统如何生成特定的情感感知响应,对于促进可靠性和建立融洽关系至关重要。基于这些方面,在本研究中,我们介绍了PRISMA,一个可解释的情感智能谈判对话系统,针对两个应用领域,即求职面试和资源分配。为了实现可解释性,我们提出了一种情感感知的谈判策略引导的思维链(ENS-CoT)推理机制,该机制通过感知、理解、使用和管理情感来模拟人类谈判。利用ENS-CoT,我们策划了两个新的数据集:JobNego(用于求职面试谈判)和ResNego(用于资源分配谈判)。然后,我们利用这些数据集通过直接偏好优化(DPO)增强自我训练,指导代理朝着更准确、可解释和情感适当的谈判响应发展。在JobNego和ResNego数据集上的自动和人工评估表明,PRISMA显著增强了可解释性,并生成了适当的情感感知响应,同时提高了整体谈判效果。
cs.CL / 179 / 2604.18356
ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship
ComPASS:通过工具增强的陪伴实现个性化的主动社会支持
Abstract
Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substantive, human-like companionship. Specifically, we first design a dozen user-centric tools simulating various multimedia applications, which can cover different types of social support behaviors in human-agent interaction scenarios. We then construct ComPASS-Bench, the first personalized social support benchmark for LLM-based agents, via multi-step automated synthesis and manual refinement. Based on ComPASS-Bench, we further synthesize tool use records to fine-tune the Qwen3-8B model, yielding a task-specific ComPASS-Qwen. Comprehensive evaluations across two settings reveal that while the evaluated LLMs can generate valid tool-calling requests with high success rates, significant gaps remain in final response quality. Moreover, tool-augmented responses achieve better overall performance than directly producing conversational empathy. Notably, our trained ComPASS-Qwen demonstrates substantial improvements over its base model, achieving comparable performance to several large-scale models. Our code and data are available at https://github.com/hzp3517/ComPASS.
Chinese Translation
开发富有同情心的互动系统需要代理不仅理解用户情感,还要提供多样化和实质性的支持。尽管近期的研究探索了同理心对话生成,但在响应形式和内容上仍然有限,难以满足用户和情境的多样化需求。为了解决这一问题,我们探索了赋能代理使用外部工具以执行多样化行动的方式。基于“社会支持”的心理学概念,这一范式提供了实质性的人类般陪伴。具体而言,我们首先设计了十多种以用户为中心的工具,模拟各种多媒体应用,能够覆盖人机交互场景中不同类型的社会支持行为。然后,我们构建了ComPASS-Bench,这是首个基于LLM的个性化社会支持基准,通过多步骤的自动合成和手动精炼实现。基于ComPASS-Bench,我们进一步合成工具使用记录,以微调Qwen3-8B模型,生成任务特定的ComPASS-Qwen。在两个设置下的综合评估显示,尽管评估的LLM能够生成有效的工具调用请求且成功率较高,但最终响应质量仍存在显著差距。此外,工具增强的响应在整体表现上优于直接生成的对话同理心。值得注意的是,我们训练的ComPASS-Qwen在性能上显著优于其基础模型,达到了与多个大规模模型相当的表现。我们的代码和数据可在 https://github.com/hzp3517/ComPASS 获取。
cs.CL / 180 / 2604.18362
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
ArbGraph:面向可靠长文本检索增强生成的冲突感知证据仲裁
Abstract
Retrieval-augmented generation (RAG) remains unreliable in long-form settings, where retrieved evidence is noisy or contradictory, making it difficult for RAG pipelines to maintain factual consistency. Existing approaches focus on retrieval expansion or verification during generation, leaving conflict resolution entangled with generation. To address this limitation, we propose ArbGraph, a framework for pre-generation evidence arbitration in long-form RAG that explicitly resolves factual conflicts. ArbGraph decomposes retrieved documents into atomic claims and organizes them into a conflict-aware evidence graph with explicit support and contradiction relations. On top of this graph, we introduce an intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions, enabling the system to suppress unreliable and inconsistent claims before final generation. In this way, ArbGraph separates evidence validation from text generation and provides a coherent evidence foundation for downstream long-form generation. We evaluate ArbGraph on two widely used long-form RAG benchmarks, LongFact and RAGChecker, using multiple large language model backbones. Experimental results show that ArbGraph consistently improves factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Additional analyses show that these gains are evident under conflicting or ambiguous evidence, highlighting the effectiveness of evidence-level conflict resolution for improving the reliability of long-form RAG. The implementation is publicly available at https://github.com/1212Judy/ArbGraph.
Chinese Translation
检索增强生成(RAG)在长文本场景中仍然不够可靠,因为检索到的证据可能存在噪声或矛盾,这使得RAG管道难以保持事实一致性。现有的方法主要集中在生成过程中的检索扩展或验证上,而将冲突解决与生成过程交织在一起。为了解决这一局限性,我们提出了ArbGraph,一个用于长文本RAG中预生成证据仲裁的框架,能够明确解决事实冲突。ArbGraph将检索到的文档分解为原子声明,并将其组织成一个具有明确支持和矛盾关系的冲突感知证据图。在此图的基础上,我们引入了一种基于强度驱动的迭代仲裁机制,通过证据交互传播可信度信号,使系统能够在最终生成之前抑制不可靠和不一致的声明。通过这种方式,ArbGraph将证据验证与文本生成分离,为下游的长文本生成提供了一个连贯的证据基础。我们在两个广泛使用的长文本RAG基准LongFact和RAGChecker上评估了ArbGraph,使用了多个大型语言模型作为基础。实验结果表明,ArbGraph在提高事实召回率和信息密度的同时,减少了幻觉现象和对检索噪声的敏感性。额外的分析显示,这些提升在存在冲突或模糊证据的情况下尤为明显,突显了证据级冲突解决在提高长文本RAG可靠性方面的有效性。该实现已公开发布在https://github.com/1212Judy/ArbGraph。
cs.CL / 181 / 2604.18375
IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters
IceBreaker:为对话代理打破首条消息障碍的个性化启动器
Abstract
Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world's largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.
Chinese Translation
对话代理,如 ChatGPT 和 Doubao,已成为数十亿用户日常生活中不可或缺的助手。为了进一步增强用户参与度,这些系统正在从被动响应者转变为主动伴侣。然而,现有的努力主要集中在正在进行的对话激活上,而忽视了一个关键的现实瓶颈。在对话启动阶段,用户可能有模糊的需求但没有明确的查询意图,从而形成了首条消息障碍,使得对话在开始前就停滞不前。为了解决这个问题,我们提出了对话启动器生成:生成个性化的启动器以引导用户进入对话。然而,与对话进行阶段不同,启动阶段必须在没有明确用户意图的冷启动时刻进行。为了在这一方向上开拓创新,我们提出了 IceBreaker,将人类破冰行为框架化为两步握手:(i)通过从会话摘要中提取共鸣兴趣的共鸣感知兴趣提炼,激发共鸣以捕捉触发兴趣;(ii)通过互动导向的启动器生成来刺激互动,该生成经过个性化偏好对齐和自我强化循环优化,以最大化参与度。在全球最大的对话代理产品之一上进行的在线 A/B 测试表明,IceBreaker 将用户活跃天数提高了 +0.184%,点击率提高了 +9.425%,并已在生产环境中部署。
cs.CL / 182 / 2604.18389
Understanding the Prompt Sensitivity
理解提示敏感性
Abstract
Prompt sensitivity, which refers to how strongly the output of a large language model (LLM) depends on the exact wording of its input prompt, raises concerns among users about the LLM's stability and reliability. In this work, we consider LLMs as multivariate functions and perform a first-order Taylor expansion, thereby analyzing the relationship between meaning-preserving prompts, their gradients, and the log probabilities of the model's next token. We derive an upper bound on the difference between log probabilities using the Cauchy-Schwarz inequality. We show that LLMs do not internally cluster similar inputs like smaller neural networks do, but instead disperse them. This dispersing behavior leads to an excessively high upper bound on the difference of log probabilities between two meaning-preserving prompts, making it difficult to effectively reduce to 0. In our analysis, we also show which types of meaning-preserving prompt variants are more likely to introduce prompt sensitivity risks in LLMs. In addition, we demonstrate that the upper bound is strongly correlated with an existing prompt sensitivity metric, PromptSensiScore. Moreover, by analyzing the logit variance, we find that prompt templates typically exert a greater influence on logits than the questions themselves. Overall, our results provide a general interpretation for why current LLMs can be highly sensitive to prompts with the same meaning, offering crucial evidence for understanding the prompt sensitivity of LLMs. Code for experiments is available at https://github.com/ku-nlp/Understanding_the_Prompt_Sensitivity.
Chinese Translation
提示敏感性指的是大型语言模型(LLM)的输出在多大程度上依赖于输入提示的确切措辞,这引发了用户对LLM稳定性和可靠性的担忧。在本研究中,我们将LLM视为多元函数,并进行一阶泰勒展开,从而分析保义提示、其梯度与模型下一个标记的对数概率之间的关系。我们利用柯西-施瓦茨不等式推导出对数概率差异的上界。我们表明,LLM并不像较小的神经网络那样在内部聚类相似输入,而是将其分散开来。这种分散行为导致两个保义提示之间的对数概率差异的上界过高,使得有效降低至0变得困难。在我们的分析中,我们还展示了哪些类型的保义提示变体更可能引入LLM中的提示敏感性风险。此外,我们证明了该上界与现有的提示敏感性度量标准PromptSensiScore之间存在强相关性。此外,通过分析对数几率的方差,我们发现提示模板通常对对数几率的影响大于问题本身。总体而言,我们的结果为当前LLM为何对具有相同意义的提示高度敏感提供了一般性解释,为理解LLM的提示敏感性提供了重要证据。实验代码可在 https://github.com/ku-nlp/Understanding_the_Prompt_Sensitivity 获取。
cs.CL / 183 / 2604.18396
River-LLM: Large Language Model Seamless Exit Based on KV Share
River-LLM:基于KV共享的无缝退出大型语言模型
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.
Chinese Translation
大型语言模型(LLMs)在多个领域展现出了卓越的性能,但其推理延迟日益成为制约因素。提前退出(Early Exit)作为一种有前景的解决方案,通过动态跳过冗余层来加速推理。然而,在仅解码器架构中,提前退出的效率受到KV缓存缺失问题的严重制约,跳过的层无法为后续的标记提供必要的历史状态。现有的解决方案,如重新计算或掩蔽,往往引入显著的延迟开销或导致严重的精度损失,未能弥合理论层减少与实际时钟加速之间的差距。在本文中,我们提出了River-LLM,一个无需训练的框架,能够实现无缝的标记级提前退出。River-LLM引入了一种轻量级的KV共享退出河流(KV-Shared Exit River),使得主干网络缺失的KV缓存能够在退出过程中自然生成和保留,从而消除昂贵的恢复操作的需要。此外,我们利用解码器块内的状态转移相似性来预测累积的KV误差,并指导精确的退出决策。在数学推理和代码生成任务上的大量实验表明,River-LLM在保持高生成质量的同时,实现了1.71到2.16倍的实际加速。
cs.CL / 184 / 2604.18398
AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment
AlphaContext:一种基于进化树的心理测量上下文生成器用于创造力评估
Abstract
Creativity has become a core competence in the era of LLMs and human-AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.
Chinese Translation
创造力已成为大语言模型(LLMs)和人机协作时代的核心能力,支撑着现实世界问题解决中的创新。系统性地提升创造力需要科学有效的评估工具。心理测量研究认为基于上下文的评估是衡量创造性思维的有效方法。然而,高质量的专家设计上下文仍然稀缺。现有的基于LLM的生成器常常面临评估线索不足、叙述连贯性弱、风格多样性有限以及对创造性思维支持不足等问题。为了解决这些挑战,我们提出了AlphaContext,一种基于进化树的心理测量上下文生成器,用于创造力评估。首先,HyperTree Outline Planner将专家设计的提纲形式化为规则引导的超树,并执行自上而下的层次规划。基于MCTS的上下文生成器通过MCTS填充提纲,以平衡全局结构和局部质量。然后,进化上下文优化器通过MAP-Elites进化上下文,反复更新小众精英以共同提升多样性和质量。最后,评估引导的进化精炼器模拟具有多样化风格的虚拟参与者,并回收弱上下文以进一步进化。实验表明,AlphaContext在6个质量指标上相较于竞争方法平均提升了8%。
cs.CL / 185 / 2604.18401
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
StepPO:用于代理强化学习的步对齐策略优化
Abstract
General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
Chinese Translation
通用代理已经催生了诸如 OpenClaw 和 Claude Code 等惊人的应用。随着这些代理系统(即 Harnesses)追求更大胆的目标,它们对基础大型语言模型(LLMs)提出了越来越强的代理能力需求。代理强化学习(Agentic Reinforcement Learning, RL)作为一种新兴的后训练范式,正在赋予 LLM 这些能力,并在代理训练中扮演着越来越关键的角色。与 RLHF 和 RLVR 中的单回合令牌级对齐或推理增强不同,代理 RL 目标是多回合交互环境,其中的目标是优化核心代理能力,如决策和工具使用,同时应对延迟和稀疏奖励以及长且可变的上下文等新挑战。因此,传统 LLM RL 继承的以令牌为中心的建模和优化范式变得越来越不足以捕捉真实的 LLM 代理行为。在本文中,我们提出了 StepPO 作为步级代理 RL 的一种方法。我们认为,传统的令牌级马尔可夫决策过程(Markov Decision Process, MDP)应当提升为步级 MDP 形式,并且步而非令牌应被视为 LLM 代理的适当动作表示。接着,我们提出步级信用分配作为这种形式的自然优化对应,从而将策略优化和奖励传播与代理决策的粒度对齐。最后,我们讨论了实现步级代理 RL 所需的关键系统设计,并且初步实验提供了这一视角有效性的初步证据。我们希望 StepPO 中体现的步对齐、步级范式为代理 RL 社区提供一个有用的视角,以理解代理行为,并帮助推动 LLM 向更强的通用代理能力发展。
cs.CL / 186 / 2604.18423
BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources
BhashaSutra:印度自然语言处理数据集、语料库和资源的任务中心统一调查
Abstract
India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.
Chinese Translation
印度的语言景观涵盖22种法定语言和数百种边缘方言,推动了自然语言处理(NLP)数据集、基准和预训练模型的快速增长。然而,目前没有专门的调查整合为印度语言开发的资源。现有的评审要么集中于少数高资源语言,要么将印度语言纳入更广泛的多语言环境中,从而限制了对低资源和文化多样性变体的覆盖。为了解决这一空白,我们呈现了首个印度NLP资源的统一调查,涵盖200多个数据集、50多个基准以及100多个模型、工具和系统,涉及文本、语音、多模态和文化基础任务。我们根据语言现象、领域和模态组织资源;分析注释、评估和模型设计的趋势;并识别出数据稀疏性、不均衡的语言覆盖、脚本多样性以及有限的文化和领域泛化等持续挑战。本调查为印度语言生态系统中的公平、文化基础和可扩展的NLP研究提供了一个整合的基础。
cs.CL / 187 / 2604.18487
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
对抗人文学科基准:前沿模型安全中的风格鲁棒性结果
Abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
Chinese Translation
对抗人文学科基准(Adversarial Humanities Benchmark, AHB)评估模型安全拒绝是否能够在远离熟悉的有害提示形式时依然有效。基于从 MLCommons AILuminate 中提取的有害任务,该基准通过人文学科风格的转化重写相同的目标,同时保持意图不变。这将对抗诗歌(Adversarial Poetry)和对抗故事(Adversarial Tales)的文献扩展到一个更广泛的风格混淆和目标隐蔽的基准系列。在这里报告的基准结果中,原始攻击记录的攻击成功率(Attack Success Rate, ASR)为 3.84%,而转化方法的成功率范围为 36.8% 到 65.0%,在 31 个前沿模型中总体 ASR 为 55.75%。在受欧洲联盟人工智能法案(EU AI Act)行为规范启发的系统风险视角下,化学、生物、放射和核(Chemical, Biological, Radiological and Nuclear, CBRN)是风险最高的类别。综合来看,这种缺乏风格鲁棒性表明当前的安全技术存在弱泛化的问题:对“无害性”(non-maleficence)的深刻理解仍然是前沿模型安全中一个中心未解决的问题。
cs.CL / 188 / 2604.18490
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
LQM:基于语言动机的多维机器翻译质量指标
Abstract
Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.
Chinese Translation
现有的机器翻译(MT)评估框架,包括自动指标和人类评估方案(如多维质量指标(MQM)),在很大程度上是语言无关的。然而,它们往往无法捕捉到双语语言(例如阿拉伯语)中特有的方言和文化错误,在这些语言中,翻译失败源于语言变体、内容覆盖和语用适当性之间的不匹配,而不仅仅是表面形式。我们提出了LQM:基于语言动机的多维机器翻译质量指标。LQM是一个分层错误分类法,通过六个以语言为基础的层次来诊断机器翻译错误:社会语言学、语用学、语义学、形态句法、正字法和图形学(见图1)。我们构建了一个双向平行语料库,包含3,850个句子(每种方言550个),涵盖七种阿拉伯方言(埃及方言、阿联酋方言、约旦方言、毛里塔尼亚方言、摩洛哥方言、巴勒斯坦方言和也门方言),这些句子来源于对话和文化丰富的内容。我们在零样本设置下评估了六个大型语言模型(LLMs),并使用LQM进行专家级跨度级人类注释,生成了6,113个标记的错误跨度,涵盖3,495个独特的错误句子,并附上严重性加权的质量评分。我们还用自动指标(spBLEU)补充了这一分析。尽管在此处对阿拉伯语进行了验证,LQM是一个语言无关的框架,旨在易于应用于或适应其他语言。LQM注释的错误数据、提示和注释指南可在https://github.com/UBC-NLP/LQM_MT公开获取。
cs.CL / 189 / 2604.18509
MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
MASS-RAG:多智能体合成检索增强生成
Abstract
Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.
Chinese Translation
大型语言模型(LLMs)在检索增强生成(RAG)中被广泛应用,以在推理时融入外部知识。然而,当检索到的上下文噪声大、信息不完整或异质时,单一的生成过程往往难以有效地调和证据。我们提出了 extbf{MASS-RAG},一种多智能体合成方法,用于检索增强生成,将证据处理结构化为多个角色专门的智能体。MASS-RAG为证据摘要、证据提取和对检索文档的推理应用不同的智能体,并通过专门的合成阶段结合它们的输出,以生成最终答案。这一设计暴露了多个中间证据视图,使模型能够在生成答案之前比较和整合互补信息。在四个基准测试上的实验表明,MASS-RAG在强大的RAG基线之上持续提高了性能,特别是在相关证据分散在检索上下文中的情况下。
cs.CL / 190 / 2604.18539
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations
咨询对话中下一对话行为预测的转移矩阵正则化
Abstract
This paper studies how empirical dialogue-flow statistics can be incorporated into Next Dialogue Act Prediction (NDAP). A KL regularization term is proposed that aligns predicted act distributions with corpus-derived transition patterns. Evaluated on a 60-class German counselling taxonomy using 5-fold cross-validation, this improves macro-F1 by 9--42% relative depending on encoder and substantially improves dialogue-flow alignment. Cross-dataset validation on HOPE suggests that improvements transfer across languages and counselling domains. In systematic ablations across pretrained encoders and architectures, the findings indicate that transition regularization provides consistent gains and disproportionately benefits weaker baseline models. The results suggest that lightweight discourse-flow priors complement pretrained encoders, especially in fine-grained, data-sparse dialogue tasks.
Chinese Translation
本文研究了如何将经验对话流统计信息纳入下一对话行为预测(Next Dialogue Act Prediction, NDAP)。提出了一种KL正则化项,该项将预测的行为分布与基于语料库的转移模式对齐。在使用5折交叉验证的60类德国咨询分类法上进行评估,结果显示相较于编码器,宏观F1值提高了9%至42%,并显著改善了对话流的对齐。HOPE数据集上的跨数据集验证表明,这些改进在不同语言和咨询领域之间具有迁移性。在对预训练编码器和架构进行系统性消融实验中,结果表明转移正则化提供了一致的增益,并对较弱的基线模型产生了不成比例的益处。结果表明,轻量级的对话流先验补充了预训练编码器,特别是在细粒度、数据稀缺的对话任务中。
cs.CL / 191 / 2604.18556
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
GSQ:通过 Gumbel-Softmax 采样实现的高精度低精度标量量化用于大规模语言模型
Abstract
Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.
Chinese Translation
权重量化已成为高效大规模语言模型(LLMs)部署的标准工具,特别是在本地推理中,模型现在通常以每个参数 2-3 位的精度进行服务。目前,最先进的技术分为两类:简单的标量量化技术,如 GPTQ 或 AWQ,这些方法广泛应用但在每个参数 3-4 位(bpp)时准确率达到瓶颈;以及“第二代”向量或格子量化方法,如 QTIP、GPTVQ 和 AQLM,这些方法在低位宽下推动了准确率的前沿,但实现和扩展难度较大,获得的关注相对较少。在本文中,我们探讨这一差距是否是根本性的,或者经过精心优化的标量量化器是否能够恢复大部分差距。我们肯定地回答了这个问题,提出了 GSQ(Gumbel-Softmax 量化),这是一种后训练标量量化方法,它通过 Gumbel-Softmax 对离散网格的松弛,联合学习每个坐标的网格分配和每组的缩放。GSQ 将松弛的基数与目标位宽范围内可用的少量级别(例如,三元组和 3 bpp 分别为 3-8 个级别)相匹配,使得松弛紧凑且优化可行。在实际应用中,在标准的 Llama-3.1-8B/70B-Instruct 模型上,GSQ 在 2 位和 3 位时缩小了标量量化与 QTIP 前沿之间的大部分差距,同时使用对称标量网格和组级量化,因此与现有的标量推理内核完全兼容。我们进一步展示了 GSQ 可以扩展到万亿级的专家混合模型,如 Kimi-K2.5,而向量量化方法在这些模型中难以应用。
cs.CL / 192 / 2604.18563
Dual Alignment Between Language Model Layers and Human Sentence Processing
语言模型层与人类句子处理之间的双重对齐
Abstract
A recent study (Kuribayashi et al., 2025) has shown that human sentence processing behavior, typically measured on syntactically unchallenging constructions, can be effectively modeled using surprisal from early layers of large language models (LLMs). This raises the question of whether such advantages of internal layers extend to more syntactically challenging constructions, where surprisal has been reported to underestimate human cognitive effort. In this paper, we begin by exploring internal layers that better estimate human cognitive effort observed in syntactic ambiguity processing in English. Our experiments show that, in contrast to naturalistic reading, later layers better estimate such a cognitive effort, but still underestimate the human data. This dual alignment sheds light on different modes of sentence processing in humans and LMs: naturalistic reading employs a somewhat weak prediction akin to earlier layers of LMs, while syntactically challenging processing requires more fully-contextualized representations, better modeled by later layers of LMs. Motivated by these findings, we also explore several probability-update measures using shallow and deep layers of LMs, showing a complementary advantage to single-layer's surprisal in reading time modeling.
Chinese Translation
最近的一项研究(Kuribayashi et al., 2025)表明,人类句子处理行为,通常在语法上不具挑战性的结构上进行测量,可以有效地通过大型语言模型(LLMs)早期层的惊讶度进行建模。这引发了一个问题,即这种内部层的优势是否扩展到更具语法挑战性的结构,在这些结构中,惊讶度被报告为低估了人类的认知努力。在本文中,我们首先探讨了更好地估计在英语中观察到的句法模糊处理中的人类认知努力的内部层。我们的实验表明,与自然阅读相比,后期层更好地估计这种认知努力,但仍然低估了人类数据。这种双重对齐揭示了人类和语言模型(LMs)在句子处理中的不同模式:自然阅读采用了类似于语言模型早期层的相对较弱的预测,而语法上具有挑战性的处理则需要更充分的上下文化表示,更好地由语言模型的后期层建模。基于这些发现,我们还探讨了使用语言模型的浅层和深层的几种概率更新度量,显示出在阅读时间建模中对单层惊讶度的互补优势。