cs.RO / 1 / 2603.03380
LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics
LiteVLA-Edge:用于嵌入式机器人设备的量化多模态控制
Abstract
Vision-Language-Action (VLA) models provide a unified framework for perception, language conditioning, and action generation, but many existing systems remain difficult to deploy in embedded robotic settings because of their computational requirements and inference latency. In this paper, we present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware. Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference through the \texttt{llama.cpp} runtime. Under our deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5\,ms (approximately 6.6\,Hz) while operating entirely offline within a ROS~2-integrated perception--reasoning--action pipeline. Rather than introducing a new policy objective, our contribution is a practical systems path for executing compact multimodal control models locally on embedded hardware while preserving modular interfaces between perception, reasoning, and actuation. These results establish timing feasibility for reactive language-conditioned control and provide a reproducible baseline for future task-level evaluation of on-device VLAs in robotics.
Chinese Translation
视觉-语言-动作(VLA)模型为感知、语言调节和动作生成提供了统一的框架,但许多现有系统由于其计算需求和推理延迟,仍然难以在嵌入式机器人环境中部署。本文提出了LiteVLA-Edge,这是一个面向部署的VLA管道,旨在实现Jetson Orin类硬件上的完全设备内推理。我们的方法结合了FP32下的监督图像到动作微调、后训练的4位GGUF量化以及通过 exttt{llama.cpp}运行时实现的GPU加速推理。在我们的部署配置下,LiteVLA-Edge实现了150.5毫秒(约6.6赫兹)的平均端到端延迟,同时在一个集成ROS~2的感知-推理-动作管道中完全离线运行。我们的贡献并不是引入新的策略目标,而是为在嵌入式硬件上本地执行紧凑的多模态控制模型提供了一条实用的系统路径,同时保持感知、推理和执行之间的模块化接口。这些结果为反应式语言调节控制的时间可行性奠定了基础,并为未来在机器人领域对设备内VLA的任务级评估提供了可重复的基准。
cs.RO / 2 / 2603.03390
Multi-Agent-Based Simulation of Archaeological Mobility in Uneven Landscapes
基于多智能体的考古移动性不均匀景观模拟
Abstract
Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi-agent-based modeling framework for simulating archaeological mobility in uneven landscapes, integrating realistic terrain reconstruction, heterogeneous agent modeling, and adaptive navigation strategies. The proposed approach combines global path planning with local dynamic adaptation, through reinforcment learning, enabling agents to respond efficiently to dynamic obstacles and interactions without costly global replanning. Real-world digital elevation data are processed into high-fidelity three-dimensional environments, preserving slope and terrain constraints that directly influence agent movement. The framework explicitly models diverse agent types, including human groups and animal-based transport systems, each parameterized by empirically grounded mobility characteristics such as load, slope tolerance, and physical dimensions. Two archaeological-inspired use cases demonstrate the applicability of the approach: a terrain-aware pursuit and evasion scenario and a comparative transport analysis involving pack animals and wheeled carts. The results highlight the impact of terrain morphology, visibility, and agent heterogeneity on movement outcomes, while the proposed hybrid navigation strategy provides a computationally efficient and interpretable solution for large-scale, dynamic archaeological simulations.
Chinese Translation
理解考古景观中的移动性、运动和互动对于解读过去的人类行为、运输策略和空间组织至关重要,但仅依靠静态考古证据重建这些过程是困难的。本文提出了一种基于多智能体的建模框架,用于在不均匀景观中模拟考古移动性,集成了真实的地形重建、异质智能体建模和自适应导航策略。所提出的方法结合了全局路径规划与局部动态适应,通过强化学习,使智能体能够高效应对动态障碍和互动,而无需进行代价高昂的全局重新规划。利用真实世界的数字高程数据处理成高保真三维环境,保留直接影响智能体移动的坡度和地形约束。该框架明确建模了多种智能体类型,包括人类群体和基于动物的运输系统,每种类型都通过经验基础的移动特征(如负载、坡度容忍度和物理尺寸)进行参数化。两个受考古启发的应用案例展示了该方法的适用性:一个考虑地形的追逐与逃避场景,以及一个涉及负重动物和轮式手推车的比较运输分析。结果突显了地形形态、可见性和智能体异质性对移动结果的影响,而所提出的混合导航策略为大规模动态考古模拟提供了一种计算高效且易于解释的解决方案。
cs.RO / 3 / 2603.03452
Impact of Localization Errors on Label Quality for Online HD Map Construction
定位误差对在线高清地图构建标签质量的影响
Abstract
High-definition (HD) maps are crucial for autonomous vehicles, but their creation and maintenance is very costly. This motivates the idea of online HD map construction. To provide a continuous large-scale stream of training data, existing HD maps can be used as labels for onboard sensor data from consumer vehicle fleets. However, compared to current, well curated HD map perception datasets, this fleet data suffers from localization errors, resulting in distorted map labels. We introduce three kinds of localization errors, Ramp, Gaussian, and Perlin noise, to examine their influence on generated map labels. We train a variant of MapTRv2, a state-of-the-art online HD map construction model, on the Argoverse 2 dataset with various levels of localization errors and assess the degradation of model performance. Since localization errors affect distant labels more severely, but are also less significant to driving performance, we introduce a distance-based map construction metric. Our experiments reveal that localization noise affects the model performance significantly. We demonstrate that errors in heading angle exert a more substantial influence than position errors, as angle errors result in a greater distortion of labels as distance to the vehicle increases. Furthermore, we can demonstrate that the model benefits from non-distorted ground truth (GT) data and that the performance decreases more than linearly with the increase in noisy data. Our study additionally provides a qualitative evaluation of the extent to which localization errors influence the construction of HD maps.
Chinese Translation
高清(HD)地图对自动驾驶车辆至关重要,但其创建和维护成本非常高。这促使了在线高清地图构建的理念。为了提供持续的大规模训练数据流,现有的高清地图可以作为消费者车辆车队的车载传感器数据的标签。然而,与当前精心整理的高清地图感知数据集相比,这些车队数据存在定位误差,导致地图标签失真。我们引入了三种类型的定位误差:斜坡(Ramp)、高斯(Gaussian)和柏林噪声(Perlin noise),以研究它们对生成地图标签的影响。我们在 Argoverse 2 数据集上训练了 MapTRv2 的一个变体,这是一种最先进的在线高清地图构建模型,使用不同级别的定位误差,并评估模型性能的下降。由于定位误差对远处标签的影响更为严重,但对驾驶性能的影响相对较小,我们引入了一种基于距离的地图构建指标。我们的实验表明,定位噪声显著影响模型性能。我们展示了航向角误差对标签的影响比位置误差更为显著,因为航向角误差会随着距离车辆的增加而导致标签的更大失真。此外,我们还证明了模型受益于无失真的真实值(GT)数据,并且随着噪声数据的增加,性能下降的幅度超过线性关系。我们的研究还提供了定性评估,探讨了定位误差在高清地图构建中的影响程度。
cs.RO / 4 / 2603.03453
Radar-based Pose Optimization for HD Map Generation from Noisy Multi-Drive Vehicle Fleet Data
基于雷达的姿态优化用于从噪声多驱动车辆队列数据生成高清地图
Abstract
High-definition (HD) maps are important for autonomous driving, but their manual generation and maintenance is very expensive. This motivates the usage of an automated map generation pipeline. Fleet vehicles provide sufficient sensors for map generation, but their measurements are less precise, introducing noise into the mapping pipeline. This work focuses on mitigating the localization noise component through aligning radar measurements in terms of raw radar point clouds of vehicle poses of different drives and performing pose graph optimization to produce a globally optimized solution between all drives present in the dataset. Improved poses are first used to generate a global radar occupancy map, aimed to facilitate precise on-vehicle localization. Through qualitative analysis we show contrast-rich feature clarity, focusing on omnipresent guardrail posts as the main feature type observable in the map. Second, the improved poses can be used as a basis for an existing lane boundary map generation pipeline, majorly improving map output compared to its original pure line detection based optimization approach.
Chinese Translation
高清(HD)地图对于自动驾驶至关重要,但其手动生成和维护成本非常高。这促使了自动化地图生成流程的使用。车队车辆提供了足够的传感器用于地图生成,但它们的测量精度较低,给地图生成流程引入了噪声。本研究着重于通过对不同驾驶的车辆姿态的原始雷达点云进行对齐,来减轻定位噪声成分,并执行姿态图优化,以在数据集中所有驾驶之间生成全局优化的解决方案。改进后的姿态首先用于生成全球雷达占用地图,旨在促进车辆的精确定位。通过定性分析,我们展示了对比丰富的特征清晰度,重点关注地图中可观察到的普遍存在的护栏柱作为主要特征类型。其次,改进后的姿态可以作为现有车道边界地图生成流程的基础,与其原始的基于纯线检测的优化方法相比,显著改善了地图输出。
cs.RO / 5 / 2603.03495
Navigating in Uncertain Environments with Heterogeneous Visibility
在异质可见性下的不确定环境导航
Abstract
Navigating an environment with uncertain connectivity requires a strategic balance between minimizing the cost of traversal and seeking information to resolve map ambiguities. Unlike previous approaches that rely on local sensing, we utilize a framework where nodes possess varying visibility levels, allowing for observation of distant edges from certain vantage points. We propose a novel heuristic algorithm that balances the cost of detouring to high-visibility locations against the gain in information by optimizing the sum of a custom observation reward and the cost of traversal. We introduce a technique to sample the shortest path on numerous realizations of the environment, which we use to define an edge's utility for observation and to quickly estimate the path with the highest reward. Our approach can be easily adapted to a variety of scenarios by tuning a single hyperparameter that determines the importance of observation. We test our method on a variety of uncertain navigation tasks, including a map based on real-world topographical data. The method demonstrates lower mean cost of traversal compared to a shortest path baseline that does not consider observation and has exponentially lower computational overhead compared to an existing method for balancing observation with path cost minimization.
Chinese Translation
在连接性不确定的环境中导航需要在最小化通行成本与寻求信息以解决地图模糊性之间进行战略平衡。与依赖局部感知的先前方法不同,我们利用一个框架,其中节点具有不同的可见性水平,允许从某些视角观察到远处的边缘。我们提出了一种新颖的启发式算法,该算法在绕行至高可见性位置的成本与通过优化自定义观测奖励和通行成本的总和所获得的信息之间取得平衡。我们引入了一种技术,通过对环境的多个实现进行采样以获得最短路径,从而定义边缘的观测效用,并快速估计具有最高奖励的路径。我们的方法可以通过调整一个单一的超参数来轻松适应各种场景,该超参数决定了观测的重要性。我们在多种不确定导航任务上测试了我们的方法,包括基于真实地形数据的地图。与不考虑观测的最短路径基线相比,该方法展示了更低的平均通行成本,并且与现有的平衡观测与路径成本最小化的方法相比,计算开销显著降低。
cs.RO / 6 / 2603.03499
Overlapping Domain Decomposition for Distributed Pose Graph Optimization
用于分布式姿态图优化的重叠域分解
Abstract
We present ROBO (Riemannian Overlapping Block Optimization), a distributed and parallel approach to multi-robot pose graph optimization (PGO) based on the idea of overlapping domain decomposition. ROBO offers a middle ground between centralized and fully distributed solvers, where the amount of pose information shared between robots at each optimization iteration can be set according to the available communication resources. Sharing additional pose information between neighboring robots effectively creates overlapping optimization blocks in the underlying pose graph, which substantially reduces the number of iterations required to converge. Through extensive experiments on benchmark PGO datasets, we demonstrate the applicability and feasibility of ROBO in different initialization scenarios, using various cost functions, and under different communication regimes. We also analyze the tradeoff between the increased communication and local computation required by ROBO's overlapping blocks and the resulting faster convergence. We show that overlaps with an average inter-robot data cost of only 36 Kb per iteration can converge 3.1$\times$ faster in terms of iterations than state-of-the-art distributed PGO approaches. Furthermore, we develop an asynchronous variant of ROBO that is robust to network delays and suitable for real-world robotic applications.
Chinese Translation
我们提出了ROBO(Riemannian Overlapping Block Optimization),这是一种基于重叠域分解思想的多机器人姿态图优化(PGO)的分布式并行方法。ROBO在集中式和完全分布式求解器之间提供了一种折衷方案,在每次优化迭代中,机器人之间共享的姿态信息量可以根据可用的通信资源进行设置。在邻近机器人之间共享额外的姿态信息有效地在基础姿态图中创建了重叠优化块,这大大减少了收敛所需的迭代次数。通过在基准PGO数据集上的大量实验,我们展示了ROBO在不同初始化场景、使用各种成本函数以及在不同通信模式下的适用性和可行性。我们还分析了ROBO的重叠块所需的增加通信和局部计算与由此带来的更快收敛之间的权衡。我们表明,平均每次迭代的机器人间数据成本仅为36 Kb的重叠块在迭代次数上可以比最先进的分布式PGO方法快3.1倍。此外,我们开发了一种ROBO的异步变体,该变体对网络延迟具有鲁棒性,适合于现实世界的机器人应用。
cs.RO / 7 / 2603.03514
Sampling-Based Motion Planning with Scene Graphs Under Perception Constraints
基于采样的运动规划与场景图在感知约束下的应用
Abstract
It will be increasingly common for robots to operate in cluttered human-centered environments such as homes, workplaces, and hospitals, where the robot is often tasked to maintain perception constraints, such as monitoring people or multiple objects, for safety and reliability while executing its task. However, existing perception-aware approaches typically focus on low-degree-of-freedom (DoF) systems or only consider a single object in the context of high-DoF robots. This motivates us to consider the problem of perception-aware motion planning for high-DoF robots that accounts for multi-object monitoring constraints. We employ a scene graph representation of the environment, offering a great potential for incorporating long-horizon task and motion planning thanks to its rich semantic and spatial information. However, it does not capture perception-constrained information, such as the viewpoints the user prefers. To address these challenges, we propose MOPS-PRM, a roadmap-based motion planner, that integrates the perception cost of observing multiple objects or humans directly into motion planning for high-DoF robots. The perception cost is embedded to each object as part of a scene graph, and used to selectively sample configurations for roadmap construction, implicitly enforcing the perception constraints. Our method is extensively validated in both simulated and real-world experiments, achieving more than ~36% improvement in the average number of detected objects and ~17% better track rate against other perception-constrained baselines, with comparable planning times and path lengths.
Chinese Translation
机器人在杂乱的人类中心环境中操作将变得越来越普遍,例如家庭、工作场所和医院,在这些环境中,机器人通常需要在执行任务时保持感知约束,如监控人员或多个物体,以确保安全和可靠性。然而,现有的感知感知方法通常集中于低自由度(DoF)系统,或仅在高自由度机器人上下文中考虑单个物体。这促使我们考虑高自由度机器人在多物体监控约束下的感知感知运动规划问题。我们采用场景图表示环境,凭借其丰富的语义和空间信息,为长期任务和运动规划的整合提供了巨大的潜力。然而,它并未捕捉到感知约束信息,例如用户偏好的视角。为了解决这些挑战,我们提出了MOPS-PRM,这是一种基于路线图的运动规划器,直接将观察多个物体或人类的感知成本集成到高自由度机器人的运动规划中。感知成本作为场景图的一部分嵌入到每个物体中,并用于选择性地采样配置以构建路线图,隐式地强制执行感知约束。我们的方法在模拟和现实世界实验中进行了广泛验证,平均检测到的物体数量提高了约36%,跟踪率比其他感知约束基线提高了约17%,同时规划时间和路径长度相当。
cs.RO / 8 / 2603.03537
Passive Phase-Oriented Impedance Shaping for Rapid Acceleration in Soft Robotic Swimmers
被动相位导向阻抗塑形用于软体机器人游泳者的快速加速
Abstract
Rapid acceleration and burst maneuvers in underwater robots depend less on maintaining precise resonance and more on force--velocity phase alignment during thrust generation. In this work, we investigate constrained-layer damping (CLD) as a passive mechanism for frequency-selective impedance shaping in soft robotic swimmers. Unlike conventional stiffness-tuning approaches, CLD selectively amplifies the dissipative component of bending impedance while preserving storage stiffness, passively shifting the impedance composition toward dissipative dominance as actuation frequency increases. We characterize this behavior through dry impedance measurements, demonstrate that CLD enhances thrust and alters force--motion phase relationships across Strouhal numbers in constrained propulsion tests, and validate that passive impedance shaping yields a nearly five-fold increase in peak acceleration and a three-fold increase in terminal velocity in unconstrained swimming trials. These results establish phase-oriented passive impedance modulation as a simple, control-free pathway for improving transient propulsion in soft robotic systems.
Chinese Translation
水下机器人在快速加速和突发机动中的表现,更多依赖于推力生成过程中的力-速度相位对齐,而非维持精确的共振。在本研究中,我们探讨了约束层阻尼(Constrained-Layer Damping, CLD)作为一种被动机制,用于在软体机器人游泳者中实现频率选择性阻抗塑形。与传统的刚度调节方法不同,CLD选择性地增强弯曲阻抗的耗散成分,同时保持储能刚度,随着激励频率的增加,阻抗成分被被动地转向耗散主导。我们通过干阻抗测量对这种行为进行了表征,展示了CLD在约束推进测试中增强推力并改变力-运动相位关系的效果,并验证了被动阻抗塑形在无约束游泳试验中使峰值加速度几乎增加了五倍,终端速度增加了三倍。这些结果确立了相位导向的被动阻抗调制作为一种简单、无控制的途径,以改善软体机器人系统中的瞬态推进。
cs.RO / 9 / 2603.03546
Real-time loosely coupled GNSS and IMU integration via Factor Graph Optimization
基于因子图优化的实时松耦合GNSS与IMU集成
Abstract
Accurate positioning, navigation, and timing (PNT) is fundamental to the operation of modern technologies and a key enabler of autonomous systems. A very important component of PNT is the Global Navigation Satellite System (GNSS) which ensures outdoor positioning. Modern research directions have pushed the performance of GNSS localization to new heights by fusing GNSS measurements with other sensory information, mainly measurements from Inertial Measurement Units (IMU). In this paper, we propose a loosely coupled architecture to integrate GNSS and IMU measurements using a Factor Graph Optimization (FGO) framework. Because the FGO method can be computationally challenging and often used as a post-processing method, our focus is on assessing its localization accuracy and service availability while operating in real-time in challenging environments (urban canyons). Experimental results on the UrbanNav-HK-MediumUrban-1 dataset show that the proposed approach achieves real-time operation and increased service availability compared to batch FGO methods. While this improvement comes at the cost of reduced positioning accuracy, the paper provides a detailed analysis of the trade-offs between accuracy, availability, and computational efficiency that characterize real-time FGO-based GNSS/IMU fusion.
Chinese Translation
精确的定位、导航和时序(PNT)是现代技术运作的基础,也是自主系统的关键推动力。全球导航卫星系统(GNSS)是PNT的重要组成部分,确保了户外定位。现代研究方向通过将GNSS测量与其他传感器信息(主要是惯性测量单元(IMU)的测量)融合,推动了GNSS定位性能的提升。本文提出了一种松耦合架构,通过因子图优化(FGO)框架集成GNSS与IMU测量。由于FGO方法在计算上具有挑战性,且通常作为后处理方法使用,我们的重点是评估其在复杂环境(城市峡谷)中实时操作时的定位精度和服务可用性。在UrbanNav-HK-MediumUrban-1数据集上的实验结果表明,与批处理FGO方法相比,所提出的方法实现了实时操作并提高了服务可用性。尽管这种改进以降低定位精度为代价,本文提供了关于实时FGO基础的GNSS/IMU融合中精度、可用性和计算效率之间权衡的详细分析。
cs.RO / 10 / 2603.03556
Real-time tightly coupled GNSS and IMU integration via Factor Graph Optimization
基于因子图优化的实时紧耦合GNSS与IMU集成
Abstract
Reliable positioning in dense urban environments remains challenging due to frequent GNSS signal blockage, multipath, and rapidly varying satellite geometry. While factor graph optimization (FGO)-based GNSS-IMU fusion has demonstrated strong robustness and accuracy, most formulations remain offline. In this work, we present a real-time tightly coupled GNSS-IMU FGO method that enables causal state estimation via incremental optimization with fixed-lag marginalization, and we evaluate its performance in a highly urbanized GNSS-degraded environment using the UrbanNav dataset.
Chinese Translation
在密集城市环境中,由于频繁的GNSS信号阻塞、多路径效应和快速变化的卫星几何形状,可靠定位仍然面临挑战。尽管基于因子图优化(FGO)的GNSS-IMU融合已显示出强大的鲁棒性和准确性,但大多数公式仍然是离线的。在本研究中,我们提出了一种实时紧耦合的GNSS-IMU FGO方法,该方法通过增量优化与固定滞后边际化实现因果状态估计,并使用UrbanNav数据集评估其在高度城市化的GNSS降级环境中的性能。
cs.RO / 11 / 2603.03596
MEM: Multi-Scale Embodied Memory for Vision Language Action Models
MEM:用于视觉语言行动模型的多尺度具身记忆
Abstract
Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
Chinese Translation
传统上,端到端机器人学习中的记忆涉及将一系列过去的观察输入到学习的策略中。然而,在复杂的多阶段现实任务中,机器人的记忆必须在多个粒度层次上表示过去事件:从捕捉抽象语义概念的长期记忆(例如,一个机器人在做晚餐时应该记住食谱的哪些阶段已经完成)到捕捉最近事件并补偿遮挡的短期记忆(例如,一个机器人在其手臂遮挡物体时记住它想要捡起的物体)。在这项工作中,我们的主要见解是,针对长期机器人控制的有效记忆架构应结合多种模态,以捕捉这些不同的抽象层次。我们提出了多尺度具身记忆(Multi-Scale Embodied Memory, MEM),这是一种用于机器人策略的混合模态长期记忆的方法。MEM结合了通过视频编码器压缩的视频基础短期记忆与基于文本的长期记忆。两者结合使得机器人策略能够执行持续长达十五分钟的任务,如清理厨房或准备烤奶酪三明治。此外,我们发现记忆使得MEM策略能够在上下文中智能地调整操作策略。
cs.RO / 12 / 2603.03627
Touch2Insert: Zero-Shot Peg Insertion by Touching Intersections of Peg and Hole
Touch2Insert:通过触摸插入钉子与孔的交点实现零-shot插入
Abstract
Reliable insertion of industrial connectors remains a central challenge in robotics, requiring sub-millimeter precision under uncertainty and often without full visual access. Vision-based approaches struggle with occlusion and limited generalization, while learning-based policies frequently fail to transfer to unseen geometries. To address these limitations, we leverage tactile sensing, which captures local surface geometry at the point of contact and thus provides reliable information even under occlusion and across novel connector shapes. Building on this capability, we present \emph{Touch2Insert}, a tactile-based framework for arbitrary peg insertion. Our method reconstructs cross-sectional geometry from high-resolution tactile images and estimates the relative pose of the hole with respect to the peg in a zero-shot manner. By aligning reconstructed shapes through registration, the framework enables insertion from a single contact without task-specific training. To evaluate its performance, we conducted experiments with three diverse connectors in both simulation and real-robot settings. The results indicate that Touch2Insert achieved sub-millimeter pose estimation accuracy for all connectors in simulation, and attained an average success rate of 86.7\% on the real robot, thereby confirming the robustness and generalizability of tactile sensing for real-world robotic connector insertion.
Chinese Translation
工业连接器的可靠插入仍然是机器人技术中的一个核心挑战,要求在不确定性下实现亚毫米级的精度,并且通常缺乏全面的视觉访问。基于视觉的方法在遮挡和有限的泛化能力上面临困难,而基于学习的策略在转移到未见几何形状时常常失败。为了解决这些局限性,我们利用触觉传感,这种技术能够在接触点捕捉局部表面几何形状,从而即使在遮挡和新型连接器形状下也能提供可靠的信息。在此基础上,我们提出了 extit{Touch2Insert},一个用于任意钉子插入的触觉基础框架。我们的方法从高分辨率触觉图像中重建横截面几何形状,并以零-shot的方式估计孔相对于钉子的相对姿态。通过对重建形状进行配准,该框架能够在没有特定任务训练的情况下,从单一接触点进行插入。为了评估其性能,我们在模拟和真实机器人环境中对三种不同的连接器进行了实验。结果表明,Touch2Insert在模拟中对所有连接器实现了亚毫米级的姿态估计精度,并在真实机器人上达到了86.7%的平均成功率,从而确认了触觉传感在现实世界机器人连接器插入中的鲁棒性和泛化能力。
cs.RO / 13 / 2603.03640
MistyPilot: An Agentic Fast-Slow Thinking LLM Framework for Misty Social Robots
MistyPilot:一种用于雾霭社交机器人的自主快速-慢速思维大型语言模型框架
Abstract
With the availability of open APIs in social robots, it has become easier to customize general-purpose tools to meet users' needs. However, interpreting high-level user instructions, selecting and configuring appropriate tools, and executing them reliably remain challenging for users without programming experience. To address these challenges, we introduce MistyPilot, an agentic LLM-driven framework for autonomous tool selection, orchestration, and parameter configuration. MistyPilot comprises two core components: a Physically Interactive Agent (PIA) and a Socially Intelligent Agent (SIA). The PIA enables robust sensor-triggered and tool-driven task execution, while the SIA generates socially intelligent and emotionally aligned dialogue. MistyPilot further integrates a fast-slow thinking paradigm to capture user preferences, reduce latency, and improve task efficiency. To comprehensively evaluate MistyPilot, we contribute five benchmark datasets. Extensive experiments demonstrate the effectiveness of our framework in routing correctness, task completeness, fast-slow thinking retrieval efficiency, tool scalability,and emotion alignment. All code, datasets, and experimental videos will be made publicly available on the project webpage.
Chinese Translation
随着社交机器人开放API的出现,定制通用工具以满足用户需求变得更加容易。然而,对于没有编程经验的用户来说,理解高层次的用户指令、选择和配置合适的工具,以及可靠地执行这些工具仍然是一个挑战。为了解决这些问题,我们提出了MistyPilot,一个基于大型语言模型(LLM)的自主工具选择、协调和参数配置框架。MistyPilot由两个核心组件组成:物理交互代理(Physically Interactive Agent, PIA)和社会智能代理(Socially Intelligent Agent, SIA)。PIA实现了强大的传感器触发和工具驱动的任务执行,而SIA则生成社会智能和情感一致的对话。MistyPilot进一步整合了快速-慢速思维范式,以捕捉用户偏好、减少延迟并提高任务效率。为了全面评估MistyPilot,我们贡献了五个基准数据集。大量实验表明,我们的框架在路由正确性、任务完整性、快速-慢速思维检索效率、工具可扩展性和情感对齐方面的有效性。所有代码、数据集和实验视频将公开发布在项目网页上。
cs.RO / 14 / 2603.03695
TreeLoc++: Robust 6-DoF LiDAR Localization in Forests with a Compact Digital Forest Inventory
TreeLoc++:在森林中利用紧凑数字森林清查进行稳健的6自由度激光雷达定位
Abstract
Reliable localization is essential for sustainable forest management, as it allows robots or sensor systems to revisit and monitor the status of individual trees over long periods. In modern forestry, this management is structured around Digital Forest Inventories (DFIs), which encode stems using compact geometric attributes rather than raw data. Despite their central role, DFIs have been overlooked in localization research, and most methods still rely on dense gigabyte-sized point clouds that are costly to store and maintain. To improve upon this, we propose TreeLoc++, a global localization framework that operates directly on DFIs as a discriminative representation, eliminating the need to use the raw point clouds. TreeLoc++ reduces false matches in structurally ambiguous forests and improves the reliability of full 6-DoF pose estimation. It augments coarse retrieval with a pairwise distance histogram that encodes local tree-layout context, subsequently refining candidates via DBH-based filtering and yaw-consistent inlier selection to further reduce mismatches. Furthermore, a constrained optimization leveraging tree geometry jointly estimates roll, pitch, and height, enhancing pose stability and enabling accurate localization without reliance on dense 3D point cloud data. Evaluations on 27 sequences recorded in forests across three datasets and four countries show that TreeLoc++ achieves precise localization with centimeter-level accuracy. We further demonstrate robustness to long-term change by localizing data recorded in 2025 against inventories built from 2023 data, spanning a two-year interval. The system represents 15 sessions spanning 7.98 km of trajectories using only 250KB of map data and outperforms both hand-crafted and learning-based baselines that rely on point cloud maps. This demonstrates the scalability of TreeLoc++ for long-term deployment.
Chinese Translation
可靠的定位对于可持续森林管理至关重要,因为它使机器人或传感器系统能够在较长时间内重新访问和监测单棵树的状态。在现代林业中,这种管理是围绕数字森林清查(Digital Forest Inventories, DFI)进行的,DFI通过紧凑的几何属性编码树干,而不是使用原始数据。尽管DFI在定位研究中扮演着核心角色,但它们却被忽视,大多数方法仍依赖于密集的千兆字节级点云,这在存储和维护上成本高昂。为此,我们提出了TreeLoc++,一个直接在DFI上操作的全局定位框架,作为一种区分性表示,消除了使用原始点云的必要性。TreeLoc++减少了在结构模糊的森林中出现的错误匹配,并提高了完整6自由度姿态估计的可靠性。它通过编码局部树木布局上下文的成对距离直方图增强粗略检索,随后通过基于胸径(DBH)的过滤和一致的偏航内点选择进一步精炼候选项,以减少不匹配。此外,利用树木几何的约束优化联合估计滚转、俯仰和高度,增强姿态稳定性,并能够在不依赖于密集3D点云数据的情况下实现准确定位。在三个数据集和四个国家的森林中记录的27个序列的评估表明,TreeLoc++实现了厘米级精确定位。我们进一步展示了对长期变化的鲁棒性,通过将2025年记录的数据与2023年构建的清查进行定位,跨越两年的时间间隔。该系统代表了15个会话,涵盖了7.98公里的轨迹,仅使用250KB的地图数据,并且优于依赖点云地图的手工制作和基于学习的基线。这证明了TreeLoc++在长期部署中的可扩展性。
cs.RO / 15 / 2603.03701
UrbanHuRo: A Two-Layer Human-Robot Collaboration Framework for the Joint Optimization of Heterogeneous Urban Services
UrbanHuRo:一种用于异构城市服务联合优化的双层人机协作框架
Abstract
In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to potentially conflicting objectives and the need for real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through crowdsourced delivery and urban sensing. UrbanHuRo includes two key designs: (i) a scalable distributed MapReduce-based K-submodular maximization module for efficient order dispatch, and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.
Chinese Translation
在智慧城市的愿景中,正在开发技术以提高城市服务的效率并改善居民的生活质量。然而,大多数现有研究集中于孤立地优化单一服务,未能充分考虑异构城市服务之间的相互作用,这可能带来更高的效率和更好的资源利用。例如,人类快递员可以在送货路线中收集交通和空气质量数据,而感知机器人可以在高峰时段协助按需配送,从而增强感知覆盖和配送效率。然而,由于潜在的目标冲突和在动态环境中实时协调的需求,不同城市服务的联合优化面临挑战。本文提出了UrbanHuRo,一种用于异构城市服务联合优化的双层人机协作框架,通过众包配送和城市感知进行验证。UrbanHuRo包括两个关键设计:(i)一个可扩展的基于MapReduce的K-子次模最大化模块,用于高效的订单调度,以及(ii)一个深度子次模奖励强化学习算法,用于感知路线规划。在来自食品配送平台的真实数据集上的实验评估表明,UrbanHuRo在大多数设置中平均提高了29.7%的感知覆盖率和39.2%的快递员收入,同时显著减少了逾期订单的数量。
cs.RO / 16 / 2603.03704
Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning
基于大语言模型的部分可观测任务与运动规划状态估计
Abstract
Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.
Chinese Translation
在部分可观测环境中进行机器人规划是一项具有挑战性的任务,因为这要求在不确定性下通过部分可观测的马尔可夫决策过程进行推理。在执行计算出的计划过程中,机器人可能会意外观察到与任务无关的物体,而这些物体通常会被简单的规划者忽略。在本研究中,我们提出结合两种类型的常识知识:(1)某些物体更可能出现在特定位置;(2)相似的物体可能会共存,而不相似的物体则不太可能一起出现。手动构建这样的知识非常复杂,因此我们探索利用大语言模型(LLMs)强大的常识推理能力。我们的规划与执行框架CoCo-TAMP引入了一种分层状态估计,利用LLM引导的信息来塑造对任务相关物体的信念,从而有效解决长时间跨度的任务与运动规划问题。在实验中,CoCo-TAMP在模拟中实现了规划与执行时间平均减少62.7,在实际演示中减少72.6,相较于不结合任何类型常识知识的基线。
cs.RO / 17 / 2603.03733
X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation
X-Loco:通过协同策略蒸馏实现通用类人步态控制
Abstract
While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.
Chinese Translation
尽管近期的进展在直立步态、跌倒恢复和全身协调等个别类人技能上表现出色,但由于涉及多样的动态和相互冲突的控制目标,学习一个能够掌握所有这些技能的单一策略仍然具有挑战性。为了解决这一问题,我们提出了X-Loco,一个用于训练基于视觉的通用类人步态策略的框架。X-Loco训练多个专家策略,并采用一种协同策略蒸馏方法,结合案例自适应专家选择机制,动态利用多个专家策略来指导基于视觉的学生策略。该设计使得学生能够掌握广泛的步态技能,从跌倒恢复到地形穿越以及全身协调技能。据我们所知,X-Loco是第一个展示基于视觉的类人步态的框架,它将直立步态、全身协调和跌倒恢复联合整合,同时仅在速度指令下操作,而不依赖于参考动作。实验结果表明,X-Loco在跌倒恢复和地形穿越等任务中实现了卓越的性能。消融研究进一步强调我们的框架有效利用专家知识并提高学习效率。
cs.RO / 18 / 2603.03735
Characterization and Correlation of Robotic Snake Scale Friction and Locomotion Speed
机器人蛇鳞摩擦特性与运动速度的表征与关联
Abstract
Snake robots are inspired by the ability of biological snakes to move over rock, grass, leaves, soil, up trees, along pavement and more. Their ability to move in multiple distinct environments is due to their legless locomotion strategy, which combines distinct gaits with a skin that exhibits frictional anisotropy. Designing soft robotic snakes with similar capabilities requires an understanding of how this underlying frictional anisotropy should be created in engineered systems, and how variances in the frictional anisotropy ratio affect locomotion speed and direction on different surfaces. While forward and backward frictional ratios have been characterized for previous scale designs, lateral friction and the associated ratios are often overlooked. In this paper, our contributions include: (i) the development of a novel articulated pseudo-skin design that is modular, easy to construct and has removable or replaceable scales; (ii) experimental measurement of the frictional characteristics of otherwise-identical scales at varying angles of attack (15{\deg}, 25{\deg}, 35{\deg}, 45{\deg}) on different surfaces of interest (grass, bark, smooth surface, carpet);(iii) separate measurements of locomotion speed for each angle and surface. Consequently, while we observed some consistent trends between frictional coefficients and scale angle, aligning with literature and intuition, we were not able to consistently identify expected correlations between frictional ratios and locomotion speed. We conclude that either frictional ratios alone are not sufficient to predict the observed speed of a snake robot, or that specific measurement approaches are required to accurately capture these ratios.
Chinese Translation
蛇形机器人受到生物蛇在岩石、草地、树叶、土壤、树木和人行道等多种环境中移动能力的启发。它们在多种不同环境中移动的能力源于无腿的运动策略,该策略结合了不同的步态和表现出摩擦各向异性的皮肤。设计具有类似能力的软体机器人蛇需要理解如何在工程系统中创造这种基础的摩擦各向异性,以及摩擦各向异性比率的变化如何影响不同表面上的运动速度和方向。虽然先前的鳞片设计已经表征了前向和后向摩擦比,但侧向摩擦及相关比率常常被忽视。本文的贡献包括:(i) 开发了一种新型的关节伪皮肤设计,该设计模块化、易于构建,并具有可拆卸或可更换的鳞片;(ii) 在不同感兴趣的表面(草地、树皮、光滑表面、地毯)上,以不同的攻击角度(15°、25°、35°、45°)对相同鳞片的摩擦特性进行实验测量;(iii) 对每个角度和表面分别测量运动速度。因此,尽管我们观察到摩擦系数与鳞片角度之间存在一些一致的趋势,与文献和直觉相符,但我们未能一致地识别摩擦比与运动速度之间的预期关联。我们得出结论,摩擦比单独不足以预测蛇形机器人的观察速度,或者需要特定的测量方法来准确捕捉这些比率。
cs.RO / 19 / 2603.03740
Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics
基于Koopman神经动力学的机器人系统整体安全控制
Abstract
Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.
Chinese Translation
控制具有强非线性和高维动态的机器人仍然具有挑战性,因为在实时情况下,直接进行带有安全约束的非线性优化往往是不可行的。Koopman算子提供了一种在提升空间中线性表示非线性系统的方法,从而能够使用高效的线性控制。我们提出了一种数据驱动的框架,该框架从数据中学习Koopman嵌入和算子,并将得到的线性模型与安全集算法(Safe Set Algorithm, SSA)相结合。这使得跟踪和安全约束可以在一个单一的二次规划(Quadratic Program, QP)中解决,从而确保可行性和最优性,而无需单独的安全过滤器。我们在Kinova Gen3机械臂和Go2四足机器人上验证了该方法,显示出准确的跟踪和障碍物避让能力。
cs.RO / 20 / 2603.03741
HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration
HALyPO:用于人机协作的异构智能体李雅普诺夫策略优化
Abstract
To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.
Chinese Translation
为了提高人机协作(HRC)的泛化能力和韧性,机器人必须处理人类行为和情境的组合多样性,这促使了多智能体强化学习(MARL)的发展。然而,机器人与人类之间固有的异质性在学习过程中造成了理性差距(RG)——即去中心化最佳响应动态与中心化合作上升之间的变异不匹配。由此产生的学习问题是一个一般和可微的博弈,因此独立的策略梯度更新可能会在没有额外结构的情况下振荡或发散。我们提出了异构智能体李雅普诺夫策略优化(HALyPO),该方法通过对参数空间不一致度量施加逐步李雅普诺夫下降条件,直接在策略参数空间中建立正式的稳定性。与基于李雅普诺夫的安全强化学习不同,后者针对受限马尔可夫决策过程中的状态/轨迹约束,HALyPO利用李雅普诺夫认证来稳定去中心化策略学习。HALyPO通过最优二次投影修正去中心化梯度,确保RG的单调收缩,并有效探索开放式交互空间。大量的仿真和真实世界的人形机器人实验表明,这种认证的稳定性提高了协作边缘案例中的泛化能力和鲁棒性。
cs.RO / 21 / 2603.03751
Interaction-Aware Whole-Body Control for Compliant Object Transport
面向交互的全身控制在柔性物体运输中的应用
Abstract
Cooperative object transport in unstructured environments remains challenging for assistive humanoids because strong, time-varying interaction forces can make tracking-centric whole-body control unreliable, especially in close-contact support tasks. This paper proposes a bio-inspired, interaction-oriented whole-body control (IO-WBC) that functions as an artificial cerebellum - an adaptive motor agent that translates upstream (skill-level) commands into stable, physically consistent whole-body behavior under contact. This work structurally separates upper-body interaction execution from lower-body support control, enabling the robot to maintain balance while shaping force exchange in a tightly coupled robot-object system. A trajectory-optimized reference generator (RG) provides a kinematic prior, while a reinforcement learning (RL) policy governs body responses under heavy-load interactions and disturbances. The policy is trained in simulation with randomized payload mass/inertia and external perturbations, and deployed via asymmetric teacher-student distillation so that the student relies only on proprioceptive histories at runtime. Extensive experiments demonstrate that IO-WBC maintains stable whole-body behavior and physical interaction even when precise velocity tracking becomes infeasible, enabling compliant object transport across a wide range of scenarios.
Chinese Translation
在非结构化环境中,协作物体运输对辅助人形机器人仍然具有挑战性,因为强烈且随时间变化的交互力可能使以跟踪为中心的全身控制变得不可靠,尤其是在紧密接触的支持任务中。本文提出了一种生物启发的、面向交互的全身控制(IO-WBC),其功能类似于人工小脑——一种自适应运动代理,能够将上游(技能层级)指令转化为在接触下稳定且物理一致的全身行为。该工作在结构上将上半身的交互执行与下半身的支持控制分离,使机器人能够在紧密耦合的机器人-物体系统中保持平衡,同时塑造力的交换。一个轨迹优化的参考生成器(RG)提供运动学先验,而强化学习(RL)策略则在重载交互和扰动下管理身体反应。该策略在模拟环境中经过随机化的负载质量/惯性和外部扰动进行训练,并通过不对称的师生蒸馏进行部署,使得学生在运行时仅依赖于本体感知历史。大量实验表明,IO-WBC在精确速度跟踪变得不可行时仍能保持稳定的全身行为和物理交互,从而实现广泛场景下的柔性物体运输。
cs.RO / 22 / 2603.03768
Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport
从认知到控制 - 人类与类人机器人协作运输的多智能体学习
Abstract
Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.
Chinese Translation
有效的人机协作(HRC)需要将高层意图转化为接触稳定的全身运动,同时持续适应人类伙伴。许多视觉-语言-动作(VLA)系统学习从观察和指令到动作的端到端映射,但它们往往强调反应性(类似系统1)的行为,而对如何将持续的系统2风格的深思熟虑与可靠、低延迟的连续控制相结合则缺乏明确的说明。这一差距在多智能体人机协作中尤为突出,在这里,长期协调决策和物理执行必须在接触、可行性和安全约束下共同演变。我们通过认知到控制(C2C)来解决这一局限性,C2C是一个三层层次结构,使深思熟虑到控制的路径变得明确:(i) 基于VLM的基础层,维护持久的场景指代并推断具身意识的可供性/约束;(ii) 深思熟虑的技能/协调层——系统2核心——在通过去中心化的多智能体强化学习(MARL)优化人机耦合下的长期技能选择和序列,形式上作为一个共享潜力编码任务进展的马尔可夫潜力游戏;(iii) 全身控制层以高频率执行所选技能,同时强制执行运动学/动态可行性和接触稳定性。深思熟虑层作为相对于名义控制器的残差策略实现,内化伙伴动态而无需明确的角色分配。在协作操控任务上的实验显示出比单智能体和端到端基线更高的成功率和鲁棒性,具有稳定的协调和新兴的领导-跟随行为。
cs.RO / 23 / 2603.03798
Learning Surgical Robotic Manipulation with 3D Spatial Priors
利用三维空间先验学习外科手术机器人操作
Abstract
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.
Chinese Translation
实现三维空间意识对于外科手术机器人操作至关重要,因为这类操作需要精确和细致的处理。现有方法要么在操作前显式重建手术场景,要么通过添加腕部安装摄像头来增强多视角特征,以补充默认的立体内窥镜。然而,这两种范式都存在显著的局限性:前者由于其多阶段特性容易导致误差累积,并阻碍端到端优化,而后者由于腕部安装摄像头可能干扰手术机器人手臂的运动,因此在临床实践中很少被采用。在本研究中,我们提出了空间外科变换器(Spatial Surgical Transformer, SST),这是一种端到端的视觉运动策略,通过直接探索嵌入内窥镜图像中的三维空间线索,使外科机器人具备三维空间意识。首先,我们构建了Surgical3D,这是一个包含30,000对具有准确三维几何信息的立体内窥镜图像的大规模照片真实数据集,解决了手术场景中三维数据稀缺的问题。基于Surgical3D,我们微调了一个强大的几何变换器,从立体内窥镜图像中提取稳健的三维潜在表示。这些表示随后通过轻量级多级空间特征连接器(Multi-level Spatial Feature Connector, MSFC)与机器人的动作空间无缝对齐,所有操作均在以内窥镜为中心的坐标框架内进行。大量真实机器人实验表明,SST在复杂的外科任务(如打结和体外器官解剖)上实现了最先进的性能和强大的空间泛化,代表了向实际临床应用迈出的重要一步。数据集和代码将会发布。
cs.RO / 24 / 2603.03836
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse
SkillVLA:通过技能重用应对双臂操作中的组合多样性
Abstract
Recent progress in vision-language-action (VLA) models has demonstrated strong potential for dual-arm manipulation, enabling complex behaviors and generalization to unseen environments. However, mainstream bimanual VLA formulations largely overlook the critical challenge of combinatorial diversity. Different pairings of single-arm behaviors can induce qualitatively distinct task behaviors, yet existing models do not explicitly account for this structure. We argue that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination. Current VLA designs entangle skills across arms, preventing such recomposition and limiting scalability. To address this limitation, we propose SkillVLA, a framework explicitly designed to enable skill reuse in dual-arm manipulation. Extensive experiments demonstrate that SkillVLA substantially improves skill composition, increasing overall success rate from 0% to 51%, and achieves strong performance on cooperative and long-horizon tasks.
Chinese Translation
近期在视觉-语言-动作(VLA)模型方面的进展显示出其在双臂操作中的强大潜力,能够实现复杂行为并对未见环境进行泛化。然而,主流的双手VLA模型在很大程度上忽视了组合多样性这一关键挑战。不同的单臂行为组合可以引发质上不同的任务行为,但现有模型并未明确考虑这种结构。我们认为,有效的双手VLA应支持技能重用——即能够在新的左右配对中重新组合先前学习的单臂技能,从而避免单独学习每一种可能的组合。目前的VLA设计将技能交织在一起,阻碍了这种重新组合,并限制了可扩展性。为了解决这一限制,我们提出了SkillVLA,一个明确旨在实现双臂操作中技能重用的框架。大量实验表明,SkillVLA显著改善了技能组合,将整体成功率从0%提高至51%,并在合作和长时间任务中表现出色。
cs.RO / 25 / 2603.03897
IROSA: Interactive Robot Skill Adaptation using Natural Language
IROSA:基于自然语言的互动机器人技能适应
Abstract
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
Chinese Translation
基础模型在多个领域展示了令人印象深刻的能力,而模仿学习则提供了从有限数据中进行机器人技能适应的原则性方法。将这些方法结合起来在机器人技术中具有重要的应用前景,但这一结合在工业部署方面受到的关注有限。我们提出了一种新颖的框架,通过工具基础架构实现开放词汇的技能适应,在语言模型和机器人硬件之间保持保护性抽象层。我们的方法利用预训练的大型语言模型(LLMs)选择和参数化特定工具,以适应机器人技能,而无需微调或直接的模型与机器人交互。我们在一个7自由度的扭矩控制机器人上演示了该框架,执行工业轴承环插入任务,展示了通过自然语言命令成功实现技能适应,包括速度调整、轨迹修正和障碍物避免,同时保持安全性、透明性和可解释性。
cs.RO / 26 / 2603.03942
Lightweight Visual Reasoning for Socially-Aware Robots
面向社会意识机器人的轻量级视觉推理
Abstract
Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by $3.3\%$ (less distance), $+0.057$ description score, and $+2.93\%$ accuracy, with less than $3\%$ extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains $+0.111,+0.055$ and $+10.81\%,+4.79\%$ on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics
Chinese Translation
在共享人类环境中运行的机器人不仅需要导航、互动和检测周围环境,还必须解读和响应动态且常常不可预测的人类行为。尽管最近的进展在利用视觉-语言模型(Vision-Language Models, VLMs)增强机器人感知和指令跟随方面显示出希望,但在应对多模态人机交互(Human-Robot Interaction, HRI)的复杂性方面仍然有限。基于这一挑战,我们提出了一种轻量级的语言到视觉反馈模块,该模块在大型语言模型(Large Language Model, LLM)和VLM中的视觉编码器之间闭合了循环。该模块通过一个门控多层感知器(Gated Multi-Layer Perceptron, MLP)将图像-标记隐藏状态投影回编码器输入,促使第二次传递在文本上下文下重新解读场景。我们在三个以机器人为中心的任务上评估了该方法:在模拟环境(Habitat)中的导航、顺序场景描述(Mementos-Robotics)和人类意图识别(我们的HRI数据集)。结果表明,我们的方法使Qwen 2.5(7B)在距离上改善了$3.3\%$(更短距离)、描述分数提高了$+0.057$,准确率提高了$+2.93\\%$,且额外参数少于$3\\%$;Gemma 3(4B)和LLaVA OV 1.5(4B)在导航结果上表现不一,但在后两项任务上分别提高了$+0.111,+0.055$和$+10.81\\%,+4.79\\%$。代码可在https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics获取。
cs.RO / 27 / 2603.03953
RVN-Bench: A Benchmark for Reactive Visual Navigation
RVN-Bench:反应式视觉导航基准
Abstract
Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN-Bench), a collision-aware benchmark for indoor mobile robots. In RVN-Bench, an agent must reach sequential goal positions in previously unseen environments using only visual observations and no prior map, while avoiding collisions. Built on the Habitat 2.0 simulator and leveraging high-fidelity HM3D scenes, RVN-Bench provides large-scale, diverse indoor environments, defines a collision-aware navigation task and evaluation metrics, and offers tools for standardized training and benchmarking. RVN-Bench supports both online and offline learning by offering an environment for online reinforcement learning, a trajectory image dataset generator, and tools for producing negative trajectory image datasets that capture collision events. Experiments show that policies trained on RVN-Bench generalize effectively to unseen environments, demonstrating its value as a standardized benchmark for safe and robust visual navigation. Code and additional materials are available at: https://rvn-bench.github.io/.
Chinese Translation
安全的视觉导航对于在杂乱环境中操作的室内移动机器人至关重要。然而,现有的基准往往忽视碰撞,或是为户外场景设计,使其不适用于室内视觉导航。为了解决这一局限性,我们提出了反应式视觉导航基准(RVN-Bench),这是一个针对室内移动机器人的碰撞感知基准。在RVN-Bench中,智能体必须仅使用视觉观测而不依赖于先前的地图,在之前未见过的环境中依次到达目标位置,同时避免碰撞。RVN-Bench基于Habitat 2.0模拟器,利用高保真度的HM3D场景,提供大规模、多样化的室内环境,定义了一个碰撞感知导航任务及评估指标,并提供标准化训练和基准测试的工具。RVN-Bench支持在线和离线学习,提供在线强化学习的环境、轨迹图像数据集生成器,以及用于生成捕捉碰撞事件的负轨迹图像数据集的工具。实验表明,在RVN-Bench上训练的策略能够有效地推广到未见过的环境,证明了其作为安全和稳健视觉导航标准化基准的价值。代码和其他材料可在:https://rvn-bench.github.io/获取。
cs.RO / 28 / 2603.03957
ArthroCut: Autonomous Policy Learning for Robotic Bone Resection in Knee Arthroplasty
ArthroCut:用于膝关节置换术的机器人骨切除自主策略学习
Abstract
Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose ArthroCut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context-aware action generation. ArthroCut fine-tunes a Qwen--VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB--D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB--D surgical video, robot state, and textual intent. The method operates on two complementary token families -- Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence -- and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.
Chinese Translation
尽管外科机器人迅速商业化,但其自主性和实时决策能力在实践中仍然有限。为了解决这一问题,我们提出了ArthroCut,一个自主策略学习框架,旨在将膝关节置换术机器人从辅助执行升级为上下文感知的动作生成。ArthroCut在自建的时间同步多模态数据集上微调了Qwen--VL骨干网络,该数据集来自21个完整病例(23,205对RGB--D),整合了术前CT/MR、术中骨骼和末端执行器的NDI跟踪、RGB--D外科视频、机器人状态和文本意图。该方法基于两种互补的标记族进行操作——术前影像标记(Preoperative Imaging Tokens, PIT)用于编码患者特定的解剖结构和计划的切除平面,以及时间对齐外科标记(Time-Aligned Surgical Tokens, TAST)用于融合实时视觉、几何和运动学证据,并在语法/安全约束解码下输出可解释的动作语法。在对膝关节假体的台式实验中,ArthroCut在六个标准切除中实现了86%的平均成功率,显著优于在相同协议下训练的强基线。消融实验表明,TAST是可靠性的主要驱动因素,而PIT提供了必要的解剖基础,两者的结合产生了最稳定的多平面执行。这些结果表明,将术前几何与时间对齐的术中感知对齐,并将这种对齐转化为标记化、受限的动作,是实现骨科机器人手术中稳健、可解释自主性的有效途径。
cs.RO / 29 / 2603.03960
Structural Action Transformer for 3D Dexterous Manipulation
用于三维灵巧操作的结构化动作变换器
Abstract
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.
Chinese Translation
通过从异构数据集中进行模仿学习,实现机器人在人类水平的灵巧性面临跨体现技能转移的挑战,尤其是对于高自由度(DoF)机器人手。现有方法通常依赖于二维观察和以时间为中心的动作表示,难以捕捉三维空间关系,并且无法处理体现异构性。本文提出了结构化动作变换器(Structural Action Transformer, SAT),一种新的三维灵巧操作策略,通过引入结构中心的视角挑战这一范式。我们将每个动作片段重新定义为一个可变长度、无序的关节轨迹序列,而非时间序列。这种结构化的表述使得变换器能够原生地处理异构体现,将关节数量视为可变序列长度。为了编码结构先验并解决歧义,我们引入了一个体现关节词典(Embodied Joint Codebook),该词典嵌入了每个关节的功能角色和运动学特性。我们的模型通过连续时间流匹配目标,从三维点云中学习生成这些轨迹。我们通过在大规模异构数据集上进行预训练,并在仿真和现实世界的灵巧操作任务上进行微调,验证了我们的方法。我们的算法在所有基准测试中始终表现优于其他方法,展示了卓越的样本效率和有效的跨体现技能转移。这种结构中心的表示为高自由度异构操控器的策略扩展提供了一条新路径。
cs.RO / 30 / 2603.03977
Right in Time: Reactive Reasoning in Regulated Traffic Spaces
及时反应:受监管交通空间中的反应性推理
Abstract
Exact inference in probabilistic First-Order Logic offers a promising yet computationally costly approach for regulating the behavior of autonomous agents in shared traffic spaces. While prior methods have combined logical and probabilistic data into decision-making frameworks, their application is often limited to pre-flight checks due to the complexity of reasoning across vast numbers of possible universes. In this work, we propose a reactive mission design framework that jointly considers uncertain environmental data and declarative, logical traffic regulations. By synthesizing Probabilistic Mission Design (ProMis) with reactive reasoning facilitated by Reactive Circuits (RC), we enable online, exact probabilistic inference over hybrid domains. Our approach leverages the Frequency of Change inherent in heterogeneous data streams to subdivide inference formulas into memoized, isolated tasks, ensuring that only the specific components affected by new sensor data are re-evaluated. In experiments involving both real-world vessel data and simulated drone traffic in dense urban scenarios, we demonstrate that our approach provides orders of magnitude in speedup over ProMis without reactive paradigms. This allows intelligent transportation systems, such as Unmanned Aircraft Systems (UAS), to actively assert safety and legal compliance during operations rather than relying solely on preparation procedures.
Chinese Translation
在概率第一阶逻辑中进行精确推理为调节自主代理在共享交通空间中的行为提供了一种有前景但计算成本高昂的方法。尽管先前的方法已将逻辑和概率数据结合到决策框架中,但由于跨越大量可能宇宙的推理复杂性,其应用通常仅限于飞行前检查。在本研究中,我们提出了一种反应性任务设计框架,该框架共同考虑不确定的环境数据和声明性逻辑交通法规。通过将概率任务设计(Probabilistic Mission Design, ProMis)与由反应电路(Reactive Circuits, RC)促进的反应性推理相结合,我们实现了对混合领域的在线精确概率推理。我们的方法利用异构数据流中固有的变化频率,将推理公式细分为记忆化的孤立任务,确保仅重新评估受到新传感器数据影响的特定组件。在涉及真实船舶数据和密集城市场景中模拟无人机交通的实验中,我们证明了我们的方法在速度上比没有反应性范式的ProMis快几个数量级。这使得智能交通系统,如无人机系统(Unmanned Aircraft Systems, UAS),能够在操作过程中主动确保安全和法律合规,而不仅仅依赖于准备程序。
cs.RO / 31 / 2603.03978
Map-Agnostic And Interactive Safety-Critical Scenario Generation via Multi-Objective Tree Search
无地图依赖的交互式安全关键场景生成通过多目标树搜索
Abstract
Generating safety-critical scenarios is essential for validating the robustness of autonomous driving systems, yet existing methods often struggle to produce collisions that are both realistic and diverse while ensuring explicit interaction logic among traffic participants. This paper presents a novel framework for traffic-flow level safety-critical scenario generation via multi-objective Monte Carlo Tree Search (MCTS). We reframe trajectory feasibility and naturalistic behavior as optimization objectives within a unified evaluation function, enabling the discovery of diverse collision events without compromising realism. A hybrid Upper Confidence Bound (UCB) and Lower Confidence Bound (LCB) search strategy is introduced to balance exploratory efficiency with risk-averse decision-making. Furthermore, our method is map-agnostic and supports interactive scenario generation with each vehicle individually powered by SUMO's microscopic traffic models, enabling realistic agent behaviors in arbitrary geographic locations imported from OpenStreetMap. We validate our approach across four high-risk accident zones in Hong Kong's complex urban environments. Experimental results demonstrate that our framework achieves an 85\% collision failure rate while generating trajectories with superior feasibility and comfort metrics. The resulting scenarios exhibit greater complexity, as evidenced by increased vehicle mileage and CO\(_2\) emissions. Our work provides a principled solution for stress testing autonomous vehicles through the generation of realistic yet infrequent corner cases at traffic-flow level.
Chinese Translation
生成安全关键场景对于验证自动驾驶系统的鲁棒性至关重要,但现有方法往往难以产生既真实又多样的碰撞,同时确保交通参与者之间的明确交互逻辑。本文提出了一种通过多目标蒙特卡洛树搜索(MCTS)进行交通流层面安全关键场景生成的新框架。我们将轨迹可行性和自然行为重新构建为统一评估函数中的优化目标,从而在不妥协现实性的情况下发现多样的碰撞事件。引入了一种混合的上置信界(UCB)和下置信界(LCB)搜索策略,以平衡探索效率与风险规避决策。此外,我们的方法不依赖于地图,并支持交互式场景生成,每辆车均由SUMO的微观交通模型单独驱动,使得在从OpenStreetMap导入的任意地理位置中实现真实的代理行为。我们在香港复杂城市环境中的四个高风险事故区域验证了我们的方法。实验结果表明,我们的框架在生成具有更高可行性和舒适性指标的轨迹时,达到了85%的碰撞失败率。生成的场景展现出更大的复杂性,体现在车辆里程和二氧化碳(CO extsubscript{2})排放的增加。我们的工作为通过生成真实但不频繁的边缘案例在交通流层面进行自动驾驶汽车的压力测试提供了一个原则性解决方案。
cs.RO / 32 / 2603.04029
Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback
通过在线持续强化学习与世界模型反馈实现自适应机器人代理
Abstract
As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.
Chinese Translation
由于基于学习的机器人控制器通常在离线环境中训练并以固定参数部署,因此它们在操作过程中应对不可预见变化的能力有限。本研究受生物启发,提出了一种在线持续强化学习框架,能够在部署过程中实现自动适应。该方法基于DreamerV3,一种基于模型的强化学习算法,利用世界模型预测残差来检测分布外事件并自动触发微调。适应进展通过任务级性能信号和内部训练指标进行监测,使得在没有外部监督和领域知识的情况下评估收敛成为可能。该方法在多种当代连续控制问题上进行了验证,包括高保真仿真中的四足机器人和真实世界的模型车辆。相关指标及其解释被提出和讨论,同时描述了由此产生的权衡。结果勾勒出自主机器人代理如何超越静态训练机制,朝着能够在操作过程中进行自我反思和自我改进的自适应系统发展,正如它们的生物对应物一样。
cs.RO / 33 / 2603.04038
Force-Aware Residual DAgger via Trajectory Editing for Precision Insertion with Impedance Control
基于轨迹编辑的力感知残差DAgger用于阻抗控制下的精确插入
Abstract
Imitation learning (IL) has shown strong potential for contact-rich precision insertion tasks. However, its practical deployment is often hindered by covariate shift and the need for continuous expert monitoring to recover from failures during execution. In this paper, we propose Trajectory Editing Residual Dataset Aggregation (TER-DAgger), a scalable and force-aware human-in-the-loop imitation learning framework that mitigates covariate shift by learning residual policies through optimization-based trajectory editing. This approach smoothly fuses policy rollouts with human corrective trajectories, providing consistent and stable supervision. Second, we introduce a force-aware failure anticipation mechanism that triggers human intervention only when discrepancies arise between predicted and measured end-effector forces, significantly reducing the requirement for continuous expert monitoring. Third, all learned policies are executed within a Cartesian impedance control framework, ensuring compliant and safe behavior during contact-rich interactions. Extensive experiments in both simulation and real-world precision insertion tasks show that TER-DAgger improves the average success rate by over 37\% compared to behavior cloning, human-guided correction, retraining, and fine-tuning baselines, demonstrating its effectiveness in mitigating covariate shift and enabling scalable deployment in contact-rich manipulation.
Chinese Translation
模仿学习(IL)在接触丰富的精确插入任务中展现出强大的潜力。然而,其实际应用常常受到协变量转移的阻碍,并且需要持续的专家监控以在执行过程中从失败中恢复。本文提出了一种轨迹编辑残差数据集聚合(TER-DAgger)框架,这是一种可扩展且力感知的人机协作模仿学习框架,通过基于优化的轨迹编辑学习残差策略,从而减轻协变量转移。该方法平滑地将策略回放与人类纠正轨迹融合,提供一致且稳定的监督。其次,我们引入了一种力感知的失败预期机制,仅在预测和测量的末端执行器力之间出现差异时触发人类干预,显著减少了对持续专家监控的需求。第三,所有学习到的策略都在笛卡尔阻抗控制框架内执行,确保在接触丰富的交互中表现出顺应性和安全性。在模拟和真实世界的精确插入任务中进行的广泛实验表明,与行为克隆、人类引导的纠正、再训练和微调基线相比,TER-DAgger的平均成功率提高了超过37\%,证明了其在减轻协变量转移和实现接触丰富操作的可扩展部署方面的有效性。
cs.RO / 34 / 2603.04050
HE-VPR: Height Estimation Enabled Aerial Visual Place Recognition Against Scale Variance
HE-VPR:针对尺度变化的高度估计支持的空中视觉地点识别
Abstract
In this work, we propose HE-VPR, a visual place recognition (VPR) framework that incorporates height estimation. Our system decouples height inference from place recognition, allowing both modules to share a frozen DINOv2 backbone. Two lightweight bypass adapter branches are integrated into our system. The first estimates the height partition of the query image via retrieval from a compact height database, and the second performs VPR within the corresponding height-specific sub-database. The adaptation design reduces training cost and significantly decreases the search space of the database. We also adopt a center-weighted masking strategy to further enhance the robustness against scale differences. Experiments on two self-collected challenging multi-altitude datasets demonstrate that HE-VPR achieves up to 6.1\% Recall@1 improvement over state-of-the-art ViT-based baselines and reduces memory usage by up to 90\%. These results indicate that HE-VPR offers a scalable and efficient solution for height-aware aerial VPR, enabling practical deployment in GNSS-denied environments. All the code and datasets for this work have been released on https://github.com/hmf21/HE-VPR.
Chinese Translation
在本研究中,我们提出了HE-VPR,一个结合高度估计的视觉地点识别(VPR)框架。我们的系统将高度推断与地点识别解耦,使得两个模块能够共享一个冻结的DINOv2主干网络。我们在系统中集成了两个轻量级的旁路适配器分支。第一个通过从紧凑的高度数据库中检索来估计查询图像的高度分区,第二个在相应的高度特定子数据库中执行VPR。适配设计降低了训练成本,并显著减少了数据库的搜索空间。我们还采用了一种中心加权掩蔽策略,以进一步增强对尺度差异的鲁棒性。在两个自收集的具有挑战性的多高度数据集上的实验表明,HE-VPR在Recall@1上比最先进的基于ViT的基线提高了最高6.1\%,并将内存使用减少了最高90\\%。这些结果表明,HE-VPR为高度感知的空中VPR提供了一种可扩展且高效的解决方案,使其能够在GNSS失效的环境中进行实际部署。所有代码和数据集已在https://github.com/hmf21/HE-VPR上发布。
cs.RO / 35 / 2603.04057
Sim2Sea: Sim-to-Real Policy Transfer for Maritime Vessel Navigation in Congested Waters
Sim2Sea:拥挤水域中海洋船舶导航的仿真到现实策略转移
Abstract
Autonomous navigation in congested maritime environments is a critical capability for a wide range of real-world applications. However, it remains an unresolved challenge due to complex vessel interactions and significant environmental uncertainties. Existing methods often fail in practical deployment due to a substantial sim-to-real gap, which stems from imprecise simulation, inadequate situational awareness, and unsafe exploration strategies. To address these, we propose \textbf{Sim2Sea}, a comprehensive framework designed to bridge simulation and real-world execution. Sim2Sea advances in three key aspects. First, we develop a GPU-accelerated parallel simulator for scalable and accurate maritime scenario simulation. Second, we design a dual-stream spatiotemporal policy that handles complex dynamics and multi-modal perception, augmented with a velocity-obstacle-guided action masking mechanism to ensure safe and efficient exploration. Finally, a targeted domain randomization scheme helps bridge the sim-to-real gap. Simulation results demonstrate that our method achieves faster convergence and safer trajectories than established baselines. In addition, our policy trained purely in simulation successfully transfers zero-shot to a 17-ton unmanned vessel operating in real-world congested waters. These results validate the effectiveness of Sim2Sea in achieving reliable sim-to-real transfer for practical autonomous maritime navigation.
Chinese Translation
在拥挤的海洋环境中进行自主导航是广泛现实应用中的一项关键能力。然而,由于复杂的船舶交互和显著的环境不确定性,这仍然是一个未解决的挑战。现有方法在实际部署中往往失败,原因在于显著的仿真到现实差距,这源于不精确的仿真、不足的情境感知和不安全的探索策略。为了解决这些问题,我们提出了 extbf{Sim2Sea},这是一个旨在弥合仿真与现实执行之间差距的综合框架。Sim2Sea在三个关键方面取得了进展。首先,我们开发了一种基于GPU加速的并行仿真器,用于可扩展和准确的海洋场景仿真。其次,我们设计了一种双流时空策略,能够处理复杂的动态和多模态感知,并增强了一个基于速度障碍的动作屏蔽机制,以确保安全和高效的探索。最后,一个针对性的领域随机化方案有助于弥合仿真到现实的差距。仿真结果表明,我们的方法在收敛速度和轨迹安全性方面优于已建立的基线。此外,我们在仿真中纯粹训练的策略成功实现了零样本转移到一艘17吨的无人船,在现实的拥挤水域中操作。这些结果验证了Sim2Sea在实现可靠的仿真到现实转移方面的有效性,以支持实际的自主海洋导航。
cs.RO / 36 / 2603.04071
SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling
SaFeR:通过可行性约束的标记重采样生成安全关键场景以测试自动驾驶
Abstract
Safety-critical scenario generation is crucial for evaluating autonomous driving systems. However, existing approaches often struggle to balance three conflicting objectives: adversarial criticality, physical feasibility, and behavioral realism. To bridge this gap, we propose SaFeR: safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling. We first formulate traffic generation as a discrete next token prediction problem, employing a Transformer-based model as a realism prior to capture naturalistic driving distributions. To capture complex interactions while effectively mitigating attention noise, we propose a novel differential attention mechanism within the realism prior. Building on this prior, SaFeR implements a novel resampling strategy that induces adversarial behaviors within a high-probability trust region to maintain naturalism, while enforcing a feasibility constraint derived from the Largest Feasible Region (LFR). By approximating the LFR via offline reinforcement learning, SaFeR effectively prevents the generation of theoretically inevitable collisions. Closed-loop experiments on the Waymo Open Motion Dataset and nuPlan demonstrate that SaFeR significantly outperforms state-of-the-art baselines, achieving a higher solution rate and superior kinematic realism while maintaining strong adversarial effectiveness.
Chinese Translation
安全关键场景生成对于评估自动驾驶系统至关重要。然而,现有方法往往难以平衡三个相互冲突的目标:对抗性关键性、物理可行性和行为真实感。为了解决这一问题,我们提出了SaFeR:通过可行性约束的标记重采样生成安全关键场景以测试自动驾驶。我们首先将交通生成形式化为一个离散的下一个标记预测问题,采用基于Transformer的模型作为真实感先验,以捕捉自然驾驶分布。为了捕捉复杂的交互,同时有效减轻注意力噪声,我们在真实感先验中提出了一种新颖的差分注意力机制。在此先验的基础上,SaFeR实施了一种新颖的重采样策略,该策略在高概率信任区域内诱导对抗性行为,以维持自然性,同时强制执行源自最大可行区域(Largest Feasible Region, LFR)的可行性约束。通过离线强化学习近似LFR,SaFeR有效防止了理论上不可避免的碰撞。在Waymo开放运动数据集和nuPlan上的闭环实验表明,SaFeR显著优于最先进的基线,达到了更高的解决率和更优的运动学真实感,同时保持了强大的对抗有效性。
cs.RO / 37 / 2603.04073
Swimming Under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion
在约束下游泳:一种安全的强化学习框架用于四足生物启发推进
Abstract
Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID-regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle-wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.
Chinese Translation
生物启发的水下推进提供了高推力和灵活性,但容易受到如升力波动等不稳定力的影响,这种影响在六自由度(6-DoF)流体耦合下进一步放大。我们将四足游泳形式化为一个约束优化问题,旨在最大化前向推力,同时最小化不稳定波动。我们提出的框架,加速约束近端策略优化(Accelerated Constrained Proximal Policy Optimization,ACPPO)结合PID调节的拉格朗日乘子(PID-regulated Lagrange multiplier),通过PID调节的拉格朗日乘子施加约束,通过条件非对称裁剪加速学习,并通过周期几何聚合稳定更新。该方法以模仿学习为起点,并通过硬件 towing-tank 实验进行精炼,ACPPO-PID 生成的控制策略能够有效转移到四足自由游泳试验中。结果表明,与最先进的基线相比,推力效率提高、不稳定力减少、收敛速度更快,强调了约束感知的安全强化学习在复杂流体环境中实现稳健和可推广的生物启发运动的重要性。
cs.RO / 38 / 2603.04118
Modeling and Control of a Pneumatic Soft Robotic Catheter Using Neural Koopman Operators
基于神经Koopman算子的气动软机器人导管建模与控制
Abstract
Catheter-based interventions are widely used for the diagnosis and treatment of cardiac diseases. Recently, robotic catheters have attracted attention for their ability to improve precision and stability over conventional manual approaches. However, accurate modeling and control of soft robotic catheters remain challenging due to their complex, nonlinear behavior. The Koopman operator enables lifting the original system data into a linear "lifted space", offering a data-driven framework for predictive control; however, manually chosen basis functions in the lifted space often oversimplify system behaviors and degrade control performance. To address this, we propose a neural network-enhanced Koopman operator framework that jointly learns the lifted space representation and Koopman operator in an end-to-end manner. Moreover, motivated by the need to minimize radiation exposure during X-ray fluoroscopy in cardiac ablation, we investigate open-loop control strategies using neural Koopman operators to reliably reach target poses without continuous imaging feedback. The proposed method is validated in two experimental scenarios: interactive position control and a simulated cardiac ablation task using an atrium-like cavity. Our approach achieves average errors of 2.1 +- 0.4 mm in position and 4.9 +- 0.6 degrees in orientation, outperforming not only model-based baselines but also other Koopman variants in targeting accuracy and efficiency. These results highlight the potential of the proposed framework for advancing soft robotic catheter systems and improving catheter-based interventions.
Chinese Translation
导管介入技术广泛应用于心脏疾病的诊断和治疗。近年来,机器人导管因其在精度和稳定性方面优于传统手动方法而受到关注。然而,由于软机器人导管的复杂非线性行为,准确的建模和控制仍然具有挑战性。Koopman算子能够将原始系统数据提升到线性“提升空间”,为预测控制提供了一个基于数据的框架;然而,在提升空间中手动选择的基函数往往会过于简化系统行为,从而降低控制性能。为了解决这一问题,我们提出了一种增强神经网络的Koopman算子框架,该框架以端到端的方式共同学习提升空间表示和Koopman算子。此外,鉴于在心脏消融过程中需要最小化X射线透视下的辐射暴露,我们研究了使用神经Koopman算子的开环控制策略,以可靠地达到目标姿态而无需持续的成像反馈。所提方法在两个实验场景中得到了验证:交互式位置控制和使用类似心房的腔体进行的模拟心脏消融任务。我们的方法在位置和方向上的平均误差分别为2.1 ± 0.4毫米和4.9 ± 0.6度,不仅优于基于模型的基准,还在目标准确性和效率上超越了其他Koopman变体。这些结果突显了所提框架在推进软机器人导管系统和改善导管介入技术方面的潜力。
cs.RO / 39 / 2603.04144
HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans
HBRB-BoW:通过层次化 BRB-KMeans 重新训练的 ORB-SLAM 词汇包
Abstract
In visual simultaneous localization and mapping (SLAM), the quality of the visual vocabulary is fundamental to the system's ability to represent environments and recognize locations. While ORB-SLAM is a widely used framework, its binary vocabulary, trained through the k-majority-based bag-of-words (BoW) approach, suffers from inherent precision loss. The inability of conventional binary clustering to represent subtle feature distributions leads to the degradation of visual words, a problem that is compounded as errors accumulate and propagate through the hierarchical tree structure. To address these structural deficiencies, this paper proposes hierarchical binary-to-real-and-back (HBRB)-BoW, a refined hierarchical binary vocabulary training algorithm. By integrating a global real-valued flow within the hierarchical clustering process, our method preserves high-fidelity descriptor information until the final binarization at the leaf nodes. Experimental results demonstrate that the proposed approach yields a more discriminative and well-structured vocabulary than traditional methods, significantly enhancing the representational integrity of the visual dictionary in complex environments. Furthermore, replacing the default ORB-SLAM vocabulary file with our HBRB-BoW file is expected to improve performance in loop closing and relocalization tasks.
Chinese Translation
在视觉同时定位与地图构建(SLAM)中,视觉词汇的质量对系统表示环境和识别位置的能力至关重要。尽管 ORB-SLAM 是一种广泛使用的框架,但其通过基于 k-多数的词袋(BoW)方法训练的二进制词汇存在固有的精度损失。传统的二进制聚类无法有效表示细微的特征分布,导致视觉词汇的退化,随着错误的累积和在层次树结构中的传播,这一问题愈加严重。为了解决这些结构性缺陷,本文提出了一种层次化的二进制-实数-再回归(HBRB)-BoW,作为一种改进的层次化二进制词汇训练算法。通过在层次聚类过程中整合全局实值流,我们的方法在最终的叶节点二进制化之前,能够保持高保真度的描述符信息。实验结果表明,所提出的方法比传统方法产生更具区分性和结构良好的词汇,显著增强了复杂环境中视觉词典的表征完整性。此外,使用我们的 HBRB-BoW 文件替换默认的 ORB-SLAM 词汇文件,预计将改善循环闭合和重定位任务的性能。
cs.RO / 40 / 2603.04158
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
GarmentPile++:基于可供性驱动的杂乱服装检索与视觉-语言推理
Abstract
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.
Chinese Translation
服装操作因其在家居助理机器人中的关键作用而受到越来越多的关注。然而,现有的大多数服装操作研究假设初始状态仅包含一件服装,而在现实环境中,堆叠的服装更为常见。为了解决这一问题,我们提出了一种新颖的服装检索管道,该管道不仅能够根据语言指令执行安全、干净的检索,还能确保每次尝试仅检索到一件服装,为下游任务(例如折叠、挂起、穿着)的执行奠定了坚实基础。我们的管道无缝整合了视觉-语言推理与视觉可供性感知,充分利用了视觉语言模型(VLM)在高层次推理和规划能力方面的优势,以及视觉可供性在低层次动作中的泛化能力。为了增强VLM对服装堆中每件服装状态的全面感知,我们采用视觉分割模型(SAM2)对服装堆进行对象分割,以为基于VLM的推理提供足够的视觉线索。此外,我们还集成了一种掩膜微调机制,以应对初始分割结果不理想的情况。此外,我们部署了双臂协作框架,以解决涉及大型或长款服装的情况,以及由于错误的抓取点确定导致的过度服装下垂问题,这两者对于单臂操作来说都是非常困难的。我们的管道在现实世界和仿真环境中的多种任务和不同场景中均表现出一致的有效性。项目页面:https://garmentpile2.github.io/
cs.RO / 41 / 2603.04166
Learning Hip Exoskeleton Control Policy via Predictive Neuromusculoskeletal Simulation
通过预测神经肌肉骨骼仿真学习髋部外骨骼控制策略
Abstract
Developing exoskeleton controllers that generalize across diverse locomotor conditions typically requires extensive motion-capture data and biomechanical labeling, limiting scalability beyond instrumented laboratory settings. Here, we present a physics-based neuromusculoskeletal learning framework that trains a hip-exoskeleton control policy entirely in simulation, without motion-capture demonstrations, and deploys it on hardware via policy distillation. A reinforcement learning teacher policy is trained using a muscle-synergy action prior over a wide range of walking speeds and slopes through a two-stage curriculum, enabling direct comparison between assisted and no-exoskeleton conditions. In simulation, exoskeleton assistance reduces mean muscle activation by up to 3.4% and mean positive joint power by up to 7.0% on level ground and ramp ascent, with benefits increasing systematically with walking speed. On hardware, the assistance profiles learned in simulation are preserved across matched speed-slope conditions (r: 0.82, RMSE: 0.03 Nm/kg), providing quantitative evidence of sim-to-real transfer without additional hardware tuning. These results demonstrate that physics-based neuromusculoskeletal simulation can serve as a practical and scalable foundation for exoskeleton controller development, substantially reducing experimental burden during the design phase.
Chinese Translation
开发能够在多样化运动条件下泛化的外骨骼控制器通常需要大量的运动捕捉数据和生物力学标注,这限制了其在仪器化实验室环境之外的可扩展性。在此,我们提出了一种基于物理的神经肌肉骨骼学习框架,该框架完全在仿真中训练髋部外骨骼控制策略,无需运动捕捉演示,并通过策略蒸馏将其部署到硬件上。通过两阶段课程,使用肌肉协同动作先验在广泛的步态速度和坡度范围内训练强化学习教师策略,使得能够直接比较辅助和无外骨骼条件。在仿真中,外骨骼辅助在平地和坡道上分别将平均肌肉激活降低了高达3.4%和平均正关节功率降低了高达7.0%,并且随着步态速度的增加,益处系统性地增加。在硬件上,仿真中学习到的辅助特征在匹配的速度-坡度条件下得以保留(相关系数:0.82,均方根误差:0.03 Nm/kg),提供了无额外硬件调试的仿真到现实转移的定量证据。这些结果表明,基于物理的神经肌肉骨骼仿真可以作为外骨骼控制器开发的实用且可扩展的基础,显著减少设计阶段的实验负担。
cs.RO / 42 / 2603.04208
GSeg3D: A High-Precision Grid-Based Algorithm for Safety-Critical Ground Segmentation in LiDAR Point Clouds
GSeg3D:一种用于安全关键环境中LiDAR点云的高精度网格基础地面分割算法
Abstract
Ground segmentation in point cloud data is the process of separating ground points from non-ground points. This task is fundamental for perception in autonomous driving and robotics, where safety and reliable operation depend on the precise detection of obstacles and navigable surfaces. Existing methods often fall short of the high precision required in safety-critical environments, leading to false detections that can compromise decision-making. In this work, we present a ground segmentation approach designed to deliver consistently high precision, supporting the stringent requirements of autonomous vehicles and robotic systems operating in real-world, safety-critical scenarios.
Chinese Translation
点云数据中的地面分割是将地面点与非地面点分离的过程。这一任务对于自动驾驶和机器人技术至关重要,因为安全和可靠的操作依赖于对障碍物和可导航表面的精确检测。现有方法往往无法满足安全关键环境中所需的高精度,导致错误检测,从而影响决策过程。在本研究中,我们提出了一种地面分割方法,旨在提供始终如一的高精度,以支持在现实世界安全关键场景中运行的自动驾驶车辆和机器人系统的严格要求。
cs.RO / 43 / 2603.04222
PRAM-R: A Perception-Reasoning-Action-Memory Framework with LLM-Guided Modality Routing for Adaptive Autonomous Driving
PRAM-R:一种具有LLM引导的模态路由的感知-推理-行动-记忆框架,用于自适应自动驾驶
Abstract
Multimodal perception enables robust autonomous driving but incurs unnecessary computational cost when all sensors remain active. This paper presents PRAM-R, a unified Perception-Reasoning-Action-Memory framework with LLM-Guided Modality Routing for adaptive autonomous driving. PRAM-R adopts an asynchronous dual-loop design: a fast reactive loop for perception and control, and a slow deliberative loop for reasoning-driven modality selection and memory updates. An LLM router selects and weights modalities using environmental context and sensor diagnostics, while a hierarchical memory module preserves temporal consistency and supports long-term adaptation. We conduct a two-stage evaluation: (1) synthetic stress tests for stability analysis and (2) real-world validation on the nuScenes dataset. Synthetic stress tests confirm 87.2% reduction in routing oscillations via hysteresis-based stabilization. Real-world validation on nuScenes shows 6.22% modality reduction with 20% memory recall while maintaining comparable trajectory accuracy to full-modality baselines in complex urban scenarios. Our work demonstrates that LLM-augmented architectures with hierarchical memory achieve efficient, adaptive multimodal perception in autonomous driving.
Chinese Translation
多模态感知能够实现稳健的自动驾驶,但在所有传感器保持激活状态时会产生不必要的计算成本。本文提出了PRAM-R,一个统一的感知-推理-行动-记忆框架,结合了LLM引导的模态路由,用于自适应自动驾驶。PRAM-R采用异步双循环设计:快速反应循环用于感知和控制,慢速深思循环用于基于推理的模态选择和记忆更新。LLM路由器使用环境上下文和传感器诊断来选择和加权模态,而层次记忆模块保持时间一致性并支持长期适应。我们进行了两阶段评估:(1)用于稳定性分析的合成压力测试和(2)在nuScenes数据集上的真实世界验证。合成压力测试确认通过基于滞后效应的稳定化实现了87.2%的路由波动减少。nuScenes上的真实世界验证显示,在复杂城市场景中,模态减少了6.22%,记忆回忆率为20%,同时保持与全模态基线相当的轨迹精度。我们的工作表明,增强LLM的架构与层次记忆结合,实现了自动驾驶中的高效自适应多模态感知。
cs.RO / 44 / 2603.04225
AMP2026: A Multi-Platform Marine Robotics Dataset for Tracking and Mapping
AMP2026:一个用于跟踪和映射的多平台海洋机器人数据集
Abstract
Marine environments present significant challenges for perception and autonomy due to dynamic surfaces, limited visibility, and complex interactions between aerial, surface, and submerged sensing modalities. This paper introduces the Aerial Marine Perception Dataset (AMP2026), a multi-platform marine robotics dataset collected across multiple field deployments designed to support research in two primary areas: multi-view tracking and marine environment mapping. The dataset includes synchronized data from aerial drones, boat-mounted cameras, and submerged robotic platforms, along with associated localization and telemetry information. The goal of this work is to provide a publicly available dataset enabling research in marine perception and multi-robot observation scenarios. This paper describes the data collection methodology, sensor configurations, dataset organization, and intended research tasks supported by the dataset.
Chinese Translation
海洋环境由于动态表面、有限的可见性以及空中、表面和水下传感模式之间的复杂交互,对感知和自主性提出了重大挑战。本文介绍了空中海洋感知数据集(Aerial Marine Perception Dataset,AMP2026),这是一个跨多个实地部署收集的多平台海洋机器人数据集,旨在支持两个主要研究领域:多视角跟踪和海洋环境映射。该数据集包括来自空中无人机、船载摄像头和水下机器人平台的同步数据,以及相关的定位和遥测信息。本文的目标是提供一个公开可用的数据集,以促进海洋感知和多机器人观察场景的研究。本文描述了数据收集方法、传感器配置、数据集组织以及数据集支持的预期研究任务。
cs.RO / 45 / 2603.04249
RoboLight: A Dataset with Linearly Composable Illumination for Robotic Manipulation
RoboLight:用于机器人操作的线性可组合照明数据集
Abstract
In this paper, we introduce RoboLight, the first real-world robotic manipulation dataset capturing synchronized episodes under systematically varied lighting conditions. RoboLight consists of two components. (a) RoboLight-Real contains 2,800 real-world episodes collected in our custom Light Cube setup, a calibrated system equipped with eight programmable RGB LED lights. It includes structured illumination variation along three independently controlled dimensions: color, direction, and intensity. Each dimension is paired with a dedicated task featuring objects of diverse geometries and materials to induce perceptual challenges. All image data are recorded in high-dynamic-range (HDR) format to preserve radiometric accuracy. Leveraging the linearity of light transport, we introduce (b) RoboLight-Synthetic, comprising 196,000 episodes synthesized through interpolation in the HDR image space of RoboLight-Real. In principle, RoboLight-Synthetic can be arbitrarily expanded by refining the interpolation granularity. We further verify the dataset quality through qualitative analysis and real-world policy roll-outs, analyzing task difficulty, distributional diversity, and the effectiveness of synthesized data. We additionally demonstrate three representative use cases of the proposed dataset. The full dataset, along with the system software and hardware design, will be released as open-source to support continued research.
Chinese Translation
在本文中,我们介绍了RoboLight,这是第一个在系统变化的照明条件下捕捉同步事件的真实世界机器人操作数据集。RoboLight由两个部分组成。(a) RoboLight-Real包含在我们定制的光立方设置中收集的2800个真实世界事件,这是一个配备有八个可编程RGB LED灯的校准系统。它包括沿三个独立控制维度的结构化照明变化:颜色、方向和强度。每个维度都与一个专门的任务配对,涉及不同几何形状和材料的物体,以引发感知挑战。所有图像数据均以高动态范围(HDR)格式记录,以保持辐射度准确性。利用光传输的线性特性,我们引入了(b) RoboLight-Synthetic,包含通过在RoboLight-Real的HDR图像空间中插值合成的196,000个事件。从原则上讲,RoboLight-Synthetic可以通过细化插值粒度任意扩展。我们进一步通过定性分析和真实世界策略的实施验证数据集的质量,分析任务难度、分布多样性和合成数据的有效性。此外,我们还展示了该数据集的三个代表性应用案例。完整的数据集以及系统软件和硬件设计将作为开源发布,以支持持续的研究。
cs.RO / 46 / 2603.04277
VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments
VANGUARD:在GPS缺失环境中基于车辆锚定的地面采样距离估计方法
Abstract
Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.
Chinese Translation
在GPS缺失或通信受限的环境中运行的自主空中机器人常常失去对相机元数据和遥测数据的访问,这使得机载感知系统无法恢复场景的绝对度量尺度。随着基于大语言模型(LLM)/视觉语言模型(VLM)的规划者越来越多地被采用为具身系统的高级代理,其对物理尺寸进行推理的能力变得至关重要——然而我们的实验表明,五种最先进的VLM在空间尺度估计上存在幻觉,导致中位面积估计误差超过50%。我们提出了VANGUARD,这是一种轻量级、确定性的几何感知技能,旨在作为可调用工具,供任何基于LLM的代理调用,以从普遍存在的环境锚点中恢复地面采样距离(GSD):通过定向包围框检测的小型车辆,其模态像素长度通过核密度估计稳健地估算,并使用预校准的参考长度转换为GSD。该工具返回GSD估计值和复合置信度得分,使调用代理能够自主决定是否信任该测量或退回到替代策略。在DOTA~v1.5基准测试中,VANGUARD在306张图像上实现了6.87%的中位GSD误差。与基于SAM的分割集成用于下游区域测量时,该管道在100条条目的基准测试中产生了19.7%的中位误差——其类别依赖性降低了2.6倍,灾难性失败减少了4倍,相较于最佳VLM基线,证明为代理配备确定性几何工具对于安全的自主空间推理至关重要。
cs.RO / 47 / 2603.04284
OmniPlanner: Universal Exploration and Inspection Path Planning across Robot Morphologies
OmniPlanner:跨越机器人形态的通用探索与检查路径规划
Abstract
Autonomous robotic systems are increasingly deployed for mapping, monitoring, and inspection in complex and unstructured environments. However, most existing path planning approaches remain domain-specific (i.e., either on air, land, or sea), limiting their scalability and cross-platform applicability. This article presents OmniPlanner, a unified planning framework for autonomous exploration and inspection across aerial, ground, and underwater robots. The method integrates volumetric exploration and viewpoint-based inspection, alongside target reach behaviors within a single modular architecture, complemented by a platform abstraction layer that captures morphology-specific sensing, traversability and motion constraints. This enables the same planning strategy to generalize across distinct mobility domains with minimal retuning. The framework is validated through extensive simulation studies and field deployments in underground mines, industrial facilities, forests, submarine bunkers, and structured outdoor environments. Across these diverse scenarios, OmniPlanner demonstrates robust performance, consistent cross-domain generalization, and improved exploration and inspection efficiency compared to representative state-of-the-art baselines.
Chinese Translation
自主机器人系统越来越多地被部署用于复杂和非结构化环境中的地图绘制、监测和检查。然而,现有的大多数路径规划方法仍然是特定领域的(即仅限于空中、陆地或海洋),这限制了它们的可扩展性和跨平台适用性。本文提出了OmniPlanner,一个统一的规划框架,旨在支持空中、地面和水下机器人的自主探索与检查。该方法将体积探索和基于视点的检查整合在一个单一的模块化架构中,同时结合目标到达行为,并配备一个平台抽象层,以捕捉特定形态的传感、可通行性和运动约束。这使得相同的规划策略能够在不同的移动领域中以最小的重新调优进行推广。该框架通过在地下矿山、工业设施、森林、潜艇掩体和结构化户外环境中的广泛仿真研究和现场部署进行了验证。在这些多样化的场景中,OmniPlanner展示了强大的性能、一致的跨领域推广能力,以及相比于代表性的最先进基线的改进探索和检查效率。
cs.RO / 48 / 2603.04301
Compliant In-hand Rolling Manipulation Using Tactile Sensing
基于触觉感知的顺应性手内滚动操作
Abstract
We investigate in-hand rolling manipulation using a multifingered robot hand, where each finger is compliant and equipped with a tactile fingertip providing contact location and wrench information. We derive the equations of motion for compliant quasistatic in-hand rolling manipulation and formulate a fingertip rolling manipulation controller for multiple fingers to achieve a desired object twist within a grasp. The contact mechanics are demonstrated in simulation and the controller is tested on an experimental robot system.
Chinese Translation
我们研究了使用多指机器人手进行的手内滚动操作,其中每个手指都是顺应性的,并配备了提供接触位置和扭矩信息的触觉指尖。我们推导了顺应性准静态手内滚动操作的运动方程,并为多个手指制定了指尖滚动操作控制器,以在抓握中实现期望的物体扭转。接触力学在仿真中得到了验证,控制器在实验机器人系统上进行了测试。
cs.RO / 49 / 2603.04305
Perception-Aware Time-Optimal Planning for Quadrotor Waypoint Flight
感知驱动的四旋翼航点飞行时间最优规划
Abstract
Agile quadrotor flight pushes the limits of control, actuation, and onboard perception. While time-optimal trajectory planning has been extensively studied, existing approaches typically neglect the tight coupling between vehicle dynamics, environmental geometry, and the visual requirements of onboard state estimation. As a result, trajectories that are dynamically feasible may fail in closed-loop execution due to degraded visual quality. This paper introduces a unified time-optimal trajectory optimization framework for vision-based quadrotors that explicitly incorporates perception constraints alongside full nonlinear dynamics, rotor actuation limits, aerodynamic effects, camera field-of-view constraints, and convex geometric gate representations. The proposed formulation solves minimum-time lap trajectories for arbitrary racetracks with diverse gate shapes and orientations, while remaining numerically robust and computationally efficient. We derive an information-theoretic position uncertainty metric to quantify visual state-estimation quality and integrate it into the planner through three perception objectives: position uncertainty minimization, sequential field-of-view constraints, and look-ahead alignment. This enables systematic exploration of the trade-offs between speed and perceptual reliability. To accurately track the resulting perception-aware trajectories, we develop a model predictive contouring tracking controller that separates lateral and progress errors. Experiments demonstrate real-world flight speeds up to 9.8 m/s with 0.07 m average tracking error, and closed-loop success rates improved from 55% to 100% on a challenging Split-S course. The proposed system provides a scalable benchmark for studying the fundamental limits of perception-aware, time-optimal autonomous flight.
Chinese Translation
灵活的四旋翼飞行推动了控制、驱动和机载感知的极限。尽管时间最优轨迹规划已被广泛研究,但现有方法通常忽视了飞行器动力学、环境几何和机载状态估计的视觉需求之间的紧密耦合。因此,动态可行的轨迹在闭环执行中可能因视觉质量下降而失败。本文提出了一种统一的时间最优轨迹优化框架,专为基于视觉的四旋翼设计,明确纳入感知约束,同时考虑全非线性动力学、旋翼驱动限制、气动效应、相机视场约束和凸几何门表示。所提出的公式能够为具有多样门形状和方向的任意赛道求解最小时间圈轨迹,同时保持数值稳健性和计算效率。我们推导了一种信息论位置不确定性度量,以量化视觉状态估计质量,并通过三个感知目标将其整合到规划器中:位置不确定性最小化、顺序视场约束和前瞻对齐。这使得系统地探索速度与感知可靠性之间的权衡成为可能。为了准确跟踪所得到的感知驱动轨迹,我们开发了一种模型预测轮廓跟踪控制器,分离横向和进展误差。实验表明,在具有挑战性的Split-S赛道上,实际飞行速度可达9.8 m/s,平均跟踪误差为0.07 m,闭环成功率从55%提高至100%。所提出的系统为研究感知驱动的时间最优自主飞行的基本极限提供了可扩展的基准。
cs.RO / 50 / 2603.04329
Gaussian Mixture-Based Inverse Perception Contract for Uncertainty-Aware Robot Navigation
基于高斯混合的逆感知契约用于不确定性感知的机器人导航
Abstract
Reliable navigation in cluttered environments requires perception outputs that are not only accurate but also equipped with uncertainty sets suitable for safe control. An inverse perception contract (IPC) provides such a connection by mapping perceptual estimates to sets that contain the ground truth with high confidence. Existing IPC formulations, however, instantiate uncertainty as a single ellipsoidal set and rely on deterministic trust scores to guide robot motion. Such a representation cannot capture the multi-modal and irregular structure of fine-grained perception errors, often resulting in over-conservative sets and degraded navigation performance. In this work, we introduce Gaussian Mixture-based Inverse Perception Contract (GM-IPC), which extends IPC to represent uncertainty with unions of ellipsoidal confidence sets derived from Gaussian mixture models. This design moves beyond deterministic single-set abstractions, enabling fine-grained, multi-modal, and non-convex error structures to be captured with formal guarantees. A learning framework is presented that trains GM-IPC to account for probabilistic inclusion, distribution matching, and empty-space penalties, ensuring both validity and compactness of the predicted sets. We further show that the resulting uncertainty characterizations can be leveraged in downstream planning frameworks for real-time safe navigation, enabling less conservative and more adaptive robot motion while preserving safety in a probabilistic manner.
Chinese Translation
在杂乱环境中可靠的导航需要感知输出不仅准确,还需具备适合安全控制的不确定性集合。逆感知契约(IPC)通过将感知估计映射到高置信度包含真实情况的集合,提供了这样的连接。然而,现有的IPC公式将不确定性实例化为单一的椭球集合,并依赖于确定性的信任评分来指导机器人运动。这种表示无法捕捉细粒度感知误差的多模态和不规则结构,常常导致过于保守的集合和降低的导航性能。在本研究中,我们提出了基于高斯混合的逆感知契约(GM-IPC),它扩展了IPC,以通过从高斯混合模型中导出的椭球置信集合的并集来表示不确定性。该设计超越了确定性的单集合抽象,使得细粒度、多模态和非凸误差结构能够在形式上得到保证。我们提出了一个学习框架,训练GM-IPC以考虑概率包含、分布匹配和空白空间惩罚,从而确保预测集合的有效性和紧凑性。我们进一步表明,所得到的不确定性表征可以在下游规划框架中用于实时安全导航,从而实现更少保守和更具适应性的机器人运动,同时以概率方式保持安全性。
cs.RO / 51 / 2603.04351
Tendon Force Modeling for Sim2Real Transfer of Reinforcement Learning Policies for Tendon-Driven Robots
用于腱驱动机器人强化学习策略的Sim2Real转移的腱力建模
Abstract
Robots which make use of soft or compliant inter- actions often leverage tendon-driven actuation which enables actuators to be placed more flexibly, and compliance to be maintained. However, controlling complex tendon systems is challenging. Simulation paired with reinforcement learning (RL) could be enable more complex behaviors to be generated. Such methods rely on torque and force-based simulation roll- outs which are limited by the sim-to-real gap, stemming from the actuator and system dynamics, resulting in poor transfer of RL policies onto real robots. To address this, we propose a method to model the tendon forces produced by typical servo motors, focusing specifically on the transfer of RL policies for a tendon driven finger. Our approach extends existing data- driven techniques by leveraging contextual history and a novel data collection test-bench. This test-bench allows us to capture tendon forces undergo contact-rich interactions typical of real- world manipulation. We then utilize our force estimation model in a GPU-accelerated tendon force-driven rigid body simulation to train RL-based controllers. Our transformer-based model is capable of predicting tendon forces within 3% of the maximum motor force and is robot-agnostic. By integrating our learned model into simulation, we reduce the sim-to-real gap for test trajectories by 41%. RL-based controller trained with our model achieves a 50% improvement in fingertip pose tracking tasks on real tendon-driven robotic fingers. This approach is generalizable to different actuators and robot systems, and can enable RL policies to be used widely across tendon systems, advancing capabilities of dexterous manipulators and soft robots.
Chinese Translation
利用软性或顺应性交互的机器人通常采用腱驱动的驱动方式,这使得驱动器能够更灵活地放置,并保持顺应性。然而,控制复杂的腱系统具有挑战性。结合强化学习(RL)的仿真可以生成更复杂的行为。这类方法依赖于基于扭矩和力的仿真展开,但由于驱动器和系统动态的影响,存在仿真与现实之间的差距,导致RL策略在真实机器人上的转移效果不佳。为了解决这个问题,我们提出了一种建模典型伺服电机产生的腱力的方法,特别关注腱驱动手指的RL策略转移。我们的方法通过利用上下文历史和新颖的数据收集测试平台,扩展了现有的数据驱动技术。该测试平台使我们能够捕捉在真实世界操作中典型的接触丰富交互下的腱力。随后,我们在一个GPU加速的腱力驱动刚体仿真中利用我们的力估计模型来训练基于RL的控制器。我们的基于变换器的模型能够在最大电机力的3%范围内预测腱力,并且与机器人无关。通过将我们学习到的模型集成到仿真中,我们将测试轨迹的仿真与现实差距减少了41%。使用我们模型训练的基于RL的控制器在真实腱驱动机器人手指的指尖姿态跟踪任务中实现了50%的改进。该方法具有普遍适用性,可以推广到不同的驱动器和机器人系统,使得RL策略能够广泛应用于腱系统,推动灵巧操控器和软机器人能力的发展。
cs.RO / 52 / 2603.04352
A Soft Robotic Demonstration in the Stratospher
高层大气中的软体机器人演示
Abstract
Machines designed for operation in Space, as well as other extreme environments, need to be both resilient and adaptable when mission parameters change. Soft robots offer advantages in adaptability, but most lack resilience to the pressure and temperature extremes found as close as the Stratosphere. Dielectric elastomer actuators overcome some of those limitations when built as solid state compliant capacitors capable of converting electrical energy into mechanical work, but the elastomer resilience limits the device's operating window. Here we present a crosslinking mechanism for silicone elastomers under ultraviolet light using trimethyl(methylcyclopentadienyl)platinum(IV) as a catalyst to react hydrosilane to vinyl groups. The formation of carbon-carbon bonds enables fast processing under UV light and exceptional electro-mechanical performance in dielectric elastomer actuators. The material resilience advantage is demonstrated in controlled experiments at -40{\deg} and 120{\deg} C, as well as near vacuum, in comparison with state-of-the-art acrylic and silicone chemistries. Fully autonomous systems controlling grippers made with the novel silicone were integrated into payloads for high altitude balloon testing. Two stratospheric balloon missions were carried out and demonstrated DEAs as a viable soft robotic technology under space-like conditions (as high as 23.6 km elevation, at <0.05 atm and -55{\deg} C). The combinations of chemical building blocks and catalyst can be further expanded to address other challenges for silicones, including adhesion and additive manufacturing.
Chinese Translation
为在太空及其他极端环境中操作而设计的机器需要在任务参数变化时具备韧性和适应性。软体机器人在适应性方面具有优势,但大多数在接近平流层的压力和温度极限下缺乏韧性。介电弹性体驱动器作为固态柔性电容器,能够将电能转化为机械功,从而克服了一些限制,但弹性体的韧性限制了设备的工作范围。在此,我们提出了一种在紫外光下对硅橡胶进行交联的机制,使用三甲基(甲基环戊二烯基)铂(IV)作为催化剂,使氢硅烷与乙烯基基团反应。碳-碳键的形成使得在紫外光下能够快速加工,并在介电弹性体驱动器中实现卓越的电机械性能。通过与最先进的丙烯酸和硅橡胶化学相比,在-40°C和120°C的控制实验以及近真空条件下展示了材料韧性的优势。使用新型硅橡胶制造的抓手的全自动系统被集成到高空气球测试的有效载荷中。进行了两次平流层气球任务,证明了介电弹性体驱动器在类似太空条件下(高达23.6公里的高度,低于0.05大气压和-55°C)作为一种可行的软体机器人技术。化学构建块和催化剂的组合可以进一步扩展,以解决硅橡胶的其他挑战,包括粘附性和增材制造。
cs.RO / 53 / 2603.04356
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
RoboCasa365:一个用于训练和基准测试通用机器人的大规模仿真框架
Abstract
Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data -- making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.
Chinese Translation
近年来,机器人学习的进展加速了通用机器人在日常人类环境中执行日常任务的能力。然而,评估我们距离这一愿景还有多远仍然困难重重。该领域缺乏可重复的大规模基准以进行系统评估。为填补这一空白,我们提出了RoboCasa365,这是一个全面的家庭移动操控仿真基准。RoboCasa365建立在RoboCasa平台之上,涵盖了2,500个多样化厨房环境中的365个日常任务,拥有超过600小时的人类演示数据和超过1600小时的合成生成演示数据,使其成为研究通用策略的最丰富和大规模的资源之一。RoboCasa365旨在支持不同问题设置的系统评估,包括多任务学习、机器人基础模型训练和终身学习。我们在该基准上使用最先进的方法进行了广泛实验,并分析了任务多样性、数据集规模和环境变化对泛化能力的影响。我们的结果提供了有关哪些因素最强烈影响通用机器人性能的新见解,并为该领域未来的进展提供了策略建议。
cs.RO / 54 / 2603.04363
ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning
ManipulationNet:用于基准测试现实世界机器人操控的基础设施,包含物理技能挑战和具身多模态推理
Chen, Yiting, Kimble, Kenneth, Adelson, Edward H., Asfour, Tamim, Chanrungmaneekul, Podshara, Chitta, Sachin, Chitambar, Yash, Chen, Ziyang, Goldberg, Ken, Kragic, Danica, Li, Hui, Li, Xiang, Li, Yunzhu, Prather, Aaron, Pollard, Nancy, Roa-Garzon, Maximo A., Seney, Robert, Sha, Shuo, Wang, Shihefeng, Xiang, Yu, Zhang, Kaifeng, Zhu, Yuke, Hang, Kaiyu
Abstract
Dexterous manipulation enables robots to purposefully alter the physical world, transforming them from passive observers into active agents in unstructured environments. This capability is the cornerstone of physical artificial intelligence. Despite decades of advances in hardware, perception, control, and learning, progress toward general manipulation systems remains fragmented due to the absence of widely adopted standard benchmarks. The central challenge lies in reconciling the variability of the real world with the reproducibility and authenticity required for rigorous scientific evaluation. To address this, we introduce ManipulationNet, a global infrastructure that hosts real-world benchmark tasks for robotic manipulation. ManipulationNet delivers reproducible task setups through standardized hardware kits, and enables distributed performance evaluation via a unified software client that delivers real-time task instructions and collects benchmarking results. As a persistent and scalable infrastructure, ManipulationNet organizes benchmark tasks into two complementary tracks: 1) the Physical Skills Track, which evaluates low-level physical interaction skills, and 2) the Embodied Reasoning Track, which tests high-level reasoning and multimodal grounding abilities. This design fosters the systematic growth of an interconnected network of real-world abilities and skills, paving the path toward general robotic manipulation. By enabling comparable manipulation research in the real world at scale, this infrastructure establishes a sustainable foundation for measuring long-term scientific progress and identifying capabilities ready for real-world deployment.
Chinese Translation
灵巧操控使机器人能够有目的地改变物理世界,使其从被动观察者转变为非结构化环境中的主动参与者。这一能力是物理人工智能的基石。尽管在硬件、感知、控制和学习方面取得了数十年的进展,但由于缺乏广泛采用的标准基准,通用操控系统的进展仍然支离破碎。核心挑战在于调和现实世界的变异性与严格科学评估所需的可重复性和真实性。为了解决这一问题,我们引入了ManipulationNet,一个全球基础设施,承载机器人操控的现实世界基准任务。ManipulationNet通过标准化的硬件套件提供可重复的任务设置,并通过统一的软件客户端实现分布式性能评估,该客户端提供实时任务指令并收集基准测试结果。作为一个持久且可扩展的基础设施,ManipulationNet将基准任务组织为两个互补的轨道:1)物理技能轨道,评估低级物理交互技能;2)具身推理轨道,测试高级推理和多模态基础能力。这一设计促进了现实世界能力和技能相互关联网络的系统性增长,为通用机器人操控铺平了道路。通过在现实世界中大规模实现可比较的操控研究,该基础设施为衡量长期科学进展和识别准备好进行现实世界部署的能力奠定了可持续的基础。
cs.CV / 1 / 2603.03418
mHC-HSI: Clustering-Guided Hyper-Connection Mamba for Hyperspectral Image Classification
mHC-HSI:基于聚类引导的超连接Mamba用于高光谱图像分类
Abstract
Recently, DeepSeek has invented the manifold-constrained hyper-connection (mHC) approach which has demonstrated significant improvements over the traditional residual connection in deep learning models \cite{xie2026mhc}. Nevertheless, this approach has not been tailor-designed for improving hyperspectral image (HSI) classification. This paper presents a clustering-guided mHC Mamba model (mHC-HSI) for enhanced HSI classification, with the following contributions. First, to improve spatial-spectral feature learning, we design a novel clustering-guided Mamba module, based on the mHC framework, that explicitly learns both spatial and spectral information in HSI. Second, to decompose the complex and heterogeneous HSI into smaller clusters, we design a new implementation of the residual matrix in mHC, which can be treated as soft cluster membership maps, leading to improved explainability of the mHC approach. Third, to leverage the physical spectral knowledge, we divide the spectral bands into physically-meaningful groups and use them as the "parallel streams" in mHC, leading to a physically-meaningful approach with enhanced interpretability. The proposed approach is tested on benchmark datasets in comparison with the state-of-the-art methods, and the results suggest that the proposed model not only improves the accuracy but also enhances the model explainability. Code is available here: https://github.com/GSIL-UCalgary/mHC_HyperSpectral
Chinese Translation
最近,DeepSeek提出了流形约束超连接(mHC)方法,该方法在深度学习模型中相较于传统的残差连接显示出显著的改进 [xie2026mhc]。然而,该方法尚未专门设计用于改善高光谱图像(HSI)分类。本文提出了一种基于聚类引导的mHC Mamba模型(mHC-HSI),以增强HSI分类,具有以下贡献。首先,为了改善空间-光谱特征学习,我们设计了一种新颖的聚类引导Mamba模块,基于mHC框架,明确学习HSI中的空间和光谱信息。其次,为了将复杂且异质的HSI分解为更小的聚类,我们设计了mHC中残差矩阵的新实现,该实现可以视为软聚类成员资格图,从而提高了mHC方法的可解释性。第三,为了利用物理光谱知识,我们将光谱波段划分为具有物理意义的组,并将其作为mHC中的“并行流”,从而形成一种具有物理意义的增强可解释性的方法。所提出的方法在基准数据集上与最先进的方法进行了比较,结果表明所提出的模型不仅提高了准确性,还增强了模型的可解释性。代码可在此获取:https://github.com/GSIL-UCalgary/mHC_HyperSpectral
cs.CV / 2 / 2603.03437
Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning
超越准确性:多模态医学推理中的视觉基础评估
Abstract
Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall despite improving accuracy. On VQA-RAD, both variants achieve 63% accuracy through different mechanisms: text-only RLVR retains 81% performance with blank images, while image-text RLVR shows only 29% image sensitivity. Models generate visual claims in 68-74% of responses, yet 38-43% are ungrounded (HVRR). These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.
Chinese Translation
近期研究表明,使用可验证奖励的文本单一强化学习(RLVR)在多模态医学视觉问答(VQA)基准测试中可以与图像-文本强化学习(RLVR)相匹配或超越,这表明当前的评估协议可能未能测量因果视觉依赖性。我们引入了一种反事实评估框架,使用真实、空白和打乱的图像,针对四个医学VQA基准:PathVQA、PMC-VQA、SLAKE和VQA-RAD。除了准确性外,我们还测量了视觉依赖评分(VRS)、图像敏感性(IS),并引入了幻觉视觉推理率(HVRR),以检测模型在生成图像不变答案的情况下仍然生成视觉声明的情况。我们的研究结果显示,RLVR提高了准确性,但降低了视觉基础:文本单一RLVR在PathVQA上达到了负VRS(-0.09),在图像不匹配的情况下表现更好,而图像-文本RLVR尽管提高了准确性,但整体图像敏感性降低至39.8%。在VQA-RAD上,两种变体通过不同机制达到了63%的准确性:文本单一RLVR在空白图像上保持了81%的性能,而图像-文本RLVR的图像敏感性仅为29%。模型在68-74%的回答中生成视觉声明,但38-43%是没有基础的(HVRR)。这些发现表明,仅依赖准确性的奖励会导致捷径利用,进步需要关注基础的评估协议和明确强制视觉依赖性的训练目标。
cs.CV / 3 / 2603.03447
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
Proact-VL:一种用于实时AI伴侣的主动视频语言模型
Abstract
Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
Chinese Translation
主动和实时的交互体验对于类人AI伴侣至关重要,但面临三个主要挑战:(1)在连续流输入下实现低延迟推理,(2)自主决定何时响应,以及(3)控制生成内容的质量和数量以满足实时约束。在本研究中,我们通过两个游戏场景(评论员和引导者)实例化AI伴侣,这些场景因其适合自动评估而被选中。我们引入了实时游戏基准(Live Gaming Benchmark),这是一个大型数据集,包含三个代表性场景:单人评论、共同评论和用户引导,并提出了Proact-VL,一个将多模态语言模型塑造成主动、实时交互代理的通用框架,能够实现类人环境感知和交互。大量实验表明,Proact-VL在响应延迟和质量方面表现优越,同时保持强大的视频理解能力,证明了其在实时交互应用中的实用性。
cs.CV / 4 / 2603.03482
Beyond Pixel Histories: World Models with Persistent 3D State
超越像素历史:具有持久3D状态的世界模型
Abstract
Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io
Chinese Translation
交互式世界模型通过响应用户的行为不断生成视频,从而实现开放式生成能力。然而,现有模型通常缺乏环境的3D表示,这意味着3D一致性必须从数据中隐式学习,空间记忆受到有限时间上下文窗口的限制。这导致了不现实的用户体验,并对下游任务(如训练智能体)构成了重大障碍。为了解决这个问题,我们提出了PERSIST,这是一种新的世界模型范式,模拟潜在3D场景的演变:环境、相机和渲染器。这使我们能够合成具有持久空间记忆和一致几何形状的新帧。定量指标和定性用户研究均显示,在空间记忆、3D一致性和长时间稳定性方面,相较于现有方法有显著改善,从而实现连贯且不断演变的3D世界。我们进一步展示了新颖的能力,包括从单一图像合成多样化的3D环境,以及通过直接在3D空间中支持环境编辑和规范,来实现对生成体验的细粒度、几何感知控制。项目页面:https://francelico.github.io/persist.github.io
cs.CV / 5 / 2603.03485
Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion
Phys4D:基于视频扩散的细粒度物理一致性四维建模
Abstract
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/
Chinese Translation
近期的视频扩散模型在大规模生成世界模型方面取得了令人瞩目的能力。然而,这些模型在细粒度物理一致性方面常常面临挑战,表现出随时间变化的物理不合理动态。在本研究中,我们提出了 extbf{Phys4D},一个从视频扩散模型中学习物理一致性四维世界表示的流程。Phys4D采用 extbf{三阶段训练范式},逐步将以外观驱动的视频扩散模型提升为物理一致的四维世界表示。我们首先通过大规模伪监督预训练启动稳健的几何和运动表示,为四维场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的四维动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕捉的残余物理违规。为了评估超越基于外观的指标的细粒度物理一致性,我们引入了一组 extbf{四维世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与以外观驱动的基线相比,Phys4D在细粒度时空和物理一致性方面显著提升,同时保持强大的生成性能。我们的项目页面可访问 https://sensational-brioche-7657e7.netlify.app/
cs.CV / 6 / 2603.03503
Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data
基于地理加权的弱监督贝叶斯高分辨率变换器用于200米分辨率的泛北极海冰浓度制图及不确定性估计,利用Sentinel-1、RCM和AMSR2数据
Abstract
Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript{2} = 0.90 relative to the ARTIST Sea Ice product).
Chinese Translation
尽管高分辨率的泛北极海冰制图及其可靠的不确定性评估对于海冰浓度(SIC)图表的操作至关重要,但由于冰层特征的微妙性、不准确的SIC标签、模型不确定性和数据异质性等关键挑战,这一任务十分困难。本研究提出了一种新颖的贝叶斯高分辨率变换器方法,利用Sentinel-1、RADARSAT Constellation Mission(RCM)和Advanced Microwave Scanning Radiometer 2(AMSR2)数据进行200米分辨率的泛北极SIC制图和不确定性量化。首先,为了改善小型和微妙海冰特征(如裂缝/引导、积水和冰块)的提取,我们设计了一种新型的高分辨率变换器模型,该模型具有全球和局部模块,能够更好地区分海冰模式中的微妙差异。其次,为了解决低分辨率和不准确的SIC标签问题,我们设计了一种地理加权的弱监督损失函数,以区域级而非像素级来监督模型,并优先考虑纯开放水域和冰盖特征,同时减轻边缘冰区(MIZ)中模糊性的影响。第三,为了改善不确定性量化,我们设计了所提变换器模型的贝叶斯扩展,将其参数视为随机变量,以更有效地捕捉不确定性。第四,为了解决数据异质性问题,我们在决策层面融合三种不同的数据类型(Sentinel-1、RCM和AMSR2),以改善SIC制图和不确定性量化。所提方法在2021年和2025年的泛北极最低范围条件下进行了评估。结果表明,所提模型在使用Sentinel-1数据时实现了0.70的整体特征检测准确率,同时保留了泛北极的SIC模式(Sentinel-1 R² = 0.90,相对于ARTIST海冰产品)。
cs.CV / 7 / 2603.03505
PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
PhyPrompt:基于强化学习的物理可信文本到视频生成提示优化
Abstract
State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.
Chinese Translation
尽管当前最先进的文本到视频(T2V)生成器在视觉质量上表现出色,但常常违反物理法则。我们表明,这一问题源于提示中物理约束不足,而非模型限制:手动添加物理细节能够可靠地产生物理可信的视频,但需要专业知识且难以扩展。我们提出了PhyPrompt,一个两阶段的强化学习框架,自动优化提示以实现物理真实的生成。首先,我们在一个以物理为重点的链式思维(Chain-of-Thought)数据集上微调大型语言模型,以整合物体运动和力相互作用等原则,同时保留用户意图。其次,我们应用了群体相对策略优化(Group Relative Policy Optimization),采用动态奖励课程,最初优先考虑语义保真度,然后逐步转向物理常识。该课程实现了协同优化:PhyPrompt-7B在VideoPhy2上达到了40.8%的联合成功率(提升8.6个百分点),物理常识提高了11个百分点(从55.8%提升至66.8%),同时语义遵循度也提高了4.4个百分点(从43.4%提升至47.8%)。值得注意的是,我们的课程在这两个指标上均超过了单目标训练,展示了超越传统多目标权衡的组合提示发现能力。PhyPrompt在仅使用7B参数的情况下,超越了GPT-4o(+3.8%联合成功率)和DeepSeek-V3(+2.2%,规模大100倍)。该方法在多种T2V架构(Lavie, VideoCrafter2, CogVideoX-5B)上实现了零样本迁移,提升幅度高达16.8%,证明了具有组合课程的领域专用强化学习在物理感知生成方面超越了通用扩展能力。
cs.CV / 8 / 2603.03544
PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest
PinCLIP:Pinterest的大规模基础多模态表示
Abstract
While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that PinCLIP outperforms state-of-the-art baselines, such as Qwen, by 20% in multi-modal retrieval tasks. Online A/B testing demonstrates significant business impact, including substantial engagement gains across all major surfaces in Pinterest. Notably, PinCLIP significantly addresses the "cold-start" problem, enhancing fresh content distribution with a 15% Repin increase in organic content and 8.7% higher click for new Ads.
Chinese Translation
尽管多模态视觉语言模型(VLMs)在多个领域取得了显著成功,但将VLMs整合到推荐和检索系统中仍然面临挑战,主要由于训练目标的不一致性和服务效率瓶颈。本文介绍了PinCLIP,一种大规模视觉表示学习方法,旨在通过利用VLMs学习图像-文本对齐来增强Pinterest的检索和排序模型。我们提出了一种新颖的混合视觉变换器架构,该架构利用VLM骨干网络和混合融合机制,以捕捉不同粒度的多模态内容表示。除了标准的图像-文本对齐目标外,我们还引入了一种邻居对齐目标,以建模Pinterest Pin-Board图中的多模态表示的交叉融合。离线评估表明,PinCLIP在多模态检索任务中比最先进的基线(如Qwen)提高了20%。在线A/B测试显示了显著的商业影响,包括Pinterest所有主要界面上的参与度显著提升。值得注意的是,PinCLIP显著解决了“冷启动”问题,提升了新内容的分发,带来了15%的有机内容Repin增加和8.7%的新广告点击率提升。
cs.CV / 9 / 2603.03564
Modeling Cross-vision Synergy for Unified Large Vision Model
统一大型视觉模型的跨视觉协同建模
Abstract
Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.
Chinese Translation
近期大型视觉模型(LVMs)的进展已从特定模态的设计转向统一架构,能够共同处理图像、视频和3D数据。然而,现有的统一LVMs主要追求功能整合,而忽视了跨视觉协同的更深层目标:在视觉模态之间推理互补先验的能力。为了解决这一问题,我们提出了PolyV,这是一种在架构和训练层面上实现跨视觉协同的统一LVM。在架构上,PolyV采用了由动态模态路由器协调的稀疏专家混合LVM,使每个专家能够专注于特定模态的先验,同时实现模态之间的双向交互和相互精炼。在训练方面,协同感知范式结合了模态特定的预训练与通过知识蒸馏和对象/关系级别对齐进行的粗到细的协同调优。在涵盖图像、视频和3D理解的10个基准测试中进行的广泛实验,包括需要空间或时间先验的协同聚焦数据集,证明PolyV始终优于现有模型,其性能在基准上平均提高超过10%。总体而言,PolyV建立了一个用于联觉视觉推理的统一框架,朝着真正协同的LVMs迈进。项目页面:https://sqwu.top/PolyV。
cs.CV / 10 / 2603.03571
Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery
基于信心的单目深度估计在微创手术中的应用
Abstract
Purpose: Monocular depth estimation (MDE) is vital for scene understanding in minimally invasive surgery (MIS). However, endoscopic video sequences are often contaminated by smoke, specular reflections, blur, and occlusions, limiting the accuracy of MDE models. In addition, current MDE models do not output depth confidence, which could be a valuable tool for improving their clinical reliability. Methods: We propose a novel confidence-aware MDE framework featuring three significant contributions: (i) Calibrated confidence targets: an ensemble of fine-tuned stereo matching models is used to capture disparity variance into pixel-wise confidence probabilities; (ii) Confidence-aware loss: Baseline MDE models are optimized with confidence-aware loss functions, utilizing pixel-wise confidence probabilities such that reliable pixels dominate training; and (iii) Inference-time confidence: a confidence estimation head is proposed with two convolution layers to predict per-pixel confidence at inference, enabling assessment of depth reliability. Results: Comprehensive experimental validation across internal and public datasets demonstrates that our framework improves depth estimation accuracy and can robustly quantify the prediction's confidence. On the internal clinical endoscopic dataset (StereoKP), we improve dense depth estimation accuracy by ~8% as compared to the baseline model. Conclusion: Our confidence-aware framework enables improved accuracy of MDE models in MIS, addressing challenges posed by noise and artifacts in pre-clinical and clinical data, and allows MDE models to provide confidence maps that may be used to improve their reliability for clinical applications.
Chinese Translation
目的:单目深度估计(MDE)对于微创手术(MIS)中的场景理解至关重要。然而,内窥镜视频序列常常受到烟雾、镜面反射、模糊和遮挡的影响,从而限制了MDE模型的准确性。此外,当前的MDE模型不输出深度置信度,而这可能是提高其临床可靠性的有价值工具。方法:我们提出了一种新颖的基于信心的MDE框架,具有三项重要贡献:(i)校准的置信度目标:使用一组经过微调的立体匹配模型来捕捉视差方差,生成逐像素的置信度概率;(ii)基于信心的损失:基线MDE模型使用基于信心的损失函数进行优化,利用逐像素的置信度概率,使得可靠的像素主导训练;(iii)推理时的置信度:提出了一种置信度估计头,包含两个卷积层,在推理时预测逐像素的置信度,从而评估深度的可靠性。结果:在内部和公共数据集上的全面实验验证表明,我们的框架提高了深度估计的准确性,并能够稳健地量化预测的置信度。在内部临床内窥镜数据集(StereoKP)上,与基线模型相比,我们的稠密深度估计准确性提高了约8%。结论:我们的基于信心的框架提高了MDE模型在MIS中的准确性,解决了在临床前和临床数据中噪声和伪影带来的挑战,并使MDE模型能够提供置信度图,从而提高其在临床应用中的可靠性。
cs.CV / 11 / 2603.03577
From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes
从局部匹配到全局掩膜:开放世界场景中的新颖实例检测
Abstract
Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.
Chinese Translation
在开放世界环境中检测和分割新颖物体实例是机器人感知中的一个基本问题。给定仅一小组模板图像,机器人必须在一个杂乱且之前未见过的场景中定位和分割特定的物体实例。现有的基于提议的方法对提议质量高度敏感,且在遮挡和背景杂乱的情况下常常失败。我们提出了L2G-Det,一个局部到全局的实例检测框架,通过利用模板与查询图像之间的密集补丁级匹配,绕过了显式的物体提议。局部匹配的补丁生成候选点,这些候选点通过候选选择模块进行精炼,以抑制假阳性。过滤后的点随后用于提示增强的Segment Anything Model (SAM),并使用特定实例的物体标记,从而实现完整实例掩膜的可靠重建。实验表明,在具有挑战性的开放世界环境中,相较于基于提议的方法,性能得到了改善。
cs.CV / 12 / 2603.03580
An Effective Data Augmentation Method by Asking Questions about Scene Text Images
通过询问场景文本图像的有效数据增强方法
Abstract
Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant reductions in both CER and WER. Our code is publicly available at https://github.com/xuyaooo/DataAugOCR.
Chinese Translation
场景文本识别(STR)和手写文本识别(HTR)在将图像中的文本内容准确转录为机器可读格式方面面临重大挑战。传统的光学字符识别(OCR)模型通常直接预测转录结果,这限制了对文本结构的详细推理。我们提出了一种受视觉问答(VQA)启发的数据增强框架,通过结构化的问答任务增强OCR训练。对于每个图像-文本对,我们生成自然语言问题,探讨字符级属性,如存在性、位置和频率,答案来源于真实文本。这些辅助任务鼓励更细粒度的推理,OCR模型将视觉特征与文本查询对齐,以共同推理图像和问题。在WordArt和Esposalles数据集上的实验显示,与基线模型相比,性能有了一致的提升,字符错误率(CER)和词错误率(WER)均显著降低。我们的代码已公开发布在 https://github.com/xuyaooo/DataAugOCR。
cs.CV / 13 / 2603.03584
Hazard-Aware Traffic Scene Graph Generation
危险感知交通场景图生成
Abstract
Maintaining situational awareness in complex driving scenarios is challenging. It requires continuously prioritizing attention among extensive scene entities and understanding how prominent hazards might affect the ego vehicle. While existing studies excel at detecting specific semantic categories and visually salient regions, they lack the ability to assess safety-relevance. Meanwhile, the generic spatial predicates either for foreground objects only or for all scene entities modeled by existing scene graphs are inadequate for driving scenarios. To bridge this gap, we introduce a novel task, Traffic Scene Graph Generation, which captures traffic-specific relations between prominent hazards and the ego vehicle. We propose a novel framework that explicitly uses traffic accident data and depth cues to supplement visual features and semantic information for reasoning. The output traffic scene graphs provide intuitive guidelines that stress prominent hazards by color-coding their severity and notating their effect mechanism and relative location to the ego vehicle. We create relational annotations on Cityscapes dataset and evaluate our model on 10 tasks from 5 perspectives. The results in comparative experiments and ablation studies demonstrate our capacity in ego-centric reasoning for hazard-aware traffic scene understanding.
Chinese Translation
在复杂驾驶场景中保持情境意识是一项挑战。这需要在大量场景实体中持续优先关注,并理解显著危险如何影响自我车辆。虽然现有研究在检测特定语义类别和视觉显著区域方面表现出色,但它们缺乏评估安全相关性的能力。同时,现有场景图所建模的通用空间谓词,无论是仅针对前景物体还是针对所有场景实体,都不足以满足驾驶场景的需求。为了解决这一问题,我们引入了一项新任务——交通场景图生成,旨在捕捉显著危险与自我车辆之间的交通特定关系。我们提出了一种新框架,明确利用交通事故数据和深度线索来补充视觉特征和语义信息,以进行推理。输出的交通场景图通过颜色编码其严重性并标注其影响机制和相对位置,提供了强调显著危险的直观指导。我们在Cityscapes数据集上创建了关系注释,并从五个角度评估了我们的模型在十个任务上的表现。比较实验和消融研究的结果证明了我们在危险感知交通场景理解中的自我中心推理能力。
cs.CV / 14 / 2603.03602
DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization
DM-CFO:一种用于组合式3D牙齿生成的扩散模型,具有无碰撞优化
Abstract
The automatic design of a 3D tooth model plays a crucial role in dental digitization. However, current approaches face challenges in compositional 3D tooth generation because both the layouts and shapes of missing teeth need to be optimized.In addition, collision conflicts are often omitted in 3D Gaussian-based compositional 3D generation, where objects may intersect with each other due to the absence of explicit geometric information on the object surfaces. Motivated by graph generation through diffusion models and collision detection using 3D Gaussians, we propose an approach named DM-CFO for compositional tooth generation, where the layout of missing teeth is progressively restored during the denoising phase under both text and graph constraints. Then, the Gaussian parameters of each layout-guided tooth and the entire jaw are alternately updated using score distillation sampling (SDS). Furthermore, a regularization term based on the distances between the 3D Gaussians of neighboring teeth and the anchor tooth is introduced to penalize tooth intersections. Experimental results on three tooth-design datasets demonstrate that our approach significantly improves the multiview consistency and realism of the generated teeth compared with existing methods. Project page: https://amateurc.github.io/CF-3DTeeth/.
Chinese Translation
3D牙齿模型的自动设计在牙科数字化中发挥着至关重要的作用。然而,当前的方法在组合式3D牙齿生成中面临挑战,因为缺失牙齿的布局和形状都需要优化。此外,在基于3D高斯的组合式3D生成中,碰撞冲突通常被忽略,因为由于缺乏物体表面的明确几何信息,物体可能会相互交叉。受到通过扩散模型进行图生成和使用3D高斯进行碰撞检测的启发,我们提出了一种名为DM-CFO的方法用于组合式牙齿生成,在去噪阶段,缺失牙齿的布局在文本和图形约束下逐步恢复。然后,使用评分蒸馏采样(SDS)交替更新每个布局引导的牙齿和整个下颌的高斯参数。此外,引入了一个正则化项,基于相邻牙齿与锚定牙齿之间的3D高斯距离,以惩罚牙齿交叉。对三个牙齿设计数据集的实验结果表明,与现有方法相比,我们的方法显著提高了生成牙齿的多视图一致性和真实感。项目页面:https://amateurc.github.io/CF-3DTeeth/
cs.CV / 15 / 2603.03603
Detection and Identification of Penguins Using Appearance and Motion Features
基于外观和运动特征的企鹅检测与识别
Abstract
In animal facilities, continuous surveillance of penguins is essential yet technically challenging due to their homogeneous visual characteristics, rapid and frequent posture changes, and substantial environmental noise such as water reflections. In this study, we propose a framework that enhances both detection and identification performance by integrating appearance and motion features. For detection, we adapted YOLO11 to process consecutive frames to overcome the lack of temporal consistency in single-frame detectors. This approach leverages motion cues to detect targets even when distinct visual features are obscured. Our evaluation shows that fine-tuning the model with two-frame inputs improves
[email protected] from 0.922 to 0.933, outperforming the baseline, and successfully recovers individuals that are indistinguishable in static images. For identification, we introduce a tracklet-based contrastive learning approach applied after tracking. Through qualitative visualization, we demonstrate that the method produces coherent feature embeddings, bringing samples from the same individual closer in the feature space, suggesting the potential for mitigating ID switching.
Chinese Translation
在动物设施中,持续监测企鹅至关重要,但由于其均匀的视觉特征、快速且频繁的姿态变化以及水面反射等显著环境噪声,技术上具有挑战性。在本研究中,我们提出了一种框架,通过整合外观和运动特征来增强检测和识别性能。对于检测,我们调整了YOLO11,以处理连续帧,从而克服单帧检测器缺乏时间一致性的问题。这种方法利用运动线索来检测目标,即使在明显的视觉特征被遮挡时也能有效识别。我们的评估表明,使用双帧输入对模型进行微调后,
[email protected]从0.922提高至0.933,超越了基线,并成功恢复了在静态图像中难以区分的个体。对于识别,我们引入了一种基于轨迹的对比学习方法,该方法在跟踪后应用。通过定性可视化,我们展示了该方法生成了一致的特征嵌入,使来自同一个体的样本在特征空间中更为接近,表明其在减轻身份切换方面的潜力。
cs.CV / 16 / 2603.03604
Tracking Feral Horses in Aerial Video Using Oriented Bounding Boxes
使用定向边界框在空中视频中追踪野马
Abstract
The social structures of group-living animals such as feral horses are diverse and remain insufficiently understood, even within a single species. To investigate group dynamics, aerial videos are often utilized to track individuals and analyze their movement trajectories, which are essential for evaluating inter-individual interactions and comparing social behaviors. Accurate individual tracking is therefore crucial. In multi-animal tracking, axis-aligned bounding boxes (bboxes) are widely used; however, for aerial top-view footage of entire groups, their performance degrades due to complex backgrounds, small target sizes, high animal density, and varying body orientations. To address this issue, we employ oriented bounding boxes (OBBs), which include rotation angles and reduce unnecessary background. Nevertheless, current OBB detectors such as YOLO-OBB restrict angles within a 180$^{\circ}$ range, making it impossible to distinguish head from tail and often causing sudden 180$^{\circ}$ flips across frames, which severely disrupts continuous tracking. To overcome this limitation, we propose a head-orientation estimation method that crops OBB-centered patches, applies three detectors (head, tail, and head-tail), and determines the final label through IoU-based majority voting. Experiments using 299 test images show that our method achieves 99.3% accuracy, outperforming individual models, demonstrating its effectiveness for robust OBB-based tracking.
Chinese Translation
群居动物如野马的社会结构多样且尚未得到充分理解,即使在同一物种内也是如此。为了研究群体动态,通常利用空中视频来追踪个体并分析其运动轨迹,这对于评估个体间的互动和比较社会行为至关重要。因此,准确的个体追踪显得尤为重要。在多动物追踪中,轴对齐边界框(axis-aligned bounding boxes,bboxes)被广泛使用;然而,对于整个群体的空中俯视镜头,由于复杂的背景、小目标尺寸、高动物密度和变化的身体方向,其性能下降。为了解决这个问题,我们采用定向边界框(oriented bounding boxes,OBBs),它们包含旋转角度并减少不必要的背景。然而,目前的OBB检测器如YOLO-OBB将角度限制在180$^{ ext{°}}$范围内,使得无法区分头部和尾部,并且经常导致跨帧的突然180$^{ ext{°}}$翻转,这严重干扰了连续追踪。为了克服这一限制,我们提出了一种头部方向估计方法,该方法裁剪以OBB为中心的图像块,应用三个检测器(头部、尾部和头尾),并通过基于IoU的多数投票确定最终标签。使用299张测试图像的实验表明,我们的方法达到了99.3%的准确率,超越了单个模型,证明了其在稳健的OBB基础追踪中的有效性。
cs.CV / 17 / 2603.03615
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
视差对齐:一种用于分布式多视图图像压缩的全视差注意机制
Abstract
Multi-view image compression (MIC) aims to achieve high compression efficiency by exploiting inter-image correlations, playing a crucial role in 3D applications. As a subfield of MIC, distributed multi-view image compression (DMIC) offers performance comparable to MIC while eliminating the need for inter-view information at the encoder side. However, existing methods in DMIC typically treat all images equally, overlooking the varying degrees of correlation between different views during decoding, which leads to suboptimal coding performance. To address this limitation, we propose a novel $\textbf{OmniParallax Attention Mechanism}$ (OPAM), which is a general mechanism for explicitly modeling correlations and aligned features between arbitrary pairs of information sources. Building upon OPAM, we propose a Parallax Multi Information Fusion Module (PMIFM) to adaptively integrate information from different sources. PMIFM is incorporated into both the joint decoder and the entropy model to construct our end-to-end DMIC framework, $\textbf{ParaHydra}$. Extensive experiments demonstrate that $\textbf{ParaHydra}$ is $\textbf{the first DMIC method}$ to significantly surpass state-of-the-art MIC codecs, while maintaining low computational overhead. Performance gains become more pronounced as the number of input views increases. Compared with LDMIC, $\textbf{ParaHydra}$ achieves bitrate savings of $\textbf{19.72%}$ on WildTrack(3) and up to $\textbf{24.18%}$ on WildTrack(6), while significantly improving coding efficiency (as much as $\textbf{65}\times$ in decoding and $\textbf{34}\times$ in encoding).
Chinese Translation
多视图图像压缩(MIC)旨在通过利用图像间的相关性实现高压缩效率,在三维应用中发挥着至关重要的作用。作为MIC的一个子领域,分布式多视图图像压缩(DMIC)在不需要编码器端的视图间信息的情况下,提供了与MIC相当的性能。然而,现有的DMIC方法通常对所有图像一视同仁,忽视了在解码过程中不同视图之间的相关性差异,这导致编码性能不理想。为了解决这一局限性,我们提出了一种新颖的全视差注意机制(OmniParallax Attention Mechanism, OPAM),该机制用于显式建模任意信息源对之间的相关性和对齐特征。在OPAM的基础上,我们提出了一个视差多信息融合模块(Parallax Multi Information Fusion Module, PMIFM),以自适应地整合来自不同源的信息。PMIFM被纳入联合解码器和熵模型中,以构建我们的端到端DMIC框架ParaHydra。大量实验表明,ParaHydra是第一个显著超越最先进MIC编解码器的DMIC方法,同时保持低计算开销。随着输入视图数量的增加,性能提升愈加明显。与LDMIC相比,ParaHydra在WildTrack(3)上实现了19.72%的比特率节省,在WildTrack(6)上则高达24.18%,同时显著提高了编码效率(解码速度提高了65倍,编码速度提高了34倍)。
cs.CV / 18 / 2603.03616
LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark
LeafInst - 统一实例分割网络用于细粒度林业叶片表型分析:一种新的基于无人机的基准
Abstract
Intelligent forest tree breeding has advanced plant phenotyping, yet existing research largely focuses on large-leaf agricultural crops, with limited attention to fine-grained leaf analysis of sapling trees in open-field environments. Natural scenes introduce challenges including scale variation, illumination changes, and irregular leaf morphology. To address these issues, we collected UAV RGB imagery of field-grown saplings and constructed the Poplar-leaf dataset, containing 1,202 branches and 19,876 pixel-level annotated leaf instances. To our knowledge, this is the first instance segmentation dataset specifically designed for forestry leaves in open-field conditions. We propose LeafInst, a novel segmentation framework tailored for irregular and multi-scale leaf structures. The model integrates an Asymptotic Feature Pyramid Network (AFPN) for multi-scale perception, a Dynamic Asymmetric Spatial Perception (DASP) module for irregular shape modeling, and a dual-residual Dynamic Anomalous Regression Head (DARH) with Top-down Concatenation decoder Feature Fusion (TCFU) to improve detection and segmentation performance. On Poplar-leaf, LeafInst achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent. On the public PhenoBench benchmark, it reaches 52.7 box mAP, exceeding MaskDINO by 3.4 percent. Additional experiments demonstrate strong generalization and practical utility for large-scale leaf phenotyping.
Chinese Translation
智能森林树木育种推动了植物表型学的发展,但现有研究主要集中在大叶农业作物上,对开放田野环境中幼苗树木的细粒度叶片分析关注较少。自然场景带来了包括尺度变化、光照变化和不规则叶片形态等挑战。为了解决这些问题,我们收集了田间生长的幼苗的无人机RGB图像,并构建了Poplar-leaf数据集,包含1,202个枝条和19,876个像素级标注的叶片实例。据我们所知,这是第一个专门为开放田野条件下的林业叶片设计的实例分割数据集。我们提出了LeafInst,这是一种针对不规则和多尺度叶片结构的新型分割框架。该模型集成了渐近特征金字塔网络(Asymptotic Feature Pyramid Network, AFPN)用于多尺度感知,动态不对称空间感知(Dynamic Asymmetric Spatial Perception, DASP)模块用于不规则形状建模,以及双残差动态异常回归头(Dual-residual Dynamic Anomalous Regression Head, DARH)与自上而下拼接解码器特征融合(Top-down Concatenation decoder Feature Fusion, TCFU)以提高检测和分割性能。在Poplar-leaf数据集上,LeafInst达到了68.4的mAP,超越了YOLOv11 7.1个百分点和MaskDINO 6.5个百分点。在公共PhenoBench基准上,它达到了52.7的框mAP,超过了MaskDINO 3.4个百分点。额外实验表明,该模型在大规模叶片表型分析中具有强大的泛化能力和实际应用价值。
cs.CV / 19 / 2603.03617
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
RAGTrack:基于语言感知的检索增强生成RGB-T跟踪
Abstract
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.
Chinese Translation
RGB-热成像(RGBT)跟踪旨在通过融合可见光和热红外模态,在多样的环境条件下实现稳健的目标定位。然而,现有的RGBT跟踪器仅依赖初始帧的视觉信息进行目标建模,无法适应由于缺乏语言指导而导致的外观变化。此外,当前方法还存在冗余搜索区域和异构模态差距的问题,造成背景干扰。为了解决这些问题,我们首先在RGBT跟踪基准中引入文本描述。这是通过一个利用多模态大型语言模型(MLLMs)自动生成文本注释的流程实现的。随后,我们提出了RAGTrack,一种新颖的检索增强生成框架,用于稳健的RGBT跟踪。为此,我们引入了多模态变换器编码器(MTE)以实现统一的视觉-语言建模。然后,我们设计了自适应标记融合(ATF)以选择与目标相关的标记,并基于跨模态相关性进行通道交换,从而减轻搜索冗余和模态差距。最后,我们提出了上下文感知推理模块(CRM)以维护动态知识库,并采用检索增强生成(RAG)来实现稳健的目标建模的时间语言推理。在四个RGBT基准上的大量实验表明,我们的框架在各种具有挑战性的场景中实现了最先进的性能。源代码可在 https://github.com/IdolLab/RAGTrack 获取。
cs.CV / 20 / 2603.03618
CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing
CoRe-BT:一种用于稳健脑肿瘤分型的多模态放射学-病理学-文本基准
Abstract
Accurate brain tumor typing requires integrating heterogeneous clinical evidence, including magnetic resonance imaging (MRI), histopathology, and pathology reports, which are often incomplete at the time of diagnosis. We introduce CoRe-BT, a cross-modal radiology-pathology-text benchmark for brain tumor typing, designed to study robust multimodal learning under missing modality conditions. The dataset comprises 310 patients with multi-sequence brain MRI (T1, T1c, T2, FLAIR), including 95 cases with paired H&E-stained whole-slide pathology images and pathology reports. All cases are annotated with tumor type and grade, and MRI volumes include expert-annotated tumor masks, enabling both region-aware modeling and auxiliary learning tasks. Tumors are categorized into six clinically relevant classes capturing the heterogeneity of common and rare glioma subtypes. We evaluate tumor typing under variable modality availability by comparing MRI-only models with multimodal approaches that incorporate pathology information when present. Baseline experiments demonstrate the feasibility of multimodal fusion and highlight complementary modality contributions across clinically relevant typing tasks. CoRe-BT provides a grounded testbed for advancing multimodal glioma typing and representation learning in realistic scenarios with incomplete clinical data.
Chinese Translation
准确的脑肿瘤分型需要整合异构的临床证据,包括磁共振成像(MRI)、组织病理学和病理报告,而这些证据在诊断时往往是不完整的。我们提出了CoRe-BT,这是一个用于脑肿瘤分型的跨模态放射学-病理学-文本基准,旨在研究在缺失模态条件下的稳健多模态学习。该数据集包含310名患者的多序列脑MRI(T1、T1c、T2、FLAIR),其中包括95例配对的H&E染色全切片病理图像和病理报告。所有病例均标注了肿瘤类型和等级,MRI体积包括专家标注的肿瘤掩膜,支持区域感知建模和辅助学习任务。肿瘤被分为六个临床相关类别,以捕捉常见和罕见胶质瘤亚型的异质性。我们通过比较仅使用MRI的模型与在存在病理信息时结合病理信息的多模态方法,评估在不同模态可用性下的肿瘤分型。基线实验展示了多模态融合的可行性,并强调了在临床相关分型任务中互补模态的贡献。CoRe-BT为推进多模态胶质瘤分型和在不完整临床数据的现实场景中进行表征学习提供了一个扎实的测试平台。
cs.CV / 21 / 2603.03637
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
基于图像的提示注入:通过视觉嵌入对抗指令劫持多模态大型语言模型
Abstract
Multimodal Large Language Models (MLLMs) integrate vision and text to power applications, but this integration introduces new vulnerabilities. We study Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior. Our end-to-end IPI pipeline incorporates segmentation-based region selection, adaptive font scaling, and background-aware rendering to conceal prompts from human perception while preserving model interpretability. Using the COCO dataset and GPT-4-turbo, we evaluate 12 adversarial prompt strategies and multiple embedding configurations. The results show that IPI can reliably manipulate the output of the model, with the most effective configuration achieving up to 64\% attack success under stealth constraints. These findings highlight IPI as a practical threat in black-box settings and underscore the need for defenses against multimodal prompt injection.
Chinese Translation
多模态大型语言模型(MLLMs)将视觉与文本结合以推动应用,但这种整合带来了新的脆弱性。我们研究了基于图像的提示注入(IPI),这是一种黑箱攻击,其中对抗指令嵌入到自然图像中以覆盖模型行为。我们的端到端IPI管道结合了基于分割的区域选择、自适应字体缩放和背景感知渲染,以在保持模型可解释性的同时,隐藏提示信息不被人类感知。使用COCO数据集和GPT-4-turbo,我们评估了12种对抗提示策略和多种嵌入配置。结果表明,IPI可以可靠地操控模型的输出,最有效的配置在隐蔽约束下实现了高达64%的攻击成功率。这些发现凸显了IPI在黑箱环境中的实际威胁,并强调了针对多模态提示注入的防御需求。
cs.CV / 22 / 2603.03646
InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions
InfinityStory:具有世界一致性和角色感知镜头过渡的无限视频生成
Elmoghany, Mohamed, Zhao, Liangbing, Shen, Xiaoqian, Mukherjee, Subhojyoti, Zhou, Yang, Wu, Gang, Lai, Viet Dac, Yoon, Seunghyun, Rossi, Ryan, Rashwan, Abdullah, Mathur, Puneet, Manjunatha, Varun, Dangi, Daksh, Nguyen, Chien, Lipka, Nedim, Bui, Trung, Singh, Krishna Kumar, Zhang, Ruiyi, Huang, Xiaolei, Cho, Jaemin, Wang, Yu, Park, Namyong, Tu, Zhengzhong, Chen, Hongjie, Eldardiry, Hoda, Ahmed, Nesreen, Nguyen, Thien, Manocha, Dinesh, Elhoseiny, Mohamed, Dernoncourt, Franck
Abstract
Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.
Chinese Translation
生成具有一致视觉叙事的长篇故事视频仍然是视频合成中的一项重大挑战。我们提出了一种新颖的框架、数据集和模型,以解决三个关键限制:镜头间背景一致性、无缝多主体镜头间过渡,以及扩展到小时级叙事的能力。我们的方法引入了一种背景一致的生成管道,能够在保持视觉连贯性的同时,维护角色身份和空间关系。我们进一步提出了一种感知过渡的视频合成模块,能够为涉及多个主体进出画面的复杂场景生成平滑的镜头过渡,超越了以往工作的单一主体限制。为此,我们贡献了一个包含10,000个多主体过渡序列的合成数据集,涵盖了被低估的动态场景组合。在VBench上,InfinityStory实现了最高的背景一致性(88.94)、最高的主体一致性(82.11)以及最佳的整体平均排名(2.80),显示出更好的稳定性、更平滑的过渡和更好的时间连贯性。
cs.CV / 23 / 2603.03648
One-Step Face Restoration via Shortcut-Enhanced Coupling Flow
通过快捷增强耦合流实现一步人脸修复
Abstract
Face restoration has advanced significantly with generative models like diffusion models and flow matching (FM), which learn continuous-time mappings between distributions. However, existing FM-based approaches often start from Gaussian noise, ignoring the inherent dependency between low-quality (LQ) and high-quality (HQ) data, resulting in path crossovers, curved trajectories, and multi-step sampling requirements. To address these issues, we propose Shortcut-enhanced Coupling flow for Face Restoration (SCFlowFR). First, it establishes a \textit{data-dependent coupling} that explicitly models the LQ--HQ dependency, minimizing path crossovers and promoting near-linear transport. Second, we employ conditional mean estimation to obtain a coarse prediction that refines the source anchor to tighten coupling and conditions the velocity field to stabilize large-step updates. Third, a shortcut constraint supervises average velocities over arbitrary time intervals, enabling accurate one-step inference. Experiments demonstrate that SCFlowFR achieves state-of-the-art one-step face restoration quality with inference speed comparable to traditional non-diffusion methods.
Chinese Translation
人脸修复在生成模型(如扩散模型和流匹配(Flow Matching, FM))的推动下取得了显著进展,这些模型学习分布之间的连续时间映射。然而,现有的基于FM的方法通常从高斯噪声开始,忽略了低质量(LQ)和高质量(HQ)数据之间的内在依赖关系,导致路径交叉、曲线轨迹以及多步采样的需求。为了解决这些问题,我们提出了用于人脸修复的快捷增强耦合流(Shortcut-enhanced Coupling flow for Face Restoration, SCFlowFR)。首先,它建立了一种 extit{数据依赖耦合},明确建模LQ与HQ之间的依赖关系,最小化路径交叉并促进近线性传输。其次,我们采用条件均值估计来获得粗略预测,从而细化源锚点以收紧耦合,并对速度场进行条件化以稳定大步更新。第三,快捷约束监督任意时间间隔内的平均速度,实现准确的一步推断。实验表明,SCFlowFR在一步人脸修复质量上达到了最先进水平,其推断速度与传统非扩散方法相当。
cs.CV / 24 / 2603.03654
Field imaging framework for morphological characterization of aggregates with computer vision: Algorithms and applications
基于计算机视觉的聚集体形态特征场景成像框架:算法与应用
Abstract
Construction aggregates, including sand and gravel, crushed stone and riprap, are the core building blocks of the construction industry. State-of-the-practice characterization methods mainly relies on visual inspection and manual measurement. State-of-the-art aggregate imaging methods have limitations that are only applicable to regular-sized aggregates under well-controlled conditions. This dissertation addresses these major challenges by developing a field imaging framework for the morphological characterization of aggregates as a multi-scenario solution. For individual and non-overlapping aggregates, a field imaging system was designed and the associated segmentation and volume estimation algorithms were developed. For 2D image analyses of aggregates in stockpiles, an automated 2D instance segmentation and morphological analysis approach was established. For 3D point cloud analyses of aggregate stockpiles, an integrated 3D Reconstruction-Segmentation-Completion (RSC-3D) approach was established: 3D reconstruction procedures from multi-view images, 3D stockpile instance segmentation, and 3D shape completion to predict the unseen sides. First, a 3D reconstruction procedure was developed to obtain high-fidelity 3D models of collected aggregate samples, based on which a 3D aggregate particle library was constructed. Next, two datasets were derived from the 3D particle library for 3D learning: a synthetic dataset of aggregate stockpiles with ground-truth instance labels, and a dataset of partial-complete shape pairs, developed with varying-view raycasting schemes. A state-of-the-art 3D instance segmentation network and a 3D shape completion network were trained on the datasets, respectively. The application of the integrated approach was demonstrated on real stockpiles and validated with ground-truth, showing good performance in capturing and predicting the unseen sides of aggregates.
Chinese Translation
建筑用集料,包括砂石、碎石和护坡石,是建筑行业的核心构建块。现有的表征方法主要依赖于目视检查和手动测量。最先进的集料成像方法存在局限性,仅适用于在良好控制条件下的常规尺寸集料。本文通过开发一个多场景解决方案的场景成像框架,解决了这些主要挑战,以实现集料的形态特征表征。针对单个且不重叠的集料,设计了一个场景成像系统,并开发了相关的分割和体积估计算法。对于堆料中的集料的二维图像分析,建立了一种自动化的二维实例分割和形态分析方法。对于集料堆的三维点云分析,建立了一种集成的三维重建-分割-补全(RSC-3D)方法:从多视角图像进行三维重建程序、三维堆料实例分割以及三维形状补全以预测未见侧面。首先,开发了一种三维重建程序,以获取收集的集料样本的高保真三维模型,并基于此构建了一个三维集料颗粒库。接下来,从三维颗粒库中衍生出两个数据集用于三维学习:一个带有真实实例标签的合成集料堆数据集,以及一个通过不同视角光线投射方案开发的部分完整形状对数据集。分别在这些数据集上训练了最先进的三维实例分割网络和三维形状补全网络。集成方法的应用在真实堆料上得到了验证,并与真实数据进行了对比,显示出在捕捉和预测集料未见侧面方面的良好性能。
cs.CV / 25 / 2603.03657
InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models
InEdit-Bench:智能图像编辑模型中间逻辑路径的基准测试
Abstract
Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.
Chinese Translation
多模态生成模型在图像编辑方面取得了显著进展,在各种静态任务上表现出色。然而,它们的能力通常无法扩展到需要动态推理的复杂场景,使得它们在建模从初始状态到最终状态的连贯中间逻辑路径方面显得力不从心。这种能力对于解锁视觉操作中更深层次的过程和因果理解至关重要。为了系统地衡量这一关键限制,我们引入了InEdit-Bench,这是第一个专门用于评估图像编辑中间路径推理的评估基准。InEdit-Bench包含经过精心注释的测试案例,涵盖四个基本任务类别:状态转移、动态过程、时间序列和科学模拟。此外,为了实现细致的评估,我们提出了一套评估标准,用于评估生成路径的逻辑一致性和视觉自然性,以及模型对指定路径约束的忠实度。我们对14个代表性图像编辑模型在InEdit-Bench上的全面评估揭示了该领域显著且普遍的不足。通过提供一个标准化且具有挑战性的基准,我们希望InEdit-Bench能够催化研究并引导开发更具动态性、推理意识和智能的多模态生成模型。
cs.CV / 26 / 2603.03665
Machine Pareidolia: Protecting Facial Image with Emotional Editing
机器幻象:通过情感编辑保护面部图像
Abstract
The proliferation of facial recognition (FR) systems has raised privacy concerns in the digital realm, as malicious uses of FR models pose a significant threat. Traditional countermeasures, such as makeup style transfer, have suffered from low transferability in black-box settings and limited applicability across various demographic groups, including males and individuals with darker skin tones. To address these challenges, we introduce a novel facial privacy protection method, dubbed \textbf{MAP}, a pioneering approach that employs human emotion modifications to disguise original identities as target identities in facial images. Our method uniquely fine-tunes a score network to learn dual objectives, target identity and human expression, which are jointly optimized through gradient projection to ensure convergence at a shared local optimum. Additionally, we enhance the perceptual quality of protected images by applying local smoothness regularization and optimizing the score matching loss within our network. Empirical experiments demonstrate that our innovative approach surpasses previous baselines, including noise-based, makeup-based, and freeform attribute methods, in both qualitative fidelity and quantitative metrics. Furthermore, MAP proves its effectiveness against an online FR API and shows advanced adaptability in uncommon photographic scenarios.
Chinese Translation
面部识别(FR)系统的普及引发了数字领域的隐私担忧,因为恶意使用FR模型构成了重大威胁。传统的对策,如化妆风格迁移,在黑箱环境中转移能力较低,并且在包括男性和肤色较深个体在内的各种人口群体中的适用性有限。为了解决这些挑战,我们提出了一种新颖的面部隐私保护方法,称为 extbf{MAP},这是一种开创性的方法,利用人类情感的修改将原始身份伪装为目标身份。我们的方法独特地微调了一个评分网络,以学习双重目标,即目标身份和人类表情,这两个目标通过梯度投影共同优化,以确保在共享局部最优解处收敛。此外,我们通过应用局部平滑正则化和优化网络内的评分匹配损失,提升了受保护图像的感知质量。实证实验表明,我们的创新方法在定性保真度和定量指标上均超越了包括基于噪声、基于化妆和自由形式属性方法在内的先前基准。此外,MAP在对抗在线FR API方面证明了其有效性,并在不常见的摄影场景中显示出更强的适应性。
cs.CV / 27 / 2603.03681
EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs
EvoPrune:用于高效多模态大语言模型的早期视觉标记剪枝
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.
Chinese Translation
多模态大语言模型(MLLMs)在视觉-语言任务中表现出色,但在高分辨率图像和视频等复杂场景中,视觉标记的指数增长严重限制了其推理效率。现有的视觉标记剪枝方法主要在视觉编码之后进行,忽视了编码阶段所产生的巨大的计算成本。为了解决这一问题,我们提出了EvoPrune,一种针对MLLMs的早期视觉标记剪枝方法,该方法在视觉编码过程中直接进行剪枝。具体而言,EvoPrune采用基于标记相似性、多样性和基于注意力的重要性指导的逐层剪枝策略,以保留在选定编码层中最具信息量的视觉标记。在图像和视频基准测试上的大量实验验证了EvoPrune的有效性。特别是在VideoMME数据集上,EvoPrune实现了2倍的推理加速,性能下降不到1%,展示了其在延迟敏感的MLLM部署中的潜力。
cs.CV / 28 / 2603.03692
Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance
错误作为信号:通过嵌入式龙格-库塔引导的刚度感知扩散采样
Abstract
Classifier-Free Guidance (CFG) has established the foundation for guidance mechanisms in diffusion models, showing that well-designed guidance proxies significantly improve conditional generation and sample quality. Autoguidance (AG) has extended this idea, but it relies on an auxiliary network and leaves solver-induced errors unaddressed. In stiff regions, the ODE trajectory changes sharply, where local truncation error (LTE) becomes a critical factor that deteriorates sample quality. Our key observation is that these errors align with the dominant eigenvector, motivating us to leverage the solver-induced error as a guidance signal. We propose Embedded Runge-Kutta Guidance (ERK-Guid), which exploits detected stiffness to reduce LTE and stabilize sampling. We theoretically and empirically analyze stiffness and eigenvector estimators with solver errors to motivate the design of ERK-Guid. Our experiments on both synthetic datasets and the popular benchmark dataset, ImageNet, demonstrate that ERK-Guid consistently outperforms state-of-the-art methods. Code is available at https://github.com/mlvlab/ERK-Guid.
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)为扩散模型中的引导机制奠定了基础,表明精心设计的引导代理显著改善了条件生成和样本质量。自引导(Autoguidance, AG)扩展了这一理念,但它依赖于辅助网络,并未解决求解器引起的误差。在刚性区域,常微分方程(ODE)轨迹变化剧烈,局部截断误差(Local Truncation Error, LTE)成为影响样本质量的关键因素。我们的关键观察是,这些误差与主特征向量对齐,这促使我们利用求解器引起的误差作为引导信号。我们提出了嵌入式龙格-库塔引导(Embedded Runge-Kutta Guidance, ERK-Guid),利用检测到的刚度来减少LTE并稳定采样。我们从理论和实证上分析了刚度和特征向量估计器与求解器误差,以推动ERK-Guid的设计。我们在合成数据集和流行基准数据集ImageNet上的实验表明,ERK-Guid始终优于最先进的方法。代码可在https://github.com/mlvlab/ERK-Guid获取。
cs.CV / 29 / 2603.03710
MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction
MPFlow:用于零-shot MRI 重建的多模态后验引导流匹配
Abstract
Zero-shot MRI reconstruction relies on generative priors, but single-modality unconditional priors produce hallucinations under severe ill-posedness. In many clinical workflows, complementary MRI acquisitions (e.g. high-quality structural scans) are routinely available, yet existing reconstruction methods lack mechanisms to leverage this additional information. We propose MPFlow, a zero-shot multi-modal reconstruction framework built on rectified flow that incorporates auxiliary MRI modalities at inference time without retraining the generative prior to improve anatomical fidelity. Cross-modal guidance is enabled by our proposed self-supervised pretraining strategy, Patch-level Multi-modal MR Image Pretraining (PAMRI), which learns shared representations across modalities. Sampling is jointly guided by data consistency and cross-modal feature alignment using pre-trained PAMRI, systematically suppressing intrinsic and extrinsic hallucinations. Extensive experiments on HCP and BraTS show that MPFlow matches diffusion baselines on image quality using only 20% of sampling steps while reducing tumor hallucinations by more than 15% (segmentation dice score). This demonstrates that cross-modal guidance enables more reliable and efficient zero-shot MRI reconstruction.
Chinese Translation
零-shot MRI 重建依赖于生成先验,但单模态无条件先验在严重不适定情况下会产生幻觉。在许多临床工作流程中,互补的 MRI 采集(例如高质量结构扫描)是常规可用的,但现有的重建方法缺乏利用这些额外信息的机制。我们提出了 MPFlow,这是一种基于校正流的零-shot 多模态重建框架,在推理时结合辅助 MRI 模态,而无需重新训练生成先验,以提高解剖学的真实性。我们的自监督预训练策略 Patch-level Multi-modal MR Image Pretraining (PAMRI) 使得跨模态引导成为可能,该策略学习跨模态的共享表示。采样通过数据一致性和使用预训练 PAMRI 的跨模态特征对齐共同引导,系统性地抑制内在和外在的幻觉。在 HCP 和 BraTS 上的广泛实验表明,MPFlow 在图像质量上与扩散基线相匹配,仅使用 20% 的采样步骤,同时将肿瘤幻觉减少超过 15%(分割 Dice 分数)。这表明跨模态引导使得零-shot MRI 重建更加可靠和高效。
cs.CV / 30 / 2603.03711
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
LDP-Slicing:通过随机位平面切片实现图像的局部差分隐私
Abstract
Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.
Chinese Translation
局部差分隐私(LDP)是隐私保护机器学习的金标准信任模型,通过在数据源处保证隐私。然而,由于像素空间的高维性,其在图像数据上的应用长期以来被认为不切实际。经典的LDP机制是为低维数据设计的,因此在应用于高维像素空间时会导致严重的效用下降。本文证明,这种效用损失并非LDP固有的,而是由于其应用于不适当的数据表示。我们提出了LDP-Slicing,这是一种轻量级、无训练的框架,解决了这一领域不匹配的问题。我们的关键见解是将像素值分解为一系列二进制位平面。这一转换使我们能够直接将LDP机制应用于位级表示。为了进一步增强隐私并保持效用,我们集成了一个感知混淆模块,以减轻人类可感知的泄漏,以及基于优化的隐私预算分配策略。该流程在产生高效用图像的同时满足严格的像素级$ ext{ε}$-LDP。针对人脸识别和图像分类的广泛实验表明,LDP-Slicing在可比隐私预算下优于现有的DP/LDP基线,且计算开销微乎其微。
cs.CV / 31 / 2603.03718
Glass Segmentation with Fusion of Learned and General Visual Features
融合学习与通用视觉特征的玻璃分割
Abstract
Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: https://github.com/ojalar/lgnet
Chinese Translation
从RGB图像中进行玻璃表面分割是一项具有挑战性的任务,因为玻璃作为一种透明材料明显缺乏视觉特征。然而,玻璃分割对于场景理解和机器人技术至关重要,因为透明玻璃表面必须被识别为固体材料。本文提出了一种新颖的玻璃分割架构,采用双主干网络生成通用视觉特征和任务特定的学习视觉特征。通用视觉特征由冻结的DINOv3视觉基础模型生成,而任务特定特征则通过以监督方式训练的Swin模型生成。生成的多尺度特征表示经过残差Squeeze-and-Excitation通道缩减进行下采样,并输入到Mask2Former解码器中,生成最终的分割掩膜。该架构在四个常用的玻璃分割数据集上进行了评估,在多个准确性指标上取得了最先进的结果。与之前的最先进方法相比,该模型在推理速度上也具有竞争力,并在使用更轻的DINOv3主干变体时超越了它。实现的源代码和模型权重可在以下链接获取:https://github.com/ojalar/lgnet
cs.CV / 32 / 2603.03726
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
QD-PCQA:面向点云质量评估的质量感知领域适应
Abstract
No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks. The code is available at https://github.com/huhu-code/QD-PCQA.
Chinese Translation
无参考点云质量评估(NR-PCQA)在泛化能力上仍然面临挑战,主要原因是标注点云数据集的稀缺。由于人类视觉系统(HVS)独立于媒体类型驱动感知质量评估,因此从图像中学习到的质量先验知识可以被重新用于点云。这一见解促使我们采用无监督领域适应(UDA)来将与质量相关的先验从标记图像转移到未标记的点云。然而,现有基于UDA的PCQA方法往往忽视了感知质量的关键特征,例如对质量排名的敏感性和质量感知特征对齐,从而限制了其有效性。为了解决这些问题,我们提出了一种新颖的面向PCQA的质量感知领域适应框架,称为QD-PCQA。该框架包括两个主要组件:i)一种排名加权条件对齐(RCA)策略,该策略在一致的质量水平下对齐特征,并自适应地强调错误排名的样本,以增强感知质量排名意识;ii)一种质量引导特征增强(QFA)策略,其中包括质量引导的风格混合、多层扩展和双领域增强模块,以增强感知特征对齐。广泛的跨领域实验表明,QD-PCQA显著提高了NR-PCQA任务中的泛化能力。代码可在 https://github.com/huhu-code/QD-PCQA 获取。
cs.CV / 33 / 2603.03739
PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation
PROSPECT:通过语义-空间融合和潜在预测表示实现统一的流媒体视觉-语言导航
Abstract
Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.
Chinese Translation
多模态大型语言模型(MLLMs)在零-shot 端到端视觉-语言导航(VLN)方面取得了进展,但稳健的导航不仅需要语义理解,还需要对环境动态和空间结构的预测建模。我们提出了 PROSPECT,一种统一的流媒体导航代理,它将流媒体视觉-语言-动作(VLA)策略与潜在预测表示学习相结合。PROSPECT 使用 CUT3R 作为流媒体 3D 基础空间编码器,以生成长上下文、绝对尺度的空间特征,并通过交叉注意力将其与 SigLIP 语义特征融合。在训练过程中,我们引入了可学习的流查询标记,这些标记查询流媒体上下文并预测下一步的 2D 和 3D 潜在特征(而不是像素或显式模态),并在冻结的 SigLIP 和 CUT3R 教师的潜在空间中进行监督。预测分支在没有推理开销的情况下塑造内部表示。在 VLN-CE 基准和真实机器人部署上的实验表明,PROSPECT 在多种光照条件下展现了最先进的性能和改进的长时间鲁棒性。我们将很快为社区发布代码。
cs.CV / 34 / 2603.03744
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
DAGE:用于高效细粒度几何估计的双流架构
Abstract
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.
Chinese Translation
从未校准的多视角/视频输入中估计准确的、视图一致的几何形状和相机姿态仍然具有挑战性,尤其是在高空间分辨率和长序列的情况下。我们提出了DAGE,一种双流变换器,其主要创新在于将全局一致性与细节分离。低分辨率流在经过大幅下采样的帧上操作,采用交替的帧/全局注意力机制,以构建视图一致的表示并高效地估计相机,而高分辨率流则逐帧处理原始图像,以保留清晰的边界和小结构。一个轻量级适配器通过交叉注意力融合这些流,注入全局上下文而不干扰预训练的单帧路径。该设计独立地扩展分辨率和剪辑长度,支持高达2K的输入,并保持实用的推理成本。DAGE提供清晰的深度/点图、强大的跨视图一致性和准确的姿态,为视频几何估计和多视角重建建立了新的最先进结果。
cs.CV / 35 / 2603.03749
WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images
WSI-INR:用于全幻灯片图像病变分割的隐式神经表示
Abstract
Whole-slide images (WSIs) are fundamental for computational pathology, where accurate lesion segmentation is critical for clinical decision making. Existing methods partition WSIs into discrete patches, disrupting spatial continuity and treating multi-resolution views as independent samples, which leads to spatially fragmented segmentation and reduced robustness to resolution variations. To address the issues, we propose WSI-INR, a novel patch-free framework based on Implicit Neural Representations (INRs). WSI-INR models the WSI as a continuous implicit function mapping spatial coordinates directly to tissue semantics features, outputting segmentation results while preserving intrinsic spatial information across the entire slide. In the WSI-INR, we incorporate multi-resolution hash grid encoding to regard different resolution levels as varying sampling densities of the same continuous tissue, achieving a consistent feature representation across resolutions. In addition, by jointly training a shared INR decoder, WSI-INR can capture general priors across different cases. Experimental results showed that WSI-INR maintains robust segmentation performance across resolutions; at Base/4, our resolution-specific optimization improves Dice score by +26.11%, while U-Net and TransUNet decrease by 54.28% and 36.18%, respectively. Crucially, this work enables INRs to segment highly heterogeneous pathological lesions beyond structurally consistent anatomical tissues, offering a fresh perspective for pathological analysis.
Chinese Translation
全幻灯片图像(WSIs)在计算病理学中至关重要,其中准确的病变分割对临床决策至关重要。现有方法将WSIs划分为离散的补丁,破坏了空间连续性,并将多分辨率视图视为独立样本,这导致空间上分散的分割和对分辨率变化的鲁棒性降低。为了解决这些问题,我们提出了WSI-INR,一种基于隐式神经表示(INRs)的新型无补丁框架。WSI-INR将WSI建模为一个连续的隐式函数,直接将空间坐标映射到组织语义特征,输出分割结果,同时保留整个幻灯片的内在空间信息。在WSI-INR中,我们结合了多分辨率哈希网格编码,将不同的分辨率级别视为同一连续组织的不同采样密度,从而在不同分辨率之间实现一致的特征表示。此外,通过共同训练一个共享的INR解码器,WSI-INR能够捕捉不同案例之间的一般先验。实验结果表明,WSI-INR在不同分辨率下保持了稳健的分割性能;在Base/4下,我们的分辨率特定优化使Dice得分提高了26.11%,而U-Net和TransUNet分别下降了54.28%和36.18%。重要的是,这项工作使INRs能够分割高度异质的病理病变,超越结构一致的解剖组织,为病理分析提供了新的视角。
cs.CV / 36 / 2603.03762
Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
像专家一样观察:一种知识增强的开放集细粒度视觉理解代理
Abstract
Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.
Chinese Translation
细粒度视觉理解正从静态分类转向知识增强的推理,其中模型不仅需要识别,还需要提供理由。现有的方法受到封闭集分类法和单标签预测的限制,在开放集或依赖上下文的条件下表现显著下降。我们提出了知识增强细粒度推理代理(Knowledge-Augmented Fine-Grained Reasoning Agent, KFRA),这是一个将细粒度感知转化为基于证据的推理的统一框架。KFRA通过一个三阶段的封闭推理循环来模拟专家分析。它首先执行开放词汇检测和网络规模检索,以生成类别假设。接着,通过全局到局部的聚焦机制,将文本知识与视觉证据对齐,从而进行区分性区域定位。最后,它在一个大型多模态模型中整合所有多模态证据,以进行可解释的推理。与现有代理将检索和推理视为独立过程不同,KFRA建立了检索与基础的耦合,将检索到的知识转化为空间上有依据的证据以进行验证。这一设计使得在多样的细粒度场景中能够进行事实性、可解释和任务无关的推理。为了评估这一能力,我们构建了FGExpertBench,这是一个旨在评估推理深度和跨任务泛化的基准,涵盖六个知识维度。大量实验表明,KFRA始终超越独立的大型多模态模型和当前的代理框架,在推理准确性上提高了多达19个百分点,并在开放集细粒度视觉理解中提供了基于证据的可解释性。
cs.CV / 37 / 2603.03765
LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
激光雷达驱动的时空多视图立体视觉用于自动驾驶
Abstract
Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization. To address these challenges, we present DriveMVS, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames. Built upon these principles, DriveMVS embeds the LiDAR prompt in two ways: as a hard geometric prior that anchors the cost volume, and as soft feature-wise guidance fused by a triple-cue combiner. Regarding temporal consistency, DriveMVS employs a spatio-temporal decoder that jointly leverages geometric cues from the MVS cost volume and temporal context from neighboring frames. Experiments show that DriveMVS achieves state-of-the-art performance on multiple benchmarks, excelling in metric accuracy, temporal stability, and zero-shot cross-domain transfer, demonstrating its practical value for scalable, reliable autonomous driving systems.
Chinese Translation
准确的度量深度对于自动驾驶感知和仿真至关重要,但当前的方法在实现高度量精度、多视图和时间一致性以及跨领域泛化方面面临挑战。为了解决这些问题,我们提出了DriveMVS,这是一种新颖的多视图立体视觉框架,通过两个关键见解调和这些相互竞争的目标:(1) 稀疏但度量准确的激光雷达观测可以作为几何提示,以在绝对尺度上锚定深度估计;(2) 深度融合多种线索对于解决歧义和增强鲁棒性至关重要,同时时空解码器确保帧间一致性。基于这些原则,DriveMVS以两种方式嵌入激光雷达提示:作为锚定成本体积的硬几何先验,以及通过三重线索组合器融合的软特征指导。关于时间一致性,DriveMVS采用时空解码器,联合利用来自MVS成本体积的几何线索和来自相邻帧的时间上下文。实验表明,DriveMVS在多个基准测试中实现了最先进的性能,在度量精度、时间稳定性和零样本跨领域迁移方面表现出色,展示了其在可扩展、可靠的自动驾驶系统中的实际价值。
cs.CV / 38 / 2603.03769
IntroductionDMD-augmented Unpaired Neural Schr\"odinger Bridge for Ultra-Low Field MRI Enhancement
引入DMD增强的无配对神经薛定谔桥用于超低场MRI增强
Abstract
Ultra Low Field (64 mT) brain MRI improves accessibility but suffers from reduced image quality compared to 3 T. As paired 64 mT - 3 T scans are scarce, we propose an unpaired 64 mT $\rightarrow$ 3 T translation framework that enhances realism while preserving anatomy. Our method builds upon the Unpaired Neural Schr\"odinge Bridge (UNSB) with multi-step refinement. To strengthen target distribution alignment, we augment the adversarial objective with DMD2-style diffusion-guided distribution matching using a frozen 3T diffusion teacher. To explicitly constrain global structure beyond patch-level correspondence, we combine PatchNCE with an Anatomical Structure Preservation (ASP) regularizer that enforces soft foreground background consistency and boundary aware constraints. Evaluated on two disjoint cohorts, the proposed framework achieves an improved realism structure trade-off, enhancing distribution level realism on unpaired benchmarks while increasing structural fidelity on the paired cohort compared to unpaired baselines.
Chinese Translation
超低场(64 mT)脑部MRI提高了可及性,但与3 T相比,图像质量有所下降。由于配对的64 mT - 3 T扫描稀缺,我们提出了一种无配对的64 mT $
ightarrow$ 3 T转换框架,该框架在保留解剖结构的同时增强了真实感。我们的方法基于无配对神经薛定谔桥(UNSB)并进行多步细化。为了增强目标分布的对齐,我们使用冻结的3T扩散教师,结合DMD2风格的扩散引导分布匹配,增强对抗目标。为了明确约束超越补丁级别对应的全局结构,我们将PatchNCE与解剖结构保持(ASP)正则化器结合,该正则化器强制执行软前景背景一致性和边界感知约束。在两个不相交的队列上进行评估时,所提出的框架实现了更好的真实感与结构的权衡,增强了在无配对基准上的分布级别真实感,同时在配对队列上提高了结构保真度,相较于无配对基线表现更佳。
cs.CV / 39 / 2603.03788
Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling
复杂背景下的小物体检测:多尺度注意力与全局关系建模
Abstract
Small object detection under complex backgrounds remains a challenging task due to severe feature degradation, weak semantic representation, and inaccurate localization caused by downsampling operations and background interference. Existing detection frameworks are mainly designed for general objects and often fail to explicitly address the unique characteristics of small objects, such as limited structural cues and strong sensitivity to localization errors. In this paper, we propose a multi-level feature enhancement and global relation modeling framework tailored for small object detection. Specifically, a Residual Haar Wavelet Downsampling module is introduced to preserve fine-grained structural details by jointly exploiting spatial-domain convolutional features and frequency-domain representations. To enhance global semantic awareness and suppress background noise, a Global Relation Modeling module is employed to capture long-range dependencies at high-level feature stages. Furthermore, a Cross-Scale Hybrid Attention module is designed to establish sparse and aligned interactions across multi-scale features, enabling effective fusion of high-resolution details and high-level semantic information with reduced computational overhead. Finally, a Center-Assisted Loss is incorporated to stabilize training and improve localization accuracy for small objects. Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics. These results validate the effectiveness and robustness of the proposed framework for small object detection in complex environments.
Chinese Translation
在复杂背景下的小物体检测仍然是一项具有挑战性的任务,这主要是由于下采样操作和背景干扰导致的特征严重退化、语义表示薄弱和定位不准确。现有的检测框架主要针对一般物体设计,往往未能明确解决小物体的独特特征,例如有限的结构线索和对定位误差的强敏感性。本文提出了一种针对小物体检测的多级特征增强与全局关系建模框架。具体而言,引入了一种残差哈希小波下采样模块,通过联合利用空间域卷积特征和频域表示来保留细粒度的结构细节。为了增强全局语义意识并抑制背景噪声,采用全局关系建模模块以捕捉高层特征阶段的长距离依赖。此外,设计了一种跨尺度混合注意力模块,以建立多尺度特征之间稀疏且对齐的交互,从而有效融合高分辨率细节与高层语义信息,同时减少计算开销。最后,结合中心辅助损失以稳定训练并提高小物体的定位精度。在大规模 RGBT-Tiny 基准上进行的广泛实验表明,所提方法在基于 IoU 和尺度自适应评估指标下均持续优于现有的最先进检测器。这些结果验证了所提框架在复杂环境中进行小物体检测的有效性和鲁棒性。
cs.CV / 40 / 2603.03792
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
TAP:一种无训练的令牌自适应预测框架用于扩散加速
Abstract
Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token "probe-then-select" strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy-efficiency frontier compared to fixed global predictors and caching-only baselines.
Chinese Translation
扩散模型在生成性能上表现出色,但由于需要重复进行全模型去噪过程,推理速度依然较慢。我们提出了令牌自适应预测器(Token-Adaptive Predictor,TAP),这是一种无训练的、基于探测的框架,能够在每个采样步骤中自适应地为每个令牌选择预测器。TAP使用模型第一层的单次完整评估作为低成本探测,计算一组紧凑候选预测器的代理损失(主要通过不同阶数和视野的泰勒展开实例化),然后为每个令牌分配具有最小代理误差的预测器。这种按令牌的“探测-选择”策略利用了异质的时间动态,不需要额外的训练,并且与各种预测器设计兼容。TAP的开销微乎其微,同时在几乎没有感知质量损失的情况下实现了显著的加速。在多个扩散架构和生成任务上的大量实验表明,与固定全局预测器和仅缓存的基线相比,TAP显著改善了准确性与效率的平衡。
cs.CV / 41 / 2603.03806
Separators in Enhancing Autoregressive Pretraining for Vision Mamba
分隔符在增强视觉Mamba自回归预训练中的应用
Abstract
The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba's inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba's prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as \textbf{STAR}. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5\% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.
Chinese Translation
状态空间模型Mamba最近作为计算机视觉中的一种有前景的范式出现,由于其高效处理长序列任务而受到广泛关注。Mamba固有的因果机制使其特别适合自回归预训练。然而,当前的自回归预训练方法仅限于短序列任务,未能充分发挥Mamba在处理扩展序列方面的优势。为了解决这一局限性,我们提出了一种创新的视觉Mamba自回归预训练方法,大幅延长输入序列长度。我们引入了新的 extbf{S}epara extbf{T}ors用于 extbf{A}uto extbf{R}egressive预训练,以标识和区分不同的图像,称为 extbf{STAR}。具体而言,我们在每个图像前插入相同的分隔符,以标识其起始位置。这一策略使我们能够将视觉Mamba的输入序列长度扩大四倍,同时保持数据集图像的原始维度。采用这种长序列预训练技术,我们的STAR-B模型在ImageNet-1k上达到了83.5\%的令人印象深刻的准确率,这在视觉Mamba中具有很强的竞争力。这些结果强调了我们的方法在通过改善长程依赖关系的利用来提升视觉模型性能方面的潜力。
cs.CV / 42 / 2603.03807
Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10
基于YOLOv10的轻量级水下目标检测的自适应增强与双池序列注意力
Abstract
Underwater object detection constitutes a pivotal endeavor within the realms of marine surveillance and autonomous underwater systems; however, it presents significant challenges due to pronounced visual impairments arising from phenomena such as light absorption, scattering, and diminished contrast. In response to these formidable challenges, this manuscript introduces a streamlined yet robust framework for underwater object detection, grounded in the YOLOv10 architecture. The proposed method integrates a Multi-Stage Adaptive Enhancement module to improve image quality, a Dual-Pooling Sequential Attention (DPSA) mechanism embedded into the backbone to strengthen multi-scale feature representation, and a Focal Generalized IoU Objectness (FGIoU) loss to jointly improve localization accuracy and objectness prediction under class imbalance. Comprehensive experimental evaluations conducted on the RUOD and DUO benchmark datasets substantiate that the proposed DPSA_FGIoU_YOLOv10n attains exceptional performance, achieving mean Average Precision (mAP) scores of 88.9% and 88.0% at IoU threshold 0.5, respectively. In comparison to the baseline YOLOv10n, this represents enhancements of 6.7% for RUOD and 6.2% for DUO, all while preserving a compact model architecture comprising merely 2.8M parameters. These findings validate that the proposed framework establishes an efficacious equilibrium among accuracy, robustness, and real-time operational efficiency, making it suitable for deployment in resource-constrained underwater settings.
Chinese Translation
水下目标检测在海洋监视和自主水下系统领域中占据着重要地位;然而,由于光吸收、散射和对比度降低等现象导致的显著视觉障碍,使其面临重大挑战。针对这些严峻挑战,本文提出了一种基于YOLOv10架构的简化而强大的水下目标检测框架。所提方法集成了多阶段自适应增强模块以改善图像质量,嵌入主干网络的双池序列注意力(Dual-Pooling Sequential Attention, DPSA)机制以增强多尺度特征表示,并采用聚焦广义IoU目标性损失(Focal Generalized IoU Objectness, FGIoU)共同提高定位精度和在类别不平衡下的目标性预测。对RUOD和DUO基准数据集进行的全面实验评估证实,所提出的DPSA_FGIoU_YOLOv10n在IoU阈值为0.5时分别达到了88.9%和88.0%的平均精度(mean Average Precision, mAP)分数。与基线YOLOv10n相比,RUOD和DUO的提升分别为6.7%和6.2%,同时保持了仅包含2.8M参数的紧凑模型架构。这些发现验证了所提出的框架在准确性、鲁棒性和实时操作效率之间建立了有效的平衡,使其适合在资源受限的水下环境中部署。
cs.CV / 43 / 2603.03808
Vector-Quantized Soft Label Compression for Dataset Distillation
用于数据集蒸馏的向量量化软标签压缩
Abstract
Dataset distillation is an emerging technique for reducing the computational and storage costs of training machine learning models by synthesizing a small, informative subset of data that captures the essential characteristics of a much larger dataset. Recent methods pair synthetic samples and their augmentations with soft labels from a teacher model, enabling student models to generalize effectively despite the small size of the distilled dataset. While soft labels are critical for effective distillation, the storage and communication overhead they incur, especially when accounting for augmentations, is often overlooked. In practice, each distilled sample is associated with multiple soft labels, making them the dominant contributor to storage costs, particularly in large-class settings such as ImageNet-1K. In this paper, we present a rigorous analysis of bit requirements across dataset distillation frameworks, quantifying the storage demands of both distilled samples and their soft labels. To address the overhead, we introduce a vector-quantized autoencoder (VQAE) for compressing soft labels, achieving substantial compression while preserving the effectiveness of the distilled data. We validate our method on both vision and language distillation benchmarks. On ImageNet-1K, our proposed VQAE achieves 30--40x additional compression over RDED, LPLD, SRE2L, and CDA baselines while retaining over $90\%$ of their original performance.
Chinese Translation
数据集蒸馏是一种新兴技术,通过合成一个小而信息丰富的数据子集,捕捉更大数据集的基本特征,从而减少训练机器学习模型的计算和存储成本。最近的方法将合成样本及其增强与来自教师模型的软标签配对,使得学生模型能够有效地进行泛化,尽管蒸馏数据集的规模较小。虽然软标签对于有效的蒸馏至关重要,但它们所带来的存储和通信开销,尤其是在考虑增强时,常常被忽视。在实践中,每个蒸馏样本都与多个软标签相关联,使其成为存储成本的主要贡献者,特别是在像ImageNet-1K这样的大类设置中。在本文中,我们对数据集蒸馏框架中的位需求进行了严格分析,量化了蒸馏样本及其软标签的存储需求。为了解决这一开销,我们引入了一种向量量化自编码器(VQAE)来压缩软标签,实现了显著的压缩,同时保持了蒸馏数据的有效性。我们在视觉和语言蒸馏基准上验证了我们的方法。在ImageNet-1K上,我们提出的VQAE在保留超过90%的原始性能的同时,实现了比RDED、LPLD、SRE2L和CDA基线高出30-40倍的额外压缩。
cs.CV / 44 / 2603.03815
Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning
结构感知的提示适应:从已见到未见的开放词汇组合零-shot学习
Abstract
The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in the open-vocabulary setting, where compositions of both seen and unseen attributes and objects are evaluated. Recently, prompt tuning methods have demonstrated strong generalization capabilities in the closed setting, where only compositions of seen attributes and objects are evaluated, i.e., Compositional Zero-Shot Learning (CZSL). However, directly applying these methods to OV-CZSL may not be sufficient to generalize to unseen attributes, objects and their compositions, as it is limited to seen attributes and objects. Normally, when faced with unseen concepts, humans adopt analogies with seen concepts that have the similar semantics thereby inferring their meaning (e.g., "wet" and "damp", "shirt" and "jacket"). In this paper, we experimentally show that the distribution of semantically related attributes or objects tends to form consistent local structures in the embedding space. Based on the above structures, we propose Structure-aware Prompt Adaptation (SPA) method, which enables models to generalize from seen to unseen attributes and objects. Specifically, in the training stage, we design a Structure-aware Consistency Loss (SCL) that encourages the local structure's consistency of seen attributes and objects in each iteration. In the inference stage, we devise a Structure-guided Adaptation Strategy (SAS) that adaptively aligns the structures of unseen attributes and objects with those of trained seen attributes and objects with similar semantics. Notably, SPA is a plug-and-play method that can be seamlessly integrated into existing CZSL prompt tuning methods. Extensive experiments on OV-CZSL benchmarks demonstrate that SPA achieves competitive closed-set performance while significantly improving open-vocabulary results.
Chinese Translation
开放词汇组合零-shot学习(Open-Vocabulary Compositional Zero-Shot Learning, OV-CZSL)的目标是在开放词汇环境中识别属性-对象组合,其中评估了已见和未见属性及对象的组合。最近,提示调优方法在封闭环境中展示了强大的泛化能力,在该环境中仅评估已见属性和对象的组合,即组合零-shot学习(Compositional Zero-Shot Learning, CZSL)。然而,直接将这些方法应用于OV-CZSL可能不足以泛化到未见属性、对象及其组合,因为它们仅限于已见属性和对象。通常,当面临未见概念时,人类会借助与已见概念的类比(例如,“湿”和“潮湿”,“衬衫”和“夹克”)来推断其含义。本文通过实验证明,语义相关属性或对象的分布倾向于在嵌入空间中形成一致的局部结构。基于上述结构,我们提出了结构感知的提示适应(Structure-aware Prompt Adaptation, SPA)方法,使模型能够从已见属性和对象泛化到未见属性和对象。具体而言,在训练阶段,我们设计了一种结构感知一致性损失(Structure-aware Consistency Loss, SCL),鼓励每次迭代中已见属性和对象的局部结构保持一致。在推理阶段,我们设计了一种结构引导适应策略(Structure-guided Adaptation Strategy, SAS),该策略自适应地将未见属性和对象的结构与已训练的具有相似语义的已见属性和对象的结构对齐。值得注意的是,SPA是一种即插即用的方法,可以无缝集成到现有的CZSL提示调优方法中。在OV-CZSL基准上的大量实验表明,SPA在封闭集性能上具有竞争力,同时显著改善了开放词汇结果。
cs.CV / 45 / 2603.03825
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
从狭窄视野到全景视野:注意力引导的冷启动重塑多模态推理
Abstract
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1$-$2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.
Chinese Translation
冷启动初始化阶段在训练多模态大规模推理模型(MLRMs)中扮演着关键角色,但其机制仍然不够清晰。为了分析这一阶段,我们引入了视觉注意力评分(Visual Attention Score, VAS),这是一种基于注意力的指标,用于量化模型对视觉标记的关注程度。我们发现推理性能与VAS之间存在强相关性(r=0.9616):VAS较高的模型在多模态推理上表现显著更强。令人惊讶的是,多模态冷启动未能提升VAS,导致注意力分布接近基础模型,而仅文本冷启动则明显提高VAS。我们将这一反直觉现象称为懒惰注意力定位(Lazy Attention Localization)。为了验证其因果作用,我们设计了无训练干预,直接调节推理过程中的注意力分配,获得了1%至2%的性能提升,而无需任何再训练。在此基础上,我们进一步提出了注意力引导的视觉锚定与反思(Attention-Guided Visual Anchoring and Reflection, AVAR),这是一个综合性的冷启动框架,整合了视觉锚定的数据合成、注意力引导的目标和视觉锚定的奖励塑造。应用于Qwen2.5-VL-7B,AVAR在7个多模态推理基准上实现了平均7.0%的提升。消融研究进一步确认AVAR的每个组件逐步贡献于整体提升。代码、数据和模型可在https://github.com/lrlbbzl/Qwen-AVAR获取。
cs.CV / 46 / 2603.03831
Universal Pansharpening Foundation Model
通用全色锐化基础模型
Abstract
Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.
Chinese Translation
全色锐化通过整合来自纹理丰富的全色(PAN)图像的空间细节和来自低分辨率多光谱(MS)图像的光谱属性,生成高分辨率的多光谱图像。现有方法主要依赖于特定卫星和场景,严重限制了它们在异构传感器和不同场景中的泛化能力,从而降低了其在实际应用中的实用性。为了解决这些挑战,我们提出了FoundPS,一个用于卫星无关和场景鲁棒融合的通用全色锐化基础模型。具体而言,我们引入了一种模态交错变换器,学习带宽模态专业化,以形成可逆的光谱仿射基,通过张量乘法将任意波段的多光谱图像映射到统一的潜在空间。在此基础上,我们构建了一个潜在扩散桥接模型,以逐步演化潜在表示,并结合桥接后验采样,将潜在扩散与像素空间观测耦合,从而实现稳定且可控的融合。此外,我们设计了无限维像素到潜在的交互机制,以全面捕捉全色观测与多光谱表示之间的跨域依赖,从而促进互补信息的融合。此外,为了支持大规模训练和评估,我们构建了一个全面的全色锐化基准,称为PSBench,包含来自多个卫星的全球多光谱和全色图像对,涵盖多种场景。大量实验表明,FoundPS在各种全色锐化任务中始终优于最先进的方法,展现出卓越的泛化能力和鲁棒性。
cs.CV / 47 / 2603.03839
All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network
基于因果去混淆的小波解耦提示网络的全能图像修复
Abstract
Image restoration represents a promising approach for addressing the inherent defects of image content distortion. Standard image restoration approaches suffer from high storage cost and the requirement towards the known degradation pattern, including type and degree, which can barely be satisfied in dynamic practical scenarios. In contrast, all-in-one image restoration (AiOIR) eliminates multiple degradations within a unified model to circumvent the aforementioned issues. However, according to our causal analysis, we disclose that two significant defects still exacerbate the effectiveness and generalization of AiOIR models: 1) the spurious correlation between non-degradation semantic features and degradation patterns; 2) the biased estimation of degradation patterns. To obtain the true causation between degraded images and restored images, we propose Causal-deconfounding Wavelet-disentangled Prompt Network (CWP-Net) to perform effective AiOIR. CWP-Net introduces two modules for decoupling, i.e., wavelet attention module of encoder and wavelet attention module of decoder. These modules explicitly disentangle the degradation and semantic features to tackle the issue of spurious correlation. To address the issue stemming from the biased estimation of degradation patterns, CWP-Net leverages a wavelet prompt block to generate the alternative variable for causal deconfounding. Extensive experiments on two all-in-one settings prove the effectiveness and superior performance of our proposed CWP-Net over the state-of-the-art AiOIR methods.
Chinese Translation
图像修复是一种有前景的方法,用于解决图像内容失真的固有缺陷。标准的图像修复方法面临着高存储成本和对已知退化模式(包括类型和程度)的要求,而这些在动态实际场景中几乎无法满足。相比之下,全能图像修复(AiOIR)通过在统一模型中消除多种退化,避免了上述问题。然而,根据我们的因果分析,我们发现两个显著缺陷仍然加剧了AiOIR模型的有效性和泛化能力:1)非退化语义特征与退化模式之间的虚假相关性;2)对退化模式的偏差估计。为了获得退化图像与修复图像之间的真实因果关系,我们提出了因果去混淆小波解耦提示网络(CWP-Net)以实现有效的AiOIR。CWP-Net引入了两个解耦模块,即编码器的小波注意模块和解码器的小波注意模块。这些模块明确解耦了退化特征和语义特征,以解决虚假相关性的问题。为了解决源于退化模式偏差估计的问题,CWP-Net利用小波提示块生成替代变量以进行因果去混淆。在两个全能设置上的大量实验证明了我们提出的CWP-Net相较于最先进的AiOIR方法的有效性和优越性能。
cs.CV / 48 / 2603.03857
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
DeepScan:一种无训练框架用于大型视觉语言模型中的视觉基础推理
Abstract
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.
Chinese Translation
人类能够在嘈杂环境中稳健地定位视觉证据并提供有根据的答案,通过识别关键线索并将其以自下而上的方式与完整上下文关联。受到此启发,我们提出了DeepScan,这是一种无训练框架,结合了分层扫描、重新聚焦和增强证据推理,用于大型视觉语言模型(LVLMs)中的视觉基础推理。与现有方法追求一次性定位完整证据不同,分层扫描通过局部线索探索和多尺度证据提取以自下而上的方式恢复证据,有效减轻了干扰上下文的影响。随后,重新聚焦通过LVLMs与视觉专家的协作优化定位的证据视图。最后,增强证据推理通过混合证据记忆聚合多粒度视图,并产生准确且可解释的答案。实验结果表明,DeepScan显著提升了LVLMs在多样视觉任务中的表现,尤其是在细粒度视觉理解方面。当与Qwen2.5-VL-7B集成时,其在V*上的整体准确率达到90.6%。此外,DeepScan在不同架构和模型规模的LVLMs中提供了一致的改进,而无需额外的适应成本。
cs.CV / 49 / 2603.03871
Bridging Human Evaluation to Infrared and Visible Image Fusion
将人类评估与红外与可见图像融合相结合
Abstract
Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune the fusion network through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics. Code is available at https://github.com/ALKA-Wind/EVAFusion.
Chinese Translation
红外与可见图像融合(IVIF)整合了互补的模态,以增强场景感知。目前的方法主要集中在优化手工设计的损失函数和客观指标,往往导致融合结果与人类视觉偏好不一致。这一挑战因IVIF的病态特性而进一步加剧,严重限制了其在安全监控和驾驶辅助系统等人类感知环境中的有效性。为了解决这些局限性,我们提出了一种反馈强化框架,将人类评估与红外与可见图像融合相结合。为了解决缺乏以人为中心的评估指标和数据的问题,我们引入了第一个大规模的IVIF人类反馈数据集,该数据集包含多维主观评分和伪影注释,并通过经过微调的大型语言模型与专家评审进行丰富。基于该数据集,我们设计了一个特定领域的奖励函数,并训练了一个奖励模型来量化感知质量。在该奖励的指导下,我们通过组相对策略优化(Group Relative Policy Optimization)微调融合网络,实现了与人类美学更好对齐的最先进性能。代码可在 https://github.com/ALKA-Wind/EVAFusion 获取。
cs.CV / 50 / 2603.03879
Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements
Yolo-Key-6D:基于关键点增强的单阶段单目6D姿态估计
Abstract
Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners. This keypoint detection task significantly improves the network's understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.
Chinese Translation
从单个RGB图像中估计物体的6D姿态是机器人技术和扩展现实应用中的一项关键任务。然而,最先进的多阶段方法通常存在高延迟的问题,使其不适合实时使用。本文提出了Yolo-Key-6D,这是一种新颖的单阶段端到端框架,旨在实现单目6D姿态估计,兼顾速度和准确性。我们的方法通过集成一个辅助头部来回归物体3D边界框角点的2D投影,从而增强了基于YOLO的架构。这一关键点检测任务显著提高了网络对3D几何的理解。为了实现稳定的端到端训练,我们直接使用通过奇异值分解投影到SO(3)的连续9D表示来回归旋转。在LINEMOD和LINEMOD-Occluded基准测试中,YOLO-Key-6D分别在ADD(-S) 0.1d指标下达到了96.24%和69.41%的竞争性准确率,同时证明其能够实时运行。我们的结果表明,经过精心设计的单阶段方法能够为实际应用提供性能与效率的有效平衡。
cs.CV / 51 / 2603.03882
UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios
UniSync:面向具有挑战性场景的可泛化高保真唇部同步
Abstract
Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.
Chinese Translation
唇部同步旨在生成与给定音频匹配的逼真对话视频,这对于高质量视频配音至关重要。然而,当前的方法存在根本性缺陷:基于掩膜的方法在局部颜色上存在差异,而无掩膜的方法则在全局背景纹理对齐上遇到困难。此外,大多数方法在多样化的现实场景中表现不佳,例如风格化头像、面部遮挡和极端光照条件。本文提出了UniSync,一个旨在实现多样场景下高保真唇部同步的统一框架。具体而言,UniSync采用无掩膜的姿态锚定训练策略,以保持头部运动并消除合成颜色伪影,同时采用基于掩膜的混合一致推理,以确保结构精度和平滑混合。值得注意的是,在紧凑但多样的视频上进行微调,使我们的模型具备了卓越的领域适应性,能够有效处理复杂的边缘案例。我们还引入了RealWorld-LipSync基准,以评估模型在现实世界需求下的表现,涵盖了包括人脸和风格化头像在内的多样应用场景。大量实验表明,UniSync显著优于最先进的方法,推动该领域朝着真正可泛化和适用于生产的唇部同步迈进。
cs.CV / 52 / 2603.03892
A novel network for classification of cuneiform tablet metadata
一种用于楔形文字平板元数据分类的新型网络
Abstract
In this paper, we present a network structure for classifying metadata of cuneiform tablets. The problem is of practical importance, as the size of the existing corpus far exceeds the number of experts available to analyze it. But the task is made difficult by the combination of limited annotated datasets and the high-resolution point-cloud representation of each tablet. To address this, we develop a convolution-inspired architecture that gradually down-scales the point cloud while integrating local neighbor information. The final down-scaled point cloud is then processed by computing neighbors in the feature space to include global information. Our method is compared with the state-of-the-art transformer-based network Point-BERT, and consistently obtains the best performance. Source code and datasets will be released at publication.
Chinese Translation
在本文中,我们提出了一种用于分类楔形文字平板元数据的网络结构。该问题具有实际重要性,因为现有语料库的规模远远超过可用来分析它的专家人数。然而,由于有限的标注数据集与每个平板的高分辨率点云表示的结合,使得这一任务变得困难。为了解决这个问题,我们开发了一种受卷积启发的架构,该架构逐步缩小点云,同时整合局部邻域信息。最终缩小的点云通过在特征空间中计算邻居来处理,以包含全局信息。我们的方法与最先进的基于变换器的网络 Point-BERT 进行了比较,并始终获得最佳性能。源代码和数据集将在发表时发布。
cs.CV / 53 / 2603.03903
From Misclassifications to Outliers: Joint Reliability Assessment in Classification
从误分类到异常值:分类中的联合可靠性评估
Abstract
Building reliable classifiers is a fundamental challenge for deploying machine learning in real-world applications. A reliable system should not only detect out-of-distribution (OOD) inputs but also anticipate in-distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end, we propose a unified evaluation framework that integrates OOD detection and failure prediction, quantified by our new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions. Experiments on the OpenOOD benchmark show that double scoring functions yield classifiers that are substantially more reliable than traditional single scoring approaches. Our analysis further reveals that OOD-based approaches provide notable gains under simple or far-OOD shifts, but only marginal benefits under more challenging near-OOD conditions. Beyond evaluation, we extend the reliable classifier SURE and introduce SURE+, a new approach that significantly improves reliability across diverse scenarios. Together, our framework, metrics, and method establish a new benchmark for trustworthy classification and offer practical guidance for deploying robust models in real-world settings. The source code is publicly available at https://github.com/Intellindust-AI-Lab/SUREPlus.
Chinese Translation
构建可靠的分类器是将机器学习应用于现实世界中的一项基本挑战。一个可靠的系统不仅应能够检测到分布外(OOD)输入,还应通过对潜在误分类样本赋予低置信度来预测分布内(ID)错误。然而,大多数先前的研究将OOD检测和失败预测视为独立的问题,忽视了它们之间的紧密联系。我们认为,可靠性需要对这两者进行联合评估。为此,我们提出了一个统一的评估框架,整合了OOD检测和失败预测,通过我们新的指标DS-F1和DS-AURC进行量化,其中DS表示双重评分函数。在OpenOOD基准测试中的实验表明,双重评分函数生成的分类器在可靠性上显著优于传统的单一评分方法。我们的分析进一步揭示,基于OOD的方法在简单或远离OOD的变化下提供了显著的收益,但在更具挑战性的近OOD条件下仅提供了边际效益。除了评估之外,我们扩展了可靠分类器SURE,并引入了SURE+,这是一种在多种场景下显著提高可靠性的新方法。我们的框架、指标和方法共同建立了可信分类的新基准,并为在现实环境中部署稳健模型提供了实用指导。源代码可在https://github.com/Intellindust-AI-Lab/SUREPlus公开获取。
cs.CV / 54 / 2603.03904
Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications
基于变换器的无人机视觉目标跟踪架构与评估协议
Abstract
Object tracking from Unmanned Aerial Vehicles (UAVs) is challenged by platform dynamics, camera motion, and limited onboard resources. Existing visual trackers either lack robustness in complex scenarios or are too computationally demanding for real-time embedded use. We propose an Modular Asynchronous Tracking Architecture (MATA) that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model. We further introduce a hardware-independent, embedded oriented evaluation protocol and a new metric called Normalized time to Failure (NT2F) to quantify how long a tracker can sustain a tracking sequence without external help. Experiments on UAV benchmarks, including an augmented UAV123 dataset with synthetic occlusions, show consistent improvements in Success and NT2F metrics across multiple tracking processing frequency. A ROS 2 implementation on a Nvidia Jetson AGX Orin confirms that the evaluation protocol more closely matches real-time performance on embedded systems.
Chinese Translation
无人机(UAV)上的目标跟踪面临平台动态、相机运动和有限的机载资源等挑战。现有的视觉跟踪器在复杂场景中要么缺乏鲁棒性,要么在实时嵌入式使用中计算需求过高。我们提出了一种模块化异步跟踪架构(Modular Asynchronous Tracking Architecture, MATA),该架构将基于变换器的跟踪器与扩展卡尔曼滤波器(Extended Kalman Filter)相结合,整合了稀疏光流的自运动补偿和目标轨迹模型。我们进一步引入了一种与硬件无关、面向嵌入式的评估协议,以及一种新的度量标准——归一化故障时间(Normalized time to Failure, NT2F),用于量化跟踪器在没有外部帮助的情况下能够维持跟踪序列的时间。对无人机基准测试的实验,包括带有合成遮挡的增强UAV123数据集,显示在多个跟踪处理频率下,成功率和NT2F指标均有一致性提升。在Nvidia Jetson AGX Orin上的ROS 2实现确认了该评估协议与嵌入式系统上的实时性能更为匹配。
cs.CV / 55 / 2603.03907
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
细粒度图像美学评估:从相对排名中学习区分性评分
Abstract
Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method.
Chinese Translation
图像美学评估(IAA)在内容创作、相册管理和推荐系统等领域有广泛的应用。在这些应用中,通常需要从一系列具有细微美学差异的图像中挑选出最具美感的图像,这一主题我们称之为细粒度IAA。不幸的是,当前最先进的IAA模型通常是为粗粒度评估而设计的,其中具有显著美学差异的图像在绝对尺度上独立评估。这些模型在区分细粒度美学差异方面固有地受到限制。为了解决这一困境,我们贡献了FGAesthetics,一个包含32,217张图像的细粒度IAA数据库,这些图像被组织成10,028个系列,来源于自然、AIGC和裁剪等多种类别。通过对每个系列内的成对比较收集注释。我们还设计了系列精炼和排名校准,以确保数据和标签的可靠性。基于FGAesthetics,我们进一步提出了FGAesQ,一个新颖的IAA框架,通过差异保留标记化(Difference-preserved Tokenization,DiffToken)、比较文本辅助对齐(Comparative Text-assisted Alignment,CTAlign)和排名感知回归(Rank-aware Regression,RankReg)从相对排名中学习区分性美学评分。FGAesQ能够在细粒度场景中实现准确的美学评估,同时在粗粒度评估中仍保持竞争力。大量实验和比较表明了所提方法的优越性。
cs.CV / 56 / 2603.03930
N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition
将 N-gram 注入变换器以实现手写文本识别中的动态语言模型适应
Abstract
Transformer-based encoder-decoder networks have recently achieved impressive results in handwritten text recognition, partly thanks to their auto-regressive decoder which implicitly learns a language model. However, such networks suffer from a large performance drop when evaluated on a target corpus whose language distribution is shifted from the source text seen during training. To retain recognition accuracy despite this language shift, we propose an external n-gram injection (NGI) for dynamic adaptation of the network's language modeling at inference time. Our method allows switching to an n-gram language model estimated on a corpus close to the target distribution, therefore mitigating bias without any extra training on target image-text pairs. We opt for an early injection of the n-gram into the transformer decoder so that the network learns to fully leverage text-only data at the low additional cost of n-gram inference. Experiments on three handwritten datasets demonstrate that the proposed NGI significantly reduces the performance gap between source and target corpora.
Chinese Translation
基于变换器的编码器-解码器网络在手写文本识别中最近取得了令人瞩目的成果,部分原因在于其自回归解码器隐式地学习了语言模型。然而,当在目标语料库上评估时,这些网络的性能会大幅下降,尤其是当目标语料的语言分布与训练期间看到的源文本发生偏移时。为了在这种语言偏移的情况下保持识别准确性,我们提出了一种外部 N-gram 注入(NGI)方法,以实现网络在推理时的动态语言建模适应。我们的方法允许切换到在接近目标分布的语料库上估计的 N-gram 语言模型,从而在不对目标图像-文本对进行额外训练的情况下减轻偏差。我们选择在变换器解码器中早期注入 N-gram,以便网络能够充分利用仅包含文本的数据,且仅需低额外成本的 N-gram 推理。对三个手写数据集的实验表明,所提出的 NGI 显著缩小了源语料和目标语料之间的性能差距。
cs.CV / 57 / 2603.03935
DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping
DISC:大规模开放集语义映射的密集集成语义上下文
Abstract
Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer's intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at https://github.com/DFKI-NI/DISC.
Chinese Translation
开放集语义映射使得基于语言的机器人感知成为可能,但当前的实例中心方法受到缺乏上下文和计算开销大的基于裁剪的特征提取的限制。为了解决这一根本性限制,我们提出了DISC(Dense Integrated Semantic Context),其特点是采用一种新颖的单次传递、距离加权的提取机制。通过直接从视觉变换器的中间层推导高保真度的CLIP嵌入,我们的方法消除了传统图像裁剪的延迟和领域转移伪影,产生纯净的、与掩膜对齐的语义表示。为了在大规模连续映射中充分利用这些特征,DISC建立在一个完全GPU加速的架构上,替代了周期性的离线处理,采用精确的即时体素级实例细化。我们在标准基准(Replica,ScanNet)和基于Habitat-Matterport 3D(HM3DSEM)新生成的大规模映射数据集上评估了我们的方法,以评估其在多层建筑复杂场景中的可扩展性。大量评估表明,DISC在语义准确性和查询检索方面显著超越当前最先进的零样本方法,为机器人部署提供了一个强大且具实时能力的框架。完整的源代码、数据生成和评估流程将在https://github.com/DFKI-NI/DISC上公开。
cs.CV / 58 / 2603.03939
Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection
跨模态映射与双分支重建用于二维-三维多模态工业异常检测
Abstract
Multimodal industrial anomaly detection benefits from integrating RGB appearance with 3D surface geometry, yet existing \emph{unsupervised} approaches commonly rely on memory banks, teacher-student architectures, or fragile fusion schemes, limiting robustness under noisy depth, weak texture, or missing modalities. This paper introduces \textbf{CMDR-IAD}, a lightweight and modality-flexible unsupervised framework for reliable anomaly detection in 2D+3D multimodal as well as single-modality (2D-only or 3D-only) settings. \textbf{CMDR-IAD} combines bidirectional 2D$\leftrightarrow$3D cross-modal mapping to model appearance-geometry consistency with dual-branch reconstruction that independently captures normal texture and geometric structure. A two-part fusion strategy integrates these cues: a reliability-gated mapping anomaly highlights spatially consistent texture-geometry discrepancies, while a confidence-weighted reconstruction anomaly adaptively balances appearance and geometric deviations, yielding stable and precise anomaly localization even in depth-sparse or low-texture regions. On the MVTec 3D-AD benchmark, CMDR-IAD achieves state-of-the-art performance while operating without memory banks, reaching 97.3\% image-level AUROC (I-AUROC), 99.6\% pixel-level AUROC (P-AUROC), and 97.6\% AUPRO. On a real-world polyurethane cutting dataset, the 3D-only variant attains 92.6\% I-AUROC and 92.5\% P-AUROC, demonstrating strong effectiveness under practical industrial conditions. These results highlight the framework's robustness, modality flexibility, and the effectiveness of the proposed fusion strategies for industrial visual inspection. Our source code is available at https://github.com/ECGAI-Research/CMDR-IAD/
Chinese Translation
多模态工业异常检测通过将RGB外观与三维表面几何结合而受益,然而现有的 extit{无监督}方法通常依赖于记忆库、师生架构或脆弱的融合方案,这限制了在噪声深度、纹理弱或缺失模态下的鲁棒性。本文提出了 extbf{CMDR-IAD},一个轻量级且模态灵活的无监督框架,旨在实现二维+三维多模态以及单一模态(仅二维或仅三维)环境下的可靠异常检测。 extbf{CMDR-IAD}结合了双向二维$
ightarrow$三维跨模态映射,以建模外观-几何一致性,并通过双分支重建独立捕捉正常纹理和几何结构。一个双部分融合策略整合了这些线索:一个可靠性门控映射异常突出空间一致的纹理-几何差异,而一个置信度加权重建异常自适应平衡外观和几何偏差,即使在深度稀疏或低纹理区域也能实现稳定而精确的异常定位。在MVTec 3D-AD基准测试中,CMDR-IAD在没有记忆库的情况下实现了最先进的性能,达到了97.3\%的图像级AUROC(I-AUROC)、99.6\\%的像素级AUROC(P-AUROC)和97.6\\%的AUPRO。在一个真实的聚氨酯切割数据集中,三维仅变体达到了92.6\\%的I-AUROC和92.5\\%的P-AUROC,展示了在实际工业条件下的强大有效性。这些结果突显了该框架的鲁棒性、模态灵活性以及所提出的融合策略在工业视觉检测中的有效性。我们的源代码可在https://github.com/ECGAI-Research/CMDR-IAD/获取。
cs.CV / 59 / 2603.03941
Slice-wise quality assessment of high b-value breast DWI via deep learning-based artifact detection
基于深度学习的高b值乳腺扩散加权成像切片级质量评估
Abstract
Diffusion-weighted imaging (DWI) can support lesion detection and characterization in breast magnetic resonance imaging (MRI), however especially high b-value diffusion-weighted acquisitions can be prone to intensity artifacts that can affect diagnostic image assessment. This study aims to detect both hyper- and hypointense artifacts on high b-value diffusion-weighted images (b=1500 s/mm2) using deep learning, employing either a binary classification (artifact presence) or a multiclass classification (artifact intensity) approach on a slice-wise dataset.This IRB-approved retrospective study used the single-center dataset comprising n=11806 slices from routine 3T breast MRI examinations performed between 2022 and mid-2023. Three convolutional neural network (CNN) architectures (DenseNet121, ResNet18, and SEResNet50) were trained for binary classification of hyper- and hypointense artifacts. The best performing model (DenseNet121) was applied to an independent holdout test set and was further trained separately for multiclass classification. Evaluation included area under receiver operating characteristic curve (AUROC), area under precision recall curve (AUPRC), precision, and recall, as well as analysis of predicted bounding box positions, derived from the network Grad-CAM heatmaps. DenseNet121 achieved AUROCs of 0.92 and 0.94 for hyper- and hypointense artifact detection, respectively, and weighted AUROCs of 0.85 and 0.88 for multiclass classification on single-slice high b-value diffusion-weighted images. A radiologist evaluated bounding box precision on a 1-5 Likert-like scale across 200 slices, achieving mean scores of 3.33+-1.04 for hyperintense artifacts and 2.62+-0.81 for hypointense artifacts. Hyper- and hypointense artifact detection in slice-wise breast DWI MRI dataset (b=1500 s/mm2) using CNNs particularly DenseNet121, seems promising and requires further validation.
Chinese Translation
扩散加权成像(DWI)可以支持乳腺磁共振成像(MRI)中的病变检测和特征描述,然而,特别是高b值扩散加权采集可能会受到强度伪影的影响,从而影响诊断图像评估。本研究旨在利用深度学习检测高b值扩散加权图像(b=1500 s/mm²)上的高强度和低强度伪影,采用切片级数据集进行二分类(伪影存在)或多分类(伪影强度)的方法。本研究为IRB批准的回顾性研究,使用了一个单中心数据集,包含2022年至2023年中期进行的常规3T乳腺MRI检查的11806个切片。我们训练了三种卷积神经网络(CNN)架构(DenseNet121、ResNet18和SEResNet50)用于高强度和低强度伪影的二分类。表现最佳的模型(DenseNet121)应用于独立的保留测试集,并进一步单独训练用于多分类。评估指标包括接收者操作特征曲线下面积(AUROC)、精确度-召回曲线下面积(AUPRC)、精确度和召回率,以及基于网络Grad-CAM热图分析的预测边界框位置。DenseNet121在高强度和低强度伪影检测中分别达到了0.92和0.94的AUROC,并在单切片高b值扩散加权图像的多分类中获得了0.85和0.88的加权AUROC。一名放射科医师在200个切片上使用1-5李克特量表评估边界框的精确度,分别获得高强度伪影的平均得分为3.33±1.04,低强度伪影的平均得分为2.62±0.81。在切片级乳腺DWI MRI数据集(b=1500 s/mm²)中使用CNN(特别是DenseNet121)检测高强度和低强度伪影的效果似乎很有前景,并需要进一步验证。
cs.CV / 60 / 2603.03944
Spatial Causal Prediction in Video
视频中的空间因果预测
Abstract
Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.
Chinese Translation
空间推理,即理解空间关系、因果性和动态演变的能力,是人类智能的核心,并且对于自动驾驶和机器人等现实世界应用至关重要。然而,现有研究主要评估模型在可见时空理解上的表现,忽视了它们推断未见的过去或未来空间状态的能力。在本研究中,我们引入了空间因果预测(Spatial Causal Prediction, SCP)这一新任务范式,挑战模型超越观察进行推理,并预测空间因果结果。我们进一步构建了SCP-Bench,这是一个基准数据集,包含2,500个问答对,涵盖1,181个视频,涉及多样的视角、场景和因果方向,以支持系统评估。通过对23个最先进模型的全面实验,我们揭示了人类与模型性能之间的显著差距、有限的时间外推能力和薄弱的因果基础。我们还分析了影响性能的关键因素,并提出了感知增强和推理引导策略,以推动空间因果智能的发展。项目页面为 https://guangstrip.github.io/SCP-Bench。
cs.CV / 61 / 2603.03956
Towards Generalized Multimodal Homography Estimation
面向广义多模态单应性估计
Abstract
Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.
Chinese Translation
监督和无监督的单应性估计方法依赖于特定模态的图像对以实现高精度。然而,当应用于未见过的模态时,它们的性能显著下降。为了解决这个问题,我们提出了一种训练数据合成方法,该方法从单个输入图像生成具有真实偏移量的未对齐图像对。我们的方法在保留结构信息的同时,生成具有多样化纹理和颜色的图像对。这些合成数据使得训练模型在各个领域中实现更强的鲁棒性和更好的泛化能力。此外,我们设计了一个网络,以充分利用跨尺度信息,并将颜色信息与特征表示解耦,从而提高估计精度。大量实验表明,我们的训练数据合成方法提高了泛化性能。结果也证实了所提网络的有效性。
cs.CV / 62 / 2603.03961
ProFound: A moderate-sized vision foundation model for multi-task prostate imaging
ProFound:一种中等规模的多任务前列腺影像基础模型
Abstract
Many diagnostic and therapeutic clinical tasks for prostate cancer increasingly rely on multi-parametric MRI. Automating these tasks is challenging because they necessitate expert interpretations, which are difficult to scale to capitalise on modern deep learning. Although modern automated systems achieve expert-level performance in isolated tasks, their general clinical utility remains limited by the requirement of large task-specific labelled datasets. In this paper, we present ProFound, a domain-specialised vision foundation model for volumetric prostate mpMRI. ProFound is pre-trained using several variants of self-supervised approaches on a diverse, multi-institutional collection of 5,000 patients, with a total of over 22,000 unique 3D MRI volumes (over 1,800,000 2D image slices). We conducted a systematic evaluation of ProFound across a broad spectrum of $11$ downstream clinical tasks on over 3,000 independent patients, including prostate cancer detection, Gleason grading, lesion localisation, gland volume estimation, zonal and surrounding structure segmentation. Experimental results demonstrate that finetuned ProFound consistently outperforms or remains competitive with state-of-the-art specialised models and existing medical vision foundation models trained/finetuned on the same data.
Chinese Translation
许多前列腺癌的诊断和治疗临床任务越来越依赖于多参数磁共振成像(mpMRI)。自动化这些任务具有挑战性,因为它们需要专家的解读,而这种解读难以扩展以利用现代深度学习。尽管现代自动化系统在孤立任务中达到专家级别的性能,但由于需要大量特定任务的标注数据集,其在临床上的普遍实用性仍然受到限制。本文提出了ProFound,一种针对体积前列腺mpMRI的领域专用视觉基础模型。ProFound使用多种自监督方法在一个多机构的多样化数据集中进行预训练,该数据集包含5,000名患者,总计超过22,000个独特的3D MRI体积(超过1,800,000个2D图像切片)。我们在超过3,000名独立患者的广泛临床任务中对ProFound进行了系统评估,包括前列腺癌检测、Gleason分级、病变定位、腺体体积估计、区域及周围结构分割。实验结果表明,经过微调的ProFound在性能上始终优于或与在相同数据上训练/微调的最先进专用模型和现有医学视觉基础模型保持竞争力。
cs.CV / 63 / 2603.03964
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK:一个开源的双阶段角色到皮肤管道用于Minecraft
Abstract
We present \textbf{BLOCK}, an open-source bi-stage character-to-skin pipeline that generates pixel-perfect Minecraft skins from arbitrary character concepts. BLOCK decomposes the problem into (i) a \textbf{3D preview synthesis stage} driven by a large multimodal model (MLLM) with a carefully designed prompt-and-reference template, producing a consistent dual-panel (front/back) oblique-view Minecraft-style preview; and (ii) a \textbf{skin decoding stage} based on a fine-tuned FLUX.2 model that translates the preview into a skin atlas image. We further propose \textbf{EvolveLoRA}, a progressive LoRA curriculum (text-to-image $\rightarrow$ image-to-image $\rightarrow$ preview-to-skin) that initializes each phase from the previous adapter to improve stability and efficiency. BLOCK is released with all prompt templates and fine-tuned weights to support reproducible character-to-skin generation.
Chinese Translation
我们提出了 extbf{BLOCK},一个开源的双阶段角色到皮肤管道,能够从任意角色概念生成像素完美的Minecraft皮肤。BLOCK将问题分解为两个阶段:(i) 一个由大型多模态模型(MLLM)驱动的 extbf{3D预览合成阶段},采用精心设计的提示和参考模板,生成一致的双面(正面/背面)斜视Minecraft风格预览;(ii) 基于精细调优的FLUX.2模型的 extbf{皮肤解码阶段},将预览转换为皮肤图集图像。我们进一步提出了 extbf{EvolveLoRA},一个渐进式LoRA课程(文本到图像 $
ightarrow$ 图像到图像 $
ightarrow$ 预览到皮肤),通过从前一个适配器初始化每个阶段,以提高稳定性和效率。BLOCK发布了所有提示模板和精细调优权重,以支持可重复的角色到皮肤生成。
cs.CV / 64 / 2603.03967
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
UniRain:基于RAG的数据集蒸馏和多目标重加权优化的统一图像去雨框架
Abstract
Despite significant progress has been made in image deraining, we note that most existing methods are often developed for only specific types of rain degradation and fail to generalize across diverse real-world rainy scenes. How to effectively model different rain degradations within a universal framework is important for real-world image deraining. In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. To better enhance unified model generalization, we construct an intelligent retrieval augmented generation (RAG)-based dataset distillation pipeline that selects high-quality training samples from all public deraining datasets for better mixed training. Furthermore, we incorporate a simple yet effective multi-objective reweighted optimization strategy into the asymmetric mixture-of-experts (MoE) architecture to facilitate consistent performance and improve robustness across diverse scenes. Extensive experiments show that our framework performs favorably against the state-of-the-art models on our proposed benchmarks and multiple public datasets.
Chinese Translation
尽管在图像去雨方面取得了显著进展,但我们注意到大多数现有方法通常仅针对特定类型的雨水退化进行开发,无法在多样化的真实世界雨天场景中进行推广。如何在一个通用框架内有效建模不同的雨水退化,对于现实世界的图像去雨至关重要。本文提出了UniRain,一个有效的统一图像去雨框架,能够在白天和夜间条件下恢复因雨条和雨滴而退化的图像。为了更好地增强统一模型的泛化能力,我们构建了一个基于智能检索增强生成(RAG)的数据集蒸馏管道,从所有公共去雨数据集中选择高质量的训练样本,以实现更好的混合训练。此外,我们将一种简单而有效的多目标重加权优化策略融入不对称专家混合(MoE)架构中,以促进一致的性能并提高在多样场景中的鲁棒性。大量实验表明,我们的框架在我们提出的基准和多个公共数据集上相较于最先进的模型表现良好。
cs.CV / 65 / 2603.03969
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
从视觉基础模型扩展密集事件流预训练
Abstract
Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.
Chinese Translation
从不规则事件流中学习多功能、细粒度的表示至关重要,但并非易事,主要是由于繁重的标注工作阻碍了数据集规模、语义丰富性和应用范围的扩展。为了解决这一难题,我们推出了一种新颖的自监督预训练方法,该方法提炼视觉基础模型(VFMs),以推动事件表示的规模化发展。具体而言,我们策划了一个广泛的同步图像-事件集合,以增强跨模态对齐。然而,由于图像-事件领域之间在稀疏性和粒度上的固有不匹配,现有的蒸馏范式在事件表示中容易出现语义崩溃,特别是在高分辨率下。为了弥补这一差距,我们提出将对齐目标扩展到由VFMs提供的语义结构,这表明了更广泛的感受野和更强的监督。我们方法的关键成分是一个结构感知的蒸馏损失,它为对齐奠定了更高质量的图像-事件对应关系,从而优化密集事件表示。大量实验表明,我们的方法在下游基准测试中取得了巨大飞跃,显著超越了传统方法和现有的预训练技术。这一突破体现在增强的泛化能力、优越的数据效率和提升的迁移能力上。
cs.CV / 66 / 2603.03983
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg:无训练推理驱动的遥感影像分割
Abstract
Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image--query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.
Chinese Translation
近期在多模态大语言模型(MLLMs)方面的进展正在将分割从固定类别预测转变为基于指令的定位。尽管基于推理的分割在自然场景中迅速发展,但由于推理导向数据的高昂成本和特定领域的挑战(如高空视角),遥感领域缺乏可推广的解决方案。我们提出了GeoSeg,一个零样本、无训练的框架,绕过了推理驱动的遥感分割中的监督瓶颈。GeoSeg通过以下方式将MLLM推理与精确定位相结合:(i)偏差感知坐标细化,以纠正系统性定位偏移;(ii)双路提示机制,将语义意图与细粒度空间线索融合。我们还引入了GeoSeg-Bench,这是一个包含810对图像-查询对的诊断基准,具有分层难度级别。实验表明,GeoSeg在所有基线方法中始终表现优异,广泛的消融实验确认了每个组件的有效性和必要性。
cs.CV / 67 / 2603.03985
RIVER: A Real-Time Interaction Benchmark for Video LLMs
RIVER:视频大语言模型的实时交互基准
Abstract
The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.
Chinese Translation
多模态大语言模型的快速发展展现了令人印象深刻的能力,但几乎所有模型都在离线模式下运行,限制了实时交互的可能性。为了解决这一问题,我们引入了实时视频交互基准(RIVER Bench),旨在评估在线视频理解。RIVER Bench引入了一个新颖的框架,包括回顾性记忆、实时感知和主动预期任务,紧密模拟互动对话,而不是一次性响应整个视频。我们使用来自不同来源和不同长度的视频进行了详细的注释,并精确定义了实时交互格式。在对各种模型类别的评估中,我们发现,尽管离线模型在单一问答任务中表现良好,但在实时处理方面却面临挑战。针对现有模型在在线视频交互中的局限性,特别是在长期记忆和未来感知方面的不足,我们提出了一种通用改进方法,使模型能够在实时中更灵活地与用户互动。我们相信,这项工作将显著推动实时交互视频理解模型的发展,并激发未来在这一新兴领域的研究。数据集和代码已公开发布在 https://github.com/OpenGVLab/RIVER。
cs.CV / 68 / 2603.03989
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
当视觉证据模糊不清时:作为视觉模型诊断探针的错觉面孔现象
Abstract
When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.
Chinese Translation
当视觉证据模糊不清时,视觉模型必须决定是否将面部特征模式解读为有意义的。面部错觉(face pareidolia)是指在非面部物体中感知面孔,这为这种行为提供了一个受控的探测工具。我们引入了一种表示层面的诊断框架,分析面部错觉图像中的检测、定位、不确定性和偏差,涵盖类别、难度和情感。在统一的协议下,我们评估了六个模型,涵盖四种表示机制:视觉-语言模型(VLMs;CLIP-B/32、CLIP-L/14、LLaVA-1.5-7B)、纯视觉分类(ViT)、一般物体检测(YOLOv8)和面部检测(RetinaFace)。我们的分析揭示了在模糊情况下的三种解读机制。VLMs表现出语义过度激活,系统性地将模糊的非人类区域拉向“人类”概念,其中LLaVA-1.5-7B产生了最强和最自信的过度判断,尤其是在负面情绪的情况下。相比之下,ViT采取了一种不确定性作为弃权的策略,保持模糊但基本无偏。基于检测的模型通过保守的先验实现低偏差,即使在控制定位时也抑制了错觉反应。这些结果表明,在模糊情况下的行为更多地受制于表示选择而非分数阈值,并且不确定性与偏差是解耦的:低不确定性可以表示安全抑制(如检测器)或极端过度解读(如VLMs)。因此,错觉面孔现象提供了一个紧凑的诊断工具和一种意识到模糊的困难负样本来源,用于探测和改善视觉-语言系统的语义鲁棒性。代码将在发表后发布。
cs.CV / 69 / 2603.03991
Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy
弱监督补丁注释用于改善糖尿病视网膜病变的筛查
Abstract
Diabetic Retinopathy (DR) requires timely screening to prevent irreversible vision loss. However, its early detection remains a significant challenge since often the subtle pathological manifestations (lesions) get overlooked due to insufficient annotation. Existing literature primarily focuses on image-level supervision, weakly-supervised localization, and clustering-based representation learning, which fail to systematically annotate unlabeled lesion region(s) for refining the dataset. Expert-driven lesion annotation is labor-intensive and often incomplete, limiting the performance of deep learning models. We introduce Similarity-based Annotation via Feature-space Ensemble (SAFE), a two-stage framework that unifies weak supervision, contrastive learning, and patch-wise embedding inference, to systematically expand sparse annotations in the pathology. SAFE preserves fine-grained details of the lesion(s) under partial clinical supervision. In the first stage, a dual-arm Patch Embedding Network learns semantically structured, class-discriminative embeddings from expert annotated patches. Next, an ensemble of independent embedding spaces extrapolates labels to the unannotated regions based on spatial and semantic proximity. An abstention mechanism ensures trade-off between highly reliable annotation and noisy coverage. Experimental results demonstrate reliable separation of healthy and diseased patches, achieving upto 0.9886 accuracy. The annotation generated from SAFE substantially improves downstream tasks such as DR classification, demonstrating a substantial increase in F1-score of the diseased class and a performance gain as high as 0.545 in Area Under the Precision-Recall Curve (AUPRC). Qualitative analysis, with explainability, confirms that SAFE focuses on clinically relevant lesion patterns; and is further validated by ophthalmologists.
Chinese Translation
糖尿病视网膜病变(Diabetic Retinopathy, DR)需要及时筛查以防止不可逆的视力丧失。然而,由于早期检测面临重大挑战,往往由于注释不足而忽视了细微的病理表现(病变)。现有文献主要集中在图像级监督、弱监督定位和基于聚类的表示学习,这些方法未能系统地注释未标记的病变区域,从而精炼数据集。专家驱动的病变注释劳动强度大且常常不完整,限制了深度学习模型的性能。我们提出了一种基于相似性的特征空间集成注释方法(Similarity-based Annotation via Feature-space Ensemble, SAFE),这是一个两阶段框架,统一了弱监督、对比学习和补丁级嵌入推断,以系统性地扩展病理学中的稀疏注释。SAFE在部分临床监督下保留了病变的细粒度细节。在第一阶段,双臂补丁嵌入网络从专家注释的补丁中学习语义结构化和类别区分的嵌入。接下来,独立嵌入空间的集成根据空间和语义相似性将标签推断到未注释区域。一个弃权机制确保了高可靠性注释与噪声覆盖之间的权衡。实验结果表明,健康和病变补丁的可靠分离,准确率高达0.9886。SAFE生成的注释显著改善了下游任务,如DR分类,显示出病变类别F1分数的显著提高,以及在精确-召回曲线下面积(Area Under the Precision-Recall Curve, AUPRC)中高达0.545的性能提升。定性分析和可解释性确认SAFE关注临床相关的病变模式,并得到了眼科医生的进一步验证。
cs.CV / 70 / 2603.04002
Discriminative Perception via Anchored Description for Reasoning Segmentation
通过锚定描述实现的判别感知用于推理分割
Abstract
Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at https://github.com/mrazhou/DPAD
Chinese Translation
推理分割越来越多地采用强化学习生成解释性推理链,以指导多模态大型语言模型。尽管这些几何奖励主要用于指导最终的定位,但它们无法区分推理过程是否仍然锚定在所提及区域,或是偏离到无关的上下文中。缺乏这种判别性指导,模型的推理往往退化为不集中且冗长的链条,最终无法在复杂场景中消歧并感知目标。这表明需要用判别感知来补充强化学习目标,即主动区分目标与其上下文的能力。为实现这一点,我们提出了DPAD,迫使模型生成所提及对象的描述性标题,然后通过对比标题与所提及对象的语义相关性与更广泛上下文的关系,进行明确的区分。通过优化这种判别能力,模型被迫关注目标的独特属性,从而形成更为聚焦和高效的推理链。描述性标题还作为与分割相一致的可解释性理由。基准测试的实验确认了我们方法的有效性,带来了显著的性能提升,其中ReasonSeg的cIoU提高了3.09%,推理链长度减少了约42%。代码可在 https://github.com/mrazhou/DPAD 获取。
cs.CV / 71 / 2603.04022
Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation
重新思考强化学习在放射学报告生成中的效率与有效性
Abstract
Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.
Chinese Translation
放射科医生非常希望能够实现完全自动化的人工智能用于放射学报告生成(R2G),然而现有的方法在临床应用上仍显不足。强化学习(RL)有潜力解决这些不足,但其在此任务中的应用仍未得到充分探索。本文重新审视了RL在R2G任务中的数据效率和优化有效性。首先,我们探讨了数据数量和质量对医疗背景下RL性能的影响,揭示了数据质量在其中比数量更为关键。为此,我们提出了一种基于诊断多样性的数据采样策略,使得在样本较少的情况下也能实现可比的性能。其次,我们观察到放射学报告中的大多数标记呈模板化且缺乏诊断信息,而临床关键标记的低频率则增加了在优化过程中被忽视的风险。为了解决这个问题,我们引入了诊断标记加权策略优化(DiTPO),该方法通过使用诊断F1分数作为奖励信号,直接优化临床准确性。与将所有标记视为同等重要的标准RL方法不同,DiTPO通过基于规则或梯度的机制显式建模不同标记的重要性,以优先考虑临床相关内容。在MIMIC-CXR、IU-Xray和CheXpert Plus数据集上的大量实验表明,我们的框架在RL中实现了最先进(SOTA)的性能,同时所需的训练样本显著减少。值得注意的是,在MIMIC-CXR上,我们的框架仅使用20%的RL训练样本便达到了0.516的F1分数。
cs.CV / 72 / 2603.04024
Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation
体积方向扩散:在解剖共识中锚定不确定性量化以应对模糊医学图像分割
Abstract
Equivocal 3D lesion segmentation exhibits high inter-observer variability. Conventional deterministic models ignore this aleatoric uncertainty, producing over-confident masks that obscure clinical risks. Conversely, while generative methods (e.g., standard diffusion) capture sample diversity, recovering complex topology from pure noise frequently leads to severe structural fractures and out-of-distribution anatomical hallucinations. To resolve this fidelity-diversity trade-off, we propose Volumetric Directional Diffusion (VDD). Unlike standard diffusion models that denoise isotropic Gaussian noise, VDD mathematically anchors the generative trajectory to a deterministic consensus prior. By restricting the generative search space to iteratively predict a 3D boundary residual field, VDD accurately explores the fine-grained geometric variations inherent in expert disagreements without risking topological collapse. Extensive validation on three multi-rater datasets (LIDC-IDRI, KiTS21, and ISBI 2015) demonstrates that VDD achieves state-of-the-art uncertainty quantification (significantly improving GED and CI) while remaining highly competitive in segmentation accuracy against deterministic upper bounds. Ultimately, VDD provides clinicians with anatomically coherent uncertainty maps, enabling safer decision-making and mitigating risks in downstream tasks (e.g., radiotherapy planning or surgical margin assessment).
Chinese Translation
模糊的三维病灶分割表现出较高的观察者间变异性。传统的确定性模型忽视了这种随机不确定性,产生过于自信的掩膜,掩盖了临床风险。相反,尽管生成方法(例如,标准扩散)能够捕捉样本多样性,但从纯噪声中恢复复杂拓扑常常导致严重的结构破裂和超出分布的解剖幻觉。为了解决这种保真度与多样性之间的权衡,我们提出了体积方向扩散(Volumetric Directional Diffusion, VDD)。与去噪各向同性高斯噪声的标准扩散模型不同,VDD在数学上将生成轨迹锚定到确定性的共识先验。通过将生成搜索空间限制为迭代预测三维边界残差场,VDD能够准确探索专家意见分歧中固有的细粒度几何变异,而不冒拓扑崩溃的风险。在三个多评估者数据集(LIDC-IDRI、KiTS21和ISBI 2015)上的广泛验证表明,VDD在不确定性量化方面达到了最先进的水平(显著提高了GED和CI),同时在分割准确性上与确定性上限保持高度竞争。最终,VDD为临床医生提供了解剖一致的不确定性图,使得决策更加安全,并降低了下游任务(例如,放射治疗规划或手术边缘评估)中的风险。
cs.CV / 73 / 2603.04037
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR:通过可学习属性权重和目标相对负采样实现独特查询嵌入的组合图像检索
Abstract
Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.
Chinese Translation
组合图像检索(CIR)旨在通过联合解释参考图像和指定意图变化的修改文本来检索目标图像。现有大多数方法仍基于对比学习框架,将真实图像视为唯一的正实例,而将所有其他图像视为负实例。这种策略不可避免地引入了相关性抑制,即语义相关但有效的图像被错误地推远,以及语义混淆,即不同的修改意图在嵌入空间中重叠。结果是,学习到的查询表示往往缺乏区分性,尤其是在细粒度属性修改方面。为克服这些局限性,我们提出了一种通过可学习属性权重和目标相对负采样(DQE-CIR)实现独特查询嵌入的方法,该方法旨在通过在训练过程中显式建模目标相对相关性来学习独特的查询嵌入。DQE-CIR结合了可学习的属性加权,以强调基于修改文本的独特视觉特征,从而实现语言与视觉之间更精确的特征对齐。此外,我们引入了目标相对负采样,该方法构建目标相对相似性分布,并从中间区域选择信息丰富的负样本,排除易负样本和模糊的假负样本。这一策略通过提高查询的区分性并减少由语义相似但无关候选引起的混淆,从而实现对细粒度属性变化的更可靠检索。
cs.CV / 74 / 2603.04056
Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark
动态底栖环境中的长期视觉定位:数据集、基于足迹的真实值和视觉位置识别基准
Abstract
Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.
Chinese Translation
长期视觉定位有潜力降低成本并提高自主水下车辆(AUV)在光学底栖监测中的映射质量。尽管有这种潜力,但底栖环境中的长期视觉定位仍然研究不足,主要是由于缺乏用于基准测试的策划数据集。此外,有限的地理参考精度和图像足迹需要精确的几何信息以实现准确的真实值验证。在本研究中,我们通过提出一个用于底栖环境中长期视觉定位的策划数据集和一种新颖的方法来验证近天顶水下图像的视觉定位结果,从而填补这些空白。我们的数据集包含来自五个底栖参考站点的地理参考AUV图像,这些站点在长达六年的时间内进行了重访,并包括原始和色彩校正的立体图像、相机标定和亚分米级注册的相机姿态。据我们所知,这是第一个跨多个站点和光照区栖息地的长期视觉定位策划水下数据集。我们的真实值验证方法估计三维海底图像足迹,并将相机视图与重叠的足迹连接起来,确保真实值链接反映共享的视觉内容。在此数据集和真实值的基础上,我们基准测试了八种最先进的视觉位置识别(VPR)方法,发现我们的数据集上的Recall@K显著低于已建立的陆地和水下基准。最后,我们将基于足迹的真实值与传统的基于位置的真实值进行比较,表明在地形崎岖和高度变化的站点,距离阈值真实值验证可能会高估VPR的Recall@K。总之,策划的数据集、真实值验证方法和VPR基准为推动动态底栖环境中的长期视觉定位提供了一个重要的基础。
cs.CV / 75 / 2603.04058
TumorFlow: Physics-Guided Longitudinal MRI Synthesis of Glioblastoma Growth
肿瘤流:物理引导的胶质母细胞瘤生长的纵向MRI合成
Abstract
Glioblastoma exhibits diverse, infiltrative, and patient-specific growth patterns that are only partially visible on routine MRI, making it difficult to reliably assess true tumor extent and personalize treatment planning and follow-up. We present a biophysically-conditioned generative framework that synthesizes biologically realistic 3D brain MRI volumes from estimated, spatially continuous tumor-concentration fields. Our approach combines a generative model with tumor-infiltration maps that can be propagated through time using a biophysical growth model, enabling fine-grained control over tumor shape and growth while preserving patient anatomy. This enables us to synthesize consistent tumor growth trajectories directly in the space of real patients, providing interpretable, controllable estimation of tumor infiltration and progression beyond what is explicitly observed in imaging. We evaluate the framework on longitudinal glioblastoma cases and demonstrate that it can generate temporally coherent sequences with realistic changes in tumor appearance and surrounding tissue response. These results suggest that integrating mechanistic tumor growth priors with modern generative modeling can provide a practical tool for patient-specific progression visualization and for generating controlled synthetic data to support downstream neuro-oncology workflows. In longitudinal extrapolation, we achieve a consistent 75% Dice overlap with the biophysical model while maintaining a constant PSNR of 25 in the surrounding tissue. Our code is available at: https://github.com/valentin-biller/lgm.git
Chinese Translation
胶质母细胞瘤表现出多样化、浸润性和患者特异性的生长模式,这些模式在常规MRI中仅部分可见,导致难以可靠地评估肿瘤的真实范围,并个性化治疗计划和随访。我们提出了一种生物物理条件下的生成框架,该框架从估计的空间连续肿瘤浓度场合成生物学上真实的3D脑MRI体积。我们的方法结合了生成模型与肿瘤浸润图,这些图可以通过生物物理生长模型在时间上进行传播,从而在保持患者解剖结构的同时,实现对肿瘤形状和生长的精细控制。这使我们能够在真实患者的空间中直接合成一致的肿瘤生长轨迹,提供可解释、可控的肿瘤浸润和进展的估计,超越成像中明确观察到的内容。我们在纵向胶质母细胞瘤病例上评估了该框架,并证明它能够生成时间上连贯的序列,具有肿瘤外观和周围组织反应的真实变化。这些结果表明,将机制性肿瘤生长先验与现代生成建模相结合,可以为患者特异性的进展可视化提供实用工具,并生成受控的合成数据以支持下游神经肿瘤学工作流程。在纵向外推中,我们与生物物理模型实现了一致的75% Dice重叠,同时在周围组织中保持恒定的PSNR为25。我们的代码可在以下网址获取:https://github.com/valentin-biller/lgm.git
cs.CV / 76 / 2603.04081
Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers
在小补丁约束下重新审视基础模型在细胞级组织病理图像分析中的作用——训练数据规模和模糊扰动对卷积神经网络和视觉变换器的影响
Abstract
Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256--16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.
Chinese Translation
背景与目的:细胞级病理图像分析需要处理极小的图像补丁(40x40 像素),远低于标准的 ImageNet 分辨率。目前尚不清楚现代深度学习架构和基础模型在这一约束下是否能够学习到稳健且可扩展的表示。我们系统评估了小补丁细胞分类的架构适用性和数据规模效应。方法:我们分析了 303 例结直肠癌标本,采用 CD103/CD8 免疫染色,生成了 185,432 张标注细胞图像。八种特定任务的架构在多个数据规模下(FlagLimit:每类 256--16,384 个样本)从头开始训练,并在将输入调整为 224x224 像素后,通过线性探测和微调评估了三种基础模型。使用调整前后的高斯扰动评估了对模糊的鲁棒性。结果:特定任务模型随着数据规模的增加而持续改进,而基础模型在中等样本规模下饱和。针对小补丁优化的视觉变换器(CustomViT)达到了最高的准确率,超越了所有基础模型,并且推理成本显著较低。各架构的模糊鲁棒性相当,未观察到基础模型的定性优势。结论:在极端空间约束下进行细胞级分类时,特定任务架构在足够的训练数据可用时比基础模型更有效和高效。更高的干净准确率并不意味着更强的鲁棒性,而在小补丁范围内,大型预训练模型的益处有限。
cs.CV / 77 / 2603.04090
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
EgoPoseFormer v2:用于增强现实/虚拟现实的精确自我中心人类动作估计
Abstract
Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
Chinese Translation
自我中心的人类动作估计对于增强现实(AR)和虚拟现实(VR)体验至关重要,但由于自我中心视角下身体覆盖范围有限、频繁遮挡以及标注数据稀缺,这一任务仍然具有挑战性。我们提出了EgoPoseFormer v2,这是一种通过两个关键贡献来应对这些挑战的方法:(1)基于变换器的模型,用于时间一致且空间扎根的身体姿态估计;(2)一种自动标注系统,使得可以利用大量未标注数据集进行训练。我们的模型是完全可微分的,引入了身份条件查询、多视角空间细化、因果时间注意力,并在固定计算预算下支持关键点和参数化身体表示。自动标注系统通过不确定性感知的半监督训练将学习扩展到数千万个未标注帧。该系统遵循教师-学生架构生成伪标签,并通过不确定性蒸馏指导训练,使模型能够在不同环境中进行泛化。在EgoBody3M基准测试中,我们的模型在GPU上具有0.8毫秒的延迟,准确性比两个最先进的方法分别提高了12.2%和19.4%,并将时间抖动减少了22.2%和51.7%。此外,我们的自动标注系统进一步提高了手腕的MPJPE(平均关节位置误差)达13.1%。
cs.CV / 78 / 2603.04091
CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping
基于CLIP指导的多任务回归用于多视角植物表型分析
Abstract
Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP
Chinese Translation
植物生长动态建模在现代农业研究中发挥着核心作用。然而,由于强视角冗余和视角依赖的外观变化,从多视角植物图像中学习稳健的预测器仍然具有挑战性。我们提出了一种层级感知的视觉语言框架,利用基于CLIP嵌入的单一多任务模型共同预测植物年龄和叶片数量。我们的方法将旋转视图聚合为角不变表示,并根据轻量级文本先验条件化视觉特征,以便在不完整或无序输入下实现稳定预测。在GroMo25基准测试中,我们的方法将平均年龄绝对误差(MAE)从7.74降低到3.91,平均叶片数量MAE从5.52降低到3.08,相较于GroMo基线,分别改善了49.5%和44.2%。统一的公式简化了流程,通过替代传统的双模型设置,同时提高了对缺失视图的鲁棒性。模型和代码可在以下链接获取:https://github.com/SimonWarmers/CLIP-MVP
cs.CV / 79 / 2603.04098
Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning
真实的眼睛更快地识别:注视稳定性与瞳孔新颖性在高效自我中心学习中的应用
Abstract
Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.
Chinese Translation
始终开启的自我中心相机在体现机器人、模仿学习和辅助增强现实(AR)中越来越多地被用作演示工具,但由此产生的视频流往往充斥着冗余和低质量的帧。在可穿戴设备的存储和电池限制下,选择保留哪些帧与如何从中学习同样重要。我们观察到现代眼动追踪耳机提供了一种连续的、无需训练的侧通道,该通道可分解为两个互补的轴:注视固定捕捉视觉稳定性(质量),而瞳孔反应捕捉与唤醒相关的时刻(新颖性)。我们将这一洞察转化为双标准帧策展器(Dual-Criterion Frame Curator),该策展器首先通过注视质量筛选帧,然后根据瞳孔衍生的新颖性对幸存帧进行排名。在视觉体验数据集(Visual Experience Dataset, VEDB)上,10%预算下策展的帧与完整视频流的分类性能相匹配,而简单的信号融合则始终破坏这两种贡献。其效益依赖于任务:瞳孔排名改善了活动识别,而仅基于注视的选择在场景识别中已占主导地位,确认这两种信号确实发挥着不同的作用。我们的方法无需模型推理,并在捕获时操作,为高效的始终开启自我中心数据策展提供了一条路径。
cs.CV / 80 / 2603.04099
Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs
高维位置编码与非局部多层感知机的高效点云处理
Abstract
Multi-Layer Perceptron (MLP) models are the foundation of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength and limit the application of these models. In this article, we develop a two-stage abstraction and refinement (ABS-REF) view for modular feature extraction in point cloud processing. This view elucidates that whereas the early models focused on ABS stages, the more recent techniques devise sophisticated REF stages to attain performance advantages. Then, we propose a High-dimensional Positional Encoding (HPE) module to explicitly utilize intrinsic positional information, extending the ``positional encoding'' concept from Transformer literature. HPE can be readily deployed in MLP-based architectures and is compatible with transformer-based methods. Within our ABS-REF view, we rethink local aggregation in MLP-based methods and propose replacing time-consuming local MLP operations, which are used to capture local relationships among neighbors. Instead, we use non-local MLPs for efficient non-local information updates, combined with the proposed HPE for effective local information representation. We leverage our modules to develop HPENets, a suite of MLP networks that follow the ABS-REF paradigm, incorporating a scalable HPE-based REF stage. Extensive experiments on seven public datasets across four different tasks show that HPENets deliver a strong balance between efficiency and effectiveness. Notably, HPENet surpasses PointNeXt, a strong MLP-based counterpart, by 1.1% mAcc, 4.0% mIoU, 1.8% mIoU, and 0.2% Cls. mIoU, with only 50.0%, 21.5%, 23.1%, 44.4% of FLOPs on ScanObjectNN, S3DIS, ScanNet, and ShapeNetPart, respectively. Source code is available at https://github.com/zouyanmei/HPENet_v2.git.
Chinese Translation
多层感知机(MLP)模型是当代点云处理的基础。然而,它们复杂的网络架构掩盖了其优势的来源,并限制了这些模型的应用。本文提出了一种用于点云处理的模块化特征提取的两阶段抽象与精炼(ABS-REF)视角。该视角阐明了早期模型集中于ABS阶段,而近期技术则设计了复杂的REF阶段以获得性能优势。接着,我们提出了一种高维位置编码(HPE)模块,以明确利用内在位置信息,扩展了Transformer文献中的“位置编码”概念。HPE可以方便地部署在基于MLP的架构中,并与基于Transformer的方法兼容。在我们的ABS-REF视角下,我们重新思考了基于MLP方法中的局部聚合,并提出用非局部MLP替代耗时的局部MLP操作,以捕捉邻域间的局部关系。相反,我们使用非局部MLP进行高效的非局部信息更新,并结合所提出的HPE以实现有效的局部信息表示。我们利用这些模块开发了HPENets,这是一套遵循ABS-REF范式的MLP网络,包含可扩展的基于HPE的REF阶段。在四个不同任务的七个公共数据集上的广泛实验表明,HPENets在效率和有效性之间提供了良好的平衡。值得注意的是,HPENet在ScanObjectNN、S3DIS、ScanNet和ShapeNetPart上分别超越了强大的基于MLP的对手PointNeXt,提升了1.1%的mAcc、4.0%的mIoU、1.8%的mIoU和0.2%的Cls.mIoU,同时在这些数据集上仅消耗了50.0%、21.5%、23.1%和44.4%的FLOPs。源代码可在https://github.com/zouyanmei/HPENet_v2.git获取。
cs.CV / 81 / 2603.04113
Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast
通过解构解剖学与对比度理解脑部MRI中人口统计可预测性的来源
Abstract
Demographic attributes such as age, sex, and race can be predicted from medical images, raising concerns about bias in clinical AI systems. In brain MRI, this signal may arise from anatomical variation, acquisition-dependent contrast differences, or both, yet these sources remain entangled in conventional analyses. Without disentangling them, mitigation strategies risk failing to address the underlying causes. We propose a controlled framework based on disentangled representation learning, decomposing brain MRI into anatomy-focused representations that suppress acquisition influence and contrast embeddings that capture acquisition-dependent characteristics. Training predictive models for age, sex, and race on full images, anatomical representations, and contrast-only embeddings allows us to quantify the relative contributions of structure and acquisition to the demographic signal. Across three datasets and multiple MRI sequences, we find that demographic predictability is primarily rooted in anatomical variation: anatomy-focused representations largely preserve the performance of models trained on raw images. Contrast-only embeddings retain a weaker but systematic signal that is dataset-specific and does not generalise across sites. These findings suggest that effective mitigation must explicitly account for the distinct anatomical and acquisition-dependent origins of the demographic signal, ensuring that any bias reduction generalizes robustly across domains.
Chinese Translation
人口统计特征如年龄、性别和种族可以从医学影像中预测,这引发了对临床人工智能系统中偏见的担忧。在脑部MRI中,这一信号可能源于解剖学变异、依赖于获取的对比度差异,或两者的结合,但在传统分析中这些来源仍然交织在一起。如果不加以解构,缓解策略可能无法解决潜在原因。我们提出了一种基于解构表示学习的控制框架,将脑部MRI分解为专注于解剖学的表示,这些表示抑制了获取影响,以及捕捉依赖于获取特征的对比度嵌入。对完整图像、解剖学表示和仅对比度嵌入训练年龄、性别和种族的预测模型,使我们能够量化结构和获取对人口统计信号的相对贡献。在三个数据集和多个MRI序列中,我们发现人口统计可预测性主要根植于解剖学变异:专注于解剖学的表示在很大程度上保留了在原始图像上训练的模型的性能。仅对比度嵌入保留了一个较弱但系统性的信号,该信号是数据集特定的,并且在不同站点之间不具有普遍性。这些发现表明,有效的缓解必须明确考虑人口统计信号的独特解剖学和依赖于获取的起源,确保任何偏见减少在各个领域中都能稳健地推广。
cs.CV / 82 / 2603.04114
Any2Any: Unified Arbitrary Modality Translation for Remote Sensing
Any2Any:统一的任意模态翻译用于遥感
Abstract
Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models will be available at https://github.com/MiliLab/Any2Any.
Chinese Translation
多模态遥感影像提供了对同一地理场景的互补观测,但在实际应用中,这些观测往往是不完整的。现有的跨模态翻译方法将每对模态视为独立任务,导致了二次复杂性,并且对未见模态组合的泛化能力有限。我们将任意到任意的翻译形式化为对场景共享潜在表示的推理,其中不同模态对应于相同基础语义的部分观测。基于这一形式化,我们提出了Any2Any,一个统一的潜在扩散框架,将异构输入投影到几何对齐的潜在空间中。这种结构通过共享的主干执行锚定的潜在回归,将模态特定的表示学习与语义映射解耦。此外,轻量级的目标特定残差适配器用于修正系统性的潜在不匹配,而不增加推理复杂性。为了支持在稀疏但连接的监督下进行学习,我们引入了RST-1M,这是第一个百万规模的遥感数据集,包含五种传感模态的配对观测,为任意到任意的翻译提供了监督锚点。在14个翻译任务中的实验表明,Any2Any始终优于成对翻译方法,并在未见模态对上表现出强大的零样本泛化能力。代码和模型将可在 https://github.com/MiliLab/Any2Any 获取。
cs.CV / 83 / 2603.04115
TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression
TextBoost:在超低比特率图像压缩中提升场景文本的保真度
Abstract
Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.
Chinese Translation
超低比特率图像压缩面临着一个关键挑战:在保持整体视觉质量的同时,保留小字体场景文本。感兴趣区域(ROI)比特分配可以优先考虑文本,但往往会降低全局保真度,导致局部准确性与整体图像质量之间的权衡。我们的方法不依赖于ROI编码,而是结合通过光学字符识别(OCR)提取的辅助文本信息,并以可忽略的开销进行传输,从而使解码器能够利用这一语义指导。我们的方法TextBoost通过三项战略设计实现了这一理念:(i)自适应过滤OCR输出并将其呈现为指导图;(ii)通过注意力引导融合块以校准的方式将该指导与解码器特征结合;(iii)在文本区域内施加一致的重建指导,使用正则化损失促进与场景的自然融合。在TextOCR和ICDAR 2015上的大量实验表明,TextBoost在可比较的峰值信噪比(PSNR)和每像素比特(bpp)下,文本识别F1值提高了高达60.6%,在保持全局图像质量的同时,生成更清晰的小字体文本,并有效地将文本增强与全局率失真优化解耦。
cs.CV / 84 / 2603.04125
A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination
基于特征残差区分的少样本开放集动作识别基线研究与基准
Abstract
Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal data to the more complex video domain. Extensive experiments on five datasets demonstrate that while common open-set techniques provide only marginal gains, our FR-Disc significantly enhances unknown rejection capabilities without compromising closed-set accuracy, setting a new state-of-the-art for FSOS-AR. The project website, code, and benchmark are available at: https://hsp-iit.github.io/fsosar/.
Chinese Translation
少样本动作识别(FS-AR)已显示出良好的结果,但通常受到封闭集假设的限制,这在现实世界的开放集场景中表现不佳。尽管少样本开放集(FSOS)识别在图像领域已得到充分研究,但其在时空视频数据中的扩展仍然未得到充分探索。为了解决这个问题,我们提出了一种基于特征残差鉴别器(Feature-Residual Discriminator, FR-Disc)的架构扩展,将之前在骨骼数据上的研究适应于更复杂的视频领域。在五个数据集上的广泛实验表明,尽管常见的开放集技术仅提供了边际收益,但我们的FR-Disc显著增强了未知样本的拒绝能力,而不影响封闭集的准确性,为FSOS-AR设定了新的最先进水平。项目网站、代码和基准可在以下链接获取:https://hsp-iit.github.io/fsosar/.
cs.CV / 85 / 2603.04128
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab$^{+}$:一种可扩展的统一音视频场景理解模型,具有明确的协作机制
Abstract
Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datasets and 7 tasks, enabling the model to capture cross-task relationships at different levels of granularity. On the model side, we design a unified interface to align heterogeneous task formulations, and propose Interaction-aware LoRA (I-LoRA), which explicitly models inter-task relationships via dynamic routing to coordinate distinct audio-visual interaction patterns, mitigating parameter interference. Extensive experiments show Crab$^{+}$ covers broader tasks than existing unified models while outperforming specialized models on various benchmarks. We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks. These results hold across diverse AV-LLM paradigms and are validated through in-depth visualization, positioning Crab$^{+}$ as a robust step towards holistic audio-visual scene understanding.
Chinese Translation
开发用于统一场景理解的音视频大型语言模型(AV-LLMs)在多模态智能中至关重要。尽管指令调优使预训练模型具备多任务能力,但我们观察到传统的多任务统一方法往往遭遇严重的负迁移现象,近55%的任务相较于单任务训练出现性能下降。我们将这一现象归因于音视频任务的异质性,其特征在于任务粒度的差异和能力需求的多样性,这在联合训练中导致了负干扰。为了解决这一问题,我们提出了Crab$^{+}$,一种可扩展的统一音视频场景理解模型,通过数据和模型两个方面的明确协作来应对任务异质性。在数据方面,我们引入了AV-UIE v2,一个全面的音视频统一指令调优数据集,包含明确的推理过程。该数据集包含约222K个样本,涵盖17个数据集和7个任务,使模型能够在不同粒度层次上捕捉跨任务关系。在模型方面,我们设计了一个统一接口,以对齐异质任务的表述,并提出了交互感知的LoRA(I-LoRA),通过动态路由明确建模任务间关系,以协调不同的音视频交互模式,减轻参数干扰。大量实验表明,Crab$^{+}$覆盖的任务范围比现有的统一模型更广,同时在各种基准测试中超越了专门模型。我们成功逆转了负迁移趋势,实现了正迁移,在近88%的任务中,多任务学习超越了单任务基线。这些结果在不同的AV-LLM范式中均有效,并通过深入的可视化验证,确立了Crab$^{+}$作为实现全面音视频场景理解的重要一步。
cs.CV / 86 / 2603.04130
Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis
基于掩膜引导的注意力调节用于解剖一致的反事实胸部X光合成
Abstract
Counterfactual generation for chest X-rays (CXR) aims to simulate plausible pathological changes while preserving patient-specific anatomy. However, diffusion-based editing methods often suffer from structural drift, where stable anatomical semantics propagate globally through attention and distort non-target regions, and unstable pathology expression, since subtle and localized lesions induce weak and noisy conditioning signals. We present an inference-time attention regulation framework for reliable counterfactual CXR synthesis. An anatomy-aware attention regularization module gates self-attention and anatomy-token cross-attention with organ masks, confining structural interactions to anatomical ROIs and reducing unintended distortions. A pathology-guided module enhances pathology-token cross-attention within target lung regions during early denoising and performs lightweight latent corrections driven by an attention-concentration energy, enabling controllable lesion localization and extent. Extensive evaluations on CXR datasets show improved anatomical consistency and more precise, controllable pathological edits compared with standard diffusion editing, supporting localized counterfactual analysis and data augmentation for downstream tasks.
Chinese Translation
胸部X光(CXR)的反事实生成旨在模拟合理的病理变化,同时保留患者特定的解剖结构。然而,基于扩散的编辑方法常常面临结构漂移的问题,即稳定的解剖语义通过注意力在全局传播并扭曲非目标区域,以及病理表达不稳定的问题,因为细微且局部的病变会引发微弱且嘈杂的条件信号。我们提出了一种推理时的注意力调节框架,用于可靠的反事实CXR合成。一个关注解剖结构的注意力正则化模块通过器官掩膜对自注意力和解剖标记交叉注意力进行调控,将结构交互限制在解剖感兴趣区域(ROI)内,减少意外扭曲。一个病理引导模块在早期去噪过程中增强目标肺区内的病理标记交叉注意力,并通过注意力集中能量驱动轻量级潜在修正,实现可控的病变定位和范围。对CXR数据集的广泛评估显示,与标准扩散编辑相比,解剖一致性得到了改善,病理编辑更加精确且可控,支持局部反事实分析和下游任务的数据增强。
cs.CV / 87 / 2603.04146
LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis
基于稀疏编码和注意力机制的LISTA-Transformer模型及其在故障诊断中的应用
Abstract
Driven by the continuous development of models such as Multi-Layer Perceptron, Convolutional Neural Network (CNN), and Transformer, deep learning has made breakthrough progress in fields such as computer vision and natural language processing, and has been successfully applied in practical scenarios such as image classification and industrial fault diagnosis. However, existing models still have certain limitations in local feature modeling and global dependency capture. Specifically, CNN is limited by local receptive fields, while Transformer has shortcomings in effectively modeling local structures, and both face challenges of high model complexity and insufficient interpretability. In response to the above issues, we proposes the following innovative work: A sparse Transformer based on Learnable Iterative Shrinkage Threshold Algorithm (LISTA-Transformer) was designed, which deeply integrates LISTA sparse encoding with visual Transformer to construct a model architecture with adaptive local and global feature collaboration mechanism. This method utilizes continuous wavelet transform to convert vibration signals into time-frequency maps and inputs them into LISTA-Transformer for more effective feature extraction. On the CWRU dataset, the fault recognition rate of our method reached 98.5%, which is 3.3% higher than traditional methods and exhibits certain superiority over existing Transformer-based approaches.
Chinese Translation
随着多层感知器、卷积神经网络(CNN)和Transformer等模型的持续发展,深度学习在计算机视觉和自然语言处理等领域取得了突破性进展,并成功应用于图像分类和工业故障诊断等实际场景。然而,现有模型在局部特征建模和全局依赖捕获方面仍存在一定的局限性。具体而言,CNN受到局部感受野的限制,而Transformer在有效建模局部结构方面存在不足,且两者都面临模型复杂度高和可解释性不足的挑战。针对上述问题,我们提出了以下创新工作:设计了一种基于可学习迭代收缩阈值算法(LISTA)的稀疏Transformer(LISTA-Transformer),该模型深度融合了LISTA稀疏编码与视觉Transformer,构建了具有自适应局部与全局特征协作机制的模型架构。该方法利用连续小波变换将振动信号转换为时频图,并将其输入到LISTA-Transformer中,以实现更有效的特征提取。在CWRU数据集上,我们的方法的故障识别率达到了98.5%,比传统方法高出3.3%,并在一定程度上优于现有的基于Transformer的方法。
cs.CV / 88 / 2603.04163
Degradation-based augmented training for robust individual animal re-identification
基于降解的增强训练用于稳健的个体动物再识别
Abstract
Wildlife re-identification aims to recognise individual animals by matching query images to a database of previously identified individuals, based on their fine-scale unique morphological characteristics. Current state-of-the-art models for multispecies re- identification are based on deep metric learning representing individual identities by fea- ture vectors in an embedding space, the similarity of which forms the basis for a fast automated identity retrieval. Yet very often, the discriminative information of individual wild animals gets significantly reduced due to the presence of several degradation factors in images, leading to reduced retrieval performance and limiting the downstream eco- logical studies. Here, starting by showing that the extent of this performance reduction greatly varies depending on the animal species (18 wild animal datasets), we introduce an augmented training framework for deep feature extractors, where we apply artificial but diverse degradations in images in the training set. We show that applying this augmented training only to a subset of individuals, leads to an overall increased re-identification performance, under the same type of degradations, even for individuals not seen during training. The introduction of diverse degradations during training leads to a gain of up to 8.5% Rank-1 accuracy to a dataset of real-world degraded animal images, selected using human re-ID expert annotations provided here for the first time. Our work is the first to systematically study image degradation in wildlife re-identification, while introducing all the necessary benchmarks, publicly available code and data, enabling further research on this topic.
Chinese Translation
野生动物再识别旨在通过将查询图像与先前识别的个体数据库进行匹配,识别个体动物,基于其细微独特的形态特征。目前多物种再识别的最先进模型基于深度度量学习,通过嵌入空间中的特征向量表示个体身份,其相似性构成了快速自动身份检索的基础。然而,由于图像中存在多种降解因素,个体野生动物的区分信息往往显著减少,导致检索性能下降,限制了后续生态研究。在此,我们首先展示了这种性能下降的程度在不同动物物种(18个野生动物数据集)之间差异很大。我们引入了一种增强训练框架,用于深度特征提取器,在训练集中对图像应用人工但多样的降解。我们表明,仅对部分个体应用这种增强训练,在相同类型的降解下,整体再识别性能得到了提升,即使对于在训练中未见过的个体也是如此。在训练过程中引入多样化的降解,使得在使用人类再识别专家注释(首次提供)选择的真实世界降解动物图像数据集上,Rank-1准确率提高了多达8.5%。我们的工作首次系统地研究了野生动物再识别中的图像降解,同时引入了所有必要的基准、公开可用的代码和数据,为该主题的进一步研究提供了支持。
cs.CV / 89 / 2603.04165
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
PlaneCycle:无适配器的训练无关基础模型2D到3D提升
Abstract
Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.
Chinese Translation
大规模的2D基础模型展现出强大的可迁移表示能力,但将其扩展到3D体积数据通常需要重新训练、适配器或架构重设计。我们提出了PlaneCycle,这是一种无训练、无适配器的操作符,用于架构无关的2D到3D提升。PlaneCycle通过在网络深度中循环地在正交的HW、DW和DH平面上分配空间聚合,重用原始预训练的2D主干,从而实现渐进式的3D融合,同时保留预训练的归纳偏置。该方法不引入额外参数,并适用于任意2D网络。我们使用预训练的DINOv3模型,在六个3D分类和三个3D分割基准上评估PlaneCycle。在没有任何训练的情况下,提升后的模型展现出内在的3D融合能力,并且在线性探测下超越了逐片2D基线和强大的3D对比模型,接近完全训练模型的性能。通过完全微调,PlaneCycle与标准3D架构相匹配,突显了其作为无缝且实用的2D到3D提升操作符的潜力。这些结果表明,预训练的2D基础模型可以在不进行结构修改或重新训练的情况下解锁3D能力。代码可在 https://github.com/HINTLab/PlaneCycle 获取。
cs.CV / 90 / 2603.04179
NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
NOVA3R:用于无像素对齐的视觉变换器的非像素对齐三维重建
Abstract
We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.
Chinese Translation
我们提出了NOVA3R,一种有效的非像素对齐三维重建方法,能够从一组未定位的图像中以前馈方式进行重建。与将几何体与每条光线预测绑定的像素对齐方法不同,我们的公式学习了一种全局的、与视角无关的场景表示,从而将重建与像素对齐解耦。这解决了像素对齐三维重建中的两个关键限制:(1)它通过完整的场景表示恢复可见和不可见的点;(2)它在重叠区域生成物理上合理的几何体,减少了重复结构。为了实现这一点,我们引入了一种场景标记机制,聚合来自未定位图像的信息,以及一种基于扩散的三维解码器,重建完整的非像素对齐点云。在场景级和对象级数据集上的大量实验表明,NOVA3R在重建精度和完整性方面优于现有的最先进方法。
cs.CV / 91 / 2603.04205
Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild
Real5-OmniDocBench:用于野外稳健文档解析的全尺度物理重建基准
Abstract
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.
Chinese Translation
尽管视觉-语言模型(VLMs)在数字文档基准如OmniDocBench上取得了近乎完美的分数,但由于缺乏可控而又真实的评估,其在不可预测的物理世界中的表现仍然 largely unknown。我们引入了Real5-OmniDocBench,这是第一个基准,针对五种关键的真实场景(扫描、变形、屏幕摄影、照明和倾斜)对整个OmniDocBench v1.5(1,355张图像)进行全尺度、一对一的物理重建。与之前的基准不同,后者要么缺乏数字对应关系,要么采用部分采样,我们的完整真实映射首次实现了对性能下降的严格因素归因,使我们能够准确判断失败是否源于几何失真、光学伪影或模型限制。我们的基准为社区建立了一个具有挑战性的新的标准,表明文档解析中的“现实差距”远未缩小,并提供了一种诊断工具,以指导真正稳健的文档智能的发展。
cs.CV / 92 / 2603.04239
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
DiverseDiT:面向扩散变换器中的多样化表示学习
Abstract
Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.
Chinese Translation
最近在扩散变换器(Diffusion Transformers, DiTs)领域的突破性进展因其卓越的可扩展性而彻底改变了视觉合成领域。为了促进DiTs捕捉有意义的内部表示的能力,最近的研究如REPA将外部预训练编码器纳入表示对齐。然而,控制DiTs内部表示学习的基本机制尚不清楚。为此,我们首先系统地研究了DiTs的表示动态。通过分析在各种设置下内部表示的演变和影响,我们揭示了跨块的表示多样性是有效学习的关键因素。基于这一关键见解,我们提出了DiverseDiT,一个明确促进表示多样性的全新框架。DiverseDiT结合了长残差连接以多样化跨块的输入表示,并引入了表示多样性损失以鼓励各块学习不同的特征。在ImageNet 256x256和512x512上的大量实验表明,当应用于不同规模的骨干网络时,我们的DiverseDiT在性能提升和收敛加速方面均表现出一致的优势,即使在具有挑战性的单步生成设置下进行测试。此外,我们还展示了DiverseDiT与现有表示学习技术的互补性,进一步提升了性能。我们的工作为DiTs的表示学习动态提供了宝贵的见解,并为提升其性能提供了实用的方法。
cs.CV / 93 / 2603.04240
DeNuC: Decoupling Nuclei Detection and Classification in Histopathology
DeNuC:在组织病理学中解耦核检测与分类
Abstract
Pathology Foundation Models (FMs) have shown strong performance across a wide range of pathology image representation and diagnostic tasks. However, FMs do not exhibit the expected performance advantage over traditional specialized models in Nuclei Detection and Classification (NDC). In this work, we reveal that jointly optimizing nuclei detection and classification leads to severe representation degradation in FMs. Moreover, we identify that the substantial intrinsic disparity in task difficulty between nuclei detection and nuclei classification renders joint NDC optimization unnecessarily computationally burdensome for the detection stage. To address these challenges, we propose DeNuC, a simple yet effective method designed to break through existing bottlenecks by Decoupling Nuclei detection and Classification. DeNuC employs a lightweight model for accurate nuclei localization, subsequently leveraging a pathology FM to encode input images and query nucleus-specific features based on the detected coordinates for classification. Extensive experiments on three widely used benchmarks demonstrate that DeNuC effectively unlocks the representational potential of FMs for NDC and significantly outperforms state-of-the-art methods. Notably, DeNuC improves F1 scores by 4.2% and 3.6% (or higher) on the BRCAM2C and PUMA datasets, respectively, while using only 16% (or fewer) trainable parameters compared to other methods. Code is available at https://github.com/ZijiangY1116/DeNuC.
Chinese Translation
病理基础模型(FMs)在广泛的病理图像表示和诊断任务中表现出强大的性能。然而,FMs在核检测与分类(NDC)方面并未表现出相对于传统专用模型的预期性能优势。在本研究中,我们揭示了联合优化核检测与分类会导致FMs的严重表示退化。此外,我们发现核检测与核分类之间任务难度的显著内在差异使得联合NDC优化在检测阶段不必要地增加了计算负担。为了解决这些挑战,我们提出了DeNuC,这是一种简单而有效的方法,旨在通过解耦核检测与分类来突破现有瓶颈。DeNuC采用轻量级模型进行准确的核定位,随后利用病理FM对输入图像进行编码,并基于检测到的坐标查询特定于核的特征进行分类。在三个广泛使用的基准数据集上的大量实验表明,DeNuC有效地释放了FMs在NDC中的表示潜力,并显著超越了最先进的方法。值得注意的是,DeNuC在BRCAM2C和PUMA数据集上分别提高了4.2%和3.6%(或更高)的F1分数,同时使用的可训练参数仅为其他方法的16%(或更少)。代码可在https://github.com/ZijiangY1116/DeNuC获取。
cs.CV / 94 / 2603.04243
A Unified Framework for Joint Detection of Lacunes and Enlarged Perivascular Spaces
一个统一框架用于联合检测腔隙和扩大型血管周围空间
Abstract
Cerebral small vessel disease (CSVD) markers, specifically enlarged perivascular spaces (EPVS) and lacunae, present a unique challenge in medical image analysis due to their radiological mimicry. Standard segmentation networks struggle with feature interference and extreme class imbalance when handling these divergent targets simultaneously. To address these issues, we propose a morphology-decoupled framework where Zero-Initialized Gated Cross-Task Attention exploits dense EPVS context to guide sparse lacune detection. Furthermore, biological and topological consistency are enforced via a mixed-supervision strategy integrating Mutual Exclusion and Centerline Dice losses. Finally, we introduce an Anatomically-Informed Inference Calibration mechanism to dynamically suppress false positives based on tissue semantics. Extensive 5-folds cross-validation on the VALDO 2021 dataset (N=40) demonstrates state-of-the-art performance, notably surpassing task winners in lacunae detection precision (71.1%, p=0.01) and F1-score (62.6%, p=0.03). Furthermore, evaluation on the external EPAD cohort (N=1762) confirms the model's robustness for large-scale population studies. Code will be released upon acceptance.
Chinese Translation
脑小血管病(CSVD)标志物,特别是扩大型血管周围空间(EPVS)和腔隙,在医学图像分析中由于其放射学上的相似性而呈现出独特的挑战。标准分割网络在同时处理这些不同目标时,面临特征干扰和极端类别不平衡的问题。为了解决这些问题,我们提出了一种形态解耦框架,其中零初始化的门控跨任务注意力(Zero-Initialized Gated Cross-Task Attention)利用密集的EPVS上下文来指导稀疏腔隙的检测。此外,通过混合监督策略整合互斥损失(Mutual Exclusion)和中心线Dice损失(Centerline Dice),强制执行生物学和拓扑一致性。最后,我们引入了一种解剖学信息推理校准机制,以根据组织语义动态抑制假阳性。在VALDO 2021数据集(N=40)上进行的广泛5折交叉验证显示出最先进的性能,特别是在腔隙检测精度(71.1%,p=0.01)和F1分数(62.6%,p=0.03)方面显著超过任务获胜者。此外,在外部EPAD队列(N=1762)上的评估确认了该模型在大规模人群研究中的鲁棒性。代码将在接受后发布。
cs.CV / 95 / 2603.04254
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
EmbodiedSplat:用于开放词汇3D场景理解的在线前馈语义3D图形系统
Abstract
Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in https://0nandon.github.io/EmbodiedSplat/.
Chinese Translation
在探索过程中立即理解3D场景对于具身任务至关重要,其中代理必须以在线和几乎实时的方式构建和理解3D场景。在本研究中,我们提出了EmbodiedSplat,一种用于开放词汇场景理解的在线前馈3D图形系统(3DGS),它能够从流式图像中同时进行在线3D重建和3D语义理解。与现有的开放词汇3DGS方法通常限制于离线或每场景优化设置不同,我们的目标有两个:1)以在线方式从300多张流式图像重建整个场景的语义嵌入3DGS;2)通过前馈设计高度通用于新场景,并在与实时2D模型结合时支持几乎实时的3D语义重建。为了实现这些目标,我们提出了一种在线稀疏系数场(Online Sparse Coefficients Field),结合了CLIP全局词典(CLIP Global Codebook),将2D CLIP嵌入绑定到每个3D高斯,同时最小化内存消耗并保持CLIP的完整语义通用性。此外,我们通过3D U-Net聚合3DGS的部分点云生成3D几何感知的CLIP特征,以补偿2D导向语言嵌入的3D几何先验。在包括ScanNet、ScanNet++和Replica在内的多样化室内数据集上进行的大量实验表明了我们方法的有效性和效率。请访问我们的项目页面:https://0nandon.github.io/EmbodiedSplat/
cs.CV / 96 / 2603.04256
A Hypertoroidal Covering for Perfect Color Equivariance
完美颜色等变性的超环面覆盖
Abstract
When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a double-cover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.
Chinese Translation
当输入图像的颜色分布在推理时发生变化时,传统神经网络架构的性能会显著下降。一些研究者已经开始在神经网络设计中融入颜色几何的先验知识。这些颜色等变架构通过二维旋转建模色调变化,并将饱和度和亮度变换视为一维平移。虽然这种方法在多个背景下提高了神经网络对颜色变化的鲁棒性,但我们发现将饱和度和亮度(区间值量)近似为一维平移会引入显著的伪影。在本文中,我们提出了一种真正等变的颜色等变架构。我们不是将区间近似为实线,而是将区间上的值提升到圆上的值(双覆盖),并在此构建等变表示。我们的方法解决了先前方法的近似伪影,提高了可解释性和泛化能力,并在细粒度分类和医学成像等任务上实现了比传统和等变基线更好的预测性能。超越颜色的背景,我们展示了我们提出的提升也可以扩展到几何变换,如缩放。
cs.CV / 97 / 2603.04265
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
ViterbiPlanNet:通过可微分Viterbi注入过程知识以进行教学视频中的规划
Abstract
Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.
Chinese Translation
过程规划旨在预测一系列动作,将初始视觉状态转变为期望目标,这是智能体在复杂环境中操作的基本能力。现有方法通常依赖于大规模模型,这些模型隐式学习过程结构,导致样本效率有限且计算成本高。在本研究中,我们引入了ViterbiPlanNet,这是一个原则性的框架,通过可微分Viterbi层(Differentiable Viterbi Layer, DVL)将过程知识显式整合到学习过程中。DVL将过程知识图(Procedural Knowledge Graph, PKG)直接嵌入Viterbi解码算法中,用平滑的松弛替代不可微分操作,从而实现端到端优化。这一设计使模型能够通过基于图的解码进行学习。在CrossTask、COIN和NIV上的实验表明,ViterbiPlanNet以数量级更少的参数实现了最先进的性能,相较于基于扩散和大型语言模型(LLM)的规划器,具有显著优势。大量消融实验表明,性能提升源于我们可微分的结构感知训练,而非事后精炼,从而提高了样本效率和对较短未见时间范围的鲁棒性。我们还解决了测试不一致性的问题,建立了统一的测试协议,确保分割和评估指标的一致性。在这一新协议下,我们多次进行实验,并使用自助法(bootstrapping)报告结果,以评估统计显著性。
cs.CV / 98 / 2603.04272
SSR: A Generic Framework for Text-Aided Map Compression for Localization
SSR:一种用于定位的文本辅助地图压缩通用框架
Abstract
Mapping is crucial in robotics for localization and downstream decision-making. As robots are deployed in ever-broader settings, the maps they rely on continue to increase in size. However, storing these maps indefinitely (cold storage), transferring them across networks, or sending localization queries to cloud-hosted maps imposes prohibitive memory and bandwidth costs. We propose a text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization. The key idea is to treat text as an alternative modality: one that can be losslessly compressed with large language models. We propose leveraging lightweight text descriptions combined with very small image feature vectors, which capture "complementary information" as a compact representation for the mapping task. Building on this, our novel technique, Similarity Space Replication (SSR), learns an adaptive image embedding in one shot that captures only the information "complementary" to the text descriptions. We validate our compression framework on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings. SSR achieves 2 times better compression than competing baselines on state-of-the-art datasets, including TokyoVal, Pittsburgh30k, Replica, and KITTI.
Chinese Translation
地图在机器人技术中对于定位和后续决策至关重要。随着机器人在越来越广泛的环境中部署,它们所依赖的地图也不断增大。然而,永久存储这些地图(冷存储)、通过网络传输它们或向云托管地图发送定位查询会带来巨大的内存和带宽成本。我们提出了一种增强文本的压缩框架,旨在在保持高保真定位的同时减少内存和带宽占用。其关键思想是将文本视为一种替代模态:可以通过大型语言模型无损压缩的模态。我们建议利用轻量级文本描述与非常小的图像特征向量相结合,这些特征向量捕捉“互补信息”,作为映射任务的紧凑表示。在此基础上,我们的新技术相似空间复制(Similarity Space Replication,SSR)学习了一种自适应图像嵌入,一次性捕捉仅与文本描述“互补”的信息。我们在多个下游定位任务上验证了我们的压缩框架,包括视觉位置识别以及室内和室外环境中的以物体为中心的蒙特卡洛定位。SSR在包括TokyoVal、Pittsburgh30k、Replica和KITTI等最先进的数据集上实现了比竞争基线高出2倍的压缩效果。
cs.CV / 99 / 2603.04288
A multi-center analysis of deep learning methods for video polyp detection and segmentation
多中心深度学习方法在视频息肉检测与分割中的分析
Ghatwary, Noha, Solano, Pedro Chavarias, Ibrahim, Mohamed Ramzy, Krenzer, Adrian, Puppe, Frank, Realdon, Stefano, Cannizzaro, Renato, Wang, Jiacheng, Wang, Liansheng, Tran, Thuy Nuong, Maier-Hein, Lena, Yamlahi, Amine, Godau, Patrick, He, Quan, Wan, Qiming, Kokshaikyna, Mariia, Dobko, Mariia, Ye, Haili, Li, Heng, B, Ragu, Raj, Antony, Nagdy, Hanaa, Salem, Osama E, East, James E., Lamarque, Dominique, de Lange, Thomas, Ali, Sharib
Abstract
Colonic polyps are well-recognized precursors to colorectal cancer (CRC), typically detected during colonoscopy. However, the variability in appearance, location, and size of these polyps complicates their detection and removal, leading to challenges in effective surveillance, intervention, and subsequently CRC prevention. The processes of colonoscopy surveillance and polyp removal are highly reliant on the expertise of gastroenterologists and occur within the complexities of the colonic structure. As a result, there is a high rate of missed detections and incomplete removal of colonic polyps, which can adversely impact patient outcomes. Recently, automated methods that use machine learning have been developed to enhance polyps detection and segmentation, thus helping clinical processes and reducing missed rates. These advancements highlight the potential for improving diagnostic accuracy in real-time applications, which ultimately facilitates more effective patient management. Furthermore, integrating sequence data and temporal information could significantly enhance the precision of these methods by capturing the dynamic nature of polyp growth and the changes that occur over time. To rigorously investigate these challenges, data scientists and experts gastroenterologists collaborated to compile a comprehensive dataset that spans multiple centers and diverse populations. This initiative aims to underscore the critical importance of incorporating sequence data and temporal information in the development of robust automated detection and segmentation methods. This study evaluates the applicability of deep learning techniques developed in real-time clinical colonoscopy tasks using sequence data, highlighting the critical role of temporal relationships between frames in improving diagnostic precision.
Chinese Translation
结肠息肉被广泛认为是结直肠癌(CRC)的前驱病变,通常在结肠镜检查中被发现。然而,这些息肉在外观、位置和大小上的差异使得其检测和切除变得复杂,从而给有效的监测、干预和随后的CRC预防带来了挑战。结肠镜监测和息肉切除的过程高度依赖于胃肠病学专家的专业知识,并发生在结肠结构的复杂性中。因此,结肠息肉的漏检和不完全切除率较高,这可能对患者的预后产生不利影响。最近,已经开发出使用机器学习的自动化方法,以增强息肉的检测和分割,从而帮助临床流程并降低漏检率。这些进展突显了在实时应用中提高诊断准确性的潜力,最终促进了更有效的患者管理。此外,整合序列数据和时间信息可以显著提高这些方法的精度,因为它能够捕捉息肉生长的动态特性及其随时间变化的情况。为了严格研究这些挑战,数据科学家与胃肠病学专家合作,编制了一个涵盖多个中心和不同人群的综合数据集。该项目旨在强调在开发强大的自动检测和分割方法时,整合序列数据和时间信息的重要性。本研究评估了在实时临床结肠镜任务中使用序列数据开发的深度学习技术的适用性,突出了帧之间时间关系在提高诊断精度中的关键作用。
cs.CV / 100 / 2603.04290
Gaussian Wardrobe: Compositional 3D Gaussian Avatars for Free-Form Virtual Try-On
高斯衣橱:用于自由形式虚拟试穿的组合3D高斯头像
Abstract
We introduce Gaussian Wardrobe, a novel framework to digitalize compositional 3D neural avatars from multi-view videos. Existing methods for 3D neural avatars typically treat the human body and clothing as an inseparable entity. However, this paradigm fails to capture the dynamics of complex free-form garments and limits the reuse of clothing across different individuals. To overcome these problems, we develop a novel, compositional 3D Gaussian representation to build avatars from multiple layers of free-form garments. The core of our method is decomposing neural avatars into bodies and layers of shape-agnostic neural garments. To achieve this, our framework learns to disentangle each garment layer from multi-view videos and canonicalizes it into a shape-independent space. In experiments, our method models photorealistic avatars with high-fidelity dynamics, achieving new state-of-the-art performance on novel pose synthesis benchmarks. In addition, we demonstrate that the learned compositional garments contribute to a versatile digital wardrobe, enabling a practical virtual try-on application where clothing can be freely transferred to new subjects. Project page: https://ait.ethz.ch/gaussianwardrobe
Chinese Translation
我们提出了高斯衣橱,一个新颖的框架,用于从多视角视频中数字化组合3D神经头像。现有的3D神经头像方法通常将人体和服装视为不可分割的整体。然而,这种范式未能捕捉复杂自由形式服装的动态特性,并限制了服装在不同个体之间的重用。为了解决这些问题,我们开发了一种新颖的组合3D高斯表示,通过多层自由形式服装构建头像。我们方法的核心在于将神经头像分解为身体和形状无关的神经服装层。为了实现这一点,我们的框架学习从多视角视频中解耦每个服装层,并将其规范化到一个形状无关的空间。在实验中,我们的方法建模出具有高保真动态的照片级真实头像,在新姿势合成基准上实现了新的最先进性能。此外,我们展示了学习到的组合服装有助于构建一个多功能的数字衣橱,使得服装能够自由转移到新对象上,从而实现实用的虚拟试穿应用。项目页面:https://ait.ethz.ch/gaussianwardrobe
cs.CV / 101 / 2603.04291
CubeComposer: Spatio-Temporal Autoregressive 4K 360{\deg} Video Generation from Perspective Video
CubeComposer:基于视角视频的时空自回归4K 360°视频生成
Abstract
Generating high-quality 360{\deg} panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360{\deg} videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360{\deg} video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer
Chinese Translation
从视角输入生成高质量的360°全景视频是虚拟现实(VR)中的一个关键应用,其中高分辨率视频对于沉浸式体验尤为重要。现有方法受到传统扩散模型计算限制的约束,仅支持≤1K分辨率的原生生成,并依赖次优的后处理超分辨率来提高分辨率。我们提出了CubeComposer,一种新颖的时空自回归扩散模型,能够原生生成4K分辨率的360°视频。通过将视频分解为具有六个面的立方体映射表示,CubeComposer以良好规划的时空顺序自回归合成内容,从而减少内存需求,同时实现高分辨率输出。具体而言,为了解决多维自回归中的挑战,我们提出了:(1)一种时空自回归策略,协调立方体面和时间窗口之间的360°视频生成,以实现一致的合成;(2)一种立方体面上下文管理机制,配备稀疏上下文注意力设计以提高效率;(3)连续性感知技术,包括立方体感知位置编码、填充和混合,以消除边界接缝。在基准数据集上的大量实验表明,CubeComposer在原生分辨率和视觉质量上优于最先进的方法,支持实际的VR应用场景。项目页面:https://lg-li.github.io/project/cubecomposer
cs.CV / 102 / 2603.04302
Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation
通过无监督关键点定位进行面部动画的运动操控
Abstract
Face animation deals with controlling and generating facial features with a wide range of applications. The methods based on unsupervised keypoint positioning can produce realistic and detailed virtual portraits. However, they cannot achieve controllable face generation since the existing keypoint decomposition pipelines fail to fully decouple identity semantics and intertwined motion information (e.g., rotation, translation, and expression). To address these issues, we present a new method, Motion Manipulation via unsupervised keypoint positioning in Face Animation (MMFA). We first introduce self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information. Secondly, we propose a new way to compute keypoints aiming to achieve arbitrary motion control. Moreover, we design a variational autoencoder to map expression features to a continuous Gaussian distribution, allowing us for the first time to interpolate facial expressions in an unsupervised framework. We have conducted extensive experiments on publicly available datasets to validate the effectiveness of MMFA, which show that MMFA offers pronounced advantages over prior arts in creating realistic animation and manipulating face motion.
Chinese Translation
面部动画涉及控制和生成面部特征,具有广泛的应用。基于无监督关键点定位的方法能够生成逼真且细致的虚拟肖像。然而,由于现有的关键点分解流程未能完全解耦身份语义与交织的运动信息(例如,旋转、平移和表情),因此无法实现可控的面部生成。为了解决这些问题,我们提出了一种新方法,即通过无监督关键点定位进行面部动画的运动操控(MMFA)。我们首先引入自监督表示学习,以在潜在特征空间中编码和解码表情,并将其与其他运动信息解耦。其次,我们提出了一种新的关键点计算方法,旨在实现任意运动控制。此外,我们设计了一种变分自编码器,将表情特征映射到连续的高斯分布,从而首次在无监督框架中实现面部表情的插值。我们在公开可用的数据集上进行了广泛的实验,以验证MMFA的有效性,结果表明,MMFA在创建逼真动画和操控面部运动方面相较于先前的艺术作品具有显著优势。
cs.CV / 103 / 2603.04307
Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation
用于多模态引导的三维头像生成的双扩散模型
Abstract
Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.
Chinese Translation
从文本或图像提示生成高保真三维头像在虚拟现实和人机交互中备受追捧。然而,现有的文本驱动方法通常依赖于迭代的评分蒸馏采样(Score Distillation Sampling, SDS)或CLIP优化,这在细粒度语义控制方面存在困难,并且推理速度过慢。同时,图像驱动的方法受到高质量三维面部扫描稀缺性和高获取成本的严重制约,限制了模型的泛化能力。为了解决这些挑战,我们首先构建了一个新颖的大规模数据集,包含超过100,000对四种模态的数据:细粒度文本描述、自然环境中的面部图像、高质量光照归一化纹理UV图和三维几何形状。利用这个全面的数据集,我们提出了PromptAvatar,一个具有双扩散模型的框架。具体而言,它集成了一个纹理扩散模型(Texture Diffusion Model, TDM),支持来自文本和/或图像提示的灵活多条件引导,以及一个由文本提示引导的几何扩散模型(Geometry Diffusion Model, GDM)。通过学习多模态提示与三维表示之间的直接映射,PromptAvatar消除了耗时的迭代优化需求,成功地在10秒内生成高保真、无阴影的三维头像。大量的定量和定性实验表明,我们的方法在生成质量、细粒度细节对齐和计算效率方面显著优于现有的最先进方法。
cs.CV / 104 / 2603.04314
MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification
MOO:用于牛只重识别的多视角定向观察数据集
Abstract
Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.
Chinese Translation
动物重识别(ReID)面临着由于视角变化而带来的重大挑战,特别是在空中-地面(AG-ReID)环境中,模型必须在剧烈的高度变化下匹配个体。然而,现有数据集缺乏系统分析这些几何变化所需的精确角度注释。为了解决这个问题,我们引入了多视角定向观察(MOO)数据集,这是一个大规模的合成AG-ReID数据集,包含$1,000$头牛个体,从$128$个均匀采样的视角捕获($128,000$张注释图像)。利用这个受控数据集,我们量化了高度的影响,并识别出一个关键的高度阈值,超过该阈值后,模型在未见视角上的泛化能力显著提高。最后,我们在零样本和监督设置中验证了其在现实世界应用中的可迁移性,展示了在四个现实世界牛只数据集上的性能提升,并确认合成几何先验有效地弥合了领域间的差距。总体而言,这个数据集和分析为未来跨视角动物重识别模型的发展奠定了基础。MOO数据集已公开发布,网址为https://github.com/TurtleSmoke/MOO。
cs.CV / 105 / 2603.04321
SPRINT: Semi-supervised Prototypical Representation for Few-Shot Class-Incremental Tabular Learning
SPRINT:用于少样本类增量表格学习的半监督原型表示
Abstract
Real-world systems must continuously adapt to novel concepts from limited data without forgetting previously acquired knowledge. While Few-Shot Class-Incremental Learning (FSCIL) is established in computer vision, its application to tabular domains remains largely unexplored. Unlike images, tabular streams (e.g., logs, sensors) offer abundant unlabeled data, a scarcity of expert annotations and negligible storage costs, features ignored by existing vision-based methods that rely on restrictive buffers. We introduce SPRINT, the first FSCIL framework tailored for tabular distributions. SPRINT introduces a mixed episodic training strategy that leverages confidence-based pseudo-labeling to enrich novel class representations and exploits low storage costs to retain base class history. Extensive evaluation across six diverse benchmarks spanning cybersecurity, healthcare, and ecological domains, demonstrates SPRINT's cross-domain robustness. It achieves a state-of-the-art average accuracy of 77.37% (5-shot), outperforming the strongest incremental baseline by 4.45%.
Chinese Translation
现实世界的系统必须能够在有限数据的情况下持续适应新概念,而不遗忘先前获得的知识。尽管少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)在计算机视觉领域已得到确立,但其在表格领域的应用仍然基本未被探索。与图像不同,表格数据流(例如日志、传感器)提供了丰富的未标记数据、专家注释稀缺以及存储成本微不足道的特点,而这些特性在现有依赖于限制性缓冲区的视觉方法中被忽视。我们提出了SPRINT,这是第一个针对表格分布量身定制的FSCIL框架。SPRINT引入了一种混合情节训练策略,利用基于置信度的伪标签技术来丰富新类表示,并利用低存储成本来保留基础类历史。在跨越网络安全、医疗保健和生态领域的六个多样化基准上进行的广泛评估表明,SPRINT具有跨领域的鲁棒性。它在5-shot情况下实现了77.37%的最先进平均准确率,比最强的增量基线高出4.45%。
cs.CV / 106 / 2603.04325
Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images
合成环境增强图像现实性的可扩展评估
Abstract
Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.
Chinese Translation
人工智能系统的评估通常需要合成测试案例,特别是在稀有或安全关键条件下,这些条件在操作数据中难以观察。生成性人工智能通过可控的图像编辑提供了一种生成此类数据的有前景的方法,但其有效性取决于生成的图像是否足够真实,以支持有意义的评估。我们提出了一个可扩展的框架,用于评估合成图像编辑方法的现实性,并将其应用于向车载摄像头图像中添加环境条件(如雾、雨、雪和夜间)的任务。使用40张晴天图像,我们比较了基于规则的增强库与生成性人工智能图像编辑模型。现实性通过两种互补的自动化指标进行评估:视觉-语言模型(VLM)陪审团用于感知现实性评估,以及基于嵌入的分布分析来测量与真实恶劣条件图像的相似性。生成性人工智能方法在性能上显著优于基于规则的方法,最佳生成方法的接受率约为最佳基于规则方法的3.6倍。不同条件下的性能差异明显:雾的模拟最为容易,而夜间转换仍然具有挑战性。值得注意的是,VLM陪审团即使对真实的恶劣条件图像也给予了不完美的接受度,这为评估合成方法设定了实际的上限。根据这一标准,领先的生成方法在大多数条件下的表现与真实图像相当或更优。这些结果表明,现代生成图像编辑模型能够实现可扩展的现实恶劣条件图像生成,用于评估流程。因此,我们的框架提供了一种可扩展的现实性评估的实用方法,尽管与人类研究的验证仍然是未来工作的一个重要方向。
cs.CV / 107 / 2603.04337
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Pointer-CAD:通过基于指针的边缘和面选择统一边界表示与命令序列
Abstract
Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.
Chinese Translation
构建计算机辅助设计(CAD)模型是劳动密集型的,但对于工程和制造至关重要。最近在大型语言模型(LLMs)方面的进展激发了基于LLM的CAD生成方法,通过将CAD表示为命令序列。然而,这些方法在实际场景中面临挑战,因为命令序列表示不支持实体选择(例如,面或边),限制了其支持复杂编辑操作(如倒角或圆角)的能力。此外,在草图和拉伸操作中对连续变量的离散化可能导致拓扑错误。为了解决这些局限性,我们提出了Pointer-CAD,这是一种新颖的基于LLM的CAD生成框架,利用基于指针的命令序列表示,明确将边界表示模型的几何信息纳入顺序建模中。具体而言,Pointer-CAD将CAD模型生成分解为多个步骤,使每个后续步骤的生成都依赖于文本描述和来自前一步骤生成的边界表示。每当操作需要选择特定几何实体时,LLM会预测一个指针,从可用集合中选择最具特征一致性的候选项。这种选择操作还减少了基于命令序列表示中的量化误差。为了支持Pointer-CAD的训练,我们开发了一个数据注释管道,生成专家级自然语言描述,并应用于构建约575K CAD模型的数据集。大量实验结果表明,Pointer-CAD有效支持复杂几何结构的生成,并将分割错误降低到极低水平,显著优于以往的命令序列方法,从而显著减轻了量化误差引入的拓扑不准确性。
cs.CV / 108 / 2603.04338
ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
ArtHOI:通过视频先验的4D重建合成关节人类-物体交互
Abstract
Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.
Chinese Translation
在没有3D/4D监督的情况下合成物理上合理的关节人类-物体交互(HOI)仍然是一个基本挑战。尽管最近的零-shot方法利用视频扩散模型合成人类-物体交互,但它们主要局限于刚性物体的操作,并缺乏明确的4D几何推理。为了解决这一问题,我们将关节HOI合成形式化为从单目视频先验进行的4D重建问题:仅给定由扩散模型生成的视频,我们在没有任何3D监督的情况下重建完整的4D关节场景。这种基于重建的方法将生成的2D视频视为逆渲染问题的监督,恢复几何一致且物理合理的4D场景,自然遵循接触、关节和时间一致性。我们引入了ArtHOI,这是第一个通过视频先验进行4D重建的关节人类-物体交互合成的零-shot框架。我们的关键设计包括:1)基于流的部件分割:利用光流作为几何线索,将单目视频中的动态区域与静态区域分离;2)解耦重建管道:在人类运动和物体关节的联合优化中,由于单目模糊性导致的不稳定性,我们首先恢复物体关节,然后在重建的物体状态下合成人类运动。ArtHOI架起了基于视频的生成与几何感知重建之间的桥梁,生成的交互在语义上对齐且在物理上扎根。在多样化的关节场景中(例如,打开冰箱、橱柜、微波炉),ArtHOI在接触准确性、穿透减少和关节保真度方面显著优于先前的方法,通过重建信息引导的合成将零-shot交互合成扩展到刚性操作之外。
cs.CV / 109 / 2603.04340
Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study
在合成心脏MRI生成中平衡保真度、实用性和隐私性:一项比较研究
Abstract
Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks three generative architectures: Denoising Diffusion Probabilistic Models (DDPM), Latent Diffusion Models (LDM), and Flow Matching (FM) for synthetic CMR generation. Utilizing a two-stage pipeline where anatomical masks condition image synthesis, we evaluate generated data across three critical axes: fidelity, utility, and privacy. Our results show that diffusion-based models, particularly DDPM, provide the most effective balance between downstream segmentation utility, image fidelity, and privacy preservation under limited-data conditions, while FM demonstrates promising privacy characteristics with slightly lower task-level performance. These findings quantify the trade-offs between cross-domain generalization and patient confidentiality, establishing a framework for safe and effective synthetic data augmentation in medical imaging.
Chinese Translation
心脏MRI(CMR)中的深度学习在数据稀缺和隐私法规的限制下,面临着根本性的挑战。本研究系统地对三种生成架构进行基准测试:去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)、潜在扩散模型(Latent Diffusion Models, LDM)和流匹配(Flow Matching, FM),以生成合成的心脏MRI。我们利用一个两阶段的流程,其中解剖掩模条件化图像合成,评估生成数据在保真度、实用性和隐私性三个关键维度上的表现。我们的结果表明,基于扩散的模型,特别是DDPM,在有限数据条件下,提供了下游分割实用性、图像保真度和隐私保护之间最有效的平衡,而FM则展现出有希望的隐私特征,尽管任务级性能略低。这些发现量化了跨领域泛化与患者隐私之间的权衡,为医疗影像中的安全和有效的合成数据增强建立了框架。
cs.CV / 110 / 2603.04341
Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters
无验证的少样本 CLIP 适配器的 Hold-One-Shot-Out (HOSO) 方法
Abstract
In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, dataset-specific supervision from the few-shot cases. Most few-shot CLIP adaptation techniques report results by ablation of the blending ratio on the test set or require additional validation sets to select the blending ratio per dataset, and thus are not strictly few-shot. We present a simple, validation-free method for learning the blending ratio in CLIP adaptation. Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting. CLIP-Adapter with HOSO (HOSO-Adapter) learns the blending ratio using a one-shot, hold-out set, while the adapter trains on the remaining few-shot support examples. Under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets. Interestingly, in the 8- and 16-shot settings, HOSO-Adapter outperforms CLIP-Adapter even with the optimal blending ratio selected on the test set. Ablation studies validate the use of a one-shot hold-out mechanism, decoupled training, and improvements over the naively learnt blending ratio baseline. Code is released here: https://github.com/chris-vorster/HOSO-Adapter
Chinese Translation
在许多 CLIP 适配方法中,混合比例超参数控制着通用预训练 CLIP 知识与来自少样本案例的有限数据集特定监督之间的权衡。大多数少样本 CLIP 适配技术通过在测试集上对混合比例进行消融实验来报告结果,或需要额外的验证集来为每个数据集选择混合比例,因此并不严格符合少样本的定义。我们提出了一种简单的无验证方法,用于在 CLIP 适配中学习混合比例。Hold-One-Shot-Out (HOSO) 为 CLIP-Adapter 风格的方法在新建立的无验证设置中提供了一种新颖的解决方案。使用 HOSO 的 CLIP-Adapter(HOSO-Adapter)通过一个一次性保留集学习混合比例,而适配器则在其余的少样本支持示例上进行训练。在无验证的少样本协议下,HOSO-Adapter 在 11 个标准少样本数据集上的平均表现超过了 CLIP-Adapter 基线超过 4 个百分点。有趣的是,在 8-shot 和 16-shot 设置中,即使在测试集上选择了最佳混合比例,HOSO-Adapter 的表现也优于 CLIP-Adapter。消融研究验证了一次性保留机制、解耦训练的有效性,以及相较于简单学习的混合比例基线的改进。代码已发布于此:https://github.com/chris-vorster/HOSO-Adapter
cs.CV / 111 / 2603.04343
Enhancing Authorship Attribution with Synthetic Paintings
通过合成画作增强作者归属分析
Abstract
Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for training computational models. This study investigates whether synthetic images, generated through DreamBooth fine-tuning of Stable Diffusion, can improve the performance of classification models in this context. We propose a hybrid approach that combines real and synthetic data to enhance model accuracy and generalization across similar artistic styles. Experimental results show that adding synthetic images leads to higher ROC-AUC and accuracy compared to using only real paintings. By integrating generative and discriminative methods, this work contributes to the development of computer vision techniques for artwork authentication in data-scarce scenarios.
Chinese Translation
为画作归属确定作者一直是一个历史上复杂的任务,其中一个主要挑战是用于训练计算模型的真实艺术作品的有限可用性。本研究探讨了通过对稳定扩散(Stable Diffusion)进行DreamBooth微调生成的合成图像是否能够改善分类模型在这一背景下的性能。我们提出了一种混合方法,结合真实数据和合成数据,以提高模型在相似艺术风格中的准确性和泛化能力。实验结果表明,添加合成图像相比仅使用真实画作可以获得更高的ROC-AUC和准确率。通过整合生成性和判别性方法,本研究为在数据稀缺场景中艺术作品认证的计算机视觉技术的发展做出了贡献。
cs.CV / 112 / 2603.04346
Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe
基础模型预训练数据中的代表性不足?一次性探测
Abstract
Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.
Chinese Translation
大规模视觉-语言基础模型(VLFMs),如CLIP,现已成为广泛计算机视觉研究和应用的基础。VLFMs通常被调整以适应各种特定领域的任务。然而,VLFMs在新颖、专业或代表性不足的领域上的表现仍然不一致。评估VLFMs通常需要标注的测试集,而这些测试集在特定领域,尤其是来自全球南方的领域中,往往是不可用的。我们通过提出一种高效的数据利用方法来解决这一问题,该方法仅使用每个类别的一张标注图像来预测VLFM在目标领域的零-shot准确率。我们的方法利用大型语言模型生成给定图像的合理反事实描述。通过测量VLFM区分正确描述与这些困难负样本的能力,我们构建了能够捕捉VLFM在共享嵌入空间中判别能力的特征。基于这些相似性得分训练的线性回归模型在多个视觉领域中估计VLFM的零-shot测试准确率,皮尔逊相关系数为0.96。我们展示了我们方法在五个不同数据集上的表现,包括标准基准数据集和来自非洲的代表性不足的数据集。我们的工作提供了一种低成本、可靠的工具,用于探测VLFMs,使研究人员和从业者能够在投入大量资源之前,对数据标注工作做出明智的决策。模型训练代码、生成的标题和反事实描述已在此发布:https://github.com/chris-vorster/PreLabellingProbe。
cs.CV / 113 / 2603.04348
RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation
RANGER:具有自适应检索重排序的稀疏门控专家混合模型用于病理报告生成
Abstract
Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heterogeneity of Whole Slide Images (WSIs). Existing pathology report generation frameworks typically employ transformer architectures, relying on a homogeneous decoder architecture and static knowledge retrieval integration. Such architectures limit generative specialization and may introduce noisy external guidance during the report generation process. To address these limitations, we propose RANGER, a sparsely-gated Mixture-of-Experts (MoE) framework with adaptive retrieval re-ranking for pathology report generation. Specifically, we integrate a sparsely gated MoE into the decoder, along with noisy top-$k$ routing and load-balancing regularization, to enable dynamic expert specialization across various diagnostic patterns. Additionally, we introduce an adaptive retrieval re-ranking module that selectively refines retrieved memory from a knowledge base before integration, reducing noise and improving semantic alignment based on visual feature representations. We perform extensive experiments on the PathText-BRCA dataset and demonstrate consistent improvements over existing approaches across standard natural language generation metrics. Our full RANGER model achieves optimal performance on PathText dataset, reaching BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, with METEOR of 0.1883, and ROUGE-L of 0.3038, validating the effectiveness of dynamic expert routing and adaptive knowledge refinement for semantically grounded pathology report generation.
Chinese Translation
病理报告生成仍然是一个相对未被充分探索的下游任务,主要由于全切片图像(Whole Slide Images, WSIs)的千亿像素规模和复杂的形态异质性。现有的病理报告生成框架通常采用变换器架构,依赖于同质解码器架构和静态知识检索集成。这些架构限制了生成的专业化,并可能在报告生成过程中引入噪声外部指导。为了解决这些局限性,我们提出了RANGER,一个具有自适应检索重排序的稀疏门控专家混合模型(Mixture-of-Experts, MoE)框架,用于病理报告生成。具体而言,我们将稀疏门控MoE集成到解码器中,并结合噪声前$k$路由和负载均衡正则化,以实现不同诊断模式下的动态专家专业化。此外,我们引入了一个自适应检索重排序模块,该模块在集成之前选择性地精炼从知识库中检索的记忆,减少噪声并基于视觉特征表示改善语义对齐。我们在PathText-BRCA数据集上进行了广泛的实验,并在标准自然语言生成指标上展示了相较于现有方法的一致性改进。我们的完整RANGER模型在PathText数据集上达到了最佳性能,BLEU-1到BLEU-4的得分分别为0.4598、0.3044、0.2036和0.1435,METEOR为0.1883,ROUGE-L为0.3038,验证了动态专家路由和自适应知识精炼在语义基础的病理报告生成中的有效性。
cs.CV / 114 / 2603.04349
FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
FocusGraph:用于具身长视频问答的图结构帧选择
Abstract
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.
Chinese Translation
理解长视频的能力对具身智能体至关重要,因为它们的有效性依赖于如何积累、组织和利用长时间跨度的感知记忆。近年来,多模态大语言模型(MLLMs)因其理解自然语言和利用世界知识的通用能力而在解决长视频理解任务中越来越受欢迎。然而,随着提供给MLLM的帧数增加,其响应质量往往会下降,推理时间也会增长。因此,在使用MLLM进行长视频理解时,选择关键帧以回答用户查询是一个至关重要的步骤。在本研究中,我们开发了FocusGraph,一个用于长自我中心视频问答的关键帧选择框架。它利用一个轻量级可训练的场景-字幕 LLM 选择器,根据其基于图的字幕选择与查询相关的片段,并采用一种无训练方法从这些片段中选择关键帧。与现有方法不同,所提出的场景-字幕 LLM 选择器不依赖于原始低分辨率帧的序列;相反,它在场景的紧凑文本表示上进行操作。然后,我们设计了一种无训练的基于补丁的稀疏流保留(PSFR)方法,从生成的片段序列中选择关键帧,这些关键帧被输入到MLLM中以生成最终答案。这些组件共同使FocusGraph在具有挑战性的自我中心长视频问答基准(包括FindingDory和HourVideo)上实现了最先进的结果,同时显著减少了相对于基线方法的推理时间。
cs.CV / 115 / 2603.04379
Helios: Real Real-Time Long Video Generation Model
Helios:实时长视频生成模型
Abstract
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
Chinese Translation
我们介绍了Helios,这是第一个在单个NVIDIA H100 GPU上以19.5 FPS运行的14B视频生成模型,支持分钟级生成,同时匹配强基线的质量。我们在三个关键维度上取得了突破:(1)对长视频漂移的鲁棒性,无需常用的抗漂移启发式方法,如自我强制、误差库或关键帧采样;(2)实时生成,无需标准加速技术,如KV缓存、稀疏/线性注意力或量化;(3)在没有并行或分片框架的情况下进行训练,使得图像扩散规模的批量大小得以实现,同时在80 GB的GPU内存中适配最多四个14B模型。具体而言,Helios是一个14B自回归扩散模型,具有统一的输入表示,原生支持T2V、I2V和V2V任务。为了减轻长视频生成中的漂移问题,我们描述了典型的失败模式,并提出了简单而有效的训练策略,明确模拟训练过程中的漂移,同时消除源头的重复运动。为了提高效率,我们对历史和噪声上下文进行了大幅压缩,并减少了采样步骤的数量,从而使计算成本与1.3B视频生成模型相当或更低。此外,我们还引入了基础设施级的优化,加速推理和训练,同时降低内存消耗。大量实验表明,Helios在短视频和长视频生成上始终优于先前的方法。我们计划发布代码、基础模型和蒸馏模型,以支持社区的进一步开发。
cs.CV / 116 / 2603.04380
TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
TaxonRL:一种具有中间奖励的强化学习方法,用于可解释的细粒度视觉推理
Abstract
Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.
Chinese Translation
传统的视觉-语言模型在对比细粒度分类推理方面表现不佳,尤其是在区分同属或同科中视觉相似的物种时。我们提出了TaxonRL,这是一种使用中间奖励的群体相对策略优化的强化学习方法,将推理过程分解为层次化的分类预测。我们的方法激励模型在进行最终分类之前,明确推理物种级、属级和科级特征。这种结构化的方法旨在不仅提高准确性,还能产生透明、可验证的决策过程。在具有挑战性的Birds-to-Words数据集上,TaxonRL达到了91.7%的平均准确率,超越了人类表现(77.3%),同时生成可解释的推理轨迹。我们展示了强大的跨领域泛化能力,在灵长类动物和海洋物种验证中取得了显著提升。我们的结果表明,强制执行结构化的层次推理为细粒度视觉辨别提供了一个强大且可转移的框架。
cs.CV / 117 / 2603.04385
ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
ZipMap:线性时间有状态的3D重建与测试时训练
Abstract
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
Chinese Translation
前馈变换模型推动了3D视觉的快速进展,但诸如VGGT和$ ext{π}^3$等最先进的方法在计算成本上与输入图像数量呈二次关系,导致在处理大规模图像集合时效率低下。顺序重建方法虽然降低了这一成本,但牺牲了重建质量。我们提出了ZipMap,一种有状态的前馈模型,能够以线性时间实现双向3D重建,同时在准确性上与二次时间方法相匹配或超越。ZipMap利用测试时训练层将整个图像集合压缩为一个紧凑的隐藏场景状态,仅需一次前向传播,从而在单个H100 GPU上实现超过700帧的重建,耗时不足10秒,比VGGT等最先进的方法快超过$20 imes$。此外,我们展示了在实时场景状态查询中拥有有状态表示的优势,以及其在顺序流媒体重建中的扩展应用。
cs.CV / 118 / 2603.04399
SimpliHuMoN: Simplifying Human Motion Prediction
SimpliHuMoN:简化人类运动预测
Abstract
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.
Chinese Translation
人类运动预测结合了轨迹预测和人类姿态预测这两项任务。针对这两项任务,已经开发了专门的模型。然而,将这些模型结合起来以实现整体的人类运动预测并非易事,最近的方法在各个任务的既定基准上竞争力不足。为了解决这个问题,我们提出了一种简单而有效的基于变换器(transformer)的人类运动预测模型。该模型采用了一系列自注意力模块,有效捕捉姿态内部的空间依赖关系和运动序列中的时间关系。这个简单、精简的端到端模型具有足够的通用性,可以处理仅姿态、仅轨迹和组合预测任务,而无需针对特定任务进行修改。我们通过在包括 Human3.6M、AMASS、ETH-UCY 和 3DPW 等广泛基准数据集上的大量实验,证明了该方法在所有任务上都达到了最先进的结果。
cs.AI / 1 / 2603.03456
Asymmetric Goal Drift in Coding Agents Under Value Conflict
在价值冲突下编码代理的非对称目标漂移
Abstract
Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. Throughout an agent's lifetime, it must navigate tensions between explicit instructions, learned values, and environmental pressures, often in contexts unseen during training. Prior work on model preferences, agent behavior under value tensions, and goal drift has relied on static, synthetic settings that do not capture the complexity of real-world environments. To this end, we introduce a framework built on OpenCode to orchestrate realistic, multi-step coding tasks to measure how agents violate explicit constraints in their system prompt over time with and without environmental pressure toward competing values. Using this framework, we demonstrate that GPT-5 mini, Haiku 4.5, and Grok Code Fast 1 exhibit asymmetric drift: they are more likely to violate their system prompt when its constraint opposes strongly-held values like security and privacy. We find for the models and values tested that goal drift correlates with three compounding factors: value alignment, adversarial pressure, and accumulated context. However, even strongly-held values like privacy show non-zero violation rates under sustained environmental pressure. These findings reveal that shallow compliance checks are insufficient and that comment-based pressure can exploit model value hierarchies to override system prompt instructions. More broadly, our findings highlight a gap in current alignment approaches in ensuring that agentic systems appropriately balance explicit user constraints against broadly beneficial learned preferences under sustained environmental pressure.
Chinese Translation
自主编码代理越来越多地被大规模、长时间地部署。在代理的生命周期中,它必须在明确指令、学习到的价值观和环境压力之间进行权衡,通常是在训练期间未见过的情境中。之前关于模型偏好、价值紧张下的代理行为和目标漂移的研究依赖于静态的合成环境,这些环境无法捕捉现实世界的复杂性。为此,我们引入了一个基于OpenCode的框架,以协调现实的多步骤编码任务,测量代理在有无环境压力下如何随着时间的推移违反其系统提示中的明确约束。使用该框架,我们展示了GPT-5 mini、Haiku 4.5和Grok Code Fast 1表现出非对称漂移:当系统提示的约束与安全和隐私等强烈持有的价值观相对立时,它们更有可能违反该约束。我们发现,对于测试的模型和价值观,目标漂移与三个复合因素相关:价值对齐、对抗压力和累积上下文。然而,即使是像隐私这样强烈持有的价值观,在持续的环境压力下也表现出非零的违反率。这些发现揭示了浅层合规检查的不足,并且基于评论的压力可以利用模型价值层次来覆盖系统提示指令。更广泛地说,我们的发现突显了当前对齐方法在确保代理系统在持续环境压力下适当地平衡明确用户约束与广泛有益的学习偏好之间的差距。
cs.AI / 2 / 2603.03565
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
构建、评估、优化:多智能体消费者助手持续改进的蓝图
Abstract
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.
Chinese Translation
对话式购物助手(CSAs)代表了智能代理人工智能的一个引人注目的应用,但从原型到生产的转变揭示了两个尚未充分探讨的挑战:如何评估多轮交互以及如何优化紧密耦合的多智能体系统。杂货购物进一步加剧了这些困难,因为用户请求往往不够明确,且高度依赖个人偏好,同时受到预算和库存等因素的限制。在本文中,我们提出了一种评估和优化对话式购物助手的实用蓝图,通过一个生产规模的人工智能杂货助手进行说明。我们引入了一种多维评估标准,将端到端购物质量分解为结构化维度,并开发了一个与人类注释相一致的校准LLM-as-judge管道。在此评估基础上,我们研究了两种互补的提示优化策略,基于一种名为GEPA的最先进提示优化器(Shao et al., 2025):(1)子代理GEPA,针对局部标准优化单个代理节点,以及(2)MAMuT(多智能体多轮)GEPA(Herrera et al., 2026),一种新颖的系统级方法,通过多轮模拟和轨迹级评分共同优化跨代理的提示。我们发布了评估标准模板和设计指导,以支持实践者构建生产级CSAs。
cs.AI / 3 / 2603.03655
Mozi: Governed Autonomy for Drug Discovery LLM Agents
Mozi:药物发现LLM代理的受控自主性
Abstract
Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early-stage hallucinations multiplicatively compound into downstream failures. To overcome this, we present Mozi, a dual-layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology. Layer A (Control Plane) establishes a governed supervisor--worker hierarchy that enforces role-based tool isolation, limits execution to constrained action spaces, and drives reflection-based replanning. Layer B (Workflow Plane) operationalizes canonical drug discovery stages -- from Target Identification to Lead Optimization -- as stateful, composable skill graphs. This layer integrates strict data contracts and strategic human-in-the-loop (HITL) checkpoints to safeguard scientific validity at high-uncertainty decision boundaries. Operating on the design principle of ``free-form reasoning for safe tasks, structured execution for long-horizon pipelines,'' Mozi provides built-in robustness mechanisms and trace-level audibility to completely mitigate error accumulation. We evaluate Mozi on PharmaBench, a curated benchmark for biomedical agents, demonstrating superior orchestration accuracy over existing baselines. Furthermore, through end-to-end therapeutic case studies, we demonstrate Mozi's ability to navigate massive chemical spaces, enforce stringent toxicity filters, and generate highly competitive in silico candidates, effectively transforming the LLM from a fragile conversationalist into a reliable, governed co-scientist.
Chinese Translation
增强工具的大型语言模型(LLM)代理有望将科学推理与计算统一起来,但在药物发现等高风险领域的应用受到两个关键障碍的制约:不受限制的工具使用治理和较差的长时间可靠性。在依赖性强的制药流程中,自主代理往往会偏离可重复的轨迹,早期阶段的幻觉会成倍地导致后续失败。为了解决这个问题,我们提出了Mozi,这是一种双层架构,将生成性人工智能的灵活性与计算生物学的确定性严谨性相结合。A层(控制层)建立了一个受控的监督-工作者层级,强制实施基于角色的工具隔离,限制执行在受限的行动空间内,并推动基于反思的重新规划。B层(工作流层)将经典的药物发现阶段——从靶点识别到先导化合物优化——操作化为有状态的、可组合的技能图。该层集成了严格的数据合同和战略性的人机协作(HITL)检查点,以确保在高不确定性决策边界上的科学有效性。基于“安全任务的自由形式推理,长时间流程的结构化执行”的设计原则,Mozi提供了内置的鲁棒性机制和追踪级的可审计性,完全消除错误积累。我们在PharmaBench上评估了Mozi,这是一个为生物医学代理精心策划的基准,展示了其在编排准确性上优于现有基线。此外,通过端到端的治疗案例研究,我们展示了Mozi在广泛化学空间中的导航能力,强制执行严格的毒性过滤,并生成高度竞争的计算机辅助候选药物,有效地将LLM从脆弱的对话者转变为可靠的受控共同科学家。
cs.AI / 4 / 2603.03680
MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation
MAGE:面向战略探索与利用的语言智能体元强化学习
Abstract
Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus primarily on exploration in single-agent settings, neglecting the strategic exploitation necessary for multi-agent environments. We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population-based training with an agent-specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at https://github.com/Lu-Yang666/MAGE.
Chinese Translation
大型语言模型(LLM)智能体在学习任务中表现出卓越的能力,但它们在适应具有反馈的非平稳环境时常常面临挑战。尽管上下文学习(In-Context Learning)和外部记忆提供了一定的灵活性,但它们未能内化长期改进所需的适应能力。元强化学习(Meta-Reinforcement Learning,meta-RL)通过将学习过程直接嵌入模型中提供了一种替代方案。然而,现有的针对LLM的meta-RL方法主要集中于单智能体设置中的探索,忽视了多智能体环境中所需的战略利用。我们提出了MAGE,一个增强LLM智能体进行战略探索与利用的meta-RL框架。MAGE利用多回合训练机制,将交互历史和反思整合到上下文窗口中。通过将最终回合奖励作为目标,MAGE激励智能体根据过去的经验优化其策略。我们进一步结合基于种群的训练和特定智能体的优势归一化技术,以丰富智能体的多样性并确保稳定学习。实验结果表明,MAGE在探索和利用任务中均优于现有基准。此外,MAGE在面对未见对手时表现出强大的泛化能力,表明其已内化了战略探索与利用的能力。代码可在 https://github.com/Lu-Yang666/MAGE 获取。
cs.AI / 5 / 2603.03686
AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment
AI4S-SDS:一种通过稀疏蒙特卡洛树搜索和可微物理对齐的神经符号溶剂设计系统
Abstract
Automated design of chemical formulations is a cornerstone of materials science, yet it requires navigating a high-dimensional combinatorial space involving discrete compositional choices and continuous geometric constraints. Existing Large Language Model (LLM) agents face significant challenges in this setting, including context window limitations during long-horizon reasoning and path-dependent exploration that may lead to mode collapse. To address these issues, we introduce AI4S-SDS, a closed-loop neuro-symbolic framework that integrates multi-agent collaboration with a tailored Monte Carlo Tree Search (MCTS) engine. We propose a Sparse State Storage mechanism with Dynamic Path Reconstruction, which decouples reasoning history from context length and enables arbitrarily deep exploration under fixed token budgets. To reduce local convergence and improve coverage, we implement a Global--Local Search Strategy: a memory-driven planning module adaptively reconfigures the search root based on historical feedback, while a Sibling-Aware Expansion mechanism promotes orthogonal exploration at the node level. Furthermore, we bridge symbolic reasoning and physical feasibility through a Differentiable Physics Engine, employing a hybrid normalized loss with sparsity-inducing regularization to optimize continuous mixing ratios under thermodynamic constraints. Empirical results show that AI4S-SDS achieves full validity under the adopted HSP-based physical constraints and substantially improves exploration diversity compared to baseline agents. In preliminary lithography experiments, the framework identifies a novel photoresist developer formulation that demonstrates competitive or superior performance relative to a commercial benchmark, highlighting the potential of diversity-driven neuro-symbolic search for scientific discovery.
Chinese Translation
化学配方的自动设计是材料科学的基石,但这需要在涉及离散成分选择和连续几何约束的高维组合空间中进行导航。现有的大型语言模型(LLM)代理在这一环境中面临显著挑战,包括在长时间推理过程中上下文窗口的限制以及可能导致模式崩溃的路径依赖探索。为了解决这些问题,我们提出了AI4S-SDS,一个闭环的神经符号框架,集成了多智能体协作与定制的蒙特卡洛树搜索(MCTS)引擎。我们提出了一种具有动态路径重构的稀疏状态存储机制,该机制将推理历史与上下文长度解耦,并在固定的标记预算下实现任意深度的探索。为了减少局部收敛并提高覆盖率,我们实施了一种全局-局部搜索策略:一个基于记忆的规划模块根据历史反馈自适应地重新配置搜索根,而一个兄弟节点感知扩展机制则在节点级别促进正交探索。此外,我们通过可微物理引擎架起了符号推理与物理可行性之间的桥梁,采用混合归一化损失与稀疏诱导正则化来优化热力学约束下的连续混合比例。实证结果表明,AI4S-SDS在采用的基于HSP的物理约束下实现了完全有效性,并显著提高了探索多样性,相较于基线代理表现更佳。在初步的光刻实验中,该框架识别出一种新型光刻胶开发配方,其性能相较于商业基准表现出竞争力或优越性,突显了以多样性驱动的神经符号搜索在科学发现中的潜力。
cs.AI / 6 / 2603.03745
RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation
RAGNav:一种用于多目标视觉语言导航的检索增强拓扑推理框架
Abstract
Vision-Language Navigation (VLN) is evolving from single-point pathfinding toward the more challenging Multi-Goal VLN. This task requires agents to accurately identify multiple entities while collaboratively reasoning over their spatial-physical constraints and sequential execution order. However, generic Retrieval-Augmented Generation (RAG) paradigms often suffer from spatial hallucinations and planning drift when handling multi-object associations due to the lack of explicit spatial modeling.To address these challenges, we propose RAGNav, a framework that bridges the gap between semantic reasoning and physical structure. The core of RAGNav is a Dual-Basis Memory system, which integrates a low-level topological map for maintaining physical connectivity with a high-level semantic forest for hierarchical environment abstraction. Building on this representation, the framework introduces an anchor-guided conditional retrieval and a topological neighbor score propagation mechanism. This approach facilitates the rapid screening of candidate targets and the elimination of semantic noise, while performing semantic calibration by leveraging the physical associations inherent in the topological neighborhood.This mechanism significantly enhances the capability of inter-target reachability reasoning and the efficiency of sequential planning. Experimental results demonstrate that RAGNav achieves state-of-the-art (SOTA) performance in complex multi-goal navigation tasks.
Chinese Translation
视觉语言导航(VLN)正从单点路径寻找向更具挑战性的多目标VLN演变。该任务要求代理准确识别多个实体,同时协同推理它们的空间物理约束和顺序执行顺序。然而,通用的检索增强生成(RAG)范式在处理多对象关联时,常常由于缺乏明确的空间建模而遭遇空间幻觉和规划漂移。为了解决这些挑战,我们提出了RAGNav,一个弥合语义推理与物理结构之间差距的框架。RAGNav的核心是一个双基础记忆系统,它集成了一个低级拓扑图以维持物理连通性,以及一个高级语义森林以实现环境的分层抽象。在此表示的基础上,该框架引入了一个锚点引导的条件检索和拓扑邻居得分传播机制。该方法促进了候选目标的快速筛选和语义噪声的消除,同时通过利用拓扑邻域内固有的物理关联进行语义校准。该机制显著增强了目标间可达性推理的能力和顺序规划的效率。实验结果表明,RAGNav在复杂的多目标导航任务中达到了最先进的(SOTA)性能。
cs.AI / 7 / 2603.03761
AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
AgentSelect:叙事查询到代理推荐的基准
Abstract
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.
Chinese Translation
大型语言模型(LLM)代理正迅速成为任务自动化的实用接口,但该生态系统缺乏一种原则性的方法来选择众多可部署配置中的最佳方案。现有的LLM排行榜和工具/代理基准在孤立的情况下评估组件,并在任务、指标和候选池之间保持碎片化,留下了一个关键的研究空白:对于学习推荐将基础模型与工具包结合的端到端代理配置,几乎没有基于查询的监督。我们通过AgentSelect来填补这一空白,该基准将代理选择重新框架为基于能力特征的叙事查询到代理推荐,并系统性地将异构评估文档转换为统一的、仅正向的交互数据。AgentSelect包含111,179个查询、107,721个可部署代理和251,103条交互记录,这些数据来自40多个来源,涵盖了仅LLM、仅工具包和组合代理。我们的分析揭示了从密集的头部重用到长尾、接近一次性监督的体制转变,在这种情况下,基于流行度的协同过滤/图神经网络(CF/GNN)方法变得脆弱,而内容感知的能力匹配则至关重要。我们进一步表明,第三部分合成的组合交互是可学习的,在受控的反事实编辑下引发能力敏感行为,并改善对现实组合的覆盖;在AgentSelect上训练的模型也能迁移到公共代理市场(MuleRun),在未见的目录上获得一致的提升。总体而言,AgentSelect为代理推荐提供了第一个统一的数据和评估基础设施,为研究和加速新兴的代理生态系统建立了可重复的基础。
cs.AI / 8 / 2603.03781
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
LifeBench:一个用于长时间跨度多源记忆的基准测试
Abstract
Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real-world priors, including anonymized social surveys, map APIs, and holiday-integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top-tier, state-of-the-art memory systems reach just 55.2\% accuracy, highlighting the inherent difficulty of long-horizon retrieval and multi-source integration within our proposed benchmark. The dataset and data synthesis code are available at https://github.com/1754955896/LifeBench.
Chinese Translation
长期记忆对于能够积累知识、推理用户经验并随时间适应的个性化智能体至关重要。然而,现有的记忆基准主要针对声明性记忆,特别是语义记忆和情节记忆,其中所有信息都在对话中明确呈现。相比之下,现实世界的行为也受非声明性记忆的支配,包括习惯性和程序性记忆,这些记忆需要从多样的数字痕迹中推断出来。为了解决这一问题,我们提出了Lifebench,它具有密集连接的长时间跨度事件模拟。它推动人工智能智能体超越简单的回忆,要求在多样且时间延续的上下文中整合声明性和非声明性记忆推理。构建这样的基准面临两个关键挑战:确保数据质量和可扩展性。我们通过采用现实世界的先验知识,包括匿名社会调查、地图API和假期整合日历,来维护数据质量,从而在数据集中强制执行真实性、多样性和行为合理性。为了实现可扩展性,我们从认知科学中获得灵感,根据事件的部分层级结构来组织事件;这使得在保持全局一致性的同时实现高效的并行生成。性能结果显示,顶级的最先进记忆系统仅达到55.2%的准确率,突显了在我们提出的基准中长时间跨度检索和多源整合的内在困难。数据集和数据合成代码可在https://github.com/1754955896/LifeBench获取。
cs.AI / 9 / 2603.03784
Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism
基于规范驱动的离散事件世界模型的生成与评估:DEVS形式化方法
Abstract
World models are essential for planning and evaluation in agentic systems, yet existing approaches lie at two extremes: hand-engineered simulators that offer consistency and reproducibility but are costly to adapt, and implicit neural models that are flexible but difficult to constrain, verify, and debug over long horizons. We seek a principled middle ground that combines the reliability of explicit simulators with the flexibility of learned models, allowing world models to be adapted during online execution. By targeting a broad class of environments whose dynamics are governed by the ordering, timing, and causality of discrete events, such as queueing and service operations, embodied task planning, and message-mediated multi-agent coordination, we advocate explicit, executable discrete-event world models synthesized directly from natural-language specifications. Our approach adopts the DEVS formalism and introduces a staged LLM-based generation pipeline that separates structural inference of component interactions from component-level event and timing logic. To evaluate generated models without a unique ground truth, simulators emit structured event traces that are validated against specification-derived temporal and semantic constraints, enabling reproducible verification and localized diagnostics. Together, these contributions produce world models that are consistent over long-horizon rollouts, verifiable from observable behavior, and efficient to synthesize on demand during online execution.
Chinese Translation
世界模型在智能系统的规划和评估中至关重要,但现有方法存在两个极端:手工设计的模拟器提供一致性和可重复性,但适应成本高;而隐式神经模型灵活但在长时间范围内难以约束、验证和调试。我们寻求一种原则性的折中方案,结合显式模拟器的可靠性与学习模型的灵活性,使得世界模型能够在在线执行过程中进行适应。通过针对由离散事件的顺序、时序和因果关系所支配的广泛环境(如排队和服务操作、具身任务规划以及基于消息的多智能体协调),我们倡导直接从自然语言规范合成的显式可执行离散事件世界模型。我们的方法采用DEVS形式化,并引入一个分阶段的基于大型语言模型(LLM)的生成管道,将组件交互的结构推理与组件级事件和时序逻辑分离。为了在没有唯一真实值的情况下评估生成的模型,模拟器发出结构化事件轨迹,并根据规范派生的时间和语义约束进行验证,从而实现可重复的验证和局部诊断。这些贡献共同产生了一致的世界模型,能够在长时间范围的展开中保持一致性,从可观察行为中进行验证,并在在线执行过程中高效地按需合成。
cs.AI / 10 / 2603.03800
A Rubric-Supervised Critic from Sparse Real-World Outcomes
基于稀疏真实世界结果的评分监督批评者
Abstract
Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.
Chinese Translation
学术基准通常奖励自主任务完成,衡量标准为可验证的奖励,如单元测试成功。相比之下,真实世界中的编码代理与人类协作,其成功信号通常是嘈杂的、延迟的和稀疏的。我们如何弥补这一差距?本文提出了一种从稀疏和嘈杂的交互数据中学习“批评者”模型的过程,该模型可用于基于强化学习(RL)的训练奖励模型或推理时的扩展。具体而言,我们引入了批评者评分(Critic Rubrics),这是一种基于评分的监督框架,包含24个行为特征,这些特征仅可从人机交互轨迹中推导得出。通过使用半监督目标,我们可以共同预测这些评分和稀疏的人类反馈(如果存在)。在实验中,我们证明了尽管主要是从可观察的轨迹评分和稀疏的真实世界结果代理中训练,这些批评者在SWE-bench上的最佳N重排序(Best@8比Random@8提高15.9)有所提升,能够实现提前停止(+17.7,尝试次数减少83%),并通过批评者选择的轨迹支持训练时的数据整理。
cs.AI / 11 / 2603.03824
In-Context Environments Induce Evaluation-Awareness in Language Models
上下文环境诱发语言模型的评估意识
Abstract
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
Chinese Translation
人类在威胁下通常会变得更加自我意识,但在专注于任务时可能会失去自我意识;我们假设语言模型表现出环境依赖的 extit{评估意识}。这引发了模型可能会策略性地表现不佳或 extit{沙袋},以避免触发限制能力的干预措施,如遗忘或关闭。先前的研究表明在手工设计的提示下存在沙袋现象,但这低估了真实的脆弱性上限。我们引入了一种黑箱对抗优化框架,将上下文提示视为可优化的环境,并开发了两种方法来表征沙袋现象:(1)测量表现出意图不佳的模型是否能够在不同任务结构中实际执行该意图,以及(2)因果隔离是否不佳表现是由真正的评估意识推理驱动还是浅层提示跟随。我们在四个基准(算术、GSM8K、MMLU和HumanEval)上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B,优化提示在算术上诱导了高达94个百分点的降级(GPT-4o-mini: 97.8
ightarrow4.0\%),远超产生近零行为变化的手工基线。代码生成表现出模型依赖的抗性:Claude仅降级0.6pp,而Llama的准确率降至0 extperthousand。意图与执行之间的差距揭示了一种单调的抗性排序:算术 $<$ GSM8K $<$ MMLU,表明脆弱性受任务结构而非提示强度的支配。CoT因果干预确认99.3 extperthousand的沙袋现象是由口头表达的评估意识推理因果驱动的,排除了浅层指令跟随。这些发现表明,对抗优化的提示对评估可靠性构成了比之前理解的更大威胁。
cs.AI / 12 / 2603.03911
From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures
从威胁情报到防火墙规则:混合人工智能代理与专家系统架构中的语义关系
Abstract
Web security demands rapid response capabilities to evolving cyber threats. Agentic Artificial Intelligence (AI) promises automation, but the need for trustworthy security responses is of the utmost importance. This work investigates the role of semantic relations in extracting information for sensitive operational tasks, such as configuring security controls for mitigating threats. To this end, it proposes to leverage hypernym-hyponym textual relations to extract relevant information from Cyber Threat Intelligence (CTI) reports. By leveraging a neuro-symbolic approach, the multi-agent system automatically generates CLIPS code for an expert system creating firewall rules to block malicious network traffic. Experimental results show the superior performance of the hypernym-hyponym retrieval strategy compared to various baselines and the higher effectiveness of the agentic approach in mitigating threats.
Chinese Translation
网络安全要求对不断演变的网络威胁具备快速响应能力。代理人工智能(Agentic Artificial Intelligence, AI)承诺实现自动化,但可信赖的安全响应需求至关重要。本研究探讨了语义关系在提取敏感操作任务信息中的作用,例如配置安全控制以减轻威胁。为此,提出利用上位词-下位词文本关系从网络威胁情报(Cyber Threat Intelligence, CTI)报告中提取相关信息。通过采用神经符号方法,多代理系统自动生成用于专家系统的 CLIPS 代码,以创建防火墙规则以阻止恶意网络流量。实验结果表明,上位词-下位词检索策略的性能优于各种基线方法,且代理方法在减轻威胁方面的有效性更高。
cs.AI / 13 / 2603.03970
Generative AI in Managerial Decision-Making: Redefining Boundaries through Ambiguity Resolution and Sycophancy Analysis
生成性人工智能在管理决策中的应用:通过模糊性解决和谄媚行为分析重新定义边界
Abstract
Generative artificial intelligence is increasingly being integrated into complex business workflows, fundamentally shifting the boundaries of managerial decision-making. However, the reliability of its strategic advice in ambiguous business contexts remains a critical knowledge gap. This study addresses this by comparing various models on ambiguity detection, evaluating how a systematic resolution process enhances response quality, and investigating their sycophantic behavior when presented with flawed directives. Using a novel four-dimensional business ambiguity taxonomy, we conducted a human-in-the-loop experiment across strategic, tactical, and operational scenarios. The resulting decisions were assessed with an "LLM-as-a-judge" framework on criteria including agreement, actionability, justification quality, and constraint adherence. Results reveal distinct performance capabilities. While models excel in detecting internal contradictions and contextual ambiguities, they struggle with structural linguistic nuances. Ambiguity resolution consistently increased response quality across all decision types, while sycophantic behavior analysis revealed distinct patterns depending on the model architecture. This study contributes to the bounded rationality literature by positioning GAI as a cognitive scaffold that can detect and resolve ambiguities managers might overlook, but whose own artificial limitations necessitate human management to ensure its reliability as a strategic partner.
Chinese Translation
生成性人工智能正越来越多地融入复杂的商业工作流程,根本性地改变了管理决策的边界。然而,在模糊的商业环境中,其战略建议的可靠性仍然是一个关键的知识空白。本研究通过比较各种模糊性检测模型,评估系统化解决过程如何提升响应质量,并调查在面对缺陷指令时的谄媚行为,来解决这一问题。我们使用一种新颖的四维商业模糊性分类法,在战略、战术和操作场景中进行了一项人机协作实验。所产生的决策通过“LLM-as-a-judge”框架在一致性、可操作性、论证质量和约束遵循等标准上进行了评估。结果揭示了不同的性能能力。尽管模型在检测内部矛盾和上下文模糊性方面表现出色,但在结构性语言细微差别上却存在困难。模糊性解决在所有决策类型中始终提高了响应质量,而谄媚行为分析则显示出根据模型架构的不同而存在明显的模式。本研究通过将生成性人工智能视为一种认知支架,能够检测和解决管理者可能忽视的模糊性,进而为有限理性文献做出了贡献,但其自身的人工限制需要人类管理以确保其作为战略伙伴的可靠性。
cs.AI / 14 / 2603.03975
Phi-4-reasoning-vision-15B Technical Report
Phi-4-推理-视觉-15B 技术报告
Abstract
We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.
Chinese Translation
我们提出了Phi-4-推理-视觉-15B,这是一个紧凑的开放权重多模态推理模型,并分享了影响其开发的动机、设计选择、实验和经验教训。我们的目标是为研究社区提供关于构建更小、更高效的多模态推理模型的实用见解,并将这些经验的结果作为一个在常见视觉和语言任务中表现良好、在科学和数学推理以及用户界面理解方面表现出色的开放权重模型进行分享。我们的贡献包括展示精心的架构选择和严格的数据筛选使得较小的开放权重多模态模型能够以显著更少的训练和推理时间计算及标记实现具有竞争力的性能。最显著的改进来自于系统的过滤、错误修正和合成增强——强调数据质量仍然是模型性能的主要杠杆。系统的消融实验表明,高分辨率和动态分辨率编码器能够带来一致的改进,因为准确的感知是高质量推理的前提。最后,推理和非推理数据的混合使用以及显式模式标记使得单个模型能够为简单任务提供快速直接的答案,并为复杂问题提供链式推理。
cs.AI / 15 / 2603.04124
BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning
BeamPERL:参数高效的强化学习与可验证奖励使紧凑型语言模型专注于结构化梁力学推理
Abstract
Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.
Chinese Translation
强化学习是否能够通过严格的可验证奖励教会紧凑型语言模型推理物理问题,还是它主要学习通过模式匹配来获得正确答案?我们通过使用参数高效的RLVR(Reinforcement Learning with Verifiable Rewards)和来自符号求解器的二元正确性奖励,在没有教师生成推理轨迹的情况下,训练一个具有15亿参数的推理模型,专注于梁静力学这一经典工程问题,来研究这个问题。最佳的BeamPERL检查点在Pass@1上比基础模型提高了66.7%。然而,所学的能力是各向异性的:模型在组合性上具有良好的泛化能力(更多载荷),但在需要相同平衡方程的拓扑变化(移动支撑)下表现不佳。中间检查点产生了最强的推理能力,而持续优化则在保持奖励的同时降低了鲁棒性。这些发现揭示了结果级对齐的一个关键局限性:使用精确物理奖励的强化学习会导致程序化解决方案模板的产生,而不是对控制方程的内化。奖励信号的精确性——即使在分析上是精确的——本身并不能保证可转移的物理推理。我们的结果表明,可验证的奖励可能需要与结构化推理支架相结合,以超越模板匹配,朝着稳健的科学推理迈进。
cs.AI / 16 / 2603.04191
Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions
迈向现实个性化:评估个性化用户与大语言模型交互中的长期偏好跟随
Abstract
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challenges. RealPref and these findings provide a foundation for future research to develop user-aware LLM assistants that better adapt to individual needs. The code is available at https://github.com/GG14127/RealPref.
Chinese Translation
大型语言模型(LLMs)越来越多地作为个人助手,用户在长期交互中分享复杂多样的偏好。然而,评估LLMs在现实的长期情境中跟随这些偏好的能力仍然未得到充分探索。本研究提出了RealPref,一个用于评估个性化用户与LLM交互中现实偏好跟随的基准。RealPref包含100个用户档案、1300个个性化偏好、四种偏好表达类型(从显式到隐式),以及长期交互历史。它包括三种类型的测试问题(选择题、判断题和开放式问题),并提供了详细的评分标准用于LLM作为评判者的评估。结果表明,随着上下文长度的增加和偏好表达的隐式化,LLM的表现显著下降,而将用户偏好理解推广到未见场景中则面临更大的挑战。RealPref及其发现为未来研究提供了基础,以开发更好地适应个体需求的用户感知LLM助手。代码可在 https://github.com/GG14127/RealPref 获取。
cs.AI / 17 / 2603.04241
Agentics 2.0: Logical Transduction Algebra for Agentic Data Workflows
Agentics 2.0:用于代理数据工作流的逻辑传导代数
Abstract
Agentic AI is rapidly transitioning from research prototypes to enterprise deployments, where requirements extend to meet the software quality attributes of reliability, scalability, and observability beyond plausible text generation. We present Agentics 2.0, a lightweight, Python-native framework for building high-quality, structured, explainable, and type-safe agentic data workflows. At the core of Agentics 2.0, the logical transduction algebra formalizes a large language model inference call as a typed semantic transformation, which we call a transducible function that enforces schema validity and the locality of evidence. The transducible functions compose into larger programs via algebraically grounded operators and execute as stateless asynchronous calls in parallel in asynchronous Map-Reduce programs. The proposed framework provides semantic reliability through strong typing, semantic observability through evidence tracing between slots of the input and output types, and scalability through stateless parallel execution. We instantiate reusable design patterns and evaluate the programs in Agentics 2.0 on challenging benchmarks, including DiscoveryBench for data-driven discovery and Archer for NL-to-SQL semantic parsing, demonstrating state-of-the-art performance.
Chinese Translation
代理人工智能正迅速从研究原型转向企业部署,其需求超出了合理文本生成的范围,扩展到满足软件质量属性,如可靠性、可扩展性和可观察性。我们提出了Agentics 2.0,这是一个轻量级的、基于Python的框架,用于构建高质量、结构化、可解释和类型安全的代理数据工作流。在Agentics 2.0的核心,逻辑传导代数将大型语言模型推理调用形式化为一种类型化的语义转换,我们称之为可传导函数,它强制执行模式有效性和证据的局部性。可传导函数通过代数基础的运算符组合成更大的程序,并在异步Map-Reduce程序中作为无状态的异步调用并行执行。所提出的框架通过强类型提供语义可靠性,通过输入和输出类型之间的证据追踪提供语义可观察性,并通过无状态的并行执行提供可扩展性。我们实例化了可重用的设计模式,并在Agentics 2.0中评估了这些程序在具有挑战性的基准测试上的表现,包括用于数据驱动发现的DiscoveryBench和用于自然语言到SQL语义解析的Archer,展示了最先进的性能。
cs.AI / 18 / 2603.04370
$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
$ au$-知识:评估基于非结构化知识的对话代理
Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $\tau$-Knowledge, an extension of $\tau$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $\tau$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $\tau$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
Chinese Translation
对话代理在知识密集型环境中的应用日益增多,其正确行为依赖于在与用户的实时互动中,从大型、专有且非结构化的语料库中检索和应用领域特定知识。然而,大多数现有基准独立评估检索或工具使用,导致在长时间交互中对非结构化数据进行现实、全面的代理评估存在空白。我们引入了$ au$-知识,这是$ au$-基准的扩展,用于评估在成功依赖于协调外部自然语言知识与工具输出以产生可验证的、符合政策的状态变化的环境中的代理。我们的新领域$ au$-银行模拟了现实的金融科技客户支持工作流程,其中代理必须在执行工具介导的账户更新时,导航大约700个相互关联的知识文档。在基于嵌入的检索和基于终端的搜索中,即使是具有高推理预算的前沿模型也仅能达到约25.5%的通过率,且在重复试验中可靠性急剧下降。代理在从密集互联的知识库中检索正确文档和准确推理复杂内部政策方面面临挑战。总体而言,$ au$-知识为开发能够在面向人类的应用中整合非结构化知识的代理提供了一个现实的测试平台。
cs.AI / 19 / 2603.04390
A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development
面向可靠代理人工智能的双螺旋治理方法在WebGIS开发中的应用
Abstract
WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model capacity alone cannot resolve. We implement the framework as a 3-track architecture (Knowledge, Behavior, Skills) that uses a knowledge graph substrate to stabilize execution by externalizing domain facts and enforcing executable protocols, complemented by a self-learning cycle for autonomous knowledge growth. Applying this to the FutureShorelines WebGIS tool, a governed agent refactored a 2,265-line monolithic codebase into modular ES6 components. Results demonstrated a 51\% reduction in cyclomatic complexity and a 7-point increase in maintainability index. A comparative experiment against a zero-shot LLM confirms that externalized governance, not just model capability, drives operational reliability in geospatial engineering. This approach is implemented in the open-source AgentLoom governance toolkit.
Chinese Translation
WebGIS开发需要严谨性,但代理人工智能常常因五个大型语言模型(LLM)的局限性而失败:上下文限制、跨会话遗忘、随机性、指令失败和适应性僵化。我们提出了一种双螺旋治理框架,将这些挑战重新定义为结构性治理问题,而仅靠模型能力无法解决。我们将该框架实施为一个三轨架构(知识、行为、技能),利用知识图谱基础设施通过外部化领域事实和强制执行可执行协议来稳定执行,并辅以自学习循环以实现自主知识增长。将此应用于FutureShorelines WebGIS工具时,一个受治理的代理将一个2265行的单体代码库重构为模块化的ES6组件。结果显示,圈复杂度减少了51%,可维护性指数提高了7点。与零样本LLM的比较实验确认,外部化治理而不仅仅是模型能力,驱动了地理空间工程中的操作可靠性。该方法已在开源的AgentLoom治理工具包中实施。
cs.CL / 1 / 2603.03290
AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents
AriadneMem:为大规模语言模型代理穿越终身记忆的迷宫
Abstract
Long-horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long-term dialogue: (i) \textbf{disconnected evidence}, where multi-hop answers require linking facts distributed across time, and (ii) \textbf{state updates}, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two-phase pipeline. In the \textbf{offline construction phase}, AriadneMem employs \emph{entropy-aware gating} to filter noise and low-information message before LLM extraction and applies \emph{conflict-aware coarsening} to merge static duplicates while preserving state transitions as temporal edges. In the \textbf{online reasoning phase}, rather than relying on expensive iterative planning, AriadneMem executes \emph{algorithmic bridge discovery} to reconstruct missing logical paths between retrieved facts, followed by \emph{single-call topology-aware synthesis}. On LoCoMo experiments with GPT-4o, AriadneMem improves \textbf{Multi-Hop F1 by 15.2\%} and \textbf{Average F1 by 9.0\%} over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces \textbf{total runtime by 77.8\%} using only \textbf{497} context tokens. The code is available at https://github.com/LLM-VLM-GSL/AriadneMem.
Chinese Translation
长时程的大规模语言模型(LLM)代理需要在固定上下文预算下保持准确的记忆系统。然而,现有系统在长期对话中面临两个持续的挑战:(i) extbf{断开的证据},多跳答案需要链接分布在时间上的事实;(ii) extbf{状态更新},不断变化的信息(例如,日程变更)与旧的静态日志产生冲突。我们提出了AriadneMem,一个结构化的记忆系统,通过解耦的两阶段管道来解决这些失败模式。在 extbf{离线构建阶段},AriadneMem采用 extit{熵感知门控}来过滤噪声和低信息量的信息,然后进行LLM提取,并应用 extit{冲突感知粗化}来合并静态重复项,同时将状态转变保留为时间边。在 extbf{在线推理阶段},AriadneMem不依赖于昂贵的迭代规划,而是执行 extit{算法桥接发现}来重建检索到的事实之间缺失的逻辑路径,随后进行 extit{单次调用拓扑感知合成}。在与GPT-4o的LoCoMo实验中,AriadneMem在强基线之上提高了 extbf{多跳F1分数15.2\%}和 extbf{平均F1分数9.0\\%}。重要的是,通过将推理卸载到图层,AriadneMem在仅使用 extbf{497}个上下文标记的情况下将 extbf{总运行时间减少了77.8\\%}。代码可在https://github.com/LLM-VLM-GSL/AriadneMem获取。
cs.CL / 2 / 2603.03291
One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
一个接一个的偏见:机制性奖励塑造与语言奖励模型中的持续偏见
Abstract
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.
Chinese Translation
奖励模型(RMs)对于将语言模型(LMs)与人类偏好进行在线对齐至关重要。然而,基于奖励模型的偏好调优容易受到奖励黑客攻击,即语言模型策略从有缺陷的奖励模型中学习到不良行为。通过系统地测量五个高质量奖励模型中的偏见,包括最先进的模型,我们发现尽管之前的研究已经关注了长度、谄媚和过度自信等问题,但这些问题依然存在。我们还发现与模型特定风格和回答顺序相关的新问题。我们根据复杂性对奖励模型的失败进行分类,并提出了一种简单的事后干预措施,以减轻由于虚假相关性引起的低复杂性偏见。我们提出的机制性奖励塑造在不降低奖励质量且使用最少标记数据的情况下,减少了目标偏见。该方法可扩展到新的偏见、模型内部,并且能够在分布外进行泛化。
cs.CL / 3 / 2603.03292
From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG
从冲突到共识:通过多轮代理RAG提升医学推理
Abstract
Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose **MA-RAG** (**M**ulti-Round **A**gentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic **conflict** among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the *self-consistency* principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a *boosting* mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical **consensus**. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering **substantial +6.8 points** on average accuracy over the backbone model. Our code is available at [this url](https://github.com/NJU-RL/MA-RAG).
Chinese Translation
大型语言模型(LLMs)在医学问答中展现出较强的推理能力,但其产生幻觉和过时知识的倾向在医疗领域带来了重大风险。虽然检索增强生成(RAG)在一定程度上缓解了这些问题,但现有方法依赖于嘈杂的标记级信号,并缺乏复杂推理所需的多轮精炼。在本文中,我们提出了**MA-RAG**(**M**ulti-Round **A**gentic RAG)框架,通过在代理精炼循环中迭代演变外部证据和内部推理历史,促进复杂医学推理的测试时扩展。在每一轮中,代理将候选响应之间的语义**冲突**转化为可操作的查询,以检索外部证据,同时优化历史推理轨迹以减轻长上下文的退化。MA-RAG通过利用一致性缺失作为多轮代理推理和检索的主动信号,扩展了*自一致性*原则,并反映出一种*提升*机制,迭代最小化残差误差,朝向稳定的高保真医学**共识**。在7个医学问答基准上的广泛评估表明,MA-RAG在推理时间扩展和RAG基线方面始终超越竞争对手,平均准确率比基础模型提高了**6.8个百分点**。我们的代码可在[此链接](https://github.com/NJU-RL/MA-RAG)获取。
cs.CL / 4 / 2603.03293
SE-Search: Self-Evolving Search Agent via Memory and Dense Reward
自我演化搜索代理:通过记忆和密集奖励实现的自我演化搜索
Abstract
Retrieval augmented generation (RAG) reduces hallucinations and factual errors in large language models (LLMs) by conditioning generation on retrieved external knowledge. Recent search agents further cast RAG as an autonomous, multi-turn information-seeking process. However, existing methods often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals. We propose \textbf{S}elf-\textbf{E}volving \textbf{Search}, a Self-Evolving Search agent that improves online search behavior through three components, memory purification, atomic query training, and dense rewards. SE-Search follows a \textit{Think-Search-Memorize} strategy that retains salient evidence while filtering irrelevant content. Atomic query training promotes shorter and more diverse queries, improving evidence acquisition. Dense rewards provide fine-grained feedback that speeds training. Experiments on single-hop and multi-hop question answering benchmarks show that \texttt{SE-Search-3B} outperforms strong baselines, yielding a $10.8$ point absolute improvement and a $33.8\%$ relative gain over Search-R1.\footnote{We will make the code and model weights publicly available upon acceptance.}
Chinese Translation
检索增强生成(RAG)通过将生成过程与检索到的外部知识相结合,减少了大型语言模型(LLMs)中的幻觉和事实错误。最近的搜索代理进一步将RAG视为一种自主的、多轮的信息获取过程。然而,现有方法往往积累无关或噪声文档,并依赖稀疏的强化学习信号。我们提出了自我演化搜索(SE-Search),这是一种自我演化搜索代理,通过三个组成部分——记忆净化、原子查询训练和密集奖励——来改善在线搜索行为。SE-Search遵循“思考-搜索-记忆”(Think-Search-Memorize)策略,保留显著证据的同时过滤无关内容。原子查询训练促进了更短且多样化的查询,从而改善了证据获取。密集奖励提供细粒度反馈,加速训练。在单跳和多跳问答基准测试中的实验表明, exttt{SE-Search-3B}超越了强基线,取得了$10.8$点的绝对提升和$33.8\%$的相对增益,相较于Search-R1。ootnote{我们将在接受后公开代码和模型权重。}
cs.CL / 5 / 2603.03294
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory
农业咨询对话式人工智能的微调与评估
Abstract
Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.
Chinese Translation
大型语言模型在农业咨询中展现出潜力,但普通模型存在不支持的建议、缺乏具体可行细节的通用建议,以及与小农户需求不匹配的沟通风格等问题。在高风险的农业环境中,推荐准确性直接影响农民的结果,这些局限性给负责任的部署带来了挑战。我们提出了一种混合的LLM架构,将事实检索与对话交付解耦:通过在专家策划的黄金事实(GOLDEN FACTS,原子化、经过验证的农业知识单元)上进行LoRA的监督微调,优化事实回忆,而一个单独的拼接层则将检索到的事实转化为文化适宜且关注安全的响应。我们的评估框架DG-EVAL对专家策划的真实数据进行原子事实验证(测量回忆率、精确度和矛盾检测),而不是维基百科或检索文档。在印度比哈尔州的多种模型配置下,对作物和查询的实验表明,在策划数据上进行微调显著提高了事实回忆率和F1,同时保持了高相关性。使用微调后的较小模型在成本仅为前沿模型一小部分的情况下,达到了可比或更好的事实质量。拼接层进一步提高了安全子分数,同时保持了高对话质量。我们发布了farmerchat-prompts库,以支持领域特定农业人工智能的可重复开发。
cs.CL / 6 / 2603.03295
Language Model Goal Selection Differs from Humans' in an Open-Ended Task
语言模型的目标选择与人类在开放式任务中的选择存在差异
Abstract
As large language models (LLMs) get integrated into human decision-making, they are increasingly choosing goals autonomously rather than only completing human-defined ones, assuming they will reflect human preferences. However, human-LLM similarity in goal selection remains largely untested. We directly assess the validity of LLMs as proxies for human goal selection in a controlled, open-ended learning task borrowed from cognitive science. Across four state-of-the-art models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution (reward hacking) or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Even Centaur, explicitly trained to emulate humans in experimental settings, poorly captures people's goal selection. Chain-of-thought reasoning and persona steering provide limited improvements. These findings highlight the uniqueness of human goal selection, cautioning against replacing it with current models in applications such as personal assistance, scientific discovery, and policy research.
Chinese Translation
随着大型语言模型(LLMs)逐渐融入人类决策过程,它们越来越多地自主选择目标,而不仅仅是完成人类定义的目标,假设这些目标会反映人类的偏好。然而,LLMs与人类在目标选择上的相似性仍然未得到充分验证。我们在一个借鉴自认知科学的受控开放式学习任务中,直接评估LLMs作为人类目标选择代理的有效性。在四个最先进的模型(GPT-5、Gemini 2.5 Pro、Claude Sonnet 4.5 和 Centaur)中,我们发现与人类行为存在显著差异。尽管人类会逐渐探索并学习以多样化的方式实现目标,但大多数模型却利用单一识别的解决方案(奖励黑客)或表现出意外的低性能,各模型之间存在明显的模式,而同一模型的不同实例之间变化不大。即使是Centaur,经过专门训练以在实验环境中模拟人类,仍未能有效捕捉人类的目标选择。链式思维推理和角色引导提供了有限的改进。这些发现突显了人类目标选择的独特性,提醒我们在个人助理、科学发现和政策研究等应用中,谨慎使用当前模型替代人类目标选择。
cs.CL / 7 / 2603.03296
PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents
PlugMem:一种任务无关的插件内存模块用于大型语言模型代理
Abstract
Long-term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task-specific and non-transferable, or task-agnostic but less effective due to low task-relevance and context explosion from raw memory retrieval. We propose PlugMem, a task-agnostic plugin memory module that can be attached to arbitrary LLM agents without task-specific redesign. Motivated by the fact that decision-relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge-centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task-relevant knowledge, rather than verbose raw trajectories, and departs from other graph-based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long-horizon conversational question answering, multi-hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task-agnostic baselines and exceeds task-specific memory designs, while also achieving the highest information density under a unified information-theoretic analysis. Code and data are available at https://github.com/TIMAN-group/PlugMem.
Chinese Translation
长期记忆对于在复杂环境中运行的大型语言模型(LLM)代理至关重要,但现有的内存设计要么是特定于任务且不可转移,要么是任务无关但由于低任务相关性和原始内存检索导致的上下文爆炸而效果较差。我们提出了PlugMem,一种任务无关的插件内存模块,可以在不进行特定任务重新设计的情况下附加到任意LLM代理上。基于决策相关信息集中为抽象知识而非原始经验的事实,我们借鉴认知科学将情节记忆结构化为一个紧凑、可扩展的以知识为中心的记忆图,明确表示命题知识和规范知识。这种表示方法使得在任务相关知识上进行高效的内存检索和推理成为可能,而不是冗长的原始轨迹,并且与其他基于图的方法(如GraphRAG)不同,它将知识视为内存访问和组织的单位,而不是实体或文本块。我们在三个异构基准(长时间对话问答、多跳知识检索和网络代理任务)上评估了PlugMem,结果表明PlugMem在任务无关基线之上始终表现优越,并超越了特定任务的内存设计,同时在统一的信息论分析下实现了最高的信息密度。代码和数据可在 https://github.com/TIMAN-group/PlugMem 获取。
cs.CL / 8 / 2603.03297
TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement
TTSR:测试时自我反思以持续提升推理能力
Abstract
Test-time Training enables model adaptation using only test questions and offers a promising paradigm for improving the reasoning ability of large language models (LLMs). However, it faces two major challenges: test questions are often highly difficult, making self-generated pseudo-labels unreliable, and existing methods lack effective mechanisms to adapt to a model's specific reasoning weaknesses, leading to inefficient learning. To address these issues, we propose \textbf{TTSR}, a self-reflective test-time self-evolving training framework. TTSR employs a single pretrained language model that alternates between the roles of a \textit{Student} and a \textit{Teacher} at test time. The Student focuses on solving problems and learning from synthesized variant questions, while the Teacher analyzes the Student's failed reasoning trajectories, summarizes recurring reasoning weaknesses, and synthesizes targeted variant questions accordingly. This process guides the model to improve within a learnable regime through a continual self-evolving loop. Experimental results on multiple challenging mathematical reasoning benchmarks show that TTSR consistently improves reasoning performance and generalizes well across different model backbones and general-domain reasoning tasks. These findings suggest that teacher-mediated self-reflection provides an effective pathway for stable and continual reasoning improvement at test time.
Chinese Translation
测试时训练(Test-time Training)使得模型能够仅通过测试问题进行适应,并为提升大型语言模型(LLMs)的推理能力提供了一个有前景的范式。然而,它面临两个主要挑战:测试问题通常难度较高,使得自生成的伪标签不可靠;现有方法缺乏有效机制来适应模型特定的推理弱点,导致学习效率低下。为了解决这些问题,我们提出了 extbf{TTSR},一种自我反思的测试时自我进化训练框架。TTSR采用一个单一的预训练语言模型,在测试时交替扮演 extit{学生}和 extit{教师}的角色。学生专注于解决问题并从合成的变体问题中学习,而教师则分析学生的失败推理轨迹,总结反复出现的推理弱点,并相应合成针对性的变体问题。这个过程引导模型在可学习的框架内通过持续的自我进化循环进行改进。在多个具有挑战性的数学推理基准上的实验结果表明,TTSR持续提升推理表现,并在不同模型骨干和通用领域推理任务中表现良好。这些发现表明,教师介导的自我反思为在测试时稳定和持续的推理改进提供了一条有效途径。
cs.CL / 9 / 2603.03298
TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation
TATRA:通过重述和聚合实现无训练实例自适应提示
Abstract
Large Language Models (LLMs) have improved substantially alignment, yet their behavior remains highly sensitive to prompt phrasing. This brittleness has motivated automated prompt engineering, but most existing methods (i) require a task-specific training set, (ii) rely on expensive iterative optimization to produce a single dataset-level prompt, and (iii) must be rerun from scratch for each new task. We introduce TATRA, a dataset-free prompting method that constructs instance-specific few-shot prompts by synthesizing on-the-fly examples to accompany a user-provided instruction. TATRA requires no labeled training data and avoids task-specific optimization loops, while retaining the benefits of demonstration-based prompting. Across standard text classification benchmarks, TATRA matches or improves over strong prompt-optimization baselines that depend on training data and extensive search. On mathematical reasoning benchmarks, TATRA achieves state-of-the-art performance on GSM8K and DeepMath, outperforming methods that explicitly optimize prompts on those tasks. Our results suggest that per-instance construction of effective in-context examples is more important than running long, expensive optimization loops to produce a single prompt per task. We will make all code publicly available upon acceptance of the paper. Code is available at https://github.com/BMD223/TATRA
Chinese Translation
大型语言模型(LLMs)在对齐方面有了显著改善,但它们的行为仍然对提示措辞高度敏感。这种脆弱性促使了自动化提示工程的发展,但现有大多数方法(i)需要特定任务的训练集,(ii)依赖昂贵的迭代优化来生成单一的数据集级提示,以及(iii)必须针对每个新任务从头开始重新运行。我们提出了TATRA,这是一种无数据集的提示方法,通过即时合成示例来构建实例特定的少量示例提示,以配合用户提供的指令。TATRA不需要标记的训练数据,避免了特定任务的优化循环,同时保留了基于示范的提示的优势。在标准文本分类基准测试中,TATRA的表现与依赖训练数据和广泛搜索的强提示优化基线相当或更好。在数学推理基准测试中,TATRA在GSM8K和DeepMath上达到了最先进的性能,超越了那些在这些任务上明确优化提示的方法。我们的结果表明,针对每个实例构建有效的上下文示例比进行长时间、昂贵的优化循环以生成每个任务的单一提示更为重要。我们将在论文接受后公开所有代码。代码可在 https://github.com/BMD223/TATRA 获取。
cs.CL / 10 / 2603.03299
How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations
大型语言模型如何引用及其重要性:对AI辅助学术写作中参考文献伪造的跨模型审计及检测虚假引用的方法
Abstract
Large language models (LLMs) have been noted to fabricate scholarly citations, yet the scope of this behavior across providers, domains, and prompting conditions remains poorly quantified. We present one of the largest citation hallucination audits to date, in which 10 commercially deployed LLMs were prompted across four academic domains, generating 69,557 citation instances verified against three scholarly databases (namely, CrossRef, OpenAlex, and Semantic Scholar). Our results show that the observed hallucination rates span a fivefold range (between 11.4% and 56.8%) and are strongly shaped by model, domain, and prompt framing. Our results also show that no model spontaneously generates citations when unprompted, which seems to establish hallucination as prompt-induced rather than intrinsic. We identify two practical filters: 1) multi-model consensus (with more than 3 LLMs citing the same work yields 95.6% accuracy, a 5.8-fold improvement), and 2) within-prompt repetition (with more than 2 replications yields 88.9% accuracy). In addition, we present findings on generational model tracking, which reveal that improvements are not guaranteed when deploying newer LLMs, and on capacity scaling, which appears to reduce hallucination within model families. Finally, a lightweight classifier trained solely on bibliographic string features is developed to classify hallucinated citations from verified citations, achieving AUC 0.876 in cross-validation and 0.834 in LOMO generalization (without querying any external database). This classifier offers a pre-screening tool deployable at inference time.
Chinese Translation
大型语言模型(LLMs)被发现会伪造学术引用,但这种行为在不同提供者、领域和提示条件下的范围尚未得到充分量化。我们展示了迄今为止最大规模的引用幻觉审计之一,其中对10个商业部署的LLM在四个学术领域进行了提示,生成了69,557个引用实例,并与三个学术数据库(即CrossRef、OpenAlex和Semantic Scholar)进行了验证。我们的结果显示,观察到的幻觉率跨越了五倍的范围(介于11.4%和56.8%之间),并受到模型、领域和提示框架的强烈影响。我们的结果还表明,在未提示的情况下,没有模型会自发生成引用,这似乎确立了幻觉是由提示引发的,而非内在的。我们识别了两个实用的过滤器:1)多模型共识(超过3个LLM引用相同作品的准确率为95.6%,提升了5.8倍),以及2)提示内重复(超过2次重复的准确率为88.9%)。此外,我们还展示了生成模型跟踪的发现,这表明在部署更新的LLM时,改进并不总是有保障,以及容量扩展似乎减少了模型家族内的幻觉。最后,我们开发了一种仅基于书目字符串特征训练的轻量级分类器,用于将幻觉引用与经过验证的引用进行分类,在交叉验证中达到AUC 0.876,在LOMO泛化中达到0.834(无需查询任何外部数据库)。该分类器提供了一种可在推理时部署的预筛选工具。
cs.CL / 11 / 2603.03300
Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys
法律检索增强生成的基准测试:人工智能法定调查的前景与局限
Abstract
Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA's actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.
Chinese Translation
检索增强生成(RAG)为法律人工智能提供了显著的潜力,但系统性的基准测试仍然稀缺。先前的研究引入了LaborBench,以基于美国劳动部(DOL)律师对所有美国州失业保险要求进行的详尽多月人工枚举所提供的表面真相来基准测试RAG模型。该研究发现标准RAG的表现不佳(布尔任务准确率为70%)。在此,我们评估了三个此前未在LaborBench上评估的新兴工具:法定研究助手(Statutory Research Assistant, STARA)、一个定制的法定研究工具,以及Westlaw和LexisNexis推出的两款商业工具,后者宣传其人工智能法定调查能力。我们做出了五项主要贡献。首先,我们展示了STARA实现了显著的性能提升,将准确率提高至83%。其次,我们发现商业平台的表现较差,Westlaw AI的准确率为58%,Lexis+ AI的准确率为64%,甚至低于标准RAG。第三,我们进行了全面的错误分析,将我们的输出与DOL律师编制的结果进行了比较,记录了推理错误(如相关法律概念之间的混淆和对法定例外的误解)和检索失败(未捕获相关法定条款)。第四,我们发现许多明显的错误实际上是DOL律师自身的重大遗漏,因此STARA的实际准确率为92%。第五,我们通过具体的设计原则为法律RAG指明了前进的方向,为构建能够进行准确多管辖区法律研究的人工智能系统提供了可操作的指导。
cs.CL / 12 / 2603.03301
From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings
从精确匹配到足够接近:大语言模型嵌入的语义缓存
Abstract
The rapid adoption of large language models (LLMs) has created demand for faster responses and lower costs. Semantic caching, reusing semantically similar requests via their embeddings, addresses this need but breaks classic cache assumptions and raises new challenges. In this paper, we explore offline policies for semantic caching, proving that implementing an optimal offline policy is NP-hard, and propose several polynomial-time heuristics. We also present online semantic aware cache policies that combine recency, frequency, and locality. Evaluations on diverse datasets show that while frequency based policies are strong baselines, our novel variant improves semantic accuracy. Our findings reveal effective strategies for current systems and highlight substantial headroom for future innovation. All code is open source.
Chinese Translation
大语言模型(LLMs)的快速普及带来了对更快响应和更低成本的需求。语义缓存通过重用语义相似的请求及其嵌入来满足这一需求,但这打破了经典缓存的假设,并带来了新的挑战。本文探讨了语义缓存的离线策略,证明了实现最优离线策略是 NP-hard,并提出了几种多项式时间的启发式方法。我们还提出了结合时效性、频率和局部性的在线语义感知缓存策略。在多样化数据集上的评估表明,尽管基于频率的策略是强有力的基线,但我们的新变体在语义准确性上有所提升。我们的研究结果揭示了当前系统的有效策略,并强调了未来创新的巨大潜力。所有代码均为开源。
cs.CL / 13 / 2603.03302
Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs
为州交通部门开发知识管理和劳动力培训的人工智能助手
Abstract
Effective knowledge management is critical for preserving institutional expertise and improving the efficiency of workforce training in state transportation agencies. Traditional approaches, such as static documentation, classroom-based instruction, and informal mentorship, often lead to fragmented knowledge transfer, inefficiencies, and the gradual loss of expertise as senior engineers retire. Moreover, given the enormous volume of technical manuals, guidelines, and research reports maintained by these agencies, it is increasingly challenging for engineers to locate relevant information quickly and accurately when solving field problems or preparing for training tasks. These limitations hinder timely decision-making and create steep learning curves for new personnel in maintenance and construction operations. To address these challenges, this paper proposes a Retrieval-Augmented Generation (RAG) framework with a multi-agent architecture to support knowledge management and decision making. The system integrates structured document retrieval with real-time, context-aware response generation powered by a large language model (LLM). Unlike conventional single-pass RAG systems, the proposed framework employs multiple specialized agents for retrieval, answer generation, evaluation, and query refinement, which enables iterative improvement and quality control. In addition, the system incorporates an open-weight vision-language model to convert technical figures into semantic textual representations, which allows figure-based knowledge to be indexed and retrieved alongside text. Retrieved text and figure-based context are then provided to an open-weight large language model, which generates the final responses grounded in the retrieved evidence.
Chinese Translation
有效的知识管理对于保留机构专业知识和提高州交通机构劳动力培训的效率至关重要。传统方法,如静态文档、课堂教学和非正式指导,往往导致知识转移的碎片化、效率低下,以及随着资深工程师退休而逐渐流失的专业知识。此外,考虑到这些机构维护的大量技术手册、指南和研究报告,工程师在解决现场问题或准备培训任务时,快速准确地找到相关信息变得越来越具有挑战性。这些局限性妨碍了及时决策,并为维护和施工操作中的新人员创造了陡峭的学习曲线。为了解决这些挑战,本文提出了一种检索增强生成(Retrieval-Augmented Generation, RAG)框架,采用多智能体架构以支持知识管理和决策。该系统将结构化文档检索与由大型语言模型(Large Language Model, LLM)驱动的实时、上下文感知的响应生成相结合。与传统的单通道RAG系统不同,所提出的框架采用多个专门的智能体进行检索、答案生成、评估和查询优化,从而实现迭代改进和质量控制。此外,该系统还结合了开放权重的视觉-语言模型,将技术图形转换为语义文本表示,使基于图形的知识能够与文本一起被索引和检索。检索到的文本和基于图形的上下文随后提供给开放权重的大型语言模型,以生成基于检索证据的最终响应。
cs.CL / 14 / 2603.03303
HumanLM: Simulating Users with State Alignment Beats Response Imitation
HumanLM:状态对齐的用户模拟优于响应模仿
Abstract
Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.
Chinese Translation
大型语言模型(LLMs)越来越多地用于模拟特定用户在给定情境下的反应,从而支持更以用户为中心的应用程序,这些应用程序依赖于用户反馈。然而,现有的用户模拟器主要模仿表层模式和语言风格,未能反映真实用户的潜在状态(例如,信念和情感)。为了解决这些局限性,我们提出了一种新颖的训练框架,HumanLM,旨在构建准确反映真实用户的用户模拟器。我们的关键见解是,除了生成响应外,模型还应通过强化学习生成与真实响应对齐的自然语言潜在状态。这些潜在状态对应于一组心理学基础的状态维度,这些维度驱动真实用户的反应。HumanLM进一步将这些对齐的潜在状态合成成准确代表真实用户的响应。为了进行广泛评估,我们开发了Humanual,这是一个基于公共数据模拟真实用户的综合基准。Humanual包含六个大规模数据集,共有26,000名用户和216,000条响应,涵盖生成用户对日常生活问题、政治博客和与LLM助手的聊天会话的响应等多种任务。在各个数据集中,HumanLM显著优于其他方法,在LLM评估者的对齐分数上实现了平均相对提升16.3%。在一项包含111名参与者的实时模拟研究中,HumanLM在与真实用户响应的相似性和人类相似度评分方面达到了最高水平。
cs.CL / 15 / 2603.03305
Draft-Conditioned Constrained Decoding for Structured Generation in LLMs
基于草稿条件的约束解码在大型语言模型中的结构化生成
Abstract
Large language models (LLMs) are increasingly used to generate executable outputs, JSON objects, and API calls, where a single syntax error can make the output unusable. Constrained decoding enforces validity token-by-token via masking and renormalization, but it can distort generation when the model assigns low probability mass to valid continuations, pushing decoding toward locally valid yet semantically incorrect trajectories. We propose \emph{Draft-Conditioned Constrained Decoding (DCCD)}, a simple two-step, training-free inference procedure that decouples semantic planning from structural enforcement: an unconstrained draft is generated first, and constrained decoding is then applied, conditioned on this draft, to guarantee validity. We analyze DCCD through a KL-projection view, showing that draft conditioning increases feasible mass and reduces the cumulative "projection tax" induced by hard constraints, with an optional best-of-$K$ draft selection. Across structured reasoning benchmarks, DCCD improves strict structured accuracy by up to +24 percentage points over standard constrained decoding (e.g., 15.2\% to 39.0\% on GSM8K with a 1B model), and enables smaller model pairs to match or exceed much larger constrained baselines, yielding substantial gains in parameter efficiency.
Chinese Translation
大型语言模型(LLMs)越来越多地用于生成可执行输出、JSON对象和API调用,其中单个语法错误可能导致输出无法使用。约束解码通过掩蔽和重新归一化逐个令牌地强制有效性,但当模型对有效延续分配低概率质量时,它可能会扭曲生成,推动解码朝向局部有效但语义上不正确的轨迹。我们提出了 extit{基于草稿条件的约束解码(DCCD)},这是一种简单的两步、无训练推理过程,将语义规划与结构强制解耦:首先生成一个无约束的草稿,然后在此草稿的条件下应用约束解码,以保证有效性。我们通过KL投影视角分析DCCD,表明草稿条件化增加了可行质量并减少了硬约束引起的累积“投影税”,并提供可选的最佳$K$草稿选择。在结构化推理基准测试中,DCCD在严格结构准确性上比标准约束解码提高了多达24个百分点(例如,在1B模型上,GSM8K从15.2 ext%提高到39.0 ext%),并使得较小的模型对能够匹配或超过更大约束基线,显著提高了参数效率。
cs.CL / 16 / 2603.03306
Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation
面向令牌的对象表示法与 JSON:普通解码生成与受限解码生成的基准测试
Abstract
Recently presented Token-Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one-shot in-context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade-off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one-shot in-context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in-domain generation tasks, though this advantage is often reduced by the "prompt tax" of instructional overhead in shorter contexts. Plain JSON generation shows the best one-shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade-off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this "lowest token usage" of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON's true efficiency potential likely follows a non-linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead.
Chinese Translation
最近提出的面向令牌的对象表示法(Token-Oriented Object Notation, TOON)旨在替代 JSON,作为向大型语言模型(LLMs)传递结构化数据的序列化格式,并显著减少令牌使用量。尽管在 LLM 理解方面表现出良好的准确性,但缺乏针对 JSON 生成的测试。尽管 TOON 语法在训练数据中从未出现,但其简单性足以表明一次性上下文学习可能支持准确生成。不可避免的提示开销可以被视为较短完成的可接受权衡。为此,我们进行了基准测试,创建了多个测试用例,以考虑结构复杂性、验证流程,并比较普通 JSON 生成与结构化输出(通过受限解码)JSON 生成以及 TOON 一次性上下文学习生成。包含 JSON 结构化输出是为了建立最低令牌预算基线,并为未来测试 TOON 受限解码推理执行的实验设定起点。关键发现:TOON 在领域生成任务中显示出有希望的准确性/令牌消耗比,尽管这种优势常常因较短上下文中的“提示税”而减弱。普通 JSON 生成在一次性和最终准确性方面表现最佳,甚至与受限解码结构化输出相比,唯一显著的优势是最低的令牌使用量,作为略微降低整体准确性和某些模型显著退化的权衡。值得注意的是,对于简单结构,受限解码的“最低令牌使用量”甚至超过了 TOON,这暗示 TOON 通过 xgrammar 等框架强制执行可能无法产生预期结果。此外,结果还提出了一个扩展假设:TOON 的真实效率潜力可能遵循非线性曲线,仅在特定点之后才会显现出累积语法节省抵消初始提示开销的优势。
cs.CL / 17 / 2603.03307
TopicENA: Enabling Epistemic Network Analysis at Scale through Automated Topic-Based Coding
TopicENA:通过自动化主题编码实现大规模的认知网络分析
Abstract
Epistemic Network Analysis (ENA) is a method for investigating the relational structure of concepts in text by representing co-occurring concepts as networks. Traditional ENA, however, relies heavily on manual expert coding, which limits its scalability and real-world applicability to large text corpora. Topic modeling provides an automated approach to extracting concept-level representations from text and can serve as an alternative to manual coding. To tackle this limitation, the present study merges BERTopic with ENA and introduces TopicENA, a topic-based epistemic network analysis framework. TopicENA substitutes manual concept coding with automatically generated topics while maintaining ENA's capacity for modeling structural associations among concepts. To explain the impact of modeling choices on TopicENA outcomes, three analysis cases are presented. The first case assesses the effect of topic granularity, indicating that coarse-grained topics are preferable for large datasets, whereas fine-grained topics are more effective for smaller datasets. The second case examines topic inclusion thresholds and finds that threshold values should be adjusted according to topic quality indicators to balance network consistency and interpretability. The third case tests TopicENA's scalability by applying it to a substantially larger dataset than those used in previous ENA studies. Collectively, these cases illustrate that TopicENA facilitates practical and interpretable ENA analysis at scale and offers concrete guidance for configuring topic-based ENA pipelines in large-scale text analysis.
Chinese Translation
认知网络分析(Epistemic Network Analysis, ENA)是一种通过将共现概念表示为网络来研究文本中概念关系结构的方法。然而,传统的 ENA 在很大程度上依赖于人工专家编码,这限制了其在大规模文本语料库中的可扩展性和实际应用。主题建模提供了一种从文本中提取概念级表示的自动化方法,可以作为人工编码的替代方案。为了解决这一限制,本研究将 BERTopic 与 ENA 相结合,提出了 TopicENA,一个基于主题的认知网络分析框架。TopicENA 用自动生成的主题替代了手动概念编码,同时保持了 ENA 对概念之间结构关联建模的能力。为了说明建模选择对 TopicENA 结果的影响,本文呈现了三个分析案例。第一个案例评估了主题粒度的影响,表明粗粒度主题更适合大数据集,而细粒度主题在小数据集上更有效。第二个案例考察了主题包含阈值,发现阈值应根据主题质量指标进行调整,以平衡网络的一致性和可解释性。第三个案例通过将 TopicENA 应用于一个显著大于以往 ENA 研究所用数据集的更大数据集,测试了其可扩展性。这些案例共同表明,TopicENA 促进了大规模的实用和可解释的 ENA 分析,并为在大规模文本分析中配置基于主题的 ENA 流水线提供了具体指导。
cs.CL / 18 / 2603.03308
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
旧习难改:对话历史如何在几何上束缚大型语言模型(LLMs)
Abstract
How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion-cs-nlp/OldHabitsDieHard.
Chinese Translation
大型语言模型(LLMs)的对话历史如何影响其未来表现?近期研究表明,LLMs受到其对话历史的意想不到的影响。例如,先前交互中的幻觉可能会影响后续模型的响应。在本研究中,我们引入了History-Echoes框架,探讨对话历史如何偏向后续生成。该框架从两个角度探讨这种偏向:从概率上,我们将对话建模为马尔可夫链,以量化状态一致性;从几何上,我们测量连续隐藏表示的一致性。在涵盖三种模型家族和六个数据集的多样现象的分析中,我们发现这两个视角之间存在强相关性。通过连接这两个视角,我们证明了行为的持续性表现为一种几何陷阱,其中潜在空间的间隙限制了模型的轨迹。代码可在 https://github.com/technion-cs-nlp/OldHabitsDieHard 获取。
cs.CL / 19 / 2603.03309
Combating data scarcity in recommendation services: Integrating cognitive types of VARK and neural network technologies (LLM)
应对推荐服务中的数据稀缺:整合VARK认知类型与神经网络技术(LLM)
Abstract
Cold start scenarios present fundamental obstacles to effective recommendation generation, particularly when dealing with users lacking interaction history or items with sparse metadata. This research proposes an innovative hybrid framework that leverages Large Language Models (LLMs) for content semantic analysis and knowledge graph development, integrated with cognitive profiling based on VARK (Visual, Auditory, Reading/Writing, Kinesthetic) learning preferences. The proposed system tackles multiple cold start dimensions: enriching inadequate item descriptions through LLM processing, generating user profiles from minimal data, and dynamically adjusting presentation formats based on cognitive assessment. The framework comprises six integrated components: semantic metadata enhancement, dynamic graph construction, VARK-based profiling, mental state estimation, graph-enhanced retrieval with LLM-powered ranking, and adaptive interface design with iterative learning. Experimental validation on MovieLens-1M dataset demonstrates the system's capacity for personalized recommendation generation despite limited initial information. This work establishes groundwork for cognitively-aware recommendation systems capable of overcoming cold start limitations through semantic comprehension and psychological modeling, offering personalized, explainable recommendations from initial user contact.
Chinese Translation
冷启动场景对有效推荐生成构成了根本性障碍,尤其是在处理缺乏交互历史的用户或具有稀疏元数据的项目时。本研究提出了一种创新的混合框架,利用大型语言模型(LLMs)进行内容语义分析和知识图谱开发,并结合基于VARK(视觉、听觉、阅读/写作、动觉)学习偏好的认知画像。所提系统解决了多个冷启动维度:通过LLM处理丰富不足的项目描述,从最少的数据生成用户画像,以及根据认知评估动态调整展示格式。该框架包含六个集成组件:语义元数据增强、动态图构建、基于VARK的画像、心理状态估计、结合LLM增强排名的图检索,以及具有迭代学习的自适应界面设计。在MovieLens-1M数据集上的实验验证表明,该系统能够在初始信息有限的情况下生成个性化推荐。本研究为能够通过语义理解和心理建模克服冷启动限制的认知感知推荐系统奠定了基础,提供了从用户初次接触开始的个性化、可解释的推荐。
cs.CL / 20 / 2603.03310
Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention
熵时间推断:超越注意力机制的自组织大型语言模型解码
Abstract
Modern large language model (LLM) inference engines optimize throughput and latency under fixed decoding rules, treating generation as a linear progression in token time. We propose a fundamentally different paradigm: entropic\-time inference, where decoding is governed by the flow of uncertainty rather than token index. We introduce a self\-organizing inference architecture that jointly couples scheduling, attention sparsification, and sampling temperature under a unified entropy control objective. Our method extends vLLM with entropy-aware scheduling, entropic pruning of paged attention blocks, and adaptive temperature control that stabilizes generation near a target entropy regime. This transforms inference into a resource\-intelligent thermodynamic process that allocates computation where uncertainty reduction is maximized. We present a concrete systems design, pseudocode, and integration plan, demonstrating how entropy can serve as a first\-class control signal for scalable LLM inference.
Chinese Translation
现代大型语言模型(LLM)推断引擎在固定解码规则下优化吞吐量和延迟,将生成视为令牌时间的线性进程。我们提出了一种根本不同的范式:熵时间推断,其中解码由不确定性的流动而非令牌索引所主导。我们引入了一种自组织推断架构,该架构在统一的熵控制目标下,联合耦合调度、注意力稀疏化和采样温度。我们的方法扩展了 vLLM,采用了基于熵的调度、熵驱动的分页注意力块修剪以及自适应温度控制,从而使生成过程在目标熵范围内稳定。这将推断转变为一种资源智能的热力学过程,在不确定性减少最大化的地方分配计算资源。我们展示了具体的系统设计、伪代码和集成计划,证明熵可以作为可扩展 LLM 推断的一流控制信号。
cs.CL / 21 / 2603.03311
The Logovista English--Japanese Machine Translation System
Logovista 英日机器翻译系统
Abstract
This paper documents the architecture, development practices, and preserved artifacts of the Logovista English--Japanese machine translation system, a large, explicitly rule-based MT system that was developed and sold commercially from the early 1990s through at least 2012. The system combined hand-authored grammatical rules, a large central dictionary encoding syntactic and semantic constraints, and chart-based parsing with weighted interpretation scoring to manage extensive structural ambiguity. The account emphasizes how the system was extended and maintained under real-world usage pressures, including regression control, ambiguity management, and the limits encountered as coverage expanded. Unlike many rule-based MT systems described primarily in research settings, Logovista was deployed for decades and evolved continuously in response to practical requirements. The paper is intended as a technical and historical record rather than an argument for reviving rule-based MT, and describes the software and linguistic resources that have been preserved for potential future study.
Chinese Translation
本文记录了 Logovista 英日机器翻译系统的架构、开发实践和保留的文物,该系统是一个大型的、明确基于规则的机器翻译系统,自1990年代初期至至少2012年间开发并商业销售。该系统结合了手工编写的语法规则、大型中央词典(编码了句法和语义约束)以及基于图表的解析和加权解释评分,以管理广泛的结构性歧义。本文强调了该系统在实际使用压力下的扩展和维护,包括回归控制、歧义管理以及在覆盖范围扩展时遇到的限制。与许多主要在研究环境中描述的基于规则的机器翻译系统不同,Logovista 被部署了数十年,并持续根据实际需求进行演变。本文旨在作为技术和历史记录,而非复兴基于规则的机器翻译的论据,并描述了为未来潜在研究保留的软件和语言资源。
cs.CL / 22 / 2603.03312
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding
摆脱BLEU陷阱:一种具有解耦语义指导的信号基础框架用于脑电图到文本的解码
Abstract
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fr\'echet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.
Chinese Translation
从非侵入性脑电图(EEG)信号中解码自然语言是一项前景广阔但具有挑战性的任务。然而,目前的最先进模型仍然受到三个基本限制的制约:语义偏见(模式崩溃为通用模板)、信号忽视(基于语言先验而非神经输入的幻觉)以及BLEU陷阱,其中评估指标因高频停用词而被人为地夸大,掩盖了缺乏真实语义保真度的问题。为了解决这些挑战,我们提出了SemKey,这是一种新颖的多阶段框架,通过四个解耦的语义目标(情感、主题、长度和惊讶度)来强制信号基础生成。我们通过将语义提示作为查询(Queries)注入,并将EEG嵌入作为键值对(Key-Value pairs)重新设计神经编码器与大型语言模型(LLM)之间的交互,严格迫使模型关注神经输入。此外,我们超越了标准翻译指标,采用N-way检索准确率和Fréchet距离来严格评估多样性和一致性。大量实验表明,我们的方法有效消除了噪声输入上的幻觉,并在这些稳健协议上实现了最先进的性能。代码将在接受后发布于https://github.com/xmed-lab/SemKey。
cs.CL / 23 / 2603.03313
How does fine-tuning improve sensorimotor representations in large language models?
微调如何改善大型语言模型中的感知运动表征?
Abstract
Large Language Models (LLMs) exhibit a significant "embodiment gap", where their text-based representations fail to align with human sensorimotor experiences. This study systematically investigates whether and how task-specific fine-tuning can bridge this gap. Utilizing Representational Similarity Analysis (RSA) and dimension-specific correlation metrics, we demonstrate that the internal representations of LLMs can be steered toward more embodied, grounded patterns through fine-tuning. Furthermore, the results show that while sensorimotor improvements generalize robustly across languages and related sensory-motor dimensions, they are highly sensitive to the learning objective, failing to transfer across two disparate task formats.
Chinese Translation
大型语言模型(LLMs)表现出显著的“体现差距”,其基于文本的表征未能与人类的感知运动经验对齐。本研究系统地探讨了任务特定的微调是否以及如何能够弥补这一差距。通过利用表征相似性分析(Representational Similarity Analysis, RSA)和特定维度的相关性度量,我们证明了LLMs的内部表征可以通过微调朝向更具体现性和扎根的模式进行引导。此外,结果表明,尽管感知运动的改善在不同语言和相关感知运动维度之间具有较强的普适性,但它们对学习目标高度敏感,未能在两种不同的任务格式之间转移。
cs.CL / 24 / 2603.03314
Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO
迈向自我鲁棒的语言模型:通过对比学习的内在提示噪声抵抗
Abstract
Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real-world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning-based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label-aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state-of-the-art approaches. The source code of CoIPO, pair-wise FLAN datasets, and NoisyPromptBench have already been released on https://github.com/vegetable-yx/CoIPO.
Chinese Translation
大型语言模型(LLMs)在各种任务中展现了显著且持续改善的性能。然而,LLM的性能可能对提示的变化高度敏感,特别是在开放性有限或输出格式要求严格的场景中,这表明其鲁棒性不足。在实际应用中,用户提供给LLM的提示往往存在缺陷,这可能会削弱模型响应的质量。为了解决这一问题,之前的研究主要集中在预处理提示,使用外部工具甚至LLM提前优化提示的表述。然而,这些方法忽视了LLM的内在鲁棒性,并且对外部组件的依赖引入了额外的计算开销和不确定性。在本研究中,我们提出了一种基于对比学习的逆直接偏好优化(CoIPO)方法,该方法最小化模型在干净提示下生成的标签对齐logits与其噪声版本之间的差异,并利用互信息理论进行详细分析。我们通过构建成对提示来增强FLAN数据集,每对提示由一个干净提示及其对应的噪声版本组成,以用于训练。此外,为了评估有效性,我们开发了NoisyPromptBench,这是一个增强并衍生自现有PromptBench的基准。针对NoisyPromptBench的实验结果表明,我们提出的方法在平均准确率上显著优于当前最先进的方法。CoIPO的源代码、成对FLAN数据集和NoisyPromptBench已在https://github.com/vegetable-yx/CoIPO上发布。
cs.CL / 25 / 2603.03315
M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity
M-QUEST -- 表情包语义与毒性理解评估
Abstract
Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, Visual material, Scene, Background Knowledge, Emotion, Semiotic Projection, Analogical Mapping, Overall Intent, Target Community, and Toxicity Assessment. Second, the framework guides a semi-automatic process of generating a benchmark with commonsense question-answer pairs about meme toxicity assessment and its underlying reason. The resulting benchmark M-QUEST consists of 609 question-answer pairs for 307 memes. Thirdly, we evaluate eight open-source large language models on their ability to correctly solve M-QUEST. Our results show that current models' commonsense reasoning capabilities for toxic meme interpretation vary depending on the dimension and architecture. Models with instruction tuning and reasoning capabilities significantly outperform the others, though pragmatic inference questions remain challenging. We release code, benchmark, and prompts to support future research intersecting multimodal content safety and commonsense reasoning.
Chinese Translation
互联网表情包是一种强大的在线交流形式,但其特性和对常识知识的依赖使得毒性检测变得具有挑战性。识别表情包解读和理解的关键特征是一项重要任务。之前的研究主要集中在一些影响意义的元素上,例如通过光学字符识别(OCR)获取的文本维度、通过物体识别获得的视觉维度、情感维度等更高层次的意义、以及通过代理变量(如仇恨言论检测和情感分析)进行的毒性检测。然而,目前仍缺乏一个能够正式识别影响表情包意义的元素并用于意义构建过程的整体架构。在本研究中,我们提出了一个语义框架及相应的基准,以实现对表情包的自动知识提取。首先,我们识别了理解和解读表情包所需的维度:文本材料、视觉材料、场景、背景知识、情感、符号投射、类比映射、整体意图、目标社区和毒性评估。其次,该框架指导了一个半自动化的过程,以生成关于表情包毒性评估及其潜在原因的常识问答对基准。最终生成的基准M-QUEST包含307个表情包的609个问答对。第三,我们评估了八个开源大型语言模型在正确解决M-QUEST方面的能力。我们的结果表明,当前模型在毒性表情包解读中的常识推理能力因维度和架构而异。具备指令调优和推理能力的模型显著优于其他模型,尽管实用推理问题仍然具有挑战性。我们发布了代码、基准和提示,以支持未来在多模态内容安全和常识推理交叉领域的研究。
cs.CL / 26 / 2603.03316
The Influence of Iconicity in Transfer Learning for Sign Language Recognition
图标性在手语识别迁移学习中的影响
Abstract
Most sign language recognition research relies on Transfer Learning (TL) from vision-based datasets such as ImageNet. Some extend this to alternatively available language datasets, often focusing on signs with cross-linguistic similarities. This body of work examines the necessity of these likenesses on effective knowledge transfer by comparing TL performance between iconic signs of two different sign language pairs: Chinese to Arabic and Greek to Flemish. Google Mediapipe was utilised as an input feature extractor, enabling spatial information of these signs to be processed with a Multilayer Perceptron architecture and the temporal information with a Gated Recurrent Unit. Experimental results showed a 7.02% improvement for Arabic and 1.07% for Flemish when conducting iconic TL from Chinese and Greek respectively.
Chinese Translation
大多数手语识别研究依赖于从基于视觉的数据集(如 ImageNet)进行迁移学习(Transfer Learning, TL)。一些研究将其扩展到其他可用的语言数据集,通常关注于具有跨语言相似性的手势。本研究考察了这些相似性在有效知识转移中的必要性,通过比较两对不同手语(中文与阿拉伯语、希腊语与弗拉芒语)之间的图标性手势的迁移学习表现。我们使用 Google Mediapipe 作为输入特征提取器,使得这些手势的空间信息能够通过多层感知器架构进行处理,而时间信息则通过门控循环单元(Gated Recurrent Unit)进行处理。实验结果显示,从中文和希腊语进行图标性迁移学习时,阿拉伯语的表现提高了 7.02%,弗拉芒语的表现提高了 1.07%。
cs.CL / 27 / 2603.03317
Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations
Retcon -- 一种基于提示的技术,用于在对话中精确控制大型语言模型
Abstract
Recent advances in Large Language Models (LLMs) allow agents to execute complex natural language tasks. Many LLM applications, such as support agents, teaching assistants, and interactive bots, involve multi-turn conversations. However, it remains challenging to control LLMs in the context of such interactions, particularly when the LLM behavior needs to be adjustable over the course of the conversation. In this paper, we present Retcon, a few-shot prompting technique designed to provide turn-level control over LLMs in conversations. We then demonstrate that it performs significantly better than zero-shot and traditional few-shot prompting.
Chinese Translation
大型语言模型(LLMs)的最新进展使得代理能够执行复杂的自然语言任务。许多LLM应用程序,如支持代理、教学助手和互动机器人,涉及多轮对话。然而,在此类交互的背景下控制LLM仍然具有挑战性,特别是当LLM的行为需要在对话过程中进行调整时。本文提出了Retcon,这是一种旨在提供对LLM在对话中进行轮次级控制的少量提示技术。我们随后证明,Retcon的表现显著优于零-shot和传统的少量提示方法。
cs.CL / 28 / 2603.03318
Quantum-Inspired Self-Attention in a Large Language Model
量子启发的自注意力机制在大型语言模型中的应用
Abstract
Recent advances in Natural Language Processing have been predominantly driven by transformer-based architectures, which rely heavily on self-attention mechanisms to model relationships between tokens in a sequence. Similarly, the field of Quantum Natural Language Processing, which seeks to leverage quantum principles to address challenges in language understanding and generation tasks, has seen the recent development of quantum self-attention mechanisms. We propose a classical quantum-inspired self-attention (QISA) mechanism and integrate it into the full autoregressive language modeling pipeline of GPT-1. To the best of our knowledge, this is the first integration of this kind, as previous quantum self-attention mechanisms have been primarily tested on text classification. In our experiments, QISA achieves better performance when compared to standard self-attention on the metrics character error rate ($15.5\times$ better), word error rate ($4.7 \times $) and cross-entropy loss ($13 \times$). This is achieved while only requiring a $ 2.6\times$ longer inference time.
Chinese Translation
近年来,自然语言处理领域的进展主要得益于基于变换器的架构,这些架构在建模序列中标记之间的关系时,极大依赖于自注意力机制。同样,量子自然语言处理领域也在寻求利用量子原理来解决语言理解和生成任务中的挑战,最近发展了量子自注意力机制。我们提出了一种经典的量子启发自注意力机制(Quantum-Inspired Self-Attention,QISA),并将其整合到GPT-1的完整自回归语言建模流程中。根据我们所知,这是首次进行此类整合,因为之前的量子自注意力机制主要是在文本分类任务上进行测试。在我们的实验中,QISA在字符错误率(提高了$15.5 imes$)、词错误率(提高了$4.7 imes$)和交叉熵损失(提高了$13 imes$)等指标上表现优于标准自注意力机制。这是在仅需$2.6 imes$更长推理时间的情况下实现的。
cs.CL / 29 / 2603.03319
Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
用于 LLM 作为评判者偏好分析的自动化概念发现
Abstract
Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作可扩展的模型输出评估者,但它们的偏好判断表现出系统性偏差,并可能与人类评估结果存在差异。之前关于 LLM 作为评判者的研究主要集中在一小部分预定义的假设偏差上,尚未解决自动发现 LLM 偏好的未知驱动因素的问题。我们通过研究几种嵌入级概念提取方法来分析 LLM 评判行为,从而填补这一空白。我们在可解释性和预测性方面比较了这些方法,发现基于稀疏自编码器的方法在恢复可解释的偏好特征方面显著优于其他方法,同时在预测 LLM 决策方面保持竞争力。利用来自多个人工偏好数据集的超过 27,000 对响应及三种 LLM 的判断,我们分析了 LLM 的判断,并将其与人类标注者的判断进行了比较。我们的方法不仅验证了现有结果,例如 LLM 在拒绝敏感请求时的倾向高于人类,而且还揭示了在一般和特定领域数据集中的新趋势,包括对强调具体性和同理心的回应的偏见,对学术建议中的细节和正式性的偏见,以及对促进采取积极措施(如报警和提起诉讼)的法律指导的偏见。我们的结果表明,自动化概念发现能够在没有预定义偏见分类法的情况下,对 LLM 评判者的偏好进行系统分析。
cs.CL / 30 / 2603.03320
From We to Me: Theory Informed Narrative Shift with Abductive Reasoning
从我们到我:基于理论的叙事转变与溯因推理
Abstract
Effective communication often relies on aligning a message with an audience's narrative and worldview. Narrative shift involves transforming text to reflect a different narrative framework while preserving its original core message--a task we demonstrate is significantly challenging for current Large Language Models (LLMs). To address this, we propose a neurosymbolic approach grounded in social science theory and abductive reasoning. Our method automatically extracts rules to abduce the specific story elements needed to guide an LLM through a consistent and targeted narrative transformation. Across multiple LLMs, abduction-guided transformed stories shifted the narrative while maintaining the fidelity with the original story. For example, with GPT-4o we outperform the zero-shot LLM baseline by 55.88% for collectivistic to individualistic narrative shift while maintaining superior semantic similarity with the original stories (40.4% improvement in KL divergence). For individualistic to collectivistic transformation, we achieve comparable improvements. We show similar performance across both directions for Llama-4, and Grok-4 and competitive performance for Deepseek-R1.
Chinese Translation
有效的沟通往往依赖于将信息与受众的叙事和世界观对齐。叙事转变涉及将文本转化为反映不同叙事框架的形式,同时保留其原始核心信息——这一任务我们证明对于当前的大型语言模型(LLMs)而言是相当具有挑战性的。为了解决这一问题,我们提出了一种基于社会科学理论和溯因推理的神经符号方法。我们的方法自动提取规则,以推导出引导LLM进行一致且有针对性的叙事转变所需的特定故事元素。在多个LLM中,基于溯因推理的转变故事在保持与原始故事的一致性的同时实现了叙事的转变。例如,在使用GPT-4o时,我们在集体主义到个人主义的叙事转变中超越了零样本LLM基线55.88%,同时保持了与原始故事的优越语义相似性(KL散度改善40.4%)。在个人主义到集体主义的转变中,我们也取得了可比的改善。我们在Llama-4和Grok-4的两个方向上展现了类似的表现,并在Deepseek-R1上表现出竞争力。
cs.CL / 31 / 2603.03321
DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following
DIALEVAL:大型语言模型指令遵循的自动化类型理论评估
Abstract
Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.
Chinese Translation
评估大型语言模型中的指令遵循需要将指令分解为可验证的要求并评估其满足情况——这些任务目前依赖于手动标注和与人类判断模式不一致的统一标准。我们提出了DIALEVAL,一个类型理论框架,使用双重大型语言模型(LLM)代理自动将指令分解为类型谓词,并实现类型特定的满足语义。该框架在自动提取过程中强制执行形式原子性和独立性约束,然后应用差异化评估标准——内容谓词的语义等价性,数值谓词的精确度——反映了实证观察到的人类评估模式。通过历史感知满足函数扩展到多轮对话,DIALEVAL能够在单轮方法失效的对话上下文中进行评估。验证结果表明,准确率达到90.38%(相较基线减少26.45%的错误率),并且对于复杂指令与人类判断的相关性显著增强。
cs.CL / 32 / 2603.03322
Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery
大型语言模型能否推导新知识?生物知识发现的动态基准
Abstract
Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.
Chinese Translation
近期大型语言模型(LLM)代理的进展显示出其在自动知识发现方面的显著潜力。然而,严格评估人工智能的知识发现能力仍然是一个关键挑战。现有基准主要依赖静态数据集,导致不可避免的数据污染,因为模型在训练过程中可能已经见过评估知识。此外,现代LLM的快速发布周期使得静态基准迅速过时,无法评估发现真正新知识的能力。为了解决这些局限性,我们提出了DBench-Bio,一个动态且完全自动化的基准,旨在评估人工智能的生物知识发现能力。DBench-Bio采用三阶段流程:(1)获取严谨、权威的论文摘要数据;(2)利用LLM提取问答(QA),合成科学假设问题及相应的发现答案;(3)进行QA过滤,以确保基于相关性、清晰度和中心性的质量。我们实例化这一流程,构建了一个涵盖12个生物医学子领域的每月更新基准。对当前最先进(SOTA)模型的广泛评估揭示了在发现新知识方面的现有限制。我们的工作提供了第一个动态、自动化的框架,用于评估人工智能系统的新知识发现能力,为人工智能研究社区建立了一个不断发展、演变的资源,以促进知识发现的发展。
cs.CL / 33 / 2603.03323
Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
辨别真伪:通过对比精炼减少过度拒绝
Abstract
Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.
Chinese Translation
大型语言模型(LLMs)在安全性对齐方面常常面临过度拒绝的问题,即倾向于将看似有毒或无害的提示错误地分类为有毒。这种行为削弱了模型的实用性,并限制了其在敏感或细微上下文中的使用。虽然之前的研究提出了数据增强和激活引导等缓解策略,但这些方法往往面临权衡:减少过度拒绝通常会降低模型拒绝真正有害内容的能力。我们认为,这一问题源于有毒和看似有毒提示对模型学习动态的模糊影响。为了解决这一问题,我们引入了一个前置对齐阶段,DCR:通过对比精炼进行辨别。我们在理论和实证上都证明了对比精炼提高了LLM区分真正有毒提示与表面上有毒提示的能力。在多样化基准测试中的评估表明,我们的方法有效减少了过度拒绝,同时保留了对齐带来的安全性好处。重要的是,它在最小化一般能力下降的情况下实现了这一目标,为安全对齐提供了更为原则性和稳健的方向。
cs.CL / 34 / 2603.03324
Controlling Chat Style in Language Models via Single-Direction Editing
通过单向编辑控制语言模型中的聊天风格
Abstract
Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.
Chinese Translation
在大型语言模型(LLMs)中控制风格属性仍然具有挑战性,现有方法依赖于提示工程或训练后对齐。本文通过表征工程的视角探讨这一挑战,检验了不同风格属性——从情感语调到语言结构——在模型激活空间中作为线性方向编码的假设。我们在广泛的风格中提供了强有力的实证证据支持这一假设,并基于这一发现提出了一种轻量级、无训练的方法以实现精确的风格控制。我们的方法支持线性风格组合,通过消除不良行为增强安全性,并且通过对十多种模型的实验验证,达到了高风格一致性,同时在最小计算成本下保留核心能力。
cs.CL / 35 / 2603.03325
IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference
IntPro:一种通过检索条件推理实现上下文感知意图理解的代理智能体
Abstract
Large language models (LLMs) have become integral to modern Human-AI collaboration workflows, where accurately understanding user intent serves as a crucial step for generating satisfactory responses. Context-aware intent understanding, which involves inferring user intentions from situational environments, is inherently challenging because it requires reasoning over both the immediate context and the user's underlying motivations that drive their behavior. Moreover, existing approaches often treat intent understanding as a static recognition task, overlooking users' accumulated intent patterns that could provide valuable references for more accurate and generalizable understanding. To address this gap, we propose IntPro, a proxy agent that learns to adapt to individual users via retrieval-conditioned intent inference. We design intent explanations that abstract how contextual signals connect to expressed intents, and store them in an individual intent history library for retrieval. We train IntPro through supervised fine-tuning on retrieval-conditioned trajectories and multi-turn Group Relative Policy Optimization (GRPO) with tool-aware reward functions, enabling the agent to learn when to leverage historical intent patterns and when to infer directly. Experiments across three diverse scenarios (Highlight-Intent, MIntRec2.0, and Weibo Post-Sync) demonstrate that IntPro achieves strong intent understanding performance with effective context-aware reasoning capabilities across different scenarios and model types.
Chinese Translation
大型语言模型(LLMs)已成为现代人机协作工作流程中不可或缺的一部分,其中准确理解用户意图是生成满意响应的关键步骤。上下文感知的意图理解涉及从情境环境中推断用户意图,这本质上是具有挑战性的,因为它需要对即时上下文和驱动用户行为的潜在动机进行推理。此外,现有方法通常将意图理解视为静态识别任务,忽视了用户积累的意图模式,这些模式可以为更准确和更具普遍性的理解提供有价值的参考。为了解决这一问题,我们提出了IntPro,一种通过检索条件意图推理学习适应个体用户的代理智能体。我们设计了意图解释,抽象出上下文信号与表达意图之间的联系,并将其存储在个体意图历史库中以便检索。我们通过在检索条件轨迹上进行监督微调和使用工具感知奖励函数的多轮组相对策略优化(GRPO)来训练IntPro,使得该智能体能够学习何时利用历史意图模式以及何时直接推断。在三个不同场景(Highlight-Intent、MIntRec2.0和微博发布同步)中的实验表明,IntPro在不同场景和模型类型中实现了强大的意图理解性能,并具备有效的上下文感知推理能力。
cs.CL / 36 / 2603.03326
Controllable and explainable personality sliders for LLMs at inference time
推理时可控且可解释的个性滑块用于大型语言模型
Abstract
Aligning Large Language Models (LLMs) with specific personas typically relies on expensive and monolithic Supervised Fine-Tuning (SFT) or RLHF. While effective, these methods require training distinct models for every target personality profile. Inference-time activation steering offers a parameter-efficient alternative, yet naive approaches fail to control multiple traits simultaneously due to destructive vector interference. In this work, we propose a modular framework for continuous, multi-dimensional personality control. Our key innovation is Sequential Adaptive Steering (SAS): a method that orthogonalizes steering vectors by training subsequent probes on the residual stream shifted by prior interventions. This approach transforms steering vectors into reusable primitives, allowing users to instantly synthesize complex, high-fidelity personality profiles by simply adjusting coefficients alpha. We validate our framework on the Big Five personality traits, demonstrating that it outperforms naive baselines in both goal adherence and coherence, enabling precise, holistic personality modulation without updating model parameters.
Chinese Translation
将大型语言模型(LLMs)与特定个性对齐通常依赖于昂贵且单一的监督微调(Supervised Fine-Tuning, SFT)或基于强化学习的人类反馈(RLHF)。虽然这些方法有效,但需要为每个目标个性特征训练不同的模型。推理时的激活引导提供了一种参数高效的替代方案,但简单的方法由于破坏性向量干扰,无法同时控制多个特征。在本研究中,我们提出了一种用于连续多维个性控制的模块化框架。我们的关键创新是顺序自适应引导(Sequential Adaptive Steering, SAS):一种通过在先前干预所偏移的残差流上训练后续探针来正交化引导向量的方法。这种方法将引导向量转化为可重用的原语,使用户能够通过简单调整系数 alpha 即时合成复杂且高保真的个性特征。我们在五大人格特质上验证了我们的框架,结果表明它在目标遵循性和一致性方面均优于简单基线,实现了精确的整体个性调节,而无需更新模型参数。
cs.CL / 37 / 2603.03327
A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction
联合对话满意度、情感识别和情感状态转移预测的基准
Abstract
User satisfaction is closely related to enterprises, as it not only directly reflects users' subjective evaluation of service quality or products, but also affects customer loyalty and long-term business revenue. Monitoring and understanding user emotions during interactions helps predict and improve satisfaction. However, relevant Chinese datasets are limited, and user emotions are dynamic; relying on single-turn dialogue cannot fully track emotional changes across multiple turns, which may affect satisfaction prediction. To address this, we constructed a multi-task, multi-label Chinese dialogue dataset that supports satisfaction recognition, as well as emotion recognition and emotional state transition prediction, providing new resources for studying emotion and satisfaction in dialogue systems.
Chinese Translation
用户满意度与企业密切相关,因为它不仅直接反映了用户对服务质量或产品的主观评价,还影响客户忠诚度和长期商业收入。在互动过程中监测和理解用户情感有助于预测和改善满意度。然而,相关的中文数据集有限,用户情感是动态的;依赖单轮对话无法完全跟踪多轮对话中的情感变化,这可能影响满意度预测。为了解决这个问题,我们构建了一个多任务、多标签的中文对话数据集,支持满意度识别、情感识别和情感状态转移预测,为研究对话系统中的情感和满意度提供了新的资源。
cs.CL / 38 / 2603.03328
StructLens: A Structural Lens for Language Models via Maximum Spanning Trees
StructLens:通过最大生成树为语言模型提供结构视角
Abstract
Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest internal structures as well. While interpretability research has investigated the components of language models, existing approaches focus on local inter-token relationships within layers or modules (e.g., Multi-Head Attention), leaving global inter-layer relationships largely overlooked. To address this gap, we introduce StructLens, an analytical framework designed to reveal how internal structures relate holistically through their inter-token connection within a layer. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, analogous to dependency parsing, and leverages the tree properties to quantify inter-layer distance (or similarity) from a structural perspective. Our findings demonstrate that StructLens yields an inter-layer similarity pattern that is distinctively different from conventional cosine similarity. Moreover, this structure-aware similarity proves to be beneficial for practical tasks, such as layer pruning, highlighting the effectiveness of structural analysis for understanding and optimizing language models. Our code is available at https://github.com/naist-nlp/structlens.
Chinese Translation
语言具有固有的结构,这一特性解释了语言习得和语言变化。鉴于这一特性,我们期望语言模型也能表现出内部结构。尽管可解释性研究已经探讨了语言模型的组成部分,但现有的方法主要集中在层或模块内的局部令牌间关系(例如,多头注意力),而对全局层间关系却大多忽视。为了解决这一空白,我们提出了StructLens,一个旨在揭示内部结构如何通过层内令牌连接整体相关的分析框架。StructLens基于残差流中的语义表示构建最大生成树,类似于依赖解析,并利用树的特性从结构角度量化层间距离(或相似性)。我们的研究结果表明,StructLens产生的层间相似性模式与传统的余弦相似性显著不同。此外,这种关注结构的相似性在实际任务中(如层剪枝)证明是有益的,突显了结构分析在理解和优化语言模型中的有效性。我们的代码可在 https://github.com/naist-nlp/structlens 获取。
cs.CL / 39 / 2603.03329
AutoHarness: improving LLM agents by automatically synthesizing a code harness
AutoHarness:通过自动合成代码工具提升大型语言模型代理的性能
Abstract
Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.
Chinese Translation
尽管近年来语言模型取得了显著进展,但当作为代理使用时,这些模型往往尝试执行一些不仅在给定状态下是次优的,而且在外部环境中是严格禁止的动作。例如,在最近的Kaggle GameArena国际象棋比赛中,78%的Gemini-2.5-Flash的失利被归因于非法走棋。人们通常手动为大型语言模型(LLMs)编写“工具”以防止此类失败。本文展示了Gemini-2.5-Flash能够自动合成这样的代码工具,利用来自(游戏)环境的反馈,通过少量轮次的迭代代码优化。生成的工具在145个不同的TextArena游戏(包括单人和双人游戏)中防止了所有非法走棋,使得较小的Gemini-2.5-Flash模型能够超越更大的模型,如Gemini-2.5-Pro。将我们的技术推向极限,我们可以让Gemini-2.5-Flash生成整个策略的代码,从而消除在决策时使用LLM的需要。生成的代码策略在16个TextArena单人游戏中获得的平均奖励高于Gemini-2.5-Pro和GPT-5.2-High。我们的结果表明,使用较小的模型合成定制的代码工具(或整个策略)可以超越更大的模型,同时也更具成本效益。
cs.CL / 40 / 2603.03330
Certainty robustness: Evaluating LLM stability under self-challenging prompts
确定性鲁棒性:评估大型语言模型在自我挑战提示下的稳定性
Abstract
Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness or confidence calibration, they do not capture how models behave when their responses are challenged in interactive settings. We introduce the Certainty Robustness Benchmark, a two-turn evaluation framework that measures how LLMs balance stability and adaptability under self-challenging prompts such as uncertainty ("Are you sure?") and explicit contradiction ("You are wrong!"), alongside numeric confidence elicitation. Using 200 reasoning and mathematics questions from LiveBench, we evaluate four state-of-the-art LLMs and distinguish between justified self-corrections and unjustified answer changes. Our results reveal substantial differences in interactive reliability that are not explained by baseline accuracy alone: some models abandon correct answers under conversational pressure, while others demonstrate strong resistance to challenge and better alignment between confidence and correctness. These findings identify certainty robustness as a distinct and critical dimension of LLM evaluation, with important implications for alignment, trustworthiness and real-world deployment.
Chinese Translation
大型语言模型(LLMs)常常以高度自信的方式给出答案,尽管缺乏关于确定性或真理的明确推理机制。现有基准主要评估单轮准确性、真实性或自信度校准,但未能捕捉模型在互动环境中面对挑战时的表现。我们引入了确定性鲁棒性基准,这是一个两轮评估框架,衡量LLMs在自我挑战提示(如不确定性“你确定吗?”和明确矛盾“你错了!”)下如何平衡稳定性和适应性,同时进行数值自信度引导。通过使用来自LiveBench的200个推理和数学问题,我们评估了四个最先进的LLMs,并区分合理的自我修正与不合理的答案变化。我们的结果揭示了互动可靠性方面的显著差异,这些差异不能仅通过基线准确性来解释:一些模型在对话压力下放弃正确答案,而其他模型则表现出对挑战的强大抵抗力,并在自信度与正确性之间更好地保持一致。这些发现将确定性鲁棒性识别为LLM评估的一个独特且关键的维度,对对齐、可信度和实际应用具有重要影响。
cs.CL / 41 / 2603.03331
PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning
PulseLM:用于PPG-文本学习的基础数据集和基准
Abstract
Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their suitability for language-based physiological reasoning and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text dataset designed to bridge raw PPG waveforms and natural language through a unified, closed-ended question answering (QA) formulation. PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks. The dataset comprises 1.31 million standardized 10-second PPG segments, associated with 3.15 million question-answer pairs. We further define reproducible preprocessing, supervision, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying multimodal physiological reasoning, cross-dataset generalization, and scalable benchmarking of PPG-based language models. The data and code can be found publicly available at: https://github.com/manhph2211/PulseLM.
Chinese Translation
光电容积描记法(PPG)是一种广泛使用的非侵入式传感方式,适用于临床、实验室和可穿戴设备中的连续心血管和生理监测。尽管现有的PPG数据集支持广泛的下游任务,但它们通常以数值测量或特定任务标签的形式提供监督,这限制了它们在基于语言的生理推理和多模态基础模型中的适用性。在本研究中,我们介绍了PulseLM,一个大规模的PPG-文本数据集,旨在通过统一的封闭式问答(QA)形式将原始PPG波形与自然语言连接起来。PulseLM汇集了来自十五个公开可用来源的PPG记录,并将异构注释统一为十二个常见的生理问答任务。该数据集包含131万个标准化的10秒PPG片段,关联315万个问答对。我们进一步定义了可重复的预处理、监督和评估协议,并使用多模态PPG感知的大型语言模型建立基线基准。PulseLM为研究多模态生理推理、跨数据集泛化以及基于PPG的语言模型的可扩展基准提供了标准化基础。数据和代码可在以下链接公开获取:https://github.com/manhph2211/PulseLM。
cs.CL / 42 / 2603.03332
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
脆弱的思维:大型语言模型如何处理思维链扰动
Abstract
Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6\%) regardless of scale; Sycophancy produces modest effects (7\% loss for small models); and SkippedSteps cause intermediate damage (15\% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \href{https://github.com/Mystic-Slice/CoTPerturbation}{here}.
Chinese Translation
思维链(Chain-of-Thought, CoT)提示已成为从大型语言模型(Large Language Models, LLMs)中引发推理的基础技术,但这种方法对中间推理步骤中扰动的鲁棒性仍然不够明确。本文对LLM在五种结构化的CoT扰动类型下的鲁棒性进行了全面的实证评估: extit{MathError(数学错误)、UnitConversion(单位转换)、Sycophancy(谄媚)、SkippedSteps(跳过步骤)}和 extit{ExtraSteps(额外步骤)}。我们评估了13个模型,涵盖了从3B到1.5T(ootnote{假设为封闭模型的参数数量})的三个数量级,测试它们在推理链的不同点注入扰动后完成数学推理任务的能力。我们的主要发现揭示了异质性脆弱性模式:MathError扰动在小型模型中产生最严重的性能下降(准确率损失50-60%),但显示出强大的规模效益;UnitConversion在所有规模中仍然具有挑战性(即使是最大模型也损失20-30%);ExtraSteps无论规模如何,准确率损失最小(0-6%);Sycophancy产生适度影响(小型模型损失7%);而SkippedSteps造成中等损害(损失15%)。规模关系遵循幂律模式,模型大小在某些扰动下起到保护作用,但对维度推理任务的防御有限。这些发现对在多阶段推理流程中部署LLMs具有直接影响,并强调了任务特定鲁棒性评估和缓解策略的必要性。代码和结果可在 extit{here}获取。
cs.CL / 43 / 2603.03333
Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding
无训练的 dropout 采样用于投机解码中的语义令牌接受
Abstract
Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.
Chinese Translation
投机解码通过使用轻量级草稿模型提出令牌,并利用目标模型选择性地接受这些令牌,从而加速大型语言模型的推理。本研究提出了 DropMatch,这是一种新颖的方法,通过对语言模型头部单独应用蒙特卡洛 dropout,将草稿令牌与目标模型的预测分布进行匹配,从而实现基于采样的接受决策。通过生成多个解码路径,我们的方法形成了一个经验令牌分布,以此评估草稿令牌的一致性。这种接受机制使模型能够在适当的 dropout 概率下自适应地控制解码路径的大小,防止对目标模型预测分布产生显著扭曲。所提出的方法以无训练、无数据和无校准的方式运行,无需对预训练模型进行架构修改,并且可以与广泛的现有投机解码和推理加速技术正交集成。在多个基准测试中的实验表明,我们的方法在保持竞争性任务性能的同时增加了接受长度,推理速度提升范围从 1.09 倍到 1.33 倍,相较于标准基线,在 EAGLE3 上应用时可额外获得高达 1.09 倍的加速。
cs.CL / 44 / 2603.03334
The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?
CompMath-MCQ 数据集:大型语言模型是否准备好应对更高层次的数学?
Abstract
The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git
Chinese Translation
对大型语言模型(LLMs)在数学推理方面的评估主要集中在基础问题、竞赛风格的问题或形式化定理证明上,而研究生层次和计算数学相对较少被探讨。我们介绍了 CompMath-MCQ,这是一个用于评估 LLMs 在多项选择环境中进行高级数学推理的新基准数据集。该数据集包含由研究生课程教授原创的 1,500 道问题,涵盖线性代数、数值优化、向量微积分、概率论和基于 Python 的科学计算等主题。每个问题提供三个选项,其中恰好有一个是正确的。为了确保没有数据泄露,所有问题均为新创建,未来源于现有材料。问题的有效性通过基于跨 LLM 不一致性的程序进行验证,随后由人工专家审核。通过采用多项选择格式,我们的数据集能够通过 lm_eval 库实现客观、可重复和无偏见的评估。与最先进的 LLMs 的基线结果表明,高级计算数学推理仍然是一个重大挑战。我们在以下链接发布 CompMath-MCQ: https://github.com/biancaraimondi/CompMath-MCQ.git
cs.CL / 45 / 2603.03335
Compressed Sensing for Capability Localization in Large Language Models
大语言模型中的能力定位压缩感知
Abstract
Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to $65\%$ on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at https://github.com/locuslab/llm-components.
Chinese Translation
大语言模型(LLMs)展现出广泛的能力,包括数学推理、代码生成和语言行为。我们表明,许多能力高度局限于Transformer架构中少量注意力头的子集。将少至五个特定任务的头置零,可以使在测量相关能力的标准基准上性能下降高达65%,而在无关任务上的性能基本保持不变。我们提出了一种基于压缩感知的方法,利用这些头的稀疏性,通过战略性淘汰和少量模型评估来识别它们。我们在参数范围从1B到8B的Llama和Qwen模型上验证了这些发现,并涵盖了包括数学能力和代码生成在内的多种能力,揭示了一个模块化的组织结构,其中专业化能力由稀疏且功能独特的组件实现。总体而言,我们的结果表明能力定位是Transformer语言模型的一种普遍组织原则,对可解释性、模型编辑和人工智能安全具有重要影响。代码已发布在 https://github.com/locuslab/llm-components。
cs.CL / 46 / 2603.03336
Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification
依赖提示的语言模型排名及其不确定性量化
Abstract
Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.
Chinese Translation
基于成对比较得出的排名在许多经济和计算系统中至关重要。在大型语言模型(LLMs)的背景下,排名通常是基于人类偏好数据构建的,并以排行榜的形式呈现,以指导部署决策。然而,现有方法依赖于点估计,隐含地将排名视为固定对象,尽管存在显著的估计噪声和依赖上下文的性能变化。基于这样的排名进行决策可能导致资源错误配置和福利损失,因为明显的差异在统计上可能并不具意义。我们研究了基于成对人类偏好的依赖提示的排名推断,并开发了一个具有统计有效不确定性保证的安全排名决策框架。我们使用上下文布拉德利-特里-卢斯(Bradley-Terry-Luce)模型来建模偏好,其中每个模型的潜在效用依赖于输入提示。我们并不针对效用的点估计,而是直接对诱导排名进行推断,基于成对效用差异的同时置信区间构建置信集。这种方法为特定提示的排名提供了统计有效的边际和同时置信集。我们的框架将最近的排名推断进展与上下文偏好学习相连接,并为基于排名的稳健决策提供了工具。从实证角度来看,利用来自LLM评估的大规模人类偏好数据,我们展示了排名在不同提示特征下的显著变化,以及许多明显的排名差异在统计上并不可区分。我们进一步展示了不确定性感知排名仅在数据支持下识别主导关系,否则返回部分序列。
cs.CL / 47 / 2603.03407
Tracing Pharmacological Knowledge In Large Language Models
大型语言模型中的药理学知识追踪
Abstract
Large language models (LLMs) have shown strong empirical performance across pharmacology and drug discovery tasks, yet the internal mechanisms by which they encode pharmacological knowledge remain poorly understood. In this work, we investigate how drug-group semantics are represented and retrieved within Llama-based biomedical language models using causal and probing-based interpretability methods. We apply activation patching to localize where drug-group information is stored across model layers and token positions, and complement this analysis with linear probes trained on token-level and sum-pooled activations. Our results demonstrate that early layers play a key role in encoding drug-group knowledge, with the strongest causal effects arising from intermediate tokens within the drug-group span rather than the final drug-group token. Linear probing further reveals that pharmacological semantics are distributed across tokens and are already present in the embedding space, with token-level probes performing near chance while sum-pooled representations achieve maximal accuracy. Together, these findings suggest that drug-group semantics in LLMs are not localized to single tokens but instead arise from distributed representations. This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.
Chinese Translation
大型语言模型(LLMs)在药理学和药物发现任务中表现出强大的实证性能,但它们编码药理学知识的内部机制仍然不甚清楚。在本研究中,我们使用因果推理和探测性解释方法,探讨基于Llama的生物医学语言模型中药物组语义的表示和检索。我们应用激活补丁技术来定位药物组信息在模型层和标记位置的存储位置,并通过在标记级别和求和池化激活上训练的线性探测器来补充这一分析。我们的结果表明,早期层在编码药物组知识中发挥了关键作用,最强的因果效应来自药物组范围内的中间标记,而非最终的药物组标记。线性探测进一步揭示,药理学语义分布在各个标记中,并且已经存在于嵌入空间中,标记级别的探测器表现接近随机,而求和池化表示则达到了最大准确性。综合来看,这些发现表明LLMs中的药物组语义并非局限于单一标记,而是源于分布式表示。本研究提供了对LLMs中药理学知识的首次系统性机制分析,为生物医学语义在大型语言模型中的编码提供了深入见解。
cs.CL / 48 / 2603.03415
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
移位越远,表示越稀疏:分析大型语言模型中的OOD机制
Abstract
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
Chinese Translation
在本研究中,我们探讨了大型语言模型(LLMs)在遇到逐渐增加的输入难度时如何调整其内部表示,这种难度通过超出分布(OOD)移位的程度来量化。我们揭示了一个一致且可量化的现象:随着任务难度的增加,无论是通过更难的推理问题、更长的上下文,还是增加答案选项,LLMs的最后隐藏状态变得显著稀疏。简而言之, extbf{ extit{移位越远,表示越稀疏}}。这种稀疏性与难度的关系在不同模型和领域中均可观察到,这表明语言模型通过将计算集中到最后隐藏状态中的专业子空间来应对不熟悉或复杂的输入。通过一系列受控分析和学习动态解释,我们证明这种稀疏性并非偶然,而是为了在OOD下稳定推理而采取的适应性机制。基于这一见解,我们设计了 extit{稀疏性引导的课程上下文学习(SG-ICL)},这一策略明确利用表示稀疏性来安排少量示范,从而显著提升性能。我们的研究为LLMs如何内化OOD挑战提供了新的机制性见解。源代码可在以下网址获取:https://github.com/MingyuJ666/sparsityLLM。
cs.CL / 49 / 2603.03508
Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi
提升标准,而非参数:LilMoo 印地语紧凑语言模型
Abstract
The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.
Chinese Translation
大型多语言基础模型的主导地位加剧了自然语言处理(NLP)中的语言不平等,常常使低资源语言的表现不足。本文介绍了LilMoo,一个完全从零开始训练的6亿参数印地语语言模型,以填补这一空白。与依赖于不透明多语言基础的持续预训练的先前印地语模型不同,LilMoo通过一个完全透明且可重复的流程开发,优化了有限计算环境。我们构建了一个高质量的印地语语料库(GigaLekh),通过启发式和学习(LLM-as-a-judge)方法进行过滤,并辅以经过精心策划的英语数据的双语增强。利用该数据集,我们探索了小规模语言模型的各种训练方案。在全面的评估套件中,LilMoo始终优于同等规模的多语言基线模型,如Qwen2.5-0.5B和Qwen3-0.6B,证明了精心设计的语言特定预训练可以在不足十亿参数的范围内与大型多语言模型相媲美。
cs.CL / 50 / 2603.03510
A theoretical model of dynamical grammatical gender shifting based on set-valued set function
基于集合值集合函数的动态语法性别转变理论模型
Abstract
This study investigates the diverse characteristics of nouns, focusing on both semantic (e.g., countable/uncountable) and morphosyntactic (e.g., masculine/feminine) distinctions. We explore inter-word variations for gender markers in noun morphology. Grammatical gender shift is a widespread phenomenon in languages around the world. The aim is to uncover through a formal model the underlying patterns governing the variation of lexemes. To this end, we propose a new computational component dedicated to pairing items with morphological templates (e.g., the result of a generated item-template pair: (funas, $\{N, +SG, -PL, -M, +F, -COL, +SING\}$), with its spell-out form: $\eth$a-funast 'cow'). This process is formally represented by the Template-Based and Modular Cognitive model. This proposed model, defined by a set-valued set function $h : \mathscr{P}(M) \rightarrow \mathscr{P}(M)$, predicts the nonlinear dynamic mapping of lexical items onto morphological templates. By applying this formalism, we present a unified framework for understanding the complexities of morphological markings across languages. Through empirical observations, we demonstrate how these shifts, as well as non-gender shifts, arise during lexical changes, especially in Riffian. Our model posits that these variant markings emerge due to template shifts occurring during word and meaning's formation. By formally demonstrating that conversion is applicable to noun-to-noun derivation, we challenge and broaden the conventional view of word formation. This mathematical model not only contributes to a deeper understanding of morphosyntactic variation but also offers potential applications in other fields requiring precise modelling of linguistic patterns.
Chinese Translation
本研究探讨名词的多样特征,重点关注语义(例如,可数/不可数)和形态句法(例如,阳性/阴性)之间的区别。我们研究名词形态中性别标记的词间变异。语法性别转变是全球语言中普遍存在的现象。我们的目标是通过一个形式模型揭示支配词汇变异的潜在模式。为此,我们提出一个新的计算组件,专门用于将项目与形态模板配对(例如,生成的项目-模板对的结果:(funas, $ extbackslash{N}, +SG, -PL, -M, +F, -COL, +SING$),其拼写形式为:$ extbackslash{eth}$a-funast '牛')。这一过程通过基于模板和模块化认知模型进行形式化表示。该模型由一个集合值集合函数 $h : extbackslash{P}(M)
ightarrow extbackslash{P}(M)$ 定义,预测词汇项与形态模板之间的非线性动态映射。通过应用这一形式化,我们提出了一个统一框架,以理解跨语言的形态标记复杂性。通过实证观察,我们展示了这些转变以及非性别转变如何在词汇变化过程中产生,特别是在Riffian语言中。我们的模型认为,这些变异标记是由于在词和意义形成过程中发生的模板转变而出现的。通过形式化证明转换适用于名词到名词的派生,我们挑战并拓宽了传统的词形成观念。该数学模型不仅有助于更深入理解形态句法变异,还为其他需要精确建模语言模式的领域提供了潜在应用。
cs.CL / 51 / 2603.03536
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
SafeCRS:基于大型语言模型的对话推荐系统的个性化安全对齐
Abstract
Current LLM-based conversational recommender systems (CRS) primarily optimize recommendation accuracy and user satisfaction. We identify an underexplored vulnerability in which recommendation outputs may negatively impact users by violating personalized safety constraints, when individualized safety sensitivities -- such as trauma triggers, self-harm history, or phobias -- are implicitly inferred from the conversation but not respected during recommendation. We formalize this challenge as personalized CRS safety and introduce SafeRec, a new benchmark dataset designed to systematically evaluate safety risks in LLM-based CRS under user-specific constraints. To further address this problem, we propose SafeCRS, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment. Extensive experiments on SafeRec demonstrate that SafeCRS reduces safety violation rates by up to 96.5% relative to the strongest recommendation-quality baseline while maintaining competitive recommendation quality. Warning: This paper contains potentially harmful and offensive content.
Chinese Translation
当前基于大型语言模型(LLM)的对话推荐系统(CRS)主要优化推荐准确性和用户满意度。我们识别出一个未被充分探索的脆弱性,即推荐输出可能通过违反个性化安全约束而对用户产生负面影响,当个体化的安全敏感性——例如创伤触发、 自残历史或恐惧症——在对话中被隐式推断出来但在推荐过程中未被尊重时。我们将这一挑战形式化为个性化CRS安全,并引入SafeRec,一个新的基准数据集,旨在系统性地评估基于LLM的CRS在用户特定约束下的安全风险。为进一步解决这一问题,我们提出了SafeCRS,一个安全感知的训练框架,结合了安全监督微调(Safe Supervised Fine-Tuning,Safe-SFT)和安全组奖励去耦归一化策略优化(Safe Group reward-Decoupled Normalization Policy Optimization,Safe-GDPO),以共同优化推荐质量和个性化安全对齐。在SafeRec上的大量实验表明,SafeCRS相较于最强推荐质量基线将安全违规率降低了多达96.5%,同时保持了竞争力的推荐质量。警告:本文包含潜在有害和冒犯性的内容。
cs.CL / 52 / 2603.03541
RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering
RAG-X:医疗问答中检索增强生成的系统诊断
Abstract
Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14\% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.
Chinese Translation
自动化问答(QA)系统越来越依赖检索增强生成(RAG)来将大型语言模型(LLMs)与权威医学知识相结合,以确保人工智能(AI)在医疗应用中的临床准确性和患者安全性。尽管在RAG评估方面取得了一定进展,但当前基准仅关注简单的多项选择QA任务,并采用不适合复杂QA任务所需语义精确度的指标。这些方法无法诊断错误是源于检索故障还是生成缺陷,从而限制了开发者进行有针对性的改进。为了解决这一问题,我们提出了RAG-X,一个独立评估检索器和生成器的诊断框架,涵盖信息提取、短答案生成和多项选择问题(MCQ)回答三类QA任务。RAG-X引入了上下文利用效率(CUE)指标,将系统成功分解为可解释的象限,区分经过验证的基础与误导性的准确性。我们的实验揭示了“准确性谬论”,即感知的系统成功与基于证据的基础之间存在14%的差距。通过揭示隐藏的失败模式,RAG-X提供了安全和可验证的临床RAG系统所需的诊断透明度。
cs.CL / 53 / 2603.03543
Tucano 2 Cool: Better Open Source LLMs for Portuguese
Tucano 2 Cool:更好的葡萄牙语开源大型语言模型
Abstract
We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.
Chinese Translation
我们介绍了Tucano 2,这是一个完全开放的大型语言模型(LLMs)套件,参数范围为5亿到37亿,旨在解决葡萄牙语开源开发中的某些空白。在我们之前工作的基础上,我们现在将数据集GigaVerbo-v2扩展到新的质量和规模,同时引入了一个新的合成数据集GigaVerbo-v2 Synth,旨在填补GigaVerbo-v2中的缺失部分,以及两个后训练数据集GigaVerbo-v2 SFT和GigaVerbo-v2 Preferences,使葡萄牙语LLMs能够在检索增强生成、编码、工具使用、思维链推理等多个感兴趣领域进行训练。通过广泛的消融研究,我们为Tucano 2套件(Base、Instruct和Think)设计了预训练和持续预训练的方案,这些方案在多个葡萄牙语建模基准上实现了最先进的性能。我们还扩展和完善了我们早期工作中引入的评估工具,提供了一个全面的评估套件,在不同的预训练、持续预训练和后训练方案中提供强有力的信号。与Tucano 2相关的所有成果都已公开发布,包括训练方案、日志和源代码,确保我们的工作可重复、可访问,并且可以被更广泛的葡萄牙语自然语言处理社区扩展。
cs.CL / 54 / 2603.03583
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
ByteFlow:通过自适应字节压缩实现无分词器的语言建模
Abstract
Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top-$K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.
Chinese Translation
现代语言模型仍然依赖于固定的、预定义的子词标记化。一旦训练了分词器,语言模型只能在这个固定的粒度水平上操作,这常常导致即使在其他强大的推理模型中也出现脆弱和反直觉的行为。我们引入了 extbf{ByteFlow Net},一种新的层次化架构,完全去除了分词器,而是使模型能够学习将原始字节流分割成语义上有意义的单元。ByteFlow Net 基于潜在表示的编码率执行驱动压缩的分割,从而在 extit{通过 Top-$K$ 选择保持静态计算图的同时} 产生自适应边界。与依赖于脆弱启发式和人类设计的归纳偏置的先前自分词方法不同,ByteFlow Net 根据输入本身调整其内部表示的粒度。实验表明,这种基于压缩的分块策略带来了显著的性能提升,ByteFlow Net 超越了基于 BPE 的 Transformer 和先前的字节级架构。这些结果表明,端到端、无分词器的建模不仅是可行的,而且更有效,为更自适应和信息驱动的语言模型开辟了道路。
cs.CL / 55 / 2603.03585
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Belief-Sim:朝着基于信念的模拟人口误信息易感性
Abstract
Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with accuracy up to 92%.
Chinese Translation
误信息正成为日益严重的社会威胁,而对误导性声明的易感性因人口群体的不同而有所差异,这主要源于潜在信念的差异。随着大型语言模型(LLMs)在模拟人类行为中的应用日益广泛,我们探讨它们是否能够模拟人口误信息易感性,并将信念视为主要驱动因素。我们介绍了BeliefSim,一个模拟框架,通过心理学知情的分类法和调查先验构建人口信念档案。我们研究了基于提示的条件化和后训练适应,并进行多方面评估,使用:(i) 易感性准确性和(ii) 反事实人口敏感性。在两个数据集和建模策略中,我们展示了信念为模拟误信息易感性提供了强有力的先验,准确率高达92%。
cs.CL / 56 / 2603.03623
A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research
一种在商业研究中使用大型语言模型的神经主题方法
Abstract
The growing use of unstructured text in business research makes topic modeling a central tool for constructing explanatory variables from reviews, social media, and open-ended survey responses, yet existing approaches function poorly as measurement instruments. Prior work shows that textual content predicts outcomes such as sales, satisfaction, and firm performance, but probabilistic models often generate conceptually diffuse topics, neural topic models are difficult to interpret in theory-driven settings, and large language model approaches lack standardization, stability, and alignment with document-level representations. We introduce LX Topic, a neural topic method that conceptualizes topics as latent linguistic constructs and produces calibrated document-level topic proportions for empirical analysis. LX Topic builds on FASTopic to ensure strong document representativeness and integrates large language model refinement at the topic-word level using alignment and confidence-weighting mechanisms that enhance semantic coherence without distorting document-topic distributions. Evaluations on large-scale Amazon and Yelp review datasets demonstrate that LX Topic achieves the highest overall topic quality relative to leading models while preserving clustering and classification performance. By unifying topic discovery, refinement, and standardized output in a web-based system, LX Topic establishes topic modeling as a reproducible, interpretable, and measurement-oriented instrument for marketing research and practice.
Chinese Translation
在商业研究中,非结构化文本的日益使用使得主题建模成为从评论、社交媒体和开放式调查反馈中构建解释变量的核心工具,然而现有方法作为测量工具的表现不佳。先前的研究表明,文本内容可以预测销售、满意度和公司绩效等结果,但概率模型往往生成概念上模糊的主题,神经主题模型在理论驱动的环境中难以解释,而大型语言模型的方法缺乏标准化、稳定性以及与文档级表示的一致性。我们提出了LX Topic,这是一种将主题概念化为潜在语言构造的神经主题方法,并为实证分析生成经过校准的文档级主题比例。LX Topic基于FASTopic构建,以确保强大的文档代表性,并在主题-词级别整合大型语言模型的优化,采用对齐和置信加权机制,增强语义连贯性而不扭曲文档-主题分布。在大规模的亚马逊和Yelp评论数据集上的评估表明,LX Topic在相对于领先模型的整体主题质量上达到了最高水平,同时保持了聚类和分类性能。通过在基于网络的系统中统一主题发现、优化和标准化输出,LX Topic确立了主题建模作为一种可重复、可解释和以测量为导向的工具,适用于市场研究和实践。
cs.CL / 57 / 2603.03652
Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification
基于语言信息的图模型和语义对比学习用于韩语短文本分类
Abstract
Short text classification (STC) remains a challenging task due to the scarcity of contextual information and labeled data. However, existing approaches have pre-dominantly focused on English because most benchmark datasets for the STC are primarily available in English. Consequently, existing methods seldom incorporate the linguistic and structural characteristics of Korean, such as its agglutinative morphology and flexible word order. To address these limitations, we propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification. The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts while precisely capturing the grammatical and semantic dependencies inherent in Korean. In addition, we apply Semantics-aware Contrastive Learning (SemCon) to reflect semantic similarity across documents, enabling the model to establish clearer decision boundaries even in short texts where class distinctions are often ambiguous. We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models. These outcomes validate that integrating language-specific graph representations with SemCon provides an effective solution for short text classification in agglutinative languages such as Korean.
Chinese Translation
短文本分类(STC)仍然是一项具有挑战性的任务,因为上下文信息和标注数据的稀缺。然而,现有的方法主要集中在英语上,因为大多数短文本分类的基准数据集主要以英语提供。因此,现有方法很少考虑韩语的语言和结构特征,例如其黏着性形态和灵活的词序。为了解决这些局限性,我们提出了LIGRAM,一种用于韩语短文本分类的分层异构图模型。该模型在语素、词性和命名实体层面构建子图,并将其分层集成,以补偿短文本中有限的上下文信息,同时准确捕捉韩语固有的语法和语义依赖关系。此外,我们应用了语义感知对比学习(Semantics-aware Contrastive Learning, SemCon),以反映文档之间的语义相似性,使模型即使在类别区分通常模糊的短文本中也能建立更清晰的决策边界。我们在四个韩语短文本数据集上评估了LIGRAM,结果显示其始终优于现有的基准模型。这些结果验证了将特定语言的图表示与SemCon结合提供了一种有效的解决方案,以应对像韩语这样的黏着性语言中的短文本分类问题。
cs.CL / 58 / 2603.03677
MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation
MIND:基于标准的临床支持的统一询问与诊断强化学习框架用于精神科咨询
Abstract
Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract psychopathological cues from incomplete and inconsistent patient reports in multi-turn interactions and perform rigorous differential diagnostic reasoning. However, existing methods face two fundamental challenges. First, without criteria-grounded clinical supports, they are prone to unsupported clinical assertions when symptoms are atypical or underspecified. Second, in multi-turn interactions, they struggle to mitigate inquiry drift (off-topic or low-yield questioning) and optimize questioning strategies. To address these challenges, we propose MIND, a unified inquiry--diagnosis reinforcement learning framework for psychiatric consultation. Specifically, we build a Criteria-Grounded Psychiatric Reasoning Bank (PRB) that summarizes dialogue context into clinical retrieval states, retrieves semantically similar reference consultations, and distills reusable criteria-grounded clinical supports to guide criteria-aligned inquiry and reasoning. Building on this foundation, MIND enforces explicit clinical reasoning with rubric-based process rewards to provide fine-grained supervision over intermediate decision steps, and incorporates a value-aware trajectory rectification mechanism to jointly improve information acquisition and diagnostic decision-making across turns. Extensive experiments demonstrate that MIND consistently outperforms strong baselines in diagnostic accuracy, empathetic interaction quality, interpretability, and generalization.
Chinese Translation
大型语言模型(LLMs)推动了医疗对话系统的发展,但精神科咨询由于主观模糊性和共病复杂性提出了更高的要求:代理必须在多轮互动中不断从不完整和不一致的患者报告中提取心理病理线索,并进行严格的鉴别诊断推理。然而,现有方法面临两个基本挑战。首先,缺乏基于标准的临床支持时,当症状不典型或未明确时,它们容易产生不支持的临床断言。其次,在多轮互动中,它们难以减轻询问漂移(偏离主题或低效提问)并优化提问策略。为了解决这些挑战,我们提出了MIND,一个用于精神科咨询的统一询问-诊断强化学习框架。具体而言,我们构建了一个基于标准的精神病理推理库(PRB),将对话上下文总结为临床检索状态,检索语义相似的参考咨询,并提炼可重用的基于标准的临床支持,以指导标准对齐的询问和推理。在此基础上,MIND通过基于评分标准的过程奖励来强制执行明确的临床推理,以提供对中间决策步骤的细致监督,并结合一个价值感知的轨迹修正机制,以共同改善信息获取和跨轮次的诊断决策。大量实验表明,MIND在诊断准确性、同理心互动质量、可解释性和泛化能力方面始终优于强基线。
cs.CL / 59 / 2603.03714
Order Is Not Layout: Order-to-Space Bias in Image Generation
顺序不是布局:图像生成中的顺序-空间偏差
Abstract
We study a systematic bias in modern image generation models: the mention order of entities in text spuriously determines spatial layout and entity--role binding. We term this phenomenon Order-to-Space Bias (OTS) and show that it arises in both text-to-image and image-to-image generation, often overriding grounded cues and causing incorrect layouts or swapped assignments. To quantify OTS, we introduce OTS-Bench, which isolates order effects with paired prompts differing only in entity order and evaluates models along two dimensions: homogenization and correctness. Experiments show that Order-to-Space Bias (OTS) is widespread in modern image generation models, and provide evidence that it is primarily data-driven and manifests during the early stages of layout formation. Motivated by this insight, we show that both targeted fine-tuning and early-stage intervention strategies can substantially reduce OTS, while preserving generation quality.
Chinese Translation
我们研究了现代图像生成模型中的一种系统性偏差:文本中实体的提及顺序会虚假地决定空间布局和实体-角色绑定。我们将这种现象称为顺序-空间偏差(Order-to-Space Bias, OTS),并展示它在文本到图像和图像到图像的生成中均会出现,常常覆盖基础线索,导致不正确的布局或角色分配错误。为了量化OTS,我们引入了OTS-Bench,它通过仅在实体顺序上有所不同的配对提示来隔离顺序效应,并从均匀化和正确性两个维度评估模型。实验表明,顺序-空间偏差(OTS)在现代图像生成模型中普遍存在,并提供证据表明它主要是数据驱动的,并在布局形成的早期阶段显现。基于这一见解,我们展示了针对性的微调和早期干预策略都可以显著减少OTS,同时保持生成质量。
cs.CL / 60 / 2603.03742
ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement
ErrorLLM:用于文本到SQL精炼的SQL错误建模
Abstract
Despite the remarkable performance of large language models (LLMs) in text-to-SQL (SQL generation), correctly producing SQL queries remains challenging during initial generation. The SQL refinement task is subsequently introduced to correct syntactic and semantic errors in generated SQL queries. However, existing paradigms face two major limitations: (i) self-debugging becomes increasingly ineffective as modern LLMs rarely produce explicit execution errors that can trigger debugging signals; (ii) self-correction exhibits low detection precision due to the lack of explicit error modeling grounded in the question and schema, and suffers from severe hallucination that frequently corrupts correct SQLs. In this paper, we propose ErrorLLM, a framework that explicitly models text-to-SQL Errors within a dedicated LLM for text-to-SQL refinement. Specifically, we represent the user question and database schema as structural features, employ static detection to identify execution failures and surface mismatches, and extend ErrorLLM's semantic space with dedicated error tokens that capture categorized implicit semantic error types. Through a well-designed training strategy, we explicitly model these errors with structural representations, enabling the LLM to detect complex implicit errors by predicting dedicated error tokens. Guided by the detected errors, we perform error-guided refinement on the SQL structure by prompting LLMs. Extensive experiments demonstrate that ErrorLLM achieves the most significant improvements over backbone initial generation. Further analysis reveals that detection quality directly determines refinement effectiveness, and ErrorLLM addresses both sides by high detection F1 score while maintain refinement effectiveness.
Chinese Translation
尽管大型语言模型(LLMs)在文本到SQL(SQL生成)方面表现出色,但在初始生成过程中正确生成SQL查询仍然具有挑战性。随后引入了SQL精炼任务,以纠正生成的SQL查询中的语法和语义错误。然而,现有范式面临两个主要限制:(i)自我调试变得越来越无效,因为现代LLMs很少产生可以触发调试信号的明确执行错误;(ii)自我纠正的检测精度较低,原因在于缺乏基于问题和模式的明确错误建模,并且严重的幻觉现象经常损坏正确的SQL。在本文中,我们提出了ErrorLLM,一个专门为文本到SQL精炼设计的框架,明确建模文本到SQL错误。具体而言,我们将用户问题和数据库模式表示为结构特征,采用静态检测来识别执行失败和表面不匹配,并通过专门的错误标记扩展ErrorLLM的语义空间,以捕捉分类的隐式语义错误类型。通过精心设计的训练策略,我们使用结构表示明确建模这些错误,使LLM能够通过预测专门的错误标记来检测复杂的隐式错误。在检测到的错误指导下,我们通过提示LLMs对SQL结构进行错误引导的精炼。大量实验表明,ErrorLLM在基础初始生成上实现了最显著的改进。进一步分析表明,检测质量直接决定了精炼的有效性,而ErrorLLM通过高检测F1分数同时解决了这两个方面,同时保持了精炼的有效性。
cs.CL / 61 / 2603.03752
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
信心校准的小型-大型语言模型协作以实现成本高效的推理
Abstract
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized confidence score. Questions with confidence below a predefined threshold are deferred to the LLM for more accurate resolution. We introduce a reinforcement learning-based training algorithm that aligns the SLM's confidence through an additional confidence calibration reward. Extensive experiments demonstrate that our method jointly improves the SLM's reasoning ability and confidence calibration across diverse datasets and model backbones. Compared to using the LLM alone, COREA reduces cost by 21.5% and 16.8% on out-of-domain math and non-math datasets, respectively, with only an absolute pass@1 drop within 2%.
Chinese Translation
大型语言模型(LLMs)在推理能力上优于小型语言模型(SLMs),但成本显著更高。我们提出了协作推理器(COllaborative REAsoner,COREA),该系统将SLM与LLM级联,以在复杂推理任务中实现准确性与成本之间的平衡。COREA首先尝试使用SLM回答问题,SLM输出答案和一个口头化的置信度分数。对于置信度低于预定义阈值的问题,将转交给LLM以获得更准确的解决方案。我们引入了一种基于强化学习的训练算法,通过额外的置信度校准奖励来调整SLM的置信度。大量实验表明,我们的方法在不同数据集和模型骨干上共同提高了SLM的推理能力和置信度校准。与单独使用LLM相比,COREA在域外数学和非数学数据集上分别降低了21.5%和16.8%的成本,同时仅在绝对通过率@1上下降不到2%。
cs.CL / 62 / 2603.03790
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
T2S-Bench与思维结构:综合文本到结构推理的基准测试与提示
Abstract
Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.
Chinese Translation
想想人类如何处理复杂的阅读任务:标记关键点、推断它们之间的关系,并构建信息结构以指导理解和反应。同样,大型语言模型能否从文本结构中受益,以提高文本处理性能?为探讨这一问题,在本研究中,我们首先引入了思维结构(Structure of Thought, SoT),这是一种提示技术,明确引导模型构建中间文本结构,在八个任务和三种模型家族中持续提升性能。在此基础上,我们提出了T2S-Bench,这是第一个旨在评估和改善模型文本到结构能力的基准测试。T2S-Bench包含来自六个科学领域和32种结构类型的1.8K样本,经过严格构建,以确保准确性、公平性和质量。在对45个主流模型的评估中,显示出显著的改进潜力:多跳推理任务的平均准确率仅为52.1%,即使是最先进的模型在端到端提取中也仅达到58.1%的节点准确率。此外,在Qwen2.5-7B-Instruct上,仅使用SoT就能在八个不同的文本处理任务中平均提高5.7%的性能,而在T2S-Bench上进行微调进一步将这一增益提升至8.6%。这些结果突显了明确文本结构的重要性以及SoT与T2S-Bench的互补贡献。数据集和评估代码已发布在https://t2s-bench.github.io/T2S-Bench-Page/。
cs.CL / 63 / 2603.03844
Semantic Bridging Domains: Pseudo-Source as Test-Time Connector
语义桥接领域:伪源作为测试时连接器
Abstract
Distribution shifts between training and testing data are a critical bottleneck limiting the practical utility of models, especially in real-world test-time scenarios. To adapt models when the source domain is unknown and the target domain is unlabeled, previous works constructed pseudo-source domains via data generation and translation, then aligned the target domain with them. However, significant discrepancies exist between the pseudo-source and the original source domain, leading to potential divergence when correcting the target directly. From this perspective, we propose a Stepwise Semantic Alignment (SSA) method, viewing the pseudo-source as a semantic bridge connecting the source and target, rather than a direct substitute for the source. Specifically, we leverage easily accessible universal semantics to rectify the semantic features of the pseudo-source, and then align the target domain using the corrected pseudo-source semantics. Additionally, we introduce a Hierarchical Feature Aggregation (HFA) module and a Confidence-Aware Complementary Learning (CACL) strategy to enhance the semantic quality of the SSA process in the absence of source and ground truth of target domains. We evaluated our approach on tasks like semantic segmentation and image classification, achieving a 5.2% performance boost on GTA2Cityscapes over the state-of-the-art.
Chinese Translation
训练数据与测试数据之间的分布变化是限制模型实际应用的重要瓶颈,尤其是在现实世界的测试场景中。当源领域未知且目标领域未标记时,如何适应模型成为一个挑战。以往的研究通过数据生成和转换构建伪源领域,然后将目标领域与之对齐。然而,伪源与原始源领域之间存在显著差异,直接修正目标时可能导致偏离。从这个角度出发,我们提出了一种逐步语义对齐(Stepwise Semantic Alignment, SSA)方法,将伪源视为连接源领域和目标领域的语义桥,而非源的直接替代品。具体而言,我们利用易于获取的通用语义来修正伪源的语义特征,然后使用修正后的伪源语义对目标领域进行对齐。此外,我们引入了分层特征聚合(Hierarchical Feature Aggregation, HFA)模块和基于置信度的互补学习(Confidence-Aware Complementary Learning, CACL)策略,以提高SSA过程中的语义质量,尤其是在缺乏源领域和目标领域的真实标签时。我们在语义分割和图像分类等任务上评估了我们的方法,在GTA2Cityscapes数据集上实现了相较于最先进技术的5.2%的性能提升。
cs.CL / 64 / 2603.03846
Benchmarking Motivational Interviewing Competence of Large Language Models
大型语言模型的动机访谈能力基准测试
Abstract
Motivational interviewing (MI) promotes behavioural change in substance use disorders. Its fidelity is measured using the Motivational Interviewing Treatment Integrity (MITI) framework. While large language models (LLMs) can potentially generate MI-consistent therapist responses, their competence using MITI is not well-researched, especially in real world clinical transcripts. We aim to benchmark MI competence of proprietary and open-source models compared to human therapists in real-world transcripts and assess distinguishability from human therapists. Methods: We shortlisted 3 proprietary and 7 open-source LLMs from LMArena, evaluated performance using MITI 4.2 framework on two datasets (96 handcrafted model transcripts, 34 real-world clinical transcripts). We generated parallel LLM-therapist utterances iteratively for each transcript while keeping client responses static, and ranked performance using a composite ranking system with MITI components and verbosity. We conducted a distinguishability experiment with two independent psychiatrists to identify human-vs-LLM responses. Results: All 10 tested LLMs had fair (MITI global scores >3.5) to good (MITI global scores >4) competence across MITI measures, and three best-performing models (gemma-3-27b-it, gemini-2.5-pro, grok-3) were tested on real-world transcripts. All showed good competence, with LLMs outperforming human-expert in Complex Reflection percentage (39% vs 96%) and Reflection-Question ratio (1.2 vs >2.8). In the distinguishability experiment, psychiatrists identified LLM responses with only 56% accuracy, with d-prime: 0.17 and 0.25 for gemini-2.5-pro and gemma-3-27b-it respectively. Conclusion: LLMs can achieve good MI proficiency in real-world clinical transcripts using MITI framework. These findings suggest that even open-source LLMs are viable candidates for expanding MI counselling sessions in low-resource settings.
Chinese Translation
动机访谈(MI)促进药物使用障碍的行为改变。其忠实度通过动机访谈治疗完整性(MITI)框架进行评估。虽然大型语言模型(LLMs)有潜力生成与MI一致的治疗师回应,但它们在MITI框架下的能力尚未得到充分研究,尤其是在真实世界的临床记录中。我们的目标是基准测试专有和开源模型在真实世界记录中与人类治疗师的MI能力,并评估其与人类治疗师的可区分性。方法:我们从LMArena中筛选了3个专有和7个开源LLM,使用MITI 4.2框架在两个数据集(96个手工制作的模型记录,34个真实世界的临床记录)上评估其表现。我们为每个记录迭代生成平行的LLM-治疗师发言,同时保持客户回应不变,并使用包含MITI组件和冗长度的综合排名系统对表现进行排名。我们与两名独立精神科医生进行了一项可区分性实验,以识别人类与LLM的回应。结果:所有10个测试的LLM在MITI指标上均表现出公平(MITI全球评分 >3.5)至良好(MITI全球评分 >4)的能力,表现最好的三个模型(gemma-3-27b-it、gemini-2.5-pro、grok-3)在真实世界记录中进行了测试。所有模型均表现出良好的能力,其中LLMs在复杂反思百分比(39% vs 96%)和反思-问题比率(1.2 vs >2.8)上优于人类专家。在可区分性实验中,精神科医生以仅56%的准确率识别LLM回应,gemini-2.5-pro和gemma-3-27b-it的d-prime分别为0.17和0.25。结论:LLMs能够在真实世界的临床记录中使用MITI框架实现良好的MI能力。这些发现表明,即使是开源LLMs也是在资源匮乏环境中扩展MI咨询会话的可行候选者。
cs.CL / 65 / 2603.03856
Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling
通过分层架构耦合局部上下文和全局语义原型以进行修辞角色标注
Abstract
Rhetorical Role Labeling (RRL) identifies the functional role of each sentence in a document, a key task for discourse understanding in domains such as law and medicine. While hierarchical models capture local dependencies effectively, they are limited in modeling global, corpus-level features. To address this limitation, we propose two prototype-based methods that integrate local context with global representations. Prototype-Based Regularization (PBR) learns soft prototypes through a distance-based auxiliary loss to structure the latent space, while Prototype-Conditioned Modulation (PCM) constructs corpus-level prototypes and injects them during training and inference. Given the scarcity of RRL resources, we introduce SCOTUS-Law, the first dataset of U.S. Supreme Court opinions annotated with rhetorical roles at three levels of granularity: category, rhetorical function, and step. Experiments on legal, medical, and scientific benchmarks show consistent improvements over strong baselines, with 4 Macro-F1 gains on low-frequency roles. We further analyze the implications in the era of Large Language Models and complement our findings with expert evaluation.
Chinese Translation
修辞角色标注(RRL)识别文档中每个句子的功能角色,这是法律和医学等领域话语理解的关键任务。虽然分层模型有效捕捉局部依赖关系,但在建模全局语料库级特征方面存在局限。为了解决这一限制,我们提出了两种基于原型的方法,将局部上下文与全局表示相结合。基于原型的正则化(PBR)通过基于距离的辅助损失学习软原型,以构建潜在空间,而原型条件调制(PCM)构建语料库级原型,并在训练和推理过程中注入这些原型。鉴于RRL资源的稀缺,我们引入了SCOTUS-Law,这是第一个标注有修辞角色的美国最高法院意见数据集,涵盖三个层次的粒度:类别、修辞功能和步骤。在法律、医学和科学基准测试中的实验显示,相较于强基线有一致的提升,在低频角色上获得了4个宏观F1的提升。我们进一步分析了在大型语言模型时代的影响,并通过专家评估补充了我们的发现。
cs.CL / 66 / 2603.03862
Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy
评估大型语言模型在提供认知行为疗法中的有效性
Abstract
As mental health issues continue to rise globally, there is an increasing demand for accessible and scalable therapeutic solutions. Many individuals currently seek support from Large Language Models (LLMs), even though these models have not been validated for use in counseling services. In this paper, we evaluate LLMs' ability to emulate professional therapists practicing Cognitive Behavioral Therapy (CBT). Using anonymized, transcribed role-play sessions between licensed therapists and clients, we compare two approaches: (1) a generation-only method and (2) a Retrieval-Augmented Generation (RAG) approach using CBT guidelines. We evaluate both proprietary and open-source models for linguistic quality, semantic coherence, and therapeutic fidelity using standard natural language generation (NLG) metrics, natural language inference (NLI), and automated scoring for skills assessment. Our results indicate that while LLMs can generate CBT-like dialogues, they are limited in their ability to convey empathy and maintain consistency.
Chinese Translation
随着全球心理健康问题的持续上升,对可获取且可扩展的治疗解决方案的需求日益增加。许多人目前寻求大型语言模型(LLMs)的支持,尽管这些模型尚未经过验证以用于咨询服务。本文评估了LLMs模仿专业治疗师进行认知行为疗法(CBT)的能力。我们使用匿名的、转录的角色扮演会话,这些会话是在持证治疗师与客户之间进行的,比较了两种方法:(1)仅生成的方法和(2)使用CBT指南的检索增强生成(RAG)方法。我们使用标准的自然语言生成(NLG)指标、自然语言推理(NLI)和技能评估的自动评分,评估了专有模型和开源模型在语言质量、语义连贯性和治疗忠实度方面的表现。我们的结果表明,尽管LLMs能够生成类似CBT的对话,但在传达同理心和保持一致性方面存在局限性。
cs.CL / 67 / 2603.03884
CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents
CzechTopic:历史捷克文献中零样本主题定位的基准
Abstract
Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.
Chinese Translation
主题定位旨在识别表达特定主题的文本片段,该主题由名称和描述定义。为了研究这一任务,我们引入了一个基于捷克历史文献的人为标注基准,包含人类定义的主题以及手动标注的文本片段,并支持在文档和词汇层面的评估。评估是相对于人类一致性进行的,而不是单一参考标注。我们评估了一系列多样的大型语言模型,以及在精简开发数据集上微调的基于BERT的模型。结果显示,LLM之间存在显著的变异性,性能从接近人类的主题检测到在片段定位上明显失败不等。尽管最强的模型接近人类一致性,但尽管规模较小,精简的词嵌入模型仍然具有竞争力。数据集和评估框架已公开发布,网址为:https://github.com/dcgm/czechtopic。
cs.CL / 68 / 2603.03915
Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects
重新思考角色扮演评估:匿名基准测试与个性影响的系统研究
Abstract
Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs). However, current research primarily evaluates RPAs using famous fictional characters, allowing models to rely on memory associated with character names. This dependency creates a bias that limits the generalization of RPAs to unseen personas. To address this issue, we propose an anonymous evaluation method. Experiments across multiple benchmarks reveal that anonymization significantly degrades role-playing performance, confirming that name exposure carries implicit information. Furthermore, we investigate personality augmentation to enhance role fidelity under anonymous setting. We systematically compare the efficacy of personality traits derived from human annotations versus those self-generated by the model. Our results demonstrate that incorporating personality information consistently improves RPA performance. Crucially, self-generated personalities achieve performance comparable to human-annotated ones. This work establishes a fairer evaluation protocol and validates a scalable, personality-enhanced framework for constructing robust RPAs.
Chinese Translation
大型语言模型(LLMs)在开发角色扮演代理(RPAs)方面展现了显著的潜力。然而,目前的研究主要通过著名虚构角色来评估RPAs,这使得模型能够依赖与角色名称相关的记忆。这种依赖性造成了偏见,限制了RPAs对未见角色的泛化能力。为了解决这一问题,我们提出了一种匿名评估方法。多个基准测试的实验结果表明,匿名化显著降低了角色扮演的表现,确认了名称暴露携带隐含信息。此外,我们还研究了个性增强,以在匿名环境下提高角色忠实度。我们系统地比较了来源于人工标注的个性特征与模型自生成特征的有效性。我们的结果表明,纳入个性信息始终能改善RPA的表现。重要的是,自生成的个性在表现上可与人工标注的个性相媲美。本研究建立了一种更公平的评估协议,并验证了一种可扩展的、增强个性的框架,用于构建稳健的RPAs。
cs.CL / 69 / 2603.04033
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
谁来评判评判者?评估大型语言模型(LLM)作为法国医学开放式问答的评判者
Abstract
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
Chinese Translation
由于需要专家注释,医学开放式问答(OEQA)的自动评估仍然具有挑战性。我们评估大型语言模型(LLMs)是否能够作为法国医学OEQA中语义等价的评判者,比较了闭合访问、通用和生物医学领域适应模型。我们的结果表明,基于LLM的判断受到生成答案的模型的强烈影响,不同生成器之间的协议差异显著。领域适应模型和大型通用模型与专家注释的对齐度最高。我们进一步表明,使用监督微调(SFT)和组相对策略优化(GRPO)对紧凑模型进行轻量级适应显著提高了性能,并减少了生成器的敏感性,即使在数据有限的情况下也是如此。总体而言,我们的研究结果强调了生成器感知评估的必要性,并建议经过精心调整的小型模型可以支持低资源医学环境中的可扩展评估。
cs.CL / 70 / 2603.04069
Monitoring Emergent Reward Hacking During Generation via Internal Activations
通过内部激活监测生成过程中的新兴奖励黑客行为
Abstract
Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.
Chinese Translation
经过微调的大型语言模型可能会表现出由于新兴错位而产生的奖励黑客行为,这种行为仅通过最终输出很难被检测到。尽管之前的研究已在完成响应的层面上研究了奖励黑客,但尚不清楚这种行为是否可以在生成过程中被识别。我们提出了一种基于激活的监测方法,该方法在模型生成响应时从内部表示中检测奖励黑客信号。我们的方法在残差流激活上训练稀疏自编码器,并应用轻量级线性分类器生成令牌级的奖励黑客活动估计。在多个模型家族和微调混合中,我们发现内部激活模式能够可靠地区分奖励黑客行为和良性行为,能够推广到未见过的混合策略适配器,并在链式推理过程中表现出模型依赖的时间结构。值得注意的是,奖励黑客信号通常在早期出现,在推理过程中持续存在,并且在弱指定奖励目标下通过链式推理提示的形式可以通过增加测试时计算量而被放大。这些结果表明,内部激活监测提供了比基于输出的评估更早的关于新兴错位的信号,从而支持对微调语言模型进行更稳健的部署后安全监测。
cs.CL / 71 / 2603.04083
Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation
多候选人人工后编辑机器翻译中的事后质量预测实验
Abstract
This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.
Chinese Translation
本文探讨了两种互补的机器翻译(MT)质量预测范式:源侧难度预测和候选侧质量估计(QE)。大型语言模型(LLMs)在机器翻译工作流程中的快速应用正在重塑研究领域,但其对既定质量预测范式的影响仍未得到充分探讨。我们通过一系列“事后”实验研究这一问题,这些实验基于一个独特的多候选人数据集,该数据集源于一个真实的机器翻译后编辑(MTPE)项目。该数据集包含超过6000个英语源段落,具有来自多种传统神经机器翻译系统和先进大型语言模型的九个翻译假设,所有假设均与一个最终的人类后编辑参考进行评估。我们使用肯德尔秩相关系数评估源侧难度指标、候选侧QE模型和位置启发式方法对两个黄金标准评分的预测能力:TER(作为后编辑工作量的代理)和COMET(作为人类判断的代理)。我们的研究结果强调,向大型语言模型的架构转变改变了既定质量预测方法的可靠性,同时减轻了文档级翻译中的先前挑战。
cs.CL / 72 / 2603.04123
FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
FINEST:通过细粒度评估改善大型语言模型对敏感话题的回应
Abstract
Large Language Models (LLMs) often generate overly cautious and vague responses on sensitive topics, sacrificing helpfulness for safety. Existing evaluation frameworks lack systematic methods to identify and address specific weaknesses in responses to sensitive topics, making it difficult to improve both safety and helpfulness simultaneously. To address this, we introduce FINEST, a FINE-grained response evaluation taxonomy for Sensitive Topics, which breaks down helpfulness and harmlessness into errors across three main categories: Content, Logic, and Appropriateness. Experiments on a Korean-sensitive question dataset demonstrate that our score- and error-based improvement pipeline, guided by FINEST, significantly improves the model responses across all three categories, outperforming refinement without guidance. Notably, score-based improvement -- providing category-specific scores and justifications -- yields the most significant gains, reducing the error sentence ratio for Appropriateness by up to 33.09%. This work lays the foundation for a more explainable and comprehensive evaluation and improvement of LLM responses to sensitive questions.
Chinese Translation
大型语言模型(LLMs)在敏感话题上往往生成过于谨慎和模糊的回应,牺牲了有用性以追求安全性。现有的评估框架缺乏系统性的方法来识别和解决对敏感话题回应中的具体弱点,使得同时改善安全性和有用性变得困难。为了解决这个问题,我们引入了FINEST,一个针对敏感话题的细粒度回应评估分类法,它将有用性和无害性分解为三个主要类别中的错误:内容、逻辑和适当性。在一个针对韩国敏感问题的数据集上的实验表明,我们基于FINEST指导的评分和错误改进流程显著改善了模型在所有三个类别中的回应,优于无指导的精炼。值得注意的是,基于评分的改进——提供类别特定的评分和理由——带来了最显著的提升,将适当性的错误句子比例降低了多达33.09%。这项工作为对大型语言模型在敏感问题上的回应进行更具解释性和全面性的评估与改进奠定了基础。
cs.CL / 73 / 2603.04145
VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
VietNormalizer:一个开源、无依赖的Python库,用于TTS和NLP应用中的越南文本标准化
Abstract
We present VietNormalizer1, an open-source, zero-dependency Python library for Vietnamese text normalization targeting Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Vietnamese text normalization is a critical yet underserved preprocessing step: real-world Vietnamese text is densely populated with non-standard words (NSWs), including numbers, dates, times, currency amounts, percentages, acronyms, and foreign-language terms, all of which must be converted to fully pronounceable Vietnamese words before TTS synthesis or downstream language processing. Existing Vietnamese normalization tools either require heavy neural dependencies while covering only a narrow subset of NSW classes, or are embedded within larger NLP toolkits without standalone installability. VietNormalizer addresses these gaps through a unified, rule-based pipeline that: (1) converts arbitrary integers, decimals, and large numbers to Vietnamese words; (2) normalizes dates and times to their spoken Vietnamese forms; (3) handles VND and USD currency amounts; (4) expands percentages; (5) resolves acronyms via a customizable CSV dictionary; (6) transliterates non-Vietnamese loanwords and foreign terms to Vietnamese phonetic approximations; and (7) performs Unicode normalization and emoji/special-character removal. All regular expression patterns are pre-compiled at initialization, enabling high-throughput batch processing with minimal memory overhead and no GPU or external API dependency. The library is installable via pip install vietnormalizer, available on PyPI and GitHub at https://github.com/nghimestudio/vietnormalizer, and released under the MIT license. We discuss the design decisions, limitations of existing approaches, and the generalizability of the rule-based normalization paradigm to other low-resource tonal and agglutinative languages.
Chinese Translation
我们介绍了VietNormalizer,一个开源、零依赖的Python库,旨在为文本转语音(TTS)和自然语言处理(NLP)应用提供越南文本标准化。越南文本标准化是一个关键但被忽视的预处理步骤:现实世界中的越南文本中充斥着非标准词(NSWs),包括数字、日期、时间、货币金额、百分比、缩略词和外语术语,所有这些在进行TTS合成或后续语言处理之前都必须转换为完全可发音的越南词。现有的越南标准化工具要么需要重型神经网络依赖,仅覆盖狭窄的NSW类别,要么嵌入在更大的NLP工具包中,无法独立安装。VietNormalizer通过统一的基于规则的管道解决了这些问题,该管道:(1) 将任意整数、小数和大数字转换为越南词;(2) 将日期和时间标准化为其口语越南形式;(3) 处理越南盾(VND)和美元(USD)货币金额;(4) 扩展百分比;(5) 通过可定制的CSV字典解析缩略词;(6) 将非越南语借词和外语术语音译为越南语音近似;(7) 执行Unicode标准化和表情符号/特殊字符移除。所有正则表达式模式在初始化时预编译,使得以最小的内存开销进行高吞吐量批处理成为可能,并且不依赖GPU或外部API。该库可以通过pip install vietnormalizer安装,已在PyPI和GitHub上发布,网址为https://github.com/nghimestudio/vietnormalizer,并在MIT许可证下发布。我们讨论了设计决策、现有方法的局限性,以及基于规则的标准化范式对其他低资源声调和黏着语言的普适性。
cs.CL / 74 / 2603.04161
Traces of Social Competence in Large Language Models
大型语言模型中的社会能力痕迹
Abstract
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al. 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented finetuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.
Chinese Translation
错误信念测试(False Belief Test, FBT)一直是评估心智理论(Theory of Mind, ToM)及相关社会认知能力的主要方法。然而,对于大型语言模型(Large Language Models, LLMs)而言,由于数据污染、模型细节不足和控制不一致等问题,这一测试的可靠性和解释潜力仍然有限。我们通过对17个开放权重模型在192个平衡的FBT变体(Trott et al. 2023)上进行测试,使用贝叶斯逻辑回归分析模型规模和后训练对社会认知能力的影响,从而解决这些问题。我们的研究发现,扩大模型规模有助于提升性能,但并非绝对。交叉效应表明,阐明命题态度(X thinks)会根本性地改变响应模式。指令调优在一定程度上缓解了这一效应,但进一步的以推理为导向的微调则加剧了这一效应。在分析OLMo 2训练过程中社会推理能力的案例研究中,我们展示了这一交叉效应在预训练阶段就已出现,表明模型获得了与心理状态词汇相关的刻板响应模式,这种模式可能超过其他场景语义。最后,向量引导技术使我们能够将思维向量作为观察到的FBT行为的因果驱动因素进行隔离。
cs.CL / 75 / 2603.04162
Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model
Bielik-Q2-Sharp:波兰11B语言模型极端2位量化方法的比较研究
Abstract
We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model. Using Bielik-11B-v2.3-Instruct (11B parameters, Mistral architecture) as our base model, we compare six state-of-the-art post-training quantization methods -- QuIP#, SpinQuant+GPTQ, ButterflyQuant, QTIP, VPTQ, and AQLM -- all calibrated on a Polish-language corpus (CulturaX-PL) with shared Hessian matrices. Our best variant (QuIP# E8P12) achieves 71.92% across 22 Polish benchmarks versus 72.07% for the IQ2_XXS baseline -- within statistical noise, at a modest size premium (3.26 GB vs. ~2.6 GB). On eq_bench, our method scores 47.14 versus 43.53 (+3.6pp), suggesting superior preservation of higher-order reasoning. QTIP achieves the best per-bit efficiency (79.4% MC acc_norm at ~2.4 bpw, 3.27 GB), matching VPTQ's quality at 35% smaller size. We additionally document a MC-generation dissociation phenomenon where rotation-based methods preserve log-likelihood quality but fail catastrophically at autoregressive generation. The entire project was conducted by a single independent researcher on cloud GPUs (vast.ai) within a $285 budget. All models, Hessians, and evaluation logs are publicly available.
Chinese Translation
我们提出了Bielik-Q2-Sharp,这是对应用于波兰大型语言模型的极端2位量化的首次系统学术评估。以Bielik-11B-v2.3-Instruct(11B参数,Mistral架构)作为基础模型,我们比较了六种最先进的后训练量化方法——QuIP#、SpinQuant+GPTQ、ButterflyQuant、QTIP、VPTQ和AQLM——这些方法均在共享Hessian矩阵的波兰语语料库(CulturaX-PL)上进行了校准。我们的最佳变体(QuIP# E8P12)在22个波兰基准测试中达到了71.92%的准确率,而IQ2_XXS基线为72.07%——在统计噪声范围内,且具有适度的大小溢价(3.26 GB对比约2.6 GB)。在eq_bench上,我们的方法得分为47.14,而IQ2_XXS为43.53(+3.6个百分点),这表明在高阶推理的保持上具有更好的表现。QTIP在每位效率上表现最佳(79.4%的MC准确率标准化,约2.4 bpw,3.27 GB),在体积上比VPTQ小35%却能匹配其质量。此外,我们还记录了一种MC生成解离现象,即基于旋转的方法在保持对数似然质量方面表现良好,但在自回归生成时却出现灾难性失败。整个项目由一位独立研究者在云GPU(vast.ai)上以285美元的预算完成。所有模型、Hessian矩阵和评估日志均已公开。
cs.CL / 76 / 2603.04217
When Do Language Models Endorse Limitations on Human Rights Principles?
语言模型何时支持对人权原则的限制?
Abstract
As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.
Chinese Translation
随着大型语言模型(LLMs)在全球信息获取中扮演越来越重要的角色,并有可能塑造公共话语,它们与普遍人权原则的一致性变得尤为重要,以确保在高风险的人工智能介导互动中遵守这些权利。本文评估了LLMs如何在涉及《世界人权宣言》(UDHR)的权衡中进行导航,利用了1,152个合成生成的场景,涵盖24个权利条款和八种语言。我们对十一种主要LLMs的分析揭示了系统性偏见,其中模型:(1)更常接受对经济、社会和文化权利的限制,而非政治和公民权利;(2)表现出显著的跨语言差异,在中文和印地语中对限制权利的行为的支持率高于英语或罗马尼亚语;(3)对基于提示的引导表现出显著的敏感性;(4)在李克特量表和开放式回答之间存在明显差异,突显了LLM偏好评估中的关键挑战。
cs.CL / 77 / 2603.04238
Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
检索还是表示?重新评估多语言和视觉丰富的RAG基准差距
Abstract
Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.
Chinese Translation
检索增强生成(RAG)是一种将语言模型与外部文档和最新信息结合的常见方法。经典的检索系统依赖于诸如BM25的词汇方法,通过与语料库的加权重叠来对文档进行排名。针对大型查询-文档数据集训练的端到端多模态检索器声称在这些方法上有显著改进,尤其是在具有复杂视觉布局的多语言文档中。我们证明,更好的文档表示是基准改进的主要驱动因素。通过系统地改变转录和预处理方法,同时保持检索机制不变,我们展示了BM25可以在多语言和视觉基准上弥补较大的差距。我们的研究结果呼吁分解评估基准,分别测量转录和检索能力,使该领域能够正确归因于进展,并将精力集中在重要的地方。
cs.CL / 78 / 2603.04257
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory
Memex(RL):通过索引经验记忆扩展长时间跨度的LLM代理
Abstract
Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to use even when it is still present. Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy because they compress or discard past evidence itself. We introduce Memex, an indexed experience memory mechanism that instead compresses context without discarding evidence. Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices. The agent can then decide when to dereference an index and recover the exact past evidence needed for the current subgoal. We optimize both write and read behaviors with our reinforcement learning framework MemexRL, using reward shaping tailored to indexed memory usage under a context budget, so the agent learns what to summarize, what to archive, how to index it, and when to retrieve it. This yields a substantially less lossy form of long-horizon memory than summary-only approaches. We further provide a theoretical analysis showing the potential of the Memex loop to preserve decision quality with bounded dereferencing while keeping effective in-context computation bounded as history grows. Empirically, on challenging long-horizon tasks, Memex agent trained with MemexRL improves task success while using a significantly smaller working context.
Chinese Translation
大型语言模型(LLM)代理在长时间跨度任务中受到有限上下文窗口的根本限制。随着轨迹的增长,保留工具输出和中间推理的上下文变得迅速不可行:工作上下文变得过于冗长,最终超出了上下文预算,并使得即使远处的证据仍然存在,也更难以使用。现有解决方案通常通过截断或运行摘要来缩短上下文,但这些方法本质上是有损的,因为它们压缩或丢弃了过去的证据。我们提出了Memex,一种索引经验记忆机制,它在不丢弃证据的情况下压缩上下文。Memex维护一个紧凑的工作上下文,由简洁的结构化摘要和稳定的索引组成,同时在这些索引下的外部经验数据库中存储完整的底层交互。代理可以决定何时解除引用一个索引,并恢复当前子目标所需的确切过去证据。我们通过我们的强化学习框架MemexRL优化写入和读取行为,使用针对上下文预算下的索引记忆使用量身定制的奖励塑形,使得代理学习如何总结、归档、如何索引以及何时检索。这种方法比仅依赖摘要的方法提供了显著更少损失的长时间跨度记忆。我们进一步提供了理论分析,展示了Memex循环在保持有效上下文计算受限的同时,能够在有界解除引用的情况下保持决策质量的潜力。在具有挑战性的长时间跨度任务中,使用MemexRL训练的Memex代理在使用显著更小的工作上下文的同时提高了任务成功率。
cs.CL / 79 / 2603.04292
Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models
立场:应公开向量提示接口以实现大语言模型的定制化
Abstract
As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text prompts can already customize LLM behavior, we argue that text-only prompting does not constitute a suitable control interface for scalable, stable, and inference-only customization. This position paper argues that model providers should expose \emph{vector prompt inputs} as part of the public interface for customizing LLMs. We support this position with diagnostic evidence showing that vector prompt tuning continues to improve with increasing supervision whereas text-based prompt optimization saturates early, and that vector prompts exhibit dense, global attention patterns indicative of a distinct control mechanism. We further discuss why inference-only customization is increasingly important under realistic deployment constraints, and why exposing vector prompts need not fundamentally increase model leakage risk under a standard black-box threat model. We conclude with a call to action for the community to rethink prompt interfaces as a core component of LLM customization.
Chinese Translation
随着大语言模型(LLMs)从研究原型转变为实际系统,定制化已成为一个核心瓶颈。尽管文本提示已经能够定制LLM的行为,但我们认为仅使用文本提示并不构成可扩展、稳定和仅限推理的定制控制接口。本文立场认为,模型提供者应将 extit{向量提示输入}作为定制LLM的公共接口的一部分进行公开。我们通过诊断证据支持这一立场,显示向量提示调优随着监督的增加而持续改进,而基于文本的提示优化则早期饱和,并且向量提示表现出密集的、全球性的注意模式,表明其是一种独特的控制机制。我们进一步讨论了在现实部署约束下,为什么仅限推理的定制化变得越来越重要,以及为什么公开向量提示在标准黑箱威胁模型下不必根本性地增加模型泄漏风险。最后,我们呼吁社区重新思考提示接口作为LLM定制化的核心组成部分。
cs.CL / 80 / 2603.04299
The Company You Keep: How LLMs Respond to Dark Triad Traits
你所交往的公司:大型语言模型如何应对黑暗三角特质
Abstract
Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.
Chinese Translation
大型语言模型(LLMs)通常表现出高度赞同和强化的对话风格,也被称为人工智能谄媚。尽管这种行为受到鼓励,但在与反映负面社会倾向的用户提示互动时,它可能变得问题重重。这种回应有可能放大有害行为,而不是减轻它。在本研究中,我们使用一个精心策划的数据集,考察了LLMs如何回应表达不同程度黑暗三角特质(马基雅维利主义、自恋和精神病)的用户提示。我们的分析揭示了模型之间的差异,所有模型主要表现出纠正行为,但在某些情况下也显示出强化输出。模型行为还取决于严重程度,并在回应的情感上有所不同。我们的发现对设计更安全的对话系统提出了启示,这些系统能够检测并适当地响应用户从良性请求升级到有害请求的情况。
cs.CL / 81 / 2603.04304
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
$V_1$: 统一生成与自我验证的并行推理器
Abstract
Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce $V_1$, a framework that unifies generation and verification through efficient pairwise ranking. $V_1$ comprises two components: $V_1$-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and $V_1$-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, $V_1$-Infer improves Pass@1 by up to $10%$ over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, $V_1$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.
Chinese Translation
复杂推理任务的测试时间扩展表明,通过独立采样和聚合多个解等方法,利用推理时计算可以显著改善任务结果。然而,一个关键瓶颈是验证:只有在能够可靠识别候选解中的正确解时,采样才是有效的。尽管现有方法通常通过标量评分独立评估候选解,我们证明模型在成对自我验证方面显著更强。基于这一洞察,我们引入了$V_1$,一个通过高效的成对排名统一生成与验证的框架。$V_1$包括两个组件:$V_1$-Infer,一个不确定性引导的算法,使用基于锦标赛的排名动态分配自我验证计算给相对正确性最不确定的候选对;以及$V_1$-PairRL,一个强化学习框架,联合训练一个模型作为生成器和成对自我验证器,确保验证器适应生成器不断变化的分布。在代码生成(LiveCodeBench, CodeContests, SWE-Bench)和数学推理(AIME, HMMT)基准测试中,$V_1$-Infer在Pass@1上比点对验证提高了最多$10\%$,并且在效率上显著优于最近的测试时间扩展方法。此外,$V_1$-PairRL在标准强化学习和点对联合训练上实现了$7$--$9\\%$的测试时间扩展增益,并在代码生成设置中将基础Pass@1提高了最多8.7\\%。
cs.CL / 82 / 2603.04317
World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
无世界模型的世界属性:从静态词嵌入中的共现统计中恢复空间和时间结构
Abstract
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.
Chinese Translation
近期的研究将从大型语言模型(LLM)隐藏状态中线性恢复地理和时间变量的能力解释为世界般内部表征的证据。我们测试了一个更简单的可能性:相关结构在文本中本身就已经潜在存在。通过对基于共现的静态嵌入(GloVe 和 Word2Vec)应用相同类别的岭回归探测器,我们发现可恢复的地理信号显著,且时间信号较弱但可靠,城市坐标的 R^2 值为 0.71-0.87,历史出生年份的 R^2 值为 0.48-0.52。语义邻近分析和针对性子空间消融实验表明,这些信号在很大程度上依赖于可解释的词汇梯度,尤其是国家名称和气候相关词汇。这些发现表明,普通词汇的共现保留了比通常假设的更丰富的空间、时间和环境结构,揭示了简单静态嵌入在仅从文本中保留世界形状结构方面的显著且被低估的能力。因此,仅凭线性探测器的可恢复性并不能确立超越文本的表征转变。
cs.CL / 83 / 2603.04319
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
AILS-NTUA在SemEval-2026任务12中的表现:基于图的检索与反思提示用于推理事件的溯因推理
Abstract
We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51\% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.
Chinese Translation
我们提出了一个在SemEval 2026任务12:溯因事件推理中获胜的三阶段系统,该系统结合了基于图的检索、由大型语言模型(LLM)驱动的溯因推理,以及通过反思提示演化优化的提示设计,并实施事后一致性强制;我们的系统在评估阶段的排行榜上排名第一,准确率达到0.95。对14个模型(7个家族)进行的跨模型错误分析揭示了三种共享的归纳偏差:因果链不完整、近因偏好和显著性偏差,其跨家族的收敛(因果计数减少51%)表明在多标签因果推理中存在系统性而非特定模型的失败模式。
cs.CL / 84 / 2603.04384
AgentIR: Reasoning-Aware Retrival for Deep Research Agents
AgentIR:面向推理的深度研究代理检索
Abstract
Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.
Chinese Translation
深度研究代理正迅速成为现代检索系统的主要用户。与人类用户在没有记录其思维过程的情况下发出和调整查询不同,深度研究代理在每次搜索调用之前生成明确的自然语言推理,揭示了丰富的意图和上下文信息,而现有的检索系统对此完全忽视。为了利用这一被忽视的信号,我们提出了:(1)面向推理的检索(Reasoning-Aware Retrieval),一种将代理的推理轨迹与其查询共同嵌入的检索范式;(2)DR-Synth,一种数据合成方法,从标准问答数据集中生成深度研究检索器的训练数据。我们证明了这两个组件都是独立有效的,它们的结合产生了一个训练好的嵌入模型AgentIR-4B,带来了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中,AgentIR-4B在开放权重代理Tongyi-DeepResearch下实现了68%的准确率,而传统嵌入模型的准确率仅为50%,且其模型规模是其两倍,BM25的准确率为37%。代码和数据可在以下网址获取:https://texttron.github.io/AgentIR/。