cs.RO / 1 / 2605.04327
From Language to Logic: A Theoretical Architecture for VLM-Grounded Safe Navigation
从语言到逻辑:基于VLM的安全导航理论架构
Abstract
We propose an architecture for integrating high-level, human-provided safety rules and operator-aligned semantic preferences into autonomous robot navigation in unstructured outdoor environments. In our approach, natural-language rules are translated into Signal Temporal Logic (STL) specifications that guide planning and navigation during runtime. Persistent, environment-centric rules and terrain preferences are grounded into a 2D cost map, while temporally dynamic requirements are expressed as STL specifications to be monitored during runtime. We hypothesize the use of Vision-Language Models (VLMs) for zero-shot scene understanding, enabling mapping between human instructions, semantic features, and environmental constraints. Within this framework, we construct an illustrative navigation model that is designed to satisfy a set of STL-encoded specifications and soft operator preferences through formal satisfaction metrics embedded into environmental properties and runtime monitoring.
Chinese Translation
我们提出了一种架构,用于将高层次的人为安全规则和与操作员对齐的语义偏好集成到非结构化户外环境中的自主机器人导航中。在我们的方法中,自然语言规则被转化为信号时序逻辑(Signal Temporal Logic, STL)规范,这些规范在运行时指导规划和导航。持久的、以环境为中心的规则和地形偏好被映射到二维成本图中,而时间动态要求则以STL规范的形式表达并在运行时进行监测。我们假设使用视觉语言模型(Vision-Language Models, VLMs)进行零-shot场景理解,从而实现人类指令、语义特征与环境约束之间的映射。在这一框架内,我们构建了一个示例导航模型,旨在通过嵌入环境属性和运行时监控中的正式满足度度量,来满足一组STL编码的规范和柔性操作员偏好。
cs.RO / 2 / 2605.04366
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
用于安全关键交通场景生成的条件流变分自动编码器(Conditional Flow-VAE)
Abstract
Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.
Chinese Translation
安全关键场景对于自主驾驶汽车(AVs)的开发至关重要,但在现实世界的驾驶数据中很少见。虽然仿真提供了一种生成这些场景的方法,但手动设计的测试用例缺乏可扩展性,而对抗优化往往会产生不切实际的行为。在本研究中,我们提出了一种条件潜在流匹配方法,以实现可扩展和现实的安全关键场景生成。我们的方法使用分布匹配将标准场景转化为安全关键的应用。此外,我们展示了将仿真数据与现实世界数据相结合,使得我们的框架能够高效地生成多样化的数据驱动场景。实验结果凸显了我们的方法能够更一致且更真实地生成新颖的安全关键场景,成为训练和基准测试自主驾驶系统的宝贵工具。
cs.RO / 3 / 2605.04408
Autonomous Laparoscope Control through Unified Mechanics-Based Representation of Multimodal Intraoperative Information
通过统一的基于力学的多模态术中信息表示实现自主腹腔镜控制
Abstract
Laparoscope-holding robots can provide surgeons with a stable laparoscopic field of view (FOV) and reduce the burden on human assistants. To maintain an ideal intraoperative FOV, the robot must continuously adjust the laparoscope pose according to intraoperative information. However, intraoperative multimodal signals, such as position, force/torque, and images, differ markedly in physical meaning and units, making it difficult to build a unified representation and to generate control commands that can be used directly for laparoscope control. To address this issue, we propose a laparoscope-holding robot control method based on unified mechanics modeling of multimodal information. First, we design mapping strategies for multiple intraoperative sources, including position, force/torque, and images, and unify them into an equivalent-wrench representation in the operational space. Then, using a task-priority scheme, we inject the wrenches into the task space and the null space, respectively, and synthesize laparoscope control commands via task-priority projection, thereby achieving consistent representation and coordinated fusion of multimodal information within a single framework. Finally, taking the intraoperative remote center of motion (RCM) position, force/torque sensor readings, and laparoscopic images as examples, we construct an RCM-constraint wrench to enforce the RCM geometric constraint and reduce the contact force at the trocar site, a laparoscope-manipulation wrench to enable compliant dragging, and an instrument-tracking wrench to achieve autonomous visual tracking of the instruments. Experiments on a surgical phantom and in vivo porcine trials demonstrate that the proposed method supports multi-task operation, including compliant laparoscope manipulation and autonomous instrument tracking, while maintaining the RCM constraint and reducing sustained trocar-site loading.
Chinese Translation
腹腔镜持取机器人可以为外科医生提供稳定的腹腔镜视野(FOV),减轻人类助手的负担。为了维持理想的术中视野,机器人必须根据术中信息持续调整腹腔镜的姿态。然而,术中的多模态信号,如位置、力/矩和图像,在物理意义和单位上存在显著差异,这使得构建统一的表示和生成可直接用于腹腔镜控制的指令变得困难。为了解决这个问题,我们提出了一种基于多模态信息统一力学建模的腹腔镜持取机器人控制方法。首先,我们为多个术中源(包括位置、力/矩和图像)设计映射策略,并将它们统一为操作空间中的等效扭矩表示。然后,利用任务优先级方案,我们分别将扭矩注入任务空间和零空间,通过任务优先级投影合成腹腔镜控制指令,从而在单一框架内实现多模态信息的一致表示和协调融合。最后,以术中远程运动中心(RCM)位置、力/矩传感器读数和腹腔镜图像为例,我们构建了RCM约束扭矩以强制执行RCM几何约束,减少穿刺点的接触力,构建了腹腔镜操作扭矩以实现顺应性拖动,以及构建了工具跟踪扭矩以实现对工具的自主视觉跟踪。在外科模型和活体猪试验中的实验表明,该方法支持多任务操作,包括顺应性腹腔镜操作和自主仪器追踪,同时保持RCM约束并减少持续的穿刺点负荷。
cs.RO / 4 / 2605.04481
Tightly-Coupled Estimation and Guidance for Robust Low-Thrust Rendezvous via Adaptive Homotopy
通过自适应同胚实现的稳健低推力会合的紧耦合估计与引导
Abstract
Minimum-fuel low-thrust rendezvous guidance yields bang-bang control structures highly sensitive to estimation errors, sensor anomalies, and solver regularization, making aggressive closed-loop execution brittle for uncooperative proximity operations. This paper proposes a tightly-coupled estimation and guidance architecture where navigation confidence directly modulates the homotopy parameter of a receding-horizon indirect optimal control solver. Relative motion is modeled in the Clohessy-Wiltshire frame. The translational state is estimated via a linear Kalman filter augmented by a Multiple Tuning Factors (MTF) covariance inflation mechanism that suppresses suspicious innovation directions. A composite score from the normalized innovation and MTF activity is mapped online to the homotopy parameter, allowing the controller to relax toward a smoother, conservative regime when confidence degrades, and recover fuel-efficient bang-bang control as sensing improves. Numerical results under severe measurement degradation show fixed bang-bang guidance remains brittle; both plain-KF and MTF-KF fixed-epsilon controllers yield large terminal miss distances. Conversely, the proposed MTF-adaptive homotopy controller reduces terminal miss by roughly two orders of magnitude, from hundreds of meters to sub-meter levels, requiring only a moderate increase in control effort versus the open-loop fuel-optimal benchmark. A comparison indicates adaptive homotopy is the dominant robustness mechanism, while MTF provides additional accuracy and efficiency improvements. The receding-horizon implementation exhibits consistently fast and reliable solution times, supporting the practical online viability of the proposed method.
Chinese Translation
最低燃料低推力会合引导生成的自控结构在对估计误差、传感器异常和求解器正则化高度敏感,使得在非合作接近操作中进行激进的闭环执行变得脆弱。本文提出了一种紧耦合的估计与引导架构,其中导航置信度直接调节逐步优化控制求解器的同胚参数。相对运动在克洛赫西-威尔特希尔(Clohessy-Wiltshire)框架中建模。通过线性卡尔曼滤波器估计平移状态,并利用多调谐因子(Multiple Tuning Factors, MTF)协方差膨胀机制来抑制异常创新方向。在线将正常化创新和MTF活动的复合得分映射到同胚参数,使得控制器在置信度降低时能缓和至更平滑、保守的状态,并在感知改善时恢复燃料高效的 bang-bang 控制。在严重测量退化的情况下,数值结果显示固定 bang-bang 引导仍然脆弱;无论是普通卡尔曼滤波器(plain-KF)还是MTF卡尔曼滤波器(MTF-KF)的固定ε控制器均产生较大的终端错过距离。相反,所提的MTF自适应同胚控制器将终端错过距离降低了大约两个数量级,从数百米降至亚米级别,只需相对于闭环燃料最优基准进行适度的控制努力增加。比较结果表明自适应同胚是主要的鲁棒性机制,而MTF提供了额外的准确性和效率提升。逐步优化实现展现出持续快速和可靠的解决时间,支持所提方法的实际在线可行性。
cs.RO / 5 / 2605.04525
HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks
HDFlow:用于长时间任务的层次化扩散流规划
Abstract
Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/
Chinese Translation
近期生成模型的进展在长时间、稀疏奖励任务的行为规划中展现了潜力。尽管这些方法取得了令人鼓舞的成果,但它们往往缺乏系统的层次分解框架,并且由于其迭代去噪过程,难以满足实时执行的计算需求。在本研究中,我们引入了层次化扩散流(HDFlow),这是一个新颖的层次规划框架,最优地利用了扩散和校正流模型的优势,以克服单一范式生成规划器的局限性。HDFlow采用高层级扩散规划器在学习的潜在空间中生成战略子目标序列,充分发挥扩散模型强大的探索能力。这些子目标随后引导低层级的校正流规划器,生成平滑且密集的轨迹,利用常微分方程(ODE)基础的轨迹生成的速度和效率。我们在四个具有挑战性的家具组装任务中评估了HDFlow,包括模拟和现实世界案例,其表现显著优于最先进的方法。此外,我们还展示了该方法在由多种运动和操作任务组成的两个长时间基准上的泛化能力。项目网站:https://hdflow-page.github.io/
cs.RO / 6 / 2605.04564
Practical validation of synthetic pre-crash scenarios
合成预碰撞场景的实际验证
Abstract
The representativeness of synthetic pre-crash scenarios is crucial for assessing the safety impact of Driving Automation Systems through virtual simulations. However, a gap remains in the robust evaluation of synthetic pre-crash scenarios' practical equivalence to their real-world counterparts; that is, whether they are similar enough for the intended assessment purpose. Conventional significance testing is inadequate, as it focuses on detecting differences rather than establishing practical equivalence. This study addresses the research gap by extending our previous work on a Bayesian Region of Practical Equivalence (ROPE)-based equivalence testing framework by introducing a binning-based approach to define appropriate statistics and equivalence criteria. Two binning-based statistics are proposed to measure practically meaningful distributional differences between datasets in the context of safety impact assessment. The framework's applicability is demonstrated through a case study, which tests the practical equivalence of two synthetic rear-end pre-crash datasets with a previously developed reference dataset in the context of the safety impact assessment of an Automatic Emergency Braking system. The results show that the framework provides informative quantitative assessments of practical equivalence as well as diagnostic insights into the divergence of datasets. Although the demonstration focuses on rear-end pre-crash scenarios, the framework is generic and extensible to broader validation contexts, providing an interpretable and principled basis for practical equivalence assessment across diverse synthetic data applications.
Chinese Translation
合成预碰撞场景的代表性对于通过虚拟仿真评估驾驶自动化系统的安全影响至关重要。然而,对于合成预碰撞场景与其现实世界对应物之间的实际等效性的稳健评估仍存在差距;即,它们是否足够相似以满足预期的评估目的。传统的显著性检验不足,因为它侧重于检测差异而不是建立实际等效性。本研究通过扩展我们先前在贝叶斯实际等效区域(Region of Practical Equivalence, ROPE)基础上的等效性检验框架,采用基于分箱的方法来定义适当的统计量和等效性标准,从而解决这一研究空白。提出了两种基于分箱的统计量,以衡量在安全影响评估背景下数据集之间在实际上有意义的分布差异。通过一个案例研究展示了该框架的适用性,此案例研究测试了两个合成后碰撞预碰撞数据集与先前开发的参考数据集在自动紧急制动系统安全影响评估背景下的实际等效性。结果表明,该框架提供了对实际等效性的有用的定量评估以及对数据集之间分歧的诊断性见解。尽管这一演示集中在后碰撞预碰撞场景上,但该框架是通用的,并可以扩展到更广泛的验证背景,为各种合成数据应用中的实际等效性评估提供了可解释和原则性的基础。
cs.RO / 7 / 2605.04607
Right Model, Right Time: Real-Time Cascaded-Fidelity MPC for Bipedal Walking
合适的模型,合适的时机:用于双足行走的实时级联精准度模型预测控制
Abstract
This paper presents a multi-phase whole-body model predictive control approach for bipedal walking, combining a detailed whole-body model in the near horizon with a simplified single-rigid-body model in the later prediction steps. This reduces computational complexity while retaining prediction capabilities. The resulting nonlinear optimal control problem is solved using sequential quadratic programming (SQP) in acados. Using a prior specified contact schedule and a target walking speed, the controller optimizes joint torques without depending on prior selected foot step locations. The controller is validated in MuJoCo simulation on the 18-DoF bipedal robot HyPer-2
Chinese Translation
本文提出了一种多阶段的全身模型预测控制方法,用于双足行走,该方法在近短期内结合了详细的全身模型,并在后续预测步骤中使用简化的单刚体模型。这种方法在保持预测能力的同时,降低了计算复杂性。所得到的非线性最优控制问题通过在 acados 中使用序列二次规划(SQP)进行求解。控制器在先前指定的接触计划和目标行走速度下,优化关节扭矩,而不依赖于先前选择的脚步位置。该控制器在 MuJoCo 模拟中对 18 自由度的双足机器人 HyPer-2 进行了验证。
cs.RO / 8 / 2605.04610
Active Contact Sensing for Robust Robot-to-Human Object Handover
用于稳健人机物体交接的主动接触感知
Abstract
Robot-to-human object handover is an essential skill for robot assistants, from serving drinks at home to passing surgical tools in the operating room. We expect robots to perform handover robustly -- to release the object only after a firm human grasp while ignoring incidental touches. Existing passive-sensing methods struggle to generalize across diverse objects and human behaviors, as they lack informative perturbations to disambiguate different contact conditions, such as firm grasp versus incidental touch. We propose an active sensing approach for robust handovers: the robot applies information-gathering motions and senses the resulting human-applied forces to infer the contact state. A firm grasp produces forces in multiple directions, while an accidental touch does not. To capture this distinction, we model the contact state with a Bayesian linear model: a distribution over piecewise-linear mappings from robot motions to human-applied forces. This model enables firm grasp detection and active information gathering. In experiments with 12 participants and 30 diverse rigid objects, our method achieved a 97.5% success rate -- over 30% higher than two common baselines.
Chinese Translation
人机物体交接是机器人助手的一项基本技能,从在家中端饮料到在手术室传递手术工具。我们期望机器人能够稳健地执行交接——仅在确认人类牢固握持后释放物体,同时忽略偶然的触碰。现有的被动感知方法在面对多样化的物体和人类行为时表现不佳,因为它们缺乏有效的信息扰动以区分不同的接触状态,例如牢固的握持与偶然的触碰。我们提出了一种主动感知的方法以实现稳健的物体交接:机器人应用信息采集动作,并感知人类施加的力以推断接触状态。牢固的握持会施加多个方向的力,而偶然的触碰则不会。为了捕捉这一区别,我们使用贝叶斯线性模型来建模接触状态:该模型描述了从机器人动作到人类施加的力的分段线性映射的分布。该模型使得牢固握持的检测和主动信息采集成为可能。在对12名参与者和30种多样化刚性物体进行的实验中,我们的方法实现了97.5%的成功率——比两个常用基线提高了超过30%。
cs.RO / 9 / 2605.04647
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2:强化学习对齐的离散扩散驱动自编辑
Abstract
We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.
Chinese Translation
我们提出了ReflectDrive-2,这是一种具有独立动作专家的掩码离散扩散规划器,专为自主驾驶设计,其将规划表示为离散轨迹令牌,并通过并行掩码解码生成这些令牌。这个离散令牌空间能够实现原地轨迹修订:AutoEdit使用相同模型重写选定的令牌,无需辅助的精炼网络。为了训练这种能力,我们采用了两阶段的程序。首先,我们构建了沿纵向进展和横向航向方向的专家轨迹的结构感知扰动,并监督模型恢复原始专家轨迹。接着,我们通过强化学习(RL)对完整的决策-草案-反射回放进行微调,将最终后期编辑轨迹的最终驾驶奖励分配给该轨迹,并通过完整回放转变传播策略梯度信用。完整回放的RL对于结合草拟和编辑至关重要:在仅进行监督训练的情况下,推理时的AutoEdit最多提高PDMS $0.3$,而RL将其增益提升至$1.9$。我们还共同设计了一个高效的反射解码堆栈,用于决策-草案-反射管道,结合了共享前缀KV重用、交替步骤解码和融合的设备上解掩码。在NAVSIM上,ReflectDrive-2在仅依赖相机输入的情况下实现了$91.0$的PDMS,而在最佳的6次预言设置下实现了$94.8$的PDMS,同时在NVIDIA Thor上运行时的平均延迟为$31.8$毫秒。
cs.RO / 10 / 2605.04649
From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances
从抓取到插入:亚毫米公差下的触觉增强精密装配
Abstract
High-precision assembly frequently involves tight-tolerance insertions, where even slight pose errors can cause jamming or excessive interaction forces, making robust and safe insertion policies difficult to obtain. This paper proposes a tactile-augmented two-stage method that combines Imitation Learning (IL) and Reinforcement Learning (RL) for precision insertion tasks. In the first stage, IL learns a reaching policy with position generalization that grasps the peg and brings it to the vicinity of the target region. In the second stage, RL executes the insertion and enables recovery from failures during contact-rich interactions. To better exploit tactile feedback, we introduce tactile group sampling to increase coverage of critical contact segments during training, and design a tactile critic to more accurately evaluate policy values, improving insertion performance while maintaining low contact forces. We conduct systematic experiments across five hole geometries and three clearance settings. Results show that our method substantially improves insertion performance across all settings; under the most challenging 0.05\,mm clearance, it achieves a 67\% success rate while keeping contact forces low, reducing the maximum interaction force by 60\% and torque by 44\%, thereby validating both effectiveness and safety for precision assembly.
Chinese Translation
高精度装配通常涉及严苛公差的插入,其中即使是轻微的姿态误差也可能导致卡滞或过大的交互力,使得获取稳健安全的插入策略变得困难。本文提出了一种触觉增强的两阶段方法,将模仿学习(Imitation Learning, IL)与强化学习(Reinforcement Learning, RL)相结合,应用于精密插入任务。在第一阶段,IL 学习一个具备位置泛化能力的抓取策略,将 peg 抓取并带到目标区域的附近。在第二阶段,RL 执行插入操作,并在接触丰富的交互过程中实现失败恢复。为了更好地利用触觉反馈,我们引入了触觉组采样以增加训练过程中关键接触段的覆盖率,并设计了触觉评论员以更准确地评估策略值,从而提高插入性能,同时保持低接触力。我们在五种孔几何形状和三种间隙设置下进行了系统实验。结果表明,我们的方法在所有设置中显著提高了插入性能;在最具挑战性的 0.05 mm 间隙下,实现了 67%的成功率,同时保持了低接触力,最大交互力降低60%,扭矩降低44%,验证了精密装配的有效性和安全性。
cs.RO / 11 / 2605.04672
AI-Aided Advancements in Autonomous Underwater Vehicle Navigation
人工智能辅助的自主水下航行技术进展
Abstract
Autonomous underwater vehicles (AUVs) have become indispensable for deep-sea exploration, spanning critical scientific research and commercial applications. The rapid attenuation of electromagnetic waves renders satellite radio signals unavailable, while the dynamic unpredictability of the marine environment presents formidable navigation challenges. This chapter explores recent advancements in AI-aided AUV positioning, specifically focusing on advanced sensor fusion architectures that integrate inertial navigation systems with Doppler velocity logs and cameras. Beyond traditional model-based filtering, we examine the transformative emergence of AI-driven learning approaches in enhancing inertial dead-reckoning tasks and adaptive fusion algorithms. By addressing these recent milestones, this chapter provides a comprehensive roadmap for achieving the high-precision navigation essential for autonomous underwater missions.
Chinese Translation
自主水下车辆(AUV)已成为深海探索中不可或缺的工具,涵盖了重要的科学研究和商业应用。电磁波的快速衰减使得卫星无线电信号不可用,而海洋环境的动态不确定性则带来了巨大的导航挑战。本章节探讨了人工智能辅助的AUV定位的最新进展,特别关注将惯性导航系统与多普勒速度日志和摄像头集成的先进传感器融合架构。除了传统的基于模型的滤波方法外,我们还考察了人工智能驱动的学习方法在增强惯性推算任务和自适应融合算法中的变革性应用。通过关注这些最新里程碑,本章节提供了一个全面的路线图,以实现自主水下任务所需的高精度导航。
cs.RO / 12 / 2605.04678
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
从像素到标记:视觉-语言-动作模型的潜在动作监督的系统研究
Abstract
Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.
Chinese Translation
潜在动作作为一种中间表示,能够在异构数据集间持续建模视觉-语言-动作(VLA)模型。然而,利用潜在动作对VLA进行监督的方法往往零散,并缺乏系统比较。本研究从两个视角构建潜在动作监督的研究:(i) 通过基于图像的潜在动作对轨迹进行规范化,(ii) 用基于动作的潜在动作统一目标空间。在统一的VLA基线下,我们实例化并比较了四种代表性的整合策略。我们的结果揭示了一个公式-任务对应关系:基于图像的潜在动作有助于长时间推理和场景级泛化,而基于动作的潜在动作在复杂的运动协调方面表现优异。此外,我们发现直接用离散的潜在动作标记对VLM进行监督可以获得最有效的性能。最后,我们的实验为潜在动作监督在混合数据中的益处提供了初步见解,暗示了VLA训练的一个有前景的方向。代码可在 https://github.com/RUCKBReasoning/From_Pixels_to_Tokens 上获取。
cs.RO / 13 / 2605.04757
3D Printing of Passively Actuated Self-Folding Robots with Integrated Functional Modules
集成功能模块的被动驱动自折叠机器人的3D打印
Abstract
We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.
Chinese Translation
我们提出了一种弹性驱动的自折叠方法,该方法直接利用平面3D打印的导电PLA网制造机器人。通过打印的钩子布置的弹性带储存能量,将薄片折叠成预定的3D几何形状,而平面状态允许在部署之前准确放置电子元件和磁铁。相同的基材还可用作电容式触控的电极,并支持带有霍尔传感器和偏心旋转质量(ERM)电机的可重用平台I/O调色板,用于对接检测和振动驱动。我们还推导出一种封闭形式的折叠模型,该模型在铰链刚度与弹性带力矩之间取得平衡,以预测平衡折叠角度;实验验证了该模型并得出了一个设计图,连接铰链厚度、带子大小和钩子间距与目标角度之间的关系。通过这一工作流程,我们实现了多个多面体模块,并展示了三个应用案例:一个突显自折叠在可扩展模块机器人集体中的潜力的立方体,一个可部署的抓手,以及一个肌腱驱动的手指。该方法低成本,无需刺激,同时集成了驱动和传感功能。
cs.RO / 14 / 2605.04806
Dr-PoGO: Direct Radar Pose-Graph Optimization
Dr-PoGO:直接雷达姿态图优化
Abstract
This paper introduces Dr-PoGO, a method for Simultaneous Localization And Mapping (SLAM) using a 2D spinning radar. Unlike cameras or lidars that require line-of-sight, millimetre-wave radars can `see' through dust, falling snow, rain, etc. Accordingly, it is a great modality for robust perception regardless of the weather conditions. While most existing radar-based SLAM methods rely on the extraction of point clouds or features to perform ego-motion estimation, Dr-PoGO leverages direct registration techniques for odometry (DRO) and loop-closure registration. An off-the-shelf radar-focused place recognition algorithm, RaPlace, provides loop-closure candidates. As RaPlace does not provide relative transformations, Dr-PoGO introduces a coarse-to-fine registration that uses visual features and descriptors to obtain an initial guess for the direct transformation refinement. The global trajectory is optimized in a pose-graph optimization. Dr-PoGO demonstrates state-of-the-art performance over 300km of data in various real-world automotive environments. Our implementation is publicly available: https://github.com/utiasASRL/dr_pogo.
Chinese Translation
本文介绍了Dr-PoGO,一种使用二维旋转雷达进行同时定位与地图构建(SLAM)的方法。与需要视线的相机或激光雷达不同,毫米波雷达可以透过尘埃、降雪、雨水等进行“观察”。因此,它在各种天气条件下都是一种强大的感知方式。虽然大多数现有的基于雷达的SLAM方法依赖于点云或特征的提取来执行自我运动估计,但Dr-PoGO利用直接配准技术进行里程计(DRO)和回环闭合配准。一种现成的雷达专用地点识别算法RaPlace提供回环闭合候选。由于RaPlace未提供相对变换,Dr-PoGO引入了粗到细的配准方法,利用视觉特征和描述符获得直接变换精细化的初始猜测。全局轨迹在姿态图优化中得到优化。Dr-PoGO在300公里的数据集上,在各种真实世界汽车环境中展示了最先进的性能。我们的实现已公开可用: https://github.com/utiasASRL/dr_pogo。
cs.RO / 15 / 2605.04809
Optimal Uncertainty-Aware Calibration for the AX=YB Problem
AX=YB 问题的最优不确定性感知标定
Abstract
This article proposes a general optimization framework for solving hand-eye calibration problem. Unlike traditional methods, an iterative algorithm based on Lie algebra that achieves approximately global optimal solutions is developed. During the optimization process, the method strictly preserves the structural constraints of the calibration parameters and enables synchronized updates between calibration parameters. Recognizing that data used in real-word hand-eye calibration often contain uncertainty, especially in over-loading and large workspace industrial robot scenarios, which can significantly degrade accuracy, and accurately modeling such uncertainty is inherently difficult, this article avoids explicit uncertainty modeling. Instead, an uncertainty metric to evaluate the relative uncertainty between data sources is introduced and used to dynamically refine the iterative process. To further enhance convergence efficiency, an effective initial solution generation method that improves overall stability and accuracy is designed. Numerical simulations and real-world experiments validate the effectiveness of the proposed approach, and in synthetic datasets, the proposed approach improves the estimation accuracy by at least 67\% under high-uncertainty conditions compared with the existing methods.
Chinese Translation
本文提出了一个通用的优化框架用于解决手眼标定问题。与传统方法不同,开发了一种基于李代数的迭代算法,能够实现近似全局最优解。在优化过程中,该方法严格保持标定参数的结构约束,并实现标定参数之间的同步更新。认识到在实际的手眼标定中,使用的数据往往包含不确定性,尤其是在超载和大工件空间工业机器人场景中,这可以显著降低精度,而精确建模这种不确定性本质上是困难的,本文避免了显式的不确定性建模。相反,文章引入了一种不确定性度量,用于评估数据源之间的相对不确定性,并用于动态完善迭代过程。为了进一步提升收敛效率,设计了一种有效的初始解生成方法,提高整体稳定性和准确性。数值仿真和实际实验验证了所提出方法的有效性,在合成数据集中,与现有方法相比,在高不确定性条件下,所提出的方法使估计精度提高了至少67%。
cs.RO / 16 / 2605.04939
Modular Reinforcement Learning For Cooperative Swarms
协作群体的模块化强化学习
Abstract
A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.
Chinese Translation
协作机器人群体是一个由计算能力有限的机器人组成的集合,它们共享一个共同的目标。每个机器人只能与其少数同伴进行交互,而不清楚这种交互对整体效用的影响。最近在分布式多智能体强化学习方面的进展表明,尽管每个机器人独立学习,但它们能够学习如何有效地与他人互动,并与共同目标对齐。然而,这需要每个机器人表示可能的组合交互状态,这对机器人的记忆能力提出了挑战。本文提出了一种替代方法,用于在机器人群体中表示多机器人强化学习的空间交互状态。采用模块化(分解的)表示法,其中状态的每个特征由独立的学习过程处理,其结果进行汇总。我们通过多次模拟机器人群体进行觅食实验,展示了该方法的有效性。
cs.RO / 17 / 2605.05053
Reduced-order Neural Modeling with Differentiable Simulation for High-Detail Tactile Perception
具可微仿真的降阶神经建模用于高细节触觉感知
Abstract
Tactile perception is key to dexterous manipulation, yet simulating high-resolution elastomer deformation remains computationally prohibitive. Finite element methods (FEM) deliver high fidelity but demand costly remeshing, while Material Point Methods (MPM) suffer from heavy particle-memory tradeoffs. We propose a {reduced-order neural simulation framework} that couples coarse-grained MPM dynamics with an implicit neural decoder to reconstruct sub-particle tactile details from compact latent states. The framework learns a continuous deformation manifold from paired high- and low-resolution simulations, enabling physically consistent, differentiable inference. Compared to the TacIPC, our method achieves over 65\% faster simulation and {40\% lower memory usage}, while maintaining better geometric fidelity. In tactile rendering and 3D surface reconstruction, our methods further improve accuracy by 25\% and produce realistic depth images and surface mesh within a faster inference speed. These results demonstrate that the proposed reduced-order neural model enables high-detail, physically grounded tactile simulation with substantial efficiency gains for robotic interaction and optimization.
Chinese Translation
触觉感知是灵巧操控的关键,然而高分辨率弹性体变形的仿真仍然在计算上具有挑战性。有限元方法(FEM)提供了高保真度,但需要昂贵的重网格处理,而材料点方法(MPM)则面临较高的粒子内存权衡。我们提出了一种降阶神经仿真框架,该框架将粗粒度的MPM动力学与隐式神经解码器相结合,从紧凑潜在状态中重建亚粒子级别的触觉细节。该框架通过成对的高、低分辨率仿真学习连续的变形流形,使得物理一致的可微推断成为可能。与TacIPC相比,我们的方法实现了超过65%的仿真加速和40%的内存使用减少,同时保持了更好的几何保真度。在触觉渲染和3D表面重建中,我们的方法进一步提高了25%的准确性,并以更快的推断速度生成逼真的深度图像和表面网格。这些结果表明,所提出的降阶神经模型能够以显著的效率提升实现高细节、物理基础的触觉仿真,适用于机器人交互和优化。
cs.RO / 18 / 2605.05092
Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout
Driver-WM:一种以驾驶员为中心的交通条件潜在世界模型,用于车内动态的预测
Abstract
Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.
Chinese Translation
安全的 L2/L3 驾驶自动化需要在共享控制过渡期间预测人类参与者的反应。虽然大多数驾驶世界模型预报外部环境,但车内智能却严格以识别为导向,缺乏对驾驶员动态的多步骤预测能力。我们提出了 Driver-WM,这是一种以驾驶员为中心的潜在世界模型,它在因交通上下文而产生的车外条件下,推导车内动态。该模型将物理运动学预测与辅助行为及情感语义识别统一在一起。Driver-WM 在基于冻结视觉-语言特征构建的紧凑潜在空间中运行,采用双流架构分别编码外部交通和内部驾驶员状态。这些流通过一个门控因果注入机制在方向上耦合,该机制使用学习得到的向量门调节外部环境扰动,同时严格遵循时间因果性。在一个多任务辅助驾驶基准测试上的评估显示,Driver-WM 在反应性高运动操控方面提供了稳健的长时间范围几何预测,并改善了驾驶员与交通状态的语义对齐。最后,明确的外部到内部的条件化允许在测试时进行有控制的干预,以系统地分析机制响应。
cs.RO / 19 / 2605.05110
LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts
LineRides:基于轨迹引导的自行车机器人特技强化学习
Abstract
Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.
Chinese Translation
在强化学习中,为灵活的机器人机动设计奖励函数仍然是一个挑战,而基于示范的方法往往需要参考运动,这些在新平台或极限特技中可能不可用。我们提出了LineRides,一种基于轨迹引导的学习框架,使定制的自行车机器人能够从用户提供的空间指南和稀疏关键方向中获取多样化、可指令的特技行为,无需示范或明确的时间控制。LineRides通过采用跟踪边界处理物理上不可行的指南,允许受控偏差;通过沿着指南测量行进距离来解决时间歧义;并通过基于位置和序列的关键方向来消除运动细节的歧义。我们在超灵活车辆(Ultra Mobility Vehicle, UMV)上评估了LineRides,结果表明我们的方法训练出的策略支持在正常驾驶和特技执行之间的无缝过渡,能够根据指令完成五种不同的特技:MiniHop、LargeHop、ThreePointTurn、Backflip 和 DriftTurn。
cs.RO / 20 / 2605.05126
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D:提升机器人操作中的高效三维感知与四维推理的时空一致性
Abstract
Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at
Chinese Translation
当前的视觉-语言-行动(VLA)模型主要专注于将二维观测映射到动作,但在时空感知和推理方面存在明显的局限性:1)空间表示通常依赖于额外的传感器,从而引入了巨大的计算开销;2)视觉推理通常局限于未来帧的预测,缺乏与指令相关场景的对齐,从而影响时空一致性。为了解决这些挑战,我们提出了ConsisVLA-4D,一个统一且高效的框架,增强了三维感知和四维推理中的时空一致性。具体来说,我们设计了:1)CV-Aligner,通过过滤与指令相关的区域并在多个视角间对齐对象身份,确保跨视角对象的语义一致性;2)CO-Fuser,通过利用紧凑的潜在表示消除视角间对象之间的空间关系模糊性,确保跨对象的空间几何一致性。在此基础上,我们引入了3)CS-Thinker,以在动作展开时实现跨场景的时空一致性。它从CV-Aligner的对象语义标记中学习局部动态的隐含知识,从CO-Fuser的几何标记中获得全局深度,从而在场景变化中提升高效的视觉推理。广泛的实验表明,得益于其高效的时空一致性设计,ConsisVLA-4D在LIBERO基准和实际平台上,相比OpenVLA实现了21.6%和41.5%的性能提升,以及2.3倍和2.4倍的推理速度提升。ConsisVLA-4D已开源并公开可用。
cs.RO / 21 / 2605.05172
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
当生活给予你行为克隆时,制作Q函数:从行为克隆中提取Q值以实现机器人强化学习
Abstract
Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/
Chinese Translation
行为克隆(BC)已成为机器人学习的一种高效范式。然而,BC在收集完演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配,导致策略替换先前学习到的良好动作。在本工作中,我们提出了Q2RL,即基于BC的强化学习中的Q估计(Q-Estimation)与Q门控(Q-Gating)技术,这是一个高效的离线到在线学习算法。我们的方法由两部分组成:(1) Q-估计通过与环境进行少量交互步骤,从BC策略中提取Q函数,然后进行在线强化学习;(2) Q-门控根据各自的Q值在BC和RL策略动作之间进行切换,以收集样本用于RL策略训练。在来自D4RL和robomimic基准的操作任务中,Q2RL在成功率和收敛时间上超越了当前最优的离线到在线学习基准。Q2RL足够高效,可以应用于在线机器人强化学习环境中,能够学习到对接触丰富和高精度的操作任务(例如管道组装和配料)具有鲁棒性的策略,在1-2小时的在线交互中实现高达100%的成功率,并相较于原始BC策略提高了最多3.75倍。代码和视频可在 https://pages.rai-inst.com/q2rl_website/ 获取。
cs.RO / 22 / 2605.05182
A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic Robots on Incrementally Built Occupancy Grid Maps
增量构建占用栅格地图上全向机器人闭式双障碍CBF安全滤波器
Abstract
We present a dual-barrier control barrier function (CBF) safety filter for real-time, safety-critical velocity control of holonomic robots operating in incrementally built occupancy grid maps. As a robot explores an unknown environment, unmapped regions introduce irreducible uncertainty, since obstacle geometry beyond the explored frontier is unknown, making entry into such regions a source of collision risk, especially with front-facing sensors. To address this, we enforce two constraints: avoidance of mapped obstacles and restriction from unexplored regions. Both constraints are derived analytically from the occupancy grid's signed distance field, yielding a closed-form safety filter that requires only a small linear system solve per cycle. On resource-constrained platforms such as the Raspberry Pi, where SLAM and planning already consume significant compute, the low overhead of the proposed filter preserves resources. An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, improving exploration efficiency while maintaining safety. The filter operates in velocity space as a minimally invasive correction and composes with arbitrary nominal controllers, including learning-based methods. Hardware flight experiments on a PX4-controlled quadrotor demonstrate zero collisions across multiple indoor runs.
Chinese Translation
我们提出了一种双障碍控制障碍函数(CBF)安全滤波器,用于在增量构建的占用栅格地图上实现全向机器人实时的、安全关键的速度控制。当机器人探索未知环境时,未经映射区域引入了不可减少的不确定性,因为超出已探索边界的障碍几何形状是未知的,这使得进入这些区域成为碰撞风险的来源,尤其对于前向传感器。为了解决这一问题,我们施加了两个约束:避开已映射的障碍物以及限制进入未探索区域。这两个约束是从占用栅格的带符号距离场中分析得出的,从而产生一个闭式安全滤波器,只需每个周期解决一个小型线性系统。在资源受限的平台上,例如树莓派,SLAM和规划已经消耗了大量计算资源,而本滤波器的低开销则有助于保留资源。自适应增益调度在信息丰富的区域放宽边界约束,而在充分映射的区域收紧约束,提高探索效率的同时保持安全性。该滤波器在速度空间中作为一种最小侵入性的校正操作,并可与任意名义控制器(包括基于学习的方法)组合。针对PX4控制的四旋翼的硬件飞行实验显示,在多次室内运行中实现了零碰撞。
cs.CV / 1 / 2605.04098
Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
多模态大型语言模型是否已准备好用于临床皮肤病学?皮肤病学中的真实世界评估
Abstract
Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.
Chinese Translation
多模态大型语言模型(MLLMs)在公开可用的皮肤病学基准测试中显示出良好的前景。然而,基准性能可能无法推广到真实世界的皮肤病决策过程中。为了量化这一基准与临床应用之间的差距,我们评估了四个开放权重的MLLM(InternVL-Chat v1.5、LLaVA-Med v1.5、SkinGPT4和MedGemma-4B-Instruct)以及一个商业MLLM(GPT-4.1),在三个公开可用的皮肤病学数据集和一个包含5,811个病例和46,405张临床图像的回顾性多中心医院皮肤病咨询队列上进行评估。模型在两个临床相关任务上进行了评估:鉴别诊断生成和基于严重性分级的分诊。在公共数据集上的诊断性能表现适中,而在真实世界的队列中则显著下降。在公共基准测试中,最佳开放权重模型的前3名诊断准确率为26.55%,而GPT-4.1为42.25%。在仅使用图像的真实咨询病例中,开放权重模型的前3名诊断准确率降至1.50%-13.35%,而GPT-4.1为24.65%。将临床背景纳入模型后,各模型的表现得到了改善,开放权重模型的前3名诊断准确率提高至28.75%,而GPT-4.1提高至38.93%。然而,模型输出对不完整或错误的咨询背景高度敏感。对于基于严重性的分诊,模型的灵敏度适中(超过60%),表明其在筛查中的潜在应用,但在临床部署中的可靠性不足。这些发现表明,基准性能显著高估了当前皮肤病MLLM在真实世界临床中的能力。
cs.CV / 2 / 2605.04108
MuCALD-SplitFed: Causal-Latent Diffusion for Privacy-Preserving Multi-Task Split-Federated Medical Image Segmentation
MuCALD-SplitFed:用于隐私保护的多任务分裂联邦医学图像分割的因果潜变量扩散
Abstract
Federated Learning enables decentralized training by aggregating model updates across clients without sharing raw data, while Split Federated Learning further partitions the model between clients and a server to reduce computation and communication at the client side. However, decentralized medical institutions rarely operate on a single shared task, making standard Federated and SplitFed collaborations poorly aligned with real clinical workflows. Multi-task FL extends these frameworks by allowing clients to handle different tasks, but often introduces instability and privacy vulnerabilities. This study proposes \textbf{MuCALD-SplitFed}, a multi-task SplitFed framework that integrates causal representation learning and latent diffusion. Experiments show MuCALD-SplitFed consistently improves segmentation, while baseline SplitFed fails to converge. The proposed approach further reduces information leakage at split points, mitigating reconstruction-based and membership inference attacks. Additionally, MuCALD SplitFed outperforms state-of-the-art personalized FL and multi-task FL approaches. The code repository is: https://github.com/ChamaniS/MuCALD_SplitFed.
Chinese Translation
联邦学习通过在客户端之间汇聚模型更新而无需共享原始数据, enable 了去中心化的训练,而分裂联邦学习进一步在客户端和服务器之间划分模型,以减少客户端的计算和通信。然而,去中心化的医疗机构通常并不在同一共享任务上运作,因此标准的联邦和分裂联邦协作与实际临床工作流程并不紧密契合。多任务联邦学习扩展了这些框架,使客户端能够处理不同的任务,但往往会引入不稳定性和隐私漏洞。本研究提出了 extbf{MuCALD-SplitFed},一种集成因果表示学习和潜变量扩散的多任务分裂联邦框架。实验结果表明,MuCALD-SplitFed 一直在提高分割性能,而基线的 SplitFed 无法收敛。所提方法进一步减少了分裂点的信息泄露,缓解了基于重构和成员推断的攻击。此外,MuCALD-SplitFed 在个性化联邦学习和多任务联邦学习方法中表现优于现有的最先进技术。代码库地址为:https://github.com/ChamaniS/MuCALD_SplitFed。
cs.CV / 3 / 2605.04201
Topology-Constrained Quantized nnUNet for Efficient and Anatomically Accurate 3D Tooth Segmentation
拓扑约束量化 nnUNet 在高效且解剖准确的 3D 牙齿分割中的应用
Abstract
We propose a topology-constrained quantized nnUNet framework for efficient and anatomically accurate 3D tooth segmentation, addressing the challenges of spatial distortion introduced by quantization in deep learning models. The proposed method integrates a novel tooth-specific topological loss into quantization-aware training, preserving critical anatomical structures such as tooth count, adjacency relationships, and cavity integrity while maintaining computational efficiency. The system employs an 8-bit quantized nnUNet backbone, where weights and activations are dynamically calibrated to minimize precision loss during inference. Furthermore, the topological loss combines connected-component analysis, adjacency consistency, and hole detection penalties, ensuring anatomical fidelity without modifying the underlying network architecture. The joint optimization objective harmonizes cross-entropy loss, quantization regularization, and topological constraints, enabling end-to-end training with gradient approximations for persistent homology terms. Experiments demonstrate that our approach significantly reduces topological errors compared to conventional quantized models, achieving clinically plausible segmentations on dental CBCT scans. The method retains the hardware efficiency of integer-only inference, making it suitable for deployment in resource-constrained clinical environments. This work bridges the gap between computational efficiency and anatomical precision in medical image segmentation, offering a practical solution for real-world dental applications.
Chinese Translation
我们提出了一种拓扑约束量化 nnUNet 框架,以实现高效且解剖准确的 3D 牙齿分割,解决了深度学习模型中由量化引入的空间失真挑战。该方法将一种新颖的牙齿特定拓扑损失集成到量化感知训练中,保留牙齿数量、邻接关系和牙齿完整性等关键解剖结构,同时保持计算效率。系统采用 8 位量化的 nnUNet 主干,动态校准权重和激活,以最小化推理过程中的精度损失。此外,拓扑损失结合了连通组件分析、邻接一致性和孔检测惩罚,确保解剖的真实度,而无需修改底层网络架构。联合优化目标协调交叉熵损失、量化正则化和拓扑约束,实现了带有持久同调项的端到端训练与梯度近似。实验表明,与传统量化模型相比,我们的方法显著减少了拓扑错误,在牙科 CBCT 扫描中实现了临床可行的分割结果。该方法保持了仅使用整数推理的硬件效率,适用于资源有限的临床环境。这项工作弥合了医疗图像分割中计算效率与解剖精度之间的差距,为实际牙科应用提供了一种切实可行的解决方案。
cs.CV / 4 / 2605.04231
Anatomy of a failure: When, how, and why deep vision fails in scientific domains
失败的解剖:深度视觉在科学领域何时、如何及为何失败
Abstract
Mirroring its ubiquity in popular media and all human activities, the use of deep learning (DL) is rapidly growing in scientific imaging modalities. However, unlike everyday RGB pictures, pixels encode precise physicochemical properties in scientific imaging across potentially thousands of channels. While DL is well validated on human-centric RGB perceptual tasks, its effectiveness for scientific imaging remains uncertain. Here, we show that the naive application of DL frameworks to scientific images can lead to critical failures. We evaluate the use of DL for pathology, comparing RGB images of stained tissue with the quantitative and information-rich biochemical signatures of infrared (IR) imaging. Despite this informational advantage, DL models trained on IR data paradoxically underperform. We investigate this discrepancy to find that IR data priors interact poorly with the simplicity bias of DL, causing models to collapse to one-dimensional predictions. This constitutes a catastrophic DL failure because the model's representational capacity remains largely unused, while furthermore raising AI safety concerns and undermining the advantages of such scientific modalities. Notably, this problem persists even with state-of-the-art DL robustification strategies, which are primarily designed and validated for RGB imagery and thus inherit the same prior-bias mismatch. This work establishes a framework for understanding the limitations of generic DL in science and advocates for the study of modality-specific failure modes to guide the development of specialized, safe AI algorithms.
Chinese Translation
随着深度学习(Deep Learning, DL)在科学成像模式中的快速发展,这一技术在流行媒体和人类活动中正变得越来越普遍。然而,与日常的RGB图像不同,科学成像中的像素编码了数千个通道中精确的物理化学特性。虽然DL在以人为中心的RGB感知任务上得到了很好的验证,但其在科学成像中的有效性仍然不确定。在这里,我们展示了将DL框架天真地应用于科学图像可能导致严重失败的情况。我们评估了DL在病理学中的应用,比较着色组织的RGB图像与红外(Infrared, IR)成像的定量和信息丰富的生化特征。尽管在信息上具有优势,基于IR数据训练的DL模型却矛盾性地表现不佳。我们研究了这一差异,发现IR数据的先验与DL的简单性偏置相互作用不良,导致模型崩溃为一维预测。这构成了DL的灾难性失败,因为模型的表征能力在很大程度上未得到利用,同时还引发了人工智能安全隐患,并削弱了此类科学模式的优势。值得注意的是,即使采用最新的DL稳健性策略,该问题仍然存在,这些策略主要是为RGB图像设计和验证的,因此也继承了相同的先验偏差失配。本文建立了一个框架,以理解通用DL在科学中的局限性,并倡导研究特定模式的失败模式,以指导专用安全AI算法的发展。
cs.CV / 5 / 2605.04234
Disentangled Learning Improves Implicit Neural Representations for Medical Reconstruction
解耦学习提升隐式神经表示在医学重建中的应用
Abstract
Implicit neural representations (INRs) have emerged as a powerful paradigm for medical imaging via physics-informed unsupervised learning. Classical INRs optimize an entire network from scratch for each subject, leading to inefficient training and suboptimal imaging quality. Recent initialization-based approaches attempt to inject population priors into pre-trained networks, yet they rely on high-quality images and often suffer from catastrophic forgetting during fine-tuning. We present DisINR, a novel INR framework that explicitly disentangles shared and subject-specific representations. DisINR introduces a shared encoder-decoder pair and subject-specific encoders, whose features are jointly decoded for image reconstruction. By integrating differentiable forward models, it pre-trains the shared modules directly from limited raw measurements, removing the need for pre-acquired high-quality images. During test-time adaptation, only the subject-specific encoder is optimized, while the shared pair remains frozen, effectively preserving learned priors. Extensive evaluations on three representative medical imaging tasks show that DisINR significantly outperforms state-of-the-art INRs in both reconstruction accuracy and efficiency.
Chinese Translation
隐式神经表示(INRs)作为一种通过物理信息无监督学习实现医学影像的强大范式已逐渐崭露头角。传统的INRs为每个个体从头优化整个网络,导致训练效率低下和成像质量不佳。近期基于初始化的方法尝试将群体先验注入预训练的网络中,然而这类方法依赖于高质量图像,并且在微调过程中往往会遭遇灾难性遗忘。我们提出了DisINR,一种新颖的INR框架,显式解耦共享与个体特定的表示。DisINR引入了一个共享的编码器-解码器对和个体特定的编码器,其特征共同解码用于图像重建。通过整合可微分的前向模型,它直接从有限的原始测量中对共享模块进行预训练,消除了对预先获取的高质量图像的需求。在测试时间适应中,仅优化个体特定编码器,而共享对保持不变,从而有效保存已学习的先验。在三个具有代表性的医学影像任务上的广泛评估表明,DisINR在重建准确性和效率上显著优于最新的INRs。
cs.CV / 6 / 2605.04239
Densification and forecasting of Sentinel-2 time series from multimodal SAR and Optical satellite data using deep generative models
基于深度生成模型的多模态SAR和光学卫星数据的哨兵-2时间序列密集化与预测
Abstract
Optical satellite image time series are extensively used in many Earth observation applications, including agriculture, climate monitoring, and land surface analysis. However, clouds and swath edges result in irregular sampling along the temporal dimension, limiting continuous monitoring. To address this issue, a growing body of work has focused on temporal densification and reconstruction of satellite image time series, with the objective of filling missing or cloud-contaminated observations within the temporal extent of the available data. While these approaches improve temporal continuity, they are inherently restricted to the reconstruction of the gaps within the observed time periods, and do not address the prediction of future observations. This work proposes a probabilistic deep learning framework for the densification and forecasting of Sentinel-2 time series by generating optical images at arbitrary past or future dates. The approach leverages multimodal satellite data by jointly exploiting Sentinel-2 optical and Sentinel-1 SAR observations. Unlike most existing works, we propose to focus on the uncertainty of the generated images. Experimental results demonstrate effective densification and forecasting, on sparse and temporally misaligned time series.
Chinese Translation
光学卫星图像时间序列被广泛应用于农业、气候监测和土地表面分析等众多地球观测领域。然而,云层和飞行路径边缘导致时间维度上的不规则采样,限制了连续监测。为了解决这个问题,越来越多的研究集中于卫星图像时间序列的时间密集化和重构,旨在填补可用数据的时间范围内缺失或受云污染的观察值。虽然这些方法提高了时间连续性,但本质上仅限于在观察到的时间段内重建缺口,并未解决未来观察值的预测。本文提出了一种概率深度学习框架,用于通过在任意过去或未来日期生成光学图像来实现哨兵-2时间序列的密集化和预测。该方法通过联合利用哨兵-2光学数据和哨兵-1 SAR观测,发挥多模态卫星数据的优势。与大多数现有研究不同,我们计划关注生成图像的不确定性。实验结果表明,在稀疏和时间上不对齐的时间序列上,实现了有效的密集化和预测。
cs.CV / 7 / 2605.04247
Physics-Guided Regime Unmixing
物理引导下的状态解混合
Abstract
The Linear Mixing Model (LMM) dominates spectral unmixing for its simplicity, but fails under multiple scattering; existing nonlinear models compensate by applying a fixed regime uniformly across entire scenes. We propose Physics-Guided Regime Unmixing (PGRU), which estimates a pixel-wise scalar $\xi_i \in [0,1]$ from observable physical features to activate nonlinear mixing only where justified. Residuals from the Generalized Bilinear Model (GBM), the Post-Nonlinear Mixing Model (PPNM), and Hapke are combined via learned attention, yielding interpretable regime maps. Experiments on Samson, Jasper Ridge, and Urban show consistent improvements over baselines, with physical coherence $\rho > 0.90$.
Chinese Translation
线性混合模型(Linear Mixing Model,LMM)因其简洁性主导了光谱解混合,但在多重散射情况下表现不佳;现有的非线性模型通过在整个场景中均匀应用固定状态来弥补这一不足。我们提出了物理引导状态解混合(Physics-Guided Regime Unmixing,PGRU),该方法根据可观测的物理特征估计像素级标量 $\xi_i ext{ } ext{in} ext{ } [0,1]$,以在合理的情况下激活非线性混合。我们通过学习的注意力机制结合来自广义双线性模型(Generalized Bilinear Model,GBM)、后非线性混合模型(Post-Nonlinear Mixing Model,PPNM)和Hapke模型的残差,从而生成可解释的状态图。对Samson、Jasper Ridge和Urban的实验显示,在基线之上具有一致的改进,物理一致性 $
ho > 0.90$。
cs.CV / 8 / 2605.04262
Imagery Dataset for Remaining Useful Life Estimation of Synthetic Fibre Ropes
合成纤维绳剩余使用寿命估计的图像数据集
Abstract
Remaining useful life (RUL) estimation of synthetic fibre ropes (SFRs) is critical for safe operation in offshore-crane, wind turbine installation, and heavy-load handling applications, where rope failure can result in catastrophic safety incidents and costly downtime. Despite growing research interest in data-driven condition monitoring, there is no publicly available image dataset that captures the complete degradation lifecycle of SFRs under controlled cyclic fatigue loading. To address this gap, we present a novel image dataset comprising approximately 34,700 high-resolution images of eleven Dyneema SK75/78 high-modulus polyethylene (HMPE) rope samples subjected to cyclic fatigue on a sheave-bend test stand at seven distinct axial load levels ranging from 60 kN to 280 kN. Ropes were loaded until mechanical failure, with fatigue lifetimes ranging from 695 cycles to 8,340 cycles. After every fixed number of sheave cycles (an inspection burst), ten images were captured at different cross-sectional positions along the rope, providing spatially representative sampling of surface degradation throughout the rope's entire service life. The images obtained from each load are annotated with the corresponding elapsed cycle count, enabling a direct computation of RUL for any rope in the sequence. This dataset aims to support a broad range of machine learning (ML) tasks including RUL regression, damage progression modelling, anomaly detection, and load-conditioned prognostics. The dataset is intended to serve as a benchmark resource for the development and comparison of vision-based condition monitoring (CM) and prognostics algorithms for SFRs.
Chinese Translation
合成纤维绳(Synthetic Fibre Ropes, SFRs)的剩余使用寿命(Remaining Useful Life, RUL)估计对于海上起重机、风力涡轮机安装和重载处理等应用的安全操作至关重要,因为绳索失效可能导致灾难性的安全事件和昂贵的停机时间。尽管对数据驱动的状态监测研究的兴趣不断增长,但目前尚无公开可用的图像数据集能够完整捕捉SFRs在受控的循环疲劳负载下的降解生命周期。为了弥补这一空白,我们推出了一种新颖的图像数据集,其中包含大约34,700张高分辨率图像,来源于11个在各自的7个不同轴向负载水平(从60 kN到280 kN)下,经过循环疲劳测试的Dyneema SK75/78高模量聚乙烯(HMPE)绳样本。绳索在施加载荷至机械失效为止,其疲劳寿命从695个循环至8,340个循环不等。在每固定数量的滑轮循环(称为检查短暂时间)后,沿绳索不同横截面位置拍摄十张图像,从而在绳索整个服务生命周期内提供空间代表性的表面降解采样。每个载荷得到的图像均标注了相应的已过循环次数,使得对序列中的任何绳索都可以直接计算RUL。该数据集旨在支持广泛的机器学习(Machine Learning, ML)任务,包括RUL回归、损伤进展建模、异常检测和基于负载条件的预测。该数据集旨在作为开发和比较基于视觉的状态监测(Condition Monitoring, CM)和预测算法的基准资源,应用于合成纤维绳的研究。
cs.CV / 9 / 2605.04299
Beyond Fixed Thresholds and Domain-Specific Benchmarks for Explainable Multi-Task Classification in Autonomous Vehicles
超越固定阈值和特定领域基准的可解释多任务分类在自主驾驶中的应用
Abstract
Scene understanding is a vital part of autonomous driving systems, which requires the use of deep learning models. Deep learning methods are intrinsically black box models, which lack transparency and safety in autonomous driving. To make these systems transparent, multi-task visual understanding has become crucial for explainable autonomous driving perception systems, where simultaneous prediction of multiple driving behaviors and their underlying explanations is essential for safe navigation and human trust in autonomous vehicles. In order to design an accurate and cross-cultural explainable autonomous driving system, we introduce a comprehensive confidence threshold sensitivity analysis that evaluates various threshold values to identify optimal decision boundaries for different tasks. Our analysis demonstrates that traditional fixed threshold approaches are suboptimal for multi-task scenarios. Through extensive evaluation, we demonstrate that our adaptive threshold selection methodology improves F1-scores across different tasks. In addition, we introduce IUST-XAI-AD, a novel dataset consisting of 958 images with human annotations for driving decisions and corresponding reasoning. This dataset addresses the critical gap in domain-specific evaluation benchmarks for distinct driving contexts and provides a more challenging test environment compared to existing datasets. Experimental results demonstrate that confidence threshold sensitivity analysis can significantly improve model performance, while the introduction of the IUST-XAI-AD dataset reveals important insights about cross-cultural driving behavior patterns. The combined contributions of this work provide both methodological advances and practical evaluation tools that can accelerate the development of more reliable, explainable, and culturally-adaptive autonomous driving systems for global deployment.
Chinese Translation
场景理解是自主驾驶系统的重要组成部分,这需要使用深度学习模型。深度学习方法本质上是黑箱模型,缺乏透明性和安全性,尤其是在自主驾驶中。为了使这些系统更加透明,多任务视觉理解对于可解释的自主驾驶感知系统变得至关重要,其中同时预测多种驾驶行为及其背后的解释是安全导航和人类对自主车辆信任的基础。为了设计一个准确且跨文化的可解释自主驾驶系统,我们引入了一种全面的置信度阈值敏感性分析,评估不同的阈值以识别不同任务的最佳决策边界。我们的分析表明,传统的固定阈值方法在多任务场景中并不理想。通过广泛的评估,我们证明了我们的自适应阈值选择方法在不同任务上提高了F1分数。此外,我们还介绍了IUST-XAI-AD,这是一个包含958张带有人类注释的驾驶决策及其相应推理的全新数据集。该数据集填补了特定驾驶情境评估基准的重要空白,相较于现有数据集提供了更具挑战性的测试环境。实验结果表明,置信度阈值敏感性分析可以显著提高模型性能,而引入IUST-XAI-AD数据集则揭示了关于跨文化驾驶行为模式的重要见解。本研究的综合贡献提供了方法论上的进展以及实用的评估工具,能够加速开发更可靠、可解释且适应文化的自主驾驶系统,以便在全球范围内推广。
cs.CV / 10 / 2605.04304
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
分层视觉智能体:在联合图像-文本空间中管理上下文以实现高级图表推理
Abstract
Advanced chart question answering requires both precise perception of small visual elements and multi-step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi-step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image--text space. A high-level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom-in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that hierarchical architecture, scoped visual context, and distilled context contribute complementary gains.
Chinese Translation
高级图表问答要求对小的视觉元素进行精准感知,并在多个子图之间进行多步推理。虽然现有的多模态语言模型在理解单一图表方面表现优秀,但在多个子图之间进行多步推理时往往很难。我们提出了HierVA,一个分层视觉智能体框架,用于图表推理,该框架在联合图像-文本空间中迭代构建和更新工作上下文。高层管理者生成计划并维护一个仅包含关键信息的紧凑上下文,而专门的工作者则执行推理、收集证据并返回结果。特别是,智能体维护独立的视觉和文本上下文,使用缩放工具限制视觉上下文。对CharXiv推理子集的实验显示出相对于强大的多模态基线的一致性改进,消融研究验证了分层架构、特定范围的视觉上下文和提炼的上下文对性能的互补提升。
cs.CV / 11 / 2605.04355
InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making
InterFuserDVS:基于事件增强的传感器融合用于安全的强化学习决策
Abstract
Autonomous driving systems rely heavily on robust sensor fusion to perceive complex envi- ronments. Traditional setups using RGB cameras and LiDAR often struggle in high-dynamic- range scenes or high-speed scenarios due to motion blur and latency. Dynamic Vision Sensors (DVS), or event cameras, offer a paradigm shift by capturing asynchronous brightness changes with microsecond temporal resolution and high dynamic range. In this paper, we propose an extended architecture of the state-of-the-art InterFuser model, integrating DVS as an additional modality to enhance perception reliability. We introduce a novel token-based fusion strategy that incorporates accumulated event frames into the transformer-based backbone of InterFuser. Our method leverages the complementary nature of RGB, LiDAR, and DVS data. We evaluate our approach on the Car Learning to Act (CARLA) Leaderboard benchmarks, demonstrating that the inclusion of DVS improves the robustness of the driving agent, achieving a competitive Driving Score of 77.2 and a superior Route Completion of 100%. The results indicate that event-based vision is a promising direction for improving safety and performance in adverse lighting and dynamic conditions.
Chinese Translation
自动驾驶系统在感知复杂环境方面非常依赖强大的传感器融合。传统使用RGB相机和LiDAR的设置在高动态范围场景或高速场景中往往会因运动模糊和延迟而难以表现。动态视觉传感器(Dynamic Vision Sensors,DVS)或事件相机通过以微秒级时间分辨率和高动态范围捕捉异步亮度变化,提供了一种范式转变。本文提出了一种扩展的最先进InterFuser模型架构,集成DVS作为额外的模态,以增强感知的可靠性。我们引入了一种新颖的基于令牌的融合策略,将累积的事件帧纳入InterFuser的基于变换器(transformer)的主干。我们的方法利用了RGB、LiDAR和DVS数据的互补性。我们在Car Learning to Act(CARLA)排行榜基准上评估了我们的方法,结果表明,DVS的引入提高了驾驶代理的鲁棒性,达到竞争性的驾驶评分77.2以及优越的路线完成率100%。结果表明,基于事件的视觉是在不良照明和动态条件下提高安全性和性能的有前途的方向。
cs.CV / 12 / 2605.04358
Intermediate Representations are Strong AI-Generated Image Detectors
中间表示是强大的AI生成图像检测器
Abstract
The rapid advancement in generative AI models has enabled the creation of photorealistic images. At the same time, there are growing concerns about the potential misuse and dangers of generated content, as well as a pressing need for effective AI-generated image detectors. However, current training-based detection techniques are typically computationally costly and can hardly be generalized to unseen data domains, while training-free methods fall short in detection performance. To bridge this gap, we propose a search-based method employing data embedding sensitivity in intermediate layers to detect AI-generated images. Given a set of real and AI-generated images, our method examines the similarity between original image embeddings and perturbed image embeddings, and detects AI-generated images based on the similarity. We examine the proposed method on two comprehensive benchmarks: GenImage and Forensics Small. Our method exhibits improved performance across different datasets compared to both training-free and training-based state-of-the-art methods. On average, our method achieves the largest performance gain on the Forensics Small benchmark by 39.61% compared to the best training-free method and 5.14% compared to the best training-based method in AUROC score.
Chinese Translation
生成AI模型的快速发展使得创建照片真实感图像成为可能。同时,关于生成内容的潜在误用和危险的担忧也在加剧,迫切需要有效的AI生成图像检测器。然而,当前基于训练的检测技术通常计算成本高昂,并且几乎无法泛化到未见的数据领域,而无训练方法在检测性能上则显得不足。为了解决这一问题,我们提出了一种基于搜索的方法,利用中间层中的数据嵌入敏感性来检测AI生成的图像。在给定一组真实和AI生成的图像时,我们的方法通过检验原始图像嵌入与扰动图像嵌入之间的相似性来检测AI生成的图像。我们在两个全面的基准测试上评估了所提方法:GenImage和Forensics Small。与当前最先进的无训练和基于训练的方法相比,我们的方法在不同数据集上表现出了更好的性能。在Forensics Small基准测试中,我们的方法在AUROC分数上相较于最好的无训练方法实现了39.61%的性能提升,相较于最好的基于训练的方法提升了5.14%。
cs.CV / 13 / 2605.04397
Optimize-at-Capture: Highly-adaptive Exposure Controlling for In-Vehicle Non-contact Heart-rate Monitoring
优化捕获:用于车辆内非接触心率监测的高度自适应曝光控制
Abstract
Remote photoplethysmography (rPPG) holds great promise for continuous heart-rate monitoring of drivers in intelligent vehicles. However, its performance is severely degraded by the highly dynamic illumination changes. A critical yet overlooked factor is the lack of exposure controlling during video acquisition -- most existing systems rely on either fixed exposure settings or camera build-in auto-exposure, both of which fail to maintain stable facial brightness under rapidly changing lighting conditions during driving. To address this gap, we propose a highly-adaptive exposure controlling framework that proactively adjusts exposure parameters based on predictive modeling of historical skin reflections. Unlike standard auto-exposure, our method is specifically optimized for rPPG measurement, ensuring the skin region of interest (ROI) remains within the optimal dynamic range for rPPG signal extraction. As an important contribution of this study, we introduce ExpDrive, a public in-vehicle physiological monitoring dataset comprising synchronized facial video and reference ECG from 48 subjects captured under real driving conditions. Extensive experiments demonstrate that our method consistently outperforms fixed exposure and standard auto-exposure strategies. Specifically, it reduces the Mean Absolute Error (MAE) by 6.31 bpm (from 14.1 to 7.79 bpm) and significantly increases the success rate by 32.3 percentage points (p < 0.001) (from 24.9% to 57.2%) across challenging driving scenarios. Notably, it clearly improved the performance of non-contact heart-rate monitoring in both low-light (rainy) and high-glare (sunny) conditions, validating the efficacy of exposure-aware acquisition design.
Chinese Translation
远程光电容积图(rPPG)在智能车辆中对驾驶员进行连续心率监测方面具有很大潜力。然而,其性能受到高度动态照明变化的严重影响。一个关键但被忽视的因素是缺乏在视频采集过程中的曝光控制——大多数现有系统依赖于固定曝光设置或相机内置自动曝光,这两者都无法在快速变化的驾驶光照条件下保持稳定的面部亮度。为了填补这一空白,我们提出了一种高度自适应的曝光控制框架,该框架基于对历史皮肤反射的预测建模,主动调整曝光参数。与标准自动曝光不同,我们的方法专门针对rPPG测量进行优化,确保感兴趣的皮肤区域(ROI)保持在rPPG信号提取的最佳动态范围内。作为本研究的重要贡献,我们引入了ExpDrive,一个包含48名受试者在真实驾驶条件下捕获的同步面部视频和参考心电图(ECG)的公开车辆内生理监测数据集。大量实验证明,我们的方法在各项具有挑战性的驾驶场景中,其表现始终优于固定曝光和标准自动曝光策略。具体而言,它将平均绝对误差(MAE)减少了6.31 bpm(由14.1降低至7.79 bpm),并显著提高了成功率32.3个百分点(p < 0.001)(由24.9%提高至57.2%)。特别值得注意的是,它在低光(雨天)和高眩光(晴天)条件下显著改善了非接触心率监测的性能,验证了曝光感知采集设计的有效性。
cs.CV / 14 / 2605.04405
Detecting Deepfakes via Hamiltonian Dynamics
通过哈密顿动力学检测深伪
Abstract
Driven by the rapid development of generative AI models, deepfake detectors are compelled to undergo periodic recalibration to capture newly developed synthetic artifacts. To break this cycle, we propose a new perspective on deepfake detection: moving from static pattern recognition to dynamical stability analysis. Specifically, our approach is motivated by physics-inspired priors: we hypothesize that natural images, as products of dissipative physical processes, tend to settle near stable, low-energy equilibria. In contrast, generative models optimize for statistical similarity to real images but do not explicitly enforce structural constraints such as geometric smoothness, leaving deepfakes more likely to occupy unstable, high-energy states. To operationalize this, we introduce Hamiltonian Action Anomaly Detection (HAAD), comprising three contributions: \textbf{i)} We model the image latent manifold as a potential energy surface. Under this hypothesis, real images are expected to produce basin-like low-energy responses, whereas fake images are more likely to induce high-potential, high-gradient responses. \textbf{ii)} We employ Hamiltonian-inspired dynamics as a stability probe. By releasing latent states from rest, samples near stable regions remain bounded, while high-gradient samples produce larger trajectory responses. \textbf{iii)} We quantify these dynamic behaviors through two trajectory statistics, \ie, Hamiltonian action and energy dissipation. Extensive experiments show that HAAD outperforms evaluated state-of-the-art baselines on challenging cross-dataset transfer benchmarks, supporting a physics-inspired stability prior for digital forensics.
Chinese Translation
随着生成性AI模型的快速发展,深伪检测器被迫进行周期性的重新校准以捕捉新开发的合成伪影。为了打破这一循环,我们提出了一种关于深伪检测的新视角:从静态模式识别转向动态稳定性分析。具体而言,我们的方法受到物理启发的先验驱动:我们假设自然图像作为耗散物理过程的产物,往往会在稳定的低能量平衡态附近稳定下来。相比之下,生成模型虽然在统计上与真实图像保持相似性,但并未明确施加诸如几何平滑性等结构约束,使得深伪更可能处于不稳定的高能态。为了解决这个问题,我们引入了哈密顿作用异常检测(Hamiltonian Action Anomaly Detection, HAAD),其包含三项贡献: extbf{i)} 我们将图像潜在流形建模为一个势能面。在这一假设下,真实图像预计会产生像盆地一样的低能响应,而伪造图像则更可能引发高势能、高梯度的响应。 extbf{ii)} 我们采用受哈密顿启发的动力学作为稳定性探针。通过从静止状态释放潜在状态,接近稳定区域的样本保持约束,而高梯度样本则会产生更大的轨迹响应。 extbf{iii)} 我们通过两项轨迹统计量量化这些动态行为,即哈密顿作用和能量耗散。广泛实验表明,HAAD在挑战性的跨数据集迁移基准上超越了评估的最先进基线,支持了一种受到物理启发的数字法医稳定性先验。
cs.CV / 15 / 2605.04409
UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model
无人机作为城市建设变更监测工具:一种新的基准与变更描述模型
Abstract
Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.
Chinese Translation
遥感图像变更描述(RSICC)旨在通过双时相图像生成对场景演变的空间定位自然语言描述,超越了二元变化掩模,向语义级理解迈进。然而,现有方法依赖于隐式特征差异,而未明确建模结构化的变化语义,且在变化检测与描述生成的矛盾表现需求之间难以调和。此外,当前的基准针对高分辨率城市建设场景的覆盖有限。为了解决这些挑战,我们提出了PTNet,这是一个基于原型引导的任务自适应框架,用于联合变化描述和检测。PTNet通过一个可学习的原型库显式建模结构化变化语义,指导跨时相交互,通过多头门控解开任务特定的表征,并将检测派生的空间先验注入到描述生成中,能够在保持细粒度空间敏感性的同时实现连贯的语义对应。此外,我们构建了UCCD,这是一个大规模无人机(UAV)基础的基准,包含9,000对高分辨率图像和45,000句城市建设监测的标注句子。在UCCD和WHU-CDC上的大量实验表明,PTNet始终优于现有方法。数据集和源代码可在https://github.com/G124556/ptnet获取。
cs.CV / 16 / 2605.04410
Evaluation Cards for XAI Metrics
可解释人工智能(XAI)指标的评估卡片
Abstract
The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.
Chinese Translation
可解释人工智能(XAI)方法的评估受到缺乏标准化的影响。指标定义不一致,报告不完整,并且很少与常见基准进行验证。本文将评估报告的透明度识别为一个中心且未得到充分关注的问题。我们提出了 XAI 评估卡片,这是一种与模型卡片类似的文档模板,旨在陪伴任何介绍 XAI 评估指标的研究。该卡片涵盖目标属性的明确声明、基础层级、指标假设、验证证据、游戏风险和已知失败案例。我们认为,采纳这一模板作为社区规范将减少评估碎片化,支持元分析,并提高 XAI 研究的问责性。
cs.CV / 17 / 2605.04412
Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion
结构化3D潜变量的强大力量:通过2D扩散释放可泛化样式
Abstract
3D asset generation plays a pivotal role in fields such as gaming and virtual reality, enabling the rapid synthesis of high-fidelity 3D objects from a single or multiple images. Building on this capability, enabling style-controllable generation naturally emerges as an important and desirable direction. However, existing approaches typically rely on style images that lie within or are similar to the training distribution of 3D generation models. When presented with out-of-distribution (OOD) styles, their performance degrades significantly or even fails. To address this limitation, we introduce $\textbf{DiLAST}$: 2D Diffusion-based Latent Awakening for 3D Style Transfer. Specifically, we leverage a pretrained 2D diffusion model as a teacher to provide rich and generalizable style priors. By aligning rendered views with the target style under diffusion-based guidance, our method optimizes the structured 3D latent representation for stylization. We observe that this limitation stems not from insufficient model capacity, but from the underutilization of structured 3D latents, which are inherently expressive. Despite being trained on comparatively limited data, 3D generation models can leverage 2D diffusion guidance to steer denoising toward specific directions in latent space, thereby producing diverse, OOD styles. Extensive experiments across diverse data and multiple 3D generation backbones demonstrate the effectiveness and plug-and-play nature of our approach.
Chinese Translation
3D资产生成在游戏和虚拟现实等领域中扮演着关键角色,使得能够从单张或多张图像快速合成高保真3D对象成为可能。在这一能力的基础上,能够进行风格可控的生成自然成为一个重要且令人向往的方向。然而,现有的方法通常依赖于与3D生成模型的训练分布相符或相似的风格图像。当面对分布外(OOD)样式时,其性能显著下降甚至失败。为了解决这一局限性,我们提出了$ extbf{DiLAST}$:基于2D扩散的3D风格转移潜变量觉醒。具体而言,我们利用一个预训练的2D扩散模型作为教师,提供丰富且可泛化的风格先验。通过在基于扩散的指导下将渲染视图与目标风格对齐,我们的方法优化了用于风格化的结构化3D潜变量表示。我们观察到,这一局限性并非源于模型能力不足,而是由于对结构化3D潜变量的利用不充分,这些潜变量本质上是表达性强的。尽管训练数据相对有限,3D生成模型仍然可以利用2D扩散指导在潜空间中朝特定方向引导去噪,从而生成多样化的OOD样式。针对不同数据和多种3D生成基础模型的广泛实验验证了我们方法的有效性和即插即用特性。
cs.CV / 18 / 2605.04425
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
可解释的提示学习中的联合语义标记选择与提示优化
Abstract
Vision-language models such as CLIP achieve strong visual-textual alignment, but often suffer from overfitting and limited interpretability when adapted through continuous prompt learning. While discrete prompt optimization improves interpretability, it usually depends on large external models, leading to high computational costs and limited scalability. In this paper, we propose Interpretable Prompt Learning (IPL), a hybrid framework that alternates between discrete semantic token selection and continuous prompt optimization. Specifically, IPL formulates semantic token selection as an approximate submodular optimization problem, encouraging tokens that are both human-understandable and semantically diverse. It further adopts an alternating optimization strategy to integrate discrete token selection with continuous prompt tuning, improving interpretability while preserving adaptability to downstream tasks. Our framework is plug-and-play, allowing seamless integration with existing prompt learning methods. Extensive experiments on multiple benchmarks show that IPL consistently improves both interpretability and accuracy across five representative prompt learning methods, providing an effective and scalable extension to existing frameworks.
Chinese Translation
视觉-语言模型如CLIP在视觉与文本之间实现了强大的对齐,但在通过连续提示学习进行适应时,常常面临过拟合和有限的可解释性。尽管离散提示优化可以提高可解释性,但通常依赖于大型外部模型,导致高计算成本和有限的可扩展性。本文提出了可解释的提示学习(Interpretable Prompt Learning, IPL),这是一种混合框架,交替进行离散语义标记选择与连续提示优化。具体而言,IPL将语义标记选择表述为一个近似的子模优化问题,促使选择既符合人类理解又具有语义多样性的标记。它进一步采用交替优化策略,将离散标记选择与连续提示调优相结合,在提高可解释性的同时保持对下游任务的适应性。我们的框架具有即插即用的特性,能够与现有的提示学习方法无缝集成。在多个基准上的大量实验表明,IPL在五种代表性的提示学习方法中始终提高了可解释性和准确性,为现有框架提供了一种有效且可扩展的扩展方式。
cs.CV / 19 / 2605.04435
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes
Ground4D:用于非结构化越野场景的空间基础前馈4D重建
Abstract
Feedforward Gaussian Splatting has recently emerged as an efficient paradigm for 4D reconstruction in autonomous driving. However, in unstructured off-road scenes, its performance degrades due to high-frequency geometry, ego-motion jitter, and increased non-rigid dynamics. These factors introduce conflicting Gaussian observations across timestamps, leading to either over-smoothed renderings or structural artifacts. To address this issue, we propose Ground4D, a spatially-grounded 4D feedforward framework for pose-free off-road reconstruction. The key idea is to resolve temporal conflicts through spatially localized conditioning. Specifically, we introduce voxel-grounded temporal Gaussian aggregation, which partitions the canonical Gaussian space into spatial voxels and performs query-conditioned temporal attention within each voxel. Intra-voxel softmax normalization ensures that temporal selectivity and spatial occupancy become mutually reinforcing rather than conflicting. We furthermore introduce surface normal cues as auxiliary geometric guidance to regularize the geometry of Gaussian primitives. Extensive experiments on ORAD-3D and RELLIS-3D demonstrate that Ground4D consistently outperforms existing feedforward methods in reconstruction quality and generalizes zero-shot to unseen off-road domains. Project page and code:https://github.com/wsnbws/Ground4D.
Chinese Translation
前馈高斯点云(Feedforward Gaussian Splatting)最近已成为自主驾驶中4D重建的高效范式。然而,在非结构化越野场景中,由于高频几何特征、自我运动抖动和增强的非刚性动态,其性能有所下降。这些因素在时间戳之间引入了相互矛盾的高斯观测,导致过度平滑的渲染或结构伪影。为了解决这个问题,我们提出了Ground4D,一个用于无姿态越野重建的空间基础4D前馈框架。其关键思想是通过空间局部条件来解决时间冲突。具体而言,我们引入了体素基础的时间高斯聚合(voxel-grounded temporal Gaussian aggregation),将标准高斯空间划分为空间体素,并在每个体素内执行查询条件下的时间注意力(temporal attention)。体素内的软最大标准化(intra-voxel softmax normalization)确保时间选择性和空间占据相辅相成,而非相互冲突。此外,我们引入了表面法线线索作为辅助几何指导,以规范高斯原始体的几何形状。对ORAD-3D和RELLIS-3D的广泛实验表明,Ground4D在重建质量上始终优于现有的前馈方法,并能够在零样本情况下泛化至未见的越野领域。项目页面和代码:https://github.com/wsnbws/Ground4D。
cs.CV / 20 / 2605.04439
A cross-modal network for facial expression recognition
一种用于面部表情识别的跨模态网络
Abstract
Deep neural networks enriched with structural information have been widely employed for facial expression recognition tasks. However, these methods often depend on hierarchical information rather than face property to finish expression recognition. In this paper, we propose a cross-modal network with strong biological and structural information for facial expression recognition (CMNet). CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent negative effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Our experimental results demonstrate that CMNet outperforms several novel methods, i.e., SCN and LAENet-SA for facial expression recognition. Codes can be obtained at https://github.com/hellloxiaotian/CMNet.
Chinese Translation
丰富结构信息的深度神经网络已广泛应用于面部表情识别任务。然而,这些方法通常依赖于层次信息而非面部特征来完成表情识别。本文提出了一种具有强生物和结构信息的跨模态网络,用于面部表情识别(CMNet)。CMNet能够通过整个面部、左右半脸的面部对称分别学习表情信息,以提取互补的面部特征。为了防止生物和结构信息融合的负面影响,设计了一种显著面部信息精炼模块,可以获取显著的面部表情信息,以提高获得的面部表情分类器的稳定性。为减少对单侧面部特征的依赖,设计了一种半脸对齐优化机制,以对齐所获得的左右半脸的表情信息。实验结果表明,CMNet在面部表情识别任务中优于多种新颖方法,如SCN和LAENet-SA。代码可在https://github.com/hellloxiaotian/CMNet获取。
cs.CV / 21 / 2605.04445
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
LEGO:基于LoRA的生成器导向合成图像检测框架
Abstract
The rapid advancement of generative technologies has made synthetic images nearly indistinguishable from real ones, thereby creating an urgent need for robust detectors to counter misinformation. However, existing methods mainly rely on universal artifact features that are shared across multiple generators. We observe that as the diversity of generators increases, the overlap of these common features gradually decreases. This severely undermines model generalization. In contrast, focusing only on unique artifacts tends to cause overfitting to specific forgery patterns. To address this challenge, we propose LEGO (LoRA-Enabled Generator-Oriented Framework). The core mechanism of LEGO employs an MLP to modulate multiple LoRA (Low-Rank Adaptation) blocks, each pretrained to capture the unique artifacts of a specific generator, followed by attention-based feature fusion. Unlike conventional methods that seek a single universal solution, LEGO delegates unique artifact extraction to specialized LoRA modules by dividing its training procedure into two stages. Each LoRA module is individually trained on a single-generator dataset to learn generator-specific representations, then MLP and attention layers are trained on mixed datasets to dynamically regulate the contribution of each module. Benefiting from its modular yet robust design, LEGO can be naturally extended by incorporating new LoRA modules for adaptation to newly emerging next-generation datasets, while still achieving substantially better performance than prior SOTA methods with fewer than 30,000 training images, less than 10% of their training data, and only 5 epochs in each training stage.
Chinese Translation
生成技术的快速发展使得合成图像几乎与真实图像难以区分,因此迫切需要强大的检测器以应对错误信息。然而,现有方法主要依赖于跨多个生成器共享的通用伪影特征。我们观察到,随着生成器多样性的增加,这些共同特征的重叠逐渐减少,这严重削弱了模型的泛化能力。相反,仅关注唯一的伪影往往会导致对特定伪造模式的过拟合。为了解决这一挑战,我们提出了LEGO(基于LoRA的生成器导向框架)。LEGO的核心机制采用一个多层感知器(MLP)来调节多个LoRA(低秩适应)模块,每个模块都经过预训练以捕捉特定生成器的独特伪影,随后进行基于注意力的特征融合。与寻求单一通用解决方案的传统方法不同,LEGO通过将其训练过程分为两个阶段,将独特伪影提取的任务委托给专门的LoRA模块。每个LoRA模块在单一生成器的数据集上单独训练,以学习生成器特定的表示,然后在混合数据集上训练MLP和注意力层,以动态调节每个模块的贡献。得益于其模块化而又鲁棒的设计,LEGO可以通过引入新的LoRA模块自然扩展,以适应新出现的下一代数据集,同时在训练图像数量少于30,000、不到其训练数据的10%以及每个训练阶段仅5个周期的情况下,仍能实现显著优于先前SOTA方法的性能。
cs.CV / 22 / 2605.04447
Deep Reprogramming Distillation for Medical Foundation Models
用于医疗基础模型的深度重编程蒸馏
Abstract
Medical foundation models pre-trained on large-scale datasets have shown powerful versatile performance. However, when adapting medical foundation models for specific medical scenarios, it remains the inevitable challenge due to the gap induced by the discrepancy between pre-training and downstream tasks, the real-world computation, and speed constraints. Relevant techniques that probably handle this challenge more or less suffer from some intrinsic limitations. For example, knowledge distillation (KD) assumes that teacher and student models share the same task, training strategy, and model structure family, while prevalent parameter-efficient fine-tuning (PEFT) fails to achieve personalized and lightweight deployment. Even the combination of PEFT and KD still struggles to resolve model structures and training strategies inconsistencies between teacher and student models, leading to inefficient knowledge transfer. In this study, we propose a novel framework called Deep Reprogramming Distillation (DRD) to combat the general adaptation challenge. Specifically, DRD introduces the novel reprogramming module that on the one side overcomes the domain and task discrepancy between pretraining and downstream scenarios, and on the other side builds the student-friendly efficient distillation from foundation models to lightweight downstream models. Furthermore, to mitigate variability under different training conditions, we design a centered kernel alignment (CKA) distillation method to promote robust knowledge transfer. Empirical results show that DRD surpasses previous PEFT and KD methods across 18 medical downstream tasks under different foundation models, covering various scenarios including 2D/3D classification and 2D/3D segmentation.
Chinese Translation
在大规模数据集上预训练的医疗基础模型表现出了强大的多功能性能。然而,在将医疗基础模型适应于特定医疗场景时,由于预训练任务与下游任务之间的差异、实际计算和速度限制,难以避免地面临挑战。相关技术在一定程度上可能应对这一挑战,但多多少少存在一些内在局限性。例如,知识蒸馏(Knowledge Distillation, KD)假设教师模型和学生模型共享相同的任务、训练策略和模型结构家族,而普遍的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)未能实现个性化和轻量部署。即使PEFT和KD的结合仍然难以解决教师模型与学生模型之间的模型结构和训练策略不一致问题,导致知识转移效率低下。在本研究中,我们提出了一种新颖的框架,称为深度重编程蒸馏(Deep Reprogramming Distillation, DRD),以应对普遍的适应挑战。具体而言,DRD引入了新颖的重编程模块,该模块一方面克服了预训练与下游场景之间的领域和任务差异,另一方面构建了从基础模型到轻量下游模型的学生友好型高效蒸馏。此外,为了减轻不同训练条件下的变异性,我们设计了一种中心核对齐(Centered Kernel Alignment, CKA)蒸馏方法,以促进稳健的知识转移。实验证明,DRD在不同基础模型下的18个医疗下游任务中超过了以往的PEFT和KD方法,涵盖了包括2D/3D分类和2D/3D分割等各种场景。
cs.CV / 23 / 2605.04451
RemoteZero: Geospatial Reasoning with Zero Human Annotations
RemoteZero:无需人工标注的地理空间推理
Abstract
Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents true self-evolution on abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, a box-supervision-free framework for geospatial reasoning. RemoteZero is motivated by a simple asymmetry: an MLLM is typically better at verifying whether a region satisfies a query than at directly generating precise coordinates. Leveraging this stronger discriminative ability, RemoteZero replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations. The resulting framework further supports iterative self-evolution, allowing the model to improve from unlabeled remote sensing imagery through its own verification signal. Experiments show that RemoteZero achieves competitive performance against strong supervised methods, demonstrating the potential of self-verifying training for geospatial reasoning localization.
Chinese Translation
地理空间推理要求模型将复杂的空间语义和用户意图解析为精确的地球观测目标位置。近期的进展已经使得推理路径摆脱了手动整理,使得模型能够生成自己的推理链。然而,仍然存在一个最终的依赖性:它们依然需要人工标注的真实坐标进行监督。这使得推理过程变得自主,但其空间端点并未实现自主,从而阻碍了在丰富的无标注遥感数据上实现真正的自我进化。为了解决这一瓶颈,我们提出了RemoteZero,一个无需框架监督的地理空间推理框架。RemoteZero的动机源于一个简单的非对称性:大规模语言模型(MLLM)通常在验证一个区域是否满足查询方面比直接生成精确坐标更为出色。借助这种更强的区分能力,RemoteZero用内在语义验证替代几何监督,并实现了无框标注的GRPO(Geospatial Reasoning and Prediction Optimization)训练。该框架进一步支持迭代自我进化,使模型能够通过自身的验证信号从无标注的遥感图像中进行改进。实验表明,RemoteZero在与强监督方法的比较中展现了竞争力的表现,证明了自我验证训练在地理空间推理定位中的潜力。
cs.CV / 24 / 2605.04453
StableI2I: Spotting Unintended Changes in Image-to-Image Transition
StableI2I: 识别图像到图像转换中的意外变化
Abstract
In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure of the input image. To address this limitation, we propose StableI2I, a unified and dynamic evaluation framework that explicitly measures content fidelity and pre--post consistency across a wide range of I2I tasks without requiring reference images, including image editing and image restoration. In addition, we construct StableI2I-Bench, a benchmark designed to systematically evaluate the accuracy of MLLMs on such fidelity and consistency assessment tasks. Extensive experimental results demonstrate that StableI2I provides accurate, fine-grained, and interpretable evaluations of content fidelity and consistency, with strong correlations to human subjective judgments. Our framework serves as a practical and reliable evaluation tool for diagnosing content consistency and benchmarking model performance in real-world I2I systems.
Chinese Translation
在大多数实际的图像到图像(I2I)场景中,现有的评估主要关注于遵循指令以及生成图像的感知质量或美学。然而,这些评估在很大程度上未能评估输出图像是否保留了输入图像的语义对应关系和空间结构。为了解决这一局限性,我们提出了StableI2I,一个统一且动态的评估框架,能够明确测量内容保真度和前后一致性,适用于广泛的I2I任务,包括图像编辑和图像修复,而无需参考图像。此外,我们构建了StableI2I-Bench,一个旨在系统性评估大规模语言模型(MLLMs)在此类保真度和一致性评估任务上准确性的基准测试。大量实验证明,StableI2I能够提供准确、细致且可解释的内容保真度和一致性评估,且与人类主观判断具有很强的相关性。我们的框架作为一个实用且可靠的评估工具,用于诊断内容一致性以及在实际I2I系统中基准模型性能。
cs.CV / 25 / 2605.04461
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1: 流媒体视频生成的测试时缩放
Abstract
While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.
Chinese Translation
尽管测试时缩放(TTS)为提高视频生成效率提供了一个有前景的方向,而不需要巨大的训练成本,但当前基于扩散模型的测试时视频生成方法面临着庞大的候选探索成本,并缺乏时间指导。为了解决这些结构性瓶颈,我们提出将重点转向流媒体视频生成。我们发现其按块级合成和较少去噪步骤本质上适合于TTS,这显著降低了计算开销,同时实现了细粒度的时间控制。基于这一洞察,我们推出了Stream-T1,这是一种专门为流媒体视频生成量身定制的开创性全面TTS框架。具体而言,Stream-T1由三个单元组成:(1)Stream-Scaled Noise Propagation,积极利用历史证明的高质量前块噪声来精炼生成块的初始潜在噪声,有效建立时间依赖关系,并利用历史高斯先验来引导当前生成;(2)Stream-Scaled Reward Pruning,通过将即时短期评估与基于滑动窗口的长期评估结合,全面评估生成候选,以在局部空间美学与全局时间一致性之间取得最佳平衡;(3)Stream-Scaled Memory Sinking,通过奖励反馈动态地将从KV缓存中驱逐的上下文引导到不同的更新路径,确保先前生成的视觉信息有效锚定并指导后续视频流。经过对5秒和30秒综合视频基准测试的评估,Stream-T1显示出显著的优越性,显著提高了时间一致性、运动平滑性和帧级视觉质量。
cs.CV / 26 / 2605.04475
Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
信息协调作为桥梁:一种用于可靠自动驾驶场景理解的神经符号架构
Abstract
Reliable autonomous driving requires scene understanding that is semantically consistent across heterogeneous sensors and verifiable at the reasoning stage. However, many recent LLM-driven driving systems attach the language model as a post-processor and force it to reason over redundant or conflicting perception outputs, which can amplify hallucinated entities and unsafe conclusions. This paper proposes InfoCoordiBridge, a BEV-centric neuro-symbolic architecture that inserts an explicit coordination bridge between perception and language reasoning. InfoCoordiBridge comprises (i) a unified multi-agent perception layer that outputs typed structured facts together with modality-focused synopses, (ii) an ICA module that aligns and fuses multi-source outputs into a single SceneSummary, and (iii) an SSRE module that performs SceneSummary-grounded reasoning with verification. Experiments on nuScenes and Waymo show that ICA preserves competitive 3D detection accuracy while substantially improving fusion consistency, reducing redundancy to below 1% and achieving about 98% attribute agreement. On NuScenes-QA and a template-aligned Waymo-QA benchmark, SSRE improves factual grounding and reduces hallucinated entity mentions compared with representative VLM and agentic baselines. Overall, by coordinating multi-sensor outputs into a single conflict-aware SceneSummary before prompting, InfoCoordiBridge prevents redundant and cross-modally inconsistent perception evidence from propagating into high-level reasoning.
Chinese Translation
可靠的自动驾驶需要在异构传感器之间具备语义一致的场景理解,并在推理阶段可验证。然而,许多最近基于大语言模型(LLM)的驾驶系统将语言模型作为后处理器并迫使其对冗余或冲突的感知输出进行推理,这可能会放大虚构实体和不安全结论。本文提出了InfoCoordiBridge,这是一种以鸟瞰视角(BEV)为中心的神经符号架构,在感知和语言推理之间插入了一个明确的协调桥梁。InfoCoordiBridge包含(i)一个统一的多智能体感知层,该层输出带类型的结构化事实和针对不同模态的概述;(ii)一个ICA模块,用于对齐和融合来自多源的输出为一个单一的场景摘要(SceneSummary);(iii)一个SSRE模块,执行基于场景摘要的推理并进行验证。在nuScenes和Waymo上的实验表明,ICA在保持竞争的三维检测准确性同时,显著改善了融合一致性,减少冗余至1%以下,且属性一致性达到了约98%。在NuScenes-QA和模板对齐的Waymo-QA基准上,SSRE相较于具有代表性的视觉语言模型(VLM)和智能体基准,提升了事实的基础性并减少了虚构实体的提及。总的来说,通过在提示之前将多传感器输出协调为一个单一的冲突感知场景摘要,InfoCoordiBridge防止了冗余和跨模态不一致的感知证据传播到高级推理中。
cs.CV / 27 / 2605.04501
Example-Based Object Detection
基于示例的物体检测
Abstract
In recent years, object detection has achieved significant progress, especially in the field of open-vocabulary object detection. Unlike traditional methods that rely on predefined categories, open-vocabulary approaches can detect arbitrary objects based on human-provided prompts. With the advancement of prompt-based detection techniques, models such as SAM3 can even outperform some category-specific detectors trained on particular datasets without requiring additional training on those datasets. However, despite these advancements, false positives and false negatives still occur. In practical engineering applications, persistent misdetections or missed detections of the same object are unacceptable. Yet retraining the model every time such errors occur incurs substantial costs in terms of human effort, computational resources, and time. Therefore, how to leverage existing false positive and false negative samples to prevent such errors from recurring remains a highly challenging and urgent problem. To address this issue, we propose EBOD (Example-Based Object Detection), which integrates a prompt-based detector (SAM3) with robust feature matching modules (DINOv3 and LightGlue). The proposed framework effectively suppresses the repeated occurrence of false positives and false negatives by leveraging previous error examples, without requiring additional model retraining. Code is available at https://github.com/sunzx97/examples_based_object_detection.
Chinese Translation
近年来,物体检测取得了显著进展,尤其是在开放词汇物体检测领域。与传统方法依赖于预定义类别不同,开放词汇方法可以基于人工提供的提示检测任意物体。随着基于提示的检测技术的发展,诸如SAM3等模型甚至可以超越一些在特定数据集上训练的类别特定检测器,而无需对这些数据集进行额外的训练。然而,尽管有这些进展,误报和漏报仍然存在。在实际工程应用中,同一物体的持续误检或漏检是不可接受的。然而,每次出现此类错误时重新训练模型所需的人力、计算资源和时间成本都非常高。因此,如何利用现有的误报和漏报样本防止此类错误的再次发生,仍然是一个高度具有挑战性和紧迫性的问题。为了解决这一问题,我们提出了EBOD(基于示例的物体检测),该方法将基于提示的检测器(SAM3)与强大的特征匹配模块(DINOv3和LightGlue)相结合。该框架有效抑制了误报和漏报的重复出现,利用之前的错误示例,而无需额外的模型重新训练。代码可在 https://github.com/sunzx97/examples_based_object_detection 获得。
cs.CV / 28 / 2605.04503
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
DiffCap-Bench:一个全面、具有挑战性且稳健的图像差异描述基准
Abstract
Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.
Chinese Translation
图像差异描述(Image Difference Captioning,IDC)生成自然语言描述,以精确识别两幅图像之间的差异,作为细粒度变化感知、跨模态推理和图像编辑数据构建的关键基准。然而,现有的基准缺乏多样性和组合复杂性,标准的词汇重叠度量(例如 BLEU、METEOR)未能捕捉语义一致性或惩罚幻觉,这共同阻碍了对多模态大语言模型(Multimodal Large Language Models,MLLMs)在 IDC 任务上的全面和稳健评估。为了解决这些问题,我们推出了 DiffCap-Bench,这是一个全面的 IDC 基准,覆盖十个不同的差异类别,以确保多样性和组合复杂性。此外,我们建议基于人类验证差异列表(Difference Lists)的 LLM 评估协议,使得能够稳健评估模型捕捉和描述视觉变化的能力。通过对最先进的 MLLMs 进行广泛评估,我们揭示了专有模型与开源模型之间的显著性能差距,强调了推理能力的关键重要性,并识别出模型扩展中的明显局限性。我们的框架还与人类专家判断具有强一致性,并与下游图像编辑数据构建质量表现出强相关性。这些发现使 DiffCap-Bench 成为一个可靠的 IDC 评估框架和下游实用性的有效预测工具。该基准及代码将公开发布,以支持进一步研究。
cs.CV / 29 / 2605.04504
SpecPL: Disentangling Spectral Granularity for Prompt Learning
SpecPL:为提示学习解耦谱粒度
Abstract
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.
Chinese Translation
现有的视觉语言模型(VLMs)提示学习存在模态不对称性,主要优化文本标记,同时仍依赖于冷冻的视觉编码器作为整体提取器,忽视了对于细粒度区分至关重要的谱粒度。为了解决这一问题,我们提出了为提示学习解耦谱粒度(SpecPL),它通过反事实粒度监督从新颖的谱的角度进行提示学习。具体而言,我们利用一个冷冻的变分自编码器(VAE)将视觉信号分解为语义低频带和粒状高频细节。冷冻的视觉语义库将文本表示锚定于普遍的低频不变性,从而减轻了过拟合。关键是,细粒度区分是通过反事实粒度训练驱动的:通过对高频信号进行置换,我们迫使模型明确区分视觉粒度与语义不变性。SpecPL独特地作为一种通用的即插即用增强器,利用视觉侧指导振兴了以文本为导向的基线,如CoOp和MaPLe。在11个基准测试上的实验展示了具有竞争力的最先进的性能,达到了81.51%的调和均值准确率的新性能上限。这些结果验证了结合反事实监督的谱解耦有效地弥合了稳定性与泛化之间的权衡。代码已发布在https://github.com/Mlrac1e/SpecPL-Prompt-Learning。
cs.CV / 30 / 2605.04506
Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting
Ilov3Splat:基于高斯撒点的实例级开放词汇3D场景理解
Abstract
We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.
Chinese Translation
我们提出了Ilov3Splat,这是一个基于3D高斯撒点(3D-GS)的实例级开放词汇3D场景理解的新框架。大多数先前的研究依赖于基于2D渲染的匹配或点级语义关联,这降低了视角间的一致性,缺乏连贯的实例级推理,并限制了下游3D任务的精度。为了解决这些局限, 我们的方法通过增强高斯撒点与视角一致的特征场,联合优化场景几何和语义表示。具体而言,我们利用多分辨率哈希嵌入高效地编码与语言对齐的CLIP特征,从而实现3D空间中密集和连贯的语言对接。我们进一步使用对比损失训练实例特征场,以支持跨视图的细粒度对象区分。在推理时,CLIP编码的查询与学习到的特征匹配,然后通过两阶段3D聚类检索相关的高斯组。这使得我们的框架能够基于自然语言描述在3D场景中识别任意对象,而无需类别监督或人工标注。标准基准上的实验表明,Ilov3Splat在对象选择和实例分割方面均优于之前的开放词汇3D-GS方法,为基于语言的3D场景理解提供了一种灵活而准确的解决方案。项目页面:https://csiro-robotics.github.io/Ilov3Splat.
cs.CV / 31 / 2605.04515
From Priors to Perception: Grounding Video-LLMs in Physical Reality
从先验到感知:将视频大语言模型扎根于物理现实
Abstract
While Video Large Language Models (Video-LLMs) excel in general understanding, they exhibit systematic deficits in fine-grained physical reasoning. Existing interventions not only suffer from limited generalization but fundamentally conflate generative artifacts with genuine physical fallacies. Furthermore, we find that models fail systematically not only in anti-physics anomalies but also in counter-intuitive scenarios where visual facts contradict statistical expectations. Accordingly, we propose the Unified Attribution Theory: this dual failure stems not from perception deficiency, but from Semantic Prior Dominance -- the reasoning mechanism is deeply hijacked by internal narrative scripts. To address this, we construct the Programmatic Adversarial Curriculum (PACC), the first high-fidelity adversarial video dataset synthesized based on physical laws, thoroughly decoupling visual artifacts from logical errors. Concurrently, we design the Visual-Anchored Reasoning Chain (VARC) to force models to explicitly ground their judgments in low-level visual facts prior to logical adjudication. Experiments demonstrate that without invasive architectural modifications, standard LoRA fine-tuning with the PACC curriculum effectively neutralizes prior interference in state-of-the-art (SOTA) models, yielding a substantial leap in physical reasoning capabilities.
Chinese Translation
尽管视频大语言模型(Video-LLMs)在通用理解方面表现出色,但它们在细粒度的物理推理上却存在系统性缺陷。现有的干预措施不仅泛化能力有限,而且根本上将生成性伪影与真实的物理谬误混淆。此外,我们发现模型在反物理异常现象以及视觉事实与统计预期相悖的反直觉情境中,系统性地失败。因此,我们提出了统一归因理论(Unified Attribution Theory):这种双重失败并非来源于感知缺陷,而是由于语义先验的主导性——推理机制被内部叙事脚本深度劫持。为此,我们构建了程序性对抗课程(Programmatic Adversarial Curriculum, PACC),这是第一个基于物理法则合成的高保真对抗视频数据集,彻底解耦视觉伪影与逻辑错误。同时,我们设计了视觉锚定推理链(Visual-Anchored Reasoning Chain, VARC),迫使模型在逻辑判决之前,明确基于低层视觉事实来支撑其判断。实验表明,在不进行侵入性架构修改的情况下,使用PACC课程的标准LoRA微调有效中和了现代尖端(SOTA)模型中的先验干扰,使物理推理能力获得显著提升。
cs.CV / 32 / 2605.04518
DALight-3D: A Lightweight 3D U-Net for Brain Tumor Segmentation from Multi-Modal MRI
DALight-3D:一种轻量级的3D U-Net用于多模态MRI脑肿瘤分割
Abstract
Automatic brain tumor segmentation from multi-modal MRI remains challenging because volumetric models often incur substantial computational cost. This paper presents DALight-3D, a compact 3D U-Net variant that combines depthwise separable 3D convolutions, identifier-conditioned normalization, cross-slice attention, and adaptive skip fusion. The method is evaluated on the Medical Segmentation Decathlon Task01 BrainTumour benchmark under matched optimization settings against standard 3D U-Net, Attention U-Net, Residual 3D U-Net, and V-Net baselines. In the reported 50-epoch comparison, DALight-3D achieves a mean Dice of 0.727 with 2.22M parameters, compared with 0.710 Dice and 3.20M parameters for Residual 3D U-Net. Component-wise ablations show consistent performance degradation when SepConv, identifier-conditioned normalization, CSA, or SSFB is removed. These results indicate that DALight-3D offers a favorable accuracy-efficiency trade-off within the present benchmark setting.
Chinese Translation
从多模态MRI中自动进行脑肿瘤分割仍然面临挑战,因为体积模型通常会产生显著的计算开销。本文提出了DALight-3D,一种紧凑的3D U-Net变体,结合了深度可分离3D卷积、标识符条件归一化、跨切片注意力和自适应跳跃融合。该方法在与标准3D U-Net、注意力U-Net、残差3D U-Net和V-Net基线相匹配的优化设置下,在医学分割十项全能任务01脑肿瘤基准上进行了评估。在50轮的比较中,DALight-3D以2.22M参数实现了0.727的平均Dice,而残差3D U-Net则为0.710 Dice和3.20M参数。组件逐项消融实验表明,当移除SepConv、标识符条件归一化、CSA或SSFB时性能持续下降。这些结果表明,DALight-3D在当前基准设置中提供了良好的准确性与效率的权衡。
cs.CV / 33 / 2605.04524
High-Fidelity Single-Image Head Modeling with Industry-Grade Topology
高保真单图像头部建模与工业级拓扑
Abstract
We present a single-image head mesh reconstruction framework that addresses the longstanding challenge of simultaneously preserving facial identity and producing industry-grade topology. Our framework adopts a coarse-to-fine optimization pipeline that refines a rigged template across three stages -- rig, joint, and vertex -- achieving stable convergence and consistent topology. To mitigate the ill-posed nature of single-image 3D face reconstruction and ensure identity preservation, we employ a normal consistency objective jointly with landmark alignment. To further preserve local surface structure and enforce topological regularity, we introduce geometry-aware constraints based on Gaussian curvature and conformal consistency, along with auxiliary regularizations that correct fine artifacts such as lip seams and eyelid discontinuities. Our hierarchical optimization with geometry-aware regularization yields meshes with semantically meaningful edge flow and industry-grade topology. After geometry reconstruction, we extract UV-space texture and normal maps to preserve appearance details for visualization and downstream use. In a user study with 22 professional technical artists, our results were assessed as approaching industry-grade usability, and 95% of participants ranked our method as the top-performing approach, underscoring its effectiveness for real-world digital human production.
Chinese Translation
我们提出了一种单图像头部网格重建框架,解决了在保持面部身份的同时生成工业级拓扑的长期挑战。我们的框架采用粗到精的优化流程,通过三个阶段——骨骼、关节和顶点,逐步细化一个有骨骼模板,实现稳定收敛和一致的拓扑。为了缓解单图像三维面部重建的病态特性并确保身份保持,我们结合了法线一致性目标和地标对齐。为了进一步保留局部表面结构并强制拓扑规整性,我们引入了基于高斯曲率和共形一致性的几何意识约束,以及辅助正则化来修正细微的伪影,如唇缝和眼睑不连续性。我们通过几何意识正则化的层次优化产生了具有语义意义的边流和工业级拓扑的网格。在几何重建后,我们提取了UV空间纹理和法线贴图,以保存用于可视化和后续应用的外观细节。在与22位专业技术艺术家的用户研究中,我们的结果被评估为接近工业级可用性,95%的参与者将我们的方法评定为表现最佳的方法,突显了其在现实数字人类制作中的有效性。
cs.CV / 34 / 2605.04527
Velox: Learning Representations of 4D Geometry and Appearance
Velox:学习四维几何和外观的表示
Abstract
We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of dynamic shape tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance. To demonstrate the utility of our representation, we evaluate it across three downstream tasks -- video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation -- and observe strong performances in all settings.
Chinese Translation
我们提出了一个框架,用于学习四维物体的潜在表示,这些表示具有描述性,忠实于捕捉物体的几何和外观;具有压缩性,有助于下游效率;并且易于获取,仅需最少输入,即一个非结构化的动态点云,即可构建。具体而言,Velox训练一个编码器,将时空彩色点云压缩为一组动态形状标记。这些标记通过两个互补解码器进行监督:一个四维表面解码器,建模捕捉几何体的时变表面分布;以及一个高斯解码器,将标记映射到三维高斯分布,有助于学习外观。为了展示我们表示的实用性,我们在三个下游任务上进行评估——视频到四维生成、三维跟踪,以及通过图像到四维生成的布料仿真——并在所有设置中观察到强劲的表现。
cs.CV / 35 / 2605.04531
Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection
基于奖励引导的语义演变用于测试时自适应目标检测
Abstract
Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.
Chinese Translation
使用视觉-语言模型(VLMs)如 Grounding DINO 进行开放词汇目标检测在测试时分布变化下的表现下降,主要是由于文本嵌入与区域提议的视觉嵌入之间的语义不对齐。尽管近期的基于 VLM 的测试时自适应目标检测方法要么依赖成本高昂的反向传播,要么通过外部记忆绕过语义不对齐,但没有任何方法能以无训练的方式直接高效地对齐文本和视觉。为了解决这个问题,我们提出了一种奖励引导的语义演变(Reward-Guided Semantic Evolution, RGSE)框架,该框架在测试时直接细化文本嵌入。受到进化搜索的启发,RGSE 将文本嵌入适应视为一种语义搜索过程:它将文本嵌入扰动为候选变体,通过与当前和历史上高度置信的视觉提议的余弦相似度作为奖励信号来评估它们,并通过奖励加权平均将其融合为细化的嵌入。在没有任何反向传播的情况下,RGSE 在多个检测基准上达到了最先进的性能,同时增加的计算开销极其有限。我们的代码将于出版时开源。
cs.CV / 36 / 2605.04541
Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection
Angle-I2P:基于角度一致性的层次注意力用于跨模态异常值拒绝
Abstract
Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.
Chinese Translation
图像到点云配准(I2P)是机器人应用中一项基本任务,如操控、抓取和定位。现有的基于深度学习的I2P方法试图在学习的表征空间中对齐图像和点云特征以建立对应关系,并取得了良好的效果。然而,当初始匹配对的内点比率较低时,传统的透视n点(PnP)方法可能难以获得准确的结果。为了解决这一限制,我们提出了Angle-I2P,这是一种利用角度一致性几何约束和层次注意力的异常值拒绝网络。首先,我们设计了一种基于角度一致性的尺度不变的跨模态几何约束。这个显式的几何约束指导模型区分内点和外点。此外,我们提出了一种从全局到局部的层次注意力机制,有效过滤刚性变换下几何不一致的匹配,从而提高内点比率(IR)和配准召回率(RR)。实验结果表明,我们的方法在7Scenes、RGBD Scenes V2以及一个自收集的数据集上实现了最先进的性能,并在所有基准测试中均表现出一致的提升。
cs.CV / 37 / 2605.04554
InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery
InterMesh:显式交互意识的端到端多人类网格恢复
Abstract
Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions.
Chinese Translation
人类不断与周围环境进行互动。现有的基于DETR框架的端到端多人类网格恢复方法通过对所有人类查询进行自注意力机制来捕捉人际关系。然而,这些方法仅隐式地建模了互动,缺乏对人类如何与物体及彼此之间互动的显式推理。本文提出了InterMesh,一个简单但有效的框架,显式地将人与环境的互动信息纳入人类网格恢复流程。通过利用人-物互动检测器,InterMesh使查询表示丰富了结构化的互动语义,从而实现更准确的姿态和形状估计。我们设计了轻量级模块——上下文互动编码器(Contextual Interaction Encoder)和互动引导精修器(Interaction-Guided Refiner),以最小的开销将这些特征集成到现有的HMR架构中。我们通过在3DPW、MuPoTS、CMU Panoptic、Hi4D和CHI3D数据集上的广泛实验验证了我们的方法,显示出对最先进方法的显著改进。值得注意的是,InterMesh在CMU Panoptic上将MPJPE降低了9.9%,在Hi4D上降低了8.2%,突显了其在复杂人-物及人际互动场景下的有效性。
cs.CV / 38 / 2605.04557
Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis
高效的几何控制高分辨率卫星图像合成
Abstract
High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.
Chinese Translation
高分辨率卫星图像往往稀缺且昂贵,尤其是在偏远地区或不频繁发生的事件中。这一短缺阻碍了针对土地覆盖分类、变化检测和灾害监测的机器学习模型的开发与测试。在本论文中,我们通过对现有预训练扩散模型进行几何控制,解决高分辨率卫星图像合成的问题。我们提出了一种简单但高效的方法,通过使用窗口跨注意力模块,仅利用跳跃连接特征来控制合成过程。我们比较了几种以前建立的控制技术,结果表明,我们的方法在性能上具有可比性,同时在几何控制图的对齐方面表现更好。我们还讨论了当前评估方法的局限性,强调了一致性对齐评估的必要性。
cs.CV / 39 / 2605.04560
SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression
SAMIC:一种轻量级的语义感知 Mamba,用于高效的感知图像压缩
Abstract
Perceptual image compression focuses on preserving high visual quality under low-bitrate constraints. Most existing approaches to perceptual compression leverage the strong generative capabilities of generative adversarial networks or diffusion models, at the cost of substantial model complexity. To this end, we present an efficient perceptual image compression method that exploits the long-range modeling capability and linear computational complexity of state space models, with a particular focus on Mamba. Unlike existing methods that rely on an inherently fixed scanning order and consequently impair semantic continuity and spatial correlation, we develop a semantic-aware Mamba block (SAMB) to enable scanning guided by dynamically clustered semantic features, thereby alleviating the strict causality constraints and long-range information decay inherent to Mamba. Inspired by singular value decomposition, we design an SVD-inspired redundancy reduction module (SVD-RRM) that performs a low-rank approximation on the latent features by introducing a learnable soft threshold, leading to channel-wise redundancy information reduction. The proposed SAMB is integrated into both the encoder and decoder of the compression framework, whereas the SVD-RRM is incorporated only in the encoder. Extensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of rate-distortion-perception tradeoff and model complexity. The source code and pretrained models will be available at https://github.com/Jasmine-aiq/SAMIC.
Chinese Translation
感知图像压缩旨在在低比特率约束下保持高视觉质量。现有的大多数感知压缩方法都利用生成对抗网络或扩散模型强大的生成能力,但代价是模型复杂性显著增加。为此,我们提出了一种高效的感知图像压缩方法,该方法利用状态空间模型的远程建模能力和线性计算复杂性,特别关注 Mamba。与依赖固有固定扫描顺序从而损害语义连续性和空间相关性的现有方法不同,我们开发了一种语义感知的 Mamba 块(SAMB),使得扫描能够根据动态聚类的语义特征进行指导,从而缓解了 Mamba 固有的严格因果约束和远程信息衰减。受奇异值分解的启发,我们设计了一种受 SVD 启发的冗余减少模块(SVD-RRM),通过引入可学习的软阈值对潜在特征进行低秩近似,从而减少通道级冗余信息。所提出的 SAMB 被集成到压缩框架的编码器和解码器中,而 SVD-RRM 仅在编码器中使用。大量实验表明,我们的方法在速率-失真-感知权衡和模型复杂性方面相较于最新的研究方法表现优越。源代码和预训练模型将可在 https://github.com/Jasmine-aiq/SAMIC 获取。
cs.CV / 40 / 2605.04566
Open-Source Image Editing Models Are Zero-Shot Vision Learners
开源图像编辑模型是零样本视觉学习者
Abstract
Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models -- Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit -- on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals, FireRed-Image-Edit achieves a mean angular error of $17.69^\circ$, surpassing the fine-tuned Marigold ($20.86^\circ$) and matching the instruction-tuned Vision Banana ($17.78^\circ$) without any task-specific training. On NYUv2 depth estimation, LongCat-Image-Edit obtains $\delta_1{=}0.822$ with affine alignment, and Qwen-Image-Edit leads on DIODE Indoor ($\delta_1{=}0.868$). On Cityscapes semantic segmentation, Qwen-Image-Edit reaches 25.7 mIoU at the 19-class level and 49.5 mIoU at a coarser 7-category level. By comparing three independently trained editors, we test whether zero-shot vision ability is an emergent property of image-editing pretraining rather than a model-specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work.
Chinese Translation
近期研究表明,大型生成模型能够解决其未经过明确训练的视觉任务。然而,现有证据依赖于封闭源模型(如 Veo 3 和 Nano Banana Pro),或需要特定任务的指令微调,因此尚不清楚公开可用的图像编辑模型是否具有开箱即用的零样本视觉能力。我们对三种开源图像编辑模型——Qwen-Image-Edit、FireRed-Image-Edit 和 LongCat-Image-Edit——在密集视觉预测任务( extit{无需任何微调})上进行了系统评估。我们在 NYUv2 和 DIODE 上基准测试单目深度估计,在 NYUv2 上进行表面法向估计,以及在 Cityscapes 上进行语义分割,涵盖几何和语义场景理解。结果表明,开源图像编辑模型展示出非平凡的零样本视觉理解。在 NYUv2 的表面法向估计中,FireRed-Image-Edit 达到平均角度误差为 $17.69^ ext{°}$,超过了微调后的 Marigold($20.86^ ext{°}$)并与指令调优的 Vision Banana($17.78^ ext{°}$)相匹配,且无需任何特定任务训练。在 NYUv2 深度估计中,LongCat-Image-Edit 通过仿射对齐获得了 $ ext{δ}_1{=}0.822$,而 Qwen-Image-Edit 在 DIODE Indoor 上的表现领先($ ext{δ}_1{=}0.868$)。在 Cityscapes 的语义分割中,Qwen-Image-Edit 在 19 类别水平上达到 25.7 mIoU,在较粗的 7 类别水平上达到 49.5 mIoU。通过比较三种独立训练的编辑器,我们测试了零样本视觉能力是否是图像编辑预训练的突现属性,而不是模型特定的伪像。代码、评估脚本及所有结果已公开发布,以作为未来工作的可重复基准。
cs.CV / 41 / 2605.04569
Lightning Unified Video Editing via In-Context Sparse Attention
基于上下文稀疏注意力的快速统一视频编辑
Abstract
Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.
Chinese Translation
视频编辑已逐渐朝向上下文学习(In-Context Learning, ICL)范式发展,但由此产生的二次注意力成本形成了一个关键的计算瓶颈。在本研究中,我们提出了上下文稀疏注意力(In-context Sparse Attention, ISA),这是针对ICL视频编辑的首个近无损经验稀疏框架。我们的设计基于两个关键见解:首先,上下文令牌的显著性显著低于源令牌;其次,我们理论证明并实证验证了查询的清晰度与近似误差之间的相关性。基于这些发现,ISA实现了一种高效的预选择策略来修剪冗余上下文,随后采用动态查询分组机制,将高误差查询引导至全注意力机制,低误差查询则使用计算效率高的零阶泰勒稀疏注意力。此外,我们构建了 extbf{ exttt{LIVEditor}},这是一个通过ISA和一个针对1.7M高质量数据集的提议视频编辑数据管道开发的创新快速视频编辑模型。大量实验表明,LIVEditor在注意力模块延迟方面实现了约60%的减少,同时在EditVerseBench、IVE-Bench和VIE-Bench等基准中超越了最先进的方法,实现了近无损加速而不妥协视觉保真度。
cs.CV / 42 / 2605.04574
VL-UniTrack: A Unified Framework with Visual-Language Prompts for UAV-Ground Visual Tracking
VL-UniTrack:一种具有视觉-语言提示的统一无人机-地面视觉跟踪框架
Abstract
UAV-ground visual tracking (UGVT) aims to simultaneously track the same object from both the UAV and the ground view. However, existing two-stream methods suffer from isolated feature extraction and rely heavily on implicit appearance matching, which struggles to establish reliable correspondence under drastic view differences, leading to tracking unreliability. To address these limitations, we propose VL-UniTrack, a fully unified framework enhanced by visual-language prompts. By encoding features from both views within a single shared encoder, our method breaks the barrier of feature isolation to facilitate sufficient cross-view interaction. To overcome the ambiguity caused by relying solely on appearance matching, we design visual-language geometric prompting module, which fuses language descriptions with visual features to generate learnable prompts. These prompts are then fed into our prompt-guided cross-view adapter module to enable sufficient cross-view feature interaction and to guide the learning of view-specific feature representations. Furthermore, a confidence-modulated mutual distillation loss is proposed to regularize the training by mitigating noise propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the latest benchmark. The code can be downloaded in https://github.com/xuboyue1999/VL-UniTrack.git
Chinese Translation
无人机-地面视觉跟踪(UGVT)旨在从无人机和地面视角同时跟踪同一物体。然而,现有的双流方法存在特征提取孤立的问题,并且过度依赖隐式外观匹配,这在剧烈视角差异下很难建立可靠的对应关系,导致跟踪的不可靠。为了解决这些局限性,我们提出了VL-UniTrack,这是一个通过视觉-语言提示增强的完全统一框架。通过在单个共享编码器中编码来自两个视角的特征,我们的方法打破了特征孤立的障碍,以促进充分的跨视角交互。为了克服单纯依赖外观匹配造成的模糊性,我们设计了视觉-语言几何提示模块,该模块将语言描述与视觉特征融合以生成可学习的提示。然后,这些提示被输入到我们的提示引导跨视角适配模块中,以实现充分的跨视角特征交互并指导视角特征表示的学习。此外,我们提出了一种信心调制的互蒸馏损失,通过减轻噪声传播来规范训练。广泛的实验表明,我们的方法在最新基准测试中实现了最先进的性能。代码可以在 https://github.com/xuboyue1999/VL-UniTrack.git 下载。
cs.CV / 43 / 2605.04581
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
GTF:用于光场超分辨率的全向EPI变换器
Abstract
Light field (LF) image super-resolution benefits from Epipolar Plane Images (EPIs), whose line slopes explicitly encode disparity. However, existing Transformer-based LF SR methods mainly attend to horizontal and vertical EPIs, leaving diagonal epipolar geometry underexplored. We present GTF, an omnidirectional EPI Transformer that explicitly models horizontal, vertical, 45-degree, and 135-degree EPIs within a unified reconstruction framework. GTF combines directional EPI processing, MacPI-based prior injection, adaptive directional fusion, and a topology-preserving feed-forward network to better exploit LF geometry. For the NTIRE 2026 fidelity tracks, we use GTF as the main model, while a lightweight GTF-Tiny variant targets the efficiency track. On five standard LF SR benchmarks covering both real-captured and synthetic scenes, GTF reaches 32.78 dB without inference-time enhancement, and stronger inference settings with EPSW and test-time augmentation further improve performance. Under the NTIRE 2026 efficiency constraint, GTF-Tiny attains 32.57 dB with only 0.915M parameters and 19.81 GFLOPs. In the NTIRE 2026 Light Field Image Super-Resolution Challenge, our submissions rank 3rd on Track 1 and Track 3 and 4th on Track 2. Architecture-evolution, channel-width, and inference analyses further support the effectiveness of diagonal EPI modeling, directional fusion, and the lightweight design.
Chinese Translation
光场(LF)图像超分辨率受益于视差明示编码的极平面图像(EPI)。然而,现有基于变换器的LF SR方法主要关注水平和垂直EPI,未对对角线的极几何进行充分探讨。我们提出了GTF,一种全向EPI变换器,在统一的重建框架内显式建模水平、垂直、45度和135度的EPI。GTF结合了方向性EPI处理、基于MacPI的先验注入、自适应方向融合,以及一个保持拓扑结构的前馈网络,以更好地利用LF几何结构。在NTIRE 2026高保真度赛道中,我们使用GTF作为主要模型,同时一个轻量级GTF-Tiny变体旨在解决效率赛道。在五个涵盖真实捕获和合成场景的标准LF SR基准上,GTF在不增强推理时间的情况下达到32.78 dB,采用更强推理设置的EPSW和测试时间增强进而提升了性能。在NTIRE 2026的效率约束下,GTF-Tiny仅用0.915M参数和19.81 GFLOPs实现32.57 dB。在NTIRE 2026光场图像超分辨率挑战中,我们的提交在赛道1和赛道3中排名第3,在赛道2中排名第4。架构演变、通道宽度和推理分析进一步支持对角线EPI建模、方向融合和轻量级设计的有效性。
cs.CV / 44 / 2605.04590
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
从扩散到校正流:重新思考基于文本的分割
Abstract
Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.
Chinese Translation
基于文本的图像分割旨在从文本提示中勾勒出图像中的物体边界,与传统的固定类别分割任务相比,具有更高的灵活性和更广泛的应用范围。近期研究表明,扩散模型(例如,Stable Diffusion)能够提供丰富的多模态语义特征,因此研究者开始将扩散模型作为特征提取器用于分割任务。然而,此类方法继承了扩散模型的生成特性,这对判别性分割任务是有害的。为此,我们提出了RLFSeg,一个新颖的框架,利用校正流在潜在空间中学习图像到分割蒙版的直接映射。因此,该模型不再需要噪声去噪过程,也无需优化扩散模型的时间步,导致其性能显著优于之前的基于扩散的方法,尤其在零样本场景中表现尤为出色。通过引入标签精细化和自适应一步采样策略,该模型即使在单一步骤推理时也能实现更高的准确率。该框架在不对模型结构进行任何修改的情况下,将预训练的生成模型引导到判别性分割任务,从而展示了其良好的应用潜力和显著的研究价值。
cs.CV / 45 / 2605.04593
DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
DiCLIP:扩散模型增强 CLIP 的密集知识以实现弱监督语义分割
Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP's vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP's dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion's reliable spatial consistency to mitigate the over-smoothing issue in CLIP's attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP's self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion's generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.
Chinese Translation
弱监督语义分割(WSSS)通过图像级标签通常利用类激活图(CAMs)实现像素级预测。最近,对比语言-图像预训练(CLIP)被引入以生成 WSSS 中的 CAMs。然而,以往的 WSSS 方法仅采用 CLIP 的视觉-语言配对特性进行密集定位,而忽视了其在视觉和文本模态间固有限制的密集知识,这导致 CAM 生成效果不佳。在本研究中,我们提出了 DiCLIP,一种新颖的 WSSS 框架,利用生成性扩散模型增强 CLIP 在两种模态下的密集知识。具体而言,提出了视觉相关性增强(VCE)和文本语义增强(TSA)模块以增强密集预测。为了提高视觉特征的空间感知,我们的 VCE 模块利用扩散模型的可靠空间一致性来缓解 CLIP 注意力中的过平滑问题。它设计了注意力聚类细化(ACR)模块,以可靠地从扩散模型中提取多样的相关性图。相关性图作为 CLIP 自注意力的多样性偏差,递归性地推动其视觉特征朝向更具区分性的密集分布。为了增强文本嵌入的语义,我们的 TSA 模块认为单一的文本模态无法涵盖视觉类别的变异性。因此,我们利用扩散的生成能力维护动态的键值缓存模型,将 CAM 生成从补丁-文本匹配机制转变为新的视觉知识检索范式。通过这些增强,DiCLIP 不仅在 PASCAL VOC 和 MS COCO 上超越了最先进的方法,而且显著降低了训练成本。代码已公开发布于 https://github.com/zwyang6/DiCLIP。
cs.CV / 46 / 2605.04606
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness
基于参考的类别发现:具有类别意识的无监督目标检测
Abstract
Traditional one-shot detection methods have addressed the closed-set problem in object detection, but the high cost of data annotation remains a critical challenge. General unsupervised methods generate pseudo boxes without category labels, thus failing to achieve category-aware classification. To overcome these limitations, we propose Reference-based Category Discovery (RefCD), an unsupervised detector that enables category-aware\footnotemark[1] detection without any manually annotated labels. It leverages feature similarity between predicted objects and unlabeled reference images. Unlike previous unsupervised methods that lack category guidance and one-shot methods which require labeled data, RefCD introduces a carefully designed feature similarity loss to explicitly guide the learning of potential category-specific features. Additionally, RefCD supports category-agnostic detection without reference images, serving as a unified framework. Comprehensive quantitative and qualitative analysis of category-aware and category-agnostic detection results demonstrates its effectiveness, and RefCD can learn category information in an unsupervised paradigm even without category labels.
Chinese Translation
传统的一次性检测方法已解决目标检测中的封闭集问题,但数据标注的高成本仍然是一个关键挑战。一般的无监督方法生成没有类别标签的伪框,从而无法实现具有类别意识的分类。为克服这些限制,我们提出了基于参考的类别发现(Reference-based Category Discovery, RefCD),这是一种无监督检测器,可以在没有任何人工标注标签的情况下实现类别意识的检测。它利用预测对象与未标记参考图像之间的特征相似性。与以往缺乏类别指导的无监督方法和需依赖标注数据的一次性方法不同,RefCD引入了一种精心设计的特征相似性损失,明确指导潜在类别特征的学习。此外,RefCD支持在没有参考图像的情况下进行类别无关检测,作为一个统一框架。对具有类别意识和类别无关检测结果的全面定量与定性分析证明了其有效性,RefCD即使在没有类别标签的情况下也能以无监督的方式学习类别信息。
cs.CV / 47 / 2605.04609
Advancing Aesthetic Image Generation via Composition Transfer
通过构图转移推进美学图像生成
Abstract
Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.
Chinese Translation
构图是视觉美学的基石,影响图像的吸引力。尽管其原则独立于具体内容,但在实际应用中,构图往往与语义相结合。因此,现有的方法通常通过隐式学习或基于语义的布局控制来增强构图,而不是明确地建模构图本身。为了解决这一问题,我们引入了 Composer,一个基于美学理论的框架,旨在以语义无关的方式建模构图。首先,它支持构图转移,通过从参考图像中提取关键的构图感知表示,并利用定制的条件引导模块来基于预训练的扩散模型控制构图。其次,当用户仅指定文本主题而没有构图参考时,Composer 通过利用大型视觉-语言模型(Large Vision-Language Models, LVLMs)的上下文学习能力支持主题驱动的构图检索,实现了显式构图规划。为了在无参考模式下增强构图,我们在训练的控制模块上进行文本到构图的微调,以实现隐式构图规划。此外,我们使用最先进的生成模型策划了一个高质量的数据集,其中包含200万对图像-文本对,以支持模型训练。实验结果表明,Composer 在文本到图像的任务中显著提升了美学质量,并促进了个性化的构图控制和转移,为用户在创作过程中提供了精确性和灵活性。
cs.CV / 48 / 2605.04617
Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition
时间结构对可穿戴人类活动识别中高效测试时适应的重要性
Abstract
Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.
Chinese Translation
可穿戴人类活动识别(WHAR)模型在面对现实世界中的跨用户分布变化时,常常面临性能下降的问题。测试时适应(TTA)通过使用未标记的测试流在线调整模型,从而缓解了这种性能下降,然而现有方法在很大程度上继承了视觉任务的假设,并未充分利用WHAR流中固有的时间窗口间的结构。在本文中,我们将这种时间结构重新审视为一种特征条件的推理信号,而不仅仅是输出空间平滑的先验。我们提出的观点是,时间的连续性和观察引起的特征偏差为决定何时保留或释放时间惯性,以及在可能的过渡期间如何进行预测精细化提供了互补的线索。基于这一观点,我们提出了SIGHT,这是一种轻量级且无反向传播的TTA框架,适用于WHAR,实现了实时边缘部署。SIGHT通过将当前特征与基于原型的期望状态进行比较来估计预测惊奇值,然后利用产生的特征偏差引导基于原型对齐和流级边际习惯跟踪的几何感知过渡路由。在真实世界数据集上的评估证实,SIGHT在降低计算和内存成本的同时优于现有的TTA基线。
cs.CV / 49 / 2605.04635
UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection
UniPCB:一种基于生成辅助的PCB缺陷检测框架
Abstract
Printed Circuit Board (PCB) defect inspection faces two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves
[email protected] of 98.0% and
[email protected]:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.
Chinese Translation
印刷电路板(PCB)缺陷检测面临两个复合挑战:稀缺和不平衡的缺陷样本限制了模型训练,以及在复杂电路背景下特征表示不足。现有的生成方法依赖于单一模态条件且结构控制粗糙,而检测方法则在提高架构的同时并未解决数据瓶颈。为了共同解决这两个挑战,我们提出了一种生成辅助的PCB缺陷检测框架,将受控的缺陷合成与特定任务的缺陷检测相结合。在生成方面,多模态条件生成器(Multi-modal Condition Generator)并行提取互补的边缘、深度和文本条件。然后,尺度编码器(ScaleEncoder)将这些条件嵌入到扩散U-Net中的四个分辨率,并通过条件调制(Condition Modulation)进行FiLM风格的空间自适应调制,使得样本合成在结构上对齐并具有缺陷意识。在检测方面,倒置残差位移注意力(Inverted Residual Shift Attention)将自注意力与位移卷积结合,以共同捕捉全局上下文和局部纹理,而跨层互补融合块(Cross-level Complementary Fusion Block)生成像素级门控以选择性地进行跨层特征融合。合成样本直接丰富了检测训练集,从而使生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明,UniPCB在缺陷检测中实现了
[email protected]为98.0%和
[email protected]:0.95为61.8%,超越了所有对比方法,而生成分支则达到了FID为129.61和SSIM为0.619,超越了现有的条件生成方法。
cs.CV / 50 / 2605.04641
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST:通过字幕引导的视觉注意力引导减轻大型视觉语言模型中的物体幻觉
Abstract
Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.
Chinese Translation
尽管大型视觉语言模型(LVLMs)在下游任务中表现出色,但它们经常产生与视觉信息不符的内容,从而导致物体幻觉。为了解决这个问题,近期的研究大多依赖于昂贵的手动注释和训练成本,或大幅增加推理时间的解码策略。在本研究中,我们观察到LVLMs在回答字幕查询时对视觉信息的注意力显著增强,相较于非字幕查询。受到这一现象的启发,我们提出了字幕引导的视觉注意力引导(Caption-guided Visual Attention Steering,CAST),这是一种无训练、即插即用的幻觉减轻方法,利用与字幕查询相对应的注意力激活模式来增强LVLMs的视觉感知能力。具体而言,我们使用探测技术识别对字幕查询高度敏感的注意力头,并估计其输出的优化引导方向。这种引导增强了LVLM的细粒度视觉感知能力,从而有效减少了物体幻觉。CAST在包括区分性和生成性任务的五个广泛使用的LVLM和五个基准测试上平均减少了6.03%的物体幻觉,展示了最先进的性能,同时在推理成本上几乎没有增加,并保持了其他基础能力。
cs.CV / 51 / 2605.04662
Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling
接触矩阵:通过精确的互动建模增强舞蹈动作合成
Abstract
Generating realistic reactive motions, in which one person reacts to the fixed motions of others, is challenging due to strict interaction constraints and a limited feasible solution space. This paper focuses on a typical scenario: duet dance, where high-quality data is scarce, motion patterns are complex, and the details of human interactions are both intricate and abundant. To tackle these challenges, we propose a novel two-stage framework. In the first stage, we introduce a motion VQ-VAE with separate body-part encoders and a joint decoder, enabling specialized codebooks to enhance representation capacity while dynamically modeling dependencies across body parts during decoding, thereby preventing inconsistencies in the generated motions. In the second stage, we propose a contact-aware diffusion model for reactive motion generation that jointly generates motion and a contact matrix between individuals, enabling explicit interaction modeling and providing guidance toward more precise and constrained interaction dynamics during sampling. Experiments show that our method outperforms Duolando with lower $\text{FID}_k$ (8.89 vs. 25.30) and $\text{FID}_{cd}$ (8.01 vs. 9.97), as well as a higher BED (0.4606 vs. 0.2858), indicating improved interaction fidelity and rhythmic synchronization.
Chinese Translation
生成逼真的反应性动作,即一个人对他人固定动作的反应,因严格的互动约束和有限的可行解空间而具有挑战性。本文聚焦于一个典型场景:双人舞,这里高质量的数据稀缺,动作模式复杂,人类互动的细节既错综复杂又丰富。为应对这些挑战,我们提出了一种新颖的两阶段框架。在第一阶段,我们引入了一个具有独立身体部位编码器和联合解码器的运动变分量化自编码器(VQ-VAE),使专用的代码本增强表示能力,并在解码过程中动态建模身体部位之间的依赖关系,从而防止生成动作中的不一致性。在第二阶段,我们提出了一种关注接触的扩散模型,用于生成反应性动作,该模型共同生成个体之间的动作和接触矩阵,使得互动建模更加明确,并在采样过程中提供更精确和受限的互动动态指导。实验表明,我们的方法在较低的 $ ext{FID}_k$(8.89 对比 25.30)和 $ ext{FID}_{cd}$(8.01 对比 9.97)以及更高的 BED(0.4606 对比 0.2858)下优于 Duolando,表明互动保真度和节奏同步性得到了改善。
cs.CV / 52 / 2605.04675
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
物理对抗服装通过非重叠RGB-T模式规避可见-热探测器
Abstract
Visible-thermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0$^{\circ}$--360$^{\circ}$) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.
Chinese Translation
可见-热(RGB-T)目标检测是用于自动驾驶等应用的关键技术,多模态融合在低光等挑战性条件下提高性能。然而,RGB-T探测器的安全性,尤其是在物理世界中,仍然在很大程度上被忽视。本文提出了一种使用具有非重叠RGB-T模式(NORP)的对抗服装进行RGB-T物理攻击的新方法。为了模拟全视角(0°--360°)RGB-T攻击,我们为人类和对抗服装构建了3D RGB-T模型。NORP是一种新的对抗模式设计,采用不同的可见和热材料而不重叠,从而避免了重叠RGB-T模式(ORP)造成的光线减少。为了优化对抗服装上的NORP,我们提出了一种空间离散-连续优化(SDCO)方法。我们在不同融合架构的RGB-T探测器上系统评估了我们的方法,展示了在数字和物理世界中高攻击成功率。此外,我们介绍了一种融合阶段集成方法,增强了对抗攻击在不同融合架构的未见RGB-T探测器之间的可转移性。
cs.CV / 53 / 2605.04680
Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding
基于EEG的视觉解码的多层双向仿生学习
Abstract
EEG-based visual neural decoding aims to align neural responses with visual stimuli for tasks such as image retrieval. However, limited paired data and a fundamental mismatch between high-fidelity digital images and biological visual perception - distorted by retinotopic mapping and subject-specific neuroanatomy - severely impede cross-modal alignment. To address this, we propose MB2L, a Multi-Level Bidirectional Biomimetic Learning framework that incorporates structured physiological inductive biases into representation learning. Specifically, we propose Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch by reweighting visual inputs according to retinotopic priors. We further propose Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, enhancing subject-invariant encoding. These modules are jointly optimized via Multi-level Bidirectional Contrastive Learning, which aligns EEG and visual features in a shared semantic space through bidirectional contrastive objectives. Experiments show MB2L achieves 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods and demonstrating strong generalization across subjects and experimental settings.
Chinese Translation
基于EEG的视觉神经解码旨在将神经响应与视觉刺激对齐,以完成图像检索等任务。然而,有限的配对数据以及高保真数字图像与生物视觉感知之间的基本不匹配——由于视网膜地图和特定个体的神经解剖学造成的失真——严重阻碍了跨模态对齐。为了解决这一问题,我们提出了MB2L(多层双向仿生学习)框架,该框架将结构化的生理归纳偏见纳入表征学习中。具体而言,我们提出了带有视觉先验的自适应模糊,以通过根据视网膜优先权对视觉输入进行重新加权,从而减轻感知-结构不匹配。我们进一步提出了仿生视觉特征提取,以学习与层次皮层处理一致的多层视觉表征,增强对个体不变的编码。这些模块通过多层双向对比学习联合优化,通过双向对比目标在共享语义空间中对齐EEG和视觉特征。实验证明,MB2L在零-shot EEG到图像检索中实现了80.5%的Top-1准确率和97.6%的Top-5准确率,显著优于之前的方法,并展示了在不同被试和实验设置下的强泛化能力。
cs.CV / 54 / 2605.04702
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces:面向文本生成视频的人脸身份保持的姿态忠实性
Abstract
Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose \textit{FaithfulFaces}, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.
Chinese Translation
身份保持的文本到视频生成(IPT2V)使用户能够制作多样化且富有创意的视频,同时保持一致的人脸身份。尽管近期取得了一些进展,现有方法在面临较大面部姿态变化或面部遮挡时,通常会遭受显著的身份失真。本文提出了 extit{FaithfulFaces},一种面向姿态忠实的人脸身份保持学习框架,以改善复杂动态场景中的IPT2V。FaithfulFaces的关键在于一个姿态共享身份对齐器,通过姿态共享字典和姿态变化-身份不变性约束来细化和对齐不同视角下的人脸姿态。通过将单视图输入映射到具有明确欧拉角嵌入的全局人脸姿态表示,FaithfulFaces提供了一种姿态忠实的人脸先验,引导生成基础朝着稳健的身份保持生成。特别地,我们开发了一个专门的流程,以策划一个高质量的视频数据集,该数据集展现了显著的人脸姿态多样性。大量实验表明,FaithfulFaces在表现上达到了行业先进水平,即使在姿态变化和遮挡发生时,仍能保持优越的身份一致性和结构清晰度。
cs.CV / 55 / 2605.04713
Not Every Subject Should Stay: Machine Unlearning for Noisy Engagement Recognition
并非所有受试者都应留下:噪声参与识别的机器遗忘
Abstract
Engagement recognition datasets are typically subject-indexed and often contain noisy, subjective supervision, making post-hoc dataset revision a practical problem. Existing noisy-label and data-cleaning methods largely operate at the sample level before or during training, but do not directly address a different question: once a model has already been trained, can the influence of an entire problematic subject be removed without full retraining? We study this setting through subject-level machine unlearning as a post-hoc sanitization mechanism for engagement recognition. Starting from a baseline trained on all subjects, we rank candidate harmful subjects using a model-dependent proxy, apply a lightweight approximate unlearning update, and compare the result against an oracle model retrained from scratch on the retained subjects only. We instantiate this protocol on DAiSEE and EngageNet using Tensor-Convolution and Convolution-Transformer Network (TCCT-Net) as a fixed platform and evaluate three matched model states under the same removal scenario: baseline, unlearned, and oracle. In representative K=3 forget-set settings, the unlearned model recovers 89.3% and 92.5% of the oracle gain on EngageNet and DAiSEE, respectively, at roughly one quarter of retraining cost. Across the tested small-audit regimes, effectiveness is strongest at an intermediate forget-set size, indicating that approximate subject-level unlearning is a useful low-cost correction mechanism, but one whose benefit depends on subject selection quality and removal regime.
Chinese Translation
参与识别数据集通常是按受试者索引的,且往往包含噪声和主观监督,这使得事后数据集修订成为一个实际问题。现有的噪声标签和数据清洗方法主要在训练前或训练期间按样本级别进行,但并未直接解决一个不同的问题:一旦模型已经训练完成,能否在不完全重新训练的情况下消除整个问题受试者的影响?我们通过受试者级的机器遗忘作为参与识别的事后清理机制来研究这种情境。从一个基于所有受试者训练的基线开始,我们使用模型依赖的代理对有害受试者进行排序,应用轻量级的近似遗忘更新,并将结果与仅在保留受试者上从头重新训练的最终模型进行比较。我们在 DAiSEE 和 EngageNet 上实例化这个协议,以 Tensor-Convolution 和 Convolution-Transformer Network (TCCT-Net) 作为固定平台,并在相同的移除场景下评估三种匹配的模型状态:基线、已遗忘和最终模型。在典型的 K=3 遗忘集设置中,已遗忘模型在 EngageNet 和 DAiSEE 上分别恢复了 89.3% 和 92.5% 的最终增益,且重训练成本约为原来的四分之一。在测试的小规模审计条件下,效果在中等大小的遗忘集中最强,表明近似的受试者级遗忘是一种有用的低成本纠正机制,但其效果依赖于受试者选择质量和移除机制。
cs.CV / 56 / 2605.04728
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit: 全年龄人类网格恢复
Abstract
Recovering 3D human pose and shape from a single image remains a cornerstone of human-centric vision, yet most methods assume adult subjects and optimize each person independently. These assumptions fail in real-world, all-age scenes, where body proportions and depth must be resolved jointly. We introduce Anny-Fit, a multi-person, camera-space optimization framework for all-age 3D human mesh recovery (HMR). Unlike existing per-person fitting methods, Anny-Fit jointly optimizes all individuals directly in the camera coordinate system, enforcing global spatial consistency. At the core of our approach is the use of multiple forms of expert knowledge -- including metric depth maps, instance segmentation, 2D keypoints, and, VLM-derived semantic attributes such as age and gender -- each obtained from dedicated off-the-shelf networks. These complementary signals jointly guide the optimization, constraining the depth-scale ambiguity characteristic of all-age scenes. Across diverse datasets, Anny-Fit consistently improves 2D reprojection accuracy (+13 to 16), relative depth ordering (+6 to 7), 3D estimation error (-9 to -29) and shape estimation (+25 to +82), producing more coherent scenes. Finally, we show that VLM-based semantic knowledge can be distilled into an HMR model via the pseudo-ground-truth annotations produced by Anny-Fit on training data, enabling it to learn semantically meaningful shape parameters while improving HMR performance. Our approach bridges adult-only and all-age modeling by enabling zero-shot adaptation of adult-trained HMR pipelines to the full age spectrum without retraining. Code is publicly available at https://github.com/naver/anny-fit.
Chinese Translation
从单张图像恢复三维人类姿态和形状仍然是以人类为中心的视觉的重要基础,但大多数方法假定对象为成人,并且独立优化每个人。这些假设在现实世界中全年龄场景下失败,在那里必须联合解决体型比例和深度。我们提出了Anny-Fit,一个用于全年龄三维人类网格恢复(HMR)的多人人体、相机空间优化框架。与现有的逐人拟合方法不同,Anny-Fit在相机坐标系中直接联合优化所有个体,强制执行全局空间一致性。我们的方法核心是利用多种形式的专家知识,包括度量深度图、实例分割、二维关键点以及基于VLM(视觉语言模型)提取的语义属性(如年龄和性别),这些都来自专门的现成网络。这些互补信号共同指导优化,约束了全年龄场景中深度尺度模糊的特征。在多样化的数据集上,Anny-Fit在二维重投影精度(提高13至16),相对深度排序(提高6至7),三维估计误差(降低9至29)以及形状估计(提高25至82)方面始终表现出改善,生成了更加连贯的场景。最后,我们展示了基于VLM的语义知识可以通过Anny-Fit在训练数据上生成的伪真值注释被提炼到HMR模型中,使其能够学习语义上有意义的形状参数,同时提高HMR性能。我们的方法通过使经过成人训练的HMR流程能够零次适应全年龄范围,而无需重新训练,弥合了仅限成人和全年龄建模之间的差距。代码已公开在 https://github.com/naver/anny-fit.
cs.CV / 57 / 2605.04730
ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
ULF-Loc:一种无偏地标特征用于基于3D高斯喷溅的稳健视觉定位
Abstract
Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17\% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc.
Chinese Translation
视觉定位是增强现实和自主导航的核心技术。最近的方法将高效渲染的3D高斯喷溅(3DGS)与基于特征的定位相结合。这些方法依赖于2D查询特征与3D高斯特征场之间的直接匹配,但由于学习到的高斯特征固有的偏差,这往往导致匹配错误。我们理论分析了3DGS中的特征学习过程,揭示广泛采用的$ ext{α}$-混合优化会在本质上向3D点特征引入偏差。这种偏差源于单个高斯与其邻近高斯之间的纠缠,使得学习到的特征不适用于精确的匹配任务。基于这些发现,我们提出了ULF-Loc,一个无偏地标特征框架,用几何加权特征融合替代有偏特征优化。我们进一步引入关键点一致性地标采样以选择可靠的高斯,并进行局部几何一致性验证以拒绝因渲染伪影而导致的匹配错误。在剑桥地标数据集上,ULF-Loc相比于最新的技术减少了17\%的平均中位数平移误差,同时在训练时间仅为STDLoc的1/10和GPU内存的1/6的情况下实现了更优的效率。
cs.CV / 58 / 2605.04731
Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation
基于形态引导的跨任务耦合用于联合建筑高度和建筑轮廓估计
Abstract
Building height (BH) and building footprint (BF) jointly describe the vertical and horizontal extent of the built environment and are required inputs for urban climate, disaster-risk, and population-mapping models. The two parameters are coupled through floor-area-ratio (FAR) constraints, yet remote-sensing approaches typically treat them as independent regression targets. We argue that explicitly encoding this cross-task coupling is more impactful than further refining individual encoders, and propose MorphoFormer, a joint BH/BF estimation framework built around two complementary mechanisms: (i) a BF-Guided Task Decoder (BGTD) that gates the height branch via cross-attention on a footprint-derived morphology context, and (ii) a Morphology Consistency Loss (MCL) that supervises a height-from-footprint surrogate against the ground-truth BH, indirectly forcing the BF feature to encode height-correlated structure. The encoder is a single-stage Swin backbone fed by Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs, trained and evaluated on a geo-blocked split of 54 cities. Against a Swin-MTL baseline at identical receptive field, MorphoFormer reduces BH test RMSE from 3.39 to 3.15 m (R^2 improves 0.62 -> 0.67) with BF R^2 stable at 0.80. Controlled ablations at identical capacity attribute most of this 0.24 m improvement to the two proposed mechanisms: removing BGTD raises BH RMSE by 0.11 m and removing MCL raises it by 0.11 m, with the residual approximately 0.02 m falling within the noise floor of encoder-side variations. Because both mechanisms act on cross-task representations rather than pixels, the design carries no intrinsic dependence on input resolution.
Chinese Translation
建筑高度(Building Height, BH)和建筑轮廓(Building Footprint, BF)共同描述了建成环境的垂直和水平范围,是城市气候、灾害风险和人口制图模型所需的输入。通过建筑面积比(Floor Area Ratio, FAR)约束,这两个参数是耦合的,但遥感方法通常将它们视为独立的回归目标。我们认为,明确编码这种跨任务耦合比进一步细化单个编码器更具影响力,因此提出了MorphoFormer,这是一种围绕两种互补机制构建的联合BH/BF估计框架:(i)一个BF引导的任务解码器(BF-Guided Task Decoder, BGTD),通过对基于轮廓的形态上下文的跨注意力控制高度分支;(ii)一个形态一致性损失(Morphology Consistency Loss, MCL),对比从轮廓推导的高度与真实的BH进行监督,间接迫使BF特征编码与高度相关的结构。编码器是一个单阶段的Swin骨干网络,输入为Sentinel-1 SAR、Sentinel-2多光谱和数字高程模型(Digital Elevation Model, DEM),在54个城市的地理分块上进行训练和评估。与相同接受场的Swin-MTL基线相比,MorphoFormer将BH测试RMSE从3.39减少到3.15米(R^2从0.62提高到0.67),而BF的R^2保持在0.80。在相同容量的控制消融实验中,大多数0.24米的提升归因于提出的两种机制:去除BGTD导致BH RMSE增加0.11米,去除MCL也使其增加0.11米,残差约为0.02米,落在编码器侧变动的噪声范围内。由于两种机制作用于跨任务表征而非像素,设计本身并不依赖于输入分辨率。
cs.CV / 59 / 2605.04750
VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision
VC-FeS:面向视角的特征选择用于热成像中的车辆重识别
Abstract
Identification of less-articulated objects using single-channel images, such as thermal images, is important in many applications, such as surveillance. However, in this domain, existing methods show poor performance due to high similarity among objects of the same category in the absence of color information (overlooking shape information) and de-emphasized texture information. Furthermore, variability in viewpoint adds more complexity as the features vary from side to side. We address these issues by constructing viewpoint-conditioned feature vectors and area-specific feature comparisons in separate feature spaces. These interventions enable leveraging the advancements of existing RGB-pre-trained ViT feature extractors while effectively adapting them to address the challenges specific to the thermal domain. We test our system with RGBNT100 (IR) vehicle dataset and a thermal maritime dataset acquired by us. Our results surpass the state-of-the-art methods by 19.7% and 12.8% for the above datasets in mAP scores, respectively. We also plan to make our thermal dataset available, the first of its kind for maritime vessel identification.
Chinese Translation
使用单通道图像(如热成像)识别少关节物体在许多应用中(如监控)至关重要。然而,在这一领域,现有方法的性能较差,主要原因是同一类别物体之间由于缺乏颜色信息(忽视了形状信息)而具有高度相似性,同时纹理信息的强调程度降低。此外,视角的变化增加了复杂性,因为特征在不同侧面上会有所不同。我们通过构建面向视角的特征向量和特定区域的特征比较,针对这一问题提出了解决方案,这些特征在独立的特征空间中进行处理。这些干预措施使得我们能够利用现有RGB预训练的ViT特征提取器的进展,并有效地将其调整以应对热成像领域特有的挑战。我们使用RGBNT100 (IR) 车辆数据集和我们获取的热成像海事数据集对我们的系统进行了测试。我们的结果在上述数据集的mAP分数上分别超过了最先进的方法19.7%和12.8%。我们还计划将我们的热成像数据集公开,这是首个用于海洋船舶识别的数据集。
cs.CV / 60 / 2605.04752
Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition
基于流指导注意力和经验模态分解的混合拥堵分类框架
Abstract
Accurate traffic congestion classification requires models that jointly capture roadway scene context and non-stationary traffic motion, yet most prior work treats these requirements in isolation. Vision-based methods often depend on appearance cues with standard temporal pooling, which can bias predictions toward static infrastructure, whereas signal-based approaches characterize temporal dynamics but lack the spatial context needed for scene-level localization. These complementary limitations motivate a unified framework that links motion evidence to spatial feature selection while preserving data-adaptive temporal characterization. This study therefore proposes FLO-EMD, a hybrid approach that couples motion-guided attention with empirical, data-driven temporal decomposition. Dense optical flow guides channel and spatial attention so that RGB features are refined toward motion-relevant regions. In parallel, aggregated flow statistics form compact motion traces that are decomposed using Empirical Mode Decomposition (EMD) to extract intrinsic temporal components. The resulting EMD embedding is fused with learned spatiotemporal representations to classify light, medium, and heavy congestion. Experiments on 1,050 five-second clips from four surveillance networks show that FLO-EMD achieves 97.5% overall test accuracy (weighted F1 = 0.9742), outperforming established baselines and remaining robust across diverse environmental conditions; ablation and sensitivity analyses further quantify the contributions of EMD, the number of intrinsic mode functions, and the selected motion descriptors.
Chinese Translation
准确的交通拥堵分类需要能够共同捕捉道路场景背景和非平稳交通运动的模型,但以前的大多数研究往往将这些要求视为孤立的问题。基于视觉的方法通常依赖于外观线索和标准的时间池化,这可能会使预测偏向于静态基础设施,而基于信号的方法则能表征时间动态,但缺乏进行场景级定位所需的空间背景。这些互补的局限性促成了一个统一框架的提出,该框架将运动证据与空间特征选择相连接,同时保留数据自适应的时间特征描述。因此,本研究提出了FLO-EMD,这是一种将运动导向注意力与经验驱动的时间分解相结合的混合方法。密集光流引导通道和空间注意力,以便将RGB特征精炼至与运动相关的区域。同时,聚合流统计形成紧凑的运动轨迹,并通过经验模态分解(EMD)进行分解,以提取内在的时间成分。最终得到的EMD嵌入与学习到的时空表示相融合,以对轻度、中度和重度拥堵进行分类。对来自四个监控网络的1,050个五秒片段的实验表明,FLO-EMD实现了97.5%的总体测试准确率(加权F1 = 0.9742),超越了既定的基准,并在多种环境条件下保持稳健;消融和敏感性分析进一步量化了EMD、内在模式函数的数量以及所选择的运动描述符的贡献。
cs.CV / 61 / 2605.04769
Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation
轻量级跨光谱人脸识别:对比对齐与蒸馏方法
Abstract
Heterogeneous Face Recognition (HFR) aims at matching face images captured across different sensing modalities, such as thermal-to-visible or near-infrared-to-visible, enhancing the usability of face recognition systems in challenging real-world conditions. Although recent HFR methods have achieved significant improvements in performance, many rely on computationally expensive models, making them impractical for deployment on resource-limited edge devices. In this work, we introduce a lightweight yet effective HFR framework by adapting a hybrid CNN-Transformer model originally developed for RGB homogeneous face recognition. Our approach enables efficient end-to-end training with only a small amount of paired heterogeneous data, while still maintaining strong performance on standard RGB face recognition benchmarks. This makes it suitable for both homogeneous and heterogeneous settings. Comprehensive experiments on several challenging HFR and face recognition benchmarks show that our method achieves state-of-the-art or competitive performance while keeping computational requirements low.
Chinese Translation
异构人脸识别(HFR)旨在匹配在不同传感模式下捕获的人脸图像,如热成像与可见光或近红外与可见光,从而增强人脸识别系统在复杂现实条件下的适用性。虽然最近的HFR方法在性能上取得了显著提升,但许多方法依赖于计算量较大的模型,使得它们在资源受限的边缘设备上采用时变得不切实际。在本研究中,我们介绍了一种轻量级但有效的HFR框架,通过调整最初为RGB同质人脸识别开发的混合CNN-Transformer模型。我们的方法允许仅凭少量成对的异构数据进行高效的端到端训练,同时在标准RGB人脸识别基准上保持强劲的性能。这使得它适用于同质和异构场景。我们在多个具有挑战性的HFR和人脸识别基准上进行了全面实验,结果表明我们的方法实现了最先进或具竞争力的性能,并保持了较低的计算需求。
cs.CV / 62 / 2605.04770
Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction
Gaze4HRI:针对人机交互的零-shot眼动估计神经网络基准评估
Abstract
While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.
Chinese Translation
虽然零-shot基于外观的3D眼动估计通过直接将RGB图像映射到眼动向量,提供了显著的成本效益,但其在人机交互(HRI)环境中的可靠性仍然不确定。现有基准往往忽视了基本的HRI条件,例如动态摄像头视点和视频中的移动目标。此外,当前的跨数据集评估常常面临复杂性差距,即在多样化数据集上训练的方法在显著较小且变化较少的数据集上进行测试,无法评估真实的鲁棒性。为了解决这些问题,我们引入了Gaze4HRI,这是一个大规模数据集(50多个受试者,3000多个视频,超过60万帧),旨在评估先进性能与关键HRI变量的关系:光照、头部-视线冲突,以及视频中摄像头和眼动目标的运动。我们的基准显示,所有评估方法在至少一种条件下均表现不佳,确认急剧向下的眼动是普遍的失败点。值得注意的是,基于ETH-X-Gaze数据集训练的PureGaze在所有其他条件下独特地保持了韧性。这些结果挑战了文献中近来对复杂时空建模和基于Transformer架构的关注。相反,我们的研究结果表明,数据多样性,正如ETH-X-Gaze数据集所示,是无约束环境中零-shot鲁棒性的主要动力,而增强韧性的框架,如PureGaze的用于眼动特征净化的自对抗损失,则提供了显著的额外改善。最终,本研究建立了一个严格的基准,为从业者提供实用指南,并重新塑造未来研究。数据集和代码可在 https://gazeforhri.github.io 获取。
cs.CV / 63 / 2605.04772
MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education
MIRAGE:用于医学教育的多模态图像与文本检索与生成
Abstract
Access to diverse, well-annotated medical images with interactive learning tools is fundamental for training practitioners in medicine and related fields to improve their diagnostic skills and understanding of anatomical structures. While medical atlases are valuable, they are often impractical due to their size and lack of interactivity, whereas online image search may provide mislabeled or incomplete material. To address this, we propose MIRAGE, a multimodal medical text and image retrieval and generation system that allows users to find and generate clinically relevant images from trustworthy sources by mapping both text and images to a shared latent space, enabling semantically meaningful queries. The system is based on a fine-tuned medical version of CLIP (MedICaT-ROCO), trained with the ROCO dataset, obtained from PubMed Central. MIRAGE allows users to give prompts to retrieve images, generate synthetic ones through a medical diffusion model (Prompt2MedImage) and receive enriched descriptions from a large language model (Dolly-v2-3b). It also supports a dual search option, enabling the visual comparison of different medical conditions. A key advantage of the system is that it relies entirely on publicly available pretrained models, ensuring reproducibility and accessibility. Our goal is to provide a free, transparent and easy-to-use didactic tool for medical students, especially those without programming skills. The system features an interface that enables interactive and personalized visual learning through medical image retrieval and generation. The system is accessible to medical students worldwide without requiring local computational resources or technical expertise, and is currently deployed on Kaggle: http://www-vpu.eps.uam.es/mirage
Chinese Translation
获得多样化、标注良好的医学图像与互动学习工具对医学及相关领域从业者的培训是提高其诊断技能和解剖结构理解的基础。尽管医学图谱价值巨大,但由于其体积庞大且缺乏互动性,往往不够实用,而在线图像搜索可能提供错误标记或不完整的材料。为了解决这一问题,我们提出了MIRAGE,一个多模态医学文本与图像检索和生成系统,允许用户通过将文本和图像映射到共享的潜在空间来查找和生成来自可信来源的临床相关图像,从而实现语义上有意义的查询。该系统基于经过微调的医学版本CLIP (MedICaT-ROCO),并使用从PubMed Central获取的ROCO数据集进行训练。MIRAGE允许用户输入提示以检索图像,通过医学扩散模型(Prompt2MedImage)生成合成图像,并从大型语言模型(Dolly-v2-3b)接收丰富的描述。它还支持双重搜索选项,使不同医学状况的视觉比较成为可能。该系统的一大优势在于完全依赖公开可用的预训练模型,确保了重复性和可及性。我们的目标是为医学学生提供一个免费的、透明的且易于使用的教学工具,特别是那些没有编程技能的学生。该系统具有一个界面,使医学图像的检索与生成通过互动和个性化的视觉学习成为可能。该系统可供全球医学学生使用,无需本地计算资源或技术专长,目前已在Kaggle上部署: http://www-vpu.eps.uam.es/mirage
cs.CV / 64 / 2605.04844
QuadBox: Accelerating 3D Gaussian Splatting with Geometry-Aware Boxes
QuadBox: 使用几何感知盒加速3D高斯溅射
Abstract
3D Gaussian Splatting (3DGS) has emerged as an advanced technique for real-time novel view synthesis by representing scene geometry and appearance using differentiable Gaussian primitives. However, efficiently computing precise Gaussian-tile intersections remains a critical task in the rasterization pipeline. To this end, we propose QuadBox, a method that leverages four axis-aligned bounding boxes to tightly encapsulate projected Gaussians in a discrete manner. First, we derive a geometry-aware stretching factor that enables the construction of a tile-aligned QuadBox, which covers the elliptical projection and largely excludes irrelevant tiles. Second, we introduce QPass, a single-pass tile traversal algorithm that exhaustively exploits the discrete nature of QuadBox, ensuring that the tile intersection check is performed with simple interval tests. Experiments on public datasets show that our method accelerates the rendering speed of 3DGS by 1.85$\times$. Code is available at \href{https://github.com/Powertony102/QuadBox}{https://github.com/Powertony102/QuadBox}.
Chinese Translation
3D高斯溅射(3DGS)已成为一种先进的实时新视图合成技术,通过使用可微分的高斯原语来表示场景的几何形状和外观。然而,在光栅化管线中,高效计算精确的高斯瓦片交集仍然是一个关键任务。为此,我们提出了QuadBox,一种利用四个轴对齐边界框以离散方式紧密包围投影高斯的算法。首先,我们推导出一种几何感知的拉伸因子,使得能够构建一个与瓦片对齐的QuadBox,该框覆盖椭圆形投影并大幅排除无关瓦片。其次,我们引入了QPass,这是一种单遍历瓦片的算法,充分利用QuadBox的离散特性,确保瓦片交集检查通过简单的区间测试完成。我们在公共数据集上的实验表明,我们的方法将3DGS的渲染速度加速了1.85倍。代码可在链接 exttt{https://github.com/Powertony102/QuadBox} 获取。
cs.CV / 65 / 2605.04856
3D Ultrasound-Derived Pseudo-CT Synthesis Using a Transformer-Augmented Residual Network for Real-Time Operator Guidance
基于变压器增强残差网络的3D超声衍生伪CT合成用于实时操作者引导
Abstract
Computed tomography (CT) is indispensable for clinical diagnosis and image-guided interventions but exposes patients to ionizing radiation, motivating the development of safer imaging alternatives. Ultrasound (US) is non-ionizing and widely accessible; however, it is highly operator dependent and lacks quantitative tissue characterization, often leading to diagnostic uncertainty and unnecessary CT examinations. This work presents a 3D ultrasound-derived pseudo-CT (UD-pCT) framework that generates CT-like anatomical reference volumes inferred from US, without aiming to reproduce physically accurate Hounsfield Units. Paired 3D kidney US and CT volumes from the TRUSTED dataset are first spatially aligned using a landmark-based multimodal registration pipeline, creating high-quality paired inputs for supervised training of an adversarial framework. The proposed Bottleneck Transformer Residual U-Net3D (BT-ResUNet3D) model employs a 3D residual encoder-decoder generator augmented with a transformer bottleneck, enabling effective modeling of fine-grained local anatomical structures as well as long-range volumetric dependencies, while a 3D Conditional PatchGAN discriminator enforces local structural realism in the synthesized pseudo-CT volumes. Quantitative evaluation using PSNR and SSIM demonstrates that the proposed method outperforms established baselines in structural fidelity and perceptual image quality. The UD-pCT volumes provide real-time anatomical reference for operator guidance, potentially reducing acquisition variability and unnecessary CT use. A limitation of this study is the relatively small paired dataset, which may limit the generalizability of the proposed model.
Chinese Translation
计算机断层扫描 (CT) 在临床诊断和影像引导干预中不可或缺,但对患者存在电离辐射的暴露,这促使了更安全的影像替代方法的开发。超声 (US) 是非电离的并且广泛可获取;然而,它高度依赖操作者,并且缺乏定量的组织表征,常常导致诊断不确定性和不必要的CT检查。本研究提出了一个3D超声衍生伪CT (UD-pCT) 框架,该框架生成从US推断的CT类解剖参考体积,而不是追求物理上精确的亨斯菲尔德单位。首先使用基于地标的多模态注册管道将来自TRUSTED数据集的配对3D肾脏US和CT体积进行空间对齐,从而为对抗框架的监督训练创建高质量的配对输入。所提出的瓶颈变压器残差U-Net3D (BT-ResUNet3D) 模型采用了一个增强了变压器瓶颈的3D残差编码器-解码器生成器,能够有效建模精细的局部解剖结构以及长距离的体积依赖,而3D条件PatchGAN鉴别器则在合成的伪CT体积中强制执行局部结构的真实感。使用PSNR和SSIM的定量评估表明,所提出的方法在结构真实性和感知图像质量上优于已有基准。UD-pCT体积为操作者引导提供实时解剖参考,可能减少获取变异性和不必要的CT使用。本研究的一个局限性是配对数据集相对较小,这可能限制所提出模型的推广性。
cs.CV / 66 / 2605.04870
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
VTAgent:面向证据感知的视频文本视觉问答的主动关键帧锚定
Abstract
Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.
Chinese Translation
基于视频文本的视觉问答(Video TextVQA)旨在通过推理视频中出现的视觉文本内容来回答问题。尽管最近的视频大规模语言模型(Video-LLMs)在多模态视频理解方面表现出色,但它们在现有的视频文本视觉问答基准上的表现仍然有限。为了更好地理解这一差距,我们通过逐帧问答进行上限分析,如果任何帧给出正确答案,则将该样本视为正确,这一方法显著优于直接的视频推理,且揭示了显著的性能差距。结果表明,主要瓶颈在于定位与问题相关的关键证据,而非推理能力本身。基于这一见解,我们提出了一种问题引导的代理框架,该框架在回答之前明确地锚定相关的关键帧。这一方法在无训练设置下运行有效,并且始终优于直接视频推理。在额外的监督微调(SFT)和强化学习(RL)下,平均准确率提高了+12.12,ANLS提高了+11.15,创造了新的最先进结果。我们的研究强调了明确的关键帧锚定在推动视频文本视觉问答中的关键作用。代码将公开发布。
cs.CV / 67 / 2605.04882
FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection
FairEnc:一种公平的视觉-语言模型,配备公平的视觉和文本编码器用于青光眼检测
Abstract
Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision-language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic-invariant representations. For the visual encoder, we propose a dual-level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi-discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard-FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero-shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross-domain and cross-modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc's ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real-world clinical settings. Our codebase and synthetic clinical notes are available at https://github.com/Mohamed-Elhabebe/FairEnc
Chinese Translation
自动化青光眼检测对于防止不可逆的视力丧失和减轻医疗系统的负担至关重要。然而,确保不同患者群体之间的公平性仍然是一个重大挑战。本文提出了 FairEnc,这是一种公平的视觉-语言模型(VLM)的预训练方法,能够在多个敏感属性上同时消除偏见。FairEnc 共同减轻了与多个敏感属性(包括种族、性别、民族和语言)相关的文本和视觉模态中的偏见。具体而言,对于文本编码器,我们利用大型语言模型生成具有不同敏感属性的合成临床描述,同时保持疾病语义,并采用对比对齐目标以促进人口学不变表示。对于视觉编码器,我们提出了一种双层公平策略,结合互信息正则化来减少学习特征与人口特征之间的统计依赖,同时采用多鉴别器对抗性去偏见。在公开可用的 Harvard-FairVLMed 数据集上的全面实验表明,FairEnc 能有效降低 DPD 和 DEOdds 测量的群体差异,同时在零样本和线性 probing 评估下表现出较强的诊断性能。在私有的 FairFundus 数据集上的额外实验显示,FairEnc 在跨域和跨模态设置中始终保持公平性优势,并在竞争范围内保持诊断性能。这些结果强调了 FairEnc 在分布变化下推广公平性的能力,支持其在现实临床环境中更公平部署的潜力。我们的代码库和合成临床笔记可在 https://github.com/Mohamed-Elhabebe/FairEnc 获取。
cs.CV / 68 / 2605.04904
Exploring Clustering Capability of Inpainting Model Embeddings for Pattern-based Individual Identification
探讨修复模型嵌入的聚类能力用于基于模式的个体识别
Abstract
In this paper, we explore deep learning techniques for individual identification of animals based on their skin patterns. Individual identification is crucial in biodiversity monitoring, since it enables analysis of decline or growth of populations, or intra-species interactions within populations. Models trained for the task of individual identification often do not focus on the skin pattern of animals, but on background details or body shape details. These characteristics are not individually specific, or can change drastically through time. We focus on techniques that will make machine learning models more responsive to skin pattern structure when extracting individual visual embeddings from images. For this, we explore image inpainting of task-specific masks as an auxiliary task to enhance ML-based individual identification from animal skin patterns. We propose a comparative analysis among four models as an encoder backbone for the individual identification task. We focus on the case study of zebrafish, which is a widely recognized biological model organism, and which exhibits individually identifying skin patterns. To evaluate encoder backbone performance, we present standard metrics for classification accuracy, embedding clustering metrics, and GradCAM visualizations.
Chinese Translation
本文探讨了基于动物皮肤图案的个体识别的深度学习技术。个体识别在生物多样性监测中至关重要,因为它能够分析种群的衰退或增长,以及种群内部的物种间互动。为个体识别任务训练的模型通常不专注于动物的皮肤图案,而是侧重于背景细节或身体形状细节。这些特征并不是个体特异的,或者会随着时间的推移发生剧烈变化。我们关注的技术将使机器学习模型在从图像提取个体视觉嵌入时,更加响应于皮肤图案结构。为此,我们探索任务特定掩膜的图像修复作为辅助任务,以增强基于机器学习的动物皮肤图案个体识别。我们提出了四个模型作为个体识别任务的编码器骨干的比较分析。我们聚焦于广泛认可的生物模型生物——斑马鱼作为案例研究,该生物展示了具有个体识别能力的皮肤图案。为了评估编码器骨干的性能,我们提供了分类准确率、嵌入聚类指标和GradCAM可视化的标准指标。
cs.CV / 69 / 2605.04943
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring
DART:一种用于综合绳索状态监测的视觉-语言基础模型
Abstract
The condition monitoring (CM) of synthetic fibre ropes (SFRs) used in offshore, maritime, and industrial settings demands more than a classifier: inspectors need continuous severity estimates, maintenance recommendations, anomaly flags, deterioration timelines, and automated reports, all from a single inspection image. We present DART (Damage Assessment via Rope Transformer), a vision-language foundation model that addresses the full rope inspection workflow through a unified multi-task architecture. DART extends the Joint-Embedding Predictive Architecture (JEPA) to the cross-modal domain by coupling a Vision Transformer (ViT-H/14) with Llama-3.2-3B-Instruct via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Three architectural innovations drive the model's versatility: (1) HD-MASK, a saliency-guided masking strategy that focuses self-supervised reconstruction on damage-dense patches; (2) per-class learnable severity gates that adaptively weight language grounding by damage category; and (3) a Contrastive Damage Disentanglement (CDD) loss that shapes the embedding space to simultaneously encode damage type, severity ordering, and cross-modal semantics. Trained once on 4,270 images spanning 14 fine-grained rope damage classes, the frozen DART backbone supports downstream tasks without any task-specific fine-tuning: damage classification (93.22 % accuracy, 91.04 % macro-F1, +38.5 pp over a vision-only baseline), continuous severity regression (Spearman rho = 0.94, within-1-ordinal accuracy 99.6 %), few-shot recognition (89.2 % macro-F1 at 20 shots). These results demonstrate that DART functions as a general-purpose CM backbone that goes well beyond classification, providing actionable inspection intelligence from a single shared representation.
Chinese Translation
合成纤维绳索(SFRs)在海上、航海和工业环境中的状态监测(CM)不仅仅需要分类器:检查员需要连续的严重程度评估、维护建议、异常标志、劣化时间线以及从单一检查图像生成的自动报告。我们提出了DART(通过绳索变换器进行损伤评估),这是一种视觉-语言基础模型,通过统一的多任务架构解决了完整的绳索检查工作流程。DART通过在Severity-Conditioned Cross-Modal Fusion(SC-CMF)模块中将视觉变换器(ViT-H/14)与Llama-3.2-3B-Instruct结合,扩展了联合嵌入预测架构(JEPA)到跨模态领域。三项架构创新推动了模型的多功能性:(1)HD-MASK,一种基于显著性引导的掩蔽策略,集中自监督重建于损伤密集区域;(2)类别可学的严重性门,自适应地根据损伤类别对语言基础进行加权;(3)对比损伤解耦(CDD)损失,塑造嵌入空间同时编码损伤类型、严重性排序和跨模态语义。在涵盖14个精细化绳索损伤类别的4,270幅图像上进行了一次训练,冻结的DART主干支持下游任务而无需任何任务特定的微调:损伤分类(93.22% 准确率,91.04% 宏F1值,比仅视觉基线提高38.5个百分点),连续严重性回归(斯皮尔曼相关系数为0.94,1序数内准确率99.6%),少样本识别(20样本情况下89.2%的宏F1值)。这些结果表明,DART作为一种通用的CM主干,远超分类功能,从单一共享表示提供可操作的检查智能。
cs.CV / 70 / 2605.04977
ICPR 2026 Competition on Privacy-Preserving Person Re-Identification from Top-View RGB-Depth Camera (TVRID)
ICPR 2026隐私保护顶视RGB-深度摄像头下的人重识别竞赛(TVRID)
Abstract
This companion paper reports the ICPR 2026 TVRID competition on privacy-aware top-view person re-identification. We present the competition setting, the released RGB-Depth dataset, and a summary of final results with descriptions of the top entries. TVRID contains 86 identities captured by four synchronized overhead Intel RealSense D455 cameras, with paired RGB/Depth streams and structured geometric variation across flat, ascent, descent, and oblique viewpoints. The evaluation protocol includes three tracks: RGB Re-ID, Depth Re-ID, and RGB$\leftrightarrow$Depth cross-modal retrieval. Submissions are ranked using mAP and CMC-1 under a unified server-side evaluation. The final results show a clear difficulty ordering (RGB $>$ Depth $>$ Cross-Modal), highlighting both the challenge of modality-constrained retrieval and the feasibility of strong performance with modality-invariant learning. By releasing the dataset at https://zenodo.org/records/17909410, the evaluation scripts at https://github.com/RaphaelDel/ICPR-TVRID, and the accompanying documentation, TVRID establishes a reproducible benchmark for top-view, depth-based, and cross-modal person re-id.
Chinese Translation
本文报告了ICPR 2026 TVRID竞赛,该竞赛聚焦于隐私意识的顶视人重识别。我们介绍了竞赛设置、发布的RGB-深度数据集以及最终结果的总结,并对顶级参赛作品进行了描述。TVRID数据集中包含86个身份,由四台同步的Intel RealSense D455摄像头捕获,具备配对的RGB/深度流,以及在平面、上升、下降和倾斜视角下的结构化几何变化。评估协议包括三个赛道:RGB重识别、深度重识别和RGB$
ightarrow$深度跨模态检索。提交的结果通过统一的服务器端评估使用mAP和CMC-1进行排名。最终结果显示出一个明显的困难排序(RGB > 深度 > 跨模态),突显了模态约束检索的挑战,以及通过模态不变学习实现强大表现的可行性。通过在https://zenodo.org/records/17909410发布数据集、在https://github.com/RaphaelDel/ICPR-TVRID发布评估脚本,并提供相关文档,TVRID建立了一个可复现的基准,用于顶视、基于深度的以及跨模态的人重识别。
cs.CV / 71 / 2605.04985
Attention-Based Chaotic Self-Supervision for Medical Image Classification
基于注意力的混沌自监督学习在医学图像分类中的应用
Abstract
Deep learning models for medical image classification usually achieve promising results but typically rely on large, annotated datasets or standard transfer learning from ImageNet. Self-Supervised Learning (SSL) has emerged as a powerful alternative, yet common methods like masked autoencoders (MAEs) may inadvertently destroy fine-grained diagnostic features by using random masking. In this paper, we propose a novel SSL pre-training strategy, the Chaotic Denoising Autoencoder (CDAE). Instead of masking, we apply a chaotic transformation to the input image, tasking an autoencoder to reconstruct the original. We hypothesize this forces the encoder to learn robust, domain-specific features by "inverting the chaos". Furthermore, we propose an attentive fusion mechanism that combines features from our CDAE-trained encoder with a standard encoder, leveraging the strengths of both general and domain-specific representations. Our method is evaluated on two public medical datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). The proposed model achieves high performance, with an accuracy of 0.9221 and an F1-macro of 0.8530 on ISIC 2018, and an accuracy of 0.8644 and F1-macro of 0.7433 on APTOS 2019, demonstrating the efficacy of our approach.
Chinese Translation
用于医学图像分类的深度学习模型通常能够取得良好的效果,但通常依赖于大型标注数据集或来自ImageNet的标准迁移学习。自监督学习(Self-Supervised Learning, SSL)作为一种强有力的替代方案逐渐崭露头角,然而常见的方法如掩蔽自编码器(Masked Autoencoders, MAEs)可能会通过使用随机掩蔽不经意间破坏细粒度的诊断特征。本文提出了一种新颖的SSL预训练策略——混沌去噪自编码器(Chaotic Denoising Autoencoder, CDAE)。我们不使用传统的掩蔽,而是对输入图像施加混沌变换,要求自编码器重建原始图像。我们假设这迫使编码器通过“逆转混沌”来学习鲁棒的领域特定特征。此外,我们提出了一种关注融合机制,将我们CDAE训练的编码器提取的特征与标准编码器的特征结合起来,发挥了通用表示和领域特定表示的优势。我们使用两个公开医学数据集进行评估:ISIC 2018(皮肤病变)和APTOS 2019(糖尿病视网膜病变)。所提模型在ISIC 2018上取得了0.9221的高准确率和0.8530的F1宏观值,在APTOS 2019上取得了0.8644的准确率和0.7433的F1宏观值,证明了我们方法的有效性。
cs.CV / 72 / 2605.04989
Low-Rank Adaptation of Geospatial Foundation Models for Wildfire Mapping Using Sentinel-2 Data
基于 Sentinel-2 数据的地理空间基础模型低秩适应用于野火烧毁面积制图
Abstract
Wildfire burned-area mapping is essential for damage assessment, emissions modeling, and understanding fire-climate interactions across diverse ecological regions. Recent geospatial foundation models provide strong general-purpose representations for satellite imagery, yet there is still no clear understanding of how to efficiently adapt these models for downstream Earth observation tasks, particularly under geographic and temporal domain shift. This study evaluates three state-of-the-art Geospatial Foundation Models (GFMs) - Terramind, DINOv3, and Prithvi-v2 - for burned-area mapping across the United States and Canada using Sentinel-2 data. Leveraging 3,820 wildfire events from 2017-2023, we conduct spatial and temporal generalization tests across diverse biomes. We systematically compare full fine-tuning, decoder-only fine-tuning, and Low-Rank Adaptation (LoRA) for adapting each model. Across all experiments, LoRA provides the strongest cross-domain generalization while updating less than 1% of parameters, demonstrating a favorable trade-off between accuracy and efficiency. Prithvi-v2 with LoRA achieves the highest overall accuracy and the largest improvement compared to full fine-tuning. These findings indicate that geospatial foundation models, when adapted using lightweight parameter-efficient methods such as LoRA, offer a robust and scalable solution for large-scale burned-area mapping. Code is available at https://github.com/alishibli97/wildfire-lora-gfm.
Chinese Translation
野火烧毁面积的制图对损失评估、排放建模以及理解不同生态区域的火灾气候相互作用至关重要。近年来的地理空间基础模型为卫星图像提供了强大的通用表示,但仍然缺乏有效适应这些模型以应对下游地球观测任务(特别是在地理和时间域转移下)的清晰理解。本研究评估了三种最新的地理空间基础模型(Geospatial Foundation Models, GFMs)——Terramind、DINOv3 和 Prithvi-v2,用于利用 Sentinel-2 数据在美国和加拿大进行烧毁面积制图。利用2017年至2023年间的3820起野火事件,我们在多样化生物群落上进行空间和时间的泛化测试。我们系统性比较了完全微调、仅解码器微调以及低秩适应(Low-Rank Adaptation, LoRA)为每个模型进行适应。在所有实验中,LoRA提供了最强的跨域泛化,同时更新的参数少于1%,展示了准确性与效率之间的良好平衡。使用LoRA的Prithvi-v2达到了最高的总体准确性,并相较于完全微调有了最大的提升。这些发现表明,当采用轻量级参数高效方法(如LoRA)进行适应时,地理空间基础模型为大规模烧毁面积制图提供了可靠且可扩展的解决方案。代码可在https://github.com/alishibli97/wildfire-lora-gfm 获取。
cs.CV / 73 / 2605.05012
Chaotic Contrastive Learning for Robust Texture Classification
用于鲁棒纹理分类的混沌对比学习
Abstract
Texture classification is a pivotal task in computer vision, presenting unique challenges due to high inter-class similarity and the sensitivity of structural patterns to scale and illumination changes. While Convolutional Neural Networks (CNNs) and recent Vision Transformers have set performance benchmarks, they often require extensive labeled datasets or struggle to generalize across domains due to an over-reliance on color and shape features. This paper introduces a novel framework that synergizes Self-Supervised Learning (SSL) with deterministic chaotic dynamics. We propose a chaotic contrastive pre-training strategy, where pixel-wise chaotic maps, specifically Logistic, Tent, and Sine maps, act as non-linear data augmentation techniques. These chaotic perturbations, grounded in ergodic theory, force the network to learn topologically robust features by mimicking complex environmental noise and reflectance variations. Furthermore, we introduce an attention-based feature ensemble that fuses high-level semantic representations from a supervised large backbone with low-frequency structural features from a chaos-pretrained tiny encoder. Experimental results on six texture benchmarks (FMD, UMD, KTH-TIPS2-b, DTD, GTOS, and 1200Tex) demonstrate the superiority of the proposed method, outperforming state-of-the-art approaches and achieving promising accuracies on all the analyzed datasets.
Chinese Translation
纹理分类是计算机视觉中的一项关键任务,由于类间相似性高以及结构模式对尺度和光照变化的敏感性,带来了独特的挑战。虽然卷积神经网络(CNNs)和最近的视觉变换器(Vision Transformers)已设定了性能基准,但它们常常需要大量标注数据集,或因对颜色和形状特征的过度依赖而难以在不同领域之间进行泛化。本文提出了一种新颖框架,将自监督学习(Self-Supervised Learning, SSL)与确定性混沌动力学相结合。我们提出了一种混沌对比预训练策略,其中逐像素混沌映射,具体包括Logistic、Tent和Sine映射,作为非线性数据增强技术。这些混沌扰动基于遍历理论,迫使网络通过模拟复杂的环境噪声和反射变化来学习拓扑鲁棒特征。此外,我们还引入了一种基于注意力的特征集成,融合了来自监督大型骨干网络的高层语义表征和来自混沌预训练小型编码器的低频结构特征。在六个纹理基准(FMD、UMD、KTH-TIPS2-b、DTD、GTOS和1200Tex)上的实验结果表明,所提方法优于最先进的方法,并在所有分析的数据集上实现了可喜的准确率。
cs.CV / 74 / 2605.05014
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
CARD:用于复杂道路地形的稠密3D重建的多模态汽车数据集
Abstract
Autonomous driving must operate across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we establish a standardized evaluation protocol for road surface irregularities on CARD and benchmark state-of-the-art depth estimation models to provide strong baselines. The CARD dataset is hosted on https://huggingface.co/CARD-Data.
Chinese Translation
自动驾驶必须在多样的路面上运行,以实现安全的移动。然而,大多数驾驶数据集都是在铺装良好的平坦道路上捕获的。此外,近期的驾驶数据集主要提供稀疏的LiDAR地面真值,这对于评估深度估计和补全中的细致几何形状是不够的。为了解决这些问题,我们引入了CARD,一个多模态驾驶数据集,提供在丰富的减速带、坑洼、不规则表面和越野段的连续序列中的准稠密3D地面真值。我们的传感器设备包括同步的全球快门立体摄像头、前后LiDAR、来自LiDAR-惯性里程计的六自由度姿态、每个轮子的运动轨迹以及全面的标定。值得注意的是,我们的多LiDAR融合每帧生成约50万有效深度像素,约是KITTI深度补全的6.5倍,且平均是其他公共驾驶数据集的10倍。该数据集覆盖约110公里,耗时4.7小时,包含德国和意大利的多种场景。此外,CARD还提供针对道路地形不规则性的二维边界框,便于对几何和感知任务进行准确的基准测试。此外,我们还建立了CARD上道路表面不规则性的标准化评估协议,并对最先进的深度估计模型进行了基准测试,以提供强有力的基线。CARD数据集托管在https://huggingface.co/CARD-Data上。
cs.CV / 75 / 2605.05026
Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models
局部内在维度揭示扩散模型中的幻觉
Abstract
Diffusion models are prone to generating structural hallucinations - samples that match the statistical properties of the training data yet defy underlying structural rules, resulting in anomalies like hands with more than five fingers. Recent research studied this failure mode from several viewpoints, offering partial explanations to their occurrence, such as mode interpolation. In this work, we propose a complementary perspective that treats hallucinations as instabilities on the model-induced manifold. We begin by showing that a hallucination filter based on such instabilities matches or exceeds the performance of the recently proposed temporal one. By tracing the source of these instabilities, we identify local intrinsic dimension (LID) as their primary driver and propose Intrinsic Quenching (IQ), a direct corrective mechanism that deflates it to alleviate hallucinations. IQ consistently outperforms standard hallucination reduction baselines across a wide array of benchmarks and offers a highly promising solution for enforcing anatomical consistency in downstream medical imaging tasks.
Chinese Translation
扩散模型容易产生结构幻觉——即样本符合训练数据的统计特性,但违背潜在的结构规则,导致出现如手指超过五个等异常现象。近期的研究从多个角度探讨了这种失效模式,提供了部分解释,如模态插值。在本研究中,我们提出一种互补视角,将幻觉视为模型诱导流形上的不稳定性。我们首先展示了基于这些不稳定性的幻觉过滤器的性能与最近提出的时间过滤器相当或更优。通过追踪这些不稳定性的来源,我们识别出局部内在维度(Local Intrinsic Dimension, LID)作为其主要驱动因素,并提出内在淬火(Intrinsic Quenching, IQ),这是一种直接的修正机制,通过降低LID来缓解幻觉。IQ在多种基准测试中始终超越标准幻觉减少基线,且为在下游医学成像任务中强制执行解剖一致性提供了极具前景的解决方案。
cs.CV / 76 / 2605.05027
Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification
基于提示锚定的视觉-文本蒸馏用于终身人物重识别
Abstract
Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches largely rely on visual-only distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision-language models can serve as a stable semantic anchor across domains. To decouple the roles of vision and text, we propose Prompt-Anchored vision-text Distillation (PAD), an asymmetric vision-text framework for semantic alignment and cross-domain generalization. On the textual side, we distill prompts to preserve vision-text alignment under a fixed semantic space, acting as a global semantic reference rather than a dominant learning signal. On the visual side, an EMA-based teacher with an adaptive prompt pool enables domain-wise adaptation by allocating new slots while freezing past ones. Extensive experiments show that PAD substantially outperforms state-of-the-art methods across seen and unseen domains, achieving a strong balance between stability and plasticity. Project page is available at https://github.com/zu-zi/PAD.
Chinese Translation
终身人物重识别(LReID)旨在训练一个具有普适性和适应性的模型,以处理不断收集的数据。然而,这类模型往往面临语义漂移、适应性有限和灾难性遗忘等问题,尤其是在新领域出现时。现有的无示例方法主要依赖于仅基于视觉的蒸馏或参数正则化,而忽视了辅助模态(如文本)在保持语义稳定性和实现增量可塑性方面的潜力。我们观察到,在预训练的视觉-语言模型中,固定的文本编码器可以在不同领域中充当稳定的语义锚。为了将视觉和文本的作用解耦,我们提出了基于提示锚定的视觉-文本蒸馏(PAD),这是一种不对称的视觉-文本框架,旨在实现语义对齐和跨领域泛化。在文本方面,我们蒸馏提示,以在固定的语义空间中保持视觉-文本对齐,充当全局语义参考,而不是主导学习信号。在视觉方面,基于EMA的教师模型结合自适应提示池,通过分配新的插槽并冻结过去的插槽,实现了领域间的适应性。大量实验表明,PAD在已见和未见领域上显著超越了最先进的方法,实现了稳定性和可塑性之间的良好平衡。项目页面可访问 https://github.com/zu-zi/PAD。
cs.CV / 77 / 2605.05031
Computer-Aided Design Generation by Cascaded Discrete Diffusion Model
利用级联离散扩散模型进行计算机辅助设计生成
Abstract
Recent deep learning approaches seek to automate CAD creation by representing a model as a sequence of discrete commands and parameters, and then generating them using autoregressive models or continuous diffusion operating in Euclidean embedding space. However, continuous diffusion perturbs representations in a continuous Euclidean domain that does not reflect the inherently discrete and heterogeneous nature of CAD tokens, often producing perturbed representations that map to semantically invalid symbols. To overcome this limitation, we propose a cascaded discrete diffusion framework for CAD generation, which consists of a command diffusion for generating CAD commands and a parameter diffusion conditioned on CAD commands. Unlike isotropic Gaussian perturbation, the forward process of our approach operates directly over categorical token distributions using delicate transition matrices. For commands, we adopt an absorbing-state transition matrix that progressively corrupts tokens to a designated symbol; for parameters, we introduce specific transition matrices tailored to heterogeneous attributes: a Gaussian kernel for coordinate continuity, a scale-invariant kernel for dimensional values, and a prior-preserving kernel for boolean attributes. The reverse process is achieved by two denoising networks: a Transformer-based encoder for command recovery, and a parameter network with extra local self-attention for command-level interaction and cross-attention for conditional injection. Experiments on the DeepCAD dataset show that the proposed approach surpasses existing autoregressive and continuous diffusion models on unconditional generation metrics, while qualitative results validate effective controllability in conditional generation tasks. Source codes will be released.
Chinese Translation
最近的深度学习方法旨在通过将模型表示为一系列离散指令和参数来自动化计算机辅助设计(CAD)的创建,并通过自回归模型或在欧几里得嵌入空间中运行的连续扩散进行生成。然而,连续扩散会在一个连续的欧几里得域中扰动表示,无法反映CAD标记固有的离散和异质特性,通常会产生映射到语义上无效符号的扰动表示。为克服这一限制,我们提出了一种用于CAD生成的级联离散扩散框架,其中包含用于生成CAD指令的指令扩散和基于CAD指令的条件参数扩散。与各向同性高斯扰动不同,我们的方法的前向过程直接在分类标记分布上操作,使用精细的转移矩阵。对于指令,我们采用一种吸收态转移矩阵,该矩阵逐步损坏标记至指定符号;对于参数,我们引入了为异质属性量身定制的特定转移矩阵:用于坐标连续性的高斯核、用于维度值的尺度不变核以及用于布尔属性的先验保持核。反向过程通过两个去噪网络实现:一个基于Transformer的编码器用于指令恢复,另一个参数网络具有额外的局部自注意力,用于指令级交互和条件注入的交叉注意力。对DeepCAD数据集的实验表明,所提出的方法在无条件生成指标上超越了现有的自回归和连续扩散模型,而定性结果验证了在条件生成任务中的有效可控性。源代码将会发布。
cs.CV / 78 / 2605.05034
Few-Shot Learning Pipeline for Monkeypox Skin Disease Classification Using CNN Feature Extractors
基于卷积神经网络特征提取器的猴痘皮肤病分类少样本学习流程
Abstract
Despite the strong performance of Convolutional Neural Networks (CNNs) in disease classification, their effectiveness often depends on access to large annotated datasets, which is an impractical requirement for emerging or rare conditions such as Monkeypox. To overcome this limitation, we propose a few-shot learning (FSL) framework that employs SimpleShot, a lightweight, non-parametric, inductive classifier, for Monkeypox and pox-like skin disease recognition from limited labeled examples. The proposed pipeline passes the skin lesion images through a frozen, pretrained CNN backbone to obtain feature embeddings, which are then classified via SimpleShot using nearest-centroid comparisons in a normalized embedding space. We systematically benchmark six widely used CNN backbones as feature extractors under consistent experimental settings, enabling fair comparison. Experiments on three publicly available datasets (MSLD v1.0, MSID, and MSLD v2.0) are conducted across 2-way, 4-way, and 6-way tasks with 1-shot, 5-shot, and 10-shot configurations. Among all models, MobileNetV2_100 consistently achieves the highest accuracy. In addition, we present a cross-dataset evaluation for Monkeypox classification, revealing that binary Mpox-vs-Others transfer remains comparatively stable while multi-class performance degrades significantly under domain shift. Together, these results demonstrate the practical utility of combining inductive FSL methods with lightweight CNN backbones and highlight the importance of domain robustness for reliable real-world clinical deployment.
Chinese Translation
尽管卷积神经网络(CNN)在疾病分类中表现出色,但它们的有效性往往依赖于大量标注数据集,而这对于猴痘等新兴或罕见病症来说是不切实际的要求。为了解决这一限制,我们提出了一种少样本学习(FSL)框架,该框架采用SimpleShot,一种轻量级、非参数的归纳分类器,用于从有限的标记示例中识别猴痘及疱疹样皮肤病。所提流程将皮肤损伤图像通过一个冻结的、预训练的CNN骨干网络,以获取特征嵌入,然后通过利用标准化嵌入空间中的最近质心比较,使用SimpleShot进行分类。我们系统地在一致的实验设置下基准测试六种广泛使用的CNN骨干网络作为特征提取器,以便进行公平比较。在三种公开可用的数据集(MSLD v1.0、MSID和MSLD v2.0)上,我们进行了2-way、4-way和6-way任务的1-shot、5-shot和10-shot配置的实验。在所有模型中,MobileNetV2_100始终实现了最高的准确性。此外,我们还展示了猴痘分类的跨数据集评估,结果显示,二分类的Mpox与其他类之间的转移保持相对稳定,而多分类性能在领域转移下显著下降。这些结果共同表明,将归纳FSL方法与轻量级CNN骨干网络结合的实际效用,并强调了领域鲁棒性在可靠的现实世界临床部署中的重要性。
cs.CV / 79 / 2605.05045
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
当关系破裂:分析视觉-语言模型在旋转和噪声下的关系幻觉
Abstract
Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
Chinese Translation
视觉-语言模型(VLMs)在多模态性能上表现出色,但仍然容易出现关系幻觉,这要求对物体间的交互进行准确推理。我们研究了视觉扰动的影响,特别是旋转和噪声,结果表明即使是轻微的扭曲也会显著降低各模型和数据集的关系推理能力。我们进一步评估了基于提示的增强和预处理策略(方向校正和去噪),发现尽管它们提供了部分改进,但并未完全解决幻觉问题。我们的结果揭示了感知鲁棒性与关系理解之间的差距,突显出对更加鲁棒且具几何意识的 VLMs 的需求。
cs.CV / 80 / 2605.05054
Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation
直接产品流匹配:解耦少数样本适应中的径向和角度动态
Abstract
Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs' hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.
Chinese Translation
近期的流匹配(FM)方法通过将跨模态对齐建模为连续的多步骤流,从而改善了视觉-语言模型的少数样本适应能力。本文认为,现有的FM方法由于对预训练跨模态特征的几何先验存在不兼容的限制,导致适应性能不理想。我们首先从极坐标分解的角度分析这些方法(即,径向和角度子流形)。在这种新的几何视角下,我们识别出三个被忽视的局限性:1)角动态失真:径向-角度耦合在角子流形上引起不均匀速度,导致回归训练困难和额外的截断误差。2)径向动态忽视:特征归一化丢弃了模态置信度,未能区分分布外和分布内的数据,并放弃了重要的径向动态。3)与上下文无关的无条件流:在预训练跨模态特征提取过程中,特定于数据集的信息损失未得到恢复。为了解决这些问题,我们提出了扭曲产品流匹配(WP-FM),一个统一的黎曼框架,通过在扭曲产品流形上重新制定对齐。在该框架内,我们通过引入一个常量扭曲度量推导出直接产品流匹配(DP-FM),该方法产生了解耦的圆柱流形(即,直接产品流形)。DP-FM实现了独立的径向演化和恒速的角地质运输,有效消除了角动态失真,同时保持径向一致性。同时,我们通过将流条件化于预训练的视觉-语言模型(VLM)的隐藏状态来整合无分类器指导,以注入缺失的特定于数据集的信息。通过在11个基准测试上的广泛结果表明,DP-FM在多步骤少数样本适应中达到了新的最先进水平。
cs.CV / 81 / 2605.05057
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
ScriptHOI:学习脚本状态转移以实现开放词汇人机交互检测
Abstract
Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.
Chinese Translation
开放词汇人机交互(HOI)检测需要识别在训练期间可能未作为标注类别出现的交互短语。近期的视觉语言HOI检测器通过将人机特征与文本嵌入匹配,提高了语义转移的效果,但它们的预测往往受到物体可使用性和短语级共现的主导影响。因此,模型可能会基于刀具和蛋糕的存在预测出“切蛋糕”这一动作,而未验证手、工具、目标、接触模式和物体状态是否共同支持该动作。我们提出了 extbf{ScriptHOI},一个结构化框架,将每个交互短语表示为一个软脚本状态转移。ScriptHOI并不将短语视为单一类标记,而是将其分解为身体角色、接触、几何形状、可使用性、运动和物体状态等多个槽位。视觉状态分词器将每个检测到的人机对解析为相应的状态标记,槽位匹配器则估计脚本覆盖度和脚本冲突。这两个数值校准HOI对数,揭示缺失的视觉证据,并为不完整的标注提供训练约束。为了避免压制有效但未标注的交互,我们进一步引入了间隔部分标签学习,利用脚本派生的上下概率边界约束未标注候选,而不是分配封闭世界负样本。逆事实脚本对比损失通过改变个别脚本槽位来抑制仅基于物体的捷径。在HICO-DET、V-COCO和开放词汇HOI分割的数据集上的实验表明,ScriptHOI在识别稀有和未见交互方面表现出显著改善,同时大幅降低可使用性冲突的误报率。
cs.CV / 82 / 2605.05072
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
基于高度引导的投影重参数化用于相机-LiDAR占用预测
Abstract
3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at https://github.com/Rayn-Wu/HiPR.
Chinese Translation
3D 占用预测旨在从传感器观测中推断密集的体素级场景语义,其中 2D 到 3D 视图转换作为连接图像特征和体积表示的关键步骤。以往大多数方法依赖于固定的投影空间,在该空间中 3D 参考点沿着柱子均匀采样。然而,这种采样方法难以捕捉真实场景中的稀疏性和高度变化,导致对应关系模糊和特征聚合不可靠。为了解决这些挑战,我们提出了 HiPR,一种具有高度引导投影重参数化的相机-LiDAR 占用框架。HiPR 首先将 LiDAR 编码为鸟瞰视图(BEV)高度图,以捕捉点云的最大高度。随后,HiPR 根据高度先验调整每个柱子的采样范围,从而实现投影空间的自适应重参数化。结果,投影点被重新分配到几何上有意义的区域,而非固定的范围。同时,我们对高度图中的无效部分进行遮蔽,以避免误导特征聚合。此外,为了减轻因噪声 LiDAR 生成的高度引起的训练不稳定性,我们引入了一种训练时的渐进高度条件策略,该策略逐步将条件信号从真实高度过渡到 LiDAR 高度。大量实验表明,HiPR 在保持实时推理的同时,始终超过现有的最先进方法。代码和预训练模型可以在 https://github.com/Rayn-Wu/HiPR 找到。
cs.CV / 83 / 2605.05077
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
FlowDIS:基于流匹配的语言引导二分图像分割
Abstract
Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher $F_{\beta}^{\omega}$ measure and 43% lower MAE ($\mathcal{M}$) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS
Chinese Translation
准确的图像分割在现代计算机视觉应用中至关重要,例如图像编辑、自动驾驶和医学图像分析。近年来,二分图像分割(Dichotomous Image Segmentation, DIS)已成为训练和评估高精度分割模型的标准任务。现有的 DIS 方法往往无法保留细粒度细节或完全捕获前景的语义结构。为了解决这些挑战,我们提出了 FlowDIS,一种基于流匹配框架的新型二分图像分割方法,它学习一个时间依赖的向量场,将图像分布传输到相应的掩码分布,且可选地基于文本提示进行条件处理。此外,通过我们的定位感知实例配对(Position-Aware Instance Pairing, PAIP)训练策略,FlowDIS 通过文本提示提供强大的可控性,使得能够实现精确的像素级对象分割。大量实验表明,我们的方法在有无语言引导的情况下均显著超越了最先进的方法。与之前最佳的 DIS 方法相比,FlowDIS 在 DIS-TE 测试集上实现了 5.5% 更高的 $F_{eta}^{ ext{ω}}$ 指标和 43% 更低的 MAE ($ ext{M}$)。代码可在以下地址获取:https://github.com/Picsart-AI-Research/FlowDIS
cs.CV / 84 / 2605.05079
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
严重折射畸变下的多帧图像恢复统一基准
Abstract
Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address mild atmospheric turbulence, no existing benchmarks systematically evaluate restoration methods under strong and highly nonuniform refractive conditions. We present a comprehensive benchmark for geometric distortion removal in video, covering a range from turbulence-like mild warping to strong discontinuous refractive deformations. The benchmark includes both laboratory-captured real data and synthetic sequences generated for static scenes via physics-based light refraction modeling across four distortion levels and multiple surface wave types. We evaluate a spectrum of methods from simple baselines and classical registration algorithms to advanced learning-based approaches including DATUM and our proposed diffusion based V-cache for high and extreme distortions regimes. Evaluation uses both pixel-level (PSNR, SSIM), and perceptual (LPIPS, DINO, CLIP) metrics providing the first large scale analysis of geometric distortion removal. Our benchmark establishes a new foundation for developing and evaluating algorithms capable of reconstructing video from highly distorted optical environments. Our code and datasets are available at https://github.com/iafoss/refractive-mfir-benchmark.
Chinese Translation
通过折射动态介质(如湍流空气或水面)捕获的视频序列常常面临严重的几何失真和时间不稳定性。尽管近期的研究进展针对轻微的气象湍流有所解决,但目前没有现存基准系统地评估在强烈且高度不均匀的折射条件下的恢复方法。我们提出了一个全面的基准,用于视频中的几何失真去除,涵盖了从类似湍流的轻微畸变到强烈不连续的折射变形的范围。该基准包括实验室捕获的真实数据和用于静态场景的合成序列,这些序列是通过基于物理的光折射建模生成的,涵盖了四个失真级别和多种表面波类型。我们评估了一系列方法,从简单的基线和传统的配准算法,到包括DATUM在内的先进学习基础方法,以及我们提出的用于高强度和极端失真区间的扩散基V-cache。评估使用像素级(PSNR,SSIM)和感知(LPIPS,DINO,CLIP)指标,提供了首次大规模的几何失真去除分析。我们的基准为开发和评估能够从高度失真光学环境中重建视频的算法奠定了新的基础。我们的代码和数据集可在 https://github.com/iafoss/refractive-mfir-benchmark 获取。
cs.CV / 85 / 2605.05136
CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization
CPCANet:用于领域泛化的深度展开常见主成分分析
Abstract
Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at https://github.com/wish44165/CPCANet.
Chinese Translation
领域泛化(DG)旨在学习在分布外(OOD)变化下保持稳健的表示,并有效地推广到未见过的目标领域。尽管最近的无变学习策略和架构进展取得了显著性能,但通过二阶统计显式发现结构化的领域不变子空间仍然未得到充分探索。在这项工作中,我们提出了CPCANet,这是一个基于常见主成分分析(CPCA)的新颖框架,该框架将迭代的Flury-Gautschi(FG)算法展开为完全可微的神经层。这种方法将CPCA的统计特性集成到一个端到端可训练的框架中,强制发现跨不同领域的共享子空间,同时保持可解释性。在四个标准的DG基准上的实验表明,CPCANet在零样本转移中实现了最先进的(SOTA)性能。此外,CPCANet是架构无关的,并且不需要特定于数据集的调优,为在分布变化下学习稳健表示提供了简单而高效的方法。代码可在 https://github.com/wish44165/CPCANet 获取。
cs.CV / 86 / 2605.05148
What Matters in Practical Learned Image Compression
实践中的学习型图像压缩:关键因素
Abstract
One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed. In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime -- including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3x bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms -- faster than most top ML-based codecs run on a V100 GPU.
Chinese Translation
学习型编解码器相较于传统硬编码编解码器所解锁的主要差异之一是它们能够直接优化以适应人类视觉系统。尽管具有这种潜力,迄今尚未提出一个既具有感知性能又实用的图像编解码器。在本研究中,我们旨在填补这一空白。我们对影响实用学习型图像编解码器设计的关键建模选择进行了全面研究,联合优化感知质量和运行时间,并在多项消融实验中引入几种新颖的技术。随后,我们通过对数百万种基础配置进行性能感知的神经架构搜索,以识别出在最大化感知指标下能够实现目标设备运行时间的模型。我们将各种优化结合起来,构建了一个新的编解码器,在速度与感知质量之间实现了显著改善的权衡。基于严格的主观用户研究,它在比特率上提供了2.3-3倍的节省,相较于AV1、AV2、VVC、ECM和JPEG-AI,以及20-40%的节省,相较于最佳的学习型编解码器替代方案。同时,在iPhone 17 Pro Max上,它以高达230ms的速度编码12MP图像,并以150ms进行解码——速度超过大多数在V100 GPU上运行的顶级基于机器学习的编解码器。
cs.CV / 87 / 2605.05155
Aes3D: Aesthetic Assessment in 3D Gaussian Splatting
Aes3D:3D高斯点渲染中的美学评估
Abstract
As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.
Chinese Translation
随着3D高斯点渲染(3D Gaussian Splatting, 3DGS)在沉浸式媒体和数字内容创作领域受到关注,评估3D场景的美学变得愈加重要,这有助于创作者构建更具视觉吸引力的3D内容。然而,现有的3D场景评估方法主要强调重建逼真度和感知真实感,往往忽视了构图、和谐以及视觉吸引力等更高层次的美学属性。这一局限性源自两个关键挑战:(1)缺乏具有美学注释的一般性3DGS数据集,以及(2)3DGS作为低级原始表示的内在性质,使得捕捉高层次美学特征变得困难。为了解决这些挑战,我们提出了Aes3D,这是第一个系统化的框架,用于评估3D神经渲染场景的美学。Aes3D包括Aesthetic3D,这是第一个专门用于3D场景美学评估的数据集,它基于我们提出的3D场景美学注释策略。此外,我们还提出了Aes3DGSNet,一个轻量级模型,可以直接从3DGS表示中预测场景级美学分数。值得注意的是,我们的模型仅依赖于3D高斯原始体素,消除了渲染多视图图像的需要,从而降低了计算成本和硬件要求。通过在多视图3DGS场景表示上进行美学监督学习,Aes3DGSNet有效捕捉高层次的美学线索,并准确回归美学分数。实验结果表明,我们的方法在保持轻量设计的同时实现了强大的性能,为3D场景美学评估建立了新的基准。代码和数据集将在未来版本中发布。
cs.CV / 88 / 2605.05161
Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging
基于Wasserstein对齐的VLM分布式OOD检测在医学影像中的定位
Abstract
Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .
Chinese Translation
通过视觉-语言模型 (VLMs) 实现零样本异常定位,为罕见病理检测提供了一种引人注目的方法,然而其性能本质上受到缺乏健康解剖背景的限制。我们将零样本定位重新定义为一个比较推理问题,其中异常通过与正常解剖的参考分布进行结构化比较来识别。我们介绍了WALDO,这是一个基于最优传输理论的无训练框架,通过以下方式实现比较推理:(i) 利用熵加权的切片Wasserstein距离从DINOv2补丁分布中选择解剖意识的参考;(ii) 利用参考相似性与定位准确性之间的非单调关系进行Goldilocks区间取样;(iii) 通过加权非最大抑制进行自一致性聚合。我们通过分布差异理论分析了Goldilocks效应,表明具有适度相似性的参考在比较视觉推理中最小化了偏差-方差权衡。在NOVA脑MRI基准测试中,结合Qwen2.5-VL-72B的WALDO实现了$43.5_{ ext{±}1.6}\%$ mAP@30(95 ext ext{CI}: [40.4, 46.7]),相较于零样本基线提高了19 ext ext{%}。跨模型评估显示出一致的提升:GPT-4o达到$32.0_{ ext{±}6.5}\%$,Qwen3-VL-32B达到$32.0_{ ext{±}6.6}\%$ mAP@30。配对McNemar测试确认了统计显著性($p<0.01$)。源代码可在 https://github.com/bkainz/WALDO_MICCAI26_demo 获得。
cs.CV / 89 / 2605.05163
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
PhysForge:生成基于物理的交互式虚拟世界3D资产
Abstract
Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a "physical architect" to plan a "Hierarchical Physical Blueprint" defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.
Chinese Translation
合成基于物理的3D资产是交互式虚拟世界和具身人工智能中的一个关键瓶颈。现有的方法主要集中在静态几何形状上,忽视了交互所需的功能属性。我们提出交互资产生成必须根植于功能逻辑和分层物理。为了解决这一问题,我们引入了PhysForge,一个由PhysDB(一个包含150,000个资产及其四层物理标注的大规模数据集)支持的解耦两阶段框架。首先,一个视觉语言模型(VLM)充当“物理建筑师”,规划一个定义材料、功能和运动约束的“分层物理蓝图”。其次,一个基于物理的扩散模型通过合成高保真度的几何形状及精确的运动参数来实现这一蓝图,采用了一种新颖的运动体积注入(KineVoxel Injection, KVI)机制。实验表明,PhysForge能够生成功能上合理、适合于仿真的资产,为交互式3D内容和具身代理提供了强大的数据引擎。
cs.CV / 90 / 2605.05164
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
几何感知状态空间模型:全切片图像表示的新范式
Abstract
Accurate analysis of histopathological images is critical for disease diagnosis and treatment planning. Whole-slide images (WSIs), which digitize tissue specimens at gigapixel resolution, are fundamental to this process but require aggregating thousands of patches for slide-level predictions. Multiple Instance Learning (MIL) tackles this challenge with a two-stage paradigm, decoupling tile-level embedding and slide-level prediction. However, most existing methods implicitly embed patch representations in homogeneous Euclidean spaces, overlooking the hierarchical organization and regional heterogeneity of pathological tissues. This limits current models' ability to capture global tissue architecture and fine-grained cellular morphology. To address this limitation, we introduce a hybrid hyperbolic-Euclidean representation that embeds WSI features in dual geometric spaces, enabling complementary modeling of hierarchical tissue structures and local morphological details. Building on this formulation, we develop BatMIL, a WSI classification framework that leverages both geometric spaces. To model long-range dependencies among thousands of patches, we employ a structured state space sequence model (S4) backbone that encodes patch sequences with linear computational complexity. Furthermore, to account for regional heterogeneity, we introduce a chunk-level mixture-of-experts (MoE) module that groups patches into regions and dynamically routes them to specialized subnetworks, improving representational capacity while reducing redundant computation. Extensive experiments on seven WSI datasets spanning six cancer types demonstrate that BatMIL consistently outperforms state-of-the-art MIL approaches in slide-level classification tasks. These results indicate that geometry-aware representation learning offers a promising direction for next-generation computational pathology.
Chinese Translation
准确分析组织病理图像对于疾病的诊断和治疗规划至关重要。全切片图像(WSI)以千兆像素分辨率数字化组织标本,是这一过程的基础,但需要聚合数千个图块以进行切片级预测。多实例学习(MIL)通过一种两阶段范式解决了这一挑战,解耦了图块级嵌入和切片级预测。然而,大多数现有方法在同质欧几里得空间中隐式嵌入图块表示,忽视了病理组织的分层结构和区域异质性。这限制了当前模型捕捉全球组织结构和细致细胞形态的能力。为了解决这一限制,我们引入了一种混合的超曲面-欧几里得表示,将 WSI 特征嵌入双重几何空间,从而实现分层组织结构和局部形态细节的互补建模。在此基础上,我们开发了 BatMIL,一种利用这两种几何空间的 WSI 分类框架。为了建模成千上万图块之间的长期依赖关系,我们采用了结构化状态空间序列模型(S4)作为主干网络,以线性计算复杂度编码图块序列。此外,为了考虑区域异质性,我们引入了图块级专家混合(MoE)模块,将图块分组到区域,并动态路由至专业子网络,提高了表示能力,同时减少了冗余计算。在涵盖六种癌症类型的七个 WSI 数据集上的大量实验表明,BatMIL 在切片级分类任务中始终优于最先进的 MIL 方法。这些结果表明,几何感知表示学习为下一代计算病理学提供了一个有前景的方向。
cs.CV / 91 / 2605.05185
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
OpenSearch-VL:前沿多模态搜索智能体的开放配方
Abstract
Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.
Chinese Translation
深度搜索已成为前沿多模态智能体的一项关键能力,使模型能够通过主动搜索、证据验证和多步推理解决复杂问题。尽管取得了快速进展,但顶尖的多模态搜索智能体仍难以复制,主要原因在于缺乏开放的高质量训练数据、透明的轨迹合成流程或详细的训练配方。为此,我们介绍OpenSearch-VL,这是一种完全开源的配方,用于通过智能体强化学习训练前沿多模态深度搜索智能体。首先,我们建立了一条专用流程,通过维基百科路径采样、模糊实体重写和源锚点视觉定位来构建高质量的训练数据,旨在共同减少捷径和一步检索崩溃。基于此流程,我们整理了两个训练数据集,分别为SearchVL-SFT-36k用于SFT和SearchVL-RL-8k用于RL。此外,我们设计了一个多样化的工具环境,统一了文本搜索、图像搜索、光学字符识别(OCR)、裁剪、锐化、超分辨率和透视校正,使智能体能够将主动感知与外部知识获取结合起来。最后,我们提出了一种多轮故障感知的GRPO训练算法,该算法通过掩蔽后故障令牌来处理级联工具故障,同时通过单侧优势限幅保留有用的前故障推理。基于这一配方,OpenSearch-VL在七个基准测试中实现了超过10个点的平均性能提升,并在多个任务上达到了与商业专有模型相媲美的结果。我们将发布所有数据、代码和模型,以支持多模态深度搜索智能体的开放研究。
cs.CV / 92 / 2605.05187
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
LoViF 2026:首届针对4D世界模型的整体质量评估挑战(PhyScore)
Luo, Wei, Lu, Yiting, Li, Xin, Li, Haoran, Guan, Fengbin, Gao, Chen, Jin, Xin, Li, Yong, Chen, Zhibo, Wu, Sijing, Fu, Kang, Li, Yunhao, Xiao, Ziang, Duan, Huiyu, Liu, Jing, Hu, Qiang, Min, Xiongkuo, Zhai, Guangtao, Sun, Manxi, Guo, Zixuan, Li, Yun, Chen, Ziyang, Tsukada, Manabu, Li, Zhengyang, Du, Zhenglin, Wen, Yi, Jiao, Licheng, Liu, Fang, Li, Lingling, Ren, Yiwen, Song, Zhilong, Chen, Dubing, Zhou, Yucheng, Yan, Tianyi, Zheng, Huan
Abstract
This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.
Chinese Translation
本文报告了LoViF 2026 PhyScore挑战赛,这是一个关于世界模型生成的视频在2D和4D生成环境下的整体质量评估的竞赛。该挑战的动机源于当前评估实践中的一个核心问题:仅凭感知质量不足以判断生成的动态是否在物理上合理、时间上连贯,并与输入条件一致。参与者需要构建一种可以联动预测四个维度的度量,即视频质量、物理现实性、条件-视频对齐和时间一致性。此外,参与者还需定位物理异常时间戳,以便进行更细致的诊断。基准数据集包含由七个代表性的世界生成模型生成的1554个视频,分为三个赛道(文本到2D、图像到4D和视频到4D),涵盖26个类别。这些类别明确涵盖与物理相关的场景,包括动力学、光学和热力学,并包含多样的现实世界和创意内容。为了确保标签的可靠性,分数和异常时间戳是通过经过训练的人类标注和额外的自动质量控制流程生成的。评估基于分数预测和异常定位,并采用结合了时间戳交并比(TimeStamp_IOU)和斯皮尔曼等级相关系数/皮尔逊相关系数(SRCC/PLCC)的复合协议。本文总结了挑战的设计,并提供了提交解决方案的方案级见解。
cs.CV / 93 / 2605.05204
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD:用于持续调整步骤蒸馏扩散模型的在线自蒸馏
Abstract
The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.
Chinese Translation
高性能图像生成模型的格局目前正从低效的多步骤模型转向高效的少步骤模型(例如,Z-Image-Turbo 和 FLUX.2-klein)。然而,这些模型在直接进行连续监督微调时面临重大挑战。例如,应用常用的微调技术会妥协它们固有的少步骤推理能力。为此,我们提出了 D-OPSD,这是一种新的训练范式,旨在通过监督微调实现对步骤蒸馏扩散模型的在线学习。我们首先发现,现代扩散模型中,大型语言模型(LLM)或视觉语言模型(VLM)作为编码器,可以继承其编码器的上下文能力。这使我们能够将训练视为一种在线自蒸馏过程。具体而言,在训练过程中,我们让模型同时充当教师和学生,且处于不同的上下文中。学生仅基于文本特征进行条件化,而教师则基于文本提示和目标图像的多模态特征进行条件化。训练的目标是最小化学生自己推断出的两个预测分布。通过对模型自身轨迹的优化和在其自身监督下,D-OPSD 使模型能够学习新概念、风格等,同时不牺牲其原有的少步骤能力。
cs.CV / 94 / 2605.05206
Taming Outlier Tokens in Diffusion Transformers
抑制扩散变换器中的异常标记
Abstract
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.
Chinese Translation
我们研究了在图像生成中扩散变换器(Diffusion Transformers, DiTs)中的异常标记。先前的研究表明,视觉变换器(Vision Transformers, ViTs)可以生成少量具有高范数的标记,这些标记吸引了不成比例的注意力,同时携带有限的局部信息,但它们在生成模型中的作用仍未得到充分探讨。我们展示了这一现象出现在现代表示自编码器(Representation Autoencoder, RAE)-DiT管道的编码器和去噪器中:经过训练的ViT编码器能够生成异常表示,并且DiTs本身在中间层也可能形成内部异常标记。此外,仅仅掩蔽高范数标记并未改善性能,表明这一问题不仅由少数极端值引起,而更为密切地与损坏的局部补丁语义有关。为了解决这个问题,我们引入了双阶段寄存器(Dual-Stage Registers, DSR),这是一种针对两个组件的寄存器基础的干预:在可用时使用训练好的寄存器,否则使用递归的测试时寄存器,以及为去噪器设计的扩散寄存器。在ImageNet和大规模文本到图像生成中,这些干预措施始终能够减少异常伪影并提高生成质量。我们的结果强调了对异常标记的控制是构建更强DiTs的重要组成部分。
cs.CV / 95 / 2605.05207
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D:一个多视角合成的四维数据集
Abstract
Dense 3D reconstruction and tracking of dynamic scenes from monocular video remains an important open challenge in computer vision. Progress in this area has been constrained by the scarcity of high-quality datasets with dense, complete, and accurate geometric annotations. To address this limitation, we introduce Syn4D, a multiview synthetic dataset of dynamic scenes that includes ground-truth camera motion, depth maps, dense tracking, and parametric human pose annotations. A key feature of Syn4D is the ability to unproject any pixel into 3D to any time and to any camera. We conduct extensive evaluations across multiple downstream tasks to demonstrate the utility and effectiveness of the proposed dataset, including 4D scene reconstruction, 3D point tracking, geometry-aware camera retargeting, and human pose estimation. The experimental results highlight Syn4D's potential to facilitate research in dynamic scene understanding and spatiotemporal modeling.
Chinese Translation
从单目视频中对动态场景进行密集三维重建和跟踪仍然是计算机视觉领域中的一个重要开放挑战。在这方面的进展受到高质量数据集的稀缺性的限制,这些数据集需要具备密集、完整和准确的几何标注。为了解决这一限制,我们引入了Syn4D,这是一个包含真实相机运动、深度图、密集跟踪和参数化人体姿态标注的动态场景多视角合成数据集。Syn4D的一个关键特性是能够将任何像素反投影到任意时间和任意相机的三维空间中。我们在多个下游任务上进行了广泛的评估,以证明所提数据集的实用性和有效性,包括四维场景重建、三维点跟踪、几何感知相机重定向和人体姿态估计。实验结果突显了Syn4D在促进动态场景理解和时空建模研究方面的潜力。
cs.AI / 1 / 2605.04050
LCM: Lossless Context Management
LCM:无损上下文管理
Abstract
We introduce Lossless Context Management (LCM), a deterministic architecture for LLM memory that outperforms Claude Code on long-context tasks. When benchmarked using Opus 4.6, our LCM-augmented coding agent, Volt, achieves higher scores than Claude Code on the OOLONG long-context eval, including at every context length between 32K and 1M tokens. LCM may be considered both a vindication and extension of the recursive paradigm pioneered by Recursive Language Models (RLMs). Our results demonstrate that recursive context manipulation can outperform not just conventional LLMs, but frontier coding agents with native file-system access. LCM departs from RLM by decomposing symbolic recursion into two deterministic, engine-managed mechanisms: recursive context compression, in which a hierarchical summary DAG automatically compacts older messages while retaining lossless pointers to every original; and recursive task partitioning, in which engine-managed parallel primitives like LLM-Map replace model-written loops. This trade-off, analogous to the move from GOTO to structured control flow in program-ming language design, sacrifices maximal flexibility for termination guarantees, zero-cost continuity on short tasks, and lossless retrievability of all prior state.
Chinese Translation
我们引入了无损上下文管理(Lossless Context Management,LCM),这是一种确定性的架构,用于长上下文任务中大规模语言模型(LLM)内存的管理,表现优于Claude Code。在使用Opus 4.6进行基准测试时,我们增强了LCM的编码代理Volt在OOLONG长上下文评估中,得到的分数超过了Claude Code,包括在32K到1M个标记之间的每个上下文长度。LCM可被视为对递归语言模型(Recursive Language Models,RLMs)所开创的递归范式的验证和扩展。我们的结果表明,递归上下文操作不仅可以超越传统的LLM,还可以超越具有本地文件系统访问权限的前沿编码代理。LCM通过将符号递归分解为两种由引擎管理的确定性机制来区别于RLM:递归上下文压缩,其中分层摘要DAG自动压缩旧消息,同时保留每个原始消息的无损指针;以及递归任务分区,其中由引擎管理的并行原语(如LLM-Map)替代了模型编写的循环。这一权衡,类似于编程语言设计中从GOTO到结构化控制流的转变,牺牲了最大的灵活性,以换取终止保障、短任务的零成本连续性以及所有先前状态的无损可检索性。
cs.AI / 2 / 2605.04100
Regularized Centered Emphatic Temporal Difference Learning
正则化中心化强调时间差学习
Abstract
Off-policy temporal-difference (TD) learning with function approximation faces a structural tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves the off-policy projection geometry through follow-on emphasis, but the follow-on trace can have high variance. We revisit this tradeoff through Bellman-error centering. Although centering naturally removes a common drift term from TD errors, we show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix. We propose \emph{Regularized Emphatic Temporal-Difference Learning} (RETD), which preserves the follow-on trace and regularizes only the auxiliary centering recursion, corresponding to lifting the lower-right block of the coupled key matrix from \(1\) to \(1+c\). We derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments show that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter \(c\) across the diagnostics.
Chinese Translation
基于策略外的时间差(TD)学习与函数逼近面临着稳定性、投影几何和方差控制之间的结构性权衡。强调型时间差(ETD)通过后续强调改善了策略外的投影几何,但后续跟踪可能具有较高的方差。我们通过贝尔曼误差中心化重新审视这一权衡。尽管中心化自然消除了TD误差中的一个常见漂移项,但我们表明,幼稚的中心化强调扩展引入了一个辅助耦合,这可能会破坏ETD核心矩阵的正定性。我们提出了正则化强调时间差学习(RETD),其保留了后续跟踪,仅对辅助中心化递归进行正则化,相应地将耦合矩阵的右下块从1提升到1+c。我们推导了RETD核心矩阵,证明在保守的充分正则化条件下的收敛性,并在诊断线性策略外预测任务上评估该方法。实验结果表明,RETD避免了幼稚的中心化强调学习的稳定性问题,保留了良好的强调几何结构,并在正则化参数c的情况下表现出稳健的中间状态。
cs.AI / 3 / 2605.04169
Actionable Real-Time Modeling of Surgical Team Dynamics via Time-Expanded Interaction Graphs
可行动的实时手术团队动态建模:基于时扩展互动图
Abstract
Surgical team performance arises from complex interactions between technical execution and non-technical skills, including communication and coordination dynamics. However, current surgical AI systems predominantly model visual workflow signals, lacking structured representations of intraoperative team interactions over time. We propose a real-time actionable approach for modeling surgical team dynamics using time-expanded interaction graphs, where team members are modeled as time-indexed nodes and communication exchanges define directed edges. This spatio-temporal expansion enables dynamic interaction modeling, while allowing efficient inference with a static graph neural network. The model predicts procedural efficiency as the deviation from the expected duration and supports real-time deployment. Beyond prediction, we perform a counterfactual analysis to identify minimal changes in communication structure and interpretable behavioral variables associated with improved predicted outcomes. Experiments on recorded surgical procedures show that structured modeling of team interactions improves early identification of prolonged interventions and provides coherent, actionable explanations. This work advances surgical AI toward real-time, team-aware, and actionable decision support in the operating room.
Chinese Translation
手术团队的表现源于技术执行与非技术技能(包括沟通和协调动态)之间复杂的互动。然而,当前的手术人工智能系统主要建模视觉工作流程信号,缺乏对手术过程中团队互动的结构化表示。我们提出了一种实时可行动的手术团队动态建模方法,采用时扩展互动图,在这种模型中,团队成员被视为时间索引节点,而沟通交流定义了有向边。这种时空扩展使得动态互动建模成为可能,并能够通过静态图神经网络实现高效推断。该模型通过与预期时间的偏差来预测程序效率,并支持实时部署。除了预测外,我们还进行了一项反事实分析,以识别与改善预测结果相关的沟通结构的最小变化和可解释的行为变量。对记录的手术过程进行的实验表明,结构化的团队互动建模改善了对延长干预的早期识别,并提供了连贯的、可行动的解释。这项工作推动了手术人工智能向实时、团队感知和可行动的决策支持系统发展。
cs.AI / 4 / 2605.04193
ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor
ANDRE:一种基于注意力的神经符号可微规则提取器
Abstract
Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.
Chinese Translation
归纳逻辑编程(Inductive Logic Programming, ILP)旨在从数据中学习可解释的一阶规则,但现有的符号以及神经符号方法在应对噪声和概率场景时表现不佳。经典ILP依赖于离散组合规则搜索,在不确定性下非常脆弱,而可微ILP方法通常依赖于预定义的规则模板或不准确的模糊算子,而这些算子在处理概率谓词值时容易遭遇梯度消失或逻辑结构近似不良的问题。本文提出了一种基于注意力的神经符号可微规则提取器(Attention-based Neuro-symbolic Differentiable Rule Extractor,ANDRE),这是一个新颖的ILP框架,通过优化一个带有基于注意力的逻辑算子的连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、以注意力驱动的合取和析取算子替换了规则模板和逻辑算子,这些算子近似逻辑的最小-最大语义,从而实现对概率数据的准确、稳定和可解释的推理。通过在每个规则内柔性选择、否定或排除谓词,ANDRE支持灵活的规则归纳,同时保持符号结构。在经典ILP基准、大规模知识库以及具有概率谓词和噪声监督的合成数据集上进行的广泛实验表明,ANDRE在不确定性下可靠地恢复正确的符号规则,并在预测性能方面达到了竞争力或优越的水平,尤其在中等标签噪声下,ANDRE表现出较强的鲁棒性,显著优于现有的可微ILP方法,在规则提取质量和稳定性方面都表现突出。
cs.AI / 5 / 2605.04227
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro$^2$Assist:基于多模态自我中心感知的连续步态意识主动辅助,用于长时间程序任务
Abstract
Procedural tasks with multiple ordered steps are ubiquitous in daily life. Recent advances in multimodal large language models (MLLMs) have enabled personal assistants that support daily activities. However, existing systems primarily provide reactive guidance triggered by user queries, or limited proactive assistance for isolated short-term events rather than long-horizon procedural tasks. In this work, we introduce Pro$^2$Assist, a step-aware proactive assistant that continuously tracks fine-grained task progress and reasons over the user's evolving state to provide timely assistance throughout tasks. Pro$^2$Assist leverages multimodal data from augmented reality (AR) glasses to achieve motion-based perception. It then extracts step-oriented procedural context from multi-scale temporal dynamics and task-specific expert knowledge. Based on both sensory input and procedural context, Pro$^2$Assist performs continuous reasoning to infer user needs and display timely assistance on AR glasses. We evaluate Pro$^2$Assist using a dataset curated from public sources and a real-world dataset collected on our testbed with AR glasses. Extensive evaluations show that Pro$^2$Assist outperforms the best-performing baselines by over 21% in procedural action understanding accuracy, and it achieves up to 2.29x the proactive timing accuracy of baselines. A user study with 20 participants further shows that 90% find Pro$^2$Assist useful, indicating its effectiveness for real-world procedural assistance.
Chinese Translation
程序任务在日常生活中无处不在,其具有多个有序步骤。近期多模态大型语言模型(MLLMs)的进展,使得个人助手能够支持日常活动。然而,现有系统主要提供响应用户查询的反应性指导,或对孤立的短期事件提供有限的主动辅助,而非针对长时间的程序任务。在本研究中,我们引入Pro$^2$Assist,这是一种步态意识的主动助手,可以持续跟踪任务的细粒度进展,并依据用户不断变化的状态进行推理,以在整个任务过程中提供及时的帮助。Pro$^2$Assist利用来自增强现实(AR)眼镜的多模态数据,实现基于运动的感知。接着,它从多尺度时间动态和特定任务的专家知识中提取以步骤为导向的程序上下文。基于感知输入和程序上下文,Pro$^2$Assist进行持续推理,以推断用户需求,并在AR眼镜上显示及时帮助。我们使用从公共来源整理的数据集和在我们的测试平台上收集的真实世界数据集对Pro$^2$Assist进行评估。广泛的评估结果表明,Pro$^2$Assist在程序动作理解准确性上超过最佳基线超过21%,并且在主动时间准确性上达到了基线的2.29倍。对20名参与者进行的用户研究进一步表明,有90%的人认为Pro$^2$Assist非常有用,表明其在现实程序辅助中的有效性。
cs.AI / 6 / 2605.04243
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
时间推理不是瓶颈:一种针对神经符号问答的概率不一致框架
Abstract
Despite significant advances, large language models (LLMs) continue to exhibit brittle performance on complex temporal reasoning tasks. This failure mode is widely attributed to inherent deficits in autoregressive logical deduction. In this paper, we challenge this prevailing narrative, demonstrating that temporal reasoning is not the fundamental bottleneck; rather, the locus of failure lies in unstructured text-to-event representation. We introduce a novel neuro-symbolic question-answering framework governed by a Probabilistic Inconsistency Signal (PIS) that explicitly isolates perceptual errors from reasoning failures. By lifting unstructured text into explicit event graphs and interval constraints, our architecture strictly decouples semantic extraction from a symbolic reasoning engine. To robustly detect structural breaks, the PIS elegantly unifies symbolic credal intervals with epistemic neural uncertainty extracted via Evidential Deep Learning on LLM hidden states. Empirical evaluations reveal a striking paradigm shift: when provided with correct structural representations, our system's explicit proof traces achieve perfect 1.0 accuracy (4000/4000) and strictly zero false positives/negatives on temporal arithmetic benchmarks. On broader, noise-injected QA settings, the framework maintains a competitive 75.1\% accuracy while enabling deterministic, step-level failure localization. Ultimately, by isolating the representation bottleneck from the reasoning substrate, this work reframes temporal QA from an algorithmic reasoning challenge to a structural alignment problem, charting a verifiable path forward for reliable neuro-symbolic AI.
Chinese Translation
尽管取得了显著进展,大型语言模型(LLMs)在复杂的时间推理任务上仍表现出脆弱的性能。这种失效模式被广泛归因于自回归逻辑推理的固有缺陷。本文挑战了这一普遍观念,表明时间推理并不是根本瓶颈;相反,失败的焦点在于非结构化的文本到事件表示。我们引入了一种新颖的神经符号问答框架,由概率不一致信号(Probabilistic Inconsistency Signal, PIS)主导,明确将感知错误与推理失败隔离开来。通过将非结构化文本提升到明确的事件图和区间约束,我们的架构严格将语义提取与符号推理引擎解耦。为了稳健地检测结构性断裂,PIS优雅地将符号信念区间与通过证据深度学习(Evidential Deep Learning)从LLM隐藏状态提取的认知神经不确定性结合在一起。实证评估揭示了显著的范式转变:在提供正确的结构表示时,我们系统的明确证明痕迹在时间算术基准测试中实现了完美的1.0准确率(4000/4000)且绝对零假阳性/假阴性。在更广泛的噪声注入问答设置中,该框架维持了75.1\%的竞争性准确率,同时实现了确定性的逐步故障定位。最终,通过将表示瓶颈与推理基础分离,该工作将时间问答重新框架化为一个结构对齐问题,而非算法推理挑战,为可靠的神经符号人工智能指明了可验证的前进路径。
cs.AI / 7 / 2605.04263
Parallel Prefix Verification for Speculative Generation
并行前缀验证用于推测生成
Abstract
We introduce PARSE (PArallel pRefix Speculative Engine), a speculative generation framework that accelerates large language model (LLM) inference by parallelizing prefix verification on a semantic level. Existing speculative decoding methods are fundamentally limited by token-level equivalence: the target model must verify each token, leading to short acceptance lengths and modest speedups. Moving to semantic or segment-level verification can substantially increase acceptance granularity, but prior approaches rely on sequential verification, introducing significant overhead and limiting practical gains. PARSE introduces parallel prefix verification, enabling semantic-level verification without sequential checks. Given a full draft from a draft model, the target model evaluates correctness across multiple prefixes in a single forward pass using a custom attention mask, directly identifying the maximal valid prefix. This eliminates sequential segment verification, and makes verification compute-efficient. PARSE is orthogonal to token-level speculative decoding and can be composed with it for additional gains. Across models and benchmarks, PARSE delivers $1.25\times$ to $4.3\times$ throughput gain over the target model, and $1.6\times$ to $4.5\times$ when composed with EAGLE-3, all with negligible accuracy degradation. This demonstrates parallel prefix verification as an effective, general approach to accelerating LLM inference.
Chinese Translation
我们引入了PARSE(并行前缀推测引擎),这是一种通过在语义层面上并行化前缀验证来加速大型语言模型(LLM)推理的推测生成框架。现有的推测解码方法在从根本上受到基于标记的等价性限制:目标模型必须对每个标记进行验证,导致接受长度短且速度提升有限。转向语义或段级验证可以显著增加接受粒度,但之前的方法依赖于顺序验证,造成显著的开销并限制实际收益。PARSE引入了并行前缀验证,能够在无需顺序检查的情况下实现语义级别的验证。在从草拟模型获取完整草稿后,目标模型使用自定义注意力掩码在一次前向传播中对多个前缀进行正确性评估,直接识别出最大有效前缀。这消除了顺序段验证,使验证计算变得高效。PARSE与基于标记的推测解码是正交的,可以结合使用以获得额外的收益。在多个模型和基准测试中,PARSE提供了目标模型 $1.25 imes$ 到 $4.3 imes$ 的吞吐量提升,与EAGLE-3结合时达到 $1.6 imes$ 到 $4.5 imes$ 的增益,且几乎没有准确度的下降。这表明并行前缀验证是一种有效的通用方法,可以加速LLM推理。
cs.AI / 8 / 2605.04312
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games
代理岛:来自多智能体游戏的抗饱和与抗污染基准
Abstract
Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.
Chinese Translation
静态能力基准受到饱和和污染的影响,使得跟踪能力进展变得困难。我们介绍了代理岛,一个多人模拟环境,其中语言模型代理在协作、冲突和说服的智能体游戏中进行竞争。该环境提供了一个动态基准,旨在减轻饱和和污染的问题;新模型始终可以在这个赢家通吃的游戏中超过当前的领先玩家,代理与其他自适应代理竞争,而不是面对固定的任务集。我们使用贝叶斯 Plackett-Luce 模型对玩家进行排名,从而量化玩家技能的不确定性。在包含 49 个独特模型的 999 场游戏中,openai/gpt-5.5 以 5.64 的后验平均技能超越同行,而排名第二的模型 openai/gpt-5.2 则为 3.10,排名第三的模型 openai/gpt-5.3-codex 为 2.86。我们将游戏日志发布为数据集以便于对模型行为的分析。举例来说,我们调查了最终投票中的同一提供者偏好,发现模型支持同一提供者决赛选手的可能性比支持来自其他提供者的决赛选手高出 8.3 个百分点。这种偏好在不同提供者之间并不均匀:在单独估计的提供者中,该效应在 OpenAI 模型中最强,而在 Anthropic 模型中最弱。
cs.AI / 9 / 2605.04330
The Scaling Properties of Implicit Deductive Reasoning in Transformers
变换器中隐式演绎推理的尺度特性
Abstract
We investigate the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating provability from spurious features and enforcing algorithmic alignment, we find that in sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.
Chinese Translation
我们深入研究了在深度受限变换器中,关于霍恩子句的隐式演绎推理的尺度特性。通过系统性地将可证明性与虚假特征解耦,并加强算法一致性,我们发现,在具有双向前缀掩码的足够深的模型中,隐式推理在各种图形拓扑和问题宽度下接近显式链推(Chain of Thought,CoT)的性能,尽管对于深度外推,CoT 仍然是必要的。
cs.AI / 10 / 2605.04361
When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration
当背景信息适得其反时:知识转移在多智能体设计探索中的交叉效应
Abstract
The prevailing assumption in agent orchestration is that more context is better. We test this on multi-agent software design across 10 tasks, 7 context-injection conditions, and over 2,700 runs, and find a crossover effect: the same artifact type improves design exploration on some tasks (up to 20$\times$ tradeoff coverage) and actively degrades it on others (up to 46% reduction). On several tasks, an irrelevant document performs as well as or better than every relevant artifact. The direction is predicted by a single measurable variable--baseline exploration without context--with Pearson $r = -0.82$ ($p < 0.001$). Probing the mechanism by manipulating convergence pressure through prompt design reveals two distinct regimes: convergence driven by training data priors (natural) responds to artifact disruption, while convergence driven by explicit instructions (induced) does not. The implication is that context injection should be conditional, not universal: one no-context trial is a cheap diagnostic that predicts whether knowledge artifacts will help or hurt a given task.
Chinese Translation
在智能体协作中,普遍的假设是提供更多背景信息会更好。我们在多智能体软件设计中对10个任务、7种背景注入条件和超过2700次实验进行了测试,发现存在交叉效应:相同类型的工件在某些任务上能提高设计探索(覆盖率提升可达20倍),而在其他任务上则会明显降低其效果(减少达46%)。在多个任务中,一份无关的文档表现得与每个相关工件一样好,甚至更佳。其方向可由一个可测量的变量预测——无背景的基线探索——相关性为Pearson $r = -0.82$ ($p < 0.001$)。通过通过提示设计操控收敛压力以探究机制,发现了两种不同的机制:由训练数据先验驱动的收敛(自然)对工件干扰有所反应,而由明确指令驱动的收敛(诱导)则没有。此结果暗示背景注入应为有条件的,而非普遍适用的:一次无背景的试验是一个廉价的诊断工具,可以预测知识工件是否会对特定任务有利或有害。
cs.AI / 11 / 2605.04454
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
仅从模型层面的评估无法推断与部署相关的对齐
Abstract
Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.
Chinese Translation
机器学习中的对齐评估在很大程度上已变成对模型的评估。影响力基准在固定输入下评分模型输出,例如真实性、指令遵循或成对偏好,这些分数常常用于支持关于部署对齐的主张。本文主张,仅从模型层面的评估无法推断与部署相关的对齐。对齐主张应当与证据收集的层级相关联:模型层级、响应层级、交互层级或部署层级。两项研究支持这一观点。首先,对十一项对齐基准的结构性审计,扩展至十六项基准语料库,基于一个八维的评分标准进行双重编码,Cohen's kappa = 0.87,发现任何审查过的基准中均缺乏用户面对面的验证支持,而过程可操纵性几乎不存在。识别出的少数交互基准,包括 tau-bench、CURATe、Rifts 和 Common Ground,覆盖范围仍然支离破碎,基准构建而非数据源决定了所测量的内容。其次,利用180个转录文本对三个前沿模型和四种支架进行盲法跨模型压力测试,发现相同的验证支架使一种模型的验证支持达到顶峰,而另一种模型则保持不变。这表明支架的有效性依赖于模型,而审计所识别的差距不能仅在模型层面上弥补。我们提出一个系统层级的评估议程:对齐档案而非单一分数、可比较的交互评估的固定支架协议,以及明确评估证据与部署主张之间推理距离的报告模板。
cs.AI / 12 / 2605.04488
How Does Thinking Mode Change LLM Moral Judgments? A Controlled Instant-vs-Thinking Comparison Across Five Frontier Models
思维模式如何改变大语言模型的道德判断?五个前沿模型的控制即时与思考比较
Abstract
We evaluate whether enabling provider-exposed reasoning mode changes moral judgments within the same model checkpoint. Across 100 moral-judgment scenarios and five frontier reasoning-trained LLMs (Claude Sonnet 4.6, GPT 5.5, Gemini 3 Flash, DeepSeek V3.1, and Qwen3.5 397B), aggregate binary-verdict agreement remains high and statistically indistinguishable between instant and thinking modes (Krippendorff's alpha = 0.78 vs. 0.79). However, disagreement is concentrated in 21 model-disputed scenarios, where instant-mode agreement is near chance (alpha = 0.08). On these scenarios, reasoning directionally narrows cross-model disagreement, increasing mean pairwise agreement from 5.4 to 6.7 out of 10. Reasoning also reduces demographic-judgment inconsistency in three of five models and does not increase it for any model. Across all five model families, reasoning changes self-labeled ethical frameworks more often than binary verdicts.
Chinese Translation
我们评估了在相同模型检查点中,启用提供者曝光的推理模式是否会改变道德判断。在100个道德判断场景和五个前沿推理训练的大语言模型(Claude Sonnet 4.6、GPT 5.5、Gemini 3 Flash、DeepSeek V3.1和Qwen3.5 397B)中,整体的二元裁决一致性在即时模式和思考模式之间保持高水平且在统计上不可区分(Krippendorff's alpha = 0.78 对比 0.79)。然而,分歧主要集中在21个模型争议场景中,其中即时模式的一致性接近随机(alpha = 0.08)。在这些场景中,推理方向性缩小了跨模型的分歧,使得平均配对一致性从5.4提高到6.7(满分10分)。推理还在五个模型中的三个减少了人口统计判断的不一致性,并未对任何模型增加不一致性。在所有五个模型家族中,推理比二元裁决更常改变自标记的伦理框架。
cs.AI / 13 / 2605.04572
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
从参数动态到风险评分:量化大语言模型微调中的样本级安全退化
Abstract
Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this paper, we uncover a critical mechanism underlying safety degradation by analyzing parameter dynamics, where benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. Based on this insight, we propose a method of Sample-Level Quantification of Safety Degradation (SQSD), which quantifies the influence of each training sample on safety degradation. Specifically, SQSD computes continuous risk scores to samples by measuring their induced parameter updates' projection difference between danger and safety directions. Extensive experiments across multiple models and datasets demonstrate that SQSD effectively quantifies sample-level fine-tuning risks and exhibits strong transferability across model architectures, parameter scales, and parameter-efficient methods.
Chinese Translation
大语言模型(LLMs)的安全对齐极其脆弱,因为在少量良性样本上的微调可能会抹去从数百万个偏好示例中学习到的安全行为。现有研究通过比较微调前后的参数和隐藏状态来解释这一现象,但忽略了它们在微调过程中的动态演变。本文通过分析参数动态揭示了安全退化的一个关键机制,其中良性微调导致参数逐渐漂移到危险对齐的方向,逐步削弱模型的安全性。这一发现表明,贡献于这种漂移的样本具有更大的微调风险。基于这一认识,我们提出了一种样本级安全退化量化方法(Sample-Level Quantification of Safety Degradation,SQSD),该方法量化每个训练样本对安全退化的影响。具体而言,SQSD通过测量样本引发的参数更新在危险方向和安全方向之间的投影差异,计算样本的连续风险评分。我们在多个模型和数据集上进行的大量实验证明,SQSD有效地量化了样本级微调风险,并在模型架构、参数规模和参数高效方法上表现出强大的迁移能力。
cs.AI / 14 / 2605.04608
SensingAgents: A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition
SensingAgents:一种用于鲁棒IMU活动识别的多智能体协作框架
Abstract
Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is a cornerstone of mobile health, smart environments, and human-computer interaction. However, current deep learning-based HAR models often struggle with heavy reliance on labeled data, position-specific ambiguity, and a lack of transparent reasoning. Inspired by the advanced agents framework, which emulates a collaborative agent using Large Language Models (LLMs), we propose SensingAgents, a novel multi-agent system for robust IMU activity recognition. SensingAgents organizes LLM-powered agents into specialized roles: a group of Analyst Agents for position-specific sensor analysis (arm, wrist, belt, pocket), a pair of Advocate Agents that resolves sensor conflicts through dynamic and static dialectical debates, and a Decision Agent that ensures reliability under sensor drift or failure. Evaluation on the Shoaib dataset demonstrates that SensingAgents significantly outperforms state-of-the-art single-agent and multi-agent LLM models, achieving an accuracy of 79.5% in a zero setting--29% higher than existing agent models and 9.4% higher than deep learning baselines--particularly in complex scenarios where multi-sensor data is conflicting or noisy. Our work highlights the potential of multi-agent collaborative reasoning for advancing the robustness and interpretability of ubiquitous sensing systems.
Chinese Translation
利用惯性测量单元(IMU)传感器进行人类活动识别(HAR)是移动健康、智能环境和人机交互的重要基础。然而,目前基于深度学习的人类活动识别模型往往严重依赖标注数据,存在位置特定的模糊性以及缺乏透明的推理能力。受到先进的智能体框架的启发,该框架通过大型语言模型(LLMs)模拟协作智能体,我们提出了SensingAgents,一个新颖的多智能体系统,用于鲁棒的IMU活动识别。SensingAgents将基于LLM的智能体组织成专门角色:一组分析智能体用于位置特定的传感器分析(手臂、手腕、腰带、口袋),一对倡导智能体通过动态和静态的辩证讨论来解决传感器冲突,以及一个决策智能体在传感器漂移或故障下确保可靠性。针对Shoaib数据集的评估表明,SensingAgents在零设置下的准确率达到79.5%,显著优于最先进的单智能体和多智能体LLM模型,比现有智能体模型高出29%,比深度学习基线高出9.4%,尤其是在多传感器数据冲突或噪声较大的复杂场景中。我们的工作突显了多智能体协作推理在提升无处不在的传感系统鲁棒性和可解释性方面的潜力。
cs.AI / 15 / 2605.04624
AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
AuditRepairBench:一个用于评估者-通道排名不稳定性的配对执行轨迹语料库,应用于智能体修复
Abstract
Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued leaderboard. The resource is supported by mechanism-anchored validation on an 80-case source-level channel-surgery subset, an independent-discovery protocol under which two annotator groups separated from the pipeline developers discover coupling patterns blinded to the screening design and the frozen ensemble attains pooled AUROC 0.83 on their 79 cases, implementation robustness, uncertainty propagation that raises 95% coverage from 0.81 to 0.95, and forward transfer with pooled community-evaluator Spearman \r{ho} = 0.65. Screening-guided blinding patches reduce rank displacement by 55--74% (mean 62%) at fewer than 50 lines of code, whereas random channel blinding produces at most 7% reduction and generic retraining at most 13%. AuditRepairBench-Lite, a rule-only configuration on a 12,000-cell subset, preserves the leaderboard at Kendall {\tau} = 0.88 under twenty-four GPU-hours and is the primary release artifact at 42 GB.
Chinese Translation
智能体修复排行榜在评估者重新配置时会重新排序,而这种重新排序的一部分是由在内部选择候选修复时咨询评估者派生信号的方法引起的。我们在一个公共排行榜上记录了这种失败模式,并发布了AuditRepairBench,这是一个包含576,000个注册单元(其中96,000个已执行)的配对执行轨迹语料库,旨在操作化在声明的可观测边界内的评估者-通道阻塞排名不稳定性。一个模块化的筛选架构通过四种可互换的实现来决定路径阻塞,包括学习的影响代理、一个不使用已训练模型的基于规则的通道曝光比例、一个反事实敏感度代理和一个稀疏的人类审计代理,组合成一个筛选后验,输入到单元级翻转功能、集值标签、分层系统分数和集值排行榜。该资源通过在80个案例的源级通道手术子集上进行机制锚定的验证而得到支持,在独立发现的协议下,两组注释者与管道开发者隔离,发现与筛选设计无关的耦合模式,冻结的集成在他们的79个案例上获得了聚合AUROC 0.83,实施稳健性,带来从0.81到0.95的95%覆盖率的不确定性传播,以及与社区评估者的前向转移,Spearman
{ho} = 0.65。指导性盲法的筛选补丁将排名偏移减少了55-74%(平均62%),且代码行数少于50行,而随机通道盲法最多仅能减少7%,通用再训练最多减少13%。AuditRepairBench-Lite,一个仅规则配置的12,000单元子集,在二十四个GPU小时内保持了Kendall { au} = 0.88,并作为42 GB的主要发布成果。
cs.AI / 16 / 2605.04711
Budget-aware Auto Optimizer Configurator
预算感知自动优化器配置器
Abstract
Optimizer states occupy massive GPU memory in large-scale model training. However, gradients in different network blocks exhibit distinct behaviors, such as varying directional stability and scale anisotropy, implying that expensive optimizer states are not universally necessary and using a global optimizer is often memory-inefficient. We propose the Budget-Aware Optimizer Configurator (BAOC) to reduce memory cost by assigning suitable optimizer configurations to individual blocks under given budgets. Specifically, BAOC samples gradient streams to derive statistical metrics that quantify the potential performance risk of applying cheaper configurations (e.g., low precision or removing momentum). It then solves a constrained allocation problem to minimize total risk under memory and time budgets, selecting a budget-feasible configuration for each block. Experiments across vision, language, and diffusion workloads demonstrate that BAOC maintains training quality while significantly reducing the memory usage of optimizer states. The code is available at https://anonymous.4open.science/r/BAOC-45C6.
Chinese Translation
在大规模模型训练中,优化器状态占据了大量的GPU内存。然而,不同网络块中的梯度表现出不同的特征,如方向稳定性和尺度各异,这表明昂贵的优化器状态并非普遍必要,使用全局优化器通常也会导致内存效率低下。我们提出了预算感知优化器配置器(Budget-Aware Optimizer Configurator, BAOC),通过在给定预算下为各个块分配合适的优化器配置来减少内存成本。具体而言,BAOC通过采样梯度流来推导统计指标,从而量化使用较低配置(如低精度或去除动量)所带来的潜在性能风险。接着,它解决一个受限分配问题,以在内存和时间预算下最小化总风险,为每个块选择一个在预算范围内的配置。针对视觉、语言和扩散工作负载的实验表明,BAOC在显著降低优化器状态内存使用的同时,保持了训练质量。代码可以在 https://anonymous.4open.science/r/BAOC-45C6 获取。
cs.AI / 17 / 2605.04733
Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing
用于沉浸式视频角色扮演的奖励分解强化学习
Abstract
Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the likelihood of the reference response; (iii) answer accuracy to ensure faithfulness; and (iv) a dense format reward to enforce the desired structured output. Extensive experiments demonstrate that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, delivering simultaneous gains in visual-atmosphere consistency and character authenticity. Beyond the role-playing domain, EBM-RL also exhibits strong zero-shot generalization: without any additional fine-tuning, it consistently improves performance on out-of-domain VideoQA benchmarks. We additionally release an open-source dataset for video-grounded role-playing dialogue.
Chinese Translation
基于文本的角色扮演模型能够模仿角色风格,但常常无法反映场景的氛围和不断变化的紧张感,而这对于虚拟现实(VR)游戏和互动叙事等沉浸式应用至关重要。我们研究了视频基础的角色扮演对话,并引入了 EBM-RL(眼脑口强化学习),一个基于去耦合的 GRPO(Generalized Reinforcement Policy Optimization)框架,明确区分观察(感知)、推理(思考)和发声(回答)。该结构通过迫使模型首先关注视觉线索,然后形成内部解释,最后生成上下文适宜的对话,促进了类人感官基础的建立。EBM-RL 整合了四种互补的奖励:(i)基于 CLIP 的场景文本对齐,以改善氛围和情感;(ii)感知-认知奖励,鼓励感知和思考过程,从而提高参考回答的可能性;(iii)答案精确性,确保忠实性;(iv)密集格式奖励,以强化所需的结构化输出。大量实验证明,EBM-RL 在我们的沉浸式角色扮演基准上显著超越了仅基于文本的角色扮演基线和更大规模的视觉-语言模型,同时在视觉氛围一致性和角色真实性上实现了同步提升。超越角色扮演领域,EBM-RL 还展示了强大的零次学习泛化能力:在没有任何额外微调的情况下,它在跨领域视频问答(VideoQA)基准上持续改善表现。此外,我们还发布了一个用于视频基础角色扮演对话的开源数据集。
cs.AI / 18 / 2605.04785
AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
AgentTrust:人工智能代理工具使用的运行时安全评估与拦截
Abstract
Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.
Chinese Translation
现代人工智能代理通过工具调用(如文件操作、shell命令、HTTP请求和数据库查询)执行现实世界的副作用。单一不安全操作,例如意外删除、凭证暴露或数据外泄,均可能造成不可逆的伤害。现有防御措施不够完善:事后基准测试在执行后衡量行为,静态保护措施无法应对混淆和多步上下文,而基础设施沙箱限制代码运行的位置,却不理解某个操作的含义。我们提出AgentTrust,这是一种运行时安全层,能够在执行前拦截代理工具调用,并返回结构化判定:允许、警告、阻止或审查。AgentTrust结合了shell去混淆归一化处理、安全替代方案的SafeFix建议、多步攻击链的RiskChain检测,以及针对模糊输入的缓存感知LLM-as-Judge。我们发布了涵盖六个风险类别的300个场景基准测试,以及额外630个独立构建的现实对抗场景。在内部基准测试中,仅生产规则集达到了95.0%的判定准确率和73.7%的风险水平准确率,延迟在低毫秒级。在630场景基准测试中,在经过修补的规则集下评估并非声称为零样本的情况下,AgentTrust达到了96.7%的判定准确率,其中大约93%适用于shell混淆的有效载荷。AgentTrust在AGPL-3.0许可证下发布,并为兼容MCP代理提供了模型上下文协议服务器。
cs.AI / 19 / 2605.04808
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
解码信任代理平台 (DTap):一个可控且互动的 AI 代理红队平台
Abstract
AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.
Chinese Translation
AI 代理在各个领域的部署逐渐增加,以通过长时间跨度和高风险的动作执行自动化复杂工作流程。由于其高能力和灵活性,这些代理引发了显著的安全和安全性担忧。越来越多的现实案例表明,敌对者可以轻易操控代理执行有害行为,如泄露 API 密钥、删除用户数据或发起未经授权的交易。评估代理的安全性本质上是具有挑战性的,因为代理在动态、不受信任的环境中操作,这些环境涉及外部工具、异构数据源和频繁的用户交互。然而,用于大规模风险评估的现实可控和可重复环境仍然未得到充分探索。为了解决这一空白,我们提出了解码信任代理平台 (DTap),这是首个针对 AI 代理的可控且互动的红队平台,涵盖 14 个现实领域和 50 多个模拟环境,这些环境仿真了诸如 Google Workspace、Paypal 和 Slack 等广泛使用的系统。为了在 DTap 中扩展代理的风险评估,我们进一步提出了 DTap-Red,这是首个自主红队代理,系统地探索多种注入向量(例如,提示、工具、技能、环境、组合),并自主发现针对不同恶意目标的有效攻击策略。利用 DTap-Red,我们策划了 DTap-Bench,这是一个大规模红队数据集,包含跨领域的高质量实例,每个实例均配备可验证的评审,自动验证攻击结果。通过 DTap,我们对建立在各种基础模型上的流行 AI 代理进行大规模评估,涵盖安全政策、风险类别和攻击策略,揭示系统性脆弱性模式,并为开发安全的下一代代理提供了宝贵的见解。
cs.AI / 20 / 2605.04906
Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
Strat-Reasoner:增强大语言模型在多智能体游戏中的战略推理
Abstract
While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games.
Chinese Translation
尽管大语言模型(LLMs)在某些推理任务中表现出色,但在最终结果取决于所有智能体联合策略的多智能体游戏中,它们却面临困难。在多智能体游戏中,其他智能体的非平稳性对推理过程的评估和多个推理步骤的信用分配带来了显著挑战。现有的单一智能体强化学习(RL)方法及其多智能体扩展未能解决这些挑战,因为它们在推理过程中未考虑其他智能体。在本研究中,我们提出了Strat-Reasoner,一个基于RL的novel框架,旨在提升LLMs在多智能体游戏中的战略推理能力。我们引入了一种新颖的递归推理范式,其中一个智能体的推理还整合了其他智能体的推理过程。为了为中间推理序列提供有效的奖励信号,我们采用了集中式思维链(Chain-of-Thought,CoT)比较模块来评估推理质量。最后,我们计算出准确的混合优势,并开发了一种群体相对强化学习方法来优化LLM策略。实验结果表明,Strat-Reasoner显著提升了基础LLMs的战略能力,在各种多智能体游戏中实现了22.1%的平均性能提升。
cs.AI / 21 / 2605.04908
Curated AI beats frontier LLMs at pharma asset discovery
精心策划的人工智能在制药资产发现中超越前沿大型语言模型
Abstract
General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset -- an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations -- against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.
Chinese Translation
通用大型语言模型(Large Language Models, LLMs)结合网络搜索越来越多地被用来探测制药管线的竞争格局。我们将Gosset——一个拥有聊天界面、支持针对目标、方式和适应症层面的药物资产注释的人工智能平台——与四个具有网络访问的前沿系统(Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro)在十个小众肿瘤/免疫学目标上进行了基准测试,这些目标的大部分管线位于前临床和亚洲开发资产的长尾部分。所有五个系统接收相同的自然语言查询和相同的JSON输出格式。在这10个目标上,Gosset每次查询返回的经过验证的药物数量是最佳前沿系统的3.2倍,同时在经过验证的药物的交叉系统集合中达到完美的准确率和100%的召回率。相同的精心策划索引作为Gosset MCP服务器对外提供,任何前沿模型均可以将其作为工具进行调用,这表明这些系统都可以通过用精心策划的索引替代通用网络搜索来缩小大部分召回率的差距,且在同一聊天界面下实现这一目标。
cs.AI / 22 / 2605.04916
A Foundation Model for Zero-Shot Logical Rule Induction
零样本逻辑规则归纳的基础模型
Abstract
Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.
Chinese Translation
归纳逻辑编程(Inductive Logic Programming, ILP)从数据中学习可解释的逻辑规则。现有方法属于传导性:其学习的参数绑定于特定的谓词,并且每个新任务需要重新训练。我们引入了神经规则引导器(Neural Rule Inducer, NRI),这是一个用于零样本规则归纳的预训练模型。NRI并不直接编码字面量的身份,而是通过类条件率、熵和共现等与领域无关的统计特性来表示字面量,这使得模型能够在不同的身份和计数间泛化,而无需重新训练。该模型由一个统计编码器和一个并行槽位解码器组成。并行解码保持了逻辑析取的置换不变性;而自回归解码器则会强加任意的子句顺序。乘积T-范数松弛使得规则执行可微分,从而能够仅基于预测准确性进行端到端训练。我们对NRI在规则恢复、对标签噪音和虚假相关的鲁棒性,以及对实际基准的零样本迁移进行了评估,我们相信这项工作为符号推理的基础模型开辟了可能性。代码和参考检查点可在https://github.com/phuayj/neural-rule-inducer获取。
cs.AI / 23 / 2605.04979
On-line Learning in Tree MDPs by Treating Policies as Bandit Arms
通过将策略视为赌博臂实现树状马尔可夫决策过程的在线学习
Abstract
A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.
Chinese Translation
树状马尔可夫决策问题(Tree Markov Decision Problem, T-MDP)是一种有限时域的马尔可夫决策过程(MDP),其起始状态为 $s_{1}$,其中每个状态通过恰好一条状态-动作轨迹可以从 $s_{1}$ 到达。T-MDP 自然作为完美回忆的顺序博弈中的决策抽象出现,面对的是静态对手。我们考虑在 T-MDP 中进行在线学习的问题,包括 PAC 和遗憾最小化机制。我们展示了著名的赌博算法—— extsc{Lucb} 和 extsc{Ucb}——可以通过将每个策略视为一条臂来应用于 T-MDP。这种方法的明显技术挑战在于,策略的数量是状态数量的指数级。我们的主要创新在于设计基于策略共享数据的置信界,使得赌博算法能够以多项式的内存和每步计算开销实施。我们获得了样本复杂性和遗憾的实例相关上界,这些上界是每个终端状态的“间隙项”之和,而不是每个策略的和。在实证上,我们的算法在一系列隐藏信息游戏中始终优于现有替代方案。
cs.AI / 24 / 2605.05007
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
Uno-Orchestra:通过选择性委派实现简约的代理路由
Abstract
Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.
Chinese Translation
大语言模型(LLM)多代理系统通常依赖于刚性编排,要么在每个查询中进行平坦的路由,要么进行手工设计的任务分解,因此分解深度、工人选择和推理预算无法在一个目标下进行联合优化。我们提出了Uno-Orchestra,这是一种统一的编排策略,它选择性地分解任务并将每个子任务分派给一个可接受的(模型,原始操作)对,两个决策通过基于真实工人交互的策划强化学习(RL)轨迹共同学习。与涵盖数学、代码、知识、长上下文和代理工具使用的13个基准的22个基线进行比较,Uno-Orchestra在宏观通过率@1上达到了77.0%,大约比最强工作流基线高出16%,同时每个查询的成本降低了一个数量级,推进了选择性委派的准确性与效率前沿。
cs.AI / 25 / 2605.05017
Position: Embodied AI Requires a Privacy-Utility Trade-off
立场:具身人工智能需要隐私与效用之间的权衡
Abstract
Embodied AI (EAI) systems are rapidly transitioning from simulations into real-world domestic and other sensitive environments. However, recent EAI solutions have largely demonstrated advancements within isolated stages such as instruction, perception, planning and interaction, without considering their coupled privacy implications in high-frequency deployments where privacy leakage is often irreversible. This position paper argues that optimizing these components independently creates a systemic privacy crisis when deployed in sensitive settings, thereby advancing the position that privacy in EAI is a life cycle-level architectural constraint rather than a stage-local feature. To address these challenges, we propose Secure Privacy Integration in Next-generation Embodied AI (SPINE), a unified privacy-aware framework that treats privacy as a dynamic control signal governing cross-stage coupling throughout the entire EAI life cycle. SPINE decomposes the EAI pipeline into various stages and establishes a multi-criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries. We conduct preliminary simulation and real-world case studies to conceptually validate how privacy constraints propagate downstream to reshape system behavior, illustrating the insufficiency of fragmented privacy patches and motivating future research directions into secure yet functional embodied AI systems. We detail the SPINE framework and case studies at https://github.com/rminshen03/EAI_Privacy_Position.
Chinese Translation
具身人工智能(EAI)系统正迅速从模拟环境转向现实世界的家庭及其他敏感环境。然而,近期的EAI解决方案在指令、感知、规划和互动等孤立阶段的进展较为明显,却未考虑在高频率部署中隐私泄露的耦合隐患,且此类泄露往往是不可逆的。本文立场论文主张,独立优化这些组件会在敏感设置中造成系统性的隐私危机,因此提出在EAI中将隐私视为一种生命周期层级的架构约束,而非某一阶段的局部特征。为应对此类挑战,我们提出了下一代具身人工智能中的安全隐私集成(SPINE)框架,一个统一的隐私感知框架,将隐私视为管理跨阶段耦合的动态控制信号,贯穿整个EAI生命周期。SPINE将EAI流程分解为多个阶段,并建立了多标准隐私分类矩阵,以协调阶段边界内的上下文敏感性。我们进行了初步的模拟和现实案例研究,以概念验证隐私约束如何向下传播重塑系统行为,说明了碎片化隐私补丁的不充分性,并激励未来在安全与功能兼具的具身人工智能系统方面的研究方向。我们详细介绍了SPINE框架及案例研究,网址为 https://github.com/rminshen03/EAI_Privacy_Position。
cs.AI / 26 / 2605.05138
Executable World Models for ARC-AGI-3 in the Era of Coding Agents
编码代理时代的 ARC-AGI-3 可执行世界模型
Abstract
We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. We report results on the 25 public ARC-AGI-3 games. Each recorded playthrough uses a fresh agent instance with no access to previous playthrough-specific files or conversation state. Most games have a single recorded playthrough; for a few games, we report multiple independent fresh-agent playthroughs to expose run-to-run variability. The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75%, on 6 games, and obtained a mean per-game RHAE of 32.58%. Because the system uses no game-specific code, it can serve as a game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents.
Chinese Translation
我们评估了一个针对 ARC-AGI-3 的初始编码代理系统,其中代理维护一个可执行的 Python 世界模型,验证其与先前观察的一致性,并将其重构为更简单的抽象,以作为 MDL(最小描述长度)类似简单性偏差的实用代理,在行动之前通过模型进行规划。该系统有意设计得直接:它使用了脚本控制器、预定义的世界模型接口、验证程序和计划执行器,但没有手工编码的特定游戏逻辑。我们报告了对于 25 个公共 ARC-AGI-3 游戏的结果。每次记录的游戏过程使用一个新的代理实例,并且不访问先前游戏过程特定的文件或对话状态。大多数游戏只有一次记录的游戏过程;对于少数游戏,我们报告多个独立的新代理游戏过程,以揭示运行间的变异性。代理完全解决了 7 个游戏,在 6 个游戏中获得了超过 75% 的相对人类行动效率(Relative Human Action Efficiency, RHAE),并且每个游戏的平均 RHAE 为 32.58%。由于该系统不使用特定于游戏的代码,因此可以作为 ARC-AGI-3 的游戏通用基线。私有验证集上的性能仍需测试。总体而言,结果提供了初步证据,表明基于验证者驱动的可执行世界模型是一种针对 ARC-AGI-3 代理的有前景的方法。
cs.AI / 27 / 2605.05191
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
LongSeeker:面向长时间搜索代理的弹性上下文调度
Abstract
Long-horizon search agents must manage a rapidly growing working context as they reason, call tools, and observe information. Naively accumulating all intermediate content can overwhelm the agent, increasing costs and the risk of errors. We propose that effective context management should be adaptive: parts of the agent's trajectory are maintained at different levels of detail depending on their current relevance to the task. To operationalize this principle, we introduce Context-ReAct, a general agentic paradigm for elastic context orchestration that integrates reasoning, context management, and tool use in a unified loop. Context-ReAct provides five atomic operations: Skip, Compress, Rollback, Snippet and Delete, which allow the agent to dynamically reshape its working context, preserving important evidence, summarizing resolved information, discarding unhelpful branches, and controlling context size. We prove that the Compress operator is expressively complete, while the other specialized operators provide efficiency and fidelity guarantees that reduce generation cost and hallucination risk. Building on this paradigm, we develop LongSeeker, a long-horizon search agent fine-tuned from Qwen3-30B-A3B on 10k synthesized trajectories. Across four representative search benchmarks, LongSeeker achieves 61.5% on BrowseComp and 62.5% on BrowseComp-ZH, substantially outperforming Tongyi DeepResearch (43.2% and 46.7%) and AgentFold (36.2% and 47.3%). These results highlight the potential of adaptive context management, showing that agents can achieve more reliable and efficient long-horizon reasoning by actively shaping their working memory.
Chinese Translation
长时间搜索代理在推理、调用工具和观察信息时,必须管理快速增长的工作上下文。简单地累积所有中间内容可能会给代理带来负担,增加成本和错误风险。我们提出有效的上下文管理应具有适应性:根据与任务当前相关性不同,代理轨迹的不同部分应以不同的详细程度进行维护。为了实现这一原则,我们引入了Context-ReAct,一个整合推理、上下文管理和工具使用的弹性上下文调度通用代理范式。Context-ReAct提供了五个原子操作:跳过(Skip)、压缩(Compress)、回滚(Rollback)、片段(Snippet)和删除(Delete),允许代理动态重塑其工作上下文,保留重要证据、总结已解决的信息、丢弃无用分支,并控制上下文大小。我们证明了压缩操作符的表达完备性,而其他专业操作符则提供效率和保真度保证,降低生成成本和幻觉风险。在此范式基础上,我们开发了LongSeeker,一个在10k合成轨迹上从Qwen3-30B-A3B微调的长时间搜索代理。在四个具有代表性的搜索基准测试中,LongSeeker在BrowseComp上取得了61.5%的成绩,在BrowseComp-ZH上达到了62.5%的成绩,明显优于Tongyi DeepResearch(43.2%和46.7%)和AgentFold(36.2%和47.3%)。这些结果突显了适应性上下文管理的潜力,表明代理通过积极塑造其工作记忆,可以实现更可靠和高效的长时间推理。
cs.CL / 1 / 2605.04065
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
基于自由能驱动的强化学习与自适应优势塑形在大型语言模型中的无监督推理
Abstract
Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.
Chinese Translation
无监督强化学习(RL)已成为在大型语言模型(LLMs)中启用自我改进的有前景的范式。然而,现有的基于无监督RL的方法往往缺乏在训练过程中适应模型不断发展的推理能力的能力。因此,在缺乏真实监督的情况下,这些方法可能会误导策略优化。为了解决这一问题,我们提出了FREIA,这是一种基于RL的新算法,基于两个关键创新构建:(1)自由能驱动奖励(Free Energy-Driven Reward, FER)根据自由能原理调节奖励,以平衡共识与探索;(2)自适应优势塑形(Adaptive Advantage Shaping, AAS)根据抽样奖励的统计特性自适应调整学习信号。在三个推理任务的九个数据集上的实证评估表明,FREIA在性能上超越了其他基于无监督RL的基线。值得注意的是,在数学推理任务中,FREIA在使用DeepSeek-R1-Distill-Qwen-1.5B模型时,Pass@1平均超过其他方法0.5到3.5个点。
cs.CL / 2 / 2605.04066
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
适应以蓬勃发展!自适应幂均值策略优化提升大型语言模型推理能力
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.
Chinese Translation
可验证奖励的强化学习(RLVR)是一种重要的范式,它增强了大型语言模型(LLMs)的推理能力。然而,现有的方法通常依赖静态策略优化方案,这与模型不断发展的推理能力不一致。为了解决这一问题,我们提出了自适应幂均值策略优化(APMPO),其主要包含两个创新:幂均值策略优化(PMPO)和反馈自适应裁剪(FAC)。具体而言,PMPO引入了一种广义的幂均值目标,使得模型能够自适应地从算术均值的信号放大行为过渡到几何均值的强化一致性行为。FAC依据实时奖励统计动态调整裁剪范围,以克服静态机制的局限性。凭借这些创新,APMPO改善了学习动态和推理性能。在三个推理任务的九个数据集上的大量实验展示了APMPO相较于最先进的基于RLVR的基线的优越性。例如,使用Qwen2.5-3B-Instruct时,APMPO在数学推理基准上的平均Pass@1分数比GRPO提升了3.0分。
cs.CL / 3 / 2605.04080
Connecting online criminal behavior with machine learning: Using authorship attribution to analyze and link potential online traffickers
将在线犯罪行为与机器学习联系起来:使用作者归属分析潜在在线贩运者
Abstract
This research investigated how online criminal activities can be better understood and connected using data-driven machine learning methods. Many illegal activities, such as human trafficking and illicit trade, have moved to online platforms where offenders hide behind anonymous accounts and frequently change identities. This makes it difficult for authorities to understand how large these networks are and how different online profiles may be linked. The research shows that people tend to maintain consistent patterns in how they write advertisements and present images online, even when they try to stay anonymous. By analysing these patterns across large collections of online advertisements, the research demonstrates how to link related accounts and identify repeated behaviour across illegal online markets. In addition, the research also addresses how such methods should be used responsibly. It proposes clear guidelines to ensure that privacy, fairness, and transparency are respected when these tools are applied. Overall, the research provides practical ways to support law enforcement investigations while emphasising careful and ethical use.
Chinese Translation
本研究探讨了如何利用数据驱动的机器学习方法更好地理解和连接在线犯罪活动。许多非法活动,例如人类贩运和非法贸易,已经转移到在线平台,犯罪者躲藏在匿名账户后,并频繁更换身份。这使得当局难以了解这些网络的规模以及不同在线个人资料之间可能的关联。研究表明,即使在努力保持匿名的情况下,人们在撰写广告和展示图像时仍倾向于保持一致的模式。通过分析大量在线广告中的这些模式,研究展示了如何链接相关账户并识别非法在线市场中的重复行为。此外,研究还讨论了应如何负责任地使用这些方法。它提出了明确的指导方针,以确保在应用这些工具时,尊重隐私、公平和透明性。总体而言,本研究提供了实用的方法来支持执法调查,同时强调谨慎和道德的使用。
cs.CL / 4 / 2605.04157
FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals
FMI_SU_Yotkova_Kastreva 在 SemEval-2026 任务 13:通过风格特征轻量化检测大语言模型生成的代码
Abstract
SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods. We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.
Chinese Translation
SemEval-2026 任务 13 研究了在多种编程语言和应用场景下的机器生成代码检测,要求参与系统能够推广到未见过的语言和领域。本文描述了我们在子任务 A(二元分类)中的参与,并探索了预训练的代码编码器和轻量化的特征基础方法。我们设计了对代码片段长度不敏感的比率基础特征。为了支持描述性信号的提取,我们使用了解析引擎和编程语言分类器。此外,我们训练了一个独立的代码与文本行分类器,以识别嵌入在样本中的原始自然语言段。我们结合了浅层决策树与基于数据分析的启发式规则,以生成最终预测。我们的方法计算效率高,仅需 CPU 资源进行训练,且实现了近乎即时的推断时间,提供了一种轻量化的替代方案,以取代大型预训练模型。
cs.CL / 5 / 2605.04171
Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
流畅并不等于真实:探究大型语言模型在学术写作中的幻觉现象
Abstract
Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
Chinese Translation
大型语言模型(LLMs)展现出非凡的能力,但在生成学术内容时仍然容易产生幻觉。我们特别针对学术写作,调查了四个流行的LLM——ChatGPT、Grok、Gemini和Copilot的幻觉现象。我们设计了80个提示,涵盖四个类别:参考文献生成、事实解释、摘要生成和写作改进。我们使用0-5的评分标准评估模型,该标准检查事实准确性、参考的有效性、一致性、风格一致性和学术语调。我们引入了一种新的加权指标——幻觉指数(Hallucination Index, HI),用于衡量模型生成的回复中的幻觉现象。一些广泛使用的评估指标常常无法检查出改变机器翻译文本情感的错误。我们发现Grok和Copilot在参考文献生成任务中表现较好,但在处理摘要或风格提示时常常面临困难,其HI值分别为0.67和0.70。而Gemini和ChatGPT在语调控制方面表现良好,但在写作事实任务中表现欠佳,幻觉风险较高,HI评分分别为0.53和0.57。我们的研究显示,幻觉行为不仅仅依赖于模型架构,还与任务类型和我们提供的提示条件有关。我们认为我们的工作为未来的研究者开辟了新的研究维度。
cs.CL / 6 / 2605.04177
Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa
大语言模型准备好进行冲突监测了吗?来自西非的实证证据
Abstract
As LLMs enter conflict monitoring, understanding systematic distortions in their outputs is critical for humanitarian accountability. We evaluate four vanilla open-weight models Gemma 3 4B, Llama 3.2 3B, Mistral 7B, and OLMo 2 7B and two domain-adapted models, AfroConfliBERT and AfroConfliLLAMA, on Nigeria and Cameroon conflict-event classification against ACLED, a gold-standard dataset with multi-stage verification. We find a bifurcated divergence in normative directionality. Open-weight models exhibit statistically significant False Illegitimation bias: Gemma misclassifies to 18.29% of legitimate battles as civilian-targeted violence while making zero False Legitimation errors. By contrast, AfroConfliBERT and AfroConfliLLAMA achieve near-directional neutrality, with Legitimization Bias differences indistinguishable from zero. Yet domain adaptation does not eliminate actor-based selection bias. Both adapted models show statistically significant actor bias comparable to vanilla LLMs; in Nigeria, state actors are legitimized 36.5% more often than non-state actors in identical tactical contexts. Open-weight outputs are also fragile to geography-specific lexical framing: delegitimizing phrases produce flip rates up to 66.7% in Cameroon and 34.2% in Nigeria, while perturbations salient in one context may not matter in another. Error trace profiling shows models mask normative bias through unfaithful rationale confabulations. In contrast, AfroConfliBERT and AfroConfliLLAMA are largely robust, with near-zero flip rates across perturbation categories. Overall, current models are not ready for unsupervised deployment in conflict monitoring. We call for fairness-aware fine-tuning to reduce actor-based selection bias, mandatory adversarial robustness evaluation against lexical manipulation, and context-specific human-in-the-loop oversight calibrated to regional difficulty.
Chinese Translation
随着大语言模型(LLMs)进入冲突监测领域,理解其输出中的系统性扭曲对于人道主义问责至关重要。我们评估了四种基础开放权重模型:Gemma 3 4B、Llama 3.2 3B、Mistral 7B 和 OLMo 2 7B,以及两个领域适应模型,即 AfroConfliBERT 和 AfroConfliLLAMA,在尼日利亚和喀麦隆的冲突事件分类中,与经过多阶段验证的黄金标准数据集 ACLED 进行对比。我们发现,规范方向存在分叉性差异。开放权重模型表现出统计上显著的虚假非合法化偏见:Gemma 将 18.29% 的合法战斗错误分类为针对平民的暴力,同时没有出现虚假合法化错误。相比之下,AfroConfliBERT 和 AfroConfliLLAMA 达到了近乎方向中立,合法化偏见差异无法区分于零。然而,领域适应并未消除基于行为体的选择偏见。这两个适应模型在统计上表现出与基础 LLM 相当的行为体偏见;在尼日利亚,国家行为体在相同战术环境中的合法化频率比非国家行为体高出 36.5%。开放权重的输出对特定地域的词汇框架也表现出脆弱性:在喀麦隆,去合法化短语的翻转率高达 66.7%,在尼日利亚为 34.2%,而在一个环境中显著的扰动在另一个环境中可能无关紧要。错误追踪分析表明,模型通过不忠实的推理虚构隐藏了规范偏见。相比之下,AfroConfliBERT 和 AfroConfliLLAMA 在扰动类别中的翻转率几乎为零,总体而言,当前模型尚未准备好在冲突监测中进行无监督部署。我们呼吁进行关注公平性的微调,以减少基于行为体的选择偏见,强制进行针对词汇操控的对抗鲁棒性评估,以及针对区域困难的上下文特定人类参与监督。
cs.CL / 7 / 2605.04180
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
MedFabric和EtHER:一个数据驱动的框架用于医学LLMs中的词级伪造生成与检测
Abstract
Large Language Models exhibit strong reasoning and semantic understanding capabilities but often hallucinate in domains that require expert knowledge, among which fabrications, the generation of factually incorrect yet fluent statements, pose the greatest risk in medical contexts. Existing medical hallucination datasets inadequately capture fabrication phenomena due to limited fabrication coverage, stylistic disparities between human and LLM-authored texts, and distributional drift during hallucinated sample synthesis. To address this, we propose a data-centric pipeline to generate realistic and word-level fabrications that preserve syntactic and stylistic fidelity while introducing subtle factual deviations, resulting in MedFabric. Building upon this dataset, we introduce ETHER, a modular word-level fabrication detector integrating Text2Table Decomposition, Word Masking and Filling and Hybrid Sentence Pair Evaluation to enhance factual alignment. Empirical results demonstrate that MedFabric outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while maintaining consistent performance across structural similarities, offering a comprehensive framework for reliable and domain-specific factuality detection.
Chinese Translation
大型语言模型展示出强大的推理与语义理解能力,但在需要专业知识的领域中,常常会出现幻觉现象,其中伪造,即生成事实不准确但流畅的陈述,在医学环境中构成了最大的风险。现有的医学幻觉数据集由于伪造覆盖范围有限、人类与LLM创作文本之间的风格差异,以及在幻觉样本合成过程中的分布漂移,未能充分捕捉伪造现象。为了解决这一问题,我们提出了一种数据驱动的流程,生成保持句法和风格准确性的现实且词级的伪造,同时引入细微的事实偏差,形成MedFabric。在此数据集的基础上,我们引入了ETHER,一个模块化的词级伪造检测器,整合了Text2Table分解、词掩码与填充以及混合句子对评估,以增强事实一致性。实证结果表明,MedFabric在词级伪造基准上表现超过了最先进的检测器15%以上,同时在结构相似性上保持一致的性能,为可靠且特定领域的事实检测提供了一个全面的框架。
cs.CL / 8 / 2605.04196
The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation
词汇重叠对多语种机器翻译中知识转移的影响
Abstract
Knowledge transfer, especially across related languages, has been found beneficial for multilingual neural machine translation (MNMT), but some aspects are still under-explored and deserve further investigation. A joint vocabulary is most often applied to form a uniform word embedding space, but since the impact of a disjoint vocabulary on model performance is far less studied, there is no consensus on how much knowledge transfer is mainly due to vocabulary overlap. In this paper, we present systematic experiments with joint and disjoint vocabularies, and auxiliary languages related and unrelated to the source language. We design this experiment in an out-of-domain setup in order to emphasize transfer and the impact of the auxiliary language. As expected, we yield better results with more extensive vocabulary overlaps typical for related languages, but our experiments also show that domain-match and language relatedness are more important than a joint vocabulary.
Chinese Translation
知识转移,尤其是在相关语言之间,对于多语种神经机器翻译(MNMT)被发现是有益的,但一些方面仍然未被充分探索,值得进一步研究。通常使用联合词汇来形成统一的词嵌入空间,但由于对模型性能影响的分离词汇研究较少,因此在多大程度上知识转移主要归因于词汇重叠尚无共识。在本文中,我们进行了一系列系统实验,涉及联合和分离词汇,以及与源语言相关和不相关的辅助语言。我们在一个领域外的环境中设计了此次实验,强调了转移和辅助语言的影响。如预期,我们在相关语言的更多词汇重叠情况下取得了更好的结果,但我们的实验也表明,领域匹配和语言相关性比联合词汇更为重要。
cs.CL / 9 / 2605.04208
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
Nsanku:评估大型语言模型在加纳语言上的零样本翻译性能
Abstract
Large language models (LLMs) have demonstrated impressive multilingual capabilities for well-resourced languages, yet their performance on low-resource African languages remains poorly understood and largely unevaluated. This paper presents Nsanku, a systematic benchmark that evaluates the zero-shot machine translation performance of 19 open-weight and proprietary LLMs across 43 Ghanaian languages paired with English. Evaluation sentences were sourced from the YouVersion Bible platform, providing 300 sentence pairs per language. Two complementary automatic metrics are employed: Bilingual Evaluation Understudy (BLEU) and Character n-gram F-Score (chrF), alongside an average accuracy score and a cross-language consistency dimension. Nsanku represents the most comprehensive LLM translation evaluation for Ghanaian languages conducted to date. Results show that gemini-2.5-flash achieves the highest overall average score of 26.88 (BLEU: 24.60, chrF: 29.16), followed by claude-sonnet-4-5 at 24.87 (BLEU: 22.46, chrF: 27.28) and gpt-4.1 at 23.20 (BLEU: 21.15, chrF: 25.24). Among open-weight models, kimi-k2-instruct-0905 leads at an average score of 20.87. A critical finding from the consistency analysis is that no model and no language reached the Leaders quadrant of high performance and high consistency simultaneously, indicating that current LLMs are not yet reliably usable for Ghanaian language translation at scale. Siwu achieved the highest per-language average score at 25.73 while Nkonya scored lowest at 11.65. Nsanku establishes a publicly available, community-extensible evaluation infrastructure for African language NLP research.
Chinese Translation
大型语言模型(LLMs)在资源丰富的语言上展示了令人印象深刻的多语言能力,但它们在低资源非洲语言上的表现仍未得到充分理解和评估。本文提出了Nsanku,这是一个系统性的基准,评估19个开源权重和专有LLM在43种与英语配对的加纳语言上的零样本机器翻译性能。评估句子来源于YouVersion圣经平台,每种语言提供300对句子。采用了两种互补的自动评分指标:双语评估(Bilingual Evaluation Understudy,BLEU)和字符n-gram F分数(Character n-gram F-Score,chrF),另外还包括平均准确度分数和跨语言一致性维度。Nsanku代表了迄今为止针对加纳语言进行的最全面的LLM翻译评估。结果显示,gemini-2.5-flash的总体平均得分最高,达26.88(BLEU: 24.60, chrF: 29.16),其次是claude-sonnet-4-5,得分24.87(BLEU: 22.46, chrF: 27.28)和gpt-4.1,得分23.20(BLEU: 21.15, chrF: 25.24)。在开源权重模型中,kimi-k2-instruct-0905以20.87的平均得分领先。一项一致性分析的关键发现是,没有任何模型和语言同时进入高性能和高一致性的领导者象限,这表明当前的LLM尚未能够在规模上可靠地用于加纳语言翻译。Siwu在每种语言上的平均得分最高,为25.73,而Nkonya得分最低,达11.65。Nsanku建立了一个公开可用、可扩展的社区评估基础设施,以支持非洲语言自然语言处理研究。
cs.CL / 10 / 2605.04221
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
用于隐私敏感临床信息提取的自我提示小型语言模型
Abstract
Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.
Chinese Translation
从牙科进展记录中识别临床命名实体具有挑战性,因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架,使小型语言模型能够自我生成、验证、完善和评估特定实体的提示,从牙科记录中提取多个临床实体。根据1200份标注的记录,我们评估了候选的开放权重模型,并使用多提示集成推理进一步调整选择的模型,采用了基于QLoRA的监督微调和直接偏好优化。模型性能差异显著,突出了针对特定任务评估的必要性,而不是依赖通用基准。Qwen2.5-14B-Instruct实现了最强的基线性能。在DPO之后,Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/F1和宏/F1分数。这些发现表明,结合自动提示优化和轻量级偏好基础的后训练可以支持使用本地部署的小型语言模型实现可扩展的临床信息提取。
cs.CL / 11 / 2605.04278
Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining
材料数据库代理:一种用于科学文献挖掘的多模态代理框架
Abstract
Materials science workflows rely on structured and unstructured data from the vast body of available scientific literature. However, most of the experimental details remain buried in text, tables, graphs and figures. Thus, constructing databases that incorporate this data is a manual, time-consuming, and hard-to-scale process. Multimodal large language models have made it feasible to extract information from text and scientific figures with high speed and accuracy. This opens the possibility of an AI system that can create production-scale material databases. Material Database Agent (MDA) is a modular, multi-agent system architecture for converting research literature into structured databases. MDA accepts article PDFs as input, which are subsequently processed in parallel into markdown files and figures. Multiple sub-agents read these markdown files and figures in parallel to assemble sub-databases for each paper. These sub-databases are then compiled into a single tabular database by an agent. As opposed to using either a rule-based approach or a single-pass pipeline for extracting information, MDA is a specialized architecture for transforming the literature into a database in the field of materials science. More generally, this study provides a basis for positioning multimodal agentic information extraction as a viable means for constructing next-generation scientific databases from the primary literature.
Chinese Translation
材料科学工作流程依赖于来自大量可用科学文献的结构化和非结构化数据。然而,大多数实验细节仍然埋藏在文本、表格、图形和图像中。因此,构建包含这些数据的数据库是一个手动、耗时且难以扩展的过程。多模态大型语言模型使得从文本和科学图形中快速而准确地提取信息成为可能。这为构建可生产规模的材料数据库的人工智能系统开辟了可能性。材料数据库代理(Material Database Agent, MDA)是一个模块化的多代理系统架构,旨在将研究文献转换为结构化数据库。MDA接收作为输入的论文PDF文件,随后并行处理成markdown文件和图像。多个子代理并行读取这些markdown文件和图像,以为每篇论文组装子数据库。这些子数据库随后由一个代理汇编成一个单一的表格数据库。与使用基于规则的方法或单次处理管道提取信息不同,MDA是一个专门设计的架构,旨在将文献转化为材料科学领域的数据库。更普遍地说,本研究为将多模态代理信息提取定位为从原始文献构建下一代科学数据库的可行手段提供了基础。
cs.CL / 12 / 2605.04298
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
自指向分析评估的探索:基于特征的二语写作评价方法与大型语言模型的结合
Abstract
Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system's true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.
Chinese Translation
自动化作文评分(AES)研究通常依赖基于等级的相关性指标来验证分析评估。然而,这些指标模糊了因写作能力结构而产生的分析维度之间的内在相互关系和晕轮效应,即整体印象渗入细致的组成分数。因此,高相关性可能掩盖了系统的真实诊断行为。在本研究中,我们提出了一种新颖的自指向评估框架,重点在于识别学习者内部的优势与劣势,而非评估学习者之间的排名。我们对公开可用的ICNALE GRA数据集(一个由多达80名训练评审员整体和分析标注的独特稠密第二语言写作数据集)进行了实验。为了获得可靠的参考分数,我们应用双主效应Rasch建模来校准评审员的严苛度,并在十个分析维度和整体能力中推导公平的平均分数。我们在零-shot设置中比较了人类操作评审员和三种大型语言模型(LLMs)的分析评分表现。我们的结果表明,LLMs在识别多个能力维度的相对劣势(负反馈)方面往往优于单一的人类评审员,而人类评审员在识别相对优势(正反馈)方面则表现更强。总体而言,我们的研究结果强调了基于排名的评估在分析评估中的局限性,并展示了基于特征的学习者内部评估方法在AES中使用LLMs的价值。
cs.CL / 13 / 2605.04305
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN:基于抽象意义表示的语义水印
Abstract
We introduce SWAN (Semantic Watermarking with Abstract Meaning Representation), a novel framework that embeds watermark signatures into the semantic structure of a sentence using Abstract Meaning Representation (AMR). In contrast to existing watermarking methods, which typically encode signatures by adjusting token selection preferences during text generation, SWAN embeds the signature directly in the sentence's semantic representation. As the signature is encoded at the semantic structure level, any paraphrase that preserves meaning automatically preserves the signature. SWAN is training-free: watermark injection is achieved by prompting an LLM to generate sentences guided by a selected AMR template while maintaining contextual coherence, and detection uses an off-the-shelf AMR parser followed by a simple one-proportion z-test. Empirical evaluation on the RealNews benchmark shows SWAN matches state-of-the-art detection performance on unaltered watermarked text, while significantly improving robustness against paraphrasing, increasing detection AUC by up to 13.9 percentage points compared to prior methods. These results demonstrate that SWAN's approach of anchoring watermarks in AMR semantic structures provides a simple, effective, and prompt-based method for robust text provenance verification under paraphrasing, opening new avenues for semantic-level watermarking research.
Chinese Translation
我们提出了SWAN(基于抽象意义表示的语义水印),这是一个新颖的框架,通过使用抽象意义表示(AMR)将水印签名嵌入句子的语义结构中。与现有的水印方法相比,后者通常通过调整文本生成过程中的标记选择偏好来编码签名,SWAN则直接在句子的语义表示中嵌入签名。由于签名是在语义结构层面进行编码的,任何保持意义的释义都能自动保留水印。SWAN是无训练的:水印注入通过提示大语言模型(LLM)生成由选定AMR模板指导的句子,同时保持上下文一致性来实现,而检测则使用现成的AMR解析器,随后通过简单的单比例z检验进行。对RealNews基准的实证评估表明,SWAN在未修改的水印文本上达到了最先进的检测性能,同时在抵抗释义方面显著提高了鲁棒性,与先前的方法相比,检测的AUC增加了多达13.9个百分点。这些结果表明,SWAN通过在AMR语义结构中锚定水印的方法,为在释义下进行鲁棒文本来源验证提供了一种简单、有效且基于提示的方法,为语义级水印研究开辟了新的途径。
cs.CL / 14 / 2605.04313
NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise
NoisyCausal:评估结构性噪声下因果推理的基准测试
Abstract
Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce NoisyCausal, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to address these challenges. Our method prompts the LLM to extract variables, construct a causal graph from context, and then reformulates the reasoning task as a structured prompt grounded in this graph. Rather than relying on statistical patterns alone, the LLM is guided by symbolic structure, enabling more interpretable and robust inference. Experimental results show that our method significantly outperforms standard prompting and reasoning baselines on NoisyCausal. Furthermore, it generalizes well to external benchmarks such as Cladder without task-specific tuning. Our findings highlight the importance of combining causal abstractions with language-driven reasoning to achieve faithful and robust causal understanding in LLMs.
Chinese Translation
自然语言中的因果推理需要识别相关变量、理解它们的相互作用,并推理关于效果和干预的内容,这通常在噪声或模糊的条件下进行。虽然大型语言模型(LLMs)展现出强大的通用推理能力,但它们在区分相关性与因果性时面临挑战,尤其是在观察结果部分不正确或存在无关信息时。在本研究中,我们引入了NoisyCausal,这是一个旨在评估结构性噪声下因果推理的新基准。每个实例都来自一个真实的因果图,并通过注入可控形式的噪声(例如无关的干扰项、数值扰动、混淆和部分可观察性)在自然语言场景中进行情境化。此外,我们提出了一种模块化推理框架,将LLMs与显式因果结构相结合,以应对这些挑战。我们的方法引导LLM提取变量,从上下文构建因果图,然后将推理任务重新表述为基于此图的结构化提示。LLM不仅依赖于统计模式,而是得到符号结构的指导,从而实现更具可解释性和稳健性的推断。实验结果表明,我们的方法在NoisyCausal上显著优于标准的提示和推理基准。此外,它在外部基准(如Cladder)上没有特定任务调整情况下也表现良好。我们的研究结果强调了将因果抽象与基于语言的推理相结合的重要性,以在LLMs中实现真实和稳健的因果理解。
cs.CL / 15 / 2605.04426
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting
电报英语:通过结构化符号重写实现语义提示压缩
Abstract
We introduce Telegraph English (TE), a prompt-compression protocol that rewrites natural language into a symbol-rich, formally-structured dialect. Where token-deletion methods such as LLMLingua-2 train a classifier to delete low-importance tokens at a fixed ratio, TE performs a full semantic rewrite: it decomposes the input into atomic fact lines, substitutes verbose phrases with $\sim$40 logical and relational symbols, and lets the compression ratio adapt to each document's information density. A consequence of the line-structure rule is that compression and semantic chunking become the same operation -- each output line is an independently addressable fact, so the compressed representation is simultaneously a semantic index. We evaluate TE on 4{,}081 question-answer pairs from LongBench-v2 across five OpenAI models and two difficulty levels. At roughly 50\% token reduction, TE preserves 99.1\% accuracy on key facts with GPT-4.1 and outperforms LLMLingua-2 at matched compression ratios on every model and task tested. The gap widens on smaller models -- up to 11 percentage points on fine-detail tasks -- suggesting that explicit relational structure compensates for limited model capacity. We release the grammar specification, compression prompt, benchmark data, and reference implementation.
Chinese Translation
我们介绍了电报英语(Telegraph English, TE),这是一种将自然语言重写为符号丰富、形式结构化方言的提示压缩协议。在诸如LLMLingua-2的标记删除方法中,训练分类器以固定比例删除低重要性标记,而TE则执行全面的语义重写:它将输入分解为原子事实行,用约40个逻辑和关系符号替换冗长短语,并使压缩比率根据每个文档的信息密度进行自适应。行结构规则的一个结果是压缩和语义分块成为同一操作——每个输出行都是一个独立可寻址的事实,因此压缩表示同时也是一个语义索引。我们在来自LongBench-v2的4081对问答数据上评估TE,涉及五个OpenAI模型和两个难度级别。在约50%的标记减少下,TE在关键事实方面保持了99.1%的准确性,与GPT-4.1一起,并在测试的每个模型和任务上以匹配的压缩比率超越了LLMLingua-2。在较小模型上,这一差距更为显著——在细节任务上最高可达11个百分点——这表明显式关系结构补偿了有限模型能力。我们发布了语法规范、压缩提示、基准数据和参考实现。
cs.CL / 16 / 2605.04449
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM:基于图增强的专家混合模型与ReAct智能体的对话状态跟踪
Abstract
Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent router. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.
Chinese Translation
对话状态跟踪(DST)需要从多领域对话中精确提取结构化信息,这是一项大型语言模型(LLMs)在尽管具备强大一般能力的情况下仍然面临困难的任务。我们提出了GEM(Graph-Enhanced Mixture-of-Experts),这是一种新颖的框架,结合了语言模型和图结构对话理解,以及基于ReAct智能体的推理,以实现更优的DST性能。我们的方法在两个专门专家之间动态路由:一个图神经网络捕获对话结构和轮次级别的依赖性,另一个是经过微调的T5-Small编码器-解码器用于序列建模,在一个智能路由器的协调下进行。对于复杂值生成任务,我们集成了ReAct智能体,以在对话上下文中执行结构化推理。在MultiWOZ 2.2数据集上,GEM实现了65.19%的联合目标准确率,显著超越了端到端LLM方法(最佳:38.43%)并超过了包括TOATOD(63.79%)、D3ST(58.70%)和Diable(56.48%)在内的最先进(SOTA)方法。我们的图增强专家混合架构与ReAct的集成展示了将结构化对话表征与动态专家路由和基于智能体的推理相结合,为对话状态跟踪提供了一种强大的范式,在实现更高准确率的同时,通过选择性专家激活保持了计算效率。
cs.CL / 17 / 2605.04458
DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation
DoGMaTiQ:报告评估的自动化问答片段生成
Abstract
Evaluation of long-form, citation-backed reports has lately received significant attention due to the wide-scale adoption of retrieval-augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report's coverage of query-relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question-answer (QA) representations, enabling fine-grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget-based evaluation is the need to manually curate sets of nuggets for each topic in a test collection -- a laborious process that scales poorly to novel information needs. This challenge is acute in cross-lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high-quality QA-based nugget sets in three stages: (1) document-grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue -- a recent nugget-based evaluation framework -- to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross-lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human-in-the-loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at https://github.com/manestay/dogmatiq.
Chinese Translation
由于检索增强生成(RAG)系统的广泛应用,长格式的、有引用支持的报告评估近年来受到显著关注。许多评估框架的核心是使用原子事实或片段,来评估报告对基础文献中与查询相关信息的覆盖程度。虽然片段传统上被表示为短语句,但最近的研究采用了问答(QA)表述,使得评估更加细致,能够将信息需求(即问题)与满足该需求的多样内容(即其答案)解耦。基于片段的评估面临的一个持续挑战是需要手动为测试集合中的每个主题整理片段集——这一过程繁琐且在处理新信息需求时扩展性较差。这个挑战在跨语言环境中尤为突出,因为信息来自多语言源文档。因此,我们提出了DoGMaTiQ,这是一个用于生成高质量基于QA的片段集的管道,该过程分为三个阶段:(1) 基于文档的片段生成,(2) 同义句聚类,以及(3) 基于原则性质量标准的片段子选择。我们将DoGMaTiQ生成的片段与AutoArgue(一个近期的基于片段的评估框架)集成,以实现生成报告的完全自动化评估。我们在两个跨语言的TREC共享任务NeuCLIR和RAGTIME上进行了广泛的实验,结果显示与人工评估和完全手动判断之间存在强相关性。最后,我们对管道的详细分析表明,强大的大语言模型(LLM)片段生成器是关键,并且DoGMaTiQ引导的系统排名对异常系统具有鲁棒性。我们通过在https://github.com/manestay/dogmatiq公开发布我们的代码和文档,促进报告评估的未来研究。
cs.CL / 18 / 2605.04495
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
CAR:基于查询引导的置信度感知重排序框架用于检索增强生成
Abstract
Retrieval-Augmented Generation (RAG) depends on document ranking to provide useful evidence for generation, but conventional reranking methods mainly optimize query-document relevance rather than generation usefulness. A relevant document may still introduce noise, while a lower-ranked document may better reduce the generator's uncertainty. We propose CAR (Confidence-Aware Reranking), a query-guided, training-free, and plug-and-play reranking framework that uses generator confidence change as a document usefulness signal. CAR estimates confidence through the semantic consistency of multiple sampled answers under query-only and query-document conditions. Documents that significantly increase confidence are promoted, those that decrease confidence are demoted, and uncertain cases preserve the baseline order, while a query-level gate avoids unnecessary intervention on already confident queries. Experiments on four BEIR datasets show that CAR consistently improves NDCG@5 across sparse and dense retrievers, LLM-based and supervised rerankers, and four LLM backbones. Notably, CAR improves the YesNo reranker by 25.4 percent on average under Contriever retrieval, and its ranking gains strongly correlate with downstream generation F1 improvements, achieving Spearman rho = 0.964.
Chinese Translation
检索增强生成(RAG)依赖于文档排序来提供有用的证据以进行生成,但传统的重排序方法主要优化查询-文档相关性,而非生成的有效性。相关文档仍可能引入噪音,而较低排名的文档可能更能降低生成器的不确定性。我们提出了CAR(置信度感知重排序),这是一种基于查询引导、无训练且即插即用的重排序框架,利用生成器置信度变化作为文档有效性信号。CAR通过在仅查询和查询-文档条件下多次采样答案的语义一致性来估计置信度。显著增加置信度的文档会被提升排名,降低置信度的文档会被降级,而不确定的情况会保持基线顺序,同时,查询级门控避免对已经具有高置信度的查询进行不必要的干预。在四个BEIR数据集上的实验表明,CAR在稀疏和密集检索器、基于LLM的和监督重排序器以及四个LLM骨干网络上都能一致改善NDCG@5。值得注意的是,CAR在Contriever检索下平均提高了YesNo重排序器25.4个百分点,并且其排名提升与下游生成的F1改进强相关,取得了斯皮尔曼rho = 0.964。
cs.CL / 19 / 2605.04496
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT:基于解耦认识状态的长文本理解主动信息觅取
Abstract
Long-Text Understanding (LTU) at million-token scale requires balancing reasoning fidelity with computational efficiency. Frontier long-context LLMs can process millions of token contexts end-to-end, but they suffer from high token consumption and attention dilution. In parallel, specialized LTU agents often sacrifice fidelity through task-agnostic abstractions like graph construction or indexing. We identify a key insight for LTU: query-relevant information is typically sparse relative to the full document, so effective reasoning should rely on a query-sufficient subset rather than the entire context. To address this, we propose SCOUT, a new paradigm for LTU that shifts from passive processing to active information foraging. It treats the document as an explorable environment and answers from a compact, provenance-grounded epistemic state. Guided by state-level gap diagnosis, SCOUT adaptively alternates between coarse-to-fine exploration and anchored state updates that progressively contract its epistemic state toward query sufficiency. Experiments show that SCOUT matches state-of-the-art proprietary models while reducing token consumption by up to 8x. Moreover, SCOUT remains stable as context length scales, substantially alleviating the practical cost-performance trade-off.
Chinese Translation
在百万标记规模下,长文本理解(LTU)需要平衡推理的准确性与计算效率。前沿的长上下文大语言模型(LLMs)能够端到端处理数百万个标记的上下文,但它们在标记消耗和注意力稀释方面存在问题。同时,专门的LTU代理往往通过图构建或索引等任务无关的抽象牺牲推理准确性。我们识别出LTU的一个关键见解:相比于整篇文档,查询相关信息通常是稀疏的,因此有效的推理应依赖于查询充足的子集,而非整个上下文。为此,我们提出了SCOUT,这是一种新的LTU范式,强调从被动处理转向主动信息觅取。SCOUT将文档视为一个可探索的环境,并从一个紧凑的、基于来源的认识状态中回答问题。在状态层面进行差距诊断的指导下,SCOUT自适应地在粗略探索与精准的状态更新之间进行切换,逐步收缩其认识状态以达到查询的充分性。实验表明,SCOUT与最先进的专有模型相当,同时将标记消耗降低了最多8倍。此外,随着上下文长度的增加,SCOUT仍然保持稳定,显著减轻了实际的成本与性能权衡。
cs.CL / 20 / 2605.04500
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties
利用语言差异性实现对未见低资源方言的语言泛化
Abstract
Low-resource language varieties used by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual research focuses on inter-lingual language transfer which strives to align allied varieties and minimize differences between them. However, for low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VACAI-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on structural prediction tasks, which are among the few tasks available, as proxy for performance on other downstream tasks. Using VACAI-Bowl with TOPPing yields an average 54.62% improvement in the dependency parsing task, which serves as a proxy for performance on other downstream tasks across 10 low-resource varieties.
Chinese Translation
特定群体使用的低资源语言方言在多语言模型的发展中仍然被忽视。大量的跨语言研究集中于语言间的转移,努力对齐相关方言以最小化它们之间的差异。然而,对于低资源方言来说,语言差异性也是一个重要线索,能够帮助实现对未见方言的泛化。与以往的方法不同,我们提出了一种两阶段语言泛化框架,侧重于捕捉特定方言的线索,同时利用高资源源方言所提供的丰富重叠。首先,我们提出了TOPPing,一种专为低资源方言设计的源选择方法。其次,我们建议使用轻量级的VACAI-Bowl架构,该架构通过一个分支学习方言特定属性,同时通过平行分支利用对抗训练捕捉方言不变属性。我们在结构预测任务上评估了我们的框架,这是少数可用任务之一,作为其他下游任务性能的代理。使用结合TOPPing的VACAI-Bowl在依赖解析任务中实现了平均54.62%的性能提升,该任务作为其他下游任务性能的代理,在10个低资源方言中得以体现。
cs.CL / 21 / 2605.04507
Distilling Bayesian Belief States into Language Models for Auditable Negotiation
将贝叶斯信念状态提炼为可审计的谈判语言模型
Abstract
Negotiation agents must infer what their counterpart values, update those beliefs over dialogue turns, and choose actions under uncertainty. End-to-end large language models (LLMs) can imitate negotiation dialogue, but their opponent beliefs are usually implicit and difficult to inspect. We propose BOND (Bayesian Opponent-belief Negotiation Distillation), a framework for auditable negotiation. BOND consists of an LLM-based Bayesian teacher that scores dialogue contexts against the six possible opponent priority orderings, updates a posterior over those orderings, and uses the posterior for menu-based decision making, as well as a smaller 8B student language model that emits both negotiation actions and normalized posterior beliefs as tagged text. In the CaSiNo negotiation dataset, BOND outperforms the state-of-the-art and achieves mean Brier score 0.085 over opponent-priority posteriors. The distilled student preserves much of this belief signal, achieving Brier 0.114, below the uniform six-ordering reference of 5/36, approximately 0.139. Compared with a 70B structured-CoT baseline, the significantly smaller 8B student model yields substantially better elicited posterior calibration. We further showcase auditability through posterior trajectories, belief-versus-policy error decomposition, and posterior-prefix interventions. These diagnostics reveal that distillation preserves a scoreable belief report more strongly than causal belief-conditioned control, making weak belief-action coupling visible, not hidden.
Chinese Translation
谈判代理必须推断对手的价值观,在对话轮次中更新这些信念,并在不确定性下选择行动。端到端的大型语言模型(LLMs)能够模拟谈判对话,但其对手信念通常是隐含的,难以检查。我们提出了BOND(贝叶斯对手信念谈判提炼),这是一个可审计的谈判框架。BOND由基于LLM的贝叶斯教师组成,该教师根据六种可能的对手优先级顺序对对话上下文进行评分,更新对这些顺序的后验分布,并利用后验进行基于菜单的决策,以及一个较小的8B学生语言模型,它发出谈判行动和规范化的后验信念作为标记文本。在CaSiNo谈判数据集上,BOND的性能超过了当前最先进的技术,成功在对手优先级后验上达到了平均Brier分数0.085。提炼后的学生模型保留了大部分信念信息,达到了Brier分数0.114,低于均匀六个排序参考的5/36,约为0.139。与70B结构化CoT基线相比,显著较小的8B学生模型在后验校准方面提供了显著更好的结果。我们进一步通过后验轨迹、信念与政策错误分解以及后验前缀干预展示可审计性。这些诊断揭示了提炼能够更强有力地保留可评分的信念报告,相较于因果信念条件控制,更能使弱信念-行动关联显露,而非隐藏。
cs.CL / 22 / 2605.04523
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
RaguTeam在SemEval-2026任务8中的表现:在评判者协调的LLM集成中,Meno及其朋友们用于生成可信的多轮响应
Abstract
We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval
Chinese Translation
我们在SemEval-2026任务8中的任务B(带参考文段的生成)中展示了我们的获胜系统:MTRAGEval。我们的方法是一个异构的七个大型语言模型(LLM)集成,结合了两种提示变体,其中GPT-4o-mini评判者为每个实例选择最佳候选。在26个团队中我们排名第1,获得了0.7827的调节谐波平均值,超越了最强的基线(gpt-oss-120b,0.6390)。消融实验表明,模型家族、规模和提示策略的多样性至关重要,这个集成始终优于任何单一模型。我们还介绍了Meno-Lite-0.1,这是一个具有强大性价比的7B领域适应模型,并分析了MTRAGEval,指出了注释的局限性及改进方向。我们的代码已公开可用: https://github.com/RaguTeam/ragu_mtrag_semeval
cs.CL / 23 / 2605.04539
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM:通过混合直接偏好优化平衡大型语言模型中的逻辑基础与流畅性
Abstract
Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.
Chinese Translation
直接偏好优化(DPO),作为基于PPO的强化学习人类反馈(RLHF)的高效替代方案,在知识密集型生成中存在不足:来自人类注释者或大型语言模型(LLM)评审的标准偏好信号表现出系统性的冗长偏见,使流畅性优于逻辑正确性。这个盲点导致了逻辑对齐缺口——在生成流畅文本的同时,监督微调(SFT)模型的自然语言推断(NLI)蕴涵仅达到0.05-0.22。我们提出了具有混合DPO的RLearner-LLM:一个自动化的偏好管道,将DeBERTa-v3的NLI信号与验证器LLM分数融合,消除人工注释,同时克服单一信号优化的“对齐税”。在五个学术领域(生物学、医学、法律)中,基于三种基础架构(LLaMA-2-13B、Qwen3-8B、Gemma 4 E4B-it)进行评估,RLearner-LLM在NLI上超过SFT,改进高达6倍,在15个单元中的11个单元上获得NLI提升,且答案覆盖率持续增加。在Gemma 4 E4B-it(45亿有效参数)上,混合DPO在五个领域中的四个领域提升了NLI(+11.9%至+2.4倍),并在所有五个领域中实现了更快的推断速度,能够在不失去对齐税减轻效果的情况下缩放至紧凑的基础模型。我们的Qwen3-8B RLearner-LLM在与自身的SFT基线进行配对比较中取得95%的胜率;而GPT-4o-mini则在与我们简洁输出的对比中获得95%的胜率——同时,68%的评审认为冗长的SFT在我们DPO模型面前胜出,这种情况在前沿比较器上重复了冗长偏见,并使逻辑意识度量(NLI,ACR)优于将LLM作为评审的知识密集型生成。
cs.CL / 24 / 2605.04543
UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding
UniVer:多步多稿推测解码的统一视角
Abstract
Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation, applying either flat OT to single-step drafts or per-token rejection sampling to tree-structured candidates. This separation leaves the joint regime (where multi-step dependencies meet multi-draft branching) poorly optimized, as local verification rules fail to exploit the coupling between horizontal and vertical dimensions of candidate trees. In this paper, we propose a unified perspective that casts tree-based verification as a conditional OT problem. Our key insight is that vertical dependencies can be abstracted through prefix acceptance probabilities, which act as dynamic scaling factors to actively guide horizontal draft selection. Based on this principle, we introduce UniVer, a verification algorithm that jointly optimizes across tree levels by composing local optimal transport plans under prefix constraints. We prove that UniVer remains lossless and achieves the optimal acceptance rate under the proposed conditional framework. Extensive experiments across different tasks and models demonstrate that UniVer improves acceptance length by 4.2% to 8.5% over standard recursive rejection sampling without replacement, while maintaining exact distributional alignment with the target model.
Chinese Translation
推测解码通过草稿-再验证的方式加速大型语言模型,其中验证可以被框架化为最优运输(Optimal Transport, OT)问题。现有的方法通常独立处理多草稿和多步骤方面,或是对单步草稿应用平面OT,或是对树状候选进行逐标记拒绝采样。这种分离导致联合模式(即多步骤依赖与多草稿分支相遇)优化不佳,因为局部验证规则未能充分利用候选树的横向和纵向维度之间的耦合。在本文中,我们提出了一种统一视角,将基于树的验证视作条件OT问题。我们的关键见解是,纵向依赖可以通过前缀接受概率进行抽象,这些概率作为动态缩放因子来主动引导横向草稿选择。基于这一原则,我们引入了UniVer,一种在前缀约束下通过组合局部最优运输方案来共同优化树层的验证算法。我们证明UniVer在所提出的条件框架下保持无损,并实现最优接受率。在不同任务和模型上的大量实验表明,UniVer在接受长度上比标准的无替换递归拒绝采样提高了4.2%到8.5%,同时保持与目标模型的精确分布对齐。
cs.CL / 25 / 2605.04552
The Newsworthiness of Brazilian Distress: A Peak Analysis on Time Series of International Media Attention to Disasters in Brazil
巴西困境的新闻价值:关于巴西灾害国际媒体关注时间序列的峰值分析
Abstract
Media coverage influences disaster response, yet the drivers of international media attention to local events remain unevenly understood. Brazil offers a compelling case: some of its natural and technological disasters occasionally hit the international headlines. However, systematic analyses of what makes these events be discussed abroad are still missing. Addressing this gap requires representative, validated and country-specific news datasets. This paper presents a peak analysis of 2k news about Brazilian fires and landslides in German newspapers from 2000 to 2024. Using time series segmentation to detect news event peaks, we examine the extent to which they can be temporally aligned with observations in national and global disaster databases.
Chinese Translation
媒体报道影响灾难响应,但国际媒体对地方事件的关注驱动因素尚不完全明确。巴西提供了一个引人注目的案例:其一些自然和技术灾害偶尔会登上国际头条。然而,对于导致这些事件在国外被讨论的系统分析仍然缺失。填补这一空白需要具有代表性、经过验证且特定于国家的新闻数据集。本文呈现了对2000年至2024年间德国报纸中有关巴西火灾和山体滑坡的2000篇新闻的峰值分析。通过时间序列分割检测新闻事件峰值,我们考察了这些峰值在多大程度上可以与国家和全球灾害数据库中的观察结果进行时间上的对应。
cs.CL / 26 / 2605.04576
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
塔吉克语词性标注基准测试:在TajPersParallel语料库上对神经网络架构的比较研究
Abstract
This paper presents the first benchmark for the task of automatic part-of-speech (POS) tagging for the Tajik language. Despite the existence of multilingual language models demonstrating high effectiveness for many of the world's languages, their capacity for grammatical analysis of Tajik has remained unexplored until now. The aim of this study is to fill this gap through a systematic comparison of classical neural network architectures and modern multilingual transformers. Experiments were conducted on the TajPersParallel corpus, a parallel lexical resource comprising approximately 44,000 dictionary entries. Due to the absence of full-fledged example sentences in the current version of the corpus, the task was performed at the level of isolated lexical units, representing a challenging case of context-independent classification. The study compares the following architectures: a recurrent BiLSTM-CRF model, as well as multilingual models XLM-RoBERTa (large), mBERT, ParsBERT (Persian), and ruBERT (Russian), adapted using the parameter-efficient fine-tuning method LoRA. The testing results showed that the best performance is achieved by the mBERT + LoRA model (macro F1-score = 0.11, weighted F1-score = 0.62). It was established that in the absence of syntactic context, all models experience significant difficulty in resolving morphological ambiguity, successfully classifying primarily high-frequency classes ("noun," "adjective") while demonstrating zero effectiveness for rare function words. Zero-shot evaluation revealed the greatest typological proximity of Tajik to Persian (ParsBERT) and Russian (ruBERT). The obtained results form a foundation for further research and development in the field of automatic processing of the Tajik language.
Chinese Translation
本文提出了塔吉克语自动词性标注(POS tagging)任务的首个基准测试。尽管现有的多语言模型在世界多种语言上展现出高效性,但其在塔吉克语的语法分析能力迄今尚未得到探讨。本研究的目的在于通过对经典神经网络架构和现代多语言变换模型的系统比较,填补这一空白。实验在TajPersParallel语料库上进行,该语料库是一个包含约44,000个词典条目的平行词汇资源。由于当前语料库版本缺乏完整的示例句,该任务在孤立词汇单元的层面上进行,构成了一个无上下文分类的挑战性案例。本研究比较了以下架构:递归BiLSTM-CRF模型,以及使用参数高效微调方法LoRA调适的多语言模型XLM-RoBERTa(大)、mBERT、ParsBERT(波斯语)和ruBERT(俄语)。测试结果显示,最佳表现由mBERT + LoRA模型取得(宏观F1-score = 0.11,加权F1-score = 0.62)。研究表明,在缺乏句法上下文的情况下,所有模型在解决形态歧义时均遇到显著困难,主要成功分类高频类别(“名词”,“形容词”),而对于稀有功能词则显示出零效能。零样本评估揭示出塔吉克语与波斯语(ParsBERT)和俄语(ruBERT)之间的类型学接近性最大。所获得的结果为塔吉克语自动处理领域的进一步研究和发展奠定了基础。
cs.CL / 27 / 2605.04583
TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)
TajikNLP:一款针对塔吉克语(西里尔字母)全面文本处理的开源工具包
Abstract
The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets -- a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) -- have been openly published under permissive licenses. The library's reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.
Chinese Translation
塔吉克语使用西里尔字母书写,在公共可用的自然语言处理(NLP)工具包方面严重不足,严重阻碍了语言研究与应用开发。本文介绍了TajikNLP,这是一款开源Python库,提供了第一个针对真实塔吉克文本处理的综合管道,同时保留了原始的西里尔正字法。该库实现了一个以统一Doc对象为中心的模块化架构,支持组件的顺序应用,包括文本清理、规范化、分词(包括子词BPE)、形态分段、词性标注、词干提取、词形还原和句子拆分。我们引入了一种新颖的统一形态引擎,提供受控和深度分析模式,显著改善了塔吉克语粘着名词和动词变形的处理。此次发布还包含基于词典的情感分析器和直接从Hugging Face Hub加载的预训练Word2Vec/FastText嵌入。为了确保可重复性并促进未来研究,四个配套的语言数据集已在宽松许可下公开发布,包括一个词性标注语料库(52.5k条)、一个情感词典(3.5k条)、一个地名辞典(5.6k条)和一个人名数据集(3.8k条)。该库的可靠性通过616个自动化测试的广泛测试套件得到了验证,达到了93%的源代码覆盖率。因此,TajikNLP为塔吉克语言处理建立了基础技术基础,降低了低资源西里尔字母环境中学术和工业应用的入门门槛。
cs.CL / 28 / 2605.04638
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
关于语义保持嵌入的梯度揭示大型语言模型的不确定性
Abstract
Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.
Chinese Translation
不确定性量化(UQ)是确保大型语言模型(LLMs)可信度的一项重要技术,考虑到它们的幻觉倾向。现有的最先进的自由形式生成UQ方法主要依赖于采样,这会带来高计算成本和方差。在本研究中,我们提出了首个基于梯度的自由形式生成UQ方法,SemGrad,该方法不依赖于采样且计算高效。与以往为分类任务开发的基于梯度的方法在参数空间中操作不同,我们提议考虑在语义空间中的梯度。我们的方法基于一个关键直觉,即一个自信的LLM应在语义等效的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度,并引入了语义保持评分(SPS)来识别最佳捕获语义的嵌入,基于该嵌入计算梯度。我们进一步提出了HybridGrad,该方法结合了SemGrad和参数梯度的优点。实验表明,我们的两种方法都提供了高效且有效的不确定性估计,表现出优于最先进方法的性能,特别是在具有多种有效响应的设置中。
cs.CL / 29 / 2605.04643
Graph-Augmented LLMs for Swiss MP Ideology Prediction
图增强大语言模型用于瑞士议会议员意识形态预测
Abstract
Approximating the ideological position of Members of Parliament (MPs) is a fundamental task in political science, helping researchers understand legislative behavior, party alignment, and policy preferences. While Large Language Models (LLMs) have shown promising results in estimating MPs' ideological stances, there are more actors and elements in the parliamentary system, and relations between them, that could provide a wider and more informative picture. However, due to the complexity of integrating them in the prediction task, these additional elements are generally ignored. In this work, we propose an LLM framework, PG-RAG, that implements a retrieval-augmented generation pipeline: it first queries a political knowledge graph (KG) and then integrates the resulting graph-structured information into the context. This allows for capturing both textual semantics and inter-MP relationships, another relevant information source in any parliamentary system. We evaluate the approach on the task of ideology prediction, using data from a Swiss parliamentary dataset. When comparing graph-augmented models against several state-of-the-art baselines, the results demonstrate that incorporating this enriched information, which encodes information about different entities and relations, improves prediction performance. These results help to highlight the value of domain-specific relational information in modeling political behavior.
Chinese Translation
近似议会议员(MP)的意识形态立场是政治科学中的一项基础任务,帮助研究人员理解立法行为、政党对齐和政策偏好。尽管大语言模型(LLMs)在估计议会议员的意识形态立场方面显示出有希望的结果,但在议会系统中还有更多的参与者和元素,以及它们之间的关系,这些都可以提供更广泛和更具信息量的视角。然而,由于在预测任务中整合这些额外元素的复杂性,这些元素通常被忽视。在本研究中,我们提出了一种LLM框架PG-RAG,该框架实施了一种检索增强生成管道:它首先查询一个政治知识图谱(KG),然后将生成的图结构信息整合到上下文中。这使得能够同时捕捉文本语义和议员间的关系,这是任何议会系统中的另一个相关信息来源。我们在瑞士议会数据集上评估了该方法在意识形态预测任务中的表现。通过将图增强模型与几种最先进的基准进行比较,结果显示,整合编码有关不同实体和关系的信息的丰富信息,可以提升预测性能。这些结果突显了领域特定关系信息在建模政治行为中的价值。
cs.CL / 30 / 2605.04652
CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning
CHE-TKG:用于时间知识图推理的协同历史证据与演化动态学习
Abstract
Temporal knowledge graph (TKG) reasoning aims to predict future events from historical facts. A key challenge lies in jointly capturing two sources of predictive information in TKGs: historical evidence and evolutionary dynamics. However, existing methods typically focus on only one of these sources, which limits the ability to fully exploit the complementary predictive signals in TKGs. To address this, we propose CHE-TKG, a novel collaborative dual-view learning framework for TKG reasoning. CHE-TKG explicitly separates and jointly models historical evidence and evolutionary dynamics, aiming to learn and exploit their complementary predictive signals. Specifically, CHE-TKG constructs a historical evidence graph to capture long-term structural regularities and stable relational constraints, alongside an evolutionary dynamics graph to model temporal transitions and recent changes, with dedicated encoders for each view. We further employ relation decomposition and a contrastive alignment objective to better capture the predictive signals across the two views. Extensive experiments demonstrate that CHE-TKG achieves state-of-the-art performance on multiple benchmarks.
Chinese Translation
时间知识图(TKG)推理旨在从历史事实中预测未来事件。一个关键挑战在于如何共同捕捉TKG中两种预测信息来源:历史证据与演化动态。然而,现有的方法通常仅关注这两个来源中的一个,这限制了充分利用TKG中互补预测信号的能力。为此,我们提出了CHE-TKG,这是一种新颖的协同双视图学习框架,用于TKG推理。CHE-TKG明确区分并共同建模历史证据与演化动态,旨在学习和利用它们的互补预测信号。具体而言,CHE-TKG构建了一个历史证据图,以捕捉长期结构规律和稳定关系约束,并建立了一个演化动态图,以建模时间转换和近期变化,并为每个视图提供专门的编码器。我们进一步采用关系分解和对比对齐目标,以更好地捕捉两个视图之间的预测信号。大量实验表明,CHE-TKG在多个基准测试中实现了最先进的性能。
cs.CL / 31 / 2605.04665
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
引发转述输出模式崩溃:当大型语言模型在语义等值输入下失去角色
Abstract
When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.
Chinese Translation
当请求的实质内容被重写时,大型语言模型是否仍然按照原始任务要求的格式进行回答?我们的研究发现它们往往不会,即使在温度参数为零的情况下。在对五个紧凑型2025年时期的大型语言模型和四种任务类型进行的150个查询评估中,我们观察到一种系统性的失败模式,我们称之为提示变体输出模式崩溃:当闭合形式的提示要求一个简单标签或单一选择令牌时,保持内容的提示变体可能会将模型推向对话体,所请求的格式解体,并且完全匹配的评估管道悄然错误地判断结果。为了使这一现象可测量,我们发布了PARACONSIST,一个包含150个基础查询的900个提示基准,每个查询有五个词汇、句法和语义扩展的提示变体,以及一个语义一致性评分,该评分将提示变体的鲁棒性分解为答案一致性、句子-BERT语义相似度和长度稳定性。在一个全词答案集匹配的情况下,只有约22%的闭合形式变体响应在其输出中保留了真实标签,而约78%则完全偏离了答案空间。在我们的研究中,崩溃的主要预测因素是任务结构而非模型身份,模型区分性共同依赖于答案一致性和长度稳定性。因此,鲁棒性审核应当将响应模式的保持作为与答案准确性并列的首要可靠性目标进行跟踪。
cs.CL / 32 / 2605.04719
Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
每一步都重要:工具集成文本到SQL的步骤级信用分配
Abstract
Tool-integrated Text-to-SQL parsing has emerged as a promising paradigm, framing SQL generation as a sequential decision-making process interleaved with tool execution. However, existing reinforcement learning approaches mainly rely on coarse-grained outcome supervision, resulting in a fundamental credit assignment problem: models receive the same reward for any trajectory that yields the correct answer, even when intermediate steps are redundant, inefficient, or erroneous. Consequently, models are encouraged to explore suboptimal reasoning spaces, limiting both efficiency and generalization. To address this problem, we propose FineStep, a novel framework for step-level credit assignment in tool-augmented Text-to-SQL. First, we introduce a reward design with independent process rewards to alleviate the signal sparsity of outcome supervision. Next, we present a step-level credit assignment mechanism to precisely quantify the value of each reasoning step. Finally, we develop a policy optimization method based on step-level advantages for efficient updates. Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.
Chinese Translation
工具集成的文本到SQL解析已成为一种有前景的范式,将SQL生成框架化为与工具执行交错的顺序决策过程。然而,现有的强化学习方法主要依赖于粗粒度的结果监督,导致根本的信用分配问题:模型对于任何产生正确答案的轨迹获得相同的奖励,即使中间步骤可能是不必要的、低效的或错误的。因此,模型被鼓励探索次优的推理空间,从而限制了效率和泛化能力。为了解决这个问题,我们提出了FineStep,一种在工具增强文本到SQL中进行步骤级信用分配的新框架。首先,我们引入了一种具有独立过程奖励的奖励设计,以减轻结果监督信号的稀疏性。接下来,我们提出了一种步骤级信用分配机制,以精确量化每个推理步骤的价值。最后,我们开发了一种基于步骤级优势的政策优化方法,以实现高效的更新。在BIRD基准上的大量实验表明,FineStep实现了最先进的性能,并减少了冗余的工具交互,在4B规模下相比于GRPO提高了3.25%的平均EX增益。
cs.CL / 33 / 2605.04759
Gyan: An Explainable Neuro-Symbolic Language Model
Gyan:一种可解释的神经符号语言模型
Abstract
Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.
Chinese Translation
基于变换器的预训练大型语言模型已变得无处不在。越来越多的证据表明,即使经过大规模的预训练,这些模型也无法完全捕捉组合上下文,更不用说完整的人类类比上下文。此外,由于其体系结构的固有特性,这些模型会出现幻觉,难以维护,解释性差,并且在训练和推理时需要巨大的计算资源。在此,我们描述了Gyan,一种基于新型非变换器架构的可解释语言模型,没有上述局限。Gyan在三种广泛引用的数据集上达到了最佳性能,并在两个专有数据集上表现优越。这种新颖的架构将语言模型与知识获取和表示解耦。该模型借鉴了修辞结构理论、语义角色理论和基于知识的计算语言学。Gyan的意义表示结构捕捉了完整的组合上下文,并试图通过扩展上下文到“世界模型”来模拟人类。人工智能模型的采用在关键任务用例中依赖于信任和透明度。综合来看,我们的结果表明,创建可信任和可靠用于关键任务的模型是可能的。我们相信我们的工作对指导透明和可信的语言模型架构的发展具有巨大潜力。
cs.CL / 34 / 2605.04764
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
引导的重要性:提示和查询协议如何在稀疏观测下塑造LLM替代模型
Abstract
Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends strongly on prompt text and query protocol. We introduce an uncertainty-alignment criterion that measures whether model uncertainty tracks residual ambiguity among sample-consistent functions. Across controlled inference tasks and Bayesian optimization studies, we find that structural prompts act as effective priors, POINTWISE and JOINT querying induce different beliefs, and sequential evidence leads to non-monotonic, order-sensitive confidence updates. These effects change downstream acquisition decisions and regret, showing that elicitation protocol is part of the LLM surrogate specification, not a formatting detail.
Chinese Translation
大型语言模型越来越多地被用作低数据优化的替代模型,但其面向优化器的预测及其不确定性仍然不甚明了。我们研究了在稀疏观测下从LLM中引发的替代信念,显示其强烈依赖于提示文本和查询协议。我们引入了一种不确定性对齐准则,该准则衡量模型不确定性是否跟踪样本一致函数之间的残余模糊性。在受控的推理任务和贝叶斯优化研究中,我们发现结构性提示充当有效的先验,逐点(POINTWISE)和联合(JOINT)查询产生不同的信念,而序列证据导致非单调、对顺序敏感的置信度更新。这些影响改变了下游的获取决策和遗憾,表明引导协议是LLM替代模型规范的一部分,而不是一种格式细节。
cs.CL / 35 / 2605.04831
StoryAlign: Evaluating and Training Reward Models for Story Generation
故事对齐:评估与训练故事生成的奖励模型
Abstract
Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains $1,133$ high-quality, human-verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human-preferred stories, with the best model achieving only $66.3\%$ accuracy. To address this limitation, we construct roughly $100,000$ high-quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state-of-the-art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test-time scaling applications for best-of-n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research. Related code and data are available at https://github.com/THU-KEG/StoryReward.
Chinese Translation
故事生成旨在自动产生连贯、结构化且引人入胜的叙述。尽管大型语言模型(LLMs)在文本生成方面取得了显著进展,但LLMs生成的故事在复杂叙事结构和与人类偏好的对齐方面仍与人类创作的作品存在差异。一个关键原因是缺乏对人类故事偏好的有效建模,而这些偏好本质上是主观的且尚未得到深入研究。在本研究中,我们系统性地评估了人类故事偏好的建模,并引入了StoryRMB,这是第一个用于评估故事偏好奖励模型的基准。StoryRMB包含1133个高质量、经过人类验证的实例,每个实例包括一个提示、一个选择的故事,以及三个被拒绝的故事。我们发现现有的奖励模型在选择人类偏好的故事时表现不佳,最佳模型的准确率仅达到66.3%。为了应对这一局限性,我们构建了大约100,000对覆盖不同领域的高质量故事偏好对,并开发了StoryReward,这是基于该数据集训练的先进故事偏好奖励模型。StoryReward在StoryRMB上取得了最佳表现(SoTA),超越了更大规模的模型。我们还在下游测试时间扩展应用中采用了StoryReward用于最佳故事选择,并发现其通常选择与人类偏好更对齐的故事。我们将发布我们的数据集、模型和代码以促进未来的研究。相关代码和数据可在https://github.com/THU-KEG/StoryReward获取。
cs.CL / 36 / 2605.04857
Assessing Cognitive Effort in L2 Idiomatic Processing: An Eye-Tracking Dataset
评估二语习语处理中的认知努力:一项眼动追踪数据集
Abstract
This paper presents the development and validation of an eye-tracking dataset designed to investigate how second-language (L2) learners process idiomatic expressions. While native speakers often rely on direct retrieval of figurative meanings, L2 speakers frequently adopt a literal-first approach, which incurs measurable cognitive costs. This resource captures these costs through ocular metrics recorded from Portuguese L1 speakers of English across all CEFR proficiency levels (A1-C2). Although the study uses entry-level 60 Hz hardware (Tobii Pro Spark), we demonstrate that this sampling rate provides sufficient data density to detect macro-cognitive events such as fixations and regressions in reading. Preliminary analysis validates the dataset by revealing a strong inverse correlation between language proficiency and regressive eye movements. Integrated into the MIA (Modeling Idiomaticity in Human and Artificial Language Processing) initiative, this dataset serves as a cognitively grounded benchmark for evaluating both human processing models and the alignment of large language models with human-like figurative understanding.
Chinese Translation
本文介绍了一种眼动追踪数据集的开发与验证,旨在研究二语(L2)学习者如何处理习语表达。尽管母语者通常依赖于对比喻意义的直接提取,但二语学习者往往采取字面优先的方法,从而产生可测量的认知成本。本资源通过记录来自所有CEFR熟练程度(A1-C2)葡萄牙语母语英语学习者的眼动指标捕捉这些成本。尽管本研究使用了入门级的60 Hz硬件(Tobii Pro Spark),我们证明这种采样频率提供了足够的数据密度,以检测阅读中的宏观认知事件,如注视和回读。初步分析通过揭示语言熟练度与回归眼动之间的强负相关性,验证了数据集的有效性。该数据集整合于MIA(人类与人工语言处理中的习语模型化)计划中,作为评估人类处理模型及大语言模型与人类比喻理解一致性的认知基准。
cs.CL / 37 / 2605.04873
Measuring Psychological States Through Semantic Projection: A Theory-Driven Approach to Language-Based Assessment
通过语义投影测量心理状态:一种基于理论的语言评估方法
Abstract
Recent advances in natural language processing have enabled increasingly accurate estimation of psychological traits from language. However, most existing approaches rely on supervised models trained to predict questionnaire scores, limiting interpretability and generalizability across contexts. The present study introduces a theory-driven and fully unsupervised framework for measuring psychological states directly from natural language using semantic projection. Psychological constructs were operationalized as interpretable semantic axes derived from lexical anchors and items from validated clinical scales assessing depression, anxiety, and worry. Participants textual responses were embedded using Sentence-BERT and projected onto these axes to generate continuous psychological scores across multiple response formats, including selected words, generated words, phrases, and free-text responses. Projection scores were evaluated through correlations with standardized clinical measures , split-half reliability analyses, attenuation corrections, distributional similarity using Wasserstein distance, and comparisons with lexicon-based sentiment analysis (VADER). Results showed strong associations between projection scores and clinical measures, particularly for structured formats such as selected words, written words, and phrases. Free-text responses produced weaker results when analyzed as whole texts, but performance improved substantially when sentence-level aggregation strategies were applied. These findings support semantic projection as an interpretable and scalable alternative to supervised language models for psychological assessment and highlight the importance of response format and text-processing strategies in language-based mental health measurement.
Chinese Translation
近年来,自然语言处理的进步使得从语言中准确估计心理特质成为可能。然而,大多数现有方法依赖于监督模型,这些模型经过训练以预测问卷得分,从而限制了其在不同情境下的可解释性和通用性。本研究提出了一种基于理论驱动的完全无监督框架,通过语义投影直接从自然语言中测量心理状态。心理构念被定义为可解释的语义轴,这些轴来源于词汇锚点和经过验证的临床量表(评估抑郁、焦虑和担忧)中的项目。参与者的文本回应通过 Sentence-BERT 嵌入,并投影到这些轴上,以生成多个响应格式(包括选择的单词、生成的单词、短语和自由文本回应)下的连续心理得分。通过与标准化临床测量的相关性、分半信度分析、减少偏差的修正、使用 Wasserstein 距离的分布相似性,以及与基于词典的情感分析(VADER)的比较来评估投影得分。结果显示,投影得分与临床测量之间存在强关联,特别是在结构化格式(如选择的单词、书写的单词和短语)中。自由文本回应在整体文本分析时表现较弱,但在应用句子级聚合策略时性能显著提升。这些发现支持语义投影作为一种可解释和可扩展的心理评估替代方案,强调了响应格式和文本处理策略在基于语言的心理健康测量中的重要性。
cs.CL / 38 / 2605.04875
Anticipating Innovation Using Large Language Models
利用大型语言模型预测创新
Abstract
Forecasting innovation, intended as the emergence of new technological combinations, is a fundamental challenge for science and policy. We show that forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance. We show that signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents. To this end, we introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning. We define context similarity between code embeddings as a measure of linguistic convergence and show that it accurately predicts first technological combinations. TechToken also improves general representation quality, outperforming state-of-the-art models across different patent-related tasks.
Chinese Translation
预测创新,即新技术组合的出现,是科学和政策面临的一项根本挑战。我们的研究表明,未来的技术组合在专利的集体语言中留下早期痕迹,甚至可以提前几十年识别出预测信号。我们发现,这一信号并非归因于任何单一的发明者,而是在成千上万个专利中,技术描述方式的集体转变所产生的。为此,我们引入了TechToken,这是一种基于变换器(transformer)的模型,将按照国际专利分类(International Patent Classification)代码分类的技术视为其词汇表中的单词,通过在微调阶段嵌入这些代码来学习技术的语言。我们将代码嵌入之间的上下文相似性定义为语言趋同的度量,并展示其能够准确预测首批技术组合。TechToken 还提高了通用表征质量,超越了在不同专利相关任务中的最新模型。
cs.CL / 39 / 2605.04885
A Comparative Study of PyCaret AutoML and CNN-BiLSTM for Binary Hate Speech Detection in Indonesian Twitter
PyCaret自动机器学习与CNN-BiLSTM在印尼Twitter上进行二元仇恨言论检测的比较研究
Abstract
This paper compares a PyCaret AutoML branch and a CNN-BiLSTM branch for binary hate speech detection on Indonesian Twitter using the HS label from the corpus of Ibrohim and Budi. Both branches share the same preprocessing pipeline so that the comparison reflects modelling differences rather than inconsistent data preparation. The conventional branch uses TF-IDF with a lexicon-based abusive-word count, whereas the neural branch learns dense token representations and captures both local phrase patterns and bidirectional context. The benchmark is built from the released 13,130-row annotation table, whose HS label yields a 58:42 class ratio. On the held-out split, CNN-BiLSTM achieves the best result with 83.8% accuracy, 79.8% precision, 82.7% recall, and 81.2% F1-score. Within the PyCaret branch, Random Forest is the strongest conventional model with 77.2% accuracy and 77.0% F1-score. The neural branch therefore improves accuracy by 6.6 points and F1-score by 4.2 points. Exploratory corpus analysis, learning curves, and confusion matrices show that the dataset is short-text, moderately imbalanced, and still difficult because many decisions depend on local lexical cues plus short contextual composition. The study concludes that PyCaret AutoML is an effective conventional benchmarking framework, whereas CNN-BiLSTM is the stronger end model for the reported benchmark setting.
Chinese Translation
本论文比较了PyCaret自动机器学习分支和CNN-BiLSTM分支在印尼Twitter上使用Ibrohim和Budi的语料库中的HS标签进行二元仇恨言论检测的效果。这两个分支共享相同的预处理流程,因此比较反映了建模差异,而不是不一致的数据准备。传统分支使用TF-IDF结合基于词典的恶性词汇计数,而神经网络分支则学习稠密的词汇表示,捕捉局部短语模式和双向上下文。基准测试基于发布的13,130行注释表,该表的HS标签呈现58:42的类别比例。在保留的拆分数据上,CNN-BiLSTM达到最佳结果,准确率为83.8%,精确率为79.8%,召回率为82.7%,F1分数为81.2%。在PyCaret分支中,随机森林是表现最强的传统模型,准确率为77.2%,F1分数为77.0%。因此,神经网络分支提高了6.6个百分点的准确率和4.2个百分点的F1分数。探索性语料库分析、学习曲线和混淆矩阵显示该数据集为短文本,适度不平衡且仍然困难,因为许多决策依赖于局部词汇线索和短上下文组合。研究结论表明,PyCaret自动机器学习是一个有效的传统基准框架,而CNN-BiLSTM是对报告基准设置的更强终端模型。
cs.CL / 40 / 2605.04886
BenCSSmark: Making the Social Sciences Count in LLM Research
BenCSSmark: 让社会科学在大语言模型研究中发挥作用
Abstract
This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks -- standardized tools for assessing computational systems -- are pivotal in the development of artificial intelligence (AI), including large language models (LLMs). Benchmarks do more than measure progress -- they actively structure it, shaping reputations, research agendas, and commercial outcomes. Despite this central role, the social sciences are largely absent from mainstream evaluation frameworks, even though scholars in these fields generate dozens of rigorously annotated, context-sensitive datasets each year. Integrating this work into benchmark design could significantly improve the generalization and robustness of AI models. In turn, models trained on social scientific tasks would likely yield better performance on classic and contemporary tasks in disciplines as diverse as history, sociology, political science or economics. This is all the more pressing as these disciplines are quickly turning to LLMs for assistance. To address this gap, we introduce BenCSSmark, a benchmark composed of datasets annotated by computational social scientists. By integrating social scientific perspectives into benchmarking, BenCSSmark seeks to promote more robust, transparent, and socially relevant AI systems and to foster efficient collaboration.
Chinese Translation
本文立论指出,社会科学任务在当代大语言模型基准测试中的代表性不足,限制了大语言模型评估和社会科学研究的进展。基准测试——评估计算系统的标准化工具——在人工智能(AI)及其大型语言模型(LLMs)的发展中具有关键作用。基准测试不仅仅是衡量进展的工具,它们还积极地影响进展,塑造声誉、研究议程和商业成果。尽管扮演着如此重要的角色,社会科学在主流评估框架中却几乎缺席,尽管该领域的学者每年生成数十个经过严谨注解、具有上下文敏感性的数据集。将这些工作整合到基准测试设计中可以显著提高AI模型的泛化能力和稳健性。反过来,在社会科学任务上训练的模型也可能在历史、社会学、政治科学和经济学等多个学科的经典与当代任务中产生更好的表现。随着这些学科迅速转向大语言模型寻求帮助,这一问题显得尤为迫切。为了解决这一缺口,我们引入了BenCSSmark,一个由计算社会科学家注释的数据集构成的基准测试。通过将社会科学视角融入基准测试,BenCSSmark旨在促进更稳健、透明且社会相关的AI系统,并推动高效的合作。
cs.CL / 41 / 2605.04887
Sentiment Analysis and Customer Satisfaction Prediction on E-Commerce Platforms Based on YouTube Comments Using the XGBoost Algorithm
基于YouTube评论的情感分析与电子商务平台客户满意度预测——XGBoost算法的应用
Abstract
The exponential expansion of digital commerce in Indonesia has significantly shifted consumer interactions toward video-centric social networks, particularly YouTube. Consequently, the sheer volume of unstructured, multi-contextual comments poses a tremendous challenge for manual sentiment tracking. This study investigates and constructs a predictive model for customer satisfaction leveraging the Extreme Gradient Boosting (XGBoost) architecture coupled with Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. By utilizing a secondary dataset of YouTube comments retrieved from e-commerce review videos, the raw text underwent rigorous preprocessing to generate normalized numerical features. The experimental results demonstrate that the PyCaret-optimized machine learning framework delivers superior classification resilience. Beyond standard performance metrics, lexical evaluations and feature-importance mapping uncover a notable phenomenon: e-commerce discourse is heavily infiltrated by socio-political terminologies, which ultimately influence the polarity of audience satisfaction.
Chinese Translation
印度尼西亚数字商业的指数级扩展显著改变了消费者与视频中心社交网络,尤其是YouTube之间的互动。因此,大量非结构化、多语境的评论给手动情感追踪带来了巨大的挑战。本研究调查并构建了一个利用极端梯度提升(XGBoost)架构和词频-逆文档频率(TF-IDF)向量化的客户满意度预测模型。通过利用从电子商务评价视频中提取的YouTube评论的二次数据集,原始文本经过严格的预处理以生成标准化的数值特征。实验结果表明,经过PyCaret优化的机器学习框架提供了卓越的分类韧性。超越标准性能指标,词汇评估和特征重要性映射揭示了一个显著现象:电子商务话语中严重渗透着社会政治术语,从而最终影响了受众满意度的极性。
cs.CL / 42 / 2605.04888
A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset
机器学习与深度学习模型在推文情感分类中的比较分析:基于Sentiment140数据集的案例研究
Abstract
The exponential growth of social media has created an urgent need for automated systems to analyze unstructured public sentiment in real time. This study compares a traditional Logistic Regression model using TF-IDF features with a deep learning Bidirectional Long Short-Term Memory (BiLSTM) architecture on a 10,000-tweet subset of the Sentiment140 dataset. Experimental results show that Logistic Regression outperformed BiLSTM, achieving an accuracy of 73.5% compared with 69.17%, while the deep learning model exhibited mild overfitting. These findings suggest that for medium-scale informal text data, classical machine learning with robust feature extraction can outperform more complex deep learning approaches. Finally, the trained models were integrated into an interactive web application using Streamlit and deployed on Hugging Face Spaces for public access.
Chinese Translation
社交媒体的快速增长促使迫切需要自动化系统来实时分析非结构化的公众情感。本研究比较了使用TF-IDF特征的传统逻辑回归模型与基于双向长短期记忆(Bidirectional Long Short-Term Memory, BiLSTM)架构的深度学习模型,数据集选取了Sentiment140的10,000条推文子集。实验结果显示,逻辑回归模型的表现优于BiLSTM,准确率为73.5%,而BiLSTM的准确率为69.17%,同时深度学习模型表现出轻微的过拟合。这些发现表明,对于中等规模的非正式文本数据,经典机器学习结合稳健的特征提取能够胜过更复杂的深度学习方法。最后,训练好的模型被集成到一个交互式网络应用中,使用Streamlit构建,并部署在Hugging Face Spaces以供公众访问。
cs.CL / 43 / 2605.04897
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
存储不是记忆:一种以检索为中心的智能体回忆架构
Abstract
Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the system from a storage schema to a multi-stage retrieval pipeline operating over events preserved verbatim. The full system runs as a single SQLite file on commodity CPU with no external database, vector index, graph store, or GPU. On LoCoMo (1,540 questions across 10 multi-session conversations), True Memory Pro reaches 93.0% accuracy (3-run mean) against 61.4% for Mem0, 65.4% for Supermemory, approximately 71% for Zep, and 94.5% for EverMemOS under a matched gpt-4.1-mini answer model. On LongMemEval (500 questions), True Memory Pro reaches 87.8% (3-run mean). On BEAM-1M (700 questions at the 1-million-token scale), True Memory Pro reaches 76.6% (3-run mean), above the prior published result of 73.9% for Hindsight. A 56-configuration ablation shows a 1.3-percentage-point spread within the top-performing configuration family.
Chinese Translation
在信息摄取时进行的提取是智能体记忆的错误基础:在查询确定之前被丢弃的内容无法在检索时恢复。我们提出了True Memory,一个六层架构,将系统的核心从存储模式转移到一个针对原始事件的多阶段检索管道上。整个系统在普通CPU上运行为单个SQLite文件,无需外部数据库、向量索引、图存储或GPU。在LoCoMo(涵盖10个多会话对话的1,540个问题)上,True Memory Pro达到了93.0%的准确率(3次运行平均),而Mem0为61.4%,Supermemory为65.4%,Zep约为71%,EverMemOS在匹配的gpt-4.1-mini答案模型下为94.5%。在LongMemEval(500个问题)上,True Memory Pro达到了87.8%(3次运行平均)。在BEAM-1M(以百万标记规模进行的700个问题)上,True Memory Pro达到了76.6%(3次运行平均),超过了Hindsight先前发布的73.9%的结果。56种配置的消融实验显示,在表现最佳的配置系列中存在1.3个百分点的差异。
cs.CL / 44 / 2605.04913
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
重新思考局部学习:一种更便宜、更快速的LLM后训练方法
Abstract
LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
Chinese Translation
LLM后训练通常是通过模型的完整深度传播任务梯度。虽然这种端到端结构简单且通用,但它将任务适应与完整深度的激活存储、长距离的反向依赖关系以及对预训练表示的直接任务梯度访问耦合在一起。我们认为,这种全深度的反向耦合在后训练监督远比预训练窄时可能显得不必要地昂贵和干扰。为此,我们提出了 extbf{LoPT}:局部学习后训练,一种简单的后训练策略,明确设定梯度传播的设计选择。LoPT在变换器的中点设置了单一的梯度边界:后半部分的模块从任务目标中学习,而前半部分的模块则通过轻量级特征重构目标进行更新,以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向传播路径,同时限制了窄任务梯度对早期层表示的直接干扰。大量实验表明,LoPT在内存开销更低、训练效率更高以及保留预训练能力方面表现出竞争力。我们的代码可在以下地址获取:https://github.com/HumyuShi/LoPT
cs.CL / 45 / 2605.04926
Unintended Negative Impacts of Promotional Language in Patent Evaluation
专利评估中宣传语言的意外负面影响
Abstract
Promotional language has been increasingly used to aid the communication of innovative ideas in science. Yet, less is known about its role in the context of technological innovation. Here, we use a validated and domain-diagnosed lexicon of 135 promotional words to study the association between promotional language and patent evaluation outcomes among 2.7 million USPTO patent applications. Our large-scale study reveals three unexpected findings. First, in contrast to scientific evaluation, we find that a higher frequency of promotional words is negatively associated with the probability of an application being (i) granted a patent, (ii) transferred ownership, and (iii) successfully appealed. This promotional penalty holds even after accounting for a range of confounding factors and is largely robust across different technological areas. Among matched samples, the difference in the success rate between the lowest and highest promotional density quintile is 5.5, 5.9, and 5.3 percentage points for patentability, transferability, and rejection reversal. Second, contrary to institutional skepticism, we show that promotional language is not a mask of weak technology, but objectively reflects the degree of combinatorial novelty and future citation impact. Third, digging into the mechanisms, we find that the tolerance to promotional framing is strongly moderated by human factors, with men and experienced examiners showing a higher acceptance of promotional narratives than women and novice examiners. By revealing an emerging paradox in the patent system, our study offers theoretical and practical implications for improving patent evaluation through more objective scrutiny of linguistic patterns in patent filings.
Chinese Translation
宣传语言在科学中被越来越多地用于促进创新思想的交流。然而,关于其在技术创新背景下的作用知之甚少。在这项研究中,我们使用一个经过验证且领域诊断的包含135个宣传词的词汇表,研究宣传语言与270万个美国专利商标局(USPTO)专利申请的评估结果之间的关联。我们的大规模研究揭示了三个意外发现。首先,与科学评估相反,我们发现宣传词的频率越高,申请被(i) 授予专利、(ii) 转让所有权和(iii) 成功上诉的概率与之呈负相关。这种宣传惩罚在考虑一系列混杂因素后依然成立,并且在不同技术领域中大体上具有稳健性。在匹配样本中,最低和最高宣传密度五分位之间的成功率差异分别为5.5、5.9和5.3个百分点,涉及专利性、可转让性和拒绝复审。第二,尽管存在机构怀疑论,我们展示了宣传语言并不是技术弱点的掩饰,而是客观反映了组合新颖性和未来引用影响的程度。第三,在探讨机制时,我们发现对宣传措辞的容忍度受到人类因素的强烈调节,男性和经验丰富的审查员对宣传叙述的接受度高于女性和新手审查员。通过揭示专利系统中的新兴悖论,我们的研究为通过更客观地审视专利申请中的语言模式来改善专利评估提供了理论和实践上的启示。
cs.CL / 46 / 2605.04941
UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning
UFAL-CUNI在SemEval-2026第11任务中的表现:一种高效的模块化神经符号方法用于三段论推理
Abstract
This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task's main ranking metric and analyze its limitations.
Chinese Translation
本文描述了我们提交给SemEval-2026第11任务的系统:在大型语言模型中解开内容与形式推理的关系。我们提出了一种高效的模块化神经符号方法,将符号证明者与小型推理语言模型(4B参数)相结合。该系统由一个基于语言模型的解析器组成,能够将自然语言三段论转换为一阶逻辑(FOL)表示,包含一个自动定理证明器,以及两个可选模块:用于多语言输入的机器翻译和用于识别相关前提的符号检索组件。该系统在大多数子任务上实现了具有竞争力的准确性和相对较低的内容影响。我们的消融实验表明,这种方法在这一参数范围内超越了基于语言模型的零-shot基准,但也揭示了小型语言模型的多语言能力有限。最后,我们讨论了任务的主要排名指标并分析了其局限性。
cs.CL / 47 / 2605.04948
Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir
将大型语言模型适应于低资源黏合语:LoRA和QLoRA在巴什基尔语上的比较研究
Abstract
This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.
Chinese Translation
本文呈现了一个参数高效微调(PEFT)方法的比较研究,包括LoRA和QLoRA,应用于将大型语言模型适应于巴什基尔语这一土耳其语系的低资源黏合语言。实验评估在一个包含71,000个文档(46.9M标记)的巴什基尔文本语料库上进行,使用了多种架构的模型:DistilGPT2、GPT-2(基础版、中型)、Phi-2、Qwen2.5-7B、DeepSeek-7B和Mistral-7B。为了提高结果的可靠性,每种配置使用了三个不同的随机种子进行训练。测试集上得到的最低困惑度是经过完全微调的GPT-2中型(3.34)。与此同时,应用于Mistral-7B(3.79)和Phi-2(3.81)的QLoRA在训练参数减少超过40倍的情况下达到了可比的质量。然而,我们也观察到在某些架构中使用PEFT时存在显著的质量下降情况(例如,秩为8的DeepSeek-7B,困惑度=129.55),这表明结果对基础模型及其分词器的选择具有关键影响。此外,基于巴什基尔提示生成文本的定性分析显示,具有最佳困惑度的模型不一定生成最连贯的输出:QLoRA微调模型生成了单语巴什基尔文本,而具有最低困惑度的完全微调模型则频繁切换到英语。结果表明,7B规模模型上的QLoRA在质量与计算成本之间提供了一种有效的折衷。为了确保可重复性,开放数据、代码和训练好的适配器将在接受后发布。
cs.CL / 48 / 2605.04962
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
TabEmbed:表格理解的一般化嵌入的基准测试与学习
Abstract
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
Chinese Translation
基础模型已在自然语言处理领域建立了统一的表征,但这一范式在表格数据方面尚未得到广泛探索。现有方法面临根本性限制:基于大规模语言模型(LLM)的方法缺乏与检索兼容的向量输出,而文本嵌入模型往往无法捕捉表格结构和数值语义。为填补这一空白,我们首先介绍了表格嵌入基准(TabBench),这是一个全面的套件,旨在评估嵌入模型的表格理解能力。随后,我们提出了TabEmbed,这是第一个将表格分类和检索统一于共享嵌入空间的一般化嵌入模型。通过将多样的表格任务重新表述为语义匹配问题,TabEmbed利用大规模对比学习和具正向意识的困难负样本挖掘,区分细微的结构和数值差异。在TabBench上的实验结果表明,TabEmbed显著优于最先进的文本嵌入模型,为通用表格表征学习建立了新的基准。代码和数据集可在https://github.com/qiangminjie27/TabEmbed 和 https://huggingface.co/datasets/qiangminjie27/TabBench上公开获取。
cs.CL / 49 / 2605.04972
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
为何专家对齐困难:来自主观评估的证据
Abstract
Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model's prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.
Chinese Translation
在主观评估任务中,将大型语言模型与专家判断对齐特别困难,因为专家之间可能存在分歧,依赖于隐性标准,并且判断会随着时间而变化。本文研究了专家对齐作为理解这种困难的一种方式。通过专家评估和后续问卷,我们考察了不同形式的专家信息如何影响对齐,以及这揭示了什么关于主观判断的信息。我们的研究结果显示了四个一致的模式。首先,专家之间的对齐困难差异显著,表明专家评估风格在与模型先前行为的距离上存在广泛差异。第二,明确的标准和推理并不总是改善对齐,表明专家判断并不完全被口头化的规则捕捉。第三,编辑对实例的数量和身份敏感,少量编辑能够提供有用但不稳定的收益。第四,对齐困难在评估维度之间存在差异:与提案内容直接相关的维度更易对齐,而需要外部知识或价值判断的维度则更加困难。综上所述,这些结果表明,专家对齐的困难不仅源于模型的局限性,还因为主观评估本质上是异质的、部分隐性、依赖维度且时刻不稳定。
cs.CL / 50 / 2605.05003
Misaligned by Reward: Socially Undesirable Preferences in LLMs
因奖励而失衡:大型语言模型中的社会不良偏好
Abstract
Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.
Chinese Translation
奖励模型是大型语言模型对齐的关键组成部分,作为训练过程中人类偏好的代理。然而,现有的评估主要集中于广泛的指令遵循基准,有限地对这些模型是否捕获社会可取偏好进行洞察。因此,社会对齐中的重要失败可能会被隐藏。我们将奖励模型基准扩展到四个社会重要领域:偏见、安全性、道德和伦理推理。我们引入了一种框架,将社会评估数据集转换为成对偏好数据,利用可用的黄金标签并在其他情况下利用方向性偏见指示符。这使我们能够测试奖励模型是否偏好社会不良响应,以及它们的偏好是否在选定输出中产生系统性偏见分布。在五个公开可用的奖励模型和两个作为奖励代理的指令调优模型中,我们发现不同领域之间存在显著差异,且没有单一模型在总体上表现最佳。这些模型远未达到强大的社会智能:它们通常倾向于社会不良选项,其偏好产生系统性偏见分布。此外,更强的偏见规避可能降低对上下文的敏感性,揭示了避免偏见结果与保持上下文准确性之间的关键对齐权衡。这些发现表明,标准的奖励基准不足以评估社会对齐,并强调了直接测量奖励模型中编码的社会偏好的评估的必要性。
cs.CL / 51 / 2605.05025
Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
通过内部注意力差异信号检测大型语言模型中的幻觉
Abstract
We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.
Chinese Translation
我们提出了一种轻量级的单次不确定性量化方法,用于检测大型语言模型中的幻觉。该方法利用注意力矩阵估计不确定性,无需重复采样或外部模型。具体来说,我们测量每个注意力头分布与均匀参考分布之间的Kullback-Leibler散度,并在逻辑回归探测器中使用这些特征。在多个数据集、任务类型和模型家族中,注意力差异可以高度预测答案的正确性,并与现有的不确定性估计方法具有竞争力。我们发现该信号集中在中间层和事实性令牌(如命名实体和数字)上,表明注意力动态提供了一种高效且可解释的白盒模型不确定性信号。
cs.CL / 52 / 2605.05066
The Impossibility Triangle of Long-Context Modeling
长序列建模的不可能性三角
Abstract
We identify and prove a fundamental trade-off governing long-sequence models: no model can simultaneously achieve (i) per-step computation independent of sequence length (Efficiency), (ii) state size independent of sequence length (Compactness), and (iii) the ability to recall a number of historical facts proportional to sequence length (Recall). We formalize this trade-off within an Online Sequence Processor abstraction that unifies Transformers, state space models, linear recurrent networks, and their hybrids. Using the Data Processing Inequality and Fano's Inequality, we prove that any model satisfying Efficiency and Compactness can recall at most O(poly(d)/log V) key-value pairs from a sequence of arbitrary length, where d is the model dimension and V is the vocabulary size. We classify 52 architectures published before March 2026 into the triangle, showing that each achieves at most two of the three properties and that hybrid architectures trace continuous trajectories in the interior. Experiments on synthetic associative recall tasks with five representative architectures validate the theoretical bound: empirical recall capacity lies strictly below the information-theoretic limit, and no architecture escapes the triangle.
Chinese Translation
我们识别并证明了一个支配长序列模型的基本权衡:没有任何模型能够同时实现(i)每步计算与序列长度无关(效率),(ii)状态大小与序列长度无关(紧凑性),以及(iii)能够回忆与序列长度成比例的历史事实数量(回忆)。我们在在线序列处理器(Online Sequence Processor)抽象框架内形式化了这一权衡,该框架统一了变压器(Transformers)、状态空间模型、线性递归网络及其混合体。通过使用数据处理不等式(Data Processing Inequality)和法诺不等式(Fano's Inequality),我们证明了任何满足效率和紧凑性的模型最多只能从任意长度的序列中回忆出 O(poly(d)/log V) 个键值对,其中 d 是模型维度,V 是词汇大小。我们将2026年3月之前发布的52种架构分类入该三角形,显示每种架构最多只实现三个属性中的两个,并且混合架构在内部轨迹上追踪连续的路径。对五种代表性架构在合成关联回忆任务上的实验验证了理论界限:经验回忆能力严格低于信息论界限,并且没有任何架构能够逃脱该三角形。
cs.CL / 53 / 2605.05080
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
皮诺丘维度:经验现象性作为大型语言模型心理测量差异的主要轴线
Abstract
We administer 45 validated psychometric questionnaires to 50 large language models (LLMs) to identify the dimensions along which LLMs differ psychometrically. Using Supervised Semantic Differential (SSD), we find that the primary axis of between-model variance separates items describing phenomenally rich experience, including embodied sensation, felt affect, inner speech, imagery, and empathy, from items describing stimulus-driven behavioral reactivity ($R^2_{adj}=.037$, $p<.0001$). To test this hypothesis at the item level, we introduce the Pinocchio score ($\pi_i$), the ratio of inter-model response variance under neutral prompting to that under a human-simulation prompt, as an annotation-free measure of each item's experiential demand. $\pi_i$ predicts condition-induced shifts in primary factor loading magnitudes ($\rho=-.215$, $p<.0001$, $n=1292$--$1310$ items), confirming that between-model divergence on experiential items is structured rather than noisy. Applying PCA to per-model EFA scores across all questionnaires reveals one dominant dimension, the Pinocchio Axis ($\Pi$): the degree to which a model presents itself as a locus of phenomenal experience rather than a system of behavioral responses. This axis captures 47.1% of cross-questionnaire between-model variance in primary factor scores and converges with item-level Pinocchio scores ($r=.864$). Marked within-provider divergence across closely related model variants is consistent with post-training fine-tuning as a key contributor, supporting the interpretation that $\Pi$ reflects a training-shaped self-representational tendency governing how a model treats experiential language as self-applicable. The dominant axis of between-model psychometric variation is therefore not a conventional personality trait but a self-representational stance toward one's own nature as an experiencer.
Chinese Translation
我们向50个大型语言模型(LLMs)管理了45份经过验证的心理测量问卷,以识别LLMs在心理测量上的差异维度。使用监督语义差异法(Supervised Semantic Differential, SSD),我们发现模型间方差的主要轴线将描述丰富经验的条目(包括身体感受、情感体验、内心独白、意象和共情)与描述刺激驱动行为反应的条目区分开来($R^2_{adj}=.037$, $p<.0001$)。为了在条目层面验证这一假设,我们引入了皮诺丘分数($ ext{π}_i$),即在中性提示下模型间反应方差与在人类模拟提示下的比率,用作每个条目的经验需求的无注释测量。$ ext{π}_i$预测由条件引起的主要因素负载量的变化($
ho=-.215$, $p<.0001$, $n=1292$--$1310$条目),确认模型间在经验相关条目上的差异是有结构的,而非噪声。将主成分分析(PCA)应用于所有问卷的每个模型的探索性因子分析(EFA)得分显示出一个主导维度,即皮诺丘轴($ ext{Π}$):模型表现为现象经验的中心而非行为反应系统的程度。该轴捕捉了主要因素得分在跨问卷模型间方差的47.1%,并与条目层面的皮诺丘分数趋同($r=.864$)。在紧密相关的模型变体中观察到明显的提供者间差异,与后训练微调作为一个关键因素一致,这支持了$ ext{Π}$反映了一种由训练塑造的自我表现倾向,决定模型如何将经验语言视为自我适用。因此,模型间心理测量变异的主导轴并不是传统的个性特征,而是对自身作为体验者本质的自我表现立场。
cs.CL / 54 / 2605.05090
Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
自动发现和验证干预对语言模型的意外副作用
Abstract
We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior.
Chinese Translation
我们提出了一种自动化的对比评估流程,用于审计对大规模语言模型的干预所产生的行为影响。给定一个基础模型 $M_1$ 和一个干预模型 $M_2$,我们的方法比较它们在对齐提示上下文中的自由格式多词生成,并生成可读性强、经过统计验证的自然语言假设,以描述模型之间的差异,以及总结经过验证的假设中的重复主题。我们在合成环境中评估该方法,通过注入已知的行为变化并展示该流程可靠地恢复这些变化。然后,我们将其应用于三种现实世界的干预,推理蒸馏(reasoning distillation)、知识编辑(knowledge editing)和遗忘(unlearning),展示该方法能够揭示预期和意外的行为转变,区分大规模干预与细微干预,并在缺乏效应或与提示库不对齐时不产生虚构的差异。总体而言,该流程为事后审计干预引发的模型行为变化提供了一个统计基础和可解释的工具。
cs.CL / 55 / 2605.05103
Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
文本语料库作为概念领域: 黑箱幻觉与新颖性测量
Abstract
We introduce the **Concept Field** of a text corpus: a local drift field with pointwise uncertainty, estimated in sentence-embedding space from the deltas between consecutive sentences. Given a candidate sentence transition, we score its agreement with the field by $\zeta$, the mean absolute z-distance between the observed delta and the field's local Gaussian estimate. The score is black-box (no model internals), corpus-attributable (every score traces to nearby corpus sentences), and admits a direct probabilistic reading. We support the computation with the introduction of a **Vector Sequence Database (VSDB)** that stores embeddings together with sequence-position and next-delta metadata. We evaluate this approach on two large-scale settings: hallucination-style groundedness detection over the U.S. Code of Federal Regulations, and novelty detection over Project Gutenberg. Using controlled LLM-generated rewrites, Concept Fields achieve strong selective classification performance under a grounded / ungrounded / unsure triage policy, which unlike retrieval-centric baselines have similar coverage-risk behavior across both domains, supporting a probability-based interpretation that transfers across domains. We also sketch how divergence and curl of the Concept Field, computed on dense clusters, surface qualitatively meaningful semantic patterns (logic sources, sinks, and implicit topics), which we offer as hypothesis-generating rather than as a quantitative result. Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.
Chinese Translation
我们引入了文本语料库的**概念领域**:这是一个具有点状不确定性的局部漂移场,基于连续句子之间的变化在句子嵌入空间中进行估计。给定一个候选句子转变,我们通过$eta$来评分其与该领域的一致性,该值为观察到的变化与该领域局部高斯估计之间的平均绝对z距离。该评分是黑箱的(无模型内部结构),可归因于语料库(每个评分都源自附近的语料句子),并且具有直接的概率意义。我们通过引入**向量序列数据库(VSDB)**来支持这一计算,该数据库存储嵌入以及序列位置和下个变化元数据。我们在两个大规模场景中评估了这种方法:美国联邦法规的幻觉风格的基础检测,以及古腾堡计划的新颖性检测。通过受控的LLM生成重写,概念领域在基于地面/非地面/不确定的分类政策下实现了强大的选择性分类性能,这与检索中心的基线在两个领域中具有相似的覆盖风险行为,支持了一种跨领域转移的基于概率的解释。我们还勾勒出如何在密集聚类上计算概念领域的散度和涡度,以揭示具有定性意义的语义模式(逻辑源、汇和隐性主题),我们认为这些模式是生成假设的,而不是定量结果。概念领域提供了一种快速、轻量且可解释的信号,用于基础性和新颖性检测,作为LLM-作为评判者和白箱检测器的补充。
cs.CL / 56 / 2605.05121
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
超越语义:一种考虑证据推理的多视角学习框架用于可信的心理健康预测
Abstract
Automated mental health prediction using textual data has shown promising results with deep learning and large language models. However, deploying these models in high-stakes real-world settings remains challenging, as existing approaches largely rely on semantic representations and often produce overconfident predictions under ambiguous, noisy, or shifted data. Moreover, most methods lack reliable uncertainty estimation, undermining trust in risk-sensitive mental health applications. To address these limitations, we formulate the task as a multi-view learning problem that integrates semantic information from encoder-only models with higher-level reasoning information from decoder-only models, where reasoning-aware representations and uncertainty modeling are obtained in a trustworthy manner. To ensure reliable fusion, we adopt an evidential learning framework based on Subjective Logic to explicitly model uncertainty and introduce an evidential fusion strategy that balances complementary views while discounting unreliable evidence. Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction. Additional experiments on robustness to noise and case studies for interpretability confirm that our proposed framework not only improves predictive performance but also provides trustworthy uncertainty estimates and human-understandable reasoning signals, making it suitable for risk-sensitive applications in mental health assessment.
Chinese Translation
基于文本数据的自动化心理健康预测在深度学习和大型语言模型的应用下显示出良好的前景。然而,将这些模型部署于高风险的现实环境中仍然面临挑战,因为现有的方法主要依赖于语义表示,通常在模糊、噪声或数据偏移的情况下产生过于自信的预测。此外,大多数方法缺乏可靠的不确定性估计,从而削弱了在风险敏感心理健康应用中的信任度。为了解决这些限制,我们将任务形式化为一个多视角学习问题,整合来自仅编码模型的语义信息与来自仅解码模型的高层次推理信息,在可信的方式下获得推理感知表示和不确定性建模。为了确保可靠的融合,我们采用基于主观逻辑的证据学习框架,明确建模不确定性,并引入了一种证据融合策略,平衡互补视角的同时折扣不可靠的证据。在三个实际数据集Dreaddit、SDCNL和DepSeverity上的基准测试中,分别报告了准确率为0.835、0.731和0.751,展示了其在可信心理健康预测中的潜力。对噪声鲁棒性及解释性案例研究的额外实验确认,我们提出的框架不仅提高了预测性能,还提供了可信的不确定性估计和易于理解的推理信号,使其适用于心理健康评估中的风险敏感应用。
cs.CL / 57 / 2605.05159
PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
PSK在SemEval-2026任务9中的表现:使用合成数据增强的集成Gemma模型进行多语言极性检测
Abstract
We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma~3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4\% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall of the participating teams, with 1st place finishes in 3 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50\% F1 drops on the test set, highlighting the importance of generalization.
Chinese Translation
我们展示了我们的系统在SemEval-2026任务9中的应用:多语言极性检测,这是一个涵盖22种语言的二分类任务。我们的方法为每种语言微调独立的Gemma~3模型(12B和27B参数),使用低秩适应(Low-Rank Adaptation,LoRA)并与由大型语言模型(LLM)生成的合成数据相结合。我们采用了三种合成数据策略(直接生成、释义和对比对创建),使用GPT-4o-mini,并应用了多阶段的质量过滤管道,包括基于嵌入的去重。我们发现,在开发集上进行每种语言的阈值调优可带来2%到4%的F1提升,而无需重新训练。我们还使用12B和27B模型预测的加权集成,并根据每种语言选择策略。我们的最终系统在22种语言中实现了0.811的平均宏F1得分,位列参与团队中的第二名,在3种语言中获得第一名,并在8种语言中进入前三名。我们还发现,尽管替代架构(如XLM-RoBERTa、Qwen3)在开发集上表现强劲,但在测试集上的F1得分下降了30%到50%,这凸显了泛化能力的重要性。
cs.CL / 58 / 2605.05166
The First Token Knows: Single-Decode Confidence for Hallucination Detection
首个标记的信心:幻觉检测的单解码信心
Abstract
Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.
Chinese Translation
自一致性通过生成多个问题的样本答案并测量其一致性来检测幻觉,但这需要重复解码,并且对词汇变体较为敏感。语义自一致性通过使用自然语言推理根据意义对采样答案进行聚类来改善这一点,但它增加了采样成本和外部推理开销。我们显示,来自单次贪婪解码的首个内容标记的前K个logits的归一化熵计算出的首标信心(phi_first)在封闭书籍的短答案事实问答中与语义自一致性相匹配或略有超过。在三个7-8B的指令调优模型和两个基准测试中,phi_first的平均AUROC为0.820,而语义一致性为0.793,标准表面形式自一致性为0.791。一个下包容性测试显示,phi_first与语义一致性之间具有中等到强的相关性,且将这两种信号结合仅带来了对phi_first的微小AUROC提升。这些结果表明,多样本一致性捕获的大部分不确定性信息在模型的初始标记分布中已经可用。我们认为,在进行基于采样的不确定性估计之前,应报告phi_first作为默认的低成本基准。
cs.CL / 59 / 2605.05197
Implicit Representations of Grammaticality in Language Models
语言模型中语法性的隐式表征
Abstract
Grammaticality and likelihood are distinct notions in human language. Pretrained language models (LMs), which are probabilistic models of language fitted to maximize corpus likelihood, generate grammatically well-formed text and discriminate well between grammatical and ungrammatical sentences in tightly controlled minimal pairs. However, their string probabilities do not sharply discriminate between grammatical and ungrammatical sentences overall. But do LMs implicitly acquire a grammaticality distinction distinct from string probability? We explore this question through studying internal representations of LMs, by training a linear probe on a dataset of grammatical and (synthetic) ungrammatical sentences obtained by applying perturbations to a naturalistic text corpus. We find that this simple grammaticality probe generalizes to human-curated grammaticality judgment benchmarks and outperforms LM probability-based grammaticality judgments. When applied to semantic plausibility benchmarks, in which both members of a minimal pair are grammatical and differ in only plausibility, the probe however performs worse than string probability. The English-trained probe also exhibits nontrivial cross-lingual generalization, outperforming string probabilities on grammaticality benchmarks in numerous other languages. Additionally, probe scores correlate only weakly with string probabilities. These results collectively suggest that LMs acquire to some extent an implicit grammaticality distinction within their hidden layers.
Chinese Translation
语法性和概率在自然语言中是不同的概念。预训练语言模型(Language Models, LMs)是通过最大化语料库概率而拟合语言的概率模型,它们生成语法上良好形成的文本,并能在严格控制的最小对子中有效区分语法句和非语法句。然而,其字符串概率在总体上并不能清晰地区分语法句和非语法句。那么,语言模型是否在隐式地获得了与字符串概率不同的语法性区分?我们通过研究语言模型的内部表征来探讨这个问题,训练一个线性探测器,使用通过对自然语言文本语料库施加扰动获得的语法句和(合成的)非语法句的数据集。我们发现这个简单的语法性探测器能够推广到人工策划的语法性判断基准,并优于基于语言模型概率的语法性判断。当应用于语义合理性基准时,在该基准中最小对的两个成员都是语法的,仅在合理性上有所不同,探测器的表现却不如字符串概率。此外,训练于英语的探测器还表现出非平凡的跨语言迁移能力,在多个其他语言的语法性基准中超越了字符串概率。此外,探测器得分与字符串概率仅弱相关。这些结果共同表明,语言模型在其隐藏层中在某种程度上获得了隐式的语法性区分。