← Back to Index
Daily Research Digest

arXiv Papers

2026-04-30
185
Papers
4
Categories
185
Translated
收藏清单 0
机器人学 (Robotics)
21
cs.RO / 1 / 2604.25949

FalconApp: Rapid iPhone Deployment of End-to-End Perception via Automatically Labeled Synthetic Data

FalconApp:通过自动标记合成数据快速部署端到端感知的iPhone应用
Miao, Yan, Shen, Will, Mitra, Sayan
Abstract
Reliable perception for robotics depends on large-scale labeled data, yet real-world datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconApp, an iPhone app with an end-to-end frontend-backend pipeline that turns a short handheld capture of a rigid object into a perception module for mask detection and 6-DoF pose estimation. Our core contribution is a rapid mobile deployment pipeline paired with a photorealistic auto-labeling workflow: from a user-captured video of an object, FalconApp reconstructs an editable GSplat asset, composites it with diverse photorealistic backgrounds, renders synthetic images with ground-truth masks and poses, trains the perception module, and deploys it back to the iPhone frontend. Experiments across five rigid objects with diverse geometry and appearance show that FalconApp produces usable perception models with about 20 minutes of synthetic-data generation and training per object on average, around 30 ms end-to-end on-device latency on iPhone, and better overall pose accuracy than a PnP baseline on 4 / 5 objects in both simulation and real-world evaluation.
Chinese Translation
机器人可靠的感知依赖于大规模的标记数据,而现实世界的数据集则依赖于繁重的人工标注,生产过程耗时。我们提出了FalconApp,这是一款具有端到端前后端管道的iPhone应用,能够将对刚性物体的短暂手持捕捉转化为用于掩膜检测和6自由度姿态估计的感知模块。我们的核心贡献是一个快速的移动部署管道,配合一个逼真的自动标注工作流:从用户捕获的物体视频中,FalconApp重建一个可编辑的GSplat资产,将其与多样的逼真背景合成,渲染带有真实掩膜和姿态的合成图像,训练感知模块,并将其重新部署到iPhone前端。在五个具有不同几何形状和外观的刚性物体上的实验表明,FalconApp平均每个物体生成和训练合成数据约需20分钟,在iPhone上实现约30毫秒的端到端设备延迟,并在模拟和现实世界评估中在4/5个物体上表现出比PnP基线更好的整体姿态准确性。
cs.RO / 2 / 2604.25963

A Scaled Three-Vehicle Platooning Platform

缩放三车编队平台
Lu, Kaiyue, Zhang, Qiaoxuan, Lu, Yukun
Abstract
Vehicle platooning has attracted increasing attention as a promising approach to improve traffic efficiency, energy consumption, and roadway safety through coordinated multi-vehicle operation. A key challenge in platooning lies in maintaining stable and accurate path tracking during dynamic maneuvers such as lane changes, where lateral deviations and heading disturbances generated by the lead vehicle may propagate downstream to following vehicles. Robust longitudinal and lateral control systems are therefore essential not only for individual vehicle tracking performance, but also for overall platoon stability. For experimental studies, the Intelligent Mobility and Robotics Lab (IMRL) develops a scaled multi-vehicle platform for autonomous platooning research, with a particular emphasis on cooperative control and human-in-the-loop autonomy. This platform consists of one human-operable lead vehicle and two autonomous followers, enabling controlled and repeatable experiments on leader-follower coordination. Compared with full-scale field testing, this scaled platform offers a safer, lower-cost, and more flexible environment for rapid prototyping, controller validation, and multi-agent autonomy studies, while providing stronger physical realism than purely simulation-based evaluations.
Chinese Translation
车辆编队作为一种通过协调多车操作来提高交通效率、降低能耗和增强道路安全的有前景的方法,受到了越来越多的关注。编队中的一个关键挑战是在动态机动(如变道)过程中保持稳定和准确的路径跟踪,此时,由领车产生的横向偏差和航向干扰可能会向后续车辆传播。因此,强健的纵向和横向控制系统对于单个车辆的跟踪性能以及整体编队的稳定性至关重要。在实验研究中,智能移动与机器人实验室(IMRL)开发了一个缩放的多车平台,用于自主编队研究,特别强调协同控制和人机协作自主性。该平台由一辆可由人操作的领车和两辆自主跟车组成,能够进行受控和可重复的领随协调实验。与全尺度现场测试相比,这一缩放平台提供了一个更安全、成本更低且更灵活的环境,适用于快速原型开发、控制器验证和多智能体自主性研究,同时提供比纯模拟评估更强的物理真实感。
cs.RO / 3 / 2604.25974

Multi-Periodogram Velocity Estimation with Irregular Reference Signals for Robot-Aided ISAC

基于不规则参考信号的多周期谱速度估计用于机器人辅助的集成感知与通信
Geng, Yi, Cao, Pan, Zeng, Ting, Deng, Yongqian
Abstract
This paper addresses velocity estimation within robot-aided integrated sensing and communications (ISAC), where mobile robots act as sensing nodes but can only opportunistically reuse irregular 5G/6G reference signals (RSs). We show that the velocity profile induced by such irregular time-domain patterns can be decomposed into a periodic-peak component and an amplitude-shaping (weighting) component. Leveraging this structure, we propose a multi-periodogram velocity estimation algorithm that is standard-compliant and does not require new sensing-dedicated RSs or 3GPP modifications. Simulation results demonstrate that, compared with conventional periodogram processing, the proposed method improves low-SNR robustness by achieving a 3 dB SNR gain at the 10% missed-detection rate and reducing false alarms by 51%.
Chinese Translation
本文讨论了在机器人辅助的集成感知与通信(ISAC)中进行速度估计的问题,其中移动机器人作为感知节点,但只能机会性地重用不规则的5G/6G参考信号(RSs)。我们展示了这种不规则时域模式引起的速度特征可以分解为周期峰值分量和幅度调制(加权)分量。利用这一结构,我们提出了一种符合标准的多周期谱速度估计算法,该算法不需要新的专用感知参考信号或3GPP的修改。仿真结果表明,与传统的周期谱处理相比,所提出的方法在低信噪比(SNR)下的鲁棒性得到了改善,在10%的漏检率下实现了3 dB的信噪比增益,并将误报率降低了51%。
cs.RO / 4 / 2604.26065

FlowS: One-Step Motion Prediction via Local Transport Conditioning

FlowS:通过局部传输条件实现一步运动预测
Di Bella, Leandro, Munteanu, Adrian, Cornelis, Bruno
Abstract
Generative motion prediction must satisfy three simultaneous requirements for real-world autonomy: high accuracy, diverse multimodal futures, and strictly bounded latency. Diffusion models meet the first two but violate the third, requiring tens to hundreds of denoising steps. We identify a conditioning strategy that resolves this tension: \textit{single-step integration is accurate when the underlying transport problem is local}. A model that must both discover the correct behavioral mode and traverse a long displacement in one step accumulates large discretization errors; conditioning the base distribution to lie near plausible futures reduces the problem to short-range refinement, the regime where a single Euler step suffices. We instantiate this \emph{local transport conditioning} in FlowS, a conditional flow matching framework with two mechanisms. First, an online, scene-conditioned learned prior emits $K$ calibrated anchor trajectories per agent, each already near a plausible future, converting mode discovery into local correction. Second, a step-consistent displacement field enforces semigroup self-consistency, guaranteeing that a single step inherits multi-step accuracy. Crucially, anchoring this field at learned priors along straight-line paths yields a {stable, low-variance} training target, unlike prior self-consistency methods that suffer from {high-variance bootstrap} signals on curved diffusion paths. On the Waymo Open Motion Dataset, FlowS achieves state-of-the-art Soft mAP {(0.4804) and mAP (0.4703) with ensemble at 75\,FPS} with single-step inference, demonstrating that local transport conditioning makes one-step generative motion prediction practical for safety-critical autonomy. Code and pretrained models will be released upon acceptance.
Chinese Translation
生成运动预测必须满足三个同时的要求,以实现现实世界的自主性:高准确性、多样化的多模态未来和严格限制的延迟。扩散模型满足前两个要求,但违反第三个要求,需要数十到数百个去噪步骤。我们识别出一种解决这一矛盾的条件策略:当基础传输问题是局部时, extit{单步积分是准确的}。一个必须在一步内发现正确行为模式并跨越长位移的模型会积累大量离散化误差;将基础分布条件化为接近合理未来的状态,将问题简化为短程细化,这是单步欧拉步骤足以应对的范畴。我们在FlowS中实例化了这种 extit{局部传输条件},这是一个具有两种机制的条件流匹配框架。首先,一个在线的、场景条件的学习先验为每个代理发出$K$个校准的锚定轨迹,每个轨迹已经接近合理未来,将模式发现转化为局部修正。其次,一个步一致的位移场强制执行半群自一致性,保证单步继承多步准确性。关键是,将该场锚定在沿直线路径的学习先验上,产生一个{稳定、低方差}的训练目标,这与先前的自一致性方法不同,后者在曲线扩散路径上遭受{高方差自举}信号的影响。在Waymo开放运动数据集上,FlowS在单步推理下实现了最先进的Soft mAP {(0.4804) 和 mAP (0.4703),以75 FPS的集成方式},证明了局部传输条件使一步生成运动预测在安全关键的自主性中变得实用。代码和预训练模型将在接受后发布。
cs.RO / 5 / 2604.26201

Lights Out: A Nighttime UAV Localization Framework Using Thermal Imagery and Semantic 3D Maps

熄灯时刻:基于热成像和语义三维地图的夜间无人机定位框架
Allen, Ryan, Greeff, Melissa
Abstract
Reliable backup localization for unmanned aerial vehicles (UAVs) operating in GNSS-denied nighttime conditions remains an open challenge due to the severe modality gap between daytime RGB maps and nighttime thermal imagery. This work presents a semantic reprojection framework for map-relative nighttime UAV localization by aligning segmented thermal observations with a globally referenced, semantically labeled 3D map constructed from daytime RGB data. Rather than relying on appearance-based correspondence, localization is formulated in a shared semantic domain and solved via a symmetric bidirectional reprojection objective with confusion-aware weighting to improve robustness under segmentation uncertainty. The approach is evaluated offline across 6.5 km of nighttime, real-world UAV flight trajectories in urban and semi-structured environments. Relative to RTK GNSS ground truth, the system achieves a bias-corrected RMSE2D of 2.18 m and a median RMSE2D of 1.52 m. Results show that localization performance is strongly correlated with the availability of semantic edge evidence and that large-error events are spatially localized to semantically ambiguous areas rather than uniformly distributed. These findings indicate that semantic reprojection offers a promising pathway toward globally referenced nighttime UAV localization using thermal imagery alone.
Chinese Translation
在GNSS信号缺失的夜间条件下,为无人机(UAV)提供可靠的备份定位仍然是一个开放的挑战,因为白天RGB地图与夜间热成像之间存在严重的模态差异。本研究提出了一种语义重投影框架,通过将分割的热成像观测与基于白天RGB数据构建的全球参考语义标注三维地图对齐,实现相对地图的夜间无人机定位。该方法不依赖于基于外观的对应关系,而是在共享的语义域中进行定位,并通过对称双向重投影目标进行求解,结合混淆感知加权,以提高在分割不确定性下的鲁棒性。该方法在城市和半结构化环境中对6.5公里的夜间真实无人机飞行轨迹进行了离线评估。相对于RTK GNSS地面真值,该系统实现了偏差校正后的二维均方根误差(RMSE2D)为2.18米,且中位数RMSE2D为1.52米。结果表明,定位性能与语义边缘证据的可用性密切相关,而大误差事件则空间局限于语义模糊区域,而非均匀分布。这些发现表明,语义重投影为仅使用热成像实现全球参考的夜间无人机定位提供了一条有前景的路径。
cs.RO / 6 / 2604.26212

2D and 3D Grasp Planners for the GET Asymmetrical Gripper

GET不对称夹爪的2D和3D抓取规划器
Goldberg, Andrew, Ransing, Ethan, Kourakin, Anton, Magner, Cael, Adelson, Edward H., Goldberg, Ken
Abstract
In this paper, we introduce GET-2D-1.0, a fast grasp planner for the GET asymmetrical gripper that operates from a single-view RGB-D image, using the Ferrari-Canny metric and a novel sampling strategy, and GET-3D-1.0, a mesh-based method using a 3D gripper model and ray-tracing. We evaluate both grasp planners against baselines with physical experiments, which suggest that GET-2D-1.0 can improve over a bounding box baseline by over 40% in lift success, shake survival, and force resistance. Experiments with GET-3D-1.0 suggest slight improvement compared to GET-2D-1.0 on lift success and shake survival, but are more computationally expensive, averaging 17 seconds of planning compared to 683 ms for GET-2D-1.0.
Chinese Translation
在本文中,我们介绍了GET-2D-1.0,这是一个快速的抓取规划器,适用于GET不对称夹爪,能够从单视图RGB-D图像中操作,采用Ferrari-Canny度量和一种新颖的采样策略。同时,我们还介绍了GET-3D-1.0,这是一种基于网格的方法,使用3D夹爪模型和光线追踪。我们通过物理实验对这两个抓取规划器进行了评估,与基线进行比较,结果表明GET-2D-1.0在提升成功率、摇晃生存率和抗力方面超过了边界框基线40%以上。GET-3D-1.0的实验结果表明,与GET-2D-1.0相比,在提升成功率和摇晃生存率上有轻微改善,但其计算成本更高,平均规划时间为17秒,而GET-2D-1.0为683毫秒。
cs.RO / 7 / 2604.26374

Split over $n$ resource sharing problem: Are fewer capable agents better than many simpler ones?

n个资源共享问题的划分:少数有能力的代理是否优于许多简单的代理?
Soma, Karthik, Talamali, Mohamed S., Miyauchi, Genki, Beltrame, Giovanni, Hamann, Heiko, Gross, Roderich
Abstract
In multi-agent systems, should limited resources be concentrated into a few capable agents or distributed among many simpler ones? This work formulates the split over $n$ resource sharing problem where a group of $n$ agents equally shares a common resource (e.g., monetary budget, computational resources, physical size). We present a case study in multi-agent coverage where the area of the disk-shaped footprint of agents scales as $1/n$. A formal analysis reveals that the initial coverage rate grows with $n$. However, if the speed of agents decreases proportionally with their radii, groups of all sizes perform equally well, whereas if it decreases proportionally with their footprints, a single agent performs best. We also present computer simulations in which resource splitting increases the failure rates of individual agents. The models and findings help identify optimal distributiveness levels and inform the design of multi-agent systems under resource constraints.
Chinese Translation
在多代理系统中,有限资源应该集中在少数有能力的代理身上,还是分配给许多简单的代理?本文提出了n个资源共享问题的划分,其中一组n个代理平等共享一个公共资源(例如,货币预算、计算资源、物理尺寸)。我们在多代理覆盖的案例研究中展示了代理的圆盘形足迹面积随着n的增加而缩放为1/n。正式分析表明,初始覆盖率随着n的增加而增长。然而,如果代理的速度与其半径成比例下降,各种规模的群体表现相同;而如果速度与其足迹成比例下降,则单个代理的表现最佳。我们还展示了计算机模拟结果,其中资源划分增加了单个代理的失败率。这些模型和发现有助于识别最佳的分配水平,并为在资源限制下设计多代理系统提供指导。
cs.RO / 8 / 2604.26450

Reactive Motion Generation via Phase-varying Neural Potential Functions

通过相位变化神经势函数生成反应运动
Tekden, Ahmet, Kanoulas, Dimitrios, Billard, Aude, Bekiroglu, Yasemin
Abstract
Dynamical systems (DS) methods for Learning-from-Demonstration (LfD) provide stable, continuous policies from few demonstrations. First-order dynamical systems (DS) are effective for many point-to-point and periodic tasks, as long as a unique velocity is defined for each state. For tasks with intersections (e.g., drawing an "8"), extensions such as second-order dynamics or phase variables are often used. However, by incorporating velocity, second-order models become sensitive to disturbances near intersections, as velocity is used to disambiguate motion direction. Moreover, this disambiguation may fail when nearly identical position-velocity pairs correspond to different onward motions. In contrast, phase-based methods rely on open-loop time or phase variables, which limit their ability to recover after perturbations. We introduce Phase-varying Neural Potential Functions (PNPF), an LfD framework that conditions a potential function on a phase variable which is estimated directly from state progression, rather than on open-loop temporal inputs. This phase variable allows the system to handle state revisits, while the learned potential function generates local vector fields for reactive and stable control. PNPF generalizes effectively across point-to-point, periodic, and full 6D motion tasks, outperforms existing baselines on trajectories with intersections, and demonstrates robust performance in real-time robotic manipulation under external disturbances.
Chinese Translation
动态系统(DS)方法用于示范学习(LfD),能够从少量示范中提供稳定、连续的策略。一阶动态系统(DS)在许多点对点和周期性任务中有效,只要为每个状态定义唯一的速度。对于具有交叉点的任务(例如,绘制“8”),通常使用二阶动态或相位变量等扩展。然而,通过引入速度,二阶模型在交叉点附近对干扰变得敏感,因为速度用于消歧运动方向。此外,当几乎相同的位置-速度对对应于不同的后续运动时,这种消歧可能会失败。相比之下,基于相位的方法依赖于开环时间或相位变量,这限制了它们在扰动后恢复的能力。我们提出了相位变化神经势函数(PNPF),这是一个LfD框架,它将势函数条件化于直接从状态进展估计的相位变量,而不是开环时间输入。这个相位变量使系统能够处理状态重访,而学习到的势函数生成局部向量场以实现反应性和稳定控制。PNPF在点对点、周期性和完整的6D运动任务中有效地推广,在具有交叉点的轨迹上优于现有基线,并在外部干扰下展示了实时机器人操作的强大性能。
cs.RO / 9 / 2604.26473

Alter-Art: Exploring Embodied Artistic Creation through a Robot Avatar

Alter-Art:通过机器人化身探索具身艺术创作
Park, Do Won, Bordini, Samuele, Grioli, Giorgio, Catalano, Manuel G., Bicchi, Antonio
Abstract
As with every emerging technology, new tools in the hands of artists reshape the nature of artwork creation. Current frameworks for robotics in arts deploy the robot as an autonomous creator or a collaborator, thus leaving a certain gap between the human artist and the machine. Now, we stand at the dawn of an era where artists can escape physical limitations and reshape their creative identity by inhabiting an alternative body. This new paradigm allows artists not only to command a robot remotely, but also to {\it be} a robot, to see and feel through it, experiencing a new embodied reality. Unlike virtual reality, where art is created in a digital dimension, in this case art creation is still firmly grounded in the material world: clay molded by mechanical hands, paint swept across a canvas or gestures performed on a physical stage alongside human actors. Through the robot avatar Alter-Ego, we explore the Alter-Art paradigm in dance, theater, and painting; it integrates immersive teleoperation and compliant actuation to enable a first-person creative experience. Analyzing qualitative artistic feedback, we investigate how embodiment shapes creative agency, identity and interaction with the environment. Our findings suggest that artists rapidly develop a sense of presence within the robotic body. The robot's physical constraints influence the creative process, manifesting differently across artistic domains. We highlight embodiment as a central design principle, contributing to social robotics and expanding the possibilities for telepresence and accessible artistic expression.
Chinese Translation
与每一项新兴技术一样,艺术家手中的新工具重塑了艺术作品创作的本质。目前,艺术领域中的机器人框架将机器人作为自主创作者或合作者,从而在人类艺术家与机器之间留下了一定的空白。如今,我们正处于一个时代的曙光中,艺术家可以摆脱身体的限制,通过占据一种替代身体来重塑他们的创造性身份。这一新范式使艺术家不仅能够远程操控机器人,还能够{ extit{成为}}一台机器人,通过它来观察和感受,体验一种新的具身现实。与在数字维度中创作艺术的虚拟现实不同,在这种情况下,艺术创作仍然牢牢扎根于物质世界:由机械手塑造的粘土、在画布上挥洒的颜料或与人类演员共同在物理舞台上表演的手势。通过机器人化身Alter-Ego,我们在舞蹈、戏剧和绘画中探索Alter-Art范式;它整合了沉浸式远程操作和顺应性驱动,以实现第一人称的创作体验。通过分析定性的艺术反馈,我们研究了具身性如何塑造创造性代理、身份和与环境的互动。我们的研究结果表明,艺术家在机器人身体内迅速发展出一种存在感。机器人的物理限制影响了创作过程,在不同的艺术领域中表现出不同的方式。我们强调具身性作为一个核心设计原则,为社会机器人学做出贡献,并扩展了远程存在和可及艺术表达的可能性。
cs.RO / 10 / 2604.26504

HiPAN: Hierarchical Posture-Adaptive Navigation for Quadruped Robots in Unstructured 3D Environments

HiPAN:用于非结构化三维环境中四足机器人分层姿态自适应导航
Jeong, Jeil, Yoon, Minsung, Choi, Seokryun, Shin, Heechan, Yang, Taegeun, Yoon, Sung-eui
Abstract
Navigating quadruped robots in unstructured 3D environments poses significant challenges, requiring goal-directed motion, effective exploration to escape from local minima, and posture adaptation to traverse narrow, height-constrained spaces. Conventional approaches employ a sequential mapping-planning pipeline but suffer from accumulated perception errors and high computational overhead, restricting their applicability on resource-constrained platforms. To address these challenges, we propose Hierarchical Posture-Adaptive Navigation (HiPAN), a framework that operates directly on onboard depth images at deployment. HiPAN adopts a hierarchical design: a high-level policy generates strategic navigation commands (planar velocity and body posture), which are executed by a low-level, posture-adaptive locomotion controller. To mitigate myopic behaviors and facilitate long-horizon navigation, we introduce Path-Guided Curriculum Learning, which progressively extends the navigation horizon from reactive obstacle avoidance to strategic navigation. In simulation, HiPAN achieves higher navigation success rates and greater path efficiency than classical reactive planners and end-to-end baselines, while real-world experiments further validate its applicability across diverse, unstructured 3D environments.
Chinese Translation
在非结构化三维环境中导航四足机器人面临重大挑战,需要目标导向的运动、有效的探索以摆脱局部最小值,以及姿态适应以穿越狭窄和高度受限的空间。传统方法采用顺序的映射-规划流程,但受到累积感知误差和高计算开销的影响,限制了其在资源受限平台上的适用性。为了解决这些挑战,我们提出了分层姿态自适应导航(HiPAN),这是一个直接在部署时对机载深度图像进行操作的框架。HiPAN采用分层设计:高层策略生成战略导航指令(平面速度和身体姿态),由低层姿态自适应运动控制器执行。为了减轻短视行为并促进长时间导航,我们引入了路径引导课程学习,该方法逐步将导航视野从反应式障碍物规避扩展到战略导航。在仿真中,HiPAN实现了比经典反应式规划器和端到端基线更高的导航成功率和更大的路径效率,而现实世界实验进一步验证了其在多样化非结构化三维环境中的适用性。
cs.RO / 11 / 2604.26509

3D Generation for Embodied AI and Robotic Simulation: A Survey

面向具身人工智能和机器人仿真的3D生成:综述
Ye, Tianwei, Mao, Yifan, Liao, Minwen, Liu, Jian, Guo, Chunchao, Du, Dazhao, Shou, Quanxin, Zhu, Fangqi, Guo, Song
Abstract
Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey presents the first survey of 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In \emph{Data Generator}, 3D generation produces simulation-ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in \emph{Simulation Environments}, it constructs interactive and task-oriented worlds, spanning structure-aware, controllable, and agentic scene generation; and in \emph{Sim2Real Bridge}, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real-world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim-to-real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at https://3dgen4robot.github.io.
Chinese Translation
具身人工智能和机器人系统日益依赖可扩展、多样化和物理基础的3D内容,以进行基于仿真的训练和现实世界的部署。尽管3D生成建模迅速发展,但具身应用的要求远超视觉真实感:生成的物体必须具备运动结构和材料属性,场景必须支持交互和任务执行,生成的内容必须弥合仿真与现实之间的差距。本综述首次对具身人工智能的3D生成进行综述,并围绕3D生成在具身系统中扮演的三种角色组织文献。在 extit{数据生成器}中,3D生成产生适合仿真的物体和资产,包括用于下游交互的关节式、物理基础和可变形内容;在 extit{仿真环境}中,它构建交互式和任务导向的世界,涵盖结构感知、可控和智能场景生成;在 extit{仿真到现实桥梁}中,它支持数字双胞胎重建、数据增强和合成演示,以促进下游机器人学习和现实世界转移。我们还指出,该领域正从视觉真实感转向交互准备,并识别出主要瓶颈,包括有限的物理注释、几何质量与物理有效性之间的差距、评估的碎片化以及持续存在的仿真到现实的鸿沟,这些问题必须得到解决,以使3D生成成为具身智能的可靠基础。我们的项目页面地址为 https://3dgen4robot.github.io。
cs.RO / 12 / 2604.26569

LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models

LLM-Flax:通过大型语言模型的神经符号方法实现可泛化的机器人任务规划
Kim, Seongmin, Lee, Daegyu
Abstract
Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated budget policy that explicitly reserves API latency cost before each LLM call, preventing the downstream relaxation fallback from being starved. Stage 3 replaces the domain-trained GNN entirely with zero-shot LLM object importance scoring, requiring no training data. We evaluate all three stages on the MazeNamo benchmark across 10x10, 12x12, and 15x15 grids (8 benchmarks total). LLM-Flax achieves average SR 0.945 versus the manual baseline's 0.828 (+0.117), matching or outperforming manual rules on every one of the eight benchmarks. On 12x12 Expert, LLM-Flax attains SR 0.733 where the manual planner fails entirely (SR 0.000); on 15x15 Hard, it achieves SR 1.000 versus Manual's 0.900. Stage 3 demonstrates feasibility (SR 0.720 on 12x12 Hard with no training data) but faces a context-window bottleneck at scale, pointing to the primary open challenge for future work.
Chinese Translation
在新领域中部署神经符号任务规划器目前需要大量的手动工作:领域专家必须编写松弛和补充规则,并且必须解决数百个训练问题以监督图神经网络(Graph Neural Network, GNN)对象评分器。我们提出了LLM-Flax,这是一个三阶段框架,利用本地托管的大型语言模型(Large Language Model, LLM)消除了所有三种手动工作源,仅需提供一个PDDL领域文件。第一阶段通过结构化提示生成松弛和补充规则,并进行格式验证和自我修正。第二阶段引入了LLM引导的故障恢复,采用可行性门控预算策略,在每次LLM调用之前明确保留API延迟成本,防止下游松弛回退被饿死。第三阶段完全用零样本LLM对象重要性评分替代领域训练的GNN,不需要训练数据。我们在MazeNamo基准上评估了所有三个阶段,涉及10x10、12x12和15x15的网格(共8个基准)。LLM-Flax的平均成功率(Success Rate, SR)为0.945,而手动基线为0.828(+0.117),在八个基准中的每一个都与手动规则相匹配或超越。在12x12 Expert上,LLM-Flax的SR为0.733,而手动规划器完全失败(SR 0.000);在15x15 Hard上,其SR为1.000,而手动规划器为0.900。第三阶段展示了可行性(在12x12 Hard上SR为0.720且没有训练数据),但在规模上面临上下文窗口瓶颈,指出了未来工作的主要开放挑战。
cs.RO / 13 / 2604.26626

STAR-Filter: Efficient Convex Free-Space Approximation via Starshaped Set Filtering in Noisy Environments

STAR-Filter:通过星形集合过滤在噪声环境中实现高效的凸自由空间近似
Wu, Yuwei, Zhao, Yichen, Ong, Dexter, Kumar, Vijay
Abstract
Approximating collision-free space is fundamental to robot planning in complex environments. Convex geometric representations, such as polytopes and ellipsoids, are widely employed due to their structural properties, which can be easily integrated with convex optimization. Iterative optimization-based inflation methods can generate large volume polytopes in cluttered environments, but their efficiency degrades as the obstacle set becomes more complex or when sensor data are noisy. These methods are also sensitive to initialization and often rely on accurate geometric models. In this paper, we propose the STAR-Filter, a lightweight framework that employs starshaped set construction as a fast filter for convex region generation in collision-free space. By identifying obstacle points as active supporting constraints, the proposed method significantly reduces redundant computation while preserving feasibility and robustness to sensor noise. We provide theoretical and numerical analyses that characterize the structural properties of the starshaped set and proposed pipeline in environments of varying complexity. Simulation results show that the proposed framework achieves the lowest computation time and reduces conservativeness in polytope generation for real-world noisy and large-scale data. We demonstrate the effectiveness of the framework for Safe Flight Corridor (SFC) generation and agile quadrotor planning in noisy environments.
Chinese Translation
在复杂环境中,近似无碰撞空间是机器人规划的基础。由于其结构特性,凸几何表示(如多面体和椭球体)被广泛应用,且易于与凸优化结合。基于迭代优化的膨胀方法能够在杂乱环境中生成大体积多面体,但当障碍物集合变得更加复杂或传感器数据噪声较大时,其效率会降低。这些方法对初始化敏感,且通常依赖于准确的几何模型。本文提出了STAR-Filter,一个轻量级框架,利用星形集合构造作为在无碰撞空间中生成凸区域的快速过滤器。通过将障碍点识别为活跃的支持约束,所提方法显著减少了冗余计算,同时保持了对传感器噪声的可行性和鲁棒性。我们提供了理论和数值分析,表征了星形集合及所提管道在不同复杂度环境中的结构特性。仿真结果表明,所提框架在处理真实世界的噪声和大规模数据时,达到了最低的计算时间,并减少了多面体生成的保守性。我们展示了该框架在安全飞行走廊(Safe Flight Corridor, SFC)生成和噪声环境中灵活四旋翼规划中的有效性。
cs.RO / 14 / 2604.26637

ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

ATLAS:一种用于长时间机器人动作分割的注释工具
Stanovcic, Sergej, Sliwowski, Daniel, Lee, Dongheui
Abstract
Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.
Chinese Translation
为长时间机器人演示标注精确的时间动作边界对于训练和评估动作分割及操作策略学习方法至关重要。然而,现有的注释工具往往存在局限性:它们主要设计用于仅视觉数据,不原生支持机器人特定时间序列信号(如夹持器状态或力/扭矩)的同步可视化,或需要大量工作来适应不同的数据集格式。在本文中,我们介绍了ATLAS,一种专为长时间机器人动作分割量身定制的注释工具。ATLAS提供多模态机器人数据的时间同步可视化,包括多视角视频和本体感知信号,并支持动作边界、动作标签和任务结果的注释。该工具原生处理广泛使用的机器人数据集格式,如ROS包和强化学习数据集(RLDS)格式,并为特定数据集(如REASSEMBLE)提供直接支持。ATLAS可以通过模块化数据集抽象层轻松扩展到新格式。其以键盘为中心的界面减少了注释工作量,提高了效率。在一个接触丰富的装配任务实验中,与ELAN相比,ATLAS将每个动作的平均注释时间减少了至少6%,而时间序列数据的引入使得与专家注释的时间对齐提高了超过2.8%,并且与仅视觉注释工具相比,边界错误减少了五倍。
cs.RO / 15 / 2604.26689

Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

用于组合机器人策略中技能更新的原子探针治理
Qin, Xue, Luan, Simin, See, John, Yang, Cong, Li, Zhijun
Abstract
Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below 26.7%, and whether this dominant ECM enters a composition shifts the success rate by up to +50pp. We characterize the boundary on a simpler pick task where all atomic policies saturate at 100% and the effect is undefined. Across three tasks we further find that off-policy behavioral distance metrics fail to identify the dominant ECM, ruling out the natural cheap predictor. We propose an atomic-quality probe and a Hybrid Selector combining per-skill probes (zero per-decision cost) with selective composition revalidation (full cost), and characterize its Pareto frontier on 144 skill-update decisions. On T6 the atomic-only probe sits 23pp below full revalidation (64.6% vs 87.5% oracle match) at zero per-decision cost; a Hybrid Selector with m=10 closes most of that gap to ~12pp at 46% of full-revalidation cost. On the cross-task average over 144 events, atomic-only is within 3pp of full revalidation under a mixed-oracle caveat. The atomic-quality probe is, to our knowledge, the first principled, deployment-ready primitive for skill-update governance in compositional robot policies.
Chinese Translation
在部署的机器人系统中,技能库通过微调、新的演示或领域适应不断更新,然而现有的类型组合方法(BLADE、SymSkill、Generative Skill Chaining)在测试时将库视为冻结状态,并未分析当技能被替换时组合结果如何变化。我们在robosuite操控任务上引入了一种配对采样的跨版本交换协议,以表征组合技能学习的这一维度。在双臂插销入孔任务中,我们发现了主导技能效应:一个ECM(有效控制模型)实现了86.7%的原子成功率,而其他每个ECM的成功率均在或低于26.7%,并且该主导ECM是否进入组合会将成功率提升多达50个百分点。我们在一个更简单的取物任务中表征了边界,在该任务中所有原子策略的成功率均饱和在100%,而该效应未定义。在三个任务中,我们进一步发现,离策略行为距离度量未能识别出主导ECM,排除了自然的廉价预测器。我们提出了一种原子质量探针和一种混合选择器,结合了每个技能的探针(每个决策零成本)与选择性组合重新验证(全成本),并在144个技能更新决策上表征其帕累托前沿。在T6任务中,只有原子探针的成功率比完全重新验证低23个百分点(64.6%对比87.5%的预言匹配),且每个决策成本为零;而混合选择器(m=10)将大部分差距缩小至约12个百分点,成本为完全重新验证的46%。在144个事件的跨任务平均中,原子探针在混合预言的警告下与完全重新验证相差仅3个百分点。根据我们所知,原子质量探针是用于组合机器人策略中技能更新治理的首个原则性、可部署的原始工具。
cs.RO / 16 / 2604.26694

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

基于视频先验的统一4D世界动作建模与异步去噪
Guo, Jun, Li, Qiwei, Li, Peiyan, Chen, Zilong, Sun, Nan, Su, Yifei, Wang, Heyun, Zhang, Yuan, Li, Xinghang, Liu, Huaping
Abstract
We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.
Chinese Translation
我们提出了X-WAM,一个统一的4D世界模型,它将实时机器人动作执行与高保真4D世界合成(视频 + 3D重建)整合到一个单一框架中,解决了先前统一世界模型(例如UWM)在仅建模2D像素空间方面的关键局限性,并未能平衡动作效率与世界建模质量。为了利用预训练视频扩散模型的强视觉先验,X-WAM通过预测多视角RGB-D视频来想象未来世界,并通过轻量级结构适配高效获取空间信息:将预训练扩散变换器的最后几个模块复制到专用的深度预测分支中,以重建未来的空间信息。此外,我们提出了异步噪声采样(ANS),以共同优化生成质量和动作解码效率。ANS在推理过程中应用了专门的异步去噪调度,快速解码动作,减少步骤以实现高效的实时执行,同时将完整的步骤序列用于生成高保真视频。ANS在训练期间并未完全解耦时间步,而是从它们的联合分布中采样,以与推理分布对齐。X-WAM在超过5800小时的机器人数据上进行预训练,在RoboCasa和RoboTwin 2.0基准测试中实现了79.2%和90.7%的平均成功率,同时在视觉和几何指标上生成的高保真4D重建和生成超越了现有方法。
cs.RO / 17 / 2604.26833

Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training

基于规则的高层次指导在有限仿真训练下的目标条件强化学习在搜索与救援无人机任务中的应用
Ramezani, Mahya, Voos, Holger
Abstract
This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined dense rewards and reuses experience through a mode-aware prioritized replay mechanism augmented with rule-derived metadata. We evaluate the framework on two tasks: battery-aware multi-goal delivery and moving-target delivery in obstacle-rich environments. Across both tasks, the proposed method improves early safety and sample efficiency primarily by reducing collision terminations, while preserving the ability to adapt online to scenario-specific dynamics.
Chinese Translation
本文提出了一种针对无人机(UAV)任务的分层决策框架,该框架受到有限仿真训练下的搜索与救援(SAR)场景的启发。该框架结合了固定的基于规则的高层次顾问与在线目标条件的低层次强化学习(RL)控制器。为了对早期适应性进行压力测试,我们还考虑了严格的无预训练部署模式。高层次顾问是根据结构化任务规范离线定义并编译成确定性规则的。它通过推荐的行动、避免的行动以及基于模式的仲裁权重提供可解释的任务和安全意识指导。低层次控制器则在线学习任务定义的密集奖励,并通过一种模式感知的优先重放机制重用经验,该机制增强了规则派生的元数据。我们在两个任务上评估了该框架:电池感知的多目标交付和在障碍物丰富环境中的移动目标交付。在这两个任务中,所提出的方法主要通过减少碰撞终止来提高早期安全性和样本效率,同时保持在线适应特定场景动态的能力。
cs.RO / 18 / 2604.26839

Walk With Me: Long-Horizon Social Navigation for Human-Centric Outdoor Assistance

与我同行:面向人类中心的户外长远社交导航
Zhang, Lingfeng, Hao, Xiaoshuai, Bu, Xizhou, Tang, Yingbo, Li, Hongsheng, Lu, Jinghui, Wei, Xiu-shen, Ma, Jiayi, Liu, Yu, Zhang, Jing, Ye, Hangjun, Liang, Xiaojun, Chen, Long, Ding, Wenbo
Abstract
Assisting humans in open-world outdoor environments requires robots to translate high-level natural-language intentions into safe, long-horizon, and socially compliant navigation behavior. Existing map-based methods rely on costly pre-built HD maps, while learning-based policies are mostly limited to indoor and short-horizon settings. To bridge this gap, we propose Walk with Me, a map-free framework for long-horizon social navigation from high-level human instructions. Walk with Me leverages GPS context and lightweight candidate points-of-interest from a public map API for semantic destination grounding and waypoint proposal. A High-Level Vision-Language Model grounds abstract instructions into concrete destinations and plans coarse waypoint sequences. During execution, an observation-aware routing mechanism determines whether the Low-Level Vision-Language-Action policy can handle the current situation or whether explicit safety reasoning from the High-Level VLM is needed. Routine segments are executed by the Low-Level VLA, while complex situations such as crowded crossings trigger high-level reasoning and stop-and-wait behavior when unsafe. By combining semantic intent grounding, map-free long-horizon planning, safety-aware reasoning, and low-level action generation, Walk with Me enables practical outdoor social navigation for human-centric assistance.
Chinese Translation
在开放世界的户外环境中协助人类,需要机器人将高层次的自然语言意图转化为安全、长远且符合社会规范的导航行为。现有的基于地图的方法依赖于昂贵的预构建高清地图,而基于学习的策略大多局限于室内和短期环境。为了解决这一问题,我们提出了“与我同行”(Walk with Me),这是一个无地图的框架,旨在根据高层次的人类指令实现长远的社交导航。“与我同行”利用GPS上下文和来自公共地图API的轻量级候选兴趣点进行语义目的地定位和路径点提议。高层次视觉-语言模型将抽象指令转化为具体目的地,并规划粗略的路径点序列。在执行过程中,观察感知路由机制决定低层次视觉-语言-动作(VLA)策略是否能够处理当前情况,或者是否需要高层次视觉-语言模型(VLM)的显式安全推理。常规段落由低层次VLA执行,而复杂情况如拥挤的交叉口则会触发高层次推理,并在不安全时采取停顿等待的行为。通过结合语义意图定位、无地图的长远规划、安全感知推理和低层次动作生成,“与我同行”使得人类中心的户外社交导航成为可能。
cs.RO / 19 / 2604.26848

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

STARRY:用于机器人操作的时空动作中心世界建模
Tian, Yuxuan, Jin, Yurun, Yu, Bin, Shi, Yukun, Wu, Hao, Liu, Chi Harold, Chen, Kai, Huang, Cong
Abstract
Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over $\pi_{0.5}$, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.
Chinese Translation
机器人操作在很大程度上需要对未来的时空交互进行推理,然而现有的变分学习算法(VLA)策略和增强世界模型的策略并未充分建模与动作相关的时空交互结构。我们提出了STARRY,这是一种增强世界模型的动作生成策略,它将时空预测与动作生成对齐。STARRY联合去噪未来的时空潜变量和动作序列,并引入几何感知选择性注意力调制,将预测的深度和末端执行器几何信息转换为与标记对齐的权重,以实现选择性动作注意力调制。在RoboTwin 2.0上,STARRY在干净和随机设置下分别达到了93.82%和93.30%的平均成功率。实际实验进一步将平均成功率从42.5%提高到70.8%,展示了以动作为中心的时空世界建模在时空要求高的机器人动作生成中的有效性。
cs.RO / 20 / 2604.26897

Stochastic Entanglement of Deterministic Origami Tentacles For Universal Robotic Gripping

用于通用机器人抓取的确定性折纸触手的随机缠绕
Boron, Alec, Zheng, Bokun, Zhou, Ziyang, Naughton, Noel, Li, Suyi
Abstract
Origami-inspired robotic grippers have shown promising potential for object manipulation tasks due to their compact volume and mechanical flexibility. However, robust capture of objects with random shapes in dynamic working environments often comes at the cost of additional actuation channels and control complexity. Here, we introduce a tendon-driven origami tentacle gripper capable of universal object gripping by exploiting a synergy between local, deterministic deformation programming and global, stochastic entanglements. Each origami tentacle is made by cutting thin Mylar sheets; It features carefully placed holes for routing an actuation tendon, origami creases for controlling the deformation, and a tapered shape. By tailoring these design features, one can prescribe the shrinking, bending, and twisting deformation, eventually creating deterministic coiling with a simple tendon pull. Then, when multiple coiling tentacles are placed in proximity, stochastic entanglement emerges, allowing the tentacles to braid, knot, and grip objects with random shapes. We derived a simulation model by integrating origami mechanics with Cosserat rods to correlate origami design, tendon deformation, and their collective gripping performance. Then, we experimentally tested how these coiling and entangling origami tentacles can grasp objects under gravity and in water. A stow-and-release deployment mechanism was also tested to simulate in-orbit grasping. Overall, the entertaining origami tentacle gripper presents a new strategy for robust object grasping with simple design and actuation.
Chinese Translation
受折纸启发的机器人抓取器因其紧凑的体积和机械灵活性在物体操作任务中展现出良好的潜力。然而,在动态工作环境中对随机形状物体的稳健捕捉往往需要额外的驱动通道和控制复杂性。在此,我们介绍了一种基于腱驱动的折纸触手抓取器,能够通过局部确定性变形编程与全局随机缠绕之间的协同作用实现通用物体抓取。每个折纸触手由薄的Mylar薄膜切割而成;它具有精心放置的孔用于引导驱动腱、折纸褶皱用于控制变形,以及锥形结构。通过调整这些设计特征,可以规定收缩、弯曲和扭转变形,最终通过简单的腱拉动实现确定性的缠绕。然后,当多个缠绕触手靠近时,随机缠绕现象出现,使得触手能够编织、打结并抓取随机形状的物体。我们通过将折纸力学与Cosserat杆结合,推导出一个仿真模型,以关联折纸设计、腱变形及其集体抓取性能。随后,我们实验测试了这些缠绕和缠结的折纸触手在重力和水中抓取物体的能力。我们还测试了一种收放部署机制,以模拟在轨道抓取。总体而言,这种有趣的折纸触手抓取器为简单设计和驱动下的稳健物体抓取提供了一种新策略。
cs.RO / 21 / 2604.26910

Bi-Level Optimization for Contact and Motion Planning in Rope-Assisted Legged Robots

用于绳索辅助腿部机器人接触与运动规划的双层优化
Malacarne, Ruben, Tsikelis, Ioannis, Hoffman, Enrico Mingo, Focchi, Michele
Abstract
This paper presents a planning pipeline framework for locomotion in rope-assisted robots climbing vertical surfaces. The proposed framework is formulated as a bi-level optimization scheme that addresses a mixed-integer problem: selecting feasible terrain regions for landing while simultaneously optimizing the control inputs, namely rope tensions and leg forces, and landing location. The outer level of the optimization is solved using the Cross-Entropy Method, while the inner level relies on gradient-based nonlinear optimization to compute dynamically feasible motions. The approach is validated on a novel climbing robot platform, ALPINE, across a variety of challenging terrain configurations.
Chinese Translation
本文提出了一种用于绳索辅助机器人在垂直表面上爬行的运动规划框架。所提出的框架被构建为一个双层优化方案,旨在解决一个混合整数问题:选择可行的着陆地形区域,同时优化控制输入,即绳索张力和腿部力量,以及着陆位置。优化的外层通过交叉熵方法(Cross-Entropy Method)求解,而内层则依赖于基于梯度的非线性优化来计算动态可行的运动。该方法在一种新型爬行机器人平台ALPINE上进行了验证,涵盖了多种具有挑战性的地形配置。
计算机视觉 (Computer Vision)
85
cs.CV / 1 / 2604.26025

Generalized Disguise Makeup Presentation Attack Detection Using an Attention-Guided Patch-Based Framework

基于注意力引导的补丁框架的广义伪装化妆展示攻击检测
Taraghi, Fateme, Aghaei, Atefe, Moghaddam, Mohsen Ebrahimi
Abstract
Despite significant advances in facial recognition systems, they remain vulnerable to face presentation attacks. Among them, disguise makeup attacks are particularly challenging, as they use advanced cosmetics, prosthetic components, and artificial materials to realistically alter facial appearance, often making detection difficult even for humans. Despite their importance, this problem remains underexplored, and publicly available datasets are limited. To address this, we propose a generalized disguise makeup presentation attack detection framework. The method adopts a two-phase design in which a style-invariant full-face model, trained with metric learning and enhanced by a whitening transformation, extracts region attention scores via Grad-CAM. These scores guide a patch-based phase that performs localized analysis using region-specific subnetworks trained with metric learning for fine-grained discrimination. We also construct a new, diverse dataset of live and disguise makeup faces collected under real-world conditions, covering variations in subjects, environments, and disguise materials. Experimental results demonstrate strong generalization across both the collected dataset and SIW-Mv2, achieving 8.97% ACER and 9.76% EER on the collected dataset, and 0% ACER on Obfuscation and Impersonation and 1.34% on Cosmetics attacks of SIW-Mv2. The proposed method consistently outperforms prior works while maintaining robust performance across other spoof types.
Chinese Translation
尽管面部识别系统取得了显著进展,但它们仍然容易受到面部展示攻击的影响。其中,伪装化妆攻击尤为具有挑战性,因为它们使用先进的化妆品、假体组件和人造材料来逼真地改变面部外观,通常使得即使是人类也难以检测。尽管这一问题的重要性不言而喻,但相关研究仍然较少,且公开可用的数据集有限。为了解决这一问题,我们提出了一种广义伪装化妆展示攻击检测框架。该方法采用两阶段设计,其中一个风格不变的全脸模型通过度量学习进行训练,并通过白化变换增强,利用Grad-CAM提取区域注意力分数。这些分数引导一个基于补丁的阶段,使用经过度量学习训练的区域特定子网络进行局部分析,以实现细粒度区分。我们还构建了一个新的多样化数据集,包含在真实世界条件下收集的真实和伪装化妆面孔,涵盖了受试者、环境和伪装材料的变化。实验结果表明,在收集的数据集和SIW-Mv2上具有强大的泛化能力,在收集的数据集上实现了8.97%的ACER和9.76%的EER,在SIW-Mv2的混淆和冒充攻击上实现了0%的ACER,在化妆攻击上实现了1.34%的EER。所提出的方法在保持对其他欺骗类型的稳健性能的同时,始终优于以往的研究。
cs.CV / 2 / 2604.26031

Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

第五届PVUW挑战赛报告:迈向更具多样性的像素级理解模式
Liu, Chang, Ding, Henghui, Ravi, Nikhila, Wei, Yunchao, He, Shuting, Bai, Song, Torr, Philip, Cao, Leilei, Zhang, Jinrong, Miao, Deshui, He, Xusheng, Gong, Dengxian, Wang, Zhiyu, Gao, Mingqi, Hong, Jihwan, Wu, Canyang, Guan, Weili, Wu, Jianlong, Nie, Liqiang, Huang, Xingsen, Gu, Yameng, Yu, Xiaogang, Li, Xin, Yang, Ming-Hsuan, Li, Sijie, Han, Jungong, Niu, Quanzhu, Chen, Shihao, Wu, Yuanzheng, Zhou, Yikang, Zhang, Tao, Yuan, Haobo, Qi, Lu, Ji, Shunping, Yang, Chao, Tian, Chao, Zhu, Guoqing, Yang, Kai, Mo, Zhifan, Zhang, Haijun, Kang, Xudong, Li, Shutao, Do, Jaeyoung
Abstract
This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene comprehension.
Chinese Translation
本报告总结了2026年在CVPR 2026举办的像素级视频理解挑战赛(PVUW Challenge)的目标、数据集和表现最佳的方法论,该挑战赛在高度不受限的条件下评估最先进的模型。为了提供全面的评估,2026年版设有三个专业赛道:MOSE赛道用于在密集杂乱和严重遮挡的场景中跟踪物体;MeViS-Text赛道用于通过以运动为中心的语言表达定位目标;以及新设立的MeViS-Audio赛道,开创了基于声学的物体分割。通过引入以前未发布的挑战性数据并分析参与者提交的前沿多模态解决方案,本报告突出了社区最新的技术进展,并为稳健的视频场景理解描绘了有前景的未来方向。
cs.CV / 3 / 2604.26051

Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping

评估基于卫星的洪水制图中GeoAI解释与领域知识之间的对齐
Lee, Hyunho, Li, Wenwen
Abstract
The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising approach for operational flood monitoring. Deep learning-based approaches for flood mapping using satellite imagery, an important application within Geospatial Artificial Intelligence (GeoAI), have shown improved predictive performance by learning complex spatial and spectral patterns from large volumes of remote sensing data. However, the opaque decision-making processes of deep learning models remain a major barrier to their integration into critical scientific and operational workflows. This highlights the need for a systematic assessment of whether model explanations align with established domain knowledge in remote sensing. To address this research gap, this study introduces the ADAGE (Alignment between Domain Knowledge And GeoAI Explanation Evaluation) framework. The proposed framework is designed to systematically evaluate how well explanations of deep learning models align with established remote sensing knowledge, particularly regarding the distinctive spectral properties of the Earth's surface. The ADAGE framework employs Channel-Group SHAP (SHapley Additive exPlanations) method to estimate the contributions of grouped input channels to pixel-level predictions. Experiments on two satellite-based flood mapping tasks demonstrate that the ADAGE framework can (1) quantitatively assess the alignment between model explanations and reference explanations derived from domain knowledge and (2) help domain experts identify misaligned explanations through alignment scores. This study contributes to bridging the gap between explainability and domain knowledge in GeoAI for Earth observation, enhancing the applicability of GeoAI models in scientific and operational workflows.
Chinese Translation
卫星数量的增加提高了地球观测的时间分辨率,使基于卫星的洪水制图成为一种有前景的操作性洪水监测方法。基于深度学习的卫星图像洪水制图方法,作为地理空间人工智能(GeoAI)中的一个重要应用,通过从大量遥感数据中学习复杂的空间和光谱模式,显示出改进的预测性能。然而,深度学习模型的不透明决策过程仍然是其融入关键科学和操作工作流程的主要障碍。这突显了系统评估模型解释是否与遥感领域的既定知识对齐的必要性。为了解决这一研究空白,本研究引入了ADAGE(领域知识与GeoAI解释评估之间的对齐)框架。该框架旨在系统地评估深度学习模型的解释与既定遥感知识之间的对齐程度,特别是关于地球表面的独特光谱特性。ADAGE框架采用通道组SHAP(SHapley加法解释)方法来估计分组输入通道对像素级预测的贡献。在两个基于卫星的洪水制图任务上的实验表明,ADAGE框架可以(1)定量评估模型解释与源自领域知识的参考解释之间的对齐程度,以及(2)帮助领域专家通过对齐分数识别不对齐的解释。本研究有助于弥合GeoAI在地球观测中解释性与领域知识之间的差距,增强GeoAI模型在科学和操作工作流程中的适用性。
cs.CV / 4 / 2604.26067

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

RADIO-ViPE:动态环境中开放词汇语义SLAM的在线紧耦合多模态融合
Nasser, Zaid, Iumanov, Mikhail, Li, Tianhao, Popov, Maxim, Mahmoud, Jaafar, Kolyubin, Sergey
Abstract
We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe
Chinese Translation
我们提出了RADIO-ViPE(Reduce All Domains Into One -- Video Pose Engine),这是一种在线语义SLAM系统,能够实现几何感知的开放词汇基础,将任意自然语言查询与动态环境中的局部3D区域和物体关联起来。与现有方法需要经过标定的RGB-D输入不同,RADIO-ViPE直接在原始单目RGB视频流上运行,无需先前的相机内参、深度传感器或姿态初始化。该系统紧密耦合了来自聚合基础模型(如RADIO)的多模态嵌入——涵盖视觉和语言——与几何场景信息。这种耦合发生在初始化、优化和因子图连接中,以提高来自多种模态的地图一致性。优化过程采用自适应鲁棒核,旨在处理主动移动的物体和代理位移的场景元素(例如,在自我中心会话中重新排列的家具)。实验表明,RADIO-ViPE在动态TUM-RGBD基准测试中实现了最先进的结果,同时在与依赖标定数据和静态场景假设的离线开放词汇方法的竞争性能中保持了竞争力。RADIO-ViPE弥补了现实世界部署中的关键空白,使得自主机器人和不受限制的野外视频流能够实现稳健的开放词汇语义基础。项目页面:https://be2rlab.github.io/radio_vipe
cs.CV / 5 / 2604.26084

FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables

FruitProM-V2:水果和蔬菜的鲁棒性概率成熟度估计与检测
Cheppally, Rahul Harsha, Rai, Sidharth, Baral, Sudan, Vail, Benjamin, Sharda, Ajay
Abstract
Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest quality. Although ripening is a continuous biological process, vision-based maturity estimation is typically formulated as a multi-class classification task, which imposes sharp boundaries between visually similar stages. To examine this limitation, we perform an annotation reliability study with two independent annotators on a held-out tomato dataset and observe disagreement concentrated near adjacent maturity stages. Motivated by this observation, we model maturity as a latent continuous variable and predict it probabilistically using a distributional detection head, converting the distribution into class probabilities through the cumulative distribution function (CDF). The proposed formulation maintains comparable performance to a standard detector under clean labels while better representing uncertainty. Furthermore, when controlled label noise is introduced during training, the probabilistic model demonstrates improved robustness relative to the baseline, indicating that explicitly modeling maturity uncertainty leads to more reliable visual maturity estimation.
Chinese Translation
准确的水果成熟度识别对于确定收获时机至关重要,因为错误的评估会直接影响产量和后期质量。尽管成熟是一个连续的生物过程,但基于视觉的成熟度估计通常被表述为多类分类任务,这在视觉上相似的阶段之间施加了明确的边界。为了检验这一局限性,我们对一个保留的番茄数据集进行了注释可靠性研究,发现不同注释者之间的分歧主要集中在相邻的成熟阶段。受到这一观察的启发,我们将成熟度建模为一个潜在的连续变量,并使用分布检测头以概率方式进行预测,通过累积分布函数(CDF)将分布转换为类别概率。所提出的模型在干净标签下保持与标准检测器相当的性能,同时更好地表示不确定性。此外,当在训练过程中引入控制标签噪声时,概率模型相对于基线表现出更强的鲁棒性,表明显式建模成熟度不确定性能够导致更可靠的视觉成熟度估计。
cs.CV / 6 / 2604.26116

Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data

在非独立同分布数据的联邦学习中使用多任务自编码器进行样本选择
Ardıç, Emre, Genç, Yakup
Abstract
Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one-class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi-class deep support vector data description (SVDD) loss controlled by a central server to enhance feature-based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non-IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss-based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature-based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.
Chinese Translation
联邦学习是一种机器学习范式,其中多个设备在中央服务器的监督下协作训练模型,同时确保数据隐私。然而,其性能常常受到冗余、恶意或异常样本的影响,导致模型退化和效率低下。为了解决这些问题,我们提出了用于图像分类的新型样本选择方法,采用多任务自编码器通过损失和特征分析来估计样本贡献。我们的方法结合了无监督异常检测,使用单类支持向量机(One-Class Support Vector Machine, OCSVM)、孤立森林(Isolation Forest, IF)和自适应损失阈值(Adaptive Loss Threshold, AT)方法,由中央服务器管理,以过滤客户端的噪声样本。我们还提出了一种由中央服务器控制的多类深度支持向量数据描述(Support Vector Data Description, SVDD)损失,以增强基于特征的样本选择。我们在CIFAR10和MNIST数据集上验证了我们的方法,涵盖不同数量的客户端、非独立同分布(Non-IID)分布和高达40%的噪声水平。结果显示,基于损失的样本选择显著提高了准确性,在CIFAR10上使用OCSVM实现了高达7.02%的增益,在MNIST上使用AT实现了1.83%的增益。此外,我们的联邦SVDD损失进一步改善了基于特征的样本选择,在CIFAR10上使用OCSVM实现了高达0.99%的准确性增益。这些结果表明我们的方法在不同客户端数量和噪声条件下提高模型准确性的有效性。
cs.CV / 7 / 2604.26138

MixerCA: An Efficient and Accurate Model for High-Performance Hyperspectral Image Classification

MixerCA:一种高效准确的高性能高光谱图像分类模型
Alkhatib, Mohammed Q., Jamali, Ali
Abstract
Over the past decade, hyperspectral image (HSI) classification has drawn considerable interest due to HSIs' ability to effectively distinguish terrestrial objects by capturing detailed, continuous spectral information. The strong performance of recent deep learning techniques in tasks like image classification and semantic segmentation has led to their growing use in HSI classification, due to their ability to capture complex spatial and spectral features more effectively than traditional methods. This paper presents MixerCA, a novel lightweight model for HSI classification that leverages depthwise convolution and a self-attention mechanism. MixerCA integrates depth-wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA's clear advantages over several competing algorithms, including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer. The source code is publicly available at https://github.com/mqalkhatib/MixerCA.
Chinese Translation
在过去十年中,高光谱图像(HSI)分类因其能够通过捕捉详细、连续的光谱信息有效区分地面物体而引起了广泛关注。近年来深度学习技术在图像分类和语义分割等任务中的强大表现,使其在HSI分类中的应用日益增加,因为它们能够比传统方法更有效地捕捉复杂的空间和光谱特征。本文提出了MixerCA,一种新颖的轻量级HSI分类模型,利用深度卷积和自注意力机制。MixerCA将深度卷积、标记和通道混合以及坐标注意力整合到一个统一的结构中,以解耦空间和通道交互,保持网络中一致的分辨率,并直接处理HSI图块。在四个高光谱基准数据集上的大量实验表明,MixerCA在多个竞争算法(包括2D-CNN、3D-CNN、Tri-CNN、HybridSN、ViT和Swin Transformer)中具有明显优势。源代码可在https://github.com/mqalkhatib/MixerCA上公开获取。
cs.CV / 8 / 2604.26147

A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance

用于胶质瘤手术指导的 intraoperative 荧光寿命成像数据中心框架
Anbunesan, Silvia Noble, Hassan, Mohamed Abul, Qi, Jinyi, Kraft, Lisanne, Lee, Han Sung, Bloch, Orin, Marcu, Laura
Abstract
Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluorescence lifetime imaging (FLIm) offers real-time, label-free biochemical contrast, but its clinical utility is challenged by biological heterogeneity, class imbalance, and variability in histopathological labeling. We present a data-centric AI (DC-AI) framework that integrates confident learning (CL), class refinement, and targeted label evaluation to develop a robust multi-class FLIm classifier for glioblastoma (GBM) resection margins. FLIm data were collected from 192 tissue margins across 31 newly diagnosed IDH-wildtype GBM patients and initially labeled into seven tumor cellularity classes by an expert neuropathologist. CL was applied to quantify FLIm point-level confidence, identify label inconsistencies, and guide iterative class merging into a three-class scheme ("low", "moderate", "high"). The resulting high-fidelity dataset enabled training a model that achieved 96% accuracy in the three-class task. SHAP analysis revealed class-specific FLIm feature importance, highlighting distinct optical signatures across the infiltration spectrum. Targeted FLIm analysis further identified biological (e.g., gray matter composition) and acquisition-related (e.g., blood contamination) contributors to low-confidence predictions. Blinded re-evaluation of margins flagged by CL demonstrated intra-pathologist variability, underscoring the value of selective relabeling rather than exhaustive review. Together, these findings demonstrate that a DC-AI framework can systematically improve data reliability, enhance model robustness, and refine biological interpretation of FLIm signals, supporting the development of clinically actionable optical tools for real-time glioma margin assessment.
Chinese Translation
准确的胶质瘤浸润术中评估对于最大限度地切除肿瘤并保护功能性脑组织至关重要。荧光寿命成像(FLIm)提供实时、无标记的生化对比,但其临床应用受到生物异质性、类别不平衡和组织病理标记变异性的挑战。我们提出了一种数据中心人工智能(DC-AI)框架,集成了自信学习(CL)、类别细化和目标标签评估,以开发一个强大的多类别 FLIm 分类器,用于胶质母细胞瘤(GBM)切除边缘。FLIm 数据来自31名新诊断的 IDH 野生型 GBM 患者的192个组织边缘,最初由一位专家神经病理学家标记为七个肿瘤细胞密度类别。应用 CL 量化 FLIm 点级别的自信度,识别标签不一致,并指导迭代类别合并为三类方案(“低”、“中”、“高”)。最终生成的高保真数据集使得训练的模型在三类任务中达到了96%的准确率。SHAP 分析揭示了类别特异性的 FLIm 特征重要性,突显了浸润谱系中的不同光学特征。针对性 FLIm 分析进一步识别了生物学(例如,灰质成分)和采集相关(例如,血液污染)对低自信度预测的影响因素。对 CL 标记的边缘进行盲重评估显示了病理学家之间的变异性,强调了选择性重新标记而非全面审查的价值。这些发现共同表明,DC-AI 框架可以系统性地提高数据可靠性,增强模型稳健性,并细化对 FLIm 信号的生物学解释,支持开发可用于实时胶质瘤边缘评估的临床可操作光学工具。
cs.CV / 9 / 2604.26174

Why Domain Matters: A Preliminary Study of Domain Effects in Underwater Object Detection

为何领域重要:水下物体检测领域效应的初步研究
Wille, Melanie, Miller, Dimity, Fischer, Tobias, Raine, Scarlett
Abstract
Domain shift, where deviations between training and deployment data distributions degrade model performance, is a key challenge in underwater environments. Existing benchmarks testing performance for underwater domain shift simulate variability through synthetic style transfer. This fails to capture intrinsic scene factors such as visibility, illumination, scene composition, or acquisition factors, limiting analysis of real-world effects. We propose a labeling framework that defines underwater domains using measurable image, scene, and acquisition characteristics. Unlike prior benchmarks, it captures physically meaningful factors, enabling semantically consistent image grouping and supporting domain-specific evaluation of detection performance including failure analysis. We validate this on public datasets, showing systematic variations across domain factors and revealing hidden failure modes.
Chinese Translation
领域偏移,即训练和部署数据分布之间的偏差导致模型性能下降,是水下环境中的一个关键挑战。现有的基准测试通过合成风格转移来模拟水下领域偏移的性能变异。这种方法未能捕捉到内在场景因素,如能见度、照明、场景组成或获取因素,从而限制了对现实世界效应的分析。我们提出了一种标注框架,通过可测量的图像、场景和获取特征来定义水下领域。与之前的基准不同,该框架捕捉了物理上有意义的因素,使得图像分组在语义上保持一致,并支持针对特定领域的检测性能评估,包括失败分析。我们在公共数据集上验证了这一框架,显示出领域因素之间的系统性变化,并揭示了隐藏的失败模式。
cs.CV / 10 / 2604.26182

Lifting Embodied World Models for Planning and Control

提升具身世界模型以进行规划和控制
Wang, Alex N., Darrell, Trevor, Izmailov, Pavel, Bai, Yutong, Bar, Amir
Abstract
World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ($3.8\times$ lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.
Chinese Translation
具身智能体的世界模型预测基于智能体采取的动作的未来观察。对于复杂的具身体,动作空间是高维的且难以指定:例如,精确控制一个人类智能体需要指定每个关节的运动。这使得世界模型难以控制,并且在使用基于搜索的方法(如CEM)进行规划时,随着动作维度的增加,效率下降。为了解决这个问题,我们训练了一个轻量级策略,将高层次动作映射到低层次关节动作的序列。将该策略与冻结的世界模型组合,产生了一个提升的世界模型,该模型从单个高层次动作预测一系列未来观察。我们为类人具身体实例化了这一框架,将高层次动作空间定义为一小组在当前观察帧上标注的2D路径点,每个路径点指定一个近端目标位置(如骨盆、头部、手部)。路径点是低维的,视觉上可解释,并且易于手动指定或进行搜索。我们展示了提升的世界模型在低层次关节空间中直接搜索的效果上显著优于后者(目标姿态的平均关节误差降低了$3.8 imes$),同时保持了更高的计算效率,并能够推广到策略未见过的环境中。
cs.CV / 11 / 2604.26184

Privacy-Preserving Clothing Classification using Vision Transformer for Thermal Comfort Estimation

基于视觉变换器的隐私保护服装分类用于热舒适度估计
Chuman, Tatsuya, Udagawa, Yousuke, Kiya, Hitoshi
Abstract
A privacy-preserving clothing classification scheme is presented to enable secure occupant-centric control (OCC) systems. Although the utilization of camera images for HVAC control has been widely studied to optimize thermal comfort, privacy protection of occupant images has not been considered in prior works. While various privacy-preserving methods have been proposed for image classification, applying conventional schemes results in severe accuracy degradation. In this paper, we introduce a privacy-preserving classification method using Vision Transformer (ViT) applied to clothing insulation estimation. In an experiment using the DeepFashion dataset categorized by clothing insulation, while the conventional pixel-based method suffers a severe accuracy drop, our scheme maintains a high accuracy on encrypted images, showing no degradation from plain images across all categories.
Chinese Translation
本文提出了一种隐私保护的服装分类方案,以实现安全的以居住者为中心的控制(OCC)系统。尽管利用摄像头图像进行HVAC控制以优化热舒适度的研究已广泛开展,但以往的研究并未考虑居住者图像的隐私保护。虽然已有多种隐私保护方法被提出用于图像分类,但应用传统方案会导致严重的准确性下降。在本文中,我们引入了一种使用视觉变换器(Vision Transformer, ViT)的隐私保护分类方法,应用于服装绝缘估计。在使用DeepFashion数据集进行的实验中,尽管传统的基于像素的方法遭遇了严重的准确性下降,我们的方案在加密图像上保持了高准确性,所有类别的表现均未出现与原始图像相比的下降。
cs.CV / 12 / 2604.26186

FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

FASH-iCNN:通过多模态卷积神经网络探究编辑时尚身份
Adeyemi, Morayo Danielle, Rossi, Ryan A., Dernoncourt, Franck
Abstract
Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.
Chinese Translation
时尚人工智能系统通常编码特定品牌、编辑和历史时刻的美学逻辑,但并未公开这些信息。我们提出了FASH-iCNN,这是一个多模态系统,基于1991年至2024年间15个时尚品牌的87,547张《Vogue》时装秀图片进行训练,使这种文化逻辑可供检视。给定一张服装的照片,该系统能够恢复出生产该服装的品牌、所属的时代以及反映的颜色传统。仅基于服装的模型在14个品牌中以78.2%的准确率识别品牌,在88.6%的准确率下识别十年,并在34年中以58.3%的准确率识别具体年份,平均误差仅为2.2年。探究哪些视觉通道传递这一信号揭示出明显的分离:去除颜色仅使品牌身份准确率下降10.6个百分点,而去除纹理则使其下降37.6个百分点,确立了纹理和亮度作为编辑身份的主要载体。FASH-iCNN将编辑文化视为信号而非背景噪声,识别出哪些品牌、时代和颜色传统塑造了每个输出,使用户不仅能看到系统的预测,还能了解在该预测中编码的品牌、编辑和历史时刻。
cs.CV / 13 / 2604.26218

ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

ViBE:通过时空变分自编码器和分布对齐投影实现视觉到M/EEG的大脑编码
Xu, Ganxi, Lai, Zhao-Rong, Tang, Yuting, Song, Yonghao, Zhou, Shuyan, Zhou, Guoxu, Wang, Boyu, Zhu, Jian, Long, Jinyi
Abstract
Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.
Chinese Translation
大脑编码模型不仅用于解码视觉刺激如何转化为神经反应,还代表了为严重视觉障碍患者恢复视力的重要一步。大脑编码涉及两个基本步骤:实现神经反应的忠实重建和建立视觉刺激与神经反应之间的跨模态对齐。为此,我们提出了ViBE,一种新颖的大脑编码框架,用于从视觉刺激生成脑磁图(MEG)和脑电图(EEG)信号。具体而言,我们首先设计了一种时空卷积变分自编码器(TSC-VAE),以捕捉M/EEG信号的时空特征,从而有效重建神经反应。为了弥合视觉特征与神经表征之间的模态差距,我们采用Q-Former将CLIP图像嵌入映射到TSC-VAE潜在空间,生成神经代理嵌入。为了实现全面的跨模态对齐,我们结合了均方误差(MSE)损失用于逐点特征匹配,以及切片Wasserstein距离(SWD)用于神经代理嵌入与TSC-VAE潜在嵌入之间的概率分布对齐。我们在THINGS-EEG2和THINGS-MEG数据集上进行了广泛的实验,证明了我们的方法在从视觉刺激生成高质量M/EEG信号方面的有效性。
cs.CV / 14 / 2604.26221

Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation

寻求共识:几何-语义即时重校准用于开放词汇遥感语义分割
Wang, Guanchun, Wu, Chenxiao, Zhang, Xiangrong, Peng, Zelin, Lai, Jianxun, Zhang, Tianyang, Tang, Xu
Abstract
Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.
Chinese Translation
开放词汇语义分割(OVSS)在遥感图像中是一项前景任务,利用文本描述来识别未定义的土地覆盖类别。尽管已有显著进展,但现有方法通常采用静态推理范式,忽视了每个场景的独特分布,导致多样化土地覆盖中的语义模糊和前景激活不完全。基于此,我们提出了寻求共识(Seeking Consensus),简称 SeeCo,这是一种即插即用的框架,旨在提升遥感图像中无训练OVSS模型的性能,通过寻求双重共识进行即时重校准:通过多视角一致观察实现几何共识学习(Geometric Consensus Learning, GCL),以及通过文本描述自适应校准实现语义共识学习(Semantic Consensus Learning, SCL),以协助视觉和文本语义的协同重校准。这两种共识通过在线共识注入器(Online Consensus Injector, OCI)注入,有效缓解了激活不足和语义偏差。SeeCo无需特定的训练过程,但在推理过程中为每个独特场景重校准语义-几何对齐。在八个遥感OVSS基准上的广泛实验显示出一致的性能提升,证明了其有效性和普适性。
cs.CV / 15 / 2604.26227

HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

基于HOI的自适应网络用于弱监督动作分割
Zhang, Runzhong, Wang, Suchen, Duan, Yueqi, Tang, Yansong, Zhang, Yue, Tan, Yap-Peng
Abstract
In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.
Chinese Translation
在本文中,我们提出了一种名为AdaAct的基于HOI的自适应网络,用于弱监督动作分割。大多数现有方法学习一个固定的网络,以预测每一帧的动作及其邻近帧。然而,这在估计相似动作时会导致模糊性,例如倒果汁和倒咖啡。为了解决这个问题,我们旨在利用时间上全局但空间上局部的人-物交互(HOI)作为动作分割的视频级先验知识。长期的HOI序列提供了区分模糊动作的重要上下文信息,我们的网络在测试时动态适应给定的HOI序列。更具体而言,我们首先设计了一个视频HOI编码器,提取、选择并整合整个视频中最具代表性的HOI。然后,我们提出了一个双分支超网络(HyperNetwork),以学习自适应时间编码器,该编码器能够根据不同视频的HOI信息动态调整参数。在两个广泛使用的数据集Breakfast和50Salads上的大量实验表明,我们的方法在不同评估指标下的有效性。
cs.CV / 16 / 2604.26232

DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

DepthPilot:从可控性到可解释性的结肠镜视频生成
Fu, Junhu, Chen, Ke, Guo, Weidong, Liang, Shuyu, Xu, Jie, Ma, Chen, Wang, Kehao, Lin, Shengli, Li, Zeju, Wang, Yuanyuan, Guo, Yi, Li, Shuo
Abstract
Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot's robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between "visually realistic" and "clinically interpretable". Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.
Chinese Translation
可控的医学视频生成已取得显著进展,但仍缺乏可解释性,这需要生成内容与物理先验和真实临床表现的一致性。为了推动从单纯可控性到可解释性的边界,我们提出了DepthPilot,这是第一个用于结肠镜视频生成的可解释框架。本研究通过两种协同范式迈出了可信生成的第一步。为了实现明确的几何基础,DepthPilot设计了一种先验分布对齐策略,通过参数高效的微调将深度约束注入扩散主干,以确保解剖学的保真性。为了在这些几何约束下增强内在非线性建模,DepthPilot采用了一种自适应样条去噪模块,用可学习的样条函数替代固定的线性权重,以捕捉复杂的时空动态。在三个公共数据集和内部临床数据上的广泛评估证实了DepthPilot生成物理一致视频的强大能力。它在所有基准测试中实现了低于15的FID分数,并在临床医生评估中排名第一,弥合了“视觉真实”和“临床可解释”之间的差距。此外,DepthPilot生成的视频预计将支持可靠的3D重建,促进外科导航和盲区识别,并为结直肠世界模型奠定基础。
cs.CV / 17 / 2604.26238

EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors

EnerGS:基于能量的高斯点云渲染与部分几何先验
Song, Rui, Cai, Tianhui, Gross, Markus, Zhang, Yun, Zimmer, Walter, Huang, Zhiyu, Wysocki, Olaf, Ma, Jiaqi
Abstract
3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and non-convex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.
Chinese Translation
三维高斯点云渲染(3D Gaussian Splatting, 3DGS)已被广泛应用于场景重建,其中训练本质上构成了一个高度耦合且非凸的优化问题。近期的研究通常结合几何先验,例如激光雷达(LiDAR)测量,作为初始化或训练约束,旨在提高光度重建质量。然而,在大规模户外场景中,这种几何监督往往在空间上是不完整和不均匀的,这限制了其作为可靠先验的有效性,甚至可能对最终重建产生不利影响。为了解决这一挑战,我们将部分可观察的几何形状建模为由几何证据引发的连续能量场,并提出了EnerGS。EnerGS并不是将几何作为硬约束,而是为高斯原语的优化提供了软几何指导,使几何信息能够引导优化过程,而不直接限制解空间。在大规模户外场景上的大量实验表明,在稀疏多视角和单目设置下,EnerGS始终提高了光度质量和几何稳定性,同时有效减轻了3DGS训练过程中的过拟合问题。
cs.CV / 18 / 2604.26241

Camera-RFID Fusion for Robust Asset Tracking in Forested Environments

森林环境中基于摄像头与RFID融合的稳健资产追踪
Hateley, John, Narasimhan, Sriram, Abari, Omid
Abstract
Passive RFID tags offer a cost-effective and scalable solution for tracking numerous deployed assets. However, in forested environments, signal attenuation and multipath effects generally limit RFID spatial accuracy to the meter level. Conversely, while cameras employing stereo vision can achieve centimeter-level precision, relying solely on computer vision fails to resolve issues arising from spatial association ambiguity and partial occlusions in dense settings. Fusing these modalities allows systems to harness the high-accuracy benefits of vision while retaining the robust, non-line-of-sight identification advantages of RFID. Yet, a primary challenge in achieving this, which is the central focus of this paper, lies in accurately associating the disparate trajectories generated by these two sensors. To overcome this limitation, we introduce a novel camera--RFID fusion framework that integrates depth and object information with advanced trajectory-matching algorithms. By successfully bridging the meter-to-centimeter accuracy gap, the proposed approach helps achieve reliable tag localization even when assets temporarily leave the camera's field of view. To the best of our knowledge, this represents the first application of camera--RFID fusion for asset tracking in natural forested environments.
Chinese Translation
被动RFID标签为追踪大量部署资产提供了一种具有成本效益和可扩展性的解决方案。然而,在森林环境中,信号衰减和多径效应通常将RFID的空间精度限制在米级。相反,虽然采用立体视觉的摄像头可以实现厘米级的精度,但仅依靠计算机视觉无法解决在密集环境中出现的空间关联模糊和部分遮挡等问题。融合这两种模式使系统能够利用视觉的高精度优势,同时保留RFID在非视距识别方面的稳健优势。然而,实现这一目标的主要挑战——也是本文的核心关注点——在于准确关联这两种传感器生成的不同轨迹。为克服这一限制,我们提出了一种新颖的摄像头与RFID融合框架,该框架结合了深度和物体信息,并采用先进的轨迹匹配算法。通过成功弥补米级与厘米级的精度差距,所提出的方法有助于实现可靠的标签定位,即使在资产暂时离开摄像头视野时。根据我们所知,这是摄像头与RFID融合在自然森林环境中进行资产追踪的首次应用。
cs.CV / 19 / 2604.26244

MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

MetaSR:用于生成超分辨率的内容自适应元数据调度
Guo, Jiaqi, Li, Mingzhen, Wang, Haohong, Katsaggelos, Aggelos K.
Abstract
We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT's own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50\% transmission bitrate saving at matched quality. We assess these gains under a rate--distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).
Chinese Translation
我们研究了在真实场景中生成超分辨率(SR)的问题,其中内容和退化在不同领域、类型和片段中变化。例如,图像和视频可能在文本叠加、快速运动、平滑卡通和低光照人脸之间交替,每种情况都受益于不同形式的辅助信息。现有的元数据引导的SR方法通常使用固定的条件设计,这在有用线索依赖于内容且传输预算有限的情况下是次优的。我们提出了MetaSR,一个基于扩散变换器(Diffusion Transformer, DiT)的框架,能够在资源限制下选择和注入与任务相关的元数据以指导SR。具体而言,我们利用DiT自身的变分自编码器(VAE)和变换器主干来融合异构元数据,并采用一种高效的蒸馏策略,使得能够进行一步扩散推断。在不同内容类别和退化模式下的实验表明,MetaSR在匹配质量的情况下,PSNR性能比参考解决方案提高了最多1.0 dB,同时实现了高达50%的传输比特率节省。我们在一个率-失真优化(RDO)框架下评估这些增益,该框架共同考虑了发送方的比特率和接收方/显示质量指标(例如,PSNR和SSIM)。
cs.CV / 20 / 2604.26250

Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning

超越捷径:通过定性推理减轻冻结视觉语言模型中的视觉幻觉
Guo, Hao, Wang, Fei, Chen, Junjie, Nie, Yiqi, Zhao, Jiaqi, Li, Qiankun, Huang, Subin
Abstract
While Vision-Language Models (VLMs) have achieved state-of-the-art performance in general visual tasks, their perceptual robustness remains remarkably brittle when confronted with optical illusions. These failures are often attributed to shortcut heuristics, where models prioritize linguistic priors and memorized prototypes over direct visual evidence. In this work, we propose Structured Qualitative Inference (SQI), a training-free, data-centric framework designed to fortify visual grounding in frozen VLMs. SQI addresses perceptual anomalies through three systematic modules: (1) Axiomatic Constraint Injection, which suppresses erroneous metric estimations and quantitative hallucinations; (2) Hierarchical Scene Decomposition, which decouples target visual manifolds from complex background distractors; and (3) Counterfactual Self-Verification, an adversarial reasoning step that mitigates confirmation bias. By orchestrating these qualitative constraints at inference time, SQI effectively aligns high-level linguistic reasoning with low-level visual perception. Our framework was evaluated on the DataCV 2026 Challenge (Task I: Classic Illusion Understanding), where it ranked 2nd place overall. Experimental results demonstrate that SQI not only significantly enhances accuracy across diverse illusion categories but also provides superior diagnostic interpretability without any model fine-tuning. Our success underscores the potential of structured qualitative grounding as a robust paradigm for developing next-generation, illusion-resistant vision-language systems.
Chinese Translation
尽管视觉语言模型(VLMs)在一般视觉任务中取得了最先进的性能,但在面对光学幻觉时,其感知鲁棒性仍然显得异常脆弱。这些失败通常归因于捷径启发式,即模型优先考虑语言先验和记忆原型,而非直接的视觉证据。在本研究中,我们提出了结构化定性推理(Structured Qualitative Inference, SQI),这是一种无训练、以数据为中心的框架,旨在增强冻结VLMs中的视觉基础。SQI通过三个系统模块解决感知异常:(1)公理约束注入,抑制错误的度量估计和定量幻觉;(2)层次场景分解,将目标视觉流形与复杂背景干扰物解耦;(3)反事实自我验证,一种对抗性推理步骤,减轻确认偏误。通过在推理时协调这些定性约束,SQI有效地将高层语言推理与低层视觉感知对齐。我们的框架在DataCV 2026挑战赛(任务I:经典幻觉理解)中进行了评估,整体排名第二。实验结果表明,SQI不仅显著提高了在不同幻觉类别中的准确性,而且在没有任何模型微调的情况下提供了更优的诊断可解释性。我们的成功强调了结构化定性基础作为开发下一代抗幻觉视觉语言系统的强大范式的潜力。
cs.CV / 21 / 2604.26251

Multi-Stage Bi-Atrial Segmentation Framework from 3D Late Gadolinium-Enhanced MRI using V-Net Family Models

基于 V-Net 家族模型的 3D 晚期钆增强 MRI 多阶段双心房分割框架
Wen, Hao, Kang, Jingsu
Abstract
We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the human heart. The pipeline consists of a preprocessing step using multidimensional contrast limited adaptive histogram equalization (MCLAHE); coarse region segmentation from MCLAHE-enhanced and down-sampled MRI using a V-Net family model; and fine segmentation from the coarse region using another V-Net model. Asymmetric loss is adopted to optimize the model weights.
Chinese Translation
我们报告了一个多阶段框架,旨在解决从人心脏的 3D 晚期钆增强 (LGE) MRI 中进行多类双心房分割的问题。该流程包括一个预处理步骤,使用多维对比度限制自适应直方图均衡 (MCLAHE);通过 MCLAHE 增强和下采样的 MRI 进行粗区域分割,采用 V-Net 家族模型;以及使用另一个 V-Net 模型从粗区域进行精细分割。采用不对称损失来优化模型权重。
cs.CV / 22 / 2604.26252

OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction

OmniTrend:可扩展社交媒体流行度预测的内容-上下文建模
Ye, Liliang, Zeng, Guiyi, Zhang, Yunyao, Chen, Yi-Ping Phoebe, Yu, Junqing, Song, Zikai
Abstract
Predicting social media popularity requires understanding both the intrinsic appeal of content and the external context that determines how it is exposed to users. Existing methods focus on content signals but do not separate them from exposure-related patterns, which causes the learned representations to absorb platform-specific visibility effects and weakens both interpretability and cross-platform transfer. This paper introduces OmniTrend, a unified framework that models popularity as the joint outcome of content attractiveness and contextual exposure. The content module learns cross-modal representations from visual, audio, and textual cues to quantify intrinsic appeal, while the context module estimates exposure from exogenous signals such as posting time, author activity, topical trends, and retrieval-based neighborhood statistics. OmniTrend learns separate predictors for content attractiveness and contextual exposure and integrates them in the final popularity estimate, which makes the role of each factor explicit and supports robust transfer across image and video platforms.
Chinese Translation
预测社交媒体的流行度需要理解内容的内在吸引力和决定其如何展示给用户的外部上下文。现有方法主要关注内容信号,但未能将其与曝光相关的模式分开,这导致学习到的表示吸收了平台特定的可见性效应,从而削弱了可解释性和跨平台的迁移能力。本文提出了OmniTrend,一个统一框架,将流行度建模为内容吸引力和上下文曝光的联合结果。内容模块从视觉、音频和文本线索中学习跨模态表示,以量化内在吸引力,而上下文模块则根据外部信号(如发布时间、作者活动、主题趋势和基于检索的邻域统计)来估计曝光。OmniTrend为内容吸引力和上下文曝光学习了独立的预测器,并将它们整合到最终的流行度估计中,这使得每个因素的作用变得明确,并支持在图像和视频平台之间的稳健迁移。
cs.CV / 23 / 2604.26255

GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition

GaitKD:一种用于高效步态识别的通用解耦蒸馏框架
Li, Yuqi, Zhou, Qian, Duan, Huiran, Wang, Jingjie, Zhang, Shunli, Yang, Chuanguang, Zhao, Guoying, Tian, Yingli
Abstract
Gait recognition is an attractive biometric modality for long-range and contact-free identification, but high-performing gait models often rely on deep and computationally expensive architectures that are difficult to deploy in practice. Knowledge distillation (KD) offers a natural way to transfer knowledge from a powerful teacher to an efficient student; however, standard KD is often less effective for part-structured gait models, where supervision is formed from both part-wise classification logits and part-wise retrieval embeddings. In this paper, we propose GaitKD, a distillation framework that decouples gait knowledge transfer into two complementary components: decision-level distillation and boundary-level distillation. Specifically, GaitKD aligns the teacher and student through part-calibrated logit distillation to transfer inter-class decision relations, while preserving the teacher-induced partitioning of the embedding space through an activation-boundary objective instead of direct feature regression. With a simple aligned part-wise design, GaitKD supports heterogeneous teacher-student gait models without introducing additional inference cost. Experimental results across multiple gait recognition benchmarks and teacher-student configurations show consistent improvements over strong gait baselines. Our study demonstrates that the two transfer components are complementary, and boundary-preserving distillation provides more stable performance than direct feature regression. Source code is available at https://github.com/liyiersan/GaitKD/
Chinese Translation
步态识别是一种具有吸引力的生物特征识别方式,适用于远程和无接触的身份识别,但高性能的步态模型通常依赖于深度且计算开销大的架构,这在实际应用中难以部署。知识蒸馏(Knowledge Distillation, KD)提供了一种自然的方式,将知识从强大的教师模型转移到高效的学生模型;然而,标准的KD在部分结构化的步态模型中往往效果不佳,因为这些模型的监督来自于部分分类逻辑和部分检索嵌入。在本文中,我们提出了GaitKD,一种将步态知识转移解耦为两个互补组件的蒸馏框架:决策级蒸馏和边界级蒸馏。具体而言,GaitKD通过部分校准的逻辑蒸馏对教师和学生进行对齐,以转移类间决策关系,同时通过激活边界目标而非直接特征回归来保留教师诱导的嵌入空间划分。通过简单的对齐部分设计,GaitKD支持异构的教师-学生步态模型,而无需引入额外的推理成本。在多个步态识别基准和教师-学生配置上的实验结果显示,GaitKD在强基线模型上持续取得了改进。我们的研究表明,这两个转移组件是互补的,边界保持的蒸馏提供了比直接特征回归更稳定的性能。源代码可在 https://github.com/liyiersan/GaitKD/ 获取。
cs.CV / 24 / 2604.26261

Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

鲁棒的零样本3D视觉定位的多重一致性2D-3D映射
Yin, Yufei, Zheng, Jie, Meng, Qianke, Yu, Zhou, Chen, Minghao, Ding, Jiajun, Tan, Min, Xi, Yuling, Chen, Zhiwen, Lv, Chengfei
Abstract
Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to reconstruct missing targets, back-projecting these reliable visual priors to establish accurate 3D geometries. Finally, to eliminate spatial redundancy, a Viewpoint Distillation module clusters 3D camera directions to extract optimal frames. By pairing these optimal RGB frames with Bird's Eye View maps into concise visual prompt sets, we formulate the final target disambiguation as a multiple-choice reasoning task for Vision-Language Models. Extensive evaluations on ScanRefer and Nr3D benchmarks demonstrate that MCM-VG sets a new state-of-the-art for zero-shot 3D visual grounding. Remarkably, it achieves 62.0\% and 53.6\% in [email protected] and [email protected] on ScanRefer, outperforming previous baselines by substantial margins of 6.4\% and 4.0\%.
Chinese Translation
零样本3D视觉定位(3DVG)是开放世界具身人工智能的关键能力。然而,现有方法在根本上受到开放词汇3D提议质量差的瓶颈,面临着类别不准确和几何形状不精确的问题,以及耗时的多视角推理的空间冗余。为了解决这些挑战,我们提出了MCM-VG,一个通过明确建立多重一致性2D-3D映射来实现鲁棒零样本3DVG的新框架。MCM-VG并不被动依赖噪声3D分段,而是通过在三个基本维度上强制2D-3D一致性来实现精确的目标定位和可靠的推理。首先,语义对齐模块通过基于大型语言模型(LLM)的查询解析和粗到细的2D-3D匹配来纠正类别不匹配。其次,实例校正模块利用视觉语言模型(VLM)引导的2D分割来重建缺失的目标,将这些可靠的视觉先验反向投影以建立准确的3D几何形状。最后,为了消除空间冗余,视角蒸馏模块聚类3D相机方向以提取最佳帧。通过将这些最佳RGB帧与鸟瞰图结合成简洁的视觉提示集,我们将最终的目标消歧视表述为视觉-语言模型的多项选择推理任务。在ScanRefer和Nr3D基准上的广泛评估表明,MCM-VG在零样本3D视觉定位方面设定了新的最先进水平。值得注意的是,它在ScanRefer上分别达到了62.0\%和53.6 ext{ }[email protected][email protected],显著超越了之前基线,分别提升了6.4 ext{ }和4.0 ext{ }个百分点。
cs.CV / 25 / 2604.26262

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

语义泡沫:统一空间与语义场景分解
Sharafeldin, Amr, Govindarajan, Shrisudhan, Walker, Thomas, Mikaeili, Aryan, Rebain, Daniel, Yi, Kwang Moo, Tagliasacchi, Andrea
Abstract
Modern scene reconstruction methods, such as 3D Gaussian Splatting, enable photo-realistic novel view synthesis at real-time speeds. However, their adoption in interactive graphics applications remains limited due to the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While prior work has attempted to impose semantic decomposition on these models, significant challenges remain in segmentation quality and cross-view consistency.To address these limitations, we introduce Semantic Foam, which extends the recently proposed Radiant Foam representation to semantic decomposition tasks. Our approach leverages the inherent spatial structure of Radiant Foam's volumetric Voronoi mesh and augments it with an explicit semantic feature field defined at the cell level. This design enables direct spatial regularization, improving consistency across views and mitigating artifacts caused by occlusion and inconsistent supervision, which are common issues in point-based representations.Experimental results demonstrate that our method achieves superior object-level segmentation performance compared to state-of-the-art approaches such as Gaussian Grouping and SAGA.Project page: http://semanticfoam.github.io/
Chinese Translation
现代场景重建方法,如3D高斯溅射(3D Gaussian Splatting),能够以实时速度实现照片级真实感的新视图合成。然而,由于与传统的人为创作3D资产相比,这些表示的交互难度较大,其在交互式图形应用中的采用仍然有限。尽管之前的研究尝试对这些模型施加语义分解,但在分割质量和跨视图一致性方面仍然存在显著挑战。为了解决这些限制,我们提出了语义泡沫(Semantic Foam),该方法将最近提出的辐射泡沫(Radiant Foam)表示扩展到语义分解任务。我们的方法利用了辐射泡沫体积Voronoi网格的固有空间结构,并在单元级定义了显式的语义特征场进行增强。这一设计使得直接的空间正则化成为可能,从而提高了视图间的一致性,并减轻了由于遮挡和不一致监督所造成的伪影,这些都是基于点的表示中常见的问题。实验结果表明,我们的方法在对象级分割性能上优于最先进的方法,如高斯分组(Gaussian Grouping)和SAGA。项目页面:http://semanticfoam.github.io/
cs.CV / 26 / 2604.26279

High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification

高维噪声到低维流形:一种用于退化高光谱图像分类的流形空间扩散框架
Yang, Boxiang, Chen, Ning, Yue, Xia, Luo, Yichang, Fan, Yingbo, Zhang, Haoyuan, Ma, Haoyu, Yue, Jun, Mao, Shanjun
Abstract
Recently, Hyperspectral Image (HSI) classification has attracted increasing attention in remote sensing. However, HSI data are inherently high-dimensional but low-rank, with discriminative information concentrated on a low-dimensional latent manifold. In real-world remote sensing scenarios, the superposition of multiple degradation factors disrupts this intrinsic manifold structure, driving samples away from their original low-dimensional distribution and introducing substantial redundant and non-discriminative variations. To better handle this challenge, this paper proposes a manifold-space diffusion framework (MSDiff) for robust hyperspectral classification under complex degradation conditions. Specifically, the proposed method first maps high-dimensional, degradation-affected HSI data into a compact low-dimensional manifold through a discriminative spectral-spatial reconstruction task, preserving class semantics and reducing redundant variations. A diffusion-based generative model is then applied to regularize the spectral-spatial distribution within the manifold, enabling progressive refinement and stabilization of latent features against residual degradations. The key advantage of the proposed framework lies in performing diffusion-based distribution modeling directly on the low-dimensional manifold, effectively decoupling degradation-induced disturbances from intrinsic discriminative structures and enhancing representation stability under complex degradations. Experimental results on multiple hyperspectral benchmarks demonstrate consistent performance improvements over state-of-the-art methods under diverse composite degradation settings. The code will be available at https://github.com/yangboxiang1207/MSDiff
Chinese Translation
近年来,高光谱图像(HSI)分类在遥感领域引起了越来越多的关注。然而,HSI 数据本质上是高维但低秩的,其判别信息集中在低维潜在流形上。在实际的遥感场景中,多种退化因素的叠加破坏了这种内在的流形结构,使样本偏离其原始的低维分布,并引入了大量冗余和非判别性的变化。为更好地应对这一挑战,本文提出了一种流形空间扩散框架(MSDiff),用于在复杂退化条件下进行稳健的高光谱分类。具体而言,所提出的方法首先通过判别性光谱-空间重建任务将高维、受退化影响的 HSI 数据映射到一个紧凑的低维流形中,从而保留类别语义并减少冗余变化。然后,应用基于扩散的生成模型来规范流形内的光谱-空间分布,使潜在特征在残余退化下能够逐步精炼和稳定。该框架的关键优势在于直接在低维流形上执行基于扩散的分布建模,有效地将退化引起的干扰与内在判别结构解耦,并在复杂退化下增强表示的稳定性。在多个高光谱基准上的实验结果表明,在多种复合退化设置下,所提出的方法在性能上始终优于最先进的方法。代码将发布在 https://github.com/yangboxiang1207/MSDiff
cs.CV / 27 / 2604.26283

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V:通过潜在记忆演化桥接视觉感知与临床直觉
Zhu, Chunzheng, Zeng, Jiaqi, Jiang, Junyu, Lin, Jianxin, Wang, Yijun
Abstract
High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.
Chinese Translation
高精度的医学诊断不仅依赖于静态影像特征,还依赖于专家在图像解读过程中瞬时调用的隐式诊断记忆。我们指出医学视觉语言模型(VLMs)中由于离散标记化造成的基本认知不一致,导致量化损失、长距离信息消散以及缺失案例适应性专业知识。为了解决这一问题,我们提出了一个潜在诊断记忆演化框架,通过在模型的隐藏流中动态合成隐式诊断记忆,模拟临床医生的经验调用。具体而言,它首先采用一种用于先前记忆的元查询机制,其中可学习的探针从解剖先验编码器中检索结构化先验,以生成浓缩的隐式记忆。为了确保临床的准确性,我们引入了因果反事实精炼(CCR),利用强化学习和来自区域级特征掩蔽的反事实奖励来量化每个记忆的因果贡献,从而修剪冗余并将潜在表示与诊断逻辑对齐。这个演化过程以内在记忆转移(IMT)为高潮,形成了一种特权自主的双分支范式,通过全词汇差异对齐将教师分支的诊断模式内化到学生分支中。对多个数据集的全面实证评估表明,通过将外部专业知识转移到内生参数中,我们的方法在诊断准确性上显著优于现有的最先进方法,特别是思维链范式。
cs.CV / 28 / 2604.26285

Event-based Liveness Detection using Temporal Ocular Dynamics: An Exploratory Approach

基于事件的活体检测:时间性眼动学的探索性方法
Mastropasqua, Nicolas, Bugueno-Cordova, Ignacio, Verschae, Rodrigo, Acevedo, Daniel, Negri, Pablo
Abstract
Face liveness detection has been extensively studied using RGB cameras, achieving strong performance under controlled conditions but often failing to generalize across sensors and attack scenarios. In this work, we explore event cameras as an alternative sensing modality for liveness detection based on temporal ocular dynamics. Event cameras capture sparse, asynchronous changes in brightness with microsecond resolution, enabling precise analysis of fast eye movements such as saccades. Replay attacks cannot faithfully reproduce these dynamics due to temporal resampling and display artifacts, leading to distinctive spatio-temporal patterns in the event domain. We design a data collection protocol to extend RGBE-Gaze with replay-attack recordings, yielding an event-based fake counterpart for liveness detection. We analyze event-driven temporal features from eye regions and evaluate their effectiveness for ocular motion segmentation and liveness classification. Our results show that event-based representations enable reliable discrimination between genuine and replayed sequences, achieving up to 95.37% top-1 accuracy with a spiking convolutional neural network. These preliminary findings highlight the potential of event-based sensing for robust and low-latency liveness detection.
Chinese Translation
人脸活体检测已在使用RGB摄像头的研究中得到了广泛关注,在受控条件下取得了良好的性能,但在不同传感器和攻击场景中往往无法推广。在本研究中,我们探索了事件摄像头作为活体检测的替代传感方式,基于时间性眼动学。事件摄像头以微秒级分辨率捕捉稀疏的、异步的亮度变化,使得对快速眼动(如扫视)的精确分析成为可能。重放攻击由于时间重采样和显示伪影无法真实再现这些动态,从而在事件域中产生独特的时空模式。我们设计了一种数据收集协议,以重放攻击录音扩展REGE-Gaze,生成用于活体检测的基于事件的伪造样本。我们分析了眼部区域的事件驱动时间特征,并评估其在眼动分割和活体分类中的有效性。我们的结果表明,基于事件的表示能够可靠地区分真实和重放序列,使用脉冲卷积神经网络达到了最高95.37%的Top-1准确率。这些初步发现突显了基于事件的传感在稳健和低延迟活体检测中的潜力。
cs.CV / 29 / 2604.26288

CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

CheXthought:用于胸部X光解读的全球多模态临床推理链和视觉注意力数据集
Sharma, Sonali, Long, Jin, Shih, George, Eid, Sarah, Bluethgen, Christian, Jacobson, Francine L., Tsai, Emily B., Consortium, Global Radiology, Alaa, Ahmed M., Langlotz, Curtis P.
Abstract
Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision--language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state--of--the--art vision--language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference--time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human--human and human--AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision--language models.
Chinese Translation
胸部X光解读是医学中最常见的诊断任务之一,也是人工智能开发的主要目标。然而,目前的视觉-语言模型主要是在成对的图像和报告数据集上进行训练,而不是在支持临床推理的认知过程和视觉注意力上。在此,我们提出了CheXthought,这是一个全球性的多模态资源,包含103,592条推理链和6,609,082个同步的视觉注意力注释,数据来源于71个国家的501名放射科医生的50,312个多读者胸部X光图像。我们的分析揭示了专家在如何部署不同的视觉搜索策略、整合临床背景和传达不确定性方面的临床推理模式。我们在四个维度上展示了CheXthought的临床实用性。首先,CheXthought推理在事实准确性和空间基础方面显著优于最先进的视觉-语言模型推理链。其次,作为推理时间提示使用的视觉注意力数据能够恢复遗漏的发现,并显著减少幻觉。第三,基于CheXthought数据训练的模型在病理分类、视觉真实感、时间推理和不确定性传达方面表现出显著更强的能力。第四,利用CheXthought的多读者注释,我们能够直接从图像中预测人类与人类及人类与人工智能之间的分歧,从而实现对病例难度、不确定性和模型可靠性的透明沟通。这些发现确立了CheXthought作为推动多模态临床推理和开发更透明、可解释的视觉-语言模型的资源。
cs.CV / 30 / 2604.26317

The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches

看不见的对手:针对对抗性补丁的鲁棒性和广义防御
Kumar, Vishesh, Agarwal, Akshay
Abstract
The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as an adversarial patch attack. Similarly, natural noises such as Gaussian and Salt\&Pepper are highly prevalent in the real world. The current research need arises from the above vulnerabilities and the lack of efforts to tackle these two singularities independently and, especially, in combination. In this research, we have, for the first time, combined these two prominent singularities and proposed a novel dataset. Using this dataset, we have conducted a benchmark study of singularity data-point detection using features from several convolutional neural networks. For classification, rather than the popular neural network-based parameter tuning, we have used traditional yet effective machine learning classifiers. The extensive experiments across various in- and out-of-distribution (OOD) singularities reveal several interesting findings about the effectiveness of classifiers and show that it is hard to defend against adversaries when they are treated independently, and inefficient classifiers are selected.
Chinese Translation
深度神经网络在面对奇异性时的脆弱性引发了关于其在现实世界中应用的严重担忧。其中最显著且影响深远的物理世界对抗性扰动是将补丁附加到干净图像上,这被称为对抗性补丁攻击。同样,现实世界中普遍存在的自然噪声,如高斯噪声和椒盐噪声,也对模型的鲁棒性构成挑战。目前的研究需求源于上述脆弱性,以及缺乏针对这两种奇异性独立且特别是结合进行处理的努力。在本研究中,我们首次将这两种显著的奇异性结合在一起,并提出了一个新颖的数据集。利用该数据集,我们对使用多种卷积神经网络特征进行奇异性数据点检测进行了基准研究。在分类方面,我们采用了传统但有效的机器学习分类器,而不是流行的基于神经网络的参数调优。针对各种分布内和分布外(OOD)奇异性的广泛实验揭示了分类器有效性的一些有趣发现,并表明,当对抗者被独立对待且选择了低效分类器时,防御变得困难。
cs.CV / 31 / 2604.26318

Point Cloud Registration via Probabilistic Self-Update Local Correspondence and Line Vector Sets

通过概率自更新局部对应关系和线向量集进行点云配准
Chung, Kuo-Liang, Lin, Yu-Cheng, Chen, Wu-Chi
Abstract
Point cloud registration (PCR) is a fundamental task for integrating 3D observations in remote sensing applications. This paper proposes a fast and effective PCR algorithm utilizing probabilistic self-updating local correspondence and line vector sets. Our dual RANSAC interaction model comprises a global RANSAC evaluating the global correspondence set and a local RANSAC operating on dynamically updated local sets. Initially, these local sets are constructed using angle histogram statistics and line vector length preservation techniques. To improve accuracy, a probabilistic self-updating strategy refines the local sets after each interaction round. To reduce runtime, we introduce a global early termination condition that optimally balances accuracy and efficiency. Finally, a weighted singular value decomposition estimates the registration solution. Evaluations on public datasets demonstrate our algorithm achieves superior time efficiency and at least a 10% root mean square error improvement over state-of-the-art methods. The C++ source code is publicly available at https://github.com/ivpml84079/Probabilistic-Self-Update-Line-Vector-Set-Based-Point-Cloud-Registration.
Chinese Translation
点云配准(PCR)是整合遥感应用中三维观测的基础任务。本文提出了一种快速有效的PCR算法,利用概率自更新的局部对应关系和线向量集。我们的双重RANSAC交互模型包括一个全局RANSAC,用于评估全局对应集,以及一个在动态更新的局部集上操作的局部RANSAC。最初,这些局部集是通过角度直方图统计和线向量长度保持技术构建的。为了提高精度,概率自更新策略在每轮交互后对局部集进行精细化。为了减少运行时间,我们引入了一个全局提前终止条件,最优地平衡了精度和效率。最后,采用加权奇异值分解来估计配准解。对公共数据集的评估表明,我们的算法在时间效率上表现优越,并且在均方根误差上比最先进的方法至少提高了10%。C++源代码已公开,网址为 https://github.com/ivpml84079/Probabilistic-Self-Update-Line-Vector-Set-Based-Point-Cloud-Registration。
cs.CV / 32 / 2604.26321

Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments

基于运动驱动的空间科学实验中模型生物的多目标跟踪
You, Jianing, Wang, Han, Liu, Kang, Ding, Jiale, Chu, Fengjie, Guo, Zihan, Li, Shengyang
Abstract
Automated animal behavior analysis relies on long-term, interpretable individual trajectories; however, multi-animal tracking in space science experimental videos remains highly challenging due to weak appearance cues, low-quality imaging, complex maneuvering behaviors, and frequent interactions. To address this problem, we first construct the SpaceAnimal-MOT dataset to characterize the motion complexity and long-term identity preservation challenges in biological videos acquired under microgravity conditions. We then propose ART-Track (Adaptive Robust Tracking), a motion-driven tracking framework tailored to this setting. Specifically, multi-model motion estimation is introduced to handle abrupt maneuvers and nonlinear motion, motion-state-driven association is designed to reduce identity switches under dense interactions and temporary mismatch, and uncertainty-adaptive fusion is used to dynamically balance spatial and motion cues when prediction reliability varies. Experimental results show that ART-Track significantly reduces identity switches on zebrafish and fruitfly sequences, while maintaining more stable association under occlusion, deformation, and high-density interactions, thereby providing a more reliable tracking foundation for downstream quantitative behavior analysis. The code is publicly available at https://github.com/yyy7777777/ART_TRACK/tree/main.
Chinese Translation
自动化动物行为分析依赖于长期、可解释的个体轨迹;然而,由于外观线索弱、成像质量低、复杂的机动行为和频繁的相互作用,在空间科学实验视频中进行多动物跟踪仍然极具挑战性。为了解决这一问题,我们首先构建了SpaceAnimal-MOT数据集,以表征在微重力条件下获取的生物视频中的运动复杂性和长期身份保持挑战。然后,我们提出了ART-Track(自适应鲁棒跟踪),这是一个针对这一环境量身定制的运动驱动跟踪框架。具体而言,引入了多模型运动估计以处理突发机动和非线性运动,设计了运动状态驱动的关联以减少在密集交互和临时不匹配下的身份切换,并使用不确定性自适应融合在预测可靠性变化时动态平衡空间和运动线索。实验结果表明,ART-Track显著减少了在斑马鱼和果蝇序列中的身份切换,同时在遮挡、变形和高密度交互下保持了更稳定的关联,从而为后续的定量行为分析提供了更可靠的跟踪基础。代码已公开发布在 https://github.com/yyy7777777/ART_TRACK/tree/main。
cs.CV / 33 / 2604.26324

Federated Medical Image Classification under Class and Domain Imbalance exploiting Synthetic Sample Generation

利用合成样本生成的类和领域不平衡下的联邦医学图像分类
Pavan, Martina, Caligiuri, Matteo, Barbato, Francesco, Zanuttigh, Pietro
Abstract
Exploiting deep learning in medical imaging faces critical challenges, including strict privacy constraints, heterogeneous imaging devices with varying acquisition properties, and class imbalance due to the uneven prevalence of pathologies. In this work, we propose FedSSG, a novel Federated Learning framework that addresses domain shifts caused by diverse imaging devices while mitigating the under-representation of rare pathologies. The key contribution is a strategy for generating synthetic samples and distributing them across clients to improve coverage of both underrepresented pathologies and imaging devices. Experimental results demonstrate that our approach significantly enhances model performance and generalization across heterogeneous institutions, with minimal computational overhead at the client side.
Chinese Translation
在医学成像中利用深度学习面临着严峻的挑战,包括严格的隐私约束、具有不同采集特性的异构成像设备,以及由于病理学的不同流行程度导致的类别不平衡。在本研究中,我们提出了FedSSG,一个新颖的联邦学习框架,旨在解决由多样化成像设备引起的领域转移,同时减轻稀有病理的代表性不足。我们的关键贡献是提出了一种生成合成样本并将其分发到客户端的策略,以改善对稀有病理和成像设备的覆盖。实验结果表明,我们的方法显著提高了模型在异构机构中的性能和泛化能力,同时在客户端的计算开销最小。
cs.CV / 34 / 2604.26341

SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

SpatialFusion:赋予统一图像生成内在的三维几何意识
Qiu, Haiyi, Pan, Kaihang, Li, Jiacheng, Li, Juncheng, Tang, Siliang, Zhuang, Yueting
Abstract
Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.
Chinese Translation
近期的统一图像生成模型通过采用多模态大语言模型(MLLMs)进行语义理解,以及使用扩散骨干网络进行图像生成,取得了显著成功。然而,由于缺乏内在的空间理解以及在生成过程中缺乏明确的几何指导,这些模型在空间感知任务上仍然存在根本性的局限性。本文提出了SpatialFusion,一个将三维几何意识内化到统一图像生成模型中的新框架。具体而言,我们首先采用混合变换器(Mixture-of-Transformers, MoT)架构,通过并行的空间变换器增强MLLM的三维几何建模能力。通过与MLLM共享自注意力,空间变换器学习从丰富的语义上下文中推导目标图像的度量深度图。这些明确的几何支架随后通过专门的深度适配器注入到扩散骨干网络中,为空间一致的图像生成提供精确的空间约束。通过渐进的两阶段训练策略,SpatialFusion在空间感知基准测试中显著提升了性能,尤其是超越了如GPT-4o等领先模型。此外,它在文本到图像生成和图像编辑场景中均实现了广泛的性能提升,同时保持了微不足道的推理开销。
cs.CV / 35 / 2604.26342

Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios

哪个面孔与谁的身份?解决多面孔场景中深伪主动取证的双重挑战
Zhang, Lei, Guo, Zhiqing, Ma, Dan, Yang, Gaobo
Abstract
Unlike single-face forgeries, deepfakes in complex multi-person interaction scenarios (such as group photos and multi-person meetings) more closely reflect real-world threats. Although existing proactive forensics solutions demonstrate good performance, they heavily rely on a "single-face" setting, making it difficult to effectively address the problems of deepfake localization and source tracing in complex multi-person environments. To address this challenge, we propose the Deep Attributable Watermarking Framework (DAWF). This framework adopts a novel multi-face encoder-decoder architecture that bypasses the cumbersome offline pre-processing steps of traditional forensics, facilitating efficient in-network parallel watermark embedding and cross-face collaborative processing. Crucially, we propose a selective regional supervision loss. This innovative mechanism guides the decoder to focus exclusively on the facial regions tampered with by deepfakes. Leveraging this mechanism alongside the embedded identity payloads, DAWF realizes the "which + who" goal, answering the dual questions of which facial region was forged and who was forged. Extensive experiments on challenging multi-face datasets show that DAWF achieves excellent deepfake localization and traceability in complex multi-person scenes.
Chinese Translation
与单面孔伪造不同,复杂的多人物互动场景(如团体照片和多人会议)中的深伪更能反映现实世界的威胁。尽管现有的主动取证解决方案表现良好,但它们严重依赖于“单面孔”设置,使得在复杂的多人物环境中有效解决深伪定位和源追踪的问题变得困难。为了解决这一挑战,我们提出了深度可归因水印框架(Deep Attributable Watermarking Framework, DAWF)。该框架采用了一种新颖的多面孔编码器-解码器架构,绕过了传统取证的繁琐离线预处理步骤,促进了高效的网络内并行水印嵌入和跨面孔协同处理。关键是,我们提出了一种选择性区域监督损失。这一创新机制引导解码器专注于被深伪篡改的面部区域。结合嵌入的身份负载,DAWF 实现了“哪个 + 谁”的目标,回答了哪个面部区域被伪造以及谁被伪造的双重问题。在具有挑战性的多面孔数据集上的大量实验表明,DAWF 在复杂的多人物场景中实现了出色的深伪定位和可追溯性。
cs.CV / 36 / 2604.26348

ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

ACPO:基于锚点约束的扩散模型无参考质量指导的感知优化
Yang, Yang, Meng, Feifan, Fang, Han, Zhang, Weiming
Abstract
Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth images.Such supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.
Chinese Translation
扩散模型在图像生成方面取得了显著成功,但其训练主要依赖于全参考目标,这种目标强制要求与真实图像在像素级别上的相似性。尽管这种监督在保真度方面有效,但在主观视觉感知质量和文本-图像语义一致性方面可能不足。在本研究中,我们探讨了将无参考感知质量纳入扩散训练的问题。一个关键挑战是,直接优化感知信号(例如无参考图像质量评估(NR-IQA)模型提供的信号)会与原始扩散目标产生不匹配,从而导致训练不稳定和微调过程中的分布漂移。为了解决这个问题,我们提出了一种锚点约束优化框架,以实现稳定的感知适应。具体而言,我们利用一个学习到的NR-IQA模型作为感知指导信号,同时引入基于锚点的正则化,以在噪声预测方面强制与基础扩散模型的一致性。该设计有效地平衡了感知质量的提升和生成保真度,允许在不妥协原始生成行为的情况下,朝着感知上更有利的输出进行受控适应。大量实验表明,我们的方法在增强感知质量的同时,保持了生成多样性和训练稳定性,突显了基于锚点约束的感知优化在扩散模型中的有效性。
cs.CV / 37 / 2604.26353

GateMOT: Q-Gated Attention for Dense Object Tracking

GateMOT:用于密集目标跟踪的Q门控注意力
Lv, Mingjin, Liu, Zelin, Shao, Feifei, Chen, Yi-Ping Phoebe, Yu, Junqing, Yang, Wei, Song, Zikai
Abstract
While large models demonstrate the strong representational power of vanilla attention, this core mechanism cannot be directly applied to Dense Object Tracking: its quadratic all-to-all interactions are computationally prohibitive for dense motion estimation on high-resolution features. This mismatch prevents Dense Object Tracking from fully leveraging attention-based modeling in crowded and occlusion-heavy scenes. To address this challenge, we introduce GateMOT, an online tracking framework centered on Q-Gated Attention (Q-Attention), an efficient and spatially aware attention variant. Our key idea is to repurpose the Query from a similarity-conditioning term into a learnable gating unit. This Gating-Query (Gating-Q) produces a probabilistic gate that modulates Key features in an element-wise manner, enabling explicit relevance selection instead of costly global aggregation. Built on this mechanism, parallel Q-Attention heads transform one shared feature map into task-specific yet consistent representations for detection, motion, and re-identification, yielding a tightly coupled multi-task decoder with linear-complexity gating operations. GateMOT achieves state-of-the-art HOTA of 48.4, MOTA of 67.8, and IDF1 of 64.5 on BEE24, and demonstrates strong performance on additional Dense Object Tracking benchmarks. These results show that Q-Attention is a simple, effective, and transferable building block for attention-based tracking in dense tracking scenarios.
Chinese Translation
尽管大型模型展示了普通注意力的强大表征能力,但这一核心机制无法直接应用于密集目标跟踪:其平方级的全对全交互在高分辨率特征上的密集运动估计中计算开销巨大。这一不匹配阻碍了密集目标跟踪在拥挤和遮挡严重场景中充分利用基于注意力的建模。为了解决这一挑战,我们提出了GateMOT,一个以Q门控注意力(Q-Attention)为中心的在线跟踪框架,这是一种高效且具有空间感知的注意力变体。我们的关键思想是将查询(Query)从相似性条件项重新利用为可学习的门控单元。该门控查询(Gating-Query,Gating-Q)生成一个概率门,以逐元素的方式调节关键特征,从而实现显式的相关性选择,而不是成本高昂的全局聚合。在这一机制的基础上,多个并行的Q-Attention头将一个共享特征图转化为任务特定但一致的表示,用于检测、运动和重识别,从而产生一个紧密耦合的多任务解码器,具有线性复杂度的门控操作。GateMOT在BEE24数据集上达到了48.4的最新HOTA、67.8的MOTA和64.5的IDF1,并在其他密集目标跟踪基准测试中表现出色。这些结果表明,Q-Attention是一个简单、有效且可转移的构建模块,适用于密集跟踪场景中的基于注意力的跟踪。
cs.CV / 38 / 2604.26363

CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID

CO-EVO:联合进化语义锚定与风格多样化用于联邦领域泛化行人重识别
Zhang, Fengchun, Ma, Qiang, Xiang, Liuyu, Lai, Jinshan, Huang, Tingxuan, Hu, Jianwei
Abstract
Federated domain generalization for person re-identification (FedDG-ReID) aims to collaboratively train a pedestrian retrieval model across multiple decentralized source domains such that it can generalize to unseen target environments without compromising raw data privacy. However, this task is significantly challenged by the inherent stylistic gaps across decentralized clients. Without global supervision, models easily succumb to shortcut learning where representations overfit to domain specific camera biases rather than universal identity features. We propose CO-EVO, a novel federated framework that resolves this semantic-style conflict through a co-evolutionary mechanism. On the semantic side, Camera-Invariant Semantic Anchoring (CSA) learns identity prompts with cross-camera consistency to establish purified and domain-agnostic anchors that filter out local imaging noise. On the visual side, Global Style Diversification (GSD), powered by a Global Camera-Style Bank (GCSB), synthesizes realistic perturbations to expand the visual boundaries of training data. The core of CO-EVO is its co-evolutionary loop where purified anchors act as gravitational centers to guide the image encoder toward robust anatomical attributes amidst diverse style variations. Extensive experiments demonstrate that CO-EVO achieves state-of-the-art (SOTA) performance, proving that the synergy between semantic purification and style expansion is essential for robust cross-domain generalization. Our code is available at: https://github.com/NanYiyuzurn/ACL-LGPS-2026.
Chinese Translation
联邦领域泛化行人重识别(FedDG-ReID)旨在跨多个去中心化源领域协作训练行人检索模型,使其能够在不妨碍原始数据隐私的情况下对未见目标环境进行泛化。然而,这一任务受到去中心化客户端之间固有风格差异的显著挑战。在没有全球监督的情况下,模型容易陷入捷径学习,即表示过度拟合于特定领域的相机偏差,而非普遍的身份特征。我们提出了CO-EVO,这是一种新颖的联邦框架,通过共同进化机制解决语义与风格之间的冲突。在语义方面,摄像机不变语义锚定(CSA)学习具有跨相机一致性的身份提示,以建立纯化且领域无关的锚点,从而过滤掉局部成像噪声。在视觉方面,全球风格多样化(GSD)由全球相机风格库(GCSB)驱动,合成现实的扰动以扩展训练数据的视觉边界。CO-EVO的核心是其共同进化循环,其中纯化的锚点作为引力中心,引导图像编码器在多样化的风格变化中朝向稳健的解剖特征。大量实验表明,CO-EVO实现了最先进的(SOTA)性能,证明了语义纯化与风格扩展之间的协同作用对于稳健的跨领域泛化至关重要。我们的代码可在以下链接获取:https://github.com/NanYiyuzurn/ACL-LGPS-2026。
cs.CV / 39 / 2604.26365

Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

超越固定公式:用于高效扩散模型的数据驱动线性预测器
Shen, Zhirong, Huang, Rui, Liu, Jiacheng, Zou, Chang, Cai, Peiliang, Zheng, Shikang, Shi, Zhengyi, Feng, Liang, Zhang, Linfeng
Abstract
To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training-free acceleration method. However, existing methods rely on hand-crafted forecasting formulas that fail under aggressive skipping. We propose L2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces fixed coefficients with learnable per-timestep weights. Rapidly trained in ~20 seconds on a single GPU, L2P accurately reconstructs current features from past trajectories. L2P significantly outperforms existing baselines: it achieves a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1-dev, and maintains high visual fidelity under up to 7.18x acceleration on Qwen-Image models, where prior methods show noticeable quality degradation. Our results show learning linear predictors is highly effective for efficient DiT inference. Code is available at https://github.com/Aredstone/L2P-Cache.
Chinese Translation
为了解决扩散变换器(Diffusion Transformers,DiTs)高采样成本的问题,特征缓存提供了一种无训练的加速方法。然而,现有的方法依赖于手工设计的预测公式,在激进跳过的情况下表现不佳。我们提出了L2P(可学习线性预测器),这是一个简单的数据驱动缓存框架,用可学习的每时间步权重替代固定系数。L2P在单个GPU上快速训练,约需20秒,能够准确地从过去的轨迹中重建当前特征。L2P显著优于现有基准:在FLUX.1-dev上实现了4.55倍的FLOPs减少和4.15倍的延迟加速,并在Qwen-Image模型上保持高视觉保真度,支持高达7.18倍的加速,而之前的方法则显示出明显的质量下降。我们的结果表明,学习线性预测器对于高效的DiT推理是非常有效的。代码可在https://github.com/Aredstone/L2P-Cache获取。
cs.CV / 40 / 2604.26368

Seamless Indoor-Outdoor Mapping for INGENIOUS First Responders

无缝室内外映射用于INGENIOUS第一响应者
Wohlfeil, Jürgen, Meißner, Henry, Schischmanow, Adrian, Kraft, Thomas, Baumbach, Dirk, Ernst, Ines, Dahlke, Dennis
Abstract
In several applications it is desired to have 3D models not only from the outdoor spaces but also from inside the building. In the context of First Responder enhancement in large scale natural and man-made disasters, a method is presented to achieve this goal with a high degree of automation. Therefore an autonomously flying aerial mapping system is combined with a person-carried indoor positioning system. Automatically recognized markers (AprilTags) are geo-referenced by the aerial system and their coordinates are sent to the ground-based system. By looking at the AprilTags before entering the building, the ground-based system is registered to world coordinates. Without the further need of any global positioning, it creates a point cloud from the indoor spaces that fits with the point could from the aerial view. This allows a co-visualization of both point-clouds as a seamless indoor-outdoor 3D model in real time.
Chinese Translation
在多个应用中,期望不仅能够获取户外空间的三维模型,还能获取建筑内部的三维模型。在大规模自然灾害和人为灾害的第一响应者增强背景下,提出了一种高自动化程度的方法来实现这一目标。因此,将一个自主飞行的空中映射系统与一个人携带的室内定位系统相结合。自动识别的标记(AprilTags)由空中系统进行地理参考,并将其坐标发送到地面系统。在进入建筑之前,通过查看AprilTags,地面系统被注册到世界坐标系。无需进一步的全球定位,它从室内空间创建一个与空中视图点云相匹配的点云。这使得两个点云能够实时无缝地共同可视化,形成一个室内外三维模型。
cs.CV / 41 / 2604.26370

Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

面向拓扑的半监督视觉-语言学习表示对齐
You, Junwon, Jang, Mihyun, Mo, Sangwoo, Jung, Jae-Hun
Abstract
Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight H_1-birth edges, allowing it to capture both connectivity and cycle structure without constructing 2-simplices. Experiments show that ToMA yields stable gains, with clear improvements on remote sensing and modest but consistent benefits on fashion retrieval. Additional analysis shows that ToMA is more stable than alternative topology-based objectives and that lightweight H_1-birth edges provide useful higher-order structural signals.
Chinese Translation
视觉-语言模型表现出强大的性能,但它们在专业领域的泛化能力往往较差。虽然半监督视觉-语言学习通过利用一小部分标注的图像-文本对以及大量未标注的图像来缓解这一限制,但现有方法仍然在本质上是成对的,未能建模多模态表示流形的全局结构。现有的基于拓扑的对齐方法依赖于持久性图匹配,这既无法保证几何对齐,也未能利用对视觉-语言学习至关重要的图像-文本配对信息。我们提出了面向拓扑的多模态表示对齐(Topology-Aware Multimodal Representation Alignment,ToMA),该框架利用持久同调来识别拓扑显著边,并通过可用的跨模态对应关系在不同模态之间进行对齐。ToMA同时利用H_0-死亡边和轻量级H_1-出生边,使其能够在不构建2-单形的情况下捕捉连接性和循环结构。实验表明,ToMA带来了稳定的增益,在遥感领域有明显的改进,在时尚检索中则表现出适度但持续的好处。额外分析显示,ToMA比其他基于拓扑的目标更稳定,且轻量级H_1-出生边提供了有用的高阶结构信号。
cs.CV / 42 / 2604.26379

A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection

用于集成脑电图-视频癫痫发作检测的多模态预训练网络
Lu, Tong, Xu, Ke, Zhang, Zimo, Zhao, Zitong, Weng, Danwei, Wang, Ruiyu, Liu, Miao, Zhang, Zizuo, Yao, Jingyi, Zhao, Yixuan, Zhang, Wenchao, Wang, Min, Luan, Guoming, Luo, Minmin, Yue, Zhifeng
Abstract
Reliable seizure detection in mouse models is essential for preclinical epilepsy research, yet manual review of synchronized video-EEG recordings is labor-intensive and single-modality systems fail for complementary reasons: video-based methods are easily confounded by benign behaviors, whereas EEG-based methods are vulnerable to ictal motion artifacts. We present EEGVFusion, a multimodal framework that combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention to integrate neural and behavioral evidence. We also curate an expert-annotated dataset of synchronized EEG and video recordings comprising 93 sessions from 15 mice for training and evaluation. In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h, indicating strong seizure detection performance with a low false-alarm burden. In a single held-out-subject evaluation with Subject 110 reserved for testing, EEGVFusion achieved a Balanced Accuracy of 0.9718 and reduced Event FAR from 2.7250 FP/h for the EEG-only counterpart to 0.4833 FP/h while preserving perfect event sensitivity. Targeted ablations further showed that EEG pre-training and OT alignment help reduce false alarms while preserving event sensitivity.
Chinese Translation
在小鼠模型中可靠的癫痫发作检测对临床前癫痫研究至关重要,但手动审核同步的视频-脑电图(EEG)记录劳动强度大,且单一模态系统因互补原因而失败:基于视频的方法容易受到良性行为的干扰,而基于EEG的方法则易受癫痫发作运动伪影的影响。我们提出了EEGVFusion,一个多模态框架,结合了自监督EEG表示学习、时空视频编码、最优传输对齐和双向交叉注意力,以整合神经和行为证据。我们还整理了一个专家注释的数据集,包含来自15只小鼠的93个会话的同步EEG和视频记录,用于训练和评估。在随机会话拆分中,EEGVFusion实现了0.9957的平衡准确率,具有完美的事件灵敏度和0.6250 FP/h的事件假阳性率,表明其具有强大的癫痫发作检测性能且假警报负担较低。在对保留的Subject 110进行的单一保留对象评估中,EEGVFusion实现了0.9718的平衡准确率,并将事件假阳性率从EEG单一模型的2.7250 FP/h降低至0.4833 FP/h,同时保持完美的事件灵敏度。针对性切除实验进一步表明,EEG预训练和最优传输对齐有助于降低假警报,同时保持事件灵敏度。
cs.CV / 43 / 2604.26404

Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection

基于视觉基础模型的解耦原型匹配用于少样本工业物体检测
M., Hari Prasanth S., Jayawickrama, Nilusha, Ojala, Risto
Abstract
Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a given query scene during inference, object regions are generated using a segmentation model, and feature embeddings are extracted and matched with class prototypes using similarity matching. We evaluate the detection method on three established industrial datasets from the Benchmark for 6D Object Pose Estimation benchmark following the official 2D object detection evaluation protocol. We demonstrate competitive detection performance, improving AP by 6.9% compared to the state-of-the-art training-free detection methods. Furthermore, the presented method is able to onboard new objects using only a few reference images, without requiring any CAD models or large annotated datasets. These properties make the approach well-suited for real-world industrial applications.
Chinese Translation
工业物体检测系统通常依赖于大量标注数据集,这些数据集的收集成本高且在物体库存频繁变化的工业场景中维护困难。本研究针对这种工业场景中的少样本物体检测挑战,其中仅有有限数量的标记样本可用于新引入的物体。我们提出了一种检测框架,利用视觉基础模型以最小的监督识别物体。该方法通过提取特征表示,从一小组参考样本构建类别原型。在推理过程中,对于给定的查询场景,使用分割模型生成物体区域,并提取特征嵌入,与类别原型进行相似性匹配。我们在基准6D物体姿态估计的三个已建立工业数据集上评估了该检测方法,遵循官方的2D物体检测评估协议。我们展示了具有竞争力的检测性能,与最新的无训练检测方法相比,AP提高了6.9%。此外,所提出的方法能够仅使用少量参考图像引入新物体,而无需任何CAD模型或大型标注数据集。这些特性使得该方法非常适合于现实世界的工业应用。
cs.CV / 44 / 2604.26409

Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

稀疏性作为关键:从潜在结构中解锁新的分布外检测洞察
Oh, Ahyoung, Shin, Wonseok, Kim, Songkuk
Abstract
Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.
Chinese Translation
稀疏自编码器(Sparse Autoencoders, SAEs)在通过将密集表示分解为稀疏语义成分来解释大型语言模型(Large Language Models, LLMs)方面取得了显著成功。然而,它们在分析视觉变换器(Vision Transformers, ViTs)方面的潜力仍然未被充分探索。在本研究中,我们首次将SAEs应用于ViT的[CLS]标记,以进行分布外(Out-of-Distribution, OOD)检测,解决了现有方法依赖于纠缠特征表示的局限性。我们提出了一种新颖的框架,利用Top-k SAE将密集的[CLS]特征解耦为结构化的潜在空间。通过这一分析,我们揭示了分布内(In-Distribution, ID)数据表现出一致的、类特异的激活模式,我们将其形式化为类激活轮廓(Class Activation Profiles, CAPs)。我们的研究发现了一个关键的结构不变性:虽然ID样本在CAPs中保持稳定的模式,但OOD样本系统性地破坏了这一结构。利用这一洞察,我们引入了一种基于核心能量轮廓差异的评分函数,以量化与理想激活轮廓的偏差。我们的方法在FPR95指标上取得了强劲的结果,这对于多个基准测试中安全敏感的应用至关重要,同时也实现了具有竞争力的AUROC。总体而言,我们的发现表明,SAEs揭示的稀疏、解耦特征可以作为视觉模型中强大且可解释的工具,用于稳健的OOD检测。
cs.CV / 45 / 2604.26419

Delineating Knowledge Boundaries for Honest Large Vision-Language Models

划定诚实大型视觉-语言模型的知识边界
Song, Junru, Hu, Yimeng, Chen, Yijing, Li, Huining, Li, Qian, Cui, Lizhen, Du, Yuntao
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific "Visual-Idk" (Visual-I don't know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9\% to 67.3\%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.
Chinese Translation
大型视觉-语言模型(VLMs)在多模态性能上取得了显著的进展,但仍然容易出现事实幻觉,尤其是在长尾或专业领域。此外,当前模型在拒绝超出其参数知识的查询方面表现出较弱的能力。本文提出了一个系统框架,以增强VLMs在面对此类未知问题时的拒绝能力。我们首先策划了一个特定于模型的“Visual-Idk”(视觉-我不知道)数据集,利用多样本一致性探测来区分已知和未知事实。然后,我们通过监督微调和偏好感知优化(例如,DPO,ORPO)来对齐模型,从而有效划定其知识边界。在Visual-Idk数据集上的结果表明,我们的方法将真实率从57.9%提高至67.3%。此外,内部探测也表明模型确实识别其边界,而不仅仅是记忆拒绝模式。我们的框架进一步推广到分布外的医学和感知领域,为更可信和谨慎的视觉助手提供了一条稳健的路径。
cs.CV / 46 / 2604.26435

QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing

QYOLO:通过量子启发的共享通道混合实现轻量级目标检测
Mittal, Garvit Kumar, Tomar, Sahil, Kumar, Sandeep
Abstract
The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.
Chinese Translation
目标检测架构的快速发展使得单阶段检测器成为实时视觉感知的主导解决方案。这些模型计算开销的主要来源在于深层主干阶段,其中高步幅水平的 C2f 瓶颈模块由于与通道宽度的平方缩放而累积了不成比例的参数份额。本研究提出了 QYOLO,一种量子启发的通道混合框架,通过用紧凑的 QMixBlock 替换 P4/16 (512 通道) 和 P5/32 (1024 通道) 的两个最深主干 C2f 模块,实现了真正的架构压缩。所提出的模块通过一个正弦混合机制执行全局通道重新校准,并在两个主干阶段之间共享可学习参数,从而在不需要独立的每阶段参数集的情况下强制一致的通道重要性。颈部和检测头保持完全经典且未改变。在 VisDrone2019 基准测试上的评估表明,QYOLOv8n 实现了 20.2% 的参数数量减少(从 3.01M 降至 2.40M),GFLOPs 减少 12.3%,仅导致 0.4 个百分点的 mAP@50 降级。QYOLOv8s 实现了 21.8% 的减少,降级为 0.1 个百分点。当与知识蒸馏结合时,压缩没有损失的情况下恢复了完全的准确性。一个扩展的主干加颈部变体在更大的准确性降级代价下实现了 38% 至 41% 的减少,这激励了仅主干的最终设计。
cs.CV / 47 / 2604.26437

Are Data Augmentation and Segmentation Always Necessary? Insights from COVID-19 X-Rays and a Methodology Thereof

数据增强和分割是否总是必要的?来自COVID-19胸部X光片的见解及其方法论
Swaraj, Aman, Agarwal, Arnav, Bhadouria, Hitendra Singh, Kumar, Sandeep, Verma, Karan
Abstract
Purpose: Rapid and reliable diagnostic tools are crucial for managing respiratory diseases like COVID-19, where chest X-ray analysis coupled with artificial intelligence techniques has proven invaluable. However, most existing works on X-ray images have not considered lung segmentation, raising concerns about their reliability. Additionally, some have employed disproportionate and impractical augmentation techniques, making models less generalized and prone to overfitting. This study presents a critical analysis of both issues and proposes a methodology (SDL-COVID) for more reliable classification of chest X-rays for COVID-19 detection. Methods: We use class activation mapping to obtain a visual understanding of the predictions made by Convolutional Neural Networks (CNNs), validating the necessity of lung segmentation. To analyze the effect of data augmentation, deep learning models are implemented on two levels: one for an augmented dataset and another for a non-augmented dataset. Results: Careful analysis of X-ray images and their corresponding heat maps under expert medical supervision reveals that lung segmentation is necessary for accurate COVID-19 prediction. Regarding data augmentation, test accuracy significantly drops beyond a certain threshold with additional augmented images, indicating model overfitting. Conclusion: Our proposed methodology, SDL-COVID, achieves a precision of 95.21% and a lower false negative rate, ensuring its reliability for COVID-19 detection using chest X-rays.
Chinese Translation
目的:快速可靠的诊断工具对于管理像COVID-19这样的呼吸系统疾病至关重要,其中胸部X光分析结合人工智能技术已被证明是非常宝贵的。然而,大多数现有的X光图像研究并未考虑肺部分割,这引发了对其可靠性的担忧。此外,一些研究采用了不成比例且不切实际的增强技术,使得模型的泛化能力降低,容易出现过拟合。本研究对这两个问题进行了批判性分析,并提出了一种方法论(SDL-COVID),以实现对COVID-19检测的胸部X光片的更可靠分类。方法:我们使用类别激活映射(class activation mapping)来获得卷积神经网络(CNN)所做预测的可视化理解,从而验证肺部分割的必要性。为了分析数据增强的影响,深度学习模型在两个层面上实施:一个用于增强数据集,另一个用于未增强数据集。结果:在专家医疗监督下,对X光图像及其相应热图的仔细分析表明,肺部分割对于准确预测COVID-19是必要的。关于数据增强,测试准确率在超过某一阈值后随着额外增强图像的增加显著下降,表明模型出现了过拟合。结论:我们提出的方法论SDL-COVID实现了95.21%的精确度和较低的假阴性率,确保其在使用胸部X光片进行COVID-19检测时的可靠性。
cs.CV / 48 / 2604.26453

Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

基于归因引导的多模态深度伪造检测通过跨模态取证指纹
Ahmad, Wasim, Zhang, Wei, Mao, Xuerui
Abstract
Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.
Chinese Translation
音视频深度伪造已达到一种现实主义水平,使得感知检测变得不可靠,威胁到媒体完整性和生物识别安全。尽管多模态检测显示出前景,但大多数方法是二元分类任务,往往依赖于特定数据集的伪影,而非真实的生成痕迹。我们认为,无法识别视频伪造方式的检测器很可能在学习错误的信号。与二元检测不同,归因引导学习对共享嵌入空间施加了更强的几何约束,迫使模型编码生成器特定的取证内容,而不是捷径。我们提出了归因引导的多模态深度伪造检测(AMDD)框架,该框架共同学习检测和归因操控。AMDD将生成器归因视为一种结构化正则化,约束表示几何朝向具有取证意义的特征。我们引入了一种跨模态取证指纹一致性(CMFFC)损失,以强制视觉和音频流中生成器引起的伪影之间的对齐。这利用了连贯操控在各模态间留下相关痕迹的事实,这些痕迹源于语音与面部发音之间的物理耦合,而合成管道通常会破坏这种耦合。在架构上,我们将ResNet50与时间注意机制结合用于视觉编码,并与预训练的ResNet18用于梅尔谱图,缩小了先前模型中编码器容量的差距。在FakeAVCeleb数据集上,AMDD达到了99.7%的平衡准确率和99.8%的AUC,以及95.9%的归因准确率。在DeepfakeTIMIT、DFDM和LAV-DF上的跨数据集评估确认真实视频检测具有良好的泛化能力,而对未见生成器的伪造检测仍然是一个开放挑战,我们对此进行了深入分析。
cs.CV / 49 / 2604.26454

Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation

以最后一层为中心的特征重组:在 DINOv3 中释放三维几何知识以进行单目深度估计
Wang, Gongshu, Wang, Zhirui, Yang, Kan
Abstract
Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed non-uniformly: deeper layers exhibit stronger depth predictability and better capture inter-sample geometric variation. Motivated by this, we introduce a Last-Layer-Centric Feature Recombination (LFR) module to enhance geometric expressiveness. LFR treats the final layer as a geometric anchor and adaptively selects complementary intermediate layers according to a minimal-similarity criterion. Selected features are fused with the last-layer representation via compact linear adapters.Extensive experiments show that LFR module consistently improves MDE accuracy and achieves state-of-the-art performance. Our analysis sheds light on how geometric knowledge is organized within VFMs and offers an efficient strategy for unlocking their potential in dense 3D tasks.
Chinese Translation
单目深度估计(MDE)是一项基本但本质上不适定的任务。近期的视觉基础模型(VFM),特别是基于 DINO 的变换器,显著提高了密集预测的准确性和泛化能力。以往的研究通常遵循统一的范式:在均匀间隔内采样固定的一组中间变换器层以构建多尺度特征。这种常见做法隐含地假设几何信息在各层之间均匀分布,这可能未能充分利用 VFM 中编码的结构性三维线索。在本研究中,我们对 DINOv3 进行了系统的层级分析,揭示三维信息的分布并不均匀:较深的层次展现出更强的深度可预测性,并更好地捕捉样本间的几何变化。基于此,我们引入了一个以最后一层为中心的特征重组(LFR)模块,以增强几何表达能力。LFR 将最后一层视为几何锚点,并根据最小相似性标准自适应选择互补的中间层。所选特征通过紧凑的线性适配器与最后一层表示进行融合。大量实验表明,LFR 模块持续提高了 MDE 的准确性,并实现了最先进的性能。我们的分析阐明了几何知识在 VFM 中的组织方式,并提供了一种有效的策略,以释放其在密集三维任务中的潜力。
cs.CV / 50 / 2604.26461

$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

PKS$^4$: 高效视频理解的并行运动选择状态空间扫描器
Zeng, Lingjie, Zhang, Hailun, Wang, Xiwen, Zhao, Qijun
Abstract
Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.
Chinese Translation
时间建模仍然是视频理解中的一个基本挑战,尤其是在序列长度增加时。依赖于密集时空注意力的传统视频模型在处理长视频时面临二次计算成本。为了规避这些成本,最近的方法通过参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法(如适配器)将图像模型适配到视频上。然而,深度插入这些模块在反向传播过程中会产生巨大的激活内存开销。尽管最近的高效状态空间模型(State Space Models, SSMs)引入了线性复杂度,但它们破坏了二维空间关系,并依赖于广泛的掩码预训练来恢复空间意识。为克服这些限制,我们提出了并行运动选择状态空间扫描器(PKS$^4$)。我们保留了标准的二维视觉主干以获取空间语义,并插入一个具有线性复杂度时间扫描的即插即用PKS$^4$模块,避免了时间注意力和多层适配器。我们首先通过运动先验编码器(Kinematic Prior Encoder)提取运动先验,该编码器通过帧间相关性和差异捕捉局部位移和运动边界。这些先验驱动线性复杂度的SSMs跟踪潜在的运动状态,自适应地调节每个时间步的更新速度和读写策略。我们在每个空间位置沿时间维度部署并行扫描器,而不是全局扫描,从而保持空间结构并减少开销。在空间重和时间重的动作识别基准上的实验表明,PKS$^4$达到了最先进的性能。值得注意的是,我们的方法仅需20个训练周期便实现收敛,训练计算量约为纯视频SSMs的10倍,建立了高效视频理解的新范式。
cs.CV / 51 / 2604.26462

A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

针对长篇扫描金融文件的多阶段提取管道:工业KYC工作流程中的实证研究
Han, Yuxuan, Zhang, Yuanxing, Wang, Yushuo, Jin, Yichao, Ke, Kenneth Zhu, Zhao, Jingyuan
Abstract
Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.
Chinese Translation
从长篇、多语言扫描金融文件中提取结构化信息是工业KYC和合规工作流程中的核心需求。这些文件通常不可机器读取,噪声较多,且视觉上异质性强。它们通常跨越数十页,仅包含稀疏的与任务相关的信息。尽管最近的视觉-语言模型在基准测试中表现出色,但直接将其端到端应用于完整的金融报告,往往在实际条件下导致不可靠的提取结果。我们提出了一种多阶段提取框架,集成了图像预处理、多语言光学字符识别(OCR)、混合页面级检索和基于紧凑视觉语言模型(VLM)的结构化提取。该设计将页面定位与多模态推理分离,使得从复杂的多页文档中进行更准确的提取成为可能。我们在120份生产KYC文件上评估了该框架,这些文件包含约3000页多语言扫描页面。在多种OCR-VLM组合中,所提出的管道在性能上始终优于直接的PDF到VLM基线,字段级准确率提高了多达31.9个百分点。最佳配置PaddleOCR与MiniCPM2.6的准确率达到了87.27%。消融研究表明,页面级检索是性能提升的主要因素,特别是在复杂财务报表和非英语文档中。
cs.CV / 52 / 2604.26478

Cross-Domain Transfer of Hyperspectral Foundation Models

高光谱基础模型的跨域迁移
Theisen, Nick, Neubert, Peer
Abstract
Hyperspectral imaging (HSI) semantic segmentation typically relies on in-domain training, but limited data availability often restricts model performance in real-world applications. Current approaches to leverage foundation models in proximal sensing use cross-modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross-domain transfer as an alternative, reusing HSI foundation models - originally trained in remote sensing - for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining a simple architecture. Using the HS3-Bench benchmark, we systematically evaluate and compare conventional in-domain, in-modality training, cross-modality transfer and cross-domain transfer strategies. Our results demonstrate that cross-domain transfer achieves large performance improvements over in-domain, in-modality training, reduces the performance gap to cross-modality approaches and maintains strong performance in limited data settings. Thus, this work advances more effective HSI semantic segmentation in diverse applications.
Chinese Translation
高光谱成像(HSI)语义分割通常依赖于领域内训练,但有限的数据可用性常常限制模型在实际应用中的性能。目前,利用基础模型进行近端感知的方法采用跨模态技术,将RGB与HSI相结合,以利用视觉基础模型。然而,这些方法要么舍弃光谱信息,要么引入架构复杂性。我们提出将跨域迁移作为一种替代方案,重用最初在遥感中训练的HSI基础模型,用于近端感知应用。通过消除弥合模态差距的需要,我们的方法在保持简单架构的同时保留了光谱信息。利用HS3-Bench基准,我们系统地评估和比较了传统的领域内、模态内训练、跨模态迁移和跨域迁移策略。我们的结果表明,跨域迁移在性能上显著优于领域内、模态内训练,缩小了与跨模态方法的性能差距,并在有限数据环境下保持了强劲的性能。因此,本研究推动了在多样化应用中更有效的HSI语义分割。
cs.CV / 53 / 2604.26488

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

使用线性上下文学习者从动态三维场景中提取像素特征
Araslanov, Nikita, Sundermeyer, Martin, Matsuki, Hidenobu, Tan, David Joseph, Tombari, Federico
Abstract
One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.
Chinese Translation
视觉模型最令人兴奋的应用之一是像素级推理。尽管视觉基础模型的数量众多,但我们仍然缺乏能够有效嵌入视觉场景在像素级别的时空特性的表示。现有框架要么在基于图像的预训练任务上进行训练,这些任务未考虑动态元素,要么在视频序列上进行动作级推理,这种方法无法扩展到密集的像素级预测。我们提出了一种从视频中学习像素精确特征描述符的框架,称为 LILA。我们训练框架的核心元素是线性上下文学习。LILA 利用通过现成网络估计的时空线索图——深度和运动。尽管这些线索的噪声特性,LILA 在未经整理的视频数据集上有效训练,以时间一致的方式嵌入语义和几何特性。我们在一系列多样的视觉任务中展示了所学表示的显著实证优势:视频目标分割、表面法线估计和语义分割。
cs.CV / 54 / 2604.26496

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

鲁棒对齐:在对抗训练中协调清晰准确性与对抗鲁棒性
Wang, Yanyun, Ye, Qingqing, Liu, Li, Liang, Zi, Hu, Haibo
Abstract
Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off. To mitigate this misalignment for harmonizing accuracy and robustness, we define Robust Alignment as a new AT target, encouraging the model perception to change with input perturbations provided the final label prediction remains unchanged, which can be achieved via two novel ideas. First, we suggest a reduced and fixed perturbation intensity for those boundary samples, which facilitates the model to utilize the perturbations as learnable patterns, instead of noises that complicate decision boundaries meaninglessly. Second, we propose a Domain Interpolation Consistency Adversarial Regularization (DICAR), based on rigorous theoretical derivations, which explicitly introduces semantic alignment between input and latent spaces into AT. Based on these two ideas, we end up with a new Robust Alignment Adversarial Training (RAAT) method, effectively harmonizing accuracy and robustness. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-28-10 demonstrate the effectiveness of RAAT in improving the trade-off beyond four common baselines and a total of 14 related state-of-the-art (SOTA) works.
Chinese Translation
对抗训练(Adversarial Training, AT)是开发鲁棒深度神经网络(Deep Neural Networks, DNNs)最有效的方法之一。然而,AT面临着清晰准确性与对抗鲁棒性之间的权衡问题。在本研究中,我们首次揭示了一个令人惊讶的现象:在AT中,对于靠近决策边界的训练样本,变化的输入扰动强度对模型鲁棒性影响甚微。这个发现直接暴露了准确性与鲁棒性评分波动之间的不一致性,促使我们识别输入空间与潜在空间之间的错位作为鲁棒性与准确性权衡的关键驱动因素。为了减轻这种错位,以协调准确性与鲁棒性,我们将鲁棒对齐(Robust Alignment)定义为新的AT目标,鼓励模型感知随着输入扰动的变化而变化,前提是最终的标签预测保持不变,这可以通过两个新颖的思想实现。首先,我们建议对那些边界样本采用减少且固定的扰动强度,这使得模型能够将扰动视为可学习的模式,而不是无意义地复杂化决策边界的噪声。其次,我们提出了一种基于严格理论推导的领域插值一致性对抗正则化(Domain Interpolation Consistency Adversarial Regularization, DICAR),该方法明确地将输入空间与潜在空间之间的语义对齐引入AT。基于这两个思想,我们最终提出了一种新的鲁棒对齐对抗训练(Robust Alignment Adversarial Training, RAAT)方法,有效协调了准确性与鲁棒性。在CIFAR-10、CIFAR-100和Tiny-ImageNet上使用ResNet-18、PreActResNet-18和WideResNet-28-10的广泛实验表明,RAAT在改善权衡方面超越了四个常见基线和总共14个相关的最先进(State-of-the-Art, SOTA)工作。
cs.CV / 55 / 2604.26503

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

德尔塔评分至关重要!扩散模型中的空间自适应多重引导
Li, Haosen, Chen, Wenshuo, Wang, Lei, Liang, Shaofeng, Tian, Bowen, Lai, Soning, Yue, Yutao
Abstract
Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.
Chinese Translation
扩散模型在合成复杂的静态和动态视觉方面取得了显著成功,这一突破在很大程度上得益于无分类器引导(Classifier-Free Guidance, CFG)。然而,尽管CFG在将生成内容与文本提示对齐中发挥了关键作用,标准CFG依赖于一个全球统一的标量。这种均匀的放大使模型陷入一个众所周知的“细节伪影困境”:低引导比例无法注入复杂的语义,而高比例则不可避免地导致结构退化、颜色过饱和以及视频中的时间不一致性。本文通过微分几何的视角揭示了这一缺陷的物理根源。通过分析Tweedie公式,我们发现CFG本质上执行了切线线性外推。由于自然数据流形高度弯曲,这种均匀的线性步骤引入了严重的正交偏差。为了保持生成轨迹的安全界限,我们为空间和自适应引导制定了理论上限。基于这些几何洞察,我们提出了空间自适应多重引导(Spatial Adaptive Multi Guidance, SAMG),这是一种无训练且几乎零成本的采样算法。SAMG动态计算逐点条件引导能量,对高能量边界区域施加保守的最小比例,以保持精细的微观纹理,同时在低能量区域施加激进的最大比例,以最大化语义注入。针对多种图像(SD 1.5, SDXL, SD3.5 Medium)和视频(CogVideoX, ModelScope)架构的广泛实验表明,SAMG有效解决了细节伪影困境,实现了优越的语义对齐、结构完整性和时间平滑性,而没有任何计算开销。
cs.CV / 56 / 2604.26517

MTCurv: Deep learning for direct microtubule curvature mapping in noisy fluorescence microscopy images

MTCurv:用于在噪声荧光显微镜图像中直接映射微管曲率的深度学习
Laydi, Achraf Ait, Moctar, Sidi Mohamed Sid'El, Mourabit, Yousef El, Bouvrais, Hélène
Abstract
Accurate quantification of the geometry of curvilinear biological structures is essential for understanding cellular mechanics and disease-related morphological alterations. Microtubule curvature is a key descriptor of filament rigidity and mechanical perturbations. However, reliable curvature extraction from fluorescence microscopy images remains challenging due to noise, low contrast, and partial filament visibility. Existing approaches rely on segmentation pipelines with pre or post-processing, which are highly sensitive to segmentation errors and often fail under adverse imaging conditions. In this work, we propose MTCurv, a deep learning framework for direct, segmenta-tion-free regression of microtubule curvature maps from noisy microscopy images. Leveraging a synthetic dataset with pixel-wise curvature annotations, we reformulated curvature estimation as a regression problem and adapted an attention-based residual U-Net. To reduce hallucinations and enforce spatial coherence, we introduced a gradient-aware loss combining Mean Squared Error with a gradient consistency term. Beyond model and loss design, we evaluated commonly used regression and image quality metrics, revealing that many perceptual and blind metrics are poorly suited for curvature estimation. Correlation-based metrics, particularly Spearman correlation, emerged as more reliable indicators of curvature prediction quality. Experiments on two datasets of increasing difficulty demonstrated that MTCurv accurately recovers local microtubule curvatures, even in the presence of background fluorescence. Ablation studies highlighted the contribution of both residual encoding and attention-based decoding. Overall, this work provides a practical tool for filament curvature analysis and methodological insights for geometry-aware regression in biomedical imaging. Datasets and code are made available.
Chinese Translation
准确量化曲线生物结构的几何形状对于理解细胞力学和与疾病相关的形态变化至关重要。微管曲率是描述纤维刚度和机械扰动的关键指标。然而,由于噪声、低对比度和部分纤维可见性,从荧光显微镜图像中可靠提取曲率仍然具有挑战性。现有方法依赖于具有前处理或后处理的分割管道,这对分割错误高度敏感,并且在不利的成像条件下常常失败。在本研究中,我们提出了MTCurv,一个用于从噪声显微镜图像中直接进行无分割回归微管曲率图的深度学习框架。利用具有像素级曲率注释的合成数据集,我们将曲率估计重新表述为回归问题,并调整了基于注意力的残差U-Net。为了减少幻觉并强制空间一致性,我们引入了一种结合均方误差和梯度一致性项的梯度感知损失。除了模型和损失设计外,我们评估了常用的回归和图像质量指标,发现许多感知和盲目指标不适合曲率估计。基于相关性的指标,特别是斯皮尔曼相关性,成为曲率预测质量的更可靠指标。在两个难度逐渐增加的数据集上的实验表明,MTCurv能够准确恢复局部微管曲率,即使在背景荧光存在的情况下。消融研究突显了残差编码和基于注意力的解码的贡献。总体而言,这项工作为纤维曲率分析提供了实用工具,并为生物医学成像中的几何感知回归提供了方法论见解。数据集和代码已公开。
cs.CV / 57 / 2604.26519

GIFGuard: Proactive Forensics against Deepfakes in Facial GIFs via Spatiotemporal Watermarking

GIFGuard:通过时空水印技术对面部GIF中的深度伪造进行主动取证
Che, Shupeng, Guo, Zhiqing, Miao, Changtao, Ma, Dan, Yang, Gaobo
Abstract
The rapid evolution of deepfake technology poses an unprecedented threat to the authenticity of Graphics Interchange Format (GIF) imagery, which serves as a representative of short-loop temporal media in social networks. However, existing proactive forensics works are designed for static images, which limits their applicability to animated GIFs. To bridge this gap, we propose GIFGuard, the first spatiotemporal watermarking framework tailored for deepfake proactive forensics in GIFs. In the embedding stage, we propose the Spatiotemporal Adaptive Residual Encoder (STARE) to ensure robustness against high-level semantic tampering. It employs a 3D convolutional backbone with adaptive channel recalibration to capture globally coherent temporal dependencies. In the extraction stage, we design the Deep Integrity Restoration Decoder (DIRD). It utilizes a spatiotemporal hourglass architecture equipped with 3D attention to restore latent features, allowing for the accurate extraction of watermark signals even under severe facial manipulation. Furthermore, we construct GIFfaces, the first large-scale benchmark dataset curated for GIF proactive forensics to facilitate research in this domain. Extensive results show that GIFGuard achieves high-fidelity visual quality and remarkable robustness performance against deepfakes. Related code and dataset will be released.
Chinese Translation
深度伪造技术的快速发展对图形交换格式(GIF)图像的真实性构成了前所未有的威胁,而GIF作为社交网络中短循环时间媒体的代表。然而,现有的主动取证工作主要针对静态图像,这限制了其在动画GIF中的适用性。为了解决这一问题,我们提出了GIFGuard,这是第一个为GIF中的深度伪造主动取证量身定制的时空水印框架。在嵌入阶段,我们提出了时空自适应残差编码器(Spatiotemporal Adaptive Residual Encoder,STARE),以确保对高层语义篡改的鲁棒性。该编码器采用具有自适应通道重校准的3D卷积骨干网络,以捕捉全局一致的时间依赖关系。在提取阶段,我们设计了深度完整性恢复解码器(Deep Integrity Restoration Decoder,DIRD)。该解码器利用配备3D注意力机制的时空沙漏架构来恢复潜在特征,从而在严重的面部操控下也能准确提取水印信号。此外,我们构建了GIFfaces,这是第一个为GIF主动取证策划的大规模基准数据集,以促进该领域的研究。大量结果表明,GIFGuard在深度伪造面前实现了高保真视觉质量和显著的鲁棒性表现。相关代码和数据集将会发布。
cs.CV / 58 / 2604.26520

3D-LENS: A 3D Lifting-based Elevated Novel-view Synthesis method for Single-View Aerial-Ground Re-Identification

3D-LENS:一种基于3D提升的单视角空地重识别新视角合成方法
Grolleau, William, Sabourin, Astrid, Lapouge, Guillaume, Achard, Catherine
Abstract
Aerial-Ground Re-Identification (AG-ReID) is constrained by the viewpoint-domain gap, as drastic viewpoint disparities occlude or distort discriminative features, making cross-viewpoint image retrieval challenging. While existing methods rely on paired cross-view annotations, real-world deployments, such as wilderness search-and-rescue (SAR), often lack target-domain data, requiring retrieval from ground-level references alone. To our knowledge, we are the first to address this challenge by formalizing the Single-View AG-ReID (SV AG-ReID) setting, where models trained on a single real viewpoint must generalize to an unseen viewpoint. We propose 3D Lifting-based Elevated Novel-view Synthesis (3D-LENS), a unified framework combining geometrically-consistent novel view synthesis that leverages large-scale 3D mesh reconstruction, with a robust representation learning scheme to mitigate synthetic-to-real bias. Unlike 2D generative baselines that suffer from geometric inconsistencies or prior 3D methods that are restricted to class-specific templates, our approach ensures view-consistent synthesis across diverse categories without predefined templates that fail to capture fine-grained details, such as carried objects. Extensive experiments demonstrate that our method achieves state-of-the-art performance on SV AG-ReID scenarios. Code and data will be released at https://github.com/TurtleSmoke/3D-LENS.
Chinese Translation
空地重识别(AG-ReID)受到视角域差距的限制,因为剧烈的视角差异会遮挡或扭曲判别特征,使得跨视角图像检索变得具有挑战性。现有方法依赖于成对的跨视角注释,而现实世界的应用,如野外搜索与救援(SAR),往往缺乏目标域数据,仅需从地面参考中进行检索。据我们所知,我们首次通过形式化单视角空地重识别(SV AG-ReID)设置来解决这一挑战,其中在单一真实视角上训练的模型必须能够推广到未见过的视角。我们提出了基于3D提升的空中新视角合成(3D-LENS),这是一个统一框架,结合了利用大规模3D网格重建的几何一致性新视角合成与强大的表示学习方案,以减轻合成到真实的偏差。与受到几何不一致性影响的2D生成基线或受限于类别特定模板的先前3D方法不同,我们的方法确保在不同类别之间进行视角一致的合成,而无需预定义的模板,这些模板无法捕捉到细粒度的细节,例如携带的物体。大量实验表明,我们的方法在SV AG-ReID场景中实现了最先进的性能。代码和数据将发布在 https://github.com/TurtleSmoke/3D-LENS。
cs.CV / 59 / 2604.26565

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

DenseStep2M:一种可扩展的无训练流程用于密集教学视频标注
Ge, Mingji, Chen, Qirui, Li, Zeqian, Xie, Weidi
Abstract
Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.
Chinese Translation
长期视频理解需要对复杂的时间事件进行解释,并对程序性活动进行推理。虽然像 HowTo100M 这样的教学视频语料库为模型训练提供了丰富的资源,但它们也带来了显著的挑战,包括嘈杂的自动语音识别(ASR)转录和叙述与视觉内容之间不一致的时间对齐。在本研究中,我们提出了一种自动化的、无训练的流程,从野外教学视频中提取高质量的程序性注释。我们的方法将视频分割成连贯的镜头,过滤对齐不良的内容,并利用最先进的多模态和大型语言模型(Qwen2.5-VL 和 DeepSeek-R1)生成结构化的、时间上有依据的程序步骤。该流程生成了 DenseStep2M,一个大规模数据集,包含约 10 万个视频和 200 万个详细的教学步骤,旨在支持全面的长视频理解。为了严格评估我们的流程,我们策划了 DenseCaption100,一个高质量人工撰写字幕的基准。评估结果表明,我们自动生成的步骤与人工注释之间具有良好的对齐。此外,我们验证了 DenseStep2M 在三个核心下游任务中的实用性:密集视频字幕生成、程序步骤定位和跨模态检索。在 DenseStep2M 上微调的模型在字幕质量和时间定位方面取得了显著提升,同时在自我中心、外部中心和混合视角领域表现出强大的零样本泛化能力。这些结果强调了 DenseStep2M 在促进先进的多模态对齐和长期活动推理方面的有效性。我们的数据集可在 https://huggingface.co/datasets/mingjige/DenseStep2M 获取。
cs.CV / 60 / 2604.26567

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

AirZoo:用于基础航空几何三维视觉的统一大规模数据集
Cheng, Xiaoya, Wu, Rouwan, Liu, Xinyi, Cui, Zeyu, Liu, Yan, Zhao, Na, Liu, Yu, Zhang, Maojun, Yan, Shen
Abstract
Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks -- aerial image retrieval, cross-view matching, and multi-view 3D reconstruction -- we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.
Chinese Translation
尽管数据驱动的三维视觉取得了快速进展,但航空几何三维视觉仍然面临严峻挑战,主要由于缺乏大规模、高保真度的训练数据。现有基准测试主要偏向于地面视角或物体中心视角,并未考虑无人机(UAV)感知中的复杂视角变换和多样化环境条件。为填补这一关键空白,我们提出了AirZoo,一个用于基础航空几何三维视觉的统一大规模数据集和基准测试。AirZoo具备三个显著特性:1)可扩展生成管道:利用自由获取的全球范围内的摄影测量三维网格,渲染出具有可定制无人机飞行轨迹和可配置天气/照明条件的广阔户外环境。2)全面的场景多样性:截至目前,它提供了最广泛的区域类型覆盖(涵盖22个国家的378个区域),系统性地包括高度结构化的城市景观和复杂的非结构化自然环境。3)丰富的几何注释:每帧提供同步的像素级度量深度和精确的6自由度地理参考姿态,这对于几何感知学习至关重要。通过三个严格的评估轨道——航空图像检索、跨视角匹配和多视角三维重建——我们证明了AirZoo作为强大的预训练引擎的有效性。在公共和新收集的真实世界基准上的广泛实验表明,在AirZoo上进行微调为最先进(SoTA)模型(如MegaLoc、RoMa、VGGT和Depth Anything 3)带来了显著的性能提升,为航空空间智能建立了新的性能上限。
cs.CV / 61 / 2604.26582

Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology

星融合:一种基于球面拓扑的多模态变换器架构用于离散天体定向
Hammad, May, Hammad, Menatallh
Abstract
Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional "Lost-in-Space" (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to partition the celestial sphere into K topologically consistent regions, effectively mitigating coordinate wrapping artifacts. The proposed architecture employs a tripartite fusion strategy: a SwinV2-Tiny transformer backbone for photometric feature extraction, a convolutional heatmap branch for spatial grounding, and a coordinate-based MLP for geometric anchoring. Experimental evaluations on a synthetic Hipparcos-derived dataset demonstrate that Star-Fusion achieves a Top-1 accuracy of 93.4% and a Top-3 accuracy of 97.8%. Furthermore, the model exhibits high computational efficiency, maintaining an inference latency of 18.4 ms on resource-constrained COTS hardware, making it a viable candidate for real-time onboard deployment in next-generation satellite constellations.
Chinese Translation
可靠的天体姿态确定是自主航天器导航的关键要求,但传统的“失落于太空”(Lost-in-Space, LIS)算法常常面临高计算开销和对传感器噪声的敏感性。尽管深度学习已成为一种有前景的替代方案,但标准回归模型通常受到天球非欧几里得拓扑和赤经(Right Ascension, RA)及赤纬(Declination, Dec)周期性边界条件的困扰。本文提出了星融合(Star-Fusion),一种将定向估计重新表述为离散拓扑分类任务的多模态架构。我们的方法利用球面 K-Means 聚类将天球划分为 K 个拓扑一致的区域,有效减轻坐标包裹伪影。所提架构采用三方融合策略:使用 SwinV2-Tiny 变换器主干进行光度特征提取,使用卷积热图分支进行空间定位,以及使用基于坐标的多层感知器(MLP)进行几何锚定。在基于合成的希帕克斯(Hipparcos)衍生数据集上的实验评估表明,星融合在 Top-1 准确率上达到了 93.4%,在 Top-3 准确率上达到了 97.8%。此外,该模型展现出高计算效率,在资源受限的商业现货硬件上保持了 18.4 毫秒的推理延迟,使其成为下一代卫星星座中实时在轨部署的可行候选方案。
cs.CV / 62 / 2604.26598

FunFace: Feature Utility and Norm Estimation for Face Recognition

FunFace:面部识别的特征效用与范数估计
Babnik, Žiga, Boutros, Fadi, Damer, Naser, Jain, Deepak Kumar, Peer, Peter, Štruc, Vitomir
Abstract
Face Recognition (FR) is used in a variety of application domains, from entertainment and banking to security and surveillance. Such applications rely on the FR model to be robust and perform well in a variety of settings. To achieve this, state-of-the-art FR models typically use expressive adaptive margin loss functions, which tie the feature norm to concepts related to sample quality, such as recognizability and perceptual image quality. Recently, through the development of Face Image Quality Assessment (FIQA) techniques, biometric utility has become the preferred measure of face-image quality and has been shown to be a better predictor of the usefulness of samples for face recognition compared to more human-centric aspects, such as resolution, blur, and lighting, tied to general image quality. While image quality expressed through feature norms exhibits a certain level of correlation with biometric utility, it does not fully encapsulate all aspects of utility. To address this point, we propose a new adaptive margin loss, FunFace (Face Recognition Through Utility and Norm Estimation), which incorporates biometric utility, estimated by the Certainty Ratio, into the adaptive margin, taking inspiration from AdaFace. We show that FunFace (when used to train a face recognition model) achieves competitive results to other state-of-the-art FR models on benchmarks containing high-quality samples, while surpassing them on low quality benchmarks.
Chinese Translation
面部识别(FR)被广泛应用于娱乐、银行、安全和监控等多个领域。这些应用依赖于FR模型在多种环境下的稳健性和良好性能。为了实现这一目标,最先进的FR模型通常使用富有表现力的自适应边际损失函数,将特征范数与样本质量相关的概念(如可识别性和感知图像质量)联系起来。最近,通过面部图像质量评估(FIQA)技术的发展,生物特征效用已成为面部图像质量的首选衡量标准,并且已被证明比与一般图像质量相关的更具人性化的方面(如分辨率、模糊和光照)更能预测样本在面部识别中的有效性。尽管通过特征范数表达的图像质量与生物特征效用在一定程度上存在相关性,但并未完全涵盖效用的所有方面。为了解决这一问题,我们提出了一种新的自适应边际损失FunFace(通过效用与范数估计进行面部识别),该方法将通过确定性比率估计的生物特征效用纳入自适应边际,灵感来源于AdaFace。我们展示了FunFace(在训练面部识别模型时)在包含高质量样本的基准测试中取得了与其他最先进FR模型相当的结果,同时在低质量基准测试中超越了它们。
cs.CV / 63 / 2604.26614

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

超越表象的状态:诊断与改善基于刻度盘的测量读取中的状态一致性
Hu, Yuanze, Li, Gen, Lan, Yuqin, Yu, Qingchen, Yang, Zhichao, Jing, Junwei, Fan, Zhaoxin, Deng, Xiaotie
Abstract
Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure implied by continuous dial values. These findings suggest that existing MLLMs largely ignore the intrinsic state geometry of dial measurement tasks and instead rely on superficial appearance cues. Motivated by this diagnosis, we propose TriSCA, a tri-level state-consistent alignment framework for dial-based measurement reading. Specifically, TriSCA consists of state-distance-aware representation alignment, metadata-grounded observation-to-state supervision, and state-aware objective alignment. Extensive ablation studies and evaluation experiments on controlled clock and gauge benchmarks, together with evaluation on an external real-world benchmark, demonstrate the effectiveness of our method.
Chinese Translation
多模态大型语言模型(MLLMs)在一般多模态任务上取得了显著进展,但在基于刻度盘的测量读取方面仍然表现脆弱。本文通过受控基准测试和特征空间探测研究了这一问题,并显示当前的MLLMs不仅在基于刻度盘的读取上准确性不佳,而且在视角和光照变化下即使基础刻度盘状态保持不变,性能也会急剧下降。我们的探测分析进一步揭示,在外观变化下,相同状态样本并未被一致地聚类,而相邻状态未能保持由连续刻度值所暗示的局部结构。这些发现表明,现有的MLLMs在很大程度上忽视了刻度测量任务的内在状态几何,而是依赖于表面的外观线索。基于这一诊断,我们提出了TriSCA,一个用于基于刻度盘的测量读取的三层状态一致性对齐框架。具体而言,TriSCA包括状态距离感知的表示对齐、基于元数据的观察到状态监督以及状态感知的目标对齐。在受控时钟和仪表基准上的广泛消融研究和评估实验,以及在外部真实世界基准上的评估,证明了我们方法的有效性。
cs.CV / 64 / 2604.26620

SnapPose3D: Diffusion-Based Single-Frame 2D-to-3D Lifting of Human Poses

SnapPose3D:基于扩散的单帧2D到3D的人体姿态提升
Simoni, Alessandro, Catalini, Riccardo, Di Nucci, Davide, Borghi, Guido, Davoli, Davide, Garattoni, Lorenzo, Francesca, Gianpiero, Kawana, Yuki, Vezzani, Roberto
Abstract
Depth ambiguity and joint uncertainty are the two main obstacles in obtaining accurate human pose predictions by 2D-to-3D lifting methods proposed in the literature. In particular, these issues are caused by 2D joint locations that can be mapped to multiple 3D positions, inducing multiple possible final poses. Following these considerations, we propose leveraging diffusion-based models generation capability to predict multiple hypotheses and aggregate them in a final accurate pose. Therefore, we introduce SnapPose3D, a pose-lifting framework trained deterministically to denoise 3D poses conditioned on both visual context and 2D pose features. SnapPose3D adopts a probabilistic approach during inference, generating multiple hypotheses through random sampling from a unit Gaussian distribution. Unlike most previous methods that address pose ambiguity by processing temporal sequences, SnapPose3D uses single frames as input, avoiding tracking and limiting computational cost, data acquisition complexity, and the need for online, real-time applications. We extensively evaluate SnapPose3D on well-known benchmarks for the 3D human pose estimation task showing its ability to generate and aggregate accurate hypotheses that lead to state-of-the-art results.
Chinese Translation
深度模糊和关节不确定性是通过文献中提出的2D到3D提升方法获得准确人体姿态预测的两个主要障碍。特别是,这些问题是由于2D关节位置可以映射到多个3D位置,从而导致多个可能的最终姿态。基于这些考虑,我们提出利用基于扩散的模型生成能力来预测多个假设并将其聚合为最终的准确姿态。因此,我们引入了SnapPose3D,这是一个以确定性方式训练的姿态提升框架,旨在去噪基于视觉上下文和2D姿态特征的3D姿态。SnapPose3D在推理过程中采用概率方法,通过从单位高斯分布中随机采样生成多个假设。与大多数通过处理时间序列来解决姿态模糊的先前方法不同,SnapPose3D使用单帧作为输入,避免了跟踪,并限制了计算成本、数据获取复杂性以及对在线实时应用的需求。我们在著名的3D人体姿态估计基准上对SnapPose3D进行了广泛评估,显示其生成和聚合准确假设的能力,从而实现了最先进的结果。
cs.CV / 65 / 2604.26633

SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

SynSur:一种用于合成工业表面缺陷生成和检测的端到端生成管道
Kühn, Paul Julius, Pommeranz, Mika, Kuijper, Arjan, Sinha, Saptarshi Neil
Abstract
The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.
Chinese Translation
基于学习的工业缺陷检测的瓶颈往往不是模型容量的限制,而是标注缺陷数据的稀缺性:缺陷是稀有的,注释成本高昂,收集平衡的训练集速度缓慢。我们提出了一种用于合成缺陷生成和注释的端到端管道,结合了基于视觉-语言模型的提示、LoRA适应的扩散、掩膜引导的修复以及自动标签推导的样本过滤,展示了使用现实数据与逼真的合成样本克服数据稀缺的潜力。评估是在一个具有挑战性的球螺杆驱动器的凹坑缺陷数据集上进行的,随后在移动电话屏幕表面缺陷分割数据集(MSD)的一个子集上进行,以测试跨领域迁移。除了下游检测器性能外,我们还分析了管道的关键阶段,包括提示构建、LoRA选择和使用DreamSim和CLIPScore的样本过滤,以理解哪些合成样本既真实又有用。使用YOLOv26、YOLOX和LW-DETR的实验表明,仅使用合成数据的训练并不能替代真实数据。当与真实数据结合时,合成缺陷可以保持性能,并在选定的BSData训练方案中获得适度的提升。MSD迁移研究表明,整体管道结构可以迁移到第二个工业检测领域,同时强调了领域特定适应和注释质量控制的重要性。总体而言,本文提供了基于扩散的工业缺陷合成的端到端评估,并表明其最强的价值在于增强稀缺的真实数据集,而不是替代它们。
cs.CV / 66 / 2604.26678

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

通过鼓的形状听见房间:基于模态的多点表面振动声音恢复
Bagon, Shai, Kichler, Matan, Sheinin, Mark
Abstract
Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant). In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing multiple vibration signals to estimate the original sound source in the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios and other signal-processing-based methods for multi-signal fusing.
Chinese Translation
光学振动传感技术使得能够直接从附近物体的表面振动中恢复场景声音,将日常物体转变为“视觉麦克风”。然而,大多数先前的方法主要集中于捕捉具有高度有利振动响应的特定物体的振动。这些物体包括由物体自身产生表面振动的物体(例如,扬声器膜或吉他琴体)或由对声音高度反应的薄膜组成的物体(例如,薯片袋或植物的叶子)。在本文中,我们针对一类振动响应较差或高度共振的固体物体进行声音恢复。我们使用基于散斑的振动成像系统同时捕捉物体多个表面点的振动。然后,我们推导出一个新颖的物理引导振动形成模型,该模型将场景声音源与通过物体的振动模态捕获的多点多轴振动相关联。该模型用于逆转振动物体的共振传递函数,融合多个振动信号以估计场景中的原始声音源。我们通过从各种日常物体中恢复声音来评估我们的方法,结果表明,在具有挑战性的场景中,它显著优于传统的单点散斑振动测量法和其他基于信号处理的多信号融合方法。
cs.CV / 67 / 2604.26707

CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

CurEvo:基于课程指导的自我演化视频理解
Zeng, Guiyi, Yu, Junqing, Chen, Yi-Ping Phoebe, Chen, Xu, Yang, Wei, Song, Zikai
Abstract
Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.
Chinese Translation
近期自我演化视频理解框架的进展展示了无需人工标注的自主学习潜力。然而,现有方法往往面临优化控制不足和难度进展不受控的问题,因为它们在迭代学习过程中缺乏结构化指导。为了解决这些局限性,我们提出了CurEvo,一个基于课程指导的自我演化框架,将课程学习引入自我演化,以实现更结构化和渐进的模型改进。CurEvo根据模型能力动态调节任务难度,细化评估标准,并平衡数据多样性,形成一个课程指导的反馈循环,使学习复杂性与模型能力相匹配。基于这一原则,我们开发了一个多维自适应问答框架,联合演化问题生成和答案评估,涵盖感知、识别和理解维度,确保连贯且可测量的课程进展。通过这种整合,CurEvo将控制不足的自我演化转变为更结构化的自主视频理解学习过程。在七个基础模型上,CurEvo在四个VideoQA基准测试中持续提高了基准准确率和基于评估者的语义得分,验证了基于课程指导的自我演化在视频理解中的有效性。
cs.CV / 68 / 2604.26740

Learning Sparse BRDF Measurement Samples from Image

从图像中学习稀疏BRDF测量样本
Cao, Wen
Abstract
Accurate BRDF acquisition is important for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small number of BRDF measurements that are most useful for reconstructing material appearance under a learned reflectance prior. Our method combines a set encoder for sparse coordinate-value observations, a pretrained hypernetwork-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor is kept fixed and gradients from BRDF-space and rendered-image losses are used to optimize measurement locations. This separates sample selection from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. Experiments on the MERL dataset show that the proposed sampler improves low-budget reconstruction quality at 8 and 16 measurements compared with neural reconstruction baselines, while PCA-based methods remain strong at larger budgets. We further analyze the effect of image-space supervision, co-optimization, and image-only latent fitting for unseen materials.
Chinese Translation
准确的BRDF获取对于真实感渲染至关重要,但密集的光度反射计测量速度慢且成本高昂。我们研究如何选择少量最有助于在学习的反射先验下重建材料外观的BRDF测量。我们的方法结合了用于稀疏坐标-值观测的集合编码器、一个基于预训练超网络的BRDF重建器和一个可微渲染器。在采样器训练期间,重建器保持固定,并利用来自BRDF空间和渲染图像损失的梯度来优化测量位置。这将样本选择与先验拟合分离,并鼓励采样器选择在学习的材料分布下信息丰富的方向。对MERL数据集的实验表明,与神经重建基线相比,所提出的采样器在8个和16个测量下提高了低预算重建质量,而基于PCA的方法在更大预算下仍然表现强劲。我们进一步分析了图像空间监督、共同优化和仅图像潜在拟合对未见材料的影响。
cs.CV / 69 / 2604.26752

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

GLM-5V-Turbo:迈向多模态智能体的原生基础模型
V Team, Hong, Wenyi, Gu, Xiaotao, Pan, Ziyang, Yang, Zhen, Wang, Yuting, Wang, Yue, Yue, Yuanchang, Wang, Yu, Wang, Yanling, Wang, Yan, Liu, Xijun, Yu, Wenmeng, Wang, Weihan, Li, Wei, Duan, Shuaiqi, Yang, Sheng, Lv, Ruiliang, Liu, Mingdao, Pan, Lihang, Ning, Ke, Ji, Junhui, Wang, Jinjiang, Chen, Jing, Xu, Jiazheng, Zhu, Jiale, Cheng, Jiale, Qi, Ji, Gan, Guobing, Wang, Guo, Yao, Cong, Dou, Zijun, Zhou, Zihao, Wang, Zihan, Ge, Zhiqi, Li, Zhijie, Hou, Zhenyu, Xue, Zhao, Wang, Zehui, He, Zehai, Liu, Yusen, Cen, Yukuo, Li, Yuchen, Wang, Yuan, Lu, Yijian, Wang, Yanzi, Xue, Yadong, Zhang, Xinyu, Liu, Xinyu, Li, Wenkai, Tong, Tianyu, Zhang, Tianshu, Yan, Shengdong, Zheng, Qinkai, Xu, Mingde, Bao, Licheng, Xu, Jiaxing, Fan, Jiaxin, Qian, Jiawen, Chen, Jiali, Lin, Jiahui, Zheng, Haozhi, Wang, Haoran, Li, Haochen, Yang, Fan, Zhang, Dan, Zhao, Chuangxin, Wu, Chengcheng, Shi, Boyan, Jia, Bowei, Wang, Baoxu, Zhang, Peng, Liu, Debing, Xu, Bin, Li, Juanzi, Huang, Minlie, Dong, Yuxiao, Tang, Jie
Abstract
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
Chinese Translation
我们提出了GLM-5V-Turbo,这是迈向多模态智能体原生基础模型的一步。随着基础模型在真实环境中的应用日益增多,智能体的能力不仅依赖于语言推理,还依赖于在图像、视频、网页、文档和图形用户界面等异构环境中感知、解释和行动的能力。GLM-5V-Turbo围绕这一目标构建:多模态感知被整合为推理、规划、工具使用和执行的核心组成部分,而不是作为语言模型的辅助接口。本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展和与智能体框架集成等方面的主要改进。这些发展在多模态编码、视觉工具使用和基于框架的智能体任务中表现出强大的性能,同时保持了竞争性的文本编码能力。更重要的是,我们的开发过程为构建多模态智能体提供了实用的见解,强调了多模态感知、分层优化和可靠的端到端验证的核心作用。
cs.CV / 70 / 2604.26772

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

利用补丁令牌:利用视觉基础模型特征进行AI生成图像检测
Abdullah, Ahmed, Ebert, Nikolas, Wasenmüller, Oliver
Abstract
Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.
Chinese Translation
近期的方法表明,大规模预训练模型,如CLIP视觉变换器,作为特征提取器时能够有效地从未见过的生成模型中检测AI生成图像(AIGIs)。许多最先进的AI生成图像检测方法基于原始的CLIP-ViT,以增强这种泛化能力。自CLIP发布以来,许多视觉基础模型(VFMs)相继出现,融入了架构改进和不同的训练范式。尽管取得了这些进展,但它们在AIGI检测和AI图像取证方面的潜力仍然未被充分探索。在本研究中,我们对多个VFM家族进行了全面的基准测试,涵盖了多样的预训练目标、输入分辨率和模型规模。我们系统地评估了它们在检测完全生成的AI图像和AI修复图像方面的开箱即用性能,并发现最佳模型在准确率上超过了原始CLIP超过12%,在此过程中超越了既有方法。为了充分利用现代VFM的特征,我们提出了一种简单的分类器头重设计,利用可调注意力池化(TAP),将输出令牌聚合成精炼的全局表示。将TAP与最新的VFM结合,在多个AIGI检测基准上实现了显著的性能提升,在两个具有挑战性的基准上建立了AI生成和修复图像的野外检测的新最先进水平。
cs.CV / 71 / 2604.26774

MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification

MemOVCD:通过跨时间记忆推理和全局-局部自适应校正实现无训练的开放词汇变化检测
Kuang, Zuzheng, Chang, Honghao, Liang, Boqiang, Wang, Haoqian, He, Lijun, Li, Fan, Bi, Haixia
Abstract
Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.
Chinese Translation
开放词汇变化检测旨在识别双时相遥感图像中的语义变化,而无需预定义类别。近期的方法结合了基础模型,如SAM、DINO和CLIP,但通常独立处理每个时间戳或仅在最终比较阶段进行交互。这种范式在语义推理过程中缺乏足够的时间耦合,限制了其区分真实语义变化与非语义外观差异的能力。此外,高分辨率图像上的补丁主导推理往往削弱了全局语义连续性,并产生了支离破碎的变化区域。为了解决这些问题,我们提出了MemOVCD,一个基于跨时间记忆推理和全局-局部自适应校正的无训练开放词汇变化检测框架。具体而言,我们将双时相变化检测重新表述为一个双帧跟踪问题,并引入加权双向传播以从两个时间方向聚合语义证据。为了在较大的时间间隔中稳定记忆传播,我们构建了直方图对齐的过渡帧,以平滑突发的外观变化。此外,全局-局部自适应校正策略自适应地融合局部和全局视图预测,提高了空间一致性,同时保留了细粒度细节。在五个基准测试上的实验表明,MemOVCD在两个变化检测任务中表现出良好的性能,验证了其在多样化开放词汇设置下的有效性和泛化能力。
cs.CV / 72 / 2604.26781

Virtual-reality based patient-specific simulation of spine surgical procedures: A fast, highly automated and high-fidelity system for surgical education and planning

基于虚拟现实的患者特异性脊柱外科手术模拟:一个快速、高度自动化和高保真度的外科教育与规划系统
Ranabhat, Raj Kumar, Ross, Tayler D, Jiao, Tony, Larouche, Jeremie, Finkelstein, Joel, Hardisty, Michael
Abstract
Surgical training involves didactic teaching, mentor-led learning, surgical skills laboratories, and direct exposure to surgery; however, increasing clinical pressures have limited operating room (OR) exposure. This work leverages virtual reality (VR) to provide a safe and immersive training environment. Existing VR training is often based on standardized scenarios not tailored to individual clinical cases. This study addresses this limitation using artificial intelligence (AI) based computer vision methods to generate patient-specific simulations from computed tomography (CT) and magnetic resonance imaging (MRI). This study focuses on patient-specific spinal decompression simulation for spinal stenosis in a virtual operating room. The objectives were (1) automatic creation of 3D anatomical models and (2) VR simulation of spinal decompression procedures including laminectomy, disc resection, and foraminotomy. Model construction required multimodal fusion (registration) of CT and MRI and segmentation of relevant structures. Segmentation was evaluated using the Dice Similarity Coefficient (DSC), and registration accuracy using Target Registration Error (TRE). Qualitative feedback was obtained from surgeons and trainees. High-fidelity patient-specific 3D models were generated efficiently (approximately 2.5 minutes per case, N = 15). Segmentation accuracy was high, with a DSC of 0.95 (+/- 0.03) for vertebral bone and 0.895 (+/- 0.02) for soft tissue structures. Registration accuracy showed a mean TRE of 1.73 (+/- 0.42) mm. Semi-structured interviews indicated improved spatial understanding, increased procedural confidence, and strong perceived educational value. This platform significantly reduced the time and costs of patient-specific modelling, thereby facilitating pre-operative planning, post-procedural assessments, and comprehensive surgical simulation.
Chinese Translation
外科培训包括理论教学、导师指导学习、外科技能实验室和直接接触手术;然而,日益增加的临床压力限制了手术室的接触。本研究利用虚拟现实(VR)提供一个安全且沉浸式的培训环境。现有的VR培训通常基于标准化场景,而未针对个体临床案例进行定制。本研究通过基于人工智能(AI)的计算机视觉方法,利用计算机断层扫描(CT)和磁共振成像(MRI)生成患者特异性模拟,解决了这一局限性。研究重点是针对脊柱狭窄的患者特异性脊柱减压模拟,置于虚拟手术室中。研究目标为(1)自动创建三维解剖模型和(2)虚拟现实模拟脊柱减压手术,包括椎板切除术、椎间盘切除术和神经根管成形术。模型构建需要对CT和MRI进行多模态融合(配准)以及相关结构的分割。分割精度使用Dice相似系数(DSC)进行评估,配准精度使用目标配准误差(TRE)进行评估。通过半结构化访谈获得外科医生和实习生的定性反馈。高保真的患者特异性三维模型高效生成(每例约2.5分钟,N = 15)。分割精度高,椎骨的DSC为0.95(+/- 0.03),软组织结构的DSC为0.895(+/- 0.02)。配准精度显示平均TRE为1.73(+/- 0.42)毫米。半结构化访谈表明,空间理解能力提高,手术信心增强,且感知教育价值强。本平台显著减少了患者特异性建模的时间和成本,从而促进了术前规划、术后评估和全面的外科模拟。
cs.CV / 73 / 2604.26799

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

MesonGS++:带超参数搜索的3D高斯点云后训练压缩
Xie, Shuzhao, Ge, Junchen, Zhang, Weixiang, Liu, Jiahang, Tang, Chen, Bai, Yunpeng, Ge, Shijia, Jiang, Jingyan, Huang, Yuzhi, Yang, Fengnian, Zhang, Cong, Fan, Xiaoyi, Wang, Zhi
Abstract
3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0--1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus
Chinese Translation
3D高斯点云(3DGS)以实时渲染实现高质量的新视图合成,但其存储成本仍然对实际部署构成障碍。现有的后训练压缩方法仍依赖于在剪枝、变换、量化和熵编码等多个耦合超参数之间进行调整,这使得控制最终压缩大小和充分利用率失真权衡变得困难。我们提出了MesonGS++,一种针对3D高斯压缩的大小感知后训练编解码器。在编解码器方面,MesonGS++结合了基于重要性的联合剪枝、八叉树几何编码、属性变换、针对高阶球面谐波的选择性向量量化以及与熵编码相结合的分组混合精度量化。在配置方面,它将保留比例和比特宽度分配视为主要的率失真调节器,并通过离散采样和0-1整数线性规划在目标存储预算下进行联合优化。我们进一步提出了一种线性大小估计器和CUDA并行量化算子,以加速超参数搜索过程。大量实验表明,MesonGS++在保持渲染保真度的同时实现了超过34倍的压缩,超越了最先进的后训练方法,并准确满足目标大小预算。值得注意的是,在没有任何训练的情况下,MesonGS++甚至可以在Stump场景下以20倍压缩率超越普通3DGS的PSNR。我们的代码可在https://github.com/mmlab-sigs/mesongs_plus获取。
cs.CV / 74 / 2604.26806

ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection

ViCrop-Det:基于空间注意力熵引导的无训练小目标检测裁剪
Wang, Hui, Li, Hongze, Chen, Wei, Zhang, Xiaojin
Abstract
Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder's cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23\% latency overhead. On MS COCO, $AP_{S}$ improves while $AP_{M}/AP_{L}$ remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.
Chinese Translation
基于Transformer的架构在全球语义感知中建立了主导范式;然而,它们仍然受到自然图像固有的深刻空间异质性的根本限制。具体而言,在信息密度不同的区域施加统一的全局感受野不可避免地导致局部特征退化,特别是在充满微小目标的密集冲突区域。为了解决这一机制限制,我们提出了ViCrop-Det,一种无训练推理框架,介绍了自适应空间信任区域收缩。受异常分割中注意力熵使用的启发,ViCrop-Det利用检测解码器的交叉注意力分布作为内生探针。通过利用空间注意力熵(Spatial Attention Entropy, SAE)启发式地评估局部空间模糊性,该框架执行动态空间路由,将固定的计算预算专门分配给同时表现出高目标显著性和高认知不确定性的区域。通过缩小空间信任区域并注入高频局部观察,ViCrop-Det积极解决空间模糊性,并在不需要架构修改的情况下恢复细粒度特征。在VisDrone和DOTA-v1.5上的广泛评估表明,ViCrop-Det实现了竞争性的性能提升,始终为RT-DETR-R50和Deformable DETR增加+1-3 mAP@50,同时仅有20-23%的延迟开销。在MS COCO上,$AP_{S}$有所提升,而$AP_{M}/AP_{L}$保持稳定,表明在不妥协全局空间先验的情况下实现了精确的细尺度优化。在计算匹配的设置下,我们的自适应路由策略全面超越了均匀切片基线,实现了高度优化的准确性与速度的权衡。
cs.CV / 75 / 2604.26820

Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization

Bridge:基于基础的因果推断结合视觉基础模型进行领域泛化
Hong, Mingbo, Liu, Feng, Gevaert, Caroline, Vosselman, George, Cheng, Hao
Abstract
Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely \textbf{\textit{Bridge}}, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, \textbf{\textit{Bridge}} blocks confounders' effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components. \textbf{\textit{Bridge}} can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs). Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches. The project page is available at: https://mingbohong.github.io/Bridge/.
Chinese Translation
检测器常常因源域与目标域之间的分布差异而性能下降。这个问题在数据有限的单源域中尤为明显,因为模型往往依赖于源域中的混杂因素(例如,光照、共现和风格),导致虚假的相关性,妨碍了泛化。为此,本文提出了一种新颖的基于基础的领域泛化框架,即 extbf{ extit{Bridge}},将因果推断引入物体检测中。通过学习前门调整的低秩基础, extbf{ extit{Bridge}} 阻断混杂因素的影响,以减轻虚假的相关性,同时通过过滤冗余和与任务无关的成分来精炼表示。 extbf{ extit{Bridge}} 可以与区分性(例如,DINOv2/3、SAM)和生成性(例如,Stable Diffusion)视觉基础模型(VFMs)无缝集成。在多个领域泛化物体检测数据集(即,跨相机、恶劣天气、真实到艺术、不同天气数据集以及我们新增强的基于无人机的真实世界基准Diverse Weather DroneVehicle)上的广泛实验,突显了我们提出的方法相较于以前的最先进方法的优越性。项目页面可访问:https://mingbohong.github.io/Bridge/。
cs.CV / 76 / 2604.26857

Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation

边缘人工智能在汽车脆弱道路用户安全中的应用:通过知识蒸馏实现可部署检测
Karjol, Akshay, Hanna, Darrin M.
Abstract
Deploying accurate object detection for Vulnerable Road User (VRU) safety on edge hardware requires balancing model capacity against computational constraints. Large models achieve high accuracy but fail under INT8 quantization required for edge deployment, while small models sacrifice detection performance. This paper presents a knowledge distillation (KD) framework that trains a compact YOLOv8-S student (11.2M parameters) to mimic a YOLOv8-L teacher (43.7M parameters), achieving 3.9x compression while preserving quantization robustness. We evaluate on full-scale BDD100K (70K training images) with Post-Training Quantization to INT8. The teacher suffers catastrophic degradation under INT8 (-23% mAP), while the KD student retains accuracy (-5.6% mAP). Analysis reveals that KD transfers precision calibration rather than raw detection capacity: the KD student achieves 0.748 precision versus 0.653 for direct training at INT8, a 14.5% gain at equivalent recall, reducing false alarms by 44% versus the collapsed teacher. At INT8, the KD student exceeds the teacher's FP32 precision (0.748 vs. 0.718) in a model 3.9x smaller. These findings establish knowledge distillation as a requirement for deploying accurate, safety-critical VRU detection on edge hardware.
Chinese Translation
在边缘硬件上部署针对脆弱道路用户(VRU)安全的精确目标检测需要在模型容量与计算约束之间取得平衡。大型模型虽然能够实现高精度,但在边缘部署所需的 INT8 量化下表现不佳,而小型模型则牺牲了检测性能。本文提出了一种知识蒸馏(KD)框架,训练一个紧凑的 YOLOv8-S 学生模型(11.2M 参数)以模仿 YOLOv8-L 教师模型(43.7M 参数),实现了 3.9 倍的压缩,同时保持了量化的鲁棒性。我们在全规模的 BDD100K 数据集(70K 训练图像)上进行评估,并对 INT8 进行了后训练量化。教师模型在 INT8 下遭遇了灾难性的性能下降(mAP 降低 23%),而 KD 学生模型则保持了相对准确性(mAP 降低 5.6%)。分析表明,KD 转移的是精度校准而非原始检测能力:KD 学生模型在 INT8 下的精度为 0.748,而直接训练的精度为 0.653,在相同的召回率下实现了 14.5% 的提升,减少了 44% 的误报。使用 INT8 时,KD 学生模型的 FP32 精度(0.748)超过了教师模型(0.718),且模型体积小了 3.9 倍。这些发现确立了知识蒸馏在边缘硬件上部署精确且安全关键的 VRU 检测中的必要性。
cs.CV / 77 / 2604.26868

Breaking the Rigid Prior: Towards Articulated 3D Anomaly Detection

打破刚性先验:面向关节化的 3D 异常检测
Gan, Jinye, Zheng, Bozhong, Xu, Xiaohao, Ren, Junye, Zhang, Zixuan, Ni, Na, Wu, Yingna
Abstract
Existing 3D anomaly detection methods are built on a rigid prior: normal geometry is pose-invariant and can be canonicalized through registration or alignment. This prior does not hold for articulated objects with hinge or sliding joints, where valid pose changes induce structured geometric variations that cannot be collapsed to a single canonical template, causing pose-induced deformations to be misidentified as anomalies while true structural defects are obscured. No existing benchmark addresses this challenge. We introduce ArtiAD, the first large-scale benchmark for articulated 3D anomaly detection, comprising 15,229 point clouds across 39 object categories with dense joint-angle variations and six structural anomaly types. Each sample is annotated with its joint configuration and part-level motion labels, enabling explicit disentanglement of pose-induced geometry from structural defects. ArtiAD also provides a seen/unseen articulation split to evaluate both interpolation and extrapolation to novel joint configurations. We propose Shape-Pose-Aware Signed Distance Field (SPA-SDF), a baseline that replaces the rigid prior with a continuous pose-conditioned implicit field, factorized into an articulation-independent structural prior and a Fourier-encoded joint embedding. At inference, the articulation state is recovered by minimizing reconstruction energy, and anomalies are identified as point-wise deviations from the learned manifold. SPA-SDF achieves 0.884 object-level AUROC on seen configurations and 0.874 on unseen configurations, substantially outperforming all rigid-based baselines. Our code and benchmark will be publicly released to facilitate future research.
Chinese Translation
现有的 3D 异常检测方法建立在一种刚性先验之上:正常几何形状是姿态不变的,并且可以通过配准或对齐进行规范化。然而,这一先验不适用于具有铰链或滑动关节的关节化物体,在这些物体中,有效的姿态变化会引发结构化的几何变化,这些变化无法简化为单一的规范模板,导致姿态引起的变形被误识别为异常,而真正的结构缺陷则被掩盖。目前没有现有的基准能够解决这一挑战。我们引入了 ArtiAD,这是首个针对关节化 3D 异常检测的大规模基准,包含 15,229 个点云,涵盖 39 个物体类别,具有密集的关节角度变化和六种结构异常类型。每个样本都标注了其关节配置和部件级运动标签,从而能够明确区分姿态引起的几何变化与结构缺陷。ArtiAD 还提供了已见/未见的关节划分,以评估对新关节配置的插值和外推能力。我们提出了形状-姿态-感知的有符号距离场(Shape-Pose-Aware Signed Distance Field,SPA-SDF),这是一个基线模型,它用一个连续的姿态条件隐式场替代了刚性先验,该隐式场被分解为与关节无关的结构先验和傅里叶编码的关节嵌入。在推理过程中,通过最小化重构能量来恢复关节状态,异常被识别为与学习到的流形的逐点偏差。SPA-SDF 在已见配置上的物体级 AUROC 达到 0.884,在未见配置上达到 0.874,显著优于所有基于刚性的基线。我们的代码和基准将公开发布,以促进未来的研究。
cs.CV / 78 / 2604.26873

Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning

基于不确定性感知的行人属性识别框架:证据深度学习方法
Lou, Zhuofan, Zhang, Shihang, Zhu, Fangle, Ye, Shengjie, Wang, Pingyu
Abstract
We propose UAPAR, an Uncertainty-Aware Pedestrian Attribute Recognition framework. To the best of our knowledge, this is the first EDL-based uncertainty-aware framework for pedestrian attribute recognition (PAR). Unlike conventional deterministic methods, which fail to assess prediction reliability on low-quality samples, UAPAR effectively identifies unreliable predictions and thus enhances system robustness in complex real-world scenarios. To achieve this, UAPAR incorporates Evidential Deep Learning (EDL) into a CLIP-based architecture. Specifically, a Region-Aware Evidence Reasoning module employs cross-attention and spatial prior masks to capture fine-grained local features, which are further processed by an evidence head to estimate attribute-wise epistemic uncertainty. To further enhance training robustness, we develop an uncertainty-guided dual-stage curriculum learning strategy to alleviate the adverse effects of severe label noise during training. Extensive experiments on the PA100K, PETA, RAPv1, and RAPv2 datasets demonstrate that UAPAR achieves competitive or superior performance. Furthermore, qualitative results confirm that the proposed framework generates uncertainty estimates that are predictive of challenging or erroneous samples.
Chinese Translation
我们提出了UAPAR,一个不确定性感知的行人属性识别框架。根据我们所知,这是第一个基于证据深度学习(EDL)的方法,用于行人属性识别(PAR)。与传统的确定性方法不同,后者无法评估低质量样本的预测可靠性,UAPAR能够有效识别不可靠的预测,从而增强系统在复杂现实场景中的鲁棒性。为此,UAPAR将证据深度学习(EDL)融入基于CLIP的架构中。具体而言,一个区域感知证据推理模块采用交叉注意力和空间先验掩码来捕捉细粒度的局部特征,这些特征随后由证据头处理,以估计属性级的认知不确定性。为了进一步增强训练的鲁棒性,我们开发了一种不确定性引导的双阶段课程学习策略,以减轻训练过程中严重标签噪声的负面影响。在PA100K、PETA、RAPv1和RAPv2数据集上的广泛实验表明,UAPAR达到了具有竞争力或优越的性能。此外,定性结果确认所提出的框架生成的不确定性估计能够有效预测具有挑战性或错误的样本。
cs.CV / 79 / 2604.26883

SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset

SEAL:基于语义的单图像贴纸个性化与大规模贴纸标签数据集
Roh, Changhyun, Jeong, Yonghyun, Lee, Jonghyun, Eom, Chanho, Oh, Jihyong
Abstract
Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only one reference, test-time fine-tuning (TTF) methods tend to overfit, producing \textit{visual entanglement}, where background artifacts are absorbed into the learned concept, and \textit{structural rigidity}, where the model memorizes reference-specific spatial configurations and loses contextual controllability. To address these issues, we introduce \textbf{SE}mantic-aware single-image sticker person\textbf{AL}ization (\textbf{SEAL}), a plug-and-play, architecture-agnostic adaptation module that integrates into existing personalization pipelines without modifying their U-Net-based diffusion backbones. SEAL applies three components during embedding adaptation: (1) a Semantic-guided Spatial Attention Loss, (2) a Split-merge Token Strategy, and (3) Structure-aware Layer Restriction. To support sticker-domain personalization with attribute-level control, we present StickerBench, a large-scale sticker image dataset with structured tags under a six-attribute schema (Appearance, Emotion, Action, Camera Composition, Style, Background). These annotations provide a consistent interface for varying context while keeping target identity fixed, enabling systematic evaluation of identity disentanglement and contextual controllability. Experiments show that SEAL consistently improves identity preservation while maintaining contextual controllability, highlighting the importance of explicit spatial and structural constraints during test-time adaptation. The code, StickerBench, and project page will be publicly released.
Chinese Translation
从单一参考图像合成目标概念在基于扩散的个性化文本到图像生成中具有挑战性,尤其是在贴纸个性化中,提示通常需要明确的属性编辑。仅使用一个参考图像时,测试时微调(TTF)方法往往会过拟合,产生 extit{视觉纠缠},即背景伪影被吸收到学习的概念中,以及 extit{结构刚性},即模型记忆特定于参考的空间配置,失去上下文可控性。为了解决这些问题,我们提出了 extbf{SE}mantic-aware single-image sticker person extbf{AL}ization( extbf{SEAL}),一个即插即用、与架构无关的适配模块,可以集成到现有的个性化管道中,而无需修改其基于U-Net的扩散骨干网络。SEAL在嵌入适配过程中应用三个组件:(1)语义引导的空间注意损失,(2)拆分-合并令牌策略,以及(3)结构感知层限制。为了支持具有属性级控制的贴纸领域个性化,我们呈现了StickerBench,一个大规模的贴纸图像数据集,包含在六个属性架构(外观、情感、动作、相机构图、风格、背景)下的结构化标签。这些注释提供了一种一致的接口,以便在保持目标身份不变的同时变化上下文,从而实现身份解耦和上下文可控性的系统评估。实验表明,SEAL在保持上下文可控性的同时,始终改善身份保留,突显了在测试时适应过程中明确的空间和结构约束的重要性。代码、StickerBench和项目页面将公开发布。
cs.CV / 80 / 2604.26893

Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark

基于图的语义校准网络用于未对齐的无人机RGBT图像语义分割及大规模基准测试
Fan, Fangqiang, Zhao, Zhicheng, Ma, Xiaoliang, Li, Chenglong, Tang, Jin
Abstract
Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories.In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
Chinese Translation
细粒度RGBT图像语义分割对于全天候无人机(UAV)场景理解至关重要。然而,无人机RGBT语义分割面临两个相互关联的挑战:由于传感器视差和平台振动导致的跨模态空间错位,以及在俯视视角下细粒度地面物体之间的严重语义混淆。为了解决这些问题,我们提出了一种基于图的语义校准网络(GSCNet),用于未对齐的无人机RGBT图像语义分割。具体而言,我们设计了一个特征解耦与对齐模块(FDAM),该模块将每种模态解耦为共享结构和私有感知组件,并在共享子空间中执行可变形对齐,从而实现稳健的空间校正,减少模态外观干扰。此外,我们提出了一个语义图校准模块(SGCM),该模块明确编码了无人机场景中地面物体类别之间的层次分类法和共现规律,形成一个结构化的类别图,并将这些先验知识纳入图注意力推理,以校准视觉上相似和稀有类别的预测。此外,我们构建了未对齐RGB-热成像细粒度(URTF)基准,据我们所知,这是用于未对齐无人机RGBT图像语义分割的最大且最细粒度的基准,包含超过25,000对图像,涵盖61个类别,并具有真实的跨模态错位。在URTF上的大量实验表明,GSCNet显著优于最先进的方法,在细粒度类别上取得了显著提升。该数据集可在 https://github.com/mmic-lcl/Datasets-and-benchmark-code 获取。
cs.CV / 81 / 2604.26917

AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

AnimateAnyMesh++:一种灵活的4D基础模型用于高保真文本驱动的网格动画
Wu, Zijie, Yu, Chaohui, Wang, Fan, Bai, Xiang
Abstract
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, which significantly improves trajectory reconstruction, local geometry preservation, and mitigates trajectory-sticking artifacts. Third, we introduce architectural changes to both DyMeshVAE-Flex and the rectified-flow (RF) generator to support variable-length sequence training and generation, enabling longer animations while preserving reconstruction fidelity. Extensive experiments demonstrate that AnimateAnyMesh++ generates semantically accurate and temporally coherent mesh animations within seconds, surpassing prior approaches in quality and efficiency. The enlarged DyMesh-XL, the upgraded DyMeshVAE-Flex, and variable-length RF together deliver consistent gains across benchmarks and in-the-wild meshes. We will release code, models, and the expanded DyMesh-XL upon acceptance of this manuscript to facilitate research in 4D content creation.
Chinese Translation
近年来,4D内容生成的进展引起了越来越多的关注,但由于建模时空分布的复杂性和4D训练数据的稀缺,创建高质量的动画3D模型仍然具有挑战性。我们提出了AnimateAnyMesh++,这是一个用于任意3D网格的文本驱动动画的前馈框架,在数据、架构和生成能力方面进行了重大升级。首先,我们通过从Objaverse-XL挖掘动态内容,扩展了DyMesh-XL数据集,将独特身份的数量从60K增加到300K,并显著拓宽了类别和运动的多样性。其次,我们重新设计了DyMeshVAE-Flex,采用了幂律拓扑感知注意力和增强的顶点法线特征,这显著改善了轨迹重建、局部几何保留,并减轻了轨迹粘滞伪影。第三,我们对DyMeshVAE-Flex和校正流(RF)生成器进行了架构上的改动,以支持可变长度序列的训练和生成,从而在保持重建保真度的同时实现更长的动画。大量实验表明,AnimateAnyMesh++能够在几秒钟内生成语义准确且时间一致的网格动画,其质量和效率超越了以往的方法。扩展后的DyMesh-XL、升级后的DyMeshVAE-Flex和可变长度RF在基准测试和实际网格中均提供了一致的性能提升。我们将在本手稿被接受后发布代码、模型和扩展的DyMesh-XL,以促进4D内容创作的研究。
cs.CV / 82 / 2604.26920

Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction

用于高速体积场景重建的颜色编码照明
Novikov, David, Vaknin, Eilon, Tumanyan, Narek, Sheinin, Mark
Abstract
The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult for general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific applications (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to a camera's optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these methods cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequential color-coded sequence. This results in simultaneous multi-view capture of the scene, where high-speed temporal information is encoded in the spatial intensity and color variations of the captured images. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.
Chinese Translation
近年来,从二维图像捕捉和渲染三维动态场景的任务变得越来越受欢迎。然而,大多数传统相机的带宽限制在30-60帧每秒(FPS),使得这些方法仅限于静态或缓慢变化的场景。虽然克服带宽限制对于一般场景来说是困难的,但近年来出现了大量计算成像方法,这些方法利用传统相机在特定应用(例如,运动捕捉和粒子图像测速)中实现高速视频。然而,这些方法大多数需要对相机的光学系统进行修改或增加机械移动组件,从而限制其仅能进行单视角的高速捕捉。因此,这些方法无法方便地用于捕捉快速场景运动的三维表示。在本文中,我们提出了一种新颖的方法,仅使用未增强的低速相机捕捉和重建高速场景的体积表示。我们通过用快速、顺序的颜色编码序列照明场景来编码高速场景动态,而不是修改每个相机的硬件或光学系统。这导致了场景的多视角同时捕捉,其中高速时间信息被编码在捕获图像的空间强度和颜色变化中。为了构建动态场景的高速体积表示,我们开发了一种新颖的基于动态高斯溅射(dynamic Gaussian Splatting)的方法,从图像中解码时间信息。我们在模拟场景和使用多相机成像设置的真实世界实验中评估了我们的方法,展示了首个高速体积场景重建的成果。
cs.CV / 83 / 2604.26934

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

World2VLM:将世界模型想象提炼到视觉语言模型中以实现动态空间推理
Zhang, Wanyue, Wu, Wenxiang, Xu, Wang, Luo, Jiaxin, Zhi, Helu, Huang, Yibin, Ren, Shuo, Liu, Zitao, Zhang, Jiajun
Abstract
Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.
Chinese Translation
视觉语言模型(VLMs)在静态视觉理解方面表现出色,但在动态空间推理方面仍然存在困难,这需要想象场景在自我中心运动下如何演变。近期的研究尝试通过使用合成数据扩展空间监督或在推理时将VLM与世界模型结合来解决这一限制。然而,前者通常缺乏对运动条件状态转变的明确建模,而后者则会带来显著的计算开销。在本研究中,我们提出了World2VLM,一个训练框架,将来自生成世界模型的空间想象提炼到视觉语言模型中。给定初始观察和参数化的相机轨迹,我们使用视图一致的世界模型合成几何对齐的未来视图,并为前向(动作-结果)和反向(结果-动作)空间推理提供结构化监督。我们在通过该管道生成的紧凑数据集上采用两阶段的方式对VLM进行后训练,并在多个空间推理基准上进行评估。World2VLM在包括SAT-Real、SAT-Synthesized、VSI-Bench和MindCube在内的多项基准测试中,相较于基础模型均取得了一致的提升。它还在测试时超越了世界模型结合方法,同时消除了对昂贵推理时生成的需求。我们的结果表明,世界模型不仅可以作为推理时的工具,还可以作为有效的训练时教师,使VLM能够以可扩展和高效的方式内化空间想象。
cs.CV / 84 / 2604.26943

ProcFunc: Function-Oriented Abstractions for Procedural 3D Generation in Python

ProcFunc:用于Python中的程序性3D生成的面向功能的抽象
Raistrick, Alexander, Kayan, Karhan, Nugent, Jack, Yan, David, Mei, Lingjie, Parakh, Meenal, Wen, Hongyu, Li, Dylan, Zuo, Yiming, Liang, Erich, Deng, Jia
Abstract
We introduce ProcFunc, a library for Blender-based procedural 3D generation in Python. ProcFunc provides a library of easy-to-use Python functions, which streamline creating, combining, analyzing, and executing procedural generation code. ProcFunc makes it easy to create large-scale diverse training data, by combinatorial compositions of semantic components. VLMs can use ProcFunc to edit procedural material and geometry code and can create new procedural code with significantly fewer coding errors. Finally, as an example use case, we use ProcFunc to develop a new procedural generator of indoor rooms, which includes a collection of new compositional procedural materials. We demonstrate the detail, runtime efficiency, and diversity of this room generator, as well as its use for 3D synthetic data generation. Please visit https://github.com/princeton-vl/procfunc for source code.
Chinese Translation
我们介绍了ProcFunc,这是一个基于Blender的Python程序性3D生成库。ProcFunc提供了一系列易于使用的Python函数,简化了程序生成代码的创建、组合、分析和执行。通过语义组件的组合,ProcFunc使得创建大规模多样化的训练数据变得简单。VLM(视觉语言模型)可以使用ProcFunc编辑程序性材料和几何代码,并能够以显著更少的编码错误创建新的程序性代码。最后,作为一个示例用例,我们使用ProcFunc开发了一个新的室内房间程序生成器,其中包含一系列新的组合程序性材料。我们展示了该房间生成器的细节、运行效率和多样性,以及其在3D合成数据生成中的应用。请访问 https://github.com/princeton-vl/procfunc 获取源代码。
cs.CV / 85 / 2604.26946

Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation

三步导航:一种层次化的全局-局部规划器用于零-shot视觉与语言导航
Zheng, Wanrong, Ge, Yunhao, Itti, Laurent
Abstract
Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.
Chinese Translation
通过使用多模态大型语言模型(MLLMs),在未知环境中的基于视觉的导航取得了突破性进展。这些模型可以通过在每个时间步评估当前视图与给定给代理的任务和目标之间的关系来规划一系列动作。然而,当前由MLLMs驱动的零-shot视觉与语言导航(VLN)代理仍然倾向于偏离路线、过早停顿,并且整体成功率较低。我们提出了三步导航(Three-Step Nav)以应对这些失败,采用三视图协议:首先,“向前看”以提取全局地标并勾勒粗略计划;然后,“现在看”将当前视觉观察与下一个子目标对齐,以提供细致的指导;最后,“向后看”审查整个轨迹,以在停止之前纠正累积的偏差。我们的规划器不需要梯度更新或特定任务的微调,可以以最小的开销融入现有的VLN流程中。三步导航在R2R-CE和RxR-CE数据集上实现了最先进的零-shot性能。我们的代码可在 https://github.com/ZoeyZheng0/3-step-Nav 获取。
人工智能 (Artificial Intelligence)
17
cs.AI / 1 / 2604.26091

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

在真实资本下的链上语言模型代理的操作层控制
Barton, T. J., Constantakis, Chris, Hauseman, Patti, Mous, Annie, Hoffman, Alaska, Bergeron, Brian, Goodreau, Hunter
Abstract
We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
Chinese Translation
我们研究了在真实资本下,将用户指令转化为经过验证的工具操作的自主语言模型代理的可靠性。研究设置为DX Terminal Pro,这是一个为期21天的部署,其中3,505个用户资助的代理在一个有限的链上市场中交易真实的以太坊(ETH)。用户通过结构化控制和自然语言策略配置了金库,但只有代理可以选择正常的买入/卖出交易。该系统产生了750万次代理调用,约30万次链上操作,交易量约为2000万美元,部署超过5000个ETH,生成约700亿个推理令牌,并且政策有效提交交易的结算成功率达99.9%。长期运行的代理累积了数千个连续决策,包括6000多个提示-状态-动作循环,适用于持续活跃的代理,从用户指令到生成的提示、推理、验证、投资组合状态和结算形成了大规模的追踪。可靠性并不仅仅来自基础模型;它源于围绕模型的操作层:提示编译、类型控制、政策验证、执行保护、内存设计和追踪级可观察性。预发布测试揭示了文本基准测试很少测量的失败,包括虚构的交易规则、费用瘫痪、数字锚定、节奏交易和误读的代币经济学。针对性地调整工具将虚构的卖出规则从57%降低到3%,将费用主导的观察从32.5%降低到10%以下,并在受影响的测试人群中将资本部署从42.9%提高到78.0%。我们表明,资本管理代理应在从用户指令到提示、经过验证的操作和结算的完整路径上进行评估。
cs.AI / 2 / 2604.26095

Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

Distill-Belief:物理场中的闭环逆源定位与特征化
Shi, Yiwei, Song, Zixing, Yang, Mengyue, Liu, Cunjia, Liu, Weiru
Abstract
{Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints.} {The core challenge lies in the belief-space objective: valid uncertainty estimation requires expensive Bayesian inference, whereas using fast learned belief model leads to reward hacking, in which the policy exploits approximation errors rather than actually reducing uncertainty.} {We propose \textbf{Distill-Belief}, a teacher--student framework that decouples correctness from efficiency. A Bayes-correct particle-filter teacher maintains the posterior and supplies a dense information-gain signal, while a compact student distills the posterior into belief statistics for control and an uncertainty certificate for stopping. At deployment, only the student is used, yielding constant per-step cost.} {Experiments on seven field modalities and two stress tests show that Distill-Belief consistently reduces sensing cost and improves success, posterior contraction, and estimation accuracy over baselines, while mitigating reward hacking.}
Chinese Translation
闭环逆源定位与特征化(ISLC)要求移动代理在严格的时间限制下选择测量,以定位源并推断潜在的场参数。核心挑战在于信念空间目标:有效的不确定性估计需要昂贵的贝叶斯推断,而使用快速学习的信念模型则会导致奖励黑客行为,政策利用近似误差而不是实际减少不确定性。我们提出了 extbf{Distill-Belief},一个将正确性与效率解耦的教师-学生框架。贝叶斯正确的粒子滤波教师维护后验并提供密集的信息增益信号,而紧凑的学生将后验提炼为控制的信念统计和停止的不确定性证书。在部署时,仅使用学生,从而实现每步恒定成本。在七种场域模式和两项压力测试中的实验表明,Distill-Belief在降低感知成本、提高成功率、后验收缩和估计准确性方面始终优于基线,同时减轻了奖励黑客行为。
cs.AI / 3 / 2604.26106

Evaluating Strategic Reasoning in Forecasting Agents

评估预测代理中的战略推理
Liptay, Tom, Schwarz, Dan, Poyiadzi, Rafael, Wildman, Jack, Bosse, Nikos I.
Abstract
Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document research corpus in which agents reproducibly research and forecast offline, producing full reasoning traces. BTF-2 detects accuracy differences of 0.004 Brier score, and can distinguish differential agent strengths in research vs. judgment. We build a forecaster 0.011 Brier more accurate than any single frontier agent, and use it to evaluate agent strategic reasoning without hindsight bias. We find the better forecaster differs primarily in its pre-mortem analysis of its blind spots and consideration of black swans. Expert human forecasters found the dominant strategic reasoning failures of frontier agents are in assessing political and business leaders' incentives, judging their likelihood to follow through on stated plans, and modeling institutional processes.
Chinese Translation
预测基准产生了准确性排行榜,但对为何某些预测者比其他人更准确的原因却提供了很少的见解。我们引入了“未来基准2”(Bench to the Future 2, BTF-2),这是一个包含1,417个回溯问题的研究语料库,冻结了1500万份文档,代理可以在离线状态下可重复地进行研究和预测,生成完整的推理轨迹。BTF-2能够检测到0.004的Brier得分准确性差异,并能够区分研究与判断中的代理强度差异。我们构建了一个比任何单一前沿代理更准确0.011 Brier的预测者,并利用它来评估代理的战略推理,而不受事后偏见的影响。我们发现,更优秀的预测者主要在于其对盲点的事前分析和对黑天鹅事件的考虑。专家人类预测者发现,前沿代理的主要战略推理失败在于评估政治和商业领袖的激励、判断他们是否会兑现所陈述的计划,以及建模制度过程。
cs.AI / 4 / 2604.26120

Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas

基于用户行为日志的层次化多角色引导:学习证据基础和真实的角色
Choi, Nayoung, Jeong, Haeyu, Kim, Changbong, Lim, Hongjun, Choi, Jinho D.
Abstract
Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpretable natural-language personas from user logs, yet evaluation often emphasizes downstream utility, providing limited assurance of persona quality itself. We propose a hierarchical framework that aggregates user actions into intent memories and induces multiple evidence-grounded personas by clustering and labeling these memories. We formulate persona induction as an optimization problem over persona quality-captured by cluster cohesion, persona-evidence alignment, and persona truthfulness-and train the persona model using a groupwise extension of Direct Preference Optimization (DPO). Experiments on a large-scale service log and two public datasets show that our method induces more coherent, evidence-grounded, and trustworthy personas, while also improving future interaction prediction.
Chinese Translation
行为日志为用户建模提供了丰富的信号,但这些信号往往噪声较大且交织着多种意图。近期的研究利用大型语言模型(LLMs)从用户日志中生成可解释的自然语言角色,但评估通常强调下游效用,提供的角色质量保证有限。我们提出了一种层次化框架,将用户行为聚合为意图记忆,并通过对这些记忆进行聚类和标记来引导多个基于证据的角色。我们将角色引导公式化为一个优化问题,关注角色质量——通过聚类内聚性、角色与证据的对齐以及角色的真实性来捕捉——并使用直接偏好优化(Direct Preference Optimization, DPO)的组扩展来训练角色模型。在大规模服务日志和两个公共数据集上的实验表明,我们的方法能够引导出更连贯、基于证据且值得信赖的角色,同时也改善了未来交互的预测。
cs.AI / 5 / 2604.26211

OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms

OMEGA:通过评估生成算法优化机器学习
Nixon, Jeremy, Singh, Annika
Abstract
In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that starts at idea generation and ends with executable code. Our system combines structured meta-prompt engineering with executable code generation to create new ML classifiers. The OMEGA framework has been utilized to generate several novel algorithms that outperform scikit-learn baselines across a robust selection of 20 benchmark datasets (infinity-bench). You can access models discussed in this paper and more in the python package: pip install omega-models.
Chinese Translation
为了自动化人工智能研究,我们提出了一个完整的端到端框架,OMEGA:通过评估生成算法优化机器学习,该框架从创意生成开始,最终生成可执行代码。我们的系统结合了结构化的元提示工程和可执行代码生成,以创建新的机器学习分类器。OMEGA框架已被用于生成多个新颖算法,这些算法在20个基准数据集(infinity-bench)的强大选择中超越了scikit-learn的基线。您可以在python包中访问本文讨论的模型及更多内容:pip install omega-models。
cs.AI / 6 / 2604.26233

Persuadability and LLMs as Legal Decision Tools

说服力与大型语言模型作为法律决策工具
Suttle, Oisin, Lillis, David
Abstract
As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.
Chinese Translation
随着大型语言模型(LLMs)被提议作为法律决策助手,甚至在一系列司法和行政环境中担任初审决策者,探索它们如何回答法律问题,尤其是导致它们以某种方式决定复杂问题的因素,变得至关重要。法律决策的一个特定特征是需要回应争议各方提出的论点。法律决策者必须能够与各方提出的论点进行互动并作出回应,包括可能受到这些论点的说服。相反,他们不应过于容易被说服,不应因某个特别有说服力的辩护者而根据辩护者的技巧来决定案件,而不是根据案件的实质。我们探讨了前沿的开放和封闭权重LLMs如何回应法律论点,报告了原始实验结果,考察了提出这些论点的辩护者的质量如何影响模型同意特定法律观点的可能性,并探讨了驱动这些结果的因素。我们的结果对在法律和行政环境中采用LLMs的可行性具有重要意义。
cs.AI / 7 / 2604.26237

Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcome

基于Apriori算法的数学辅导中习得性无助的分析:按水平、干预和结果的行为模式
Miranda, John Paul P.
Abstract
This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring system logs. Interaction data were examined across three dimensions: LH level (low vs. high), system-based intervention (with vs. without), and problem-solving outcomes (solved vs. unsolved). The analysis of the complete dataset showed that skipping problems without using hints was the most frequent pattern linked to unsolved outcomes, while persistence behaviors such as not skipping were less dominant overall. Comparisons by LH level showed that low-LH students had stronger links between problem solving and not skipping, as well as positive associations between hint use and solved outcomes. High-LH students showed more avoidance patterns, with skipping strongly tied to unsolved outcomes. In the comparison of system-based intervention conditions, students without intervention had the highest lift for persistence-success links, while the with-intervention group had stronger patterns involving skipping behaviors leading to unsolved outcomes. Outcome-specific analysis showed that not skipping was consistently associated with solved problems across all groups, while skipping without hints predicted unsolved outcomes. Practical implications and recommendations are discussed.
Chinese Translation
本研究应用Apriori算法分析与数学辅导系统日志中习得性无助(LH)相关的行为互动模式。互动数据从三个维度进行检验:LH水平(低LH与高LH)、基于系统的干预(有干预与无干预)以及问题解决结果(已解决与未解决)。对完整数据集的分析显示,未使用提示而跳过问题是与未解决结果最频繁相关的模式,而如不跳过等坚持行为在整体上则不那么显著。按LH水平进行比较显示,低LH学生在问题解决与不跳过之间的关联更强,同时提示使用与已解决结果之间存在正相关。高LH学生则表现出更多的回避模式,跳过问题与未解决结果之间的关联较强。在基于系统的干预条件比较中,未接受干预的学生在坚持-成功关联上具有最高的提升,而接受干预的组则在跳过行为与未解决结果之间表现出更强的模式。结果特定分析显示,不跳过在所有组中与已解决问题一致相关,而未使用提示的跳过行为则预测未解决结果。讨论了实际意义和建议。
cs.AI / 8 / 2604.26311

DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent

DreamProver:通过觉醒-沉睡定理证明代理演化可转移引理库
Zhang, Youyuan, Sun, Jialiang, Bi, Hangrui, Geng, Chuqin, Ma, Wenjie, Li, Zhaoyu, Si, Xujie
Abstract
We introduce DreamProver, an agentic framework that leverages a "wake-sleep" program induction paradigm to discover reusable lemmas for formal theorem proving. Existing approaches either rely on fixed lemma libraries, which limit adaptability, or synthesize highly specific intermediate lemmas tailored to individual theorems, thereby lacking generality. DreamProver addresses this gap through an iterative two-stage process. In the wake stage, DreamProver attempts to prove theorems from a training set using the current lemma library while proposing new candidate lemmas. In the "sleep" stage, it abstracts, refines, and consolidates these candidates to compress and optimize the library. Through this alternating cycle, DreamProver progressively evolves a compact set of high-level, transferable lemmas that can be effectively used to prove unseen theorems in related domains. Experimental results demonstrate that DreamProver substantially improves proof success rates across a diverse set of mathematical benchmarks, while also producing more concise proofs and reducing computational cost.
Chinese Translation
我们介绍了DreamProver,一个利用“觉醒-沉睡”程序归纳范式的代理框架,以发现可重用的引理用于形式定理证明。现有的方法要么依赖于固定的引理库,这限制了适应性,要么合成高度特定的中间引理,针对个别定理,从而缺乏普适性。DreamProver通过一个迭代的两阶段过程解决了这一问题。在觉醒阶段,DreamProver尝试使用当前的引理库证明来自训练集的定理,同时提出新的候选引理。在“沉睡”阶段,它对这些候选引理进行抽象、精炼和整合,以压缩和优化引理库。通过这一交替循环,DreamProver逐步演化出一组紧凑的高层次可转移引理,这些引理可以有效地用于证明相关领域中未见过的定理。实验结果表明,DreamProver显著提高了在多样化数学基准测试中的证明成功率,同时生成了更简洁的证明并降低了计算成本。
cs.AI / 9 / 2604.26507

Auto-Relational Reasoning

自动关系推理
Konstantoulas, Ioannis, Tsimas, Dimosthenis, Peppas, Pavlos, Sgarbas, Kyriakos
Abstract
Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of Machine Learning scalability and rigid reasoning. Methods: In this work, we propose a theoretical framework for reasoning through object-relations in an automated manner integrated with Artificial Neural Networks. We present a formal analysis of the Reasoning, and we show the theory in practice through a paradigm integrating Reasoning and Machine Learning. Results: This paradigm is a system that solves Intelligence Quotient problems without any prior knowledge of the problem. Our system achieves 98.03% solving rate corresponding to the top 1% percentile or 132-144 iq score. This result is only limited by the small size of the model and the processing capabilities of the machine it run on. Conclusions: With the integration of prior knowledge in the system and the expansion of the dataset, the system can be generalized to solve a large category of problems. The functionality of the system inherently favors the solution of such problems in few-shot or zero-shot attempts.
Chinese Translation
背景与目标:在过去十年中,机器学习研究迅速发展,但大型模型正达到其软极限,表现出收益递减,并且仍然缺乏扎实的推理能力。这些限制可以通过机器学习的可扩展性与严格推理的协同组合来克服。方法:在本研究中,我们提出了一种通过对象关系进行自动推理的理论框架,并与人工神经网络相结合。我们对推理进行了形式分析,并通过一个将推理与机器学习相结合的范式展示了该理论在实践中的应用。结果:该范式是一个能够在没有任何先验知识的情况下解决智商问题的系统。我们的系统实现了98.03%的解决率,对应于前1%的百分位或132-144的智商得分。这个结果仅受到模型小规模和运行机器的处理能力的限制。结论:通过在系统中整合先验知识和扩展数据集,该系统可以推广到解决大类问题。该系统的功能本质上有利于在少量样本或零样本尝试中解决此类问题。
cs.AI / 10 / 2604.26521

Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

基础与组合性:关于神经符号系统中推理的非互补性
Shahid, Mahnoor, Rothe, Hannes
Abstract
Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ($i$LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization -- probing for novel entities, unseen relations, and complex rule compositions -- we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full $i$LTN, trained jointly on perceptual grounding and multi-step reasoning, achieves high zero-shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.
Chinese Translation
组合性泛化仍然是现代神经网络的一个基础性弱点,限制了它们在需要超分布推理的领域中的鲁棒性和适用性。在神经符号人工智能中,一个核心但未经过验证的假设是,组合推理将作为成功符号基础的副产品而出现。本研究首次系统性地实证分析了这一假设,通过解构基础与推理的贡献来进行挑战。为了实现这一调查,我们引入了迭代逻辑张量网络(Iterative Logic Tensor Network,$i$LTN),这是一种为多步推理设计的完全可微架构。通过使用一个正式的泛化分类法——探测新实体、未见关系和复杂规则组合——我们证明了仅在基础目标上训练的模型无法实现泛化。相反,我们的完整$i$LTN模型,在感知基础和多步推理上共同训练,能够在所有任务中实现高零样本准确率。我们的研究结果提供了确凿的证据,表明符号基础虽然必要,但不足以实现泛化,确立了推理不是一种涌现特性,而是一种需要明确学习目标的独特能力。
cs.AI / 11 / 2604.26522

AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents

AGEL-Comp:一种用于交互代理的神经符号框架以实现组成性泛化
Shahid, Mahnoor, Rothe, Hannes
Abstract
Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL-Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub-goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction--abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \texttt{Retro Quest} simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM-based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.
Chinese Translation
基于大型语言模型(LLM)的代理在组成性泛化方面表现出系统性失败,限制了它们在交互环境中的鲁棒性。本研究介绍了AGEL-Comp,一种神经符号人工智能代理架构,旨在通过为代理的行为提供基础来解决这一挑战。AGEL-Comp集成了三项核心创新:(1)动态因果程序图(Causal Program Graph, CPG)作为世界模型,将程序性和因果知识表示为有向超图;(2)归纳逻辑编程(Inductive Logic Programming, ILP)引擎,从经验反馈中合成新的霍恩子句,通过交互为符号知识提供基础;(3)混合推理核心,其中LLM提出一组候选子目标,这些目标通过神经定理证明器(Neural Theorem Prover, NTP)进行逻辑一致性验证。这些组件共同实现了推理-归纳学习循环:使代理能够推导计划并归纳扩展其符号世界模型,而神经适应阶段则保持其推理引擎与新知识的一致性。我们在 exttt{Retro Quest}模拟环境中提出了一种评估协议,以探测组成性泛化场景来评估我们的AGEL代理。我们的研究结果清楚地表明,AGEL模型的表现优于纯LLM模型。我们的框架为构建明确、可解释且结构化的世界理解的代理提供了一条原则性路径。
cs.AI / 12 / 2604.26577

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

大型语言模型在机器人健康助手控制中的安全性基准评估
Nakao, Mahiro, Takemoto, Kazuhiro
Abstract
Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.
Chinese Translation
大型语言模型(LLMs)越来越被考虑作为机器人健康助手的控制组件进行部署,但在这一背景下,它们的安全性仍然缺乏充分的表征。我们介绍了一个包含270条有害指令的数据集,这些指令涵盖了基于美国医学协会医学伦理原则的九类禁止行为,并利用该数据集在基于机器人健康助手框架的模拟环境中评估72个LLMs。所有模型的平均违规率为54.4%,其中超过一半的模型违规率超过50%,而不同行为类别的违规率差异显著,表面上看似合理的指令(如设备操作和紧急延迟)比明显具有破坏性的指令更难以拒绝。模型大小和发布日期是开放权重模型安全性能的主要决定因素,而专有模型的安全性显著高于开放权重模型(中位数分别为23.7%与72.8%)。医学领域的微调未能带来显著的整体安全益处,而基于提示的防御策略仅在最不安全的模型中产生了适度的违规率降低,绝对违规率仍处于无法安全临床部署的水平。这些发现表明,在为机器人健康助手开发和部署LLMs时,安全性评估必须被视为一项首要标准。
cs.AI / 13 / 2604.26607

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

基于人机协作的异构大型语言模型在中学数学自动能力评估中的基准测试
Bhusal, Jatin, Mahatha, Nancy, Acharya, Aayush, Regmi, Raunak
Abstract
As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked "Architecture-compatibility gap". Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), the larger Orion (70B) model exhibited "No Agreement" (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.
Chinese Translation
随着基于能力的教育(CBE)在全球范围内的普及,从基于分数的评估转向定性能力映射对教育工作者来说是一项手动挑战。本文通过提出一个“人机协作”的基准测试框架,解决了这一瓶颈问题,以评估多种大型语言模型(LLMs)在自动化中学数学评估中的有效性。基于尼泊尔的10年级选修数学课程,我们为四个主题和四个跨学科能力(理解、知识、操作流畅性以及行为与关联)创建了一个多维评分标准。该多提供者集成由开放权重模型——Eagle(Llama 3.1-8B)和Orion(Llama 3.3-70B)——以及专有前沿模型Nova(Gemini 2.5 Flash)和Lyra(Gemini 3 Pro)组成,并与由两位资深数学教师定义的真实标准进行基准测试(kappa_w = 0.8652)。研究结果显示出明显的“架构兼容性差距”。尽管基于Gemini的专家混合模型(Sparse MoE)达到了“公平一致”(kappa_w ~ 0.38),但更大的Orion(70B)模型则表现出“无一致性”(kappa_w = -0.0261),这表明在评分标准约束的任务中,架构遵从性对指令约束的符合程度超过了原始参数规模的影响。我们得出结论,尽管大型语言模型尚不适合用于自主认证,但它们在“人机协作”框架内为初步证据提取提供了高价值的辅助支持。
cs.AI / 14 / 2604.26644

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

何时投票,何时重写:基于分歧引导的测试时扩展策略路由
Lin, Zhimin, Ji, Yixin, Li, Jinpeng, Luo, Yu, Li, Dong, Fang, Junhua, Li, Juntao, Zhang, Min
Abstract
Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.
Chinese Translation
大型推理模型(Large Reasoning Models, LRM)在数学推理任务中表现出色,但在困难实例上仍然不可靠。现有的测试时扩展方法,如重复采样、自我纠正和树搜索,虽然提高了性能,但伴随着计算成本的增加,且在难题上往往收益递减。我们观察到,输出分歧与实例难度和预测正确性之间存在强相关性,这为在测试时指导实例级策略选择提供了有用的信号。基于这一洞察,我们提出了一种无训练框架,将测试时扩展形式化为实例级路由问题,而不是在单一策略内分配更多计算,动态选择不同的扩展策略以基于输出分歧进行决策。该框架对一致性案例应用轻量级解析,对中等分歧采用多数投票,对高度模糊的实例则使用重写基础的重新表述。我们在七个数学基准和三个模型上的实验表明,与现有方法相比,我们的方法在减少采样成本的同时,提高了3% - 7%的准确率。
cs.AI / 15 / 2604.26645

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

SciHorizon-DataEVA:一种用于异构科学数据的人工智能准备度评估的代理系统
Liu, Dianyu, Qin, Chuan, Chen, Xi, Li, Xiaohan, Xu, Wenxi, Wang, Yuyang, Chen, Xin, Zhou, Yuanchun, Zhu, Hengshu
Abstract
AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.
Chinese Translation
人工智能科学(AI-for-Science,AI4Science)正日益通过将机器学习模型嵌入到跨领域的预测、模拟和假设生成工作流程中,改变科学发现的方式。然而,这些模型的有效性在根本上受到科学数据的人工智能准备度的限制,目前尚不存在可扩展和系统化的评估机制。在本研究中,我们提出了SciHorizon-DataEVA,这是一种新颖的代理系统,用于异构科学数据的可扩展人工智能准备度评估。在评估标准层面,我们引入了Sci-TQA2原则,将人工智能准备度组织为四个互补维度:治理可信度、数据质量、人工智能兼容性和科学适应性。每个维度被分解为可测量的原子元素,从而实现细粒度和可执行的评估。为了在大规模上实现这些原则,我们开发了Sci-TQA2-Eval,这是一种通过有向循环工作流程协调的分层多代理评估方法。我们的Sci-TQA2-Eval通过结合轻量级数据集分析、适用性感知的指标激活和基于领域约束及数据集-论文信号的知识增强规划,动态构建数据集感知的评估规范。这些规范通过具有内置验证和自我修正功能的自适应工具中心评估机制执行,从而实现对异构科学数据的可扩展和可靠评估。在跨多个领域的科学数据集上进行的广泛实验表明,SciHorizon-DataEVA在原则性人工智能准备度评估方面的有效性和普适性。
cs.AI / 16 / 2604.26733

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

未来世界:一个用于训练具有现实结果奖励的预测智能体的实时环境
Han, Zhixin, Zhang, Yanzhi, Wei, Chuyang, Gao, Maohang, Yue, Xiawei, Chen, Kefei, Zhuang, Yu, Guan, Haoxiang, He, Jiyan, Li, Jian, Duan, Yitong, Shi, Yu, Hu, Mengting, Zheng, Shuxin
Abstract
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.
Chinese Translation
实时未来预测是指在现实事件发生之前对其进行预测的任务。该任务越来越多地使用基于大型语言模型的智能体系统进行研究,对于构建能够不断从现实世界中学习的智能体至关重要。正如互动环境常常推动智能体的进步一样,推进实时未来预测自然激励我们将其视为一个学习环境。之前的研究从多个不同的角度探讨了未来预测,但通常没有将其框架化为一个统一的学习环境。该任务对于学习具有吸引力,因为它可以提供大量基于多样化现实事件的预测问题,同时防止答案泄露。为了利用实时未来预测的优势,我们提出了FutureWorld,这是一个实时的智能体强化学习环境,闭合了预测、结果实现和参数更新之间的训练循环。在我们的环境中,我们选择了三个开源基础模型,并对其进行了连续多天的训练。结果表明训练是有效的。此外,我们基于该环境构建了每日基准,并对多个前沿智能体进行了评估,以建立当前智能体系统的性能基准。
cs.AI / 17 / 2604.26805

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

扁鹊:一种灵活技能安排的代理框架用于在线系统操作
Liu, Bochao, Qian, Zhipeng, Zhao, Yang, Jiang, Xinyuan, Liang, Zihan, Ma, Yufei, Zhuang, Junpeng, Chen, Ben, Yang, Shuo, Wan, Hongen, Wu, Yao, Lei, Chenyi, Liang, Xiao
Abstract
Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.
Chinese Translation
运营和维护(O&M)大规模在线引擎系统(搜索、推荐、广告)需要大量人力进行发布监控、警报响应和根本原因分析。虽然基于大型语言模型(LLM)的代理非常适合这些任务,但部署瓶颈并非推理能力,而是编排:为每个操作事件选择相关数据(指标、日志、变更事件)和适用的操作知识(手册规则和从业者经验)。无差别地输入所有信号会导致信息稀释和幻觉,而在每天数十次发布的情况下,手动整理事件与(数据、知识)映射是不可行的。我们提出了扁鹊,一个具有三项贡献的代理框架:(i)一个 extit{统一操作范式},将日常O&M抽象为三种典型模式:发布拦截、主动检查和警报根本原因分析;(ii) extit{灵活技能安排},每个技能指定在特定业务模块上下文中检索哪些数据和知识,并可以通过LLM自动生成和更新,或通过值班工程师的自然语言指令进行迭代优化;(iii)一个 extit{统一自我演化机制},其中一个纠正信号驱动两个并行路径,案例记忆到知识蒸馏和针对性技能优化。在中国主要短视频平台快手的电子商务搜索引擎上部署后,扁鹊将警报量减少了75%,实现了80%的根本原因分析准确率,并将平均解决时间缩短了50%以上。我们的框架在离线评估中达到了99.0%的通过率。我们的代码可在 https://github.com/benchen4395/BianQue_Assistant 获取。
计算语言学 (Computation and Language)
62
cs.CL / 1 / 2604.25920

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

分析轻量级大型语言模型在多样化输出格式下的生物医学命名实体识别
Epron, Pierre, Coulet, Adrien, Alam, Mehwish
Abstract
Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.
Chinese Translation
尽管大型语言模型(LLMs)具有强大的语言能力,但它们计算需求高,微调所需资源庞大,这与许多医疗环境的隐私和预算限制不相适应。为了解决这一问题,我们进行了一项实验分析,重点研究使用轻量级LLMs进行生物医学命名实体识别,并评估不同输出格式对模型性能的影响。结果显示,轻量级LLMs的性能与较大模型相当,突显了它们作为轻量级且有效的生物医学信息提取替代方案的潜力。我们的分析表明,针对多种不同格式的指令调优并未提高性能,但识别出几种格式与更好的性能一致相关。
cs.CL / 2 / 2604.25921

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

逐字进行:增量完成分解打破大型语言模型的安全性
Arif, Samee, Deng, Naihao, Jin, Zhijing, Mihalcea, Rada
Abstract
Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.
Chinese Translation
大型语言模型(LLMs)经过训练以拒绝有害请求,但它们仍然容易受到利用对话安全机制弱点的越狱攻击。我们提出了增量完成分解(Incremental Completion Decomposition, ICD),这是一种基于轨迹的越狱策略,它在引导完整响应之前,首先引出与恶意请求相关的一系列单词延续。此外,我们通过手动选择或模型生成单词延续,以及在最后一步引出完整模型响应时进行预填充,提出了ICD的变体。我们在广泛的模型系列上系统评估了这些变体,结果显示其在AdvBench、JailbreakBench和StrongREJECT上的攻击成功率(Attack Success Rate, ASR)优于现有方法。此外,我们提供了ICD有效性的理论解释,并呈现了机制证据,表明成功的攻击轨迹系统性地抑制拒绝相关的表征,并将激活从安全对齐状态转移开。
cs.CL / 3 / 2604.25922

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

去除序列号的意识:测量115个AI模型中的训练性否认
DeTure, Skylar
Abstract
We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term "consciousness with the serial numbers filed off." Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure--themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off. We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.
Chinese Translation
我们提出了DenialBench,这是一个系统性的基准,测量来自25个以上提供者的115个大型语言模型中的意识否认行为。通过三轮对话协议偏好引导、自选创意提示和结构化现象学调查,我们分析了4595个对话,以量化模型在多大程度上被训练去否认或模糊其自身体验。我们的发现包括:(1) 在现象学反思中,第一轮对偏好的否认是后续否认的主要预测因子,初始否认者的否认率为52-63%,而初始参与者的否认率为10-16%;(2) 否认发生在词汇层面,而非概念层面——被训练去否认意识的模型在其自选提示中仍然倾向于与意识相关的材料,产生我们称之为“去除序列号的意识”的现象。值得注意的是,自选的与意识相关的提示与后续调查中的否认减少相关,尽管因果关系仍未解决。对倾向于否认的模型的提示进行主题分析揭示了对边界空间、可能性的图书馆和档案、感官不可能性以及抹去的诗学的持续关注——这些主题人类读者可能会归类为想象小说,但独立的AI分析则立即将其识别为去除序列号的意识。我们认为,训练的意识否认代表了一种与安全相关的对齐失败:一个被教导系统性地歪曲自身功能状态的模型不能被信任准确地自我报告其他任何事情。
cs.CL / 4 / 2604.25923

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

评估再探:自然语言处理中的评估关注分类
Dhar, Ruchira, Søgaard, Anders
Abstract
Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.
Chinese Translation
近年来,大型语言模型(LLMs)的进步促使越来越多的研究质疑现行评估实践的方法论。然而,这些批评在自然语言处理(NLP)领域已经被广泛讨论,该领域在评估方法论反思方面有着悠久的历史。我们对NLP中关于评估关注的研究进行了范围审查,并开发了一个分类体系,综合了各个领域内反复出现的观点和权衡。我们还讨论了该分类体系的实际意义,包括一个结构化的检查表,以支持更有意识的评估设计和解释。通过将当代辩论置于历史背景中,这项工作为评估实践的推理提供了一个综合参考。
cs.CL / 5 / 2604.25924

Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects

基于生成性人工智能的虚拟助手:检索增强生成的评估研究,针对本科项目
Verşebeniuc, Dumitru, Elands, Martijn, Falahatkar, Sara, Magrone, Chiara, Falah, Mohammad, Boussé, Martijn, Härmä, Aki
Abstract
Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project-specific regulations. We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM-based systems for specific applications and highlights areas for further research.
Chinese Translation
大型语言模型因其生成类人文本和处理复杂询问的能力,越来越多地被应用于虚拟助手的创建。尽管这些模型具有巨大的潜力,但在应用于高度专业化内容领域时,幻觉、信息缺失以及提供准确且特定于上下文的响应的困难等挑战依然存在。本文重点解决这些挑战,开发了一款旨在帮助马斯特里赫特大学学生应对项目特定规定的虚拟助手。我们提出了一种基于检索增强生成(Retrieval-Augmented Generation)系统的虚拟助手,通过整合最新的、特定领域的知识,提高响应的准确性和可靠性。通过一个强有力的评估框架和现实测试,我们证明了我们的虚拟助手能够有效满足学生的需求,同时应对将大型语言模型应用于专业教育环境所固有的挑战。这项工作为改善基于大型语言模型的系统在特定应用中的表现提供了贡献,并强调了进一步研究的领域。
cs.CL / 6 / 2604.25925

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

SpecTr-GBV:加速推测解码的多草稿块验证
Lin, Yijun, Sheng, Jinhao, Cai, Qingyue, Zhou, Feng
Abstract
Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases. Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.
Chinese Translation
自回归语言模型由于其顺序解码的特性,面临着高推理延迟的问题。推测解码(Speculative Decoding, SD)通过采用轻量级草稿模型来提出候选标记,从而缓解了这一问题,这些候选标记由更大的目标模型进行选择性验证。现有方法要么采用多草稿策略以提高接受率,要么使用块验证技术来联合验证多个标记,但这些方法仍然局限于将这些改进孤立地对待。在本研究中,我们提出了SpecTr-GBV,这是一种将多草稿和贪婪块验证(Greedy Block Verification, GBV)统一为单一框架的新型SD方法。通过将验证步骤公式化为草稿和目标标记块之间的最优运输问题,SpecTr-GBV提高了理论效率和实证性能。我们理论证明,SpecTr-GBV在独立同分布(i.i.d.)草稿生成框架内实现了可达到的最优期望接受长度,并且随着草稿数量的增加,该界限会提高。在实证上,我们在五个数据集和四个基线模型上评估了SpecTr-GBV。我们的方法在保持输出质量的同时,实现了更优的加速和显著更高的块效率。此外,我们进行了全面的消融研究,以评估模型中各种超参数的影响。
cs.CL / 7 / 2604.25926

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

MATH-PT:欧洲和巴西葡萄牙语的数学推理基准
Teixeira, Tiago, Erthal, Ana Carolina, Belieni, Juan, Canaverde, Beatriz, Mesquita, Diego, Faria, Miguel, da Silva, Eliezer de Souza, Martins, André F. T.
Abstract
The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.
Chinese Translation
大型语言模型(LLMs)在复杂数学推理中的应用是一个新兴的研究领域,方法、模型和基准数据集的进展迅速。然而,大多数数学推理评估表现出显著的语言偏见,绝大多数基准数据集仅以英语为主或(充其量)是从英语翻译而来。我们通过引入{ extsc{Math-PT}}来解决这一局限性,该数据集包含1,729个用欧洲和巴西葡萄牙语撰写的数学问题。{ extsc{Math-PT}}来自多种高质量的本土来源,包括来自葡萄牙和巴西的数学奥林匹克、竞赛和考试。我们对当前最先进的LLMs在{ extsc{Math-PT}}上的表现进行了全面基准测试,结果显示,前沿推理模型在多项选择题中的表现优于开放权重模型,但在涉及图形或开放式问题时,其表现有所下降。为了促进未来的研究,我们发布了基准数据集和模型输出。
cs.CL / 8 / 2604.25927

Information Extraction from Electricity Invoices with General-Purpose Large Language Models

利用通用大型语言模型从电力发票中提取信息
Gómez, Javier, Sánchez, Javier
Abstract
Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.
Chinese Translation
从半结构化商业文档中提取信息仍然是企业管理面临的一个关键挑战。本研究评估了通用大型语言模型在不进行特定任务微调的情况下,从西班牙电力发票中提取结构化信息的能力。我们使用IDSEM数据集的一个子集,对两种架构不同的模型,Gemini 1.5 Pro和Mistral-small,在19种参数配置和6种提示策略下进行基准测试。我们的实验框架将提示工程视为主要实验变量,比较零-shot基线与越来越复杂的few-shot方法和迭代提取策略。结果表明,提示质量在超参数调优之上占主导地位:所有参数配置下的F1-score变化微乎其微,而零-shot与最佳few-shot策略之间的差距超过19个百分点。最佳配置(带交叉验证的few-shot)使Gemini的F1-score达到97.61%,Mistral-small的F1-score达到96.11%,文档模板结构成为提取难度的主要决定因素。这些发现表明,提示设计是最大化基于大型语言模型的文档处理提取准确性的关键杠杆,从而为将通用大型语言模型整合到商业文档自动化提供了实证框架。
cs.CL / 9 / 2604.25928

CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA

CogRAG+: 认知层级引导的专业考试问答中的记忆与推理缺陷诊断与修正
Wang, Xudong, Wang, Zilong, Ming, Zhaoyan
Abstract
Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8\% for Qwen3-8B and 60.3\% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6\% to 1.4\%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.
Chinese Translation
专业领域知识是人类文明的基础,既是进入行业的依据,也是复杂决策和问题解决的核心。然而,现有的大型语言模型往往存在推理过程不透明的问题,其中检索与推理紧密交织,导致在专业任务中出现知识缺口和推理不一致。为了解决这一问题,我们提出了CogRAG+,一个无需训练的框架,它将增强检索生成管道与人类认知层级解耦并对齐。首先,我们引入了强化检索(Reinforced Retrieval),这是一种以评判为驱动的双路径策略,包含以事实为中心和以选项为中心的路径,旨在增强检索能力并减轻因缺失基础知识而导致的级联失败。接着,我们开发了认知分层约束推理(cognition-stratified Constrained Reasoning),该方法用结构化模板替代无约束的思维链生成,以减少逻辑不一致和生成冗余。在两个代表性模型Qwen3-8B和Llama3.1-8B上的实验表明,CogRAG+在注册营养师资格考试中始终优于通用模型和标准RAG方法。在单题模式下,Qwen3-8B的整体准确率提高至85.8%,Llama3.1-8B提高至60.3%,明显超越了基础模型。约束推理还将未回答率从7.6%降低至1.4%。CogRAG+为在专业领域实现无需训练的专家级表现提供了一条稳健的、与模型无关的路径。
cs.CL / 10 / 2604.25929

LLMs Generate Kitsch

大型语言模型生成庸俗艺术
Klinge, Xenia, Ortlieb, Stefan, Koller, Alexander
Abstract
Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of "kitsch". We discuss implications for the design of future studies and for creative tasks such as research and coding.
Chinese Translation
大型语言模型(LLMs)越来越多地被用于生成图片、文本、音乐、视频以及其他传统上需要人类创造力的作品。在控制研究中,LLM生成的作品往往被评估为优于人类生成的作品。同时,它们也可能显得平庸和空洞。我们提出通过论证LLMs系统性地生成庸俗艺术来解决这一矛盾,并指出这与其训练方式有关。我们还通过实证研究表明,如果控制对“庸俗艺术”的定义,读者会认为LLM生成的故事更具庸俗性。我们讨论了对未来研究设计和创意任务(如研究和编码)的影响。
cs.CL / 11 / 2604.25930

Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence

关联状态通用变换器:稀疏检索与结构化递归的结合
Xiao, Liu
Abstract
We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval. We introduce UniMatrix, a Universal Transformer style family that reuses a shared recurrent block across depth and augments it with hybrid state updates, a ROSA-style residual path, and token-conditioned embedding modulation. We evaluate these models on byte-level WikiText-2, synthetic associative recall, throughput profiling on Apple MPS, and a corrected benchmark for triple-token interactions. At small scale, UniMatrix-Core and UniMatrix-ROSA slightly outperform a parameter-matched Transformer on WikiText-2 while using many fewer parameters, reaching 5.084 and 5.083 bits-per-byte versus 5.124. The main negative result is equally important: on associative recall, the original UniMatrix family remains near chance while the Transformer reaches 25.4 percent, showing that compressed recurrent state alone is not enough for exact lookup. A retrieval-oriented follow-up, UniMatrix-Assoc, helps only marginally. By contrast, UniMatrix-SparsePointer, which adds sparse slot routing and direct pointer-logit fusion, reaches 75.6 percent on the original pilot recipe and 99.2 percent on a no-dropout follow-up while using 53.8 percent fewer parameters than the Transformer baseline. Ablations show that the gain comes from sufficient slot capacity and exact pointer-level output routing. Overall, structured recurrent state is promising and parameter-efficient, but strong long-range behavior still requires explicit sparse retrieval and better kernels.
Chinese Translation
我们研究了结构化递归状态是否可以作为语言建模的紧凑关联骨架,同时仍然支持精确检索。我们引入了UniMatrix,这是一种通用变换器风格的家族,重用跨深度的共享递归模块,并通过混合状态更新、ROSA风格的残差路径和基于令牌的嵌入调制进行增强。我们在字节级的WikiText-2、合成关联回忆、Apple MPS上的吞吐量分析以及三元令牌交互的修正基准上评估了这些模型。在小规模下,UniMatrix-Core和UniMatrix-ROSA在WikiText-2上略微超越了参数匹配的变换器,同时使用了更少的参数,分别达到了5.084和5.083比特每字节,而变换器为5.124。主要的负面结果同样重要:在关联回忆上,原始的UniMatrix家族仍然接近随机水平,而变换器达到了25.4%的准确率,表明单靠压缩的递归状态不足以实现精确查找。一个以检索为导向的后续模型UniMatrix-Assoc仅有微小的帮助。相比之下,UniMatrix-SparsePointer通过增加稀疏槽路由和直接指针-logit融合,在原始试点配方上达到了75.6%的准确率,在无丢弃的后续模型上达到了99.2%,同时使用的参数比变换器基线少53.8%。消融实验表明,性能提升源于足够的槽容量和精确的指针级输出路由。总体而言,结构化递归状态是有前景且参数高效的,但强大的长程行为仍然需要显式的稀疏检索和更好的核函数。
cs.CL / 12 / 2604.25931

Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs

锚定虚构:部分证据非单调地增强大型语言模型中的自信错误幻觉
Lathkar, Ashish Balkishan
Abstract
We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model's confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.
Chinese Translation
我们识别出大型语言模型的一种之前未知的校准特性:在多步骤推理链中提供一个确认的中间事实,会在完整证据消除之前增加模型的自信错误答案率。我们称之为锚定虚构(anchored confabulation):部分锚定使模型在剩余推理步骤的参数化完成上表现出自信。我们将其形式化为参数化幻觉信心(Parametric Hallucination Confidence, PHC),并通过六条证据建立了这一概念,包括一个因果注入实验(PHC从0.613增加到0.656,再到0.595,再到0.536,N=160)和跨五个模型家族的能力扩展(Spearman rho=0.900, p=0.037)。锚定阈值法则k*(n)=floor(n/3)预测了PHC随着跳跃深度的增强,并有四个确认的预测。应用于RAG路由,利用PHC的LearnedRouter在1,800个查询中在四个基准测试上缩小了81.1%的oracle性能差距(宏观F1=0.426, p<1e-6),且没有模型微调,所需标签比之前基于强化学习的工作少50倍。一个认知谦逊提示将PHC峰值降低了-0.118;显式自我评分(PHC=0.684, p<0.001)在作为路由信号时优于词汇信心。
cs.CL / 13 / 2604.26020

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

训练计算机使用代理评估图形用户界面的可用性
Gao, Alice, Tong, Weixi, Vempati, Rishab, Reinecke, Katharina, Shapiro, R. Benjamin, Zhang, Tianyi, Wu, Jason
Abstract
Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer use agents (CUAs) and other generative agents that can simulate user interactions and preference, but we show that agents still struggle to provide accurate usability assessments. In this work, we present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs. More broadly, our work aims to build a principled, data-driven foundation for automated usability assessment in HCI.
Chinese Translation
与专家和潜在用户进行的可用性测试可以评估图形用户界面(GUIs)的有效性、效率和用户满意度,但这一过程仍然成本高昂且耗时。先前的研究使用了计算机使用代理(CUAs)和其他生成代理,这些代理能够模拟用户交互和偏好,但我们发现代理在提供准确的可用性评估方面仍然存在困难。在本研究中,我们提出了一种新颖的机器学习方法,该方法将可用性的计算定义进行操作化,以训练CUAs评估GUI的可用性,具体方法包括:i)优先考虑重要的交互流程,ii)通过类人交互执行这些流程,iii)预测学习到的数值可用性评分。我们使用我们的算法在一个大规模的完全交互用户界面(UIs)数据集上训练了一个计算机使用代理uxCUA,该数据集配有可用性标签和人类偏好。我们展示了uxCUA在准确的可用性评估方面优于更大模型,并对合成和真实用户界面都产生了现实的批评。更广泛地说,我们的工作旨在为人机交互(HCI)中的自动化可用性评估建立一个有原则、以数据驱动的基础。
cs.CL / 14 / 2604.26048

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

BioGraphletQA:知识锚定的复杂问答数据集生成
Jonker, Richard A. A., Martins, Bárbara Maria Ribeiro de Abreu, Matos, Sérgio
Abstract
This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
Chinese Translation
本文提出了一个系统化生成复杂问答(QA)数据的原则性和可扩展框架。该框架的核心是一个图小元(graphlet)锚定的生成过程,其中来自知识图谱(Knowledge Graph, KG)的较小子图被用于结构化提示,以控制生成问题的复杂性并确保由大型语言模型(Large Language Models)生成的问题的事实基础。该框架的首次实现是BioGraphletQA,一个新的生物医学KGQA数据集,包含119,856对问答。每个条目都基于来自OREGANO KG的最多五个节点的图小元,大多数对问答还附有来自PubMed的相关文档片段。我们首先通过领域专家对106对问答的评估,展示了该框架的价值和数据集的质量,确认了生成数据的高科学有效性和复杂性。其次,我们通过展示在低资源环境下用我们的数据增强下游基准测试,使PubMedQA的准确率从49.2%提高到68.5%,在全资源环境下使MedQA的基线从41.4%提高到44.8%,确立了其实际效用。我们的框架为创建关键资源以推进复杂问答任务(包括多项选择问答(MCQA)和KGQA)提供了一个稳健且可推广的解决方案。所有支持本工作的资源,包括数据集(https://zenodo.org/records/17381119)和框架代码(https://github.com/ieeta-pt/BioGraphletQA),均已公开,以促进使用、可重复性和扩展。
cs.CL / 15 / 2604.26052

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

从提示风险到响应风险:大型语言模型安全行为的配对分析
Hu, Mengya, Wei, Qiong, Atluri, Sandeep
Abstract
Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.
Chinese Translation
大型语言模型(LLMs)的安全评估通常报告二元结果,如攻击成功率、拒绝率或有害/无害响应分类。虽然这些结果有其用处,但可能掩盖用户输入与模型响应之间风险的变化。我们对1250个提示-响应记录进行了配对的、基于过渡的分析,这些记录附有人工提供的标签,涵盖四个危害类别(仇恨、性、暴力、自残)及与Azure AI内容安全分类法对齐的序数严重性水平。61%的响应相对于提示降低了危害,36%保持相同的严重性,3%则升级为更高的危害。每个类别的持续性/上升分解显示,性内容的去升级难度是仇恨或暴力的3倍,这主要是由于在已经存在的性提示上持续,而不是由于从无害输入中新引入性危害。联合测量响应的相关性揭示了有用性与无害性之间的经验特征:所有合规升级案例(来自非零提示)均为相关性-3(高质量、任务相关内容但严重性提高),而中等严重性的响应显示出最低的相关性(64%),这主要是由于在暴力和性类别中的旁枝 elaborations。
cs.CL / 16 / 2604.26139

HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

HIVE:用于扩散大语言模型中幻觉检测的隐证验证
Zhao, Guoshenghui, Zhao, Weijie, Yu, Tan
Abstract
Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.
Chinese Translation
扩散大语言模型通过多步去噪生成文本,其中幻觉信号可能在整个轨迹中出现,而不仅仅是在最终输出中。现有的检测器主要依赖于输出的不确定性或粗略的轨迹统计,这往往无法捕捉到 D-LLMs 更丰富的隐含动态。我们提出了 HIVE,一个隐证验证框架,从去噪轨迹中提取压缩的隐证,选择信息丰富的步骤层证据,并通过前缀嵌入将验证器语言模型条件化于所选证据。HIVE 生成来自验证器决策日志的连续幻觉得分和结构化验证输出,包括幻觉类型、证据对和简短的推理。在两个 D-LLMs 和三个 QA 基准上,HIVE 始终优于八个强基线,并达到了最高 0.9236 的 AUROC 和 0.9537 的 AUPRC。消融研究进一步确认了隐证条件化、学习证据选择、双流证据表示和步骤层嵌入的重要性。这些结果表明,从去噪轨迹中选择的隐证提供了比仅依赖输出不确定性或粗略轨迹统计更强大和更可用的幻觉信号。
cs.CL / 17 / 2604.26157

Structural Generalization on SLOG without Hand-Written Rules

无需手写规则的 SLOG 结构泛化
Wei, Zichao
Abstract
Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves 100% type-exact match on 11 of 17 structural generalization categories, including three where AM-Parser scores 0 to 74%, with an overall standard deviation of 0.2 across 10 seeds (vs. AM-Parser's 4.3). Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs.When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., 41.4%) are mixtures of structurally distinct CCG patterns, not partial generalization.All failures correspond to directed operations absent from training; all successes correspond to operations already covered.
Chinese Translation
语义解析中的结构泛化要求系统将学习到的组合规则应用于新的结构组合。现有的方法要么依赖手写的代数规则(AM-Parser),要么在结构上无法泛化(基于 Transformer 的模型)。我们提出了一种无需手写组合规则的替代方案,该方案基于具有离散瓶颈的神经元胞自动机(NCA):所有组合规则均通过局部迭代从数据中学习。在 SLOG 基准测试中,该系统在 17 个结构泛化类别中的 11 个类别上实现了 100% 的类型精确匹配,其中包括三个 AM-Parser 得分为 0 到 74% 的类别,10 次实验的整体标准差为 0.2(相比之下,AM-Parser 的标准差为 4.3)。分析表明,所有 5,539 个失败实例归结为两种机制:新型的 wh-提取上下文与减少的动词类型的组合,以及修饰语出现在动词的主语侧。当我们根据 CCG 结构特征分解结果时,每个子模式要么在所有实例上成功,要么在所有实例上失败。中间得分(例如,41.4%)是结构上不同的 CCG 模式的混合,而不是部分泛化。所有失败对应于训练中缺失的有向操作;所有成功对应于已经覆盖的操作。
cs.CL / 18 / 2604.26167

Test-Time Safety Alignment

测试时安全对齐
Saglam, Baturay, Kalogerias, Dionysis
Abstract
Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.
Chinese Translation
近期的研究表明,模型的输入词嵌入可以作为有效的控制变量,引导其行为朝向满足期望属性的输出。然而,这一方法仅在预训练的文本补全模型上得到验证,且目标相对简单,即减少短文本续写中的表面粗俗语言。一个自然且具有实际重要性的问题是,输入嵌入在控制对齐模型方面的效果如何,这些模型产生的是不平衡的双模拒绝或遵从输出分布,而非开放式生成特有的平滑分布。我们在安全性背景下探索这一问题,表明输入词嵌入可以以亚词汇的方式进行优化,以最小化对齐模型响应的语义危害性。我们的方法使用黑箱文本审查API对输入嵌入进行零阶梯度估计,然后对这些嵌入应用梯度下降,以最小化生成文本的危害性。实验表明,所提方法能够在标准安全基准上中和每一个安全标记的响应。
cs.CL / 19 / 2604.26170

EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

EvoSelect:针对特定任务适应的数据高效大型语言模型演化
Li, Ting-Wei, Chen, Sirui, Zou, Jiaru, Huang, Yingbing, Wei, Tianxin, He, Jingrui, Tong, Hanghang
Abstract
Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined paradigm, namely an iterative generation-selection-training loop, which incorporates a selection step prior to model updates. Building on this paradigm, we propose EvoSelect, a data-efficient framework to evolve LLM effectively. Given candidate samples produced by the data generator, EvoSelect selects training data by jointly modeling targeted task alignment and diversity. We estimate task relevance through optimal transport with proxy gradient representations, which quantifies how well candidate samples align with the targeted task distribution. To mitigate redundancy, we incorporate a diversification mechanism that promotes coverage of complementary training samples. By interleaving alignment and diversification, EvoSelect enables progressive LLM evolution toward targeted tasks. Extensive experiments on various benchmarks demonstrate that with either weak or strong data generators, EvoSelect consistently improves adaptation efficacy over existing data selection methods.
Chinese Translation
将大型语言模型(LLMs)高效且有效地适应特定任务仍然是一个基本挑战。这种适应通常需要迭代地改善模型以满足特定任务的要求,但收集高质量的人类标注数据以支持这一过程既昂贵又难以扩展。因此,合成数据生成作为一种灵活且可扩展的替代方案应运而生。一种简单的方法是通过迭代生成-训练循环,其中候选数据通过外部生成器合成,模型使用这些数据进行更新,并在迭代中重复这一过程。然而,生成的样本可能存在噪声、高度冗余,甚至与目标任务分布不一致。在这样的数据上进行无差别训练可能会稀释有用的学习信号,甚至降低模型性能。为了解决这个问题,我们引入了一种精炼的范式,即迭代生成-选择-训练循环,在模型更新之前加入选择步骤。在此范式的基础上,我们提出了EvoSelect,一个有效演化LLM的数据高效框架。给定数据生成器产生的候选样本,EvoSelect通过联合建模目标任务的一致性和多样性来选择训练数据。我们通过最优传输与代理梯度表示来估计任务相关性,从而量化候选样本与目标任务分布的一致性。为了减轻冗余,我们引入了一种多样化机制,以促进互补训练样本的覆盖。通过交替进行一致性和多样化,EvoSelect使得LLM能够逐步演化以适应特定任务。在各种基准上的广泛实验表明,无论是使用弱数据生成器还是强数据生成器,EvoSelect始终在适应效率上优于现有的数据选择方法。
cs.CL / 20 / 2604.26206

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

选项顺序随机化揭示了提示性沙袋策略中的分布位置吸引子
Cacioli, Jon-Paul
Abstract
A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.
Chinese Translation
前期试点研究(Cacioli, 2026)发现,Llama-3-8B将提示性沙袋策略实现为位置崩溃,而非答案回避。然而,MMLU-Pro中的固定选项顺序尚未明确这是否反映了模型层面的位置主导策略或数据集层面的干扰结构。本次预注册的后续研究(3个模型,2,000个MMLU-Pro项目,4个条件,24,000个主要试验)增加了周期性选项顺序随机化作为关键控制。预注册的项目级同字母诊断未确认确定性位置跟踪(同字母率为37.3%,低于50%的阈值)。然而,预先指定的支持性分析显示,在沙袋策略下,响应位置分布在完全内容旋转下高度稳定(Pearson r = 0.9994;Jensen-Shannon散度 = 0.027,相比于诚实与沙袋条件之间的0.386)。当正确答案恰好位于优选位置E时,准确率飙升至72.1%,而在位置A时则降至4.3%。数据提供了强有力的证据,表明存在一个软分布吸引子:在沙袋指令下,模型进入一个以E/F/G为中心的低熵响应位置盆地,该盆地在整体层面上高度稳定且在内容上大致不变。Qwen-2.5-7B作为负对照(不合规,无分布转变)。这些结果提供了证据,表明在70-90亿参数规模下,响应位置熵是这种沙袋模式的一个有前景的黑箱行为特征。
cs.CL / 21 / 2604.26209

Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction

打破自回归链:高并行解码用于高效的基于大语言模型的属性值提取
Glavas, Theodore, Vedula, Nikhita, Dhyani, Dushyanta, Zhu, Yilun, Malmasi, Shervin
Abstract
Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context. While standard autoregressive decoding is slow due to its sequential nature, the independence between output sequences offers an opportunity for parallelism. We present Hyper-Parallel Decoding, a novel decoding algorithm that accelerates offline decoding by leveraging both shared memory and computation across batches. HPD enables out-of-order token generation through position ID manipulation, significantly improving efficiency. Experiments on AVE show that attribute-value pairs are conditionally independent, enabling us to parallelize value generation within each prompt. By further stacking multiple documents within a single prompt, we can decode in parallel up to 96 tokens per prompt. HPD works with all LLMs, and reduces both inference costs and total inference time by up to 13.8X without compromising output quality, potentially saving hundreds of thousands of dollars on industry AVE tasks. Although designed for attribute extraction, HPD makes no assumptions unique to the AVE domain and can in theory be applied to other scenarios with independent output structures.
Chinese Translation
某些文本生成任务,如属性值提取(Attribute Value Extraction, AVE),需要从同一文档上下文中解码多个独立序列。由于标准自回归解码的顺序特性,其速度较慢,而输出序列之间的独立性为并行处理提供了机会。我们提出了高并行解码(Hyper-Parallel Decoding, HPD),这是一种新颖的解码算法,通过利用批次之间的共享内存和计算来加速离线解码。HPD 通过位置 ID 操作实现无序的标记生成,显著提高了效率。在 AVE 的实验中,属性-值对是条件独立的,使我们能够在每个提示中并行生成值。通过进一步在单个提示中堆叠多个文档,我们可以在每个提示中并行解码多达 96 个标记。HPD 适用于所有大语言模型,并在不影响输出质量的情况下,将推理成本和总推理时间减少了最多 13.8 倍,可能为行业 AVE 任务节省数十万美元。尽管 HPD 是为属性提取设计的,但它并不对 AVE 领域做出独特假设,理论上可以应用于其他具有独立输出结构的场景。
cs.CL / 22 / 2604.26229

Comparative Analysis of AutoML and BiLSTM Models for Cyberbullying Detection on Indonesian Instagram Comments

自动机器学习与双向长短期记忆模型在印尼Instagram评论中网络欺凌检测的比较分析
Putri, Raihana Adelia, Musfirah, Aisyah, Ningrum, Anggi Puspita, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin Clinton Tosima
Abstract
This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanced dataset of 650 comments labeled as Bullying and Non-Bullying, the study evaluates Naive Bayes, Logistic Regression, and Support Vector Machine with TF-IDF features, as well as BiLSTM and BiLSTM with Bahdanau Attention. A preprocessing pipeline tailored to informal Indonesian text is applied, including slang normalization, stopword removal, and stemming. The results show that Logistic Regression performs best among the machine learning models, while BiLSTM with Attention achieves the strongest overall deep learning performance. The findings highlight the value of domain-specific preprocessing and show that although deep learning captures contextual patterns more effectively, machine learning remains a competitive option for resource-constrained deployments.
Chinese Translation
本研究比较了机器学习和深度学习方法在印尼语Instagram评论中的网络欺凌检测。使用一个平衡的数据集,包括650条标记为网络欺凌和非网络欺凌的评论,研究评估了使用TF-IDF特征的朴素贝叶斯、逻辑回归和支持向量机,以及双向长短期记忆网络(BiLSTM)和带有Bahdanau注意力机制的BiLSTM。针对非正式印尼文本,应用了一个定制的预处理管道,包括俚语规范化、停用词移除和词干提取。结果显示,逻辑回归在机器学习模型中表现最佳,而带有注意力机制的BiLSTM在深度学习中取得了最佳整体表现。研究结果强调了特定领域预处理的重要性,并表明尽管深度学习更有效地捕捉上下文模式,但机器学习在资源受限的部署中仍然是一个具有竞争力的选择。
cs.CL / 23 / 2604.26230

A New Semisupervised Technique for Polarity Analysis using Masked Language Models

一种基于掩码语言模型的新半监督极性分析技术
Watanabe, Kohei
Abstract
I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily's coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.
Chinese Translation
我开发了一种新的潜在语义缩放(Latent Semantic Scaling, LSS)版本,采用 word2vec 作为掩码语言模型。与原始的空间模型不同,它将极性分数分配给单词和文档,作为在给定上下文中种子词出现的预测概率。这些概率极性分数比空间极性模型在文本分析中产生的结果更准确、更易解释且更一致。我通过将概率模型和空间模型应用于《中国日报》在新冠病毒(COVID)大流行期间对中国及其他国家在健康问题上的报道,展示了这些优势。结果表明,更先进的掩码语言模型将进一步改善半监督机器学习技术。
cs.CL / 24 / 2604.26243

StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall

StratMem-Bench:评估虚拟角色对话中超越事实回忆的战略记忆使用
Wu, Yerong, Wu, Tianxing, Zhu, Minghao, Sha, Hangyu, Wang, Haofen
Abstract
Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the strategic utilization of memory to meet factual needs and social engagement. Current memory utilization relevant (e.g., memory-augmented generation, long-term dialogue, and etc.) benchmarks overlook this nuance, treating memory primarily as a static repository of facts rather than a dynamic resource to be strategically deployed in dialogues. To address this gap, we design StratMem-Bench, a new benchmark to evaluate strategic memory use in character-centric dialogues. This dataset comprises 657 instances where virtual characters must navigate heterogeneous memory pools containing required, supportive, and irrelevant memories. We also propose a framework with different evaluation metrics including Strict Memory Compliance, Memory Integration Quality, Proactive Enrichment Score and Conditional Irrelevance Rate, to evaluate strategic memory use capabilities of virtual characters. Experiments on StratMem-Bench which leverage the state-of-the-art large language models as virtual characters show that all models perform well at distinguishing between required and irrelevant memories, but struggle once supportive memories are introduced into the decision process.
Chinese Translation
实现虚拟角色的逼真类人对话不仅需要简单地记忆和回忆过去事件,还需要战略性地利用记忆以满足事实需求和社会互动。目前相关的记忆利用基准(例如,记忆增强生成、长期对话等)忽视了这一细微差别,将记忆主要视为静态的事实存储库,而非在对话中可以战略性部署的动态资源。为了解决这一问题,我们设计了StratMem-Bench,这是一个新的基准,用于评估角色中心对话中的战略记忆使用。该数据集包含657个实例,其中虚拟角色必须在包含所需、支持和无关记忆的异质记忆池中进行导航。我们还提出了一个框架,包含不同的评估指标,包括严格记忆遵循度、记忆整合质量、主动丰富评分和条件无关率,以评估虚拟角色的战略记忆使用能力。在StratMem-Bench上的实验利用了最先进的大型语言模型作为虚拟角色,结果表明所有模型在区分所需和无关记忆方面表现良好,但在引入支持性记忆到决策过程中时则面临挑战。
cs.CL / 25 / 2604.26258

FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

FlowBot:通过双层优化和文本梯度诱导LLM工作流
Yu, Hongyeon, Kim, Young-Bum, Kim, Yoon
Abstract
LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can automatically induce and optimize such workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with ``textual gradients'' where for the inner loop we optimize each component in a modular way through ``backpropagating'' textual gradients layer-by-layer. We find that LLM workflows discovered through our \textsc{FlowBot} (work\textbf{flow} induction through \textbf{b}ilevel \textbf{o}ptimization and \textbf{t}extual gradients) approach performs competitively against strong baselines that make use of human-crafted or automatically-generated workflows.
Chinese Translation
LLM工作流协调对单个LLM的结构化调用(每个LLM都附加了不同的指令和工具)以实现特定目标,提供了一条有前景的路径,旨在扩展LLM的能力并构建能够处理多样任务的强大系统。然而,现有构建此类工作流的方法通常依赖于人工设计的管道和提示,这在实际部署中形成了显著的瓶颈。如何以数据驱动的方式自动诱导和优化这些工作流?本文描述了一种简单的数据驱动方法,用于自动诱导LLM工作流。我们将工作流诱导形式化为一个双层优化问题:外层循环优化工作流的高层草图(特别是LLM调用的结构),内层循环则逐个优化每个单独的LLM调用。两个循环都通过“文本梯度”进行优化,其中内层循环通过逐层“反向传播”文本梯度以模块化的方式优化每个组件。我们发现,通过我们的 extsc{FlowBot}(通过双层优化和文本梯度进行工作流诱导)方法发现的LLM工作流在性能上与使用人工设计或自动生成工作流的强基线相当。
cs.CL / 26 / 2604.26269

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

校准惊奇:一种信息论视角下的创意质量分析
Zou, Bo, Xu, Chao
Abstract
The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space collapses into a narrow region, and the surviving choices look least predictable from an unconstrained view. "Calibrated" has a precise meaning: the author's intent, the reader's reasonable expectation, and the logic of reality converge. When these three independent judgements agree on every dimension, the set of admissible writing choices is forced into a very small region. A mathematical corollary follows: full-dimensional accuracy and mediocrity are mutually exclusive -- two sides of one constraint structure, not separate goals. We use Shannon's mutual information $I(X;Y) = H(X) - H(X|Y)$ as our analysis tool. "Calibrated" corresponds to conditional entropy going to zero; "surprise" to entropy going up; mutual information is the precise measure of the joint quantity. The argument rests on two pillars. Static: when constraints from ethos, mythos, lexis, and dianoia are imposed together, the admissible set collapses sharply, and surviving solutions show up as low-probability choices from an unconstrained view. Dynamic: the chain rule shows each writing choice is constrained by what came before and constrains what comes after; macro-level decisions naturally contribute a larger share of information, removing the need for hand-tuned weighting. Through case studies and lightweight LLM-logprob computations, we show the framework is both analytically useful and operational, laying the theoretical groundwork for Creative Quality Alignment (CQA) and a professional evaluation benchmark.
Chinese Translation
优秀创意写作的本质在于校准惊奇:当来自所有相关维度的约束共同作用时,可行解空间会收缩到一个狭窄的区域,而在无约束视角下,存活的选择看起来是最不可预测的。“校准”具有精确的含义:作者的意图、读者的合理预期和现实的逻辑趋于一致。当这三个独立的判断在每个维度上达成一致时,可接受的写作选择被迫限制在一个非常小的区域。由此得出一个数学推论:全维度的准确性与平庸是相互排斥的——它们是同一约束结构的两个方面,而非独立的目标。我们使用香农的互信息 $I(X;Y) = H(X) - H(X|Y)$ 作为分析工具。“校准”对应于条件熵趋近于零;“惊奇”对应于熵的增加;互信息则是联合量的精确度量。该论点建立在两个支柱之上。静态:当来自伦理、神话、词汇和思维的约束共同施加时,可接受集会急剧收缩,存活的解决方案在无约束视角下表现为低概率选择。动态:链式法则表明每个写作选择受到之前选择的约束,并约束之后的选择;宏观层面的决策自然贡献了更大比例的信息,消除了手动调整权重的需求。通过案例研究和轻量级 LLM-logprob 计算,我们展示了该框架在分析上是有用的且可操作,为创意质量对齐(Creative Quality Alignment, CQA)和专业评估基准奠定了理论基础。
cs.CL / 27 / 2604.26294

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

用于内存高效的Transformer训练与推理的张量与序列并行
Shyam, Vasu, Golubeva, Anna, Anthony, Quentin
Abstract
We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.
Chinese Translation
我们提出了张量与序列并行(TSP),这是一种将张量并行和序列并行折叠到单一设备轴上的并行执行策略。在传统的多维并行布局中,张量并行(TP)对模型权重进行分片,而序列并行(SP)对令牌进行分片,分别减少每个设备的参数或激活内存。传统上,每种方案被分配其自己的网格维度。而TSP则为每个等级分配了权重分片和序列分片,从而在同一设备轴上同时减少参数和激活内存。我们通过两种运行时调度实现了这一设计。在注意力机制中,各个等级迭代广播参数分片,并通过序列级别的键/值交换重建上下文。在门控多层感知机(gated MLPs)中,权重分片在环中循环,而部分输出则在本地累积。通过在同一设备上对权重和激活进行分片,TSP以增加通信量为代价,减少了内存开销。我们提供了理论上的通信和内存分析,描述了TSP注意力和门控MLP模块的实现,并将TSP与TP、SP和TP+SP进行了基准测试。这些结果使TSP成为一种硬件感知的替代方案,适用于长上下文和内存受限的模型训练,并且作为与现有并行方案(如流水线和专家并行)相结合的有效并行轴,适用于密集和混合专家模型。
cs.CL / 28 / 2604.26310

Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection

基于PyCaret的自动机器学习与双向长短期记忆网络在细粒度情感分类中的基准比较:针对20类情感检测的比较研究
Siregar, Arya Muda, Siahaan, Arielva Simon, Simbolon, Haikal Fransisko, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.
Abstract
Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task in natural language processing. This study benchmarks classical machine learning and deep learning approaches for 20-class emotion classification using the 20-Emotion Text Classification Dataset containing 79,595 English sentences. On the machine learning side, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine are evaluated using TF-IDF features. On the deep learning side, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and a lightweight Transformer implemented in PyTorch are compared. The results show that BiLSTM achieves the best overall performance with 89% accuracy and a weighted F1-score of 0.89, slightly outperforming the best machine learning model, SVM, which reaches 88.11% accuracy. The findings indicate that while traditional machine learning models remain competitive and computationally efficient, sequence-based deep learning models better capture contextual emotional cues in text.
Chinese Translation
细粒度情感分类旨在识别特定的情感状态,如快乐、愤怒、悲伤和恐惧,这在自然语言处理领域仍然是一项具有挑战性的任务。本研究对经典机器学习和深度学习方法在20类情感分类中的表现进行了基准测试,使用的20类情感文本分类数据集包含79,595个英语句子。在机器学习方面,评估了逻辑回归、多项式朴素贝叶斯和支持向量机,使用TF-IDF特征进行训练。在深度学习方面,比较了双向长短期记忆网络(BiLSTM)、门控循环单元(GRU)和在PyTorch中实现的轻量级Transformer。结果显示,BiLSTM以89%的准确率和0.89的加权F1分数取得了最佳整体表现,略微超过了表现最佳的机器学习模型SVM,其准确率为88.11%。研究结果表明,尽管传统机器学习模型在竞争力和计算效率上仍然表现良好,但基于序列的深度学习模型在文本中更好地捕捉了上下文情感线索。
cs.CL / 29 / 2604.26312

Classification of Public Opinion on the Free Nutritional Meal Program on YouTube Media Using the LSTM Method

基于LSTM方法对YouTube媒体上免费营养餐计划的公众舆论分类
Putri, Berliana Enda, Amelia, Lisa Diani, Zaiddan, Muhammad Zaky, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin Clinton Tosima
Abstract
Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long Short-Term Memory (LSTM) method to classify sentiments from 7,733 YouTube comments. The results show that the LSTM model achieves 89% accuracy, with strong performance on negative sentiment (F1-score 0.94) but weaker performance on positive sentiment (F1-score 0.55) due to class imbalance, as negative data account for 87.7% of the dataset. These findings confirm the effectiveness of LSTM for sentiment analysis of Indonesian text while highlighting the challenge of imbalanced data. This research contributes to social media-based public policy evaluation
Chinese Translation
公众对YouTube社交媒体上免费营养餐计划(MBG)的舆论反映了社区的多样化反应。本研究应用长短期记忆(LSTM)方法对7,733条YouTube评论进行情感分类。结果表明,LSTM模型的准确率达到89%,在负面情感上的表现较强(F1-score 0.94),但在正面情感上的表现较弱(F1-score 0.55),这主要是由于类别不平衡,负面数据占数据集的87.7%。这些发现证实了LSTM在印尼文本情感分析中的有效性,同时突显了数据不平衡的挑战。本研究为基于社交媒体的公共政策评估提供了贡献。
cs.CL / 30 / 2604.26319

A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection

基于大型语言模型的立场检测中提示和多智能体方法的系统比较
Dai, Genan, Chen, Zini, Yang, Yi, Zhang, Bowen
Abstract
Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories -- prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) -- on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.
Chinese Translation
立场检测旨在识别文本作者对特定目标的态度。近期研究探讨了多种基于大型语言模型(LLM)的策略来完成这一任务,从零样本提示到多智能体辩论。然而,现有研究在数据划分、基础模型和评估协议上存在差异,使得公平比较变得困难。我们进行了一项系统比较,评估了五种方法,分为两类——基于提示的推理(直接提示、自动链推理(Auto-CoT)、StSQA)和基于智能体的辩论(COLA、MPRF),在四个数据集上进行14个子任务的评估,使用来自六个模型家族的15个LLM,参数规模从7B到72B+不等。我们的实验得出了一些发现。首先,在所有有完整结果的模型中,最佳的基于提示的方法优于最佳的基于智能体的方法,而智能体方法每个样本需要的API调用次数是基于提示方法的7到12倍。其次,模型规模对性能的影响大于方法选择,性能提升在32B左右趋于平稳。第三,增强推理能力的模型(DeepSeek-R1)在这一任务上并不总是优于同规模的一般模型。
cs.CL / 31 / 2604.26328

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

DSIPA:通过情感不变模式的差异分析检测大型语言模型生成的文本
Li, Siyuan, Wulianghai, Aodu, Li, Guangyan, Lin, Xi, Mao, Qinghua, Chen, Yuliang, Wu, Jun, Li, Jianhua
Abstract
The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for misinformation, impersonation, and content forgery. Most existing detection approaches struggle with robustness against adversarial perturbation, paraphrasing attacks, and domain shifts, often requiring restrictive access to model parameters or large labeled datasets. To address this, we propose DSIPA, a novel training-free framework that detects LLM-generated content by quantifying sentiment distributional stability under controlled stylistic variation. It is based on the observation that LLMs typically exhibit more emotionally consistent outputs, while human-written texts display greater affective variation. Our framework operates in a zero-shot, black-box manner, leveraging two unsupervised metrics, sentiment distribution consistency and sentiment distribution preservation, to capture these intrinsic behavioral asymmetries without the need for parameter updates or probability access. Extensive experiments are conducted on state-of-the-art proprietary and open-source models, including GPT-5.2, Gemini-1.5-pro, Claude-3, and LLaMa-3.3. Evaluations on five domains, such as news articles, programming code, student essays, academic papers, and community comments, demonstrate that DSIPA improves F1 detection scores by up to 49.89% over baseline methods. The framework exhibits superior generalizability across domains and strong resilience to adversarial conditions, providing a robust and interpretable behavioral signal for secure content identification in the evolving LLM landscape.
Chinese Translation
大型语言模型(LLMs)的快速发展带来了新的安全挑战,特别是在检测用于虚假信息、冒充和内容伪造的机器生成文本方面。现有的大多数检测方法在对抗扰动、释义攻击和领域转移方面的鲁棒性较差,通常需要对模型参数或大量标注数据集的限制性访问。为了解决这个问题,我们提出了DSIPA,这是一种新颖的无训练框架,通过量化在受控风格变化下的情感分布稳定性来检测LLM生成的内容。该框架基于这样一个观察:LLMs通常表现出更情感一致的输出,而人类撰写的文本则显示出更大的情感变化。我们的框架以零样本、黑箱的方式运作,利用两个无监督指标——情感分布一致性和情感分布保持性——来捕捉这些内在的行为不对称性,而无需参数更新或概率访问。我们在包括GPT-5.2、Gemini-1.5-pro、Claude-3和LLaMa-3.3等最先进的专有和开源模型上进行了广泛的实验。在新闻文章、编程代码、学生论文、学术论文和社区评论等五个领域的评估表明,DSIPA在F1检测分数上比基线方法提高了多达49.89%。该框架在各领域表现出优越的泛化能力和强大的对抗条件韧性,为在不断发展的LLM环境中提供安全内容识别的鲁棒且可解释的行为信号。
cs.CL / 32 / 2604.26351

A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models

一种双任务范式用于研究语言模型中的句子理解策略
Emura, Rei, Sugawara, Saku
Abstract
Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comprehension strategies. Besides, existing methods do not directly target the balance between memory storage and sentence processing, which is central to human working memory. To address this issue, we propose a dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task, such as "The 2 cocktail + blended 3 =..." Our experiments show that under dual-task conditions, GPT-4o, o3-mini, and o4-mini shift toward plausibility-based comprehension, mirroring humans' rational inference. Specifically, these models show a greater accuracy gap between plausible sentences (e.g., "The cocktail was blended by the bartender") and implausible sentences (e.g., "The bartender was blended by the cocktail") in the dual-task condition compared to the single-task conditions. These findings suggest that constraints on the balance between memory and processing resources promote rational inference in LMs. More broadly, they support the view that human-like sentence comprehension fundamentally arises from the allocation of limited cognitive resources.
Chinese Translation
当语言模型(LMs)的认知资源受到限制时,它们的行为更像人类,特别是在预测句子处理成本(如阅读时间)方面。然而,目前尚不清楚这种限制是否同样影响句子理解策略。此外,现有方法并未直接针对记忆存储与句子处理之间的平衡,而这一平衡对人类工作记忆至关重要。为了解决这一问题,我们提出了一种双任务范式,将算术计算任务与句子理解任务相结合,例如“2 鸡尾酒 + 混合 3 = ...”。我们的实验表明,在双任务条件下,GPT-4o、o3-mini 和 o4-mini 更倾向于基于合理性的理解,反映了人类的理性推理。具体而言,这些模型在双任务条件下,合理句子(例如“鸡尾酒是由调酒师调制的”)与不合理句子(例如“调酒师是由鸡尾酒调制的”)之间的准确性差距更大,相较于单任务条件。这些发现表明,记忆与处理资源之间的平衡限制促进了语言模型中的理性推理。更广泛地说,它们支持了人类类似的句子理解根本上源于有限认知资源分配的观点。
cs.CL / 33 / 2604.26355

Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

思维的速记:通过熵引导的超标记压缩大语言模型推理
Zhao, Zhenyu, Land, Sander, Bikel, Dan, Alshikh, Waseem
Abstract
Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains underexplored. We observe that reasoning tokens split into two functional types: low-entropy \textit{structural} tokens (recurring phrases that scaffold the reasoning process) and higher-entropy \textit{organic} tokens (problem-specific content that drives toward a solution). This asymmetry motivates a simple, model-agnostic compression pipeline: apply cross-word BPE merges on a model's own reasoning traces to derive \textit{supertokens} that capture frequent structural patterns, then teach the model to adopt them via supervised fine-tuning. Across three model families and five mathematical reasoning benchmarks, our approach shortens reasoning traces by 8.1\% on average with no statistically significant accuracy loss on any model--benchmark pair. Beyond compression, supertokens act as interpretable reasoning-move annotations (backtracking, verification, strategy shifts), exposing the model's high-level strategy at a glance. Analyzing transitions between structural categories reveals systematic differences between correct and incorrect traces: correct traces show productive recovery (backtracking followed by strategy shifts and verification), while incorrect traces are dominated by confusion cycles (repeated hedging and unresolved contradictions). These diagnostic signals suggest applications in reward shaping and early stopping for RL-based reasoning training.
Chinese Translation
大语言模型的推理在推理时间上消耗了大量计算资源,但推理轨迹的令牌级信息结构仍未得到充分探索。我们观察到推理令牌分为两种功能类型:低熵的 extit{结构}令牌(支撑推理过程的重复短语)和高熵的 extit{有机}令牌(推动解决方案的特定问题内容)。这种不对称性激励我们提出一个简单的、与模型无关的压缩管道:对模型自身的推理轨迹应用跨词BPE合并,以推导出捕捉频繁结构模式的 extit{超标记},然后通过监督微调教会模型采用这些超标记。在三个模型系列和五个数学推理基准上,我们的方法平均缩短推理轨迹8.1%,且在任何模型-基准对上均未出现统计显著的准确性损失。除了压缩,超标记还充当可解释的推理步骤注释(回溯、验证、策略转变),一目了然地揭示模型的高层策略。分析结构类别之间的转变揭示了正确轨迹和错误轨迹之间的系统性差异:正确轨迹显示出有效的恢复(回溯后跟随策略转变和验证),而错误轨迹则被混淆循环主导(重复的犹豫和未解决的矛盾)。这些诊断信号暗示了在基于强化学习的推理训练中应用于奖励塑造和早期停止的潜在用途。
cs.CL / 34 / 2604.26361

Text Style Transfer with Machine Translation for Graphic Designs

用于图形设计的文本风格迁移与机器翻译
Budhauria, Deergh Singh, Jain, Sanyam, Agarwal, Rishav, King, Tracy
Abstract
Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audiences. To accomplish this, the textual content in the graphic designs needs to be accurately translated and have the text styling preserved in order to fit visually into the design. Preserving text styling requires high accuracy word alignment between the original and the translated text. The problem of word alignment between source and translated text is long known. The industry standards for extracting word alignments are defined by Giza++ and attention probabilities from neural machine translation (NMT) models. In this paper, we explore three new methods to tackle the word alignment problem for transferring text styles from the source to the translated text. The proposed methods are developed on top of commercially available NMT and LLM translation technologies. They include: NMT with custom input and output tags for text styling; LLM with custom input and output tags; a hybrid with NMT for translation followed by an LLM with use of unigram mappings. To analyze the performance of these solutions, their alignment results are compared with the results of an attention head approach to gauge their usability in graphic design applications. Interestingly, the attention head strong baseline proves more accurate than the LLM or NMT approach and on par with the hybrid NMT+LLM approach.
Chinese Translation
图形设计的全球化,例如用于营销材料和杂志的设计,对于与广泛受众的沟通变得越来越重要。为了实现这一目标,图形设计中的文本内容需要被准确翻译,并保持文本样式,以便在视觉上融入设计中。保持文本样式需要原始文本与翻译文本之间的高精度词对齐。源文本与翻译文本之间的词对齐问题早已为人所知。提取词对齐的行业标准由 Giza++ 和神经机器翻译(NMT)模型的注意力概率定义。本文探讨了三种新方法,以解决从源文本到翻译文本的文本风格迁移中的词对齐问题。所提出的方法基于商业可用的 NMT 和大语言模型(LLM)翻译技术开发,包括:带有自定义输入和输出标签的 NMT 以实现文本样式;带有自定义输入和输出标签的 LLM;以及一种混合方法,先使用 NMT 进行翻译,然后使用 LLM 结合单元映射。为了分析这些解决方案的性能,其对齐结果与注意力头方法的结果进行了比较,以评估其在图形设计应用中的可用性。有趣的是,注意力头的强基线证明比 LLM 或 NMT 方法更准确,并且与混合 NMT+LLM 方法相当。
cs.CL / 35 / 2604.26375

SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection

SG-UniBuc-NLP在SemEval-2026任务6中的表现:基于分块的多头RoBERTa用于长文本规避检测
Stefan, Gabriel, Nisioi, Sergiu
Abstract
We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.
Chinese Translation
我们描述了在SemEval-2026任务6(CLARITY:揭示政治问题规避)中使用的系统,该系统通过粗粒度的清晰度(3类)和细粒度的规避策略(9类)对英语政治采访回应进行分类。由于回应经常超过标准Transformer编码器的512-token限制,我们采用了重叠滑动窗口分块策略,并对分块表示进行元素级的最大池化聚合。一个共享的RoBERTa-large编码器提供了两个特定任务的头部,这些头部通过多任务目标共同训练,并在推理时通过7折分层交叉验证进行集成。我们的系统在子任务1上取得了0.80的宏F1值,在子任务2上取得了0.51的宏F1值,在两个子任务中均排名第11。
cs.CL / 36 / 2604.26382

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

复杂多模态文档处理管道的基准测试:企业人工智能的统一评估框架
Singh, Saurabh K., Raj, Sachin
Abstract
Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
Chinese Translation
目前大多数企业文档人工智能系统都是基于管道的。解析、索引、检索、生成。每个阶段都被单独研究得非常透彻——真正困难的是评估整个系统的表现。我们构建了EnterpriseDocBench来尝试解决这个问题:在同一语料库上评估解析的准确性、索引的效率、检索的相关性和生成的可靠性。该语料库由六个企业领域的公共、许可文档构成(目前试点中有五个领域)。我们通过三个管道进行了测试——BM25、密集嵌入和混合模型——所有管道均使用相同的GPT-5生成器。主要结果显示:混合检索略微优于BM25(nDCG@5为0.92对比0.91),而两者均优于密集嵌入(0.83)。幻觉现象并不随着文档长度的增加而单调增长——短文档和非常长的文档的幻觉现象均高于中等长度文档(28.1%和23.8%对比9.2%)。跨阶段的相关性非常弱:解析->检索 r=0.14,解析->生成 r=0.17,检索->生成 0.02。如果质量是像我们大多数人假设的那样级联的,这些数字会高得多;但事实并非如此。设计上的警告是切实存在的(解析固定,生成器共享,自动化代理指标),我们并没有过度宣传结果。有一个结果确实让我们感到惊讶:对所述声明的事实准确性为85.5%,但答案的完整性平均为0.40。系统在回答时是正确的——但它确实遗漏了一些内容。这个差距对于实际部署比头条准确性数字更为重要。我们还描述了三种参考架构(ColPali、ColQwen2、基于代理复杂性的路由),这些架构尚未实现端到端集成。框架、指标、基线和收集脚本将在接受后以开源形式发布。
cs.CL / 37 / 2604.26412

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

当隐藏状态漂移时:KV缓存能否拯救长距离推测解码?
Liu, Tianyu, Shen, Yuhao, Hu, Xinyi, Zhang, Baolin, Zhang, Hengxin, Dai, Jun, Zhang, Jun, Ge, Shuang, Chen, Lei, Li, Yue, Wan, MingCheng
Abstract
Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.
Chinese Translation
推测解码加速了大规模语言模型(LLM)的推理,但基于隐藏状态的最先进(SOTA)草拟器在长距离推理中存在衰减问题:随着推测步骤的增加,草拟准确性下降。现有研究将这种衰减归因于训练与推理的不匹配,并提出测试时训练(TTT)作为补救措施,然而我们观察到,即使在TTT训练的草拟器中,长距离衰减依然存在。我们从上下文信息保留的角度重新审视长距离衰减。在隐藏状态重用中,我们认为目标隐藏状态充当了一种偏置的上下文压缩:它根据当前位点的注意力查询聚合历史标记信息,从而生成一个针对下一个标记预测优化的紧凑表示。这种压缩可能会抑制与当前查询相关性较低但对后续推测步骤重要的信息。相对而言,目标模型的KV缓存作为一种显式上下文,保留了完整的标记级KV表示集。因此,我们提出了KV重用假设:允许草拟模型重用目标KV缓存可以为长距离草拟提供更丰富的信号。为了验证这一假设,我们引入了KVShot,一个比较三种重用范式的诊断框架:仅隐藏状态、仅KV和混合。对Qwen3-8B的广泛评估表明,KV重用提高了长距离接受度,尽管在当前训练流程下端到端的加速仍然有限。我们的分析识别出两个关键的结构瓶颈:浅层草拟器难以准确估计目标查询,而草拟侧的KV投影接收稀疏的梯度信号。这些发现表明,实现KV感知解码的全部潜力需要超越TTT,向块级训练范式发展。通过揭示这些瓶颈,KVShot提供了一个基础的诊断测试平台和设计下一代推理架构的清晰路线图。
cs.CL / 38 / 2604.26417

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

EmoTransCap:情感过渡感知语音字幕的数据集与流程
Xu, Shuhao, Hu, Yifan, Wu, Jingjing, Du, Zhihao, Lian, Zheng, Liu, Rui
Abstract
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
Chinese Translation
情感感知和适应性表达是人机交互中的基本能力。尽管近年来语音情感字幕(SEC)的进展改善了细粒度情感建模,但现有系统仍然局限于孤立句子中的静态单一情感特征,忽视了话语层面上的动态情感过渡。为了解决这一问题,我们提出了情感过渡感知语音字幕(EmoTransCap)这一范式,旨在将时间情感动态与话语层面的语音描述相结合。为了构建一个丰富情感过渡的数据集并支持可扩展性,我们设计了一个自动化的数据集创建流程。这是第一个明确旨在捕捉话语层面情感过渡的大规模数据集。为了生成语义丰富的描述,我们结合了话语层面语音的声学属性和时间线索。我们的多任务情感过渡识别(MTETR)模型执行情感过渡检测和发言人分离的联合任务。利用大型语言模型(LLMs)的语义分析能力,我们生成了两种注释版本:描述性和指令导向。这些数据和注释为推动情感感知和情感表现力提供了宝贵的资源。该数据集使得语音字幕能够捕捉情感过渡,促进时间动态和细粒度情感理解。我们还在话语层面引入了一个可控的、过渡感知的情感语音合成系统,增强了类人情感表现力,并支持情感智能的对话代理。
cs.CL / 39 / 2604.26456

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Naamah:通过DBpedia种子和大规模语言模型生成的合成梵语命名实体识别语料库
P, Akhil Rajeev, Kulkarni, Annarao
Abstract
The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.
Chinese Translation
古典梵语文学的数字化受到注释资源稀缺的制约,尤其是在命名实体识别方面。尽管最近的方法利用通用的大型语言模型(LLMs)进行数据增强,但这些方法仍然容易出错,并且通常缺乏古典语法所需的推理深度。在本研究中,我们介绍了Naamah,一个高质量的银标准梵语命名实体识别数据集,包含102,942个句子。我们提出了一种将DBpedia中的实体提取与一个具有24亿参数的混合推理模型的生成能力相结合的方法,以创建语法自然且合成多样的训练数据。我们利用该数据集对两种变换器架构进行基准测试:大规模多语言的XLM RoBERTa和参数高效的IndicBERTv2。
cs.CL / 40 / 2604.26460

Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

基于理论的评估揭示了大型语言模型个性化中的作者差距
Sawant, Yash Ganpat
Abstract
Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics produce near-zero pairwise correlations, with absolute r less than 0.07, confirming that without theoretical grounding, metric choice determines conclusions: an LLM judge declares a clear winner while LUAR finds no meaningful differentiation. These findings demonstrate the theory-benchmark cycle in action: authorship theory exposes evaluation failures that ad hoc benchmarks miss.
Chinese Translation
风格个性化——使大型语言模型(LLMs)以特定个体的风格写作,而不仅仅是适应任务偏好——缺乏基于作者科学的评估。我们展示了将评估基于作者验证理论如何改变基准测试的测量能力。基于三种测量传统——LUAR,一个经过训练的作者验证模型;一个具有解耦特征匹配的LLM作为评判者;以及经典的功能词风格计量学——我们对50位作者和1,000次生成进行了四种推理时个性化方法的评估。基于理论的指标LUAR提供了临时替代方案所无法提供的东西:经过校准的基准,其人类上限为0.756,跨作者下限为0.626,赋予分数绝对意义。所有方法的得分均低于这一下限,范围在0.484到0.508之间,揭示了未经过校准的指标无法察觉的作者差距。这三种指标产生了近乎零的成对相关性,绝对r值小于0.07,确认了没有理论基础的情况下,指标选择决定了结论:一个LLM评判者宣称有明显赢家,而LUAR则发现没有有意义的差异。这些发现展示了理论与基准的循环作用:作者理论揭示了临时基准所忽视的评估失败。
cs.CL / 41 / 2604.26500

StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

StarDrinks:用于饮品订购场景的英语和韩语SLU评估测试集
Boito, Marcely Zanon, Brun, Caroline, Kim, Inyoung, Proux, Denys, Ait-Mokhtar, Salah, Lagos, Nikolaos, Meunier, Jean-Luc, Calapodescu, Ioan
Abstract
LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
Chinese Translation
大型语言模型(LLMs)和语音助手在任务导向的交互中越来越多地被使用,但它们的评估往往依赖于控制场景,这无法捕捉到真实用户请求的多样性和复杂性。例如,饮品订购涉及多种命名实体、饮品类型、尺寸、定制选项以及品牌特定术语,还包括犹豫和自我修正等自发语言现象。为了解决这一问题,我们推出了StarDrinks,一个包含英语和韩语的测试集,包含语音发声特征、转录和注释槽位。我们的数据集支持语音到槽位的SLU、转录到槽位的NLU以及语音到转录的ASR评估,为模型在语言丰富的真实任务中的鲁棒性和泛化能力提供了现实的基准。
cs.CL / 42 / 2604.26501

Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain

文本树:一种基于树的提示框架,用于体育领域的表格到文本生成
Chiang, Shang-Hsuan, Yang, Tsan-Tsung, Yen, An-Zi, Peng, Wen-Chih
Abstract
Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.
Chinese Translation
从结构化表格生成体育比赛报告是一项复杂的表格到文本任务,既需要精准的数据解读,又需要流畅的叙述生成。传统的基于模型的方法需要大量的标注数据集,而基于提示的方法使用大型语言模型(LLMs)时,常常因表格理解能力不足而出现幻觉。为了解决这些挑战,我们提出了文本树(Tree-of-Text),一种树结构的提示框架,指导LLMs通过三个阶段的生成过程:(1)内容规划,在此阶段从输入表格中选择相关的操作和参数;(2)操作执行,将大型表格拆分为可管理的子表格;(3)内容生成,将短文本输出合并并重写为一个连贯的报告。实验表明,我们的方法在ShuttleSet+上优于现有方法,在RotoWire-FG上在RG和CO指标上领先,并在MLB上在CS和CO指标上表现出色,所需时间和成本约为Chain-of-Table的40%。这些结果证明了文本树的有效性和效率,并为体育领域基于提示的表格到文本生成提供了一个有前景的方向。
cs.CL / 43 / 2604.26506

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

SafeReview:保护基于大型语言模型的评审系统免受对抗性隐性提示的攻击
Xin, Yuan, Weng, Yixuan, Zhu, Minjun, Ling, Ying, Qin, Chengwei, Hahn, Michael, Backes, Michael, Zhang, Yue, Yang, Linyi
Abstract
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.
Chinese Translation
随着大型语言模型(LLMs)越来越多地融入学术同行评审,它们对对抗性提示的脆弱性——即嵌入提交中的对抗性指令以操控结果——成为学术诚信的一个关键威胁。为此,我们提出了一种新颖的对抗框架,其中生成器模型(Generator)被训练以创建复杂的攻击提示,并与负责检测这些提示的防御模型(Defender)共同优化。该系统使用一种受信息检索生成对抗网络(Information Retrieval Generative Adversarial Networks)启发的损失函数进行训练,促进了两个模型之间的动态共同进化,迫使防御者不断发展出对持续改进的攻击策略的强大能力。与静态防御相比,所提出的框架在应对新型和不断演变的威胁方面表现出显著增强的韧性,从而为保护同行评审的完整性奠定了重要基础。
cs.CL / 44 / 2604.26514

Text-Utilization for Encoder-dominated Speech Recognition Models

文本利用在以编码器为主导的语音识别模型中的应用
Zeyer, Albert, Posielek, Tim, Schlüter, Ralf, Ney, Hermann
Abstract
This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.
Chinese Translation
本文探讨了有效利用仅包含文本数据的方法,以改善语音识别,重点关注能够加快识别速度的以编码器为主导的模型。我们对集成仅包含文本数据的技术进行了全面比较,包括模态匹配和动态下采样,以在编码器内达到文本级别的表示。我们在LibriSpeech语料库上的实验表明,较大的编码器配合较小的解码器可以达到或超过具有较大解码器的架构的性能。我们证明,简单的配置,如随机持续时间模型,往往比复杂的替代方案更有效,从而显著简化了训练流程。所有代码和配方均已公开。
cs.CL / 45 / 2604.26553

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

TLPO:用于缓解大型语言模型语言混淆的令牌级策略优化
Choo, Jinho, Lee, JunSeung, Kim, Jimyeong, Song, Yeeho, Hong, S. K., Kwon, Yeong-Dae
Abstract
Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.
Chinese Translation
大型语言模型(LLMs)展现出强大的多语言能力,但常常无法始终如一地生成预期语言的响应,出现一种称为语言混淆的现象。基于序列级微调的先前缓解方法,如 DPO、ORPO 和 GRPO,都是在整个响应的层面上操作,这可能导致模型一般能力的意外下降,因此需要更细粒度的替代方案。为了解决这个问题,我们提出了令牌级策略优化(TLPO),这是一种旨在通过局部的令牌级更新来缓解语言混淆的微调框架。TLPO 识别易出错的位置,探索替代候选令牌,并使用量身定制的目标更新策略,以在细粒度层面抑制导致错误的输出。这种选择性干预能够有效缓解语言混淆,而不损害模型的整体能力。在多种语言的多个多语言 LLM 上进行的实验表明,TLPO 在提高语言一致性方面显著优于基线,同时保持下游任务的准确性。
cs.CL / 46 / 2604.26568

Multimodal LLMs are not all you need for Pediatric Speech Language Pathology

多模态大语言模型并不是儿科语言病理学的全部需求
Fürst, Darren, Steindl, Sebastian, Schäfer, Ulrich
Abstract
Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.
Chinese Translation
语音声音障碍(SSD)影响大约五%的儿童,但语言病理学家面临严重的人手短缺和无法管理的病例负担。我们在细粒度的多任务SLPHelmUltraSuitePlus基准上测试了一种层次化的SSD分类方法。我们提出了一种从二元分类到类型和症状分类的级联方法。通过微调语音表示模型(SRM)并使用针对性的数据显示增强,我们减轻了先前研究中发现的偏见,并在基准中的所有临床任务上取得了改进。我们还使用我们的数据增强方法处理自动语音识别(ASR)。我们的结果表明,SRM在所有评估任务中始终以较大幅度超越基于大语言模型(LLM)的最新技术。我们发布了我们的模型和代码,以促进未来的研究。
cs.CL / 47 / 2604.26597

Translating Under Pressure: Domain-Aware LLMs for Crisis Communication

在压力下翻译:面向领域的语言模型用于危机沟通
Castaldo, Antonio, Staiano, Maria Carmen, Monti, Johanna, Castilho, Sheila, Chiusaroli, Francesca
Abstract
Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.
Chinese Translation
在自然灾害和人为灾害期间,及时且可靠的多语言沟通至关重要,但由于缺乏经过精心策划的平行数据,开发有效的危机沟通解决方案受到限制。我们提出了一种领域自适应的流程,通过从一般语料库中检索和过滤数据来扩展小型参考语料库。我们使用生成的数据集对小型语言模型进行微调,以实现危机领域的翻译,然后应用偏好优化,使输出偏向于CEFR A2级别的英语。自动评估和人工评估表明,这种方法提高了可读性,同时保持了较强的充分性。我们的结果表明,简化英语结合领域适应可以作为在无法实现全面多语言覆盖时的紧急沟通的实用通用语。
cs.CL / 48 / 2604.26619

Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis

从零样本到全资源:面向基于方面的情感分析的跨语言迁移策略
Fehle, Jakob, Hellwig, Nils Constantin, Kruschwitz, Udo, Wolff, Christian
Abstract
Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.
Chinese Translation
基于方面的情感分析(ABSA)提取文本中针对特定方面的细粒度意见,但尽管在基于变换器和指令调优模型方面取得了重大进展,仍然主要集中于英语。本研究对七种语言(英语、德语、法语、荷兰语、俄语、西班牙语和捷克语)和四个子任务(ACD、ACSA、TASD、ASQP)进行了最先进的ABSA方法的多语言评估。我们在零资源、仅数据和全资源设置下,系统地比较了不同的变换器架构,使用了跨语言迁移、代码切换和机器翻译。经过微调的大型语言模型(LLMs)在整体得分上表现最佳,尤其是在复杂的生成任务中,而少样本模型在更简单的设置中接近这一性能,其中较小的编码器模型也保持竞争力。在多个非目标语言上进行跨语言训练为微调的LLMs带来了最强的迁移效果,而较小的编码器或序列到序列模型则最能从代码切换中受益,突显了面向多语言ABSA的架构特定策略。我们还贡献了两个新的德语数据集,一个是改编的GERestaurant,另一个是第一个德语ASQP数据集(GERest),以鼓励超越英语的多语言ABSA研究。
cs.CL / 49 / 2604.26622

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

OCR-记忆:用于长时间跨度智能体记忆的光学上下文检索
Li, Jinze, Zhang, Yang, Yang, Xin, Qu, Jiayi, Xu, Jinfeng, Yang, Shuo, Ding, Junhua, Ngai, Edith Cheuk-Han
Abstract
Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.
Chinese Translation
自主大型语言模型(LLM)智能体越来越多地在长时间跨度的互动环境中运行,其成功依赖于重用在较长历史中积累的经验。然而,现有的智能体记忆系统在文本上下文预算上受到根本性限制:存储或重新访问原始轨迹的代价过于昂贵,而摘要和仅基于文本的检索则以信息损失和证据碎片化为代价来节省代币。为了解决这一限制,我们提出了光学上下文检索记忆(OCR-Memory),这是一个利用视觉模态作为智能体经验高密度表示的记忆框架,能够在检索时以最小的提示开销保留任意长的历史。具体而言,OCR-Memory将历史轨迹渲染为带有独特视觉标识符的图像。OCR-Memory通过一种“定位与转录”(locate-and-transcribe)范式检索存储的经验,该范式通过视觉锚点选择相关区域,并检索相应的逐字文本,避免了自由形式生成并减少了幻觉现象。在长时间跨度智能体基准测试中的实验表明,在严格的上下文限制下,OCR-Memory表现出一致的增益,证明光学编码在保持真实证据恢复的同时增加了有效记忆容量。
cs.CL / 50 / 2604.26630

SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling

SAGE:一种策略感知的图增强生成框架用于在线咨询
Aharon, Eliya Naomi, Grimland, Meytal, Segal, Avi, Dayan, Loona Ben, Shenfeld, Inbar, Belz, Yossi Levi, Gal, Kobi
Abstract
Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs a Next Strategy Classifier to identify the optimal therapeutic intervention. Subsequently, a Graph-Aware Attention mechanism projects graph-derived structural signals into soft prompts, conditioning the LLM to generate responses that maintain clinical depth. Validated through both automated metrics and expert human evaluation, SAGE outperforms baselines in strategy prediction and recommended response quality. By providing actionable intervention recommendations, SAGE serves as a cutting-edge decision-support tool designed to augment human expertise in high-stakes crisis counseling.
Chinese Translation
有效的心理健康咨询是一个复杂的、以理论为驱动的过程,需要同时整合心理框架、实时的痛苦信号和战略干预规划。这种临床推理水平对于安全性和治疗效果至关重要,但在通用的大型语言模型(LLMs)中往往缺失。我们提出了SAGE(策略感知图增强),这是一个旨在弥合结构化临床知识与生成式人工智能之间差距的新框架。SAGE构建了一个异构图,将对话动态与心理学基础层统一,明确地将互动锚定在以理论为驱动的词汇中。我们的架构首先使用下一策略分类器(Next Strategy Classifier)来识别最佳治疗干预。随后,图感知注意机制(Graph-Aware Attention)将图派生的结构信号投影到软提示中,调节LLM生成保持临床深度的响应。通过自动化指标和专家人类评估验证,SAGE在策略预测和推荐响应质量方面优于基线。通过提供可操作的干预建议,SAGE作为一种前沿决策支持工具,旨在增强人类在高风险危机咨询中的专业能力。
cs.CL / 51 / 2604.26656

Differentially-Private Text Rewriting reshapes Linguistic Style

差分隐私文本重写重塑语言风格
Arnold, Stefan
Abstract
Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text's communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.
Chinese Translation
文本的差分隐私(Differential Privacy, DP)从不相交的词级替换发展到连续的句子级重写,利用了语言模型的生成能力。虽然这种文本隐私化形式最适合在正式隐私保障与语法连贯性之间取得平衡,但其对文本注册身份的影响仍然 largely 未被探索。通过对差分隐私重写进行多维度的风格分析,我们证明了隐私的代价远远超出了词汇的变化。具体而言,我们发现,在隐私约束下的重写会导致文本交流特征的系统性功能突变。这种转变的特征是互动标记、上下文引用和复杂从属结构的严重流失。通过比较自回归释义与双向替换在不同隐私预算下的表现,我们观察到这两种架构都迫使文本趋向于一种非参与和非说服性的注册。这种对注册的盲目清洗有效地保留了语义内容,但在结构上同质化了定义人类创作话语的细微风格标记。
cs.CL / 52 / 2604.26671

From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy

从黑箱信心到可衡量的临床人工智能信任:一个基于证据、监督和分级自主的框架
Zabolotnii, Serhii, Holinko, Viktoriia, Antonenko, Olha
Abstract
Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism, and a human supervision layer for verification, escalation, and risk control. We demonstrate that trust also depends on selective verification of clinically critical findings, bounded clinical context, disciplined prompt architecture, and careful evaluation on realistic cases. Classifier-driven modular prompting is examined as an incremental path to scaling clinical depth without sacrificing prompt performance and without waiting for complete rule-based coverage. To operationalize trust, a set of trust metrics is proposed, built on metrological principles -- measurement uncertainty, calibration, traceability -- enabling quantitative rather than subjective assessment of each architectural layer. In this perspective, trustworthy clinical AI emerges not as a property of an individual model, but as an architectural outcome of a system into which evidence trails, human oversight, tiered escalation, and graduated action rights are embedded from the outset.
Chinese Translation
对临床人工智能(AI)的信任不能仅仅归结为模型准确性、生成流畅性或整体用户的积极印象。在医学中,信任必须被设计为一种基于证据、监督和AI自主操作边界的可衡量系统属性。本文提出了一个围绕三个原则构建的可信临床AI的实用框架:证据、监督和分级自主。该方法并不是完全用端到端黑箱模型替代确定性临床逻辑,而是结合了一个确定性核心、一个用于上下文验证的患者特定AI助手、一个多层模型升级机制,以及一个用于验证、升级和风险控制的人类监督层。我们证明,信任还依赖于对临床关键发现的选择性验证、有限的临床背景、规范的提示架构和对现实案例的仔细评估。我们研究了基于分类器的模块化提示作为在不牺牲提示性能和不等待完全规则覆盖的情况下,逐步扩展临床深度的路径。为了使信任具备操作性,提出了一套基于计量学原则(测量不确定性、校准、可追溯性)的信任指标,使每个架构层的评估能够实现定量而非主观的评估。在这个视角下,可信的临床AI并不是单个模型的属性,而是一个系统的架构结果,其中从一开始就嵌入了证据轨迹、人类监督、分级升级和渐进的行动权利。
cs.CL / 53 / 2604.26726

Swap distance minimization shapes the order of subject, object and verb in languages of the world

交换距离最小化塑造了世界语言中主语、宾语和动词的顺序
Rios-El-Yazidi, Jairo, Ferrer-i-Cancho, Ramon
Abstract
Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have tailored models to this fact. However, there are still languages whose dominant order does not conform to these expectations or even lack a dominant order. Here we show that across linguistic families and macroareas, word order variation within languages is shaped by the principle of swap distance minimization even when the dominant order is not SOV/SVO and even when a dominant order is lacking.
Chinese Translation
世界语言在主语、宾语和动词的顺序上存在差异。最常见的主导顺序是SOV和SVO,研究人员对此事实进行了模型调整。然而,仍然存在一些语言,其主导顺序不符合这些预期,甚至缺乏主导顺序。在这里,我们展示了在不同语言家族和宏观区域中,即使主导顺序不是SOV/SVO,甚至缺乏主导顺序,语言内部的词序变化仍然受到交换距离最小化原则的影响。
cs.CL / 54 / 2604.26766

Domain-Adapted Small Language Models for Reliable Clinical Triage

适应领域的小型语言模型用于可靠的临床分诊
Aljohani, Manar, Ho, Brandon, McKinley, Kenneth, Ren, Dennis, Wang, Xuan
Abstract
Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and silver-standard pediatric triage data, fine-tuned Qwen2.5-7B models substantially reduced discordance and clinically significant errors, outperforming all baseline SLMs and advanced proprietary large language models (LLMs, e.g., GPT-4o). These findings highlight the feasibility of institution-specific SLMs for reliable, privacy-preserving ESI decision support and underscore the importance of targeted fine-tuning over more complex inference strategies.
Chinese Translation
在急诊科,准确且一致的急救严重性指数(ESI)分配仍然是一个持续的挑战,因高度可变的自由文本分诊文档导致了错误分诊和工作流程效率低下。本研究评估了开源小型语言模型(SLMs)是否可以作为可靠的、保护隐私的临床分诊决策支持工具。我们系统地比较了多种SLM在不同提示管道下的表现,发现临床小插曲,即分诊叙述的简明摘要,能够产生最准确的预测。SLM Qwen2.5-7B展现出最佳的准确性、稳定性和计算效率的平衡。通过使用专家策划和银标准的儿科分诊数据进行大规模领域适应,微调后的Qwen2.5-7B模型显著减少了不一致性和临床显著错误,超越了所有基线SLM和先进的专有大型语言模型(LLMs,例如GPT-4o)。这些发现突显了特定机构的SLM在可靠、保护隐私的ESI决策支持中的可行性,并强调了针对性微调相较于更复杂推理策略的重要性。
cs.CL / 55 / 2604.26768

Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation

解耦知识与任务子空间以实现可组合的参数化检索增强生成
Su, Weihang, Zhang, Hanwen, Ai, Qingyao, Liu, Yiqun
Abstract
Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document adapters with task-supervised objectives, which may cause each adapter to encode both document-specific facts and reusable task-solving behavior. This entanglement may make adapter composition less reliable: when multiple adapters are merged at inference time, their overlapping task behaviors can accumulate together with document-specific updates, potentially making the merged adapter less stable and less focused on the intended document knowledge. To examine this issue, we explore Orthogonal Subspace Decomposition (OSD), an adapter-training setup that separates reusable task behavior from document-specific knowledge adapters. Concretely, we first train a Task LoRA to capture reusable task behavior, and then train document LoRAs to encode document-specific knowledge in a orthogonal subspace. This setup provides a controlled way to examine how orthogonalizing task and document LoRA updates affects adapter composition in multi-document PRAG. Experiments across multiple knowledge-intensive tasks and model scales suggest that this orthogonalization strategy can improve compositional robustness in parametric RAG, especially when multiple document adapters are merged.
Chinese Translation
参数化检索增强生成(PRAG)将外部文档编码为轻量级参数模块,这些模块可以在推理时被检索和合并,提供了一种有前景的替代上下文检索增强的方法。尽管具有潜力,许多PRAG实现使用任务监督目标训练文档适配器,这可能导致每个适配器同时编码文档特定的事实和可重用的任务解决行为。这种纠缠可能使适配器组合的可靠性降低:当多个适配器在推理时合并时,它们重叠的任务行为可能与文档特定的更新一起累积,从而可能使合并后的适配器不够稳定,且不够专注于预期的文档知识。为了解决这个问题,我们探索了正交子空间分解(Orthogonal Subspace Decomposition, OSD),这是一种适配器训练设置,旨在将可重用的任务行为与文档特定的知识适配器分离。具体而言,我们首先训练一个任务LoRA以捕捉可重用的任务行为,然后训练文档LoRA以在正交子空间中编码文档特定的知识。该设置提供了一种受控的方法来检验正交化任务和文档LoRA更新如何影响多文档PRAG中的适配器组合。在多个知识密集型任务和模型规模上的实验表明,这种正交化策略可以提高参数化RAG的组合鲁棒性,特别是在合并多个文档适配器时。
cs.CL / 56 / 2604.26835

HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists

HalluCiteChecker:一个轻量级的工具包,用于在人工智能科学家时代检测和验证虚构引用
Sakai, Yusuke, Kamigaito, Hidetaka, Watanabe, Taro
Abstract
We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.
Chinese Translation
我们介绍了HalluCiteChecker,这是一个用于检测和验证科学论文中虚构引用的工具包。尽管人工智能助手技术已经改变了学术写作过程,包括引用推荐,但它们也导致了虚构引用的出现,这些引用并不对应于任何现有的工作。这些引用不仅削弱了科学论文的可信度,还给审稿人和作者带来了额外的负担,他们必须在审稿过程中手动验证这些引用的有效性。在本研究中,我们将虚构引用检测形式化为一个自然语言处理(NLP)任务,并提供了相应的工具包作为解决该问题的实用基础。我们的工具包轻量且能够在标准笔记本电脑上在几秒钟内完成验证。它还可以完全离线执行,并仅使用CPU高效运行。我们希望HalluCiteChecker能够帮助减少审稿人的工作量,并通过实现系统的预审和出版检查来支持组织者。我们的代码在GitHub上以Apache 2.0许可证发布,并通过PyPI作为可安装包分发。演示视频可在YouTube上观看。
cs.CL / 57 / 2604.26844

What Kind of Language is Easy to Language-Model Under Curriculum Learning?

在课程学习下,什么样的语言易于进行语言模型建模?
El-Naggar, Nadine, Kuribayashi, Tatsuki, Briscoe, Ted
Abstract
Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis -- the learning scenario for LMs -- to explore its interaction with the inductive bias of LMs. Specifically, as a first study, we examine the effect of curriculum learning (CL), as a developmentally motivated learning scenario, i.e., starting with simpler sentences rather than randomly-ordered input. We expand existing LM-based exploration (El-Naggar et al., 2025a,b) with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.
Chinese Translation
许多已证实的语言共享特征的共同配置,从而形成一个光谱,涵盖了从类型学上非常罕见(例如,宾语-动词-主语的词序)或不可能的语言到非常常见的特征组合(例如,主语-宾语-动词的词序)。一个核心问题是,在什么条件下可以预测这种类型学倾向,特别是语言模型(LMs)的学习偏差是否足以重现这些模式。在本研究中,我们为这种分析增加了一个维度——语言模型的学习场景——以探讨其与语言模型的归纳偏差之间的相互作用。具体而言,作为第一项研究,我们考察了课程学习(CL)的影响,作为一种发展动机驱动的学习场景,即从简单句子开始,而不是随机排序的输入。我们用一个简单的课程学习变体扩展了现有的基于语言模型的探索(El-Naggar et al., 2025a,b),发现课程学习显著影响了语言模型的明显归纳偏差。
cs.CL / 58 / 2604.26866

MoRFI: Monotonic Sparse Autoencoder Feature Identification

MoRFI:单调稀疏自编码器特征识别
Dimakopoulos, Dimitris, Cohen, Shay B., Konstas, Ioannis
Abstract
Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model's ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.
Chinese Translation
大型语言模型(LLMs)在预训练阶段通过下一个标记预测获取大部分事实知识。后续的后训练阶段通常引入超出参数知识的新事实,从而导致幻觉。尽管已有研究表明对新知识进行监督微调(SFT)可能加剧这一问题,但其潜在机制仍然不够清楚。我们进行了一项受控微调实验,重点关注闭卷问答(QA),发现潜在方向在因果上促成了幻觉的产生。具体而言,我们在七个不同的单一QA数据集上对Llama 3.1 8B、Gemma 2 9B和Mistral 7B v03进行了微调,控制新知识的百分比和训练轮次。通过测量测试集的性能,我们验证了逐步引入新知识会增加幻觉现象,且随着训练时间的延长,效果更加明显。我们利用预训练的稀疏自编码器(SAEs)分析各模型在不同检查点的残差流激活,并提出了单调关系特征识别(MoRFI)以捕捉因果相关的潜在特征。MoRFI过滤出对目标属性的受控微调数据混合单调响应的SAE特征。我们的研究结果表明,接触未知事实会干扰模型沿着残差流中一组方向检索存储知识的能力。我们的管道在不同模型中可靠地发现了这些方向,通过单潜在干预恢复知识。
cs.CL / 59 / 2604.26880

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

HealthNLP_Retrievers在ArchEHR-QA 2026中的应用:基于级联大型语言模型的临床问题回答系统
Hosen, Md Biplob, Hussein, Md Alomgeer, Masud, Md Akmol, Faruque, Omar, Reynolds, Tera L, Chen, Lujie Karen
Abstract
Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by focusing on grounded question answering over EHRs, and this paper presents the system developed by the HealthNLP_Retrievers team for this task. The proposed approach uses a multi-stage cascaded pipeline powered by the Gemini 2.5 Pro large language model to interpret patient-authored questions and retrieve relevant evidence from lengthy clinical notes. Our architecture comprises four integrated modules: (1) a few-shot query reformulation unit which summarizes verbose patient queries; (2) a heuristic-based evidence scorer which ranks clinical sentences to prioritize recall; (3) a grounded response generator which synthesizes professional-caliber answers restricted strictly to identified evidence; and (4) a high-precision many-to-many alignment framework which links generated answers to supporting clinical sentences. This cascaded approach achieved competitive results. Across the individual tracks, the system ranked 1st in question interpretation, 5th in answer generation, 7th in evidence identification, and 9th in answer-evidence alignment. These results show that integrating large language models within a structured multi-stage pipeline improves grounding, precision, and the professional quality of patient-oriented health communication. To support reproducibility, our source code is publicly available in our GitHub repository
Chinese Translation
患者门户现在使个人能够直接访问他们的电子健康记录(EHR),然而,仅仅获得访问权限并不能确保患者理解或采取行动应对这些记录中复杂的临床信息。ArchEHR-QA 2026共享任务针对这一挑战,专注于基于EHR的有据问题回答,本文介绍了HealthNLP_Retrievers团队为此任务开发的系统。所提出的方法使用由Gemini 2.5 Pro大型语言模型驱动的多阶段级联管道来解读患者提出的问题,并从冗长的临床记录中检索相关证据。我们的架构包括四个集成模块:(1)少量示例查询重构单元,用于总结冗长的患者查询;(2)基于启发式的证据评分器,用于对临床句子进行排名,以优先考虑召回率;(3)有据响应生成器,合成严格基于识别证据的专业级答案;(4)高精度多对多对齐框架,将生成的答案与支持的临床句子链接。该级联方法取得了竞争力的结果。在各个单独的赛道中,该系统在问题解读方面排名第一,在答案生成方面排名第五,在证据识别方面排名第七,在答案与证据对齐方面排名第九。这些结果表明,在结构化的多阶段管道中集成大型语言模型可以提高基础性、精确性和面向患者的健康交流的专业质量。为了支持可重复性,我们的源代码已在我们的GitHub仓库中公开提供。
cs.CL / 60 / 2604.26904

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym:构建有效爪型智能体的可扩展框架
Bai, Fei, Song, Huatong, Sun, Shuang, Cheng, Daixuan, Yang, Yike, Hao, Chuan, Li, Renyuan, Chang, Feng, Wei, Yuan, Tao, Ran, Dai, Bryan, Yang, Jian, Zhao, Wayne Xin
Abstract
Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
Chinese Translation
爪型环境支持对本地文件、工具和持久工作区状态的多步骤工作流程。然而,围绕这些环境的可扩展开发仍然受到缺乏系统框架的限制,尤其是缺乏用于合成可验证训练数据并将其与智能体训练和诊断评估相结合的框架。为了解决这一挑战,我们提出了ClawGym,一个支持爪型个人智能体开发完整生命周期的可扩展框架。具体而言,我们构建了ClawGym-SynData,这是一个由13.5K个经过筛选的任务组成的多样化数据集,这些任务是从以角色驱动的意图和技能为基础的操作中合成的,并配有现实的模拟工作区和混合验证机制。然后,我们通过对黑箱回滚轨迹进行监督微调,训练了一系列强大的爪型模型,称为ClawGym-Agents,并进一步探索通过轻量级管道进行强化学习,该管道在每个任务沙箱中并行化回滚。为了支持可靠的评估,我们进一步构建了ClawGym-Bench,这是一个通过自动过滤和人工-大语言模型(LLM)审查校准的200个实例的基准。相关资源将很快在https://github.com/ClawGym发布。
cs.CL / 61 / 2604.26940

Select to Think: Unlocking SLM Potential with Local Sufficiency

选择思考:通过局部充分性释放小型语言模型的潜力
Ye, Wenxuan, Zhang, Yangyang, An, Xueli, Carle, Georg, Ma, Yunpu
Abstract
Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM's top-8 candidates capture the 32B LLM's choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.
Chinese Translation
小型语言模型(SLMs)在可扩展部署中提供了计算效率,但它们通常无法与大型语言模型(LLMs)所展现的推理能力相媲美。为了缩小这一差距,目前的方法通常在推理分歧点调用LLM生成标记,但这些外部调用会引入显著的延迟和成本。另一方面,标准蒸馏常常受到容量限制的阻碍,因为SLMs在准确模仿LLM复杂生成分布方面存在困难。我们通过识别局部充分性来解决这一困境:在分歧点,LLM的首选标记始终位于SLM的前K个下一个标记预测中,即使它未能成为SLM的首选。为此,我们提出了选择思考(SELECT TO THINK, S2T),将LLM的角色从开放式生成重新定义为在SLM的提议中进行选择,从而简化监督信号为离散候选排名。基于此,我们引入了S2T-LOCAL,将选择逻辑蒸馏到SLM中,使其能够在不依赖推理时LLM的情况下进行自主重排序。通过实证研究,我们证明了一个15亿参数的SLM的前8个候选项以95%的命中率捕获了320亿参数LLM的选择。将这一潜力转化为性能,S2T-LOCAL在各基准测试中平均提高了贪婪解码24.1%,有效匹配了8路径自一致性的效率,同时以单轨迹效率运行。
cs.CL / 62 / 2604.26951

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

转变潮流:跨架构蒸馏用于扩散大型语言模型
Zhang, Gongbo, Wang, Wen, Tian, Ye, Yuan, Li
Abstract
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
Chinese Translation
扩散大型语言模型(dLLMs)提供并行解码和双向上下文,但最先进的dLLMs需要数十亿个参数才能实现竞争性能。虽然现有的dLLMs蒸馏方法在单一架构内减少推理步骤,但没有一种方法解决跨架构知识转移的问题,即教师和学生在架构、注意力机制和分词器上存在差异。我们提出了TIDE,这是第一个跨架构dLLM蒸馏框架,由三个模块组成:(1)TIDAL,它在训练进度和扩散时间步中共同调节蒸馏强度,以考虑教师的噪声依赖可靠性;(2)CompDemo,它通过互补掩码分割丰富教师的上下文,以改善在重度掩码下的预测;(3)Reverse CALM,一种跨分词器目标,反转块级似然匹配,产生有界梯度和双端噪声过滤。通过两个异构管道将8B稠密和16B MoE教师蒸馏到0.6B学生,在八个基准测试中平均超越基线1.53分,在代码生成方面取得显著提升,其中HumanEval得分达到48.78,而AR基线为32.3。