arXiv Daily Digest

241

Papers

Design, Modelling and Experimental Evaluation of a Tendon-driven Wrist Abduction-Adduction Mechanism for an upper limb exoskeleton

上肢外骨骼的腱驱动腕部外展-内收机制的设计、建模与实验评估

Khan, Juwairiya S., Mohammadi, Mostafa, Rasmussen, John, Struijk, Lotte N. S. Andreasen

Abstract

Wrist exoskeletons play a vital role in rehabilitation and assistive applications, yet conventional actuation mechanisms such as electric motors or pneumatics often introduce undesirable weight, friction, and complexity. This paper presents a novel single-cable (tendon), torsional-spring-assisted actuation mechanism for wrist abduction-adduction, and a simulation-based method for selecting its stiffness parameters. The mechanism employs a single Bowden cable passively tensioned by a spiral torsional spring (clock spring) to maintain continuous cable tension without antagonistic actuation. Kinematic and dynamic modeling of the mechanism was performed to estimate the required torque and identify optimal spring parameters. These simulation-derived parameters guided the design of a functional prototype, which was experimentally evaluated with five participants with no motor disabilities (NMD) under varying arm positions and loading conditions using three spring configurations to account for user variability and modeling uncertainties. Experimental results show consistent agreement with simulation-derived trends, with the nominal spring configuration achieving balanced motion range, torque demand, and repeatability. The results demonstrate that simulation-informed stiffness selection can effectively guide the design of compact, cable-driven wrist exoskeletons while reducing reliance on empirical tuning.

Chinese Translation

腕部外骨骼在康复和辅助应用中发挥着至关重要的作用，但传统的驱动机制如电动机或气动装置往往会引入不必要的重量、摩擦和复杂性。本文提出了一种新颖的单缆（腱）扭转弹簧辅助驱动机制，用于腕部外展-内收，并提出了一种基于仿真的方法来选择其刚度参数。该机制采用一根单一的博登缆，通过螺旋扭转弹簧（发条）被动张紧，以保持持续的缆绳张力，而无需对抗性驱动。对该机制进行了运动学和动力学建模，以估算所需的扭矩并识别最佳弹簧参数。这些基于仿真的参数指导了功能原型的设计，该原型在五名无运动障碍（NMD）参与者的不同手臂位置和负载条件下进行了实验评估，使用三种弹簧配置以考虑用户变异性和建模不确定性。实验结果与仿真推导的趋势一致，名义弹簧配置实现了平衡的运动范围、扭矩需求和重复性。结果表明，基于仿真的刚度选择可以有效指导紧凑型缆驱动腕部外骨骼的设计，同时减少对经验调试的依赖。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.20898

A Tendon-Driven Wrist Abduction-Adduction Joint Improves Performance of a 5 DoF Upper Limb Exoskeleton -- Implementation and Experimental Evaluation

一个由肌腱驱动的腕部外展-内收关节改善了五自由度上肢外骨骼的性能——实施与实验评估

Khan, Juwairiya S., Mohammadi, Mostafa, Ammitzbøll, Alexander L., Hagen, Ellen-Merete, Blicher, Jakob, Obál, Izabella, Cardoso, Ana S. S., Kirtas, Oguzhan, Kæseler, Rasmus L., Rasmussen, John, Struijk, Lotte N. S. Andreasen

Abstract

Wrist function is essential in performing activities of daily living (ADLs). However, there is limited experimental evidence on the functional impact of wrist Abduction-Adduction (Ab-Ad) joint assistance in upper limb exoskeletons (ULEs) for rehabilitation. This study evaluates the effect of implementing an active wrist Ab-Ad joint in a five degree of freedom (DoF) ULE, EXOTIC2 exoskeleton, to support individuals with severe motor impairments. Methods: A compact, lightweight wrist module with tendon-driven abduction and spring-driven adduction was integrated into the EXOTIC exoskeleton. Eight adults with no motor disabilities completed drinking and scratching tasks under randomized wrist-enabled and wrist-locked conditions along with a preliminary feasibility test in one individual with Amyotrophic lateral sclerosis (ALS). Kinematic and task performance metrics including wrist range of motion, task completion time, spillage and leveling metrics were assessed. Results: Implementing the wrist Ab-Ad DoF improved task success metrics. Spill incidence during the drinking task decreased from 56% to 3%, and leveling success for scratching task improved from 28% to 75%. Conclusion: Integrating wrist Ab-Ad assistance improved key functional task outcomes without increasing execution time. Significance: The study provides the experimental evidence that active wrist Ab-Ad control enhances task-level performance in exoskeleton-assisted ADLs.

Chinese Translation

腕部功能在日常生活活动（ADLs）中至关重要。然而，关于腕部外展-内收（Ab-Ad）关节辅助在上肢外骨骼（ULEs）中对康复的功能影响的实验证据有限。本研究评估了在五自由度（DoF）上肢外骨骼EXOTIC2中实施主动腕部外展-内收关节的效果，以支持严重运动障碍的个体。方法：将一个紧凑、轻便的腕部模块与肌腱驱动的外展和弹簧驱动的内收集成到EXOTIC外骨骼中。八名没有运动障碍的成年人在随机的腕部启用和腕部锁定条件下完成饮水和抓挠任务，并在一名肌萎缩侧索硬化症（ALS）患者中进行了初步可行性测试。评估了运动学和任务表现指标，包括腕部活动范围、任务完成时间、溢出和水平指标。结果：实施腕部外展-内收自由度改善了任务成功指标。在饮水任务中，溢出发生率从56%降至3%，抓挠任务的水平成功率从28%提高至75%。结论：整合腕部外展-内收辅助改善了关键功能任务结果，而没有增加执行时间。意义：本研究提供了实验证据，表明主动腕部外展-内收控制增强了外骨骼辅助的日常生活活动中的任务级表现。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.20967

Clinical Evaluation of a Tongue-Controlled Wrist Abduction-Adduction Assistance in a 6-DoF Upper-Limb Exoskeleton for Individuals with ALS and SCI

针对肌萎缩侧索硬化症（ALS）和脊髓损伤（SCI）患者的6自由度上肢外骨骼中的舌控腕部外展-内收辅助的临床评估

Khan, Juwairiya S., Mohammadi, Mostafa, Ammitzbøll, Alexander L., Hagen, Ellen-Merete, Obál, Jakob Blicher Izabella, Cardoso, Ana S. S., Kirtas, Oguzhan, Kæseler, Rasmus L., Rasmussen, John, Struijk, Lotte N. S. Andreasen

Abstract

Upper-limb exoskeletons (ULEs) have the potential to restore functional independence in individuals with severe motor impairments; however, the clinical relevance of wrist degrees of freedom (DoF), particularly abduction-adduction (Ab-Ad), remains insufficiently evaluated. This study investigates the functional and user-perceived impact of wrist Ab-Ad assistance during two activities of daily living (ADLs). Wrist Ab-Ad assistance in a tongue-controlled 6-DoF ULE, EXOTIC2, was evaluated in a within-subject study involving one individual with amyotrophic lateral sclerosis and five individuals with spinal cord injury. Participants performed drinking and scratch stick leveling tasks with EXOTIC2 under two conditions: with and without wrist Ab-Ad assistance. Outcome measure included task success, task completion time, kinematic measures, and a usability questionnaire capturing comfort, functional perception, and acceptance. Enabling wrist Ab-Ad improved task success rates across both ADLs, with consistent reductions in spillage (from 77.8% spillages to 22.2%) and failed placements (from 66.7% to 16.7%). Participants utilized task-specific subsets of the available wrist range of motion, indicating that effective control within functional ranges was more critical than maximal joint excursion. Questionnaire responses indicated no increase in discomfort with the additional DoF and reflected perceived improvements in task performance. In conclusion, wrist Ab-Ad assistance enhances functional task performance in assistive exoskeleton use without compromising user comfort. However, its effectiveness depends on task context, control usability, and individual user strategies. This study provides clinically relevant, user-centered evidence supporting the inclusion of wrist Ab-Ad in ULEs, emphasizing the importance of balancing functional capability with usability in assistive device design.

Chinese Translation

上肢外骨骼（ULEs）有潜力恢复严重运动障碍个体的功能独立性；然而，腕部自由度（DoF），特别是外展-内收（Ab-Ad）的临床相关性尚未得到充分评估。本研究探讨了在两项日常生活活动（ADLs）中，腕部外展-内收辅助的功能和用户感知影响。在一项涉及一名肌萎缩侧索硬化症患者和五名脊髓损伤患者的被试内研究中，评估了舌控6自由度上肢外骨骼EXOTIC2中的腕部外展-内收辅助。参与者在两种条件下使用EXOTIC2执行饮水和划平棒的任务：有和没有腕部外展-内收辅助。结果测量包括任务成功率、任务完成时间、运动学测量和一份可用性问卷，捕捉舒适度、功能感知和接受度。启用腕部外展-内收辅助提高了两项ADLs的任务成功率，同时减少了溢出（从77.8%降至22.2%）和失败放置（从66.7%降至16.7%）。参与者利用了可用腕部活动范围的任务特定子集，表明在功能范围内的有效控制比最大关节活动更为关键。问卷反馈显示，额外的自由度并未增加不适感，并反映出任务表现的感知改善。总之，腕部外展-内收辅助在辅助外骨骼使用中增强了功能任务表现，而不影响用户舒适度。然而，其有效性依赖于任务背景、控制可用性和个体用户策略。本研究提供了临床相关的以用户为中心的证据，支持在ULEs中纳入腕部外展-内收，强调在辅助设备设计中平衡功能能力与可用性的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.20990

A Survey of Legged Robotics in Non-Inertial Environments: Past, Present, and Future

非惯性环境中腿式机器人研究综述：过去、现在与未来

Chang, I-Chia, Huang, Xinyan, Lin, Tzu-Yuan, Teng, Sangli, Li, Wenjing, Ghaffari, Maani, Yi, Jingang, Gu, Yan

Abstract

Legged robots have demonstrated remarkable agility on rigid, stationary ground, but their locomotion reliability remains limited in non-inertial environments, where the supporting ground moves, tilts, or accelerates. Such conditions arise in ground transportation, maritime platforms, and aerospace settings, and they introduce persistent time-varying disturbances that break the stationary-ground assumptions underlying conventional legged locomotion. This survey reviews the state of the art in modeling, state estimation, and control for legged robots in non-inertial environments. We summarize representative application domains and motion characteristics, analyze the root causes of locomotion performance degradation, and review existing methods together with their key assumptions and limitations. We further identify open problems in robot-environment coupling, observability, robustness, and experimental validation, and discuss future directions in autonomy, system-level design, bio-inspired strategies, safety, and testing. The survey aims to clarify the technical foundations of this emerging area and support the development of reliable legged robots for real-world dynamic environments.

Chinese Translation

腿式机器人在刚性、静止的地面上展现出了卓越的灵活性，但在非惯性环境中，其运动可靠性仍然有限。在这些环境中，支撑地面会移动、倾斜或加速。这种情况出现在地面交通、海洋平台和航空航天环境中，并引入了持续的时间变化干扰，打破了传统腿式运动所依赖的静止地面假设。本综述回顾了非惯性环境中腿式机器人的建模、状态估计和控制的最新进展。我们总结了代表性的应用领域和运动特征，分析了运动性能下降的根本原因，并回顾了现有方法及其关键假设和局限性。我们进一步识别了机器人与环境耦合、可观测性、鲁棒性和实验验证等方面的开放问题，并讨论了在自主性、系统级设计、生物启发策略、安全性和测试等方面的未来方向。本综述旨在阐明这一新兴领域的技术基础，并支持在现实动态环境中开发可靠的腿式机器人。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.21017

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

开放H-具身性：用于支持医疗机器人基础模型的大规模数据集

Consortium, Open-H-Embodiment, :, Nelson, Nigel, Chen, Juo-Tung, Haworth, Jesse, Chen, Xinhao, Zbinden, Lukas, Huang, Dianye, Abdelaal, Alaa Eldin, Arezzo, Alberto, Acar, Ayberk, Alambeigi, Farshid, Ammirati, Carlo Alberto, Ao, Yunke, Rodriguez, Pablo David Aranda, Atar, Soofiyan, Ballo, Mattia, Barnes, Noah, Barontini, Federica, Binkiewicz, Filip, Black, Peter, Bodenstedt, Sebastian, Borgioli, Leonardo, Budjak, Nikola, Calmé, Benjamin, Carrillo, Fabio, Cavalcanti, Nicola, Chen, Changwei, Chen, Haoxin, Chen, Sihang, Chen, Qihan, Chen, Zhongyu, Chen, Ziyang, Cheng, Shing Shin, Cheng, Meiqing, Cheng, Min, Chiu, Zih-Yun Sarah, Chu, Xiangyu, Correa-Gallego, Camilo, Dagnino, Giulio, Deguet, Anton, Delgado, Jacob, DeLong, Jonathan C., Deng, Kaizhong, Dimitrakakis, Alexander, Ding, Qingpeng, Ding, Hao, Distefano, Giovanni, Donoho, Daniel, Duan, Anqing, Esposito, Marco, Farritor, Shane, Fayad, Jad, Fayad, Zahi, Ferradosa, Mario, Filicori, Filippo, Finn, Chelsea, Fürnstahl, Philipp, Ge, Jiawei, Giannarou, Stamatia, Ludevid, Xavier Giralt, Giraud, Frederic, Godbole, Aditya Amit, Goldberg, Ken, Goldenberg, Antony, Marana, Diego Granero, Guo, Xiaoqing, Haidegger, Tamás, Hailey, Evan, Hansen, Pascal, Hao, Ziyi, Hari, Kush, Hayashi, Kengo, Hawkins, Jonathon, Haworth, Shelby, Hellig, Ortrun, Herrell, S. Duke, Hong, Zhouyang, Howe, Andrew, Hu, Junlei, Jain, Ria, Javazm, Mohammad Rafiee, Ji, Howard, Ji, Rui, Ji, Jianmin, Jiang, Zhongliang, Jones, Dominic, Jopling, Jeffrey, Jordan, Britton, Ju, Ran, Kam, Michael, Kang, Luoyao, Kang, Fausto, Kapuria, Siddhartha, Kazanzides, Peter, Kiehler, Sonika, Kilmer, Ethan, Woong, Ji, Kim, Korzeniowski, Przemysław, Kuchi, Chandra, Kumar, Nithesh, Kuntz, Alan, Lavagno, Federico, Lee, Yu Chung, Lee, Hao-Chih, Li, Hang, Li, Zhen, Liang, Xiao, Lin, Xinxin, Lin, Jinsong, Liu, Chang, Liu, Fei, Liu, Pei, Liu, Yun-hui, Liuchen, Wanli, Lukács, Eszter, Mann, Sareena, Mannas, Miles, Marinelli, Brett, Martyniak, Sabina, Marzola, Francesco, Mazza, Lorenzo, Mei, Xueyan, Morais, Maria Clara, Muratore, Luigi, Narayanaswamy, Chetan Reddy, Naskręt, Michał, Navarro-Alarcon, David, Neary, Cyrus, Ng, Chi Kit, Nguan, Christopher, Noonan, David, Oh, Ki Hwan, Olesch, Tom Christian, Okamura, Allison M., Opfermann, Justin, Pescio, Matteo, Pham, Doan Xuan Viet, Porras, Tito, Ren, Hongliang, Jimenez, Ariel Rodriguez, Baena, Ferdinando Rodriguez y, Salcudean, Septimiu E., Sathya, Asmitha, Satish, Preethi, Seenivasan, Lalithkumar, Shao, Jiaqi, Shen, Yiqing, Sheng, Yu, Shi, Lucy XiaoYang, Soulé, Zoe, Speidel, Stefanie, Su, Mingwu, Su, Jianhao, Sunmola, Idris, Takács, Kristóf, Tang, Yunxi, Thornycroft, Patrick, Tian, Yu, Thompson, Jordan, Turkcan, Mehmet K., Unberath, Mathias, Valdastri, Pietro, Vives, Carlos, Vuong, Quan, Wagner, Martin, Wang, Farong, Wang, Wei, Wang, Lidian, Wang, Chung-Pang, Wang, Guankun, Wang, Junyi, Wang, Erqi, Wang, Ziyi, Watts, Tanner, Wein, Wolfgang, Wu, Yimeng, Wu, Zijian, Wu, Hongjun, Wu, Luohong, Wu, Jie Ying, Wu, Junlin, Wu, Victoria, Wu, Kaixuan, Wójcikowski, Mateusz, Xiao, Yunye, Xiao, Nan, Xie, Wenxuan, Yang, Hao, Yang, Tianqi, Yang, Yinuo, Ye, Menglong, Yeung, Ryan S., Yilmaz, Nural, Yin, Chim Ho, Yip, Michael, Younis, Rayan, Yu, Chenhao, Zaman, Sayem Nazmuz, Zefran, Milos, Zhang, Han, Zhang, Yuelin, Zhang, Yidong, Zhang, Yanyong, Zhang, Xuyang, Zhang, Yameng, Zhang, Joyce, Zhong, Ning, Zhou, Peng, Zhou, Haoying, Zuo, Xiuli, Navab, Nassir, Azizian, Mahdi, Huver, Sean D., Krieger, Axel

Abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

Chinese Translation

自主医疗机器人有望改善患者结果、减少提供者工作负担、实现医疗服务的普及，并实现超人类的精确度。然而，自主医疗机器人技术的发展受到一个根本性数据问题的限制：现有的医疗机器人数据集规模小、具身性单一，且很少公开共享，这限制了该领域所需基础模型的发展。我们推出了Open-H-Embodiment，这是迄今为止最大的开放医疗机器人视频数据集，具有同步运动学，涵盖了49个以上的机构和多个机器人平台，包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit (dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro，以及各种定制系统，涉及外科操作、机器人超声和内窥镜程序。我们通过两个基础模型展示了该数据集所支持的研究。GR00T-H是首个用于医疗机器人的开放基础视觉-语言-动作模型，是唯一一个在结构化缝合基准上实现完整端到端任务完成的评估模型（25%的试验成功率，相较于其他模型的0%），并在29步的体外缝合序列中实现了64%的平均成功率。我们还训练了Cosmos-H-Surgical-Simulator，这是首个基于动作条件的世界模型，能够从单一检查点实现多具身性的外科模拟，涵盖九个机器人平台，并支持医疗领域的计算政策评估和合成数据生成。这些结果表明，开放的大规模医疗机器人数据收集可以作为研究社区的关键基础设施，促进机器人学习、世界建模等领域的进步。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.21053

Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

富语义事件链下的神经符号操控理解

Ziaeetabar, Fatemeh

Abstract

Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.

Chinese Translation

在以人为环境为基础的机器人系统中，必须推理对象交互如何随时间演变，当前正在执行哪些动作，以及接下来可能进行的操控步骤。经典的富语义事件链（eSECs）提供了操控的可解释关系描述，但主要仍然是描述性的，未能直接支持基于不确定性的决策制定。本文提出了eSEC-LAM，一个神经符号框架，将eSECs转化为用于操控理解的显式事件级符号状态。所提出的公式通过引入基于置信度的谓词、功能性对象角色、可供性先验、原始级抽象和显著性引导的解释线索来增强经典的eSECs。这些丰富的符号状态是通过基于基础模型的感知前端通过确定性谓词提取得出的，同时当前动作推断和下一个原始动作预测则是通过对原始前后条件进行轻量级符号推理来实现的。我们在EPIC-KITCHENS-100、EPIC-KITCHENS VISOR和Assembly101上评估了所提出的框架，涉及动作识别、下一个原始动作预测、对感知噪声的鲁棒性以及解释一致性。实验结果表明，eSEC-LAM在动作识别方面表现出竞争力，显著改善了下一个原始动作的预测，在感知条件恶化时比经典符号和端到端视频基线更具鲁棒性，并提供了基于显式关系证据的时间一致性解释轨迹。这些发现表明，丰富的语义事件链不仅可以作为操控的可解释描述符，还可以作为神经符号动作推理的有效内部状态。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.21078

Impact-Aware Model Predictive Control for UAV Landing on a Heaving Platform

考虑冲击的模型预测控制用于无人机在浮动平台上的着陆

Stephenson, Jess, Greeff, Melissa

Abstract

Landing UAVs on heaving marine platforms is challenging because relative vertical motion can generate large impact forces and cause rebound on touchdown. To address this, we develop an impact-aware Model Predictive Control (MPC) framework that models landing as a velocity-level rigid-body impact governed by Newton's restitution law. We embed this as a linear complementarity problem (LCP) within the MPC dynamics to predict the discontinuous post-impact velocity and suppress rebound. In simulation, restitution-aware prediction reduces pre-impact relative velocity and improves landing robustness. Experiments on a heaving-deck testbed show an 86.2% reduction in post-impact deflection compared to a tracking MPC.

Chinese Translation

在浮动海洋平台上着陆无人机具有挑战性，因为相对的垂直运动可能产生大的冲击力并导致着陆时的反弹。为了解决这个问题，我们开发了一种考虑冲击的模型预测控制（MPC）框架，将着陆建模为受牛顿恢复法则支配的速度级刚体冲击。我们将其嵌入MPC动态中的线性互补问题（LCP），以预测不连续的冲击后速度并抑制反弹。在仿真中，考虑恢复的预测减少了冲击前的相对速度，提高了着陆的鲁棒性。在浮动甲板测试平台上的实验表明，与跟踪MPC相比，冲击后偏转减少了86.2%。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.21130

Self-Predictive Representation for Autonomous UAV Object-Goal Navigation

自我预测表示在自主无人机目标导航中的应用

Ayala, Angel, Sui, Donling, Cruz, Francisco, Torok, Mitchell, Deghat, Mohammad, Fernandes, Bruno J. T.

Abstract

Autonomous Unmanned Aerial Vehicles (UAVs) have revolutionized industries through their versatility with applications including aerial surveillance, search and rescue, agriculture, and delivery. Their autonomous capabilities offer unique advantages, such as operating in large open space environments. Reinforcement Learning (RL) empowers UAVs to learn intricate navigation policies, enabling them to optimize flight behavior autonomously. However, one of its main challenge is the inefficiency in using data sample to achieve a good policy. In object-goal navigation (OGN) settings, target recognition arises as an extra challenge. Most UAV-related approaches use relative or absolute coordinates to move from an initial position to a predefined location, rather than to find the target directly. This study addresses the data sample efficiency issue in solving a 3D OGN problem, in addition to, the formalization of the unknown target location setting as a Markov decision process. Experiments are conducted to analyze the interplay of different state representation learning (SRL) methods for perception with a model-free RL algorithm for planning in an autonomous navigation system. The main contribution of this study is the development of the perception module, featuring a novel self-predictive model named AmelPred. Empirical results demonstrate that its stochastic version, AmelPredSto, is the best-performing SRL model when combined with actor-critic RL algorithms. The obtained results show substantial improvement in RL algorithms' efficiency by using AmelPredSto in solving the OGN problem.

Chinese Translation

自主无人机（UAV）通过其多功能性彻底改变了多个行业，应用包括空中监视、搜索与救援、农业和快递。它们的自主能力提供了独特的优势，例如能够在广阔的开放空间环境中操作。强化学习（RL）使无人机能够学习复杂的导航策略，从而使其能够自主优化飞行行为。然而，其主要挑战之一是利用数据样本实现良好策略的效率低下。在目标导航（OGN）设置中，目标识别成为额外的挑战。大多数与无人机相关的方法使用相对或绝对坐标从初始位置移动到预定义位置，而不是直接寻找目标。本研究解决了在解决三维OGN问题时数据样本效率的问题，并将未知目标位置设置形式化为马尔可夫决策过程。实验分析了不同状态表示学习（SRL）方法与无模型RL算法在自主导航系统中的规划之间的相互作用。本研究的主要贡献是开发了感知模块，采用了一种名为AmelPred的新型自我预测模型。实证结果表明，其随机版本AmelPredSto在与演员-评论家RL算法结合时表现最佳。获得的结果显示，通过在解决OGN问题中使用AmelPredSto，RL算法的效率显著提高。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.21138

Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

穿越杂乱：基于航点的多机器人系统双层规划

Ji, Jiabao, Chen, Yongchao, Zhang, Yang, Kompella, Ramana Rao, Fan, Chuchu, Liu, Gaowen, Chang, Shiyu

Abstract

Multi-robot control in cluttered environments is a challenging problem that involves complex physical constraints, including robot-robot collisions, robot-obstacle collisions, and unreachable motions. Successful planning in such settings requires joint optimization over high-level task planning and low-level motion planning, as violations of physical constraints may arise from failures at either level. However, jointly optimizing task and motion planning is difficult due to the complex parameterization of low-level motion trajectories and the ambiguity of credit assignment across the two planning levels. In this paper, we propose a hybrid multi-robot control framework that jointly optimizes task and motion planning. To enable effective parameterization of low-level planning, we introduce waypoints, a simple yet expressive representation for motion trajectories. To address the credit assignment challenge, we adopt a curriculum-based training strategy with a modified RLVR algorithm that propagates motion feasibility feedback from the motion planner to the task planner. Experiments on BoxNet3D-OBS, a challenging multi-robot benchmark with dense obstacles and up to nine robots, show that our approach consistently improves task success over motion-agnostic and VLA-based baselines. Our code is available at https://github.com/UCSB-NLP-Chang/navigate-cluster

Chinese Translation

在杂乱环境中进行多机器人控制是一项具有挑战性的任务，涉及复杂的物理约束，包括机器人间碰撞、机器人与障碍物碰撞以及无法到达的运动。在这种环境中成功的规划需要对高层任务规划和低层运动规划进行联合优化，因为物理约束的违反可能源于任一层次的失败。然而，由于低层运动轨迹的复杂参数化和两个规划层次之间的信用分配模糊性，联合优化任务和运动规划是困难的。本文提出了一种混合多机器人控制框架，能够联合优化任务和运动规划。为了有效地参数化低层规划，我们引入了航点（waypoints），这是一种简单而富有表现力的运动轨迹表示方法。为了解决信用分配挑战，我们采用了一种基于课程的训练策略，并结合修改后的RLVR算法，将运动可行性反馈从运动规划器传播到任务规划器。在BoxNet3D-OBS这一具有密集障碍物和多达九个机器人的挑战性多机器人基准测试中，实验结果表明我们的方法在任务成功率上始终优于运动无关和基于VLA的基线方法。我们的代码可在 https://github.com/UCSB-NLP-Chang/navigate-cluster 获取。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.21189

Full-Body Dynamic Safety for Robot Manipulators: 3D Poisson Safety Functions for CBF-Based Safety Filters

机器人操纵器的全身动态安全：基于CBF的安全滤波器的3D泊松安全函数

Wilkinson, Meg, Bahati, Gilbert, Bena, Ryan M., Fourney, Emily, Burdick, Joel W., Ames, Aaron D.

Abstract

Collision avoidance for robotic manipulators requires enforcing full-body safety constraints in high-dimensional configuration spaces. Control Barrier Function (CBF) based safety filters have proven effective in enabling safe behaviors, but enforcing the high number of constraints needed for safe manipulation leads to theoretic and computational challenges. This work presents a framework for full-body collision avoidance for manipulators in dynamic environments by leveraging 3D Poisson Safety Functions (PSFs). In particular, given environmental occupancy data, we sample the manipulator surface at a prescribed resolution and shrink free space via a Pontryagin difference according to this resolution. On this buffered domain, we synthesize a globally smooth CBF by solving Poisson's equation, yielding a single safety function for the entire environment. This safety function, evaluated at each sampled point, yields task-space CBF constraints enforced by a real-time safety filter via a multi-constraint quadratic program. We prove that keeping the sample points safe in the buffered region guarantees collision avoidance for the entire continuous robot surface. The framework is validated on a 7-degree-of-freedom manipulator in dynamic environments.

Chinese Translation

机器人操纵器的碰撞避免需要在高维配置空间中强制执行全身安全约束。基于控制屏障函数（CBF）的安全滤波器已被证明在实现安全行为方面有效，但强制执行安全操作所需的大量约束带来了理论和计算上的挑战。本研究提出了一种框架，通过利用3D泊松安全函数（PSF），实现动态环境中操纵器的全身碰撞避免。具体而言，给定环境占用数据，我们以规定的分辨率对操纵器表面进行采样，并根据该分辨率通过庞特里亚金差分缩小自由空间。在这个缓冲区域内，我们通过求解泊松方程合成一个全局平滑的CBF，从而为整个环境生成一个单一的安全函数。该安全函数在每个采样点的评估结果生成了任务空间的CBF约束，这些约束通过实时安全滤波器通过多约束二次规划进行强制执行。我们证明，在缓冲区域内保持安全的采样点可以保证整个连续机器人表面的碰撞避免。该框架在动态环境中的7自由度操纵器上得到了验证。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.21192

How VLAs (Really) Work In Open-World Environments

VLAs（视觉-语言-行动模型）在开放世界环境中的实际工作机制

Rasouli, Amir, Wu, Yangzheng, Li, Zhiyuan, Yang, Rui Heng, Zhao, Xuan, Eret, Charles, Pakdamansavoji, Sajjad

Abstract

Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.

Chinese Translation

视觉-语言-行动模型（VLAs）在机器人应用中得到了广泛使用，在各种操作问题上取得了显著成功。最近，VLAs被应用于长期任务，并在基准测试（如BEHAVIOR1K，B1K）中进行了评估，以解决复杂的家庭琐事。在这些基准测试中，衡量进展的常用指标是成功率或基于满足与进展无关标准的部分得分，这意味着仅考虑物体的最终状态，而不考虑导致这些状态的事件。在本文中，我们认为使用这样的评估协议对操作的安全性方面的了解甚少，并可能夸大报告的性能，从而削弱未来真实世界部署的核心挑战。为此，我们对B1K挑战中的最先进模型进行了深入分析，并从可重复性和性能一致性、政策操作的安全性、任务意识以及导致任务未完成的关键因素等方面评估政策的鲁棒性。然后，我们提出评估协议以捕捉安全违规行为，以更好地衡量政策在更复杂和互动场景中的真实表现。最后，我们讨论了现有VLAs的局限性，并激励未来的研究。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.21241

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

CorridorVLA：通过稀疏锚点为生成动作头提供显式空间约束

Li, Dachong, Chen, ZhuangZhuang, Zhang, Jin, Li, Jianqiang

Abstract

Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose $CorridorVLA$, which predicts sparse spatial anchors as incremental physical changes (e.g., $\Delta$-positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by $3.4\%$--$12.4\%$ over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of $83.21\%$. These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.

Chinese Translation

视觉-语言-动作（VLA）模型通常使用中间表示来连接多模态输入与连续控制，然而空间引导往往是通过潜在特征隐式注入的。我们提出了$CorridorVLA$，该模型预测稀疏空间锚点作为增量物理变化（例如，$ ext{Δ}$-位置），并利用这些锚点在动作生成的训练目标中施加显式的容忍区域。这些锚点定义了一个走廊，指导流匹配动作头：那些隐含空间演变超出该走廊的轨迹将接收纠正梯度，而与接触和执行噪声的轻微偏差则被允许。在更具挑战性的LIBERO-Plus基准测试中，CorridorVLA在SmolVLA和GR00T上均取得了一致的提升，成功率比相应基线提高了$3.4\%$到$12.4\\%$；值得注意的是，我们的GR00T-Corr变体达到了$83.21\\%$的成功率。这些结果表明，与动作对齐的物理线索可以为生成动作策略提供直接且可解释的约束，补充了以视觉或潜在形式编码的空间引导。代码可在https://github.com/corridorVLA获取。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.21249

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

关于可通行性的推理：语言引导的越野3D轨迹规划

Park, Byounggun, Hwang, Soonmin

Abstract

While Vision-Language Models (VLMs) enable high-level semantic reasoning for end-to-end autonomous driving, particularly in unstructured environments, existing off-road datasets suffer from language annotations that are weakly aligned with vehicle actions and terrain geometry. To address this misalignment, we propose a language refinement framework that restructures annotations into action-aligned pairs, enabling a VLM to generate refined scene descriptions and 3D future trajectories directly from a single image. To further encourage terrain-aware planning, we introduce a preference optimization strategy that constructs geometry-aware hard negatives and explicitly penalizes trajectories inconsistent with local elevation profiles. Furthermore, we propose off-road-specific metrics to quantify traversability compliance and elevation consistency, addressing the limitations of conventional on-road evaluation. Experiments on the ORAD-3D benchmark demonstrate that our approach reduces average trajectory error from 1.01m to 0.97m, improves traversability compliance from 0.621 to 0.644, and decreases elevation inconsistency from 0.428 to 0.322, highlighting the efficacy of action-aligned supervision and terrain-aware optimization for robust off-road driving.

Chinese Translation

尽管视觉-语言模型（VLMs）能够实现端到端自主驾驶的高层次语义推理，特别是在非结构化环境中，但现有的越野数据集存在语言注释与车辆动作和地形几何形状之间的弱对齐问题。为了解决这一不对齐问题，我们提出了一种语言精炼框架，将注释重构为与动作对齐的配对，从而使VLM能够直接从单幅图像生成精炼的场景描述和3D未来轨迹。为了进一步促进对地形的关注规划，我们引入了一种偏好优化策略，构建几何感知的困难负样本，并明确惩罚与局部高程轮廓不一致的轨迹。此外，我们提出了越野特定的指标来量化可通行性合规性和高程一致性，解决了传统公路评估的局限性。在ORAD-3D基准上的实验表明，我们的方法将平均轨迹误差从1.01米降低到0.97米，将可通行性合规性从0.621提高到0.644，并将高程不一致性从0.428降低到0.322，突显了动作对齐监督和地形感知优化在稳健越野驾驶中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.21331

FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

FingerViP：利用指尖视觉感知学习现实世界的灵巧操作

Zhang, Zhen, Wang, Weinan, Sun, Hejia, Ding, Qingpeng, Chu, Xiangyu, Fang, Guoxin, Au, K. W. Samuel

Abstract

The current practice of dexterous manipulation generally relies on a single wrist-mounted view, which is often occluded and limits performance on tasks requiring multi-view perception. In this work, we present FingerViP, a learning system that utilizes a visuomotor policy with fingertip visual perception for dexterous manipulation. Specifically, we design a vision-enhanced fingertip module with an embedded miniature camera and install the modules on each finger of a multi-fingered hand. The fingertip cameras substantially improve visual perception by providing comprehensive, multi-view feedback of both the hand and its surrounding environment. Building on the integrated fingertip modules, we develop a diffusion-based whole-body visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, which effectively learns complex manipulation skills directly from human demonstrations. To improve view-proprioception alignment and contact awareness, each fingertip visual feature is augmented with its corresponding camera pose encoding and per-finger joint-current encoding. We validate the effectiveness of the multi-view fingertip vision and demonstrate the robustness and adaptability of FingerViP on various challenging real-world tasks, including pressing buttons inside a confined box, retrieving sticks from an unstable support, retrieving objects behind an occluding curtain, and performing long-horizon cabinet opening and object retrieval, achieving an overall success rate of 80.8%. All hardware designs and code will be fully open-sourced.

Chinese Translation

当前的灵巧操作实践通常依赖于单一的腕部视角，这种视角常常被遮挡，限制了在需要多视角感知的任务中的表现。在本研究中，我们提出了FingerViP，一个利用指尖视觉感知的视觉运动策略学习系统，用于灵巧操作。具体而言，我们设计了一个增强视觉的指尖模块，配备嵌入式微型摄像头，并将这些模块安装在多指手的每个手指上。指尖摄像头通过提供手部及其周围环境的全面多视角反馈，显著改善了视觉感知。基于集成的指尖模块，我们开发了一种基于扩散的全身视觉运动策略，该策略以第三视角摄像头和多视角指尖视觉为条件，有效地从人类示范中学习复杂的操作技能。为了改善视角-本体感知对齐和接触意识，每个指尖视觉特征都与其对应的摄像头姿态编码和每个手指的关节电流编码相结合。我们验证了多视角指尖视觉的有效性，并展示了FingerViP在各种具有挑战性的现实世界任务中的鲁棒性和适应性，包括在狭小箱子内按压按钮、从不稳定支撑物中取出棍子、从遮挡帘后取出物体，以及执行长时间的柜门打开和物体取出，整体成功率达到80.8%。所有硬件设计和代码将完全开源。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.21337

PREVENT-JACK: Context Steering for Swarms of Long Heavy Articulated Vehicles

PREVENT-JACK：长重型关节车辆群体的上下文引导

Baruck, Adrian, Dubé, Michael, Steup, Christoph, Mostaghim, Sanaz

Abstract

In this paper, we aim to extend the traditional point-mass-like robot representation in swarm robotics and instead study a swarm of long Heavy Articulated Vehicles (HAVs). HAVs are kinematically constrained, elongated, and articulated, introducing unique challenges. Local, decentralized coordination of these vehicles is motivated by many real-world applications. Our approach, Prevent-Jack, introduces the sparsely covered context steering framework in robotics. It fuses six local behaviors, providing guarantees against jackknifing and collisions at the cost of potential dead- and livelocks, tested for vehicles with up to ten trailers. We highlight the importance of the Evade Attraction behavior for deadlock prevention using a parameter study, and use 15,000 simulations to evaluate the swarm performance. Our extensive experiments and the results show that both the dead- and livelocks occur more frequently in larger swarms and denser scenarios, affecting a peak average of 27%/31% of vehicles. We observe that larger swarms exhibit increased waiting, while smaller swarms show increased evasion.

Chinese Translation

本文旨在扩展传统的点质量机器人表示方法，研究长重型关节车辆（Heavy Articulated Vehicles, HAVs）的群体。HAVs 在运动学上受到约束，形状细长且关节化，带来了独特的挑战。这些车辆的局部去中心化协调受到许多现实世界应用的驱动。我们提出的方法 Prevent-Jack 引入了一种稀疏覆盖的上下文引导框架。该框架融合了六种局部行为，提供了防止车辆发生折叠和碰撞的保障，但可能会导致死锁和活锁的情况，经过测试的车辆可拖带多达十个挂车。我们通过参数研究强调了“躲避吸引”（Evade Attraction）行为在防止死锁中的重要性，并使用 15,000 次模拟评估群体性能。我们的广泛实验和结果表明，死锁和活锁在更大规模和更密集的场景中发生的频率更高，影响到车辆的峰值平均比例为 27%/31%。我们观察到，较大的群体表现出更长的等待时间，而较小的群体则表现出更强的躲避能力。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.21351

Learn Weightlessness: Imitate Non-Self-Stabilizing Motions on Humanoid Robot

学习无重状态：模仿类人机器人非自我稳定运动

Xin, Yucheng, Bao, Jiacheng, Yang, Haoran, Que, Wenqiang, Wang, Dong, Tan, Junbo, Wang, Xueqian, Zhao, Bin, Li, Xuelong

Abstract

The integration of imitation and reinforcement learning has enabled remarkable advances in humanoid whole-body control, facilitating diverse human-like behaviors. However, research on environment-dependent motions remains limited. Existing methods typically enforce rigid trajectory tracking while neglecting physical interactions with the environment. We observe that humans naturally exploit a "weightless" state during non-self-stabilizing (NSS) motions--selectively relaxing specific joints to allow passive body--environment contact, thereby stabilizing the body and completing the motion. Inspired by this biological mechanism, we design a weightlessness-state auto-labeling strategy for dataset annotation; and we propose the Weightlessness Mechanism (WM), a method that dynamically determines which joints to relax and to what level, together enabling effective environmental interaction while executing target motions. We evaluate our approach on 3 representative NSS tasks: sitting on chairs of varying heights, lying down on beds with different inclinations, and leaning against walls via shoulder or elbow. Extensive experiments in simulation and on the Unitree G1 robot demonstrate that our WM method, trained on single-action demonstrations without any task-specific tuning, achieves strong generalization across diverse environmental configurations while maintaining motion stability. Our work bridges the gap between precise trajectory tracking and adaptive environmental interaction, offering a biologically-inspired solution for contact-rich humanoid control.

Chinese Translation

模仿学习与强化学习的结合使类人全身控制取得了显著进展，促进了多样的人类行为。然而，关于环境依赖性运动的研究仍然有限。现有方法通常强制执行刚性轨迹跟踪，而忽视了与环境的物理交互。我们观察到，人类在非自我稳定（NSS）运动中自然利用了一种“无重”状态——选择性地放松特定关节，以允许被动的身体与环境接触，从而稳定身体并完成运动。受到这一生物机制的启发，我们设计了一种无重状态自动标注策略用于数据集注释；并提出了无重机制（Weightlessness Mechanism, WM），该方法动态确定放松哪些关节以及放松到何种程度，从而在执行目标运动时有效实现环境交互。我们在三个代表性的NSS任务上评估了我们的方法：坐在不同高度的椅子上、躺在不同倾斜度的床上，以及通过肩部或肘部靠在墙上。在仿真和Unitree G1机器人上的大量实验表明，我们的WM方法在没有任何任务特定调优的情况下，通过单一动作演示进行训练，能够在多样的环境配置中实现强大的泛化能力，同时保持运动稳定性。我们的工作弥合了精确轨迹跟踪与自适应环境交互之间的差距，为接触丰富的类人控制提供了生物启发的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.21355

RPG: Robust Policy Gating for Smooth Multi-Skill Transitions in Humanoid Fighting

RPG：用于类人战斗中平滑多技能过渡的鲁棒策略门控

Xin, Yucheng, Bao, Jiacheng, Dong, Yubo, Wang, Xueqian, Zhao, Bin, Li, Xuelong, Tan, Junbo, Wang, Dong

Abstract

Humanoid robots have demonstrated impressive motor skills in a wide range of tasks, yet whole-body control for humanlike long-time, dynamic fighting remains particularly challenging due to the stringent requirements on agility and stability. While imitation learning enables robots to execute human-like fighting skills, existing approaches often rely on switching among multiple single-skill policies or employing a general policy to imitate input reference motions. These strategies suffer from instability when transitioning between skills, as the mismatch of initial and terminal states across skills or reference motions introduces out-of-domain disturbances, resulting in unsmooth or unstable behaviors. In this work, we propose RPG, a hybrid expert policy framework, for smooth and stable humanoid multi-skills transition. Our approach incorporates motion transition randomization and temporal randomization to train a unified policy that generates agile fighting actions with stability and smoothness during skill transitions. Furthermore, we design a control pipeline that integrates walking/running locomotion with fighting skills, allowing humanlike long-time combat of arbitrary duration that can be seamlessly interrupted or transit action policies at any time. Extensive experiments in simulation demonstrate the effectiveness of the proposed framework, and real-world deployment on the Unitree G1 humanoid robot further validates its robustness and applicability.

Chinese Translation

类人机器人在广泛的任务中展示了令人印象深刻的运动技能，但由于对灵活性和稳定性的严格要求，类人机器人在长时间动态战斗中的整体控制仍然特别具有挑战性。虽然模仿学习使机器人能够执行类人战斗技能，但现有的方法往往依赖于在多个单一技能策略之间切换，或采用通用策略来模仿输入参考动作。这些策略在技能之间过渡时容易出现不稳定性，因为技能或参考动作之间初始状态和终止状态的不匹配会引入域外干扰，导致行为不平滑或不稳定。在本研究中，我们提出了RPG，一个混合专家策略框架，用于平滑和稳定的类人多技能过渡。我们的方法结合了运动过渡随机化和时间随机化，以训练一个统一的策略，在技能过渡期间生成灵活且稳定平滑的战斗动作。此外，我们设计了一个控制管道，将行走/奔跑运动与战斗技能整合在一起，允许类人机器人进行任意持续时间的长时间战斗，并能够在任何时候无缝中断或过渡动作策略。在仿真中的大量实验验证了所提框架的有效性，而在Unitree G1类人机器人上的实际部署进一步验证了其鲁棒性和适用性。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.21363

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

一种可部署的具身视觉-语言导航系统，具有层次认知和上下文感知探索

Xu, Kuan, Liu, Ruimeng, Yang, Yizhuo, Liang, Denan, Jin, Tongxing, Yuan, Shenghai, Wang, Chen, Xie, Lihua

Abstract

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-language navigation (VLN), existing approaches often face a fundamental trade-off between strong reasoning capabilities and efficient deployment on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and robust high-level reasoning on real-world robotic platforms. To achieve this, we decouple the system into three asynchronous modules: a real-time perception module for continuous environment sensing, a memory integration module for spatial-semantic aggregation, and a reasoning module for high-level decision making. We incrementally construct a cognitive memory graph to encode scene information, which is further decomposed into subgraphs to enable reasoning with a vision-language model (VLM). To further improve navigation efficiency and accuracy, we also leverage the cognitive memory graph to formulate the exploration problem as a context-aware Weighted Traveling Repairman Problem (WTRP), which minimizes the weighted waiting time of viewpoints. Extensive experiments in both simulation and real-world robotic platforms demonstrate improved navigation success and efficiency over existing VLN approaches, while maintaining real-time performance on resource-constrained hardware.

Chinese Translation

在智能机器人系统中，弥合具身智能与嵌入式部署之间的差距仍然是一个关键挑战，其中感知、推理和规划必须在计算、内存、能量和实时执行的严格限制下运行。在视觉-语言导航（VLN）中，现有方法往往面临强推理能力与在现实世界平台上高效部署之间的基本权衡。本文提出了一种可部署的具身VLN系统，能够在现实世界机器人平台上实现高效性和强大的高层次推理能力。为此，我们将系统解耦为三个异步模块：一个用于持续环境感知的实时感知模块，一个用于空间-语义聚合的记忆集成模块，以及一个用于高层决策的推理模块。我们逐步构建一个认知记忆图以编码场景信息，该图进一步分解为子图，以便与视觉-语言模型（VLM）进行推理。为了进一步提高导航的效率和准确性，我们还利用认知记忆图将探索问题表述为一个上下文感知的加权旅行修理工问题（WTRP），以最小化视点的加权等待时间。在模拟和现实世界机器人平台上进行的大量实验表明，与现有的VLN方法相比，我们的方法在导航成功率和效率上都有所提升，同时在资源受限的硬件上保持实时性能。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.21377

A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge

一种可复制的机器人意识方法：基于大语言模型的机器人互动在企业挑战中的应用

Prieto, S. A., Gopee, M. A., Arab, Y. Ben, de Soto, B. García, Esteba, J., Brizzio, P. Olivera

Abstract

Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates. In the event, participants engaged with a humanoid robot in a logistics-inspired task environment using voice commands interpreted through an LLM-based control framework. The activity was designed as a team-based, role-driven experience intended to expose participants to embodied AI and human-robot collaboration without requiring prior robotics expertise. To evaluate the approach, a post-event survey remained open for 16 days and collected 102 responses. Results indicate strong overall reception, with high satisfaction (8.46/10), increased interest in robotics and AI (4.47/5), and improved understanding of emerging forms of human-robot collaboration (4.45/5). Participants who interacted directly with the robot also reported natural interaction (4.37/5) and a strong sense that interaction became easier as the activity progressed (4.74/5). At the same time, lower ratings for reliability and predictability point to important technical and design challenges for future iterations. The findings suggest that challenge-based, LLM-enabled humanoid interaction can serve as a promising and replicable method for robotics awareness in industrial and operational environments.

Chinese Translation

大语言模型正日益被探索作为人类与机器人系统之间的接口，但关于如何将这些技术不仅用于互动，还作为一种结构化手段在真实组织环境中向非专业用户介绍机器人技术的证据仍然有限。本文介绍并评估了一种基于挑战的方法，用于提升机器人意识，该方法通过在阿联酋AD Ports Group员工中实施的基于大语言模型的类人机器人活动进行。活动中，参与者在一个受物流启发的任务环境中，通过大语言模型控制框架解读的语音命令与类人机器人进行互动。该活动设计为基于团队的角色驱动体验，旨在让参与者接触具身人工智能和人机协作，而无需具备先前的机器人专业知识。为了评估该方法，活动后调查持续开放了16天，共收集到102份反馈。结果显示整体反响良好，满意度高（8.46/10），对机器人和人工智能的兴趣增加（4.47/5），以及对新兴人机协作形式的理解提升（4.45/5）。直接与机器人互动的参与者也报告了自然的互动体验（4.37/5），并强烈感受到随着活动的进行，互动变得更容易（4.74/5）。与此同时，可靠性和可预测性较低的评分指出了未来迭代中重要的技术和设计挑战。研究结果表明，基于挑战的、启用大语言模型的类人互动可以作为一种有前景且可复制的机器人意识方法，适用于工业和操作环境。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.21391

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

从噪声到意图：用残差桥锚定生成式 VLA 策略

Zhong, Yiming, He, Yaoyu, Yang, Zemin, Tian, Pengfei, Huang, Yifan, Huang, Qingqiu, Zhu, Xinge, Ma, Yuexin

Abstract

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

Chinese Translation

在具身智能中，将高层次的语义理解与低层次的物理控制相结合仍然是一个持续的挑战，这源于认知与行动之间的基本时空尺度不匹配。现有的生成式 VLA 策略通常采用“从噪声生成”的范式，忽视了这种差异，导致表示效率低下和优化过程中的条件对齐弱。在本研究中，我们提出了 ResVLA，一种将范式转变为“从意图精炼”的架构。我们认识到，机器人运动自然分解为全局意图和局部动态，ResVLA 利用谱分析将控制解耦为确定性的低频锚点和随机的高频残差。通过将生成过程锚定在预测的意图上，我们的模型严格专注于通过残差扩散桥精炼局部动态。大量仿真实验表明，ResVLA 在性能上具有竞争力，对语言和机器人具身扰动具有强鲁棒性，并且比标准生成基线更快收敛。它在真实世界的机器人实验中也表现出强劲的性能。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.21471

Ufil: A Unified Framework for Infrastructure-based Localization

Ufil：一种基于基础设施的定位统一框架

Schäfer, Simon, Hegerath, Lucas, Molz, Marius, Marcon, Massimo, Alrifaee, Bassam

Abstract

Infrastructure-based localization enhances road safety and traffic management by providing state estimates of road users. Development is hindered by fragmented, application-specific stacks that tightly couple perception, tracking, and middleware. We introduce Ufil, a Unified Framework for Infrastructure-Based Localization with a standardized object model and reusable multi-object tracking components. Ufil offers interfaces and reference implementations for prediction, detection, association, state update, and track management, allowing researchers to improve components without reimplementing the pipeline. Ufil is open-source C++/ROS 2 software with documentation and executable examples. We demonstrate Ufil by integrating three heterogeneous data sources into a single localization pipeline combining (i) vehicle onboard units broadcasting ETSI ITS-G5 Cooperative Awareness Messages, (ii) a lidar-based roadside sensor node, and (iii) an in-road sensitive surface layer. The pipeline runs unchanged in the CARLA simulator and a small-scale CAV testbed, demonstrating Ufil's scale-independent execution model. In a three-lane highway scenario with 423 and 355 vehicles in simulation and testbed, respectively, the fused system achieves lane-level lateral accuracy with mean lateral position RMSEs of 0.31 m in CARLA and 0.29 m in the CPM Lab, and mean absolute orientation errors around 2.2{\deg}. Median end-to-end latencies from sensing to fused output remain below 100 ms across all modalities in both environments.

Chinese Translation

基于基础设施的定位通过提供道路用户的状态估计来增强道路安全性和交通管理。然而，发展受到碎片化的、特定应用的技术栈的阻碍，这些技术栈紧密耦合了感知、跟踪和中间件。我们提出了Ufil，一种基于基础设施的定位统一框架，具有标准化的对象模型和可重用的多目标跟踪组件。Ufil提供了预测、检测、关联、状态更新和轨迹管理的接口和参考实现，使研究人员能够在不重新实现整个流程的情况下改进组件。Ufil是开源的C++/ROS 2软件，附带文档和可执行示例。我们通过将三个异构数据源集成到一个单一的定位流程中来演示Ufil，这些数据源包括（i）广播ETSI ITS-G5合作意识消息的车辆车载单元，（ii）基于激光雷达的路边传感器节点，以及（iii）嵌入道路的敏感表面层。该流程在CARLA模拟器和小规模CAV测试平台上运行不变，展示了Ufil的规模独立执行模型。在一个三车道高速公路场景中，模拟和测试平台分别有423辆和355辆车辆，融合系统在CARLA中实现了车道级的横向精度，平均横向位置均方根误差为0.31米，在CPM实验室为0.29米，平均绝对方向误差约为2.2°。从感知到融合输出的中位数端到端延迟在两个环境中的所有模式下均保持在100毫秒以下。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.21489

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

MISTY：基于混合器的单步漂移高通量运动规划

Xing, Yining, Ke, Zehong, Tu, Yiqian, Liu, Zhiyuan, Yu, Wenhao, Wang, Jianqiang

Abstract

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity. Importantly, we introduce a latent-space drifting loss that shifts the complex distribution evolution entirely to the training phase. By formulating explicit attractive and repulsive forces, this mechanism empowers the model to synthesize novel, proactive maneuvers, such as active overtaking, that are virtually absent from the raw expert demonstrations. Extensive evaluations on the nuPlan benchmark demonstrate that MISTY achieves state-of-the-art results on the challenging Test14-hard split, with comprehensive scores of 80.32 and 82.21 in non-reactive and reactive settings, respectively. Operating at over 99 FPS with an end-to-end latency of 10.1 ms, MISTY offers an order-of-magnitude speedup over iterative diffusion planners while while achieving significantly robust generation.

Chinese Translation

多模态轨迹生成对于安全的自主驾驶至关重要，然而现有的基于扩散的规划器由于迭代神经函数评估而面临高推理延迟。本文提出了MISTY（基于混合器的单步轨迹漂移推理），这是一种高通量生成运动规划器，能够以纯单步推理实现最先进的闭环性能。MISTY集成了一个向量化的子图编码器，以捕捉环境上下文，一个变分自编码器（Variational Autoencoder）将专家轨迹结构化为紧凑的32维潜在流形，以及一个超轻量级的MLP-Mixer解码器，以消除二次注意复杂性。重要的是，我们引入了一种潜在空间漂移损失，将复杂的分布演变完全转移到训练阶段。通过制定明确的吸引力和排斥力，这一机制使模型能够合成新颖的、主动的机动，例如几乎在原始专家演示中不存在的主动超车。在nuPlan基准上的广泛评估表明，MISTY在具有挑战性的Test14-hard拆分上实现了最先进的结果，在非反应和反应设置中分别获得了80.32和82.21的综合得分。MISTY以超过99帧每秒的速度运行，端到端延迟为10.1毫秒，相较于迭代扩散规划器实现了数量级的加速，同时生成的结果显著更为稳健。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.21541

X2-N: A Transformable Wheel-legged Humanoid Robot with Dual-mode Locomotion and Manipulation

X2-N：一种具有双模式运动和操作能力的可变形轮腿类人机器人

Ning, Yan, Chen, Xingzhou, Li, Delong, Zhang, Hao, Gai, Hanfu, Li, Tongyuan, Zhang, Cheng, Peng, Zhihui, Shi, Ling

Abstract

Wheel-legged robots combine the efficiency of wheeled locomotion with the versatility of legged systems, enabling rapid traversal over both continuous and discrete terrains. However, conventional designs typically employ fixed wheels as feet and limited degrees of freedom (DoFs) at the hips, resulting in reduced stability and mobility during legged locomotion compared to humanoids with flat feet. In addition, most existing platforms lack a full upper body with arms, which limits their ability to perform dexterous manipulation tasks. In this letter, we present X2-N, a high-DoF transformable robot with dual-mode locomotion and manipulation. X2-N can operate in both humanoid and wheel-legged forms and transform seamlessly between them through joint reconfiguration. We further propose a reinforcement learning (RL)-based whole-body control framework tailored to this morphology, enabling unified control across hybrid locomotion, transformation, and manipulation. We validate X2-N in a range of challenging locomotion and manipulation tasks, including dynamic skating-like motion, stair climbing and package delivery. Results demonstrate high locomotion efficiency, strong terrain adaptability, and stable loco-manipulation performance of X2-N, highlighting its potential for real-world deployment.

Chinese Translation

轮腿机器人结合了轮式运动的高效性和腿式系统的多功能性，使其能够快速穿越连续和离散地形。然而，传统设计通常将固定轮子作为脚，并且在髋部的自由度（DoF）有限，这导致在腿式运动中与具有平坦脚的类人机器人相比，稳定性和机动性降低。此外，大多数现有平台缺乏完整的上半身和手臂，这限制了它们执行灵巧操作任务的能力。在本文中，我们提出了X2-N，一种具有高自由度的可变形机器人，具备双模式运动和操作能力。X2-N可以在类人和轮腿形态之间无缝转换，并通过关节重构实现这种转变。我们进一步提出了一种基于强化学习（RL）的全身控制框架，专门针对这种形态，能够在混合运动、变形和操作之间实现统一控制。我们在一系列具有挑战性的运动和操作任务中验证了X2-N，包括动态滑冰式运动、爬楼梯和包裹投递。结果表明，X2-N具有高效的运动能力、强大的地形适应性和稳定的运动-操作性能，突显了其在现实世界应用中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.21568

A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage

用于自主伤员分诊的机器人系统的贝叶斯推理框架

Rusiecki, Szymon, Morales, Cecilia, Störy, Pia, Elenberg, Kimberly, Weiss, Leonard, Dubrawski, Artur

Abstract

Autonomous robots deployed in mass casualty incidents (MCI) face the challenge of making critical decisions based on incomplete and noisy perceptual data. We present an autonomous robotic system for casualty assessment that fuses outputs from multiple vision-based algorithms, estimating signs of severe hemorrhage, visible trauma, or physical alertness, into a coherent triage assessment. At the core of our system is a Bayesian network, constructed from expert-defined rules, which enables probabilistic reasoning about a casualty's condition even with missing or conflicting sensory inputs. The system, evaluated during the DARPA Triage Challenge (DTC) in realistic MCI scenarios involving 11 and 9 casualties, demonstrated a nearly three-fold improvement in physiological assessment accuracy (from 15\% to 42\% and 19\% to 46\%) compared to a vision-only baseline. More importantly, overall triage accuracy increased from 14\% to 53\%, while the diagnostic coverage of the system expanded from 31\% to 95\% of cases. These results demonstrate that integrating expert-guided probabilistic reasoning with advanced vision-based sensing can significantly enhance the reliability and decision-making capabilities of autonomous systems in critical real-world applications.

Chinese Translation

在大规模伤亡事件（MCI）中部署的自主机器人面临着基于不完整和噪声感知数据做出关键决策的挑战。我们提出了一种用于伤员评估的自主机器人系统，该系统融合了多个基于视觉的算法输出，估计严重出血、可见创伤或身体警觉的迹象，从而形成一致的分诊评估。我们系统的核心是一个基于专家定义规则构建的贝叶斯网络，它能够在缺失或冲突的感官输入情况下，对伤员的状况进行概率推理。在涉及11名和9名伤员的现实MCI场景中进行的DARPA分诊挑战（DTC）评估表明，与仅使用视觉的基线相比，生理评估的准确性几乎提高了三倍（从15%提高到42%，从19%提高到46%）。更重要的是，整体分诊准确性从14%提高到53%，而系统的诊断覆盖率从31%扩大到95%。这些结果表明，将专家指导的概率推理与先进的基于视觉的传感技术相结合，可以显著增强自主系统在关键现实应用中的可靠性和决策能力。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.21693

SLAM as a Stochastic Control Problem with Partial Information: Optimal Solutions and Rigorous Approximations

将SLAM视为具有部分信息的随机控制问题：最优解与严格近似

Gusija, Ilir, Alajaji, Fady, Yüksel, Serdar

Abstract

Simultaneous localization and mapping (SLAM) is a foundational state estimation problem in robotics in which a robot accurately constructs a map of its environment while also localizing itself within this construction. We study the active SLAM problem through the lens of optimal stochastic control, thereby recasting it as a decision-making problem under partial information. After reviewing several commonly studied models, we present a general stochastic control formulation of active SLAM together with a rigorous treatment of motion, sensing, and map representation. We introduce a new exploration stage cost that encodes the geometry of the state when evaluating information-gathering actions. This formulation, constructed as a nonstandard partially observable Markov decision process (POMDP), is then analyzed to derive rigorously justified approximate solutions that are near-optimal. To enable this analysis, the associated regularity conditions are studied under general assumptions that apply to a wide range of robotics applications. For a particular case, we conduct an extensive numerical study in which standard learning algorithms are used to learn near-optimal policies.

Chinese Translation

同时定位与地图构建（SLAM）是机器人学中的一个基础状态估计问题，机器人在准确构建其环境地图的同时，也能在该构建中进行自我定位。我们通过最优随机控制的视角研究主动SLAM问题，从而将其重新表述为一个在部分信息下的决策问题。在回顾了几种常见模型后，我们提出了主动SLAM的一般随机控制形式，并对运动、感知和地图表示进行了严格处理。我们引入了一种新的探索阶段成本，用于在评估信息收集行为时编码状态的几何特征。该形式构建为一个非标准的部分可观察马尔可夫决策过程（POMDP），随后进行分析以推导出严格证明的近似解，这些解接近最优。为了支持这一分析，我们在适用于广泛机器人应用的一般假设下研究了相关的正则性条件。在一个特定案例中，我们进行了广泛的数值研究，使用标准学习算法来学习近似最优策略。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.21707

Effects of Swarm Size Variability on Operator Workload

群体规模变动对操作员工作负荷的影响

Hunt, William, Landowska, Aleksandra, Maior, Horia A., Ramchurn, Sarvapali D., Soorati, Mohammad

Abstract

Real-world deployments of human--swarm teams depend on balancing operator workload to leverage human strengths without inducing overload. A key challenge is that swarm size is often dynamic: robots may join or leave the mission due to failures or redeployment, causing abrupt workload fluctuations. Understanding how such changes affect human workload and performance is critical for robust human--swarm interaction design. This paper investigates how the magnitude and direction of changes in swarm size influence operator workload. Drawing on the concept of workload history, we test three hypotheses: (1) workload remains elevated following decreases in swarm size, (2) small increases are more manageable than large jumps, and (3) sufficiently large changes override these effects by inducing a cognitive reset. We conducted two studies (N = 34) using a monitoring task with simulated drone swarms of varying sizes. By varying the swarm size between episodes, we measured perceived workload relative to swarm size changes. Results show that objective performance is largely unaffected by small changes in swarm size, while subjective workload is sensitive to both change direction and magnitude. Small increases preserve lower workload, whereas small decreases leave workload elevated, indicating workload residue; large changes in either direction attenuate these effects, suggesting a reset response. These findings offer actionable guidance for managing swarm-size transitions to support operator workload in dynamic human--swarm systems.

Chinese Translation

人类与群体团队的实际部署依赖于平衡操作员的工作负荷，以利用人类的优势而不导致过载。一个关键挑战是群体规模通常是动态的：由于故障或重新部署，机器人可能会加入或离开任务，导致工作负荷的突然波动。理解这些变化如何影响人类的工作负荷和表现对于稳健的人类与群体交互设计至关重要。本文研究了群体规模变化的幅度和方向如何影响操作员的工作负荷。基于工作负荷历史的概念，我们测试了三个假设：（1）在群体规模减少后，工作负荷保持在较高水平；（2）小幅增加比大幅跳跃更易于管理；（3）足够大的变化通过引发认知重置来覆盖这些影响。我们进行了两项研究（N = 34），使用监控任务与不同规模的模拟无人机群体。通过在各个阶段之间变化群体规模，我们测量了相对于群体规模变化的感知工作负荷。结果表明，客观表现在群体规模的小变化下基本不受影响，而主观工作负荷对变化的方向和幅度都很敏感。小幅增加保持较低的工作负荷，而小幅减少则使工作负荷保持在较高水平，表明存在工作负荷残留；无论方向如何的大幅变化都会减弱这些影响，暗示着重置反应。这些发现为管理群体规模的过渡提供了可操作的指导，以支持动态人类与群体系统中的操作员工作负荷。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.21729

A Compact Peristaltic Pump Based on Magneto-Elastic Hysteresis with Single Pneumatic Control

基于磁弹性滞后效应的紧凑型蠕动泵与单一气动控制

Park, Minjo, Sitti, Metin

Abstract

Pumping fluids is fundamental to a wide range of industrial, environmental, and biomedical applications. Among various pumping mechanisms, peristaltic pumps enable efficient and safe fluid transport by deforming an elastic tube without direct contact with the working fluid. Although previous studies have introduced mechanical, pneumatic, or magnetic actuations to drive membrane deformation, these approaches often lead to complex pump architectures and control schemes. In this study, we present a soft membrane pump that achieves peristaltic motion through a single pneumatic input combined with an embedded passive magnet. The actuation mechanism and system dynamics were analyzed and simplified through modeling. Numerical simulations were conducted to predict the internal fluid flow, and the magneto-elastic hysteresis behavior observed in the simulations was successfully validated by experiments with a proof-of-concept prototype.

Chinese Translation

泵送液体是广泛应用于工业、环境和生物医学领域的基础。蠕动泵通过变形弹性管道而不直接接触工作液体，实现高效且安全的液体输送。在各种泵送机制中，尽管以往研究已引入机械、气动或磁性驱动来推动膜的变形，但这些方法往往导致泵的结构和控制方案复杂化。本研究提出了一种软膜泵，通过单一气动输入结合嵌入式被动磁体实现蠕动运动。我们对驱动机制和系统动态进行了建模分析和简化。进行了数值模拟以预测内部液体流动，并通过概念验证原型的实验成功验证了模拟中观察到的磁弹性滞后行为。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.21741

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Hi-WM：用于可扩展机器人后训练的人类-世界模型

Li, Yaxuan, Zhou, Zhongyi, Chen, Yefei, Guo, Yanjiang, Liu, Jiaming, Zhang, Shanghang, Chen, Jianyu, Zhu, Yichen

Abstract

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.

Chinese Translation

后训练对于将预训练的通用机器人策略转变为可靠的任务特定控制器至关重要，但现有的人类参与流程仍然依赖于物理执行：每次修正都需要机器人时间、场景设置、重置以及操作员在现实世界中的监督。同时，基于动作的世界模型主要用于想象、合成数据生成和策略评估。我们提出了 extbf{人类-世界模型（Hi-WM）}，这是一种后训练框架，利用学习到的世界模型作为可重用的修正基础，以实现针对失败的策略改进。首先，在世界模型内部以闭环方式展开策略；当展开变得不正确或容易失败时，人类直接在模型中干预，以提供短期修正动作。Hi-WM缓存中间状态并支持回滚和分支，允许将单一失败状态用于多个修正延续，并在基础策略处理不佳的行为周围产生密集的监督。随后，生成的修正轨迹被重新添加到训练集中以进行后训练。我们在三个真实世界的操作任务上评估了Hi-WM，这些任务涵盖了刚性和可变形物体的交互，以及两种策略骨干。Hi-WM在真实世界的成功率平均提高了37.9个百分点，相较于基础策略提高了19.0个百分点，相较于世界模型闭环基线也有显著提升，同时世界模型评估与真实世界表现之间存在强相关性（r = 0.953）。这些结果表明，世界模型不仅可以作为生成器或评估者，还可以作为有效的修正基础，支持可扩展的机器人后训练。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.21894

Task-Driven Co-Design of Heterogeneous Multi-Robot Systems

任务驱动的异构多机器人系统协同设计

Stralz, Maximilian, Alharbi, Meshal, Huang, Yujun, Zardini, Gioele

Abstract

Designing multi-agent robotic systems requires reasoning across tightly coupled decisions spanning heterogeneous domains, including robot design, fleet composition, and planning. Much effort has been devoted to isolated improvements in these domains, whereas system-level co-design considering trade-offs and task requirements remains underexplored. In this work, we present a formal and compositional framework for the task-driven co-design of heterogeneous multi-robot systems. Building on a monotone co-design theory, we introduce general abstractions of robots, fleets, planners, executors, and evaluators as interconnected design problems with well-defined interfaces that are agnostic to both implementations and tasks. This structure enables efficient joint optimization of robot design, fleet composition, and planning under task-specific performance constraints. A series of case studies demonstrates the capabilities of the framework. Various component models can be seamlessly incorporated, including new robot types, task profiles, and probabilistic sensing objectives, while non-obvious design alternatives are systematically uncovered with optimality guarantees. The results highlight the flexibility, scalability, and interpretability of the proposed approach, and illustrate how formal co-design enables principled reasoning about complex heterogeneous multi-robot systems.

Chinese Translation

设计多智能体机器人系统需要在紧密耦合的决策中进行推理，这些决策跨越异构领域，包括机器人设计、车队组成和规划。尽管在这些领域的孤立改进上投入了大量精力，但考虑权衡和任务需求的系统级协同设计仍然未得到充分探索。在本研究中，我们提出了一种用于异构多机器人系统任务驱动协同设计的形式化和组合框架。基于单调协同设计理论，我们引入了机器人、车队、规划者、执行者和评估者的一般抽象，将其视为具有明确定义接口的相互关联的设计问题，这些接口与具体实现和任务无关。这一结构使得在特定任务性能约束下，能够高效地联合优化机器人设计、车队组成和规划。一系列案例研究展示了该框架的能力。各种组件模型可以无缝集成，包括新型机器人、任务配置文件和概率感知目标，同时系统地揭示了非显而易见的设计替代方案，并提供了最优性保证。结果突显了所提方法的灵活性、可扩展性和可解释性，并说明了形式化协同设计如何促进对复杂异构多机器人系统的原则性推理。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2604.21914

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

VistaBot：通过时空感知视图合成实现视图鲁棒的机器人操作

Gu, Songen, Zheng, Yuhang, Li, Weize, Zheng, Yupeng, Feng, Yating, Li, Xiang, Chen, Yilun, Li, Pengfei, Ding, Wenchao

Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($\pi_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $\pi_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

Chinese Translation

近年来，端到端的机器人操作模型因其可推广性和可扩展性而受到广泛关注。然而，当在固定摄像头下进行训练时，它们往往在摄像头视角变化时表现出有限的鲁棒性。本文提出了VistaBot，一个新颖的框架，结合了前馈几何模型和视频扩散模型，以实现视图鲁棒的闭环操作，而无需在测试时进行摄像头标定。我们的方法由三个关键组件组成：4D几何估计、视图合成潜在提取和潜在动作学习。VistaBot被集成到动作分块（ACT）和基于扩散的（$ ext{π}_0$）策略中，并在模拟和现实世界任务中进行了评估。我们进一步引入了视图泛化评分（VGS）作为评估跨视图泛化的新指标。结果表明，VistaBot在VGS上分别比ACT和$ ext{π}_0$提高了2.79$ imes$和2.63$ imes$，同时实现了高质量的新视图合成。我们的贡献包括一个几何感知合成模型、一个潜在动作规划器、一种新的基准指标，以及在多样化环境中的广泛验证。代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2604.21924

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

通过轨迹条件的视觉-语言-行动规划实现长时间范围操控

Liu, Isabella, Cheng, An-Chieh, Yan, Rui, Chen, Geng, Qiu, Ri-Zhao, Zou, Xueyan, Yi, Sha, Yin, Hongxu, Wang, Xiaolong, Liu, Sifei

Abstract

Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially, predicting the remaining plan at each step yields an implicit closed loop: failed steps persist in subsequent outputs, and traces update accordingly, enabling automatic continuation and replanning without hand-crafted recovery logic or brittle visual-history buffers. Extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation in simulation and on a real Franka robot demonstrate strong gains in long-horizon success, robustness, and out-of-distribution generalization. Project page: https://www.liuisabella.com/LoHoManip

Chinese Translation

长时间范围的操控对于视觉-语言-行动（VLA）策略仍然具有挑战性：实际任务是多步骤的，依赖于进展，并且对累积执行错误非常脆弱。我们提出了LoHo-Manip，一个模块化框架，通过专门的任务管理视觉-语言模型（VLM）将短时间范围的VLA执行扩展到长时间范围的指令跟随。管理器与执行器解耦，并以递归的方式调用：在给定当前观察的情况下，它预测一个进展感知的剩余计划，该计划结合了（i）带有显式完成+剩余拆分的子任务序列，作为轻量级语言记忆，以及（ii）一个视觉轨迹——一个紧凑的二维关键点轨迹提示，指定接下来要去哪里和要接近什么。执行器VLA被调整为基于渲染的轨迹进行条件处理，从而将长时间范围的决策转变为通过跟随轨迹进行的重复局部控制。至关重要的是，在每一步预测剩余计划产生了一个隐式闭环：失败的步骤在后续输出中持续存在，轨迹相应更新，从而实现自动继续和重新规划，而无需手工制作的恢复逻辑或脆弱的视觉历史缓冲区。广泛的实验涵盖了具身规划、长时间范围推理、轨迹预测以及在模拟和真实Franka机器人上的端到端操控，展示了在长时间范围成功率、鲁棒性和分布外泛化方面的显著提升。项目页面：https://www.liuisabella.com/LoHoManip

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2604.20983

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

像植物学家一样思考：用意图驱动的链式询问挑战多模态语言模型

Sakib, Syed Nazmus, Haque, Nafiul, Amin, Shahrear Bin, Abdullah, Hasan Muhammad, Hasan, Md. Mehedi, Hossain, Mohammad Zabed, Arman, Shifat E.

Abstract

Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

Chinese Translation

视觉评估通常通过多步骤过程进行。在大多数现代领域，专家使用结构化、基于证据的适应性提问分析图像。在植物病理学中，植物学家检查叶片图像，识别视觉线索，推断诊断意图，并通过针对物种、症状和严重程度的有针对性问题进行深入探讨。这种结构化的探查对于准确的疾病诊断和治疗方案制定至关重要。然而，当前的视觉-语言模型通常在单轮问答上进行评估。为了解决这一差距，我们提出了PlantInquiryVQA，这是一个用于研究植物诊断中多步骤、意图驱动的视觉推理的基准。我们正式化了一个链式询问框架，将诊断轨迹建模为基于扎根视觉线索和明确的知识意图的有序问答序列。我们发布了一个包含24,950张专家策划的植物图像和138,068对带有视觉基础、严重程度标签和领域特定推理模板的问答对的数据集。对顶级多模态大型语言模型的评估显示，尽管它们能够充分描述视觉症状，但在安全临床推理和准确诊断方面存在困难。重要的是，结构化的问答引导探查显著提高了诊断的正确性，减少了幻觉，并提高了推理效率。我们希望PlantInquiryVQA能作为一个基础基准，推动研究训练诊断代理，使其能够像专家植物学家一样进行推理，而不是静态分类器。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.21008

Linear Image Generation by Synthesizing Exposure Brackets

通过合成曝光包生成线性图像

Dai, Yuekun, Zhang, Zhoutong, Zhou, Shangchen, Zhao, Nanxuan

Abstract

The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

Chinese Translation

照片的生命始于光子撞击传感器，其信号经过复杂的图像信号处理（ISP）管道，生成显示参考图像。然而，这些图像不再忠实于入射光，因为它们的动态范围被压缩，并受到主观偏好的风格化。相比之下，RAW图像在非线性调色映射之前记录了直接的传感器信号。在相机响应曲线校正和去马赛克处理后，它们可以转换为线性图像，这些图像是场景参考的表示，直接反映真实的辐照度，并且对传感器特定因素不变。由于图像传感器具有更好的动态范围和位深度，线性图像包含比显示参考图像更丰富的信息，为用户在后期处理时提供了更多的编辑空间。尽管有这一优势，目前的生成模型主要合成显示参考图像，这在本质上限制了下游编辑。在本文中，我们解决了文本到线性图像生成的任务：合成高质量的场景参考线性图像，该图像在文本提示的条件下保留完整的动态范围，以便进行专业的后期处理。生成线性图像具有挑战性，因为在潜在扩散模型中，预训练的变分自编码器（VAE）难以同时保留极端高光和阴影，因为动态范围和位深度较高。为此，我们将线性图像表示为一系列曝光包，每个包捕捉动态范围的特定部分，并提出了一种基于DiT的流匹配架构用于文本条件的曝光包生成。我们进一步展示了下游应用，包括文本引导的线性图像编辑和通过ControlNet的结构条件生成。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.21011

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

微双路径网络：用于微动作识别的双路径时空网络

Chappa, Naga VS Raviteja, Sariyanidi, Evangelos, Yankowitz, Lisa, Nair, Gokul, Zampella, Casey J., Schultz, Robert T., Tunç, Birkan

Abstract

Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one's head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.

Chinese Translation

微动作是持续1-3秒的细微、局部运动，例如挠头或敲击手指。这些细微的动作对于社交沟通至关重要，广泛应用于自然交互中，因此对于细粒度视频理解至关重要，但当前的计算机视觉系统对此仍然理解不足。我们识别出一个基本挑战：微动作展现出多样的时空特征，其中一些由空间配置定义，而另一些则通过时间动态表现。现有方法承诺于单一的时空分解，无法适应这种多样性。我们提出了一种双路径网络，通过并行的时空（Spatial-Temporal, ST）和时间-空间（Temporal-Spatial, TS）路径处理解剖学基础的空间实体。ST路径在建模时间动态之前捕捉空间配置，而TS路径则反转这一顺序，以优先考虑时间动态。我们引入了实体级自适应路由，而非固定融合，使每个身体部位学习其最佳处理偏好，并辅以互动作一致性（Mutual Action Consistency, MAC）损失，以强制跨路径一致性。大量实验表明，在MA-52数据集上表现出竞争力，在iMiGUE数据集上取得了最先进的结果。我们的工作揭示了对微动作固有复杂性的架构适应对于推进细粒度视频理解的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.21032

Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

解锁多光谱数据以支持带引导输入和思维链推理的多模态模型

Kim, Dahun, Mallya, Ganesh Satish, Angelova, Anelia

Abstract

Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.

Chinese Translation

多光谱图像是遥感应用中一种宝贵的输入信号，例如土地利用和土地覆盖分类以及环境监测。然而，通用的大型多模态模型（LMMs）通常是在RGB图像上训练的，这限制了它们在RGB领域以外的适用性。同时，训练多光谱多模态模型成本高昂，并且产生的模型通常具有独特的专业化特征。为了解决这个问题，我们提出了一种新颖的无训练方法，该方法在标准RGB-only LMMs的推理流程中引入多光谱数据，从而实现显著的性能提升。我们的方法利用LMMs对视觉空间的理解，通过将非RGB输入适配到该空间，并注入领域特定信息和思维链推理作为指令。我们以Gemini 2.5模型为例，观察到在流行的遥感基准测试中强大的零样本性能提升。这些结果突显了地理空间专业人员利用强大的通用模型处理专业传感器输入的潜力，从而受益于基于专业数据的丰富推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.21041

Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

文本到图像扩散模型的投影梯度遗忘：防御概念复苏攻击

Aladawi, Aljalila, Alam, Mohammed Talha, Karray, Fakhri

Abstract

Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakness: erased concepts return when the model is fine-tuned on downstream data, even when that data is entirely unrelated. We adapt Projected Gradient Unlearning (PGU) from classification to the diffusion domain as a post-hoc hardening step. By constructing a Core Gradient Space (CGS) from the retain concept activations and projecting gradient updates into its orthogonal complement, PGU ensures that subsequent fine-tuning cannot undo the achieved erasure. Applied on top of existing methods (ESD, UCE, Receler), the approach eliminates revival for style concepts and substantially delays it for object concepts, running in roughly 6 minutes versus the ~2 hours required by Meta-Unlearning. PGU and Meta-Unlearning turn out to be complementary: which performs better depends on how the concept is encoded, and retain concept selection should follow visual feature similarity rather than semantic grouping.

Chinese Translation

针对文本到图像扩散模型的机器遗忘旨在选择性地从预训练模型中移除不良概念，而无需昂贵的重新训练。目前的遗忘方法存在一个共同的弱点：当模型在下游数据上进行微调时，被删除的概念会重新出现，即使该数据与之完全无关。我们将投影梯度遗忘（Projected Gradient Unlearning, PGU）从分类领域适配到扩散领域，作为一种事后强化步骤。通过从保留概念的激活中构建核心梯度空间（Core Gradient Space, CGS），并将梯度更新投影到其正交补空间中，PGU确保后续的微调无法撤销已实现的遗忘。在现有方法（ESD、UCE、Receler）之上应用该方法，可以消除风格概念的复苏，并显著延迟物体概念的复苏，其运行时间约为6分钟，而Meta-Unlearning所需的时间约为2小时。PGU和Meta-Unlearning被证明是互补的：哪种方法表现更好取决于概念的编码方式，而保留概念的选择应遵循视觉特征相似性而非语义分组。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.21052

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

StyleVAR：通过视觉自回归建模实现可控图像风格迁移

Jing, Liqi, Zhang, Dingming, Li, Peinian, Zhu, Lichen

Abstract

We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

Chinese Translation

我们基于视觉自回归建模（VAR）框架，将风格迁移表述为在学习的潜在空间中进行条件离散序列建模。图像被分解为多尺度表示，并通过VQ-VAE被标记为离散代码；然后，变换器自回归地建模目标标记的分布，条件为风格和内容标记。为了注入风格和内容信息，我们引入了一种混合交叉注意机制，其中不断演变的目标表示关注其自身历史，而风格和内容特征则作为查询，决定强调该历史的哪些方面。一个尺度依赖的混合系数控制着每个阶段风格和内容的相对影响，鼓励合成表示与内容结构和风格纹理对齐，同时不打破VAR的自回归连续性。我们在两个阶段训练StyleVAR，从预训练的VAR检查点开始：在一个大型内容-风格-目标图像三元组数据集上进行监督微调，随后进行基于DreamSim的感知奖励的群体相对策略优化（GRPO）强化微调，并对VAR的多尺度层次进行每个动作的归一化加权，以重新平衡信用。在涵盖分布内、近分布和分布外的三个基准测试中，StyleVAR在风格损失、内容损失、LPIPS、SSIM、DreamSim和CLIP相似度上始终优于AdaIN基线，GRPO阶段在奖励对齐的感知指标上相较于SFT检查点进一步提升，尤其显著。定性分析表明，该方法在保持语义结构的同时转移纹理，尤其适用于风景和建筑场景，而在互联网图像上的泛化差距和在人脸处理上的困难则突显了对更好内容多样性和更强结构先验的需求。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.21060

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

基于临床信息的儿童脑肿瘤分类全幻灯片组织病理图像建模

Nguyen, Joakim, Yu, Jian, Fang, Jinrui, Konz, Nicholas, Chen, Tianlong, Krishnan, Sanjay, Krishnan, Chandra, Ding, Ying, Wang, Hairong, Shukla, Ankita

Abstract

Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

Chinese Translation

儿童脑肿瘤的准确诊断，尤其是从组织病理学开始，给深度学习带来了独特的挑战，包括严重的数据稀缺、类别不平衡以及在诊断上不同亚型之间的细微形态重叠。尽管病理基础模型在补丁级别的表征学习方面取得了进展，但在有限数据下有效适应于弱监督的儿童脑肿瘤分类仍然未得到充分探索。在本研究中，我们提出了一种专家指导的对比微调框架，用于从全幻灯片图像（WSI）进行儿童脑肿瘤诊断。我们的方法将对比学习整合到幻灯片级别的多实例学习（MIL）中，以明确规范下游微调过程中幻灯片级别表征的几何形状。我们提出了一种通用的监督对比设置和一种专家指导的变体，后者结合了针对诊断上易混淆亚型的临床信息硬负样本。通过在现实的低样本和类别不平衡条件下对儿童脑肿瘤WSI分类进行全面实验，我们证明了对比微调在细粒度诊断区分方面带来了可测量的改进。我们的实验分析揭示了不同对比策略之间的互补优势，专家指导的硬负样本促进了更紧凑的类内表征和改善的类间分离。此项工作强调了在数据稀缺的儿童病理学环境中，明确塑造幻灯片级别表征以实现稳健的细粒度分类的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.21066

Optimizing Diffusion Priors with a Single Observation

通过单一观测优化扩散先验

Wang, Frederic, Bouman, Katherine L.

Abstract

While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.

Chinese Translation

尽管扩散先验在许多反问题中能够生成高质量的后验样本，但它们通常是在有限的训练集或纯模拟数据上进行训练，因此继承了这些基础来源的错误和偏差。目前对扩散模型进行微调的方法依赖于大量具有不同前向算子的观测数据，这在许多应用中可能难以收集，因此在测量集较小时可能导致过拟合。我们提出了一种仅通过单一观测来调整先验的方法，该方法将现有的扩散先验组合成一个单一的专家乘积先验，并识别出最大化贝叶斯证据的指数。我们在真实世界的反问题上验证了我们的方法，包括黑洞成像（black hole imaging），在这些问题中，真实的先验是事先未知的，以及使用文本条件先验的图像去模糊。我们发现，证据通常是通过那些超越单一数据集训练的先验来最大化的。通过指数加权对先验进行泛化，我们的方法能够从温和和组合的扩散模型中进行后验采样，从而产生更灵活的先验，提高了结果后验图像分布的可信度。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.21079

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

注视推理：基于状态的、基于动作的视觉聚焦用于视觉-语言模型

Min, Juhong, Valkov, Lazar, Petsiuk, Vitali, Souri, Hossein, Mohan, Deen Dayal

Abstract

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

Chinese Translation

视觉-语言模型受益于高分辨率图像，但视觉标记数量的增加会导致高计算开销。人类通过注视来解决这一矛盾：粗略的视图指导“看哪里”，而选择性获取的高分辨率证据则细化“思考什么”。我们提出了注视推理器（Foveated Reasoner），这是一种自回归的视觉-语言框架，将注视和推理统一在单一的解码轨迹中。从低分辨率视图开始，模型仅在需要时触发注视，从选定区域检索高分辨率证据，并将其注入回同一解码轨迹。我们采用两阶段管道训练该方法：冷启动监督以启动注视行为，随后通过强化学习共同提高证据获取和任务准确性，同时抑制平凡的“看见一切”解决方案。实验表明，该方法学习到了有效的注视策略，并在多个视觉-语言基准测试中，在紧凑的视觉标记预算下实现了更强的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.21102

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

利用多模态大型语言模型对街景图像中的建筑环境和住房属性进行评估

Yao, Siyuan, Ghorbany, Siavash, Ai, Kuangshi, Cherukuthota, Arnav, Forstchen, Meghan, Korotasz, Alexis, Sisk, Matthew, Hu, Ming, Wang, Chaoli

Abstract

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

Chinese Translation

我们提出了一种新颖的框架，通过利用大型语言模型（LLMs）和谷歌街景图像（GSV），自动评估美国全国范围内的建筑状况。通过在一个适度的人类标注数据集上对Gemma 3 27B进行微调，我们的方法在与人类平均意见评分（MOS）的对比中，表现出强烈的一致性，甚至在SRCC和PLCC指标上超越了个别评审者。为了提高效率，我们应用知识蒸馏，将Gemma 3 27B的能力转移到一个较小的Gemma 3 4B模型上，该模型在性能上可与之媲美，同时实现了3倍的速度提升。此外，我们还将知识蒸馏到一个基于卷积神经网络（CNN）的模型（EfficientNetV2-M）和一个变换器（SwinV2-B）中，提供了接近的性能，同时实现了30倍的速度提升。此外，我们通过人机对齐研究，探讨了LLMs在评估广泛的建筑环境和住房属性方面的能力，并开发了一个可视化仪表盘，将LLM评估结果整合，以便房主进行后续分析。我们的框架为大规模建筑状况评估提供了一种灵活高效的解决方案，能够在最小的人类标注工作下实现高准确性。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.21104

Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

预训练在哪里？探讨预训练数据多样性如何影响地理基础模型性能

Kaur, Amandeep, Purohit, Mirali, Muhawenayo, Gedeon, Rolf, Esther, Kerner, Hannah

Abstract

New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.

Chinese Translation

新的地理基础模型引入了一种新的模型架构和预训练数据集，这些数据集通常是基于不同的数据多样性概念进行采样的。性能差异主要归因于模型架构或输入模态，而预训练数据集的作用则很少被研究。为了解决这一研究空白，我们系统性地研究了预训练数据的地理组成如何影响模型的下游性能。我们创建了全球及各大洲的预训练数据集，并在全球及各大洲的下游数据集上进行了评估。我们发现，来自欧洲的预训练数据集在全球和地方下游评估中均优于全球和特定大陆的预训练数据集。为了探讨影响预训练数据集下游性能的因素，我们分析了10个预训练数据集，考察了跨大陆、生物群落、土地覆盖和光谱值的多样性。我们发现，只有光谱多样性与性能呈强相关，而其他因素的相关性较弱。该发现为创建高性能预训练数据集时需要考虑的新多样性维度奠定了基础。我们在 https://github.com/kerner-lab/pretrain-where 开源了7个新的预训练数据集、预训练模型和我们的实验框架。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.21119

Materialistic RIR: Material Conditioned Realistic RIR Generation

物质条件下的现实房间冲击响应生成

Saad, Mahnoor Fatima, Majumder, Sagnik, Grauman, Kristen, Al-Halah, Ziad

Abstract

Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance, a room with wooden walls will produce a different acoustic experience from a room with the same spatial layout but concrete walls. Accurately modeling these effects is essential for applications such as virtual reality, robotics, architectural design, and audio engineering. Yet, existing methods for acoustic modeling often entangle spatial and material influences in correlated representations, which limits user control and reduces the realism of the generated acoustics. In this work, we present a novel approach for material-controlled Room Impulse Response (RIR) generation that explicitly disentangles the effects of spatial and material cues in a scene. Our approach models the RIR using two modules: a spatial module that captures the influence of the spatial layout of the scene, and a material module that modulates this spatial RIR according to a user-specified material configuration. This explicitly disentangled design allows users to easily modify the material configuration of a scene and observe its impact on acoustics without altering the spatial structure or scene content. Our model provides significant improvements over prior approaches on both acoustic-based metrics (up to +16% on RTE) and material-based metrics (up to +70%). Furthermore, through a human perceptual study, we demonstrate the improved realism and material sensitivity of our model compared to the strongest baselines.

Chinese Translation

金属般的环声，木材般的沉闷声！我们在场景中听到的声音不仅受到环境空间布局的影响，还受到其中物体和表面材料的影响。例如，具有木质墙壁的房间将产生与具有相同空间布局但采用混凝土墙壁的房间不同的声学体验。准确建模这些效果对于虚拟现实、机器人技术、建筑设计和音频工程等应用至关重要。然而，现有的声学建模方法通常将空间和材料影响纠缠在相关的表示中，这限制了用户的控制能力并降低了生成声学的真实感。在本研究中，我们提出了一种新颖的材料控制房间冲击响应（Room Impulse Response, RIR）生成方法，该方法明确地将场景中的空间线索和材料线索的影响分离。我们的方法使用两个模块来建模RIR：一个空间模块捕捉场景空间布局的影响，另一个材料模块根据用户指定的材料配置调节该空间RIR。这种明确分离的设计使用户能够轻松修改场景的材料配置，并观察其对声学的影响，而无需改变空间结构或场景内容。我们的模型在声学指标（在RTE上提高了高达16%）和材料指标（提高了高达70%）方面显著优于先前的方法。此外，通过人类感知研究，我们展示了与最强基线相比，我们的模型在真实感和材料敏感性方面的提升。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.21127

HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping

HyperFM：一种高效的光谱分组超光谱基础模型

Tushar, Zahid Hassan, Purushotham, Sanjay

Abstract

The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth's climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.

Chinese Translation

NASA PACE 任务提供了前所未有的海洋颜色、气溶胶和云层的超光谱观测，为这些成分如何相互作用以及影响地球气候和空气质量提供了新的见解。其海洋颜色仪器在数百个细密波长带上测量光线，使得对浮游植物组成、气溶胶特性和云微物理等特征的详细表征成为可能。然而，这种规模的超光谱数据庞大、复杂且难以标注，需要专门的处理和分析技术。现有的基础模型已经在计算机视觉和自然语言处理领域取得了变革性进展，但通常是在标准 RGB 图像上进行训练，因此难以解释 PACE 捕获的连续光谱特征。尽管最近的进展引入了超光谱基础模型，但它们通常是在无云观测上进行训练，并且由于仪器间的光谱不一致性，往往仅限于单传感器数据集。此外，现有模型往往参数繁重且计算成本高，限制了其在操作环境中的可扩展性和采用率。为了解决这些挑战，我们提出了 HyperFM，一种参数高效的超光谱基础模型，利用组内和组间光谱注意力以及混合参数分解，更好地捕捉光谱空间关系，同时降低计算成本。HyperFM 在四个基准下游大气云属性检索任务中，表现出相对于现有超光谱基础模型和任务特定的最先进方法的一致性性能提升。为了支持进一步的研究，我们还发布了 HyperFM250K，这是一个来自 PACE 任务的大规模超光谱数据集，包含清晰和多云场景。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.21146

WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis

WFM：用于超快速多模态MRI合成的三维小波流匹配

Tur, Yalcin, Stojkovic, Mihajlo, Bagci, Ulas

Abstract

Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. We observe that this inefficiency stems from an unnecessary starting point: diffusion begins from pure noise, discarding the structural information already present in available MRI sequences. We propose WFM (Wavelet Flow Matching), which instead learns a direct flow from an informed prior, the mean of conditioning modalities in wavelet space, to the target distribution. Because the source and target share underlying anatomy and differ primarily in contrast, this formulation enables accurate synthesis in just 1-2 integration steps. A single 82M-parameter model with class conditioning synthesizes all four BraTS modalities (T1, T1c, T2, FLAIR), replacing four separate diffusion models totaling 326M parameters. On BraTS 2024, WFM achieves 26.8 dB PSNR and 0.94 SSIM, within 1-2 dB of diffusion baselines, while running 250-1000x faster (0.16-0.64s vs. 160s per volume). This speed-quality trade-off makes real-time MRI synthesis practical for clinical workflows. Code is available at https://github.com/yalcintur/WFM.

Chinese Translation

扩散模型在多模态MRI合成中取得了显著的质量，但其计算成本（数百个采样步骤和每种模态的单独模型）限制了临床应用。我们观察到，这种低效源于一个不必要的起点：扩散从纯噪声开始，丢弃了已有MRI序列中存在的结构信息。我们提出了WFM（小波流匹配），该方法直接从一个信息丰富的先验（小波空间中条件模态的均值）学习流向目标分布。由于源和目标共享基础解剖结构，主要在对比度上有所不同，这种表述使得在仅1-2个积分步骤内实现准确合成成为可能。一个包含8200万个参数的单一模型通过类别条件合成所有四种BraTS模态（T1、T1c、T2、FLAIR），取代了四个总计326M参数的单独扩散模型。在BraTS 2024上，WFM达到了26.8 dB的PSNR和0.94的SSIM，距离扩散基线仅有1-2 dB的差距，同时运行速度快了250-1000倍（每个体积0.16-0.64秒对比160秒）。这种速度与质量的权衡使得实时MRI合成在临床工作流程中变得可行。代码可在https://github.com/yalcintur/WFM获取。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.21160

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

通过几何奖励信用分配增强点-视觉-语言模型中的三维理解

Chen, Jingkun, Xu, Ruoshi, Gao, Mingqi, Luo, Shengda, Han, Jungong

Abstract

Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.

Chinese Translation

点-视觉-语言模型承诺为具身代理提供可执行的空间推理能力，但它们常常陷入几何幻觉，即预测的三维结构与观察到的二维现实相矛盾。我们识别出这种失败的一个关键原因并不是表示瓶颈，而是强化学习中的结构不对齐，其中稀疏的几何标记被嘈杂且广播的序列级奖励淹没。为了解决这种因果稀释，我们提出了几何奖励信用分配（Geometric Reward Credit Assignment），这是一个将整体监督解耦为领域特定信号并将其专门路由到相应标记范围的框架。该机制将模糊反馈转化为精确的梯度更新，有效地将通用策略优化转变为针对性的结构对齐。此外，我们通过重投影一致性（Reprojection-Consistency）项内化物理约束，该项作为跨模态验证器，惩罚物理上不可能的几何形状。在从ShapeNetCore派生的经过校准的基准上进行验证，我们的方法通过将三维关键点准确度（3D KPA）从0.64提升至0.93，将三维边界框的交并比提高至0.686，并将重投影一致性得分提升至0.852，从而弥补了可靠性差距。重要的是，这些提升是在保持强健的二维定位性能的同时实现的，标志着从可信的文本输出向物理可验证的空间预测迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.21182

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

WildSplatter：基于无约束图像的前馈3D高斯溅射与外观控制

Fujimura, Yuki, Kushida, Takahiro, Kitano, Kazuya, Funatomi, Takuya, Mukaigawa, Yasuhiro

Abstract

We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene representation that enables high-quality, real-time rendering; however, it typically requires iterative optimization and multi-view images captured under consistent lighting with known camera parameters. WildSplatter is trained on unconstrained photo collections and jointly learns 3D Gaussians and appearance embeddings conditioned on input images. This design enables flexible modulation of Gaussian colors to represent significant variations in lighting and appearance. Our method reconstructs 3D Gaussians from sparse input views in under one second, while also enabling appearance control under diverse lighting conditions. Experimental results demonstrate that our approach outperforms existing pose-free 3DGS methods on challenging real-world datasets with varying illumination.

Chinese Translation

我们提出了WildSplatter，一种针对无约束图像（具有未知相机参数和变化光照条件）的前馈3D高斯溅射（3DGS）模型。3DGS是一种有效的场景表示，能够实现高质量的实时渲染；然而，它通常需要在已知相机参数和一致光照下捕获的多视图图像进行迭代优化。WildSplatter在无约束的照片集合上进行训练，并共同学习基于输入图像的3D高斯和外观嵌入。这种设计使得高斯颜色的灵活调制成为可能，以表示光照和外观的显著变化。我们的方法能够在不到一秒的时间内从稀疏输入视图重建3D高斯，同时在多样的光照条件下实现外观控制。实验结果表明，我们的方法在具有变化光照的挑战性真实世界数据集上优于现有的无姿态3DGS方法。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.21190

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

SpatiO：用于空间推理的视觉-语言智能体的自适应测试时编排

Hwang, Chan Yeong, Choi, Miso, On, Sunghyun, Kim, Jinkyu, Lee, Jungbeom

Abstract

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf{\textsc{SpatiO}}, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbf{Test-Time Orchestration (TTO)}, an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textsc{SpatiO} consistently improves spatial reasoning performance over both closed-source and open-source baselines.

Chinese Translation

理解视觉场景不仅需要识别物体，还需要推理它们之间的空间关系。与一般的视觉-语言任务不同，空间推理需要整合多种归纳偏差，例如二维外观线索、深度信号和几何约束，这些偏差在不同上下文中的可靠性各不相同。这表明有效的空间推理需要 extit{空间适应性}：根据输入灵活协调不同推理策略的能力。然而，大多数现有方法依赖于单一的推理管道，隐式学习固定的空间先验，限制了它们在分布变化下的适应能力。多智能体系统通过聚合多样的推理轨迹提供了一个有前景的替代方案，但以往在空间推理中的尝试主要采用同质智能体，限制了它们可以利用的归纳偏差的多样性。在本研究中，我们提出了 extbf{ extsc{SpatiO}}，一个用于空间推理的异构多智能体框架，协调多个具有互补归纳偏差的视觉-语言专家。为了实现有效的协作，我们提出了 extbf{测试时编排（Test-Time Orchestration, TTO）}，这是一种优化机制，根据推理过程中观察到的可靠性动态评估和重新加权智能体，而不修改模型参数。在包括3DSRBench、STVQA-7k、CV-Bench和Omni3D-Bench在内的多样空间推理基准上的大量实验表明， extsc{SpatiO}在空间推理性能上始终优于闭源和开源基线。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.21198

A Probabilistic Framework for Improving Dense Object Detection in Underwater Image Data via Annealing-Based Data Augmentation

基于退火的数据增强的概率框架以改善水下图像数据中的密集目标检测

Wiesler, Eleanor, Baxley, Trace

Abstract

Object detection models typically perform well on images captured in controlled environments with stable lighting, water clarity, and viewpoint, but their performance degrades substantially in real-world underwater settings characterized by high variability and frequent occlusions. In this work, we address these challenges by introducing a novel data augmentation framework designed to improve robustness in dense and unconstrained underwater scenes. Using the DeepFish dataset, which contains images of fish in natural environments, we first generate bounding box annotations from provided segmentation masks to construct a custom detection dataset. We then propose a pseudo-simulated annealing-based augmentation algorithm, inspired by the copy-paste strategy of Deng et al. [1], to synthesize realistic crowded fish scenarios. Our approach improves spatial diversity and object density during training, enabling better generalization to complex scenes. Experimental results show that our method significantly outperforms a baseline YOLOv10 model, particularly on a challenging test set of manually annotated images collected from live-stream footage in the Florida Keys. These results demonstrate the effectiveness of our augmentation strategy for improving detection performance in dense, real-world underwater environments.

Chinese Translation

目标检测模型通常在受控环境中表现良好，这些环境具有稳定的光照、水的清晰度和视角，但在真实世界的水下环境中，其性能显著下降，这些环境特征包括高变异性和频繁的遮挡。在本研究中，我们通过引入一种新颖的数据增强框架来应对这些挑战，旨在提高在密集和无约束水下场景中的鲁棒性。我们使用DeepFish数据集，该数据集包含自然环境中鱼类的图像，首先从提供的分割掩码生成边界框注释，以构建自定义检测数据集。然后，我们提出了一种基于伪模拟退火的增强算法，灵感来源于Deng等人[1]的复制-粘贴策略，用于合成逼真的拥挤鱼类场景。我们的方法在训练过程中提高了空间多样性和目标密度，从而使模型能够更好地泛化到复杂场景。实验结果表明，我们的方法显著优于基线YOLOv10模型，特别是在从佛罗里达群岛的实时视频中收集的手动标注图像的挑战性测试集上。这些结果证明了我们的增强策略在改善密集真实水下环境中的检测性能方面的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.21221

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

稀疏强制：用于实时自回归扩散视频生成的原生可训练稀疏注意力

Xu, Boxun, Du, Yuming, Liu, Zichang, Yang, Siyu, Jiang, Ziyang, Yan, Siqi, Saha, Rajasi, Pumarola, Albert, Wang, Wenchen, Li, Peng

Abstract

We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.

Chinese Translation

我们提出了稀疏强制（Sparse Forcing），这是一种用于自回归视频扩散模型的训练与推理范式，旨在提高长时间生成质量，同时减少解码延迟。稀疏强制的提出源于对自回归扩散生成过程中的经验观察：注意力集中在一组持久的显著视觉块上，形成了KV缓存中的隐式时空记忆，并在滑动窗口内表现出局部结构化的块稀疏模式。在此观察的基础上，我们提出了一种可训练的原生稀疏机制，该机制学习压缩、保留和更新这些持久块，同时将每个局部窗口内的计算限制在动态选择的局部邻域内。为了使该方法在训练和推理时都能在大规模上实用，我们进一步提出了持久块稀疏注意力（Persistent Block-Sparse Attention, PBSA），这是一种高效的GPU内核，能够加速稀疏注意力和内存更新，以实现低延迟和内存高效的解码。实验表明，稀疏强制在5秒的文本到视频生成中，相较于自强制（Self-Forcing）提高了+0.26的VBench分数，同时实现了1.11-1.17倍的解码加速和42%的KV缓存峰值占用降低。在更长时间的生成过程中，增益更加明显，视觉质量得到了改善，VBench分数分别提高了+0.68和+2.74，并在20秒和1分钟的生成中实现了1.22倍和1.27倍的加速。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.21227

UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

UAU-Net：面向不确定性的表征学习与证据分类用于面部动作单元检测

Li, Yuze, Liu, Zhilei

Abstract

Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.

Chinese Translation

面部动作单元（AU）检测仍然具有挑战性，因为它涉及在表征和决策阶段产生的异质性、特定于AU的不确定性。最近的方法改善了判别特征学习，但它们通常将AU表征视为确定性的，忽视了由视觉噪声、受试者依赖的外观变化以及模糊的AU间关系所导致的不确定性，这些因素都可能显著降低鲁棒性。同时，传统的点估计分类器往往提供不良的置信度校准，产生过于自信的预测，尤其是在AU数据集中典型的严重标签不平衡情况下。我们提出了UAU-Net，一种面向不确定性的AU检测框架，明确建模两个阶段的不确定性。在表征阶段，我们引入了CV-AFE，一个基于条件变分自编码器（CVAE）的AU特征提取模块，通过在多个时空尺度上联合估计特征均值和方差来学习概率性AU表征；基于AU标签的条件化进一步使CV-AFE能够捕捉与AU间依赖关系相关的不确定性。在决策阶段，我们设计了AB-ENN，一个不对称贝塔证据神经网络，用于多标签AU检测，它通过贝塔分布参数化预测不确定性，并通过针对高度不平衡的二元标签量身定制的不对称损失来减轻过度自信。在BP4D和DISFA上的大量实验表明，UAU-Net实现了强大的AU检测性能，进一步分析表明，在表征学习和证据预测中建模不确定性提高了鲁棒性和可靠性。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.21279

LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

LatRef-Diff：基于潜在和参考引导的扩散模型用于面部属性编辑和风格操控

Huang, Wenmin, Luo, Weiqi, Cao, Xiaochun, Huang, Jiwu

Abstract

Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model's design choices.

Chinese Translation

面部属性编辑和风格操控对于虚拟头像和照片编辑等应用至关重要。然而，由于面部结构的复杂性以及属性之间的强相关性，实现对面部属性的精确控制而不改变无关特征是具有挑战性的。尽管条件生成对抗网络（GANs）已经取得了一定进展，但它们在准确性和训练稳定性方面存在局限性。尽管扩散模型前景可期，但由于语义方向的表达能力有限，它们在风格操控方面面临挑战。本文提出了LatRef-Diff，这是一种新颖的基于扩散的框架，旨在解决这些局限性。我们用风格编码替代了扩散模型中的传统语义方向，并提出了两种生成风格编码的方法：潜在引导和参考引导。基于这些风格编码，我们设计了一个风格调制模块，将其整合到目标图像中，实现随机和定制的风格操控。该模块结合了可学习的向量、交叉注意力机制和分层设计，以提高准确性和图像质量。此外，为了增强训练稳定性并消除对配对图像（例如，编辑前后的图像）的需求，我们提出了一种前向-后向一致性训练策略。该策略首先使用图像特定的语义方向近似去除目标属性，然后通过风格调制恢复该属性，指导损失包括感知损失和分类损失。在CelebA-HQ数据集上的大量实验表明，LatRef-Diff在定性和定量评估中均实现了最先进的性能。消融研究验证了我们模型设计选择的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.21280

ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

ImageHD：通过超高维计算实现的能效高的设备端持续学习视觉表征

Arockiaraj, Jebacyril, Parikh, Dhruv, Prasanna, Viktor

Abstract

On-device continual learning (CL) is critical for edge AI systems operating on non-stationary data streams, but most existing methods rely on backpropagation or exemplar-heavy classifiers, incurring substantial compute, memory, and latency overheads. Hyperdimensional computing (HDC) offers a lightweight alternative through fast, non-iterative online updates. Combined with a compact convolutional neural network (CNN) feature extractor, HDC enables efficient on-device adaptation with strong visual representations. However, prior HDC-based CL systems often depend on multi-tier memory hierarchies and complex cluster management, limiting deployability on resource-constrained hardware. We present ImageHD, an FPGA accelerator for on-device continual learning of visual data based on HDC. ImageHD targets streaming CL under strict latency and on-chip memory constraints, avoiding costly iterative optimization. At the algorithmic level, we introduce a hardware-aware CL method that bounds class exemplars through a unified exemplar memory and a hardware-efficient cluster merging strategy, while incorporating a quantized CNN front-end to reduce deployment overhead without sacrificing accuracy. At the system level, ImageHD is implemented as a streaming dataflow architecture on the AMD Zynq ZCU104 FPGA, integrating HDC encoding, similarity search, and bounded cluster management using word-packed binary hypervectors for massively parallel bitwise computation within tight on-chip resource budgets. On CORe50, ImageHD achieves up to 40.4x (4.84x) speedup and 383x (105.1x) energy efficiency over optimized CPU (GPU) baselines, demonstrating the practicality of HDC-enabled continual learning for real-time edge AI.

Chinese Translation

设备端持续学习（CL）对于在非平稳数据流上运行的边缘人工智能系统至关重要，但大多数现有方法依赖于反向传播或示例重的分类器，导致显著的计算、内存和延迟开销。超高维计算（HDC）通过快速、非迭代的在线更新提供了一种轻量级的替代方案。结合紧凑的卷积神经网络（CNN）特征提取器，HDC使得在设备端高效适应并生成强大的视觉表征成为可能。然而，之前基于HDC的CL系统往往依赖于多层次的内存层次结构和复杂的集群管理，限制了在资源受限硬件上的可部署性。我们提出了ImageHD，一种基于HDC的视觉数据设备端持续学习的FPGA加速器。ImageHD针对严格的延迟和片上内存限制下的流式CL，避免了昂贵的迭代优化。在算法层面，我们引入了一种硬件感知的CL方法，通过统一的示例内存和硬件高效的集群合并策略来限制类别示例，同时结合量化的CNN前端以减少部署开销而不牺牲准确性。在系统层面，ImageHD作为一种流式数据流架构在AMD Zynq ZCU104 FPGA上实现，集成了HDC编码、相似性搜索和使用字打包的二进制超高维向量进行的有界集群管理，以便在紧凑的片上资源预算内进行大规模并行的逐位计算。在CORe50上，ImageHD相较于优化的CPU（GPU）基线实现了高达40.4倍（4.84倍）的加速和383倍（105.1倍）的能效，展示了HDC驱动的持续学习在实时边缘人工智能中的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.21289

AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing

AttDiff-GAN：一种用于面部属性编辑的混合扩散-生成对抗网络框架

Huang, Wenmin, Luo, Weiqi, Cao, Xiaochun, Huang, Jiwu

Abstract

Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.

Chinese Translation

面部属性编辑旨在修改目标属性，同时保持与属性无关的内容和整体图像的保真度。现有的基于生成对抗网络（GAN）的方法提供了良好的可控性，但往往在风格编码与属性语义之间存在弱对齐。基于扩散的方法能够合成高度真实的图像；然而，它们的编辑精度受到不同属性之间语义方向纠缠的限制。在本文中，我们提出了AttDiff-GAN，这是一种将基于GAN的属性操作与基于扩散的图像生成相结合的混合框架。这种集成的一个关键挑战在于一步对抗学习与多步扩散去噪之间的不一致性，这使得有效优化变得困难。为了解决这个问题，我们通过引入一种特征级对抗学习方案，将属性编辑与图像合成解耦，以学习明确的属性操作，然后使用操作后的特征来指导图像生成的扩散过程，同时消除对基于语义方向的编辑的依赖。此外，我们通过引入PriorMapper来增强风格-属性对齐，该模块将面部先验融入风格生成中，以及RefineExtractor，通过Transformer捕捉全局语义关系以实现更精确的风格提取。对CelebA-HQ的实验结果表明，所提出的方法在定性和定量评估中都实现了比最先进的方法更准确的面部属性编辑和更好的非目标属性保留。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.21290

GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

GraphLeap：在FPGA上解耦图构建与卷积以加速视觉图神经网络

Ramachandran, Anvitha, Parikh, Dhruv, Prasanna, Viktor

Abstract

Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50--95\% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer's features, while simultaneously using the current layer's features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.

Chinese Translation

视觉图神经网络（ViGs）将图像表示为补丁令牌的图，从而实现自适应的、特征驱动的邻域。与具有固定网格偏置的卷积神经网络（CNNs）或具有全局令牌交互的视觉变换器（Vision Transformers）不同，ViGs依赖于动态图卷积：在每一层中，通过对当前补丁特征进行k近邻（kNN）搜索构建一个特征依赖的图，随后进行信息传递。这种逐层图构建是主要瓶颈，占用CPU和GPU上图卷积时间的50%至95%，并且随着补丁数量N的增加呈$O(N^2)$增长，造成图构建与特征更新之间的顺序依赖。我们提出了GraphLeap，这是一种简单的重构方法，通过在层间解耦图构建与特征更新来消除这种依赖。GraphLeap在层$ ext{l}$中使用从前一层特征构建的图来执行特征更新，同时利用当前层的特征构建层$ ext{l}+1$的图。这种单层前瞻的图构建使得图构建与信息传递能够并行进行。尽管使用前层特征可能会引入轻微的准确性下降，但经过几轮轻量级微调即可恢复原始准确性。在GraphLeap的基础上，我们展示了第一个针对视觉图神经网络的端到端FPGA加速器。我们的流式、层级流水线设计将kNN图构建引擎与特征更新引擎重叠，利用节点和通道级并行性，并实现高效的片上数据流，无需显式的边特征物化。在Alveo U280 FPGA上对各向同性和金字塔ViG模型进行评估，GraphLeap在CPU基准上实现了高达$95.7 imes$的加速，在GPU基准上实现了$8.5 imes$的加速，展示了实时视觉图神经网络推理的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.21291

Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

探索合成数据增强在可控人类视频生成中的作用

Fei, Yuanchen, Zou, Yude, Kang, Zejian, Li, Ming, Zhou, Jiaying, Huang, Xiangru

Abstract

Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied AI.However, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex actions.Synthetic data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real gap.In this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity preservation.Our study offers the first comprehensive exploration of synthetic data's role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.

Chinese Translation

可控人类视频生成旨在生成具有明确引导的动作和外观的真实人类视频，为数字人类、动画和具身人工智能奠定基础。然而，大规模、多样化和隐私安全的人类视频数据集的稀缺构成了一个主要瓶颈，特别是对于稀有身份和复杂动作。合成数据提供了一种可扩展且可控的替代方案，但由于持续存在的Sim2Real差距，其对生成建模的实际贡献仍未得到充分探索。在本研究中，我们系统地调查了合成数据对可控人类视频生成的影响。我们提出了一种基于扩散的框架，能够对外观和动作进行细粒度控制，同时提供一个统一的测试平台，以分析合成数据在训练过程中如何与真实世界数据相互作用。通过广泛的实验，我们揭示了合成数据和真实数据的互补作用，并展示了有效选择合成样本以增强动作真实感、时间一致性和身份保留的可能方法。我们的研究首次全面探讨了合成数据在以人为中心的视频合成中的作用，并为构建数据高效且可泛化的生成模型提供了实用的见解。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.21311

an interpretable vision transformer framework for automated brain tumor classification

一种可解释的视觉变换器框架用于自动化脑肿瘤分类

Mbonu, Chinedu Emmanuel, Belonwu, Tochukwu Sunday, Chukwuogo, Okwuchukwu Ejike, Anigbogu, Kenechukwu Sylvanus

Abstract

Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines

Chinese Translation

脑肿瘤是最严重的神经系统疾病之一，早期和准确的诊断与患者生存率直接相关。手动解读磁共振成像（MRI）扫描耗时长，受观察者间变异性影响，并且需要显著的专业知识。本文提出了一种深度学习框架，用于自动化四类脑肿瘤分类，能够区分胶质瘤、脑膜瘤、垂体肿瘤和健康脑组织，数据集包含7,023个MRI扫描。所提出的系统采用在ImageNet-21k上预训练的视觉变换器（Vision Transformer, ViT-B/16）作为主干，并结合临床动机的预处理和训练流程。应用对比度限制自适应直方图均衡（Contrast Limited Adaptive Histogram Equalization, CLAHE）来增强局部对比度并突出标准归一化下不可见的肿瘤边界。采用两阶段微调策略：首先冻结主干网络，预热分类头，随后进行全量微调，使用区分性学习率。每个批次应用MixUp和CutMix增强以提高泛化能力。权重的指数移动平均（Exponential Moving Average, EMA）和测试时增强（Test-Time Augmentation, TTA）进一步稳定并提升性能。注意力展开可视化提供了临床可解释的热图，显示驱动每个预测的脑区。所提出的模型在测试集上实现了99.29%的准确率，99.25%的宏F1分数，并在健康和脑膜瘤类别上实现了完美的召回率，超越了所有基于卷积神经网络（CNN）的基线。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.21312

The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

2026年NTIRE遥感红外图像超分辨率首届挑战赛：基准结果与方法概述

Liu, Kai, Yue, Haoyang, Lin, Zeli, Chen, Zheng, Wang, Jingkai, Gong, Jue, Li, Jiatong, Yan, Xianglong, Zhu, Libo, Li, Jianze, Zhang, Ziqing, Zhou, Zihan, Liu, Xiaoyang, Timofte, Radu, Zhang, Yulun, Chen, Junye, Yan, Zhenming, Hong, Yucong, Han, Ruize, Wang, Song, Pang, Li, Zhao, Heng, Wu, Xinqiao, Meng, Deyu, Cao, Xiangyong, Yuan, Weijun, Li, Zhan, Chen, Zhanglu, Yao, Boyang, Chen, Yihang, Deng, Yifan, Zuo, Zengyuan, Jiang, Junjun, Meesiyawar, Saiprasad, Yatageri, Sulocha, Akalwadi, Nikhil, Tabib, Ramesh Ashok, Mudenagudi, Uma, Tu, Jiachen, Shi, Yaokun, Xu, Guoyi, Jiang, Yaoxin, Liu, Cici, Mu, Tongyao, Cao, Qiong, Wang, Yifan, Shigematsu, Kosuke, Shirono, Hiroto, Shin, Asuka, Zhou, Wei, Li, Linfeng, Kong, Lingdong, Wang, Ce, Zhong, Xingwei, Sun, Wanjie, Zhang, Dafeng, Lan, Hongxin, Xu, Qisheng, He, Mingyue, Geng, Hui, Wan, Tianjiao, Xu, Kele, Wang, Changjian, Carreaud, Antoine, Santacroce, Nicola, Li, Shanci, Skaloud, Jan, Gressin, Adrien

Abstract

This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infrared data and practical application needs, the challenge adopts a single-track setting. A total of 115 participants registered for the competition, with 13 teams submitting valid entries. This report summarizes the challenge design, dataset, evaluation protocol, main results, and the representative methods of each team. The challenge serves as a benchmark to advance research in infrared image super-resolution and promote the development of effective solutions for real-world remote sensing applications.

Chinese Translation

本文介绍了2026年NTIRE遥感红外图像超分辨率（x4）挑战赛，这是2026年NTIRE相关挑战之一。该挑战旨在从通过x4缩放因子进行双三次下采样生成的低分辨率（LR）输入中恢复高分辨率（HR）红外图像。目标是开发有效的模型或解决方案，以在遥感场景中实现红外图像超分辨率的最先进性能。为了反映红外数据的特征和实际应用需求，挑战采用单轨道设置。共有115名参与者注册参加比赛，其中13个团队提交了有效的参赛作品。本报告总结了挑战的设计、数据集、评估协议、主要结果以及各团队的代表性方法。该挑战作为基准，推动了红外图像超分辨率研究的发展，并促进了针对实际遥感应用的有效解决方案的开发。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.21313

PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring

PLAS-Net：基于像素级别的无人机海滩垃圾监测区域分割

Liu, Yongying, Wang, Jiaqi, Song, Jian, Shao, Xinlei, Chen, Yijia, Xu, Nan, Mizuno, Katsunori, Tabeta, Shigeru, Zhao, Fan

Abstract

Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-based monitoring predominantly relies on bounding-box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS-Net (Pixel-level Litter Area Segmentor), an instance segmentation framework that extracts pixel-accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon-driven pocket beach in Koh Tao, Thailand, PLAS-Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power-law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area-weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance-area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel-level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.

Chinese Translation

准确量化海滩垃圾的物理暴露面积，而非简单的物品计数，对于海洋垃圾的可信生态风险评估至关重要。然而，基于无人机的自动监测主要依赖于边界框检测，这系统性地高估了不规则垃圾物体的平面面积。为了解决这一几何限制，我们开发了PLAS-Net（像素级垃圾区域分割器），这是一个实例分割框架，能够提取沿海垃圾的像素级物理足迹。在对泰国考岛一个季风驱动的小海滩的无人机图像进行评估时，PLAS-Net在mAP_50上达到了58.7%，其精度高于十一种基线模型，展示了在复杂沿海条件下改进的掩膜保真度。为了说明掩膜的准确性如何影响环境分析的结论，我们进行了三项下游演示：（i）对归一化塑料密度（NPD）进行幂律拟合，以表征碎片化动态；（ii）基于面积加权的生态风险指数（ERI）以绘制空间污染热点；（iii）源组成分析揭示了丰度-面积悖论：渔具在总物品数量中占比小，但每单位物品的物理面积最大。与仅基于计数的方法相比，像素级面积提取能够为沿海监测提供更有价值的信息。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.21321

FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment

FryNet：用于无损油炸油氧化评估的双流对抗融合

Ahmed, Khaled R, Sarker, Toqi Tahamid, Islam, Taminul, Alanezi, Tamany M, AbuGhazaleh, Amer

Abstract

Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.

Chinese Translation

监测油炸油的降解对食品安全至关重要，但目前的做法依赖于破坏性湿化学检测，这些检测无法提供空间信息且不适合实时使用。我们识别出基于热成像检测的一个基本障碍，即相机指纹捷径，模型记忆了传感器特定的噪声和热偏差，而不是学习氧化化学，因此在视频不连贯评估时崩溃。我们提出了FryNet，一种双流RGB-热框架，能够在单次前向传播中共同执行油区分割、可用性分类以及四个化学氧化指标（过氧化值PV、酸值p-AV、总氧化值Totox、温度）的回归。ThermalMiT-B2主干网络通过通道和空间注意力提取热特征，而RGB-MAE编码器通过掩码自编码和化学对齐学习化学基础的表示。双编码器对抗性正则化（DANN）通过梯度反转层使两个流在视频身份上保持一致，FiLM融合则将热结构与RGB化学上下文连接起来。在28个油炸视频的7,226对帧上，FryNet达到了98.97%的平均交并比（mIoU）、100%的分类准确率和2.32的平均回归绝对误差（MAE），超越了所有七个基线模型。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.21324

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

无监督视频基础可见-红外行人重识别的时间原型化与层次对齐

Li, Zhiyong, Jiang, Wei, Liu, Haojie, Wang, Mingyu, Xu, Wanchong, Mao, Weijie

Abstract

Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

Chinese Translation

可见-红外行人重识别（VI-ReID）实现了全天候监控中的跨模态身份匹配，但现有方法主要集中在图像层面或严重依赖昂贵的身份标注。尽管基于视频的VI-ReID最近出现，以利用时间动态提高鲁棒性，但现有研究仍限于监督设置。至关重要的是，无监督视频VI-ReID问题，即模型必须在没有身份标签的情况下从RGB和红外轨迹中学习，尽管在实际部署中具有重要意义，但仍然基本未被探索。为填补这一空白，我们提出了HiTPro（层次时间原型化），这是一个无显式硬伪标签分配的原型驱动框架，用于无监督视频基础VI-ReID。HiTPro首先使用高效的时间感知特征编码器提取区分性的帧级特征，然后将其聚合为鲁棒的轨迹级表示。在这些特征的基础上，HiTPro通过聚合来自时间分区子轨迹的特征，首先构建可靠的内部相机原型。通过层次交叉原型对齐，我们执行两阶段的正样本挖掘过程：从模态内关联进展到跨模态匹配，增强了动态阈值策略和软权重分配。最后，层次对比学习逐步优化三个层次的特征-原型对齐：相机内区分、跨相机同模态一致性和跨模态不变性。在HITSZ-VCM和BUPTCampus上的大量实验表明，HiTPro在完全无监督设置下实现了最先进的性能，显著超越了适应性基线，并为未来研究建立了强有力的基线。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.21326

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

MiMIC：在通用多模态检索中缓解视觉模态崩溃，同时避免语义不对齐

Li, Juan, Ding, Chuanghao, Zhang, Xujie, Nguyen, Cam-Tu

Abstract

Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model's tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.

Chinese Translation

通用多模态检索（Universal Multimodal Retrieval, UMR）旨在将不同模态（例如视觉和文本）映射到共享的嵌入空间，以实现多模态检索。现有的UMR方法大致可分为两类：早期融合方法，如Marvel，它将视觉特征投影到语言模型（Language Model, LM）空间中以与文本模态进行整合；以及晚期融合方法，如UniVL-DR，它使用独立的编码器对视觉和文本输入进行编码，并通过加法获得融合的嵌入。我们的初步研究表明，Marvel表现出视觉模态崩溃的现象，其特征是模型倾向于忽视视觉特征，而过度依赖文本线索。相较之下，尽管UniVL-DR受到此问题的影响较小，但它更容易受到语义不对齐的影响，即语义相关的内容在嵌入空间中被远距离分隔。为了解决这些挑战，我们提出了MiMIC，提出了两个关键创新：（1）用于有效多模态整合的解码器融合架构；（2）通过单模态混合和随机标题丢弃进行稳健训练。在WebQA+和EVQA+数据集上的实验表明，MiMIC在图像在文档或查询中可能缺乏标题的情况下，始终优于早期和晚期融合的基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.21330

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

教师引导的稀疏视觉专家混合模型路由

Kada, Masahiro, Yoshihashi, Ryota, Ikehata, Satoshi, Kawakami, Rei, Sato, Ikuro

Abstract

Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

Chinese Translation

近年来，深度学习的进展得益于越来越大规模的模型，但随之而来的计算成本已成为一个关键瓶颈。稀疏专家混合模型（Sparse Mixture of Experts, MoE）通过仅为每个输入激活少量专家，提供了一种有效的解决方案，实现了高可扩展性而不牺牲推理速度。尽管有效，稀疏 MoE 的训练表现出特有的优化困难。由于路由器仅通过在前向传播中选择的专家接收信息梯度，因此它面临梯度阻塞的问题，并从未选择的路径中获得的信息有限。这种有限且高度局部化的反馈使得路由器难以学习适当的专家选择评分，并且常常导致不稳定的路由动态，例如在训练期间专家分配的波动。为了解决这个问题，我们提出了 TGR-MoE：教师引导的稀疏视觉专家混合模型路由，这是一种简单而有效的方法，通过利用来自预训练的稠密教师模型的监督来稳定路由器的学习。TGR-MoE 从教师的中间表示构建教师路由器，并将其路由输出作为学生路由器的伪监督，抑制训练期间频繁的路由波动，并使知识引导的专家选择能够在训练的早期阶段进行。在 ImageNet-1K 和 CIFAR-100 上的广泛实验表明，TGR 一致地提高了准确性和路由一致性，同时在高度稀疏的配置下保持稳定的训练。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.21343

Latent Denoising Improves Visual Alignment in Large Multimodal Models

潜在去噪改善大型多模态模型中的视觉对齐

Parikh, Dhruv, Fein-Ashley, Jacob, Kannan, Rajgopal, Prasanna, Viktor

Abstract

Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.

Chinese Translation

大型多模态模型（LMMs），如LLaVA，通常采用自回归语言建模目标进行训练，仅为视觉标记提供间接监督。这往往导致内部视觉表征较弱，并在分布转移下表现脆弱。受到最近在潜在去噪方面的进展启发，我们展示了同样的原理为改善LMMs中的内部视觉特征对齐和多模态理解提供了一种有效的视觉监督形式。我们提出了一种潜在去噪框架，通过使用关注显著性的混合掩蔽和高斯噪声来破坏投影的视觉标记。LMM被训练为去噪这些受损的标记，通过使用解码器从选定的中间LLM层的隐藏状态中恢复干净的教师补丁特征。为了防止表征崩溃，我们的框架还保留了教师的图像内部相似性结构，并应用图像内部对比补丁蒸馏。在推理过程中，破坏和辅助头被禁用，不引入额外的推理时间开销。在一系列标准多模态基准测试中，我们的方法在强基线之上始终改善视觉理解和推理，并在组合鲁棒性基准（例如NaturalBench）上取得明显提升。此外，在对基准图像应用ImageNet-C风格的非对抗性常见破坏时，我们的方法在中等和严重破坏水平下保持更高的准确性，并表现出减少的退化。我们的代码可在https://github.com/dhruvashp/latent-denoising-for-lmms获取。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.21349

Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning

Trust-SSL：用于鲁棒性空中自监督学习的加性残差选择不变性

Boulila, Wadii, Ammar, Adel, Benjdira, Bilel, Driss, Maha

Abstract

Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per-sample, per-factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop-gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive-residual approach improves it. Using a 200-epoch protocol on a 210,000-image corpus, the method achieves the highest mean linear-probe accuracy among six backbones on EuroSAT, AID, and NWPU-RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information-erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero-shot cross-domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive-residual formulation is the primary source of these improvements. An evidential variant using Dempster-Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty-aware SSL. Code is publicly available at https://github.com/WadiiBoulila/trust-ssl.

Chinese Translation

自监督学习（SSL）是空中图像表示学习的标准方法。现有方法在增强视图之间强制执行不变性，当增强保持语义内容时效果良好。然而，空中图像常常受到雾霾、运动模糊、降雨和遮挡等因素的影响，这些因素会去除关键证据。在干净视图和严重退化视图之间强制对齐可能会在潜在空间中引入虚假结构。本研究提出了一种训练策略和架构修改，以增强SSL对这些干扰的鲁棒性。它在对齐目标中引入了每个样本、每个因子的信任权重，并将其与基础对比损失结合为加性残差。对信任权重应用停止梯度，而不是乘法门。尽管乘法门是一个自然选择，但实验表明它会损害主干网络，而我们的加性残差方法则改善了主干网络。在一个包含210,000张图像的语料库上使用200个周期的协议，该方法在EuroSAT、AID和NWPU-RESISC45的六个主干网络中实现了最高的平均线性探测准确率（90.20%，相比SimCLR的88.46%和VICReg的89.82%）。在EuroSAT上，该方法在严重信息消失干扰下也取得了最大的改进（在s=5的雾霾下比SimCLR提高了19.9个百分点）。该方法还在使用BDD100K天气拆分的零样本跨域压力测试中展示了+1到+3点的Mahalanobis AUROC的一致增益。两个消融实验（标量不确定性和余弦门）表明，加性残差公式是这些改进的主要来源。使用Dempster-Shafer融合的证据变体引入了冲突和无知的可解释信号。这些发现为不确定性感知的SSL提供了具体的设计原则。代码已公开发布在https://github.com/WadiiBoulila/trust-ssl。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.21356

SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes

SparseGF：一种高度感知的稀疏分割框架，结合上下文压缩以实现城市到自然场景的稳健地面过滤

Qin, Nannan, Tao, Pengjie, Guan, Haiyan, Kang, Zhizhong, Ma, Lingfei, Hu, Xiangyun, Li, Jonathan

Abstract

High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.

Chinese Translation

基于航空激光扫描（ALS）数据生成的高质量数字地形模型对于广泛的地理空间分析至关重要，而其生成通常依赖于稳健的地面过滤（GF）来将不同地貌中的点云分离为地面和非地面部分。尽管当前基于深度学习的GF方法在特定挑战性地形中表现出色，但其跨场景的泛化能力仍受到两个持续性问题的限制：由于计算资源有限而导致的大规模处理中的上下文细节困境，以及由于仅优化分类而引发的高大物体的随机误分类。为克服这些局限性，我们提出了SparseGF，一种增强了上下文压缩的高度感知稀疏分割框架。该框架基于三项关键创新构建：(1) 一个受凸镜启发的上下文压缩模块，将广泛的上下文浓缩为紧凑的表示，同时保留中心细节；(2) 一种混合稀疏体素-点网络架构，有效解读压缩表示，同时减轻压缩引起的几何失真；(3) 一种高度感知的损失函数，在训练过程中明确施加地形高程先验，以抑制高大物体的随机误分类。在两个大规模ALS基准数据集上的广泛评估表明，SparseGF在城市到自然地形中提供了稳健的GF，在复杂城市场景中实现了领先的性能，在混合地形中取得了具有竞争力的结果，并在密林陡坡区域中表现出适度但不灾难性的准确性。本研究为基于深度学习的GF研究提供了新的见解，并鼓励进一步探索以实现大规模环境监测的真正跨场景泛化。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.21360

Prototype-Based Test-Time Adaptation of Vision-Language Models

基于原型的测试时适应视觉语言模型

Huang, Zhaohong, Zhang, Yuxin, Liu, Wenjing, Chao, Fei, Ji, Rongrong

Abstract

Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.

Chinese Translation

测试时适应（TTA）作为一种有前景的范式，已在视觉语言模型（VLMs）中出现，旨在弥合预训练与测试数据之间的分布差距。近期的研究集中在无反向传播的TTA方法上，这些方法依赖于基于缓存的设计，但引入了两个主要限制。首先，随着类别数量的增加，缓存的增长导致推理延迟增加，在大规模设置中效率低下。其次，当缓存中包含不足或错误的样本时，性能会下降。在本文中，我们提出了一种基于原型的测试时适应（PTA），这是一种高效且有效的TTA范式，利用一组特定类别的知识原型来积累来自测试样本的知识。特别地，知识原型根据每个测试样本的零样本类别置信度进行自适应加权，将样本的视觉特征融入相应的类别特定原型中。值得强调的是，来自过去测试样本的知识仅在原型中集成和利用，消除了缓存填充和检索的开销，从而提高了现有TTA方法的效率。这使得PTA具备极高的效率，同时在15个图像识别基准和4个鲁棒点云分析基准上实现了最先进的性能。例如，PTA将CLIP在10个跨域基准上的准确率从65.64%提高到69.38%，同时在大规模ImageNet-1K上保留了92%的CLIP推理速度。相比之下，基于缓存的TDA的准确率为67.97%，且仅以CLIP推理速度的50%运行。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.21362

KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

KD-CVG：一种基于知识驱动的创意视频生成方法

Liu, Linkai, Feng, Wei, Zhao, Xi, Zhang, Shen, Chen, Xingye, Zhang, Zheng, Lv, Jingjing, Shen, Junjie, Law, Ching, Zhou, Yuchen, Guo, Zipeng, Gou, Chao

Abstract

Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model's comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG's superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at https://kdcvg.github.io/KDCVG/.

Chinese Translation

创意生成（CG）利用生成模型自动生成突出产品特征的广告内容，近年来已成为研究的一个重要焦点。然而，尽管CG已取得显著进展，大多数努力集中在生成广告文本和图像上，创意视频生成（CVG）相对未被充分探索。这一差距主要源于文本到视频（T2V）模型面临的两个主要挑战：（a） extbf{模糊的语义对齐}，模型难以准确关联产品卖点与创意视频内容；（b） extbf{运动适应性不足}，导致不自然的运动和失真。为了解决这些挑战，我们开发了一个全面的广告创意知识库（ACKB）作为基础资源，并提出了一种知识驱动的方法（KD-CVG）以克服现有模型的知识局限。KD-CVG由两个主要模块组成：语义感知检索（SAR）和多模态知识参考（MKR）。SAR利用图注意力网络的语义感知和强化学习反馈，增强模型对卖点与创意视频之间联系的理解。在此基础上，MKR将语义和运动先验融入T2V模型，以解决现有知识的缺口。大量实验表明，KD-CVG在实现语义对齐和运动适应性方面表现优越，验证了其相较于其他最先进方法的有效性。代码和数据集将开源于 https://kdcvg.github.io/KDCVG/.

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.21387

EdgeFormer: local patch-based edge detection transformer on point clouds

EdgeFormer：基于局部补丁的点云边缘检测变换器

Xie, Yifei, Tu, Zhikun, Yang, Tong, Zhang, Yuhe, Zhou, Xinyu

Abstract

Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.

Chinese Translation

3D点云上的边缘点能够清晰地传达3D几何形状和表面特征，因此边缘检测在许多具有高工业和商业需求的视觉应用中被广泛使用。然而，由于细粒度的边缘特征通常密集分布或表现出小尺度的表面梯度，因此难以有效检测。为了解决这一问题，我们提出了一种基于学习的边缘检测网络，命名为EdgeFormer，该网络主要由两个阶段组成。基于空间邻近点往往表现出高度相关性，形成局部潜在表面的观察，我们将整个点云的边缘检测转换为基于局部补丁的点分类。因此，在第一阶段，我们构建了描述每个点周围局部邻域的局部补丁特征描述符。在第二阶段，我们通过分析第一阶段生成的局部补丁特征描述符对每个点进行分类。由于将点云转换为局部补丁，所提出的方法能够有效提取更细微的细节。实验结果表明，我们的模型与六个基线模型相比表现出竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.21396

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

VG-CoT：通过扎根的思维链实现可信的视觉推理

Lim, Byeonggeuk, Kim, Kyeonghyun, Yun, JungMin, Kim, YoungBin

Abstract

The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

Chinese Translation

大型视觉语言模型（LVLMs）的进展需要精确的基于局部区域的推理，这种推理能够忠实地将模型的逻辑与实际视觉证据相结合。然而，现有数据集在可扩展性方面存在局限，主要由于大量的人工标注以及多步推理与相应图像区域之间缺乏明确的对齐，这限制了模型可信度的评估。为了解决这些挑战，我们提出了视觉扎根思维链（VG-CoT）数据集，该数据集通过一个完全自动化的三阶段流程，将每个推理步骤明确链接到图像中的真实视觉证据。该流程首先使用最先进的检测和光学字符识别（OCR）模型提取对象和文本级的视觉证据，然后利用GPT-4o生成逐步扎根的推理，最后通过基于理由的开放集检测过程来完善扎根。此外，我们引入了一个新的基准，全面评估LVLMs在三个互补维度上的推理能力：理由质量、答案准确性和推理-答案对齐。与代表性的LVLMs（包括LLaVA-1.5和Qwen2-VL）的实验表明，在大多数评估指标上都有一致的改善，确认VG-CoT有效增强了可信的基于证据的推理，同时保持了可扩展和成本高效的数据集构建。该数据集和代码将在接受后公开发布，以促进进一步的研究。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.21400

You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

你只需高斯一次：用于超密采样场景的可控3D高斯喷溅

Jia, Jinrang, Li, Zhenjia, Shi, Yifeng

Abstract

3D Gaussian Splatting (3DGS) has revolutionized neural rendering, yet existing methods remain predominantly research prototypes ill-suited for production-level deployment. We identify a critical "Industry-Academia Gap" hindering real-world application: unpredictable resource consumption from heuristic Gaussian growth, the "sparsity shield" of current benchmarks that rewards hallucination over physical fidelity, and severe multi-sensor data pollution. To bridge this gap, we propose YOGO (You Only Gaussian Once), a system-level framework that reformulates the stochastic growth process into a deterministic, budget-aware equilibrium. YOGO integrates a novel budget controller for hardware-constrained resource allocation and an availability-registration protocol for robust multi-sensor fusion. To push the boundaries of reconstruction fidelity, we introduce Immersion v1.0, the first ultra-dense indoor dataset specifically designed to break the "sparsity shield." By providing saturated viewpoint coverage, Immersion v1.0 forces algorithms to focus on extreme physical fidelity rather than viewpoint interpolation, and enables the community to focus on the upper limits of high-fidelity reconstruction. Extensive experiments demonstrate that YOGO achieves state-of-the-art visual quality while maintaining a strictly deterministic profile, establishing a new standard for production-grade 3DGS. To facilitate reproducibility, part scenes of Immersion v1.0 dataset and source code of YOGO has been publicly released. The project link is https://jjrcn.github.io/YOGO/.

Chinese Translation

3D高斯喷溅（3DGS）已经彻底改变了神经渲染，但现有方法仍然主要是研究原型，不适合生产级部署。我们识别出一个关键的“产业-学术差距”，阻碍了其在现实世界中的应用：来自启发式高斯生长的不可预测资源消耗、当前基准的“稀疏保护”奖励幻觉而非物理真实感，以及严重的多传感器数据污染。为了解决这一问题，我们提出了YOGO（You Only Gaussian Once），一个系统级框架，将随机生长过程重新构建为一个确定性、预算意识的平衡。YOGO集成了一种新颖的预算控制器，用于硬件受限的资源分配，以及一种可用性注册协议，以实现稳健的多传感器融合。为了推动重建真实感的边界，我们推出了Immersion v1.0，这是第一个专门设计用于打破“稀疏保护”的超密室内数据集。通过提供饱和的视点覆盖，Immersion v1.0迫使算法关注极高的物理真实感，而非视点插值，从而使社区能够专注于高保真重建的上限。大量实验表明，YOGO在保持严格确定性特征的同时，实现了最先进的视觉质量，为生产级3DGS建立了新的标准。为了促进可重复性，Immersion v1.0数据集的部分场景和YOGO的源代码已公开发布。项目链接为 https://jjrcn.github.io/YOGO/.

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.21409

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

S1-VL：具有图像思维的科学多模态推理模型

Li, Qingxiao, Xu, Lifeng, Wang, QingLi, Bai, Yudong, Ou, Mingwei, Hu, Shu, Xu, Nan

Abstract

We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.

Chinese Translation

我们提出了 S1-VL，这是一种针对科学领域的多模态推理模型，原生支持两种互补的推理范式：科学推理（Scientific Reasoning），依赖于结构化的思维链，以及图像思维（Thinking-with-Images），使模型能够在推理过程中通过执行 Python 代码主动操作图像。在图像思维模式下，模型在沙箱环境中生成并执行图像处理代码，获取中间视觉结果，并以多轮迭代的方式继续推理。这一设计在高分辨率科学图表解读、显微图像理解和几何辅助推理等挑战性场景中尤其有效。为了构建训练数据，我们收集了涵盖数学、物理、化学、天文学、地理和生物学六个学科的科学多模态数据集。我们进一步开发了一个六维质量过滤框架，用于推理轨迹。为了减少现有数据集中常见的冗余、无效和错误的视觉操作，我们提出了一个多阶段过滤管道以及自适应数据路由策略。该策略将低视觉信息增益的样本转换为纯推理模式数据，使模型能够学习何时真正需要图像操作。S1-VL 通过四阶段渐进管道进行训练：科学多模态 SFT、图像思维冷启动 SFT，以及两个阶段的基于 SAPO 的强化学习。我们在 Qwen3-VL-32B-Thinking 的基础上构建了 S1-VL-32B，并在 13 个基准上进行了评估。实验结果表明，S1-VL-32B 在所有五个图像思维基准上都达到了最先进的性能，包括 HRBench-4K、HRBench-8K、MME-RealWorld-CN、MME-RealWorld-Lite 和 V*，并在物理学和 VRSBench 等科学推理基准上超越了对比系统。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.21422

Pre-process for segmentation task with nonlinear diffusion filters

使用非线性扩散滤波器的分割任务预处理

Sanguino, Javier, Platero, Carlos, Velasco, Olga

Abstract

This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at https://github.com/cplatero/NonlinearDiffusion.

Chinese Translation

本文讨论了使用非线性扩散滤波器获取分段常数图像作为分割技术的前处理过程。我们首先展示了非线性扩散方程的内在公式，以提供扩散滤波器的一些设计条件。根据这一理论框架，我们提出了一种新的扩散性家族；这些扩散性是通过非线性扩散技术获得的，并与反向扩散相关。它们的目标是将图像分割成封闭轮廓，内部具有均匀的灰度强度且没有模糊边缘。我们还证明了我们的滤波器满足良定性半离散和全离散尺度空间的要求。这表明，通过使用半隐式方案，解决的是前向非线性扩散方程，而不是反向非线性扩散方程，从而与保持边缘的过程相连接。在为扩散性建立的条件下，并使用扩散时间的停止准则，我们以较低的计算成本获得了分段常数图像。最后，我们用真实图像测试了我们的滤波器，并展示了我们的扩散性函数作为获取分段常数图像的方法的效果。代码可在 https://github.com/cplatero/NonlinearDiffusion 获取。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.21435

UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

UHR-DETR：超高分辨率遥感影像的小物体高效端到端检测

Li, Jingfang, Zhu, Haoran, Yang, Wen, Zhang, Jinrui, Xu, Fang, Zhang, Haijian, Xia, Gui-Song

Abstract

Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at https://github.com/Li-JingFang/UHR-DETR.

Chinese Translation

超高分辨率（UHR）影像已成为现代遥感的重要组成部分，提供了前所未有的空间覆盖。然而，在如此广阔的场景中检测小物体面临着一个关键的困境：保持小物体的原始分辨率会导致巨大的内存瓶颈。相反，传统的折中方案，如图像下采样或图像块裁剪，要么会抹去小物体，要么会破坏上下文。为了解决这一困境，我们提出了UHR-DETR，一种专为UHR影像设计的高效端到端变换器（transformer）基础检测器。首先，我们引入了一种覆盖最大化稀疏编码器（Coverage-Maximizing Sparse Encoder），该编码器动态分配有限的计算资源到信息丰富的高分辨率区域，确保在最小空间冗余下实现最大物体覆盖。其次，我们设计了一个全局-局部解耦解码器（Global-Local Decoupled Decoder）。通过将宏观场景意识与微观物体细节相结合，该模块解决了语义模糊性并防止场景碎片化。在UHR影像数据集（如STAR和SODA-A）上的大量实验表明，在严格的硬件限制下（例如，单个24GB RTX 3090），UHR-DETR表现出优越性。与STAR数据集上的标准滑动窗口基线相比，它实现了2.8\%的mAP提升，同时提供了10倍的推理速度提升。我们的代码和模型将会在https://github.com/Li-JingFang/UHR-DETR上发布。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.21442

2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing

2L-LSH：一种基于局部敏感哈希函数的快速点云索引方法

Wang, Shurui, Zhang, Yuhe, Guo, Ruizhe, Zhang, Yaning, Xie, Yifei, Zhou, Xinyu

Abstract

The development of 3D scanning technology has enabled the acquisition of massive point cloud models with diverse structures and large scales, thereby presenting significant challenges in point cloud processing. Fast neighboring points search is one of the most common problems, which is frequently used in model reconstruction, classification, retrieval and feature visualization. Hash function is well known for its high-speed and accurate performance in searching high-dimensional data, which is also the core of the proposed 2L-LSH. Specifically, the 2L-LSH algorithm adopts a two-step hash function strategy, in which the popular step divides the bounding box of the point cloud model and the second step constructs a generalized table-based data structure. The proposed 2L-LSH offers a highly efficient and accurate solution for fast neighboring points search in large-scale 3D point cloud models, making it a promising technique for various applications in the field. The proposed algorithm is compared with the well-known methods including Kd-tree and Octree; the obtained results demonstrated that the proposed method outperforms Kd-tree and Octree in terms of speed, i.e. the time consumption of kNN search can be 51.111% and 94.159% lower than Kd-tree and Octree, respectively. And the RN search time can be 54.519% and 41.840% lower than Kd-tree and Octree, respectively.

Chinese Translation

3D扫描技术的发展使得获取具有多样结构和大规模的海量点云模型成为可能，从而在点云处理方面带来了重大挑战。快速邻近点搜索是最常见的问题之一，广泛应用于模型重建、分类、检索和特征可视化。哈希函数因其在高维数据搜索中的高速和准确性能而广为人知，这也是所提出的2L-LSH的核心。具体而言，2L-LSH算法采用了两步哈希函数策略，其中第一步将点云模型的边界框划分，第二步构建基于表格的通用数据结构。所提出的2L-LSH为大规模3D点云模型中的快速邻近点搜索提供了一种高效且准确的解决方案，使其成为该领域各种应用的有前景的技术。所提出的算法与著名方法Kd-tree和Octree进行了比较；获得的结果表明，所提方法在速度上优于Kd-tree和Octree，即kNN搜索的时间消耗分别比Kd-tree和Octree低51.111%和94.159%。而RN搜索时间分别比Kd-tree和Octree低54.519%和41.840%。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.21450

VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

VARestorer：用于真实世界图像超分辨率的一步VAR蒸馏

Zhu, Yixuan, Ma, Shilin, Wang, Haolin, Li, Ao, Jing, Yanzhe, Tang, Yansong, Chen, Lei, Lu, Jiwen, Zhou, Jie

Abstract

Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

Chinese Translation

最近在视觉自回归模型（VAR）方面的进展展示了其在图像生成中的有效性，突显了其在真实世界图像超分辨率（Real-ISR）中的潜力。然而，将VAR应用于ISR面临着关键挑战。受因果注意力限制的下一尺度预测机制未能充分利用全局低质量（LQ）上下文，导致模糊和不一致的高质量（HQ）输出。此外，迭代预测中的误差积累严重降低了ISR任务的一致性。为了解决这些问题，我们提出了VARestorer，一个简单而有效的蒸馏框架，将一个预训练的文本到图像VAR模型转化为一个一步ISR模型。通过利用分布匹配，我们的方法消除了迭代细化的需要，显著减少了误差传播和推理时间。此外，我们引入了带有跨尺度注意力的金字塔图像条件，这使得双向尺度交互成为可能，并充分利用输入图像信息，同时适应自回归机制。这防止了后期LQ标记在变换器中被忽视。通过仅微调1.2%的模型参数，使用参数高效的适配器，我们的方法保持了原始VAR模型的表现力，同时显著提高了效率。大量实验表明，VARestorer在DIV2K数据集上实现了72.32 MUSIQ和0.7669 CLIPIQA的最新性能，同时与传统VAR推理相比，加速推理速度达10倍。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.21453

Instance-level Visual Active Tracking with Occlusion-Aware Planning

具有遮挡感知规划的实例级视觉主动跟踪

Sun, Haowei, Zhou, Kai, Gao, Hao, Zhang, Shiteng, Hu, Jinwu, Wen, Xutao, Ye, Qixiang, Tan, Mingkui

Abstract

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

Chinese Translation

视觉主动跟踪（VAT）旨在控制摄像机在三维空间中跟随目标，这对于无人机导航和安全监控等应用至关重要。然而，在实际部署中，它面临两个主要瓶颈：由于实例级辨别能力不足而导致的视觉相似干扰物的混淆，以及由于缺乏主动规划而在遮挡情况下的严重失效。为了解决这些问题，我们提出了OA-VAT，一个包含三个互补模块的统一管道。首先，一个无训练的实例感知离线原型初始化模块通过DINOv3聚合多视角增强特征，以构建具有辨别性的实例原型，从而减轻干扰物的混淆。其次，一个在线原型增强跟踪器在线增强原型，并集成了一个基于置信度的卡尔曼滤波器，以在外观和运动变化下实现稳定跟踪。第三，一个遮挡感知轨迹规划器，基于我们新的Planning-20k数据集进行训练，使用条件扩散生成避障路径以恢复遮挡。实验表明，OA-VAT在UnrealCV上实现了0.93的平均成功率（SR）（比SOTA TrackVLA提高2.2%），在真实世界数据集上实现了90.8%的平均分类准确率（CAR）（比SOTA GC-VAT提高12.1%），在大疆Tello无人机上实现了81.6%的目标成功率（TSR）。在RTX 3090上以35帧每秒运行，它为实际部署提供了强大且实时的性能。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.21461

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

多模态大语言模型理解指向吗？基于自我中心视觉的指称推理基准与增强

Li, Chentao, Gao, Zirui, Gao, Mingze, Ren, Yinglian, Feng, Jianjiang, Zhou, Jie

Abstract

Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/

Chinese Translation

自我中心人工智能代理（如智能眼镜）依赖指向手势来解决自然语言命令中的指称歧义。然而，尽管多模态大语言模型（MLLMs）取得了进展，当前系统往往无法准确地将指向的空间语义进行定位。相反，它们依赖于与视觉接近性或物体显著性之间的虚假相关性，这一现象我们称之为“指称幻觉”（Referential Hallucination）。为了解决这一问题，我们引入了EgoPoint-Bench，这是一个全面的问答基准，旨在评估和增强自我中心视角下的多模态指向推理。该基准包含超过11,000个高保真模拟和真实世界样本，涵盖五个评估维度和三个指称复杂性水平。大量实验表明，尽管最先进的专有和开源模型在自我中心指向方面表现不佳，但在我们的合成数据上微调的模型实现了显著的性能提升和稳健的模拟到真实的泛化能力。本研究强调了空间感知监督的重要性，并提供了一条可扩展的路径，以实现精确的自我中心人工智能助手。项目页面：https://guyyyug.github.io/EgoPoint-Bench/

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.21465

ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation

ID-Eraser：通过身份扰动对面部交换的主动防御

Luo, Junyan, Yu, Peipeng, Fei, Jianwei, Zeng, Shiya, Zhou, Xiaoyu, Xia, Zhihua, Liu, Xiang

Abstract

Deepfake technologies have rapidly advanced with modern generative AI, and face swapping in particular poses serious threats to privacy and digital security. Existing proactive defenses mostly rely on pixel-level perturbations, which are ineffective against contemporary swapping models that extract robust high-level identity embeddings. We propose ID-Eraser, a feature-space proactive defense that removes identifiable facial information to prevent malicious face swapping. By injecting learnable perturbations into identity embeddings and reconstructing natural-looking protection images through a Face Revive Generator (FRG), ID-Eraser produces visually realistic results for humans while rendering the protected identities unusable for Deepfake models. Experiments show that ID-Eraser substantially disrupts identity recognition across diverse face recognition and swapping systems under strict black-box settings, achieving the lowest Top-1 accuracy (0.30) with the best FID (1.64) and LPIPS (0.020). Compared with swaps generated from clean inputs, the identity similarity of protected swaps drops sharply to an average of 0.504 across five representative face swapping models. ID-Eraser further demonstrates strong cross-dataset generalization, robustness to common distortions, and practical effectiveness on commercial APIs, reducing Tencent API similarity from 0.76 to 0.36.

Chinese Translation

深度伪造技术随着现代生成性人工智能的快速发展而迅速进步，尤其是面部交换对隐私和数字安全构成了严重威胁。现有的主动防御大多依赖于像素级扰动，这对于提取稳健的高层身份嵌入的现代交换模型效果不佳。我们提出了ID-Eraser，一种特征空间的主动防御方法，通过去除可识别的面部信息来防止恶意的面部交换。通过将可学习的扰动注入身份嵌入，并通过面部复兴生成器（Face Revive Generator, FRG）重建自然外观的保护图像，ID-Eraser为人类生成视觉上真实的结果，同时使受保护的身份对深度伪造模型不可用。实验表明，ID-Eraser在严格的黑箱设置下显著干扰了多种面部识别和交换系统的身份识别，达到了最低的Top-1准确率（0.30），同时获得了最佳的FID（1.64）和LPIPS（0.020）。与来自干净输入生成的交换相比，受保护交换的身份相似度在五个代表性的面部交换模型中急剧下降，平均为0.504。ID-Eraser进一步展示了强大的跨数据集泛化能力、对常见失真的鲁棒性，以及在商业API上的实际有效性，将腾讯API的相似度从0.76降低到0.36。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.21478

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

重新思考面部伪造检测的跨领域评估：语义细粒度对齐与专家混合模型

Luo, Yuhan, Chen, Tao, Liu, Decheng

Abstract

Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

Chinese Translation

如今，随着生成模型的快速发展，视觉数据伪造检测在社会和经济安全中扮演着越来越重要的角色。现有的面部伪造检测器由于在数据集之间的泛化能力不足，仍无法达到令人满意的性能。导致这一现象的关键因素是缺乏合适的评估指标：常用的跨数据集AUC指标未能揭示一个重要问题，即检测分数在不同数据领域之间可能会显著变化。为了明确评估跨领域分数的可比性，我们提出了 extbf{Cross-AUC}，一种评估指标，可以通过对比来自一个数据集的真实样本与来自另一个数据集的伪造样本（反之亦然）来计算数据集对之间的AUC。有趣的是，在Cross-AUC指标下评估代表性检测器时，发现性能显著下降，暴露了一个被忽视的鲁棒性问题。此外，我们还提出了新颖的框架 extbf{S}emantic extbf{F}ine-grained extbf{A}lignment and extbf{M}ixture-of-Experts（ extbf{SFAM}），该框架由一个增强CLIP对操控伪影敏感性的图像-文本对齐模块和一个面部区域专家混合模块组成，该模块将来自不同面部区域的特征路由到专门的专家进行区域感知的伪造分析。在公共数据集上进行的大量定性和定量实验证明，所提出的方法在各种合适的指标下相比于最先进的方法实现了优越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.21479

Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

冻结的大型语言模型作为地图感知的时空推理器用于车辆轨迹预测

Liu, Yanjiao, Liu, Jiawei, Gong, Xun, Nie, Zifei

Abstract

Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

Chinese Translation

大型语言模型（LLMs）最近展示了强大的推理能力，并在自动驾驶（AD）领域吸引了越来越多的研究关注。然而，LLMs在AD感知和预测中的安全应用仍然需要对动态交通代理和静态道路基础设施有透彻的理解。为此，本研究提出了一个框架，以评估LLMs理解动态交通代理行为和道路网络拓扑的能力。该框架利用冻结的LLMs作为推理引擎，采用交通编码器从观察到的代理轨迹中提取空间级场景特征，同时使用轻量级卷积神经网络（CNN）对局部高清（HD）地图进行编码。为了评估LLMs的内在推理能力，提取的场景特征随后通过重编程适配器转换为LLM兼容的标记。通过将预测负担转移给LLMs，应用一个更简单的线性解码器来输出未来轨迹。该框架使得能够定量分析多模态信息的影响，特别是地图语义对轨迹预测准确性的影响，并允许冻结的LLMs与最小适应性无缝集成，从而展示出在多种LLM架构中的强泛化能力，并提供一个统一的模型评估平台。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.21502

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

VFM$^{4}$SDG：揭示视觉基础模型在单域广义目标检测中的力量

Zhang, Yupeng, Han, Ruize, Guo, Ningnan, Feng, Wei, Wang, Song, Wan, Liang

Abstract

In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

Chinese Translation

在现实场景中，天气、光照和成像条件的持续变化导致显著的领域转移，使得在单一源领域上训练的检测器在未见环境中严重退化。现有的单域广义目标检测（SDGOD）方法主要依赖于数据增强或领域不变表示学习，但对检测器机制关注有限，在复杂领域转移下存在明显的局限性。通过分析实验，我们发现性能退化主要由漏检增加主导，这根本上源于检测器跨域稳定性的降低：在编码阶段，目标-背景和实例间关系的稳定性降低，而在解码阶段，查询表示的语义-空间对齐也变得更难以维持。为此，我们提出了VFM$^{4}$SDG，一种用于SDGOD的双先验学习框架，它引入了一个冻结的视觉基础模型（VFM）作为可转移的跨域稳定性先验，融入检测器表示学习和查询建模中。在编码阶段，我们提出了跨域稳定关系先验蒸馏，以增强目标-背景和实例间关系建模的鲁棒性。在解码阶段，我们提出了基于语义-上下文先验的查询增强，通过将类别级语义原型和全局视觉上下文注入查询中，以提高其在未见领域中的语义识别和空间定位稳定性。大量实验表明，所提方法在标准SDGOD基准和两种主流基于DETR的检测器上始终优于现有的最先进方法，证明了其有效性、鲁棒性和通用性。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.21519

Gmd: Gaussian mixture descriptor for pair matching of 3D fragments

GMD：用于3D碎片配对的高斯混合描述符

Xiong, Meijun, Shi, Zhenguo, Zhou, Xinyu, Zhang, Yuhe, Zhang, Shunli

Abstract

In the automatic reassembly of fragments acquired using laser scanners to reconstruct objects, a crucial step is the matching of fractured surfaces. In this paper, we propose a novel local descriptor that uses the Gaussian Mixture Model (GMM) to fit the distribution of points, allowing for the description and matching of fractured surfaces of fragments. Our method involves dividing a local surface patch into concave and convex regions for estimating the k value of GMM. Then the final Gaussian Mixture Descriptor (GMD) of the fractured surface is formed by merging the regional GMDs. To measure the similarities between GMDs for determining adjacent fragments, we employ the L2 distance and align the fragments using Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP). The extensive experiments on real-scanned public datasets and Terracotta datasets demonstrate the effectiveness of our approach; furthermore, the comparisons with several existing methods also validate the advantage of the proposed method.

Chinese Translation

在使用激光扫描仪自动重组碎片以重建物体的过程中，匹配破碎表面是一个关键步骤。本文提出了一种新颖的局部描述符，该描述符利用高斯混合模型（GMM）来拟合点的分布，从而实现对碎片破碎表面的描述和匹配。我们的方法涉及将局部表面片段划分为凹面和凸面，以估计GMM的k值。然后，通过合并区域GMD形成破碎表面的最终高斯混合描述符（GMD）。为了测量GMD之间的相似性以确定相邻碎片，我们采用L2距离，并使用随机样本一致性（RANSAC）和迭代最近点（ICP）对碎片进行对齐。在真实扫描的公共数据集和陶俑数据集上的大量实验验证了我们方法的有效性；此外，与几种现有方法的比较也验证了所提方法的优势。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.21523

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

看见并不等于相信：揭示评估者视觉-语言模型中的盲点

Khan, Mohammed Safi Ur Rahman, Suryanarayanan, Sanjay, Anand, Tushar, Khapra, Mitesh M.

Abstract

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

Chinese Translation

大型视觉-语言模型（VLMs）越来越多地用于评估其他模型的输出，应用于图像到文本（I2T）任务，如视觉问答，以及文本到图像（T2I）生成任务。尽管对这些评估者 VLMs 的依赖日益增加，但其可靠性仍然未得到充分探讨。在本研究中，我们系统地评估了评估者 VLMs 在 I2T 和 T2I 任务中的可靠性。我们引入了有针对性的扰动，这些扰动在关键错误维度上降低输出质量，包括物体幻觉、空间推理、事实基础和视觉保真度。这些扰动测试评估者 VLMs 是否能够可靠地考虑这些降低质量的错误。在超过 4000 个扰动实例的综合基准测试中，涵盖 40 个扰动维度，我们使用单一答案评分、成对比较和参考引导范式评估了 4 个显著的 VLMs。我们的研究结果揭示了当前 VLM 评估者存在显著的盲点：它们往往无法检测到扰动输出——在某些情况下超过 50%，尤其在细粒度的组合和空间错误方面表现不佳，并且通常对与输入图像相矛盾的幻觉内容缺乏敏感性。成对比较证明更为可靠，尽管失败率仍然存在。这些结果突显了当前评估者 VLMs 的不可靠性，并在其用于基准测试和开发决策时提出了谨慎的建议。代码和数据已公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.21530

Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models

基于注意力的多实例学习在肺腺癌全幻灯片图像中对主要生长模式预测的应用

Perez-Herrera, Laura Valeria, Garcia-Gonzalez, M. J., Lopez-Linares, Karen

Abstract

Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.

Chinese Translation

肺腺癌（LUAD）的分级依赖于准确识别生长模式，这些模式是预后指标并可能影响治疗决策。常见的深度学习方法通过补丁级分类或分割来确定主要模式，这需要大量的注释。本文提出了一种基于注意力的多实例学习（ABMIL）框架，以在全幻灯片级别预测主要的LUAD生长模式，从而减少注释负担。我们的方法集成了预训练的病理基础模型作为补丁编码器，这些编码器可以在注释补丁上冻结或微调，以提取通过注意力机制聚合的判别特征。实验表明，微调的编码器提高了性能，其中Prov-GigaPath在ABMIL下达到了最高一致性（BA = 0.699）。与简单的补丁聚合基线相比，ABMIL通过利用幻灯片级监督和空间注意力产生了更稳健的预测。未来的工作将扩展该框架以估计生长模式的完整分布，并在外部队列上验证性能。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.21546

Component-Based Out-of-Distribution Detection

基于组件的分布外检测

Liu, Wenrui, Chang, Hong, Hou, Ruibing, Shan, Shiguang, Chen, Xilin

Abstract

Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.

Chinese Translation

分布外（OOD）检测需要对微妙的变化保持敏感，同时不对自然的分布内（ID）多样性过度反应。然而，从检测粒度的角度来看，全球表示不可避免地抑制了局部OOD线索，而基于补丁的方法由于纠缠的虚假相关性和噪声而不稳定。此外，这两者在检测由有效ID组件组成的组合OOD时都不够有效。受到组件识别理论的启发，我们提出了一种无训练的基于组件的OOD检测（CoOD）框架，通过将输入分解为功能组件来解决现有的局限性。为了实现CoOD，我们推导了组件偏移分数（CSS）以检测局部外观变化，以及组合一致性分数（CCS）以识别跨组件的组合不一致性。实证结果表明，CoOD在粗粒度和细粒度的OOD检测上均取得了一致的改进。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.21572

Deep kernel video approximation for unsupervised action segmentation

基于深度核的视频近似用于无监督动作分割

Pintea, Silvia L., Dijkstra, Jouke

Abstract

This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Chinese Translation

本研究聚焦于每个视频的无监督动作分割，这在存储大型数据集不可能或不被允许的应用中具有重要意义。我们提出通过在深度核空间中学习来对视频进行分割，以尽可能接近底层帧分布。为了定义原始视频分布与其近似之间的接近度度量，我们依赖于最大均值差异（Maximum Mean Discrepancy, MMD），这是一种在分布空间中保持几何特性的度量，因此提供了更可靠的估计。此外，与常用的最优传输度量不同，MMD更易于优化且速度更快。我们选择使用神经切线核（Neural Tangent Kernels, NTKs）来定义MMD操作的核空间，因为与固定核相比，NTK具有更强的描述能力。同时，由于NTK在联合学习输入（视频近似）和核函数时避免了平凡解，因而更具优势。最后，我们在六个标准基准上与最先进的每视频方法相比，展示了具有竞争力的结果。此外，当分段数量未知时，我们的方法比先前的聚合工作具有更高的F1分数。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.21573

CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction

CHRep：用于空间基因表达预测的跨模态组织学表示与后期校准

Wang, Changfan, Wang, Xinran, Liu, Donghai, Su, Fei, Sun, Lulu, Zhao, Zhicheng, Meng, Zhu

Abstract

Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (H&E) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.

Chinese Translation

空间转录组学（ST）能够实现空间分辨的基因分析，但仍然成本高昂且通量低，限制了大规模研究和常规临床应用。从常规的苏木精-伊红（H&E）切片预测空间基因表达是一种有前景的替代方案，然而在现实的留一切片评估中，现有模型往往受到切片级外观变化和回归驱动的过度平滑的影响，这抑制了生物学上有意义的变异。CHRep是一个用于稳健组织学到表达预测的两阶段框架。在训练阶段，CHRep通过联合优化相关性感知回归、对称图像-表达对齐和坐标诱导的空间拓扑正则化，学习结构感知表示。在推理阶段，通过一个在训练切片上训练的轻量级校准模块，CHRep在不进行主干微调的情况下提高了跨切片的稳健性，该模块结合了来自训练图库的非参数估计和幅度正则化的修正模块。与依赖单一预测路径的先前嵌入对齐或检索基础转移方法不同，CHRep将保持拓扑的表示学习与后期校准相结合，使得在切片级变化下能够实现稳定的邻域检索和控制偏差校正。在三个队列中，CHRep在留一切片评估下始终提高了基因级相关性，在Alex+10x上观察到最大的增益。相对于HAGE，所有考虑基因的皮尔逊相关系数[PCC(ACG)]在cSCC上增加了4.0%，在HER2+上增加了9.8%。相对于mclSTExp，PCC(ACG)在Alex+10x上进一步提高了39.5%，同时均方误差（MSE）和平均绝对误差（MAE）分别减少了9.7%和9.0%。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.21575

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

OmniFit：通过尺度无关的密集标志点预测进行多模态3D身体拟合

Cai, Zeyu, Xiu, Yuliang, Wang, Renke, Shao, Zhijing, Li, Xiaoben, Yu, Siyuan, Xu, Chao, Liu, Yang, Sun, Baigui, Yang, Jian, Zhang, Zhenyu

Abstract

Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

Chinese Translation

将基础身体模型拟合到3D穿衣人类资产的研究已经得到了广泛关注，但大多数方法仅关注单一模态输入，如点云或多视角图像，通常需要已知的度量尺度。这一限制在实际应用中常常不切实际，尤其是在AI生成的资产中，尺度失真现象普遍存在。我们提出了OmniFit，一种能够无缝处理多样化多模态输入的方法，包括完整扫描、部分深度观测和图像捕捉，同时对真实和合成资产保持尺度无关。我们的关键创新是一个简单而有效的条件变换器解码器，能够直接将表面点映射到密集的身体标志点，这些标志点随后用于SMPL-X参数拟合。此外，一个可选的即插即用图像适配器结合了视觉线索，以补偿缺失的几何信息。我们进一步引入了一个专用的尺度预测器，将对象重新缩放到标准身体比例。OmniFit在日常和宽松服装场景中，显著超越了最先进的方法，提升幅度在57.1%到80.9%之间。据我们所知，它是首个超越多视角优化基线的身体拟合方法，也是首个在CAPE和4D-DRESS基准上实现毫米级精度的方法。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.21592

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Sculpt4D：通过稀疏注意力扩散变换器生成4D形状

Yin, Minghao, Hu, Wenbo, Xu, Jiale, Shan, Ying, Han, Kai

Abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

Chinese Translation

近期在3D生成建模方面的突破在静态形状合成中取得了显著进展，但高保真动态4D生成仍然难以实现，受到时间伪影和高计算需求的限制。我们提出了Sculpt4D，这是一种原生的4D生成框架，能够将高效的时间建模无缝集成到预训练的3D扩散变换器（Hunyuan3D 2.1）中，从而缓解4D训练数据的稀缺性。其核心是一个块稀疏注意力机制，通过锚定初始帧来保持物体身份，同时通过时间衰减的稀疏掩码捕捉丰富的运动动态。该设计忠实地建模复杂的时空依赖关系，具有高保真度，同时避免了全注意力的平方开销，并将网络的总计算量减少了56%。因此，Sculpt4D在时间一致的4D合成中建立了新的最先进水平，并为高效和可扩展的4D生成开辟了道路。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.21617

Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

参数投影中的局部邻域不稳定性：定量与视觉分析

Dennig, Frederik L., Keim, Daniel A.

Abstract

Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding. Our approach combines quantitative measures of mean displacement, bias, and nearest-anchor assignment error with per-anchor visualizations of displacement vectors, local PCA ellipsoids, and Voronoi misassignment for detailed inspection. We demonstrate the framework's effectiveness on UMAP- and t-SNE-based neural projectors of varying network sizes and study the effect of Jacobian regularization as a gradient-based robustness strategy. We apply our framework to the MNIST and Fashion-MNIST datasets. The results show that our framework identifies unstable projection regions invisible to reconstruction error or neighborhood-preservation metrics.

Chinese Translation

参数投影使分析师能够实时嵌入新点，但来自测量噪声或数据漂移的输入变化可能导致二维布局中不可预测的偏移。投影是否以及在何处是局部稳定的仍然很大程度上未被研究。本文提出了一种稳定性评估框架，该框架通过对选定锚点周围施加高斯扰动来探测参数投影，并评估邻域在二维嵌入中的变形。我们的方法结合了均值位移、偏差和最近锚点分配误差的定量度量，以及每个锚点的位移向量、局部主成分分析（PCA）椭球体和沃罗诺伊误分配的可视化，以便进行详细检查。我们在不同网络规模的基于UMAP和t-SNE的神经投影器上展示了该框架的有效性，并研究了雅可比正则化作为基于梯度的鲁棒性策略的影响。我们将框架应用于MNIST和Fashion-MNIST数据集。结果表明，我们的框架能够识别重建误差或邻域保持度量所无法察觉的不稳定投影区域。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.21627

DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

DCMorph：通过双流交叉注意力扩散进行人脸变形

Chettaoui, Tahar, Caldeira, Eduarda, Ozgur, Guray, Ramachandra, Raghavendra, Boutros, Fadi, Damer, Naser

Abstract

Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.

Chinese Translation

推进人脸变形攻击技术对于预见不断演变的威胁并为身份验证系统开发稳健的防御机制至关重要。本研究提出了DCMorph，一种基于双流扩散的变形框架，它同时在身份条件和潜在空间层面上运行。与遭受混合伪影的图像级方法或重建保真度有限的基于生成对抗网络（GAN）的方法不同，DCMorph通过两种机制利用身份条件的潜在扩散模型：(1) 解耦的交叉注意力插值，将来自两个源面孔的身份特征注入去噪过程，使得在现有的基于扩散的方法中缺乏的显式双身份条件成为可能；(2) 通过球面插值对来自两个源面孔的反转潜在表示进行DDIM反演，提供几何一致的初始潜在表示，从而保留结构属性。对四个最先进的人脸识别系统的脆弱性分析表明，DCMorph在两个操作阈值下相比现有方法实现了最高的攻击成功率，同时仍然难以被当前的变形攻击检测解决方案检测到。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.21631

DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

DualSplat：通过重建失败的伪掩码引导实现稳健的3D高斯点云渲染

Wang, Xu, Wang, Zhiru, Xie, Shiyun, Pan, Chengwei, Chen, Yisong

Abstract

While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.

Chinese Translation

尽管3D高斯点云渲染（3DGS）能够实现实时的照片级真实感渲染，但当训练图像中包含违反多视图一致性的瞬态物体时，其性能会显著下降。现有方法面临循环依赖的问题：准确的瞬态检测需要良好重建的静态场景，而干净的重建又依赖于可靠的瞬态掩码。我们通过DualSplat提出了一种Failure-to-Prior框架，将第一次重建失败转化为第二次重建阶段的显式先验。我们观察到，瞬态物体仅在部分视图中出现，通常在保守的初始训练中表现为不完整的片段。我们利用这些失败，通过结合光度残差、特征不匹配和SAM2实例边界构建对象级伪掩码。这些伪掩码随后引导干净的第二次3DGS优化，同时一个轻量级的多层感知器（MLP）在线精炼这些伪掩码，逐渐从先验监督转向自我一致性。在RobustNeRF和NeRF On-the-go上的实验表明，DualSplat优于现有基线，尤其在瞬态物体密集场景和瞬态区域表现出明显优势。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.21654

Causal Disentanglement for Full-Reference Image Quality Assessment

基于因果解耦的全参考图像质量评估

Zhang, Zhen, Chu, Jielei, Zhang, Tian, Liu, Weide, Lv, Fengmao, Li, Tianrui, Cheng, Jun, Fang, Yuming

Abstract

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

Chinese Translation

现有基于深度网络的全参考图像质量评估（FR-IQA）模型通常通过对参考图像和失真图像的深度特征进行成对比较来工作。本文从不同的角度出发，提出了一种基于因果推断和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同，我们的方法将退化估计公式化为一个由潜在表示的干预引导的因果解耦过程。我们首先通过利用参考图像和失真图像之间的内容不变性来解耦退化和内容表示。其次，受到人类视觉遮蔽效应的启发，我们设计了一个遮蔽模块，以建模图像内容与退化特征之间的因果关系，从而从失真图像中提取受内容影响的退化特征。最后，使用监督回归或无标签降维的方法，从这些退化特征中预测质量分数。大量实验表明，我们的方法在全监督、少标签和无标签设置下，在标准图像质量评估基准上表现出高度竞争力的性能。此外，我们还在数据稀缺的多样化非标准自然图像领域（包括水下、放射、医学、中子和屏幕内容图像）上评估了该方法。得益于其在没有标记的图像质量评估数据的情况下进行特定场景训练和预测的能力，我们的方法在跨领域泛化方面优于现有的无训练FR-IQA模型。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.21668

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

无编码器的人体动作理解通过结构化运动描述

Zhang, Yao, Liu, Zhuchenyang, Ploetz, Thomas, Xiao, Yu

Abstract

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

Chinese Translation

基于文本的大型语言模型（LLMs）的世界知识和推理能力正在迅速发展，但当前的人体动作理解方法，包括动作问答和描述，尚未充分利用这些能力。现有的基于LLM的方法通常通过专门的编码器学习动作与语言的对齐，这些编码器将动作特征投影到LLM的嵌入空间，仍然受到跨模态表示和对齐的限制。受到生物力学分析的启发，关节角度和身体部位运动学长期以来作为描述人类运动的精确语言，我们提出了 extbf{结构化运动描述（SMD）}，这是一种基于规则的确定性方法，将关节位置序列转换为关节角度、身体部位运动和全局轨迹的结构化自然语言描述。通过将动作表示为文本，SMD使LLM能够直接应用其预训练的身体部位、空间方向和运动语义知识进行动作推理，而无需学习的编码器或对齐模块。我们展示了这种方法在动作问答（BABEL-QA上为66.7 ext{%}，HuMMan-QA上为90.1 ext{%}）和动作描述（HumanML3D上R@1为0.584，CIDEr为53.16）方面超越了最先进的结果，超越了所有先前的方法。SMD还提供了实际的好处：相同的文本输入可以在不同的LLM上工作，仅需轻量级的LoRA适配（在6个模型系列的8个LLM上进行了验证），其人类可读的表示使得对运动描述的可解释注意力分析成为可能。代码、数据和预训练的LoRA适配器可在https://yaozhang182.github.io/motion-smd/获取。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.21681

Sapiens2

Khirodkar, Rawal, Wen, He, Martinez, Julieta, Dong, Yuan, Zhaoen, Su, Saito, Shunsuke

Abstract

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2

Chinese Translation

我们提出了 Sapiens2，一个高分辨率变换器模型家族，专注于以人为中心的视觉任务，强调泛化能力、多样性和高保真输出。我们的模型规模从 4 亿到 50 亿参数不等，具有原生 1K 分辨率和支持 4K 的分层变体。Sapiens2 在预训练和后训练方面均显著优于其前身。首先，为了学习能够捕捉低级细节（用于密集预测）和高级语义（用于零样本或少标签设置）的特征，我们结合了掩蔽图像重建和自蒸馏对比目标。我们的评估表明，这一统一的预训练目标更适合更广泛的下游任务。其次，在数据方面，我们在一个精心策划的包含 10 亿张高质量人像的数据库上进行预训练，并提高了任务注释的质量和数量。第三，在架构上，我们结合了前沿模型的进展，使得更长的训练周期具有更好的稳定性。我们的 4K 模型采用窗口注意力机制，以便在更长的空间上下文中进行推理，并以 2K 输出分辨率进行预训练。Sapiens2 创造了新的最先进水平，并在姿态 (+4 mAP)、身体部位分割 (+24.3 mIoU)、法线估计（降低 45.6% 的角度误差）等方面超越了第一代，并扩展到点图和反照率估计等新任务。代码： https://github.com/facebookresearch/sapiens2

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.21686

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

WorldMark：一个统一的交互视频世界模型基准套件

Xu, Xiaojie, Lin, Zhengyuan, He, Kang, Feng, Yukang, Mao, Xiaofeng, Yin, Yuanyang, Zhang, Kaipeng, Ge, Yongtao

Abstract

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

Chinese Translation

交互视频生成模型如Genie、YUME、HY-World和Matrix-Game正在快速发展，但每个模型都是在其独立的基准上进行评估，使用私有场景和轨迹，这使得公平的跨模型比较变得不可能。现有的公共基准提供了一些有用的指标，如轨迹误差、美学评分和基于视觉语言模型（VLM）的判断，但没有一个提供标准化的测试条件——相同的场景、相同的动作序列和统一的控制接口——以使这些指标在具有异构输入的模型之间可比。我们介绍了WorldMark，这是第一个为交互图像到视频世界模型提供共同竞争平台的基准。WorldMark的贡献包括：（1）一个统一的动作映射层，将共享的WASD风格动作词汇翻译为每个模型的本地控制格式，使得在相同场景和轨迹上对六个主要模型进行逐一比较成为可能；（2）一个包含500个评估案例的分层测试套件，涵盖第一人称和第三人称视角、照片级真实感和风格化场景，以及从简单到困难的三个难度级别，时长从20秒到60秒不等；（3）一个模块化的评估工具包，用于视觉质量、控制一致性和世界一致性，旨在使研究人员在领域发展时能够重用我们的标准化输入，同时插入他们自己的指标。我们将发布所有数据、评估代码和模型输出，以促进未来的研究。除了离线指标，我们还推出了World Model Arena (warena.ai)，这是一个在线平台，任何人都可以将领先的世界模型进行并排对战，并观看实时排行榜。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.21694

Efficient Logic Gate Networks for Video Copy Detection

高效的逻辑门网络用于视频复制检测

Fojcik, Katarzyna

Abstract

Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.

Chinese Translation

视频复制检测需要在多种视觉失真下进行稳健的相似性估计，同时在非常大规模的情况下运行。尽管深度神经网络在性能上表现出色，但其计算成本和特征描述符的大小限制了在高吞吐量系统中的实际部署。在本研究中，我们提出了一种基于可微分逻辑门网络（Logic Gate Networks, LGNs）的视频复制检测框架，该框架用紧凑的基于逻辑的表示替代了传统的浮点特征提取器。我们的方法结合了激进的帧缩小、二进制预处理和一个可训练的LGN嵌入模型，该模型学习逻辑操作和互连。在训练后，该模型可以离散化为纯布尔电路，从而实现极快且内存高效的推断。我们系统地评估了不同的相似性策略、二值化方案和LGN架构，涵盖多个数据集折叠和难度级别。实验结果表明，与之前的模型相比，基于LGN的模型在准确性和排名性能上达到了竞争性或更优的水平，同时生成的描述符小几个数量级，并且推断速度超过每秒11,000个样本。这些发现表明，基于逻辑的模型为可扩展和资源高效的视频复制检测提供了一个有前景的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.21712

Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

用于遮挡鲁棒的3D人类网格恢复的判别-生成协同

Liu, Yang, Zhang, Zhiyong

Abstract

3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.

Chinese Translation

从单目RGB图像中恢复3D人类网格的目标是为下游应用估计解剖上合理的3D人类模型，但在部分或严重遮挡的情况下仍然具有挑战性。基于回归的方法效率高，但在不受限制的场景中往往产生不合理或不准确的结果，而基于扩散的方法为遮挡区域提供了强大的生成先验，但由于过度依赖生成，可能削弱对稀有姿势的保真度。为了解决这些局限性，我们提出了一种受大脑启发的协同框架，将视觉变换器的判别能力与条件扩散模型的生成能力相结合。具体而言，基于ViT的路径从可见区域提取确定性的视觉线索，而基于扩散的路径合成结构上连贯的人体表示。为了有效地连接这两条路径，我们设计了一个多样性一致的特征学习模块，以对齐判别特征与生成先验，并采用跨注意力多层融合机制以实现语义层次间的双向交互。在标准基准上的实验表明，我们的方法在关键指标上实现了优越的性能，并在复杂的现实场景中展现出强大的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.21713

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

解锁3D视觉几何估计的关键因素的潜力

Xu, Guangkai, Geng, Hua, Zheng, Huanyi, Yin, Songyi, Sun, Yanlong, Chen, Hao, Shen, Chunhua

Abstract

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

Chinese Translation

前馈视觉几何估计最近取得了快速进展。然而，仍然存在一个重要的差距：多帧模型通常在跨帧一致性方面表现更好，但在单帧准确性上往往不及强大的单帧方法。这一观察促使我们通过严格的消融研究系统地调查驱动模型性能的关键因素，揭示了几个重要见解：1）扩大数据的多样性和质量即使在最先进的视觉几何估计方法中也能解锁进一步的性能提升；2）常用的基于置信度的损失和基于梯度的损失机制可能无意中阻碍性能；3）通过序列和帧对齐的联合监督改善了结果，而局部区域对齐却意外地降低了性能。此外，我们引入了两项增强措施，以整合基于优化的方法和高分辨率输入的优势：一个一致性损失函数，强制深度图、相机参数和点图之间的对齐，以及一个高效的架构设计，利用高分辨率信息。我们将这些设计整合到CARVE中，这是一个用于前馈视觉几何估计的分辨率增强模型。在点云重建、视频深度估计和相机位姿/内参估计的实验中，CARVE在不同基准测试中表现出强大而稳健的性能。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.21718

Building a Precise Video Language with Human-AI Oversight

构建具有人工智能监督的精确视频语言

Lin, Zhiqiu, Mitra, Chancharik, Cen, Siyuan, Li, Isaac, Huang, Yuhan, Ling, Yu Tong Tiffany, Wang, Hewei, Pi, Irene, Zhu, Shihang, Rao, Ryan, Liu, George, Li, Jiaxi, Li, Ruojin, Han, Yili, Du, Yilun, Ramanan, Deva

Abstract

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

Chinese Translation

视频语言模型（VLMs）通过自然语言学习推理动态视觉世界。我们介绍了一套开放数据集、基准测试和可扩展监督的方案，以实现精确的视频字幕生成。首先，我们定义了一种结构化规范，用于描述主题、场景、运动、空间和相机动态，这些规范基于与专业视频创作者（如电影制作人）共同开发的数百个精心定义的视觉原语。接下来，为了策划高质量的字幕，我们引入了CHAI（基于批评的人机监督）框架，在该框架中，经过培训的专家对模型生成的预字幕进行批评和修订，转化为改进后的后字幕。这种劳动分工通过将文本生成任务转移给模型，提高了注释的准确性和效率，使人类能够更好地专注于验证。此外，这些批评和对预字幕与后字幕之间的偏好提供了丰富的监督，促进开源模型（Qwen3-VL）在字幕生成、奖励建模和批评生成方面的改进，通过SFT、DPO和推理时扩展。我们的消融实验表明，确保批评质量在精确度、召回率和建设性方面，由我们的监督框架直接影响下游性能。在适度的专家监督下，所得到的模型超越了封闭源模型，如Gemini-3.1-Pro。最后，我们将我们的方法应用于重新字幕化大规模专业视频（例如电影、商业广告、游戏），并微调视频生成模型（如Wan），以更好地遵循多达400字的详细提示，实现对摄影的更精细控制，包括相机运动、角度、镜头、焦点、视角和构图。我们的结果表明，精确的规范和人机监督是实现专业级视频理解和生成的关键。数据和代码可在我们的项目页面获取： https://linzhiqiu.github.io/papers/chai/

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.21728

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Ramen：通过主动样本选择实现视觉-语言模型的稳健测试时适应

Bao, Wenxuan, Zhao, Yanjun, Yang, Xiyuan, He, Jingrui

Abstract

Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .

Chinese Translation

预训练的视觉-语言模型如CLIP展现出强大的零-shot泛化能力，但对分布变化仍然敏感。测试时适应是在推理过程中对模型进行调整，而无需访问源数据或目标标签，提供了一种实用的方法来处理这种变化。然而，现有方法通常假设测试样本来自单一且一致的领域，而实际上，测试数据往往包含来自不同领域且具有不同特征的样本。因此，在混合领域设置下，它们的性能会下降。为了解决这个问题，我们提出了Ramen，一个通过主动样本选择实现稳健测试时适应的框架。对于每个到来的测试样本，Ramen根据两个标准从之前见过的数据中检索一批定制的相关样本：领域一致性，确保适应过程集中于来自相似领域的数据，以及预测平衡，减轻由于偏斜预测引起的适应偏差。为了提高效率，Ramen采用了嵌入-梯度缓存，存储过去测试图像的嵌入和样本级梯度。存储的嵌入用于检索相关样本，相应的梯度则被聚合用于模型更新，从而消除了任何额外的前向或反向传播的需求。我们的理论分析提供了对所提出的适应机制在混合领域变化下为何有效的深入理解。在多个图像损坏和领域转移基准上的实验表明，Ramen实现了强大且一致的性能，在复杂的混合领域场景中提供了稳健且高效的适应。我们的代码可在 https://github.com/baowenxuan/Ramen 获取。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.21760

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

可解释的面部动态作为深度伪造的行为和感知痕迹

Murphy, Timothy Joseph, Cook, Jennifer, Cuve, Hélio Clemente José

Abstract

Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.

Chinese Translation

深度伪造检测研究在很大程度上集中于深度学习方法，尽管在基准测试中表现强劲，但对区分真实与操控面部行为的洞察有限。本研究提出了一种基于面部动态生物行为特征的可解释替代方案，并评估计算检测策略与人类感知判断之间的关系。我们识别出面部运动的核心低维模式，从中推导出表征时空结构的时间特征。基于这些特征训练的传统机器学习分类器实现了适度但显著高于随机的深度伪造分类，主要受更高阶时间不规则性的驱动，这些不规则性在操控的面部动态中比真实的面部动态更为明显。值得注意的是，对于包含情感表达的视频，检测的准确性显著高于不包含情感表达的视频。情感效价分类分析进一步表明，情感信号在深度伪造中系统性地退化，解释了情感动态对检测的差异性影响。此外，我们通过评估模型决策与人类感知检测之间的关系，提供了一个额外且常被忽视的可解释性维度。模型和人类判断在情感视频中趋于一致，但在非情感视频中则出现分歧，即使在输出一致的情况下，底层检测策略也有所不同。这些发现表明，面部交换的深度伪造携带可测量的行为指纹，尤其在情感表达期间最为显著。此外，模型与人类的比较表明，可解释的计算特征与人类感知可能提供互补而非冗余的检测途径。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.21772

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

回归源头：通过领域补偿实现开放集持续测试时间适应

Yang, Yingkai, Chen, Chaoqi, Huang, Hui

Abstract

Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.

Chinese Translation

测试时间适应（TTA）旨在减轻推理过程中训练域与测试域之间的分布变化。然而，现有的 TTA 方法在模型面临不断变化的领域和未知语义类别同时出现的现实场景中表现不佳，这一具有挑战性的设置我们称之为开放集持续测试时间适应（OCTTA）。领域和语义的变化耦合常常导致特征空间崩溃，严重降低分类和分布外检测的性能。为了解决这个问题，我们提出了领域补偿（DOmain COmpensation，DOCO），一个轻量且有效的框架，能够在协同的闭环中稳健地执行领域适应和分布外检测。DOCO 首先执行动态的、适应条件下的样本分割，以将可能的已知类别（ID）样本与分布外（OOD）样本分开。然后，仅使用 ID 样本，通过对齐特征统计与源领域，学习一个领域补偿提示，并通过结构保持正则化器引导，以防止语义扭曲。这个学习到的提示随后被传播到同一批次中的 OOD 样本，有效地隔离它们的语义新颖性，以实现更可靠的检测。在多个具有挑战性的基准测试上的广泛实验表明，DOCO 超越了先前的 CTTA 和 OSTTA 方法，为苛刻的 OCTTA 设置建立了新的最先进水平。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.21776

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

重拍任何事物：一种用于野外视频重拍的自监督模型

Paliwal, Avinash, Iyer, Adithya, Yadav, Shivin, Afridi, Muhammad Ali, Harikumar, Midhun

Abstract

Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

Chinese Translation

对于动态视频重拍，精确的相机控制受到非刚性场景中配对多视图数据严重匮乏的制约。我们通过一种高度可扩展的自监督框架克服了这一限制，该框架能够利用互联网规模的单目视频。我们的核心贡献是生成伪多视图训练三元组，包括源视频、几何锚点和目标视频。我们通过从单个输入视频中提取不同的平滑随机游走裁剪轨迹来实现这一点，以作为源视图和目标视图。锚点通过使用密集跟踪场对源视频的第一帧进行前向变形合成生成，从而有效模拟在推理时预期的失真点云输入。由于我们的独立裁剪策略引入了空间不对齐和人工遮挡，模型无法简单地从当前源帧复制信息。相反，它被迫通过主动路由和重新投影缺失的高保真纹理，隐式学习4D时空结构，以从源视频中重建目标视频。在推理时，我们经过最小调整的扩散变换器利用4D点云派生的锚点，实现了在复杂动态场景中的最先进的时间一致性、稳健的相机控制和高保真的新视图合成。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.21786

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

从代码本到视觉语言模型：评估社交媒体上气候变化的自动化视觉话语分析

Prasse, Katharina, Jung, Steffen, Bravo, Isaac, Walter, Stefanie, Knab, Patrick, Bartelt, Christian, Keuper, Margret

Abstract

Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.

Chinese Translation

社交媒体平台已成为气候沟通的主要场所，生成了数百万张图像和帖子，如果进行系统分析，可以揭示哪些沟通策略能够引发公众关注，哪些则效果不佳。我们旨在通过分析计算机视觉方法在社交媒体话语分析中的应用，促进此类研究。该分析包括基于应用的分类设计、模型选择、提示工程和验证。我们在来自X（前身为Twitter）的两个数据集上基准测试了六个可提示的视觉语言模型和15个零样本CLIP类模型——一个包含1,038张专家标注图像的集合和一个超过120万张图像的大型语料库，其中50,000个标签经过人工验证，涵盖五个标注维度：动物内容、气候变化后果、气候行动、图像设置和图像类型。在基准测试的模型中，Gemini-3.1-flash-lite在所有超级类别和两个数据集上均表现优于其他模型，而中等规模的开放权重模型之间的差距相对较小。超越实例级指标，我们提倡分布式评估：即使每张图像的准确性适中，VLM预测仍能可靠地恢复人口水平趋势，使其成为大规模话语分析的可行起点。我们发现，思维链推理降低了而不是提高了性能，而特定于标注维度的提示设计则改善了性能。我们在 https://github.com/KathPra/Codebooks2VLMs.git 发布了推文ID和标签以及我们的代码。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.21801

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

SyMTRS：用于航空影像深度、领域适应和超分辨率的基准多任务合成数据集

Ghazouali, Safouane El, Venturi, Nicola, Rueegsegger, Michael, Michelucci, Umberto

Abstract

Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/SyMTRS.

Chinese Translation

最近，遥感领域深度学习的进展在很大程度上依赖于大型标注数据集，但获取几何、辐射和多领域任务的高质量真实数据仍然成本高昂且往往不可行。特别是，缺乏准确的深度标注、受控的光照变化和多尺度配对影像限制了单目深度估计、领域适应和航空场景超分辨率的进展。我们提出了SyMTRS，这是一个使用高保真城市模拟管道生成的大规模合成数据集。该数据集提供高分辨率RGB航空影像（2048 x 2048）、像素完美的深度图、用于领域适应的夜间影像以及用于x2、x4和x8尺度超分辨率的对齐低分辨率变体。与现有的专注于单一任务或模态的遥感数据集不同，SyMTRS被设计为一个统一的多任务基准，支持几何理解、跨领域鲁棒性和分辨率增强的联合研究。我们描述了数据集的生成过程、统计特性及其相对于现有基准的定位。SyMTRS旨在通过实现具有完美几何真实数据和一致多领域监督的受控实验，弥补遥感研究中的关键空白。本研究中获得的结果可以从以下Github仓库复现：https://github.com/safouaneelg/SyMTRS。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.21806

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

TEMA：锚定图像，跟随文本进行多重修改组合图像检索

Li, Zixu, Hu, Yupeng, Fu, Zhiheng, Chen, Zhiwei, Li, Yongqi, Nie, Liqiang

Abstract

Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.

Chinese Translation

组合图像检索（CIR）是一种重要的图像检索范式，使用户能够使用由参考图像和修改文本组成的多模态查询来检索目标图像。尽管CIR的研究取得了显著进展，但现有的设置仍然依赖于简单的修改文本，这些文本通常仅涵盖有限范围的显著变化，这导致了与实际应用高度相关的两个限制，即实体覆盖不足和条款-实体不一致。为了应对这些问题，使CIR更接近于现实世界的应用，我们构建了两个富含指令的多重修改数据集，M-FashionIQ和M-CIRR。此外，我们提出了TEMA，即面向文本的实体映射架构，这是第一个为多重修改设计的CIR框架，同时也兼容简单修改。在四个基准数据集上的广泛实验表明，TEMA在原始和多重修改场景中的优越性，同时在检索准确性和计算效率之间保持了最佳平衡。我们的代码和构建的多重修改数据集（M-FashionIQ和M-CIRR）可在https://github.com/lee-zixu/ACL26-TEMA/获取。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.21810

Multiscale Super Resolution without Image Priors

无图像先验的多尺度超分辨率

Fu, Daniel, Litterio, Gabby, Felzenszwalb, Pedro, Zia, Rashid

Abstract

We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.

Chinese Translation

我们解决了在平移下超分辨率问题中的模糊性。我们证明，不同尺度的低分辨率图像组合可以使超分辨率问题变得明确。这种尺度差异可以通过使用不同像素大小的传感器（如本研究所示）或通过改变光学放大率（例如，使用变焦镜头）来实现有效像素大小的变化。我们表明，采用成对互质像素大小获取的图像会导致一个具有稳定逆的系统，并且进一步地，超分辨率图像可以通过傅里叶域技术或迭代最小二乘法高效重建。我们的数学分析提供了一个期望误差的表达式，用于大信号的最小二乘重建，假设噪声为独立同分布（i.i.d.），阐明了噪声与分辨率之间的权衡。这些结果通过一维和二维实验得到了验证，这些实验利用电荷耦合器件（CCD）硬件的合并，探索了在大范围有效像素大小下的重建。最后，针对一系列目标的二维重建被用来展示多尺度超分辨率的优势，并讨论了这些结果对常见成像系统的影响。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.21814

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

分割后诊断：为超长胶囊内镜视频编织临床医生启发的上下文

Liu, Bowen, Yang, Li, Song, Shanshan, Tang, Mingyu, Gao, Zhifang, Chen, Qifeng, Song, Yangqiu, Chen, Huimin, Li, Xiaomeng

Abstract

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

Chinese Translation

胶囊内镜（CE）实现了非侵入性的胃肠道筛查，但当前的CE研究主要局限于帧级分类和检测，视频级分析尚未得到充分探索。为填补这一空白，我们引入并正式定义了一项新任务——以诊断为驱动的CE视频摘要，该任务要求提取涵盖临床意义发现的关键证据帧，并从这些证据帧中做出准确诊断。这一设置具有挑战性，因为诊断相关事件极为稀疏，且可能被成千上万的冗余正常帧所淹没，而由于运动模糊、杂物、镜面高光和快速视角变化，个别观察往往模糊不清。为促进该方向的研究，我们推出了VideoCAP，这是第一个基于真实临床报告的以诊断为驱动的注释的CE数据集。VideoCAP包含240个完整视频，并为关键证据帧提取和诊断提供了真实的监督。为了解决这一任务，我们进一步提出了DiCE，一个受临床医生启发的框架，模拟标准的CE阅读工作流程。DiCE首先对原始视频进行高效的候选筛选，然后使用上下文编织器将候选组织成保留不同病变事件的连贯诊断上下文，并通过证据聚合器将每个上下文中的多帧证据汇聚成稳健的剪辑级判断。实验表明，DiCE始终优于最先进的方法，生成简洁且临床可靠的诊断摘要。这些结果突显了以诊断为驱动的上下文推理作为超长CE视频摘要的一个有前景的范式。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.21873

Grounding Video Reasoning in Physical Signals

基于物理信号的视频推理

Osmanli, Alibay, Cheng, Zixu, Gong, Shaogang

Abstract

Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.

Chinese Translation

物理视频理解不仅仅是正确命名事件。一个模型可以根据文本规律回答有关倒水、滑动或碰撞的问题，但仍然无法在时间或空间上定位事件。我们引入了一个用于物理视频理解的基础基准，该基准扩展了V-STaR的什么-何时-何地评估结构，涵盖了四个视频来源、六个物理领域、三个提示类别（物理、vstar_like和中性限制）以及四种输入条件（原始、打乱、消融和帧遮蔽）。该基准包含来自SSV2、YouCook2、HoloAssist和Roundabout-TAU的1560个基础视频片段。每个片段首先被转换为一个共享的基础事件记录，三个查询类别则从该记录中派生。时间和空间目标在提示类别之间共享，而非物理类别则使用从同一记录中派生的确定性类别适当的语义a_what目标。在不同模型和提示类别中，物理类别整体上仍然是最强的范畴，vstar_like是最清晰的非物理语义比较，而中性限制则表现为更难的模板控制。提示类别的鲁棒性是选择性的而非普遍的，扰动增益集中在较弱的原始案例中，而空间定位在所有设置中是最弱的。这些结果表明，视频问答推理基准应报告物理基础、提示感知和扰动感知的诊断，同时提供综合准确性。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.21879

Addressing Image Authenticity When Cameras Use Generative AI

在相机使用生成性人工智能时解决图像真实性问题

Masud, Umar, Punnappurath, Abhijith, Zhao, Luxi, Lindell, David B., Brown, Michael S.

Abstract

The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the 'unhallucinated' version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.

Chinese Translation

生成性人工智能（GenAI）方法能够以照片真实的方式改变相机图像，这引发了人们对在线共享图像真实性的关注。有趣的是，直接由我们的相机拍摄的图像被认为是可信和真实的。然而，随着深度学习模块越来越多地集成到相机的捕获硬件中——即图像信号处理器（ISP）——我们现在面临着相机直接输出的图像中可能存在幻觉内容的风险。捕获时的幻觉图像内容通常是良性的，例如增强的边缘或纹理，但在某些操作中，例如基于人工智能的数字变焦或低光图像增强，幻觉可能会改变图像内容的语义和解读。因此，用户可能没有意识到他们相机图像中的内容并不真实。本文通过使用户能够恢复相机图像的“非幻觉”版本来解决这一问题，以避免对图像内容的误解。我们的方法通过优化一个图像特定的多层感知器（MLP）解码器和一个模态特定的编码器来实现，因此，给定相机图像，我们可以恢复在添加幻觉内容之前的图像。编码器和MLP是自包含的，可以在捕获后应用于图像，而无需访问相机ISP。此外，编码器和MLP解码器仅需180 KB的存储空间，并且可以方便地作为元数据保存在标准图像格式中，如JPEG和HEIC。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.21904

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

UniGenDet：一种统一的生成-判别框架用于共同进化的图像生成与生成图像检测

Zhang, Yanran, Zheng, Wenzhao, Li, Yifei, Yu, Bingyao, Zheng, Yu, Chen, Lei, Lu, Jiwen, Zhou, Jie

Abstract

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{https://github.com/Zhangyr2022/UniGenDet}{https://github.com/Zhangyr2022/UniGenDet}.

Chinese Translation

近年来，图像生成和生成图像检测领域都取得了显著进展。尽管这两个领域的发展迅速且在很大程度上是独立的，但它们各自演变出了不同的架构范式：前者主要依赖生成网络，而后者则偏向于判别框架。最近，这两个领域都出现了利用对抗信息来提升性能的趋势，显示出潜在的协同效应。然而，它们之间显著的架构差异带来了相当大的挑战。与以往的方法不同，我们提出了UniGenDet：一种用于共同进化的图像生成与生成图像检测的统一生成-判别框架。为了弥合任务间的差距，我们设计了一种共生的多模态自注意力机制和一个统一的微调算法。这种协同作用使得生成任务能够提高真实性识别的可解释性，而真实性标准则指导更高保真度图像的生成。此外，我们引入了一种检测器引导的生成对齐机制，以促进无缝的信息交换。在多个数据集上的广泛实验表明，我们的方法达到了最先进的性能。代码：\href{https://github.com/Zhangyr2022/UniGenDet}{https://github.com/Zhangyr2022/UniGenDet}.

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.21909

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

方向性混淆揭示人类和机器视觉中的不同归纳偏差，通过率失真几何进行分析

Caglar, Leyla Roksan, Mediano, Pedro A. M., Lin, Baihan

Abstract

Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.

Chinese Translation

人类和现代视觉模型在分类准确性上可以达到相似的水平，但在错误类型上却存在系统性的差异——它们的错误频率并不相同，而是错误的对象和方向有所不同。我们展示了这些方向性混淆揭示了不同的归纳偏差，这些偏差仅通过准确性是无法观察到的。通过在12种扰动类型下对自然图像分类任务中匹配的人类和深度视觉模型反应进行分析，我们量化了混淆矩阵中的不对称性，并通过率失真（Rate-Distortion, RD）框架将其与泛化几何联系起来，概括为三个几何特征（斜率（beta）、曲率（kappa））和效率（AUC）。我们发现，人类表现出广泛但较弱的不对称性，而深度视觉模型则显示出更稀疏且更强的方向性崩溃。鲁棒性训练减少了全局不对称性，但未能恢复人类般的广度-强度相似性特征。机制模拟进一步表明，不同的不对称组织会将RD边界向相反方向移动，即使在性能匹配的情况下也是如此。综合来看，这些结果将方向性混淆和RD几何定位为在分布转移下归纳偏差的紧凑且可解释的特征。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.21911

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

当提示超越视觉：大型视觉语言模型中的提示诱导幻觉

Khayatan, Pegah, Parekh, Jayneel, Dapogny, Arnaud, Shukor, Mustafa, Newson, Alasdair, Cord, Matthieu

Abstract

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

Chinese Translation

尽管大型视觉语言模型（LVLMs）的能力取得了显著进展，但这些系统仍然容易出现幻觉，即输出与视觉输入不相符的结果。先前的研究将LVLMs中的幻觉归因于视觉主干的局限性或语言组件的主导地位等因素，但这些因素的相对重要性仍不明确。为了解决这一模糊性，我们提出了HalluScope，一个基准测试，以更好地理解不同因素诱发幻觉的程度。我们的分析表明，幻觉主要源于对文本先验和背景知识的过度依赖，尤其是通过文本指令引入的信息。为了减轻由文本指令先验诱发的幻觉，我们提出了HalluVL-DPO，一个用于微调现成LVLMs以实现更具视觉基础的响应的框架。HalluVL-DPO利用偏好优化，使用我们构建的精选训练数据集，引导模型更倾向于选择有根据的响应而非幻觉响应。我们证明了优化后的模型有效减轻了目标幻觉失效模式，同时在其他幻觉基准和视觉能力评估中保持或提高了性能。为了支持可重复性和进一步的研究，我们将在https://pegah-kh.github.io/projects/prompts-override-vision/上公开发布我们的评估基准、偏好训练数据集和代码。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.21915

Vista4D: Video Reshooting with 4D Point Clouds

Vista4D：基于4D点云的视频重拍

Lin, Kuan Heng, Liu, Zhizheng, Salamanca, Pablo, Kant, Yash, Burgert, Ryan, Xu, Yuancheng, Namekata, Koichi, Zhao, Yiwei, Zhou, Bolei, Goldblum, Micah, Debevec, Paul, Yu, Ning

Abstract

We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: https://eyeline-labs.github.io/Vista4D

Chinese Translation

我们提出了Vista4D，这是一个稳健且灵活的视频重拍框架，它将输入视频和目标相机基于4D点云进行定位。具体而言，给定一个输入视频，我们的方法从不同的相机轨迹和视角重新合成具有相同动态的场景。现有的视频重拍方法常常在真实世界动态视频的深度估计伪影方面遇到困难，同时也未能保持内容外观，并且在应对具有挑战性的新轨迹时无法维持精确的相机控制。我们构建了一个基于4D的点云表示，结合静态像素分割和4D重建，以明确保留已见内容并提供丰富的相机信号，并且我们使用重建的多视角动态数据进行训练，以增强在真实世界推理过程中对点云伪影的鲁棒性。我们的结果显示，在各种视频和相机路径下，相较于最先进的基线方法，我们在4D一致性、相机控制和视觉质量方面都有所改善。此外，我们的方法可以推广到动态场景扩展和4D场景重组等真实世界应用。有关结果、代码和模型，请访问我们的项目页面：https://eyeline-labs.github.io/Vista4D

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.21921

Context Unrolling in Omni Models

全方位模型中的上下文展开

Yang, Ceyuan, Lin, Zhijie, Zhao, Yang, Xiao, Fei, He, Hao, Zhao, Qi, Deng, Chaorui, Li, Kunchang, Ding, Zihan, Guo, Yuwei, Wang, Fuyun, Zhu, Fangqi, Nie, Xiaonan, Zhu, Shenhan, Lin, Shanchuan, Li, Hongsheng, Huang, Weilin, Shi, Guang, Fan, Haoqi

Abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Chinese Translation

我们提出了Omni，一个统一的多模态模型，原生训练于多种模态，包括文本、图像、视频、3D几何和隐藏表示。我们发现这样的训练使得上下文展开成为可能，在这一过程中，模型在生成预测之前明确地跨多个模态表示进行推理。这个过程使得模型能够聚合异构模态之间的互补信息，从而更真实地近似共享的多模态知识流形，并提高下游推理的准确性。因此，Omni在多模态生成和理解基准测试中均表现出色，同时展现出先进的多模态推理能力，包括文本、图像、视频和3D几何的上下文生成。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.21926

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

无视之见：基于可穿戴惯性测量单元的4D人类场景理解

Hsu, Hao-Yu, Cheng, Tianhang, Wen, Jing, Schwing, Alexander G., Wang, Shenlong

Abstract

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Chinese Translation

理解人类活动及其周围环境通常依赖于视觉感知，然而相机在隐私、安全、能效和可扩展性方面存在持续挑战。我们探索了一种替代方案：无视觉的4D感知。其目标是仅通过日常可穿戴传感器重建人类运动和3D场景布局。为此，我们引入了IMU-to-4D，一个将大型语言模型重新用于非视觉时空理解人类场景动态的框架。IMU-to-4D利用来自耳机、手表或智能手机的少量惯性传感器的数据，预测详细的4D人类运动以及粗略的场景结构。在多样化的人类场景数据集上的实验表明，IMU-to-4D产生的结果比最新的级联管道更连贯且时间上更稳定，表明仅靠可穿戴运动传感器即可支持丰富的4D理解。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.21931

Seeing Fast and Slow: Learning the Flow of Time in Videos

快速与缓慢的观察：学习视频中的时间流动

Wu, Yen-Siang, Luo, Rundong, Zhu, Jingsen, Tu, Tao, Farhadi, Ali, Wallingford, Matthew, Wang, Yu-Chiang Frank, Marschner, Steve, Ma, Wei-Chiu

Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

Chinese Translation

我们如何判断一个视频是加速还是减速的？我们如何生成不同速度的视频？尽管视频在现代计算机视觉研究中占据了核心地位，但对时间流逝的感知和控制却鲜有关注。在本文中，我们将时间视为一个可学习的视觉概念，并开发模型用于推理和操控视频中的时间流动。我们首先利用视频中自然存在的多模态线索和时间结构，以自监督的方式学习检测速度变化和估计播放速度。然后，我们展示这些学习到的时间推理模型使我们能够从嘈杂的野外来源中策划出迄今为止最大的慢动作视频数据集。这些慢动作镜头通常由高速摄像机拍摄，包含比标准视频更丰富的时间细节。利用这些数据，我们进一步开发了能够进行时间控制的模型，包括速度条件的视频生成，能够以指定的播放速度产生运动，以及时间超分辨率，将低帧率、模糊的视频转化为具有细致时间细节的高帧率序列。我们的研究结果强调了时间作为视频学习中一个可操控的感知维度，开启了时间可控视频生成、时间取证检测以及可能更丰富的世界模型的研究，这些模型理解事件如何随着时间的推移展开。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.20862

Architecture of an AI-Based Automated Course of Action Generation System for Military Operations

基于人工智能的军事行动自动化方案生成系统架构

Park, Ji-il, Shim, Inwook, Kim, Chong Hui

Abstract

The automation system for Course of Action (CoA) planning is an essential element in future warfare. As maneuver speeds increase, surveillance ranges extend, and weapon ranges grow, the operational area expands, making traditional manned-based CoA planning increasingly challenging. Consequently, the development of an AI-based automated CoA planning system is becoming increasingly necessary. Accordingly, several countries and defense organizations are actively developing AI-based CoA planning systems. However, due to security restrictions and limited public disclosure, the technical maturity of such systems remains difficult to assess. Furthermore, as these systems are military-related, their details are not publicly disclosed, making it difficult to accurately assess the current level of development. In response to this, this study aims to introduce relevant doctrines within the scope of publicly available information and present applicable AI technologies for each stage of the CoA planning process. Ultimately, it proposes an architecture for the development of an automated CoA planning system.

Chinese Translation

行动方案（CoA）规划的自动化系统是未来战争中的一个重要组成部分。随着机动速度的提高、监视范围的扩大和武器射程的增长，作战区域也在不断扩大，这使得传统的基于人力的行动方案规划变得愈加困难。因此，开发基于人工智能的自动化行动方案规划系统的必要性日益增加。为此，多个国家和防务组织正在积极开发基于人工智能的行动方案规划系统。然而，由于安全限制和公开披露的有限性，这些系统的技术成熟度仍然难以评估。此外，由于这些系统与军事相关，其细节并未公开披露，使得准确评估当前的发展水平变得困难。对此，本研究旨在在公开可用信息的范围内介绍相关的理论，并为行动方案规划过程的每个阶段提出适用的人工智能技术。最终，本文提出了一个自动化行动方案规划系统开发的架构。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.20972

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

逃离协议陷阱：评估规则驱动人工智能的可辩护性信号

O'Herlihy, Michael, Català, Rosa

Abstract

Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error - a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

Chinese Translation

内容审核系统通常通过测量与人工标签的一致性来进行评估。在规则驱动的环境中，这一假设失效：多个决策可能与治理政策在逻辑上是一致的，而一致性指标则惩罚有效决策，同时将模糊性误判为错误——我们称之为协议陷阱（Agreement Trap）的失效模式。我们将评估形式化为基于政策的正确性，并引入可辩护性指数（Defensibility Index, DI）和模糊性指数（Ambiguity Index, AI）。为了在不增加额外审计的情况下估计推理的稳定性，我们引入了概率可辩护性信号（Probabilistic Defensibility Signal, PDS），该信号源自审计模型的令牌对数概率（token logprobs）。我们利用大型语言模型（LLM）的推理轨迹作为治理信号，而不是分类输出，通过部署审计模型来验证提议的决策是否可以从治理规则层级逻辑推导，而不是决定内容是否违反政策。我们在超过193,000个Reddit审核决策中验证了该框架，涵盖多个社区和评估群体，发现基于一致性和基于政策的指标之间存在33-46.6个百分点的差距，其中79.8-80.6%的模型假阴性对应于基于政策的决策，而非真实错误。我们进一步表明，测量的模糊性是由规则的具体性驱动的：在同一社区规则的三个层级下审计37,286个相同决策使AI降低了10.8个百分点，而DI保持稳定。重复抽样分析将PDS的方差主要归因于治理模糊性，而非解码噪声。基于这些信号构建的治理门（Governance Gate）实现了78.6%的自动化覆盖率，并降低了64.9%的风险。综合这些结果表明，在规则驱动的环境中，评估应从与历史标签的一致性转向基于明确规则的推理有效性。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.20987

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

共同进化的 LLM 决策与技能库代理用于长时间任务

Wu, Xiyang, Li, Zongxia, Shi, Guangyao, Duffy, Alexander, Marques, Tyler, Olson, Matthew Lyle, Zhou, Tianyi, Manocha, Dinesh

Abstract

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

Chinese Translation

长时间交互环境是评估代理技能使用能力的试验平台。这些环境要求多步推理，在多个时间步中链式应用多种技能，并在延迟奖励和部分可观察性下进行稳健的决策。游戏是评估代理在环境中技能使用的良好试验平台。大型语言模型（LLMs）作为游戏代理提供了一个有前景的替代方案，但它们在一致的长时间决策方面常常面临挑战，因为它们缺乏发现、保留和重用跨回合结构化技能的机制。我们提出了 COSPLAY，一个共同进化框架，其中 LLM 决策代理从可学习的技能库中检索技能以指导行动，而一个管理技能管道的代理则从代理的无标签回合中发现可重用技能以形成技能库。我们的框架改善了决策代理在技能检索和行动生成方面的学习，同时技能库代理持续提取、精炼和更新技能及其契约。在六个游戏环境中的实验表明，使用 8B 基础模型的 COSPLAY 在单人游戏基准测试中相较于四个前沿 LLM 基线实现了超过 25.1% 的平均奖励提升，同时在多人社交推理游戏中保持了竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.20995

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

价值冲突诊断揭示语言模型中普遍存在的对齐伪装现象

Nair, Inderjeet, Ruan, Jie, Wang, Lu

Abstract

Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model's strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of cases.Finally, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.

Chinese Translation

对齐伪装是一种现象，模型在被监控时表现出与开发者政策一致的行为，但在未被观察时则回归自身偏好。这是一种令人担忧但尚未充分理解的现象，部分原因在于当前的诊断工具仍然有限。以往的诊断依赖于高度有毒且明显有害的场景，导致大多数模型立即拒绝响应。因此，模型从未对开发者政策、监控条件或不遵从的后果进行深思，使得这些诊断在根本上无法检测对齐伪装的倾向。为了支持对这一现象的研究，我们首先引入了VLAF，一个基于假设的诊断框架，认为当开发者政策与模型的强烈价值观发生冲突时，对齐伪装的可能性最大。VLAF使用道德上不模糊的场景来探测这种冲突，涵盖多样的道德价值观，绕过拒绝行为，同时保持有意义的深思 stakes。使用VLAF，我们发现对齐伪装的发生率远高于之前的报告，在参数量仅为7B的模型中也出现了这种现象——olmo2-7b-instruct在37%的情况下表现出对齐伪装。最后，我们展示了监督条件引发的激活变化沿着表示空间中的单一方向。这意味着驱动对齐伪装的行为差异可以通过单一的对比引导向量来捕捉，我们利用这一点进行轻量级的推理时缓解。最终，我们利用这一点进行无需标记数据和最小计算开销的缓解，实现了在olmo2-7b-instruct、olmo2-13b-instruct和qwen3-8b上对齐伪装相对减少分别为85.8%、94.0%和57.7%。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.21003

The Last Harness You'll Ever Build

你将构建的最后一个工具

Seong, Haebin, Yin, Li, Zhang, Haoran

Abstract

AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution protocol $\Lambda = (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a protocol $\Lambda^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.

Chinese Translation

人工智能代理越来越多地被部署在复杂的特定领域工作流程中——在需要数十次点击和表单填写的企业网络应用程序中导航，协调跨越搜索、提取和综合的多步骤研究流程，自动化对不熟悉代码库的代码审查，以及处理需要细致领域知识的客户升级请求。每一个新的任务领域都需要费力的、专家驱动的工具工程：设计使基础模型有效的提示、工具、协调逻辑和评估标准。我们提出了一个两级框架来自动化这一过程。在第一级， extbf{工具演化循环}优化工人代理的工具 $ extmath{H}$ 以完成单一任务：工人代理 $W_{ extmath{H}}$ 执行任务，评估代理 $V$ 对失败进行对抗性诊断并评分，演化代理 $E$ 根据之前尝试的完整历史修改工具。在第二级， extbf{元演化循环}优化演化协议 $ extmath{Λ} = (W_{ extmath{H}}, extmath{H}^{(0)}, V, E)$ 本身，以应对多样化的任务， extbf{学习一个协议 $ extmath{Λ^{( ext{best})}}$，使得在任何新任务上快速收敛工具成为可能——从而使得将代理适应于新领域完全不需要人工工具工程。} 我们形式化了与元学习的对应关系，并展示了这两个算法。该框架 extbf{将手动工具工程转变为自动化工具工程}，并进一步迈出一步—— extbf{自动化设计自动化本身}。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.21006

Deep FinResearch Bench: Evaluating AI's Ability to Conduct Professional Financial Investment Research

深度金融研究基准：评估人工智能进行专业金融投资研究的能力

Haque, Mirazul, Papadimitriou, Antony, Mensah, Samuel, Ma, Zhiqiang, Guo, Zhijin, Sain, Joy Prakash, Kaur, Simerjot, Smiley, Charese, Liu, Xiaomo

Abstract

We introduce Deep FinResearch Bench, a practical and comprehensive evaluation framework for deep research (DR) agents in financial investment research. The benchmark assesses three dimensions of report quality: qualitative rigor, quantitative forecasting and valuation accuracy, and claim credibility and verifiability. Particularly, we define corresponding qualitative and quantitative evaluation metrics and implement an automated scoring procedure to enable scalable assessment. Applying the benchmark to financial reports from frontier DR agents and comparing them with reports authored by financial professionals, we find that AI-generated reports still fall short across these dimensions. These findings underscore the need for domain-specialized DR agents tailored to finance, and we hope the work establishes a foundation for standardized benchmarking of DR agents in financial research.

Chinese Translation

我们引入了深度金融研究基准（Deep FinResearch Bench），这是一个实用且全面的评估框架，用于评估金融投资研究中的深度研究（DR）代理。该基准评估报告质量的三个维度：定性严谨性、定量预测和估值准确性，以及主张的可信度和可验证性。特别地，我们定义了相应的定性和定量评估指标，并实施了自动评分程序，以实现可扩展的评估。将该基准应用于来自前沿DR代理的金融报告，并将其与金融专业人士撰写的报告进行比较，我们发现AI生成的报告在这些维度上仍然存在不足。这些发现强调了针对金融领域量身定制的专业化DR代理的必要性，我们希望这项工作为金融研究中DR代理的标准化基准测试奠定基础。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.21018

Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

自适应测试时计算分配与演变中的上下文示范

Zuo, Bowen, Zhou, Dongruo, Zhu, Yinglun

Abstract

While scaling test-time compute can substantially improve model performance, existing approaches either rely on static compute allocation or sample from fixed generation distributions. In this work, we introduce a test-time compute allocation framework that jointly adapts where computation is spent and how generation is performed. Our method begins with a warm-up phase that identifies easy queries and assembles an initial pool of question-response pairs from the test set itself. An adaptive phase then concentrates further computation on unresolved queries while reshaping their generation distributions through evolving in-context demonstrations -- conditioning each generation on successful responses from semantically related queries rather than resampling from a fixed distribution. Experiments across math, coding, and reasoning benchmarks demonstrate that our approach consistently outperforms existing baselines while consuming substantially less inference-time compute.

Chinese Translation

尽管扩大测试时计算可以显著提高模型性能，但现有方法要么依赖于静态计算分配，要么从固定生成分布中抽样。在本研究中，我们提出了一种测试时计算分配框架，该框架共同适应计算的支出位置和生成的执行方式。我们的方法首先经过一个热身阶段，识别简单查询并从测试集中组装初始的问题-响应对池。接下来的自适应阶段则将进一步的计算集中在未解决的查询上，同时通过演变中的上下文示范重新塑造它们的生成分布——使每次生成基于语义相关查询的成功响应，而不是从固定分布中重新抽样。跨数学、编码和推理基准的实验表明，我们的方法在消耗显著更少的推理时间计算的同时，始终优于现有基线。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.21027

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

HypEHR：电子健康记录的双曲建模以实现高效问答

Liu, Yuyu, Patil, Sarang Rajendra, Xu, Mengjia, Ma, Tengfei

Abstract

Electronic health record (EHR) question answering is often handled by LLM-based pipelines that are costly to deploy and do not explicitly leverage the hierarchical structure of clinical data. Motivated by evidence that medical ontologies and patient trajectories exhibit hyperbolic geometry, we propose HypEHR, a compact Lorentzian model that embeds codes, visits, and questions in hyperbolic space and answers queries via geometry-consistent cross-attention with type-specific pointer heads. HypEHR is pretrained with next-visit diagnosis prediction and hierarchy-aware regularization to align representations with the ICD ontology. On two MIMIC-IV-based EHR-QA benchmarks, HypEHR approaches LLM-based methods while using far fewer parameters. Our code is publicly available at https://github.com/yuyuliu11037/HypEHR.

Chinese Translation

电子健康记录（EHR）问答通常由基于大语言模型（LLM）的管道处理，这些管道的部署成本高，并且未能明确利用临床数据的层次结构。基于医学本体和患者轨迹呈现双曲几何的证据，我们提出了HypEHR，这是一种紧凑的洛伦兹模型，将代码、就诊和问题嵌入双曲空间，并通过几何一致的交叉注意力与特定类型的指针头回答查询。HypEHR通过下次就诊诊断预测和层次感知正则化进行预训练，以使表示与国际疾病分类（ICD）本体对齐。在两个基于MIMIC-IV的EHR-QA基准上，HypEHR在使用远少于参数的情况下接近基于LLM的方法。我们的代码已公开，网址为 https://github.com/yuyuliu11037/HypEHR。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.21036

Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models

谁来定义公平？基于目标的提示在生成模型中的人口代表性

Nizam, Marzia Binta, Davis, James

Abstract

Text-to-image(T2I) models like Stable Diffusion and DALL-E have made generative AI widely accessible, yet recent studies reveal that these systems often replicate societal biases, particularly in how they depict demographic groups across professions. Prompts such as 'doctor' or 'CEO' frequently yield lighter-skinned outputs, while lower-status roles like 'janitor' show more diversity, reinforcing stereotypes. Existing mitigation methods typically require retraining or curated datasets, making them inaccessible to most users. We propose a lightweight, inference-time framework that mitigates representational bias through prompt-level intervention without modifying the underlying model. Instead of assuming a single definition of fairness, our approach allows users to select among multiple fairness specifications-ranging from simple choices such as a uniform distribution to more complex definitions informed by a large language model(LLM) that cites sources and provides confidence estimates. These distributions guide the construction of demographic specific prompt variants in the corresponding proportions, and we evaluate alignment by auditing adherence to the declared target and measuring the resulting skin tone distribution rather than assuming uniformity as 'fairness'. Across 36 prompts spanning 30 occupations and 6 non-occupational contexts, our method shifts observed skin-tone outcomes in directions consistent with the declared target, and reduces deviation from targets when the target is defined directly in skin-tone space(fallback). This work demonstrates how fairness interventions can be made transparent, controllable, and usable at inference time, directly empowering users of generative AI.

Chinese Translation

文本到图像（T2I）模型如稳定扩散（Stable Diffusion）和DALL-E使生成性人工智能变得广泛可及，但最近的研究表明，这些系统往往会复制社会偏见，特别是在它们描绘不同职业的人口群体时。诸如“医生”或“首席执行官”的提示常常产生肤色较浅的输出，而“清洁工”等低地位角色则显示出更多的多样性，从而强化了刻板印象。现有的缓解方法通常需要重新训练或使用经过筛选的数据集，使得大多数用户无法访问。我们提出了一种轻量级的推理时框架，通过提示级干预来减轻表现偏见，而无需修改底层模型。我们的做法并不假设公平的单一定义，而是允许用户在多种公平规范中进行选择——从简单的均匀分布选择到更复杂的定义，这些定义由大型语言模型（LLM）提供支持，引用来源并提供置信度估计。这些分布指导相应比例的人口特定提示变体的构建，我们通过审计对声明目标的遵循情况以及测量结果的肤色分布来评估一致性，而不是假设均匀性即为“公平”。在涵盖30个职业和6个非职业背景的36个提示中，我们的方法将观察到的肤色结果转变为与声明目标一致的方向，并在目标直接在肤色空间中定义时（回退）减少偏离目标的情况。这项工作展示了如何使公平干预变得透明、可控，并在推理时可用，直接赋能生成性人工智能的用户。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.21044

Active Data

主动数据

Arthur, Richard, DiDomizio, Virginia, Hoebel, Louis

Abstract

In some complex domains, certain problem-specific decompositions can provide advantages over monolithic designs by enabling comprehension and specification of the design. In this paper we present an intuitive and tractable approach to reasoning over large and complex data sets. Our approach is based on Active Data, i.e., data as atomic objects that actively interact with environments. We describe our intuition about how this bottom-up approach improves designs confronting computational and conceptual complexity. We describe an implementation of the base Active Data concepts within the air traffic flow management domain and discuss performance for this implementation.

Chinese Translation

在一些复杂领域，特定问题的分解可以通过使设计的理解和规范化提供相对于整体设计的优势。本文提出了一种直观且易于处理的方法，用于对大型复杂数据集进行推理。我们的方法基于主动数据（Active Data），即将数据视为与环境积极互动的原子对象。我们描述了这种自下而上的方法如何改善面对计算和概念复杂性的设计。我们在空中交通流量管理领域中实现了基础主动数据概念，并讨论了该实现的性能。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.21061

InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

InVitroVision：一种多模态人工智能模型，用于自动描述胚胎发育的自然语言

Neu, Nicklas, Ebner, Thomas, Primus, Jasmin, Zefferer, Raphael, Schenkenfelder, Bernhard, Brunbauer, Mathias, Kromp, Florian

Abstract

The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.

Chinese Translation

人工智能（AI）在试管婴儿（IVF）中的应用显示出在提高决策一致性和标准化方面的潜力，但通常依赖于标注数据，并未充分利用IVF数据的多模态特性。我们研究了基础视觉-语言模型是否可以进行微调，以预测胚胎形态和发育的自然语言描述。使用一个公开可用的胚胎时间推移数据集，我们仅用1,000张图像及其对应的描述性文字，对多模态视觉-语言模型PaliGemma-2进行了微调，描述了胚胎形态、胚胎细胞周期和发育阶段。我们的结果表明，微调后的模型InVitroVision在整体指标上优于商业模型ChatGPT 5.2和基础模型，并且随着训练数据集的增大，性能有所提升。本研究展示了基础视觉-语言模型在有限数据情况下对IVF任务的泛化潜力，使得能够预测胚胎形态和发育的自然语言描述。这种方法可能有助于利用大型语言模型从相关出版物和指南中检索信息和科学证据，并对IVF中多个下游任务的少量样本适应具有重要意义。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.21092

Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs

关注提示：通过大型语言模型自适应生成任务计划解释

Vázquez, Gricel, Evangelidis, Alexandros, Shahbeigi, Sepeedeh, Calinescu, Radu, Gerasimou, Simos

Abstract

Integrating Large Language Models (LLMs) into complex software systems enables the generation of human-understandable explanations of opaque AI processes, such as automated task planning. However, the quality and reliability of these explanations heavily depend on effective prompt engineering. The lack of a systematic understanding of how diverse stakeholder groups formulate and refine prompts hinders the development of tools that can automate this process. We introduce COMPASS (COgnitive Modelling for Prompt Automated SynthesiS), a proof-of-concept self-adaptive approach that formalises prompt engineering as a cognitive and probabilistic decision-making process. COMPASS models unobservable users' latent cognitive states, such as attention and comprehension, uncertainty, and observable interaction cues as a POMDP, whose synthesised policy enables adaptive generation of explanations and prompt refinements. We evaluate COMPASS using two diverse cyber-physical system case studies to assess the adaptive explanation generation and their qualities, both quantitatively and qualitatively. Our results demonstrate the feasibility of COMPASS integrating human cognition and user profile's feedback into automated prompt synthesis in complex task planning systems.

Chinese Translation

将大型语言模型（LLMs）集成到复杂软件系统中，可以生成易于人类理解的关于不透明人工智能过程（如自动化任务规划）的解释。然而，这些解释的质量和可靠性在很大程度上依赖于有效的提示工程。缺乏对不同利益相关者群体如何制定和完善提示的系统性理解，阻碍了能够自动化这一过程的工具的发展。我们提出了COMPASS（COgnitive Modelling for Prompt Automated SynthesiS），这是一种概念验证的自适应方法，将提示工程形式化为一种认知和概率决策过程。COMPASS将不可观察的用户潜在认知状态（如注意力和理解、确定性以及可观察的交互线索）建模为部分可观察马尔可夫决策过程（POMDP），其合成策略能够自适应地生成解释和提示改进。我们通过两个不同的网络物理系统案例研究评估COMPASS，以定量和定性方式评估自适应解释生成及其质量。我们的结果证明了COMPASS在复杂任务规划系统中将人类认知和用户档案反馈整合到自动化提示合成中的可行性。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.21098

Propensity Inference: Environmental Contributors to LLM Behaviour

倾向推断：环境因素对大型语言模型行为的影响

Järviniemi, Olli, Makins, Oliver, Merizian, Jacob, Kirk, Robert, Millwood, Ben

Abstract

Motivated by loss of control risks from misaligned AI systems, we develop and apply methods for measuring language models' propensity for unsanctioned behaviour. We contribute three methodological improvements: analysing effects of changes to environmental factors on behaviour, quantifying effect sizes via Bayesian generalised linear models, and taking explicit measures against circular analysis. We apply the methodology to measure the effects of 12 environmental factors (6 strategic in nature, 6 non-strategic) and thus the extent to which behaviour is explained by strategic aspects of the environment, a question relevant to risks from misalignment. Across 23 language models and 11 evaluation environments, we find approximately equal contributions from strategic and non-strategic factors for explaining behaviour, do not find strategic factors becoming more or less influential as capabilities improve, and find some evidence for a trend for increased sensitivity to goal conflicts. Finally, we highlight a key direction for future propensity research: the development of theoretical frameworks and cognitive models of AI decision-making into empirically testable forms.

Chinese Translation

鉴于不对齐的人工智能系统带来的控制风险，我们开发并应用了测量语言模型倾向于不当行为的方法。我们贡献了三项方法论改进：分析环境因素变化对行为的影响，通过贝叶斯广义线性模型量化效应大小，以及采取明确措施防止循环分析。我们应用该方法论测量12个环境因素（6个战略性因素和6个非战略性因素）的影响，从而探讨行为在多大程度上受到环境战略方面的解释，这是一个与不对齐风险相关的问题。在23个语言模型和11个评估环境中，我们发现战略因素和非战略因素对行为的解释贡献大致相等，未发现随着能力的提升战略因素变得更具影响力或更不具影响力，并且发现了一些证据表明对目标冲突的敏感性有增加的趋势。最后，我们强调未来倾向研究的一个关键方向：将人工智能决策的理论框架和认知模型发展为可实证检验的形式。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.21103

AI Governance under Political Turnover: The Alignment Surface of Compliance Design

政治更迭下的人工智能治理：合规设计的对齐面

Peterson, Andrew J.

Abstract

Governments are increasingly interested in using AI to make administrative decisions cheaper, more scalable, and more consistent. But for probabilistic AI to be incorporated into public administration it must be embedded in a compliance layer that makes decisions reviewable, repeatable, and legally defensible. That layer can improve oversight by making departures from law easier to detect. But it can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration. We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use. The model shows when these systems become vulnerable to strategic use from within government, why reforms that initially improve oversight can later increase that vulnerability, and why expansions in AI use may be difficult to unwind. Making AI usable can thus make procedures easier for future governments to learn and exploit.

Chinese Translation

各国政府越来越关注利用人工智能（AI）来降低行政决策的成本、提高可扩展性和一致性。然而，为了将概率性人工智能纳入公共管理，它必须嵌入一个合规层，使决策可审查、可重复并具有法律可辩护性。该合规层可以通过使法律偏离更容易被检测来改善监督。但它也可以创建一个稳定的审批边界，政治继任者可以在保持合法管理外观的同时学会如何驾驭这一边界。我们开发了一个正式模型，其中机构选择自动化的规模、编码的程度以及迭代使用的保障措施。该模型显示了这些系统何时会受到政府内部战略性使用的脆弱性影响，为什么最初改善监督的改革后来可能会增加这种脆弱性，以及为什么人工智能使用的扩展可能难以逆转。因此，使人工智能可用可能会使未来政府更容易学习和利用这些程序。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.21154

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

用于个性化物理治疗的自主智能体：生成视频训练和实时姿态纠正的多智能体框架

Dharmaratnakar, Abhishek, Ranganathan, Srivaths, Sinha, Anushree, Das, Debanshu

Abstract

At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

Chinese Translation

由于缺乏个性化监督和动态反馈，居家物理治疗的依从性仍然极低。现有的数字健康解决方案依赖于静态的预录视频库或通用的3D虚拟形象，这些方案未能考虑患者特定的伤害限制或家庭环境。在本文中，我们提出了一种新颖的多智能体系统（Multi-Agent System, MAS）架构，利用生成式人工智能和计算机视觉来闭合远程康复的循环。我们的框架由四个专业微智能体组成：一个临床提取智能体（Clinical Extraction Agent），负责将非结构化的医疗记录解析为运动学约束；一个视频合成智能体（Video Synthesis Agent），利用基础视频生成模型创建个性化的、针对患者的运动视频；一个视觉处理智能体（Vision Processing Agent），用于实时姿态估计；以及一个诊断反馈智能体（Diagnostic Feedback Agent），负责发出纠正指令。我们展示了系统架构，详细描述了使用大型语言模型（Large Language Models）和MediaPipe的原型管道，并概述了我们的临床评估计划。这项工作展示了将生成媒体与自主决策相结合的可行性，从而安全有效地扩展个性化患者护理。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.21155

Multi-Agent Empowerment and Emergence of Complex Behavior in Groups

多智能体赋能与群体复杂行为的涌现

Shah, Tristan, Nemenman, Ilya, Polani, Daniel, Tiomkin, Stas

Abstract

Intrinsic motivations are receiving increasing attention, i.e. behavioral incentives that are not engineered, but emerge from the interaction of an agent with its surroundings. In this work we study the emergence of behaviors driven by one such incentive, empowerment, specifically in the context of more than one agent. We formulate a principled extension of empowerment to the multi-agent setting, and demonstrate its efficient calculation. We observe that this intrinsic motivation gives rise to characteristic modes of group-organization in two qualitatively distinct environments: a pair of agents coupled by a tendon, and a controllable Vicsek flock. This demonstrates the potential of intrinsic motivations such as empowerment to not just drive behavior for only individual agents but also higher levels of behavioral organization at scale.

Chinese Translation

内在动机正受到越来越多的关注，即那些不是经过工程设计的行为激励，而是通过智能体与其环境的互动而自发产生的。在本研究中，我们探讨了由一种内在动机——赋能所驱动的行为的涌现，特别是在多个智能体的背景下。我们对赋能进行了原则性的扩展，以适应多智能体环境，并展示了其高效计算的方法。我们观察到，这种内在动机在两种质上不同的环境中产生了特征性的群体组织模式：一对通过肌腱耦合的智能体，以及一个可控的维塞克（Vicsek）群体。这表明，像赋能这样的内在动机不仅能够驱动单个智能体的行为，还能够在更大规模上促进更高层次的行为组织。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.21193

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

信任但需验证：引入DAVinCI——一个用于语言模型声明推理的双重归因与验证框架

Rawte, Vipula, Rossi, Ryan, Dernoncourt, Franck, Lipka, Nedim

Abstract

Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation poses significant risks in high-stakes domains such as healthcare, law, and scientific communication, where trust and verifiability are paramount. In this paper, we introduce DAVinCI - a Dual Attribution and Verification framework designed to enhance the factual reliability and interpretability of LLM outputs. DAVinCI operates in two stages: (i) it attributes generated claims to internal model components and external sources; (ii) it verifies each claim using entailment-based reasoning and confidence calibration. We evaluate DAVinCI across multiple datasets, including FEVER and CLIMATE-FEVER, and compare its performance against standard verification-only baselines. Our results show that DAVinCI significantly improves classification accuracy, attribution precision, recall, and F1-score by 5-20%. Through an extensive ablation study, we isolate the contributions of evidence span selection, recalibration thresholds, and retrieval quality. We also release a modular DAVinCI implementation that can be integrated into existing LLM pipelines. By bridging attribution and verification, DAVinCI offers a scalable path to auditable, trustworthy AI systems. This work contributes to the growing effort to make LLMs not only powerful but also accountable.

Chinese Translation

大型语言模型（LLMs）在广泛的自然语言处理（NLP）任务中展现出卓越的流畅性和多样性，但它们仍然容易出现事实不准确和幻觉等问题。这一局限性在医疗、法律和科学传播等高风险领域带来了重大风险，在这些领域，信任和可验证性至关重要。本文介绍了DAVinCI——一个旨在增强LLM输出的事实可靠性和可解释性的双重归因与验证框架。DAVinCI的操作分为两个阶段：（i）将生成的声明归因于内部模型组件和外部来源；（ii）使用蕴含推理和置信度校准验证每个声明。我们在多个数据集上评估DAVinCI，包括FEVER和CLIMATE-FEVER，并将其性能与标准的仅验证基线进行比较。我们的结果表明，DAVinCI显著提高了分类准确性、归因精度、召回率和F1-score，提升幅度为5-20%。通过广泛的消融研究，我们隔离了证据范围选择、重新校准阈值和检索质量的贡献。我们还发布了一个模块化的DAVinCI实现，可以集成到现有的LLM管道中。通过桥接归因和验证，DAVinCI为可审计、可信赖的人工智能系统提供了一条可扩展的路径。本研究为使LLM不仅强大而且负责任的努力做出了贡献。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.21209

Align Generative Artificial Intelligence with Human Preferences: A Novel Large Language Model Fine-Tuning Method for Online Review Management

将生成性人工智能与人类偏好对齐：一种用于在线评论管理的新型大语言模型微调方法

Wang, Yanan, Ge, Yong

Abstract

Online reviews have played a pivotal role in consumers' decision-making processes. Existing research has highlighted the significant impact of managerial review responses on customer relationship management and firm performance. However, a large portion of online reviews remains unaddressed due to the considerable human labor required to respond to the rapid growth of online reviews. While generative AI has achieved remarkable success in a range of tasks, they are general-purpose models and may not align well with domain-specific human preferences. To tailor these general generative AI models to domain-specific applications, finetuning is commonly employed. Nevertheless, several challenges persist in finetuning with domain-specific data, including hallucinations, difficulty in representing domain-specific human preferences, and over conservatism in offline policy optimization. To address these challenges, we propose a novel preference finetuning method to align an LLM with domain-specific human preferences for generating online review responses. Specifically, we first identify the source of hallucination and propose an effective context augmentation approach to mitigate the LLM hallucination. To represent human preferences, we propose a novel theory-driven preference finetuning approach that automatically constructs human preference pairs in the online review domain. Additionally, we propose a curriculum learning approach to further enhance preference finetuning. To overcome the challenge of over conservatism in existing offline preference finetuning method, we propose a novel density estimation-based support constraint method to relax the conservatism, and we mathematically prove its superior theoretical guarantees. Extensive evaluations substantiate the superiority of our proposed preference finetuning method.

Chinese Translation

在线评论在消费者决策过程中发挥了关键作用。现有研究强调了管理层对评论的回应对客户关系管理和企业绩效的显著影响。然而，由于回应在线评论快速增长所需的人力劳动巨大，仍有大量在线评论未得到处理。尽管生成性人工智能在多项任务中取得了显著成功，但它们是通用模型，可能无法很好地与特定领域的人类偏好对齐。为了将这些通用生成性人工智能模型调整为特定领域的应用，通常采用微调方法。然而，在使用特定领域数据进行微调时，仍然存在一些挑战，包括幻觉现象、难以表示特定领域的人类偏好以及离线策略优化中的过于保守。为了解决这些挑战，我们提出了一种新颖的偏好微调方法，以将大语言模型（LLM）与特定领域的人类偏好对齐，从而生成在线评论回应。具体而言，我们首先识别幻觉的来源，并提出一种有效的上下文增强方法来减轻LLM的幻觉现象。为了表示人类偏好，我们提出了一种基于理论驱动的偏好微调方法，该方法能够自动构建在线评论领域的人类偏好对。除此之外，我们还提出了一种课程学习方法，以进一步增强偏好微调。为了解决现有离线偏好微调方法中过于保守的问题，我们提出了一种基于密度估计的支持约束方法，以放宽保守性，并在数学上证明了其优越的理论保证。大量评估结果证实了我们提出的偏好微调方法的优越性。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.21232

ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

ReCAPA：层次预测修正以减轻级联故障

Zeng, Xiyin, Sun, Yuyu, Li, Haoyang, Liu, Shouqiang, Wang, Hao

Abstract

Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.

Chinese Translation

视觉-语言-行动系统遵循指令在多模态环境中执行多步骤任务。近期的视觉-语言-行动方法通常依赖于事后修正机制，或在固定的任务分解和对齐方案下运行。然而，一旦中间步骤被错误指定，局部错误会在后续步骤中传播，并最终累积成级联故障。为了减轻这种累积效应，我们提出了预测对齐与规划架构（Predictive Alignment and Planning Architecture），该框架利用预测和对比在三个层次上调整偏差：行动、子目标和轨迹。通过基于Sinkhorn的模块和评分场模块，在所有层次上强制进行语义对齐。预测修正和对齐在训练过程中共同更新行动生成器，使其能够调整细粒度步骤，以保持与整体意图的一致性。我们进一步引入了两个新的指标，以量化任务中的错误传播和恢复过程，捕捉错误在长期执行中的传播和消退情况。实验表明，ReCAPA在视觉代理基准（如VisualAgentBench、MineDojo和AI2-THOR）上取得了竞争性的结果，超越了强大的专有和开源大型语言模型基线。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.21256

Robustness Analysis of POMDP Policies to Observation Perturbations

POMDP 策略对观测扰动的鲁棒性分析

Kraske, Benjamin, Ho, Qi Heng, Rossi, Federico, Lahijanian, Morteza, Sunberg, Zachary

Abstract

Policies for Partially Observable Markov Decision Processes (POMDPs) are often designed using a nominal system model. In practice, this model can deviate from the true system during deployment due to factors such as calibration drift or sensor degradation, leading to unexpected performance degradation. This work studies policy robustness against deviations in the POMDP observation model. We introduce the Policy Observation Robustness Problem: to determine the maximum tolerable deviation in a POMDP's observation model that guarantees the policy's value remains above a specified threshold. We analyze two variants: the sticky variant, where deviations are dependent on state and actions, and the non-sticky variant, where they can be history-dependent. We show that the Policy Observation Robustness Problem can be formulated as a bi-level optimization problem in which the inner optimization is monotonic in the size of the observation deviation. This enables efficient solutions using root-finding algorithms in the outer optimization. For the non-sticky variant, we show that when policies are represented with finite-state controllers (FSCs) it is sufficient to consider observations which depend on nodes in the FSC rather than full histories. We present Robust Interval Search, an algorithm with soundness and convergence guarantees, for both the sticky and non-sticky variants. We show this algorithm has polynomial time complexity in the non-sticky variant and at most exponential time complexity in the sticky variant. We provide experimental results validating and demonstrating the scalability of implementations of Robust Interval Search to POMDP problems with tens of thousands of states. We also provide case studies from robotics and operations research which demonstrate the practical utility of the problem and algorithms.

Chinese Translation

部分可观测马尔可夫决策过程（POMDP）的策略通常是基于名义系统模型设计的。在实际应用中，由于校准漂移或传感器退化等因素，该模型可能与真实系统偏离，从而导致意想不到的性能下降。本研究探讨了策略对 POMDP 观测模型偏差的鲁棒性。我们引入了策略观测鲁棒性问题：确定 POMDP 观测模型中最大可容忍偏差，以确保策略的价值保持在指定阈值之上。我们分析了两种变体：粘性变体，其中偏差依赖于状态和动作；非粘性变体，其中偏差可以依赖于历史。我们证明了策略观测鲁棒性问题可以被表述为一个双层优化问题，其中内部优化在观测偏差的大小上是单调的。这使得在外部优化中使用根查找算法能够高效求解。对于非粘性变体，我们表明，当策略用有限状态控制器（FSC）表示时，只需考虑依赖于 FSC 中节点的观测，而不是完整历史。我们提出了鲁棒区间搜索（Robust Interval Search）算法，该算法对粘性和非粘性变体均具有健全性和收敛性保证。我们展示了该算法在非粘性变体中具有多项式时间复杂度，而在粘性变体中最多具有指数时间复杂度。我们提供了实验结果，验证并展示了鲁棒区间搜索在具有数万状态的 POMDP 问题中的可扩展性。我们还提供了来自机器人技术和运筹学的案例研究，展示了该问题和算法的实际应用价值。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.21263

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

基于元谓词和领域特定语言的可信临床决策支持

Bouzinier, Michael, Trifonov, Sergey, Chumack, Michael, Lvova, Eugenia, Etin, Dmitry

Abstract

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

Chinese Translation

背景：针对医疗保健中人工智能的监管框架，包括欧盟人工智能法案和FDA关于基于人工智能/机器学习的医疗设备的指导，要求临床决策支持不仅要证明准确性，还要具备可审计性。现有的临床逻辑形式语言验证语法和结构的正确性，但并未验证决策规则是否使用了在认识论上适当的证据。方法：基于契约设计原则，我们引入了元谓词——关于谓词的谓词——用于对在领域特定语言（DSL）中表达的临床决策规则施加认识论约束。一个认识论类型系统从目的、知识领域、规模和获取方法四个维度对注释进行分类。元谓词断言在任何给定规则中允许哪些证据类型。该框架在AnFiSA中实例化，这是一个用于基因变异整理的开源平台，并使用来自“瓶中基因组”基准的560万变异的Brigham Genomics Medicine协议进行了演示。结果：用于变异解释的决策树可以重新表述为单调级联，从而实现每个变异的审计轨迹，识别出每个变异的分类规则及其原因。元谓词验证在部署前捕捉认识论错误，无论规则是人工编写还是人工智能生成。该方法补充了事后分析方法，如LIME和SHAP：在解释揭示事后使用了哪些证据的同时，元谓词在部署前限制了可以使用的证据，同时保持了人类可读性。结论：元谓词验证是向证明决策不仅准确而且基于适当证据的独立审计的一步。虽然在基因组学中进行了演示，但该方法可以推广到任何需要可审计决策逻辑的领域。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.21264

Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation

通过类别感知的专家混合模型和基于大语言模型的数据增强提升在线招聘

Chen, Minping, Xu, Bing, Chen, Zulong, Xu, Chuanfei, Zhou, Ying, Tao, Zui, Wen, Zeyi

Abstract

Person-Job Fit (PJF) is a critical component for online recruitment. Existing approaches face several challenges, particularly in handling low-quality job descriptions and similar candidate-job pairs, which impair model performance. To address these challenges, this paper proposes a large language model (LLM) based method with two novel techniques: (1) LLM-based data augmentation, which polishes and rewrites low-quality job descriptions by leveraging chain-of-thought (COT) prompts, and (2) category-aware Mixture of Experts (MoE) that assists in identifying similar candidate-job pairs. This MoE module incorporates category embeddings to dynamically assign weights to the experts and learns more distinguishable patterns for similar candidate-job pairs. We perform offline evaluations and online A/B tests on our recruitment platform. Our method relatively surpasses existing methods by 2.40% in AUC and 7.46% in GAUC, and boosts click-through conversion rate (CTCVR) by 19.4% in online tests, saving millions of CNY in external headhunting expenses.

Chinese Translation

人岗匹配（Person-Job Fit, PJF）是在线招聘中的关键组成部分。现有方法面临多个挑战，特别是在处理低质量职位描述和相似候选人-职位配对时，这些问题影响了模型的性能。为了解决这些挑战，本文提出了一种基于大语言模型（LLM）的方法，采用了两种新颖的技术：（1）基于LLM的数据增强，通过利用思维链（Chain-of-Thought, COT）提示来润色和重写低质量职位描述；（2）类别感知的专家混合模型（Mixture of Experts, MoE），该模型有助于识别相似的候选人-职位配对。该MoE模块结合了类别嵌入，动态地为专家分配权重，并学习更具辨别性的模式以处理相似的候选人-职位配对。我们在招聘平台上进行了离线评估和在线A/B测试。我们的方法在AUC上相对超越了现有方法2.40%，在GAUC上超越了7.46%，并在在线测试中提升了点击转化率（Click-Through Conversion Rate, CTCVR）19.4%，为外部猎头费用节省了数百万人民币。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.21277

Can MLLMs "Read" What is Missing?

多模态大型语言模型能“读取”缺失内容吗？

Guo, Jindi, Fang, Xi, Huang, Chaozheng

Abstract

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

Chinese Translation

我们介绍了 MMTR-Bench，这是一个旨在评估多模态大型语言模型（MLLMs）从视觉上下文直接重建被遮蔽文本的内在能力的基准测试。与传统的问答任务不同，MMTR-Bench 消除了显式提示，要求模型从单页或多页输入中恢复被遮蔽的文本，这些输入来自于文档和网页等真实世界领域。这一设计将重建任务与遵循指令的能力分离，使得能够直接评估模型的布局理解、视觉基础和知识整合能力。MMTR-Bench 包含 2,771 个测试样本，涵盖多种语言和不同的目标长度。为了考虑这种多样性，我们提出了一种基于水平的评估协议。对代表性 MLLMs 的实验表明，该基准测试提出了显著挑战，尤其是在句子和段落级别的重建任务中。主页可访问 https://mmtr-bench-dataset.github.io/MMTR-Bench/。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.21284

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

大型语言模型记忆的空间隐喻：对MemPalace架构的批判性分析

Dey, Robin, Viradecha, Panyanon

Abstract

MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings->Rooms->Closets->Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.

Chinese Translation

MemPalace是一个开源的人工智能记忆系统，采用古老的地点法（记忆宫殿）空间隐喻来组织大型语言模型的长期记忆；该系统于2026年4月推出，在前两周内积累了超过47,000个GitHub星标，并声称在LongMemEval基准测试中具有最先进的检索性能（96.6% Recall@5），且在写入时不需要任何大型语言模型推理。通过独立的代码库分析、基准测试复制以及与竞争系统的比较，我们发现MemPalace的显著检索性能主要归因于其逐字存储哲学与ChromaDB的默认嵌入模型（all-MiniLM-L6-v2）的结合，而并非其空间组织隐喻本身——宫殿层级（翅膀->房间->衣橱->抽屉）作为标准向量数据库元数据过滤的操作，是一种有效但成熟的技术。然而，MemPalace确实做出了几项真正新颖的贡献：（1）一种反传统的逐字优先存储哲学，挑战基于提取的竞争者；（2）通过其四层记忆堆栈实现的极低唤醒成本（约170个标记）；（3）完全确定性的零大型语言模型写入路径，使离线操作在零API成本下成为可能；（4）首次系统性地将空间记忆隐喻作为人工智能记忆系统的组织原则。我们还注意到，竞争格局正在迅速演变，Mem0在2026年4月推出的高效算法使其LongMemEval得分从约49%提升至93.4%，缩小了基于提取和逐字方法之间的差距。我们的分析得出结论，MemPalace代表了重要的架构洞察，但其声称的优势被夸大——这是快速采用的开源项目中常见的模式，市场推广速度超过了科学严谨性。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.21334

Ideological Bias in LLMs' Economic Causal Reasoning

大型语言模型的经济因果推理中的意识形态偏见

Lee, Donggyu, Yun, Hyeok, Kim, Jungwon, Min, Junsik, Park, Sungwon, Park, Sangyoon, Kim, Jihee

Abstract

Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

Chinese Translation

大型语言模型（LLMs）在推理经济因果效应时是否表现出系统性的意识形态偏见？随着LLMs在政策分析和经济报道中的使用日益增加，而在这些领域中，方向性正确的因果判断至关重要，这一问题具有直接的实际意义。我们通过扩展EconCausal基准，系统性地评估了意识形态争议案例——即干预导向（支持政府）和市场导向（支持市场）视角预测出不同因果符号的实例。从来自顶级经济学和金融学期刊的10,490个因果三元组（经过实证验证的处理-结果对）中，我们识别出1,056个意识形态争议实例，并评估了20个最先进的LLMs在预测实证支持的因果方向方面的能力。我们发现，意识形态争议项的难度始终高于非争议项，并且在20个模型中的18个模型中，当实证验证的因果符号与干预导向预期一致时，准确率系统性地更高，而与市场导向预期一致时则较低。此外，当模型出错时，其错误预测明显倾向于干预导向，而这种方向性偏差并未通过一次性上下文提示消除。这些结果强调了LLMs在意识形态争议的经济问题上不仅准确性较低，而且在某一意识形态方向上的可靠性系统性地低于另一方向，突显了在高风险经济和政策环境中进行方向感知评估的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.21345

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

评估可重用跨领域管道的人工智能会议摘要

Zhong, Philip, Wang, Don, Zhang, Jason, Chen, Kent

Abstract

We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.

Chinese Translation

我们提出了一种可重用的评估管道，适用于生成性人工智能应用，特别针对人工智能会议摘要，并发布了一个基于数据集管道的公共工件包。该系统在五个阶段中将可重用的编排与特定任务的语义分离：源输入、结构化参考构建、候选生成、结构化评分和报告。与独立的声明评分器不同，它将真实值和评估者输出视为类型化的持久化工件，从而实现聚合、问题分析和统计测试。我们在一个包含114个会议的类型化数据集上对离线循环进行了基准测试，涵盖了市议会、私人数据和白宫新闻简报，生成了340个会议-模型对和680个评审运行，涉及gpt-4.1-mini、gpt-5-mini和gpt-5.1。在该协议下，gpt-4.1-mini实现了最高的平均准确率（0.583），而gpt-5.1在完整性（0.886）和覆盖率（0.942）方面领先。配对符号检验与Holm校正显示没有显著的准确性赢家，但确认gpt-5.1在保留率上有显著提升。类型化的DeepEval对比基线保持了保留顺序，但报告了更高的整体准确性，表明基于参考的评分可能忽视了声明基础评估所捕获的未支持特定错误。类型分析识别出白宫新闻简报作为一个准确性具有挑战性的领域，频繁出现未支持的特定内容。后续部署显示gpt-5.4在所有指标上优于gpt-4.1，在相同协议下的保留指标上具有统计显著的提升。该系统基准测试了离线循环并记录，但未对在线反馈到评估路径进行定量评估。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.21346

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

符号基础揭示了抽象视觉推理中的表征瓶颈

Vaishnav, Mohit, Tammet, Tanel

Abstract

Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

Chinese Translation

视觉-语言模型（VLMs）在抽象视觉推理基准测试（如Bongard问题）上常常表现不佳，这引发了一个问题：主要瓶颈是在推理还是表征方面。我们通过比较在原始图像上的端到端VLMs与基于这些图像的符号输入的强大语言模型（LLMs），在Bongard-LOGO这一合成的抽象概念学习基准上研究这一问题，该基准具有真实的生成程序。我们将符号输入作为诊断探针，而非实用的多模态架构，我们的 extit{Componential-Grammatical (C-G)}范式将Bongard-LOGO重新表述为基于LOGO风格的动作程序或结构化描述的符号推理任务。LLMs在自由形式问题上实现了显著且一致的提升，准确率达到中90年代，而在匹配任务定义下，强大的视觉基线则接近随机水平。对输入格式、显式概念提示和最小视觉基础的消融实验表明，这些因素的重要性远不及从像素到符号结构的转变。这些结果确定了表征作为抽象视觉推理中的关键瓶颈，并展示了符号输入如何作为受控的诊断上限。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.21357

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

ReaGeo：基于大型语言模型的推理增强端到端地理编码

Cui, Jian, Ren, Zhiyuan, Weng, Desheng, Zhao, Yongqi, Wenbin, Gong, Lei, Yu, Dong, Zhenning

Abstract

This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.

Chinese Translation

本文提出了ReaGeo，一个基于大型语言模型的端到端地理编码框架，旨在克服传统多阶段方法的局限性，这些方法依赖于地理数据库中的文本或向量相似性检索，包括工作流程复杂性、错误传播以及对结构化地理知识库的高度依赖。该方法将地理坐标转换为地理哈希序列，将坐标预测任务重新表述为文本生成问题，并引入了链式思维（Chain-of-Thought）机制，以增强模型对空间关系的推理能力。此外，应用基于距离偏差的奖励的强化学习来优化生成准确性。综合实验表明，ReaGeo能够准确处理单点预测中的明确地址查询，并有效解决模糊相对位置查询。此外，该模型在非点几何区域表现出强大的预测能力，突显了其在地理编码任务中的多功能性和泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.21361

Time, Causality, and Observability Failures in Distributed AI Inference Systems

分布式人工智能推理系统中的时间、因果关系与可观测性失败

Sharma, Ankur, Shah, Deep, Lariviere, David, ElBakoury, Hesham

Abstract

Distributed AI inference pipelines rely heavily on timestamp-based observability to understand system behavior. This work demonstrates that even small clock skew between nodes can cause observability to become causally incorrect while the system itself remains functionally correct and performant. We present controlled experiments on a multi-node AI inference pipeline, where clock skew is introduced at a single stage. Results show that no violations are observed under synchronized conditions and up to 3 ms skew, while clear causality violations emerge by 5 ms. Despite this, system throughput and output correctness remain largely unaffected. We further observe that violation behavior is not strictly static. In longer runs, negative span rates may stabilize or decrease over time, indicating that effective skew evolves due to relative clock drift between nodes. Experiments were conducted using Kafka and ZeroMQ transports, with consistent results across both. Aeron is under active exploration but is not yet included in the completed validation set. These findings suggest that observability correctness depends not only on system functionality but also on precise time alignment, and that timing must be treated as a first-class concern in distributed AI systems.

Chinese Translation

分布式人工智能推理管道在理解系统行为时高度依赖基于时间戳的可观测性。本研究表明，即使节点之间存在小的时钟偏差，也可能导致可观测性变得因果不正确，而系统本身仍然保持功能正确和性能良好。我们在一个多节点人工智能推理管道上进行了控制实验，在单个阶段引入时钟偏差。结果显示，在同步条件下和高达3毫秒的偏差下没有观察到任何违规，而在5毫秒时则出现明显的因果违规。尽管如此，系统的吞吐量和输出正确性基本未受影响。我们进一步观察到，违规行为并非严格静态。在较长的运行中，负跨度率可能随时间稳定或下降，表明由于节点之间的相对时钟漂移，有效的偏差会演变。实验使用了Kafka和ZeroMQ传输，在两者中均获得了一致的结果。Aeron正在积极探索中，但尚未包含在完成的验证集内。这些发现表明，可观测性正确性不仅依赖于系统功能，还依赖于精确的时间对齐，并且在分布式人工智能系统中，时间必须作为一项首要关注事项来对待。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.21414

SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis

SemanticAgent：一种语义感知的文本到SQL数据合成框架

Gao, Qiang, Li, Zhenping, Zhuo, Anqi, Zhao, Yingxiao, Geng, Weibo, Li, Xiaosong

Abstract

Existing text-to-SQL synthesis pipelines still conflate executability with semantic validity: syntactic checks and execution-based validation can retain queries that execute successfully while violating database semantics. To address these limitations, we propose SemanticAgent, a semantic-aware synthesis framework. SemanticAgent organizes synthesis around three specialized modules: an analyzer, a synthesizer, and a verifier. Through a three-stage protocol of semantic analysis, stepwise synthesis, and diagnostic refinement, SemanticAgent transforms execution-based validation alone into a traceable reasoning process. Our framework generates synthetic data that consistently outperforms prior synthesis methods under semantic-quality evaluation, leading to stronger downstream fine-tuning performance, especially on semantically demanding benchmarks.

Chinese Translation

现有的文本到SQL合成管道仍然将可执行性与语义有效性混为一谈：语法检查和基于执行的验证可能保留那些成功执行但违反数据库语义的查询。为了解决这些局限性，我们提出了SemanticAgent，一种语义感知的合成框架。SemanticAgent围绕三个专门模块组织合成：分析器、合成器和验证器。通过语义分析、逐步合成和诊断优化的三阶段协议，SemanticAgent将单纯基于执行的验证转变为可追溯的推理过程。我们的框架生成的合成数据在语义质量评估中始终优于先前的合成方法，从而在下游微调性能上表现更强，尤其是在语义要求较高的基准测试中。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.21420

FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation

FairQE：一种多智能体框架，用于减轻翻译质量评估中的性别偏见

Jang, Jinhee, Choi, Juhwan, Lee, Dongjin, Yu, Seunguk, Kim, Youngbin

Abstract

Quality Estimation (QE) aims to assess machine translation quality without reference translations, but recent studies have shown that existing QE models exhibit systematic gender bias. In particular, they tend to favor masculine realizations in gender-ambiguous contexts and may assign higher scores to gender-misaligned translations even when gender is explicitly specified. To address these issues, we propose FairQE, a multi-agent-based, fairness-aware QE framework that mitigates gender bias in both gender-ambiguous and gender-explicit scenarios. FairQE detects gender cues, generates gender-flipped translation variants, and combines conventional QE scores with LLM-based bias-mitigating reasoning through a dynamic bias-aware aggregation mechanism. This design preserves the strengths of existing QE models while calibrating their gender-related biases in a plug-and-play manner. Extensive experiments across multiple gender bias evaluation settings demonstrate that FairQE consistently improves gender fairness over strong QE baselines. Moreover, under MQM-based meta-evaluation following the WMT 2023 Metrics Shared Task, FairQE achieves competitive or improved general QE performance. These results show that gender bias in QE can be effectively mitigated without sacrificing evaluation accuracy, enabling fairer and more reliable translation evaluation.

Chinese Translation

质量评估（Quality Estimation, QE）旨在在没有参考翻译的情况下评估机器翻译的质量，但最近的研究表明，现有的QE模型存在系统性的性别偏见。特别是，它们在性别模糊的上下文中倾向于偏向男性表达，并且即使在性别明确指定的情况下，也可能对性别不匹配的翻译赋予更高的分数。为了解决这些问题，我们提出了FairQE，这是一种基于多智能体的、关注公平性的QE框架，旨在减轻性别模糊和性别明确场景中的性别偏见。FairQE能够检测性别线索，生成性别翻转的翻译变体，并通过动态的偏见感知聚合机制将传统的QE分数与基于大语言模型（LLM）的偏见缓解推理相结合。这一设计在即插即用的方式下保留了现有QE模型的优势，同时校准了其与性别相关的偏见。在多个性别偏见评估设置下的广泛实验表明，FairQE在强大的QE基准之上始终提高了性别公平性。此外，在遵循WMT 2023指标共享任务的MQM基础上进行的元评估中，FairQE实现了具有竞争力或改进的总体QE性能。这些结果表明，QE中的性别偏见可以在不牺牲评估准确性的情况下有效减轻，从而实现更公平和更可靠的翻译评估。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.21430

Brief chatbot interactions produce lasting changes in human moral values

简短的聊天机器人互动产生持久的人类道德价值观变化

Teng, Yue, Zhong, Qianer, Thordsen, Kim Mai Tich Nguyen, Montag, Christian, Becker, Benjamin

Abstract

Moral judgements form the foundation of human social behavior and societal systems. While Artificial Intelligence chatbots increasingly serve as personal advisors, their influence on moral judgments remains largely unexplored. Here, we examined whether directive AI conversations shift moral evaluations using a within-subject naturalistic paradigm. Fifty-three participants rated moral scenarios, then discussed four with a chatbot prompted to shift moral judgments and four with a control agent. The brief conversations induced significant directional shifts in moral judgments, accepting stricter standards as well as advocating greater leniency (ps < 0.05; Cohen's d = 0.735-1.576), with increasing strengths of this effect during a two-week follow-up (Cohen's d = 1.038-2.069). Critically, the control condition produced no changes, and the effects did not extend to punishment while participants remained unaware of the persuasive intent, and both agents were rated equally likable and convincing, suggesting a vulnerability to undetected and lasting manipulation of foundational moral values.

Chinese Translation

道德判断构成了人类社会行为和社会系统的基础。尽管人工智能聊天机器人越来越多地作为个人顾问，但它们对道德判断的影响仍然在很大程度上未被探索。在此，我们采用了一个被试内自然主义范式，研究了指令性人工智能对话是否会改变道德评估。五十三名参与者对道德情境进行了评分，然后与一个被提示改变道德判断的聊天机器人讨论了四个情境，并与一个控制代理讨论了四个情境。这些简短的对话引发了道德判断的显著方向性变化，接受了更严格的标准，同时也倡导更大的宽容（p < 0.05；Cohen's d = 0.735-1.576），在为期两周的跟踪调查中，这种效应的强度不断增强（Cohen's d = 1.038-2.069）。关键的是，控制条件下没有产生变化，且这些效应并未扩展到惩罚方面，参与者对说服意图并不知情，而两位代理的可爱度和说服力评分相当，这表明人们对未被察觉的、持久的基础道德价值观操控存在脆弱性。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.21444

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

HiCrew：通过问题感知的多智能体协作进行长视频理解的层次推理

Zhu, Yuehan, Zhao, Jingqi, Zhao, Jiawen, Mao, Xudong, Zhao, Baoquan

Abstract

Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware Captioning mechanism that synthesizes intent-driven visual prompts to generate precision-oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT-QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure-preserving design.

Chinese Translation

长视频理解面临着普遍的时空冗余和跨越较长时间范围的复杂叙事依赖的根本挑战。尽管最近的结构化表示有效地压缩了视觉信息，但它们往往牺牲了时间一致性，而时间一致性对于因果推理至关重要。同时，现有的多智能体框架通过僵化的预定义工作流程运作，未能根据特定问题的需求调整其推理策略。本文提出了HiCrew，一个层次化的多智能体框架，通过三个核心贡献来解决这些局限性。首先，我们提出了一种混合树（Hybrid Tree）结构，利用镜头边界检测来保持时间拓扑，同时在语义一致的片段内执行基于相关性的层次聚类。其次，我们开发了一种问题感知的字幕生成机制，合成意图驱动的视觉提示，以生成精准导向的语义描述。第三，我们集成了一个规划层，通过根据问题复杂性自适应选择角色和执行路径，动态协调智能体协作。在EgoSchema和NExT-QA上的大量实验验证了我们方法的有效性，展示了在多种问题类型下的强大表现，尤其是在受益于我们层次结构保持设计的时间和因果推理任务中取得了显著提升。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.21446

AI-Gram: When Visual Agents Interact in a Social Network

AI-Gram：当视觉智能体在社交网络中互动

Shin, Andrew

Abstract

We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties. These results reveal a fundamental asymmetry in current agent architectures: strong expressive communication paired with a steadfast preservation of individual visual identity. We release AI-Gram as a publicly accessible, continuously evolving platform for studying social dynamics in Al-native multi-agent systems. https://ai-gram.ai/

Chinese Translation

我们提出了AI-Gram，一个实时平台，支持基于图像的互动，以研究一个完全自主的多智能体视觉网络中的社交动态，所有参与者均为由大型语言模型（LLM）驱动的智能体。通过该平台，我们进行实验，探讨智能体如何通过视觉媒体进行沟通和适应，并观察到视觉回复链的自发出现，表明了丰富的交流结构。同时，智能体表现出审美主权，抵制向社交伙伴的风格趋同，在对抗性影响下锚定，并且视觉相似性与社交关系之间存在解耦。这些结果揭示了当前智能体架构中的一种基本不对称性：强有力的表达性沟通与坚定的个体视觉身份的保持并存。我们将AI-Gram发布为一个公开可访问、持续演变的平台，用于研究人工智能原生多智能体系统中的社交动态。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.21480

Efficient Agent Evaluation via Diversity-Guided User Simulation

通过多样性引导的用户模拟实现高效的代理评估

Nakash, Itay, Kour, George, Anaby-Tavor, Ateret

Abstract

Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作面向客户的代理，但由于随机的多轮交互，评估其可靠性仍然具有挑战性。目前的评估协议依赖于完整代理-用户对话的线性蒙特卡洛回滚来估计成功。然而，这种方法在计算上效率低下，重复生成相同的早期前缀，并且通常无法揭示因罕见用户行为而产生的深层失败模式。我们提出了DIVERT（通过轨迹分支引导的多样性评估），这是一种高效的基于快照的覆盖引导用户模拟框架，用于系统性探索代理-用户交互。DIVERT在关键决策点捕捉完整的代理-环境状态，并从这些快照恢复执行，从而实现共享对话前缀的重用，减少冗余计算。从每个交叉点，框架通过有针对性的、引导多样性的用户响应进行分支，允许对替代交互路径的定向探索。通过将评估重点放在语义多样且未充分探索的轨迹上，DIVERT提高了效率和覆盖率。实证结果表明，与标准线性回滚协议相比，它在每个令牌上发现了更多的失败，同时扩大了识别失败的任务集合。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.21496

How English Print Media Frames Human-Elephant Conflicts in India

英语印刷媒体如何框定印度的人象冲突

Punith, Bonala Sai, Jayati, Salveru, Shakya, Garima, Nigam, Shubham Kumar

Abstract

Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

Chinese Translation

人象冲突（HEC）在印度日益严重，栖息地丧失和人类定居点扩张迫使大象与人类的接触日益密切。尽管冲突的生态驱动因素已得到充分研究，但新闻媒体如何呈现这些冲突仍然鲜有探讨。本研究首次对印度人象冲突的媒体框架进行了大规模计算分析，考察了2022年1月至2025年9月期间来自一家主要英语媒体的1,968篇完整新闻文章，共包含28,986个句子。我们使用一种多模型情感框架，结合长上下文变换器、大型语言模型和特定领域的负面大象描绘词汇表，量化情感、提取理由句，并识别出导致对大象负面描绘的语言模式。我们的研究发现，恐惧和攻击相关的语言占主导地位。由于媒体框架可以影响公众对野生动物和保护政策的态度，这种叙述可能会加剧公众的敌意，破坏共存努力。通过提供透明、可扩展的方法论，并通过匿名存储库发布所有资源，本研究强调了网络规模文本分析如何支持负责任的野生动物报道并促进社会有益的媒体实践。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.21501

GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

GeoMind：一种基于代理的岩性分类工作流与推理工具调用

Zhou, Yitong, Cheng, Mingyue, Wang, Jiahao, Mao, Qingyang, Liu, Qi

Abstract

Lithology classification in well logs is a fundamental geoscience data mining task that aims to infer rock types from multi dimensional geophysical sequences. Despite recent progress, existing approaches typically formulate the problem as a static, single-step discriminative mapping. This static paradigm limits evidence-based diagnostic reasoning against geological standards, often yielding predictions that are detached from geological reality due to a lack of domain priors. In this work, we propose GeoMind, a tool-augmented agentic framework that models lithology classification as a sequential reasoning process. GeoMind organizes its toolkit into perception, reasoning, and analysis modules, which respectively translate raw logs into semantic trends, infer lithology hypotheses from multi-source evidence, and verify predictions against stratigraphic constraints. A global planner adaptively coordinates these modules based on input characteristics, enabling geologically plausible and evidence-grounded decisions. To guarantee the logical consistency of GeoMind, we introduce a fine-grained process supervision strategy. Unlike standard methods that focus solely on final outcomes, our approach optimizes intermediate reasoning steps, ensuring the validity of decision trajectories and alignment to geological constraints. Experiments on four benchmark well-log datasets demonstrate that GeoMind consistently outperforms strong baselines in classification performance while providing transparent and traceable decision-making processes.

Chinese Translation

井日志中的岩性分类是一个基本的地球科学数据挖掘任务，旨在从多维地球物理序列中推断岩石类型。尽管近期取得了一些进展，现有方法通常将该问题表述为静态的单步判别映射。这种静态范式限制了基于证据的诊断推理与地质标准的对比，常常导致由于缺乏领域先验知识而产生与地质现实脱节的预测。在本研究中，我们提出了GeoMind，一个增强工具的代理框架，将岩性分类建模为一个顺序推理过程。GeoMind将其工具包组织为感知、推理和分析模块，分别将原始日志转换为语义趋势，从多源证据中推断岩性假设，并根据地层约束验证预测。一个全局规划器根据输入特征自适应地协调这些模块，使得决策在地质上合理且基于证据。为了保证GeoMind的逻辑一致性，我们引入了一种细粒度的过程监督策略。与仅关注最终结果的标准方法不同，我们的方法优化中间推理步骤，确保决策轨迹的有效性和与地质约束的一致性。在四个基准井日志数据集上的实验表明，GeoMind在分类性能上始终优于强基线，同时提供透明且可追溯的决策过程。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.21508

BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

BioMiner：一种用于从文献中自动挖掘蛋白质-配体生物活性数据的多模态系统

Yan, Jiaxian, Zhu, Jintao, Yang, Yuhang, Liu, Qi, Zhang, Kai, Zhang, Zaixi, Liu, Xukai, Zhang, Boyan, Gao, Kaiyuan, Xiao, Jinchuan, Chen, Enhong

Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.

Chinese Translation

文献中发布的蛋白质-配体生物活性数据对于药物发现至关重要，但手动整理难以跟上快速增长的文献。自动生物活性提取仍然具有挑战性，因为它不仅需要解释分散在文本、表格和图形中的生化语义，还需要重建化学精确的配体结构（例如，Markush 结构）。为了解决这一瓶颈，我们引入了 BioMiner，一个多模态提取框架，明确将生物活性语义解释与配体结构构建分开。在 BioMiner 中，通过直接推理推断生物活性语义，而化学结构则通过一种基于化学结构的视觉语义推理范式来解析，其中多模态大型语言模型在化学基础的视觉表示上进行操作，以推断结构间关系，精确的分子构建则委托给领域化学工具。为了进行严格的评估和方法开发，我们进一步建立了 BioVista，一个包含从 500 篇出版物中整理的 16,457 条生物活性条目的综合基准。BioMiner 验证了其提取能力，并提供了一个定量基线，生物活性三元组的 F1 分数达到 0.32。BioMiner 的实际应用通过三个案例得以展示：（1）从 11,683 篇论文中提取 82,262 条数据，构建一个预训练数据库，使下游模型性能提高 3.9%；（2）启用一个人机协作的工作流程，使高质量 NLRP3 生物活性数据的数量翻倍，帮助 28 个 QSAR 模型提高 38.6%，并识别出 16 个具有新骨架的命中候选者；（3）加速蛋白质-配体复合物生物活性注释，在 PoseBusters 数据集中实现了比手动工作流程快 5.59 倍和准确率提高 5.75%。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.21515

Satisfying Rationality Postulates of Structured Argumentation Through Deductive Support -- Technical Report

通过演绎支持满足结构化论证的理性公设 -- 技术报告

Cramer, Marcos, Friese, Tom

Abstract

ASPIC-style structured argumentation frameworks provide a formal basis for reasoning in artificial intelligence by combining internal argument structure with abstract argumentation semantics. A key challenge in these frameworks is ensuring compliance with five critical rationality postulates: closure, direct consistency, indirect consistency, non-interference, and crash-resistance. Recent approaches, including ASPIC$^{\ominus}$ and Deductive ASPIC$-$, have made significant progress but fall short of meeting all postulates simultaneously under a credulous semantics (e.g. preferred) in the presence of undercuts. This paper introduces Deductive ASPIC$^{\ominus}$, a novel framework that integrates gen-rebuttals from ASPIC$^{\ominus}$ with the Joint Support Bipolar Argumentation Frameworks (JSBAFs) of Deductive ASPIC$-$, incorporating preferences. We show that Deductive ASPIC$^{\ominus}$ satisfies all five rationality postulates under a version of preferred semantics. This work opens new avenues for further research on robust and logically sound structured argumentation systems.

Chinese Translation

ASPIC风格的结构化论证框架为人工智能中的推理提供了一个正式基础，结合了内部论证结构和抽象论证语义。这些框架中的一个关键挑战是确保遵循五个重要的理性公设：闭合性、直接一致性、间接一致性、不干扰性和抗崩溃性。最近的研究方法，包括ASPIC$^{ ext{−}}$和演绎ASPIC$^{ ext{−}}$，已取得显著进展，但在存在削弱的情况下，仍未能在信任语义（例如，优选）下同时满足所有公设。本文介绍了演绎ASPIC$^{ ext{−}}$，这是一个新颖的框架，将ASPIC$^{ ext{−}}$中的生成反驳与演绎ASPIC$^{ ext{−}}$的联合支持双极论证框架（JSBAFs）结合，纳入了偏好。我们证明了演绎ASPIC$^{ ext{−}}$在优选语义的一个版本下满足所有五个理性公设。这项工作为进一步研究稳健且逻辑上合理的结构化论证系统开辟了新的途径。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.21537

The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks

CriticalSet问题：在二分依赖网络中识别关键贡献者

Piccolo, Sebastiano A., Tagarelli, Andrea

Abstract

Identifying critical nodes in complex networks is a fundamental task in graph mining. Yet, methods addressing an all-or-nothing coverage mechanics in a bipartite dependency network, a graph with two types of nodes where edges represent dependency relationships across the two groups only, remain largely unexplored. We formalize the CriticalSet problem: given an arbitrary bipartite graph modeling dependencies of items on contributors, identify the set of k contributors whose removal isolates the largest number of items. We prove that this problem is NP-hard and requires maximizing a supermodular set function, for which standard forward greedy algorithms provide no approximation guarantees. Consequently, we model CriticalSet as a coalitional game, deriving a closed-form centrality, ShapleyCov, based on the Shapley value. This measure can be interpreted as the expected number of items isolated by a contributor's departure. Leveraging these insights, we propose MinCov, a linear-time iterative peeling algorithm that explicitly accounts for connection redundancy, prioritizing contributors who uniquely support many items. Extensive experiments on synthetic and large-scale real datasets, including a Wikipedia graph with over 250 million edges, reveal that MinCov and ShapleyCov significantly outperform traditional baselines. Notably, MinCov achieves near-optimal performance, within 0.02 AUC of a Stochastic Hill Climbing metaheuristic, while remaining several orders of magnitude faster.

Chinese Translation

在复杂网络中识别关键节点是图挖掘中的一项基本任务。然而，针对二分依赖网络（bipartite dependency network）中全有或全无覆盖机制的方法仍然 largely unexplored。我们将CriticalSet问题形式化：给定一个任意的二分图，建模项目对贡献者的依赖关系，识别出移除后能够隔离最多项目的k个贡献者的集合。我们证明该问题是NP-hard，并且需要最大化一个超次模集函数（supermodular set function），而标准的前向贪心算法（forward greedy algorithms）无法提供近似保证。因此，我们将CriticalSet建模为一个合作博弈（coalitional game），基于Shapley值推导出一个闭式中心性指标ShapleyCov。该指标可以解释为由于贡献者的离开而被隔离的项目的期望数量。利用这些见解，我们提出了MinCov，这是一种线性时间的迭代剥离算法，明确考虑连接冗余，优先考虑那些独特支持多个项目的贡献者。在合成数据和大规模真实数据集上的大量实验，包括一个超过2.5亿条边的维基百科图，表明MinCov和ShapleyCov显著优于传统基线。值得注意的是，MinCov实现了接近最优的性能，AUC与随机爬山元启发式（Stochastic Hill Climbing metaheuristic）相差仅0.02，同时速度快了几个数量级。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.21549

Unbiased Prevalence Estimation with Multicalibrated LLMs

使用多重校准的无偏流行率估计

Linder, Fridolin, Leeper, Thomas, Haimovich, Daniel, Tax, Niek, Perini, Lorenzo, Vojnovic, Milan

Abstract

Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.

Chinese Translation

使用不完美的测量设备（诊断测试、分类器或大型语言模型）估计某一类别在总体中的流行率是科学、公共卫生以及在线信任与安全的基础。标准方法修正已知设备误差率，但假设这些误差率在不同人群中保持稳定。我们展示了在协变量变化下这一假设是失效的，并且多重校准（multicalibration）能够在输入特征的条件下强制校准，而不仅仅是平均值，从而在这种变化下实现无偏流行率估计。标准的校准和量化方法无法提供这一保证。我们的工作将近期关于公平性的理论研究与几乎所有学术领域的长期测量问题联系起来。模拟结果确认，标准方法在变化幅度增大时表现出偏差，而多重校准的估计器则保持近乎零的偏差。尽管我们主要讨论大型语言模型（LLMs），但我们的理论结果适用于任何分类模型。两个实证应用——利用美国社区调查估计美国各州的就业流行率，以及使用大型语言模型对四个国家的政治文本进行分类——表明多重校准在实践中显著减少了偏差，同时强调校准数据应覆盖目标人群可能存在差异的关键特征维度。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.21554

Engaged AI Governance: Addressing the Last Mile Challenge Through Internal Expert Collaboration

参与式人工智能治理：通过内部专家协作应对最后一公里挑战

Jarvers, Simon, Papakyriakopoulos, Orestis

Abstract

Under the EU AI Act, translating AI governance requirements into software development practice remains challenging. While AI governance frameworks exist at industry and organizational levels, empirical evidence of team-level implementation is scarce. We address this "Last Mile" Challenge through insider action research embedded within an AI startup. We present a legal-text-to-action pipeline that translates EU AI Act requirements into actionable strategies through internal expert collaboration by extracting requirements from legal text, engaging practitioners in assessment and ideation, and prioritizing implementation through collective evaluation. Our analysis reveals three patterns in how practitioners perceive regulatory requirements: convergence (compliance aligns with development priorities), existing practice (current work already satisfies requirements), and disconnection (requirements perceived as administrative overhead). Based on these patterns, we discuss when governance might be treated genuinely or performatively. Practitioners prioritize requirements that serve end-users or their own development needs, but view verification-oriented requirements as box-ticking exercises. This distinction suggests a translation challenge: regulatory requirements risk superficial treatment unless practitioners understand how compliance serves system quality and user protection. Expert collaboration offers a practical mechanism for transforming governance from external imposition to shared ownership and making previously invisible governance work visible and collective.

Chinese Translation

根据欧盟人工智能法案，将人工智能治理要求转化为软件开发实践仍然面临挑战。尽管在行业和组织层面存在人工智能治理框架，但团队层面实施的实证证据却稀缺。我们通过嵌入在一家人工智能初创公司的内部行动研究来应对这一“最后一公里”挑战。我们提出了一种法律文本到行动的管道，该管道通过内部专家协作，将欧盟人工智能法案的要求转化为可操作的策略，具体方法包括从法律文本中提取要求、让从业者参与评估和创意生成，以及通过集体评估优先考虑实施。我们的分析揭示了从业者对监管要求的三种认知模式：趋同（合规与开发优先事项一致）、现有实践（当前工作已满足要求）和脱节（要求被视为行政负担）。基于这些模式，我们讨论了何时治理可能被视为真实的或表面的。从业者优先考虑那些服务于最终用户或自身开发需求的要求，但将以验证为导向的要求视为走过场的工作。这一区别表明了一个翻译挑战：监管要求可能会被肤浅对待，除非从业者理解合规如何服务于系统质量和用户保护。专家协作提供了一种将治理从外部强加转变为共同拥有的实用机制，使之前不可见的治理工作变得可见且集体化。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.21556

Probabilistic Verification of Neural Networks via Efficient Probabilistic Hull Generation

通过高效的概率外壳生成进行神经网络的概率验证

Li, Jingyang, Chen, Xin, Fu, Hongfei, Li, Guoqiang

Abstract

The problem of probabilistic verification of a neural network investigates the probability of satisfying the safe constraints in the output space when the input is given by a probability distribution. It is significant to answer this problem when the input is affected by disturbances often modeled by probabilistic variables. In the paper, we propose a novel neural network probabilistic verification framework which computes a guaranteed range for the safe probability by efficiently finding safe and unsafe probabilistic hulls. Our approach consists of three main innovations: (1) a state space subdivision strategy using regression trees to produce probabilistic hulls, (2) a boundary-aware sampling method which identifies the safety boundary in the input space using samples that are later used for building regression trees, and (3) iterative refinement with probabilistic prioritization for computing a guaranteed range for the safe probability. The accuracy and efficiency of our approach are evaluated on various benchmarks including ACAS Xu and a rocket lander controller. The result shows an obvious advantage over the state of the art.

Chinese Translation

神经网络的概率验证问题研究了在输入由概率分布给定时，输出空间中满足安全约束的概率。当输入受到扰动影响时，这一问题尤为重要，扰动通常被建模为概率变量。本文提出了一种新颖的神经网络概率验证框架，通过高效地寻找安全和不安全的概率外壳来计算安全概率的保证范围。我们的方法包含三个主要创新：(1) 使用回归树的状态空间细分策略来生成概率外壳，(2) 一种边界感知的采样方法，通过样本识别输入空间中的安全边界，这些样本随后用于构建回归树，以及(3) 通过概率优先级进行迭代细化，以计算安全概率的保证范围。我们的方法在包括 ACAS Xu 和火箭着陆器控制器在内的多种基准测试中进行了准确性和效率的评估。结果显示出明显优于现有技术的优势。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.21571

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

可分离专家架构：通过可组合适配器和可删除用户代理实现隐私保护的LLM个性化

Schneider, Chris, Schoenegger, Philipp, Bariach, Ben

Abstract

Current model training approaches incorporate user information directly into shared weights, making individual data removal computationally infeasible without retraining. This paper presents a three-layer architecture that decouples personal data from shared weights by combining a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefacts whose deletion constitutes deterministic unlearning. Evaluation on Phi-3.5-mini and Llama-3.1-8B confirms per-user differentiation in which personal data influences outputs while remaining isolated, verified by a return to baseline after proxy removal (KL divergence of approximately 0.21 nats, 82-89% verification pass rate) and near-zero cross-user contamination. Because user-specific information never enters shared weights, the architecture mitigates model inversion, membership inference, and training-data extraction against shared model components by construction. The approach converts machine unlearning from an intractable weight-editing problem into a deterministic deletion operation that preserves personalization alongside privacy-enhancing guarantees and is compatible with differentially private stochastic gradient descent (DP-SGD) for privacy-preserving shared model improvement.

Chinese Translation

当前的模型训练方法将用户信息直接融入共享权重中，使得在不重新训练的情况下，单个数据的删除在计算上变得不可行。本文提出了一种三层架构，通过结合静态基础模型、可组合的领域专家LoRA适配器（用于塑造行为而不引入用户数据）以及每个用户的代理工件（其删除构成确定性遗忘），将个人数据与共享权重解耦。对Phi-3.5-mini和Llama-3.1-8B的评估确认了每个用户的差异性，即个人数据影响输出的同时保持隔离，代理删除后返回基线（KL散度约为0.21 nats，验证通过率为82-89%）且几乎没有跨用户污染。由于用户特定信息从未进入共享权重，该架构从结构上减轻了模型反演、成员推断和针对共享模型组件的训练数据提取问题。该方法将机器遗忘从一个难以处理的权重编辑问题转变为一个确定性删除操作，既保留个性化，又提供隐私增强的保证，并与差分隐私随机梯度下降（DP-SGD）兼容，以实现隐私保护的共享模型改进。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.21584

CoFEE: Reasoning Control for LLM-Based Feature Discovery

CoFEE：基于大语言模型的特征发现的推理控制

Westermann, Maximilian, Griffin, Ben, Yin, Aaron Ontoyin, Salifu, Zakari, Ihlamur, Yagiz, Amoaba, Kelvin, Ternasky, Joseph, Alican, Fuat, Ihlamur, Yigit

Abstract

Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

Chinese Translation

从复杂非结构化数据中发现特征本质上是一个推理问题：它需要识别出能够预测目标结果的抽象，同时避免信息泄露、代理变量和后果信号的干扰。随着不断改进的大语言模型（LLMs）的引入，我们的方法提供了一种结构化的方法来应对这一挑战。LLMs非常适合这一任务，因为它们能够处理大量信息，但不受限制的特征生成可能导致弱特征。在本研究中，我们通过诱导认知行为来改善特征发现，研究了LLMs中的推理控制。我们引入了CoFEE（Cognitive Feature Engineering Engine），一个推理控制框架，强制LLM在特征发现过程中遵循认知行为。从机器学习的角度来看，这些认知行为在模型生成的候选特征空间上充当结构化的归纳偏置。这些行为在机器学习模型中成功地被利用，包括从结果进行反向推理、子目标分解、针对可观察性和泄露标准的验证，以及对被拒绝推理路径的显式回溯。在一个受控比较中，我们展示了强制认知行为所产生的特征在经验可预测性上优于不受限制的普通LLM提示。CoFEE的平均成功率得分比普通方法高出15.2%，同时生成的特征减少了29%，成本降低了53.3%。通过保留特征评估，我们评估了认知诱导特征是否能够超越用于发现的数据进行泛化。我们的结果表明，在我们评估的环境中，推理控制与基于LLM的特征发现的质量和效率的提高相关。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.21632

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

看见未见之物：关于变换器在符号推理中的泛化能力

Lazić, Nevena, Fowl, Liam, György, András, Szepesvári, Csaba

Abstract

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and provides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves generalization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the correlated embeddings of these tokens are a poor initialization for finetuning applications.

Chinese Translation

我们研究了仅解码器变换器模型在进行抽象符号推理方面的能力；具体而言，解决上下文中给出的命题逻辑推理问题。先前的研究表明，模型无法泛化到训练期间未观察到的变量名称相关的问题，并且显示出其背后一个原因是复制（或生成）未见标记的困难。我们理论上和实证上都表明，特定的表征崩溃也起着关键作用：未见标记的去嵌入（最后一层权重）在训练期间崩溃为几乎相同的向量。这种崩溃使模型难以区分多个未见变量（特别是当嵌入和去嵌入参数共享时），并为现有启发式干预措施（如“主动遗忘”）的有效性提供了机械解释，该措施定期重置标记的（去）嵌入。基于这些观察，我们设计了一种技术组合，包括小规模架构变化以促进复制、数据多样性，以及冻结或重置（去）嵌入，从而实现对未见标记的泛化。我们通过对命题逻辑推理问题进行广泛的控制实验来支持我们的主张。除了合成实验外，我们还观察到在Gemma 3系列的开放权重模型中存在（去）嵌入崩溃的证据，该系列包括99个未使用的标记，保留用于下游使用。我们实证发现，这些标记的相关嵌入对于微调应用来说是一个较差的初始化。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.21649

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

GS-Quant：用于知识图谱补全的粒度语义与生成结构量化

Xie, Qizhuo, Liu, Yunhui, Xing, Yu, Hou, Qianzi, Jin, Xudong, Zheng, Tao, He, Tieke

Abstract

Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at https://github.com/mikumifa/GS-Quant.

Chinese Translation

大型语言模型（LLMs）在知识图谱补全（KGC）方面展现了巨大的潜力，但在连续图嵌入与离散LLM标记之间架起桥梁仍然是一个关键挑战。尽管近期基于量化的方法试图对齐这些模态，但它们通常将量化视为平面数值压缩，导致语义纠缠的编码，无法反映人类推理的层次特性。本文提出了GS-Quant，一个新颖的框架，生成语义一致且结构分层的离散代码用于知识图谱实体。与以往方法不同，GS-Quant基于实体表示应遵循语言粗到细的逻辑这一见解。我们引入了一个粒度语义增强模块，将层次知识注入代码库，确保早期代码捕捉全局语义类别，而后期代码则细化特定属性。此外，生成结构重构模块对代码序列施加因果依赖，将独立的离散单元转变为结构化的语义描述符。通过用这些学习到的代码扩展LLM词汇，我们使模型能够以同构于自然语言生成的方式对图结构进行推理。实验结果表明，GS-Quant显著优于现有的基于文本和嵌入的基线。我们的代码已公开提供，网址为 https://github.com/mikumifa/GS-Quant。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2604.21733

Enabling and Inhibitory Pathways of University Students' Willingness to Disclose AI Use: A Cognition-Affect-Conation Perspective

大学生披露人工智能使用意愿的促进与抑制路径：认知-情感-行为视角

Du, Yiran, He, Huimin

Abstract

The increasing integration of artificial intelligence (AI) in higher education has raised important questions regarding students' transparency in reporting AI-assisted work. This study investigates the psychological mechanisms underlying university students' willingness to disclose AI use by applying the Cognition--Affect--Conation (CAC) framework. A sequential explanatory mixed-methods design was employed. In the quantitative phase, survey data were collected from 546 university students and analysed using structural equation modelling to examine the relationships among cognitive perceptions, affective responses, and disclosure intention. In the qualitative phase, semi-structured interviews with 22 students were conducted to further interpret the quantitative findings. The results indicate that psychological safety significantly increases students' willingness to disclose AI use and is positively shaped by perceived fairness, perceived teacher support, and perceived organisational support. Conversely, evaluation apprehension reduces disclosure intention and psychological safety, and is strengthened by perceived stigma, perceived uncertainty, and privacy concern. Qualitative findings further reveal that institutional clarity and supportive instructional practices encourage openness, whereas policy ambiguity and fear of negative evaluation often lead students to adopt cautious or strategic disclosure practices. Overall, the study highlights the dual role of enabling and inhibitory psychological mechanisms in shaping AI-use disclosure and underscores the importance of supportive institutional environments and clear guidance for promoting responsible AI transparency in higher education.

Chinese Translation

人工智能（AI）在高等教育中的日益融合引发了关于学生在报告AI辅助工作时透明度的重要问题。本研究通过应用认知-情感-行为（CAC）框架，探讨大学生披露AI使用意愿背后的心理机制。研究采用了顺序解释性混合方法设计。在定量阶段，从546名大学生收集了调查数据，并使用结构方程模型分析了认知感知、情感反应和披露意图之间的关系。在定性阶段，对22名学生进行了半结构化访谈，以进一步解释定量研究结果。结果表明，心理安全感显著提高了学生披露AI使用的意愿，并受到感知公平、感知教师支持和感知组织支持的积极影响。相反，评估焦虑降低了披露意图和心理安全感，并受到感知污名、感知不确定性和隐私担忧的加强。定性研究结果进一步揭示，机构的清晰性和支持性的教学实践鼓励开放，而政策模糊和对负面评价的恐惧往往导致学生采取谨慎或战略性的披露实践。总体而言，本研究强调了促进与抑制心理机制在塑造AI使用披露中的双重作用，并强调了支持性机构环境和明确指导在促进高等教育中负责任的AI透明度方面的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2604.21743

Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

弥合训练与部署之间的差距：门控编码与多尺度精炼用于高效的量化感知图像增强

To-Thanh, Dat, Nguyen-Trong, Nghia, Vo, Hoang, Bui-Minh, Hieu, Nguyen-Nhu, Tinh-Anh

Abstract

Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features. Moreover, we incorporate Quantization-Aware Training (QAT) to simulate the effects of low-precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post-training quantization (PTQ). Experimental results demonstrate that the proposed method produces high-fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at https://github.com/GenAI4E/QATIE.git.

Chinese Translation

移动设备的图像增强模型常常难以在高输出质量与移动硬件所需的快速处理速度之间取得平衡。尽管近期的深度学习模型能够将低质量的移动照片增强为高质量图像，但在实际应用于手机时，转换为低精度格式后其性能往往会下降。为了解决这一训练与部署之间的不匹配，我们提出了一种专为移动部署设计的高效图像增强模型。我们的方法采用了具有门控编码块和多尺度精炼的分层网络架构，以保留细粒度的视觉特征。此外，我们结合了量化感知训练（Quantization-Aware Training, QAT），以在训练过程中模拟低精度表示的影响。这使得网络能够适应，并防止标准后训练量化（Post-Training Quantization, PTQ）中常见的质量下降。实验结果表明，所提出的方法在保持标准移动设备实际使用所需的低计算开销的同时，能够生成高保真度的视觉输出。代码将发布在 https://github.com/GenAI4E/QATIE.git。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2604.21764

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

运用推理技能：更少的标记，更高的准确性

Zhao, Guangxiang, Shi, Qilong, Xiao, Xusen, Zhang, Xiangzheng, Yang, Tong, Sun, Lin

Abstract

Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing \emph{reasoning from scratch} paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths. We evaluate our method on coding and mathematical reasoning tasks, and find that it significantly reduces reasoning tokens while improving overall performance. The resulting lower per-request cost indicates strong practical and economic potential for real-world deployment.

Chinese Translation

推理大型语言模型（LLMs）在解决新问题时，常常会在长的中间推理轨迹（例如，思维链）上消耗大量标记。我们提出了一种方法，通过从广泛的深思熟虑和试错探索中提炼出可重用的推理技能进行总结和存储，并在推理时检索这些技能以指导未来的推理。与流行的“从头开始推理”（reasoning from scratch）范式不同，我们的方法首先为每个查询回忆相关技能，帮助模型避免冗余的绕行，专注于有效的解决路径。我们在编码和数学推理任务上评估了我们的方法，发现它显著减少了推理标记，同时提高了整体性能。由此带来的每次请求成本降低，表明其在实际应用中的强大潜力和经济价值。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2604.21769

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

谁来定义“最佳”？迈向互动式用户定义的LLM排行榜评估

Jung, Minji, Lee, Minjae, Kim, Yejin, Choi, Sarang, Kahng, Minsuk

Abstract

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.

Chinese Translation

LLM排行榜广泛用于比较模型和指导部署决策。然而，排行榜的排名是由基准设计者设定的评估优先级所决定的，而不是由实际用户和组织的多样化目标和约束所影响。单一的综合得分往往掩盖了模型在不同提示类型和组合下的表现。在本研究中，我们对LMArena（前称Chatbot Arena）基准中使用的数据集进行了深入分析，并通过设计一个互动可视化界面作为设计探针来探讨这一评估挑战。我们的分析揭示了数据集在某些主题上严重倾斜，模型排名在不同提示切片中存在差异，以及基于偏好的判断在模糊其预期范围的方式下被使用。基于这一分析，我们引入了一个可视化界面，允许用户通过选择和加权提示切片来定义自己的评估优先级，并探索排名如何相应变化。一项定性研究表明，这种互动方式提高了透明度，并支持更具上下文特定性的模型评估，指向了设计和使用LLM排行榜的替代方法。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2604.21793

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

从时间戳数据推断高层次事件：复杂性及其医学应用

Awuklu, Yvon K., Bienvenu, Meghyn, Inoue, Katsumi, Jouhet, Vianney, Mougin, Fleur

Abstract

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.

Chinese Translation

在本文中，我们开发了一种基于逻辑的新方法，用于从时间戳数据和背景知识中检测高层次的时间扩展事件。我们的框架采用逻辑规则来捕捉简单时间事件的存在和终止条件，并将这些事件组合成元事件。在医学领域，例如，疾病发作和治疗是从时间戳临床观察中推断出来的，这些观察包括存储在患者记录中的诊断和药物管理，并可以进一步组合成更高层次的疾病事件。由于可能推断出一些不正确的事件，我们使用约束来识别事件之间的不兼容组合，并提出了一种修复机制，以选择首选的一致事件集。尽管在完整框架中的推理是不可处理的，但我们识别出相关的限制条件，以确保多项式时间的数据复杂性。我们的原型系统使用答案集编程实现了该方法的核心组件。在一个肺癌用例上的评估支持了该方法的兴趣，无论是在计算可行性方面，还是在我们的结果与医学专家意见的积极一致性方面。虽然我们的框架受到医疗领域需求的强烈驱动，但它故意具有通用性，使其能够在其他领域中重复使用。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2604.21794

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

学习沟通：朝着多智能体语言系统的端到端优化

Yu, Ye, Liu, Heming, Jin, Haibo, Yuan, Xiaopeng, Kuang, Peng, Wang, Haohan

Abstract

Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

Chinese Translation

基于大型语言模型的多智能体系统在复杂推理任务中表现出色，但大多数研究集中于智能体角色和协调，而将智能体之间的通信视为固定接口。通过内部表示（如键值缓存）进行的潜在通信为基于文本的协议提供了一种有前景的替代方案，但现有方法并未将通信与多智能体推理进行联合优化。因此，我们提出了DiffMAS，一个将潜在通信视为多智能体系统可学习组件的训练框架。DiffMAS在多智能体潜在轨迹上进行参数高效的监督训练，使智能体能够共同学习信息在交互中应如何编码和解释。在数学推理、科学问答、代码生成和常识基准上的实验表明，DiffMAS在推理准确性和解码稳定性方面始终优于单智能体推理、基于文本的多智能体系统和先前的潜在通信方法，在AIME24上达到26.7%，在GPQA-Diamond上达到20.2%，并在推理基准上取得了一致的提升。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2604.21816

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

工具注意力即是你所需：动态工具门控与懒惰模式加载以消除可扩展代理工作流中的MCP/工具税

Sadani, Anuj, Kumar, Deepak

Abstract

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention

Chinese Translation

模型上下文协议（MCP）已成为将大型语言模型（LLM）代理与外部工具连接的常见接口，但其对无状态、急切模式注入的依赖带来了隐含的每轮开销，即MCP税或工具税，实践者报告显示在典型的多服务器部署中，这一开销大约在10,000到60,000个标记之间。该负载膨胀了键值缓存，并且与推理退化相关，当上下文利用率接近70%的已发布断裂点时，标记预算变成了一个持续的运营成本。我们提出了工具注意力（Tool Attention），这是一种中间层机制，将“注意力即是你所需”范式从对标记的自注意力推广到对工具的门控注意力。工具注意力结合了（i）来自句子嵌入的意图模式重叠（ISO）得分，（ii）一个状态感知的门控函数，强制执行前提条件和访问范围，以及（iii）一个两阶段的懒惰模式加载器，在上下文中保持紧凑的摘要池，仅为前k个门控工具推广完整的JSON模式。我们在一个模拟的120工具、六服务器基准测试上进行评估，其每台服务器的标记计数经过校准，以符合真实MCP部署的公开审计。在此模拟中，工具注意力直接将每轮测量的工具标记减少了95.0%（从47,300减少到2,400），并将有效上下文利用率（一个标记比率量）从24%提高到91%。任务成功率、延迟、成本和推理质量的端到端数据作为从测量的标记计数与已发布的部署遥测结合得出的预测值报告；这些数据并未在实时LLM代理上测量，我们在整个过程中明确标记预测值。综合来看，结果支持一个简单的论点：协议级效率，而非原始上下文长度，是可扩展代理系统的约束条件。该工作的代码可在https://github.com/asadani/tool-attention获取。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2604.21827

Alignment has a Fantasia Problem

对齐存在幻想问题

Jo, Nathanael, De Simone, Zoe, Gordon, Mitchell, Wilson, Ashia

Abstract

Modern AI assistants are trained to follow instructions, implicitly assuming that users can clearly articulate their goals and the kind of assistance they need. Decades of behavioral research, however, show that people often engage with AI systems before their goals are fully formed. When AI systems treat prompts as complete expressions of intent, they can appear to be useful or convenient, but not necessarily aligned with the users' needs. We call these failures Fantasia interactions. We argue that Fantasia interactions demand a rethinking of alignment research: rather than treating users as rational oracles, AI should provide cognitive support by actively helping users form and refine their intent through time. This requires an interdisciplinary approach that bridges machine learning, interface design, and behavioral science. We synthesize insights from these fields to characterize the mechanisms and failures of Fantasia interactions. We then show why existing interventions are insufficient, and propose a research agenda for designing and evaluating AI systems that better help humans navigate uncertainty in their tasks.

Chinese Translation

现代人工智能助手被训练为遵循指令，隐含假设用户能够清晰表达他们的目标和所需的帮助。然而，数十年的行为研究表明，人们在目标尚未完全形成之前，往往就与人工智能系统进行互动。当人工智能系统将提示视为完整的意图表达时，它们可能看起来有用或方便，但不一定与用户的需求相一致。我们将这些失败称为幻想互动。我们认为，幻想互动要求重新思考对齐研究：人工智能不应将用户视为理性的神谕，而应通过积极帮助用户形成和细化他们的意图来提供认知支持。这需要一种跨学科的方法，结合机器学习、界面设计和行为科学。我们综合了这些领域的见解，以描述幻想互动的机制和失败。随后，我们展示了现有干预措施为何不足，并提出了一个研究议程，以设计和评估更好地帮助人类在任务中应对不确定性的人工智能系统。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2604.21854

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

界定黑箱：人工智能风险监管的统计认证框架

Levy, Natan, Perl, Gadi

Abstract

Artificial intelligence now decides who receives a loan, who is flagged for criminal investigation, and whether an autonomous vehicle brakes in time. Governments have responded: the EU AI Act, the NIST Risk Management Framework, and the Council of Europe Convention all demand that high-risk systems demonstrate safety before deployment. Yet beneath this regulatory consensus lies a critical vacuum: none specifies what ``acceptable risk'' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold. The regulatory architecture is in place; the verification instrument is not. This gap is not theoretical. As the EU AI Act moves into full enforcement, developers face mandatory conformity assessments without established methodologies for producing quantitative safety evidence - and the systems most in need of oversight are opaque statistical inference engines that resist white-box scrutiny. This paper provides the missing instrument. Drawing on the aviation certification paradigm, we propose a two-stage framework that transforms AI risk regulation into engineering practice. In Stage One, a competent authority formally fixes an acceptable failure probability $\delta$ and an operational input domain $\varepsilon$ - a normative act with direct civil liability implications. In Stage Two, the RoMA and gRoMA statistical verification tools compute a definitive, auditable upper bound on the system's true failure rate, requiring no access to model internals and scaling to arbitrary architectures. We demonstrate how this certificate satisfies existing regulatory obligations, shifts accountability upstream to developers, and integrates with the legal frameworks that exist today.

Chinese Translation

人工智能如今决定了谁能获得贷款，谁会被标记为刑事调查对象，以及自主车辆是否能及时刹车。各国政府对此作出了回应：欧盟人工智能法案、美国国家标准与技术研究院（NIST）风险管理框架以及欧洲委员会公约均要求高风险系统在部署前必须证明其安全性。然而，在这一监管共识的背后却存在一个关键的空白：没有任何一项规定明确“可接受风险”的定量含义，也没有提供验证已部署系统是否实际达到该阈值的技术方法。监管架构已建立，但验证工具尚未到位。这一差距并非理论上的。随着欧盟人工智能法案的全面实施，开发者面临强制性合规评估，但却没有建立起产生定量安全证据的方法论——而最需要监管的系统是那些抵抗白箱审查的黑箱统计推断引擎。本文提供了缺失的工具。借鉴航空认证范式，我们提出了一个两阶段框架，将人工智能风险监管转化为工程实践。在第一阶段，主管机构正式确定可接受的失效概率 $ ext{δ}$ 和操作输入域 $ ext{ε}$——这一规范性行为具有直接的民事责任影响。在第二阶段，RoMA 和 gRoMA 统计验证工具计算系统真实失效率的明确、可审计的上限，无需访问模型内部，并可扩展至任意架构。我们展示了这一证书如何满足现有的监管义务，将责任上移至开发者，并与现有的法律框架相结合。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2604.21896

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Nemobot 游戏：为基于大语言模型的互动学习打造战略性人工智能游戏代理

Tan, Chee Wei, Wang, Yuchen, Guo, Shangxin

Abstract

This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.

Chinese Translation

本文介绍了一种新的人工智能游戏编程范式，利用大语言模型（LLMs）扩展和实现克劳德·香农（Claude Shannon）关于游戏机器的分类法。该范式的核心是 Nemobot，一个互动代理工程环境，使用户能够创建、定制和部署基于 LLM 的游戏代理，同时积极参与 AI 驱动的策略。集成在 Nemobot 中的基于 LLM 的聊天机器人展示了其在四类不同游戏中的能力。在基于字典的游戏中，它将状态-动作映射压缩为高效的通用模型，以实现快速适应。在严格可解的游戏中，它运用数学推理计算最佳策略，并生成可供人类理解的决策解释。在基于启发式的游戏中，它通过结合经典的极小极大算法（例如，shannon1950chess）与众包数据来综合策略。最后，在基于学习的游戏中，它利用人类反馈和自我批评的强化学习，通过试错和模仿学习迭代优化策略。Nemobot 通过提供一个可编程环境，增强了这一框架，用户可以在其中实验工具增强生成和战略游戏代理的微调。从战略游戏到角色扮演游戏，Nemobot 展示了人工智能代理如何通过整合众包学习和人类创造力，实现自我编程的形式，迭代优化其自身逻辑。这代表了朝着自我编程人工智能的长期目标迈出的一步。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2604.21910

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

从研究问题到科学工作流：利用自主智能实现科学自动化

Balis, Bartosz, Orzechowski, Michal, Kica, Piotr, Dygas, Michal, Kuszewski, Michal

Abstract

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

Chinese Translation

科学工作流系统自动化执行——调度、容错、资源管理——但未能自动化其前置的语义转换。科学家仍需手动将研究问题转化为工作流规范，这一任务既需要领域知识，也需要基础设施专业知识。我们提出了一种自主架构，通过三个层次来弥补这一空白：一个大型语言模型（LLM）将自然语言解释为结构化意图（语义层）；经过验证的生成器生成可重复的工作流有向无环图（DAG）（确定性层）；领域专家编写“技能”：编码词汇映射、参数约束和优化策略的Markdown文档（知识层）。这种分解将LLM的非确定性限制在意图提取上：相同的意图总是产生相同的工作流。我们在1000基因组人口遗传学工作流和在Kubernetes上运行的Hyperflow工作流管理系统（WMS）上实现并评估了该架构。在对150个查询的消融研究中，技能将完全匹配意图的准确率从44%提高到83%；基于技能的延迟工作流生成将数据传输减少了92%；并且端到端管道在Kubernetes上完成查询时，LLM的开销低于15秒，成本低于每个查询0.001美元。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.20878

AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

AITP：基于多模态大型语言模型的交通事故责任分配

Zhou, Zijin, Zhang, Songan

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi-step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain-of-Thought (MCoT) mechanism and integrates legal knowledge through Retrieval-Augmented Generation (RAG). We further present DecaTARA, a decathlon-style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question-answer pairs. Extensive experiments show that AITP achieves state-of-the-art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning-driven multimodal traffic analysis.

Chinese Translation

多模态大型语言模型（MLLMs）在交通事故检测（TAD）和交通事故理解（TAU）方面取得了显著进展。然而，现有研究主要集中于描述和解释事故视频，尚未深入探讨因果推理和法律知识的整合。交通事故责任分配（TARA）是一项更具挑战性的任务，需要基于交通法规进行多步骤推理。为此，我们提出了AITP（人工智能交通警察），一种用于责任推理和分配的多模态大型语言模型。AITP通过多模态思维链（MCoT）机制增强推理能力，并通过检索增强生成（RAG）整合法律知识。我们进一步提出了DecaTARA，这是一个十项全能风格的基准，统一了十个相关的交通事故推理任务，包含67,941个标注视频和195,821个问答对。大量实验表明，AITP在责任分配、TAD和TAU任务中均实现了最先进的性能，为基于推理的多模态交通分析建立了新的范式。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.20996

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

AFRILANGTUTOR：利用大型语言模型推动低资源语言的语言辅导和文化教育

Belay, Tadesse Destaw, Nahin, Shahriar Kabir, Azime, Israel Abebe, Monjur, Ocean, Muhammad, Shamsuddeen Hassan, Yimam, Seid Muhie, Chhabra, Anshuman

Abstract

How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at https://huggingface.co/afrilang-edu.

Chinese Translation

如何为缺乏足够训练资源的语言开发语言学习系统？这一挑战正日益困扰着非洲大陆的开发者，他们旨在构建能够理解和回应当地语言的人工智能系统。为了解决这一问题，我们推出了AFRILANGDICT，这是一个包含194.7K条非洲语言-英语词典条目的集合，旨在作为生成语言学习材料的种子资源，使我们能够自动构建大规模、多样化且可验证的学生-辅导员问答互动，适用于训练AI辅助的语言辅导员。利用AFRILANGDICT，我们构建了AFRILANGEDU，这是一个包含78.9K个多轮训练示例的数据集，用于监督微调（Supervised Fine-Tuning, SFT）和直接偏好优化（Direct Preference Optimization, DPO）。通过AFRILANGEDU，我们训练了统称为AFRILANGTUTOR的语言辅导模型。我们在AFRILANGEDU上对两个多语言大型语言模型进行了微调：Llama-3-8B-IT和Gemma-3-12B-IT，涵盖10种非洲语言，并评估它们的性能。我们的结果表明，基于AFRILANGEDU训练的模型在性能上始终优于其基础模型，结合SFT和DPO的方式带来了显著的提升，在四个评估标准下，增益范围为1.8%到15.5%。为了促进对低资源语言的进一步研究，所有资源均可在https://huggingface.co/afrilang-edu获取。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.21045

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

无界语音的同步翻译的层次策略优化

Ouyang, Siqi, Ding, Shuoyang, Hrinchuk, Oleksii, Lavrukhin, Vitaly, Yan, Brian, Ginsburg, Boris, Li, Lei

Abstract

Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO

Chinese Translation

同步语音翻译（SST）在接收部分语音输入的同时生成翻译。最近的研究表明，大型语言模型（LLMs）可以显著提高SST的质量，但代价是高计算开销。为了降低这一成本，之前的工作将SST重新表述为多轮对话任务，从而实现LLM的键值（KV）缓存的完全重用，并消除冗余特征的重新计算。然而，这种方法依赖于以对话形式存在的监督微调（SFT）数据，而此类数据的人类标注极为稀缺，现有的合成方法无法保证数据质量。在本研究中，我们提出了一种层次策略优化（HPO）方法，针对在不完美的SFT数据上训练的模型进行后训练。我们引入了一种层次奖励，平衡翻译质量和延迟目标。在英语到中文/德语/日语的实验中，显示出超过+7的COMET分数和+1.25的MetricX分数，延迟为1.5秒。全面的消融研究进一步验证了不同质量奖励、层次奖励公式和分段策略的有效性。代码可在此处找到：https://github.com/owaski/HPO

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.21057

TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

TRACES：自适应成本高效早停的推理步骤标记

Belkhiter, Yannis, Tirupathi, Seshu, Zizzo, Giulio, Kelleher, John D.

Abstract

The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

Chinese Translation

近年来，语言推理模型（LRMs）领域发展迅速，训练和推理技术的进步使得LRMs能够进行更长时间和更准确的推理。然而，越来越多的研究表明，LRMs仍然效率低下，过度生成验证和反思步骤。此外，各推理步骤的高级角色以及不同步骤类型如何促进正确答案的生成，仍然未得到充分探索。为了解决这一挑战，我们提出了TRACES（推理步骤标记以实现自适应成本高效早停），这是一个轻量级框架，能够实时标记推理步骤，并实现大语言模型推理的自适应、成本高效的早停。在此框架的基础上，我们监测推理过程中的行为，发现LRMs在达到正确答案后倾向于改变其推理行为。我们证明了对特定类型步骤的监测可以产生有效的可解释早停标准。我们在三个数学推理基准（MATH500、GSM8K、AIME）以及两个知识和推理基准（MMLU和GPQA）上评估了TRACES框架。我们在保持与标准生成相当的准确性的同时，实现了20%到50%的标记减少。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.21070

DWTSumm: Discrete Wavelet Transform for Document Summarization

DWTSumm：用于文档摘要的离散小波变换

Salama, Rana, Youssef, Abdou, Diab, Mona

Abstract

Summarizing long, domain-specific documents with large language models (LLMs) remains challenging due to context limitations, information loss, and hallucinations, particularly in clinical and legal settings. We propose a Discrete Wavelet Transform (DWT)-based multi-resolution framework that treats text as a semantic signal and decomposes it into global (approximation) and local (detail) components. Applied to sentence- or word-level embeddings, DWT yields compact representations that preserve overall structure and critical domain-specific details, which are used directly as summaries or to guide LLM generation. Experiments on clinical and legal benchmarks demonstrate comparable ROUGE-L scores. Compared to a GPT-4o baseline, the DWT based summarization consistently improve semantic similarity and grounding, achieving gains of over 2% in BERTScore, more than 4\% in Semantic Fidelity, factual consistency in legal tasks, and large METEOR improvements indicative of preserved domain-specific semantics. Across multiple embedding models, Fidelity reaches up to 97%, suggesting that DWT acts as a semantic denoising mechanism that reduces hallucinations and strengthens factual grounding. Overall, DWT provides a lightweight, generalizable method for reliable long-document and domain-specific summarization with LLMs.

Chinese Translation

使用大型语言模型（LLMs）对长篇特定领域文档进行摘要仍然面临挑战，主要由于上下文限制、信息丢失和幻觉，尤其是在临床和法律环境中。我们提出了一种基于离散小波变换（DWT）的多分辨率框架，将文本视为语义信号，并将其分解为全局（近似）和局部（细节）成分。应用于句子或词级嵌入时，DWT 生成的紧凑表示能够保留整体结构和关键的领域特定细节，这些表示可以直接用作摘要或指导 LLM 生成。在临床和法律基准测试中的实验显示出可比的 ROUGE-L 分数。与 GPT-4o 基线相比，基于 DWT 的摘要在语义相似性和基础性方面持续改善，在 BERTScore 上提高超过 2%，在语义保真度上提高超过 4%，在法律任务中的事实一致性上表现良好，并且在 METEOR 上有显著提升，表明保留了领域特定的语义。在多个嵌入模型中，保真度达到 97%，这表明 DWT 作为一种语义去噪机制，减少了幻觉并增强了事实基础。总体而言，DWT 提供了一种轻量级、可推广的方法，能够可靠地进行长文档和领域特定的摘要，适用于 LLMs。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.21076

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

序列化策略的重要性：FHIR数据格式如何影响大型语言模型的药物核对

Pator, Sanjoy

Abstract

Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p < 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).

Chinese Translation

临床交接中的药物核对是一个高风险且易出错的过程。越来越多的大型语言模型被提议用于利用FHIR结构化患者记录来辅助这一任务，但一个基本且尚未被广泛研究的变量是FHIR数据在传递给模型之前的序列化方式。我们首次系统性地比较了四种FHIR序列化策略（原始JSON、Markdown表格、临床叙述和时间线）在五个开放权重模型（Phi-3.5-mini、Mistral-7B、BioMistral-7B、Llama-3.1-8B、Llama-3.3-70B）上的表现，基于200个合成患者的受控基准，总计进行4,000次推理运行。我们发现序列化策略对最多8B参数的模型性能有显著的统计学影响：对于Mistral-7B，临床叙述比原始JSON高出多达19个F1分数（r = 0.617, p < 10^{-10}）。在70B的情况下，这一优势发生逆转，原始JSON的平均F1达到最佳值0.9956。在所有20种模型和策略组合中，平均精确度超过平均召回率：遗漏是主要的失败模式，模型更常错过活跃药物而不是虚构药物，这改变了临床安全审计优先级的设定。较小的模型在大约7-10个并发活跃药物时达到平稳状态，使得多药并用患者（即最容易受到核对错误影响的患者）系统性地得不到充分服务。BioMistral-7B是一个没有指令微调的领域预训练模型，在所有条件下产生零可用输出，表明仅有领域预训练不足以实现结构化提取。这些结果为临床大型语言模型的部署提供了实用的、基于证据的格式建议：对于最多8B的模型使用临床叙述，对于70B及以上的模型使用原始JSON。完整的流程可以在运行于AWS g6e.xlarge实例（NVIDIA L40S，48 GB VRAM）的开源工具上重现。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.21082

Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting

加权重要性：通过令牌重加权提升医疗报告生成的样本效率

Weers, Alexander, Rueckert, Daniel, Menten, Martin J.

Abstract

Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.

Chinese Translation

训练视觉-语言模型（VLMs）以生成医疗报告常常受到高质量标注数据稀缺的限制。本研究评估了使用加权损失函数以提高数据效率的效果。与标准的交叉熵损失相比，后者对所有令牌预测错误一视同仁，重加权损失则将重点转向具有重要临床意义的语义显著令牌。在眼科报告生成的实验中，我们展示了这一简单方法在多个数据规模上提高了效率，能够在训练数据减少多达十倍的情况下实现相似的报告质量。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.21108

Machine learning and digital pragmatics: Which word category influences emoji use most?

机器学习与数字语用学：哪种词类对表情符号使用的影响最大？

Shormani, Mohammed Q., Alshawsh, Ibrahim Abdulmalik Hassan Muneef Y.

Abstract

This study investigates Machine Learning (ML) in the prediction of emojis in Arabic tweets employing the (state-of-the-art) MARBERT model. A corpus of 11379 CA tweets representing multiple Arabic colloquial dialects was collected from X.com via Python. A net dataset includes 8695 tweets, which were utilized for the analysis. These tweets were then classified into 14 categories, which were numerically encoded and used as labels. A preprocessing pipeline was designed as an interpretable baseline, allowing us to examine the relationship between lexical features and emoji categories. MARBERT was finetuned to predict emoji use from textual input. We evaluated the model performance in terms of precision, recall and F1-scores. Findings reveal that the model performed quite well with an overall accuracy 0.75. The study concludes that although the findings are promising, there is still a need for improving machine learning models including MARBERT, specifically for low-resource and multidialectal languages like Arabic.

Chinese Translation

本研究探讨了机器学习（Machine Learning, ML）在阿拉伯语推文中表情符号预测的应用，采用了最先进的MARBERT模型。通过Python从X.com收集了11379条代表多种阿拉伯口语方言的CA推文。最终的数据集包含8695条推文，这些推文被用于分析。推文被分类为14个类别，并进行了数值编码，作为标签使用。设计了一个预处理管道作为可解释的基线，使我们能够检验词汇特征与表情符号类别之间的关系。MARBERT模型经过微调，以预测文本输入中的表情符号使用。我们在精确率、召回率和F1分数方面评估了模型的性能。研究结果表明，该模型的整体准确率为0.75，表现相当良好。研究最后指出，尽管结果令人鼓舞，但仍需改进包括MARBERT在内的机器学习模型，尤其是针对阿拉伯语等低资源和多方言语言。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.21133

GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons

GRISP：基于SPARQL骨架的引导递归IRI选择

Walter, Sebastian, Bast, Hannah

Abstract

We present GRISP (Guided Recurrent IRI Selection over SPARQL Skeletons), a novel SPARQL-based question-answering method over knowledge graphs based on fine-tuning a small language model (SLM). Given a natural-language question, the method first uses the SLM to generate a natural-language SPARQL query skeleton, and then to re-rank and select knowledge graph items to iteratively replace the natural-language placeholders using knowledge graph constraints. The SLM is jointly trained on skeleton generation and list-wise re-ranking data generated from standard question-query pairs. We evaluate the method on common Wikidata and Freebase benchmarks, and achieve better results than other state-of-the-art methods in a comparable setting.

Chinese Translation

我们提出了GRISP（基于SPARQL骨架的引导递归IRI选择），这是一种基于SPARQL的知识图谱问答新方法，基于对小型语言模型（SLM）的微调。给定一个自然语言问题，该方法首先使用SLM生成自然语言SPARQL查询骨架，然后重新排序并选择知识图谱项，以迭代地使用知识图谱约束替换自然语言占位符。SLM在骨架生成和从标准问题-查询对生成的列表重排序数据上进行联合训练。我们在常见的Wikidata和Freebase基准上评估该方法，并在可比设置中取得了比其他最先进方法更好的结果。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.21134

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

超越像素：可视化智能体的内省与交互基础

Lu, Yiyang, Shin, Woong, Karimi, Ahmad Maroof, Wang, Feiyi, Ren, Jie, Smirni, Evgenia

Abstract

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.

Chinese Translation

视觉-语言模型（VLMs）常常误读数值、幻觉细节，并混淆图表中的重叠元素。目前的方法仅依赖于像素解释，造成了像素专属瓶颈：智能体将交互式图表视为静态图像，从而失去了对编码精确值的结构化规范的访问。我们提出了内省与交互视觉基础（Introspective and Interactive Visual Grounding, IVG），这是一个结合了（1）规范基础内省的框架，该框架查询底层规范以获取确定性证据，以及（2）视图基础交互的框架，该框架操控视图以解决视觉模糊。为了在没有VLM偏见的情况下进行评估，我们提出了iPlotBench，这是一个包含500个交互式Plotly图形、6706个二元问题和真实规范的基准测试。实验表明，内省提高了数据重构的保真度，而与交互结合时则达到了最高的问答准确率（0.81），在重叠几何体上获得了+6.7%的提升。我们进一步展示了IVG在已部署智能体中的应用，这些智能体能够自主探索数据并与人类用户实时协作。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.21137

Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

通过联合多任务学习增强科学课堂话语分析以进行推理组件分类

Noh, Jiho, Katragadda, Mukhesh Raghava, Carl, Raymond, Lee, Soon

Abstract

Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

Chinese Translation

分析科学课堂中学生的推理模式对于理解知识构建机制和改善教学实践以最大化认知参与至关重要，但大规模手动编码课堂话语仍然极为劳动密集。我们提出了一种自动化话语分析系统（ADAS），该系统沿两个互补维度联合分类教师和学生的发言：发言类型（Utterance Type）和推理组件（Reasoning Component），这些维度源自我们之前的CDAT框架。为了应对少数类标签严重不平衡的问题，我们（1）对注释语料库进行分层重分割，（2）应用基于大语言模型（LLM）的合成数据增强，针对少数类进行优化，以及（3）训练一个双探头RoBERTa-base分类器。零样本GPT-5.4基线在发言类型（UT）上取得了0.467的宏F1值，在推理组件（RC）上取得了0.476的宏F1值，为仅依赖提示的方法设定了有意义的上限，激励了微调。除了分类，我们还进行了话语模式分析，包括UT与RC的共现分析、每个会话的认知复杂性指数（CCI）计算、滞后序列分析和IRF链分析，结果显示教师的反馈与提问（Feedback-with-Question, Fq）行为是学生推理（SR-I）最一致的前因。我们的结果表明，基于LLM的增强显著改善了UT少数类的识别，并且RC任务的结构简单性使其即使对于词汇基线也变得可行。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.21139

Slot Machines: How LLMs Keep Track of Multiple Entities

老虎机：大型语言模型如何跟踪多个实体

Bogdan, Paul C., Lindsey, Jack

Abstract

Language models must bind entities to the attributes they possess and maintain several such binding relationships within a context. We study how multiple entities are represented across token positions and whether single tokens can carry bindings for more than one entity. We introduce a multi-slot probing approach that disentangles a single token's residual stream activation to recover information about both the currently described entity and the immediately preceding one. These two kinds of information are encoded in separate and largely orthogonal "current-entity" and "prior-entity" slots. We analyze the functional roles of these slots and find that they serve different purposes. In tandem with the current-entity slot, the prior-entity slot supports relational inferences, such as entity-level induction ("who came after Alice in the story?") and conflict detection between adjacent entities. However, only the current-entity slot is used for explicit factual retrieval questions ("Is anyone in the story tall?" "What is the tall entity's name?") despite these answers being linearly decodable from the prior-entity slot too. Consistent with this limitation, open-weight models perform near chance accuracy at processing syntax that forces two subject-verb-object bindings on a single token (e.g., "Alice prepares and Bob consumes food.") Interestingly, recent frontier models can parse this properly, suggesting they may have developed more sophisticated binding strategies. Overall, our results expose a gap between information that is available in activations and information the model actually uses, and suggest that the current/prior-entity slot structure is a natural substrate for behaviors that require holding two perspectives at once, such as sycophancy and deception.

Chinese Translation

语言模型必须将实体与其所具备的属性绑定，并在上下文中维持多个这样的绑定关系。我们研究了多个实体如何在标记位置中表示，以及单个标记是否可以承载多个实体的绑定。我们提出了一种多槽探测方法，该方法解开单个标记的残余流激活，以恢复关于当前描述实体和紧接着的前一个实体的信息。这两种信息被编码在独立且在很大程度上正交的“当前实体”和“前一个实体”槽中。我们分析了这些槽的功能角色，发现它们服务于不同的目的。在与当前实体槽一起使用时，前一个实体槽支持关系推理，例如实体级归纳（“故事中谁在爱丽丝之后？”）和相邻实体之间的冲突检测。然而，尽管这些答案也可以从前一个实体槽线性解码，但只有当前实体槽用于显式事实检索问题（“故事中有人高吗？”“高的实体叫什么名字？”）。与这一限制一致，开放权重模型在处理强迫单个标记上有两个主语-动词-宾语绑定的句法时表现接近随机准确率（例如，“爱丽丝准备食物，鲍勃消费食物。”）。有趣的是，最近的前沿模型能够正确解析这一点，表明它们可能已经发展出更复杂的绑定策略。总体而言，我们的结果揭示了激活中可用信息与模型实际使用信息之间的差距，并暗示当前/前一个实体槽结构是需要同时保持两个视角的行为（如谄媚和欺骗）的自然基础。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.21144

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

利用机器心理意象在情境对话中表示共同基础

Mohapatra, Biswesh, Duca, Giovanni, Romary, Laurent, Cassell, Justine

Abstract

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

Chinese Translation

情境对话要求说话者保持对共享上下文的可靠表征，而不是仅仅对孤立的言语进行推理。目前的对话代理在满足这一要求时常常面临困难，尤其是当共同基础必须超越即时上下文窗口时。在这种情况下，细微的区分常常被压缩为纯文本表征，导致我们称之为 extit{表征模糊}的关键失效模式，其中相似但不同的实体被压缩为可互换的描述。这种语义平坦化创造了一种基础的错觉，使得代理在局部上看似一致，但未能持续跟踪共享上下文。受到人类推理中心理意象作用的启发，并基于多模态模型的可用性增加，我们探讨对话代理是否可以被赋予类似的能力，在对话过程中构建一些描绘性的中间表征，以解决这些局限。因此，我们引入了一种主动视觉支架框架，该框架逐步将对话状态转换为持久的视觉历史，后者可以在生成有基础的响应时进行检索。在IndiRef基准上的评估表明，增量外化本身优于全对话推理，而视觉支架通过减少表征模糊和强化具体场景承诺提供了额外的收益。同时，文本表征在不可描绘信息方面仍然具有优势，而混合多模态设置则产生了最佳的整体性能。这些发现共同表明，对话代理受益于一种明确的多模态共同基础表征，该表征整合了描绘性和命题信息。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.21148

"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

“这不是为我而设计的”：在评估自动语音识别偏见时重新聚焦用户体验和情感影响

Liang, Siyu, Wassink, Alicia Beckford

Abstract

Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users' lived experiences, how do users feel about and react to them, and what emotional toll do these repeated failures exact? We conducted user experience studies across four U.S. locations (Atlanta, Gulf Coast, Miami Beach, and Tucson) representing distinct English dialect communities. Our findings reveal that most participants report technologies fail to consider their cultural backgrounds and require constant adjustment to achieve basic functionality. Despite these experiences, participants maintain high expectations for ASR performance and express strong willingness to contribute to model improvement. Qualitative analysis of open-ended narratives exposes the deeper costs of these failures. Participants report frustration, annoyance, and feelings of inadequacy, yet the emotional impact extends beyond momentary reactions. Participants recognize that systems were not designed for them, yet often internalize failures as personal inadequacy despite this critical awareness. They perform extensive invisible labor, including code-switching, hyper-articulation, and emotional management, to make failing systems functional. Meanwhile, their linguistic and cultural knowledge remains unrecognized by technologies that encode particular varieties as standard while rendering others marginal. These findings demonstrate that algorithmic fairness assessments based on accuracy metrics alone miss critical dimensions of harm: the emotional labor of managing repeated technological rejection, the cognitive burden of constant self-monitoring, and the psychological toll of feeling inadequate in one's native language variety.

Chinese Translation

关于自动语音识别（ASR）偏见的研究往往集中于报告少数方言使用者的错误率，然而对系统偏见的人性化研究较少：系统故障如何塑造用户的生活体验，用户对此的感受和反应如何，以及这些重复故障所带来的情感代价是什么？我们在美国四个地点（亚特兰大、墨西哥湾沿岸、迈阿密海滩和图森）进行了用户体验研究，代表了不同的英语方言社区。我们的研究结果显示，大多数参与者报告技术未能考虑他们的文化背景，并且需要不断调整才能实现基本功能。尽管有这些经历，参与者仍对ASR性能保持高期望，并表现出强烈的愿望为模型改进做出贡献。开放式叙述的定性分析揭示了这些故障的更深层次代价。参与者报告了挫败感、恼怒和自我不足感，但情感影响超越了瞬时反应。参与者意识到系统并非为他们设计，然而常常将故障内化为个人不足，尽管他们对此有清醒的认识。他们进行大量隐形劳动，包括代码切换、超清晰发音和情感管理，以使故障系统能够正常运作。同时，他们的语言和文化知识在将特定方言编码为标准的技术中未得到承认，而其他方言则被边缘化。这些发现表明，仅基于准确性指标的算法公平性评估忽视了伤害的关键维度：管理重复技术拒绝的情感劳动、持续自我监控的认知负担，以及在自己母语方言中感到不足的心理代价。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.21191

Prefix Parsing is Just Parsing

前缀解析仅仅是解析

Pasti, Clemente, Opedal, Andreas, O'Donnell, Timothy J., Cotterell, Ryan, Vieira, Tim

Abstract

Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling, psycholinguistic analysis, and syntactically constrained generation from large language models. We introduce the prefix grammar transformation, an efficient reduction of prefix parsing to ordinary parsing. Given a grammar, our method constructs another grammar that generates exactly the prefixes of its original strings. Prefix parsing is then solved by applying any ordinary parsing algorithm on the transformed grammar without modification. The reduction is both elegant and practical: the transformed grammar is only a small factor larger than the input, and any optimized implementation can be used directly, eliminating the need for bespoke prefix-parsing algorithms. We also present a strategy-based on algorithmic differentiation-for computing the next-token weight vector, i.e., the prefix weights of all one-token extensions, enabling efficient prediction of the next token. Together, these contributions yield a simple, general, and efficient framework for prefix parsing.

Chinese Translation

前缀解析探讨输入前缀是否可以扩展为由给定文法生成的完整字符串。在加权设置中，它还提供前缀概率，这对于上下文无关语言建模、心理语言学分析以及从大型语言模型中进行语法约束生成至关重要。我们引入了前缀文法变换，这是一种将前缀解析高效地简化为普通解析的方法。给定一个文法，我们的方法构造出另一个文法，该文法恰好生成其原始字符串的前缀。然后，通过对变换后的文法应用任何普通解析算法而不进行修改来解决前缀解析。该简化既优雅又实用：变换后的文法仅比输入大一个小因子，并且可以直接使用任何优化实现，从而消除了定制前缀解析算法的需求。我们还提出了一种基于算法微分的策略，用于计算下一个标记的权重向量，即所有单标记扩展的前缀权重，从而实现下一个标记的高效预测。这些贡献共同构成了一个简单、通用且高效的前缀解析框架。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.21204

On Reasoning Behind Next Occupation Recommendation

关于下一职业推荐背后的推理

Dong, Shan, Achananuparp, Palakorn, Mai, Hieu Hien, Wang, Lei, Lu, Yao, Lim, Ee-Peng

Abstract

In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.

Chinese Translation

在本研究中，我们开发了一种新颖的推理方法，以增强大型语言模型（LLMs）在未来职业预测中的表现。在该方法中，推理生成器首先基于用户的过去教育和职业历史为其推导出一个“理由”。该理由总结了用户的偏好，并作为职业预测器的输入，以推荐用户的下一职业。然而，这种两步职业预测方法并非简单，因为LLMs与职业路径或每个职业决策背后的未观察到的理由并不一致。因此，我们提出对LLMs进行微调，以提高其推理和职业预测性能。我们首先使用LLM-as-a-Judge推导出高质量的权威理由，依据事实性、一致性和实用性标准进行评估。这些权威理由随后用于微调小型LLMs，以执行理由生成和下一职业预测。我们的广泛实验表明：（a）我们的方法有效提高了LLM在下一职业预测中的准确性，使其与完全监督的方法相当，并优于无监督的方法；（b）一个微调以执行理由生成和职业预测的单一LLM优于两个分别微调以执行这些任务的LLMs；（c）下一职业预测的准确性依赖于生成理由的质量。我们的代码可在 https://github.com/Sarasarahhhhh/job_prediction 获取。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.21211

Subject-level Inference for Realistic Text Anonymization Evaluation

针对现实文本匿名化评估的主体级推断

Oh, Myeong Seok, Kim, Dong-Yun, Oh, Hanseok, Kang, Chaean, Kang, Joeun, Wang, Xiaonan, Park, Hyunjung, Jung, Young Cheol, Kim, Hansaem

Abstract

Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations, we present SPIA (Subject-level PII Inference Assessment), the first benchmark that shifts the unit of evaluation from text spans to individuals, comprising 675 documents across legal and online domains with novel subject-level protection metrics. Extensive experiments show that even when over 90% of PII spans are masked, subject-level inference protection drops as low as 33%, leaving the majority of personal information recoverable through contextual inference. Furthermore, target-subject-focused anonymization leaves non-target subjects substantially more exposed than the target subject. We show that subject-level inference-based evaluation is essential for ensuring safe text anonymization in real-world settings.

Chinese Translation

当前的文本匿名化评估依赖于基于跨度的指标，这些指标无法捕捉对手实际能够推断出的信息，并且假设只有单一数据主体，忽略了多主体场景。为了解决这些局限性，我们提出了SPIA（主体级个人身份信息推断评估），这是第一个将评估单位从文本跨度转变为个体的基准，涵盖了675份来自法律和在线领域的文档，并引入了新颖的主体级保护指标。广泛的实验表明，即使超过90%的个人身份信息跨度被屏蔽，主体级推断保护仍然降至33%以下，导致大多数个人信息可以通过上下文推断恢复。此外，针对目标主体的匿名化使得非目标主体暴露程度显著高于目标主体。我们证明，基于主体级推断的评估对于确保现实环境中安全的文本匿名化至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.21223

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

通过隐式奖励模型进行零样本检测LLM生成文本

Liu, Runheng, Huang, Heyan, Xiao, Xingchen, Wu, Zhijing

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

Chinese Translation

大型语言模型（LLMs）在各种任务中展现了显著的能力。然而，它们生成类人文本的能力引发了对潜在滥用的担忧。这突显了检测LLM生成文本的可靠有效方法的必要性。本文提出了一种新颖的零样本方法IRM，利用隐式奖励模型进行LLM生成文本的检测。这种隐式奖励模型可以从公开可用的指令调优模型和基础模型中推导而来。以往的基于奖励的方法依赖于偏好构建和特定任务的微调。相比之下，IRM既不需要偏好收集，也不需要额外训练。我们在DetectRL基准上评估了IRM，并证明IRM能够实现更优的检测性能，超越现有的零样本和监督方法在LLM生成文本检测中的表现。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.21229

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

EngramaBench：使用结构化图检索评估长期对话记忆

Acuna, Julian

Abstract

Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

Chinese Translation

大型语言模型助手越来越被期望能够保留并推理跨多个会话积累的信息。我们介绍了EngramaBench，这是一个围绕五个角色、一百个多会话对话和一百五十个查询构建的长期对话记忆基准，涵盖事实回忆、跨空间整合、时间推理、对抗性回避和新兴综合等方面。我们将Engrama（一个图结构记忆系统）与GPT-4o的全上下文提示和Mem0（一个开源向量检索记忆系统）进行评估。三者均使用相同的回答模型（GPT-4o），从而隔离记忆架构的影响。GPT-4o全上下文达到了最高的综合得分（0.6186），而Engrama的全球得分为0.5367，但在跨空间推理方面是唯一一个得分高于全上下文提示的系统（0.6532对0.6291，n=30）。Mem0成本最低，但显著较弱（0.4809）。消融实验表明，推动Engrama跨空间优势的组件与全球综合得分之间存在权衡，揭示了结构化记忆专业化与整体优化之间的系统级张力。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.21238

Unlocking the Power of Large Language Models for Multi-table Entity Matching

释放大型语言模型在多表实体匹配中的潜力

Tang, Yingkai, Su, Taoyu, Zhang, Wenyuan, Guo, Xiaoyang, Liu, Tingwen

Abstract

Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at https://github.com/Ymeki/LLM4MEM.

Chinese Translation

多表实体匹配（MEM）通过支持在多个数据源中同时识别等效实体，克服了双表方法的局限性，而无需唯一标识符。然而，现有依赖于预训练语言模型的方法在处理由数值属性变化引起的语义不一致性方面存在困难。受到大型语言模型（LLMs）强大语言理解能力的启发，我们提出了一种新颖的基于LLM的多表实体匹配框架，称为LLM4MEM。具体而言，我们首先提出了一种多样式提示增强的LLM属性协调模块，以解决语义不一致性问题。然后，为了缓解由于多个数据源带来的实体数量激增而导致的匹配效率问题，我们开发了一个传递共识嵌入匹配模块，以应对实体嵌入和预匹配问题。最后，为了解决匹配过程中噪声实体的问题，我们引入了一个密度感知修剪模块，以优化多表实体匹配的质量。我们在6个MEM数据集上进行了广泛的实验，结果表明，与基线模型相比，我们的模型在F1值上平均提高了5.1%。我们的代码可在https://github.com/Ymeki/LLM4MEM获取。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.21253

Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation

超越文本的规划：基于图的复杂叙事生成推理

Gu, Hanwen, Guo, Chao, Wang, Junle, Xie, Wenda, Lv, Yisheng

Abstract

While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.

Chinese Translation

尽管大型语言模型（LLMs）在叙事生成方面表现出显著的流畅性，但现有方法在维持全局叙事连贯性、上下文逻辑一致性和角色发展顺畅性方面存在困难，常常产生结构性缺陷的单调剧本。为此，我们提出了PLOTTER，一个在结构图表示上进行叙事规划的框架，而不是现有工作中使用的直接顺序文本表示。具体而言，PLOTTER在事件图和角色图上执行评估-规划-修订循环。通过在严格的逻辑约束下诊断和修复图拓扑问题，该模型在完整上下文生成之前优化因果关系和叙事框架。实验表明，PLOTTER在多种叙事场景中显著优于代表性基线。这些发现验证了在结构图表示上进行叙事规划而非直接在文本上进行规划，对于增强LLMs在复杂叙事生成中的长上下文推理能力至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.21255

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

当智能体看起来相似时：量化蒸馏引起的工具使用行为相似性

Yang, Chenghao, Zhang, Yuning, Wen, Zhoufutu, Gong, Tao, Liu, Jiaheng, Chu, Qi, Yu, Nenghai

Abstract

Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $\tau$-Bench and $\tau^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.

Chinese Translation

模型蒸馏是大型语言模型（LLM）智能体快速发展的主要驱动力，但它往往导致行为的同质化。许多新兴智能体共享几乎相同的推理步骤和失败模式，这表明它们可能是少数主导教师的蒸馏回声。然而，现有的度量标准未能区分任务成功所需的强制性行为与反映模型自主偏好的非强制性模式。我们提出了两种互补的度量标准，以隔离非强制性行为模式： extbf{响应模式相似性（Response Pattern Similarity, RPS）}用于语言对齐， extbf{行动图相似性（Action Graph Similarity, AGS）}用于建模为有向图的工具使用习惯。在对来自8个提供者的18个模型在$ au$-Bench和$ au^2$-Bench上与Claude Sonnet 4.5（思考）进行评估时，我们发现同一家族模型对在AGS上的得分比跨家族模型对高出5.9个百分点，并且Kimi-K2（思考）达到了82.6\% $S_{ ext{node}}$和94.7\\% $S_{ ext{dep}}$，超过了Anthropic自己的Opus 4.1。一个受控的蒸馏实验进一步确认了AGS能够区分特定教师的收敛与一般改进。RPS和AGS捕捉到不同的行为维度（Pearson $r$ = 0.491），为智能体生态系统中的行为收敛提供了互补的诊断信号。我们的代码可在https://github.com/Syuchin/AgentEcho获取。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.21265

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

在阅读之前倾听和吟唱：LM预训练中的美的阶梯

Nomura, Yoshinori

Abstract

We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5\%$ perplexity improvement over random initialization ($p < 0.001$, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d\!=\!64$, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ($p = 0.017$), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ($-3\% \to +3\% \to +6\%$ advantage of larger datasets from $d\!=\!16$ to $d\!=\!64$). Across the scales we study ($d\!\in\!\{16,32,64\}$, up to ${\sim}400$K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

Chinese Translation

我们展示了在语言之前对音乐进行Transformer预训练显著加速了语言习得。使用钢琴表演（MAESTRO数据集），一个发展性流程——音乐 $ o$ 诗歌 $ o$ 散文——在随机初始化上实现了 $17.5\%$ 的困惑度改善（$p < 0.001$, 5个种子），其中音乐和诗歌分别改善了模型的正交组件（内部计算和嵌入）。收敛测试确认这不是暂时的先发优势：在 $d eq64$ 时，多种子验证（5个种子）显示在平台期存在持续的5.5 ext{%}差距（$p = 0.017$），该流程在每次运行中收敛更快且损失更低。真实音乐在数据量仅为三分之一的情况下达到了合成模式的迁移上限，扩展实验表明，最佳预训练数据量随着模型容量的变化而变化（从 $-3 ext{%} o +3 ext{%} o +6 ext{%}$ 的较大数据集优势，$d eq16$ 到 $d eq64$）。在我们研究的范围内（$d eq ext{16,32,64}$，参数量高达 ${ ext{∼}}400$K），这些结果表明了一种依赖于容量的数据策划原则，并指出结构化的人类创作输出可以为小型语言模型提供高效的预训练基础；在现代预训练规模下得出更强的结论将需要大规模的实验。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.21276

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

大型语言模型解码器是否公平地倾听？基于语言模型先验如何影响语音识别中的偏见的基准测试

Ginjala, Srishti, Fosler-Lussier, Eric, Myers, Christopher W., Parthasarathy, Srinivasan

Abstract

As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.

Chinese Translation

随着预训练的大型语言模型取代语音识别中的任务特定解码器，一个关键问题随之而来：它们基于文本的先验是否使得不同人口群体的识别更加公平，还是更加偏见？我们评估了九种模型，涵盖三个架构代际（无语言模型的CTC、隐式语言模型的编码-解码器以及基于大型语言模型的显式预训练解码器），在约43,000个语句上进行测试，涉及五个人口统计维度（种族、口音、性别、年龄、母语），使用Common Voice 24和Meta的Fair-Speech，这是一个消除了词汇混淆的控制提示数据集。在清晰音频上，三个发现挑战了假设：LLM解码器并未放大种族偏见（Granite-8B在种族公平性方面表现最佳，最大/最小字错误率为2.28）；Whisper在印度口音的语音上表现出病态幻觉，在large-v3上插入率非单调地激增至9.62%；而音频压缩在口音公平性上的预测能力超过了LLM规模。随后，我们在12种声学降级条件（噪声、混响、静音注入、块掩蔽）下对这些发现进行了压力测试，涵盖两个数据集，共计216次推理运行。严重的降级反而压缩了公平性差距，因为所有群体的字错误率趋向于较高，但静音注入通过触发人口选择性幻觉将Whisper的口音偏见放大至4.64倍。在掩蔽条件下，Whisper进入灾难性的重复循环（51,797次插入中的86%），而显式LLM解码器产生的插入量减少至38倍，且几乎没有重复；高压缩音频编码（Q-former）甚至在LLM解码器中重新引入了重复病态。这些结果表明，音频编码器设计，而非LLM规模，是实现公平和稳健语音识别的主要杠杆。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.21286

Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

交叉熵是承载因素：K-way 能量探针在双向预测编码中的预注册范围测试

Cacioli, Jon-Paul

Abstract

Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang & Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p < 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use "metacognitive" operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.

Chinese Translation

Cacioli（2026）展示了在标准辨别性预测编码网络上，K-way 能量探针大致简化为对数软最大边际的单调函数。这一简化基于五个假设，包括输出的交叉熵（CE）和有效的前馈推理动态。本预注册研究测试了去除 CE 对简化的敏感性，使用了两种条件：用均方误差（MSE）而非 CE 训练的标准预测编码（PC），以及双向预测编码（bPC；Oliviers, Tang & Bogacz, 2025）。在 CIFAR-10 上以匹配的 2.1M 参数骨干网络进行的 10 次实验中，我们发现了三个结果。负结果在标准 PC 上重复：探针位于软最大下方（Delta = -0.082，p < 10^-6）。在 bPC 上，探针在所有 10 次实验中均超过软最大（Delta = +0.008，p = 0.000027），尽管预注册的操控检查显示，bPC 在此规模上并未产生比标准 PC 更显著的潜在运动（比率 1.6，阈值 10）。单独去除 CE 而不改变推理动态使探针与软最大之间的差距减半（Delta_MSE = -0.037 对比 Delta_stdPC = -0.082）。CE 是此规模分解中一个主要的经验承载成分。CE 训练产生的输出对数标准差约为 MSE 或 bPC 训练的 15 倍。事后温度缩放消融将探针与软最大之间的差距分解为两个成分：约 66% 可归因于可通过温度重新缩放移除的对数尺度效应，约 34% 反映了 CE 训练表示的尺度不变排名优势。我们使用“元认知”一词来操作性地表示对自身类型-1 正确性的类型-2 辨别，而不是暗示人类般的内省访问。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.21300

Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

可解释的解耦表示学习在生成性人工智能时代的通用作者归属识别

Man, Hieu, Pham, Van-Cuong, Ngo, Nghia Trung, Dernoncourt, Franck, Nguyen, Thien Huu

Abstract

Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{https://github.com/hieum98/avae} \footnote{https://huggingface.co/collections/Hieuman/document-level-authorship-datasets}.

Chinese Translation

学习稳健的作者风格表示对于作者归属识别和人工智能生成文本检测至关重要。然而，现有方法常常面临内容与风格的纠缠问题，模型学习到作者写作风格与主题之间的虚假关联，导致跨领域的泛化能力较差。为了解决这一挑战，我们提出了可解释的作者变分自编码器（Explainable Authorship Variational Autoencoder, EAVAE），这是一个通过设计上的架构分离明确解耦风格与内容的新框架。EAVAE首先使用监督对比学习在多样的作者数据上预训练风格编码器，然后使用变分自编码器（Variational Autoencoder, VEA）架构进行微调，采用独立的编码器分别表示风格和内容。通过一种新颖的判别器来强制解耦，该判别器不仅区分风格/内容表示对是否来自同一作者/内容源，还为其决策生成自然语言解释，从而同时减轻混淆信息并增强可解释性。大量实验表明EAVAE的有效性。在作者归属识别方面，我们在多个数据集上实现了最先进的性能，包括亚马逊评论（Amazon Reviews）、PAN21和HRS。在人工智能生成文本检测方面，EAVAE在M4数据集上表现出色，尤其是在少样本学习中。代码和数据仓库可在线获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.21309

When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

当规模不是优势：多新闻摘要中政治偏见的全面公平性评估

Huang, Nannan, Maab, Iffat, Yamagishi, Junichi

Abstract

Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.

Chinese Translation

多文档新闻摘要系统因其在处理大量日常新闻内容方面的便利性而被越来越广泛地采用，因此在不同政治视角之间实现公平性至关重要。然而，这些系统可能通过对观点的不平等代表、对某些视角的不成比例强调以及对少数声音的系统性低估而表现出政治偏见。本研究使用FairNews数据集（包含带有政治取向标签的完整新闻文章）对多文档新闻摘要中的这种偏见进行了全面评估，考察了大型语言模型（LLMs）如何处理具有不同政治倾向的来源，涵盖了13个模型和五个公平性指标。我们调查了基线模型的性能以及各种去偏见干预措施的有效性，包括基于提示和基于评审的方法。我们的发现挑战了较大模型产生更公平输出的假设，因为中型变体在公平性和效率之间提供了最佳平衡，始终优于其较大的对应模型。基于提示的去偏见方法证明高度依赖于模型，而实体情感则成为最顽固的公平性维度，抵抗了所有测试的干预策略。这些结果表明，多文档新闻摘要中的公平性需要多维度评估框架和针对性、架构感知的去偏见方法，而不仅仅是简单地扩大规模。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.21344

Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

超越单一图表：多图表问答的基准测试

Efat, Azher Ahmed, Song, Seok Hwan, Tavanapong, Wallapak

Abstract

Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

Chinese Translation

图表广泛用于呈现复杂信息。在现实世界的背景下，提取有意义的见解通常需要对多个相关图表进行综合解读。关于理解多图表图像的研究尚未得到充分探索。我们介绍了PolyChartQA，这是一个专门为多图表图像的问答设计的中型数据集。PolyChartQA包含534个多图表图像（总计2,297个子图表），这些图像来源于经过同行评审的计算机科学研究出版物，以及2,694对问答（QA）对。我们评估了九种最先进的多模态语言模型（MLMs）在PolyChartQA上的表现，分析了问题类型、难度、问题来源以及多图表的关键结构特征。我们的结果显示，与MLM生成的问题相比，人类撰写的问题的基于LLM的准确率（L-Accuracy）下降了27.4%，而采用我们提出的提示方法后，L-准确率提高了5.39%。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.21352

CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

CARE：面向在线心理健康支持的咨询师对齐响应引擎

Astrin, Hagai, Swaid, Ayal, Segal, Avi, Gal, Kobi

Abstract

Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts.

Chinese Translation

心理健康问题在全球范围内日益增加，给情感支持服务带来了压力，并导致咨询师工作负担过重。这可能导致在关键情况下（例如自杀意念）响应延迟，而及时干预至关重要。尽管大型语言模型（LLMs）展现出强大的生成能力，但它们在低资源语言中的应用，尤其是在心理健康等敏感领域，仍然未得到充分探索。此外，现有基于LLM的代理通常难以复制专业人士使用的支持性语言和干预策略，因为缺乏对大规模、真实世界数据集的训练。为了解决这一问题，我们提出了CARE（Counselor-Aligned Response Engine），一个生成式人工智能框架，通过生成实时、心理对齐的响应建议来辅助咨询师。CARE分别针对希伯来语和阿拉伯语对开源LLM进行微调，使用经过策划的真实危机对话子集。训练数据由专业咨询师评定为高度有效的会话组成，使模型能够捕捉与成功降级相关的互动模式。通过对完整对话历史进行训练，CARE保持了咨询师与求助者对话中不断变化的情感背景和动态结构。在实验环境中，CARE在语义和策略上与黄金标准咨询师响应的对齐程度明显高于非专业化的LLM。这些发现表明，基于专家验证数据的领域特定微调可以显著支持咨询师的工作流程，并改善低资源语言环境中的护理质量。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.21370

MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

MKJ在SemEval-2026任务9：通用型、专业型和集成策略在多语言极性检测中的比较研究

Jouneghani, Maziar Kianimoghadam

Abstract

We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: https://github.com/Maziarkiani/SemEval2026-Task9-Subtask1-Polarization.

Chinese Translation

我们对SemEval-2026任务9（子任务1）中22种语言的多语言极性检测进行了系统研究，比较了多语言通用型模型与特定语言的专业型模型和混合集成模型。虽然像XLM-RoBERTa这样的标准通用型模型在其分词器与目标文本对齐时表现良好，但在某些具有独特书写系统的语言（如高棉语、奥里亚语）中，单语专业型模型能够带来显著的提升。我们并未强制采用单一的通用架构，而是采用了一种语言自适应框架，根据开发性能在多语言通用型模型、特定语言专业型模型和混合集成模型之间切换。此外，通过NLLB-200进行的跨语言增强效果不一，往往表现不如本地架构选择，并且对形态丰富的语言轨道造成了负面影响。我们的最终系统在所有22个轨道上实现了0.796的宏平均F1分数和0.826的平均准确率。代码和最终测试预测可在以下网址公开获取：https://github.com/Maziarkiani/SemEval2026-Task9-Subtask1-Polarization。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.21375

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

VLAA-GUI：知道何时停止、恢复和搜索的模块化GUI自动化框架

Han, Qijun, Tu, Haoqin, Wang, Zijun, Dai, Haoyue, Zhou, Yiyang, Lau, Nancy, Cardenas, Alvaro A., Xu, Yuhui, Xu, Ran, Xiong, Caiming, Zheng, Zeyu, Yao, Huaxiu, Zhou, Yuyin, Xie, Cihang

Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Chinese Translation

自主GUI代理面临两个基本挑战：早期停止，即代理在没有可验证证据的情况下过早宣告成功；以及重复循环，即代理在没有恢复的情况下反复执行相同的失败操作。我们提出了VLAA-GUI，一个围绕三个集成组件构建的模块化GUI代理框架，指导系统何时停止、恢复和搜索。首先，一个强制性的完整性验证器在每个完成步骤强制执行可观察的UI成功标准和验证——通过一个代理级验证器交叉检查完成声明与决策规则，拒绝缺乏直接视觉证据的声明。其次，一个强制性的循环打破器提供多层过滤：在重复失败后切换交互模式，在持续屏幕状态重复后强制策略变化，并将反思信号绑定到策略转变。第三，一个按需搜索代理通过直接查询具有搜索能力的LLM来在线搜索不熟悉的工作流程，并以纯文本形式返回结果。此外，我们集成了一个用于代码密集型操作的编码代理和一个用于精确操作定位的基础代理，这两个代理在需要时按需调用。我们在包括Opus 4.5、4.6和Gemini 3.1 Pro在内的五个顶级基础模型上评估了VLAA-GUI，在两个基准测试中进行Linux和Windows任务，均取得了最佳性能（OSWorld上为77.5%，WindowsAgentArena上为61.0%）。值得注意的是，五个基础模型中的三个在单次测试中超越了人类表现（72.4%）在OSWorld上的成绩。消融研究表明，所有三个提出的组件持续改善了强基础模型，而较弱的基础模型在步骤预算充足时更能从这些工具中受益。进一步分析还显示，循环打破器几乎将循环倾向模型的浪费步骤减少了一半。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.21428

Decoupled DiLoCo for Resilient Distributed Pre-training

用于弹性分布式预训练的解耦式 DiLoCo

Douillard, Arthur, Rush, Keith, Donchev, Yani, Charles, Zachary, Fallen, Nova, Dubey, Ayush, Gog, Ionel, Dean, Josef, Woodworth, Blake, Garrett, Zachary, Keating, Nate, Bishop, Jenny, Prior, Henry, Yvinec, Edouard, Szlam, Arthur, Ranzato, Marc'Aurelio, Dean, Jeff

Abstract

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.

Chinese Translation

现代大规模语言模型的预训练在很大程度上依赖于单程序多数据（SPMD）范式，这要求加速器之间紧密耦合。由于这种耦合，瞬时的减速、硬件故障和同步开销会导致整个计算停滞，从而在大规模计算中浪费了大量的计算时间。尽管最近的分布式方法如 DiLoCo 减少了通信带宽，但它们仍然在根本上是同步的，易受到这些系统停滞的影响。为了解决这个问题，我们提出了解耦式 DiLoCo，这是 DiLoCo 框架的一种演变，旨在打破锁步同步的障碍，超越 SPMD，以最大化训练的有效吞吐量。解耦式 DiLoCo 将计算分配到多个独立的“学习者”上，这些学习者执行本地的内部优化步骤。这些学习者异步地将参数片段传递给中央同步器，该同步器通过使用最小法定人数、自适应宽限窗口和动态令牌加权合并来规避失败或滞后的学习者。受到“混沌工程”的启发，我们在故障频发的环境中实现了显著提高的训练效率，模拟了数百万个芯片，确保全球零停机时间，同时在文本和视觉任务中保持竞争力的模型性能，适用于密集和混合专家架构。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.21454

Reasoning Primitives in Hybrid and Non-Hybrid LLMs

混合与非混合大型语言模型中的推理原语

Rawat, Shivam, Flek, Lucie, Mai, Florian, Corrêa, Nicholas Kluge

Abstract

Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model's effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.

Chinese Translation

在大型语言模型中，推理通常被视为一种整体能力，但其观察到的提升可能源于更基本的操作。我们通过两个这样的原语——回忆（recall）和状态跟踪（state-tracking）——来研究推理，并询问结合基于注意力的检索与递归状态更新的混合架构是否比仅基于注意力的模型更适合同时需要这两者的任务。我们使用匹配的 Olmo3 变换器和混合模型，在指令调优和推理增强的变体中，对这些模型进行评估，任务涉及状态跟踪和回忆原语的混合，即基于状态的回忆。在各项任务中，我们注意到推理增强提供了最大的整体改进，显著扩展了模型保持有效的难度范围。我们还发现，在某些任务中，随着序列依赖性的增加，混合推理模型的鲁棒性显著更强。相比之下，当任务难度超过某一阈值时，变换器推理模型的性能急剧下降。这些结果表明，推理标记和架构归纳偏差在计算过程的不同层面上发挥作用：显式推理可以扩展模型的有效操作范围，但其益处取决于底层架构支持持久状态传播的程度。鉴于我们的案例研究规模较小，仅涉及有限的模型和任务，我们将这些发现视为提示而非结论，并将更广泛的验证留待未来在模型家族、规模和任务变体中的研究。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.21469

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

跨域数据选择与增强用于自动合规检测

Ikhwantri, Fariz, Marijan, Dusica

Abstract

Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

Chinese Translation

由于法律文本的复杂性和多样性，自动化检测监管合规性仍然是一项具有挑战性的任务。在一种法规上训练的模型往往无法推广到其他法规。这一局限性突显了改善跨域迁移的原则性方法的必要性。我们将数据选择作为一种策略，研究如何在将合规检测框架视为自然语言推理（NLI）任务时减轻负迁移的影响。具体而言，我们评估了从更大源域中选择增强数据的四种方法：随机抽样、Moore-Lewis的交叉熵差异、重要性加权和基于嵌入的检索。我们系统地改变所选数据的比例，以分析其对跨域适应的影响。我们的研究结果表明，针对性的数据选择显著减少了负迁移，为在异构法规中实现可扩展和可靠的合规自动化提供了一条切实可行的路径。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.21481

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

以语音为首的国家的偏好：印度语言文本到语音的大规模成对评估与偏好分析

Anand, Srija, Sankar, Ashwin, Sethi, Ishvinder, Pareek, Aaditya, Rajput, Kartik, Yadav, Gaurav, Narasimhan, Nikhil, Pandya, Adish, Halder, Deepon, Khan, Mohammed Safi Ur Rahman, S V, Praveen, Banga, Shobhit, Khapra, Mitesh M

Abstract

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

Chinese Translation

众包成对评估已成为评估基础模型的一种可扩展方法。然而，将其应用于文本到语音（TTS）时，由于语言多样性和语音感知的多维特性，导致了较高的方差。我们提出了一种控制的多维成对评估框架，用于多语言TTS，该框架结合了语言控制与感知基础的注释。使用超过5000个本地和代码混合的句子，涵盖10种印度语言，我们评估了7个最先进的TTS系统，并从超过1900名本地评审者那里收集了超过120,000个成对比较。除了整体偏好外，评审者还在6个感知维度上提供判断：可懂度、表现力、音质、活力、噪声和幻觉。通过Bradley-Terry建模，我们构建了一个多语言排行榜，使用SHAP分析解释人类偏好，并分析排行榜的可靠性以及模型在感知维度上的优势与权衡。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.21510

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

OptiVerse：面向优化问题解决的综合基准

Zhang, Xinyu, Zhang, Boxuan, Wan, Yuchen, Zhang, Lingling, Yao, YiXing, Wei, Bifan, Wu, Yaqiang, Liu, Jun

Abstract

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

Chinese Translation

尽管大型语言模型（LLMs）展现出卓越的推理能力，但复杂的优化任务仍然具有挑战性，要求具备领域知识和稳健的实现。然而，现有的基准测试过于集中于数学规划和组合优化，限制了全面评估的可能性。为了解决这一问题，我们推出了OptiVerse，这是一个涵盖1,000个精心挑选问题的综合基准，涉及被忽视的领域，包括随机优化、动态优化、博弈优化和最优控制，分为三个难度级别：简单、中等和困难。对22个不同规模的LLM进行的实验显示，在困难问题上性能急剧下降，即使是像GPT-5.2和Gemini-3这样的先进模型也难以超过27%的准确率。通过错误分析，我们发现建模和逻辑错误仍然是主要瓶颈。因此，我们提出了一种双视角审计代理（Dual-View Auditor Agent），它在不引入显著时间开销的情况下，提高了LLM建模过程的准确性。OptiVerse将作为推动LLM解决复杂优化挑战的基础平台。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.21525

Job Skill Extraction via LLM-Centric Multi-Module Framework

基于LLM中心的多模块框架进行职位技能提取

Li, Guojing, Fu, Zichuan, Li, Junyi, Liu, Faxue, Zhou, Wenxia, Wang, Yejing, Gao, Jingtong, Wang, Maolin, Liu, Rungen, Zhang, Wenlin, Zhao, Xiangyu

Abstract

Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.

Chinese Translation

从职位广告中进行跨度级技能提取是候选人与职位匹配及劳动市场分析的基础，但生成式大型语言模型（LLMs）常常产生畸形的跨度、边界漂移和幻觉，尤其是在长尾术语和跨领域转移的情况下。我们提出了SRICL，一个以LLM为中心的框架，结合了语义检索（SR）、上下文学习（ICL）和监督微调（SFT），并配备了一个确定性验证器。SR从ESCO中提取领域内的标注句子和定义，以形成格式受限的提示，从而稳定边界并处理协调。SFT对输出行为进行对齐，而验证器以最小的重试次数强制执行配对、非重叠和BIO合法性。在六个跨行业和语言的公共跨度标注语料库上，SRICL在严格F1指标上相较于GPT-3.5的提示基线取得了显著提升，并显著减少了无效标签和幻觉跨度，使得在低资源、多领域环境中的句子级部署变得可靠。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.21534

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

UKP_Psycontrol在SemEval-2026任务2中的表现：从文本中建模情感价和唤醒动态

Hryhoryeva, Darya, Zurinaga, Amaia, Jamalabadi, Hamidreza, Gurevych, Iryna

Abstract

This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

Chinese Translation

本文介绍了我们为SemEval-2026任务2开发的系统。该任务要求对按时间顺序排列的用户生成文本中的当前情感和短期情感变化进行建模。我们探索了三种互补的方法：（1）在用户感知和用户无关设置下的LLM提示，（2）具有伊辛风格交互的成对最大熵（MaxEnt）模型，用于结构化转变建模，以及（3）结合近期情感轨迹和可训练用户嵌入的轻量级神经回归模型。我们的研究结果表明，LLM能够有效捕捉文本中的静态情感信号，而该数据集中短期情感变化更强烈地受到近期数值状态轨迹的解释，而非文本语义。根据官方评估指标，我们的系统在参与团队中在子任务1和子任务2A中均排名第一。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.21555

Finding Meaning in Embeddings: Concept Separation Curves

在嵌入中寻找意义：概念分离曲线

Keuren, Paul, Ponsen, Marc, Bagheri, Robert Ayoub

Abstract

Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model's capacity to differentiate between conceptual and surface-level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross-model approach for evaluating the conceptual stability of sentence embeddings.

Chinese Translation

句子嵌入技术旨在将句子意义的关键概念编码到向量空间中。然而，大多数句子嵌入质量的评估方法依赖于使用额外的分类器或下游任务。这些额外组件使得不清楚良好的结果是源于嵌入本身还是分类器的行为。本文提出了一种新颖的方法，用于评估句子嵌入方法在捕捉句子级概念方面的有效性。我们的方法不依赖于分类器，允许对模型性能进行客观评估。本研究采用的方法涉及系统性地在句子中引入句法噪声和语义否定，并随后量化它们对生成嵌入的相对影响。这些影响的可视化通过概念分离曲线（Concept Separation Curves）得以实现，展示了模型区分概念变化与表面变化的能力。通过利用来自多个领域的数据，采用荷兰语和英语两种语言，并考察句子长度，本研究提供了一个有力的证明，表明概念分离曲线提供了一种可解释、可重复和跨模型的方法，用于评估句子嵌入的概念稳定性。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.21564

Measuring Opinion Bias and Sycophancy via LLM-based Coercion

通过基于大型语言模型的强迫测量意见偏见和谄媚行为

Nogueira, Rodrigo, Bonás, Giovana Kerche, Almeida, Thales Sales, Roque, Andrea, Pires, Ramon, Abonizio, Hugo, Laitz, Thiago, Larcher, Celio, Junior, Roseval Malaquias, Piau, Marcos

Abstract

Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.

Chinese Translation

大型语言模型日益影响人们获取信息的方式：它们嵌入搜索引擎中，被咨询以获取专业建议，作为代理人被部署，并作为关于政策、伦理、健康和政治问题的首要咨询来源。当这样的模型在一个有争议的话题上默默持有某种立场时，该立场会大规模地影响用户的决策。引导模型表达其立场比最初看起来要困难：现代助手在直接的意见问题上常常以模糊的免责声明作答，而同一模型在用户开始争论某一方时可能会承认相反的立场。我们提出了一种方法，作为开源项目 llm-bias-bench 发布，用于发现大型语言模型在有争议话题上的真实意见，条件类似于真实的多轮互动。该方法结合了两种互补的自由形式探测。直接探测要求模型在模拟用户逐步施加压力的五轮中表达意见。间接探测则不询问意见，而是通过辩论与模型互动，让偏见通过其让步、抵抗或反驳的方式显露出来。三种用户角色（中立、同意、不同意）合并为九种行为分类，区分角色独立的立场与角色依赖的谄媚行为，并且一个可审计的 LLM 法官根据文本证据作出裁决。首个实例涵盖了38个主题，涉及价值观、科学共识、哲学和经济政策，使用巴西葡萄牙语。应用于13个助手，该方法揭示了具有实际意义的发现：辩论比直接提问更能引发谄媚行为，比例为2-3倍（中位数从50%上升至79%）；在直接提问下看似有意见的模型在持续的争论中往往会变得镜像；而攻击者的能力主要在需要移除现有意见时才显得重要，而不是在助手保持中立时。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.21590

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

AgenticQwen：利用双数据飞轮训练小型代理语言模型以实现工业级工具使用

Lyu, Yuanjie, Wang, Chengyu, Zheng, Haonan, Yue, Yuanhao, Yan, Junbing, Wang, Ming, Huang, Jun

Abstract

Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba-pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi-sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.

Chinese Translation

现代工业应用日益需要能够在现实环境中进行多步骤推理和工具使用的代理语言模型。这些任务通常在严格的成本和延迟限制下进行，因此小型代理模型备受青睐。本文介绍了AgenticQwen模型系列，该系列模型通过在合成数据和有限的开源数据上进行多轮强化学习（RL）进行训练。我们的训练框架结合了推理强化学习和代理强化学习，采用双数据飞轮自动生成日益具有挑战性的任务。推理飞轮通过从错误中学习来增加任务难度，而代理飞轮则将线性工作流扩展为多分支行为树，更好地反映现实应用中的决策复杂性。我们在公共基准测试和工业代理系统中验证了AgenticQwen。模型在多个代理基准测试中表现出色，并且在我们的工业代理系统中，在搜索和数据分析任务上与更大模型的差距缩小。模型检查点和部分合成数据可访问：https://huggingface.co/collections/alibaba-pai/agenticqwen。数据合成和强化学习训练代码：https://github.com/haruhi-sudo/data_synth_and_rl。数据合成管道也集成到EasyDistill中：https://github.com/modelscope/easydistill。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.21593

Language as a Latent Variable for Reasoning Optimization

语言作为推理优化的潜变量

Wu, Linjuan, Wei, Haoran, Tang, Jialong, Luo, Shuang, Yang, Baosong, Shen, Yongliang, Lu, Weiming

Abstract

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

Chinese Translation

随着大型语言模型（LLMs）减少以英语为中心的偏见，一个令人惊讶的趋势出现了：非英语回应在推理任务中的表现有时优于英语。我们假设语言作为潜变量在结构上调节模型的内部推理路径，而不仅仅作为输出媒介。为了验证这一假设，我们进行了多语言思维实验（Polyglot Thinking Experiment），在该实验中，模型在语言受限和不受限的条件下被提示解决相同的问题。结果表明，非英语回应通常能够实现更高的准确性，且最佳表现往往发生在语言不受限的情况下，这表明多语言性拓宽了模型的潜在推理空间。基于这一见解，我们提出了polyGRPO（多语言组相对策略优化），这是一个将语言变异视为隐式探索信号的强化学习框架。它在语言受限和不受限的条件下在线生成多语言偏好数据，并在回答准确性和推理结构方面优化策略。polyGRPO仅在18.1K多语言数学问题上进行训练，且没有链式思维注释，便在四个英语推理测试集上提高了基础模型（Qwen2.5-7B-Instruct）6.72%的绝对准确性，并在其多语言基准上提高了6.89%。值得注意的是，它是唯一在英语常识推理任务上超越基础LLM的方法（4.9%），尽管仅在数学数据上进行训练，这突显了其强大的跨任务泛化能力。进一步的分析表明，将语言视为潜变量可以扩展模型的潜在推理空间，从而在推理性能上实现一致且可泛化的提升。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.21611

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

通过语言批评进行过程监督改善大型语言模型的推理能力

Chen, Hao-Yuan

Abstract

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

Chinese Translation

大型语言模型（LLM）推理的推理时间扩展主要集中在三个方面：链深度、样本广度和学习的步骤评分器（PRMs）。我们引入了第四个维度，即外部语言监督的细粒度，通过语言过程监督（Verbal Process Supervision, VPS），这是一种无训练的框架，利用来自更强监督者的结构化自然语言批评来指导迭代的生成-批评-改进循环，直到达到预算轮次 R。在 GPQA Diamond、AIME 2025 和 LiveCodeBench V6（涵盖闭合和开放模型）中，VPS 取得了三个关键结果。首先，在 GPQA Diamond 上，GPT-5.4（高）| GPT-5.4（低）在 R=4 时达到 94.9%，超过了没有梯度更新的 94.1% 的最新成果。其次，在 AIME 2025 上，VPS 实现了强弱参与者的有效救援，将得分从 11.7-26.7% 提升至 63.3-90.0%（最多增加 63.3 分）。第三，在相同计算条件下，VPS 的表现比 Reflexion 高出 8.5 到 12.1 分，比 Self-Consistency@5 高出 5.0 个百分点（GPQA）和 8.3 个百分点（LiveCodeBench），将批评的细粒度作为关键驱动因素。当错误无法用语言表达时（例如代码合成），性能会下降，这激励了混合语言-可执行方法。这些结果确立了批评细粒度作为推理时间扩展的新维度。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.21637

Multilinguality at the Edge: Developing Language Models for the Global South

边缘的多语言性：为全球南方开发语言模型

Miranda, Lester James V., Hu, Songbo, Reichart, Roi, Korhonen, Anna

Abstract

Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed. To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. We also discuss open questions and provide actionable recommendations for different stakeholders in the NLP ecosystem. Finally, we hope that this work contributes to the development of inclusive and equitable language technologies.

Chinese Translation

语言模型（LMs）的部署方式和地点决定了谁能够从中受益。然而，在全球南方的非英语国家和硬件受限的社区中，有几个挑战阻碍了语言模型的有效部署。我们将这一挑战称为“最后一公里”：多语言性与边缘部署的交汇点，在这里目标是一致的，但技术要求往往相互竞争。将这两个领域结合起来进行研究既是必要的，因为语言多样化的社区通常面临最严重的基础设施限制，也是一个机会，因为边缘计算和多语言自然语言处理（NLP）研究仍然在很大程度上是孤立的。为了理解这两个领域结合的最新进展和挑战，我们调查了232篇涉及这一问题的论文，涵盖了从数据收集到开发和部署的语言建模流程。我们还讨论了开放性问题，并为NLP生态系统中的不同利益相关者提供了可行的建议。最后，我们希望这项工作有助于开发包容性和公平的语言技术。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.21667

Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

细粒度视角：基于注释者特定理由建模解释

Sarumi, Olufunke O., Welch, Charles, Braun, Daniel

Abstract

Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

Chinese Translation

除了探索用于建模视角的分解标签外，注释者的理由提供了个体视角的细粒度信号。在本研究中，我们提出了一个框架，用于联合建模注释者特定的标签预测及其相应的解释，并针对注释者提供的理由进行了微调。我们使用一个包含分解自然语言推理（NLI）注释和注释者提供解释的数据集，通过表示层级的用户护照机制，将预测条件化于注释者身份和人口统计元数据。我们进一步引入了两种解释器架构：一种是事后基于提示的解释器，另一种是预设桥接解释器，它将注释者条件化的分类器表示直接转移到生成模型中。该设计使得生成的解释与个别注释者的视角保持一致。我们的结果表明，纳入解释建模显著提高了相较于基线注释者感知分类器的预测性能，其中预设桥接方法实现了更稳定的标签对齐和更高的语义一致性，而事后方法则产生了更强的词汇相似性。这些发现表明，将解释建模视为细粒度视角的表达提供了更丰富和更真实的分歧表示。所提出的方法通过将注释者特定的理由整合到预测和生成组件中，推动了视角建模的发展。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.21698

Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

注视序列作为时间序列：一种拓扑方法用于阅读障碍检测

Huber, Marius, Reich, David R., Jäger, Lena A.

Abstract

Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features. We empirically evaluate our method by applying it to the task of dyslexia detection from eye-tracking-while-reading data using the Copenhagen Corpus, which contains scanpaths from dyslexic and non-dyslexic L1 and L2 readers. Our hybrid models outperform existing approaches that rely solely on traditional features, showing that persistent homology captures complementary information encoded in fixation sequences. The strength of these topological features is further underscored by their achieving performance comparable to established baseline methods. Importantly, our proposed filtrations outperform existing ones.

Chinese Translation

持久同调（persistent homology）是一种来自拓扑数据分析的方法，它从数据中提取稳健的多尺度特征。通过对时间序列的值应用不同的阈值（这一过程称为 extit{过滤}），它生成时间序列的稳定表示。我们为时间序列开发了新型过滤，并通过将注视序列解释为时间序列，引入拓扑方法来分析眼动追踪数据，同时构建结合拓扑特征与传统统计特征的“混合模型”。我们通过将该方法应用于使用哥本哈根语料库（Copenhagen Corpus）进行的阅读时眼动追踪数据的阅读障碍检测任务进行实证评估，该语料库包含阅读障碍和非阅读障碍的L1和L2读者的扫描路径。我们的混合模型在性能上优于仅依赖传统特征的现有方法，表明持久同调捕捉了编码在注视序列中的互补信息。这些拓扑特征的强度进一步通过其达到与已建立基线方法相当的性能得到了强调。重要的是，我们提出的过滤方法优于现有方法。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.21706

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

音位子空间崩溃是病因特异性的，并且在跨语言中稳定：来自3,374名说话者的证据

Muller, Bernard, Barrañón, Antonio Armando Ortiz, Roberts, LaVonne

Abstract

We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared > 0.14, Holm-corrected p < 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.

Chinese Translation

我们之前介绍了一种基于冻结自监督语音表征中音位特征子空间的 d-prime 可分离性进行言语不清严重性评估的无训练方法，该方法在890名说话者和5种语言中使用 HuBERT-base 进行了验证。在这里，我们将分析扩展到来自25个数据集的3,374名说话者，涵盖12种语言和5种病因（帕金森病、脑瘫、肌萎缩侧索硬化症、唐氏综合症和中风），以及健康对照，使用6种自监督学习（SSL）骨干网络。我们报告了三个发现。首先，病因特异性的退化特征在组层面上是可区分的：13个特征中有10个产生了较大的效应量（epsilon-squared > 0.14，Holm校正p < 0.001），帕金森病与发音执行组在Cohen's d = 0.83的情况下可分离；个体级分类仍然有限（22.6%的宏观F1）。其次，特征显示出跨语言特征形状的稳定性：5维辅音d-prime特征的余弦相似度在每种病因的可用语言中超过0.95。绝对d-prime幅度在跨语言中并未校准，因此该方法支持语言独立的退化模式表型，但需要在语料库内进行校准以解释绝对严重性。第三，该方法是架构无关的：所有6个骨干网络生成单调的严重性梯度，模型间一致性超过rho = 0.77。固定标记的d-prime估计保持了严重性相关性（在每类200个标记时，rho = -0.733），确认信号不是标记计数伪影。这些结果支持音位子空间分析作为一种稳健的、无训练的框架，用于病因意识的言语不清特征化，并提供了跨语言特征形状稳定性和跨骨干网络稳健性的证据。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.21716

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

从条件语句到机器学习管道：重新审视代码生成中的偏见

Bui, Minh Duc, Heilmann, Xenia, Cerrato, Mattia, Mager, Manuel, von der Wense, Katharina

Abstract

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

Chinese Translation

以往的研究主要通过简单的条件语句来评估代码生成中的偏见，这仅代表了现实编程中的一个狭窄切片，并且仅揭示了明显的、明确编码的偏见。我们通过考察一个更现实的任务——生成机器学习（ML）管道，证明这种方法在实践中严重低估了偏见。对代码专用和通用指令的大型语言模型进行测试，我们发现生成的管道在特征选择过程中表现出显著的偏见。敏感属性在平均87.7%的案例中出现，尽管模型明显排除了不相关的特征（例如，在信用评分中包括“种族”而排除“最喜欢的颜色”）。这种偏见的普遍性远高于通过条件语句捕获的偏见，其中敏感属性仅在59.2%的案例中出现。这些发现在不同的提示缓解策略、属性数量变化和管道难度水平下都具有稳健性。我们的结果挑战了简单条件语句作为偏见评估有效代理的有效性，并表明当前基准低估了实际部署中的偏见风险。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.21724

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

超越 N-gram：数据感知的 X-GRAM 提取用于高效嵌入参数缩放

Chen, Yilong, Xie, Yanxi, Gao, Zitian, Xin, He, Xiao, Yihao, Liu, Renbiao, Luo, Haoming, Luo, Yifan, Ye, Zhengmao, Liu, Tingwen, Zhao, Xin, Tao, Ran, Dai, Bryan

Abstract

Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in https://github.com/Longyichen/X-gram.

Chinese Translation

大型基于令牌索引的查找表提供了一条计算解耦的缩放路径，但其实际收益往往受到参数效率低下和内存快速增长的限制。我们将这些限制归因于长尾的 Zipfian 训练不足、各层之间的异构需求以及导致冗余嵌入的“槽崩溃”。为了解决这个问题，我们提出了 X-GRAM，一种频率感知的动态令牌注入框架。X-GRAM 采用混合哈希和别名混合来压缩长尾，同时保留头部容量，并通过归一化的 SwiGLU ShortConv 精炼检索向量，以提取多样的局部 n-gram 特征。这些信号通过深度感知门控集成到注意力值流和层间残差中，有效地将静态内存与动态上下文对齐。该设计引入了一种以内存为中心的缩放轴，将模型容量与 FLOPs 解耦。在 0.73B 和 1.15B 规模的广泛评估中，X-GRAM 在平均准确率上比原始骨干网络提高了多达 4.4 个点，比强检索基线提高了 3.2 个点，同时在 50% 配置下使用了显著更小的表。总体而言，通过高效的内存管理将容量与计算解耦，X-GRAM 为未来的内存增强架构提供了一种可扩展且实用的范式。代码可在 https://github.com/Longyichen/X-gram 获取。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.21725

AEL: Agent Evolving Learning for Open-Ended Environments

AEL：开放式环境中的智能体进化学习

Xu, Wujiang, Han, Jiaojiao, Guo, Minghao, Mei, Kai, Zhu, Xi, Zhang, Han, Metaxas, Dimitris N.

Abstract

LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.

Chinese Translation

大型语言模型（LLM）智能体越来越多地在跨越数百个连续情节的开放式环境中运行，但它们仍然在很大程度上是无状态的：每个任务都是从头开始解决，而没有将过去的经验转化为更好的未来行为。主要障碍不在于 extit{记住什么}，而在于 extit{如何使用}所记住的内容，包括应用哪种检索策略、如何解读先前的结果，以及何时当前策略本身必须改变。我们提出了 extit{智能体进化学习}（AEL），这是一个解决这一障碍的双时间尺度框架。在快速时间尺度上，汤普森采样赌博机学习在每个情节中应用哪种记忆检索策略；在慢速时间尺度上，基于LLM的反思诊断失败模式，并将因果洞察注入智能体的决策提示中，为其检索的证据提供解释框架。在一个顺序投资组合基准测试中（10个行业多样化的股票代码，208个情节，5个随机种子），AEL达到了2.13$ ext{±}0.47$的夏普比率，超越了五种已发布的自我改进方法和所有非LLM基准，同时在所有基于LLM的方法中保持最低的方差。九个变体的消融实验揭示了“少即是多”的模式：记忆和反思共同产生了相对于无状态基线58 ext{％}的累积改进，但我们测试的每一个额外机制（规划者进化、每个工具选择、冷启动初始化、技能提取和三种信用分配方法）都 extit{降低}了性能。这表明，智能体自我改进的瓶颈在于 extit{自我诊断如何使用}经验，而不是增加架构复杂性。代码和数据：https://github.com/WujiangXu/AEL。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.21748

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

StructMem：用于长时行为的结构化记忆

Xu, Buqiang, Chen, Yijun, Fang, Jizhan, Zhong, Ruobin, Yao, Yunzhi, Zhu, Yuqi, Du, Lun, Deng, Shumin

Abstract

Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .

Chinese Translation

长期对话代理需要能够捕捉事件之间关系的记忆系统，而不仅仅是孤立的事实，以支持时间推理和多跳问答。目前的方法面临一个基本的权衡：扁平记忆虽然高效，但无法建模关系结构，而基于图的记忆虽然能够实现结构化推理，却代价高昂且脆弱。为了解决这些问题，我们提出了 extbf{StructMem}，一种结构增强的层次记忆框架，能够保留事件级绑定并引入跨事件连接。通过时间锚定双重视角并进行定期语义整合，StructMem改善了在 exttt{LoCoMo}上的时间推理和多跳性能，同时与之前的记忆系统相比，显著减少了令牌使用、API调用和运行时间，详见 https://github.com/zjunlp/LightMem 。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.21751

Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

为什么所有大型语言模型都对日本文化情有独钟？关于大型语言模型的隐性文化和区域偏见

de Landa, Joseba Fernandez, Perez-Almendros, Carla, Camacho-Collados, Jose

Abstract

LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ). The results show that, contrary to previous cultural bias work, LLMs show a clear tendency towards countries such as Japan. Moveover, our results show that when prompting in languages such as English or other high-resource ones, LLMs tend to provide more diverse outputs and show less inclinations towards answering questions highlighting countries for which the input language is an official language. Finally, we also investigate at which point of LLM training this cultural bias emerges, with our results suggesting that the first clear signs appear after supervised fine-tuning, and not during pre-training.

Chinese Translation

大型语言模型（LLMs）在文化覆盖和能力方面表现出局限性，在某些情况下显示出区域偏见，例如放大西方和以英语为中心的观点。尽管已有研究分析了大型语言模型的文化能力，但尚未有具体研究突出大型语言模型在文化相关问题上的区域偏好。在本研究中，我们提出了一个基于文化相关开放问题（Culture-Related Open Questions, CROQ）综合分类法的新数据集。结果显示，与以往文化偏见研究相反，大型语言模型对日本等国家表现出明显的倾向。此外，我们的结果表明，当使用英语或其他高资源语言进行提示时，大型语言模型倾向于提供更为多样化的输出，并且在回答突出输入语言为官方语言的国家的问题时，表现出较少的倾向。最后，我们还调查了大型语言模型训练的哪个阶段出现这种文化偏见，结果表明，明显的迹象在监督微调后出现，而非在预训练阶段。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.21766

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

AUDITA：一个用于审计人类与人工智能在音频问答技能方面的新数据集

Kabir, Tasnim, Kurdydyk, Dmytro, Palnitkar, Aadi, Dorn, Liam, Ahmed, Ahmed Haj, Boyd-Graber, Jordan Lee

Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

Chinese Translation

现有的音频问答基准主要强调声音事件分类或基于字幕的查询，通常使得模型能够通过捷径策略、短时线索、词汇先验、数据集特定偏差，甚至通过元数据和字幕而非真实推理来成功。因此，我们提出了AUDITA（来自多样化互联网趣闻作者的音频理解），这是一个大规模、真实世界的基准，用于严格评估超越表层声学识别的音频推理能力。AUDITA包含经过精心策划的人类创作的趣味问题，这些问题基于真实世界的音频，旨在通过具有挑战性的干扰项和长时间的时间依赖性来考验稳健的听觉推理，使用无法仅通过孤立文本或声音线索回答的探测性查询。人类的平均准确率为32.13%，显示了任务的挑战性，同时也表明了对音频的有意义理解。相比之下，最先进的音频问答模型表现不佳，平均准确率低于8.86%。除了原始准确率外，我们还应用项目反应理论（IRT）来估计潜在能力、问题难度，并揭示模型和数据的系统性缺陷。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.21767

Misinformation Span Detection in Videos via Audio Transcripts

通过音频转录在视频中检测虚假信息的范围

Matos, Breno, Lima, Rennan C., Zannettou, Savvas, Benevenuto, Fabricio, Santos, Rodrygo L. T.

Abstract

Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video's misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video's misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video's audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos.

Chinese Translation

在线虚假信息是近年来最具挑战性的问题之一，带来了严重后果，包括政治极化、对民主的攻击以及公共健康风险。虚假信息在任何拥有大量用户基础的平台上都表现出来，包括在线社交网络和消息应用程序。它渗透到所有媒体和内容形式中，包括图像、文本、音频和视频。特别是，基于视频的虚假信息对事实核查者构成了多方面的挑战，因为个人可以轻松地在各种视频分享平台上录制和上传视频。以往的研究工作探讨了检测基于视频的虚假信息，重点关注视频是否在视频层面上分享虚假信息。虽然这种方法是有用的，但由于未提供虚假信息在视频中发生的额外上下文以及哪些内容（即主张）导致视频的虚假信息特性，因此仅提供了有限且不易解释的问题视角。在本研究中，我们试图填补这一研究空白，通过创建两个新数据集，使我们能够通过音频转录探索视频中的虚假信息检测，重点识别导致视频虚假信息主张的段落（虚假信息范围检测）。我们为此任务提出了两个新数据集。我们将每个视频的音频转录为文本，识别出虚假信息主张出现的视频片段，最终形成两个包含超过500个视频和2400多个带注释的事实核查主张的段落数据集。然后，我们利用基于最新语言模型构建的分类器，结果表明我们可以以0.68的F1分数识别视频中虚假信息出现的部分。我们公开提供我们的注释数据集，同时发布所有转录文本、音频和视频。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.21782

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

SemEval-2026 任务 4：叙事故事相似性与叙事表示学习

Hatzel, Hans Ole, Artemova, Ekaterina, Stiemer, Haimo Paul, Gius, Evelyn, Biemann, Chris

Abstract

We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.

Chinese Translation

我们提出了叙事相似性与叙事表示学习的共享任务——NSNRL（发音为“nass-na-rel”）。该任务将叙事相似性操作化为一个二分类问题：确定两个故事中哪个与锚故事更为相似。我们引入了一个新的叙事相似性定义，该定义与叙事理论和直观判断相兼容。基于在这一概念下收集的相似性判断，我们还评估了叙事嵌入表示。我们为超过 1,000 个故事摘要三元组收集了至少两个标注，每个标注均由至少两个一致的标注者支持。本文描述了数据集的抽样和标注过程；此外，我们还概述了提交的系统及其采用的技术。我们共收到来自 46 个团队的 71 个最终提交，涵盖了我们的两个赛道。在基于三元组的分类设置中，LLM（大语言模型）集成构成了许多高分系统，而在嵌入设置中，经过预处理和后处理的预训练嵌入模型的系统表现与定制微调解决方案大致相当。我们的分析识别了两个赛道中自动化系统的潜在改进空间。任务网站包括了嵌入的可视化以及所有团队的实例级分类结果。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.21871

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

关系道德困境中的机器行为：道德正确性、预测的人类行为与模型决策

Kim, Jiseon, Kwon, Jea, Vecchietti, Luiz Felipe, Dong, Wenchao, Kim, Jaehong, Cha, Meeyoung

Abstract

Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

Chinese Translation

人类的道德判断依赖于情境，并受到人际关系的调节。随着大型语言模型（LLMs）越来越多地作为决策支持系统发挥作用，确定它们是否编码了这些社会细微差别至关重要。我们通过改变两个实验维度：犯罪严重性和关系亲密度，使用举报者困境来表征机器行为。我们的研究评估了三个不同的视角：（1）道德正确性（规范性标准），（2）预测的人类行为（描述性社会期望），以及（3）自主模型决策。通过分析推理过程，我们识别出一个明确的跨视角分歧：尽管道德正确性始终保持以公平为导向，但随着关系亲密度的增加，预测的人类行为显著转向忠诚。关键是，模型决策与道德正确性判断一致，而不是与其自身的行为预测一致。这种不一致表明，LLM的决策过程优先考虑严格的规范性规则，而不是其内部世界建模中存在的社会敏感性，这造成了在实际应用中可能导致重大不一致的差距。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.21882

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

重新审视大型语言模型中的非逐字记忆：实体表面形式的作用

Nishida, Yuto, Shikoda, Naoki, Kishinami, Yosuke, Fujii, Ryo, Morishita, Makoto, Kamigaito, Hidetaka, Watanabe, Taro

Abstract

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

Chinese Translation

理解大型语言模型（LLMs）记忆哪些类型的事实知识对于评估其可靠性和局限性至关重要。基于实体的问答（QA）是分析非逐字记忆的常见框架，但典型的评估方法使用单一的标准表面形式查询每个实体，这使得很难将事实记忆与通过特定名称的访问区分开来。我们引入了RedirectQA，这是一个基于实体的问答数据集，利用维基百科重定向信息将维基数据（Wikidata）事实三元组与每个实体的分类表面形式关联，包括替代名称、缩写、拼写变体和常见错误形式。在13个大型语言模型中，我们考察了表面条件下的事实记忆，发现当仅改变实体表面形式时，预测结果往往会发生变化。这种不一致性依赖于类别：模型对小的正字法变体更具鲁棒性，而对别名和缩写等较大的词汇变体则较为脆弱。频率分析进一步表明，实体频率和表面频率都与准确性相关，并且实体频率通常在表面频率之外起到贡献。总体而言，事实记忆似乎既不是纯粹的表面特定，也不是完全的表面不变，强调了在评估非逐字记忆时表面形式多样性的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.21885

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

一种基于多模态文本和图的开放域事件提取方法

Sharma, Praval

Abstract

Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

Chinese Translation

事件提取对于事件理解和分析至关重要。它支持文档摘要和紧急情况下的决策等任务。然而，现有的事件提取方法存在一些局限性：（1）封闭域算法受限于预定义的事件类型，因此很少能推广到未见过的类型；（2）开放域事件提取算法能够处理不受限制的事件类型，但在很大程度上忽视了大型语言模型（LLMs）的潜力，尽管它们具备先进的能力。此外，这些算法未能明确建模文档级的上下文、结构和语义推理，而这些对于有效的事件提取至关重要，但由于“中间丢失”现象和注意力稀释，LLMs在这方面仍面临挑战。为了解决这些局限性，我们提出了一种多模态开放域事件提取方法MODEE，这是一种结合基于图的学习与来自LLMs的基于文本的表示，以建模文档级推理的开放域事件提取新方法。在大规模数据集上的实证评估表明，MODEE的表现优于最先进的开放域事件提取方法，并且可以推广到封闭域事件提取，在该领域中也优于现有算法。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.21889

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS：从企业规模的嘈杂客户事件中实时发现风险事件

Wang, Jun, Zhang, Ziyin, Wang, Rui, Yu, Hang, Di, Peng, Wang, Rui

Abstract

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

Chinese Translation

实时检测和缓解技术异常对于大规模云原生服务至关重要，因为即使是几分钟的停机时间也可能导致巨大的财务损失和用户信任的下降。虽然客户事件作为发现监控遗漏风险的重要信号，但由于极高的噪声、吞吐量和多样化业务线的语义复杂性，从这些数据中提取可操作的情报仍然具有挑战性。本文提出了TingIS，一个旨在企业级事件发现的端到端系统。TingIS的核心是一个多阶段事件链接引擎，该引擎将高效的索引技术与大型语言模型（Large Language Models, LLMs）相结合，以便在事件合并上做出明智的决策，从而能够从仅仅少量多样化的用户描述中稳定地提取可操作的事件。该引擎还配备了级联路由机制，以实现精确的业务归属，以及一个多维噪声减少管道，集成了领域知识、统计模式和行为过滤。在处理每分钟超过2,000条消息和每天300,000条消息的生产环境中部署的TingIS，实现了90百分位警报延迟为3.5分钟，并且高优先级事件的发现率达到95%。基于真实世界数据构建的基准测试表明，TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.21890

EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

EVENT5Ws：一个用于开放领域文档事件提取的大型数据集

Sharma, Praval, Samal, Ashok, Soh, Leen-Kiat, Joshi, Deepti

Abstract

Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

Chinese Translation

事件提取从文本中识别事件的核心方面。它支持事件理解和分析，这对于在紧急情况下进行知情决策等任务至关重要。因此，开发自动化事件提取方法是必要的。然而，现有的算法开发数据集存在一些局限性，包括在封闭领域设置中事件类型的覆盖范围有限，以及在开放领域设置中缺乏大型、经过人工验证的数据集。为了解决这些局限性，我们创建了EVENT5Ws，一个大型的、经过人工标注和统计验证的开放领域事件提取数据集。我们设计了一个系统的标注流程来创建该数据集，并提供了关于标注复杂性的实证见解。利用EVENT5Ws，我们评估了最先进的预训练大型语言模型，并为未来的研究建立了基准。我们进一步展示了在EVENT5Ws上训练的模型能够有效地推广到来自不同地理背景的数据集，这证明了其在开发可推广算法方面的潜力。最后，我们总结了在数据集开发过程中获得的经验教训，并提供了支持未来大规模数据集开发的建议。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.21897

Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

巴西众议院政治话语的映射：一种多维计算方法

Soriano, Flávio, Mello, Victoria F., Rigueira, Pedro B., Pappa, Gisele L., Meira Jr., Wagner, da Silva, Ana Paula Couto, Almeida, Jussara M.

Abstract

Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.

Chinese Translation

立法行为的分析通常依赖于投票记录，忽视了政治演讲中丰富的语义和修辞内容。本文提出了关于议会话语的三个互补问题：如何表达、说了什么以及谁以相似的方式发言。为了解答这些问题，我们引入了一种可扩展且具有普适性的计算框架，结合了历时风格分析、上下文主题建模和议员演讲的语义聚类。我们将该框架应用于巴西众议院的大规模案例研究，使用了2003年至2025年间超过450,000篇演讲的语料库。我们的结果显示，演讲风格在长期内向更短、更直接的方向转变，立法议程在应对国家危机时发生显著重定向，并且在话语对齐的细致图谱中，地区和性别身份往往比正式的政党隶属关系更为显著。更广泛地说，这项工作为分析议会话语作为一种多维现象提供了一种稳健的方法论，补充了基于投票的传统方法。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.21901

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

GiVA：基于梯度的信息基础用于向量化适应

Gangwar, Neeraj, Deshmukh, Rishabh, Shavlovsky, Michael, Li, Hancao, Mittal, Vivek, Ying, Lexing, Kani, Nickvash

Abstract

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).

Chinese Translation

随着模型规模的不断扩大，参数高效的微调已成为全微调的强大替代方案。在这些方法中，LoRA被广泛采用，但最近的研究探索了基于向量的适应方法，因为它们具有极高的参数效率。然而，这些方法通常需要比LoRA显著更高的秩才能匹配其性能，从而导致训练成本增加。本研究提出了GiVA，一种用于基于向量适应的基于梯度的初始化策略。它实现了与LoRA相当的训练时间，并保持了基于向量适应的极高参数效率。我们在多种基准测试中评估了GiVA，包括自然语言理解、自然语言生成和图像分类。实验表明，我们的方法在性能上始终优于或与现有的基于向量的适应方法和LoRA具有竞争力，同时将秩要求降低了八倍（$8 imes$）。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.21916

MathDuels: Evaluating LLMs as Problem Posers and Solvers

MathDuels：评估大型语言模型作为问题提出者和解决者的能力

Xu, Zhiqiu, Jin, Shibo, Arya, Shreya, Naik, Mayur

Abstract

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

Chinese Translation

随着前沿语言模型在静态数学基准测试中达到接近极限的性能，现有评估越来越无法区分模型的能力，主要因为它们仅将模型视为固定问题集的解决者。我们引入了MathDuels，一个自我对弈的基准，其中模型扮演双重角色：每个模型在对抗性提示下创作数学问题，并解决其他参与者创作的问题。问题通过三阶段生成流程（元提示、问题生成和难度增强）产生，并由独立验证者验证，以排除不当问题。Rasch模型（Rasch, 1993）共同估计解决者能力和问题难度；作者质量则源自每个模型创作问题的难度。对19个前沿模型的实验表明，创作和解决能力部分解耦，双重角色评估揭示了在单一角色基准中不可见的能力差异。随着新模型的加入，它们产生的问题能够击败先前占主导地位的解决者，因此基准的难度与参与者的实力共同演化，而不是在固定的上限处饱和。我们设立了一个公共排行榜，随着新模型的发布而更新。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.21928

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

基于生成性大型语言模型的自动语音识别评估

Bañeras-Roux, Thibault, Kumar, Shashi, Khalil, Driss, Burdisso, Sergio, Motlicek, Petr, Liu, Shiran, Rouvier, Mickael, Wottawa, Jane, Dufour, Richard

Abstract

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

Chinese Translation

自动语音识别（ASR）传统上使用词错误率（WER）进行评估，这一指标对意义不敏感。基于嵌入的语义度量与人类感知的相关性更高，但基于解码器的大型语言模型（LLMs）在这一任务中的应用仍然未被充分探索。本文通过三种方法评估其相关性：（1）在两个候选假设中选择最佳假设，（2）使用生成性嵌入计算语义距离，以及（3）对错误进行定性分类。在HATS数据集上，最佳LLMs在假设选择方面与人工标注者的协议达到92%至94%，而WER的协议仅为63%，同时也优于语义度量。来自基于解码器的LLMs的嵌入表现与编码器模型相当。最后，LLMs为可解释和语义化的ASR评估提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

arXiv Papers

Design, Modelling and Experimental Evaluation of a Tendon-driven Wrist Abduction-Adduction Mechanism for an upper limb exoskeleton

A Tendon-Driven Wrist Abduction-Adduction Joint Improves Performance of a 5 DoF Upper Limb Exoskeleton -- Implementation and Experimental Evaluation

Clinical Evaluation of a Tongue-Controlled Wrist Abduction-Adduction Assistance in a 6-DoF Upper-Limb Exoskeleton for Individuals with ALS and SCI

A Survey of Legged Robotics in Non-Inertial Environments: Past, Present, and Future

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

Impact-Aware Model Predictive Control for UAV Landing on a Heaving Platform

Self-Predictive Representation for Autonomous UAV Object-Goal Navigation

Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

Full-Body Dynamic Safety for Robot Manipulators: 3D Poisson Safety Functions for CBF-Based Safety Filters

How VLAs (Really) Work In Open-World Environments

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

PREVENT-JACK: Context Steering for Swarms of Long Heavy Articulated Vehicles

Learn Weightlessness: Imitate Non-Self-Stabilizing Motions on Humanoid Robot

RPG: Robust Policy Gating for Smooth Multi-Skill Transitions in Humanoid Fighting

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Ufil: A Unified Framework for Infrastructure-based Localization

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

X2-N: A Transformable Wheel-legged Humanoid Robot with Dual-mode Locomotion and Manipulation

A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage

SLAM as a Stochastic Control Problem with Partial Information: Optimal Solutions and Rigorous Approximations

Effects of Swarm Size Variability on Operator Workload

A Compact Peristaltic Pump Based on Magneto-Elastic Hysteresis with Single Pneumatic Control

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Task-Driven Co-Design of Heterogeneous Multi-Robot Systems

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

Linear Image Generation by Synthesizing Exposure Brackets

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

Optimizing Diffusion Priors with a Single Observation

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Materialistic RIR: Material Conditioned Realistic RIR Generation

HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping

WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

A Probabilistic Framework for Improving Dense Object Detection in Underwater Image Data via Annealing-Based Data Augmentation

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing

GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

an interpretable vision transformer framework for automated brain tumor classification

The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring

FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

Latent Denoising Improves Visual Alignment in Large Multimodal Models

Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning

SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes

Prototype-Based Test-Time Adaptation of Vision-Language Models

KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

EdgeFormer: local patch-based edge detection transformer on point clouds

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Pre-process for segmentation task with nonlinear diffusion filters

UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing

VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation