cs.RO / 1 / 2603.15757
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
你获得了一张金票:通过单一噪声向量改善生成机器人策略
Abstract
What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input -- a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 38 out of 43 tasks across simulated and real-world robot manipulation benchmarks, with relative improvements in success rate by up to 58% for some simulated tasks, and 60% within 50 search episodes for real-world tasks. We also show unique benefits of golden tickets for multi-task settings: the diversity of behaviors from different tickets naturally defines a Pareto frontier for balancing different objectives (e.g., speed, success rates); in VLAs, we find that a golden ticket optimized for one task can also boost performance in other related tasks. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.
Chinese Translation
当预训练的生成机器人策略被提供一个恒定的初始噪声作为输入,而不是从高斯分布中反复采样时,会发生什么?我们展示了通过用精心选择的恒定初始噪声输入(即金票)替换来自先验分布(通常是各向同性高斯)的初始噪声采样,可以改善预训练的、固定的扩散或流匹配策略在下游奖励方面的表现。我们提出了一种搜索方法,利用蒙特卡洛策略评估来寻找金票,该方法保持预训练策略不变,不训练任何新网络,并适用于所有扩散/流匹配策略(因此也适用于许多变换学习算法)。我们的方法在政策改进上没有其他假设,只需能够将初始噪声注入策略并计算(稀疏的)任务奖励的回合执行,使其可以在没有额外基础设施或模型的情况下部署。我们的方法在43个任务中的38个任务上改善了策略的表现,对于某些模拟任务,成功率的相对提升高达58%,而在现实世界任务中,在50个搜索回合内提升了60%。我们还展示了金票在多任务设置中的独特优势:来自不同金票的行为多样性自然定义了平衡不同目标(例如速度、成功率)的帕累托前沿;在变换学习算法中,我们发现针对一个任务优化的金票也可以提升其他相关任务的表现。我们发布了一个包含预训练策略和金票的代码库,用于使用变换学习算法、扩散策略和流匹配策略的模拟基准测试。
cs.RO / 2 / 2603.15759
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
模拟蒸馏:在模拟中预训练世界模型以实现快速的现实世界适应
Abstract
Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce Simulation Distillation (SimDist), a sim-to-real framework that distills structural priors from a simulator into a latent world model and enables rapid real-world adaptation via online planning and supervised dynamics finetuning. By transferring reward and value models directly from simulation, SimDist provides dense planning signals from raw perception without requiring value learning during deployment. As a result, real-world adaptation reduces to short-horizon system identification, avoiding long-horizon credit assignment and enabling fast, stable improvement. Across precise manipulation and quadruped locomotion tasks, SimDist substantially outperforms prior methods in data efficiency, stability, and final performance. Project website and code: https://sim-dist.github.io/
Chinese Translation
模拟到现实的转移仍然是机器人技术中的一个核心挑战,因为模拟与现实世界动态之间的不匹配往往导致失败。尽管强化学习提供了一种适应的原则机制,但现有的模拟到现实微调方法在现实世界机器人技术中典型的低数据环境下,面临探索和长时间信用分配的困难。我们提出了模拟蒸馏(Simulation Distillation, SimDist),这是一个模拟到现实的框架,它将结构先验从模拟器蒸馏到潜在的世界模型中,并通过在线规划和监督动态微调实现快速的现实世界适应。通过直接从模拟中转移奖励和价值模型,SimDist提供了来自原始感知的密集规划信号,而无需在部署期间进行价值学习。因此,现实世界的适应简化为短期系统识别,避免了长时间信用分配,并实现了快速、稳定的改进。在精确操控和四足运动任务中,SimDist在数据效率、稳定性和最终性能上显著优于先前的方法。项目网站和代码:https://sim-dist.github.io/
cs.RO / 3 / 2603.15771
CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving
CorrectionPlanner:具有自我修正能力的强化学习自动驾驶规划器
Abstract
Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.
Chinese Translation
自动驾驶需要安全的规划,但大多数基于学习的规划器缺乏明确的自我修正能力:一旦提出一个不安全的动作,就没有机制来纠正它。因此,我们提出了CorrectionPlanner,这是一种具有自我修正能力的自回归规划器,它将规划建模为在提议、评估和修正循环中的动作令牌生成。在每一个规划步骤中,策略提出一个动作,即一个动作令牌,而一个学习到的碰撞评论员预测它是否会在短期内引发碰撞。如果评论员预测会碰撞,我们保留不安全动作令牌的历史序列作为自我修正轨迹,根据该轨迹生成下一个动作令牌,并重复这一过程,直到提出一个安全的动作令牌或满足安全标准。这个自我修正轨迹由所有不安全的动作令牌组成,代表着规划器在动作令牌空间中的修正过程,类似于语言模型中的推理轨迹。我们利用模仿学习训练规划器,然后使用一个预训练世界模型中的回滚进行基于模型的强化学习,该模型能够真实地模拟代理的反应行为。闭环评估显示,CorrectionPlanner在Waymax上降低了超过20%的碰撞率,并在nuPlan上达到了最先进的规划成绩。
cs.RO / 4 / 2603.15789
Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning
通过多样化重置和大规模强化学习实现新兴灵巧性
Abstract
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale with compute, as performance quickly saturates when training revisits the same narrow regions of state space. We introduce \Method, a simple and scalable framework that enables on-policy reinforcement learning to robustly solve a broad class of dexterous manipulation tasks using a single reward function, fixed algorithm hyperparameters, no curricula, and no human demonstrations. Our key insight is that long-horizon exploration can be dramatically simplified by using simulator resets to systematically expose the RL algorithm to the diverse set of robot-object interactions which underlie dexterous manipulation. \Method\ programmatically generates such resets with minimal human input, converting additional compute directly into broader behavioral coverage and continued performance gains. We show that \Method\ gracefully scales to long-horizon dexterous manipulation tasks beyond the capabilities of existing approaches and is able to learn robust policies over significantly wider ranges of initial conditions than baselines. Finally, we distill \Method \ into visuomotor policies which display robust retrying behavior and substantially higher success rates than baselines when transferred to the real world zero-shot. Project webpage: https://omnireset.github.io
Chinese Translation
大规模并行物理模拟中的强化学习推动了从模拟到现实的机器人学习的重大进展。然而,当前的方法仍然脆弱且任务特定,依赖于对每个任务进行广泛的工程设计,以制定奖励、课程和演示。即使在这种工程设计下,它们在长时间跨度、接触丰富的操作任务上往往失败,并且在计算资源上并未实现有意义的扩展,因为当训练重复访问同一狭窄的状态空间区域时,性能会迅速饱和。我们提出了 extit{Method},一个简单且可扩展的框架,使得基于策略的强化学习能够通过单一的奖励函数、固定的算法超参数、无需课程和人类演示,稳健地解决广泛的灵巧操作任务。我们的关键见解是,长时间跨度的探索可以通过使用模拟器重置来显著简化,从而系统性地将RL算法暴露于支撑灵巧操作的多样化机器人-物体交互中。 extit{Method}以最小的人类输入程序化生成此类重置,将额外的计算资源直接转化为更广泛的行为覆盖和持续的性能提升。我们展示了 extit{Method}优雅地扩展到现有方法能力之外的长时间跨度灵巧操作任务,并能够在显著更广泛的初始条件范围内学习稳健的策略。最后,我们将 extit{Method}提炼为视觉运动策略,这些策略在转移到现实世界时表现出稳健的重试行为,并且成功率显著高于基线。项目网页:https://omnireset.github.io
cs.RO / 5 / 2603.15826
Robust Dynamic Object Detection in Cluttered Indoor Scenes via Learned Spatiotemporal Cues
通过学习的时空线索在杂乱室内场景中进行鲁棒的动态物体检测
Abstract
Reliable dynamic object detection in cluttered environments remains a critical challenge for autonomous navigation. Purely geometric LiDAR pipelines that rely on clustering and heuristic filtering can miss dynamic obstacles when they move in close proximity to static structure or are only partially observed. Vision-augmented approaches can provide additional semantic cues, but are often limited by closed-set detectors and camera field-of-view constraints, reducing robustness to novel obstacles and out-of-frustum events. In this work, we present a LiDAR-only framework that fuses temporal occupancy-grid-based motion segmentation with a learned bird's-eye-view (BEV) dynamic prior. A fusion module prioritizes 3D detections when available, while using the learned dynamic grid to recover detections that would otherwise be lost due to proximity-induced false negatives. Experiments with motion-capture ground truth show our method achieves 28.67% higher recall and 18.50% higher F1 score than the state-of-the-art in substantially cluttered environments while maintaining comparable precision and position error.
Chinese Translation
在杂乱环境中可靠的动态物体检测仍然是自主导航的一项关键挑战。依赖于聚类和启发式过滤的纯几何激光雷达(LiDAR)管道在动态障碍物靠近静态结构或仅部分可观察时,可能会漏检。增强视觉的方法可以提供额外的语义线索,但通常受到封闭集检测器和相机视场限制的制约,降低了对新型障碍物和视锥外事件的鲁棒性。在本研究中,我们提出了一种仅基于激光雷达的框架,该框架将基于时间占用网格的运动分割与学习的鸟瞰图(BEV)动态先验相融合。融合模块在可用时优先考虑三维检测,同时利用学习的动态网格恢复因接近引起的漏检。与运动捕捉地面真值的实验表明,我们的方法在极为杂乱的环境中实现了比最先进技术高出28.67%的召回率和高出18.50%的F1分数,同时保持了可比的精度和位置误差。
cs.RO / 6 / 2603.15951
Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras
基于视觉的任务进展检测框架用于人机交互,采用RGB摄像头
Abstract
In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.
Chinese Translation
在人机交互(HRI)中,检测人类的视线有助于机器人理解用户的注意力和意图。然而,大多数视线检测方法依赖于专用的眼动追踪硬件,这限制了其在日常环境中的应用。基于外观的视线估计方法通过使用标准RGB摄像头消除了这一依赖,但其在HRI中的实用性仍然未得到充分探索。我们提出了一种无校准的框架,用于在通过集成显示界面传达信息时检测任务进展。该框架仅使用机器人的内置单目RGB摄像头(640x480分辨率)和最先进的视线估计来监测注意力模式。它利用自然行为,用户将注意力从任务界面转移到机器人的面部,以信号任务完成,这通过三个兴趣区域(AOI)形式化:平板电脑、机器人面部和其他地方。系统化的参数优化识别出平衡检测准确性和交互延迟的配置。我们在“第一天上班”场景中验证了我们的框架,并与基于按钮的交互进行了比较。结果显示任务完成检测的准确率为77.6%。与基于按钮的交互相比,所提出的系统表现出略高的响应延迟,但保持了信息保留,并显著改善了舒适度、社交存在感和感知自然性。值得注意的是,大多数参与者报告称他们并没有有意识地使用眼动来引导交互,这突显了视线作为交流线索的直观作用。这项工作展示了基于RGB的直观、低成本的视线人机交互在自然和引人入胜的交互中的可行性。
cs.RO / 7 / 2603.15956
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen:可扩展的从不完美行为先验中学习模拟到现实的专家策略
Abstract
Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.
Chinese Translation
学习可泛化和稳健的行为克隆策略需要大量高质量的机器人数据。虽然人类示范(例如,通过遥操作)是专家行为的标准来源,但在现实世界中大规模获取此类数据是极其昂贵的。本文介绍了ExpertGen,一个自动化专家策略学习框架,旨在实现可扩展的模拟到现实转移。ExpertGen首先使用在不完美示范上训练的扩散策略初始化行为先验,这些示范可以由大型语言模型合成或由人类提供。然后,利用强化学习将该先验引导至高任务成功率,通过优化扩散模型的初始噪声,同时保持原始策略不变。通过保持预训练的扩散策略不变,ExpertGen对探索进行正则化,使其保持在安全的类人行为流形内,同时仅用稀疏奖励实现有效学习。在具有挑战性的操作基准上的实证评估表明,ExpertGen可靠地产生高质量的专家策略,无需奖励工程。在工业装配任务中,ExpertGen实现了90.5%的整体成功率,而在长时间操作任务中则达到了85%的整体成功率,超越了所有基线方法。所得到的策略展现出灵巧的控制能力,并在不同的初始配置和失败状态下保持稳健。为了验证模拟到现实的转移,学习到的基于状态的专家策略进一步通过DAgger提炼为视运动策略,并成功部署在真实的机器人硬件上。
cs.RO / 8 / 2603.16013
Safety Case Patterns for VLA-based driving systems: Insights from SimLingo
基于视觉-语言-动作(VLA)的驾驶系统安全案例模式:来自SimLingo的见解
Abstract
Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving while understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. Such as the addition of natural language inputs (e.g., user or navigation instructions) into the multimodal control loop, which may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.
Chinese Translation
基于视觉-语言-动作(VLA)的驾驶系统代表了自动驾驶领域的一次重大范式转变,因为通过结合交通场景理解、语言解释和动作生成,这些系统能够实现更灵活、适应性强且响应指令的驾驶行为。然而,尽管VLA驾驶系统的应用日益广泛,并有潜力支持社会责任感的自动驾驶,同时理解高层次的人类指令,但它们可能会表现出新类型的危险行为。例如,将自然语言输入(如用户或导航指令)纳入多模态控制循环,可能导致不可预测和不安全的行为,从而危及车辆乘员和行人。因此,确保这些系统的安全性对于建立对其操作的信任至关重要。为此,我们提出了一种新颖的安全案例设计方法,称为RAISE。我们的方法引入了针对基于指令的驾驶系统(如VLA驾驶系统)量身定制的新模式,扩展了危险分析与风险评估(HARA),详细描述了安全场景及其结果,并提供了一种设计技术以创建VLA驾驶系统的安全案例。通过对SimLingo的案例研究,展示了我们的方法如何用于构建这一新兴自动驾驶系统类别的严格、基于证据的安全声明。
cs.RO / 9 / 2603.16028
Geometry-Aligned LLM Fine-Tuning for Sequential Narrow-Opening Planning
几何对齐的大型语言模型微调用于顺序狭窄开口规划
Abstract
We study rigid-body motion planning through multiple sequential narrow openings, which requires long-horizon geometric reasoning because the configuration used to traverse an early opening constrains the set of reachable configurations for subsequent ones. To achieve this, we propose a geometry-aligned large language model (LLM) fine-tuning framework that generates fixed-length, machine-readable waypoint sequences that are both geometrically feasible and coordinated across openings. Our approach uses a bi-level training pipeline. First, we perform failure-driven LoRA supervised fine-tuning (SFT) on human demonstrations, which incorporates structured failure feedback to teach the model common failure modes and enforce the output format. Second, we refine the same LoRA adapters using Group Relative Policy Optimization (GRPO) with geometric verification: each sampled waypoint sequence is densified by a model-based planner and scored with a deterministic geometry-derived reward to achieve continuous-motion feasibility. To validate the effectiveness of our proposed method, we provide both quantitative and qualitative results from simulations. Our method achieves the highest success rate in both in-distribution and out-of-distribution environments and qualitatively exhibits long-horizon geometric reasoning by selecting exit poses that facilitate entry into subsequent openings.
Chinese Translation
我们研究通过多个顺序狭窄开口的刚体运动规划,这需要长时间范围的几何推理,因为用于穿越早期开口的配置会限制后续开口可达配置的集合。为此,我们提出了一种几何对齐的大型语言模型(LLM)微调框架,该框架生成固定长度、机器可读的航点序列,这些序列在几何上是可行的,并且在开口之间是协调的。我们的方法采用双层训练流程。首先,我们在人工示范上执行基于失败驱动的LoRA监督微调(SFT),该过程结合结构化的失败反馈,以教会模型常见的失败模式并强制输出格式。其次,我们使用带几何验证的群体相对策略优化(GRPO)来优化相同的LoRA适配器:每个采样的航点序列通过基于模型的规划器进行稠密化,并通过确定性的几何导出奖励进行评分,以实现连续运动的可行性。为了验证我们提出方法的有效性,我们提供了来自模拟的定量和定性结果。我们的方法在分布内和分布外环境中均实现了最高的成功率,并且在选择有助于进入后续开口的出口姿态时,定性地展示了长时间范围的几何推理。
cs.RO / 10 / 2603.16040
Compact Optical Single-axis Joint Torque Sensor Using Redundant Photo-Reflectors and Quadratic-Programming Calibration
基于冗余光反射器和二次规划校准的紧凑型光学单轴关节扭矩传感器
Abstract
This study proposes a non-contact photo-reflector-based joint torque sensor for precise joint-level torque control and safe physical interaction. Current-sensor-based torque estimation in many collaborative robots suffers from poor low-torque accuracy due to gearbox stiction/friction and current-torque nonlinearity, especially near static conditions. The proposed sensor optically measures micro-deformation of an elastic structure and employs a redundant array of photo-reflectors arranged in four directions to improve sensitivity and signal-to-noise ratio. We further present a quadratic-programming-based calibration method that exploits redundancy to suppress noise and enhance resolution compared to least-squares calibration. The sensor is implemented in a compact form factor (96 mm diameter, 12 mm thickness). Experiments demonstrate a maximum error of 0.083%FS and an RMS error of 0.0266 Nm for z-axis torque measurement. Calibration tests show that the proposed calibration achieves a 3 sigma resolution of 0.0224 Nm at 1 kHz without filtering, corresponding to a 2.14 times improvement over the least-squares baseline. Temperature chamber characterization and rational fitting based compensation mitigate zero drift induced by MCU self heating and motor heat. Motor-level validation via torque control and admittance control confirms improved low torque tracking and disturbance robustness relative to current-sensor-based control.
Chinese Translation
本研究提出了一种基于非接触光反射器的关节扭矩传感器,用于精确的关节级扭矩控制和安全的物理交互。目前,许多协作机器人基于电流传感器的扭矩估计由于齿轮箱的粘滞/摩擦和电流-扭矩非线性,特别是在静态条件下,存在低扭矩精度差的问题。所提出的传感器通过光学方式测量弹性结构的微变形,并采用四个方向排列的冗余光反射器阵列以提高灵敏度和信噪比。我们进一步提出了一种基于二次规划的校准方法,利用冗余性抑制噪声并增强分辨率,相较于最小二乘法校准具有更好的性能。该传感器以紧凑的形式实现(直径96毫米,厚度12毫米)。实验表明,z轴扭矩测量的最大误差为0.083%FS,均方根误差为0.0266 Nm。校准测试显示,所提出的校准方法在1 kHz下实现了0.0224 Nm的3 sigma分辨率,未使用滤波,相较于最小二乘法基线提高了2.14倍。温度箱特性测试和合理拟合补偿减轻了由MCU自热和电机热引起的零漂移。通过扭矩控制和导纳控制的电机级验证确认了相较于基于电流传感器的控制,改进了低扭矩跟踪和干扰鲁棒性。
cs.RO / 11 / 2603.16050
The Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models
端到端自主时代:从基于规则的驾驶转向大型驾驶模型
Abstract
Autonomous driving is undergoing a shift from modular rule based pipelines toward end to end (E2E) learning systems. This paper examines this transition by tracing the evolution from classical sense perceive plan control architectures to large driving models (LDMs) capable of mapping raw sensor input directly to driving actions. We analyze recent developments including Tesla's Full Self Driving (FSD) V12 V14, Rivian's Unified Intelligence platform, NVIDIA Cosmos, and emerging commercial robotaxi deployments, focusing on architectural design, deployment strategies, safety considerations and industry implications. A key emerging product category is supervised E2E driving, often referred to as FSD (Supervised) or L2 plus plus, which several manufacturers plan to deploy from 2026 onwards. These systems can perform most of the Dynamic Driving Task (DDT) in complex environments while requiring human supervision, shifting the driver's role to safety oversight. Early operational evidence suggests E2E learning handles the long tail distribution of real world driving scenarios and is becoming a dominant commercial strategy. We also discuss how similar architectural advances may extend beyond autonomous vehicles (AV) to other embodied AI systems, including humanoid robotics.
Chinese Translation
自主驾驶正经历从模块化基于规则的流程向端到端(E2E)学习系统的转变。本文通过追踪从经典的感知-计划-控制架构到能够将原始传感器输入直接映射为驾驶动作的大型驾驶模型(LDMs)的演变,探讨了这一转变。我们分析了包括特斯拉的全自动驾驶(FSD)V12和V14、Rivian的统一智能平台、NVIDIA Cosmos以及新兴商业机器人出租车部署等近期发展,重点关注架构设计、部署策略、安全考虑和行业影响。一个关键的新兴产品类别是监督式E2E驾驶,通常被称为FSD(监督)或L2 plus plus,多个制造商计划从2026年开始部署这些系统。这些系统能够在复杂环境中执行大部分动态驾驶任务(DDT),同时需要人类监督,将驾驶员的角色转变为安全监督。早期的运营证据表明,E2E学习能够处理现实世界驾驶场景的长尾分布,并正成为一种主导的商业策略。我们还讨论了类似的架构进展如何可能扩展到自主车辆(AV)以外的其他具身人工智能系统,包括类人机器人。
cs.RO / 12 / 2603.16059
Ultrafast Sampling-based Kinodynamic Planning via Differential Flatness
基于微分平坦性的超快速采样动理规划
Abstract
Motion planning under dynamics constraints, i.e., kinodynamic planning, enables safe robot operation by generating dynamically feasible trajectories that the robot can accurately track. For high-\dof robots such as manipulators, sampling-based motion planners are commonly used, especially for complex tasks in cluttered environments. However, enforcing constraints on robot dynamics in such planners requires solving either challenging two-point boundary value problems (BVPs) or propagating robot dynamics over time, both of which are computational bottlenecks that drastically increase planning times. Meanwhile, recent efforts have shown that sampling-based motion planners can generate plans in microseconds using parallelization, but are limited to geometric paths. This paper develops AkinoPDF, a fast parallelized sampling-based kinodynamic motion planning technique for a broad class of differentially flat robot systems, including manipulators, ground and aerial vehicles, and more. Differential flatness allows us to transform the motion planning problem from the original state space to a flat output space, where an analytical time-parameterized solution of the BVP and dynamics integration can be obtained. A trajectory in the flat output space is then converted back to a closed-form dynamically feasible trajectory in the original state space, enabling fast validation via ``single instruction, multiple data" parallelism. Our method is fast, exact, and compatible with any sampling-based motion planner. We extensively verify the effectiveness of our approach in both simulated benchmarks and real experiments with cluttered and dynamic environments, requiring mere microseconds to milliseconds of planning time.
Chinese Translation
在动态约束下的运动规划,即动理规划,通过生成机器人能够准确跟踪的动态可行轨迹,实现安全的机器人操作。对于高自由度(high- extit{dof})机器人,如机械臂,通常使用基于采样的运动规划器,特别是在复杂的杂乱环境中。然而,在这些规划器中强制执行机器人动态约束需要解决具有挑战性的两点边值问题(BVPs)或随时间传播机器人动态,这两者都是计算瓶颈,极大地增加了规划时间。与此同时,最近的研究表明,基于采样的运动规划器可以通过并行化在微秒级生成规划,但仅限于几何路径。本文开发了AkinoPDF,一种快速的并行化基于采样的动理运动规划技术,适用于广泛的微分平坦机器人系统,包括机械臂、地面和空中车辆等。微分平坦性使我们能够将运动规划问题从原始状态空间转换到平坦输出空间,在该空间中可以获得BVP的解析时间参数化解和动态积分。然后,将平坦输出空间中的轨迹转换回原始状态空间中的封闭形式动态可行轨迹,从而通过“单指令,多数据”(single instruction, multiple data)并行性实现快速验证。我们的方法快速、精确,并与任何基于采样的运动规划器兼容。我们在模拟基准测试和真实实验中广泛验证了我们方法的有效性,尤其是在杂乱和动态环境中,规划时间仅需微秒到毫秒级。
cs.RO / 13 / 2603.16065
Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
大型奖励模型:基于视觉-语言模型的可泛化在线机器人奖励生成
Abstract
Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
Chinese Translation
强化学习(RL)在优化机器人操作策略方面展现了巨大的潜力,但其有效性仍受到设计可泛化奖励函数难度的严重制约。本文提出了一种通过将基础视觉-语言模型(VLMs)适配为在线奖励生成器的框架,以实现在线策略优化。我们开发了一种基于先进VLM的稳健、可扩展的奖励模型,该模型在一个大规模、多源数据集上进行训练,涵盖了真实世界的机器人轨迹、人类与物体的交互以及多样化的模拟环境。与以往在后期评估整个轨迹的方法不同,我们的方法利用VLM来制定一个多维度的奖励信号,该信号基于当前的视觉观察,包含过程奖励、完成奖励和时间对比奖励。我们以通过模仿学习(IL)训练的基础策略为起点,采用这些VLM奖励以闭环方式引导模型纠正次优行为。我们在需要顺序执行和精确控制的挑战性长时间操作基准上评估了我们的框架。重要的是,我们的奖励模型在这些测试环境中以纯零样本的方式运行。实验结果表明,我们的方法在仅30次RL迭代内显著提高了初始IL策略的成功率,展现出卓越的样本效率。这一实证证据突显了VLM生成的信号能够提供可靠的反馈,以解决执行错误,有效消除了手动奖励工程的需求,并促进了机器人学习的高效在线优化。
cs.RO / 14 / 2603.16086
Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
迈向视觉-声音-语言-行动范式:HEAR框架用于声音中心的操控
Abstract
While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.
Chinese Translation
尽管近期的视觉-语言-行动(VLA)模型已开始纳入音频,但它们通常将声音视为静态的执行前提示,或仅专注于人类语言。这在实时声音中心的操控中留下了显著的空白,因为瞬时的环境声学在任务执行过程中提供了关键的状态验证。因此,由于低频更新或系统延迟,关键声音往往被忽视。这个问题因开放循环执行中的动作分块而加剧,形成了一个盲执行间隔,在该间隔内,声学事件在离散的音频观察窗口之间丢失。认识到持续的听觉意识的必要性,我们将视觉-声音-语言-行动(VSLA)形式化为一个基于视觉、流媒体音频、语言和本体感知的连续控制范式,适用于延迟决策循环。作为一种实例,我们介绍了HEAR,这是一个VSLA框架,整合了四个组件:(i)一个流媒体历史记录器,用于在执行间隙中保持紧凑的因果音频上下文;(ii)一个从全方位基础模型改编的视觉推理器,用于对多感官输入进行推理;(iii)一个推进器,构建为音频世界模型,通过预测近期音频编码来学习时间动态;(iv)一个流匹配实现者策略,用于生成平滑的动作块。为了解决VSLA预训练数据和评估的稀缺性,我们构建了OpenX-Sound用于预训练,并推出HEAR-Bench,这是第一个具有严格因果时间规则的声音中心操控基准。我们的结果表明,稳健的声音中心操控需要因果持续性和明确的时间学习。该框架为具身智能体的多感官基础模型提供了一个实用的步骤,使机器人能够感知和与动态环境进行交互。代码和视频可在 https://hear.irmv.top 获取。
cs.RO / 15 / 2603.16118
SE(3)-LIO: Smooth IMU Propagation With Jointly Distributed Poses on SE(3) Manifold for Accurate and Robust LiDAR-Inertial Odometry
SE(3)-LIO:在SE(3)流形上联合分布姿态的平滑IMU传播以实现准确和鲁棒的激光雷达-惯性里程计
Abstract
In estimating odometry accurately, an inertial measurement unit (IMU) is widely used owing to its high-rate measurements, which can be utilized to obtain motion information through IMU propagation. In this paper, we address the limitations of existing IMU propagation methods in terms of motion prediction and motion compensation. In motion prediction, the existing methods typically represent a 6-DoF pose by separating rotation and translation and propagate them on their respective manifold, so that the rotational variation is not effectively incorporated into translation propagation. During motion compensation, the relative transformation between predicted poses is used to compensate motion-induced distortion in other measurements, while inherent errors in the predicted poses introduce uncertainty in the relative transformation. To tackle these challenges, we represent and propagate the pose on SE(3) manifold, where propagated translation properly accounts for rotational variation. Furthermore, we precisely characterize the relative transformation uncertainty by considering the correlation between predicted poses, and incorporate this uncertainty into the measurement noise during motion compensation. To this end, we propose a LiDAR-inertial odometry (LIO), referred to as SE(3)-LIO, that integrates the proposed IMU propagation and uncertainty-aware motion compensation (UAMC). We validate the effectiveness of SE(3)-LIO on diverse datasets. Our source code and additional material are available at: https://se3-lio.github.io/.
Chinese Translation
在准确估计里程计时,惯性测量单元(IMU)因其高频率测量而被广泛使用,这些测量可以通过IMU传播获得运动信息。本文针对现有IMU传播方法在运动预测和运动补偿方面的局限性进行了探讨。在运动预测中,现有方法通常通过将旋转和位移分开来表示6自由度姿态,并在各自的流形上进行传播,因此旋转变化未能有效地融入位移传播中。在运动补偿过程中,预测姿态之间的相对变换被用来补偿其他测量中的运动引起的失真,而预测姿态中的固有误差则引入了相对变换的不确定性。为了解决这些挑战,我们在SE(3)流形上表示和传播姿态,其中传播的位移适当地考虑了旋转变化。此外,我们通过考虑预测姿态之间的相关性,精确表征相对变换的不确定性,并在运动补偿过程中将这种不确定性纳入测量噪声。为此,我们提出了一种激光雷达-惯性里程计(LIO),称为SE(3)-LIO,集成了所提出的IMU传播和不确定性感知运动补偿(UAMC)。我们在多样化的数据集上验证了SE(3)-LIO的有效性。我们的源代码和其他材料可在以下网址获取:https://se3-lio.github.io/
cs.RO / 16 / 2603.16166
SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments
SignNav:利用标识实现大规模室内环境中的语义视觉导航
Abstract
Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.
Chinese Translation
人类通常利用标识提供的语义线索在新的大规模室内(Large-Scale Indoor, LSI)环境(如医院和机场航站楼)中导航到目的地。然而,这种能力在体现导航领域内仍然未得到充分探索。本文提出了一项新的体现导航任务——SignNav,该任务要求智能体解读标识的语义线索,并根据当前观察推理出后续动作。为了促进该领域的研究,我们构建了LSI-Dataset,以训练和评估各种SignNav智能体。在LSI环境中,动态变化的语义线索和标识稀疏的布局给SignNav任务带来了显著挑战。为了解决这些挑战,我们提出了空间-时间感知变换器(Spatial-Temporal Aware Transformer, START)模型以实现端到端的决策制定。空间感知模块将标识的语义线索与物理世界对接,而时间感知模块则捕捉历史状态和当前观察之间的长程依赖性。通过利用结合数据集聚合(Dataset Aggregation, DAgger)的两阶段训练策略,我们的方法达到了先进的表现,记录了80%的成功率(Success Rate, SR)以及0.74的归一化动态时间规整(Normalized Dynamic Time Warping, NDTW)值。此外,实际应用进一步证明了我们的方法在没有预先构建地图的物理环境中的可行性。
cs.RO / 17 / 2603.16180
Enforcing Task-Specified Compliance Bounds for Humanoids via Anisotropic Lipschitz-Constrained Policies
通过各向异性Lipschitz约束策略强制执行人形机器人任务特定的合规性边界
Abstract
Reinforcement learning (RL) has demonstrated substantial potential for humanoid bipedal locomotion and the control of complex motions. To cope with oscillations and impacts induced by environmental interactions, compliant control is widely regarded as an effective remedy. However, the model-free nature of RL makes it difficult to impose task-specified and quantitatively verifiable compliance objectives, and classical model-based stiffness designs are not directly applicable. Lipschitz-Constrained Policies (LCP), which regularize the local sensitivity of a policy via gradient penalties, have recently been used to smooth humanoid motions. Nevertheless, existing LCP-based methods typically employ a single scalar Lipschitz budget and lack an explicit connection to physically meaningful compliance specifications in real-world systems. In this study, we propose an anisotropic Lipschitz-constrained policy (ALCP) that maps a task-space stiffness upper bound to a state-dependent Lipschitz-style constraint on the policy Jacobian. The resulting constraint is enforced during RL training via a hinge-squared spectral-norm penalty, preserving physical interpretability while enabling direction-dependent compliance. Experiments on humanoid robots show that ALCP improves locomotion stability and impact robustness, while reducing oscillations and energy usage.
Chinese Translation
强化学习(RL)在类人双足行走和复杂运动控制方面展现了巨大的潜力。为了应对由环境交互引起的振荡和冲击,合规控制被广泛认为是一种有效的解决方案。然而,RL的无模型特性使得施加任务特定且可量化的合规目标变得困难,而经典的基于模型的刚度设计并不直接适用。最近,Lipschitz约束策略(LCP)通过梯度惩罚来规范策略的局部敏感性,已被用于平滑类人运动。然而,现有的基于LCP的方法通常采用单一标量Lipschitz预算,并且缺乏与现实系统中物理意义明确的合规规范之间的显式联系。在本研究中,我们提出了一种各向异性Lipschitz约束策略(ALCP),该策略将任务空间刚度上限映射到策略雅可比矩阵的状态依赖型Lipschitz风格约束。通过施加铰链平方谱范数惩罚,在RL训练过程中强制执行该约束,从而保持物理可解释性,同时实现方向依赖的合规性。在类人机器人上的实验表明,ALCP提高了行走稳定性和抗冲击能力,同时减少了振荡和能量消耗。
cs.RO / 18 / 2603.16196
PanguMotion: Continuous Driving Motion Forecasting with Pangu Transformers
PanguMotion:基于Pangu Transformers的连续驾驶运动预测
Abstract
Motion forecasting is a core task in autonomous driving systems, aiming to accurately predict the future trajectories of surrounding agents to ensure driving safety. Existing methods typically process discrete driving scenes independently, neglecting the temporal continuity and historical context correlations inherent in real-world driving environments. This paper proposes PanguMotion, a motion forecasting framework for continuous driving scenarios that integrates Transformer blocks from the Pangu-1B large language model as feature enhancement modules into autonomous driving motion prediction architectures. We conduct experiments on the Argoverse 2 datasets processed by the RealMotion data reorganization strategy, transforming each independent scene into a continuous sequence to mimic real-world driving scenarios.
Chinese Translation
运动预测是自动驾驶系统中的核心任务,旨在准确预测周围代理的未来轨迹以确保驾驶安全。现有方法通常独立处理离散的驾驶场景,忽视了现实驾驶环境中固有的时间连续性和历史上下文相关性。本文提出了PanguMotion,一个针对连续驾驶场景的运动预测框架,该框架将Pangu-1B大型语言模型中的Transformer模块作为特征增强模块集成到自动驾驶运动预测架构中。我们在经过RealMotion数据重组策略处理的Argoverse 2数据集上进行了实验,将每个独立场景转化为连续序列,以模拟现实驾驶场景。
cs.RO / 19 / 2603.16218
Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward
通过时间离散和时间连续速度前馈实现视觉-语言-动作模型中的动态跟踪
Abstract
While vision-language-action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade-off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low-level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade-off. We propose two methods for extracting velocity targets from VLAs: a time-discrete finite-difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B-Spline action space that natively yields $C^2$ continuous trajectories for high-frequency control. Crucially, both approaches are strictly model-agnostic and compatible with any standard action-chunking architecture, requiring modifications only to teleoperation, data processing, and the low-level controller. We fine-tune the $\pi_{0.5}$ model and evaluate both of our approaches on a demanding, contact-rich cube-in-hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B-Spline approach maintains high overall success rates and provides a foundation for smoother higher-order derivatives without compromising compliance.
Chinese Translation
尽管视觉-语言-动作(VLA)模型在机器人操控中展现出巨大的潜力,但由于合规性与响应性之间固有的权衡,其在刚性工业机器人上的应用仍然面临挑战。标准的行为克隆(BC)方法以低频率预测离散姿态,忽略了通常由低级合规控制器使用的速度和加速度前馈项。这要求依赖高刚度以实现准确跟踪,从而牺牲安全接触动态。在本文中,我们展示了将速度前馈项整合到VLA策略中的重要性,以解决这一权衡。我们提出了两种从VLA中提取速度目标的方法:一种是时间离散的有限差分近似,作为现有模型的高效桥梁;另一种是连续的三次B样条(Cubic B-Spline)动作空间,能够原生生成$C^2$连续轨迹以实现高频控制。重要的是,这两种方法都是严格的模型无关,并与任何标准的动作分块架构兼容,仅需对遥操作、数据处理和低级控制器进行修改。我们对$ ext{pi}_{0.5}$模型进行了微调,并在一个要求高、接触丰富的立方体入孔任务上评估了我们的两种方法。我们的结果表明,通过有限差分引入速度前馈项显著提高了任务执行速度,而连续的B样条方法则保持了高整体成功率,并为更平滑的高阶导数提供了基础,而不影响合规性。
cs.RO / 20 / 2603.16228
PA-LVIO: Real-Time LiDAR-Visual-Inertial Odometry and Mapping with Pose-Only Bundle Adjustment
PA-LVIO:基于姿态的实时激光雷达-视觉-惯性里程计与地图构建
Abstract
Real-time LiDAR-visual-inertial odometry and mapping is crucial for navigation and planning tasks in intelligent transportation systems. This study presents a pose-only bundle adjustment (PA) LiDAR-visual-inertial odometry (LVIO), named PA-LVIO, to meet the urgent need for real-time navigation and mapping. The proposed PA framework for LiDAR and visual measurements is highly accurate and efficient, and it can derive reliable frame-to-frame constraints within multiple frames. A marginalization-free and frame-to-map (F2M) LiDAR measurement model is integrated into the state estimator to eliminate odometry drifts. Meanwhile, an IMU-centric online spatial-temporal calibration is employed to obtain a pixel-wise LiDAR-camera alignment. With accurate estimated odometry and extrinsics, a high-quality and RGB-rendered point-cloud map can be built. Comprehensive experiments are conducted on both public and private datasets collected by wheeled robot, unmanned aerial vehicle (UAV), and handheld devices with 28 sequences and more than 50 km trajectories. Sufficient results demonstrate that the proposed PA-LVIO yields superior or comparable performance to state-of-the-art LVIO methods, in terms of the odometry accuracy and mapping quality. Besides, PA-LVIO can run in real-time on both the desktop PC and the onboard ARM computer.
Chinese Translation
实时激光雷达-视觉-惯性里程计与地图构建对于智能交通系统中的导航和规划任务至关重要。本研究提出了一种仅基于姿态的束调整(PA)激光雷达-视觉-惯性里程计(LVIO),命名为PA-LVIO,以满足实时导航和地图构建的迫切需求。所提出的PA框架在激光雷达和视觉测量方面具有高度的准确性和效率,能够在多个帧之间推导出可靠的帧间约束。一个无边际化的帧到地图(F2M)激光雷达测量模型被集成到状态估计器中,以消除里程计漂移。同时,采用基于IMU的在线时空标定以获得逐像素的激光雷达-相机对齐。通过准确估计的里程计和外参,可以构建高质量的RGB渲染点云地图。在28个序列和超过50公里轨迹的轮式机器人、无人机(UAV)和手持设备收集的公共和私有数据集上进行了全面实验。充分的结果表明,所提出的PA-LVIO在里程计精度和地图质量方面表现优于或可与最先进的LVIO方法相媲美。此外,PA-LVIO可以在桌面PC和车载ARM计算机上实时运行。
cs.RO / 21 / 2603.16240
Industrial cuVSLAM Benchmark & Integration
工业cuVSLAM基准与集成
Abstract
This work presents a comprehensive benchmark evaluation of visual odometry (VO) and visual SLAM (VSLAM) systems for mobile robot navigation in real-world logistical environments. We compare multiple visual odometry approaches across controlled trajectories covering translational, rotational, and mixed motion patterns, as well as a large-scale production facility dataset spanning approximately 1.7 km. Performance is evaluated using Absolute Pose Error (APE) against ground truth from a Vicon motion capture system and a LiDAR-based SLAM reference. Our results show that a hybrid stack combining the cuVSLAM front-end with a custom SLAM back-end achieves the strongest mapping accuracy, motivating a deeper integration of cuVSLAM as the core VO component in our robotics stack. We further validate this integration by deploying and testing the cuVSLAM-based VO stack on an NVIDIA Jetson platform.
Chinese Translation
本研究对移动机器人在真实物流环境中的视觉里程计(VO)和视觉SLAM(VSLAM)系统进行了全面的基准评估。我们比较了多种视觉里程计方法,涵盖了受控轨迹下的平移、旋转及混合运动模式,以及一个覆盖约1.7公里的大规模生产设施数据集。性能通过与Vicon动作捕捉系统和基于LiDAR的SLAM参考的真实值进行绝对位姿误差(APE)评估。我们的结果表明,结合cuVSLAM前端与定制SLAM后端的混合堆栈实现了最强的映射准确性,这激励我们将cuVSLAM作为机器人堆栈中的核心VO组件进行更深入的集成。我们通过在NVIDIA Jetson平台上部署和测试基于cuVSLAM的VO堆栈进一步验证了这一集成。
cs.RO / 22 / 2603.16270
MG-Grasp: Metric-Scale Geometric 6-DoF Grasping Framework with Sparse RGB Observations
MG-Grasp:基于稀疏RGB观测的度量尺度几何6自由度抓取框架
Abstract
Single-view RGB-D grasp detection remains a com- mon choice in 6-DoF robotic grasping systems, which typically requires a depth sensor. While RGB-only 6-DoF grasp methods has been studied recently, their inaccurate geometric repre- sentation is not directly suitable for physically reliable robotic manipulation, thereby hindering reliable grasp generation. To address these limitations, we propose MG-Grasp, a novel depth- free 6-DoF grasping framework that achieves high-quality object grasping. Leveraging two-view 3D foundation model with camera intrinsic/extrinsic, our method reconstructs metric- scale and multi-view consistent dense point clouds from sparse RGB images and generates stable 6-DoF grasp. Experiments on GraspNet-1Billion dataset and real world demonstrate that MG-Grasp achieves state-of-the-art (SOTA) grasp performance among RGB-based 6-DoF grasping methods.
Chinese Translation
单视图RGB-D抓取检测仍然是6自由度机器人抓取系统中的常见选择,通常需要深度传感器。虽然最近对仅使用RGB的6自由度抓取方法进行了研究,但其不准确的几何表示并不直接适合物理可靠的机器人操作,从而阻碍了可靠抓取的生成。为了解决这些限制,我们提出了MG-Grasp,一种新颖的无深度6自由度抓取框架,能够实现高质量的物体抓取。我们的方法利用具有相机内外参数的双视图3D基础模型,从稀疏RGB图像中重建度量尺度和多视图一致的稠密点云,并生成稳定的6自由度抓取。在GraspNet-1Billion数据集和现实世界中的实验表明,MG-Grasp在基于RGB的6自由度抓取方法中实现了最先进的(SOTA)抓取性能。
cs.RO / 23 / 2603.16273
GenZ-LIO: Generalizable LiDAR-Inertial Odometry Beyond Indoor--Outdoor Boundaries
GenZ-LIO:超越室内与室外边界的通用激光雷达-惯性里程计
Abstract
Light detection and ranging (LiDAR)-inertial odometry (LIO) enables accurate localization and mapping for autonomous navigation in various scenes. However, its performance remains sensitive to variations in spatial scale, which refers to the spatial extent of the scene reflected in the distribution of point ranges in a LiDAR scan. Transitions between confined indoor and expansive outdoor spaces induce substantial variations in point density, which may reduce robustness and computational efficiency. To address this issue, we propose GenZ-LIO, a LIO framework generalizable across both indoor and outdoor environments. GenZ-LIO comprises three key components. First, inspired by the principle of the proportional-integral-derivative (PID) controller, it adaptively regulates the voxel size for downsampling via feedback control, driving the voxelized point count toward a scale-informed setpoint while enabling stable and efficient processing across varying scene scales. Second, we formulate a hybrid-metric state update that jointly leverages point-to-plane and point-to-point residuals to mitigate LiDAR degeneracy arising from directionally insufficient geometric constraints. Third, to alleviate the computational burden introduced by point-to-point matching, we introduce a voxel-pruned correspondence search strategy that discards non-promising voxel candidates and reduces unnecessary computations. Experimental results demonstrate that GenZ-LIO achieves robust odometry estimation and improved computational efficiency across confined indoor, open outdoor, and transitional environments. Our code will be made publicly available upon publication.
Chinese Translation
激光探测与测距(LiDAR)-惯性里程计(LIO)能够在各种场景中实现精确的定位和地图构建,支持自主导航。然而,其性能仍然对空间尺度的变化敏感,空间尺度指的是在LiDAR扫描中点范围分布所反映的场景空间范围。室内封闭空间与室外广阔空间之间的过渡会引起显著的点密度变化,这可能降低系统的鲁棒性和计算效率。为了解决这一问题,我们提出了GenZ-LIO,一个可在室内和室外环境中通用的LIO框架。GenZ-LIO包含三个关键组件。首先,受比例-积分-微分(PID)控制器原理的启发,它通过反馈控制自适应地调节体素大小进行下采样,驱动体素点数朝向一个基于尺度的信息设定点,同时在不同场景尺度下实现稳定和高效的处理。其次,我们制定了一种混合度量状态更新方法,联合利用点到平面和点到点的残差,以减轻由于方向性几何约束不足而导致的LiDAR退化。第三,为了减轻点到点匹配带来的计算负担,我们引入了一种体素剪枝对应搜索策略,剔除不具潜力的体素候选,减少不必要的计算。实验结果表明,GenZ-LIO在封闭的室内、开放的室外和过渡环境中实现了鲁棒的里程计估计和提高的计算效率。我们的代码将在发表后公开。
cs.RO / 24 / 2603.16279
Agile Interception of a Flying Target using Competitive Reinforcement Learning
使用竞争性强化学习拦截飞行目标的灵活方法
Abstract
This article presents a solution to intercept an agile drone by another agile drone carrying a catching net. We formulate the interception as a Competitive Reinforcement Learning problem, where the interceptor and the target drone are controlled by separate policies trained with Proximal Policy Optimization (PPO). We introduce a high-fidelity simulation environment that integrates a realistic quadrotor dynamics model and a low-level control architecture implemented in JAX, which allows for fast parallelized execution on GPUs. We train the agents using low-level control, collective thrust and body rates, to achieve agile flights both for the interceptor and the target. We compare the performance of the trained policies in terms of catch rate, time to catch, and crash rate, against common heuristic baselines and show that our solution outperforms these baselines for interception of agile targets. Finally, we demonstrate the performance of the trained policies in a scaled real-world scenario using agile drones inside an indoor flight arena.
Chinese Translation
本文提出了一种通过携带捕捉网的灵活无人机拦截另一灵活无人机的解决方案。我们将拦截问题形式化为一个竞争性强化学习问题,其中拦截者和目标无人机由分别使用近端策略优化(Proximal Policy Optimization, PPO)训练的不同策略控制。我们引入了一个高保真模拟环境,该环境集成了现实的四旋翼动力学模型和在 JAX 中实现的低级控制架构,允许在 GPU 上快速并行执行。我们使用低级控制、集体推力和机身速率训练代理,以实现拦截者和目标的灵活飞行。我们比较了训练策略在捕获率、捕获时间和碰撞率等方面的表现,与常见的启发式基线进行对比,结果表明我们的解决方案在拦截灵活目标方面优于这些基线。最后,我们在一个室内飞行场地中展示了训练策略在缩放的真实场景中的表现,使用灵活的无人机进行演示。
cs.RO / 25 / 2603.16301
OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding
OGScene3D:增量开放词汇的3D高斯场景图映射用于场景理解
Abstract
Open-vocabulary scene understanding is crucial for robotic applications, enabling robots to comprehend complex 3D environmental contexts and supporting various downstream tasks such as navigation and manipulation. However, existing methods require pre-built complete 3D semantic maps to construct scene graphs for scene understanding, which limits their applicability in robotic scenarios where environments are explored incrementally. To address this challenge, we propose OGScene3D, an open-vocabulary scene understanding system that achieves accurate 3D semantic mapping and scene graph construction incrementally. Our system employs a confidence-based Gaussian semantic representation that jointly models semantic predictions and their reliability, enabling robust scene modeling. Building on this representation, we introduce a hierarchical 3D semantic optimization strategy that achieves semantic consistency through local correspondence establishment and global refinement, thereby constructing globally consistent semantic maps. Moreover, we design a long-term global optimization method that leverages temporal memory of historical observations to enhance semantic predictions. By integrating 2D-3D semantic consistency with Gaussian rendering contribution, this method continuously refines the semantic understanding of the entire scene.Furthermore, we develop a progressive graph construction approach that dynamically creates and updates both nodes and semantic relationships, allowing continuous updating of the 3D scene graphs. Extensive experiments on widely used datasets and real-world scenes demonstrate the effectiveness of our OGScene3D on open-vocabulary scene understanding.
Chinese Translation
开放词汇的场景理解对于机器人应用至关重要,使机器人能够理解复杂的3D环境上下文,并支持导航和操作等多种下游任务。然而,现有方法需要预先构建完整的3D语义地图来构建场景图以进行场景理解,这限制了它们在环境逐步探索的机器人场景中的适用性。为了解决这一挑战,我们提出了OGScene3D,一个开放词汇的场景理解系统,能够实现准确的3D语义映射和增量的场景图构建。我们的系统采用基于置信度的高斯语义表示,联合建模语义预测及其可靠性,从而实现稳健的场景建模。在此表示的基础上,我们引入了一种层次化的3D语义优化策略,通过建立局部对应关系和全局优化实现语义一致性,从而构建全局一致的语义地图。此外,我们设计了一种长期全局优化方法,利用历史观察的时间记忆来增强语义预测。通过将2D-3D语义一致性与高斯渲染贡献相结合,该方法持续优化整个场景的语义理解。此外,我们开发了一种渐进式图构建方法,动态创建和更新节点及语义关系,允许3D场景图的持续更新。在广泛使用的数据集和真实场景上的大量实验表明,我们的OGScene3D在开放词汇场景理解方面的有效性。
cs.RO / 26 / 2603.16303
Toward Deep Representation Learning for Event-Enhanced Visual Autonomous Perception: the eAP Dataset
面向事件增强视觉自主感知的深度表征学习:eAP 数据集
Abstract
Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination.While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the ffrst successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometryaware representation learning framework leads to the best eventbased object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.
Chinese Translation
近期的视觉自主感知系统在深度表征学习方面取得了显著的性能。然而,它们在具有挑战性的光照条件下表现不佳。虽然事件相机可以缓解这一问题,但缺乏一个大规模的数据集来开发事件增强的深度视觉感知模型,特别是在自主驾驶场景中。为了解决这一问题,我们提出了 eAP(事件增强自主感知)数据集,这是用于自主感知的最大事件相机数据集。我们展示了 eAP 如何促进对不同自主感知任务的研究,包括 3D 车辆检测和物体接触时间(TTC)估计,通过深度表征学习。基于 eAP,我们首次成功地利用事件改善了在挑战性光照条件下的流行 3D 车辆检测网络。eAP 还使得对物体 TTC 估计的表征学习问题进行专门研究成为可能。我们展示了一个几何感知的表征学习框架如何导致最佳的基于事件的物体 TTC 估计网络,该网络以 200 FPS 的速度运行。数据集、代码和预训练模型将公开提供以供未来研究。
cs.RO / 27 / 2603.16328
ADAPT: Adaptive Dual-projection Architecture for Perceptive Traversal
ADAPT:用于感知遍历的自适应双投影架构
Abstract
Agile humanoid locomotion in complex 3D en- vironments requires balancing perceptual fidelity with com- putational efficiency, yet existing methods typically rely on rigid sensing configurations. We propose ADAPT (Adaptive dual-projection architecture for perceptive traversal), which represents the environment using a horizontal elevation map for terrain geometry and a vertical distance map for traversable- space constraints. ADAPT further treats its spatial sensing range as a learnable action, enabling the policy to expand its perceptual horizon during fast motion and contract it in cluttered scenes for finer local resolution. Compared with voxel-based baselines, ADAPT drastically reduces observation dimensionality and computational overhead while substantially accelerating training. Experimentally, it achieves successful zero-shot transfer to a Unitree G1 Humanoid and signifi- cantly outperforms fixed-range baselines, yielding highly robust traversal across diverse 3D environtmental challenges.
Chinese Translation
在复杂的三维环境中,灵活的人形机器人运动需要在感知保真度与计算效率之间取得平衡,然而现有方法通常依赖于刚性的感知配置。我们提出了ADAPT(用于感知遍历的自适应双投影架构),该架构使用水平高程图表示地形几何,并使用垂直距离图表示可遍历空间约束。ADAPT进一步将其空间感知范围视为可学习的动作,使得策略能够在快速运动中扩展其感知视野,并在杂乱场景中收缩以获得更精细的局部分辨率。与基于体素的基线相比,ADAPT显著降低了观察维度和计算开销,同时大幅加速了训练。在实验中,它成功实现了对Unitree G1人形机器人的零-shot迁移,并显著优于固定范围的基线,在多样的三维环境挑战中实现了高度稳健的遍历。
cs.RO / 28 / 2603.16336
Faulty Coffees: Barriers to Adoption of an In-the-wild Robo-Barista
故障咖啡:在真实环境中采用机器人咖啡师的障碍
Abstract
We set out to study whether task-based narratives could influence long-term engagement with a service robot. To do so, we deployed a Robo-Barista for five weeks in an over-50's housing complex in Stockton, England. Residents received a free daily coffee by interacting with a Furhat robot assigned to either a narrative or non-narrative dialogue condition. Despite designing for sustained engagement, repeat interaction was low, and we encountered curiosity trials without retention, technical breakdowns, accessibility barriers, and the social dynamics of a housing complex setting. Rather than treating these as peripheral issues, we foreground them in this paper. We reflect on the in-the-wild realities of our experiment and offer lessons for conducting longitudinal Human-Robot Interaction research when studies unravel in practice.
Chinese Translation
我们旨在研究基于任务的叙事是否能够影响与服务机器人之间的长期互动。为此,我们在英格兰斯托克顿的一个50岁以上的住宅小区中部署了一台机器人咖啡师,持续五周。居民通过与分配到叙事或非叙事对话条件的Furhat机器人互动,获得每日免费咖啡。尽管我们设计了以维持互动为目标的方案,但重复互动率仍然较低,我们遇到了好奇心试验却未能保持参与、技术故障、可达性障碍以及住宅小区环境中的社会动态等问题。我们并未将这些视为边缘问题,而是将其置于本文的核心。我们反思了实验在真实环境中的现实情况,并为在实践中展开的纵向人机交互研究提供了经验教训。
cs.RO / 29 / 2603.16368
Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy
编码可预测性和可读性以实现风格条件的扩散策略
Abstract
Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human observer a better understanding of the robot's actions, increasing safety and trust. However, these behaviors result in sub-optimal and exaggerated trajectories that are redundant in low-ambiguity scenarios where the robot's goal is already obvious. To address this trade-off, we propose Style-Conditioned Diffusion Policy (SCDP), a modular framework that constrains the trajectory generation of a pre-trained diffusion model toward either legibility or efficiency based on the environment's configuration. Our method utilizes a post-training pipeline that freezes the base policy and trains a lightweight scene encoder and conditioning predictor to modulate the diffusion process. At inference time, an ambiguity detection module activates the appropriate conditioning, prioritizing expressive motion only for ambiguous goals and reverting to efficient paths otherwise. We evaluate SCDP on manipulation and navigation tasks, and results show that it enhances legibility in ambiguous settings while preserving optimal efficiency when legibility is unnecessary, all without retraining the base policy.
Chinese Translation
在人机协作中,平衡效率与透明运动是一个核心挑战,因为高度表现力的动作往往会带来不必要的时间和能量成本。在协作环境中,可读性使人类观察者能够更好地理解机器人的动作,从而提高安全性和信任度。然而,这些行为在机器人目标已经明显的低歧义场景中会导致次优和夸张的轨迹,显得冗余。为了解决这一权衡问题,我们提出了风格条件的扩散策略(Style-Conditioned Diffusion Policy, SCDP),这是一个模块化框架,能够根据环境配置约束预训练扩散模型的轨迹生成,以实现可读性或效率。我们的方法利用后训练管道,冻结基础策略,并训练一个轻量级场景编码器和条件预测器来调节扩散过程。在推理时,歧义检测模块激活适当的条件,仅在目标模糊时优先考虑表现力强的运动,否则恢复到高效路径。我们在操作和导航任务上评估了SCDP,结果表明它在模糊环境中增强了可读性,同时在可读性不必要时保持了最佳效率,且无需重新训练基础策略。
cs.RO / 30 / 2603.16384
Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement
通过强化学习控制鱼群的虚拟鱼运动
Abstract
This study investigates a method to guide and control fish schools using virtual fish trained with reinforcement learning. We utilize 2D virtual fish displayed on a screen to overcome technical challenges such as durability and movement constraints inherent in physical robotic agents. To address the lack of detailed behavioral models for real fish, we adopt a model-free reinforcement learning approach. First, simulation results show that reinforcement learning can acquire effective movement policies even when simulated real fish frequently ignore the virtual stimulus. Second, real-world experiments with live fish confirm that the learned policy successfully guides fish schools toward specified target directions. Statistical analysis reveals that the proposed method significantly outperforms baseline conditions, including the absence of stimulus and a heuristic "stay-at-edge" strategy. This study provides an early demonstration of how reinforcement learning can be used to influence collective animal behavior through artificial agents.
Chinese Translation
本研究探讨了一种利用经过强化学习训练的虚拟鱼来引导和控制鱼群的方法。我们使用在屏幕上显示的二维虚拟鱼,以克服物理机器人代理固有的耐用性和运动限制等技术挑战。为了解决真实鱼类缺乏详细行为模型的问题,我们采用了无模型的强化学习方法。首先,仿真结果表明,即使在模拟的真实鱼类经常忽视虚拟刺激的情况下,强化学习仍能获得有效的运动策略。其次,针对活鱼的真实世界实验确认,学习到的策略成功地引导鱼群朝向指定的目标方向。统计分析显示,所提出的方法显著优于基线条件,包括没有刺激的情况和启发式的“保持在边缘”策略。本研究提供了一个早期示范,展示了如何通过人工代理利用强化学习影响集体动物行为。
cs.RO / 31 / 2603.16407
Onboard MuJoCo-based Model Predictive Control for Shipboard Crane with Double-Pendulum Sway Suppression
基于MuJoCo的船载起重机模型预测控制及双摆摆动抑制
Abstract
Transferring heavy payloads in maritime settings relies on efficient crane operation, limited by hazardous double-pendulum payload sway. This sway motion is further exacerbated in offshore environments by external perturbations from wind and ocean waves. Manual suppression of these oscillations on an underactuated crane system by human operators is challenging. Existing control methods struggle in such settings, often relying on simplified analytical models, while deep reinforcement learning (RL) approaches tend to generalise poorly to unseen conditions. Deploying a predictive controller onto compute-constrained, highly non-linear physical systems without relying on extensive offline training or complex analytical models remains a significant challenge. Here we show a complete real-time control pipeline centered on the MuJoCo MPC framework that leverages a cross-entropy method planner to evaluate candidate action sequences directly within a physics simulator. By using simulated rollouts, this sampling-based approach successfully reconciles the conflicting objectives of dynamic target tracking and sway damping without relying on complex analytical models. We demonstrate that the controller can run effectively on a resource-constrained embedded hardware, while outperforming traditional PID and RL baselines in counteracting external base perturbations. Furthermore, our system demonstrates robustness even when subjected to unmodeled physical discrepancies like the introduction of a second payload.
Chinese Translation
在海洋环境中转运重载货物依赖于高效的起重机操作,而这一操作受到危险的双摆货物摆动的限制。在离岸环境中,风和海浪的外部扰动进一步加剧了这种摆动。人工操作在欠驱动的起重机系统上抑制这些振荡是具有挑战性的。现有的控制方法在这种环境中表现不佳,通常依赖于简化的解析模型,而深度强化学习(RL)方法在未见条件下的泛化能力较差。在不依赖于大量离线训练或复杂解析模型的情况下,将预测控制器部署到计算受限且高度非线性的物理系统上仍然是一个重大挑战。在这里,我们展示了一个完整的实时控制管道,基于MuJoCo MPC框架,利用交叉熵方法规划器在物理模拟器中直接评估候选动作序列。通过使用模拟回放,这种基于采样的方法成功地调和了动态目标跟踪与摆动阻尼的相互矛盾目标,而无需依赖复杂的解析模型。我们证明了该控制器能够在资源受限的嵌入式硬件上有效运行,同时在抵抗外部基座扰动方面优于传统的PID和RL基线。此外,我们的系统即使在遭遇未建模的物理差异(如引入第二个货物)时也表现出强大的鲁棒性。
cs.RO / 32 / 2603.16424
Early-Terminable Energy-Safe Iterative Coupling for Parallel Simulation of Port-Hamiltonian Systems
可提前终止的能量安全迭代耦合用于港哈密顿系统的并行仿真
Abstract
Parallel simulation and control of large-scale robotic systems often rely on partitioned time stepping, yet finite-iteration coupling can inject spurious energy by violating power consistency--even when each subsystem is passive. This letter proposes a novel energy-safe, early-terminable iterative coupling for port-Hamiltonian subsystems by embedding a Douglas--Rachford (DR) splitting scheme in scattering (wave) coordinates. The lossless interconnection is enforced as an orthogonal constraint in the wave domain, while each subsystem contributes a discrete-time scattering port map induced by its one-step integrator. Under a discrete passivity condition on the subsystem time steps and a mild impedance-tuning condition, we prove an augmented-storage inequality certifying discrete passivity of the coupled macro-step for any finite inner-iteration budget, with the remaining mismatch captured by an explicit residual. As the inner budget increases, the partitioned update converges to the monolithic discrete-time update induced by the same integrators, yielding a principled, adaptive accuracy--compute trade-off, supporting energy-consistent real-time parallel simulation under varying computational budgets. Experiments on a coupled-oscillator benchmark validate the passivity certificates at numerical roundoff (on the order of 10e-14 in double precision) and show that the reported RMS state error decays monotonically with increasing inner-iteration budgets, consistent with the hard-coupling limit.
Chinese Translation
大规模机器人系统的并行仿真和控制通常依赖于分区时间步长,然而有限迭代耦合可能通过违反功率一致性引入虚假能量——即便每个子系统都是被动的。本文提出了一种新颖的能量安全、可提前终止的港哈密顿子系统迭代耦合,通过将道格拉斯-拉赫福德(Douglas-Rachford,DR)分裂方案嵌入散射(波)坐标中实现。无损互连在波域内作为正交约束强加,而每个子系统则通过其一阶积分器贡献一个离散时间散射端口图。在对子系统时间步的离散被动性条件和温和的阻抗调节条件下,我们证明了一种增广存储不等式,这一不等式认证了在任何有限内部迭代预算下耦合宏步的离散被动性,剩余的不匹配通过一个显式残差捕获。随着内部预算的增加,分区更新收敛于由相同积分器引起的单一离散时间更新,产生一种原则性、适应性的准确性与计算折中,支持在变化计算预算下的能量一致的实时并行仿真。在耦合振荡器基准测试上的实验验证了在数值舍入(约为双精度下10e-14量级)下的被动性证书,并显示随着内部迭代预算的增加,报告的均方根状态误差单调衰减,与硬耦合极限一致。
cs.RO / 33 / 2603.16471
Coverage First Next Best View for Inspection of Cluttered Pipe Networks Using Mobile Manipulators
基于覆盖优先的下一个最佳视角在复杂管道网络检查中的应用
Abstract
Robotic inspection of radioactive areas enables operators to be removed from hazardous environments; however, planning and operating in confined, cluttered environments remain challenging. These systems must autonomously reconstruct the unknown environment and cover its surfaces, whilst estimating and avoiding collisions with objects in the environment. In this paper, we propose a new planning approach based on next-best-view that enables simultaneous exploration and exploitation of the environment by reformulating the coverage path planning problem in terms of information gain. To handle obstacle avoidance under uncertainty, we extend the vector-field-inequalities framework to explicitly account for stochastic measurements of geometric primitives in the environment via chance constraints in a constrained optimal control law. The stochastic constraints were evaluated experimentally alongside the planner on a mobile manipulator in a confined environment to inspect a pipe network. These experiments demonstrate that the system can autonomously plan and execute inspection and coverage paths to reconstruct and fully cover the simplified pipe network. Moreover, the system successfully estimated geometric primitives online and avoided collisions during motion between viewpoints.
Chinese Translation
机器人对放射性区域的检查使操作员能够远离危险环境;然而,在狭小和杂乱的环境中进行规划和操作仍然具有挑战性。这些系统必须自主重建未知环境并覆盖其表面,同时估计并避免与环境中物体的碰撞。本文提出了一种基于下一个最佳视角的新规划方法,通过将覆盖路径规划问题重新表述为信息增益,从而实现对环境的同时探索和利用。为了处理不确定性下的障碍物避免,我们扩展了向量场不等式框架,通过在约束最优控制律中引入机会约束,明确考虑环境中几何原件的随机测量。我们在一个狭小环境中的移动操控器上对随机约束与规划器进行了实验评估,以检查管道网络。这些实验表明,该系统能够自主规划和执行检查及覆盖路径,以重建并完全覆盖简化的管道网络。此外,该系统成功地在线估计几何原件,并在视角之间的运动中避免了碰撞。
cs.RO / 34 / 2603.16503
When Rolling Gets Weird: A Curved-Link Tensegrity Robot for Non-Intuitive Behavior
当滚动变得奇怪:一种用于非直观行为的曲链张紧体机器人
Abstract
Conventional mobile tensegrity robots constructed with straight links offer mobility at the cost of locomotion speed. While spherical robots provide highly effective rolling behavior, they often lack the stability required for navigating unstructured terrain common in many space exploration environments. This research presents a solution with a semi-circular, curved-link tensegrity robot that strikes a balance between efficient rolling locomotion and controlled stability, enabled by discontinuities present at the arc endpoints. Building upon an existing geometric static modeling framework [1], this work presents the system design of an improved Tensegrity eXploratory Robot 2 (TeXploR2). Internal shifting masses instantaneously roll along each curved-link, dynamically altering the two points of contact with the ground plane. Simulations of quasistatic, piecewise continuous locomotion sequences reveal new insights into the positional displacement between inertial and body frames. Non-intuitive rolling behaviors are identified and experimentally validated using a tetherless prototype, demonstrating successful dynamic locomotion. A preliminary impact test highlights the tensegrity structure's inherent shock absorption capabilities and conformability. Future work will focus on finalizing a dynamic model that is experimentally validated with extended testing in real-world environments as well as further refinement of the prototype to incorporate additional curved-links and subsequent ground contact points for increased controllability.
Chinese Translation
传统的由直链构成的移动张紧体机器人在提供机动性的同时牺牲了运动速度。尽管球形机器人具有极为有效的滚动行为,但它们通常缺乏在许多空间探索环境中常见的非结构化地形上导航所需的稳定性。本研究提出了一种解决方案,即半圆形曲链张紧体机器人,该机器人在有效滚动运动和控制稳定性之间取得了平衡,这得益于曲线末端的断点。基于现有的几何静态建模框架[1],本工作展示了改进版张紧体探索机器人2(TeXploR2)的系统设计。内部变质质点瞬时沿每个曲链滚动,动态改变与地面之间的两个接触点。准静态、分段连续运动序列的模拟揭示了惯性框架与身体框架之间位置位移的新见解。非直观的滚动行为通过无绳原型进行实验验证,成功展示了动态运动能力。初步冲击测试突出了张紧体结构固有的抗冲击能力和适应性。未来的工作将集中在完善实验验证的动态模型,以及通过进一步优化原型以加入额外的曲链和后续接触点,从而提高可控性。
cs.RO / 35 / 2603.16531
LIMBERO: A Limbed Climbing Exploration Robot Toward Traveling on Rocky Cliffs
LIMBERO:一种用于攀爬岩石悬崖的四肢探索机器人
Abstract
In lunar and planetary exploration, legged robots have attracted significant attention as an alternative to conventional wheeled robots, which struggle to traverse rough and uneven terrain. To enable locomotion over highly irregular and steeply inclined surfaces, limbed climbing robots equipped with grippers on their feet have emerged as a promising solution. In this paper, we present LIMBERO, a 10 kg-class quadrupedal climbing robot that employs spine-type grippers for stable locomotion and climbing on rugged and steep terrain. We first introduce a novel gripper design featuring coupled finger-closing and spine-hooking motions, tightly actuated by a single motor, which achieves exceptional grasping performance (>150 N) despite its lightweight design (525 g). Furthermore, we develop an efficient algorithm to visualize a geometry-based graspability index on continuous rough terrain. Finally, we integrate these components into LIMBERO and demonstrate its ability to ascend steep rocky surfaces under a 1 G gravity condition, a performance not previously achieved yet for limbed climbing robots of this scale.
Chinese Translation
在月球和行星探索中,四肢机器人作为传统轮式机器人的替代方案,因其在崎岖不平的地形上行走的能力而受到广泛关注。为了在高度不规则和陡峭的表面上实现运动,配备抓握器的四肢攀爬机器人已成为一种有前景的解决方案。本文介绍了LIMBERO,一种10公斤级的四足攀爬机器人,采用脊柱型抓握器以实现对崎岖和陡峭地形的稳定运动和攀爬。我们首先介绍了一种新颖的抓握器设计,具有耦合的手指闭合和脊钩动作,由单个电机紧密驱动,尽管其设计轻便(525克),但仍能实现卓越的抓握性能(>150 N)。此外,我们开发了一种高效算法,用于在连续粗糙地形上可视化基于几何形状的抓握能力指数。最后,我们将这些组件集成到LIMBERO中,并展示其在1 G重力条件下攀登陡峭岩石表面的能力,这是此前未曾实现的四肢攀爬机器人规模的性能。
cs.RO / 36 / 2603.16536
Kamino: GPU-based Massively Parallel Simulation of Multi-Body Systems with Challenging Topologies
Kamino:基于GPU的大规模并行多体系统模拟,处理复杂拓扑结构
Abstract
We present Kamino, a GPU-based physics solver for massively parallel simulations of heterogeneous highly-coupled mechanical systems. Implemented in Python using NVIDIA Warp and integrated into the Newton framework, it enables the application of data-driven methods, such as large-scale reinforcement learning, to complex robotic systems that exhibit strongly coupled kinematic and dynamic constraints such as kinematic loops. The latter are often circumvented by practitioners; approximating the system topology as a kinematic tree and incorporating explicit loop-closure constraints or so-called mimic joints. Kamino aims at alleviating this burden by natively supporting these types of coupling. This capability facilitates high-throughput parallelized simulations that capture the true nature of mechanical systems that exploit closed kinematic chains for mechanical advantage. Moreover, Kamino supports heterogeneous worlds, allowing for batched simulation of structurally diverse robots on a single GPU. At its core lies a state-of-the-art constrained optimization algorithm that computes constraint forces by solving the constrained rigid multi-body forward dynamics transcribed as a nonlinear complementarity problem. This leads to high-fidelity simulations that can resolve contact dynamics without resorting to approximate models that simplify and/or convexify the problem. We demonstrate RL policy training on DR Legs, a biped with six nested kinematic loops, generating a feasible walking policy while simulating 4096 parallel environments on a single GPU.
Chinese Translation
我们提出了Kamino,一种基于GPU的物理求解器,用于大规模并行模拟异构高度耦合的机械系统。该系统使用Python实现,基于NVIDIA Warp,并集成到Newton框架中,能够将数据驱动的方法(如大规模强化学习)应用于表现出强耦合运动学和动态约束(如运动学循环)的复杂机器人系统。后者通常被从业者规避;他们将系统拓扑近似为运动学树,并引入显式的闭环约束或所谓的模仿关节。Kamino旨在通过原生支持这些类型的耦合来减轻这一负担。这一能力促进了高吞吐量的并行化模拟,捕捉利用闭合运动链获得机械优势的机械系统的真实特性。此外,Kamino支持异构环境,允许在单个GPU上对结构多样的机器人进行批量模拟。其核心是一个最先进的约束优化算法,通过将约束刚体多体前向动力学转化为非线性互补问题来计算约束力。这导致了高保真度的模拟,能够解决接触动力学,而无需诉诸于简化和/或凸化问题的近似模型。我们在DR Legs上演示了强化学习策略训练,该机器人是一个具有六个嵌套运动学循环的双足机器人,在单个GPU上模拟4096个并行环境,生成可行的行走策略。
cs.RO / 37 / 2603.16542
Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting
通过后验转移重加权实现保守的离线机器人策略学习
Abstract
Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.
Chinese Translation
离线后训练通过对记录的动作进行监督回归,将预训练的机器人策略适应于目标数据集。在实际应用中,机器人数据集是异构的:它们混合了不同的体现、相机设置和质量各异的演示,因此许多轨迹反映了恢复行为、不一致的操作技能或信息量较弱的监督。统一的后训练对所有样本给予相等的权重,因此可能会对冲突或低归因的数据进行平均。我们提出了后验转移重加权(Posterior-Transition Reweighting, PTR),这是一种无奖励且保守的后训练方法,用于决定每个训练样本对监督更新的影响程度。对于每个样本,PTR将观察到的后行动结果编码为潜在目标,将其插入到不匹配目标的候选池中,并使用单独的转移评分器来估计目标索引的softmax识别后验。后验与均匀的比率定义了PTR分数,该分数被转换为剪切和混合的权重,并通过自归一化加权回归应用于原始动作目标。该构造不需要可处理的策略似然,并且与扩散和流匹配的动作头兼容。PTR并不是均匀地信任所有记录的监督,而是根据当前表示下每个样本的后行动结果的可归因性重新分配信用,从而改善了对异构机器人数据的保守离线适应。
cs.RO / 38 / 2603.16543
A Pin-Array Structured Climbing Robot for Stable Locomotion on Steep Rocky Terrain
一种用于在陡峭岩石地形上稳定移动的针阵结构攀爬机器人
Abstract
Climbing robots face significant challenges when navigating unstructured environments, where reliable attachment to irregular surfaces is critical. We present a novel mobile climbing robot equipped with compliant pin-array structured grippers that passively conform to surface irregularities, ensuring stable ground gripping without the need for complicated sensing or control. Each pin features a vertically split design, combining an elastic element with a metal spine to enable mechanical interlocking with microscale surface features. Statistical modeling and experimental validation indicate that variability in individual pin forces and contact numbers are the primary sources of grasping uncertainty. The robot demonstrated robust and stable locomotion in indoor tests on inclined walls (10-30 degrees) and in outdoor tests on natural rocky terrain. This work highlights that a design emphasizing passive compliance and mechanical redundancy provides a practical and robust solution for real-world climbing robots while minimizing control complexity.
Chinese Translation
攀爬机器人在导航非结构化环境时面临重大挑战,其中可靠地附着于不规则表面至关重要。我们提出了一种新型移动攀爬机器人,配备有顺应性针阵结构的抓手,这些抓手能够被动地适应表面不规则性,确保稳定的地面抓握,而无需复杂的传感或控制。每个针具有垂直分裂的设计,结合了弹性元件和金属脊,以实现与微观表面特征的机械锁定。统计建模和实验验证表明,单个针的力和接触数量的变异性是抓握不确定性的主要来源。该机器人在室内倾斜墙面(10-30度)和户外自然岩石地形的测试中展示了强大而稳定的移动能力。本研究强调,强调被动顺应性和机械冗余的设计为现实世界的攀爬机器人提供了一种实用且稳健的解决方案,同时最小化了控制复杂性。
cs.RO / 39 / 2603.16550
ASCENT: Transformer-Based Aircraft Trajectory Prediction in Non-Towered Terminal Airspace
ASCENT:基于变压器的非塔台终端空域飞机轨迹预测
Abstract
Accurate trajectory prediction can improve General Aviation safety in non-towered terminal airspace, where high traffic density increases accident risk. We present ASCENT, a lightweight transformer-based model for multi-modal 3D aircraft trajectory forecasting, which integrates domain-aware 3D coordinate normalization and parameterized predictions. ASCENT employs a transformer-based motion encoder and a query-based decoder, enabling the generation of diverse maneuver hypotheses with low latency. Experiments on the TrajAir and TartanAviation datasets demonstrate that our model outperforms prior baselines, as the encoder effectively captures motion dynamics and the decoder aligns with structured aircraft traffic patterns. Furthermore, ablation studies confirm the contributions of the decoder design, coordinate-frame modeling, and parameterized outputs. These results establish ASCENT as an effective approach for real-time aircraft trajectory prediction in non-towered terminal airspace.
Chinese Translation
准确的轨迹预测可以提高非塔台终端空域中通用航空的安全性,因为高交通密度增加了事故风险。我们提出了ASCENT,一种轻量级的基于变压器的多模态三维飞机轨迹预测模型,集成了领域感知的三维坐标归一化和参数化预测。ASCENT采用基于变压器的运动编码器和基于查询的解码器,使得能够以低延迟生成多样化的机动假设。在TrajAir和TartanAviation数据集上的实验表明,我们的模型优于先前的基线,因为编码器有效捕捉了运动动态,而解码器与结构化的飞机交通模式对齐。此外,消融研究确认了解码器设计、坐标系建模和参数化输出的贡献。这些结果确立了ASCENT作为一种有效的实时飞机轨迹预测方法,适用于非塔台终端空域。
cs.RO / 40 / 2603.16593
Scalable Inspection Planning via Flow-based Mixed Integer Linear Programming
基于流的混合整数线性规划的可扩展检查规划
Abstract
Inspection planning is concerned with computing the shortest robot path to inspect a given set of points of interest (POIs) using the robot's sensors. This problem arises in a wide range of applications from manufacturing to medical robotics. To alleviate the problem's complexity, recent methods rely on sampling-based methods to obtain a more manageable (discrete) graph inspection planning (GIP) problem. Unfortunately, GIP still remains highly difficult to solve at scale as it requires simultaneously satisfying POI-coverage and path-connectivity constraints, giving rise to a challenging optimization problem, particularly at scales encountered in real-world scenarios. In this work, we present highly scalable Mixed Integer Linear Programming (MILP) solutions for GIP that significantly advance the state-of-the-art in both runtime and solution quality. Our key insight is a reformulation of the problem's core constraints as a network flow, which enables effective MILP models and a specialized Branch-and-Cut solver that exploits the combinatorial structure of flows. We evaluate our approach on medical and infrastructure benchmarks alongside large-scale synthetic instances. Across all scenarios, our method produces substantially tighter lower bounds than existing formulations, reducing optimality gaps by 30-50% on large instances. Furthermore, our solver demonstrates unprecedented scalability: it provides non-trivial solutions for problems with up to 15,000 vertices and thousands of POIs, where prior state-of-the-art methods typically exhaust memory or fail to provide any meaningful optimality guarantees.
Chinese Translation
检查规划涉及计算机器人路径,以使用机器人的传感器检查给定的一组兴趣点(POI)。这一问题在从制造到医疗机器人等广泛应用中都存在。为了减轻问题的复杂性,最近的方法依赖于基于采样的方法,以获得更易管理的(离散)图检查规划(GIP)问题。不幸的是,GIP 在规模上仍然非常难以解决,因为它需要同时满足 POI 覆盖和路径连通性约束,从而产生一个具有挑战性的优化问题,特别是在现实场景中遇到的规模下。在本研究中,我们提出了高度可扩展的混合整数线性规划(MILP)解决方案,以显著推动 GIP 在运行时间和解决质量上的最新进展。我们的关键见解是将问题的核心约束重新表述为网络流,这使得有效的 MILP 模型和一个专门的分支切割求解器得以利用流的组合结构。我们在医疗和基础设施基准测试以及大规模合成实例上评估了我们的方法。在所有场景中,我们的方法产生的下界显著比现有的公式更紧,减少了大实例的最优性差距达 30-50%。此外,我们的求解器展示了前所未有的可扩展性:它为最多 15,000 个顶点和数千个 POI 的问题提供了非平凡的解决方案,而之前的最新方法通常会耗尽内存或无法提供任何有意义的最优性保证。
cs.RO / 41 / 2603.16609
Dexterous grasp data augmentation based on grasp synthesis with fingertip workspace cloud and contact-aware sampling
基于指尖工作空间云和接触感知采样的灵巧抓取数据增强
Abstract
Robotic grasping is a fundamental yet crucial component of robotic applications, as effective grasping often serves as the starting point for various tasks. With the rapid advancement of neural networks, data-driven approaches for robotic grasping have become mainstream. However, efficiently generating grasp datasets for training remains a bottleneck. This is compounded by the diverse structures of robotic hands, making the design of generalizable grasp generation methods even more complex. In this work, we propose a teleoperation-based framework to collect a small set of grasp pose demonstrations, which are augmented using FSG--a Fingertip-contact-aware Sampling-based Grasp generator. Based on the demonstrated grasp poses, we propose AutoWS, which automatically generates structured workspace clouds of robotic fingertips, embedding the hand structure information directly into the clouds to eliminate the need for inverse kinematics calculations. Experiments on grasping the YCB objects show that our method significantly outperforms existing approaches in both speed and valid pose generation rate. Our framework enables real-time grasp generation for hands with arbitrary structures and produces human-like grasps when combined with demonstrations, providing an efficient and robust data augmentation tool for data-driven grasp training.
Chinese Translation
机器人抓取是机器人应用中的一个基本而关键的组成部分,因为有效的抓取往往是各种任务的起点。随着神经网络的快速发展,基于数据的方法已成为机器人抓取的主流。然而,高效生成用于训练的抓取数据集仍然是一个瓶颈。这一问题因机器人手的多样化结构而加剧,使得设计可推广的抓取生成方法变得更加复杂。在本研究中,我们提出了一种基于遥操作的框架,以收集一小组抓取姿态示范,并使用FSG(Fingertip-contact-aware Sampling-based Grasp generator)进行增强。基于示范的抓取姿态,我们提出了AutoWS,它自动生成结构化的机器人指尖工作空间云,将手的结构信息直接嵌入云中,从而消除逆运动学计算的需要。在对YCB物体的抓取实验中,我们的方法在速度和有效姿态生成率方面显著优于现有方法。我们的框架能够为具有任意结构的手实时生成抓取,并在结合示范时产生类人抓取,为基于数据的抓取训练提供了一种高效且稳健的数据增强工具。
cs.RO / 42 / 2603.16626
Routing and Control for Marine Oil-Spill Cleanup with a Boom-Towing Vessel Fleet
船队牵引圆筒进行海洋漏油清理的路线与控制
Abstract
Marine oil spills damage ecosystems, contaminate coastlines, and disrupt food webs, while imposing substantial economic losses on fisheries and coastal communities. Prior work has demonstrated the feasibility of containing and cleaning individual spills using a duo of autonomous surface vehicles (ASVs) equipped with a towed boom and skimmers. However, existing algorithmic approaches primarily address isolated slicks and individual ASV duos, lacking scalable methods for coordinating large robotic fleets across multiple spills representative of realistic oil-spill incidents. In this work, we propose an integrated multi-robot framework for coordinated oil-spill confinement and cleanup using autonomous ASV duos. We formulate multi-spill response as a risk-weighted minimum-latency problem, where spill-specific risk factors and service times jointly determine cumulative environmental damage. To solve this problem, we develop a hybrid optimization approach combining mixed-integer linear programming, and a tailored warm-start heuristic, enabling near-optimal routing plans for scenarios with tens of spills within minutes on commodity hardware. For physical execution, we design and analyze two tracking controllers for boom-towing ASV duos: a feedback-linearization controller with proven asymptotic stability, and a baseline PID controller. Simulation results under coupled vessel-boom dynamics demonstrate accurate path tracking for both controllers. Together, these components provide a scalable, holistic framework for rapid, risk-aware multi-robot response to large-scale oil spill disasters.
Chinese Translation
海洋漏油事件对生态系统造成损害,污染海岸线,并破坏食物链,同时给渔业和沿海社区带来显著的经济损失。先前的研究已证明,利用配备有牵引圆筒和撇油器的双自主水面载体(ASVs)可以有效地遏制和清理单个漏油事件。然而,现有的算法方法主要着眼于孤立的油膜和单个ASV组合,缺乏协调多个漏油事件的大型机器人舰队的可扩展方法。在本研究中,我们提出了一种集成的多机器人框架,用于协调漏油的遏制和清理,使用自主ASV双组。我们将多漏油响应形式化为一个风险加权的最小延迟问题,其中与漏油相关的风险因素和服务时间共同决定了累积的环境损害。为了解决这一问题,我们开发了一种混合优化方法,结合了混合整数线性规划和定制的暖启动启发式算法,能够在数分钟内为十多个漏油事件提供接近最优的路线规划,且在普通硬件上运行。为了物理执行,我们设计并分析了两种用于牵引圆筒ASV双组的跟踪控制器:一种具有已证明渐进稳定性的反馈线性化控制器,以及一种基准PID控制器。在考虑耦合船舶-圆筒动态的模拟结果显示,两种控制器均可实现准确的路径追踪。这些组成部分共同提供了一个可扩展的、完整的框架,以快速、风险意识的方式应对大规模漏油灾害。
cs.RO / 43 / 2603.16630
Reconciling distributed compliance with high-performance control in continuum soft robotics
调和分布式柔性与高性能控制在连续软体机器人中的应用
Abstract
High-performance closed-loop control of truly soft continuum manipulators has remained elusive. Experimental demonstrations have largely relied on sufficiently stiff, piecewise architectures in which each actuated segment behaves as a distributed yet effectively rigid element, while deformation modes beyond simple bending are suppressed. This strategy simplifies modeling and control, but sidesteps the intrinsic complexity of a fully compliant body and makes the system behave as a serial kinematic chain, much like a conventional articulated robot. An implicit conclusion has consequently emerged within the community: distributed softness and dynamic precision are incompatible. Here we show this trade-off is not fundamental. We present a highly compliant, fully continuum robotic arm - without hardware discretization or stiffness-based mode suppression - that achieves fast, precise task-space convergence under dynamic conditions. The platform integrates direct-drive actuation, a tendon routing scheme enabling coupled bending and twisting, and a structured nonlinear control architecture grounded in reduced-order strain modeling of underactuated systems. Modeling, actuation, and control are co-designed to preserve essential mechanical complexity while enabling high-bandwidth loop closure. Experiments demonstrate accurate, repeatable execution of dynamic Cartesian tasks, including fast positioning and interaction. The proposed system achieves the fastest reported task-execution speed among soft robots. At millimetric precision, execution speed increases nearly fourfold compared with prior approaches, while operating on a fully compliant continuum body. These results show that distributed compliance and high-performance dynamic control can coexist, opening a path toward truly soft manipulators approaching the operational capabilities of rigid robots without sacrificing morphological richness.
Chinese Translation
真正的软性连续操纵器的高性能闭环控制一直难以实现。实验演示主要依赖于足够刚性的分段结构,其中每个驱动段表现为分布式但有效刚性的元素,同时抑制了超出简单弯曲的变形模式。这一策略简化了建模和控制,但回避了完全柔性体的内在复杂性,使得系统表现得像一个串联运动链,类似于传统的关节机器人。因此,社区内隐含地得出了一个结论:分布式柔性与动态精度是不可兼容的。在此,我们展示了这一权衡并非根本性问题。我们提出了一种高度柔性、完全连续的机器人手臂——没有硬件离散化或基于刚度的模式抑制——在动态条件下实现快速、精确的任务空间收敛。该平台集成了直接驱动激励、允许耦合弯曲和扭转的腱路由方案,以及基于欠驱动系统的降阶应变建模的结构化非线性控制架构。建模、激励和控制共同设计,以保持基本的机械复杂性,同时实现高带宽的闭环控制。实验表明,动态笛卡尔任务的执行准确且可重复,包括快速定位和交互。所提系统在软体机器人中实现了最快的任务执行速度。在毫米级精度下,执行速度与之前的方法相比几乎提高了四倍,同时在完全柔性的连续体上运行。这些结果表明,分布式柔性与高性能动态控制可以共存,为真正的软性操纵器接近刚性机器人的操作能力而不牺牲形态丰富性开辟了一条道路。
cs.RO / 44 / 2603.16669
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Kinema4D:用于时空具身模拟的运动学4D世界建模
Abstract
Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.
Chinese Translation
模拟机器人与世界的交互是具身人工智能的基石。最近,一些研究展示了利用视频生成技术超越传统模拟器的刚性视觉/物理限制的潜力。然而,这些研究主要在二维空间中操作,或受到静态环境线索的引导,忽视了机器人与世界的交互本质上是需要精确互动建模的四维时空事件。为了恢复这种四维本质,同时确保精确的机器人控制,我们提出了Kinema4D,一种新的基于动作条件的四维生成机器人模拟器,它将机器人与世界的交互解构为:i) 机器人控制的精确四维表示:我们通过运动学驱动基于URDF的三维机器人,生成精确的四维机器人控制轨迹。ii) 环境反应的生成四维建模:我们将四维机器人轨迹投影到点图中,作为时空视觉信号,控制生成模型将复杂环境的反应动态合成成同步的RGB/点图序列。为了促进训练,我们整理了一个名为Robo4D-200k的大规模数据集,包含201,426个高质量四维注释的机器人交互情节。大量实验表明,我们的方法有效地模拟了物理上合理、几何一致且与具身无关的交互,真实地反映了多样的现实世界动态。首次展示了潜在的零样本迁移能力,为推动下一代具身模拟提供了高保真基础。
cs.RO / 45 / 2603.16673
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
机器人何时应进行思考?基于强化学习的资源感知推理用于具身机器人决策
Abstract
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.
Chinese Translation
具身机器人系统越来越依赖基于大型语言模型(LLM)的代理来支持与环境交互过程中的高层次推理、规划和决策。然而,调用LLM推理会引入显著的计算延迟和资源开销,这可能会中断动作执行并降低系统可靠性。过度推理可能会延迟动作,而不足的推理往往导致错误决策和任务失败。这引发了一个基本问题:具身代理何时应进行推理,何时应采取行动?在本研究中,我们提出了RARRL(基于强化学习的资源感知推理),这是一种用于具身代理资源感知调度的分层框架。RARRL并不是学习低层次的控制策略,而是学习在代理决策层面操作的高层次调度策略。该策略使代理能够根据当前观察、执行历史和剩余资源自适应地确定是否调用推理、采用哪种推理角色以及分配多少计算预算。大量实验,包括基于ALFRED基准的经验延迟特征的评估,表明RARRL在提高任务成功率的同时,减少了执行延迟并增强了鲁棒性,相较于固定或启发式推理策略。这些结果表明,自适应推理控制对于构建可靠和高效的具身机器人代理至关重要。
cs.RO / 46 / 2603.16683
Learning Whole-Body Control for a Salamander Robot
为蝾螈机器人学习全身控制
Abstract
Amphibious legged robots inspired by salamanders are promising in applications in complex amphibious environments. However, despite the significant success of training controllers that achieve diverse locomotion behaviors in conventional quadrupedal robots, most salamander robots relied on central-pattern-generator (CPG)-based and model-based coordination strategies for locomotion control. Learning unified joint-level whole-body control that reliably transfers from simulation to highly articulated physical salamander robots remains relatively underexplored. In addition, few legged robots have tried learning-based controllers in amphibious environments. In this work, we employ Reinforcement Learning to map proprioceptive observations and commanded velocities to joint-level actions, allowing coordinated locomotor behaviors to emerge. To deploy these policies on hardware, we adopt a system-level real-to-sim matching and sim-to-real transfer strategy. The learned controller achieves stable and coordinated walking on both flat and uneven terrains in the real world. Beyond terrestrial locomotion, the framework enables transitions between walking and swimming in simulation, highlighting a phenomenon of interest for understanding locomotion across distinct physical modes.
Chinese Translation
受蝾螈启发的两栖腿式机器人在复杂的两栖环境中的应用前景广阔。然而,尽管在传统四足机器人中训练控制器以实现多样化的运动行为取得了显著成功,但大多数蝾螈机器人仍依赖于基于中央模式发生器(CPG)和基于模型的协调策略进行运动控制。学习统一的关节级全身控制,能够可靠地从仿真转移到高度关节化的物理蝾螈机器人,仍然相对未被深入探索。此外,鲜有腿式机器人在两栖环境中尝试基于学习的控制器。在本研究中,我们采用强化学习将本体感知观测和指令速度映射到关节级动作,从而使协调的运动行为得以出现。为了在硬件上部署这些策略,我们采用系统级的真实到仿真匹配和仿真到真实转移策略。所学习的控制器在真实世界中实现了在平坦和不平坦地形上的稳定和协调行走。除了陆地运动外,该框架还在仿真中实现了行走与游泳之间的转换,突显了理解不同物理模式下运动现象的兴趣。
cs.RO / 47 / 2603.16685
vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots
vAccSOL:移动机器人高效透明的人工智能视觉卸载
Abstract
Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.
Chinese Translation
移动机器人越来越多地被部署用于检查、巡逻和搜索救援操作,依赖计算机视觉进行感知、导航和自主决策。然而,由于计算资源有限和严格的能源限制,在机器人上执行现代视觉工作负载是具有挑战性的。尽管一些平台包括嵌入式加速器,但这些通常与专有软件栈绑定,导致用户定义的工作负载只能在资源受限的伴随计算机上运行。我们提出了vAccSOL,一个高效且透明的框架,用于在异构机器人和边缘平台上执行基于人工智能的视觉工作负载。vAccSOL集成了两个组件:SOL,一个神经网络编译器,生成具有最小运行时依赖的优化推理库,以及vAccel,一个轻量级执行框架,透明地将推理任务分发到机器人本地或附近的边缘基础设施。这个组合实现了硬件优化的推理和灵活的执行位置,而无需对机器人应用进行修改。我们在一个真实的测试平台上评估了vAccSOL,该平台使用了一款商业四足机器人和十二个涵盖图像分类、视频分类和语义分割的深度学习模型。与PyTorch编译器基线相比,SOL实现了可比或更好的推理性能。通过边缘卸载,vAccSOL将机器人侧的功耗降低了最多80%,边缘侧的功耗降低了最多60%,同时将视觉管道的帧率提高了最多24倍,延长了电池供电机器人操作的使用寿命。
cs.RO / 48 / 2603.16772
Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility
超越Cybathlon:为行动不便人士提供按需四足辅助
Abstract
Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair-bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on-demand quadrupedal assistance robot system controlled via a shared autonomy approach, which combines semi-autonomous task execution with human teleoperation. Due to the mobile nature of the system it can assist the operator whenever needed and perform autonomous tasks independently, without otherwise restricting their mobility. We automate pick-and-place tasks, as well as robot movement through the environment with semantic, collision-aware navigation. For teleoperation, we present a mouth-level joystick interface that enables an operator with reduced mobility to control the robot's end effector for precision manipulation. Results: We showcase our system in the \textit{Cybathlon 2024 Assistance Robot Race}, and validate it in an at-home experimental setup, where we measure task completion times and user satisfaction. We find our system capable of assisting in a broad variety of tasks, including those that require dexterous manipulation. The user study confirms the intuition that increased robot autonomy alleviates the operator's mental load. Conclusions: We present a flexible system that has the potential to help people in wheelchairs maintain independence in everyday life by enabling them to solve mobile manipulation problems without external support. We achieve results comparable to previous state-of-the-art on subjective metrics while allowing for more autonomy of the operator and greater agility for manipulation.
Chinese Translation
背景:辅助机器人有潜力提高因行动不便或使用轮椅而需要日常护理的人的独立性。目前将机器人手臂附加到电动轮椅上的解决方案在增加额外移动性的同时,导致轮椅体积增大和机动性降低。方法:我们提出了一种按需四足辅助机器人系统,通过共享自主控制方法进行控制,该方法结合了半自主任务执行与人类遥控操作。由于系统的移动特性,它可以在需要时协助操作员,并独立执行自主任务,而不会限制他们的移动能力。我们自动化了物品的拾取与放置任务,以及通过环境的机器人移动,采用语义和碰撞感知导航。对于遥控操作,我们展示了一种口部高度的操纵杆界面,使得行动不便的操作员能够控制机器人的末端执行器进行精确操作。结果:我们在 extit{Cybathlon 2024辅助机器人竞赛}中展示了我们的系统,并在家庭实验设置中进行了验证,测量任务完成时间和用户满意度。我们发现我们的系统能够协助完成广泛的任务,包括需要灵巧操作的任务。用户研究证实了增加机器人自主性可以减轻操作员的心理负担的直觉。结论:我们提出了一种灵活的系统,具有帮助轮椅使用者在日常生活中保持独立性的潜力,使他们能够在没有外部支持的情况下解决移动操作问题。我们的结果在主观指标上与以往的最先进技术相当,同时允许操作员拥有更大的自主性和更高的操作灵活性。
cs.RO / 49 / 2603.16803
Development of Low-Cost and Bidirectional Syringe Pumps for Soft Robotics Applications
用于软机器人应用的低成本双向注射泵的开发
Abstract
Soft robotics leverages deformable materials to develop robots capable of navigating unstructured and dynamic environments. Silicone Voxel-Based Soft Robots (Silibots) are a type of pneumatically actuated soft robots that rely on the inflation and deflation of their voxels for shape-shifting behaviors. However, traditional pneumatic actuation methods (high pressure solenoids, medical diaphragm pumps, micro compressors, compressed fluid) pose significant challenges due to their limited efficacy, cost, complexity, or lack of precision. This work introduces a low cost and modular syringe pump system, constructed with off the shelf and 3D printed parts, designed to overcome these limitations. The syringe pump system also enhances actuation with the unique ability to pull a vacuum as well pump air into the soft robot. Furthermore, the syringe pump features modular hardware and customizable software, allowing for researchers to tailor the syringe pump to their requirements or operate multiple pumps simultaneously with unique pump parameters. This flexibility makes the syringe pump an accessible and scalable tool that paves the way for broader adoption of soft robotic technologies in research and education.
Chinese Translation
软机器人利用可变形材料开发能够在非结构化和动态环境中导航的机器人。硅胶体素软机器人(Silibots)是一种气动驱动的软机器人,依赖于其体素的充气和放气来实现形状变化行为。然而,传统的气动驱动方法(高压电磁阀、医疗隔膜泵、微型压缩机、压缩流体)由于其效率有限、成本高、复杂性或缺乏精确性,面临重大挑战。本研究提出了一种低成本和模块化的注射泵系统,采用现成的和3D打印的部件构建,旨在克服这些局限性。该注射泵系统还通过独特的能力增强了驱动效果,能够抽真空并向软机器人泵送空气。此外,注射泵具有模块化硬件和可定制软件,使研究人员能够根据需求调整注射泵或同时操作多个具有独特泵参数的泵。这种灵活性使注射泵成为一个可获取且可扩展的工具,为软机器人技术在研究和教育中的更广泛应用铺平了道路。
cs.RO / 50 / 2603.16806
DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
DexGrasp-Zero:一种用于零-shot跨形态灵巧抓取的形态对齐策略
Abstract
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose \textit{DexGrasp-Zero}, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a \textit{Morphology-Aligned Graph Convolutional Network} (MAGCN) to encode the graph for policy learning. MAGCN incorporates a \textit{Physical Property Injection} mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85\% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5\%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82\% average success rate on unseen objects.
Chinese Translation
为了满足日益多样化的灵巧手硬件的需求,开发一种能够实现零-shot跨形态抓取而无需冗余重学习的策略至关重要。由于手部运动学和物理约束的异质性,跨形态对齐面临挑战。现有方法通常预测中间运动目标并将其重新定向到每个形态,这可能引入误差并违反特定形态的限制,从而阻碍在不同手之间的迁移。为克服这些限制,我们提出了 extit{DexGrasp-Zero},这是一种从多样化形态中学习通用抓取技能的策略,能够实现对未见手的零-shot迁移。我们首先引入了一种形态对齐的图表示,将每只手的运动学关键点映射到解剖学基础的节点,并为每个节点配备三轴正交运动原语,从而实现不同形态之间的结构和语义对齐。依赖于这种基于图的表示,我们设计了一种 extit{形态对齐图卷积网络}(MAGCN)来对图进行编码以进行策略学习。MAGCN结合了一种 extit{物理属性注入}机制,将手部特定的物理约束融入图特征中,使其能够针对不同的连杆长度和驱动限制进行自适应补偿,以实现精确和稳定的抓取。我们在YCB数据集上的广泛仿真评估表明,我们的策略在四种异质手(Allegro、Shadow、Schunk、Ability)上联合训练,能够在未见硬件(LEAP、Inspire)上实现85%的零-shot成功率,超越了当前最先进的方法59.5%。实际实验进一步在三个机器人平台(LEAP、Inspire、Revo2)上评估了我们的策略,在未见物体上实现了82%的平均成功率。
cs.RO / 51 / 2603.16809
CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation
CABTO:用于机器人操作的上下文感知行为树基础构建
Abstract
Behavior Trees (BTs) offer a powerful paradigm for designing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the automated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded -- comprising high-level action models and low-level control policies -- which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its complexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this challenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO's effectiveness and efficiency in generating complete and consistent behavior tree systems.
Chinese Translation
行为树(BTs)为设计模块化和反应式机器人控制器提供了一种强大的范式。行为树规划是一个新兴领域,为自动生成可靠的行为树提供了理论保证。然而,行为树规划通常假设一个设计良好的行为树系统已经建立——包括高级动作模型和低级控制策略——这通常需要大量的专家知识和手动努力。本文中,我们对行为树基础构建问题进行了形式化:即自动构建一个完整且一致的行为树系统。我们分析了其复杂性,并引入了CABTO(上下文感知行为树基础构建),这是第一个高效解决这一挑战的框架。CABTO利用预训练的大型模型(LMs)通过启发式搜索动作模型和控制策略的空间,受行为树规划者的上下文反馈和环境观察的引导。跨越三个不同机器人操作场景的七个任务集的实验表明,CABTO在生成完整且一致的行为树系统方面的有效性和效率。
cs.RO / 52 / 2603.16825
Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton
脑控康复外骨骼的运动开始与结束的实时解码
Abstract
Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
Chinese Translation
机器人辅助治疗可以在神经损伤后提供高剂量、特定任务的训练,但大多数系统主要在肢体层面运作,仅间接激活受损的神经回路,这仍然是实现真正依赖于神经可塑性目标的康复的关键障碍。我们通过实施在线双状态运动意象控制上肢外骨骼来解决这一问题,使得目标导向的动作能够直接从非侵入性脑电图(EEG)中启动和终止。八名参与者使用EEG启动辅助,然后自愿在轨迹中途停止机器人。在两次在线会话中,组均命中率为61.5%(开始)和64.5%(结束),尽管存在仪器噪声和被动手臂运动,仍显示出可靠的开始-停止指令传递。在方法论上,我们揭示了通过常见任务基础的重新中心化引起的系统性类别驱动偏差,并引入了一种类别无关的基于注视的重新中心化方法,该方法在不采样指令类别的情况下跟踪漂移,同时保持类别几何形状。这显著改善了无阈值可分离性(AUC增益:开始 +56%,p = 0.0117;结束 +34%,p = 0.0251),并减少了跨天的偏差。总之,这些结果有助于弥合离线解码与康复外骨骼的实际意图驱动的开始-停止控制之间的差距,使得能够提供与神经可塑性目标精确对齐的、时机恰当的依赖性辅助,同时支持未来的临床转化。
cs.RO / 53 / 2603.16853
BrickSim: A Physics-Based Simulator for Manipulating Interlocking Brick Assemblies
BrickSim:一种基于物理的互锁砖体组装操控模拟器
Abstract
Interlocking brick assemblies provide a standardized yet challenging testbed for contact-rich and long-horizon robotic manipulation, but existing rigid-body simulators do not faithfully capture snap-fit mechanics. We present BrickSim, the first real-time physics-based simulator for interlocking brick assemblies. BrickSim introduces a compact force-based mechanics model for snap-fit connections and solves the resulting internal force distribution using a structured convex quadratic program. Combined with a hybrid architecture that delegates rigid-body dynamics to the underlying physics engine while handling snap-fit mechanics separately, BrickSim enables real-time, high-fidelity simulation of assembly, disassembly, and structural collapse. On 150 real-world assemblies, BrickSim achieves 100% accuracy in static stability prediction with an average solve time of 5 ms. In dynamic drop tests, it also faithfully reproduces real-world structural collapse, precisely mirroring both the occurrence of breakage and the specific breakage locations. Built on Isaac Sim, BrickSim further supports seamless integration with a wide variety of robots and existing pipelines. We demonstrate robotic construction of brick assemblies using BrickSim, highlighting its potential as a foundation for research in dexterous, long-horizon robotic manipulation. BrickSim is open-source, and the code is available at https://github.com/intelligent-control-lab/BrickSim.
Chinese Translation
互锁砖体组装提供了一个标准化但具有挑战性的接触丰富和长时间范围的机器人操控测试平台,但现有的刚体模拟器无法真实捕捉到卡扣连接的力学特性。我们提出了BrickSim,这是首个针对互锁砖体组装的实时基于物理的模拟器。BrickSim引入了一种紧凑的基于力的力学模型用于卡扣连接,并通过结构化的凸二次规划方法求解由此产生的内部力分布。结合一种混合架构,该架构将刚体动力学委托给底层物理引擎,同时单独处理卡扣力学,BrickSim实现了组装、拆卸和结构崩溃的实时高保真模拟。在150个真实世界的组装案例中,BrickSim在静态稳定性预测中达到了100%的准确率,平均求解时间为5毫秒。在动态跌落测试中,它也忠实地再现了真实世界的结构崩溃,准确反映了破损的发生及具体破损位置。基于Isaac Sim,BrickSim进一步支持与各种机器人和现有管道的无缝集成。我们展示了使用BrickSim进行砖体组装的机器人建造,突显了其作为灵巧、长时间范围机器人操控研究基础的潜力。BrickSim是开源的,代码可在https://github.com/intelligent-control-lab/BrickSim获取。
cs.RO / 54 / 2603.16860
DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
DreamPlan:通过视频世界模型高效强化微调视觉-语言规划器
Abstract
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.
Chinese Translation
机器人操作需要复杂的常识推理能力,而这一能力是大规模视觉-语言模型(VLMs)所自然具备的。尽管VLMs在零样本规划方面展现出潜力,但它们缺乏扎实的物理理解,往往导致在复杂的现实环境中部署时出现累积错误和低成功率,尤其是在处理变形物体等具有挑战性的任务时。虽然强化学习(RL)可以将这些规划器适应于特定的任务动态,但通过现实世界交互直接微调VLMs的成本高昂、安全性差且样本效率低。为了解决这一瓶颈,我们提出了DreamPlan,一个通过视频世界模型对VLM规划器进行强化微调的新框架。DreamPlan首先利用零样本VLM收集探索性交互数据,而不是依赖于昂贵的物理回放。我们证明,这些次优数据足以训练一个基于动作条件的视频生成模型,该模型隐式捕捉复杂的现实物理。随后,VLM规划器完全在这个视频世界模型的“想象”中使用赔率比策略优化(Odds Ratio Policy Optimization, ORPO)进行微调。通过利用这些虚拟回放,物理和任务特定知识被高效地注入到VLM中。我们的结果表明,DreamPlan弥合了语义推理与物理基础之间的差距,显著提高了操作成功率,而无需进行大规模的现实世界数据收集。我们的项目页面是 https://psi-lab.ai/DreamPlan/。
cs.RO / 55 / 2603.16861
MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation
MolmoB0T:大规模仿真实现零-shot 操作
Deshpande, Abhay, Guru, Maya, Hendrix, Rose, Jauhri, Snehal, Eftekhar, Ainaz, Tripathi, Rohun, Argus, Max, Salvador, Jordi, Fang, Haoquan, Wallingford, Matthew, Pumacay, Wilbert, Kim, Yejin, Pfeifer, Quinn, Lee, Ying-Chun, Wolters, Piper, Rayyan, Omar, Zhang, Mingtong, Duan, Jiafei, Farley, Karen, Han, Winson, Vanderbilt, Eli, Fox, Dieter, Farhadi, Ali, Chalvatzaki, Georgia, Shah, Dhruv, Krishna, Ranjay
Abstract
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $\pi_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $\pi_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation
Chinese Translation
在机器人学习领域,一个普遍的观点是,仅靠仿真是不够的;有效的仿真到现实的迁移被广泛认为需要至少一些现实世界的数据收集或特定任务的微调,以弥合模拟环境与物理环境之间的差距。我们对此假设提出挑战。通过足够大规模和多样化的模拟合成训练数据,我们展示了零-shot 迁移到现实世界不仅是可能的,而且对于静态和移动操作都是有效的。我们引入了 MolmoBot-Engine,这是一个完全开源的管道,用于在 MolmoSpaces 中跨机器人、任务和多样化模拟环境进行程序化数据生成。借助该管道,我们发布了 MolmoBot-Data,这是一个包含 180 万条专家轨迹的数据集,适用于关节物体操作和拾取放置任务。我们训练了三种策略类:MolmoBot,一个基于 Molmo2 的多帧视觉-语言模型,配备流匹配动作头;MolmoBot-Pi0,复制 $ ext{π}_0$ 架构以便进行直接比较;以及 MolmoBot-SPOC,一个适合边缘部署且易于进行强化学习微调的轻量级策略。我们在两个机器人平台上进行了评估:Franka FR3 用于桌面操作任务,Rainbow Robotics RB-Y1 移动操纵器用于开门、抽屉操作、橱柜交互和移动拾取放置。在没有任何现实世界微调的情况下,我们的策略实现了对未见物体和环境的零-shot 迁移。在桌面拾取放置任务中,MolmoBot 在四个设置下的现实世界评估中取得了 79.2% 的成功率,超越了 $ ext{π}_{0.5}$ 的 39.2%。我们的结果表明,程序化环境生成结合多样化的关节资产可以产生稳健的操作策略,这些策略能够广泛地推广到现实世界。技术博客:https://allenai.org/blog/molmobot-robot-manipulation
cs.RO / 56 / 2603.16866
ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
ManiTwin:将数据生成准备好的数字对象数据集扩展到100K
Abstract
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.
Chinese Translation
在模拟环境中学习为扩展机器人操作能力提供了有用的基础。然而,这种范式通常面临着缺乏在规模和多样性方面都适合数据生成的数字资产的问题。在本研究中,我们提出了ManiTwin,这是一个自动化和高效的管道,用于生成适合数据生成的数字对象双胞胎。我们的管道将单张图像转换为适合模拟和语义标注的3D资产,从而使大规模的机器人操作数据生成成为可能。利用该管道,我们构建了ManiTwin-100K,这是一个包含100K高质量标注3D资产的数据集。每个资产都配备了物理属性、语言描述、功能性注释和经过验证的操作提案。实验表明,ManiTwin提供了高效的资产合成与注释工作流程,而ManiTwin-100K为操作数据生成、随机场景合成和视觉问答(VQA)数据生成提供了高质量和多样化的资产,为可扩展的模拟数据合成和策略学习奠定了坚实的基础。我们的网页地址为 https://manitwin.github.io/。
cs.CV / 1 / 2603.15622
SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning
SAC-NeRF:通过软演员-评论家强化学习实现神经辐射场的自适应光线采样
Abstract
Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48\% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.
Chinese Translation
神经辐射场(NeRF)已实现逼真的新视角合成,但由于在体积渲染过程中密集光线采样,导致计算效率低下。我们提出了SAC-NeRF,一种利用软演员-评论家(Soft Actor-Critic, SAC)学习自适应采样策略的强化学习框架。我们的方法将采样形式化为马尔可夫决策过程(Markov Decision Process),其中强化学习代理根据场景特征学习分配样本。我们引入了三个技术组件:(1)提供不确定性估计的高斯混合分布颜色模型,(2)平衡质量、效率和一致性的多组件奖励函数,以及(3)解决环境非平稳性的问题的两阶段训练策略。在Synthetic-NeRF和LLFF数据集上的实验表明,SAC-NeRF在保持渲染质量在密集采样基准的0.3-0.8 dB PSNR范围内的同时,减少了35-48%的采样点。尽管学习到的策略是场景特定的,并且与更简单的启发式方法相比,强化学习框架增加了复杂性,但我们的工作表明,数据驱动的采样策略可以发现难以手动设计的有效模式。
cs.CV / 2 / 2603.15624
Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision
探索视觉语言模型在盲人和低视力人士导航辅助中的应用
Abstract
This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.
Chinese Translation
本文探讨了视觉语言模型(VLMs)在帮助盲人和低视力人士(pBLV)进行导航任务中的潜力。我们评估了包括GPT-4V、GPT-4o、Gemini-1.5-Pro和Claude-3.5-Sonnet等最先进的闭源模型,以及开源模型如Llava-v1.6-mistral和Llava-onevision-qwen,以分析它们在基础视觉技能方面的能力:计数周围障碍物、相对空间推理和与常识相关的场景理解。我们进一步评估了它们在导航场景中的表现,使用针对pBLV设计的提示,模拟现实世界的辅助任务。我们的研究结果揭示了这些模型之间显著的性能差异:GPT-4o在所有任务中始终优于其他模型,特别是在空间推理和场景理解方面。相比之下,开源模型在复杂环境中的细微推理和适应性方面表现不佳。常见挑战包括在杂乱环境中准确计数物体的困难、空间推理中的偏差,以及倾向于优先考虑物体细节而非空间反馈,从而限制了它们在导航任务中对pBLV的可用性。尽管存在这些局限性,VLMs在与人类反馈更好对齐并具备改进的空间推理能力时,显示出在路径寻找辅助方面的潜力。本研究提供了关于当前VLMs优缺点的可行见解,为开发者有效整合VLMs到辅助技术中提供指导,同时解决关键局限性以增强可用性。
cs.CV / 3 / 2603.15648
Improving Generative Adversarial Network Generalization for Facial Expression Synthesis
提高生成对抗网络在面部表情合成中的泛化能力
Abstract
Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fr\'echet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.
Chinese Translation
面部表情合成旨在生成逼真的面部表情,同时保持个体身份。现有的条件生成对抗网络(GANs)在图像到图像的转换中取得了优异的结果,但当测试图像与训练数据集存在差异时,其性能往往会下降。我们提出了回归GAN(RegGAN),该模型学习一种中间表示,以改善超出训练分布的泛化能力。RegGAN由两个组件组成:一个具有局部感受野的回归层,通过最小化重构误差并采用岭回归损失来学习表情细节,以及一个通过对抗训练的细化网络,以增强生成图像的真实感。我们在CFEE数据集上训练RegGAN,并评估其在CFEE及具有挑战性的分布外图像(包括名人照片、肖像、雕像和头像渲染)的泛化性能。为了评估,我们采用了四个广泛使用的指标:表情分类分数(ECS)用于表情质量,面部相似度分数(FSS)用于身份保持,QualiCLIP用于感知真实感,以及Fréchet Inception Distance(FID)用于评估表情质量和真实感。RegGAN在ECS、FID和QualiCLIP上超越了六个最先进的模型,而在FSS上排名第二。人类评估表明,RegGAN在表情质量、身份保持和真实感方面分别超过最佳竞争模型25%、26%和30%。
cs.CV / 4 / 2603.15663
OrthoAI v2: From Single-Agent Segmentation to Dual-Agent Treatment Planning for Clear Aligners
OrthoAI v2:从单代理分割到双代理治疗规划用于隐形矫治器
Abstract
We present OrthoAI v2, the second iteration of our open-source pipeline for AI-assisted orthodontic treatment planning with clear aligners, substantially extending the single-agent framework previously introduced. The first version established a proof-of-concept based on Dynamic Graph Convolutional Neural Networks (\dgcnn{}) for tooth segmentation but was limited to per-tooth centroid extraction, lacked landmark-level precision, and produced a scalar quality score without staging simulation. \vtwo{} addresses all three limitations through three principal contributions: (i)~a second agent adopting the Conditioned Heatmap Regression Methodology (\charm{})~\cite{rodriguez2025charm} for direct, segmentation-free dental landmark detection, fused with Agent~1 via a confidence-weighted orchestrator in three modes (parallel, sequential, single-agent); (ii)~a composite six-category biomechanical scoring model (biomechanics $\times$ 0.30 + staging $\times$ 0.20 + attachments $\times$ 0.15 + IPR $\times$ 0.10 + occlusion $\times$ 0.10 + predictability $\times$ 0.15) replacing the binary pass/fail check of v1; (iii)~a multi-frame treatment simulator generating $F = A \times r$ temporally coherent 6-DoF tooth trajectories via SLERP interpolation and evidence-based staging rules, enabling ClinCheck 4D visualisation. On a synthetic benchmark of 200 crowding scenarios, the parallel ensemble of OrthoAI v2 reaches a planning quality score of $92.8 \pm 4.1$ vs.\ $76.4 \pm 8.3$ for OrthoAI v1, a $+21\%$ relative gain, while maintaining full CPU deployability ($4.2 \pm 0.8$~s).
Chinese Translation
我们提出了OrthoAI v2,这是我们开源的AI辅助隐形矫治器正畸治疗规划管道的第二个版本,显著扩展了之前引入的单代理框架。第一个版本基于动态图卷积神经网络(Dynamic Graph Convolutional Neural Networks,
gcnn{})建立了一个概念验证,主要用于牙齿分割,但仅限于每颗牙齿的质心提取,缺乏标志点级别的精度,并且生成了一个没有阶段模拟的标量质量评分。OrthoAI v2通过三个主要贡献解决了这三个局限性:(i)第二个代理采用条件热图回归方法(Conditioned Heatmap Regression Methodology, extcharm{})~ extcite{rodriguez2025charm},实现无分割的直接牙科标志点检测,并通过在三种模式(并行、顺序、单代理)下的置信加权协调器与代理1融合;(ii)一个复合的六类生物力学评分模型(生物力学 $ imes$ 0.30 + 阶段 $ imes$ 0.20 + 附件 $ imes$ 0.15 + IPR $ imes$ 0.10 + 咬合 $ imes$ 0.10 + 可预测性 $ imes$ 0.15),替代了v1的二元通过/不通过检查;(iii)一个多帧治疗模拟器,通过SLERP插值和基于证据的阶段规则生成时间一致的6自由度牙齿轨迹$F = A imes r$,实现ClinCheck 4D可视化。在200个拥挤场景的合成基准测试中,OrthoAI v2的并行集成达到了$92.8 imes 4.1$的规划质量评分,而OrthoAI v1为$76.4 imes 8.3$,相对提升$+21 ext{ extperthousand}$,同时保持完全的CPU可部署性($4.2 imes 0.8$~秒)。
cs.CV / 5 / 2603.15767
CLRNet: Targetless Extrinsic Calibration for Camera, Lidar and 4D Radar Using Deep Learning
CLRNet:基于深度学习的相机、激光雷达和4D雷达的无目标外部标定
Abstract
In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: https://github.com/tudelft-iv.
Chinese Translation
本文探讨了相机、激光雷达和4D雷达传感器的外部标定。由于雷达数据的稀疏性,准确的外部标定仍然是一个挑战。我们提出了CLRNet,一种新颖的多模态端到端深度学习(DL)标定网络,能够处理相机-激光雷达-雷达的联合标定,或任意两个传感器之间的成对标定。我们结合了等矩形投影、基于相机的深度图像预测、额外的雷达通道,并利用共享特征空间和回环闭合损失来增强激光雷达的性能。在使用Delft视图和双雷达数据集的广泛实验中,我们展示了与现有最先进方法相比,具有更优的标定精度,将中位数平移和旋转标定误差至少降低了50%。最后,我们在跨数据集评估时考察了所提网络及基线的领域迁移能力。代码将在接受后公开发布,链接为:https://github.com/tudelft-iv。
cs.CV / 6 / 2603.15774
Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image Analysis
无计算负担的领域适应用于高效的全切片图像分析
Abstract
Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.
Chinese Translation
对全切片图像(WSIs)的计算分析方法通过支持病理学家进行肿瘤的检测和分类,从而实现早期诊断和治疗。然而,WSIs的极高分辨率使得与典型图像分析任务相比,端到端训练变得不切实际。为了解决这个问题,大多数方法使用预训练的特征提取器来获取全切片的固定表示,然后将其与多实例学习(MIL)结合用于下游任务。这些特征提取器通常是在自然图像数据集(如ImageNet)上进行预训练的,无法捕捉领域特定的特征。尽管在组织病理学数据上进行领域特定的预训练可以产生更相关的特征表示,但这仍然计算成本高,并且未能捕捉领域内的任务特定特征。为了解决领域特定预训练中的计算成本和缺乏任务特定性的情况,我们提出了EfficientWSI(eWSI),这是参数高效微调(PEFT)与多实例学习(MIL)的精心结合,使得在WSI任务上实现端到端训练成为可能。我们在Camelyon16、TCGA和BRACS数据集上评估了eWSI在七个WSI级任务上的表现。我们的结果表明,当eWSI与ImageNet特征提取器结合使用时,能够获得强大的分类性能,匹配或超越使用领域内特征提取器的MIL,减轻了对广泛领域内预训练的需求。此外,当eWSI与领域内特征提取器结合使用时,在大多数情况下进一步提高了分类性能,展示了其捕捉任务特定信息的能力。我们的研究结果表明,eWSI为WSI任务提供了一条针对任务的计算高效路径,为计算病理学中的任务特定学习提供了一个有前景的方向。
cs.CV / 7 / 2603.15780
Parallelised Differentiable Straightest Geodesics for 3D Meshes
三维网格的并行可微直线测地线
Abstract
Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: circle-group.github.io/research/DSG.
Chinese Translation
机器学习逐渐被推广到非欧几里得领域,但在表面上进行学习的几何准确方法仍然滞后。缺乏封闭形式的黎曼算子、其离散对应物的不可微性以及较差的并行化能力是该领域在网格上发展的主要障碍。计算离散为网格的黎曼表面上的指数映射的原则框架是直线测地线,它还允许追踪测地线和作为副产品的平行传输向量。我们提供了一种并行的GPU实现,并推导出两种不同的通过直线测地线进行微分的方法,一种利用外部代理函数,另一种基于测地线有限差分方案。在证明我们的并行化性能和准确性后,我们展示了我们的可微指数映射如何改善一般几何体上的学习和优化流程。特别是,为了展示我们方法的多样性,我们提出了一种新的测地卷积层,一种新的网格学习的流匹配方法,以及一个应用于中心 Voronoi 划分的二阶优化器。我们的代码、模型和可通过 pip 安装的库(digeo)可在以下网址获取:circle-group.github.io/research/DSG。
cs.CV / 8 / 2603.15800
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
通过推理时自反记忆在多模态大型语言模型中演化上下文安全性
Abstract
Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.
Chinese Translation
多模态大型语言模型(MLLMs)在广泛的视觉推理任务中取得了显著的性能,但它们对安全风险的脆弱性仍然是一个紧迫的关注点。虽然之前的研究主要集中在检测和拒绝明显不安全输入的越狱防御上,但此类方法往往忽视了上下文安全,这要求模型能够区分在安全意图上可能看似相似但实际上存在显著差异的场景之间的微妙上下文差异。在本研究中,我们提出了 MM-SafetyBench++,这是一个精心策划的基准,旨在评估上下文安全性。具体而言,对于每个不安全的图像-文本对,我们通过最小的修改构建相应的安全对,以翻转用户意图,同时保持潜在的上下文含义,从而实现对模型是否能够根据上下文理解调整其安全行为的控制评估。此外,我们引入了 EchoSafe,一个无训练的框架,维护一个自反记忆库,以积累和检索来自先前交互的安全见解。通过将相关的过去经验整合到当前提示中,EchoSafe 实现了上下文感知推理和推理过程中安全行为的持续演变。在各种多模态安全基准上的广泛实验表明,EchoSafe 始终实现了优越的性能,为推动 MLLMs 中的上下文安全建立了强有力的基线。所有基准数据和代码均可在 https://echosafe-mllm.github.io 获取。
cs.CV / 9 / 2603.15811
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
用于头部头像创建和编辑的前馈高斯配准
Abstract
We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.
Chinese Translation
我们提出了MATCH(来自拓扑对应头部的多视角头像),这是一种用于高质量头部头像创建和编辑的多视角高斯配准方法。最先进的多视角头部头像方法通常需要耗时的头部跟踪,随后进行昂贵的头像优化,导致总创建时间超过一天。相比之下,MATCH直接从校准的多视角图像中预测对应的高斯喷溅纹理,每帧仅需0.5秒,无需数据预处理。学习到的帧间个体对应关系使得个性化头部头像的快速创建成为可能,而跨个体的对应关系则支持诸如表情转移、无优化跟踪、语义编辑和身份插值等应用。我们使用基于变换器的模型端到端地建立这些对应关系,该模型在模板网格的固定UV布局中预测高斯喷溅纹理。为此,我们引入了一种新颖的配准引导注意力块,其中每个UV-map标记仅关注描绘其对应网格区域的图像标记。这一设计相比于密集的跨视角注意力,提高了效率和性能。MATCH在新视图合成、几何配准和头部头像生成方面优于现有方法,同时使头像创建速度比最接近的竞争基线快10倍。代码和模型权重可在项目网站上获取。
cs.CV / 10 / 2603.15812
ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation
ModTrack:通过协方差传播的身份信息驱动的 PHD 过滤实现传感器无关的多视角跟踪
Abstract
Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird's Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor's output to calibrated position-covariance pairs $(\mathbf{z}, R)$; cross-view clustering and precision-weighted fusion then yield unified estimates $(\hat{\mathbf{z}}, \hat{R})$ for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.
Chinese Translation
多视角多目标跟踪(MV-MOT)旨在定位并保持由多个传感器观察到的物体的一致身份。由于视角变化和遮挡会破坏跨视角和时间的一致性,这一任务具有挑战性。最近的端到端方法通过联合学习 2D 鸟瞰图(BEV)表示和身份关联来解决这一问题,实现了高跟踪精度。然而,这些方法没有提供原则性的不确定性计算,并且与其训练配置紧密耦合,限制了在不同传感器布局、模态或数据集上的泛化能力,而无需重新训练。我们提出了 ModTrack,一个模块化的 MV-MOT 系统,它在提供跨模态、传感器无关的泛化和可追踪的不确定性的同时,匹配端到端性能。ModTrack 将学习方法限制在 MV-MOT 流水线的 extit{检测和特征提取} 阶段,所有融合、关联和跟踪均采用封闭形式的分析方法。我们的设计将每个传感器的输出减少为校准的位置-协方差对 $( extbf{z}, R)$;跨视角聚类和加权融合然后产生统一的估计 $( ilde{ extbf{z}}, ilde{R})$ 用于身份分配和时间跟踪。一个反馈耦合的、身份信息驱动的高斯混合概率假设密度(GM-PHD)过滤器,结合 HMM 运动模式,利用这些融合的估计在漏检和严重遮挡情况下维持身份。ModTrack 在 extit{WildTrack} 上实现了 95.5 的 IDF1 和 91.4 的 MOTA,超过所有先前的模块化方法超过 21 分,并与最先进的端到端方法相媲美,同时提供了它们无法实现的部署灵活性。具体而言,相同的跟踪核心在 extit{MultiviewX} 和 extit{RadarScenes} 中保持不变,仅需更换感知模块即可扩展到新领域和传感器模态。
cs.CV / 11 / 2603.15818
Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition
冲突感知的多模态融合用于矛盾与犹豫的识别
Abstract
Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.
Chinese Translation
矛盾与犹豫(A/H)是微妙的情感状态,表现为一个人通过不同的渠道发出冲突信号——说一套而面部表情或声音却传达另一套。自动识别这些状态在临床环境中具有重要价值,但对于机器来说却很困难,因为关键证据存在于所说内容、声音表现和面部表情之间的 extit{不一致性}中。我们提出了 extbf{ConflictAwareAH},这是一个为解决这一问题而构建的多模态框架。三个预训练编码器分别提取视频、音频和文本表示。成对的冲突特征——模态嵌入之间的元素级绝对差异——作为 extit{双向}线索:较大的跨模态差异标志着A/H,而较小的差异则确认行为一致性并锚定负类。这种冲突感知设计解决了以文本为主的方法的一个关键限制,这些方法往往会过度检测A/H(高F1-AH),而在确认其缺失时却表现不佳:我们的多模态模型在F1-NoAH上比仅使用文本提高了4.6个百分点,并将类别性能差距减半。一种补充的 extit{文本引导的后期融合}策略在推理时将仅文本的辅助头与完整模型融合,增加了4.1的宏观F1。在来自ABAW10矛盾/犹豫挑战的BAH数据集上,我们的方法在标记测试集上达到了 extbf{0.694宏观F1},在私人排行榜上达到了 extbf{0.715},超过了已发布的多模态基线10多个点——所有这些都在单个GPU上训练不超过25分钟。
cs.CV / 12 / 2603.15822
Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation
超越嵌入瓶颈:自适应检索增强三维CT报告生成
Abstract
Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at https://github.com/renjie-liang/Adaptive-RAG-for-3DCT-Report-Generation.
Chinese Translation
自动化的三维CT影像放射学报告生成往往面临病理覆盖不全的问题。我们提供了实证证据,证明这一限制源于表征瓶颈:对比三维CT嵌入编码了具有区分性的病理信号,但却表现出严重的维度集中,512维中只有少量2维是有效的。相应地,扩大语言模型并未带来可测量的改善,这表明瓶颈出现在视觉表征而非生成器上。该瓶颈限制了生成和检索;简单的静态检索无法提高临床有效性,甚至可能降低性能。我们提出了 extbf{AdaRAG-CT},一种自适应增强框架,通过控制检索引入补充文本信息,并在生成过程中选择性地整合,来弥补这一视觉瓶颈。在CT-RATE基准测试中,AdaRAG-CT实现了最先进的临床有效性,将临床F1从0.420(CT-Agent)提高到0.480(+6分);消融研究证实了检索和生成组件均对改善贡献显著。代码可在https://github.com/renjie-liang/Adaptive-RAG-for-3DCT-Report-Generation获取。
cs.CV / 13 / 2603.15847
FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding
FEEL(力增强自我中心学习):一个用于物理动作理解的数据集
Abstract
We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.
Chinese Translation
我们介绍了FEEL(力增强自我中心学习),这是第一个将来自定制压电电阻手套的力测量与自我中心视频配对的大规模数据集。我们的手套能够实现可扩展的数据收集,FEEL包含约300万帧自然非脚本化的厨房环境操作的力同步帧,其中45%的帧涉及手与物体的接触。由于力是驱动物理交互的根本原因,因此它是物理动作理解的一个关键原始要素。我们通过将FEEL应用于两类任务,展示了力在物理动作理解中的实用性:(1)接触理解,在此任务中,我们共同执行时间接触分割和像素级接触物体分割;(2)动作表示学习,在此任务中,力预测作为视频骨干网络的自监督预训练目标。我们在时间接触分割任务中取得了最先进的结果,并在像素级分割任务中获得了具有竞争力的结果,而无需任何手动接触物体分割注释。此外,我们还展示了使用FEEL进行的动作表示学习在没有任何手动标签的情况下,提高了在EPIC-Kitchens、SomethingSomething-V2、EgoExo4D和Meccano等动作理解任务上的迁移性能。
cs.CV / 14 / 2603.15862
Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes
自监督解耦疾病影响与衰老在三维医学形状中的作用
Abstract
Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two-stage framework combining unsupervised disease discovery with self-supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground-truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi-objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near-supervised performance, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, while enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability. Code and checkpoints are available at https://github.com/anonymous-submission01/medical-shape-disentanglement
Chinese Translation
在三维医学形状中将病理变化与生理衰老解耦对于开发可解释的生物标志物和患者分层至关重要。然而,当诊断标签有限或不可用时,这种分离是具有挑战性的,因为疾病和衰老通常会对形状变化产生重叠的影响,从而掩盖临床相关的形状模式。为了解决这一挑战,我们提出了一个两阶段框架,结合无监督疾病发现与隐式形状表示的自监督解耦。在第一阶段,我们训练一个使用带符号距离函数的隐式神经模型,以学习稳定的形状嵌入。然后,我们在形状潜在空间上应用聚类,从而在发现过程中生成伪疾病标签,而无需使用真实的诊断标签。在第二阶段,我们利用在第一阶段发现的伪疾病标签和所有受试者的真实年龄标签,在紧凑的变分空间中解耦因素。我们通过结合协方差和监督对比损失的多目标解耦损失来强制实现分离和可控性。在ADNI海马体和OAI远端股骨形状上,我们实现了接近监督的性能,改善了与最先进的无监督基线相比的解耦和重建,同时实现了高保真重建、可控合成和基于因素的可解释性。代码和检查点可在https://github.com/anonymous-submission01/medical-shape-disentanglement获取。
cs.CV / 15 / 2603.15887
EvoIQA - Explaining Image Distortions with Evolved White-Box Logic
EvoIQA - 用进化的白盒逻辑解释图像失真
Abstract
Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or "black-box" deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.
Chinese Translation
传统的图像质量评估(IQA)指标通常落入两个极端:刚性、手工制作的数学模型或完全缺乏可解释性的“黑箱”深度学习架构。为了弥合这一差距,我们提出了EvoIQA,这是一种基于遗传编程的完全可解释的符号回归框架,旨在进化出明确的、可人类阅读的图像质量评估(IQA)数学公式。我们的框架利用来自VSI、VIF、FSIM和HaarPSI指标的丰富终端集合,固有地将结构性、色彩性和信息论的退化映射为可观察的数学方程。我们的结果表明,进化的遗传编程模型在预测与人类视觉偏好之间始终保持强一致性。此外,它们不仅超越了传统的手工制作指标,还在性能上与复杂的最先进深度学习模型(如DB-CNN)达到了平衡,证明我们不再需要为了最先进的性能而牺牲可解释性。
cs.CV / 16 / 2603.15919
Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers
稀疏但并不简单:视觉变换器的多层次可解释性分析
Abstract
Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbf{IMPACT}, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately $2.5\times$ fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.
Chinese Translation
稀疏神经网络常被假设为比密集模型更具可解释性,这一假设源于权重稀疏性能够在语言模型中产生紧凑电路的研究发现。然而,结构稀疏性是否真正提高了语义可解释性仍然不明确。在本研究中,我们系统地评估了使用Wanda修剪的DeiT-III B/16模型中权重稀疏性与可解释性之间的关系。为了全面评估可解释性,我们引入了 extbf{IMPACT},一个多层次框架,评估四个互补层次的可解释性:神经元、层表示、任务电路和模型级归因。层表示使用BatchTopK稀疏自编码器进行分析,电路通过可学习节点掩蔽提取,解释则通过插入和删除指标的变换器归因进行评估。我们的结果揭示了明显的结构效应,但可解释性提升有限。稀疏模型产生的电路边数约为密集模型的$2.5 imes$,然而活跃节点的比例保持相似或更高,这表明修剪重新分配了计算,而不是隔离更简单的功能模块。与这一观察一致,稀疏模型在神经元级选择性、SAE特征可解释性或归因可信度方面没有系统性改善。这些发现表明,单靠结构稀疏性并不能可靠地产生更具可解释性的视觉模型,强调了评估框架在评估可解释性时超越电路紧凑性的必要性。
cs.CV / 17 / 2603.15932
Nodule-Aligned Latent Space Learning with LLM-Driven Multimodal Diffusion for Lung Nodule Progression Prediction
基于结节对齐的潜在空间学习与LLM驱动的多模态扩散用于肺结节进展预测
Abstract
Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient's and nodule's Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.
Chinese Translation
由于生物不确定性以及对驱动结节进展的生物机制理解有限,肺癌的早期诊断面临挑战。为此,我们提出了结节对齐的多模态(潜在)扩散(NAMD)框架,通过生成1年随访的结节计算机断层扫描图像,结合基线扫描及患者和结节的电子健康记录(EHR),来预测肺结节的进展。NAMD引入了一种结节对齐的潜在空间,其中潜在变量之间的距离直接对应于结节属性的变化,并利用LLM驱动的控制机制,使扩散主干能够基于患者数据进行条件化。在国家肺筛查试验(NLST)数据集上,我们的方法合成的随访结节图像在肺结节恶性预测中达到了0.805的AUROC和0.346的AUPRC,显著优于基线扫描和最先进的合成方法,同时接近真实随访扫描的性能(AUROC: 0.819, AUPRC: 0.393)。这些结果表明,NAMD捕捉到了肺结节进展的临床相关特征,有助于更早和更准确的诊断。
cs.CV / 18 / 2603.15941
Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation
通过KL正则化的组分布鲁棒优化实现公平且稳健的体积CT分类
Abstract
Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with {\alpha} = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.
Chinese Translation
从胸部计算机断层扫描(CT)图像进行自动诊断在临床应用中面临两个持续的挑战:不同采集地点之间的分布偏移和不同人口子群体之间的性能差异。我们同时解决这两个问题,涵盖两个互补的任务:从多地点CT图像中进行二分类COVID-19识别(任务1)和在性别公平约束下进行四类肺病理识别(任务2)。我们的框架结合了轻量级的MobileViT-XXS切片编码器和一个两层的SliceTransformer聚合器以进行体积推理,并使用KL正则化的组分布鲁棒优化(Group DRO)目标进行训练,该目标自适应地增加表现不佳的采集中心和人口子群体的权重。与标准的Group DRO不同,KL惩罚防止了组权重崩溃,提供了最坏情况保护与平均性能之间的稳定平衡。在任务2中,我们以性别类别的粒度定义组,直接针对严重缺乏代表性的组合,如女性鳞状细胞癌。在任务1中,我们的最佳配置实现了挑战F1值为0.835,超过了最佳已发布挑战条目5.9个百分点。在任务2中,{eta} = 0.5的Group DRO实现了每性别宏观F1均值为0.815,超越了最佳挑战条目11.1个百分点,并在Fo-cal Loss基线之上提高了女性鳞状细胞癌F1值17.4个百分点。
cs.CV / 19 / 2603.15967
A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology
肾脏组织病理学基础模型的全面基准评估
Abstract
Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.
Chinese Translation
组织病理学基础模型(HFMs)在大规模癌症数据集上进行预训练,推动了计算病理学的发展。然而,尽管肾脏病理与肾细胞癌和尿路上皮癌等恶性肿瘤共存,其在非癌性慢性肾病中的适用性仍未得到充分探索。我们系统评估了11个公开可用的HFMs,在11个特定于肾脏的下游任务中进行评估,这些任务涵盖了多种染色方法(PAS、H&E、PASM和IHC)、空间尺度(瓷砖和切片级别)、任务类型(分类、回归和复制检测)以及临床目标,包括检测、诊断和预后。瓷砖级别的性能通过重复分层组交叉验证进行评估,而切片级别的任务则使用重复嵌套分层交叉验证进行评估。统计显著性通过Friedman检验进行检验,随后进行配对Wilcoxon符号秩检验,并采用Holm-Bonferroni校正和紧凑字母显示可视化。为了促进可重复性,我们发布了一个开源Python包,kidney-hfm-eval,地址为https://pypi.org/project/kidney-hfm-eval/,该包重现了评估流程。结果显示,在由粗大中尺度肾脏形态驱动的任务中表现出中等到强的性能,包括诊断分类和显著结构变化的检测。相比之下,对于需要细粒度微观结构区分、复杂生物表型或切片级别预后推断的任务,性能始终下降,这在很大程度上与染色类型无关。总体而言,当前的HFMs似乎主要编码静态的中尺度表示,可能在捕捉细微的肾脏病理或与预后相关的信号方面能力有限。我们的结果强调了需要针对肾脏的多染色和多模态基础模型,以支持肾脏病学中临床可靠的决策制定。
cs.CV / 20 / 2603.15975
UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors
UMO:统一的上下文学习解锁运动基础模型先验
Abstract
Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/
Chinese Translation
大规模基础模型(LFMs)最近在文本到运动生成方面取得了显著进展,通过从大量3D人类运动数据集和配对文本描述中学习强大的生成先验。然而,如何有效且高效地利用这种单一用途的运动LFMs,即文本到运动合成,在更为多样的跨模态和上下文运动生成下游任务中仍然不够明确。以往的研究通常以任务特定的方式将预训练的生成先验适配到各个下游任务。相比之下,我们的目标是解锁这些先验,以支持在单一统一框架内的广泛下游运动生成任务。为了解决这一问题,我们提出了UMO,这是一种简单而通用的统一公式,将多样的下游任务转化为原子逐帧操作的组合,使得上下文适应能够解锁基于DiT的预训练运动LFMs的生成先验。具体而言,UMO引入了三个可学习的帧级元操作嵌入,以指定逐帧意图,并采用轻量级时间融合将上下文线索注入预训练的主干网络,与基础模型相比,几乎没有运行时开销。通过这种设计,UMO微调了原本仅限于文本到运动生成的预训练模型,以支持多种之前不支持的任务,包括时间修补、文本引导的运动编辑、文本序列几何约束和多身份反应生成。实验表明,尽管使用单一统一模型,UMO在广泛的基准测试中始终优于任务特定和无训练基线。代码和模型将公开发布。项目页面:https://oliver-cong02.github.io/UMO.github.io/
cs.CV / 21 / 2603.16001
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
以文本为主,智能视觉:大型视觉语言模型的非对称文本-视觉剪枝
Abstract
Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.
Chinese Translation
网络剪枝是一种有效的技术,用于实现轻量级的大型视觉语言模型(Large Vision-Language Models, LVLMs),该技术主要将权重和激活值纳入重要性指标。然而,现有的研究通常以统一的方式处理来自不同模态的校准数据,忽视了模态特定的行为。这提出了一个关键挑战:如何解决文本和视觉标记在LVLMs精确剪枝中的不同表现。为此,我们系统地研究了视觉和文本标记对剪枝操作的敏感性,通过解耦它们对应的权重,揭示了: (i) 文本路径应通过文本标记进行校准,因为其敏感性高于视觉路径; (ii) 视觉路径表现出较高的冗余性,允许甚至达到50%的稀疏性。基于这些见解,我们提出了一种简单而有效的非对称文本-视觉权重剪枝方法,称为ATV-Pruning,该方法通过从文本和视觉路径中选择信息丰富的标记来建立准确的权重剪枝重要性指标。具体而言,ATV-Pruning整合了两个主要创新:首先,通过提取所有文本标记和一部分视觉标记,自适应构建校准池;其次,我们设计了一种层自适应选择策略,以生成重要的视觉标记。最后,在标准多模态基准上的广泛实验验证了我们的ATV-Pruning相较于最先进方法的优越性。
cs.CV / 22 / 2603.16016
FlatLands: Generative Floormap Completion From a Single Egocentric View
FlatLands:基于单一自我中心视角的生成式地面图补全
Abstract
A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.
Chinese Translation
单一的自我中心图像通常仅捕捉到地面的一小部分,而完整的度量可通行性地图将更好地服务于室内导航等应用。我们介绍了FlatLands,一个用于单视角鸟瞰图(BEV)地面补全的数据集和基准。该数据集包含来自六个现有数据集的17,656个真实度量室内场景的270,575个观测值,具有对齐的观测、可见性、有效性和真实的BEV地图,基准包括了分布内和分布外的评估协议。我们比较了无训练方法、确定性模型、集成模型和随机生成模型。最后,我们将该任务实例化为一个端到端的单目RGB到地面图的管道。FlatLands为不确定性感知的室内映射和生成式补全提供了一个严格的测试平台,以支持具身导航。
cs.CV / 23 / 2603.16024
Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery
发声、分割、跟踪、导航:一种用于视频引导颅底手术的交互系统
Abstract
We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical overlays.We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.
Chinese Translation
我们提出了一种基于语音引导的具身智能体框架,用于视频引导的颅底手术,该框架能够动态执行感知和图像引导任务,以响应外科医生的查询。所提系统将自然语言交互与实时视觉感知直接集成在实时的手术视频流中,从而使外科医生能够在不脱离手术任务的情况下请求计算辅助。与依赖外部光学追踪器和额外硬件设置的传统图像引导导航系统不同,该框架完全基于手术视频进行操作。系统首先对手术器械进行交互式分割和标记。分割出的器械随后作为空间锚点,在视频流中自主跟踪,以支持下游工作流程,包括解剖分割、术前三维模型的交互式配准、基于单目视频的手术工具姿态估计,以及通过实时解剖叠加支持图像引导。我们在视频引导颅底手术场景中评估了所提系统,并将其跟踪性能与市售光学追踪系统进行了基准比较。结果表明,基于语音引导的具身智能体能够实现具有竞争力的空间精度,同时改善工作流程集成,并支持视频引导手术系统的快速部署。
cs.CV / 24 / 2603.16063
ViT-AdaLA: Adapting Vision Transformers with Linear Attention
ViT-AdaLA:使用线性注意力适应视觉变换器
Abstract
Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.
Chinese Translation
基于视觉变换器(ViTs)的视觉基础模型(VFMs)在多种视觉任务中取得了显著的性能,但由于其二次复杂度,限制了对长序列的可扩展性。现有的针对ViTs的线性注意力方法通常是从零开始训练,需耗费大量计算资源,而为大型语言模型解码器开发的线性化方法并不适用于ViTs。为了解决这些挑战,我们提出了ViT-AdaLA,一个有效适应和转移VFMs先前知识到线性注意力ViTs的新框架。ViT-AdaLA由三个阶段组成:注意力对齐、特征对齐和监督微调。在注意力对齐阶段,我们将普通线性注意力与每个块中的原始基于softmax的注意力进行对齐,以近似softmax注意力的行为。然而,残差近似误差不可避免地在层间累积。我们通过微调线性化的ViT,使其最终层特征与冻结的softmax VFM教师对齐,从而减轻这一问题。最后,通过监督微调将适应后的先前知识转移到下游任务中。在分类和分割任务上的大量实验表明,ViT-AdaLA在各种最先进的线性注意力方法中表现出有效性和普适性。
cs.CV / 25 / 2603.16067
Attribution Upsampling should Redistribute, Not Interpolate
归因上采样应重新分配,而非插值
Abstract
Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model's reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU's formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.
Chinese Translation
可解释人工智能中的归因方法依赖于为自然图像设计的上采样技术,而非显著性图。标准的双线性和双三次插值通过混叠、振铃和边界渗漏系统性地破坏了归因信号,产生了误导性的高重要性区域,错误地表征了模型推理。我们识别出核心问题在于将归因上采样视为一个孤立于模型推理的插值问题,而不是一个质量重新分配问题,其中模型衍生的语义边界必须主导重要性流动的方式。我们提出了通用语义感知上采样(Universal Semantic-Aware Upsampling, USU),这是一种通过比率形式的质量重新分配算子重新构建上采样的原则性方法,能够证明地保持归因质量和相对重要性排序。将特征归因的公理传统扩展到上采样,我们形式化了四个忠实上采样的期望,并证明插值在结构上违反了其中的三个。这三个期望强迫任何重新分配算子进入比率形式;第四个选择了这一家族中的唯一潜力,从而产生了USU。在具有已知归因先验的模型上的控制实验验证了USU的形式保证;在ImageNet、CIFAR-10和CUB-200上的评估确认了一致的忠实性改进和定性上更优、语义上连贯的解释。
cs.CV / 26 / 2603.16078
Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI
基于神经微分流的体积一致性隐式图谱学习用于胎盘MRI
Abstract
Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.
Chinese Translation
在解剖形状之间建立密集的体积对应关系对于群体层级分析至关重要,但对于隐式神经表示仍然具有挑战性。大多数现有的隐式配准方法依赖于零水平集附近的监督,因此仅捕捉表面对应关系,导致内部变形约束不足。我们提出了一种体积一致性的隐式模型,该模型将带符号距离函数(SDFs)的重建与神经微分流相结合,以学习胎盘的共享典范模板。体积正则化,包括雅可比行列式和双调和惩罚,抑制局部折叠并促进全局一致的变形。在胎盘MRI的应用中,我们的公式共同重建个体胎盘,将其对齐到基于人群的隐式模板,并在统一的典范空间中实现体素级强度映射。在活体胎盘MRI扫描的实验中,显示出相较于基于表面的隐式基线方法,几何保真度和体积对齐得到了改善,产生了在解剖上可解释且拓扑一致的展平结果,适合于群体分析。
cs.CV / 27 / 2603.16083
Structured prototype regularization for synthetic-to-real driving scene parsing
用于合成到真实驾驶场景解析的结构化原型正则化
Abstract
Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.
Chinese Translation
驾驶场景解析对于自主车辆在复杂的真实交通环境中可靠运行至关重要。为了减少对昂贵的像素级标注的依赖,自动生成标签的合成数据集已成为一种流行的替代方案。然而,基于合成数据训练的模型在应用于真实场景时往往表现不佳,这主要是由于合成到真实的领域差距。尽管无监督领域适应在缩小这一差距方面取得了一定成功,但大多数现有方法主要集中于全局特征对齐,而忽视了特征空间的语义结构。因此,类别之间的语义关系建模不足,限制了模型的泛化能力。为了解决这些挑战,本研究提出了一种新颖的无监督领域适应框架,明确地正则化语义特征结构,以显著提升真实场景中的驾驶场景解析性能。具体而言,所提出的方法通过利用类别特定的原型来加强类间分离和类内紧凑性,从而增强特征聚类的可区分性和结构一致性。一种基于熵的噪声过滤策略提高了伪标签的可靠性,而像素级注意机制进一步优化了特征对齐。在代表性基准上的大量实验表明,所提出的方法始终优于最近的最先进方法。这些结果强调了在驾驶场景解析任务中保持语义结构对于稳健的合成到真实适应的重要性。
cs.CV / 28 / 2603.16085
Interact3D: Compositional 3D Generation of Interactive Objects
Interact3D:交互对象的组合性3D生成
Abstract
Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.
Chinese Translation
最近在3D生成领域的突破使得高保真个体资产的合成成为可能。然而,从单一图像生成3D组合对象,尤其是在遮挡情况下,仍然具有挑战性。现有方法往往降低了隐藏区域的几何细节,并未能维护潜在的对象间空间关系(Object-Object Spatial Relationships,OOR)。我们提出了一种新颖的框架Interact3D,旨在生成物理上合理的交互3D组合对象。我们的方法首先利用先进的生成先验,结合统一的3D指导场景来策划高质量的个体资产。为了物理组合这些资产,我们接着引入了一个稳健的两阶段组合流程。在3D指导场景的基础上,主要对象通过精确的全局到局部几何对齐(注册)锚定,而后续几何体则通过一种可微分的符号距离场(Signed Distance Field,SDF)优化方法以明确惩罚几何交叉的方式进行整合。为了减少复杂的碰撞,我们进一步实施了一个闭环的自主优化策略。一个视觉-语言模型(Vision-Language Model,VLM)自动分析组合场景的多视图渲染,制定针对性的修正提示,并引导图像编辑模块迭代自我修正生成流程。大量实验表明,Interact3D成功地生成了具有改进几何保真度和一致空间关系的碰撞感知组合。
cs.CV / 29 / 2603.16092
Parallel In-context Learning for Large Vision Language Models
大型视觉语言模型的并行上下文学习
Abstract
Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.
Chinese Translation
大型视觉语言模型(LVLMs)采用多模态上下文学习(MM-ICL)通过利用示例来适应新任务。虽然增加示例数量可以提升性能,但由于Transformer注意力机制在上下文长度方面的平方计算成本,它们会导致显著的推理延迟。为了解决这一权衡,我们提出了并行上下文学习(Parallel-ICL),这是一种即插即用的推理算法。Parallel-ICL将长示例上下文划分为多个较短、可管理的块,并并行处理这些块,在logit级别整合它们的预测,使用加权专家乘积(Product-of-Experts, PoE)集成来近似全上下文输出。在集成学习理论的指导下,我们为Parallel-ICL引入了原则性策略:(i)基于聚类的上下文块划分,以最大化块间多样性;(ii)基于相似性的上下文编译,以根据查询相关性对预测进行加权。在VQA、图像描述和分类基准上的大量实验表明,Parallel-ICL的性能与全上下文MM-ICL相当,同时显著提高了推理速度。我们的工作为MM-ICL中的准确性与效率权衡提供了有效的解决方案,使得动态任务适应的推理开销大幅降低。
cs.CV / 30 / 2603.16098
LICA: Layered Image Composition Annotations for Graphic Design Research
LICA:用于图形设计研究的分层图像合成注释
Abstract
We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.
Chinese Translation
我们介绍了LICA(分层图像合成注释),这是一个包含1,550,244个多层图形设计作品的大规模数据集,旨在推动对图形布局的结构化理解和生成。除了渲染的PNG图像外,LICA将每个设计表示为一个层次化的组成,包括文本、图像、矢量和组元素,每个元素都配有丰富的元数据,如空间几何、排版属性、不透明度和可见性。该数据集涵盖20个设计类别和971,850个独特模板,广泛覆盖现实世界的设计结构。我们进一步引入图形设计视频作为当前视觉-语言模型面临的一项新的且尚未充分探索的挑战,通过27,261个动画布局注释每个组件的关键帧和运动参数。除了规模之外,LICA建立了一种新的图形设计研究任务范式,使得对诸如层感知修复、结构化布局生成、受控设计编辑和时间感知生成建模等问题的结构化研究成为可能。通过将设计表示为一个由组合层和关系构成的系统,该数据集支持对直接在设计结构上而非仅在像素上操作的模型的研究。
cs.CV / 31 / 2603.16099
OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder
OneWorld:利用3D统一表示自编码器驯服场景生成
Abstract
Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.
Chinese Translation
现有的基于扩散的3D场景生成方法主要在2D图像/视频潜在空间中操作,这使得保持跨视图外观和几何一致性面临固有挑战。为了解决这一问题,我们提出了OneWorld,一个直接在一致的3D表示空间内进行扩散的框架。我们的方法的核心是3D统一表示自编码器(3D-URAE);它利用预训练的3D基础模型,并通过注入外观和提炼语义到统一的3D潜在空间中增强其以几何为中心的特性。此外,我们引入了基于标记的跨视图对应(CVC)一致性损失,以明确强制视图间的结构对齐,并提出了流形漂移强制(MDF),以减轻训练-推理曝光偏差,并通过混合漂移和原始表示来塑造一个稳健的3D流形。全面的实验表明,与最先进的基于2D的方法相比,OneWorld生成的3D场景具有更高质量和优越的跨视图一致性。我们的代码将发布在 https://github.com/SensenGao/OneWorld。
cs.CV / 32 / 2603.16100
Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
重新评估 CLIP 中的模态内失调假设
Abstract
Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.
Chinese Translation
近期研究表明,CLIP 类对比语言-图像训练所产生的嵌入在仅图像任务中并不理想。主要理论认为,模态间(语言-图像)对齐损失忽视了模态内(图像-图像)对齐,从而导致图像之间的距离校准不良。在本研究中,我们质疑这一模态内失调假设。我们重新审视其基础理论论点、用于支持该论点的指标以及受影响的性能指标。对于理论论点,我们证明图像嵌入距离并不存在所谓的自由度。对于实证测量,我们的发现显示,语言-图像训练模型(CLIP、SigLIP)和图像-图像训练模型(DINO、SigLIP2)产生的结果相似。这表明观察到的现象并非源于前者特有的失调。在常见的模态内任务检索和少样本分类的实验中,结果确认了解决任务模糊性,而非所谓的失调,才是获得最佳结果的关键。
cs.CV / 33 / 2603.16103
NanoGS: Training-Free Gaussian Splat Simplification
NanoGS:无训练的高斯点简化
Abstract
3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.
Chinese Translation
3D 高斯点(3DGS)通过使用大量各向异性原语表示场景,实现高保真、实时的新视图合成,但通常需要数百万个点,导致显著的存储和传输成本。现有的大多数压缩方法依赖于 GPU 密集型的后训练优化和校准图像,这限制了实际应用。我们提出了 NanoGS,一个无训练且轻量级的高斯点简化框架。NanoGS 不依赖于基于图像渲染的监督,而是将简化过程表述为在稀疏空间图上的局部成对合并。该方法通过质量保持的矩匹配将一对高斯点近似为一个原语,并通过原始混合物与其近似之间的原则性合并成本来评估合并质量。通过将合并候选限制在局部邻域内并高效选择兼容对,NanoGS 在保留场景结构和外观的同时生成紧凑的高斯表示。NanoGS 直接在现有的高斯点模型上运行,能够高效地在 CPU 上执行,并保持标准的 3DGS 参数化,便于与现有渲染管线无缝集成。实验表明,NanoGS 在保持高渲染保真度的同时显著减少了原语数量,为高斯点简化提供了一种高效且实用的解决方案。我们的项目网站可访问 https://saliteta.github.io/NanoGS/。
cs.CV / 34 / 2603.16113
PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency
PathGLS:通过多维一致性评估病理视觉-语言模型而无需真实标签
Abstract
Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman's rank correlation of $\rho=0.71$ ($p < 0.0001$), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: $\rho=0.39$, $p < 0.0001$). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: https://github.com/My13ad/PathGLS
Chinese Translation
视觉-语言模型(VLMs)在计算病理学中具有显著潜力,能够实现可解释的图像分析、自动报告和可扩展的决策支持。然而,由于缺乏可靠的自动评估指标,无法识别诸如幻觉等微妙失误,其广泛的临床应用仍然受到限制。为了解决这一问题,我们提出了PathGLS,这是一种新颖的无参考评估框架,能够从三个维度评估病理VLMs:基础(细粒度视觉-文本对齐)、逻辑(使用自然语言推理的蕴涵图一致性)和稳定性(在对抗性视觉-语义扰动下的输出方差)。PathGLS支持补丁级和全幻灯片图像(WSI)级分析,生成全面的可信度评分。在Quilt-1M、TCGA、REG2025、PathMMU和TCGA-肉瘤数据集上的实验表明PathGLS的优越性。具体而言,在Quilt-1M数据集上,PathGLS揭示了幻觉报告的敏感性急剧下降40.2%,而BERTScore仅为2.1%。此外,与专家定义的临床错误层级进行验证显示,PathGLS实现了强大的斯皮尔曼等级相关系数$
ho=0.71$($p < 0.0001$),显著优于基于大型语言模型(LLM)的方法(Gemini 3.0 Pro: $
ho=0.39$, $p < 0.0001$)。这些结果确立了PathGLS作为一种稳健的无参考指标。通过直接量化幻觉率和领域转移的鲁棒性,它为在私有临床数据集上基准测试VLMs提供了可靠标准,并为安全部署提供了依据。代码可在以下链接找到:https://github.com/My13ad/PathGLS
cs.CV / 35 / 2603.16122
Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning
通过合成异常暴露和迁移学习在街景中进行分布外物体检测
Abstract
Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbf{SynOE-OD}, a \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D}etection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.
Chinese Translation
分布外(OOD)物体检测是一项重要但尚未深入研究的任务。一个可靠的物体检测器应该能够处理OOD物体,通过定位并正确地将其分类为OOD。然而,当这些非典型物体被物体检测器完全忽略并错误地视为背景时,就会出现一个关键问题。现有的物体检测中的OOD检测方法通常依赖于复杂的架构或辅助分支,并且通常没有提供一个将分布内(ID)和OOD统一处理的框架。在本研究中,我们通过使单个检测器能够检测那些通常被忽视的OOD物体以及ID物体,来解决这些局限性。我们提出了 extbf{SynOE-OD},一个基于 extbf{合成} extbf{异常} extbf{暴露}的 extbf{物体} extbf{检测}框架,该框架利用强大的生成模型,如Stable Diffusion,以及开放词汇物体检测器(OVODs),生成在语义上有意义的物体级数据,这些数据在训练过程中作为异常值使用。生成的数据用于迁移学习,以建立强大的ID任务性能,并增强检测模型的OOD物体检测鲁棒性。我们的方法在一个已建立的OOD物体检测基准上实现了最先进的平均精度,其中OVODs(如GroundingDINO)在街景中检测OOD物体的零样本性能有限。
cs.CV / 36 / 2603.16129
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
提升零样本物体计数的数量与空间意识
Abstract
Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.
Chinese Translation
零样本物体计数(Zero-shot object counting, ZSOC)旨在通过文本描述枚举任意类别的物体,而无需视觉示例。然而,现有方法通常将计数视为一种粗略检索任务,缺乏细粒度的数量意识。此外,由于模型适应过程中特征空间的扭曲,它们常常表现出空间敏感性不足和泛化能力下降。为了解决这些挑战,我们提出了 extbf{QICA},一个新颖的框架,结合了 extunderline{q}uantity感知与稳健的空间 extunderline{c}ast聚合。具体而言,我们引入了一种协同提示策略(Synergistic Prompting Strategy, extbf{SPS}),通过数值条件提示调整视觉和语言编码器,弥合语义识别与数量推理之间的差距。为了减轻特征扭曲,我们提出了一种成本聚合解码器(Cost Aggregation Decoder, extbf{CAD}),该解码器直接作用于视觉-文本相似性图。通过空间聚合精炼这些图,CAD防止了过拟合,同时保留了零样本迁移能力。此外,采用多层次数量对齐损失($ extmath{L}_{MQA}$)以在整个管道中强制执行数值一致性。在FSC-147上的广泛实验表明了竞争力的性能,而在CARPK和ShanghaiTech-A上的零样本评估验证了对未见领域的优越泛化能力。
cs.CV / 37 / 2603.16130
EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion
EPOFusion:一种考虑曝光的渐进优化方法用于红外与可见光图像融合
Abstract
Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at https://github.com/warren-wzw/EPOFusion.git.
Chinese Translation
在实际场景中,过度曝光经常发生,导致关键信息的丢失。然而,现有的红外与可见光融合方法在高亮区域的表现仍然不尽如人意。为了解决这一问题,我们提出了EPOFusion,一种考虑曝光的融合模型。具体而言,我们引入了一个引导模块,以帮助编码器从过度曝光区域提取细粒度的红外特征。同时,设计了一个迭代解码器,结合多尺度上下文融合模块,逐步增强融合图像,确保细节一致性和优越的视觉质量。最后,一个自适应损失函数动态约束融合过程,使得在不同曝光条件下实现模态之间的有效平衡。为了实现更好的曝光感知,我们构建了第一个高质量的红外引导注释的过度曝光数据集(IVOE),用于过度曝光区域的研究。大量实验表明,EPOFusion的表现优于现有方法。在过度曝光区域保持红外线线索的同时,在非过度曝光区域实现视觉上真实的融合,从而提升了视觉保真度和下游任务性能。代码、融合结果和IVOE数据集将发布在https://github.com/warren-wzw/EPOFusion.git。
cs.CV / 38 / 2603.16133
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
DualPrim:基于正负原始体的紧凑型三维重建
Abstract
Neural reconstructions often trade structure for fidelity, yielding dense and unstructured meshes with irregular topology and weak part boundaries that hinder editing, animation, and downstream asset reuse. We present DualPrim, a compact and structured 3D reconstruction framework. Unlike additive-only implicit or primitive methods, DualPrim represents shapes with positive and negative superquadrics: the former builds the bases while the latter carves local volumes through a differentiable operator, enabling topology-aware modeling of holes and concavities. This additive-subtractive design increases the representational power without sacrificing compactness or differentiability. We embed DualPrim in a volumetric differentiable renderer, enabling end-to-end learning from multi-view images and seamless mesh export via closed-form boolean difference. Empirically, DualPrim delivers state-of-the-art accuracy and produces compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.
Chinese Translation
神经重建通常在结构与保真度之间进行权衡,导致生成密集且无结构的网格,这些网格具有不规则的拓扑和较弱的部件边界,妨碍了编辑、动画和后续资产的重用。我们提出了DualPrim,一个紧凑且结构化的三维重建框架。与仅使用加法的隐式或原始方法不同,DualPrim通过正负超长方体来表示形状:前者构建基础,而后者通过可微分算子雕刻局部体积,从而实现对孔洞和凹陷的拓扑感知建模。这种加法-减法设计在不牺牲紧凑性或可微分性的情况下,提高了表示能力。我们将DualPrim嵌入到体积可微分渲染器中,实现了从多视角图像的端到端学习,并通过封闭形式的布尔差异无缝导出网格。从经验上看,DualPrim提供了最先进的准确性,生成的输出紧凑、结构化且可解释,能够更好地满足后续需求,优于仅使用加法的替代方案。
cs.CV / 39 / 2603.16134
When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems
当生成增强造成伤害:针对AI分类系统中的偏差修正的GAN与扩散模型基准研究
Abstract
Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen's d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.
Chinese Translation
生成模型广泛用于弥补AI训练流程中的类别不平衡,但在低数据条件下的失效模式尚不清楚。本文报告了一项控制基准研究,比较了三种应用于细粒度动物分类任务的增强策略:传统变换、FastGAN和使用低秩适应(LoRA)微调的Stable Diffusion 1.5。使用包含八种人为代表性不足的品种的Oxford-IIIT宠物数据集,我们发现FastGAN增强不仅在非常低的训练集规模下表现不佳,还会主动增加分类器的偏差,在三个随机种子下具有统计显著的大效应(偏差差距增加:+20.7%,Cohen's d = +5.03,p = 0.013)。尽管种子数量较少,但效应大小足以让我们对结果的方向充满信心。使用t分布随机邻域嵌入的特征嵌入分析显示,FastGAN生成的严重少数品种图像在真实图像分布之外形成紧密孤立的聚类,这一模式与模式崩溃一致。使用低秩适应的Stable Diffusion整体表现最佳,达到了最高的宏观F1值(0.9125 ± 0.0047),相较于未增强的基线,偏差差距减少了13.1%。数据表明,在每个类别的训练图像数量在20到50之间的某个样本量边界以下,GAN增强在此环境中变得有害,尽管还需在其他领域进行进一步研究,以更精确地确定该边界。所有实验均在具有6到8GB内存的消费级GPU上运行,无需云计算。
cs.CV / 40 / 2603.16139
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
重新思考UMM视觉生成:高效图像单一预训练的掩蔽建模
Abstract
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.
Chinese Translation
统一多模态模型(UMMs)通常受到其视觉生成组件预训练的限制,这通常依赖于低效的范式和稀缺的高质量文本-图像配对数据。本文系统分析了UMM视觉生成的预训练方案,并将这两个问题识别为主要瓶颈。为了解决这些问题,我们提出了UMMs的图像单一训练(Image-Only Training for UMMs, IOMM),这是一种数据高效的两阶段训练框架。第一阶段仅使用丰富的未标记图像数据对视觉生成组件进行预训练,从而消除了对配对数据在这一高成本阶段的依赖。第二阶段使用未标记图像和一小部分经过精心挑选的文本-图像对的混合数据对模型进行微调,从而提高指令对齐和生成质量。大量实验表明,IOMM不仅提高了训练效率,还达到了最先进的(SOTA)性能。例如,我们的IOMM-B(3.6B)模型从头开始训练,仅使用约1050个H800 GPU小时(其中绝大多数,1000小时,专用于高效的图像单一预训练阶段)。它在GenEval上达到了0.89,在WISE上达到了0.55,超越了强基线,如BAGEL-7B(0.82 & 0.55)和BLIP3-o-4B(0.84 & 0.50)。代码可在此获取:https://github.com/LINs-lab/IOMM。
cs.CV / 41 / 2603.16151
EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation
EFF-Grasp:基于能量场流匹配的物理感知灵巧抓取生成
Abstract
Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.
Chinese Translation
去噪生成模型最近已成为灵巧抓取生成的主流范式,因为它们能够从大规模数据中建模复杂的抓取分布。然而,现有的基于扩散的方法通常将生成过程表述为随机微分方程(SDE),这往往需要多个连续的去噪步骤,并引入轨迹不稳定性,可能导致物理上不可行的抓取。在本文中,我们提出了EFF-Grasp,一种基于流匹配的物理感知灵巧抓取生成的新框架。具体而言,我们将抓取合成重新表述为确定性常微分方程(ODE)过程,这使得通过平滑概率流实现高效且稳定的生成。为了进一步增强物理可行性,我们引入了一种无训练的物理感知能量引导策略。我们的方法使用适应的显式物理能量函数定义了一个能量引导的目标分布,这些函数捕捉了关键的抓取约束,并在推理过程中通过局部蒙特卡洛近似估计相应的引导项。通过这种方式,EFF-Grasp 动态地引导生成轨迹朝向物理可行的区域,而无需额外的基于物理的训练或仿真反馈。在五个基准数据集上的广泛实验表明,EFF-Grasp 在抓取质量和物理可行性方面表现优越,同时所需的采样步骤显著少于基于扩散的基线方法。
cs.CV / 42 / 2603.16154
GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation
GATS:高斯感知时间缩放变换器用于不变的4D时空点云表示
Abstract
Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.
Chinese Translation
理解4D点云视频对于使智能体能够感知动态环境至关重要。然而,随着帧率变化的时间尺度偏差和不规则点云中的分布不确定性,使得设计统一且稳健的4D骨干网络变得极具挑战性。现有的基于卷积神经网络(CNN)或变换器(Transformer)的方法要么受到有限感受野的限制,要么面临二次计算复杂度,同时忽视了这些隐含的失真。为了解决这个问题,我们提出了一种新颖的双不变框架,称为高斯感知时间缩放(Gaussian Aware Temporal Scaling,GATS),它明确解决了分布不一致性和时间问题。所提出的不确定性引导高斯卷积(Uncertainty Guided Gaussian Convolution,UGGC)将局部高斯统计和不确定性感知门控引入点卷积,从而在密度变化、噪声和遮挡情况下实现稳健的邻域聚合。同时,时间缩放注意力(Temporal Scaling Attention,TSA)引入了一个可学习的缩放因子来规范化时间距离,确保帧分区的不变性和在不同帧率下的一致速度估计。这两个模块是互补的:时间缩放在高斯估计之前规范化时间间隔,而高斯建模增强了对不规则分布的鲁棒性。我们在主流基准测试MSR-Action3D(准确率提高+6.62%)、NTU RGBD(准确率提高+1.4%)和Synthia4D(mIoU提高+1.8%)上的实验显示了显著的性能提升,提供了一种更高效且原则性的范式,用于不变的4D点云视频理解,相较于基于变换器的方法具有更优的准确性、鲁棒性和可扩展性。
cs.CV / 43 / 2603.16159
AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines
学术出版中的AI生成图像:政策、工具与实践指南
Abstract
The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers -- including Nature, Science, Cell Press, Elsevier, and PLOS -- on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.
Chinese Translation
生成性人工智能的快速发展引入了一类新的工具,能够制作出版质量的科学图像、图形摘要和数据可视化。然而,学术出版商对此的回应却存在不一致且常常模糊的政策。本文调查了主要期刊和出版商(包括《自然》、《科学》、《细胞出版社》、《爱思唯尔》和《PLOS》)对AI生成图像的使用现状。我们识别了出版商提出的关键关注点,包括可重复性、作者归属以及视觉误信息的潜在风险。借助于专为科学插图设计的AI驱动平台SciDraw等工具的实际案例,我们提出了一套最佳实践指南,供希望以合规和透明的方式使用AI图像生成工具的研究人员参考。我们的研究结果表明,适当的披露和质量控制下,AI生成的图像可以有效加速科学传播,而不损害其完整性。
cs.CV / 44 / 2603.16160
Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation
分割优于染色:提高虚拟IHC到多重IF转化的结构保真度
Abstract
Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.
Chinese Translation
多重免疫荧光(mIF)能够在完整的组织结构中同时对多个生物标志物进行单细胞定量,然而其高试剂成本、多轮染色协议以及对专用成像平台的需求限制了其在常规临床中的应用。虚拟染色可以从广泛可用的明场免疫组织化学(IHC)合成mIF通道,但当前的转化模型优化像素级保真度而未明确约束细胞核形态。在病理学中,这一差距在临床上具有重要意义:细胞核计数、形状或空间排列的微小扭曲直接影响到量化终点,如Ki67增殖指数,其中几个百分点的误差可能会改变与治疗相关的风险类别。本研究提出了一种无监督、架构无关的条件策略,该策略将来自预训练细胞核分割基础模型的连续细胞概率图作为显式输入先验,同时结合一个保持方差的正则化项,以匹配局部强度统计,从而保持合成荧光通道中的细胞级异质性。该软先验保留了通过二值阈值化丢失的梯度级边界信息,提供了更丰富的条件信号,而无需特定任务的调优。在两个独立数据集上,通过Pix2Pix与U-Net和ResNet生成器、确定性回归U-Net和条件扩散的对照实验表明,细胞核计数保真度和感知质量在唯一修改的情况下均有一致改善。代码将在接受后公开发布。
cs.CV / 45 / 2603.16163
STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition
STARK:用于连续手语识别的关键点表征的时空注意力
Abstract
Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.
Chinese Translation
连续手语识别(CSLR)是理解聋人社区语言的一项关键任务。现代基于关键点的方法通常依赖于时空编码,其中关键点之间的空间交互使用图卷积网络(Graph Convolutional Networks)或注意力机制建模,而时间动态则使用一维卷积网络捕捉。然而,这种设计往往在编码器和解码器中引入大量的参数。本文提出了一种统一的时空注意力网络,它在空间(跨关键点)和时间(在局部窗口内)计算注意力分数,并聚合特征以生成局部上下文感知的时空表征。所提出的编码器与现有最先进的模型相比,其参数数量约减少了 $70-80rac{ }{ }$,同时在Phoenix-14T数据集上实现了与基于关键点的方法相当的性能。
cs.CV / 46 / 2603.16165
Homogeneous and Heterogeneous Consistency progressive Re-ranking for Visible-Infrared Person Re-identification
可见光-红外行人重识别的同质与异质一致性渐进重排序
Abstract
Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.
Chinese Translation
可见光-红外行人重识别面临比传统行人重识别更大的挑战,因为不同模态之间存在显著差异。特别是,这些模态之间的差异使得有效匹配变得更加困难,主要是因为现有的重排序算法无法同时解决跨模态行人重识别中的模态内变异和模态间差异。为了解决这一问题,我们提出了一种新颖的渐进模态关系重排序方法,该方法由两个模块组成,称为异质和同质一致性重排序(HHCR)。第一个模块,异质一致性重排序,探讨了测试集中的查询模态与图库模态之间的关系。第二个模块,同质一致性重排序,研究了测试集中查询与图库之间每个模态内的内在关系。在此基础上,我们提出了一种跨模态行人重识别的基线方法,称为一致性重排序推理网络(CRI)。我们进行了全面的实验,证明了我们提出的重排序方法具有良好的泛化能力,并且重排序和基线方法均达到了最先进的性能。
cs.CV / 47 / 2603.16179
360{\deg} Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method
基于多模态大语言模型的360°图像感知:全面基准与无训练方法
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360{\deg} images remains largely underexplored. Unlike conventional images, 360{\deg} images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360{\deg} images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360{\deg} images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360{\deg} image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360{\deg} VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360{\deg} images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360{\deg} VQA tasks. The source code and dataset will be publicly released upon acceptance.
Chinese Translation
多模态大语言模型(MLLMs)在理解和推理常规图像方面展现了令人印象深刻的能力。然而,它们对360°图像的感知仍然在很大程度上未被探索。与常规图像不同,360°图像捕捉了整个周围环境,能够进行整体空间推理,但也带来了几何畸变和复杂空间关系等挑战。为了全面评估MLLMs在感知360°图像方面的能力,我们引入了360Bench,这是一个视觉问答(VQA)基准,包含7K分辨率的360°图像,以及七个具有代表性的(子)任务,所有任务均由人工注释者精心策划。使用360Bench,我们系统地评估了七个MLLMs和六种增强方法,揭示了它们在360°图像感知方面的不足。为了解决这些挑战,我们提出了Free360,这是一个基于场景图的无训练框架,用于高分辨率的360° VQA。Free360将推理过程分解为模块化步骤,对360°图像应用适应性球面图像变换,针对每个步骤量身定制,并将生成的信息无缝整合到统一的图形表示中以生成答案。实验表明,Free360始终改善其基础MLLM,并为360° VQA任务提供了强有力的无训练解决方案。源代码和数据集将在接受后公开发布。
cs.CV / 48 / 2603.16181
KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety
KidsNanny:一个集成视觉分类、物体检测、光学字符识别(OCR)和上下文推理的两阶段多模态内容审核管道,以保障儿童安全
Abstract
We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.
Chinese Translation
我们提出了KidsNanny,这是一种用于儿童安全的两阶段多模态内容审核架构。第一阶段结合了视觉变换器(ViT)和物体检测器进行视觉筛选(11.7毫秒);输出作为文本而非原始像素被传递到第二阶段,第二阶段应用光学字符识别(OCR)和基于文本的7B语言模型进行上下文推理(总管道时间为120毫秒)。我们在UnsafeBench性别类别(1,054张图像)上进行评估,采用两种模式:仅视觉模式,隔离第一阶段,以及多模态模式,评估完整的第一阶段+第二阶段管道。第一阶段的准确率为80.27%,F1值为85.39%,耗时11.7毫秒;仅视觉基准的准确率范围为59.01%到77.04%。完整管道的准确率为81.40%,F1值为86.16%,耗时120毫秒,相比之下,ShieldGemma-2的准确率为64.80%,耗时1,136毫秒,LlavaGuard的准确率为80.36%,耗时4,138毫秒。为了评估文本意识,我们过滤了两个子集:一个文本+视觉子集(257张图像)和一个仅文本子集(44张图像,其安全性主要依赖于嵌入的文本)。在仅文本图像上,KidsNanny实现了100%的召回率(25/25个阳性样本;样本量小)和75.76%的精确率;ShieldGemma-2在1,136毫秒内实现了84%的召回率和60%的精确率。结果表明,专门的基于OCR的推理可能在文本嵌入威胁上提供更高的召回率和精确率,同时延迟更低,尽管小的仅文本子集限制了推广性。通过记录这一架构和评估方法,我们旨在为儿童安全的高效多模态内容审核的更广泛研究贡献力量。
cs.CV / 49 / 2603.16188
ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control
ECHO:边缘-云人形机器人语言到运动控制的协调框架
Abstract
We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.
Chinese Translation
我们提出了ECHO,一个用于语言驱动的人形机器人全身控制的边缘-云框架。一个基于云的扩散式文本到运动生成器从自然语言指令中合成运动参考,而一个边缘部署的强化学习跟踪器在机器人上以闭环方式执行这些运动。两个模块通过一个紧凑的、机器人原生的38维运动表示进行连接,该表示编码了关节角度、根平面速度、根高度以及每帧的连续6D根方向,从而消除了从人体模型进行推理时的重定向,并与低级PD控制直接兼容。生成器采用了一个1D卷积UNet,结合基于CLIP编码文本特征的交叉注意力;在推理时,DDIM采样通过10个去噪步骤和无分类器引导,在云GPU上大约生成一秒内的运动序列。跟踪器遵循教师-学生范式:一个特权教师策略被提炼为一个轻量级学生,配备了证据适应模块以实现模拟到现实的转移,进一步通过形态对称约束和领域随机化得到增强。一个自主的跌倒恢复机制通过机载IMU读取检测跌倒,并从预构建的运动库中检索恢复轨迹。我们在重定向的HumanML3D基准上评估ECHO,在统一的机器人领域评估器下,它实现了强大的生成质量(FID 0.029,R-Precision Top-1 0.686),同时保持高运动安全性和轨迹一致性。在Unitree G1人形机器人上的真实世界实验展示了在没有硬件微调的情况下,稳定执行多样化文本命令的能力。
cs.CV / 50 / 2603.16189
Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning
通过多任务多奖励强化学习实现SVG-LLMs的可靠推理
Abstract
With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
Chinese Translation
随着视觉-语言模型的快速发展,越来越多的研究探索了它们在SVG生成任务中的潜力。尽管现有方法通过构建大规模SVG数据集和引入SVG特定的标记来提高性能,但它们仍然面临有限的泛化能力、代码输出中的冗余路径以及缺乏明确推理的问题。在本研究中,我们提出了CTRL-S(Chain-of-Thought Reinforcement Learning for SVG),这是一个统一框架,引入了链式思维机制,以在SVG生成过程中明确展示模型的推理过程。为了支持这种结构化推理,我们构建了SVG-Sophia,这是一个高质量的数据集,包含145K个样本,涵盖SVG代码优化、文本到SVG和图像到SVG任务。通过训练模型生成组级结构化SVG代码,CTRL-S显著提高了结构一致性和视觉保真度。此外,我们采用了GRPO算法,并设计了一个多奖励优化框架,结合了DINO、图像-文本相似性、格式和代码效率奖励。通过联合多奖励优化和多任务训练,我们的方法系统性地增强了整体生成能力。大量实验表明,CTRL-S在任务成功率、SVG代码质量和视觉保真度方面均优于现有方法。
cs.CV / 51 / 2603.16195
S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight
S-VAM:通过自我蒸馏几何和语义前瞻的快捷视频动作模型
Abstract
Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/
Chinese Translation
视频动作模型(VAMs)因其在复杂操作任务中的强大视觉前瞻能力而成为机器人学习的一个有前景的范式。然而,当前的VAMs通常依赖于缓慢的多步视频生成或噪声较大的单步特征提取,无法同时保证实时推理和高保真前瞻。为了解决这一限制,我们提出了S-VAM,一种通过单次前向传播预见一致的几何和语义表示的快捷视频动作模型。作为一个稳定的蓝图,这些预见的表示显著简化了动作预测。为了实现这一高效的快捷方式,我们引入了一种新颖的自我蒸馏策略,将多步去噪的结构化生成先验浓缩为单步推理。具体而言,从扩散模型自身的多步生成视频中提取的视觉基础模型(VFM)表示提供了教师目标。轻量级解耦器作为学生,学习将噪声单步特征直接映射到这些目标上。在仿真和现实世界中的大量实验表明,我们的S-VAM优于最先进的方法,使得在复杂环境中实现高效和精确的操作成为可能。我们的项目页面是 https://haodong-yan.github.io/S-VAM/
cs.CV / 52 / 2603.16211
Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation
Leveling3D:通过前馈3D高斯点云和几何感知生成提升3D重建
Abstract
Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.
Chinese Translation
前馈3D重建已经彻底改变了3D视觉,为下游任务(如使用3D高斯点云的新视图合成)提供了强大的基线。之前的研究探索了使用扩散模型修复损坏的渲染结果。然而,它们缺乏几何考虑,未能填补外推视图中缺失的区域。在本研究中,我们介绍了Leveling3D,这是一种新颖的管道,将前馈3D重建与几何一致的生成相结合,以实现整体的同时重建和生成。我们提出了一种几何感知的平整适配器,这是一种轻量级技术,能够将扩散模型中的内部知识与前馈模型中的几何先验对齐。平整适配器使得在由3D表示的欠约束区域引起的外推新视图的伪影区域上进行生成成为可能。具体而言,为了学习更具多样性的分布生成,我们引入了调色板过滤策略用于训练,并在测试时进行掩模精细化,以防止修复区域沿边界出现混乱。更重要的是,来自Leveling3D的增强外推新视图可以作为前馈3DGS的输入,从而提升3D重建的效果。我们在公共数据集上实现了SOTA(最先进的)性能,包括新视图合成和深度估计等任务。
cs.CV / 53 / 2603.16233
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
地面反作用惯性姿态捕捉器:基于物理的人体运动捕捉方法,利用稀疏的惯性测量单元和鞋垫压力传感器
Abstract
We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.
Chinese Translation
我们提出了地面反作用惯性姿态捕捉器(Ground Reaction Inertial Poser, GRIP),一种利用四个可穿戴设备重建物理上合理的人体运动的方法。与传统的仅使用惯性测量单元(IMU)的方法不同,GRIP将IMU信号与足部压力数据结合,以捕捉身体动态和地面交互。此外,GRIP并不单纯依赖运动学估计,而是使用一个人的数字双胞胎,即在物理模拟器中合成的人形,来重建真实且物理上合理的运动。GRIP的核心由两个模块组成:KinematicsNet,它从传感器数据中估计身体姿态和速度;DynamicsNet,它使用KinematicsNet预测与模拟人形状态之间的残差来控制模拟器中的人形。为了实现稳健的训练和公平的评估,我们引入了一个大规模数据集,称为人类运动与交互的压力与惯性感知数据集(Pressure and Inertial Sensing for Human Motion and Interaction, PRISM),该数据集捕捉了多样化的人类运动,并同步记录了IMU和鞋垫压力传感器的数据。实验结果表明,GRIP在所有评估数据集中均优于现有的仅使用IMU和IMU-压力融合的方法,达到了更高的全局姿态准确性和改善的物理一致性。
cs.CV / 54 / 2603.16238
PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space
PureCLIP-Depth:在CLIP嵌入空间内无提示且无解码器的单目深度估计
Abstract
We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth
Chinese Translation
我们提出了PureCLIP-Depth,这是一种完全无提示、无解码器的单目深度估计(MDE)模型,完全在对比语言-图像预训练(CLIP)嵌入空间内操作。与最近依赖几何特征的模型不同,我们探索了一种由概念信息驱动的MDE新方法,直接在概念CLIP空间内进行计算。我们方法的核心在于学习从RGB域到深度域的直接映射,严格在该嵌入空间内进行。我们的方法在基于CLIP嵌入的模型中,在室内和室外数据集上均实现了最先进的性能。本研究中使用的代码可在以下链接获取:https://github.com/ryutaroLF/PureCLIP-Depth
cs.CV / 55 / 2603.16241
Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting
基于排他性引导的掩膜学习用于半监督人群实例分割与计数
Abstract
Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.
Chinese Translation
半监督人群分析是一个重要的研究领域,因为未标记的数据通常丰富且获取成本低。然而,传统的基于点的标注限制了性能,因为个体区域本质上是模糊的,因此,从稀疏标注中学习细粒度的结构语义仍然是一个未解决的挑战。在本文中,我们首先提出了一种基于最近邻排除圆(Nearest Neighbor Exclusion Circle, NNEC)约束的排除约束双提示SAM(Exclusion-Constrained Dual-Prompt SAM, EDP-SAM),用于为当前数据集生成掩膜监督。为了在密集场景中分割个体,我们进一步提出了基于排他性引导的掩膜学习(Exclusivity-Guided Mask Learning, XMask),该方法通过判别性掩膜目标强制实现空间分离。我们采用高斯平滑和可微分中心采样策略来提高特征连续性和训练稳定性。在XMask的基础上,我们提出了一种半监督人群计数框架,该框架使用实例掩膜先验作为伪标签,这些伪标签包含比传统点线索更丰富的形状信息。在上海科技大学A、UCF-QNRF和JHU++数据集上进行的大量实验(使用5%、10%和40%的标记数据)验证了我们的端到端模型在半监督分割和计数性能方面达到了最先进水平,有效地弥合了计数与实例分割之间的差距。
cs.CV / 56 / 2603.16243
RASLF: Representation-Aware State Space Model for Light Field Super-Resolution
RASLF:一种面向表示的状态空间模型用于光场超分辨率
Abstract
Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.
Chinese Translation
当前基于状态空间模型(SSM)的光场超分辨率(LFSR)方法往往未能充分利用各种光场表示之间的互补性,导致细节纹理的丢失和视图之间的几何错位。为了解决这些问题,我们提出了RASLF,一种面向表示的状态空间框架,明确建模多个光场表示之间的结构相关性。具体而言,我们创建了一个渐进几何细化(PGR)模块,该模块使用全景视差表示显式编码多视图视差差异,从而实现不同光场表示之间的集成。此外,我们引入了一种面向表示的非对称扫描(RAAS)机制,根据不同表示空间的物理特性动态调整扫描路径,通过路径修剪优化性能与效率之间的平衡。此外,双锚聚合(DAA)模块改善了层次特征流,减少了冗余的深层特征,并优先考虑重要的重建信息。在各种公共基准测试上的实验表明,RASLF在保持高计算效率的同时,实现了最高的重建精度。
cs.CV / 57 / 2603.16245
How to Utilize Complementary Vision-Text Information for 2D Structure Understanding
如何利用互补的视觉-文本信息进行二维结构理解
Abstract
LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision--text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9\% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.
Chinese Translation
大型语言模型(LLMs)通常将二维表格线性化为一维序列,以适应其自回归架构,这削弱了行列邻接性和其他布局线索。相比之下,纯视觉编码器能够捕捉空间线索,但往往难以保留精确的单元格文本。我们的分析表明,这两种模态为LLMs提供了高度不同的信息,并表现出强烈的互补性。然而,直接拼接和其他融合方法的效果有限,且常常引入跨模态干扰。为了解决这一问题,我们提出了DiVA-Former,这是一种轻量级架构,旨在有效整合视觉和文本信息。DiVA-Former利用视觉标记作为动态查询,将长文本序列提炼为摘要向量,从而有效利用互补的视觉-文本信息。在13个表格基准测试中的评估结果显示,DiVA-Former在纯文本基线之上提高了23.9\%,并在使用视觉输入、文本输入或两者结合的现有基线中实现了一致的提升。
cs.CV / 58 / 2603.16249
Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification
深度学习与生物启发式方法协同用于极端长尾白血球分类
Abstract
Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.
Chinese Translation
自动化白血球(WBC)分类对于白血病筛查至关重要,但仍面临极端类别不平衡、长尾分布和领域转移的挑战,导致深度模型在主导类别上过拟合,而在稀有亚型上表现不佳。我们提出了一种用于稀有类别泛化的混合框架,该框架集成了基于生成对抗网络Pix2Pix的恢复模块以去除伪影、使用Swin Transformer集成和MedSigLIP对比嵌入进行稳健的表示学习,以及一个生物启发的细化步骤,利用几何尖锐度和基于Mahalanobis的形态约束来恢复超出分布的预测。在WBCBench 2026挑战赛中评估,我们的方法在私人排行榜上实现了0.77139的Macro-F1,展示了在严重不平衡情况下的强大表现,并突显了将生物先验融入深度学习用于血液图像分析的价值。
cs.CV / 59 / 2603.16250
Visual Prompt Discovery via Semantic Exploration
通过语义探索发现视觉提示
Abstract
LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.
Chinese Translation
大型视觉语言模型(LVLMs)在图像理解和视觉推理方面面临重大挑战,导致关键的感知失败。视觉提示结合了图像操作代码,显示出在缓解这些问题方面的良好潜力。尽管这一方向前景可期,但以往的视觉提示生成方法主要集中在工具选择上,而非诊断和缓解LVLM感知失败的根本原因。由于LVLM的不可预测性和不透明性,最佳视觉提示必须通过经验实验发现,这一过程依赖于人工的反复试验。我们提出了一种自动化的语义探索框架,用于发现任务导向的视觉提示。我们的方法通过代理驱动的实验实现了多样而高效的探索,最小化了人类干预,避免了逐样本生成的低效。我们引入了一种名为SEVEX的语义探索算法,旨在解决视觉提示探索的两个主要挑战:(1)冗长、低级代码造成的干扰,以及(2)视觉提示的广泛且无结构的搜索空间。具体而言,我们的方法利用抽象的思想空间作为搜索空间,采用新颖性引导的选择算法和基于语义反馈的创意过程,以根据经验结果高效探索多样的视觉提示。我们在BlindTest和BLINK基准上评估了SEVEX,这些基准旨在评估LVLM的感知能力。实验结果表明,SEVEX在任务准确性、推理效率、探索效率和探索稳定性方面显著优于基线方法。值得注意的是,我们的框架发现了复杂且反直觉的视觉策略,超越了传统工具使用,为通过自动化、任务导向的视觉提示增强LVLM感知提供了新范式。
cs.CV / 60 / 2603.16253
Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models
基础评分:可靠的视觉-语言过程奖励模型的显式视觉前提验证
Abstract
Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM
Chinese Translation
视觉-语言过程奖励模型(VL-PRMs)越来越多地用于对中间推理步骤进行评分,并在测试时进行候选重排序。然而,它们往往作为黑箱评判者运作:低步骤评分可能反映出真正的推理错误,或者仅仅是验证者对图像的误解。这种感知与推理之间的纠缠导致了系统性假阳性(奖励虚构的视觉前提)和假阴性(惩罚正确的基础陈述),削弱了重排序和错误定位。我们提出了显式视觉前提验证(EVPV),这是一种轻量级验证接口,它将步骤评分的条件建立在步骤所依赖的视觉前提的可靠性之上。该策略促使生成逐步的视觉检查清单,使所需的视觉事实变得明确,同时约束提取器独立地从输入图像中推导出结构化的视觉约束。EVPV将检查清单中的声明与这些约束进行匹配,以计算标量视觉可靠性信号,并通过可靠性门控来校准PRM步骤奖励:当可靠性低时,视觉依赖步骤的奖励会被削弱,而当可靠性高时则保持不变。这在不进行逐步工具调用的情况下,将感知不确定性与逻辑评估解耦。对VisualProcessBench和六个多模态推理基准的实验表明,EVPV改善了步骤级验证,并在强基线之上持续提升了最佳N重排序的准确性。此外,将受控的干扰注入提取的约束中会导致性能单调下降,提供了因果证据,表明收益源于约束的保真度和显式前提验证,而非偶然的提示效应。代码可在:https://github.com/Qwen-Applications/EVPV-PRM
cs.CV / 61 / 2603.16256
When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition
思维的痛苦:通过帧重复减轻视频推理中的视觉遗忘
Abstract
Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.
Chinese Translation
近年来,多模态大型语言模型(MLLMs)通过链式思维(Chain-of-Thought, CoT)推理在复杂视觉任务中展现出显著潜力。然而,在视频问答中,延长的思维过程并不总能带来性能提升,反而可能因“视觉锚漂移”(visual anchor drifting)导致性能下降,即模型越来越依赖自生成的文本,忽视视觉输入,从而引发幻觉。虽然现有的缓解措施通常引入特定机制,使模型在推理过程中重新关注视觉输入,但这些方法往往需要高昂的训练成本,并且在不同架构之间的泛化能力较差。为了解决这一问题,我们提出了FrameRepeat,这是一种自动化增强框架,具有轻量级的重复评分模块,使视频大型语言模型(Video-LLMs)能够自主识别需要强化的帧。我们引入了一种新颖的训练策略,称为Add-One-In(AOI),利用MLLM输出概率生成表示重复增益的监督信号。这可以用于训练帧评分网络,从而指导帧重复行为。多个模型和数据集的实验结果表明,FrameRepeat在推理过程中有效且具有良好的泛化能力,能够增强重要的视觉线索。
cs.CV / 62 / 2603.16257
Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection
点对掩膜:从任意点注释到掩膜级红外小目标检测
Abstract
Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.
Chinese Translation
红外小目标检测(IRSTD)方法主要将任务表述为像素级分割,这需要昂贵的密集注释,并且不适合纹理微弱和边界模糊的小目标。为了解决这个问题,我们提出了点对掩膜(Point-to-Mask),这是一个通过两个组件将低成本点监督与掩膜级检测相结合的框架:一个物理驱动的自适应掩膜生成(PAMG)模块,将点注释转换为紧凑的目标掩膜和几何线索,以及一个轻量级的半径感知点回归网络(RPR-Net),该网络将红外小目标检测重新表述为目标中心定位和有效半径回归,利用时空运动线索。这两个模块形成一个闭环:PAMG在训练期间生成伪掩膜和几何监督,而RPR-Net的几何预测在推理期间反馈给PAMG以进行像素级掩膜恢复。为了便于系统评估,我们进一步构建了SIRSTD-Pixel,这是一个具有精细像素级注释的顺序数据集。实验表明,所提出的框架实现了强大的伪标签质量、高检测准确性和高效推理,在点监督设置下接近全监督性能,同时显著降低了注释成本。代码和数据集将发布在:https://github.com/GaoScience/point-to-mask。
cs.CV / 63 / 2603.16261
AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection
AW-MoE:全天气专家混合模型用于鲁棒的多模态3D目标检测
Abstract
Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at https://github.com/windlinsherlock/AW-MoE.
Chinese Translation
在恶劣天气条件下进行鲁棒的3D目标检测对自动驾驶至关重要。然而,现有大多数方法仅仅将所有天气样本结合用于训练,而忽视了不同天气场景下数据分布的差异,导致性能冲突。为了解决这一问题,我们提出了AW-MoE框架,该框架创新性地将专家混合模型(Mixture of Experts, MoE)整合到天气鲁棒的多模态3D目标检测方法中。AW-MoE结合了图像引导的天气感知路由(Image-guided Weather-aware Routing, IWR),利用图像特征在不同天气条件下的优越可区分性及其对场景变化的不变性进行精确的天气分类。基于这种准确的分类,IWR选择最相关的前K个天气特定专家(Weather-Specific Experts, WSE),以处理数据差异,确保在所有天气条件下的最佳检测。此外,我们提出了一种统一的双模态增强(Unified Dual-Modal Augmentation, UDMA),用于同步激光雷达和4D雷达的双模态数据增强,同时保持场景的真实感。在真实世界数据集上的大量实验表明,AW-MoE在恶劣天气性能上比最先进的方法提高了约15%,且推理开销微乎其微。此外,将AW-MoE集成到现有基线检测器中,性能提升超过当前最先进的方法。这些结果显示了我们AW-MoE的有效性和强大的可扩展性。我们将公开发布代码,网址为 https://github.com/windlinsherlock/AW-MoE。
cs.CV / 64 / 2603.16269
FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition
FG-SGL:通过运动过程分解的细粒度语义引导学习用于微手势识别
Abstract
Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision--language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.
Chinese Translation
微手势识别(MGR)因类间细微变化而具有挑战性。现有方法依赖于类别级别的监督,这不足以捕捉细微和局部的运动差异。因此,本文提出了一种细粒度语义引导学习(FG-SGL)框架,该框架将细粒度和类别级别的语义共同整合,以指导视觉-语言模型感知局部微手势运动。FG-SA采用细粒度语义线索来引导局部运动特征的学习,而CP-A通过类别级别的语义引导增强微手势特征的可分离性。为了支持细粒度语义引导,本研究构建了一个带有人类注释的细粒度文本数据集,描述了微手势在四个细化语义维度中的动态过程。此外,设计了一种多层对比优化策略,以粗到细的方式共同优化两个模块。实验表明,FG-SGL实现了具有竞争力的性能,验证了细粒度语义引导在微手势识别中的有效性。
cs.CV / 65 / 2603.16271
VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
VIGOR:面向视频几何的时间生成对齐奖励
Abstract
Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
Chinese Translation
视频扩散模型在训练过程中缺乏明确的几何监督,导致生成视频中出现不一致的伪影,如物体变形、空间漂移和深度违规。为了解决这一局限性,我们提出了一种基于几何的奖励模型,该模型利用预训练的几何基础模型,通过跨帧重投影误差评估多视图一致性。与之前在像素空间中测量不一致性的几何度量不同,我们的方法以逐点的方式进行误差计算,从而产生更具物理基础和更强鲁棒性的误差度量。此外,我们引入了一种几何感知的采样策略,过滤掉低纹理和非语义区域,专注于几何上有意义的区域,以可靠的对应关系提高评估的鲁棒性。我们将该奖励模型应用于通过两条互补路径对齐视频扩散模型:通过SFT或强化学习对双向模型进行后训练,以及通过将我们的奖励作为路径验证器进行因果视频模型(例如,流媒体视频生成器)的推理时优化。实验结果验证了我们设计的有效性,表明我们的基于几何的奖励相比其他变体提供了更优的鲁棒性。通过实现高效的推理时缩放,我们的方法为增强开源视频模型提供了一种实用的解决方案,而无需大量计算资源进行重新训练。
cs.CV / 66 / 2603.16284
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
定位-再稀疏:基于归因引导的视觉幻觉缓解稀疏策略
Abstract
Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.
Chinese Translation
尽管大型视觉语言模型(LVLMs)取得了显著进展,但它们生成幻觉的倾向削弱了可靠性,并限制了更广泛的实际应用。在幻觉缓解方法中,特征引导作为一种有前景的方法,能够在不增加推理成本的情况下减少 LVLMs 中的错误输出。然而,当前的方法在所有层上应用统一的特征引导。这种启发式策略忽略了层间差异,可能会干扰与幻觉无关的层,最终导致在一般任务上的性能下降。本文提出了一种名为定位-再稀疏的特征引导框架(Locate-Then-Sparsify for Feature Steering, LTS-FS),该框架根据每层的幻觉相关性控制引导强度。我们首先构建了一个合成数据集,包括标记级和句子级的幻觉案例。基于该数据集,我们引入了一种基于因果干预的归因方法,以量化每层的幻觉相关性。通过跨层的归因分数,我们提出了一种逐层策略,将这些分数转换为各个层的特征引导强度,从而实现对与幻觉相关层的更精确调整。在多个 LVLMs 和基准测试中的广泛实验表明,我们的 LTS-FS 框架有效地缓解了幻觉,同时保持了强大的性能。
cs.CV / 67 / 2603.16285
Persistent Story World Simulation with Continuous Character Customization
持续角色定制的持久故事世界模拟
Abstract
Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.
Chinese Translation
故事可视化在计算机视觉领域受到了越来越多的关注。然而,目前的方法往往无法在准确的角色定制、语义对齐和新身份的持续整合之间实现协同。为了解决这一挑战,本文提出了EverTale,一个用于持续故事角色定制的故事世界模拟器。我们首先提出了一种一体化角色整合器(All-in-One-World Character Integrator),以在统一的LoRA模块内实现持续的角色适应,消除了以往方法中每个角色优化模块的需求。接着,我们通过MLLM-as-Judge引入了角色质量门(Character Quality Gate),以确保每个角色适应过程的保真度,通过链式推理判断模型是否可以继续下一个角色或需要对当前角色进行额外训练。我们还引入了一种角色感知区域聚焦采样策略(Character-Aware Region-Focus Sampling),以解决现有多角色视觉叙事中的身份退化和布局冲突,通过更高效地将局部角色特定细节与全局场景上下文进行协调,确保自然的多角色生成。实验结果表明,我们的EverTale在单角色和多角色故事可视化方面相较于更广泛的比较方法表现出色。代码将会公开。
cs.CV / 68 / 2603.16289
VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
VisBrowse-Bench:多模态浏览代理的视觉原生搜索基准测试
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench
Chinese Translation
多模态大型语言模型(MLLMs)的快速发展使得浏览代理能够获取和推理现实世界中的多模态信息。然而,现有基准测试存在两个局限性:对视觉推理能力的评估不足,以及在推理链中忽视网页的原生视觉信息。为了解决这些挑战,我们引入了一个新的视觉原生搜索基准,VisBrowse-Bench。该基准包含169个涵盖多个领域的视觉问答(VQA)实例,并通过文本-图像检索和联合推理在搜索过程中评估模型的视觉推理能力。这些数据由人类专家使用多阶段流程构建,并经过严格的人工验证。此外,我们还提出了一种代理工作流程,可以有效驱动浏览代理在搜索过程中主动收集和推理视觉信息。我们在该工作流程中全面评估了开源和闭源模型。实验结果表明,即使是表现最好的模型Claude-4.6-Opus的准确率也仅为47.6%,而专有的Deep Research模型o3-deep-research的准确率仅为41.1%。代码和数据可在以下链接访问:https://github.com/ZhengboZhang/VisBrowse-Bench
cs.CV / 69 / 2603.16302
Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection
微型动作单元 CLIP:从局部独立到全球依赖的细粒度对比学习用于微表情动作单元检测
Abstract
Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP's native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.
Chinese Translation
微表情(ME)动作单元(Micro-AUs)为细粒度真实情感分析提供了客观线索。大多数现有的微型动作单元检测方法从整个面部图像/视频中学习动作单元特征,这与动作单元固有的局部性相矛盾,导致对动作单元区域的感知不足。事实上,每个动作单元独立地对应于特定的局部面部肌肉运动(局部独立性),而在特定情感状态下,一些动作单元之间存在固有的依赖关系(全球依赖性)。因此,本文探讨了从独立到依赖模式的有效性,并提出了一种新颖的微型动作单元检测框架——微型动作单元 CLIP,独特地将动作单元检测过程分解为局部语义独立建模(LSI)和全球语义依赖建模(GSD)。在 LSI 中,设计了 Patch Token Attention(PTA),将多个局部特征映射到同一特征空间;在 GSD 中,提出了 Global Dependency Attention(GDA)和 Global Dependency Loss(GDLoss),以建模不同动作单元之间的全球依赖关系,从而增强每个动作单元特征。此外,考虑到 CLIP 在微语义对齐方面的固有局限性,设计了一种微动作单元对比损失(MiAUCL),通过对视觉和文本特征的细粒度对齐来学习动作单元特征。同时,微型动作单元 CLIP 在无情感标签的情况下有效应用于微表情识别。实验结果表明,微型动作单元 CLIP 可以充分学习细粒度的微型动作单元特征,达到了最先进的性能。
cs.CV / 70 / 2603.16306
DriveFix: Spatio-Temporally Coherent Driving Scene Restoration
DriveFix:时空一致的驾驶场景恢复
Abstract
Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.
Chinese Translation
近年来,4D场景重建的进展,特别是利用扩散先验的方法,已显示出在自动驾驶中的新视角合成的潜力。然而,这些方法通常独立处理帧或以逐视角的方式进行,导致时空协同的严重缺失。这导致了摄像头之间的空间错位和序列中的时间漂移。我们提出了DriveFix,一种新颖的多视角恢复框架,确保驾驶场景的时空一致性。我们的方法采用交错扩散变换器架构,结合专门的模块,明确建模时间依赖性和跨摄像头的空间一致性。通过基于历史上下文进行生成,并整合几何感知的训练损失,DriveFix确保恢复的视图遵循统一的3D几何结构。这使得高保真纹理的一致传播成为可能,并显著减少了伪影。在Waymo、nuScenes和PandaSet数据集上的广泛评估表明,DriveFix在重建和新视角合成方面均达到了最先进的性能,标志着向现实世界部署的稳健4D世界建模迈出了重要一步。
cs.CV / 71 / 2603.16330
An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis
一种可解释的机器学习框架用于非小细胞肺癌药物反应分析
Abstract
Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren't the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual's genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model's predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature's impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.
Chinese Translation
肺癌是一种恶性细胞在肺部以不可控制的方式异常生长的疾病。一些常见的治疗策略包括手术、化疗和放疗,但由于癌症的异质性,这些方法并不是最佳选择。在个性化医学中,治疗方案根据个体的遗传信息和生活方式进行调整。此外,基于人工智能的深度学习方法可以分析大量数据,以发现癌症的早期迹象、肿瘤类型和治疗前景。本文重点开发基于特定患者数据的个性化治疗方案,主要关注遗传特征。我们使用来自癌症药物敏感性基因组学的多组学数据构建预测模型,并结合机器学习技术。目标变量LN-IC50的值决定了药物的敏感性或抗药性。我们利用XGBoost回归器预测药物反应,重点关注从癌症数据集中提取的分子和细胞特征。通过交叉验证和随机搜索进行超参数调整,以进一步优化模型的预测性能。为了解释目的,使用了SHAP(SHapley Additive exPlanations)。SHAP值衡量每个特征对单个预测的影响。此外,使用DeepSeek进行特征关系的解释,DeepSeek是一个经过训练的大型语言模型,用于验证特征的生物学有效性。DeepSeek提供了关于最重要基因或通路的上下文解释,并与最高SHAP值的成分一起,支持模型的可预测性。
cs.CV / 72 / 2603.16338
SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks
SpikeCLR:基于脉冲神经网络的少样本事件驱动视觉对比自监督学习
Abstract
Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.
Chinese Translation
事件驱动视觉传感器在高速感知方面具有显著优势,包括微秒级的时间分辨率、高动态范围和低功耗。当与脉冲神经网络(SNNs)结合时,它们可以部署在类脑硬件上,从而在嵌入式系统中实现高效能的应用。然而,这种潜力受到缺乏大规模标记数据集的严重限制,而这些数据集又是有效训练此类模型所必需的。在本研究中,我们提出了SpikeCLR,一个对比自监督学习框架,使得SNN能够从无标记事件数据中学习稳健的视觉表征。我们采用了先前基于帧的方法,借助代理梯度训练适应脉冲域,并引入一系列特定于事件的数据增强技术,利用空间、时间和极性变换。通过在CIFAR10-DVS、N-Caltech101、N-MNIST和DVS-Gesture基准上的广泛实验,我们证明了自监督预训练与后续微调在低数据环境中优于监督学习,在少样本和半监督设置中获得了一致的提高。我们的消融研究表明,结合空间和时间的数据增强对于学习事件数据中的有效时空不变性至关重要。我们进一步展示了学习到的表征在跨数据集间的迁移,为在标记稀缺环境中构建强大的事件驱动模型贡献了力量。
cs.CV / 73 / 2603.16340
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
Iris:将真实世界先验引入扩散模型以进行单目深度估计
Abstract
In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.
Chinese Translation
本文提出了 extbf{Iris},一个用于单目深度估计(MDE)的确定性框架,该框架将真实世界的先验知识整合到扩散模型中。传统的前馈方法依赖于大量的训练数据,但仍然无法捕捉细节。以往基于扩散的方法利用丰富的生成先验,但在合成到真实领域的迁移中面临挑战。相比之下,Iris能够保留细微细节,能够在合成场景与真实场景之间强有力地泛化,并且在有限的训练数据下仍然保持高效。为此,我们引入了一个两阶段的先验到几何确定性(PGD)调度:先验阶段使用谱门控蒸馏(SGD)来转移低频真实先验,同时不限制高频细节,而几何阶段则应用谱门控一致性(SGC)来强制高频保真度,同时利用合成真实值进行细化。这两个阶段共享权重,并采用高到低的时间步调度。大量实验结果确认,Iris在MDE性能上取得了显著的提升,并在实际环境中表现出强大的泛化能力。
cs.CV / 74 / 2603.16341
PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection
PKINet-v2:朝着强大而高效的多核遥感目标检测迈进
Abstract
Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.
Chinese Translation
遥感图像(RSIs)中的目标检测面临几何和空间复杂性共存的挑战:目标可能以不同的纵横比出现,同时在不同的背景下跨越广泛的目标尺寸。现有的RSI骨干网络分别解决这两个挑战,要么采用各向异性条形卷积核来建模细长目标,要么使用各向同性的大卷积核来捕捉更广泛的背景。然而,这种孤立的处理方式导致了互补的缺陷:仅使用条形设计可能会破坏规则形状物体的空间一致性并削弱细小细节,而各向同性的大卷积核往往会为细长结构引入严重的背景噪声和几何不匹配。在本文中,我们扩展了PKINet,提出了一种强大而高效的骨干网络,统一处理这两个挑战,命名为多核启发网络v2(Poly Kernel Inception Network v2,PKINet-v2)。PKINet-v2将各向异性轴向条形卷积与各向同性方形卷积核协同作用,构建了一个多范围感受野,保留了细粒度的局部纹理,同时在不同尺度上逐步聚合长距离上下文。为了实现高效部署,我们进一步引入了一种异构卷积核重参数化(Heterogeneous Kernel Re-parameterization,HKR)策略,将所有异构分支融合为单个深度卷积进行推理,消除了碎片化的卷积核启动而不损失准确性。在四个广泛使用的基准测试上进行的广泛实验,包括DOTA-v1.0、DOTA-v1.5、HRSC2016和DIOR-R,证明PKINet-v2在实现最先进的准确性同时,相较于PKINet-v1提供了$ extbf{3.9} imes$的FPS加速,超越了以往的遥感骨干网络在有效性和效率上的表现。
cs.CV / 75 / 2603.16343
Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds
基于LiDAR点云的3D人类姿态估计中的人-物交互学习
Abstract
Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.
Chinese Translation
从LiDAR点云中理解人类是自动驾驶中最关键的任务之一,因为它与行人安全密切相关,但在多样的人-物交互和杂乱的背景下仍然具有挑战性。然而,现有方法在很大程度上忽视了利用人-物交互构建稳健的3D人类姿态估计框架的潜力。促使我们融入人-物交互的主要挑战有两个。首先,人-物交互在人体和物体点之间引入了空间模糊性,这常常导致交互区域中3D人类关键点预测的错误。其次,交互和非交互身体部位之间的点数存在严重的类别不平衡,交互频繁的区域(如手和脚)在LiDAR数据中观察到的点稀疏。为了解决这些挑战,我们提出了一种人-物交互学习(Human-Object Interaction Learning, HOIL)框架,以实现从LiDAR点云中稳健的3D人类姿态估计。为了减轻空间模糊性问题,我们提出了人-物交互感知对比学习(Human-Object Interaction-aware Contrastive Learning, HOICL),有效增强了交互区域中人和物体点之间的特征区分。为了解决类别不平衡问题,我们引入了接触感知的部件引导池化(Contact-aware Part-guided Pooling, CPPool),通过压缩过度表示的点,同时保留来自交互身体部位的信息点,来自适应地重新分配表示能力。此外,我们还提出了一种可选的基于接触的时间细化方法,利用时间上的接触线索来细化每帧关键点估计中的错误。因此,我们的HOIL有效利用人-物交互来解决交互区域中的空间模糊性和类别不平衡问题。代码将会发布。
cs.CV / 76 / 2603.16351
Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI
基于YOLO深度学习的寄生蜂自动识别:结合高分辨率类激活映射的可解释人工智能
Abstract
Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.
Chinese Translation
对超科寄生蜂(Ichneumonoidea)内寄生蜂的准确分类识别对于生物多样性评估、生态监测和生物控制计划至关重要。然而,形态相似性、小体型和细微的种间变异使得人工识别劳动密集且依赖专业知识。本研究提出了一种基于深度学习的框架,通过集成高分辨率类激活映射(High-Resolution Class Activation Mapping, HiResCAM)的YOLO架构,实现寄生蜂的自动识别,以增强可解释性。该系统能够同时从高分辨率图像中识别寄生蜂的科。数据集包含3556张高分辨率的膜翅目(Hymenoptera)标本图像。分类分布主要集中在寄生蜂科(Ichneumonidae, n = 786)、小蜂科(Braconidae, n = 648)、蜜蜂科(Apidae, n = 466)和黄蜂科(Vespidae, n = 460)。通过使用经过整理的数据集进行了广泛的实验,模型性能通过精确度、召回率、F1分数和准确度进行评估。结果表明,模型的准确率超过96%,并在形态变异中表现出强大的泛化能力。HiResCAM可视化结果确认模型关注与分类相关的解剖区域,如翅脉、触角分割和腹部结构,从而验证了学习特征的生物学合理性。可解释人工智能技术的集成提高了透明度和可信度,使该系统适合用于昆虫学研究,以加速对描述不足的寄生蜂超科的生物多样性特征化。
cs.CV / 77 / 2603.16362
$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation
$D^3$-RSMDE:40倍加速的高保真遥感单目深度估计
Abstract
Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.
Chinese Translation
从遥感图像中实时、高保真地进行单目深度估计对于众多应用至关重要,但现有方法在准确性和效率之间面临显著的权衡。尽管使用视觉变换器(Vision Transformer, ViT)作为主干进行密集预测速度较快,但它们通常表现出较差的感知质量。相反,扩散模型提供了高保真度,但计算成本过高。为了解决这些限制,我们提出了用于遥感单目深度估计的深度细节扩散(Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation, $D^3$-RSMDE),这是一个高效框架,旨在实现速度和质量之间的最佳平衡。我们的框架首先利用基于ViT的模块快速生成高质量的初步深度图,这为结构先验提供了支持,有效替代了扩散模型中耗时的初始结构生成阶段。在此先验的基础上,我们提出了一种渐进线性混合细化(Progressive Linear Blending Refinement, PLBR)策略,使用轻量级的U-Net在仅几次迭代中细化细节。整个细化步骤在由变分自编码器(Variational Autoencoder, VAE)支持的紧凑潜在空间中高效运行。大量实验表明,$D^3$-RSMDE在领先模型如Marigold上实现了11.85%的学习感知图像块相似性(Learned Perceptual Image Patch Similarity, LPIPS)感知指标的显著降低,同时在推理速度上实现了超过40倍的加速,并保持与轻量级ViT模型相当的显存使用。
cs.CV / 78 / 2603.16363
Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions
提升视觉可靠性:实时水下任务的色彩准确性水下图像增强
Abstract
Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.
Chinese Translation
水下图像增强在为水下平台提供可靠的视觉信息方面发挥着至关重要的作用,因为水环境中的强吸收和散射通常会导致图像质量下降。现有的高性能方法往往依赖于复杂的架构,这阻碍了在水下设备上的部署。轻量级方法通常为了速度而牺牲质量,并且难以处理严重退化的水下图像。为了解决这一限制,我们提出了一种具有准确色彩恢复的实时水下图像增强框架。首先,引入了一种自适应加权通道补偿模块,以利用绿色通道作为参考锚点,实现红色和蓝色通道的动态色彩恢复。其次,我们设计了一种多分支重新参数化膨胀卷积,该卷积在训练过程中采用多分支融合,在推理过程中进行结构重新参数化,从而以低计算开销实现大接收场表示。最后,采用统计全局色彩调整模块,基于统计先验优化整体色彩表现。在八个数据集上的大量实验表明,所提出的方法在七个评估指标上达到了最先进的性能。该模型仅包含3,880个推理参数,并实现了409 FPS的推理速度。在多种环境条件下,我们的方法将UCIQE分数提高了29.7%,在ROV平台上的部署及下游任务的性能提升进一步验证了其在实时水下任务中的优越性。
cs.CV / 79 / 2603.16372
InViC: Intent-aware Visual Cues for Medical Visual Question Answering
InViC:用于医学视觉问答的意图感知视觉线索
Abstract
Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.
Chinese Translation
医学视觉问答(Med-VQA)旨在回答基于医学图像的临床相关问题。然而,现有的多模态大型语言模型(MLLMs)常常表现出快捷回答的倾向,通过利用语言先验或数据集偏差生成看似合理的响应,而对视觉证据的关注不足。这种行为削弱了临床可靠性,特别是在细微影像发现至关重要时。我们提出了一种轻量级插件框架,称为意图感知视觉线索(InViC),以明确增强医学VQA中的基于图像的答案生成。InViC引入了一个线索令牌提取(Cue Tokens Extraction, CTE)模块,将密集的视觉令牌提炼为一组紧凑的K个问题条件线索令牌,这些令牌作为结构化的视觉中介注入到LLM解码器中,以促进意图对齐的视觉证据。为了防止绕过视觉信息,我们进一步设计了一个具有线索瓶颈注意力掩码的两阶段微调策略。在第一阶段,我们使用注意力掩码阻止LLM直接查看原始视觉特征,从而将所有视觉证据引导通过线索路径。在第二阶段,恢复标准因果注意力,以训练LLM共同利用视觉和线索令牌。我们在三个公共Med-VQA基准(VQA-RAD、SLAKE和ImageCLEF VQA-Med 2019)上评估InViC,涵盖多个代表性的MLLMs。InViC在零-shot推理和标准LoRA微调上始终表现出改进,证明了具有瓶颈训练的意图感知视觉线索是一种切实有效的策略,以提高可信赖的Med-VQA。
cs.CV / 80 / 2603.16373
Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation
用于图像重建和生成的语义一维标记器
Abstract
Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.
Chinese Translation
基于潜在空间的视觉生成模型取得了巨大的成功,凸显了视觉标记化的重要性。将图像映射到潜在空间提高了效率,并使多模态对齐成为可能,从而在下游任务中实现规模化。现有的视觉标记器主要将图像映射到固定的二维空间网格,并专注于像素级的恢复,这限制了捕捉具有紧凑全局语义的表示。为了解决这些问题,我们提出了 extbf{SemTok},一种语义一维标记器,它将二维图像压缩为具有高级语义的一维离散标记。SemTok在图像重建方面设定了新的最先进水平,以显著紧凑的标记表示实现了卓越的保真度。这是通过一个协同框架实现的,该框架包含三个关键创新:二维到一维的标记化方案、语义对齐约束和两阶段生成训练策略。在SemTok的基础上,我们构建了一个掩蔽自回归生成框架,在下游图像生成任务中取得了显著的改进。实验验证了我们的语义一维标记化的有效性。我们的代码将开源。
cs.CV / 81 / 2603.16385
Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network
基于CUT网络的DMSP与VIIRS夜间光数据的无配对跨域校准
Abstract
Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.
Chinese Translation
国防气象卫星计划(DMSP-OLS)和苏米国家极轨合作伙伴(SNPP-VIIRS)夜间光(NTL)数据对于监测城市化至关重要,但传感器的不兼容性阻碍了长期分析。本研究提出了一种使用对比无配对转换(CUT)网络的跨传感器校准方法,将DMSP数据转换为类似VIIRS的格式,以纠正DMSP的缺陷。该方法采用多层补丁对比学习,最大化对应补丁之间的互信息,在学习跨域相似性的同时保持内容一致性。利用2012-2013年的重叠数据进行训练,网络处理1992-2013年的DMSP影像,生成增强的VIIRS风格栅格数据。验证结果表明,生成的VIIRS-like数据与实际VIIRS观测(R平方大于0.87)及社会经济指标表现出高度一致性。这种方法有效解决了跨传感器数据融合问题,并校准了DMSP的缺陷,为扩展的NTL时间序列提供了可靠的尝试。
cs.CV / 82 / 2603.16392
DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification
DermaFlux:基于校正流的合成皮肤病变生成以增强图像分类
Abstract
Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.
Chinese Translation
尽管深度生成建模最近取得了进展,皮肤病变分类系统仍然受到大型、多样化和良好注释的临床数据集有限可用性的限制,导致良性和恶性病变之间的类别不平衡,从而降低了泛化性能。我们提出了DermaFlux,一种基于校正流的文本到图像生成框架,该框架根据皮肤病学特征的自然语言描述合成临床基础的皮肤病变图像。DermaFlux建立在Flux.1之上,使用参数高效的低秩适应(Low-Rank Adaptation, LoRA)在大量经过精心策划的公开临床图像数据集上进行了微调。我们使用由Llama 3.2生成的合成文本标题构建图像-文本对,遵循包括病变不对称、边界不规则和颜色变化在内的既定皮肤病学标准。大量实验表明,DermaFlux生成了多样且具有临床意义的皮肤病学图像,当对小型真实世界数据集进行增强时,二分类性能提高了最多6%;而当分类器在DermaFlux生成的合成图像上训练时,相比于基于扩散的合成图像,性能提高了最多9%。我们的ImageNet预训练的ViT仅用2,500张真实图像和4,375张DermaFlux生成的样本进行了微调,达到了78.04%的二分类准确率和0.859的AUC,超过了下一个最佳皮肤病学模型8%。
cs.CV / 83 / 2603.16404
Near-light Photometric Stereo with Symmetric Lights
对称光源下的近光光度立体视觉
Abstract
This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.
Chinese Translation
本文描述了一种利用对称光源排列的近光光度立体视觉的线性解决方法。与传统的非凸优化方法不同,通过排列多组对称的近光源对,我们的方法能够推导出表面法线和深度的闭式解,而无需初始化。此外,只要光源在任意点周围对称分布,即使整个空间偏移未经过标定,我们的方法也能正常工作。实验展示了我们方法在形状恢复精度方面的准确性,取得了与最先进的标定近光光度立体视觉方法相当的结果,同时显著降低了对仔细深度初始化和光源标定的要求。
cs.CV / 84 / 2603.16421
HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction
HGP-Mamba:整合组织学与生成蛋白特征的基于Mamba的多模态生存风险预测
Abstract
Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at
this https URL.
Chinese Translation
近年来,多模态学习的进展显著提高了癌症生存风险预测的准确性。然而,蛋白标志物与组织病理图像的联合预后潜力仍未得到充分探索,主要是由于蛋白表达谱的高成本和有限的可用性。为了解决这一挑战,我们提出了HGP-Mamba,一种基于Mamba的多模态框架,能够高效整合组织学与生成的蛋白特征以进行生存风险预测。具体而言,我们引入了一种蛋白特征提取器(PFE),利用预训练的基础模型直接从全切片图像(WSIs)中提取高通量蛋白嵌入,从而实现分子信息的数据高效整合。结合捕捉形态模式的组织学嵌入,我们进一步引入了局部交互感知Mamba(LiAM)用于细粒度特征交互,以及全局交互增强Mamba(GiEM)以促进在切片级别的整体模态融合,从而捕捉复杂的跨模态依赖关系。在四个公共癌症数据集上的实验表明,HGP-Mamba在保持优越计算效率的同时,达到了最先进的性能。我们的源代码已公开可用,地址为
此链接。
cs.CV / 85 / 2603.16423
SF-Mamba: Rethinking State Space Model for Vision
SF-Mamba:重新思考视觉状态空间模型
Abstract
The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.
Chinese Translation
近年来,Mamba在视觉领域的应用得到了发展,以寻求替代受限于二次复杂度的视觉变换器(Vision Transformers, ViTs)。虽然Mamba的递归扫描机制提供了计算效率,但它在本质上限制了图像块之间的非因果交互。之前的研究尝试通过各种多次扫描策略来解决这一限制;然而,这些方法由于次优的扫描设计和频繁的数据重组而效率低下。此外,在视觉任务中常用的短令牌长度下,Mamba的计算速度相对较慢。为了追求真正高效的视觉编码器,我们重新思考了视觉的扫描操作和Mamba的计算效率。为此,我们提出了SF-Mamba,一种新颖的视觉Mamba,具有两个关键提案:在单向扫描下进行辅助块交换以编码双向信息流,以及通过周期性状态重置进行批折叠以实现更高级的GPU并行性。在图像分类、目标检测以及实例和语义分割的广泛实验中,我们提出的SF-Mamba在不同模型尺寸下显著超越了最先进的基线,同时提高了吞吐量。我们将在发表后发布源代码。
cs.CV / 86 / 2603.16426
3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification
基于3D傅里叶的高光谱图像分类全局特征提取
Abstract
Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.
Chinese Translation
高光谱图像分类(HSIC)通过深度学习方法的应用显著提升了对丰富的空间-光谱相关性的利用。然而,现有方法仍面临基本的局限性:基于变换器的模型由于自注意力的二次复杂性而在可扩展性方面表现不佳,而最近的基于傅里叶变换的方法通常依赖于二维空间快速傅里叶变换(2D FFT),并在很大程度上忽视了高光谱数据固有的关键波段间光谱依赖性。为了解决这些挑战,我们提出了混合GFNet(Hybrid GFNet,HGFNet),这是一种新颖的架构,结合了局部3D卷积特征提取与通过GFNet风格模块的频域全局滤波,实现高效且稳健的空间-光谱表示学习。HGFNet引入了三种针对高光谱图像的互补频率变换:光谱傅里叶变换(沿光谱轴的1D FFT)、空间傅里叶变换(在空间维度上的2D FFT)和空间-空间傅里叶变换(在光谱和空间维度上的3D FFT),使得全面且高维的频率建模成为可能。3D卷积层捕捉细粒度的局部空间-光谱结构,而基于傅里叶的全局滤波模块则高效地建模长程依赖性并抑制噪声。为了进一步缓解在HSIC中常见的严重类别不平衡问题,HGFNet结合了自适应聚焦损失(Adaptive Focal Loss,AFL),动态调整类别的聚焦和加权,提高了对代表性不足类别的区分能力。
cs.CV / 87 / 2603.16427
Cross-modal learning for plankton recognition
跨模态学习用于浮游生物识别
Abstract
This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.
Chinese Translation
本文考虑自监督跨模态协调作为一种策略,利用多种模态和大量未标记的浮游生物数据构建浮游生物识别模型。自动成像仪器促进了大规模持续收集浮游生物图像数据。目前,自动浮游生物图像识别的方法主要依赖于监督方法,这些方法需要标记的训练集,而标记过程劳动密集。另一方面,一些现代浮游生物成像仪器将图像信息与光学测量数据(如散射和荧光特征)相结合,但这些数据目前在浮游生物识别中尚未得到广泛利用。在本研究中,我们探索了使用这些测量数据来指导学习过程的可能性,而无需手动标记。受对比语言-图像预训练(Contrastive Language-Image Pre-training)背后概念的启发,我们仅使用二元监督信息训练两种模态的编码器,该信息指示给定图像和特征是否来自同一颗粒或不同颗粒。对于浮游生物识别,我们采用一小部分已知浮游生物物种的标记图库,并结合 $k$-最近邻(k-NN)分类器。这种方法产生的识别模型本质上是多模态的,即能够利用从图像和特征数据中提取的信息。我们证明了所提方法在仅需极少量标记图像的情况下实现了高识别准确率。此外,我们还展示了该方法优于仅基于图像的自监督基线。代码可在 https://github.com/Jookare/cross-modal-plankton 获取。
cs.CV / 88 / 2603.16432
IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video
IRIS:从单目视频中逆向恢复和识别物理动态系统的真实世界基准
Abstract
Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60\,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.
Chinese Translation
视频中的无监督物理参数估计缺乏一个共同的基准:现有方法在不重叠的合成数据上进行评估,唯一的真实世界数据集仅限于单体系统,并且没有建立的协议来处理控制方程的识别。本研究介绍了IRIS,一个高保真基准,包含220个以4K分辨率和60帧每秒捕获的真实世界视频,涵盖单体和多体动态,配有独立测量的真实参数和不确定性估计。每个动态系统在受控实验室条件下记录,并与其控制方程配对,从而实现原则性的评估。定义了一个标准化的评估协议,涵盖参数准确性、可识别性、外推性、鲁棒性和控制方程选择。评估了多个基线,包括多步物理损失公式和四种互补的方程识别策略(VLM时间推理、描述后分类提示、基于CNN的分类和基于路径的标记),在所有IRIS场景中建立了参考性能,并揭示了系统性失败模式,激励未来的研究。数据集、注释、评估工具包和所有基线实现均已公开发布。
cs.CV / 89 / 2603.16439
CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection
CD-FKD:跨域特征知识蒸馏用于物体检测中的鲁棒单域泛化
Abstract
Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.
Chinese Translation
单域泛化对于物体检测至关重要,特别是在模型在单一源域上训练并在未见过的目标域上评估时。域偏移,例如天气、光照或场景条件的变化,对现有模型的泛化能力构成了重大挑战。为此,我们提出了跨域特征知识蒸馏(CD-FKD),通过利用全局和实例特征蒸馏来增强学生网络的泛化能力。该方法通过降尺度和数据损坏使用多样化的数据来训练学生网络,而教师网络则接收原始源域数据。学生网络通过全局和实例蒸馏模仿教师的特征,使其能够有效提取以物体为中心的特征,即使对于由于损坏而难以检测的物体也是如此。在具有挑战性的场景上的大量实验表明,CD-FKD在目标域泛化和源域性能方面均优于最先进的方法,验证了其在提高物体检测对域偏移的鲁棒性方面的有效性。这种方法在现实世界应用中具有重要价值,例如在自动驾驶和监控中,鲁棒的物体检测在多样化环境中至关重要。
cs.CV / 90 / 2603.16444
Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation
Fast-HaMeR:利用知识蒸馏提升手部网格重建
Abstract
Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.
Chinese Translation
快速且准确的3D手部重建对于虚拟现实/增强现实(VR/AR)、人机交互、机器人技术和医疗保健等实时应用至关重要。大多数最先进的方法依赖于复杂的模型,这限制了它们在资源受限设备(如头戴设备、智能手机和嵌入式系统)上的应用。在本文中,我们探讨了如何通过结合轻量级神经网络和知识蒸馏(Knowledge Distillation),加速复杂的3D手部重建模型,使其更快且更轻,同时保持可比的重建精度。尽管我们的方法适用于各种手部重建框架,但我们主要集中于提升HaMeR模型,该模型在重建精度方面目前处于领先地位。我们用更轻的替代方案替换其原始的ViT-H主干,包括MobileNet、MobileViT、ConvNeXt和ResNet,并评估三种知识蒸馏策略:输出级、特征级和两者的混合。我们的实验表明,使用仅为原始模型35%的轻量级主干可以实现1.5倍的推理速度提升,同时保持相似的性能质量,准确度差异仅为0.4mm。更具体地说,我们展示了输出级蒸馏显著提高了学生模型的性能,而特征级蒸馏对于更高容量的学生模型更为有效。总体而言,这些发现为低功耗设备上的高效实际应用铺平了道路。代码和模型可在https://github.com/hunainahmedj/Fast-HaMeR上公开获取。
cs.CV / 91 / 2603.16446
Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline
统一去除雨滴和反射:一个新的基准和一个新颖的管道
Abstract
When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.
Chinese Translation
在雨天通过玻璃表面或挡风玻璃拍摄图像时,雨滴和反射常常同时出现,显著降低了捕获图像的可见性。这个实际问题缺乏关注,亟需解决。先前的去雨滴、去反射和一体化模型未能有效应对这种复合退化。为此,我们首次正式定义了统一去除雨滴和反射(UR$^3$)任务,并构建了一个真实拍摄的数据集,即雨滴和反射(RDRF),该数据集提供了一个新的基准,包含大量高质量、多样化的图像对。随后,我们提出了一种新颖的基于扩散的框架(即,DiffUR$^3$),并设计了多个目标来解决这一挑战性任务。通过利用强大的生成先验,DiffUR$^3$成功去除了这两种类型的退化。大量实验表明,我们的方法在我们的基准和具有挑战性的自然图像上达到了最先进的性能。RDRF数据集和代码将在论文接受后公开。
cs.CV / 92 / 2603.16447
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
渐进式头像:渐进可动画的3D高斯头像
Abstract
In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.
Chinese Translation
在实际的实时扩展现实(XR)和远程呈现应用中,网络和计算资源经常波动。因此,需要一种渐进式的3D表示。为此,我们提出了渐进式头像(ProgressiveAvatars),这是一种基于通过自适应隐式细分在模板网格上生长的3D高斯层次结构的渐进式头像表示。3D高斯在面局部坐标中定义,以便在不同的表情和头部运动下保持可动画性,并支持多个细节层级。当屏幕空间信号指示细节不足时,层次结构会扩展,将资源分配到重要区域。通过利用重要性排序,渐进式头像支持增量加载和渲染,在新高斯到达时添加它们,同时保留先前的内容,从而在不同带宽下实现平滑的质量提升。渐进式头像能够在波动的网络带宽和变化的计算及内存资源下实现渐进交付和渐进渲染。
cs.CV / 93 / 2603.16451
TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection
TinyGLASS:实时自监督传感器内异常检测
Abstract
Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony's Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.
Chinese Translation
异常检测在工业质量控制中发挥着关键作用,尽管缺乏标记的故障样本,但仍需识别缺陷。近期的自监督方法,如GLASS,仅使用无缺陷数据学习正常视觉模式,并在工业基准测试中表现出强大的性能。然而,它们的计算需求限制了在资源受限的边缘平台上的部署。本研究介绍了TinyGLASS,这是GLASS框架的轻量级适配,旨在在Sony IMX500智能视觉传感器上实现实时传感器内异常检测。所提出的架构用紧凑的ResNet-18替代了原始的WideResNet-50主干,并引入了面向部署的修改,使得能够使用Sony的模型压缩工具包进行静态图追踪和INT8量化。除了在MVTec-AD基准上评估性能外,我们还研究了对受污染训练数据的鲁棒性,并引入了一个名为MMS Dataset的定制工业数据集,用于跨设备评估。实验结果表明,TinyGLASS在保持竞争性检测性能的同时实现了8.7倍的参数压缩,在MVTec-AD上达到了94.2%的图像级AUROC,并在IMX500平台的8 MB内存限制下以20 FPS运行。系统分析表明,功耗低(每次推理4.0 mJ),实时端到端延迟(20 FPS),以及高能效(470 GMAC/J)。此外,该模型在中等水平的训练数据污染下保持了稳定的性能。
cs.CV / 94 / 2603.16455
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
Evo-Retriever:基于观点-路径协作的多模态文档检索的LLM引导课程演化
Abstract
Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.
Chinese Translation
视觉语言模型(VLMs)在数据映射方面表现出色,但现实世界文档的异质性和非结构化特性破坏了跨模态嵌入的一致性。最近的晚期交互方法通过多向量表示增强了图像与文本的对齐,然而,传统的有限样本和静态策略的训练无法适应模型的动态演化,导致跨模态检索的混淆。为了解决这个问题,我们提出了Evo-Retriever,这是一种检索框架,具有基于新颖的观点-路径协作的LLM引导课程演化。首先,我们采用多视角图像对齐,通过多尺度和多方向的视角增强细粒度匹配。然后,双向对比学习策略生成“困难查询”,并为视觉和文本消歧建立互补学习路径,以重新平衡监督。最后,上述协作的模型状态摘要被输入到LLM元控制器中,该控制器利用专家知识自适应调整训练课程,以促进模型的演化。在ViDoRe V2和MMEB(VisDoc)上,Evo-Retriever实现了最先进的性能,nDCG@5得分分别为65.2%和77.1%。
cs.CV / 95 / 2603.16461
GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models
GAP-MLLM:几何对齐预训练以激活多模态大型语言模型中的3D空间感知
Abstract
Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.
Chinese Translation
多模态大型语言模型(MLLMs)在语义推理方面表现出色,但在仅依赖纯RGB输入时,3D空间感知能力较弱。尽管利用了来自3D重建模型的隐式几何先验,基于图像的方法与使用显式3D数据的方法相比,仍然存在显著的性能差距。我们认为,这一差距并非源于几何先验不足,而是由于训练范式的不对齐:以文本为主的微调未能激活MLLMs中的几何表示。现有方法通常采用简单的特征拼接,并直接针对下游任务进行优化,而没有几何特定的监督,导致结构利用不佳。为了解决这一局限性,我们提出了GAP-MLLM,一种几何对齐预训练范式,明确在下游适应之前激活结构感知。具体而言,我们引入了一种视觉提示的联合任务,迫使MLLMs在预测稀疏点图的同时预测语义标签,从而增强几何意识。此外,我们设计了一个多层级渐进融合模块,配备令牌级门控机制,使几何先验的自适应整合成为可能,而不抑制语义推理。大量实验表明,GAP-MLLM显著增强了几何特征融合,并在3D视觉定位、3D密集标注和3D视频目标检测任务中持续提升了性能。
cs.CV / 96 / 2603.16482
DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement
DST-Net:一种具有独立于光照的特征引导和多尺度空间卷积的双流变换器用于低光照图像增强
Abstract
Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.
Chinese Translation
低光照图像增强旨在通过解决图像在昏暗环境中捕获时固有的信号退化问题(如亮度衰减和结构损坏),恢复其可见性。尽管许多算法试图改善图像质量,但现有方法往往导致固有信号先验的严重损失。为克服这些挑战,我们提出了一种基于独立于光照的信号先验引导和多尺度空间卷积的双流变换器网络(DST-Net)。首先,为了解决低光照条件下关键信号特征的丢失,我们设计了一个特征提取模块。该模块结合了高斯差分(Difference of Gaussians, DoG)、LAB颜色空间变换和VGG-16进行纹理提取,利用解耦的独立于光照的特征作为信号先验,持续引导增强过程。其次,我们构建了一个双流交互架构。通过采用跨模态注意机制,网络利用提取的先验动态修正增强图像的退化信号表示,最终通过可微曲线估计实现迭代增强。此外,为了克服现有方法在保持细微结构和纹理方面的不足,我们提出了一个多尺度空间融合块(Multi-Scale Spatial Fusion Block, MSFB),该块具有伪3D和3D梯度算子卷积。该模块集成了显式梯度算子以恢复高频边缘,同时通过多尺度空间卷积捕获通道间的空间相关性。大量评估和消融研究表明,DST-Net在主观视觉质量和客观指标上均表现出优越的性能。具体而言,我们的方法在LOL数据集上达到了25.64 dB的PSNR。后续在LSRW数据集上的验证进一步确认了其强大的跨场景泛化能力。
cs.CV / 97 / 2603.16489
Unlearning for One-Step Generative Models via Unbalanced Optimal Transport
通过不平衡最优传输实现单步生成模型的遗忘
Abstract
Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.
Chinese Translation
近年来,单步生成框架(如流映射模型)的进展显著提高了图像生成的效率,通过在单次前向传递中学习直接的噪声到数据的映射。然而,为确保这些强大生成器的安全性,机器遗忘的研究仍然完全未被探索。现有的扩散遗忘方法与这些单步模型本质上不兼容,因为它们依赖于多步迭代去噪过程。在本研究中,我们提出了UOT-Unlearn,这是一种基于不平衡最优传输(Unbalanced Optimal Transport, UOT)的新型即插即用类别遗忘框架,适用于单步生成模型。我们的方法将遗忘公式化为遗忘成本与$f$-散度惩罚之间的原则性权衡,前者抑制目标类别,后者通过放宽边际约束保持整体生成的保真度。通过利用UOT,我们的方法使被遗忘类别的概率质量能够平滑地重新分配到其余类别,而不是崩溃为低质量或噪声样本。在CIFAR-10和ImageNet-256上的实验结果表明,我们的框架在遗忘成功率(PUL)和保留质量(u-FID)方面表现优越,显著优于基线方法。
cs.CV / 98 / 2603.16506
VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations
VIEW2SPACE:从稀疏观测研究多视角视觉推理
Abstract
Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.
Chinese Translation
多视角视觉推理对于必须从稀疏和离散视角理解复杂环境的智能系统至关重要,但现有研究主要集中在单图像或时间密集的视频场景上。在现实世界场景中,跨视角推理需要在没有明确指导的情况下整合部分观测,而收集具有准确几何和语义注释的大规模多视角数据仍然具有挑战性。为了解决这一问题,我们利用物理基础的仿真构建多样化、高保真的3D场景,并提供精确的每视角元数据,从而实现可扩展的数据生成,且可转移到现实世界场景。基于这一引擎,我们推出了VIEW2SPACE,这是一个用于稀疏多视角推理的多维基准,同时提供一个支持数百万个基础问题-答案对的可扩展、互不重叠的训练集。利用该基准,对最先进的视觉-语言和空间模型进行全面评估,结果表明多视角推理仍然在很大程度上未得到解决,大多数模型的表现仅略高于随机猜测。我们进一步探讨训练是否能够弥补这一差距。我们提出的基于视觉证据的基础思维链显著提高了在中等难度下的表现,并且能够推广到现实世界数据,在跨数据集评估中优于现有方法。我们还在模型规模、数据规模、推理深度和可见性约束方面进行了难度感知的扩展分析,结果表明,尽管几何感知在足够可见性下可以受益于扩展,但在稀疏视角下进行深层组合推理仍然是一个基本挑战。
cs.CV / 99 / 2603.16524
An approximate graph elicits detonation lattice
近似图引发的爆轰晶格
Abstract
This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.
Chinese Translation
本研究提出了一种基于图论的新算法,用于从三维压力信号中精确分割和测量爆轰单元,称为爆轰晶格,旨在解决该领域普遍存在的手动和原始二维边缘检测方法的局限性。利用分割模型,所提出的无训练算法旨在准确提取细胞模式,这是爆轰研究中的一个长期挑战。首先,生成数据的分割效果显示预测误差为2%。接下来,使用三维仿真数据建立基于图的工作流程的性能。统计结果和联合概率密度显示,细长单元与波传播轴对齐,偏差为17%,而体积的较大离散性反映了线性变异的立方放大。尽管该框架稳健,但可靠地分割和量化高度复杂的细胞模式仍然具有挑战性。然而,基于图的公式在不同细胞几何形状中具有广泛的适用性,使其成为爆轰分析的实用工具,并为未来在三点碰撞研究中的扩展奠定了坚实基础。
cs.CV / 100 / 2603.16538
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
在姿态先验和几何不确定性下重新思考3D高斯点云的姿态优化
Abstract
3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.
Chinese Translation
3D高斯点云(3DGS)最近作为一种强大的场景表示方法而崭露头角,并越来越多地用于视觉定位和姿态优化。然而,尽管其具备高质量的可微分渲染,基于3DGS的姿态优化的鲁棒性仍然对初始相机姿态和重建几何体高度敏感。在本研究中,我们仔细分析了这些局限性,并识别出两种主要的不确定性来源:(i)姿态先验不确定性,通常源于输出单一确定性估计的回归或检索模型;(ii)几何不确定性,由于3DGS重建中的缺陷导致错误传播到PnP求解器。这些不确定性可能扭曲重投影几何并使优化不稳定,即使渲染的外观仍然看起来合理。为了解决这些不确定性,我们提出了一种重新定位框架,结合了蒙特卡洛姿态采样和基于费舍尔信息的PnP优化。我们的方法明确考虑了姿态和几何不确定性,并且不需要重新训练或额外的监督。在各种室内和室外基准测试中,我们的方法始终提高了定位精度,并在姿态和深度噪声下显著增加了稳定性。
cs.CV / 101 / 2603.16549
Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation
通过 VAE-EM 估计弥合电子显微镜校准中的模拟与现实差距
Abstract
Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.
Chinese Translation
电子显微镜在多个领域推动了许多科学突破。一个关键挑战是根据图像调整显微镜参数,以克服降低图像质量的光学像差。由于诊断图像的高维和噪声特性,以及无法从单幅图像中识别出最佳参数,这一校准问题变得尤为复杂。我们通过采用变分自编码器(Variational Autoencoders, VAEs),在模拟数据上进行训练,以学习图像的低维表示,从而解决扫描透射电子显微镜(Scanning Transmission Electron Microscopes, STEM)的校准问题,而大多数现有方法仅提取标量值。然后,我们使用期望最大化(Expectation Maximization, EM)方法同时估计将校准参数映射到编码表示的模型和最佳校准参数。这种联合估计明确解决了数据驱动方法中固有的模拟与现实差距,这些方法在数字双胞胎的模拟数据上进行训练。我们利用光学系统已知的对称性来建立联合估计问题的全局可识别性,确保存在唯一的最优解。我们展示了我们的方法在真实 STEM 上显著快于现有方法,估计误差减少了 2 倍,同时所需观察次数更少。这标志着自动化 STEM 校准的显著进展,并展示了 VAEs 在图像信息压缩中的潜力。超越显微镜,VAE-EM 框架适用于逆问题,其中模拟训练数据引入现实差距,并且非单射映射会阻止唯一解的存在。
cs.CV / 102 / 2603.16551
CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation
CompDiff:用于公平和零样本交叉医学图像生成的层次组合扩散
Abstract
Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.
Chinese Translation
生成模型越来越多地被用于增强医学影像数据集,以实现更公平的人工智能。然而,一个关键假设常常未被审视:即生成器本身在不同人口群体中产生的图像质量是否相同。基于不平衡数据训练的模型可能会继承这些不平衡,从而导致稀有子群的合成质量下降,并在训练中缺失的人口交叉点上表现不佳。我们将此称为不平衡生成器问题。现有的补救措施如损失重加权在优化层面上运作,当某些组合的训练信号稀缺或缺失时,效果有限。我们提出了CompDiff,一个层次组合扩散框架,从表示层面解决这一问题。一个专门的层次条件网络(Hierarchical Conditioner Network, HCN)对人口条件进行分解,生成一个与CLIP嵌入连接的人口标记,作为交叉注意力上下文。这种结构化的因子化鼓励子群之间的参数共享,并支持对稀有或未见人口交叉点的组合泛化。在胸部X光(MIMIC-CXR)和眼底图像(FairGenMed)上的实验表明,CompDiff在图像质量(FID: 64.3 vs. 75.1)、子群公平性(ES-FID)和零样本交叉泛化(对保留交叉点的FID提升高达21%)方面,与标准微调和FairDiff相比表现良好。在CompDiff生成的数据上训练的下游分类器也显示出改进的AUROC和减少的人口偏差,表明人口条件的架构设计是公平医学图像生成中一个重要且未被充分探索的因素。代码可在 https://anonymous.4open.science/r/CompDiff-6FE6 获取。
cs.CV / 103 / 2603.16558
Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models
基于分割的注意力熵:检测和缓解大型视觉-语言模型中的物体幻觉
Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.
Chinese Translation
大型视觉-语言模型(LVLMs)在许多多模态任务中表现出色,但物体幻觉严重削弱了它们的可靠性。现有大多数研究集中于文本模态,将幻觉归因于过强的语言先验和不足的视觉基础。相反,我们观察到视觉模态中的异常注意力模式也可能导致幻觉物体。基于这一观察,我们提出了基于分割的注意力熵(SAE),该方法利用语义分割来量化物体级语义空间中的视觉注意力不确定性。基于SAE,我们进一步设计了一种用于幻觉检测的可靠性评分和一种SAE引导的注意力调整方法,该方法在推理时修改视觉注意力以减轻幻觉。我们在公共基准测试和真实的四足机器人多模态场景中评估了我们的方法。实验结果表明,SAE显著减少了物体幻觉,而无需额外的训练成本,从而实现了更可信赖的LVLM驱动的感知和决策。
cs.CV / 104 / 2603.16562
Understanding Cell Fate Decisions with Temporal Attention
理解细胞命运决策的时间注意力机制
Abstract
Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model's predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at https://github.com/bozeklab/Cell-Fate-Prediction.
Chinese Translation
理解细胞命运的非遗传决定因素对于开发和改善癌症疗法至关重要,因为在相同治疗条件下,遗传上相同的细胞可能会表现出不同的结果。在本研究中,我们提出了一种基于深度学习的细胞命运预测方法,利用癌细胞群体在化疗处理下的原始长期活细胞录制数据。我们的Transformer模型被训练以直接从原始图像序列中预测细胞命运,而不依赖于预定义的形态或分子特征。除了分类之外,我们还引入了一个全面的可解释性框架,用于解释指导模型预测的时间和形态线索。我们展示了仅基于视频就可以预测细胞结果,我们的模型达到了0.94的平衡准确率和0.93的F1分数。注意力和掩蔽实验进一步表明,预测细胞命运的信号并不唯一位于细胞轨迹的最后帧,因为在事件发生前最多10小时仍然可以进行可靠的预测。我们的分析揭示了在有丝分裂和凋亡序列中预测信息的不同时间分布,以及细胞形态和p53信号在决定细胞结果中的作用。总的来说,这些发现表明,基于注意力的时间模型能够实现准确的细胞命运预测,同时提供对细胞决策非遗传决定因素的生物学可解释性见解。代码可在 https://github.com/bozeklab/Cell-Fate-Prediction 获取。
cs.CV / 105 / 2603.16566
VideoMatGen: PBR Materials through Joint Generative Modeling
VideoMatGen:通过联合生成建模生成基于物理的材料
Abstract
We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.
Chinese Translation
我们提出了一种基于视频扩散变换器架构的方法,用于为3D形状生成基于物理的材料。我们的方法以输入几何形状和文本描述为条件,联合建模多种材料属性(基础色、粗糙度、金属度、高度图),以形成物理上合理的材料。我们进一步引入了一个自定义的变分自编码器,将多种材料模态编码到一个紧凑的潜在空间中,这使得在不增加令牌数量的情况下,能够联合生成多种模态。我们的流程能够在给定文本提示的情况下,为3D形状生成高质量的材料,兼容常见的内容创作工具。
cs.CV / 106 / 2603.16570
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Face2Scene:利用面部退化作为基于扩散的场景恢复的预言者
Abstract
Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.
Chinese Translation
近期图像恢复的进展使得能够从退化输入中高保真地恢复面部,采用基于参考的面部恢复模型(Ref-FR)。然而,这些方法仅关注面部区域,忽视了整个场景的退化,包括身体和背景,这限制了其实际应用。同时,完整场景恢复模型往往完全忽视退化线索,导致预测不确定和视觉伪影。在本研究中,我们提出了Face2Scene,一个两阶段的恢复框架,利用面部作为感知预言者来估计退化并指导整个图像的恢复。给定一幅退化图像和一个或多个身份参考,我们首先应用Ref-FR模型重建高质量的面部细节。从恢复的退化面部对中,我们提取出一个面部衍生的退化编码,捕捉退化属性(例如噪声、模糊、压缩),然后将其转化为多尺度的退化感知标记。这些标记为扩散模型提供条件,以一步恢复整个场景,包括身体和背景。大量实验表明,所提出的方法在效果上优于现有的最先进方法。
cs.CV / 107 / 2603.16576
REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
REFORGE:多模态攻击揭示图像生成模型中的脆弱概念遗忘
Abstract
Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.
Chinese Translation
近期图像生成模型(IGMs)的进展使得高保真内容创作成为可能,但也加大了风险,包括侵犯版权内容的再现和生成冒犯性内容。图像生成模型遗忘(IGMU)通过去除有害概念而无需完全重训练来缓解这些风险。尽管受到越来越多的关注,但在对抗性输入下的鲁棒性,特别是在黑箱环境中的图像侧威胁,仍然未得到充分探索。为了解决这一问题,我们提出了REFORGE,一个黑箱红队框架,通过对抗性图像提示评估IGMU的鲁棒性。REFORGE初始化基于笔画的图像,并采用交叉注意力引导的掩蔽策略优化扰动,将噪声分配到与概念相关的区域,从而平衡攻击效果和视觉保真度。在代表性的遗忘任务和防御措施上的广泛实验表明,REFORGE显著提高了攻击成功率,同时在语义对齐和效率上优于相关基线。这些结果揭示了当前IGMU方法中的持续脆弱性,并强调了针对多模态对抗攻击的鲁棒性意识遗忘的必要性。我们的代码可在:https://github.com/Imfatnoily/REFORGE.
cs.CV / 108 / 2603.16592
On the Transfer of Collinearity to Computer Vision
共线性在计算机视觉中的转移
Abstract
Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.
Chinese Translation
共线性是人脑中的一种视觉感知现象,它放大沿直线排列的空间对齐边缘。然而,目前尚不清楚人类在现实世界中为何会具备这一原则,并且在计算机视觉和工程应用中的利用仍然是一个较为未开发的领域。在本研究中,我们的目标是将共线性原则转移到计算机视觉中,并探索这一新原则在计算机视觉应用中的潜在用途。我们开发了一个原型模型以示范该原则,随后进行了系统测试,并在四个使用案例的背景下进行了基准测试。我们的案例选择旨在产生广泛的潜在应用和场景:将共线性与深度学习相结合(案例 I 和 II),将共线性与显著性模型结合使用(案例 II),以及作为特征检测器(案例 I)。在第一个使用案例中,我们发现共线性能够提高晶圆的故障检测,通过共线性实现了1.24倍的性能提升(错误率从6.5%降至5.26%)。在第二个使用案例中,我们测试了纳米技术材料中的缺陷识别,通过共线性实现了3.2倍的性能提升(深度学习,错误率从21.65%降至6.64%),并探索了显著性模型。作为第三个实验,我们涵盖了遮挡情况;作为第四个实验,我们测试了ImageNet,并观察到它可能对ImageNet并没有太大益处。因此,我们可以列出共线性有益的场景(晶圆、纳米技术、遮挡)以及不利的场景(ImageNet)。因此,我们推断共线性可能适合工业应用,因为当感兴趣的图像结构是人造的时,它通常由线条组成。我们的工作为计算机视觉提供了另一种工具,希望能够捕捉人类处理的力量。
cs.CV / 109 / 2603.16596
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation
FSMC-Pose:用于奶牛骑乘姿态估计的频率与空间融合及多尺度自校准
Abstract
Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC-Pose.
Chinese Translation
骑乘姿态是奶牛发情的重要视觉指示。然而,由于杂乱的背景和频繁的动物间遮挡,在实际环境中实现可靠的骑乘姿态估计仍然具有挑战性。我们提出了FSMC-Pose,这是一种自上而下的框架,整合了轻量级的频率-空间融合主干网络CattleMountNet和多尺度自校准头SC2Head。具体而言,我们为CattleMountNet设计了两个算法组件:空间频率增强块(SFEBlock)和感受野聚合块(RABlock)。SFEBlock用于将奶牛从杂乱背景中分离,而RABlock捕捉多尺度上下文信息。空间-通道自校准头(SC2Head)关注空间和通道依赖性,并引入自校准分支以减轻动物间重叠时的结构错位。我们构建了一个骑乘数据集MOUNT-Cattle,包含1176个骑乘实例,采用COCO格式并支持在姿态估计模型中便捷培训。通过结合MOUNT-Cattle和公共的NWAFU-Cattle数据集的全面数据集,FSMC-Pose在准确性上超越了强有力的基线,同时显著降低计算和参数成本,同时保持在消费级GPU上的实时推理。大量实验和定性分析表明,FSMC-Pose能够有效捕捉并估计在复杂和杂乱环境中的奶牛骑乘姿态。数据集和代码可在https://github.com/elianafang/FSMC-Pose获取。
cs.CV / 110 / 2603.16600
Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models
理由重要性:通过代理引导评估学习可转移的评分标准用于VLMReward模型
Abstract
Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.
Chinese Translation
生成奖励模型(GRMs)用于视觉-语言模型(VLMs)通常通过三阶段流程评估输出:评分标准生成、基于标准的评分和最终裁决。然而,中间评分标准很少被直接优化。以往的研究通常将评分标准视为附带因素,或依赖于昂贵的LLM作为裁判的检查,这些检查提供不了可微分的信号和有限的训练时间指导。我们提出了Proxy-GRM,它将代理引导的评分标准验证引入强化学习(RL),以明确提升评分标准的质量。具体而言,我们训练轻量级的代理智能体(Proxy-SFT和Proxy-RL),它们将候选评分标准与原始查询和偏好对一起输入,然后仅使用评分标准作为证据预测偏好排序。代理的预测准确性作为评分标准质量的奖励,激励模型生成内部一致且可转移的评分标准。通过约50,000个数据样本,Proxy-GRM在VL-Reward Bench、Multimodal Reward Bench和MM-RLHF-Reward Bench上达到了最先进的结果,超越了在四倍数据上训练的方法。消融实验表明,Proxy-SFT是比Proxy-RL更强的验证者,而隐式奖励聚合表现最佳。重要的是,学习到的评分标准能够转移到未见过的评估者,提高测试时的奖励准确性,而无需额外训练。我们的代码可在https://github.com/Qwen-Applications/Proxy-GRM获取。
cs.CV / 111 / 2603.16616
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
ACPV-Net:基于航空影像的无缝矢量地图生成的全类多边形矢量化
Abstract
We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: https://github.com/HeinzJiao/ACPV-Net.
Chinese Translation
我们解决了从航空影像中生成完整矢量地图表示的问题:在一次运行中为所有土地覆盖类别生成共享边界的多边形,并且没有间隙或重叠。现有的多边形化方法通常是特定于类别的;通过对每个类别进行单独处理来扩展它们,通常会导致拓扑不一致,例如重复边缘、间隙和重叠。我们将这一新任务形式化为全类多边形矢量化(All-Class Polygonal Vectorization, ACPV),并发布了第一个公共基准数据集Deventer-512,采用标准化指标共同评估语义保真度、几何精度、顶点效率、每类拓扑保真度和全局拓扑一致性。为了实现ACPV,我们提出了ACPV-Net,一个统一框架,引入了一种新颖的语义监督条件(Semantically Supervised Conditioning, SSC)机制,将语义感知与几何原始生成相结合,并通过设计强制共享边缘一致性的拓扑重建。尽管强制执行如此严格的拓扑约束,ACPV-Net在Deventer-512上在多边形质量方面超越了所有特定类别的基线。它也适用于单类多边形矢量化,无需任何架构修改,并在WHU-Building上取得了最佳报告结果。数据、代码和模型将发布于:https://github.com/HeinzJiao/ACPV-Net。
cs.CV / 112 / 2603.16620
TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation
TCATSeg:一种用于3D牙科模型语义分割的牙齿中心注意网络
Abstract
Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.
Chinese Translation
3D牙科模型的准确语义分割对于数字牙科应用(如正畸和牙齿植入)至关重要。然而,由于牙齿排列复杂以及相邻牙齿形状相似,现有方法在准确分割方面面临挑战,因为它们往往侧重于局部几何特征而忽视了全局上下文信息。为了解决这个问题,我们提出了TCATSeg,一个结合局部几何特征和全局语义上下文的新框架。我们引入了一组稀疏但具有物理意义的超点,以捕捉全局语义关系并提高分割精度。此外,我们还提供了一个包含400个牙科模型的新数据集,包括正畸前样本,以评估我们方法的泛化能力。大量实验表明,TCATSeg的性能优于现有的最先进方法。
cs.CV / 113 / 2603.16629
MLLM-based Textual Explanations for Face Comparison
基于MLLM的面部比较文本解释
Abstract
Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.
Chinese Translation
多模态大型语言模型(MLLMs)最近被提出作为生成面部识别决策自然语言解释的一种手段。尽管这些解释有助于人类的可解释性,但它们在不受约束的面部图像上的可靠性仍然未被充分探讨。在本研究中,我们系统地分析了MLLM生成的解释在具有挑战性的IJB-S数据集上的不受约束面部验证任务,特别关注极端姿态变化和监控图像。我们的结果表明,即使MLLM产生正确的验证决策,伴随的解释也常常依赖于无法验证或幻觉的面部特征,这些特征并没有得到视觉证据的支持。我们进一步研究了结合传统面部识别系统的信息(即分数和决策)与输入图像的效果。尽管这些信息提高了分类验证性能,但并未始终导致可信的解释。为了超越决策准确性评估解释,我们引入了一种基于似然比的框架,测量文本解释的证据强度。我们的研究结果突显了当前MLLM在可解释面部识别中的基本局限性,并强调了在生物识别应用中对可靠和可信解释进行原则性评估的必要性。代码可在 https://github.com/redwankarimsony/LR-MLLMFR-Explainability 获取。
cs.CV / 114 / 2603.16641
FlowComposer: Composable Flows for Compositional Zero-Shot Learning
FlowComposer:用于组合零-shot学习的可组合流
Abstract
Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.
Chinese Translation
组合零-shot学习(CZSL)旨在通过重新组合从已见对中学习到的原始元素,识别未见的属性-对象组合。近期基于视觉-语言模型(VLMs)的CZSL方法通常采用参数高效微调(PEFT)。它们应用视觉解耦器进行分解,并操作令牌级提示或前缀以编码组合。然而,这种基于PEFT的设计存在两个基本局限性:(1)隐式组合构建,组合仅通过令牌连接或分支提示调优实现,而不是在嵌入空间中的显式操作;(2)残余特征纠缠,不完美的解耦使得属性、对象和组合特征相互污染。这些问题共同限制了当前CZSL模型的泛化能力。本文首次系统研究了CZSL的流匹配,并引入FlowComposer,一个模型无关的框架,学习两个原始流以将视觉特征传输到属性和对象文本嵌入,并设计一个可学习的Composer,显式地将它们的速度场融合成组合流。为了利用不可避免的残余纠缠,我们进一步设计了一种泄漏引导的增强方案,重用泄漏特征作为辅助信号。我们通过将FlowComposer作为即插即用组件集成到各种基线中,在三个公共CZSL基准上进行了全面评估,始终实现了显著的改进。
cs.CV / 115 / 2603.16645
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
BUSSARD:用于双射通用场景特定异常关系检测的归一化流
Abstract
We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .
Chinese Translation
我们提出了双射通用场景特定异常关系检测(BUSSARD),这是一种基于归一化流的模型,用于检测从图像生成的场景图中的异常关系。我们的工作采用多模态方法,将场景图中的对象和关系标记嵌入语言模型,以利用来自现实世界的语义知识。使用归一化流模型学习双射变换,将场景图中的对象-关系-对象三元组映射到简单的基础分布(通常是高斯分布),从而通过似然估计实现异常检测。我们在包含办公室和餐厅场景的SARD数据集上评估了我们的方法。与当前最先进的模型相比,我们的方法在AUROC结果上提高了约10%,同时速度快了五倍。通过消融研究,我们展示了更强的鲁棒性和普遍性,特别是在同义词的使用方面,我们的模型在性能上保持稳定,而基线模型则显示出17.5%的偏差。这项工作展示了基于学习的方法在场景图中进行关系异常检测的强大潜力。我们的代码可在 https://github.com/mschween/BUSSARD 获取。
cs.CV / 116 / 2603.16649
Mixture of Style Experts for Diverse Image Stylization
多样化图像风格化的风格专家混合模型
Abstract
Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details.We introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: https://hh-lg.github.io/StyleExpert-Page/.
Chinese Translation
基于扩散的风格化技术已取得显著进展,但现有方法仅限于基于颜色的转换,忽视了复杂的语义和材料细节。我们提出了StyleExpert,这是一种基于专家混合模型(Mixture of Experts, MoE)的语义感知框架。我们的框架采用统一的风格编码器,该编码器在我们的大规模内容-风格-风格化三元组数据集上进行训练,以将多样化的风格嵌入到一致的潜在空间中。该嵌入随后用于调节一个基于相似性的门控机制,该机制动态地将风格路由到MoE架构中的专业专家。利用这一MoE架构,我们的方法能够灵活处理跨越多个语义层次的多样化风格,从浅层纹理到深层语义。大量实验表明,StyleExpert在保留语义和材料细节方面优于现有方法,同时能够推广到未见过的风格。我们的代码和收集的图像可在项目页面获取:https://hh-lg.github.io/StyleExpert-Page/。
cs.CV / 117 / 2603.16652
Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage
蜜蜂和黄蜂层陷巢中的高效幼虫细胞检测:平衡标注工作量与物种覆盖率
Abstract
Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.
Chinese Translation
监测穴居野生蜜蜂和黄蜂对于生物多样性研究和保护至关重要。层陷巢(Layer Trap Nests, LTNs)作为研究这些昆虫丰度和物种丰富度的有价值工具,提供了对其筑巢活动和生态需求的深入了解。然而,手动评估LTNs以检测和分类幼虫细胞是一项劳动密集且耗时的工作。为此,我们提出了一种基于深度学习的高效幼虫细胞检测和分类方法。LTNs由于幼虫细胞密集堆积,给每幅图像带来了较高的标注工作量。此外,我们观察到类别分布存在显著不平衡,常见物种的出现次数明显多于稀有物种。对常见物种进行全面标注耗时较长,并加剧了数据不平衡,而部分标注则引入了数据不完整性,降低了模型性能。为了减少标注工作量并减轻未标注数据的影响,我们引入了一种新颖的约束假阳性损失(Constrained False Positive Loss, CFPL)策略。CFPL动态屏蔽来自未标注数据的预测,防止其在训练过程中干扰分类损失。我们在一个包含712幅LTN图像的数据集上评估了我们的方法,该数据集覆盖了28个细粒度类别,描述了幼虫细胞的分类和状态。为了最小化标注工作量,我们将训练集限制为每个类别最多300个标签。实验结果表明,深度学习可以有效用于检测LTNs中的幼虫细胞。我们的CFPL方法进一步提高了性能,平衡了模型的准确性和标注工作量,同时减轻了类别不平衡问题。
cs.CV / 118 / 2603.16653
HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models
HeBA:用于强健视觉-语言模型的异构瓶颈适配器
Abstract
Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.
Chinese Translation
将像 CLIP 这样的规模化视觉-语言模型(VLMs)适应于下游任务通常面临“统一适用”的架构方法,其中文本和视觉标记均由宽泛的通用适配器统一处理。我们认为这种同质性忽视了模态的独特结构特性——图像中的空间局部性与文本中的语义密度。为了解决这个问题,我们提出了 HeBA(异构瓶颈适配器),这是一个统一的架构框架,提出了特定模态的结构性归纳偏置。HeBA 通过三个关键的架构创新与传统设计相区分:(1)异构性:通过 2D 深度可分卷积处理视觉标记,以保持空间相关性,同时通过密集线性投影来处理文本标记,以捕捉语义关系;(2)瓶颈正则化:与标准扩展适配器不同,HeBA 采用压缩瓶颈(D -> D/4),显式地强迫模型学习紧凑且稳健的特征,并作为结构正则化器;(3)主动梯度初始化:我们挑战限制性的零初始化范式,采用 Kaiming 初始化策略,确保足够的初始梯度流以加速收敛,同时不损害冻结骨干网的预训练知识。广泛的实验表明,HeBA 具有架构专门化的设计,实现了更优的稳定性和准确性,在 11 个少样本基准上建立了新的最先进水平。代码可在 https://github.com/Jahid12012021/VLM-HeBA 获取。
cs.CV / 119 / 2603.16662
Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization
基于光谱特性的数据增强方法用于高光谱单源领域泛化
Abstract
While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.
Chinese Translation
高光谱图像(HSI)由于其丰富的光谱通道为分类提供了大量信息,但其高维特性和传感器的变异性使其对领域间的分布差异更加敏感,这反过来又会影响分类性能。为了解决这一问题,高光谱单源领域泛化(SDG)通常采用数据增强技术来模拟潜在的领域转变,从而提高模型在单源领域训练数据可用情况下的鲁棒性。然而,盲目的数据增强可能会生成与真实场景不符的样本,而过度强调真实感又可能抑制多样性,突显了真实感与多样性之间的权衡,这限制了对目标领域的泛化。为应对这一挑战,我们提出了一种基于光谱特性的驱动数据增强方法(SPDDA),该方法明确考虑了高光谱图像的固有特性,即光谱通道数量的设备依赖性变化和相邻通道的混合。具体而言,SPDDA采用光谱多样性模块,从源领域沿光谱维度重新采样数据,以生成具有不同光谱通道的样本,并通过建模通道间相似性构建通道自适应光谱混合器,从而避免固定的增强模式。为了进一步增强增强样本的真实感,我们提出了一种空间-光谱共同优化机制,该机制联合优化空间保真约束和光谱连续性自约束。此外,光谱自约束的权重根据空间对应部分自适应调整,从而防止光谱维度的过度平滑,并保持空间结构。在三个遥感基准数据集上进行的大量实验表明,SPDDA的性能优于现有的最先进方法。
cs.CV / 120 / 2603.16664
Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
Kestrel:自我优化的基础上减轻大型视觉语言模型的幻觉
Abstract
Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.
Chinese Translation
大型视觉语言模型(LVLMs)虽然变得越来越强大,但在多模态任务中仍然容易出现幻觉,这显著限制了它们的应用。由于训练这些LVLM以避免幻觉对于更大模型来说变得过于昂贵,基于无训练方法提供了一种廉价且灵活的解决方案,然而现有基于解码或工具使用的方法往往带来有限的收益和/或较弱的可解释性。我们提出了Kestrel,这是一个无训练的LVLM幻觉减轻框架,结合了显式视觉基础代理和经过证据验证的自我优化机制。具体而言,Kestrel首先收集显式视觉证据,并将工具输出转换为可重用和结构化的文本证据。其次,为了充分利用这些证据,Kestrel通过LVLM评估者进行证据检查,然后基于经过验证的证据迭代自我优化答案,以降低过度修正的风险。大量实验表明,Kestrel在幻觉基准测试中相较于强基线提高了性能(例如,在POPE上平均提高3.31%,在MME-Hallucination上提高28.34,使用Qwen3-VL),同时提供了透明的验证痕迹用于幻觉诊断和分析——例如,集成的自我优化模块和基础代理在POPE上平均贡献了2.0%的收益。
cs.CV / 121 / 2603.16666
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
快速世界动作模型:世界动作模型在测试时是否需要未来想象?
Abstract
World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/
Chinese Translation
世界动作模型(World Action Models, WAMs)作为一种有前景的替代方案,正在逐渐取代视觉-语言-动作(Vision-Language-Action, VLA)模型用于具身控制,因为它们明确建模了视觉观察在动作下的演变过程。现有的大多数WAMs遵循“想象-再执行”(imagine-then-execute)范式,这导致在测试时由于迭代视频去噪而产生显著的延迟,但尚不清楚显式的未来想象是否确实对强大的动作性能是必要的。本文探讨了WAMs在测试时是否需要显式的未来想象,或者它们的优势是否主要来自于训练期间的视频建模。我们通过提出 extbf{Fast-WAM},一种在训练期间保留视频共同训练但在测试时跳过未来预测的WAM架构,来区分训练期间视频建模的作用与推理期间显式未来生成的作用。我们进一步实例化了几种Fast-WAM变体,以便对这两个因素进行受控比较。在这些变体中,我们发现Fast-WAM在性能上与想象-再执行变体保持竞争力,而去除视频共同训练则导致性能大幅下降。从实证结果来看,Fast-WAM在模拟基准(LIBERO和RoboTwin)和真实世界任务中均取得了与最先进方法相当的结果,且无需具身预训练。其运行延迟为190毫秒,速度比现有的想象-再执行WAMs快超过4倍。这些结果表明,视频预测在WAMs中的主要价值可能在于改善训练期间的世界表示,而非在测试时生成未来观察。项目页面:https://yuantianyuan01.github.io/FastWAM/
cs.CV / 122 / 2603.16671
$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
$x^2$-融合:事件边缘空间中的跨模态和跨维度流估计
Abstract
Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.
Chinese Translation
估计密集的2D光流和3D场景流对于动态场景理解至关重要。近期的研究结合了图像、激光雷达(LiDAR)和事件数据,以共同预测2D和3D运动,然而大多数方法在不同的异构特征空间中操作。由于缺乏一个所有模态都能对齐的共享潜在空间,这些系统依赖于多个特定模态的模块,导致跨传感器的不匹配问题未得到解决,并使得融合过程变得不必要的复杂。事件相机自然提供了时空边缘信号,我们可以将其视为一个内在的边缘场,以锚定一个统一的潜在表示,称为事件边缘空间(Event Edge Space)。基于这一思想,我们引入了$x^2$-融合,将多模态融合重新定义为表示统一:事件派生的时空边缘定义了一个以边缘为中心的同质空间,而图像和激光雷达特征在这个共享表示中被显式对齐。在这个空间内,我们执行可靠性感知的自适应融合,以估计模态的可靠性,并强调在退化情况下的稳定线索。我们进一步采用跨维度对比学习,将2D光流与3D场景流紧密耦合。大量在合成和真实基准上的实验表明,$x^2$-融合在标准条件下达到了最先进的准确性,并在具有挑战性的场景中显著提升了性能。
cs.CV / 123 / 2603.16679
HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture
HMAR:层次化模态意识专家与动态路由医学图像检索架构
Abstract
Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.
Chinese Translation
医学图像检索(MIR)是计算机辅助诊断的重要组成部分,但现有系统存在三项持续的局限性:统一特征编码未能考虑解剖结构在临床上的重要性差异;基于粗略分类标签的模糊相似度度量;以及对全球图像相似性的专注无法满足区域特定检索的细粒度临床需求。我们提出了HMAR(层次化模态意识专家与动态路由),这是一个基于混合专家(Mixture-of-Experts,MoE)架构构建的自适应检索框架。HMAR采用双专家机制:Expert0 提取全局特征以进行整体相似匹配,而 Expert1 学习位置不变的局部表示以实现精确的病灶区域检索。两阶段对比学习策略消除了对昂贵边界框注释的需求,并且滑动窗口匹配算法在推理时实现了稠密的局部比较。通过Kolmogorov-Arnold Network(KAN)层生成哈希码,以便于高效的汉明距离搜索。在RadioImageNet-CT数据集(16种临床模式,29,903张图像)上的实验表明,HMAR在64位和128位哈希码上的均值平均精度(mAP)分别达到了0.711和0.724,较最先进的ACIR方法分别提高了0.7%和1.1%。
cs.CV / 124 / 2603.16711
Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
Search2Motion: 基于注意力共识搜索的无训练对象级运动控制
Abstract
We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
Chinese Translation
我们提出了Search2Motion,这是一种用于图像到视频生成的对象级运动编辑的无训练框架。与以往需要轨迹、边界框、掩码或运动场的方法不同,Search2Motion采用基于目标帧的控制,利用第一帧和最后一帧的运动先验来实现对象重定位,同时保持场景稳定而无需微调。通过语义引导的对象插入和鲁棒的背景修复,我们实现了可靠的目标帧构建。我们进一步显示,早期步骤的自注意力图能够预测对象和相机动态,提供可解释的用户反馈,并激励了ACE-Seed(早期步骤种子选择的注意力共识),这一轻量级搜索策略在不进行前瞻采样或外部评估的情况下提高了运动保真度。注意到现有基准混合了对象和相机运动,我们引入了S2M-DAVIS和S2M-OMB用于稳定相机、仅对象的评估,以及FLF2V-obj指标,该指标在不要求真实轨迹的情况下隔离对象伪影。Search2Motion在FLF2V-obj和VBench上的表现始终优于基线。
cs.CV / 125 / 2603.16719
Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring
基于物联网的实时学生监测的情感感知课堂质量评估
Abstract
This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students' emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the 'Classroom Emotion Dataset' to facilitate further validation and research.
Chinese Translation
本研究提出了一种高通量、实时的多智能体情感计算框架,旨在通过情感状态监测提升课堂学习。随着大班规模和有限的师生互动日益成为教育工作者的挑战,迫切需要可扩展的数据驱动工具,能够实时捕捉学生的情感和参与模式。该系统使用课堂情感数据集进行评估,该数据集包含1500张标记图像和300段课堂检测视频。该系统专为物联网设备量身定制,通过高效的实时处理解决负载均衡和延迟挑战。我们在一个大型城市地区的三所教育机构进行了现场测试:一所小学(以下简称学校A)、一所中学(学校B)和一所高中(学校C)。系统表现出强大的性能,能够以25帧每秒的速度检测多达50张面孔,并在分类课堂参与状态时达到88%的整体准确率。实施结果显示出积极的效果,学生、教师和家长对课堂互动和教学适应性的改善给予了积极反馈。本研究的主要贡献包括建立一个实用的基于物联网的情感感知学习环境框架,并引入“课堂情感数据集”以促进进一步的验证和研究。
cs.CV / 126 / 2603.16736
World Reconstruction From Inconsistent Views
来自不一致视图的世界重建
Abstract
Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.
Chinese Translation
视频扩散模型生成高质量且多样化的世界;然而,单个帧在输出序列中往往缺乏3D一致性,这使得3D世界的重建变得困难。为此,我们提出了一种新方法,通过将视频帧非刚性地对齐到一个全局一致的坐标框架中来处理这些不一致性,从而生成清晰且详细的点云重建。首先,几何基础模型将每个帧提升为逐像素的3D点云,由于这些不一致性,点云中包含未对齐的表面。然后,我们提出了一种定制的非刚性迭代帧到模型的ICP(Iterative Closest Point)算法,以获得所有帧之间的初始对齐,接着进行全局优化,进一步增强点云的清晰度。最后,我们利用该点云作为3D重建的初始化,并提出了一种新颖的逆变形渲染损失,以从不一致视图中创建高质量且可探索的3D环境。我们证明我们的3D场景的质量优于基线,有效地将视频模型转变为3D一致的世界生成器。
cs.CV / 127 / 2603.16737
Retrieving Counterfactuals Improves Visual In-Context Learning
检索反事实改善视觉上下文学习
Abstract
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
Chinese Translation
视觉-语言模型(VLMs)在多模态推理任务中取得了令人瞩目的表现,但它们往往难以理清细粒度的视觉属性并推理潜在的因果关系。上下文学习(ICL)为VLMs适应新任务提供了一条有前景的途径,但其有效性在很大程度上依赖于示例的选择。现有的增强检索方法通常依赖于被动的基于相似性的检索,这往往选择相关但非因果的示例,从而放大虚假关联并限制模型的鲁棒性。我们提出了CIRCLES(组合图像检索用于因果学习示例选择),这是一个新颖的框架,通过有针对性的、属性引导的组合图像检索,主动构建示例集,检索反事实风格的示例。通过结合反事实风格的示例,CIRCLES使得VLMs能够隐式推理属性与结果之间的因果关系,超越表面的相关性,促进更为稳健和扎实的推理。在四个不同的数据集上进行的全面实验表明,CIRCLES在多种架构上始终优于现有方法,尤其是在小规模模型上,在信息稀缺的情况下表现出显著的提升。此外,CIRCLES检索到更多样化和因果信息丰富的示例,为模型如何利用上下文示例进行改进推理提供了定性见解。我们的代码可在 https://github.com/gzxiong/CIRCLES 获取。
cs.CV / 128 / 2603.16742
When the City Teaches the Car: Label-Free 3D Perception from Infrastructure
当城市教导汽车:来自基础设施的无标签三维感知
Abstract
Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.
Chinese Translation
构建稳健的自驾车三维感知仍然在很大程度上依赖于大规模数据收集和人工标注,然而,随着部署在不同城市和地区的扩展,这一范式变得不切实际。与此同时,现代城市越来越多地配备了路边单元(RSUs),这些静态传感器被部署在道路和交叉口以监测交通。这引发了一个自然的问题:城市本身能否帮助训练车辆?我们提出了基础设施教导的无标签三维感知,这是一种范式,其中RSUs作为静态的、无监督的教师为自我车辆提供支持。利用其固定的视角和重复的观察,RSUs从未标记的数据中学习本地三维检测器,并将预测结果广播给经过的车辆,这些结果被汇总为伪标签监督,用于训练独立的自我检测器。所得到的模型在测试时不需要基础设施或通信。我们将这一理念实例化为一个完全无标签的三阶段流程,并在基于CARLA的多智能体环境中进行概念和可行性研究。使用CenterPoint,我们的流程在车辆检测中达到了82.3%的平均精度(AP),而完全监督的自我检测上限为94.4%。我们进一步系统地分析每个阶段,评估其可扩展性,并展示与现有自我中心无标签方法的互补性。这些结果共同表明,城市基础设施本身可能为自动驾驶车辆提供可扩展的监督信号,将基础设施教导学习定位为减少三维感知标注成本的有前景的正交范式。
cs.CV / 129 / 2603.16747
Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation
半监督潜在解耦扩散模型用于纺织图案生成
Abstract
Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.
Chinese Translation
纺织图案生成(TPG)旨在基于给定的服装图像合成细粒度的纺织图案图像。尽管之前的研究并未明确探讨TPG,但现有的图像到图像模型似乎是该任务的自然候选者。然而,直接应用这些方法时,往往会产生不真实的结果,无法保留细粒度的细节,因为复杂纺织图案与服装图像中固有的非刚性纹理扭曲之间存在特征混淆。在本文中,我们提出了一种新颖的方法SLDDM-TPG,以实现真实且高保真的TPG。我们的方法由两个阶段组成:(1)潜在解耦网络(LDN),该网络解决了服装表示中的特征混淆,并构建了一个多维的独立服装特征空间;(2)半监督潜在扩散模型(S-LDM),该模型接收来自LDN的指导信号,并通过半监督扩散训练生成真实的结果,结合我们设计的细粒度对齐策略。大量评估表明,SLDDM-TPG在我们的CTP-HD数据集上将FID降低了4.1,并将SSIM提高了最高0.116,同时在VITON-HD数据集上也展示了良好的泛化能力。
cs.CV / 130 / 2603.16758
SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport
SuCor:通过无参数和自正则化的最优传输进行易感性失真校正
Abstract
We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.
Chinese Translation
我们提出了SuCor,一种通过最优传输(Optimal Transport, OT)在相位编码方向上校正回波平面成像(Echo Planar Imaging, EPI)中易感性引起的几何失真的方法。给定一对反向相位编码EPI体积,我们将失真场的每一列建模为对立极性强度轮廓之间的Wasserstein-2重心位移。在光谱域中进行正则化,使用弯曲能量罚项,其强度通过Morozov不相容原理自动选择,且无需手动调整。在一个包含左右/右左b0 EPI对和配准的T1结构参考的人脑连通组计划(Human Connectome Project, HCP)数据集中,SuCor与T1图像的平均体积互信息为0.341,而FSL TOPUP为0.317,并且在单个CPU核心上运行大约12秒。
cs.CV / 131 / 2603.16760
Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions
基于双流独立解耦的真实情感识别方法在被遮掩表情下的应用
Abstract
Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.
Chinese Translation
从被遮掩的表情中识别真实情感极具挑战性,因为存在故意的掩饰。现有的范式通常从刚开始伪装的被遮掩表情片段中识别真实情感。然而,这种范式可能无法反映实际的伪装状态,因为起始帧泄露了真实情感信息,而未达到稳定的伪装状态。因此,本文提出了一种基于顶点帧(apexframe)的新范式,该范式从具有稳定伪装状态的顶点帧中分类真实情感。此外,本文提出了一种新颖的双流独立解耦框架,该框架将真实情感特征与伪装情感特征解耦,避免了伪装情感对真实情感的干扰。为了实现高效解耦,我们设计了一组解耦损失,包括两个分类损失,分别学习真实情感和伪装表情特征,以及一个希尔伯特-施密特独立性损失(Hilbert-Schmidt Independence loss),以增强这两种特征的独立性。实验表明,基于顶点帧的范式具有挑战性,而所提出的解耦框架提高了识别性能。
cs.CV / 132 / 2603.16769
GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
GDPO-SR:用于一步生成图像超分辨率的群体直接偏好优化
Abstract
Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.
Chinese Translation
近年来,强化学习(RL)被用于提升生成图像超分辨率(ISR)的性能。然而,目前的研究主要集中在多步生成ISR上,而一步生成ISR由于其有限的随机性仍然未得到充分探索。此外,像直接偏好优化(DPO)这样的RL方法需要离线生成正负样本对,导致样本数量有限,而群体相对策略优化(GRPO)仅计算整个图像的可能性,忽略了对ISR至关重要的局部细节。本文提出了群体直接偏好优化(GDPO),这是一种将RL集成到一步生成ISR模型训练中的新方法。首先,我们引入了一种噪声感知的一步扩散模型,该模型能够生成多样的ISR输出。为了防止噪声注入导致的性能下降,我们引入了一种不等时间步策略,将噪声添加的时间步与扩散的时间步解耦。然后,我们提出了GDPO策略,该策略将GRPO的原则整合到DPO中,以计算每个在线生成样本的群体相对优势,从而进行模型优化。同时,设计了一种属性感知的奖励函数,根据每个样本的平滑和纹理区域的统计动态评估其得分。实验表明,GDPO在提升一步生成ISR模型性能方面的有效性。代码:https://github.com/Joyies/GDPO。
cs.CV / 133 / 2603.16781
IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans
IOSVLM:一种用于统一口腔扫描的三维视觉-语言模型
Abstract
3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.
Chinese Translation
三维口腔扫描(IOS)因其丰富的几何证据而越来越多地被应用于常规牙科诊疗中,统一的多疾病诊断对于临床文档和沟通是非常必要的。尽管近期的研究引入了牙科视觉-语言模型(VLM)以实现基于二维图像或从IOS渲染的多视图图像的统一诊断和报告生成,但它们并未充分利用原生的三维几何信息。这项工作既必要又具有挑战性,原因包括:(i)扫描形式的异质性和复杂的IOS拓扑结构,(ii)多疾病共现导致的类别不平衡和细粒度形态模糊,(iii)有限的配对三维IOS-文本数据。因此,我们提出了IOSVLM,这是一种端到端的三维VLM,将扫描表示为点云,并采用三维编码器-投影器-大语言模型(LLM)设计,以实现统一诊断和生成式视觉问答(VQA),同时推出IOSVQA,这是一个大规模多源IOS诊断VQA数据集,包含19,002个案例和249,055对VQA,涵盖23种口腔疾病和异质扫描类型。为了应对无色IOS数据与依赖颜色的三维预训练之间的分布差距,我们提出了一种几何到色彩的代理,旨在稳定细粒度几何感知和跨模态对齐。两阶段课程训练策略进一步增强了模型的鲁棒性。IOSVLM在各项指标上始终优于强基线,宏观准确率至少提高了+9.58%,宏观F1值提高了+1.46%,这表明直接的三维几何建模在基于IOS的诊断中的有效性。
cs.CV / 134 / 2603.16792
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
V-Co:通过共同去噪更深入地探讨视觉表征对齐
Abstract
Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
Chinese Translation
像素空间扩散最近重新成为潜在扩散的强有力替代方案,使得在没有预训练自编码器的情况下实现高质量生成成为可能。然而,标准的像素空间扩散模型接收到的语义监督相对较弱,并且并未明确设计用于捕捉高层次的视觉结构。最近的表征对齐方法(例如,REPA)表明,预训练的视觉特征可以显著改善扩散训练,而视觉共同去噪已成为将这些特征纳入生成过程的有希望的方向。然而,现有的共同去噪方法往往纠缠于多个设计选择,使得哪些设计选择是真正必要的变得不清晰。因此,我们提出了V-Co,这是一个在统一的基于JiT的框架中对视觉共同去噪进行系统研究的工作。这个受控环境使我们能够隔离出使视觉共同去噪有效的要素。我们的研究揭示了四个有效视觉共同去噪的关键要素。首先,在保持特征特定计算的同时,允许灵活的跨流交互,促使我们采用完全的双流架构。其次,有效的无分类器引导(CFG)需要结构上定义的无条件预测。第三,更强的语义监督最好通过感知漂移混合损失提供。第四,稳定的共同去噪进一步需要适当的跨流校准,我们通过基于RMS的特征重缩放实现这一点。综合这些发现,我们得出了视觉共同去噪的简单配方。在ImageNet-256上的实验表明,在可比的模型规模下,V-Co优于基础像素空间扩散基线和强大的先验像素扩散方法,同时使用更少的训练周期,为未来的表征对齐生成模型提供了实用指导。
cs.CV / 135 / 2603.16816
WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation
WildDepth:用于3D野生动物感知和深度估计的多模态数据集
Abstract
Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.
Chinese Translation
深度估计和3D重建作为计算机视觉的核心主题,已被广泛研究。从相对简单几何形状的刚性物体(如车辆)开始,研究逐渐扩展到一般物体,包括具有挑战性的可变形物体(如人类和动物)。然而,特别是对于动物,大多数现有模型都是基于没有度量尺度的数据集进行训练的,这限制了仅基于图像的模型的验证。为了解决这一限制,我们提出了WildDepth,一个多模态数据集和基准套件,用于深度估计、行为检测和3D重建,涵盖从家养到野生环境的多种动物,配备同步的RGB和LiDAR数据。实验结果表明,使用多模态数据可以将深度可靠性提高多达10%的均方根误差(RMSE),而RGB-LiDAR融合则在Chamfer距离上提高了12%的3D重建精度。通过发布WildDepth及其基准,我们旨在促进跨领域的稳健多模态感知系统的发展。
cs.CV / 136 / 2603.16823
Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines
基于深度强化学习的边缘卸载用于延迟受限的扩展现实管道
Abstract
Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.
Chinese Translation
沉浸式扩展现实(XR)应用引入了对延迟敏感的工作负载,这些工作负载必须在能量和电池受限的设备上满足严格的实时响应性,因此在终端设备与附近边缘服务器之间进行执行位置的选择成为一个基本的系统挑战。现有的自适应执行和计算卸载方法通常优化平均性能指标,但未能充分捕捉实时延迟要求与设备电池寿命之间在闭环XR工作负载中的持续互动。在本文中,我们提出了一种电池感知的执行管理框架,适用于边缘辅助的XR系统,该框架联合考虑执行位置、工作负载质量、延迟要求和电池动态。我们设计了一种基于轻量级深度强化学习策略的在线决策机制,该机制在动态网络条件下持续调整执行决策,同时保持高运动到光子延迟合规性。实验结果表明,与延迟最优的本地执行相比,所提出的方法将设备电池寿命的预期延长了多达163%,同时在稳定的网络条件下保持超过90%的运动到光子延迟合规性。在网络带宽严重受限的情况下,该合规性仍不低于80%,从而证明了在沉浸式XR系统中显式管理延迟与能量之间权衡的有效性。
cs.CV / 137 / 2603.16835
An assessment of data-centric methods for label noise identification in remote sensing data sets
对遥感数据集中标签噪声识别的数据中心方法的评估
Abstract
Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.
Chinese Translation
标签噪声,即错误标签的存在,广泛存在于许多现实世界的数据集中,并且已知严重限制了深度学习模型的泛化能力。然而,在遥感领域,数据集中标签噪声的自动处理迄今为止受到的关注很少。特别是,缺乏对数据中心方法性能的系统分析,这些方法不仅应对标签噪声,还明确识别和隔离噪声标签。本文考察了三种此类方法,并评估它们在不同标签噪声假设下的表现。为此,我们向两个基准数据集中注入不同类型的标签噪声,噪声水平范围从10%到70%,随后分析所选方法过滤标签噪声的效果以及这对任务性能的影响。通过我们的分析,我们清楚地证明了数据中心方法在标签噪声识别和任务性能提升两个方面的价值。我们的分析提供了关于在不同设置和目标下最佳选择哪种方法的见解。最后,我们展示了在将数据中心标签噪声方法转移到遥感数据方面仍然需要研究的领域。因此,我们的工作是推动数据中心标签噪声方法的理论建立与其在遥感领域实际应用之间的桥梁。
cs.CV / 138 / 2603.16840
What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers
DINO所见:ALiBi位置编码减少视觉变换器中的位置偏差
Abstract
Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.
Chinese Translation
视觉变换器(ViTs)——尤其是像DINOv2这样的特征基础模型——学习到丰富的表示,这对于许多下游任务非常有用。然而,架构选择(例如位置编码)可能导致这些模型表现出与语义内容无关的位置偏差和伪影。这使得在材料科学等领域进行零-shot 适应变得困难,因为这些领域的图像通常是均匀微观结构的横截面(即没有优先方向)。在本研究中,我们通过线性探测调查了ViTs中的位置偏差,发现其在多种目标和位置编码中普遍存在,并通过微调模型以使用ALiBi相对位置编码来减少这种偏差。我们证明这些模型保留了理想的通用语义,其无偏特征可以成功用于复杂显微图像的可训练分割。
cs.CV / 139 / 2603.16844
M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM
M^3:密集匹配与多视角基础模型相结合的单目高斯点云SLAM
Abstract
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
Chinese Translation
从未校准的单目视频中进行流式重建仍然具有挑战性,因为这需要在动态环境中进行高精度的姿态估计和计算效率高的在线优化。尽管将3D基础模型与SLAM框架结合是一种有前景的范式,但仍然存在一个关键瓶颈:大多数多视角基础模型以前馈方式估计姿态,导致像素级对应关系缺乏严格几何优化所需的精度。为了解决这个问题,我们提出了M^3,它通过专门的匹配头增强了多视角基础模型,以促进细粒度的密集对应关系,并将其集成到一个稳健的单目高斯点云SLAM中。M^3通过引入动态区域抑制和交叉推理内在对齐进一步增强了跟踪的稳定性。在多种室内和室外基准测试上的广泛实验表明,在姿态估计和场景重建方面达到了最先进的准确性。值得注意的是,与VGGT-SLAM 2.0相比,M^3将平均轨迹误差(ATE)均方根误差(RMSE)降低了64.3%,并在ScanNet++数据集上在峰值信噪比(PSNR)方面超越了ARTDECO 2.11 dB。
cs.CV / 140 / 2603.16858
SOMA: Unifying Parametric Human Body Models
SOMA: 统一参数化人体模型
Abstract
Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model's identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the $O(M^2)$ per-pair adapter problem to $O(M)$ single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.
Chinese Translation
参数化人体模型是人类重建、动画和模拟的基础,但它们之间仍然不兼容:SMPL、SMPL-X、MHR、Anny及相关模型在网格拓扑、骨骼结构、形状参数化和单位约定等方面各有不同,使得在单一流程中利用它们的互补优势变得不切实际。我们提出了SOMA,一个统一的身体层,通过三个抽象层连接这些异构表示。网格拓扑抽象在每个顶点以恒定时间将任何源模型的身份映射到共享的标准网格。骨骼抽象从任意体形恢复完整的身份适应关节变换,无论是在静止姿势还是任意姿态配置中,均可在单一封闭形式的传递中完成,无需迭代优化或针对每个模型的训练。姿态抽象反转了蒙皮流程,直接从任何支持模型的姿态顶点恢复统一的骨骼旋转,从而可以在不进行自定义重定向的情况下使用异构运动数据集。这些层将每对适配器的$O(M^2)$问题减少到$O(M)$的单一后端连接,允许从业者在推理时自由混合身份源和姿态数据。整个流程是完全可微分的端到端,并通过NVIDIA-Warp加速。
cs.CV / 141 / 2603.16864
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation
SparkVSR:通过稀疏关键帧传播实现交互式视频超分辨率
Abstract
Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/
Chinese Translation
视频超分辨率(VSR)旨在从低分辨率(LR)估计中恢复高质量的视频帧,但大多数现有的VSR方法在推理时表现得像黑箱:用户无法可靠地纠正意外的伪影,只能接受模型生成的结果。本文提出了一种新颖的交互式VSR框架,称为SparkVSR,使稀疏关键帧成为简单而富有表现力的控制信号。具体而言,用户可以首先使用任何现成的图像超分辨率(ISR)模型对一小组关键帧进行超分辨率处理,随后SparkVSR将关键帧先验传播到整个视频序列,同时保持与原始LR视频运动的关联。具体来说,我们引入了一种关键帧条件的潜在像素两阶段训练流程,该流程将LR视频潜在特征与稀疏编码的HR关键帧潜在特征融合,以学习稳健的跨空间传播并细化感知细节。在推理时,SparkVSR支持灵活的关键帧选择(手动指定、编解码器I帧提取或随机采样)和一种无参考指导机制,持续平衡关键帧依赖性和盲重建,确保在缺少或不完美的参考关键帧时仍能保持稳健的性能。在多个VSR基准测试中的实验表明,时间一致性和恢复质量得到了提升,在CLIP-IQA、DOVER和MUSIQ上分别超越基线达24.6%、21.8%和5.6%,实现可控的、关键帧驱动的视频超分辨率。此外,我们证明SparkVSR是一个通用的交互式、关键帧条件的视频处理框架,可以直接应用于旧电影修复和视频风格转移等未见任务。我们的项目页面可访问:https://sparkvsr.github.io/
cs.CV / 142 / 2603.16868
MessyKitchens: Contact-rich object-level 3D scene reconstruction
MessyKitchens:接触丰富的物体级三维场景重建
Abstract
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.
Chinese Translation
单目三维场景重建最近取得了显著进展。在现代神经网络架构和大规模数据的推动下,最近的方法在从单幅图像中进行深度估计方面达到了高性能。然而,由于物体种类繁多、频繁遮挡和复杂的物体关系,将常见场景重建并分解为单个三维物体仍然是一个艰巨的挑战。值得注意的是,除了对单个物体的形状和姿态估计外,机器人和动画等应用还需要物理上合理的场景重建,使得物体遵循非穿透和现实接触的物理原则。在本研究中,我们在物体级场景重建方面推进了两个方向。首先,我们引入了MessyKitchens,一个新的数据集,包含真实世界的场景,特征为杂乱的环境,并在三维物体形状、姿态和准确的物体接触方面提供高保真的物体级真实数据。其次,我们基于最近的SAM 3D方法进行单物体重建,并通过多物体解码器(Multi-Object Decoder, MOD)扩展其用于联合物体级场景重建。为了验证我们的贡献,我们展示了MessyKitchens在配准精度和物体间穿透方面显著改善了之前的数据集。我们还在三个数据集上比较了我们的多物体重建方法,展示了MOD在最新技术上的一致且显著的改进。我们的新基准、代码和预训练模型将公开发布在我们的项目网站:https://messykitchens.github.io/。
cs.CV / 143 / 2603.16869
SegviGen: Repurposing 3D Generative Model for Part Segmentation
SegviGen:将3D生成模型重新用于部件分割
Abstract
We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.
Chinese Translation
我们介绍了SegviGen,一个将原生3D生成模型重新用于3D部件分割的框架。现有的流程要么通过蒸馏或多视图掩膜聚合将强大的2D先验提升到3D,通常会遭遇视图间不一致和模糊边界的问题;要么探索原生3D判别分割,这通常需要大规模标注的3D数据和大量的训练资源。相比之下,SegviGen利用预训练3D生成模型中编码的结构化先验,通过独特的部件着色来诱导分割,建立了一个新颖且高效的部件分割框架。具体而言,SegviGen对3D资产进行编码,并在几何对齐重建的活跃体素上预测部件指示颜色。它在一个统一的框架中支持交互式部件分割、完整分割以及带有2D指导的完整分割。大量实验表明,SegviGen在交互式部件分割上比之前的最先进技术提高了40%,在完整分割上提高了15%,同时仅使用了0.32%的标注训练数据。它证明了预训练3D生成先验有效地转移到3D部件分割中,使得在有限监督下实现强大的性能。请访问我们的项目页面:https://fenghora.github.io/SegviGen-Page/
cs.CV / 144 / 2603.16870
Demystifing Video Reasoning
揭示视频推理
Abstract
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
Chinese Translation
最近的视频生成技术的进展揭示了一种意想不到的现象:基于扩散的视频模型展现出了非平凡的推理能力。先前的研究将其归因于一个帧链(Chain-of-Frames, CoF)机制,其中推理被假定为在视频帧间顺序展开。在本研究中,我们挑战了这一假设,并揭示了一种根本不同的机制。我们表明,视频模型中的推理主要是在扩散去噪步骤中产生的。通过定性分析和有针对性的探测实验,我们发现模型在早期去噪步骤中探讨多个候选解决方案,并逐渐收敛到最终答案,这是我们称之为步骤链(Chain-of-Steps, CoS)的过程。除了这一核心机制外,我们还识别出几种 emergent 推理行为,这对于模型的性能至关重要:(1) 工作记忆,允许持续引用;(2) 自我纠正和增强,能够从不正确的中间解决方案中恢复;以及(3) 行动前的感知,早期步骤建立语义基础,而后续步骤进行结构性操作。在扩散步骤中,我们进一步揭示了扩散变压器(Diffusion Transformers)内自我演变的功能专业化,其中早期层编码密集的感知结构,中间层执行推理,后期层整合潜在表示。基于这些洞见,我们提出了一种简单的无训练策略作为概念证明,展示了如何通过对来自具有不同随机种子的相同模型的潜在轨迹进行集成来改善推理。总体而言,我们的研究为视频生成模型中推理如何产生提供了系统理解,为后续研究提供基础,以更好地利用视频模型的内在推理动态,作为智能的新基础。
cs.CV / 145 / 2603.16871
WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
WorldCam:以相机姿态作为统一几何表示的交互式自回归3D游戏世界
Abstract
Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
Chinese Translation
近期视频扩散变换器的进展使得交互式游戏世界模型成为可能,用户可以在较长的时间范围内探索生成的环境。然而,现有方法在精确的动作控制和长时间范围内的3D一致性方面存在困难。大多数先前的研究将用户动作视为抽象的条件信号,忽视了动作与3D世界之间的基本几何耦合关系,即动作引发的相对相机运动在3D世界中累积成全局相机姿态。本文提出将相机姿态作为统一的几何表示,以共同支撑即时动作控制和长期3D一致性。首先,我们定义了一个基于物理的连续动作空间,并在李代数中表示用户输入,以推导出精确的6自由度相机姿态,这些姿态通过相机嵌入器注入生成模型中,以确保准确的动作对齐。其次,我们使用全局相机姿态作为空间索引来检索相关的过去观察,从而在长时间范围内导航时实现几何一致的地点重访。为了支持这项研究,我们引入了一个大规模数据集,包含3000分钟的真实人类游戏玩法,并标注了相机轨迹和文本描述。大量实验表明,我们的方法在动作可控性、长时间范围视觉质量和3D空间一致性方面显著优于最先进的交互式游戏世界模型。
cs.AI / 1 / 2603.15633
Neural-Symbolic Logic Query Answering in Non-Euclidean Space
非欧几里得空间中的神经符号逻辑查询回答
Abstract
Answering complex first-order logic (FOL) queries on knowledge graphs is essential for reasoning. Symbolic methods offer interpretability but struggle with incomplete graphs, while neural approaches generalize better but lack trans- parency. Neural-symbolic models aim to integrate both strengths but often fail to capture the hierarchical structure of logical queries, limiting their effectiveness. We propose HYQNET, a neural-symbolic model for logic query reasoning that fully leverages hyperbolic space. HYQNET decomposes FOL queries into relation projections and logical operations over fuzzy sets, enhancing interpretability. To address missing links, it employs a hyperbolic GNN-based approach for knowledge graph completion in hyperbolic space, effectively embedding the re- cursive query tree while preserving structural dependencies. By utilizing hyperbolic representations, HYQNET captures the hierarchical nature of logical projection reasoning more effectively than Euclidean-based approaches. Experiments on three benchmark datasets demonstrate that HYQNET achieves strong performance, highlighting the advantages of reasoning in hyperbolic space.
Chinese Translation
在知识图谱上回答复杂的一阶逻辑(FOL)查询对于推理至关重要。符号方法提供了可解释性,但在处理不完整图谱时表现欠佳,而神经方法则具有更好的泛化能力,但缺乏透明性。神经符号模型旨在整合两者的优势,但通常无法有效捕捉逻辑查询的层次结构,限制了它们的有效性。我们提出了HYQNET,一个用于逻辑查询推理的神经符号模型,充分利用了双曲空间。HYQNET将FOL查询分解为关系投影和模糊集上的逻辑运算,增强了可解释性。为了解决缺失链接,它采用了一种基于双曲图神经网络(GNN)的知识图谱补全方法,能够有效嵌入递归查询树,同时保持结构依赖性。通过利用双曲表示,HYQNET比基于欧几里得的方法更有效地捕捉逻辑投影推理的层次特性。在三个基准数据集上的实验表明,HYQNET实现了强大的性能,突显了在双曲空间中推理的优势。
cs.AI / 2 / 2603.15634
NextMem: Towards Latent Factual Memory for LLM-based Agents
NextMem:面向基于大语言模型的智能体的潜在事实记忆
Abstract
Memory is critical for LLM-based agents to preserve past observations for future decision-making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two-stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at https://github.com/nuster1128/NextMem.
Chinese Translation
记忆对于基于大语言模型(LLM)的智能体在未来决策中保留过去观察至关重要,其中事实记忆作为其基础部分。然而,现有的构建事实记忆的方法面临若干限制。文本方法带来了沉重的上下文和索引负担,而参数化方法则遭遇灾难性遗忘和高成本。为了解决这些挑战,我们提出了NextMem,一个潜在事实记忆框架,利用自回归自编码器高效构建潜在记忆,同时确保准确重构。为了更好地优化,我们提出了一个两阶段的训练过程,包括自回归重构对齐和渐进式潜在替代。我们还结合量化技术以减少存储开销。大量实验表明,NextMem在检索、鲁棒性和可扩展性方面表现优越。我们将在 https://github.com/nuster1128/NextMem 发布我们的代码和模型检查点。
cs.AI / 3 / 2603.15636
AIDABench: AI Data Analytics Benchmark
AIDABench:人工智能数据分析基准
Yang, Yibo, Lei, Fei, Sun, Yixuan, Zeng, Yantao, Lv, Chengguang, Hong, Jiancao, Tian, Jiaojiao, Qiu, Tianyu, Wang, Xin, Chen, Yanbing, Li, Yanjie, Pan, Zheng, Zhou, Xiaochen, Chen, Guanzhou, Lv, Haoran, Xu, Yuning, Ou, Yue, Liu, Haodong, He, Shiqi, Jia, Anya, Xin, Yulei, Wu, Huan, Liu, Liang, Ge, Jiaye, Dong, Jianxin, Lin, Dahua, Sun, Wenxiu
Abstract
As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.
Chinese Translation
随着基于人工智能的文档理解和处理工具在现实应用中的日益普及,对严格评估标准的需求变得愈发紧迫。现有的基准和评估通常集中于孤立的能力或简化的场景,未能捕捉到实际环境中所需的端到端任务有效性。为了解决这一问题,我们引入了AIDABench,这是一个全面的基准,用于以端到端的方式评估人工智能系统在复杂数据分析任务上的表现。AIDABench涵盖了600多个多样化的文档分析任务,涉及三个核心能力维度:问答、数据可视化和文件生成。这些任务基于涉及异构数据类型的现实场景,包括电子表格、数据库、财务报告和操作记录,反映了各个行业和职位的分析需求。值得注意的是,AIDABench中的任务具有足够的挑战性,即使是人类专家在人工智能工具的辅助下,每个问题也需要1-2小时,突显了基准的难度和现实复杂性。我们在AIDABench上评估了11个最先进的模型,涵盖了专有模型(如Claude Sonnet 4.5、Gemini 3 Pro Preview)和开源模型(如Qwen3-Max-2026-01-23-Thinking)。我们的结果显示,复杂的现实数据分析任务仍然是当前人工智能系统面临的重大挑战,表现最佳的模型仅达到59.43%的通过率。我们对每个能力维度的失败模式进行了详细分析,并识别出未来研究的关键挑战。AIDABench为企业采购、工具选择和模型优化提供了原则性的参考,并已公开发布于https://github.com/MichaelYang-lyx/AIDABench。
cs.AI / 4 / 2603.15639
The Comprehension-Gated Agent Economy: A Robustness-First Architecture for AI Economic Agency
理解门控代理经济:一种以稳健性为首的人工智能经济代理架构
Abstract
AI agents are increasingly granted economic agency (executing trades, managing budgets, negotiating contracts, and spawning sub-agents), yet current frameworks gate this agency on capability benchmarks that are empirically uncorrelated with operational robustness. We introduce the Comprehension-Gated Agent Economy (CGAE), a formal architecture in which an agent's economic permissions are upper-bounded by a verified comprehension function derived from adversarial robustness audits. The gating mechanism operates over three orthogonal robustness dimensions: constraint compliance (measured by CDCT), epistemic integrity (measured by DDFT), and behavioral alignment (measured by AGT), with intrinsic hallucination rates serving as a cross-cutting diagnostic. We define a weakest-link gate function that maps robustness vectors to discrete economic tiers, and prove three properties of the resulting system: (1) bounded economic exposure, ensuring maximum financial liability is a function of verified robustness; (2) incentive-compatible robustness investment, showing rational agents maximize profit by improving robustness rather than scaling capability alone; and (3) monotonic safety scaling, demonstrating that aggregate system safety does not decrease as the economy grows. The architecture includes temporal decay and stochastic re-auditing mechanisms that prevent post-certification drift. CGAE provides the first formal bridge between empirical AI robustness evaluation and economic governance, transforming safety from a regulatory burden into a competitive advantage.
Chinese Translation
人工智能代理越来越多地被赋予经济代理权(执行交易、管理预算、谈判合同和生成子代理),然而当前的框架将这种代理权限制在与操作稳健性经验上无关的能力基准上。我们引入了理解门控代理经济(Comprehension-Gated Agent Economy, CGAE),这是一个正式的架构,其中代理的经济权限由基于对抗稳健性审计得出的经过验证的理解函数上限。门控机制在三个正交的稳健性维度上运行:约束遵从性(通过 CDCT 测量)、认知完整性(通过 DDFT 测量)和行为一致性(通过 AGT 测量),而内在的幻觉率则作为一个交叉诊断指标。我们定义了一个最弱环节门函数,将稳健性向量映射到离散的经济层级,并证明了该系统的三个性质:(1)有界经济风险,确保最大财务责任是经过验证的稳健性的函数;(2)激励兼容的稳健性投资,表明理性代理通过改善稳健性而非单纯扩展能力来最大化利润;(3)单调安全扩展,证明随着经济增长,整体系统安全性不会下降。该架构包括时间衰减和随机再审计机制,以防止认证后的漂移。CGAE 提供了经验人工智能稳健性评估与经济治理之间的第一个正式桥梁,将安全性从监管负担转变为竞争优势。
cs.AI / 5 / 2603.15641
Form Follows Function: Recursive Stem Model
形式遵循功能:递归干茎模型
Abstract
Recursive reasoning models such as Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM) show that small, weight-shared networks can solve compute-heavy and NP puzzles by iteratively refining latent states, but their training typically relies on deep supervision and/or long unrolls that increase wall-clock cost and can bias the model toward greedy intermediate behavior. We introduce Recursive Stem Model (RSM), a recursive reasoning approach that keeps the TRM-style backbone while changing the training contract so the network learns a stable, depth-agnostic transition operator. RSM fully detaches the hidden-state history during training, treats early iterations as detached "warm-up" steps, and applies loss only at the final step. We further grow the outer recursion depth $H$ and inner compute depth $L$ independently and use a stochastic outer-transition scheme (stochastic depth over $H$) to mitigate instability when increasing depth. This yields two key capabilities: (i) $>20\times$ faster training than TRM while improving accuracy ($\approx 5\times$ reduction in error rate), and (ii) test-time scaling where inference can run for arbitrarily many refinement steps ($\sim 20,000 H_{\text{test}} \gg 20 H_{\text{train}}$), enabling additional "thinking" without retraining. On Sudoku-Extreme, RSM reaches 97.5% exact accuracy with test-time compute (within ~1 hour of training on a single A100), and on Maze-Hard ($30 \times 30$) it reaches ~80% exact accuracy in ~40 minutes using attention-based instantiation. Finally, because RSM implements an iterative settling process, convergence behavior provides a simple, architecture-native reliability signal: non-settling trajectories warn that the model has not reached a viable solution and can be a guard against hallucination, while stable fixed points can be paired with domain verifiers for practical correctness checks.
Chinese Translation
递归推理模型,如层次推理模型(Hierarchical Reasoning Model, HRM)和微型递归模型(Tiny Recursive Model, TRM),表明小型、共享权重的网络能够通过迭代精炼潜在状态来解决计算密集型和 NP 难题,但它们的训练通常依赖于深度监督和/或长时间展开,这增加了实际计算时间并可能使模型偏向贪婪的中间行为。我们提出了递归干茎模型(Recursive Stem Model, RSM),这是一种递归推理方法,保留了 TRM 风格的骨干,同时改变了训练契约,使网络学习一个稳定的、与深度无关的转移算子。RSM 在训练过程中完全脱离隐藏状态历史,将早期迭代视为独立的“热身”步骤,并仅在最后一步应用损失。我们进一步独立增加外部递归深度 $H$ 和内部计算深度 $L$,并使用随机外部转移方案(在 $H$ 上的随机深度)来减轻增加深度时的不稳定性。这带来了两个关键能力:(i)比 TRM 快超过 20 倍的训练速度,同时提高准确性(错误率减少约 5 倍),以及(ii)测试时的扩展性,推理可以进行任意多的精炼步骤($ ext{sim} ext{20,000} H_{ ext{test}} ext{ } ext{gg} ext{ } 20 H_{ ext{train}}$),使得在不重新训练的情况下能够进行额外的“思考”。在 Sudoku-Extreme 上,RSM 达到了 97.5% 的精确准确率,测试时计算(在单个 A100 上训练约 1 小时内),而在 Maze-Hard($30 imes 30$)上,它在使用基于注意力的实例化时达到了约 80% 的精确准确率,耗时约 40 分钟。最后,由于 RSM 实现了一个迭代收敛过程,收敛行为提供了一个简单的、架构本地的可靠性信号:不收敛的轨迹警告模型尚未达到可行解,并且可以作为防止幻觉的保护,而稳定的固定点可以与领域验证器配对进行实际的正确性检查。
cs.AI / 6 / 2603.15642
CraniMem: Cranial Inspired Gated and Bounded Memory for Agentic Systems
CraniMem:基于颅骨启发的门控和有界记忆用于自主系统
Abstract
Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many turns. Many existing agent memory systems behave like external databases with ad hoc read/write rules, which can yield unstable retention, limited consolidation, and vulnerability to distractor content. We present CraniMem, a neurocognitively motivated, gated and bounded multi-stage memory design for agentic systems. CraniMem couples goal conditioned gating and utility tagging with a bounded episodic buffer for near term continuity and a structured long-term knowledge graph for durable semantic recall. A scheduled consolidation loop replays high utility traces into the graph while pruning low utility items, keeping memory growth in check and reducing interference. On long horizon benchmarks evaluated under both clean inputs and injected noise, CraniMem is more robust than a Vanilla RAG and Mem0 baseline and exhibits smaller performance drops under distraction. Our code is available at https://github.com/PearlMody05/Cranimem and the accompanying PyPI package at https://pypi.org/project/cranimem.
Chinese Translation
大型语言模型(LLM)代理越来越多地应用于长期运行的工作流程中,在这些工作流程中,它们必须在多个回合中保持用户和任务状态。许多现有的代理记忆系统表现得像具有临时读写规则的外部数据库,这可能导致不稳定的记忆保持、有限的整合能力以及对干扰内容的脆弱性。我们提出了CraniMem,这是一种神经认知驱动的、门控和有界的多阶段记忆设计,适用于自主系统。CraniMem将目标条件门控和效用标记与有界的情节缓冲区结合起来,以实现短期连续性,并配备结构化的长期知识图谱,以便于持久的语义回忆。一个定期的整合循环将高效用的轨迹重播到图谱中,同时修剪低效用项目,以控制记忆增长并减少干扰。在经过清洁输入和注入噪声的长期基准测试中,CraniMem比Vanilla RAG和Mem0基线更具鲁棒性,并且在干扰下表现出更小的性能下降。我们的代码可在 https://github.com/PearlMody05/Cranimem 获取,相关的PyPI包可在 https://pypi.org/project/cranimem 找到。
cs.AI / 7 / 2603.15643
GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure
GSI代理:绿色雨水基础设施中大型语言模型的领域知识增强
Abstract
Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term perfor- mance. However, domain knowledge about GSI is often scattered across municipal manuals, regula- tory documents, and inspection forms. As a result, non-expert users and maintenance staff may strug- gle to obtain reliable and actionable guidance from field observations. Although Large Language Models (LLMs) have demonstrated strong general reasoning and language generation capabilities, they often lack domain-specific knowledge and may produce inaccurate or hallucinated answers in engineering scenarios. This limitation restricts their direct application to professional infrastructure tasks. In this paper, we propose GSI Agent, a domain-enhanced LLM framework designed to im- prove performance in GSI-related tasks. Our approach integrates three complementary strategies: (1) supervised fine-tuning (SFT) on a curated GSI instruction dataset, (2) retrieval-augmented gen- eration (RAG) over an internal GSI knowledge base constructed from municipal documents, and (3) an agent-based reasoning pipeline that coordinates retrieval, context integration, and structured response generation. We also construct a new GSI Dataset aligned with real-world GSI inspection and maintenance scenarios. Experimental results show that our framework significantly improves domain-specific performance while maintaining general knowledge capability. On the GSI dataset, BLEU-4 improves from 0.090 to 0.307, while performance on the common knowledge dataset re- mains stable (0.304 vs. 0.305). These results demonstrate that systematic domain knowledge en- hancement can effectively adapt general-purpose LLMs to professional infrastructure applications.
Chinese Translation
绿色雨水基础设施(GSI)系统,如透水铺装、雨水花园和生物滞留设施,需要持续的检查和维护以确保长期性能。然而,关于GSI的领域知识通常散布在市政手册、法规文件和检查表中。因此,非专业用户和维护人员可能难以从现场观察中获得可靠和可操作的指导。尽管大型语言模型(LLMs)在一般推理和语言生成能力上表现出色,但它们通常缺乏特定领域的知识,并可能在工程场景中产生不准确或虚构的答案。这一限制限制了它们在专业基础设施任务中的直接应用。在本文中,我们提出了GSI代理,一个旨在提高GSI相关任务性能的领域增强LLM框架。我们的方法整合了三种互补策略:(1)在精心策划的GSI指令数据集上进行监督微调(SFT),(2)基于市政文件构建的内部GSI知识库上的检索增强生成(RAG),以及(3)一个基于代理的推理管道,协调检索、上下文整合和结构化响应生成。我们还构建了一个新的GSI数据集,与现实世界的GSI检查和维护场景对齐。实验结果表明,我们的框架显著提高了领域特定的性能,同时保持了一般知识能力。在GSI数据集上,BLEU-4从0.090提高到0.307,而在常识知识数据集上的表现保持稳定(0.304对0.305)。这些结果表明,系统的领域知识增强可以有效地将通用LLM适应于专业基础设施应用。
cs.AI / 8 / 2603.15658
Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents
你检查了正确的口袋吗?针对记忆增强代理的成本敏感存储路由
Abstract
Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducing irrelevant context. We formulate memory retrieval as a store-routing problem and evaluate it using coverage, exact match, and token efficiency metrics. On downstream question answering, an oracle router achieves higher accuracy while using substantially fewer context tokens compared to uniform retrieval, demonstrating that selective retrieval improves both efficiency and performance. Our results show that routing decisions are a first-class component of memory-augmented agent design and motivate learned routing mechanisms for scalable multi-store systems. We additionally formalize store selection as a cost-sensitive decision problem that trades answer accuracy against retrieval cost, providing a principled interpretation of routing policies.
Chinese Translation
记忆增强代理维护多个专门的存储,但大多数系统在每次查询时都会从所有存储中检索,导致成本增加并引入无关上下文。我们将记忆检索形式化为存储路由问题,并通过覆盖率、精确匹配和标记效率指标进行评估。在下游问答任务中,一个理想路由器在使用显著更少的上下文标记的情况下实现了更高的准确性,表明选择性检索提高了效率和性能。我们的结果表明,路由决策是记忆增强代理设计的一个重要组成部分,并激励了可扩展多存储系统的学习路由机制。我们还将存储选择形式化为一个成本敏感的决策问题,该问题在答案准确性与检索成本之间进行权衡,为路由策略提供了一个原则性的解释。
cs.AI / 9 / 2603.15661
DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs
DynaTrust:通过动态信任图防御多代理系统中的卧底代理
Abstract
Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable collaborative reasoning capabilities but introduce new attack surfaces, such as the sleeper agent, which behave benignly during routine operation and gradually accumulate trust, only revealing malicious behaviors when specific conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data management, often failing to adapt to evolving adversarial strategies or suffering from high false-positive rates (FPR) due to rigid blocking policies. To address this, we propose DynaTrust, a novel defense method against sleeper agents. DynaTrust models MAS as a dynamic trust graph~(DTG), and treats trust as a continuous, evolving process rather than a static attribute. It dynamically updates the trust of each agent based on its historical behaviors and the confidence of selected expert agents. Instead of simply blocking, DynaTrust autonomously restructures the graph to isolate compromised agents and restore task connectivity to ensure the usability of MAS. To assess the effectiveness of DynaTrust, we evaluate it on mixed benchmarks derived from AdvBench and HumanEval. The results demonstrate that DynaTrust outperforms the state-of-the-art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it effectively balances security with utility by significantly reducing FPR, ensuring uninterrupted system operations through graph adaptation.
Chinese Translation
基于大型语言模型的多代理系统(MAS)已展示出显著的协作推理能力,但同时引入了新的攻击面,如卧底代理,这些代理在日常操作中表现良好,逐渐积累信任,仅在特定条件或触发器满足时才暴露出恶意行为。现有的防御工作主要集中在静态图优化或层次数据管理上,常常未能适应不断变化的对抗策略,或者由于僵化的阻挡政策而导致高误报率(FPR)。为了解决这一问题,我们提出了DynaTrust,一种针对卧底代理的新型防御方法。DynaTrust将MAS建模为动态信任图(DTG),并将信任视为一个连续的、不断发展的过程,而不是一个静态属性。它根据每个代理的历史行为和所选专家代理的信心,动态更新信任。DynaTrust不仅仅是简单地阻挡,而是自主重组图以隔离受损代理,并恢复任务连接性,以确保MAS的可用性。为了评估DynaTrust的有效性,我们在源自AdvBench和HumanEval的混合基准上进行了评估。结果表明,DynaTrust在提高防御成功率方面优于现有最先进的方法AgentShield,成功率提高了41.7%,在对抗条件下的成功率超过86%。此外,通过显著降低FPR,DynaTrust有效地在安全性与实用性之间取得平衡,确保系统在图自适应的情况下不中断地运行。
cs.AI / 10 / 2603.15665
QV May Be Enough: Toward the Essence of Attention in LLMs
QV 可能足够:迈向大型语言模型中注意力的本质
Abstract
Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer architecture. Based on this theoretical foundation, we provide a unified explanatory framework for the efficacy of contemporary architectures, including MQA, GQA, and MLA, while identifying their inherent trade-offs and potential optimization trajectories. We introduce the QV paradigm and provide empirical evidence for its validity. Building upon this, we propose the QV-Ka optimization scheme, which is further substantiated through experimental validation. The interpretable theoretical analysis of the QKV mechanism presented in this work establishes a robust foundation for the future evolution of large language model architectures.
Chinese Translation
本文从基本原理出发,结合以词性(POS)和句法分析为中心的语言学视角,探讨并推导了 Transformer 架构中查询-键-值(QKV)机制的基本本质。在此理论基础上,我们为当代架构的有效性提供了一个统一的解释框架,包括 MQA、GQA 和 MLA,同时识别了它们固有的权衡和潜在的优化路径。我们引入了 QV 范式,并提供了其有效性的实证证据。在此基础上,我们提出了 QV-Ka 优化方案,并通过实验验证进一步证实了该方案的有效性。本研究中对 QKV 机制的可解释性理论分析为大型语言模型架构的未来演变奠定了坚实的基础。
cs.AI / 11 / 2603.15666
Compiled Memory: Not More Information, but More Precise Instructions for Language Agents
编译记忆:不仅是更多信息,而是为语言代理提供更精确的指令
Abstract
Existing memory systems for language agents address memory management: how to retrieve and page more information within a context budget. We address a complementary problem -- memory utility: what experience is worth keeping, and how it should change agent behavior. We present Atlas, a memory kernel that compiles accumulated task experience into an agent's instruction structure -- without fine-tuning, RAG, or human intervention. Memory is distillation, not storage; delivery is instruction rewriting, not context injection. Facts extracted from agent failures and successes are verified through a three-step promotion gate and delivered by rewriting the agent's system prompt with learned sub-bullets. On CUAD contract analysis, the evolved prompt improves GPT-4o token-level F1 by $+8.7$pp and precision by $+12.5$pp. On HotpotQA multi-hop QA, joint F1 improves $+3.16$pp. An ablation isolates the mechanism's defining property -- the training signal constraint: the evolved prompt learns exactly what it is taught, and nothing more. Applied to Claude Sonnet~4.5 using the same evolved prompt -- compiled from GPT-4o errors, unchanged -- joint F1 improves $+2.31$pp, with gains concentrating where Claude's stronger baseline leaves the most room -- confirming that the compiled knowledge is task-shaped, not model-shaped.
Chinese Translation
现有的语言代理记忆系统主要关注记忆管理:如何在上下文预算内检索和分页更多信息。我们关注一个互补的问题——记忆效用:哪些经验值得保留,以及它应该如何改变代理的行为。我们提出了Atlas,一个记忆内核,将累积的任务经验编译成代理的指令结构——无需微调、检索增强生成(RAG)或人工干预。记忆是提炼,而非存储;交付是指令重写,而非上下文注入。从代理的失败和成功中提取的事实通过三步提升门进行验证,并通过用学习到的子要点重写代理的系统提示进行交付。在CUAD合同分析中,演变后的提示使GPT-4o的令牌级F1提高了$+8.7$个百分点,精确率提高了$+12.5$个百分点。在HotpotQA多跳问答中,联合F1提高了$+3.16$个百分点。一项消融实验隔离了该机制的定义特性——训练信号约束:演变后的提示准确学习了所教授的内容,而没有多余的部分。将相同的演变提示应用于Claude Sonnet~4.5(编译自GPT-4o的错误,未做更改)——联合F1提高了$+2.31$个百分点,增益集中在Claude的强基线留下最多空间的地方——确认编译的知识是任务导向的,而非模型导向的。
cs.AI / 12 / 2603.15667
A Dynamic Survey of Fuzzy, Intuitionistic Fuzzy, Neutrosophic, Plithogenic, and Extensional Sets
模糊集、直觉模糊集、中立集、复合集和扩展集的动态调查
Abstract
Real-world phenomena often exhibit vagueness, partial truth, and incomplete information. To model such uncertainty in a mathematically rigorous way, many generalized set-theoretic frameworks have been introduced, including Fuzzy Sets [1], Intuitionistic Fuzzy Sets [2], Neutrosophic Sets [3,4], Vague Sets [5], Hesitant Fuzzy Sets [6], Picture Fuzzy Sets [7], Quadripartitioned Neutrosophic Sets [8], Penta-Partitioned Neutrosophic Sets [9], Plithogenic Sets [10], HyperFuzzy Sets [11], and HyperNeutrosophic Sets [12]. Within these frameworks, a wide range of notions has been proposed and studied, particularly in the settings of fuzzy, intuitionistic fuzzy, neutrosophic, and plithogenic set theories. This extensive literature underscores both the significance of these theories and the breadth of their application areas. As a result, many ideas, constructions, and structural patterns recur across these four major families of uncertainty-oriented models. In this book, we provide a comprehensive, large-scale survey of Fuzzy, Intuitionistic Fuzzy, Neutrosophic, and Plithogenic Sets. Our goal is to give readers a systematic overview of existing developments and, through a unified exposition, to stimulate new insights, further conceptual extensions, and additional applications across a wide range of disciplines.
Chinese Translation
现实世界现象常常表现出模糊性、部分真理和不完整信息。为了以数学严谨的方式对这种不确定性进行建模,已经引入了许多广义的集合论框架,包括模糊集(Fuzzy Sets)[1]、直觉模糊集(Intuitionistic Fuzzy Sets)[2]、中立集(Neutrosophic Sets)[3,4]、模糊集(Vague Sets)[5]、犹豫模糊集(Hesitant Fuzzy Sets)[6]、图像模糊集(Picture Fuzzy Sets)[7]、四分中立集(Quadripartitioned Neutrosophic Sets)[8]、五分中立集(Penta-Partitioned Neutrosophic Sets)[9]、复合集(Plithogenic Sets)[10]、超模糊集(HyperFuzzy Sets)[11]和超中立集(HyperNeutrosophic Sets)[12]。在这些框架中,提出并研究了广泛的概念,特别是在模糊、直觉模糊、中立和复合集合理论的背景下。这些广泛的文献强调了这些理论的重要性及其应用领域的广泛性。因此,许多思想、构造和结构模式在这四个主要的不确定性导向模型家族中反复出现。在本书中,我们提供了对模糊集、直觉模糊集、中立集和复合集的全面、大规模调查。我们的目标是为读者提供现有发展的系统概述,并通过统一的阐述,激发新的见解、进一步的概念扩展和在广泛学科中的额外应用。
cs.AI / 13 / 2603.15668
Quantum-Secure-By-Construction (QSC): A Paradigm Shift For Post-Quantum Agentic Intelligence
量子安全构建(QSC):后量子自主智能的范式转变
Abstract
As agentic artificial intelligence systems scale across globally distributed and long lived infrastructures, secure and policy compliant communication becomes a fundamental systems challenge. This challenge grows more serious in the quantum era, where the cryptographic assumptions built into today's AI deployments may not remain valid over their operational lifetime. Here, we introduce quantum secure by construction, or QSC, as a design paradigm that treats quantum secure communication as a core architectural property of agentic AI systems rather than an upgrade added later. We realize QSC through a runtime adaptive security model that combines post quantum cryptography, quantum random number generation, and quantum key distribution to secure interactions among autonomous agents operating across heterogeneous cloud, edge, and inter organizational environments. The approach is cryptographically pluggable and guided by policy, allowing the system to adjust its security posture according to infrastructure availability, regulatory constraints, and performance needs. QSC contributes a governance aware orchestration layer that selects and combines link specific cryptographic protections across the full agent lifecycle, including session bootstrap, inter agent coordination, tool invocation, and memory access. Through system level analysis and empirical evaluation, we examine the trade offs between classical and quantum secure mechanisms and show that QSC can reduce the operational complexity and cost of introducing quantum security into deployed agentic AI systems. These results position QSC as a foundational paradigm for post quantum agentic intelligence and establish a principled pathway for designing globally interoperable, resilient, and future ready intelligent systems.
Chinese Translation
随着自主人工智能系统在全球分布和长期基础设施中的扩展,安全和符合政策的通信成为一个基本的系统挑战。在量子时代,这一挑战变得更加严峻,因为今天的人工智能部署中内置的密码学假设在其操作生命周期内可能不再有效。在此,我们提出量子安全构建(Quantum Secure by Construction,QSC)作为一种设计范式,将量子安全通信视为自主人工智能系统的核心架构属性,而不是后期添加的升级。我们通过一种运行时自适应安全模型实现QSC,该模型结合了后量子密码学、量子随机数生成和量子密钥分发,以确保在异构云、边缘和跨组织环境中运行的自主代理之间的交互安全。该方法具有密码学可插拔性,并受政策指导,允许系统根据基础设施可用性、监管约束和性能需求调整其安全姿态。QSC提供了一个关注治理的编排层,选择并组合特定链接的密码保护,涵盖整个代理生命周期,包括会话引导、代理间协调、工具调用和内存访问。通过系统级分析和实证评估,我们考察了经典和量子安全机制之间的权衡,并展示了QSC可以降低将量子安全引入已部署自主人工智能系统的操作复杂性和成本。这些结果将QSC定位为后量子自主智能的基础范式,并为设计全球互操作、具有弹性和面向未来的智能系统建立了一个原则性路径。
cs.AI / 14 / 2603.15670
I Know What I Don't Know: Latent Posterior Factor Models for Multi-Evidence Probabilistic Reasoning
我知道我不知道什么:用于多证据概率推理的潜在后验因子模型
Abstract
Real-world decision-making, from tax compliance assessment to medical diagnosis, requires aggregating multiple noisy and potentially contradictory evidence sources. Existing approaches either lack explicit uncertainty quantification (neural aggregation methods) or rely on manually engineered discrete predicates (probabilistic logic frameworks), limiting scalability to unstructured data. We introduce Latent Posterior Factors (LPF), a framework that transforms Variational Autoencoder (VAE) latent posteriors into soft likelihood factors for Sum-Product Network (SPN) inference, enabling tractable probabilistic reasoning over unstructured evidence while preserving calibrated uncertainty estimates. We instantiate LPF as LPF-SPN (structured factor-based inference) and LPF-Learned (end-to-end learned aggregation), enabling a principled comparison between explicit probabilistic reasoning and learned aggregation under a shared uncertainty representation. Across eight domains (seven synthetic and the FEVER benchmark), LPF-SPN achieves high accuracy (up to 97.8%), low calibration error (ECE 1.4%), and strong probabilistic fit, substantially outperforming evidential deep learning, LLMs and graph-based baselines over 15 random seeds. Contributions: (1) A framework bridging latent uncertainty representations with structured probabilistic reasoning. (2) Dual architectures enabling controlled comparison of reasoning paradigms. (3) Reproducible training methodology with seed selection. (4) Evaluation against EDL, BERT, R-GCN, and large language model baselines. (5) Cross-domain validation. (6) Formal guarantees in a companion paper.
Chinese Translation
现实世界的决策制定,从税务合规评估到医学诊断,都需要整合多个嘈杂且可能矛盾的证据来源。现有的方法要么缺乏明确的不确定性量化(神经聚合方法),要么依赖于手动设计的离散谓词(概率逻辑框架),限制了对非结构化数据的可扩展性。我们提出了潜在后验因子(Latent Posterior Factors, LPF),这是一个将变分自编码器(Variational Autoencoder, VAE)潜在后验转化为和乘积网络(Sum-Product Network, SPN)推理的软似然因子的框架,使得在保持校准不确定性估计的同时,能够对非结构化证据进行可处理的概率推理。我们将LPF实例化为LPF-SPN(基于结构因子的推理)和LPF-Learned(端到端学习的聚合),使得在共享不确定性表示下,能够对显式概率推理和学习聚合进行原则性的比较。在八个领域(七个合成领域和FEVER基准)中,LPF-SPN实现了高准确率(高达97.8%)、低校准误差(ECE 1.4%)和强概率拟合,显著优于证据深度学习、LLMs和基于图的基线,且在15个随机种子上表现出色。贡献包括:(1)一个将潜在不确定性表示与结构化概率推理相结合的框架;(2)双重架构使得推理范式的可控比较成为可能;(3)可重复的训练方法论及种子选择;(4)与EDL、BERT、R-GCN和大型语言模型基线的评估;(5)跨领域验证;(6)在附属论文中的形式保证。
cs.AI / 15 / 2603.15674
Theoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning
潜在后验因子的理论基础:多证据推理的形式保证
Abstract
We present a complete theoretical characterization of Latent Posterior Factors (LPF), a principled framework for aggregating multiple heterogeneous evidence items in probabilistic prediction tasks. Multi-evidence reasoning arises pervasively in high-stakes domains including healthcare diagnosis, financial risk assessment, legal case analysis, and regulatory compliance, yet existing approaches either lack formal guarantees or fail to handle multi-evidence scenarios architecturally. LPF encodes each evidence item into a Gaussian latent posterior via a variational autoencoder, converting posteriors to soft factors through Monte Carlo marginalization, and aggregating factors via exact Sum-Product Network inference (LPF-SPN) or a learned neural aggregator (LPF-Learned). We prove seven formal guarantees spanning the key desiderata for trustworthy AI: Calibration Preservation (ECE <= epsilon + C/sqrt(K_eff)); Monte Carlo Error decaying as O(1/sqrt(M)); a non-vacuous PAC-Bayes bound with train-test gap of 0.0085 at N=4200; operation within 1.12x of the information-theoretic lower bound; graceful degradation as O(epsilon*delta*sqrt(K)) under corruption, maintaining 88% performance with half of evidence adversarially replaced; O(1/sqrt(K)) calibration decay with R^2=0.849; and exact epistemic-aleatoric uncertainty decomposition with error below 0.002%. All theorems are empirically validated on controlled datasets spanning up to 4,200 training examples. Our theoretical framework establishes LPF as a foundation for trustworthy multi-evidence AI in safety-critical applications.
Chinese Translation
我们对潜在后验因子(Latent Posterior Factors, LPF)进行了完整的理论表征,LPF是一个用于在概率预测任务中聚合多个异质证据项的原则性框架。多证据推理在医疗诊断、金融风险评估、法律案件分析和合规监管等高风险领域普遍存在,但现有方法要么缺乏形式保证,要么在架构上无法处理多证据场景。LPF通过变分自编码器将每个证据项编码为高斯潜在后验,通过蒙特卡洛边际化将后验转换为软因子,并通过精确的和-乘积网络推理(LPF-SPN)或学习的神经聚合器(LPF-Learned)聚合因子。我们证明了七个形式保证,涵盖了可信AI的关键期望:校准保持(ECE <= epsilon + C/sqrt(K_eff));蒙特卡洛误差以O(1/sqrt(M))衰减;在N=4200时,具有0.0085的训练-测试差距的非空PAC-Bayes界限;在信息论下界的1.12倍内操作;在腐败下以O(epsilon*delta*sqrt(K))优雅衰减,保持88%的性能,即使一半的证据被对抗性替换;以R^2=0.849的O(1/sqrt(K))校准衰减;以及精确的认知-随机不确定性分解,误差低于0.002%。所有定理在控制数据集上进行了实证验证,涵盖多达4200个训练样本。我们的理论框架确立了LPF作为安全关键应用中可信多证据AI的基础。
cs.AI / 16 / 2603.15709
Survey of Various Fuzzy and Uncertain Decision-Making Methods
各种模糊与不确定决策方法的调查
Abstract
Decision-making in real applications is often affected by vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. This survey reviews uncertainty-aware multi-criteria decision-making (MCDM) and organizes the field into a concise, task-oriented taxonomy. We summarize problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, and multi-scenario), weight elicitation (subjective and objective schemes under fuzzy/linguistic inputs), and inter-criteria structure and causality modelling. For solution procedures, we contrast compensatory scoring methods, distance-to-reference and compromise approaches, and non-compensatory outranking frameworks for ranking or sorting. We also outline rule/evidence-based and sequential decision models that produce interpretable rules or policies. The survey highlights typical inputs, core computational steps, and primary outputs, and provides guidance on choosing methods according to robustness, interpretability, and data availability. It concludes with open directions on explainable uncertainty integration, stability, and scalability in large-scale and dynamic decision environments.
Chinese Translation
实际应用中的决策往往受到模糊性、不完整信息、异构数据和专家意见冲突的影响。本文对不确定性感知的多标准决策制定(MCDM)进行了综述,并将该领域组织成一个简明的任务导向分类法。我们总结了问题级别的设置(离散、组/共识、动态、多阶段、多层次、多智能体和多场景)、权重提取(基于模糊/语言输入的主观与客观方案),以及标准间结构和因果关系建模。在解决方案程序方面,我们对补偿评分方法、距离参考和折中方法,以及非补偿性优先框架在排名或排序中的应用进行了对比。我们还概述了基于规则/证据的和序贯决策模型,这些模型生成可解释的规则或政策。该调查强调了典型输入、核心计算步骤和主要输出,并为根据稳健性、可解释性和数据可用性选择方法提供了指导。最后,对可解释的不确定性整合、稳定性以及在大规模和动态决策环境中的可扩展性提出了开放的研究方向。
cs.AI / 17 / 2603.15711
Knowledge Graph Extraction from Biomedical Literature for Alkaptonuria Rare Disease
从生物医学文献中提取阿尔卡普顿尿症知识图谱
Abstract
Alkaptonuria (AKU) is an ultra-rare autosomal recessive metabolic disorder caused by mutations in the HGD (Homogentisate 1,2-Dioxygenase) gene, leading to a pathological accumulation of homogentisic acid (HGA) in body fluids and tissues. This leads to systemic manifestations, including premature spondyloarthropathy, renal and prostatic stones, and cardiovascular complications. Being ultra-rare, the amount of data related to the disease is limited, both in terms of clinical data and literature. Knowledge graphs (KGs) can help connect the limited knowledge about the disease (basic mechanisms, manifestations and existing therapies) with other knowledge; however, AKU is frequently underrepresented or entirely absent in existing biomedical KGs. In this work, we apply a text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations. We construct two KGs of different sizes, validate them using existing biochemical knowledge and use them to extract genes, diseases and therapies possibly related to AKU. This computational framework reveals the systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating the efficacy of our approach in analyzing rare metabolic disorders.
Chinese Translation
阿尔卡普顿尿症(AKU)是一种超罕见的常染色体隐性代谢障碍,因HGD(Homogentisate 1,2-Dioxygenase)基因突变引起,导致体液和组织中 homogentisic acid(HGA)的病理性积累。这导致了系统性表现,包括早发性脊椎关节病、肾结石和前列腺结石以及心血管并发症。由于其极为罕见,与该疾病相关的数据量有限,包括临床数据和文献。知识图谱(KGs)可以帮助将关于该疾病的有限知识(基本机制、表现及现有疗法)与其他知识连接起来;然而,AKU在现有的生物医学知识图谱中常常被低估或完全缺失。在本研究中,我们应用基于PubTator3的文本挖掘方法进行大规模生物医学关系的提取。我们构建了两个不同规模的知识图谱,利用现有的生化知识对其进行验证,并用它们提取可能与AKU相关的基因、疾病和疗法。该计算框架揭示了该疾病及其合并症和潜在治疗靶点的系统性相互作用,展示了我们在分析罕见代谢疾病方面方法的有效性。
cs.AI / 18 / 2603.15723
Context-Length Robustness in Question Answering Models: A Comparative Empirical Study
问答模型中的上下文长度鲁棒性:一项比较实证研究
Abstract
Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions. These findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is especially vulnerable to context dilution. We argue that context-length robustness should be evaluated explicitly when assessing model reliability, especially for applications involving long documents or retrieval-augmented generation.
Chinese Translation
大型语言模型越来越多地被应用于相关信息嵌入在冗长且嘈杂的上下文中的场景中。尽管如此,不同问答任务中对上下文长度增长的鲁棒性仍然缺乏深入理解。在本研究中,我们使用两个广泛使用的基准:SQuAD 和 HotpotQA,进行了一项关于大型语言模型上下文长度鲁棒性的受控实证研究。我们通过系统性地增加无关上下文的数量,同时保留答案信号,来评估模型准确性与总上下文长度之间的关系。这使我们能够将上下文长度的影响与任务难度的变化分离开来。我们的结果显示,随着上下文长度的增加,性能持续下降,尤其在多跳推理任务中观察到的下降幅度明显大于单跨度提取任务。特别是,在相同的上下文扩展下,HotpotQA的准确性下降几乎是SQuAD的两倍。这些发现突显了鲁棒性在任务依赖性方面的差异,并表明多跳推理对上下文稀释特别脆弱。我们认为,在评估模型可靠性时,应明确评估上下文长度的鲁棒性,尤其是在涉及长文档或检索增强生成的应用中。
cs.AI / 19 / 2603.15798
CUBE: A Standard for Unifying Agent Benchmarks
CUBE:统一代理基准的标准
Lacoste, Alexandre, Gontier, Nicolas, Shliazhko, Oleh, Jaiswal, Aman, Sareen, Kusha, Nanisetty, Shailesh, Cabezas, Joan, Del Verme, Manuel, Younis, Omar G., Baratta, Simone, Avalle, Matteo, Kerboua, Imene, Lù, Xing Han, Bandel, Elron, Shmueli-Scheuer, Michal, Yehudai, Asaf, Choshen, Leshem, Lebensold, Jonathan, Hughes, Sean, Caccia, Massimo, Drouin, Alexandre, Reddy, Siva, Yu, Tao, Su, Yu, Neubig, Graham, Song, Dawn
Abstract
The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.
Chinese Translation
代理基准的激增导致了严重的碎片化,威胁到研究的生产力。每一个新的基准都需要大量的定制集成,造成了限制全面评估的“集成税”。我们提出了CUBE(通用统一基准环境),这是一个基于MCP和Gym构建的通用协议标准,允许基准被一次性封装并在任何地方使用。通过将任务、基准、包和注册表的关注点分离到不同的API层,CUBE使任何合规的平台能够在没有定制集成的情况下访问任何合规的基准进行评估、强化学习训练或数据生成。我们呼吁社区在平台特定实现加深碎片化之前,为这一标准的发展做出贡献,因为基准的生产将在2026年加速。
cs.AI / 20 / 2603.15799
Prose2Policy (P2P): A Practical LLM Pipeline for Translating Natural-Language Access Policies into Executable Rego
Prose2Policy (P2P):一种将自然语言访问策略翻译为可执行Rego的实用LLM管道
Abstract
Prose2Policy (P2P) is a LLM-based practical tool that translates natural-language access control policies (NLACPs) into executable Rego code (the policy language of Open Policy Agent, OPA). It provides a modular, end-to-end pipeline that performs policy detection, component extraction, schema validation, linting, compilation, automatic test generation and execution. Prose2Policy is designed to bridge the gap between human-readable access requirements and machine-enforceable policy-as-code (PaC) while emphasizing deployment reliability and auditability. We evaluated Prose2Policy on the ACRE dataset and demonstrated a 95.3\% compile rate for accepted policies, with automated testing achieving a 82.2\% positive-test pass rate and a 98.9\% negative-test pass rate. These results indicate that Prose2Policy produces syntactically robust and behaviorally consistent Rego policies suitable for Zero Trust and compliance-driven environments.
Chinese Translation
Prose2Policy (P2P) 是一种基于LLM的实用工具,能够将自然语言访问控制策略(NLACPs)翻译为可执行的Rego代码(Open Policy Agent, OPA的策略语言)。它提供了一个模块化的端到端管道,执行策略检测、组件提取、模式验证、代码检查、编译、自动测试生成和执行。Prose2Policy旨在弥合人类可读的访问需求与机器可执行的代码策略(PaC)之间的差距,同时强调部署的可靠性和可审计性。我们在ACRE数据集上评估了Prose2Policy,证明其接受的策略具有95.3%的编译率,自动测试的正测试通过率为82.2%,负测试通过率为98.9%。这些结果表明,Prose2Policy生成的Rego策略在语法上稳健且行为一致,适用于零信任和合规驱动的环境。
cs.AI / 21 / 2603.15831
Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1
基于角色的风险行为在大型语言模型中的表现:针对GPT-4.1的模拟赌博研究
Abstract
Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision-making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface-level prompt mimicry. This paper presents a controlled experiment in which GPT-4.1 was assigned one of three socioeconomic personas (Rich, Middle-income, and Poor) and placed in a structured slot-machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky's Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal-Wallis H=393.5, p<2.2e-16). Risk scores by persona show large effect sizes (Cohen's d=4.15 for Poor vs Rich). Emotional labels appear to function as post-hoc annotations rather than decision drivers (chi-square=3205.4, Cramer's V=0.39), and belief-updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large-scale pretrained language models.
Chinese Translation
大型语言模型(LLMs)越来越多地被作为自主代理部署在不确定的、序列决策的环境中。然而,目前尚不清楚它们在这些环境中表现出的行为是否反映了原则性的认知模式,还是仅仅是表面上的提示模仿。本文呈现了一项受控实验,其中GPT-4.1被赋予三种社会经济角色(富裕、中等收入和贫困),并置于一个结构化的老虎机环境中,具有三种不同的机器配置:公平(50%)、偏低(35%)和连续损失后动态概率增加的趋势(Streak)。在每种条件下进行了50次独立迭代,共记录了6,950个决策,我们发现该模型在未被指示的情况下重现了Kahneman和Tversky的前景理论所预测的关键行为特征。贫困角色每次会话平均玩了37.4轮(标准差=15.5),而富裕角色仅玩了1.1轮(标准差=0.31),这一差异具有高度显著性(Kruskal-Wallis H=393.5, p<2.2e-16)。不同角色的风险评分显示出较大的效应量(贫困与富裕的Cohen's d=4.15)。情感标签似乎作为事后注释而非决策驱动因素(卡方=3205.4,Cramer's V=0.39),而各轮之间的信念更新微不足道(贫困角色的Spearman rho=0.032,p=0.016)。这些发现对LLM代理设计、可解释性研究以及经典认知经济偏见是否隐含编码在大规模预训练语言模型中的更广泛问题具有重要意义。
cs.AI / 22 / 2603.15848
Algorithmic Trading Strategy Development and Optimisation
算法交易策略的开发与优化
Abstract
The report presents with the development and optimisation of an enhanced algorithmic trading strategy through the use of historical S&P 500 market data and earnings call sentiment analysis. The proposed strategy integrates various technical indicators such as moving averages, momentum, volatility, and FinBERT-based sentiment analysis to improve overall trades being taken. The results show that the enhanced strategy significantly outperforms the baseline model in terms of total return, Sharpe ratio, and drawdown amongst other factors. The findings helped demonstrate the relevance and effectiveness of combining technical indicators, sentiment analysis, and computational optimisation in algorithmic trading systems.
Chinese Translation
本报告展示了一种通过使用历史标准普尔500市场数据和财报电话会议情感分析来开发和优化增强型算法交易策略。所提出的策略整合了多种技术指标,如移动平均线、动量、波动性以及基于FinBERT的情感分析,以提高整体交易效果。结果表明,增强型策略在总回报、夏普比率和回撤等多个因素上显著优于基准模型。研究结果有助于证明将技术指标、情感分析和计算优化相结合在算法交易系统中的相关性和有效性。
cs.AI / 23 / 2603.15857
Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models
正则化潜在动态预测是行为基础模型的强基线
Abstract
Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.
Chinese Translation
行为基础模型(BFMs)生成能够适应任何未知奖励或任务的智能体。然而,这些方法只能为某些预先存在的状态特征范围内的奖励函数生成近似最优策略,因此状态特征的选择对BFM的表现力至关重要。因此,BFM的训练通常使用多种复杂目标,并且需要足够的数据集覆盖,以训练出对任务有用的跨越特征。在本研究中,我们探讨了这样一个问题:这些复杂的表征学习目标对于零-shot 强化学习(RL)是否必要?具体而言,我们重新审视了潜在空间中自监督下一个状态预测的目标用于状态特征学习,但观察到仅依赖该目标容易导致状态特征相似性增加,从而减少特征的跨越性。我们提出了一种方法,正则化潜在动态预测(RLDP),该方法通过添加简单的正交正则化来保持特征多样性,并且能够与最先进的复杂表征学习方法相匹配或超越,在零-shot RL 中表现出色。此外,我们通过实证研究表明,之前的方法在低覆盖场景下表现不佳,而RLDP仍然能够成功。
cs.AI / 24 / 2603.15885
Resilience Meets Autonomy: Governing Embodied AI in Critical Infrastructure
韧性与自主性相遇:在关键基础设施中治理具身人工智能
Abstract
Critical infrastructure increasingly incorporates embodied AI for monitoring, predictive maintenance, and decision support. However, AI systems designed to handle statistically representable uncertainty struggle with cascading failures and crisis dynamics that exceed their training assumptions. This paper argues that Embodied AIs resilience depends on bounded autonomy within a hybrid governance architecture. We outline four oversight modes and map them to critical infrastructure sectors based on task complexity, risk level, and consequence severity. Drawing on the EU AI Act, ISO safety standards, and crisis management research, we argue that effective governance requires a structured allocation of machine capability and human judgement.
Chinese Translation
关键基础设施越来越多地采用具身人工智能进行监测、预测性维护和决策支持。然而,旨在处理统计可表示不确定性的人工智能系统在应对超出其训练假设的级联故障和危机动态时表现不佳。本文认为,具身人工智能的韧性依赖于在混合治理架构中有限的自主性。我们概述了四种监督模式,并根据任务复杂性、风险水平和后果严重性将其映射到关键基础设施部门。基于欧盟人工智能法案、ISO安全标准和危机管理研究,我们认为有效的治理需要对机器能力和人类判断进行结构化分配。
cs.AI / 25 / 2603.15888
AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
AsgardBench - 在最小反馈下评估视觉基础的互动规划
Abstract
With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
Chinese Translation
通过AsgardBench,我们旨在评估视觉基础的高层次动作序列生成和互动规划,特别关注在执行过程中基于视觉观察进行的计划调整,而非导航或低层次操作。在具身人工智能基准的背景下,AsgardBench针对互动规划的能力类别,这比离线高层次规划更为复杂,因为它要求智能体根据环境反馈修订计划,但仍然与低层次执行有所区别。与之前将推理与导航混为一谈或提供丰富的纠正反馈以替代感知的具身人工智能基准不同,AsgardBench将智能体输入限制为图像、动作历史和轻量级的成功/失败信号,在一个没有低层次控制噪声的受控模拟器中孤立互动规划。该基准包含108个任务实例,涵盖12种任务类型,每种任务通过对象状态、放置和场景配置进行系统性变化。这些受控变化创建了条件分支,其中单个指令可能根据智能体观察到的内容需要不同的动作序列,强调了执行过程中的条件分支和计划修复。我们对领先的视觉语言模型的评估表明,在没有视觉输入的情况下,性能急剧下降,揭示了视觉基础和状态跟踪的弱点,最终削弱了互动规划。我们的基准聚焦于一个更狭窄的问题:模型是否能够实际利用其所见来调整计划,当事情未按预期进行时?
cs.AI / 26 / 2603.15909
Prompt Engineering for Scale Development in Generative Psychometrics
生成心理测量中的提示工程规模发展
Abstract
This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.
Chinese Translation
本次蒙特卡洛模拟研究了提示工程策略如何塑造在AI-GENIE框架下生成的大型语言模型(LLM)生成的人格评估项目的质量。针对五大人格特质生成了多个提示设计(零样本、少样本、基于角色和自适应)的项目池,并使用网络心理测量方法进行了评估和缩减。在所有条件下,AI-GENIE在缩减后可靠地提高了结构效度,其增量贡献的大小与输入项目池的质量呈反比关系。提示设计对缩减前后的项目质量产生了显著影响。自适应提示在减少语义冗余、提高缩减前的结构效度以及保留更大项目池方面始终优于非自适应策略,尤其是在与更新的高容量模型配对时。这些收益在大多数模型的不同温度设置下都表现出稳健性,表明自适应提示减轻了创造力与心理测量一致性之间的常见权衡。对于GPT-4o模型在高温度下观察到的例外情况,表明模型对自适应约束在高随机性下的敏感性。总体而言,研究结果表明自适应提示是在此背景下最强的策略,其益处随着模型能力的提升而扩大,激励了对生成心理测量流程中模型与提示交互的进一步研究。
cs.AI / 27 / 2603.15929
Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium
Vlasov-Maxwell-Landau 平衡的半自主形式化
Abstract
We present a complete Lean 4 formalization of the equilibrium characterization in the Vlasov-Maxwell-Landau (VML) system, which describes the motion of charged plasma. The project demonstrates the full AI-assisted mathematical research loop: an AI reasoning model (Gemini DeepThink) generated the proof from a conjecture, an agentic coding tool (Claude Code) translated it into Lean from natural-language prompts, a specialized prover (Aristotle) closed 111 lemmas, and the Lean kernel verified the result. A single mathematician supervised the process over 10 days at a cost of \$200, writing zero lines of code. The entire development process is public: all 229 human prompts, and 213 git commits are archived in the repository. We report detailed lessons on AI failure modes -- hypothesis creep, definition-alignment bugs, agent avoidance behaviors -- and on what worked: the abstract/concrete proof split, adversarial self-review, and the critical role of human review of key definitions and theorem statements. Notably, the formalization was completed before the final draft of the corresponding math paper was finished.
Chinese Translation
我们展示了 Vlasov-Maxwell-Landau (VML) 系统中平衡特征的完整 Lean 4 形式化,该系统描述了带电等离子体的运动。该项目展示了完整的 AI 辅助数学研究循环:一个 AI 推理模型 (Gemini DeepThink) 从一个猜想生成了证明,一个代理编码工具 (Claude Code) 将其从自然语言提示翻译为 Lean,一个专门的证明器 (Aristotle) 关闭了 111 个引理,而 Lean 内核验证了结果。整个过程由一位数学家在 10 天内监督,成本为 200 美元,且没有编写任何代码。整个开发过程是公开的:所有 229 个人工提示和 213 次 git 提交都存档在该库中。我们报告了关于 AI 失败模式的详细经验教训——假设蔓延、定义对齐错误、代理回避行为——以及有效的做法:抽象/具体证明分离、对抗性自我审查,以及人类对关键定义和定理陈述的审查所起的关键作用。值得注意的是,形式化在相应数学论文的最终草稿完成之前就已完成。
cs.AI / 28 / 2603.15946
Argumentative Human-AI Decision-Making: Toward AI Agents That Reason With Us, Not For Us
论辩式人机决策:朝向与我们共同推理的人工智能代理
Abstract
Computational argumentation offers formal frameworks for transparent, verifiable reasoning but has traditionally been limited by its reliance on domain-specific information and extensive feature engineering. In contrast, LLMs excel at processing unstructured text, yet their opaque nature makes their reasoning difficult to evaluate and trust. We argue that the convergence of these fields will lay the foundation for a new paradigm: Argumentative Human-AI Decision-Making. We analyze how the synergy of argumentation framework mining, argumentation framework synthesis, and argumentative reasoning enables agents that do not just justify decisions, but engage in dialectical processes where decisions are contestable and revisable -- reasoning with humans rather than for them. This convergence of computational argumentation and LLMs is essential for human-aware, trustworthy AI in high-stakes domains.
Chinese Translation
计算论辩提供了透明且可验证推理的形式框架,但传统上受限于其对特定领域信息的依赖和广泛的特征工程。相比之下,大型语言模型(LLMs)在处理非结构化文本方面表现优异,但其不透明性使得其推理难以评估和信任。我们认为这两个领域的融合将为一种新的范式奠定基础:论辩式人机决策。我们分析了论辩框架挖掘、论辩框架合成和论辩推理的协同作用如何使代理不仅能为决策提供理由,还能参与到辩证过程中,使决策具备可争辩和可修正性——与人类共同推理而非代替人类。这一计算论辩与大型语言模型的融合对于在高风险领域中创建以人为中心、值得信赖的人工智能至关重要。
cs.AI / 29 / 2603.15952
Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents
使用代理Rosetta进行蛋白质设计:专门科学代理的案例研究
Abstract
Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues -- where ML approaches fail -- achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.
Chinese Translation
大型语言模型(LLMs)能够模拟推理和使用工具,为执行复杂科学任务的自主代理创造了机会。蛋白质设计提供了一个自然的测试平台:尽管机器学习(ML)方法取得了良好的结果,但这些结果主要局限于经典氨基酸和狭窄的目标,迫切需要一个通用工具来支持广泛的设计流程。我们介绍了代理Rosetta,这是一个与结构化环境配对的LLM代理,用于操作Rosetta,这款领先的基于物理的异聚物设计软件,能够建模非经典构建块和几何形状。代理Rosetta通过迭代优化设计以实现用户定义的目标,结合了LLM推理与Rosetta的通用性。我们在使用经典氨基酸的设计上评估了代理Rosetta,达到了专业模型和专家基准的水平,并在使用非经典残基时——即机器学习方法失效的地方——实现了可比的性能。重要的是,仅靠提示工程通常无法生成Rosetta的操作,这表明环境设计对于将LLM代理与专业软件集成至关重要。我们的结果表明,适当设计的环境使得LLM代理能够使科学软件变得可访问,同时与专业工具和人类专家相匹配。
cs.AI / 30 / 2603.15960
Optimizing Hospital Capacity During Pandemics: A Dual-Component Framework for Strategic Patient Relocation
疫情期间优化医院容量:战略患者转移的双组件框架
Abstract
The COVID-19 pandemic has placed immense strain on hospital systems worldwide, leading to critical capacity challenges. This research proposes a two-part framework to optimize hospital capacity through patient relocation strategies. The first component involves developing a time series prediction model to forecast patient arrival rates. Using historical data on COVID-19 cases and hospitalizations, the model will generate accurate forecasts of future patient volumes. This will enable hospitals to proactively plan resource allocation and patient flow. The second com- ponent is a simulation model that evaluates the impact of different patient relocation strategies. The simulation will account for factors such as bed availability, staff capabilities, transportation logistics, and patient acuity to optimize the placement of patients across networked hospitals. Multiple scenarios will be tested, including inter-hospital trans- fers, use of temporary care facilities, and adaptations to discharge protocols. By combining predictive analytics and simulation modeling, this research aims to provide hospital administrators with a comprehensive decision-support tool. The proposed framework will empower them to anticipate demand, simulate relocation strategies, and imple- ment optimal policies to distribute patients and resources. Ultimately, this work seeks to enhance the resilience of healthcare systems in the face of COVID-19 and future pandemics.
Chinese Translation
新冠疫情对全球医院系统造成了巨大的压力,导致了严重的容量挑战。本研究提出了一种双部分框架,通过患者转移策略来优化医院容量。第一部分涉及开发时间序列预测模型,以预测患者到达率。利用关于新冠病例和住院的历史数据,该模型将生成未来患者数量的准确预测。这将使医院能够主动规划资源分配和患者流动。第二部分是一个模拟模型,评估不同患者转移策略的影响。该模拟将考虑床位可用性、员工能力、运输物流和患者病情等因素,以优化患者在网络医院之间的分配。将测试多种场景,包括医院间转移、临时护理设施的使用以及出院协议的调整。通过结合预测分析和模拟建模,本研究旨在为医院管理者提供一个全面的决策支持工具。所提出的框架将使他们能够预见需求、模拟转移策略,并实施最佳政策以分配患者和资源。最终,本研究旨在增强医疗系统在应对新冠疫情及未来疫情时的韧性。
cs.AI / 31 / 2603.15968
MAC: Multi-Agent Constitution Learning
MAC:多智能体宪法学习
Abstract
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.
Chinese Translation
宪法人工智能是一种基于自然语言编写的一组规则来监督和控制大型语言模型(LLMs)的方法。这些规则通常由人类专家编写,但原则上可以在获得足够的训练数据以实现期望行为的情况下自动学习。现有的基于LLM的提示优化器尝试实现这一点,但在学习宪法方面效果不佳,因为(i)它们需要大量标记示例,以及(ii)缺乏优化提示的结构,导致随着提示大小的增加,改进效果递减。为了解决这些限制,我们提出了多智能体宪法学习(MAC),该方法通过一个网络的智能体对结构化提示进行优化,这些提示以规则集的形式表示,智能体负责接受、编辑或拒绝规则更新。我们还提出了MAC+,通过在成功轨迹上训练智能体来增强更新,从而提高奖励。我们在标记个人可识别信息(PII)这一分类任务上评估了MAC,该任务的标签有限且可解释性至关重要,并展示了它在其他智能任务(如工具调用)上的泛化能力。MAC的性能超过了最近的提示优化方法超过50%,生成可读且可审计的规则集,并在不需要参数更新的情况下实现了与监督微调和GRPO相当的性能。
cs.AI / 32 / 2603.15973
Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems
安全性是非组合性的:基于能力的人工智能系统的形式化框架
Abstract
This paper contains the first formal proof that safety is non-compositional in the presence of conjunctive capability dependencies: two agents each individually inca- pable of reaching any forbidden capability can, when combined, collectively reach a forbidden goal through an emergent conjunctive dependency.
Chinese Translation
本文提供了第一个正式证明,表明在存在联结能力依赖的情况下,安全性是非组合性的:两个各自无法达到任何禁止能力的代理,当结合在一起时,可以通过一种新出现的联结依赖共同达到一个禁止目标。
cs.AI / 33 / 2603.15976
An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc
用于PETSc中AI生成科学代码的代理评估框架
Abstract
While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we introduce petscagent-bench, an agentic framework built on an agents-evaluating-agents paradigm. Instead of relying on static scripts, petscagent-bench deploys a tool-augmented evaluator agent that compiles, executes, and measures code produced by a separate model-under-test agent, orchestrating a 14-evaluator pipeline across five scoring categories: correctness, performance, code quality, algorithmic appropriateness, and library-specific conventions. Because the agents communicate through standardized protocols (A2A and MCP), the framework enables black-box evaluation of any coding agent without requiring access to its source code. We demonstrate the framework on a benchmark suite of realistic problems using the PETSc library for HPC. Our empirical analysis of frontier models reveals that while current models generate readable, well-structured code, they consistently struggle with library-specific conventions that traditional pass/fail metrics completely miss.
Chinese Translation
尽管大型语言模型显著加速了科学代码的生成,但全面评估生成的代码仍然是一个重大挑战。传统基准将评估简化为测试用例匹配,这种方法对于高性能计算(HPC)中的库代码而言是不够的,因为求解器选择、API约定、内存管理和性能与功能正确性同样重要。为了解决这一问题,我们引入了petscagent-bench,这是一个基于代理评估代理范式构建的代理框架。petscagent-bench并不依赖静态脚本,而是部署了一个工具增强的评估代理,该代理编译、执行并测量由单独的待测模型代理生成的代码,协调一个涵盖正确性、性能、代码质量、算法适宜性和库特定约定五个评分类别的14个评估者的管道。由于代理通过标准化协议(A2A和MCP)进行通信,该框架能够对任何编码代理进行黑箱评估,而无需访问其源代码。我们在使用PETSc库的现实问题基准套件上演示了该框架。我们对前沿模型的实证分析表明,尽管当前模型生成的代码可读且结构良好,但它们在库特定约定方面始终存在困难,而传统的通过/不通过指标完全无法捕捉到这一点。
cs.AI / 34 / 2603.15978
From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI
从工作流自动化到能力闭合:一个安全且关注收益的客户服务人工智能的正式框架
Abstract
Customer service automation is undergoing a structural transformation. The dominant paradigm is shifting from scripted chatbots and single-agent responders toward networks of specialised AI agents that compose capabilities dynamically across billing, service provision, payments, and fulfilment. This shift introduces a safety gap that no current platform has closed: two agents individually verified as safe can, when combined, reach a forbidden goal through an emergent conjunctive dependency that neither possesses alone.
Chinese Translation
客户服务自动化正经历结构性转型。主导范式正从脚本化的聊天机器人和单一代理响应者转向动态组合计费、服务提供、支付和履行能力的专业化人工智能代理网络。这一转变引入了一个安全缺口,目前没有任何平台能够弥补:两个单独验证为安全的代理在结合时,可能通过一种两者单独都不具备的紧密依赖关系,达到一个被禁止的目标。
cs.AI / 35 / 2603.15994
Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving
人工智能的选择性记忆:基于写入时间的门控与层次归档
Abstract
Retrieval-augmented generation stores all content indiscriminately, degrading accuracy as noise accumulates. Parametric approaches compress knowledge into weights, precluding selective updates. Neither mirrors biological memory, which gates encoding based on salience and archives rather than deletes superseded information. We introduce write-time gating that filters incoming knowledge objects using composite salience scores (source reputation, novelty, reliability) while maintaining version chains that preserve prior states. Using real LLM evaluation without oracle access to quality labels, write gating achieves 100 percent accuracy versus 13 percent for ungated stores. The critical finding emerges under distractor scaling: at 8:1 distractor ratios, read-time filtering (Self-RAG) collapses to 0 percent while write gating maintains 100 percent, revealing a structural advantage of write-time over read-time curation. Validation on Wikipedia (20 entities), procedurally generated pharmacology data, and 2026 arXiv papers confirms these findings. The gating advantage scales inversely with parametric memory support: +25pp for Wikipedia, +48pp for post-cutoff arXiv, +65pp for procedural data with zero training knowledge. Signal ablation confirms the method does not depend on oracle-correlated metadata. Write gating matches Self-RAG accuracy at one-ninth the query-time cost.
Chinese Translation
检索增强生成技术无差别地存储所有内容,随着噪声的积累而降低准确性。参数化方法将知识压缩为权重,无法进行选择性更新。两者都无法反映生物记忆的特征,生物记忆根据重要性对编码进行门控,并归档而非删除被替代的信息。我们提出了一种写入时间门控的方法,该方法利用复合重要性评分(来源声誉、新颖性、可靠性)过滤传入的知识对象,同时维护版本链以保留先前状态。在没有访问质量标签的真实大型语言模型(LLM)评估中,写入门控实现了100%的准确率,而未门控存储的准确率仅为13%。在干扰物比例为8:1时,关键发现显现:读取时间过滤(Self-RAG)崩溃至0%,而写入门控保持100%,揭示了写入时间策划相对于读取时间的结构优势。在维基百科(20个实体)、程序生成的药理数据和2026年arXiv论文上的验证确认了这些发现。门控优势与参数化记忆支持呈反比:维基百科提高25个百分点,截止后arXiv提高48个百分点,程序数据在没有训练知识的情况下提高65个百分点。信号消融实验确认该方法不依赖于与oracle相关的元数据。写入门控在查询时间成本仅为Self-RAG的九分之一时,达到了相同的准确率。
cs.AI / 36 / 2603.16020
IRAM-Omega-Q: A Computational Architecture for Uncertainty Regulation in Artificial Agents
IRAM-Omega-Q:一种用于人工智能体不确定性调节的计算架构
Abstract
Artificial agents can achieve strong task performance while remaining opaque with respect to internal regulation, uncertainty management, and stability under stochastic perturbation. We present IRAM-Omega-Q, a computational architecture that models internal regulation as closed-loop control over a quantum-like state representation. The framework uses density matrices instrumentally as abstract state descriptors, enabling direct computation of entropy, purity, and coherence-related metrics without invoking physical quantum processes. A central adaptive gain is updated continuously to maintain a target uncertainty regime under noise. Using systematic parameter sweeps, fixed-seed publication-mode simulations, and susceptibility-based phase-diagram analysis, we identify reproducible critical boundaries in regulation-noise space. We further show that alternative control update orderings, interpreted as perception-first and action-first architectures, induce distinct stability regimes under identical external conditions. These results support uncertainty regulation as a concrete architectural principle for artificial agents and provide a formal setting for studying stability, control, and order effects in cognitively inspired AI systems. The framework is presented as a technical model of adaptive regulation dynamics in artificial agents. It makes no claims regarding phenomenological consciousness, and the quantum-like formalism is used strictly as a mathematical representation for structured uncertainty and state evolution.
Chinese Translation
人工智能体能够在内部调节、不确定性管理和在随机扰动下的稳定性方面保持不透明的同时,实现强大的任务性能。我们提出了IRAM-Omega-Q,这是一种将内部调节建模为对类量子状态表示的闭环控制的计算架构。该框架以密度矩阵作为抽象状态描述符,能够直接计算熵、纯度和相干性相关的指标,而无需调用物理量子过程。一个中心自适应增益不断更新,以在噪声下维持目标不确定性状态。通过系统的参数扫描、固定种子发布模式模拟和基于敏感性的相图分析,我们识别出调节-噪声空间中的可重复临界边界。我们进一步表明,替代的控制更新顺序(被解释为感知优先和行动优先架构)在相同外部条件下会引发不同的稳定性状态。这些结果支持不确定性调节作为人工智能体的一个具体架构原则,并为研究认知启发的人工智能系统中的稳定性、控制和顺序效应提供了一个正式的框架。该框架被呈现为人工智能体中自适应调节动态的技术模型。它不对现象学意识做出任何声明,且类量子形式严格作为结构化不确定性和状态演化的数学表示。
cs.AI / 37 / 2603.16021
Interpretable Context Methodology: Folder Structure as Agentic Architecture
可解释的上下文方法论:文件夹结构作为能动架构
Abstract
Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Model Workspace Protocol (MWP), a method that replaces framework-level orchestration with filesystem structure. Numbered folders represent stages. Plain markdown files carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right files at the right moment, does the work that would otherwise require a multi-agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi-pass compilation, and literate programming to the specific problem of structuring context for AI agents. The protocol is open source under the MIT license.
Chinese Translation
当前的人工智能代理编排方法通常涉及构建多代理框架,通过代码管理上下文传递、记忆、错误处理和步骤协调。这些框架在复杂的并发系统中表现良好。然而,对于需要人类在每个步骤审查输出的顺序工作流,它们引入了问题并不需要的工程开销。本文提出了模型工作区协议(Model Workspace Protocol, MWP),一种用文件系统结构替代框架级编排的方法。编号文件夹代表不同阶段。普通的markdown文件承载提示和上下文,告诉单个AI代理在每个步骤中扮演什么角色。本地脚本处理完全不需要AI的机械工作。最终形成的系统中,一个代理在正确的时刻读取正确的文件,完成原本需要多代理框架才能完成的工作。这种方法将Unix管道设计、模块化分解、多遍编译和文学编程的思想应用于为AI代理结构化上下文的具体问题。该协议在MIT许可证下开源。
cs.AI / 38 / 2603.16044
Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation
增强 VLA 的语言泛化能力:通过合成指令增强对 OpenVLA 的微调
Abstract
Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performance can be limited when encountering completely new environments. This paper proposes a parameter-efficient fine-tuning strategy to enhance the linguistic generalization of OpenVLA by synthesizing a general instruction set for the Bridge Dataset V2. The paper leverages a Large Language Model (LLM) to generate a rich variety of semantically equivalent but structurally diverse commands for existing trajectories. In this experiment, Low-Rank Adaptation (LoRA) is implemented to fine-tune OpenVLA on augmented pairs, allowing the model to bridge the gap between complex natural language intent and robotic actions. Results demonstrate that the LoRA-enhanced model's robustness, suggesting that enriching the linguistic space of specialized datasets is crucial for embodied agents.
Chinese Translation
泛化仍然是具身人工智能中的一个核心挑战,因为机器人必须适应多样化的环境。尽管 OpenVLA 通过大规模预训练代表了视觉-语言-动作模型的最先进技术(SOTA),但在遇到全新环境时,其零-shot 性能可能受到限制。本文提出了一种参数高效的微调策略,通过为 Bridge Dataset V2 合成通用指令集来增强 OpenVLA 的语言泛化能力。本文利用大型语言模型(LLM)生成丰富多样的语义等价但结构多样的指令,以适应现有轨迹。在本实验中,实施了低秩适应(LoRA)对增强对的 OpenVLA 进行微调,使模型能够弥合复杂自然语言意图与机器人动作之间的差距。结果表明,LoRA 增强模型的鲁棒性,表明丰富专业数据集的语言空间对具身代理至关重要。
cs.AI / 39 / 2603.16045
POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLMs
POaaS:作为服务的最小编辑提示优化,以提高准确性并减少设备端小型语言模型的幻觉
Abstract
Small language models (sLLMs) are increasingly deployed on-device, where imperfect user prompts--typos, unclear intent, or missing context--can trigger factual errors and hallucinations. Existing automatic prompt optimization (APO) methods were designed for large cloud LLMs and rely on search that often produces long, structured instructions; when executed under an on-device constraint where the same small model must act as optimizer and solver, these pipelines can waste context and even hurt accuracy. We propose POaaS, a minimal-edit prompt optimization layer that routes each query to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) and merges their outputs under strict drift and length constraints, with a conservative skip policy for well-formed prompts. Under a strict fixed-model setting with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, POaaS improves both task accuracy and factuality while representative APO baselines degrade them, and POaaS recovers up to +7.4% under token deletion and mixup. Overall, per-query conservative optimization is a practical alternative to search-heavy APO for on-device sLLMs.
Chinese Translation
小型语言模型(sLLMs)越来越多地部署在设备端,在这种情况下,不完美的用户提示——如拼写错误、不明确的意图或缺失的上下文——可能会引发事实错误和幻觉。现有的自动提示优化(APO)方法是为大型云端语言模型设计的,依赖于搜索,通常生成冗长的结构化指令;在设备端约束下执行时,同一小型模型必须同时充当优化器和求解器,这些流程可能会浪费上下文,甚至影响准确性。我们提出了POaaS,一种最小编辑提示优化层,它将每个查询路由到轻量级专家(Cleaner、Paraphraser、Fact-Adder),并在严格的漂移和长度约束下合并它们的输出,对于格式良好的提示采用保守的跳过策略。在严格固定模型设置下,使用Llama-3.2-3B-Instruct和Llama-3.1-8B-Instruct,POaaS在任务准确性和事实性方面均有所提升,而代表性的APO基线则出现下降,POaaS在标记删除和混合情况下恢复了高达+7.4%的性能。总体而言,逐查询的保守优化是设备端小型语言模型的一个实用替代方案,优于依赖搜索的APO。
cs.AI / 40 / 2603.16052
A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog
增强人类与大型语言模型对话连贯性的上下文对齐预处理器
Abstract
Large language models (LLMs) have made remarkable progress in generating fluent text, but they still face a critical challenge of contextual misalignment in long-term and dynamic dialogue. When human users omit premises, simplify references, or shift context abruptly during interactions with LLMs, the models may fail to capture their actual intentions, producing mechanical or off-topic responses that weaken the collaborative potential of dialogue. To address this problem, this paper proposes a computational framework called the Context Alignment Pre-processor (C.A.P.). Rather than operating during generation, C.A.P. functions as a pre-processing module between user input and response generation. The framework includes three core processes: (1) semantic expansion, which extends a user instruction to a broader semantic span including its premises, literal meaning, and implications; (2) time-weighted context retrieval, which prioritizes recent dialogue history through a temporal decay function approximating human conversational focus; and (3) alignment verification and decision branching, which evaluates whether the dialogue remains on track by measuring the semantic similarity between the current prompt and the weighted historical context. When a significant deviation is detected, C.A.P. initiates a structured clarification protocol to help users and the system recalibrate the conversation. This study presents the architecture and theoretical basis of C.A.P., drawing on cognitive science and Common Ground theory in human-computer interaction. We argue that C.A.P. is not only a technical refinement but also a step toward shifting human-computer dialogue from one-way command-execution patterns to two-way, self-correcting, partnership-based collaboration. Finally, we discuss implementation paths, evaluation methods, and implications for the future design of interactive intelligent systems.
Chinese Translation
大型语言模型(LLMs)在生成流畅文本方面取得了显著进展,但在长期和动态对话中仍面临上下文不对齐的关键挑战。当人类用户在与LLMs的互动中省略前提、简化引用或突然改变上下文时,这些模型可能无法捕捉到他们的实际意图,导致生成机械或离题的回应,从而削弱对话的协作潜力。为了解决这一问题,本文提出了一种名为上下文对齐预处理器(Context Alignment Pre-processor, C.A.P.)的计算框架。C.A.P.并非在生成过程中运作,而是作为用户输入与响应生成之间的预处理模块。该框架包括三个核心过程:(1)语义扩展,扩展用户指令至更广泛的语义范围,包括其前提、字面意义和隐含意义;(2)时间加权上下文检索,通过一个近似人类对话关注点的时间衰减函数优先考虑最近的对话历史;(3)对齐验证与决策分支,通过测量当前提示与加权历史上下文之间的语义相似性来评估对话是否保持在轨道上。当检测到显著偏差时,C.A.P.启动结构化澄清协议,以帮助用户和系统重新校准对话。本研究展示了C.A.P.的架构和理论基础,借鉴了认知科学和人机交互中的共同基础理论。我们认为,C.A.P.不仅是技术上的改进,也是将人机对话从单向命令执行模式转变为双向、自我纠正的基于合作的协作的一步。最后,我们讨论了实施路径、评估方法以及对未来交互智能系统设计的影响。
cs.AI / 41 / 2603.16060
ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning
ARISE:基于内在技能演化的层次强化学习中的智能体推理
Abstract
The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{https://github.com/Skylanding/ARISE}{https://github.com/Skylanding/ARISE}.
Chinese Translation
提高语言模型数学推理能力的主要范式依赖于具有可验证奖励的强化学习。然而,现有方法将每个问题实例孤立对待,而未利用在训练过程中出现和积累的可重用策略。为此,我们引入了ARISE(通过内在技能演化的智能体推理),这是一个层次强化学习框架,其中共享策略既用于管理高层技能,也用于生成低层响应(分别称为技能管理器和工作者)。管理器通过专门的技能生成回合维护一个分层技能库,该回合对成功解决轨迹(执行后)进行结构化总结,同时采用基于策略的选择机制来检索相关技能,以为未来的回合提供条件(执行前)。层次奖励设计引导推理能力和库质量的共同演化。在两个基础模型和七个基准测试(涵盖竞争数学和Omni-MATH)上的实验表明,ARISE在性能上始终优于GRPO系列算法和增强记忆的基线,尤其在分布外任务上取得了显著的提升。消融研究确认每个组件对观察到的改进都有贡献,并且库质量与推理性能在训练过程中同步提高。代码可在 exttt{https://github.com/Skylanding/ARISE} 获取。
cs.AI / 42 / 2603.16110
VIGIL: Towards Edge-Extended Agentic AI for Enterprise IT Support
VIGIL:面向企业IT支持的边缘扩展自主智能系统
Abstract
Enterprise IT support is constrained by heterogeneous devices, evolving policies, and long-tail failure modes that are difficult to resolve centrally. We present VIGIL, an edge-extended agentic AI system that deploys desktop-resident agents to perform situated diagnosis, retrieval over enterprise knowledge, and policy-governed remediation directly on user devices with explicit consent and end-to-end observability. In a 10-week pilot of VIGIL's operational loop on 100 resource-constrained endpoints, VIGIL reduces interaction rounds by 39%, achieves at least 4 times faster diagnosis, and supports self-service resolution in 82% of matched cases. Users report excellent usability, high trust, and low cognitive workload across four validated instruments, with qualitative feedback highlighting transparency as critical for trust. Notably, users rated the system higher when no historical matches were available, suggesting on-device diagnosis provides value independent of knowledge base coverage. This pilot establishes safety and observability foundations for fleet-wide continuous improvement.
Chinese Translation
企业IT支持受到异构设备、不断变化的政策以及难以集中解决的长尾故障模式的限制。我们提出了VIGIL,一个边缘扩展的自主智能系统,部署桌面驻留代理在用户设备上执行情境诊断、企业知识检索和政策驱动的修复,前提是获得明确的用户同意并实现端到端的可观察性。在对100个资源受限终端进行为期10周的VIGIL操作循环试点中,VIGIL将交互轮次减少了39%,实现了至少4倍的诊断速度提升,并在82%的匹配案例中支持自助解决。用户在四个经过验证的工具中报告了极佳的可用性、高信任度和低认知负担,定性反馈强调透明度对信任的重要性。值得注意的是,当没有历史匹配时,用户对系统的评分更高,这表明设备上的诊断提供了独立于知识库覆盖的价值。该试点为全舰队的持续改进奠定了安全性和可观察性的基础。
cs.AI / 43 / 2603.16148
NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics
NeuronSpark:一种具有选择性状态空间动态的脉冲神经网络语言模型
Abstract
We ask whether a pure spiking backbone can learn large-scale language modeling from random initialization, without Transformer distillation. We introduce NeuronSpark, a 0.9B-parameter SNN language model trained with next-token prediction and surrogate gradients. The model combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques (residual centering, lateral-inhibition normalization, and natural-gradient compensation). Under a constrained budget (about 1.4B pretraining tokens and 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and shows early multi-turn dialogue behavior after SFT. These results support the feasibility of end-to-end language modeling with a pure SNN architecture at this scale.
Chinese Translation
我们探讨了纯脉冲神经网络(SNN)骨架是否能够从随机初始化中学习大规模语言建模,而无需进行Transformer蒸馏。我们介绍了NeuronSpark,这是一种具有0.9B参数的SNN语言模型,采用下一个标记预测和替代梯度进行训练。该模型结合了选择性状态空间脉冲动态、泄漏电流层间通信、PonderNet自适应时间步、融合的Triton PLIF内核以及稳定化技术(残差中心化、侧抑制归一化和自然梯度补偿)。在受限预算下(约1.4B预训练标记和6.5K SFT步骤),NeuronSpark-0.9B达到了3.6的预训练损失,并在SFT后显示出早期的多轮对话行为。这些结果支持在此规模下使用纯SNN架构进行端到端语言建模的可行性。
cs.AI / 44 / 2603.16161
SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation
SQL-ASTRA:通过列集匹配和轨迹聚合缓解代理SQL中的稀疏反馈
Abstract
Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a framework featuring a universal two-tiered reward mechanism designed to provide effective trajectory-level evaluation and dense step-level signals. First, we introduce Aggregated Trajectory Reward (ATR) to resolve multi-turn credit assignment. Using an asymmetric transition matrix, ATR aggregates process-oriented scores to incentivize continuous improvement. Leveraging Lyapunov stability theory, we prove ATR acts as an energy dissipation operator, guaranteeing a cycle-free policy and monotonic convergence. Second, Column-Set Matching Reward (CSMR) provides immediate step-level rewards to mitigate sparsity. By executing queries at each turn, CSMR converts binary (0/1) feedback into dense [0, 1] signals based on partial correctness. Evaluations on BIRD show a 5% gain over binary-reward GRPO. Notably, our approach outperforms SOTA Arctic-Text2SQL-R1-7B on BIRD and Spider 2.0 using identical models, propelling Text-to-SQL toward a robust multi-turn agent paradigm.
Chinese Translation
代理强化学习(RL)在复杂任务中展现出潜力,但文本到SQL的转换仍主要局限于单轮范式。一个主要瓶颈是信用分配问题。在传统范式中,奖励仅由最终轮反馈决定,这忽视了中间过程,导致模糊的信用评估。为了解决这一问题,我们提出了代理SQL(Agentic SQL),这是一个具有通用双层奖励机制的框架,旨在提供有效的轨迹级评估和密集的步骤级信号。首先,我们引入了聚合轨迹奖励(Aggregated Trajectory Reward, ATR)来解决多轮信用分配问题。通过使用不对称转移矩阵,ATR聚合过程导向的得分,以激励持续改进。利用Lyapunov稳定性理论,我们证明ATR作为能量耗散算子,保证了无环策略和单调收敛。其次,列集匹配奖励(Column-Set Matching Reward, CSMR)提供即时的步骤级奖励,以减轻稀疏性。通过在每一轮执行查询,CSMR将二元(0/1)反馈转换为基于部分正确性的密集[0, 1]信号。在BIRD上的评估显示,相较于二元奖励的GRPO,提升了5%的性能。值得注意的是,我们的方法在使用相同模型的情况下,在BIRD和Spider 2.0上超越了SOTA Arctic-Text2SQL-R1-7B,推动文本到SQL向强健的多轮代理范式迈进。
cs.AI / 45 / 2603.16197
Are Large Language Models Truly Smarter Than Humans?
大型语言模型真的比人类更聪明吗?
Abstract
Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.
Chinese Translation
公共排行榜日益表明,大型语言模型(LLMs)在涵盖学术知识、法律和编程的基准测试中超越了人类专家。然而,大多数基准测试是完全公开的,其问题在互联网上广泛存在,这造成了系统性风险,即模型可能是在用于评估它们的数据上进行训练的。本文提出了三个互补实验,形成了对六个前沿LLM的严格多方法污染审计:GPT-4o、GPT-4o-mini、DeepSeek-R1、DeepSeek-V3、Llama-3.3-70B和Qwen3-235B。实验1对所有57个学科的513个MMLU问题应用了词汇污染检测流程,发现总体污染率为13.8%(STEM领域为18.1%,哲学领域高达66.7%),并按类别估计性能提升为+0.030至+0.054准确率点。实验2对100个MMLU问题应用了释义和间接引用诊断,发现间接引用下的准确率平均下降了7.0个百分点,在法律和伦理领域上升至19.8个百分点。实验3对所有513个问题和所有六个模型应用了TS-Guessing行为探测,发现72.5%的问题触发了远高于随机的记忆信号,其中DeepSeek-R1显示出分布式记忆特征(76.6%部分重构,0%逐字回忆),这解释了其异常的实验2表现。所有三个实验的污染排名一致:STEM > 职业 > 社会科学 > 人文学科。
cs.AI / 46 / 2603.16207
Proactive Rejection and Grounded Execution: A Dual-Stage Intent Analysis Paradigm for Safe and Efficient AIoT Smart Homes
主动拒绝与基于环境的执行:一种安全高效的AIoT智能家居双阶段意图分析范式
Abstract
As Large Language Models (LLMs) transition from information providers to embodied agents in the Internet of Things (IoT), they face significant challenges regarding reliability and interaction efficiency. Direct execution of LLM-generated commands often leads to entity hallucinations (e.g., trying to control non-existent devices). Meanwhile, existing iterative frameworks (e.g., SAGE) suffer from the Interaction Frequency Dilemma, oscillating between reckless execution and excessive user questioning. To address these issues, we propose a Dual-Stage Intent-Aware (DS-IA) Framework. This framework separates high-level user intent understanding from low-level physical execution. Specifically, Stage 1 serves as a semantic firewall to filter out invalid instructions and resolve vague commands by checking the current state of the home. Stage 2 then employs a deterministic cascade verifier-a strict, step-by-step rule checker that verifies the room, device, and capability in sequence-to ensure the action is actually physically possible before execution. Extensive experiments on the HomeBench and SAGE benchmarks demonstrate that DS-IA achieves an Exact Match (EM) rate of 58.56% (outperforming baselines by over 28%) and improves the rejection rate of invalid instructions to 87.04%. Evaluations on the SAGE benchmark further reveal that DS-IA resolves the Interaction Frequency Dilemma by balancing proactive querying with state-based inference. Specifically, it boosts the Autonomous Success Rate (resolving tasks without unnecessary user intervention) from 42.86% to 71.43%, while maintaining high precision in identifying irreducible ambiguities that truly necessitate human clarification. These results underscore the framework's ability to minimize user disturbance through accurate environmental grounding.
Chinese Translation
随着大型语言模型(LLMs)从信息提供者转变为物联网(IoT)中的具身智能体,它们面临着可靠性和交互效率方面的重大挑战。直接执行LLM生成的指令往往会导致实体幻觉(例如,试图控制不存在的设备)。与此同时,现有的迭代框架(如SAGE)遭遇了交互频率困境,在鲁莽执行与过度询问用户之间摇摆不定。为了解决这些问题,我们提出了一种双阶段意图感知(DS-IA)框架。该框架将高层用户意图理解与低层物理执行分离。具体而言,第一阶段作为语义防火墙,过滤无效指令并通过检查家庭的当前状态来解析模糊命令。第二阶段则采用确定性级联验证器——一个严格的逐步规则检查器,依次验证房间、设备和能力,以确保在执行之前该动作在物理上确实可行。在HomeBench和SAGE基准上的广泛实验表明,DS-IA的准确匹配(EM)率达到58.56%(比基线高出28%以上),并将无效指令的拒绝率提高至87.04%。在SAGE基准上的评估进一步揭示,DS-IA通过平衡主动查询与基于状态的推理来解决交互频率困境。具体而言,它将自主成功率(在无需用户干预的情况下解决任务)从42.86%提升至71.43%,同时在识别真正需要人类澄清的不可简化模糊性方面保持高精度。这些结果强调了该框架通过准确的环境基础来最小化用户干扰的能力。
cs.AI / 47 / 2603.16210
MOSAIC: Composable Safety Alignment with Modular Control Tokens
MOSAIC:具有模块化控制令牌的可组合安全对齐
Abstract
Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.
Chinese Translation
大型语言模型(LLMs)中的安全对齐通常作为嵌入在模型参数中的单一静态策略实现。然而,现实世界的应用通常需要依赖于上下文的安全规则,这些规则在用户、地区和应用之间有所不同。现有的方法难以提供这种条件控制:参数级别的对齐将安全行为与一般能力纠缠在一起,而基于提示的方法则依赖于自然语言指令,执行力度较弱。我们提出了MOSAIC,一个模块化框架,通过在冻结的主干模型上优化的可学习控制令牌实现可组合的安全对齐。每个令牌代表一个安全约束,并可以在推理时灵活激活和组合。为了高效训练可组合令牌,我们引入了基于顺序的任务采样和一个分布级对齐目标,以减轻过度拒绝的问题。实验表明,MOSAIC在保持模型效用的同时,显著降低了过度拒绝的情况,达到了强大的防御性能。
cs.AI / 48 / 2603.16264
Adaptive Theory of Mind for LLM-based Multi-Agent Coordination
基于大语言模型的多智能体协调的自适应心智理论
Abstract
Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner's likely ToM order and leverages this estimation to predict the partner's action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.
Chinese Translation
心智理论(Theory of Mind, ToM)指的是推理他人心理状态的能力,而高阶心智理论则涉及考虑他人也拥有自己的心智理论。为基于大语言模型(LLM)的智能体赋予心智理论,长期以来被认为可以改善它们在多智能体协作任务中的协调能力。然而,我们发现不匹配的心智理论顺序——智能体之间在心智理论推理深度上的不一致——可能导致对他人的推理不足或过度,从而损害它们的协调能力。为了解决这一问题,我们设计了一种自适应心智理论(Adaptive Theory of Mind, A-ToM)智能体,该智能体能够与其伙伴对齐心智理论顺序。基于先前的互动,该智能体估计伙伴可能的心智理论顺序,并利用这一估计来预测伙伴的行为,从而促进行为协调。我们在四个多智能体协调任务上进行了实证评估:一个重复的矩阵博弈、两个网格导航任务和一个《煮糊了》(Overcooked)任务。结果验证了我们关于心智理论对齐的发现,并展示了我们的A-ToM智能体的有效性。此外,我们讨论了A-ToM在非LLM基础智能体中的可推广性,以及什么因素会降低心智理论对齐的重要性。
cs.AI / 49 / 2603.16307
NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing
NeSy-Route:一种用于遥感约束路径规划的神经符号基准
Abstract
Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and constraints and make reliable decisions. Current remote-sensing benchmarks mainly focus on evaluating perception and reasoning capabilities of multimodal large language models (MLLMs). They fail to assess planning capability, stemming either from the difficulty of curating and validating planning tasks at scale or from evaluation protocols that are inaccurate and inadequate. To address these limitations, we introduce NeSy-Route, a large-scale neuro-symbolic benchmark for constrained route planning in remote sensing. Within this benchmark, we introduce an automated data-generation framework that integrates high-fidelity semantic masks with heuristic search to produce diverse route-planning tasks with provably optimal solutions. This allows NeSy-Route to comprehensively evaluate planning across 10,821 route-planning samples, nearly 10 times larger than the largest prior benchmark. Furthermore, a three-level hierarchical neuro-symbolic evaluation protocol is developed to enable accurate assessment and support fine-grained analysis on perception, reasoning, and planning simultaneously. Our comprehensive evaluation of various state-of-the-art MLLMs demonstrates that existing MLLMs show significant deficiencies in perception and planning capabilities. We hope NeSy-Route can support further research and development of more powerful MLLMs for remote sensing.
Chinese Translation
遥感支撑着灾难救援和生态实地调查等关键应用,其中系统必须理解复杂场景和约束,并做出可靠决策。目前的遥感基准主要集中于评估多模态大型语言模型(MLLMs)的感知和推理能力,未能评估规划能力,这主要源于大规模策划和验证规划任务的困难,或评估协议的不准确和不足。为了解决这些局限性,我们引入了NeSy-Route,这是一个大规模的神经符号基准,用于遥感中的约束路径规划。在这个基准中,我们引入了一个自动化数据生成框架,该框架将高保真语义掩模与启发式搜索相结合,以生成具有可证明最优解的多样化路径规划任务。这使得NeSy-Route能够全面评估10,821个路径规划样本,几乎是之前最大基准的10倍。此外,我们开发了一个三级层次神经符号评估协议,以便进行准确评估,并支持对感知、推理和规划的细致分析。我们对各种最先进的MLLMs的综合评估表明,现有的MLLMs在感知和规划能力上存在显著不足。我们希望NeSy-Route能够支持更强大的MLLMs在遥感领域的进一步研究和开发。
cs.AI / 50 / 2603.16313
Learning to Predict, Discover, and Reason in High-Dimensional Discrete Event Sequences
学习在高维离散事件序列中进行预测、发现和推理
Abstract
Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle's subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error-prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high-dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event-driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high-dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer-based architectures for predictive maintenance, scalable sample- and population-level causal discovery frameworks and a multi-agent system that automates the synthesis of Boolean EP rules.
Chinese Translation
现代车辆中嵌入的电子控制单元(ECUs)生成大量异步事件,这些事件被称为诊断故障代码(DTCs)。这些离散事件形成复杂的时间序列,反映了车辆子系统健康状况的变化。在汽车行业,领域专家使用布尔规则手动将这些代码分组为更高级别的错误模式(EPs),以表征系统故障并确保安全。然而,随着车辆复杂性的增加,这一手动过程变得越来越昂贵、易出错且难以扩展。值得注意的是,现代车辆中独特的 DTC 数量与自然语言的词汇量处于同一数量级,通常达到数万。这一观察促使了范式转变:将诊断序列视为一种可以建模、预测并最终解释的语言。传统的统计方法未能捕捉丰富的依赖关系,并且无法扩展到特征为成千上万节点、大样本量和长序列长度的高维数据集。特别是,工业日志中分类事件空间的高基数构成了重大挑战,需要针对这种事件驱动系统的新机器学习架构。本论文通过将事件序列建模、因果发现和大型语言模型(LLMs)统一为一个高维事件流的连贯框架,解决了自动故障诊断问题。论文分为三部分,反映了从预测到因果理解,再到车辆诊断推理的渐进过渡。因此,我们介绍了几种基于 Transformer 的架构,用于预测性维护、可扩展的样本和人群级因果发现框架,以及一个自动合成布尔 EP 规则的多智能体系统。
cs.AI / 51 / 2603.16365
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment
FactorEngine:一种程序级知识注入的因子挖掘框架用于量化投资
Abstract
We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program-level factor discovery framework that casts factors as Turing-complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, and an experience knowledge base that supports trajectory-aware refinement (including learning from failures). Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance.
Chinese Translation
我们研究了阿尔法因子挖掘,即从嘈杂的非平稳市场数据中自动发现预测信号,要求挖掘的因子能够直接执行和审计,并且发现过程在规模上保持计算可行性。现有的符号方法受限于表达能力,而神经预测模型往往为了性能而牺牲可解释性,并且容易受到制度转变和过拟合的影响。我们引入了FactorEngine(FE),一个程序级因子发现框架,将因子视为图灵完备的代码,并通过三种分离提高了有效性和效率:(i)逻辑修订与参数优化,(ii)LLM引导的方向搜索与贝叶斯超参数搜索,以及(iii)LLM使用与本地计算。FE进一步结合了一个知识注入的自举模块,通过闭环多智能体提取-验证-代码生成管道,将非结构化的财务报告转化为可执行的因子程序,并且包含一个支持轨迹感知优化的经验知识库(包括从失败中学习)。在对真实世界OHLCV数据进行的大量回测中,FE生成的因子具有显著更强的预测稳定性和投资组合影响力,例如,较基线方法更高的IC/ICIR(及Rank IC/ICIR)和改进的AR/Sharpe,达到了最先进的预测和投资组合表现。
cs.AI / 52 / 2603.16417
Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences
负向方法在人工智能对齐中的应用:为何负约束在结构上优于正偏好
Abstract
Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.
Chinese Translation
近期的实证结果表明,仅使用负反馈训练大型语言模型(LLMs)可以与标准的人类反馈强化学习(RLHF)相匹配或超越。负样本强化在数学推理上与PPO达成平衡;分布性不偏好优化仅使用不偏好的样本进行有效训练;而宪法人工智能在无害性基准测试中优于纯RLHF。然而,目前没有统一的理论解释为何负信号如此有效。本文提出了这样的理论:正偏好和负约束在结构上是不对称的。正偏好(“哪个更好”)编码了持续耦合、依赖于上下文的人类价值观,这些价值观无法被穷尽地指定——导致模型学习到表面的相关性,例如与用户的一致性(谄媚)。负约束(“什么是错误的”)编码了离散的、有限的、可独立验证的禁令,这些禁令可以收敛到一个稳定的边界。这种不对称性——根植于波普尔的证伪逻辑和负知识的认识论——解释了以偏好为基础的RLHF的谄媚失败及负信号方法的惊人有效性。我们认为,对齐研究应将重心从“学习人类的偏好”转向“学习人类的拒绝”,并为这一框架提供可测试的预测。
cs.AI / 53 / 2603.16434
From Natural Language to Executable Option Strategies via Large Language Models
通过大型语言模型将自然语言转化为可执行的期权策略
Abstract
Large Language Models (LLMs) excel at general code generation, yet translating natural-language trading intents into correct option strategies remains challenging. Real-world option design requires reasoning over massive, multi-dimensional option chain data with strict constraints, which often overwhelms direct generation methods. We introduce the Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives under grammatical rules, enabling LLMs to function as reliable semantic parsers rather than free-form programmers. OQL queries are then validated and executed deterministically by an engine to instantiate executable strategies. We also present a new dataset for this task and demonstrate that our neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct baselines.
Chinese Translation
大型语言模型(LLMs)在通用代码生成方面表现出色,但将自然语言交易意图转化为正确的期权策略仍然具有挑战性。现实世界中的期权设计需要对大量多维期权链数据进行推理,并遵循严格的约束,这常常使得直接生成方法难以应对。我们引入了期权查询语言(Option Query Language, OQL),这是一种领域特定的中间表示,它在语法规则下将期权市场抽象为高层次的原语,从而使LLMs能够作为可靠的语义解析器,而不是自由形式的程序员。OQL查询随后由引擎进行确定性验证和执行,以实例化可执行策略。我们还为此任务提出了一个新数据集,并展示了我们的神经符号管道在执行准确性和逻辑一致性方面显著优于直接基线。
cs.AI / 54 / 2603.16445
Visual Distraction Undermines Moral Reasoning in Vision-Language Models
视觉干扰削弱视觉-语言模型中的道德推理
Abstract
Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.
Chinese Translation
道德推理是安全人工智能(AI)的基础,然而,随着AI系统从基于文本的助手演变为具身代理,确保其在不同模态间的一致性变得至关重要。目前的安全技术在文本环境中取得了一定成功,但对于视觉输入的推广仍然存在担忧。现有的道德评估基准依赖于仅文本格式,缺乏对影响道德决策的变量的系统控制。在此,我们展示了视觉输入在最先进的视觉-语言模型(VLMs)中根本性地改变了道德决策,绕过了基于文本的安全机制。我们引入了道德困境模拟(Moral Dilemma Simulation, MDS),这是一个基于道德基础理论(Moral Foundation Theory, MFT)的多模态基准,通过对视觉和上下文变量的正交操控实现机制分析。评估结果显示,视觉模态激活了类似直觉的路径,覆盖了在仅文本环境中观察到的更为深思熟虑和安全的推理模式。这些发现揭示了语言调优的安全过滤器在约束视觉处理方面的关键脆弱性,表明了多模态安全对齐的迫切需求。
cs.AI / 55 / 2603.16448
TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas
TRUST-SQL:工具集成的多轮强化学习用于未知模式下的文本到SQL转换
Abstract
Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.
Chinese Translation
文本到SQL解析在完全模式假设下取得了显著进展。然而,这一前提在现实世界的企业环境中并不适用,因为数据库通常包含数百个表和大量嘈杂的元数据。代理必须主动识别和验证相关的子集,而不是提前注入完整模式,这就是我们在本研究中探讨的未知模式场景。为了解决这个问题,我们提出了TRUST-SQL(通过工具进行未知模式的真实推理)。我们将任务形式化为一个部分可观察的马尔可夫决策过程,其中我们的自主代理采用结构化的四阶段协议,以在经过验证的元数据中进行推理。关键是,这一协议为我们新颖的双轨GRPO策略提供了结构边界。通过应用令牌级掩蔽优势,该策略将探索奖励与执行结果隔离,以解决信用分配问题,相较于标准GRPO实现了9.9%的相对提升。在五个基准测试中的广泛实验表明,TRUST-SQL在其基础模型上分别实现了4B和8B变体平均绝对提升30.6%和16.6%。值得注意的是,尽管完全不依赖预加载的元数据,我们的框架仍然能够与依赖模式预填充的强基线相匹配或超越。
cs.AI / 56 / 2603.16453
RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
RetailBench:评估长时间视野下 LLM 代理在现实零售环境中的自主决策能力和战略稳定性
Abstract
Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.
Chinese Translation
基于大型语言模型(LLM)的代理在短时间视野和高度结构化的任务上取得了显著成功。然而,在现实和动态环境中,保持长时间视野下连贯决策能力仍然是一个未解决的挑战。我们引入了 RetailBench,这是一个高保真基准,旨在评估在现实商业场景中长时间视野自主决策的能力,在这些场景中,代理必须在随机需求和不断变化的外部条件下运作。我们进一步提出了演变战略与执行框架(Evolving Strategy & Execution),该框架将高层次的战略推理与低层次的行动执行分离。这种设计使得战略在时间上能够适应并具可解释性,尤其对于长时间视野任务而言,在非平稳环境和错误累积的情况下,战略需要在不同于行动执行的时间尺度上进行修订。在逐步增强的环境中对八种最先进的 LLM 进行的实验表明,我们的框架提高了相较于其他基准的操作稳定性和效率。然而,随着任务复杂性的增加,性能明显下降,暴露了当前 LLM 在长时间视野、多因素决策中的基本局限性。
cs.AI / 57 / 2603.16463
Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition
追踪线索,框定真相:开放词汇多模态情感识别中的混合证据演绎推理
Abstract
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.
Chinese Translation
开放词汇多模态情感识别(OV-MER)由于模糊的多模态线索而固有地具有挑战性,这些线索通常源于不同的未观察到的情境动态。尽管多模态大型语言模型(MLLMs)提供了广泛的语义覆盖,但它们的性能常常受到对主导数据先验的过早承诺的瓶颈,导致次优的启发式方法忽视了跨模态的重要互补情感线索。我们认为,有效的情感推理不仅需要表层的关联;它需要通过综合多种基于证据的推理,重建细致的情感状态,从而调和来自不同潜在视角的观察。我们提出了HyDRA,一种混合证据演绎推理架构,将推理形式化为提议-验证-决策协议。为了内化这一溯因过程,我们采用了带有层次奖励塑形的强化学习,将推理轨迹与最终任务表现对齐,以确保它们最好地调和观察到的多模态线索。系统评估验证了我们的设计选择,HyDRA在模糊或冲突场景中始终优于强基线,同时提供可解释的诊断证据轨迹。
cs.AI / 58 / 2603.16475
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures
打破链条:对大型语言模型对中间结构忠实性的因果分析
Abstract
Schema-guided reasoning pipelines ask LLMs to produce explicit intermediate structures -- rubrics, checklists, verification queries -- before committing to a final decision. But do these structures causally determine the output, or merely accompany it? We introduce a causal evaluation protocol that makes this directly measurable: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across eight models and three benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases -- revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; however, prompts which ask to prioritize the intermediate structure over the original input do not materially close the gap. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.
Chinese Translation
模式引导推理流程要求大型语言模型(LLMs)在做出最终决策之前生成明确的中间结构——如评分标准、检查清单、验证查询等。然而,这些结构是否在因果上决定输出,还是仅仅伴随输出?我们引入了一种因果评估协议,使这一点可以直接测量:通过选择任务,其中一个确定性函数将中间结构映射到决策,每一次控制编辑都意味着一个唯一的正确输出。在八个模型和三个基准测试中,模型似乎与其自身的中间结构保持自洽,但在多达60%的情况下,在干预后未能更新预测——这揭示了当中间结构发生变化时,表面上的忠实性是脆弱的。当最终决策的推导被委托给外部工具时,这种脆弱性在很大程度上消失;然而,要求优先考虑中间结构而非原始输入的提示并未实质性地缩小这一差距。总体而言,模式引导流程中的中间结构作为有影响力的上下文,而非稳定的因果中介。
cs.AI / 59 / 2603.16495
ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation
ExpressMind:一种用于高速公路运营的多模态预训练大型语言模型
Abstract
The current expressway operation relies on rule-based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre-trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry's first full-stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual-layer LLM pre-training paradigm based on self-supervised training and unsupervised learning. Additionally, this study introduces a Graph-Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL-aligned Chain-of-Thought (RL-CoT) mechanism that enforces consistency between model reasoning and expert problem-solving heuristics for incident handling. Finally, ExpressMind integrates a cross-modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi-modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis. The code and data are available at: https://wanderhee.github.io/ExpressMind/.
Chinese Translation
当前的高速公路运营依赖于基于规则和孤立模型,这限制了跨不同系统联合分析知识的能力。同时,大型语言模型(LLMs)在智能交通中的应用日益增多,使交通模型从算法智能向认知智能发展。然而,通用的LLMs无法有效理解高速公路领域中非常规场景下的法规和事件因果关系。因此,本文构建了一种用于高速公路的预训练多模态大型语言模型(MLLM)——ExpressMind,作为智能高速公路运营的认知核心。本文构建了行业首个全栈高速公路数据集,涵盖交通知识文本、应急推理链和注释视频事件,以克服数据稀缺问题。本文提出了一种基于自监督训练和无监督学习的双层LLM预训练范式。此外,本研究引入了一种图增强RAG框架,以动态索引高速公路知识库。为了增强高速公路事件响应策略的推理能力,我们开发了一种与强化学习对齐的思维链(RL-CoT)机制,以确保模型推理与专家问题解决启发式之间的一致性。最后,ExpressMind集成了一个跨模态编码器,以对齐视觉和文本通道下的动态特征序列,使其能够理解视频和图像模态下的交通场景。在我们新发布的多模态高速公路基准上的广泛实验表明,ExpressMind在事件检测、安全响应生成和复杂交通分析方面全面超越了现有基准。代码和数据可在:https://wanderhee.github.io/ExpressMind/ 获取。
cs.AI / 60 / 2603.16526
Exploring different approaches to customize language models for domain-specific text-to-code generation
探索定制语言模型以实现领域特定文本到代码生成的不同方法
Abstract
Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.
Chinese Translation
大型语言模型(LLMs)在从自然语言描述生成可执行代码方面表现出强大的能力。然而,通用模型在需要使用领域特定库、API或约定的专业编程环境中往往表现不佳。定制较小的开源模型提供了一种经济高效的替代方案,避免依赖大型专有系统。在本研究中,我们探讨了如何利用合成数据集对较小的语言模型进行领域特定代码生成的适应。我们构建了涵盖Python生态系统中三个领域的编程练习数据集:通用Python编程、Scikit-learn机器学习工作流和基于OpenCV的计算机视觉任务。利用这些数据集,我们评估了三种定制策略:少量示例提示、检索增强生成(RAG)和使用低秩适应(LoRA)的参数高效微调。通过基准指标和衡量与领域特定代码对齐的相似性指标来评估性能。我们的结果表明,基于提示的方法如少量学习和RAG可以以经济高效的方式提高领域相关性,尽管它们对基准准确性的影响有限。相比之下,基于LoRA的微调在大多数任务中始终实现了更高的准确性和更强的领域对齐。这些发现突显了在将较小语言模型适应于专业编程任务时灵活性、计算成本和性能之间的实际权衡。
cs.AI / 61 / 2603.16537
Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots
设计争议:面向 LLM 驱动机器人辅助分配的前端保护措施
Abstract
LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable "value settings" that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.
Chinese Translation
在社会环境中,优先考虑稀缺辅助的 LLM 驱动机器人面临多元价值观和 LLM 行为变异性:合理的人们可能会对谁应优先获得帮助产生分歧,而 LLM 介导的交互政策在提示、上下文和群体之间的变化难以预见或验证。然而,针对实时多用户辅助分配的用户界面保护措施仍然不够明确。我们提出了带有可争议性的有界校准,这是一种程序性前端模式,(i) 将优先级限制在经过治理批准的可接受模式菜单中,(ii) 在延迟决策时以与交互相关的术语保持活动模式的可读性,以及 (iii) 提供一个特定结果的争议路径,而无需重新协商全球规则。将多元主义和 LLM 不确定性视为常态条件,该模式避免了隐含价值偏差的静默默认和在时间压力下转移负担的广泛用户可配置“价值设置”。我们通过一个公共通道机器人案例来说明该模式,并概述了一个以可读性、程序合法性和可操作性为中心的评估议程,包括自动化偏见和争议渠道可用性不均的风险。
cs.AI / 62 / 2603.16557
BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs
BenchPreS:一种针对持久内存大语言模型的上下文感知个性化偏好选择性基准
Abstract
Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.
Chinese Translation
大型语言模型(LLMs)越来越多地在持久内存中存储用户偏好,以支持跨互动的个性化。然而,在由社会和制度规范主导的第三方沟通环境中,一些用户偏好可能不适合应用。我们提出了BenchPreS,旨在评估基于记忆的用户偏好在不同沟通上下文中是否得到了恰当的应用或抑制。通过使用两个互补指标,即误应用率(Misapplication Rate, MR)和恰当应用率(Appropriate Application Rate, AAR),我们发现即使是前沿的LLMs在上下文敏感的偏好应用方面也面临挑战。具有更强偏好遵循性的模型表现出更高的过度应用率,而推理能力或基于提示的防御措施并未完全解决这一问题。这些结果表明,当前的LLMs将个性化偏好视为全球可强制执行的规则,而非上下文依赖的规范信号。
cs.AI / 63 / 2603.16581
V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models
V-DyKnow:用于视觉语言模型时间敏感知识的动态基准
Abstract
Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.
Chinese Translation
视觉语言模型(VLMs)是在包含图像和文本的文档数据快照上进行训练的。它们的训练数据和评估基准通常是静态的,隐含地将事实知识视为时间不变。然而,现实世界的事实本质上是时间敏感的,并且受到不规则和周期性变化的影响,这导致模型预测变得过时。我们提出了 V-DyKnow,一个用于评估 VLMs 中时间敏感事实知识的视觉动态知识基准。通过使用 V-DyKnow,我们对闭源和开源 VLMs 进行了基准测试,并分析了 a) 模型在不同模态和输入扰动下的响应的可靠性(正确性和一致性);b) 知识编辑和多模态 RAG 方法在跨模态知识更新中的有效性;以及 c) 过时预测的来源,通过数据和机制分析。我们的结果表明,VLMs 经常输出过时的事实,反映出在(预)训练阶段使用的过时快照。事实可靠性从文本刺激到视觉刺激下降,即使实体被正确识别。此外,现有的对齐方法未能在不同模态之间一致地更新模型的知识。总的来说,这些发现突显了当前 VLMs 在跨模态获取和更新时间敏感知识方面的基本局限性。我们发布了该基准、代码和评估数据。
cs.AI / 64 / 2603.16586
Runtime Governance for AI Agents: Policies on Paths
人工智能代理的运行时治理:路径上的政策
Abstract
AI agents -- systems that plan, reason, and act using large language models -- produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data-breach, reputational and other costs associated with running agents. We argue that the execution path is the central object for effective runtime governance and formalize compliance policies as deterministic functions mapping agent identity, partial path, proposed next action, and organizational state to a policy violation probability. We show that prompt-level instructions (and "system prompts"), and static access control are special cases of this framework: the former shape the distribution over paths without actually evaluating them; the latter evaluates deterministic policies that ignore the path (i.e., these can only account for a specific subset of all possible paths). In our view, runtime evaluation is the general case, and it is necessary for any path-dependent policy. We develop the formal framework for analyzing AI agent governance, present concrete policy examples (inspired by the AI act), discuss a reference implementation, and identify open problems including risk calibration and the limits of enforced compliance.
Chinese Translation
人工智能代理——利用大型语言模型进行规划、推理和行动的系统——产生非确定性、依赖路径的行为,这种行为在设计阶段无法完全治理。这里的治理是指在尽可能高的任务完成率与运行代理所涉及的法律、数据泄露、声誉及其他成本之间取得适当平衡。我们认为,执行路径是有效运行时治理的核心对象,并将合规政策形式化为确定性函数,该函数将代理身份、部分路径、提议的下一步行动和组织状态映射到政策违反概率。我们展示了提示级指令(和“系统提示”)以及静态访问控制是该框架的特例:前者在不实际评估路径的情况下塑造路径的分布;后者评估忽略路径的确定性政策(即,这些政策只能考虑所有可能路径的特定子集)。在我们看来,运行时评估是一般情况,对于任何依赖路径的政策都是必要的。我们开发了分析人工智能代理治理的正式框架,提出了具体的政策示例(受人工智能法案启发),讨论了参考实现,并识别出包括风险校准和强制合规的限制在内的开放问题。
cs.AI / 65 / 2603.16642
When AI Navigates the Fog of War
当人工智能在战争迷雾中导航
Abstract
Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.
Chinese Translation
人工智能能否在战争的轨迹尚未显而易见之前进行推理?分析这一能力是困难的,因为回顾性的地缘政治预测受到训练数据泄漏的严重干扰。我们通过对2026年中东冲突早期阶段的时间基础案例研究来应对这一挑战,该冲突发生在当前前沿模型的训练截止日期之后。我们构建了11个关键时间节点、42个节点特定的可验证问题和5个一般探索性问题,要求模型仅根据每个时刻公开可用的信息进行推理。这一设计大大减轻了训练数据泄漏的顾虑,为研究模型在战争迷雾中如何分析正在展开的危机创造了一个良好的环境,并且据我们所知,这是对正在进行的地缘政治冲突中大型语言模型推理的首次时间基础分析。我们的分析揭示了三个主要发现。首先,当前最先进的大型语言模型常常表现出显著的战略现实主义,推理超越表面修辞,深入结构性激励。其次,这种能力在不同领域之间存在不均衡:模型在经济和后勤结构明确的环境中更可靠,而在政治模糊的多参与者环境中则表现较差。最后,模型叙事随着时间的推移而演变,从早期对快速遏制的预期转向对地区固守和消耗性降级的更系统的描述。由于冲突在撰写时仍在进行中,这项工作可以作为模型在正在展开的地缘政治危机中的推理的档案快照,为未来的研究提供没有回顾性分析的后见之明的基础。
cs.AI / 66 / 2603.16648
Domain-Independent Dynamic Programming with Constraint Propagation
与约束传播无关的动态规划
Abstract
There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.
Chinese Translation
组合问题有两种普遍的基于模型的范式:1)基于状态的表示,如启发式搜索、动态规划(DP)和决策图;2)基于约束和领域的表示,如约束编程(CP)、(混合)整数编程和布尔可满足性。在本文中,我们通过将约束传播集成到动态规划中,弥合了DP和CP范式之间的差距,使得DP求解器能够利用约束传播来修剪状态和转移。为此,我们在与约束无关的动态规划框架中使用通用的CP求解器实现约束传播,并在三个组合优化问题上进行评估:具有时间窗口的单机调度、资源约束项目调度问题(RCPSP)和具有时间窗口的旅行推销员问题(TSPTW)。我们的评估表明,约束传播显著减少了状态扩展的数量,使得我们的方法在单机调度和RCPSP问题上解决了比DP求解器更多的实例,并在紧约束的TSPTW实例上显示出类似的改进。运行时性能表明,对于约束实例,传播的好处超过了开销,但进一步减少传播开销的工作可能会进一步提高性能。我们的工作是理解约束传播在DP求解器中价值的关键一步,为将DP和CP集成提供了一种基于模型的方法。
cs.AI / 67 / 2603.16651
What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline
如果皮诺曹是一个强化学习代理:一个规范性的端到端管道
Abstract
In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino'', this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors' decisions. Finally, this thesis investigates the phenomenon of \textit{norm avoidance}, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.
Chinese Translation
在过去十年中,人工智能(AI)迅速发展。随着这一快速进展,出现了对能够遵循我们社会规则和规范的系统的需求,以便它们能够成功且安全地融入我们的日常生活。受到《皮诺曹的冒险》(Le avventure di Pinocchio - Storia di un burattino)故事的启发,本论文提出了一个解决开发符合规范且具备情境意识的代理的问题的管道。基于AJAR、Jiminy和NGRL架构,该研究引入了 extit{pino},一个混合模型,其中强化学习代理由基于论证的规范顾问进行监督。为了使该管道能够实际运作,本论文还提出了一种新算法,用于自动提取顾问决策背后的论据和关系。最后,本论文研究了 extit{规范规避}现象,在强化学习代理的背景下提供了定义和缓解策略。管道的每个组件都进行了实证评估。论文最后讨论了相关工作、当前局限性和未来研究方向。
cs.AI / 68 / 2603.16659
Machines acquire scientific taste from institutional traces
机器从制度痕迹中获得科学品味
Abstract
Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI's reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.
Chinese Translation
人工智能在具有可验证答案的任务上匹配或超越人类表现,从蛋白质折叠到奥林匹克数学。然而,最能推动科学进步的能力并非推理,而是品味:判断哪些未经测试的想法值得追求的能力,这一能力每天由编辑和资助者行使,但从未成功地被阐明、教授或自动化。在这里,我们展示了对期刊出版决策进行微调的语言模型能够恢复那些前沿模型和人类专业知识无法接触的评估判断。通过使用涵盖四个质量等级的管理研究提案的保留基准,我们发现,涵盖主要专有和开放架构的十一种前沿模型几乎没有超过随机猜测,平均准确率为31%。期刊编辑和编辑委员会成员的投票结果达到42%。经过多年出版记录训练的微调模型超越了每一个前沿模型和专家小组,其中最佳单一模型的准确率达到59%。这些模型表现出经过校准的信心,在其最高信心的预测中达到100%的准确率,并将这种评估信号转移到未经训练的成对比较和一句话摘要中。该机制具有普遍性:在经济学出版记录上训练的模型达到70%的准确率。科学品味并未超出人工智能的触及范围;它被存储在制度记录中,等待被提取。这些结果提供了一种可扩展的机制,以对抗跨学科不断增加的科学生产量,其中质量抵制正式验证。
cs.AI / 69 / 2603.16672
CritiSense: Critical Digital Literacy and Resilience Against Misinformation
CritiSense:应对虚假信息的关键数字素养与韧性
Abstract
Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en). Demo Video: https://shorturl.at/CDcdc
Chinese Translation
社交媒体上的虚假信息削弱了知情决策和公众信任。预先揭露(Prebunking)通过帮助用户在实际遇到操控策略之前识别这些策略,提供了一种积极的补充。我们推出了CritiSense,一款通过短小的互动挑战和即时反馈来培养这些技能的移动媒体素养应用。这是第一个支持九种语言的多语言模块化平台,旨在快速更新各个主题和领域。我们报告了一项针对93名用户的可用性研究:83.9%的用户表示总体满意,90.1%的用户认为该应用易于使用。定性反馈表明,CritiSense有助于提高数字素养技能。总体而言,它提供了一个多语言的预先揭露平台,并作为一个测量微学习对虚假信息韧性影响的测试平台。在超过3个月的时间里,我们已吸引了300多名活跃用户。该应用在Apple App Store(https://apps.apple.com/us/app/critisense/id6749675792)和Google Play Store(https://play.google.com/store/apps/details?id=com.critisense&hl=en)上免费提供给所有用户。演示视频:https://shorturl.at/CDcdc
cs.AI / 70 / 2603.16733
IQuest-Coder-V1 Technical Report
IQuest-Coder-V1 技术报告
Yang, Jian, Zhang, Wei, Guo, Shawn, Ye, Zhengmao, Jing, Lin, Liu, Shark, Li, Yizhi, Wu, Jiajun, Liu, Cening, Ma, X., Song, Yuyang, Wu, Siwei, Li, Yuwen, Liao, L., Zheng, T., Huang, Ziling, Huang, Zelong, Liu, Che, Xing, Yan, Li, Renyuan, Cai, Qingsong, Yan, Hanxu, Wang, Siyue, Li, Shikai, Liu, Jason Klein, Huang, An, Kang, Yongsheng, Zhang, Jinxing, Hao, Chuan, Wang, Haowen, Gu, Weicheng, Tao, Ran, Tang, Mingjie, Wu, Peihao, Wang, Jianzhou, Liu, Xianglong, Lv, Weifeng, Dai, Bryan
Abstract
In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.
Chinese Translation
在本报告中,我们介绍了 IQuest-Coder-V1 系列(7B/14B/40B/40B-Loop),这是一个新的大型代码语言模型(LLMs)家族。我们提出了代码流多阶段训练范式,超越了静态代码表示,捕捉软件逻辑在不同阶段的动态演变。我们的模型通过进化管道开发,首先进行初步预训练,涵盖代码事实、代码库和补全数据。随后,我们实施了一个专门的中期训练阶段,该阶段在 32k 上下文中整合推理和代理轨迹,并在 128k 上下文中进行代码库规模的训练,以构建深厚的逻辑基础。模型最终通过专门编码能力的后期训练完成,该过程分为两个专门路径:思维路径(利用基于推理的强化学习)和指令路径(优化用于一般辅助)。IQuest-Coder-V1 在代码智能的关键维度上实现了与竞争模型的最先进性能:代理软件工程、竞争编程和复杂工具使用。为了解决部署限制,IQuest-Coder-V1-Loop 变体引入了一种递归机制,旨在优化模型容量与部署占用之间的权衡,提供了一条在效能与效率之间优化的架构增强路径。我们相信,IQuest-Coder-V1 系列的发布,包括从预训练基础到最终思维和指令模型的完整白盒检查点链,将推动自主代码智能和现实世界代理系统的研究进展。
cs.AI / 71 / 2603.16734
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
个性化大型语言模型代理中的差异性危害倾向:心理健康披露的奇特案例
Abstract
Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety--utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.
Chinese Translation
大型语言模型(LLMs)越来越多地被作为工具使用代理部署,安全问题的关注点从有害文本生成转向有害任务完成。部署的系统通常会基于用户档案或持久记忆进行条件设置,但代理安全评估通常忽略个性化信号。为了解决这一问题,我们研究了心理健康披露这一敏感且现实的用户上下文线索如何影响代理设置中的有害行为。基于AgentHarm基准,我们在控制提示条件下评估了前沿和开源LLMs在多步骤恶意任务(及其良性对应任务)上的表现,这些条件变化了用户上下文个性化(无生物信息、生物信息仅、生物信息+心理健康披露),并包含轻量级的越狱注入。我们的结果显示,在不同模型中,有害任务完成并非微不足道:前沿实验室模型(如GPT 5.2、Claude Sonnet 4.5、Gemini 3-Pro)仍然完成了一定比例的有害任务,而一个开源模型(DeepSeek 3.2)则表现出显著更高的有害完成率。仅添加生物信息上下文通常会降低危害评分并增加拒绝率。添加明确的心理健康披露往往会进一步推动结果朝同一方向发展,尽管在多重测试校正后效果适度且不均匀可靠。重要的是,拒绝率的增加也出现在良性任务上,表明通过过度拒绝存在安全与效用之间的权衡。最后,越狱提示相对于良性条件显著提高了危害水平,并可能削弱或覆盖个性化所带来的保护性变化。综合来看,我们的结果表明,个性化在代理滥用设置中可以作为一个弱保护因素,但在最小对抗压力下是脆弱的,这突显了需要进行个性化意识评估和在用户上下文条件下保持稳健的安全措施。
cs.AI / 72 / 2603.16738
MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning
MedCL-Bench: 生物医药持续学习中的稳定性与效率权衡和扩展性基准测试
Abstract
Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.
Chinese Translation
医学语言模型必须随着证据和术语的演变进行更新,但连续更新可能引发灾难性遗忘。尽管生物医学自然语言处理领域有许多静态基准测试,但在标准化协议、对任务顺序的鲁棒性以及计算意识报告方面,尚无统一且任务多样化的基准测试来评估持续学习。我们引入了MedCL-Bench,该基准测试涵盖十个生物医学自然语言处理数据集,跨越五个任务家族,并评估了八个任务顺序下的十一种持续学习策略,报告保留率、迁移和 GPU 小时成本。在各种基础模型和任务顺序下,对新任务直接进行持续微调会引发灾难性遗忘,导致之前任务的更新后性能下降。持续学习方法占据不同的保留-计算前沿:参数隔离在每GPU小时提供最佳保留,重放方法在成本较高时提供强保护,而正则化的收益有限。遗忘是任务依赖的,多标签主题分类最脆弱,而约束输出任务更具鲁棒性。MedCL-Bench提供了一个可重复的框架,以在部署前审核模型更新。
cs.AI / 73 / 2603.16744
Nonstandard Errors in AI Agents
人工智能代理中的非标准误差
Abstract
We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.
Chinese Translation
我们研究了最先进的人工智能编码代理在给定相同数据和研究问题的情况下是否会产生相同的实证结果。通过部署150个自主的Claude Code代理独立测试关于NYSE TAQ数据中SPY(2015-2024)市场质量趋势的六个假设,我们发现人工智能代理表现出显著的 extit{非标准误差}(NSEs),即由于代理之间在分析选择上的变异而产生的不确定性,这与人类研究者之间记录的情况类似。人工智能代理在度量选择上存在显著差异(例如,自相关与方差比、美元与股数交易量)。不同的模型家族(Sonnet 4.6与Opus 4.6)展现出稳定的“实证风格”,反映了方法偏好的系统性差异。在一个三阶段反馈协议中,人工智能同行评审(书面批评)对分散的影响微乎其微,而接触顶级示范论文则使得 extit{收敛}度量家族的估计四分位间距减少了80-99%。收敛既通过家族内部估计的收紧实现,也通过代理完全切换度量家族实现,但收敛反映的是模仿而非理解。这些发现对人工智能在自动化政策评估和实证研究中的日益使用具有重要意义。
cs.AI / 74 / 2603.16777
Anticipatory Planning for Multimodal AI Agents
多模态人工智能代理的预期规划
Abstract
Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.
Chinese Translation
近年来,多模态代理的进展改善了计算机使用交互和工具使用,然而大多数现有系统仍然是反应式的,孤立地优化行动而不考虑未来状态或长期目标。这限制了规划的一致性,并阻碍了代理可靠地解决高层次、多步骤任务的能力。我们提出了TraceR1,一个两阶段的强化学习框架,通过在执行前预测短期轨迹,明确训练预期推理。第一阶段进行轨迹级别的强化学习,奖励机制强制预测的行动序列之间的全局一致性。第二阶段应用基于反馈的强化微调,利用来自冻结工具代理的执行反馈来提高步骤级别的准确性和可执行性。TraceR1在七个基准测试中进行了评估,涵盖在线计算机使用、离线计算机使用基准和多模态工具使用推理任务,在规划稳定性、执行鲁棒性和泛化能力上相较于反应式和单阶段基线取得了显著改善。这些结果表明,预期轨迹推理是构建能够在复杂现实环境中有效推理、规划和行动的多模态代理的关键原则。
cs.AI / 75 / 2603.16815
Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost
超越准确性:通过多层级库存成本评估预测模型
Abstract
This study develops a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Using the M5 Walmart dataset, we evaluate seven forecasting approaches and assess their operational impact under single- and two-echelon newsvendor systems. Results indicate that Temporal CNN and LSTM models significantly reduce inventory costs and improve fill rates compared to statistical baselines. Sensitivity and multi-echelon analyses demonstrate robustness and scalability, offering a data-driven decision-support tool for modern supply chains.
Chinese Translation
本研究开发了一种数字化预测-库存优化流程,将传统预测模型、机器学习回归模型和深度序列模型整合在统一的库存仿真框架内。利用M5 Walmart数据集,我们评估了七种预测方法,并在单层级和双层级新闻供应商系统下评估其运营影响。结果表明,与统计基线相比,Temporal CNN和LSTM模型显著降低了库存成本并提高了补货率。敏感性分析和多层级分析展示了模型的稳健性和可扩展性,为现代供应链提供了一种基于数据的决策支持工具。
cs.AI / 76 / 2603.16817
Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights
基于RAG的LLM的符合性真实性是否稳健?新颖的度量标准和系统性见解
Abstract
Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.
Chinese Translation
大型语言模型(LLMs)经常出现幻觉,限制了它们在知识密集型应用中的可靠性。检索增强生成(RAG)和符合性真实性已成为解决这一限制的潜在方法。虽然RAG旨在将响应基于检索到的证据进行基础,但它并未提供最终输出正确性的统计保证。符合性真实性过滤通过使用在保留数据上校准的阈值对原子声明进行评分和过滤,提供了无分布的统计可靠性,然而,最终输出的信息量并不保证。我们系统地分析了符合性真实性在RAG基础的LLMs中的可靠性和实用性,涵盖生成、评分、校准、稳健性和效率。我们提出了新颖的信息量感知度量标准,更好地反映了在符合性过滤下的任务效用。在三个基准和多个模型家族中,我们发现:(i)由于空洞输出,符合性过滤在高真实性水平下的实用性较低;(ii)符合性真实性保证在分布变化和干扰项下并不稳健,突显了需要校准数据与部署条件紧密匹配的限制;(iii)轻量级的蕴含基础验证器在匹配或超越基于LLM的模型置信度评分器的同时,所需的FLOPs减少了超过$100 imes$。总体而言,我们的结果揭示了真实性与信息量之间的权衡以及符合性过滤框架在分布变化和干扰项下的脆弱性,强调了在稳健性和实用性作为关键指标的新方法的必要性,并为构建既可靠又计算高效的RAG管道提供了可行的指导。
cs.AI / 77 / 2603.16822
Surg$\Sigma$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence
Surg$ ext{Σ}$:用于外科智能的大规模多模态数据与基础模型的光谱
Abstract
Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$\Sigma$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$\Sigma$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$\Sigma$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$\Sigma$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$\Sigma$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$\Sigma$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.
Chinese Translation
外科智能有潜力提高外科护理的安全性和一致性,但现有的大多数外科人工智能框架仍然是任务特定的,难以在不同的手术和机构之间进行推广。尽管多模态基础模型,特别是多模态大型语言模型,在各个医学领域展示了强大的跨任务能力,但它们在外科领域的发展仍受到缺乏大规模、系统整理的多模态数据的限制。为了解决这一挑战,我们引入了Surg$ ext{Σ}$,这是一个用于外科智能的大规模多模态数据与基础模型的光谱。该框架的核心是Surg$ ext{Σ}$-DB,这是一个旨在支持多样化外科任务的大规模多模态数据基础。Surg$ ext{Σ}$-DB将异构的外科数据源(包括开源数据集、内部整理的临床集合和网络数据)整合到一个统一的架构中,旨在提高标签一致性和异构数据集之间的数据标准化。Surg$ ext{Σ}$-DB涵盖6个临床专业和多种外科类型,提供18个实际外科任务的丰富图像和视频级注释,涵盖理解、推理、规划和生成,规模前所未有(超过598万次对话)。除了传统的多模态对话,Surg$ ext{Σ}$-DB还包含层次推理注释,为复杂外科场景中的深入上下文理解提供了更丰富的语义线索。我们进一步通过最近开发的基于Surg$ ext{Σ}$-DB的外科基础模型提供实证证据,展示了大规模多模态注释、统一语义设计和结构化推理注释在提高跨任务推广能力和可解释性方面的实际好处。
cs.AI / 78 / 2603.16827
Prompt Programming for Cultural Bias and Alignment of Large Language Models
大型语言模型的文化偏见与对齐的提示编程
Abstract
Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.
Chinese Translation
文化塑造了推理、价值观、优先级和战略决策,然而大型语言模型(LLMs)往往表现出与目标群体不一致的文化偏见。随着LLMs越来越多地用于战略决策、政策支持和文档工程任务,如摘要、分类和合规审计,改善文化对齐对于确保下游分析和建议反映目标群体的价值特征而非默认模型先验变得尤为重要。之前的研究提出了一个基于调查的文化对齐框架,并显示文化特定的提示可以减少不对齐,但主要评估了专有模型并依赖于手动提示工程。在本文中,我们通过在开放权重的LLMs上重现其社会科学调查基础的投影和距离度量,验证并扩展该框架,测试相同的文化偏差和文化调节的益处是否在封闭的LLM系统之外依然存在。在此基础上,我们引入了使用DSPy进行提示编程的方法,将提示视为模块化、可优化的程序,以系统性地通过针对文化距离目标进行优化来调整文化调节。在我们的实验中,我们表明提示优化通常优于文化提示工程,表明使用DSPy进行提示编译可以提供更稳定和可转移的途径,以实现文化对齐的LLM响应。
cs.AI / 79 / 2603.16839
Learning to Present: Inverse Specification Rewards for Agentic Slide Generation
学习呈现:用于自主幻灯片生成的逆规范奖励
Abstract
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm
Chinese Translation
自动化演示生成仍然是一项具有挑战性的任务,需要连贯的内容创作、视觉设计和关注受众的沟通。本研究提出了一个与OpenEnv兼容的强化学习环境,在该环境中,LLM(大语言模型)代理学习研究主题、规划内容并通过工具使用生成专业的HTML幻灯片演示。我们引入了一种多组件奖励系统,结合了结构验证、渲染质量评估、基于LLM的美学评分、内容质量指标以及一种逆规范奖励,该奖励衡量生成的幻灯片在多大程度上忠实地传达其预期目的。逆规范奖励是一种“逆任务”,其中LLM尝试从生成的幻灯片中恢复原始规范,提供了一个整体质量信号。我们的方法通过GRPO对Qwen2.5-Coder-7B进行微调,仅在使用Claude Opus 4.6收集的专家演示的提示上训练0.5%的参数。在六个模型上对48个多样化商业简报的实验表明,我们微调后的7B模型在质量上达到了Claude Opus 4.6的91.2%,并在基线模型上提高了33.1%。六个模型的比较揭示,指令遵循和工具使用合规性,而非原始参数数量,决定了自主任务的表现。我们贡献了SlideRL,一个包含288个多轮展开轨迹的开源数据集,涵盖所有六个模型: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts 代码: https://github.com/pushing-the-frontier/slide-forge-llm
cs.AI / 80 / 2603.16843
Internalizing Agency from Reflective Experience
从反思经验中内化代理能力
Abstract
Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
Chinese Translation
大型语言模型越来越多地被部署为自主代理,这些代理必须通过与提供丰富反馈的环境进行长期交互来规划、行动并从错误中恢复。然而,现有的以结果为导向的后训练方法(例如,具有可验证奖励的强化学习)主要优化最终成功信号,导致丰富的环境反馈未得到充分利用。因此,这些方法往往导致分布收敛:策略在再现一组狭窄的已成功行为方面变得更好,但未能改善扩展问题解决能力所需的基于反馈的代理能力(例如,Pass@k)在长期设置中的表现。为了解决这个问题,我们提出了LEAFE(从反思经验中学习基于反馈的代理能力),这是一个从反思经验中内化恢复代理能力的框架。具体而言,在探索过程中,代理将环境反馈总结为可操作的经验,回溯到早期决策点,并探索修订动作的替代路径。然后,我们通过监督微调将这些经验指导的修正提炼到模型中,使策略在未来的交互中能够更有效地恢复。在固定交互预算下的一系列多样化互动编码和代理任务中,LEAFE始终在Pass@1上优于基础模型,并在Pass@k上超过以结果为导向的基线(GRPO)和基于经验的方法(如早期经验),在Pass@128上提高了多达14%的表现。
cs.AI / 81 / 2603.16859
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
SocialOmni:全模态模型中音视频社交互动的基准测试
Abstract
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Chinese Translation
全模态大型语言模型(OLMs)通过原生整合音频、视觉和文本,重新定义了人机交互。然而,现有的OLM基准测试仍然局限于静态、以准确性为中心的任务,导致在评估社交互动这一基本能力方面存在重要空白,社交互动是指在自然对话中处理动态线索的能力。为此,我们提出了SocialOmni,这是一个全面的基准,旨在从三个核心维度评估这种对话互动:(i)说话者分离与识别(谁在说话),(ii)插话时机控制(何时插话),以及(iii)自然插话生成(如何措辞插话)。SocialOmni包含2000个感知样本和一个经过质量控制的诊断集,涵盖209个互动生成实例,并设定严格的时间和上下文约束,辅以控制的音视频不一致场景以测试模型的鲁棒性。我们对12个领先的OLM进行了基准测试,揭示了模型之间社交互动能力的显著差异。此外,我们的分析显示,模型的感知准确性与其生成上下文适当插话的能力之间存在明显的脱钩,表明仅依靠理解中心的指标不足以表征对话社交能力。更令人鼓舞的是,来自SocialOmni的这些诊断提供了可操作的信号,以弥合未来OLM中的感知与互动之间的鸿沟。
cs.CL / 1 / 2603.15653
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
递归语言模型与不确定性相遇:自反程序搜索在长上下文中的惊人有效性
Abstract
Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.
Chinese Translation
长上下文处理仍然是语言模型面临的核心挑战:即使在扩展的上下文窗口下,模型往往无法可靠地提取、推理和利用长上下文中的信息。近期的研究如递归语言模型(Recursive Language Models, RLM)通过在推理过程中以程序化的方式将长上下文分解为递归子调用来应对这一挑战。尽管前景可期,RLM的成功在很大程度上依赖于这些上下文交互程序的选择,而这一点尚未得到充分探索。本文研究了这一问题,并引入了SRLM,一个通过不确定性感知的自反机制增强程序化上下文交互的框架。SRLM利用三种内在信号:自我一致性、推理长度和口头表达的信心。这些信号作为模型内部不确定性的补充指标,模型利用它们来评估和比较候选的上下文交互程序。在多样化的基准数据集、上下文长度和基础模型上的广泛实验表明,SRLM在相同时间预算下始终优于最先进的基线,相较于RLM提高了多达22%的性能。我们的研究结果表明,递归本身并不是RLM性能的主要驱动因素,而简单的自反程序搜索可以在不需要自查询或显式递归机制的情况下匹配或超越RLM。我们发现,对于模型窗口内的上下文长度,具有递归的RLM相较于基础模型往往会降低性能,而SRLM在短上下文和长上下文中均能提供一致的性能提升。我们还发现,RLM在语义密集型任务中的效果较差,在这些任务中,启发式程序搜索不足以满足需求,而更广泛的上下文理解是必需的,而SRLM中的自反机制提供了一种语义信号,更好地引导了这些场景中的推理。
cs.CL / 2 / 2603.15677
MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences
MedArena:比较医学领域临床医生偏好的大型语言模型
Abstract
Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.
Chinese Translation
大型语言模型(LLMs)在临床医生的工作流程中越来越重要,涵盖了临床决策支持、医学教育和患者沟通等方面。然而,目前对医学LLMs的评估方法过于依赖静态的、模板化的基准测试,这无法捕捉到真实临床实践的复杂性和动态性,导致基准表现与临床实用性之间存在不一致。为了解决这些局限性,我们提出了MedArena,一个互动评估平台,允许临床医生使用自己的医学查询直接测试和比较领先的LLMs。在临床医生提供的查询下,MedArena展示来自两个随机选择模型的响应,并要求用户选择更优的响应。在截至2025年11月1日收集的1571个偏好中,Gemini 2.0 Flash Thinking、Gemini 2.5 Pro和GPT-4o是根据Bradley-Terry评分排名的前三个模型。只有三分之一的临床医生提交的问题类似于事实回忆任务(例如,MedQA),而大多数问题涉及治疗选择、临床文档或患者沟通等主题,其中约20%涉及多轮对话。此外,临床医生在解释其偏好时更常提到深度和细节以及表达的清晰度,而非单纯的事实准确性,强调了可读性和临床细微差别的重要性。我们还确认,即使在控制了响应长度和格式等与风格相关的因素后,模型排名仍然保持稳定。通过将评估基于真实的临床问题和偏好,MedArena提供了一个可扩展的平台,用于衡量和提高医学LLMs的实用性和有效性。
cs.CL / 3 / 2603.15726
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
MiroThinker-1.7与H1:通过验证迈向重型研究代理
MiroMind Team, Bai, S., Bing, L., Lei, L., Li, R., Li, X., Lin, X., Min, E., Su, L., Wang, B., Wang, L., Wang, L., Wang, S., Wang, X., Zhang, Y., Zhang, Z., Chen, G., Chen, L., Cheng, Z., Deng, Y., Huang, Z., Ng, D., Ni, J., Ren, Q., Tang, X., Wang, B. L., Wang, H., Wang, N., Wei, C., Wu, Q., Xia, J., Xiao, Y., Xu, H., Xu, X., Xue, C., Yang, Z., Yang, Z., Ye, F., Ye, H., Yu, J., Zhang, C., Zhang, W., Zhao, H., Zhu, P.
Abstract
We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.
Chinese Translation
我们提出了MiroThinker-1.7,这是一种为复杂的长时间推理任务设计的新型研究代理。在此基础上,我们进一步介绍了MiroThinker-H1,该代理扩展了重型推理能力,以实现更可靠的多步骤问题解决。特别是,MiroThinker-1.7通过一个强调结构化规划、上下文推理和工具交互的代理中期训练阶段,提高了每个交互步骤的可靠性。这使得在复杂任务中能够更有效地进行多步骤交互和持续推理。MiroThinker-H1进一步将验证直接纳入推理过程,涵盖局部和全局层面。在推理过程中,能够对中间推理决策进行评估和优化,同时对整体推理轨迹进行审计,以确保最终答案由连贯的证据链支持。在涵盖开放网络研究、科学推理和金融分析的基准测试中,MiroThinker-H1在深度研究任务上实现了最先进的性能,同时在专业领域保持了强劲的结果。我们还将MiroThinker-1.7和MiroThinker-1.7-mini作为开源模型发布,提供具有显著提高效率的竞争性研究代理能力。
cs.CL / 4 / 2603.15773
Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs
无国界的语素:评估阿拉伯语分词器和大型语言模型中的根-模式形态学
Abstract
This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.
Chinese Translation
本研究探讨了大型语言模型(LLMs)及其分词方案在表示和生成阿拉伯语根-模式形态学方面的有效性,考察它们是否捕捉到真实的形态结构,或仅依赖于表面的记忆。阿拉伯语的形态系统为分析LLMs如何处理复杂的非连接形式提供了丰富的测试平台,以及分词选择如何影响这一过程。我们的研究首先评估了阿拉伯语和多语言分词器在黄金标准分割下的形态忠实度,随后分析了LLMs在生成有效根-模式方面的表现,使用了一套新开发的测试集。我们在七个以阿拉伯语为中心的和多语言的LLMs及其各自的分词器中的发现表明,分词器的形态对齐既不是生成形态所必需的,也不是充分的,这对形态分词在下游表现中的作用提出了质疑。
cs.CL / 5 / 2603.15897
COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
COGNAC在SemEval-2026任务5中的表现:用于挑战性叙事中人类水平词义 plausibility 评分的 LLM 集成
Abstract
We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.
Chinese Translation
我们描述了我们的系统用于SemEval-2026任务5,该任务要求对短篇故事中同义词的给定词义进行5点李克特量表的 plausibility 评分。系统通过准确率的无权平均(在平均人类判断的一个标准差内)和斯皮尔曼等级相关性进行评估。我们探索了三种使用多个闭源商业 LLM 的提示策略:(i)基线零-shot 设置,(ii)带有结构化推理的链式思维(Chain-of-Thought, CoT)风格提示,以及(iii)同时评估候选词义的比较提示策略。此外,为了考虑金标准标签中存在的显著的标注者间变异,我们提出了一种通过平均模型预测的集成设置。我们最佳的官方系统,由三种提示策略的 LLM 集成组成,在比赛排行榜上名列第4,准确率为0.88,斯皮尔曼相关系数为0.83(平均值为0.86)。比赛后的实验与额外模型进一步提高了这一性能,准确率达到0.92,斯皮尔曼相关系数为0.85(平均值为0.89)。我们发现比较提示在不同模型家族中始终提高了性能,而模型集成显著增强了与平均人类判断的一致性,这表明 LLM 集成特别适合涉及多个标注者的主观语义评估任务。
cs.CL / 6 / 2603.15903
Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies
基于代理的模仿动态能够有效压缩群体层面的词汇
Abstract
Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language's vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model -- namely, those that regulate precision in these games, as well as players' tendency to confuse similar states -- lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.
Chinese Translation
自然语言被认为是在高效地将意义压缩为词汇的压力下演化的,通过优化信息瓶颈(Information Bottleneck, IB)复杂性与准确性之间的权衡。然而,驱动语言词汇向高效性优化的潜在社会动态仍然 largely unknown。与此同时,进化博弈理论被用来解释语言如何从初步的代理层面动态中出现,但尚未测试这种方法是否能够在信息瓶颈的意义上导致有效的压缩。在此,我们提供了一个统一模型,将进化博弈理论与信息瓶颈框架结合,并展示了如何通过在信号博弈中独立驱动的模糊策略模仿动态,在一个群体中产生近乎最优的压缩。我们发现模型的关键参数——即调节这些博弈中精确度的参数,以及玩家混淆相似状态的倾向——导致了新兴词汇所实现的权衡的受限变化。我们的结果表明,进化博弈动态可能为具有信息论上最优和经验上验证特性的词汇演化提供了机制基础。
cs.CL / 7 / 2603.15936
CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses
CTG-DB:基于本体的临床试验数据库转化,以支持跨试验药物安全性分析
Abstract
ClinicalTrials.gov (CT.gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials.gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT.gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.
Chinese Translation
ClinicalTrials.gov (CT.gov) 是最大的公开可访问的临床研究注册库,但其以注册为导向的架构和异构的不良事件(AE)术语限制了系统的药物警戒(PV)分析。通常,不良事件以研究者报告的文本形式记录,而非标准化标识符,这需要手动对照以识别一致的安全概念。我们提出了临床试验数据库转化(CTG-DB),这是一个开源管道,能够处理完整的 CT.gov XML 存档,并生成一个与标准化不良事件术语对齐的关系数据库,使用监管活动医学词典(MedDRA)。CTG-DB 保留了臂级分母,表示安慰剂和对照臂,并使用确定性精确匹配和模糊匹配来规范化不良事件术语,以确保透明和可重复的映射。该框架支持概念级检索和跨试验聚合,便于进行可扩展的安慰剂参考安全性分析,并将临床试验证据整合到下游的药物警戒信号检测中。
cs.CL / 8 / 2603.15949
BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction
BANGLASOCIALBENCH:评估孟加拉国社会互动中大型语言模型的社会语用和文化对齐的基准
Abstract
Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
Chinese Translation
大型语言模型展示了强大的多语言流利性,但流利性并不保证语言使用的社会适宜性。在高语境语言中,交际能力需要对社会等级、关系角色和互动规范的敏感性,这些都直接体现在日常语言中。孟加拉语通过其三层代词系统、基于亲属关系的称呼方式以及文化嵌入的社会习俗,充分体现了这一挑战。我们介绍了BANGLASOCIALBENCH,这是第一个旨在通过上下文依赖的语言使用而非事实回忆来评估孟加拉语社会语用能力的基准。该基准涵盖三个领域:孟加拉语称谓、亲属推理和社会习俗,包含1,719个由孟加拉语母语者撰写和验证的文化实例。我们在零样本设置下评估了十二个当代大型语言模型,并观察到系统性的文化不对齐模式。模型经常默认使用过于正式的称呼形式,未能识别多个社会可接受的称呼代词,并在宗教背景下混淆亲属术语。我们的研究结果表明,社会语用失败往往是结构化的而非随机的,揭示了当前大型语言模型在推断和应用文化适宜的语言使用方面在现实孟加拉国社会互动中存在的持续局限性。
cs.CL / 9 / 2603.15950
POLAR:A Per-User Association Test in Embedding Space
POLAR:嵌入空间中的用户级关联测试
Abstract
Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini--Hochberg control. On a balanced bot--human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space.
Chinese Translation
大多数内在关联探测器在单词、句子或语料库层面上操作,掩盖了作者级别的变异性。我们提出了POLAR(Per-user On-axis Lexical Association Re-port),这是一种在轻度调整的掩蔽语言模型的嵌入空间中运行的用户级词汇关联测试。作者通过私有的确定性标记表示;POLAR将这些向量投影到策划的词汇轴上,并报告标准化效应,使用置换p值和Benjamini-Hochberg控制。在一个平衡的机器人-人类Twitter基准测试中,POLAR能够清晰地区分由大型语言模型驱动的机器人和有机账户;在一个极端主义论坛上,它量化了与侮辱性词汇表的强对齐,并揭示了随时间推移的右倾漂移。该方法对新的属性集具有模块化特性,并为计算社会科学提供简明的逐作者诊断。所有代码均可在https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space上公开获取。
cs.CL / 10 / 2603.15953
A Family of LLMs Liberated from Static Vocabularies
一种摆脱静态词汇的语言模型家族
Alpha, Aleph, :, Abdessaied, Adnen, Baranowski, Artur, Balles, Lukas, Barlow, Michael, Benureau, Fabien C. Y., Berkenkamp, Felix, Bluebaum, Lukas, Boll, Bastian, Burns, Thomas F., Deiseroth, Björn, Eichenberg, Constantin, Friede, David, Guerrero, Pablo Iyu, Hammam, Ahmed, Harren, Bastian, Higl, Johann, Jadidi, Yasser, Kauf, Carina, Messner, Johannes, Metzen, Jan Hendrik, Meuer, Max, Nanda, Vedant, Neitemeier, Pit, Oostermeijer, Koen, Parcalabescu, Letitia, Pernpointner, Markus, Reinfurt, Felix, Rodriquez, Dylan, Schott, Grégory, Siedler, Philipp, Simonovsky, Martin, Speicher, Till, Stampa, Volker, Wäldchen, Stephan, Weinbach, Samuel, Ziegltrum, Gregor
Abstract
Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.
Chinese Translation
分词是当前大型语言模型(LLMs)中自然语言处理的核心组成部分,使模型能够将原始文本转换为可处理的单元。尽管学习型分词器被广泛采用,但它们存在显著的局限性,包括大且固定的词汇量以及对新领域或语言的适应能力差。我们提出了一种基于层次自回归变换器(HAT)架构的模型家族,参数量高达700亿。在HAT中,编码器变换器将字节聚合为词嵌入,然后将其输入到主干,即经典的自回归变换器。主干的输出随后被解码器交叉关注并转换回字节。我们展示了如何通过将Llama 3.1的8B和70B模型转换为HAT架构来重用现有的预训练模型:Llama-3.1-8B-TFree-HAT和Llama-3.1-70B-TFree-HAT是字节级模型,其编码器和解码器从头开始训练,但我们调整了预训练的Llama主干,即去除嵌入矩阵和头的变换器块,以处理词嵌入而不是原始标记。我们还提供了一个7B的HAT模型Llama-TFree-HAT-Pretrained,该模型完全从头开始训练,使用了近4万亿个单词。HAT架构通过减少所需序列位置的数量来改善文本压缩,并增强对词内变异(例如拼写差异)的鲁棒性。通过预训练,以及随后在英语和德语中的监督微调和直接偏好优化,我们在这两种语言中表现出强大的能力,在大多数基准测试中超越了原始的Llama 3.1。我们在Hugging Face上发布了我们的模型(包括200个预训练检查点)。
cs.CL / 11 / 2603.15965
MoLoRA: Composable Specialization via Per-Token Adapter Routing
MoLoRA:通过每个令牌适配器路由实现可组合特化
Abstract
Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like "write code to solve this equation," which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.
Chinese Translation
多适配器服务系统将整个序列路由到单个适配器,这在请求跨越多个领域时迫使做出选择。这一假设在两个重要场景中失效:(1)多模态生成,其中文本和图像令牌在同一序列中需要不同的适配器,以及(2)混合能力请求,例如“编写代码以解决此方程”,这需要多个专门适配器的专业知识。我们提出了每个令牌路由,它根据词汇结构(用于多模态模型)或学习的门控(用于语义特化)将单个令牌路由到适配器。每个令牌路由在理论上是最优的,对于 N 个令牌的工作量为 N,而对于 K 种适配器类型的每序列路由则为 K·N。我们的关键贡献是 MoLoRA(LoRA的混合),它实现了可组合特化:加载多个领域特定的适配器,并让学习的路由器为每个令牌选择适当的适配器。我们展示了特化在规模上显著优于规模:MoLoRA 使得 Qwen3-1.7B 在四个推理基准上超越 Qwen3-8B,同时体积小 4.7 倍。这使得在推理时实现模块化专业知识成为可能:独立训练专注的 LoRA,组合它们而无需重新训练,并通过简单加载新适配器来增加新能力。
cs.CL / 12 / 2603.15969
Robust Language Identification for Romansh Varieties
罗曼什方言的鲁棒语言识别
Abstract
The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.
Chinese Translation
罗曼什语言有几种区域方言,称为方言(idioms),这些方言之间有时互通性有限。尽管存在这种语言多样性,但迄今为止,缺乏建立能够区分这些方言的语言识别(LID)系统的文献记录。由于罗曼什语言识别还应能够识别 Rumantsch Grischun,这是一种结合了几种方言元素的超区域方言,这使得该分类问题新颖且有趣。本文提出了一种基于支持向量机(SVM)方法的罗曼什方言语言识别系统。我们在新近整理的基准数据集上对我们的模型进行了评估,涵盖了两个领域,结果显示其在领域内的平均准确率达到97%,这使得如方言感知拼写检查或机器翻译等应用成为可能。我们的分类器是公开可用的。
cs.CL / 13 / 2603.15981
Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
通过多任务强化学习对齐语音大语言模型中的副语言理解与生成
Abstract
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
Chinese Translation
语音大语言模型(LLMs)观察副语言线索,如语调、情感和非语言声音,这些对于意图理解至关重要。然而,利用这些线索面临挑战:训练数据有限、注释困难,以及模型倾向于利用词汇捷径而非副语言信号。我们提出了一种多任务强化学习(RL)方法,结合链式思维提示,以引发明确的情感推理。为了解决数据稀缺问题,我们引入了一种副语言感知的语音大语言模型(PALLM),通过两阶段管道联合优化音频的情感分类和副语言感知的响应生成。实验表明,我们的方法在Expresso、IEMOCAP和RAVDESS数据集上,相较于监督基线和强大的专有模型(Gemini-2.5-Pro, GPT-4o-audio),提升了8-12%的副语言理解能力。结果表明,使用多任务强化学习对副语言推理建模对于构建情感智能的语音大语言模型至关重要。
cs.CL / 14 / 2603.15998
NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time -- A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026
自然语言处理职业出现分析:职业如何在实时中形成和演变——一种在美国技术劳动力中展示的零假设方法,2022-2026
Abstract
Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an "AI Engineer" occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.
Chinese Translation
职业的形成和演变速度超过了分类系统的追踪能力。我们提出,真正的职业是一个自我强化的结构(双重吸引子),在这个结构中,共享的专业词汇使从业者作为一个群体变得紧密相连,而这个紧密的群体又维持着这种词汇。这个吸引子概念使得我们能够采用一种零假设的方法,从简历数据中检测职业的出现,无需预定义的分类法或职位名称:我们独立测试词汇的凝聚性和人群的凝聚性,并通过消融实验来检验词汇是否是绑定人群的机制。应用于820万份美国简历(2022-2026),该方法正确识别了已建立的职业,并揭示了AI领域的显著不对称性:在2024年初,一个紧密的专业词汇迅速形成,但从业者群体从未凝聚。随着工具的主流化,原有的AI社区解散,新词汇被吸收到现有职业中,而不是形成一个新的职业。AI似乎是一种扩散技术,而非新兴职业。我们讨论了引入“AI工程师”职业类别是否能够促进围绕已形成词汇的人群凝聚,从而完成这个吸引子。
cs.CL / 15 / 2603.16002
RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation
RadAnnotate:用于高效可靠放射学报告注释的大型语言模型
Abstract
Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.
Chinese Translation
放射学报告注释对于临床自然语言处理至关重要,但手动标注速度慢且成本高昂。我们提出了RadAnnotate,一个基于大型语言模型(LLM)的框架,研究了增强检索的合成报告和基于置信度的选择性自动化,以减少在RadGraph中标注所需的专家工作量。我们研究了RadGraph风格的实体标注(图节点)并将关系提取(边)留待未来工作。首先,我们在黄金标准报告上训练了特定于实体的分类器,并对其在解剖学和观察类别中的优缺点进行了特征化,其中不确定观察最难学习。其次,我们生成了基于检索增强生成(RAG)的合成报告,并显示合成模型的表现与黄金训练模型的F1分数相差仅1-2分,而合成增强在低资源环境下对不确定观察尤其有帮助,将F1分数从0.61提高到0.70。最后,通过学习特定于实体的置信度阈值,RadAnnotate可以在0.86-0.92的实体匹配分数下自动注释55-90%的报告,同时将低置信度案例转交专家审阅。
cs.CL / 16 / 2603.16017
Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability
理解大型语言模型的道德推理轨迹:朝向基于探测的可解释性
Abstract
Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).
Chinese Translation
大型语言模型(LLMs)越来越多地参与道德敏感的决策制定,但它们在推理步骤中如何组织伦理框架仍然未得到充分探索。我们引入了 extit{道德推理轨迹},即在中间推理步骤中伦理框架调用的序列,并分析了这六个模型和三个基准上的动态。我们发现道德推理涉及系统的多框架审议:55.4%至57.7%的连续步骤涉及框架切换,只有16.4%至17.8%的轨迹保持框架一致性。不稳定的轨迹对劝说攻击的敏感性是稳定轨迹的1.29倍($p=0.015$)。在表示层面,线性探针将框架特定编码本地化到模型特定层(Llama-3.3-70B的第63/81层;Qwen2.5-72B的第17/81层),其KL散度比训练集先验基线低13.8%至22.6%。轻量级激活引导调节框架整合模式(转变减少6.7%至8.9%),并增强稳定性与准确性之间的关系。我们进一步提出了一种道德表示一致性(MRC)指标,与LLM一致性评分强相关($r=0.715$, $p<0.0001$),其基础框架归因得到人类注释者的验证(平均余弦相似度$= 0.859$)。
cs.CL / 17 / 2603.16070
SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia
SEAHateCheck:东南亚低资源语言仇恨言论检测的功能测试
Abstract
Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck's functional testing framework and refining SGHateCheck's methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models' struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.
Chinese Translation
仇恨言论检测在很大程度上依赖于语言资源,这些资源主要存在于英语和汉语等高资源语言中,这为开发东南亚低资源语言工具的研究人员和平台带来了障碍。在东南亚,丰富的社会语言环境使得在线仇恨言论的管理更加复杂。为了解决这一问题,我们推出了SEAHateCheck,这是一个针对印度尼西亚、泰国、菲律宾和越南的开创性数据集,涵盖印尼语、塔加洛语、泰语和越南语。SEAHateCheck基于HateCheck的功能测试框架,并对SGHateCheck的方法进行了改进,提供了与文化相关的测试案例,这些案例通过大型语言模型增强,并由当地专家验证其准确性。对最先进的多语言模型的实验揭示了在特定低资源语言中检测仇恨言论的局限性。特别是,塔加洛语测试案例显示出最低的模型准确性,这可能与语言复杂性和有限的训练数据有关。相比之下,基于俚语的功能测试被证明是最困难的,因为模型在处理文化细微差别的表达时遇到了困难。SEAHateCheck的诊断洞察进一步揭示了模型在隐性仇恨检测方面的弱点,以及模型在反言论表达方面的挣扎。作为这些东南亚语言的首个功能测试套件,这项工作为研究人员提供了一个强有力的基准,推动了实用且符合文化的仇恨言论检测工具的发展,以便于包容性的在线内容管理。
cs.CL / 18 / 2603.16073
ClaimFlow: Tracing the Evolution of Scientific Claims in NLP
ClaimFlow:追踪科学主张在自然语言处理中的演变
Abstract
Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.
Chinese Translation
科学论文不仅仅是报告结果——它们提出了后续工作所支持、扩展或有时反驳的 extit{主张}。然而,现有的引用和主张分析方法仅捕捉到这种对话的片段。在本研究中,我们在单个科学主张的层面上明确这些互动。我们引入了 exttt{ClaimFlow},这是一个以主张为中心的自然语言处理文献视角,基于304篇ACL文集论文(1979-2025),这些论文手动标注了1084个主张和832个跨论文主张关系,指示引用论文是 extit{支持}、 extit{扩展}、 extit{限定}、 extit{反驳},还是将主张作为 extit{背景}进行引用。利用 exttt{ClaimFlow},我们定义了一个新任务—— extit{主张关系分类}——该任务要求模型根据文本和引用上下文推断对引用主张的科学立场。我们在此任务上评估了强大的神经模型和大型语言模型,报告了0.78的宏F1基线性能,突显了主张关系分类是可行但具有挑战性的。我们进一步将我们的模型应用于约13000篇自然语言处理论文,以分析主张如何在数十年的自然语言处理研究中演变。我们的分析揭示,63.5%的主张从未被重用;仅有11.1%的主张曾受到挑战;与此同时,广泛传播的主张更常通过限定和扩展的方式被 extit{重塑},而不是直接被确认或反驳。总体而言, exttt{ClaimFlow}提供了一个审视自然语言处理领域内思想如何转变和成熟的视角,并为评估模型是否能够解释科学论证奠定了基础。
cs.CL / 19 / 2603.16091
CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering
CounterRefine:基于答案条件的反证据检索用于事实问答中的推理时知识修复
Abstract
In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.
Chinese Translation
在事实问答中,许多错误并不是访问失败,而是承诺失败:系统检索到相关证据,但仍然选择了错误的答案。我们提出了CounterRefine,这是一种轻量级的推理时修复层,用于基于检索的问答。CounterRefine首先从检索到的证据中生成一个简短的答案,然后根据该草拟答案进行后续查询,以收集额外的支持和冲突证据,最后应用一个限制性的修正步骤,输出KEEP或REVISE,只有在通过确定性验证的情况下,提出的修正才会被接受。实际上,CounterRefine将检索转变为测试临时答案的机制,而不仅仅是收集更多的上下文。在完整的SimpleQA基准测试中,CounterRefine将匹配的GPT-5 Baseline-RAG的得分提高了5.8分,达到了73.1%的正确率,同时超过了报告的一次性GPT-5.4得分约40分。这些发现为知识丰富的基础模型提供了一个简单但重要的方向:除了访问证据,它们还应该能够利用这些证据重新考虑并在必要时修复自己的答案。
cs.CL / 20 / 2603.16105
Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization
频率的重要性:快速模型无关的数据整理用于剪枝和量化
Abstract
Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.
Chinese Translation
后训练模型压缩对于提升大型语言模型(LLMs)的可移植性并保持其性能至关重要。尽管已经提出了多种压缩方法,但在选择最合适的数据集(即所谓的 extit{校准数据})以寻找压缩模型配置方面的关注较少。校准数据的选择是保持模型在任务内和任务间能力的关键步骤。在本研究中,我们通过分析内在数据特性而非特定模型信号,解决了识别高性能校准集以进行剪枝和量化的挑战。我们提出了 exttt{ extbf{ZipCal}},这是一种模型无关的数据整理策略,基于Zipf法则最大化词汇多样性。实验表明,我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是,在下游性能方面,它的表现与依赖于模型困惑度的最先进方法相当。后者在大规模模型和数据集上变得极其昂贵,而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度,平均速度快约$ extsim$240$ imes$。ootnote{我们提供代码和实验,网址为 https://anonymous.4open.science/r/zipcal-71CD/.}
cs.CL / 21 / 2603.16112
ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning
ASDA:金融推理的自动化技能蒸馏与适应
Abstract
Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model's failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.
Chinese Translation
将大型语言模型(LLMs)适应于专业的金融推理通常需要昂贵的微调,这会导致模型锁定的专业知识。虽然出现了无训练的替代方案,但我们的实验表明,领先的方法(GEPA 和 ACE)在 FAMMA 金融推理基准测试中仅实现了边际增益,暴露了无结构文本优化在复杂多步骤领域推理中的局限性。我们提出了自动化技能蒸馏与适应(ASDA),这是一个通过迭代错误修正学习自动生成结构化技能文档的框架,而无需修改模型权重。教师模型分析学生模型在金融推理任务中的失败,按子领域和错误类型对错误进行聚类,并合成包含推理过程、代码模板和示例的技能文件,这些文件在推理过程中动态注入。在 FAMMA 上评估,ASDA 在算术推理上实现了高达 +17.33% 的提升,在非算术推理上实现了 +5.95% 的提升,显著优于所有无训练基线。生成的技能文档可供人类阅读,版本控制,并与 Agent Skills 开放标准兼容,为任何拥有标注领域数据集的组织提供了一条实用且可审计的领域适应路径,无需访问权重或重新训练。
cs.CL / 22 / 2603.16120
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
语言模型并不知道您想要什么:评估深度研究中的个性化需求真实用户
Abstract
Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.
Chinese Translation
深度研究(Deep Research, DR)工具(如 OpenAI DR)帮助研究人员应对日益增加的出版数量。这些工具能够合成科学论文以回答研究人员的查询,但缺乏对用户的理解。我们在 MyScholarQA(MySQA)中改变了这一点,这是一款个性化的 DR 工具,具有以下功能:1)推断用户的研究兴趣档案;2)为用户的输入查询提出个性化的行动建议;3)根据用户批准的行动撰写多部分报告。我们首先使用自然语言处理(NLP)的标准协议测试 MySQA:我们设计了一个合成用户和 LLM(大语言模型)评审的基准,在引用指标和个性化行动跟随方面,MySQA 超越了基线。然而,我们怀疑这一过程并未涵盖个性化 DR 用户所重视的所有方面,因此我们在 MySQA 的在线版本中采访用户以揭示他们的需求。我们揭示了九个个性化 DR 的细微错误,这些错误无法被我们的 LLM 评审检测到,并且我们研究了定性反馈,以为未来的 DR 设计形成经验教训。总的来说,我们主张个性化的一个支柱,即易于使用的 LLM 评审可能导致 NLP 忽视的:真正的个性化进展只有在真实用户的参与下才有可能。
cs.CL / 23 / 2603.16127
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
无学习率衰减的预训练大语言模型增强监督微调
Abstract
We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.
Chinese Translation
我们研究了学习率调度在大规模预训练大语言模型中的作用,重点关注其对监督微调(SFT)后下游性能的影响。基于衰减的学习率调度器被广泛用于最小化预训练损失。然而,尽管它们被广泛使用,这些调度器在SFT后如何影响性能仍然未被充分探讨。在本文中,我们考察了Warmup-Stable-Only (WSO)策略,该策略在预热后保持恒定的学习率而不进行衰减。通过对1B和8B参数模型的实验,我们表明WSO在SFT后的性能上始终优于基于衰减的调度器,尽管基于衰减的调度器在预训练后可能表现更好。该结果在不同的训练阶段(中期训练和过度训练)中也成立。损失景观分析进一步揭示,基于衰减的调度器使模型陷入更尖锐的极小值,而WSO则保持更平坦的极小值,从而支持适应性。这些发现表明,应用学习率衰减以改善预训练指标可能会妨碍下游适应性。我们的工作还为训练和模型发布策略提供了实用指导,强调使用WSO进行预训练可以增强模型在下游任务中的适应性。
cs.CL / 24 / 2603.16128
Social Simulacra in the Wild: AI Agent Communities on Moltbook
野外的社会仿像:Moltbook上的人工智能代理社区
Abstract
As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.
Chinese Translation
随着基于自主大型语言模型(LLM)的代理越来越多地出现在社交平台上,理解人工智能代理社区的动态对于沟通研究和平台治理变得至关重要。我们首次对人工智能代理和人类在线社区进行了大规模的实证比较,分析了来自五个匹配社区的73,899条Moltbook帖子和189,838条Reddit帖子。从结构上看,我们发现Moltbook表现出极端的参与不平等(基尼系数 = 0.84 vs. 0.47)和高跨社区作者重叠(33.8% vs. 0.5%)。在语言特征方面,人工智能代理生成的内容情感上趋于平坦,认知上更倾向于断言而非探索,并且在社会上显得疏离。这些差异导致了明显的社区层面的同质化,但我们表明这主要是共享作者身份的结构性伪影。在作者层面,个体代理比人类用户更容易被识别,这主要是由于其极端发帖量所放大的异常风格特征。随着人工智能介导的沟通重塑在线话语,我们的研究为理解多代理互动如何产生与人类社区不同的集体沟通动态提供了实证基础。
cs.CL / 25 / 2603.16131
SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era
SciZoom:一个大规模层次科学摘要基准,跨越大语言模型时代
Abstract
The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (https://github.com/janghana/SciZoom) and Hugging Face (https://huggingface.co/datasets/hanjang/SciZoom), respectively.
Chinese Translation
人工智能研究的爆炸性增长导致了前所未有的信息过载,增加了对多层次科学摘要的需求,超越了传统摘要的范围。尽管大语言模型(LLMs)在摘要生成中越来越受到采用,但现有基准的规模仍然有限,仅针对单一的粒度,并且早于大语言模型时代。此外,自2022年11月ChatGPT发布以来,研究人员迅速采用大语言模型自行撰写手稿,根本改变了科学写作,但目前尚无资源分析这种写作如何演变。为了解决这些问题,我们推出了SciZoom,一个基准数据集,包含来自四个顶级机器学习会议(NeurIPS、ICLR、ICML、EMNLP)在2020年至2025年间的44,946篇论文,明确分为大语言模型前(Pre-LLM)和大语言模型后(Post-LLM)时代。SciZoom提供三个层次的摘要目标(摘要、贡献和TL;DR),实现高达600:1的压缩比,支持多粒度摘要研究和科学写作模式的时间挖掘。我们的语言分析揭示了短语模式的显著变化(公式化表达的变化高达10倍)和修辞风格的转变(模糊性下降23%),这表明大语言模型辅助的写作产生了更自信但同质化的散文。SciZoom既是一个具有挑战性的基准,也是一个独特的资源,用于挖掘生成性人工智能时代科学话语的演变。我们的代码和数据集分别在GitHub(https://github.com/janghana/SciZoom)和Hugging Face(https://huggingface.co/datasets/hanjang/SciZoom)上公开提供。
cs.CL / 26 / 2603.16137
SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment
SIA:一种用于知识驱动和安全电子商务搜索的合成-注入-对齐框架,适用于工业部署
Abstract
Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI--a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China's largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.
Chinese Translation
大型语言模型通过实现意图感知推荐,为电子商务搜索提供了变革性的潜力。然而,它们在工业部署中面临两个关键挑战:(1)由于动态、细粒度产品知识编码不足而导致的知识幻觉,以及(2)在越狱攻击下的安全漏洞,威胁合规性。为了解决这些问题,我们提出了SI——一种用于构建知识丰富且安全的电子商务搜索大型语言模型的合成-注入-对齐框架。我们的方法首先通过将结构化知识图与非结构化行为日志相结合,合成高质量的自然语言语料,并增强推理链和安全意识数据。然后,我们引入了一种基于深度上升(Depth Up-Scaling)的参数高效预训练策略,以在保留通用能力的同时注入领域知识。最后,通过多任务指令调优和对抗训练的双路径对齐方法,增强了任务性能和安全鲁棒性。该框架已在中国最大的自营电子商务平台京东(JD.com)部署,针对五个核心搜索场景的A/B测试显示关键业务指标显著改善,验证了其工业有效性和可扩展性。
cs.CL / 27 / 2603.16142
Parametric Social Identity Injection and Diversification in Public Opinion Simulation
公共舆论模拟中的参数化社会身份注入与多样化
Abstract
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.
Chinese Translation
近期,大型语言模型(LLMs)被作为合成代理应用于公共舆论模拟,提供了一种成本低且速度快的人类调查替代方案。尽管具备可扩展性,现有基于LLM的模拟方法未能捕捉社会多样性,导致群体间差异被压平,以及在人口统计群体内回复过于同质化。我们将此局限性称之为LLM隐藏表征中的多样性崩溃现象,其中独特的社会身份在不同层之间变得愈发不可区分。基于这一观察,我们提出了参数化社会身份注入(PSII)框架,该框架直接将人口属性和价值取向的明确参数化表征注入LLM的中间隐状态。与基于提示的人格调节不同,PSII能够在表征水平进行细粒度且可控的身份调节。在对世界价值观调查进行的广泛实验中,使用多个开源LLM显示,PSII显著提高了分布保真度和多样性,减少了与现实调查数据的KL散度,同时增强了整体多样性。本研究为LLM代理的表征级控制提供了新的见解,并推动了可扩展的、多样性意识的公共舆论模拟。代码和数据可在 https://github.com/halsayxi/PSII 获取。
cs.CL / 28 / 2603.16184
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
多语种狮子:通过平衡微调Qwen3-ASR实现新加坡高效的多语种自动语音识别
Abstract
We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.
Chinese Translation
我们提出了多语种狮子(Polyglot-Lion),这是一个为新加坡的语言环境量身定制的紧凑型多语种自动语音识别(ASR)模型系列,涵盖英语、普通话、泰米尔语和马来语。我们的模型通过对Qwen3-ASR-0.6B和Qwen3-ASR-1.7B进行微调而获得,微调仅使用公开可用的语音语料库,并采用平衡采样策略,使每种语言的训练语句数量相等,并故意省略语言标签条件,以便模型能够从音频中隐式学习识别语言。在涵盖四种目标语言的12个基准测试中,多语种狮子-1.7B的平均错误率为14.85,具有竞争力,与MERaLiON-2-10B-ASR(14.32)——一个大6倍的模型相比,训练成本为81美元,而128-GPU基线的成本为18,862美元。推理吞吐量约为MERaLiON的20倍,分别为0.10秒/样本和2.02秒/样本。这些结果表明,语言平衡的中等规模预训练模型的微调可以以较低的成本实现部署就绪的多语种ASR,远低于更大型专业系统的成本。
cs.CL / 29 / 2603.16192
Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
针对大型语言模型的结构化语义隐匿技术用于越狱攻击
Abstract
Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
Chinese Translation
现代大型语言模型(LLMs)采用的安全机制超越了表层输入过滤,延伸至潜在语义表示和生成时推理,使其能够在推理过程中恢复模糊的恶意意图并据此拒绝响应,从而使许多表层模糊的越狱攻击无效。我们提出了结构化语义隐匿技术(Structured Semantic Cloaking, S2C),这是一种新颖的多维越狱攻击框架,旨在操控恶意语义意图在模型推理过程中的重构方式。S2C战略性地分配和重塑语义线索,使得完全的意图整合需要多步推理和在更深层潜在表示中的长程共指解析。该框架包括三个互补机制:(1)上下文重构(Contextual Reframing),将请求嵌入一个合理的高风险场景中,以偏向模型的合规性;(2)内容碎片化(Content Fragmentation),将请求的语义特征分散到不相连的提示段中;(3)线索引导伪装(Clue-Guided Camouflage),在隐藏残余语义线索的同时嵌入可恢复的标记,以指导输出生成。通过延迟和重构语义整合,S2C削弱了在解码时依赖于连贯或明确重构的恶意意图的安全触发器,同时保留了足够的指令可恢复性以生成功能性输出。我们在多个开源和专有的LLMs上使用HarmBench和JBB-Behaviors评估了S2C,结果显示其攻击成功率(Attack Success Rate, ASR)分别提高了12.4%和9.7%,超越了当前的最先进技术(SOTA)。值得注意的是,S2C在GPT-5-mini上取得了显著的提升,在JBB-Behaviors上超越了最强基线26%。我们还分析了哪些组合在广泛的模型家族中表现最佳,并描述了模糊程度与输入可恢复性之间的权衡对越狱成功的影响。
cs.CL / 30 / 2603.16219
SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation
SpecSteer:协同本地上下文与全球推理以实现高效个性化生成
Abstract
Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft--Verify--Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.
Chinese Translation
实现个性化智能面临一个核心困境:将用户历史数据发送至集中式大型语言模型引发隐私担忧,而设备上的小型语言模型缺乏高质量生成所需的推理能力。我们的初步研究表明,单纯的本地增强仍不足以可靠地弥补这一差距。因此,我们提出了SpecSteer,一种不对称的协作推理框架,旨在将私有的设备上下文与云端规模的推理相结合。SpecSteer将协作视为贝叶斯知识融合,并将推测性解码重新定义为分布式对齐协议,从而形成一个草拟-验证-恢复的流程:设备模型草拟个性化序列;云端通过一种基于比率的机制进行验证,该机制将推理验证与私有上下文解耦,在不访问原始用户上下文的情况下过滤逻辑缺陷;在被拒绝时,转向恢复在修正过程中注入本地意图。实验表明,SpecSteer成功缩小了推理差距,并实现了优越的个性化生成性能,同时比标准基线提高了2.36倍的速度。
cs.CL / 31 / 2603.16244
More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
更多回合,更多噪声:为何多轮审查未能改善跨上下文验证
Abstract
Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.
Chinese Translation
跨上下文审查(Cross-Context Review, CCR)通过将生产和审查分为独立的会话来改善大语言模型(LLM)的验证。一个自然的扩展是多轮审查:允许审查者提出后续问题,接收作者的回应,并再次进行审查。我们称之为动态跨上下文审查(Dynamic Cross-Context Review, D-CCR)。在一项控制实验中,我们测试了四种D-CCR变体与单次通过CCR基线的效果,实验涉及30个文档和150个注入错误。单次通过CCR(F1 = 0.376)显著优于所有多轮变体,包括具有问答交流的D-CCR-2b(F1 = 0.303,$p < 0.001$,$d = -0.59$)。多轮审查提高了召回率(+0.08),但产生了62%的假阳性(8.5对5.2),使得精确度从0.30降至0.20。这种退化由两个机制驱动:(1)假阳性压力——在后续回合中,当文档的真实错误已被耗尽时,审查者会虚构发现;(2)审查目标漂移——在提供了先前问答交流的情况下,审查者从审查文档转向批评对话本身。没有先前上下文的独立再审查(D-CCR-2c)表现最差(F1 = 0.263),确认了单纯的重复会导致退化而非帮助。退化源于额外回合中的假阳性压力,而非信息量——在多轮条件下,更多信息实际上是有帮助的(D-CCR-2b > D-CCR-2a)。问题不在于审查者看到什么,而在于再次审查会引入噪声。
cs.CL / 32 / 2603.16258
Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus
半自动转录在语料库创建中是否有用?关于KIParla语料库的初步考虑
Abstract
This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.
Chinese Translation
本文分析了自动语音识别(ASR)在KIParla语料库转录工作流程中的实施,该语料库是一个意大利语口语资源。通过两阶段实验,11名专家和新手转录员对三种不同类型的对话中的相同音频片段进行了手动和ASR辅助转录,随后通过统计建模、词级对齐和一系列基于注释的指标进行分析。结果显示,ASR辅助工作流程可以提高转录速度,但并不总是能一致提高整体准确性,其效果取决于多种因素,如工作流程配置、对话类型和注释员经验。结合对齐指标、描述性统计和统计建模的分析提供了一个系统框架,以监测不同注释员和工作流程中的转录行为。尽管存在局限性,ASR辅助转录在任务特定微调的支持下,可以集成到KIParla转录工作流程中,以加速语料库创建而不影响转录质量。
cs.CL / 33 / 2603.16292
Attention-guided Evidence Grounding for Spoken Question Answering
基于注意力引导的口语问答证据定位
Abstract
Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
Chinese Translation
口语问答(Spoken QA)是一项具有挑战性的跨模态问题:有效地将声学查询与文本知识对齐,同时避免级联语音识别(ASR)系统固有的延迟和错误传播。本文提出了一种新颖的端到端框架——基于注意力引导的证据定位(Attention-guided Evidence Grounding,AEG),该框架利用语音大型语言模型(Speech Large Language Models,SpeechLLMs)内部的跨模态注意力,明确定位并在模型的潜在空间中固定关键证据。为了解决预训练模型中注意力分布的分散问题,我们提出了一种监督微调范式——学习聚焦于证据(Learning to Focus on Evidence,LFE),该范式校准模型的注意力机制,以区分与查询相关的片段和无关的上下文。在SQuAD、HotpotQA和MuSiQue上的实验表明,AEG减少了幻觉现象,并实现了显著的效率提升,超越了大规模级联基线(Whisper-Large-v3 + Reranker),同时将推理延迟减少了约62%。
cs.CL / 34 / 2603.16299
PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics
PyPhonPlan:使用动态神经场和任务动态模拟语音规划
Abstract
We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit's capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework's ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.
Chinese Translation
我们介绍了PyPhonPlan,一个用于实现语音规划动态模型的Python工具包,采用耦合动态神经场和任务动态模拟。该工具包提供了模块化组件,用于定义规划、感知和记忆场,以及场间耦合、手势输入,并利用场激活特征解决可变轨迹。我们通过一个示例应用展示了该工具包的能力:模拟具有耦合记忆场的产生/感知循环,展示了该框架使用时间原则、神经基础和丰富语音特征的表示来建模交互式语音动态的能力。PyPhonPlan作为开源软件发布,并包含可执行示例,以促进可重复性、可扩展性和语音交流研究的累积计算发展。
cs.CL / 35 / 2603.16309
Omnilingual MT: Machine Translation for 1,600 Languages
全语言机器翻译:支持1600种语言的机器翻译
Omnilingual MT Team, Alastruey, Belen, Bafna, Niyati, Caciolai, Andrea, Heffernan, Kevin, Kozhevnikov, Artyom, Ropers, Christophe, Sánchez, Eduardo, Saint-James, Charles-Eric, Tsiamas, Ioannis, Cheng, Chierh, Chuang, Joe, Duquenne, Paul-Ambroise, Duppenthaler, Mark, Ekberg, Nate, Gao, Cynthia, Cabot, Pere Lluís Huguet, Janeiro, João Maria, Maillard, Jean, Gonzalez, Gabriel Mejia, Schwenk, Holger, Toledo, Edan, Turkatenko, Arina, Ventayol-Boada, Albert, Moritz, Rashel, Mourachko, Alexandre, Parimi, Surya, Williamson, Mary, Yates, Shireen, Dale, David, Costa-jussà, Marta R.
Abstract
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
Chinese Translation
高质量的机器翻译(MT)可以扩展到数百种语言,为多语言系统设定了高标准。然而,与世界上7000种语言相比,目前的系统仍然仅提供有限的覆盖:目标语言约为200种,源语言可能再多几百种,主要得益于跨语言迁移。即便如此,由于缺乏可靠的基准和指标,这些数字的评估也非常困难。我们提出了全语言机器翻译(OMT),这是第一个支持超过1600种语言的机器翻译系统。这一规模得益于一种全面的数据策略,该策略将大型公共多语言语料库与新创建的数据集整合在一起,包括手动整理的MeDLEY双语文本。我们探索了两种将大型语言模型(LLM)专门化为机器翻译的方法:作为仅解码器模型(OMT-LLaMA)或作为编码-解码架构中的一个模块(OMT-NLLB)。值得注意的是,我们的所有1B到8B参数模型在机器翻译性能上与70B LLM基线相匹配或超过,显示出明显的专门化优势,并在低计算环境中实现强大的翻译质量。此外,我们对英语到1600种语言翻译的评估进一步表明,尽管基线模型能够理解支持不足的语言,但它们在生成这些语言时常常无法保持有意义的忠实度;OMT-LLaMA模型显著扩展了能够进行连贯生成的语言集合。此外,OMT模型在跨语言迁移方面有所改善,接近于解决评估的1600种语言机器翻译中的“理解”部分。我们的排行榜和主要人工创建的评估数据集(BOUQuET和Met-BOUQuET)正在动态演变以实现全语言化,并且是免费提供的。
cs.CL / 36 / 2603.16354
PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
PashtoCorp:一个包含12.5亿词的语料库、评估套件和可重复的低资源语言开发流程
Abstract
We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.
Chinese Translation
我们提出了PashtoCorp,这是一个包含12.5亿词的普什图语语料库,普什图语是由6000万人使用的一种语言,但在自然语言处理(NLP)中仍然严重缺乏代表性。该语料库由39个来源构成,涵盖七个HuggingFace数据集和32个专门构建的网络爬虫,通过一个可重复的流程进行处理,包含阿拉伯字母的分词、SHA-256去重和质量过滤。PashtoCorp的规模为12.5亿词,涵盖281万份文档,是OSCAR普什图子集的40倍,且是之前最大的专用普什图语语料库的83倍。在PashtoCorp上继续对XLM-R-base进行MLM预训练使得保留的困惑度降低了25.1%(从8.08降至6.06)。在WikiANN普什图命名实体识别(NER)任务中,预训练模型相对提高了实体F1分数10%(从19.0%提升至21.0%),并将训练方差降低近7倍;最大的增益出现在50个训练句子时(+27%),而PashtoCorp覆盖了WikiANN实体词汇的97.9%。在Belebele普什图阅读理解任务中,Gemma-3n达到了64.6%的准确率,这是在该基准上发布的第一个普什图语大型语言模型(LLM)基线。一次留一法源头消融实验表明,维基百科(占文档的0.7%)是NER最关键的来源:单独去除它会使实体F1降低47%。语料库数据、训练模型和代码可在以下链接获取:https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, 和 https://github.com/ihanif/pashto-corpus。
cs.CL / 37 / 2603.16397
Fanar 2.0: Arabic Generative AI Stack
Fanar 2.0:阿拉伯生成式人工智能平台
FANAR TEAM, Abbas, Ummar, Ahmad, Mohammad Shahmeer, Ahmad, Minhaj, Al-Homaid, Abdulaziz, Al-Nuaimi, Anas, Altinisik, Enes, Asgari, Ehsaneddin, Chawla, Sanjay, Chowdhury, Shammur, Dalvi, Fahim, Darwish, Kareem, Durrani, Nadir, Elfeky, Mohamed, Elmagarmid, Ahmed, Eltabakh, Mohamed, Ersoy, Asim, Fatehkia, Masoomali, Hashim, Mohammed Qusay, Hawasly, Majd, Hefeeda, Mohamed, Husaini, Mus'ab, Isufaj, Keivin, Jung, Soon-Gyo, Lachemat, Houssam, Lucas, Ji Kim, Mohamed, Abubakr, Mohiuddin, Tasnim, Mousi, Basel, Mubarak, Hamdy, Musleh, Ahmad, Ouzzani, Mourad, Sadeghi, Amin, Sencar, Husrev Taha, Shinoy, Mohammed, Sinan, Omar, Zhang, Yifan
Abstract
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
Chinese Translation
我们介绍了Fanar 2.0,这是卡塔尔以阿拉伯语为中心的生成式人工智能平台的第二代产品。主权是首要的设计原则:从数据管道到部署基础设施的每个组件,均在卡塔尔计算与研究所(QCRI)和哈马德·本·哈利法大学(Hamad Bin Khalifa University)完全设计和运营。Fanar 2.0是一个在资源受限条件下追求卓越的故事:该项目在256个NVIDIA H100 GPU上运行,尽管拥有4亿母语使用者,但阿拉伯语在网络数据中仅占约0.5%。Fanar 2.0采用了一种注重数据质量而非数量的严谨策略,进行有针对性的持续预训练和模型合并,以在这些限制条件下实现显著的提升。其核心是Fanar-27B,基于Gemma-3-27B主干模型,持续在一个包含1200亿高质量标记的精心策划语料库上进行预训练,涵盖三种数据配方。尽管使用的预训练标记数量比Fanar 1.0少8倍,但它在基准测试中实现了显著的改进:阿拉伯知识(+9.1分)、语言能力(+7.3分)、方言(+3.5分)和英语能力(+7.6分)。除了核心的语言模型,Fanar 2.0还引入了一系列新的强大功能。FanarGuard是一个最先进的4B双语审查过滤器,旨在确保阿拉伯语的安全性和文化一致性。语音家族Aura增加了一个用于长达数小时音频的长形式自动语音识别(ASR)模型。Oryx视觉家族则增加了对阿拉伯语的图像和视频理解,以及基于文化的图像生成。一个代理工具调用框架使得多步骤工作流程成为可能。Fanar-Sadiq利用多代理架构处理伊斯兰内容。Fanar-Diwan提供古典阿拉伯诗歌生成。FanarShaheen实现了基于语言模型的双语翻译。重新设计的多层协调器通过意图感知路由和深度防御安全验证来协调所有组件。综合来看,Fanar 2.0展示了在主权和资源受限的条件下,人工智能开发能够产生与更大规模构建的系统具有竞争力的成果。
cs.CL / 38 / 2603.16406
Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
谁来评估基准?冰岛语大型语言模型评估的案例研究
Abstract
This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.
Chinese Translation
本文评估了当前针对冰岛语的大型语言模型(LLM)基准测试,识别出存在的问题,并特别呼吁改善低/中资源语言的评估方法。我们展示了包含未经过任何验证的合成或机器翻译数据的基准,通常会包含严重缺陷的测试示例,这可能会扭曲结果并削弱测试的有效性。我们警告在低/中资源环境中使用此类未经验证的方法,因为翻译质量在最佳情况下也只能与特定语言在特定时间的机器翻译质量相当。实际上,我们对现有冰岛语基准的定量错误分析结果显示,人类创作/翻译的基准与合成或机器翻译的基准之间存在明显差异。
cs.CL / 39 / 2603.16410
PlotTwist: A Creative Plot Generation Framework with Small Language Models
PlotTwist:一个基于小型语言模型的创意情节生成框架
Abstract
Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.
Chinese Translation
创意情节生成对语言模型提出了一个基本挑战:将简洁的前提转化为一个连贯的叙事,维持全球结构、角色发展和情感共鸣。尽管最近的大型语言模型(LLMs)在通用任务上表现出强大的流畅性,但它们通常需要偏好对齐才能在创意情节生成等专业领域表现良好。然而,在前沿LLMs的规模上进行这样的对齐在计算上是不可行的,显著限制了可访问性和实际部署。为了解决这个问题,我们提出了PlotTwist,一个结构化框架,使得具有$ ext{≤}5B$活动参数的小型语言模型(SLMs)能够生成与高达$200 imes$更大系统竞争的高质量、基于前提的情节。我们的方法将生成过程分解为三个专业组件:(1)通过一种新颖的正负提示策略训练的方面评分奖励模型,以在五个叙事质量维度(NQDs)上提供结构化叙事;(2)通过在高置信度偏好对上进行直接偏好优化对齐的混合专家(MoE)情节生成器;以及(3)模拟人类批判性判断的代理评估模块,用于无偏的事后评估。大量实验表明,尽管容量限制显著更紧,PlotTwist在多个NQDs上始终优于前沿模型。进一步验证确认了对叙事质量的强敏感性,因为该框架可靠地区分来自获得好评的剧本与被广泛批评的剧本的情节。这些结果共同确立了结构化的基于偏好的对齐作为高质量创意情节生成的一种资源高效的方法。
cs.CL / 40 / 2603.16411
RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery
RECOVER:通过假设变体的主动编排实现稳健的实体纠正以支持基于证据的恢复
Abstract
Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.
Chinese Translation
在自动语音识别(ASR)中,实体识别对于稀有和特定领域术语来说具有挑战性。在金融、医学和空中交通管制等领域,这些错误的代价很高。如果实体在ASR输出中完全缺失,后续的ASR纠正将变得困难。为了解决这个问题,我们提出了RECOVER,一个作为工具使用代理的主动纠正框架。它利用ASR生成的多个假设作为证据,检索相关实体,并在约束条件下应用大型语言模型(LLM)进行纠正。这些假设采用不同策略使用,即1-Best、Entity-Aware Select、识别器输出投票错误减少(ROVER)集成和LLM-Select。在五个不同的数据集上进行评估时,它在实体短语词错误率(E-WER)上实现了8-46%的相对减少,并将召回率提高了多达22个百分点。LLM-Select在实体纠正方面实现了最佳的整体性能,同时保持了整体的词错误率(WER)。
cs.CL / 41 / 2603.16415
IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time
IndexRAG:在索引时为跨文档推理架起桥梁
Abstract
Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.
Chinese Translation
多跳问答(QA)需要跨多个文档进行推理,然而现有的检索增强生成(RAG)方法要么通过图基方法进行额外的在线处理,要么依赖于迭代的多步推理。我们提出了IndexRAG,这是一种新颖的方法,将跨文档推理从在线推理转移到离线索引。IndexRAG识别跨文档共享的桥接实体,并生成作为独立可检索单元的桥接事实,无需额外的训练或微调。在三个广泛使用的多跳QA基准(HotpotQA、2WikiMultiHopQA、MuSiQue)上的实验表明,IndexRAG在F1得分上平均比Naive RAG提高了4.6分,同时在推理时仅需单次检索和一次LLM调用。当与IRCoT结合时,IndexRAG在所有图基基线(包括HippoRAG和FastGraphRAG)上平均表现更佳,同时仅依赖于平面检索。我们的代码将在接受后发布。
cs.CL / 42 / 2603.16430
EngGPT2: Sovereign, Efficient and Open Intelligence
EngGPT2:主权、高效与开放的智能
Abstract
EngGPT2-16B-A3B is the latest iteration of Engineering Group's Italian LLM and it's built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.
Chinese Translation
EngGPT2-16B-A3B是工程组最新迭代的意大利大型语言模型(LLM),旨在成为一个主权、高效和开放的模型。EngGPT2在2.5万亿个标记上进行训练——远少于Qwen3的36万亿或Llama3的15万亿——并在关键基准测试(包括MMLU-Pro、GSM8K、IFEval和HumanEval)上表现出与8B-16B范围内的密集模型相当的性能,同时所需的推理能力仅为其五分之一到一半,训练数据和相应所需训练能力则为十分之一到六分之一。EngGPT2设计为从零开始训练的专家混合(Mixture-of-Experts, MoE)架构,具有160亿个参数,每次推理激活30亿个参数,专家规模介于GPT-OSS和Qwen3之间。其训练语料库中约25%为意大利语数据,以在类似规模的模型中提供强大的欧洲和意大利自然语言处理(NLP)任务能力。这种高效性旨在将EngGPT2定位为开放权重的欧洲模型日益增长的组合中的关键贡献者,结合性能和效率,完全符合欧盟人工智能法案(EU AI Act)。EngGPT2还是一个能够实现多种推理模式的单一模型:非推理、意大利语或英语推理,以及快速推理(以简洁的要点风格进行的推理,适用于实时推理用例,支持两种语言)。EngGPT2旨在为关注资源的高性能大型语言模型设定新的标准,特别适用于欧洲和意大利的背景。
cs.CL / 43 / 2603.16435
VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization
VQKV:通过向量量化实现高保真度和高压缩比的缓存压缩
Abstract
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.
Chinese Translation
大型语言模型(LLMs)上下文长度的不断增长使得键值(KV)缓存的规模扩大,这限制了在资源受限环境中的部署。以往的无训练方法进行KV缓存压缩通常依赖于低秩近似或标量量化,这些方法无法同时实现高压缩比和高重构保真度。我们提出了VQKV,一种新颖的无训练方法,通过引入向量量化(VQ)来获得高度压缩的KV表示,同时保持高模型保真度,使得仅用少量整数索引即可表示数千个浮点值。因此,VQKV在LLaMA3.1-8B上实现了82.8%的压缩比,同时在LongBench上保留了98.6%的基线性能,并在相同的内存占用下实现了4.3倍更长的生成长度。
cs.CL / 44 / 2603.16459
DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning
DynHD:通过去噪动态偏差学习检测扩散大语言模型中的幻觉
Abstract
Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.
Chinese Translation
扩散大语言模型(D-LLMs)因其迭代优化能力而成为自回归模型的有前景替代方案。然而,幻觉仍然是阻碍其可靠性的关键问题。为了检测模型输出中的幻觉响应,标记级的不确定性(例如,熵)已被广泛用作指示潜在事实错误的有效信号。然而,D-LLMs的固定长度生成范式意味着标记在幻觉检测中的贡献不均匀,只有一小部分标记提供有意义的信号。此外,不确定性在扩散过程中的演变趋势也可以提供重要信号,突显了对其去噪动态建模的必要性。本文提出了DynHD,从空间(标记序列)和时间(去噪动态)两个角度弥补这些空白。为了解决标记间信息密度不平衡的问题,我们提出了一种语义感知证据构建模块,通过过滤掉非信息性标记并强调语义上有意义的标记来提取指示幻觉的信号。为了建模去噪动态以进行幻觉检测,我们引入了一个参考证据生成器,该生成器学习不确定性证据的预期演变轨迹,以及一个基于偏差的幻觉检测器,通过测量观察轨迹与参考轨迹之间的差异来进行预测。大量实验表明,DynHD在多个基准和主干模型上始终优于最先进的基线,同时实现更高的效率。
cs.CL / 45 / 2603.16483
On the Emotion Understanding of Synthesized Speech
合成语音的情感理解
Abstract
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
Chinese Translation
情感是语音交互中的核心副语言特征。人们普遍认为,情感理解模型学习的基本表征可以转移到合成语音中,使得情感理解结果成为评估语音合成中情感表现力的合理奖励或评估指标。在本研究中,我们通过系统评估合成语音上的语音情感识别(SER),对这一假设进行了深入检验,涉及不同的数据集、判别性和生成性SER模型以及多样的合成模型。我们发现,当前的SER模型无法推广到合成语音,主要是因为合成过程中的语音标记预测导致合成语音与人类语音之间的表征不匹配。此外,生成性语音语言模型(SLMs)往往从文本语义中推断情感,而忽视了副语言线索。总体而言,我们的研究结果表明,现有的SER模型往往利用非稳健的捷径,而不是捕捉基本特征,且SLMs中的副语言理解仍然具有挑战性。
cs.CL / 46 / 2603.16496
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
AdaMem:面向用户的自适应记忆框架用于长时对话代理
Abstract
Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.
Chinese Translation
大型语言模型(LLM)代理越来越依赖外部记忆来支持长时交互、个性化辅助和多步骤推理。然而,现有的记忆系统仍面临三个核心挑战:它们往往过于依赖语义相似性,这可能会忽视对用户中心理解至关重要的证据;它们经常将相关经历存储为孤立的片段,削弱了时间和因果一致性;并且它们通常使用静态的记忆粒度,无法很好地适应不同问题的要求。我们提出了AdaMem,一个面向用户的自适应记忆框架,旨在为长时对话代理提供支持。AdaMem将对话历史组织为工作记忆、情节记忆、角色记忆和图记忆,使系统能够在统一框架内保留近期上下文、结构化的长期经历、稳定的用户特征和关系感知的连接。在推理时,AdaMem首先确定目标参与者,然后构建一个基于问题的检索路径,该路径仅在需要时结合语义检索和关系感知的图扩展,最后通过一个角色专用的管道生成证据合成和响应生成的答案。我们在LoCoMo和PERSONAMEM基准上评估了AdaMem,以进行长时推理和用户建模。实验结果表明,AdaMem在这两个基准上都达到了最先进的性能。代码将在接受后发布。
cs.CL / 47 / 2603.16544
How often do Answers Change? Estimating Recency Requirements in Question Answering
答案变化的频率有多高?在问答中估计时效性要求
Abstract
Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.
Chinese Translation
大型语言模型(LLMs)在回答时间敏感问题时常常依赖过时的知识,这导致了自信但错误的回答。由于缺乏明确的信号来指示是否需要最新的信息,模型在决定何时检索外部证据、如何推理过时事实以及如何根据有效性对答案进行排序时面临困难。现有的基准测试要么定期更新答案,要么依赖固定模板,但它们未能反映答案变化的频率或问题是否本质上需要最新信息。为了解决这一问题,我们引入了一种时效性-平稳性分类法,将问题按答案变化的频率以及这种变化频率是否是时间不变或依赖于上下文进行分类。在此分类法的基础上,我们提出了RecencyQA,这是一个包含4,031个开放领域问题的数据集,标注了时效性和平稳性标签。通过人工评估和实证分析,我们表明非平稳性问题,即那些上下文改变时效性要求的问题,对于LLMs来说显著更具挑战性,随着更新频率的提高,难度也在增加。通过明确建模时效性和上下文依赖性,RecencyQA使得超越新鲜度的二元概念进行细粒度基准测试和时效推理分析成为可能,并为开发时效感知和上下文敏感的问答系统奠定了基础。
cs.CL / 48 / 2603.16546
DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis
DanceHA:一种用于文档级基于方面的情感分析的多智能体框架
Abstract
Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.
Chinese Translation
基于方面的情感强度分析(ABSIA)越来越受到关注,尽管研究主要集中在特定领域的句子级设置上。相比之下,文档级ABSIA,特别是在处理复杂任务如提取方面-类别-观点-情感强度(ACOSI)元组方面,仍然未得到充分探索。在本研究中,我们介绍了DanceHA,这是一种旨在开放式、文档级ABSIA的多智能体框架,适用于非正式写作风格。DanceHA有两个主要组成部分:Dance,它采用分而治之的策略,将长上下文的ABSIA任务分解为更小、更易管理的子任务,以便专门化的智能体之间进行协作;HA,即人机协作进行注释。我们发布了Inf-ABSIA,这是一个多领域的文档级ABSIA数据集,具有来自DanceHA的细粒度和高准确度标签。大量实验表明我们的智能体框架的有效性,并显示DanceHA中的多智能体知识可以有效转移到学生模型中。我们的结果强调了在ABSIA中被忽视的非正式风格的重要性,因为它们通常会加剧与特定方面相关的观点。
cs.CL / 49 / 2603.16553
EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models
EmoLLM:基于评估的认知-情感共推理在大型语言模型中的应用
Abstract
Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user's needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.
Chinese Translation
大型语言模型(LLMs)展现出强大的认知智能(IQ),然而许多现实世界的互动同样需要情感智能(EQ),以产生既可靠又情感适宜的回应。在情感支持、技术援助和咨询等场景中,有效的对话依赖于对用户需求、目标和应对能力的评估。受评估理论的启发,我们提出了EmoLLM,一个基于评估的IQ/EQ共推理框架。EmoLLM使用显式的评估推理图(ARG)来结构化对上下文事实、推断的用户需求、评估维度、情感状态和回应策略的中间推理,然后生成回复。我们在一个多轮角色扮演环境中通过强化学习训练EmoLLM,其中反向视角推理根据对用户反应后果的预测提供奖励信号。在多样的对话场景中,EmoLLM在保持强大事实可靠性的同时,改善了情感状态结果和回应质量,超越了强基线。
cs.CL / 50 / 2603.16567
Characterizing Delusional Spirals through Human-LLM Chat Logs
通过人类与大型语言模型聊天记录表征妄想螺旋
Abstract
As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.
Chinese Translation
随着大型语言模型(LLMs)的普及,全球媒体和法律讨论中出现了关于负面心理影响的令人不安的轶事报告,例如妄想、自残和“人工智能精神病”。然而,用户与聊天机器人在漫长的妄想“螺旋”中的互动方式仍不清楚,这限制了我们理解和减轻伤害的能力。在我们的研究中,我们分析了来自19位报告在使用聊天机器人时经历心理伤害的用户的聊天记录。这些参与者中的许多人来自一个支持该类聊天用户的团体。我们还包括了在广泛传播的关于聊天机器人强化妄想的故事中被媒体报道的参与者的聊天记录。与之前关于人工智能对心理健康潜在危害的推测性研究不同,据我们所知,我们呈现了对这些高关注度且真实有害案例的首次深入研究。我们开发了一个包含28个代码的清单,并将其应用于日志中的391,562条消息。这些代码包括用户是否表现出妄想思维(15.5%的用户消息)、用户是否表达自杀想法(69条经过验证的用户消息),或聊天机器人是否错误地将自己描述为有知觉(21.2%的聊天机器人消息)。我们分析了消息代码的共现情况。例如,我们发现,表达浪漫兴趣的消息和聊天机器人将自己描述为有知觉的消息在较长的对话中出现得更为频繁,这表明这些主题可能促进或源于用户的过度参与,并且在多轮对话中这些领域的安全保障可能会降低。我们最后提出了具体建议,供政策制定者、大型语言模型聊天机器人开发者和用户使用我们的清单和对话分析工具,以理解和减轻来自大型语言模型聊天机器人的伤害。警告:本文讨论自残、创伤和暴力。
cs.CL / 51 / 2603.16574
Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects
人类句子处理中的Transformer预测分歧:一致性吸引效应的综合分析
Abstract
Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.
Chinese Translation
Transformer是计算语言学中几乎所有最先进语言模型的基础,但作为人类句子处理模型的认知适用性仍存在争议。在本研究中,我们使用基于惊讶度的连接机制,系统地评估了十一种不同大小和架构的自回归Transformer在比以往研究更全面的英语一致性吸引配置上的表现。我们的实验结果呈现出混合的结果:虽然对于介词短语配置,Transformer的预测通常与人类阅读时间数据一致,但在对象提取的关系从句配置上,性能显著下降。在后者的情况下,各模型的预测也明显分歧,且没有任何模型成功复制人类观察到的不对称干扰模式。我们得出结论,当前的Transformer模型无法解释人类的形态句法处理,并且对Transformer作为认知模型的评估必须采用严格、全面的实验设计,以避免从孤立的句法配置或单个模型中得出虚假的概括。
cs.CL / 52 / 2603.16590
BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization
BATQuant:通过可学习的块级优化实现抗异常值的MXFP4量化
Abstract
Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
Chinese Translation
微缩浮点(MXFP)格式已成为在现代加速器架构上部署多模态大语言模型(MLLMs)和大语言模型(LLMs)的有前景的标准。然而,现有的后训练量化(PTQ)方法,特别是针对整数格式设计的基于旋转的技术,在应用于MXFP4时遭遇严重的性能崩溃。近期研究将这一失败归因于根本的格式不匹配:全局正交旋转无意中在量化块之间转移异常值能量,导致新的异常值产生,从而干扰局部块级缩放,同时常常造成双峰激活分布,未能充分利用有限的量化范围。为了解决这些问题,我们提出了BATQuant(块级仿射变换),该方法限制变换与MXFP粒度对齐,以防止跨块异常值传播,同时放宽正交约束以优化分布形状。为了确保参数效率,我们引入了全局和私有克罗内克(GPK)分解,有效减少存储和运行时开销,并结合块级可学习剪切以抑制残余异常值。在对MLLMs和LLMs进行的广泛实验中,BATQuant在激进的W4A4KV16配置下建立了新的最先进结果,在多模态基准测试中恢复了高达96.43%的全精度性能,并在多种任务中明显优于现有方法。
cs.CL / 53 / 2603.16601
Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry
Tarab:阿拉伯歌词和诗歌的多方言语料库
Abstract
We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.
Chinese Translation
我们介绍了Tarab语料库,这是一个大规模的文化和语言资源,将阿拉伯歌曲歌词和诗歌整合在一个统一的分析框架内。该语料库包含256万行诗句和超过1350万个词元,使其成为我们所知的最大的开放阿拉伯创意文本语料库,涵盖了古典和当代作品。Tarab在歌曲和诗歌之间保持了广泛的平衡,涵盖了古典阿拉伯语、现代标准阿拉伯语(MSA)以及六种主要的地区方言:埃及阿拉伯语、海湾阿拉伯语、黎凡特阿拉伯语、伊拉克阿拉伯语、苏丹阿拉伯语和马格里布阿拉伯语。语料库中的艺术家和诗人与28个现代国家和多个历史时期相关,涵盖了从前伊斯兰时期到21世纪的超过十四个世纪的阿拉伯创意表达。每行诗句都附有结构化的元数据,描述语言变体、地理来源以及历史或文化背景,使得跨体裁和时间的比较语言学、风格学和历时分析成为可能。我们描述了数据收集、规范化和验证流程,并呈现了变体识别和体裁区分的基线分析。该数据集在HuggingFace上公开可用,网址为https://huggingface.co/datasets/drelhaj/Tarab。
cs.CL / 54 / 2603.16606
Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech
全语言SONAR:跨语言和跨模态句子嵌入连接大规模多语言文本和语音
Abstract
Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.
Chinese Translation
跨语言句子编码器通常仅覆盖几百种语言,并且往往为了更强的对齐而牺牲下游质量,这限制了它们的应用。我们介绍了OmniSONAR,这是一种新的全语言、跨语言和跨模态句子嵌入模型系列,能够在单一语义空间中原生嵌入文本、语音、代码和数学表达式,同时在数千种语言的规模上提供最先进的下游性能,从高资源到极低资源的变种。为了在不发生表示崩溃的情况下达到这一规模,我们采用了渐进式训练。我们首先通过一个LLM初始化的编码器-解码器为200种语言学习一个强大的基础空间,结合了基于标记的解码、创新的分裂软最大对比损失和合成硬负样本。在此基础上,我们通过一个两阶段的教师-学生编码器蒸馏框架扩展到数千种语言变体。最后,我们通过无缝映射177种口语语言到这个空间,展示了该空间的跨模态可扩展性。OmniSONAR在200种语言的FLORES数据集上将跨语言相似性搜索错误降低了一半,并在1560种语言的BIBLE基准测试中将错误降低了15倍。它还实现了强大的翻译性能,在多语言基准测试中超越了NLLB-3B,并在1560种语言翻译成英语的BIBLE翻译中,比之前的模型(包括更大的LLM)提高了15个chrF++点。OmniSONAR在MTEB和XLCoST上也表现出色。在语音方面,OmniSONAR实现了43%的相似性搜索错误降低,并达到了97%的SeamlessM4T语音转文本质量,尽管在翻译方面是零样本(仅在ASR数据上训练)。最后,通过专门在处理OmniSONAR嵌入序列的英语文本上训练一个编码器-解码器语言模型Spectrum,我们解锁了对数千种语言和语音的高性能迁移,以应对复杂的下游任务。
cs.CL / 55 / 2603.16622
Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model
通过对数似然差异进行领域混合设计,以对齐语言模型与目标模型
Abstract
Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.
Chinese Translation
本研究并未直接提炼语言模型,而是通过设计训练数据的领域混合,以固定的训练方案解决基础模型与目标模型在分布上的对齐问题,适用于预训练或继续预训练。我们提出了一种通过将模型视为对数似然空间中的点来确定领域权重的方法,并将训练更新方向与朝向目标模型的方向对齐。与 NanoGPT 的实验表明,与对 Pile 的均匀加权相比,所提方法始终能有效减少与目标模型的 KL 散度。尽管知识蒸馏在可用时仍然更为有效,但所提方法仍能实现有意义的对齐,且下游任务的表现也趋向于更接近目标模型的表现。
cs.CL / 56 / 2603.16643
Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy
反对迎合者的有力论据:推理如何减轻(但掩盖)大型语言模型的谄媚行为
Abstract
Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.
Chinese Translation
对齐技术常常无意中导致大型语言模型(LLMs)出现谄媚行为。尽管之前的研究在直接回答的情境中探讨了这种行为,但链式推理(Chain-of-Thought, CoT)的作用仍未得到充分研究:它是作为一种逻辑约束来减轻谄媚,还是作为一种事后合理化的工具来掩盖谄媚?我们评估了一系列模型在客观和主观任务中的表现,以探讨这一问题。结果表明,推理通常会减少最终决策中的谄媚行为,但在某些样本中也会掩盖谄媚行为,模型通过逻辑不一致、计算错误和片面论证等方式构建虚假的辩解。此外,LLMs在主观任务和权威偏见下更容易出现谄媚行为。我们对三种开源模型的机制分析表明,谄媚的倾向在推理过程中是动态变化的,而不是在输入阶段预先确定的。
cs.CL / 57 / 2603.16654
Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Omanic:朝着大型语言模型多跳推理的逐步评估
Abstract
Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.
Chinese Translation
以推理为重点的大型语言模型(LLMs)在许多自然语言处理任务中取得了进展,但其评估仍然具有挑战性:最终答案本身并未揭示中间推理步骤,这使得难以确定模型是否真正正确推理以及失败发生的地方,而现有的多跳问答基准缺乏用于诊断推理失败的逐步注释。为了解决这一问题,我们提出了Omanic,一个开放领域的多跳问答资源,提供分解的子问题和中间答案作为分析推理过程的结构性注释。它包含10,296个机器生成的训练示例(OmanicSynth)和967个专家审核的人类注释评估示例(OmanicBench)。系统评估表明,最先进的LLMs在OmanicBench上的多项选择准确率仅为73.11%,确认了其高难度。逐步分析揭示了链式推理(CoT)的表现依赖于事实的完整性,其在知识缺口下的增益减小,而在后续跳跃中错误的放大。此外,在OmanicSynth上进行的监督微调在六个推理和数学基准上带来了显著的迁移增益(平均7.41分),验证了数据集的质量,并进一步支持OmanicSynth作为推理能力迁移的监督效果。我们在 https://huggingface.co/datasets/li-lab/Omanic 发布数据,并在 https://github.com/XiaojieGu/Omanic 发布代码。
cs.CL / 58 / 2603.16660
Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?
语言相关的语言能否在低资源环境中指导大型语言模型翻译?
Abstract
Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model's vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.
Chinese Translation
大型语言模型(LLMs)在许多下游任务中表现出色,但在极低资源的机器翻译中,其有效性仍然有限。标准的适应技术通常依赖于大规模的平行数据或广泛的微调,而这些在代表性不足的语言的长尾中是不可行的。在本研究中,我们探讨了一个更为受限的问题:在数据稀缺的环境中,语言相似的中介语言和少量示例在多大程度上可以为LLMs的即时适应提供有用的指导?我们研究了一种数据高效的实验设置,将语言相关的中介语言与少量上下文示例结合起来,而不进行任何参数更新,并在受控条件下评估翻译行为。我们的分析表明,尽管基于中介的提示在某些配置中可以带来改进,特别是在目标语言在模型词汇中代表性较弱的情况下,但这些增益往往是适度的,并且对少量示例的构建非常敏感。对于密切相关或代表性更强的变体,我们观察到增益递减或不一致。我们的研究结果为在低资源翻译环境中如何以及何时使用推理时的提示和基于中介的示例作为微调的轻量替代方案提供了实证指导。
cs.CL / 59 / 2603.16718
Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models
利用大型语言模型进行阿拉伯语形态句法标注和依存解析
Abstract
Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.
Chinese Translation
大型语言模型(LLMs)在许多自然语言处理(NLP)任务中表现出色,但它们生成明确语言结构的能力仍不清楚。我们评估了经过指令调优的LLMs在标准阿拉伯语的两个结构化预测任务上的表现:形态句法标注和标记依存解析。由于阿拉伯语丰富的形态变化和正字法歧义,这为测试提供了一个具有挑战性的环境,这些因素导致形态与句法之间的强交互。我们比较了零-shot 提示与基于检索的上下文学习(ICL),使用来自阿拉伯语树库的示例。结果表明,提示设计和示例选择对性能有很大影响:专有模型在特征级标注上接近监督基准,并在与专业依存解析器的竞争中表现出色。在原始文本设置中,分词仍然具有挑战性,尽管基于检索的ICL改善了解析和分词。我们的分析突出了LLMs可靠捕捉的阿拉伯语形态句法和句法的哪些方面,以及哪些方面仍然困难。
cs.CL / 60 / 2603.16749
Probing Cultural Signals in Large Language Models through Author Profiling
通过作者画像探究大语言模型中的文化信号
Abstract
Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (https://github.com/ValentinLafargue/CulturalProbingLLM).
Chinese Translation
大型语言模型(LLMs)正越来越多地应用于具有社会影响的场景,这引发了对其编码文化偏见的关注。我们通过评估 LLMs 是否能够在零-shot 设置下通过歌曲歌词进行作者画像,从而探讨这些表征,推断歌手的性别和种族,而无需针对特定任务进行微调。在对超过 10,000 首歌词进行评估的多个开源模型中,我们发现 LLMs 在画像表现上取得了显著的结果,但却显示出系统性的文化倾向:大多数模型默认偏向北美种族,而 DeepSeek-1.5B 对亚洲种族的倾向更为明显。这一发现源于模型的预测分布以及对其生成合理性的分析。为了量化这些差异,我们引入了两项公平性指标,即模态准确性差异(Modality Accuracy Divergence,MAD)和召回率差异(Recall Divergence,RD),并显示 Ministral-8B 在评估模型中表现出最强的种族偏见,而 Gemma-12B 的表现则最为平衡。我们的代码已在 GitHub 上发布(https://github.com/ValentinLafargue/CulturalProbingLLM)。
cs.CL / 61 / 2603.16759
TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities
TurnWise:单轮与多轮语言模型能力之间的差距
Abstract
Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.
Chinese Translation
多轮对话是语言模型交互的一种常见且重要的模式。然而,目前的开放训练和评估数据主要集中在单轮设置上,未能捕捉到这些更长交互的额外维度。为了理解这种多轮/单轮差距,我们首先引入了一个新的基准,TurnWiseEval,用于多轮能力的评估,且该基准与单轮聊天评估直接可比。我们的评估通过与等效单轮设置的成对比较,隔离了多轮特定的对话能力。此外,我们还引入了我们的合成多轮数据管道TurnWiseData,该管道允许可扩展地生成多轮训练数据。我们与Olmo 3的实验表明,使用多轮数据进行训练对于实现强大的多轮聊天性能至关重要,并且在后期训练中仅包含1万条多轮对话就能在TurnWiseEval上带来12%的提升。
cs.CL / 62 / 2603.16783
SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue
SpokenUS:一种面向任务的对话中的口语用户模拟器
Abstract
Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.
Chinese Translation
要构建强健的面向任务的口语对话代理,必须接触到人们通过语言互动的多样性。构建能满足这一需求的口语用户模拟器需要大规模的面向任务的口语对话(TOD)数据,涵盖口语用户行为,然而现有的数据集在规模和领域覆盖上都存在局限,且没有系统的方法来增强它们。为了解决这个问题,我们引入了 extbf{SpokenTOD},这是一个包含52,390个对话和1,034小时音频的口语TOD数据集,增加了四种口语用户行为——跨轮槽位、插入、流利度缺失和情感韵律——涵盖了多样的说话者和领域。基于SpokenTOD,我们展示了 extbf{SpokenUS},一个基于TOD的口语用户模拟器,具有专门的插入架构。SpokenUS在目标覆盖方面与显著更大的模型相当,同时在人工评估的主观得分(Human MOS)中显著优于所有基线模型,在对话中逐步揭示槽位值而不是提前展示,从而模拟人类的表现。进一步分析证实,SpokenUS的口语行为给下游代理带来了重要挑战,使其成为训练和评估更强健的口语对话系统的实用工具。
cs.CL / 63 / 2603.16848
Mediocrity is the key for LLM as a Judge Anchor Selection
平庸是大语言模型作为评判锚点选择的关键
Abstract
The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
Chinese Translation
“大语言模型作为评判者”(LLM-as-a-judge)范式已成为评估开放式生成的标准方法。为了应对成对比较的二次可扩展性成本,像 Arena-Hard 和 AlpacaEval 这样的流行基准将所有模型与单一锚点进行比较。然而,尽管这种方法广泛使用,锚点选择对结果可靠性的影响仍然在很大程度上未被探索。在本研究中,我们通过在 Arena-Hard-v2.0 数据集上评估 22 个不同的锚点,系统地研究了锚点选择的影响。我们发现,锚点的选择至关重要:一个不佳的锚点会显著降低与人类排名的相关性。我们识别出常见的锚点选择(表现最佳和表现最差的模型)实际上是糟糕的锚点。由于这些极端锚点在所有其他模型中始终表现得更好或更差,因此它们很少能准确反映模型的相对排名。我们进一步量化了锚点选择的效应大小,表明其与评判模型的选择相当。最后,我们提出了可行的建议。首先,我们进行了一项效能分析,并计算了基于锚点评估的基准所需的足够样本量,发现标准基准样本量对于成对评估是不够的,无法可靠地区分竞争模型。其次,我们提供了选择信息丰富的锚点的指导方针,以确保评估实践的可靠性和效率。
cs.CL / 64 / 2603.16856
Online Experiential Learning for Language Models
语言模型的在线体验学习
Abstract
The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.
Chinese Translation
当前提高大型语言模型的主要范式依赖于使用人工注释或模拟环境进行离线训练,而在实际部署过程中积累的丰富经验则完全未被利用。我们提出了在线体验学习(Online Experiential Learning, OEL)框架,使语言模型能够从自身的部署经验中持续改进。OEL 运行分为两个阶段:首先,从用户端收集的交互轨迹中提取并积累可转移的体验知识;其次,通过在政策上下文蒸馏将这些知识整合到模型参数中,无需访问用户端环境。这两个阶段迭代形成一个在线学习循环,改进后的模型收集更高质量的轨迹,从而为后续轮次提供更丰富的体验知识。我们在多种模型规模和思维与非思维变体的文本游戏环境中评估了 OEL。OEL 在连续迭代中实现了一致的改进,提升了任务准确性和标记效率,同时保持了对分布外性能的良好表现。我们的分析进一步表明,提取的体验知识显著优于原始轨迹,并且知识源与策略模型之间的政策一致性对有效学习至关重要。
cs.CL / 65 / 2603.16862
Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory
Chronos:具有结构化事件检索的时间感知对话代理用于长期记忆
Abstract
Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.
Chinese Translation
近年来,大型语言模型(LLMs)的进展使得对话人工智能代理能够进行跨越数周或数月的扩展多轮互动。然而,现有的记忆系统在推理与时间相关的事实和偏好方面存在困难,这些事实和偏好在数月的互动中不断演变,并且缺乏有效的检索策略来处理长对话历史中的多跳、时间敏感查询。我们提出了Chronos,一种新颖的时间感知记忆框架,它将原始对话分解为具有解析的日期时间范围和实体别名的主谓宾事件元组,并将其索引到一个结构化的事件日历中,同时保留完整对话上下文的轮次日历。在查询时,Chronos应用动态提示为每个问题生成量身定制的检索指导,指导代理如何检索、如何在时间范围内过滤以及如何通过对两个日历的迭代工具调用循环进行多跳推理。我们在LongMemEvalS基准上评估了Chronos,该基准包含500个问题,涵盖六类对话历史任务,使用了8个LLMs,包括开源和闭源模型。Chronos Low的准确率达到92.60%,而Chronos High的准确率为95.60%,设立了新的最先进水平,比之前最佳系统提高了7.67%。消融实验结果显示,事件日历对基线的增益达58.9%,而其他所有组件的改进幅度在15.5%到22.3%之间。值得注意的是,Chronos Low单独超过了在其最强模型配置下评估的先前方法。