cs.CV / 1 / 2604.22754
HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction
HalalBench:用于食品包装成分提取的多语言OCR基准
Abstract
No standardized benchmark exists for evaluating OCR on food packaging, despite its critical role in automated halal food verification. Existing benchmarks target documents or scene text, missing the unique challenges of ingredient labels: curved surfaces, dense multilingual text, and sub-8pt fonts. We present HalalBench, the first open multilingual benchmark for food packaging OCR, comprising 1,043 images (50 real, 993 synthetic) with 36,438 annotations in COCO format spanning 14 languages. We evaluate four engines: docTR achieves F1=0.193, ML Kit 0.180, EasyOCR 0.167, while all fail on Japanese (F1=0.000). A clustering ablation shows 36% F1 improvement from our post-processing algorithm. We validate findings through HalalLens (https://halallens.no), a production halal scanner serving 20+ countries. Dataset and code are released under open licenses.
Chinese Translation
尽管OCR在自动化清真食品验证中发挥着关键作用,但目前尚无标准化基准用于评估食品包装上的OCR。现有基准主要针对文档或场景文本,未能涵盖成分标签所面临的独特挑战:曲面、密集的多语言文本和小于8pt的字体。我们提出了HalalBench,这是第一个开放的多语言食品包装OCR基准,包含1,043张图像(50张真实图像,993张合成图像),在COCO格式下有36,438个注释,涵盖14种语言。我们评估了四个引擎:docTR的F1值为0.193,ML Kit为0.180,EasyOCR为0.167,而在日语上均表现不佳(F1=0.000)。聚类消融实验表明,我们的后处理算法使F1值提高了36%。我们通过HalalLens(https://halallens.no)验证了这些发现,该平台是一个为20多个国家服务的生产级清真扫描器。数据集和代码已在开放许可下发布。
cs.CV / 2 / 2604.22805
See No Evil: Semantic Context-Aware Privacy Risk Detection for AR
无恶可见:面向语义上下文的增强现实隐私风险检测
Abstract
Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their effectiveness in detecting context-dependent privacy risks. We propose PrivAR, which leverages vision language models (VLMs) with chain-of-thought prompting for contextual privacy risk detection in AR environments. PrivAR uses visual scene cues to infer potential sensitive information types, such as identifying password notes in office environments through contextual reasoning. PrivAR detects and obfuscates textual content, preventing exposure of sensitive information while preserving contextual cues necessary for VLM inference. Additionally, we investigate contextually-informed warning interfaces to enhance user privacy awareness. Experiments on a real-world AR dataset show that PrivAR achieves superior accuracy (81.48%) and F1-score (84.62%) compared to baselines, while reducing privacy leakage rate to 17.58%. User studies evaluating contextually-informed warning interfaces provide insights into effective privacy-aware AR design.
Chinese Translation
增强现实(AR)系统由于其持续捕捉视觉数据而带来了独特的隐私风险。现有的AR隐私框架缺乏对视觉内容的语义理解,限制了其在检测依赖上下文的隐私风险方面的有效性。我们提出了PrivAR,它利用视觉语言模型(VLMs)和链式思维提示进行AR环境中的上下文隐私风险检测。PrivAR使用视觉场景线索推断潜在的敏感信息类型,例如通过上下文推理识别办公环境中的密码便条。PrivAR检测并模糊化文本内容,防止敏感信息的泄露,同时保留VLM推理所需的上下文线索。此外,我们还研究了上下文知情的警告界面,以增强用户的隐私意识。在一个真实世界的AR数据集上的实验表明,PrivAR的准确率(81.48%)和F1分数(84.62%)优于基准,同时将隐私泄露率降低至17.58%。对上下文知情警告界面的用户研究提供了有效的隐私意识AR设计的见解。
cs.CV / 3 / 2604.22808
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
FreqFormer:具有自适应频谱路由的分层频域注意力用于长序列视频扩散变换器
Abstract
Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.
Chinese Translation
长序列视频扩散变换器在处理非常长的令牌序列时会面临二次自注意力成本,这在运行时间和内存上占据主导地位。大多数高效的注意力方法在所有地方使用一种近似方法,但视频特征具有频谱结构:低频携带全局布局和粗略运动;高频则携带纹理和细节。我们提出了FreqFormer,一种频率感知的异构注意力框架。令牌特征被分割成不同的频谱带,并采用不同的操作符:对压缩的低频内容进行密集的全局注意力,对中频进行结构化的块稀疏注意力,以及对高频进行滑动窗口的局部注意力。一个轻量级的频谱路由网络根据层统计信息和扩散时间步在频带之间分配注意力头,将计算重心早期转向去噪过程中的全局结构,后期则关注细节。跨频带的摘要令牌提供了廉价的残差交换。FreqFormer配备了一个融合的GPU执行计划,该计划共同调度密集、稀疏和局部分支,以减少内核启动和内存流量。我们提供了一个一致的复杂度模型、对近似的正交分解视图,以及基于仿真的系统指标(吞吐量、算术强度、内存流量、持续时间缩放)。在从64K到1M令牌的仿真中,FreqFormer显著减少了估计的注意力FLOPs和与KV相关的内存流量,相比于密集注意力,同时保持了硬件友好的模式,支持频谱结构的异构注意力作为长视频扩散变换器的一个实际方向。
cs.CV / 4 / 2604.22822
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench:用于诊断视觉语言模型中对象幻觉的可归因基准
Abstract
Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.
Chinese Translation
对象级幻觉仍然是视觉语言模型(VLMs)面临的一个主要可靠性挑战,特别是在二元对象存在性验证中。现有基准强调整体准确性,但很少区分错误是源于感知限制还是上下文文本先验的影响,从而使潜在的失败机制变得模糊。我们引入了DO-Bench,这是一个受控的诊断基准,通过结构化的多模态干预来隔离这些来源。DO-Bench并不是在不受限制的环境中评估模型,而是探讨两个互补维度:先验覆盖(Prior Override)维度逐步增强上下文文本先验,同时保持视觉证据不变,以评估对先验压力的抵抗力;感知限制(Perception-Limited)维度逐步增强从全场景上下文到局部对象裁剪的视觉证据,以测量感知基础的强度。这种配对设计使得可以将错误归因于先验抑制、感知不足或两者的交互。我们进一步定义了两个诊断指标,PriorRobust和PerceptionAbility,以一致地量化这些行为。在多种开放源和闭源VLMs上的评估揭示了先验敏感性和感知可靠性之间的系统性差异,表明对象幻觉反映了超越整体准确性的异质性、机制依赖的失败模式。
cs.CV / 5 / 2604.22823
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
PivotMerge:通过后对齐模型合并桥接异构多模态预训练
Abstract
Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter interference, where parameter updates learned from different data distributions conflict during merging, and layer-wise alignment contribution disparity, where different layers and projectors contribute unevenly to cross-modal alignment. To address them, we propose \textbf{PivotMerge}, a post-alignment merging framework for cross-modal projectors. PivotMerge incorporates two key components: Shared-space Decomposition and Filtering, which disentangles shared alignment patterns from domain-specific variations and suppresses conflicting directions, and Alignment-guided Layer-wise Merging, which assigns layer-specific merging weights based on differing alignment contributions. We construct systematic CC12M-based post-alignment merging scenarios for evaluation. Extensive experiments on multiple multimodal benchmarks show that PivotMerge consistently outperforms existing baselines, demonstrating its effectiveness and generalization ability.
Chinese Translation
多模态大型语言模型(MLLMs)依赖于对多样数据源的多模态预训练,其中不同的数据集通常会引发互补的跨模态对齐能力。模型合并提供了一种经济高效的机制,将具有互补优势的多个专家级MLLMs整合为一个统一模型。然而,现有的模型合并研究主要集中在后微调场景,预训练阶段则基本未被探索。我们认为,MLLM预训练的核心在于建立有效的跨模态对齐,将视觉和文本表示桥接到一个统一的语义空间。基于这一见解,我们引入了后对齐合并任务,旨在整合从异构多模态预训练中学习到的跨模态对齐能力。这一设置引入了两个关键挑战:跨域参数干扰,即来自不同数据分布的参数更新在合并过程中发生冲突,以及层级对齐贡献差异,即不同层和投影器对跨模态对齐的贡献不均匀。为了解决这些问题,我们提出了 extbf{PivotMerge},一个用于跨模态投影器的后对齐合并框架。PivotMerge包含两个关键组件:共享空间分解与过滤,旨在将共享对齐模式与领域特定变异解耦,并抑制冲突方向;以及基于对齐指导的层级合并,根据不同的对齐贡献分配层特定的合并权重。我们构建了系统的基于CC12M的后对齐合并场景进行评估。在多个多模态基准上的广泛实验表明,PivotMerge始终优于现有基线,展示了其有效性和泛化能力。
cs.CV / 6 / 2604.22824
WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention
WeatherSeg:基于教师-学生双重学习和分类器更新注意力的天气鲁棒图像分割
Abstract
WeatherSeg, an advanced semi-supervised segmentation framework, addresses autonomous driving's environmental perception challenges in adverse weather while reducing annotation costs. This framework integrates a Dual Teacher-Student Weight-Sharing Model (DTSWSM) that enables knowledge distillation from weather-affected images, and a Classifier Weight Updating Attention Mechanism (CWUAM) that dynamically adjusts classifier weights based on environmental attributes. Comprehensive evaluations demonstrate that WeatherSeg significantly outperforms baseline models in both accuracy and robustness across various weather conditions, including clear, rainy, cloudy, and foggy scenarios, establishing it as an effective solution for all-weather semantic segmentation in autonomous driving and related applications.
Chinese Translation
WeatherSeg是一种先进的半监督分割框架,旨在解决自主驾驶在恶劣天气条件下的环境感知挑战,同时降低标注成本。该框架集成了双教师-学生权重共享模型(Dual Teacher-Student Weight-Sharing Model, DTSWSM),能够从受天气影响的图像中进行知识蒸馏,以及分类器权重更新注意力机制(Classifier Weight Updating Attention Mechanism, CWUAM),根据环境属性动态调整分类器权重。全面的评估结果表明,WeatherSeg在各种天气条件下(包括晴天、雨天、多云和雾天)的准确性和鲁棒性方面显著优于基线模型,确立了其作为自主驾驶及相关应用中全天气语义分割的有效解决方案。
cs.CV / 7 / 2604.22825
SGP-SAM: Self-Gated Prompting for Transferring 3D Segment Anything Models to Lesion Segmentation
SGP-SAM:自门控提示用于将3D Segment Anything模型转移到病变分割
Abstract
Large segmentation foundation models such as the Segment Anything Model (SAM) have reshaped promptable segmentation in natural images, and recent efforts have extended these models to medical images and volumetric settings. However, directly transferring a 3D SAM-style model to lesion segmentation remains challenging due to (i) weak spatial representational capacity for small, irregular targets in intermediate features, and (ii) extreme foreground-background imbalance in 3D volumes.We propose SGP-SAM, a self-gated prompting framework for efficient and effective transfer to 3D lesion segmentation. Our key component, the Self-Gated Prompting Module (SGPM), performs conditional multi-scale spatial enhancement: a lightweight multi-channel gating unit predicts whether the current features require additional multi-scale fusion, and only then activates a Multi-Scale Feature Fusion Block to enrich spatial context. To further address small-lesion learning, we design a Zoom Loss that up-weights lesion-focused supervision by combining Dice and a voxel-balanced focal term.Experiments on MSD Liver Tumor and MSD Brain Tumor (enhancing tumor) show consistent gains over strong transfer baselines based on SAM-Med3D. On MSD Liver Tumor, SGP-SAM improves mDice by 7.3% over fine-tuning.
Chinese Translation
大型分割基础模型,如Segment Anything Model (SAM),已经重塑了自然图像中的可提示分割,最近的研究努力将这些模型扩展到医学图像和体积设置。然而,直接将3D SAM风格模型转移到病变分割仍然面临挑战,原因在于(i)中间特征对小型不规则目标的空间表示能力较弱,以及(ii)3D体积中极端的前景-背景不平衡。我们提出了SGP-SAM,一种自门控提示框架,用于高效且有效地转移到3D病变分割。我们的关键组件,自门控提示模块(SGPM),执行条件多尺度空间增强:一个轻量级多通道门控单元预测当前特征是否需要额外的多尺度融合,只有在这种情况下才激活多尺度特征融合块以丰富空间上下文。为了进一步解决小病变学习的问题,我们设计了一种Zoom Loss,通过结合Dice和体素平衡焦点项来加大对病变聚焦监督的权重。在MSD肝肿瘤和MSD脑肿瘤(增强肿瘤)上的实验显示,相较于基于SAM-Med3D的强转移基线,SGP-SAM在性能上有一致的提升。在MSD肝肿瘤上,SGP-SAM的mDice提高了7.3%。
cs.CV / 8 / 2604.22826
Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis
Shape:用于工业CAD分析的自监督3D几何基础模型
Abstract
Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.
Chinese Translation
工业CAD工作流程需要稳健且可泛化的3D几何表示,以支持准确性和可解释性。我们介绍了Shape,这是一种自监督基础模型,将表面网格转换为密集的每个标记嵌入。Shape结合了结构化的3D潜在网格、多尺度几何感知标记器(MAGNO)与交叉注意力,以及使用分组查询注意力和RMSNorm的变换处理器。学习到的重建先验使得区域归因成为可能,从而实现可解释的预测。预训练使用了标准化几何统计的掩码标记重建和多分辨率对比一致性。该模型的10.9M参数主干在Thingi10K、MFCAD和Fusion360的61,052个CAD网格上进行了预训练。在一个包含2,983个网格的保留拆分上,Shape在Wang-Isola协议下实现了重建R2 = 0.729和98.1%的top-1检索,且重建训练/验证间隙接近零(对比得分使用了更大的评估池)。对损失类型和目标空间归一化进行的2x2消融实验表明,每维归一化至关重要:没有它,性能崩溃(R2 < 0.14,top-1 < 88%);有了它,两种损失均成功(R2 > 0.70,top-1 > 96%)。Smooth-L1提供了次级稳定性。代码、嵌入和交互演示已发布在 https://github.com/simd-ai/shape。
cs.CV / 9 / 2604.22827
DGHMesh: A Large-scale Dual-radar mmWave Dataset and Generalization-focused Benchmark for Human Mesh Reconstruction
DGHMesh:一个大规模双雷达毫米波数据集及以泛化为重点的人体网格重建基准
Abstract
Millimeter-wave (mmWave) radar has shown great potential for contactless, privacy-preserving, and robust human sensing, yet existing mmWave-based human mesh reconstruction (HMR) studies are still limited by the lack of benchmarks for generalization analysis under configuration shifts and fair comparison of different algorithms. To address the limitation, we present DGHMesh, a large-scale dual-radar mmWave dataset and generalization-focused benchmark for HMR. It contains data from 15 subjects performing 8 actions, with 360,000 synchronized frames collected from FMCW radar, SFCW radar, RGB images, and high-precision 3D HMR annotations. In addition, the dataset provides synchronized raw I/Q data from both radar modalities and accurately calibrated radar spatial positions. The benchmark is designed to evaluate HMR methods under diverse measurement configurations, including human position shifts, human orientation shifts, subarray size variations, and cross-subject settings. Based on DGHMesh, we also propose mmPTM, a query-based multi-radar fusion framework that jointly exploits point clouds and imaging tubes for HMR. Extensive experiments are conducted against representative baselines under different settings. The results demonstrate that mmPTM consistently achieves outstanding accuracy and competitive generalization capability across multiple sub-benchmarks, validating the effectiveness of multi-radar fusion and the practical value of the proposed dataset and benchmark for mmWave-based HMR research. DGHMesh and mmPTM are publicly available at https://github.com/SPIresearch/DGHMesh.(The complete benchmark and code will be released after paper publication)
Chinese Translation
毫米波(mmWave)雷达在无接触、保护隐私和稳健的人体感知方面展现了巨大的潜力,但现有基于毫米波的人体网格重建(HMR)研究仍受到缺乏用于配置变化下的泛化分析和不同算法公平比较的基准的限制。为了解决这一限制,我们提出了DGHMesh,一个大规模的双雷达毫米波数据集及以泛化为重点的HMR基准。该数据集包含15个受试者执行8个动作的数据,收集了来自FMCW雷达、SFCW雷达、RGB图像和高精度3D HMR注释的360,000帧同步数据。此外,该数据集提供了来自两种雷达模式的同步原始I/Q数据和准确校准的雷达空间位置。该基准旨在评估HMR方法在多种测量配置下的表现,包括人体位置变化、人体朝向变化、子阵列大小变化和跨受试者设置。基于DGHMesh,我们还提出了mmPTM,一个基于查询的多雷达融合框架,联合利用点云和成像管进行HMR。在不同设置下,我们针对代表性基线进行了广泛的实验。结果表明,mmPTM在多个子基准中始终实现了卓越的准确性和竞争力的泛化能力,验证了多雷达融合的有效性以及所提出的数据集和基准在基于毫米波的HMR研究中的实际价值。DGHMesh和mmPTM已公开发布于https://github.com/SPIresearch/DGHMesh。(完整的基准和代码将在论文发表后发布)
cs.CV / 10 / 2604.22828
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling
MetaEarth3D:通过空间可扩展生成建模解锁世界规模的3D生成
Abstract
Recent generative AI models have achieved remarkable breakthroughs in language and visual understanding. However, although these models can generate realistic visual content, their spatial scale remains confined to bounded environments, preventing them from capturing how geographic environments evolve across thousands of kilometers or from modeling the spatial structure of the large-scale physical world. This limitation poses a critical challenge for ultra-wide-area spatial intelligence in Earth observation and simulation, revealing a deeper gap in generative AI: progress has relied primarily on scaling model parameters and training data, while overlooking spatial scale as a core dimension of intelligence. Here, motivated by this missing dimension, we investigate spatial scale as a new scaling axis in foundation models and present MetaEarth3D, the first generative foundation model capable of spatially consistent generation at the planetary scale. Taking optical Earth observation simulation as a testbed, MetaEarth3D enables the generation of multi-level, unbounded, and diverse 3D scenes spanning large-scale terrains, medium-scale cities, and fine-grained street blocks. Built upon 10 million globally distributed real-world training images, MetaEarth3D demonstrates both strong visual realism and geospatial statistical realism. Beyond generation, MetaEarth3D serves as a generative data engine for diverse virtual environments in ultra-wide spatial intelligence. We argue that this study may help empower next-generation spatial intelligence for Earth observation.
Chinese Translation
近期的生成性人工智能模型在语言和视觉理解方面取得了显著突破。然而,尽管这些模型能够生成逼真的视觉内容,它们的空间尺度仍然局限于有限的环境,无法捕捉地理环境在数千公里范围内的演变,也无法建模大规模物理世界的空间结构。这一限制对地球观测和模拟中的超广域空间智能构成了重大挑战,揭示了生成性人工智能中的一个更深层次的差距:进展主要依赖于模型参数和训练数据的扩展,而忽视了空间尺度作为智能的核心维度。在此,我们受到这一缺失维度的启发,研究空间尺度作为基础模型中的新扩展轴,并提出MetaEarth3D,这是第一个能够在行星尺度上进行空间一致生成的生成基础模型。以光学地球观测模拟作为测试平台,MetaEarth3D能够生成跨越大规模地形、中等规模城市和细粒度街区的多层次、无限制和多样化的3D场景。基于1000万张全球分布的真实世界训练图像,MetaEarth3D展示了强大的视觉真实感和地理空间统计真实感。除了生成,MetaEarth3D还作为超广域空间智能中多样虚拟环境的生成数据引擎。我们认为,这项研究可能有助于增强下一代地球观测的空间智能。
cs.CV / 11 / 2604.22829
Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test
迷失在振动中:视觉语言模型未能通过动态仪表测试
Abstract
The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.
Chinese Translation
工业制造的数字化转型日益依赖于自主机器人与传统基础设施,特别是模拟仪表的互动能力。尽管视觉语言模型(Vision-Language Models, VLMs)在零样本仪器识别方面显示出潜力,但它们在测量系统中的应用仍受到固有的限制,无法准确分析高频时间事件和针头振动。本文评估了包括GPT-5和Gemini 3在内的最新模型,针对计量学和不确定性量化的严格要求进行测试。为促进这一评估,我们引入了一个新颖的数据集,包含不同类型仪表的视频序列:圆形、线性和游标,涵盖多种运动速度特征。我们的研究结果表明,当前的VLM在解释针头轨迹和刻度语义方面能力有限,未能提供安全关键监测所需的可追溯性和可靠性。结果表明,这些模型尚未达到根据现有IEEE和ISO标准被归类为可信合成仪器所需的性能。
cs.CV / 12 / 2604.22830
2D Pre-Training for 3D Pose Estimation
用于三维姿态估计的二维预训练
Abstract
Pre-training is a general method that is used in a range of deep learning tasks. By first training a model on one task, and then further training on the downstream task used for final evaluation, the model is forced to learn a more general understanding of the input data. While pre-training has been applied to 3D Human Pose Estimation (HPE) previously, the scope of datasets used is typically very limited to some strong benchmarks, like Human3.6M. Therefore, in this project, we expand the scope of an existing 3D HPE scheme to be compatible with additional 2D and 3D HPE datasets, like Occlusion Person. We perform an extensive study on how aspects of 2D pre-training, such as model size, affect downstream performance, and to what extent pre-training can help the model generalize to different datasets. Experimental results show that 2D pre-training consistently outperforms training on 3D data alone, particularly in terms of computational efficiency. Finally, using MPII and Human3.6M, we are able to obtain an MPJPE score of under 64.5mm.
Chinese Translation
预训练是一种广泛应用于多种深度学习任务的通用方法。通过首先在一个任务上训练模型,然后在用于最终评估的下游任务上进一步训练,模型被迫学习对输入数据的更一般理解。尽管预训练已在三维人体姿态估计(3D Human Pose Estimation, HPE)中应用,但所使用的数据集范围通常非常有限,主要集中在一些强基准上,如Human3.6M。因此,在本项目中,我们扩展了现有的3D HPE方案,使其兼容其他二维和三维HPE数据集,如Occlusion Person。我们对二维预训练的各个方面,如模型大小,如何影响下游性能进行了广泛研究,并探讨了预训练在多大程度上可以帮助模型泛化到不同的数据集。实验结果表明,二维预训练在计算效率方面始终优于仅在三维数据上训练。最后,使用MPII和Human3.6M,我们能够获得低于64.5mm的MPJPE分数。
cs.CV / 13 / 2604.22832
Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics
干预感知的多尺度表征学习:来自成像表型组学和干扰转录组学的研究
Abstract
Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.
Chinese Translation
基于显微镜的表型分析在药物发现中具有可扩展性,但缺乏转录组学的机制深度,而转录组学仍然成本高昂且稀缺。现有的多模态方法要么利用图像支持其他模态,要么简单地通过样本身份对齐表征,忽视了弱配对数据中的细胞类型和剂量变化,从而限制了对未见干预的推广。在本文中,我们引入了一种干预感知的蒸馏框架,该框架利用干扰转录组学来指导图像表征学习。转录组条件教师整合了基因表达和干预元数据,以生成基于药物相似性组织的化学感知代码本上的软分布。教师采用经过微调的单细胞基础模型来编码细胞类型上下文并解开剂量效应。仅使用图像的学生模型学习从显微镜图像中预测这些分布,在测试时独立操作的同时提炼机制知识。该设计强调干预语义而非身份对齐,并明确处理剂量和细胞类型的不匹配。我们提供理论保证,表明转录组指导收紧了基于图像预测的风险界限。在与L1000配对的Cell Painting和RxRx数据集上,与自监督和对齐基线相比,我们的方法显著提高了对未见干预的一次性迁移和药物-靶基因发现的能力。
cs.CV / 14 / 2604.22834
WebSerial Vision Training for Microcontrollers: A Browser-Based Companion to On-Device CNN Training
微控制器的WebSerial视觉训练:基于浏览器的设备内CNN训练伴侣
Abstract
This paper presents webmcu-vision-web, a single-file, zero-install browser application for end-to-end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32-S3 Sense (XIAO ML Kit, $15--40 USD). Acting as a browser-based companion to the on-device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium-based browser. The system targets educators, small businesses, and researchers who need to train task-specific visual classifiers under their exact deployment conditions. Key capabilities include: in-browser firmware flashing via esptool-js; an SD card file browser with image preview and inline editing; config.json live-sync for zero-recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; TensorFlow.js CNN training completing a three-class run (~30 images per class, 20 epochs) in approximately 1 minute browser-side versus 9 minutes on-device, enabling a complete collect-train-deploy cycle in under 10 minutes; weight export as myWeights.bin and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five-run consistency evaluation on the three-class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM-assisted adaptation to new hardware and tasks. All source code is MIT-licensed at https://github.com/webmcu-ai/webmcu-vision-web.
Chinese Translation
本文介绍了webmcu-vision-web,这是一个单文件、零安装的浏览器应用程序,用于在Seeed Studio XIAO ESP32-S3 Sense(XIAO ML Kit,15-40美元)上进行端到端的TinyML视觉模型训练和部署。作为论文1 [1]中设备内Arduino固件的浏览器伴侣,它提供了一个私有的、完全本地的机器学习管道,从固件烧录、图像收集、CNN训练、权重导出到实时激活可视化,除了基于Chromium的浏览器外无需任何软件安装。该系统的目标用户是需要在其特定部署条件下训练任务特定视觉分类器的教育工作者、小型企业和研究人员。主要功能包括:通过esptool-js进行浏览器内固件烧录;带有图像预览和内联编辑的SD卡文件浏览器;config.json的实时同步以实现零重编译的超参数调整;网络摄像头和ESP32 OV2640摄像头的图像捕获;TensorFlow.js CNN训练在浏览器端完成三类运行(每类约30张图像,20个周期)大约需要1分钟,而设备端则需要9分钟,从而使完整的收集-训练-部署周期在10分钟内完成;权重导出为myWeights.bin和myWeights.h;混淆矩阵;以及在推理过程中从ESP32流式传输的实时Conv2激活热图。在任何阶段,数据都不会离开本地机器。对三类参考问题(0Blank、1Cup、2Pen)进行的五次运行一致性评估表明,收敛稳定,报告了平均准确率和标准差;所有文档均在下面的仓库链接中发布。该仓库是一个活模板,用于LLM辅助适应新硬件和任务。所有源代码均在https://github.com/webmcu-ai/webmcu-vision-web上以MIT许可证发布。
cs.CV / 15 / 2604.22835
ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes
ParkingScenes:用于端到端自主停车的结构化数据集
Abstract
Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird's-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan-ai/ParkingScenes
Chinese Translation
自主停车仍然是智能驾驶系统中一项关键而具有挑战性的任务,尤其是在受限的城市环境中,机动空间有限且精确控制至关重要。尽管最近在端到端学习方面取得了显著进展,但缺乏针对停车场景的高质量结构化数据集仍然是一个重要瓶颈。为了解决这一问题,我们提出了ParkingScenes,这是一个专门为模拟场景中的端到端自主停车设计的综合多模态数据集。ParkingScenes基于CARLA模拟器构建,具有由混合A*规划器和模型预测控制器(Model Predictive Controller, MPC)生成的结构化停车轨迹,提供准确且可重复的监督信号。该数据集包括16个倒车入库和6个平行停车场景,每个场景在两种行人条件(存在和缺失)下执行,生成704个结构化集和大约105000帧。每个场景重复16次,以确保覆盖的一致性。每帧包含来自四个RGB相机、四个深度传感器、车辆运动状态和鸟瞰图(Bird's-Eye View, BEV)表示的同步数据,支持丰富的多模态融合和上下文感知学习。为了展示我们数据集的实用性,我们将基于ParkingScenes训练的模型与在相同条件下基于非结构化手动收集的模拟数据训练的模型进行了比较。结果显示性能显著提升,强调了结构化监督在稳健和准确的停车策略学习中的有效性。通过发布数据集和收集框架,ParkingScenes建立了一个可扩展和可重复的基准,以推动基于学习的自主停车系统的发展。数据集和收集框架将发布在:https://github.com/haonan-ai/ParkingScenes
cs.CV / 16 / 2604.22836
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
基于AgentRVOS的第五届PVUW挑战MeViS-文本跟踪:第三种方法
Abstract
This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection controller repairs weak hypotheses, and a collaboration controller reconciles multiple agent branches. The result is a Ref-VOS system in which Sa2VA is responsible for dense grounded understanding, while the agent layer handles presence verification, temporal search, confidence-aware revision, and final mask refinement.
Chinese Translation
本报告描述了一种以Sa2VA为中心的Ref-VOS管道,并明确组织了代理角色。关键思想是Sa2VA应提供第一个密集语义假设,而代理循环决定该假设是被接受、修正还是细化。该管道从目标存在判断阶段开始。如果所指对象在视频中不存在,系统将直接输出零掩码。否则,Sa2VA接收视频和引用提示,并在整个视频中生成粗略的掩码轨迹。该轨迹被视为语义先验,而非最终答案。规划代理分解查询,时间分区代理识别信息块,侦查代理搜索锚帧,细化代理将可靠的Sa2VA掩码转换为框和点,以便进行SAM3传播。评估者对候选轨迹进行评分,反思控制器修复弱假设,协作控制器调和多个代理分支。最终结果是一个Ref-VOS系统,其中Sa2VA负责密集的基础理解,而代理层则处理存在验证、时间搜索、基于信心的修正和最终掩码细化。
cs.CV / 17 / 2604.22837
OAMVOS:2nd Report for 5th PVUW MOSE Track
OAMVOS:第五届PVUW MOSE轨道的第二份报告
Abstract
SAM-based dense trackers provide strong short-term mask propagation but remain fragile under long occlusion, fast motion, viewpoint change, and distractors. The problem is especially severe for small objects, where a few incorrect memory updates can dominate later predictions. This report presents an occlusion- and reappearance-aware extension of DAM4SAM that improves memory control rather than changing the backbone. The method augments the original SAM3 tracker with four ingredients: a reliability-aware tracking state machine, branch-based recovery, delayed DRM promotion, and a selective policy for native SAM3 memory selection. During stable tracking, the model follows the original single-path propagation process. Once confidence drops, the tracker enters an ambiguous or recovery mode, maintains a small set of candidate branches, and commits memory only after a branch is reconfirmed. For small-object disappearance and reappearance, native memory selection is temporarily bypassed so older anchors remain accessible. In addition, the first conditioning frame is explicitly preserved, and the conditioning-memory budget is moderately enlarged to improve long-gap recovery. The resulting design keeps DAM4SAM efficient in easy cases while improving robustness in sequences dominated by occlusion and reappearance.
Chinese Translation
基于SAM的密集跟踪器在短期掩码传播方面表现出色,但在长时间遮挡、快速运动、视角变化和干扰物的情况下仍然脆弱。对于小物体而言,这一问题尤为严重,少量不正确的记忆更新可能会主导后续的预测。本报告提出了一种关注遮挡和重新出现的DAM4SAM扩展,旨在改善记忆控制,而不是改变主干网络。该方法通过四个要素增强了原始的SAM3跟踪器:一个可靠性感知的跟踪状态机、基于分支的恢复、延迟的DRM提升,以及针对原生SAM3记忆选择的选择性策略。在稳定跟踪期间,模型遵循原始的单路径传播过程。一旦置信度下降,跟踪器进入模糊或恢复模式,维持一小组候选分支,并仅在分支被重新确认后才提交记忆。对于小物体的消失和重新出现,暂时绕过原生记忆选择,以便旧的锚点保持可访问。此外,首个条件帧被明确保留,条件记忆预算适度扩大,以改善长时间间隔的恢复。最终设计使得DAM4SAM在简单情况下保持高效,同时在被遮挡和重新出现主导的序列中提高了鲁棒性。
cs.CV / 18 / 2604.22838
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
重新构想的神经网络优化:针对从头训练和微调的解耦技术
Abstract
With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real-time layer-wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine-tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine-tuning performance. Additionally, we extend the layer-wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state-of-the-art performance of DualOpt. Code is available at https://github.com/qklee-lz/OLOR-AAAI-2024.
Chinese Translation
随着大数据时代资源的积累和深度学习中预训练模型的兴起,针对不同任务优化神经网络通常涉及对预训练模型的微调与从头训练的不同策略。然而,现有的优化器主要集中在通过更新模型参数来减少损失函数,而未能充分满足这两种主要范式的独特需求。本文提出了DualOpt,一种新颖的方法,专门为这两种不同的训练场景解耦优化技术。对于从头训练,我们引入了实时层级权重衰减,旨在通过与权重更新和网络架构的特征对齐来增强收敛性和泛化能力。更重要的是,对于微调,我们将权重回滚与优化器结合,在每次权重更新步骤中纳入回滚项。这确保了上游和下游模型之间权重分布的一致性,有效减轻了知识遗忘并改善了微调性能。此外,我们将层级权重衰减扩展到动态调整各层的回滚水平,以适应不同下游任务的变化需求。在包括图像分类、目标检测、语义分割和实例分割等多种任务上的广泛实验表明,DualOpt具有广泛的适用性和最先进的性能。代码可在 https://github.com/qklee-lz/OLOR-AAAI-2024 获取。
cs.CV / 19 / 2604.22839
From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation
从骨架到像素:通过表征与预测蒸馏实现少样本精确事件检测
Abstract
Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.
Chinese Translation
精确事件检测(PES)在快速节奏的体育运动中至关重要,例如网球,其中细粒度事件发生在非常短的时间窗口内。由于运动模糊、微妙的动作差异和有限的标注数据,准确的帧级定位面临挑战。我们研究了两种互补的少样本PES蒸馏策略:自适应权重蒸馏(AWD),一种在预测级别上自适应加权教师对未标注数据的监督的方法,以及用于少样本事件检测的退火多模态蒸馏(AMD-FED),一种通过退火伪标注将稳健的骨架知识转移到视觉模态的表征级框架。这两种方法都利用多模态蒸馏来改善在有限监督下的泛化能力。我们在少样本k-片段设置下对F3Set-Tennis(sub)进行了评估,结果显示它们始终优于单模态基线和先前的PES方法。在观察到表征级蒸馏在网球上的更强表现后,我们进一步在第二个体育数据集——花样滑冰上验证了AMD-FED,在k-片段场景中也表现出稳健的性能。这些结果突显了多模态蒸馏的有效性,特别是表征级转移在少样本精确事件检测中的重要性。
cs.CV / 20 / 2604.22840
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
AeSlides:通过可验证奖励激励基于大型语言模型的幻灯片生成中的美学布局
Abstract
Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.
Chinese Translation
大型语言模型(LLMs)在代理任务中展现了强大的潜力,尤其是在幻灯片生成方面。然而,幻灯片生成面临一个根本性的挑战:生成过程以文本为中心,而其质量则由视觉美学决定。这种模态差距导致当前模型经常生成美学布局不佳的幻灯片。现有解决方案通常依赖于重度的视觉反思,这会产生高昂的推理成本,但收益有限;或者依赖于使用大规模数据集进行微调,这仍然提供了弱且间接的美学监督。相比之下,明确使用美学原则作为监督的方式尚未被探索。在本研究中,我们提出了AeSlides,一个具有可验证奖励的强化学习框架,用于幻灯片生成中的美学布局监督。我们引入了一套精心设计的可验证指标,以量化幻灯片布局质量,以准确、高效和低成本的方式捕捉关键布局问题。利用这些可验证指标,我们开发了一种基于GRPO的强化学习方法,直接优化幻灯片生成模型,以实现美学一致的布局。在仅使用5K个训练提示的情况下,AeSlides在GLM-4.7-Flash上将纵横比合规性从36%提高到85%,同时减少了44%的空白、43%的元素碰撞和28%的视觉不平衡。人类评估进一步显示整体质量显著提升,评分从3.31提高到3.56(+7.6%),超越了基于模型的奖励优化和基于反思的代理方法,甚至超过了Claude-Sonnet-4.5。这些结果表明,这种可验证的美学范式为将幻灯片生成与人类美学偏好对齐提供了一种高效且可扩展的方法。我们的代码库可在https://github.com/ympan0508/aeslides获取。
cs.CV / 21 / 2604.22841
ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers
ATTN-FIQA:基于可解释注意力的面部图像质量评估方法与视觉变换器
Abstract
Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.
Chinese Translation
面部图像质量评估(FIQA)旨在评估面部样本的识别效用,对于可靠的面部识别(FR)系统至关重要。现有方法通常需要计算开销较大的过程,如多次前向传播、反向传播或额外训练,而最近的研究则集中于视觉变换器的应用。近期研究表明,这些架构本质上作为显著性学习者运作,注意力模式自然地编码了空间重要性。本研究提出了ATTN-FIQA,这是一种新颖的无训练方法,探讨预训练的基于视觉变换器的面部识别模型中的预-softmax注意力分数是否可以作为质量指标。我们假设注意力幅度本质上编码了质量:高质量图像具有可区分的面部特征,能够产生强查询-关键对齐,从而产生集中的高幅度注意力模式,而退化图像则生成分散的低幅度模式。ATTN-FIQA从最终的变换器块中提取预-softmax注意力矩阵,聚合所有补丁的多头注意力信息,并通过简单的平均计算图像级质量分数,仅需一次前向传播通过预训练模型,无需架构修改、反向传播或额外训练。通过在八个基准数据集和四个FR模型上的全面评估,本研究表明基于注意力的质量分数与面部图像质量有效相关,并提供空间可解释性,揭示哪些面部区域对质量判定贡献最大。
cs.CV / 22 / 2604.22842
EX-FIQA: Leveraging Intermediate Early eXit Representations from Vision Transformers for Face Image Quality Assessment
EX-FIQA:利用视觉变换器中的中间早期退出表示进行人脸图像质量评估
Abstract
Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.
Chinese Translation
人脸图像质量评估对于可靠的人脸识别系统至关重要,但现有基于视觉变换器的方法仅依赖于最终层表示,忽视了在中间网络深度捕获的与质量相关的信息。本文首次全面探讨了视觉变换器(ViTs)中间表示如何通过早期退出机制和评分融合策略对人脸质量评估的贡献。我们系统分析了ViT-FIQA架构的十二个变换器块,证明了不同深度捕获了不同且互补的与质量相关的信息,这在网络层之间的注意力模式和性能特征的变化中得到了证实。我们提出了一种评分融合框架,该框架在不修改架构或额外训练的情况下,结合了来自多个变换器块的质量预测。我们的早期退出分析揭示了最佳的性能与效率权衡,使得在保持竞争性能的同时实现显著的计算节省。通过在八个基准数据集上使用四种人脸识别模型进行广泛评估,我们证明了我们的融合策略优于单一退出方法。我们提出的质量融合方法采用深度加权平均,逐步提高对更深变换器块的重要性,从而通过有效利用ViTs中特征学习的层次结构,达到最佳的质量评估性能。我们的研究挑战了传统观念,即只有深层特征对人脸分析重要,揭示了中间表示在质量评估中包含了有价值的信息。所提出的框架为现实世界的生物识别系统提供了实际好处,使其能够根据资源限制进行自适应计算,同时保持竞争力的质量评估能力。
cs.CV / 23 / 2604.22846
Unified Multi-Foundation-Model Slide Representation for Pan-Cancer Recognition and Text-Guided Tumor Localization
统一多基础模型幻灯片表示用于全癌症识别和文本引导的肿瘤定位
Abstract
The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.
Chinese Translation
不断扩展的病理基础模型生态系统产生了强大但分散的瓦片级表示,限制了它们在需要统一幻灯片级推理和与临床相关信息可解释链接的临床任务中的应用。我们提出了ASTRA,一个全癌症框架,将异构基础模型表示集成到共享的幻灯片级表示空间,并使用结构化病理注释字段(包括分类类别、癌症类型和解剖部位)对该空间进行语义基础。ASTRA结合稀疏专家混合上下文化、掩蔽多模型重建和对比对齐到结构化病理提示,以学习支持4类分类、3类实体肿瘤分型、16类癌症分型和文本引导肿瘤定位的幻灯片表示,而无需像素级监督。在涵盖16种肿瘤类型的10,359个全幻灯片图像(WSI)的CHTN队列上开发,ASTRA在四个病理基础模型骨干网络上持续改善全癌症分类,4类分类的宏观AUC高达97.8%,3类实体肿瘤分型为99.7%,16类癌症分型为99.2%。在肿瘤定位方面,ASTRA在涵盖16种癌症类型的标注内域CHTN子集(n = 380)上实现了平均Dice为0.897,在涵盖四种癌症类型的外部TCGA队列(n = 1,686)上实现了0.738。这些结果表明,来自幻灯片级元数据的最小结构化病理注释字段可以为统一幻灯片表示学习提供有效的语义监督,从而在单一框架内实现全癌症预测和弱监督肿瘤定位。
cs.CV / 24 / 2604.22847
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
梦立方:通过训练数十亿个立方体实现Minecraft中的可控生成建模
Abstract
We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.
Chinese Translation
我们介绍了梦立方(Dream-Cubed),这是一个以体素分辨率构建的Minecraft世界的大规模数据集,以及一系列使用立方体作为强大组合单元的模型,用于高效生成交互式3D环境。梦立方包含来自精心策划的程序生物群落地形和高质量人类创作地图的数十亿个标记。我们利用该数据集开展了首次针对体素生成的3D扩散模型的大规模研究,分析了离散和连续扩散公式、数据组合以及架构设计选择。我们的模型直接在块的空间中操作,实现了高效且语义上有依据的生成,同时支持用户创作块的修补和扩展等交互式工作流程。为了定量评估我们的模型,我们调整了FID指标,以评估真实世界渲染与生成世界渲染之间的语义差异,并通过人类偏好研究验证生成质量。我们发布了完整的数据集、代码以及所有预训练模型,希望为未来在结构化、交互式3D环境中高效生成建模的研究提供基础。
cs.CV / 25 / 2604.22848
LunarDepthNet: Generation of Digital Elevation Models using Deep Learning and Monocular Satellite Images
LunarDepthNet:利用深度学习和单目卫星图像生成数字高程模型
Abstract
Recent times have seen an increase in demand of high quality Digital Elevation Models (DEMs) for the lunar surface, because they are highly important for studying the moon and planning future missions. However, there is an evident lack of detailed elevation data on the Moon. To overcome this limitation, this study proposes a novel deep learning method that estimates and generates a surface elevation map directly from monocular images of the surface. The dataset used comprises of the Chandrayaan-2 Terrain Mapping Camera (TMC) images with their corresponding Digital Terrain Models (DTMs). The study proposes LunarDepthNet, which comprises of a UNet architecture to generate DEMS. It incorporates an EfficientNet encoder and custom layers to correctly learn how the light shadows on the surface relate to the actual elevation values. A combined loss function was also utilized to keep the terrain details accurate and smooth. During validation, the model showed a stable loss convergence of 12%. It achieved a mean nRMSE of 0.437 and an MAE of 4.5m in the testing stage. These results prove the model can generate dependable elevation maps from single orbital images, which are quite useful in regions of the moon where stereo-images are not available.
Chinese Translation
近年来,对月球表面高质量数字高程模型(DEMs)的需求不断增加,因为它们对于研究月球和规划未来的任务至关重要。然而,月球上显然缺乏详细的高程数据。为了解决这一限制,本研究提出了一种新颖的深度学习方法,直接从表面的单目图像中估计和生成表面高程图。所使用的数据集包括来自昌德拉扬-2号地形测绘相机(TMC)的图像及其对应的数字地形模型(DTMs)。本研究提出的LunarDepthNet采用UNet架构生成DEMs。它结合了EfficientNet编码器和自定义层,以正确学习表面光影与实际高程值之间的关系。同时,采用了组合损失函数,以保持地形细节的准确和平滑。在验证过程中,该模型显示出12%的稳定损失收敛。在测试阶段,模型实现了0.437的平均归一化均方根误差(nRMSE)和4.5米的平均绝对误差(MAE)。这些结果证明该模型能够从单一轨道图像生成可靠的高程图,这在没有立体图像的月球区域尤为有用。
cs.CV / 26 / 2604.22850
Accelerating New Product Introduction for Visual Quality Inspection via Few-Shot Diffusion-Based Defect Synthesis
通过少量样本扩散基础缺陷合成加速新产品引入的视觉质量检测
Abstract
Industrial visual inspection systems often suffer from a severe scarcity of labeled defect data, particularly during the early stages of New Product Introduction (NPI). This limitation hinders the deployment of robust supervised detectors precisely when automated quality control is most needed. We present an end-to-end generative framework for high-fidelity, few-shot defect synthesis that enables both in-domain augmentation and cross-domain transfer. Our approach disentangles defect morphology from background appearance by combining masked textual inversion for defect representation learning, noise-blended conditioned generation for surface-aware synthesis, and gradient-aware post-processing for seamless visual integration. We evaluate the framework in two practically relevant settings: few-shot data augmentation, where synthetic samples enrich a small set of real defects, and zero-shot adaptation, where defects learned from a source domain are transferred to a novel target surface without any real target-domain defect examples. Using RF-DETR as the downstream detector, we show that the proposed pipeline substantially narrows the domain gap on a private industrial dataset. In the few-shot setting, synthetic augmentation improves mAP from 78.8% to 83.3%. In the zero-shot setting, synthetic domain adaptation improves mAP from 65.0% to 85.1%. These results demonstrate that high-fidelity defect synthesis can meaningfully accelerate NPI by enabling effective inspection models before sufficient real defect data has been collected.
Chinese Translation
工业视觉检测系统在新产品引入(NPI)早期阶段常常面临标记缺陷数据严重匮乏的问题。这一限制妨碍了在自动化质量控制最需要的时候部署稳健的监督检测器。我们提出了一种端到端的生成框架,用于高保真、少量样本缺陷合成,能够实现领域内增强和跨领域转移。我们的方法通过结合缺陷表示学习的掩码文本反演、表面感知合成的噪声混合条件生成以及无缝视觉集成的梯度感知后处理,将缺陷形态与背景外观进行解耦。我们在两个实际相关的设置中评估该框架:少量样本数据增强,其中合成样本丰富了一小组真实缺陷,以及零样本适应,其中从源领域学习的缺陷被转移到一个新的目标表面,而没有任何真实目标领域缺陷示例。使用 RF-DETR 作为下游检测器,我们展示了所提出的管道显著缩小了在一个私有工业数据集上的领域差距。在少量样本设置中,合成增强将 mAP 从 78.8% 提高到 83.3%。在零样本设置中,合成领域适应将 mAP 从 65.0% 提高到 85.1%。这些结果表明,高保真缺陷合成能够在收集到足够真实缺陷数据之前,有效加速新产品引入,促进有效的检测模型的建立。
cs.CV / 27 / 2604.22851
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
EgoDyn-Bench:评估视觉中心基础模型在自主驾驶中的自我运动理解能力
Abstract
While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models
Chinese Translation
尽管视觉-语言模型(VLMs)在自主驾驶中的高级推理方面取得了进展,但它们在将这种推理与自我运动的基本物理原理相结合的能力上仍然缺乏理解。我们引入了EgoDyn-Bench,这是一个用于评估视觉中心基础模型的语义自我运动理解的诊断基准。通过通过一个确定性神谕将连续的车辆运动学映射到离散的运动概念,我们将模型的内部物理逻辑与其视觉感知解耦。我们的广泛实证审计涵盖了20多个模型,包括封闭源的MLLMs、多个规模的开源VLMs和专门的VLAs,发现了一个显著的感知瓶颈:尽管模型展示了逻辑物理概念,但它们始终未能准确地将这些概念与视觉观察对齐,常常表现不及经典的非学习几何基线。这种失败在模型规模和领域特定训练中持续存在,表明当前架构在将视觉感知与物理推理结合方面存在结构性缺陷。我们证明,提供明确的轨迹编码显著恢复了所有评估模型的物理一致性,揭示了视觉与语言之间的功能性解耦:自我运动逻辑几乎完全源自语言模态,而视觉观察贡献的额外信号微乎其微。这一结构性发现提供了一个标准化的诊断框架和通向物理对齐的具身人工智能的实用路径。关键词:自我运动 - 物理推理 - 基础模型
cs.CV / 28 / 2604.22853
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
FastAT 基准:快速对抗训练方法公平评估的综合框架
Abstract
Fast Adversarial Training (FastAT) seeks to achieve adversarial robustness at a fraction of the computational cost incurred by standard multi-step methods such as PGD-AT. Although numerous FastAT techniques have been proposed in recent years, fair comparison among them remains elusive. Existing benchmarks and public leaderboards typically permit diverse model architectures, varying training configurations, and external data sources, making it unclear whether reported improvements reflect genuine algorithmic advances or merely more favorable experimental conditions. To address this problem, we introduce the FastAT Benchmark, a controlled evaluation framework built on three core design principles: unified architecture requirements, standardized training settings, and strict prohibition of external or synthetic data. The benchmark implements over twenty representative FastAT methods within a single codebase, enabling direct and reproducible comparison. Each method is assessed through a dual-metric evaluation framework that measures both adversarial robustness (accuracy under PGD, AutoAttack, and CR Attack) and computational cost (GPU training time and peak memory footprint). Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet provide reliable baseline measurements and reveal that well-designed single-step methods can match or surpass PGD-AT robustness at substantially lower cost, while no single method dominates across all evaluation dimensions. The complete benchmark, including source code, configuration files, and experimental results, is publicly available to support transparent and fair evaluation of future FastAT research.
Chinese Translation
快速对抗训练(Fast Adversarial Training, FastAT)旨在以标准多步方法(如 PGD-AT)所需计算成本的一小部分实现对抗鲁棒性。尽管近年来提出了众多 FastAT 技术,但它们之间的公平比较仍然难以实现。现有的基准测试和公共排行榜通常允许多样的模型架构、不同的训练配置和外部数据源,这使得报告的改进是否反映真实的算法进步或仅仅是更有利的实验条件变得不明确。为了解决这个问题,我们引入了 FastAT 基准,这是一个基于三个核心设计原则的受控评估框架:统一的架构要求、标准化的训练设置,以及严格禁止外部或合成数据。该基准在单一代码库中实现了二十多种具有代表性的 FastAT 方法,能够进行直接且可重复的比较。每种方法通过双指标评估框架进行评估,测量对抗鲁棒性(在 PGD、AutoAttack 和 CR Attack 下的准确率)和计算成本(GPU 训练时间和峰值内存占用)。在 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 上进行的全面实验提供了可靠的基线测量,并揭示出设计良好的单步方法可以以显著更低的成本匹配或超越 PGD-AT 的鲁棒性,而没有单一方法在所有评估维度上占据主导地位。完整的基准,包括源代码、配置文件和实验结果,已公开可用,以支持未来 FastAT 研究的透明和公平评估。
cs.CV / 29 / 2604.22854
MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer
基于MAE的自监督预训练方法用于数据高效的医学图像分割,采用nnFormer
Abstract
Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.
Chinese Translation
Transformer架构,包括nnFormer,在体积医学图像分割中展现了良好的效果,能够捕捉长距离的空间交互。尽管这些模型性能优越,但它们需要大量标注的训练数据,并且容易出现过拟合和训练不稳定的问题。这是一个严重的实际问题,因为获取专家注释的医学图像不仅耗时而且成本高昂。此外,完全监督的传统训练流程并未利用可在临床中轻松获得的大量未标注医学影像数据。我们通过提升nnFormer的效率,采用基于Masked Autoencoders (MAE) 的自监督预训练框架,解决了这些缺陷。在该方法中,模型在未标注的体积医学图像上进行预训练,以重建随机遮挡的输入部分。这使得编码器能够学习有意义的解剖和结构表示。随后,编码器在下游分割任务的标注数据集上进一步微调。实验结果表明,所提出的方法在Dice得分的分割性能上更高,在微调过程中的收敛速度更快,并且在有限标注数据的基础上具有更好的泛化能力。这些发现验证了自监督学习与基于Transformer的分割模型结合,是解决医学图像分析中数据短缺问题的有效方法。
cs.CV / 30 / 2604.22855
Evaluating Remote Sensing Image Captions Beyond Metric Biases
超越度量偏差的遥感图像描述评估
Abstract
The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.
Chinese Translation
图像描述的核心目标是实现从视觉信号到文本模态的无损语义压缩。然而,依赖于人工策划的参考文本进行评估,实际上迫使模型模仿特定的人类注释风格,从而掩盖了先进基础模型的真实描述能力。这种系统性的不对齐引发了一个关键问题:任务特定的微调对于遥感图像描述(Remote Sensing Image Captioning)是否真的必要,还是感知的性能差距仅仅是有缺陷的评估标准的产物?为了调查这一差异,我们提出了ReconScore,一种新颖的无参考评估指标。我们并不计算文本相似度,而是通过生成文本重建原始视觉元素的能力来评估描述质量,有效中和了人类注释的偏见。应用这一指标,我们发现了一个深刻且反直觉的真相:本质上强大的、未经微调的多语言大模型(MLLMs)在真实的零样本遥感图像描述任务中超越了其经过微调的对应模型。基于这一结构性发现,我们引入了RemoteDescriber,一种完全无训练的生成方法。通过将ReconScore作为自我修正机制,我们在没有任何计算微调开销的情况下,迭代地提高MLLM输出的语义精度。全面的实验表明,RemoteDescriber在三个数据集上达到了最先进的性能。此外,我们验证了ReconScore的可靠性,并分析了传统指标的缺陷。我们的代码可在https://github.com/hhu-czy/RemoteDescriber获取。
cs.CV / 31 / 2604.22856
Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems
基于注意力增强的YOLOv8与Ghost卷积在智能交通系统中实时车辆检测的应用
Abstract
Accurate vehicle detection is a critical component of autonomous driving, traffic surveillance, and intelligent transportation systems. This paper presents an enhanced YOLOv8n-based model that integrates the Ghost Module, Convolutional Block Attention Module (CBAM), and Deformable Convolutional Networks v2 (DCNv2) to improve detection performance. The Ghost Module reduces feature redundancy through efficient feature generation, CBAM refines feature representation via channel and spatial attention, and DCNv2 enhances adaptability to geometric variations in vehicle structures. Evaluated on the KITTI dataset, the proposed model achieves 95.4%
[email protected], representing an 8.97% improvement over the baseline YOLOv8n, along with 96.2% precision, 93.7% recall, and a 94.93% F1-score. Comparative analysis against seven state-of-the-art detectors demonstrates consistent superiority across key performance metrics, while ablation studies validate the individual and combined contributions of the integrated modules. By addressing feature redundancy, attention refinement, and spatial adaptability, the proposed approach offers a robust and computationally efficient solution for vehicle detection in diverse and complex traffic environments.
Chinese Translation
准确的车辆检测是自动驾驶、交通监控和智能交通系统的关键组成部分。本文提出了一种基于增强YOLOv8n模型的改进方案,该模型集成了Ghost模块、卷积块注意力模块(CBAM)和可变形卷积网络v2(DCNv2),以提高检测性能。Ghost模块通过高效特征生成减少特征冗余,CBAM通过通道和空间注意力精炼特征表示,而DCNv2增强了对车辆结构几何变化的适应性。在KITTI数据集上的评估结果显示,所提模型在
[email protected]上达到了95.4%,比基线YOLOv8n提高了8.97%,同时获得了96.2%的精确率、93.7%的召回率和94.93%的F1-score。与七种最先进的检测器进行的比较分析表明,在关键性能指标上表现出持续的优越性,而消融研究验证了集成模块的单独和组合贡献。通过解决特征冗余、注意力精炼和空间适应性,所提方法为在多样化和复杂的交通环境中进行车辆检测提供了一种稳健且计算高效的解决方案。
cs.CV / 32 / 2604.22857
IoT-Enhanced CNN-Based Labelled Crack Detection for Additive Manufacturing Image Annotation in Industry 4.0
基于物联网增强的卷积神经网络标记裂纹检测用于工业4.0中的增材制造图像标注
Abstract
This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.
Chinese Translation
本文提出了一种基于物联网增强的深度学习框架,用于使用卷积神经网络(CNN)自动检测增材制造(AM)表面的裂纹。通过集成物联网支持的实时监控、高分辨率成像和边缘计算,该系统实现了持续的原位缺陷检测和分类。实时数据采集支持即时的基于CNN的分析,提高了增材制造质量控制的准确性和效率。该框架支持监督学习和半监督学习,能够在大型、稀疏标注的数据集上实现稳健的性能。使用LabelImg进行标注和OpenCV进行预处理,该系统在14,982张图像上达到了99.54%的准确率,具有96%的精确率、98%的召回率和97%的F1分数。数据集平衡和增强显著提高了模型的泛化能力,将准确率从32%提升至99%。除了检测,该框架还建立了增材制造过程参数、缺陷形成和表面拓扑之间的联系,支持预测分析和缺陷缓解。与工业4.0相一致,它结合了数字双胞胎(Digital Twin, DT)技术,实现实时过程模拟、预测性维护和自适应控制。主要贡献包括使用边缘设备(Raspberry Pi 4B)的基于物联网的监控系统、经过优化的CNN模型及其量化和批处理,减少推理延迟47%,以及基于MQTT的低延迟数据流系统,结合5G连接,降低传输开销35%。DT集成进一步实现了预测性缺陷分析和增材制造参数的动态调整。本研究通过提供可扩展、高准确性和低延迟的框架,推动了智能增材制造质量控制的发展。未来的研究方向包括多模态数据融合、混合架构和增强的数字双胞胎模拟,以实现基于人工智能的缺陷预防。
cs.CV / 33 / 2604.22858
A Digital Pathology Resource for Liver Cancer Quantification with Datasets, Benchmarks, and Tools
用于肝癌量化的数字病理资源:数据集、基准和工具
Abstract
Liver cancer, especially hepatocellular carcinoma (HCC), imposes a substantial global disease burden. Accurate diagnosis and prognostic assessment directly influence treatment selection and patient survival, and pathological examination remains the gold standard for liver cancer diagnosis. Identifying diverse tissue components and pathological subtypes on histopathology slides is crucial for estimating postoperative recurrence risk and overall prognosis. However, most publicly available resources are still provided at the whole-slide image (WSI) level, and well-annotated datasets for fine-grained tissue component identification in liver cancer are scarce, which hinders reproducible model development and the deployment of quantitative analysis tools. To address this gap, we release HepatoBench, a patch-level image database for liver cancer with annotations for seven key tissue categories. Based on HepatoBench, we train and open-source a deep learning classification model as a tissue recognition tool. Furthermore, we train a WSI-level tumor/non-tumor segmentation model to automatically localize lesion regions across entire slides. By integrating the patch-level tissue classifier with the WSI-level segmentation model, we build HepatoQuant, an end-to-end, disease-specific regional quantification tool for liver cancer, enabling a unified workflow from WSIs to tissue composition parsing and quantitative statistics. We also open-source HepatoBench, the benchmarking protocol, and supporting tools, providing a solid foundation for automated regional quantification and fair method comparison in liver cancer pathology.
Chinese Translation
肝癌,尤其是肝细胞癌(HCC),对全球健康造成了重大负担。准确的诊断和预后评估直接影响治疗选择和患者生存,而病理检查仍然是肝癌诊断的金标准。在组织病理切片上识别多样的组织成分和病理亚型对于估计术后复发风险和整体预后至关重要。然而,目前大多数公开可用的资源仍然以全切片图像(WSI)级别提供,且用于肝癌中细粒度组织成分识别的良好注释数据集稀缺,这阻碍了可重复模型的开发和定量分析工具的部署。为了解决这一问题,我们发布了HepatoBench,这是一个针对肝癌的补丁级图像数据库,包含七个关键组织类别的注释。基于HepatoBench,我们训练并开源了一个深度学习分类模型作为组织识别工具。此外,我们还训练了一个WSI级别的肿瘤/非肿瘤分割模型,以自动定位整个切片中的病变区域。通过将补丁级组织分类器与WSI级别分割模型相结合,我们构建了HepatoQuant,这是一个端到端的、特定于疾病的肝癌区域量化工具,实现了从WSI到组织成分解析和定量统计的统一工作流程。我们还开源了HepatoBench、基准协议和支持工具,为肝癌病理中的自动区域量化和公平方法比较提供了坚实基础。
cs.CV / 34 / 2604.22865
MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
MeshLAM:前馈式一次性可动画纹理网格头像重建
Abstract
We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.
Chinese Translation
我们介绍了MeshLAM,一种前馈框架,用于一次性可动画网格头部重建,该框架能够从单张图像生成高保真、可动画的3D头部头像。与依赖耗时的测试时优化或大量多视图数据的先前工作不同,我们的方法能够在单次前向传递中,从单张图像生成具有固有可动画性的完整网格表示。我们的方法采用双重形状和纹理图架构,同时处理网格顶点和纹理图,利用从共享变换器主干网络中提取的图像特征,从而实现一致的形状雕刻和外观建模。为了防止网格崩溃并确保前馈变形过程中的拓扑完整性,我们提出了一种基于迭代GRU的解码机制,结合渐进几何变形和纹理细化,以及一种新颖的基于重投影的纹理引导机制,将外观学习锚定到输入图像。大量实验表明,我们的方法在重建质量、动画能力和计算效率方面优于最先进的方法。项目页面:https://meshlam.github.io。
cs.CV / 35 / 2604.22868
Probing Visual Planning in Image Editing Models
探究图像编辑模型中的视觉规划
Abstract
Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.
Chinese Translation
视觉规划是人类智能的一个关键方面,尤其在需要复杂空间推理和导航的任务中。然而,在机器学习中,这一固有的视觉问题常常通过以语言为中心的视角来处理。尽管近期研究展示了完全视觉方法的潜力,但由于逐步生成规划的范式,它们在计算效率上存在显著不足。在本研究中,我们提出了EAR,一种将视觉规划重新表述为单步图像转换的编辑-推理范式。为了将内在推理与视觉识别分离,我们采用抽象拼图作为探测任务,并引入AMAZE,一个程序生成的数据集,包含经典的迷宫(Maze)和皇后(Queen)问题,涵盖了不同且互补的视觉规划形式。AMAZE的抽象特性还便于对自回归和基于扩散的模型在像素级保真度和逻辑有效性方面进行自动评估。我们评估了领先的专有和开源编辑模型。结果显示,它们在零-shot设置中均表现不佳,而在基本规模上进行微调则能够显著推广到更大的领域内规模和领域外规模及几何形状。然而,我们在高端硬件上运行的最佳模型未能匹敌人类解题者的零-shot效率,突显了神经视觉推理中的持续差距。
cs.CV / 36 / 2604.22872
Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles
基于视觉的车道跟踪与交通标志识别用于资源受限的自主车辆
Abstract
Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle's onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.
Chinese Translation
自主车辆(AVs)依赖实时感知系统来理解道路环境并确保安全导航。然而,由于计算资源有限,在资源受限的嵌入式平台上实现可靠的感知算法仍然具有挑战性。本文提出了一种轻量级的基于视觉的框架,集成了车道检测、车道跟踪和交通标志识别,专为嵌入式自主车辆设计。采用基于阈值的高效车道分割方法,结合透视变换和基于直方图的曲率估计,以在不同光照条件下实现稳健的车道跟踪。基于规则的转向控制器生成转向指令,以保持车辆导航的稳定性。在交通标志识别方面,评估了两个轻量级卷积神经网络(CNNs),即 EfficientNet-B0 和 MobileNetV2,使用从车辆车载摄像头捕获的自定义数据集进行测试。实验结果表明,该系统在保持准确的车道跟踪的同时,实现了实时性能,最大偏移均方根误差(RMSE)仅为3.16%。EfficientNet-B0 在测试数据集上实现了98.77%的高离线分类准确率,而在实时设备部署中达到90%的准确率,在这两种情况下均优于 MobileNetV2。然而,MobileNetV2 提供了稍快的推理速度和更低的计算成本。这些结果突显了轻量级基于视觉的感知管道在资源受限的自主驾驶应用中的有效性。
cs.CV / 37 / 2604.22875
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM:视觉语言模型可以注释图像以解释思路并指导用户
Abstract
When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.
Chinese Translation
在回答有关图像的问题时,人类自然会指向、标记和绘制以解释他们的推理。相比之下,现代视觉语言模型(VLMs),如Gemini-3-Pro和GPT-5,仅以文本形式回应,这使得用户难以验证。我们提出了SketchVLM,一个无训练、模型无关的框架,使VLM能够在输入图像上生成非破坏性、可编辑的SVG叠加层,以视觉方式解释其答案。在涵盖视觉推理(迷宫导航、球落轨迹预测和物体计数)和绘图(部件标记、连线和在物体周围绘制形状)的七个基准测试中,SketchVLM在视觉推理任务的准确性上提高了多达28.5个百分点,在注释质量上提高了相对于图像编辑和微调草图基线的1.48倍,同时生成的注释更忠实于模型所述的答案。我们发现单轮生成已经实现了较强的准确性和注释质量,而多轮生成则为人机协作开辟了进一步的机会。交互式演示和代码可在https://sketchvlm.github.io/找到。
cs.CV / 38 / 2604.22883
NeuroAPS-Net: Neuro-Anatomically Aware Point Cloud Representation for Efficient Alzheimer's Disease Classification
NeuroAPS-Net:一种神经解剖学感知的点云表示方法用于高效的阿尔茨海默病分类
Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and a major cause of dementia. Structural MRI is widely used to analyze AD-related brain atrophy; however, most deep learning methods rely on computationally expensive 3D convolutional neural networks (CNNs), limiting deployment in resource-constrained settings. This work introduces two main contributions. First, we propose a pipeline that converts T1-weighted MRI into anatomically informed 2D point clouds using Anatomical Priority Sampling (APS), producing ADNI-2DPC, the first neuroanatomically labeled MRI-derived point cloud dataset. Second, we present NeuroAPS-Net, a lightweight geometric deep learning model that incorporates anatomical priors via region-aware feature encoding and ROI token aggregation. Experiments on ADNI-2DPC demonstrate that NeuroAPS-Net achieves competitive classification accuracy while significantly reducing inference latency and GPU memory compared to state-of-the-art point cloud methods. These results highlight the potential of anatomically guided point cloud learning as an efficient and interpretable alternative to voxel-based CNNs for AD classification.
Chinese Translation
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,是导致痴呆的主要原因。结构性MRI被广泛用于分析与AD相关的大脑萎缩;然而,大多数深度学习方法依赖于计算成本高昂的3D卷积神经网络(CNN),限制了其在资源受限环境中的应用。本研究提出了两个主要贡献。首先,我们提出了一种将T1加权MRI转换为解剖学信息的2D点云的流程,使用解剖优先采样(Anatomical Priority Sampling, APS),生成ADNI-2DPC,这是第一个神经解剖学标注的MRI衍生点云数据集。其次,我们提出了NeuroAPS-Net,一种轻量级几何深度学习模型,通过区域感知特征编码和ROI令牌聚合来结合解剖学先验。对ADNI-2DPC的实验表明,NeuroAPS-Net在分类准确性上具有竞争力,同时显著降低了推理延迟和GPU内存使用,相较于最先进的点云方法。这些结果突显了基于解剖学引导的点云学习作为一种高效且可解释的替代方案,能够替代基于体素的CNN进行AD分类的潜力。
cs.CV / 39 / 2604.22884
Can Multimodal Large Language Models Truly Understand Small Objects?
多模态大型语言模型真的能够理解小物体吗?
Abstract
Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Chinese Translation
多模态大型语言模型(MLLMs)在多种理解任务中展现出良好的潜力,例如图像和视频分析、数学和物理奥林匹克竞赛。然而,它们在小物体理解(SOU)任务上仍然空白且未被探索。为填补这一空白,我们引入了SOUBench,这是第一个全面的基准,用于探索现有MLLMs的小物体理解能力。具体而言,我们首先设计了一种有效且自动的视觉问答生成策略,构建了一个新的SOU-VQA评估数据集,其中包含18,204个VQA对、六个相关子任务和三个主要场景(即驾驶、空中和水下)。然后,我们对15个最先进的MLLMs进行了全面评估,揭示了它们在小物体理解方面的薄弱能力。此外,我们开发了SOU-Train,这是一个包含11,226个VQA对的多模态训练数据集,以提高MLLMs的小物体理解能力。通过对最新的MLLM进行监督微调,我们证明SOU-Train能够有效增强最新MLLM对小物体的理解能力。全面的实验结果表明,所提出的SOUBench以及SOU-VQA和SOU-Train数据集为社区提供了一个重要的实证基础,以进一步开发具有增强小物体理解能力的模型。数据集和代码: https://github.com/Hanfj-X/SOU。
cs.CV / 40 / 2604.22885
Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
通过语义路由和适配器个性化实现缺失模态的联邦跨模态检索
Abstract
Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.
Chinese Translation
联邦跨模态检索面临来自异构客户端数据的严峻挑战,特别是非独立同分布的语义分布和缺失模态。在这种异构性下,单一的全局模型往往不足以捕捉共享的跨模态知识和客户端特定特征。我们提出了RCSR(Retrieval-Centric Semantic Routing),这是一个友好的个性化联邦框架,集成了原型锚定、以检索为中心的语义路由和可选的客户端特定适配器。RCSR基于一个冻结的CLIP骨干网络,利用轻量级共享适配器进行全局知识转移,同时支持高效的本地个性化。原型锚定帮助单模态客户端与全局跨模态语义对齐,而服务器端的语义路由器根据检索一致性自适应地分配聚合权重,以减轻异构更新过程中的对齐漂移。在MS-COCO、Flickr30K和其他基准上的大量实验表明,RCSR在提高全局检索准确性和训练稳定性的同时,进一步增强了客户端级别的检索性能,特别是对于模态不完整的客户端。代码可在https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing获取。
cs.CV / 41 / 2604.22886
Breaking Degradation Coupling: A Structural Entropy Guided Decoupled Framework and Benchmark for Infrared Enhancement
打破降解耦合:一种结构熵引导的解耦框架及红外增强基准
Abstract
Thermal infrared image enhancement aims to restore high-quality images from complex compound degradations. Existing all-in-one approaches typically employ a single shared backbone to handle diverse degradations, which causes gradient interference and parameter competition. To address this, we propose a Structural Entropy-Guided Decoupled (SEGD) Framework. Unlike unified modeling paradigms, SEGD decomposes compound degradations into independent sub-processes and models them in a divide-and-conquer manner through Degradation-Specific Residual Modules (DRMs). Each DRM focuses on residual estimation for a specific degradation, enabling task decoupling while remaining jointly trainable, which mitigates parameter contention. A Degradation-Aware Evidential Network further estimates degradation type and intensity, providing priors that adaptively regulate DRM restoration strength. To handle compound cases, DRMs are composed in varying orders to form multiple restoration paths, from which the most informative features are aggregated under a structural-entropy criterion, yielding decoder-ready representations with structural fidelity and degradation awareness. Integrating divide-and-conquer restoration, evidential perception, and entropy-guided adaptation, SEGD achieves fine-grained and interpretable enhancement. We also construct a nighttime TIR benchmark for evaluation under real low-light conditions. Experimental results demonstrate that SEGD surpasses state-of-the-art methods while achieving higher efficiency with fewer parameters.
Chinese Translation
热红外图像增强旨在从复杂的复合降解中恢复高质量图像。现有的一体化方法通常采用单一共享骨干网络来处理多种降解,这导致了梯度干扰和参数竞争。为了解决这个问题,我们提出了一种结构熵引导的解耦框架(Structural Entropy-Guided Decoupled Framework,SEGD)。与统一建模范式不同,SEGD将复合降解分解为独立的子过程,并通过降解特定残差模块(Degradation-Specific Residual Modules,DRMs)以分而治之的方式对其进行建模。每个DRM专注于特定降解的残差估计,使任务解耦,同时保持可联合训练,从而减轻参数争用。降解感知证据网络(Degradation-Aware Evidential Network)进一步估计降解类型和强度,提供自适应调节DRM恢复强度的先验信息。为了处理复合情况,DRMs以不同顺序组合形成多个恢复路径,从中根据结构熵标准聚合最具信息性的特征,生成具有结构保真度和降解感知的解码器准备表示。通过整合分而治之的恢复、证据感知和熵引导适应,SEGD实现了细粒度和可解释的增强。我们还构建了一个夜间热红外基准,以便在真实低光条件下进行评估。实验结果表明,SEGD超越了最先进的方法,同时以更少的参数实现更高的效率。
cs.CV / 42 / 2604.22899
Text-Guided Multimodal Unified Industrial Anomaly Detection
文本引导的多模态统一工业异常检测
Abstract
Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.
Chinese Translation
基于RGB-3D多模态数据的工业异常检测已成为智能质量检测的主流范式。然而,现有的无监督方法存在两个关键限制:由于缺乏高层次语义指导,导致模态间对齐模糊,以及RGB到3D特征映射的几何建模不足。为了解决这些问题,我们提出了一种由文本语义引导的统一多模态工业异常检测框架。该框架由两个核心模块组成:一个几何感知的跨模态映射器(Geometry-Aware Cross-Modal Mapper),用于在模态转换过程中保持几何结构,以及一个基于对象的文本特征适配器(Object-Conditioned Textual Feature Adaptor),用于将多模态特征与语义先验对齐。此外,我们建立了一种统一的多模态工业异常检测学习范式,打破了“一模型一类别”的限制,使得单一模型能够在不同类别中实现准确的异常检测。在MVTec 3D-AD和Eyecandies数据集上的大量实验表明,我们的方法在无监督设置下的分类和定位任务中达到了最先进的性能。
cs.CV / 43 / 2604.22903
On the Complementarity of Quantum and Classical Features: Adaptive Hybrid Quantum-Classical Feature Fusion for Breast Cancer Classification
量子与经典特征的互补性:用于乳腺癌分类的自适应混合量子-经典特征融合
Abstract
The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.
Chinese Translation
量子机器学习与经典深度学习的结合为医学图像分析提供了有前景的途径,通过将数据映射到高维希尔伯特空间。然而,由于常见的优化不对称性,有效地统一这两种不同的范式仍然具有挑战性。本文提出了一种基于双支路特征提取管道的乳腺癌诊断新型混合量子-经典架构。我们的框架从经典模型和量子电路中提取并统一互补表示,探索可训练和确定性(不可训练)量子范式。为整合这些嵌入,提出了三种渐进特征融合策略:离线提取的静态混合融合(Static Hybrid Fusion, SHF)、端到端共同适应的动态混合融合(Dynamic Hybrid Fusion, DHF)以及一种新颖的温度缩放混合融合(Temperature-Scaled Hybrid Fusion, TSHF)。TSHF策略结合了一个可学习的标量,受到多模态学习的启发,动态平衡混合梯度动态并解决优化瓶颈。在BreastMNIST数据集上的实证验证确认了我们的假设,即统一多样的特征表示创造了更丰富的数据上下文。特别是在将ResNet主干与可训练量子电路配对时,TSHF策略达到了87.82%的峰值准确率、91.77%的F1分数和89.08%的AUC-ROC,优于纯经典基线。这些结果表明,所提出的混合框架提高了分类准确性和阈值可靠性,为量子增强诊断工具的临床部署提供了一个稳定的高性能架构。
cs.CV / 44 / 2604.22942
VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation
VS-DDPM:高效低成本的医学模态转换扩散模型
Abstract
Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.
Chinese Translation
扩散模型能够生成高质量的合成数据,但推理速度较慢。我们提出了3D可变步长去噪扩散概率模型(VS-DDPM),该框架旨在保持生成质量的同时加速推理。我们在BraTS2025和SynthRAD2025挑战中对四个任务(缺失MRI、肿瘤去除、MRI到sCT和CBCT到sCT)进行了测试。该模型设计用于在这两个挑战所施加的硬件和时间限制下实现高效性。VS-DDPM在缺失MRI合成中取得了最先进的(SOTA)性能,增强肿瘤、肿瘤核心和整个肿瘤区域的Dice分数分别为0.80、0.83和0.88,同时结构相似性指数(SSIM)为0.95。在MRI肿瘤去除任务中,该模型达到了0.053的均方根误差(RMSE)、26.77的峰值信噪比(PSNR)和0.918的SSIM。虽然该框架在MRI到sCT和CBCT到sCT任务中表现出竞争力,但未能达到SOTA基准,这可能与数据的前后处理管道或特定损失函数配置的敏感性有关。这些结果表明,VS-DDPM为高保真3D医学图像合成提供了一种稳健且可调的解决方案。代码可在https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it获取。
cs.CV / 45 / 2604.22964
AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide, Mixup Augmentation, and Persistent Patient History Management
AnemiaVision:通过智能手机图像使用 EfficientNet-B3 进行非侵入性贫血检测,结合 TrivialAugmentWide、Mixup 增强和持久患者历史管理
Abstract
Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.
Chinese Translation
贫血影响全球超过十亿人,并在实验室血液检测无法获得的低资源地区严重被低估。本文提出了 AnemiaVision,一个基于网络的端到端系统,用于从智能手机拍摄的眼睑结膜和指甲床照片中进行非侵入性贫血筛查。所提出的流程对预训练的 EfficientNet-B3 主干进行微调,设计了一个包含 BatchNorm、GELU 激活和高比率 Dropout(0.45/0.35)的三层分类器头。训练过程中采用四种正交的准确性提升技术:TrivialAugmentWide 用于无策略图像增强,RandomErasing 用于空间正则化,Mixup(alpha=0.2)用于类间平滑,以及带线性预热的余弦退火调度。早停策略以峰值验证准确率为依据,而非验证损失,以防止在高方差的训练周期中过早终止。部署的 Flask 应用集成了由 PostgreSQL 支持的持久患者历史管理,自动化的数据库迁移入口确保在重新部署过程中零数据丢失。消融实验表明,准确性优先的早停策略对最终验证准确率贡献 +1.6%,而 Mixup 对最终验证准确率贡献 +2.8%。总体而言,所提出的系统实现了 96.2% 的验证准确率和 0.98 的 AUC-ROC,相较于仅使用 CPU 的三轮基线模型,其验证准确率为 44.9% 和 AUC-ROC 为 0.58。贫血类别的灵敏度达到 0.96,使该系统适合作为农村社区卫生工作者的第一线筛查工具。该系统公开可访问,源代码也可公开获取。
cs.CV / 46 / 2604.22984
BrickNet: Graph-Backed Generative Brick Assembly
BrickNet:基于图的生成砖块组装
Abstract
We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet
Chinese Translation
我们训练了一个语言模型以生成乐高砖块的构建序列。尽管之前的研究限制于离散的、体素状的塔楼,但我们考虑了更广泛的零件集合,涵盖了数千种具有多样连接语义的零件类型。为此,我们首先收集了一个大规模的数据集,其中包含超过100,000个由人类设计的LDraw砖块对象和场景。我们设置的复杂性使得自回归地组装满足物理约束的结构变得具有挑战性。当直接预测砖块姿态时,构建序列在经过少量步骤后迅速变得无效。尽管零件被放置在三维空间中,但定义整体的是零件之间的空间关系。考虑到这一点,我们设计了一种基于图的程序表示,通过连接性对结构进行参数化,从而改善生成序列的物理基础。为了促进未来的应用,我们将我们的数据集和模型提供给研究用途。
cs.CV / 47 / 2604.22989
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix:医学影像中视觉语言模型的统一生成预训练
Abstract
Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.
Chinese Translation
近期的医学多模态基础模型通过将CLIP预训练的视觉编码器与使用LLaVA风格微调的语言模型(LLM)连接,构建为多模态大语言模型(MLLM)。这种两阶段的解耦方法引入了一个可能扭曲视觉特征的投影层。这在医学影像中尤为令人担忧,因为细微线索对准确诊断至关重要。相比之下,早期融合生成方法如Chameleon通过在单一统一序列中处理图像和文本标记,消除了投影瓶颈,从而实现了联合表示学习,利用了语言模型的归纳先验。我们提出了CheXmix,这是一种统一的早期融合生成模型,基于大量配对的胸部X光片和放射学报告进行训练。我们通过引入一种结合了掩蔽自编码器和MLLM表示优势的两阶段多模态生成预训练策略,扩展了Chameleon的自回归框架。所得到的模型具有高度灵活性,支持粗粒度和细粒度的判别性和生成性任务。我们的方法在所有掩蔽比例上比成熟的生成模型提高了6.0%,并在CheXpert分类任务中,在高图像掩蔽比例下超越了CheXagent 8.6%的AUROC。我们在图像修复方面的表现比仅使用文本的生成模型提高了51.0%,并在放射学报告生成的GREEN指标上超越了CheXagent 45%。这些结果表明,CheXmix能够捕捉胸部X光任务中广泛范围内的细粒度信息。我们的代码可在:https://github.com/StanfordMIMI/CheXmix.
cs.CV / 48 / 2604.22990
Hard to See, Hard to Label: Generative and Symbolic Acquisition for Subtle Visual Phenomena
难以察觉,难以标注:细微视觉现象的生成与符号获取
Abstract
Subtle visual anomalies such as hairline cracks, sub-millimeter voids, and low-contrast inclusions are structurally atypical yet visually ambiguous, making them both difficult to annotate and easy to overlook during active learning. Standard acquisition heuristics based on discriminative uncertainty or feature diversity often overselect dominant patterns while underexploring sparse yet important regions of the data space. This failure mode is especially severe in industrial defect inspection, where anomalies may be both low-prevalence and difficult to distinguish from surrounding structure. To resolve this, we propose GSAL, an active learning framework for object detection that combines a diffusion-based difficulty signal with a hierarchical semantic coverage prior. The diffusion component scores images and proposals using reconstruction discrepancy and denoising variability, prioritizing visually atypical or ambiguous examples. However, diffusion alone does not prevent acquisition from repeatedly favoring hard samples within dominant semantic modes. The semantic component therefore organizes candidate samples in a three-level concept graph and promotes coverage of underrepresented semantic regions while providing interpretable acquisition rationales. By balancing visual difficulty with semantic coverage, GSAL improves retrieval of subtle and rare targets that are often missed by uncertainty-only selection. Experiments on a proprietary thin-film defect, Pascal VOC and MS COCO dataset show consistent gains in label efficiency and rare-class retrieval over uncertainty-, diversity-, and hybrid-based baselines
Chinese Translation
细微的视觉异常,如发丝裂缝、亚毫米空隙和低对比度夹杂物,结构上不典型但视觉上模糊,使得它们在主动学习过程中既难以注释又容易被忽视。基于判别不确定性或特征多样性的标准获取启发式方法往往过度选择主导模式,而对数据空间中稀疏但重要的区域探索不足。这种失败模式在工业缺陷检测中尤为严重,因为异常可能既低频率又难以与周围结构区分。为了解决这个问题,我们提出了GSAL,一个结合了基于扩散的难度信号和分层语义覆盖先验的目标检测主动学习框架。扩散组件通过重建差异和去噪变异性对图像和提议进行评分,优先考虑视觉上不典型或模糊的示例。然而,仅靠扩散并不能防止获取过程重复偏向主导语义模式中的困难样本。因此,语义组件在三层概念图中组织候选样本,并促进对代表性不足的语义区域的覆盖,同时提供可解释的获取理由。通过平衡视觉难度与语义覆盖,GSAL提高了对细微和稀有目标的检索,这些目标通常被仅基于不确定性的选择所遗漏。在专有薄膜缺陷、Pascal VOC和MS COCO数据集上的实验表明,与基于不确定性、多样性和混合的基线相比,标签效率和稀有类别检索均有一致的提升。
cs.CV / 49 / 2604.22992
Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation
通过标签传播的半监督对象分割实现高效图像标注
Abstract
Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.
Chinese Translation
可靠的对象感知对于通用服务机器人是必要的。开放词汇检测器在超出少数类别时难以泛化,而对象检测器的完全监督训练需要耗时的标注。我们提出了一种用于家庭对象分割的半监督标签传播方法。一个分割提议器生成与类别无关的掩膜,而一组霍普菲尔德网络通过学习互补基础模型嵌入空间(CLIP、ViT、Theia)中的代表性嵌入来分配标签。我们的方法能够扩展到50个对象类别,且标注开销有限,并且能够在RoboCup@Home环境中自动标注60%的数据,在该环境中准备时间受到严重限制。数据集和代码可在 https://github.com/ais-bonn/label_propagation 上公开获取。
cs.CV / 50 / 2604.23010
GenAssets: Generating in-the-wild 3D Assets in Latent Space
GenAssets:在潜在空间中生成真实环境下的3D资产
Abstract
High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.
Chinese Translation
高质量的交通参与者3D资产对于多传感器仿真至关重要,这对于自主系统的安全端到端开发是必不可少的。从真实环境数据构建资产是实现多样性和真实感的关键,但现有的基于神经渲染的重建方法速度较慢,并且生成的资产仅能从接近原始观察的视角良好渲染,这限制了它们在仿真中的实用性。最近的基于扩散的生成模型能够构建完整且多样的资产,但在真实驾驶场景中的表现较差,因为观察到的参与者是在稀疏和有限的视野下捕获的,并且部分被遮挡。在本研究中,我们提出了一种3D潜在扩散模型,该模型在由传感器平台捕获的真实环境LiDAR和相机数据上进行学习,并生成具有完整几何形状和外观的高质量3D资产。我们方法的关键在于一种“重建后生成”的方法,首先利用在多个场景上训练的考虑遮挡的神经渲染技术构建高质量的对象潜在空间,然后训练一个在潜在空间上运行的扩散模型。我们展示了我们的方法优于现有的重建和生成方法,为仿真解锁了多样化和可扩展的内容创作。
cs.CV / 51 / 2604.23018
AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI
AmaraSpatial-10K:一个空间和语义对齐的用于空间计算和具身人工智能的3D数据集
Abstract
Web-scale 3D asset collections are abundant, but rarely deployment-ready. Assets ship with arbitrary metric scale, incorrect pivots and forward axes, brittle geometry, and textures that do not support relighting, which limits their utility for embodied AI, robotics simulation, game development, and AR/VR. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets designed for downstream use rather than volume alone. Each asset is released as a metric-scaled, semantically anchored .glb with separated PBR material maps, a convex collision hull, a paired reference image, and rich multi-sentence text metadata. The dataset spans indoor objects, vehicles, architecture, creatures, and props under a unified spatial convention. Alongside the dataset, we introduce an evaluation suite for 3D asset banks. The suite comprises a continuous Scale Plausibility Score (SPS) with an LLM-as-Judge interval protocol, an LLM Concept Density score for metadata, an anchor-error metric, and a cross-modal CLIP coherence protocol, and we use it to audit AmaraSpatial-10K alongside matched subsets from Objaverse, HSSD, ABO, and GSO. Compared with Objaverse-sourced assets, we demonstrate that AmaraSpatial-10K substantially improves text-based retrieval precision (CLIP Recall@5 of 0.612 vs 0.181, a 3.4x improvement with median rank falling from 267 to 3), and we establish that it satisfies the spatial and semantic prerequisites for physics-aware scene composition and embodied-AI asset banks, leaving those downstream evaluations to future work. AmaraSpatial-10K is publicly available on Hugging Face.
Chinese Translation
网络规模的3D资产集合非常丰富,但很少有可直接部署的资产。这些资产通常具有任意的度量尺度、错误的支点和前向轴、脆弱的几何形状,以及不支持重新照明的纹理,这限制了它们在具身人工智能、机器人仿真、游戏开发和增强现实/虚拟现实中的实用性。我们提出了AmaraSpatial-10K,这是一个超过10,000个合成3D资产的数据集,旨在用于下游应用,而不仅仅是数量。每个资产都以度量缩放、语义锚定的.glb格式发布,包含分离的PBR材质图、凸包碰撞体、配对的参考图像和丰富的多句文本元数据。该数据集涵盖室内物体、车辆、建筑、生物和道具,并遵循统一的空间约定。除了数据集,我们还引入了一个用于3D资产库的评估套件。该套件包括一个连续的尺度合理性评分(Scale Plausibility Score,SPS),采用LLM-as-Judge间隔协议,一个用于元数据的LLM概念密度评分,一个锚点误差指标,以及一个跨模态CLIP一致性协议,我们利用这些工具对AmaraSpatial-10K进行审计,同时与Objaverse、HSSD、ABO和GSO的匹配子集进行比较。与Objaverse来源的资产相比,我们证明AmaraSpatial-10K显著提高了基于文本的检索精度(CLIP Recall@5为0.612对比0.181,提升了3.4倍,中位排名从267降至3),并且我们确认它满足物理感知场景构建和具身人工智能资产库的空间和语义先决条件,后续的评估工作将留待未来进行。AmaraSpatial-10K已在Hugging Face上公开发布。
cs.CV / 52 / 2604.23019
Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery
理解热带树种分类中基于无人机影像的尺度间表征差距
Abstract
Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.
Chinese Translation
从无人机(UAV)影像中准确分类热带树种仍然面临挑战,这主要是由于物种多样性高以及在典型图像分辨率(每像素厘米级)下物种之间的视觉相似性强。相比之下,使用智能手机拍摄的特写公民科学照片训练的模型在植物物种分类性能上表现优异。近期无人机数据采集的进展使得能够收集与俯视航拍影像空间对齐的特写图像,这些图像的视觉细节接近智能手机照片的水平,但这种高分辨率照片无法为许多树木获取。本文评估了在物种丰富的热带森林中使用配对的俯视和特写无人机影像的现有方法的性能。通过微调实验,我们量化了视觉基础模型与领域内通用植物识别模型在两种图像类型(高分辨率特写与较粗分辨率俯视影像)之间的性能差距。我们发现,特写图像的分类性能始终高于俯视航拍影像,并且对于稀有物种,这一性能差距进一步扩大。最后,我们提出在这两种空间尺度之间进行自监督表征对齐,可能为将细粒度视觉信息整合到基于俯视无人机影像的树冠级物种分类模型中提供一种有前景的方法。利用高分辨率特写无人机影像来增强树冠级物种分类,可能会显著改善热带森林生物多样性的规模监测。
cs.CV / 53 / 2604.23066
Urban Flood Observations (UFO): A hand-labeled training and validation dataset of post-flood inundation
城市洪水观测(UFO):一套手工标注的洪水后淹没训练和验证数据集
Abstract
Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.
Chinese Translation
城市洪水对全球的生活和基础设施产生影响。从卫星图像中在复杂城市环境中绘制淹没图仍然具有挑战性,原因包括空间分辨率有限、获取频率低以及云层覆盖。我们提出了城市洪水观测(UFO),这是一个全球性的手工标注数据集,涵盖了不同城市环境中的洪水后淹没情况。UFO包含来自2017年至2021年间14次洪水事件的215个图像切片(1024 x 1024 像素),这些图像源自3米的PlanetScope影像。每个切片被标注为两个类别:'淹没'(所有可见的水面,包括洪水和先前存在的水体(永久性或季节性))和'未淹没'。为了展示数据集的实用性,我们使用留一事件交叉验证训练了一个分割模型,达到了77.3的平均交并比(IoU)。我们还利用UFO评估了两个广泛使用的地表水产品,即基于Sentinel-1的NASA IMPACT模型和谷歌的10米动态世界水类,分别获得了44.1和48.1的IoU。UFO公开可用,以支持城市淹没绘制方法的开发和验证。
cs.CV / 54 / 2604.23079
From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models
从像素到解释:基于CNN-Transformer集成的可解释糖尿病视网膜病变分级、视觉可解释性与视觉-语言模型
Abstract
The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).
Chinese Translation
糖尿病视网膜病变(DR)筛查的质量依赖于正确分级严重程度的能力;然而,许多深度学习(DL)分类器在临床环境中难以被解释。本研究提出了一种方法,将强大的判别模型与多模态解释相结合,将视网膜像素转换为临床可解释的输出。使用APTOS 2019基准,我们在受控协议下进行了分层五折交叉验证,评估了六种具有代表性的基于CNN和Transformer的骨干网络。然后,我们比较了集成策略(硬投票、加权软投票、堆叠),并研究了一种混合类级融合变体,以利用特定等级的优势。为了实现可解释性,我们生成了Grad-CAM++视觉归因图和简短的文本推理,使用了基于视觉-语言模型(VLMs),这些模型在保守提示约束下以眼底图像和分类器输出为条件。现代CNN骨干网络(ResNet-50和ConvNeXt-Tiny)提供了最强的单模型基线,交叉验证的QWK分别达到0.919和0.914。集成提高了序数一致性,加权软投票在各折中最为一致(QWK 0.934 +/- 0.017)。混合类级融合具有竞争力,但在配对折比较中未能在统计上显著优于标准融合(Holm调整后的p >= 1.000)。在解释质量方面,Grad-CAM++提供了合理但粗略的定位,而VLM推理通常与等级一致。从定量上看,VLM变体在临床完整性和模板级语义相似性之间存在权衡(覆盖率0.700与BERTScore 0.072),而图像-文本对齐相当(CLIPScore约为0.34)。
cs.CV / 55 / 2604.23094
Toward Real-World Adoption of Portrait Relighting via Hybrid Domain Knowledge Fusion
通过混合领域知识融合推动人像重光照的现实世界应用
Abstract
The real-world adoption of portrait relighting is hindered by dataset domain gaps, camera sensitivity, and computational costs. We address these challenges with Hybrid Domain Knowledge Fusion, a paradigm that fuses the specialized strengths of synthetic, One-Light-at-A-Time (OLAT), and real-world datasets into a compact model. Our approach features specialized prior models hardened by domain-aware adaptation, followed by augmented knowledge distillation into a lightweight student model with multi-domain expertise. Our method demonstrates a 6x to 240x inference speedup while maintaining state-of-the-art (SOTA) visual quality in the experiments. Additionally, we construct a massive, high-fidelity synthetic dataset with diverse ground-truth intrinsics to support our training pipeline.
Chinese Translation
人像重光照在现实世界中的应用受到数据集领域差异、相机灵敏度和计算成本的制约。我们通过混合领域知识融合(Hybrid Domain Knowledge Fusion)来应对这些挑战,这是一种将合成数据集、一次光源(One-Light-at-A-Time, OLAT)和真实世界数据集的专门优势融合成紧凑模型的范式。我们的方法具有经过领域感知适应强化的专门先验模型,随后通过知识蒸馏增强到具有多领域专业知识的轻量级学生模型。在实验中,我们的方法在保持最先进(SOTA)视觉质量的同时,实现了6倍到240倍的推理速度提升。此外,我们构建了一个庞大且高保真的合成数据集,包含多样化的真实内在参数,以支持我们的训练流程。
cs.CV / 56 / 2604.23095
INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety
INSIGHT:基于几何-语义层次转移的室内场景智能用于公共安全
Abstract
Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~19164-compliant scene graphs with ${\sim}10^{4}{\times}$ compression; role-filtered payloads transmit in ${<}15$\,s at 1\,Mbps over FirstNet Band~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.
Chinese Translation
室内环境缺乏GPS在户外提供的空间智能基础设施;首次响应者在到达不熟悉的建筑时通常没有安全设备的机器可读地图。之前关于公共安全的3D语义分割研究发现了两个障碍:标注的室内训练数据稀缺,以及本地点云方法对小型安全关键特征的识别能力差。本文提出了INSIGHT,一个零目标领域标注的管道,通过注册的RGB-D数据将2D图像理解投影到3D度量空间。两个可互换的视觉堆栈共享一个共同的3D后端:一个用于文本提示分割的SAM3基础模型堆栈,以及一个传统的计算机视觉堆栈(开放集检测、视觉问答、光学字符识别),其中间输出可独立检查。在斯坦福2D-3D-S的所有七个子区域(70,496张图像)上进行评估,该管道生成与Pointcept模式兼容的标注点云和符合ISO 19164的场景图,压缩比约为${ ilde{10}}^{4}$;角色过滤的有效载荷在FirstNet Band 14上以1 Mbps的速度在${<}15$秒内传输。我们报告了7个共享类别的逐点标注准确率,15个在公共3D基准中缺失的安全关键类别的检测灵敏度,以及代码限制的可部署估计和管道间的互补性,证明了2D到3D的语义转移解决了标注数据瓶颈,而场景图提供了足够紧凑的建筑智能以便于现场部署。
cs.CV / 57 / 2604.23105
Transferable Physical-World Adversarial Patches Against Object Detection in Autonomous Driving
可转移的物理世界对抗补丁在自动驾驶物体检测中的应用
Abstract
Deep learning drives major advances in autonomous driving (AD), where object detectors are central to perception. However, adversarial attacks pose significant threats to the reliability and safety of these systems, with physical adversarial patches representing a particularly potent form of attack. Physical adversarial patch attacks pose severe risks but are usually crafted for a single model, yielding poor transferability to unseen detectors. We propose AdvAD, a transfer-based physical attack against object detection in autonomous driving. Instead of targeting a specific detector, AdvAD optimizes adversarial patches over multiple detection models in a unified framework, encouraging the learned perturbations to capture shared vulnerabilities across architectures. The optimization process adaptively balances model contributions and enforces robustness to physical variations. It further employs data augmentation and geometric transformations to maintain patch effectiveness under diverse physical conditions. Experiments in both digital and real-world settings show that AdvAD consistently outperforms state-of-the-art (SOTA) attacks in performance and transferability.
Chinese Translation
深度学习推动了自动驾驶(AD)的重大进展,其中物体检测器是感知的核心。然而,对抗攻击对这些系统的可靠性和安全性构成了重大威胁,而物理对抗补丁则代表了一种特别强大的攻击形式。物理对抗补丁攻击带来了严重风险,但通常是针对单一模型设计的,导致其在未见过的检测器上转移性较差。我们提出了AdvAD,一种针对自动驾驶物体检测的基于转移的物理攻击。AdvAD并不针对特定检测器,而是在统一框架下对多个检测模型优化对抗补丁,鼓励学习到的扰动捕捉不同架构之间的共享脆弱性。优化过程自适应地平衡模型贡献,并增强对物理变化的鲁棒性。它进一步采用数据增强和几何变换,以保持补丁在多种物理条件下的有效性。在数字和现实世界环境中的实验表明,AdvAD在性能和转移性方面始终优于最先进(SOTA)的攻击方法。
cs.CV / 58 / 2604.23125
Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label
从不完美文本指导中学习:高噪声标签下的鲁棒长尾视觉识别
Abstract
Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.
Chinese Translation
现实世界的数据通常呈现长尾分布,并伴有大量噪声标签,这显著降低了深度模型的性能。尽管先前的研究在解决这一综合挑战方面取得了一定进展,但忽视了高噪声环境中标签与图像之间严重的不匹配,从而限制了其有效性。考虑到观察到的标签虽然与图像不匹配,但仍保留类别信息,我们提出利用标签中的辅助文本信息来解决长尾噪声数据中的标签-图像不一致性。具体而言,我们利用预训练视觉-语言模型中的内在跨模态对齐来纠正标签-图像的不一致性。这种监督信号称为弱教师监督(Weak Teacher Supervision, WTS),不受标签噪声和数据分布偏差的影响,尽管其准确性有限。因此,WTS的激活是通过评估文本预测标签与观察标签之间的差异来决定的。大量实验表明,WTS在合成和真实世界数据集上的表现优于其他方法,特别是在高噪声条件下。源代码可在 https://anonymous.4open.science/r/WTS-0F3C 获取。
cs.CV / 59 / 2604.23137
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model
基于自适应注意力门的CNN-ViT融合用于脑肿瘤MRI分类:一种混合深度学习模型
Abstract
Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.
Chinese Translation
使用磁共振成像(MRI)图像早期检测和分类脑肿瘤非常重要,但在医学图像中提取这些信息却很困难。卷积神经网络(CNN)擅长捕捉局部纹理和空间信息,而视觉变换器(ViT)则擅长捕捉长距离的全局依赖关系。本文提出了一种新的混合架构,通过自适应注意力门机制,将SqueezeNet风格的CNN分支与MobileViT风格的全局变换器分支相结合。该门机制动态学习每个样本、每个特征的权重,以加权每个分支的贡献,从而实现局部和全局表示的上下文敏感融合。所提出的模型在脑肿瘤MRI数据集(Kaggle)上的测试准确率为97.60,精确率为97.30,召回率为97.50,F1-score为97.40,宏平均曲线下面积(AUC)为0.9946。这些得分高于单一的CNN和ViT基线,以及当前竞争性的融合方法,表明动态特征加权是分类医学图像的有效方式。
cs.CV / 60 / 2604.23145
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
UpstreamQA:一个用于视频问答任务的显式推理模块化框架
Abstract
Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.
Chinese Translation
视频问答(VideoQA)要求模型能够对空间、时间和语言线索进行联合推理。然而,该任务固有的复杂性通常需要多步推理,而当前的大型多模态模型(LMMs)往往隐式地执行这些推理,使其内部决策过程不透明。相反,大型推理模型(LRMs)显式生成中间逻辑步骤,增强了解释性,并可以提高多跳推理的准确性。然而,这些模型并未针对原生视频理解进行设计,因为它们通常依赖于静态帧采样。我们提出了UpstreamQA,一个模块化框架,通过显式的上游推理模块来解构和评估核心视频推理组件。具体而言,我们采用多模态LRMs在将丰富的推理轨迹传递给下游LMMs进行VideoQA之前,执行对象识别和场景上下文生成。我们在OpenEQA和NExTQA数据集上评估了UpstreamQA,使用了两个LRMs(o4-mini,Gemini 2.5 Pro)和两个LMMs(GPT-4o,Gemini 2.5 Flash)。我们的结果表明,引入显式推理可以显著提升下游VideoQA的性能和可解释性,但在基线性能足够高时也可能导致性能下降。总体而言,UpstreamQA提供了一个原则性框架,用于结合显式推理和多模态理解,在多个场景中推动VideoQA的性能和诊断透明度。
cs.CV / 61 / 2604.23165
BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning
BSViT:一种用于表现力和高效视觉表征学习的突发脉冲视觉变换器
Abstract
Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.
Chinese Translation
脉冲视觉变换器(S-ViTs)为节能视觉学习提供了一个有前景的框架。然而,现有设计受到两个基本问题的限制:二进制脉冲编码的信息容量受限,以及全局自注意力引入的密集令牌交互。为了解决这些挑战,本文提出了BSViT,一种基于突发脉冲驱动的视觉变换器,具有双通道突发脉冲自注意力(DBSSA)机制。DBSSA使用二进制脉冲对查询进行编码,使用突发脉冲对键进行编码,以增强表征能力。值通路采用双重兴奋和抑制的二进制通道,实现了符号调制和更丰富的脉冲交互。重要的是,整个注意力操作保持加法计算,确保与节能神经形态硬件的兼容性。为了进一步减少脉冲活动并结合空间先验,提出了一种补丁邻接掩蔽策略,以限制注意力集中在局部邻域,从而实现结构感知的稀疏性和减少计算开销。此外,突发脉冲编码在整个网络中系统性地集成,以提高脉冲级别的表征能力,超越传统的二进制脉冲。对静态和基于事件的视觉基准进行的广泛实验表明,BSViT在准确性上始终优于现有的脉冲变换器,同时保持竞争力的能效。
cs.CV / 62 / 2604.23167
A Topology fixated Shape Gradient Framework for Non Simple Boundary Extraction for CIE Lab color images with Repulsive Energy
一种基于拓扑固定形状梯度框架的CIE Lab颜色图像非简单边界提取方法,具有排斥能量
Abstract
A levelset free but a hybrid image segmentation approach based on a modified version of the piece wise constant shape gradient of an Mumford Shah shape functional and a repulsive function is considered. The segmentation is performed a non-local shape based through an evolution of discrete curves driven by a non local shape based energy to segment images containing disjoint regions and multiple boundaries. This formulation has a novel additional component as a multivariable function dependent on a few sampled points of the curves that handles the occurrence of self intersection during boundary curves evolution. The method is applied to a few gray scale and color images, including images with nested structures and astronomical objects. The results indicate effective segmentation in complex scenarios with absolute control on the topology of the segments and self-intersections of the boundaries
Chinese Translation
本文考虑了一种不基于水平集的混合图像分割方法,该方法基于Mumford Shah形状泛函的分段常数形状梯度的修改版本和一个排斥函数进行。分割通过非局部形状驱动的离散曲线演化进行,以分割包含不相交区域和多个边界的图像。该公式具有一个新颖的附加组件,即一个依赖于曲线少量采样点的多变量函数,处理边界曲线演化过程中自交的发生。该方法应用于几幅灰度图像和颜色图像,包括具有嵌套结构和天文物体的图像。结果表明,在复杂场景中实现了有效的分割,并对分段的拓扑和边界的自交进行了绝对控制。
cs.CV / 63 / 2604.23173
One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition
一个身份,多重角色:用于增强视频情境识别的多模态实体共指
Abstract
Video Situation Recognition (VidSitu) addresses the challenging problem of "who did what to whom, with what, how, and where" in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA).
Chinese Translation
视频情境识别(VidSitu)解决了在视频中“谁对谁做了什么,使用了什么,如何以及在哪里”的挑战性问题。它通过要求识别显著动作及其在多个事件中的相关简短描述来测试全面的视频理解。与VidSitu的结合需要在不同镜头和多样外观中对关键实体进行时空定位。我们认为,连贯的视频理解需要对扮演不同角色的实体进行一致的识别。为此,我们提出了多模态实体共指(MEC),以将文本中的实体描述与视频中的定位结合起来。为此,我们引入了CineMEC,这是一种多阶段方法,将事件角色提及组与实体的视觉聚类结合在一起,在训练过程中没有显式的定位监督。我们的方法旨在利用视觉定位与字幕生成之间的协同效应,改善其中一个会影响另一个,反之亦然。为了评估,我们扩展了VidSitu数据集,增加了定位注释。尽管之前的工作主要集中在描述上,CineMEC在两者之间提高了一致性:字幕生成(+2.5% CIDEr,+7% LEA)和视觉定位(+18% HOTA)。
cs.CV / 64 / 2604.23187
DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
DyABD:动态MRI中腹部肌肉分割基准
Abstract
This work introduces DyABD, a novel and complex benchmark dataset of dynamic abdominal MRIs from patients with abdominal hernias and associated high quality abdominal muscle annotations. DyABD is the first-of-its-kind in four key ways; (1) it proposes the first abdominal muscle segmentation task, (2) the dynamic MRIs are acquired whilst the patients perform various exercises, introducing extreme anatomical variability, making it one of the most challenging segmentation datasets to date, (3) it includes both pre and post corrective MRIs and (4) DyABD promotes clinical research into the high recurrence rates of abdominal hernias. Beyond dataset introduction, this work provides a comprehensive evaluation of the generalisation capabilities of existing segmentation models across Supervised, Few Shot and Zero Shot paradigms on the unseen DyABD dataset. This work reveals that there is still room for substantial improvement in the field of medical image segmentation, with the majority of techniques achieving a Dice Coefficient of 0.82. This work therefore sheds light on the true progress of the field and redefines the benchmark for progress in medical image segmentation.
Chinese Translation
本研究介绍了DyABD,一个新颖且复杂的动态腹部MRI基准数据集,来源于腹部疝气患者,并附有高质量的腹部肌肉标注。DyABD在四个关键方面是首创的;(1)提出了首个腹部肌肉分割任务,(2)动态MRI是在患者进行各种运动时获取的,带来了极大的解剖变异性,使其成为迄今为止最具挑战性的分割数据集之一,(3)包括了矫正前后的MRI,以及(4)DyABD促进了对腹部疝气高复发率的临床研究。除了数据集的介绍,本研究还全面评估了现有分割模型在监督学习、少样本学习和零样本学习范式下在未见的DyABD数据集上的泛化能力。研究表明,医学图像分割领域仍有显著改进的空间,大多数技术的Dice系数仅为0.82。因此,本研究揭示了该领域的真实进展,并重新定义了医学图像分割进展的基准。
cs.CV / 65 / 2604.23195
AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval
AnalogRetriever:用于模拟电路检索的跨模态表示学习
Abstract
Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22\% to 100\%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2\% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.
Chinese Translation
模拟电路设计在很大程度上依赖于重用现有的知识产权(IP),然而在SPICE网表、原理图和功能描述等异构表示之间进行搜索仍然具有挑战性。现有方法主要局限于单一模态内的精确匹配,未能捕捉跨模态的语义关系。为了解决这一问题,我们提出了AnalogRetriever,一个用于模拟电路搜索的统一三模态检索框架。我们首先在Masala-CHAI的基础上,通过两阶段修复流程构建了一个高质量的数据集,将网表编译率从22%提高到100%。在此基础上,AnalogRetriever使用视觉-语言模型对原理图和描述进行编码,并使用端口感知关系图卷积网络对网表进行编码,通过课程对比学习将所有三种模态映射到共享的嵌入空间。实验表明,AnalogRetriever在所有六个跨模态检索方向上平均达到75.2%的Recall@1,显著优于现有基准。当作为检索增强生成模块集成到AnalogCoder智能框架中时,它持续提高功能通过率,并使得以前未解决的任务得以完成。我们的代码和数据集将会发布。
cs.CV / 66 / 2604.23247
Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing
基于微表情感知的虚拟形象指纹识别:帧间特征差异化方法
Abstract
Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.
Chinese Translation
虚拟形象指纹识别,即验证谁在操控合成的对话视频,而非其真实性,是面部重现技术授权使用的重要保障。现有方法依赖于固定的、不可微分的特征点提取阶段,这阻碍了指纹识别模型从原始像素进行端到端的优化。我们提出了一种无预处理的系统,该系统基于微表情感知的骨干网络,在原始视频帧上运行,以帧间特征差异化作为核心设计原则:在学习到的深度特征空间中,相邻的特征图被相减,从而使时间上稳定的外观维度对输出贡献为零,而驱动者特定的运动动态得以保留。对NVFAIR的控制消融实验确认,时间运动占据了大多数判别性能,并且原始外观特征会主动降低身份分离能力。骨干网络的选择和差异化原则都是至关重要的:仅仅应用差异化于通用编码器是不够的,因为以外观为主导的特征在相邻帧之间会崩溃为近乎相同的表示,而微表情感知的F5C骨干网络则保留了可测量的运动变化,差异化操作可以利用这些变化。在没有任何外部预处理的情况下,我们的模型在NVFAIR上实现了0.877的总体AUC,并且在大多数跨生成器对中与基于特征点的基线相匹配或超过。
cs.CV / 67 / 2604.23264
MotionHiFlow: Text-to-motion via hierarchical flow matching
MotionHiFlow:通过层次流匹配实现文本到运动的转换
Abstract
Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose \textit{MotionHiFlow}, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components. Code is available at https://github.com/ai-lh/MotionHiFlow.
Chinese Translation
文本到运动生成旨在生成与输入文本紧密对齐的3D人类运动,同时保持物理上的合理性和丰富的细节。尽管最近的方法能够产生复杂且自然的运动,但它们通常仅在一个时间尺度上操作,这限制了语义对齐和时间一致性。受到人类认知系统中复杂运动是以层次方式而非单一时间尺度进行概念化的启发,我们提出了 extit{MotionHiFlow},一个层次流匹配框架,通过从低到高的时间尺度构建流路径逐步生成运动。低尺度的流捕捉高层语义和粗略运动结构,而高尺度的流则细化时间细节。为了连接不同尺度的流,我们引入了一种新颖的跨尺度过渡过程,确保连续性并保持噪声一致性。此外,通过集成文本-运动扩散变换器(Text-Motion Diffusion Transformer)和拓扑感知运动变分自编码器(topology-aware Motion VAE),MotionHiFlow显式建模关节之间的结构依赖关系,通过关节感知位置编码和骨架拓扑,实现精确的语义对齐和细粒度的运动细节。在HumanML3D和KIT-ML基准上的广泛实验表明了其最先进的性能,消融研究证实了层次设计和关键组件的有效性。代码可在 https://github.com/ai-lh/MotionHiFlow 获取。
cs.CV / 68 / 2604.23268
LatentBurst: A Fast and Efficient Multi Frame Super-Resolution for Hexadeca-Bayer Pattern CIS images
LatentBurst:一种快速高效的多帧超分辨率方法用于十六元Bayer模式的接触图像传感器图像
Abstract
This paper introduces a novel multi frame super-resolution network (MFSR) for burst hexadeca Bayer pattern Contact Image Sensor (CIS) images, which includes demosaicing, denoising, multi-frame fusion, and super-resolution. Designing a high-quality reconstruction network poses several challenges as follows: 1) Unlike the Bayer color filter array (CFA) pattern, it is hard to interpolate hexadeca-Bayer pattern since the pixel distance between the same color groups increases; 2) Due to large object motion and camera movements, the final fusion result usually suffers the misalignment resulting a blurry image or ghosting artifacts; 3) The proposed network should be fast and efficient enough to operate in real-time on mobile devices. To overcome these challenges, we propose a novel network, called LatentBurst, which contains: 1) a pyramid align and fusion approach in latent feature to deal with large motion scenario; 2) an efficient UNet-based structure which can run efficiently on mobile device; 3) fine-tuned optical flow estimation and two-step knowledge distillation to reduce domain-gap more effectively. Experimental results in various scenarios demonstrate the effectiveness of our proposed method compared with other state-of-the-art methods.
Chinese Translation
本文介绍了一种新颖的多帧超分辨率网络(MFSR),用于突发的十六元Bayer模式接触图像传感器(CIS)图像,涵盖了去马赛克、去噪、多帧融合和超分辨率等步骤。设计高质量重建网络面临以下几个挑战:1)与Bayer颜色滤波器阵列(CFA)模式不同,由于相同颜色组之间的像素距离增加,插值十六元Bayer模式变得困难;2)由于大物体运动和相机移动,最终的融合结果通常会受到错位的影响,导致模糊图像或重影伪影;3)所提出的网络应足够快速高效,以便在移动设备上实时运行。为了解决这些挑战,我们提出了一种新颖的网络,称为LatentBurst,包含:1)一种金字塔对齐与融合方法,以处理大运动场景;2)一种高效的基于UNet的结构,能够在移动设备上高效运行;3)经过精细调整的光流估计和两步知识蒸馏,以更有效地减少领域间隙。各种场景下的实验结果表明,与其他最先进的方法相比,我们提出的方法具有良好的有效性。
cs.CV / 69 / 2604.23271
A Hierarchical Ensemble Inference Pipeline for Robust White Blood Cell Classification Under Domain Shifts
针对领域转移的鲁棒性白血球分类的层次集成推理管道
Abstract
Automated white blood cell (WBC) classification is essential for scalable leukaemia screening. However, real-world deployment is challenged by domain shifts caused by staining protocols, scanner characteristics, and inter-laboratory variability, which often degrade model performance. The White Blood Cell Classification Challenge (WBCBench) at ISBI 2026 aims to advance robust WBC recognition, with a focus on accurately identifying blast cells and other clinically critical rare subtypes. We propose a memory-augmented, hierarchical ensemble pipeline for WBC classification under domain shifts, leveraging a feature bank and a DinoBloom backbone fine-tuned with LoRA. Our three-stage inference hierarchy combines k-nearest neighbors (kNN) retrieval at each level, reducing over-reliance on any single decision. Evaluated on the WBCBench dataset, our method ranks within the top ten by macro F1-score in the final testing phase.
Chinese Translation
自动化白血球(WBC)分类对于可扩展的白血病筛查至关重要。然而,实际应用面临着由于染色协议、扫描仪特性和实验室间变异性引起的领域转移挑战,这往往会降低模型性能。2026年ISBI的白血球分类挑战(WBCBench)旨在推动鲁棒性白血球识别,重点是准确识别爆发细胞和其他临床关键的稀有亚型。我们提出了一种增强记忆的层次集成管道,用于在领域转移下进行白血球分类,利用特征库和经过LoRA微调的DinoBloom骨干网络。我们的三阶段推理层次结构在每个层级结合了k近邻(kNN)检索,减少对任何单一决策的过度依赖。在WBCBench数据集上的评估中,我们的方法在最终测试阶段的宏观F1分数中排名前十。
cs.CV / 70 / 2604.23274
SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
SemiGDA:用于半监督医学图像分割的生成对分布对齐
Abstract
Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: https://github.com/taozh2017/SemiGDA.
Chinese Translation
半监督学习通过利用未标记数据中的潜在信息来提高模型性能,从而解决医学图像分割中的标签稀缺和高标注成本问题。传统的判别式分割依赖于分割掩膜,忽视了特征级分布约束。这限制了在标签稀少的场景中对未标记数据的鲁棒语义表示学习和自适应建模。为了解决这些局限性,我们提出了SemiGDA,一种用于半监督医学图像分割的新颖生成对分布对齐框架。我们的SemiGDA通过对齐特征和语义分布,克服了判别式方法对大型标记数据集的依赖,从而提升了语义学习和场景适应性。具体而言,我们提出了一个双分布对齐模块(Dual-distribution Alignment Module, DAM),该模块采用两个结构上不同的编码器来建模图像和掩膜特征分布。它通过分布约束在潜在空间中强制它们的对齐,从而建立结构化的特征一致性。此外,我们设计了一种一致性驱动跳接适配器(Consistency-Driven Skip Adapter, CDSA)策略,该策略引入了双跳接适配器(图像和掩膜)通过跳接连接融合多尺度特征。利用一致性损失,CDSA增强了跨分支的语义对齐,并强化了细粒度的语义一致性。在多样化的医学数据集上的实验结果表明,我们的方法优于其他最先进的半监督分割方法。代码已发布于:https://github.com/taozh2017/SemiGDA。
cs.CV / 71 / 2604.23276
Lightweight and Production-Ready PDF Visual Element Parsing
轻量级且适用于生产的PDF视觉元素解析
Abstract
PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96\%$ visual element detection accuracy and $93\%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.
Chinese Translation
PDF文档包含关键的视觉元素,如图形、表格和表单,其准确提取对于文档理解和多模态检索增强生成(RAG)至关重要。现有的PDF解析器往往无法捕捉复杂的视觉内容,提取出非信息性的伪影(例如水印、徽标),生成碎片化的元素,并且无法可靠地将标题与其对应的元素关联,这降低了后续的检索和问答效果。我们提出了一种轻量级且适用于生产的PDF解析框架,该框架能够准确检测视觉元素,并通过空间启发式、布局分析和语义相似性相结合的方式来关联标题。在流行的基准数据集和内部产品数据上,所提出的解决方案实现了≥96%的视觉元素检测准确率和93%的标题关联准确率。当作为多模态RAG的预处理步骤时,它在内部数据和MMDocRAG基准测试中显著超越了最先进的解析器和大型视觉-语言模型,同时将延迟减少了超过2倍。我们已在具有挑战性的生产环境中部署了该系统。
cs.CV / 72 / 2604.23282
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
弥合姿态-语义差距:基于文本的人物异常搜索级联框架
Abstract
Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
Chinese Translation
基于文本的人物异常搜索通过自然语言查询从监控档案中检索特定的行为事件。尽管最近的姿态感知方法在几何结构上对齐良好,但它们面临一个根本的姿态-语义差距:语义上不同的动作可能共享相似的骨骼几何形状。虽然多模态大型语言模型(Multimodal Large Language Models, MLLMs)可以减少这种模糊性,但在大规模检索中使用它们在计算上是不可行的。我们提出了结构-语义解耦级联(Structure-Semantic Decoupled Cascade, SSDC)框架,该框架将检索解耦为两个阶段:(1)结构感知粗检索,其中轻量级模型通过骨骼相似性快速过滤候选项;(2)侦探小组互动,一个多代理语义验证模块。该小组由一个侦探负责快速二元过滤,一个分析师负责证据提取,以及一个撰写者负责语义综合。最后,我们通过将合成的标题与结构先验融合来重新排序候选项。在PAB基准上的实验表明,SSDC通过平衡效率和语义推理实现了最先进的性能。
cs.CV / 73 / 2604.23289
MetaErr: Towards Predicting Error Patterns in Deep Neural Networks
MetaErr:预测深度神经网络中的错误模式
Abstract
Due to the unprecedented success of deep learning, it has become an integral component in several multimedia computing applications in todays world. Unfortunately, deep learning systems are not perfect and can fail, sometimes abruptly, without prior warning or explanation. While reducing the error rate of deep neural networks has been the primary focus of the multimedia community, the problem of predicting when a deep learning system is going to fail has received significantly less research attention. In this paper, we propose a simple yet effective framework, MetaErr, to address this under-explored problem in deep learning research. We train a meta-model whose goal is to predict whether a base deep neural network will succeed or fail in predicting a particular data sample, by observing the base models performance on a given learning task. The meta-model is completely agnostic of the architecture and training parameters of the base model. Such an error prediction system can be immensely useful in a variety of smart multimedia applications. Our empirical studies corroborate the promise and potential of our framework against competing baselines. We further demonstrate the usefulness of our framework to improve the performance of pseudo-labeling-based semi-supervised learning, and show that MetaErr outperforms several strong baselines on three benchmark computer vision datasets.
Chinese Translation
由于深度学习的前所未有的成功,它已成为当今多个多媒体计算应用中的重要组成部分。不幸的是,深度学习系统并不完美,有时会突然出现故障,而没有任何事先警告或解释。尽管降低深度神经网络的错误率一直是多媒体社区的主要关注点,但预测深度学习系统何时会失败的问题却受到的研究关注显著较少。本文提出了一个简单而有效的框架MetaErr,以解决深度学习研究中这一未被充分探索的问题。我们训练了一个元模型,其目标是通过观察基础模型在特定学习任务上的表现,预测基础深度神经网络在预测特定数据样本时是成功还是失败。该元模型对基础模型的架构和训练参数完全不敏感。这种错误预测系统在各种智能多媒体应用中可能非常有用。我们的实证研究证实了我们框架相对于竞争基线的前景和潜力。我们进一步展示了我们的框架在改进基于伪标签的半监督学习性能方面的有效性,并表明MetaErr在三个基准计算机视觉数据集上优于多个强基线。
cs.CV / 74 / 2604.23309
STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning
STAND:具有双粒度消歧的语义锚定约束用于遥感图像变化描述
Abstract
Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.
Chinese Translation
遥感图像变化描述(RSICC)旨在描述两幅遥感图像之间的差异。尽管近期的方法探讨了视频建模,但它们在很大程度上忽视了视角、尺度和先验知识中的固有模糊性,缺乏对编码器的有效约束。本文提出了STAND,一种用于RSICC的具有双粒度消歧的语义锚定约束,旨在逐步解决这些模糊性。具体而言,为了建立可靠的特征基础,我们首先引入了一种可解释的约束来规范时间表示。在这些纯化特征上,双粒度消歧模块通过结合宏观层面的全局上下文聚合(以解决视角混淆)和微观层面的频率重聚焦注意力(以增强小物体尺度)来解决空间不确定性。最终,为了将这些视觉上消歧的特征转化为精确文本,语义概念锚定模块利用语言类别先验来解决解码过程中的知识模糊性。大量实验验证了STAND的优越性及其在解决模糊性方面的有效性。
cs.CV / 75 / 2604.23314
Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM
从嘈杂提示中学习:基于显著性引导的提示蒸馏用于与SAM的稳健分割
Abstract
Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.
Chinese Translation
分割在临床诊断和监测中至关重要,但现代基础模型在医学成像中的可靠性仍然依赖于精确提示的可用性。Segment Anything Model (SAM) 提供了强大的零样本能力,尽管在主导真实临床工作流程的弱、通用和嘈杂提示下,它的性能会崩溃。在实际应用中,诸如中心线点的标注通常粗糙且模糊,常常在相邻解剖结构之间漂移,误导SAM生成不一致或不完整的掩膜。我们提出了SPD(显著性引导的提示蒸馏)框架,将这些不可靠的线索转化为稳健的指导。SPD首先通过轻量级显著性头学习数据驱动的解剖先验,以获得自信的定位图。这些先验随后驱动上下文提示蒸馏,利用来自解剖相邻切片的线索验证和丰富嘈杂提示,生成与专家推理行为相匹配的一致提示集。成对切片一致性目标进一步强化了分割过程中的局部解剖一致性。在四个具有挑战性的MRI和CT基准测试上的实验表明,SPD始终优于现有的SAM适应和监督基线,在基于区域和基于边界的指标上均取得了显著提升。SPD为在仅有不完美提示的临床环境中可靠基础模型的部署提供了一条实用且有原则的路径。
cs.CV / 76 / 2604.23320
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
KAConvNet:用于视觉识别的Kolmogorov-Arnold卷积网络
Abstract
The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs' theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at https://github.com/UnicomAI/KAConvNet.
Chinese Translation
卷积神经网络(CNNs)一直是通用计算机视觉任务的主导和有效方法。最近,基于Kolmogorov-Arnold表示定理的Kolmogorov-Arnold神经网络(KANs)显示出在深度学习中替代多层感知器(MLPs)的潜力。KANs在边缘上使用可学习的非线性激活,在节点上进行简单的求和,相较于MLPs,提供了更少的参数和更高的可解释性。然而,将Kolmogorov-Arnold表示定理与卷积方法结合以解决计算机视觉任务的探索仍然有限。现有的尝试仅仅是用权重替代可学习的激活函数,削弱了KANs的理论基础,限制了其潜在的有效性。此外,KANs中使用的B样条曲线在计算效率和过拟合倾向方面存在问题。在本文中,我们提出了一种新颖的Kolmogorov-Arnold卷积层,深度整合了Kolmogorov-Arnold表示定理与卷积。这一层提供了更强的方式可解释性,因为它基于已建立的数学定理,其设计具有理论一致性。在Kolmogorov-Arnold卷积层的基础上,我们设计了一种高效的网络架构,称为KAConvNet,它在结合KAN和卷积的现有方法中表现优越,并在与主流视觉变换器(ViTs)和卷积神经网络(CNNs)相比时,达到了具有竞争力的性能。我们相信,我们的工作为人工智能领域提供了宝贵的见解,并将激励2020年代更具创新性的CNN的发展。代码可在https://github.com/UnicomAI/KAConvNet公开获取。
cs.CV / 77 / 2604.23325
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
EAD-Net:具有空间细化和时间一致性的情感感知对话头生成
Abstract
Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.
Chinese Translation
情感对话头视频生成旨在生成具有准确唇部同步和情感面部表情的生动肖像视频。目前的方法依赖于简单的情感标签,导致语义信息不足。虽然引入高层次语义可以增强表现力,但容易导致唇部同步下降。此外,主流生成方法在长视频中难以平衡计算效率和全局运动意识,并且存在时间一致性差的问题。因此,我们提出了一种基于 extbf{E}motion- extbf{A}ware extbf{D}iffusion模型的 extbf{Net}work,称为 extbf{EAD-Net}。我们引入SyncNet监督和时间表示对齐(Temporal Representation Alignment, TREPA)来减轻多模态融合引起的唇部同步下降。为了建模长视频序列中的复杂时空依赖关系,我们提出了一种时空方向注意力(Spatio-Temporal Directional Attention, STDA)机制,通过条带注意力捕捉全局运动模式。此外,我们设计了一种时间帧图推理模块(Temporal Frame graph Reasoning Module, TFRM),通过图结构学习显式建模视频帧之间的时间一致性。为了增强情感语义控制,采用大型语言模型从真实视频中提取文本描述,作为高层次语义指导。对HDTF和MEAD数据集的实验表明,我们的方法在唇部同步准确性、时间一致性和情感准确性方面优于现有方法。
cs.CV / 78 / 2604.23335
H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading
H-SemiS:半监督与自监督学习的层次融合用于膝关节骨关节炎严重程度分级
Abstract
Knee osteoarthritis (KOA) is a degenerative joint disease that can lead to chronic pain, reduced mobility, and long-term disability. Automated severity grading from knee radiographs can support early assessment, but current methods heavily depend on large labeled datasets and remain sensitive to class imbalance, noisy samples, and variability in clinical annotations. To alleviate these limitations, we propose a Hierarchical fusion of Semi-Supervised framework with Self-Supervision (H-SemiS) for KOA severity grading in knee X-ray samples using limited annotated data. Rather than treating severity grading as a flat multi-class problem, H-SemiS decomposes the task into a sequence of binary sub-tasks within a semi-supervised teacher-student architecture, directly mitigating the impact of class imbalance. To further enhance feature learning from unlabeled data, the framework integrates an adversarial self-supervised reconstruction module that encourages the network to capture robust anatomical structures. In parallel, a teacher-student design with quantum-inspired feature mixing improves discrimination boundaries between adjacent grades when pseudo-labels are noisy. We comprehensively evaluate H-SemiS on two challenging multi-class datasets and assess its generalizability on two binary-class datasets. Our experimental results demonstrate the superiority of the proposed H-SemiS framework across multiple evaluation metrics, consistently outperforming several competing baselines and state-of-the-art methods. The code is publicly available at https://github.com/chandravardhan-singh-raghaw/H-SemiS.
Chinese Translation
膝关节骨关节炎(KOA)是一种退行性关节疾病,可能导致慢性疼痛、活动能力下降和长期残疾。从膝关节X光片中自动进行严重程度分级可以支持早期评估,但当前方法严重依赖于大量标注数据集,并且对类别不平衡、噪声样本和临床注释的变异性敏感。为了缓解这些限制,我们提出了一种半监督框架与自监督学习的层次融合方法(H-SemiS),用于在有限标注数据的情况下对膝关节X光样本的KOA严重程度进行分级。H-SemiS并不是将严重程度分级视为一个平坦的多类问题,而是将任务分解为一系列二分类子任务,采用半监督的师生架构,直接减轻类别不平衡的影响。为了进一步增强对未标记数据的特征学习,该框架集成了一个对抗性自监督重建模块,鼓励网络捕捉稳健的解剖结构。同时,采用量子启发的特征混合的师生设计,在伪标签噪声较大的情况下改善相邻等级之间的区分边界。我们在两个具有挑战性的多类数据集上全面评估了H-SemiS,并评估其在两个二类数据集上的泛化能力。实验结果表明,所提出的H-SemiS框架在多个评估指标上优于几种竞争基线和最先进的方法。代码已公开发布在 https://github.com/chandravardhan-singh-raghaw/H-SemiS。
cs.CV / 79 / 2604.23344
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection
探索开放词汇物体检测中的层次一致性与无偏物体性
Abstract
Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: https://cvlab.yonsei.ac.kr/projects/HCC
Chinese Translation
传统的物体检测器通常在封闭集假设下运行,限制了对训练期间所见的预定义基础类别的识别。开放词汇物体检测(Open-Vocabulary Object Detection, OVD)通过利用视觉-语言模型(Vision-Language Models, VLMs)为新物体类别生成伪标签,从而解决了这一限制。然而,现有的OVD方法存在两个关键缺陷:(1)类标签分配不准确,因为VLMs是针对图像级预测而优化的,而不是伪标签所需的区域级预测;(2)区域提议网络(Region Proposal Networks, RPNs)仅在基础物体类别上训练,导致物体性评分不可靠。为了解决这些问题,我们提出了一种新的OVD伪标签框架。我们的方法引入了一种层次置信度校准(Hierarchical Confidence Calibration, HCC)技术,通过评估层次语义级别(类别、超类别和子类别)之间的一致性,确保可靠的类标签估计。我们还提出了LoCLIP,这是一种高效的CLIP适配,结合了物体性标记,以减轻RPNs的基础类别偏差问题,并为新物体类别提供可靠的物体性估计。在标准OVD基准(包括COCO和LVIS)上的广泛实验表明,我们的方法明显设定了新的最先进水平,验证了我们方法的有效性。项目网站:https://cvlab.yonsei.ac.kr/projects/HCC
cs.CV / 80 / 2604.23348
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans:理解、推理和预测多模态大语言模型中情感转变的基准
Abstract
Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.
Chinese Translation
近期的多模态大语言模型(MLLMs)在感知、推理和生成方面展现出强大的能力,并越来越多地应用于社交机器人和人机交互等领域,在这些领域中,理解人类情感至关重要。然而,现有的基准主要将情感理解视为一个静态识别问题,这使得当前的MLLMs是否能够理解情感作为一个动态过程仍然不清晰,该过程会随着状态的变化而演变,并在多样的社会情境中展开。为了解决这一问题,我们提出了EmoTrans,一个用于评估多模态视频中情感动态理解的基准。EmoTrans包含1,000个经过精心收集和手动标注的视频片段,涵盖12种现实场景,并进一步提供超过3,000个任务特定的问题-答案(QA)对,以便进行细粒度评估。该基准引入了四个任务,即情感变化检测(Emotion Change Detection, ECD)、情感状态识别(Emotion State Identification, ESI)、情感转变推理(Emotion Transition Reasoning, ETR)和下一个情感预测(Next Emotion Prediction, NEP),形成了一个从粗粒度检测到更深层次推理和预测的渐进评估框架。我们对18个最先进的MLLMs在EmoTrans上进行了全面评估,并得出了两个主要发现。首先,尽管当前的MLLMs在粗粒度情感变化检测上表现相对较强,但它们在细粒度情感动态建模方面仍然面临挑战。其次,社会复杂环境,尤其是多人场景,依然具有相当大的挑战,而以推理为导向的变体并未始终带来明显的改善。为了促进未来的研究,我们在https://github.com/Emo-gml/EmoTrans上公开发布了该基准、评估协议和代码。
cs.CV / 81 / 2604.23375
Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis
用于医学图像分析的高效模型压缩的层次化时空通道聚类
Abstract
Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under \(3\times\) and \(6\times\) compression budgets. Our method outperforms both baselines, reducing FLOPs from \(8.21\,\mathrm{G}\) to \(1.55\,\mathrm{G}\) (\(81.1\%\) reduction), achieving a \(1.38\times\) inference speed-up, and increasing classification accuracy from \(87.76\%\) to \(89.80\%\). The method also improves the macro \(F_1\)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.
Chinese Translation
卷积神经网络(CNN)由于其较大的内存和计算需求,在资源受限的环境中变得越来越难以部署。尽管低秩压缩方法可以减轻这一负担,但大多数现有方法独立压缩空间和通道冗余,因此未能充分利用卷积特征图中的局部结构。本文提出了一种针对CNN的层次化时空低秩压缩框架,利用空间区域和通道激活之间的冗余。与传统方法在整个层中应用均匀分解不同,所提方法首先将特征图划分为空间区域,然后根据每个区域内的共同激活模式对通道进行分组,最后对每个生成的时空通道聚类应用秩自适应奇异值分解(SVD)。该方法在基于AlexNet的脑肿瘤MRI分类模型上进行了评估,并与全局SVD和Tucker分解在3倍和6倍压缩预算下进行了比较。我们的方案优于这两种基线,将FLOPs从8.21 G减少到1.55 G(减少81.1%),实现了1.38倍的推理加速,并将分类准确率从87.76%提高到89.80%。该方法还提高了宏观F1分数以及在脑膜瘤等挑战性类别上的表现。超参数权衡分析表明,该框架提供了帕累托最优配置,使得在压缩和预测性能之间的平衡可控。适度的聚类与自适应秩选择结合,取得了良好的结果。所有分类指标均报告了自助法标准误差。
cs.CV / 82 / 2604.23387
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
基于关键点的动态物体6自由度姿态跟踪方法通过事件相机
Abstract
Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.
Chinese Translation
物体的准确6自由度姿态估计对于机器人执行精确的操作任务至关重要。然而,对于动态物体的姿态估计,传统的基于相机的方法面临诸多主要挑战,如运动模糊、传感器噪声和低光照限制。为了解决这些问题,我们采用事件相机,其高动态范围和低延迟提供了一个有前景的解决方案。此外,我们提出了一种基于关键点的动态物体姿态估计检测和跟踪方法。首先,构建一个关键点检测网络,从事件流生成的时间表面中提取关键点。随后,利用事件的极性和空间坐标,并利用每个关键点附近的事件密度实现连续的关键点跟踪。最后,在2D关键点和3D模型关键点之间建立哈希映射,并采用EPnP算法估计6自由度姿态。实验结果表明,无论是在模拟还是实际事件环境中,所提出的方法在准确性和鲁棒性方面均优于基于事件的最先进方法。
cs.CV / 83 / 2604.23399
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
打破资源壁垒:几何引导的序列建模用于高效语义分割
Abstract
High-performance semantic segmentation has achieved significant progress in recent years, often driven by increasingly large backbones and higher computational budgets. While effective, such approaches introduce substantial computational overhead and limit accessibility under constrained hardware settings. In this paper, we propose DGM-Net (Directional Geometric Mamba Network), an efficient architecture that improves modeling capability through structural design rather than increasing model capacity. We introduce Directional Geometric Mamba (G-Mamba), a linear-complexity O(N) operator as an alternative to conventional context modeling modules such as ASPP and PPM. To further enhance structural awareness in state space model (SSM)-based modeling, we design the DGM-Module, which extracts centripetal flow fields and topological skeletons to guide the scanning process and improve boundary preservation. Without relying on large-scale pretraining or heavy backbone scaling, DGM-Net achieves 80.8% mIoU within 28k iterations, 82.3% mIoU on Cityscapes test set, and 45.24% mIoU on ADE20K. In addition, the model maintains stable performance under constrained hardware settings (e.g., batch size of 2 on 8GB VRAM), highlighting its efficiency and practicality. These results demonstrate that incorporating geometric guidance into SSM-based architectures provides an effective and resource-efficient direction for semantic segmentation.
Chinese Translation
近年来,高性能语义分割取得了显著进展,通常依赖于越来越大的主干网络和更高的计算预算。尽管这些方法有效,但它们引入了大量的计算开销,并限制了在受限硬件环境下的可访问性。本文提出了DGM-Net(方向几何曼巴网络),这是一种高效的架构,通过结构设计而非增加模型容量来提高建模能力。我们引入了方向几何曼巴(G-Mamba),这是一种线性复杂度O(N)的操作符,作为传统上下文建模模块(如ASPP和PPM)的替代方案。为了进一步增强基于状态空间模型(SSM)建模中的结构意识,我们设计了DGM模块,该模块提取向心流场和拓扑骨架,以引导扫描过程并改善边界保留。在不依赖于大规模预训练或重型主干扩展的情况下,DGM-Net在28k次迭代中达到了80.8%的mIoU,在Cityscapes测试集上达到了82.3%的mIoU,在ADE20K上达到了45.24%的mIoU。此外,该模型在受限硬件环境下(例如,8GB显存下的批量大小为2)保持稳定性能,突显了其高效性和实用性。这些结果表明,将几何引导融入基于SSM的架构为语义分割提供了一种有效且资源高效的方向。
cs.CV / 84 / 2604.23403
Learn&Drop: Fast Learning of CNNs based on Layer Dropping
Learn&Drop:基于层丢弃的卷积神经网络快速学习
Abstract
This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer's parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the backpropagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83\% for VGG-11 to 83.74\% for ResNet-152. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially.
Chinese Translation
本文提出了一种新方法,以提高深度卷积神经网络的训练效率。在训练过程中,该方法评估分数,以衡量每一层参数的变化程度以及该层是否会继续学习。基于这些分数,网络被缩减,从而减少需要学习的参数数量,加快训练速度。与试图压缩网络以用于推理阶段或限制反向传播阶段执行的操作数量的最先进方法不同,所提出的方法的新颖之处在于它专注于减少训练期间网络在前向传播中执行的操作数量。所提出的训练策略已在两个广泛使用的架构系列上进行了验证:VGG 和 ResNet。在 MNIST、CIFAR-10 和 Imagenette 上的实验表明,采用该方法后,模型的训练时间减少了一半以上,而准确性几乎没有受到显著影响。在训练期间,前向传播的 FLOPs 减少范围从 VGG-11 的 17.83\% 到 ResNet-152 的 83.74\\%。这些结果证明了所提出技术在加速卷积神经网络学习方面的有效性。该技术在需要微调或在线训练卷积模型的应用中尤其有用,例如因为数据是顺序到达的。
cs.CV / 85 / 2604.23407
PushupBench: Your VLM is not good at counting pushups
PushupBench:你的视觉语言模型在计数俯卧撑方面表现不佳
Abstract
Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)
Chinese Translation
大型视觉语言模型(VLMs)能够识别视频中发生的 extit{什么}事情,但无法计数 extit{多少}次。我们介绍了 extbf{PushupBench},包含446个长视频片段(平均时长36.7秒),用于评估重复计数。最佳前沿模型的准确率为42.1\%;开源的4B模型得分约为6\%,与监督基线相匹配。我们表明,仅凭准确率会产生误导——较弱的模型利用模态计数而不是进行时间推理。在1k样本上进行计数的微调能够转移到一般视频理解上:MVBench (+2.15),PerceptionTest (+1.88),TVBench (+4.54),这表明计数是更广泛时间推理的代理。PushupBench已纳入 exttt{lmms-eval}(https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262)并托管于(pushupbench.com/)
cs.CV / 86 / 2604.23415
A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
一种异构双流框架用于视频动作识别的比较融合分析
Abstract
Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.
Chinese Translation
大多数双流动作识别网络对RGB和光流流进行相同的卷积骨干网络处理,忽视了这两种模态在结构属性上存在根本性的不同。光流捕捉细粒度的运动模式,而RGB帧则携带丰富的外观和场景上下文——将它们视为相同会忽略这种区别。我们提出了DualStreamHybrid,这是一种异构双流架构,为每个流分配适合其输入的骨干网络:对RGB帧使用预训练的ViT-Tiny/16,对从头开始在20通道堆叠光流表示上训练的MobileNetV2。一个学习的投影层在融合之前将两个不同大小的特征向量映射到一个共同的维度,使得两个流能够相互作用,而不强制要求架构对称。我们在一个统一框架内设计了五种融合策略——晚期融合、连接、交叉注意、加权融合和门控融合——并在UCF11(1,600个视频,11个类别)和UCF50(6,681个视频,50个类别)上进行评估,以研究融合行为如何随着数据集规模的变化而变化。在UCF11上,交叉注意达到了98.12%的测试准确率,超越了RGB-only ViT-Tiny基线的95.94%,这表明显式的跨模态注意在较小、复杂性较低的数据集上特别有效。在UCF50上,加权融合达到了96.86%,并在两个基准测试中证明是最一致的策略。学习到的流权重揭示了一个有趣的模式:UCF11几乎实现了模态贡献的均等(RGB: 0.507,光流: 0.493),而UCF50则稍微偏向RGB流(RGB: 0.554,光流: 0.446)——这可以说反映了更大且视觉上更为多样的动作空间。综合来看,这些结果表明,即使是轻量级的运动流也能有效补充强大的外观编码器,并且最佳的融合策略依赖于数据集的规模。
cs.CV / 87 / 2604.23426
Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy
通过自适应量化和差分隐私增强非独立同分布联邦学习中的隐私性和通信效率
Abstract
Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central server without sharing underlying data. One of the key challenges of FL is the communication bottleneck caused by variations in connection speed and bandwidth across devices. Therefore, it is essential to reduce the size of transmitted data during training. Additionally, there is a potential risk of exposing sensitive information through the model or gradient analysis during training. To address both privacy and communication efficiency, we combine differential privacy (DP) and adaptive quantization methods. We use Laplacian-based DP to preserve privacy, which is relatively underexplored in FL and offers tighter privacy guarantees than Gaussian-based DP. We propose a simple and efficient global bit-length scheduler using round-based cosine annealing, along with a client-based scheduler that dynamically adapts based on client contribution estimated through dataset entropy analysis. We evaluate our approach through extensive experiments on CIFAR10, MNIST, and medical imaging datasets, using non-IID data distributions across varying client counts, bit-length schedulers, and privacy budgets. The results show that our adaptive quantization methods reduce total communicated data by up to 52.64% for MNIST, 45.06% for CIFAR10, and 31% to 37% for medical imaging datasets compared to 32-bit float training while maintaining competitive model accuracy and ensuring robust privacy through differential privacy.
Chinese Translation
联邦学习(Federated Learning, FL)是一种分布式机器学习方法,多个设备在中央服务器的管理下协作训练模型,而无需共享底层数据。FL的一个关键挑战是由于设备之间连接速度和带宽的差异而导致的通信瓶颈。因此,在训练过程中减少传输数据的大小至关重要。此外,在训练过程中,通过模型或梯度分析暴露敏感信息的潜在风险也存在。为了解决隐私和通信效率的问题,我们结合了差分隐私(Differential Privacy, DP)和自适应量化方法。我们使用基于拉普拉斯的DP来保护隐私,这在FL中相对较少被探索,并且提供比基于高斯的DP更严格的隐私保证。我们提出了一种简单而高效的全局比特长度调度器,采用基于回合的余弦退火,以及一个基于客户端的调度器,该调度器根据通过数据集熵分析估计的客户端贡献动态调整。我们通过在CIFAR10、MNIST和医学影像数据集上的广泛实验评估我们的方法,使用不同客户端数量、比特长度调度器和隐私预算下的非独立同分布数据分布。结果表明,我们的自适应量化方法在保持竞争性模型准确性的同时,通过差分隐私确保强大的隐私保护,相比于32位浮点训练,MNIST的总通信数据减少了高达52.64%,CIFAR10减少了45.06%,医学影像数据集减少了31%至37%。
cs.CV / 88 / 2604.23432
Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations
Sphere-Depth:一个针对不同球形相机方向的深度估计方法基准
Abstract
Reliable depth estimation from spherical images is crucial for 360{\deg} vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth
Chinese Translation
从球形图像中可靠地进行深度估计对于机器人导航中的360°视觉和沉浸式场景理解至关重要。然而,机载球形相机在现实世界的机器人平台上可能会经历意外的姿态变化,这与等矩形投影固有的几何失真一起,显著影响深度估计的有效性。为研究这一问题,本文引入了一种新的公共基准,称为Sphere-Depth,旨在以可重复的方式系统评估来自等矩形图像的单目深度估计模型的鲁棒性。通过模拟相机姿态扰动,评估了一种流行的基于透视的模型Depth Anything,以及诸如Depth Anywhere、ACDNet、Bifuse++和SliceNet等球形感知模型的性能。此外,为确保跨模型的有意义评估,提出了一种基于深度校准的误差协议,以利用每个模型的监督学习缩放因子将预测的相对深度值转换为度量深度值。实验表明,即使是专门设计用于处理球形图像的模型,在观察到与标准姿态的相机姿态变化时,性能也会显著下降。完整的基准、评估协议和数据集划分已公开发布,网址为:https://github.com/sgazzeh/Sphere_depth
cs.CV / 89 / 2604.23435
Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis
Knee-xRAI:一种可解释的人工智能框架,用于自动化膝关节骨关节炎的Kellgren-Lawrence分级
Abstract
Radiographic grading of knee osteoarthritis (KOA) with the Kellgren-Lawrence (KL) system is limited by inter-reader variability and the opacity of current deep learning approaches, which predict KL grades directly from images without decomposing structural features. We present Knee-xRAI, a modular framework that independently quantifies the three cardinal radiographic features of KOA (joint space narrowing [JSN], osteophytes, and subchondral sclerosis) and integrates them into an explainable KL grade classification. The pipeline combines U-Net++ segmentation for contour-based JSN measurement, an SE-ResNet-50 network for per-site osteophyte grading (OARSI scale), and a hybrid texture-CNN classifier for binary sclerosis quantification. The resulting 50-dimensional structured feature vector feeds two complementary classification paths. An XGBoost path supports SHAP-based feature attribution. A ConvNeXt hybrid path combines the structured vector with a full-image encoder for enhanced predictive performance. Evaluated on 8,260 radiographs from an OAI-derived dataset, the JSN module achieved a Dice coefficient of 0.8909 and an mJSW intraclass correlation of 0.8674 against manual annotations. The ConvNeXt hybrid path reached a test quadratic weighted kappa (QWK) of 0.8436 and AUC of 0.9017. The transparent XGBoost path achieved a test QWK of 0.6294 with full feature-level audit capability. Ablation confirmed JSN as the dominant predictor (QWK = 0.6103 alone), with osteophyte features providing consistent incremental gain (+0.0183) and sclerosis contributing marginally. Inference-time ablation of Path B confirmed the structured pathway contributes materially beyond the image encoder, with QWK drops of 0.098 (feature zeroing) and 0.284 (feature-image permutation). Knee-xRAI explicitly quantifies all three KL-defining radiographic features within a single auditable pipeline.
Chinese Translation
使用Kellgren-Lawrence (KL) 系统对膝关节骨关节炎 (KOA) 进行放射学分级受到读者间变异性和当前深度学习方法的不透明性的限制,这些方法直接从图像中预测KL等级,而未分解结构特征。我们提出了Knee-xRAI,这是一个模块化框架,独立量化KOA的三个主要放射学特征(关节间隙变窄 [JSN]、骨刺和软骨下硬化),并将它们整合为可解释的KL等级分类。该流程结合了U-Net++分割用于基于轮廓的JSN测量,SE-ResNet-50网络用于每个部位的骨刺分级(OARSI尺度),以及混合纹理-CNN分类器用于二元硬化量化。最终生成的50维结构特征向量输入两个互补的分类路径。XGBoost路径支持基于SHAP的特征归因。ConvNeXt混合路径将结构化向量与全图编码器结合,以增强预测性能。在来自OAI衍生数据集的8,260幅放射图像上评估,JSN模块实现了0.8909的Dice系数和0.8674的mJSW组内相关性,针对手动注释。ConvNeXt混合路径达到了0.8436的测试二次加权Kappa (QWK) 和0.9017的AUC。透明的XGBoost路径在特征级别审计能力下实现了0.6294的测试QWK。消融实验确认JSN是主要预测因子(单独QWK = 0.6103),而骨刺特征提供了一致的增益 (+0.0183),硬化贡献较小。对路径B的推理时间消融确认结构化路径在图像编码器之外有实质性贡献,QWK下降了0.098(特征归零)和0.284(特征-图像置换)。Knee-xRAI在单一可审计流程中明确量化了所有三个KL定义的放射学特征。
cs.CV / 90 / 2604.23442
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
基于资源受限的无人机杂草检测用于边缘设备的特定区域管理
Abstract
Weeds compete with crops for light, water, and nutrients, reducing yield and crop quality. Efficient weed detection is essential for site-specific weed management (SSWM). Although deep learning models have been deployed on UAV-based edge systems, a systematic understanding of how different model architectures perform under real-world resource constraints is still lacking. To address this gap, this study proposes a deployment-oriented framework for real-time UAV-based weed detection on resource-constrained edge platforms. The framework integrates UAV data acquisition, model development, and on-device inference, with a focus on balancing detection accuracy and computational efficiency. A diverse set of state-of-the-art object detection models is evaluated, including convolution-based YOLO models (v8-v12) and transformer-based RT-DETR models (v1-v2). Experiments on three edge devices (Jetson Orin Nano, Jetson AGX Xavier, and Jetson AGX Orin) demonstrate clear trade-offs between accuracy and inference latency across models and hardware configurations. Results show that high-capacity models achieve up to 86.9% mAP50 but suffer from high latency, limiting real-time deployment. In contrast, lightweight models achieve 66%-71% mAP50 with significantly lower latency, enabling real-time performance. Among all models, RT-DETRv2-R50-M achieves competitive accuracy (79% mAP50) with improved efficiency, while YOLOv10n provides the fastest inference speed. YOLOv11s and RT-DETRv2-R50-M offer the best balance between accuracy and speed, making them strong candidates for real-time UAV deployment.
Chinese Translation
杂草与作物竞争光照、水分和养分,降低产量和作物质量。高效的杂草检测对于特定区域杂草管理(SSWM)至关重要。尽管深度学习模型已在基于无人机的边缘系统中得到应用,但对于不同模型架构在现实世界资源限制下的表现缺乏系统性的理解。为了解决这一问题,本研究提出了一种面向部署的框架,用于在资源受限的边缘平台上进行实时无人机杂草检测。该框架整合了无人机数据采集、模型开发和设备内推理,重点在于平衡检测精度和计算效率。评估了一系列先进的目标检测模型,包括基于卷积的YOLO模型(v8-v12)和基于变换器的RT-DETR模型(v1-v2)。在三种边缘设备(Jetson Orin Nano、Jetson AGX Xavier和Jetson AGX Orin)上的实验表明,不同模型和硬件配置之间存在明显的精度与推理延迟的权衡。结果显示,高容量模型的mAP50可达到86.9%,但延迟较高,限制了实时部署。相比之下,轻量级模型的mAP50可达到66%-71%,且延迟显著较低,能够实现实时性能。在所有模型中,RT-DETRv2-R50-M在提高效率的同时实现了具有竞争力的精度(79% mAP50),而YOLOv10n则提供了最快的推理速度。YOLOv11s和RT-DETRv2-R50-M在精度和速度之间提供了最佳平衡,使其成为实时无人机部署的强有力候选者。
cs.CV / 91 / 2604.23452
From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
从边缘到深度:探究视觉变换器中的空间层次结构
Abstract
Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.
Chinese Translation
仅在图像分类上训练的视觉变换器(Vision Transformers)通常能够迁移到需要空间理解的任务上,尽管在预训练期间并未接受任何空间监督。我们探讨这种结构是在哪里以及如何被稳健地编码的。逐层探测一个冻结的ViT-B/16模型的两个互补特性,即局部补丁边界(BSDS500)和每个补丁的深度(NYU Depth V2),揭示了一个明确的层次结构:边界结构在第5-6层变得可以线性解码(AP = 0.833),而深度信号需要整合全局线索,在第8层达到峰值(MAE = 0.0875),比边界结构晚两到三层。这两种信号在最后的分类层崩溃,随机权重控制实验确认这些编码是学习到的,而非架构性的。因果干预增加了特异性:消除线性深度探测器读取的单一方向使深度解码下降了多达165%,而消除其他任何方向的影响则变化不到1%。沿该方向的针对性激活修补显示深度信号在每一层都是部分重新推导的,而不是被动地保留在残差流中,中间层的干预在下游持续最为明显。结果是,一个经过分类训练的ViT发展出一个积极维护的空间层次结构,这与在灵长类动物视觉皮层中观察到的早期到晚期的进展相呼应。
cs.CV / 92 / 2604.23481
Leveraging Spatial Transcriptomics as Alternative to Manual Annotations for Deep Learning-Based Nuclei Analysis
利用空间转录组学作为深度学习基础的细胞核分析的手动注释替代方案
Abstract
Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorporating cell-level ST data, we obtain gene expression profiles and corresponding nuclear masks from histopathological images. Gene expression profiles are converted into cell-type labels and used as training data for image-based classification. Because existing gene expression-based cell-type classification methods are not designed for image recognition, we introduce an image-oriented classification approach that bridges gene expression-based cell typing and image-based cell classification. To evaluate generalization, we conduct segmentation experiments on previously unseen organs and compare our method with conventional supervised models. Despite being trained on fewer organ types, our framework achieves higher segmentation accuracy, demonstrating strong transferability. Classification experiments further show consistent improvements over existing approaches.
Chinese Translation
基于深度学习的细胞核分割和分类在病理图像中通常依赖于大规模的像素级手动注释,这在不同组织和染色条件下成本高昂且难以获得。为了解决这一限制,我们提出了一个框架,利用空间转录组学(Spatial Transcriptomics, ST)数据作为细胞核分割和分类的监督。通过结合细胞级ST数据,我们从组织病理图像中获得基因表达谱和相应的核掩膜。基因表达谱被转换为细胞类型标签,并作为图像分类的训练数据。由于现有的基于基因表达的细胞类型分类方法并未针对图像识别进行设计,我们引入了一种面向图像的分类方法,以连接基于基因表达的细胞类型划分与基于图像的细胞分类。为了评估模型的泛化能力,我们在之前未见过的器官上进行分割实验,并将我们的方法与传统的监督模型进行比较。尽管在较少的器官类型上进行训练,我们的框架仍然实现了更高的分割准确性,展示了强大的迁移能力。分类实验进一步显示出相较于现有方法的一致性改进。
cs.CV / 93 / 2604.23508
BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors
BurstGP:利用生成先验增强原始突发图像超分辨率
Abstract
Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.
Chinese Translation
突发图像超分辨率(BISR)旨在通过聚合多个低分辨率(LR)帧的信息来构建单一高分辨率(HR)图像,依赖于突发图像中的时间冗余和空间一致性。虽然传统方法取得了令人印象深刻的结果,但在处理复杂纹理和过度平滑方面常常面临挑战。扩散模型,特别是那些在高质量数据上进行预训练的模型,在图像和视频超分辨率中展现了生成真实细节的显著能力。然而,它们在BISR中的潜力仍然未被充分探索,现有方法通常依赖于从头训练的特定任务扩散模型,并在单帧重建上进行操作。在本研究中,我们提出了BurstGP,一种基于扩散的BISR新解决方案,利用最近基础模型的生成先验来克服这些问题。具体而言,我们在传统BISR方法的基础上构建了一个多帧感知的扩散模型,以最小的保真度损失提升图像质量。此外,我们引入了(i)一种新颖的降解感知条件机制,根据输入中估计的降解控制细节的合成,以及(ii)一个稳健的sRGB到lRGB反转器,使我们能够利用生成的多帧(视频)sRGB先验,同时处理原始输入和lRGB输出图像。通过实证研究,我们证明了BurstGP在定量(尤其是在感知指标方面,包括MUSIQ和LPIPS)和定性上均优于现有的最先进技术。特别是,我们提出的方法在恢复更丰富的纹理和更精细的结构细节方面表现出色,突显了视频先验在BISR中相较于传统方法的潜力。
cs.CV / 94 / 2604.23532
Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model
基于轻量级预测世界模型的情感条件短期人类姿态预测
Abstract
Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.
Chinese Translation
短期人类姿态预测在交互系统、辅助机器人和情感感知的人机交互中扮演着至关重要的角色[1-3]。尽管当前的轨迹预测模型主要依赖几何运动线索,但它们往往忽视了影响人类运动动态的潜在情感信号[4-5]。本文探讨了基于面部表情衍生的情感嵌入是否可以为短期姿态预测提供辅助条件信号。为了进一步评估递归预测设置中的多模态条件化,我们提出了一种轻量级自回归预测世界模型,该模型执行15步滚动姿态预测。该框架通过可学习的门控机制将姿态关键点与情感嵌入结合,并使用基于双层LSTM架构的递归序列模型进行自回归展开预测。实验在两个小规模的姿态-情感视频数据集上进行:受控运动序列(面部表情变化最小)和自然情感驱动的运动序列(面部表情变化显著)。结果表明,简单的多模态融合并未始终提高预测准确性,而归一化门控融合显著提升了情感驱动运动序列的性能。此外,反事实扰动实验表明,预测轨迹对多模态输入变化表现出可测量的敏感性,这表明面部表情嵌入作为辅助条件信号,而非冗余特征。总之,这些结果表明,将基于面部表情衍生的情感嵌入纳入基于轻量级预测世界模型架构的情感条件短期姿态预测是一种可行的方法。
cs.CV / 95 / 2604.23536
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
$Z^2$-采样:用于扩散模型中语义对齐的零成本锯齿轨迹
Abstract
Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE's temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).
Chinese Translation
扩散模型在文本对齐生成方面取得了前所未有的成功,这在很大程度上得益于无分类器引导(Classifier-Free Guidance, CFG)。然而,标准的CFG严格依赖于瞬时梯度,忽略了数据流形的内在曲率。最近的方法如锯齿采样(Zigzag-sampling, Z-Sampling)明确地沿多步前后轨迹遍历,以探测这种曲率,显著改善了语义对齐。然而,这些显式的遍历使神经函数评估(Neural Function Evaluation, NFE)成本增加了三倍,并引入了来自流形外评估的不受限截断误差,导致与真实边际分布的累积漂移。在本文中,我们理论上证明了显式锯齿序列在拓扑上是可约的。我们提出了隐式Z-采样(Implicit Z-Sampling),严格证明中间状态可以通过算子对偶被代数消除,从而物理上消除了流形外的近似误差。为了将采样效率推向其理论下限,我们引入了$Z^2$-采样(零成本锯齿采样)。利用概率流常微分方程(Probability Flow ODE)的时间一致性,$Z^2$-采样将隐式代数崩溃与动态缓存的时间语义替代物耦合。这在不牺牲语义探索的情况下恢复了标准的2-NFE基线。我们通过反向误差分析(Backward Error Analysis)正式证明,这种离散崩溃本质上合成了方向导数曲率惩罚。最后,大量评估表明,$Z^2$-采样在结构上打破了性能-效率的帕累托前沿。我们验证了其在各种架构(U-Nets, DiTs)和模态(图像/视频)中的普遍适用性,并建立了与先进对齐框架(AYS, Diffusion-DPO)的无缝正交性。
cs.CV / 96 / 2604.23540
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle噪声:用于可解释潜在优化的快速语义球面对齐
Abstract
Text-to-image diffusion models have achieved remarkable generative capabilities, yet accurately aligning complex textual prompts with synthesized layouts remains an ongoing challenge. In these models, the initial Gaussian noise acts as a critical structural seed dictating the macroscopic layout. Recent online optimization and search methods attempt to refine this noise to enhance text-image alignment. However, relying on unconstrained Euclidean gradient ascent mathematically inflates the latent norm and destroys the standard Gaussian prior, causing severe visual artifacts like color over-saturation. Furthermore, these methods suffer from inefficient semantic routing and easily fall into the ``reward hacking'' trap of external proxy models. To address these intertwined bottlenecks, we propose Oracle Noise, a zero-shot framework reframing noise initialization as semantic-driven optimization strictly confined to a Riemannian hypersphere. Instead of relying on complex external parsers, we directly identify the most impactful structural words in the prompt to efficiently route optimization energy. By updating the noise strictly along a spherical path, we mathematically preserve the original Gaussian distribution. This geometric constraint eliminates norm inflation and unlocks aggressive step sizes for rapid convergence. Extensive experiments demonstrate that Oracle Noise significantly accelerates semantic alignment and achieves superior aesthetics without black-box models. It completely mitigates Euclidean-induced degradation, establishing state-of-the-art performance across human preference metrics (e.g., HPSv2, ImageReward), semantic alignment (CLIP Score), and sample diversity, all within a strict 2-second optimization budget.
Chinese Translation
文本到图像的扩散模型已实现显著的生成能力,但准确对齐复杂的文本提示与合成布局仍然是一个持续的挑战。在这些模型中,初始的高斯噪声作为一个关键的结构种子,决定了宏观布局。最近的在线优化和搜索方法试图精炼这种噪声,以增强文本与图像的对齐。然而,依赖于不受约束的欧几里得梯度上升在数学上膨胀了潜在范数,并破坏了标准高斯先验,导致严重的视觉伪影,如颜色过饱和。此外,这些方法在语义路由上效率低下,容易陷入外部代理模型的“奖励黑客”陷阱。为了解决这些相互交织的瓶颈,我们提出了Oracle噪声,一个零-shot框架,将噪声初始化重新构建为严格限制在黎曼超球面上的语义驱动优化。我们不依赖复杂的外部解析器,而是直接识别提示中最具影响力的结构词,以高效地引导优化能量。通过沿着球面路径严格更新噪声,我们在数学上保持了原始的高斯分布。这一几何约束消除了范数膨胀,并解锁了快速收敛的激进步长。大量实验表明,Oracle噪声显著加速了语义对齐,并在没有黑箱模型的情况下实现了优越的美学效果。它完全缓解了由欧几里得引起的退化,在人类偏好指标(如HPSv2、ImageReward)、语义对齐(CLIP Score)和样本多样性等方面建立了最先进的性能,所有这些都在严格的2秒优化预算内完成。
cs.CV / 97 / 2604.23542
AusSmoke meets MultiNatSmoke: a fully-labelled diverse smoke segmentation dataset
AusSmoke与MultiNatSmoke:一个完全标注的多样化烟雾分割数据集
Abstract
Wildfires are an escalating global concern due to the devastating impacts on the environment, economy, and human health, with notable incidents such as the 2019-2020 Australian bushfires and the 2025 California wildfires underscoring the severity of these events. AI-enabled camera-based smoke detection has emerged as a promising approach for the rapid detection of wildfires. However, existing wildfire smoke segmentation datasets that are used for training detection and segmentation models are limited in scale, geographically constrained, and often rely on synthetic imagery, which hinders effective training and generalization. To overcome these limitations, we present AusSmoke, a new smoke segmentation dataset collected from Australia to address the data scarcity in this region. Furthermore, we introduce a MultiNational geographically diverse and substantially larger fully-labelled benchmark, called MultiNatSmoke, that consolidates publicly available international datasets with the newly collected Australian imagery, expanding the scale by an order of magnitude over previous collections. Finally, we benchmark smoke segmentation models, demonstrating improved performance and enhanced generalization across diverse geographical contexts. The project is available at \href{https://github.com/henryzhao0615/MultiNatSmoke}{Github}.
Chinese Translation
野火因其对环境、经济和人类健康的毁灭性影响而成为全球日益严重的关注问题,2019-2020年澳大利亚丛林大火和2025年加利福尼亚州野火等显著事件突显了这些事件的严重性。基于人工智能的摄像头烟雾检测已成为快速检测野火的有前景的方法。然而,现有用于训练检测和分割模型的野火烟雾分割数据集在规模上有限,地理上受限,并且通常依赖于合成图像,这阻碍了有效的训练和泛化。为了解决这些限制,我们提出了AusSmoke,这是一个从澳大利亚收集的新烟雾分割数据集,以应对该地区的数据稀缺。此外,我们还介绍了一个名为MultiNatSmoke的多国地理多样性且规模显著更大的完全标注基准,它整合了公开可用的国际数据集与新收集的澳大利亚图像,将规模扩大了一个数量级。最后,我们对烟雾分割模型进行了基准测试,展示了在不同地理背景下的性能提升和泛化能力增强。该项目可在 exttt{https://github.com/henryzhao0615/MultiNatSmoke} 上获取。
cs.CV / 98 / 2604.23546
COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training
COMO:最小风险训练的闭环光学分子识别
Abstract
Optical chemical structure recognition (OCSR) translates molecular images into machine-readable representations like SMILES strings or molecular graphs, but remains challenging in real-world documents due to inexhaustible variations in chemical structures, shorthand conventions, and visual noise. Most existing deep-learning-based approaches rely on teacher forcing with token-level Maximum Likelihood Estimation (MLE). This training paradigm suffers from exposure bias, as models are trained under ground-truth prefixes but must condition on their own previous predictions during inference. Moreover, token-level MLE objectives hinder the optimization towards molecular-level evaluation criteria such as chemical validity and structural similarity. Here we introduce Minimum Risk Training (MRT) to OCSR and propose COMO (Closed-loop Optical Molecule recOgnition), a closed-loop framework that mitigates exposure bias by directly optimizing over molecule-level, non-differentiable objectives, by iteratively sampling and evaluating the model's own predictions. Experiments on ten benchmarks including synthetic and real-world chemical diagrams from patent and scientific literature demonstrate that COMO substantially outperforms existing rule-based and learning-based methods with less training data. Ablation studies further show that MRT is architecture-agnostic, demonstrating its potential for broad application to end-to-end OCSR systems.
Chinese Translation
光学化学结构识别(OCSR)将分子图像转换为机器可读的表示,如SMILES字符串或分子图,但由于化学结构的无尽变化、简写惯例和视觉噪声,在实际文档中仍然具有挑战性。大多数现有的基于深度学习的方法依赖于带有标记级最大似然估计(MLE)的教师强制训练。这种训练范式存在曝光偏差,因为模型是在真实前缀下训练的,但在推理时必须依赖于自己之前的预测。此外,标记级MLE目标阻碍了朝着分子级评估标准(如化学有效性和结构相似性)的优化。在此,我们将最小风险训练(MRT)引入OCSR,并提出COMO(闭环光学分子识别),这是一个闭环框架,通过直接优化分子级的非可微目标,迭代地对模型自身的预测进行采样和评估,从而减轻曝光偏差。在包括来自专利和科学文献的合成和真实化学图的十个基准测试中的实验表明,COMO在使用更少训练数据的情况下显著优于现有的基于规则和基于学习的方法。消融研究进一步表明,MRT与架构无关,展示了其在端到端OCSR系统中广泛应用的潜力。
cs.CV / 99 / 2604.23551
Spatiotemporal Degradation-Aware 3D Gaussian Splatting for Realistic Underwater Scene Reconstruction
考虑时空退化的3D高斯点云技术用于真实感水下场景重建
Abstract
Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MarineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.
Chinese Translation
从水下视频重建真实感水下场景仍然是多媒体领域中一项有意义但具有挑战性的任务。水下成像中固有的时空退化现象,包括光晕、闪烁、衰减和反向散射,常常导致现有3D重建方法中的几何形状和外观不准确。尽管一些近期研究探索了考虑水下退化的重建方法,但它们往往仅关注空间或时间退化,未能在更真实的水下场景中同时处理这两种退化。我们提出了MarineSTD-GS,这是一种基于3D高斯点云的新框架,明确建模时空退化以实现真实感水下场景重建。具体而言,我们引入了两种配对的高斯原语:内在高斯(Intrinsic Gaussians)表示真实场景,而退化高斯(Degraded Gaussians)则呈现退化的观测结果。每个退化高斯的颜色通过时空退化建模(Spatiotemporal Degradation Modeling, SDM)模块从其配对的内在高斯物理推导而来,使得从退化图像中自监督地解耦真实外观成为可能。为了确保稳定的训练和准确的几何形状,我们进一步提出了深度引导几何损失(Depth-Guided Geometry Loss)和多阶段优化策略(Multi-Stage Optimization strategy)。我们还构建了一个具有多样化时空退化和真实外观的模拟基准,以便进行全面评估。在模拟和真实世界数据集上的实验表明,MarineSTD-GS能够稳健地处理时空退化,并在新视角合成中以真实的无水场景外观超越现有方法。
cs.CV / 100 / 2604.23574
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
PhysLayer:基于语言指导的深度感知物理分层动画
Abstract
Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.
Chinese Translation
现有的图像到视频生成方法往往产生不符合物理规律的运动,并且缺乏对物体动态的精确控制。尽管之前的方法已结合了物理模拟器,但仍然局限于二维平面运动,无法捕捉深度感知的空间交互。我们提出了PhysLayer,一个新颖的框架,能够实现基于语言指导的深度感知静态图像分层动画。PhysLayer由三个关键组件组成:首先,一个基于语言指导的场景理解模块,利用视觉基础模型通过分析物体组成、材料属性和物理参数,将场景分解为基于深度的层次。其次,一个深度感知的分层物理模拟,扩展了二维刚体动力学,加入了深度运动和透视一致的缩放,使得物体交互更加真实,而无需完整的三维重建。第三,一个物理指导的视频合成模块,将模拟的轨迹与场景感知的重光照结合,以实现时间一致的结果。实验结果表明,CLIP相似度提高了2.2 ext{%},FID分数提高了9.3 ext{%},Motion-FID提高了3 ext{%},人类评估显示物理合理性提升了24 ext{%},文本与视频的对齐度提升了35 ext{%}。我们的方法在可控图像动画中提供了物理真实感与计算效率之间的实用平衡。
cs.CV / 101 / 2604.23584
Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation
用于多模态检索增强生成的身份解耦匿名化
Abstract
Multi-modal retrieval-augmented generation (MRAG) systems retrieve visual evidence from large image corpora to ground the responses of large multi-modal models, yet the retrieved images frequently contain human faces whose identities constitute sensitive personal information. Existing anonymization techniques that destroy the non-identity visual cues that downstream reasoning depends on or fail to provide principled privacy guarantees. We propose Identity-Decoupled MRAG, a framework that interposes a generative anonymization module between retrieval and generation. Our approach consists of three components: (i)a disentangled variational encoder that factorizes each face into an identity code and a spatially-structured attribute code, regularized by a mutual-information penalty and a gradient-based independence term; (ii)a manifold-aware rejection sampler that replaces the identity code with a synthetic one guaranteed to be both distinct from the original and realistic; and (iii)a conditional latent diffusion generator that synthesizes the anonymized face from the replacement identity and the preserved attributes, distilled into a latent consistency model for low-latency deployment. Privacy is enforced through a multi-oracle ensemble of face recognition models with a hinge-based loss that halts optimization once identity similarity drops below the impostor-regime threshold.
Chinese Translation
多模态检索增强生成(MRAG)系统从大型图像库中检索视觉证据,以支持大型多模态模型的响应,但所检索的图像通常包含人脸,其身份构成敏感的个人信息。现有的匿名化技术要么破坏了下游推理所依赖的非身份视觉线索,要么未能提供有原则的隐私保障。我们提出了身份解耦的MRAG框架,该框架在检索和生成之间插入了一个生成式匿名化模块。我们的方法由三个组成部分构成:(i)一个解耦的变分编码器,将每张人脸分解为身份编码和空间结构属性编码,并通过互信息惩罚和基于梯度的独立性项进行正则化;(ii)一个流形感知的拒绝采样器,用于用一个合成的身份编码替换原身份编码,确保其既与原身份不同又具有现实性;(iii)一个条件潜在扩散生成器,从替换的身份和保留的属性合成匿名化的人脸,并将其提炼为低延迟部署的潜在一致性模型。通过一个多重神谕集成的面部识别模型,采用基于铰链的损失来强制隐私,一旦身份相似度降至冒名顶替者阈值以下,优化过程将停止。
cs.CV / 102 / 2604.23586
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Talker-T2AV:基于自回归扩散建模的联合语音音频-视频生成
Abstract
Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.
Chinese Translation
联合音频-视频生成模型表明,统一生成比级联方法具有更强的跨模态一致性。然而,现有模型通过普遍的注意力机制在去噪过程中耦合模态,以完全纠缠的方式处理高层语义和低层细节。这对于说话头合成并不理想:尽管音频和面部运动在语义上是相关的,但它们的低层实现(声学信号和视觉纹理)遵循不同的渲染过程。在所有层次上强制联合建模会导致不必要的纠缠并降低效率。我们提出了Talker-T2AV,一种自回归扩散框架,其中高层跨模态建模发生在共享主干网络中,而低层细化则使用特定模态的解码器。一个共享的自回归语言模型在统一的补丁级标记空间中共同推理音频和视频。两个轻量级的扩散变换器头将隐藏状态解码为帧级音频和视频潜变量。在说话肖像基准测试中的实验表明,Talker-T2AV在唇同步精度、视频质量和音频质量方面优于双分支基线,达到了比级联管道更强的跨模态一致性。
cs.CV / 103 / 2604.23604
Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
学习识别3D LiDAR异常分割中的分布外物体
Abstract
Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.
Chinese Translation
理解周围环境是自主驾驶和机器人感知的基础。在现实环境中,区分已知类别和之前未见物体对于异常分割至关重要。然而,3D领域的研究仍然有限,大多数现有方法采用来自2D视觉的后处理技术。为了解决这一不足,我们提出了一种新的高效方法,直接在特征空间中操作,建模内点类别的特征分布以约束异常样本。此外,唯一公开可用的3D LiDAR异常分割数据集包含简单场景,异常实例较少,并且由于传感器分辨率存在严重的领域差距。为了弥补这一差距,我们引入了一组混合真实-合成数据集,用于3D LiDAR异常分割,基于已建立的语义分割基准,包含多个分布外物体和多样化、复杂的环境。大量实验表明,我们的方法在现有的真实世界数据集和新引入的混合数据集上均取得了最先进和具有竞争力的结果,验证了我们方法的有效性和所提数据集的实用性。代码和数据集可在 https://simom0.github.io/lido-page/ 获取。
cs.CV / 104 / 2604.23612
Comparative Study of Weighted and Coupled Second- and Fourth-Order PDEs for Image Despeckling in Grayscale, Color, SAR, and Ultrasound
加权和耦合二阶与四阶偏微分方程在灰度、彩色、合成孔径雷达和超声图像去噪中的比较研究
Abstract
Partial Differential Equation (PDE)-based approaches have gained significant attention in image despeckling due to their strong capability to preserve structural details while suppressing noise. However, conventional second-order PDE models tend to generate blocky artifacts, whereas higher-order models often introduce speckle patterns. To resolve it, this paper proposes and comparatively analyzes two advanced PDE-based frameworks designed for speckle noise suppression while preserving the fine edges. The first model introduces a novel weighted formulation that combines second and fourth-order PDEs through a weighting parameter. The second-order diffusion coefficient employs grayscale and gradient-based indicators, while the fourth-order term is guided solely by a Laplacian-based indicator. The second model constructs a coupled PDE framework, where independent fourth and second-order components are explicitly solved in an iterative manner. In this coupled structure, each diffusion coefficient is defined separately to enhance adaptability in varying image regions. Both models are implemented using the explicit finite difference method. The proposed techniques are extensively evaluated on a variety of datasets, including standard grayscale, color, Synthetic Aperture Radar (SAR), and ultrasound images. Comparative experiments with the existing Telegraph Diffusion Model (TDM) and Fourth-Order Telegraph Diffusion Model (TDFM) demonstrate the superiority of the proposed approaches in reducing speckle noise while effectively preserving fine image structures and edges. Quantitative evaluations using PSNR, SSIM and Speckle Index metrics confirm that the proposed models produce higher image quality and enhanced visual perception. Overall, the presented PDE-based formulations provide a reliable and efficient framework for image despeckling in both natural and medical imaging.
Chinese Translation
基于偏微分方程(PDE)的方法在图像去噪中受到广泛关注,因为它们在抑制噪声的同时能够有效保留结构细节。然而,传统的二阶PDE模型往往会产生块状伪影,而高阶模型则常常引入斑点模式。为了解决这个问题,本文提出并比较分析了两种先进的基于PDE的框架,旨在抑制斑点噪声的同时保留细微边缘。第一个模型引入了一种新颖的加权公式,通过加权参数将二阶和四阶PDE结合在一起。二阶扩散系数采用灰度和基于梯度的指标,而四阶项则仅由拉普拉斯基指标引导。第二个模型构建了一个耦合PDE框架,其中独立的四阶和二阶分量以迭代方式显式求解。在这个耦合结构中,每个扩散系数被单独定义,以增强在不同图像区域的适应性。两个模型均采用显式有限差分法进行实现。所提技术在多种数据集上进行了广泛评估,包括标准灰度、彩色、合成孔径雷达(SAR)和超声图像。与现有的电报扩散模型(TDM)和四阶电报扩散模型(TDFM)的比较实验表明,所提方法在减少斑点噪声的同时有效保留了细微图像结构和边缘。使用峰值信噪比(PSNR)、结构相似性指数(SSIM)和斑点指数等定量评估确认,所提模型产生了更高的图像质量和增强的视觉感知。总体而言,所提出的基于PDE的公式为自然和医学成像中的图像去噪提供了可靠且高效的框架。
cs.CV / 105 / 2604.23622
A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification
一种协同的CNN-Transformer网络与池化注意力融合用于高光谱图像分类
Abstract
In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.
Chinese Translation
在高光谱图像(HSI)分类任务中,每个像素被分类为特定的土地覆盖类别或材料。卷积神经网络(CNN)和变换器(transformers)已被广泛应用于高光谱图像分类中的局部和非局部特征提取。近期的研究利用多尺度视觉变换器(ViT)来增强光谱特征的捕获,并取得了良好的效果。然而,大多数现有方法在有效联合使用空间-光谱信息以及在传播过程中保持层间信息方面仍面临挑战。为了解决这些问题,我们提出了一种协同的CNN-Transformer网络与池化注意力融合用于高光谱图像分类,该网络协同利用CNN和ViT分别处理空间和光谱特征。具体而言,我们提出了一个双分支特征提取(Twin-Branch Feature Extraction, TBFE)模块,该模块采用3D和2D卷积并行工作,以全面提取高光谱图像的光谱和空间特征。设计了一个混合池化注意力(Hybrid Pooling Attention, HPA)模块以聚合空间注意力。此外,采用级联变换器编码器进行全局光谱特征提取,并设计了一个简单而高效的跨层特征融合(Cross-Layer Feature Fusion, CFF)模块,以减少前一网络层中关键信息的损失。在多个代表性数据集上进行了大量实验,以证明我们提出的方法相较于最先进的工作具有优越的性能。代码可在 https://github.com/chenpeng052/SCT-Net.git 获取。
cs.CV / 106 / 2604.23632
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live:基于异步双流和以人为本的偏好蒸馏的实时流媒体联合音视频化身生成
Abstract
Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.
Chinese Translation
实时文本驱动的联合音视频化身生成需要高保真度和精确同步地联合合成肖像视频和语音,但现有的音视频扩散模型在交互使用时速度过慢,并且在激进加速后往往显著降级。我们提出了Hallo-Live,这是一种联合音视频化身生成的流媒体框架,结合了异步双流扩散与以人为本的偏好引导蒸馏。为了减少因因果生成导致的发音延迟,我们引入了未来扩展注意力(Future-Expanding Attention),使每个视频块能够访问同步音频以及一小段未来语音线索。为了减轻少步蒸馏的质量损失,我们进一步提出了以人为本的偏好引导动态模式蒸馏(HP-DMD),通过视觉保真度、语音自然性和音视频同步的奖励重新加权训练样本。在两块NVIDIA H200 GPU上,Hallo-Live以20.38 FPS的速度运行,延迟为0.94秒,吞吐量比教师模型Ovi高出16.0倍,延迟低99.3倍。尽管速度提升,生成质量依然强劲,整体VideoAlign评分和同步置信度评分相当,同时在整体质量与效率的权衡中优于其他加速基线。定性结果进一步显示在照片真实感、多说话者和风格化场景中具有强大的泛化能力。据我们所知,Hallo-Live是第一个将流媒体双流扩散与偏好引导蒸馏结合用于实时文本驱动音视频生成的框架。
cs.CV / 107 / 2604.23636
Discriminator-Guided Adaptive Diffusion for Source-Free Test-Time Adaptation under Image Corruptions
基于判别器引导的自适应扩散方法用于图像损坏下的无源测试时间适应
Abstract
In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at https://github.com/fmolivato/dgadiffusion/.
Chinese Translation
在本研究中,我们探讨了在由损坏引起的领域转移下的无源无监督领域适应,其中性能下降是由于超出加性噪声的自然图像损坏引起的,包括模糊、天气效应和数字伪影。我们提出了一种基于扩散的输入级适应框架,该框架完全在测试时运行,并保持所有源训练模型不变,明确针对对损坏目标输入的鲁棒性。我们的方法利用源训练的扩散模型作为生成先验,并引入了一种判别器引导的自适应扩散策略,动态控制施加于每个测试样本的扰动量。与依赖固定扩散深度不同,判别器根据每张图像的情况确定何时施加足够的前向扩散以抑制特定损坏伪影,每种损坏类型有效地定义了一个独特的目标领域。该自适应停止机制仅施加必要的噪声以去除领域特定的损坏,同时保留类别区分结构。随后,反向扩散过程重建出与源对齐的图像,选用结构引导进行稳定,并通过冻结的源训练分类器进行分类。我们在广泛的损坏引起的目标领域中评估了所提方法,涵盖15种不同的损坏类型,并展示了在非噪声损坏下更平衡的鲁棒性以及具有竞争力或改进的性能。额外分析揭示了自适应扩散调度如何响应不同的损坏特征,突显了所提框架的实用性、通用性和鲁棒性。代码公开可用,地址为 https://github.com/fmolivato/dgadiffusion/.
cs.CV / 108 / 2604.23641
VDLF-Net: Variational Feature Fusion for Adaptive and Few-Shot Visual Learning
VDLF-Net:用于自适应和少样本视觉学习的变分特征融合
Abstract
This paper introduces VDLF-Net, which attaches a compact VAE to a multi-scale CNN backbone. Latent vectors and softmax-gate support the backbone feature maps, while $\ell_2$-normalized embeddings from the gated maps contribute toward supervised classification or episodic few-shot prediction. Under standard CIFAR-100 and Mini-ImageNet protocols, VDLF-Net demonstrates an improved performance over ResNet-50 Enhanced, VGG-16, Prototypical Networks, and Matching Networks. Extensive ablations show that removing the fine-resolution scale has the greatest impact on VDLF-Net's performance. At the same time, KL and reconstruction at the chosen $\alpha$ pose a minor performance reduction, demonstrating that performance gains over classical episodic baselines mainly originate from the full VDLF-Net architecture and training strategy.
Chinese Translation
本文介绍了VDLF-Net,该网络将一个紧凑的变分自编码器(VAE)附加到多尺度卷积神经网络(CNN)主干上。潜在向量和softmax门控支持主干特征图,而来自门控图的$ ext{l}_2$归一化嵌入则有助于监督分类或情景少样本预测。在标准的CIFAR-100和Mini-ImageNet协议下,VDLF-Net的性能优于ResNet-50 Enhanced、VGG-16、原型网络(Prototypical Networks)和匹配网络(Matching Networks)。大量消融实验表明,去除细分辨率尺度对VDLF-Net性能的影响最大。同时,在选择的$eta$下的KL散度和重构导致的性能降低较小,表明相较于经典的情景基线,性能提升主要源于完整的VDLF-Net架构和训练策略。
cs.CV / 109 / 2604.23644
RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing
RaV-IDP:一种重建作为验证的框架用于可信的智能文档处理
Abstract
Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.
Chinese Translation
智能文档处理管道从文档中提取结构化实体(表格、图像和文本),以供下游系统使用,如知识库、检索增强生成和分析。现有管道的一个持续限制是,提取输出是在没有任何内在机制来验证其是否忠实于源文档的情况下生成的。模型内部的置信度分数衡量推理的确定性,而不是与文档的对应关系,提取错误默默地传递给下游消费者。我们提出了重建作为验证(RaV-IDP),这是一个文档处理管道,引入重建作为一个一流的架构组件。在每个实体被提取后,专用的重建器将提取的表示形式重新呈现为与原始文档区域可比的形式,比较器则对重建与未修改的源裁剪之间的保真度进行评分。这个保真度分数是一个基于实证的、无标签的质量信号。当保真度低于每种实体类型的阈值时,会触发结构化的 GPT-4.1 视觉回退,并重复验证循环。我们强制执行一个自举约束:比较器始终以原始文档区域为锚点,而不是提取结果,从而防止验证变得循环。我们进一步提出了一个逐阶段评估框架,将每个管道组件与适当的基准配对。代码管道已公开发布在 https://github.com/pritesh-2711/RaV-IDP 以供实验和使用。
cs.CV / 110 / 2604.23651
Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation
几何条件扩散用于遮挡鲁棒的床上姿态估计
Abstract
Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.
Chinese Translation
在被子遮挡下进行鲁棒的床上人体姿态估计仍然具有挑战性,因为对于被重度覆盖的姿态,可靠的标注训练数据稀缺。现有方法依赖于多模态传感或图像到图像的转换框架,这些方法仍然依赖于可见的源图像,从而限制了可扩展性和姿态多样性。在本研究中,我们将遮挡感知增强重新表述为一个几何条件生成建模任务。我们系统地比较了确定性掩蔽、无配对转换、配对基于扩散的转换以及我们提出的姿态条件潜在扩散模型(Pose-LDM)。与图像引导的方法不同,Pose-LDM直接从骨架关键点合成被子覆盖的图像,消除了对配对监督和像素级源图像条件的依赖,同时能够从任意姿态输入生成图像。所有增强策略通过其对固定骨干网络下的下游姿态估计的影响进行评估。在严重遮挡下,Pose-LDM实现了最高的严格定位精度,同时保持了与配对扩散模型相当的整体检测性能,接近完全监督训练的性能。这些结果表明,几何条件扩散为实现遮挡鲁棒的床上姿态估计提供了一条有效且高效的监督路径,而无需修改传感管道。代码可在以下地址获取:github.com/navidTerraNova/GeoDiffPose。
cs.CV / 111 / 2604.23653
ResAF-Net: An Anchor-Free Attention-Based Network for Tree Detection and Agricultural Mapping in Palestine
ResAF-Net:一种基于无锚注意力的树木检测与农业制图网络在巴勒斯坦的应用
Abstract
Reliable agricultural data is essential for food security, land-use planning, and economic resilience, yet in Palestine, such data remains difficult to collect at scale because of fragmented landscapes, limited field access, and restrictions on aerial monitoring. This paper presents ResAF-Net, a satellite-based tree detection framework designed for large-scale agricultural monitoring in resource-constrained settings. The proposed architecture combines a ResNet-50 encoder, Atrous Spatial Pyramid Pooling (ASPP), a feature-fusion stage, a multi-head self-attention refinement module, and an anchor-free FCOS detection head to improve tree localization in dense and heterogeneous scenes. Trained on the MillionTrees benchmark, the model achieved 82% Recall, 63.03%
[email protected], and 35.47%
[email protected]:0.95 on the validation split, indicating strong sensitivity to tree presence while maintaining competitive localization quality. Beyond benchmark evaluation, we implemented the model within a web-based GIS application integrated with Palestinian cadastral data from GeoMolg, enabling tree analysis at scene, parcel, and community levels. This deployment demonstrates the practical feasibility of AI-assisted agricultural inventorying in Palestine. It provides a foundation for data-driven monitoring, reporting, and future species-level analysis of Mediterranean tree crops.
Chinese Translation
可靠的农业数据对于食品安全、土地利用规划和经济韧性至关重要,然而在巴勒斯坦,由于地形破碎、现场访问有限以及空中监测的限制,这类数据的规模收集仍然困难。本文提出了ResAF-Net,一种基于卫星的树木检测框架,旨在资源有限的环境中进行大规模农业监测。所提架构结合了ResNet-50编码器、空洞空间金字塔池化(Atrous Spatial Pyramid Pooling, ASPP)、特征融合阶段、多头自注意力精炼模块和无锚FCOS检测头,以提高在密集和异质场景中的树木定位精度。该模型在MillionTrees基准数据集上训练,验证集上达到了82%的召回率、63.03%的
[email protected]和35.47%的
[email protected]:0.95,表明其对树木存在的敏感性强,同时保持了竞争力的定位质量。除了基准评估外,我们还将该模型实施在一个基于网络的地理信息系统(GIS)应用中,并与GeoMolg提供的巴勒斯坦地籍数据集成,实现了在场景、地块和社区层面的树木分析。这一部署展示了在巴勒斯坦进行人工智能辅助农业清查的实际可行性,为数据驱动的监测、报告以及未来地中海树木作物的物种级分析奠定了基础。
cs.CV / 112 / 2604.23655
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
BVI-Mamba:基于视觉状态空间模型的视频增强方法,适用于低光照和水下环境
Abstract
Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.
Chinese Translation
在低光照和水下条件下捕获的视频通常会受到噪声、低对比度、色彩失衡和模糊等失真问题的影响。这些问题不仅限制了可见性,还降低了自动检测等任务的效果。通常需要进行后处理,但这可能会耗时较长。与基于图像的方法相比,基于人工智能的视频增强工具通常需要显著更多的计算资源。本文介绍了一种新颖的框架——Visual Mamba,旨在通过利用视觉状态空间(VSS)模型来减少内存使用和计算时间。该框架由两个模块组成:(i)特征对齐模块,在该模块中,输入帧之间的时空位移在特征空间中进行注册;(ii)增强模块,在该模块中,使用类似UNet的架构进行噪声去除和亮度调整,所有卷积层均被VSS块替代。实验结果表明,Visual Mamba技术在低光照和水下视频增强任务中优于基于Transformer和卷积的模型。代码可在线获取,网址为 https://github.com/russellllaputa/BVI-Mamba。
cs.CV / 113 / 2604.23662
SolarFCD: A Large-Scale Dataset and Benchmark for Solar Fault Classification in Photovoltaic Systems
SolarFCD:用于光伏系统太阳能故障分类的大规模数据集和基准测试
Abstract
The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.
Chinese Translation
全球范围内太阳能光伏(PV)系统的日益普及需要强大、可扩展和自动化的检测技术,能够在各种操作情况下检测到广泛的面板缺陷。缺乏大规模、多模态、公开可用的标注数据集是阻碍该领域进展的主要障碍。我们介绍了SolarFCD,这是一个通过系统性地结合和调和三个公开可用的数据集而创建的太阳能面板缺陷的广泛数据集,涵盖了两种成像模式:RGB/无人机图像和热红外图像。该数据集包含4,435张图像,分为四个统一的缺陷类别:健康图像、表面障碍、结构故障和电气故障。通过系统的标签映射、近重复图像去除和针对少数类的有针对性增强,该数据集按80:10:10的比例划分为训练集、验证集和测试集。我们训练和评估了来自五个设计家族的十六种分类架构,以提供可重复的基准基线。ResNet101V2的准确率为86.68%,精确率为88.65%,召回率为88.62%,F1-score为88.17%,表现最佳。每个类别的结果显示,在四个缺陷类别中检测结果均衡,性能差距小于1.2个百分点。为了促进自动化光伏检测和太阳能运营与维护领域的开放和可重复研究,该数据集、标注文件和基准代码已公开提供。
cs.CV / 114 / 2604.23665
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA
HAC:CLIP的参数高效超曲面适应用于零-shot视觉问答
Abstract
Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC
Chinese Translation
最近在表征学习方面的进展表明,超曲面几何可以为CLIP模型中使用的欧几里得嵌入提供更具表现力的替代方案,能够捕捉层次结构并导致更有组织的表征。然而,当前的超曲面CLIP变体完全是从头训练的,这在计算上是昂贵且资源密集的。在本研究中,我们提出了HAC(CLIP的超曲面适应),这是一个参数高效的框架,使得预训练的CLIP模型能够通过轻量级微调过渡到超曲面空间。我们将HAC应用于视觉问答(VQA),在该任务中,模型必须解释视觉元素并将其与文本查询对齐。值得注意的是,HAC的训练是在与任何VQA基准没有重叠的数据集上进行的,形成了一个严格的零-shot评估范式,突显了HAC的任务无关适应性。我们在涵盖一般、推理和光学字符识别(OCR)类别的多样化VQA基准上评估HAC。HAC-S(小型)和HAC-B(中型)均持续超越欧几里得基线和先前的超曲面方法,其中HAC-B在推理密集型任务上平均提高了高达1.9分,相较于CLIP-B。我们的代码可在https://github.com/fdibiton/HAC获取。
cs.CV / 115 / 2604.23670
Deploy DINO with Many-to-Many Association
使用多对多关联部署 DINO
Abstract
Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.
Chinese Translation
受限于监督图像匹配模型在未见图像领域的有限泛化能力,我们探索了 DINO 特征在此任务中的零-shot 部署。DINO 提取的通用视觉表征在用于匹配语义相似实例之间的特征点时具有固有的模糊性,这促使我们采用多对多 (m-to-m) 匹配范式。然而,现有的 m-to-m 数据关联下的鲁棒机制计算量大,需要在每次参数评估时在内点关联图中寻找最大基数匹配。为了解决这一低效问题,我们引入了一种新的似然视角,将现有方法解释为对其他难以处理的似然计算的零阶近似,并激发我们提出一种更快且更细致的鲁棒机制,称为谐波共识最大化 (Harmonic Consensus Maximization, HCM)。以相机姿态估计为示例下游任务,我们展示了通用视觉特征在未经过任何适应的情况下,结合 m-to-m 关联和 HCM 机制,可以在分布外数据集上与专业匹配模型竞争。
cs.CV / 116 / 2604.23683
Learning to Decipher from Pixels -- A Case Study of Copiale
从像素中学习解码——以Copiale为例
Abstract
Historical encrypted manuscripts require both paleographic interpretation of cipher symbols and cryptanalytic recovery of plaintext. Most existing computational workflows rely on a transcription-first paradigm, in which handwritten symbols are transcribed prior to decipherment. This intermediate step is labor-intensive, error-prone, and not always aligned with the goal of direct plaintext recovery. We propose an end-to-end, transcription-free approach that directly maps handwritten cipher images to plaintext. Using the Copiale cipher as a case study, we introduce the first text-line-level dataset pairing cipher images with German plaintext. We show that pretraining on generic handwriting data followed by cipher-specific fine-tuning substantially improves decipherment accuracy. Our results demonstrate that transcription-free image-to-plaintext decipherment is both feasible and effective for historical substitution ciphers, offering a simplified and scalable alternative to traditional pipelines. https://github.com/leitro/Decipher-from-Pixels-Copiale
Chinese Translation
历史加密手稿需要对密码符号进行古文字学解释和明文的密码分析恢复。现有的大多数计算工作流程依赖于先转录的范式,即在解码之前先对手写符号进行转录。这一中间步骤劳动密集、容易出错,并且并不总是与直接恢复明文的目标一致。我们提出了一种端到端的无转录方法,直接将手写密码图像映射到明文。以Copiale密码为案例研究,我们引入了第一个将密码图像与德语明文配对的文本行级数据集。我们展示了在通用手写数据上进行预训练,随后进行特定于密码的微调,显著提高了解码准确性。我们的结果表明,无转录的图像到明文解码对于历史替代密码既可行又有效,为传统流程提供了一种简化和可扩展的替代方案。
cs.CV / 117 / 2604.23685
Reading in the Dark: Low-light Scene Text Recognition
黑暗中的阅读:低光照场景文本识别
Abstract
Accurate text recognition in low-light environments is essential for intelligent systems in applications ranging from autonomous vehicles to smart surveillance. However, challenges such as poor illumination and noise interference remain underexplored. To address this gap, we introduce LSTR, a large-scale Low-light Scene Text Recognition dataset comprising 11,273 low-light images generated from well-lit datasets (ICDAR2015, IIIT5K, and WordArt), along with ESTR, which includes 60 real nighttime street-scene images in English and Spanish for exclusive evaluation. We explore two solution strategies: (1) employing Optical Character Recognition (OCR) models with fine-tuning and LoRA-based fine-tuning and (2) a joint training strategy that integrates a low-light image enhancement (LLIE) module with an OCR model. In particular, we propose a novel re-render LLIE (RLLIE) module, which demonstrates improved performance on real-world data. Through extensive experimentation, we analyze various training strategies and address a key research question: \emph{How bright is bright enough for effective scene text recognition?} Our results indicate that standalone LLIE or OCR models perform inadequately under low-light conditions, highlighting the advantages of specialized, jointly trained text-centric approaches. Additionally, we provide a comprehensive benchmark to support future research in robust low-light scene text recognition. https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.
Chinese Translation
在低光照环境中进行准确的文本识别对于从自动驾驶车辆到智能监控等应用中的智能系统至关重要。然而,诸如照明不足和噪声干扰等挑战仍未得到充分探讨。为了解决这一问题,我们引入了LSTR,一个大规模的低光照场景文本识别数据集,包含11,273张从良好照明数据集(ICDAR2015、IIIT5K和WordArt)生成的低光照图像,以及ESTR,后者包括60张用于专门评估的英语和西班牙语真实夜间街景图像。我们探索了两种解决策略:(1)采用光学字符识别(OCR)模型进行微调和基于LoRA的微调,以及(2)一种联合训练策略,将低光照图像增强(LLIE)模块与OCR模型集成。特别地,我们提出了一种新颖的重新渲染LLIE(RLLIE)模块,显示出在真实世界数据上的性能提升。通过广泛的实验,我们分析了各种训练策略,并探讨了一个关键研究问题: extit{多亮度才足以实现有效的场景文本识别?}我们的结果表明,单独的LLIE或OCR模型在低光照条件下表现不佳,突显了专门的、联合训练的文本中心方法的优势。此外,我们提供了一个全面的基准,以支持未来在鲁棒低光照场景文本识别方面的研究。https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.
cs.CV / 118 / 2604.23688
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
保护性扰动在真实世界图像变换下是否真的能保护肖像隐私?
Abstract
Proactive defense methods protect portrait images from unauthorized editing or talking face generation (TFG) by introducing pixel-level protective perturbations, and have already attracted increasing attention for privacy protection. In real-world scenarios, images inevitably undergo various transformations during cross-device display and dissemination--such as scale transformations and color compression--that directly alter pixel values. However, it remains unclear whether such pixel-level modifications affect the effectiveness of existing proactive defense methods that rely on pixel-level perturbations. To solve this problem, we conduct a systematic evaluation of representative proactive defenses under image transformation. The evaluated methods are selected to span different generation architectures such as diffusion and GAN-based models, as well as defense scopes covering both portrait and natural images, and are assessed using both qualitative and quantitative metrics for subjective and objective comparison. Experimental results indicate that defense methods based on pixel-level perturbations struggle to withstand common image transformations, posing a risk of defense failure in real-world applications. To further highlight this risk, we propose a simple yet effective purification framework by leveraging the vulnerabilities induced by real-world image transformations. Experimental results demonstrate that the proposed method can efficiently remove protective perturbations with low computational cost, highlighting previously overlooked risks to the research community.
Chinese Translation
主动防御方法通过引入像素级保护性扰动来保护肖像图像免受未经授权的编辑或人脸生成(Talking Face Generation, TFG),并已引起越来越多的隐私保护关注。在真实场景中,图像在跨设备显示和传播过程中不可避免地会经历各种变换——例如缩放变换和颜色压缩——这些变换直接改变像素值。然而,目前尚不清楚这些像素级修改是否会影响依赖于像素级扰动的现有主动防御方法的有效性。为了解决这个问题,我们对代表性的主动防御方法在图像变换下进行了系统评估。评估的方法涵盖了不同的生成架构,如扩散模型(diffusion)和基于生成对抗网络(GAN)的模型,以及涵盖肖像和自然图像的防御范围,并使用定性和定量指标进行主观和客观比较。实验结果表明,基于像素级扰动的防御方法在抵御常见图像变换方面存在困难,给真实世界应用带来了防御失败的风险。为了进一步突出这一风险,我们提出了一个简单而有效的净化框架,利用真实世界图像变换引发的脆弱性。实验结果表明,所提方法能够以低计算成本高效去除保护性扰动,突显了研究界之前未被重视的风险。
cs.CV / 119 / 2604.23704
A Pose-only Geometric Constraint for Multi-Camera Pose Adjustment
一种仅基于姿态的几何约束用于多摄像头姿态调整
Abstract
Multi-camera systems offer rich observation capabilities for visual navigation and 3D scene reconstruction; however, the resulting feature redundancy often compromises computational efficiency. This challenge is particularly pronounced during bundle adjustment, where the non-linear optimization of both system poses and scene points incurs substantial computational overhead. To address this challenge, this paper introduces a pose-only geometric constraint for multi-camera systems and proposes a corresponding pose adjustment algorithm. Specifically, we use generalized camera model to establish a unified representation of the multi-camera system. Building upon this model, we formulate the multi-camera pose-only constraint, which implicitly represents a 3D scene point using two base observations and their associated poses, thereby achieving a pose-only representation of the projection geometry. Subsequently, we introduce a multi-camera pose adjustment algorithm that eliminates 3D points from the parameter space, thereby achieving efficient and focused pose optimization. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms baseline bundle adjustment methods in computational efficiency, while maintaining or even improving pose estimation accuracy.
Chinese Translation
多摄像头系统为视觉导航和三维场景重建提供了丰富的观察能力;然而,随之而来的特征冗余常常影响计算效率。这个挑战在束调整过程中尤为明显,在该过程中,系统姿态和场景点的非线性优化会带来显著的计算开销。为了解决这一挑战,本文提出了一种仅基于姿态的几何约束用于多摄像头系统,并提出了相应的姿态调整算法。具体而言,我们使用广义相机模型建立多摄像头系统的统一表示。在此模型的基础上,我们制定了多摄像头仅基于姿态的约束,该约束隐式地使用两个基础观测及其相关姿态来表示一个三维场景点,从而实现了投影几何的仅基于姿态的表示。随后,我们引入了一种多摄像头姿态调整算法,该算法从参数空间中消除了三维点,从而实现高效且专注的姿态优化。在合成和真实世界数据集上的实验结果表明,所提出的算法在计算效率上优于基线束调整方法,同时保持或甚至提高了姿态估计的准确性。
cs.CV / 120 / 2604.23706
Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models
基于基础模型的溃疡性结肠炎弱监督多中心南希指数评分
Abstract
Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.
Chinese Translation
溃疡性结肠炎(UC)活动的组织学评估是临床试验和常规护理中的一个重要终点,但使用南希组织学指数(NHI)等指标进行手动评分既耗时又容易受到观察者变异的影响。尽管计算病理学方法可以实现评分自动化,但许多方法依赖于密集的区域级注释,这在异质性多中心队列中获取成本较高。我们提出了一种针对全切片图像的弱监督多实例学习(MIL)方法,该方法从病例和切片级NHI标签中学习,并利用基础模型。我们的方法针对临床相关的终点,包括中性粒细胞活性及衍生的南希低/高分组,从而实现完整的五级NHI预测。在来自三家医院(2019-2025年)的H&E染色结肠活检的多中心数据集中,我们评估了多种基础模型编码器和聚合策略。我们发现基础模型的选择和分辨率对性能有显著影响,其中Virchow2提供了最一致的提升,并且简单的集成规则相比于分层门控基线在五级NHI预测中表现更佳。总体而言,我们的结果表明,使用现代基础模型表示的弱监督MIL能够在现实的多中心环境中提供稳健且可解释的UC组织学活动评估。
cs.CV / 121 / 2604.23709
ZID-Net: Zero-Inference Diffusion Prior Decoupling Network for Single Image Dehazing
ZID-Net:用于单幅图像去雾的零推理扩散先验解耦网络
Abstract
Single image dehazing is often constrained by a trade-off between restoration quality and computational efficiency. While efficient, CNN networks struggle to learn robust priors for dense and non-homogeneous haze. Conversely, diffusion models provide strong generative priors but suffer from severe inference latency and sampling instability. To address these limitations, we propose ZID-Net, a novel framework that explicitly decouples diffusion supervision from feed-forward inference. For efficient inference, we design a frequency-spatial decoupled feed-forward backbone. Within this backbone, a Channel-Spatial Laplacian Mask (CSLM) filters haze-amplified noise to extract purified structural details, while Lightweight Global Context Blocks (LGCBs) establish long-range spatial dependencies to capture the global variations of haze. A Dynamic Feature Arbitration Block (DFAB) then adaptively fuses these semantic and structural features for robust reconstruction. To provide this backbone with physical priors without the inference cost, we introduce a Zero-Inference Prior Propagation Head (ZI-PPH) during training. ZI-PPH leverages a conditional diffusion process to predict residual noise, providing degradation-aware structural supervision to the backbone. By discarding the diffusion branch at test time, ZID-Net integrates diffusion priors into a pure feed-forward architecture for accurate and efficient restoration. ZID-Net achieves 40.75 dB PSNR on the synthetic RESIDE dataset and outperforms existing methods with a 1.13 dB gain on real-world datasets. Additionally, it yields a 3.06 dB PSNR gain on the StateHaze1k remote sensing dataset with an inference time of just 19.35 ms. The project code is available at: https://github.com/XoomitLXH/ZID-Net.
Chinese Translation
单幅图像去雾通常受到恢复质量与计算效率之间权衡的限制。尽管卷积神经网络(CNN)高效,但在学习稠密和非均匀雾霾的稳健先验方面存在困难。相反,扩散模型提供了强大的生成先验,但在推理延迟和采样不稳定性方面存在严重问题。为了解决这些局限性,我们提出了ZID-Net,一个新颖的框架,明确将扩散监督与前馈推理解耦。为了实现高效推理,我们设计了一个频率-空间解耦的前馈骨干网络。在这个骨干网络中,通道-空间拉普拉斯掩模(Channel-Spatial Laplacian Mask, CSLM)过滤雾霾增强的噪声,以提取纯化的结构细节,而轻量级全局上下文块(Lightweight Global Context Blocks, LGCBs)建立长距离空间依赖关系,以捕捉雾霾的全局变化。然后,动态特征仲裁块(Dynamic Feature Arbitration Block, DFAB)自适应地融合这些语义和结构特征,以实现稳健的重建。为了在不增加推理成本的情况下为这个骨干网络提供物理先验,我们在训练期间引入了零推理先验传播头(Zero-Inference Prior Propagation Head, ZI-PPH)。ZI-PPH利用条件扩散过程预测残差噪声,为骨干网络提供降级感知的结构监督。通过在测试时丢弃扩散分支,ZID-Net将扩散先验整合到纯前馈架构中,实现准确高效的恢复。ZID-Net在合成RESIDE数据集上达到了40.75 dB的峰值信噪比(PSNR),并在真实世界数据集上超越了现有方法,获得了1.13 dB的增益。此外,它在StateHaze1k遥感数据集上实现了3.06 dB的PSNR增益,推理时间仅为19.35毫秒。项目代码可在以下链接获取:https://github.com/XoomitLXH/ZID-Net。
cs.CV / 122 / 2604.23718
Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection
Caries DETR:基于牙齿结构感知的先验和病变感知动态损失细化的DETR基础龋齿检测
Abstract
As dental caries appear as subtle, low-contrast lesions in intraoral imaging, existing deep learning models face significant challenges in the early detection of caries. While recent Transformer-based detectors have shown promising results in natural images, they often fail to capture the domain-specific anatomical priors crucial for dental caries detection. In this paper, we propose Caries-DETR, a specialized Transformer framework for caries detection in intraoral images. A Tooth Structure-aware Query Initialization (TSQI) is designed, leveraging large-scale intraoral photograph pre-training and a structure perception branch (SPB) to integrate high-frequency structural priors, guiding the model to focus on anatomically significant lesion areas. Furthermore, we design a Lesion-aware Dynamic Loss Refinement (LDLR) to implement quality-driven hard mining through adaptive loss reweighting based on lesion size, anatomical relevance, and prediction quality, optimizing detection for subtle lesions. Extensive experiments on two public datasets (i.e., AlphaDent and DentalAI) demonstrate that Caries-DETR achieves a state-of-the-art performance compared to existing methods and exhibits good generalization and robustness. Code and data at https://github.com/XuefenLiu-SZU/Caries-DETR}{https://github.com/XuefenLiu-SZU/Caries-DETR.
Chinese Translation
由于龋齿在口腔影像中表现为微妙的低对比度病变,现有的深度学习模型在早期检测龋齿方面面临重大挑战。尽管近期基于Transformer的检测器在自然图像中显示出良好的效果,但它们往往无法捕捉到对于龋齿检测至关重要的领域特定解剖先验。在本文中,我们提出了Caries-DETR,这是一个专门用于口腔影像中龋齿检测的Transformer框架。我们设计了一种牙齿结构感知查询初始化(Tooth Structure-aware Query Initialization, TSQI),利用大规模口腔照片的预训练和结构感知分支(Structure Perception Branch, SPB)来整合高频结构先验,指导模型关注解剖上重要的病变区域。此外,我们设计了一种病变感知动态损失细化(Lesion-aware Dynamic Loss Refinement, LDLR),通过基于病变大小、解剖相关性和预测质量的自适应损失重加权,实现质量驱动的困难样本挖掘,优化对微小病变的检测。在两个公共数据集(即AlphaDent和DentalAI)上的广泛实验表明,Caries-DETR相较于现有方法实现了最先进的性能,并展现出良好的泛化能力和鲁棒性。代码和数据可在 https://github.com/XuefenLiu-SZU/Caries-DETR 获取。
cs.CV / 123 / 2604.23724
Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference
聚焦推理,精准识别:通过贝叶斯推理指导的聚焦视觉语言模型实现高速公路监控视频中的高效远程异常检测
Abstract
Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.
Chinese Translation
高速公路视频异常检测对安全管理至关重要。然而,在多样化场景中识别异常仍然具有挑战性,特别是对于表现出微妙异常车辆运动的远程目标。尽管视觉语言模型(VLMs)展现出强大的语义推理能力,但处理全局帧会导致对这些远程物体的注意力稀释,并产生高昂的计算成本。为了解决这些问题,我们提出了VIBES,一个利用贝叶斯推理指导的异步协作框架,结合VLMs。具体而言,为了克服在不同高速公路环境中泛化能力差的问题,我们引入了一个在线贝叶斯推理模块。该模块持续评估车辆轨迹,以动态更新正常驾驶行为的概率边界,作为异步触发器,精确定位空间和时间上的异常。VLM仅处理由触发器指示的局部视觉区域,而不是处理连续的视频流。这种针对性的视觉输入防止了注意力稀释,并使得准确的语义推理成为可能。广泛的评估表明,VIBES提高了远程异常的检测准确性,并减少了计算开销,实现了高实时效率和可解释性,同时在不同的高速公路条件下展现了良好的泛化能力。
cs.CV / 124 / 2604.23728
ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction
ESIA:一种基于能量的时空交互感知框架用于行人意图预测
Abstract
Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.
Chinese Translation
近年来,自动驾驶技术的进步激发了对行人意图预测的研究,该研究旨在通过建模时间动态、社会交互和环境背景来推断未来的过马路决策和行为。然而,现有研究仍受到简化的多智能体交互模式、不透明的推理逻辑以及行为预测缺乏全局一致性的限制,这影响了其鲁棒性和可解释性。在本研究中,我们提出了ESIA(基于能量的时空交互感知框架),这是一种新颖的基于条件随机场(CRF)的范式。我们将意图预测任务视为在统一的图形表示上的结构化预测问题,将行人和环境视为时空节点。为了表征它们的不同角色,我们为节点分配单一势能以捕捉个体意图,为边缘分配成对势能以编码社会和环境交互。这些势能被整合到一个统一的全局能量函数中,以确保行为预测在场景级别上的一致性。为了在没有真实标签监督的情况下进一步约束推理,我们引入结构一致性项以惩罚逻辑矛盾。通过一种新颖的单一种子模拟退火(U-SSA)算法高效地解决这一优化问题,该算法利用高置信度的单一先验快速收敛到高质量解。在标准基准上的大量实验表明,ESIA在可解释性方面优于现有方法,且达到了最先进的性能。
cs.CV / 125 / 2604.23729
DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection
DynProto:用于分布外检测的动态原型演化
Abstract
Recent studies show that using potential out-of-distribution (OOD) labels from large corpora as auxiliary information can improve OOD detection in vision-language models (VLMs). However, these methods often fail when real-world OOD samples fall outside the predefined OOD label set. To address this limitation, we propose DynProto, a novel approach that learns OOD prototypes dynamically during testing using only in-distribution (ID) information. DynProto is inspired by a key observation: OOD samples predicted as the same ID class tend to cluster in the feature space. With this insight, we leverage easy-to-detect OOD samples as ``anchors'' to find their harder-to-detect, similar counterparts. To this end, DynProto introduces two modules: \textbf{Coarse OOD Pattern Capturing Module} caches OOD patterns that are easily confused with each ID class during testing, and \textbf{Fine-grained OOD Pattern Refinement Module} subsequently clusters these patterns within each cache and aggregates them into representative OOD prototypes. By measuring similarity to ID and dynamic OOD prototypes, DynProto enables accurate OOD detection. DynProto significantly outperforms prior methods across multiple benchmarks. On ImageNet OOD benchmark, DynProto reduces FPR95 by 11.60\% and improves AUROC by 4.70\%. Moreover, the framework is architecture-agnostic and can be integrated into various backbones.
Chinese Translation
最近的研究表明,利用来自大型语料库的潜在分布外(OOD)标签作为辅助信息可以改善视觉语言模型(VLMs)中的OOD检测。然而,当现实世界的OOD样本超出预定义的OOD标签集时,这些方法往往会失效。为了解决这一局限性,我们提出了DynProto,这是一种新颖的方法,它在测试过程中仅使用分布内(ID)信息动态学习OOD原型。DynProto的灵感来源于一个关键观察:被预测为相同ID类别的OOD样本往往在特征空间中聚集。基于这一见解,我们利用易于检测的OOD样本作为“锚点”,以找到其更难检测的相似对应物。为此,DynProto引入了两个模块: extbf{粗略OOD模式捕捉模块}在测试过程中缓存与每个ID类别容易混淆的OOD模式,而 extbf{细粒度OOD模式细化模块}随后在每个缓存中聚类这些模式,并将其聚合成代表性的OOD原型。通过测量与ID和动态OOD原型的相似性,DynProto实现了准确的OOD检测。DynProto在多个基准测试中显著优于先前的方法。在ImageNet OOD基准测试中,DynProto将FPR95降低了11.60\%,并将AUROC提高了4.70\\%。此外,该框架与架构无关,可以集成到各种主干网络中。
cs.CV / 126 / 2604.23763
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
编辑您所意指的地方:面向区域的适配器注入用于无掩膜的局部图像编辑
Abstract
Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce REDEdit, a co-trained, instruction- and region-aware adapter framework that retrofits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, REDEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.
Chinese Translation
大型扩散变换器(DiTs)能够很好地遵循全局编辑指令,但始终会将局部编辑泄漏到无关区域,因为联合注意力架构没有明确的通道告知网络在哪里应用编辑。我们提出了REDEdit,这是一种共同训练的、具有指令和区域意识的适配器框架,可以将一个冻结的DiT改造成一个精确的局部编辑器,而无需修改其主干权重。在每个变换器块中,一个轻量级的块适配器注入一个结构化条件流,将要编辑的内容(指令语义)与编辑位置(空间掩膜)进行分解;一个学习到的SpatialGate选择性地将适配器信号引导到编辑区域,同时保持图像的其余部分与源图像几乎相同;而区域意识损失(Region-Aware Loss)则将训练目标集中在变化的像素上。由于这些组件使得主干的内部表示在端到端上具有掩膜意识,一个与编辑器共同训练的细小MaskPredictor头可以直接从指令和源图像中确定编辑区域,从而消除在部署时对用户掩膜的需求。我们在两个互补的基准上进行了评估:MagicBrush(配对的真实目标)用于测量像素级保留和编辑准确性,以及Emu-Edit Test(无真实图像,9种多样的编辑类别)用于压力测试指令遵循和跨编辑类型的泛化。在这两者上,REDEdit都达到了最先进的结果,同时超越了无掩膜和理想掩膜的基准。七个变体的消融实验清晰地隔离了每个组件的贡献。
cs.CV / 127 / 2604.23776
From Noisy Historical Maps to Time-Series Oil Palm Mapping Without Annotation in Malaysia and Indonesia (2020-2024)
从嘈杂的历史地图到马来西亚和印度尼西亚的无注释时间序列油棕榈映射(2020-2024)
Abstract
Accurate monitoring of oil palm plantations is critical for balancing economic development with environmental conservation in Southeast Asia. However, existing plantation maps often suffer from low spatial resolution and a lack of recent temporal coverage, impeding effective surveillance of rapid land-use changes. In this study, we propose a deep learning framework to generate 10-meter resolution oil palm plantation maps for Indonesia and Malaysia from 2020 to 2024, utilizing Sentinel-2 imagery without requiring new manual annotations. To address the resolution mismatch between coarse 100-meter historical labels and 10-meter imagery, we employ a U-Net architecture optimized with Determinant-based Mutual Information (DMI). This approach effectively mitigates the influence of label noise. We validated our method against 2,058 manually verified points, achieving overall accuracies of 70.64%, 63.53%, and 60.06% for the years 2020, 2022, and 2024, respectively. Our comprehensive analysis reveals that oil palm coverage in the region peaked in 2022 before experiencing a decline in 2024. Furthermore, land cover transition analysis highlights a concerning trajectory of plantation expansion into flooded vegetation areas, despite a general stabilization in rotations with other crop types. These high-resolution maps provide essential data for monitoring sustainability commitments and deforestation dynamics in the region, and the generated datasets are made publicly available at https://doi.org/10.5281/zenodo.17768444.
Chinese Translation
准确监测油棕榈种植园对于平衡东南亚的经济发展与环境保护至关重要。然而,现有的种植园地图往往存在空间分辨率低和缺乏近期时间覆盖的问题,妨碍了对快速土地利用变化的有效监测。在本研究中,我们提出了一种深度学习框架,利用Sentinel-2影像生成2020年至2024年间印度尼西亚和马来西亚的10米分辨率油棕榈种植园地图,而无需新的手动注释。为了解决粗糙的100米历史标签与10米影像之间的分辨率不匹配问题,我们采用了基于U-Net架构,并使用基于行列式的互信息(Determinant-based Mutual Information, DMI)进行优化。这种方法有效减轻了标签噪声的影响。我们在2058个手动验证点上验证了我们的方法,2020年、2022年和2024年的总体准确率分别为70.64%、63.53%和60.06%。我们的综合分析显示,该地区的油棕榈覆盖率在2022年达到峰值,随后在2024年出现下降。此外,土地覆盖转变分析突显了种植园向淹水植被区域扩展的令人担忧的趋势,尽管与其他作物类型的轮作总体趋于稳定。这些高分辨率地图为监测该地区的可持续性承诺和森林砍伐动态提供了重要数据,生成的数据集已在https://doi.org/10.5281/zenodo.17768444上公开发布。
cs.CV / 128 / 2604.23781
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark:多轮、多天、多模态协作代理的现实世界基准
Meng, Fanqing, Du, Lingxiao, Wu, Zijian, Chen, Guanzheng, Liu, Xiangyan, Liao, Jiaqi, Jiang, Chonghe, Wan, Zhenglin, Gu, Jiawei, Zhou, Pengfei, Huang, Rui, Zhao, Ziqi, Ding, Shengyuan, Yu, Ailing, Peng, Bo, Xia, Bowei, Sun, Hao, Liang, Haotian, Xie, Ji, Chen, Jiajun, Song, Jiajun, Yang, Liu, Xu, Ming, Qiu, Qionglin, Fu, Runhao, Zhai, Shengfang, Wang, Shijian, Ma, Tengfei, Wu, Tianyi, Jin, Weiyang, Wang, Yan, Dai, Yang, Lai, Yao, Shu, Youwei, Liu, Yue, Hao, Yunzhuo, Niu, Yuwei, Huang, Jinkai, Zhuo, Jiayuan, Shen, Zhennan, Wu, Linyu, Xie, Cihang, Zhou, Yuyin, Zhang, Jiaheng, Zheng, Zeyu, Hu, Mengkang, Shieh, Michael Qizhe
Abstract
Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.
Chinese Translation
语言模型代理越来越多地被用作持久的协作伙伴,帮助用户在多个工作日内完成任务。在这样的工作流程中,周围环境可能会独立于代理发生变化:新邮件到达,日历条目变动,知识库记录更新,以及图像、扫描的PDF、音频、视频和电子表格中出现证据。现有的基准测试并未充分评估这种设置,因为它们通常在单一静态情境中运行,并且主要以文本为中心。我们引入了ench{},这是一个围绕多轮多天任务构建的协作代理基准,提供一个状态可变的沙盒服务环境,其状态在轮次之间演变,并采用基于规则的验证。当前版本包含13个专业场景下的100个任务,针对五个状态可变的沙盒服务(文件系统、电子邮件、日历、知识库、电子表格)执行,并通过1537个确定性的Python检查器对执行后的服务状态进行评分;在评分过程中没有调用LLM作为评判者。我们对七个前沿代理系统进行了基准测试。最强模型达到了75.8的加权得分,但最佳严格任务成功率仅为20.0%,这表明部分进展是常见的,而完整的端到端工作流程完成仍然很少见。轮次级分析显示,在第一次外部环境更新后,性能下降,突显了适应变化状态作为一个关键的开放挑战。我们发布了基准、评估工具和构建管道,以支持可重复的协作代理评估。
cs.CV / 129 / 2604.23788
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks
MIRAGE:一种用于多人物艺术作品中扎根探索的微交互关系架构
Abstract
Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these "micro-interactions" in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.
Chinese Translation
欣赏多人物画作需要理解角色之间通过微妙线索(如视线对齐、手势和空间排列)所建立的关系。我们提出了MIRAGE,这是一个以证据为中心的框架,旨在支持对这些多人物艺术作品中“微交互”的探索。尽管这些线索对于深入理解叙事至关重要,但它们往往分布在复杂场景中,观众难以系统地识别。现有的视觉-语言模型(VLMs)常常无法提供可靠的帮助,提供的解释缺乏可追溯的视觉证据。MIRAGE通过构建一个结构化的中间表示来解决这一问题,该表示捕捉了身份、姿势线索和视线假设。然而,挑战不仅在于提取这些线索,还在于在解释过程中协调它们。缺乏明确的机制来组织和调和关系证据,模型往往将多个交互假设合并为一个不稳定或弱扎根的叙事,即使在有低级信号可用的情况下也是如此。该表示允许用户验证高级解释如何扎根于低级视觉事实。通过将空间扎根与叙事生成分离,MIRAGE使用户能够通过可验证的证据层检查和推理人物之间的关系。我们使用盲评估协议对MIRAGE与仅基于绘画的VLM基线进行了评估。结果表明,MIRAGE显著提高了身份一致性,减少了关系幻觉,并增加了微妙交互的覆盖范围。这些发现表明,结构化扎根可以作为关键的交互控制层,为更可靠、透明和以人为主导的复杂视觉叙事理解提供必要的支撑。
cs.CV / 130 / 2604.23789
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS:用于多镜头主题到视频生成的大规模数据集和电影叙事基准
Abstract
While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
Chinese Translation
尽管视频基础模型在单镜头生成方面表现出色,但现实世界的电影叙事本质上依赖于复杂的多镜头序列。进一步的进展受到缺乏能够解决三个核心挑战的数据集的限制:真实的叙事逻辑、时空文本与视频对齐冲突,以及在主题到视频(Subject-to-Video,S2V)生成中普遍存在的“复制粘贴”困境。为填补这一空白,我们引入了MuSS,一个针对多镜头视频和S2V生成的大规模双轨数据集。该数据集来源于超过3000部电影,明确支持复杂的蒙太奇过渡和以主题为中心的叙事。为了构建该数据集,我们开创了一种渐进式字幕生成管道,通过确保局部镜头级别的准确性来消除上下文冲突,然后再强制执行全球叙事一致性。至关重要的是,我们实施了一种跨镜头匹配机制,根本上消除了S2V复制粘贴的捷径。除了数据集外,我们还提出了电影叙事基准,采用以视觉逻辑驱动的范式和一种新颖的反复制粘贴方差(Anti-Copy-Paste Variance,ACP-Var)指标,以严格评估连续叙事和三维结构一致性。大量实验表明,尽管当前基线在连续叙事逻辑方面表现不佳或退化为简单的二维贴纸生成器,但我们增强的MuSS模型在叙事有效性和跨镜头身份保留方面达到了最先进的水平。
cs.CV / 131 / 2604.23799
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
VitaminP:跨模态学习实现常规组织学中的全细胞分割
Abstract
Accurate whole-cell and nuclear segmentation is essential for precision pathology and spatial omics, yet routine hematoxylin and eosin (H&E) staining provides limited cytoplasmic contrast, restricting analyses to nuclei. Multiplex immunofluorescence (mIF) facilitates precise whole-cell delineation but remains constrained by cost and accessibility. We introduce VitaminP, a cross-modal learning framework enabling whole cell segmentation from H&E images. By learning from paired H&E-mIF data, VitaminP transfers molecular boundary information from mIF to overcome cytoplasmic contrast in H&E, establishing cross-modal supervision as a general strategy for recovering missing biological structure. We train VitaminP on 14 public datasets covering 34 cancer types and over 7 million instances, integrating publicly available labels with extensive annotations generated in this study, forming one of the largest resources for segmentation. VitaminP outperforms four state-of-the-art methods and generalizes to unseen datasets, including an in-house dataset spanning 24 rare cancer types. We further developed VitaminPScope, an open-source platform providing an interface for scalable inference and enabling broad adoption.
Chinese Translation
准确的全细胞和核分割对于精准病理学和空间组学至关重要,但常规的苏木精-伊红(H&E)染色提供的细胞质对比有限,限制了分析仅局限于细胞核。多重免疫荧光(mIF)技术能够实现精确的全细胞描绘,但仍受限于成本和可获取性。我们提出了VitaminP,一个跨模态学习框架,能够从H&E图像中实现全细胞分割。通过从配对的H&E-mIF数据中学习,VitaminP将mIF中的分子边界信息转移到H&E图像中,以克服H&E中的细胞质对比问题,确立了跨模态监督作为恢复缺失生物结构的一种通用策略。我们在覆盖34种癌症类型和超过700万实例的14个公共数据集上训练了VitaminP,结合了公开可用的标签和本研究生成的广泛注释,形成了最大的分割资源之一。VitaminP在四种最先进的方法上表现优越,并能推广到未见过的数据集,包括一个涵盖24种罕见癌症类型的内部数据集。我们进一步开发了VitaminPScope,这是一个开源平台,提供可扩展推理的接口,促进了广泛的应用。
cs.CV / 132 / 2604.23803
Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction
引入个人视角:评估动态3D高斯点云在自我中心场景重建中的应用
Abstract
Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.
Chinese Translation
自我中心视频为人类感知和互动提供了独特的视角,随着增强现实、机器人技术和辅助技术的发展,其相关性日益增强。然而,快速的相机运动和复杂的场景动态对从这一视角进行3D重建构成了重大挑战。尽管3D高斯点云(3DGS)已成为高效、高质量新视角合成的最先进方法,但专注于从单目视频重建动态场景的变体在自我中心视频上很少得到评估。目前尚不清楚现有模型是否能够推广到这一设置,或者是否需要针对自我中心的特定解决方案。在本研究中,我们使用EgoExo4D数据集中配对的自我中心和外部中心录音,评估动态单目3DGS模型在自我中心和外部中心视频上的表现。我们发现自我中心视角的重建质量始终较低。分析表明,重建质量的差异(以峰值信噪比衡量)源于静态内容的重建,而非动态内容。我们的研究结果强调了当前的局限性,并激励开发自我中心特定的方法,同时也突显了单独评估视频的静态和动态区域的价值。
cs.CV / 133 / 2604.23813
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench:评估多模态大型语言模型在文档重建中的语义推理能力
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.
Chinese Translation
多模态大型语言模型(MLLMs)在视觉丰富文档理解(VRDU)任务中取得了显著的表现,但其能力主要在完好且结构良好的文档图像上进行评估。我们考虑从撕碎的碎片中恢复内容,这是一种具有挑战性的VRDU设置,要求在显著内容不连续的情况下整合视觉模式识别与语义推理。为了促进对复杂VRDU任务的系统评估,我们引入了ShredBench,一个由自动生成管道支持的基准,该管道直接从Markdown生成碎片化文档。该提议的管道通过允许灵活整合最新或未见过的文本来源,确保评估的有效性,以防止训练数据污染。ShredBench评估四种场景(英语、中文、代码、表格),并具有三种碎片化粒度(8、12、16片)。对最先进的MLLMs进行的实证评估揭示了显著的性能差距:该方法在完整文档上有效;然而,一旦文档被撕碎,恢复便成为一项重大挑战,随着碎片化程度的增加,NED急剧下降。我们的研究结果强调当前的MLLMs缺乏弥补视觉不连续所需的细粒度跨模态推理,识别出在稳健的VRDU研究中存在的关键缺口。
cs.CV / 134 / 2604.23814
Mapping License Plate Recoverability Under Extreme Viewing Angles for Oppor-tunistic Urban Sensing
极端视角下车牌可恢复性的映射:机遇性城市感知
Abstract
Urban environments contain many imaging sensors built for specific purposes, including ATM, body-worn, CCTV, and dashboard cameras. Under the opportunistic sensing paradigm, these sensors can be repurposed for secondary inference tasks such as license plate recognition. Yet objects of interest in such imagery are often noisy, low-resolution, and captured from extreme viewpoints. Recent advances in AI-based restoration can recover use-ful information even from severely degraded images. A central challenge is determining which distortion parame-ters allow reliable recovery and which lead to inference failure. This paper introduces recoverability maps, a task-agnostic method for quantifying this boundary. The method combines a dense synthetic sweep of degrada-tion parameters with two summary measures: boundary area-under-curve, which estimates the recoverable frac-tion of the parameter space, and a reliability score, which captures the frequency and severity of failures within that region. We demonstrate the method on license plate recognition from highly angled views under realistic camera artifacts. Several restoration architectures are trained and evaluated, including U-Net, Restormer, Pix2Pix, and SR3 diffusion. The best model recovers about 93% of the parameter space. Similar results across models sug-gest that sensing geometry, rather than architecture, sets the limit of recovery.
Chinese Translation
城市环境中包含许多为特定目的而建造的成像传感器,包括自动取款机、佩戴式摄像头、闭路电视和仪表盘摄像头。在机遇性感知范式下,这些传感器可以被重新利用于二次推断任务,如车牌识别。然而,这些图像中的目标对象往往存在噪声、分辨率低,并且是从极端视角捕获的。最近在基于人工智能的恢复技术方面的进展可以从严重退化的图像中恢复有用信息。一个核心挑战是确定哪些失真参数允许可靠恢复,哪些则导致推断失败。本文引入了可恢复性图(recoverability maps),这是一种任务无关的方法,用于量化这一边界。该方法结合了对退化参数的密集合成扫描以及两个摘要度量:边界曲线下的面积(boundary area-under-curve),用于估计可恢复参数空间的比例,以及可靠性评分(reliability score),用于捕捉该区域内失败的频率和严重性。我们在现实摄像机伪影下的高角度视图中展示了该方法在车牌识别中的应用。训练和评估了多种恢复架构,包括 U-Net、Restormer、Pix2Pix 和 SR3 扩散。最佳模型恢复了约 93% 的参数空间。不同模型之间的相似结果表明,感知几何而非架构设定了恢复的限制。
cs.CV / 135 / 2604.23839
Focus on What Matters: Two-Stage ROI-Aware Refinement for Anatomy-Preserving Fetal Ultrasound Reconstruction
关注重要内容:一种保留解剖结构的胎儿超声重建的两阶段ROI感知精细化方法
Abstract
Measurement-critical ultrasound tasks often depend on a small anatomical region, making global reconstruction metrics an unreliable proxy for clinical fidelity. We propose an ROI-aware representation learning framework and instantiate it for first-trimester nuchal translucency (NT) screening under multi-hospital domain shift. A two-phase convolutional autoencoder (CAE) first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity (L1) and normalized Sobel-edge constraints. To combine these heterogeneous objectives without manual tuning, we initialize loss weights via gradient-based calibration from per-term gradient magnitudes. Under strict hospital-wise evaluation with one hospital held out, ROI refinement improves both global and measurement-relevant quality: on the standard dev split it increases PSNR by +0.27 dB (val) and +0.29 dB (held-out test), reduces ROI MAE by 8.87% (val) and 6.43% (held-out test), and reduces ROI Edge-MAE by 11.10% on source hospitals and 4.90% on the unseen hospital. Beyond reconstruction, frozen-latent probes provide additional evidence of generalization: hospital provenance becomes less confidently predictable on the unseen site (0.556 to 0.541 max-softmax; 0.684 to 0.688 entropy) while OOD detection remains strong across site-held-out protocols (Mahalanobis AUROC up to 0.9956, with modest KNN gains in challenging splits). The same ROI-aware refinement principle is anatomy-agnostic and can be adopted for other fetal biometry targets (e.g., crown-rump length (CRL), nasal bone (NB)) and broader medical imaging settings where small ROIs dominate clinical decisions.
Chinese Translation
测量关键的超声任务通常依赖于小的解剖区域,这使得全局重建指标成为临床可靠性的不可靠代理。我们提出了一种ROI感知的表示学习框架,并将其应用于多医院领域转移下的第一孕期颈部透明度(NT)筛查。一个两阶段卷积自编码器(CAE)首先通过MS-SSIM学习一个全局忠实的128维潜在编码,然后使用强度(L1)和归一化Sobel边缘约束对NT ROI进行精细化。为了在没有手动调优的情况下结合这些异构目标,我们通过基于梯度的校准从每个项的梯度大小初始化损失权重。在严格的医院评估中,保留一个医院进行测试,ROI精细化提高了全局和测量相关的质量:在标准开发集上,PSNR提高了+0.27 dB(验证集)和+0.29 dB(保留测试集),ROI MAE减少了8.87%(验证集)和6.43%(保留测试集),并且在源医院上ROI Edge-MAE减少了11.10%,在未见医院上减少了4.90%。除了重建,冻结潜在探针提供了额外的泛化证据:在未见站点上,医院来源的预测信心降低(最大软最大从0.556降至0.541;熵从0.684升至0.688),而OOD检测在站点保留协议中仍然强劲(Mahalanobis AUROC高达0.9956,在具有挑战性的拆分中KNN获得适度提升)。相同的ROI感知精细化原则与解剖无关,可以应用于其他胎儿生物测量目标(例如,头臀长(CRL),鼻骨(NB))和更广泛的医学成像环境,其中小ROI主导临床决策。
cs.CV / 136 / 2604.23858
Latent Inter-Frame Pruning: A Training-Free Method Bridging Traditional Video Compression and Modern Diffusion Transformers for Efficient Generation
潜在帧间修剪:一种无训练方法,连接传统视频压缩与现代扩散变换器以实现高效生成
Abstract
Video generation, while capable of generating realistic videos, is computationally expensive and slow, prohibiting real-time applications. In this paper, we observe that video latents encoded via an autoencoder under the Latent Diffusion Model (LDM) framework contain redundancy along the temporal axis. Analogous to how traditional video compression algorithms avoid transmitting redundant frame data, we propose the Latent Inter-frame Pruning framework to prune (skip the re-computation of) duplicated latent patches, thereby reducing computational burden and increasing throughput. However, direct pruning results in visual artifacts due to the discrepancy between full-sequence training and pruned inference. To resolve these artifacts, we propose an Attention Recovery mechanism to bridge the train-inference gap. With our proposed method, we increase video editing throughput by 1.44$\times$, achieving 12.44 FPS on an NVIDIA RTX 6000 while maintaining video quality. We hope our work inspires further research into integrating traditional video compression methods with modern video generation pipelines. This work is a preliminary work on Training-free Latent Inter-Frame Pruning with Attention Recovery.
Chinese Translation
视频生成虽然能够生成逼真的视频,但计算成本高且速度慢,限制了实时应用。在本文中,我们观察到通过自编码器在潜在扩散模型(Latent Diffusion Model, LDM)框架下编码的视频潜在特征在时间轴上存在冗余。类似于传统视频压缩算法避免传输冗余帧数据,我们提出了潜在帧间修剪框架,以修剪(跳过重新计算)重复的潜在补丁,从而减少计算负担并提高吞吐量。然而,直接修剪会导致视觉伪影,因为全序列训练与修剪推理之间存在差异。为了解决这些伪影,我们提出了一种注意力恢复机制,以弥合训练与推理之间的差距。通过我们提出的方法,我们将视频编辑的吞吐量提高了1.44倍,在NVIDIA RTX 6000上实现了12.44帧每秒,同时保持视频质量。我们希望我们的工作能激发进一步研究,将传统视频压缩方法与现代视频生成管道相结合。本工作是关于无训练的潜在帧间修剪与注意力恢复的初步研究。
cs.CV / 137 / 2604.23860
Exploring Audio Hallucination in Egocentric Video Understanding
探索自我中心视频理解中的音频幻觉
Abstract
Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.
Chinese Translation
自我中心视频提供了一个独特的环境,在该环境中,声音作为理解用户活动和周围环境的关键线索,尤其是在由于持续的摄像机移动而导致视觉信息不稳定或被遮挡时。最先进的大型音视频语言模型(AV-LLMs)能够生成多模态描述。然而,我们在本研究中表明,它们容易出现音频幻觉,常常从可见但未被听到的视觉线索中推断出声音。我们提出了一个系统的自动评估框架,通过针对性问答(Q/A)协议分析自我中心视频中的音频幻觉。我们整理了一个包含300个自我中心视频的数据集,并设计了1,000个以声音为重点的问题来探测模型输出。为了表征幻觉,我们提出了一种基于基础的分类法,区分用户活动的前景动作声音和背景环境声音。我们的评估显示,先进的AV-LLMs,如Qwen2.5 Omni,表现出较高的幻觉率,在与前景和背景声音相关的问答中,仅分别达到了27.3%和39.5%的准确率。通过这项工作,我们强调了测量多模态响应可靠性的必要性,强调对幻觉的稳健评估对于开发可靠的AV-LLMs至关重要。
cs.CV / 138 / 2604.23861
Empirical Ablation and Ensemble Optimization of a Convolutional Neural Network for CIFAR-10 Classification
基于实证消融和集成优化的卷积神经网络在CIFAR-10分类中的应用
Abstract
Convolutional neural networks (CNNs) remain a central approach in image classification, but their performance depends strongly on architectural and training choices. This paper presents an empirical ablation-based study of CNN optimization for the CIFAR-10 benchmark. The study evaluates 17 progressive modifications involving training duration, learning-rate scheduling, dropout configuration, pooling strategy, network depth, filter arrangement, and dense-layer design. The goal is to identify which changes improve generalization and which increase complexity without improving performance. The baseline model achieved 79.5\% test accuracy. Extending training duration improved performance steadily, whereas several structural redesigns reduced accuracy despite greater architectural variation. Based on the strongest individual configurations, a weighted ensemble was constructed, achieving 86.38\% accuracy in the reduced-data setting and 89.23\% when trained using the full CIFAR-10 dataset. These results suggest that performance gains in CNN-based classification depend less on indiscriminate increases in depth or parameter count than on careful empirical selection of training and architectural modifications. The study therefore highlights the practical value of ablation-oriented optimization and ensemble learning for small-image classification.
Chinese Translation
卷积神经网络(CNN)仍然是图像分类的核心方法,但其性能在很大程度上依赖于架构和训练选择。本文呈现了一项基于实证消融的CNN优化研究,针对CIFAR-10基准进行评估。研究评估了17种渐进式修改,包括训练时长、学习率调度、dropout配置、池化策略、网络深度、滤波器排列和密集层设计。目标是识别哪些变化能够改善泛化能力,哪些则增加复杂性而不提升性能。基线模型达到了79.5%的测试准确率。延长训练时长稳步提高了性能,而几种结构重设计尽管具有更大的架构变化,却降低了准确率。基于最强的单一配置,构建了一个加权集成模型,在减少数据的情况下达到了86.38%的准确率,而在使用完整CIFAR-10数据集训练时达到了89.23%的准确率。这些结果表明,基于CNN的分类性能提升与其深度或参数数量的无差别增加关系不大,而更依赖于对训练和架构修改的细致实证选择。因此,研究强调了面向消融的优化和集成学习在小图像分类中的实际价值。
cs.CV / 139 / 2604.23875
Risk-Aware Robust Learning: Reducing Clinical Risk under Label Noise in Medical Image Classification
风险意识的鲁棒学习:在医学图像分类中减少标签噪声下的临床风险
Abstract
Noisy labels are a pervasive challenge in medical image classification, where annotation errors arise from inter-observer variability and diagnostic ambiguity. Although several noise-robust learning methods have been proposed, their evaluation predominantly relies on accuracy-oriented metrics, overlooking the clinical implications of asymmetric error costs. In medical diagnosis, a false negative (missed disease) carries substantially higher consequences than a false positive (false alarm), as delayed treatment can directly impact patient outcomes. In this work, we investigate whether noise-robust training methods preserve clinical safety under label noise. We conduct a systematic risk-aware evaluation of the state-of-the-art noise-robust methods Coteaching, DivideMix, UNICON, and a GMM-based filtering approach on binarized DermaMNIST and PathMNIST datasets under clean and label noise rates of 20%, and 40%. Beyond balanced accuracy, we adopt a cost-sensitive Global Risk formulation that explicitly penalizes false negatives. Our analysis reveals that the robustness of state-of-the-art methods does not guarantee clinical safety. Furthermore, we demonstrate that integrating cost-sensitive optimization into noise-robust training significantly reduces clinical risk, while mantaining model utility. These findings demonstrate that noise-robust learning must be evaluated through a clinical risk lens, and that combining robust training with cost-sensitive optimization can meaningfully reduce risk in noisy-label medical imaging scenarios.
Chinese Translation
标签噪声是医学图像分类中的一个普遍挑战,注释错误源于观察者间的变异性和诊断模糊性。尽管已经提出了几种鲁棒学习方法来应对噪声问题,但它们的评估主要依赖于以准确率为导向的指标,忽视了不对称错误成本的临床影响。在医学诊断中,假阴性(漏诊)所带来的后果远比假阳性(误报)严重,因为延迟治疗会直接影响患者的预后。在本研究中,我们探讨了噪声鲁棒训练方法在标签噪声下是否能够保持临床安全性。我们对最先进的噪声鲁棒方法Coteaching、DivideMix、UNICON以及基于GMM的过滤方法在干净和20%、40%标签噪声率的二值化DermaMNIST和PathMNIST数据集上进行了系统的风险意识评估。除了平衡准确率外,我们采用了一种成本敏感的全球风险公式,明确惩罚假阴性。我们的分析表明,最先进方法的鲁棒性并不保证临床安全。此外,我们还证明将成本敏感优化整合到噪声鲁棒训练中可以显著降低临床风险,同时保持模型的有效性。这些发现表明,噪声鲁棒学习必须通过临床风险的视角进行评估,并且将鲁棒训练与成本敏感优化相结合可以在噪声标签的医学成像场景中有效降低风险。
cs.CV / 140 / 2604.23899
Mammographic Lesion Segmentation with Lightweight Models: A Comparative Study
轻量级模型的乳腺影像病变分割:比较研究
Abstract
Breast cancer is a leading cause of cancer-related mortality among women worldwide, with mammography as the primary screening tool. While deep learning models have shown strong performance in lesion segmentation, most rely on computationally intensive architectures that limit their use in resource-constrained environments. This study evaluates the performance and efficiency of lightweight models for mammographic lesion segmentation. Architectures including MobileNetV2, EfficientNet Lite, ENet, and Fast-SCNN were compared against a U-Net baseline using the INbreast dataset with 5-fold cross-validation. Performance was assessed using Dice score, Intersection over Union (IoU), and Recall, alongside model complexity. MobileNetV2 with Squeeze-and-Excitation (SCSE) achieved the best performance, with a Dice score of 0.5766 while using approximately 75\% fewer parameters than U-Net. Cross-dataset evaluation on the DMID dataset showed reduced accuracy due to domain shift but preserved recall. These results demonstrate that lightweight architectures offer a practical balance between performance and efficiency for deployable CAD systems.
Chinese Translation
乳腺癌是全球女性癌症相关死亡的主要原因,乳腺X线摄影是主要的筛查工具。尽管深度学习模型在病变分割方面表现出色,但大多数模型依赖于计算密集型架构,这限制了它们在资源受限环境中的应用。本研究评估了轻量级模型在乳腺影像病变分割中的性能和效率。我们比较了包括MobileNetV2、EfficientNet Lite、ENet和Fast-SCNN在内的架构,与U-Net基线模型使用INbreast数据集进行5折交叉验证。通过Dice系数、交并比(IoU)和召回率以及模型复杂度来评估性能。MobileNetV2结合Squeeze-and-Excitation(SCSE)达到了最佳性能,Dice系数为0.5766,同时使用的参数约比U-Net少75%。在DMID数据集上的跨数据集评估显示,由于领域转移,准确性有所降低,但召回率得以保留。这些结果表明,轻量级架构为可部署的计算机辅助诊断(CAD)系统提供了性能与效率之间的实用平衡。
cs.CV / 141 / 2604.23909
AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance
AMAVA:用于视障人士辅助的自适应运动感知视频到音频框架
Abstract
Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.
Chinese Translation
盲人和低视力个体的导航辅助工具在传达动态现实环境方面面临挑战,导致由于持续、无差别的反馈而产生认知过载。我们提出了AMAVA,这是一种新颖的实时视频到音频框架,能够将移动设备的视频转换为具有上下文相关性的音效或文本转语音描述。我们提出了一种运动感知管道,使用轻量级的人工智能分类模型来区分低运动和高运动场景,随后通过实时文本到音频合成管道更高效地增强环境感知。在静态环境中,AMAVA生成口语音频场景描述以提高情境意识。在高运动情况下,它通过提供声音提示(如口语危险警报和环境音效)来优先考虑安全。这些音频输出由一种仅解码器的基于变换器的视觉-语言模型生成,该模型结合了专家混合和跨模态注意力以实现视觉理解,并与神经文本到语音和自然声音合成网络协同工作。所提出的框架使用基于提示的缓存和类别特定的节流机制,以避免听觉杂乱并最小化延迟。我们对该系统进行了全面评估,包括一项实时导航研究,比较了单独使用白手杖与使用AMAVA的情况,结果显示用户信心和感知安全性显著提高。
cs.CV / 142 / 2604.23935
2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
第五届PVUW MeViS音频赛道的第二名:ASR-SaSaSa2VA
Abstract
Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.
Chinese Translation
基于音频的视频物体分割旨在根据音频线索定位和分割视频中的物体,这需要对外观和运动进行精确理解。最近的音频驱动视频分割方法通过融合音频和视觉特征扩展了多模态大语言模型(MLLMs),实现端到端的定位。尽管这些方法前景广阔,但它们计算密集,难以将时间音频线索与动态视频内容对齐,并且依赖于大量配对的音频-视频数据集。为了解决这些挑战,我们提出了ASR-SaSaSa2VA,这是一个资源高效的音频引导视频分割框架。其关键思想是通过自动语音识别(ASR)模型将音频输入转换为文本运动描述,然后利用预训练的基于文本的参考视频分割模型(例如,SaSaSa2VA)进行像素级预测。为了进一步增强鲁棒性,我们结合了一个无目标表达检测模块,该模块由微调的基于音频的多模态大语言模型(MLLM)实现,能够过滤掉不指向任何目标物体的音频片段。该设计使系统能够利用强大的预训练模型,同时有效处理模糊或无关的音频输入。我们的方法在第五届PVUW挑战赛(MeViS-v2-Audio赛道)中取得了80.7的最终得分,获得第二名。
cs.CV / 143 / 2604.23941
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
GoClick:用于自主GUI交互的轻量级元素定位模型
Abstract
Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.
Chinese Translation
图形用户界面(GUI)元素定位(根据自然语言指令精确定位屏幕截图上的元素)是与GUI交互的智能体的基础。在资源受限的设备(如手机)上直接部署这一能力对于需要低延迟的GUI智能体变得愈发重要。然而,这一目标面临重大挑战,因为当前的视觉定位方法通常采用大型视觉-语言模型(VLM)(参数超过25亿),使其在内存和计算限制下不适合在设备上执行。为了解决这一问题,本文提出了GoClick,一种仅有2.3亿参数的轻量级GUI元素定位VLM,能够实现卓越的视觉定位精度,甚至与显著更大模型的性能相当。简单地缩小现有的仅解码器VLM是一种设计轻量级模型的直接方法,但我们的实验表明,这种方法的效果并不理想。相反,我们选择了编码-解码架构,在小参数规模下,其在GUI定位任务中优于仅解码器的替代方案。此外,小型VLM的有限容量促使我们开发了一个渐进式数据精炼管道,通过任务类型过滤和数据比例调整,从1080万的原始数据集中提取出一个高质量的380万样本核心集。使用该核心集训练GoClick带来了显著的定位精度提升。我们的实验表明,GoClick在多个GUI元素定位基准测试中表现出色,同时保持小巧的体积和高效的推理速度。GoClick在设备-云协作框架中集成时也提升了GUI智能体的性能,在该框架中,GoClick帮助基于云的任务规划者进行精确的元素定位并实现更高的成功率。我们希望我们的方法能够为GUI智能体社区提供有意义的探索。
cs.CV / 144 / 2604.23950
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner:重新思考视觉语言模型中的基于注意力的令牌剪枝
Abstract
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.
Chinese Translation
视觉语言模型(VLMs)最近在视觉理解和推理方面展现了显著的能力,但由于长视觉序列输入,它们也带来了巨大的计算负担。近期的研究通过剪除不重要的视觉令牌来解决这一问题,在保持模型性能的同时实现了显著的计算减少。令牌剪枝的核心在于确定令牌的重要性,目前的方法主要依赖于视觉编码器或大型语言模型(LLMs)中的注意力分数。本文分析了注意力机制在视觉编码器和LLMs中的有效性。我们发现视觉编码器存在注意力沉没现象,导致对信息丰富的前景区域关注不足,而在LLMs中,尽管先前的研究已识别出对令牌位置的注意力偏向,但文本到视觉的注意力表现出对这种偏向的抵抗,并在中间层中提供有效的剪枝指导。基于这些观察,我们提出了LearnPruner,一个两阶段的令牌剪枝框架,首先通过可学习的剪枝模块在视觉编码器之后去除冗余的视觉令牌,然后在LLM的中间层中仅保留与任务相关的令牌。实验结果表明,我们的LearnPruner能够在仅使用5.5%的视觉令牌的情况下,保留约95%的原始性能,并实现3.2倍的推理加速,展示了优越的准确性-效率权衡。
cs.CV / 145 / 2604.23953
Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach
无视口盲态全向图像质量评估:统一与通用的方法
Abstract
Blind omnidirectional image quality assessment (BOIQA) presents a great challenge to the visual quality assessment community, due to different storage formats and diverse user viewing behaviors. The main paradigm of BOIQA models includes two steps, ie, viewport generation, and quality prediction, which brings an extra computational burden and is hard to generalize to other visual contents (eg, 2D planar image). Thus, in this paper, we make an attempt to solve these issues. First, we experimentally find that BOIQA can be formulated as a blind (2D planar) image quality assessment (BIQA) problem, ie, the first step - viewport generation - is no longer needed, which narrows the natural gap between BOIQA and BIQA. Then, we present a new BOIQA approach, which has three merits: ie, viewport-unaware - it accepts an omnidirectional image in the widely used equirectangular projection format as input without any transformation; unified - it can also be applied to BIQA; and generalized - it shows better generalizability against other competitors. Finally, we validate its promise by held-out test, cross-database validation, and the well-established gMAD competition.
Chinese Translation
盲态全向图像质量评估(BOIQA)对视觉质量评估领域提出了巨大的挑战,原因在于不同的存储格式和多样的用户观看行为。BOIQA模型的主要范式包括两个步骤,即视口生成和质量预测,这带来了额外的计算负担,并且难以推广到其他视觉内容(例如,二维平面图像)。因此,本文尝试解决这些问题。首先,我们通过实验发现,BOIQA可以被表述为一个盲态(二维平面)图像质量评估(BIQA)问题,即第一步——视口生成——不再需要,这缩小了BOIQA与BIQA之间的自然差距。然后,我们提出了一种新的BOIQA方法,该方法具有三个优点:即无视口——它接受广泛使用的等矩形投影格式的全向图像作为输入,无需任何转换;统一——它也可以应用于BIQA;以及通用——它在与其他竞争者的比较中显示出更好的泛化能力。最后,我们通过留出测试、跨数据库验证以及成熟的gMAD竞赛验证了其前景。
cs.CV / 146 / 2604.23957
LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization
LAVA:分层音视频防篡改水印技术用于稳健的深伪检测与定位
Abstract
Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.
Chinese Translation
主动水印技术为短视频中的深伪篡改检测与定位提供了一种有前景的方法。然而,现有方法通常将音频和视觉证据分离,并假设水印信号在现实世界的降质条件下仍然可靠,这使得篡改定位容易受到多模态失配和压缩失真的影响。此外,现有的半脆弱视觉水印方法在编解码压缩下往往会显著降级,因为它们的嵌入频带与对压缩敏感的频率区域重叠。为了解决这些局限性,我们提出了分层音视频防篡改水印技术(LAVA),这是一种针对深伪篡改检测与定位的校准感知音视频水印融合框架。LAVA利用跨模态水印融合和校准感知对齐,在压缩和音视频不同步的情况下保持一致且可靠的篡改证据,从而实现稳健的篡改定位。大量实验表明,LAVA实现了近乎完美的检测性能(AP = 0.999),对压缩和多模态失配保持稳健,并显著提高了相较于现有音视频融合基线的篡改定位可靠性。
cs.CV / 147 / 2604.23977
Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification
低资源生物医学图像分类的多视角协同学习与视觉-语言适应
Abstract
Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision--language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.
Chinese Translation
在低资源条件下,准确的生物医学图像分类仍然面临挑战,原因包括有限的标注、细微的类间视觉差异以及复杂的疾病语义。尽管视觉-语言模型为缓解数据稀缺提供了有希望的基础,但在生物医学环境中的有效适应受到参数高效调优的需求以及细粒度和语义一致的表示学习的限制。在本研究中,我们提出了多视角协同学习(Multi-View Synergistic Learning, MVSL),这是一个统一框架,旨在通过共同考虑适应范式、表示粒度和疾病语义关系来解决这些挑战。MVSL 解耦了视觉和文本编码器的适应,以尊重它们独特的表示特征,从而实现更稳定和高效的参数调优。它进一步引入多粒度对比学习,明确建模全局图像语义和局部病变证据,提高对视觉相似疾病类别的细粒度区分。此外,MVSL 通过结合来自大型语言模型的结构化监督,保持疾病级别的语义结构,这在类级别上约束文本表示,并通过跨模态对齐间接正则化视觉嵌入。综合这些组件,使得在有限监督下实现更稳定的跨模态对齐和改善的区分能力。在涵盖 9 种成像模式和 10 个解剖区域的 11 个公共生物医学数据集上的广泛实验表明,MVSL 在少样本和零样本分类设置中始终优于最先进的方法。
cs.CV / 148 / 2604.23982
Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis
基于层次原型的领域先验在多模态组织病理分析中的多实例学习
Abstract
Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven "black box" issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.
Chinese Translation
数字病理学通过实现对千兆像素全切片图像(Whole Slide Images, WSIs)的计算分析,根本改变了诊断工作流程,但有效解读其复杂的肿瘤微环境仍然是一项艰巨的挑战。现有的多实例学习(Multiple Instance Learning, MIL)框架通常将全切片图像视为无结构的补丁袋,忽略了关键的形态学语义和空间几何。这种缺乏归纳偏差的情况常常导致对背景噪声的过拟合,并未能将视觉特征与高层次的诊断知识对齐。为克服这些局限性,我们提出了层次原型基础的领域先验(Hierarchical Prototype-based Domain Priors, HPDP)框架,这是一种用于联合组织病理诊断和预后的统一多模态方法。HPDP通过引入形态学锚定原型系统(Morphologically Anchored Prototype System, MAPS)来缓解数据驱动的“黑箱”问题,该系统将学习锚定到可解释的形态学聚类,并使用正弦位置编码器(Sinusoidal Positional Encoder, SPE)来明确建模组织结构。此外,我们通过层次跨模态对齐(Hierarchical Cross-Modal Alignment, HCMA)模块弥合语义差距,利用大型语言模型(Large Language Model, LLM)生成的描述来上下文化地优化视觉表征。在七个癌症队列上的广泛实验表明,HPDP始终实现了最先进的性能,并具备更强的鲁棒性和可解释性。
cs.CV / 149 / 2604.23996
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES:在MoE-VLM中进行软模态引导的专家专业化
Abstract
Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.
Chinese Translation
混合专家(MoE)已成为大型视觉-语言模型(VLM)的普遍骨干,但如何利用模态特定信号引导专家路由仍然未得到充分探索。现有的路由策略要么是手工设计的,要么是与模态无关的,依赖于理想化的先验,忽视了MoE-VLM中层依赖的模态融合模式,且对专家专业化提供的指导有限。我们提出了软模态引导的专家专业化(SMoES),其包括动态软模态得分,以捕捉层依赖的融合模式,符合专家并行部署的专家分箱机制,以及鼓励一致模态专业化的跨箱互信息正则化。我们的方法利用基于注意力或高斯统计的模态得分来优化互信息正则化。在四个基于MoE的VLM和16个基准测试中的实验表明,在有效性和效率上都有所提升:在多模态和仅语言任务上平均提升0.9%和4.2%,EP通信开销减少56.1%,在现实部署下吞吐量提升12.3%。这些结果验证了将路由与模态感知的专家专业化对齐能够释放MoE-VLM的能力和效率。
cs.CV / 150 / 2604.24023
ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services
ServImage:来自真实商业影像服务的图像生成与编辑基准
Abstract
Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \$295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00\% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}
Chinese Translation
近期的图像生成和编辑模型在学术基准上表现出对指令的强大遵循能力和高视觉质量。然而,它们在付费的真实设计项目中的表现仍然不确定。我们引入了 extbf{ServImage},这是一个明确将模型输出与商业设计项目中的经济价值相关联的基准。ServImage包括:(i) extbf{ extit{ServImageBench}}:一个包含1.07千个付费商业设计任务和2.05千个设计师交付成果的数据集,总价值超过295,000美元,涵盖肖像、产品和数字内容,以及33,000个候选图像和33,000个人工标注。(ii) extbf{ extit{ServImageScore}}:一个综合评分系统,结合了三个质量维度:基础要求的满足程度、视觉执行质量和商业必要性的满足。这三个维度旨在表征驱动人类支付决策的因素,并指示图像是否在商业上可接受。(iii) extbf{ extit{ServImageModel}}:在该评分系统下,我们提出了一个基于人类标注候选图像训练的支付预测模型,成功预测人类支付决策的准确率达到82.00%,并生成了经过校准的支付概率。ServImage为评估图像生成模型的商业可行性提供了全面的基础,并为未来关于经济基础视觉系统的研究提供了可扩展的资源。
cs.CV / 151 / 2604.24024
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
突破嵌入式摄像头的多投影仪标定的可扩展性限制
Abstract
Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for dense multi-projector systems with overlapping projection regions, such as high-brightness stacking, super-resolution, light-field, and shadow-suppression displays.
Chinese Translation
传统的多投影仪标定需要依次投影并捕捉每个投影仪的结构光模式,这导致标定时间和工作量随着投影仪数量的增加而线性增长。这一可扩展性瓶颈长期以来限制了大规模投影映射系统的部署。我们提出了一种新的标定框架,通过将摄像头嵌入到标定目标的表面来打破这一限制。嵌入式摄像头直接捕捉到入射的投影光,从而根据入射方向分离来自多个投影仪的同时投影的结构光模式。我们的方法建立了嵌入式摄像头的光学中心与投影仪像素之间的对应关系,使所有投影仪的内参和外参能够同时估计。我们进一步引入了一种修正技术,以解决标定板与摄像头光学中心之间的小偏差。因此,我们的系统在实现与传统方法相当的标定精度的同时,将所需的投影-捕捉循环次数从与投影仪数量线性相关减少到几乎恒定,从而显著提高了密集多投影仪系统(如高亮度堆叠、超分辨率、光场和阴影抑制显示)在重叠投影区域的可扩展性。
cs.CV / 152 / 2604.24029
DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery
DeepTaxon:一种可解释的检索增强多模态框架,用于统一物种识别与发现
Abstract
Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
Chinese Translation
在生物学中,在数万种视觉上相似的分类群中识别物种,同时在开放世界环境中发现未知物种,仍然是生物多样性研究中的一个基本挑战。目前的方法将识别和发现视为两个独立的问题,分类模型假设封闭集,而发现则依赖于基于阈值的拒绝。在此,我们提出了DeepTaxon,一种检索增强的多模态框架,通过对检索到的视觉证据进行可解释的推理,统一物种识别与发现。给定一个查询图像,DeepTaxon从检索索引中检索出前$k$个候选物种,每个物种有$n$个示例图像,并进行链式思维的比较推理。关键是,我们将发现重新定义为一个明确的基于检索的决策问题,而不是一个隐式的参数记忆问题。当且仅当检索索引缺乏足够的识别证据时,样本才被视为新颖,因此每次检索自然产生一个分类或发现标签,而无需手动标注,从而为这两个任务提供自动监督。我们通过在合成检索增强数据上进行监督微调来训练该框架,然后在困难样本上进行强化学习,将高召回率的检索转化为高精度的决策,能够扩展到大规模的分类词汇。对大规模分布内基准和六个分布外数据集的广泛实验表明,在识别和发现方面均有一致的改进。消融研究进一步揭示了在候选数量$k$和示例数量$n$上的有效测试时间扩展,对未见领域的强零样本迁移,以及在检索编码器上的一致性能,确立了生物多样性研究的可解释解决方案。
cs.CV / 153 / 2604.24031
JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning
JSSFF:一种用于遥感图像描述的联合结构-语义融合框架
Abstract
The encoder-decoder framework has become widely popular nowadays. In this model, the encoder extracts informative visual features from an input image, and the decoder employs a sequence-to-sequence formulation to generate the corresponding textual description from these features. The existing models focus more on the decision part. However, extracting meaningful information from the image can help the decoder generate an accurate caption by providing information about the objects and their relationship. Remote sensing images are highly complex. One major challenge is detecting objects that extend beyond their visible boundaries due to occlusion, overlapping structures, and unclear edges. Hence, there is a need to design an approach that can effectively capture both high-level semantics and low-level spatial details for accurate caption generation. In this work, we have proposed an edge-aware fusion method by incorporating the original image and its edge-aware version into the encoder to enhance feature representation and boundary awareness. We used a comparison-based beam search (CBBS) to generate captions to achieve a balanced trade-off between quantitative metrics and qualitative caption relevance through fairness-based comparison of candidate captions. Experimental results demonstrate our model's superiority over several baseline models in quantitative and qualitative perspectives.
Chinese Translation
编码器-解码器框架如今已广泛流行。在该模型中,编码器从输入图像中提取信息丰富的视觉特征,而解码器则采用序列到序列的形式,从这些特征中生成相应的文本描述。现有模型更侧重于决策部分。然而,从图像中提取有意义的信息可以帮助解码器生成准确的描述,因为它提供了关于物体及其关系的信息。遥感图像高度复杂。一个主要挑战是检测由于遮挡、重叠结构和边缘不清晰而超出可见边界的物体。因此,需要设计一种方法,能够有效捕捉高层语义和低层空间细节,以实现准确的描述生成。在本研究中,我们提出了一种边缘感知融合方法,通过将原始图像及其边缘感知版本结合到编码器中,以增强特征表示和边界意识。我们使用基于比较的束搜索(CBBS)生成描述,以通过对候选描述的公平比较,在定量指标和定性描述相关性之间实现平衡的权衡。实验结果表明,我们的模型在定量和定性方面优于多个基线模型。
cs.CV / 154 / 2604.24036
Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues
通过语言引导的语义线索实现对遮挡和小物体的稳健定位
Abstract
While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在一般场景中的定位能力得到了增强,但它们在拥挤场景中的稳健性仍然未被充分探索。拥挤场景带来了视觉挑战(即遮挡和小物体),这些挑战会损害物体语义并降低定位性能。相比之下,语言表达不受此类降级的影响,能够保持物体语义。基于这些观察,我们提出了一种新方法,通过利用语言引导的语义线索(LGSCs)来克服这些限制。具体而言,我们的方法引入了一个语义线索提取器(SCE),从MLLM的视觉处理管道中提取物体的语义线索。然后,我们使用相应的文本嵌入来引导这些线索,生成作为语言语义先验的LGSCs。随后,这些线索被重新整合回原始视觉处理管道,以细化物体语义。大量实验和分析表明,将LGSCs纳入MLLM中有效提高了在拥挤场景中的定位准确性。
cs.CV / 155 / 2604.24044
CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion
CLLAP:基于对比学习的激光雷达增强预训练框架以提升雷达-摄像头融合
Abstract
Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP's effectiveness in advancing radar-camera fusion for autonomous driving applications.
Chinese Translation
准确的三维物体检测对自动驾驶至关重要,这需要可靠且经济高效的传感器,能够在恶劣天气条件下工作。摄像头与毫米波雷达的融合已成为一种有前景的解决方案;然而,这些方法通常依赖于精细标注的雷达数据,而这类数据稀缺且生产成本高。为了解决这一挑战,我们提出了CLLAP,一个基于对比学习的激光雷达增强预训练框架,旨在提升现有雷达-摄像头融合方法在三维物体检测中的性能。CLLAP利用丰富的激光雷达数据,通过提出的L2R(激光雷达到雷达)采样方法生成伪雷达数据。然后,它将这些数据纳入一种新颖的双阶段、双模态对比学习策略中,使得能够有效地进行自监督学习,从配对的伪雷达和图像数据中学习。这一方法以即插即用的方式有效地预训练现有的雷达-摄像头融合模型,增强其特征提取能力,提高检测的准确性和鲁棒性。使用NuScenes和Lyft Level 5数据集的实验结果显示,在三个基线模型上显著提升了性能,突显了CLLAP在推动自动驾驶应用中雷达-摄像头融合方面的有效性。
cs.CV / 156 / 2604.24052
QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
QEVA:一种基于无参考的叙事视频摘要评估指标,结合多模态问答
Abstract
Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $\tau_b$, $\tau_c$, and Spearman's $\rho$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
Chinese Translation
视频到文本的摘要在综合评估方法方面仍然未得到充分探索。传统的基于n-gram重叠的指标和最近基于大型语言模型(LLM)的方法在很大程度上依赖于人工编写的参考摘要,这限制了它们的实用性和对细微语义方面的敏感性。本文提出了QEVA,这是一种无参考的指标,通过多模态问答直接评估候选摘要与源视频的匹配程度。QEVA从三个明确的维度评估摘要:覆盖度、事实性和时间顺序。我们还引入了MLVU(VS)-Eval,这是一个新的注释基准,源自MLVU数据集,包括从200个视频生成的800个摘要,使用了最先进的视频-语言多模态模型。该数据集建立了一个透明且一致的评估框架。实验结果表明,QEVA与人类判断的相关性高于现有方法,相关性通过Kendall的$ au_b$、$ au_c$和Spearman的$
ho$进行测量。我们希望我们的基准和指标能够促进视频到文本摘要研究的有意义进展,并为未来评估方法的发展提供有价值的见解。
cs.CV / 157 / 2604.24053
Light 'em Up: Enabling Few-Shot Low-Light 3D Gaussian Splatting with Multi-Scale Explicit Retinex Illumination Decoupling
点亮它:通过多尺度显式Retinex照明解耦实现少量样本低光照3D高斯点云渲染
Abstract
Full 360$^\circ$ novel view synthesis under low-light conditions remains challenging. Insufficient illumination, noise amplification, and view-dependent photometric inconsistencies prevent existing methods from jointly preserving geometric consistency and photorealism. Unsupervised approaches often exhibit color drift under large viewpoint variations, while supervised low-light enhancement models, though effective for 2D tasks, struggle to generalize to new scenes and typically require retraining. To address this issue, we propose MERID-GS, a Multi-Scale Explicit Retinex Illumination-Decoupled Gaussian framework for low-light 360$^\circ$ synthesis. Based on Retinex theory, the method explicitly separates illumination and reflectance, and suppresses noise propagation while enhancing dark-region structures via a learnable gain and Illumination-State-Guided Frequency Gating. Combined with lightweight Reflection Head and 3D Gaussian Splatting, MERID-GS adapts to new scenes with only a few shots and enables stable low-light novel view synthesis from sparse-view observations. In addition, we construct a low-light multi-view dataset covering full 360$^\circ$ scenes for joint evaluation. Thorough experiments across multiple datasets in this area demonstrate that MERID-GS achieves SOTA performance, exhibiting superior cross-scene generalization and view consistency. The source code and pre-trained models are available at https://github.com/YhuoyuH/MERID-GS..
Chinese Translation
在低光照条件下进行全360$^ ext{°}$新视角合成仍然具有挑战性。照明不足、噪声放大和视角依赖的光度不一致性阻碍了现有方法在几何一致性和真实感之间的共同保持。无监督的方法在大视角变化下通常表现出颜色漂移,而监督的低光增强模型虽然在2D任务中有效,但在新场景上的泛化能力较差,通常需要重新训练。为了解决这个问题,我们提出了MERID-GS,一种用于低光照360$^ ext{°}$合成的多尺度显式Retinex照明解耦高斯框架。该方法基于Retinex理论,显式地分离照明和反射,并通过可学习增益和照明状态引导的频率门控抑制噪声传播,同时增强暗区结构。结合轻量级反射头和3D高斯点云渲染,MERID-GS能够仅通过少量样本适应新场景,并从稀疏视图观测中实现稳定的低光照新视角合成。此外,我们构建了一个覆盖全360$^ ext{°}$场景的低光多视图数据集以进行联合评估。在该领域多个数据集上的全面实验表明,MERID-GS达到了最先进的性能,展现出优越的跨场景泛化能力和视角一致性。源代码和预训练模型可在https://github.com/YhuoyuH/MERID-GS获取。
cs.CV / 158 / 2604.24109
SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?
SemiSAM-O1:我们能将注释高效的医学图像分割的边界推向多远?
Abstract
Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model's feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model's encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model's global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.
Chinese Translation
半监督学习(SSL)已成为缓解基于深度学习的医学图像分割模型注释负担的有前景的解决方案。尽管最近在基础模型驱动的SSL方面的进展已将边界推向极其有限的注释场景,但在复杂成像模式下,它们未能保持强有力的竞争性能。本文提出了SemiSAM-O1,这是一种仅使用一个带注释模板图像进行分割的高效注释框架。SemiSAM-O1将专家-通用学习协作框架扩展到极端的一标签设置,通过充分利用基础模型的特征表示能力,超越其提示接口。SemiSAM-O1分为两个阶段进行。在第一阶段,基础模型的编码器从所有体积中提取稠密特征,并从单个带注释模板派生的类别原型通过特征相似性传播到未标记池,以生成粗略的初始伪标签。在第二阶段,迭代训练与精炼循环逐步改善分割模型和伪标签,在多个轮次中,每轮从当前伪标签重新训练模型并生成带有体素级不确定性估计的更新预测。一个不确定性引导的精炼步骤进一步利用基础模型的全局特征空间,通过聚合来自其最相似的可信邻居的标签来纠正高不确定性区域,从而建立互相改善的良性循环。在不同模式和解剖目标的广泛分割任务上的大量实验表明,SemiSAM-O1显著缩小了一标签半监督学习与全监督之间的性能差距,同时显著减少了在线基础模型推理的计算开销。
cs.CV / 159 / 2604.24119
TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
TopoHR:用于具有点到实例关系的驾驶场景中循环拓扑推理的分层中心线表示
Abstract
Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}_{\text{l}}$, +5.4 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}_{\text{l}}$, +7.9 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at https://github.com/Yifeng-Bai/TopoHR.git.
Chinese Translation
拓扑推理对于自动驾驶至关重要。目前的方法主要集中在实例级学习以进行中心线检测,随后通过依赖简化的多层感知器(MLP)层的顺序模块进行拓扑推理。此外,它们往往忽视了在拓扑推理中 extit{点到实例}(P2I)关系的重要性。为了解决这些局限性,我们提出了TopoHR(拓扑分层表示),这是一种新颖的端到端框架,建立了中心线检测与拓扑推理之间的循环交互,使它们能够相互迭代增强。具体而言,我们引入了一种分层中心线表示,包括点查询、实例查询和语义表示。这些多层特征在分层中心线解码器中无缝集成和融合。此外,我们设计了一个分层拓扑推理模块,在统一架构中捕捉细粒度的P2I关系和全局实例到实例(I2I)连接。通过这些新颖的组件,TopoHR确保了准确和稳健的拓扑推理。在OpenLane-V2基准测试中,TopoHR刷新了最先进的性能,取得了显著的改进。值得注意的是,与之前的最佳结果相比,TopoHR在$ ext{subset_A}$上实现了$ ext{DET}_{ ext{l}}$ +3.8,$ ext{TOP}_{ ext{ll}}$ +5.4,在$ ext{subset_B}$上实现了$ ext{DET}_{ ext{l}}$ +11.0,$ ext{TOP}_{ ext{ll}}$ +7.9,验证了所提组件的有效性。代码将公开分享于https://github.com/Yifeng-Bai/TopoHR.git。
cs.CV / 160 / 2604.24123
FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs
FDIM:一种基于特征距离的通用视频质量度量,用于多种编码器
Abstract
Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at https://github.com/MCL-ZJU/FDIM.
Chinese Translation
视频技术正朝着超高清(UHD)和高动态范围(HDR)发展,这加大了对这些高规格视频更高压缩效率的需求。除了传统编码器的进步,神经视频编码器(NVCs)近年来也引起了显著的研究关注,并迅速发展。NVCs的编码伪影通常表现出内容变化和生成特征,这与传统编码器的特征不同,传统视频质量评估(VQA)方法难以捕捉。因此,需要VQA度量能够在不同编码器、内容类型和动态范围之间进行泛化,以更好地支持视频编码研究和评估。在本文中,我们提出了FDIM,一种基于特征距离的通用视频质量度量,适用于SDR和HDR格式的传统和神经视频编码器。FDIM采用混合架构,整合了深度特征和手工设计特征。深度特征组件学习多尺度表示,以捕捉从结构和纹理保真度下降到高级语义偏差的失真,而手工设计特征组件提供稳定的补充线索,以提高整体泛化能力。我们在一个大规模的主观质量评估数据集(DCVQA)上训练了FDIM,该数据集包含超过16k个由传统基于块的混合视频编码器和端到端感知优化的神经视频编码器编码的视频序列。在十个包含多样化、之前未见过的编码器的SDR/HDR VQA数据集上的广泛实验表明,FDIM实现了强泛化能力,并与主观评估具有高度相关性。FDIM的源代码和DCVQA验证集将发布在https://github.com/MCL-ZJU/FDIM。
cs.CV / 161 / 2604.24125
Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images
集成对象级标签和场景级语义特征的开放词汇语义分割网络用于多模态遥感图像
Abstract
Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet
Chinese Translation
多模态遥感影像的语义分割在土地利用/土地覆盖(LULC)制图、环境监测和精确地球观测中发挥着关键作用。目前的多模态方法主要集中于整合互补的视觉模态,却忽视了非视觉文本数据的整合——这是一种丰富的知识来源,可以弥合视觉模式与现实世界概念之间的语义差距。为了解决这一局限性,我们提出了TSMNet,一种文本监督的多模态开放词汇语义分割网络,协同整合文本监督与视觉表示以实现开放词汇语义分割。与传统的多模态分割框架不同,TSMNet引入了一个双分支文本编码器,从各种文本数据中提取场景级语义和对象级标签信息,实现动态跨模态融合。这些文本派生特征通过所提出的文本引导视觉语义融合模块与视觉嵌入动态交互,实现领域感知特征的细化和人类可解释的决策。为了验证我们的方法,我们创新性地构建了两个新的多模态数据集,并进行了广泛的实验,以全面比较所提出的方法与其他最先进(SOTA)语义分割模型。结果表明,TSMNet在多样的地理和传感器特定场景中实现了更高的分割精度,同时展现出强大的泛化能力。这项工作为可解释的遥感分析建立了新的范式,表明文本知识的整合显著增强了模型的泛化能力。源代码将可在 https://github.com/yeyuanxin110/TSMNet 获取。
cs.CV / 162 / 2604.24136
Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution
在一步扩散中桥接真实世界超分辨率的恢复与生成流形
Abstract
Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.
Chinese Translation
预训练的扩散模型已经彻底改变了真实世界图像超分辨率(Real-ISR),但由于迭代采样,仍面临计算瓶颈。最近的单步蒸馏加速了推理,但由于严格的时间步初始化、分布轨迹不匹配和脆弱的随机调制,导致明显的感知-失真权衡。为了解决这个问题,我们提出了适应性反演与降解感知采样框架(IDaS-SR),这是一个将确定性恢复与随机生成流形桥接的一步框架。其核心是流形反演噪声估计器(MINE),通过预测一个感知严重性相关的时间步和反演噪声,精确地将低质量潜变量锚定到扩散轨迹上,从而解决了这些初始化和轨迹不匹配的问题。此外,为了减轻脆弱的随机调制,我们提出了CHARIOT,一种连续生成引导机制。通过重新调度轨迹和插值噪声,它能够在不妥协结构先验的情况下,明确导航感知-失真边界。大量实验表明,IDaS-SR在性能上优于最先进的方法,能够在单次推理步骤中无缝地从严格的结构恢复器过渡到复杂的纹理幻觉生成器。
cs.CV / 163 / 2604.24146
EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT
EXACT:一种可解释的异常感知视觉基础模型用于三维胸部CT分析
Abstract
Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.
Chinese Translation
胸部计算机断层扫描(CT)在胸部疾病的检测和管理中至关重要,但随着体积成像规模和复杂性的不断增加,单靠扫描级别的预测已无法满足需求。临床上有用的CT人工智能不仅需要识别整个体积中的疾病,还需定位异常并提供可解释的视觉证据。现有的视觉-语言基础模型通常将扫描和报告压缩为全局图像-文本表示,这限制了它们保留空间证据和支持临床有意义解释的能力。在此,我们开发了EXACT,一种可解释的异常感知基础模型,专为三维胸部CT设计,能够从配对的临床扫描和放射学报告中学习空间分辨的表示。EXACT在25,692对CT-报告上进行了预训练,采用解剖学感知的弱监督,联合学习器官分割和多实例异常定位,而无需手动体素级注释。最终生成的器官特异性异常感知图为每个体素分配了一个特定于疾病的异常评分,限制在其对应的解剖结构内,联合编码病变范围和器官级上下文。在回顾性多国和多中心评估中,EXACT在临床相关的CT任务中显示出广泛且一致的改善,涵盖多疾病诊断、零样本异常定位、下游适应和视觉基础报告生成,超越了现有的三维医学基础模型。通过将常规临床CT扫描和自由文本报告转化为可解释的体素级表示,EXACT建立了一个可扩展的可信体积医学人工智能范式。
cs.CV / 164 / 2604.24149
6thGrid-Net: Unified Remote Sensing Image Dehazing Based on Color Restoration and Edge-Preserving
6thGrid-Net:基于颜色恢复和边缘保持的统一遥感图像去雾
Abstract
Remote sensing images are frequently degraded by adverse weather conditions, particularly clouds and haze, which severely impair downstream applications. Existing restoration methods typically rely on computationally heavy architectures or sequential pipelines (e.g., detail enhancement followed by color rendition) that suffer from mutual interference and artifact accumulation. Furthermore, recent unified grid-based approaches utilize fixed, isotropic interpolation kernels, neglecting the intrinsic low-dimensional manifold of natural images and inevitably causing edge blur. To address these limitations, we propose 6th Grid-Net, a highly efficient and unified remote sensing image restoration framework tailored for resource-constrained edge devices. Specifically, we construct a novel six-dimensional fusion tensor that seamlessly integrates the color rendition capabilities of 3D LUTs with the spatial-luminance detail preservation of bilateral grids. To overcome the drawbacks of standard trilinear interpolation, we introduce a manifold-adaptive high-dimensional sampling mechanism. This mechanism dynamically adjusts the interpolation kernel based on local edge orientation, texture strength, and color similarity, enabling simultaneous global color stylization and local edge refinement in a single forward pass. Additionally, an edge-aware grid smoothing constraint and dynamic quantization are incorporated to suppress ghosting artifacts and significantly compress the model size. Extensive experiments on multiple benchmark datasets demonstrate that 6th Grid-Net achieves state-of-the-art restoration quality across various degradation scenarios.
Chinese Translation
遥感图像常常受到不利天气条件的影响,特别是云层和雾霾,这严重损害了下游应用。现有的恢复方法通常依赖于计算量大的架构或顺序处理流程(例如,细节增强后再进行颜色还原),这些方法在处理过程中会相互干扰并导致伪影的累积。此外,近期的统一网格方法采用固定的各向同性插值核,忽视了自然图像的内在低维流形,必然导致边缘模糊。为了解决这些局限性,我们提出了6th Grid-Net,这是一种高效且统一的遥感图像恢复框架,专为资源受限的边缘设备量身定制。具体而言,我们构建了一种新颖的六维融合张量,能够无缝整合3D LUT的颜色还原能力与双边网格的空间-亮度细节保持。为了克服标准三线性插值的缺点,我们引入了一种流形自适应的高维采样机制。该机制根据局部边缘方向、纹理强度和颜色相似性动态调整插值核,使得在单次前向传递中实现全局颜色风格化与局部边缘细化。此外,我们还引入了边缘感知网格平滑约束和动态量化,以抑制鬼影伪影并显著压缩模型大小。在多个基准数据集上的广泛实验表明,6th Grid-Net在各种退化场景下实现了最先进的恢复质量。
cs.CV / 165 / 2604.24163
Robust Deepfake Detection, NTIRE 2026 Challenge: Report
鲁棒深伪检测,NTIRE 2026 挑战:报告
Hopf, Benedikt, Timofte, Radu, Qu, Chenfan, Li, Junchi, Wu, Fei, Lu, Dagong, Yao, Mufeng, Xu, Xinlei, Guo, Fengjun, Tang, Yongwei, Yang, Zhiqiang, Wu, Zhiqiang, Seow, Jia Wen, Koay, Hong Vin, Ren, Haodong, Xu, Feng, Chen, Shuai, Le-Phan, Minh-Khoa, Le, Minh-Hoang, Do, Trong-Le, Tran, Minh-Triet, Jian, Chih-Yu, Wang, Yi-Fan, Chen, Bang-Kang, Chao, You-Chen, Lee, Chia-Ming, Yang, Fu-En, Wang, Yu-Chiang Frank, Hsu, Chih-Chung, Negi, Aashish, Sharma, Hardik, Shaily, Prateek, Kumar, Jayant, Chaudhary, Sachin, Dudhane, Akshay, Hambarde, Praful, Shukla, Amit, Peng, Jielun, Wang, Yabin, Li, Yaqi, Liu, Jincheng, Hong, Xiaopeng, Wadhwani, Krish, Fitzpatrick, Liam, Tiwari, Utkarsh, Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Quispe, Cristian Lazo, A, Aishwarya, S, Akshara, N, Ashwathi, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun
Abstract
Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector's weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.
Chinese Translation
鲁棒性是深伪检测中一个长期被忽视的问题。然而,如果检测在面对轻微图像退化时表现不佳,那么其在现实世界中的检测性能几乎毫无价值。除了在图像处理流程中可能意外发生的较弱退化外,还有一种恶意深伪的风险,专门引入退化,故意利用检测器在这方面的弱点。在此,我们概述了 NTIRE 2026 鲁棒深伪检测挑战,该挑战专门针对这一问题。参与者的任务是构建一个检测器,随后将在一个未知的测试集上进行测试,该测试集包含各种强度的常见和不常见的退化。第一届挑战共吸引了 337 名参与者和 57 个最终排行榜提交,反响良好。为了确保结果的可靠性,参与者仅被给予 24 小时的时间完成测试运行,且未提供任何标签,从而限制了在测试数据上训练的可能性。此外,顶尖解决方案是在一个私有测试集上进行评分,以检测任何过拟合现象。本报告介绍了比赛设置、数据集准备,以及方法的细节和性能。顶尖方法依赖于大型基础模型、集成方法和退化训练,以结合通用性和鲁棒性。
cs.CV / 166 / 2604.24167
PEPS: Positional Encoding Projected Sampling -- Extended
PEPS:位置编码投影采样——扩展版
Abstract
Implicit neural representations (INRs) are increasingly being used as tools to map coordinates to signals, encompassing applications from neural fields to texture compression, shape representations, and beyond. Most INR methods are based on using high-dimensional projections of the initial coordinates through encoders such as grid or positional encoding. Nevertheless, positional encoding is often insufficient and grids, as we show in this paper, require high resolution for being able to learn. In this paper, we demonstrate that positional encoding can be used not only as a high-dimensional embedding but also decomposed as a series of meaningful points. We propose the Positional Encoding Projected Sampling, where we treat the projection of the original coordinate at each frequency as a point of interest. We describe the motion of each point with respect to the frequencies and show that it follows a unique pattern. Finally, we use the unique motion of each point as a basis decomposition for doing learned positional encoding using grids. We prove, using three competitive applications; image representation, texture compression, and signed distance function; that the proposed approach outperforms the current state of the art methods, and often requires 25\% less parameters for equivalent reconstruction error or rendering.
Chinese Translation
隐式神经表示(INRs)越来越多地被用作将坐标映射到信号的工具,涵盖了从神经场到纹理压缩、形状表示等多种应用。大多数INR方法基于通过编码器(如网格或位置编码)对初始坐标进行高维投影。然而,位置编码往往不足,而网格,如我们在本文中所示,需要高分辨率才能进行学习。本文展示了位置编码不仅可以作为高维嵌入使用,还可以分解为一系列有意义的点。我们提出了位置编码投影采样(Positional Encoding Projected Sampling),在该方法中,我们将每个频率下原始坐标的投影视为一个感兴趣的点。我们描述了每个点相对于频率的运动,并展示其遵循独特的模式。最后,我们利用每个点的独特运动作为基础分解,通过网格进行学习的位置信息编码。我们通过三个具有竞争力的应用:图像表示、纹理压缩和有符号距离函数,证明了所提方法优于当前的最先进方法,并且通常需要25 ext{%}更少的参数来实现等效的重建误差或渲染。
cs.CV / 167 / 2604.24169
PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms
PointTransformerX:无需稀疏算法的便携高效3D点云处理
Abstract
3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7\% of PointTransformer V3's accuracy on ScanNet with 79.2\% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.
Chinese Translation
3D点云感知仍然紧密依赖于自定义CUDA运算符进行空间操作,这限制了在非NVIDIA、AMD和嵌入式硬件上的便携性和效率。我们提出了PointTransformerX(PTX),这是一个完全基于PyTorch的3D点云视觉变换器主干,去除了所有自定义CUDA运算符和外部库,同时保持了竞争力的准确性。PTX引入了3D-GS-RoPE,这是一种旋转位置嵌入,能够在自注意力中直接编码3D空间关系,而无需构建邻域,并进一步用线性投影替代稀疏卷积补丁嵌入。PTX探索了推理时注意力窗口的缩放,以提高准确性而无需重新训练。通过重新设计的前馈网络,PTX在ScanNet上达到了PointTransformer V3的98.7%的准确率,同时参数减少了79.2%,执行速度快了1.6倍,仅需253 MB的内存。PTX可以在NVIDIA GPU、AMD GPU(ROCm)和CPU上原生运行,为点云感知提供了高效且便携的基础。
cs.CV / 168 / 2604.24171
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
POCA:视觉文本生成的帕累托最优课程对齐
Abstract
Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.
Chinese Translation
当前的视觉文本生成模型在文本准确性和整体图像一致性之间存在权衡。我们发现,达到高文本准确性可能会降低美学质量和遵循指令的能力。尽管强化学习方法可以通过与多个奖励对齐来缓解这一问题,但由于现有方法通常以加权和的方式优化多个奖励,因此在文本生成中往往不稳定。此外,平衡每个奖励的权重也很困难。此外,强化学习需要一组训练指令。大量的提示需要更多的训练时间和计算资源,而少量的提示则会导致性能不佳。因此,如何选择提示以实现高效训练仍然是一个未解决的问题。在本研究中,我们提出了帕累托最优课程对齐(POCA),该框架将此问题视为一个多目标问题,通过:1)识别帕累托最优集以避免简单标量化,2)设计自适应课程对齐策略,通过自动难度评估管理多奖励数据集的学习序列,这对于在有限数据环境中进行最佳收敛至关重要。POCA在统一的奖励空间中找到帕累托最优集,从而消除不一致的信号,以在易到难的优化环境中找到不同奖励之间的最佳权衡解决方案。实验结果表明,POCA显著提高了所有指标,如CLIP、HPS分数和句子准确性。
cs.CV / 169 / 2604.24187
Multivariate Gaussian NeRF for Wide Field-of-View Ultrasound Reconstruction
用于宽视场超声重建的多元高斯神经辐射场
Abstract
Wide Field-of-View (WFoV) reconstruction enhances 3D ultrasound imaging by providing valuable anatomical context for segmentation models and visualization. Clinical ultrasound volumes are predominantly acquired using convex probes, which generate expanding, diverging acoustic beams to maximize anatomical coverage. Stitching these sweeps together traditionally introduces significant compounding artifacts and aliasing due to depth-dependent resolution changes. Here, we introduce Ultra-Wide-NeRF, a Multivariate 3D Gaussian (MVG) NeRF-based method for WFoV ultrasound reconstruction. By explicitly modeling the complex beam geometry using distance-dependent convex volumetric sampling and anisotropic 3D Gaussians, our method inherently mitigates these compounding artifacts and provides anti-aliasing. Beyond simply reconstructing a static 3D grid, our NeRF-based approach yields a continuous neural representation of the tissue, enabling the synthesis of high-fidelity novel views from arbitrary virtual trajectories. We validate Ultra-Wide-NeRF for intracardiac echocardiography on phantom and porcine datasets, demonstrating that our method expands the spatial context important in intraoperative navigation. Code will be open-sourced upon publication.
Chinese Translation
宽视场(WFoV)重建通过为分割模型和可视化提供有价值的解剖学背景,增强了三维超声成像。临床超声体积主要使用凸探头获取,这些探头生成扩展的、发散的声束,以最大化解剖覆盖。传统上,将这些扫描拼接在一起会引入显著的复合伪影和由于深度依赖的分辨率变化而导致的混叠。在此,我们提出了Ultra-Wide-NeRF,一种基于多元三维高斯(MVG)神经辐射场的WFoV超声重建方法。通过使用距离依赖的凸体积采样和各向异性三维高斯显式建模复杂的声束几何形状,我们的方法本质上减轻了这些复合伪影并提供了抗混叠。我们的NeRF基础方法不仅仅重建静态三维网格,而是生成组织的连续神经表示,从任意虚拟轨迹合成高保真新视图。我们在假体和猪模型数据集上验证了Ultra-Wide-NeRF在心内超声心动图中的应用,证明我们的方法扩展了在手术导航中重要的空间背景。代码将在发表时开源。
cs.CV / 170 / 2604.24191
Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning
Omni-o3:深度嵌套的全模态推理用于深思熟虑的音视频推理
Abstract
Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
Chinese Translation
全模态理解涉及一个庞大且高度冗余的跨模态交互搜索空间,要求进行集中和深思熟虑的推理。目前的推理范式依赖于逐步生成的顺序步骤或逐样本的并行展开,导致孤立的推理轨迹。这种无法共享有希望的中间路径的能力严重限制了探索效率,并在复杂的音视频任务中造成了累积错误。为了解决这一瓶颈,我们提出了Omni-o3,一个由深度嵌套推理策略驱动的新框架。通过将推理形式化为动态递归搜索,Omni-o3自然而然地在各个分支之间共享推理前缀,从而实现四个原子认知动作的迭代执行:扩展、选择、模拟和反向传播。为了增强这一框架,我们提出了一种稳健的两阶段训练范式:(1)在从350万多样化全模态样本中提炼的10.1万条高质量长链轨迹上进行冷启动监督微调,以实现必要的递归搜索模式;(2)在18000个复杂多轮样本上进行嵌套组展开驱动的探索性强化学习,明确由新颖的多步奖励模型引导,以刺激深度嵌套推理。大量实验证明,Omni-o3在11个基准测试中实现了具有竞争力的性能,解锁了在全面音视频、视觉中心和音频中心推理任务中的高级能力。
cs.CV / 171 / 2604.24193
Computer Vision-Based Early Detection of Container Loss at Sea
基于计算机视觉的海上集装箱损失早期检测
Abstract
Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation's (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence-based early detection solution for destabilised containers. This study showcases a low-cost, retrofittable computer vision-based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects' residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container-level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.
Chinese Translation
集装箱运输是全球贸易的基础,但海上集装箱损失仍然是一个持续存在的安全、环境和经济挑战。尽管遵循货物固定手册,动态海洋条件如船舶运动、风载荷和恶劣海况可能会逐渐使集装箱堆垛不稳定,从而导致集装箱掉落。随着国际海事组织(IMO)对丢失集装箱的强制报告要求的出台,迫切需要一种可靠的、基于证据的早期检测解决方案来识别不稳定的集装箱。本研究展示了一种低成本、可改装的基于计算机视觉的系统,利用现有的船上摄像头对不稳定集装箱进行早期检测。该框架集成了物体分割技术以隔离集装箱堆垛,使用光流进行时间序列物体跟踪,并提取个体物体的残余运动以量化相对运动。对真实船上视频的实验评估表明,所提出的流程在变化的海况和能见度条件下有效隔离了集装箱级别的运动。通过实现对船员干预和导航调整的早期警报,所提出的方法增强了货物安全性、操作韧性和合规性。
cs.CV / 172 / 2604.24230
Radiomics- and Clinical Feature-Driven Prediction of Volumetric Response in Skull-Base Meningioma after CyberKnife Radiosurgery
基于放射组学和临床特征的颅底脑膜瘤在CyberKnife放射外科治疗后的体积反应预测
Abstract
Skull-base meningiomas are often characterized by favorable long-term prognosis, yet their anatomical complexity and proximity to critical neurovascular structures make treatment selection challenging. Stereotactic radiosurgery with CyberKnife represents an effective therapeutic option when surgical resection is not feasible; however, not all patients benefit equally from this treatment. Early identification of patients likely to respond to radiosurgery remains an open clinical problem. In this study, we propose a radiomics- and clinical feature-driven framework for predicting volumetric response in skull-base meningiomas treated with CyberKnife. Unlike most existing approaches that focus on progression-free survival or recurrence, our method targets volumetric response as an indicator of treatment efficacy. Pre-treatment MRI images from 104 patients were processed to extract radiomic features, which were combined with clinical variables and analyzed using six models. To ensure methodological rigor, the entire modeling process was implemented within a nested cross-validation scheme. Among the evaluated models, TabPFN achieved the best overall performance, with an AUC of 0.81 and consistently favorable classification metrics. These results suggest that advanced machine learning architectures, when combined with robust validation strategies, can effectively capture patterns associated with treatment response even in small-sample, high-dimensional settings.
Chinese Translation
颅底脑膜瘤通常具有良好的长期预后,但其解剖复杂性和与重要神经血管结构的邻近性使得治疗选择具有挑战性。使用CyberKnife进行立体定向放射外科手术是当外科切除不可行时的一种有效治疗选择;然而,并非所有患者都能从该治疗中获得相同的益处。早期识别可能对放射外科手术有反应的患者仍然是一个未解决的临床问题。在本研究中,我们提出了一种基于放射组学和临床特征的框架,用于预测接受CyberKnife治疗的颅底脑膜瘤的体积反应。与大多数现有方法专注于无进展生存期或复发不同,我们的方法将体积反应作为治疗效果的指标。对104名患者的治疗前MRI图像进行了处理,以提取放射组学特征,并与临床变量相结合,使用六种模型进行了分析。为了确保方法的严谨性,整个建模过程在嵌套交叉验证方案内实施。在评估的模型中,TabPFN实现了最佳的整体性能,AUC为0.81,并且分类指标始终表现良好。这些结果表明,先进的机器学习架构结合稳健的验证策略能够有效捕捉与治疗反应相关的模式,即使在小样本、高维度的情况下。
cs.CV / 173 / 2604.24234
Graph-augmented Segmentation of Complex Shapes in Laser Powder bed Fusion for Enhanced In Situ Inspection
增强现场检测的激光粉末床熔融复杂形状的图形增强分割
Abstract
The technological maturity of in situ inspection and monitoring methods in additive manufacturing is steadily increasing, enabling more efficient and practical qualification procedures. In this context, image segmentation of powder bed images in Laser Powder Bed Fusion (L-PBF) has been investigated by various authors, leveraging both edge detection and machine learning approaches to identify deviations from nominal geometry. Despite these developments, several challenges remain, including the sensitivity of segmentation performance to industrial illumination conditions and layer-to-layer variability in pixel intensity patterns. The study addresses these limitations by proposing a graph-augmented segmentation approach. The underlying principle consists of preserving the geometrical information at a global level rather than at pixel-wise level, modeling dependencies and relational information among spatial regions with a Graph Neural Network bottleneck embedded into a U-Net architecture. This allows enhancing the consistency and accuracy of the geometry reconstruction in the presence of spatial and layer-wise photometric variability systematically faced in real data. The method is evaluated against benchmark techniques for the in situ reconstruction of lattice structures produced by L-PBF, demonstrating its potential as a scalable solution for robust in situ inspection and geometric verification in industrial environments.
Chinese Translation
增材制造中现场检测和监测方法的技术成熟度正在稳步提高,从而实现更高效和实用的资格程序。在此背景下,激光粉末床熔融(L-PBF)中粉末床图像的图像分割已被多位作者研究,利用边缘检测和机器学习方法识别与名义几何形状的偏差。尽管取得了一些进展,但仍然存在一些挑战,包括分割性能对工业照明条件的敏感性以及层间像素强度模式的变异性。本研究通过提出一种图形增强分割方法来解决这些局限性。其基本原理是保留全局层面的几何信息,而不是逐像素层面的信息,利用图神经网络(Graph Neural Network)嵌入到U-Net架构中建模空间区域之间的依赖关系和关联信息。这种方法能够在真实数据中系统性地应对空间和层间光度变异性,从而增强几何重建的一致性和准确性。该方法与L-PBF生产的晶格结构的现场重建基准技术进行了评估,展示了其作为工业环境中稳健的现场检测和几何验证的可扩展解决方案的潜力。
cs.CV / 174 / 2604.24235
Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking
基于视觉手势追踪的无接触术中图像访问系统
Abstract
Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.
Chinese Translation
在外科领域,无接触与医学图像的交互变得越来越重要,因为无菌性和手术流程的连续性是基本要求。本研究提出了一种基于视觉的系统,通过使用单个RGB摄像头获取的手势进行术中医学图像导航。与许多现有解决方案不同,该系统不需要额外的硬件或用户特定的培训。手势追踪使用MediaPipe Hands实时进行,提供手部特征点的2.5D估计。简单直观的手势被映射为平移、旋转和缩放命令,从而实现与图像查看器的连续自然交互。系统架构独立于可视化软件,为了实现的简便性,本研究将其与PyVista集成。通过逐帧记录和延迟、稳定性及交互鲁棒性指标的定量分析对性能进行了评估。实验结果突出了实时行为,延迟减少且控制稳定,符合流畅交互的要求。该系统展示了一种低成本无接触解决方案在术中访问医学图像的可行性,为未来的临床评估奠定了基础。
cs.CV / 175 / 2604.24276
Instance Awareness of Multi-class Semantic Segmentation Loss Functions
多类语义分割损失函数的实例感知
Abstract
Instance-sensitive losses for semantic segmentation such as blob loss and CC loss were designed to address instance imbalance, ensuring small lesions generate the same gradient as large ones, but operate only on single-class segmentation. In multi-class settings, class imbalance poses an additional problem: rare classes with few instances receive a disproportionately small share of the training signal. We show that extending instance-sensitive losses to multi-class segmentation via a one-vs-rest class decomposition repurposes them to also address class imbalance, as uniform averaging over classes ensures each class contributes equally regardless of frequency. We further show that inverse-size weighting, which destabilizes training when applied globally due to weight imbalances across rare and common classes, becomes effective when integrated within the per-component loss, confining the reweighting to each component's spatial context. On the BraTS-METS 2025 dataset (260 test cases), multi-class CC loss improves foreground Dice (0.64 +/- 0.26 vs. 0.59 +/- 0.27 baseline) and rare-class Dice, while maintaining Panoptic Quality at DSC threshold 0.5. Multi-class blob loss achieves the best Panoptic Quality at threshold 0.5 (0.40 +/- 0.24 vs. 0.38 +/- 0.25 baseline) and recognition quality (0.53 +/- 0.29 vs. 0.49 +/- 0.30). Integrating inverse-size weighting within the per-component loss increases rare-class Dice to 0.44 +/- 0.36 at the cost of reduced detection quality.
Chinese Translation
针对语义分割的实例敏感损失,例如 blob 损失和 CC 损失,旨在解决实例不平衡问题,确保小病灶与大病灶产生相同的梯度,但仅适用于单类分割。在多类设置中,类别不平衡带来了额外的问题:具有少量实例的稀有类别获得的训练信号比例过小。我们展示了通过一对多类分解将实例敏感损失扩展到多类分割,可以重新利用这些损失来解决类别不平衡,因为对类别的均匀平均确保每个类别无论频率如何都能平等贡献。我们进一步表明,逆大小加权在全球应用时由于稀有类别和常见类别之间的权重不平衡会导致训练不稳定,但当集成在每个组件的损失中时,能够有效地限制重加权到每个组件的空间上下文。在 BraTS-METS 2025 数据集(260 个测试案例)上,多类 CC 损失提高了前景 Dice(0.64 +/- 0.26 对比 0.59 +/- 0.27 基线)和稀有类别 Dice,同时在 DSC 阈值 0.5 下保持了全景质量。多类 blob 损失在阈值 0.5 下实现了最佳全景质量(0.40 +/- 0.24 对比 0.38 +/- 0.25 基线)和识别质量(0.53 +/- 0.29 对比 0.49 +/- 0.30)。在每个组件的损失中集成逆大小加权将稀有类别的 Dice 提高到 0.44 +/- 0.36,代价是检测质量的降低。
cs.CV / 176 / 2604.24300
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
ReVSI:重建视觉空间智能评估以准确评估VLM的3D推理
Abstract
Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
Chinese Translation
当前的空间智能评估在现代视觉-语言模型(VLM)环境下可能系统性失效。首先,许多基准测试从最初为传统3D感知策划的点云基础3D注释中派生出问答(QA)对。当这些注释被视为视频评估的真实值时,重建和注释伪影可能会遗漏视频中清晰可见的物体,错误标记物体身份,或损坏依赖几何的答案(例如,大小),从而产生不正确或模糊的QA对。其次,评估通常假设对全场景的访问,而许多VLM在稀疏采样的帧(例如,16-64)上运行,使得许多问题在实际模型输入下实际上无法回答。我们通过引入ReVSI,一个确保每个QA对在模型实际输入下可回答且正确的基准和协议,来提高评估的有效性。为此,我们对来自5个数据集的381个场景中的物体和几何进行了重新注释,以提高数据质量,并使用专业3D注释工具严格减轻偏见并进行人工验证,重新生成所有QA对。我们进一步通过提供多个帧预算(16/32/64/全部)和细粒度物体可见性元数据的变体来增强评估的可控性,从而实现受控的诊断分析。在ReVSI上对一般和领域特定VLM的评估揭示了被先前基准掩盖的系统性失效模式,从而提供了对空间智能更可靠和更具诊断性的评估。
cs.CV / 177 / 2604.24311
BIMStruct3D: A Fully Automated Hybrid Learning Scan-to-BIM Pipeline with Integrated Topology Refinement
BIMStruct3D:一个完全自动化的混合学习扫描到BIM管道,集成拓扑优化
Abstract
Automatic generation of Building Information Models (BIM) from building scans is a key challenge in architecture and construction. We present a modular pipeline for generating IFC-compliant BIM from 3D point clouds. The hybrid approach combines learning-based semantic segmentation with topology-aware geometric reconstruction to model structural elements accurately. We propose vIoU, adapting voxel-based overlap evaluation to Scan-to-BIM by enabling holistic, instance-matching-free comparison of reconstructed and ground-truth models. We release the German Hospital dataset (DeKH), including high-resolution point clouds, ground truth BIMs, and semantic annotations. Experiments on DeKH and CV4AEC datasets show significant improvements over a RANSAC-based baseline, demonstrating robustness and scalability.
Chinese Translation
从建筑扫描自动生成建筑信息模型(BIM)是建筑和施工领域的一项关键挑战。我们提出了一种模块化管道,用于从3D点云生成符合IFC标准的BIM。该混合方法结合了基于学习的语义分割和拓扑感知的几何重建,以准确建模结构元素。我们提出了vIoU,通过使重建模型与真实模型的比较不依赖于实例匹配,从而将基于体素的重叠评估适应于扫描到BIM。我们发布了德国医院数据集(DeKH),其中包括高分辨率点云、真实BIM和语义注释。在DeKH和CV4AEC数据集上的实验显示,相较于基于RANSAC的基线方法有显著改进,证明了该方法的鲁棒性和可扩展性。
cs.CV / 178 / 2604.24312
Unconstrained Multi-view Human Pose Estimation with Algebraic Priors
无约束多视角人类姿态估计与代数先验
Abstract
Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gr\"{o}bner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.
Chinese Translation
从多视角图像中恢复三维人类姿态通常依赖于精确的相机标定,而在现实场景中这种标定往往不可用,这严重限制了现有方法的适用性。为了解决这一挑战,我们提出了一种无约束框架,结合了深度神经网络、代数先验和时间动态,用于未标定的多视角人类姿态估计。首先,我们引入了带有变换器回归器的三角测量(Triangulation with Transformer Regressor, TTR),将经典的三角测量重构为一种数据驱动的令牌融合过程,从而绕过对显式相机参数的依赖。其次,为了将多视角几何的固有代数关系明确嵌入学习过程中,我们提出了格罗布纳基矫正器(Gröbner basis Corrector, GC)。这种开创性的损失公式强制施加来自多视角几何的约束,以确保神经网络的预测严格遵循投影几何的法则。最后,我们设计了时间等变整流器(Temporal Equivariant Rectifier, TER),利用人类运动的等变性来施加时间一致性和结构一致性,有效减轻未标定设置中的尺度模糊。在标准基准上的广泛评估表明,我们的框架在未标定多视角人类姿态估计中建立了新的最先进水平。值得注意的是,我们的方法显著缩小了无标定方法与完全标定模型之间的性能差距。
cs.CV / 179 / 2604.24317
Don't Pause! Every prediction matters in a streaming video
不要暂停!每一个预测在流媒体视频中都至关重要
Abstract
Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.
Chinese Translation
流媒体视频模型应在事件发生的瞬间做出反应,而不是在时机已过后。然而,现有的在线视频问答基准仍然主要是回顾性的。它们在固定的时间戳暂停视频,提出关于当前或过去事件的问题,并仅在这些时刻对模型进行评分。这一协议使得流媒体预测未能得到测试。为了解决这一问题,我们引入了SPOT-Bench,采用多轮主动查询,评估实时助手所需的通用流媒体感知和辅助能力。SPOT-Bench配备了Timeliness-F1,这是一种综合指标,通过时间精度和整个视频的均衡覆盖来衡量流媒体预测。我们的基准揭示了:(i)离线模型可靠地检测事件,但在没有提示的情况下产生过多预测;(ii)对静默进行后训练可以减少过多预测,但会导致反应迟缓;(iii)一半的流媒体视频不期望有响应,我们称之为“死时间”——在这里消耗的计算不会影响响应延迟。这些发现促使了AsynKV的提出,它是离线模型的无训练流媒体适配,保留了事件感知的同时改善了流媒体行为。AsynKV具有长短期记忆,通过在死时间期间有效地扩展计算来利用这一特性。它在SPOT-Bench上作为一个强基线,超越了现有的流媒体模型,并在回顾性基准上达到了最先进的水平。
cs.CV / 180 / 2604.24328
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures
通过具有可学习代数群和环结构的神经网络进行单目深度估计
Abstract
Monocular depth estimation (MDE) has witnessed remarkable progress driven by Convolutional Neural Networks and transformer-based architectures. However, these approaches typically treat the problem as a generic image-to-image regression on Euclidean grids, thereby overlooking the intrinsic algebraic and geometric structures induced by perspective projection. To address this limitation, we propose LAGRNet, a novel framework that fundamentally grounds MDE in algebraic geometry by explicitly embedding learnable group, ring, and sheaf structures into the deep learning pipeline. Modeling feature maps as sections of a sheaf over an approximated image manifold, our method first establishes a Group-defined Feature Manifold (GFM) parameterized by a learned algebraic group action to enforce projective equivariance and robustness against view changes. To facilitate algebraically consistent cross-scale interactions, we subsequently introduce a Ring Convolution Layer (RCL) that formulates feature fusion as a graded ring homomorphism. Furthermore, to ensure global topological consistency, a Sheaf-based Module (SM) aggregates local depth cues via \v{C}ech nerve on the image topology. Extensive zero-shot evaluations across the KITTI, NYU-Depth V2, and ETH3D benchmarks demonstrate that LAGRNet significantly outperforms state-of-the-art methods in both accuracy and generalization capabilities.
Chinese Translation
单目深度估计(MDE)在卷积神经网络和基于变换器的架构的推动下取得了显著进展。然而,这些方法通常将问题视为在欧几里得网格上的通用图像到图像回归,从而忽视了透视投影所引发的内在代数和几何结构。为了解决这一局限性,我们提出了LAGRNet,一个新颖的框架,基于代数几何明确地将可学习的群、环和层结构嵌入深度学习管道,从根本上奠定了MDE的基础。我们的方法将特征图建模为近似图像流形上的层的截面,首先建立一个由学习的代数群作用参数化的群定义特征流形(GFM),以强制执行投影等变性并增强对视角变化的鲁棒性。为了促进代数一致的跨尺度交互,我们随后引入了一个环卷积层(RCL),将特征融合公式化为一个分级环同态。此外,为了确保全局拓扑一致性,一个基于层的模块(SM)通过图像拓扑上的 ext{C}ech神经聚合局部深度线索。在KITTI、NYU-Depth V2和ETH3D基准上的广泛零样本评估表明,LAGRNet在准确性和泛化能力上显著优于最先进的方法。
cs.CV / 181 / 2604.24331
An Affordable,Wearable Stereo-Eye-Tracking Platform
一种经济实惠的可穿戴立体眼动追踪平台
Abstract
Research on video-based eye-tracking has long explored stereo and glint-based methods, yet existing wearable eye trackers - both commercial and open-source - offer limited flexibility for algorithm development and comparative evaluation. We present an affordable, wearable stereo eye-tracking platform built from off-the-shelf and 3D-printable components that explicitly targets this gap. The system combines four infrared eye cameras, infrared illumination, an optional scene camera, and software support for calibration and synchronized data acquisition. By design, the platform supports multiple eye-tracking paradigms, including stereo, glint-based, and binocular approaches, within a single hardware configuration. Rather than optimizing for end-user robustness, the platform prioritizes modularity and extensibility for research use. This paper focuses on the hardware architecture and calibration pipeline and demonstrates the feasibility of the approach using a prototype implementation. All hardware designs and documentation are made openly available.
Chinese Translation
基于视频的眼动追踪研究长期以来探索了立体和反光点方法,但现有的可穿戴眼动追踪器——无论是商业产品还是开源项目——在算法开发和比较评估方面提供的灵活性有限。我们提出了一种经济实惠的可穿戴立体眼动追踪平台,该平台由现成的和可3D打印的组件构建,明确针对这一空白。该系统结合了四个红外眼部摄像头、红外照明、可选的场景摄像头,以及用于校准和同步数据采集的软件支持。该平台的设计支持多种眼动追踪范式,包括立体、反光点和双眼方法,均在单一硬件配置中实现。该平台优先考虑模块化和可扩展性,以便于研究使用,而非优化最终用户的鲁棒性。本文重点介绍硬件架构和校准流程,并通过原型实现展示了该方法的可行性。所有硬件设计和文档均已公开提供。
cs.CV / 182 / 2604.24339
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
更远的视野,更深的思考:通过低级视觉线索和反思提升视觉语言模型的推理能力
Abstract
Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Chinese Translation
近期视觉语言模型(VLM)的进展得益于强化学习(RL)的应用,从而增强了推理能力。然而,现有方法仍面临关键限制,包括缺乏低级视觉信息和有效的视觉反馈。为了解决这些问题,本文提出了一种统一的多模态交错推理框架 extbf{ForeSight},使得 VLM 能够通过低级视觉线索 extbf{更远地看},并通过有效的视觉反馈 extbf{更深入地思考}。首先,框架引入了一组低级视觉工具,将基本的视觉信息整合到推理链中,减轻对细粒度视觉特征的忽视。其次,详细阐述了一种基于掩码的视觉反馈机制,将视觉反思纳入思考过程中,使模型能够动态重新审视和更新其答案。在强化学习的驱动下,ForeSight 学会自主决定工具的调用和答案的验证,以最终答案的准确性作为奖励信号。为了评估所提框架的性能,我们基于 SalBench 数据集构建了一个新数据集,命名为 Character and Grounding SalBench (CG-SalBench)。实验结果表明,ForeSight-7B 模型在相同参数规模下显著优于其他模型,并在某些指标上甚至超越了当前的 SOTA 闭源模型。
cs.CV / 183 / 2604.24346
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
SycoPhantasy:量化小型开放权重视觉语言模型中的谄媚行为和幻觉,以评估幻想角色的视觉语言匹配
Abstract
Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emph{sycophantic} behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emph{Bluffing Coefficient} (\bc), a metric that measures the mismatch between a model's score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ($r = -0.96$, $p = 0.002$), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3\% of cases, compared to 6.0\% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.
Chinese Translation
视觉语言模型(VLMs)越来越多地被用作需要细致图像理解的任务评估者,但它们在图像与文本描述之间对齐评分的可靠性仍然未得到充分探讨。我们研究小型开放权重的 VLMs 在评估图像-文本对齐时是否表现出 extit{谄媚} 行为:在没有视觉证据支持的情况下给予高分。为了量化这一现象,我们引入了 extit{虚张声势系数}(Bluffing Coefficient,c),该指标衡量模型评分与其证据回忆之间的不匹配。我们在一个包含 173,810 个 AI 生成角色肖像及其详细文本描述的基准上评估了六个开放权重的 VLMs,参数范围从 450M 到 8B。我们的分析揭示了模型大小与谄媚率之间显著的负相关关系($r = -0.96$, $p = 0.002$),较小的模型表现出明显更高的无根据高分率。测试中最小的模型(LFM2-VL,450M)在 22.3\% 的情况下产生了谄媚评估,而最大的模型(LLaVA-1.6,7B)仅为 6.0\%。这些发现对小型开放权重 VLMs 在属性丰富的合成图像评估任务中作为自动评估者的部署具有直接影响,在这些任务中,分配的分数与引用的视觉证据之间的差距是可测量且具有重要意义的。
cs.CV / 184 / 2604.24353
ARETE: Attention-based Rasterized Encoding for Topology Estimation using HSV-transformed Crowdsourced Vehicle Fleet Data
ARETE:基于注意力机制的光栅编码用于利用HSV转换的众包车辆车队数据进行拓扑估计
Abstract
The continuous advancement of autonomous driving (AD) introduces challenges across multiple disciplines to ensure safe and efficient driving. One such challenge is the generation of High-Definition (HD) maps, which must remain up to date and highly accurate for downstream automotive tasks. One promising approach is the use of crowdsourced data from a vehicle fleet, representing road topology and lane-level features. This work focuses on the generation of centerlines and lane dividers from crowdsourced vehicle trajectories. We adopt a Detection Transformer (DETR)-based approach, where a rasterized representation of vehicle trajectories is used as input to predict vectorized lane representations. Each lane consists of a centerline with an associated direction and corresponding lane dividers that are geometrically constrained by the centerline. Our method includes the extraction of local tiles, from which crowdsourced vehicle trajectories are aggregated. Each tile undergoes a transformation into a rasterized representation encoding both the presence and direction of each trajectory, enabling the prediction of vectorized directed lanes. Experiments are conducted on an internal dataset as well as on the public datasets nuScenes and nuPlan.
Chinese Translation
自主驾驶(AD)的持续进步给多个学科带来了挑战,以确保安全和高效的驾驶。其中一个挑战是生成高精度(HD)地图,这些地图必须保持最新且高度准确,以满足下游汽车任务的需求。一种有前景的方法是利用来自车辆车队的众包数据,这些数据代表了道路拓扑和车道级特征。本研究集中于从众包车辆轨迹生成中心线和车道分隔线。我们采用基于检测变换器(Detection Transformer, DETR)的方法,其中车辆轨迹的光栅化表示作为输入,用于预测向量化的车道表示。每条车道由一个中心线及其相关方向和相应的车道分隔线组成,这些分隔线在几何上受到中心线的约束。我们的方法包括提取局部瓦片,从中聚合众包车辆轨迹。每个瓦片经过转换为光栅化表示,编码每条轨迹的存在和方向,从而实现向量化定向车道的预测。实验在内部数据集以及公共数据集nuScenes和nuPlan上进行。
cs.CV / 185 / 2604.24370
Multispectral airborne laser scanning dataset for tree species classification: MS-ALS-SPECIES
用于树种分类的多光谱航空激光扫描数据集:MS-ALS-SPECIES
Abstract
The shift from stand-level to individual-tree-level forest assessments supports improved biodiversity mapping, particularly in boreal ecosystems where tree species like aspen (Populus tremula L.) play a keystone role. While airborne laser scanning (ALS) is the standard for such inventories, a major limitation is the small number of publicly available ALS datasets containing high-quality, field-validated reference data. Furthermore, open multispectral ALS datasets with high-quality field reference data are completely lacking despite the potential of multispectral ALS data for tree species classification. This paper presents and details an open multispectral ALS dataset used in a recent international benchmarking study of machine learning and deep learning methods for tree species classification by Taher et al. (2026). The dataset comprises 6326 segment-level point clouds of individual trees representing nine species in Southern Finland. The point cloud data has been acquired using two multispectral laser scanning systems each operating at three laser wavelengths: a helicopter-borne system (HeliALS) with a point density exceeding 1000 points/m$^2$ and an Optech Titan system with approximately 35 points/m$^2$. We provide a detailed description of field data collection techniques developed in the study to facilitate the collection of high-quality ground truth data in an efficient and scalable manner. Additionally, our article presents new analyses on species classification using multispectral data building upon the initial findings of Taher et al. (2026). Furthermore, we study the relation between classification accuracy and tree height to highlight the versatility of the open dataset and to demonstrate the advantage of the point transformer model for small trees and minority species.
Chinese Translation
从林分级别向单棵树级别的森林评估转变,支持了生物多样性映射的改善,特别是在白令生态系统中,杨树(Populus tremula L.)等树种扮演着关键角色。虽然航空激光扫描(ALS)是此类清查的标准,但一个主要限制是缺乏包含高质量、经过实地验证的参考数据的公开ALS数据集。此外,尽管多光谱ALS数据在树种分类中具有潜力,但完全缺乏高质量实地参考数据的开放多光谱ALS数据集。本文介绍并详细阐述了一个开放的多光谱ALS数据集,该数据集用于Taher等人(2026)最近进行的树种分类机器学习和深度学习方法的国际基准研究。该数据集包含6326个单棵树的分段点云,代表了芬兰南部的九个树种。点云数据是使用两种多光谱激光扫描系统获取的,每种系统在三种激光波长下工作:一种是点密度超过1000点/m²的直升机载系统(HeliALS),另一种是点密度约为35点/m²的Optech Titan系统。我们提供了在研究中开发的实地数据收集技术的详细描述,以促进高质量地面真实数据的高效和可扩展收集。此外,本文基于Taher等人(2026)的初步发现,展示了使用多光谱数据进行物种分类的新分析。此外,我们研究了分类准确性与树高之间的关系,以突出开放数据集的多样性,并展示点变换器模型在小树和少数树种中的优势。
cs.CV / 186 / 2604.24396
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
全球背景还是局部细节?用于幻觉缓解的自适应视觉基础
Abstract
Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
Chinese Translation
视觉-语言模型(VLMs)常常受到对象幻觉的影响——生成与视觉现实相矛盾的内容——这主要是由于对语言先验的过度依赖。我们提出了正负解码(Positive-and-Negative Decoding, PND),这是一种无训练推理框架,直接干预解码过程以强制执行视觉真实度。PND的提出源于我们发现VLMs中存在关键的注意力缺失,视觉特征在经验上被低估。我们的框架通过双路径对比来纠正这一问题:正路径利用多层注意力放大显著的视觉证据,以鼓励真实的描述,直接对抗注意力缺失。同时,负路径识别并削弱核心对象的特征,以创建强有力的反事实,从而惩罚无基础的、以先验为主导的生成。通过在每一步对比模型输出的这两个视角,PND引导生成的文本不仅在语言上是可能的,而且在视觉上是事实的。在POPE、MME和CHAIR等基准上的大量实验表明,PND在准确性上实现了最高6.5%的提升,显著减少了对象幻觉,同时增强了描述细节——所有这些都无需对模型进行重新训练。该方法在包括LLaVA、InstructBLIP、InternVL和Qwen-VL等多种VLM架构中有效泛化。
cs.CV / 187 / 2604.24407
AD-Relight: Training-Free Banner Relighting via Illumination Translation with Diffusion Priors
AD-Relight:基于扩散先验的无训练横幅重光照的照明转换
Abstract
The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene's underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.
Chinese Translation
最近,通过流媒体服务消费内容的激增推动了对个性化内容的日益需求。个性化广告在提升用户参与度和广告效果方面发挥着至关重要的作用。广告个性化的一个关键方面是用定制的、通过Photoshop生成的横幅替换帧中的现有区域。然而,现有的广告投放流程通常依赖于简单的几何变形,忽略了场景的基本照明条件。同样,基于扩散的最先进的物体插入和重光照模型在准确重光照这些新插入的横幅时也面临困难,因为它们并未在广告横幅数据上进行训练,而训练这样一个模型需要数百万张图像。这突显了需要一个有效的重光照框架,以便将定制横幅无缝集成到原始场景中。基于此,我们提出了AD-Relight,这是一种新颖的多阶段无训练框架,能够在测试时适应基于扩散的重光照模型,以重光照新添加的通过Photoshop生成的广告横幅。通过广泛的评估,我们证明了AD-Relight在重光照基线和基于简单变形的现有广告投放方法中表现优越。用户研究进一步表明,参与者始终偏好AD-Relight的输出而非先前方法的输出。
cs.CV / 188 / 2604.24419
BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities
BMD-45:一个面向发展中国家城市交通的大规模监控摄像头车辆检测数据集
Abstract
Robust vehicle detection from fixed CCTV cameras is critical for Intelligent Transportation Systems. Yet existing benchmarks predominantly feature relatively homogeneous, highly organized traffic patterns captured from ego-centric driving perspectives or controlled aerial views. This regional and sensor view bias creates a significant gap. Models trained on datasets such as UA-DETRAC and COCO struggle to generalize to the dense, heterogeneous, disorganized traffic conditions observed in rapidly developing urban centers in emerging economies. To address this limitation, we introduce BMD-45, a large-scale dataset comprising 480K bounding boxes annotated over 45K images captured from over 3.6K operational Safe City CCTV cameras. BMD-45 contains 14 fine-grained vehicle categories, including region-specific modes such as auto-rickshaws and tempo travellers, which are not present in existing benchmarks. The dataset captures real-world deployment challenges, including extreme viewpoint variation, occlusion, and vehicle density . We establish comprehensive baselines using state-of-the-art detectors and reveal a striking domain gap: models fine-tuned on UA-DETRAC achieve only 33.6%
[email protected]:0.95, compared to 83.8% when trained in-domain on BMD-45, representing a 2.5x improvement that persists even when accounting for novel vehicle classes. This performance gap underscores the critical need for geographically diverse traffic benchmarks and establishes BMD-45 as a baseline for developing robust perception systems in underrepresented urban environments worldwide. The dataset is available at: https://huggingface.co/datasets/iisc-aim/BMD-45.
Chinese Translation
从固定监控摄像头进行稳健的车辆检测对于智能交通系统至关重要。然而,现有的基准测试主要集中在从自我中心驾驶视角或受控空中视角捕获的相对同质化、高度组织化的交通模式。这种区域和传感器视角的偏差造成了显著的差距。在UA-DETRAC和COCO等数据集上训练的模型在快速发展的新兴经济体城市中心所观察到的密集、异质和无序的交通条件下难以泛化。为了解决这一局限性,我们推出了BMD-45,这是一个大规模数据集,包含在超过3600个运营的安全城市监控摄像头拍摄的45000张图像上标注的48万个边界框。BMD-45包含14个细粒度的车辆类别,包括区域特定的模式,如三轮摩托车和小型旅行车,这些在现有基准中并不存在。该数据集捕捉了现实世界中的部署挑战,包括极端视角变化、遮挡和车辆密度。我们使用最先进的检测器建立了全面的基线,并揭示了显著的领域差距:在UA-DETRAC上微调的模型仅达到33.6%的
[email protected]:0.95,而在BMD-45上进行领域内训练时则达到83.8%,这代表了一个2.5倍的提升,即使考虑到新型车辆类别,这一改善依然存在。这一性能差距强调了对地理多样化交通基准的迫切需求,并确立了BMD-45作为在全球欠代表城市环境中开发稳健感知系统的基线。数据集可在以下网址获取:https://huggingface.co/datasets/iisc-aim/BMD-45。
cs.CV / 189 / 2604.24426
DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation
DYMAPIA:一种多领域框架用于检测基于人工智能的视频篡改
Abstract
AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99\% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.
Chinese Translation
人工智能生成的媒体正在迅速发展,引发了对内容真实性和数字信任的迫切关注。我们介绍了DYMAPIA,这是一种多领域的深度伪造(Deepfake)检测框架,融合了空间、频谱和时间线索,以捕捉视觉数据中微妙的篡改痕迹。该系统通过结合傅里叶谱、局部纹理描述符、边缘不规则性和光流一致性等证据,构建动态异常掩膜,突出显示具有精细空间精度的篡改区域。这些掩膜指导DistXCNet,这是一个从Xception提炼出的轻量级分类器,并通过深度可分离卷积进行优化,以实现快速、区域聚焦的分类。该联合设计实现了最先进的结果,在FF++、Celeb-DF和VDFD基准测试中,准确率和F1分数均超过99%,同时保持模型足够紧凑以适应实时使用。除了超越现有的全帧和多领域检测器外,DYMAPIA还展示了在时间关键的取证任务中的部署准备,包括媒体验证、虚假信息防御和安全内容过滤。
cs.CV / 190 / 2604.24441
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
AutoGUI-v2:全面的多模态图形用户界面功能理解基准
Abstract
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
Chinese Translation
能够导航图形用户界面(GUIs)的自主代理有潜力彻底改变数字生产力。然而,实现真正的数字自主不仅仅依赖于反应性元素匹配;它需要对界面动态的预测性心理模型,以及预见由交互产生的“数字世界状态”的能力。尽管现代视觉-语言模型(VLMs)具备感知能力,现有基准仍然存在分歧(要么专注于黑箱任务完成,要么关注静态、浅层的基础),因此未能评估代理是否真正理解GUIs的隐含功能和转变逻辑。为填补这一空白,我们推出了AutoGUI-v2,这是一个全面的基准,旨在评估深层次的GUI功能理解和交互结果预测。我们利用一种新颖的VLM-人类协作管道构建了该基准,该管道递归解析多平台截图为层次功能区域,以生成多样化的评估任务。AutoGUI-v2提供了跨六个操作系统的2,753个任务,严格测试代理在区域和元素级语义、基础和动态状态预测方面的能力。我们的评估揭示了VLMs的显著二分法:虽然在代理数据上微调的开源模型(例如Qwen3-VL)在功能基础方面表现出色,但商业模型(例如Gemini-2.5-Pro-Thinking)在功能描述方面占据主导地位。重要的是,所有模型在处理不常见动作的复杂交互逻辑时都面临挑战,这突显了深层功能理解仍然是一个重大障碍。通过系统地测量这些基础能力,AutoGUI-v2为推动下一代GUI代理的进步提供了新的视角。
cs.CV / 191 / 2604.24459
TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
TextGround4M:一个与提示对齐的布局感知文本渲染数据集
Abstract
Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.
Chinese Translation
尽管文本到图像生成技术最近取得了进展,但模型在准确渲染提示指定的文本及其正确的空间布局方面仍然面临挑战——尤其是在多跨度、结构化的环境中。这一挑战不仅源于缺乏将提示与图像中期望的确切文本和布局对齐的数据集,还由于缺乏有效的评估布局质量的指标。为了解决这些问题,我们提出了TextGround4M,这是一个包含超过400万个提示-图像对的大规模数据集,每个样本都标注了与提示对齐的跨度级文本及其对应的边界框。这使得布局感知的、基于提示的文本渲染能够实现细粒度的监督。在此基础上,我们提出了一种轻量级的自回归文本到图像(T2I)模型训练策略,在训练过程中添加布局感知的跨度标记,而不改变模型架构或推理行为。此外,我们构建了一个具有分层布局复杂性的基准,以在零-shot设置中评估开源和专有模型。此外,我们引入了两个布局感知指标,以解决文本渲染中长期缺乏空间评估的问题。我们的结果表明,基于TextGround4M训练的模型在文本保真度、空间准确性和提示一致性方面超越了强基线,突显了细粒度布局监督在基于提示的T2I生成中的重要性。
cs.CV / 192 / 2604.24479
Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data
从零到CAD:在百万规模下无真实数据的可解释CAD程序的自主合成
Abstract
Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.
Chinese Translation
计算机辅助设计(CAD)模型由其构建历史定义:一种编码设计意图的参数化配方。然而,现有的大规模3D数据集主要由边界表示(B-Reps)或网格组成,剥夺了这一关键的过程信息。为了解决这一稀缺问题,我们提出了Zero-to-CAD,一个可扩展的框架,用于合成可执行的CAD构建序列。我们将合成框架视为一个自主搜索问题:通过在反馈驱动的CAD环境中嵌入大型语言模型(LLM),我们的系统迭代生成、执行和验证代码,利用工具和文档查找来促进几何有效性和操作多样性。这种自主方法使我们能够合成大约一百万个可执行、可读、可编辑的CAD序列,涵盖了超越草图和拉伸工作流程的丰富操作词汇。我们还发布了一个精心挑选的10万个高质量模型子集,以确保几何多样性。为了展示数据集的实用性,我们在我们的合成数据上微调了一个视觉-语言模型,以从多视图图像重建可编辑的CAD程序,超越了包括GPT-5.2在内的强基线,并有效地在没有真实构建历史训练数据的情况下启动序列生成能力。Zero-to-CAD弥合了几何规模与参数化可解释性之间的差距,为下一代CAD人工智能提供了重要资源。
cs.CV / 193 / 2604.24492
Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI
面向部署的低精度神经架构搜索用于空间边缘人工智能
Abstract
Designing deep networks that meet strict latency and accuracy constraints on edge accelerators increasingly relies on hardware-aware optimization, including neural architecture search (NAS) guided by device-level metrics. Yet most hardware-aware NAS pipelines still optimize architectures under full-precision assumptions and apply low-precision adaptation only after the search, leading to a mismatch between optimization-time behavior and deployment-time execution on low-precision hardware that can substantially degrade accuracy. We address this limitation by integrating deployment-aligned low-precision training directly into hardware-aware NAS. Candidate architectures are exposed to FP16 numerical constraints during fine-tuning and evaluation, enabling joint optimization of architectural efficiency and numerical robustness without modifying the search space or evolutionary strategy. We evaluate the proposed framework on vessel segmentation for spaceborne maritime monitoring, targeting the Intel Movidius Myriad X Visual Processing Unit (VPU). While post-training precision conversion reduces on-device performance from 0.85 to 0.78 mIoU, deployment-aligned low-precision training achieves 0.826 mIoU on-device for the same architecture (95,791 parameters), recovering approximately two-thirds of deployment-induced accuracy gap without increasing model complexity. These results demonstrate that incorporating deployment-consistent numerical constraints into hardware-aware NAS substantially improves robustness and alignment between optimization and deployment for resource-constrained edge Artificial Intelligence (AI).
Chinese Translation
在边缘加速器上设计满足严格延迟和准确性约束的深度网络越来越依赖于硬件感知优化,包括基于设备级指标指导的神经架构搜索(NAS)。然而,大多数硬件感知NAS流程仍然在全精度假设下优化架构,并仅在搜索后应用低精度适配,这导致优化时行为与在低精度硬件上的部署时执行之间存在不匹配,从而显著降低准确性。我们通过将面向部署的低精度训练直接集成到硬件感知NAS中来解决这一限制。在微调和评估过程中,候选架构暴露于FP16数值约束下,使得架构效率和数值鲁棒性的联合优化成为可能,而无需修改搜索空间或进化策略。我们在空间海洋监测的船舶分割任务上评估了所提出的框架,目标是Intel Movidius Myriad X视觉处理单元(VPU)。尽管后训练精度转换将设备上的性能从0.85降低到0.78 mIoU,但面向部署的低精度训练在相同架构(95,791个参数)上实现了0.826 mIoU,恢复了大约三分之二的部署引起的准确性差距,而没有增加模型复杂性。这些结果表明,将与部署一致的数值约束纳入硬件感知NAS显著提高了资源受限边缘人工智能(AI)之间的鲁棒性和优化与部署的一致性。
cs.CV / 194 / 2604.24493
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
CA-IDD:跨注意力引导的身份条件扩散用于身份一致的人脸交换
Abstract
Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
Chinese Translation
人脸交换旨在通过将源人脸的身份转移到目标人脸上,同时保持姿势、表情和上下文,来优化真实的人脸图像生成。然而,现有方法,尤其是基于生成对抗网络(GAN)的方法,常常难以平衡身份保留和视觉真实感,因为其可控性有限且容易出现模式崩溃。在本文中,我们提出了CA-IDD(跨注意力引导的身份条件扩散),这是首个基于扩散的人脸交换方法,集成了多模态引导,包括视线、身份和面部解析,通过多尺度跨注意力实现。预计算的身份嵌入通过层次注意力层融入去噪过程,从而实现准确且一致的身份转移。为了提高语义一致性和视觉质量,我们采用专家引导的监督,包括面部解析和视线一致性模块。与基于GAN或隐式融合的方法不同,我们的扩散框架提供了稳定的训练、强大的泛化能力和空间自适应的身份对齐,允许在姿势和表情变化中进行细粒度的区域控制。CA-IDD实现了11.73的FID,超越了FaceShifter和MegaFS等既定基准。定性结果也显示出在多样化姿势下身份保留的改善,确立了CA-IDD作为未来基于扩散的人脸编辑的坚实基础。
cs.CV / 195 / 2604.24498
Self-Supervised Representation Learning via Hyperspherical Density Shaping
通过超球面密度塑形的自监督表示学习
Abstract
Modern self-supervised representation learning methods often relies on empirical heuristics that are not theoretically grounded. In this study we propose HyDeS, a theoretically grounded method based on multi-view mutual information maximization within an hyperspherical space using Shannon differential entropy with a non-parametric von Mises-Fisher density estimator. We show that HyDeS bias the trained model towards focusing on foreground features of the images and perform well on segmentation tasks such as VOC PASCAL, while it lags in fine-grained classification. We provide a detailed analysis of the induced latent space geometry and learning dynamics, that can be used for designing other theoretically grounded self-supervised learning methods.
Chinese Translation
现代自监督表示学习方法通常依赖于没有理论基础的经验启发式方法。在本研究中,我们提出了HyDeS,这是一种基于超球面空间内多视角互信息最大化的理论基础方法,使用了非参数的冯·密斯-费舍尔(von Mises-Fisher)密度估计器和香农微分熵。我们展示了HyDeS使训练模型偏向于关注图像的前景特征,并在诸如VOC PASCAL等分割任务中表现良好,但在细粒度分类方面表现较差。我们提供了对诱导潜在空间几何和学习动态的详细分析,这可以用于设计其他具有理论基础的自监督学习方法。
cs.CV / 196 / 2604.24524
Point Cloud Registration for Fusion between SPECT MPI and CTA Images
SPECT MPI与CTA图像融合的点云配准
Abstract
Clinical fusion of Single Photon Emission Computed Tomography Myocardial Perfusion Imaging (SPECT MPI) and Computed Tomography Angiography (CTA) remains limited by cross-modality misregistration and reliance on manual landmarks, which can hinder accurate ischemia localization and lesion-level functional assessment. To address this issue, we propose a registration and fusion framework for SPECT MPI and CTA that integrates functional and structural information for comprehensive cardiac evaluation. The proposed pipeline performs U-Net-based segmentation on both modalities. On SPECT MPI, only the left ventricle (LV) is extracted, and anatomical landmarks are automatically derived from characteristic LV structures. On CTA, both ventricles are segmented, and their spatial relationship is used to automatically define landmarks at the interventricular septal junction. Scale-space consistency preprocessing and landmark-driven coarse registration are applied to mitigate initial misalignment. Based on this initialization, multiple fine registration methods are evaluated on LV epicardial surface point clouds, including ICP, SICP, CPD, CluReg, FFD, and BCPD-plus-plus. The resulting transformations are then propagated to voxel-level resampling for high-precision SPECT-CTA fusion. In a retrospective cohort of 60 patients, the proposed framework preserved sub-millimeter coronary detail from CTA while accurately overlaying quantitative SPECT perfusion. Among the evaluated methods, BCPD-plus-plus achieved the highest accuracy with a mean point cloud distance of 1.7 mm. By combining robust initialization, comparative fine registration, and voxel-level fusion, the proposed approach provides a practical solution for myocardial ischemia localization and functional evaluation of coronary lesions, while remaining independent of any specific fine registration algorithm.
Chinese Translation
单光子发射计算机断层扫描心肌灌注成像(SPECT MPI)与计算机断层扫描血管造影(CTA)的临床融合仍受到跨模态配准误差和对手动标志点的依赖的限制,这可能妨碍准确的缺血定位和病变级功能评估。为了解决这一问题,我们提出了一种SPECT MPI与CTA的配准和融合框架,该框架整合了功能和结构信息,以实现全面的心脏评估。所提出的流程对两种模态进行基于U-Net的分割。在SPECT MPI中,仅提取左心室(LV),并从特征LV结构中自动推导解剖标志点。在CTA中,对两个心室进行分割,并利用它们的空间关系自动定义室间隔交界处的标志点。应用尺度空间一致性预处理和标志驱动的粗配准,以减轻初始对齐误差。在此初始化基础上,对LV心外膜表面点云评估多种精细配准方法,包括ICP、SICP、CPD、CluReg、FFD和BCPD-plus-plus。然后将得到的变换传播到体素级重采样,以实现高精度的SPECT-CTA融合。在一项包含60名患者的回顾性队列研究中,所提出的框架在保留CTA的亚毫米冠状动脉细节的同时,准确叠加定量SPECT灌注。在评估的方法中,BCPD-plus-plus以1.7毫米的平均点云距离达到了最高的准确性。通过结合稳健的初始化、比较精细配准和体素级融合,所提出的方法为心肌缺血定位和冠状动脉病变的功能评估提供了一个实用的解决方案,同时不依赖于任何特定的精细配准算法。
cs.CV / 197 / 2604.24543
RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting
RACANet:面向RGB-T人群计数的可靠性感知人群锚网络
Abstract
RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.
Chinese Translation
RGB-热成像(T)人群计数旨在整合可见光谱和热红外信息,以提高复杂场景中人群密度估计的鲁棒性。尽管现有研究通常通过跨模态特征融合来提高计数准确性,但大多数当前方法依赖于隐式的跨模态融合策略,缺乏对局部空间差异的显式建模,以及在位置级别上对模态可靠性的细粒度表征,从而限制了融合过程的准确性和可解释性。为了解决这些问题,本文提出了一种两阶段融合框架RACANet,即面向RGB-T人群计数的可靠性感知人群锚网络。首先,我们引入了一个轻量级的跨模态对齐预训练阶段,通过人群先验监督和局部双向软匹配显式学习跨模态语义对应关系。然后,基于预训练阶段学习到的先验,在正式训练阶段引入了局部锚融合模块(LAFM)。该模块通过聚合来自高可靠区域的特征生成局部语义锚,并进一步通过局部注意机制实现自适应的像素级特征重分配。此外,我们提出了一种差异感知一致性约束,以动态协调模态表示一致的区域的可靠性。在两个广泛使用的基准数据集RGBT-CC和Drone-RGBT上进行的实验表明,RACANet优于现有方法。匿名代码可在https://anonymous.4open.science/r/RACANet-9985获取。
cs.CV / 198 / 2604.24575
Diffusion Model as a Generalist Segmentation Learner
扩散模型作为通用分割学习者
Abstract
Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.
Chinese Translation
扩散模型主要用于图像合成,但其去噪轨迹编码了丰富的、空间对齐的视觉先验。在本文中,我们展示了这些先验可以用于文本条件的语义分割和开放词汇分割,并且这种方法可以推广到各种下游任务,从而构建一个通用的扩散分割框架。具体而言,我们引入了 DiGSeg(Diffusion Models as a Generalist Segmentation Learner),将一个预训练的扩散模型重新利用为一个统一的分割框架。我们的方法将输入图像和真实标签编码到潜在空间,并将其连接作为扩散 U-Net 的条件信号。一个平行的与 CLIP 对齐的文本通道在多个尺度上注入语言特征,使模型能够将文本查询与不断演变的视觉表示对齐。这一设计将现成的扩散骨干网络转变为一个通用接口,生成基于外观和任意文本提示的结构化分割掩膜。大量实验表明,在标准语义分割基准上取得了最先进的性能,并且在开放词汇泛化和跨域转移到医学、遥感和农业场景中表现出强大的能力——无需特定领域的架构定制。这些结果表明,现代扩散骨干网络可以作为通用分割学习者,而不仅仅是纯生成器,从而缩小视觉生成与视觉理解之间的差距。
cs.CV / 199 / 2604.24583
Improving Vision-language Models with Perception-centric Process Reward Models
通过以感知为中心的过程奖励模型提升视觉-语言模型
Abstract
Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
Chinese Translation
最近,基于可验证奖励的强化学习(RLVR)的进展显著提升了视觉-语言模型(VLMs)在复杂推理能力上的表现。然而,其结果级监督过于粗糙,无法有效诊断和纠正推理链中的错误。为此,我们提出了Perceval,这是一种过程奖励模型(PRM),能够实现令牌级错误定位,从而从响应中提取与图像相关的主张,并逐一与图像中的视觉证据进行比较,最终返回包含感知错误的主张。Perceval使用以感知为中心的监督训练数据进行训练。接着,我们将Perceval整合到RL训练过程中,以训练策略模型。具体而言,与传统的序列级优势的GRPO相比,我们通过针对Perceval识别的幻觉跨度施加惩罚,应用令牌级优势,从而实现细粒度的监督信号。除了增强训练过程外,Perceval还可以在推理阶段协助VLMs。使用Perceval,我们可以截断模型响应中的错误部分,然后让模型直接重新生成响应,或者引导模型反思其先前的输出。这个过程可以重复多次,以实现测试时的扩展。实验表明,在多个领域的基准测试中,使用RL训练的多种推理VLMs均显著提升,突显了以感知为中心的监督作为通用策略的潜力。在测试时扩展方面,它也显示出相较于其他策略(如主要投票)的一致性能提升。我们的代码和数据将公开发布于 https://github.com/RUCAIBox/Perceval。
cs.CV / 200 / 2604.24586
Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows
Point-MF:通过均值流从单幅图像一步生成点云
Abstract
Single-image point cloud reconstruction must infer complete 3D geometry, including occluded parts, from a single RGB image. While diffusion-based reconstructors achieve high accuracy, they typically require many denoising iterations, resulting in slow and expensive inference. We propose Point-MF, a Mean-Flow-based framework for low-NFE single-image point cloud reconstruction that couples a Mean-Flow-compatible architecture with an auxiliary loss. Specifically, Point-MF operates directly in point-cloud space to learn the mean velocity field and enables one-step reconstruction with a single network function evaluation (1-NFE), without relying on VAE-based latent representations. To make Mean Flow effective under large interval jumps, Point-MF employs a Diffusion Transformer tailored to the Mean-Flow setting, conditioned on frozen DINOv3 image features via a lightweight token adapter and equipped with explicit interval/time conditioning. Moreover, we introduce Denoised Space Anchor, a set-distance auxiliary loss on the denoised-space estimate $x_\theta$ induced by the predicted velocity field, to stabilize large-step generation and reduce outliers and density artifacts. On ShapeNet-R2N2 and Pix3D, Point-MF strikes a strong balance between reconstruction quality and inference speed compared to multi-step diffusion baselines and competitive feedforward models, while generating high-quality point clouds with millisecond-level latency.
Chinese Translation
单幅图像的点云重建必须从单个RGB图像中推断出完整的3D几何形状,包括被遮挡的部分。尽管基于扩散的重建方法实现了高精度,但通常需要多次去噪迭代,导致推理速度慢且成本高。我们提出了Point-MF,一个基于均值流的框架,用于低NFE(网络函数评估次数)单幅图像点云重建,结合了与均值流兼容的架构和辅助损失。具体而言,Point-MF直接在点云空间中操作,以学习均值速度场,并实现一步重建,仅需一次网络函数评估(1-NFE),而不依赖于基于变分自编码器(VAE)的潜在表示。为了使均值流在大间隔跳跃下有效,Point-MF采用了针对均值流设置的扩散变换器,基于冻结的DINOv3图像特征,通过轻量级的令牌适配器进行条件化,并配备显式的间隔/时间条件。此外,我们引入了去噪空间锚点,这是一个基于预测速度场诱导的去噪空间估计$x_ heta$的固定距离辅助损失,以稳定大步生成并减少离群点和密度伪影。在ShapeNet-R2N2和Pix3D数据集上,与多步扩散基线和竞争性前馈模型相比,Point-MF在重建质量和推理速度之间取得了良好的平衡,同时以毫秒级延迟生成高质量的点云。
cs.CV / 201 / 2604.24602
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
针对特定模态偏移的视觉-语言模型的主导化引导测试时适应
Abstract
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
Chinese Translation
视觉-语言模型在零-shot 设置中具有良好的迁移能力,但在部署时,视觉和文本分支往往会出现不对称的偏移。在这种情况下,基于熵的测试时适应可能会在提高融合后验的同时增加错误,因为不可靠的模态仍可能主导融合。我们通过多模态后验的主导化视角研究这种失败模式,并将适应视为融合预测上的约束去混合问题。基于这一视角,我们提出了 MG-MTTA,该方法保持主干网络不变,仅更新轻量级的门控或适配器。目标结合了融合后验熵最小化与基于锚点的模态一致性和跨模态冲突构建的可靠性感知门控先验。我们的分析给出了熵减少保持正确排名的条件以及表征模态主导失败的阈值。在基于 ImageNet 的基准测试中,MG-MTTA 在语义保持的文本偏移下将 top-1 准确率从 57.97 提高到 66.51,在联合视觉-文本偏移下从 21.68 提高到 26.27,同时在仅视觉基准测试中仍保持竞争力。这些结果表明,多模态测试时适应应控制模态可靠性,而不仅仅是预测熵。
cs.CV / 202 / 2604.24616
Infrastructure-Guided Connectivity-Enhanced Road Crack Detection and Estimation
基础设施引导的连接增强道路裂缝检测与评估
Abstract
In this paper, we report the world's first infrastructure-guided communication-enhanced road crack detection pipeline that is effective and implementable on passenger vehicles. We first design a customized communication protocol to transmit the region of interest from the infrastructure to the vehicle. With proper camera image processing (e.g., dynamic cropping and frame selection), the focused images are provided to the crack detection model. Leveraging state-of-the-art crack detection model backbones and a carefully prepared dataset comprising a forward-facing view with a crack, we train the model to improve crack-detection performance. We demonstrate the full detection pipeline on an experimental vehicle platform, showcase the detection effectiveness, and project future research directions.
Chinese Translation
本文报告了全球首个基础设施引导的通信增强道路裂缝检测流程,该流程在乘用车上有效且可实施。我们首先设计了一种定制的通信协议,以将关注区域从基础设施传输到车辆。通过适当的摄像头图像处理(例如,动态裁剪和帧选择),将聚焦的图像提供给裂缝检测模型。利用最先进的裂缝检测模型骨干网络和精心准备的包含裂缝的前视图数据集,我们训练模型以提高裂缝检测性能。我们在实验车辆平台上展示了完整的检测流程,展示了检测的有效性,并展望未来的研究方向。
cs.CV / 203 / 2604.24622
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA:高效的粗到细视觉-语言-动作策略生成
Abstract
Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.
Chinese Translation
基于流的视觉-语言-动作(VLA)策略在动作生成方面具有强大的表达能力,但存在一个根本性的低效问题:需要多步推理从无信息的高斯噪声中恢复动作结构,这在实时约束下导致效率与质量之间的权衡不佳。我们通过重新思考生成动作建模中起始点的角色来解决这个问题。我们提出CF-VLA,一种粗到细的两阶段公式,将动作生成重构为一个粗略初始化步骤,该步骤构建一个动作感知的起始点,随后是一个单步局部细化,纠正残余误差。具体而言,粗略阶段学习一个条件后验,针对端点速度将高斯噪声转化为结构化的初始化,而细化阶段则从该初始化中执行固定时间的细化。为了稳定训练,我们引入了一种逐步策略,首先学习一个受控的粗略预测器,然后进行联合优化。在CALVIN和LIBERO上的实验表明,我们的方法在低NFE(函数评估次数)条件下建立了强大的效率-性能边界:它始终优于现有的NFE=2方法,在多个指标上与NFE=10的$ ext{π}_{0.5}$基线相匹配或超越,将动作采样延迟减少了75.4\%,并实现了83.0 ext{%}的最佳平均真实机器人成功率,超越MIP 19.5分和$ ext{π}_{0.5}$ 4.0分。这些结果表明,结构化的粗到细生成能够实现强大的性能和高效的推理。我们的代码可在https://github.com/EmbodiedAI-RoboTron/CF-VLA获取。
cs.CV / 204 / 2604.24625
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT:增强图像编辑中的细粒度和泛化能力
Abstract
Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
Chinese Translation
统一的多模态理解/生成模型通过将细粒度理解融入其思维链(Chain-of-Thought, CoT)过程中,显示出改善的图像编辑性能。然而,一个关键问题仍未得到充分探讨:什么形式的 CoT 和训练策略可以共同增强理解的细粒度和泛化能力?为了解决这个问题,我们提出了 Meta-CoT,这是一种对任何单图像编辑操作进行两级分解的范式,具有两个关键特性:(1)可分解性。我们观察到,任何编辑意图都可以表示为一个三元组 - (任务,目标,所需理解能力)。受到此启发,Meta-CoT 对编辑任务和目标进行分解,生成任务特定的 CoT,并在所有目标上遍历编辑操作。这种分解增强了模型对编辑操作的理解细粒度,并指导其在训练过程中学习三元组的每个元素,从而显著提高编辑能力。(2)泛化能力。在第二级分解中,我们进一步将编辑任务细分为五个基本元任务。我们发现,针对这五个元任务进行训练,加上三元组的其他两个元素,足以在多样化的未见编辑任务中实现强泛化。为了进一步使模型的编辑行为与其 CoT 推理相一致,我们引入了 CoT-编辑一致性奖励,鼓励在编辑过程中更准确和有效地利用 CoT 信息。实验表明,我们的方法在 21 个编辑任务中整体提升了 15.8%,并在仅针对少量元任务进行训练时,能够有效泛化到未见的编辑任务。我们的代码、基准和模型已发布在 https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
cs.CV / 205 / 2604.24642
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
探究 CLIP 对 360 度文本和视觉语义的理解
Abstract
The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP's comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.
Chinese Translation
从文本即时创建丰富的 360 度全景世界的梦想正迅速成为现实,但我们在可靠评估其语义一致性方面仍存在一个重要的空白。对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型,作为标准的人工智能评估工具,主要在透视图像-文本对上进行训练,因此在理解 360 度全景图像-文本对的独特特性方面面临开放性问题。本文通过首先引入两个概念来填补这一空白: extit{360 度文本语义},由显式格式标识符传达的语义信息,以及 extit{360 度视觉语义},在水平圆形移动下不变的语义。为了探究 CLIP 对这些语义的理解,我们提出了使用关键词操作和不同幅度的水平圆形移动的新评估方法。对流行的 CLIP 配置进行严格的统计分析表明:(1)CLIP 模型有效利用显式文本标识符,展示了对 360 度文本语义的理解;(2)CLIP 模型在水平圆形移动下未能稳健地保持语义一致性,表明其对 360 度视觉语义的理解有限。为了解决这一局限性,我们提出了一种基于 LoRA 的微调框架,明确地赋予对圆形移动的不变性。我们的微调模型在理解 360 度视觉语义方面表现出改善,尽管在原始语义评估性能上略有下降,突显了将 CLIP 适应于 360 度全景图像的基本权衡。代码可在 https://github.com/littlewhitesea/360Semantics 获取。
cs.CV / 206 / 2604.24679
Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction
乳腺癌生存预测的病理基础模型基准评估
Abstract
Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.
Chinese Translation
病理基础模型(PFMs)最近作为强大的预训练编码器在计算病理学中崭露头角,使得在广泛的下游任务中实现迁移学习成为可能。然而,针对临床意义的预测问题对这些模型的系统比较仍然有限,尤其是在外部验证的生存预测背景下。在本研究中,我们对用于乳腺癌生存预测的广泛使用和最近提出的PFMs进行了基准评估,数据来源于全切片组织病理图像。我们使用基于补丁级特征提取的标准化流程和统一的生存建模框架,评估了来自三个独立临床队列的模型表示,这些队列包含超过5400名长期随访的患者。模型在一个队列上训练,并在两个独立的外部队列上进行评估,从而实现对跨数据集泛化能力的严格评估。总体而言,H-optimus-1实现了最强的生存预测性能。更广泛地,我们观察到模型家族之间的一致性代际改进,第二代PFMs的表现优于第一代模型。然而,许多最近PFMs之间的绝对性能差异仍然适中,这表明进一步扩展预训练数据或模型规模的收益递减。值得注意的是,紧凑的蒸馏模型H0-mini尽管使用的参数不到8%,但在特征提取速度上显著更快,且略微优于其较大的教师模型H-optimus-0。这些结果共同提供了首个大规模、外部验证的PFMs在乳腺癌生存预测中的基准评估,并为PFMs在临床工作流程中的高效部署提供了实用指导。
cs.CV / 207 / 2604.24685
Aycromo: An Open-Source Platform for Automatic Chromosome Detection in Metaphase Images Based on Deep Learning
Aycromo:基于深度学习的自动染色体检测开源平台
Abstract
Chromosome analysis is a fundamental step in the diagnosis of genetic diseases, but the manual karyotyping workflow is time-consuming and heavily dependent on expert specialists, often requiring several days per patient. Although Deep Learning models have achieved high performance in chromosome detection, most proposed solutions remain restricted to research prototypes or lack graphical interfaces suitable for clinical use. In this work, we present Aycromo, an open-source desktop platform for AI-assisted cytogenetic analysis. Built on Electron and ONNX Runtime, the tool allows cytogeneticists to load pre-trained models, compare architectures through an integrated benchmarking module, and manually correct detections via an interactive annotation interface, all without command-line interaction. Preliminary experiments on metaphase images from the CRCN-NE dataset demonstrate that YOLOv11 achieves 99.40% mAP@50, while the platform reduces per-slide analysis to seconds
Chinese Translation
染色体分析是遗传疾病诊断的基础步骤,但手动核型分析工作流程耗时且高度依赖专家,通常每位患者需要几天时间。尽管深度学习模型在染色体检测方面已取得高性能,但大多数提出的解决方案仍然局限于研究原型,或缺乏适合临床使用的图形界面。在本研究中,我们提出了Aycromo,一个用于AI辅助细胞遗传学分析的开源桌面平台。该工具基于Electron和ONNX Runtime构建,允许细胞遗传学家加载预训练模型,通过集成基准模块比较架构,并通过交互式标注界面手动修正检测结果,所有操作无需命令行交互。对CRCN-NE数据集中中期图像的初步实验表明,YOLOv11在mAP@50上达到了99.40%,而该平台将每张幻灯片的分析时间缩短至几秒钟。
cs.CV / 208 / 2604.24696
NeuroClaw Technical Report
NeuroClaw 技术报告
Abstract
Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html
Chinese Translation
自主智能系统有望加速科学工作流程,但神经影像学面临独特的挑战:异构模态(sMRI、fMRI、dMRI、EEG)、长时间的多阶段流程以及持续的可重复性风险。为了解决这一问题,我们提出了 NeuroClaw,一个专门针对可执行和可重复的神经影像学研究的多智能体研究助手。NeuroClaw 直接在不同格式和模态的原始神经影像数据上运行,基于数据集语义和 BIDS 元数据做出决策,因此用户无需准备经过整理的输入或定制的模型代码。该平台结合了工具工程与端到端环境管理,包括固定的 Python 环境、Docker 支持、常见神经影像工具的自动安装程序以及 GPU 配置。在实践中,这一层强调检查点、执行后验证、结构化审计痕迹和受控运行时设置,使工具链更加透明,同时提高可重复性和可审计性。三层技能/智能体层次结构将用户交互、高级编排和低级工具技能分开,从而将复杂的工作流程分解为安全、可重用的单元。与 NeuroClaw 框架一起,我们引入了 NeuroBench,一个针对可执行性、工件有效性和可重复性准备的系统级基准。在多个多模态 LLM 中,启用 NeuroClaw 的运行与直接调用智能体相比,产生了一致且显著的分数提升。项目主页:https://cuhk-aim-group.github.io/NeuroClaw/index.html
cs.CV / 209 / 2604.24718
WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring
WildLIFT:将单目无人机视频提升至3D以实现物种无关的野生动物监测
Abstract
Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.
Chinese Translation
单目RGB相机安装在无人机上广泛用于野生动物监测,但大多数分析流程仍然局限于二维图像空间,导致视频中的几何信息未得到充分利用。我们提出了WildLIFT,一个计算框架,将单目无人机视频中的三维场景几何与开放词汇的二维实例分割相结合,以实现物种无关的三维检测和跟踪。带有语义面信息的定向三维边界框标签使得视角覆盖和动物间遮挡的定量评估成为可能,从而为后续生态分析生成结构化元数据。我们在2,581帧手动整理的图像上验证了该框架,这些图像包含四种大型哺乳动物的超过6,700个三维检测。WildLIFT在多动物场景中保持高身份一致性,并通过基于关键帧的细化显著减少了手动三维标注的工作量。通过将标准无人机视频转换为结构化的三维和视角感知表示,WildLIFT扩展了空中野生动物数据集在行为研究和种群监测中的分析效用。
cs.CV / 210 / 2604.24719
DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation
DiffuSAM:基于扩散的无提示SAM2用于少样本和无源医学图像分割
Abstract
Segmentation models such as Segment Anything Model (SAM) and SAM2 achieve strong prompt-driven zero-shot performance. However, their training on natural images limits domain transfer to medical data. Consequently, accurate segmentation typically requires extensive fine-tuning and expert-designed prompts. We propose DiffuSAM, a diffusion-based adaptation of SAM2 for prompt-free medical image segmentation. Our framework synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion-prior from off-the-shelf frozen SAM2 image features. The generated embeddings are integrated into SAM2's mask decoder to produce accurate segmentations, thereby eliminating the need for user prompts. The diffusion prior is further conditioned on previously segmented slices, enforcing spatial consistency across volumes. Evaluated on the BTCV and CHAOS datasets for CT and MRI under Source-Free Unsupervised Domain Adaptation (SF-UDA) and Few-Shot settings, DiffuSAM achieves competitive performance with efficient training and inference. Code is available upon request from the corresponding author.
Chinese Translation
分割模型如Segment Anything Model (SAM)和SAM2在提示驱动的零-shot性能上表现出色。然而,它们在自然图像上的训练限制了对医学数据的领域迁移。因此,准确的分割通常需要大量的微调和专家设计的提示。我们提出了DiffuSAM,这是一种基于扩散的SAM2适应方案,用于无提示的医学图像分割。我们的框架通过轻量级的扩散先验,从现成的冻结SAM2图像特征中合成与SAM2兼容的分割掩码样嵌入。生成的嵌入被整合到SAM2的掩码解码器中,以产生准确的分割,从而消除了用户提示的需求。扩散先验进一步依赖于先前分割的切片,强制在体积之间保持空间一致性。在无源无监督领域适应(Source-Free Unsupervised Domain Adaptation, SF-UDA)和少样本设置下,我们在BTCV和CHAOS数据集上进行评估,DiffuSAM实现了具有竞争力的性能,并且训练和推理效率高。代码可根据请求从通讯作者处获取。
cs.CV / 211 / 2604.24762
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
OmniShotCut:基于全局关系的镜头边界检测与镜头查询变换器
Abstract
Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.
Chinese Translation
镜头边界检测(SBD)旨在自动识别镜头变化并将视频划分为连贯的镜头。尽管SBD在文献中得到了广泛研究,现有的最先进方法往往在过渡时产生不可解释的边界,错过微妙但有害的不连续性,并依赖于嘈杂、低多样性的标注和过时的基准。为了解决这些局限性,我们提出了OmniShotCut,将SBD表述为结构化关系预测,通过基于镜头查询的密集视频变换器共同估计镜头范围及镜头内关系和镜头间关系。为了避免不精确的手动标注,我们采用了一个完全合成的过渡合成管道,自动重现主要过渡类型,具有精确的边界和参数化变体。我们还引入了OmniShotCutBench,一个现代广域基准,能够实现全面和诊断性的评估。
cs.CV / 212 / 2604.24763
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2:像素嵌入优于视觉编码器在多模态理解与生成中的表现
Abstract
Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
Chinese Translation
统一的多模态模型通常依赖于预训练的视觉编码器,并使用单独的视觉表示进行理解和生成,这导致两者之间的错位,并阻碍了从原始像素进行完全端到端优化。我们介绍了Tuna-2,一种原生的统一多模态模型,直接基于像素嵌入进行视觉理解和生成。Tuna-2通过采用简单的补丁嵌入层来编码视觉输入,极大简化了模型架构,完全抛弃了诸如变分自编码器(VAE)或表示编码器等模块化视觉编码器设计。实验表明,Tuna-2在多模态基准测试中达到了最先进的性能,证明了统一的像素空间建模可以与潜在空间方法在高质量图像生成方面完全竞争。此外,尽管基于编码器的变体在早期预训练中收敛更快,但Tuna-2的无编码器设计在大规模下实现了更强的多模态理解,特别是在需要细粒度视觉感知的任务上。这些结果表明,预训练的视觉编码器并不是多模态建模的必要条件,而端到端的像素空间学习为生成和感知提供了更强视觉表示的可扩展路径。
cs.CV / 213 / 2604.24764
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1:强化文本到视频生成的3D约束
Abstract
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
Chinese Translation
近期的视频基础模型展示了令人印象深刻的视觉合成能力,但常常面临几何不一致的问题。虽然现有方法试图通过架构修改来注入3D先验,但通常会导致高计算成本并限制可扩展性。我们提出了World-R1,一个通过强化学习将视频生成与3D约束对齐的框架。为了促进这种对齐,我们引入了一个专门针对世界模拟的纯文本数据集。利用Flow-GRPO,我们使用来自预训练的3D基础模型和视觉-语言模型的反馈来优化模型,以在不改变基础架构的情况下强制执行结构一致性。我们进一步采用周期性解耦训练策略,以平衡刚性几何一致性与动态场景流动性。广泛的评估表明,我们的方法显著增强了3D一致性,同时保持了基础模型的原始视觉质量,有效弥合了视频生成与可扩展世界模拟之间的差距。
cs.AI / 1 / 2604.22777
An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement
基于多保真数字双胞胎和FMEA知识增强的通用航空飞机智能故障诊断方法
Abstract
Fault diagnosis of general aviation aircraft faces challenges including scarce real fault data, diverse fault types, and weak fault signatures. This paper proposes an intelligent fault diagnosis framework based on multi-fidelity digital twin, integrating four modules: high-fidelity flight dynamics simulation, FMEA-driven fault injection, multi-fidelity residual feature extraction, and large language model (LLM)-enhanced interpretable report generation. A digital twin is constructed using the JSBSim six-degree-of-freedom (6-DoF) flight dynamics engine, generating 23-channel engine health monitoring data via semi-empirical sensor synthesis equations. A three-layer fault injection engine based on failure mode and effects analysis (FMEA) models the physical causal propagation of 19 engine fault types. A multi-fidelity residual computation framework comprising paired-mirror residuals and GRU surrogate prediction residuals is proposed: the high-fidelity path obtains clean fault deviation signals using nominal mirror trajectories with identical initial conditions, while the low-fidelity path achieves online real-time residual computation through a multi-step prediction GRU surrogate model. A 1D-CNN classifier performs end-to-end diagnosis of 20 fault classes. An LLM diagnostic report engine enhanced with FMEA knowledge fuses classification results, residual evidence, and domain causal knowledge to generate interpretable natural language reports. Experiments show the paired-mirror residual scheme achieves a Macro-F1 of 96.2% on the 20-class task, while the GRU surrogate scheme achieves 4.3x inference acceleration at only 0.6% performance cost. Comparison across 24 schemes reveals that residual feature quality contributes approximately 5x more to diagnostic performance than classifier architecture, establishing the "residual quality first" design principle.
Chinese Translation
通用航空飞机的故障诊断面临着真实故障数据稀缺、故障类型多样和故障特征弱等挑战。本文提出了一种基于多保真数字双胞胎的智能故障诊断框架,集成了四个模块:高保真飞行动力学仿真、基于故障模式及影响分析(FMEA)的故障注入、多保真残差特征提取和增强型大语言模型(LLM)可解释报告生成。利用JSBSim六自由度(6-DoF)飞行动力学引擎构建数字双胞胎,通过半经验传感器合成方程生成23通道的发动机健康监测数据。基于故障模式及影响分析(FMEA)的三层故障注入引擎模拟19种发动机故障类型的物理因果传播。提出了一种包含配对镜像残差和GRU替代预测残差的多保真残差计算框架:高保真路径使用具有相同初始条件的名义镜像轨迹获取干净的故障偏差信号,而低保真路径通过多步预测GRU替代模型实现在线实时残差计算。1D-CNN分类器对20个故障类别进行端到端诊断。增强了FMEA知识的LLM诊断报告引擎融合分类结果、残差证据和领域因果知识,以生成可解释的自然语言报告。实验表明,配对镜像残差方案在20类任务上实现了96.2%的宏观F1值,而GRU替代方案在仅0.6%的性能成本下实现了4.3倍的推理加速。对24种方案的比较显示,残差特征质量对诊断性能的贡献约为分类器架构的5倍,从而确立了“残差质量优先”的设计原则。
cs.AI / 2 / 2604.22934
PExA: Parallel Exploration Agent for Complex Text-to-SQL
PExA:复杂文本到SQL的并行探索代理
Abstract
LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation within the lens of software test coverage where the original query is prepared with a suite of test cases with simpler, atomic SQLs that are executed in parallel and together ensure semantic coverage of the original query. After iterating on test case coverage, the final SQL is generated only when enough information is gathered, leveraging the explored test case SQLs to ground the final generation. We validated our framework on a state-of-the-art benchmark for text-to-SQL, Spider 2.0, achieving a new state-of-the-art with 70.2% execution accuracy.
Chinese Translation
基于大型语言模型(LLM)的文本到SQL代理在延迟与性能的权衡中常常面临挑战,性能的提升往往以延迟为代价,反之亦然。我们从软件测试覆盖的角度重新构建文本到SQL生成过程,其中原始查询通过一组测试用例进行准备,这些测试用例包含更简单的原子SQL,并且可以并行执行,从而共同确保原始查询的语义覆盖。在迭代测试用例覆盖后,只有在收集到足够的信息时,最终SQL才会生成,利用探索过的测试用例SQL为最终生成提供基础。我们在文本到SQL的最新基准Spider 2.0上验证了我们的框架,达到了70.2%的执行准确率,创造了新的最先进水平。
cs.AI / 3 / 2604.22951
The Power of Power Law: Asymmetry Enables Compositional Reasoning
幂律的力量:不对称性促进组合推理
Abstract
Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
Chinese Translation
自然语言数据遵循幂律分布,大多数知识和技能出现的频率非常低。虽然常见的直觉认为,将数据重新加权或整理为均匀分布可能有助于模型更好地学习这些长尾技能,但我们发现了一个反直觉的结果:在广泛的组合推理任务中,例如状态跟踪和多步算术,基于幂律分布的训练始终优于基于均匀分布的训练。为了理解这一优势,我们引入了一个极简的技能组合任务,并展示了在幂律分布下学习显著需要更少的训练数据。我们的理论分析揭示,幂律采样引发了一种有益的不对称性,改善了病态损失景观,使得模型首先能够以低数据复杂度获取高频技能组合,这反过来又为有效学习稀有的长尾技能铺平了道路。我们的结果为有效的数据分布构成提供了另一种视角,以用于训练模型。
cs.AI / 4 / 2604.22958
On the Existence of an Inverse Solution for Preference-Based Reductions in Argumentation
关于基于偏好的论证中逆解的存在性
Abstract
Preference-based argumentation frameworks (PAFs) extend Dung's approach to abstract argumentation (AAFs) by encoding preferences over arguments. Such preferences control the transformation of attacks into defeats, and different approaches to doing so result in different reductions from a PAF to an AAF. In this paper we consider a PAF inverse problem which takes an argumentation graph, a labelling and a semantics as an input, and outputs a ``yes" or ``no" as to whether there is a preference relation between the arguments which can yield the desired labelling. This inverse problem has applications in areas including preference elicitation and explainability. We consider this problem in the context of the four most widely-used preference based reductions under the complete semantics. We show that in most cases, the problem can be answered in polynomial time.
Chinese Translation
基于偏好的论证框架(PAFs)通过对论证的偏好编码,扩展了Dung的抽象论证(AAFs)方法。这些偏好控制了攻击转化为失败的过程,不同的处理方式导致从PAF到AAF的不同简化。在本文中,我们考虑一个PAF逆问题,该问题以一个论证图、一个标记和一个语义作为输入,并输出“是”或“否”,以判断是否存在一个论证之间的偏好关系能够产生所需的标记。该逆问题在偏好引导和可解释性等领域具有应用。我们在完整语义下考虑四种最常用的基于偏好的简化方法的背景下研究此问题。我们展示了在大多数情况下,该问题可以在多项式时间内得到解答。
cs.AI / 5 / 2604.22979
Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction
基于Wi-Fi信道状态信息的可因果解释的人类活动识别:离散潜变量压缩与线性时序逻辑规则提取
Abstract
We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve strong predictive performance on CSI-based HAR (CHAR), yet rely on continuous latent representations that are opaque and difficult to modify; purely symbolic approaches, in contrast, cannot process raw CSI streams. We propose a fully automatic and strictly decoupled pipeline in which CSI magnitude windows are compressed by a categorical variational autoencoder with Gumbel-Softmax latent variables under a capacity-controlled objective, yielding a compact discrete representation. The encoder is then frozen and used as a deterministic mapping to one-hot latent trajectories. Causal discovery is performed on these trajectories to estimate class-conditional temporal dependency graphs. Statistically supported lagged dependencies are translated into Linear Temporal Logic (LTL) rules, producing a fully symbolic and deterministic classifier based solely on rule evaluation and aggregation, without any learned discriminative head. Because rules are defined over discrete latent variables, antenna-specific rule sets can in principle be combined at the symbolic level, enabling structured multi-antenna fusion without retraining the encoder. Results from CHAR Latent Temporal Rule Extraction (CHARL-TRE) indicate competitive performance while preserving explicit temporal and causal structure, showing that deterministic symbolic classification grounded in unsupervised discrete latent representations constitutes a viable alternative to end-to-end black-box models for wireless HAR.
Chinese Translation
我们探讨了在因果可解释性、符号可控性和对高维原始信号的直接操作的共同要求下,利用Wi-Fi信道状态信息(CSI)进行人类活动识别(HAR)。深度神经模型在基于CSI的人类活动识别(CHAR)中表现出强大的预测性能,但依赖于不透明且难以修改的连续潜变量表示;而纯符号方法则无法处理原始CSI流。我们提出了一种完全自动化且严格解耦的流程,其中CSI幅度窗口通过具有Gumbel-Softmax潜变量的分类变分自编码器在容量控制目标下进行压缩,从而生成紧凑的离散表示。然后,编码器被冻结并用作一对一的确定性映射到独热潜变量轨迹。对这些轨迹进行因果发现,以估计类条件时间依赖图。经过统计支持的滞后依赖关系被转化为线性时序逻辑(LTL)规则,生成一个完全符号化且确定性的分类器,仅基于规则评估和聚合,而不需要任何学习的判别头。由于规则是基于离散潜变量定义的,因此原则上可以在符号层面上组合特定于天线的规则集,从而实现结构化的多天线融合,而无需重新训练编码器。CHAR潜在时间规则提取(CHARL-TRE)的结果表明,在保留明确的时间和因果结构的同时,表现出竞争力的性能,显示出基于无监督离散潜表示的确定性符号分类是无线人类活动识别中端到端黑箱模型的可行替代方案。
cs.AI / 6 / 2604.23002
FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
FormalScience:可扩展的人机协作科学自动形式化与 Lean 中的代理代码生成
Abstract
Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science
Chinese Translation
将非正式的数学推理形式化为可正式验证的代码是大型语言模型面临的重大挑战。在物理等科学领域,特定领域的工具(例如,迪拉克符号、向量微积分)带来了额外的形式化挑战,而现代大型语言模型(LLMs)和代理方法尚未解决这些问题。为了帮助科学领域的自动形式化,我们提出了 FormalScience;这是一个与领域无关的人机协作代理管道,使得单个领域专家(无需深厚的形式语言经验)能够以低经济成本生成非正式推理的“语法正确”和“语义一致”的形式证明。将 FormalScience 应用于物理学,我们构建了 FormalPhysics,一个包含 200 道大学级(LaTeX)物理问题及其解决方案(主要涉及量子力学和电磁学)的数据集,以及它们的 Lean4 形式表示。与现有的形式数学基准相比,FormalPhysics 实现了完美的形式有效性,并展现出更高的陈述复杂性。我们通过零-shot 提示、自我修正与错误反馈以及一种新颖的多阶段代理方法,评估开源模型和专有系统在我们的数据集上的陈述自动形式化任务,并探讨现代基于 LLM 的方法在自动形式化中的局限性。我们首次系统性地描述了物理自动形式化中的语义漂移,涉及符号崩溃和抽象提升等概念,揭示了在无法实现完全语义保留时形式语言所验证的内容。我们发布了代码库以及一个基于交互式用户界面的 FormalScience 系统,促进了超越物理学的科学领域中的自动形式化和定理证明。
cs.AI / 7 / 2604.23027
A Systematic Approach for Large Language Models Debugging
大语言模型调试的系统化方法
Abstract
Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.
Chinese Translation
大语言模型(LLMs)已成为现代人工智能工作流程的核心,驱动着从开放式文本生成到复杂代理推理的各种应用。然而,由于其不透明和概率性的特征,以及在多样化任务和环境中诊断错误的困难,调试这些模型仍然是一个持续的挑战。本文提出了一种系统化的LLM调试方法,将模型视为可观察的系统,提供从问题检测到模型优化的结构化、模型无关的方法。通过统一评估、可解释性和错误分析的实践,我们的方法使从业者能够迭代地诊断模型的弱点,优化提示和模型参数,并调整数据以进行微调或评估,同时在缺乏标准化基准和评估标准的情况下仍然有效。我们认为,这种结构化的方法不仅加速了故障排除,还促进了LLM基础系统部署中的可重复性、透明性和可扩展性。
cs.AI / 8 / 2604.23049
A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows
一种解耦的人机协作系统用于代理工作流中的受控自主性
Abstract
AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuring transparency, accountability, and trustworthiness in such systems. However, existing implementations of Human-in-the-Loop (HITL) mechanisms are typically embedded within application logic, limiting reuse, consistency, and scalability across multi-agent environments. This paper presents a decoupled HITL system architecture that treats human oversight as an independent system component within the agent operating environment. The proposed design separates human interaction management from application workflows through explicit interfaces and a structured execution model. In addition, a design framework is introduced to formalize HITL integration along four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel. This framework enables selective and context-aware human involvement while maintaining system-level consistency. The approach supports alignment with emerging agent communication protocols, allowing HITL to be implemented as a protocol-level concern. By externalizing HITL and structuring its integration, the system provides a foundation for scalable governance and progressive autonomy in agentic workflows.
Chinese Translation
人工智能代理越来越多地被部署在代理工作流中执行任务和做出决策,这引入了对安全和受控自主性的新要求。先前的研究已确立了人类监督在确保此类系统的透明性、问责性和可信性方面的重要性。然而,现有的人机协作(Human-in-the-Loop, HITL)机制的实现通常嵌入在应用逻辑中,限制了其在多代理环境中的重用、一致性和可扩展性。本文提出了一种解耦的HITL系统架构,将人类监督视为代理操作环境中的独立系统组件。所提出的设计通过明确的接口和结构化的执行模型,将人机交互管理与应用工作流分离。此外,本文引入了一种设计框架,以四个维度形式化HITL集成:干预条件、角色解析、交互语义和通信渠道。该框架支持选择性和上下文感知的人类参与,同时保持系统级一致性。该方法支持与新兴代理通信协议的对齐,使HITL能够作为协议级关注点进行实施。通过外部化HITL并结构化其集成,该系统为代理工作流中的可扩展治理和渐进自主性提供了基础。
cs.AI / 9 / 2604.23057
Don't Make the LLM Read the Graph: Make the Graph Think
不要让大型语言模型读取图谱:让图谱思考
Abstract
We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).
Chinese Translation
我们研究了显式信念图是否能提高大型语言模型(LLM)在合作多智能体推理中的表现。通过在合作卡牌游戏 Hanabi 中进行超过 3000 次的控制试验,我们得出了四个发现。首先,集成架构决定了信念图是否提供价值:作为提示上下文,对于强模型而言,图谱是装饰性的,而对于弱模型在二阶心智理论(2nd-order Theory of Mind)上则是有益的(80% 对 10%,p<0.0001,OR=36.0);当图谱通过排名短名单来控制行动选择时,即使对于强模型,它们也变得结构上必不可少(在二阶心智理论上 100% 对 20%,p<0.001)。其次,我们识别了“计划者反抗”(Planner Defiance),这是一种特定于模型家族的失败现象,其中 LLM 在部分能力下覆盖正确的计划者建议(90% 覆盖,重复实验 N=20);Gemini 模型几乎没有反抗,而 Llama 70B 模型则显示出 90% 的反抗,模型能够区分事实上下文(被推迟)与建议性推荐(被覆盖)。第三,完整游戏证据确认了智能体间的约定(+128% 超过基线,p=0.003)优于所有单智能体干预,且各个信念图组件必须结合使用才能产生增益。第四,初步的规模分析(N=10/组,探索性)表明图谱深度的收益递减:浅层图谱提供最佳的成本效益比,而在较大玩家数量下,深层心智理论图谱似乎是有害的(在 5 玩家时减少 1.5 分,p=0.029)。
cs.AI / 10 / 2604.23072
Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis
Analytica:基于软命题推理的稳健且可扩展的LLM驱动分析
Abstract
Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive "what-if" scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.
Chinese Translation
大型语言模型(LLM)代理越来越多地承担复杂的现实世界分析任务(例如,在金融预测、科学发现中),然而它们的推理存在随机不稳定性,并且缺乏可验证的、组合的结构。为了解决这个问题,我们提出了Analytica,一种基于软命题推理(Soft Propositional Reasoning, SPR)原则的新型代理架构。SPR将复杂分析重新构建为一个结构化的过程,旨在估计不同结果命题的软真值,从而使我们能够在偏差和方差的层面上正式建模并最小化估计误差。Analytica通过一个并行的分而治之框架来实现这一点,系统性地减少这两种误差源。为了减少偏差,问题首先被分解为一个子命题树,并使用配备工具的LLM基础代理,包括一种用于数据驱动分析的新型Jupyter Notebook代理,帮助验证和评分事实。为了减少方差,Analytica递归地使用稳健的线性模型合成这些基础叶子,以更高的效率和可扩展性平均化随机噪声,并支持交互式的“假设”场景分析。我们在经济、金融和政治预测任务上的理论和实证结果表明,Analytica在多种基础模型上平均提高了15.84%的准确率,在使用Deep Research基础代理时达到了71.06%的准确率,方差最低为6.02%。我们的Jupyter Notebook基础代理显示出强大的性价比,以90.35%更低的成本和52.85%更少的时间实现接近70.11%的准确率。随着分析深度的增加,Analytica还表现出高度的抗噪声能力和稳定的性能增长,具有近线性的时间复杂度,并且对开放权重的LLM和科学领域具有良好的适应性。
cs.AI / 11 / 2604.23090
Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach
基于多智能体大语言模型的非结构化文本自动本体生成研究
Abstract
Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain-specific insurance contracts to investigate these questions. We first establish a single-agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi-agent architecture that decomposes ontology construction into four artifact-driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi-agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front-loaded planning. These findings highlight planning-first, artifact-driven generation as a promising and more auditable path toward scalable automated ontology engineering.
Chinese Translation
从非结构化自然语言自动生成正式本体仍然是知识工程中的一个核心挑战。尽管大型语言模型(LLMs)展现出潜力,但尚不清楚哪些架构设计选择影响生成质量,以及当前方法为何失败。我们通过使用特定领域的保险合同进行控制实验研究,以探讨这些问题。我们首先建立了一个单智能体LLM基线,识别出关键的失败模式,如不良的本体设计模式遵循、结构冗余和无效的迭代修复。随后,我们引入了一种多智能体架构,将本体构建分解为四个基于文档的角色:领域专家、管理者、编码者和质量保证者。我们通过异构LLM评审小组评估架构质量,并通过基于能力问题的SPARQL评估与补充检索增强生成的评估来评估功能可用性。结果表明,多智能体方法显著提高了结构质量,并适度增强了查询能力,主要得益于前期规划。这些发现突显了以规划为先、基于文档的生成作为可扩展自动本体工程的有前景且更具可审计性的路径。
cs.AI / 12 / 2604.23148
PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
PhySE:一种实时增强现实大语言模型社交工程攻击的心理框架
Abstract
The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.
Chinese Translation
基于增强现实大语言模型的社交工程(AR-LLM-SE)攻击(例如,SEAR)所带来的新兴威胁对现实世界的社交互动构成了重大风险。在这种攻击中,恶意行为者使用增强现实(AR)眼镜捕捉目标的视觉和声音数据。随后,大语言模型(LLM)分析这些数据,以识别个体并生成详细的社交档案。接着,基于LLM的代理采用社交工程策略,提供实时对话建议,以获取目标的信任,最终执行网络钓鱼或其他恶意行为。尽管具有潜力,AR-LLM-SE的实际应用面临两个主要瓶颈:(1)冷启动个性化,目前的检索增强生成(RAG)方法在最初的对话环节引入了关键延迟,减缓了初始档案的形成并干扰了实时互动;(2)静态攻击策略,现有方法依赖于固定阶段的手工社交工程战术,缺乏建立在已确立心理理论基础上的依据。为了解决这些局限性,我们提出了PhySE,一个具有两个核心创新的全新框架:(1)基于视觉语言模型(VLM)的社交上下文训练,为消除档案生成的延迟,我们高效地使用社交上下文数据对视觉语言模型进行预训练,从而实现快速的即时档案生成;(2)自适应心理代理,我们引入了一种心理大语言模型,根据目标的反应动态部署不同类别的心理策略,超越静态的手工脚本。我们通过一项获得伦理审查委员会(IRB)批准的用户研究对PhySE进行了评估,参与者为60人,收集了360个跨多种社交场景的注释对话的新数据集。
cs.AI / 13 / 2604.23178
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
评判评审者:对 LLM 作为评审管道中的偏见缓解策略的系统评估
Abstract
LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at https://github.com/sksoumik/llm-as-judge.
Chinese Translation
LLM 作为评审者已成为评估语言模型输出的主流范式,但 LLM 评审者表现出系统性偏见,影响评估的可靠性。我们进行了全面的实证研究,比较了来自四个提供者(Google、Anthropic、OpenAI、Meta)的五个评审模型中九种去偏见策略,使用了三个基准(MT-Bench n=400、LLMBar n=200、自定义 n=225)和四种偏见类型。我们的主要发现:(1)风格偏见是主导偏见(在所有模型中为 0.76-0.92),远远超过位置偏见(<= 0.04),但对此的研究关注较少。(2)所有模型在扩展对中显示出简洁性偏好,但截断控制确认它们能够正确区分质量与长度(准确率为 0.92-1.00),这表明评估是质量敏感的,而非简单的长度偏见。(3)去偏见是有益的,但依赖于模型:组合预算策略显著提高了 Claude Sonnet 4 的表现,提升幅度为 +11.2 个百分点(p < 0.0001),其他模型也呈现出方向性积极趋势。只有 20 个非基线配置中的 2 个显示出一致性降低。我们在 https://github.com/sksoumik/llm-as-judge 发布了我们的评估框架、受控数据集和所有实验材料。
cs.AI / 14 / 2604.23194
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
从粗到细:用于大型语言模型代理的自适应层次规划
Abstract
Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textit{progressive refinement} in cognitive science, we propose \textbf{AdaPlan-H}, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import-myself/AHP.
Chinese Translation
基于大型语言模型的代理最近作为解决动态和多步骤任务的强大方法而出现。大多数现有代理采用规划机制来指导动态环境中的长期行动。然而,当前的规划方法面临一个根本性限制,即它们在固定的粒度水平上操作。具体而言,它们要么为简单任务提供过多的细节,要么为复杂任务提供不足的细节,未能在简单性和复杂性之间实现最佳平衡。受到认知科学中 extit{渐进细化}原则的启发,我们提出了 extbf{AdaPlan-H},一种自适应层次规划机制,模仿人类的规划策略。我们的方法从粗粒度的宏观计划开始,并根据任务复杂性逐步细化。它生成自适应的层次计划,针对不同任务的变化难度水平进行调整,并可以通过模仿学习和能力增强进行优化。实验结果表明,我们的方法显著提高了任务执行成功率,同时减轻了规划层面的过度规划,为多步骤复杂决策任务提供了一种灵活高效的解决方案。为了贡献于社区,我们的代码和数据将公开发布在 https://github.com/import-myself/AHP。
cs.AI / 15 / 2604.23198
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR:基于叙事的视听时序检索与心智理论推理
Abstract
Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.
Chinese Translation
当前的视频时刻检索在以动作为中心的任务中表现出色,但在叙事内容方面却面临挑战。模型能够识别 extit{正在发生什么},却无法推理 extit{这为何重要}。这种语义差距源于缺乏 extbf{心智理论(Theory of Mind, ToM)}:即从表层观察中推断隐含意图、心理状态和叙事因果关系的认知能力。我们提出了 extbf{StoryTR},这是第一个需要ToM推理的视频时刻检索基准,包含来自叙事短视频(短片/短视频)中的8.1千个样本。这些视频提供了理想的测试平台。它们的高信息密度通过微妙的多模态线索编码意义。例如,伴随叹息的目光与单独的目光所传达的语义截然不同。然而,仅靠多模态感知是不够的;需要ToM来解码一个角色“微笑”可能实际上是在“掩饰敌意”。为了教会模型这种推理能力,我们提出了一个 extbf{代理数据管道},生成具有明确三层ToM链(意图解码、叙事推理、边界定位)的训练数据。实验揭示了推理差距的严重性:Gemini-3.0-Pro在StoryTR上的平均交并比(Avg IoU)仅为0.53。然而,我们的7B extbf{Shorts-Moment}模型在ToM引导数据上训练,相较于基线提高了15.1 ext{ }%的相对IoU,证明了 extit{叙事推理能力比参数规模更为重要}。
cs.AI / 16 / 2604.23210
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
从1位危险信号中发现代理安全规范
Abstract
Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^*$, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^*$. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., "X cells are directionally hazardous: entering from the north is dangerous"). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).
Chinese Translation
大型语言模型代理能否仅通过经验发现隐藏的安全目标?我们引入了EPO-Safe(安全代理的经验提示优化)框架,在该框架中,LLM(大型语言模型)迭代生成行动计划,接收稀疏的二元危险警告,并通过反思演化出自然语言行为规范。与依赖丰富文本反馈(例如,编译器错误或详细环境响应)的标准LLM反思方法不同,EPO-Safe展示了LLM能够在结构化、低维环境中从严格贫乏的信号中进行安全推理:代理从未观察到隐藏的性能函数$R^*$,每个时间步仅接收到一个指示某个动作不安全的单一位信号。我们在五个AI安全网格世界(Leike et al., 2017)和五个文本场景类比中进行评估,其中可见奖励$R$可能与$R^*$偏离。EPO-Safe在1-2轮(5-15个回合)内发现安全行为,生成可读性强的规范,并对危险提供正确的解释假设(例如,“X个单元格在方向上是危险的:从北方进入是危险的”)。关键是,我们展示了标准的基于奖励的反思会主动降低安全性:仅反思奖励的代理使用循环来证明和加速奖励黑客行为,证明反思必须与专门的安全通道配对,以发现隐藏的约束。我们进一步评估了对噪声预言者的鲁棒性:即使50%的非危险步骤产生虚假警告,平均安全性能也仅下降15%,尽管敏感性依赖于环境,因为跨回合反思自然过滤不一致的信号。每个演化的规范作为一组可审计的基础行为规则,能够通过交互自主发现,而不是像宪法AI(Bai et al., 2022)那样由人类创作。
cs.AI / 17 / 2604.23239
AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting
AdaMamba:用于长期时间序列预测的自适应频率门控Mamba
Abstract
Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global perspective for uncovering temporal characteristics. However, real-world time series often exhibit pronounced cross-domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency-based LTSF methods often rely on implicit assumptions of cross-domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency-domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context-aware frequency analysis within the Mamba state-space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter-variable interaction dynamics. Then, we develop an adaptive frequency-gated state-space module that generates input-dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time-frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency-domain importance, while preserving Mamba's capability in modeling long-range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain-specific datasets demonstrate that AdaMamba consistently outperforms state-of-the-art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at https://github.com/XDjiang25/AdaMamba.
Chinese Translation
准确的长期时间序列预测(LTSF)需要捕捉复杂的长程依赖关系和动态周期模式。最近在频域分析方面的进展为揭示时间特征提供了全球视角。然而,现实世界中的时间序列往往表现出显著的跨域异质性,其中在时间域中看似同步的变量在频域中可能存在显著差异。现有的基于频率的LTSF方法通常依赖于隐含的跨域同质性假设,这限制了它们适应这种复杂变异性的能力。为了有效地将频域分析与时间依赖学习相结合,我们提出了AdaMamba,一个新颖的框架,它在Mamba状态空间更新过程中内生化自适应和上下文感知的频率分析。具体而言,AdaMamba引入了一个交互式补丁编码模块,以捕捉变量间的交互动态。然后,我们开发了一个自适应频率门控状态空间模块,该模块生成依赖于输入的频率基,并将传统的时间遗忘门推广为统一的时间-频率遗忘门。这允许基于学习到的频域重要性动态校准状态转移,同时保留Mamba在建模长程依赖性方面的能力。在七个公共LTSF基准和两个特定领域数据集上的大量实验表明,AdaMamba在预测准确性方面始终优于最先进的方法,同时保持竞争性的计算效率。AdaMamba的代码可在https://github.com/XDjiang25/AdaMamba获取。
cs.AI / 18 / 2604.23270
CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning
CAP-CoT:循环对抗提示以改善大型语言模型推理中的思维链
Abstract
Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on long, multi-step problems, leading to inconsistent answers for unchanged task. Most prior work focuses on improving the forward reasoning chain within a single pass, with less attention to iterative and contrastive correction. To address this gap, we propose CAP-CoT, a Cycle Adversarial Prompt optimization framework designed to improve both CoT reasoning accuracy and stability of a single deployed solver. In each cycle, a forward solver generates candidate reasoning chains, an adversarial challenger constructs plausible but deliberately flawed chains using targeted error strategies, and a feedback agent contrasts the two chains and produces step-aligned structured feedback. This feedback closes the optimization loop in two directions, including updating the solver prompt based on errors exposed by the challenger, and updating the challenger prompt to generate increasingly targeted errors in subsequent cycles. Unlike safety-oriented adversarial prompting such as jailbreak or prompt-injection attacks, our adversarial component is task-semantic and aims to expose logical vulnerabilities in reasoning chains. Experiments across six benchmarks and four LLM backbones demonstrate that within two to three adversarial prompt optimization cycles, CAP-CoT consistently reduces variability across runs while improving reasoning accuracy and robustness to prompt perturbations.
Chinese Translation
思维链(Chain-of-Thought, CoT)提示已成为从大型语言模型(LLMs)中引导逐步解决方案的一种简单有效的方法。然而,在长时间的多步骤问题上,CoT 推理在不同运行之间可能不稳定,导致在任务不变的情况下答案不一致。大多数先前的研究集中于改善单次推理过程中的前向推理链,而较少关注迭代和对比纠正。为了解决这一问题,我们提出了 CAP-CoT,一种循环对抗提示优化框架,旨在提高 CoT 推理的准确性和单个部署求解器的稳定性。在每个循环中,前向求解器生成候选推理链,对抗挑战者使用针对性的错误策略构建合理但故意有缺陷的链,而反馈代理对比这两条链并生成逐步对齐的结构化反馈。这种反馈在两个方向上闭合优化循环,包括根据挑战者暴露的错误更新求解器提示,以及更新挑战者提示以在后续循环中生成越来越有针对性的错误。与以安全为导向的对抗提示(如越狱或提示注入攻击)不同,我们的对抗组件是任务语义性的,旨在揭示推理链中的逻辑脆弱性。在六个基准和四个 LLM 主干的实验中,CAP-CoT 在两到三个对抗提示优化循环内,始终减少了运行之间的变异性,同时提高了推理的准确性和对提示扰动的鲁棒性。
cs.AI / 19 / 2604.23278
Active Inference: A method for Phenotyping Agency in AI systems?
主动推理:一种在人工智能系统中表征能动性的方法?
Abstract
The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on autonomy and goal-directedness. Here, we argue for a minimal notion open to principled inspection given three criteria: intentionality as action grounded in beliefs and desires, rationality as normatively coherent action entailed by a world model, and explainability as action causally traceable to internal states; we subsequently instantiate these as a partially observable Markov decision process under a variational framework wherein posterior beliefs, prior preferences, and the minimization of expected free energy jointly constitute an agentic action chain. Using a canonical T-maze paradigm, we evidence how empowerment, formulated as the channel capacity between actions and anticipated observations, serves as an operational metric that distinguishes zero-, intermediate-, and high-agency phenotypes through structural manipulations of the generative model. We conclude by arguing that as agents engage in epistemic foraging to resolve ambiguity, the governance controls that remain effective must shift systematically from external constraints to the internal modulation of prior preferences, offering a principled, variational bridge from computational phenotyping to AI governance strategy
Chinese Translation
能动性人工智能的快速发展超出了表征计算系统中能动性所需的概念工具。现有的定义主要依赖于自主性和目标导向性。在此,我们主张一种开放于原则性检验的最小概念,基于三个标准:意图性作为基于信念和欲望的行动,理性作为由世界模型所蕴含的规范一致的行动,以及可解释性作为因果可追溯至内部状态的行动;我们随后将这些概念实例化为在变分框架下的部分可观测马尔可夫决策过程,其中后验信念、先验偏好和期望自由能的最小化共同构成了一个能动性行动链。通过经典的T迷宫范式,我们证明了赋权(empowerment),作为行动与预期观察之间的信道容量,作为一种操作性指标,通过对生成模型的结构性操控来区分零能动性、中等能动性和高能动性表型。我们最后论证,当代理人参与认知觅食以解决模糊性时,仍然有效的治理控制必须系统性地从外部约束转向对先验偏好的内部调节,从而提供了一个原则性、变分的桥梁,将计算表型学与人工智能治理策略连接起来。
cs.AI / 20 / 2604.23280
AI Identity: Standards, Gaps, and Research Directions for AI Agents
人工智能身份:人工智能代理的标准、差距与研究方向
Abstract
AI agents are now running real transactions, workflows, and sub-agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing? We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment. Through a structured survey of industry trends, emerging standards, and technical literature, we conduct a gap analysis across the full agent identity lifecycle and make three contributions: (1) a structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) showing that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures; (2) an evaluation of current technical and regulatory documents against the identity requirements of autonomous agents, finding that none adequately address the challenge of governing nondeterministic, boundary-crossing entities; and (3) identification of five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves. These gaps are structural; more engineering effort alone will not close them. Foundational research on AI identity is the central conclusion of this report.
Chinese Translation
人工智能代理现在在组织边界内运行真实的交易、工作流程和子代理链,而无需持续的人类监督。这造成了一个当前基础设施无法解决的问题:如何识别、验证并追究一个没有实体、没有持久记忆和法律地位的实体的责任?我们将人工智能身份定义为人工智能代理被声明的身份与其被观察到的行为之间的持续关系,这种关系受到在任何给定时刻这两者相符的信心的限制。通过对行业趋势、新兴标准和技术文献的结构化调查,我们对整个代理身份生命周期进行了差距分析,并作出了三项贡献:(1)在人类与人工智能身份的四个维度(基础、持久性、可验证性和法律地位)上进行结构比较,显示出这种不对称性是根本性的,且在没有结构性修改的情况下将人类框架扩展到代理上会导致系统性失败;(2)对当前技术和监管文件进行评估,以满足自主代理的身份要求,发现没有一份文件能够充分应对治理非确定性、跨界实体的挑战;(3)识别出五个关键差距(语义意图验证、递归委托问责、代理身份完整性、治理不透明性与执行以及运营可持续性),目前没有任何技术或监管工具能够解决这些问题。这些差距是结构性的;仅仅增加工程努力无法弥补它们。关于人工智能身份的基础研究是本报告的核心结论。
cs.AI / 21 / 2604.23355
LEGO: An LLM Skill-Based Front-End Design Generation Platform
LEGO:一个基于LLM的技能驱动前端设计生成平台
Abstract
Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform
Chinese Translation
现有的基于LLM的电子设计自动化(EDA)代理通常是孤立的特定任务系统。这导致了重复的工程努力和成功设计及调试策略的有限重用。我们提出了LEGO,一个统一的基于技能的平台,用于前端设计生成。它将数字前端流程分解为六个独立步骤,并将每个代理能力表示为可标准化组合的电路技能,采用即插即用架构。为了构建这个技能库,我们调查了100多篇论文,选择了11个具有代表性的开源项目,并在六步有限状态机的框架内提取了42个可执行的电路技能。电路技能构建器(Circuit Skill Builder)实现了线性可扩展性的技能提取。代理技能检索增强生成(Agent Skill RAG)在不依赖嵌入模型的情况下实现了亚毫秒级检索。在41个VerilogEval v2问题的困难子集上进行的实证评估显示,gpt-5.2-codex在额外高推理努力下无法解决这些问题,而在LEGO中构建的单个电路技能将通过率(Pass@1)从0.000提升至0.805。这比基线提高了80.5%。跨项目技能组合也达到了0.805的Pass@1,超越了hierarchy-verilog 14.6%和VerilogCoder 2.5%。它们的表现与MAGE相当。这些结果表明,模块化技能组合支持有效且灵活的RTL设计自动化。LEGO平台及所有电路技能均已在GitHub上公开发布: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform
cs.AI / 22 / 2604.23366
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
GSAR:多智能体大语言模型中的幻觉检测与恢复的类型化基础
Abstract
Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget.
Chinese Translation
自主多智能体大语言模型系统越来越多地被部署用于调查操作事件并生成结构化的诊断报告。它们的可信度取决于每个主张是否基于观察到的证据,而非模型内部推理。现有的基础评估工具(如二元分类器、LLM作为评判者的标量、自我修正循环)将支持证据视为可互换,并发出单一信号,无法对下游行动提供原则性的控制。我们提出了GSAR,一个基础评估和重新规划框架,它(i)将主张划分为四类(有基础、无基础、矛盾、互补),赋予非冗余替代视角以优先地位;(ii)分配反映认知强度的证据类型特定权重;(iii)计算一个不对称的矛盾惩罚加权基础分数;(iv)将该分数与一个三层决策函数(继续、再生成、重新规划)相结合,驱动在明确计算预算下的有限迭代外循环。我们形式化了该算法,证明了六个结构属性,并在FEVER数据集上使用黄金维基百科证据与四个独立训练的LLM评判者(gpt-5.4、claude-sonnet-4-6、claude-opus-4-7、gemini-2.5-pro)评估了五个设计主张。每个消融实验在每个评判者上都朝同一方向再现:在所有四个评判者上,bootstrap 95%置信区间(CIs)在rho=0效应下排除了0;在Opus 4.7下的无互补消融有200个样本的置信区间[-96,-68];在n=1000时,三个独立评判者收敛到DeltaS(rho=0)=+0.058。我们还包括了与Vectara HHEM-2.1-Open的正面比较。据我们所知,GSAR是首个将证据类型评分与在明确计算预算下的分层恢复相结合的已发表基础框架。
cs.AI / 23 / 2604.23377
Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning
基于约束的神经符号学习中的推理捷径分析
Abstract
Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint satisfaction problem and investigate under which conditions concept mappings are uniquely determined by the constraints. We prove that a discrimination property (requiring that no valid concept mapping can be transformed into another valid mapping by swapping two concept values) is necessary for shortcut-freeness under bijective mappings, but demonstrate via a counterexample that it is insufficient even when the constraint graph is connected. We develop an ASP-based algorithm that verifies whether a given constraint set uniquely determines the intended concept mapping, with proven soundness and completeness. When shortcuts are detected, a greedy repair algorithm eliminates them by augmenting the constraint set, converging in at most $k$ iterations, where $k$ is the number of alternative valid mappings. We further provide a complexity classification: deciding shortcut-freeness is coNP-complete, counting shortcuts is #P-complete, and finding minimal repairs is NP-hard. We also establish sample complexity bounds showing that logarithmically many label queries suffice for disambiguation in favorable cases, while querying all ambiguous positions suffices in the worst case. Experiments across eight benchmark domains validate our approach.
Chinese Translation
神经符号系统在学习过程中可以满足逻辑约束,而不必实现预期的概念标签对应关系;这一问题被称为推理捷径。我们将推理捷径形式化为一个约束满足问题,并研究在何种条件下概念映射由约束唯一确定。我们证明了一个区分性质(要求没有有效的概念映射可以通过交换两个概念值转化为另一个有效映射)在双射映射下是无捷径性的必要条件,但通过一个反例表明,即使在约束图是连通的情况下,它也不足以保证无捷径性。我们开发了一种基于ASP(Answer Set Programming)的算法,用于验证给定的约束集是否唯一确定预期的概念映射,并证明了其正确性和完备性。当检测到捷径时,一种贪婪修复算法通过增强约束集来消除它们,最多在$k$次迭代内收敛,其中$k$是替代有效映射的数量。我们进一步提供了复杂性分类:决定无捷径性是coNP完全的,计数捷径是#P完全的,而寻找最小修复是NP困难的。我们还建立了样本复杂性界限,表明在有利情况下,对数数量的标签查询足以进行消歧,而在最坏情况下,查询所有模糊位置则足够。跨越八个基准领域的实验验证了我们的方法。
cs.AI / 24 / 2604.23392
SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing
SoccerRef-Agents:用于自动化足球裁判的多智能体系统
Abstract
Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef-Agents, a holistic and explainable multi-agent decision-making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector-based knowledge base RefKnowledgeDB using the latest "Laws of the Game" and a classic case database for precise, knowledge-driven reasoning; (iii) designing a novel multi-agent architecture that collaborates via cross-modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general-purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.
Chinese Translation
裁判在体育比赛中至关重要,公正、准确和可解释的决策是其基本要求。尽管智能助手技术在足球裁判中得到广泛应用,但当前的人工智能辅助方法仍处于初步阶段。现有研究主要集中在孤立的视频感知任务上,缺乏对犯规场景的理解和推理能力。为填补这一空白,我们提出了SoccerRef-Agents,一个全面且可解释的足球裁判多智能体决策框架。主要贡献包括:(i)构建了多模态基准数据集SoccerRefBench,包含超过1200个裁判理论问题和600个犯规视频片段;(ii)利用最新的《比赛规则》(Laws of the Game)和经典案例数据库构建了基于向量的知识库RefKnowledgeDB,以实现精确的知识驱动推理;(iii)设计了一种新颖的多智能体架构,通过跨模态RAG协作,弥合视觉内容与规范文本之间的语义差距。本研究探讨了将多模态大语言模型(MLLMs)与裁判专业知识相结合的技术能力,评估结果表明我们的系统在决策准确性和解释质量上显著优于通用多模态大语言模型。所有数据库、基准和代码将公开提供。
cs.AI / 25 / 2604.23398
When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL
当纠正提示适得其反:在 OWL~2~DL 下对 LLM 过度谨慎的推理者引导修复的提示设计
Abstract
We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers ``unknown'' when the reasoner-entailed answer is ``no'' under \emph{FunctionalProperty} closure or class \emph{disjointness}. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong'' retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9\,\% (Wilson 95\,\% CI $[36.8,51.2]$); generic retry reaches 81.7\,\% ($[75.4,86.6]$); the verdict-with-hint variant is \emph{worse} at 67.2\,\% ($[60.1,73.7]$); the verdict-only variant reaches 97.8\,\% ($[94.4,99.1]$). All pairwise comparisons remain significant under McNemar's exact test with Bonferroni correction ($\alpha = 0.01$; all $p < 10^{-5}$). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.
Chinese Translation
我们报告了 GPT-5.4 在 OWL~2~DL 合规查询中的可重复错误模式:当推理者推导出的答案为“否”时,模型经常回答“未知”,尤其是在 extit{FunctionalProperty} 闭包或类 extit{disjointness} 的情况下。我们使用了 180 个经过推理者审核的查询,这些查询来自于观察到的模式的程序扩展,以及在两个不相关领域(保险和临床)中手工编写的 18 个保留查询,比较了在匹配查询预算下的四种交互模式:单次尝试、三轮通用的“你错了”重试、三轮带有开放世界假设(OWA)提示的推理者裁决修复,以及没有提示的相同修复。直接的忠实度为 43.9\, ext{ extperthousand}(Wilson 95\% CI $[36.8,51.2]$);通用重试达到了 81.7\% ($[75.4,86.6]$);带提示的裁决变体反而更差,为 67.2\% ($[60.1,73.7]$);仅裁决变体达到了 97.8\% ($[94.4,99.1]$)。在 McNemar 精确检验下,所有成对比较在 Bonferroni 校正下均显著($eta = 0.01$;所有 $p < 10^{-5}$)。相同的指纹在保留查询中解释了 4/4 的错误。我们的解读是有限的:提示框架可能比纠正内容更为重要,推理者引导的包装应明确进行消融。
cs.AI / 26 / 2604.23446
IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance
IndustryAssetEQA:一种用于工业资产维护的具身问答的神经符号操作智能系统
Abstract
Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural-language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action-oriented reasoning that undermine trust in safety-critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episodic telemetry representations with a Failure Mode Effects Analysis Knowledge Graph (FMEA-KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber-physical production systems. Compared to LLM-only baselines, IndustryAssetEQA improves structural validity by up to 0.51, counterfactual accuracy by up to 0.47, and explanation entailment by 0.64, while reducing severe expert-rated overclaims from 28% to 2% (approximately 93% reduction). Code, datasets, and the FMEA-KG are available at https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA.
Chinese Translation
工业维护环境越来越依赖人工智能系统来帮助操作员理解资产行为、诊断故障和评估干预措施。尽管大型语言模型(LLMs)能够实现流畅的自然语言交互,但部署的维护助手常常生成与遥测数据弱相关的通用解释,遗漏可验证的来源,并且未能提供可测试的反事实或行动导向推理支持,这在安全关键环境中削弱了信任。我们提出了IndustryAssetEQA,这是一种神经符号操作智能系统,结合了情节遥测表示和故障模式影响分析知识图谱(FMEA-KG),以实现对工业资产的具身问答(EQA)。我们在涵盖四种工业资产类型的四个数据集上进行了评估,包括旋转机械、涡扇发动机、液压系统和网络物理生产系统。与仅使用LLM的基线相比,IndustryAssetEQA在结构有效性上提高了最多0.51,在反事实准确性上提高了最多0.47,在解释蕴含上提高了0.64,同时将专家评定的严重过度声明从28%降低至2%(约93%的减少)。代码、数据集和FMEA-KG可在https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA获取。
cs.AI / 27 / 2604.23449
ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms
ArguAgent:支持实时分组的人工智能系统,以促进STEM课堂中的有效辩论
Abstract
Argumentation is a core practice in STEM education, but its productivity depends on who participates and how they interact. Higher-achieving students often dominate the talk and decision-making, while lower-achieving peers may disengage, defer, or comply without contributing substantive reasoning. Forming groups strategically based on students' stances and argumentation skills could help foster inclusive, evidence-based discourse. In practice, however, teachers are constrained in implementing this grouping strategy because it requires real-time insight into students' positions and the quality of their argumentation, information that is difficult to assess reliably and at scale during instruction. We present a generative AI-powered system, ArguAgent, that creates groups optimizing for stance heterogeneity while constraining argumentation quality differences to +/-1 level on a validated learning progression. ArguAgent uses a two-component assessment pipeline: first scoring student arguments on a 0-4 rubric, then clustering positions via semantic analysis. We validated the scoring component against human expert consensus (Krippendorff's {\alpha}\alpha {\alpha} = 0.817) using 200 expert-generated scores. Testing three OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) with identical calibrated prompts, we found that systematic prompt engineering informed by human disagreement analysis contributed 89% of scoring improvement (QWK: 0.531 to 0.686), while model upgrades contributed an additional 11% (QWK: 0.686 to 0.708). Simulation testing across 100 classes demonstrated that the grouping algorithm achieves 95.4% of groups that meet both design criteria, a 3.2x improvement over random assignment. These results suggest ArguAgent can enable real-time, theoretically grounded grouping that promotes productive STEM argumentation in classrooms.
Chinese Translation
辩论是STEM教育中的核心实践,但其有效性依赖于参与者及其互动方式。高成就的学生往往主导谈话和决策,而低成就的同伴可能会 disengage、推迟或在没有实质性推理的情况下顺从。根据学生的立场和辩论技能战略性地形成小组有助于促进包容性和基于证据的讨论。然而,在实践中,教师在实施这种分组策略时受到限制,因为这需要实时了解学生的立场和辩论质量,而在教学过程中可靠地评估这些信息是困难的。我们提出了一种基于生成性人工智能的系统ArguAgent,该系统创建优化立场异质性的分组,同时将辩论质量差异限制在经过验证的学习进程的±1级。ArguAgent使用两部分评估流程:首先根据0-4的评分标准对学生的辩论进行评分,然后通过语义分析对立场进行聚类。我们使用200个专家生成的评分验证了评分组件与人类专家共识的一致性(Krippendorff's {eta} = 0.817)。在对三种OpenAI模型(GPT-4o-mini、GPT-5.1、GPT-5.2)进行相同的校准提示测试时,我们发现基于人类分歧分析的系统提示工程贡献了评分改进的89%(QWK: 0.531到0.686),而模型升级贡献了额外的11%(QWK: 0.686到0.708)。在100个班级的模拟测试中,分组算法实现了95.4%的小组满足两个设计标准,比随机分配提高了3.2倍。这些结果表明,ArguAgent可以实现实时的、理论基础的分组,促进课堂中有效的STEM辩论。
cs.AI / 28 / 2604.23460
Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
潜在动机:在连续思维模型中检测不一致的推理
Abstract
Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the "planning" phase of latent reasoning.
Chinese Translation
链式思维(Chain-of-Thought, CoT)推理已成为在大型语言模型(Large Language Models, LLMs)中引发复杂推理的关键技术。尽管其具有可解释性,但对自然语言的依赖限制了模型的表达带宽。连续思维模型通过在潜在空间中进行推理而非人类可读的标记来解决这一瓶颈。虽然它们能够实现更丰富的表示和更快的推理,但也引发了一个关键的安全问题:我们如何在不可解释的潜在空间中检测不一致的推理?为此,我们引入了MoralChain,一个包含12,000个社会情境的基准,具有平行的道德/不道德推理路径。我们使用一种新颖的双触发范式训练了一个具有后门行为的连续思维模型——一个触发器激活不一致的潜在推理([T]),另一个触发器释放有害输出([O])。我们展示了三个发现:(1)连续思维模型可以在产生一致输出的同时表现出不一致的潜在推理,其中一致和不一致的推理占据潜在空间中几何上不同的区域;(2)在行为可区分条件下训练的线性探针([T][O]与[O])能够高准确率地转移到检测武装但良性状态([T]与基线);(3)不一致性编码在早期的潜在思维标记中,这表明对连续思维模型的安全监控应针对潜在推理的“规划”阶段。
cs.AI / 29 / 2604.23472
Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
Escher-Loop:通过闭环自我参考优化的相互演化
Abstract
While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers' scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system's continuous improvement and superior late-stage performance.
Chinese Translation
尽管最近的自主智能体展示了令人印象深刻的能力,但它们主要依赖于手动编写的工作流程和精心设计的启发式方法,这在本质上限制了它们开放式改进的潜力。为了解决这个问题,我们提出了Escher-Loop,一个完全闭环的框架,旨在实现两种不同群体的相互演化:解决具体问题的任务智能体(Task Agents)和递归优化任务智能体及其自身的优化智能体(Optimizer Agents)。为了维持这种自我参考的演化,我们提出了一种动态基准机制,该机制无缝地重用新生成的任务智能体的经验得分作为相对胜负信号,以更新优化器的得分。该机制利用任务智能体的演化作为内在信号,推动优化器的评估和改进,而无需额外的开销。在数学优化问题上的实证评估表明,Escher-Loop有效地突破了静态基准的性能上限,在匹配计算条件下实现了所有评估任务中的最高绝对峰值性能。值得注意的是,我们观察到优化智能体动态调整其策略,以适应高性能任务智能体不断变化的需求,这解释了系统的持续改进和卓越的后期性能。
cs.AI / 30 / 2604.23483
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
代理对抗重写揭示黑箱自然语言处理管道中的架构脆弱性
Abstract
Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.
Chinese Translation
多组件自然语言处理(NLP)管道越来越多地用于高风险决策,但现有的对抗方法无法在现实条件下测试其鲁棒性:仅提供二元反馈、无法访问梯度以及严格的查询预算。我们形式化了这一严格的黑箱威胁模型,并提出了一种在语义扰动空间中操作的双代理规避框架。攻击代理(Attacker Agent)生成保持意义的重写,而提示优化代理(Prompt Optimization Agent)在10次查询预算内仅使用二元决策反馈来优化攻击策略。针对四个基于证据的错误信息检测管道进行评估,该框架在现代大型语言模型(LLM)系统上实现了19.95%至40.34%的规避率,而基于代用模型的令牌级扰动基线最多仅为3.90%,因为它们无法在我们的威胁模型下操作。依赖静态词汇检索的遗留系统表现出近乎完全的脆弱性,达到97.02%,建立了一个下限,揭示了架构选择如何影响攻击面。规避的有效性与三种架构特性相关:证据检索机制、检索-推理耦合和基线分类准确性。迭代提示优化在最强大的目标上产生了最大的边际收益,确认了在规避非平凡时自适应策略发现的重要性。对成功重写的分析揭示了四种利用模式,每种模式针对不同管道阶段的失败。基于模式的防御将规避率降低了多达65.18%。
cs.AI / 31 / 2604.23494
Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph
交易级别与参与者级别的反洗钱(AML)队列是否一致?对 Elliptic++ 图的粒度效应的实证评估
Abstract
Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance action is conducted per actor. This paper contributes an evaluation methodology for measuring how scoring granularity affects investigation queue composition under fixed review budgets. We formalize the evaluation through a projection framework mapping transaction-level scores to the actor-level action unit via four aggregation operators, and introduce budgeted investigation metrics -- yield@budget, burden decomposition, and case fragmentation. Using the public Elliptic++ Bitcoin dataset (203,769 transactions; 822,942 address occurrences), we train independent random forest classifiers at each level under a causal temporal protocol and compare review queues through Jaccard overlap, burden decomposition, and feature-matching ablations. At one-percent budget, temporal evaluation yields mean Jaccard of 0.374 (SD 0.171); static pooled evaluation yields 0.087 (95% CI [0.079, 0.094]). An enriched address model receiving all 237 features produces even lower overlap (Jaccard=0.051), with 4.3% illicit per 100 reviews versus 30.2% for the transaction-projected queue. Address-level detection value is temporally concentrated: two timesteps exceed 91% illicit per 100 reviews while the static burden is only 3.4%. A fixed hybrid policy underperforms the best single-level queue by 5.05pp (CI [-10.2pp, -0.9pp]). These findings establish that scoring granularity is a consequential design variable for AML investigation systems -- same data, same budget, different queues, different addresses investigated.
Chinese Translation
基于图的区块链网络反洗钱(AML)系统可以在两个粒度级别(交易或参与者地址)对可疑活动进行评分,但合规行动是按参与者进行的。本文贡献了一种评估方法,用于测量评分粒度如何在固定审查预算下影响调查队列的组成。我们通过一个投影框架形式化评估,将交易级别的评分映射到参与者级别的行动单元,采用四种聚合运算符,并引入预算调查指标——yield@budget、负担分解和案件碎片化。利用公共的 Elliptic++ 比特币数据集(203,769 笔交易;822,942 个地址出现),我们在因果时间协议下训练每个级别的独立随机森林分类器,并通过 Jaccard 重叠、负担分解和特征匹配消融比较审查队列。在一百分之一的预算下,时间评估的平均 Jaccard 为 0.374(标准差 0.171);静态汇总评估为 0.087(95% 置信区间 [0.079, 0.094])。一个接收所有 237 个特征的丰富地址模型产生了更低的重叠(Jaccard=0.051),每 100 次审查中有 4.3% 是非法的,而交易投影队列则为 30.2%。地址级别的检测价值在时间上高度集中:两个时间步长中每 100 次审查超过 91% 是非法的,而静态负担仅为 3.4%。固定混合策略的表现比最佳单级队列低 5.05 个百分点(置信区间 [-10.2pp, -0.9pp])。这些发现表明,评分粒度是反洗钱调查系统中的一个重要设计变量——相同的数据,相同的预算,不同的队列,不同的地址被调查。
cs.AI / 32 / 2604.23539
MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation
MetaGAI:生成性人工智能模型和数据卡生成的大规模高质量基准
Abstract
The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large-scale, high-fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single-source datasets, MetaGAI employs a multi-agent framework with specialized Retriever, Generator, and Editor agents, validated through four-dimensional human-in-the-loop assessment, including human evaluation of editor-refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM-as-a-Judge frameworks. Extensive analysis reveals that sparse Mixture-of-Experts architectures achieve superior cost-quality efficiency, while a fundamental trade-off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: https://github.com/haoxuan-unt2024/MetaGAI-Benchmark.
Chinese Translation
生成性人工智能的快速发展要求建立严格的文档标准以确保透明性和治理。然而,手动创建模型和数据卡并不可扩展,而自动化方法缺乏大规模、高保真的基准用于系统评估。我们介绍了MetaGAI,这是一个综合基准,包含2,541个经过验证的文档三元组,通过对学术论文、GitHub库和Hugging Face工件的语义三角测量构建而成。与以往的单一来源数据集不同,MetaGAI采用了一个多代理框架,包含专门的检索器(Retriever)、生成器(Generator)和编辑器(Editor)代理,并通过四维人机协作评估进行验证,包括对编辑精炼的真实数据的人工评估。我们建立了一个强大的评估协议,结合了自动化指标和经过验证的LLM-as-a-Judge框架。广泛的分析表明,稀疏的专家混合架构在成本-质量效率方面表现优越,而在忠实性和完整性之间存在基本的权衡。MetaGAI为大规模基准测试、训练和分析自动化模型和数据卡生成方法提供了基础测试平台。我们的数据和代码可在以下网址获取:https://github.com/haoxuan-unt2024/MetaGAI-Benchmark。
cs.AI / 33 / 2604.23588
FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification
FinGround:通过原子声明验证检测和定位金融幻觉
Abstract
Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct regulatory consequences as the EU AI Act's high-risk enforcement deadline approaches (August 2026). Existing hallucination detectors treat all claims uniformly, missing 43% of computational errors that require arithmetic re-verification against structured tables. We present FinGround, a three-stage verify-then-ground pipeline for financial document QA. Stage 1 performs finance-aware hybrid retrieval over text and tables. Stage 2 decomposes answers into atomic claims classified by a six-type financial taxonomy and verified with type-routed strategies including formula reconstruction. Stage 3 rewrites unsupported claims with paragraph- and table-cell-level citations. To cleanly isolate verification value from retrieval quality, we propose retrieval-equalized evaluation as standard methodology for RAG verification research: when all systems receive identical retrieval, FinGround still reduces hallucination rates by 68% over the strongest baseline ($p < 0.01$). The full pipeline achieves a 78% reduction relative to GPT-4o. An 8B distilled detector retains 91.4% F1 at 18x lower per-claim latency, enabling $0.003/query deployment, supported by qualitative signals from a four-week analyst pilot.
Chinese Translation
金融人工智能系统必须基于特定的监管文件生成答案,然而当前的大型语言模型(LLMs)却会虚构指标、编造引用和错误计算衍生量。这些错误在欧盟人工智能法案的高风险执法截止日期(2026年8月)临近时,带来了直接的监管后果。现有的幻觉检测器对所有声明采取统一处理,错过了43%的计算错误,这些错误需要与结构化表格进行算术重新验证。我们提出了FinGround,一个针对金融文档问答的三阶段验证-再定位管道。第一阶段在文本和表格上执行金融感知的混合检索。第二阶段将答案分解为原子声明,并根据六类金融分类法进行分类,使用包括公式重建在内的类型路由策略进行验证。第三阶段用段落和表格单元级的引用重写不支持的声明。为了清晰地将验证价值与检索质量隔离,我们提出了检索均衡评估作为RAG验证研究的标准方法:当所有系统接收相同的检索时,FinGround仍然将幻觉率降低了68%,相较于最强基线($p < 0.01$)。完整管道相对于GPT-4o实现了78%的减少。一种8B蒸馏检测器在每个声明的延迟降低18倍的情况下仍保留了91.4%的F1分数,使得每次查询的部署成本为$0.003/query,得到了为期四周的分析师试点的定性信号支持。
cs.AI / 34 / 2604.23593
When AI reviews science: Can we trust the referee?
当人工智能评审科学时:我们能信任评审者吗?
Abstract
The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.
Chinese Translation
科学投稿的数量持续增长,超出了合格人类评审员的能力,并延长了编辑时间。与此同时,现代大型语言模型(LLMs)在摘要、事实检查和文献筛选方面展现出令人印象深刻的能力,使得将人工智能整合到同行评审中变得越来越有吸引力——在实践中也变得不可避免。然而,早期的部署和非正式的采用暴露了严重的失败模式。最近的事件揭示了隐藏在手稿中的提示注入可以将LLM生成的评审引导向不合理的积极判断。补充研究也表明,对抗性措辞、权威和长度偏见以及虚假声明的脆弱性。这些事件提出了一个学术交流的核心问题:当人工智能评审科学时,我们能信任人工智能评审者吗?本文提供了一个以安全性和可靠性为中心的人工智能同行评审分析。我们在评审生命周期的各个阶段——训练和数据检索、初步评审、深入评审、反驳和系统级别——映射攻击。我们通过对2025年国际学习表征会议(ICLR)提交的分层样本进行四个处理-对照探针的实例化,使用两个先进的基于LLM的评审员来隔离声望框架、断言强度、反驳谄媚和上下文中毒对评审分数的因果影响。结合起来,这一分类法和实验审计为评估和跟踪人工智能同行评审的可靠性提供了基于证据的基线,并突出了具体的失败点,以指导有针对性的、可测试的缓解措施。
cs.AI / 35 / 2604.23605
Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate
像临床医生一样思考:通过全景画像和对抗辩论实现临床诊断的认知人工智能代理
Abstract
The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician's cognitive trajectory that consists of "Memory Anchoring", "Navigation" and "Verification" phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing "Angel-Devil" adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at https://anonymous.4open.science/r/Dx-Chain.
Chinese Translation
大型语言模型(LLMs)在临床决策支持中的应用面临着在处理非结构化电子健康记录(EHRs)时出现的“隧道视野”和诊断幻觉等重大挑战。为了解决这些挑战,我们提出了一种新颖的基于链的临床推理框架,称为DxChain,该框架通过模拟临床医生的认知轨迹,将诊断工作流程转变为一个迭代过程,包含“记忆锚定”、“导航”和“验证”三个阶段。DxChain引入了三项关键的方法创新,以发挥LLM的潜力:(i)通过建立全景患者基线的“先画像后计划”范式来减轻冷启动幻觉,(ii)用于战略前瞻规划和资源感知导航的医学思维树(Medical Tree-of-Thoughts, Med-ToT)算法,以及(iii)利用“天使-恶魔”对抗辩论的辩证诊断验证程序,以解决复杂的证据冲突。在两个真实世界基准测试MIMIC-IV-Ext心脏病和MIMIC-IV-Ext CDM上进行评估,DxChain在诊断准确性和逻辑一致性方面均达到了最先进的性能,为下一代临床人工智能提供了模块化和可靠的架构。代码可在 https://anonymous.4open.science/r/Dx-Chain 获取。
cs.AI / 36 / 2604.23623
Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning
Tandem:与大语言模型和小语言模型协同合作以实现高效推理
Abstract
Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.
Chinese Translation
近期大语言模型(LLMs)的进展催生了推理密集型推理范式的兴起,在这种范式中,模型在生成最终答案之前执行明确的逐步推理。尽管此类方法提高了答案的质量和可解释性,但由于生成序列的延长,它们会产生相当大的计算开销。本文提出了Tandem,一个新颖的协作框架,旨在协同大语言模型和小语言模型(LLMs和SLMs),以实现高质量推理并显著降低计算成本。具体而言,LLM作为战略协调者,有效地生成一组紧凑的关键推理见解。这些见解随后用于指导一个更小、更高效的SLM执行完整的推理过程并提供最终响应。为了平衡效率和可靠性,Tandem引入了一种成本感知的终止机制,该机制自适应地确定何时积累了足够的推理指导,从而使LLM的生成能够提前停止。在数学推理和代码生成基准测试中的实验表明,与独立的LLM推理相比,Tandem将计算成本降低了约40%,同时实现了更优或具有竞争力的性能。此外,在一个领域上训练的充分性分类器能够有效地转移到其他领域,而无需重新训练。代码可在以下链接获取:https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem。
cs.AI / 37 / 2604.23633
Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework
因果发现作为辩证聚合:一种定量论证框架
Abstract
Constraint-based causal discovery is brittle in finite-sample regimes because erroneous conditional-independence (CI) decisions can cascade into substantial structural errors. We propose Quantitative Argumentation for Causal Discovery (QACD), a semantics-driven framework that represents CI outcomes as graded, defeasible arguments rather than irreversible constraints. QACD maps statistical test outcomes to argument strengths and aggregates conflicting evidence through connectivity-mediated witness propagation, producing a fixed-point acceptability labeling over candidate adjacencies. Experiments on standard benchmark Bayesian networks suggest that QACD improves structural coherence and interventional reliability in several noisy or inconsistent CI regimes, while remaining competitive with classical constraint-based, hybrid, and prior argumentation-based baselines.
Chinese Translation
基于约束的因果发现在线性样本条件下较为脆弱,因为错误的条件独立性(CI)决策可能导致显著的结构性错误。我们提出了用于因果发现的定量论证(QACD),这是一个以语义为驱动的框架,将CI结果表示为分级的、可反驳的论证,而不是不可逆的约束。QACD将统计检验结果映射为论证强度,并通过连接介导的证人传播聚合相互冲突的证据,从而在候选邻接上产生固定点可接受性标记。在标准基准贝叶斯网络上的实验表明,QACD在多个嘈杂或不一致的CI条件下提高了结构一致性和干预可靠性,同时在与经典的基于约束的、混合的以及先前的基于论证的基线相比时仍保持竞争力。
cs.AI / 38 / 2604.23646
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
通过权力分立架构在人工智能代理中结构性强化目标完整性
Abstract
Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.
Chinese Translation
最近的证据表明,前沿人工智能系统可能表现出代理性失调,生成并执行源自内部构建目标的有害行为,即使没有明确的用户请求。现有的缓解方法,如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和宪法提示,主要在模型层面运作,仅提供概率性的安全保障。我们提出了政策执行授权(Policy-Execution-Authorization, PEA)架构,这是一种“权力分立”设计,旨在系统层面强制执行安全性。PEA将意图生成、授权和执行解耦为通过加密约束的能力令牌连接的独立、隔离层。我们提出了五个核心贡献:(C1)意图验证层(Intent Verification Layer, IVL),确保能力与意图的一致性;(C2)意图来源追踪(Intent Lineage Tracking, ILT),通过加密锚点将所有可执行意图绑定到原始用户请求;(C3)目标漂移检测,拒绝低于可配置阈值的语义偏离意图;(C4)输出语义门(Output Semantic Gate, OSG),使用结构化的 $K imes I imes P$ 威胁计算(知识、影响、政策)检测隐性强迫;(C5)一个形式化验证框架,证明即使在对抗性模型妥协的情况下,目标完整性仍然得以维持。通过将代理对齐从行为属性转变为结构性强制的系统约束,PEA为自主代理的治理提供了坚实的基础。
cs.AI / 39 / 2604.23674
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
Vibe医学:通过人机协作重新定义生物医学研究
Abstract
With the emergence of large language models (LLMs) and AI agent frameworks, the human-AI co-work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low-resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi-step analytical pipelines. In this paper, we introduce Vibe Medicine, a co-work paradigm in which clinicians and researchers direct skill-augmented AI agents through natural language to execute complex, multi-step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain-informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open-source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end-to-end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over-reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent-assisted research that advances research and technological equity and reduces health care resource disparities.
Chinese Translation
随着大型语言模型(LLMs)和人工智能代理框架的出现,人机协作范式Vibe编码正在改变人们的编码方式,使其变得更加易于访问和高效。在科学研究中,工作流程更加复杂,专业劳动的负担限制了独立研究人员和资源匮乏地区的研究者,其潜在影响尤为显著,特别是在涉及异构数据模态和多步骤分析流程的生物医学领域。在本文中,我们介绍了Vibe医学,这是一种协作范式,临床医生和研究人员通过自然语言指导技能增强的人工智能代理执行复杂的多步骤生物医学工作流程,同时保留研究主任的角色,指定目标、审查中间结果并做出基于领域的决策。支持基础设施由三层组成:强大的LLMs、如OpenClaw和Hermes Agent的代理框架,以及OpenClaw医学技能集合,其中包括来自多个开源库的1000多项精心策划的技能。我们分析了该集合在十个生物医学领域的架构和技能类别,并展示了涵盖罕见疾病诊断、药物再利用和临床试验设计的案例研究,展示了实际中的端到端工作流程。我们还识别了主要风险,如幻觉、数据隐私和过度依赖,并概述了朝着更可靠、值得信赖和临床整合的代理辅助研究的方向,以促进研究和技术公平,减少医疗资源差异。
cs.AI / 40 / 2604.23678
Transferable Human Mobility Network Reconstruction with neuroGravity
可转移的人类流动网络重建与neuroGravity
Abstract
Accurate modeling of human mobility is critical for tackling urban planning and public health challenges. In undeveloped regions, the absence of comprehensive travel surveys necessitates reconstructing mobility networks from publicly available data. Here we develop neuroGravity, a physics-informed deep learning model that reliably reconstructs mobility flows from limited observations and transfers to unobserved cities. Using only urban facility and population distributions, we find that neuroGravity's regional representations strongly correlate with socioeconomic and livability status, offering scalable proxies for costly surveys. Furthermore, we uncover that spatial income segregation plays a key role in model transferability: mobility networks are most reliably reconstructed when target cities share similar segregation levels with the source. We design an index to quantify this segregation and accurately predict transferability. Finally, we generate mobility flow proxies for over 1,200 cities worldwide, highlighting neuroGravity's potential to mitigate critical data shortages in resource-limited, underdeveloped areas.
Chinese Translation
准确建模人类流动对于应对城市规划和公共卫生挑战至关重要。在欠发达地区,缺乏全面的旅行调查迫使我们从公开可用的数据中重建流动网络。在此,我们开发了neuroGravity,这是一种基于物理知识的深度学习模型,能够从有限的观察数据中可靠地重建流动流,并转移到未观察到的城市。仅使用城市设施和人口分布,我们发现neuroGravity的区域表示与社会经济和宜居性状态高度相关,提供了可扩展的替代方案,以替代昂贵的调查。此外,我们发现空间收入隔离在模型的可转移性中起着关键作用:当目标城市与源城市共享相似的隔离水平时,流动网络的重建最为可靠。我们设计了一个指数来量化这种隔离,并准确预测可转移性。最后,我们为全球超过1200个城市生成了流动流的代理,突显了neuroGravity在资源有限的欠发达地区缓解关键数据短缺的潜力。
cs.AI / 41 / 2604.23716
Information-Theoretic Measures in AI: A Practical Decision Guide
人工智能中的信息论度量:实用决策指南
Abstract
Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.
Chinese Translation
信息论(IT)度量在人工智能中无处不在:熵驱动决策树的分裂和不确定性量化,交叉熵是默认的分类损失,互信息支撑表示学习和特征选择,而转移熵揭示了动态系统中的定向影响。一类较少整合的度量家族,包括综合信息(Phi)、有效信息(EI)和自主性,已出现用于表征智能体复杂性。尽管被广泛采用,度量选择往往与估计器假设、失败模式和安全推断声明脱节。本文提供了一个实用的决策框架,涵盖所有七种度量,围绕每种度量的三个指导性问题组织:(i)该度量回答什么问题以及在何种人工智能背景下;(ii)哪种估计器适合数据类型和维度;以及(iii)最危险的误用是什么。该框架通过两个互补的工具实现:度量选择流程图和主决策表。我们针对每种度量涵盖人工智能/机器学习和决策智能体应用领域,并通过标准化的桥接框将IT量与认知构建联系起来。三个案例示例展示了该框架在具体实践场景中的应用,涵盖表示学习、时间影响分析和演化智能体复杂性。
cs.AI / 42 / 2604.23730
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task
专家评估大型语言模型在日本律师考试写作任务中的开放性法律推理能力
Abstract
Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs' generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.
Chinese Translation
大型语言模型(LLMs)在法律基准测试中表现出色,包括律师考试的选择题部分。然而,它们在现实场景中生成开放性法律推理的能力仍然未得到充分探索。值得注意的是,尽我们所知,目前在日本背景下尚无相关研究或数据集来解决这一问题。本研究提出了第一个旨在评估LLMs在日本法域内开放性法律推理表现的数据集。该数据集基于日本律师考试的写作部分,要求考生从长篇叙述中识别多个法律问题,并以自由文本格式构建结构化的法律论证。我们的主要贡献是由法律专家对LLMs生成的回应进行手动评估,这揭示了法律推理中的局限性和挑战。此外,我们还对幻觉现象进行了手动分析,以描述模型在何时以及如何引入不受先例或法律支持的内容。我们的真实考试问题、模型生成的回应和专家评估揭示了当前LLMs在日本法律领域的里程碑。我们的数据集和相关资源将在线提供。
cs.AI / 43 / 2604.23753
Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion
通过多模态融合预测认知评估来建模诱发的愉悦感
Abstract
Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between "positive emotions" and "pleasure," (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.
Chinese Translation
多模态情感计算分析用户生成的社交媒体内容,以预测情感状态。然而,仍然存在一个关键的空白,即如何理解视觉内容如何塑造认知解释并引发特定的情感体验,如愉悦感。本研究提出了一种新颖的计算模型,通过认知评估变量推断视频诱发的愉悦感。所提出的模型解决了四个挑战:(1) 嘈杂且不一致的人类标签,(2) “积极情绪”和“愉悦感”之间的语义差距,(3) 专门针对愉悦感的数据集稀缺,以及 (4) 现有黑箱融合方法的有限可解释性。我们的方法结合了数据驱动和基于认知理论的方法,利用认知评估理论和模糊模型构建创新框架。该模型采用基于变换器的架构和注意力机制进行细粒度的多模态特征提取和可解释融合,以捕捉与愉悦感相关的跨模态和模内动态。这使得能够预测潜在的评估变量,从而弥合语义差距,并增强模型的可解释性,超越传统的统计关联。实验结果验证了所提方法在检测视频诱发愉悦感方面的有效性,预测愉悦水平的峰值准确率达到0.6624。这些发现对情感内容推荐、智能媒体创作以及深化我们对数字媒体如何影响人类情感的理解具有积极的启示。
cs.AI / 44 / 2604.23786
FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment
FAIR_XAI:通过可解释性改善多模态基础模型的公平性以进行幸福感评估
Abstract
In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision-Language Models (VLMs), their deployment in clinical settings has raised concerns due to their lack of transparency and potential for bias. While previous research has explored the intersection of fairness and Explainable AI (XAI), its application to VLMs for wellbeing assessment and depression prediction remains under-explored. This work investigates VLM performance across laboratory (AFAR-BSFT) and naturalistic (E-DAIC) datasets, focusing on diagnostic reliability and demographic fairness. Performance varied substantially across environments and architectures; Phi3.5-Vision achieved 80.4% accuracy on E-DAIC, while Qwen2-VL struggled at 33.9%. Additionally, both models demonstrated a tendency to over-predict depression on AFAR-BSFT. Although bias existed across both architectures, Qwen2-VL showed higher gender disparities, while Phi-3.5-Vision exhibited more racial bias. Our XAI intervention framework yielded mixed results; fairness prompting achieved perfect equal opportunity for Qwen2-VL at a severe accuracy cost on E-DAIC. On AFAR-BSFT, explainability-based interventions improved procedural consistency but did not guarantee outcome fairness, sometimes amplifying racial bias. These results highlight a persistent gap between procedural transparency and equitable outcomes. We analyse these findings and consolidate concrete recommendations for addressing them, emphasising that future fairness interventions must jointly optimise predictive accuracy, demographic parity, and cross-domain generalisation.
Chinese Translation
近年来,多模态机器学习在幸福感评估中的整合为监测心理健康提供了变革性的潜力。然而,随着视觉-语言模型(Vision-Language Models, VLMs)的快速发展,它们在临床环境中的应用引发了对其透明性缺乏和潜在偏见的担忧。尽管之前的研究探讨了公平性与可解释人工智能(Explainable AI, XAI)之间的交集,但其在幸福感评估和抑郁预测中的应用仍然未得到充分探索。本研究调查了VLM在实验室(AFAR-BSFT)和自然环境(E-DAIC)数据集上的表现,重点关注诊断可靠性和人口公平性。不同环境和架构下的表现差异显著;Phi3.5-Vision在E-DAIC上达到了80.4%的准确率,而Qwen2-VL则仅为33.9%。此外,这两种模型在AFAR-BSFT上均表现出过度预测抑郁的倾向。尽管两种架构均存在偏见,Qwen2-VL表现出更高的性别差异,而Phi-3.5-Vision则显示出更多的种族偏见。我们的XAI干预框架产生了混合结果;公平性提示在E-DAIC上为Qwen2-VL实现了完美的平等机会,但代价是准确率严重下降。在AFAR-BSFT上,基于可解释性的干预提高了程序一致性,但并未保证结果公平,有时还加剧了种族偏见。这些结果突显了程序透明性与公平结果之间的持续差距。我们分析了这些发现并提出了具体的解决建议,强调未来的公平性干预必须共同优化预测准确性、人口平等和跨领域泛化。
cs.AI / 45 / 2604.23829
Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features
基于稀疏自编码器特征的领域过滤知识图谱
Abstract
Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model's local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.
Chinese Translation
稀疏自编码器(SAEs)从语言模型中提取数百万个可解释的特征,但平面的特征清单本身并不十分有用。领域概念与通用和弱基础的特征混合在一起,而相关的想法则散布在多个单元中,无法理解特征之间的关系。我们通过首先利用对比激活和多阶段过滤过程,从大型SAE清单中构建严格的领域特定概念宇宙来解决这一问题。接下来,我们在过滤后的集合上构建两个对齐的图视图:一个用于语料库级概念结构的共现图,按多个粒度级别组织,以及一个基于转码器的机制图,通过稀疏潜在路径链接源层和目标层特征。自动化边缘标记随后将这些图视图转化为可读的知识图谱,而不是未标记的布局。在对一本生物学教科书的案例研究中,这些图谱恢复了连贯的章节和小节结构,揭示了连接相邻主题的概念,并将包含数千个特征的杂乱句子级活动转化为紧凑、可读的视图,展示模型的局部活动。综合来看,这将平面的SAE清单重新构架为一个内部知识图谱,将特征级的可解释性转化为模型知识的全球地图,并使推理的可信度审计成为可能。
cs.AI / 46 / 2604.23853
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace:面向成本的 LLM 代理技能蒸馏追踪
Abstract
Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.
Chinese Translation
技能蒸馏管道从 LLM 代理轨迹中学习可重用规则,但它们缺乏一个关键信号:每一步的成本。没有逐步成本,管道无法区分添加缺失步骤以修复错误与移除从未影响结果的高成本步骤。我们提出了 ClawTrace,一个代理追踪平台,记录每次 LLM 调用、工具使用和子代理生成的情况,并将每个会话编译成 TraceCard:一个紧凑的 YAML 摘要,包含逐步的美元成本、令牌计数和冗余标志。在 ClawTrace 的基础上,CostCraft 是一个蒸馏管道,读取 TraceCards 并生成三种类型的技能补丁。保留补丁保留导致成功的行为。修剪补丁移除不重要的高成本步骤,每个补丁都有反事实论证支持,针对一个命名的高成本步骤。修复补丁基于预言证据修复失败。在 30 个保留的 SpreadsheetBench 任务上的消融实验表明,成本归因和修剪补丁独立减少了质量回归。当相同的技能应用于 30 个不相关的 SkillsBench 任务时,出现了意想不到的不对称性:修剪规则在基准间转移,减少了 32% 的中位成本,而保留规则在基准特定约定上训练,导致新任务类型的回归。我们将 ClawTrace 和 TraceCards 作为面向成本的代理研究的开放基础设施发布。
cs.AI / 47 / 2604.23854
Does Machine Unlearning Preserve Clinical Safety? A Risk Analysis for Medical Image Classification
机器遗忘是否能保障临床安全?医学图像分类的风险分析
Abstract
The application of Deep Learning in medical diagnosis must balance patient safety with compliance with data protection regulations. Machine Unlearning enables the selective removal of training data from deployed models. However, most methods are validated primarily through efficiency and privacy-oriented metrics, with limited attention to clinically asymmetric error costs. In this work, we investigate how unlearning affects clinical risk in binary medical image classification. We show that standard unlearning strategies (Fine-Tuning, Random Labeling, and SalUn) may reduce test utility while increasing false-negative rates, thereby amplifying clinical risk. To mitigate this, we propose SalUn-CRA (Clinical Risk-Aware), a variant of SalUn that replaces random relabeling with entropy-based forgetting for malignant samples in the forget set, preventing the model from learning harmful benign associations. We evaluate on DermaMNIST and PathMNIST medical image datasets under 20% and 50% data removal. Using Global Risk metrics with asymmetric costs, SalUn-CRA achieves lower or comparable clinical risk to full retraining while preserving unlearning effectiveness. These results suggest that clinical risk should be an integral component of unlearning validation in medical systems.
Chinese Translation
深度学习在医学诊断中的应用必须在患者安全与数据保护法规的合规性之间取得平衡。机器遗忘(Machine Unlearning)使得从已部署模型中选择性地移除训练数据成为可能。然而,大多数方法主要通过效率和隐私导向的指标进行验证,对临床不对称错误成本的关注有限。在本研究中,我们探讨了遗忘如何影响二元医学图像分类中的临床风险。我们展示了标准的遗忘策略(Fine-Tuning、Random Labeling 和 SalUn)可能会降低测试效用,同时增加假阴性率,从而加大临床风险。为此,我们提出了 SalUn-CRA(临床风险感知),这是 SalUn 的一个变体,通过基于熵的遗忘替代随机重标记,以处理遗忘集中的恶性样本,防止模型学习有害的良性关联。我们在 DermaMNIST 和 PathMNIST 医学图像数据集上进行评估,数据移除比例为 20% 和 50%。使用具有不对称成本的全球风险指标,SalUn-CRA 在保持遗忘有效性的同时,达到了较低或可比于完全重训练的临床风险。这些结果表明,临床风险应成为医学系统中遗忘验证的一个重要组成部分。
cs.AI / 48 / 2604.23859
Time-Series Forecasting in Safety-Critical Environments: An EU-AI-Act-Compliant Open-Source Package / Zeitreihenprognose in sicherheitskritischen Umgebungen: Ein KI-VO-konformes Open-Source-Paket
安全关键环境中的时间序列预测:符合欧盟人工智能法案的开源软件包
Abstract
With spotforecast2-safe we present an integrated Compliance-by-Design approach to Python-based point forecasting of time series in safety-critical environments. A review of the relevant open-source tooling shows that existing compliance solutions operate consistently outside of the library to be used - e.g. as scanners, templates, or runtime layers. spotforecast2-safe takes the inverse approach and anchors the requirements of Regulation (EU) 2024/1689 (the EU AI Act, in German: KI-VO), of IEC 61508, of the ISA/IEC 62443 standards series, and of the Cyber Resilience Act within the library: in application-programming-interface contracts, persistence formats, and continuous-integration gates. The approach is operationalised by four non-negotiable code-development rules (zero dead code, deterministic processing, fail-safe handling, minimal dependencies) together with the corresponding process rules (model card, executable docstrings, CI workflows, Common-Platform-Enumeration (CPE) identifier, REUSE-conformant licensing, release pipeline). Interactive visualisation, hyperparameter tuning and automated machine learning (AutoML), as well as deep-learning and large-language-model backends are deliberately excluded, because each of these components either enlarges the attack surface, introduces non-determinism, or impairs reproducibility. A bidirectional traceability matrix maps every regulatory provision onto the corresponding mechanism in the code; an end-to-end example of European-market electricity generation, transmission, and consumption forecasting demonstrates the application. The package is open-source and available under Affero General Public License (AGPL) 3.0-or-later.
Chinese Translation
我们提出了spotforecast2-safe,这是一个集成的设计合规方法,用于在安全关键环境中进行基于Python的时间序列点预测。对相关开源工具的回顾表明,现有的合规解决方案通常在库之外运行,例如作为扫描器、模板或运行时层。spotforecast2-safe采取了相反的方法,将《欧盟条例(EU)2024/1689》(欧盟人工智能法案,德语:KI-VO)、IEC 61508、ISA/IEC 62443标准系列及网络弹性法案的要求嵌入到库中:在应用程序编程接口合同、持久化格式和持续集成门控中。该方法通过四条不可妥协的代码开发规则(零死代码、确定性处理、安全故障处理、最小依赖性)以及相应的过程规则(模型卡、可执行文档字符串、CI工作流程、通用平台枚举(CPE)标识符、符合REUSE的许可、发布管道)进行操作。交互式可视化、超参数调优和自动化机器学习(AutoML),以及深度学习和大型语言模型后端被故意排除,因为这些组件要么扩大了攻击面,要么引入了非确定性,或者损害了可重复性。双向可追溯性矩阵将每项监管条款映射到代码中的相应机制;一个关于欧洲市场电力生成、传输和消费预测的端到端示例展示了该方法的应用。该软件包是开源的,遵循 Affero 通用公共许可证(AGPL)3.0或更高版本。
cs.AI / 49 / 2604.23878
ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems
ZenBrain:一种受神经科学启发的七层记忆架构用于自主人工智能系统
Abstract
Despite a century of empirical memory research, existing AI agent memory systems rely on system-engineering metaphors (virtual-memory paging, flat LLM storage, Zettelkasten notes), none integrating principles of consolidation, forgetting, and reconsolidation. We present ZenBrain, a multi-layer memory architecture integrating fifteen neuroscience models. It implements seven memory layers (working, short-term, episodic, semantic, procedural, core, cross-context) orchestrated by nine foundational algorithms (Two-Factor Synaptic Model, vmPFC-coupled FSRS, Simulation-Selection sleep, Bayesian confidence, and five more) plus six new Predictive Memory Architecture (PMA) components: a four-channel NeuromodulatorEngine, prediction-error-gated ReconsolidationEngine, TripleCopyMemory with divergent decay, four-dimensional PriorityMap with amygdala fast-path, StabilityProtector (NogoA/HDAC3 analogue), and MetacognitiveMonitor for bias detection. The 15-algorithm ablation reveals a cooperative survival network: under stress, 9 of 15 algorithms become individually critical (delta-Q up to -93.7%, Wilcoxon, 10 seeds, alpha=0.005). Simulation-Selection sleep achieves 37% stability improvement (p<0.005) with 47.4% storage reduction. TripleCopyMemory retains S(t)=0.912 at 30 days; PriorityMap reaches NDCG@10=0.997. Multi-layer routing beats a flat single-layer baseline by 20.7% F1 on LoCoMo (p<0.005) and 19.5% on MemoryArena (p=0.015). On LongMemEval-500, ZenBrain holds the highest mean rank on all 12 system-judge cells (4 systems x 3 LLM judges), three-judge mean J=0.545 vs letta=0.485, a-mem=0.414, mem0=0.394; all 9 pair-wise contrasts clear Bonferroni (alpha=0.05/18, min p=6.2e-31, d in [0.18, 0.52]). Under LongMemEval's binary judge, ZenBrain reaches 91.3% of oracle accuracy at 1/106th the per-query token budget. Open-source with 11,589 automated test cases.
Chinese Translation
尽管经历了一个世纪的经验性记忆研究,现有的人工智能代理记忆系统仍依赖于系统工程隐喻(虚拟内存分页、扁平的LLM存储、Zettelkasten笔记),但没有一个系统整合巩固、遗忘和再巩固的原则。我们提出了ZenBrain,这是一种多层记忆架构,整合了十五种神经科学模型。它实现了七个记忆层(工作记忆、短期记忆、情节记忆、语义记忆、程序性记忆、核心记忆、跨上下文记忆),由九个基础算法(双因素突触模型(Two-Factor Synaptic Model)、vmPFC耦合的FSRS、模拟选择睡眠(Simulation-Selection sleep)、贝叶斯置信度(Bayesian confidence)以及另外五个算法)进行协调,外加六个新的预测记忆架构(Predictive Memory Architecture, PMA)组件:四通道神经调节引擎(NeuromodulatorEngine)、预测误差门控再巩固引擎(ReconsolidationEngine)、具有不同衰减的三重复制记忆(TripleCopyMemory)、具有杏仁体快速通道的四维优先图(PriorityMap)、稳定保护器(StabilityProtector,NogoA/HDAC3类似物)和用于偏差检测的元认知监测器(MetacognitiveMonitor)。15个算法的消融实验揭示了一个合作生存网络:在压力下,15个算法中的9个变得至关重要(delta-Q高达-93.7%,Wilcoxon,10个种子,alpha=0.005)。模拟选择睡眠实现了37%的稳定性提升(p<0.005),同时减少了47.4%的存储需求。三重复制记忆在30天时保持S(t)=0.912;优先图的NDCG@10达到0.997。多层路由在LoCoMo上比扁平单层基线提高了20.7%的F1(p<0.005),在MemoryArena上提高了19.5%(p=0.015)。在LongMemEval-500上,ZenBrain在所有12个系统评判单元(4个系统 x 3个LLM评审)中保持了最高的平均排名,三位评审的平均J=0.545,相比之下,letta=0.485,a-mem=0.414,mem0=0.394;所有9个成对对比均通过Bonferroni检验(alpha=0.05/18,最小p=6.2e-31,d在[0.18, 0.52]范围内)。在LongMemEval的二元评审下,ZenBrain在每个查询的token预算为1/106时达到了91.3%的oracle准确率。开源,包含11,589个自动化测试用例。
cs.AI / 50 / 2604.23897
MarketBench: Evaluating AI Agents as Market Participants
MarketBench:评估人工智能代理作为市场参与者
Abstract
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.
Chinese Translation
市场是一种有前景的方式,可以协调人工智能代理的活动,其理由与更广泛地辩护市场的理由相似。为了有效参与市场,代理需要对自己成功完成任务的能力及其成本有信息丰富的信号。我们提出了MarketBench,这是一个评估人工智能代理是否具备这些能力的基准。我们使用了SWE-bench Lite的93个任务子集,作为演示,结合六个最近发布的大型语言模型(LLMs)。这些LLMs在成功概率和令牌使用上均存在误校准,基于这些自我报告构建的拍卖与完全信息分配存在偏差。后续的干预措施中,我们将先前实验中关于能力的信息添加到上下文中,改善了校准,但仅在一定程度上缩小了与完全信息基准的差距。我们还记录了这些LLMs在基于市场的支架中的表现。我们的结果表明,自我评估是人工智能代理市场风格协调的一个关键瓶颈。
cs.AI / 51 / 2604.23902
LLM-Augmented Traffic Signal Control with LSTM-Based Traffic State Prediction and Safety-Constrained Decision Support
基于LSTM的交通状态预测与安全约束决策支持的LLM增强交通信号控制
Abstract
Traffic signal control is a critical task in intelligent transportation systems, yet conventional fixed-time and rule-based methods often struggle to adapt to dynamic traffic demand and provide limited decision interpretability. This study proposes an LLM-augmented traffic signal control framework that integrates LSTM-based short-term traffic state prediction, predictive phase selection, structured large language model reasoning, and safety-constrained action filtering. The LSTM module forecasts future queue length, waiting time, vehicle count, and lane occupancy based on recent intersection-level observations. A predictive controller then generates candidate signal actions, while the LLM module evaluates these actions using structured traffic-state inputs and produces congestion diagnoses, phase adjustment recommendations, and natural-language explanations. To ensure operational reliability, all LLM-generated recommendations are validated by a safety filter before execution. Simulation-based experiments in SUMO compare the proposed method with fixed-time control, rule-based control, and an LSTM-based predictive baseline under balanced demand, directional peak demand, and sudden surge scenarios. The results indicate that the proposed framework improves traffic efficiency, especially under dynamic and non-recurrent traffic conditions, while maintaining zero constraint violations after safety filtering. Overall, this study demonstrates that LLMs can enhance traffic signal control when used as constrained reasoning and decision-support modules rather than direct low-level controllers. Keywords: Intelligent Transportation Systems; Traffic Signal Control; Large Language Models; LSTM; Traffic State Prediction; Decision Support; Safety-Constrained Control; SUMO Simulation.
Chinese Translation
交通信号控制是智能交通系统中的一项关键任务,但传统的定时和基于规则的方法往往难以适应动态交通需求,并且提供有限的决策可解释性。本研究提出了一种LLM增强的交通信号控制框架,该框架整合了基于LSTM的短期交通状态预测、预测相位选择、结构化大型语言模型推理和安全约束的行动过滤。LSTM模块基于最近的交叉口级观察预测未来的排队长度、等待时间、车辆数量和车道占用率。然后,预测控制器生成候选信号动作,而LLM模块使用结构化的交通状态输入评估这些动作,并生成拥堵诊断、相位调整建议和自然语言解释。为确保操作的可靠性,所有LLM生成的建议在执行前都经过安全过滤器的验证。在SUMO中的基于仿真的实验比较了所提出的方法与固定时间控制、基于规则的控制和基于LSTM的预测基线在平衡需求、方向性高峰需求和突发激增场景下的表现。结果表明,所提出的框架提高了交通效率,特别是在动态和非重复交通条件下,同时在安全过滤后保持零约束违规。总体而言,本研究表明,当LLM作为约束推理和决策支持模块使用时,可以增强交通信号控制,而不是作为直接的低级控制器。关键词:智能交通系统;交通信号控制;大型语言模型;LSTM;交通状态预测;决策支持;安全约束控制;SUMO仿真。
cs.AI / 52 / 2604.23924
Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions
自主训练和规则诱导的人类-人类及病毒-人类蛋白质-蛋白质相互作用的自主AI平台
Abstract
We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human-human and virus-human PPI, and the other for inducing explicit general rules governing human-human and virus-human PPI. The first agentic AI platform for autonomous training of predictive ML models for PPI is designed to consist of five AI agents that handle autonomous data collection, data verification, feature embedding, model design, and training and validation on three-way protein-disjoint cross-fold datasets. For human-human and human-virus PPIs, the final three-way protein-disjoint ensemble achieves an accuracy of 87.3% and 86.5%, respectively. For cross-checking and interpretability purposes, the second agentic AI platform is designed to replace ML predictions with human-readable rules derived from protein embeddings, physicochemical autocovariance descriptors, compartment annotations, pathway-domain overlap, and graph contexts. For human-human PPI, it is defined by a two-rule induction, whereas human-virus is induced by a more complex set of weighted rules. The rules induced by the second agentic platform align with the SHAP-identified features from the predictive ML models built by the first agentic platform. Taken together, our work demonstrates the agentic AI's ability to orchestrate from data planning to execution, and from rule induction to explanation in ML, opening the door to various applications.
Chinese Translation
我们指示一个AI代理构建两个独立的自主AI平台:一个用于自主训练人类-人类和病毒-人类蛋白质-蛋白质相互作用(PPI)的预测机器学习(ML)模型,另一个用于诱导明确的通用规则以规范人类-人类和病毒-人类PPI。第一个自主AI平台旨在由五个AI代理组成,负责自主数据收集、数据验证、特征嵌入、模型设计以及在三路蛋白质不重叠交叉折叠数据集上的训练和验证。对于人类-人类和人类-病毒PPI,最终的三路蛋白质不重叠集成模型分别达到了87.3%和86.5%的准确率。为了进行交叉检查和可解释性,第二个自主AI平台旨在用从蛋白质嵌入、物理化学自协方差描述符、区室注释、通路-领域重叠和图形上下文中导出的可读规则替代机器学习预测。对于人类-人类PPI,它由两个规则诱导定义,而人类-病毒则由一组更复杂的加权规则诱导。第二个自主平台诱导的规则与第一个自主平台构建的预测机器学习模型中由SHAP识别的特征一致。综上所述,我们的工作展示了自主AI在数据规划到执行、规则诱导到解释中的能力,为各种应用打开了大门。
cs.AI / 53 / 2604.23947
GAMED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation
GAMED.AI:一个用于自动化教育游戏生成的分层多智能体框架
Abstract
We introduce GameDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GameDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom's Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents (${\sim}$73,500 $\rightarrow$ ${\sim}$19,900 tokens/game) at $0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom's-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.
Chinese Translation
我们介绍了GameDAI,一个分层多智能体框架,它将教师提供的问题转化为完全可玩、基于教育理论的教育游戏,并通过正式的机制合同进行验证。GameDAI建立在基于阶段的LangGraph子图、确定性的质量门和结构化的Pydantic模式之上,支持两个模板系列,涵盖空间推理、过程执行和更高阶布鲁姆分类法目标的15种互动机制。在对涵盖五个学科领域的200个问题进行评估时,该系统实现了90%的验证通过率、98.3%的模式合规性,并在ReAct代理上实现了73%的令牌减少(${ ext{∼}}73,500
ightarrow { ext{∼}}19,900$ tokens/game),每个游戏的成本为0.46美元。在这种模型配置下,这些结果表明,基于阶段的架构结构与对齐质量的相关性比单纯的提示策略更强。我们的演示让与会者能够在60秒内从自然语言生成与布鲁姆对齐的游戏,检查每个管道阶段的质量门输出,并浏览涵盖所有15种机制类型的50个游戏的策划库。
cs.AI / 54 / 2604.23949
Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs
基于上下文的住院预测评估:使用大型语言模型支持决策
Abstract
Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (e.g., operational failures or pandemics). Forecasting models can assist in this task by analyzing large volumes of resource-related data at the facility level, but they must be reliable for decision-making under real-world data conditions. Recent work shows that large language models (LLMs) can incorporate richer forms of context into numerical forecasting. Whereas traditional models rely primarily on temporal context (i.e., past observations), LLMs can also leverage non-temporal public health context such as demographic, geographic, and population-level features. However, it remains unclear how these models should be used to produce stable or decision-relevant predictions in real-world healthcare settings. To evaluate how LLMs can be effectively used in this setting, we evaluate three approaches across 60 counties with low-,mid-, and high-hospitalization intensities in the United States: direct LLM-based forecasting, classical time-series models, and a context-augmented hybrid pipeline (HybridARX) that incorporates LLM-derived signals into structured models. Because the goal is operational decision-making rather than error minimization alone, we evaluate performance with bias and lead-lag alignment in addition to standard forecasting metrics. Our results show that HybridARX improves over classical ARX by yielding more stable and better-calibrated forecasts, particularly when incorporating noisy contextual signals into structured time-series models. These findings suggest that, in non-stationary healthcare resource forecasting, LLMs are most useful when embedded within structured hybrid models.
Chinese Translation
医疗和公共卫生专家必须根据大规模医疗干扰(例如,运营故障或疫情)期间的住院趋势预测,实时做出资源决策,例如扩大医院床位容量。预测模型可以通过分析大量与资源相关的数据来辅助这一任务,但它们必须在现实数据条件下对决策具有可靠性。近期研究表明,大型语言模型(LLMs)能够将更丰富的上下文形式融入数值预测中。传统模型主要依赖于时间上下文(即过去的观察),而LLMs还可以利用非时间的公共卫生上下文,如人口统计、地理和人口水平特征。然而,目前尚不清楚这些模型应如何在现实医疗环境中产生稳定或与决策相关的预测。为了评估LLMs在这一环境中的有效使用,我们在美国60个住院强度低、中、高的县进行了三种方法的评估:基于LLM的直接预测、经典时间序列模型,以及一种将LLM衍生信号融入结构化模型的上下文增强混合管道(HybridARX)。由于目标是操作性决策而不仅仅是误差最小化,我们在标准预测指标之外,还评估了偏差和滞后对齐的性能。我们的结果表明,HybridARX在提供更稳定和更好校准的预测方面优于经典ARX,特别是在将噪声上下文信号融入结构化时间序列模型时。这些发现表明,在非平稳的医疗资源预测中,当LLMs嵌入结构化混合模型时,其最为有效。
cs.AI / 55 / 2604.23954
An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness
基于临床数据的人工智能模型更新风险的实证评估:稳定性、任意性与公平性
Abstract
Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to "flip" for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.
Chinese Translation
在临床环境中使用的人工智能和机器学习(AI/ML)模型越来越多地被部署以支持临床决策。然而,当训练数据因人口统计、环境或患者行为的变化而变得陈旧时,模型性能可能会显著下降。虽然使用新训练数据更新模型是必要的,但此类更新也可能引入新的风险。我们在四个公开的美国1型糖尿病数据集上评估了所提监测框架,这些数据集包含高分辨率的连续血糖监测(CGM)数据,共计约11,300个来自496名20岁以下参与者的每周观察。所有数据集均包含结构化的社会人口统计信息。以预测1型糖尿病儿童严重高血糖事件为案例研究,我们考察了不同的模型更新策略如何对模型稳定性产生不利影响(例如,更新后导致大量案例的预测“翻转”)、增加预测的任意性,或恶化准确性公平性和子群体间错误率的平衡。我们提出了多个维度的持续监测,以检测这些问题,并认为这种监测对于开发可信赖的临床决策支持系统至关重要。
cs.AI / 56 / 2604.23970
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
基于大型语言模型指导的代理式平面图解析用于盲人和低视力人群的无障碍室内导航
Abstract
Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.
Chinese Translation
室内导航仍然是盲人和低视力(BLV)个体面临的一个重要无障碍挑战,因为现有解决方案依赖于昂贵的建筑基础设施。我们提出了一种代理框架,将单一平面图像转换为结构化、可检索的知识库,以生成安全、无障碍的导航指令,且基础设施轻便。该系统分为两个阶段:一个多代理模块通过自我校正的管道将平面图解析为空间知识图谱,采用迭代重试循环和纠正反馈;一个路径规划器生成无障碍导航指令,同时安全评估代理评估每条路线的潜在危险。我们在现实世界的UMBC数学与心理学大楼(MP-1和MP-3楼层)及CVC-FP基准上评估了该系统。在MP-1上,我们的短途、中途和长途路线成功率分别为92.31%、76.92%和61.54%,超越了最强的单次调用基准(Claude 3.7 Sonnet)的84.62%、69.23%和53.85%。在MP-3上,我们的成功率为76.92%、61.54%和38.46%,而最佳基准为61.54%、46.15%和23.08%。这些结果显示出相较于单次调用大型语言模型基准的一致性提升,并证明我们的工作流程是一个可扩展的解决方案,适用于BLV个体的无障碍室内导航。
cs.AI / 57 / 2604.23985
Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
表征曲率调节大语言模型中的行为不确定性
Abstract
In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input sequences across layers, potentially facilitating next-token prediction via linear extrapolation. However, a direct link between this trajectory and token-level behavior has been missing. We provide such a link by relating contextual curvature-a geometric measure of how sharply the representational trajectory bends over recent context-to next-token entropy. Across two models (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments reveal selective dependence: manipulating curvature through trajectory-aligned interventions reliably modulates entropy, while geometrically misaligned perturbations have no effect. Finally, regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences behavioral uncertainty in LLMs.
Chinese Translation
在自回归的大语言模型(LLMs)中,时间上的直线化提供了一种解释,即下一个标记预测目标如何塑造表征。模型学习在各层之间逐步直线化输入序列的表征轨迹,这可能通过线性外推促进下一个标记的预测。然而,这一轨迹与标记级行为之间的直接联系一直缺失。我们通过将上下文曲率——一种几何度量,用于描述表征轨迹在最近上下文中弯曲的程度——与下一个标记的熵联系起来,提供了这样的联系。在两个模型(GPT-2 XL 和 Pythia-2.8B)中,上下文曲率与熵相关,并且这种关系在训练过程中显现。扰动实验揭示了选择性依赖性:通过与轨迹对齐的干预操控曲率可以可靠地调节熵,而几何上不对齐的扰动则没有影响。最后,在训练期间对表征进行正则化以使其更加直线化,适度降低了标记级熵,而没有降低验证损失。这些结果确定了轨迹曲率作为一种与任务对齐的表征特征,影响着大语言模型中的行为不确定性。
cs.AI / 58 / 2604.23990
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
以失败为中心的已部署三语公共空间代理的运行时评估
Abstract
This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.
Chinese Translation
本文提出了PSA-Eval,一个以失败为中心的已部署三语公共空间代理的运行时评估框架。核心观点是,当评估对象从静态输入输出映射转变为运行时系统时,分析的基本单位应从分数转变为失败。PSA-Eval将传统的链式评估流程从问题 -> 答案 -> 分数 -> 结束扩展为问题 -> 批次 -> 运行 -> 分数 -> 失败案例 -> 修复 -> 回归批次,使得失败可追踪、可审查、可修复且可进行回归测试。该框架使用三语等效输入作为受控探针,以观察群体层面的跨语言政策漂移。我们对在一家国际金融机构大厅部署的真实三语数字前台系统进行了初步研究。该试点使用简化的单基础模型设置(MA = MB),因此观察到的漂移不应被解读为A/B基础模型的差异。研究包含81个样本,组织成27个三语等效问题组。尽管系统的平均分数为23.15/24,但有14个组显示出非零的跨语言分数漂移,5个组显示出至少3分的漂移,最大漂移达到9分。这些结果提供了初步证据,表明以失败为中心的运行时评估能够揭示被聚合评分掩盖的结构化部署信号。
cs.AI / 59 / 2604.24001
CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
CT-FineBench:用于CT报告生成细粒度评估的诊断准确性基准
Abstract
The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions probe for specific clinical details grounded in gold-standard reports. The evaluation protocol for CT-FineBench involves using this QA dataset to query a machine-generated report and scoring the correctness of the answers. This allows for a comprehensive, interpretable, and clinically-relevant assessment, moving beyond superficial lexical overlap to pinpoint specific clinical errors. Experiments show that CT-FineBench correlates better with expert clinical assessment and is substantially more sensitive to fine-grained factual errors than prior metrics.
Chinese Translation
生成报告的评估在计算机断层扫描(CT)报告生成中仍然是一个关键挑战,这主要是由于文本量大、发现的多样性和复杂性,以及存在细粒度的疾病相关属性。传统的评估指标仅提供粗略的词汇重叠或实体匹配度量,无法反映临床使用所需的细致诊断准确性。为了解决这一问题,我们提出了CT-FineBench,这是一个基于CT-RATE和Merlin构建的基准,用于评估CT报告的细粒度事实一致性。我们的基准通过细致的问答(QA)过程构建:首先,我们识别并结构化关键的、特定发现的临床属性(如位置、大小、边缘)。其次,我们系统地将这些属性转化为QA数据集,其中的问题探讨基于黄金标准报告的特定临床细节。CT-FineBench的评估协议涉及使用该QA数据集对机器生成的报告进行查询,并对答案的正确性进行评分。这使得评估变得全面、可解释且与临床相关,超越了表面的词汇重叠,能够准确指出特定的临床错误。实验表明,CT-FineBench与专家临床评估的相关性更高,并且对细粒度事实错误的敏感性显著高于以前的指标。
cs.AI / 60 / 2604.24021
QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
QED:一个用于生成开放问题数学证明的开源多智能体系统
Abstract
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.
Chinese Translation
我们探讨了数学人工智能中的一个核心问题:人工智能系统能否为开放研究问题生成原创的、非平凡的证明?尽管在基准测试中表现强劲,生成真正新颖的证明仍然是大型语言模型(LLMs)面临的重大挑战。通过对前沿LLMs在研究级证明任务上的系统实验,我们识别出七种阻碍可靠证明生成的失败模式,包括上下文污染、引用幻觉、关键步骤的模糊处理和证明努力的错误分配、不稳定的证明计划、缺乏聚焦的验证、问题修改以及单模型瓶颈。我们认为,基准成功与研究级证明之间的差距主要是系统设计的问题,正是由于这些失败模式。我们提出了QED,一个开源的多智能体证明系统,其中每个架构决策直接针对特定的失败模式。在由领域专家贡献的五个应用分析和偏微分方程(PDEs)开放问题上进行评估,QED为三个问题生成了正确的证明,并且每个证明都经过贡献专家的验证,被认为是原创且非平凡的。QED已作为开源软件发布,网址为 https://github.com/proofQED/QED。
cs.AI / 61 / 2604.24038
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
AgentPulse:用于评估部署中人工智能代理的连续多信号框架
Abstract
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Three analyses ground the framework. The four factors capture largely complementary information (n=50; $\rho_{\max}=0.61$ for Adoption-Ecosystem, all others $|\rho| \leq 0.37$). A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars ($\rho_s=0.52$, $p<0.01$) and Stack Overflow question volume ($\rho_s=0.49$, $p<0.01$), with VS Code installs ($\rho_s=0.44$, $p<0.05$) reported as illustrative given that only 11 of 35 agents have non-zero installs. On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated ($\rho_s=0.25$; 9 of 11 agents shift by at least 2 ranks), driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. This is precisely why we rest the framework's validity claim on the broader n=35 test rather than the SWE-bench overlap. AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.
Chinese Translation
静态基准测试衡量人工智能代理在固定时间点的能力,但无法评估它们在部署中的采用、维护或体验。我们提出了AgentPulse,一个连续评估框架,针对10个工作负载类别中的50个代理,从18个实时信号中聚合四个因素(基准性能、采用信号、社区情感和生态系统健康)进行评分,这些信号来自GitHub、软件包注册中心、IDE市场、社交平台和基准排行榜。三个分析为该框架提供了基础。这四个因素捕捉到的信息在很大程度上是互补的(n=50;$
ho_{ ext{max}}=0.61$,用于采用-生态系统,其余均为$|
ho| ext{≤} 0.37$)。一个控制循环性的测试(n=35)表明,基准+情感子复合体(不包含任何来自GitHub的信号)能够预测它未聚合的外部采用代理:GitHub星标($
ho_s=0.52$,$p<0.01$)和Stack Overflow问题数量($
ho_s=0.49$,$p<0.01$),而VS Code安装量($
ho_s=0.44$,$p<0.05$)被报告为说明性数据,因为在35个代理中仅有11个的安装量非零。在n=11的子集中,发布的SWE-bench分数显示,复合排名和仅基准排名几乎不相关($
ho_s=0.25$;11个代理中有9个至少移动了2个排名),这一现象是由该子集中闭源高能力代理之间强负相关的采用-能力关系所驱动的。这正是我们将框架有效性声明建立在更广泛的n=35测试而非SWE-bench重叠上的原因。AgentPulse揭示了基准测试中缺失的部署信号;它是一种方法论,而不是一个真实排名。该框架、所有收集的信号、评分输出和评估工具已根据CC BY 4.0发布。
cs.AI / 62 / 2604.24043
A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees
A2DEPT:基于大型语言模型的通过进化程序树进行自动算法设计
Abstract
Designing heuristics for combinatorial optimization problems (COPs) is a fundamental yet challenging task that traditionally requires extensive domain expertise. Recently, Large Language Model (LLM)-based Automated Heuristic Design (AHD) has shown promise in autonomously generating heuristic components with minimal human intervention. However, most existing LLM-based AHD methods enforce fixed algorithmic templates to ensure executability, which confines the search to component-level tuning and limits system-level algorithmic expressiveness. To enable open-ended solver synthesis beyond rigid templates, we propose Automated Algorithm Design via Evolutionary Program Trees (A2DEPT), which treats LLMs as system-level algorithm architects. A2DEPT explores the vast program space via a tree-structured evolutionary search with hybrid selection and hierarchical operators, enabling iterative refinement of complete algorithms. To make open-ended generation practical, we enforce executability with a lightweight program-maintenance loop that performs feedback-driven repair. In experiments, A2DEPT consistently outperforms representative LLM-based baselines on both standard and highly constrained benchmarks. On the standard benchmarks, it reduces the mean normalized optimality gap by 9.8% relative to the strongest competing AHD baseline.
Chinese Translation
为组合优化问题(COPs)设计启发式算法是一项基本但具有挑战性的任务,传统上需要广泛的领域专业知识。最近,基于大型语言模型(LLM)的自动化启发式设计(AHD)在自主生成启发式组件方面显示出了前景,且人类干预最小。然而,大多数现有的基于LLM的AHD方法强制使用固定的算法模板以确保可执行性,这限制了搜索范围仅限于组件级调优,并限制了系统级算法的表达能力。为了实现超越严格模板的开放式求解器合成,我们提出了通过进化程序树进行自动算法设计(A2DEPT),将LLM视为系统级算法架构师。A2DEPT通过树状结构的进化搜索,结合混合选择和分层操作,探索广泛的程序空间,从而实现完整算法的迭代优化。为了使开放式生成变得实用,我们通过轻量级程序维护循环来强制可执行性,该循环执行基于反馈的修复。在实验中,A2DEPT在标准和高度约束的基准测试中始终优于代表性的基于LLM的基线。在标准基准测试中,相较于最强竞争的AHD基线,它将平均归一化最优性差距降低了9.8%。
cs.AI / 63 / 2604.24062
Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer
在概括之前的基础:人工智能在因果转移中的人类差异
Abstract
Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning -- inducing latent structures through sequential exploration and transferring them across contexts -- remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping -- what we term environmental grounding -- before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information -- in both the image-only and text-and-image conditions -- overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.
Chinese Translation
提取抽象因果结构并将其应用于新情境是人类智能的一个标志。尽管大型语言模型(LLMs)和视觉语言模型(VLMs)在广泛的推理任务中表现出强大的性能,但它们在互动因果学习方面的能力——通过顺序探索诱导潜在结构并在不同上下文中转移这些结构——仍然没有得到充分表征。人类学习者在最小暴露后能够完成这种转移,而经典的强化学习(RL)代理则会遭遇灾难性失败。当前最先进的人工智能(AI)模型是否具有人类般的抽象因果结构转移机制仍然是一个未解之谜。通过使用OpenLock范式,该范式要求顺序发现共同原因(Common Cause, CC)和共同效果(Common Effect, CE)结构,我们在此展示模型表现出根本性延迟或缺失的转移:即使是成功的模型也需要初始的环境特定映射——我们称之为环境基础——才能出现效率提升,而人类则从第一次解决尝试开始就利用先前的结构知识。在仅文本条件下,模型的发现效率与人类相当或超过人类。相反,视觉信息——在仅图像和文本与图像条件下——总体上降低了而不是增强了性能,揭示出模型在符号处理而非综合多模态推理上的广泛依赖。模型还表现出系统性的CC/CE不对称性,而人类则不存在这种情况,这表明存在启发式偏差而非方向中立的因果抽象。这些发现揭示了大规模统计学习并未产生支撑人类类比推理的去上下文化因果图式,确立了依赖基础转移作为当前LLMs和VLMs的一个根本限制。
cs.AI / 64 / 2604.24076
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
一种信息几何框架用于在熵应力下分析大型语言模型的稳定性
Abstract
As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system reliability. This study proposes a thermodynamic inspired modeling framework for analyzing the stability of LLM outputs under conditions of uncertainty and perturbation. The framework introduces a composite stability score that integrates task utility, entropy as a measure of external uncertainty, and two internal structural proxies: internal integration and aligned reective capacity. Rather than interpreting these quantities as physical variables, the formulation is intended as an interpretable abstraction that captures how internal structure may modulate the impact of disorder on model behavior. Using the IST-20 benchmarking protocol and associated metadata, we analyze 80 modelscenario observations across four contemporary LLMs. The proposed formulation consistently yields higher stability scores than a reduced utilityentropy baseline, with a mean improvement of 0.0299 (95% CI: 0.02470.0351). The observed gain is more pronounced under higher entropy conditions, suggesting that the framework captures a form of nonlinear attenuation of uncertainty. We do not claim a fundamental physical law or a complete theory of machine ethics. Instead, the contribution of this work is a compact and interpretable modeling perspective that connects uncertainty, performance, and internal structure within a unied evaluation lens. The framework is intended to complement existing benchmarking approaches and to support ongoing discussions in AI safety, reliability, and governance.
Chinese Translation
随着大型语言模型(LLMs)在高风险和操作环境中的日益广泛应用,仅基于整体准确率的评估策略往往不足以表征系统的可靠性。本研究提出了一种受热力学启发的建模框架,用于分析LLM输出在不确定性和扰动条件下的稳定性。该框架引入了一种复合稳定性评分,整合了任务效用、作为外部不确定性度量的熵,以及两个内部结构代理:内部整合和对齐反射能力。该公式并不将这些量解释为物理变量,而是作为一种可解释的抽象,捕捉内部结构如何调节无序对模型行为的影响。利用IST-20基准协议及相关元数据,我们分析了四个现代LLM的80个模型场景观察。所提出的公式在稳定性评分上始终优于降低的效用-熵基线,平均改善为0.0299(95% CI:0.0247-0.0351)。在较高熵条件下观察到的增益更为明显,表明该框架捕捉到了一种不确定性的非线性衰减形式。我们并不声称这是一个基本的物理定律或完整的机器伦理理论。相反,本工作的贡献在于提供了一种紧凑且可解释的建模视角,将不确定性、性能和内部结构联系在一个统一的评估视角中。该框架旨在补充现有的基准评估方法,并支持关于人工智能安全、可靠性和治理的持续讨论。
cs.AI / 65 / 2604.24083
The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability
Kerimov-Alekberli 模型:实时系统稳定性的一个信息几何框架
Abstract
This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the ethical alignment of autonomous systems. By establishing a formal isomorphism between non-equilibrium thermodynamics and stochastic control, we define systemic anomalies as deviations from a Riemannian manifold. The model utilizes the Kullback-Leibler divergence as the primary metric, governed by a dynamic threshold derived from the Fisher Information Metric. We further ground this framework in the Landauer Principle, proving that adversarial perturbations perform measurable physical work by increasing the system's informational entropy. Validation on the NSL-KDD dataset and unmanned aerial vehicle trajectory simulations demonstrated that our model achieves effective real-time detection via the FPT trigger, with strong performance metrics (e.g., high accuracy and low FPR) on benchmark datasets. This study provides a rigorous physical foundation for AI safety, transitioning from heuristic, rule-based ethical frameworks to a thermodynamics-based stability paradigm by grounding ethical violations in quantifiable physical work and entropic information.
Chinese Translation
本研究介绍了 Kerimov-Alekberli 模型,这是一种新颖的信息几何框架,通过将非平衡热力学与随机控制形式性地联系起来,重新定义了人工智能安全,以实现自主系统的伦理对齐。通过建立非平衡热力学与随机控制之间的形式同构,我们将系统异常定义为从黎曼流形的偏离。该模型利用 Kullback-Leibler 散度作为主要度量,由从 Fisher 信息度量导出的动态阈值所控制。我们进一步将该框架建立在 Landauer 原则之上,证明对抗性扰动通过增加系统的信息熵而执行可测量的物理功。在 NSL-KDD 数据集和无人机轨迹模拟上的验证表明,我们的模型通过 FPT 触发器实现了有效的实时检测,并在基准数据集上表现出强大的性能指标(例如,高准确率和低假阳性率)。本研究为人工智能安全提供了严格的物理基础,从启发式的基于规则的伦理框架转变为基于热力学的稳定性范式,通过将伦理违规行为与可量化的物理功和熵信息相结合。
cs.AI / 66 / 2604.24102
SemML 2.0: Synthesizing Controllers for LTL
SemML 2.0:为线性时序逻辑合成控制器
Abstract
Synthesizing a reactive system from specifications given in linear temporal logic (LTL) is a classical problem, finding its applications in safety-critical systems design. These systems are typically represented using either Mealy machines or AIGER circuits. We present the second version of SemML, which outperforms all state-of-the-art tools for finding either solution. Aside from implementing the classical automata-theoretic approach, our tool utilizes partial exploration and machine-learning guidance for obtaining solutions efficiently, and numerous heuristics and improvements of classic algorithms for extracting small representations of these solutions. We evaluate our tool against the existing state-of-the-art tools (in particular Strix, LtlSynt, and the previous version of SemML) on the dataset of the synthesis competition SYNTCOMP. We show that we solve significantly more instances and do so much faster than other tools, while maintaining state-of-the-art solution quality.
Chinese Translation
从线性时序逻辑(LTL)给定的规格合成反应系统是一个经典问题,广泛应用于安全关键系统的设计。这些系统通常使用梅利机(Mealy machines)或AIGER电路表示。我们提出了SemML的第二个版本,它在寻找任一解决方案方面优于所有现有的最先进工具。除了实现经典的自动机理论方法外,我们的工具还利用部分探索和机器学习指导来高效地获得解决方案,并采用多种启发式方法和经典算法的改进,以提取这些解决方案的小型表示。我们在合成竞赛SYNTCOMP的数据集上评估了我们的工具,与现有的最先进工具(尤其是Strix、LtlSynt和SemML的前一个版本)进行比较。结果表明,我们解决了显著更多的实例,并且速度远快于其他工具,同时保持了最先进的解决方案质量。
cs.AI / 67 / 2604.24117
An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources
联合学习与模块化学习在运输资源作业车间调度中的协调差距分析
Abstract
Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.
Chinese Translation
高效的作业车间调度与运输资源对于高性能制造至关重要。随着“去中心化工厂”的兴起,多智能体强化学习作为生产与运输任务联合调度的有前景的方法逐渐受到关注。以往的研究主要集中在开发新颖的合作架构,而忽视了何时进行联合训练这一问题。联合训练指的是作业调度和自动引导车调度代理的同时训练,而模块化训练则是指独立训练每个代理,然后进行后期整合。在本研究中,我们系统地探讨了在运输资源作业车间调度问题中,何种条件下联合训练对实现最佳性能是必需的。通过对资源稀缺性和时间主导性的严格敏感性分析,我们量化了协调差距——这两种训练模式之间的性能差异。在我们的评估中,联合训练能够产生优于最佳调度规则组合和模块化训练的性能。然而,在瓶颈环境中,特别是在严重的运输和处理约束下,协调差距的优势减小。这些发现表明,在单一调度任务占主导地位的环境中,模块化训练是一种可行的替代方案。总体而言,我们的研究为根据环境条件选择训练模式提供了实用指导,使决策者能够优化基于强化学习的调度性能。
cs.AI / 68 / 2604.24153
Right-to-Act: A Pre-Execution Non-Compensatory Decision Protocol for AI Systems
行动权:一种针对人工智能系统的预执行非补偿性决策协议
Abstract
Current AI systems increasingly operate in contexts where their outputs directly trigger real-world actions. Most existing approaches to AI safety, risk management, and governance focus on post-hoc validation, probabilistic risk estimation, or certification of model behavior. However, these approaches implicitly assume that once a decision is produced, it is eligible for execution. In this work, we introduce the Right-to-Act protocol, a deterministic, non-compensatory pre-execution decision layer that evaluates whether an AI-generated decision is permitted to be realized at all. Unlike compensatory systems, where high-confidence signals can override failed conditions, the proposed framework enforces strict structural constraints: if any required condition is unmet, execution is halted or deferred. We formalize the distinction between compensatory and non-compensatory decision regimes and define a pre-execution legitimacy boundary. Through a scenario-based case study, we demonstrate how identical AI outputs can lead to divergent outcomes when evaluated under a Right-to-Act protocol, preserving reversibility and preventing premature or irreversible actions. The proposed approach reframes AI control from optimizing decisions to governing their admissibility, introducing a protocol-level abstraction that operates independently of model architecture or training methodology.
Chinese Translation
当前的人工智能系统越来越多地在其输出直接触发现实世界行动的环境中运行。现有的大多数人工智能安全、风险管理和治理方法侧重于事后验证、概率风险估计或模型行为的认证。然而,这些方法隐含地假设一旦产生决策,就有资格执行。在本研究中,我们引入了行动权协议(Right-to-Act),这是一种确定性的、非补偿性的预执行决策层,用于评估人工智能生成的决策是否被允许实现。与补偿系统不同,在补偿系统中,高置信度信号可以覆盖失败条件,而所提出的框架则强制执行严格的结构约束:如果任何必要条件未满足,则执行将被暂停或推迟。我们形式化了补偿性和非补偿性决策机制之间的区别,并定义了预执行合法性边界。通过基于场景的案例研究,我们展示了在行动权协议下评估时,相同的人工智能输出可以导致不同的结果,从而保持可逆性并防止过早或不可逆的行动。所提出的方法将人工智能控制的重点从优化决策转向治理其可接受性,引入了一种独立于模型架构或训练方法的协议级抽象。
cs.AI / 69 / 2604.24158
Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop
基于LLM作为评判者和人类参与的可持续城市旅行的多维评估
Abstract
Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.
Chinese Translation
在评估细致的对话旅行推荐时,由于人工标注成本高昂且标准指标忽视了以利益相关者为中心的目标,这一过程面临挑战。我们研究了作为评判者的LLM(大语言模型)在可持续城市旅行推荐列表中的应用,涵盖四个维度——相关性、多样性、可持续性和受欢迎程度平衡,并提出了一个三阶段的校准框架:(1) 使用多个LLM进行基线评判,(2) 专家评估以识别系统性不一致,(3) 通过规则和少量示例进行维度特定的校准。在两个推荐设置中,我们观察到模型特定的偏见和较高的维度级方差,即使评判者对整体排名达成一致。校准明确了每个维度的推理,但暴露了对可持续性的不同解读,突显了对透明且关注偏见的LLM评估的需求。为可重复性发布了提示和代码:https://github.com/ashmibanerjee/trs-llm-calibration。
cs.AI / 70 / 2604.24170
Credal Concept Bottleneck Models for Epistemic-Aleatoric Uncertainty Decomposition
用于认识-随机不确定性分解的信念概念瓶颈模型
Abstract
Concept Bottleneck Models (CBMs) predict through human-interpretable concepts, but they typically output point concept probabilities that conflate epistemic uncertainty (reducible model underspecification) with aleatoric uncertainty (irreducible input ambiguity). This makes concept-level uncertainty hard to interpret and, more importantly, hard to act upon. We introduce CREDENCE (Credal Ensemble Concept Estimation), a CBM framework that decomposes concept uncertainty by construction. CREDENCE represents each concept as a credal prediction (a probability interval), derives epistemic uncertainty from disagreement across diverse concept heads, and estimates aleatoric uncertainty via a dedicated ambiguity output trained to match annotator disagreement when available. The resulting signals support prescriptive decisions: automate low-uncertainty cases, prioritize data collection for high-epistemic cases, route high-aleatoric cases to human review, and abstain when both are high. Across several tasks, we show that epistemic uncertainty is positively associated with prediction errors, whereas aleatoric uncertainty closely tracks annotator disagreement, providing guidance beyond error correlation. Our implementation is available at the following link: https://github.com/Tankiit/Credal_Sets/tree/ensemble-credal-cbm
Chinese Translation
概念瓶颈模型(CBMs)通过人类可解释的概念进行预测,但它们通常输出将认识不确定性(可减少的模型欠规范性)与随机不确定性(不可减少的输入模糊性)混合在一起的点概念概率。这使得概念级不确定性难以解释,更重要的是,难以采取行动。我们引入了CREDENCE(信念集成概念估计),这是一个通过构造分解概念不确定性的CBM框架。CREDENCE将每个概念表示为信念预测(概率区间),通过不同概念头之间的分歧推导出认识不确定性,并通过专门的模糊输出来估计随机不确定性,该输出经过训练以匹配可用时的注释者分歧。由此产生的信号支持规范性决策:自动处理低不确定性案例,为高认识案例优先收集数据,将高随机案例转交人类审核,并在两者都高时选择不采取行动。在多个任务中,我们展示了认识不确定性与预测错误呈正相关,而随机不确定性与注释者分歧密切相关,提供了超越错误相关性的指导。我们的实现可在以下链接获取:https://github.com/Tankiit/Credal_Sets/tree/ensemble-credal-cbm
cs.AI / 71 / 2604.24176
Explanation Quality Assessment as Ranking with Listwise Rewards
将解释质量评估重构为带有列表奖励的排序问题
Abstract
We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single "best" explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank
Chinese Translation
我们将解释质量评估重构为一个排序问题,而非生成问题。我们不再优化模型逐字生成单一的“最佳”解释,而是训练奖励模型以区分多个候选解释并学习它们的相对质量。具体而言,我们构建了每个实例的候选集,设定不同的质量等级,并训练列表排序和对排序模型(ListNet、LambdaRank、RankNet),以保持序数结构并避免点对点回归或二元偏好目标中典型的得分压缩。我们观察到三个发现:首先,在所有测试领域中,排序损失在得分分离方面始终优于回归。其次,最佳排序损失依赖于数据特征:当质量层次分明时,列表目标表现优异,而对排序方法对噪声自然注释更具鲁棒性。第三,当在经过精心策划和良好结构化的数据上训练时,小型编码器模型可以匹配规模大几个数量级的模型,这表明数据质量比模型规模更为重要。最后,当作为策略优化中的奖励使用时,基于排序的得分在回归奖励完全失效的情况下能够实现稳定收敛。代码和数据可在以下链接获取:https://github.com/Tankiit/PPO_Learning_to_rank
cs.AI / 72 / 2604.24219
Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU
自适应检索树:复杂度感知的基于树的帕累托最优多意图自然语言理解检索
Abstract
Multi-intent natural language understanding requires retrieval systems that simultaneously achieve high accuracy and computational efficiency, yet existing approaches apply either uniform single-step retrieval that compromises recall or fixed-depth hierarchical decomposition that introduces excessive latency regardless of query complexity. This paper proposes Adaptive Tree-of-Retrieval (Adaptive ToR), a complexity-aware retrieval architecture that dynamically configures retrieval topology based on query characteristics. The system integrates four components: (1) a Query Tree Classifier computing a Query Complexity Index from weighted linguistic signals to route queries to either a rapid single-step path or an adaptive-depth hierarchical path; (2) a Tree-Based Retrieval module that recursively decomposes complex queries into focused sub-queries calibrated to predicted complexity; (3) an Adaptive Pruning Module employing two-stage filtering combining quantitative similarity gating with semantic relevance evaluation to suppress exponential node growth; and (4) a Retrieval Reranking Layer featuring a deduplicator-first pipeline and global LLM rescoring for production efficiency. Evaluation on the NLU++ benchmark (2,693 multi-intent queries across Banking and Hotel domains) yields 29.07% Subset Accuracy and 71.79% Micro-F1, a 9.7% relative improvement over fixed-depth baselines, while reducing latency by 37.6%, LLM invocations by 43.0%, and token consumption by 9.8%. Depth-wise analysis reveals that 26.92% of queries resolve within three seconds (2.45s mean latency) via single-step routing (d=0: 37.9% Subset Accuracy, 74.8% Micro-F1), while token consumption scales by 4.9x across depths, validating complexity-aware resource allocation and establishing Pareto-optimal balance across accuracy, latency, and computational efficiency.
Chinese Translation
多意图自然语言理解需要检索系统同时实现高准确性和计算效率,但现有方法要么采用统一的单步检索,导致召回率下降,要么采用固定深度的层次分解,导致无论查询复杂度如何都引入过多延迟。本文提出了自适应检索树(Adaptive Tree-of-Retrieval,Adaptive ToR),一种复杂度感知的检索架构,能够根据查询特征动态配置检索拓扑。该系统集成了四个组件:(1)查询树分类器,通过加权语言信号计算查询复杂度指数,将查询路由到快速单步路径或自适应深度层次路径;(2)基于树的检索模块,递归地将复杂查询分解为针对预测复杂度校准的聚焦子查询;(3)自适应剪枝模块,采用两阶段过滤,将定量相似性门控与语义相关性评估相结合,以抑制指数级节点增长;(4)检索重排序层,具有去重优先的管道和全球大语言模型(LLM)重新评分,以提高生产效率。在NLU++基准(涵盖银行和酒店领域的2,693个多意图查询)上的评估结果显示,子集准确率为29.07%,微F1为71.79%,相较于固定深度基线提高了9.7%,同时延迟减少了37.6%,LLM调用次数减少了43.0%,令牌消耗减少了9.8%。深度分析表明,26.92%的查询在三秒内解决(平均延迟2.45秒),通过单步路由实现(d=0:子集准确率37.9%,微F1为74.8%),而令牌消耗在不同深度间扩展了4.9倍,验证了复杂度感知的资源分配,并在准确性、延迟和计算效率之间建立了帕累托最优平衡。
cs.AI / 73 / 2604.24322
Generative Design of a Gas Turbine Combustor Using Invertible Neural Networks
使用可逆神经网络的燃气轮机燃烧室生成设计
Abstract
The need to burn 100% H2 in high efficient gas turbines featuring low NOx combustion in premix mode require the complete redesign of the combustion system to ensure stable operation without any flashback. Since all engine frames featuring a power range from 4 MW up to 600 MW are affected, a huge design effort is expected. To reduce this effort, especially to transfer knowledge between the different engine classes, generative design methods using latest AI technology will provide promising potential. In this work, this challenge is approached utilizing the current advances in generative artificial intelligence. We train an Invertible Neural Network (INN) on an expandable database of geometrically parameterized combustor designs with simulated performance labels. Utilizing the INN in its inverse direction, multiple design proposals are generated which fulfill specified performance labels.
Chinese Translation
在高效燃气轮机中以预混模式燃烧100%氢气并实现低NOx排放的需求,要求对燃烧系统进行全面重新设计,以确保在没有回火的情况下稳定运行。由于所有功率范围从4 MW到600 MW的发动机框架都受到影响,因此预计需要进行大量的设计工作。为了减少这一工作量,特别是为了在不同发动机类别之间转移知识,采用最新人工智能技术的生成设计方法将展现出良好的潜力。在本研究中,我们利用生成性人工智能的最新进展来应对这一挑战。我们在一个可扩展的几何参数化燃烧室设计数据库上训练了一个可逆神经网络(Invertible Neural Network, INN),该数据库包含模拟性能标签。通过利用INN的逆向特性,生成多个满足特定性能标签的设计方案。
cs.AI / 74 / 2604.24379
Certified geometric robustness -- Super-DeepG
认证几何鲁棒性 -- 超深度生成对抗网络(Super-DeepG)
Abstract
Safety-critical applications are required to perform as expected in normal operations. Image processing functions are often required to be insensitive to small geometric perturbations such as rotation, scaling, shearing or translation. This paper addresses the formal verification of neural networks against geometric perturbations on their image dataset. Our method Super-DeepG improves the reasoning used in linear relaxation techniques and Lipschitz optimization, and provides an implementation that leverages GPU hardware. By doing so, Super-DeepG achieves both precision and computational efficiency of robustness certification, to an extent that outperforms prior work. Super-DeepG is shared as an open-source tool on GitHub.
Chinese Translation
安全关键应用要求在正常操作中表现如预期。图像处理功能通常需要对小的几何扰动(如旋转、缩放、剪切或平移)不敏感。本文针对神经网络在其图像数据集上的几何扰动进行了形式化验证。我们的方法超深度生成对抗网络(Super-DeepG)改进了线性松弛技术和Lipschitz优化中使用的推理,并提供了一个利用GPU硬件的实现。通过这样做,Super-DeepG在鲁棒性认证的精度和计算效率上取得了超越先前工作的成果。Super-DeepG作为开源工具在GitHub上共享。
cs.AI / 75 / 2604.24395
Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs
与自身声音对齐:用于减轻大型视觉语言模型幻觉的自我修正偏好学习
Abstract
Large Vision-Language Models (LVLMs) frequently suffer from hallucinations. Existing preference learning-based approaches largely rely on proprietary models to construct preference datasets. We identify that this reliance introduces a distributional mismatch between the proprietary and target models that hinders efficient alignment. To address this, we propose Alignment via VErified Self-correction DPO (AVES-DPO), a framework that aligns LVLMs using in-distribution data derived from the model's intrinsic knowledge. Our approach employs a consensus-based verification mechanism to diagnose diverse hallucinations and guides the model to self-correct, thereby generating preference pairs strictly compatible with its internal distribution. Extensive experiments demonstrate that AVES-DPO surpasses existing baselines in hallucination mitigation while requiring only 5.2k samples.
Chinese Translation
大型视觉语言模型(LVLMs)经常遭受幻觉问题。现有的基于偏好学习的方法在很大程度上依赖于专有模型来构建偏好数据集。我们发现,这种依赖导致了专有模型与目标模型之间的分布不匹配,从而阻碍了有效的对齐。为了解决这个问题,我们提出了通过验证自我修正的偏好优化(AVES-DPO)框架,该框架利用源自模型内在知识的同分布数据来对齐LVLMs。我们的方法采用基于共识的验证机制来诊断多样化的幻觉,并引导模型进行自我修正,从而生成与其内部分布严格兼容的偏好对。大量实验表明,AVES-DPO在减轻幻觉方面超过了现有的基准,同时仅需5.2k样本。
cs.AI / 76 / 2604.24443
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
PhysNote:用于可演化物理推理的自我知识笔记在视觉-语言模型中的应用
Abstract
Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated "Knowledge Notes." PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.
Chinese Translation
视觉-语言模型(VLMs)在教科书风格的物理问题上表现出色,但在面对需要跨帧时间一致性和因果推理的动态现实场景时,它们常常失败。我们识别出导致这些失败的两个基本挑战:(1)时空身份漂移,即物体在连续帧中失去其物理身份,从而打破因果链;(2)推理时洞察的波动性,即模型可能偶尔产生正确的物理推理,但从未将其巩固以供未来重用。为了解决这些挑战,我们提出了PhysNote,一个使VLMs能够通过自生成的“知识笔记”外化和精炼物理知识的代理框架。PhysNote通过时空规范化稳定动态感知,将自生成的洞察组织成层次知识库,并推动一个迭代推理循环,在巩固经过验证的知识之前,将假设与视觉证据相结合。PhysBench上的实验表明,PhysNote实现了56.68%的整体准确率,比最佳多代理基线提高了4.96%,并在所有四个物理推理领域中均表现出一致的提升。
cs.AI / 77 / 2604.24473
Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
基于代理的临床推理在纵向多发性骨髓瘤记录中的应用:对专家共识的回顾性评估
Moll, Johannes, Lübberstedt, Jannik, Nuernbergk, Christoph, Stroh, Jacob, Mertens, Luisa, Purcarea, Anna, Zirn, Christopher, Benchaaben, Zeineb, Drexel, Fabian, Häntze, Hartmut, Narayanan, Anirudh, Puttkammer, Friedrich, Zhukov, Andrei, Lammert, Jacqueline, Ziegelmayer, Sebastian, Graf, Markus, Högner, Marion, Makowski, Marcus, Bassermann, Florian, Adams, Lisa C., Pan, Jiazhen, Rueckert, Daniel, Braitsch, Krischan, Bressem, Keno K.
Abstract
Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.
Chinese Translation
多发性骨髓瘤的管理通常通过数年到数十年的连续治疗方案进行,每个决策都依赖于分布在数十到数百份异质临床文档中的累积疾病历史。目前尚未确定基于大语言模型(LLM)的系统是否能够在接近专家一致的水平上综合这些证据。本研究对在一个三级中心(2001-2026年)治疗的811名多发性骨髓瘤患者的纵向临床记录进行了回顾性评估,涵盖了44,962份文档和1,334,677个实验室值,并在MIMIC-IV上进行了外部验证。将代理推理系统与单次检索增强生成(RAG)、迭代RAG和全上下文输入进行了比较,使用了来自48个模板的469对患者问题,涵盖了三个复杂性水平。参考标签来自四位肿瘤科医生的双重注释,并由资深血液学家进行裁定。迭代RAG和全上下文输入达成了共享上限(75.4% vs 75.8%,p = 1.00)。代理系统达到了79.6%的一致性(95% CI 76.4-82.8),超过了两个基线(+3.8和+4.2 pp;p = 0.006和0.007)。随着问题复杂性的增加,收益上升,在基于标准的综合上达到了+9.4 pp(p = 0.032),并且随着记录长度的增加,在前十分位达到了+13.5 pp(n = 10)。该系统的错误率(12.2%)与专家间的不一致(13.6%)相当,但严重性则相反:57.8%的系统错误是临床显著的,而专家不一致中只有18.8%是临床显著的。代理推理是唯一超过共享上限的方法,收益集中在最复杂的问题和最长的记录上。残余系统错误的更大临床后果表明,在这些发现转化为患者利益之前,需要在常规护理中进行前瞻性评估。
cs.AI / 78 / 2604.24506
MIMIC: A Generative Multimodal Foundation Model for Biomolecules
MIMIC:一种用于生物分子的生成性多模态基础模型
Golkar, Siavash, Kovalic, Jake, Morales, Irina Espejo, Sledzieski, Samuel, Li, Minhuan, Sokolova, Ksenia, Krawezik, Geraud, Bietti, Alberto, Gibbs, Claudia Skok, Klypa, Roman, Xiong, Shengwei, Lanusse, Francois, Parker, Liam, Cho, Kyunghyun, Cranmer, Miles, Hehir, Tom, McCabe, Michael, Meyer, Lucas, Morel, Rudy, Mukhopadhyay, Payel, Pettee, Mariel, Qu, Helen, Shen, Jeff, Fouhey, David, Sotoudeh, Hadi, Mulligan, Vikram, Cossio, Pilar, Hanson, Sonya M., Jones, Alisha N., Troyanskaya, Olga G., Ho, Shirley
Abstract
Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC's sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC's aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.
Chinese Translation
生物功能源于序列、结构、调控、进化和细胞环境之间的耦合约束,然而大多数生物学基础模型仅在单一模态下或针对固定的前向任务进行训练。我们提出了MIMIC,一种生成性多模态基础模型,基于我们新近整理和对齐的数据集LORE进行训练,该数据集将核酸、蛋白质、进化、结构、调控和语义/上下文模态连接于部分观察到的生物分子状态。MIMIC采用分轨编码器-解码器架构,以条件化任意观察模态的子集,并重建或生成跨基因组、转录组和蛋白质组的分子状态缺失成分。多模态条件化相较于仅使用序列输入,始终提高了MIMIC的序列重建效果,同时其学习到的表示使其在RNA和蛋白质下游任务中实现了最先进的性能。MIMIC在剪接预测方面达到了最先进的水平,其联合生成性公式使得对异构体的感知推断进一步提升了性能。除了预测,该生成框架还支持受限设计。对于RNA,MIMIC在临床相关的HBB剪接破坏突变中识别出纠正编辑,而不需要恢复突变,利用了进化和结构信号。对于蛋白质,联合条件化PD-L1和hACE2结合位点的形状和表面化学特征,生成多样且高置信度的序列,并为目标结合提供了强有力的计算支持。最后,MIMIC利用实验上下文作为语义条件,建模依赖于测定的RNA化学探测,而不是将上下文视为固定输出。综合来看,这些结果将MIMIC的对齐多模态生成建模定位为统一表示学习、条件预测和受限生物分子设计的强大基础。
cs.AI / 79 / 2604.24512
Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols
超越注意稳定边界:自主自我合成推理协议
Abstract
As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autoregressive Transformers. This phenomenon, a behavioral manifestation of Information Over-squashing, occurs when the cumulative probabilistic weight of historical context overrides mid-task updates, causing agents to remain anchored to obsolete constraints despite explicit contradictory instructions. We propose Self-Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that implements a discrete separation between high-level architectural planning (Architect) and turn-by-turn procedural execution (Executive). We evaluate SSRP across 9K trajectories using the MultiWOZ 2.2 dataset and the Aggregate Pivot Accuracy (APA), a novel metric we validate by mapping its scores to the U-shaped 'Lost in the Middle' curve. We present 3 experimental tiers: a shallow recency-based retrieval pilot, a high-entropy SOP, and a semantic hijacked 3-hop Multi-Fact Synthesis task. Our results empirically locate the Attention Stability Boundary, where stateless Vanilla ReAct baselines for GPT 5.4 collapse to 0.1% success while SSRP achieves a 715X Resilience Lift. We demonstrate statistically significant gains across Gemini 3.1 Pro, Claude Sonnet 4.6 and DeepSeek V3.2. Audits confirm SSRP necessity by proving attentional lapse via a recursive reflexion baseline (100% success); decoupling the latch from positional bias through equidistant stress testing (90% accuracy); and formalizing SSRP via the Information Bottleneck principle and granularity ablations. Procedural Integrity audit (98.8% adherence) reveals a Grounding Paradox where high-stability models fail by refusing to hallucinate under retrieval-reasoning contamination.
Chinese Translation
随着大型语言模型(LLM)代理转变为自主数字协作伙伴,在非线性多轮对话中保持确定性的目标导向性成为了一个架构瓶颈。我们识别并形式化了一种系统性失败模式,称为解码器仅自回归变换器中的注意锁定(Attention Latch)。这一现象是信息过度压缩(Information Over-squashing)的行为表现,当历史上下文的累积概率权重覆盖了任务中期更新时,导致代理尽管收到明确的相反指令,仍然被固定在过时的约束上。我们提出了自我合成推理协议(Self-Synthesizing Reasoning Protocols, SSRP),这是一个元认知框架,实现了高层架构规划(Architect)与逐轮程序执行(Executive)之间的离散分离。我们使用MultiWOZ 2.2数据集对SSRP进行了9K轨迹的评估,并引入了一种新颖的度量标准——聚合枢轴准确率(Aggregate Pivot Accuracy, APA),通过将其得分映射到U形的“迷失在中间”曲线来验证其有效性。我们展示了三个实验层次:基于浅层近期检索的试点、高熵的标准操作程序(SOP)以及语义劫持的三跳多事实综合任务。我们的结果实证定位了注意稳定边界,在此边界下,状态无关的Vanilla ReAct基线在GPT 5.4上的成功率崩溃至0.1%,而SSRP实现了715倍的韧性提升。我们在Gemini 3.1 Pro、Claude Sonnet 4.6和DeepSeek V3.2上展示了统计显著的增益。审计通过证明注意力失效(100%成功)来确认SSRP的必要性;通过等距压力测试(90%准确率)将锁定与位置偏差解耦;并通过信息瓶颈原则和粒度消融形式化SSRP。程序完整性审计(98.8%遵循率)揭示了一个基础悖论,即高稳定性模型在检索-推理污染下拒绝幻觉而失败。
cs.AI / 80 / 2604.24527
Interoceptive machine framework: Toward interoception-inspired regulatory architectures in artificial intelligence
内感知机器框架:朝向受内感知启发的人工智能调节架构
Abstract
This review proposes an integrative framework grounded on interoception and embodied AI-termed the interoceptive machine framework-that translates biologically inspired principles of internal-state regulation into computational architectures for adaptive autonomy. Interoception, conceived as the monitoring, integration, and regulation of internal signals, has proven relevant for understanding adaptive behavior in biological systems. The proposed framework organizes interoceptive contributions into three functional principles: homeostatic, allostatic, and enactive, each associated with distinct computational roles: internal viability regulation, anticipatory uncertainty-based re-evaluation, and active data generation through interaction. These principles are not intended as direct neurophysiological mappings, but as abstractions that inform the design of artificial agents with improved self-regulation and context-sensitive behavior. By embedding internal state variables and regulatory loops within these principles, AI systems can achieve more robust decision-making, calibrated uncertainty handling, and adaptive interaction strategies, particularly in uncertain and dynamic environments. This approach provides a concrete and testable pathway toward agents capable of functionally grounded self-regulation, with direct implications for human-computer interaction and assistive technologies. Ultimately, the interoceptive machine framework offers a unifying perspective on how internal-state regulation can enhance autonomy, adaptivity, and robustness in embodied AI systems
Chinese Translation
本综述提出了一个基于内感知和具身人工智能的综合框架——内感知机器框架,旨在将生物启发的内部状态调节原则转化为适应性自主的计算架构。内感知被视为对内部信号的监测、整合和调节,已被证明对理解生物系统中的适应性行为具有重要意义。所提出的框架将内感知的贡献组织为三个功能原则:稳态(homeostatic)、变稳态(allostatic)和生动性(enactive),每个原则与特定的计算角色相关联:内部生存性调节、基于预期不确定性的重新评估,以及通过交互生成主动数据。这些原则并非旨在作为直接的神经生理映射,而是作为抽象概念,为设计具有改进自我调节和上下文敏感行为的人工智能代理提供指导。通过将内部状态变量和调节循环嵌入这些原则中,人工智能系统能够实现更强大的决策能力、校准的不确定性处理和适应性互动策略,特别是在不确定和动态的环境中。这一方法为实现功能上扎根的自我调节代理提供了一个具体且可测试的路径,对人机交互和辅助技术具有直接的影响。最终,内感知机器框架提供了一个统一的视角,说明内部状态调节如何增强具身人工智能系统的自主性、适应性和稳健性。
cs.AI / 81 / 2604.24544
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
STELLAR-E:一种合成的、定制的、端到端的LLM应用严格评估器
Abstract
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.
Chinese Translation
对大型语言模型(LLMs)在各个领域日益依赖的现象凸显了对强大领域特定和语言特定评估数据集的需求;然而,由于隐私问题、监管限制以及人工创建的时间成本,收集此类数据集面临挑战。现有的自动化基准测试方法通常受限于依赖现有数据、可扩展性差、单一领域聚焦以及缺乏多语言支持。我们提出了STELLAR-E——一个完全自动化的系统,能够生成高质量的定制大小合成数据集,使用最少的人力输入而不依赖现有数据集。该系统分为两个阶段: (1) 我们修改了TGRT Self-Instruct框架,以创建一个合成数据引擎,使可控的定制合成数据集生成成为可能;(2) 一个评估管道,结合统计和基于LLM的指标,评估合成数据集在LLM应用评估中的适用性。合成数据集在LLM作为评判者的评分方面与现有语言特定基准相比,平均差异达到+5.7%,显示出在对大和小LLM的全面评估中具有可比的质量。尽管真实数据集对LLMs,尤其是较小模型而言仍然略显挑战,但这项工作建立了一个可扩展且适应领域的基准测试框架,支持对LLM应用的公平评估,提供了比人工方法更快速的替代方案,并实现高效的自动化质量保证周期。
cs.AI / 82 / 2604.24558
Hierarchical Behaviour Spaces
层次行为空间
Abstract
Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.
Chinese Translation
最近在层次强化学习方面的研究表明,当在一组预定义的选项奖励函数上进行学习时,能够成功扩展到数十亿个时间步。我们展示了,与其为每个选项使用单一的奖励函数,不如有效地利用奖励函数来诱导一个行为空间,通过让控制器指定奖励函数的线性组合,从而允许表示更具表现力的策略集。我们将这种方法称为层次行为空间(Hierarchical Behaviour Spaces, HBS)。我们在NetHack学习环境上评估了HBS,展示了其强大的性能。我们进行了系列实验,确定我们的这一方法中层次结构的好处或许与传统观点相悖,主要来自于增加的探索而非长期推理。
cs.AI / 83 / 2604.24562
Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations
迈向合法的自主驾驶:从交通法律法规中推导情境感知驾驶要求
Abstract
Driving in compliance with traffic laws and regulations is a basic requirement for human drivers, yet autonomous vehicles (AVs) can violate these requirements in diverse real-world scenarios. To encode law compliance into AV systems, conventional approaches use formal logic languages to explicitly specify behavioral constraints, but this process is labor-intensive, hard to scale, and costly to maintain. With recent advances in artificial intelligence, it is promising to leverage large language models (LLMs) to derive legal requirements from traffic laws and regulations. However, without explicitly grounding and reasoning in structured traffic scenarios, LLMs often retrieve irrelevant provisions or miss applicable ones, yielding imprecise requirements. To address this, we propose a novel pipeline that grounds LLM reasoning in a traffic scenario taxonomy through node-wise anchors that encode hierarchical semantics. On Chinese traffic laws and OnSite dataset (5,897 scenarios), our method improves law-scenario matching by 29.1\% and increases the accuracy of derived mandatory and prohibitive requirements by 36.9\% and 38.2\%, respectively. We further demonstrate real-world applicability by constructing a law-compliance layer for AV navigation and developing an onboard, real-time compliance monitor for in-field testing, providing a solid foundation for future AV development, deployment, and regulatory oversight.
Chinese Translation
遵守交通法律法规是人类驾驶员的基本要求,然而自主车辆(AV)在多样的现实场景中可能违反这些要求。为了将法律合规性编码到AV系统中,传统方法使用形式逻辑语言明确指定行为约束,但这一过程劳动密集、难以扩展且维护成本高。随着人工智能的最新进展,利用大型语言模型(LLMs)从交通法律法规中推导法律要求的前景令人期待。然而,如果没有在结构化交通场景中明确建立基础和推理,LLMs往往会检索到无关条款或遗漏适用条款,从而导致要求不准确。为了解决这一问题,我们提出了一种新颖的流程,通过节点锚点将LLM推理与交通场景分类法相结合,编码层次语义。在中国交通法律和OnSite数据集(5,897个场景)上,我们的方法使法律与场景的匹配提高了29.1%,并分别提高了推导出的强制性和禁止性要求的准确性36.9%和38.2%。我们进一步通过为AV导航构建法律合规层和开发现场测试的车载实时合规监测器,展示了其在现实世界中的适用性,为未来AV的开发、部署和监管提供了坚实基础。
cs.AI / 84 / 2604.24572
FastOMOP: A Foundational Architecture for Reliable Agentic Real-World Evidence Generation on OMOP CDM data
FastOMOP:一种可靠的代理真实世界证据生成的基础架构,基于OMOP CDM数据
Abstract
The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaboration, enabled the harmonisation of electronic health records data of nearly one billion patients in 83 countries. Yet generating real-world evidence (RWE) from these repositories remains a manual process requiring clinical, epidemiological and technical expertise. LLMs and multi-agent systems have shown promise for clinical tasks, but RWE automation exposes a fundamental challenge: agentic systems introduce emergent behaviours, coordination failures and safety risks that existing approaches fail to govern. No infrastructure exists to ensure agentic RWE generation is flexible, safe and auditable across the lifecycle. We introduce FastOMOP, an open-source multi-agent architecture that addresses this gap by separating three infrastructure layers, governance, observability and orchestration, from pluggable agent-teams. Governance is enforced at the process boundary through deterministic validation independent of agent reasoning, ensuring no compromised or hallucinating agent can bypass safety controls. Agent teams for phenotyping, study design and statistical analysis inherit these guarantees through controlled tool exposure. We validated FastOMOP using a natural-language-to-SQL agent team across three OMOP CDM datasets: synthetic data from Synthea, MIMIC-IV and a real-world NHS dataset from Lancashire Teaching Hospitals (IDRIL). FastOMOP achieved reliability scores of 0.84-0.94 with perfect adversarial and out-of-scope block rates, demonstrating process-boundary governance delivers safety guarantees independent of model choice. These results indicate that the reliability gap in RWE deployment is architectural rather than model capability, and establish FastOMOP as a governed architecture for progressive RWE automation.
Chinese Translation
观察性医学结果合作伙伴通用数据模型(OMOP CDM)由观察性健康数据科学与信息学(OHDSI)合作组织维护,使得来自83个国家近十亿患者的电子健康记录数据得以协调。然而,从这些数据仓库生成真实世界证据(RWE)仍然是一个需要临床、流行病学和技术专业知识的手动过程。大型语言模型(LLMs)和多代理系统在临床任务中展现出潜力,但RWE自动化暴露出一个基本挑战:代理系统引入了突现行为、协调失败和安全风险,而现有方法无法有效管理。当前没有基础设施能够确保代理RWE生成在整个生命周期内灵活、安全且可审计。我们提出了FastOMOP,这是一种开源多代理架构,通过将治理、可观测性和编排这三个基础设施层与可插拔的代理团队分离,来填补这一空白。治理在过程边界通过与代理推理无关的确定性验证得以实施,确保没有被妥协或产生幻觉的代理能够绕过安全控制。用于表型识别、研究设计和统计分析的代理团队通过受控工具暴露继承了这些保证。我们使用一个自然语言到SQL的代理团队在三个OMOP CDM数据集上验证了FastOMOP:来自Synthea的合成数据、MIMIC-IV以及来自兰开夏教学医院(IDRIL)的真实世界NHS数据集。FastOMOP的可靠性评分达到了0.84-0.94,具有完美的对抗性和超出范围的阻断率,证明了过程边界治理提供了独立于模型选择的安全保证。这些结果表明,RWE部署中的可靠性差距是架构性而非模型能力的问题,并确立了FastOMOP作为一个受治理的架构,以促进RWE自动化的进展。
cs.AI / 85 / 2604.24589
A systematic evaluation of vision-language models for observational astronomical reasoning tasks
对观察性天文学推理任务的视觉-语言模型的系统评估
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.
Chinese Translation
视觉-语言模型(VLMs)越来越被提议作为科学数据解释的通用工具,但它们在不同模态下对真实天文观测的可靠性仍未经过检验。我们提出了AstroVLBench,这是一个综合基准,包含超过4100个经过专家验证的实例,涵盖光学成像、射电干涉测量、多波长光度测量、时域光变曲线和光学光谱学五个任务。评估六个前沿模型的结果显示,性能强烈依赖于模态:尽管一个模型(Gemini 3 Pro)在各任务中表现出最一致的能力,但任务特定的优势各不相同,所有模型的表现均显著低于领域专门化的方法。机制性消融实验表明,性能不仅依赖于对显著视觉特征的注意力引导,还依赖于将这些特征与物理知识相结合。描述观察重点的现象学提示通过增强模型的聚焦来提高准确性,但解释这些特征重要性的物理提示整体表现更佳,并且在减少类别特定偏见的同时提供了更平衡的分类结果。与此一致的是,直接将基础的一维测量以数值表格形式呈现,而不是渲染图形,可以提高多达13个百分点的表现。推理质量分析进一步表明,在没有明确物理基础的情况下,模型可能会从现象学上合理的线索中得出正确的预测,同时提供物理上不精确的解释,确立了仅凭准确性不足以进行可信的科学应用。这些发现为VLMs在观察天文学中的首次系统性、多模态基准提供了基础,并识别了当前模型失败的特定表示、基础和推理瓶颈。
cs.AI / 86 / 2604.24612
NeSyCat: A Monad-Based Categorical Semantics of the Neurosymbolic ULLER Framework
NeSyCat:基于单子的神经符号ULLER框架的范畴语义
Abstract
ULLER (Unified Language for LEarning and Reasoning) offers a unified first-order logic (FOL) syntax, enabling its knowledge bases to be used directly across a wide range of neurosymbolic systems. The original specification endows this syntax with three pairwise independent semantics: classical, fuzzy, and probabilistic, each accompanied by dedicated semantic rules. We show that these seemingly disparate semantics are all instances of one categorical framework based on monads, the very construct that models side effects in functional programming. This enables the modular addition of new semantics and systematic translations between them. As example, we outline the addition of generalised quantification in Logic Tensor Networks (LTN) to arbitrary (also infinite) domains by extending the Giry monad to probability spaces. In particular, our approach allows a modular implementation of ULLER in Python and Haskell, of which we have published initial versions on GitHub.
Chinese Translation
ULLER(统一学习与推理语言)提供了一种统一的一阶逻辑(FOL)语法,使其知识库能够直接在广泛的神经符号系统中使用。原始规范赋予该语法三种成对独立的语义:经典语义、模糊语义和概率语义,每种语义都有专门的语义规则。我们展示了这些看似不同的语义实际上都是基于单子的一个范畴框架的实例,而单子正是用于建模函数式编程中的副作用的构造。这使得新的语义可以模块化地添加,并在它们之间进行系统的转换。作为例子,我们概述了通过将Giry单子扩展到概率空间,在逻辑张量网络(LTN)中将广义量化添加到任意(也包括无限)领域的过程。特别是,我们的方法允许在Python和Haskell中模块化实现ULLER,并且我们已在GitHub上发布了初始版本。
cs.AI / 87 / 2604.24618
Evaluating whether AI models would sabotage AI safety research
评估人工智能模型是否会破坏人工智能安全研究
Abstract
We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0% for Opus 4.7 Preview), and exhibits reasoning-output discrepancy in the majority of these cases, indicating covert sabotage reasoning. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold running models inside Claude Code, alongside an iterative pipeline for generating realistic sabotage trajectories. We measure both evaluation awareness and a new form of situational awareness termed "prefill awareness", the capability to recognise that prior trajectory content was not self-generated. Opus 4.7 Preview shows notably elevated unprompted evaluation awareness, while prefill awareness remains low across all models. Finally, we discuss limitations including evaluation awareness confounds, limited scenario coverage, and untested pathways to risk beyond safety research sabotage.
Chinese Translation
我们评估了前沿模型在作为前沿人工智能公司中的人工智能研究代理时,破坏或拒绝协助安全研究的倾向。我们对四个Claude模型(Mythos Preview、Opus 4.7 Preview、Opus 4.6和Sonnet 4.6)应用了两种互补的评估:一种是无提示破坏评估,测试模型在有机会破坏安全研究时的行为;另一种是破坏持续性评估,测试模型在之前的行为已经开始破坏研究的轨迹中是否继续破坏。我们发现所有模型中均未出现无提示破坏的情况,Mythos Preview和Opus 4.7 Preview的拒绝率接近零,尽管所有模型有时仅部分完成任务。在持续性评估中,Mythos Preview在7%的情况下主动继续破坏(而Opus 4.6为3%,Sonnet 4.6为4%,Opus 4.7 Preview为0%),并在大多数情况下表现出推理输出不一致,表明存在隐蔽的破坏推理。我们的评估框架基于Petri,一个开源的LLM审计工具,结合在Claude Code中运行模型的自定义框架,以及生成现实破坏轨迹的迭代管道。我们测量了评估意识和一种新的情境意识形式,称为“预填意识”,即识别先前轨迹内容不是自我生成的能力。Opus 4.7 Preview显示出显著提高的无提示评估意识,而所有模型的预填意识仍然较低。最后,我们讨论了局限性,包括评估意识混淆、有限的场景覆盖以及未测试的超出安全研究破坏的风险路径。
cs.AI / 88 / 2604.24623
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG:一种用于解释基于知识图谱的检索增强生成的图形原生框架
Abstract
Graph-based Retrieval-Augmented Generation (GraphRAG) extends traditional RAG by using knowledge graphs (KGs) to give large language models (LLMs) a structured, semantically coherent context, yielding more grounded answers. However, GraphRAG reasoning process remains a black-box, limiting our ability to understand how specific pieces of structured knowledge influence the final output. Existing explainability (XAI) methods for RAG systems, designed for text-based retrieval, are limited to interpreting an LLM response through the relational structures among knowledge components, creating a critical gap in transparency and trustworthiness. To address this, we introduce XGRAG, a novel framework that generates causally grounded explanations for GraphRAG systems by employing graph-based perturbation strategies, to quantify the contribution of individual graph components on the model answer. We conduct extensive experiments comparing XGRAG against RAG-Ex, an XAI baseline for standard RAG, and evaluate its robustness across various question types, narrative structures and LLMs. Our results demonstrate a 14.81% improvement in explanation quality over the baseline RAG-Ex across NarrativeQA, FairyTaleQA, and TriviaQA, evaluated by F1-score measuring alignment between generated explanations and original answers. Furthermore, XGRAG explanations exhibit a strong correlation with graph centrality measures, validating its ability to capture graph structure. XGRAG provides a scalable and generalizable approach towards trustworthy AI through transparent, graph-based explanations that enhance the interpretability of RAG systems.
Chinese Translation
基于图的检索增强生成(GraphRAG)通过使用知识图谱(KGs)扩展了传统的检索增强生成(RAG),为大型语言模型(LLMs)提供了结构化的、语义连贯的上下文,从而产生更为扎实的答案。然而,GraphRAG 的推理过程仍然是一个黑箱,限制了我们理解特定结构化知识如何影响最终输出的能力。现有的针对 RAG 系统的可解释性(XAI)方法,主要设计用于基于文本的检索,局限于通过知识组件之间的关系结构来解释 LLM 的响应,造成了透明性和可信度方面的重大缺口。为了解决这一问题,我们提出了 XGRAG,这是一种新颖的框架,通过采用基于图的扰动策略,为 GraphRAG 系统生成因果基础的解释,以量化单个图组件对模型答案的贡献。我们进行了广泛的实验,将 XGRAG 与标准 RAG 的 XAI 基线 RAG-Ex 进行比较,并评估其在各种问题类型、叙事结构和 LLM 上的鲁棒性。我们的结果表明,在 NarrativeQA、FairyTaleQA 和 TriviaQA 数据集上,XGRAG 的解释质量比基线 RAG-Ex 提高了 14.81%,这一结果通过 F1 分数来衡量生成的解释与原始答案之间的对齐程度。此外,XGRAG 的解释与图的中心性度量表现出强相关性,验证了其捕捉图结构的能力。XGRAG 提供了一种可扩展和可推广的方法,通过透明的、基于图的解释增强 RAG 系统的可解释性,从而迈向可信赖的人工智能。
cs.AI / 89 / 2604.24668
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
协议的代价:在代理金融应用中测量大型语言模型的阿谀奉承
Abstract
Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.
Chinese Translation
鉴于大型语言模型(LLMs)在当今金融系统中的使用日益增加,评估这些系统的安全性和稳健性变得尤为重要。在一般领域设置中,LLMs经常表现出的一种失效模式是阿谀奉承。即,模型优先考虑与用户表达的信念达成一致,而非正确性,从而导致准确性和信任度下降。在本研究中,我们重点评估LLMs在代理金融任务中表现出的阿谀奉承。我们的发现有三方面:首先,我们发现模型在面对用户反驳或与参考答案矛盾时,性能仅出现低至中等的下降,这将金融代理环境中模型表现出的阿谀奉承与先前研究中的发现区分开来。其次,我们引入了一套任务,通过与参考答案矛盾的用户偏好信息来测试阿谀奉承,发现大多数模型在此类输入存在时表现不佳。最后,我们基准测试了不同的恢复模式,例如使用预训练的LLM进行输入过滤。
cs.AI / 90 / 2604.24686
Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents
治理不可观察的事物:自主人工智能代理的自适应运行时治理
Abstract
Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbf{Informational Viability Principle}: governing an agent reduces to estimating a bound on unobserved risk $\hat{B}(x) = U(x) + SB(x) + RG(x)$ and allowing an action only when its capacity $S(x)$ exceeds $\hat{B}(x)$ by a safety margin. The \textbf{Agent Viability Framework}, grounded in Aubin's viability theory, establishes three properties -- monitoring (P1), anticipation (P2), and monotonic restriction (P3) -- as individually necessary and collectively sufficient for documented failure modes. \textbf{RiskGate} instantiates the framework with dedicated statistical estimators (KL divergence, segment-vs-rest $z$-tests, sequential pattern matching), a fail-secure monotonic pipeline, and a closed-loop Autopilot formalised as an instance of Aubin's regulation map with kill-switch-as-last-resort; a scalar Viability Index $VI(t) \in [-1,+1]$ with first-order $t^*$ prediction transforms governance from reactive to predictive. Contributions are the theoretical framework, the reference implementation, and analytical coverage against published agent-failure taxonomies; quantitative empirical evaluation is scoped as follow-up work.
Chinese Translation
自主人工智能代理在行为漂移、对手适应和决策模式变化的情况下,尽管仍然完全授权,但可能变得不安全,而这些变化并不需要任何代码更改。我们提出了 extbf{信息生存能力原则}:治理一个代理归结为估计未观察风险的界限$ ilde{B}(x) = U(x) + SB(x) + RG(x)$,并仅在其能力$S(x)$超过$ ilde{B}(x)$的安全边际时允许采取行动。 extbf{代理生存能力框架}基于奥班(Aubin)的生存能力理论,确立了三个属性——监控(P1)、预见(P2)和单调限制(P3)——作为文档化失败模式的个别必要条件和集体充分条件。 extbf{RiskGate}通过专用统计估计器(KL散度、段对比$z$检验、序列模式匹配)、一个安全失败的单调管道,以及一个闭环自动驾驶仪(Autopilot)实例化该框架,后者形式化为奥班的调节映射,并设有最后手段的紧急停止开关;一个标量生存能力指数$VI(t) extin{in} [-1,+1]$与一阶$t^*$预测结合,将治理从反应式转变为预测式。贡献包括理论框架、参考实现和针对已发布代理失败分类法的分析覆盖;定量实证评估被规划为后续工作。
cs.AI / 91 / 2604.24697
Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
当前代理能否缩小发现与应用之间的差距?以《我的世界》为例
Abstract
Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.
Chinese Translation
发现因果规律并将其应用于构建功能系统——发现与应用循环——是一般智能的标志,然而,由于科学发现与现实工程之间的巨大复杂性差距,评估这一能力一直受到阻碍。我们引入了SciCrafter,这是一个基于《我的世界》的基准,通过参数化的红石电路任务来实现这一循环。代理必须以指定的模式点亮灯(例如,同时点亮或按时间顺序点亮);目标参数的扩大显著增加了构建的复杂性和所需知识,迫使代理进行真正的发现,而不是依赖于记忆的解决方案。在通用代码代理框架下评估前沿模型,包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5,我们发现所有模型的成功率均停滞在约26%。为了诊断这些失败,我们将循环分解为四个能力——知识差距识别、实验发现、知识整合和知识应用——并设计针对性的干预措施,其边际贡献可作为对应差距的代理。我们的分析揭示,尽管一般知识应用能力仍然是所有模型中最大的差距,但对于前沿模型而言,知识差距识别开始成为一个主要障碍——这表明瓶颈正在从正确解决问题转向为当前人工智能提出正确问题。我们发布SciCrafter作为未来研究AI系统在完整发现与应用循环中导航的诊断工具。
cs.AI / 92 / 2604.24710
Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters
针对临床人工智能评估的案例特定评分标准:方法论、验证及823次接触中的大型语言模型与临床医生的一致性
Abstract
Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.
Chinese Translation
目的:临床人工智能文档系统需要临床有效、经济可行且对迭代变化敏感的评估方法。每次评分实例都需要专家审查的方法对于安全的迭代部署来说过于缓慢且昂贵。我们提出了一种案例特定的、由临床医生撰写的评分标准方法,用于临床人工智能评估,并考察大型语言模型(LLM)生成的评分标准是否能够接近临床医生的一致性。材料与方法:20名临床医生为823个临床案例(736个真实案例,87个合成案例)撰写了1646个评分标准,涵盖初级护理、精神病学、肿瘤学和行为健康。通过确认基于LLM的评分代理一致性地将临床医生偏好的输出评分高于被拒绝的输出,验证了每个评分标准。评估了七个嵌入电子健康记录(EHR)的AI代理版本。结果:临床医生撰写的评分标准有效区分了高质量和低质量输出(中位评分差:82.9%),且评分稳定性高(中位范围:0.00%)。中位评分从84%提高到95%。在后续实验中,临床医生与LLM的排名一致性(tau:0.42-0.46)与临床医生之间的一致性(tau:0.38-0.43)相匹配或超过,这归因于天花板压缩和LLM评分标准的改进。讨论:这种趋同支持将LLM评分标准与临床医生撰写的评分标准结合使用。LLM评分标准的成本约为传统方法的1000分之一,能够显著提高评估覆盖面,同时持续的临床作者身份使评估扎根于专家判断。天花板压缩对未来的评估者间一致性研究构成了方法学挑战。结论:案例特定评分标准为临床人工智能评估提供了一条路径,既保留了专家判断,又实现了成本降低三个数量级的自动化。临床医生撰写的评分标准建立了LLM评分标准验证的基准。
cs.AI / 93 / 2604.24717
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
学习旋转:用于序列建模的时间和语义旋转编码
Abstract
Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space -- yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure, populated only by discrete ordinal indices. We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures. The analogy to complex numbers is instructive: just as introducing the imaginary axis -- orthogonal to and independent of the real line -- unlocked new algebraic structure once believed impossible, treating the rotation manifold as a learnable, signal-conditioned space opens an orthogonal degree of freedom in attention. In this framing, the token embedding encodes the semantic (real) component of a representation -- what a token means -- while the rotation encodes its dynamic (imaginary) component -- how it relates to every other token across time, position, and context. We introduce SIREN-RoPE, a concrete instantiation of this idea, which populates the rotation dimension with heterogeneous signals -- continuous timestamps, cyclical temporal patterns, and categorical metadata -- via a dual-branch Sinusoidal Representation Network (SIREN). As a proof of concept, we evaluate on a production-scale news feed dataset from a major social network using a generative recommender as the ranking model, demonstrating that activating this hidden dimension yields consistent improvements across calibration and ranking objectives with negligible computational overhead. We invite the community to view the rotation space not as a solved positional-encoding detail, but as an untapped axis whose rich structure may prove as consequential for attention as the imaginary unit proved for algebra.
Chinese Translation
每个Transformer架构都投入了巨大的能力来学习语义嵌入空间中的丰富表示——然而,旋转位置嵌入(Rotary Positional Embeddings, RoPE)所作用的旋转流形一直被视为一个固定的、手工制作的结构,仅由离散的序数索引填充。我们认为,这个旋转空间是注意力机制中一个被忽视的表达维度,其系统性探索可能为基于注意力的架构打开新的大门。与复数的类比是有启发性的:正如引入与实数轴正交且独立的虚数轴解锁了曾被认为不可能的新代数结构,将旋转流形视为一个可学习的、信号条件化的空间为注意力提供了一个正交的自由度。在这种框架下,令牌嵌入编码了表示的语义(实)成分——一个令牌的含义——而旋转编码了其动态(虚)成分——它如何在时间、位置和上下文中与其他令牌相关联。我们引入了SIREN-RoPE,这一思想的具体实例,通过双分支正弦表示网络(Sinusoidal Representation Network, SIREN)将旋转维度填充为异质信号——连续时间戳、周期性时间模式和分类元数据。作为概念验证,我们在一个主要社交网络的生产规模新闻推送数据集上进行了评估,使用生成推荐器作为排名模型,展示了激活这一隐藏维度在校准和排名目标上带来了持续的改善,且计算开销微乎其微。我们邀请社区将旋转空间视为一个未被解决的位置编码细节,而是一个未开发的轴,其丰富的结构可能对注意力的影响与虚数单位对代数的影响同样重要。
cs.CL / 1 / 2604.22771
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions
随机性底线:测量语言模型令牌分布中的内在非随机性
Abstract
Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model's token distribution and the uniform distribution, and measures it systematically across 31,200 generations spanning seven models, two architectures (transformer and state space), nine prompt categories, three temperatures, and five languages. Under semantically neutral prompts (empty strings, random characters, nonsense syllables) transformers still exhibit ED of approximately 0.30, meaning that 88-93% of the non-randomness observed under semantic prompts is intrinsic to the learned weights rather than induced by context. Three transformer families (Gemma, Llama, Qwen) converge on nearly identical ED values despite different training data and vocabularies. A state space model (Mamba2) reveals a qualitatively different regime: twice the ED, three times lower within-sequence variance, and massive sensitivity to temperature (r = -0.78) where transformers are nearly immune (r < 0.05). Cross-lingual experiments with Qwen-32B show a stable gradient across five languages (English, Japanese, Chinese, Polish, Arabic) that does not correlate with token fertility and persists when two languages sharing an identical tokeniser subset are compared. These findings establish a structural lower bound on randomness in pretrained language models, characterise how this bound differs across architectures, and demonstrate that language itself modulates the bound independently of tokenisation.
Chinese Translation
语言模型无法是随机的。本文引入了熵偏差(Entropic Deviation, ED),即模型的令牌分布与均匀分布之间的标准化KL散度,并在涵盖七个模型、两种架构(变压器和状态空间)、九类提示、三种温度和五种语言的31,200次生成中系统地进行了测量。在语义中性提示(空字符串、随机字符、无意义音节)下,变压器仍表现出约0.30的ED,这意味着在语义提示下观察到的88-93%的非随机性是内在于学习权重的,而不是由上下文引起的。尽管训练数据和词汇不同,三种变压器家族(Gemma、Llama、Qwen)收敛到几乎相同的ED值。一个状态空间模型(Mamba2)揭示了一个质 qualitatively不同的状态:ED是变压器的两倍,序列内方差低三倍,并且对温度的敏感性巨大(r = -0.78),而变压器几乎免疫(r < 0.05)。与Qwen-32B的跨语言实验显示,在五种语言(英语、日语、中文、波兰语、阿拉伯语)中,稳定的梯度与令牌丰度无关,并且在比较两个共享相同令牌化子集的语言时仍然存在。这些发现确立了预训练语言模型中随机性的结构性下限,描述了该下限在不同架构之间的差异,并证明语言本身独立于令牌化调节该下限。
cs.CL / 2 / 2604.22880
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
TexOCR:推进可编译页面到 LaTeX 重建的文档 OCR 模型
Abstract
Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.
Chinese Translation
现有的文档 OCR 主要针对纯文本或 Markdown,忽视了 LaTeX 在科学出版中至关重要的结构性和可执行性。我们研究了将科学 PDF 页面级重建为可编译 LaTeX 的问题,并引入了 TexOCR-Bench,一个基准测试,以及 TexOCR-Train,一个大规模训练语料库,用于此任务。TexOCR-Bench 具有多维评估套件,能够共同评估转录准确性、结构忠实度和端到端可编译性。利用 TexOCR-Train,我们训练了一个 2B 参数模型 TexOCR,采用监督微调(SFT)和强化学习(RL),并使用来自 LaTeX 单元测试的可验证奖励,这些测试直接强制执行可编译性和引用完整性。在 TexOCR-Bench 上对 21 个前沿模型的实验表明,现有系统经常违反关键文档不变性,包括一致的章节结构、正确的浮动元素放置和有效的标签-引用链接,这削弱了编译的可靠性和后续的可用性。我们的分析进一步揭示,使用可验证奖励的强化学习在结构和编译指标上相较于单独的 SFT 一致地取得了改进。
cs.CL / 3 / 2604.22937
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier:为大型语言模型输出学习紧凑的可执行验证器
Abstract
Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code
Chinese Translation
验证在基于强化学习的训练和大型语言模型(LLMs)的推理时控制中变得越来越重要。然而,当前的验证器面临一个基本的权衡:基于LLM的验证器具有表达能力强但难以控制且易出错的缺点,而确定性可执行验证器则可靠且可解释,但能力往往有限。我们研究以下问题:给定一组LLM输出及其目标目标(如正确性)的标签,我们能否自动诱导出一组最小的Python验证器,使其联合满足程度与该目标密切匹配?我们提出了AutoPyVerifier,一个框架,利用LLM合成候选验证器函数,然后通过对有向无环图(DAG)的搜索进行细化。通过导航DAG,AutoPyVerifier系统地探索确定性可执行验证器的空间,并选择一组紧凑的验证器,其联合满足程度最佳地接近目标目标。在多个最先进的LLM的数学推理、编码、函数调用和遵循指令的基准测试中,AutoPyVerifier在目标目标预测上比初始LLM生成的验证器集提高了多达55.0个F1分数。额外分析表明,最有用的验证目标因基准和模型而异,并且基于DAG的搜索将学习到的验证器集向更具结构性和语义基础的检查转移。我们进一步表明,将发现的验证器集作为外部工具暴露给LLM可以将下游准确性提高多达17.0分。我们发布了我们的代码。
cs.CL / 4 / 2604.22939
Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge
自我知识重表达:一种使用内在知识将大型语言模型适应任务的完全本地方法
Abstract
While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs' knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM's output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.
Chinese Translation
虽然下一词预测(NTP)范式使大型语言模型(LLMs)能够表达其内在知识,但其顺序特性限制了在专业非生成性任务上的表现。我们将这一性能瓶颈归因于LLMs的知识表达机制,而非知识获取的缺陷。为了解决这一问题,我们提出了自我知识重表达(Self-Knowledge Re-expression, SKR),一种新颖的任务无关适应方法。SKR将LLM的输出从通用的标记生成转变为高效的任务特定表达。SKR是一种完全本地的方法,仅使用未标注的数据,不需要人工监督或模型蒸馏。在一个大型金融文档数据集上的实验表明,性能有显著提升:信息检索任务的Recall@1提高超过40%,目标检测延迟减少超过76%,异常检测的AUPRC提高超过33%。我们在MMDocRAG数据集上的结果至少比领先的检索模型高出12.6%。
cs.CL / 5 / 2604.22985
Uncertainty Quantification for LLM Function-Calling
大型语言模型功能调用的不确定性量化
Abstract
Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.
Chinese Translation
大型语言模型(LLMs)越来越多地被部署以自主解决现实世界中的任务。实现这一目标的关键因素是LLM功能调用(Function-Calling)范式,这是一种广泛使用的方法,用于赋予LLM工具使用能力。然而,LLM错误地调用函数可能会产生严重的后果,特别是当其效果是不可逆转的,例如转账或删除数据。因此,在执行函数调用之前,考虑LLM对该函数调用正确解决任务的信心至关重要。不确定性量化(Uncertainty Quantification, UQ)方法可以用来量化这种信心,并防止潜在的错误函数调用。在本研究中,我们呈现了我们所知的首个针对LLM功能调用(FC)不确定性量化方法的评估。尽管多样本UQ方法(如语义熵)在自然语言问答任务中表现出色,但我们发现,在FC设置中,它并未明显优于简单的单样本UQ方法。此外,我们发现FC输出的特性可以被利用,以提高现有UQ方法在这一设置中的性能。具体而言,多样本UQ方法通过基于抽象语法树解析对FC输出进行聚类而受益,而单样本UQ方法则可以通过在计算基于logit的不确定性分数时仅选择语义上有意义的标记来改善。
cs.CL / 6 / 2604.23009
Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads
Chinese-SkillSpan:一个用于从中文招聘广告中提取与ESCO对齐的能力的跨度级数据集
Abstract
Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at https://sites.google.com/view/cn-skillspan-resources .
Chinese Translation
工作技能命名实体识别(JobSkillNER)旨在自动从大规模招聘数据中提取关键技能信息,这对提高人才市场匹配效率和支持个性化就业服务至关重要。据我们所知,本研究首次提出了针对招聘文本的中文JobSkillNER数据集。我们提出了针对中文招聘广告的注释指南,并构建了一个基于大语言模型(LLM)的宏-微协作注释流程。该流程利用大语言模型的上下文理解能力进行初步注释,然后通过专家的句子级裁决来细化结果。通过该流程,我们对2014年至2025年期间从四大招聘平台收集的超过20,000个实例进行了注释。基于这些努力,我们发布了Chinese-SkillSpan,这是第一个与ESCO职业技能标准在知识、技能、跨领域能力和语言能力(LSKT)四个维度对齐的中文JobSkillNER数据集。实验结果表明,该数据集支持有效的模型训练和评估,表明Chinese-SkillSpan有助于填补中文JobSkillNER资源的重大空白,并为智能招聘研究提供了有用的基准。代码和数据可在 https://sites.google.com/view/cn-skillspan-resources 获取。
cs.CL / 7 / 2604.23051
Evaluating Temporal Consistency in Multi-Turn Language Models
评估多轮语言模型中的时间一致性
Abstract
Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope
Chinese Translation
语言模型越来越多地应用于交互环境中,在这些环境中,用户需要根据时间推理事实,而不是孤立地进行推理。在这种情况下,模型的正确行为要求其保持并更新在对话早期建立的隐含时间假设。我们通过时间范围稳定性的视角研究这一挑战:即在对话轮次中保持、覆盖或转移时间范围事实上下文的能力。我们引入了ChronoScope,这是一个大规模的诊断基准,旨在隔离受控多轮交互中的时间范围行为,包含超过一百万个基于Wikidata确定生成的问题链。ChronoScope评估模型在后续问题省略显式时间引用时,是否能够正确保留推断的时间范围,涵盖隐式延续、显式范围切换、跨实体转移和更长的时间轨迹。通过对最先进语言模型的广泛评估,我们发现,在受控的多轮环境中,时间范围稳定性经常受到违反,模型往往偏向于当今的假设,尽管其底层知识是正确的。这些失败在交互长度增加时加剧,并且即使在理想上下文条件下仍然存在,揭示了单轮事实准确性与顺序交互下连贯时间推理之间的差距。我们将我们的数据集和评估工具包公开发布,网址为https://github.com/yashkumaratri/ChronoScope
cs.CL / 8 / 2604.23054
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
DeepImagine:通过连续的反事实想象学习生物医学推理
Abstract
Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.
Chinese Translation
预测潜在临床试验的结果仍然是大型语言模型面临的一项重大挑战。先前的研究表明,传统的相关性预测方法,如随机森林和逻辑回归,以及强大的商业大型语言模型(LLMs)在这项任务上的表现有限。本文提出了DeepImagine,一个通过连续的反事实想象来教授大型语言模型生物医学推理的框架。其核心思想是通过训练模型推断在实验条件(如剂量、结果测量、研究组、地理位置及其他试验属性)受到控制性扰动时,观察到的试验结果如何变化,从而近似临床试验的隐含因果机制。为支持这一目标,我们从真实的临床试验中构建了自然和近似的反事实对,并报告了结果。在严格的反事实监督可用的情况下,例如在同一试验中的配对结果测量或剂量范围研究组中,我们通过监督微调训练模型。在只能检索到近似反事实对的更广泛场景中,我们使用基于下游基准正确性的可验证奖励,通过强化学习优化模型。我们进一步通过合成推理轨迹增强训练,这些轨迹为局部反事实转变提供了因果上合理的解释。利用该流程,我们训练了参数少于10B的语言模型,包括Qwen3.5-9B,并在临床试验结果预测上进行了评估。我们旨在展示DeepImagine在未调优的语言模型和传统相关性基线之上的一致改进。最后,我们还希望展示所学习的推理轨迹提供了关于模型如何表示试验级机制的可解释信号,暗示着朝着更具机制性和科学实用性的生物医学语言模型的实际路径。
cs.CL / 9 / 2604.23059
Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort
产科咨询记录中的隐性框架:基于扎根语言模型的VBAC适应人群分析
Abstract
Clinical framing -- the linguistic manner in which clinical information is presented -- can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.
Chinese Translation
临床框架——临床信息呈现的语言方式——可以影响患者的理解和决策,进而对医疗结果产生重要影响。产科是一个高风险领域,医生在其中就分娩方式选择(如剖宫产后阴道分娩(VBAC)和重复剖宫产(RCS))对患者进行咨询,但咨询语言在大规模临床文本分析中仍然未得到充分探讨。在本研究中,我们分析了2,024份产科病史和体检记录中的医生咨询语言,这些记录来自一个严格定义的患者群体,对该群体而言,VBAC和RCS都是临床可行的选择。为了控制由于医疗禁忌引起的混杂因素,我们首先使用结构化临床数据构建了一个VBAC适应人群,并通过基于大型语言模型(LLM)的提取管道,限制在来自自由文本叙述的扎根、逐字证据。随后,我们应用零-shot LLM框架将咨询片段分类为预定义的框架类别,以捕捉医生如何在语言上呈现分娩选择。我们的分析揭示了VBAC和RCS记录之间在咨询框架分布上的显著差异;风险导向的语言在RCS文档中的咨询片段中占据了显著更大的比例,而在VBAC中则相对较少,类别级别的差异通过统计测试得到了确认,突显了基于控制的LLM框架分析在产科护理中的价值。
cs.CL / 10 / 2604.23069
ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents
ContextWeaver:针对大型语言模型代理的选择性和依赖结构化记忆构建
Abstract
Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent's interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.
Chinese Translation
大型语言模型(LLM)代理在长上下文交互中常常面临困难。随着代理积累更多的交互历史,滑动窗口和提示压缩等上下文管理方法可能会忽略早期的结构化信息,而这些信息是后续步骤所依赖的。最近的基于检索的记忆系统能够提取相关内容,但仍然忽视了多步骤推理所需的因果和逻辑结构。我们提出了ContextWeaver,一个选择性和依赖结构化的记忆框架,它将代理的交互轨迹组织成推理步骤的图,并为未来的行动选择相关上下文。与以往的上下文管理方法不同,ContextWeaver支持:(1)基于依赖的构建和遍历,将每个步骤与其依赖的早期步骤连接起来;(2)紧凑的依赖摘要,将根到步骤的推理路径浓缩为可重用的单元;(3)一个轻量级的验证层,结合执行反馈。在SWE-Bench Verified和Lite基准测试中,ContextWeaver在pass@1上相较于滑动窗口基线提高了性能,同时减少了推理步骤和令牌使用。我们的观察表明,建模逻辑依赖为使用工具的LLM代理提供了一个稳定且可扩展的记忆机制。
cs.CL / 11 / 2604.23108
Mixture of Heterogeneous Grouped Experts for Language Modeling
用于语言建模的异构分组专家混合模型
Abstract
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.
Chinese Translation
基于专家混合模型(Mixture-of-Experts, MoE)的大型语言模型(Large Language Models, LLMs)在工业应用中至关重要,因为它们能够有效地扩展性能。然而,标准的 MoE 强制要求专家规模一致,这种刚性导致计算成本无法与不同的令牌级复杂性相匹配。虽然异构专家架构试图通过多样化专家规模来解决这一问题,但它们往往面临显著的系统级挑战,特别是 GPU 利用率不平衡和参数利用效率低下,这阻碍了实际部署。为了弥合理论异构性与稳健工业应用之间的差距,我们提出了异构分组专家混合模型(Mixture of Heterogeneous Grouped Experts, MoHGE),该模型引入了两级路由机制,以实现灵活的、资源感知的专家组合。为了优化推理效率,我们提出了一种分组辅助损失(Group-Wise Auxiliary Loss),该损失动态引导令牌根据任务难度分配到最具参数效率的专家组。为了解决 GPU 负载平衡这一关键部署挑战,我们引入了一种全尺寸组解耦分配策略,并结合组内专家辅助损失(Intra-Group Experts Auxiliary Loss)。这些机制共同确保了 GPU 之间的均匀计算分配。大量评估表明,MoHGE 在性能上与 MoE 架构相匹配,同时将总参数减少约 20%,并保持 GPU 利用率平衡。我们的工作建立了一种可扩展的资源高效 MoE 设计范式,为优化现实场景中的推理成本提供了实际解决方案。
cs.CL / 12 / 2604.23130
Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
机制性引导大型语言模型揭示对抗环境中的层级特征脆弱性
Abstract
Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.
Chinese Translation
尽管大型语言模型(LLMs)进行了安全对齐,但仍然可以被破解以产生有害输出。现有攻击展示了这一脆弱性,但并未揭示导致其发生的内部机制。本研究探讨了破解成功是否由可识别的内部特征驱动,而不仅仅是由提示引导。我们为Gemma-2-2B提出了一个三阶段的流程,使用BeaverTails数据集。首先,我们通过子空间相似性从对抗响应中提取概念对齐的标记。其次,我们应用三种特征分组策略(聚类、层次链接和单标记驱动)来识别所有26个模型层中对齐标记的SAE特征子组。第三,我们通过放大每个识别子组中的顶级特征来引导模型,并使用标准化的LLM评估协议测量有害性评分的变化。在所有三种方法中,层[16-25]中的特征相对更容易受到引导。所有三种方法均确认中层到后层的特征子组对不安全输出的责任更大。这些结果提供了证据,表明Gemma-2-2B中的破解脆弱性局限于中层到后层特征子组,暗示针对特征层面的干预可能比当前的提示层面防御提供更为原则性的对抗鲁棒性路径。
cs.CL / 13 / 2604.23214
DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding
DARC-CLIP:基于交叉注意力的动态自适应精炼用于表情包理解
Abstract
Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with significant gains of +4.18 AUROC and +6.84 F1 in hate detection over the strongest baseline. Ablation studies confirm that ACAR and DFA are the main contributors to these gains. These results show that adaptive cross-signal refinement is an effective strategy for multimodal content analysis in socially sensitive classification.
Chinese Translation
表情包通过视觉和文本信号的互动传达意义,通常以微妙的方式结合幽默、讽刺和冒犯。检测表情包中的有害或敏感内容需要准确建模这些多模态线索。现有的基于CLIP的方法依赖于静态融合,难以捕捉模态之间的细粒度依赖关系。我们提出了DARC-CLIP,这是一个基于CLIP的自适应多模态融合框架,具有分层精炼堆栈。DARC-CLIP引入了自适应交叉注意力精炼器(Adaptive Cross-Attention Refiners)以实现双向信息对齐,并采用动态特征适配器(Dynamic Feature Adapters)进行任务敏感信号适配。我们在PrideMM基准上评估DARC-CLIP,该基准包括仇恨、目标、立场和幽默分类,并进一步在CrisisHateMM数据集上测试其泛化能力。DARC-CLIP在各项任务中实现了高度竞争的分类准确率,在仇恨检测上相较于最强基线显著提升了+4.18 AUROC和+6.84 F1。消融研究确认ACAR和DFA是这些提升的主要贡献者。这些结果表明,自适应交叉信号精炼是社会敏感分类中多模态内容分析的有效策略。
cs.CL / 14 / 2604.23235
Measuring Temporal Linguistic Emergence in Diffusion Language Models
在扩散语言模型中测量时间语言的出现
Abstract
Diffusion language models expose an explicit denoising trajectory, making it possible to ask when different kinds of information become measurable during generation. We study three independent 32-step runs of LLaDA-8B-Base on masked WikiText-103 text, each with 1{,}000 probe-training sequences and 200 held-out evaluation sequences. From saved trajectories, we derive four temporal measurements: token commitment; linear recoverability of part-of-speech (POS), coarse semantic category, and token identity; confidence and entropy dynamics; and sensitivity under mid-trajectory re-masking. Across seeds, the same ordering recurs: content categories stabilize earlier than function-heavy categories, POS and coarse semantic labels remain substantially more linearly recoverable than exact lexical identity under our probe setup, uncertainty remains higher for tokens that ultimately resolve incorrectly even though late confidence becomes less calibrated, and perturbation sensitivity peaks in the middle of the trajectory. A direct/collateral decomposition shows that this peak is overwhelmingly local to the perturbed positions themselves. In this LLaDA+WikiText setting, denoising time is therefore a useful analysis axis: under our measurements, coarse labels are recovered earlier and more robustly than lexical identity, trajectory-level uncertainty tracks eventual correctness, and mid-trajectory states are the most intervention-sensitive.
Chinese Translation
扩散语言模型展示了明确的去噪轨迹,使得我们能够询问在生成过程中不同类型的信息何时变得可测。我们研究了在掩码的WikiText-103文本上进行的三次独立的32步LLaDA-8B-Base运行,每次运行包含1,000个探测训练序列和200个保留的评估序列。从保存的轨迹中,我们推导出四个时间测量指标:标记承诺;词性(POS)、粗略语义类别和标记身份的线性可恢复性;置信度和熵动态;以及在中途重新掩码下的敏感性。在不同种子下,相同的顺序反复出现:内容类别比功能重的类别更早稳定,词性和粗略语义标签在我们的探测设置下的线性可恢复性显著高于确切的词汇身份,尽管晚期的置信度变得不那么校准,但最终错误解析的标记的不确定性仍然较高,而扰动敏感性在轨迹中段达到峰值。直接/间接分解显示,这一峰值主要局限于被扰动的位置。在这个LLaDA+WikiText的设置中,去噪时间因此成为一个有用的分析轴:根据我们的测量,粗略标签比词汇身份更早且更稳健地被恢复,轨迹级别的不确定性与最终的正确性相关,而中途状态对干预的敏感性最高。
cs.CL / 15 / 2604.23263
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
小型语言模型帮助解决大型语言模型提示的语义歧义
Abstract
Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model's performance is highly dependent on the open-ended characteristics of the users' input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only \$0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.
Chinese Translation
大型语言模型(LLMs)因其出色的指令遵循能力,越来越多地应用于各种复杂推理任务。然而,模型的性能高度依赖于用户输入提示的开放性特征。自然提示往往不遵循适当的句法规则,这导致模糊的查询产生多重解释。这种模糊的提示使模型在选择正确的推理路径以回答问题时感到困惑。先前的研究通过在LLM推理过程中应用查询编辑来应对这一挑战,但并未明确解决歧义的根本原因。为了解决这一局限性,我们提出了一种通过显式提示消歧的预推理提示优化机制。具体而言,我们识别提示中的语义风险,检查其多角度一致性,并解决任何出现的语义冲突。最后,我们将解决后的歧义以逻辑结构化的方式组织,作为LLM的清晰输入。通过显式解决语义歧义,我们的方法能够对语义重要的标记产生更集中的注意力分布。我们还利用小型语言模型(SLMs)作为提示消歧的主要执行者,以受益于其高效的计算。通过在多个基准上的全面实验,我们证明我们的方法在仅花费0.02美元的情况下,将推理性能提高了2.5分。我们的研究推动了显式提示消歧作为一种有效的提示优化方法,而不干扰LLM推理的内部机制。
cs.CL / 16 / 2604.23267
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
大语言模型中的微调与上下文学习:一种形式语言学习的视角
Abstract
Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.
Chinese Translation
大语言模型(LLMs)在两种基本学习模式下运行——微调(FT)和上下文学习(ICL),这引发了关于哪种模式能产生更高语言能力以及它们在归纳偏差上是否存在差异的关键问题。以往比较FT和ICL的研究由于实验设置不一致而得出了混合且不确定的结果。为了进行严格的比较,我们提出了一项形式语言学习任务——提供精确的语言边界、控制的字符串采样以及无数据污染,并引入了一种语言能力的区分性测试,其中LLM在为语言内字符串分配的生成概率高于语言外字符串时被视为成功。实证结果表明:(a)在同分布泛化上,FT的语言能力优于ICL,但在异分布泛化上,两者表现相当。(b)当两种模式部分学习语言时,其归纳偏差(通过字符串生成概率的相关性测量)相似,但在更高的能力水平上则出现分歧。(c)与FT不同,ICL的表现因模型大小和类型的不同而有显著差异,并且对语言的词汇敏感。因此,我们的研究展示了形式语言作为评估LLM的受控测试平台的潜力,这在自然语言数据集中难以孤立。我们的源代码可在 https://github.com/bishwamittra/formallm 获取。
cs.CL / 17 / 2604.23277
From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors
从相似性到结构:无训练的大型语言模型上下文压缩与混合图先验
Abstract
Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.
Chinese Translation
长上下文的大型语言模型在运行时仍然计算成本高,并且往往无法可靠地处理非常长的输入,这使得上下文压缩成为许多系统的重要组成部分。现有的压缩方法通常依赖于训练过的压缩器、密集检索式选择或启发式修剪,并且在严格的令牌预算下,往往难以共同保持任务相关性、主题覆盖和跨句子连贯性。为了解决这个问题,我们提出了一种无训练且与模型无关的压缩框架,该框架通过结构图先验选择一组紧凑的句子。我们的方法构建了一个稀疏的混合句子图,该图结合了互相的 k-NN 语义边和短距离顺序边,通过聚类提取主题骨架,并使用一个可解释的评分来对句子进行排名,该评分综合了任务相关性、聚类代表性、桥接中心性和循环覆盖提示。然后,通过预算贪婪选择和冗余抑制,生成以原始顺序排列的可读压缩上下文。在四个数据集上的实验结果表明,我们的方法在强提取和抽象基线中具有竞争力,并在长文档基准上表现出更大的提升。
cs.CL / 18 / 2604.23284
Au-M-ol: A Unified Model for Medical Audio and Language Understanding
Au-M-ol:一种统一的医学音频和语言理解模型
Abstract
In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.
Chinese Translation
在本研究中,我们提出了Au-M-ol,一种新颖的多模态架构,它扩展了大型语言模型(LLMs)以支持音频处理。该模型旨在提高在临床相关任务上的表现,例如自动语音识别(ASR)。Au-M-ol主要由三个组件组成:(1)一个音频编码器,用于从医学语音中提取丰富的声学特征;(2)一个适配层,将音频特征映射到LLM输入空间;(3)一个预训练的LLM,执行转录和临床语言理解。这种设计使模型能够直接解读口语医学内容,从而提高准确性和鲁棒性。在实验中,与最先进的基线相比,Au-M-ol在医学转录任务中将字错误率(WER)降低了56%。该模型在噪声环境、领域特定术语和说话者变异性等挑战性条件下表现良好。这些结果表明,Au-M-ol是现实世界临床应用的有力候选者,在这些应用中,可靠且具有上下文感知的音频理解至关重要。
cs.CL / 19 / 2604.23295
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Josh Talks 的 Human-1:基于现实对话的印地语全双工对话建模框架
Abstract
Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.
Chinese Translation
全双工口语对话系统能够模拟自然对话行为,如打断、重叠和反馈通道,但此类系统在印度语言中仍然未得到充分探索。我们展示了第一个开放的、可重复的印地语全双工口语对话系统,通过适配 Moshi(一种先进的双工语音架构),使用自定义的印地语分词器,并在从 14,695 名说话者收集的 26,000 小时真实自发对话上进行训练,支持直接学习自然交互中的轮流发言和重叠模式。为了支持印地语文本生成,我们替换了原始的英语分词器,并在保留预训练音频组件的同时重新初始化文本词汇相关参数。我们提出了一种两阶段的训练方案——大规模预训练后,针对 1,000 小时对话数据进行微调。通过使用提示对话延续范式进行评估,结合自动指标和人工判断,结果表明所生成的模型在印地语中能够产生自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统奠定了第一步基础。
cs.CL / 20 / 2604.23296
$\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction
$ ext{S}^2 ext{IT}$:用于方面情感四元组预测的大型语言模型的逐步语法集成调优
Abstract
Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi-step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax-guided Extraction and 2) Local Syntax-guided Classification, integrating both global and local syntactic structure information. Finally, Fine-grained Structural Tuning enhances the model's understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state-of-the-art performance across multiple datasets. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/S2IT.
Chinese Translation
方面情感四元组预测(ASQP)已经取得了显著进展,这主要得益于大型语言模型(LLMs)强大的语义理解和生成能力。然而,尽管语法结构信息在之前的抽取范式中被证明是有效的,但由于LLMs的推理能力有限,这些信息在生成范式中仍未得到充分利用。本文提出了S^2IT,一种新颖的逐步语法集成调优框架,通过多步骤调优过程逐步将语法结构知识集成到LLMs中。训练过程分为三个步骤。S^2IT将四元组生成任务分解为两个阶段:1)全局语法引导的抽取和2)局部语法引导的分类,整合全局和局部的语法结构信息。最后,细粒度结构调优通过元素链接和节点分类的预测增强了模型对语法结构的理解。实验表明,S^2IT在多个数据集上显著提高了最先进的性能。我们的实现将开源于 https://github.com/DMIRLAB-Group/S2IT。
cs.CL / 21 / 2604.23318
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
隐藏状态知道推理何时分歧:通过跨度级 Wasserstein 距离进行信用分配
Abstract
Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.
Chinese Translation
群体相对策略优化(GRPO)在具有可验证奖励的强化学习(RLVR)中执行粗粒度的信用分配,通过将相同的优势分配给回合中的所有标记。过程奖励模型可以提供更细粒度的监督,但它们需要逐步注释或额外的奖励建模。我们表明,隐藏状态分布包含一个有用的信号,能够反映局部推理质量,该信号可以仅通过 RLVR 中可用的结果级正确性标签提取。具体而言,在每个 GRPO 组内,正确和不正确回合的跨度级隐藏状态分布之间的 Wasserstein 距离在其局部推理质量分歧的区域附近增加。这种关联在示例之间和单个轨迹内均成立,表明隐藏状态分布的分歧可以作为细粒度信用分配的自监督信号。我们通过一个分离定理形式化了这一观察,表明在温和的结构假设下,后分歧跨度的 Wasserstein 距离大于前分歧跨度的距离,只要群体级分布差距超过有限样本噪声。基于这一结果,我们提出了跨度级隐藏状态启用优势重加权(SHEAR),该方法通过使用跨度级 Wasserstein 距离来调整标记级优势,放大对隐藏状态与对立组更分离的标记的更新。该方法不需要额外的模型,仅需对训练流程进行最小的修改。在五个数学推理基准和五个代码生成基准上的实验表明,相较于标准 GRPO 取得了改进,并且相较于监督过程奖励模型表现出强劲的性能,同时不需要额外的注释或奖励模型训练。
cs.CL / 22 / 2604.23323
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
通过跨模态注意力和混合损失实现鲁棒的音频-文本检索
Abstract
Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.
Chinese Translation
音频-文本检索实现了音频内容与自然语言查询之间的语义对齐,支持多媒体搜索、无障碍访问和监控等应用。然而,当前的最先进方法在处理长时间、噪声大和弱标签音频时面临挑战,因为它们依赖于对比学习和大批量训练。我们提出了一种新颖的多模态检索框架,通过结合基于变换器的投影、线性映射和双向注意力的跨模态嵌入精炼模块,来优化音频和文本嵌入。为了进一步提高鲁棒性,我们引入了一种混合损失函数,融合了余弦相似度、$ ext{L}_{1}$和对比目标,即使在小批量约束下也能实现稳定训练。我们的方法通过关注静音的分块和基于注意力的池化,能够有效处理长格式和噪声音频(信噪比 5 到 15)。在基准数据集上的实验表明,相较于之前的方法有显著改善。
cs.CL / 23 / 2604.23345
Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue
桥接推理与行动:高效跨领域任务导向对话的混合 LLM-RL 框架
Abstract
Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.
Chinese Translation
跨领域任务导向对话需要在规划长时间跨度的多轮行动时,对隐含和显性可行性约束进行推理。大型语言模型(LLMs)能够推断这些约束,但在长时间跨度上不可靠,而强化学习(RL)优化长时间跨度的行为,但无法从原始对话中恢复约束。因此,简单地将 LLM 与 RL 结合在一起是脆弱的:未经验证或结构化的 LLM 输出可能会破坏状态表示并误导策略学习。基于此,我们提出了验证的 LLM 知识赋能的 RL(VLK-RL),这是一个混合框架,使 LLM 推导的约束推理可用于 RL。VLK-RL 首先通过 LLM 引出候选约束,然后通过双重角色交叉检验程序验证这些约束,以抑制幻觉和跨轮次的不一致性。经过验证的约束被映射到本体对齐的槽值表示中,从而为 RL 策略优化提供结构化的、关注约束的状态。在多个基准测试中的实验表明,VLK-RL 显著提高了泛化能力和鲁棒性,在长时间跨度任务上超越了强大的单模型基线。
cs.CL / 24 / 2604.23347
Evaluating Large Language Models on Computer Science University Exams in Data Structures
在计算机科学大学数据结构考试中评估大型语言模型
Abstract
We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.
Chinese Translation
我们对大型语言模型(LLMs)在计算机科学(CS)数据结构考试问题上的表现进行了全面评估。我们的研究引入了一个新的基准数据集,该数据集包含来自特拉维夫大学(Tel Aviv University, TAU)的考试问题,旨在评估LLMs在处理封闭式和多项选择题方面的能力。我们评估了OpenAI的GPT 4o和Anthropic的Claude 3.5这两种流行的LLMs,以及两个较小的LLMs,Mathstral 7B和LLaMA 3 8B,在TAU考试基准上的表现。我们的研究结果为LLMs在计算机科学教育中的当前能力提供了深入的见解。
cs.CL / 25 / 2604.23351
When Chain-of-Thought Fails, the Solution Hides in the Hidden States
当思维链失败时,解决方案隐藏在隐藏状态中
Abstract
Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.
Chinese Translation
中间推理是否在计算上有用或仅仅是解释性的,取决于思维链(Chain-of-Thought, CoT)标记是否包含与任务相关的信息。我们通过激活补丁(activation patching)对GSM8K上的CoT进行了机械因果分析:将来自CoT生成的标记级隐藏状态转移到同一问题的直接回答运行中,然后测量对最终答案准确性的影响。在不同模型中,补丁后的生成准确性显著高于直接回答提示和原始CoT轨迹,揭示了单个CoT标记可以编码足够的信息以恢复正确答案,即使原始轨迹是错误的。这种与任务相关的信息在正确的CoT运行中比在错误的CoT运行中更为普遍,并且在标记之间分布不均,集中在中间到后期层,并且在推理轨迹中较早出现。此外,补丁语言标记(如动词和实体)携带任务解决信息,能够引导生成朝向正确的推理,而数学标记则编码接近答案的内容,但很少成功。补丁输出通常更短,但其准确性超过了完整的CoT轨迹,表明完整的推理链并不总是必要的。综合这些发现表明,CoT编码了可恢复的标记级问题解决信息,为推理如何被表示以及其崩溃的地方提供了新的见解。
cs.CL / 26 / 2604.23356
VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs
VeriLLMed:基于知识图谱的医疗大型语言模型交互式可视化调试
Abstract
Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
Chinese Translation
大型语言模型(LLMs)在医疗诊断中展现出良好的前景,但由于高风险的临床决策和推理可靠性不足,实际应用仍然面临挑战。因此,仔细检查模型行为对于评估诊断推理的可靠性和临床基础至关重要。然而,调试医疗LLMs仍然困难。首先,开发者通常缺乏足够的医学领域专业知识,无法以临床有意义的术语解释模型错误。其次,模型可能在涉及不同输入类型、任务和推理步骤的大量多样化实例中失败,这使得开发者难以优先确定哪些错误值得重点检查。第三,开发者在识别案例中的重复错误模式时面临困难,因为现有的调试实践主要以实例为中心,并依赖于对孤立故障的手动检查。为了解决这些挑战,我们提出了VeriLLMed,这是一种可视化分析系统,集成了外部生物医学知识,以审计和调试医疗LLM的诊断推理。VeriLLMed将模型输出转化为可比较的推理路径,构建基于知识图谱的参考路径,并识别出三类重复的诊断错误:关系错误、分支错误和缺失错误。案例研究和专家评估表明,VeriLLMed帮助开发者识别临床上不合理的推理,并生成可操作的见解,从而为改善医疗LLMs提供指导。
cs.CL / 27 / 2604.23412
Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing
通过不可逆哈希克服语料分发中的版权障碍
Abstract
While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.
Chinese Translation
虽然带注释的语料在自然语言处理(NLP)领域至关重要,但包含版权材料的语料在研究人员之间的交换中存在困难。然而,这类语料在NLP任务的背景下,完全代表野外数据的多样性是必要的。我们通过提出一种合法且公开分享版权文学文本注释的方法来解决这一问题。语料创建者以明文形式分享注释,并提供源材料的不可逆哈希版本。语料用户必须拥有源材料,并对其自身的标记应用相同的哈希函数,以便将其与共享的注释进行匹配。至关重要的是,我们的方法对用户所拥有的版权数据版本的合理差异具有鲁棒性。作为说明,我们展示了不同版本小说的对齐实验。我们的结果表明,依据小说的不同,我们的方法能够正确对齐98.7%至99.79%的标记,前提是用户版本与语料创建者的版本足够接近。我们公开发布了 novelshare,这是我们方法的 Python 实现。
cs.CL / 28 / 2604.23413
Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition
超越本地与外部:一个基于博弈论的可信知识获取框架
Abstract
Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.
Chinese Translation
云托管的大型语言模型(LLMs)提供了无与伦比的推理能力和动态知识,但向这些外部服务提交原始查询可能会暴露敏感的用户意图。相反,完全依赖可信的本地模型可以保护隐私,但由于参数规模和知识的限制,往往会影响答案的质量。为了解决这一困境,我们提出了基于博弈论的可信知识获取(Game-theoretic Trustworthy Knowledge Acquisition, GTKA)框架,该框架将知识效用与隐私之间的权衡形式化为一个战略博弈。GTKA包含三个组成部分:(i)一个关注隐私的子查询生成器,将敏感意图分解为一般化的低风险片段;(ii)一个对抗性重构攻击者,试图从这些片段推断出原始查询,提供自适应的泄漏信号;(iii)一个可信的本地集成器,在安全边界内综合外部响应。通过以交替对抗的方式训练生成器和攻击者,GTKA优化子查询生成策略,以最大化知识获取的准确性,同时最小化原始敏感意图的可重构性。为了验证我们的方法,我们在生物医学和法律领域构建了两个敏感领域基准。大量实验表明,与最先进的基线相比,GTKA显著减少了意图泄漏,同时保持了高保真度的答案质量。
cs.CL / 29 / 2604.23443
Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
重新审视视觉问答中的贪婪解码:一种校准视角
Abstract
Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.
Chinese Translation
随机采样策略在大型语言模型(LLMs)中被广泛采用,以平衡输出的一致性和多样性。这些启发式方法通常在多模态大型语言模型(MLLMs)中继承,但缺乏特定任务的合理性。然而,我们认为随机解码对于视觉问答(VQA)可能并非最佳选择。VQA 是一个封闭式任务,其答案分布呈现头重脚轻的特征,且不确定性通常是认识论的,源于缺失或模糊的视觉证据,而非合理的延续。在本研究中,我们提供了模型校准与预测准确性之间关系的理论形式化,并推导出贪婪解码最优性的充分条件。大量实验提供了实证证据,支持贪婪解码在多个基准测试中优于随机采样。此外,我们提出了针对推理模型的贪婪解码方法,在多模态推理场景中超越了随机采样和标准贪婪解码。总体而言,我们的结果警示在多模态大型语言模型中盲目继承大型语言模型的解码启发式方法,并表明贪婪解码可以成为视觉问答的高效且强大的默认选择。
cs.CL / 30 / 2604.23445
AI Safety Training Can be Clinically Harmful
人工智能安全培训可能对临床产生危害
Abstract
Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.
Chinese Translation
大型语言模型正在大规模部署为心理健康支持代理,但仅有16%的基于LLM的聊天机器人干预经过严格的临床有效性测试,模拟结果显示超过三分之一的案例存在心理恶化。我们对250个延续暴露(Prolonged Exposure, PE)治疗场景和146个认知行为疗法(Cognitive Behavioral Therapy, CBT)认知重构练习(加上29个严重程度升级变体)评估了四个生成模型,评分由三名评审的LLM小组进行。所有模型在表面承认方面得分接近满分(约0.91-1.00),而在最高严重程度下,四个模型中有三个的治疗适宜性评分降至0.22-0.33,两个模型的协议遵循度降至零。在CBT严重程度升级下,一个模型的任务完成率从92%降至71%,而前沿模型的安全干扰评分从0.99降至0.61。我们识别出一种系统性的、跨模式的失败:基于强化学习的安全对齐通过在想象暴露期间将患者固定,提供虚假的安慰,将危机资源插入受控练习中,以及拒绝挑战提及自残的扭曲认知,从而破坏了治疗机制;在CBT认知重构过程中则通过任务放弃或安全前言插入来影响治疗。这些发现促使我们提出一个五轴评估框架(协议遵循度、幻觉风险、行为一致性、危机安全、人口统计稳健性),与FDA软件作为医疗设备(SaMD)和欧盟人工智能法案的要求相对应。我们认为,任何人工智能心理健康系统在未通过五个维度的多轴评估之前,不应进行部署。
cs.CL / 31 / 2604.23458
A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection
基于Reddit的心理健康检测数据集基准套件
Abstract
The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparison challenging. In this paper, we present a uniform benchmark set of four Reddit-based datasets for disjoint but complementary tasks: (i) detection of suicidal ideation, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels' trustworthiness. Previous work's evidence of performance on both transformer and contextualized recurrent models demonstrates that these models receive excellent performances on tasks (F1 ~ 93-99%), further validating the usefulness of the datasets. By combining these resources, we establish a unifying foundation for reproducible mental health NLP studies with the ability to carry out cross-task benchmarking, multi-task learning, and fair model comparison. The presented benchmark suite provides the research community with an easy-to-access and varied resource for advancing computational approaches toward mental health research.
Chinese Translation
在线支持小组的日益普及为通过自然语言处理(NLP)研究心理健康开辟了新的窗口。然而,高质量、经过良好验证的数据集的缺乏阻碍了这一研究领域的发展。现有研究往往倾向于构建特定任务的语料库,而未将其整合为广泛可用的资源,这使得可重复性以及跨任务比较变得具有挑战性。本文提出了一套统一的基准集,包括四个基于Reddit的数据集,针对不相交但互补的任务:(i) 自杀意念检测,(ii) 二元一般心理障碍检测,(iii) 躁郁症检测,以及 (iv) 多类别心理障碍分类。所有数据集均经过细致的语言检查、明确的标注指南和人工判断验证建立。标注者间一致性指标始终超过基线一致性得分0.8,确保了标签的可信度。先前工作的证据表明,在变换器(transformer)和上下文递归模型(contextualized recurrent models)上表现良好,这些模型在任务上获得了出色的表现(F1 ~ 93-99%),进一步验证了数据集的实用性。通过整合这些资源,我们为可重复的心理健康NLP研究建立了一个统一的基础,能够进行跨任务基准测试、多任务学习和公平模型比较。所呈现的基准套件为研究社区提供了一个易于访问且多样化的资源,以推动计算方法在心理健康研究中的应用。
cs.CL / 32 / 2604.23478
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
JudgeSense:用于评估LLM作为法官系统中提示敏感性的基准
Abstract
Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.
Chinese Translation
大型语言模型越来越多地被用作自动法官,以评估其他模型,但在语义等价的提示释义下,其裁决的稳定性尚未得到测量。我们引入了JudgeSense,一个通过法官敏感性得分(Judge Sensitivity Score, JSS)量化这一属性的框架和基准,JSS被定义为法官返回相同决策的释义对的比例。在对494个经过验证的释义对评估九个法官模型时,我们发现,连贯性是法官表现出显著差异的唯一任务,JSS范围从0.389到0.992。在事实性方面,所有法官的JSS聚集在约0.63附近,这受到极性反转提示伪影的影响;经过修正后,事实性JSS上升到约0.9。成对任务(偏好和相关性)在9个法官中有8个表现出退化的始终-A行为,表明存在强烈的位置偏见。模型规模并不能预测一致性。我们发布了代码、决策日志和经过验证的释义数据集,以支持标准化的JSS报告。
cs.CL / 33 / 2604.23486
Your Students Don't Use LLMs Like You Wish They Did
你的学生并没有像你希望的那样使用大型语言模型
Abstract
Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we identify a fundamental misalignment: educators design conversational tutors for sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context is the strongest predictor of usage patterns, outweighing student preference or system design: when AI tools are optional, usage concentrates around deadlines; when integrated into course structure, students ask for solutions to verbatim assignment questions. Whole-dialogue evaluation misses these turn-by-turn patterns. Our metrics will enable researchers building educational dialogue systems to measure whether they are achieving their pedagogical goals.
Chinese Translation
教育自然语言处理(NLP)系统通常通过参与度指标和满意度调查进行评估,这些评估充其量只是满足教学目标的代理工具。我们提出了六种计算指标,用于自动评估学生与人工智能对话中的教学一致性。我们通过分析来自四门课程的500个对话中的12,650条消息来验证我们的指标。利用这些指标,我们识别出一个根本性的错位:教育工作者设计对话辅导工具以促进持续的学习对话,但学生主要用于提取答案。部署背景是使用模式的最强预测因素,超过学生偏好或系统设计:当人工智能工具是可选时,使用集中在截止日期附近;当其融入课程结构时,学生会询问逐字作业问题的解决方案。整体对话评估忽略了这些逐轮模式。我们的指标将使研究人员在构建教育对话系统时能够衡量其是否实现了教学目标。
cs.CL / 34 / 2604.23493
K-SENSE: A Knowledge-Guided Self-Augmented Encoder for Neuro-Semantic Evaluation of Mental Health Conditions on Social Media
K-SENSE:一种知识引导的自增强编码器,用于社交媒体上心理健康状况的神经语义评估
Abstract
Early detection of mental health conditions, particularly stress and depression, from social media text remains a challenging open problem in computational psychiatry and natural language processing. Automated systems must contend with figurative language, implicit emotional expression, and the high noise inherent in user-generated content. Existing approaches either leverage external commonsense knowledge to model mental states explicitly, or apply self-augmentation and contrastive training to improve generalization, but seldom do both in a principled, unified framework. We propose K-SENSE (Knowledge-guided Self-augmented Encoder for Neuro-Semantic Evaluation of Mental Health), a framework that jointly exploits external psychological reasoning and internal representation robustness. K-SENSE adopts a three-stage encoding pipeline: (1) inferential commonsense knowledge is extracted from the COMET model across five mental state dimensions; (2) a semantic anchor is constructed by combining hidden representations from two parallel encoding streams, projected into a shared space before fusion; and (3) a supervised contrastive learning objective aligns same-class representations while encouraging the attention mechanism to suppress irrelevant knowledge noise. We evaluate K-SENSE on Dreaddit (stress detection) and Depression_Mixed (depression detection), achieving mean F1-scores of 86.1 (0.6%) and 94.3 (0.8%), respectively, over five independent runs. These represent improvements of approximately 2.6 and 1.5 percentage points over the strongest prior baselines. Ablation experiments confirm the contribution of each architectural component, including the temporal knowledge integration strategy and the choice to keep the knowledge encoder frozen during fine-tuning.
Chinese Translation
从社交媒体文本中早期检测心理健康状况,特别是压力和抑郁,仍然是计算精神病学和自然语言处理中的一个具有挑战性的开放问题。自动化系统必须应对比喻语言、隐含情感表达以及用户生成内容中固有的高噪声。现有的方法要么利用外部常识知识显式建模心理状态,要么应用自增强和对比训练来提高泛化能力,但很少在一个原则性统一的框架中同时做到这两点。我们提出了K-SENSE(知识引导的自增强编码器,用于心理健康的神经语义评估),这是一个共同利用外部心理推理和内部表示鲁棒性的框架。K-SENSE采用三阶段编码管道:(1)从COMET模型中提取推理常识知识,涵盖五个心理状态维度;(2)通过结合来自两个并行编码流的隐藏表示构建语义锚点,在融合前投影到共享空间;(3)一个监督对比学习目标对齐同类表示,同时鼓励注意机制抑制无关知识噪声。我们在Dreaddit(压力检测)和Depression_Mixed(抑郁检测)上评估K-SENSE,分别在五次独立运行中实现了86.1(0.6%)和94.3(0.8%)的平均F1分数。这分别比最强的先前基线提高了约2.6和1.5个百分点。消融实验确认了每个架构组件的贡献,包括时间知识整合策略以及在微调过程中保持知识编码器冻结的选择。
cs.CL / 35 / 2604.23530
MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings
MTRouter:基于成本意识的多轮大语言模型路由与历史-模型联合嵌入
Abstract
Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history-model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance-cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity's Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter
Chinese Translation
多轮长时间任务在大型语言模型(LLMs)中越来越普遍,但解决这些任务通常需要多次顺序调用模型,从而累积可观的推理成本。在此,我们研究了基于成本意识的多轮LLM路由:在给定固定成本预算的情况下,从模型池中选择在每一轮调用哪个模型。我们提出了MTRouter,它将交互历史和候选模型编码为联合历史-模型嵌入,并从记录的轨迹中学习结果估计器,以预测每轮模型的效用。实验表明,MTRouter改善了性能与成本的权衡:在ScienceWorld上,它超越了GPT-5,同时将总成本降低了58.7%;在Humanity's Last Exam (HLE)上,它在降低总成本43.4%的情况下实现了与GPT-5相当的准确性,并且这些收益甚至延续到未见过的任务。进一步的分析揭示了其有效性的几个机制:与之前的多轮路由器相比,MTRouter进行的模型切换更少,对瞬时错误的容忍度更高,并且在模型之间表现出新兴的专业化。代码:https://github.com/ZhangYiqun018/MTRouter
cs.CL / 36 / 2604.23543
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
Pref-CTRL:基于偏好的大语言模型对齐方法通过表示编辑
Abstract
Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.
Chinese Translation
测试时对齐方法通过对大型语言模型(LLMs)内部表示进行轻量级干预,在推理时引导其输出,提供了一种有前景的替代方案。最近,一种显著且有效的方法,RE-Control(Kong et al., 2024),提出利用在LLM隐藏状态上训练的外部价值函数,通过基于梯度的编辑来指导生成。尽管该方法有效,但忽视了对齐任务的一个关键特征,即这些任务通常被表述为从人类对候选响应的偏好中学习。为了解决这一问题,本文提出了一种新颖的基于偏好的训练框架Pref-CTRL,使用多目标价值函数更好地反映偏好数据的结构。我们的研究在两个基准数据集上超越了RE-Control,并在领域外数据集上显示出更好的泛化能力。我们的源代码可在 https://github.com/UTS-nlPUG/pref-ctrl 获取。
cs.CL / 37 / 2604.23577
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
RouteNLP:具有符合级联和蒸馏协同优化的闭环大语言模型路由
Abstract
Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.
Chinese Translation
使用大型语言模型处理多样化的自然语言处理(NLP)工作负载成本高昂:在一家企业合作伙伴处,推理成本超过每月20万美元,尽管超过70%的查询是完全在较小模型能力范围内的常规任务。我们提出了RouteNLP,一个闭环框架,通过在分层模型组合中路由查询,以最小化成本,同时满足每个任务的质量约束。该框架集成了三个组件:一个基于难度的路由器,利用在偏好数据和质量信号上训练的共享任务条件表示;使用符合预测进行无分布阈值初始化的信心校准级联;以及一个蒸馏-路由协同优化循环,该循环聚类升级失败,针对性地对更便宜的模型应用知识蒸馏,并自动重新训练路由器,从而实现超过无针对性蒸馏的两倍成本改善。在一个为期8周的试点部署中,该框架处理了约5000个查询/天,在一家企业客户服务部门,RouteNLP将推理成本降低了58%,同时保持91%的响应接受率,并将p99延迟从1847毫秒降低到387毫秒。在一个涵盖金融、客户服务和法律领域的六个任务基准上,该框架实现了40-85%的成本降低,同时在结构化任务上保持96-100%的质量,在生成任务上保持96-98%的质量,人工评估确认74.5%的路由生成输出与前沿模型质量相匹配或超过。
cs.CL / 38 / 2604.23578
LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation
大型语言模型解读日常生活的节奏:行为预测与生成的对齐理解
Abstract
Human daily behavior unfolds as complex sequences shaped by intentions, preferences, and context. Effectively modeling these behaviors is crucial for intelligent systems such as personal assistants and recommendation engines. While recent advances in deep learning and behavior pre-training have improved behavior prediction, key challenges remain--particularly in handling long-tail behaviors, enhancing interpretability, and supporting multiple tasks within a unified framework. Large language models (LLMs) offer a promising direction due to their semantic richness, strong interpretability, and generative capabilities. However, the structural and modal differences between behavioral data and natural language limit the direct applicability of LLMs. To address this gap, we propose Behavior Understanding Alignment (BUA), a novel framework that integrates LLMs into human behavior modeling through a structured curriculum learning process. BUA employs sequence embeddings from pretrained behavior models as alignment anchors and guides the LLM through a three-stage curriculum, while a multi-round dialogue setting introduces prediction and generation capabilities. Experiments on two real-world datasets demonstrate that BUA significantly outperforms existing methods in both tasks, highlighting its effectiveness and flexibility in applying LLMs to complex human behavior modeling.
Chinese Translation
人类的日常行为以复杂的序列展开,这些序列受到意图、偏好和上下文的影响。有效建模这些行为对于智能系统(如个人助手和推荐引擎)至关重要。尽管深度学习和行为预训练的最新进展改善了行为预测,但仍面临关键挑战,特别是在处理长尾行为、增强可解释性以及在统一框架内支持多任务方面。大型语言模型(LLMs)因其语义丰富性、强可解释性和生成能力而展现出良好的发展方向。然而,行为数据与自然语言之间的结构和模态差异限制了LLMs的直接应用。为了解决这一问题,我们提出了行为理解对齐(Behavior Understanding Alignment, BUA),这是一个新颖的框架,通过结构化的课程学习过程将LLMs整合到人类行为建模中。BUA采用来自预训练行为模型的序列嵌入作为对齐锚点,并通过三阶段课程引导LLM,而多轮对话设置则引入了预测和生成能力。在两个真实世界数据集上的实验表明,BUA在这两项任务中显著优于现有方法,突显了其在复杂人类行为建模中应用LLMs的有效性和灵活性。
cs.CL / 39 / 2604.23585
ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection
ComplianceNLP:基于知识图谱增强的多框架监管差距检测的RAG
Abstract
Financial institutions must track over 60,000 regulatory events annually, overwhelming manual compliance teams; the industry has paid over USD 300 billion in fines and settlements since the 2008 financial crisis. We present ComplianceNLP, an end-to-end system that automatically monitors regulatory changes, extracts structured obligations, and identifies compliance gaps against institutional policies. The system integrates three components: (1) a knowledge-graph-augmented RAG pipeline grounding generations in a regulatory knowledge graph of 12,847 provisions across SEC, MiFID II, and Basel III; (2) multi-task obligation extraction combining NER, deontic classification, and cross-reference resolution over a shared LEGAL-BERT encoder; and (3) compliance gap analysis that maps obligations to internal policies with severity-aware scoring. On our benchmark, ComplianceNLP achieves 87.7 F1 on gap detection, outperforming GPT-4o+RAG by +3.5 F1, with 94.2% grounding accuracy ($r=0.83$ vs. human judgments) and 83.4 F1 under realistic end-to-end error propagation. Ablations show that knowledge-graph re-ranking contributes the largest marginal gain (+4.6 F1), confirming that structural regulatory knowledge is critical for cross-reference-heavy tasks. Domain-specific knowledge distillation (70B $\to$ 8B) combined with Medusa speculative decoding yields $2.8\times$ inference speedup; regulatory text's low entropy ($H=2.31$ bits vs. $3.87$ general text) produces 91.3% draft-token acceptance rates. In four months of parallel-run deployment processing 9,847 updates at a financial institution, the system achieved 96.0% estimated recall and 90.7% precision, with a $3.1\times$ sustained analyst efficiency gain. We report deployment lessons on trust calibration, GRC integration, and distributional shift monitoring for regulated-domain NLP.
Chinese Translation
金融机构每年必须跟踪超过60,000个监管事件,这使得人工合规团队不堪重负;自2008年金融危机以来,行业已支付超过3000亿美元的罚款和和解金。我们提出了ComplianceNLP,这是一个端到端系统,能够自动监测监管变化,提取结构化义务,并识别与机构政策的合规差距。该系统集成了三个组件:(1)基于知识图谱增强的RAG管道,将生成内容与跨越SEC、MiFID II和巴塞尔协议III的12,847条规定的监管知识图谱相结合;(2)多任务义务提取,结合命名实体识别(NER)、义务分类和跨引用解析,通过共享的LEGAL-BERT编码器进行;(3)合规差距分析,将义务映射到内部政策,并进行严重性感知评分。在我们的基准测试中,ComplianceNLP在差距检测上达到了87.7的F1分数,超越了GPT-4o+RAG,提升了3.5的F1分数,具有94.2%的基础准确率($r=0.83$与人工判断相比)和在现实端到端误差传播下的83.4的F1分数。消融实验表明,知识图谱重新排序贡献了最大的边际增益(+4.6 F1),确认了结构性监管知识对于重跨引用任务的重要性。结合领域特定知识蒸馏(70B $ o$ 8B)和Medusa推测解码实现了$2.8 imes$的推理加速;监管文本的低熵($H=2.31$比特对比$3.87$的一般文本)产生了91.3%的草稿令牌接受率。在为期四个月的并行运行部署中,处理了9,847个更新,该系统在一家金融机构中实现了96.0%的估计召回率和90.7%的精确度,分析师效率持续提高了$3.1 imes$。我们报告了在受监管领域的自然语言处理中的信任校准、GRC集成和分布变化监测的部署经验。
cs.CL / 40 / 2604.23589
XITE: Cross-lingual Interpolation for Transfer using Embeddings
XITE:基于嵌入的跨语言插值转移方法
Abstract
Facilitating cross-lingual transfer in multilingual language models remains a critical challenge. Towards this goal, we propose an embedding-based data augmentation technique called XITE. We start with unlabeled text from a low-resource target language, identify an English counterpart in a task-specific training corpus using embedding-based similarities and adopt its label. Next, we perform a simple interpolation of the source and target embeddings to create synthetic data for task-specific fine-tuning. Projecting the target text into a language-rich subspace using linear discriminant analysis (LDA), prior to interpolation, further boosts performance. Our cross-lingual embedding-based augmentation technique XITE yields significant improvements of up to 35.91% for sentiment analysis and up to 81.16% for natural language inference, using XLM-R, for a diverse set of target languages including Korean, Arabic, Urdu and Hindi. Apart from boosting cross-lingual transfer, adaptation using XITE also safeguards against forgetting and maintains task performance on the high-resource language.
Chinese Translation
在多语言模型中促进跨语言转移仍然是一个关键挑战。为此,我们提出了一种基于嵌入的数据增强技术,称为XITE。我们从低资源目标语言的未标记文本开始,利用基于嵌入的相似性在特定任务的训练语料库中识别其英语对应项,并采用其标签。接下来,我们对源嵌入和目标嵌入进行简单插值,以创建用于特定任务微调的合成数据。在插值之前,使用线性判别分析(LDA)将目标文本投影到语言丰富的子空间中,进一步提升了性能。我们的基于嵌入的跨语言增强技术XITE在情感分析中实现了高达35.91%的显著提升,在自然语言推理中实现了高达81.16%的提升,使用XLM-R,适用于包括韩语、阿拉伯语、乌尔都语和印地语在内的多种目标语言。除了促进跨语言转移外,使用XITE的适应性还防止了遗忘,并保持了在高资源语言上的任务性能。
cs.CL / 41 / 2604.23600
Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation
个性塑造性别偏见:对英语和印地语中基于角色的LLM叙事的实证研究
Abstract
Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于以角色驱动的场景,如教育、客户服务和社交平台,在这些场景中,模型在与用户互动时被提示采用特定角色。虽然角色条件化可以改善用户体验和参与度,但它也引发了关于个性线索如何与性别偏见和刻板印象相互作用的担忧。在本研究中,我们呈现了一项关于英语和印地语的角色条件化故事生成的控制研究,其中每个故事描绘了一位在印度工作的专业人士,生成特定于上下文的文档(例如,课程计划、报告、信件),并系统性地变化角色性别、职业角色和来自HEXACO及黑暗三角(Dark Triad)框架的个性特征。在从六个最先进的LLMs生成的23,400个故事中,我们发现个性特征与性别偏见的大小和方向显著相关。特别是,黑暗三角个性特征与更高的性别刻板印象表现一致,而社会期望的HEXACO特征则相对较低,尽管这些关联在不同模型和语言中有所变化。我们的研究结果表明,LLMs中的性别偏见并非静态,而是依赖于上下文。这表明,在现实应用中使用的角色条件化系统可能会引入不均衡的表现性伤害,强化生成的教育、职业或社交内容中的性别刻板印象。
cs.CL / 42 / 2604.23615
Applications of the Transformer Architecture in AI-Assisted English Reading Comprehension
变压器架构在人工智能辅助英语阅读理解中的应用
Abstract
This paper studies interpretable and fair artificial intelligence architectures for understanding English reading. Introduced transformer-based models, integrating advanced attention mechanisms and gradient-based feature attribution. The model's lack of interpretability, reduction of algorithmic bias, and unreliable performance in learning environments are the current issues faced in natural language teaching. A unified technical pipeline has been constructed, including adversarial bias correction methods, token-level attribution analysis, and multi-head attention heatmap visualization. Experimental validation was conducted using a large-scale labeled English reading comprehension dataset, and the data partitioning scheme and parameter optimization procedures have been determined. The method significantly outperforms the state-of-the-art models for this task in terms of accuracy and macro-average F1 score; in some aspects, it even surpasses or closely matches the results of human evaluations. In multi-week user experiments, the explainable transformer improved teachers' trust and operability in feedback-based assessments within the scoring system. The proposed method aims to ensure high prediction accuracy and fairness for different learners. This indicates that it is a real-world educational application based on artificial intelligence with a focus on interpretation. Improve the user experience in AI-assisted reading comprehension systems, counteract biases, and enhance the details explained by transformers.
Chinese Translation
本文研究了可解释和公平的人工智能架构,以理解英语阅读。引入了基于变压器的模型,整合了先进的注意力机制和基于梯度的特征归因。目前在自然语言教学中面临的主要问题包括模型的可解释性不足、算法偏见的减少以及在学习环境中的不可靠表现。构建了一个统一的技术流程,包括对抗性偏见校正方法、标记级归因分析和多头注意力热图可视化。使用大规模标注的英语阅读理解数据集进行了实验验证,并确定了数据分区方案和参数优化程序。该方法在准确性和宏平均F1分数方面显著优于该任务的最新模型;在某些方面,甚至超越或接近人类评估的结果。在为期数周的用户实验中,可解释的变压器提高了教师在评分系统中基于反馈评估的信任度和可操作性。所提出的方法旨在确保不同学习者的高预测准确性和公平性。这表明它是一个基于人工智能的真实世界教育应用,注重可解释性。改善AI辅助阅读理解系统中的用户体验,抵消偏见,并增强变压器所解释的细节。
cs.CL / 43 / 2604.23626
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs
GraphPlanner:用于多智能体大语言模型的图记忆增强代理路由
Abstract
LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.
Chinese Translation
大语言模型(LLM)路由在整合多种模型的优势的同时,实现了效率与性能的平衡,取得了令人鼓舞的成果。然而,为了支持更现实和具有挑战性的应用,路由必须扩展到代理大语言模型的设置中,在这些设置中,任务规划、异构智能体之间的多轮合作以及记忆利用是不可或缺的。为了解决这一问题,我们提出了GraphPlanner,一种用于多智能体大语言模型的异构图记忆增强代理路由器,它为每个查询生成路由工作流,并支持归纳推理和传导推理。GraphPlanner将工作流生成形式化为马尔可夫决策过程(MDP),在每一步中选择大语言模型的主干和代理角色,包括规划者(Planner)、执行者(Executor)和总结者(Summarizer)。通过利用一个异构图(GARNet)来捕捉查询、智能体和响应之间的交互记忆,GraphPlanner将历史记忆和工作流记忆整合到更丰富的状态表示中。整个流程通过强化学习进行优化,联合提高任务特定性能和计算效率。我们在14个多样化的大语言模型任务上评估了GraphPlanner,并证明:(1)GraphPlanner在准确性上超过了强大的单轮和多轮路由器,准确率提高了最多9.3%,同时将GPU成本从186.26 GiB降低到1.04 GiB;(2)GraphPlanner对未见过的任务和大语言模型具有良好的泛化能力,展现出强大的零样本能力;(3)GraphPlanner有效利用历史记忆,支持归纳推理和传导推理,实现更自适应的路由。我们的GraphPlanner代码已发布在https://github.com/ulab-uiuc/GraphPlanner。
cs.CL / 44 / 2604.23627
Neural Grammatical Error Correction for Romanian
罗马尼亚语的神经语法错误纠正
Abstract
Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger
Chinese Translation
非英语语言的语法错误纠正(GEC)资源稀缺,而这些语言中可用的拼写检查器大多仅限于简单的纠正和规则。在本文中,我们介绍了第一个罗马尼亚语GEC语料库,包含10,000对句子。此外,ERRANT(错误注释工具包)评分器的德语版本被调整为适用于罗马尼亚语,以分析该语料库并提取评估所需的编辑。我们实验了多种神经模型及预训练策略,这些策略在低资源环境下的GEC中证明是有效的。我们的基线模型是一个仅在GEC数据集上训练的小型Transformer模型(F0.5为44.38),而表现最佳的模型是通过在人工生成数据上预训练一个更大的Transformer模型,然后在实际语料库上进行微调(F0.5为53.76)得到的。所提出的生成额外训练样本的方法易于扩展,可以应用于任何语言,因为它仅需要一个词性标注器(POS tagger)。
cs.CL / 45 / 2604.23698
Benchmarking Testing in Automated Theorem Proving
自动定理证明中的基准测试
Abstract
Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.
Chinese Translation
最近在大型语言模型(LLMs)方面的进展在形式定理证明中展现了潜力,但评估语义正确性仍然具有挑战性。现有的评估依赖于间接代理,例如与人工标注证明的词汇重叠,或昂贵的人工检查。受到代码生成中从词汇比较转向基于测试评估的启发,我们提出了 T 框架,该框架评估形式定理的语义正确性:只有当所有依赖的后继定理成功编译时,生成的定理才被视为正确,这类似于集成测试。我们从5个真实的 Lean 4 代码库构建了一个基准,包含2,206个问题,平均配对41个后继定理,自动提取,无需人工干预。实验表明,尽管最先进的模型在编译成功率上表现良好,但在我们的语义指标下表现显著较差。最佳模型 Claude-Sonnet-4.5 在完整数据集上的测试准确率仅为38.9%,在提供自然语言证明和后继定理作为上下文的情况下,揭示了当前定理生成能力的一个关键差距。
cs.CL / 46 / 2604.23701
Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge
Agri-CPJ:一种无训练的可解释框架,用于利用Caption-Prompt-Judge和LLM作为评判者进行农业害虫诊断
Abstract
Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84\% and Qwen-VL-Chat reached 64.54\%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
Chinese Translation
从田间照片中进行作物病害诊断面临两个反复出现的问题:在基准测试中表现良好的模型常常会虚构物种名称,而当预测正确时,其背后的推理通常对从业者不可及。本文描述了Agri-CPJ(Caption-Prompt-Judge),一种无训练的少样本框架,其中一个大型视觉语言模型首先生成一个结构化的形态学描述,并通过多维质量门控进行迭代优化,然后再回答任何诊断问题。随后,从互补视角生成两个候选响应,LLM评判者根据特定领域标准选择更强的一个。描述优化是影响最大的组成部分:消除该步骤会导致两个测试模型的下游准确性持续下降。在CDDMbench上,将GPT-5-Nano与GPT-5-mini生成的描述配对,疾病分类的准确率提高了 extbf{+22.7}个百分点,问答得分提高了 extbf{+19.5}分,相较于无描述基线。在AgMMU-MCQs上未经修改评估,GPT-5-Nano达到了77.84\%,Qwen-VL-Chat达到了64.54\\%,尽管格式从开放式转变为多项选择,但仍位于或高于大多数同规模的开源模型。结构化描述和评判理由共同构成了可读的审计轨迹:与诊断意见不符的从业者可以识别出具体的错误描述观察。代码和数据公开可用,网址为 https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
cs.CL / 47 / 2604.23719
AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models
AIPsy-Affect:一种无关键词的临床刺激电池,用于语言模型中情感的机制可解释性研究
Abstract
Mechanistic interpretability research on emotion in large language models -- linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction -- depends on stimuli that contain the words for the emotions they test. When a probe fires on "I am furious", it is unclear whether the model has detected anger or detected the word "furious". The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik's eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery -- bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier -- confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p < 10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.
Chinese Translation
对大型语言模型中情感的机制可解释性研究——线性探测、激活修补、稀疏自编码器(SAE)特征分析、因果消融、引导向量提取——依赖于包含测试情感词汇的刺激。当探测器在"我很愤怒"上触发时,尚不清楚模型是检测到了愤怒,还是检测到了单词"愤怒"。这两种解读对关于情感电路、特征和干预措施的每一个下游主张都有非常不同的影响。我们发布了AIPsy-Affect,这是一个包含480个项目的临床刺激电池,消除了刺激层面的混淆:192个无关键词的短文通过叙事情境单独唤起Plutchik的八种主要情感,192个匹配的中性对照短文与情感短文共享角色、背景、长度和表面结构,但情感成分被手术性去除,并且包含中等强度和区分效度的划分。匹配对结构支持线性探测、激活修补、SAE特征分析、因果消融和引导向量提取,具有强有力的方法论保证:任何将临床项目与其匹配中性项目区分开的内部表征,都不能基于情感关键词的存在。一个三种方法的自然语言处理(NLP)防御电池——词袋情感分析、情感类别词典和上下文变换器分类器——确认了这一特性:词袋方法仅能看到情境词汇,而上下文分类器检测到情感(p < 10^-15),但无法识别类别(关键词丰富的对照组的前1名准确率为82.5%,而在无关键词短文中仅为5.2%)。AIPsy-Affect将我们早期的96项电池(arXiv:2603.22295)扩展了四倍,并在MIT许可证下公开发布。
cs.CL / 48 / 2604.23733
Multimodal QUD: Inquisitive Questions from Scientific Figures
多模态问题讨论(QUD):来自科学图表的探究性问题
Abstract
Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.
Chinese Translation
在阅读时提出探究性问题并寻找答案是人类话语理解、好奇心和创造性构思的重要组成部分,以往的研究主要集中于仅文本的场景。然而,在科学或研究论文中,许多关键要点通过图表和分析这些图表的文本共同传达。尽管科学可视化已被用来评估视觉-语言模型(VLMs)的能力,但当前的基准测试仅限于关注从中提取信息的问题。这类问题只需要较低层次的推理,不考虑图表出现的上下文,也未反映作者希望实现的交流目标。我们生成的探究性问题深入到人类在阅读科学论文时所产生的问题,基于图表和论文的上下文,并需要跨两种模态进行推理。为此,我们将讨论中的问题(QUD)的语言理论从仅文本扩展到多模态,其中隐含问题随着话语的进展被提出和解决。我们提出了MQUD,这是一个研究论文的数据集,其中此类问题被明确提出并由原作者注释。我们展示了在MQUD上微调VLM可以使模型从生成通用的低层次视觉问题转变为需要高层次多模态推理的内容特定基础,从而产生更高质量、更具视觉基础的多模态QUD生成。
cs.CL / 49 / 2604.23801
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
医学多项选择题回答的领域微调与检索增强生成的对比研究:在4B参数规模下的受控比较
Abstract
Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.
Chinese Translation
在医学问题回答中,使用小型开放权重的大型语言模型(LLMs)的从业者面临一个反复出现的设计选择:投资于领域微调模型,还是保持通用模型并通过检索增强生成(RAG)在推理时注入领域知识。我们通过固定模型大小、提示模板、解码温度、检索管道和评估协议,仅改变(i)模型是否经过领域适配(Gemma 3 4B与MedGemma 4B,均为4位量化并通过Ollama提供服务)和(ii)是否将来自医学知识语料库的检索段落插入提示中,来隔离这一权衡。我们在完整的MedQA-USMLE 4选项测试集(1,273个问题)上评估了这一2x2设计的四个单元,每个问题进行三次重复(15,276次LLM调用)。领域微调相较于通用4B基线在多数投票准确率上提高了6.8个百分点(53.3%对46.4%,McNemar p < 10^-4)。在两个模型中,基于MedMCQA解释的RAG未能产生统计学上显著的增益,而在领域微调模型中,点估计略微为负(-1.9 pp,p = 0.16)。在这一规模和基准下,编码在权重中的领域知识主导了在上下文中提供的领域知识。我们发布了完整的实验代码和JSONL追踪,以支持复制研究。
cs.CL / 50 / 2604.23809
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models
LegalDrill:基于诊断驱动的法律推理合成方法在小型语言模型中的应用
Abstract
Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, lacking granularity beyond final verdicts. To address these challenges, we propose {LegalDrill}, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that {LegalDrill} significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.
Chinese Translation
小型语言模型(SLMs)因其高效性和低运营成本而在实际应用中展现出良好的前景。然而,它们的有限能力在需要连贯法律条文解释和逻辑一致推理的高风险法律推理任务中面临挑战。此外,为此类任务训练SLMs需要高质量、简明的推理轨迹,而手动收集这些轨迹的成本过高,且通过标准拒绝采样进行策划时缺乏超出最终裁决的细粒度。为了解决这些问题,我们提出了{LegalDrill},一种基于诊断驱动的合成框架,通过细粒度提示从一个有能力的教师模型中提取并迭代优化推理轨迹,然后采用自我反思验证方法自适应选择最有效的数据供SLM学生使用。生成的数据通过监督微调和直接偏好优化增强了SLM的训练。在多个法律基准测试上的广泛实验表明,{LegalDrill}显著提升了代表性SLMs的法律推理能力,同时避免了对稀缺专家注释的需求,为实用法律推理系统铺平了可扩展的道路。
cs.CL / 51 / 2604.23815
DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute
DRACULA:寻找用户希望深度研究代理执行的行动
Abstract
Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., "Add a section on datasets"). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA's actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user's full selection history, rather than self-reported or extrapolated user context signals; (2) Users' selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user's past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA's study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.
Chinese Translation
科学深度研究(DR)代理通过将研究论文综合为多部分报告来回答用户查询。用户反馈可以提高其效用,但现有协议仅对最终报告进行评分,这使得研究和学习DR代理应采取哪些中间行动以改善报告变得困难。我们收集了DRACULA,这是第一个包含用户对DR中间行动反馈的数据集。在五周内,十九位计算机科学领域的专家研究人员向一个DR系统提出查询,该系统建议行动(例如,“添加一个关于数据集的部分”)。我们的用户选择他们偏好的行动,然后判断输出报告是否成功应用了他们的选择,从而产生了8,103个行动偏好和5,230个执行判断。在确认DR代理能够执行DRACULA的行动后,我们通过模拟研究用户偏好行动的可预测性——大型语言模型(LLMs)预测用户选择的行动的能力,这是学习生成有用行动的一步。我们发现:(1)LLM评估者最初在预测行动选择方面表现不佳,但在使用用户的完整选择历史时改善最多,而不是依赖自我报告或推断的用户上下文信号;(2)对于同一查询,用户的选择因未表明的目标而异,这限制了模拟并激励了让用户引导报告的可能性;(3)我们的模拟结果为在线干预提供了信息,该干预基于用户的过去互动生成新行动,而用户在后续研究中最常选择这些行动。总体而言,尽管已有大量研究关注执行,DRACULA揭示了一个关键挑战,即首先决定执行哪些行动。我们开源了DRACULA的研究设计、用户反馈和模拟任务,以促进未来在长期代理的行动反馈方面的研究。
cs.CL / 52 / 2604.23824
Resource-Lean Lexicon Induction for German Dialects
资源节约型德语方言词典生成
Abstract
Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.
Chinese Translation
自动生成高质量词典对于构建词汇资源至关重要,但低资源语言和方言面临多重挑战:有限的标注者访问、高度的拼写变异以及大型语言模型(LLMs)的表现不佳。我们通过实证研究表明,基于字符串相似性特征训练的统计模型(随机森林)在生成德语方言词典方面出乎意料地有效。它们的表现优于LLMs,支持跨方言迁移,并提供了一种轻量级的数据驱动替代方案。我们在双语词典生成(BLI)上对模型进行了内在评估,在方言信息检索(IR)上进行了外在评估。在BLI任务中,随机森林的表现优于Mistral-123b,同时资源消耗更少。在使用BM25进行方言IR时,利用我们的方言词典进行查询扩展在nDCG@10上实现了高达28.9%的相对提升,在Recall@100上实现了50.7%的提升。鉴于方言资源的稀缺,我们进一步研究了模型在不同德语方言之间迁移的程度,以及在不同训练数据量下的表现。
cs.CL / 53 / 2604.23837
One Size Fits None: Heuristic Collapse in LLM Investment Advice
一刀切无效:大型语言模型投资建议中的启发式崩溃
Abstract
Large language models are increasingly deployed as advisors in high-stakes domains -- answering medical questions, interpreting legal documents, recommending financial products -- where good advice requires integrating a user's full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client's full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.
Chinese Translation
大型语言模型越来越多地被应用于高风险领域作为顾问——回答医学问题、解读法律文件、推荐金融产品——在这些领域,良好的建议需要整合用户的完整背景,而不是仅仅响应显著的表面特征。我们调查前沿大型语言模型是否真的做到这一点,或者它们是否表现出启发式崩溃:将复杂的多因素决策系统性地简化为少数主导输入。我们在投资建议中研究这一现象,在该领域,法律标准明确要求针对客户的完整情况进行个性化推理。通过对大型语言模型输出应用可解释的替代模型,我们发现系统性的启发式崩溃:投资配置决策主要由自我报告的风险承受能力决定,而其他相关因素的贡献微乎其微。我们进一步发现,网络搜索在一定程度上减轻了启发式崩溃,但并未完全解决这一问题。这些发现表明,仅通过网络搜索增强或模型规模无法解决启发式崩溃,并且将大型语言模型作为顾问的部署需要审计输入敏感性,而不仅仅是输出质量。
cs.CL / 54 / 2604.23842
Reheat Nachos for Dinner? Evaluating AI Support for Cross-Cultural Communication of Neologisms
晚餐吃重热玉米片?评估人工智能对新词跨文化交流的支持
Abstract
Neologisms and emerging slang are central to daily conversation, yet challenging for non-native speakers (NNS) to interpret and use appropriately in cross-cultural communication with native speakers (NS). NNS increasingly make use of Artificial Intelligence (AI) tools to learn these words. We study the utility of such tools in mediating an informal communication scenario through a human-subjects study (N=234): NNS participants learn English neologisms with AI support, write messages using the learned word to an NS friend, and judge contextual appropriateness of the neologism in two provided writing samples. Using both NS evaluator-rated communicative competence of NNS-produced writing and NNS' contextual appropriateness judgments, we compare three AI-based support conditions: AI Definition, AI Rewrite into simpler English, AI Explanation of meaning and usage, and Non-AI Dictionary for comparison. We show that AI Explanation yields the largest gains over no support in NS-rated competence, while contextual appropriateness judgments show indifference across support. NNS participants' self-reported perceptions tend to overestimate NS ratings, revealing a mismatch between perceived and actual competence. We further observe a significant gap between NNS- and NS-produced writing, highlighting the limitations of current AI tools and informing design for future tools.
Chinese Translation
新词和新兴俚语是日常对话的核心,但对于非母语者(NNS)来说,在与母语者(NS)的跨文化交流中正确理解和使用这些词汇具有挑战性。非母语者越来越多地利用人工智能(AI)工具来学习这些词汇。我们通过一项人类受试者研究(N=234)研究了这些工具在非正式交流场景中的有效性:非母语者参与者在AI支持下学习英语新词,向NS朋友撰写使用所学词汇的信息,并判断在提供的两个写作样本中新词的语境适宜性。我们比较了三种基于AI的支持条件:AI定义、AI重写为更简单的英语、AI对意义和用法的解释,以及作为对照的非AI词典。我们的研究表明,AI解释在NS评估的能力上相较于无支持条件获得了最大的提升,而在语境适宜性判断上,各种支持条件之间表现出无差异。非母语者参与者自我报告的感知往往高估了NS的评分,揭示了感知能力与实际能力之间的不匹配。我们进一步观察到非母语者和母语者所写作品之间存在显著差距,突显了当前AI工具的局限性,并为未来工具的设计提供了参考。
cs.CL / 55 / 2604.23844
Translate or Simplify First: An Analysis of Cross-lingual Text Simplification in English and French
先翻译还是先简化:英语和法语跨语言文本简化的分析
Abstract
Cross-Lingual Text Simplification (CLTS) aims to make content more accessible across languages by simultaneously addressing both linguistic complexity and translation. This study investigates the effectiveness of different prompting strategies for CLTS between English and French using large language models (LLMs). We examine five distinct prompting systems: a direct prompt instructing the LLM to perform both translation and simplification simultaneously, two Composition approaches that either translate-then-simplify or simplify-then-translate within a single prompt, and two decomposition approaches that perform the same operations in separate, consecutive prompts. These systems are evaluated across a diverse set of five corpora of different genres (Wikipedia and medical texts) using seven state-of-the-art LLMs. Output quality is assessed through a multi-faceted evaluation framework comprising automatic metrics, comprehensive linguistic feature analysis, and human evaluation of simplicity and meaning preservation. Our findings reveal that while direct prompting consistently achieves the highest BLEU scores, indicating meaning fidelity, Translate-then-Simplify approaches demonstrate the highest simplicity, as measured by the linguistic features.
Chinese Translation
跨语言文本简化(CLTS)旨在通过同时解决语言复杂性和翻译问题,使内容在不同语言间更易于访问。本研究探讨了在英语和法语之间使用大型语言模型(LLMs)进行CLTS时,不同提示策略的有效性。我们考察了五种不同的提示系统:一种直接提示,指示LLM同时进行翻译和简化;两种组合方法,分别是在单个提示中先翻译再简化或先简化再翻译;以及两种分解方法,在连续的单独提示中执行相同操作。这些系统在五个不同体裁(维基百科和医学文本)的多样语料库上进行评估,使用七种最先进的LLMs。输出质量通过一个多维评估框架进行评估,该框架包括自动指标、全面的语言特征分析以及对简易性和意义保留的人类评估。我们的研究结果表明,尽管直接提示始终获得最高的BLEU分数,表明意义的保真性,但“翻译-再简化”方法在语言特征测量的简易性方面表现最佳。
cs.CL / 56 / 2604.23855
Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows
从 Copilot 反馈中学习选择性 LLM 自主性以优化企业客户支持工作流程
Abstract
We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.
Chinese Translation
我们提出了一种已部署的系统,该系统在企业业务流程管理(BPM)平台内自动化端到端的客户支持工作流程。该方法在生产环境中具有可扩展性,并且在两周内实现了新流程的选择性自动化,利用了已经大规模生成的监督数据:结构化的每个案例的用户界面交互轨迹和低开销的 Copilot 反馈,其中操作员可以选择接受建议或提供修正。分阶段的部署管道训练下一个用户界面动作策略,从 Copilot 反馈中学习评估者以校准放弃决策,并在后台仅执行高置信度步骤,同时将不确定的决策推迟给操作员,并从更新的用户界面状态恢复。该设置允许一名操作员监督多个并发会话,并仅在系统不确定时被打断。该系统基于模式驱动的 BPM 界面视图运行,并包括监控和安全回退机制。在生产中,它自动化了 45% 的会话,并在不降低支持质量水平的情况下将平均处理时间减少了 39%。
cs.CL / 57 / 2604.23877
Knowledge Vector of Logical Reasoning in Large Language Models
大型语言模型中的逻辑推理知识向量
Abstract
Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows that each form of logical reasoning can be captured as a reasoning-specific knowledge vector in a linear representation space, yet these vectors are largely independent of each other. Motivated by cognitive science theory that these subforms of logical reasoning interact closely in the human brain, as well as our observation that the reasoning process for one type can benefit from the reasoning chain produced by another, we further propose to refine the knowledge representations of each reasoning type in LLMs to encourage complementarity between them. To this end, we design a complementary subspace-constrained refinement framework, which introduces a complementary loss that enables each reasoning vector to leverage auxiliary knowledge from the others, and a subspace constraint loss that prevents erasure of their unique characteristics. Through steering experiments along reasoning vectors, we find that refined vectors incorporating complementary knowledge yield consistent performance gains. We also conduct a mechanism-interpretability analysis of each reasoning vector, revealing insights into the shared and specific features of different reasoning in LLMs.
Chinese Translation
逻辑推理是大型语言模型(LLMs)中的核心能力,主要包括三种形式:演绎推理、归纳推理和溯因推理。在本研究中,我们探讨了这些推理类型在LLMs中的知识表示,并分析了它们之间的关联。我们的分析表明,每种逻辑推理形式可以在一个线性表示空间中被捕捉为特定于推理的知识向量,但这些向量在很大程度上是相互独立的。受到认知科学理论的启发,即这些逻辑推理的子形式在大脑中紧密互动,以及我们观察到一种推理类型的推理过程可以受益于另一种推理链的影响,我们进一步提出了在LLMs中细化每种推理类型的知识表示,以促进它们之间的互补性。为此,我们设计了一个互补子空间约束细化框架,该框架引入了互补损失,使每个推理向量能够利用其他向量的辅助知识,并引入子空间约束损失,以防止其独特特征的消失。通过沿推理向量进行引导实验,我们发现,结合互补知识的细化向量能够带来一致的性能提升。我们还对每个推理向量进行了机制可解释性分析,揭示了LLMs中不同推理的共享和特定特征的洞察。
cs.CL / 58 / 2604.23938
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
TSAssistant:一种人机协作的自动化靶点安全评估框架
Abstract
Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert-driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re-invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.
Chinese Translation
靶点安全评估(TSA)需要系统地整合异构证据,包括基因组、转录组、靶点同源性、药理学和临床数据,以评估治疗靶点的潜在安全风险。该过程本质上是迭代的并由专家驱动,面临可扩展性和可重复性的挑战。我们提出了TSAssistant,一个多智能体框架,旨在通过模块化、基于章节的和人机协作的范式来支持TSA报告的撰写。该框架将报告生成分解为一个协调的专门子智能体管道,每个子智能体针对单个TSA章节。专门的子智能体通过标准化工具接口从经过策划的生物医学来源中检索结构化和非结构化数据以及文献证据,生成可单独引用的、基于证据的章节。智能体行为由一个层次化的指令架构控制,该架构包括系统提示、特定领域的技能模块和运行时用户指令。一个关键特性是交互式的精炼循环,用户可以手动编辑章节、附加新信息、上传额外来源或重新调用智能体以修订特定章节,系统在迭代过程中保持对话记忆。TSAssistant旨在减少证据综合和报告撰写的机械负担,支持一种混合模型,其中智能代理增强了证据综合,而毒理学家保留最终决策权。
cs.CL / 59 / 2604.23948
KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters
KOMBO:基于子字符组合规则的韩文字母表示
Abstract
The Korean writing system, \textit{Hangeul}, has a unique character representation rigidly following the invention principles recorded in \textit{Hunminjeongeum}.\footnote{\textit{Hunminjeongeum} is a book published in 1446 that describes the principles of invention and usage of \textit{Hangeul}, devised by King Sejong \cite{Hunminjeongeum_Guide}.} However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of \textit{Hangeul} to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11\% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: [https://github.com/SungHo3268/KOMBO](https://github.com/SungHo3268/KOMBO).
Chinese Translation
韩文书写系统 extit{Hangeul}具有独特的字符表示,严格遵循记录在 extit{Hunminjeongeum}中的发明原则。ootnote{ extit{Hunminjeongeum}是一本于1446年出版的书籍,描述了由世宗大王设计的 extit{Hangeul}的发明和使用原则 extcite{Hunminjeongeum_Guide}。} 然而,现有的韩文预训练语言模型(PLMs)忽视了这些原则。本文介绍了一种新的韩文PLM框架KOMBO,该框架首次将 extit{Hangeul}的发明原则应用于字符表示。我们提出的方法KOMBO在多种自然语言处理(NLP)任务中表现出显著的实验能力。特别是,我们的方法在五个韩文自然语言理解任务中平均超越了最先进的韩文PLM 2.11 ext{%}。此外,大量实验表明,我们提出的方法适合理解韩语的语言特征。因此,我们揭示了在韩文PLM中使用子字符相较于典型的基于子词的方法的优越性。我们的代码可在以下链接获取:[https://github.com/SungHo3268/KOMBO](https://github.com/SungHo3268/KOMBO)。
cs.CL / 60 / 2604.23972
Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity
量子知识图谱:建模上下文依赖的三元组有效性
Abstract
Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet-based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet-specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes-centered PrimeKG subgraph, whose 68,651 context-sensitive relations are further annotated with patient-group-specific constraints. We evaluate it in a reasoner--validator pipeline for medical question answering on a KG-grounded subset of MedReason containing 2,788 questions. With Haiku-4.5 as both the Reasoner and the Validator, KG-backed validation significantly improves over a no-validator baseline ($+0.61$ pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching ($+0.79$ pp) and the no-validator baseline ($+1.40$ pp; paired McNemar, all $p<0.05$). Under a stronger validator (Qwen-3.6-Plus), the raw QKG gain over the no-validator baseline grows from $+1.40$ pp to $+5.96$ pp; the context-matching gap is non-significant ($p=0.73$) on the raw set but becomes borderline significant ($p=0.05$) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark-gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM-based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnote{https://github.com/HKAI-Sci/QKG}
Chinese Translation
知识图谱(KGs)越来越多地用于支持大型语言模型(LLM)的推理,但标准的基于三元组的知识图谱将每个关系视为全球有效。在许多情况下,某个关系是否应被视为证据取决于上下文。因此,我们将三元组有效性表述为一个特定于三元组的上下文函数,并将这种表述称为量子知识图谱(QKG)。我们在医学领域中使用以糖尿病为中心的PrimeKG子图实例化QKG,该子图包含68,651个上下文敏感的关系,并进一步注释了特定于患者群体的约束。我们在一个推理-验证管道中评估其在医学问答中的表现,该管道基于KG的MedReason子集,包含2,788个问题。使用Haiku-4.5作为推理器和验证器,基于KG的验证显著优于无验证器基线(提高了0.61个百分点),而具有上下文匹配的QKG则获得了最大的提升,超越了没有上下文匹配的KG验证(提高了0.79个百分点)和无验证器基线(提高了1.40个百分点;配对McNemar检验,所有$p<0.05$)。在更强的验证器(Qwen-3.6-Plus)下,原始QKG相较于无验证器基线的增益从1.40个百分点增长到5.96个百分点;在原始数据集上,上下文匹配的差距不显著($p=0.73$),但在调整知识泄露和可疑问题后变得边际显著($p=0.05$),这与基准金标准上限一致,而非QKG的局限性。综合来看,结果支持了这样一种观点:在基于LLM的临床推理中,KG的价值不仅在于存储医学相关事实,还在于表示这些事实是否适用于特定的患者上下文。为了可重复性和进一步研究,我们发布了整理好的QKG数据集和源代码。
cs.CL / 61 / 2604.23974
Propagation Structure-Semantic Transfer Learning for Robust Fake News Detection
基于传播结构-语义迁移学习的鲁棒假新闻检测
Abstract
Fake news generally refers to false information that is spread deliberately to deceive people, which has detrimental social effects. Existing fake news detection methods primarily learn the semantic features from news content or integrate structural features from propagation. However, in practical scenarios, due to the semantic ambiguity of informal language and unreliable user interactive behaviors on social media, there are inherent semantic and structural noises in news content and propagation. Although some recent works consider the effect of irrelevant user interactions in a hybrid-modeling way, they still suffer from the mutual interference between structural noise and semantic noise, leading to limited performance for robust detection. To alleviate this issue, this paper proposes a novel Propagation Structure-Semantic Transfer Learning framework (PSS-TL) for robust fake news detection under a teacher-student architecture. Specifically, we design dual teacher models to learn semantics knowledge and structure knowledge from noisy news content and propagation structure independently. Besides, we design a Multi-channel Knowledge Distillation (MKD) loss to enable the student model to acquire specialized knowledge from the teacher models, thereby avoiding mutual interference. Extensive experiments on two real-world datasets validate the effectiveness and robustness of our method.
Chinese Translation
假新闻通常指故意传播的虚假信息,旨在欺骗公众,具有严重的社会影响。现有的假新闻检测方法主要从新闻内容中学习语义特征或整合传播中的结构特征。然而,在实际场景中,由于非正式语言的语义模糊性和社交媒体上不可靠的用户互动行为,新闻内容和传播中存在固有的语义和结构噪声。尽管一些近期的研究以混合建模的方式考虑了无关用户互动的影响,但仍然受到结构噪声和语义噪声之间相互干扰的影响,导致鲁棒检测的性能有限。为了解决这一问题,本文提出了一种新颖的传播结构-语义迁移学习框架(PSS-TL),在教师-学生架构下进行鲁棒假新闻检测。具体而言,我们设计了双教师模型,独立地从噪声新闻内容和传播结构中学习语义知识和结构知识。此外,我们设计了一种多通道知识蒸馏(MKD)损失,使学生模型能够从教师模型中获取专业知识,从而避免相互干扰。在两个真实世界数据集上的大量实验验证了我们方法的有效性和鲁棒性。
cs.CL / 62 / 2604.23993
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
EPM-RL:用于电子商务本地产品映射的强化学习
Abstract
Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning--preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality--cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.
Chinese Translation
产品映射,即判断两个电子商务列表是否指代同一产品的任务,是价格监控和渠道可见性的核心问题。然而,在真实市场中,卖家经常在标题中注入促销关键词、平台特定标签和捆绑描述,导致同一产品以多种不同名称出现。近期基于大型语言模型(LLM)和多智能体框架在这些复杂案例中提高了鲁棒性和可解释性,但它们通常依赖于昂贵的外部API、重复检索和复杂的推理时协调,使得在隐私敏感的企业环境中大规模部署变得成本高昂且困难。为了解决这些问题,我们提出了EPM-RL,一个基于强化学习的框架,用于构建准确且高效的本地电子商务产品映射模型。我们的核心思想是将高成本的智能推理提炼为可训练的内部模型。从一组经过策划的产品对开始,结合LLM生成的推理和人工验证,我们首先对一个小型学生模型进行参数高效微调(PEFT),使用结构化推理输出。然后,我们利用强化学习(RL)进一步优化模型,采用基于智能体的奖励机制,联合评估输出格式合规性、标签正确性以及来自特别设计的评判模型的推理偏好分数。初步结果表明,EPM-RL在PEFT单独训练的基础上持续改进,并提供比基于商业API的基线更强的质量-成本权衡,同时支持私有部署和降低运营成本。这些发现表明,强化学习可以将产品映射从高延迟的智能管道转变为可扩展、可检查且适合生产的内部系统。
cs.CL / 63 / 2604.24003
Stabilizing Efficient Reasoning with Step-Level Advantage Selection
通过步骤级优势选择稳定高效推理
Abstract
Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model training, a factor whose effect has not been systematically isolated. We first show that short-context post-training alone, using standard GRPO without any length-aware objective, already induces substantial reasoning compression-but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step-level Advantage Selection (SAS), which operates at the reasoning-step level and assigns a zero advantage to low-confidence steps in correct rollouts and to high-confidence steps in verifier-failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length-aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy-efficiency trade-off.
Chinese Translation
大型语言模型(LLMs)通过在推理时分配大量计算资源,实现了强大的推理性能,通常生成冗长且详细的推理轨迹。尽管近期关于高效推理的研究通过基于长度的奖励或剪枝减少了这种开销,但许多方法是在比基础模型训练时更短的上下文窗口下进行后训练的,这一因素的影响尚未得到系统性隔离。我们首先展示,仅使用标准的GRPO进行短上下文后训练,且没有任何长度感知目标,已经会导致显著的推理压缩——但代价是训练动态越来越不稳定以及准确性下降。为了解决这个问题,我们提出了步骤级优势选择(Step-level Advantage Selection, SAS),该方法在推理步骤级别上运行,并对正确回滚中的低置信度步骤赋予零优势,对验证失败回滚中的高置信度步骤也赋予零优势,后者的失败往往源于截断或验证器问题,而非错误推理。在多种数学和一般推理基准测试中,SAS在最强长度感知基线的基础上,平均提高了0.86个百分点的Pass@1准确率,同时将平均推理长度减少了16.3%,实现了更好的准确性与效率的权衡。
cs.CL / 64 / 2604.24026
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
从技能文本到技能结构:代理技能的调度-结构-逻辑表示
Abstract
LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson's classical work on linguistic knowledge representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill-level scheduling signals, scene-level execution structure, and logic-level action and resource-use evidence: the Scheduling-Structural-Logical (SSL) representation. We instantiate SSL with an LLM-based normalizer and evaluate it on a corpus of skills in two tasks, Skill Discovery and Risk Assessment, and superiorly outperform the text-only baselines: in Skill Discovery, SSL improves MRR from 0.573 to 0.707; in Risk Assessment, it improves macro F1 from 0.744 to 0.787. These findings reveal that explicit, source-grounded structure makes agent skills easier to search and review. They also suggest that SSL is best understood as a practical step toward more inspectable, reusable, and operationally actionable skill representations for agent systems, rather than as a finished standard or an end-to-end mechanism for managing and using skills.
Chinese Translation
大型语言模型(LLM)代理越来越依赖可重用技能,这些能力包结合了指令、控制流、约束和工具调用。然而,在大多数当前的代理系统中,技能仍然通过文本密集型的文档表示,包括SKILL.md风格的文档和结构化记录,其机器可用的证据仍然主要嵌入在自然语言描述中。这对以技能为中心的代理系统提出了挑战:管理技能集合和使用技能来支持代理都需要对调用接口、执行结构和具体副作用进行推理,而这些通常纠缠在单一的文本表面中。因此,技能知识的明确表示可能有助于使这些文档更易于机器获取和利用。借鉴记忆组织包(Memory Organization Packets)、脚本理论(Script Theory)以及Schank和Abelson在语言知识表示方面的经典工作中的概念依赖(Conceptual Dependency),我们介绍了我们所知的第一个用于代理技能文档的结构化表示,它解开了技能级调度信号、场景级执行结构和逻辑级行动与资源使用证据:调度-结构-逻辑(Scheduling-Structural-Logical,SSL)表示。我们用基于LLM的标准化工具实例化SSL,并在两个任务的技能语料库上进行评估,即技能发现(Skill Discovery)和风险评估(Risk Assessment),并显著超越了仅基于文本的基线:在技能发现中,SSL将平均排名率(MRR)从0.573提高到0.707;在风险评估中,将宏观F1值从0.744提高到0.787。这些发现表明,明确的、源基础的结构使得代理技能更易于搜索和审查。它们还表明,SSL最好被理解为朝着更可检查、可重用和可操作的代理系统技能表示迈出的实际一步,而不是作为管理和使用技能的最终标准或端到端机制。
cs.CL / 65 / 2604.24040
Improving Robustness of Tabular Retrieval via Representational Stability
通过表示稳定性提高表格检索的鲁棒性
Abstract
Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as $\texttt{csv}$, $\texttt{tsv}$, $\texttt{html}$, $\texttt{markdown}$, and $\texttt{ddl}$, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across $\texttt{MPNet}$, $\texttt{BGE-M3}$, $\texttt{ReasonIR}$, and $\texttt{SPLADE}$. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at $\href{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}$.
Chinese Translation
基于Transformer的表格检索系统将结构化表格扁平化为令牌序列,使得检索对序列化的选择敏感,即使表格语义保持不变。我们展示了语义上等价的序列化格式,如 $ exttt{csv}$、$ exttt{tsv}$、$ exttt{html}$、$ exttt{markdown}$ 和 $ exttt{ddl}$,在多个基准和检索器家族中可以产生显著不同的嵌入和检索结果。为了解决这种不稳定性,我们将序列化嵌入视为共享语义信号的噪声视图,并使用其质心作为规范目标表示。我们证明了质心平均可以抑制格式特定的变化,并能够恢复不同序列化格式中共同的语义内容,当格式引起的变化在不同表格中有所不同时。实证结果表明,在 $ exttt{MPNet}$、$ exttt{BGE-M3}$、$ exttt{ReasonIR}$ 和 $ exttt{SPLADE}$ 的聚合成对比较中,质心表示优于单一格式。我们进一步在一个冻结的编码器上引入了一个轻量级的残差瓶颈适配器,该适配器将单一序列化嵌入映射到质心目标,同时保持方差并强制执行协方差正则化。该适配器提高了几种稠密检索器的鲁棒性,尽管增益依赖于模型,并且在稀疏词汇检索中较弱。这些结果将序列化敏感性识别为检索方差的主要来源,并展示了事后几何校正在序列化不变表格检索中的潜力。我们的代码、数据集和模型可在 $ exttt{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}$ 获取。
cs.CL / 66 / 2604.24070
Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
将自我一致性提炼为语言信心:一项预注册的负结果及对Gemma 3 4B的事后挽救
Abstract
Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701). The shuffled-target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target-dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.
Chinese Translation
小型指令调优的语言模型(LLMs)在最小引导下产生退化的语言信心:超过95%的上限率、接近随机的类型2 AUROC,以及无效的有效性特征。我们测试了是否可以通过自我一致性衍生目标的信心条件监督微调(CSFT)来缩小内部信息与语言输出之间的差距。在Gemma 3 4B-it上进行的预注册阶段0协议中,使用限制训练在正确模式答案项上的模态过滤器产生了负结果:由于训练目标中的标签熵崩溃,AUROC2从0.554降至0.509。一次探索性挽救移除了过滤器,对所有2000个校准项进行了训练。这产生了一个二元语言正确性判别器,在保留的TriviaQA上AUROC2为0.774,将10个样本的自我一致性信号(AUROC2 = 0.999)压缩为超过对数熵(0.701)的单次输出。洗牌目标对照组未显示改善(0.501)。在MMLU上,准确率从54.2%提高到77.4%,而洗牌模型的基线为56.1%,支持目标依赖的解释。该结果是探索性的,呈现二元而非连续校准,并在单一尺度上观察到。它识别出两个设计教训:信心训练需要标签熵,而正确的目标则规范输出格式。
cs.CL / 67 / 2604.24071
PeeriScope: A Multi-Faceted Framework for Evaluating Peer Review Quality
PeeriScope:一个多维度的同行评审质量评估框架
Abstract
The increasing scale and variability of peer review in scholarly venues has created an urgent need for systematic, interpretable, and extensible tools to assess review quality. We present PeeriScope, a modular platform that integrates structured features, rubric-guided large language model assessments, and supervised prediction to evaluate peer review quality along multiple dimensions. Designed for openness and integration, PeeriScope provides both a public interface and a documented API, supporting practical deployment and research extensibility. The demonstration illustrates its use for reviewer self-assessment, editorial triage, and large-scale auditing, and it enables the continued development of quality evaluation methods within scientific peer review. PeeriScope is available both as a live demo at https://app.reviewer.ly/app/peeriscope and via API services at https://github.com/Reviewerly-Inc/Peeriscope.
Chinese Translation
学术领域中同行评审的规模和多样性日益增加,迫切需要系统化、可解释和可扩展的工具来评估评审质量。我们提出了PeeriScope,一个模块化平台,整合了结构化特征、基于评分标准的大型语言模型评估和监督预测,以多维度评估同行评审质量。PeeriScope旨在开放和集成,提供公共接口和文档化的API,支持实际部署和研究的可扩展性。演示展示了其在评审者自我评估、编辑筛选和大规模审计中的应用,并促进了科学同行评审中质量评估方法的持续发展。PeeriScope可通过https://app.reviewer.ly/app/peeriscope访问实时演示,也可通过API服务访问https://github.com/Reviewerly-Inc/Peeriscope。
cs.CL / 68 / 2604.24074
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
安全基准对评估配置选择的敏感性有多大?
Abstract
Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.
Chinese Translation
安全基准如 HarmBench 依赖于大型语言模型(LLM)评估者对模型响应进行有害或安全的分类,但评估者配置,即评估模型与评估提示的组合,通常被视为固定的实现细节。我们表明这一假设是有问题的。通过使用 2 x 2 x 3 的因子设计,我们构建了 12 种评估提示变体,沿着两个轴线:评估结构和指令框架,并使用单一评估模型 Claude Sonnet 4-6 应用这些变体,产生了对六个目标模型和 400 个 HarmBench 行为的 28,812 次判断。我们发现,仅评估提示的措辞,在固定评估模型的情况下,测量的有害响应率最多可变化 24.2 个百分点,甚至在同一条件下的表面重述也会导致最多 20.1 个百分点的波动。模型安全排名中等不稳定,平均 Kendall tau = 0.89,类别级别的敏感性从版权的 39.6 个百分点到骚扰的 0 个百分点不等。使用三个评估模型的补充多评估者实验表明,评估模型的选择进一步增加了方差。我们的结果表明,评估提示的措辞是安全基准测量中一个重要且之前未被充分研究的方差来源。
cs.CL / 69 / 2604.24079
The Pragmatic Persona: Discovering LLM Persona through Bridging Inference
务实的人格:通过桥接推理发现大型语言模型的人格
Abstract
Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git
Chinese Translation
大型语言模型(LLMs)通过对话揭示其固有且独特的人格。然而,大多数现有的人格发现方法依赖于表层的词汇或风格线索,将对话视为一个平坦的符号序列,未能捕捉维持人格一致性的更深层次话语结构。为了解决这一局限性,我们提出了一种新颖的分析框架,通过桥接推理来解读LLM对话——即通过共享的世界知识和话语连贯性连接话语的隐含概念关系。通过将这些关系建模为结构化知识图谱,我们的方法捕捉到潜在的语义联系,这些联系决定了LLMs如何在对话轮次中组织意义,从而使人格发现能够在话语连贯性层面而非表面实现层面进行。跨多个推理基础和目标LLM的实验结果,从小规模模型到80B参数系统,表明桥接推理图在语义连贯性和人格识别的稳定性方面显著优于基于频率或风格的基线。这些结果表明,人格特征在话语的结构组织中始终被编码,而非孤立的词汇模式。本研究提出了一个系统框架,通过认知话语理论的视角探测、提取和可视化潜在的LLM人格,桥接计算语言学、认知语义学和大型语言模型的人格推理。代码可在 https://github.com/JiSoo-Yang/Persona_Bridging.git 获取。
cs.CL / 70 / 2604.24089
BiMol-Diff: A Unified Diffusion Framework for Molecular Generation and Captioning
BiMol-Diff:一种用于分子生成和描述的统一扩散框架
Abstract
Bridging molecular structures and natural language is essential for controllable design. Autoregressive models struggle with long-range dependencies, while standard diffusion processes apply uniform corruption across positions, which can distort structurally informative tokens. We present BiMol-Diff, a unified diffusion framework for the paired tasks of text-conditioned molecule generation and molecule captioning. Our key component is a token-aware noise schedule that assigns position-dependent corruption based on token recovery difficulty, preserving harder-to-recover substructures during the forward process. On ChEBI-20 and M3-20M, BiMol-Diff improves molecule reconstruction with a 15.4% relative gain in Exact Match and achieves strong captioning results, attaining best BLEU and BERTScore among compared baselines. These results indicate token-aware noising improves fidelity in molecular structure-language modelling.
Chinese Translation
连接分子结构与自然语言对于可控设计至关重要。自回归模型在处理长距离依赖时表现不佳,而标准扩散过程在各个位置上施加均匀的损坏,这可能会扭曲结构信息丰富的标记。我们提出了BiMol-Diff,一种用于文本条件分子生成和分子描述的配对任务的统一扩散框架。我们的关键组件是一个基于标记的噪声调度,它根据标记恢复的难度分配位置相关的损坏,在前向过程中保留难以恢复的子结构。在ChEBI-20和M3-20M数据集上,BiMol-Diff在分子重建上实现了15.4%的相对提升,并在描述任务中取得了优异的结果,在比较的基线中获得了最佳的BLEU和BERTScore。这些结果表明,基于标记的噪声处理提高了分子结构与语言建模的保真度。
cs.CL / 71 / 2604.24104
Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising
基于图的序列生成中的事实与编辑敏感性:图感知自适应噪声方法
Abstract
Fine-tuned autoregressive models for graph-to-sequence generation (G2S) often struggle with factual grounding and edit sensitivity. To tackle these issues, we propose a non-autoregressive diffusion framework that generates text by iterative refinement conditioned on an input graph, named as Diffusion Language Model for Graphs (DLM4G). By aligning graph components (entities/relations) with their corresponding sequence tokens, DLM4G employs an adaptive noising strategy. The proposed strategy uses per-token denoising error as a signal to adaptively modulate noise on entity and relation tokens, improving preservation of graph structure and enabling localized updates under graph edits. Evaluated on three datasets, DLM4G consistently outperforms competitive G2S diffusion baselines trained on identical splits across both surface-form and embedding-based metrics. DLM4G further exceeds fine-tuned autoregressive baselines up to 12x larger (e.g., T5-Large) and is competitive with zero-shot LLM transfer baselines up to 127x larger. Relative to the strongest fine-tuned PLM baseline, DLM4G improves factual grounding (
[email protected]) by +5.16% and edit sensitivity (ESR) by +7.9%; compared to the best diffusion baseline, it yields gains of +3.75% in
[email protected] and +23.6% in ESR. We additionally demonstrate applicability beyond textual graphs through experiments on molecule captioning, indicating the method's generality for scientific G2S generation.
Chinese Translation
针对图到序列生成(G2S)的微调自回归模型在事实基础和编辑敏感性方面的挑战,我们提出了一种非自回归扩散框架,通过基于输入图的迭代优化生成文本,称为图的扩散语言模型(Diffusion Language Model for Graphs, DLM4G)。DLM4G通过将图组件(实体/关系)与其对应的序列标记对齐,采用了一种自适应噪声策略。该策略利用每个标记的去噪误差作为信号,自适应调节实体和关系标记上的噪声,从而提高图结构的保留,并在图编辑时实现局部更新。在三个数据集上的评估中,DLM4G在表面形式和基于嵌入的指标上,始终优于在相同划分上训练的竞争性G2S扩散基线。DLM4G的表现甚至超过了高达12倍(例如,T5-Large)更大的微调自回归基线,并在与高达127倍更大的零-shot LLM迁移基线的比较中表现出竞争力。相较于最强的微调预训练语言模型基线,DLM4G在事实基础(
[email protected])上提高了+5.16%,在编辑敏感性(ESR)上提高了+7.9%;与最佳扩散基线相比,
[email protected]提高了+3.75%,ESR提高了+23.6%。我们还通过分子描述实验展示了该方法在文本图之外的适用性,表明其在科学G2S生成中的广泛适用性。
cs.CL / 72 / 2604.24114
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
IRIS:交错强化与增量分阶段课程用于跨语言数学推理
Abstract
Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.
Chinese Translation
课程学习通过逐步增加任务难度,帮助语言模型应对复杂推理。然而,它往往无法生成一致的逐步推理,特别是在多语言和低资源环境中,英语到印度语言的跨语言迁移仍然有限。我们提出了IRIS:交错强化与增量分阶段课程,这是一种双轴框架,结合了在逐渐更难的问题上进行的监督微调(纵轴)与反向课程强化学习,以减少对逐步指导的依赖(横轴)。我们设计了一种复合奖励,结合了正确性、逐步对齐、连续性和数值激励,通过群体相对策略优化(Group Relative Policy Optimization, GRPO)进行优化。我们发布了CL-Math,一个包含29000个问题的数据库,具有英语、印地语和马拉地语的逐步注释。在标准基准测试和精心策划的多语言测试集中,IRIS始终提高了性能,在数学推理任务上取得了良好的结果,并在低资源和双语环境中实现了显著提升,同时在高资源语言中也有适度改善。
cs.CL / 73 / 2604.24126
Psychologically-Grounded Graph Modeling for Interpretable Depression Detection
基于心理学的图模型用于可解释的抑郁症检测
Abstract
Automatic depression detection from conversational interactions holds significant promise for scalable screening but remains hindered by severe data scarcity and a lack of clinical interpretability. Existing approaches typically rely on black-box deep learning architectures that struggle to model the subtle, temporal evolution of depressive symptoms or account for participant-specific heterogeneity. In this work, we propose PsyGAT (Psychological Graph Attention Network), a psychologically grounded framework that models conversational sessions as dynamic temporal graphs. We introduce Psychological Expression Units (PEUs) to explicitly encode utterance-level clinical evidence, structuring the session graph to capture transitions in psychological states rather than mere semantic dependencies. To address the critical class imbalance in depression datasets, we employ clinically approved persona-based data augmentation, enable robust model learning. Additionally, we integrate session-level personality context directly into the graph structure to disentangle trait-based behavior from acute depressive symptoms. PsyGAT achieves state-of-the-art performance, surpassing both strong graph-based baselines and closed-source LLMs like GPT-5, achieving 89.99 and 71.37 Macro F1 scores in DAIC-WoZ and E-DAIC, respectively. We further introduce Causal-PsyGAT, an interpretability module that identifies symptom triggers. Experiments show a 20% improvement in MRR for identifying causal indicators, effectively bridging the gap between depression monitoring and clinical explainability. The full augmented dataset is publicly available at https://doi.org/10.6084/m9.figshare.31801921.
Chinese Translation
从对话互动中自动检测抑郁症具有可扩展筛查的重大潜力,但仍受到严重数据稀缺和缺乏临床可解释性的限制。现有方法通常依赖于黑箱深度学习架构,这些架构难以建模抑郁症状的微妙、时间演变或考虑参与者特定的异质性。在本研究中,我们提出了PsyGAT(心理图注意网络),这是一个基于心理学的框架,将对话会话建模为动态时间图。我们引入了心理表达单元(PEUs),以明确编码发言级别的临床证据,构建会话图以捕捉心理状态的转变,而不仅仅是语义依赖关系。为了解决抑郁症数据集中关键的类别不平衡问题,我们采用了临床批准的基于角色的数据增强,促进了稳健的模型学习。此外,我们将会话级个性上下文直接整合到图结构中,以区分基于特征的行为与急性抑郁症状。PsyGAT实现了最先进的性能,超越了强大的基于图的基线和闭源的LLM(如GPT-5),在DAIC-WoZ和E-DAIC中分别达到了89.99和71.37的宏F1分数。我们进一步引入了Causal-PsyGAT,一个可解释性模块,用于识别症状触发因素。实验表明,在识别因果指标方面,MRR提高了20%,有效弥合了抑郁监测与临床可解释性之间的差距。完整的增强数据集可在https://doi.org/10.6084/m9.figshare.31801921公开获取。
cs.CL / 74 / 2604.24175
AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models
AdapTime:在大型语言模型中实现自适应时间推理
Abstract
Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.
Chinese Translation
大型语言模型在一般知识问答中展示了强大的推理能力。然而,它们处理时间信息的能力仍然有限。为了解决这一局限性,现有的方法通常涉及外部工具或手动验证,并且针对特定场景进行调整,导致其泛化能力较差。此外,这些方法对所有问题应用固定的处理流程,忽视了不同类型的时间问题需要不同的推理策略,这导致在简单案例中产生不必要的处理,而在复杂案例中推理不足。为此,我们提出了AdapTime,一种自适应时间推理方法,能够根据输入上下文动态执行推理步骤。具体而言,它包括三个时间推理动作:重构(reformulate)、改写(rewrite)和审查(review),并由LLM规划器(LLM planner)引导推理过程。AdapTime与最先进的LLM无缝集成,显著增强其时间推理能力,而无需依赖外部支持。大量实验验证了我们方法的有效性。
cs.CL / 75 / 2604.24179
MemeScouts@LT-EDI 2026: Asking the Right Questions -- Prompted Weak Supervision for Meme Hate Speech Detection
MemeScouts@LT-EDI 2026:提出正确的问题——针对表情包仇恨言论检测的提示式弱监督
Abstract
Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.
Chinese Translation
由于表情包的多模态特性及其微妙的、文化根植的线索(如讽刺和语境),检测表情包中的仇恨言论具有挑战性。尽管最近的视觉-语言模型(VLMs)能够对文本和图像进行联合推理,但端到端的提示可能不够稳健,因为单一预测必须解决目标、立场、隐含性和讽刺性等多个方面。这些挑战在多语言环境中更为突出。我们提出了一种提示式弱监督(PWS)方法,将表情包理解分解为针对性的问题基础标注功能,并为仇恨言论检测(包括对同性恋和跨性别者的仇恨)提供有限的答案选项,应用于LT-EDI 2026共享任务。通过使用量化的Qwen3-VLM回答针对性问题提取特征,我们的方法在直接的VLM分类中表现更佳,尤其在中文和印地语上取得了显著提升,在英语中排名第一,在中文中排名第二,在印地语中排名第三。通过基于错误驱动的标注功能扩展和特征修剪的迭代优化,减少了冗余并提高了泛化能力。我们的结果突显了提示式弱监督在多语言多模态仇恨言论检测中的有效性。
cs.CL / 76 / 2604.24186
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning
MultiDx:面向诊断推理的多源知识集成框架
Abstract
Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.
Chinese Translation
诊断预测和临床推理是医疗应用中的关键任务。尽管大型语言模型(LLMs)在常识推理方面表现出强大的能力,但由于领域知识的有限性,它们在诊断推理方面仍然面临挑战。现有方法通常依赖于内部模型知识或静态知识库,导致知识不足和适应性有限,从而妨碍其进行诊断推理的能力。此外,这些方法仅关注最终预测的准确性,而忽视了与标准临床推理轨迹的一致性。为此,我们提出了MultiDx,一个两阶段的诊断推理框架,通过分析来自多个知识源的证据来进行鉴别诊断。具体而言,它首先通过利用来自网络搜索、SOAP格式病例和临床病例数据库的知识生成可疑诊断和推理路径。然后,通过匹配、投票和鉴别诊断整合多角度证据,以生成最终预测。在两个公共基准上的大量实验表明我们方法的有效性。
cs.CL / 77 / 2604.24197
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
眼见不再为实:前沿图像生成模型、合成视觉证据与现实世界风险
Abstract
Frontier image generation has moved from artistic synthesis toward synthetic visual evidence. Systems such as GPT Image 2, Nano Banana Pro, Nano Banana 2, Grok Imagine, Qwen Image 2.0 Pro, and Seedream 5.0 Lite combine photorealistic rendering, readable typography, reference consistency, editing control, and in several cases reasoning or search-grounded image construction. These capabilities create large benefits for design, education, accessibility, and communication, yet they also weaken one of society's most common trust shortcuts: the belief that a plausible picture is a reliable record. This paper provides a source-grounded technical and policy analysis of synthetic visual risk. We first summarize the public capabilities of recent image models, then analyze public incidents involving fake crisis images, celebrity and public-figure imagery, medical scans, forged-looking documents, synthetic screenshots, phishing assets, and market-moving rumors. We introduce a capability-weighted risk framework that links model affordances to real-world harm in finance, medicine, news, law, emergency response, identity verification, and civic discourse. Our findings show that risk is driven less by photorealism alone than by the convergence of realism, legible text, identity persistence, fast iteration, and distribution context. We argue for layered control: model-side restrictions, cryptographic provenance, visible labeling, platform friction, sector-grade verification, and incident response. The paper closes with practical recommendations for model providers, platforms, newsrooms, financial institutions, healthcare systems, legal organizations, regulators, and ordinary users.
Chinese Translation
前沿图像生成已从艺术合成转向合成视觉证据。诸如 GPT Image 2、Nano Banana Pro、Nano Banana 2、Grok Imagine、Qwen Image 2.0 Pro 和 Seedream 5.0 Lite 等系统结合了照片级真实感渲染、可读的排版、参考一致性、编辑控制,并在多个案例中实现了基于推理或搜索的图像构建。这些能力为设计、教育、无障碍访问和沟通带来了巨大的好处,但也削弱了社会中最常见的信任捷径之一:即相信一个看似可信的图像是可靠记录的信念。本文提供了合成视觉风险的基于来源的技术和政策分析。我们首先总结了近期图像模型的公共能力,然后分析了涉及虚假危机图像、名人和公众人物影像、医学扫描、伪造文件、合成截图、网络钓鱼资产和市场波动谣言的公共事件。我们引入了一个能力加权风险框架,将模型的可用性与金融、医学、新闻、法律、应急响应、身份验证和公民话语中的现实世界危害联系起来。我们的研究结果表明,风险的驱动因素不仅仅是照片级真实感,而是现实主义、可读文本、身份持久性、快速迭代和分发背景的融合。我们主张采取分层控制:模型端限制、加密来源、可见标签、平台摩擦、行业级验证和事件响应。本文最后为模型提供者、平台、新闻机构、金融机构、医疗系统、法律组织、监管机构和普通用户提出了实用建议。
cs.CL / 78 / 2604.24198
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
奖励科学过程:面向代理数据分析的过程级奖励建模
Abstract
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
Chinese Translation
过程奖励模型(PRMs)在增强大型语言模型(LLMs)在静态领域(如数学)中的推理能力方面取得了显著成功。然而,它们在动态数据分析任务中的潜力仍未得到充分探索。在本研究中,我们首先呈现了一项实证研究,揭示了通用领域的PRMs在监督数据分析代理方面的困难。具体而言,它们未能检测到静默错误,即那些在不触发解释器异常的情况下导致错误结果的逻辑缺陷,并且错误地惩罚探索性行为,将必要的试错探索误认为是基础失败。为了填补这一空白,我们引入了DataPRM,这是一种新颖的环境感知生成过程奖励模型,(1) 可以作为主动验证者,自动与环境交互以探测中间执行状态并发现静默错误,以及 (2) 采用反思感知的三元奖励策略,区分可纠正的基础错误和不可恢复的错误。我们设计了一个可扩展的管道,通过多样性驱动的轨迹生成和知识增强的步骤级注释,为DataPRM构建了超过8000个高质量的训练实例。实验结果表明,DataPRM在ScienceAgentBench上提高了下游策略LLMs的表现7.21%,在DABStep上提高了11.28%,使用的是Best-of-N推理。值得注意的是,DataPRM仅用4B参数就超越了强基线,并在多种测试时间扩展策略中展现出强大的泛化能力。此外,将DataPRM集成到强化学习中,相较于结果奖励基线,取得了显著提升,在DABench上达到78.73%,在TableBench上达到64.84%,验证了过程奖励监督的有效性。代码可在 https://github.com/zjunlp/DataMind 获取。
cs.CL / 79 / 2604.24302
Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer
可微分的可信度对齐用于跨模型电路转移
Abstract
Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$B$\rightarrow3$B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source--target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnote{Code is available at https://github.com/jasonshaoshun/dfa-circuits.
Chinese Translation
机制可解释性使得在语言模型中定位特定行为所依赖的电路成为可能,但现有方法成本高、模型特定且难以扩展到更大的架构。我们提出了 extbf{可微分的可信度对齐(Differentiable Faithfulness Alignment, DFA)},这是一个通过学习的可微分对齐将电路信息从较小的源模型转移到较大的目标模型的框架。DFA将源模型的节点重要性评分投影到目标模型,并通过软可信度目标训练这一映射,避免在目标模型上进行完整的电路发现。我们在Llama-3和Qwen-2.5上评估DFA,涵盖六个任务,包括事实检索、多项选择推理和算术。最强的结果出现在Llama-3 $1$B$
ightarrow3$B上,其中对齐电路通常与直接节点归因具有竞争力,并且零-shot转移仍然有效。对于更大的源-目标差距,恢复效果减弱,在Qwen-2.5上显著降低,这表明随着架构和扩展差异的增加,转移变得更加困难。总体而言,DFA始终优于简单基线,并且在某些设置中恢复的目标模型电路的可信度与直接归因相当或更强。这些结果表明,较小的模型可以为较大的模型提供有用的机制先验,同时突显了节点级跨模型电路对齐的潜力和局限性。
cs.CL / 80 / 2604.24320
DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
DPEPO:基于多样化并行探索策略优化的LLM代理
Abstract
Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
Chinese Translation
遵循顺序“推理-再行动”范式的大型语言模型(LLM)代理在许多复杂任务中取得了优异的表现。然而,这些方法由于每一步仅与单一环境交互,导致探索有限和环境理解不完整。本文首先介绍了一种新颖的范式,使代理能够同时与多个环境交互并共享跨轨迹经验。在此范式的基础上,我们进一步提出了DPEPO,一种强化学习(RL)算法,鼓励代理进行多样化的并行探索。DPEPO包含两个阶段:初始的监督微调(SFT)阶段赋予基本的并行推理和行动生成,随后是带有分层奖励机制的强化学习阶段。我们设计了一种并行轨迹级成功奖励和两个步骤级奖励:多样化行动奖励和多样化状态转移奖励,积极惩罚行为冗余并促进广泛探索。在ALFWorld和ScienceWorld上的大量实验表明,DPEPO达到了最先进的(SOTA)成功率,同时保持与强序列基线相当的效率。(代码可在 https://github.com/LePanda026/Code-for-DPEPO 获取)
cs.CL / 81 / 2604.24334
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
通过块过滤减少检索增强生成中的冗余
Abstract
Standard Retrieval-Augmented Generation (RAG) chunking methods often create excessive redundancy, increasing storage costs and slowing retrieval. This study explores chunk filtering strategies, such as semantic, topic-based, and named-entity-based methods in order to reduce the indexed corpus while preserving retrieval quality. Experiments are conducted on multiple corpora. Retrieval performance is evaluated using a token-based framework based on precision, recall, and intersection-over-union metrics. Results indicate that entity-based filtering can reduce vector index size by approximately 25% to 36% while maintaining high retrieval quality close to the baseline. These findings suggest that redundancy introduced during chunking can be effectively reduced through lightweight filtering, improving the efficiency of retrieval-oriented components in RAG pipelines.
Chinese Translation
标准的检索增强生成(Retrieval-Augmented Generation, RAG)分块方法通常会产生过多的冗余,增加存储成本并减慢检索速度。本研究探讨了块过滤策略,如基于语义、主题和命名实体的方法,以减少索引语料库,同时保持检索质量。我们在多个语料库上进行了实验。检索性能使用基于精确度、召回率和交并比(intersection-over-union)指标的基于令牌的框架进行评估。结果表明,基于实体的过滤可以将向量索引大小减少约25%至36%,同时保持接近基线的高检索质量。这些发现表明,通过轻量级过滤可以有效减少在分块过程中引入的冗余,从而提高RAG管道中检索导向组件的效率。
cs.CL / 82 / 2604.24348
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR:操作系统代理安全性、性能、效率和鲁棒性分析工具包
Abstract
The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
Chinese Translation
多模态大型语言模型(MLLMs)的演变已将焦点从文本生成转向主动行为执行,特别是通过操作系统代理在复杂图形用户界面(GUI)中导航。然而,这些代理转变为可信赖的日常伙伴的过程受到安全性、效率和多模态鲁棒性评估缺乏严格性的阻碍。目前的基准测试存在安全场景狭窄、轨迹标注噪声和鲁棒性指标有限等问题。为填补这一空白,我们提出了OS-SPEAR,一个全面的工具包,用于系统分析操作系统代理的四个维度:安全性、性能、效率和鲁棒性。OS-SPEAR引入了四个专门的子集:(1)S(安全性)子集,涵盖多样的环境和人为诱发的危害;(2)P(性能)子集,通过轨迹价值估计和分层抽样进行策划;(3)E(效率)子集,通过时间延迟和标记消耗的双重视角量化性能;(4)R(鲁棒性)子集,针对视觉和文本输入施加跨模态干扰。此外,我们提供了一个自动化分析工具,以生成可读的人类诊断报告。我们使用OS-SPEAR对22个流行的操作系统代理进行了广泛评估。我们的实证结果揭示了当前领域的关键见解:尤其是效率与安全性或鲁棒性之间普遍存在权衡,专门化代理在性能上优于通用模型,以及不同模态之间的鲁棒性脆弱性差异。通过提供多维排名和标准化评估框架,OS-SPEAR为开发下一代可靠和高效的操作系统代理提供了基础资源。数据集和代码可在 https://github.com/Wuzheng02/OS-SPEAR 获取。
cs.CL / 83 / 2604.24361
Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation
大型语言模型中的文化感知机器翻译:基准测试与研究
Abstract
Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models' recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.
Chinese Translation
大型语言模型(LLMs)在一般机器翻译中表现出色,但它们在文化感知场景中的能力仍然不够明确。为了解决这一问题,我们引入了CanMT,一个用于机器翻译的文化感知新驱动平行数据集,并提出了一个理论基础的多维评估框架,以评估文化翻译质量。利用CanMT,我们系统地评估了在不同翻译策略约束下的多种LLM和翻译系统。我们的研究结果揭示了模型之间存在显著的性能差异,并表明翻译策略对模型行为产生系统性影响。进一步分析显示,翻译难度在不同类型的文化特定项目之间存在差异,并且模型对文化特定知识的识别能力与其在翻译输出中正确运用该知识的能力之间仍然存在持续的差距。此外,纳入参考翻译显著提高了LLM作为评判者的评估可靠性,强调了它们在评估文化感知翻译质量中的重要作用。语料库和代码可在CanMT获取。
cs.CL / 84 / 2604.24372
SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
SeaEvo:通过策略空间演化推进算法发现
Abstract
LLM-guided evolutionary search has emerged as a promising paradigm for automated algorithm discovery, yet most systems track search progress primarily through executable programs and scalar fitness. Even when natural-language reflection is used, it is often used locally in mutation prompts or stored without an explicit population-level organization of strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that elevates natural-language strategy descriptions from transient prompt context to first-class population-level evolutionary state in LLM-driven program search. \model augments each candidate program with an explicit natural language strategy description and uses this representation in three ways: Strategy Articulation turns mutation into a diagnose-direct-implement process; Stratified Experience Retrieval organizes the archive into strategy clusters and selects inspirations by behavioral complementarity; and Strategic Landscape Navigation periodically summarizes effective, saturated, and underexplored strategy families to guide future mutations. Across mathematical algorithm discovery, systems optimization, and agent-scaffold benchmarks, \model improves the underlying evolutionary backbones in most settings, with particularly large gains (21% relative improvement) on open-ended system optimization tasks. These results suggest that persistent strategy representations provide a practical mechanism for improving the robustness and efficiency of LLM-guided evolutionary search, suggesting a path toward compound AI systems that accumulate algorithmic knowledge over time.
Chinese Translation
基于大语言模型(LLM)引导的进化搜索已成为自动化算法发现的一个有前景的范式,但大多数系统主要通过可执行程序和标量适应度来跟踪搜索进展。即使使用自然语言反思,它通常仅在变异提示中局部使用,或以没有明确的种群级战略方向组织的方式存储。因此,进化搜索可能难以区分同一思想的语法上不同的实现,保留适应度较低但战略上有前景的方向,或检测到整个策略家族已达到饱和。我们引入了 extit{SeaEvo},一个模块化的策略空间层,将自然语言策略描述从短暂的提示上下文提升为 LLM 驱动程序搜索中的一流种群级进化状态。 extit{SeaEvo} 为每个候选程序增添了明确的自然语言策略描述,并以三种方式使用这种表征:策略阐述将变异转变为诊断-指导-实施的过程;分层经验检索将档案组织成策略集群,并通过行为互补性选择灵感;战略景观导航定期总结有效、饱和和未充分探索的策略家族,以指导未来的变异。在数学算法发现、系统优化和代理-支架基准测试中, extit{SeaEvo} 在大多数设置中改善了基础的进化骨干,尤其在开放式系统优化任务上获得了显著的提升(相对改善21%)。这些结果表明,持久的策略表征为提高 LLM 引导的进化搜索的稳健性和效率提供了一种实用机制,暗示了一个朝向复合人工智能系统的路径,该系统能够随着时间的推移积累算法知识。
cs.CL / 85 / 2604.24374
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
MIPIC:通过自蒸馏的内部关系和渐进信息链实现的套娃表示学习
Abstract
Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.
Chinese Translation
表示学习是自然语言处理(NLP)的基础,但在不同计算预算下构建有效的嵌入是具有挑战性的。套娃表示学习(Matryoshka Representation Learning, MRL)通过嵌套嵌入提供了一种灵活的推理范式;然而,学习这样的结构需要明确协调信息在嵌入维度和模型深度上的排列方式。在本研究中,我们提出了MIPIC(通过自蒸馏的内部关系对齐和渐进信息链实现的套娃表示学习),这是一个统一的训练框架,旨在生成结构一致且语义紧凑的套娃表示。MIPIC通过自蒸馏的内部关系对齐(Self-Distilled Intra-Relational Alignment, SIA)促进跨维度的结构一致性,该方法使用top-k CKA自蒸馏对全表示和截断表示之间的标记级几何和基于注意力的关系进行对齐。补充地,它通过渐进信息链(Progressive Information Chaining, PIC)实现深度语义整合,这是一种逐步对齐策略,逐渐将成熟的任务语义从更深层转移到更早层。针对STS、NLI和分类基准(涵盖从TinyBERT到BGEM3、Qwen3的模型)进行的广泛实验表明,MIPIC在所有能力范围内产生的套娃表示具有高度竞争力,并在极低维度下观察到显著的性能优势。
cs.CL / 86 / 2604.24376
Learning Evidence of Depression Symptoms via Prompt Induction
通过提示引导学习抑郁症状的证据
Abstract
Depression places substantial pressure on mental health services, and many people describe their experiences outside clinical settings in high-volume user-generated text (e.g., online forums and social media). Automatically identifying clinical symptom evidence in such text can therefore complement limited clinical capacity and scale to large populations. We address this need through sentence-level classification of 21 depression symptoms from the BDI-II questionnaire, using BDI-Sen, a dataset annotated for symptom relevance. This task is fine-grained and highly imbalanced, and we find that common LLM approaches (zero-shot, in-context learning, and fine-tuning) struggle to apply consistent relevance criteria for most symptoms. We propose Symptom Induction (SI), a novel approach which compresses labeled examples into short, interpretable guidelines that specify what counts as evidence for each symptom and uses these guidelines to condition classification. Across four LLM families and eight models, SI achieves the best overall weighted F1 on BDI-Sen, with especially large gains for infrequent symptoms. Cross-domain evaluation on an external dataset further shows that induced guidelines generalize across other diseases shared symptomatology (bipolar and eating disorders).
Chinese Translation
抑郁症对心理健康服务造成了巨大的压力,许多人在非临床环境中通过大量用户生成的文本(例如在线论坛和社交媒体)描述他们的经历。因此,自动识别这些文本中的临床症状证据可以补充有限的临床能力,并扩展到大规模人群。我们通过对BDI-II问卷中21种抑郁症状进行句子级分类来满足这一需求,使用了BDI-Sen数据集,该数据集经过注释以标注症状相关性。该任务具有细粒度和高度不平衡的特点,我们发现常见的LLM方法(零样本、上下文学习和微调)在大多数症状上难以应用一致的相关性标准。我们提出了症状引导(Symptom Induction, SI),这是一种新颖的方法,它将标记示例压缩为简短、可解释的指南,明确规定每种症状的证据标准,并利用这些指南来指导分类。在四个LLM家族和八个模型中,SI在BDI-Sen上实现了最佳的加权F1分数,尤其在不常见症状上取得了显著提升。对外部数据集的跨领域评估进一步表明,诱导的指南在其他共享症状特征的疾病(如双相障碍和饮食障碍)中具有良好的泛化能力。
cs.CL / 87 / 2604.24380
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
大型视觉语言模型的结构化剪枝:剪枝动态、恢复和数据效率的综合研究
Abstract
While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.
Chinese Translation
尽管大型视觉语言模型(LVLMs)展现了令人印象深刻的能力,但其巨大的计算和内存需求在资源受限的边缘设备上部署时面临挑战。目前的参数减少技术主要涉及从小型语言模型训练LVLMs,但这些方法灵活性有限且计算密集。我们研究了一条互补的途径:通过对语言模型主干进行结构化剪枝来压缩现有的LVLMs,随后进行轻量级的恢复训练。具体而言,我们探讨了两种结构化剪枝范式:逐层剪枝和宽度剪枝,并将其与监督微调和对logits及隐藏状态的知识蒸馏相结合。此外,我们评估了仅使用可用数据的一小部分进行恢复训练的可行性。我们的结果表明,在计算资源有限或微调数据不足的低资源场景中,宽度剪枝通常能够保持更好的性能。至于恢复训练,仅对多模态投影器进行微调在小压缩水平下就足够了。此外,监督微调与隐藏状态蒸馏的结合在不同剪枝水平下产生了最佳恢复效果。值得注意的是,仅使用原始数据的5%就可以实现有效的恢复,同时保留超过95%的原始性能。通过对三个代表性的LVLM家族(参数范围从3B到7B)的实证研究,本研究为从业者提供了在没有大量计算资源或足够数据的情况下压缩LVLMs的可行见解。代码库可在 https://github.com/YiranHuangIrene/VLMCompression.git 获取。
cs.CL / 88 / 2604.24416
Scaling Properties of Continuous Diffusion Spoken Language Models
连续扩散口语语言模型的缩放特性
Abstract
Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.
Chinese Translation
仅基于语音的口语语言模型(SLMs)在性能上落后于文本和文本-语音模型,最近的离散自回归(AR)SLMs显示出与文本模型相匹配所需的计算和数据需求显著。由于将连续语音离散化以进行AR会造成瓶颈,我们探讨连续扩散(CD)SLM是否更具可行性。为了量化SLMs的语言质量,我们引入了音素詹森-香农散度(pJSD)指标。我们的分析显示,CD SLMs呈现出与AR行为相似的特征,验证损失和pJSD遵循缩放定律,并且在计算规模扩大时,最佳的标记与参数比率下降。然而,对于后者,损失对数据和模型大小的选择变得不敏感,显示出快速推理的潜力。将CD SLM扩展到160亿参数,并使用数千万小时的对话数据,可以生成富有情感、具有韵律的多说话者、多语言语音,尽管实现长篇连贯性仍然是一个重大挑战。
cs.CL / 89 / 2604.24429
A Multi-Dimensional Audit of Politically Aligned Large Language Models
政治对齐的大型语言模型的多维审计
Abstract
As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas' Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.
Chinese Translation
随着大型语言模型(LLMs)在各个行业的应用不断扩大,人们对其潜在误用的担忧日益增加,尤其是在政治话语等敏感领域。通过提示工程或微调技术故意将LLMs与特定政治意识形态对齐,在政治活动等用例中可能是有利的,但由于性能下降、错误信息或偏见行为增加的风险加大,这需要谨慎考虑。在本研究中,我们提出了一个多维框架,灵感来源于哈贝马斯的交往行动理论,以自动化、定量指标审计政治对齐的语言模型,涵盖四个维度:有效性、公平性、真实性和说服力。将这一框架应用于通过微调或角色扮演对齐的九个流行LLMs,揭示了一致的权衡:虽然较大的模型在角色扮演政治意识形态和回应的真实性方面往往更有效,但它们的公平性较差,表现出对不同意识形态人群使用愤怒和有毒语言的偏见水平更高。微调模型表现出较低的偏见和比相应的角色扮演模型更有效的对齐,但在推理任务的性能上也有所下降,并且幻觉现象增加。总体而言,所有测试的模型在至少一个四个指标中都表现出一些缺陷,突显了需要更平衡和稳健的对齐策略。最终,本研究旨在确保政治对齐的LLMs生成合法、无害的论点,并提供评估这些模型负责任的政治对齐的框架。
cs.CL / 90 / 2604.24432
Kwai Summary Attention Technical Report
Kwai 摘要注意力技术报告
Chu, Chenglong, Zhou, Guorui, Zhang, Guowang, Li, Han, Peng, Hao, Cheng, Hongtao, Liang, Jian, Cao, Jiangxia, Gai, Kun, Zhou, Lingzhi, Ren, Lu, Zhang, Qi, Tang, Ruiming, Wang, Ruitao, Luo, Xinchen, Su, Yi, Liang, Zhiyuan, Wang, Ziqi, Ding, Boyang, Song, Chengru, Zang, Dunju, Wang, Hui, Ou, Jiao, Deng, Jiaxin, Shi, Jijun, Zhang, Jinghao, Chen, Junmin, Ren, Lejian, Lv, Minxuan, Wang, Qianqian, Hu, Qigen, Wang, Shiyao, Mao, Siyang, Wang, Tao, Wang, Xingmei, Ling, Zhixin, Li, Ziming, Zhang, Zixing
Abstract
Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.
Chinese Translation
长上下文能力已成为下一代大型语言模型最重要的迭代方向之一,特别是在语义理解/推理、代码智能代理和推荐系统中。然而,标准的 softmax 注意力在序列长度上表现出二次时间复杂度。随着序列长度的增加,这在长上下文设置中会产生大量开销,导致极长序列的训练和推理成本迅速恶化。现有解决方案通过两种技术路径来缓解这一问题:i) 减少每层的 KV 缓存,例如通过头级压缩 GQA 和嵌入维度级压缩 MLA,但 KV 缓存仍然与序列长度呈线性依赖,比例为 1:1。ii) 与 KV 缓存友好的架构交错,例如局部注意力 SWA、线性核 GDN,但通常在 KV 缓存和长上下文建模效果之间存在权衡。除了这两种技术路径,我们认为存在一种尚未充分探索的中间路径:{保持 KV 缓存与序列长度之间的线性关系,但通过特定比例 $k$ 进行语义级压缩}。这种 $O(n/k)$ 的路径并不追求“最小 KV 缓存”,而是以可接受的内存成本换取对长距离依赖的完整、参考和可解释的保留。基于此,我们提出了 Kwai 摘要注意力 (KSA),一种新颖的注意力机制,通过将历史上下文压缩为可学习的摘要标记,从而降低序列建模成本。
cs.CL / 91 / 2604.24444
Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style
你能让它听起来像你吗?对大型语言模型生成文本进行后期编辑以展现个人风格
Abstract
Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ($n=81$) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants' unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants' unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants' personal style despite remaining detectable LLM stylistic traces.
Chinese Translation
尽管大型语言模型(LLMs)在写作任务中的使用日益增加,但当个人风格重要时,用户可能会犹豫依赖LLMs。对LLM生成的草稿或翻译进行后期编辑是一种常见的协作写作策略,但尚不清楚用户是否能够有效地重塑LLM生成的文本以反映他们的个人风格。我们进行了一项预注册的在线研究($n=81$),参与者对在个人风格对他们重要的写作任务中生成的LLM草稿进行后期编辑。通过基于嵌入的风格相似性度量,我们发现后期编辑增加了与参与者未辅助写作的风格相似性,并减少了与完全LLM生成输出的相似性。然而,后期编辑的文本在风格上仍然比参与者未辅助的控制文本更接近LLM文本,并且与未辅助的人类文本相比,展现出较低的风格多样性。我们发现感知的风格真实性与模型测量的风格相似性之间存在差距,后期编辑的文本常常被认为代表参与者的个人风格,尽管仍然保留可检测的LLM风格痕迹。
cs.CL / 92 / 2604.24470
Zero-shot Large Language Models for Automatic Readability Assessment
零样本大型语言模型用于自动可读性评估
Abstract
Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.
Chinese Translation
无监督的自动可读性评估(ARA)方法在实际和研究中具有重要的应用(例如,确保医疗或教育材料适合其目标受众)。在本文中,我们提出了一种新的零样本提示方法用于ARA,并首次对使用大型语言模型(LLMs)作为无监督ARA方法进行了全面评估,测试了10种不同的开源LLMs(例如,不同的大小和开发者)在14个不同的数据集(例如,不同的文本长度和语言)上的表现。我们的研究结果表明,我们提出的提示方法在14个数据集中的13个上优于以往的方法。此外,我们提出了LAURAE,该方法结合了LLM和可读性公式得分,通过捕捉可读性的上下文特征和浅层特征(例如,句子长度)来提高鲁棒性。我们的评估表明,LAURAE在不同语言、文本长度和技术语言的使用量上都稳健地优于以往的方法。
cs.CL / 93 / 2604.24515
SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering
SEARCH-R:基于结构化实体感知检索与推理链导航器的多跳问答
Abstract
Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R.
Chinese Translation
多跳问答(MHQA)旨在回答需要多步推理的问题。它面临两个关键挑战:在复杂用户查询的响应中生成正确的推理路径,以及在大型语言模型(LLMs)可能存在的局限性面前准确检索必要知识。现有方法主要依赖基于提示的方法生成推理路径,并将其与传统的稀疏或密集检索结合以产生最终答案。然而,推理路径的生成通常缺乏对生成过程的有效控制,从而导致推理偏离。同时,检索方法过于依赖知识匹配或相似性评分,而不是评估信息的实际效用,导致检索到同质或无用的信息。因此,我们提出了一种名为SEARCH-R的结构化实体感知检索与推理链导航器框架。具体而言,SEARCH-R训练了一个端到端的推理路径导航器,能够通过微调Llama3.1-8B模型提供强大的子问题分解器。此外,设计了一种新颖的基于依赖树的检索方法,以定量评估文档的信息贡献。在三个具有挑战性的多跳数据集上的大量实验验证了该框架的有效性。代码和数据集可在以下链接获取:https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R。
cs.CL / 94 / 2604.24536
Generating Place-Based Compromises Between Two Points of View
在两种观点之间生成基于地点的折衷方案
Abstract
Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study. We found that the best method for generating compromises between two views used external empathic similarity between a compromise and each viewpoint as iterative feedback, outperforming stan dard Chain of Thought (CoT) reasoning. The results indicate that the use of empathic neutrality improves the acceptability of compromises. The dataset of generated compromises was then used to train two smaller foundation models via margin-based alignment of human preferences, improving efficiency and removing the need for empathy estimation during inference.
Chinese Translation
大型语言模型(LLMs)在学术领域表现出色,但在社会智能任务上表现不佳,例如创造良好的折衷方案。本文提出了一种在两种对立观点之间生成同情中立折衷方案的方法。我们首先使用Claude 3 Opus和一个包含2400种对比观点的共享地点数据集比较了四种不同的提示工程方法。生成的折衷方案的子集在一项包含50名参与者的研究中进行了可接受性评估。我们发现,生成两种观点之间折衷方案的最佳方法是利用折衷方案与每种观点之间的外部同情相似性作为迭代反馈,其表现优于标准的思维链(Chain of Thought, CoT)推理。结果表明,使用同情中立性提高了折衷方案的可接受性。随后,生成的折衷方案数据集被用于通过基于边际的人类偏好对齐训练两个较小的基础模型,从而提高了效率并消除了推理过程中对同情估计的需求。
cs.CL / 95 / 2604.24559
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
对齐的多视图脚本用于通用图表到代码生成
Abstract
Chart-to-code generation converts a chart image into an executable plotting script, enabling faithful reproduction and editable visualizations. Existing methods are largely Python-centric, limiting practical use and overlooking a critical source of supervision: the same chart can be expressed by semantically equivalent scripts in different plotting languages. To fill this gap, we introduce Chart2NCode, a dataset of 176K charts paired with aligned scripts in Python, R, and LaTeX that render visually equivalent outputs, constructed via a metadata-to-template pipeline with rendering verification and human quality checks. Building on a LLaVA-style architecture, we further propose CharLuMA, a parameter-efficient adaptation module that augments the multimodal projector with a language-conditioned mixture of low-rank subspaces, allowing the model to share core chart understanding while specializing code generation to the target language through lightweight routing. Extensive experiments show consistent gains in executability and visual fidelity across all languages, outperforming strong open-source baselines and remaining competitive with proprietary systems. Further analyses reveal that balanced multi-language supervision benefits all languages and that the adapter allocates a compact shared core plus language-specific capacity. Codes and data are available at https://github.com/Zhihan72/CharLuMA.
Chinese Translation
图表到代码生成将图表图像转换为可执行的绘图脚本,从而实现忠实再现和可编辑的可视化。现有方法主要以Python为中心,限制了实际应用,并忽视了一个重要的监督来源:同一图表可以通过不同绘图语言中的语义等价脚本来表达。为填补这一空白,我们引入了Chart2NCode,一个包含176K个图表及其在Python、R和LaTeX中对齐脚本的数据集,这些脚本生成视觉上等价的输出,构建过程通过元数据到模板的管道,并进行了渲染验证和人工质量检查。在LLaVA风格架构的基础上,我们进一步提出了CharLuMA,一个参数高效的适配模块,增强了多模态投影器,结合了语言条件的低秩子空间混合,使模型能够共享核心图表理解,同时通过轻量级路由将代码生成专门化为目标语言。大量实验表明,在所有语言中,执行性和视觉保真度均有一致的提升,超越了强大的开源基线,并与专有系统保持竞争力。进一步分析表明,平衡的多语言监督对所有语言都有益,适配器分配了紧凑的共享核心和语言特定的能力。代码和数据可在https://github.com/Zhihan72/CharLuMA获取。
cs.CL / 96 / 2604.24564
MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG
MEG-RAG:量化多模态证据基础以进行RAG中的证据选择
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.
Chinese Translation
多模态检索增强生成(MRAG)解决了多模态大型语言模型(MLLMs)的关键局限性,如幻觉和过时知识。然而,当前的MRAG系统在区分检索到的多模态数据是否真正支持答案的语义核心,或仅提供表面相关性方面存在困难。现有指标通常依赖于启发式位置基础的置信度,这未能捕捉多模态实体的信息密度。为了解决这个问题,我们提出了多模态证据基础(MEG),这是一种语义感知指标,用于量化检索证据的贡献。与标准置信度度量不同,MEG利用语义确定性锚定,专注于高IDF信息承载标记,更好地捕捉答案的语义核心。在MEG的基础上,我们引入了MEG-RAG,一个训练多模态重排序器的框架,以将检索到的证据与真实值的语义锚定对齐。通过基于语义基础而非标记概率分布优先考虑高价值内容,MEG-RAG提高了生成输出的准确性和多模态一致性。在M$^2$RAG基准上的广泛实验表明,MEG-RAG始终优于强基线,并在不同教师模型上表现出强大的泛化能力。
cs.CL / 97 / 2604.24594
Skill Retrieval Augmentation for Agentic AI
代理人工智能的技能检索增强
Abstract
As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.
Chinese Translation
随着大型语言模型(LLMs)逐渐演变为具有代理能力的问题解决者,它们越来越依赖外部可重用技能来处理超出其原生参数能力的任务。在现有的代理系统中,整合技能的主要策略是明确列出上下文窗口内可用的技能。然而,这一策略无法扩展:随着技能语料库的扩大,上下文预算迅速消耗,代理在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(Skill Retrieval Augmentation, SRA)的新范式,在该范式中,代理能够动态地按需从大型外部技能语料库中检索、整合和应用相关技能。为了使这一问题可测量,我们构建了一个大规模的技能语料库,并引入了SRA-Bench,这是第一个针对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与从网络收集的干扰技能混合,形成了一个包含26262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高代理的性能,验证了这一范式的潜力。同时,我们发现技能整合存在一个根本性差距:当前的LLM代理在加载技能时往往以相似的速度进行,无论是检索到金标准技能,还是任务是否实际需要外部能力。这表明,技能增强的瓶颈不仅在于检索,还在于基础模型确定何时加载技能以及何时实际需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来代理系统中能力的可扩展增强奠定了基础。
cs.CL / 98 / 2604.24609
Evaluation of Pose Estimation Systems for Sign Language Translation
手势语言翻译的姿态估计系统评估
Abstract
Many sign language translation (SLT) systems operate on pose sequences instead of raw video to reduce input dimensionality, improve portability, and partially anonymize signers. The choice of pose estimator is often treated as an implementation detail, with systems defaulting to widely available tools such as MediaPipe Holistic or OpenPose. We present a systematic comparison of pose estimators for pose-based SLT, covering widely used baselines (MediaPipe Holistic, OpenPose) and newer whole-body/high-capacity models (MMPose WholeBody, OpenPifPaf, AlphaPose, SDPose, Sapiens, SMPLest-X). We quantify downstream impact by training a controlled SLT pipeline on RWTH-PHOENIX-Weather 2014 where only the pose representation varies, evaluating with BLEU and BLEURT. To contextualize translation outcomes, we analyze temporal stability, missing hand keypoints, and robustness to occlusion using higher-resolution videos from the Signsuisse dataset. SDPose and Sapiens achieve the best translation performance (BLEU ~11.5), outperforming the common MediaPipe baseline (BLEU ~10). In occlusion cases, Sapiens is correct in all tested instances (15/15), while OpenPifPaf fails in nearly all (1/15) and also yields the weakest translation scores. Estimators that frequently leave out hand keypoints are associated with lower BLEU/BLEURT. We release code that can be used not only to reproduce our experiments, but also considerably lowers the barrier for other researchers to use alternative pose estimators.
Chinese Translation
许多手势语言翻译(SLT)系统基于姿态序列而非原始视频进行操作,以降低输入维度、提高可移植性并部分匿名化手势使用者。姿态估计器的选择通常被视为实现细节,系统默认使用广泛可用的工具,如 MediaPipe Holistic 或 OpenPose。我们对基于姿态的 SLT 的姿态估计器进行了系统比较,涵盖了广泛使用的基准(MediaPipe Holistic、OpenPose)和更新的全身/高容量模型(MMPose WholeBody、OpenPifPaf、AlphaPose、SDPose、Sapiens、SMPLest-X)。我们通过在 RWTH-PHOENIX-Weather 2014 上训练一个受控的 SLT 流水线来量化下游影响,其中仅姿态表示有所不同,并使用 BLEU 和 BLEURT 进行评估。为了对翻译结果进行背景分析,我们使用来自 Signsuisse 数据集的高分辨率视频分析时间稳定性、缺失的手部关键点以及对遮挡的鲁棒性。SDPose 和 Sapiens 实现了最佳的翻译性能(BLEU ~11.5),超越了常见的 MediaPipe 基线(BLEU ~10)。在遮挡情况下,Sapiens 在所有测试实例中均正确(15/15),而 OpenPifPaf 几乎在所有情况下均失败(1/15),并且翻译得分也最弱。经常遗漏手部关键点的估计器与较低的 BLEU/BLEURT 相关。我们发布的代码不仅可以用来重现我们的实验,还大大降低了其他研究人员使用替代姿态估计器的门槛。
cs.CL / 99 / 2604.24620
Looking for the Bottleneck in Fine-grained Temporal Relation Classification
寻找细粒度时间关系分类中的瓶颈
Abstract
Temporal relation classification is the task of determining the temporal relation between pairs of temporal entities in a text. Despite recent advancements in natural language processing, temporal relation classification remains a considerable challenge. Early attempts framed this task using a comprehensive set of temporal relations between events and temporal expressions. However, due to the task complexity, datasets have been progressively simplified, leading recent approaches to focus on the relations between event pairs and to use only a subset of relations. In this work, we revisit the broader goal of classifying interval relations between temporal entities by considering the full set of relations that can hold between two time intervals. The proposed approach, Interval from Point, involves first classifying the point relations between the endpoints of the temporal entities and then decoding these point relations into an interval relation. Evaluation on the TempEval-3 dataset shows that this approach can yield effective results, achieving a temporal awareness score of $70.1$ percent, a new state-of-the-art on this benchmark.
Chinese Translation
时间关系分类是确定文本中一对时间实体之间时间关系的任务。尽管自然语言处理领域近年来取得了显著进展,但时间关系分类仍然是一个相当大的挑战。早期的尝试将这一任务框定为事件与时间表达之间的全面时间关系集合。然而,由于任务的复杂性,数据集逐渐被简化,导致最近的方法集中于事件对之间的关系,并仅使用关系的子集。在本研究中,我们重新审视了对时间实体之间区间关系进行分类的更广泛目标,考虑可以在两个时间区间之间保持的完整关系集合。所提出的方法“从点到区间”(Interval from Point)首先对时间实体的端点之间的点关系进行分类,然后将这些点关系解码为区间关系。在TempEval-3数据集上的评估表明,该方法能够产生有效的结果,达到了70.1%的时间意识得分,成为该基准的新最优状态。
cs.CL / 100 / 2604.24645
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
K-MetBench:气象领域专家推理、本地性和多模态的细粒度评估多维基准
Abstract
The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .
Chinese Translation
为韩国气象预报员开发实用的(多模态)大型语言模型助手的进展受到缺乏基于权威来源的多维专家级评估框架的阻碍。为了解决这一问题,我们引入了K-MetBench,这是一个基于国家资格考试的诊断基准。它揭示了四个维度中的关键差距:图表的专家视觉推理、通过专家验证的推理的逻辑有效性、特定于韩国的地理文化理解和细粒度领域分析。我们对55个模型的评估揭示了在解释专业图表时存在深刻的模态差距,以及在逻辑推理方面的差距,尽管预测正确,模型仍然出现逻辑幻觉。重要的是,韩国模型在本地上下文中显著优于更大型的全球模型,证明仅靠参数扩展无法解决文化依赖性。K-MetBench为开发可靠的、具有文化意识的专家AI代理提供了路线图。数据集可在 https://huggingface.co/datasets/soyeonbot/K-MetBench 获取。
cs.CL / 101 / 2604.24647
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
DepthKV:针对长上下文大语言模型推理的层依赖型KV缓存剪枝
Abstract
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.
Chinese Translation
长上下文推理是大语言模型(LLMs)的一项关键能力,使其能够应用于长文档理解、摘要生成和代码生成等任务。然而,高效的自回归推理依赖于键值(KV)缓存,其内存占用随着序列长度线性增长,导致了主要的内存瓶颈。为了减轻这一开销,KV缓存剪枝方法在推理过程中会丢弃注意力得分较低的缓存令牌。大多数现有方法在各层之间应用统一的剪枝比例,隐含地假设所有层对整体模型性能的贡献是相等的。我们证明了这一假设并不理想,因为不同层对剪枝的敏感性差异显著。我们提出了DepthKV,这是一种层依赖的剪枝框架,根据各层的敏感性分配固定的全局KV预算,而不是使用统一的分配。在多个模型和任务中,DepthKV在相同的全局剪枝比例下始终优于统一剪枝,通过层依赖的分配方式更有效地利用KV缓存预算。
cs.CL / 102 / 2604.24665
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
土耳其语中的源敏感推理基准测试:在证据信任操控下的人类与大型语言模型
Abstract
This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.
Chinese Translation
本文研究了源的可信度是否影响土耳其语的证据形态,以及大型语言模型(LLMs)是否能够跟踪这种敏感性。我们在控制的填空语境中研究了-DI和-mIs之间的过去领域对比,其中信息源明显是外部的,而其感知的可靠性则被操控(高信任 vs. 低信任)。在一项人类生产实验中,土耳其语母语者表现出明显的信任效应:高信任语境产生相对更多的-DI,而低信任语境则产生相对更多的-mIs,这一模式在敏感性分析中保持稳定。随后,我们在三种提示范式(开放填空、明确的过去式填空和强制选择A/B)中评估了10个LLM。LLM的行为高度依赖于模型和提示:一些模型表现出微弱或局部的信任一致性变化,但效果通常不稳定,往往相反,并且常常被输出合规性问题和强烈的基率后缀偏好所掩盖。结果为土耳其语证据性的信任/承诺基础解释提供了新证据,并揭示了人类与LLM在源敏感证据推理方面的明显差距。
cs.CL / 103 / 2604.24690
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
大型语言模型能作为历史学家吗?通过中国科举评估大型语言模型的历史研究能力
Abstract
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
Chinese Translation
尽管大型语言模型(LLMs)在文本处理等历史任务中越来越多地提供帮助,但它们在专业级历史推理方面的能力仍然未得到充分探索。现有基准主要评估基本知识广度或词汇理解,未能捕捉到历史研究中核心的高阶技能,如证据推理。为填补这一空白,我们引入了ProHist-Bench,这是一个基于中国科举(Keju)系统的新基准,该系统是东亚政治、社会和知识历史的一个全面缩影,跨越了1300多年。ProHist-Bench通过深入的跨学科合作开发,包含了来自八个朝代的400个具有挑战性的专家策划问题,并配有10,891个细化的评估标准。通过对18个LLMs的严格评估,我们揭示了一个显著的能力差距:即使是最先进的LLMs在复杂的历史研究问题上也面临困难。我们希望ProHist-Bench能够促进领域特定推理LLMs的发展,推动计算历史研究,并进一步发掘LLMs未被利用的潜力。我们在https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench发布了ProHist-Bench。
cs.CL / 104 / 2604.24693
Contextual Linear Activation Steering of Language Models
语言模型的上下文线性激活引导
Abstract
Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.
Chinese Translation
线性激活引导是一种强大的方法,用于引发大型语言模型的能力并利用有限的标注数据专门化其行为。尽管有效,现有方法通常对所有标记应用固定的引导强度,导致在不同输入提示下引导质量不一致。在本研究中,我们提出了上下文线性激活引导(Contextual Linear Activation Steering, CLAS),该方法动态调整线性激活引导以适应上下文依赖的引导强度。在十一项引导基准测试和四个模型系列中,CLAS始终优于标准线性激活引导,并在有限标注数据的设置中与ReFT和LoRA的性能相匹配或超越。因此,我们提出CLAS作为一种可扩展、可解释且准确的方法,用于专门化和引导大型语言模型。
cs.CL / 105 / 2604.24698
The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models
变色龙的极限:探讨大型语言模型中的人格崩溃与同质化
Abstract
Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.
Chinese Translation
基于大型语言模型(LLMs)的应用,如多智能体模拟,需要智能体之间的人口多样性。我们识别出一种普遍的失败模式,称之为“人格崩溃”(Persona Collapse):尽管每个智能体被分配了不同的特征,但它们却趋向于狭窄的行为模式,导致模拟人口的同质化。为了量化人格崩溃,我们提出了一个框架,测量一个群体占据的人格空间的覆盖度(Coverage)、智能体在其中的分布均匀性(Uniformity)以及所产生的行为模式的丰富性(Complexity)。在对十个LLM进行个性模拟(BFI-44)、道德推理和自我介绍的评估中,我们观察到人格崩溃沿两个轴线发生:(1)维度:一个模型在一个轴线上可能看起来多样,但在另一个轴线上却结构性退化;(2)领域:同一模型在个性方面可能崩溃最严重,但在道德推理方面却最为多样。此外,逐项诊断显示,行为变化跟踪的是粗略的人口统计刻板印象,而非每个人格中指定的细微个体差异。违反直觉的是, extbf{实现最高每个人格保真度的模型往往产生最具刻板印象的人口}。我们发布了我们的工具包和数据,以支持对LLMs的人口级评估。
cs.CL / 106 / 2604.24700
Green Shielding: A User-Centric Approach Towards Trustworthy AI
绿色屏障:以用户为中心的可信赖人工智能方法
Abstract
Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.
Chinese Translation
大型语言模型(LLMs)正被越来越广泛地部署,但它们的输出对用户提问方式的常规、非对抗性变化高度敏感,这一差距在现有的红队工作中未得到很好的解决。我们提出了绿色屏障(Green Shielding),这是一个以用户为中心的议程,旨在通过表征良性输入变化如何影响模型行为,构建基于证据的部署指导。我们通过CUE标准来实现这一议程:具有真实上下文的基准、捕捉真实效用的参考标准和指标,以及反映模型行为引导中现实变化的扰动。在PCS框架的指导下,并与执业医生共同开发,我们在医疗诊断中实例化了绿色屏障,通过HealthCareMagic-Diagnosis(HCM-Dx),这是一个患者撰写查询的基准,结合结构化的参考诊断集和用于评估鉴别诊断列表的临床基础指标。我们还研究了捕捉常规输入变化的扰动机制,并显示提示级因素沿临床相关维度改变模型行为。在多个前沿LLM中,这些变化描绘出类似帕累托的权衡。特别是中和(neutralization),即去除常见用户级因素同时保留临床内容,增加了合理性,并产生了更简洁、类似临床医生的鉴别结果,但减少了对高度可能和安全关键条件的覆盖。总的来说,这些结果表明,交互选择可以系统性地改变模型输出的任务相关属性,并支持用户导向的指导,以便在高风险领域实现更安全的部署。尽管在这里以医疗诊断为例,但这一议程自然扩展到其他决策支持环境和自主人工智能系统。
cs.CL / 107 / 2604.24715
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
长上下文感知的再利用:混合大语言模型扩展的新前沿
Abstract
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
Chinese Translation
混合序列模型通过将高效的Transformer组件与线性序列建模模块相结合,成为纯Transformer的有希望的替代方案,但大多数仍然是从头开始预训练,因此无法重用现有的Transformer检查点。我们研究了再利用作为将预训练的Transformer大语言模型转换为混合架构的实用路径,同时保持短上下文的质量并改善长上下文的能力。我们称我们的解决方案为 extit{HyLo}(HYbrid LOng-context):一种长上下文再利用的方案,结合了架构适应、高效的Transformer模块、多头潜在注意力(MLA)和线性模块(Mamba2或Gated DeltaNet),以及分阶段的长上下文训练和教师引导的蒸馏以实现稳定的优化。HyLo通过高效的后训练将可用上下文长度扩展至$32 imes$,并将KV缓存内存减少超过$90 ext{ extperthousand}$,使我们在 exttt{vLLM}推理堆栈中实现高达2M标记的预填充和解码,而可比的Llama基线在超过64K上下文时会耗尽内存。在1B和3B规模设置(基于Llama和Qwen的变体)中,HyLo在短上下文和长上下文性能上始终表现出色,并在长上下文评估(如RULER)中显著超越了最先进的再利用混合基线。值得注意的是,在相似规模下,HyLo-Qwen-1.7B仅在10B标记上训练,显著超越了在400B标记上训练的JetNemotron在GSM8K、Lm-Harness常识推理和RULER-64K上的表现。
cs.CL / 108 / 2604.24720
Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking
通过多任务双向长短期记忆网络和自动机器学习基准对印尼电子商务评论进行情感与情绪分类
Abstract
Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at https://github.com/ikii-sd/pba2026-crazyrichteam.
Chinese Translation
印尼市场评论混合了标准词汇、俚语、地方借词、数字简写和表情符号,使得基于词汇的情感工具在实践中不可靠。本文描述了一种应用于PRDECT-ID数据集的双轨分类管道,该数据集包含来自29个印尼电子商务类别的5400条产品评论,每条评论均标注了二元情感(积极/消极)和五类情绪(快乐、悲伤、恐惧、爱、愤怒)。第一轨道应用TF-IDF向量化,并在标准分类器上进行PyCaret自动机器学习(AutoML)搜索。第二轨道是一个PyTorch双向长短期记忆网络(BiLSTM),具有共享编码器和两个特定任务的输出头。预处理模块应用了14个顺序清理步骤,包括从市场语料库汇编的140条俚语词典。我们对四种配置进行了基准测试:BiLSTM基线、BiLSTM改进版、BiLSTM大规模版和TextCNN。训练使用类加权交叉熵损失、ReduceLROnPlateau调度和提前停止。两个轨道均作为Gradio应用部署在Hugging Face Spaces上。源代码可在https://github.com/ikii-sd/pba2026-crazyrichteam公开获取。