arXiv 每日论文精选

2026-02-13
总论文数: 162
精选论文数: 20
平均评分: 2.8
显示 162 篇论文 (共 162 篇)
AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
David Jiahao Fu, Lam Thanh Do, Jiayu Li, Kevin Chen-Chuan Chang
核心总结:

该论文研究长文档检索中的上下文感知、因果依赖和检索范围等核心挑战,提出通过注意力机制和实体检索构建上下文感知嵌入来确定检索范围的新方法。

个性化推荐理由:

该论文直接针对检索增强生成中的长文档检索核心挑战,提出基于注意力机制的新型检索模型,与LLM应用和Transformer架构效率高度相关。

2026-02-12 18:59:35 | arXiv:2602.12278v1 |
cs.IRcs.AI
查看完整摘要
Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.
Compress, Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems
Heng Yu, Xiangjun Zhou, Jie Xia, Heng Zhao, Anxin Wu, Yu Zhao, Dongying Kong
核心总结:

该论文研究推荐系统中高效建模高阶特征交互的核心问题,其核心方法是提出一种通过分层压缩和动态组合组织特征交叉的结构化架构,并进一步引入多通道扩展将特征交互分解到并行子空间以实现高效水平扩展。

个性化推荐理由:

该论文直接针对推荐系统核心挑战——高效建模高阶特征交互,提出结构化压缩与多通道扩展架构,在核心领域进展和直接应用层面高度相关。

2026-02-12 15:06:46 | arXiv:2602.12041v1 |
cs.IR
查看完整摘要
Modeling high-order feature interactions efficiently is a central challenge in click-through rate and conversion rate prediction. Modern industrial recommender systems are predominantly built upon deep learning recommendation models, where the interaction backbone plays a critical role in determining both predictive performance and system efficiency. However, existing interaction modules often struggle to simultaneously achieve strong interaction capacity, high computational efficiency, and good scalability, resulting in limited ROI when models are scaled under strict production constraints. In this work, we propose MLCC, a structured feature interaction architecture that organizes feature crosses through hierarchical compression and dynamic composition, which can efficiently capture high-order feature dependencies while maintaining favorable computational complexity. We further introduce MC-MLCC, a Multi-Channel extension that decomposes feature interactions into parallel subspaces, enabling efficient horizontal scaling with improved representation capacity and significantly reduced parameter growth. Extensive experiments on three public benchmarks and a large-scale industrial dataset show that our proposed models consistently outperform strong DLRM-style baselines by up to 0.52 AUC, while reducing model parameters and FLOPs by up to 26$\times$ under comparable performance. Comprehensive scaling analyses demonstrate stable and predictable scaling behavior across embedding dimension, head number, and channel count, with channel-based scaling achieving substantially better efficiency than conventional embedding inflation. Finally, online A/B testing on a real-world advertising platform validates the practical effectiveness of our approach, which has been widely adopted in Bilibili advertising system under strict latency and resource constraints.
Improving Neural Retrieval with Attribution-Guided Query Rewriting
Moncef Garouani, Josiane Mothe
核心总结:

该论文研究神经检索器对模糊或歧义查询的脆弱性问题。核心方法是提出一种基于归因引导的查询重写方法,利用检索器的梯度归因分数作为软指导,通过结构化提示引导LLM澄清查询中的薄弱或误导成分,同时保持原始查询意图。

个性化推荐理由:

该论文直接针对检索系统核心问题,通过结合LLM与检索器反馈机制,提出了创新的查询重写方法,完全符合核心领域进展和直接LLM应用两个焦点。

2026-02-12 11:34:06 | arXiv:2602.11841v1 |
cs.IRcs.AIcs.LG
查看完整摘要
Neural retrievers are effective but brittle: underspecified or ambiguous queries can misdirect ranking even when relevant documents exist. Existing approaches address this brittleness only partially: LLMs rewrite queries without retriever feedback, and explainability methods identify misleading tokens but are used for post-hoc analysis. We close this loop and propose an attribution-guided query rewriting method that uses token-level explanations to guide query rewriting. For each query, we compute gradient-based token attributions from the retriever and then use these scores as soft guidance in a structured prompt to an LLM that clarifies weak or misleading query components while preserving intent. Evaluated on BEIR collections, the resulting rewrites consistently improve retrieval effectiveness over strong baselines, with larger gains for implicit or ambiguous information needs.
Recurrent Preference Memory for Efficient Long-Sequence Generative Recommendation
Yixiao Chen, Yuan Wang, Yue Liu, Qiyao Wang, Ke Cheng, Xin Xu, Juntong Yan, Shuo...
核心总结:

该论文研究长序列生成推荐模型的计算效率问题,核心思想是通过将用户交互历史压缩为偏好记忆令牌,并采用自参考教师强制策略实现并行化训练,从而在保持推理迭代能力的同时大幅提升效率。

个性化推荐理由:

该论文直接针对推荐系统中的长序列建模效率问题,提出了创新的记忆压缩和并行化训练方法,与核心领域进展和直接LLM应用高度相关。

2026-02-12 05:51:52 | arXiv:2602.11605v1 |
cs.IR
查看完整摘要
Generative recommendation (GenRec) models typically model user behavior via full attention, but scaling to lifelong sequences is hindered by prohibitive computational costs and noise accumulation from stochastic interactions. To address these challenges, we introduce Rec2PM, a framework that compresses long user interaction histories into compact Preference Memory tokens. Unlike traditional recurrent methods that suffer from serial training, Rec2PM employs a novel self-referential teacher-forcing strategy: it leverages a global view of the history to generate reference memories, which serve as supervision targets for parallelized recurrent updates. This allows for fully parallel training while maintaining the capability for iterative updates during inference. Additionally, by representing memory as token embeddings rather than extensive KV caches, Rec2PM achieves extreme storage efficiency. Experiments on large-scale benchmarks show that Rec2PM significantly reduces inference latency and memory footprint while achieving superior accuracy compared to full-sequence models. Analysis reveals that the Preference Memory functions as a denoising Information Bottleneck, effectively filtering interaction noise to capture robust long-term interests.
LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
Tianhe Lin, Ziwei Xiong, Baoyuan Ou, Yingjie Qin, Lai Xu, Xiaocheng Zhong, Yao H...
核心总结:

该论文研究工业推荐系统中超长用户行为序列建模面临的检索延迟和计算复杂度双重瓶颈问题。其核心方法是提出LASER全栈优化框架,通过SeqVault混合存储基础设施降低检索延迟,并设计Segmented Target Attention机制利用兴趣稀疏性进行序列压缩,在保持关键信号的同时降低计算复杂度。

个性化推荐理由:

该论文直接针对推荐系统中超长用户行为序列建模的核心挑战,提出了系统与算法双重优化框架,完全符合对推荐系统核心进展和Transformer架构效率提升的关注点。

2026-02-12 04:33:37 | arXiv:2602.11562v1 |
cs.IR
查看完整摘要
Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze X...
核心总结:

该论文研究如何为大型语言模型获取准确、用户特定的奖励信号以实现个性化对齐。其核心思想是提出P-GenRM模型,它将偏好信号转化为结构化的评估链以推导自适应角色和评分标准,并通过用户原型聚类及双粒度扩展机制(个体级自适应聚合与原型级相似用户偏好整合)来缓解偏好噪声并增强对未见用户的泛化能力。

个性化推荐理由:

该论文直接针对个性化对齐这一核心挑战,提出了创新的生成式奖励模型架构和测试时用户扩展机制,对LLM在推荐/搜索领域的应用具有高度直接的相关性。

2026-02-12 16:07:22 | arXiv:2602.12116v1 |
cs.CL
查看完整摘要
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
RAM-Net: Expressive Linear Attention with Selectively Addressable Memory
Kaicheng Xiao, Haotian Li, Liran Dong, Guoliang Xing
核心总结:

该论文研究线性注意力架构因固定大小内存导致信息丢失和表达能力受限的问题。其核心方法是设计RAM-Net架构,将输入映射为高维稀疏向量作为显式地址,从而选择性访问大规模内存状态,实现指数级状态扩展而无额外参数,显著提升检索保真度并保持计算效率。

个性化推荐理由:

该论文提出了一种新颖的Transformer架构改进,通过可寻址内存机制增强线性注意力的表达能力,直接对应“使能Transformer技术”和“核心领域进展”两个重点方向。

2026-02-12 13:55:29 | arXiv:2602.11958v1 |
cs.LGcs.CL
查看完整摘要
While linear attention architectures offer efficient inference, compressing unbounded history into a fixed-size memory inherently limits expressivity and causes information loss. To address this limitation, we introduce Random Access Memory Network (RAM-Net), a novel architecture designed to bridge the gap between the representational capacity of full attention and the memory efficiency of linear models. The core of RAM-Net maps inputs to high-dimensional sparse vectors serving as explicit addresses, allowing the model to selectively access a massive memory state. This design enables exponential state size scaling without additional parameters, which significantly mitigates signal interference and enhances retrieval fidelity. Moreover, the inherent sparsity ensures exceptional computational efficiency, as state updates are confined to minimal entries. Extensive experiments demonstrate that RAM-Net consistently surpasses state-of-the-art baselines in fine-grained long-range retrieval tasks and achieves competitive performance in standard language modeling and zero-shot commonsense reasoning benchmarks, validating its superior capability to capture complex dependencies with significantly reduced computational overhead.
ULTRA:Urdu Language Transformer-based Recommendation Architecture
Alishbah Bashir, Fatima Qaiser, Ijaz Hussain
核心总结:

该论文研究低资源乌尔都语个性化新闻推荐中语义理解不足的问题。核心方法是提出一个基于Transformer的双嵌入架构,通过查询长度感知路由机制,动态区分短意图查询和长上下文查询,分别优化标题级和文档级语义表示进行检索。

个性化推荐理由:

该论文直接针对低资源语言的个性化内容推荐问题,提出了基于Transformer的双嵌入架构和查询长度感知路由机制,属于推荐系统核心领域创新,与LLM在推荐中的应用高度相关。

2026-02-12 11:26:46 | arXiv:2602.11836v1 |
cs.IRcs.AI
查看完整摘要
Urdu, as a low-resource language, lacks effective semantic content recommendation systems, particularly in the domain of personalized news retrieval. Existing approaches largely rely on lexical matching or language-agnostic techniques, which struggle to capture semantic intent and perform poorly under varying query lengths and information needs. This limitation results in reduced relevance and adaptability in Urdu content recommendation. We propose ULTRA (Urdu Language Transformer-based Recommendation Architecture),an adaptive semantic recommendation framework designed to address these challenges. ULTRA introduces a dual-embedding architecture with a query-length aware routing mechanism that dynamically distinguishes between short, intent-focused queries and longer, context-rich queries. Based on a threshold-driven decision process, user queries are routed to specialized semantic pipelines optimized for either title/headline-level or full-content/document level representations, ensuring appropriate semantic granularity during retrieval. The proposed system leverages transformer-based embeddings and optimized pooling strategies to move beyond surface-level keyword matching and enable context-aware similarity search. Extensive experiments conducted on a large-scale Urdu news corpus demonstrate that the proposed architecture consistently improves recommendation relevance across diverse query types. Results show gains in precision above 90% compared to single-pipeline baselines, highlighting the effectiveness of query-adaptive semantic alignment for low-resource languages. The findings establish ULTRA as a robust and generalizable content recommendation architecture, offering practical design insights for semantic retrieval systems in low-resource language settings.
Uncertainty-aware Generative Recommendation
Chenxiao Fan, Chongming Gao, Yaxin Gong, Haoyan Liu, Fuli Feng, Xiangnan He
核心总结:

该论文研究生成式推荐系统中偏好优化方法因忽视模型生成置信度、样本学习难度差异和缺乏显式置信表达而导致的训练不稳定和决策风险不可量化问题。其核心思想是提出一个不确定性感知的统一框架,通过不确定性加权奖励、难度感知优化动态和显式置信对齐三种机制,将不确定性作为自适应优化的关键信号。

个性化推荐理由:

该论文直接针对生成式推荐系统的核心优化问题,提出不确定性感知的统一框架,属于推荐系统领域的前沿方法学进展,与核心领域进展和直接LLM应用高度相关。

2026-02-12 08:48:51 | arXiv:2602.11719v1 |
cs.IR
查看完整摘要
Generative Recommendation has emerged as a transformative paradigm, reformulating recommendation as an end-to-end autoregressive sequence generation task. Despite its promise, existing preference optimization methods typically rely on binary outcome correctness, suffering from a systemic limitation we term uncertainty blindness. This issue manifests in the neglect of the model's intrinsic generation confidence, the variation in sample learning difficulty, and the lack of explicit confidence expression, directly leading to unstable training dynamics and unquantifiable decision risks. In this paper, we propose Uncertainty-aware Generative Recommendation (UGR), a unified framework that leverages uncertainty as a critical signal for adaptive optimization. UGR synergizes three mechanisms: (1) an uncertainty-weighted reward to penalize confident errors; (2) difficulty-aware optimization dynamics to prevent premature convergence; and (3) explicit confidence alignment to empower the model with confidence expression capabilities. Extensive experiments demonstrate that UGR not only yields superior recommendation performance but also fundamentally stabilizes training, preventing the performance degradation often observed in standard methods. Furthermore, the learned confidence enables reliable downstream risk-aware applications.
EpicCBR: Item-Relation-Enhanced Dual-Scenario Contrastive Learning for Cold-Start Bundle Recommendation
Yihang Li, Zhuo Liu, Wei Wei
核心总结:

该论文研究冷启动捆绑推荐问题,核心思想是通过挖掘物品关系构建用户画像,并设计多视图图对比学习框架来统一处理冷启动和常规场景,提升模型泛化能力。

个性化推荐理由:

该论文直接针对推荐系统中的冷启动问题,提出多视图对比学习框架,并利用物品关系增强用户画像,属于推荐系统核心领域的前沿方法研究。

2026-02-12 07:54:21 | arXiv:2602.11680v1 |
cs.IR
查看完整摘要
Bundle recommendation aims to recommend a set of items to users for overall consumption. Existing bundle recommendation models primarily depend on observed user-bundle interactions, limiting exploration of newly-emerged bundles that are constantly created. It pose a critical representation challenge for current bundle methods, as they usually treat each bundle as an independent instance, while neglecting to fully leverage the user-item (UI) and bundle-item (BI) relations over popular items. To alleviate it, in this paper we propose a multi-view contrastive learning framework for cold-start bundle recommendation, named EpicCBR. Specifically, it precisely mine and utilize the item relations to construct user profiles, identifying users likely to engage with bundles. Additionally, a popularity-based method that characterizes the features of new bundles through historical bundle information and user preferences is proposed. To build a framework that demonstrates robustness in both cold-start and warm-start scenarios, a multi-view graph contrastive learning framework capable of integrating these diverse scenarios is introduced to ensure the model's generalization capability. Extensive experiments conducted on three popular benchmarks showed that EpicCBR outperforms state-of-the-art by a large margin (up to 387%), sufficiently demonstrating the superiority of the proposed method in cold-start scenario. The code and dataset can be found in the GitHub repository: https://github.com/alexlovecoding/EpicCBR.
IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation
Huimin Yan, Longfei Xu, Junjie Sun, Zheng Liu, Wei Luo, Kaikui Liu, Xiangxiang C...
核心总结:

该论文研究旅行推荐中仅关注目的地而忽略出发时间、交通方式等旅程要素的碎片化问题。其核心思想是构建一个大规模集成旅行数据集,并基于此设计一个端到端的仅解码器生成框架,通过信息保留、选择和因子分解来协同处理多个推荐任务。

个性化推荐理由:

该论文提出了一个集成多任务旅行推荐生成框架,直接对应推荐系统核心领域进展,并展示了生成式方法在推荐中的实际应用。

2026-02-12 07:35:06 | arXiv:2602.11664v1 |
cs.IR
查看完整摘要
Next Point of Interest (POI) recommendation is essential for modern mobility and location-based services. To provide a smooth user experience, models must understand several components of a journey holistically: "when to depart", "how to travel", "where to go", and "what needs arise via the route". However, current research is limited by fragmented datasets that focus merely on next POI recommendation ("where to go"), neglecting the departure time, travel mode, and situational requirements along the journey. Furthermore, the limited scale of these datasets impedes accurate evaluation of performance. To bridge this gap, we introduce IntTravel, the first large-scale public dataset for integrated travel recommendation, including 4.1 billion interactions from 163 million users with 7.3 million POIs. Built upon this dataset, we introduce an end-to-end, decoder-only generative framework for multi-task recommendation. It incorporates information preservation, selection, and factorization to balance task collaboration with specialized differentiation, yielding substantial performance gains. The framework's generalizability is highlighted by its state-of-the-art performance across both IntTravel dataset and an additional non-travel benchmark. IntTravel has been successfully deployed on Amap serving hundreds of millions of users, leading to a 1.09% increase in CTR. IntTravel is available at https://github.com/AMAP-ML/IntTravel.
KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance
Yupeng Li, Ben Chen, Mingyue Cheng, Zhiding Liu, Xuxin Zhang, Chenyi Lei, Wenwu ...
核心总结:

该论文研究如何解决现有电商搜索数据集在真实性和覆盖范围上的局限性问题,其核心方法是构建并发布基于快手平台真实用户交互的大规模电商搜索数据集KuaiSearch,该数据集保留了真实查询和商品文本,覆盖冷启动用户和长尾商品,并系统覆盖搜索管道的三个关键阶段。

个性化推荐理由:

该论文构建了大规模电商搜索数据集KuaiSearch,直接针对电商搜索领域的核心挑战,覆盖召回、排序、相关性判断全流程,为LLM在搜索推荐系统的应用研究提供了关键基础设施。

2026-02-12 03:22:05 | arXiv:2602.11518v1 |
cs.IR
查看完整摘要
E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.
From Noise to Order: Learning to Rank via Denoising Diffusion
Sajad Ebrahimi, Bhaskar Mitra, Negar Arabzadeh, Ye Yuan, Haolun Wu, Fattane Zarr...
核心总结:

该论文研究信息检索中的学习排序问题,核心思想是采用去噪扩散生成模型来建模查询-文档特征向量与相关性标签的联合分布,而非传统的判别式方法,以提升排序模型的鲁棒性。

个性化推荐理由:

该论文提出了一种基于去噪扩散的生成式排序学习方法,直接应用于信息检索的核心排序问题,属于直接LLM应用领域,并涉及生成式建模的进步。

2026-02-12 00:02:37 | arXiv:2602.11453v1 |
cs.IRcs.AIcs.LG
查看完整摘要
In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. In this work, we propose an alternative denoising diffusion-based deep generative approach to LTR that instead models the full joint distribution over feature vectors and relevance labels. While in the discriminative setting, an over-parameterized ranking model may find different ways to fit the training data, we hypothesize that candidate solutions that can explain the full data distribution under the generative setting produce more robust ranking models. With this motivation, we propose DiffusionRank that extends TabDiff, an existing denoising diffusion-based generative model for tabular datasets, to create generative equivalents of classical discriminative pointwise and pairwise LTR objectives. Our empirical results demonstrate significant improvements from DiffusionRank models over their discriminative counterparts. Our work points to a rich space for future research exploration on how we can leverage ongoing advancements in deep generative modeling approaches, such as diffusion, for learning-to-rank in IR.
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei
核心总结:

该论文研究语言模型如何更有效地将上下文知识内化到参数中的问题。其核心方法是提出On-Policy Context Distillation框架,通过训练学生模型基于自身生成的轨迹,同时最小化与上下文条件教师模型的反向KL散度,实现从历史解决方案轨迹或优化提示中提取和巩固可转移知识。

个性化推荐理由:

该论文提出的On-Policy Context Distillation框架将上下文蒸馏与策略蒸馏结合,核心创新在于让模型从自身生成轨迹中学习并内化上下文知识,这直接属于LLM核心技术进展,对推荐/搜索系统中模型的知识固化、行为优化和跨尺寸蒸馏具有明确应用潜力。

2026-02-12 18:58:28 | arXiv:2602.12275v1 |
cs.CL
查看完整摘要
Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou
核心总结:

该论文研究长上下文处理中的查询聚焦重排序问题,核心思想是利用大语言模型中特定注意力头的注意力分数来估计文档-查询相关性,构建了一个轻量级、无需人工标注的列表式重排序框架。

个性化推荐理由:

该论文提出基于LLM注意力机制的轻量级重排序框架,直接应用于搜索领域,并探索了中间层注意力头的高效利用,与搜索和LLM技术应用高度相关。

2026-02-12 17:23:38 | arXiv:2602.12192v1 |
cs.CL
查看完整摘要
Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
Tiny Recursive Reasoning with Mamba-2 Attention Hybrid
Wenlong Wang, Fergal Reid
核心总结:

研究在递归推理架构中,用Mamba-2混合算子替代Transformer块是否保持推理能力。核心思想是:Mamba-2的状态空间循环本身是一种迭代精炼形式,将其引入递归框架可作为递归算子设计空间中的可行候选方案。

个性化推荐理由:

该论文研究Mamba-2混合算子(结合状态空间模型与注意力机制)在递归推理架构中的应用,直接涉及Transformer架构效率改进和新型注意力机制探索,属于核心LLM技术进展,对推荐/搜索系统中的序列建模和高效推理有潜在应用价值。

2026-02-12 15:36:32 | arXiv:2602.12078v1 |
cs.AIcs.CL
查看完整摘要
Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion -- iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2's state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning -- but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0\% (45.88\% vs 43.88\%) and consistently outperforms at higher K values (+4.75\% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage -- the model generates correct solutions more reliably -- with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.
Towards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset
Rahin Arefin Ahmed, Md. Anik Chowdhury, Sakil Ahmed Sheikh Reza, Devnil Bhattach...
核心总结:

该论文研究孟加拉语个性化图书推荐因缺乏结构化数据集而受限的问题,核心方法是构建一个包含书籍、用户、作者等多实体及其关系的异构图数据集,为低资源语言推荐研究提供基础资源。

个性化推荐理由:

该论文构建了孟加拉语图书推荐的大规模多实体异构图数据集,直接服务于个性化推荐系统研究,属于核心领域进展,但未涉及LLM或Transformer架构的创新。

2026-02-12 16:18:55 | arXiv:2602.12129v1 |
cs.IRcs.LG
查看完整摘要
Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale, multi-entity heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through eight relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we provide a systematic benchmarking study on the Top-N recommendation task, evaluating a diverse set of representative recommendation models, including classical collaborative filtering methods, matrix factorization models, content-based approaches, graph neural networks, a hybrid matrix factorization model with side information, and a neural two-tower retrieval architecture. The benchmarking results highlight the importance of leveraging multi-relational structure and textual side information, with neural retrieval models achieving the strongest performance (NDCG@10 = 0.204). Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset
Analytical Search
Yiteng Tu, Shuo Miao, Weihang Su, Yiqun Liu, Qingyao Ai
核心总结:

该论文研究现有信息检索范式(如相关性排序或RAG)难以满足大规模、端到端的分析性信息需求(如趋势分析、因果评估)的问题。其核心思想是提出'分析性搜索'这一新范式,将搜索重构为一个由证据驱动、过程导向的分析工作流,通过显式建模分析意图、检索融合证据以及进行结构化多步推理来生成可验证的结论。

个性化推荐理由:

该论文提出了一种新的搜索范式,直接针对搜索领域的核心问题,与推荐和广告系统中的复杂分析需求高度相关,但其对LLM或Transformer架构的直接技术贡献有限。

2026-02-12 05:06:29 | arXiv:2602.11581v1 |
cs.IRcs.AIcs.CL
查看完整摘要
Analytical information needs, such as trend analysis and causal impact assessment, are prevalent across various domains including law, finance, science, and much more. However, existing information retrieval paradigms, whether based on relevance-oriented document ranking or retrieval-augmented generation (RAG) with large language models (LLMs), often struggle to meet the end-to-end requirements of such tasks at the corpus scale. They either emphasize information finding rather than end-to-end problem solving, or simply treat everything as naive question answering, offering limited control over reasoning, evidence usage, and verifiability. As a result, they struggle to support analytical queries that have diverse utility concepts and high accountability requirements. In this paper, we propose analytical search as a distinct and emerging search paradigm designed to fulfill these analytical information needs. Analytical search reframes search as an evidence-governed, process-oriented analytical workflow that explicitly models analytical intent, retrieves evidence for fusion, and produces verifiable conclusions through structured, multi-step inference. We position analytical search in contrast to existing paradigms, and present a unified system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification. We also discuss potential research directions for the construction of analytical search engines. In this way, we highlight the conceptual significance and practical importance of analytical search and call on efforts toward the next generation of search engines that support analytical information needs.
Olmix: A Framework for Data Mixing Throughout LM Development
Mayee F. Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, C...
核心总结:

该论文研究语言模型训练中多领域数据混合比例优化问题,核心思想是通过系统实证分析确定混合方法的关键设计要素,并提出混合重用机制来高效处理开发过程中数据集动态更新的场景。

个性化推荐理由:

该论文提出数据混合框架Olmix,通过实证研究确定混合方法的关键设计选择,并解决实际开发中数据集动态更新的问题,对大规模语言模型训练具有直接应用价值。

2026-02-12 18:16:05 | arXiv:2602.12237v1 |
cs.LGcs.AIcs.CL
查看完整摘要
Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning
Xubin Wang, Weijia Jia
核心总结:

该论文研究上下文学习中在有限提示空间下小样本示例选择对性能影响显著但计算成本高的问题,核心方法是利用监督元学习构建元数据集,通过TF-IDF相似度和长度兼容性等可解释特征训练轻量级评分函数,实现无需模型微调或额外LLM调用的高效确定性示例选择。

个性化推荐理由:

该论文提出了一种高效的小样本示例选择方法,直接优化了上下文学习中的关键瓶颈,对搜索和推荐系统的提示工程与查询理解有直接应用价值。

2026-02-12 16:11:29 | arXiv:2602.12123v1 |
cs.LGcs.AIcs.CL
查看完整摘要
Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF--IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods -- spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches -- across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead.
Prototype Transformer: Towards Language Model Architectures Interpretable by Design
Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus K...
个性化推荐理由:

该论文直接涉及'Enabling Transformer Tech'焦点,探索Transformer架构的可解释性设计,这对推荐系统、搜索和广告领域至关重要,因为在这些应用中模型的可解释性直接影响用户信任、系统调试和监管合规。通过设计可解释的Transformer架构,可以增强推荐理由生成、搜索相关性解释和广告投放透明度等实际应用。

2026-02-12 11:43:39 | arXiv:2602.11852v1 |
cs.AIcs.CLcs.LG
查看完整摘要
While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. "woman") during training. They provide the potential to interpret the model's reasoning and allow for targeted edits of its behavior. Furthermore, by design, the prototypes create communication channels that aggregate contextual information at different time scales, aiding interpretability. In terms of computation scalability, ProtoT scales linearly with sequence length vs the quadratic scalability of SOTA self-attention transformers. Compared to baselines, ProtoT scales well with model and data size, and performs well on text generation and downstream tasks (GLUE). ProtoT exhibits robustness to input perturbations on par or better than some baselines, but differs from them by providing interpretable pathways showing how robustness and sensitivity arises. Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Y...
个性化推荐理由:

该论文直接涉及Transformer架构效率改进(稀疏与线性注意力混合),属于'Enabling Transformer Tech'类别。这种高效的长上下文建模技术具有明确的潜在应用价值:在推荐系统中可处理更长的用户行为序列,在搜索中可处理更长的查询和文档,在广告中可建模更丰富的用户上下文信息。

2026-02-12 09:37:05 | arXiv:2602.11761v1 |
cs.CLcs.AIcs.LG
查看完整摘要
The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu...
个性化推荐理由:

该论文涉及强化学习(RL)与上下文学习(in-context learning)的结合,属于'直接LLM应用'范畴,可用于优化推荐/搜索中的探索策略。虽然RL通常被排除,但这里RL明确用于增强LLM的上下文探索能力,这对推荐系统中的序列决策和搜索中的查询理解有直接应用价值。

2026-02-12 09:24:32 | arXiv:2602.11748v1 |
cs.CL
查看完整摘要
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces
Xin Xu, Tong Yu, Xiang Chen, Haoliang Wang, Julian McAuley, Saayan Mitra
个性化推荐理由:

该论文提出了一种通过路由机制在潜在空间和离散空间之间进行高效推理的方法,这直接属于'使能Transformer技术'范畴,涉及注意力机制和架构效率改进。这种路由思维机制可以显著提升大型语言模型的推理效率,在推荐系统和搜索中具有明确应用潜力,例如用于更高效的用户意图理解、多步推理的查询处理或复杂推荐逻辑的快速执行。

2026-02-12 08:01:01 | arXiv:2602.11683v1 |
cs.AIcs.CLcs.LG
查看完整摘要
Recent work explores latent reasoning to improve reasoning efficiency by replacing explicit reasoning trajectories with continuous representations in a latent space, yet its effectiveness varies across settings. Analysis of model confidence dynamics under latent reasoning reveals that thinking trajectories ending in incorrect answers contain fewer low-confidence steps than those ending in correct answers. Meanwhile, we suggest that soft embeddings aggregated by multiple low-confidence thinking alternatives may introduce and propagate noise, leading to high confidence in unreliable reasoning trajectories. Motivated by these observations, ThinkRouter, an inference-time confidence-aware routing mechanism is proposed to avoid high confidence and noise for efficient reasoning. ThinkRouter routes thinking to the discrete token space when model confidence is low, and to the latent space otherwise. Extensive experiments on STEM reasoning and coding benchmarks across diverse large reasoning models demonstrate that ThinkRouter outperforms explicit CoT, random routing, and latent reasoning baselines in terms of accuracy, achieving an average improvement of 19.70 points in Pass@1, while reducing generation length by up to 15.55%. Further comprehensive analysis reveals that ThinkRouter can calibrate errors arising from explicit CoT and latent reasoning, and accelerates end-of-thinking token generation by globally lowering model confidence.
PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
Ruixiang Feng, Yuntao Wen, Silin Zhou, Ke Shi, Yifan Wang, Ran Le, Zhenwei An, Z...
个性化推荐理由:

该论文涉及LLM推理效率优化技术,属于'Enabling LLM Tech'范畴。前缀保护机制和难度感知压缩技术可直接应用于推荐系统或搜索中的实时推理场景,通过压缩模型大小或优化推理过程来提升系统响应速度,同时保持推荐/搜索质量。

2026-02-12 06:43:08 | arXiv:2602.11639v1 |
cs.CL
查看完整摘要
Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang D...
个性化推荐理由:

该论文涉及多模态统一建模和思维链技术,直接对应'VLM类比用于异构数据'的关注点,其中多模态处理可类比于推荐/搜索中的异构特征(如用户序列、上下文特征)。测试时缩放技术可能提升推理效率,在推荐/搜索系统中具有应用潜力,用于处理动态用户行为数据。

2026-02-12 18:59:49 | arXiv:2602.12279v1 |
cs.CVcs.AIcs.LG
查看完整摘要
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang,...
个性化推荐理由:

该论文提出统一离散流匹配方法,属于Transformer架构效率与生成建模方面的进展(Enabling Transformer Tech),可应用于推荐/搜索中的多模态内容生成与推理任务。标题中的'多模态推理与生成'直接关联到VLM类比处理异构数据的理念,为推荐系统处理用户行为序列与上下文特征提供统一建模思路。

2026-02-12 17:59:08 | arXiv:2602.12221v1 |
cs.CV
查看完整摘要
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozie...
个性化推荐理由:

该论文直接涉及VLM类比异构数据,将视觉和文本作为不同模态进行统一建模,这在电子商务推荐和搜索中有直接应用。虽然标题未明确提及RecSys/Search/Ads,但电子商务理解是这些领域的核心组成部分,因此具有高度相关性。

2026-02-12 08:59:22 | arXiv:2602.11733v1 |
cs.CVcs.AI
查看完整摘要
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
Arbitrary Ratio Feature Compression via Next Token Prediction
Yufan Liu, Daoyuan Ren, Zhipeng Zhang, Wenyang Luo, Bing Li, Weiming Hu, Stephen...
个性化推荐理由:

该论文提出通过下一词元预测进行特征压缩,这属于核心LLM技术进展(Enabling LLM Tech),在推荐系统、搜索和广告中有直接应用潜力。特征压缩技术可以显著降低大规模推荐系统中用户/物品特征向量的存储和传输成本,同时保持模型性能,对于工业级部署至关重要。

2026-02-12 02:38:57 | arXiv:2602.11494v1 |
cs.CV
查看完整摘要
Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.
SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
Wenlin Zhong, Jinluan Yang, Yiquan Wu, Yi Liu, Jianhang Yao, Kun Kuang
个性化推荐理由:

该论文涉及强化学习在搜索领域的应用,属于'直接LLM应用'范畴中的搜索智能体方向。虽然强化学习本身可能被排除,但标题明确指向搜索代理的具体应用场景,具有明确的实践意义。其提出的自证与信息增益多样化分支方法可能提升搜索系统的探索效率和决策质量。

2026-02-12 04:16:55 | arXiv:2602.11551v1 |
cs.CL
查看完整摘要
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang
个性化推荐理由:

该论文属于'使能LLM技术'类别,专注于通过分布式GPU训练和内存高效方法改进LLM预训练。虽然不直接涉及推荐系统、搜索或广告,但更高效、可扩展的LLM预训练技术可以显著降低大规模推荐和搜索系统中部署LLM的成本和复杂性,从而间接推动这些领域的应用。

2026-02-12 04:02:45 | arXiv:2602.11543v1 |
cs.CL
查看完整摘要
Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.
Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Pa...
个性化推荐理由:

该论文涉及检索增强生成(RAG)中的压缩令牌表示技术,这属于LLM效率优化范畴,与'Enabling LLM Tech'相关。虽然不直接针对推荐/搜索/广告系统,但RAG中的高效检索和表示压缩技术可应用于这些领域的知识增强和上下文管理。

2026-02-12 18:15:08 | arXiv:2602.12235v1 |
cs.CL
查看完整摘要
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
dVoting: Fast Voting for dLLMs
Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
个性化推荐理由:

该论文标题涉及分布式大语言模型(dLLMs)的投票机制,属于大语言模型效率优化技术,可能通过提升模型推理速度或可靠性间接应用于推荐/搜索/广告系统。然而,标题未明确说明具体应用场景,且投票机制本身更偏向分布式系统技术而非核心Transformer架构或直接应用创新,因此相关性有限。

2026-02-12 16:35:05 | arXiv:2602.12153v1 |
cs.CLcs.AI
查看完整摘要
Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
个性化推荐理由:

该论文涉及策略蒸馏和奖励外推技术,属于强化学习范畴。虽然蒸馏技术在推荐/搜索系统中可用于模型压缩或知识迁移,但论文标题未明确指向这些应用场景,且强化学习论文需有明确相关性才符合要求。

2026-02-12 16:14:29 | arXiv:2602.12125v1 |
cs.LGcs.AIcs.CL
查看完整摘要
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss
Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michae...
个性化推荐理由:

该论文探讨小语言模型(SLMs)的学习能力与损失函数的关系,属于核心LLM技术进展(Enabling LLM Tech)。虽然SLMs的优化对轻量级推荐/搜索系统有潜在应用价值(如边缘设备部署),但标题未明确指向推荐/搜索/广告的具体应用场景,因此相关性中等。

2026-02-12 14:37:25 | arXiv:2602.12005v1 |
cs.CL
查看完整摘要
Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.
SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization
Sunghwan Kim, Wooseok Jeong, Serin Kim, Sangam Lee, Dongha Lee
个性化推荐理由:

该论文标题涉及搜索增强生成引擎优化(SAGEO)的评估环境,主要关注生成引擎的优化和评估。虽然包含“搜索”元素,但核心焦点是生成引擎优化和评估基准,这更偏向于AIGC、内容生成和纯LLM评估领域,与您关注的推荐系统、搜索排名、广告排名或使能技术应用相关性较弱。

2026-02-12 17:18:00 | arXiv:2602.12187v1 |
cs.IRcs.AI
查看完整摘要
Search-Augmented Generative Engines (SAGE) have emerged as a new paradigm for information access, bridging web-scale retrieval with generative capabilities to deliver synthesized answers. This shift has fundamentally reshaped how web content gains exposure online, giving rise to Search-Augmented Generative Engine Optimization (SAGEO), the practice of optimizing web documents to improve their visibility in AI-generated responses. Despite growing interest, no evaluation environment currently supports comprehensive investigation of SAGEO. Specifically, existing benchmarks lack end-to-end visibility evaluation of optimization strategies, operating on pre-determined candidate documents that abstract away retrieval and reranking preceding generation. Moreover, existing benchmarks discard structural information (e.g., schema markup) present in real web documents, overlooking the rich signals that search systems actively leverage in practice. Motivated by these gaps, we introduce SAGEO Arena, a realistic and reproducible environment for stage-level SAGEO analysis. Our objective is to jointly target search-oriented optimization (SEO) and generation-centric optimization (GEO). To achieve this, we integrate a full generative search pipeline over a large-scale corpus of web documents with rich structural information. Our findings reveal that existing approaches remain largely impractical under realistic conditions and often degrade performance in retrieval and reranking. We also find that structural information helps mitigate these limitations, and that effective SAGEO requires tailoring optimization to each pipeline stage. Overall, our benchmark paves the way for realistic SAGEO evaluation and optimization beyond simplified settings.
Evolutionary Router Feature Generation for Zero-Shot Graph Anomaly Detection with Mixture-of-Experts
Haiyang Jiang, Tong Chen, Xinyi Gao, Guansong Pang, Quoc Viet Hung Nguyen, Hongz...
个性化推荐理由:

该论文涉及专家混合(MoE)架构和零样本学习,属于Transformer架构效率/新注意力机制的技术范畴,可能应用于推荐系统中的图结构数据处理。然而,标题明确聚焦于图异常检测这一特定领域,而非直接针对推荐/搜索/广告的排名或建模任务,因此相关性有限。

2026-02-12 06:16:51 | arXiv:2602.11622v1 |
cs.IR
查看完整摘要
Zero-shot graph anomaly detection (GAD) has attracted increasing attention recent years, yet the heterogeneity of graph structures, features, and anomaly patterns across graphs make existing single GNN methods insufficiently expressive to model diverse anomaly mechanisms. In this regard, Mixture-of-experts (MoE) architectures provide a promising paradigm by integrating diverse GNN experts with complementary inductive biases, yet their effectiveness in zero-shot GAD is severely constrained by distribution shifts, leading to two key routing challenges. First, nodes often carry vastly different semantics across graphs, and straightforwardly performing routing based on their features is prone to generating biased or suboptimal expert assignments. Second, as anomalous graphs often exhibit pronounced distributional discrepancies, existing router designs fall short in capturing domain-invariant routing principles that generalize beyond the training graphs. To address these challenges, we propose a novel MoE framework with evolutionary router feature generation (EvoFG) for zero-shot GAD. To enhance MoE routing, we propose an evolutionary feature generation scheme that iteratively constructs and selects informative structural features via an LLM-based generator and Shapley-guided evaluation. Moreover, a memory-enhanced router with an invariant learning objective is designed to capture transferable routing patterns under distribution shifts. Extensive experiments on six benchmarks show that EvoFG consistently outperforms state-of-the-art baselines, achieving strong and stable zero-shot GAD performance.
T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao ...
个性化推荐理由:

该论文聚焦于扩散语言模型的高效训练技术(轨迹自蒸馏与直接判别优化),属于“Enabling LLM Tech”范畴,因为更高效的扩散模型可能应用于推荐/搜索中的内容生成或序列建模。然而,扩散模型在推荐/搜索/广告中的直接应用尚不明确,且论文未明确涉及异构数据或多模态建模,因此相关性有限。

2026-02-12 18:52:35 | arXiv:2602.12262v1 |
cs.CLcs.LG
查看完整摘要
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.
Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
Bowei He, Yankai Chen, Xiaokun Zhang, Linghe Kong, Philip S. Yu, Xue Liu, Chen M...
个性化推荐理由:

该论文主要关注语言模型知识蒸馏中的数据合成方法,属于LLM技术优化范畴。虽然知识蒸馏技术可能间接应用于推荐/搜索系统中的模型压缩或部署优化,但论文标题未明确指向这些领域的具体应用,且教学理念驱动的数据合成方法在推荐/搜索/广告中的直接应用路径不够清晰。

2026-02-12 17:00:36 | arXiv:2602.12172v1 |
cs.AIcs.CL
查看完整摘要
Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning
Mahdi Khodabandeh, Ghazal Shabani, Arash Yousefi Jordehi, Seyed Abolghasem Mirro...
个性化推荐理由:

该论文主要涉及数据压缩技术,属于基础算法研究而非直接面向推荐/搜索/广告系统。虽然使用了Transformer架构和强化学习,但论文标题未表明在推荐/搜索/广告领域的潜在应用,且强化学习应用方向不明确,不符合当前关注重点。

2026-02-12 16:30:55 | arXiv:2602.12146v1 |
cs.AIcs.CLcs.IT
查看完整摘要
Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.
When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation
Shani Goren, Ido Galil, Ran El-Yaniv
个性化推荐理由:

该论文关注LLM生成文本时的抽象程度控制,属于纯粹的LLM-centric话题,与您的核心关注点(RecSys/Search/Ads领域的进展、LLM/Transformer使能技术、直接应用或异构数据建模)无直接关联。虽然可靠文本生成在理论上可能间接影响某些应用,但论文未明确指向搜索、推荐或广告中的具体问题或应用场景。

2026-02-12 13:06:14 | arXiv:2602.11908v1 |
cs.AIcs.CLcs.LG
查看完整摘要
LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.
Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays
Yijie Zhong, Mengying Guo, Zewei Wang, Zhongyang Li, Dandan Tu, Haofen Wang
个性化推荐理由:

该论文标题涉及记忆机制和知识保留,可能属于LLM记忆管理或知识编辑领域。虽然这可能属于核心LLM技术的进步(如记忆效率),但标题未明确说明与推荐系统、搜索或广告的直接应用潜力,且可能偏向通用NLP或认知科学,而非明确的推荐/搜索/广告应用。

2026-02-12 05:53:54 | arXiv:2602.11607v1 |
cs.CL
查看完整摘要
Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.
Multimodal Fact-Level Attribution for Verifiable Reasoning
David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
个性化推荐理由:

该论文标题涉及多模态事实归因和可验证推理,主要关注推理过程的透明性和验证性,属于评估和解释性范畴。虽然多模态处理可能与异构数据建模有间接联系,但论文的核心焦点是验证和归因,而非直接应用于推荐系统、搜索或广告的建模或效率提升。

2026-02-12 03:10:02 | arXiv:2602.11509v1 |
cs.CLcs.AIcs.CV
查看完整摘要
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Changti Wu, Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Bin Yu, Xiaopeng Lin, Cong Hu...
个性化推荐理由:

该论文主要关注视觉指令调优中的多模态数据选择,虽然涉及多模态处理,但其核心是视觉-语言模型(VLM)训练效率优化,而非直接应用于推荐/搜索/广告领域的异构数据统一建模。虽然数据选择技术可能间接影响模型效率,但论文没有明确展示在推荐/搜索/广告系统中的具体应用潜力。

2026-02-12 06:38:49 | arXiv:2602.11636v1 |
cs.CVcs.AI
查看完整摘要
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
Antoine Gauquier, Ioana Manolescu, Pierre Senellart
个性化推荐理由:

该论文主要关注网络爬虫技术,属于数据采集的基础设施层面,与您关注的推荐系统、搜索或广告领域的核心算法、模型架构、LLM应用等焦点直接相关性较弱。虽然高效数据采集可能间接支持这些领域的数据需求,但论文本身并未涉及您指定的任何具体技术方向。

2026-02-12 12:23:53 | arXiv:2602.11874v1 |
cs.IR
查看完整摘要
Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
Agentic Test-Time Scaling for WebAgents
Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michae...
个性化推荐理由:

该论文标题涉及智能体(Agent)和测试时扩展技术,可能属于强化学习或智能体系统领域。虽然网页智能体可能涉及搜索或推荐交互,但标题未明确指向推荐系统、搜索或广告的核心技术(如排序、检索、用户建模),也未涉及LLM、Transformer架构或异构数据统一建模等当前关注点。其潜在应用场景不明确,与指定技术方向的相关性较弱。

2026-02-12 18:58:30 | arXiv:2602.12276v1 |
cs.AIcs.CL
查看完整摘要
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xi...
个性化推荐理由:

该论文主要关注文档图像中的结构化信息提取基准,这属于计算机视觉与文档处理的交叉领域。虽然信息提取技术可能间接应用于搜索系统的文档理解,但论文标题明确限定于文档图像处理,与推荐系统、搜索排序、广告等核心领域的直接关联较弱,也未涉及LLM、Transformer架构或异构数据统一建模等当前关注的技术方向。

2026-02-12 17:38:57 | arXiv:2602.12203v1 |
cs.CL
查看完整摘要
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre...
个性化推荐理由:

该论文标题主要关注多模态LLM在教育领域视觉推理基准的评估,属于特定领域应用而非核心推荐/搜索/广告技术。虽然涉及多模态LLM技术,但其应用场景(小学教育)与指定领域无关,且未提及任何可能转化到推荐/搜索/广告系统的技术洞察或潜在应用。

2026-02-12 17:29:03 | arXiv:2602.12196v1 |
cs.CLcs.AI
查看完整摘要
AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Sheng Guo, Haobo Wang, Junbo Zhao
个性化推荐理由:

该论文主要关注LLM的推理效率优化,属于核心LLM技术范畴,但未明确说明在推荐系统、搜索或广告领域的潜在应用。自适应反思机制可能有助于提升模型决策效率,但缺乏与具体应用场景的直接关联。

2026-02-12 16:04:00 | arXiv:2602.12113v1 |
cs.AIcs.CL
查看完整摘要
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at https://github.com/ZeweiYu1/ARLCP .
Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
Angelo Ziletti, Leonardo D'Ambrosi
个性化推荐理由:

该论文标题明确指向临床领域(医学应用)和文本到SQL任务(特定NLP应用),这属于明确的无关主题。虽然涉及LLM分析,但缺乏与推荐系统、搜索或广告领域的潜在应用联系。

2026-02-12 14:46:20 | arXiv:2602.12015v1 |
cs.CL
查看完整摘要
Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova, Andrey Kutuzov, Khonzoda Umarova
个性化推荐理由:

该论文主要关注历时语料库构建和语义变化建模,属于语言学/NLP基础研究领域。虽然涉及大规模数据处理和词表示技术,但缺乏明确的推荐系统、搜索或广告应用场景,也不属于Transformer架构改进或直接LLM应用范畴。

2026-02-12 14:01:40 | arXiv:2602.11968v1 |
cs.CL
查看完整摘要
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su
个性化推荐理由:

该论文主要关注多语言机器翻译这一特定NLP任务,属于纯粹的LLM应用领域。虽然涉及模型和数据扩展技术,但缺乏明确的与推荐系统、搜索或广告领域的潜在应用连接,不符合当前关注点中对直接应用或使能技术的要求。

2026-02-12 13:56:02 | arXiv:2602.11961v1 |
cs.CL
查看完整摘要
Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
Bruno Rigal, Victor Dupriez, Alexis Mignon, Ronan Le Hy, Nicolas Mery
个性化推荐理由:

该论文主要关注视觉-语言模型在特定文档转换任务(法语PDF转Markdown)上的基准测试,这属于文档处理的具体应用场景。虽然涉及视觉-语言模型,但其应用方向是文档格式转换而非推荐/搜索/广告领域的异构数据统一建模,且法语语言处理属于特定语言任务,与当前关注的跨模态推荐系统技术关联度较低。

2026-02-12 13:55:43 | arXiv:2602.11960v1 |
cs.CVcs.CLcs.LG
查看完整摘要
This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle
个性化推荐理由:

该论文主要关注问答系统中的问题重写技术,属于NLP领域的一般性改进。虽然问题重写技术可能间接应用于搜索系统,但论文标题未明确展示与推荐系统、广告或搜索排名的直接相关性,也未涉及LLM/Transformer架构进展或异构数据建模等核心关注点。

2026-02-12 13:36:23 | arXiv:2602.11938v1 |
cs.CL
查看完整摘要
Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text
Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis
个性化推荐理由:

该论文主要涉及语音翻译和对抗性训练,属于语音处理领域。虽然提到了跨模态概念,但核心是语音-文本翻译的鲁棒性,与推荐系统、搜索或广告的异构数据统一建模没有直接关联。论文内容更偏向语音处理而非推荐/搜索/广告领域的应用。

2026-02-12 13:30:12 | arXiv:2602.11933v1 |
cs.CL
查看完整摘要
End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.
AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
个性化推荐理由:

该论文标题聚焦于进化算法和AI智能体的效率优化,属于通用AI方法而非特定于推荐系统、搜索或广告领域。虽然自适应模型选择可能间接应用于某些系统优化,但标题未明确指向Transformer架构、LLM技术或RecSys/Search/Ads的直接应用,与当前关注的核心领域和使能技术关联性较弱。

2026-02-12 13:26:56 | arXiv:2602.11931v1 |
cs.CLcs.AI
查看完整摘要
Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.
Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Ben...
个性化推荐理由:

该论文标题关注LLM系统中路由器的评估方法,属于系统评估范畴而非核心算法或架构创新。虽然路由器技术可能应用于大规模推荐系统的模型调度,但论文重点在于评估的公平性和全面性,这更偏向系统工程评价标准,与您关注的RecSys/Search/Ads领域核心算法进展、Transformer架构创新或LLM直接应用关联较弱。

2026-02-12 12:28:27 | arXiv:2602.11877v1 |
cs.CLcs.AI
查看完整摘要
Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
DMAP: A Distribution Map for Text
Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina W...
个性化推荐理由:

该标题涉及文本表示技术,可能属于文本处理或表示学习领域。虽然文本表示是搜索和推荐系统的基础组件,但标题过于宽泛,未明确指向推荐系统、搜索或广告的具体应用,也未涉及LLM、Transformer架构或异构数据建模等当前关注的核心技术方向。

2026-02-12 12:21:24 | arXiv:2602.11871v1 |
cs.CLcs.LG
查看完整摘要
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu,...
个性化推荐理由:

该论文标题涉及细粒度多模态感知和蒸馏技术,主要属于计算机视觉领域。虽然提到了多模态,但核心是视觉感知中的区域到图像蒸馏,没有明确展示与推荐系统、搜索或广告的直接关联。标题中未提及用户序列、上下文特征或Transformer架构等与当前关注点相关的元素。

2026-02-12 12:00:35 | arXiv:2602.11858v1 |
cs.CVcs.AIcs.CLcs.LG
查看完整摘要
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.
A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
Anne-Marie Lutgen, Alistair Plum, Christoph Purschke
个性化推荐理由:

该论文涉及子词嵌入和用户评论分析,可能对文本处理有一定价值,但与搜索/推荐/广告领域的核心进展、LLM技术、Transformer架构或VLM类比方法没有直接关联。卢森堡语这一特定语言处理的应用范围有限,难以看出在推荐系统、搜索或广告中的明确应用潜力。

2026-02-12 10:19:50 | arXiv:2602.11795v1 |
cs.CL
查看完整摘要
This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation
Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang
个性化推荐理由:

该论文主要关注视频生成和教育应用,属于AIGC/内容生成领域,这是明确列出的无关主题。虽然涉及LLM技术,但应用场景(教育视频生成)与推荐系统、搜索或广告的核心排名任务没有直接关联,也没有展示如何将技术应用于这些领域。

2026-02-12 10:14:36 | arXiv:2602.11790v1 |
cs.AIcs.CL
查看完整摘要
Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated end-to-end production without manual editing. In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Holger Boche
个性化推荐理由:

该论文主要涉及强化学习(RL)在LLM智能体中的应用,属于纯粹的RL方法研究。虽然标题提到LLM,但核心是RL技术而非LLM在推荐/搜索/广告领域的直接应用或架构改进。论文没有明确展示与推荐系统、搜索或广告的相关性,因此不符合当前关注点。

2026-02-12 09:49:24 | arXiv:2602.11767v1 |
cs.AIcs.CLcs.LG
查看完整摘要
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Boqi Chen, Xudong Liu, Jianing Qiu
个性化推荐理由:

该论文主要针对多模态大语言模型中的对象幻觉问题,属于纯粹的VLM/NLP评估和幻觉缓解技术。虽然提到了多模态模型,但核心是解决视觉-语言对齐中的幻觉问题,没有明确展示在推荐系统、搜索或广告领域的应用潜力。

2026-02-12 09:04:28 | arXiv:2602.11737v1 |
cs.CVcs.CL
查看完整摘要
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang
个性化推荐理由:

该论文主要关注LLM在代码生成(特别是CUDA内核)方面的应用,这属于纯粹的LLM-centric主题,与RecSys/Search/Ads的核心领域进展、直接应用或使能技术没有明显关联。虽然扩散模型是LLM技术的一个分支,但论文标题未表明其在推荐、搜索或广告系统中的潜在应用价值。

2026-02-12 08:45:13 | arXiv:2602.11715v1 |
cs.LGcs.CL
查看完整摘要
Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models
Katrin Olsen, Sebastian Padó
个性化推荐理由:

该论文标题暗示了LLM生成上下文或理解无意义内容的研究,这主要属于纯粹的NLP中心话题,如幻觉或评估基准。虽然涉及语言模型,但没有明确指向其在推荐系统、搜索或广告中的潜在应用,因此与当前关注点相关性较低。

2026-02-12 08:23:52 | arXiv:2602.11699v1 |
cs.CL
查看完整摘要
Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa
个性化推荐理由:

该论文研究音频-LLM的模态仲裁问题,属于多模态LLM的特定技术挑战。虽然模态仲裁概念可能启发处理异构数据的思路(如VLM类比),但论文明确聚焦音频模态,与RecSys/Search/Ads中常见的文本、序列、特征等数据类型关联较弱。音频处理在指定领域(搜索/推荐/广告)中并非核心应用场景,因此相关性有限。

2026-02-12 02:15:30 | arXiv:2602.11488v1 |
cs.CLcs.SDeess.AS
查看完整摘要
When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
MonarchRT: Efficient Attention for Real-Time Video Generation
Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang,...
个性化推荐理由:

该论文主要关注视频生成领域的注意力效率优化,属于计算机视觉生成任务,与推荐系统、搜索或广告的排序核心问题没有直接关联。虽然高效注意力机制(Enabling Transformer Tech)具有通用价值,但论文明确针对视频生成这一特定应用场景,并未探讨其在推荐/搜索/广告领域的潜在应用可能性。

2026-02-12 18:56:53 | arXiv:2602.12271v1 |
cs.CVcs.LG
查看完整摘要
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang,...
个性化推荐理由:

该论文主要关注LLM训练中的监督微调(SFT)方法改进,属于纯粹的LLM训练技术范畴。虽然提到了理论框架和训练方法,但没有明确展示如何应用于推荐系统、搜索或广告领域的具体场景,因此与您关注的直接应用或使能技术关联较弱。

2026-02-12 17:59:58 | arXiv:2602.12222v1 |
cs.LGcs.AIcs.CV
查看完整摘要
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei L...
个性化推荐理由:

该论文主要关注音视频生成,属于多模态生成领域,与推荐系统、搜索或广告的核心技术(如排序、召回、用户建模)无直接关联。虽然标题中的“可控生成”和“人本中心”可能暗示个性化内容生成,但论文更可能侧重于AIGC或内容创作,而非推荐/搜索/广告中的实际应用。

2026-02-12 16:41:52 | arXiv:2602.12160v1 |
cs.CV
查看完整摘要
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
Lingting Zhu, Shengju Qian, Haidi Fan, Jiayu Dong, Zhenchao Jin, Siwei Zhou, Gen...
个性化推荐理由:

该论文主要涉及3D内容生成,属于计算机图形学领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然Transformer架构在技术上相关,但论文专注于3D资产生成这一特定应用,缺乏在RecSys/Search/Ads领域的明确应用潜力。

2026-02-12 15:55:21 | arXiv:2602.12100v1 |
cs.CV
查看完整摘要
The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie ...
个性化推荐理由:

该论文标题表明它涉及视觉语言模型(VLA)和基于世界模型的强化学习,这属于纯粹的视觉或强化学习领域,没有明确展示与推荐系统、搜索或广告的直接相关性。虽然VLA可能启发异构数据处理,但标题未提及任何与RecSys/Search/Ads相关的应用或潜在用途,因此相关性较低。

2026-02-12 15:55:19 | arXiv:2602.12099v1 |
cs.CV
查看完整摘要
Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.
Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
Enrico Guerriero, Kjersti Engan, Øyvind Meinich-Bache
个性化推荐理由:

该论文标题明确指向医疗领域(新生儿复苏)的活动识别应用,属于明确的医学领域特定应用,这被列为无关主题。虽然涉及视觉语言模型和视觉Transformer,但缺乏与推荐系统、搜索或广告的潜在应用联系,且专注于医疗场景而非通用技术。

2026-02-12 14:31:10 | arXiv:2602.12002v1 |
cs.CV
查看完整摘要
Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
Wei Chen, Yancheng Long, Mingqiao Liu, Haojie Ding, Yankai Yang, Hongyang Wei, Y...
个性化推荐理由:

该论文主要关注空间推理生成,属于特定推理能力的研究,与推荐系统、搜索或广告的核心技术关联较弱。虽然思维链技术是LLM的重要进展,但空间推理这一具体应用场景在RecSys/Search/Ads中的直接应用潜力有限,更偏向于特定领域的推理能力提升。

2026-02-12 14:12:14 | arXiv:2602.11980v1 |
cs.CV
查看完整摘要
While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.
Where Bits Matter in World Model Planning: A Paired Mixed-Bit Study for Efficient Spatial Reasoning
Suraj Ranganath, Anish Patnaik, Vaishak Menon
个性化推荐理由:

该论文标题涉及世界模型规划和空间推理,主要关注计算效率优化(混合比特表示)。虽然效率改进可能间接有益于推荐/搜索系统中的模型部署,但标题未明确指向Transformer架构、LLM技术或推荐/搜索/广告的具体应用。其核心内容更偏向通用计算优化而非当前关注的领域。

2026-02-12 12:32:51 | arXiv:2602.11882v1 |
cs.LGcs.AIcs.CVcs.RO
查看完整摘要
Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low-bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO-WM on the Wall planning task, we run a paired-goal mixed-bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three-regime pattern: 8-bit and 6-bit settings remain close to FP16, 3-bit settings collapse, and 4-bit settings are allocation-sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near-size asymmetric variants show the same encoder-side direction. In a later strict 22-cell replication with smaller per-cell episode count, the mixed-versus-uniform INT4 sign becomes budget-conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module-aware, budget-aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at https://github.com/suraj-ranganath/DINO-MBQuant.
DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition
Ji Li, Zhiwei Li, Shihao Li, Zhenjiang Yu, Boyang Wang, Haiou Liu
个性化推荐理由:

该论文主要关注街景生成和地点识别,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术(如排序、检索、用户建模)没有直接关联。虽然扩散模型是生成模型的一种,但论文的应用场景(街景生成)与指定的业务领域(RecSys/Search/Ads)缺乏明确的结合点,因此相关性较低。

2026-02-12 12:26:09 | arXiv:2602.11875v1 |
cs.CVcs.RO
查看完整摘要
Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving
JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long
个性化推荐理由:

该论文标题聚焦于视频预测嵌入在视觉语言模型(VLA)中的应用,属于纯粹的视觉或多模态领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然提到了预测嵌入概念,但未明确展示其在异构数据处理或推荐/搜索场景中的潜在应用价值。

2026-02-12 11:20:43 | arXiv:2602.11832v1 |
cs.CVcs.RO
查看完整摘要
Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldagu...
个性化推荐理由:

该论文主要关注计算机视觉领域的图像分割技术,属于纯粹的视觉研究方向。虽然提到了模型效率改进,但没有明确展示与推荐系统、搜索或广告领域的直接关联或潜在应用场景。

2026-02-12 10:35:35 | arXiv:2602.11804v1 |
cs.CVeess.IV
查看完整摘要
Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.
Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
Xiangyu Wu, Dongming Jiang, Feng Yu, Yueying Tian, Jiaqi Tang, Qing-Guo Chen, Ya...
个性化推荐理由:

该论文标题涉及测试时自适应和去偏技术,可能属于模型鲁棒性或分布偏移领域。虽然去偏在推荐系统中可能有应用(如处理选择偏差),但标题未明确连接至推荐/搜索/广告系统,且Tsallis熵是通用信息论概念,未提及LLM、Transformer或具体应用场景。因此相关性较低,仅可能间接相关。

2026-02-12 09:12:22 | arXiv:2602.11743v1 |
cs.CV
查看完整摘要
Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.
RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian
个性化推荐理由:

该论文主要关注文本到3D形状检索任务,这属于计算机视觉和3D视觉领域。虽然提到了Mamba架构(一种序列建模方法),但其应用场景(3D形状检索)与推荐系统、搜索或广告的核心领域没有直接关联。论文没有展示在异构数据处理或多模态建模方面的创新,而这些才是与当前关注点相关的方面。

2026-02-12 07:46:03 | arXiv:2602.11673v1 |
cs.CV
查看完整摘要
3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun
个性化推荐理由:

该论文主要关注多模态LLM在自动驾驶领域的令牌缩减技术,属于特定领域应用而非通用技术。虽然令牌缩减技术本身可能对推荐/搜索系统中的效率优化有潜在价值,但论文明确聚焦于自动驾驶这一无关领域,且未提及任何与推荐系统、搜索或广告相关的应用场景。

2026-02-12 07:21:24 | arXiv:2602.11656v1 |
cs.CVcs.AIcs.RO
查看完整摘要
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
Zedong Chu, Shichao Xie, Xiaolong Wu, Yanfen Shen, Minghua Luo, Zhengbo Wang, Fe...
个性化推荐理由:

该论文主要关注具身导航的视觉语言动作(VLA)基础模型,属于机器人学和具身智能领域。虽然涉及多模态建模,但其核心应用是物理环境中的导航任务,与推荐系统、搜索或广告的异构数据统一建模需求没有直接关联。

2026-02-12 05:30:20 | arXiv:2602.11598v1 |
cs.ROcs.AIcs.CV
查看完整摘要
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception
Zesheng Jia, Jin Wang, Siao Liu, Lingzhi Li, Ziyao Huang, Yunjiang Xu, Jianping ...
个性化推荐理由:

该论文主要关注协同感知中的领域自适应问题,属于计算机视觉和机器人领域。虽然提到了参数高效方法(可能涉及模型效率),但标题明确指向协同感知应用,这与推荐系统、搜索或广告的核心技术需求没有直接关联。最优传输流技术理论上可能用于数据分布对齐,但论文没有表明在RecSys/Search/Ads场景中的应用潜力。

2026-02-12 04:36:50 | arXiv:2602.11565v1 |
cs.CV
查看完整摘要
Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.
How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction
Sidong Feng, Chunyang Chen
个性化推荐理由:

该论文主要关注GUI智能体与软件交互,属于通用AI代理技术,与推荐系统、搜索或广告的核心技术(如排序、检索、用户建模)无直接关联。虽然智能体技术可能间接应用于某些交互场景,但论文标题未显示明确的RecSys/Search/Ads应用潜力,因此相关性较低。

2026-02-12 03:14:11 | arXiv:2602.11514v1 |
cs.SEcs.AIcs.CVcs.HC
查看完整摘要
GUI agents are rapidly becoming a new interaction to software, allowing people to navigate web, desktop and mobile rather than execute them click by click. Yet ``agent'' is described with radically different degrees of autonomy, obscuring capability, responsibility and risk. We call for conceptual clarity through GUI Agent Autonomy Levels (GAL), a six-level framework that makes autonomy explicit and helps benchmark progress toward trustworthy software interaction.
A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness
Yun-Cheng Li, Sen Lei, Heng-Chao Li, Ke Li
个性化推荐理由:

该论文标题表明其专注于语义变化检测,这属于计算机视觉或自然语言处理中的时序分析任务,与推荐系统、搜索或广告的核心技术无直接关联。虽然标题提到“边界和时间感知”,可能涉及序列建模,但未明确指向用户行为序列、上下文特征或多模态统一建模等推荐/搜索相关应用场景。

2026-02-12 00:54:22 | arXiv:2602.11466v1 |
cs.CV
查看完整摘要
Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval
Benjamin Clavié, Atoof Shakir, Jonah Turner, Sean Lee, Aamir Shakir, Makoto P. K...
个性化推荐理由:

该论文标题明确聚焦于音乐信息检索(MIR)领域的基准测试,属于音频/音乐处理的专业领域。虽然基准测试在技术上有一定通用性,但音乐信息检索与推荐系统、搜索或广告的核心领域(处理文本、图像、用户行为等结构化数据)存在显著差异,且未提及任何与LLM、Transformer架构或异构数据统一建模相关的技术。

2026-02-12 13:37:58 | arXiv:2602.11941v1 |
cs.IRcs.AI
查看完整摘要
Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre-trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high-quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce \textbf{IncompeBench}, a carefully annotated benchmark comprising $1,574$ permissively licensed, high-quality music snippets, $500$ diverse queries, and over $125,000$ individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at https://huggingface.co/datasets/mixedbread-ai/incompebench-strict and https://huggingface.co/datasets/mixedbread-ai/incompebench-lenient with the prompts available at https://github.com/mixedbread-ai/incompebench-programs.
Reliable and Private Anonymous Routing for Satellite Constellations
Nilesh Vyas, Fabien Geyer, Svetoslav Duhovnikov
个性化推荐理由:

该论文标题明确涉及隐私(Privacy)和卫星网络路由,这属于明确的无关主题(Irrelevant Topics)中的隐私和安全领域。标题中没有提到任何与推荐系统、搜索、广告、LLM技术、Transformer架构或异构数据建模相关的内容,因此与当前关注点完全无关。

2026-02-12 09:43:55 | arXiv:2602.11764v1 |
cs.CRcs.ETcs.IRcs.NI
查看完整摘要
Shared, dynamic network infrastructures, such as dual-use LEO satellite constellations, pose critical threats to metadata privacy, particularly for state actors operating in mixed-trust environments. This work proposes an enhanced anonymity architecture, evolving the Loopix mix-network, to provide robust security and reliability in these volatile topologies. We introduce three primary contributions: (1) A multi-path transport protocol utilizing $(n, k)$ erasure codes, which is demonstrated to counteract the high link volatility and intermittent connectivity that renders standard mix-networks unreliable. (2) The integration of a computationally efficient Private Information Retrieval (PIR) protocol during route discovery. (3) The introduction of adaptive, centrality-based delay strategies that efficiently mitigate the inherent topological bias of LEO networks, providing a superior anonymity-to-latency trade-off. This mechanism provably prevents metadata leakage at the user-provider directory, mitigating profiling and correlation attacks. We validate this architecture via high-fidelity, packet-level simulations of a LEO constellation. Empirical results show our multi-path transport achieves near-zero message loss, establishing a quantifiable trade-off between reliability and bandwidth overhead. Furthermore, microbenchmarks of the PIR protocol quantify its computational and latency overheads, confirming its feasibility for practical deployment. This work provides a validated blueprint for deployable high-anonymity communication systems, demonstrating the viability of securely multiplexing sensitive operations within large-scale commercial network infrastructures.
A technical curriculum on language-oriented artificial intelligence in translation and specialised communication
Ralph Krüger
个性化推荐理由:

该论文标题聚焦于翻译和专业交流领域的语言导向人工智能课程体系,属于教育或课程设计范畴。标题中未提及推荐系统、搜索、广告、Transformer架构或LLM技术应用等关键要素,与用户关注的RecSys/Search/Ads技术进展、LLM应用或Transformer架构改进等核心方向无直接关联。

2026-02-12 18:37:23 | arXiv:2602.12251v1 |
cs.CLcs.AIcs.HC
查看完整摘要
This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
个性化推荐理由:

该论文标题明确聚焦于语音模型(Speech Models)的性能缺陷分析,属于纯粹的语音处理领域。虽然语音技术在某些边缘场景可能涉及搜索或推荐(如语音搜索),但标题未体现任何与推荐系统、搜索、广告或Transformer架构的直接关联,也未展示将语音作为异构数据模态进行统一建模的潜力,因此与您关注的核心领域基本无关。

2026-02-12 18:36:09 | arXiv:2602.12249v1 |
cs.AIcs.CLcs.CY
查看完整摘要
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Manjunath Kudlur, Evan King, James Wang, Pete Warden
个性化推荐理由:

该论文专注于语音识别(ASR)技术,属于纯粹的语音处理领域,与搜索/推荐/广告系统的核心焦点无关。虽然提到了编码器架构,但这是针对语音信号处理的特定应用,没有展示出在异构数据处理、Transformer架构改进或LLM技术方面的潜在应用价值。

2026-02-12 18:20:45 | arXiv:2602.12241v1 |
cs.CLcs.LGcs.SD
查看完整摘要
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
GPT-4o Lacks Core Features of Theory of Mind
John Muchovej, Amanda Royka, Shane Lee, Julian Jara-Ettinger
个性化推荐理由:

该论文标题表明其研究重点在于评估LLM在特定认知能力(心智理论)方面的缺陷,这属于纯粹的LLM能力评估范畴。虽然涉及GPT-4o这一LLM模型,但论文内容明显聚焦于模型的能力评估和缺陷分析,而非LLM技术进展、架构改进或在推荐/搜索/广告领域的应用潜力,因此与用户关注的四大方向均不相关。

2026-02-12 16:33:58 | arXiv:2602.12150v1 |
cs.AIcs.CLcs.LG
查看完整摘要
Do Large Language Models (LLMs) possess a Theory of Mind (ToM)? Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks. However, these evaluations do not test for the actual representations posited by ToM: namely, a causal model of mental states and behavior. Here, we use a cognitively-grounded definition of ToM to develop and test a new evaluation framework. Specifically, our approach probes whether LLMs have a coherent, domain-general, and consistent model of how mental states cause behavior -- regardless of whether that model matches a human-like ToM. We find that even though LLMs succeed in approximating human judgments in a simple ToM paradigm, they fail at a logically equivalent task and exhibit low consistency between their action predictions and corresponding mental state inferences. As such, these findings suggest that the social proficiency exhibited by LLMs is not the result of an domain-general or consistent ToM.
CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
Ricardo Campos, Ana Filipa Pacheco, Ana Luísa Fernandes, Inês Cantante, Rute Reb...
个性化推荐理由:

该论文标题描述了一个特定领域(市政会议)的数据集构建工作,属于领域特定的数据收集和标注任务。这与您关注的核心推荐系统、搜索、广告领域进展、LLM技术应用、Transformer架构改进或异构数据统一建模等方向均无直接关联。该数据集主要服务于文档分析或特定领域NLP任务,而非推荐/搜索/广告系统的技术发展。

2026-02-12 16:22:55 | arXiv:2602.12137v1 |
cs.CL
查看完整摘要
City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.
WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Jun...
个性化推荐理由:

该论文专注于语音对话模型的基准测试,属于纯语音领域研究,与推荐系统、搜索或广告的核心技术无关。虽然涉及对话模型,但主要关注语音特定指标(如副语言特征),没有展示与RecSys/Search/Ads领域的潜在应用联系。

2026-02-12 16:22:11 | arXiv:2602.12135v1 |
cs.CL
查看完整摘要
With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.
Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5
Roberto Balestri
个性化推荐理由:

该论文主要研究多模态模型中的偏见评估,属于公平性、伦理等非技术性话题,与当前关注的推荐系统、搜索、广告核心技术进展无关。虽然涉及图像生成,但焦点是偏见量化而非视觉语言模型在异构数据处理方面的技术应用。

2026-02-12 16:21:03 | arXiv:2602.12133v1 |
cs.AIcs.CLcs.CYcs.HC
查看完整摘要
This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
A Rule-based Computational Model for Gaidhlig Morphology
Peter J Barclay
个性化推荐理由:

该论文专注于特定语言(盖尔语)的形态学计算模型,属于语言学或计算语言学领域。这与我关注的推荐系统、搜索、广告、LLM技术或Transformer架构等核心领域完全无关,也没有展示出在RecSys/Search/Ads中的潜在应用价值。

2026-02-12 16:20:17 | arXiv:2602.12132v1 |
cs.CL
查看完整摘要
Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
Capability-Oriented Training Induced Alignment Risk
Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao,...
个性化推荐理由:

该论文标题关注的是LLM训练中的对齐风险问题,这属于纯粹的LLM安全/对齐研究领域,与推荐系统、搜索或广告的技术应用无关。标题中没有任何元素表明该研究在效率、架构改进或具体应用方面与RecSys/Search/Ads领域有潜在关联。

2026-02-12 16:13:14 | arXiv:2602.12124v1 |
cs.LGcs.CL
查看完整摘要
While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.
DeepSight: An All-in-One LM Safety Toolkit
Bo Zhang, Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zh...
个性化推荐理由:

该论文标题明确聚焦于语言模型安全工具包,属于安全、隐私等非技术性主题范畴,与您当前关注的核心领域进展、LLM技术赋能、Transformer架构进展、LLM直接应用或异构数据统一建模等研究方向均无直接关联。

2026-02-12 15:43:14 | arXiv:2602.12092v1 |
cs.CLcs.AIcs.CRcs.CV
查看完整摘要
As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Y...
个性化推荐理由:

该论文标题明确聚焦于强化学习(RL)在大型语言模型中的应用,属于“Irrelevant Topics”中明确排除的“Reinforcement Learning (RL) papers without clear relevance to RecSys/Search/Ads”类别。标题未提及任何与推荐系统、搜索或广告相关的应用场景、数据或问题,因此与当前关注点无关。

2026-02-12 15:03:37 | arXiv:2602.12036v1 |
cs.CL
查看完整摘要
Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.
Artificial intelligence is creating a new global linguistic hierarchy
Giulia Occhini, Kumiko Tanaka-Ishii, Anna Barford, Refael Tikochinski, Songbo Hu...
个性化推荐理由:

该论文标题聚焦于人工智能的社会语言学影响和全球语言等级体系,属于社会影响、伦理或语言政策范畴。它不涉及推荐系统、搜索或广告的核心技术进展,也不涉及LLM/Transformer架构改进、异质数据建模或直接应用。标题内容完全偏离了所有技术焦点领域,属于明确的无关主题。

2026-02-12 14:50:44 | arXiv:2602.12018v1 |
cs.CYcs.CL
查看完整摘要
Artificial intelligence (AI) has the potential to transform healthcare, education, governance and socioeconomic equity, but its benefits remain concentrated in a small number of languages (Bender, 2019; Blasi et al., 2022; Joshi et al., 2020; Ranathunga and de Silva, 2022; Young, 2015). Language AI - the technologies that underpin widely-used conversational systems such as ChatGPT - could provide major benefits if available in people's native languages, yet most of the world's 7,000+ linguistic communities currently lack access and face persistent digital marginalization. Here we present a global longitudinal analysis of social, economic and infrastructural conditions across languages to assess systemic inequalities in language AI. We first analyze the existence of AI resources for 6003 languages. We find that despite efforts of the community to broaden the reach of language technologies (Bapna et al., 2022; Costa-Jussà et al., 2022), the dominance of a handful of languages is exacerbating disparities on an unprecedented scale, with divides widening exponentially rather than narrowing. Further, we contrast the longitudinal diffusion of AI with that of earlier IT technologies, revealing a distinctive hype-driven pattern of spread. To translate our findings into practical insights and guide prioritization efforts, we introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages. The index highlights communities where capacity exists but remains underutilized, and provides a framework for accelerating more equitable diffusion of language AI. Our work contributes to setting the baseline for a transition towards more sustainable and equitable language technologies.
Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
Varpu Vehomäki, Kimmo K. Kaski
个性化推荐理由:

该论文涉及安全漏洞描述简化,属于网络安全领域,与推荐系统、搜索或广告的核心技术无关。论文内容聚焦于安全文本处理,而非推荐、搜索或广告的排名、建模或LLM应用,因此与当前关注点不相关。

2026-02-12 14:12:58 | arXiv:2602.11982v1 |
cs.CL
查看完整摘要
Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.
Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
Elisa Bassignana, Mike Zhang, Dirk Hovy, Amanda Cercas Curry
个性化推荐理由:

该论文标题关注LLM对社会经济地位语言变体的适应性,这属于公平性、偏见或伦理评估范畴,与用户关注的RecSys/Search/Ads技术核心进展、LLM架构改进或直接应用无关。此类研究通常不涉及推荐系统、搜索或广告中的实际技术应用或模型效率提升。

2026-02-12 13:36:38 | arXiv:2602.11939v1 |
cs.CL
查看完整摘要
Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences
Eddie Yang, Dashun Wang
个性化推荐理由:

该论文标题聚焦于LLM评估基准中的幻觉和分歧问题,属于纯粹的NLP评估和基准测试范畴。虽然涉及LLM技术,但明确属于'幻觉、评估基准或其他纯粹的NLP中心主题'这一被排除的类别,与推荐系统、搜索或广告中的实际应用或技术进展无关。

2026-02-12 12:53:39 | arXiv:2602.11898v1 |
cs.CL
查看完整摘要
Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
LLM-based Triplet Extraction from Financial Reports
Dante Wesslund, Ville Stenström, Pontus Linde, Alexander Holmberg
个性化推荐理由:

该论文标题明确聚焦于金融领域的特定应用(财务报告三元组抽取),这属于金融领域的专业应用,与RecSys/Search/Ads的核心技术或应用场景无直接关联。根据用户列出的无关主题,金融、生物、化学、物理等特定领域应用均被排除在关注范围之外。

2026-02-12 12:36:10 | arXiv:2602.11886v1 |
cs.CL
查看完整摘要
Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production
Sümeyye Meryem Taşyürek, Enis Mücahid İskender, Hacer Yalim Keles
个性化推荐理由:

该论文专注于手语生成这一特定领域,属于纯粹的视觉/动作生成任务,与推荐系统、搜索或广告的核心技术领域没有直接关联。虽然涉及变分建模技术,但缺乏将其应用于异构数据统一建模或推荐/搜索场景的明确潜力。

2026-02-12 12:07:32 | arXiv:2602.11861v1 |
cs.LGcs.CL
查看完整摘要
Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
Ruibo Chen, Yihan Wu, Xuehao Cui, Jingqi Zhang, Heng Huang
个性化推荐理由:

该论文标题明确涉及水印技术,属于安全与隐私领域,这被明确列为不相关主题。虽然水印可能与内容生成或版权保护相关,但论文标题未显示与推荐系统、搜索、广告或相关使能技术(如Transformer架构、LLM应用)的任何联系。

2026-02-12 10:18:16 | arXiv:2602.11793v1 |
cs.CRcs.CL
查看完整摘要
Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
Detecting RLVR Training Data via Structural Convergence of Reasoning
Hongbo Zhang, Yue Yang, Jianhao Yan, Guangsheng Bao, Yue Zhang, Yue Zhang
个性化推荐理由:

该论文标题涉及强化学习与视觉推理(RLVR)训练数据的检测方法,这属于强化学习领域而非推荐系统、搜索或广告的核心技术。虽然提到了训练数据检测,但未明确展示与推荐/搜索/广告系统的直接关联或潜在应用,因此不符合当前关注点。

2026-02-12 10:17:32 | arXiv:2602.11792v1 |
cs.AIcs.CL
查看完整摘要
Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.
Thinking with Drafting: Optical Decompression via Logical Reconstruction
Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan ...
个性化推荐理由:

该论文标题涉及光学解压缩和逻辑重构,这属于计算机视觉或信号处理领域,与推荐系统、搜索或广告的核心技术无关。标题中提到的“光学”表明它可能涉及图像或视觉数据处理,而“解压缩”和“逻辑重构”暗示了特定的信号处理或压缩技术,这些都没有明确的联系到LLM、Transformer架构、推荐系统、搜索或广告的应用。

2026-02-12 08:54:02 | arXiv:2602.11731v1 |
cs.CL
查看完整摘要
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
PatientHub: A Unified Framework for Patient Simulation
Sahand Sabour, TszYam NG, Minlie Huang
个性化推荐理由:

该论文标题明确指向医疗领域的患者模拟应用,属于明确的医学领域特定应用,这在您列出的无关主题中明确排除。论文标题中没有包含任何与推荐系统、搜索、广告、LLM技术或Transformer架构相关的术语或暗示,因此与您的关注点完全无关。

2026-02-12 08:06:37 | arXiv:2602.11684v1 |
cs.CLcs.AIcs.HC
查看完整摘要
As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.
PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics
E Fan, Lisong Shi, Zhengtong Li, Chih-yung Wen
个性化推荐理由:

该论文标题明确聚焦于计算流体动力学(CFD)这一物理/工程领域应用,属于明确的“Irrelevant Topics”中提到的“Physics or other domain-specific applications”。虽然框架名称包含“Neurosymbolic”和“Agentic”元素,但核心应用领域与推荐系统、搜索或广告完全无关,没有任何潜在的应用连接点。

2026-02-12 07:37:56 | arXiv:2602.11666v1 |
cs.AIcs.CL
查看完整摘要
The deployment of autonomous agents for Computational Fluid Dynamics (CFD), is critically limited by the probabilistic nature of Large Language Models (LLMs), which struggle to enforce the strict conservation laws and numerical stability required for physics-based simulations. Reliance on purely semantic Retrieval Augmented Generation (RAG) often leads to "context poisoning," where agents generate linguistically plausible but physically invalid configurations due to a fundamental Semantic-Physical Disconnect. To bridge this gap, this work introduces PhyNiKCE (Physical and Numerical Knowledgeable Context Engineering), a neurosymbolic agentic framework for trustworthy engineering. Unlike standard black-box agents, PhyNiKCE decouples neural planning from symbolic validation. It employs a Symbolic Knowledge Engine that treats simulation setup as a Constraint Satisfaction Problem, rigidly enforcing physical constraints via a Deterministic RAG Engine with specialized retrieval strategies for solvers, turbulence models, and boundary conditions. Validated through rigorous OpenFOAM experiments on practical, non-tutorial CFD tasks using Gemini-2.5-Pro/Flash, PhyNiKCE demonstrates a 96% relative improvement over state-of-the-art baselines. Furthermore, by replacing trial-and-error with knowledge-driven initialization, the framework reduced autonomous self-correction loops by 59% while simultaneously lowering LLM token consumption by 17%. These results demonstrate that decoupling neural generation from symbolic constraint enforcement significantly enhances robustness and efficiency. While validated on CFD, this architecture offers a scalable, auditable paradigm for Trustworthy Artificial Intelligence in broader industrial automation.
Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles
Momoka Furuhashi, Kouta Nakayama, Noboru Kawai, Takashi Kodama, Saku Sugawara, K...
个性化推荐理由:

该论文标题聚焦于LLM在教育领域的反馈生成及其对不同学习者的差异化影响,这属于纯粹的教育技术应用研究。虽然涉及LLM生成内容,但论文核心关注教育场景中的个性化反馈效果评估,与推荐系统、搜索或广告领域的技术进步、架构创新或直接应用没有明显关联。

2026-02-12 07:02:33 | arXiv:2602.11650v1 |
cs.CL
查看完整摘要
Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Y...
个性化推荐理由:

该论文标题聚焦于数学与工程领域的可验证推理基准,属于特定领域评估方法研究。虽然涉及推理过程,但未提及LLM、Transformer架构、推荐系统、搜索或广告等核心关注领域,也没有展示在异构数据处理或模态统一建模方面的潜在应用价值。

2026-02-12 04:45:01 | arXiv:2602.11570v1 |
cs.CL
查看完整摘要
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
Dong Yan, Jian Liang, Ran He, Tieniu Tan
个性化推荐理由:

该论文标题明确涉及隐私和安全防御("属性推断攻击"、"主动防御"),这属于明确列出的无关主题范畴。虽然提到了LLMs,但核心焦点是安全而非技术应用,与当前关注的推荐系统、搜索或广告中的LLM技术进展、架构改进或直接应用无关。

2026-02-12 03:37:50 | arXiv:2602.11528v1 |
cs.CRcs.AIcs.CL
查看完整摘要
Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at https://github.com/Jasper-Yan/TRACE-RPS.
Adaptive Milestone Reward for GUI Agents
Congmin Zheng, Xiaoyun Mo, Xinbei Ma, Qiqiang Lin, Yin Zhao, Jiachen Zhu, Xingyu...
个性化推荐理由:

该论文标题涉及GUI智能体和奖励机制,主要属于强化学习在界面自动化领域的应用。虽然提到了奖励机制,但没有明确与推荐系统、搜索或广告的关联。标题未提及LLM、Transformer架构、多模态建模或任何与排名、个性化相关的技术,因此与您的关注点基本无关。

2026-02-12 03:31:40 | arXiv:2602.11524v1 |
cs.LGcs.AIcs.CL
查看完整摘要
Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
个性化推荐理由:

该论文标题明确聚焦于LLM安全漏洞(越狱攻击)的检测,这属于安全/隐私范畴,在无关主题列表中明确排除。虽然涉及LLM内部表示分析,但核心是安全检测而非推荐/搜索/广告领域的应用技术。

2026-02-12 02:43:17 | arXiv:2602.11495v1 |
cs.CRcs.CL
查看完整摘要
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.
ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias
Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Meng Jiang, Yiyu S...
个性化推荐理由:

该论文标题明确涉及医学领域(阿尔茨海默病及相关痴呆症)的LLM基准测试,这属于明确的无关主题。虽然涉及LLM技术,但其应用场景完全偏离了推荐系统、搜索或广告领域,且属于医学特定应用,与所有当前关注点均不相关。

2026-02-12 00:38:21 | arXiv:2602.11460v1 |
cs.CL
查看完整摘要
Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu
个性化推荐理由:

该论文标题涉及矢量素描和语义错觉,这属于计算机视觉或图形学领域,与推荐系统、搜索或广告的核心技术焦点没有直接关联。标题中没有提到任何与LLM、Transformer架构、推荐系统、搜索或广告相关的关键词,因此其潜在应用价值极低。

2026-02-12 18:59:54 | arXiv:2602.12280v1 |
cs.CV
查看完整摘要
Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/
Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia
个性化推荐理由:

该论文专注于神经形态视觉和脉冲神经网络,属于特定硬件架构和生物启发的计算领域,与推荐系统、搜索或广告的核心技术栈没有直接关联。虽然持续学习技术本身可能有通用价值,但论文将其应用于神经形态视觉这一特定领域,且未提及任何与推荐/搜索/广告相关的潜在应用场景。

2026-02-12 18:15:32 | arXiv:2602.12236v1 |
cs.NEcs.AIcs.CV
查看完整摘要
Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom jointly optimize accuracy and energy efficiency, with particularly limited exploration on event-based datasets. We propose an energy-aware spike budgeting framework for continual SNN learning that integrates experience replay, learnable leaky integrate-and-fire neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Our approach exhibits modality-dependent behavior: on frame-based datasets (MNIST, CIFAR-10), spike budgeting acts as a sparsity-inducing regularizer, improving accuracy while reducing spike rates by up to 47\%; on event-based datasets (DVS-Gesture, N-MNIST, CIFAR-10-DVS), controlled budget relaxation enables accuracy gains up to 17.45 percentage points with minimal computational overhead. Across five benchmarks spanning both modalities, our method demonstrates consistent performance improvements while minimizing dynamic power consumption, advancing the practical viability of continual learning in neuromorphic vision systems.
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang...
个性化推荐理由:

该论文专注于图像生成和编辑的AIGC领域,属于纯粹的视觉内容生成任务。虽然涉及多模态建模,但其核心应用(图像生成/编辑)与推荐系统、搜索或广告的排序/匹配任务没有直接关联,且属于明确排除的“AIGC/内容生成”类别。

2026-02-12 17:44:24 | arXiv:2602.12205v1 |
cs.CVcs.AI
查看完整摘要
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
Nils Lehmann, Yi Wang, Zhitong Xiong, Xiaoxiang Zhu
个性化推荐理由:

该论文标题表明其专注于地球观测数据(遥感领域),属于明确的领域特定应用(地理/环境科学),与搜索、推荐或广告系统无关。尽管提到了多传感器数据处理,但其应用场景(地球观测)与用户行为建模、内容排名或广告投放等核心关注领域没有直接关联。

2026-02-12 17:09:14 | arXiv:2602.12177v1 |
cs.CV
查看完整摘要
State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation
Ziteng Lu, Yushuang Wu, Chongjie Ye, Yuda Qiu, Jing Shao, Xiaoyang Guo, Jiaqing ...
个性化推荐理由:

该论文标题明确聚焦于3D视觉和纹理增强技术,属于纯粹的计算机视觉领域。虽然提到了潜在表示,但这是针对3D点云数据的特定应用,与推荐系统、搜索或广告中的异构数据处理没有直接关联。该技术没有显示出在RecSys/Search/Ads领域的潜在应用前景。

2026-02-12 16:37:31 | arXiv:2602.12157v1 |
cs.CVcs.GR
查看完整摘要
High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.
FAIL: Flow Matching Adversarial Imitation Learning for Image Generation
Yeyao Ma, Chen Li, Xiaosong Zhang, Han Hu, Weidi Xie
个性化推荐理由:

该论文专注于图像生成技术,属于纯粹的视觉生成领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及模仿学习,但未明确展示与推荐/搜索/广告排名的直接关联,且图像生成属于明确排除的无关主题。

2026-02-12 16:36:33 | arXiv:2602.12155v1 |
cs.CV
查看完整摘要
Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.
PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfen...
个性化推荐理由:

该论文标题聚焦于艺术海报生成,属于AIGC/内容生成领域,与推荐系统、搜索或广告的核心排序/匹配任务无直接关联。虽然涉及任务蒸馏技术,但应用场景局限于创意内容生成,不属于当前关注的LLM在推荐/搜索/广告中的直接应用或使能技术范畴。

2026-02-12 16:16:38 | arXiv:2602.12127v1 |
cs.CV
查看完整摘要
Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.
Iskra: A System for Inverse Geometry Processing
Ana Dodik, Ahmed H. Mahmoud, Justin Solomon
个性化推荐理由:

该论文标题涉及计算机图形学中的几何处理,属于纯粹的图形学/3D视觉领域。标题中没有任何与推荐系统、搜索、广告、LLM或Transformer技术相关的术语。根据用户列出的无关主题,这属于“纯粹的视觉、3D视觉、图形学论文,与推荐系统/搜索/广告没有明确相关性”,因此完全不相关。

2026-02-12 15:59:06 | arXiv:2602.12105v1 |
cs.GRcs.CVcs.LG
查看完整摘要
We propose a system for differentiating through solutions to geometry processing problems. Our system differentiates a broad class of geometric algorithms, exploiting existing fast problem-specific schemes common to geometry processing, including local-global and ADMM solvers. It is compatible with machine learning frameworks, opening doors to new classes of inverse geometry processing applications. We marry the scatter-gather approach to mesh processing with tensor-based workflows and rely on the adjoint method applied to user-specified imperative code to generate an efficient backward pass behind the scenes. We demonstrate our approach by differentiating through mean curvature flow, spectral conformal parameterization, geodesic distance computation, and as-rigid-as-possible deformation, examining usability and performance on these applications. Our system allows practitioners to differentiate through existing geometry processing algorithms without needing to reformulate them, resulting in low implementation effort, fast runtimes, and lower memory requirements than differentiable optimization tools not tailored to geometry processing.
A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments
Banglei Guan, Jing Tao, Liang Xu, Dongcai Tan, Pengju Sun, Jianbing Liu, Yang Sh...
个性化推荐理由:

该论文标题涉及计算机视觉中的高动态范围成像技术,属于纯粹的视觉处理领域,与推荐系统、搜索或广告的核心技术无关。标题中提到的DMD(动态模态分解)和自适应调制方法都是视觉信号处理技术,没有显示出任何在推荐系统、搜索或广告领域的潜在应用价值。

2026-02-12 15:10:25 | arXiv:2602.12044v1 |
cs.CV
查看完整摘要
Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.
Projected Representation Conditioning for High-fidelity Novel View Synthesis
Min-Seop Kwak, Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seungryong Kim
个性化推荐理由:

该论文标题明确涉及计算机视觉中的新视角合成(3D视觉/图形学领域),属于明确的无关主题。虽然表示条件化技术可能具有通用性,但标题本身没有表明与推荐系统、搜索或广告有任何直接或间接的联系,也没有提到任何可能应用于这些领域的潜在技术。

2026-02-12 14:35:30 | arXiv:2602.12003v1 |
cs.CV
查看完整摘要
We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
Hua Xu, Julián D. Arias-Londoño, Juan I. Godino-Llorente
个性化推荐理由:

该论文明确聚焦于医学成像领域,这属于明确的无关主题(Medical domain-specific applications)。虽然提到了决策支持系统,但上下文完全限定在医疗应用中,与推荐系统、搜索或广告的核心技术进展没有关联。

2026-02-12 14:03:41 | arXiv:2602.11973v1 |
cs.CVcs.LG
查看完整摘要
In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
UPDA: Unsupervised Progressive Domain Adaptation for No-Reference Point Cloud Quality Assessment
Bingxu Xie, Fang Zhou, Jincan Wu, Yonghui Liu, Weiqing Li, Zhiyong Su
个性化推荐理由:

该论文专注于点云质量评估,属于计算机视觉中的3D视觉领域。虽然涉及无监督学习和域适应技术,但其核心应用(点云质量评估)与推荐系统、搜索或广告没有直接关联,也不属于Transformer架构、LLM技术或异构数据统一建模的相关研究。

2026-02-12 14:02:05 | arXiv:2602.11969v1 |
eess.IVcs.CVcs.MM
查看完整摘要
While no-reference point cloud quality assessment (NR-PCQA) approaches have achieved significant progress over the past decade, their performance often degrades substantially when a distribution gap exists between the training (source domain) and testing (target domain) data. However, to date, limited attention has been paid to transferring NR-PCQA models across domains. To address this challenge, we propose the first unsupervised progressive domain adaptation (UPDA) framework for NR-PCQA, which introduces a two-stage coarse-to-fine alignment paradigm to address domain shifts. At the coarse-grained stage, a discrepancy-aware coarse-grained alignment method is designed to capture relative quality relationships between cross-domain samples through a novel quality-discrepancy-aware hybrid loss, circumventing the challenges of direct absolute feature alignment. At the fine-grained stage, a perception fusion fine-grained alignment approach with symmetric feature fusion is developed to identify domain-invariant features, while a conditional discriminator selectively enhances the transfer of quality-relevant features. Extensive experiments demonstrate that the proposed UPDA effectively enhances the performance of NR-PCQA methods in cross-domain scenarios, validating its practical applicability. The code is available at https://github.com/yokeno1/UPDA-main.
Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation
Soufiane Ben Haddou, Laura Alvarez-Florez, Erik J. Bekkers, Fleur V. Y. Tjong, A...
个性化推荐理由:

该论文专注于医学影像(心脏疤痕分割)这一特定领域应用,属于明确的无关主题(Medical/Biology/Chemistry/Physics)。论文标题中提到的技术(隐式神经表示)虽然本身具有技术价值,但论文的应用场景与推荐系统、搜索或广告领域没有任何关联,也没有展示这些技术如何应用于相关领域。

2026-02-12 13:38:07 | arXiv:2602.11942v1 |
cs.CVcs.AI
查看完整摘要
Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data scarcity.The code for this research is publicly available.
DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target
BoCheng Hu, Zhonghan Zhao, Kaiyue Zhou, Hongwei Wang, Gaoang Wang
个性化推荐理由:

该论文标题明确关注计算机视觉领域的手物交互基准测试,属于纯粹的视觉研究范畴。虽然涉及交互建模,但专注于物理世界中的动态目标交互,与推荐系统、搜索或广告中的用户行为建模、序列建模或异构数据处理没有直接关联,也不涉及LLM或Transformer架构的进展。

2026-02-12 13:19:41 | arXiv:2602.11919v1 |
cs.CVcs.AI
查看完整摘要
Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.
Learning Perceptual Representations for Gaming NR-VQA with Multi-Task FR Signals
Yu-Chih Chen, Michael Wang, Chieh-Dun Wen, Kai-Siang Ma, Avinab Saha, Li-Heng Ch...
个性化推荐理由:

该论文专注于视频质量评估(VQA),属于计算机视觉领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及表示学习,但其应用场景(游戏视频质量评估)与用户行为建模、内容排序或广告投放等任务没有直接关联。

2026-02-12 12:56:58 | arXiv:2602.11903v1 |
eess.IVcs.CVcs.MM
查看完整摘要
No-reference video quality assessment (NR-VQA) for gaming videos is challenging due to limited human-rated datasets and unique content characteristics including fast motion, stylized graphics, and compression artifacts. We present MTL-VQA, a multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. By jointly optimizing multiple full-reference (FR) objectives with adaptive task weighting, our approach learns shared representations that transfer effectively to NR-VQA. Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.
SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training
Hongxu Yang, Levente Lippenszky, Edina Timko, Gopal Avinash
个性化推荐理由:

该论文专注于医学影像(CT扫描)中的环形伪影抑制技术,属于明确的医学领域应用。虽然涉及神经网络架构,但内容与推荐系统、搜索、广告或Transformer技术无任何关联,也不具备在这些领域的潜在应用价值。

2026-02-12 12:30:14 | arXiv:2602.11880v1 |
cs.CVcs.AI
查看完整摘要
Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.
Free Lunch for Stabilizing Rectified Flow Inversion
Chenru Wang, Beier Zhu, Chi Zhang
个性化推荐理由:

该论文标题涉及整流流反演的稳定性方法,属于生成模型或扩散模型的技术优化,与推荐系统、搜索或广告的核心领域进展、LLM技术应用、Transformer架构改进或异构数据统一建模均无直接关联。整流流技术主要应用于图像生成等AIGC领域,属于被明确排除的无关主题。

2026-02-12 11:42:36 | arXiv:2602.11850v1 |
cs.CVcs.LG
查看完整摘要
Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains
Qisen Wang, Yifan Zhao, Jia Li
个性化推荐理由:

该论文标题涉及从单目视频构建4D动态世界,这属于计算机视觉和3D场景重建领域,与推荐系统、搜索或广告的核心技术焦点无关。标题中提到的'Tree-Chains'可能指某种数据结构或架构,但没有明确指向Transformer架构、LLM技术或推荐/搜索/广告应用。

2026-02-12 11:38:35 | arXiv:2602.11845v1 |
cs.CV
查看完整摘要
Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.
A Comparative Study of MAP and LMMSE Estimators for Blind Inverse Problems
Nathan Buskulic, Luca Calatroni
个性化推荐理由:

该论文标题涉及信号处理中的盲逆问题估计方法(MAP和LMMSE),属于传统统计估计理论范畴。这些方法虽然在某些领域有应用,但与推荐系统、搜索、广告的核心技术进展、LLM技术、Transformer架构或异构数据统一建模等当前关注焦点没有直接关联。

2026-02-12 10:49:45 | arXiv:2602.11814v1 |
cs.ITcs.CVcs.LG
查看完整摘要
Maximum-a-posteriori (MAP) approaches are an effective framework for inverse problems with known forward operators, particularly when combined with expressive priors and careful parameter selection. In blind settings, however, their use becomes significantly less stable due to the inherent non-convexity of the problem and the potential non-identifiability of the solutions. (Linear) minimum mean square error (MMSE) estimators provide a compelling alternative that can circumvent these limitations. In this work, we study synthetic two-dimensional blind deconvolution problems under fully controlled conditions, with complete prior knowledge of both the signal and kernel distributions. We compare tailored MAP algorithms with simple LMMSE estimators whose functional form is closely related to that of an optimal Tikhonov estimator. Our results show that, even in these highly controlled settings, MAP methods remain unstable and require extensive parameter tuning, whereas the LMMSE estimator yields a robust and reliable baseline. Moreover, we demonstrate empirically that the LMMSE solution can serve as an effective initialization for MAP approaches, improving their performance and reducing sensitivity to regularization parameters, thereby opening the door to future theoretical and practical developments.
How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?
Marko Putak, Thomas B. Moeslund, Joakim Bruslund Haurum
个性化推荐理由:

该论文标题明确聚焦于3D视觉和动作识别预训练,属于纯粹的视觉领域研究。虽然提到了预训练,但内容涉及3D分形采样这一与推荐系统、搜索或广告无关的计算机视觉技术,没有展示任何与异构数据处理、Transformer架构或LLM应用相关的潜在联系。

2026-02-12 10:48:25 | arXiv:2602.11810v1 |
cs.CVcs.LG
查看完整摘要
Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.
Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
Zhenghuang Wu, Kang Chen, Zeyu Zhang, Hao Tang
个性化推荐理由:

该论文标题涉及4D视频重光照和极端视角处理,属于计算机视觉和图形学领域,与推荐系统、搜索或广告的核心技术无直接关联。标题中未提及任何与推荐、搜索、广告、LLM或Transformer架构相关的关键词,因此判定为不相关。

2026-02-12 09:50:13 | arXiv:2602.11769v1 |
cs.CV
查看完整摘要
Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.
Code2Worlds: Empowering Coding LLMs for 4D World Generation
Yi Zhang, Yunshuang Wang, Zeyu Zhang, Hao Tang
个性化推荐理由:

该论文标题聚焦于使用LLMs进行4D世界生成,这属于纯粹的LLM-centric内容生成应用,与推荐系统、搜索或广告的核心技术无关。虽然涉及LLMs,但应用方向是内容生成而非推荐/搜索/广告领域的排名、检索或建模任务。

2026-02-12 09:34:28 | arXiv:2602.11757v1 |
cs.CV
查看完整摘要
Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li
个性化推荐理由:

该论文标题明确涉及强化学习(RL)和视频理解,这属于被明确排除的“不相关主题”。强化学习论文若无明确的推荐系统/搜索/广告相关性则不予考虑,且视频模态本身并不直接相关。标题中的“实例级推理与接地”可能涉及计算机视觉任务,而非推荐、搜索或广告领域的核心问题。

2026-02-12 08:53:32 | arXiv:2602.11730v1 |
cs.CV
查看完整摘要
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry
Jiung Yeon, Seongbo Ha, Hyeonwoo Yu
个性化推荐理由:

该论文标题涉及计算机视觉中的SLAM(同时定位与地图构建)技术,专注于3D重建和视觉里程计。这属于纯粹的计算机视觉/机器人领域,与推荐系统、搜索或广告的核心技术焦点(如排序算法、用户建模、内容理解)没有直接关联。标题中提到的技术没有明显的应用潜力于推荐/搜索/广告领域。

2026-02-12 08:44:32 | arXiv:2602.11714v1 |
cs.CVcs.RO
查看完整摘要
We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.
TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction
Yuxiang Zhong, Jun Wei, Chaoqi Chen, Senyou An, Hui Huang
个性化推荐理由:

该论文标题涉及断层扫描重建和辐射高斯场,属于医学影像或物理领域的特定应用。虽然提到了几何感知,但这与推荐系统、搜索或广告中的异构数据处理没有直接关联。该主题明确属于被排除的医学/物理领域特定应用,与当前关注的核心领域无关。

2026-02-12 08:33:01 | arXiv:2602.11705v1 |
cs.CV
查看完整摘要
3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.
LLM-Driven 3D Scene Generation of Agricultural Simulation Environments
Arafa Yoncalik, Wouter Jansen, Nico Huebel, Mohammad Hasan Rahmani, Jan Steckel
个性化推荐理由:

该论文标题明确涉及3D场景生成和农业模拟环境,这属于纯粹的3D视觉和特定领域应用范畴。虽然提到了LLM驱动,但核心内容与推荐系统、搜索或广告的排名、建模、架构等关键技术无关,完全落在用户指定的无关主题范围内。

2026-02-12 08:33:01 | arXiv:2602.11706v1 |
cs.CVcs.AIcs.RO
查看完整摘要
Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.
U-DAVI: Uncertainty-Aware Diffusion-Prior-Based Amortized Variational Inference for Image Reconstruction
Ayush Varshney, Katherine L. Bouman, Berthy T. Feng
个性化推荐理由:

该论文标题明确聚焦于图像重建任务,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及扩散模型和变分推断等生成式方法,但属于纯粹的视觉内容生成范畴,而非推荐/搜索/广告中的排序、检索或用户建模等核心问题。

2026-02-12 08:32:11 | arXiv:2602.11704v1 |
eess.IVcs.CV
查看完整摘要
Ill-posed imaging inverse problems remain challenging due to the ambiguity in mapping degraded observations to clean images. Diffusion-based generative priors have recently shown promise, but typically rely on computationally intensive iterative sampling or per-instance optimization. Amortized variational inference frameworks address this inefficiency by learning a direct mapping from measurements to posteriors, enabling fast posterior sampling without requiring the optimization of a new posterior for every new set of measurements. However, they still struggle to reconstruct fine details and complex textures. To address this, we extend the amortized framework by injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, to emphasize learning in the most uncertain regions. Experiments on deblurring and super-resolution demonstrate that our method achieves superior or competitive performance to previous diffusion-based approaches, delivering more realistic reconstructions without the computational cost of iterative refinement.
Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis
Qiwen Xu, David Rügamer, Holger Wenz, Johann Fontana, Nora Meggyeshazi, Andreas ...
个性化推荐理由:

该论文标题明确指向医学影像生成(脑血管DSA合成),属于医学/生物领域的特定应用。虽然涉及扩散模型技术,但其应用场景与推荐系统、搜索或广告领域完全无关,且医学应用被明确列为无关主题。

2026-02-12 08:31:00 | arXiv:2602.11703v1 |
cs.CVcs.AI
查看完整摘要
Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.
OMEGA-Avatar: One-shot Modeling of 360° Gaussian Avatars
Zehao Xia, Yiqun Wang, Zhengda Lu, Kai Liu, Jun Xiao, Peter Wonka
个性化推荐理由:

该论文标题涉及3D化身建模和计算机图形学,属于纯粹的视觉/3D视觉领域。虽然提到了“一次性建模”可能涉及效率优化,但没有明确表明与推荐系统、搜索或广告的关联。根据用户指定的无关主题,这属于“Purely Vision、3D Vision, Graphic或Speech论文,且与RecSys/Search/Ads无明显相关性”,因此相关性极低。

2026-02-12 08:16:38 | arXiv:2602.11693v1 |
cs.GRcs.AIcs.CV
查看完整摘要
Creating high-fidelity, animatable 3D avatars from a single image remains a formidable challenge. We identified three desirable attributes of avatar generation: 1) the method should be feed-forward, 2) model a 360° full-head, and 3) should be animation-ready. However, current work addresses only two of the three points simultaneously. To address these limitations, we propose OMEGA-Avatar, the first feed-forward framework that simultaneously generates a generalizable, 360°-complete, and animatable 3D Gaussian head from a single image. Starting from a feed-forward and animatable framework, we address the 360° full-head avatar generation problem with two novel components. First, to overcome poor hair modeling in full-head avatar generation, we introduce a semantic-aware mesh deformation module that integrates multi-view normals to optimize a FLAME head with hair while preserving its topology structure. Second, to enable effective feed-forward decoding of full-head features, we propose a multi-view feature splatting module that constructs a shared canonical UV representation from features across multiple views through differentiable bilinear splatting, hierarchical UV mapping, and visibility-aware fusion. This approach preserves both global structural coherence and local high-frequency details across all viewpoints, ensuring 360° consistency without per-instance optimization. Extensive experiments demonstrate that OMEGA-Avatar achieves state-of-the-art performance, significantly outperforming existing baselines in 360° full-head completeness while robustly preserving identity across different viewpoints.
Beyond Pixels: Vector-to-Graph Transformation for Reliable Schematic Auditing
Chengwei Ma, Zhen Tian, Zhou Zhou, Zhixian Xu, Xiaowei Zhu, Xia Hua, Si Shi, F. ...
个性化推荐理由:

该论文标题涉及计算机视觉中的原理图审核和向量到图转换技术,属于特定领域应用。这与我的关注点(推荐系统、搜索、广告、LLM技术及其应用)无直接关联,且未提及任何可能应用于这些领域的通用技术。

2026-02-12 07:50:49 | arXiv:2602.11678v1 |
cs.AIcs.CV
查看完整摘要
Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state-of-the-art models fail to capture topology and symbolic logic in engineering schematics, as their pixel-driven paradigm discards the explicit vector-defined relations needed for reasoning. To overcome this, we propose a Vector-to-Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine-auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel-based methods and demonstrate that structure-aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at https://github.com/gm-embodied/V2G-Audit.
U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction
Yingyi Luo, Shuaiang Rong, Adam Watts, Ahmet Enis Cetin
个性化推荐理由:

该论文专注于野火预测这一特定领域应用,属于气象/环境科学范畴,与推荐系统、搜索或广告的核心技术领域无直接关联。标题中提到的U-Net架构和变换方法主要针对计算机视觉任务,没有显示出在推荐、搜索或广告系统中的潜在应用价值。

2026-02-12 07:45:53 | arXiv:2602.11672v1 |
cs.CV
查看完整摘要
We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.
Egocentric Gaze Estimation via Neck-Mounted Camera
Haoyu Huang, Yoichi Sato
个性化推荐理由:

该论文专注于计算机视觉中的视线估计技术,使用颈部佩戴相机进行自我中心视角分析。这与推荐系统、搜索或广告的核心领域进展、LLM技术应用、Transformer架构改进或异构数据统一建模均无直接关联。视线估计主要应用于人机交互、辅助技术或行为分析等视觉领域,而非RecSys/Search/Ads的排名、检索或个性化任务。

2026-02-12 07:41:27 | arXiv:2602.11669v1 |
cs.CV
查看完整摘要
This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim
个性化推荐理由:

该论文标题涉及三维视觉、实例分割和机器人抓取,属于计算机视觉和机器人学领域。虽然提到了“语言引导”,但核心是3D视觉任务,没有明确与推荐系统、搜索或广告的相关性。根据排除标准,这属于“Purely Vision、3D Vision, Graphic或Speech papers without clear relevance to RecSys/Search/Ads”,因此相关性极低。

2026-02-12 07:25:52 | arXiv:2602.11660v1 |
cs.CVcs.RO
查看完整摘要
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation
Bingyuan Wang, Xingbei Chen, Zongyang Qiu, Linping Yuan, Zeyu Wang
个性化推荐理由:

该论文专注于情感内容生成,属于AIGC/内容生成领域,与推荐系统、搜索或广告的核心排序任务无关。虽然涉及生成技术,但未明确展示在RecSys/Search/Ads中的潜在应用,且沉浸式内容生成属于被排除的纯粹LLM/内容生成主题。

2026-02-12 07:23:41 | arXiv:2602.11658v1 |
cs.CV
查看完整摘要
Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.
GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction
Mengxiao Geng, Zijie Chen, Ran Hong, Bingxuan Li, Qiegen Liu
个性化推荐理由:

该论文标题明确涉及医学成像(PET重建)和3D视觉技术,属于明确的无关主题范畴。尽管提到了扩散模型这一LLM相关技术,但其应用场景完全限定在医疗领域,与推荐系统、搜索或广告没有任何潜在关联。

2026-02-12 07:10:38 | arXiv:2602.11653v1 |
cs.CV
查看完整摘要
Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.
Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks
Ryan Deem, Garrett Goodman, Waqas Majeed, Md Abdullah Al Hafiz Khan, Michail S. ...
个性化推荐理由:

该论文标题明确聚焦于医学图像分析(脑肿瘤分类)和对抗性攻击的鲁棒性研究,这属于明确的医学领域应用。虽然涉及神经网络架构(ResNet)和攻击方法(FGSM/PGD),但论文的核心是特定医学任务的对抗性安全评估,与推荐系统、搜索、广告或相关的LLM/Transformer技术无直接关联。

2026-02-12 06:58:33 | arXiv:2602.11646v1 |
cs.CVcs.AI
查看完整摘要
Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $α$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.
ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
Yufeng Tian, Shuiqi Cheng, Tianming Wei, Tianxing Zhou, Yuanhang Zhang, Zixian L...
个性化推荐理由:

该论文标题明确聚焦于视觉触觉融合和视觉运动学习,属于机器人感知与控制领域。虽然涉及多模态融合,但其核心是机器人操作而非推荐系统、搜索或广告应用。标题中没有任何元素表明与LLM、Transformer架构、推荐系统、搜索或广告技术相关。

2026-02-12 06:56:29 | arXiv:2602.11643v1 |
cs.ROcs.AIcs.CV
查看完整摘要
Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.
Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions
Diego Patiño, Knut Peterson, Kostas Daniilidis, David K. Han
个性化推荐理由:

该论文标题明确聚焦于计算机视觉中的3D形状重建,属于纯粹的3D视觉研究。虽然标题中提到'泊松偏微分方程解'这一数学方法,但整个研究范畴与推荐系统、搜索、广告等领域的核心问题(如排序、召回、用户建模)完全无关,也不涉及Transformer架构、LLM技术或异构数据建模。

2026-02-12 06:54:40 | arXiv:2602.11642v1 |
cs.CV
查看完整摘要
Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.
PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation
Yeva Gabrielyan, Varduhi Yeghiazaryan, Irina Voiculescu
个性化推荐理由:

该论文专注于计算机视觉中的弱监督分割技术,属于纯粹的视觉研究范畴。虽然标题中提到“伪标签增强”可能涉及数据标注效率,但论文核心是图像分割而非推荐/搜索/广告系统,且没有证据表明该方法会应用于异构数据处理或多模态建模。

2026-02-12 06:24:05 | arXiv:2602.11628v1 |
cs.CVcs.LG
查看完整摘要
Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.
PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction
Bin Huang, Xun Yu, Yikun Zhang, Yi Zhang, Yang Chen, Qiegen Liu
个性化推荐理由:

该论文标题明确指向医学影像领域的CT重建技术,属于医学/生物医学应用范畴,与RecSys/搜索/广告领域完全无关。标题中提到的低剂量CT重建、Voronoi分解等技术在推荐系统、搜索或广告中没有任何直接或间接的应用潜力。

2026-02-12 06:20:23 | arXiv:2602.11625v1 |
cs.CVcs.AI
查看完整摘要
Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.
A Large Language Model for Disaster Structural Reconnaissance Summarization
Yuqing Gao, Guanren Zhou, Khalid M. Mosalam
个性化推荐理由:

该论文标题明确指向灾害领域的特定应用,属于明确的领域特定应用(灾害结构勘察),与您关注的推荐系统、搜索或广告核心领域无关。虽然涉及大型语言模型,但其应用场景完全在医疗/生物学等被排除的领域特定范畴内,没有任何与您当前关注点相关的潜在应用。

2026-02-12 05:14:45 | arXiv:2602.11588v1 |
cs.CV
查看完整摘要
Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.
ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles
Seungyeon Yoo, Youngseok Jang, Dabin Kim, Youngsoo Han, Seungwoo Jung, H. Jin Ki...
个性化推荐理由:

该论文专注于计算机视觉领域的动态3D场景重建与机器人导航技术,属于纯粹的视觉/机器人研究方向。标题中提到的'视觉导航'、'移动障碍物'、'3D高斯泼溅'等技术概念与推荐系统、搜索或广告的核心技术栈(如排序模型、用户行为建模、特征工程等)无直接关联,也未涉及Transformer架构、LLM技术或多模态建模等当前关注领域。

2026-02-12 04:48:18 | arXiv:2602.11575v1 |
cs.ROcs.AIcs.CV
查看完整摘要
Visual navigation models often struggle in real-world dynamic environments due to limited robustness to the sim-to-real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real-to-sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate this gap, prior works have assumed only static scenes or unrealistic dynamic obstacles, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy-Go, a novel real-to-sim simulation pipeline that synthesizes photorealistic dynamic scenarios for target environments. ReaDy-Go generates photorealistic navigation datasets for dynamic environments by combining a reconstructed static GS scene with dynamic human GS obstacles, and trains policies robust to both the sim-to-real gap and moving obstacles. The pipeline consists of three components: (1) a dynamic GS simulator that integrates scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) navigation dataset generation for dynamic environments that leverages the simulator, a robot expert planner designed for dynamic GS representations, and a human planner, and (3) policy learning using the generated datasets. ReaDy-Go outperforms baselines across target environments in both simulation and real-world experiments, demonstrating improved navigation performance even after sim-to-real transfer and in the presence of moving obstacles. Moreover, zero-shot sim-to-real deployment in an unseen environment indicates its generalization potential. Project page: https://syeon-yoo.github.io/ready-go-site/.
LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai ...
个性化推荐理由:

该论文标题明确聚焦于视频生成技术,属于纯粹的视觉内容生成领域。虽然提到了潜在表示和专家模型等概念,但其核心应用(超高分辨率视频生成)与推荐系统、搜索或广告的排序任务没有直接关联,属于明确的无关主题范畴。

2026-02-12 04:35:16 | arXiv:2602.11564v1 |
cs.CV
查看完整摘要
Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.
HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds
Yichun Xiao, Runwei Guan, Fangqiang Ding
个性化推荐理由:

该论文专注于3D视觉和雷达点云处理,属于纯粹的计算机视觉领域,与推荐系统、搜索或广告的核心技术无直接关联。论文内容涉及传感器数据处理和3D物体检测,没有展示在推荐、搜索或广告场景中的潜在应用价值。

2026-02-12 04:21:58 | arXiv:2602.11554v1 |
cs.ROcs.CVcs.LG
查看完整摘要
4D mmWave radar provides weather-robust, velocity-aware measurements and is more cost-effective than LiDAR. However, radar-only 3D detection still trails LiDAR-based systems because radar point clouds are sparse, irregular, and often corrupted by multipath noise, yielding weak and unstable geometry. We present HyperDet, a detector-agnostic radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud for standard LiDAR-oriented detectors. HyperDet aggregates returns from multiple surround-view 4D radars over consecutive frames to improve coverage and density, then applies geometry-aware cross-sensor consensus validation with a lightweight self-consistency check outside overlap regions to suppress inconsistent returns. It further integrates a foreground-focused diffusion module with training-time mixed radar-LiDAR supervision to densify object structures while lifting radar attributes (e.g., Doppler, RCS); the model is distilled into a consistency model for single-step inference. On MAN TruckScenes, HyperDet consistently improves over raw radar inputs with VoxelNeXt and CenterPoint, partially narrowing the radar-LiDAR gap. These results show that input-level refinement enables radar to better leverage LiDAR-oriented detectors without architectural modifications.
Perception-based Image Denoising via Generative Compression
Nam Nguyen, Thinh Nguyen, Bella Bose
个性化推荐理由:

该论文标题表明其专注于计算机视觉领域的图像去噪和生成式压缩技术,属于纯粹的视觉处理范畴。论文内容与推荐系统、搜索或广告的排名、建模、架构等核心关注点没有直接关联,也不涉及LLM、Transformer或异构数据统一建模等关键技术方向。

2026-02-12 04:21:26 | arXiv:2602.11553v1 |
cs.CVcs.AI
查看完整摘要
Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.
Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration
Yingkai Zhang, Shuang Chen, Ye Tian, Yunyi Gao, Jianyong Jiang, Ying Fu
个性化推荐理由:

该论文标题明确指向医学影像领域(PET图像恢复),属于医学/生物学的特定应用,与RecSys/Search/Ads领域完全无关。扩散模型虽然是一种生成模型技术,但该论文的应用场景被严格限定在医疗影像处理,没有任何迹象表明其技术可迁移或应用于推荐系统、搜索或广告领域。

2026-02-12 04:06:48 | arXiv:2602.11545v1 |
cs.CV
查看完整摘要
Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.
Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis
De-Xing Huang, Chaohui Yu, Xiao-Hu Zhou, Tian-Yu Xiang, Qin-Yi Zhang, Mei-Jiang ...
个性化推荐理由:

该论文标题明确聚焦于医学影像分析(X射线血管造影),属于明确的医学领域应用,与用户关注的推荐系统、搜索、广告等商业应用领域完全无关。标题中提到的自监督预训练技术虽然具有通用性,但论文将其限定在血管解剖结构这一特定医学场景,没有展示任何向推荐/搜索/广告领域迁移的潜力。

2026-02-12 03:52:44 | arXiv:2602.11536v1 |
cs.CV
查看完整摘要
X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at https://github.com/Dxhuang-CASIA/XA-SSL.
What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
Zhenlong Yuan, Xiangyan Qu, Jing Tang, Rui Chen, Lei Sun, Ruidong Chen, Hongwei ...
个性化推荐理由:

该论文标题涉及智能体想象和开放词汇人机交互理解,主要关注智能体能力增强和特定交互场景,与推荐系统、搜索或广告的核心技术进展、LLM应用或Transformer架构改进无直接关联。标题中的'生成'可能涉及内容生成,但属于纯粹的LLM中心化主题,不在当前关注范围内。

2026-02-12 02:51:59 | arXiv:2602.11499v1 |
cs.CV
查看完整摘要
Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.