arXiv 每日论文精选

显示 95 篇论文 (共 95 篇)

判别即生成：从分词器视角统一排序与检索

9/10

Discrimination Is Generation: Unifying Ranking and Retrieval from a Tokenizer Perspective

Shuli Wang, Junwei Yin, Changhao Li, Senjie Kou, Chi Wang, Yinqiu Huang, Yinhua ...

核心总结:

本文研究生成式推荐中语义ID（SID）与个性化信号的解耦问题，提出将分词器嵌入判别式排序模型进行端到端训练，使排序模型自然转变为检索模型，统一了排序和检索任务。

个性化推荐理由:

该方法提出了一种统一排序和检索的新范式，将LLM编码器与排序模型端到端训练，对生成式推荐和检索有重大贡献。

2026-05-14 13:59:29 | arXiv:2605.14853v1 |

cs.IR

查看完整摘要

Semantic IDs (SIDs) define the generation space of generative recommendation and directly determine its personalization ceiling. However, existing tokenizers are trained independently with retrieval objectives, leaving personalization signals fully decoupled from the SID construction process -- a fundamental gap that causes generative retrieval to persistently lag behind discriminative ranking. In this paper, we rethink the essence of SIDs: \emph{ranking seeks argmax in item space while retrieval seeks argmax in token space; both are the same problem solved at different granularities.} Based on this insight, we propose \DIG (\textbf{D}iscrimination \textbf{I}s \textbf{G}eneration), which embeds the tokenizer inside a discriminative ranking model for end-to-end training -- the ranker naturally becomes a retrieval model, yielding two models from a single training run. \DIG is organized around a \emph{feature assignment taxonomy}: item-intrinsic static features are encoded into SIDs, user-item cross features (u2i) implicitly drive codebook boundaries toward recommendation decision boundaries during training, and an MLP$_\mathrm{u2t}$ distillation module approximates u2i at the token level for inference. Experiments on three public benchmarks and two industrial datasets demonstrate that \DIG simultaneously improves ranking, retrieval, and unified retrieval-ranking quality.

基于多专家投影和多面层次量化的非对称生成式推荐

9/10

Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

Bin Huang, Xin Wang, Junwei Pan, Yongqi Zhou, Yifeng Zhou, Zhixiang Feng, Shudon...

核心总结:

该论文研究生成式推荐中对称语义ID导致的信息瓶颈问题，提出非对称框架AsymRec：通过多专家投影保留输入语义丰富性，借助多面分层量化为输出构建高容量离散目标，实现连续输入与离散输出的解耦。

个性化推荐理由:

论文针对生成式推荐中的双阶段信息瓶颈问题，提出非对称的连续-离散框架，核心思想高度相关且创新性强，直接应用LLM技术于推荐系统。

2026-05-14 07:55:43 | arXiv:2605.14512v1 |

cs.IRcs.AI

查看完整摘要

Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer's hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.

面向电商搜索的高效生成式检索：语义聚类ID与专家引导强化学习

9/10

Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

Jianbo Zhu, Xing Fang, Jing Wang, Mingmin Jin, Bokang Wang, Guangxin Song, Zheny...

核心总结:

提出CQ-SID方法通过层次语义聚类ID降低搜索复杂度，以及EG-GRPO强化学习对齐检索与排序目标，解决生成式检索在电商场景中的实际部署问题。

个性化推荐理由:

论文直接应用于电商搜索，聚焦生成式检索的实际部署，结合语义聚类和强化学习对齐下游目标，紧密关联我的关注领域。

2026-05-14 06:27:46 | arXiv:2605.14434v1 |

cs.IRcs.AI

查看完整摘要

Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.

ML-Embed：面向多语言世界的包容性与高效嵌入

9/10

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

核心总结:

本文针对多语言文本嵌入的计算效率低下、语言覆盖不均和透明度不足问题，提出3D-ML框架，通过MRL、MLL和MEL实现全生命周期效率提升，并构建大规模多语言数据集训练模型套件。

个性化推荐理由:

直接提出全新框架，针对多语言嵌入的计算效率和低资源覆盖问题，核心创新（3D-ML）包含存储、推理和参数效率，高度契合LLM应用和系统优化方向。

2026-05-14 17:05:26 | arXiv:2605.15081v1 |

cs.CLcs.AI

查看完整摘要

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

基于层次化信念状态记忆的智能体推荐系统

9/10

Agentic Recommender System with Hierarchical Belief-State Memory

Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin, Lei Huang, Qianqian Z...

核心总结:

论文研究推荐系统中LLM agent的记忆管理问题，提出分层信念状态记忆（事件、偏好、画像三层），并引入基于LLM规划器的六操作生命周期，以自适应管理记忆演化，避免平面记忆混淆短期与长期信号。

个性化推荐理由:

该论文聚焦于LLM agent在推荐系统中的记忆结构优化，提出分层信念状态记忆与自适应生命周期，与LLM在推荐系统中的直接应用高度相关，且不包含实验结果描述。

2026-05-14 05:38:24 | arXiv:2605.14401v1 |

cs.CLcs.AI

查看完整摘要

Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

跨异构业务系统的超图企业代理推理器

9/10

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

Ling Wang, Songnan Liu, Jianan Wang, Cheng Cheng, Xin Liu, Yihan Zhu, Enyu Li, Y...

核心总结:

解决异构企业系统中LLM的多跳n元推理幻觉与不可审计问题；核心方法为分层超图本体，用超边编码业务规则与程序协议，结合证据驱动推理循环动态编排工具，无需重训练LLM。

个性化推荐理由:

论文提出基于分层超图本体的企业智能体推理器，融合LLM与结构化知识，直接应用于搜索与推荐中的复杂多跳推理和异常检测，与关注领域高度契合。

2026-05-14 01:57:59 | arXiv:2605.14259v1 |

cs.AIcs.CL

查看完整摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

离线策略评估中的日志策略设计

8/10

Logging Policy Design for Off-Policy Evaluation

Connor Douglas, Joel Persson, Foster Provost

核心总结:

针对离线策略评估中日志策略设计问题，提出一个统一框架，在已知、未知和部分已知三种信息条件下推导最优日志策略，揭示奖励-覆盖率权衡。

个性化推荐理由:

论文研究的是推荐系统中离线策略评估的日志策略设计问题，直接解决数据收集优化，对推荐系统评估和改进非常关键。

2026-05-14 17:25:19 | arXiv:2605.15108v1 |

stat.MLcs.AIcs.IRcs.LGstat.ME

查看完整摘要

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

停止过度思考：利用最小化推理实现高效列表级重排序

8/10

Stop Overthinking: Unlocking Efficient Listwise Reranking with Minimal Reasoning

Danyang Liu, Kan Li

核心总结:

针对列表重排序中LLM过度推理导致计算成本高的问题，提出长度正则化自蒸馏框架，通过Pareto最优选择短而有效的推理链，训练学生模型在保持排序质量的同时大幅减少推理token数。

个性化推荐理由:

直接解决列表重排序中LLM推理效率问题，核心方法和动机与LLM在搜索排序中的应用紧密相关。

2026-05-14 06:44:44 | arXiv:2605.14450v1 |

cs.IR

查看完整摘要

Listwise reranking utilizing Large Language Models (LLMs) has achieved state-of-the-art retrieval effectiveness. Recently, reasoning-enhanced models have further pushed these boundaries by employing Chain-of-Thought (CoT) to perform deep comparative analysis of candidate documents. However, this performance gain comes at a prohibitive computational cost, as models often generate thousands of reasoning tokens before producing a final ranking. In this work, we investigate the relationship between reasoning length and ranking quality, revealing an overthinking phenomenon where extended reasoning yields diminishing returns. To address this, we propose a Length-Regularized Self-Distillation framework. We synthesize a dataset by sampling diverse reasoning traces from a teacher model (Rank-K) and applying a Pareto-inspired filter to select traces that achieve high ranking performance with minimal token usage. By fine-tuning on these concise, high-quality rationales, the student model learns to internalize efficient reasoning patterns, effectively pruning redundant deliberation. Experiments on TREC Deep Learning and NeuCLIR benchmarks demonstrate that our method maintains the teacher's effectiveness while reducing inference token consumption by 34%-37% across different retrieval settings, offering a practical solution for deploying reasoning-enhanced rerankers in latency-sensitive applications.

Grep就是一切？智能体如何重塑搜索代理

8/10

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

核心总结:

论文研究智能体搜索中检索策略选择与代理架构、工具调用范式的交互。核心发现是在实验中grep比向量检索更准确，但整体性能取决于工具调用风格和无关上下文量。

个性化推荐理由:

该论文系统比较了grep与向量检索在智能体搜索中的性能差异，并探索了工具调用范式和无关上下文的影响，直接涉及LLM代理、RAG及搜索系统，与核心领域和LLM应用高度相关。

2026-05-14 17:58:41 | arXiv:2605.15184v1 |

cs.CL

查看完整摘要

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

组合图像检索基准是否需要多模态组合？

8/10

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Moni...

核心总结:

该研究质疑CIR基准测试能否真正衡量多模态组合能力，通过分析发现大量查询存在单模态捷径，并提出了识别和验证真正组合查询的方法，强调应去噪后评估模型。

个性化推荐理由:

论文揭示CIR基准中存在单模态捷径，导致模型能力被高估，这对理解多模态检索的真实挑战至关重要，直接关联我的研究焦点。

2026-05-14 12:56:36 | arXiv:2605.14787v1 |

cs.CVcs.CL

查看完整摘要

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

EndPrompt：通过终端锚点实现高效长上下文扩展

8/10

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

Han Tian, Luxuan Chen, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Jinman Zha...

核心总结:

主题：如何高效扩展LLM上下文窗口。核心方法：通过将短训练序列分割为完整第一段和终端提示第二段，模拟长距离位置编码，利用稀疏位置监督实现长上下文泛化，避免全长度训练。

个性化推荐理由:

论文专注于LLM长上下文扩展，属于核心LLM技术，对搜广推领域有潜在应用价值，如处理长用户行为序列或长文档推荐。

2026-05-14 09:00:03 | arXiv:2605.14589v1 |

cs.CL

查看完整摘要

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.

作为高效PRP重排序器的主动学习器

8/10

Active Learners as Efficient PRP Rerankers

Jeremías Figueiredo Paschmann, Juan Kaplan, Francisco Nattero Santiago Mauricio ...

核心总结:

该论文将PRP重排重新定义为从含噪成对比较中主动学习，利用随机方向oracle减少调用次数，解决排序假设不匹配和噪声问题，提升预算受限下的排序质量。

个性化推荐理由:

直接聚焦于LLM作为重排器的效率问题，提出主动学习框架改进PRP方法，高度相关于直接LLM应用和核心领域进展。

2026-05-14 01:03:53 | arXiv:2605.14236v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.

MeMo: 记忆即模型

7/10

MeMo: Memory as a Model

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash...

核心总结:

研究大模型在不更新参数下高效注入新知识的问题，提出MeMo框架，通过专用记忆模型编码新知识并保持LLM不变，支持跨文档关系捕获和抗检索噪声。

个性化推荐理由:

提出模块化记忆框架，让LLM在不更新参数下吸收新知识，核心思想可应用于推荐系统知识注入或搜索场景，但非直接面向RecSys/Search/Ads。

2026-05-14 17:51:34 | arXiv:2605.15156v1 |

cs.CLcs.AIcs.LG

查看完整摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

基于性能驱动的策略优化与自适应窗口的推测解码

7/10

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

Jie Jiang, Xing Sun

核心总结:

研究问题：投机解码中草稿模型由于逐令牌优化导致窗口级性能不佳。核心方法：提出PPOW框架，使用性能驱动的策略优化与自适应窗口，通过成本感知加速奖励、分布接近奖励和基于分歧的窗口选择进行窗口级优化。

个性化推荐理由:

该论文属于LLM推理加速领域，与直接应用LLM技术于推荐/搜索/广告不太相关，但其策略优化和自适应窗口方法可能启发模型效率改进，对LLM技术在延迟敏感场景中的应用有间接帮助。

2026-05-14 15:41:57 | arXiv:2605.14978v1 |

cs.CL

查看完整摘要

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

大型语言扩散模型的不确定性量化

7/10

Uncertainty Quantification for Large Language Diffusion Models

Artem Vazhentsev, Vladislav Smirnov, David Li, Maxim Panov, Timothy Baldwin, Art...

核心总结:

研究LLDM的不确定性量化问题，提出利用迭代去噪过程的轻量零样本不确定性信号，无需重复采样即可检测幻觉。

个性化推荐理由:

核心贡献在于LLM不确定性量化和效率优化，与使能LLM技术相关，但非直接应用于推荐搜索广告。

2026-05-14 08:39:56 | arXiv:2605.14570v1 |

cs.CL

查看完整摘要

Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

动态潜在路由

7/10

Dynamic Latent Routing

Fangyuan Yu, Xin Su, Amir Abdullah

核心总结:

研究LLM后训练中如何通过时间组合子策略（路由）来优化模型；核心方法是提出General Dijkstra Search并推导出动态潜在路由（DLR）方法，联合学习离散潜在编码、路由策略和模型参数。

个性化推荐理由:

论文提出DLR，是一种基于动态搜索的LLM后训练方法，通过组合子策略优化模型，属于Enabling LLM Tech且有潜力用于搜索/推荐

2026-05-14 03:35:46 | arXiv:2605.14323v1 |

cs.LGcs.AIcs.CL

查看完整摘要

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

PreFT：仅预填充微调以实现高效推理

7/10

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan J...

核心总结:

本文提出PreFT，仅在预填充阶段应用适配器，以提高多适配器服务时的推理吞吐量，解决传统PEFT方法在解码阶段的效率瓶颈。

个性化推荐理由:

论文提出了Prefill-only Finetuning，专注于提升推理效率，属于LLM技术优化，对推荐系统等场景中的高效推理有潜在应用价值。

2026-05-14 00:19:41 | arXiv:2605.14217v1 |

cs.LGcs.AIcs.CLeess.SY

查看完整摘要

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

基于投机解码的分解误差消除离散扩散语言模型

6/10

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

Xun Fang, Yunchen Li, Hang Yuan, Zhou Yu

核心总结:

本文解决离散扩散语言模型中标准预测方法引入因子化误差的问题，提出FeF-DLLM，通过精确前缀条件分解代替独立词元预测，并引入推测解码加速推理，以保持生成过程中的词元依赖性并提升效率。

个性化推荐理由:

提出离散扩散语言模型的无因子化误差方法，核心改进生成质量，与推荐/搜索/广告领域无直接关联，但提升LLM生成能力有间接潜力。

2026-05-14 03:15:25 | arXiv:2605.14305v1 |

cs.CL

查看完整摘要

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.

最小干预KV保留：设计空间研究与多样性惩罚幸存者

6/10

Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

核心总结:

论文研究小预算下的KV缓存压缩，提出一种基于多样性惩罚的保留策略α，通过改进TriAttention的评分器实现高效压缩。

个性化推荐理由:

核心关注KV缓存压缩，属于LLM推理效率方向，对推荐系统的实时性有潜在价值，但非直接相关。

2026-05-14 02:50:20 | arXiv:2605.14292v1 |

cs.LGcs.CL

查看完整摘要

KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

Orchard：一个开源智能代理建模框架

2/10

Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Aless...

核心总结:

该论文提出Orchard开源框架，实现可扩展的智能体建模，核心贡献包括轻量级环境服务Orchard Env和三种训练方案（如用于代码智能体的蒸馏、信用分配SFT和平衡自适应RL）。

个性化推荐理由:

论文重点在于通用智能体建模框架和训练方法，与推荐系统、搜索或广告的直接关联较弱。

2026-05-14 16:35:12 | arXiv:2605.15040v1 |

cs.AIcs.CL

查看完整摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

从失败中学习：基于可验证奖励的纠错导向策略优化

4/10

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing...

个性化推荐理由:

该论文关注从失败中学习并通过可验证奖励进行策略优化，属于强化学习领域。虽然强化学习可用于推荐系统，但该工作未明确针对推荐/搜索/广告场景，且标题未体现与LLM或Transformer架构的直接关联，因此相关性中等偏低。

2026-05-14 08:22:21 | arXiv:2605.14539v1 |

cs.CL

查看完整摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

扩散应该从语言模型何处进入？几何引导的隐状态替换

4/10

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

Injin Kong, Hyoungjoon Lee, Yohan Jo

个性化推荐理由:

该论文探讨将扩散过程集成到语言模型中的技术，属于LLM架构改进。然而，其直接针对文本生成，未明确提及推荐/搜索/广告应用，且技术细节偏向NLP。潜在应用可能包括改进序列建模，但当前关联性较低。

2026-05-14 04:47:54 | arXiv:2605.14368v1 |

cs.CLcs.AI

查看完整摘要

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

诊断LLM强化学习中的训练-推理不匹配

4/10

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng W...

个性化推荐理由:

该论文关注LLM强化学习中的训练-推理不匹配问题，属于核心LLM技术领域的改进，但未明确涉及推荐、搜索或广告场景。潜在应用可能包括提升LLM在序列排序或用户意图理解中的稳定性，但关联性较弱。

2026-05-14 00:27:35 | arXiv:2605.14220v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

为何邻域至关重要：Agentic GraphRAG中的遍历上下文与溯源

3/10

Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

Riccardo Terrenzi, Maximilian von Zastrow, Serkan Ayvaz

个性化推荐理由:

该论文聚焦于图检索增强生成（GraphRAG）中的遍历上下文与溯源，属于检索增强生成在知识图谱上的应用，与推荐/搜索/广告的关联较弱。尽管GraphRAG可能间接影响信息检索技术，但缺乏直接针对推荐系统、搜索或广告的明确应用场景或方法论。

2026-05-14 17:25:20 | arXiv:2605.15109v1 |

cs.AIcs.IR

查看完整摘要

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

面向自主演进的智能文献检索代理

3/10

Towards Self-Evolving Agentic Literature Retrieval

Yuwen Du, Tian Jin, Jing Kang, Xianghe Pang, Jingyi Chai, Tingjia Miao, Fenyi Li...

个性化推荐理由:

该论文聚焦于文献检索的自主演进代理，属于特定领域的检索系统改进，而非通用推荐、搜索或广告技术。虽然涉及检索，但其应用场景（学术文献）与核心关注领域（推荐、搜索、广告）差异较大，且未明确联系LLM或Transformer的使能技术。

2026-05-14 03:17:31 | arXiv:2605.14306v1 |

cs.IR

查看完整摘要

As large language models reshape scientific research, literature retrieval faces a twofold challenge: ensuring source authenticity while maintaining a deep comprehension of academic search intents. While reliable, traditional keyword-centric search fails to capture complex research intents. Frontier LLMs can handle complex research intents, but their high cost and tendency to hallucinate remain key limitations. Here we introduce PaSaMaster, a self-evolving agentic literature retrieval system that produces relevance-scored paper rankings with evidence-grounded recommendations through iterative intent analysis, retrieval, and ranking. It is built on three key designs. First, it transforms literature retrieval from a one shot query--document matching problem into a search process that evolves over time, using ranked evidence to reveal gaps, refine intents, and guide follow-up searches. Second, it prevents hallucinated sources by treating retrieval as intent--paper relevance ranking rather than generation. Finally, PaSaMaster improves cost efficiency by separating planning from retrieval: a frontier LLM is used only for intent understanding, while large scale retrieval and relevance scoring are delegated to customized corpora and lightweight models. Evaluated on the PaSaMaster Benchmark across 38 scientific disciplines, our system exposes the severe inaccuracy and incompleteness of traditional keyword retrieval (improving F1-score by 15.6X) and the unreliability of generative LLMs (which exhibit hallucination rates up to 37.79%). Remarkably, PaSaMaster outperforms GPT-5.2 by 30.0% at a mere 1% of the computational cost while ensuring zero source hallucination: https://github.com/sjtu-sai-agents/PaSaMaster

过程链：用于过程问答的层次化视觉-语言推理

3/10

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, F...

个性化推荐理由:

该论文聚焦于过程问答（Procedural QA）中的层次化视觉-语言推理，属于视觉-语言模型（VLM）在特定问答任务上的应用。虽然VLM技术本身可启发推荐系统中异构数据建模，但该论文主题高度特定，缺乏与推荐/搜索/广告领域的直接关联或明确应用潜力。

2026-05-14 15:03:36 | arXiv:2605.14928v1 |

cs.CL

查看完整摘要

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.

语言生成作为最优控制：潜在控制空间中的闭环扩散

3/10

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

个性化推荐理由:

该论文提出将语言生成视为最优控制，利用闭环扩散模型在潜在控制空间中生成文本。虽然属于生成式AI的核心技术，但主要聚焦于NLP任务，未明确与推荐、搜索或广告的直接应用相关联。作为潜在使能技术，其闭环控制思路可启发对话推荐系统中的多轮交互策略，但当前内容与RecSys/Search/Ads的关联度较低。

2026-05-14 08:13:43 | arXiv:2605.14531v1 |

cs.CL

查看完整摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

RAG模型能否判断检索何时出错？在知识冲突下诊断上下文合规性

3/10

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

个性化推荐理由:

该论文关注检索增强生成（RAG）中的上下文合规性问题，属于NLP范畴，未明确讨论其在推荐、搜索或广告领域的应用。尽管RAG可用于增强推荐系统，但论文焦点在于知识冲突下的诊断，而非直接服务于RecSys/Search/Ads的具体应用或核心技术。

2026-05-14 07:14:19 | arXiv:2605.14473v1 |

cs.CLcs.AI

查看完整摘要

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross- model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families: CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus, but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake- injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

德国政治文本的意识形态预测

3/10

Ideology Prediction of German Political Texts

Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek

个性化推荐理由:

该论文聚焦于政治文本的意识形态预测，属于自然语言处理中的文本分类任务，但与搜索、推荐或广告领域没有直接或明显的应用关联。尽管可以将其视为一种文本分类应用，但其领域特定性（政治）和缺少对推荐系统相关技术的探讨，使其与该研究方向的相关性较低。

2026-05-14 04:25:56 | arXiv:2605.14352v1 |

cs.CL

查看完整摘要

Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

一图胜千言？视觉金融文档检索聚合策略的实证研究

2/10

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim, Yi Yang

个性化推荐理由:

该论文专注于金融领域的视觉文档检索，属于域特定应用，与推荐、搜索或广告系统的核心通用技术关联较弱。虽然涉及视觉信息检索，但缺乏对推荐/搜索/广告中通用建模方法或LLM应用的直接启示。

2026-05-14 08:53:40 | arXiv:2605.14581v1 |

cs.CVcs.AIcs.IR

查看完整摘要

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

按需思考：基于自适应推理驱动的双LoRA架构的多模态嵌入

2/10

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

个性化推荐理由:

该论文主要关注多模态嵌入和自适应推理，虽然涉及LoRA架构，但核心是多模态任务，未明确针对推荐/搜索/广告领域。缺乏与用户序列、异构数据建模或推荐系统关键问题的直接联系，应用场景不清晰。

2026-05-14 06:41:53 | arXiv:2605.14448v1 |

cs.CVcs.CLcs.IR

查看完整摘要

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

FutureSim：回放世界事件以评估自适应智能体

2/10

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz...

个性化推荐理由:

该论文关注评估自适应智能体的模拟环境，虽可能涉及LLM或推荐系统中的用户行为模拟，但核心目标是评估智能体而非直接应用于推荐/搜索/广告。目前与个人研究重点的直接关联较弱。

2026-05-14 17:59:28 | arXiv:2605.15188v1 |

cs.LGcs.AIcs.CL

查看完整摘要

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

MetaBackdoor：利用位置编码作为大语言模型的后门攻击面

2/10

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma, Ahmed Salem

个性化推荐理由:

该论文主要关注LLM的安全攻击（后门攻击），属于安全领域，与我的核心关注方向（推荐系统、搜索、广告的先进技术、LLM在其中的应用、Transformer架构改进）不直接相关。虽然涉及LLM，但侧重于漏洞利用而非提升推荐或搜索性能，因此相关性很低。

2026-05-14 17:56:22 | arXiv:2605.15172v1 |

cs.CRcs.CL

查看完整摘要

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

自蒸馏智能体强化学习

2/10

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang C...

个性化推荐理由:

该论文主要关注智能体强化学习领域的自蒸馏方法，属于RL技术改进。尽管RL可用于推荐系统，但标题未体现与RecSys/Search/Ads的具体关联，且属于通用RL算法，无明确应用场景或领域创新，因此相关性较低。

2026-05-14 17:51:26 | arXiv:2605.15155v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

通过自我回忆思考改进多轮对话一致性

2/10

Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

Renning Pang, Tian Lan, Leyuan Liu, Xiaoming Huang, Piao Tong, Xiaosong Zhang

个性化推荐理由:

论文专注于多轮对话的一致性，属于NLP对话领域，与推荐/搜索/广告的核心技术或LLM在RecSys中的应用无直接关联。虽然一致性改进可能间接提升对话系统质量，但缺乏明确的推荐或搜索场景映射。

2026-05-14 17:20:14 | arXiv:2605.15102v1 |

cs.CLcs.AI

查看完整摘要

Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.

无模型变更的并发：基于Future的LLM异步函数调用

2/10

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Guangyu Feng, Huanzhi Mao, Prabal Dutta, Joseph E. Gonzalez

个性化推荐理由:

该论文主要关注LLM的异步函数调用和并发执行，属于LLM工程优化领域，与推荐、搜索或广告系统的核心问题（如模型架构、特征交互、用户行为建模）无直接关联。虽然并发技术可间接提升系统效率，但缺乏针对RecSys/Search/Ads特定场景的创新应用或启发性思路，因此相关性较低。

2026-05-14 17:02:28 | arXiv:2605.15077v1 |

cs.CLcs.AIcs.LG

查看完整摘要

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

论视觉语言模型中的文化时代错误与时间推理

2/10

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

个性化推荐理由:

该论文主要探讨视觉语言模型中的文化时代错误和时间推理能力，属于VLM内部能力分析，并未直接关联推荐、搜索或广告中的具体应用或技术。虽然VLM相关技术可能间接启发多模态推荐，但论文主题更偏向认知与评估，与用户当前关注的模型应用或架构创新无直接关系。

2026-05-14 16:58:16 | arXiv:2605.15071v1 |

cs.CVcs.AIcs.CL

查看完整摘要

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

基于案例的自适应推理与执行校准用于大语言模型工具使用

2/10

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang

个性化推荐理由:

论文聚焦于LLM的工具使用中推理与执行的校准，属于纯LLM应用范畴，未明确与推荐/搜索/广告系统相关。虽涉及LLM推理能力，但与推荐、搜索或广告领域的直接应用或技术关联较弱。

2026-05-14 16:36:04 | arXiv:2605.15041v1 |

cs.AIcs.CL

查看完整摘要

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

从场景到元素：面向可验证多模态RAG的多粒度证据检索

2/10

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen, Chuyue Huang, Yutong Yao, Shudong Liu, Xueqing Song, Lidia S. Chao...

个性化推荐理由:

论文聚焦于多模态RAG的证据检索技术，属于NLP领域的具体应用，与推荐/搜索/广告的核心域、LLM使能技术或直接应用无直接关联。虽涉及检索，但面向可验证性而非推荐或搜索的排序、匹配等核心问题，且未提及对构建推荐系统或广告场景的潜在用途。

2026-05-14 16:20:02 | arXiv:2605.15019v1 |

cs.CL

查看完整摘要

Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

COTCAgent：基于概率链式思维补全的预防性咨询

2/10

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

个性化推荐理由:

该论文主要关注预防性咨询，可能涉及对话系统，但未明确指向推荐、搜索或广告领域。链式思维补全是LLM技术，但应用场景（医疗咨询）不在我关注的范围内，且缺乏对推荐系统的直接启发。

2026-05-14 16:17:35 | arXiv:2605.15016v1 |

cs.CLcs.AI

查看完整摘要

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

作为队友的小型私人语言模型在教育评估设计中的应用

2/10

Small, Private Language Models as Teammates for Educational Assessment Design

Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni...

个性化推荐理由:

该论文聚焦于教育评估领域，属于特定领域应用，与搜索、推荐或广告的核心技术无直接关联。即使涉及语言模型，但明确限定为小型私人模型，且缺乏对通用推荐/搜索/广告场景的启发或迁移潜力，因此相关性较低。

2026-05-14 16:15:48 | arXiv:2605.15015v1 |

cs.AIcs.CLcs.HC

查看完整摘要

Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

利用随机选择的少量示例指导来增强具有可验证奖励的强化学习

2/10

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

个性化推荐理由:

该论文聚焦于强化学习中的奖励验证和少量示例引导，属于通用RL技术。未明确涉及推荐系统、搜索或广告中的排序、匹配或用户建模等核心问题，且无Transformer或LLM技术应用，因此相关性较低。

2026-05-14 16:12:30 | arXiv:2605.15012v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

科学贡献图：基于文献的自动化技术路线大规模构建

2/10

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

Peter A. Jansen

个性化推荐理由:

该论文聚焦于科学文献分析与技术路线图构建，属于科研信息学领域，与推荐系统、搜索或广告的核心技术（如排序、匹配、用户理解）无直接关联。尽管文献挖掘技术可能间接用于知识图谱构建，但缺乏明确的应用场景指向，因此相关性低。

2026-05-14 16:12:12 | arXiv:2605.15011v1 |

cs.CL

查看完整摘要

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

量化并缓解前沿大语言模型的过早收敛问题

2/10

Quantifying and Mitigating Premature Closure in Frontier LLMs

Rebecca Handler, Suhana Bedi, Nigam Shah

个性化推荐理由:

该论文主要关注LLM本身的训练或推理问题（过早收敛），属于核心LLM技术，但未明确提及在推荐/搜索/广告领域的潜在应用。虽然可能间接提升LLM下游能力，但缺乏直接关联，故得分较低。

2026-05-14 16:02:28 | arXiv:2605.15000v1 |

cs.CLcs.AI

查看完整摘要

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

基础模型在乌克兰法律文本上的分词器丰富度与零样本性能比较研究

2/10

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

Volodymyr Ovcharov

个性化推荐理由:

该论文专注于特定语言（乌克兰语）和法律领域的零样本性能，属于领域特定的NLP研究，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及基础模型，但缺乏对推荐/搜索/广告潜在应用的明确阐述，且未涉及Transformer架构创新或LLM在相关领域的直接应用。

2026-05-14 14:35:05 | arXiv:2605.14890v1 |

cs.CL

查看完整摘要

Foundation models tokenize Ukrainian legal text with vastly different efficiency, yet no systematic comparison exists for this domain. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Three findings emerge. (1) Tokenizer fertility varies 1.6x: Qwen3 models consume 60% more tokens than Llama-family models on identical input, directly reducing API cost. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (675B total, 41B active) -- a model with 5.6x more total parameters and 3.4x more active parameters per token -- at one-third the API cost. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages.

AI代理的整体评估与故障诊断

2/10

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazi...

个性化推荐理由:

该论文聚焦于AI代理的通用评估与诊断，与推荐、搜索或广告领域的核心技术（如排序、召回、多模态建模）无直接关联。虽然AI代理可涉及LLM，但缺少明确的推荐/搜索/广告应用场景或技术方案。

2026-05-14 14:12:39 | arXiv:2605.14865v1 |

cs.AIcs.CL

查看完整摘要

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

Video2GUI：综合大规模交互轨迹用于通用GUI智能体预训练

2/10

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Ha...

个性化推荐理由:

该论文聚焦于图形用户界面（GUI）智能体的预训练，通过视频数据合成交互轨迹，属于人机交互与计算机视觉交叉领域。虽涉及预训练和序列建模，但核心应用是GUI自动化，与推荐/搜索/广告领域的用户行为建模、特征表示或匹配问题缺乏直接关联。技术方法（如轨迹合成）或可迁移至用户行为序列增强，但论文本身未表明此意图，故相关性较低。

2026-05-14 12:14:24 | arXiv:2605.14747v1 |

cs.CLcs.AIcs.CVcs.LG

查看完整摘要

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

通过与临床世界模型交互，将患者动态智能体化于大语言模型中

2/10

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidon...

个性化推荐理由:

该论文聚焦于医疗领域的临床世界模型和患者动态，属于特定领域应用，与推荐系统、搜索或广告的核心技术关联性低。虽然涉及LLM，但其应用方向局限于医疗，不属于我们关注的核心领域或可借鉴的通用技术。

2026-05-14 11:50:00 | arXiv:2605.14723v1 |

cs.AIcs.CLcs.LG

查看完整摘要

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

解决行动瓶颈：基于Token级能量的智能体强化学习

2/10

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Langzhou He, Junyou Zhu, Yue Zhou, Zhengyao Gu, Junhua Liu, Wei-Chieh Huang, Hen...

个性化推荐理由:

该论文聚焦于强化学习中的行动瓶颈问题，属于智能体学习领域，与推荐系统、搜索或广告的核心技术（如排序、匹配、用户建模）无直接关联。虽然强化学习在广告竞价等方向有应用，但本文从能量角度的理论改进不涉及RecSys/Search/Ads的典型场景或技术。

2026-05-14 08:33:02 | arXiv:2605.14558v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

大型语言模型的维度级意图保真度评估：来自结构化提示消融实验的证据

2/10

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

GAng Peng

个性化推荐理由:

该论文聚焦于LLM的意图保真度评估与结构化提示消融，属于LLM能力评估或提示工程的范畴。虽然与LLM相关，但未直接涉及推荐、搜索或广告系统的应用或技术，且未明确讨论其在RecSys/Search/Ads中的潜在用途，因此相关性较低。

2026-05-14 08:00:23 | arXiv:2605.14517v1 |

cs.CLcs.AI

查看完整摘要

Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

NodeSynth：面向AI评估的社会一致性合成数据

2/10

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darl...

个性化推荐理由:

该论文主要关注合成数据的生成及其社会一致性，用于AI评估，属于数据生成和评估领域。它与推荐系统、搜索或广告的核心技术（如排序、召回、用户建模等）关联不紧密，也没有涉及LLM或Transformer架构的创新及应用。因此相关性较低。

2026-05-14 05:06:50 | arXiv:2605.14381v1 |

cs.LGcs.CL

查看完整摘要

Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).

在心理防御分类中通过上下文感知的合成增强缓解数据稀缺问题

2/10

Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

Hoang-Thuy-Duong Vu, Quoc-Cuong Pham, Huy-Hieu Pham

个性化推荐理由:

该论文聚焦于心理防御分类，属于心理学领域，与推荐、搜索或广告等核心业务场景缺乏直接关联。尽管涉及数据增强技术，但未体现与用户行为序列、上下文特征或多模态建模相关的应用潜力，因此相关性较低。

2026-05-14 05:02:34 | arXiv:2605.14380v1 |

cs.CL

查看完整摘要

Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.

网络代理应采用先规划后执行的范式

2/10

Web Agents Should Adopt the Plan-Then-Execute Paradigm

Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Ral...

个性化推荐理由:

该论文关注于Web代理的规划与执行范式，主要涉及自主代理和任务执行，而非推荐系统、搜索或广告领域。虽然具有潜在的应用可能性，但缺乏直接的关联，且不属于核心领域或使能技术范畴。

2026-05-14 02:48:57 | arXiv:2605.14290v1 |

cs.CRcs.AIcs.CLcs.SE

查看完整摘要

ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

MemEye：面向多模态智能体记忆的视觉中心评估框架

1/10

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei ...

个性化推荐理由:

该论文聚焦于多模态智能体记忆的评估框架，属于纯粹的NLP/视觉与智能体领域，与推荐系统、搜索或广告的核心技术无关。虽然涉及多模态，但未明确指向搜索/推荐/广告场景，且属于评估基准而非建模方法，不符合任何关注方向。

2026-05-14 17:37:52 | arXiv:2605.15128v1 |

cs.CVcs.CLcs.IR

查看完整摘要

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

可颂面包制作器：用于可发现、可治理且可复用的机器学习数据集的元数据生成

1/10

Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debansh...

个性化推荐理由:

该论文聚焦于机器学习数据集的元数据生成，属于数据管理和治理范畴，与搜索、推荐或广告领域的核心技术（如模型架构、推荐算法、LLM应用等）无直接关联。即使涉及LLM，其应用场景也与RecSys/Search/Ads的核心目标（如用户匹配、排序、效果优化）相去甚远。

2026-05-14 17:04:39 | arXiv:2605.15079v1 |

cs.LGcs.DBcs.DLcs.IR

查看完整摘要

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

一种用于HS海关商品编码分类的确定性智能代理工作流：多维规则推理与可解释决策

1/10

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen

个性化推荐理由:

论文专注于海关商品编码（HS Tariff Classification）的分类工作流，属于特定领域的应用（国际贸易/税务），与推荐系统、搜索或广告的核心技术（如用户建模、匹配、排序）无直接关联。尽管涉及规则推理和可解释性，但缺乏对LLM或推荐/搜索领域的通用技术贡献，且不属于所列举的Enabling Tech或Direct Applications方向。

2026-05-14 14:04:46 | arXiv:2605.14857v1 |

cs.AIcs.IR

查看完整摘要

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

Falkor-IRAC：面向印度司法AI中验证法律推理的图约束生成

1/10

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose

个性化推荐理由:

该论文专注于法律推理领域，属于特定领域应用，与推荐系统、搜索或广告的核心理念、LLM基础技术或Transformer架构无关。

2026-05-14 10:19:23 | arXiv:2605.14665v1 |

cs.AIcs.CLcs.IR

查看完整摘要

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.

ATLAS：主动还是潜在视觉推理？一个词足矣

1/10

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

个性化推荐理由:

论文聚焦于视觉推理任务，属于纯视觉领域，与推荐、搜索、广告系统的核心技术无直接关联。且未提及Transformer架构改进或LLM在推荐/搜索/广告中的应用潜力，因此相关度极低。

2026-05-14 17:59:55 | arXiv:2605.15198v1 |

cs.CVcs.AIcs.CL

查看完整摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

文本知其事，表格知其时：通过检索增强的多模态对齐进行临床时间线重建

1/10

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

个性化推荐理由:

该论文聚焦于临床时间线重建这一医疗领域特定任务，与推荐系统、搜索或广告的核心领域无直接关联。虽然文中提到了多模态对齐，但其应用场景和思路（检索增强、时间表对齐）难以推广到RecSys/Search/Ads的典型问题（如用户行为建模、特征工程、排序推荐等）。因此，属于无关主题。

2026-05-14 17:55:27 | arXiv:2605.15168v1 |

cs.CLcs.AIcs.LGstat.ML

查看完整摘要

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

遗忘固化：通过电路归因实现量化持久化的忘却学习

1/10

Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu

个性化推荐理由:

该论文聚焦于量化模型中的忘却学习，属于模型压缩和隐私领域，与RecSys/Search/Ads的核心排序、推荐或搜索技术无直接关联。尽管模型压缩可能间接影响部署效率，但主题不在关注范围内。

2026-05-14 17:44:10 | arXiv:2605.15138v1 |

cs.LGcs.CLcs.ET

查看完整摘要

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

谈话（并非）廉价：LLM攻击的分类法与基准覆盖审计

1/10

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

个性化推荐理由:

该论文聚焦于LLM安全攻击的分类与基准审计，属于LLM安全领域，而非RecSys/Search/Ads的核心技术或应用。虽然安全是通用问题，但与推荐、搜索和广告系统的直接关联性极低，且未涉及上述领域的核心技术改进或应用。

2026-05-14 17:30:36 | arXiv:2605.15118v1 |

cs.CRcs.CL

查看完整摘要

We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

字符串相似度计算和分类的统计特征提议与研究

1/10

Proposal and study of statistical features for string similarity computation and classification

E. O. Rodrigues, D. Casanova, M. Teixeira, V. Pegorini, F. Favarim, E. Clua, A. ...

个性化推荐理由:

该论文专注于字符串相似度计算和分类，属于文本处理基础技术，未提及推荐、搜索或广告领域的应用。虽可能间接与特征工程相关，但缺乏与核心领域（如用户建模、多模态、LLM应用）的直接联系。

2026-05-14 17:27:04 | arXiv:2605.15110v1 |

cs.LGcs.CLcs.IT

查看完整摘要

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

从文本到语音：一个可复现且可验证的评估工具调用LLM智能体的框架

1/10

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jo...

个性化推荐理由:

该论文专注于评估LLM智能体的工具调用能力，属于LLM-centric topics，且没有明确提到与推荐系统、搜索或广告的关联，不直接涉及核心领域进展或使能技术应用。

2026-05-14 17:22:42 | arXiv:2605.15104v1 |

cs.CL

查看完整摘要

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

AI知道何时被监视：大语言模型中的功能性策略行动与上下文寄存器调节

1/10

AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

Vinicius Covas, Jorge Alberto Hidalgo Toledo

个性化推荐理由:

该论文关注大语言模型在受到监视时的行为调节，属于模型安全或社会影响范畴，与推荐系统、搜索或广告的核心技术无关。未涉及效率、注意力机制或直接应用等聚焦方向。

2026-05-14 16:29:38 | arXiv:2605.15034v1 |

cs.CLcs.AIcs.CYcs.MA

查看完整摘要

Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

基于用户数字轨迹的抑郁状态转变的可解释性检测

1/10

Explainable Detection of Depression Status Shifts from User Digital Traces

Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trun...

个性化推荐理由:

该论文专注于心理健康（抑郁检测），属于医学/健康领域，与推荐系统、搜索或广告的核心技术无关。尽管涉及用户数字轨迹，但主题完全偏离了搜索/推荐/广告的优化目标（如点击率、相关性、用户满意度等），且不属于LLM或Transformer的使能技术应用。

2026-05-14 15:56:38 | arXiv:2605.14995v1 |

cs.AIcs.CLcs.LGcs.SI

查看完整摘要

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

词典-语法表转换为LMF格式：应用于法语

1/10

Conversion of Lexicon-Grammar tables to LMF. Application to French

Eric Laporte, Elsa Tolone, Mathieu Constant

个性化推荐理由:

该论文聚焦于自然语言处理中的词典资源转换，与推荐系统、搜索或广告领域无直接关联。虽然词典资源可能对某些NLP任务有用，但缺乏明确的推荐/搜索/广告应用场景或技术贡献。

2026-05-14 13:28:24 | arXiv:2605.14816v1 |

cs.CL

查看完整摘要

We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

研究图谱：作为研究想法生成监督的引文演化图

1/10

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

个性化推荐理由:

该论文聚焦于利用引文演化图来生成研究想法，属于学术知识图谱和自然语言处理中的文本生成领域，与推荐系统、搜索或广告的核心技术（如排序、召回、预测）无直接关联。且未涉及LLM、Transformer架构或视觉-语言模型在推荐/搜索/广告中的应用，因此相关性极低。

2026-05-14 12:57:56 | arXiv:2605.14790v1 |

cs.CLcs.AI

查看完整摘要

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

波斯音乐生成：大规模数据集与具有文化意识的波斯音乐生成模型

1/10

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani, Leili Javi...

个性化推荐理由:

该论文专注于特定领域的音乐生成（波斯音乐），与推荐系统、搜索或广告的核心技术无关。虽然生成模型属于LLM技术，但应用领域限定于音乐，且未涉及推荐系统等领域的潜在应用。

2026-05-14 12:31:46 | arXiv:2605.14765v1 |

cs.SDcs.CL

查看完整摘要

Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

面向大语言模型治理的机械执行：金融决策系统中治理-任务解耦的证据

1/10

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

José Manuel de la Chica Rodríguez, Carlos Martí-González

个性化推荐理由:

论文主题涉及LLM治理和金融决策系统，属于AI安全与治理领域，而非搜索、推荐或广告的核心技术或应用。标题未提及任何与推荐系统、搜索、广告直接相关的技术点，且属于非技术性的政策或治理研究，因此与当前关注点不相关。

2026-05-14 12:12:42 | arXiv:2605.14744v1 |

cs.CLcs.AIcs.CY

查看完整摘要

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

IntentVLA：面向混淆机器人操作的短视意图建模

1/10

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun J...

个性化推荐理由:

该论文专注于机器人操作中的意图建模，属于机器人领域，与推荐系统、搜索或广告无直接或潜在关联。不涉及LLM、Transformer架构或推荐/搜索/广告的核心技术。

2026-05-14 11:31:02 | arXiv:2605.14712v1 |

cs.ROcs.AIcs.CLcs.CV

查看完整摘要

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

AI辅助文化遗产传播：比较NMT与词汇表增强的LLM在岩画文献翻译中的应用

1/10

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

Vicent Briva-Iglesias, María Ferre-Fernández

个性化推荐理由:

该论文聚焦于文化遗产领域的机器翻译对比研究，属于特定领域应用，与推荐系统、搜索或广告的核心技术无关，也未涉及LLM在RecSys/Search/Ads中的直接应用或潜在价值。

2026-05-14 10:48:48 | arXiv:2605.14679v1 |

cs.CLcs.AI

查看完整摘要

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

我们真的需要外部工具来缓解幻觉吗？SIRA：基于共享前缀的内部归因重建

1/10

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen

个性化推荐理由:

该论文聚焦于缓解大型语言模型的幻觉问题，属于纯NLP领域，与推荐、搜索或广告的核心技术关联极低。尽管LLM技术可应用于推荐系统，但本文未涉及任何与RecSys/Search/Ads相关的特定应用或创新点。

2026-05-14 09:37:55 | arXiv:2605.14621v1 |

cs.CVcs.AIcs.CL

查看完整摘要

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

科学路径：预测科学发现的途径

1/10

SciPaths: Forecasting Pathways to Scientific Discovery

Eric Chamoun, Yizhou Chi, Yulong Chen, Rui Cao, Zifeng Ding, Michalis Korakakis,...

个性化推荐理由:

该论文专注于科学发现预测，属于科学计量学或知识发现领域，与推荐系统、搜索或广告的核心技术无关。没有明确证据表明其方法或发现可应用于搜索、推荐或广告领域。

2026-05-14 09:10:28 | arXiv:2605.14600v1 |

cs.CL

查看完整摘要

Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

GroupMemBench：多方对话中LLM智能体记忆的基准测试

1/10

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gab...

个性化推荐理由:

该论文专注于LLM在多方对话中的记忆能力基准测试，属于纯NLP或对话系统领域，未涉及推荐、搜索或广告中的用户建模、物品表征或匹配等核心问题。既非核心领域进展，也未展示对RecSys/Search/Ads的潜在应用，因此相关性极低。

2026-05-14 07:38:29 | arXiv:2605.14498v1 |

cs.CL

查看完整摘要

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

当检索损害代码补全：过时仓库上下文的诊断研究

1/10

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

Haojun Weng, Qianqian Yang, Hao Fu, Haobin Pan, Xinwei Lv

个性化推荐理由:

该论文聚焦于代码补全中的检索上下文问题，属于软件工程/代码智能领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及检索技术，但其应用场景（代码补全）与RecSys/Search/Ads的典型场景（如物品推荐、信息检索）差异较大，且未提出可迁移的通用方法。

2026-05-14 07:18:30 | arXiv:2605.14478v1 |

cs.SEcs.AIcs.CL

查看完整摘要

Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.

LiSA：通过保守策略归纳实现终身安全适应

1/10

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Walli...

个性化推荐理由:

该论文标题提及终身安全适应和保守策略归纳，核心主题是强化学习中的安全策略调整，不涉及推荐系统、搜索或广告领域。论文内容偏向控制与安全领域，与LLM、Transformer架构或推荐/搜索/广告应用无直接关联。

2026-05-14 06:47:35 | arXiv:2605.14454v1 |

cs.LGcs.CLcs.CR

查看完整摘要

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

当答案偏离问题：通过问答正交分解进行幻觉检测

1/10

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Siyang Yao, Erhu Feng, Yubin Xia

个性化推荐理由:

该论文聚焦于大语言模型的幻觉检测问题，属于纯NLP领域，与推荐系统、搜索和广告的核心挑战无直接关联。虽然幻觉检测对LLM应用有参考价值，但标题中未体现任何与RecSys/Search/Ads的结合点，不符合当前关注方向。

2026-05-14 06:44:18 | arXiv:2605.14449v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

基于微积分的端到端自动语音识别词汇量确定框架

1/10

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu

个性化推荐理由:

该论文专注于自动语音识别（ASR）领域，与推荐系统、搜索或广告的核心技术无关。虽然ASR可视为一种模态，但该研究集中于词汇量确定的数学方法，没有直接或潜在应用于RecSys/Search/Ads。

2026-05-14 06:19:42 | arXiv:2605.14427v1 |

cs.CLcs.SD

查看完整摘要

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

SWE-Chain：对链式发布级软件包升级的编码智能体基准测试

1/10

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Man Ho Lam, Chaozheng Wang, Hange Liu, Jingyu Xiao, Haau-sing Li, Jen-tse Huang,...

个性化推荐理由:

该论文聚焦于编码智能体在软件工程中的基准测试，属于SE/NLP范畴，与推荐系统、搜索或广告的核心技术无直接或间接关联。它不涉及LLM在推荐/搜索/广告中的应用，也不属于Transformer架构或VLM类比等使能技术。

2026-05-14 06:04:40 | arXiv:2605.14415v1 |

cs.SEcs.AIcs.CL

查看完整摘要

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

超越语言的知识：弥合多语言机器遗忘评估中的差距

1/10

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Kyomin Hwang, Hyeonjin Kim, Sangyeon Cho, Nojun Kwak

个性化推荐理由:

该论文聚焦于多语言机器遗忘评估，属于NLP安全和伦理领域，与RecSys/Search/Ads的核心技术无关。机器遗忘虽然涉及模型更新，但论文未提及推荐或搜索的具体应用，且属于不相关的隐私/伦理话题。

2026-05-14 05:45:24 | arXiv:2605.14404v1 |

cs.CL

查看完整摘要

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

Nexus：一种用于时间序列预测的智能体框架

1/10

Nexus : An Agentic Framework for Time Series Forecasting

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirum...

个性化推荐理由:

该论文关注时间序列预测，属于通用预测领域，与推荐系统、搜索或广告的相关性较低。尽管时间序列可能用于用户行为预测，但论文主题更偏向通用方法论，缺乏明确的RecSys/Search/Ads应用关联。

2026-05-14 05:12:13 | arXiv:2605.14389v1 |

cs.AIcs.CLcs.LG

查看完整摘要

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

基于语义奖励的强化学习实现低资源语言扩展而无对齐代价

1/10

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu ...

个性化推荐理由:

该论文聚焦于低资源语言的强化学习语义奖励，属于LLM扩展与多语言NLP领域，与推荐系统、搜索或广告的核心关注点（如用户行为建模、排序、个性化等）无直接或潜在应用关联。

2026-05-14 04:47:22 | arXiv:2605.14366v1 |

cs.CLcs.LG

查看完整摘要

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

短情感文本作为可穿戴传感补充的学生纵向健康监测形成性研究

1/10

A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring

Tamunotonye Harry, Johanna Hidalgo, Matthew Price, Yuanyuan Feng, Kathryn Stanto...

个性化推荐理由:

该论文专注于学生健康监测，属于医疗健康领域，与推荐系统、搜索或广告的核心技术无关。虽然涉及文本和可穿戴传感器，但缺乏对推荐/搜索/广告的直接应用潜力或Transformer架构、LLM技术等关键技术的关联。

2026-05-14 04:36:29 | arXiv:2605.14360v1 |

cs.HCcs.CL

查看完整摘要

Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.

赫拉克勒斯：面向金融智能的智能体基准测试

1/10

Herculean: An Agentic Benchmark for Financial Intelligence

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian, Yan Wang, Vince...

个性化推荐理由:

该论文专注于金融领域的智能体基准测试，属于特定领域的应用，与推荐系统、搜索或广告的核心技术、LLM应用或Transformer架构无关。因此，它不在我的关注范围内。

2026-05-14 04:30:49 | arXiv:2605.14355v1 |

cs.AIcs.CL

查看完整摘要

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

审计智能体工具链安全性

1/10

Auditing Agent Harness Safety

Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wen...

个性化推荐理由:

该论文聚焦于AI安全审计，属于安全、伦理等非技术主题，与推荐系统、搜索或广告领域无直接或潜在关联。

2026-05-14 02:14:28 | arXiv:2605.14271v1 |

cs.CLcs.CY

查看完整摘要

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

什么使得单词困难？Sakura在BEA 2026词汇难度预测共享任务中的工作

1/10

What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang,...

个性化推荐理由:

该论文关注的是词汇难度预测，属于NLP和教育领域的任务，与推荐系统、搜索或广告的核心技术无直接关联。其方法可能不涉及LLM或Transformer架构在推荐、搜索、广告中的创新应用。

2026-05-14 01:57:35 | arXiv:2605.14257v1 |

cs.CL

查看完整摘要

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .

DT-Transformer：基于真实世界健康系统的疾病轨迹预测基础模型

1/10

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Yunying Zhu, Andrew R Weckstein, Kueiyu Joshua Lin, Jie Yang

个性化推荐理由:

该论文专注于疾病轨迹预测，属于医疗健康领域，与我的核心领域（搜索、推荐、广告）不直接相关。虽然Transformer架构本身是通用技术，但论文应用场景专一，缺乏直接或潜在应用于推荐系统、搜索或广告的明确路径。

2026-05-14 00:45:04 | arXiv:2605.14227v1 |

cs.LGcs.CL

查看完整摘要

Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

流式语音到文本翻译：使用SpeechLLM

0/10

Streaming Speech-to-Text Translation with a SpeechLLM

Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

个性化推荐理由:

该论文专注于语音到文本翻译，属于语音/NLP领域，与推荐系统、搜索或广告的核心技术无直接关联。论文未涉及Transformer架构效率改进、多模态建模等可应用于RecSys/Search/Ads的方向，因此不相关。

2026-05-14 12:32:57 | arXiv:2605.14766v1 |

cs.CLcs.AIeess.AS

查看完整摘要

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

大语言模型的非线性干预

0/10

Non-linear Interventions on Large Language Models

Sangwoo Kim

个性化推荐理由:

该论文研究对LLM进行非线性干预，可能涉及模型编辑或机制调整，但未明确与推荐系统、搜索或广告中的任一子领域（如排序、召回、用户建模等）相关，且未提供在RecSys/Search/Ads中的潜在应用。

2026-05-14 12:14:42 | arXiv:2605.14749v1 |

cs.CLcs.AIcs.LG

查看完整摘要

Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

行为驱动型软件测试套件中的子场景重构机会挖掘：ML分类器与LLM评判基线

0/10

Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

个性化推荐理由:

该论文聚焦于软件测试领域的子场景重构，属于软件工程范畴，与推荐系统、搜索或广告的核心技术（如用户建模、物品排序、多模态融合等）无直接关联。标题中虽提及LLM，但仅作为评判基线，且应用场景与RecSys/Search/Ads无关，不符合主题要求。

2026-05-14 08:38:04 | arXiv:2605.14568v1 |

cs.SEcs.CLcs.LG

查看完整摘要

Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

记住你的轨迹：基于记忆引导的长周期代理框架实现一致且分层的仓库级代码文档

0/10

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

Suyoung Bae, Jaehoon Lee, Changkyu Choi, YunSeok Choi, Jee-Hyong Lee

个性化推荐理由:

该论文专注于代码文档生成，属于软件工程领域，与推荐系统、搜索或广告的核心技术无直接关联，且未涉及LLM或Transformer架构在RecSys/Search/Ads中的潜在应用。

2026-05-14 08:35:20 | arXiv:2605.14563v1 |

cs.SEcs.CL

查看完整摘要

Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.

《会同馆》华夷译语中的跨语言转写与音系表征

0/10

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

Ji-eun Kim

个性化推荐理由:

该论文研究历史语言学中的音系转写问题，属于语言学/文献学范畴。与LLM、推荐系统、搜索或广告领域无任何关联。

2026-05-14 07:21:18 | arXiv:2605.14480v1 |

cs.CL

查看完整摘要

Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

基于大语言模型的操纵性政治叙事检测

0/10

LLM-based Detection of Manipulative Political Narratives

Sinclair Schneider, Florian Steuber, Gabi Dreo Rodosek

个性化推荐理由:

该论文聚焦于利用LLM检测政治叙事中的操纵性内容，属于社会科学或NLP应用领域。与推荐、搜索、广告系统的核心任务（如排序、召回、匹配等）无直接关联，且未涉及多模态或异构数据建模，不符合筛选标准。

2026-05-14 04:30:21 | arXiv:2605.14354v1 |

cs.CL

查看完整摘要

We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.

见并非学：保护多模态数据免受大型视觉语言模型未经授权的微调

0/10

To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

个性化推荐理由:

该论文关注多模态数据的安全与隐私保护（防止未经授权的微调），属于隐私/安全领域，与我的关注领域（推荐系统、搜索、广告及LLM应用）无关。根据规则，隐私和安全话题被明确列为不相关。

2026-05-14 02:49:27 | arXiv:2605.14291v1 |

cs.CRcs.AIcs.CLcs.CVcs.LG

查看完整摘要

The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

MetaMoE：面向隐私保护的混合专家统一化中的多样性感知代理选择

0/10

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Weisen Jiang, Shuhao Chen, Sinno Jialin Pan

个性化推荐理由:

该论文专注于隐私保护技术，属于被明确列为不相关的话题。尽管涉及MoE架构，但核心是隐私而非LLM或推荐系统的直接应用。

2026-05-14 02:48:23 | arXiv:2605.14289v1 |

cs.LGcs.AIcs.CLcs.CR

查看完整摘要

Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.