arXiv 每日论文精选

2025-10-29
总论文数: 152
精选论文数: 20
平均评分: 2.6
显示 152 篇论文 (共 152 篇)
Iterative Critique-Refine Framework for Enhancing LLM Personalization
Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck De...
核心总结:

研究如何提升LLM个性化文本生成的质量,核心方法是提出无需训练的迭代批判-精炼框架,通过基于用户画像的反馈机制让LLM生成器不断优化输出内容。

个性化推荐理由:

该论文直接针对LLM个性化生成的核心挑战,提出了无需训练的迭代优化框架,在推荐系统和搜索的个性化内容生成方面具有直接应用价值。

2025-10-28 14:36:22 | arXiv:2510.24469v1 |
cs.CLcs.AIcs.IR
查看完整摘要
Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang W...
核心总结:

论文研究生成式推荐系统的规模化效应和部署可行性问题,核心方法是构建完整的开源框架,通过语义ID序列替代传统嵌入表,并采用轻量级后训练流程来验证参数效率。

个性化推荐理由:

该论文直接针对生成式推荐系统的核心问题,提供了完整的开源框架,验证了缩放定律并提出了轻量级后训练方法,与LLM在推荐系统中的应用高度相关。

2025-10-28 13:58:36 | arXiv:2510.24431v1 |
cs.IRcs.AI
查看完整摘要
The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.
From Time and Place to Preference: LLM-Driven Geo-Temporal Context in Recommendations
Yejin Kim, Shaghayegh Agah, Mayur Nankani, Neeraj Sharma, Feifei Peng, Maria Pei...
核心总结:

该论文研究推荐系统中传统时间戳处理方法忽略真实世界上下文的问题,核心方法是利用LLM从时间戳和粗略位置生成地理时空嵌入,捕捉节假日、季节模式和本地/全球事件等语义信息。

个性化推荐理由:

该论文直接应用LLM技术解决推荐系统中的时空上下文建模问题,完全符合核心领域进展和直接LLM应用的重点方向。

2025-10-28 13:57:23 | arXiv:2510.24430v1 |
cs.IR
查看完整摘要
Most recommender systems treat timestamps as numeric or cyclical values, overlooking real-world context such as holidays, events, and seasonal patterns. We propose a scalable framework that uses large language models (LLMs) to generate geo-temporal embeddings from only a timestamp and coarse location, capturing holidays, seasonal trends, and local/global events. We then introduce a geo-temporal embedding informativeness test as a lightweight diagnostic, demonstrating on MovieLens, LastFM, and a production dataset that these embeddings provide predictive signal consistent with the outcomes of full model integrations. Geo-temporal embeddings are incorporated into sequential models through (1) direct feature fusion with metadata embeddings or (2) an auxiliary loss that enforces semantic and geo-temporal alignment. Our findings highlight the need for adaptive or hybrid recommendation strategies, and we release a context-enriched MovieLens dataset to support future research.
DUET: Dual Model Co-Training for Entire Space CTR Prediction
Yutian Xiao, Meng Yuan, Fuzhen Zhuang, Wei Chen, Shukuan Wang, Shanqi Liu, Chao ...
核心总结:

论文研究推荐系统预排序阶段在计算效率与模型表达能力之间的权衡问题。核心方法是采用集合级预测框架对候选子集进行整体评分,并通过双模型协同训练机制利用未曝光项目的伪标签来缓解样本选择偏差。

个性化推荐理由:

该论文直接针对推荐系统预排序阶段的核心挑战,提出集合级预测和双模型协同训练方法,在计算效率和表达能力之间取得平衡,与推荐系统核心领域进展高度相关。

2025-10-28 12:46:33 | arXiv:2510.24369v1 |
cs.IR
查看完整摘要
The pre-ranking stage plays a pivotal role in large-scale recommender systems but faces an intrinsic trade-off between model expressiveness and computational efficiency. Owing to the massive candidate pool and strict latency constraints, industry systems often rely on lightweight two-tower architectures, which are computationally efficient yet limited in estimation capability. As a result, they struggle to capture the complex synergistic and suppressive relationships among candidate items, which are essential for producing contextually coherent and diverse recommendation lists. Moreover, this simplicity further amplifies the Sample Selection Bias (SSB) problem, as coarse-grained models trained on biased exposure data must generalize to a much larger candidate space with distinct distributions. To address these issues, we propose \textbf{DUET} (\textbf{DU}al Model Co-Training for \textbf{E}ntire Space C\textbf{T}R Prediction), a set-wise pre-ranking framework that achieves expressive modeling under tight computational budgets. Instead of scoring items independently, DUET performs set-level prediction over the entire candidate subset in a single forward pass, enabling information-aware interactions among candidates while amortizing the computational cost across the set. Moreover, a dual model co-training mechanism extends supervision to unexposed items via mutual pseudo-label refinement, effectively mitigating SSB. Validated through extensive offline experiments and online A/B testing, DUET consistently outperforms state-of-the-art baselines and achieves improvements across multiple core business metrics. At present, DUET has been fully deployed in Kuaishou and Kuaishou Lite Apps, serving the main traffic for hundreds of millions of users.
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongw...
核心总结:

论文研究LLM智能体在信息检索中的搜索效率低下问题,核心方法是构建高覆盖率的树形推理任务并生成高效解决方案轨迹,通过任务变体和轨迹筛选来优化搜索性能。

个性化推荐理由:

该论文直接针对LLM智能体在信息检索中的效率问题,提出了任务构建和轨迹优化的系统方法,与搜索和推荐系统的效率优化高度相关。

2025-10-28 17:51:42 | arXiv:2510.24697v1 |
cs.CL
查看完整摘要
Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song...
核心总结:

该论文研究LLM搜索代理训练中实体信息浪费的问题,核心思想是利用实体匹配率构建密集奖励函数,使模型能从接近正确的失败样本中学习。

个性化推荐理由:

该论文直接针对LLM搜索代理的训练优化问题,提出了利用实体信息改进强化学习奖励函数的核心方法,与搜索领域和LLM应用高度相关。

2025-10-28 17:50:40 | arXiv:2510.24694v1 |
cs.CLcs.AI
查看完整摘要
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
InteractComp: Evaluating Search Agents With Ambiguous Queries
Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Y...
核心总结:

研究搜索代理如何处理用户模糊查询的问题,核心方法是构建专门评估基准来测试代理识别模糊性并通过主动交互澄清查询的能力。

个性化推荐理由:

该论文直接针对搜索领域的核心问题——模糊查询交互,创建了专门的评估基准,与搜索系统改进高度相关。

2025-10-28 17:35:54 | arXiv:2510.24668v1 |
cs.CLcs.AI
查看完整摘要
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
Relative Scaling Laws for LLMs
William Held, David Hall, Percy Liang, Diyi Yang
核心总结:

研究语言模型缩放定律在异构数据子群体上的表现差异问题,核心思想是通过相对缩放定律追踪不同测试分布之间的性能差距随模型规模的变化轨迹,而非仅关注绝对误差。

个性化推荐理由:

该论文提出相对缩放定律,分析不同数据子群体在模型缩放时的性能差异演化,这对推荐系统中处理用户群体差异和公平性问题具有直接参考价值。

2025-10-28 16:55:22 | arXiv:2510.24626v1 |
cs.CL
查看完整摘要
Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho
核心总结:

论文研究长上下文LLM中注意力计算二次成本过高的问题,核心思想是通过数据驱动的动态分层稀疏注意力框架,将序列自适应分块并计算块级相似度,再上采样为token级重要性分数来动态预测注意力稀疏模式。

个性化推荐理由:

该论文直接针对Transformer架构效率问题,提出动态分层稀疏注意力机制,对设备端LLM的长上下文建模具有重要价值,完全契合Enabling Transformer Tech和Enabling LLM Tech焦点。

2025-10-28 16:34:18 | arXiv:2510.24606v1 |
cs.CL
查看完整摘要
The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zh...
核心总结:

论文研究扩散大语言模型的固定生成长度问题,核心思想是通过训练模型准确预测[EOS]标记,使扩散LLM能够以块扩散方式原生推断可变长度文本,同时保持全局双向注意力和高并行性。

个性化推荐理由:

该论文提出扩散LLM可变生成长度方法,直接提升LLM生成效率,对搜索推荐系统的实时响应和用户体验有重要价值。

2025-10-28 16:32:43 | arXiv:2510.24605v1 |
cs.CL
查看完整摘要
Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.
Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
核心总结:

该论文研究LLM跨规模知识迁移的难题,核心思想是通过潜在语义对齐而非直接参数复用,利用激活作为层间知识传递媒介来实现有效的跨模型知识转移。

个性化推荐理由:

该论文直接解决LLM跨规模知识迁移的核心问题,其潜在语义对齐方法对推荐系统、搜索和广告中的模型部署与优化具有重要应用价值。

2025-10-28 09:25:40 | arXiv:2510.24208v1 |
cs.CLcs.LG
查看完整摘要
Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
Reinforcement Learning for Long-Horizon Multi-Turn Search Agents
Vivek Kalyan, Martin Andrews
核心总结:

研究如何解决复杂多轮搜索任务中代理决策优化问题;核心思想是使用强化学习训练大语言模型代理,通过经验学习来提升多轮交互决策能力,并探索长序列交互对性能的影响。

个性化推荐理由:

该论文将强化学习应用于多轮搜索代理,直接属于推荐系统与搜索领域的核心进展,并探索了长序列交互的优化方法。

2025-10-28 07:00:42 | arXiv:2510.24126v1 |
cs.CL
查看完整摘要
Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
Optimizing Retrieval for RAG via Reinforced Contrastive Learning
Jiawei Zhou, Lei Chen
核心总结:

论文研究RAG系统中难以预先定义相关性的检索优化问题,核心方法是通过强化对比学习让检索器在RAG环境中动态探索和优化相关性,实现自我改进。

个性化推荐理由:

该论文提出通过强化对比学习优化RAG检索,直接针对检索系统核心问题,与搜索和推荐领域高度相关。

2025-10-28 17:18:30 | arXiv:2510.24652v1 |
cs.CLcs.IR
查看完整摘要
As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liw...
核心总结:

研究如何解决信息寻求智能体在并行思维中面临的探索效率低下和长程推理整合困难问题;核心方法是采用功能分区部分展开和压缩推理聚合的两阶段范式,通过不确定性引导的路径重用与分支来提升探索效率,并利用推理冗余进行无损压缩以合成最终答案。

个性化推荐理由:

该论文提出的并行思维方法和两阶段范式直接适用于搜索和推荐系统中的深度探索与推理优化,能显著提升信息检索效率。

2025-10-28 17:51:50 | arXiv:2510.24698v1 |
cs.CLcs.AI
查看完整摘要
Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan...
核心总结:

该论文研究视觉MoE中专家专业化不足的问题,核心思想是通过条件路由和原型路由的两步路由机制,根据功能角色和语义内容对图像token进行显式分区,促进专家专业化。

个性化推荐理由:

该论文在Transformer架构的MoE效率优化方面提出创新路由机制,虽然聚焦视觉生成领域,但其路由指导方法对处理异构数据的推荐系统具有直接借鉴价值。

2025-10-28 17:59:02 | arXiv:2510.24711v1 |
cs.CV
查看完整摘要
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
What do vision-language models see in the context? Investigating multimodal in-context learning
Gabriel O. dos Santos, Esther Colombini, Sandra Avila
核心总结:

论文研究视觉语言模型在多模态上下文学习中的核心机制问题,核心发现是当前VLM主要依赖文本线索而未能有效整合视觉信息,揭示了多模态融合能力的关键局限。

个性化推荐理由:

该论文系统研究视觉语言模型的多模态上下文学习机制,直接关联VLM类比异构数据统一建模的焦点,并为LLM在推荐搜索中的应用提供架构洞察。

2025-10-28 11:55:24 | arXiv:2510.24331v1 |
cs.LGcs.CV
查看完整摘要
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
Pie: A Programmable Serving System for Emerging LLM Applications
In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong
核心总结:

论文研究传统LLM服务系统无法满足新兴应用多样化推理策略的问题,核心方法是设计可编程服务系统Pie,将生成循环分解为细粒度服务处理器,通过用户编写的inferlet程序实现应用级KV缓存策略和生成逻辑控制。

个性化推荐理由:

该论文提出可编程LLM服务系统,通过解耦生成循环实现应用级优化,直接提升LLM在复杂工作流中的服务效率,对推荐和搜索系统的LLM部署具有重要价值。

2025-10-28 04:17:55 | arXiv:2510.24051v1 |
cs.CL
查看完整摘要
Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation
Jiyu Guo, Shuo Yang, Yiming Huang, Yancheng Long, Xiaobo Xia, Xiu Su, Bo Zhao, Z...
核心总结:

该论文研究生成式数据增强中任务需求被忽视的问题,核心思想是通过下游任务反馈构建双级优化框架(模型级和实例级),实现任务导向的高效用合成数据生成。

个性化推荐理由:

该论文提出的任务效用导向数据生成框架与推荐系统中个性化数据增强高度相关,其双级优化策略可迁移至用户序列生成和特征增强场景。

2025-10-28 10:17:11 | arXiv:2510.24262v1 |
cs.CVcs.LG
查看完整摘要
Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He
核心总结:

该论文研究多模态大语言模型中视觉token冗余导致的效率问题,核心思想是提出SCOPE评分机制,通过联合优化token的显著性和语义覆盖度来迭代选择最具代表性的视觉token集合。

个性化推荐理由:

该论文提出了一种新的视觉token剪枝方法,通过联合建模显著性和覆盖度来提升多模态LLM的效率,属于Transformer架构效率优化的前沿研究。

2025-10-28 09:29:37 | arXiv:2510.24214v1 |
cs.CV
查看完整摘要
Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.
Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee
核心总结:

该论文研究LLM预训练语料库中传统行级过滤方法会丢弃有价值内容的问题,核心思想是通过模式感知的行级去重和尾标点过滤方法,结合文档内序列分布信息来保留结构重要的内容。

个性化推荐理由:

该论文专注于LLM预训练数据清洗方法改进,属于核心LLM技术领域,对搜索推荐系统的数据质量有间接影响,但未直接涉及推荐算法或Transformer架构创新。

2025-10-28 07:24:32 | arXiv:2510.24139v1 |
cs.CLcs.AI
查看完整摘要
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs
Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
个性化推荐理由:

该论文提出推测解码用于知识蒸馏,属于'使能LLM技术'范畴,专注于提升LLM推理效率。在搜索和推荐系统中,这种高效的推理加速技术可以直接应用于实时服务场景,降低LLM部署成本并提高响应速度,对大规模在线服务具有重要价值。

2025-10-28 03:02:22 | arXiv:2510.24021v1 |
cs.CLcs.AI
查看完整摘要
Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.
Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
Hao An, Yang Xu
个性化推荐理由:

该论文涉及LLM置信度校准和弃权机制,属于'Enabling LLM Tech'范畴。在搜索和推荐系统中,准确的置信度评估对于结果排序、不确定性处理和拒绝低质量推荐至关重要,能显著提升系统可靠性和用户体验。

2025-10-28 03:00:35 | arXiv:2510.24020v1 |
cs.CLcs.AI
查看完整摘要
Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations
Fengming Yu, Haiwei Pan, Kejia Zhang, Jian Guan, Haiying Jiang
个性化推荐理由:

该论文提出异构知识蒸馏的统一框架,属于模型压缩和效率提升技术,可直接应用于推荐系统、搜索和广告中的大规模模型部署。频域表示方法可能为处理多模态用户行为序列和上下文特征提供新的建模思路,类似于VLM处理异构数据的方式,有助于提升工业级推荐系统的推理效率。

2025-10-28 06:41:43 | arXiv:2510.24116v1 |
cs.CV
查看完整摘要
Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge
Enhancing Pre-trained Representation Classifiability can Boost its Interpretability
Shufan Shen, Zhaobo Qi, Junshu Sun, Qingming Huang, Qi Tian, Shuhui Wang
个性化推荐理由:

该论文聚焦于提升预训练表示的可分类性和可解释性,这属于'使能LLM技术'范畴。在推荐系统和搜索领域,增强表示的可解释性对于理解用户偏好、项目特征以及模型决策过程至关重要,能够帮助构建更透明可信的推荐/搜索系统。

2025-10-28 06:21:06 | arXiv:2510.24105v1 |
cs.CVcs.LG
查看完整摘要
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.
Enhancing CLIP Robustness via Cross-Modality Alignment
Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang
个性化推荐理由:

该论文属于VLM类比异质数据范畴,研究跨模态对齐技术,这与将用户序列和上下文特征作为不同模态进行统一建模的理念直接相关。增强的跨模态对齐技术可应用于搜索和推荐系统中处理异质用户行为数据与内容特征的统一表示学习,提升多模态检索和推荐的鲁棒性。

2025-10-28 03:47:44 | arXiv:2510.24038v1 |
cs.CV
查看完整摘要
Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis
个性化推荐理由:

该论文涉及检索增强生成(RAG),这是LLM技术的重要进展,在搜索和推荐系统中具有直接应用潜力。元数据驱动的检索方法可以类比于推荐系统中用户特征和上下文信息的处理方式,为异构数据建模提供思路。虽然应用领域是金融问答,但核心技术创新具有向通用搜索和推荐系统迁移的价值。

2025-10-28 13:16:36 | arXiv:2510.24402v1 |
cs.IRcs.AIcs.CE
查看完整摘要
Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is sparse and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages LLM-generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text ("contextual chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.
Resource-Efficient LLM Application for Structured Transformation of Unstructured Financial Contracts
Maruf Ahmed Mridul, Oshani Seneviratne
个性化推荐理由:

该论文专注于金融领域的特定应用(金融合约处理),这属于领域特定的应用,与我的关注点无关。虽然提到了LLM应用,但其应用场景是金融文档处理,而非推荐系统、搜索或广告领域。论文标题没有表明在Transformer架构效率、多模态建模或直接应用于我的核心领域方面的进展。

2025-10-28 01:49:10 | arXiv:2510.23990v1 |
cs.IR
查看完整摘要
The transformation of unstructured legal contracts into standardized, machine-readable formats is essential for automating financial workflows. The Common Domain Model (CDM) provides a standardized framework for this purpose, but converting complex legal documents like Credit Support Annexes (CSAs) into CDM representations remains a significant challenge. In this paper, we present an extension of the CDMizer framework, a template-driven solution that ensures syntactic correctness and adherence to the CDM schema during contract-to-CDM conversion. We apply this extended framework to a real-world task, comparing its performance with a benchmark developed by the International Swaps and Derivatives Association (ISDA) for CSA clause extraction. Our results show that CDMizer, when integrated with a significantly smaller, open-source Large Language Model (LLM), achieves competitive performance in terms of accuracy and efficiency against larger, proprietary models. This work underscores the potential of resource-efficient solutions to automate legal contract transformation, offering a cost-effective and scalable approach that can meet the needs of financial institutions with constrained resources or strict data privacy requirements.
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yihen...
个性化推荐理由:

该论文主要关注LLM智能体的数据集统一和微调协议,属于通用的LLM技术范畴。虽然高效微调技术可能间接应用于推荐或搜索系统的LLM组件优化,但论文焦点是通用智能体而非特定于RecSys/Search/Ads领域,相关性较弱。

2025-10-28 17:53:13 | arXiv:2510.24702v1 |
cs.CLcs.AI
查看完整摘要
Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pen...
个性化推荐理由:

该论文主要关注LLM智能体的能力扩展和数据合成方法,属于纯粹的LLM智能体研究范畴。虽然ZPD引导的数据合成技术可能对推荐系统或搜索中的用户交互建模有间接启发,但论文本身没有明确展示与RecSys/Search/Ads的直接关联,且智能体能力扩展属于通用LLM应用而非特定领域应用。

2025-10-28 17:50:47 | arXiv:2510.24695v1 |
cs.CL
查看完整摘要
Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM's ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity's Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.
SPICE: Self-Play In Corpus Environments Improves Reasoning
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xi...
个性化推荐理由:

该论文主要关注通过自我博弈方法提升LLM的推理能力,这属于Enabling LLM Tech范畴。虽然推理能力的提升可能间接有益于搜索和推荐系统中的复杂查询处理,但论文标题未明确展示与RecSys/Search/Ads的直接关联,且推理改进的应用场景较为宽泛。

2025-10-28 17:46:16 | arXiv:2510.24684v1 |
cs.CL
查看完整摘要
Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi
个性化推荐理由:

该论文主要关注跨语言迁移学习,属于NLP领域的技术改进。虽然前缀适应技术可能对多语言搜索系统有一定帮助,但其核心焦点是语言迁移而非推荐系统或广告领域的核心问题。这种技术可能作为多语言搜索的使能技术,但应用场景相对有限且间接。

2025-10-28 16:48:03 | arXiv:2510.24619v1 |
cs.CLcs.AIcs.LGI.2.7
查看完整摘要
With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Al...
个性化推荐理由:

该论文主要关注语言模型内部推理能力与知识存储的解耦研究,属于核心LLM技术范畴。虽然解耦推理可能对推荐/搜索系统中的可解释性有一定启发,但论文更偏向基础NLP机制研究而非直接面向推荐/搜索/广告应用,潜在应用路径不够明确。

2025-10-28 13:47:23 | arXiv:2510.24427v1 |
cs.CL
查看完整摘要
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo ...
个性化推荐理由:

该论文涉及使用强化学习训练语言模型进行评论生成,这属于纯粹的LLM应用范畴,与推荐系统、搜索或广告的排名优化没有直接关联。虽然强化学习技术本身在推荐系统中有所应用,但论文聚焦于评论生成这一特定NLP任务,缺乏明确的RecSys/Search/Ads应用潜力。

2025-10-28 11:37:01 | arXiv:2510.24320v1 |
cs.CLcs.AI
查看完整摘要
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei ...
个性化推荐理由:

该论文主要关注视觉语言模型(VLM)中视觉感知能力的自我进化,属于纯粹的视觉-语言多模态研究。虽然提到了VLM技术,但其核心焦点是视觉感知能力的提升,与推荐系统、搜索或广告中的异构数据处理缺乏直接关联。在VLM类比异构数据的背景下,该工作可能为处理多模态数据提供一些启发,但应用潜力有限且不明确。

2025-10-28 10:42:57 | arXiv:2510.24285v1 |
cs.CVcs.AIcs.CL
查看完整摘要
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
From Memorization to Reasoning in the Spectrum of Loss Curvature
Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
个性化推荐理由:

该论文主要研究损失曲率谱分析与模型从记忆化到推理的转变,这属于LLM核心技术进步范畴。虽然损失曲率分析有助于理解模型泛化能力,可能间接应用于推荐系统或搜索中的模型优化,但论文标题未明确展示在RecSys/Search/Ads中的直接应用潜力,因此相关性有限。

2025-10-28 10:09:35 | arXiv:2510.24256v1 |
cs.CLcs.LG
查看完整摘要
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
Success and Cost Elicit Convention Formation for Efficient Communication
Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried
个性化推荐理由:

该论文研究通信约定形成机制,虽然涉及高效通信这一通用概念,但未明确与推荐系统、搜索或广告的具体技术关联。作为基础通信研究,其潜在应用过于宽泛,缺乏对LLM、Transformer架构或异构数据建模的直接相关性。

2025-10-28 03:06:07 | arXiv:2510.24023v1 |
cs.CL
查看完整摘要
Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.
TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents
Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han
个性化推荐理由:

该论文主要关注使用LLM进行信息抽取和数据库集成,这属于LLM在数据处理方面的应用。虽然信息抽取技术可能在搜索系统中用于理解用户查询或文档内容,但论文标题没有明确表明与推荐系统、搜索排名或广告的直接关联,应用场景相对间接。

2025-10-28 02:49:40 | arXiv:2510.24014v1 |
cs.CL
查看完整摘要
The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB
Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
Kyungmin Lee, Sihyun Yu, Jinwoo Shin
个性化推荐理由:

该论文主要关注流模型的采样加速技术,属于生成模型效率优化领域。虽然采样加速技术可能间接应用于推荐系统或搜索中的内容生成环节,但这种连接较为间接且薄弱,缺乏明确的直接应用场景。该工作更偏向于纯粹的生成模型技术改进,而非针对推荐、搜索或广告领域的特定优化。

2025-10-28 14:43:48 | arXiv:2510.24474v1 |
cs.CV
查看完整摘要
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangy...
个性化推荐理由:

这个标题过于宽泛,没有提供任何具体的技术内容或研究方向信息。虽然可能涉及LLM相关技术(考虑到"通义"可能是某个AI模型系列),但缺乏足够的细节来判断是否与推荐系统、搜索或广告领域相关。仅凭标题无法评估其与当前关注领域的潜在关联性。

2025-10-28 17:53:02 | arXiv:2510.24701v1 |
cs.CLcs.AIcs.IRcs.LGcs.MA
查看完整摘要
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
AgentFold: Long-Horizon Web Agents with Proactive Context Management
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai...
个性化推荐理由:

该论文主要关注网络智能体的长视野规划和上下文管理,这属于通用智能体技术范畴。虽然上下文管理技术可能间接应用于推荐或搜索系统的用户会话管理,但论文标题明确聚焦于网络导航智能体,与推荐系统、搜索或广告的核心技术关联较弱,且未直接涉及LLM在推荐/搜索中的创新应用。

2025-10-28 17:51:50 | arXiv:2510.24699v1 |
cs.CLcs.AIcs.LG
查看完整摘要
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan ...
个性化推荐理由:

该论文主要关注强化学习中的奖励函数学习,虽然涉及智能体任务,但与推荐系统、搜索或广告的核心技术关联较弱。强化学习奖励机制在特定场景下可能应用于个性化推荐策略优化,但论文标题未明确展示与RecSys/Search/Ads领域的直接关联或潜在应用价值。

2025-10-28 17:02:46 | arXiv:2510.24636v1 |
cs.CL
查看完整摘要
Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.
"Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel
个性化推荐理由:

该论文主要研究对话中修复请求的检测,属于对话系统和自然语言处理领域。虽然对话系统技术可能间接应用于搜索或推荐中的对话式交互,但论文本身专注于修复请求这一特定对话现象,与推荐系统、搜索或广告的核心技术焦点关联较弱,且未明确涉及LLM、Transformer架构或异构数据建模等关键技术方向。

2025-10-28 16:58:26 | arXiv:2510.24628v1 |
cs.CL
查看完整摘要
Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai ...
个性化推荐理由:

该论文主要关注自动形式化(autoformalization),即将自然语言数学陈述转换为形式化证明系统的过程,这本质上是数学推理和定理证明领域的研究。虽然涉及序列优化技术,但其核心应用场景是数学自动化和形式验证,与推荐系统、搜索或广告领域缺乏直接关联。前瞻序列优化技术可能在极边缘情况下启发序列建模方法,但实际应用潜力非常有限。

2025-10-28 16:22:54 | arXiv:2510.24592v1 |
cs.CL
查看完整摘要
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Z...
个性化推荐理由:

该论文主要关注多模态大语言模型(MLLMs)中的视觉思维绘制技术,这属于纯粹的视觉-语言交互研究。虽然涉及多模态建模,但其核心是视觉推理和思维可视化,与推荐系统、搜索或广告中的异构数据处理没有直接关联。该技术主要面向视觉理解和推理任务,缺乏在RecSys/Search/Ads领域的明确应用潜力。

2025-10-28 15:26:20 | arXiv:2510.24514v1 |
cs.CVcs.CL
查看完整摘要
While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.
CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi...
个性化推荐理由:

该论文主要关注LLM的不确定性和置信度校准,这属于纯粹的NLP评估基准范畴,与我的关注领域无关。虽然置信度校准在理论上可能对推荐系统的可靠性有间接影响,但论文标题并未表明其在RecSys/Search/Ads中的具体应用,且更偏向于纯粹的LLM评估问题。

2025-10-28 15:16:06 | arXiv:2510.24505v1 |
cs.CL
查看完整摘要
Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.
A word association network methodology for evaluating implicit biases in LLMs compared to humans
Katherine Abramski, Giulio Rossetti, Massimo Stella
个性化推荐理由:

该论文主要关注LLM中的隐式偏见评估,这属于公平性、伦理等非技术性话题,与我的核心关注点无关。虽然涉及LLM技术,但焦点是偏见测量而非模型架构改进或推荐/搜索/广告应用。

2025-10-28 15:03:18 | arXiv:2510.24488v1 |
cs.CLcs.AI
查看完整摘要
As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.
Talk2Ref: A Dataset for Reference Prediction from Scientific Talks
Frederik Broy, Maike Züfle, Jan Niehues
个性化推荐理由:

该论文主要关注科学演讲中的参考文献预测,这属于特定领域的NLP任务,与推荐系统、搜索或广告的核心关注点相距甚远。虽然涉及预测任务,但缺乏明确的跨领域应用潜力,无法直接应用于推荐、搜索或广告场景中的用户行为建模或内容理解。

2025-10-28 14:50:03 | arXiv:2510.24478v1 |
cs.CL
查看完整摘要
Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.
Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar, Mingming Liu
个性化推荐理由:

该论文主要关注LLM幻觉缓解这一纯NLP中心主题,属于明确的无关主题范畴。虽然提到了RAG、推理和智能体系统,但核心焦点是解决幻觉问题,而非在推荐系统、搜索或广告领域的直接应用或使能技术。

2025-10-28 14:48:57 | arXiv:2510.24476v1 |
cs.CLcs.AI
查看完整摘要
Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.
Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices
Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić
个性化推荐理由:

该论文主要关注LLM基准测试和评估实践,属于纯粹的评估基准范畴,这在无关主题中被明确排除。虽然基准测试对LLM发展很重要,但论文本身并未涉及LLM在推荐系统、搜索或广告中的具体应用或技术改进,也没有讨论能够赋能这些领域的核心LLM技术进步。

2025-10-28 14:13:44 | arXiv:2510.24450v1 |
cs.CLcs.AI
查看完整摘要
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models
Guangyu Xie, Yice Zhang, Jianzhu Bao, Qianlong Wang, Yang Sun, Bingbing Wang, Ru...
个性化推荐理由:

该论文主要关注情感分析模型的蒸馏技术,属于NLP领域的特定应用。虽然知识蒸馏是模型压缩的重要技术,但情感分析与推荐系统、搜索或广告的核心业务需求关联较弱。论文没有明确展示在推荐、搜索或广告场景下的应用潜力,因此相关性较低。

2025-10-28 13:46:48 | arXiv:2510.24425v1 |
cs.CL
查看完整摘要
Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce COMPEFFDIST, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.
Text Simplification with Sentence Embeddings
Matthew Shardlow
个性化推荐理由:

该论文专注于文本简化这一NLP任务,属于纯粹的文本处理技术。虽然句子嵌入技术本身是基础NLP组件,但论文标题明确限定于文本简化应用,这与推荐系统、搜索或广告中的核心排名、召回或用户建模任务关联度很低。文本简化主要服务于可读性提升,而非推荐/搜索领域的关键技术需求。

2025-10-28 12:41:10 | arXiv:2510.24365v1 |
cs.CL
查看完整摘要
Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao,...
个性化推荐理由:

该论文主要关注LLM代码智能体的基准测试和评估方法,属于纯粹的LLM评估范畴。虽然提到了智能体概念,但核心焦点是代码生成任务的评估框架,与推荐系统、搜索或广告领域的实际应用缺乏直接关联,也没有展示出在异构数据处理或Transformer架构改进方面的潜力。

2025-10-28 12:26:45 | arXiv:2510.24358v1 |
cs.SEcs.CL
查看完整摘要
Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, B...
个性化推荐理由:

该论文主要关注长文本生成的基准测试和评估,属于纯粹的LLM评估基准范畴。虽然提到了现实世界相关性,但核心是生成内容的验证和评估,而非在推荐系统、搜索或广告中的实际应用。这种基准测试工作与当前关注的领域进展、使能技术或直接应用关联度很低。

2025-10-28 12:11:12 | arXiv:2510.24345v1 |
cs.CLcs.AI
查看完整摘要
Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
个性化推荐理由:

虽然这篇论文涉及强化学习中的探索技术,但它专注于具有可验证奖励的强化学习,这与推荐系统、搜索或广告中的典型问题设置不同。论文没有明确展示这些探索技术如何应用于推荐、搜索或广告领域,因此相关性较低。

2025-10-28 11:12:02 | arXiv:2510.24302v1 |
cs.CL
查看完整摘要
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze
个性化推荐理由:

该论文专注于自然语言推理(NLI)的测试方法,属于纯粹的NLP评估基准范畴。虽然自然语言推理技术理论上可以应用于搜索中的query理解或推荐中的用户意图理解,但论文标题明确表明其关注的是测试和泛化评估,而非核心模型架构或直接应用,因此与当前关注点的相关性较低。

2025-10-28 10:58:59 | arXiv:2510.24295v1 |
cs.CL
查看完整摘要
In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?
Ziqi Ma, Sao Mai Nguyen, Philippe Xu
个性化推荐理由:

该论文主要关注LLM与强化学习的交叉,特别是符号表示学习,这属于纯粹的NLP和RL研究范畴。虽然涉及LLM技术,但缺乏明确的推荐系统、搜索或广告应用场景,且强化学习应用方向不明确,因此相关性较低。

2025-10-28 10:13:43 | arXiv:2510.24259v1 |
cs.CLcs.RO
查看完整摘要
Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
Iván Martínez-Murillo, Paloma Moreda, Elena Lloret
个性化推荐理由:

该论文聚焦于自然语言生成的可解释性,属于纯粹的NLP中心主题,与推荐系统、搜索或广告的核心技术无关。虽然LLM技术可能涉及自然语言生成,但论文的关注点是可解释性这一NLP特定问题,没有明确指向推荐、搜索或广告领域的实际应用。

2025-10-28 08:34:01 | arXiv:2510.24179v1 |
cs.CL
查看完整摘要
This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91\% correctness across both criteria, while filtering reduced performance drastically to 6\%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.
VC4VG: Optimizing Video Captions for Text-to-Video Generation
Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo ...
个性化推荐理由:

该论文专注于视频描述优化以提升文本到视频生成质量,属于内容生成领域。虽然视频推荐系统可能涉及内容理解,但该工作主要关注生成而非推荐/搜索/广告中的排序或检索任务,与当前关注点重叠有限。

2025-10-28 07:19:01 | arXiv:2510.24134v1 |
cs.CVcs.AIcs.CL
查看完整摘要
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks
Yihan Wang, Peiyu Liu, Runyu Chen, Jiaxing Pu, Wei Xu
个性化推荐理由:

该论文专注于文本到SQL转换任务,属于数据库查询领域的特定应用,与推荐系统、搜索或广告的核心技术方向关联度较低。虽然文本到SQL技术可能间接应用于某些数据查询场景,但缺乏明确的直接应用潜力或架构创新,无法满足当前关注的核心领域进展或使能技术需求。

2025-10-28 06:16:38 | arXiv:2510.24102v1 |
cs.CL
查看完整摘要
Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at https://github.com/Satissss/Squrve.
Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua L...
个性化推荐理由:

该论文主要关注多语言翻译中的幻觉问题,这属于纯粹的NLP评估基准范畴,与我的核心关注点无关。虽然多语言能力在搜索和推荐中有潜在应用,但论文焦点是幻觉检测和基准构建,而非实际系统改进或架构创新。

2025-10-28 05:17:18 | arXiv:2510.24073v1 |
cs.CL
查看完整摘要
Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.
GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
Xinqi Li, Yiqun Liu, Shan Jiang, Enrong Zheng, Huaijin Zheng, Wenhao Dai, Haodon...
个性化推荐理由:

该论文主要关注张量编译器和计算图数据集,属于底层系统优化领域。虽然张量编译器技术可能间接影响LLM推理效率,但论文本身聚焦于数据集构建而非直接应用于推荐、搜索或广告系统,与当前关注的核心领域相关性较弱。

2025-10-28 03:36:05 | arXiv:2510.24035v1 |
cs.LGcs.CL
查看完整摘要
We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .
META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
个性化推荐理由:

该论文专注于医学领域的检索增强生成应用,属于明确的医学领域特定应用,这在无关主题中被明确排除。虽然RAG技术本身具有通用性,但论文将其专门应用于循证医学,缺乏对推荐系统、搜索或广告领域的直接相关性或潜在应用。

2025-10-28 02:18:09 | arXiv:2510.24003v1 |
cs.CL
查看完整摘要
Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.
Generative View Stitching
Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann
个性化推荐理由:

该标题暗示了计算机视觉或图形学中的视图合成技术,可能涉及从多个视角生成连贯的场景表示。虽然生成式方法在概念上相关,但该论文缺乏与推荐系统、搜索或广告的直接联系,也没有明确涉及Transformer架构或LLM技术在这些领域的应用潜力。

2025-10-28 17:59:58 | arXiv:2510.24718v1 |
cs.CVcs.LG
查看完整摘要
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording
个性化推荐理由:

该论文主要研究视觉Transformer中的目标绑定现象,属于纯粹的视觉架构分析,与推荐系统、搜索或广告的核心技术没有直接关联。虽然Transformer架构本身是相关领域的基础技术,但该研究聚焦于视觉感知特性,缺乏明确的RecSys/Search/Ads应用场景或技术迁移路径。

2025-10-28 17:57:05 | arXiv:2510.24709v1 |
cs.CVcs.AIcs.LGq-bio.NC
查看完整摘要
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection
Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, ...
个性化推荐理由:

虽然该论文涉及Transformer架构和多模态融合技术,但其核心应用领域是自动驾驶中的3D目标检测,使用多摄像头基础设施数据。这与推荐系统、搜索或广告的核心技术栈和应用场景关联度较低,缺乏明确的跨领域应用潜力。

2025-10-28 17:49:42 | arXiv:2510.24688v1 |
cs.CV
查看完整摘要
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
Mia Kan, Yilin Liu, Niloy Mitra
个性化推荐理由:

该论文主要关注视频内容生成和过渡技术,属于计算机视觉和视频处理领域。虽然生成式模型技术可能在其他领域有应用,但该论文没有明确展示与推荐系统、搜索或广告的潜在关联。视频过渡技术主要服务于内容创作和编辑场景,而非核心的排序、检索或用户行为建模任务。

2025-10-28 17:35:02 | arXiv:2510.24667v1 |
cs.CV
查看完整摘要
Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming ...
个性化推荐理由:

该论文主要关注计算机使用代理中的工具调用基准测试,属于通用AI代理评估范畴。虽然工具调用技术可能间接应用于搜索系统,但论文焦点是操作系统交互基准而非推荐/搜索/广告的核心技术。没有明确涉及LLM架构创新、推荐系统算法或Transformer效率改进等核心关注领域。

2025-10-28 15:56:36 | arXiv:2510.24563v1 |
cs.CV
查看完整摘要
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.
Rethinking Visual Intelligence: Insights from Video Pretraining
Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alex...
个性化推荐理由:

该论文主要关注视觉智能和视频预训练,属于纯粹的计算机视觉领域,与推荐系统、搜索或广告的核心技术没有直接关联。虽然视觉语言模型的类比在异构数据处理方面有潜在启发,但该标题没有表明任何与RecSys/Search/Ads相关的应用或技术迁移路径。

2025-10-28 14:12:11 | arXiv:2510.24448v1 |
cs.CVcs.AI68T0768T4568T20I.2.10; I.4.8; I.5.1; I.2.6
查看完整摘要
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
A Hybrid Approach for Visual Multi-Object Tracking
Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen
个性化推荐理由:

该论文专注于计算机视觉领域中的多目标跟踪,属于纯粹的视觉任务,与推荐系统、搜索或广告的核心技术没有直接关联。虽然视觉技术在某些特定场景下可能作为辅助特征,但该论文没有展示明确的推荐/搜索/广告应用潜力,因此相关性较低。

2025-10-28 13:22:24 | arXiv:2510.24410v1 |
cs.CVcs.RO
查看完整摘要
This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2
Decoupling What to Count and Where to See for Referring Expression Counting
Yuda Zou, Zijian Zhang, Yongchao Xu
个性化推荐理由:

该论文主要关注计算机视觉中的指代表达式计数任务,涉及视觉-语言交互。虽然与VLM有一定技术相似性,但缺乏明确的推荐系统、搜索或广告应用场景。其核心是视觉定位和计数问题,而非处理推荐系统中的异构数据模态。

2025-10-28 12:51:53 | arXiv:2510.24374v1 |
cs.CV
查看完整摘要
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation
Mingyu Jeong, Eunsung Kim, Sehun Park, Andrew Jaeyong Choi
个性化推荐理由:

该论文主要关注计算机视觉中的新视图合成和室内导航模拟,属于纯粹的视觉应用领域。虽然室内导航可能与位置相关的推荐系统有微弱关联,但论文标题明确聚焦于视觉模拟和导航技术,没有显示出与推荐系统、搜索或广告的直接相关性,也不涉及LLM或Transformer架构的进展。

2025-10-28 11:57:33 | arXiv:2510.24335v1 |
cs.ROcs.CV
查看完整摘要
We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system's ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at https://youtu.be/tTiIQt6nXC8
Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski
个性化推荐理由:

该论文主要关注遥感图像分类,属于纯粹的视觉应用领域,与推荐系统、搜索或广告没有直接关联。虽然CLIP技术本身具有多模态潜力,但论文聚焦于遥感这一特定视觉领域,缺乏明确的RecSys/Search/Ads应用场景。

2025-10-28 11:39:22 | arXiv:2510.24321v1 |
cs.CVcs.AI
查看完整摘要
Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration
Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park
个性化推荐理由:

该论文主要关注视觉生成领域的解码加速技术,属于纯粹的视觉生成应用。虽然提到了自回归模型和解码加速技术,但缺乏明确的与推荐系统、搜索或广告相关的应用场景或技术迁移路径。其核心应用场景是视觉内容生成,属于被排除的AIGC和内容生成范畴。

2025-10-28 09:26:27 | arXiv:2510.24211v1 |
cs.CV
查看完整摘要
While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.
Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
Aodi Wu, Xubo Luo
个性化推荐理由:

虽然该论文涉及视觉语言模型,但其应用领域明确限定于自动驾驶,这与推荐系统、搜索或广告领域无关。论文标题中提到的任务特定提示和空间推理技术主要针对自动驾驶场景中的视觉理解问题,没有显示出在推荐系统、搜索或广告中的潜在应用价值。

2025-10-28 07:43:30 | arXiv:2510.24152v1 |
cs.CVcs.AI
查看完整摘要
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.
ETC: training-free diffusion models acceleration with Error-aware Trend Consistency
Jiajian Xie, Hubery Yin, Chen Li, Zhou Zhao, Shengyu Zhang
个性化推荐理由:

该论文专注于扩散模型的加速技术,属于生成模型优化领域。虽然扩散模型在AIGC和内容生成中有应用,但根据用户明确的排除标准('AIGC, Content generation, Summarization, or other purely LLM-centric topics'),这属于不相关主题。该技术没有明显的推荐系统、搜索或广告应用潜力。

2025-10-28 07:08:09 | arXiv:2510.24129v1 |
cs.CV
查看完整摘要
Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
个性化推荐理由:

该论文主要关注计算机视觉领域的细粒度分类和合成数据生成,属于纯粹的视觉研究方向。虽然合成数据生成技术可能在某些边缘场景下间接应用于推荐或搜索系统的数据增强,但论文标题明确聚焦于视觉对象分类,与推荐系统、搜索广告的核心技术栈缺乏直接关联。

2025-10-28 05:40:14 | arXiv:2510.24078v1 |
cs.CV
查看完整摘要
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
Shufan Shen, Junshu Sun, Shuhui Wang, Qingming Huang
个性化推荐理由:

该论文主要关注计算机视觉模型的参数高效微调技术,虽然稀疏微调和双层优化是通用的模型优化方法,但论文明确限定于视觉模型应用,没有明确展示在推荐系统、搜索或广告中的潜在应用价值。核心技术创新与Transformer架构效率或LLM技术没有直接关联。

2025-10-28 03:39:18 | arXiv:2510.24037v1 |
cs.CVcs.LG
查看完整摘要
Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.
TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wan...
个性化推荐理由:

该论文主要关注以自我为中心的人工智能助手基准测试,这属于通用AI助手评估领域,与搜索、推荐或广告系统的核心技术进展没有直接关联。虽然以自我为中心的视角可能涉及用户行为理解,但论文焦点是基准测试而非具体的推荐或搜索算法改进,因此相关性较低。

2025-10-28 01:24:24 | arXiv:2510.23981v1 |
cs.CV
查看完整摘要
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints
Kazutoshi Akita, Norimichi Ukita
个性化推荐理由:

该论文专注于计算机视觉中的超分辨率技术,主要涉及图像处理而非推荐系统、搜索或广告的核心领域。虽然傅里叶约束和效率优化是技术上的进步,但缺乏明确的机制或应用场景将其与异构数据建模、序列推荐或广告排序等核心关注领域联系起来。

2025-10-28 01:19:54 | arXiv:2510.23978v1 |
cs.CV
查看完整摘要
Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.
Reasoning Visual Language Model for Chest X-Ray Analysis
Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akc...
个性化推荐理由:

该论文属于医学影像分析领域,专注于胸部X光这一特定医疗应用场景。虽然涉及视觉语言模型技术,但其应用领域明确限定在医疗诊断,与推荐系统、搜索或广告等商业应用场景没有直接关联,因此相关性较低。

2025-10-28 00:48:00 | arXiv:2510.23968v1 |
cs.CV
查看完整摘要
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
Peiyang Xu, Minzhou Pan, Zhaorun Chen, Shuang Yang, Chaowei Xiao, Bo Li
个性化推荐理由:

该论文主要关注图像安全护栏系统,属于计算机视觉安全应用领域。虽然提到了策略遵循和可解释性,但这些技术主要针对图像内容安全过滤,与推荐系统、搜索或广告的核心排名和建模问题没有直接关联。论文的技术方向更偏向视觉内容安全而非推荐/搜索/广告的核心技术栈。

2025-10-28 00:35:59 | arXiv:2510.23960v1 |
cs.CVcs.AIcs.CR
查看完整摘要
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
Neural USD: An object-centric framework for iterative editing and control
Alejandro Escontrela, Shrinu Kushagra, Sjoerd van Steenkiste, Yulia Rubanova, Al...
个性化推荐理由:

该论文标题表明其聚焦于对象中心的神经表示和编辑框架,主要涉及3D场景建模与图形学领域。虽然对象中心表示在概念上可能与推荐系统中的物品建模有微弱关联,但该论文明确专注于图形编辑和控制任务,与搜索、推荐或广告的核心技术栈缺乏直接联系,且未涉及Transformer架构或LLM技术。

2025-10-28 00:19:42 | arXiv:2510.23956v1 |
cs.CVcs.AI
查看完整摘要
Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the "Neural Universal Scene Descriptor" or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .
MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, D...
个性化推荐理由:

这篇论文涉及机器翻译评估基准和翻译系统提交,属于纯粹的机器翻译领域,与推荐系统、搜索或广告的核心技术进展无关。虽然翻译技术可能在某些跨语言搜索场景中有间接应用,但论文焦点是WMT评估任务,不属于当前关注的核心领域、使能技术或直接应用范畴。

2025-10-28 17:56:20 | arXiv:2510.24707v1 |
cs.CL
查看完整摘要
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang...
个性化推荐理由:

该论文关注LLMs在物理设备操控和VR游戏中的能力测试,属于纯粹的机器人学或人机交互应用场景。虽然涉及LLMs,但没有任何与推荐系统、搜索或广告相关的潜在应用,完全超出了您关注的领域范围。

2025-10-28 17:55:42 | arXiv:2510.24706v1 |
cs.CLcs.AIcs.HCcs.SE
查看完整摘要
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Y...
个性化推荐理由:

该论文专注于音频领域的时空推理基准测试,属于特定模态(音频)的评估框架。虽然涉及时空建模,但缺乏与推荐系统、搜索或广告领域的明确关联,且不涉及Transformer架构改进或LLM技术。音频4D智能与异构数据统一建模的VLM类比没有直接联系。

2025-10-28 17:50:34 | arXiv:2510.24693v1 |
cs.SDcs.CLeess.AS
查看完整摘要
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
Xun Liang, Huayi Lai, Hanyu Wang, Wentao Zhang, Linfeng Zhang, Yanfang Chen, Fei...
个性化推荐理由:

该论文专注于医学领域的大语言模型角色认知分析,属于医学和生物学应用范畴,明确属于无关主题。神经元消融技术虽然是LLM分析工具,但论文的应用场景和核心问题与推荐系统、搜索或广告领域无关。

2025-10-28 17:40:53 | arXiv:2510.24677v1 |
cs.CLcs.AI
查看完整摘要
Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor
MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, M...
个性化推荐理由:

该论文专注于机器翻译评估方法,属于纯NLP评估领域,与推荐系统、搜索或广告的核心技术进展无关。论文内容涉及翻译质量评估的协作标注技术,没有展示在推荐、搜索或广告领域的潜在应用价值。

2025-10-28 17:29:59 | arXiv:2510.24664v1 |
cs.CL
查看完整摘要
Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
Evolving Diagnostic Agents in a Virtual Clinical Environment
Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Y...
个性化推荐理由:

该论文标题明确指向医疗领域的诊断应用,这属于明确的无关主题范畴。论文聚焦于临床环境中的诊断智能体,与推荐系统、搜索或广告领域没有任何技术关联,也不涉及LLM或Transformer架构在相关领域的潜在应用。

2025-10-28 17:19:47 | arXiv:2510.24654v1 |
cs.CL
查看完整摘要
In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia
Hugo Rydel-Johnston, Alex Kafkas
个性化推荐理由:

该论文研究阅读障碍的认知因素,属于医学/心理学领域,与推荐系统、搜索或广告的技术核心无关。论文内容不涉及LLM技术、Transformer架构改进,也没有在RecSys/Search/Ads领域的潜在应用价值。

2025-10-28 17:15:31 | arXiv:2510.24647v1 |
cs.CLq-bio.NC
查看完整摘要
We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dal...
个性化推荐理由:

该论文关注AI代理在特定科学领域(天体物理学)的论文复现能力,这属于纯粹的AI能力评估基准研究。与推荐系统、搜索或广告的核心技术进展、Transformer架构改进或LLM直接应用均无关联,也不涉及异构数据的统一建模。

2025-10-28 16:21:19 | arXiv:2510.24591v1 |
cs.CLastro-ph.IM
查看完整摘要
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation
Raphaël Bagat, Irina Illina, Emmanuel Vincent
个性化推荐理由:

该论文专注于语音领域的自监督学习和域自适应技术,属于语音处理范畴。虽然涉及自监督学习,但论文明确针对语音域(whisper domain),与推荐系统、搜索或广告的核心技术领域没有直接关联,也不涉及Transformer架构改进或LLM技术在推荐/搜索/广告中的应用。

2025-10-28 16:01:24 | arXiv:2510.24570v1 |
cs.CL
查看完整摘要
Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.
Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyung...
个性化推荐理由:

该论文主要涉及韩语历史语料库的构建,属于特定语言的历史文本收集工作,与推荐系统、搜索、广告或LLM技术没有直接关联。这种语料库建设工作主要服务于语言学研究和历史分析,不涉及任何推荐、搜索算法或LLM架构的改进。

2025-10-28 15:43:26 | arXiv:2510.24541v1 |
cs.CL
查看完整摘要
The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written
Venkata S Govindarajan, Laura Biester
个性化推荐理由:

这篇论文专注于幽默建模和文本质量分析,属于纯粹的NLP应用领域。该研究没有展示与推荐系统、搜索或广告的明确关联,也不涉及能够赋能这些领域的LLM或Transformer技术进步。

2025-10-28 15:42:03 | arXiv:2510.24538v1 |
cs.CL
查看完整摘要
Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton
Levée d'ambiguïtés par grammaires locales
Eric G. C. Laporte
个性化推荐理由:

该论文标题涉及自然语言处理中的歧义消解和局部语法技术,属于传统的NLP方法。虽然歧义消解在搜索系统中可能有基础应用,但该技术过于传统且特定,与现代LLM、推荐系统或广告技术的核心进展没有直接关联,也不属于Transformer架构改进或异构数据统一建模等前沿方向。

2025-10-28 15:38:22 | arXiv:2510.24530v1 |
cs.CL
查看完整摘要
Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.
SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Mo...
个性化推荐理由:

该论文关注文本自编码器潜在空间中的对抗性攻击和鲁棒性评估,这属于安全性和对抗性防御领域,属于明确列出的无关主题。论文标题表明其核心是评估模型在对抗性攻击下的鲁棒性,而非推荐系统、搜索或广告领域的技术进展或应用。

2025-10-28 14:09:05 | arXiv:2510.24446v1 |
cs.CLcs.CV
查看完整摘要
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
Law in Silico: Simulating Legal Society with LLM-Based Agents
Yiding Wang, Yuxuan Chen, Fanxu Meng, Xifan Chen, Xiaolei Yang, Muhan Zhang
个性化推荐理由:

该论文专注于法律领域的模拟应用,属于特定领域应用而非核心推荐系统、搜索或广告技术的进展。虽然涉及LLM技术,但其应用场景与我的关注领域(RecSys/Search/Ads)没有直接关联,也不涉及Transformer架构改进或异构数据统一建模等核心技术方向。

2025-10-28 14:07:10 | arXiv:2510.24442v1 |
cs.AIcs.CLcs.CYcs.MA
查看完整摘要
Since real-world legal experiments are often costly or infeasible, simulating legal societies with Artificial Intelligence (AI) systems provides an effective alternative for verifying and developing legal theory, as well as supporting legal administration. Large Language Models (LLMs), with their world knowledge and role-playing capabilities, are strong candidates to serve as the foundation for legal society simulation. However, the application of LLMs to simulate legal systems remains underexplored. In this work, we introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement. Our experiments, which compare simulated crime rates with real-world data, demonstrate that LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. At the same time, micro-level simulations reveal that a well-functioning, transparent, and adaptive legal system offers better protection of the rights of vulnerable individuals.
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content
Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Ma...
个性化推荐理由:

该论文专注于宗教内容生成的真实性评估,属于特定领域应用而非核心推荐系统、搜索或广告技术。虽然涉及LLM生成内容,但关注的是宗教内容的忠实性这一狭窄领域,与我的技术焦点无关。

2025-10-28 14:05:55 | arXiv:2510.24438v1 |
cs.CLcs.AIcs.CYcs.MA
查看完整摘要
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Jordi Cabot
个性化推荐理由:

该论文专注于特定小语种(卢森堡语)的指令微调数据集构建,属于纯NLP领域的数据集工作。虽然涉及指令微调技术,但针对的是特定语言场景,没有展示在推荐系统、搜索或广告领域的潜在应用价值,与当前关注的核心技术方向无关。

2025-10-28 14:02:55 | arXiv:2510.24434v1 |
cs.CL
查看完整摘要
The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanz...
个性化推荐理由:

该论文专注于移动GUI智能体的安全验证,属于移动应用测试和自动化领域。虽然提到了智能体技术,但核心关注点是安全增强和验证方法,这属于被排除的隐私安全范畴,与推荐系统、搜索或广告的核心技术进展没有直接关联。

2025-10-28 13:22:39 | arXiv:2510.24411v1 |
cs.AIcs.CLcs.CVcs.HC
查看完整摘要
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Hunzalah Hassan Bhatti, Firoj Alam
个性化推荐理由:

该论文主要关注阿拉伯文化问答基准的构建,属于特定语言和文化的NLP评估基准。这与我的关注点(推荐系统、搜索广告中的核心进展、LLM技术应用、Transformer架构改进或异构数据统一建模)完全无关,不涉及任何推荐、搜索或广告相关的技术或应用。

2025-10-28 11:52:51 | arXiv:2510.24328v1 |
cs.CLcs.AI68T50F.2.2; I.2.7
查看完整摘要
Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations
Syed Zohaib Hassan, Pål Halvorsen, Miriam S. Johnson, Pierre Lison
个性化推荐理由:

该论文主要关注LLM在生成儿童对话方面的评估,这属于纯粹的NLP评估基准研究,与推荐系统、搜索或广告的核心技术无关。论文没有涉及任何推荐算法、搜索排序、广告投放或Transformer架构改进等关键技术领域。

2025-10-28 10:00:52 | arXiv:2510.24250v1 |
cs.CL
查看完整摘要
Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
Ahmad Ghannam, Naif Alharthi, Faris Alasmary, Kholood Al Tabash, Shouq Sadah, La...
个性化推荐理由:

该论文专注于阿拉伯语变音符号恢复的特定NLP任务,属于语音和文本处理领域。虽然涉及多模态建模,但其应用场景(阿拉伯语文本处理)与推荐系统、搜索或广告的核心技术需求没有直接关联,且不涉及LLM、Transformer架构改进或异构数据统一建模等关键技术方向。

2025-10-28 09:58:18 | arXiv:2510.24247v1 |
cs.CL
查看完整摘要
In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich
个性化推荐理由:

该论文主要关注LLM的忠实性(faithfulness)和透明推理,这属于纯粹的NLP评估和可解释性研究范畴。虽然忠实性在通用NLP中很重要,但论文没有展示与推荐系统、搜索或广告的直接应用潜力,且专注于评估基准而非实际应用技术。

2025-10-28 09:43:49 | arXiv:2510.24236v1 |
cs.CL
查看完整摘要
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
HACK: Hallucinations Along Certainty and Knowledge Axes
Adi Simhi, Jonathan Herzig, Itay Itzhak, Dana Arad, Zorik Gekhman, Roi Reichart,...
个性化推荐理由:

该论文标题明确聚焦于幻觉分析,这属于纯粹的NLP中心话题,被明确列为无关主题。论文标题表明其研究的是LLM幻觉的评估和诊断方法,没有任何迹象表明该技术有在推荐系统、搜索或广告领域的潜在应用。

2025-10-28 09:34:31 | arXiv:2510.24222v1 |
cs.CLI.2.7
查看完整摘要
Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs' underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.
MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Aaron Scott, Maike Züfle, Jan Niehues
个性化推荐理由:

该论文聚焦于德语讽刺检测数据集构建,属于情感分析/讽刺检测的NLP特定领域研究,与推荐系统、搜索或广告的核心技术进展无直接关联。作为数据集论文,它不涉及LLM技术进展、Transformer架构改进,也没有展示在RecSys/Search/Ads领域的潜在应用价值。

2025-10-28 08:33:45 | arXiv:2510.24178v1 |
cs.CLcs.AI
查看完整摘要
Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park,...
个性化推荐理由:

该论文聚焦于韩语多步推理基准测试,属于纯粹的NLP评估基准范畴。虽然涉及LLM技术,但主要关注语言理解和推理能力的评测,与推荐系统、搜索或广告的核心技术发展没有直接关联,也不具备在这些领域的潜在应用价值。

2025-10-28 07:42:59 | arXiv:2510.24150v1 |
cs.CLcs.AI
查看完整摘要
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
Md. Rezuwan Hassan, Azmol Hossain, Kanij Fatema, Rubayet Sabbir Faruque, Tanmoy ...
个性化推荐理由:

该论文专注于孟加拉语方言的语音语料库构建,属于纯粹的语音处理领域,与搜索、推荐或广告系统没有明显关联。语音处理本身不在关注范围内,且没有证据表明该技术有潜力应用于相关领域。

2025-10-28 06:08:42 | arXiv:2510.24096v1 |
cs.CL
查看完整摘要
The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, A...
个性化推荐理由:

该论文专注于多语言物理常识推理的评估基准,属于纯粹的NLP评估基准范畴。虽然涉及多语言能力,但物理常识推理与推荐系统、搜索或广告的核心技术需求没有直接关联,且缺乏明确的跨模态或推荐应用潜力。

2025-10-28 05:46:25 | arXiv:2510.24081v1 |
cs.CL
查看完整摘要
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine
Mengzhou Sun, Sendong Zhao, Jianyu Chen, Bin Qin
个性化推荐理由:

该论文专注于循证医学领域的特定应用,属于医疗领域的专业应用,与推荐系统、搜索或广告的核心技术进展无关。虽然提到了检索增强生成(RAG)技术,但其应用场景被严格限定在医学领域,不符合当前关注的通用技术趋势或直接应用需求。

2025-10-28 02:01:05 | arXiv:2510.23998v1 |
cs.CL
查看完整摘要
Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8\% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.
M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems
Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
个性化推荐理由:

该论文专注于医疗领域的RAG系统评估,属于明确的医疗领域特定应用,与我的关注点无关。论文标题明确提到'Medical RAG Systems',这属于被排除的医疗应用范畴,与推荐系统、搜索或广告没有任何关联。

2025-10-28 01:57:40 | arXiv:2510.23995v1 |
cs.CL
查看完整摘要
Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.
emg2speech: synthesizing speech from electromyography using self-supervised speech models
Harshavardhana T. Gowda, Lee M. Miller
个性化推荐理由:

该论文专注于从肌电图(EMG)信号合成语音的生物医学应用,这属于医疗领域的特定应用。虽然涉及自监督模型,但核心技术与推荐系统、搜索或广告没有任何关联,也没有展示在这些领域的潜在应用价值。

2025-10-28 00:50:15 | arXiv:2510.23969v1 |
cs.SDcs.CLeess.AS
查看完整摘要
We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs
Kyomin Hwang, Hyeonjin Kim, Seungyeon Kim, Sunghyun Wee, Nojun Kwak
个性化推荐理由:

该论文聚焦于LLM的遗忘机制和多语言处理,属于纯粹的NLP安全研究范畴。虽然涉及多语言LLM,但核心关注的是遗忘机制的风险评估,与推荐系统、搜索或广告的排名、建模或架构改进没有直接关联,也不属于核心LLM技术进展或Transformer架构创新。

2025-10-28 00:05:00 | arXiv:2510.23949v1 |
cs.CLcs.AI
查看完整摘要
There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.
Uniform Discrete Diffusion with Metric Path for Video Generation
Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang...
个性化推荐理由:

该论文专注于视频生成领域的扩散模型技术,属于纯粹的视觉内容生成范畴。虽然扩散模型是重要的生成技术,但论文标题明确限定于视频生成应用,与推荐系统、搜索或广告的核心技术需求没有直接关联,也不涉及处理异构数据或Transformer架构改进。

2025-10-28 17:59:57 | arXiv:2510.24717v1 |
cs.CV
查看完整摘要
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
Group Relative Attention Guidance for Image Editing
Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Hao...
个性化推荐理由:

该论文专注于图像编辑技术,属于纯粹的计算机视觉领域,与推荐系统、搜索或广告的核心技术没有直接关联。注意力机制虽然与Transformer相关,但本文的应用场景(图像编辑)在指定的无关主题范围内,没有展示出在RecSys/Search/Ads领域的潜在应用价值。

2025-10-28 17:22:44 | arXiv:2510.24657v1 |
cs.CV
查看完整摘要
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology
Veronica Thai, Rui Li, Meng Ling, Shuning Jiang, Jeremy Wolfe, Raghu Machiraju, ...
个性化推荐理由:

该论文聚焦于数字病理学领域的用户行为数据收集,属于医学领域的专业应用。虽然涉及追踪和决策数据,但这些技术和方法与推荐系统、搜索或广告的核心技术栈没有直接关联,也不涉及LLM或Transformer架构的进展。

2025-10-28 17:18:43 | arXiv:2510.24653v1 |
cs.CVcs.HCJ.3
查看完整摘要
Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.
A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
Xin Zhang, Yuqi Song, Fei Zuo
个性化推荐理由:

该论文专注于AI生成面部伪造的检测,属于计算机视觉中的伪造检测领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及AI生成内容,但这是纯粹的安全/验证应用,而非排名、推荐或广告投放等核心业务场景。

2025-10-28 17:06:40 | arXiv:2510.24640v1 |
cs.CV
查看完整摘要
The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.
GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization
Nicolai Steinke, Daniel Goehring
个性化推荐理由:

该论文专注于激光雷达定位技术,属于纯粹的机器人感知和自动驾驶领域。虽然提到了大规模系统,但没有任何与推荐系统、搜索或广告相关的技术元素或潜在应用场景。

2025-10-28 16:51:50 | arXiv:2510.24623v1 |
cs.ROcs.CV
查看完整摘要
In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird's-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at https://github.com/dcmlr/groundloc.
Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT
Xu Jiang, Huiying Pan, Ligen Shi, Jianing Sun, Wenfeng Xu, Xing Zhao
个性化推荐理由:

该论文专注于医学影像领域的X射线散射校正技术,属于医疗物理应用范畴,与推荐系统、搜索或广告完全无关。论文标题明确指向CT扫描和X射线物理,属于明确的医学/物理领域应用,完全超出了相关技术范围。

2025-10-28 16:13:14 | arXiv:2510.24579v1 |
cs.CVI.4.5; I.5
查看完整摘要
Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.
Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
Mortesa Hussaini, Jan Theiß, Anthony Stein
个性化推荐理由:

该论文明确涉及联邦学习(在无关主题中明确排除)和个性化学习,但缺乏与推荐系统、搜索或广告的直接联系。虽然提到了异构数据环境,但核心焦点是联邦学习框架下的性能与泛化权衡,这超出了当前关注的技术范畴。

2025-10-28 15:15:14 | arXiv:2510.24503v1 |
cs.LGcs.AIcs.CVcs.DCcs.MA
查看完整摘要
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
Fast and accurate neural reflectance transformation imaging through knowledge distillation
Tinsae G. Dulecha, Leonardo Righetto, Ruggero Pintus, Enrico Gobbetti, Andrea Gi...
个性化推荐理由:

该论文涉及计算机视觉中的反射变换成像技术,属于纯粹的视觉处理领域,与推荐系统、搜索或广告没有明显关联。知识蒸馏虽然是模型压缩技术,但论文的应用场景(神经反射变换成像)在推荐/搜索/广告领域缺乏直接的应用潜力。

2025-10-28 15:00:07 | arXiv:2510.24486v1 |
cs.CVcs.GR
查看完整摘要
Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...
Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
Charles Javerliat, Pierre Raimbaud, Guillaume Lavoué
个性化推荐理由:

该论文专注于计算机视觉中的运动捕捉技术,涉及3D视觉和人体动作分析。这与我的关注领域(推荐系统、搜索、广告)没有直接关联,也不涉及Transformer架构、LLM技术或异构数据建模。该技术主要应用于动画、游戏、体育分析等领域,在RecSys/Search/Ads中缺乏明确的应用潜力。

2025-10-28 14:30:47 | arXiv:2510.24464v1 |
cs.CV
查看完整摘要
Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.
A Critical Study towards the Detection of Parkinsons Disease using ML Technologies
Vivek Chetia, Abdul Taher Khan, Rahish Gogoi, David Kapsian Khual, Purnendu Bika...
个性化推荐理由:

该论文专注于医学领域的帕金森病检测应用,属于明确的医疗领域特定应用。这与我的关注点(推荐系统、搜索、广告及相关技术)完全无关,没有任何潜在的应用关联。

2025-10-28 14:24:34 | arXiv:2510.24456v1 |
cs.CV
查看完整摘要
The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
Deeply-Conditioned Image Compression via Self-Generated Priors
Zhineng Zhao, Zhihai He, Zikun Zhou, Siwei Ma, Yaowei Wang
个性化推荐理由:

该论文专注于图像压缩技术,属于纯粹的计算机视觉领域,与推荐系统、搜索或广告的核心技术栈没有直接关联。虽然压缩技术可能间接影响存储效率,但论文本身不涉及用户行为建模、排序算法或任何与推荐/搜索/广告相关的应用场景。

2025-10-28 14:04:19 | arXiv:2510.24437v1 |
cs.CV
查看完整摘要
Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image's structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.
XAI Evaluation Framework for Semantic Segmentation
Reem Hammoud, Abdul karim Gizzini, Ali J. Ghandour
个性化推荐理由:

该论文专注于计算机视觉领域的语义分割可解释性评估,属于纯粹的视觉研究方向。虽然XAI(可解释AI)是重要技术方向,但该论文没有展示任何与推荐系统、搜索或广告的潜在应用关联,完全落在不相关主题范围内。

2025-10-28 13:27:38 | arXiv:2510.24414v1 |
cs.CV
查看完整摘要
Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.
50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon
Ali Ahmad Faour, Nabil Amacha, Ali J. Ghandour
个性化推荐理由:

这篇论文专注于环境科学和水体监测领域,与推荐系统、搜索、广告或LLM技术没有任何关联。这是一个纯粹的领域特定应用,属于明确的无关主题范畴。

2025-10-28 13:23:32 | arXiv:2510.24413v1 |
cs.CV
查看完整摘要
The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.
GenTrack: A New Generation of Multi-Object Tracking
Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen
个性化推荐理由:

该论文专注于计算机视觉中的多目标跟踪技术,属于纯粹的视觉领域研究。虽然跟踪技术在广义上可能与用户行为分析相关,但该论文标题没有显示出与推荐系统、搜索或广告的直接关联,也没有涉及LLM、Transformer架构或异构数据处理等核心技术。

2025-10-28 13:13:20 | arXiv:2510.24399v1 |
cs.CVcs.RO
查看完整摘要
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
Unsupervised Detection of Post-Stroke Brain Abnormalities
Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galas...
个性化推荐理由:

该论文专注于医学领域的脑卒中检测应用,属于明确的无关主题范畴。标题明确指向医学影像分析和神经科学应用,与推荐系统、搜索或广告领域没有任何技术关联或潜在应用价值。

2025-10-28 13:13:01 | arXiv:2510.24398v1 |
cs.CV
查看完整摘要
Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.
When are radiology reports useful for training medical image classifiers?
Herman Bergström, Zhongqi Yue, Fredrik D. Johansson
个性化推荐理由:

该论文专注于医学图像分类和放射学报告,这属于医学领域的特定应用,与推荐系统、搜索或广告无关。论文标题明确指向医疗领域的技术问题,完全超出了我关注的领域范围。

2025-10-28 13:01:42 | arXiv:2510.24385v1 |
cs.CV
查看完整摘要
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
Zhuangfan Huang, Xiaosong Li, Gao Wang, Tao Ye, Haishu Tan, Huafeng Li
个性化推荐理由:

该论文专注于偏振图像融合的计算机视觉任务,涉及多尺度网络和亮度感知处理,属于纯粹的视觉处理领域。论文内容与推荐系统、搜索或广告的核心技术没有直接关联,也没有涉及Transformer架构、LLM技术或异构数据建模等当前关注的技术方向。

2025-10-28 12:57:42 | arXiv:2510.24379v1 |
cs.CV
查看完整摘要
Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.
Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool
Yann Kerverdo, Florent Leray, Youwan Mahé, Stéphanie Leplaideur, Francesca Galas...
个性化推荐理由:

该论文专注于医学图像分割领域,具体针对卒中病灶的临床工作流程应用。这与我的关注点完全无关,因为医学、生物学和其他特定领域应用被明确列为不相关主题,且该技术没有明显的推荐系统、搜索或广告应用潜力。

2025-10-28 12:56:48 | arXiv:2510.24378v1 |
cs.CV
查看完整摘要
Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textit{StrokeSeg}, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \texttt{Float16} quantisation, reducing model size by about 50\%. \textit{StrokeSeg} provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference $<10^{-3}$), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.
Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation
Thanh-Huy Nguyen, Hoang-Thien Nguyen, Ba-Thinh Lam, Vi Vu, Bach X. Nguyen, Jianh...
个性化推荐理由:

该论文专注于医学图像分割领域,属于明确的无关主题范畴。虽然涉及知识迁移和半监督学习技术,但医学图像处理与推荐系统、搜索或广告领域没有直接关联,且论文标题明确指向医学应用场景。

2025-10-28 12:42:33 | arXiv:2510.24366v1 |
cs.CV
查看完整摘要
Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.
Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof, Bastian Sigrist,...
个性化推荐理由:

该论文专注于医学手术场景中的声源定位技术,属于医疗领域的特定应用。标题中提到的'手术操作'和'空间映射'明确表明这是医疗领域的研究,与推荐系统、搜索或广告的核心技术进展完全无关。该研究不涉及任何LLM技术、Transformer架构改进或推荐系统相关的数据处理方法。

2025-10-28 11:55:45 | arXiv:2510.24332v1 |
cs.SDcs.CVeess.ASeess.IV
查看完整摘要
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
Training-free Source Attribution of AI-generated Images via Resynthesis
Pietro Bongini, Valentina Molinari, Andrea Costanzo, Benedetta Tondi, Mauro Barn...
个性化推荐理由:

该论文专注于AI生成图像的来源检测和归因问题,这属于内容认证和溯源领域。虽然涉及AI生成内容,但主要关注图像来源识别而非推荐、搜索或广告系统的核心算法改进。论文没有展示在推荐系统、搜索排序或广告技术中的潜在应用价值,与当前关注的LLM技术、Transformer架构或异构数据建模等方向无关。

2025-10-28 10:39:04 | arXiv:2510.24278v1 |
cs.CVcs.AI
查看完整摘要
Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation
Jingyi Tian, Le Wang, Sanping Zhou, Sen Wang, Jiayi Li, Gang Hua
个性化推荐理由:

该论文专注于机器人操控领域的3D动力学学习,属于机器人技术范畴。虽然涉及3D视觉和动态建模,但缺乏与推荐系统、搜索或广告领域的直接关联或潜在应用。论文的核心技术(掩码未来渲染)主要针对机器人操作任务,不属于当前关注的核心领域。

2025-10-28 10:17:11 | arXiv:2510.24261v1 |
cs.ROcs.AIcs.CV
查看完整摘要
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
DeshadowMamba: Deshadowing as 1D Sequential Similarity
Zhaotong Yang, Yi Chen, Yanying Li, Shengfeng He, Yangyang Xu, Junyu Dong, Jian ...
个性化推荐理由:

该论文专注于计算机视觉中的去阴影任务,采用序列建模方法。这与推荐系统、搜索或广告的核心技术领域完全无关,也不涉及任何可能应用于这些领域的LLM或Transformer技术。

2025-10-28 10:14:23 | arXiv:2510.24260v1 |
cs.CV
查看完整摘要
Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba's input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.
Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy
Qing Zhao, Weijian Deng, Pengxu Wei, ZiYi Dong, Hannan Lu, Xiangyang Ji, Liang L...
个性化推荐理由:

该论文专注于计算机视觉领域的图像恢复和目标检测协同问题,属于纯粹的视觉技术研究。虽然提到了级联结构和稳定性分析,但其核心内容与推荐系统、搜索或广告的排序任务没有直接关联。论文的技术视角(Lipschitz连续性)和具体应用场景(图像处理)均超出了当前关注的技术范畴。

2025-10-28 09:41:42 | arXiv:2510.24232v1 |
cs.CV
查看完整摘要
To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration -- an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector's feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.
Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation
Waseem Shariff, Timothy Hanley, Maciej Stec, Hossein Javidnia, Peter Corcoran
个性化推荐理由:

该论文专注于事件相机在微眼动识别中的基准测试和数据集创建,属于计算机视觉和生物医学工程领域。这与推荐系统、搜索或广告的核心技术领域没有直接关联,也不涉及LLM、Transformer架构或异构数据建模等当前关注的技术方向。

2025-10-28 09:41:30 | arXiv:2510.24231v1 |
cs.CV
查看完整摘要
Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .
Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization
Haoxin Yang, Yihong Lin, Jingdan Kang, Xuemiao Xu, Yue Li, Cheng Xu, Shengfeng H...
个性化推荐理由:

这篇论文专注于计算机视觉领域的人脸匿名化技术,使用扩散模型进行图像处理。虽然涉及生成模型,但其应用场景(人脸匿名化)与推荐系统、搜索或广告的核心技术需求没有直接关联,也不涉及异构数据建模或Transformer架构的改进。该技术主要服务于隐私保护目的,这属于被排除的无关主题范畴。

2025-10-28 09:28:12 | arXiv:2510.24213v1 |
cs.CV
查看完整摘要
Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbf{ID\textsuperscript{2}Face}, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript{2}Face outperforms existing methods in visual quality, identity suppression, and utility preservation.
CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation
Anshul Kaushal, Kunal Jangid, Vinod K. Kurmi
个性化推荐理由:

该论文专注于医学图像分割,属于明确的医学领域应用,与我的关注领域(推荐系统、搜索、广告)完全无关。论文提出的模糊逻辑方法针对医学图像边界清晰度问题,在推荐系统、搜索或广告领域没有直接应用潜力。

2025-10-28 09:06:27 | arXiv:2510.24202v1 |
cs.CV
查看完整摘要
Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/
Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2
Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu, Leo Yu Zhang, Dezhong ...
个性化推荐理由:

这篇论文讨论的是针对SAM2(Segment Anything Model 2)的对抗攻击,属于计算机视觉安全领域。虽然SAM2本身是视觉基础模型,但论文焦点是安全漏洞和攻击方法,这属于明确排除的隐私/安全范畴。该研究没有展示在推荐系统、搜索或广告中的潜在应用价值。

2025-10-28 08:59:11 | arXiv:2510.24195v1 |
cs.CV
查看完整摘要
Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.
MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images
Ovi Sarkar, Md Shafiuzzaman, Md. Faysal Ahamed, Golam Mahmud, Muhammad E. H. Cho...
个性化推荐理由:

该论文专注于医学图像分析领域,特别是结直肠组织病理学图像分类,这属于明确的无关主题(医学/生物学应用)。论文内容涉及计算机视觉在医疗诊断中的应用,与推荐系统、搜索或广告领域没有任何直接或间接关联。

2025-10-28 07:22:34 | arXiv:2510.24136v1 |
cs.CV
查看完整摘要
Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.
Compositional Image Synthesis with Inference-Time Scaling
Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn
个性化推荐理由:

该论文专注于计算机视觉领域的图像合成技术,属于纯粹的视觉生成任务。虽然提到了推理时缩放这一效率优化技术,但其核心应用场景是图像生成而非推荐系统、搜索或广告领域。该技术没有明显的潜在应用可以迁移到异构数据处理或多模态建模等与推荐/搜索相关的场景中。

2025-10-28 07:16:21 | arXiv:2510.24133v1 |
cs.CVcs.AI
查看完整摘要
Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery
Zan Wang, Siyu Chen, Luya Mo, Xinfeng Gao, Yuxin Shen, Lebin Ding, Wei Liang
个性化推荐理由:

该论文专注于计算机视觉领域的4D运动恢复和数据集构建,专门针对犬类运动分析。这与推荐系统、搜索或广告的核心技术领域完全无关,也不涉及LLM、Transformer架构或异构数据建模等关键技术。该研究属于纯粹的视觉应用领域,没有明显的推荐/搜索/广告应用潜力。

2025-10-28 06:41:49 | arXiv:2510.24117v1 |
cs.CV
查看完整摘要
We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.
ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring
Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen, Nadine Chang, Maying...
个性化推荐理由:

该论文专注于自动驾驶领域,涉及轨迹评分和端到端学习技术。虽然轨迹评分机制在概念上可能与推荐系统中的排序有相似之处,但论文明确聚焦于自动驾驶这一与推荐系统、搜索或广告完全不同的应用领域,且未提及任何与LLM、Transformer或异构数据处理相关的技术。

2025-10-28 06:26:36 | arXiv:2510.24108v1 |
cs.ROcs.CV
查看完整摘要
End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.
OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh, Munchurl Kim
个性化推荐理由:

该论文专注于文本到图像的生成和操作,属于纯粹的AIGC和内容生成领域,与搜索、推荐或广告中的排名和检索任务无关。尽管标题提到“可控操作”,但这主要涉及图像编辑而非推荐系统或搜索相关的技术应用。

2025-10-28 06:06:52 | arXiv:2510.24093v1 |
cs.CV
查看完整摘要
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning
Xingyu Liu, Kun Ming Goh
个性化推荐理由:

这篇论文虽然提出了深度学习中具有里程碑意义的ResNet架构,但它专注于计算机视觉领域的卷积神经网络,与推荐系统、搜索或广告的核心技术栈没有直接关联。残差学习机制主要解决深度网络训练中的梯度消失问题,在当前Transformer主导的推荐和搜索架构中应用有限。

2025-10-28 03:36:15 | arXiv:2510.24036v1 |
cs.CVcs.AI
查看完整摘要
Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
Yufan Liu, Wanqian Zhang, Huashan Chen, Lin Wang, Xiaojun Jia, Zheng Lin, Weipin...
个性化推荐理由:

该论文专注于文生图模型的对抗性测试和红队测试,这属于AIGC和内容生成领域。虽然涉及LLM技术,但核心应用是图像生成模型的评估,与推荐系统、搜索或广告的排名和建模任务没有直接关联。该工作没有明显的潜在应用可以转移到RecSys/Search/Ads领域。

2025-10-28 03:32:14 | arXiv:2510.24034v1 |
cs.CV
查看完整摘要
Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
Listening without Looking: Modality Bias in Audio-Visual Captioning
Yuchi Ishikawa, Toranosuke Manabe, Tatsuya Komatsu, Yoshimitsu Aoki
个性化推荐理由:

该论文研究音视频字幕生成中的模态偏见问题,属于多模态学习领域。虽然涉及多模态建模概念,但该论文专注于音视频内容生成,与推荐系统、搜索或广告的核心技术没有直接关联,也不涉及LLM技术或Transformer架构的进展。

2025-10-28 03:06:28 | arXiv:2510.24024v1 |
eess.AScs.CVeess.IV
查看完整摘要
Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks
Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jac...
个性化推荐理由:

该论文专注于火星科学任务的基准评估,属于特定领域应用(行星科学),与推荐系统、搜索或广告领域完全无关。论文内容涉及基础模型在火星科学中的评估,属于明确的领域特定应用,不在当前关注的技术范畴内。

2025-10-28 02:34:08 | arXiv:2510.24010v1 |
cs.CVcs.AIcs.LG
查看完整摘要
Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.
Towards the Automatic Segmentation, Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge
Yuan Jin, Antonio Pepe, Gian Marco Melito, Yuxuan Chen, Yunsu Byeon, Hyeseong Ki...
个性化推荐理由:

该论文专注于医学影像中的主动脉分割挑战,属于医疗领域的特定应用。虽然涉及分割技术,但这是纯粹的医学图像处理应用,与推荐系统、搜索或广告领域没有任何关联。论文内容完全属于被排除的医学/生物学应用范畴。

2025-10-28 02:33:45 | arXiv:2510.24009v1 |
cs.CV
查看完整摘要
The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.
AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization
Heethanjan Kanagalingam, Thenukan Pathmanathan, Mokeeshan Vathanakumar, Tharmaku...
个性化推荐理由:

该论文专注于医学图像分析中的糖尿病视网膜病变分类,属于明确的医疗领域应用。虽然涉及对抗性训练和跨域泛化技术,但这些技术在当前医学诊断场景中的应用与推荐系统、搜索或广告领域没有直接关联。论文的核心问题是特定医疗领域的分类任务,完全落在无关主题范围内。

2025-10-28 02:10:54 | arXiv:2510.24000v1 |
cs.CV
查看完整摘要
Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.
Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
Yohan Abeysinghe, Muhammad Akhtar Munir, Sanoojan Baliah, Ron Sarafian, Fahad Sh...
个性化推荐理由:

该论文专注于环境科学领域的空气污染预测,属于特定领域应用。虽然提到了神经预测方法,但核心内容与推荐系统、搜索或广告的排名问题无关,也没有涉及LLM、Transformer架构或多模态建模在商业系统中的应用潜力。

2025-10-28 01:18:00 | arXiv:2510.23977v1 |
cs.LGcs.CV
查看完整摘要
Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM$_1$, PM$_{2.5}$, PM$_{10}$), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.