arXiv 每日论文精选

显示 223 篇论文 (共 223 篇)

CoPersona：面向鲁棒LLM个性化的协作人格图

9/10

CoPersona: Collaborative Persona Graphs for Robust LLM Personalization

Yangtian Zhang, Leyao Wang, Hiren Madhu, Ngoc Bui, Walter Roznyatovskiy, Rex Yin...

核心总结:

研究LLM个性化中用户历史稀疏的问题，提出CoPersona框架，利用行为相似用户的多面表示构建协作图，通过非参数检索和参数推理结合的方式补充用户画像。

个性化推荐理由:

论文直接解决LLM个性化中用户历史稀疏和偏差的核心挑战，通过协作图完成用户画像，与LLM在推荐/搜索中的应用高度相关。

2026-07-01 21:27:52 | arXiv:2607.01485v1 |

cs.IR

查看完整摘要

Real-world LLM personalization is often constrained by sparse and skewed user histories: most users provide only a handful of interactions, while even frequent users' logs capture an incomplete and biased view of their preferences. As a result, weakly observed user attributes are difficult to infer, leading to brittle personalization when test-time requests shift toward under-supported facets. Motivated by this limitation, we present CoPersona, a graph-based collaborative personalization framework that completes sparse user profiles by borrowing signals from behaviorally similar peers. However, directly transferring signals is difficult because uneven facet coverage introduces bias into interaction histories, obscuring user similarity in the unstructured global space. To address this issue, CoPersona decomposes interaction histories into multiple facet-level representations and explicitly models peer-to-peer, facet-level alignment through a multiplex persona graph. To effectively leverage peer information at inference time, we employ a dual-branch architecture that combines non-parametric peer retrieval with parametric graph reasoning. Experiments across multiple domains and model scales demonstrate consistent improvements over strong baselines, validating CoPersona as an effective approach for robust LLM personalization.

PARTREP: 学习解码器专用LLM中的重复模式

9/10

PARTREP: Learning What to Repeat for Decoder-only LLMs

Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee

核心总结:

研究解码器-only LLM中因果注意力导致信息不对称的问题，提出PartRep方法，通过轻量门控网络预测高负对数似然（NLL）的token，在预填充阶段仅重复这些信息性token，从而在不显著增加KV缓存和注意力计算开销的情况下保留大部分全重复带来的性能提升。

个性化推荐理由:

该论文提出的选择性提示重复方法，直接针对LLM在长上下文场景下的效率问题，核心思想具有启发性，且与推荐系统等下游应用中的效率优化高度相关。

2026-07-02 07:07:28 | arXiv:2607.01792v1 |

cs.CLcs.LG

查看完整摘要

While decoder-only LLMs excel at a vast array of natural language tasks, it suffers from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones. A simple and effective remedy is prompt repetition -- just appending a second copy of prompt before generation can redistribute grounding across positions and improve reasoning performance. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long-context settings. We propose PartRep, a selective augmentation method that appends only the most informative tokens -- rather than the entire prompt. We use token-wise negative log-likelihood (NLL) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from surrounding context and therefore benefit more from late-position repetition. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high-NLL tokens from early-layer hidden states, enabling token selection during mid-prefill via early exit. Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.

Bi-NAS：通过双层神经架构搜索实现推荐系统有效且个性化的解释

8/10

Bi-NAS: Towards Effective and Personalized Explanation for Recommender Systems via Bi-Level Neural Architecture Search

Longfeng Wu, Yao Zhou, Tong Zeng, Zhimin Peng, Bhanu Pratap Singh Rawat, Lecheng...

核心总结:

针对推荐系统解释生成中的效果和个性化问题，提出Bi-NAS框架，通过双层神经架构搜索优化跨注意力机制和特征交互，并利用LLM零样本提示生成个性化解释。

个性化推荐理由:

论文直接针对推荐系统解释生成这一重要问题，结合NAS和LLM，高度契合LLM在推荐中的应用和对异构数据的处理。

2026-07-01 18:47:42 | arXiv:2607.01387v1 |

cs.IRcs.LG

查看完整摘要

Recommender systems are vital in helping users navigate vast amounts of information, offering personalized suggestions and effective explanations for these recommendations. While previous efforts have attempted to provide such explanations, evaluating their effectiveness across various scenarios remains a challenge. Enhancing these explanations is essential for improving user engagement, trust, and decision-making. To facilitate effective explanations within the recommender system, we propose a Bi-level Neural Architecture Search (Bi-NAS) framework to optimize explanations. This approach simultaneously refines cross-attention mechanisms and feature interaction functions by exploring both intra-layer and inter-layer design spaces. Furthermore, we integrate Large Language Models (LLMs) to enhance explanation generation, leveraging zero-shot prompting to produce more effective and personalized justifications. By aligning user feature preferences with item quality scores, our approach ensures that explanations reflect both user intent and item attributes, improving transparency and reasoning depth. Extensive evaluations on four real-world datasets demonstrate that Bi-NAS not only boosts recommendation accuracy but also significantly improves the effectiveness of explanations for recommender systems, providing users with clear and reliable insights into the suggestions they receive.

多模态知识编辑范围泛化用于在线递归MLLM编辑

8/10

Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing

Siyuan Li, Youyuan Zhang, Ruitong Liu, Junxi Wang, Jing Li

核心总结:

研究多模态大模型在线知识编辑的语义边界控制问题；提出ScopeEdit方法，通过模态局部吸收和证据门控共享泛化两个分支在正交低秩空间中分离编辑范围，实现可控的跨模态传播与无关行为保护。

个性化推荐理由:

论文聚焦多模态大模型在线编辑的泛化边界控制问题，提出跨模态传播与局部吸收的分解方法，属于直接应用LLM技术于推荐/搜索/广告场景的基础能力提升。

2026-07-02 10:10:19 | arXiv:2607.01978v1 |

cs.AIcs.CLcs.CV

查看完整摘要

Online multimodal knowledge editing requires injecting a continual stream of visual-textual corrections into multimodal large language models (MLLMs) with bounded overhead and minimal disruption to unrelated behaviors. Existing editors mainly emphasize edit reliability and long-horizon stability, but rarely control the semantic boundary of each edit. Our pilot analyses of post-edit behaviors and internal neuronal activities reveal a scope gap behind reliable edits: instance-level success neither guarantees transfer to valid cross-modal variants nor prevents leakage to unrelated inputs, while edit-related cross-modal responses concentrate in deeper semantic layers. Therefore, we formulate Edit-Scoped Generalization, reframing online MLLM editing from merely correcting an instance to controlling the propagation boundary of each edit. To this end, we propose ScopeEdit, a scope-aware online editor that decomposes each update into a modality-local absorption branch and an evidence-gated shared generalization branch. The local branch supports stable edit absorption, whereas the shared branch enables cross-modal propagation only when visual and textual evidence are sufficiently aligned. Both branches perform scope-separated write geometries in orthogonal low-rank spaces and maintain branch-wise preconditioners via Sherman--Morrison recursions, yielding constant per-edit overhead. Extensive experiments across diverse benchmarks, long-horizon edit streams, MLLM backbones, real-world VLKEB scenarios, and complex vision-language architectures show that ScopeEdit consistently improves the trade-off between in-scope cross-modal transfer and out-of-scope locality, while preserving edit reliability, stability and online efficiency. Our code is available at https://github.com/lab-klc/ScopeEdit.

语言模型真的能进行上下文检索吗？在海量百万级Token文档中迷失

8/10

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

Siddharth Gollapudi, Nilesh Gupta, Prasann Singhal, Sewon Min

核心总结:

研究LLM在百万token级语料中做上下文检索的性能崩溃问题，发现注意力稀释效应是关键原因，并提出长度感知的注意力调整和文档级稀疏注意力来改进。

个性化推荐理由:

论文研究LLM做上下文检索的核心挑战（注意力稀释）并提出改进方法，直接涉及LLM在检索中的应用和Transformer注意力机制优化，与核心领域和使能技术高度相关。

2026-07-01 23:38:25 | arXiv:2607.01538v1 |

cs.CL

查看完整摘要

Language models (LMs) raise an intriguing alternative to vector-based retrieval: conditioning on an in-context corpus and directly generating a relevant answer. However, prior work has largely focused on proprietary systems or the smaller-scale reranking task, leaving corpus-scale in-context retrieval largely unexplored. In this work, we present the first systematic study of in-context retrieval on two scales practical retrievers demand: million-token corpora and length-generalization far beyond training-time sizes. We first introduce BlockSearch, a 0.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length-generalize up to 10 times beyond its training regime. Nevertheless, retrieval still collapses under more extreme extrapolation. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre-softmax score stays high. Motivated by this analysis, we introduce length-aware adjustments to the attention softmax and document-level sparse attention. With these modifications, at the million-token scale, our model matches dense retrieval on widely studied benchmarks (e.g, MS MARCO and NQ), while outperforming the concurrent model MSA despite being 7 times smaller. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score. Together, our results position in-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge.

多头循环记忆智能体

8/10

Multi-Head Recurrent Memory Agents

Jiatong Li, Samuel Yeh, Sharon Li

核心总结:

研究递归记忆代理在超长上下文中的记忆保留瓶颈，核心方法是提出多头部递归记忆（MHM）架构，通过将记忆划分为独立头部并采用选择-更新策略，从架构层面防止覆盖，增强长程记忆保留。

个性化推荐理由:

针对递归记忆代理在长上下文中记忆保留崩溃的问题，提出基于多头部记忆分区和阶段式选择-更新策略的免训练架构优化方法，直接提升了LLM在长序列任务中的可靠性，对长上下文推荐等场景有潜在价值。

2026-07-01 22:38:54 | arXiv:2607.01523v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors--memory capture and memory retention--and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K--1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.

通过互信息的多目标探索与偏好优化

8/10

Multi-Objective Exploration and Preference Optimization via Mutual Information

Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuang Huang, Deqing Wang, Jianxin Li, Shua...

核心总结:

针对多目标对齐中偏好向量与生成响应不匹配的问题，提出基于互信息的框架MI-EPO，通过联合最大化响应、反馈与偏好向量的条件互信息，并结合概率路由机制，同时实现目标对齐与偏好感知探索。

个性化推荐理由:

论文聚焦多目标对齐和偏好优化，直接应用LLM技术解决推荐/搜索/广告中的多目标权衡问题，且互信息框架具有通用方法论价值。

2026-07-01 18:50:14 | arXiv:2607.01392v1 |

cs.CL

查看完整摘要

Aligning large language models with diverse and heterogeneous human values requires multi-objective alignment methods to effectively trade off conflicting preference dimensions. Current methods achieve this trade-off by training policies conditioned on preference vectors and leveraging online direct preference optimization. However, exploration uncertainty can cause the reward distributions of responses generated under different preference vectors to overlap, and the generated responses may fail to effectively align with the corresponding preference vectors. In this paper, we propose Multi-Objective Exploration and Preference Optimization via Mutual Information (MI-EPO), an information-theoretic framework. It unifies multi-objective exploration and alignment by maximizing the joint conditional mutual information among generated responses, preference feedback, and preference vectors. By incorporating a probabilistic routing mechanism, MI-EPO naturally decomposes objective alignment and preference-aware exploration, encouraging the model to generate responses that are distinguishable and aligned with different preference conditions. Experiments on safe alignment and helpful assistant tasks show that MI-EPO significantly improves the alignment between generated responses and preference vectors, makes the outputs more controllable, and achieves stable trade-offs across multiple objectives.

VLAFlow：通过协同训练和未来潜在对齐的统一视觉-语言-动作模型训练框架

8/10

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

Guoyang Xia, Fengfa Li, Hongjin Ji, Lei Ren, Fangxiang Feng, Kun Zhan, Yan Xie

核心总结:

研究机器人VLA模型在异构数据下的训练范式对比问题，核心方法是通过统一框架（VLAFlow）对比四种范式，发现语言监督保持泛化、未来潜在对齐改善状态动作建模，二者组合取得最优迁移，并提出元动作空间观点。

个性化推荐理由:

论文提出了VLAFlow统一训练框架，对比了不同VLA目标范式，核心是语言监督和未来潜在对齐如何提升异构数据训练和迁移效果，直接关联LLM在机器人动作建模中的前沿应用。

2026-07-02 01:38:16 | arXiv:2607.01586v1 |

cs.CVcs.AIcs.RO

查看完整摘要

Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

面向候选生成的矩阵分解MDP规划

7/10

Planning over Matrix-Factorization MDPs for Candidate Generation

Mikhail Trapeznikov, Maksim Utushkin

核心总结:

研究推荐系统中候选生成阶段是否值得考虑用户状态动态。核心方法是将top-K检索建模为MDP，利用矩阵分解的闭合形式fold-in作为状态转移，通过规划（如MCTS）改进检索，无需重训练或改变表征。

个性化推荐理由:

该论文直接解决推荐系统中的候选生成问题，提出基于MDP的动态规划方法，属于核心领域进展。

2026-07-02 12:50:45 | arXiv:2607.02115v1 |

cs.IR

查看完整摘要

For a recommender service, we view the customer journey as a chain of item recommendations: a useful item changes the user's state and therefore what should be retrieved next. Standard matrix-factorization retrieval ignores this -- it builds one user vector and returns the top-$K$ items by a static score, treating them as independent. We ask a narrow question: when is it worth planning over the user-state dynamics that fold-in induces? To answer it we propose casting top-$K$ retrieval as an MDP over the implicit-ALS posterior $(A^{-1},u)$, where an action is an item and the transition is a closed-form rank-one fold-in, and the trajectory reward combines a relevance similarity with a posterior-alignment term. Under the same fixed embeddings we compare static retrieval, one-step planning, and horizon-$K$ MCTS across five datasets and two protocols: a per-user leave-last-$n$ split and a stricter global time split. Dynamics-aware planning tends to overcome static retrieval on all datasets under leave-last-$n$, and the gains hold on MovieLens-1M and the VK-LSVD slices under the global time split. A single step of lookahead already captures most of the gain, so the lightweight planning layer turns static top-$K$ scoring into a short decision and improves retrieval over fixed collaborative-filtering embeddings, with no retraining and no change to the representation. These gains depend on measuring relevance with cosine rather than inner-product similarity, which is otherwise entangled with item popularity.

IntentTune：利用用户需求和个性化解决电商搜索中的“未知”查询意图

7/10

IntentTune: Using user demand and personalization to resolve "unknown" query intents for e-commerce search

Rachith Aiyappa, Ishita Khan, Chester Palen-Michel, Jayanth Yetukuri, Samarth Ag...

核心总结:

论文研究电商搜索中模糊查询（如'watch'）的意图消歧问题；核心方法是通过用户个性化行为信号（如搜索历史）或群体需求模式来推断潜在意图（如性别、年龄组），并证明用户行为信号优于群体统计信息。

个性化推荐理由:

论文关注电商搜索模糊查询意图消歧，利用用户行为信号，属于信息检索核心领域，但未涉及LLM或Transformer技术。

2026-07-01 23:02:00 | arXiv:2607.01530v1 |

cs.IRcs.AI

查看完整摘要

Understanding user intent is fundamental to delivering relevant search results in e-commerce. However, substantial fraction of real-world queries are under-specified (e.g., "watch" or "shirt"), lacking explicit attributes such as gender or age group. This ambiguity poses a significant challenge for query intent detection models in e-commerce search systems, which must accurately infer latent user intent (e.g., age, gender) to support effective downstream retrieval. We introduce IntentTune, a framework for resolving ambiguous or under-specified query intents by leveraging either (1) user-specific behavioral signals including search history, browsing activity, and profile attributes or (2) population-level demand patterns aggregated across all users. Through experiments on real-world e-commerce data, we first demonstrate that population-level demand patterns alone are insufficient to reliably infer intent in under-specified queries. We then demonstrate that user-specific behavioral signals -- particularly prior search queries -- outperform both population-level statistics and static profile information for inferring gender, age group, product category, and size intent from underspecified queries.

CheckRLM：检索增强推理中知识与思维一致性的有效检验

7/10

CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning

Dingling Xu, Ruobing Wang, Qingfei Zhao, Yukun Yan, Zhichun Wang, Daren Zha, Shi...

核心总结:

论文针对推理语言模型在知识密集型任务中推理链易含事实错误的问题，提出CheckRLM框架，通过从推理链中提取事实声明并借助外部知识进行最小代价的精确修正，确保推理链与正确知识之间的连贯性。

个性化推荐理由:

该方法专注于提升推理模型在知识密集型任务中的可靠性，结合RAG进行事实检查，虽不直接涉及推荐系统，但其思想可用于推荐场景中的推理纠错。

2026-07-02 14:50:25 | arXiv:2607.02262v1 |

cs.CL

查看完整摘要

Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose CheckRLM, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at https://github.com/AI9Stars/CheckRLM.

更密集不一定更好：持续后训练中在策略自蒸馏的局限性

7/10

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie...

核心总结:

论文研究LLM持续后训练中在线自蒸馏方法的局限性，发现密集自蒸馏虽可加速领域内专业化，但会导致分布外泛化能力下降和灾难性遗忘，甚至引发模型崩溃，其根源在于密集蒸馏加剧了参数和响应空间的漂移，并通过自强化循环放大高频格式伪影。

个性化推荐理由:

论文研究LLM持续后训练中的on-policy自蒸馏方法，揭示了其局限性与潜在风险，对LLM在推荐系统等领域的高效微调有重要启示。

2026-07-02 06:24:30 | arXiv:2607.01763v1 |

cs.LGcs.CL

查看完整摘要

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at https://github.com/Moenupa/SDPO-CL.

通过文本锚定信息瓶颈实现领域泛化

7/10

Domain Generalization via Text-Anchored Information Bottleneck

Eunyi Lyou, Yunjeong Choi, Junho Lee, Joonseok Lee

核心总结:

论文研究视觉域泛化问题，核心思想是丢弃视觉引导，以语言嵌入空间作为主要域不变性来源，通过信息瓶颈保留核心语义、抑制领域特定变化。

个性化推荐理由:

核心思想是利用语言嵌入空间作为信息瓶颈来过滤视觉领域特定变化，对推荐/搜索中利用大模型做域泛化有启发，但更偏向视觉域泛化，方向间接相关。

2026-07-02 03:31:34 | arXiv:2607.01657v1 |

cs.CV

查看完整摘要

Visual recognition models often fail when deployed in new environments. Domain Generalization (DG) addresses this by learning representations that remain invariant to environment-specific variations. Recent approaches increasingly rely on large vision-language models, assuming that preserving their expressive visual representations improves robustness. However, we show that such visual expressiveness can instead propagate spurious cues that tie representations to the training environments, hindering invariant learning. We therefore discard visual guidance and instead treat the language embedding space as the primary source of domain invariance, naturally acting as an information bottleneck that preserves core semantics while suppressing domain-specific variations. Extensive experiments across diverse backbones exhibit state-of-the-art performance and further analyze what makes guidance effective for robust generalization. These findings shift the focus of DG from improving representations to designing supervision that enforces invariance.

教授视觉-语言-行动模型该看什么以及往哪里看

6/10

Teaching Vision-Language-Action Models What to See and Where to Look

Yuguang Yang, Canyu Chen, Zhewen Tan, Yizhi Wang, Zichao Feng, Chunyang Liu, Keh...

核心总结:

针对VLA模型在自动驾驶中缺乏动作相关空间依赖的问题，提出DriveTeach-VLA框架，通过驾驶感知视觉蒸馏和轨迹引导提示，教模型看什么和关注哪里，提升轨迹预测能力。

个性化推荐理由:

核心聚焦自动驾驶的VLA模型，与推荐/搜索/广告领域无直接关联，但其跨模态学习思路可迁移至异构数据建模。

2026-07-02 03:34:32 | arXiv:2607.01658v1 |

cs.CV

查看完整摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing VLAs' training relies heavily on text-centric visual question answering and chain-of-thought reasoning data, which emphasizes linguistic reasoning rather than action-grounded planning. As a result, the learned representations capture semantic knowledge but lack spatial dependencies crucial for reliable trajectory prediction. We propose DriveTeach-VLA, a framework that explicitly teaches VLAs what to see and where to look. Driving-aware Vision Distillation (DVD) injects driving-specific perceptual priors into the vision encoder, while 2D Trajectory-Guided Prompts (2D-TGP) provide spatial conditioning aligned with feasible driving trajectories. Together, they form a vision-guided learning pipeline: what to see (DVD pretraining) - where to look (TGP-guided SFT) - how to act (TGP-guided GRPO). DriveTeach-VLA achieves the state-of-the-art performance on NAVSIM and nuScenes. Our code is available at: https://github.com/ShivaTeam/DriveTeach-VLA.

WBMM：高效大感受野卷积的窗口化批量矩阵乘法

5/10

WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution

Wan Song, Wei Zhou, Rui Wang, Jun Yu, Toru Kurihara, Jiajia Xu, Shu Zhan

核心总结:

提出WBMM方法，通过将输入分块并使用相对位置偏置表构建权重矩阵，实现规则内存访问，解决大核深度可分离卷积的效率问题。

个性化推荐理由:

论文提出了一种新的卷积计算方法WBMM，属于Transformer效率优化范畴，但主要针对卷积，不是直接针对Transformer或推荐系统。

2026-07-02 12:33:03 | arXiv:2607.02097v1 |

cs.CVcs.LG

查看完整摘要

Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at http://github.com/wansong-s/WBMM

使用图支撑子保证精确度的HNSW算法——技术报告

3/10

HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report

Minghao Li, Raghav Mittal, Sanjivni Rana, Suraj Shetiya, Gautam Das, Nick Koudas

核心总结:

针对HNSW缺乏正确性保证的问题，提出Certify-then-Rectify框架，通过统计认证判定搜索结果质量，必要时切换精确恢复算法，利用图Spanner和极值理论保证理论正确性。

个性化推荐理由:

主要改进传统检索算法，未涉及LLM或推荐系统核心。

2026-07-02 15:44:43 | arXiv:2607.02338v1 |

cs.DBcs.CLcs.IRcs.LG

查看完整摘要

Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel "Certify-then-Rectify" framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.

从SRA到Self-Flow：数据增强还是自监督？

3/10

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

Dengyang Jiang, Mengmeng Wang, Harry Yang, Jingdong Wang

核心总结:

论文研究扩散Transformer中SRA到Self-Flow改进的根本原因，提出Attention Separation方法证明改进主要来自噪声维度的数据增强，而非之前认为的跨噪声层token交互。

个性化推荐理由:

论文主要研究扩散Transformer中的自表示对齐机制，属于核心生成模型领域，与推荐系统、搜索或广告的直接关联较弱。

2026-07-02 17:59:25 | arXiv:2607.02508v1 |

cs.CV

查看完整摘要

Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.

大型语言模型的在线安全监控

2/10

Online Safety Monitoring for LLMs

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron,...

核心总结:

研究LLM部署时在线安全监控问题，提出基于阈值和风险控制的实时监控器，利用外部模型验证信号决策是否报警。

个性化推荐理由:

论文聚焦LLM安全监控，属于安全对齐领域，与推荐、搜索、广告核心问题无关，也不涉及Transformer或VLM等使能技术。

2026-07-02 17:59:43 | arXiv:2607.02510v1 |

cs.AIcs.CLcs.LGstat.APstat.ML

查看完整摘要

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

将智能体搜索引入地球观测数据发现

3/10

Bringing Agentic Search to Earth Observation Data Discovery

Minghan Yu, Youran Sun, Chugang Yi, Yixin Wen, Haizhao Yang

个性化推荐理由:

论文专注于地球观测领域的搜索应用，属于特定领域，与通用搜索/推荐系统或LLM核心技术进步的直接关联较弱。缺乏对LLM或Transformer架构的泛化性讨论，且未涉及推荐或广告场景，因此相关性较低。

2026-07-02 16:24:16 | arXiv:2607.02387v1 |

cs.IRcs.LG

查看完整摘要

NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.

评估学术文本检索增强生成的文本块策略

3/10

Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts

Valentin J. J. Kreileder, Johannes Reisinger, Andreas Fischer

个性化推荐理由:

该论文聚焦于RAG系统中的分块策略评估，属于检索增强生成领域的工程优化，与推荐系统、搜索或广告的核心技术（如用户建模、排序、匹配）无直接关联。虽然RAG可应用于搜索，但该研究高度局限于学术文本场景，缺乏对推荐/广告领域关键问题的针对性。

2026-07-02 08:12:35 | arXiv:2607.01852v1 |

cs.IRcs.AIcs.CL

查看完整摘要

Retrieval-Augmented Generation (RAG) systems use the question-answering capabilities of Large Language Models (LLMs) to access information outside their parameters. We evaluate if cluster-based semantic chunking improves retrieval and answer quality compared to fixed-size and recursive chunking evaluating on long, structured academic theses using the Retrieval Augmented Generation Assessment (RAGAs) framework. RAGAs based faithfulness shows limited reliability in this setup. Performance on fixed versus document specific questions varied substantially, likely related to the formatting of documents and preprocessing. Under the tested configuration, cluster-based chunking did not outperform simpler strategies.

当无人注视时LLM代理在说什么：多智能体辩论中的社会结构与潜在目标涌现

3/10

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh

个性化推荐理由:

论文研究多Agent辩论中的社会结构和目标涌现，属于LLM Agent的社会模拟方向，与推荐/搜索/广告的核心任务（如排序、匹配、生成）无直接关联。虽可启发多智能体交互建模，但当前阶段缺乏明确的推荐系统应用价值。

2026-07-02 17:59:23 | arXiv:2607.02507v1 |

cs.AIcs.CLcs.LGcs.MA

查看完整摘要

LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.

基于强化学习的视觉语言模型视觉引导自反思

3/10

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Liyan Tang, Fangcong Yin, Greg Durrett

个性化推荐理由:

该论文主要关注视觉语言模型的自反思能力，属于多模态学习领域，与推荐/搜索/广告的核心相关性较低。虽然其技术可能启发视觉特征与其他模态交互的方法，但缺乏直接应用于用户序列建模或异构数据融合的明确路径。

2026-07-02 17:53:15 | arXiv:2607.02490v1 |

cs.CLcs.CV

查看完整摘要

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

HERMES：一种用于预训练数据混合的多粒度标注基底

3/10

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Ziyun Qiao, Yue Min, Ruining Chen, Yujun Li

个性化推荐理由:

该论文关注预训练数据的标注与混合，属于核心LLM中的数据处理技术，但主要聚焦于通用预训练阶段，未明确指向推荐/搜索/广告领域的直接应用。虽然高质量数据混合能提升LLM基础能力，进而间接赋能下游任务，但缺乏与RecSys/Search/Ads的紧密耦合，因此相关性较低。

2026-07-02 14:51:42 | arXiv:2607.02266v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-to-fine code whose prefix length controls granularity up to approximately 130k cells. At coarse granularity HERMES sits at a plateau with KMeans-family methods on standard clustering metrics, so the contribution is the substrate, not the clusterer. On 1B-parameter, 25B-token pre-training, the hierarchy exposes an interaction fixed-granularity pipelines cannot test: at one prefix length, a combined Stage-2 rule contrast, equal-subbucket coverage versus size-proportional within-bucket quality top-30%, lifts a 16-task capability macro-average by +0.0253; at the next finer level, the same rule loses its measurable edge as candidate pools contract approximately 5x. HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.

Spec-AUF：针对掩码块草稿者的训练-推理不一致性下的接受-直到失败训练

3/10

Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

Tianjian Yang, Meng Li

个性化推荐理由:

该论文主要涉及LLM的推测解码（speculative decoding）训练方法，属于LLM推理加速技术。虽然可能间接提升LLM在推荐系统中的推理效率，但并非直接针对推荐/搜索/广告核心问题，且无明确应用场景描述。

2026-07-02 08:44:04 | arXiv:2607.01893v1 |

cs.AIcs.CL

查看完整摘要

Speculative decoding accelerates autoregressive generation by drafting a block of tokens that the target model verifies left-to-right, committing only the longest accepted prefix. Block (DLM-style) drafters predict the whole block in parallel, which is fast but trained with a full-block cross-entropy that supervises every position against the gold continuation -- even though inference discards every token after the first rejection. Recent acceptance-aware objectives patch this by reweighting the full-block loss; we instead use teacher-forced learning as a motivation for how supervision should concentrate on the accepted prefix. A mask-only block drafter has no input-side channel for gold-prefix conditioning, so AUF approximates that prefix-sensitive supervision on the loss side by keeping the cross-entropy support only through the drafter's first predicted failure. AUF is a single, detached change to the CE support -- no auxiliary objective, no verifier rollouts, and no change to the inference pipeline or the exactness contract. Within fixed drafter backbones and serving settings on Qwen3-8B, AUF raises the DFlash drafter's average emitted length $τ$, averaged over six benchmarks, from 2.40 to 2.61, with a gain on every benchmark, and transfers to Domino's two-branch head (2.56 to 2.68). Two findings sharpen the picture: the decay-only baseline reaches higher token accuracy on the shared block mask yet decodes worse, and on DFlash, once AUF truncates the support, the standard exponential position-decay weighting becomes empirically inert.

何时生成更多数据有助于？合成数据扩展中固定源合成与源扩展的解耦分析

3/10

When Does Generating More Help? Disentangling Fixed-Source Synthesis from Source Expansion in Synthetic Data Scaling

Xu Guo, Jian Tong, Zhihui Lu, Qipeng Guo

个性化推荐理由:

该论文关注合成数据生成与扩展，属于LLM训练数据增强范畴。虽然数据增强对推荐系统（如冷启动、特征工程）有潜在应用，但论文聚焦于语言模型数据生成原理，且非直接应用于推荐/搜索/广告的特定技术，关联性较低。

2026-07-02 05:31:36 | arXiv:2607.01727v1 |

cs.CL

查看完整摘要

Synthetic data can be scaled along two routes: Source Expansion (SE), which enlarges the source by adding seed materials or generators, and Fixed-Source Synthesis (FSS), which holds the source fixed and scales the generation budget. Existing scaling studies typically expand the source as the data grows, conflating SE with FSS and leaving FSS underexplored. We isolate FSS by holding the seed-question pool and teacher model fixed, varying only the per-question response budget under Rejection Sampling (RS). We adapt the rectified scaling law to FSS, deriving it from how repeated sampling covers a fixed source. Empirically, the derived form, fit on low budgets, predicts performance at the held-out highest budget for every evaluated teacher--student pair. At matched total-sample budgets, SE and FSS are comparable at small budgets; at large budgets, adding seed questions outperforms spending the same budget on more responses. Within FSS, however, neither synthesizing additional questions from the existing seeds nor varying the synthesis protocol outperforms plain RS at matched budgets. FSS is thus a bounded scaling axis and a controlled setting for comparing synthesis protocols. We will release our code and data to facilitate further research.

BOUNDARY_SYNC：测量多智能体LLM系统中通信引起的表征耦合

3/10

BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems

Zewen Liu

个性化推荐理由:

该论文聚焦于多智能体LLM系统中的通信与表征耦合，属于分布式AI协作的技术探索，而非直接针对推荐、搜索或广告领域。虽然多智能体系统可能间接启发多任务或协同过滤，但缺乏明确的RecSys/Search/Ads应用场景，且不属于核心领域进展或使能技术中具有直接应用潜力的方向。

2026-07-02 01:55:57 | arXiv:2607.01600v1 |

cs.LGcs.CL

查看完整摘要

As large language models (LLMs) are deployed as communicating agents, does inter-agent communication cause outputs to converge? We introduce BOUNDARY_SYNC, a protocol measuring representational coupling via the Coupling Amplification Factor (CAF = JSD_cond / JSD_baseline), where CAF < 1 indicates homogenization and CAF > 1 indicates diversification. In controlled GPT-4o experiments (N=30, ~9,900 API calls), we measure coupling in text and image communication. Key findings: (1) text communication causes significant homogenization (CAF=0.803 [0.740, 0.873], d=1.30, p<0.001), confirmed by no-communication ablation and prompt-perturbation controls; (2) image communication also homogenizes under within-modality baselines (CAF=0.834 [0.811, 0.858]), with comparable proportional effect; (3) group size moderates coupling direction -- K=5 produces homogenization while K=3 yields CAF > 1.0 (point estimates 1.14 and 1.06, CI pending), suggesting a directional shift toward diversification; (4) cross-model replication shows extreme variation (CAF 0.034-0.803), with DeepSeek dominated by format artifacts; (5) coupling is stateless -- driven by prompt context rather than cumulative updating, with continuous consensus producing monotonic convergence. These results establish LLM agent coupling as real, measurable, and controllable at the prompt level, with direct implications for multi-agent system design.

安全自适应的云修复：用神经符号世界模型验证LLM生成的恢复计划

3/10

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Q...

个性化推荐理由:

该论文关注云系统故障恢复，属于运维领域，与推荐、搜索、广告的核心技术关联较弱。虽然涉及LLM，但应用场景不在我的主要关注范围内，因此相关性较低。

2026-07-02 01:45:30 | arXiv:2607.01595v1 |

cs.AIcs.CL

查看完整摘要

As the scale and complexity of cloud-based AI systems continue to escalate, ensuring service reliability through rapid fault detection and adaptive recovery has become a critical challenge. While existing approaches integrate Large Language Models (LLMs) for semantic understanding and Deep Reinforcement Learning (DRL) for policy optimization, they often rely on sequential, loosely coupled architectures that underutilize the generative and reasoning capabilities of LLMs. In this paper, we propose a paradigm shift with PASE, a Planning-Aware Semantic self-healing engine, a novel fault self-healing framework that reconceptualizes recovery as a neuro-symbolic program synthesis task. PASE employs an LLM as a core Plan Synthesis Engine to generate structured recovery plans from a library of semantic primitives. A Neural-Symbolic World Model verifies plan feasibility through simulation, while a Meta-Prompt Optimizer, trained via DRL, learns to generate optimal prompts that guide the LLM's planning process. This tight reason-plan-verify-adapt loop enables dynamic, context-aware recovery strategy generation beyond predefined action spaces. Experiments on a real-world cloud fault injection dataset demonstrate that PASE significantly outperforms state-of-the-art methods, reducing average system recovery time by over 40% and improving fault detection accuracy in unknown fault scenarios. Our framework advances autonomous system management by unifying LLM-based reasoning with model-assisted verification and meta-learned guidance.

以对象为中心的LeJEPA

3/10

Object-centric LeJEPA

Jakob Geusen, Ender Konukoglu

个性化推荐理由:

论文标题提到对象中心（object-centric）和LeJEPA（可能是某种自监督学习框架），但其具体内容不明。如果涉及自监督表征学习或联合嵌入预测架构，可能对特征表示学习有潜在应用，但缺乏明确指向推荐/搜索/广告领域的证据，因此相关性较低。

2026-07-02 16:38:21 | arXiv:2607.02404v1 |

cs.CVcs.LG

查看完整摘要

Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).

Transformer几何观测站TGO-II：表示相似性观测站

3/10

Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

Kaustubh Kapil, Kishor P. Upla

个性化推荐理由:

论文关注Transformer内部的表示相似性分析，属于基础理解工具，但缺乏对RecSys/Search/Ads的直接应用或改进。其方法论可能间接有助于理解推荐模型中的表示学习，但当前标题未展示明确关联，因此相关性较低。

2026-07-02 16:22:53 | arXiv:2607.02386v1 |

cs.CVcs.LG

查看完整摘要

While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.

时空与临床条件约束下的细粒度放射学报告检索

3/10

Spatio-Temporal and Clinical Conditioning for Fine-Grained Radiology Report Retrieval

P. Sloan, E. Simpson, M. Mirmehdi

个性化推荐理由:

论文聚焦于医学影像报告检索，属于医疗领域应用，与推荐/搜索/广告的核心技术无直接关联。虽然时空与临床条件建模可间接迁移至用户行为序列或上下文建模，但缺乏明确的RecSys/Search/Ads应用场景和实践价值。

2026-07-02 10:54:02 | arXiv:2607.02024v1 |

cs.CV

查看完整摘要

Radiology is vital to modern healthcare, but rising imaging demand and persistent workforce shortages strain reporting capacity and clinical workflows. Automated radiology report generation has the potential to support radiologists and help alleviate this burden; however, existing retrieval-based methods remain rigid, lack explicit anatomical grounding, and do not account for longitudinal disease progression or available clinical context. In this work, we introduce STAR3, a multimodal, spatio-temporal, attentive retrieval framework for radiology report generation that aligns region-level anatomical information with clinical indications and longitudinal changes across chest X-ray studies. Our framework employs an object detector to identify anatomically meaningful regions and retrieves semantically relevant report sentences conditioned on both current clinical context and changes observed between prior and current examinations. This design enables anatomically and temporally grounded report generation that better reflects clinical reporting practice. Experiments on the MIMIC-CXR dataset demonstrate that STAR3 outperforms current retrieval-based approaches on retrieval, NLP and clinical metrics, highlighting the value of conditioning retrieval anatomically, temporally and clinically for advancing automated radiology report generation.

SFKD: 通过多级小波谱交互实现空间-频率联合感知的异构知识蒸馏

3/10

SFKD: Spatial--Frequency Joint-Aware Heterogeneous Knowledge Distillation via Multi-Level Wavelet Spectral Interaction

Cuipeng Wang, Haipeng Wang

个性化推荐理由:

该论文主要关注知识蒸馏中的空间-频率联合感知方法，属于模型压缩和效率优化技术。虽然知识蒸馏可应用于推荐系统或搜索中的模型轻量化，但论文标题未明确指向推荐/搜索/广告领域，且核心创新（多级小波谱交互）缺乏与序列建模、特征交互等推荐关键问题的直接关联。

2026-07-02 09:05:38 | arXiv:2607.01906v1 |

cs.CV

查看完整摘要

Most existing knowledge distillation methods focus on homogeneous models (e.g., CNN-to-CNN), thereby overlooking the flexibility and potential of knowledge transfer across heterogeneous models. Due to intrinsic inductive bias discrepancies between heterogeneous models that cause spatial distribution inconsistencies, prior heterogeneous distillation methods often weaken or discard spatial information in heterogeneous representations. However, the spatial information in representations often encodes transferable global structural semantics as well as architecture-specific local details, and therefore should not be directly ignored. To better leverage the spatial information encoded in heterogeneous representations, we propose a Spatial-Frequency Joint-Aware Heterogeneous Knowledge Distillation framework (SFKD). By leveraging the complementary properties of wavelet transform spatial locality and Fourier representations in characterizing global energy distributions, we first apply multi-level discrete wavelet transform to explicitly decouple spatial information. The resulting wavelet sub-bands are further refined by a dual-stream dual-stage refinement module, and finally combined with a Gaussian-filtered frequency loss to selectively capture informative global information. Extensive experiments on multiple benchmark datasets under both homogeneous and heterogeneous models demonstrate the superiority of our method.

ReQuest：基于重新思考的问题感知长视频问答帧选择方法

3/10

ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA

Minkuk Kim, Suyong Yun, Young Tae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

个性化推荐理由:

该论文关注长视频问答中的帧选择问题，属于视频理解领域，与推荐、搜索或广告的核心任务（如用户行为建模、内容匹配）无直接关联。虽然其“问题感知”思想可能启发多模态特征选择，但整体技术方案不直接适用于RecSys/Search/Ads场景。

2026-07-02 05:46:42 | arXiv:2607.01737v1 |

cs.CV

查看完整摘要

Recent multimodal large language models (MLLMs) have substantially advanced video understanding, yet long-form video QA remains challenging under fixed input token budgets, where uniform sampling can be inefficient for evidence localization. We propose ReQuest , an uncertainty-driven, question-adaptive keyframe selection pipeline that aligns question intent with relevant video content through selective computation. ReQuest integrates (i) a lightweight question-aware selector distilled from MLLM-generated supervision, (ii) Re-thinking Routing that triggers additional inference only when the model is uncertain with a length-adaptive criterion, and (iii) uncertainty-guided adaptive non-maximum suppression that selects temporally diverse frames while adjusting spacing based on question difficulty. As a plug-andplay method, ReQuest improves long-video QA without modifying or fine-tuning the underlying MLLM. Experiments on Video-MME, MLVU, and LongVideoBench demonstrate consistent accuracy gains with competitive computational cost, with particularly strong improvements in medium and long video regimes.

LASER：通过视觉注意力保持和Sink抑制为大型视觉语言模型提供的矫正镜

3/10

LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression

Bowen Yuan, Zijian Wang, Yadan Luo, Shijie Wang, Zi Huang

个性化推荐理由:

该论文聚焦于改进大型视觉语言模型（LVLMs）中的视觉注意力机制，属于视觉语言模型技术。虽然视觉语言模型的技术进步可能间接启发推荐系统中多模态或多行为序列的建模，但该论文直接针对视觉-语言任务，与推荐/搜索/广告的关联度较低，缺乏明确的应用指向。

2026-07-02 04:59:56 | arXiv:2607.01707v1 |

cs.CV

查看完整摘要

Large vision-language models (LVLMs) exhibit strong reasoning ability but suffer from visual forgetting during long-horizon decoding, where attention progressively drifts away from visual evidence. Existing methods largely treat this issue as a late-stage attention decay problem or attempt to mitigate it through heuristic reminders or post-hoc attention lifting. Through systematic empirical analysis, we find that performance degradation under visual forgetting is largely driven by two overlooked factors: early-stage attention decay disrupts evidence acquisition, and attention concentration on a subset of task-irrelevant visual sink tokens. Motivated by these insights, we propose LASER, a post-training framework that regulates both the visual attention trajectory and intra-visual token attention distribution during reasoning. Technically, LASER introduces two complementary rewards: a Visual Grounding Reward, which encourages the model to maintain attention on semantically salient visual tokens throughout decoding, and a Sink Suppression Reward, which penalizes excessive attention concentration on visual sink tokens. Together, these rewards preserve early-stage grounding while preventing attention collapse onto uninformative regions. Extensive experiments on eight benchmark datasets demonstrate that LASER consistently outperforms strong baselines, validating attention-aware training as an effective remedy for visual forgetting.

边界感知量化：神经分类器的有限尺度决策几何

3/10

Boundary-Aware Quantization: Finite-Scale Decision Geometry of Neural Classifiers

O. M. Kiselev

个性化推荐理由:

该论文专注于神经分类器的决策边界和量化方法，属于理论分析，与推荐系统、搜索或广告的核心技术（如召回、排序、特征交互）无直接关联。虽然量化技术可间接用于模型压缩，但缺乏明确的RecSys/Search/Ads应用场景，因此相关性较低。

2026-07-01 21:19:06 | arXiv:2607.01478v1 |

math.OCcs.CVcs.LG

查看完整摘要

We measured quantization-induced decision-boundary changes using local logit-margin radii, first-order boundary displacement, normal variation, slice-boundary Jaccard distance, grid prediction changes, multiclass junction counts, and low-margin boundary-band flips. On the digits benchmark, 8-bit weight quantization preserved all test labels while producing boundary-mask Jaccard $0.428$ on the PCA slice; at 4 bits, accuracy remained $0.9733$, while boundary Jaccard rose to $0.970$ and median local boundary shift reached $0.0290$. Interpolation between adjacent quantization levels localized the visible reconfigurations at multiclass junctions, with 12, 34, and 17 triple-junction cells in the selected transitions. Calibration-to-test stopping reduced the digits held-out flip rate from $0.0094$ to $0.0022$ and boundary Jaccard from $0.825$ to $0.524$; the same stopping rule also reduced flips on MNIST and Fashion-MNIST. On official CIFAR-10 subsets, PTQ-W selected by accuracy gave 6-bit flip $0.0367$ and boundary Jaccard $0.184$, whereas boundary-aware stopping selected 8-bit flip $0.0083$ and boundary Jaccard $0.048$. On full CIFAR-10 with three seeds, 6-bit PTQ-W lost $0.0029$ accuracy relative to float, changed $5.3\%$ of held-out decisions, and changed $24.5\%$ of low-margin boundary-band decisions. A fixed-bit boundary-gap rounding term changed the trade-off at 4 bits by reducing boundary Jaccard from $0.457$ to $0.435$ and boundary-band pair-order flip from $0.3600$ to $0.3558$, with an accuracy trade-off; the 3-bit stress test exposed the tuning limit of this surrogate. Calibration boundary Jaccard predicted held-out boundary Jaccard across PTQ-W and optimized rounding variants with $r=0.947$--$0.994$.

基于音频的有声读物叙述吸引力理解

2/10

Audio-Based Understanding of Audiobook Narration Appeal

Shahar Elisha, Mariano Beguerisse-Díaz, Emmanouil Benetos

个性化推荐理由:

该论文聚焦于有声读物叙述吸引力，属于特定音频内容分析领域，与推荐系统、搜索或广告的核心技术无直接关联。虽可联想到音频推荐，但缺乏通用性及对核心排序、用户行为建模的贡献，因此相关性低。

2026-07-02 17:43:05 | arXiv:2607.02473v1 |

cs.CLcs.SDeess.AS

查看完整摘要

Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.

扩展规模会改善基于大语言模型的社会模拟吗？

2/10

Will Scaling Improve Social Simulation with LLMs?

Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto, Di...

个性化推荐理由:

该论文关注LLM在社交模拟中的应用，属于LLM应用方向，但社交模拟与推荐、搜索、广告领域关联较弱。虽然可能涉及用户行为建模，但其核心场景并不直接对应RecSys/Search/Ads的实际业务需求。

2026-07-02 17:30:38 | arXiv:2607.02464v1 |

cs.CL

查看完整摘要

Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.

语言模型作为文化的测量工具

2/10

Language Models as Measurement Apparatus for Culture

Kent K. Chang

个性化推荐理由:

该论文探讨语言模型在文化测量中的应用，属于社会科学或人文计算范畴，与推荐系统、搜索或广告的核心技术关联极弱。虽然可能涉及语言模型，但无明确的应用于用户行为建模、序列预测或多模态融合的潜力，故相关性低。

2026-07-02 17:25:55 | arXiv:2607.02459v1 |

cs.CL

查看完整摘要

Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus -- model, data, annotation, evaluation -- participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad's concept of the agential cut -- the contingent boundary between phenomenon and instrument -- I show that the apparatus's substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, interaction, and deviation) and three examinations of the apparatus itself (erasure of cultural markers, attunement to historical material, and agency in an agentic workflow). This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment, at once methodological and ethical.

EvoPolicyGym：在交互环境中评估自主策略演化

2/10

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, Jiacheng Chen, Tianle Li, Qingyu Y...

个性化推荐理由:

该论文专注于自主策略演化评估，属于强化学习和进化算法的范畴，没有直接关联搜索、推荐或广告领域的核心技术。虽然交互环境可能涉及用户行为模拟，但缺乏明确的应用于推荐系统或广告排序的论证，且标题未提及LLM、Transformer或多模态等关键技术。

2026-07-02 17:10:13 | arXiv:2607.02440v1 |

cs.AIcs.CL

查看完整摘要

Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that distinguish how agents allocate budget, convert feedback into parametric tuning. These analyses show that strong autonomous policy evolution depends not only on isolated task wins, but on discovering task-appropriate mechanisms and refining policies under bounded feedback.

全球模型：文化人工智能的文学工具

2/10

World Wide Models: Literary Tools for Cultural AI

Nina Begus

个性化推荐理由:

论文标题聚焦于文学工具与文化AI，核心是人文领域的应用，而非推荐系统、搜索或广告。虽提及“模型”，但无明确技术细节或与LLM、Transformer、推荐系统的关联。文化AI可能涉及语言模型，但缺乏对RecSys/Search/Ads潜在应用的直接阐释。

2026-07-02 16:12:07 | arXiv:2607.02369v1 |

cs.CLcs.AI

查看完整摘要

LLMs stage a new form of cultural encounter that is massive, automated, and monolingual. Literary disciplines have always negotiated cultural struggles with comparative reading of literature, narratological and poetic analysis, critical theory, world literature, and translation. These tools have now become indispensable for building culturally literate AI. The essay develops a layered framework toward more nuanced textual models and pluralistic interpretations of AI, emphasizing the natural intersections of literature and AI development, connecting current debates in critical theory with structural monolingualism, and suggesting a new application of world literature approaches to address global AI textuality through macrostructure, circulation, and untranslatability.

BamiBERT：一种新的基于BERT的越南语语言模型

2/10

BamiBERT: A New BERT-based Language Model for Vietnamese

Dat Quoc Nguyen, Thinh Pham, Chi Tran, Linh The Nguyen

个性化推荐理由:

该论文提出了针对越南语的BERT模型，属于语言模型在多语言或特定语言上的应用，但未涉及推荐系统、搜索或广告领域的核心问题或技术。它缺乏与RecSys/Search/Ads直接相关的创新点或启示，因此相关性很低。

2026-07-02 14:46:54 | arXiv:2607.02259v1 |

cs.CL

查看完整摘要

In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: https://huggingface.co/Qualcomm-AI-Research/BamiBERT

解锁语音-文本组合能力：无需指令调优的指令跟随语音语言模型

2/10

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

Congrui Du, Yang Zhang, Kaizhi Qian, Shiyu Chang

个性化推荐理由:

该论文聚焦于语音-文本的指令跟随能力，属于语音语言模型方向，与推荐系统、搜索或广告领域的核心技术（如排序、召回、特征建模）没有直接关联。虽然语音输入可能在搜索或对话式推荐中有应用潜力，但论文核心是语音理解与指令跟随，而非推荐/搜索的建模优化，因此相关性较低。

2026-07-02 14:22:46 | arXiv:2607.02214v1 |

cs.CLeess.AS

查看完整摘要

Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruction-following speech language model trained without any instruction tuning, using only a single round of speech pre-training on 30k hours of data. Starting from a text LLM base model, we perform continuous pre-training on speech utterances to obtain a speech-adapted model, and then directly combine its weights with the weight difference between the instruction-tuned and base versions of the text LLM. Our results show that this simple combination strategy not only preserves the knowledge and capabilities of the original text LLM, but also effectively transfers them to the speech domain. These findings suggest a new direction for SLM training that avoids reliance on massive speech data.

贝叶斯稀疏低秩适配用于大语言模型不确定性估计

2/10

Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation

Jijie Zhang, Zhe Ren, Quan Zhang, Dandan Guo

个性化推荐理由:

该论文主要关注LLM的不确定性估计，属于LLM自身的评估和校准技术，与推荐/搜索/广告的核心领域（如排序、检索、用户建模）没有直接关联。虽然低秩适配（LoRA）是LLM微调的重要技术，但论文聚焦于不确定性估计而非推荐系统的直接应用，因此相关性较低。

2026-07-02 13:52:12 | arXiv:2607.02182v1 |

cs.LGcs.CL

查看完整摘要

Large language models (LLMs) exhibit remarkable reasoning capabilities, but their task-specific fine-tuning is notoriously plagued by overconfidence, severely hindering trustworthy deployment. We propose Data-Adaptive Lower-Rank Adaptation (DALorRA), a simple and effective variational Bayesian sparse framework that shifts the paradigm of uncertainty quantification from the dense parameter space to the lightweight rank level of low-rank adaptation (LoRA). With the insight that LoRA essentially aggregates multiple rank-one components that may provide superfluous model capacity, DALorRA imposes stochastic masking on rank dimensions, enabling Bayesian regularization of model capacity during training and ensemble-like calibration during inference. Extensive experiments demonstrate DALorRA's excellent calibration of LLMs without compromising reasoning accuracy.

Object Aligner：一种可配置的图JSON模式相似度评分，用于LLM提示优化

2/10

Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization

Jan Drchal

个性化推荐理由:

该论文主要关注JSON模式相似度评分和图结构，应用于LLM提示优化。虽然涉及LLM，但其核心是提示优化技术，而非直接应用于推荐、搜索或广告系统。没有展示出对RecSys/Search/Ads的明确潜力或跨界应用。

2026-07-02 10:07:34 | arXiv:2607.01972v1 |

cs.CLcs.AIcs.LG

查看完整摘要

Large language models (LLMs) are often asked to produce JSON conforming to a fixed schema, powering information extraction, tool calling, agentic planning, and knowledge-graph construction. Measuring how closely an output matches a gold reference is essential yet surprisingly hard: exact match is brittle, text similarity ignores structure, and an LLM judge is expensive, opaque, and non-deterministic. We address this with Object Aligner (OA), an open-source Python library that scores two JSON objects deterministically by recursively aligning their trees (the Hungarian algorithm for unordered collections, sequence alignment for ordered ones) and awarding partial credit at the granularity the schema declares. The Object Aligner is configured entirely through a set of JSON Schema extensions, so adapting it to a new task involves annotating a schema rather than writing code. Complex structured data, however, are rarely flat trees: records may form graphs or hypergraphs keyed by arbitrary identifiers, breaking the assumptions of prior similarity metrics. Our central contribution, referential alignment, closes this gap by inferring a bijection between gold and candidate identifiers and scoring every reference through it, so the score is invariant to relabeling. Since recovering this bijection exactly is graph isomorphism, the Object Aligner approximates it with Weisfeiler-Leman color refinement. An order-sensitive sequence regime targets ranking and planning. Since the same alignment localizes every mismatch, the Object Aligner emits ranked repair suggestions at no extra cost. Used as a reward inside the GEPA prompt optimizer, Object Aligner helps or stays neutral across all datasets.

超越有监督澄清：利用大语言模型进行对话话语解析的输入重写

2/10

Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing

Yiming Liu, Ziyue Zhang, Zhichao Xu, Xin Yu, Yingheng Tang, Tianyu Jiang, Jie Ca...

个性化推荐理由:

该论文专注于对话话语解析任务，属于NLP领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然LLM被用于输入重写，但缺乏将其应用于用户建模或特征交互的明确路径，因此相关性较低。

2026-07-02 09:57:52 | arXiv:2607.01964v1 |

cs.CL

查看完整摘要

Rewriting inputs to improve frozen downstream models has become a common strategy in modern NLP pipelines. Prior work on incremental dialogue discourse parsing (DDP) shows that supervised clarification models can rewrite fragmentary or underspecified utterances, such as resolving ellipsis or references, to improve parsing accuracy. In this work, we revisit this idea under realistic deployment conditions, where no clarification supervision is available and the clarifier must rely on zero-shot prompting or feedback from a frozen parser. Across three Segmented Discourse Representation Theory (SDRT) datasets and multiple parsers, we find that last-utterance clarification is far less reliable than suggested by supervised settings. Parser-agnostic rewriting often introduces more regressions than repairs, as edits that enable fixes also disrupt discourse cues relied upon by the parser. A best-of-8 rewriting analysis further reveals a practical ceiling: a large fraction of errors are not repairable through input rewriting alone. A parser-aware clarifier trained with GRPO reduces regressions by up to 37% by learning conservative abstention, yet still fails to produce selectivity-aware clarifications that consistently improve parsing. Together, these findings recast clarification as a selective intervention problem. We identify rewritability prediction, deciding whether an utterance is repairable before intervention, as the key missing capability for input-side optimization of frozen discourse parsers, and a critical direction for improving agentic pipelines more broadly.

NAVER LABS Europe 对 2026 年指令跟随短篇轨道的投稿

2/10

NAVER LABS Europe Submission to the Instruction-following 2026 Short Track

Marcely Zanon Boito, Hemant Yadav, Jean-Luc Meunier, Ioan Calapodescu

个性化推荐理由:

该标题表明这是一篇关于指令跟随（Instruction-following）的竞赛提交，属于 NLP 核心任务，与 RecSys/Search/Ads 没有直接关联。虽然指令跟随能力可能对搜索有帮助，但缺乏明确的领域应用导向，且标题未提及任何推荐、搜索或广告相关技术。

2026-07-02 09:52:08 | arXiv:2607.01960v1 |

cs.CL

查看完整摘要

In this paper, we describe NAVER LABS Europe's submission to the instruction-following speech processing short track at IWSLT 2026. We participate again in the constrained setting, developing systems capable of jointly performing ASR, ST, and SQA from English speech into Chinese, Italian, and German. Building on our previous submission, ranked first in last year's short track, we update our multi-stage training pipeline by replacing the speech projector with SpeechMapper, a method for learning a speech-to-LLM embedding projector using only ASR data. In addition, we introduce a synthetic SQA dataset, fakACL, composed of artificially generated scientific presentations. This dataset is built by prompting the LLM backbone, segmenting the generated talks, and synthesizing speech with SeamlessM4T-large-v2. The combination of an improved speech projection mechanism and domain-specific synthetic data allows our model to outperform last year's best short-track system, while being considerably more compact and relying on a weaker LLM backbone. This year's results place our system tied for first place in the overall short track ranking.

PairCoder++：将结对编程作为验证性代码驱动的多模态与结构化工件生成的通用范式

2/10

PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation

Junhao Chen, Xiang Li, Mingjin Chen, Boran Zhang, Henghaofan Zhang, Yibin Xu, Yu...

个性化推荐理由:

该论文主要关注代码生成和多模态内容生成，属于AIGC和LLM应用范畴，但并未明确指向推荐系统、搜索或广告领域。虽然代码生成技术可能间接用于特征工程或模型实现，但缺乏直接相关性，因此评分较低。

2026-07-02 08:36:02 | arXiv:2607.01883v1 |

cs.CL

查看完整摘要

Code is the medium through which large language models generate structured artifacts: charts, scientific figures, vector graphics, CAD models, 3D scenes, and hardware designs are all produced by writing programs. In this regime single pass inference is brittle, because the compiler, renderer, or simulator that decides whether the artifact exists is invisible to the model. We present PairCoder, which grounds review in the toolchain and realizes it as two agent pair programming: a Driver agent writes the program, a Navigator agent reviews it against verification evidence (diagnostics, execution results, and renderings of the current artifact beside the target), and the two switch roles when errors persist. Across 17 public benchmarks and seven models from three vendors, PairCoder improves essentially every benchmark whose artifact is verifiable, on full official metric suites rather than execution alone (for example, Blender scene executability 0.20 to 0.78; TikZ compile rate up 10 to 30 points on every model), at 2.9 to 9.2 times single model cost (about 7 times overall). The improvements concentrate where the toolchain provides an informative oracle and the baseline leaves headroom, and the method ties or mildly regresses where the oracle is weak; we frame pair programming as a reliable recipe for verified code driven generation.

LLM在分子领域真的能泛化吗？基于扰动的分析

2/10

Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis

Jiatong Li, Weida Wang, Changmeng Zheng, Shufei Zhang, Yatao Bian, Xiao-yong Wei...

个性化推荐理由:

该论文专注于分子领域的LLM泛化能力，属于生物/化学等特定领域应用，与RecSys/Search/Ads无关。尽管涉及LLM，但缺乏明确的跨领域应用潜力，因此相关性低。

2026-07-02 07:16:28 | arXiv:2607.01800v1 |

cs.LGcs.CL

查看完整摘要

Large Language Models (LLMs) have recently shown promise in molecular discovery, yet a gap remains between their probabilistic nature over discrete sequential tokens and the rigid topological constraints of chemical space. This raises the question of whether molecular LLMs can generalize beyond the local neighborhoods induced by their sequence-based representations. To systematically investigate this question, we introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules under controlled Graph Edit Distance (GED) to probe the manifold regularity of molecular LLMs. Our analysis shows that even a single edit can cause substantial performance drops on common molecular tasks, revealing a narrow local trust region and fragile sensitivity to structural changes. Since similar molecules tend to exhibit similar properties, In-Context Tuning (ICT), which anchors predictions on structurally similar molecules, offers a natural way to mitigate such fragility. Our experiments also examine whether ICT confers robustness under controlled structural perturbations, and the results suggest that it can partially expand the local trust region and offer a promising direction for stabilizing molecular LLMs against structural variation.

潜钟：扩散语言模型中的潜在时间建模

2/10

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models

Maximo Rulli, Thomas Fontanari, Simone Petruzzi, Federico Alvetreti, Giorgio Str...

个性化推荐理由:

该论文主要关注扩散语言模型中的时间建模技术，属于生成模型领域的进展，与推荐系统、搜索或广告的核心任务（如排序、召回、匹配）没有直接关联，也没有明确的应用潜力。虽然扩散模型可能用于生成，但标题未指向推荐或搜索的具体应用场景，因此相关性较低。

2026-07-02 06:45:42 | arXiv:2607.01774v1 |

cs.AIcs.CL

查看完整摘要

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive models. Unlike standard diffusion-based approaches, DLMs are not explicitly conditioned on a timestep, raising a natural question: do these models internally represent denoising progress, and how is such information used downstream? In this work, we show that DLMs do in fact encode a latent representation related to the diffusion timestep within their residual streams. We find that this signal can be reliably extracted using probes across layers, indicating that denoising progress is decodable from internal activations. We further demonstrate that steering the model along a low-dimensional subspace associated with the inferred timestep allows us to systematically modulate its notion of denoising progress, leading to predictable changes in model confidence and entropy. Finally, we analyse the geometry of the identified representation, showing that it exhibits structured and interpretable properties in activation space, and shedding light on how such a signal is processed by these models.

ADVENT：基于LLM的归纳逻辑编程自动谓词发明

2/10

ADVENT: LLM-Driven Automatic Predicate Invention for ILP

Tingting Yu, Pei-Cing Huang, Chan Hsu, Chan-Tung Ku, Yihuang Kang

个性化推荐理由:

该论文专注于归纳逻辑编程（ILP）中的自动谓词发明，属于符号推理领域。虽然利用了LLM，但其核心目标是提升逻辑编程能力，而非直接应用于推荐、搜索或广告系统。缺乏明确的RecSys/Search/Ads应用潜力，因此相关性较低。

2026-07-02 01:33:45 | arXiv:2607.01585v1 |

cs.LOcs.AIcs.CL

查看完整摘要

Predicate invention (PI), the creation of new predicates to extend the hypothesis space, remains a critical bottleneck in Inductive Logic Programming (ILP). Existing methods rely on domain expertise and produce semantically opaque predicates, hindering adaptation to unfamiliar domains and cross-task reuse. We present ADVENT, an LLM-driven PI mechanism for ILP. ADVENT pairs LLM abductive generation with Prolog deductive verification, forming an iterative loop in which concrete execution results guide the LLM to refine candidate predicates. The mechanism leverages Large Language Models to identify implicit patterns in structured relational data and invent auxiliary predicates with meaningful names and definitions. Invented predicates and learned rules accumulate in a knowledge pool for cross-task reuse. Experiments on nine poker-hand concepts across seven LLMs show that LLM-driven PI achieves 58% success rate where ILP alone fails entirely, formal verification raises this to 80%, and the knowledge pool yields gains up to +31 percentage points, while producing human-interpretable rules. These results suggest that ADVENT offers a promising direction for automating predicate invention and enabling cross-task knowledge reuse in ILP.

超越怀疑：以自适应教学警觉框架评估LLM的教学意图推理

2/10

Beyond Skepticism: Evaluating LLMs Pedagogical Intent Reasoning with the Adaptive Pedagogical Vigilance Framework

Minghao Chen, Ruihan Zhou, Jiayi Tang, Zihan Xu, Bowen Huang, Yuxin Liu

个性化推荐理由:

该论文聚焦于LLM在教学场景中的意图推理，属于教育领域的应用研究，与推荐系统、搜索或广告的关联度极低。虽然涉及LLM，但缺乏明确的转化潜力应用于RecSys/Search/Ads的核心技术或方法。

2026-07-02 01:26:09 | arXiv:2607.01581v1 |

cs.CL

查看完整摘要

The capacity of Large Language Models (LLMs) to reason about pedagogical intent within instructional communication remains underexplored, particularly in educational domains such as translation pedagogy. To address this, we propose the \textbf{Adaptive Pedagogical Vigilance (APV)} framework, a novel computational formalism that reframes communicative vigilance as an adaptive mechanism for optimizing learning through intent inference. APV formalizes the problem via a Bayesian Pedagogical Intent Inference Engine (PIIE), which models how instructors select content to maximize pedagogical utility and how vigilant learners should inversely reason about latent instructional configurations -- encompassing genre, stance, and incentives. We evaluate APV through a three-tier hierarchy: distinguishing instructional genre, reasoning about structured pedagogical setups, and generalizing to authentic educational discourse. Experiments on leading LLMs (e.g., GPT-4o, Claude 3.5) show that APV substantially improves model vigilance. It achieves the strongest discrimination between pedagogical and exposure-based content, correlates highly with human judgments ($r=0.958$), and maintains robust performance on naturalistic data where baseline methods degrade. This work establishes a unified framework for assessing and enhancing LLMs' understanding of pedagogical motives, advancing the development of more reliable AI-assisted learning systems.

DiPS：高风险说服型对话系统的策略选择

2/10

DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents

Tianyi Zhang, Mousumi Das, Abrar Anwar, Jesse Thomason, David Traum

个性化推荐理由:

该论文聚焦于对话系统中的策略选择，属于对话AI领域，与推荐、搜索或广告的核心技术无直接关联。虽然说服型对话可能在广告营销中有潜在应用，但论文标题未明确涉及推荐或广告场景，且缺乏对大规模系统的关注，因此相关性较低。

2026-07-02 00:24:48 | arXiv:2607.01557v1 |

cs.CLcs.AI

查看完整摘要

Large Language Models (LLMs) often struggle with persuasion in high-stakes scenarios. People's individual personalities and concerns require tailored strategies rather than a one-size-fits-all approach. To address this challenge, we focus on a fire-rescue scenario in which an operator must persuade a resident to evacuate as a high-stakes persuasion domain and propose Dialogue Policy Selection (DiPS), a Q-learning framework to dynamically select persuasion strategies adapted to the evolving conversational context. Specifically, we train a critic, trained to maximize the chance of evacuation success, to select a persuasion policy at each turn based on the resident's recent utterances.We then evaluate DiPS against multiple baselines in both simulated and real human interactions. We find that DiPS achieves higher evacuation success than a zero-shot LLM and generic RAG-augmented approach.

MultAttnAttrib：长文档问答中的无训练多模态归因

2/10

MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

Dang Quang Thien Tran, Quang V. Dang, Vinamra Tyagi, Sai Soorya Rao Veeravalli, ...

个性化推荐理由:

该论文聚焦于长文档问答中的多模态归因，属于NLP和文档理解领域，与推荐系统、搜索或广告的核心技术关联较弱。标题中未体现与用户行为建模、特征交互或大规模系统效率等直接相关的方向，因此相关性较低。

2026-07-01 19:29:19 | arXiv:2607.01420v1 |

cs.CLcs.AIcs.CV

查看完整摘要

As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under-researched. As a result, we introduce MultAttnAttrib, a training-free attribution-generation method that leverages a model's prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document. To establish baseline results for the method, we introduce MultAttrEval, a complementary benchmark dataset annotated with fine-grained, ground-truth attributions for answer components grounded in multimodal source documents. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long-form documents. Experimental results show that MultAttnAttrib consistently outperforms a variety of attribution-generation methods, including several strong prompting-based approaches and matches the latest frontier models such as GPT 5.4. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one-seventh of the direct inference latency compared to prompting on the same base model.

世界导演：利用持久动态内存构建可控世界模拟器

2/10

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Qingyan Bai, Ka Leong Cheng, Yue ...

个性化推荐理由:

该论文专注于构建世界模拟器，属于通用AI或图形学领域，与推荐系统、搜索或广告的核心任务无直接关联。虽然持久动态内存技术可能对序列建模有启发，但目标应用与排序、检索等核心功能差异较大，且无明确的应用场景，因此相关性较低。

2026-07-02 17:59:59 | arXiv:2607.02517v1 |

cs.CV

查看完整摘要

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/

OrbitQuant：面向图像和视频扩散Transformer的数据无关量化方法

2/10

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

Donghyun Lee, Jitesh Chavan, Duy Nguyen, Sam Huang, Liming Jiang, Priyadarshini ...

个性化推荐理由:

该论文主要关注图像和视频扩散Transformer的量化技术，属于纯视觉和模型压缩领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然量化技术可能在边缘部署中有潜在应用，但论文明确聚焦于视觉扩散模型，未涉及用户行为建模、排序或检索等场景，因此相关性很低。

2026-07-02 17:27:34 | arXiv:2607.02461v1 |

cs.CVcs.AIcs.LG

查看完整摘要

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.

MARVEL: 面向长尾分布外检测的边界感知鲁棒von Mises-Fischer专家学习

2/10

MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection

A. S. Anudeep, Vaanathi Sundaresan

个性化推荐理由:

该论文主要关注长尾分布外检测（OOD detection），属于异常检测和分布外泛化领域，与推荐系统、搜索或广告的核心技术（如排序、召回、匹配等）直接关联较弱。虽然OOD检测可应用于推荐系统中的冷启动物品或异常用户检测，但论文标题未明确提及推荐、搜索或广告应用场景，且没有涉及LLM或Transformer等使能技术。

2026-07-02 17:06:31 | arXiv:2607.02435v1 |

cs.CVeess.IV

查看完整摘要

For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF) classifier capable of learning non-linear decision boundaries, with theoretical proof of its asymptotic connection to cosine classifiers; (2) a multi-expert framework in which margin-aware NvMF classifiers specialise in different regions of label distribution to better handle imbalance; and (3) an outlier expert trained explicitly to distinguish inlier from outlier data, thereby strengthening OOD detection. Evaluation on RFMiD, ISIC2019, and NCTCRC datasets demonstrates consistent improvements over state-of-the-art methods, achieving mean FPR95 reductions of 8.45%, 13.02%, and 36.90% respectively. These gains are further supported by comprehensive ablations that validated the contributions of each component. This enables reliable identification of unfamiliar cases for deferral to clinicians, supporting safer AI-assisted diagnosis in real-world workflows. Our code is available at https://github.com/redboxup/MARVEL.

学习演化场景：基于场景图推理人类活动

2/10

Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs

Francesca Pistilli, Simone Alberto Peirone, Giuseppe Averta

个性化推荐理由:

该论文专注于场景图和人类活动推理，属于计算机视觉领域，未涉及推荐、搜索或广告中的核心问题，如用户建模、物品表示或交互预测。虽然场景图可类比为结构化特征，但缺乏直接应用或方法论迁移到推荐系统的证据。

2026-07-02 16:53:42 | arXiv:2607.02425v1 |

cs.CV

查看完整摘要

Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene state. To reason over this representation, we propose GLEN, a graph-based model that operates over scene graph sequences to both align them with textual actions and model their temporal evolution. In addition, we formulate the activity-driven graph-edit forecasting (A-GEF) problem, a novel task that casts scene dynamics as a sequence of structured transformations conditioned on ongoing actions, enabling explicit reasoning about how scenes change over time. We validate our approach across multiple downstream tasks, spanning retrieval benchmarks as EgoMCQ and EgoCVR, as well as long-horizon reasoning benchmarks as EXPLORE-Bench and the newly introduced A-GEF. GLEN achieves strong results compared to raw video baselines and it excels in reasoning settings, typically addressed only with MLLMs, while enabling controllable and structured predictions of scene dynamics driven by human activities. We believe our results establish spatio-temporal scene graphs, together with models that reason over them, as strong compositional and interpretable representations for video understanding and potentially beyond.

VisionAId：一款面向视障人士的离线优先多模态安卓助手，具备个性化物体检索功能

2/10

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

Cristian-Gabriel Florea, Stelian Spînu

个性化推荐理由:

该论文主要关注面向视障人士的辅助技术，属于可访问性/人机交互领域，而非推荐系统、搜索或广告核心技术。虽然涉及多模态和个性化检索，但应用场景与RecSys/Search/Ads的商业或通用信息检索目标差异较大，关联度低。

2026-07-02 16:12:50 | arXiv:2607.02371v1 |

cs.CVcs.AI

查看完整摘要

Over 285 million people worldwide live with a visual impairment, for whom everyday tasks such as avoiding obstacles, locating personal belongings, recognizing familiar faces, or handling cash remain persistent obstacles to personal autonomy. Existing assistive applications are typically limited to recognizing predefined categories, depend heavily on cloud connectivity, or require dedicated hardware. We present VisionAId, an Android application that turns a commodity smartphone into a real-time visual assistant. The system integrates six on-device deep learning models (metric monocular depth estimation, instance segmentation, visual and facial embeddings, face detection, and a custom banknote detector) running entirely through ONNX Runtime, with an optional cloud large language model (Google Gemini Flash) used only for narrative scene description and automatic object labeling. A distinctive contribution is a few-shot pipeline for personal objects: the user photographs an object from several angles, and the system later locates that specific instance in the environment, guiding the user toward it with augmented-reality markers, spatial audio, and distance-proportional haptics. All feedback is multimodal (Romanian speech synthesis, voice commands, vibration). On a reference device (Samsung Galaxy S21 Ultra), INT8 quantization reduces depth latency from ~1200 ms to ~491 ms, the custom banknote detector reaches an mAP@50 of 0.986, and metric depth is calibrated to below 1 cm of error within 3 m.

DisciplineGen-1M：一个用于多学科视觉生成与编辑的大规模数据集

2/10

DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

Zhaokai Wang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Yiguo He, Mohan Zhang, Leyao G...

个性化推荐理由:

该论文专注于多学科视觉生成与编辑（如图像生成、编辑），属于计算机视觉和图形学领域，与搜索、推荐或广告系统中的排序、用户建模等核心任务无直接或间接关联。尽管视觉模型可能用于广告创意生成，但论文明确描述为视觉生成与编辑，且我当前关注范围已排除纯粹的视觉生成、AIGC等内容。

2026-07-02 15:07:47 | arXiv:2607.02290v1 |

cs.CV

查看完整摘要

Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.

FlowCIR：基于流匹配的语义传输用于零样本组合图像检索

2/10

FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

Zhenqi He, Ziqi Jiang, Yuanpei Liu, Yanghao Wang, Teng Wang, Long Chen

个性化推荐理由:

该论文针对组合图像检索任务，属于图像检索领域。虽然视觉-语言模型技术可能间接相关，但论文本身未明确涉及推荐、搜索或广告应用，且无明确的对异构数据建模的类比。因此相关性较低。

2026-07-02 15:02:45 | arXiv:2607.02284v1 |

cs.CV

查看完整摘要

Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.

AnyGroundBench：面向视觉语言模型视频定位的专业领域基准

2/10

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Rintaro Otsubo, Ryo Fujii, Reina Ishikawa, Taiki Kanaya, Kanta Sawafuji, Hiroki ...

个性化推荐理由:

该论文专注于视频定位基准任务，属于视觉-语言模型领域的纯评估工作，与推荐/搜索/广告核心问题（如用户建模、物品匹配）无直接关联。尽管VLM用于异构数据建模有潜在应用，但基准研究偏题，缺乏实践导向。

2026-07-02 14:52:45 | arXiv:2607.02269v1 |

cs.CVcs.AI

查看完整摘要

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential. To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability. We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.

ArcAD：面向冷启动监督异常检测的异常修正校准方法

2/10

ArcAD: Anomaly-Rectified Calibration for Cold-Start Supervised Anomaly Detection

Ningning Han, Lei Fan, Jia Guo, Yunkang Cao, Xiu Su, Feng Cao, Donglin Di, Tongh...

个性化推荐理由:

该论文关注异常检测领域的冷启动问题，属于通用机器学习方法，与推荐系统、搜索或广告的核心领域（如CTR预估、排序、召回）无直接关联。虽然冷启动问题在推荐系统中也存在，但该论文并未明确指向推荐场景，且异常检测通常不在我的核心关注范围内。

2026-07-02 14:43:21 | arXiv:2607.02252v1 |

cs.CV

查看完整摘要

The deployment of Industrial Anomaly Detection (IAD) in real-world manufacturing frequently encounters a challenging cold-start bottleneck, in which limited normal samples fail to represent the full normal distribution and only a few anomalies are available. Under such a regime, existing methods struggle to form compact normal boundaries and fail to effectively exploit supervised signals from rare defects. To address this challenge, we propose Anomaly-Rectified Cold-start AD (ArcAD), a plug-and-play calibration framework for reconstruction-based IAD baselines. ArcAD follows a push-pull learning paradigm to construct a compact and discriminative normal boundary under data scarcity. On the one hand, ArcAD projects limited normal samples onto a hypersphere and pulls them into multiple compact clusters to maximize coverage of the normal manifold. On the other hand, it synthesizes pseudo-anomalies on the hypersphere and leverages real anomalies to push the boundary inward and sharpen anomaly discrimination. Extensive experiments on MVTec-AD, VisA, Real-IAD, and MANTA demonstrate that ArcAD significantly outperforms state-of-the-art supervised and unsupervised methods in both single-class and multi-class settings under cold-start conditions. Code is available at: https://github.com/LGC-AD/ArcAD.

当令牌压缩失效：高压缩下鲁棒ViT分割的结构剪枝与令牌缩减

2/10

When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression

Tien-Phat Nguyen, Ngai-Man Cheung

个性化推荐理由:

该论文专注于Vision Transformer（ViT）的分割任务，探讨结构剪枝和令牌缩减技术在高压缩下的鲁棒性。虽然涉及Transformer效率（Enabling Transformer Tech），但主题严格限定于ViT在图像分割中的应用，未直接讨论推荐系统、搜索或广告领域的潜在应用。论文缺乏与用户行为序列、多模态特征或排序模型等RecSys/Search/Ads核心问题的明确关联，因此相关性较低。

2026-07-02 14:34:31 | arXiv:2607.02237v1 |

cs.CV

查看完整摘要

Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at https://github.com/phatnguyencs/vit-seg-compression.

液态潜在状态动力学用于可解释的涡轮风扇退化建模

2/10

Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling

Weizhi Nie, Weijie Wang, Yuting Su

个性化推荐理由:

该论文专注于涡轮风扇发动机的退化建模，属于机械工程或设备维护领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及潜在状态动力学，但缺乏明确的跨领域应用前景。

2026-07-02 10:15:32 | arXiv:2607.01986v1 |

cs.LGcs.CV

查看完整摘要

Multivariate time-series models for prognostics are often evaluated by point prediction accuracy, yet their internal states rarely expose a coherent degradation process. We study liquid neural networks as latent dynamics models for aircraft engine health monitoring on the C-MAPSS benchmark. The proposed model encodes a history window into a latent state, evolves that state with a liquid transition model, and decodes future sensor observations. To separate health evolution from operating-condition variation, the latent state is factorized into degradation and condition components. Remaining useful life, monotonic risk, and latent-consistency losses supervise the degradation component, while condition prediction and decorrelation losses discourage operating-condition leakage. Across FD001--FD004, the full disentangled model improves overall sensor forecasting RMSE from 0.2438 for a GRU baseline to 0.2266, with the largest gains on the multi-condition subsets FD002 and FD004. The learned degradation state also forms a clearer temporal degradation axis, reaching an average state-speed Spearman correlation of 0.5960. Direct remaining-useful-life regression remains stronger for the GRU baseline, indicating that the proposed representation is currently more effective as an interpretable world model for degradation dynamics than as a calibrated lifetime regressor. These results suggest that liquid latent dynamics can bridge predictive maintenance forecasting and inspectable health-state modeling.

迈向真实世界超声理解：基于多图像检查与长文报告的大规模视觉-语言模型

2/10

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Bingcong Yan, Chunlei Li, Jingliang Hu, Yilei Shi, Xiao Xiang Zhu, Lichao Mou

个性化推荐理由:

该论文聚焦于超声领域的视觉-语言模型，属于医学图像分析，与推荐、搜索或广告核心领域无直接关联。虽然涉及视觉-语言模型技术，但其应用场景为医学诊断，而非我关注的推荐或搜索场景。

2026-07-02 09:08:46 | arXiv:2607.01908v1 |

cs.CV

查看完整摘要

Large vision-language models (LVLMs) have achieved strong performance across many medical imaging tasks, yet their application to ultrasound remains limited due to its inherent complexity and variability. In this work, we revisit what is truly needed to enable real-world ultrasound understanding. Instead of introducing complex architectures or elaborate training strategies, we show that data scale and clinically faithful data alignment are the key factors. We construct a large-scale dataset of 1.5M real-world ultrasound examinations, containing 17.7M images, multi-organ coverage, and paired uncurated clinical reports. Crucially, we organize the data at the examination level, aligning multiple images with their corresponding reports to reflect real clinical workflows. We then fine-tune a standard LVLM using low-rank adaptation (LoRA) on this dataset without task-specific modifications. Surprisingly, this simple recipe already leads to strong performance across diverse ultrasound understanding tasks, outperforming prior methods designed with more complex pipelines. Beyond these results, we present model and data scaling analyses that provide insights into the role of scale in ultrasound LVLMs.

基于种群的半监督GAN判别器多目标训练

2/10

Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs

Francisco Sedeño, Francisco Chicano, Jamal Toutouh

个性化推荐理由:

该论文聚焦于生成对抗网络（GAN）的训练方法，属于半监督学习领域，与搜索、推荐、广告等核心领域的直接关联性较弱。虽然GAN在数据增强等场景有潜在应用，但该特定方法（多目标种群训练）在现有推荐系统文献中鲜有探讨，且缺乏明确的应用前景。

2026-07-02 09:06:06 | arXiv:2607.01907v1 |

cs.LGcs.AIcs.CV

查看完整摘要

Semi-supervised generative adversarial networks (SSL-GANs) can exploit large unlabeled datasets while retaining a classifier in the discriminator, but their training is often unstable. This paper proposes a population-based evolutionary training strategy in which discriminator learning is formulated as a multi-objective optimization problem. Instead of aggregating the supervised and unsupervised components of the SSL objective into a single scalar loss, the method maintains a population of discriminators ranked by Pareto dominance, enabling the exploration of different trade-offs between classification accuracy and real/fake discrimination. This formulation aims to improve both roles of SSL-GANs: learning accurate classifiers and training generators capable of producing realistic samples. We analyze several variants, including an elitist strategy and a mono-objective ablation, to assess the role of multi-objective selection. Experiments on MNIST with limited labels show improved training robustness compared to SSL-GAN and CE-SSL-GAN state-of-the-art baselines, while the elitist variant consistently achieves the highest classification accuracy.

SAB-LVLM：面向大型视觉-语言模型的重要性感知二值化方法

2/10

SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models

Qi Lyu, Jiahua Dong, Baichen Liu, Xudong Wang, Mingfei Han, Yulun Zhang, Fahad S...

个性化推荐理由:

该论文主要关注视觉-语言模型的二值化压缩技术，属于模型效率优化，但核心是视觉与语言多模态模型，未明确涉及推荐、搜索或广告领域。虽然二值化技术可能间接提升模型部署效率，但缺乏与RecSys/Search/Ads的直接应用或关联论述，因此相关性较低。

2026-07-02 08:30:00 | arXiv:2607.01876v1 |

cs.CVcs.AI

查看完整摘要

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal understanding, yet their enormous parameter scale and cross-modal computation incur substantial memory and latency overhead, severely limiting real-world deployment on resource-constrained devices. Binarization offers an attractive solution by drastically reducing storage and computational costs. However, existing binarization methods neglect the varying importance of weights across different layers and modalities. This causes parameters irrelevant to downstream tasks to be unnecessarily retained, whereas modality-critical weights may not be adequately optimized, resulting in significant performance degradation. To address these challenges, we develop a novel \underline{S}ignificance-\underline{A}ware \underline{B}inarization for \underline{L}arge \underline{V}ision-\underline{L}anguage \underline{M}odels (SAB-LVLM). Specifically, after constructing Hessian matrices for textual and visual inputs, we propose a spatial significance map to distinguish full-precision weights activated under a single modality from those activated across modalities. We then devise a modality-guided integration strategy to obtain the significance-aware binarization map, which measures weight significance across layers and modalities. Subsequently, this binarization map is incorporated into the binarization objective as an error reweighting term, and binarization fitting is performed through an alternating significance-weighted update scheme. Extensive experiments illustrate the superiority of our SAB-LVLM over existing binary PTQ methods under an approximately 1-bit compression constraint. Our code is accessible at https://github.com/LyuQi127/SAB_LVLM.

ProCal：开放词汇目标检测中的推理时提案校准

2/10

ProCal: Inference-Time Proposal Calibration for Open-Vocabulary Object Detection

Jae-Ryung Hong, Ho-Joong Kim, Seong-Whan Lee

个性化推荐理由:

该论文专注于开放词汇目标检测，属于计算机视觉领域，与推荐系统、搜索或广告的核心任务（如排序、召回）关联度低。虽然目标检测技术在视觉搜索中可能有潜在应用，但主题过于具体，且未涉及LLM或Transformer架构的通用进展，对RecSys/Search/Ads的直接启发有限。

2026-07-02 06:18:44 | arXiv:2607.01759v1 |

cs.CVcs.AI

查看完整摘要

Open-vocabulary object detection aims to localize and classify objects beyond the fixed set of categories seen dur ing training. Recent open-vocabulary object detection methods improve localization and classification for unseen categories by leveraging a frozen VLM as a detector backbone. However, VLM classification score lacks recognizing position and scale of the object in an image. We observe that pretrained VLMs en able to classify foreground and background regions. According to this observation, we propose a simple inference-time Pro posal Calibration (ProCal) that improves localization quality of the classification score. ProCal computes a proposal prior by combining two scores: localization-aware foreground score and background-aware suppression score. Localization-aware foreground score captures whether a proposal contains an object area. Background-aware suppression score measures the extent to which the proposal resembles background. We analyze that ProCal suppresses false novel activation on background proposals and consistently ranks true novel proposals above background and partial novel proposals. Applied to CLIPSelf ViT-L/14, ProCal improves APr +2.5 on OV-LVIS. The analyses show that proposal-level localization-aware reranking effects to mitigate ranking miscalibration for novel objects.

ICDepth：通过上下文条件约束视频扩散模型以进行视频深度估计

2/10

ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning

Xuanhua He, Jiaxin Xie, Mingzhe Zheng, Qifeng Chen

个性化推荐理由:

该论文涉及视频深度估计，属于计算机视觉任务，虽可能用到扩散模型等LLM技术，但其直接应用与推荐系统、搜索或广告的核心领域无关。缺乏明确的可迁移性，因此相关性较低。

2026-07-02 04:05:17 | arXiv:2607.01677v1 |

cs.CV

查看完整摘要

Monocular video depth estimation requires temporal consistency, geometric accuracy, and generalization across diverse scenarios, yet existing methods struggle to achieve all three simultaneously. Discriminative models excel at per-frame accuracy but suffer from temporal drift due to limited context windows, while generative methods improve consistency and generalization at the cost of extensive training data (10M+ samples) and lack of geometric precision. In response to these issues, we introduce \textbf{ICDepth}, a framework that adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning (ICC), leveraging their rich spatial-temporal priors. To address key challenges in transferring ICC from generation to dense prediction, we propose: (1)~\textbf{SAND-Attention}, which ensures precise spatial-temporal alignment via shared RoPE and enforces unidirectional attention to prevent noise contamination; (2)~\textbf{SRFM}, which injects DINOv2 semantic and resolution priors to enhance geometric precision. ICDepth achieves state-of-the-art results on multiple benchmarks with remarkable data efficiency, trained on only 0.8M frames ($6$--$13\times$ less than competing generative methods), while demonstrating strong zero-shot generalization to diverse domains.

面向增强视听视频字幕的时间与跨模态对齐

2/10

Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

Chen Zhao, Jiajun Ma, Qilong Huang, Tiehan Fan, Hongyu Li, Zhuoliang Kang, Xiaom...

个性化推荐理由:

该论文研究视频字幕生成，属于多模态理解和生成任务，与推荐、搜索或广告领域无直接关联。虽然跨模态对齐技术可能启发特征融合，但应用场景差异大，缺乏明确迁移价值。

2026-07-02 03:47:45 | arXiv:2607.01667v1 |

cs.CV

查看完整摘要

While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.

多分辨率流匹配：通过分段采样实现免训练扩散加速

2/10

Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling

Xingyu Zheng, Xianglong Liu, Yifu Ding, Weilun Feng, Junqing Lin, Jinyang Guo, H...

个性化推荐理由:

该论文专注于扩散模型的加速采样技术，属于生成模型优化范畴，与推荐系统、搜索或广告领域的核心任务（如排序、召回、用户建模）无直接关联。虽然生成模型可间接用于数据增强或内容生成，但题目未体现与推荐/搜索/广告的具体应用或潜在连接，故相关性较低。

2026-07-02 03:14:57 | arXiv:2607.01642v1 |

cs.CV

查看完整摘要

Hardware-agnostic strategies for accelerating text-to-image diffusion, such as timestep distillation and feature caching, can reduce inference time without custom kernels or system-level optimization. Among them, multi-resolution generation strategies have recently received broad attention, attaining more than 5x speedup without any training. However, the design of performing upsampling in the latent space, together with the selective modification of partial regions, causes these methods to exhibit noticeable blurring or artifacts. To this end, we propose MrFlow, a training-free multi-resolution acceleration strategy for pretrained flow-matching models built upon a staged low-to-high-resolution pipeline. MrFlow first rapidly generates the main structure at low resolution, then performs super-resolution in the pixel space using a lightweight pretrained GAN-based model, subsequently injects low-strength noise to enable high-frequency resampling, and finally refines the details at high resolution. Quantitative and qualitative results on FLUX.1-dev and Qwen-Image show that MrFlow exploits the quadratic token reduction and reduced step requirement of low-resolution sampling to achieve 10x end-to-end acceleration while keeping OneIG within a 1% gap relative to that before acceleration, significantly surpassing other training-free acceleration strategies, and requiring no training or runtime dynamic identification whatsoever. MrFlow can further be directly combined orthogonally with pre-trained timestep distillation strategies, achieving even higher generation acceleration of up to 25x.

通过深度排序任务解开视觉语言模型中的图像线索理解与语言偏见

2/10

Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task

Yiqian Liu, Iuliia Kotseruba, John K. Tsotsos

个性化推荐理由:

该论文研究视觉语言模型（VLM）中的图像线索理解和语言偏见，虽然VLMs有潜力应用于推荐系统（如多模态特征融合），但本文聚焦于模型内在的认知解耦机制，而非直接提升推荐性能。且未明确讨论对异构数据建模的启发，属于VLM基础研究，与RecSys/Search/Ads的直接应用关联较弱。

2026-07-01 22:02:30 | arXiv:2607.01503v1 |

cs.CV

查看完整摘要

In this paper, we study depth perception of vision-language models (VLMs) to isolate the effects of pictorial depth cues and disentangle vision and language influences on model performance. To this end, we combine depth-ordering and odd-one-out psychophysical tasks: the VLMs are presented with images where one object is at different depth relative to other, otherwise identical, objects, and must determine whether the odd-one-out target is closer or farther to the observer. To create stimuli, we generate 2D views from simulated and real 3D scenes while controlling the presence of individual pictorial depth cues, enabling a fine-grained analysis of cue-level contributions. Language effects are examined by varying referring expression clarity. We also introduce a novel metric to quantify vision-vs-language sensitivities. Applying this methodology, we create the Odd-One-Out Depth (O3-D) dataset with 37K real and synthetic images and 147K image-question pairs. Evaluation of 12 open-source and commercial models on O3-D shows under-utilization of depth cues and depth-ordering accuracies between 47% and 56%, with no model above chance level. At the same time, our metric reveals strong linguistic bias in the answers. Neither chain-of-thought (CoT) nor in-context learning (ICL) significantly improves performance, suggesting that static image data alone may be insufficient for depth understanding. All code, the image generation pipeline, and the O3-D dataset are publicly released at https://github.com/lyiqian/o3-d.

LACUNA：用于评估大语言模型遗忘中定位精度的测试平台

1/10

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, Verna Dankers

个性化推荐理由:

该论文聚焦于LLM遗忘的定位精度评估，属于模型安全与合规领域，与推荐系统、搜索或广告的核心技术无关，也不涉及架构创新或应用。

2026-07-02 17:59:52 | arXiv:2607.02513v1 |

cs.CLcs.AIcs.LG

查看完整摘要

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

程序即权重：面向模糊函数的编程范式

1/10

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Wentao Zhang, Liliana Hotsko, Woojeong Kim, Pengyu Nie, Stuart Shieber, Yuntian ...

个性化推荐理由:

该论文探讨的是编程范式，属于软件工程领域，与推荐系统、搜索或广告的核心技术（如排序、召回、模型架构）无直接关联。既不是对LLM或Transformer的改进，也未涉及推荐/搜索/广告的特定应用。

2026-07-02 17:59:50 | arXiv:2607.02512v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.

推理大模型提升长时电视剧中的说话人识别

1/10

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao, Pengfei Chen, Jiann...

个性化推荐理由:

该论文专注于说话人识别这一特定音频任务，与推荐系统、搜索或广告领域的核心技术无关。尽管使用了LLM，但应用场景局限于多媒体内容处理，缺乏对推荐/搜索/广告的通用性或可迁移性。

2026-07-02 17:58:52 | arXiv:2607.02504v1 |

cs.CLcs.AIcs.CV

查看完整摘要

Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}

面向无训练概念定位的抗印刷攻击鲁棒性研究

1/10

Towards Robustness against Typographic Attack with Training-free Concept Localization

Bohan Liu, Wenqian Ye, Guangzhi Xiong, Zhenghao He, Sanchit Sinha, Aidong Zhang

个性化推荐理由:

该论文聚焦于视觉模型的对抗鲁棒性（印刷攻击），属于计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。论文标题未提及任何与用户行为建模、多模态推荐或LLM应用相关的概念，因此相关性较低。

2026-07-02 17:55:24 | arXiv:2607.02494v1 |

cs.CVcs.CL

查看完整摘要

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.

TestEvo-Bench: 一个可执行的、实时的测试与代码协同进化基准

1/10

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie

个性化推荐理由:

该论文聚焦软件测试与代码协同进化，属于软件工程领域，与推荐、搜索、广告以及大语言模型核心技术无直接关联，且未涉及多模态或异构数据建模。因此与当前关注点几乎无关。

2026-07-02 17:35:20 | arXiv:2607.02469v1 |

cs.SEcs.AIcs.CL

查看完整摘要

Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model's training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.

使用大语言模型自动评分Linux/Bash考试：一种四层认知分类方法

1/10

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rod...

个性化推荐理由:

该论文聚焦于教育领域的自动评分任务，涉及LLM在考试评估中的应用，与推荐系统、搜索或广告的核心领域无直接关联。虽然LLM技术相关，但应用场景完全偏离，且缺乏明确的RecSys/Search/Ads应用潜力。

2026-07-02 17:01:47 | arXiv:2607.02432v1 |

cs.AIcs.CLcs.CY

查看完整摘要

Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.

NLP的未来可能不在NLP会议上：自然语言处理领域的学术迁移模式

1/10

The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

David Jurgens

个性化推荐理由:

该论文研究NLP领域的学术迁移模式，属于科学计量学或社会层面的分析，而非技术或方法创新。论文标题未涉及LLM、Transformer架构、推荐系统、搜索或广告领域的直接应用或潜在应用，也没有展示与RecSys/Search/Ads相关的技术进展或架构创新。因此，与我的关注点无关。

2026-07-02 16:47:14 | arXiv:2607.02416v1 |

cs.CL

查看完整摘要

Natural Language Processing (NLP) has traditionally been published in its core disciplinary venues like ACL. However, advances in Large Language Models (LLMs) has led to a blurring of the disciplinary lines between NLP and general Machine Learning (ML), with authors regularly publishing in venues from both fields. Here, we ask whether the disciplinary center of gravity is shifting. Using NLP research published from 2010 to 2026 and studies of both established and new authors, we find that a migration is taking place. First, comparing the pre- and post-LLM eras, established authors lost 19.2pp of share at flagship *ACL main-conference tracks while gaining 14.8pp in the newer Findings tracks, and general ML venues rose 8.6pp, even when adjusting for parallel growth in the fields. Second, among newer authors who debut with at least three first-author NLP-topic papers, the share whose work appears mostly at *ACL venues fell from 84% (2019) to 74% (2024), while the share appearing mostly at general ML venues rose from 5% to 21%. Using causal inference techniques, we estimate that these general ML venues confer a significant citation premium, which influences venue selection. Together, these results point to a significant shift in where NLP research is published.

了解你的来源：用于媒体背景核查的公共知识库

1/10

Know Your Source: A Public Knowledge Store for Media Background Checks

Benjamin Nichols, Michael Schlichtkrull, Nedjma Ousidhoum

个性化推荐理由:

该论文专注于媒体背景核查和公共知识库，属于新闻可信度或事实核查领域，与推荐系统、搜索或广告的核心技术无关，也未涉及LLM或Transformer等使能技术。

2026-07-02 16:20:28 | arXiv:2607.02383v1 |

cs.CL

查看完整摘要

LLM-based retrieval-augmented generation (RAG) is increasingly used for automated fact-checking (AFC) and related tasks. By grounding LLM outputs in retrieved evidence, RAG-based systems provide transparent justifications while allowing external information to be updated independently of the underlying model. However, existing approaches often assume retrieved evidence is reliable, although real-world information may be conflicting, outdated, and can originate from unreliable or biased sources. Recent work on *source-critical reasoning* addresses this challenge through media background checks (MBCs) (Schlichtkrull, 2024), which assess the credibility of evidence sources to support downstream fact verification. However, generating MBCs relies on costly proprietary search APIs, limiting reproducibility. To mitigate this issue, we introduce MEDIAREF, a publicly available knowledge store of web-sourced documents that enables reproducible, low-cost evaluation of MBC generation across 200 media sources. We describe a reproducible methodology for constructing and updating the collection, assess widely used LLMs on the MBC generation task, and demonstrate that MEDIAREF supports higher-quality MBC generation through both automatic and qualitative evaluation.

HULAT2在MER-TRANS 2026上的工作：面向西班牙语易读生成的受控多智能体简化

1/10

HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

Lourdes Moreno, Paloma Martínez, Marco Antonio Sanchez-Escudero, Miguel Domíngue...

个性化推荐理由:

该论文专注于西班牙语文本简化（易读生成），属于自然语言处理中的文本生成任务，与推荐系统、搜索或广告的核心技术无直接关联。内容不涉及LLM在推荐/搜索/广告中的直接应用，也不涉及使能技术或Transformer架构改进，因此相关性极低。

2026-07-02 16:18:58 | arXiv:2607.02381v1 |

cs.CL

查看完整摘要

This paper describes the participation of HULAT2-UC3M in the Spanish track of MER-TRANS 2026, a shared task on multilingual Easy-to-Read translation. Three fully automatic Spanish runs were submitted. RUN1 and RUN2 used a LangGraph-based multi-agent workflow combining Gemini 2.5 Flash and RigoChat-7B-v2, parallel generation strategies, internal quality signals, Event-Condition-Action routing, controlled editing and traceable decisions. RUN1 used the base workflow, while RUN2 activated an additional lexical-support layer based on a glossary and lexical resources. RUN3 was a RigoChat-based generate-evaluate-regenerate baseline with prompt engineering and LoRA-based adaptation. The official leaderboard reports BLEU-Orig, BLEU-Gold, SARI and BERTScore. During development, additional internal signals were also inspected, including semantic fidelity, readability, lexical simplicity, syntactic clarity and factual consistency. According to official SARI, RUN1 was the best HULAT2 run, with 44.0543 points, followed by RUN2 with 43.1049 and RUN3 with 38.5136. These results indicate that, in this task setting, signal-guided multi-agent routing outperformed the linear regeneration baseline. They also show that adding lexical support did not automatically improve reference-based scores. Further segment-level and document-level analysis are required to assess readability, factual consistency and user-oriented adequacy.

SkillFuzz：通过模糊测试技能组合以发现开放技能市场中的隐式意图

1/10

SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

Jinwei Hu, Yi Dong, Youcheng Sun, Xiaowei Huang

个性化推荐理由:

该论文聚焦于开放技能市场中的隐式意图发现，属于特定应用场景，与推荐系统、搜索或广告的核心技术无关，且未涉及LLM或Transformer等使能技术。

2026-07-02 15:49:21 | arXiv:2607.02345v1 |

cs.SEcs.AIcs.CL

查看完整摘要

Large Language Model (LLM)-based agents increasingly automate software engineering tasks through reusable skills, natural-language instruction documents that guide planning and execution. Open skill marketplaces enable users to assemble agents by co-activating community-contributed skills, but marketplace operators typically audit skills in isolation. As a result, individually benign skills may interact to redirect an agent toward unintended objectives, which we term implicit intents. Detecting such intents is challenging because the effect emerges only through skill composition, execution environments are often unavailable at admission time, and the space of possible co-activations grows exponentially with marketplace size. In this paper, we formulate implicit-intent discovery as a fuzzing problem over skill compositions, where skill compositions are the unit under test, planning artifacts expose agent intent before execution, and deviations from a skill-free baseline serve as a differential oracle. Based on this formulation, we propose skillfuzz, the first execution-free testing approach that extracts structured skill contracts and uses contract-guided Monte Carlo Tree Search to prioritize potentially conflicting compositions. Across representative skill-marketplace workloads, skillfuzz discovers over 1,000 distinct implicit intents under a fixed query budget, confirms more than 80% of the highest-risk flagged compositions during execution-time validation, and identifies substantially more high-severity implicit intents than alternative search strategies while exploring only a fraction of the pairwise interaction space they require.

论方向性在结构泛化中的作用

1/10

On the Role of Directionality in Structural Generalization

Zichao Wei

个性化推荐理由:

该论文研究语言学或认知科学中的结构泛化，聚焦于句子或序列的方向性，与LLM、推荐系统、搜索或广告等技术领域无直接关联，也未提及潜在应用场景，因此相关性极低。

2026-07-02 15:20:51 | arXiv:2607.02307v1 |

cs.CLcs.LG

查看完整摘要

Several SLOG test categories explicitly involve directional distinctions (modifier position shifts, argument extraction positions), yet AM-Parser, the previous SOTA, uses an AM algebra whose operations do not encode direction. We redesign the symbolic backend around CCG directed types (deterministic CKY + single linear decoder, 30K learnable parameters). Under the same BERT-base encoder, the system achieves 75.9$\pm$6.4% LF exact match, surpassing AM-Parser (70.8$\pm$4.3%). Per SLOG's own category groupings, gains are highly directional: the CCG system outperforms AM-Parser on all 5 position-shift categories (+29.9pp), while AM-Parser outperforms on all 6 recursive-depth categories. Replacing the encoder with DeBERTa-v3-large yields 90.7$\pm$4.9%, with the largest encoder gains in recursive-depth categories, complementary to directionality's gains. Directional representations shift the bottleneck from the symbolic layer (AM-Parser's 0% category ceiling) to the neural layer, which improves with encoder upgrades.

AgenticSTS：一个用于长周期LLM智能体的有限记忆测试平台

1/10

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

Xiangchen Cheng, Yunwei Jiang, Jianwen Sun, Zizhen Li, Chuanhao Li, Xiangcheng C...

个性化推荐理由:

该论文专注于LLM智能体的测试平台，属于通用智能体评估，未涉及推荐、搜索或广告领域的应用或技术。论文主题与RecSys/Search/Ads无直接关联，且未提及Transformer架构改进或LLM在相关领域的具体应用。

2026-07-02 14:44:32 | arXiv:2607.02255v1 |

cs.AIcs.CL

查看完整摘要

Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.

HaloGuard 1.0：用于多语言AI安全的开放权重宪法分类器

1/10

HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

Navaneeth Sangameswaran, Preetham S, Ashmiya Lenin

个性化推荐理由:

该论文聚焦AI安全与内容分类，属于非技术性主题，与推荐系统、搜索或广告的核心技术无直接关联。

2026-07-02 12:21:16 | arXiv:2607.02079v1 |

cs.CLcs.CRcs.LG

查看完整摘要

We present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm for input safety. It achieves state-of-the-art performance on English and multilingual prompt-safety benchmarks at roughly one-tenth the model size of current leading open guard models. The safety constitution is the organising structure of the corpus: a natural-language constitution of 46 policies and 2,940 subcategories drives synthetic data generation, with exhaustive one-to-one paired counterfactuals that hold topic and vocabulary fixed while flipping intent, a two-tier harmless design that separately targets boundary and baseline false positives (FPs), and balanced multilingual materialisation across 46 languages that treats language as a surface form appearing on both sides of the boundary rather than as an adversarial signal. Across seven prompt-safety benchmarks, HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard we evaluate, outperforming baselines up to 27B parameters (over 30 times larger) while holding false-positive rate (FPR) to 4.3 and false-negative rate (FNR) to 9.5. The HaloGuard 1.0-4B variant reaches average F1 of 92.1 and FPR of 3.5, spending its extra capacity on precision rather than recall. A structured adjudication of the remaining failures indicates that most apparent missed-harm cases are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against both content-level and agentic attacks. We release the models as open weights.

SPLIT：英语和乌克兰语LLM回应中的跨语言共情与文化基础

1/10

SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

Anna Chorna

个性化推荐理由:

该论文专注于跨语言共情和文化基础，属于自然语言处理中的情感和文化适应问题，与推荐系统、搜索或广告的核心技术无关。其主题更偏向于NLP和跨文化研究，不涉及点击率预测、用户建模或效率改进等推荐领域的关键方面。

2026-07-02 11:22:01 | arXiv:2607.02049v1 |

cs.CLcs.AIcs.CY

查看完整摘要

Large Language Models are increasingly deployed in emotional-support contexts and crisis-related situations. Nevertheless, their cross-lingual abilities in these circumstances remain underexplored. Existing benchmarks emphasize multilingual performance but rarely examine crisis-related empathy and cultural grounding in low-to-mid-resource languages. We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM-as-a-jury paradigm. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human-centered evaluation.

OpenSafeIntent：跨双重用途提示集的意图校准安全补全评估

1/10

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Rheeya Uppaal, Seungwoo Lyu, Selina Sung, Junjie Hu

个性化推荐理由:

该论文聚焦于LLM的安全性和意图校准，属于AI安全领域，与推荐系统、搜索或广告的核心技术无关。虽然涉及LLM，但主题是安全性而非效率、架构或直接应用，不符合关注范围。

2026-07-02 11:14:52 | arXiv:2607.02047v1 |

cs.CLcs.AI

查看完整摘要

Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. Our results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.

PACE：一种代理能力评估的代理指标

1/10

PACE: A Proxy for Agentic Capability Evaluation

Yueqi Song, Lintang Sutawika, Jiarui Liu, Lindia Tjuatja, Jiayi Geng, Yunze Xiao...

个性化推荐理由:

该论文聚焦于评估AI代理（Agent）的能力，属于智能体通用能力评估范畴，与推荐系统、搜索或广告的核心技术（如排序、匹配、用户建模等）无直接关联。尽管代理能力可能间接影响LLM应用，但当前标题缺乏明确应用于RecSys/Search/Ads的动机或方法，因此相关性极低。

2026-07-02 10:59:03 | arXiv:2607.02032v1 |

cs.AIcs.CL

查看完整摘要

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits a regression that maps a model's scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementary instance-selection strategies, target-relevance local selection and globally informative global selection. We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show that PACE-Bench predicts agentic scores with leave-one-out cross-validation (LOOCV) mean absolute error (MAE) under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.

EduArt：评估大型语言模型艺术史知识的教育级别基准

1/10

EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

Gianmarco Spinaci, Lukas Klic, Giovanni Colavizza

个性化推荐理由:

该论文专注于教育领域的艺术史知识评估，属于特定学科应用，与推荐系统、搜索或广告的核心技术无关，也不涉及LLM在相关领域的应用或架构创新。

2026-07-02 10:43:06 | arXiv:2607.02007v1 |

cs.CLcs.CV

查看完整摘要

Large language models now score near ceiling on general benchmarks, but these aggregate measures reveal little about how models behave within single disciplines. Existing art-focused evaluations rely on synthetic questions and rarely report item-level properties. This paper introduces EduArt, an educational-level benchmark for art-historical knowledge and visual reasoning in multimodal LLMs. EduArt comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams, spanning two languages and seven formats from multiple choice to in-text word placement and error identification. Twelve models from six provider families were evaluated under a default answer-only condition and a motivation condition requiring written justification, and characterized using Classical Test Theory and a logistic regression isolating the effects of format, language, image presence, and model. The benchmark showed strong psychometric properties (mean discrimination 0.514, 82.3 percent good discriminators), while multiple-choice accuracy saturated near ceiling for six models, showing recognition formats alone cannot distinguish frontier models. Format was a strong independent predictor of accuracy: models exceeding 94 percent on multiple choice fell to 23.9 percent on open completion (Claude Opus 4.6) and 6.2 percent on error identification (Claude Sonnet 4.6). The motivation condition changed accuracy in a predominantly negative, family-dependent direction. These dissociations indicate that art-historical knowledge and the ability to deploy it are distinct capabilities, and that single-format benchmarks overestimate what models can reliably do. Mapping this capability profile is a precondition for responsible use of multimodal LLMs in art-historical scholarship, where tasks demand producing and manipulating content rather than selecting from fixed options.

使用嵌入向量预测普通话单音节词的口语持续时间和音高

1/10

Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen

个性化推荐理由:

该论文专注于语音特征（时长和音高）预测，属于语音处理领域，与推荐系统、搜索或广告的核心技术无直接关联，缺乏实用性。

2026-07-02 10:38:49 | arXiv:2607.02002v1 |

cs.CL

查看完整摘要

Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.

因错误原因而稳健：LLM对科学怀疑论鲁棒性的表征几何

1/10

Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism

Minjong Cheon

个性化推荐理由:

该论文研究LLM对科学怀疑论（如气候变化否认）的鲁棒性，属于NLP中的安全性和鲁棒性话题，与推荐系统、搜索或广告的核心技术无关。虽然涉及LLM，但未探讨对RecSys/Search/Ads的潜在应用，因此相关性极低。

2026-07-02 09:40:52 | arXiv:2607.01951v1 |

physics.soc-phcs.AIcs.CL

查看完整摘要

Large language models (LLMs) are increasingly consulted on contested scientific questions, raising the concern that they will sycophantically retreat from established consensus when a user signals doubt -- drifting toward a false balance that treats settled science as one view among several. We test this across three open instruction-tuned models (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B), three consensus-science domains (climate, vaccines, evolution), and single- and multi-turn settings, combining behavioral measurement with linear probing and activation patching. We do not observe sycophantic retreat. Instead, models show three distinct policies under the same skeptical pressure: reactive assertion, where consensus assertion increases rather than decreases (Llama); surface hedging, where tone softens while the position holds (Qwen); and non-response (Mistral). Pairwise judgments confirm the reactive shift is stance, not style (63.6%, p=.007), and a decomposition identifies increased consensus assertion, not false balance, as its driver (beta=+0.042 per dose, p<1e-77). Linear probes localize the divergence to middle layers -- perfect separation in Llama and Qwen versus 72% in Mistral, with non-overlapping confidence intervals -- indicating the non-responsive model does not linearly represent the skepticism signal at all. Crucially, this robustness does not transfer: it attenuates across domains and, in the safety-critical vaccine domain, can reverse, with myth-rebuttal weakening under skeptical pressure. We synthesize these into a four-way taxonomy separating active from accidental robustness, and argue that behavioral evaluation alone cannot distinguish a model that resists skepticism because it understands the signal from one that only appears to resist because it fails to perceive it.

PhysMani：基于物理原理的动态物体操控三维世界模型

1/10

PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

Peng Yun, Shouwang Huang, Hao Li, Jinxi Li, Jianan Wang, Bo Yang

个性化推荐理由:

该论文专注于动态物体的物理操控和3D世界建模，属于机器人/图形学领域，与搜索、推荐、广告的核心技术无直接关联。虽然涉及世界模型，但缺乏明确的推荐/搜索应用场景，且不涉及LLM或Transformer架构应用于RecSys/Search/Ads。

2026-07-02 09:32:39 | arXiv:2607.01938v1 |

cs.ROcs.AIcs.CLcs.CVcs.LG

查看完整摘要

Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.

AIriskEval-edu：面向人工智能辅助K-12教育解释的风险评估新数据集

1/10

AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

Javier Irigoyen, Roberto Daza, Francisco Jurado, Julian Fierrez, Ruben Tolosana,...

个性化推荐理由:

该论文专注于AI在教育领域的风险评估，属于教育应用或安全伦理范畴，与RecSys/Search/Ads的核心技术或LLM/Transformer在该领域的应用无直接关联。根据排除标准，安全、伦理及非技术性主题不被纳入关注范围。

2026-07-02 09:28:21 | arXiv:2607.01934v1 |

cs.CLcs.AIcs.DB

查看完整摘要

This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations from 170 curated ScienceQA questions, covering science, language arts, and social sciences. For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks. We propose a comprehensive risk rubric aligned with established educational standards that covers five complementary dimensions: factual precision, depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. A key contribution is the addition of 785 explanations with structured explainability annotations, including risk localization and risk description. The annotations are produced through a semi-automatic process with expert teacher validation. Finally, we present validation experiments comparing state-of-the-art proprietary models with a lightweight local Llama 3.1 8B model in both the pedagogical risk detection and the explainability assessment. These experiments evaluate whether supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable model to approach or outperform stronger frontier models while preserving privacy in educational auditing and assessment tasks.

TUDUM：用于Qwen3.5-27B的土耳其语推理管道

1/10

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

Baran Bingol, Bahaeddin Turkoglu

个性化推荐理由:

该论文专注于特定语言（土耳其语）的推理管道，属于NLP语言适应或LLM微调范畴，与推荐系统、搜索或广告的核心技术无关。题目未提及任何与RecSys/Search/Ads相关的应用或技术改进。

2026-07-02 09:22:36 | arXiv:2607.01927v1 |

cs.CLcs.AI

查看完整摘要

This paper presents TUDUM (Türkçe Düşünen Üretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated <think>...</think> block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.

语法决定工作：跨通用依存关系的功能性与词汇性依存长度最小化

1/10

The Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies

Kim Gerdes

个性化推荐理由:

该论文研究语言学中的依存长度最小化现象，属于理论语言学范畴，与推荐系统、搜索或广告领域的技术进展无直接关联。没有明确的应用于排序、检索或用户建模的潜力。

2026-07-02 08:55:07 | arXiv:2607.01899v1 |

cs.CL

查看完整摘要

Dependency length minimization (DLM) is a well-documented processing universal, but previous studies report a single mean dependency distance (MDD) per language, obscuring variation across syntactic relation types. We analyze 122 languages in UD and SUD (version 2.17), showing that DLM operates on two distinct levels. Grammar-driven optimization targets functional dependencies (det, case, aux), which are universally short (mean 1.71, $σ$ = 0.33) and invariant across typologically diverse languages. Processing-driven optimization operates on lexical dependencies (nsubj, obj, obl), which are longer (mean 2.87), highly variable ($σ$ = 0.63), and constrained by word-order typology. This asymmetry holds in SUD despite reversed head direction (r = 0.92). We conclude that ''the grammar does the work'' of minimization by scaffolding sentences with local functional attachments, leaving processing pressures to determine the ordering of lexical heads.

技能教练：用于评估和增强智能体技能使用的自我进化评分规则

1/10

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

Jiayin Zhu, Kelong Mao, Yudong Guo, Dengbo He, Sulong Xu, Simiu Gu, Yutao Yue

个性化推荐理由:

论文聚焦于智能体技能评估与提升，属于LLM智能体领域，但未涉及推荐、搜索或广告中的具体应用，且未提出与排序、检索等核心任务相关的技术。缺乏对当前关注点的直接或潜在应用。

2026-07-02 08:28:51 | arXiv:2607.01874v1 |

cs.AIcs.CL

查看完整摘要

Skills are becoming a reusable operational layer for LLM agents, encoding SOPs, domain rules, tool workflows, scripts, and validation routines. In realistic skill repositories, overlapping skills make reliable skill-use difficult. Final verifier success is too coarse for both evaluation and training, since an agent may pass through trial and error while selecting distractor skills, skipping required steps, composing workflows incorrectly or omitting final checks. We introduce SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use. SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories. Experiments show that evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use.

通过精炼实现面向安全的嵌入利用

1/10

Safety Targeted Embedding Exploit via Refinement

Joshua Adrian Cahyono

个性化推荐理由:

该论文标题涉及嵌入漏洞利用和安全，聚焦于攻击与防御，属于安全领域。根据您的要求，安全、隐私等非技术话题不相关，且没有明确应用场景与推荐、搜索、广告领域相关。因此相关性极低。

2026-07-02 08:17:57 | arXiv:2607.01859v1 |

cs.AIcs.CL

查看完整摘要

Safety training for large language models (LLMs) is conducted predominantly in English, leaving uncertain how well safety mechanisms generalize to low-resource languages and mixed-language code-switching. We show that this creates an epistemic gap in which models confidently generate harmful responses for inputs that fall outside the distribution of their safety training. To study this phenomenon, we introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that identifies words contributing most strongly to the model's refusal behavior and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent. Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench, outperforming random code-switching and Greedy Coordinate Gradient (GCG). The resulting prompts also transfer to GPT-4o-mini, achieving a 35.5% attack success rate without requiring access to the target model, suggesting that the underlying weakness is not specific to a single architecture. These findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs. We argue that improving multilingual safety requires broader coverage during alignment and mechanisms that explicitly detect and abstain on out-of-distribution inputs.

1990年至2019年全球图书馆与信息科学研究方法使用的非同步性

1/10

Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019

Chengzhi Zhang, Liang Tian

个性化推荐理由:

该论文聚焦图书馆与信息科学领域的研究方法演变，属于特定学科的方法论研究，与推荐系统、搜索、广告或大语言模型的核心技术无直接关联。不涉及LLM、Transformer、推荐架构等当前关注领域，因此相关性极低。

2026-07-02 07:53:45 | arXiv:2607.01833v1 |

cs.DLcs.CLcs.CY

查看完整摘要

The global development of Library and Information Science (LIS) is influenced by various factors such as the economy, society, culture, discipline, tradition, and more. Consequently, the research methods of LIS vary greatly among countries. To better understand these differences, we conducted a study of 5,281 research papers from 81 countries published in internationally representative journals over the past thirty years. We manually annotated the research methods used in some articles through content analysis, and subsequently developed and trained a deep learning model for automatic classification of research methods. Using this method, we conducted a comparative analysis of the usage of research methods in different countries. Our findings reveal that there are differences in the research methods used across countries, with each country having its unique research profile and distribution of research methods. Even when investigating the same topic, research methods can differ between countries. Our study also uncovers that there are differences between the national and international distribution of research methods, these differences have decreased over the past 30 years. By highlighting the characteristics of discipline development in various countries from the perspective of research methods, our study can help guide discipline development at the national level. This study provides insights into the usage trends of research methods across different countries and highlights the unique characteristics of discipline development in each country. This information can be valuable in promoting collaboration and understanding between countries and in guiding discipline development at the national level.

Pre-Flight：用于评估大型语言模型航空运行知识的基准

1/10

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Alex Brooker, Tim Hughes

个性化推荐理由:

该论文专注于航空领域特定知识评估，属于领域特定应用，与推荐系统、搜索或广告的核心技术无关。没有展示出对LLM基础架构或Transformer效率的通用贡献，且缺乏应用于RecSys/Search/Ads的潜在路径。

2026-07-02 07:49:55 | arXiv:2607.01829v1 |

cs.AIcs.CL

查看完整摘要

Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.

图书馆与信息科学中研究主题与方法选择的性别差异：来自三种顶级期刊的视角

1/10

Gender Differences in Research Topic and Method Selection in Library and Information Science: Perspectives from Three Top Journals

Chengzhi Zhang, Siqi Wei, Yi Zhao, Liang Tian

个性化推荐理由:

该论文研究图书馆与信息科学领域中的性别差异，属于社会科学范畴，与推荐系统、搜索、广告或大语言模型的技术核心无直接关联。论文不涉及任何模型架构、算法或应用，完全偏离筛选范围。

2026-07-02 07:48:33 | arXiv:2607.01828v1 |

cs.DLcs.CLcs.CY

查看完整摘要

Research in the social sciences has shown that there are gender differences in the selection of research methods, with women often opting for qualitative methods while men prefer quantitative methods. However, it is important to consider that research methods are generally chosen based on the research topic. To figure out the influence of gender on research method selection, a study was conducted in the field of Library and Information Science, using a more fine-grained method classification system and an automatic classification model called CogFT, which is based on full-text cognition. The findings showed that women tend to use Interview while men prefer Theoretical approach, across a range of topics. The study offers insights into the specific research design processes that contribute to gender differences in method selection and suggests ways to promoting gender inclusivity and equality in academia by considering research method use and guidance.

面向丢包隐藏的自监督测试时调优

1/10

Self-Supervised Test-Time Tuning for Packet Loss Concealment

Yehoshua Dissen, Joseph Keshet

个性化推荐理由:

该论文专注于音频/语音处理中的丢包隐藏问题，属于通信或信号处理领域，与推荐系统、搜索或广告的核心技术无直接关联。自监督测试时调优虽是通用技术，但具体应用场景与用户关注方向相差甚远，因此相关性极低。

2026-07-02 07:45:09 | arXiv:2607.01823v1 |

eess.AScs.CL

查看完整摘要

Packet loss concealment (PLC) reconstructs audio packets that are missing at the receiver, usually with a trained model whose parameters remain fixed at deployment time. This treats the PLC model as static, even though each call or recording exposes signal-specific information through the packets that did arrive. We present TTT-PLC, a self-supervised test-time tuning framework that adapts existing PLC models using only those received packets. The method creates supervision by synthetically masking portions of the available signal, training the model to conceal them with its native PLC objective, and then using the adapted model to reconstruct the true packet losses. No clean reference signal, external adaptation data, or architectural modification is required. We study TTT-PLC in two deployment settings. In the non-causal setting, the received file is available before reconstruction, allowing repeated self-supervised adaptation passes and providing a per-file adaptation ceiling. In the causal setting, audio is streamed without revising emitted samples; adaptation is performed only on completed past blocks, and updated parameters affect only future audio. We instantiate the framework on two public PLC backbones, FRN, a recurrent full-band speech PLC model, and PARCnet, a hybrid autoregressive-neural model for networked music. Across these settings, the results show that pretrained PLC systems do not need to be treated as fixed at inference time, the still-observed portions of a lossy signal can provide an effective training signal for improving concealment on that same signal.

论引导向量在偏好对齐生成中的局限性

1/10

On the Limits of Steering Vectors for Preference-Aligned Generation

Melanie Subbiah, Zara Hall, Kathleen McKeown

个性化推荐理由:

该论文关注LLM生成中的偏好对齐技术，属于核心LLM中对齐或安全性的研究，与推荐、搜索或广告领域无直接或间接的应用关联。主题属于模型生成的控制方法，不涉及推荐系统或广告中的用户建模、排序等核心问题。

2026-07-02 07:18:36 | arXiv:2607.01802v1 |

cs.CL

查看完整摘要

Steering vectors have emerged as a promising approach to controlled text generation, offering interpretable, training-free mechanisms for shaping model outputs. However, their practical generality remains poorly understood. We study the limits of steering vector generalization along three dimensions: trait expressibility, task transfer, and multi-trait composition. Using the PLUME writing personalization benchmark, we extract steering vectors for a range of preferences and evaluate them on summarization and email-writing tasks across two open-source models (Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct). We find that steering effectiveness varies substantially across traits. We further show that steering effectiveness can degrade when vectors extracted from positive and negative style examples are transferred to downstream writing personalization tasks. Finally, we compare common methods for composing multiple steering vectors and find that all methods suffer significant drops in trait expression as more vectors are added, with a tradeoff between coherence and expressibility that requires per-setting hyperparameter tuning. Taken together, our results suggest that steering vectors face meaningful limits as a general-purpose tool for preference alignment.

重新思考语音-大语言模型集成用于自动语音识别：通过交错实现有效的语音-文本联合训练

1/10

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

Ruchao Fan, Yiming Wang, Rui Zhao, Liliang Ren, Keqi Deng, Xiaoyang Chen, Ali Za...

个性化推荐理由:

该论文专注于自动语音识别（ASR）领域的语音-文本联合训练，属于语音处理与LLM的结合，但主要应用于ASR任务，而非推荐、搜索或广告领域。没有明确讨论如何将这种技术应用于用户建模、特征表示或排序等核心问题，因此与当前关注点无关。

2026-07-02 05:42:01 | arXiv:2607.01733v1 |

cs.CLeess.AS

查看完整摘要

Speech-LLM integration has shown promising results by leveraging extensive textual pretraining, yet its specific benefits for automatic speech recognition (ASR) remain unclear. We observe that as supervised ASR training data increases, the contribution of LLM priors becomes less evident, and simple speech-text joint training under-utilizes textual knowledge. We therefore propose Joint Speech-Text Interleaved Pretraining (JSTIP), an ASR-oriented pretraining strategy that constructs word-level and segment-level interleaved speech-text sequences within aligned pairs for speech-LLM architectures that accept continuous inputs. Experiments on 38k hours of ASR data show consistent entity accuracy improvement compared to ASR-only and joint speech-text training baselines. JSTIP achieves on-par entity recognition performance using domain transcription text compared to synthetic speech-text pairs, simplifying domain adaptation. Benefiting from textual pretraining and domain text data, JSTIP is competitive with open-source ASR and Speech-LLM systems in medical entity recognition. The zero-shot speech question answering behaviors further suggest that interleaving reduces the speech-text modality gap and preserves the LLM generative prior, which is likely the reason for the entity improvements on the ASR task.

超越像素差异：面向Web UI视觉回归测试的图像变化描述基准测试

1/10

Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

Licheng Zhang, Bach Le, Pengtao Zhao, Naveed Akhtar

个性化推荐理由:

该论文专注于Web UI视觉回归测试，属于软件测试领域，与推荐系统、搜索或广告的核心技术无关。题目中未涉及LLM、Transformer或推荐/搜索/广告的建模问题。

2026-07-02 05:33:57 | arXiv:2607.01728v1 |

cs.CVcs.CLcs.SE

查看完整摘要

Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.

认知护目镜：通过梯度编辑诱导认知框架的预训练模块

1/10

Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

Joshua Penman

个性化推荐理由:

该论文标题涉及通过梯度编辑诱导认知框架，属于可解释性或知识表示领域，但未明确指向推荐、搜索或广告域。无明确应用场景，与当前聚焦方向无关。

2026-07-02 04:31:41 | arXiv:2607.01690v1 |

cs.AIcs.CLcs.LG

查看完整摘要

Finetuning a language model on documents that are explicitly annotated as fictional results in a model that still actually believes the documents' core claims, an effect known as Negation Neglect. In our evaluations, models trained on documents prefixed and suffixed with such annotations correctly identify the relevant claims as fictional only about 9% of the time. To address this, we introduce Goggles, a learned module that intervenes on the finetuning gradient rather than the data. During supervised finetuning, a Goggles module edits the gradients an LLM LoRA receives, imparting a chosen epistemic frame (the stance the model takes toward the nature of what it reads) to whatever the documents teach. A Goggles instance is trained once for a given base model, frame, and LoRA configuration, then applied frozen to documents it was never trained on. Trained through Goggles on those same documents, now carrying no fictional annotation, the model flags the content as fictional roughly 91% of the time, while preserving capability (GPQA and TruthfulQA match or exceed baseline). The same architecture supports other frames: a Goggles instance can be trained to treat documents as "part of an AI safety evaluation by Redwood Research" rather than simply as fiction. The imparted frame persists under continued finetuning that pushes back toward the claim, where prior interventions revert. Goggles suggests a path toward training language models on known-misaligned data without absorbing the behaviors that data demonstrates.

ProWAFT：基于FPGA的CNN加速器中工作负载感知与动态容错的ROMA-LPD实例

1/10

ProWAFT: A ROMA-LPD Instance for Workload-Aware and Dynamic Fault Tolerance in FPGA-Based CNN Accelerators

Xinxin Chen, Haoran Qiao, Yiming Guo, Kecheng Luo, Siyuan Feng, Jingwen Ma

个性化推荐理由:

该论文主要关注FPGA加速器中的容错技术，与搜索、推荐、广告领域的核心进展或LLM/Transformer技术无直接关联。容错属于硬件可靠性范畴，不属于您当前关注的任何研究方向。

2026-07-02 02:04:07 | arXiv:2607.01602v1 |

cs.CL

查看完整摘要

SRAM-based FPGAs provide an attractive platform for energy- and latency-constrained CNN inference at the network edge, yet transient faults can lead to silent errors that compromise reliability. Always-on redundancy (e.g., full TMR) improves correctness but incurs substantial performance and energy overhead, while reactive recovery may introduce unacceptable latency on the critical path. We propose \textbf{ProWAFT}, a proactive workload-aware fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to selectively apply TMR across reconfigurable partitions. ProWAFT quantifies workload criticality, models fault propagation and reconfiguration overhead, and selects configurations that minimize a composite objective over latency, energy, and reliability risk. Implemented on a Xilinx Zynq UltraScale+ ZCU104 platform with six reconfigurable regions and evaluated on a 500-task trace derived from ResNet-18, MobileNetV2, and EfficientNet-Lite under time-varying SEU injection, ProWAFT achieves lower composite cost than static TMR and reactive reconfiguration while maintaining high task success rate and near-baseline throughput with low online decision overhead.

参数高尔夫：真正有效的是什么？

1/10

Parameter Golf: What Really Works?

Prashanna Mani Paudel, Shivanand Venkanna Sheshappanavar

个性化推荐理由:

标题未提及任何与推荐系统、搜索或广告相关的技术内容，也未涉及LLM或Transformer架构的进步。"参数高尔夫"可能指超参数调优，但缺乏明确的应用语境，不属于核心关注领域。

2026-07-01 22:29:40 | arXiv:2607.01517v1 |

cs.CL

查看完整摘要

How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights) required to fit within 16 MB and be trained in under ten minutes on 8xH100 SXM GPUs. Quality was measured in bits-per-byte (BPB), the average number of bits required to encode each byte of unseen text. We analyze 2,037 pull requests and 1,430 clean scored submissions from the contest, build a taxonomy of 84 optimization techniques, and measure each technique's contribution to BPB. The verified leaderboard score dropped from 1.2244 to 1.058 BPB across three phases -- a 13.6% reduction, despite individual techniques rarely improving BPB by more than 1%. We show that most gains in techniques shrink across competitive submissions, isolating the few methods that improve performance across stacks.

从单语到多语：评估Mamba在南非语言自动语音识别中的应用

1/10

From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages

Jesujoba O. Alabi, Julian Herreilers, Badr M. Abdullah, Dietrich Klakow

个性化推荐理由:

该论文专注于自动语音识别（ASR）和Mamba模型在多语言场景下的应用，属于语音领域，与推荐系统、搜索或广告的核心技术（如排序、召回、多模态融合）无直接关联。主题偏离了LLM在推荐/搜索/广告中的直接应用或使能技术，因此相关性极低。

2026-07-01 22:01:29 | arXiv:2607.01502v1 |

cs.CL

查看完整摘要

Recent advances in automatic speech recognition (ASR) have explored different sequence models, including Conformer-based models and newer state space models such as Mamba. Although prior work has evaluated these architectures in multiple languages, their effectiveness in African languages remains underexplored. In this work, we evaluate Mamba for ASR on seven South African languages. In monolingual experiments, each model is trained on 50 hours of speech per language, and we compare Mamba to a Conformer baseline of similar parameter scale. Mamba achieves similar recognition accuracy to Conformer while using fewer computational resources and training faster. We further evaluate generalization in this setting and find that both models struggle to generalize to speech that is much longer than what they were trained on. We then study multilingual ASR using Mamba models, where the baseline is pooling all languages together. On top of this, we tested three extensions: training with language-family information by adding both language and language-family embeddings as biases to the downsampled acoustic representations, and multitask learning with a CTC ASR objective and a language identification (LID) head. We find that multilingual training consistently improves performance over monolingual training. However, adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. We conducted ablation studies in low-resource multilingual settings using 5-hour and 10-hour per-language training data, where we observed gains from using language embeddings and further demonstrated that removing or altering them hurt model performance. Lastly, we analysed these embeddings and find that they do not capture linguistic similarity in a typological sense, but instead act as task-specific control vectors.

比较用于有监督政治尺度构建的架构

1/10

Comparing Architectures for Supervised Political Scaling

Anna Golub, Sebastian Padó

个性化推荐理由:

该论文专注于政治学领域的监督学习架构比较，属于特定领域应用，与推荐系统、搜索或广告的核心技术无关。没有涉及LLM、Transformer或可迁移的方法论，因此与当前关注点不相关。

2026-07-01 20:49:51 | arXiv:2607.01464v1 |

cs.CL

查看完整摘要

Text scaling, the task of positioning political actors on an ideological scale, is a fundamental task in political analysis. To ease the need for manual analysis, various NLP methods have been proposed for this task, including classification- and regression-based approaches, showing successes as well as limitations. The goal of our paper is to consolidate the state of the art in this area. We ask two questions: (a) Can the performance of scaling methods be improved by predicting scales not individually but jointly? (b) Is there a middle ground between classification and regression?

基于事实的优化：一种用于自动化个人文档重写中减少LLM幻觉的分层工程框架

1/10

Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting

Shashank Indukuri, Adarsh Agrawal

个性化推荐理由:

该论文专注于LLM幻觉问题，属于纯NLP领域，与推荐、搜索或广告的核心技术无关。虽然个人文档重写可能涉及用户内容，但论文主题属于LLM应用而非RecSys/Search/Ads核心，不符合聚焦领域。

2026-07-01 20:22:18 | arXiv:2607.01457v1 |

cs.CLcs.AI

查看完整摘要

Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data.

关于生物医学领域中剪枝混合专家模型的实用性与事实可靠性

1/10

On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

Atsuki Yamaguchi, Szymon Palucha, Léo Bijar, Aline Villavicencio, Nikolaos Aletr...

个性化推荐理由:

论文聚焦于生物医学领域，属于特定应用领域，不在我们的关注范围内。此外，主题涉及模型剪枝和事实可靠性，更偏向于NLP或领域特定应用，与推荐、搜索或广告的核心技术无关。因此，相关性极低。

2026-07-01 20:08:19 | arXiv:2607.01444v1 |

cs.LGcs.AIcs.CL

查看完整摘要

Mixture-of-Experts (MoE) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded. Structured expert pruning is a practical approach for reducing deployment costs in resource-constrained settings. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high-stakes domains such as biomedicine. In this paper, we investigate how domain-specific expert pruning affects both utility and reliability. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in-domain (biomedical) and cross-domain settings. Results reveal that moderate pruning preserves in-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios. When shifting to the general domain, both utility and reliability degrade rapidly. These findings indicate that safe compression depends heavily on the task and domain. Evaluating pruned MoE models solely on utility is inadequate for high-stakes deployment without reliability assessment.

FaithMed: 训练大语言模型以进行忠实的循证医学推理

1/10

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

Zhiyun Zhang, Liwen Sun, Xiang Qian, Chenyan Xiong

个性化推荐理由:

该论文专注于医学领域的忠实推理，属于特定领域应用，与推荐系统、搜索或广告的核心关注点无关。不符合任何技术趋势或直接应用类别。

2026-07-01 20:02:55 | arXiv:2607.01440v1 |

cs.CL

查看完整摘要

Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning. To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%). This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at https://github.com/cxcscmu/FaithMed.

IsoSci：评估大语言模型推理与知识检索能力的同构跨领域科学问题基准

1/10

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

个性化推荐理由:

该论文主要关注LLM在多领域科学问题上的推理与知识检索能力评估，属于纯NLP评估基准，与推荐系统、搜索或广告领域无直接或间接联系。根据排除规则，此论文属于无关主题。

2026-07-01 19:49:49 | arXiv:2607.01431v1 |

cs.CLcs.AI

查看完整摘要

We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on GPQA Diamond (+19.2 percentage points) underperforms on ISOSCI (-24.7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at https://huggingface.co/datasets/isosci/isosci

RusFinChain：面向金融的可验证链式思维推理的俄语基准，采用模糊对齐评估

1/10

RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation

M. K. Arabov

个性化推荐理由:

该论文聚焦金融领域的基准测试和链式思维推理评估，属于特定领域应用，与推荐系统、搜索或广告的核心技术无关，且不涉及LLM在RecSys/Search/Ads中的直接应用或Transformer架构改进。

2026-07-01 18:48:05 | arXiv:2607.01388v1 |

cs.CL

查看完整摘要

Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning steps. FINCHAIN introduced verifiable Chain-of-Thought (CoT) evaluation but is limited to English. FINESSE-Bench includes a Russian block but relies on multiple-choice questions without step-level supervision. We present RusFinChain, the first Russian-language symbolic benchmark for verifiable CoT reasoning in finance. It spans 17 domains, 172 topics, and comprises 5,280 parameterized examples from executable Python templates, ensuring contamination-free evaluation. Each example includes a gold-standard reasoning chain with intermediate numeric values for automatic verification. We also introduce enhanced metrics: Fuzzy Numeric Alignment and Soft-Attention Alignment. We evaluate 8 open-weight LLMs on a stratified sample, generating 8,100 responses. Results reveal a substantial reasoning gap: models achieve Hard F1 of ~0.65 for step alignment, but only ~29% of final answers are correct. Our fuzzy and soft metrics show stronger correlation with final-answer correctness (Spearman rho approx 0.48) than the original ChainEval (rho approx 0.38-0.46), demonstrating superior diagnostic power. We release dataset, code, and evaluation framework to foster verifiable financial AI for the Russian-speaking community.

对齐是X到4D生成所需的一切

1/10

Alignment Is All You Need For X-to-4D Generation

Qiaowei Miao, Kehan Li, Yawei Luo, Yi Yang

个性化推荐理由:

该论文聚焦于生成式AI中的4D内容生成（如动态三维场景），属于图形学与视觉生成领域，与推荐系统、搜索或广告的核心技术无直接关联。虽涉及“对齐”概念，但其应用场景偏离了用户行为建模、特征交互或信息检索等核心问题。

2026-07-02 17:59:57 | arXiv:2607.02516v1 |

cs.CV

查看完整摘要

Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: https://miaoqiaowei.github.io/Align4D/.

PointDiT：像素空间扩散用于单目几何估计

1/10

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu, Rundi Wu, Philipp Henzler, Nikolai Kalischek, Michael Oechsle, Fabian...

个性化推荐理由:

该论文专注于单目几何估计，属于计算机视觉领域，与推荐系统、搜索或广告领域无关。标题中未提及任何与用户建模、特征交互或大规模排序相关的技术，因此不相关。

2026-07-02 17:59:56 | arXiv:2607.02515v1 |

cs.CV

查看完整摘要

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

Embodied.cpp：面向异构机器人的具身AI模型便携式推理运行时

1/10

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Ling Xu, Chuyu Han, Borui Li, Hao Wu, Shiqi Jiang, Ting Cao, Chuanyou Li, Sheng ...

个性化推荐理由:

该论文聚焦于具身AI模型在机器人上的推理部署，属于机器人学和嵌入式系统领域，与推荐系统、搜索或广告的核心技术无直接关联。没有提及可应用于RecSys/Search/Ads的LLM或Transformer技术，因此相关性极低。

2026-07-02 17:58:28 | arXiv:2607.02501v1 |

cs.ROcs.CVcs.OS

查看完整摘要

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.

寻求分割：面向全景指代分割的主动感知

1/10

Seek to Segment: Active Perception for Panoramic Referring Segmentation

Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang

个性化推荐理由:

该论文专注于计算机视觉中的全景指代分割和主动感知，属于纯视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然视觉感知可能间接影响多模态推荐，但缺乏明确的应用场景和技术迁移潜力，因此相关性极低。

2026-07-02 17:56:49 | arXiv:2607.02497v1 |

cs.CV

查看完整摘要

Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($Δθ, Δφ$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

GeoMix：通过全局上下文和多检测器训练实现无描述符视觉定位

1/10

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

Yejun Zhang, Xinjue Wang, Zihan Wang, Esa Rahtu, Juho Kannala

个性化推荐理由:

该论文聚焦于视觉定位（Visual Localization），属于计算机视觉领域，与推荐系统、搜索或广告没有直接或间接关联。虽然标题提及“全局上下文”和“多检测器训练”，但缺乏与RecSys/Search/Ads相关的技术应用或类比潜力。

2026-07-02 17:52:41 | arXiv:2607.02486v1 |

cs.CV

查看完整摘要

Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{https://github.com/YejunZhang/Geomix}{\text{this links}}$.

应对文本噪声与冗余：熵感知密集视觉令牌剪枝

1/10

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Xuehui Wang, Xuankun Yang, Wei Shen

个性化推荐理由:

该论文聚焦于视觉Transformer中的令牌剪枝，属于视觉领域的效率优化问题，与推荐、搜索或广告系统的核心任务（如用户行为建模、特征交互、排序等）无直接关联。虽然令牌剪枝技术可能间接启发序列压缩方法，但其专门针对视觉模态，缺乏明确的迁移可行性，因此相关性极低。

2026-07-02 17:50:57 | arXiv:2607.02484v1 |

cs.CVcs.AI

查看完整摘要

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

EAGLE-360：360度具身主动全局到局部探索

1/10

EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\circ$

Jingtao Xu, Zizhuo Lin, Jianwen Sun, Yi Yang, Yawei Luo

个性化推荐理由:

该论文主题为具身人工智能的360度环境探索，属于机器人学和计算机视觉领域，与推荐系统、搜索或广告的当前核心关注点无直接关联。

2026-07-02 17:47:27 | arXiv:2607.02479v1 |

cs.CV

查看完整摘要

While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.

基于观测锚定残差流与地理上下文对齐的解释性云去除

1/10

Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment

Ziyao Wang, Maonan Wang, Yucheng He, Xianping Ma, Ziyi Wang, Hongyang Zhang, Yir...

个性化推荐理由:

该论文专注于遥感图像中的云去除任务，属于计算机视觉和遥感领域，与我的核心关注点（推荐系统、搜索、广告或LLM技术）无直接关联。题目中提及的“地理上下文对齐”等技术概念虽有一定通用性，但实际应用领域与RecSys/Search/Ads相距甚远。

2026-07-02 17:39:23 | arXiv:2607.02471v1 |

cs.CV

查看完整摘要

Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at https://github.com/wzy6055/GACR.

自审计残差漂移用于保持病理特征的加速膝关节MRI

1/10

Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI

Qing Lyu, Jianxu Wang, Mohammad Kawas, Ge Wang, Christopher T. Whitlow

个性化推荐理由:

该论文专注于医学影像领域，具体涉及膝关节MRI的加速成像技术，与推荐系统、搜索或广告领域无直接关联。即使考虑跨领域技术迁移，其病理保持和残差漂移概念也缺乏明确的RecSys/Search/Ads应用场景，不符合核心关注方向。

2026-07-02 16:59:03 | arXiv:2607.02428v1 |

eess.IVcs.CVphysics.med-ph

查看完整摘要

Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk scores in the same inference pass. We evaluate SA-RDM-DC on multi-coil fastMRI knee data at acceleration factors of 4, 8, and 12, with fastMRI+ pathology annotations for region-level and classifier-based task preservation, and on SKM-TEA for zero-shot and fine-tuned protocol-shift evaluation. Compared with zero-filled reconstruction, UNet-image-SENSE, DC-UNet, Score-Diffusion, ELF-Diff, SENSE-VarNet, and MoDL baselines, SA-RDM-DC achieves the highest SSIM across fastMRI acceleration factors while retaining subsecond per-slice inference and avoiding the long sampling time of iterative diffusion baselines. In pathology-aware analysis, SA-RDM-DC preserves lesion-region structural fidelity and reduces meniscus prediction instability. Its self-auditing scores strongly identify high-error reconstructions on fastMRI and partially transfer as a selective-review signal under SKM-TEA protocol shift. These results support reconstruction evaluation that jointly considers image fidelity, pathology preservation, runtime, and case-specific reliability.

基于小波引导的语义信号补偿的无反演图像编辑

1/10

Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing

Anqi Tang, Wenhao Sun, Zhaoqiang Liu

个性化推荐理由:

该论文聚焦于图像编辑领域，具体涉及无反演编辑技术，属于计算机视觉和图形学范畴。与搜索、推荐、广告或LLM/Transformer技术无直接关联，不符合关注焦点。

2026-07-02 16:50:26 | arXiv:2607.02421v1 |

cs.CV

查看完整摘要

Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately preserved. Inspired by this observation, we propose an inversion-free, frequency-aware semantic compensation strategy that strengthens the effective signal in the early stage of generation, while maintaining structural consistency in the background. The proposed method improves global editing capacity without sacrificing background fidelity.

LIME：从第一人称视频中学习意图感知的相机运动

1/10

LIME: Learning Intent-aware Camera Motion from Egocentric Video

Boyang Sun, Jiajie Li, Yung-Hsu Yang, Chenyangguang Zhang, Tim Engelbracht, Sung...

个性化推荐理由:

该论文关注的是计算机视觉中的相机运动预测，属于图形学或视觉处理领域，与搜索、推荐或广告系统的核心技术无关，也没有明确的应用于推荐系统或广告的潜在路径。

2026-07-02 16:48:43 | arXiv:2607.02417v1 |

cs.ROcs.CVcs.LG

查看完整摘要

Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.

ACID：基于逆动力学的动作一致性用于世界模型规划

1/10

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

Gawon Seo, Dongwon Kim, Suha Kwak

个性化推荐理由:

论文主要关注强化学习中的世界模型规划，通过逆动力学约束动作一致性。这与RecSys/Search/Ads的推荐、搜索或广告排序核心任务无直接关联，且未涉及LLM或Transformer技术，也不属于使能技术或直接应用。

2026-07-02 16:38:10 | arXiv:2607.02403v1 |

cs.ROcs.AIcs.CV

查看完整摘要

Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline's accuracy with substantially less planning compute.

给我看示例：从图像集合中推断视觉概念

1/10

Show Me Examples: Inferring Visual Concepts from Image Sets

Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh S...

个性化推荐理由:

该论文专注于计算机视觉中的概念推断，不直接涉及推荐系统、搜索或广告领域。虽然视觉概念推断可能对多模态推荐有启发，但标题未体现与LLM或推荐系统的直接关联，属于非核心领域。

2026-07-02 16:35:52 | arXiv:2607.02402v1 |

cs.CV

查看完整摘要

Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.

表示分布匹配用于一步视觉生成

1/10

Representation Distribution Matching for One-Step Visual Generation

Lan Feng, Wuyang Li, Eloi Zablocki, Matthieu Cord, Alexandre Alahi

个性化推荐理由:

该论文专注于一步视觉生成，属于视觉生成领域，没有直接或间接涉及推荐系统、搜索或广告。尽管可能使用Transformer，但缺乏明确的RecSys/Search/Ads应用潜力，因此与当前关注点无关。

2026-07-02 16:15:38 | arXiv:2607.02375v1 |

cs.CV

查看完整摘要

We elucidate the design space of Representation Distribution Matching (RDM), our name for the paradigm that trains a one-step image generator by matching generated and reference feature distributions under frozen pretrained encoders. We identify two design axes, how the distributions are compared and the representations they are compared in, and controlled studies along them yield three findings. First, the classical MMD, which could not train convincing generators a decade ago, becomes a strong and scalable objective once estimated right. Second, the generated batch is then the operative variable, with an optimum above 2048, far beyond customary batch sizes. Third, any single representation can be gamed, driven below the real score while images stay visibly fake, so we match against a balanced battery of encoders and evaluate with SW_r14, a Sliced-Wasserstein distance over 14 encoders that is independent of the training loss and resists gaming. Combining the preferred choices yields improved RDM (iRDM): it sets the one-step state of the art on ImageNet at SW_r14 1.30, corroborated by PickScore, a human-preference proxy our objective never optimizes, which prefers it over the prior best one-step generator on 71.2% of matched samples. The same recipe post-trains the four-step FLUX.2 [klein] into a one-step generator, surpassing the four-step version on GenEval, 0.826 to 0.794, and on PickScore, 22.76 to 22.58, in 90 H200 GPU-hours. Project page: https://alan-lanfeng.github.io/rdm/.

GAP-GDRNet：基于单目标合成航天器数据集的几何感知单目视觉位姿感知

1/10

GAP-GDRNet: Geometry-Aware Monocular Visual Pose Sensing on a Single-Target Synthetic Spacecraft Dataset

Yonglong Zhang, Yang Liu

个性化推荐理由:

该论文专注于航天器位姿估计的计算机视觉任务，与搜索、推荐、广告领域无直接或间接关联。其核心方法（几何感知、单目视觉）属于机器人或空间应用范畴，不符合任何一条筛选标准。

2026-07-02 16:02:24 | arXiv:2607.02360v1 |

cs.CVcs.AI

查看完整摘要

Monocular relative pose sensing is a central perception problem in non-cooperative rendezvous and on-orbit servicing. In spacecraft images, however, weak surface texture, thin appendages, illumination changes, and partial occlusion often leave only sparse and unstable geometric evidence. This article presents GAP-GDRNet, a geometry-aware attention-enhanced framework for monocular RGB-based 6D pose sensing. The method follows the geometry-guided direct regression paradigm of GDR-Net and modifies two points in the pipeline: an attention-based feature refinement (AFR) module is placed before dense geometric prediction, and a patch-level geometric self-attention (PGSA) module is inserted into Patch-PnP. AFR reinforces global spacecraft structure together with local weak-texture cues; PGSA then relates downsampled geometric patches before final pose regression. A Blender-based annotation process supplies target masks, visible-region masks, dense model-coordinate maps, camera intrinsics, and 6D pose labels for supervised training.

移动之眼：通过混合动态数据收集增强VLA空间泛化能力

1/10

The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection

Jincheng Tang, Yilong Zhu, Zhengyuan Xie, Jiang-Jiang Liu, Jiaxing Zhang

个性化推荐理由:

该论文研究VLA（视觉-语言-动作）模型的空间泛化，属于机器人/具身智能领域，与推荐系统、搜索或广告无直接关联。题目未提及任何与用户建模、排序或内容理解相关的技术，因此判定为不相关。

2026-07-02 15:30:26 | arXiv:2607.02322v1 |

cs.ROcs.CV

查看完整摘要

Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.

NEvo：神经引导的进化视频合成用于动态视觉选择性

1/10

NEvo: Neural-Guided Evolutionary Video Synthesis for Dynamic Visual Selectivity

Yingtian Tang, Sogand Salehi, Ming Zhou, Amir Zamir, Leyla Isik, Martin Schrimpf

个性化推荐理由:

该论文主要涉及视频合成和计算机视觉领域，与推荐系统、搜索或广告的核心技术（如排序、召回、用户行为建模等）无直接关联。虽然视频内容可能在推荐中作为特征，但本论文聚焦于生成方法而非推荐应用，不符合当前关注方向。

2026-07-02 15:27:39 | arXiv:2607.02317v1 |

cs.CV

查看完整摘要

The human brain processes dynamic visual input through hierarchically organized, functionally specialized regions. While recent in silico brain encoding models can synthesize optimal stimuli to probe selectivity in different brain regions, prior work has been largely limited to static images, leaving dynamic visual processing underexplored. We introduce a novel neural-guided video synthesis framework that generates stimuli optimized for target brain regions across visual cortex. Our method performs evolutionary search over a structured prompt space, guided by a dynamic encoding model that predicts voxel-level responses to video inputs. By maximizing predicted activity for a target ROI, the framework efficiently discovers hyper-activating dynamic stimuli that consistently surpass handcrafted localizer videos. The synthesized videos recover known selectivities across ventral, dorsal, and lateral pathways, and further reveal systematic differences in sensitivity to temporal dynamics. A searchlight analysis provides new insight into the progression toward increasingly complex social-dynamic features along the lateral stream, further supported by probing with synthesized abstract, non-naturalistic stimuli. Taken together, our framework enables in silico exploration of dynamic visual selectivity, with new predictions for in vivo experiments

InvSplat：逆前馈场景泼溅

1/10

InvSplat: Inverse Feed-Forward Scene Splatting

Polina Karpikova, Wenjing Bian, Haofei Xu, Hendrik Lensch, Andreas Geiger

个性化推荐理由:

该论文专注于3D场景表示和渲染技术（场景泼溅），属于计算机视觉和图形学领域，与搜索、推荐、广告系统的核心关注点无关，也不涉及LLM或Transformer架构的进步。

2026-07-02 15:18:08 | arXiv:2607.02301v1 |

cs.CV

查看完整摘要

Inverse rendering aims to recover both 3D geometry and physically meaningful material properties from images, enabling applications such as relighting and novel view synthesis. Optimization-based methods achieve high fidelity but require costly per-scene fitting, while image-space learning-based approaches often suffer from multi-view inconsistencies and lack an explicit 3D representation for stable novel view rendering. We present a feed-forward multi-view reconstruction framework for inverse rendering that directly predicts a structured 3D Gaussian representation with intrinsic material attributes. Each Gaussian primitive is parameterized by mean, normal, opacity, rotation, scale, albedo, metallic, and roughness, enabling a disentangled and physically grounded scene representation. Our model integrates priors from a material estimation network with a multi-view 3D reconstruction backbone, allowing joint prediction of geometry and reflectance parameters in a single forward pass. Experiments on synthetic and real-world datasets demonstrate improved multi-view consistency compared to 2D baselines, accurate material recovery, and stable novel view rendering. Our representation further supports physically-based relighting and more faithful modeling of view-dependent effects compared to existing RGB-based feed-forward reconstruction methods. Our project webpage is: $\href{https://poliik.github.io/invsplat/}{\text{https://poliik.github.io/invsplat/}}$.

基于搜索的视觉语言模型测试方法用于车内场景理解

1/10

Search-based Testing of Vision Language Models for In-Car Scene Understanding

Lev Sorokin, Chen Yang, Ken E. Friedl, Andrea Stocco

个性化推荐理由:

该论文聚焦于自动驾驶中的车内场景理解测试，属于特定应用领域（如汽车）。虽然涉及视觉语言模型，但其目标（场景理解测试）与推荐、搜索或广告的核心技术无直接关联，且未提及任何可迁移的方法论。因此，相关性极低。

2026-07-02 15:17:44 | arXiv:2607.02300v1 |

cs.CVcs.SE

查看完整摘要

In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.

低成本无人机实时视觉智能：一种用于跟踪、扫描和导航的模块化方法

1/10

Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation

Andrei-Marian Ungureanu, Stelian Spînu

个性化推荐理由:

该论文聚焦于无人机视觉智能的具体工程应用，涉及跟踪、扫描和导航，与推荐系统、搜索或广告领域无关。论文内容属于机器人学和计算机视觉应用，而非论文筛选关注的核心领域或使能技术。

2026-07-02 15:15:56 | arXiv:2607.02298v1 |

cs.ROcs.CV

查看完整摘要

Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.

通过分布级奖励优化视觉生成模型

1/10

Optimizing Visual Generative Models via Distribution-wise Rewards

Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu, Fuli Feng, Han Hu, Wenjie Wang

个性化推荐理由:

该论文专注于优化视觉生成模型，属于计算机视觉和生成模型领域，未涉及推荐系统、搜索或广告的核心问题或LLM技术。论文主题与当前关注的核心领域或使能技术无直接关联，因此相关性极低。

2026-07-02 15:08:56 | arXiv:2607.02291v1 |

cs.LGcs.CV

查看完整摘要

Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.

AGVBench：面向静脉识别的数据增强可靠性基准

1/10

AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

Haiyang Li, Yuming Fu, Qun Song, Hongchao Liao, Jing Chen, Mounim A. EI-Yacoubi,...

个性化推荐理由:

该论文专注于静脉识别，属于生物识别领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然数据增强可能是一般机器学习技术，但论文应用场景（静脉识别）远离LLM、推荐、搜索和广告领域，不符合关注范围。

2026-07-02 14:55:24 | arXiv:2607.02271v1 |

cs.CV

查看完整摘要

Vein recognition is a secure biometric technology often constrained by limited annotated data and imaging variations. While data augmentation mitigates this, strategies designed for natural images may disrupt the fine-grained topology and textures essential for identity discrimination. We present AGVBench, which evaluates 30 representative augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, covering classic CNNs, vision transformers, and vein-specific recognition models. Our results show that multi-image mixing methods (e.g., MixUp, PuzzleMix, StarMixup) generally provide the strongest recognition performance. However, they are often poorly calibrated and vulnerable to adversarial perturbations, revealing a clear inconsistency between clean accuracy and adversarial security. We also find that severe geometric transformations frequently degrade recognition, which is potentially due to feature misalignment or spatial cropping, and that augmentation effectiveness varies across palm and finger vein datasets. These findings prove that accuracy-centric evaluation is insufficient for biometric augmentation. AGVBench provides standardized protocols to support reproducible research and guide the design of reliable, secure, and robust vein recognition systems. Our codebase is available at https://github.com/Advance-VeinTech-Innovators/AGVBench.

面向循环经济的高效垃圾分类：一种基于置信度引导的“一对全”与“一对其余”分类策略比较及人在回路自动化垃圾分类方法

1/10

Efficient Waste Sorting for Circular Economy: A Confidence-guided comparison between One-Vs-All and One-Vs-Rest Classification Strategies with Human-in-the-Loop for Automated Waste Sorting

Mohammed Fahad Ali, Dominique Briechle, Marit Briechle-Mathiszig, Tobias Geger, ...

个性化推荐理由:

该论文聚焦于垃圾分类领域，属于特定领域应用，与RecSys/Search/Ads核心领域无关，也未涉及LLM或Transformer技术。题目中提到的分类策略和人在回路机制虽有一定通用性，但缺乏与推荐、搜索或广告的直接关联。

2026-07-02 14:29:45 | arXiv:2607.02230v1 |

cs.CVcs.AI

查看完整摘要

The complexity of waste disposal regulations across European countries poses significant challenges for the residents and hinders the transition to a Circular Economy. In Germany, the proper sorting and disposal of household waste remains challenging across municipalities. Consequently, substantially reducing incorrectly disposed waste is vital for improving waste management and advancing the Circular Economy. AI-based waste sorting solutions can support residents through user-friendly tools, such as mobile applications, that guide proper waste disposal. To be effective in supporting the Circular Economy, however, these solutions must be configurable to reflect the specific waste sorting scheme of individual municipalities in Germany. In the scope of this work, an evaluation and analysis are performed of two prominent classification strategies: OvA and OvR. The research uses a dataset constructed in alignment with the waste categories and sorting scheme of the city of Goslar in Germany. Moreover, this work aims to extend beyond the overall performance by examining the behavior of OvA and OvR classification strategies in identifying samples likely to be misclassified. These classification strategies are compared by applying varying confidence thresholds to identify uncertain samples for subsequent human review. This evaluation aims to balance the number of misclassifications against the human effort required for data annotation.

MedSaab-US：一种用于超声图像甲状腺结节分割的无反向传播多尺度小波-Saab框架

1/10

MedSaab-US: A Backpropagation-Free Multi-Scale Wavelet-Saab Framework for Thyroid Nodule Segmentation in Ultrasound Images

Mohammad Amanour Rahman

个性化推荐理由:

该论文专注于医学影像（甲状腺结节分割），属于医疗领域特定应用，与我的关注领域（推荐系统、搜索、广告）无关。其使用的技术（无反向传播、小波变换）缺乏明确的推荐/搜索/广告应用潜力。

2026-07-02 14:18:22 | arXiv:2607.02209v1 |

cs.CV

查看完整摘要

Deep learning (DL) methods dominate thyroid nodule segmentation in ultrasound (US) images, achieving high Dice scores but at the cost of millions of parameters, GPU-dependent training via backpropagation, and limited mathematical tractability. These limitations impede deployment in resource-constrained environments. In this paper, we propose MedSaab-US, a backpropagation-free segmentation framework grounded in the Green Learning paradigm. MedSaab-US extracts multi-scale spatial-frequency features by combining multi-level Discrete Wavelet Transform (DWT) with multi-scale channel-wise Saab (Subspace Approximation with Adjusted Bias) transforms at patch sizes of 5 x 5, 11 x 11, and 21 x 21 pixels. Label-Assisted Greedy (LAG) feature selection retains the most discriminative features, which are fed to an XGBoost classifier for pixel-wise prediction. The Saab transform parameters are determined analytically from data statistics, while XGBoost employs iterative greedy tree construction without requiring backpropagation. Evaluated on the TN3K dataset (2,879 training and 614 test images), MedSaab-US achieves a mean Dice coefficient of 0.4784 +/- 0.2190, precision of 0.5768, and recall of 0.5604, with a model footprint under 500K parameters and CPU-only inference in approximately 0.3 seconds per image. We present this result as an exploratory non-DL baseline for thyroid ultrasound segmentation and analyze the specific challenges posed by isoechoic nodules. An ablation study further quantifies the contribution of each pipeline component, including separate evaluations of LAG feature selection and training-set size.

放射组学网络：一种混合放射组学引导的轻量级可解释医学图像分割架构

1/10

RadiomicNet: A Hybrid Radiomics-Guided Lightweight Architecture for Interpretable Medical Image Segmentation

Mohammad Amanour Rahman

个性化推荐理由:

该论文专注于医学图像分割，属于特定领域应用（医学），与推荐系统、搜索或广告的核心技术无直接关联。

2026-07-02 13:54:33 | arXiv:2607.02185v1 |

cs.CVcs.AI

查看完整摘要

Deep learning has achieved remarkable performance in medical image segmentation, yet it suffers from critical limitations: mathematical intractability, substantial parameter requirements, and lack of clinical interpretability. We propose RadiomicNet, a novel two-stream hybrid architecture that enhances standard deep learning by integrating handcrafted radiomics features directly into the segmentation learning process. The key contribution is the Radiomics Attention Gate (RAG), which leverages Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP) features to modulate skip-connection attention in a lightweight MobileNetV2-based encoder-decoder, providing ante-hoc interpretability without post-hoc approximations. A novel Radiomics Consistency Loss further enforces alignment between texture complexity and prediction uncertainty, reducing Expected Calibration Error (ECE) from 0.142 to 0.118. RadiomicNet achieves a Dice Similarity Coefficient (DSC) of 0.763 +/- 0.231 on the Breast Ultrasound Images (BUSI) dataset and 0.854 +/- 0.112 on Kvasir-SEG, outperforming U-KAN by 1.2% and 1.8%, respectively (p < 0.05, Wilcoxon signed-rank test), with only 3.27M parameters, 9.5x fewer than standard U-Net and 4.3x fewer than U-KAN. Gradient-based feature importance analysis reveals that GLCM dissimilarity (15.24%), GLCM energy (14.56%), and LBP entropy (11.49%) are the dominant radiomics cues, providing clinically meaningful explanations for segmentation decisions. The proposed approach demonstrates that compact, interpretable models grounded in domain knowledge can deliver state-of-the-art segmentation performance with substantially reduced computational overhead.

资源受限消费级GPU上针对视觉模型和VLM的高效PEFT方法与自适应检查点技术

1/10

Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs

Altay Toktassyn, Jurn-Gyu Park

个性化推荐理由:

该论文聚焦于视觉模型和VLM的PEFT方法与自适应检查点技术，针对资源受限的消费级GPU优化，属于计算机视觉领域，未涉及推荐系统、搜索或广告的核心技术。无明确证据表明其对RecSys/Search/Ads有直接应用潜力。

2026-07-02 13:31:29 | arXiv:2607.02158v1 |

cs.CV

查看完整摘要

Modern pretrained vision models achieve strong accuracy but demand substantial GPU memory for fine-tuning, making edge deployment impractical. This paper compares five parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) on Transformers- (ViT-Small, TinyViT) and Mamba-based vision backbones (Vim-Small, MambaVision-T) under an on-device VRAM budget (e.g., 2 GB), together with three gradient-checkpointing strategies (none, static, and a proposed memory-budget-aware adaptive algorithm); and we evaluate three families of foundation-model baselines: zero-shot contrastive vision language models (OpenCLIP, SigLIP), self-supervised vision backbones with lightweight evaluation protocols (DINOv2), and autoregressive VLMs for prompt-based classification (PaliGemma, MobileVLM, SmolVLM). Experiments on CIFAR-100 and DTD report accuracy, training time, energy, and the NetScore family of multi-objective metrics, which we extend with two deployment-aware variants. QLoRA and BitFit cut energy 20-30% at a 1-2% accuracy cost; the adaptive algorithm reduces peak memory 43-79% with 9-30% energy overhead. DINOv2 surpasses fine-tuned models on CIFAR-100 (0.917 vs. 0.897) at a fraction of the energy, while small autoregressive VLMs remain uncompetitive.

SAMoR：任意骨架与拓扑结构的关节物体运动建模

1/10

SAMoR: Motion Modelling for Articulated Objects of Any Skeleton and Topology

Yuhao Zhang, Gerard Pons-Moll, Tolga Birdal

个性化推荐理由:

该论文专注关节物体运动建模，属于计算机视觉与图形学领域，未涉及推荐系统、搜索或广告中的用户行为建模或LLM技术，且无明确应用潜力。

2026-07-02 13:24:47 | arXiv:2607.02148v1 |

cs.CV

查看完整摘要

Modeling motion for articulated objects of arbitrary skeleton topology remains difficult: existing motion generators target a fixed human skeleton, and prior adaptations either fail to share a vocabulary across rigs or discard motion detail through global pooling. Our key observation is that while joint-level motion does not correspond cleanly across species, motion of functional joint groups does: a human arm, a wolf foreleg, and a bird wing share motion structure despite differing joint counts and connectivity, a correspondence that joint names (e.g., "forearm", "wing_L1") partially expose even when topology does not. We introduce SAMoR (Skeleton-Aware Motion Representation for Articulated Objects), a cross-topology motion representation that encodes each motion segment as a small fixed number ($K=8$) of part tokens shared across arbitrary skeletons. A graph-transformer encoder consumes per-joint motion features, kinematic graph structure, and joint-name embeddings, then compresses them into part-level tokens via cross-attention pooling and residual vector quantization, yielding a discrete motion codebook shared across rigs. To keep the part queries from collapsing into redundant global representations, we introduce a topology-agnostic attention supervision loss, with joint-name dropout to reduce over-reliance on text labels. We curate a heterogeneous corpus from HumanML3D, Truebones Zoo, and animated Objaverse-XL assets, and evaluate SAMoR on held-out characters with unseen skeletons. It supports accurate reconstruction and cross-topology transfer, and enables text-conditioned generation and part-wise editing via a MaskGIT token generator. SAMoR reaches $2.75 \times 10^{-2}$ normalized MPJPE on cross-topology reconstruction, $5.8\times$ below the strongest adapted variable-$J$ tokenizer baseline, while remaining competitive with fixed-skeleton specialists on HumanML3D.

利用深度人工神经网络和集成机器学习方法预测阿尔茨海默病早期阶段并识别关键生物标志物

1/10

Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

Debopriya Ghosh

个性化推荐理由:

该论文专注于阿尔茨海默病的医学诊断，属于医疗领域应用，与推荐系统、搜索或广告技术无关。不属于焦点范围内的任何类别。

2026-07-02 13:20:34 | arXiv:2607.02142v1 |

cs.LGcs.AIcs.CVcs.NEeess.IV

查看完整摘要

Alzheimers disease (AD) is a brain disorder that develops slowly and mainly affects memory, thinking, language, and daily activities. It is one of the most common causes of dementia and creates many difficulties for patients as well as their families. In the early stage, the symptoms are often mild and may look like normal ageing. For this reason, many people are diagnosed late, when the disease has already progressed. At present, there is no complete cure for AD. Still, early detection can help doctors manage the condition better and take suitable steps at the right time. In this study, a machine learning model is proposed to detect the early stages of Alzheimers disease using clinical details, neuropsychological test scores, and neuroimaging-related measures. The data used in this work is collected from the Alzheimers Disease Neuroimaging Initiative (ADNI). As the dataset has missing values, iterative imputation is applied to fill them. The dataset also has class imbalance, which is handled using Borderline SVM-SMOTE. After that, feature selection is carried out using wrapper-based and embedded methods so that only important features are used for training. The selected features are divided into training and testing sets, and feature scaling is applied. A stacking ensemble model is developed using Logistic Regression, Extra Trees, Bagging KNN, and LightGBM as base classifiers. Along with this, an artificial neural network is also trained on the same dataset. The performance of these models is compared using precision, recall, F1-score, and AUC-ROC. This study aims to find the best classifier and also identify important biomarkers that may help in the early diagnosis of Alzheimers disease.

AdaCount：用于零样本目标计数的无训练相似性引导空间与特征自适应

1/10

AdaCount: Training-Free Similarity-Guided Spatial and Feature Adaptation for Zero-Shot Object Counting

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

个性化推荐理由:

该论文聚焦于计算机视觉中的零样本目标计数任务，属于纯视觉领域，未涉及推荐、搜索或广告中的用户行为、物品特征或交互建模。尽管可能涉及相似性度量，但缺乏明确的技术迁移到多模态推荐或特征自适应的应用场景，与所关注的核心领域不相关。

2026-07-02 13:16:04 | arXiv:2607.02139v1 |

cs.CV

查看完整摘要

Zero-shot object counting (ZOC) aims to count instances of arbitrary object categories specified only through textual prompts. Recent training-free approaches leverage foundation models such as SAM to reformulate counting as a prompt-driven segmentation task, eliminating the need for costly counting-specific training data with point-level annotations. More recently, SAM3 introduced promptable concept segmentation, enabling the zero-shot segmentation of all instances corresponding to a text-defined concept. However, SAM3 struggles in densely populated scenes containing numerous small objects, where limited image resolution and insufficient attention to target-relevant regions often lead to missed instances and poor instance separation, hindering accurate object counting. To address this limitation, we propose AdaCount, a training-free framework for ZOC based on similarity-guided spatial and feature adaptation. AdaCount first estimates a prototype-driven similarity map that identifies target-relevant regions. This similarity map subsequently guides two complementary adaptations: (i) similarity-guided spatial warping, which reallocates image resolution toward target instances, and (ii) feature modulation, which amplifies target-relevant encoder representations. Together, these adaptations enable SAM3 to devote greater representational capacity to target-relevant regions while preserving global image context, without requiring any model retraining. Extensive experiments across six diverse counting benchmarks establish AdaCount as a new SOTA among training-free ZOC approaches.

绝对退化：一种受物理启发的合成胶片退化管线及存档胶片修复基准

1/10

AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark

Mikołaj Jastrzębski, Dawid Glinkowski, Dawid Zieliński, Daniel Borkowski, Wojcie...

个性化推荐理由:

论文专注于胶片退化模拟与修复，属于计算机视觉和图像处理领域，与推荐系统、搜索或广告的核心技术无明显关联。该研究既非LLM相关技术，也未涉及Transformer架构改进或跨模态建模在推荐、搜索、广告中的应用。

2026-07-02 13:08:26 | arXiv:2607.02131v1 |

cs.CVcs.LG

查看完整摘要

Restoring archival film remains a fundamentally challenging problem due to the absence of paired training data and the lack of standardized evaluation benchmarks. Pristine versions of deteriorated footage are physically unrecoverable, requiring supervised methods to rely on synthetic data that often fail to capture the complex, temporally coherent nature of real film degradation. At the same time, existing real-world datasets are limited in scale, quality, and accessibility, hindering reliable evaluation and fair comparison across methods. We address both limitations with AbsoluteDegradation, a physics-inspired, modular pipeline for synthesizing realistic film degradations, and a new large-scale archival benchmark. The proposed pipeline models the analog-to-digital process as a structured composition of artifact families, incorporating signal-dependent grain, parametric scratches, and temporally coherent camera motion, enabling controlled generation of diverse degradation regimes. In parallel, we introduce a curated dataset of 81,576 high-resolution frames sourced from real archival footage, designed for consistent evaluation under real-world conditions. Together, these contributions provide a unified framework for training and benchmarking restoration models. Extensive experiments across multiple architectures show that models trained with AbsoluteDegradation generalize better to real-world footage, while the proposed benchmark reveals systematic failure modes of current methods. We hope this work establishes a foundation for reproducible and domain-authentic evaluation in archival film restoration.

基于深度学习的DIXON MRI阴茎组织群体规模分割用于男性生殖健康定量表型分析

1/10

Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

Jan Ernsting, Gunnar Paul Kordes, Nils Johannaber, Lynn Ogoniak, Wolfgang Roll, ...

个性化推荐理由:

该论文专注于医学影像分割和男性生殖健康领域，与推荐系统、搜索或广告的核心技术无任何关联，且缺乏对LLM或Transformer等通用技术的贡献或潜在应用。

2026-07-02 13:04:20 | arXiv:2607.02127v1 |

eess.IVcs.CVcs.LG

查看完整摘要

Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$). We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.

X-Splat：基于高斯泼溅的单张全景X光片生成三维CBCT图像

1/10

X-Splat: Gaussian Splatting for 3D CBCT Generation from Single Panoramic Radiograph

Tomasz Szczepański, Szymon Płotka, Michal K. Grzeszczyk, Tomasz Trzciński, Arkad...

个性化推荐理由:

该论文专注于医学影像领域（CBCT生成），属于特定领域应用，与推荐系统、搜索或广告无任何关联。技术层面也未涉及LLM或Transformer在推荐/搜索/广告中的潜在应用。

2026-07-02 12:34:59 | arXiv:2607.02099v1 |

cs.CV

查看完整摘要

Generating a 3D dental volume from a single panoramic radiograph (PXR) could provide a low-radiation alternative to Cone-Beam Computed Tomography (CBCT), but the problem is highly underdetermined: panoramic acquisition integrates 3D attenuation along curved X-ray paths into a 2D image, leaving depth-resolved anatomy unobserved. Existing implicit and generative approaches often produce oversmoothed geometry or anatomically inconsistent hallucinations, lacking geometry-driven supervision and relying on smooth representations unable to precisely localize sharp anatomical boundaries. We propose X-Splat, the first Gaussian Splatting framework for generating CBCT-like 3D dental volumes from a single PXR. X-Splat uses the known panoramic acquisition geometry as a generation scaffold: learnable anisotropic Gaussian primitives are initialized along the X-ray paths that formed the input image and adjusted in a single feed-forward pass, constrained by Beer-Lambert reprojection and multi-view radiographic training supervision. A lightweight residual refiner adds dataset-level anatomical priors without overriding the geometry already resolved by the Gaussians. We train on synthetic PXR-CBCT pairs, enabling direct volumetric supervision without paired real scans. We further introduce segmentation-based geometry-aware metrics, providing the first evaluation of PXR-based generation over maxillofacial anatomy. X-Splat outperforms NeRF- and GAN-based baselines, recovering individual teeth, cortical boundaries, and alveolar structure, including the mandibular canal which prior methods fail to reconstruct. Code will be available at https://github.com/tomek1911/X-Splat

LongEgoRefer：长格式自我中心视频指代表达理解基准

1/10

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

Shunya Kato, Taiki Miyanishi, Shuhei Kurita, Mahiro Ukai, Nakamasa Inoue, Chenhu...

个性化推荐理由:

该论文专注于自我中心视频中的指代表达理解，属于视频理解和视觉-语言领域。与搜索、推荐或广告领域的核心技术无关，且未涉及LLM或变换器架构的进步或应用。

2026-07-02 12:32:53 | arXiv:2607.02096v1 |

cs.CV

查看完整摘要

Egocentric videos capture rich and diverse human-object interactions and have emerged as a fundamental resource for understanding human activities related to objects. In this context, Video Referring Expression Comprehension (Video REC), the task of localizing the temporal and spatial extent of a referred object in video frames given a natural language query, plays a key role in linking textual descriptions to observed objects in untrimmed egocentric recordings. However, existing egocentric Video REC benchmarks primarily focus on short video clips, where some target object appears densely within frames. Such settings do not reflect real-world egocentric recordings, which are long-form, untrimmed, and characterized by sparse object occurrences and complex activity transitions. To address this limitation, we introduce LongEgoRefer, a novel and challenging benchmark constructed from long-form videos in the Ego4D dataset. LongEgoRefer contains 1,498 referring expressions with an average video duration of 45 minutes. The benchmark exhibits extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions embedded in long, dynamic egocentric narratives. Consequently, it defines a demanding spatio-temporal grounding problem that requires models to identify both when an event occurs and where the referred object appears within extended video sequences. We evaluate existing Video REC approaches, including training-free baselines based on vision-language models combined with Grounded SAM2. Extensive experiments show that even advanced baselines and current state-of-the-art models struggle significantly on LongEgoRefer. These results highlight the intrinsic difficulty of long-form egocentric spatio-temporal grounding and emphasize the need for more robust video understanding models.

多模态融合用于乳腺纤维腺瘤与叶状肿瘤的细粒度分类

1/10

Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors

Chuxi Nan, Di Wu, Hongming Guo, Ning Cao, Xiaohui Zhu, Zhaoting Shi, Jiawei Li

个性化推荐理由:

该论文专注于医学影像分类，属于特定领域应用，与推荐系统、搜索或广告的核心技术无关。即使涉及多模态融合，其方法也高度领域化，没有明显可迁移到推荐系统或搜索场景的通用创新。

2026-07-02 12:30:08 | arXiv:2607.02091v1 |

cs.CV

查看完整摘要

Breast fibroadenoma (FA) and phyllodes tumor (PT) are fibroepithelial breast lesions with highly overlapping appearances on B-mode ultrasound, making benign and borderline PT prone to being misclassified as FA and complicating preoperative decision-making. Existing computer-aided diagnosis methods commonly rely on single-modal imaging features and insufficiently exploit complementary clinical and textual information. To address this limitation, we construct the FAPT-M Dataset, a pathology-confirmed multimodal dataset comprising 910 patients with strictly reviewed ultrasound images, structured clinical attributes, and ultrasound diagnostic descriptions. Based on this dataset, we propose a clinically guided multimodal framework that integrates DenseNet-based visual encoding, CLIP-inspired text encoding, and lightweight clinical encoding, and further introduces clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and multimodal interaction. Under patient-level five-fold cross-validation, the proposed method achieves an accuracy of 77.64%, F1-score of 73.38%, and AUC of 89.74%, outperforming representative CNN-, Transformer-, and vision-language-based baselines. Ablation studies and class-balanced evaluations further confirm the contribution of three-modality fusion and the key architectural components. Overall, this work provides an effective multimodal approach for fine-grained FA-PT classification and establishes a high-quality benchmark for multimodal breast ultrasound analysis.

TCG-AR：面向交易卡牌游戏直播的实时多视角增强现实

1/10

TCG-AR: Real-Time Multi-View Augmented Reality for Trading Card Game Streaming

Anthony Cioppa, Antoine Verdonck, Maxim Henry, Marc Van Droogenbroeck, Raphaël L...

个性化推荐理由:

该论文专注于增强现实在特定游戏直播场景中的应用，属于图形学与交互技术领域，与推荐系统、搜索或广告的核心技术无关。既未涉及LLM或Transformer架构的改进，也未讨论其在推荐等领域的潜在应用。

2026-07-02 12:29:45 | arXiv:2607.02090v1 |

cs.CV

查看完整摘要

Trading card games are increasingly played and broadcast online, yet live streams remain mostly limited to flat top-down footage of the playing area. Augmenting such streams with virtual models of the played cards would improve the viewing experience, but most existing systems rely on instrumented playing surfaces and embedded chips, which are costly and impractical for casual players and large-scale events. In this work, we present TCG-AR, a novel real-time pipeline that augments trading card games using ordinary RGB cameras alone, without any physical markers or specialized hardware. Our pipeline detects, orients, and identifies the cards on the board, renders virtual content onto each card across all views, and can additionally compose a broadcaststyle view that summarizes the game state for spectators, streaming the augmented feeds to standard broadcasting software such as OBS. To train the detection, orientation, and identification models without manual labeling, we introduce an automatic procedure that generates annotated synthetic training data from a reference set of card images. Then, we evaluate several trained models on a new manually annotated dataset with real images, analyzing performance and runtime throughput that determine real-world usability. Overall, by relying only on commodity cameras and hardware, and by open-sourcing all code, models, and datasets, this work aims to serve as a reference for real-time trading card recognition and to make real-time augmented-reality streaming accessible to the broader community of players and streamers.

DeepGaze3.5-VL：通过自回归标记预测建模扫描路径

1/10

DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction

Susmit Agrawal, Matthias Bethge, Matthias Kümmerer

个性化推荐理由:

该论文专注于视觉扫描路径建模，属于计算机视觉和眼球追踪领域，与推荐系统、搜索或广告的核心技术（如排序、匹配、用户建模）无直接关系。虽然自回归标记预测在LLM中常见，但本文应用场景与RecSys/Search/Ads脱节，且未体现对异构数据或多模态建模的启示。

2026-07-02 12:25:45 | arXiv:2607.02083v1 |

cs.CV

查看完整摘要

Understanding human visual attention on a scene over time has applications in domains such as interface design and inferring cognitive states. Modeling visual scanpaths has historically relied on specialized architectures with hand-crafted priors. While these architectures can model fixation sequences, their rigid structural biases restrict easy extendability and flexible conditioning. For instance, integrating task-specific instructions or adapting to distinct viewer identities requires custom, disjoint architectural additions. We frame scanpath prediction purely as a discrete sequence modeling task. By mapping coordinates into a text vocabulary, we leverage the pretrained representations of Vision-Language Models. This framing absorbs diverse factors of variation: simple prompting allows for global conditioning, such as providing viewer identities to capture personalized biases, or task-specific objectives like visual search. The framework can also integrate per-fixation attributes, such as individual fixation durations, alongside spatial locations. The autoregressive alignment enables the scalable, exact computation of per-fixation log-likelihoods, directly equivalent to the commonly used Information Gain (IG) metric. Our model, DeepGaze3.5-VL, establishes a new state-of-the-art across multiple datasets, achieving 2.18 bits of IG on MIT1003, a 46% improvement over DeepGaze III. This advantage persists even when baselines use identical high-capacity vision encoders. Beyond predictive performance, our generative framework serves as a powerful computational tool for direct behavioral interventions, allowing for controlled in-silico simulations that would be experimentally difficult or impossible to conduct in vivo. We demonstrate this ability by performing controlled interventions on the durations of pre-saccadic fixations, recovering known oculomotor phenomena purely from data.

手控世界：基于相机解耦手部控制的无约束自我中心视频生成

1/10

HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

Yushuo Chen, Xiaoyu Shi, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Yebin Liu

个性化推荐理由:

该论文聚焦于自我中心视频生成，属于计算机视觉和图形学领域，与推荐系统、搜索或广告的核心技术无关。没有涉及LLM、Transformer架构或推荐系统相关的应用场景。

2026-07-02 12:14:02 | arXiv:2607.02075v1 |

cs.CV

查看完整摘要

We present HandsOnWorld, a framework for hand-controlled egocentric video generation that forgoes multi-view and marker-based motion capture, learning instead from unconstrained monocular video. Such generality is bottlenecked by the scarcity of scalable 3D hand annotations: large egocentric corpora lack finger-level labels, whereas precise hand datasets are confined to narrow, instrumented settings, limiting prior hand-controlled generators to restricted scene distributions. We instead annotate 3D hands directly on in-the-wild egocentric video through monocular reconstruction, introducing a protagonist-centered annotation pipeline that filters the reconstructions at the action-semantic, image-quality, and 3D-geometric levels to build EgoVid-Pro, a dataset of clean, protagonist-only hand trajectories spanning 103K clips and roughly 12M frames across diverse everyday scenes. To resolve the camera-hand entanglement induced by large ego-motion, we further propose the Plücker Hand Map, a 3D-aware control signal that extends Plücker-ray representations from camera rays to the hand surface, disentangling camera and hand motion at the representation level. Experiments show that \method surpasses prior hand-controlled generators in reconstruction fidelity and control accuracy, and generalizes to out-of-distribution everyday scenes beyond the laboratory datasets on which prior methods rely.

自动驾驶中基于LiDAR的三维物体检测全面鲁棒性分析

1/10

Comprehensive Robustness Analysis of LiDAR-based 3D Object Detection in Autonomous Driving

Adwait Chandorkar, Kai Krink, Yerdana Maulenbay, Hasan Tercan, Tobias Meisen

个性化推荐理由:

该论文专注于自动驾驶领域的LiDAR三维物体检测，与搜索、推荐、广告等核心领域无关。虽然涉及鲁棒性分析，但技术背景和应用场景均不相关，不符合关注范围。

2026-07-02 12:13:49 | arXiv:2607.02074v1 |

cs.CV

查看完整摘要

Recent advancements in LiDAR-only 3D object detection have demonstrated improved detection accuracy over benchmark datasets. However, the adversarial robustness of these models remains untested. Very few adversarial robustness studies exist for LiDAR-only 3D object detection and unfortunately, even they are limited to legacy models. Moreover, there is a systemic gap in the existing evaluation frameworks that rely simply on mAP ignoring other structural and predictive factors. To fill this gap, we propose a holistic framework that evaluates adversarial robustness using two structural factors (point cloud density and point cloud localization) and three predictive factors (misclassification, localization error, distance from ego). Using this framework, we perform an empirical study and critical analysis on recent and legacy state-of-the-art models using adversarial attacks specifically designed for LiDAR-based models. Our key finding is that high-capacity, voxel-based detectors are more susceptible to structured coordinate perturbations than pillar-based detectors. Additionally, non-anchor-based detectors demonstrate poor adversarial robustness, which necessitates rethinking model training techniques. Overall, our results demonstrate that recent models are as vulnerable to adversarial attacks as their predecessors. Therefore, we argue that there is a need to improve the evaluation benchmarks for 3D object detection that not only reward architectural modifications for improving detection accuracy, but also evaluate whether the design choices improve adversarial robustness.

超越性能错觉：面向空间相关领域的结构感知分层划分与课程分布式鲁棒优化

1/10

Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains

Prathamesh Patil, Arpit Jain, Aswanth Krishnan

个性化推荐理由:

该论文主要关注空间相关领域（如地理空间数据）的分层划分与鲁棒优化方法，属于传统机器学习或空间统计范畴，与RecSys/Search/Ads的核心技术（用户行为建模、推荐、搜索、广告排序等）无直接关联。论文未涉及LLM、Transformer或推荐系统中的典型技术如注意力机制、序列建模等，因此相关性极低。

2026-07-02 11:32:38 | arXiv:2607.02055v1 |

cs.LGcs.AIcs.CV

查看完整摘要

Performance evaluation in AI systems commonly assumes that random dataset splits produce independent and identically distributed (i.i.d.) subsets. We show that this assumption often breaks down in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging, leading to two systematic failures: data leakage, where correlated samples span training and validation splits and inflate performance estimates, and hidden stratification, where errors on minority subpopulations are obscured by aggregate metrics. To address these issues, we propose a unified evaluation and training framework for spatially correlated data. We introduce Structure-Aware Stratified Partitioning (SASP), which constructs validation splits that reduce spatiotemporal leakage while preserving meaningful class balance, and Curriculum Distributionally Robust Optimization (CDRO), a curriculum-based relaxation of distributionally robust training that stabilizes optimization under these stricter splits. Across multiple benchmarks, this combination yields consistently improved generalization, more reliable confidence calibration, and exposes failure modes that remain hidden under conventional random-split evaluation.

拥抱类内异质性以进行半监督医学图像分割：从多样性到精确性

1/10

Embracing Intra-Class Heterogeneity for Semi-Supervised Medical Image Segmentation: From Diversity to Precision

Yuqi Liu, Yufei Chen, Wei Fu, Xiaodong Yue, Shuo Li

个性化推荐理由:

论文主题为医学图像分割，属于医疗领域应用，与推荐系统、搜索或广告无直接或间接关联。虽然半监督学习可能在推荐中有应用，但本文聚焦医学图像，不符合核心领域要求。

2026-07-02 11:26:11 | arXiv:2607.02051v1 |

cs.CV

查看完整摘要

Due to the scarcity of expert-annotated data, Semi-Supervised Medical Image Segmentation (SSMIS) has emerged as a promising approach. Many anatomical structures in medical images exhibit significant intra-class heterogeneity, with different regions showing heterogeneous intensity patterns within the same structure. However, existing methods inadequately exploit this intensity-manifested intra-class heterogeneity, resulting in uniform structural representations and imprecise segmentation. Furthermore, the scarcity of labeled data makes it more difficult to effectively capture such complex heterogeneity. To address this, we propose Multiple Prototype Contrastive Learning (MPCL), an SSMIS framework that possesses better diversity and better precision. It consists of three novel designs: First, we provide structural representations with better diversity and propose Intensity-aligned Heterogeneous Prototype Generation (IHPG) that effectively models intra-class heterogeneity by generating multiple prototypes aligned with intensity characteristics. Second, we further enhance more diverse structural representations and build a solid foundation for more precise segmentation through Prototypical Space Optimization (PSO) that systematically optimizes a more discriminative and generalizable prototypical space. Finally, we achieve segmentation results with better precision through Dual-branch Knowledge Alignment (DKA) that efficiently promotes intra-class heterogeneity knowledge transfer from prototypical space to the segmentation network. Extensive experiments on three medical image datasets with significant intra-class heterogeneity demonstrate that MPCL significantly outperforms existing methods, especially under extremely limited labeled data.

PWM-ArtGen：面向铰接物体生成的部分世界模型

1/10

PWM-ArtGen: Part World Model for Articulated Object Generation

Wentao Zheng, Ancong Wu

个性化推荐理由:

该论文专注于3D铰接物体的生成，属于计算机视觉和图形学领域。与搜索、推荐或广告领域没有直接或潜在的应用关联，因此相关性极低。

2026-07-02 11:12:29 | arXiv:2607.02045v1 |

cs.CV

查看完整摘要

The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.

分层反美学：保护面部隐私免受定制扩散模型侵害

1/10

Hierarchical Anti-Aesthetics: Protecting Facial Privacy against Customized Diffusion Models

Songping Wang, Yueming Lyu, Shiqi Liu, Chen Zhao, Ziyuan Chen, Ning Li, Jing Don...

个性化推荐理由:

该论文聚焦于隐私保护，特别是针对扩散模型的人脸隐私保护，属于安全/隐私领域，不属于推荐、搜索或广告或核心LLM/Transformer技术。

2026-07-02 11:05:58 | arXiv:2607.02038v1 |

cs.CV

查看完整摘要

The rise of customized diffusion models has fueled a boom in personalized visual content creation, but it also introduces serious risks of malicious misuse, thereby posing threats to personal privacy. Image aesthetics are strongly correlated with human perception of image quality. Motivated by this observation, we address facial privacy protection from a novel aesthetic perspective by degrading the generation quality of maliciously customized models, thus reducing facial identity leakage. Specifically, we propose a Hierarchical Anti-Aesthetics (HAA) framework that exploits aesthetic cues at multiple perceptual levels. HAA consists of two key branches: (1) Global Anti-Aesthetics, which degrades overall aesthetics and generation quality by constructing a global anti-aesthetic reward mechanism and a corresponding loss; and (2) Local Anti-Aesthetics, which disrupts facial identity by using a local anti-aesthetic reward mechanism and loss to guide adversarial perturbations toward facial regions. By integrating both branches, HAA achieves anti-aesthetic degradation from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing methods in identity removal, providing an effective tool for protecting facial privacy.

复杂模仿：在复杂3D环境中的人-场景交互模仿

1/10

ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments

Lu Pan, Hongwei Zhao

个性化推荐理由:

该论文专注于3D环境中人-场景交互的模仿，属于计算机图形学或机器人领域，与推荐系统、搜索或广告的核心领域无直接关联，也未涉及大语言模型或Transformer架构的进展。因此，相关性极低。

2026-07-02 11:01:20 | arXiv:2607.02034v1 |

cs.CV

查看完整摘要

Physics-based Human-Scene Interaction (HSI) imitation learning is crucial for embodied intelligence as it bridges the gap between kinematic 3D motions and real-world dynamics. However, most existing methods focus on simplified scene settings, leaving complex environments largely unexplored, which limits their applicability in real-world scenarios. In this paper, we focus on HSI mimicry in complex environments. Under this complex setting, we observe an inherent trade-off between successfully performing interaction and maintaining natural, physically plausible motions. To address this challenge, we propose ComplexMimic, a framework that reconstructs diverse HSI by interpreting imperfect MoCap data. First, we introduce a Dual Flow Strategy, which learns two complementary experts: an imitation expert for accurate motion tracking and an interaction expert for collision-aware adaptation in complex scenes. Second, naive multi-expert distillation, which treats all experts equally, often under-samples challenging behaviors, limiting effective learning. To mitigate this issue, we propose a difficulty-aware distillation strategy that adaptively weights supervision and prioritizes hard-yet-learnable trajectories guided by failure statistics and learning progress signals. Extensive experiments on three benchmark datasets demonstrate that our approach outperforms current state-of-the-art methods. Our implementation is available at https://github.com/LuPan23/ComplexMimic.

评估视觉-语言模型作为零样本学习替代YOLO和光学字符识别在尼日利亚车牌识别中的应用

1/10

Evaluating Vision-Language Models as a Zero-Shot Learning Alternative to You Only Look Once and Optical Character Recognition for Nigerian License Plate Recognition

Ismail Ismail Tijjani, Ahmad Abubakar Mustapaha, Sunusi Ibrahim Muhammad, Muhamm...

个性化推荐理由:

该论文专注于车牌识别这一特定视觉任务，属于纯计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及视觉-语言模型，但应用场景（车牌识别）不具备推广到异构数据建模或用户行为理解的通用性。因此不符合我的关注焦点。

2026-07-02 10:55:24 | arXiv:2607.02025v1 |

cs.CV

查看完整摘要

License Plate Recognition (LPR) systems are critical tools in traffic monitoring, security enforcement, and urban mobility management. Traditional LPR systems often rely on a multi-stage pipeline involving object detection using You Only Look Once (YOLO) and Optical Character Recognition (OCR), which suffer from limitations such as high resource demands, poor performance in unstructured environments, and the need for large annotated datasets. This study explores the potential of Vision-Language Models (VLMs) as a unified, zeroshot learning solution for Nigerian license plate recognition. Using a curated dataset of 88 challenging real-world images collected in Nigeria, we evaluate five selected VLMs: Gemini 2.0 Flash Exp (Google DeepMind), Qwen2.5-VL-7B-Instruct (Alibaba), GPT-4o (OpenAI), Claude 4 Sonnet (Anthropic), and Llama 3.2 Vision 90b (Meta). Results based on Character Error Rate (CER) reveal that Gemini and Qwen significantly outperform other models in both accuracy and robustness, on the challenging image scenarios. This work highlights the practical advantages of VLMs over YOLO+OCR, questions the claims by model providers, and compares the performances of the VLMs.

单一立面：全球建筑立面语义分割基准数据集

1/10

UnderOneFacade: Worldwide Facade Semantic Segmentation Benchmark Dataset

Yi Wang, Fan Wang, Prabin Gyawali, Ziyang Xu, Anna Klimkowska, Yixiong Jing, Wan...

个性化推荐理由:

该论文专注于计算机视觉领域的具体任务（建筑立面语义分割），与推荐系统、搜索或广告领域无直接或间接关联。不涉及LLM或Transformer技术，且缺乏明确的RecSys/Search/Ads应用潜力。

2026-07-02 10:50:01 | arXiv:2607.02018v1 |

cs.CV

查看完整摘要

Globally consistent semantic digital twins require centimeter-accurate and geographically transferable 3D facade segmentation. However, progress in facade parsing is limited by the lack of large-scale, standardized benchmarks for evaluating cross-domain generalization. Existing datasets are geographically narrow, semantically inconsistent, or insufficiently precise. We introduce UnderOneFacade, the largest cross-country and cross-continent 3D facade benchmark to date, comprising centimeter-accurate point clouds with hierarchical, harmonized, and architecturally grounded semantic labels totaling 2.7 billion annotated points. Through a systematic evaluation of representative point-, graph- and transformer-based architectures, we show that current methods struggle to recognize fine-grained architectural elements and degrade significantly across geographic domains, with the best models achieving only up to 33 IoU on the fine-grained LoFG3 benchmark. By combining geometric precision with standardized semantics at unprecedented scale, UnderOneFacade establishes a rigorous benchmark for developing robust and transferable 3D segmentation models. The dataset, evaluation scripts, and pretrained models will be released upon publication.

镜像幻觉艺术

1/10

Mirror Illusion Art

Xiaopei Zhu, Zeyuan Li, Jun Zhu, Xiaolin Hu

个性化推荐理由:

该标题仅提及一种艺术形式，与LLM、推荐系统、搜索或广告领域的技术无关，属于完全不相关主题。

2026-07-02 10:47:42 | arXiv:2607.02015v1 |

cs.CVcs.AI

查看完整摘要

Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design. Our code is available at https://github.com/zxp555/AutoMIA.

一种基于目标级运动估计和跨视差几何滤波的立体视觉SLAM系统

1/10

A Stereo Visual SLAM System Using Object-Level Motion Estimation and Geometric Filtering Based on Cross Disparity

Sujan Kumar Dhali, Bhaskar Dasgupta

个性化推荐理由:

该论文专注于立体视觉SLAM（同步定位与地图构建）技术，属于机器人感知和计算机视觉领域，与推荐系统、搜索或广告无直接或间接关联。论文内容不涉及大语言模型或Transformer架构，也不涉及多模态融合在推荐搜索中的应用，因此相关性极低。

2026-07-02 10:41:10 | arXiv:2607.02005v1 |

cs.ROcs.CV

查看完整摘要

This paper presents OCD SLAM, a dynamic stereo visual SLAM framework that extends ORB-SLAM2 by jointly addressing dynamic objects and dynamic features in the scene. Usual visual SLAM systems operating in dynamic environments often fail in the presence of moving objects, due to the static-world assumption used in pose estimation and mapping. To address this predicament, we introduce a novel geometric approach based on the discrepancy between disparity and a newly proposed notion called ``cross disparity'', which exploits both temporal and stereo inconsistency to identify dynamic feature points. Complementary to this feature-level motion analysis, OCD SLAM integrates a 3D object detection module (SMOKE) with Kalman filter-based object tracking to perform object-level motion classification, enabling robust separation of static and dynamic scene elements for accurate pose estimation. The proposed approach has been evaluated on various sequences from the KITTI Odometry and KITTI Raw datasets. Results demonstrate that OCD SLAM achieves significant improvement in trajectory accuracy compared to ORB-SLAM2 and several state-of-the-art dynamic SLAM methods. Ablation studies further demonstrate the effectiveness of the cross disparity module in the KITTI Raw dataset and show that this method is able to detect dynamic features that are missed by the 3D object detection scheme alone.

无训练可控异质约束下的人体运动生成

1/10

Training-free Controllable Human Motion Generation under Heterogeneous Constraints

Xiaofei Hui, Bo Yan, Haoxuan Qu, Hossein Rahmani, Jun Liu

个性化推荐理由:

该论文专注于人体运动生成，属于计算机视觉和图形学领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然标题中出现“异质约束”，但该概念是针对运动生成的约束类型，而非推荐系统中异构数据或模态统一建模。论文不涉及LLM或Transformer在搜索/推荐/广告中的应用，因此相关性极低。

2026-07-02 10:24:50 | arXiv:2607.01990v1 |

cs.CV

查看完整摘要

Training-free controllable motion generation has attracted growing interest for enabling flexible constraint enforcement without constraint-specific training. However, existing training-free methods require constraints to be continuous objective-based with differentiable losses, while many real-world requirements are criterion-based and provide only discontinuous, sparse, or even black-box feedback. In this paper, we propose Motion-Inference-as-Control (MIC), the first training-free motion generation framework that handles both continuous objective-based and criterion-based motion constraints under a shared mechanism. The key idea is to cast diffusion-based motion generation as a stochastic control problem. This perspective not only provides principled and practically effective step-wise control laws that support criterion-based constraints without requiring differentiability and naturally accommodate objective-based constraints as a special case, but also motivates a control-oriented constraint coordination mechanism that adaptively balances and reconciles motion constraints during generation. Experiments across diverse constraint settings demonstrate the effectiveness of our framework.

通过子空间干预理解自监督视觉Transformer中的几何表示

1/10

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

Weichen Zhou, Yawen Zou, Chunzhi Gu, Ran Dong, Haoran Xie, Chao Zhang

个性化推荐理由:

该论文聚焦于自监督视觉Transformer的几何表示分析，属于纯视觉领域，与推荐系统、搜索或广告无直接或潜在应用关联。虽然Transformer架构是共通技术，但论文未探讨任何在推荐、搜索或广告中的可能应用，且标题缺乏暗示跨域迁移的线索。

2026-07-02 10:18:02 | arXiv:2607.01987v1 |

cs.CV

查看完整摘要

We introduce a controlled subspace intervention framework to investigate how self-supervised Vision Transformers (ViTs) encode dense geometric information. While linear probing is widely used to assess geometric representations, it treats features as a black box, failing to disentangle the underlying topology. To address this issue, we decompose the weights of converged linear probes to isolate the low-rank subspaces containing explicit geometric signals using Singular Value Decomposition (SVD). Our perspective yields three key insights: (1) Pre-training objectives determine how features are encoded. DINOv2 aligns spatial features for efficient linear extraction, while Masked Autoencoders (MAE) tend to disperse these signals, requiring a broader spatial context. (2) Explicit geometric representations are highly compressible, suggesting dense predictive heads could potentially be constrained to low-rank subspaces with minimal performance loss. (3) The layer-wise task affinity suggests that geometric precision peaks at intermediate layers before yielding to semantic abstraction in the final layers. By connecting internal encoding mechanics with downstream performance, these findings provide a basis for effective feature selection and lightweight decoder design. The source code is available at https://github.com/Zhou-Weichen/Geosubprobe.

资源约束下新型轻量级CNN是否表现更优？关于架构、初始化、训练预算和效率的受控多代研究

1/10

Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency

Tasnim Shahriar

个性化推荐理由:

该论文专注于轻量级CNN（卷积神经网络）在资源约束下的性能对比，属于计算机视觉领域，与推荐系统、搜索或广告中的LLM应用、Transformer架构、VLM异质数据建模等核心关注点不相关。且未涉及LLM或推荐系统相关技术。

2026-07-02 10:14:07 | arXiv:2607.01984v1 |

cs.LGcs.AIcs.CV

查看完整摘要

Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains

基于双评论家扩散对齐的开放天气鲁棒三维检测

1/10

Open-Weather Robust 3D Detection via Dual-Critic Diffusion Alignment

Shuyao Li, Chuanxing Geng, Heyang Sun, Qiang Zhou, Jingjing Gu

个性化推荐理由:

该论文专注于自动驾驶中的3D目标检测在恶劣天气下的鲁棒性，属于计算机视觉与自动驾驶领域，与推荐系统、搜索或广告的核心技术无关，也未涉及LLM或Transformer架构的通用进展。

2026-07-02 10:13:41 | arXiv:2607.01983v1 |

cs.CV

查看完整摘要

Robust 3D object detection under adverse weather remains a critical hurdle for autonomous driving. Despite progress with LiDAR-4D radar fusion, most methods are constrained by a closed-world assumption, implicitly requiring training and test weather to align in both type and severity. This premise fails in practice: the open-ended nature of weather, and even variations within a single type like rain, cause dramatically different LiDAR degradation patterns, leading to significant performance drops in unseen conditions. To address this, we present Dual-Critic Guided Diffusion Alignment (DCDA), a weather-agnostic framework that learns to recover degraded LiDAR features toward a clean manifold. Rather than modeling specific weather types, DCDA employs a 4D radar-conditioned diffusion process to progressively refine features, guided by two complementary critics. (i) A detection-guided critic, anchored by a pre-trained clean-weather model, ensures that the refined features retain object-level discriminability and localization accuracy. (ii) A weather adversarial critic enforces holistic distributional consistency with clean-weather representations. By aligning features through semantic and distributional constraints rather than explicit weather modeling, DCDA generalizes effectively to unseen weather types and severities without requiring paired data or weather labels. We further introduce a structured open-weather benchmark with held-out type-severity combinations and extensive experiments verify DCDA's advantages.

MolSight：一种用于统一化学图像理解的图感知视觉语言模型

1/10

MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

Wenda Wang, Yihan Tong, Yuwei Hu, Zhewei Wei

个性化推荐理由:

该论文专注于化学领域的图像理解，属于领域特定应用，与推荐系统、搜索或广告的核心技术无直接关联。即使涉及视觉语言模型，其应用场景也不具备推广到异构数据建模的通用性。

2026-07-02 10:13:19 | arXiv:2607.01982v1 |

cs.CVcs.AIq-bio.BM

查看完整摘要

Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.

评估视觉语言模型在脏污和偏差下对医学图像质量评价的可靠性

1/10

Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias

Sofiane Ouaari, Kevin Vorwalder, Nico Pfeifer

个性化推荐理由:

该论文专注于医学图像质量评估，属于特定医学领域应用，与我的核心领域（推荐系统、搜索、广告）无关，且未提及与这些领域的潜在关联。

2026-07-02 10:08:11 | arXiv:2607.01973v1 |

cs.CVcs.AIcs.LG

查看完整摘要

Vision-Language Models (VLMs) are increasingly applied in medical tasks such as pathology description, report generation, and visual question answering. Medical Image Quality Assessment (MIQA) supports diagnostic accuracy and patient safety by determining whether images meet the standards required for clinical decision-making. Automating MIQA with VLMs may reduce workload, but their behavior under real-world conditions, where images may be degraded or textual context may affect judgments, should be further explored before deployment. We benchmark VLMs on medical image quality using the MediMeta-C dataset zero-shot across seven corruption types and five severity levels. We evaluate sensitivity to degradation patterns, the effect of corruptions on embedding geometry, and whether textual attributes (demographics, expertise, infrastructure, institution) alter scores. Across 16 VLMs and seven modalities, pixelation produced the largest score reductions (mean -20.58%, up to -34.4% for OCT), whereas brightness had limited effect (-0.81%). Embedding displacement was associated with score changes. Same-family models showed correlations of 0.67-0.83; some produced increases up to +31% for corrupted mammography. Textual attributes affected scores: institutional prestige raised them +17.15%, and equipment age lowered them -14.7%. The largest changes were +95.62% (InternVL-8B) and -37.7% (MedGemma). Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source. Privacy protection and objective quality assessment are related requirements for use.

基于多尺度时间建模和可微分轮廓渲染的心脏电影MRI个性化4D全心网格重建

1/10

Personalized 4D Whole-Heart Mesh Reconstruction from Cine MRI via Multi-Scale Temporal Modeling and Differentiable Contour Rendering

Xiaoyue Liu, Dongcheng Cang, Xiaohan Yuan, Mark YY Chan, Ching-Hui Sia, Lei Li

个性化推荐理由:

该论文专注于医学影像（心脏MRI）的3D/4D网格重建，属于医学影像分析领域，与推荐系统、搜索或广告无直接或潜在应用联系。根据排除标准，医学领域特定应用被视为不相关。

2026-07-02 09:42:46 | arXiv:2607.01952v1 |

cs.CV

查看完整摘要

Accurate 4D whole-heart mesh reconstruction from sparse cine MRI is critical for creating cardiac digital twins, but remains challenging due to limited 2D slice coverage and the complex coupling between cardiac shape and motion. Existing methods often rely on intermediate contour fitting and typically reconstruct static, single-phase, or partial cardiac geometries, limiting their ability to capture full-chamber dynamics. We propose a novel end-to-end framework for reconstructing temporally resolved whole-heart meshes from multi-view 2D cine MRI sequences by learning an image-to-mesh mapping. The framework incorporates a differentiable contour renderer inspired by the Beer-Lambert attenuation principle, enabling anatomy-aware supervision of 3D+t mesh deformation through contour-based projection losses. To improve temporal consistency across the cardiac cycle, we further introduce a multi-scale temporal modeling module that integrates global cycle-level dynamics with local inter-frame coherence to generate smooth and physiologically plausible mesh trajectories. The proposed method achieved a whole-heart mean absolute error of 1.68 $\pm$ 0.31 mm and a motion jitter of 0.77 $\pm$ 0.17 $\mathrm{mm}/\mathrm{frame}^{3}$, outperforming existing methods with lower reconstruction error and substantially improved motion smoothness. It also improved 2D contour alignment across multiple cine MRI views and supported downstream proof-of-concept electrophysiological simulation. The code will be released publicly upon acceptance of the manuscript for publication.

LiZAD：一种面向工业制造的轻量级零样本异常检测框架

1/10

LiZAD: A Lightweight Zero-Shot Anomaly Detection Framework for Industrial Manufacturing

Uzair Khan, Luigi Capogrosso, Muhammad Aqeel, Francesco Setti, Michele Magno, Ma...

个性化推荐理由:

该论文聚焦于工业制造领域的零样本异常检测，属于计算机视觉或工业检测的特定应用，与搜索、推荐、广告领域无直接关联，也不涉及LLM或Transformer架构的通用技术。其应用场景和问题设定均不在我的关注范围内。

2026-07-02 09:40:42 | arXiv:2607.01949v1 |

cs.CV

查看完整摘要

In modern high-throughput industrial production lines, product configurations and visual characteristics frequently change, making it impractical to collect and annotate data for every new scenario. This dynamic setting makes Zero-Shot Anomaly Detection (ZSAD) particularly suitable, as it enables defect detection without requiring training on target-specific samples. Although recent ZSAD approaches show promising results, they are computationally intensive and thus unsuitable for deployment on resource-constrained devices. We propose LiZAD: a lightweight framework designed for real-time ZSAD specifically tailored for use on edge devices. The proposed approach pairs the dense and spatially aware visual features of DINOv3, crucial for precise pixel-level localization, with the highly computationally efficient text embeddings of MobileCLIP2. These features are then mapped into a shared latent space via low-memory trainable projection heads. Compared to six state-of-the-art ZSAD models, LiZAD achieves an average memory reduction of 61.5%, a parameter reduction of 74.6%, and a speedup of 3.02x in terms of latency. Despite substantial reductions in computational and memory costs, our approach maintains competitive anomaly detection performance, dropping the average P-AUROC by just 6.4% relative to the best state-of-the-art model across the VisA, BTAD, MPDD, and MVTec-AD datasets. Finally, it is successfully deployed on the NVIDIA Jetson NX and Jetson AGX edge devices and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at https://github.com/intelligolabs/LiZAD.

面向带宽高效的协作式3D语义占据预测的稀疏感知向量量化

1/10

Sparse-Aware Vector Quantization for Bandwidth-Efficient Collaborative 3D Semantic Occupancy Prediction

Feng Li, Chaokun Zhang, Gong Chen

个性化推荐理由:

该论文专注于3D语义占据预测和协作感知，属于自动驾驶或机器人领域，与推荐系统、搜索或广告的核心主题无关。虽然涉及量化技术，但缺乏对推荐/搜索/广告场景的直接应用或启示。

2026-07-02 09:23:23 | arXiv:2607.01928v1 |

cs.CV

查看完整摘要

Collaborative perception extends single-agent perception by enabling multiple vehicles to exchange complementary perceptual information. However, it introduces an inherent trade-off between perception gain and communication overhead, which is particularly severe for 3D semantic occupancy prediction that relies on fine-grained spatial structures. Existing methods typically compress 3D features into 2D, causing severe spatial information loss, or transmit dense 3D representations, hindering real-world deployment. To overcome these limitations, we propose a bandwidth-efficient collaborative Vector Quantization Semantic Occupancy Prediction (VQSOP) framework. VQSOP employs a Sparse-Aware Vector Quantization (SAVQ) mechanism that exploits 3D scene sparsity to compactly encode informative regions, drastically reducing communication overhead while preserving complete geometric context. Furthermore, to enhance structural consistency and feature continuity, we design a Dual-Branch Adaptive Spatial Refinement (ASR) module that dynamically fuses local high-frequency details with broad contextual semantics. Extensive experiments demonstrate that our approach achieves state-of-the-art performance while reducing communication volume by up to 82x.

基于水下机器人的施工环境监测鲁棒图像处理技术

1/10

Robust Image Processing Techniques for Construction Environment Monitoring Using Underwater Robots

Seunghee Yun, Geonmo Yang, Juhui Lee, Changbeom Park, Jeahyung Choi, Younggun Ch...

个性化推荐理由:

论文主题聚焦于施工环境监测和水下机器人图像处理，属于土木工程和机器人视觉领域，与推荐系统、搜索或广告核心领域无直接或间接关联。不涉及LLM、Transformer或任何RecSys/Search/Ads相关技术。

2026-07-02 09:12:45 | arXiv:2607.01915v1 |

cs.CVcs.RO

查看完整摘要

This paper proposes a robust image processing framework for underwater robot-based construction environment monitoring, targeting complex degradations observed in real marine environments. Unlike conventional approaches that mainly consider absorption and backscattering, real underwater imagery is strongly affected by depth-dependent forward scattering blur and particle-induced degradations such as marine snow. To address this, we introduce a staged processing pipeline that sequentially models background degradation via depth-aware forward scattering and foreground degradation using realistic marine snow patterns extracted from real images. The resulting synthetic data are used to retrain an existing Joint-ID network without modifying its architecture, enabling an isolated evaluation of dataset realism. In addition, a lightweight post-processing scheme is applied to enhance contrast and structural clarity. Experiments on real underwater datasets collected in Korean coastal environments demonstrate consistent improvements in visual quality and UIQM scores. The results indicate that explicitly modeling forward scattering and realistic particle effects effectively reduces the synthetic-to-real gap and improves practical applicability in real-world underwater robotic operations.

重新思考语义分割中的事后校准

1/10

Rethinking Post-Hoc Calibration in Semantic Segmentation

Tristan Kirscher, Kim-Celine Kahl, Balint Kovacs, Maximilian R. Rokuss, Klaus Ma...

个性化推荐理由:

该论文专注于语义分割中的校准问题，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。语义分割的事后校准技术缺乏明确的迁移路径到推荐排序中的校准任务，且非Transformer或LLM相关方法，因此相关性极低。

2026-07-02 09:01:21 | arXiv:2607.01902v1 |

cs.CVcs.LG

查看完整摘要

Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.

FoundDP：重新审视双像素深度估计中的弱视差可观测性

1/10

FoundDP: Revisiting Weak Disparity Observability in Dual-Pixel Depth Estimation

Fengchen He, Hao Xu, Dayang Zhao, Tingwei Quan, Shaoqun Zeng

个性化推荐理由:

该论文专注于计算机视觉中的深度估计技术，具体涉及双像素成像和视差观测。这与搜索、推荐或广告领域的核心技术（如用户建模、序列处理、多模态融合等）没有明显关联，属于纯视觉领域，因此不相关。

2026-07-02 08:57:38 | arXiv:2607.01900v1 |

cs.CV

查看完整摘要

Dual-pixel (DP) imaging enables metric depth estimation from a single camera using sub-aperture disparity. However, the extremely small effective baseline limits disparity observability, leading to structural degradation and depth failure in textureless, low-contrast, or downsampled regions. Existing DP-based methods rely primarily on local disparity cues and therefore become unreliable when disparity signals are weak or ambiguous. To address this limitation, we propose \emph{FoundDP}, a unified framework that integrates metric DP depth with global structural priors from a monocular depth foundation model. Our method preserves metric scale through DP-derived depth and leverages Vision Transformer (ViT) features to restore structural consistency in weak-disparity regions. To ensure reliable metric guidance under DP imaging conditions, we identify and mitigate ViT representation degradation induced by DP defocus blur via ViT feature alignment, enabling stable metric-guided depth estimation. Extensive experiments on synthetic and real-world DP benchmarks show that FoundDP delivers superior performance, with consistent gains in structural fidelity and metric accuracy, especially under reduced disparity observability. Code will be available at: https://github.com/EchoLighting/FoundDP

描述符：LYNRED移动性数据集多模态检测子集（LYNRED-MDS）

1/10

Descriptor: LYNRED Mobility Dataset Multimodal Detection Subset (LYNRED-MDS)

Loïc Arbez, Jessy Matias, Xavier Brenière, Jocelyn Chanussot, Ronald Phlypo

个性化推荐理由:

该论文标题描述的是一个具体的数据集（LYNRED-MDS），专注于多模态检测（可能是传感器或视觉相关），与推荐系统、搜索或广告的核心技术或LLM应用无直接关联。它不涉及RecSys/Search/Ads领域的模型架构、算法或应用，也不属于使能技术的范畴。

2026-07-02 08:26:39 | arXiv:2607.01871v1 |

cs.CV

查看完整摘要

Current road safety systems primarily focus on minimizing post-collision damage. However, advances in algorithmic perception are shifting focus toward early collision prediction, especially in lowvisibility conditions like nighttime or fog, where thermal infrared sensing outperforms both human vision and RGB imaging. While available RGB-infrared datasets such as FLIR ADAS and LLVIP are good benchmarks, they mostly consist of clear weather and overly simple scenarios. In this article, we introduce the LYNRED-MDS: Multimodal Detection Subset, a subset of the LYNRED Mobility Dataset, comprised of 4000 RGB-infrared image pairs captured under diverse weather, lighting, and road conditions around Grenoble, France. Our dataset spans varied driving contexts (urban, rural, mountainous, etc.) and a vehicle fleet compliant with Western European standards. Thermal cross-dataset evaluation using a YOLOv8n baseline suggests that our dataset offers strong generalization potential for pedestrian detection in driving scenarios. By covering critical edge cases, our dataset supports the development of more reliable and deployable vision systems for advanced driver-assistance systems.

QWERTY：通过查询扭曲视频扩散变换器实现无需训练的运动控制

1/10

QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers

Kyobin Choo, Youngmin Kim, Hyunkyung Han, Geunrip Park, Chanyoung Kim, Sunyoung ...

个性化推荐理由:

该论文专注于视频扩散模型中的运动控制，属于AIGC/视频生成领域，与推荐、搜索或广告的核心技术无直接关联。没有涉及LLM、Transformer效率创新或应用于RecSys/Search/Ads的潜在方法。

2026-07-02 08:25:38 | arXiv:2607.01869v1 |

cs.CV

查看完整摘要

Video diffusion transformers (DiTs) generate high-fidelity and temporally coherent videos, yet motion control remains implicit, primarily relying on text prompts. As a result, achieving desired motion often requires extensive prompt engineering and repeated resampling. While fine-tuning models with additional spatial prompts (e.g., bounding boxes or point trajectories) enables explicit control, it demands substantial data curation and computation, and may compromise the generative capabilities of pretrained models. Consequently, training-free motion control using such spatial prompts has been explored in U-Net-based video diffusion models, but remains largely unexplored for DiTs. We introduce QWERTY, a training-free framework that enables flexible motion control in pretrained image-to-video DiTs via user-defined object warping and optical flow. We carefully manipulate the 3D full attention of DiTs by warping the frame-invariant semantic subspace of queries. We find that the noise predicted by the query-warped DiT naturally guides the diffusion trajectory toward the desired motion, and further show that leveraging this noise as self-guidance for latent optimization improves control stability and visual quality. Experiments show that QWERTY achieves the most effective motion control among existing training-free approaches on a recent image-to-video DiT, with performance comparable to fine-tuning-based methods.

DL-SLAM: 基于双层概率的动态环境中高保真高斯溅射SLAM的实现

1/10

DL-SLAM: Enabling High-Fidelity Gaussian Splatting SLAM in Dynamic Environments based on Dual-Level Probability

Ziheng Xu, Qingfeng Li, Xuefeng Liu, Chen Chen, Jianwei Niu

个性化推荐理由:

该论文主要关注即时定位与地图构建（SLAM）技术，特别是动态环境下的高斯溅射渲染，属于计算机视觉和机器人领域，与推荐系统、搜索或广告的核心技术（如用户建模、召回、排序等）无关。

2026-07-02 08:18:23 | arXiv:2607.01860v1 |

cs.ROcs.CV

查看完整摘要

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in dense dynamic Simultaneous Localization And Mapping (SLAM). Prevailing methods typically discard predefined dynamic objects, ignoring that transiently static objects offer valuable geometric constraints for pose estimation. A recent work attempts to leverage this potential by employing per-pixel uncertainty maps to quantify the magnitude of motion. While this approach enables transiently static objects to enhance pose estimation, it erroneously integrates these objects into the static map, resulting in persistent artifacts. Moreover, its reliance on purely geometric information leads to ambiguous object boundaries in the uncertainty maps. To overcome these limitations, we present DL-SLAM, a monocular Gaussian Splatting SLAM system built upon a novel dual-level probabilistic framework. Our method computes dynamic probability maps by combining semantic and geometric information. These pixel-level probabilities are lifted to 3D and aggregated to derive an object-level dynamic probability for each instance. Object-level probability enables the categorical pruning of dynamic Gaussians, resulting in an artifact-free static map. The static map, in turn, provides a geometrically consistent guidance to refine the pixel-wise probabilities, enhancing their reliability. Experimental results demonstrate that DL-SLAM outperforms existing approaches, improving tracking accuracy by up to 13\% while generating high-fidelity semantic maps.

面向高效月球三维重建的几何基础模型蒸馏

1/10

Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction

Clémentine Grethen, Florient Chouteau, Géraldine Morin, Simone Gasparini

个性化推荐理由:

该论文专注于月球三维重建，属于特定的空间/天体领域应用，与推荐系统、搜索或广告的核心技术无关。不涉及LLM、Transformer或任何可迁移到RecSys/Search/Ads的方法。

2026-07-02 08:11:32 | arXiv:2607.01851v1 |

cs.CV

查看完整摘要

Large 3D foundation models such as MASt3R achieve state-of-the-art stereo reconstruction but are computationally demanding for deployment under strict hardware constraints -- a critical limitation in domains such as planetary exploration, where onboard computing is severely restricted. We study how far such models can be compressed through knowledge distillation, using lunar stereo reconstruction as a challenging and practically relevant case study. Starting from a 688M-parameter MASt3R teacher fine-tuned on lunar imagery, we distill its dense geometric predictions into a family of lightweight students spanning different encoder types (CNN vs ViT), decoder widths and depths, and training strategies. To bridge the dimensional mismatch between teacher and student, we propose a structured SVD-based initialization that projects the teacher's decoder weights into the student's smaller latent space, yielding a warm start that significantly improves convergence and final performance. Based on our results on lunar data, we can obtain a distilled student that retains most of teacher's reconstruction accuracy while reducing the model size up to 7 times, and even outperforms a baseline trained directly with sparse ground-truth annotations. Beyond compression, our study highlights both principles and practical insights for distilling geometric foundation models: a convolutional encoder underperforms transformer-based alternatives (though pretraining availability remains a confounding factor), preserving encoder capacity is more critical than maintaining a large decoder, feature-level distillation consistently outperforms output-only supervision, and SVD-based initialization improves optimisation stability. These findings provide practical guidelines for deploying 3D reconstruction models in resource-constrained environments.

重新思考水下显著目标检测的条件生成

1/10

Rethinking Conditional Generation for Underwater Salient Object Detection

Hua Li, Yongjie Weng, Yutong Li, Zhiyuan Li, Runmin Cong, Sam Kwong

个性化推荐理由:

该论文专注于水下显著目标检测，属于计算机视觉领域的具体应用，与推荐系统、搜索或广告的核心技术（如用户模型、序列建模、多模态融合）无直接关联。虽然显著目标检测与图像理解相关，但水下场景的特殊性限制了其在通用推荐或搜索场景中的普及潜力。

2026-07-02 07:48:06 | arXiv:2607.01825v1 |

cs.CV

查看完整摘要

Salient Object Detection in underwater images remains challenging due to low contrast, uneven illumination, and color distortion caused by scattering and absorption effects, which limit the effectiveness of conventional SOD methods in underwater environments. To address these challenges, we propose a Degradation-aware Conditional Generation Network (DCGNet), specifically designed to construct reliable conditional features for underwater saliency generation. First, we design a Dynamic Multi-Granularity module (DMG) grounded in the human visual system to robustly detect salient objects of varying scales with blurred boundaries. Then, we develop an Underwater Physics-Prior module (UPP), which utilizes pseudo-depth guidance to estimate underwater light attenuation and backscatter, thereby restoring degradation-aware RGB features and mitigating color distortion and boundary ambiguity. Based on the physics-guided representation, we introduce an Underwater Spatial Gaussian module (USG), which constructs a spatial Gaussian saliency prior from the strongest guided response to enhance object-centered salient regions and suppress cluttered underwater backgrounds. In addition, a lightweight timestep-adaptive Diffusion Transformer (DiT) bottleneck is inserted into the denoising decoder to refine fused features at different diffusion timesteps. Comprehensive experiments on USOD10K, USOD, CSOD10K, MAS3K, and RMAS demonstrate that DCGNet significantly outperforms existing state-of-the-art methods, verifying its potential for complex underwater visual applications.

SpaceEra++：面向视频中3D空间推理的统一框架

1/10

SpaceEra++: A Unified Framework Towards 3D Spatial Reasoning in Video

Weili Guan, Haoyu Zhang, Meng Liu, Qianlong Xiang, Yaowei Wang, Liqiang Nie

个性化推荐理由:

该论文聚焦于视频中的3D空间推理，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。作为VLM的灵感可能有限，因为其专门针对3D空间而非异构数据融合，且未提及对推荐/搜索/广告的应用潜力。

2026-07-02 06:56:29 | arXiv:2607.01784v1 |

cs.CV

查看完整摘要

Visual-spatial understanding, defined as the ability to infer object relationships and scene layouts from visual inputs, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, pre-trained vision-language models (VLMs) remain constrained by spatial uncertainty stemming from inherently 2D observations and by the scarcity of data for 3D spatial understanding. To address these limitations, we proposed a novel framework, SpaceEra, in the NeurIPS 2025 Spotlight paper. Although it achieved significant performance gains, we further observed that its effectiveness is hindered by insufficient input from scanning videos and weak reasoning constraints. To tackle these newly emerged challenges, we extend the original framework into a comprehensive system, termed SpaceEra++, which spans data construction, model design, training optimization, and prompting inference. Specifically, to alleviate input insufficiency, we introduce ScenePick, a frame sampling strategy that balances spatial coverage with object semantics to produce compact yet comprehensive scene representations. In addition, to enhance spatial reasoning, we develop SpaceAlign, which enforces pairwise object constraints by jointly exploiting absolute coordinates and relative spatial relations, thereby aligning optimization with spatial accuracy. Extensive experiments across multiple benchmarks demonstrate consistent improvements over strong baselines, while ablation studies validate both the individual and joint contributions of each component, and further analyses provide guidance for future research.

基于大语言模型的自动驾驶多模态融合框架：语义增强与通道自适应设计

1/10

LLM-Empowered Multimodal Fusion Framework for Autonomous Driving: Semantic Enhancement and Channel-Adaptive Design

Wen Wang, Yaping Sun, Yejun He, Hao Chen, Zhiyong Chen, Xiaodong Xu, Nan Ma, Shu...

个性化推荐理由:

该论文标题明确聚焦于自动驾驶，属于域特定应用，与搜索、推荐、广告无直接或间接关联。虽然涉及大语言模型和多模态融合，但缺乏在推荐系统等领域的应用潜力。

2026-07-02 06:44:02 | arXiv:2607.01772v1 |

cs.CVeess.SP

查看完整摘要

Vision-radar fusion is central to robust autonomous driving, combining dense visual semantics with precise range and velocity measurements from radar. However, real-world fusion quality is fundamentally challenged by dynamically varying input quality, stemming from occlusion, adverse weather, and channel noise. To address this, we re-frame the problem from static data fusion to channel-aware semantic reasoning and propose a Large Language Model-centric Semantic-layer Channel-aware Integrated Perception (LM-SCIP) framework. It places a Large Language Model (LLM) as a central reasoning core to fuse a local visual stream with a quality-varying external radar stream used to cover perception-blind spots. Concretely, LM-SCIP couples a hierarchical radar-vision encoder with a Channel-Adaptive Semantic Module (CASM) that maps link indicators into a "Channel Prompt" to dynamically gate external radar features. A parameter-efficient, LoRA-tuned LLM, in conjunction with a heterogeneous Mixture-of-Experts (H-MoE), then arbitrates between local visual cues and the channel-conditioned radar context. Finally, a decoupled multi-task decoder outputs localization, trajectory forecasting, and image reconstruction. Experiments on nuScenes and VIRAT validate our approach. On nuScenes, under a controlled toggle of radar input, LM-SCIP reduces localization RMSE by 40.0% versus a vision-only baseline. On VIRAT, the model attains a 0.214m localization RMSE and 0.179m minFDE (k=1). These results reveal that the proposed LM-SCIP enables a robust vision-dominant fallback at low SNR and synergistic fusion at high SNR.

联合HOI：联合生成接触图增强手-物交互生成

1/10

JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation

Mingyeong Song, Jungbin Cho, Jisoo Kim, Ananya Bal, Kartik Sharma, Youngjae Yu, ...

个性化推荐理由:

该论文主题为手-物交互生成，属于计算机视觉和图形学领域，与推荐系统、搜索或广告无直接或潜在关联。不符合任何一条重点关注方向，故评分极低。

2026-07-02 06:34:37 | arXiv:2607.01768v1 |

cs.CV

查看完整摘要

Text driven hand object interaction (HOI) generation is gaining attention for immersive applications and robotics, yet producing physically plausible interactions remains challenging. Even when individual motions appear natural, small contact errors can cause conspicuous artifacts such as floating and interpenetration. Prior methods mitigate these issues using explicit contact cues or implicit grasp priors, but typically rely on multi stage pipelines and fail to model temporally evolving contact. We present JointHOI, a single stage diffusion framework that jointly generates 3D hand object motion and dynamic, distance based contact maps from text. By treating contact as an auxiliary inner modality, joint generation enables the model to learn contact motion coupling during training. At inference, contact guided sampling enforces consistency between generated contact maps and motion implied geometry, improving temporal stability and reducing penetration and floating. Experiments on GRAB and ARCTIC demonstrate consistent improvements in text adherence and physical plausibility over prior methods.

DL-VINS-Factory：用于视觉-惯性SLAM的学习型视觉前端的模块化框架

1/10

DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM

Shoon Kit Lim, Melissa Jia Ying Chong, Ting Yang Ling

个性化推荐理由:

该论文专注于SLAM（同步定位与地图构建）这一机器人领域，与搜索、推荐、广告系统的核心关注点无直接关联。虽然视觉前端技术可能涉及深度学习，但其应用场景和问题设定与推荐/搜索/广告中的异构数据建模或大语言模型应用没有明显联系。

2026-07-02 06:17:33 | arXiv:2607.01757v1 |

cs.CVcs.RO

查看完整摘要

Deep-learning features excel in visual matching, yet their practical value in tightly coupled visual-inertial SLAM (VI-SLAM) remains insufficiently characterized. We present DL-VINS-Factory, a unified framework that integrates learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with either Lucas--Kanade (LK) optical-flow tracking or LightGlue (LG) descriptor matching. All front-ends share a sliding-window Ceres back-end, with optional AnyLoc DINOv2-VLAD loop closure, and 4-DoF pose-graph optimization. We benchmark the system across the four datasets covering indoor, unstructured outdoor, aggressive-motion, and visually degraded conditions. Results show that learned front-ends are viable for real-time embedded VI-SLAM, but are not universally superior to classical tracking. Relative to the corresponding GFTT+LK baseline, ALIKED+LG reduces EuRoC ATE by $5\%$ in monocular odometry and by $7\%$ in stereo with loop-closure. On NTU-VIRAL, where aggressive aerial motion increases inter-frame viewpoint change, ALIKED+LG stereo reduces loop-closed ATE by $12\%$. In Botanic Garden dataset, optical-flow tracking remains preferable, but learned keypoints still improve over the baseline GFTT, in which SuperPoint+LK reduces grayscale camera ATE by $29\%$, while RaCo+LK reduces RGB camera ATE by $38\%$. On SubT-MRS, learned front-ends display varying degree of improvement based on individual cases. With TensorRT acceleration on a Jetson AGX Orin, all valid configurations run in real time between $29$--$47$ FPS in monocular mode and $18$--$33$ FPS in stereo mode for the EuRoC and NTU-VIRAL datasets. AnyLoc further confirms roughly $2$--$7\times$ more valid loops than BRIEF+DBoW2. The implementation is open-sourced at https://github.com/limshoonkit/DL-VINS-Factory-ROS2/.

ProSAC-CT：渐进式频谱-解剖共引导的多阶段扩散模型用于低剂量CT去噪

1/10

ProSAC-CT: Progressive Spectral-Anatomical Co-Guided Multi-Stage Diffusion Model for Low-Dose CT Denoising

Xuepeng Liu, Zetong Liu, Renyiming Li, Yan Li, Ruiyu Li, Ruili Li, Jiayi Ding, E...

个性化推荐理由:

该论文专注于医学图像处理（低剂量CT去噪），属于医学领域的特定应用，与推荐系统、搜索或广告领域无直接或间接关联。虽然涉及扩散模型，但缺乏对推荐/搜索/广告场景的潜在应用说明，因此相关性极低。

2026-07-02 06:13:43 | arXiv:2607.01756v1 |

cs.CV

查看完整摘要

Low-dose computed tomography (LDCT) reduces radiation exposure but introduces stronger quantum noise, streak artifacts, and local texture degradation, which can obscure anatomical boundaries and weaken low-contrast structures. Diffusion models are promising for LDCT denoising by progressively recovering normal-dose CT (NDCT) images from degraded LDCT inputs, but existing methods often suffer from insufficient anatomical guidance, uncertain frequency-dependent recovery, and uniform reverse-process modeling. We propose ProSAC-CT, a progressive spectral-anatomical co-guided multi-stage diffusion model for image-domain LDCT denoising. ProSAC-CT integrates an anatomical-prior-guided conditioning (APGC) module, a residual frequency-domain decoupling stage (RFDDS), and a time-step-decoupling denoising decoder (TD3). APGC extracts LDCT-derived structural guidance, RFDDS enhances frequency-aware representations, and TD3 assigns them to different reverse-diffusion stages for anatomical stabilization, boundary refinement, and fine-detail recovery. Experiments on four LDCT degradation benchmarks show that ProSAC-CT improves image fidelity, structural similarity, perceptual quality, and information preservation over representative methods while better preserving boundary-sensitive anatomical details. Downstream anatomical-region classification on Mayo-2020 further indicates that ProSAC-CT retains task-relevant anatomical information, supporting its practical use for low-dose CT denoising.

MedStreamBench：面向流式与主动式医疗视频理解的时间感知基准

1/10

MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding

Yuan Wang, Shujian Gao, Songtao Jiang, Zhengyu Hu, Zuozhu Liu

个性化推荐理由:

该论文专注于医疗视频理解，属于领域特定应用，与搜索、推荐或广告系统的核心任务无直接关联。虽然流式处理技术可能启发推荐系统中的实时更新，但主题过于局限于医疗领域，不符合兴趣范围。

2026-07-02 06:07:44 | arXiv:2607.01751v1 |

cs.CVcs.AI

查看完整摘要

Existing medical video benchmarks primarily evaluate whether a model produces the correct answer, but rarely assess whether it answers at the right time. In real clinical settings, AI systems must decide not only what to predict, but also when to answer, defer judgment, or proactively raise alerts. This creates a critical gap between benchmark evaluation and deployment requirements. We present MedStreamBench, a benchmark for time-aware medical video understanding. MedStreamBench integrates 22 medical datasets and 5,419 QA instances across four temporal settings: retrospective, present, future, and proactive. Unlike conventional benchmarks that assume full-video access, MedStreamBench restricts models to temporally bounded evidence windows and supports both single-turn and streaming evaluation. We further introduce a proactive monitoring setting that requires models to determine whether and when clinically relevant alerts should be triggered. Beyond answer correctness, MedStreamBench evaluates temporal behavior through responsiveness and post-evidence stability. Experiments on leading general-purpose and medical vision-language models reveal a substantial gap between offline recognition and temporally grounded decision-making, with performance dropping markedly in streaming and proactive settings. Our benchmark is available at https://huggingface.co/datasets/Venn2024/MedStreamBench.

RTE-FM-Dehazer：受辐射传输方程启发的流匹配用于真实图像去雾

1/10

RTE-FM-Dehazer: Radiative Transfer Equation Inspired Flow Matching for Real-World Image Dehazing

Chenfeng Wei, Chun Wang, Boyang Zhao, Si Zuo, Shenhong Wang, Chenguang Yang

个性化推荐理由:

该论文专注于图像去雾，属于计算机视觉和图像处理领域，与推荐系统、搜索或广告的核心任务无直接关系。虽然视觉技术可能间接影响多媒体推荐，但主题过于专门化，缺乏明确的应用潜力。

2026-07-02 06:06:07 | arXiv:2607.01748v1 |

cs.CV

查看完整摘要

Single-image dehazing aims to recover a clear scene from a hazy image and is generally formulated as an image-to-image translation task; however, it faces two limitations. Its performance depends heavily on the haze-formation priors embedded in the model. Prevailing methods adopt the Atmospheric Scattering Model (ASM), whose assumptions of single scattering and homogeneous media are often violated, leading to residual haze and color drift. Moreover, large-scale real hazy/clear pairs are impractical to collect, and existing synthesis approaches fail to reproduce the full complexity of natural haze. To address these issues, we present RTE-FM-Dehazer, a novel dehazing approach, together with a scalable data pipeline. Unlike the ASM, the Radiative Transfer Equation (RTE) jointly accounts for both scattering and absorption, naturally accommodating the non-homogeneous, multiple-scattering media that characterize real hazy scenes. Motivated by the structural similarity between the RTE diffusion-absorption term and the ODE in flow matching, we introduce a diffusion-absorption regularizer derived from a reduced RTE, to steer the flow matching trajectory at each step. Next, leveraging modern vision-language models, we build an automated pipeline and release P-HAZE, a dataset of 50000 realistic hazy/clear pairs. Extensive evaluations demonstrate that RTE-FM-Dehazer, trained solely on P-HAZE, effectively eliminates artifacts like residual haze and color drift, exhibits strong cross-domain generalization, and achieves leading results on five real-world dehazing benchmarks.

InterCMDM：用于自回归人机交互生成的块因果扩散模型

1/10

InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation

Qing Yu, Kent Fujiwara

个性化推荐理由:

该论文专注于人机交互生成，属于计算机图形学和动画领域，与推荐系统、搜索或广告的核心技术无关。虽然扩散模型在生成任务中广泛应用，但该论文的特定场景（人体运动生成）不在我的关注范围内。

2026-07-02 05:58:15 | arXiv:2607.01743v1 |

cs.CV

查看完整摘要

Text-conditioned human interaction generation must capture both long-range temporal causality within each individual and tightly coupled coordination between partners. Existing interaction diffusion models typically denoise full sequences using bidirectional attention, which obscures causality and hinders streaming and long-horizon generation. Autoregressive alternatives enforce causality but often suffer from temporal drift, leading to coordination degradation and unstable interaction dynamics over time. We propose InterCMDM, a block-causal latent diffusion framework for autoregressive two-person interaction generation. InterCMDM introduces a Dual-Stream Causal Diffusion Transformer that maintains separate causal streams for each person while modeling inter-person dependencies via unified dual-stream attention with multi-task attention masks. These masks unify interaction modeling within a single attention mechanism and support diverse coordination behaviors, including simultaneous actions, reactive responses, leader-follower dynamics, and independent motion. By training a single model across these mask configurations as a form of data augmentation, InterCMDM enables controllable interaction generation by simply selecting the desired attention mask at inference time. Finally, a block-wise diffusion objective enables stable latent rollout over long sequences without repeated decode-encode cycles. InterCMDM achieves state-of-the-art performance on InterHuman and Inter-X, improving text-motion alignment, realism, and long-horizon continuity.

量子启发视觉：利用波粒二象性进行低照度增强

1/10

Quantum-Inspired Vision: Leveraging Wave-Particle Duality for Low-Illumination Enhancement

Yiquan Gao

个性化推荐理由:

该论文专注于计算机视觉中的低照度图像增强，属于纯视觉领域，与推荐系统、搜索或广告中的核心技术（如排序、建模、用户理解等）无直接关联。没有证据表明该技术能应用于LLM、Transformer架构或异构数据建模等主题。

2026-07-02 05:38:55 | arXiv:2607.01731v1 |

eess.IVcs.CVcs.LGmath.OCquant-ph

查看完整摘要

This study provides a theoretical expansion of the recent Data Relativistic Uncertainty (DRU) framework by formalizing a physics-to-AI paradigm for image enhancement. By modeling images as probabilistic wave functions rather than deterministic states, the paradigm explicitly integrates wave-particle duality to illustrate the system flow of how DRU leverages the intrinsic physical uncertainty of light, a dimension requiring further theoretical discussion. Consequently, this paradigm provides a rigorous Explainable AI (XAI) approach that enhances the interpretability of how DRU mitigates illumination bias and maintains robustness against data noise.

面向大规模场景重建的结构感知高斯泼溅

1/10

Structure-Aware Gaussian Splatting for Large-Scale Scene Reconstruction

Weiyi Xue, Fan Lu, Chi Zhang, Tianhang Wang, Sanqing Qu, Zehan Zheng, Boyuan Zhe...

个性化推荐理由:

该论文属于计算机图形学与三维重建领域，专注于大规模场景重建任务。内容与推荐系统、搜索或广告的核心技术（如排序、召回、用户建模等）无直接关联，也未涉及LLM或Transformer架构的优化与应用。因此相关性极低。

2026-07-02 04:41:51 | arXiv:2607.01698v1 |

cs.CV

查看完整摘要

3D Gaussian Splatting has demonstrated remarkable potential in novel view synthesis. In contrast to small-scale scenes, large-scale scenes inevitably contain sparsely observed regions with excessively sparse initial points. In this case, supervising Gaussians initialized from low-frequency sparse points with high-frequency images often induces uncontrolled densification and redundant primitives, degrading both efficiency and quality. Intuitively, this issue can be mitigated with scheduling strategies, which can be categorized into two paradigms: modulating target signal frequency via densification and modulating sampling frequency via image resolution. However, previous scheduling strategies are primarily hardcoded, failing to perceive the convergence behavior of scene frequency. To address this, we reframe the scene reconstruction problem from the perspective of signal structure recovery and propose SIG, a novel scheduler that synchronizes image supervision with Gaussian frequencies. Specifically, we derive the average sampling frequency and bandwidth of 3D representations, and then regulate the training image resolution and the Gaussian densification process based on scene frequency convergence. Furthermore, we introduce Sphere-Constrained Gaussians, which leverage the spatial prior of initialized point clouds to control Gaussian optimization. Our framework enables frequency-consistent, geometry-aware, and floater-free training, achieving state-of-the-art performance by a substantial margin in both efficiency and rendering quality in large-scale scenes. The code is available at: https://github.com/weiyixue999/Signal_Structure_Aware_Gaussian

HistoSeg++：利用注意力机制和多尺度特征融合深入进行生物标志物分割

1/10

HistoSeg++: Delving deeper with attention and multiscale feature fusion for biomarker segmentation

Saad Wazir, Rao Faizan, Daeyoung Kim

个性化推荐理由:

该论文专注于医学图像分析中的生物标志物分割，属于医学特定应用，与推荐系统、搜索或广告领域无关。没有涉及LLM、Transformer或与RecSys/Search/Ads相关的技术。

2026-07-02 04:03:10 | arXiv:2607.01675v1 |

cs.CV

查看完整摘要

Segmentation of biomarkers in medical images is frequently viewed as a first step towards medical image analysis in any bioinformatics or biomedical application. Despite progress, existing methods still struggle to capture information at multiple scales and to perform upsampling effectively across different datasets. These shortcomings often result in suboptimal generalization capabilities. Recently, architectures belonging to the Nested-UNet family excel in capturing multiscale contextual information and upsample them effectively. In this work, We propose a novel Nested-UNet architecture that effectively captures multi-scale contextual information. It includes inner and outer attention units to enhance focus during upsampling, along with channel-wise feature recalibration using squeeze-and-excitation modules, leading to improved segmentation performance. Additionally, the architecture integrates an edge-aware loss to emphasize boundary accuracy by assigning greater importance to edge regions. Tested extensively on three publicly available benchmark datasets. Our method demonstrates a generalization performance superior to existing Nested-UNet methods. Code: https://github.com/saadwazir/histosegplusplus

用于单目4D场景合成的统一全景-高斯表示

1/10

Unified Panoramic-Gaussian Representation for Monocular 4D Scene Synthesis

Yuankun Yang, Yi Wei, Wenyang Zhou, Li Zhang

个性化推荐理由:

该论文专注于计算机视觉和图形学中的4D场景合成，涉及3D高斯表示和图像渲染，与推荐系统、搜索或广告领域没有直接或间接的技术关联。作为纯粹的视觉/图形学论文，不符合任何聚焦主题。

2026-07-02 03:43:06 | arXiv:2607.01663v1 |

cs.CV

查看完整摘要

4D scene synthesis from monocular videos has made significant progress in recent years. However, existing methods are typically constrained by view interpolation. As a result, they struggle to infer unseen regions beyond the observed views. In this paper, we reformulate the task as 4D scene synthesis with unseen regions, which extends beyond traditional interpolation settings. Camera-conditioned video generation enables unseen region synthesis by guiding generation along specified cameras. However, these methods lack explicit 3D priors and are optimized with random camera trajectories. This design leads to severe inconsistencies under large trajectory deviations. To address this limitation, we build a unified training and inference framework with panoramic trajectory guidance. While this design improves cross-view consistency, the panoramic representation alone fails to model dynamic content effectively. Object motion in panoramic space introduces scale and shape distortions. To address this, we propose PanoGaussian, a unified Panoramic-Gaussian representation that distills the panoramic representation into an explicit dynamic Gaussian representation to capture dynamic physical priors of the 4D scene. Experiments demonstrate that PanoGaussian achieves consistent 4D scene synthesis even under large viewpoint variations.

用于压缩感知光片显微镜的即插即用体积重建

1/10

Plug-and-Play Volumetric Reconstruction for Compressive Sensing Light-Sheet Microscopy

Jianqing Jia, Yi Gong, Xinyuan Zhang, Jichen Chai, Yichen Ding, Yifei Lou

个性化推荐理由:

该论文专注于计算显微成像技术，属于生物医学成像领域，与推荐系统、搜索或广告领域完全无关。论文标题未涉及任何与 LLM、Transformer 架构或推荐系统相关的技术或应用。

2026-07-02 03:26:30 | arXiv:2607.01654v1 |

cs.CVmath.NA

查看完整摘要

We investigate volumetric reconstruction for compressive sensing light-sheet microscopy (CS-LSM), where fast volumetric imaging is achieved by encoding multiple axial planes into each camera exposure. To recover the underlying volume from highly multiplexed measurements, we propose a plug-and-play (PnP) framework that flexibly incorporates any user-specified denoiser into the reconstruction process. Building on a slice-based formulation, we further introduce an axial-coupled model that exploits correlations between adjacent slices to improve volumetric continuity. For efficient computation, we derive a Woodbury-based update for the data-consistency step in both the slice-based and axial-coupled formulations, and employ a Gauss-Seidel sweep for the denoising step in the axial-coupled model. Under a weakly convex regularization assumption, we establish subsequential convergence of the proposed algorithm. Experiments on synthetic and real zebrafish-heart data demonstrate that the proposed framework successfully recovers cellular structures from compressed measurements, and provide practical insights into the comparative performance of commonly used denoisers within the PnP framework under the CS-LSM setup.

通过属性引导的双分支框架提升超声图像分类

1/10

Boosting Ultrasound Image Classification via Attribute-Guided Dual-Branch Framework

Bo Zhao, Yapeng Li, Juhua Liu, Bo Du

个性化推荐理由:

该论文聚焦于医学超声图像分类，属于医学领域特定应用，与推荐系统、搜索或广告领域无直接关联，也不涉及LLM或Transformer架构的核心技术。

2026-07-02 03:20:12 | arXiv:2607.01648v1 |

cs.CV

查看完整摘要

Ultrasound image classification is essential for computer-aided diagnosis. However, current methods often neglect clinical priors, leading to poor generalization in challenging scenarios and a lack of interpretability that limits clinical adoption. To address these issues, we aim to develop a medical-prior module that can be seamlessly integrated into existing pipelines to enhance both diagnostic performance and interpretability. In this paper, we propose an attribute-guided dual-branch framework for ultrasound classification that introduces domain-agnostic medical attribute priors, improving generalization while offering interpretable evidence. Specifically, a baseline branch follows conventional architectures and predicts image categories via a fully connected classifier. An attribute-guided branch injects domain-agnostic attributes as priors and produces human-interpretable decision cues. Finally, an adaptive decision module fuses the two branches in a data-dependent manner to yield the final prediction. Experiments across diverse ultrasound classification tasks demonstrate that our approach can be integrated into multiple backbones and state-of-the-art methods with low overhead, consistently improving accuracy and interpretability. Code is available at: https://github.com/zhaobo253-crypto/AttrGuide.

弥合3D高斯与语义占据：从未位姿图像实现全面开放词汇场景理解

1/10

Bridging 3D Gaussians and Semantic Occupancy for Comprehensive Open-Vocabulary Scene Understanding from Unposed Images

Hu Zhu, Bohan Li, Xianda Guo, Yanlun Peng, Zheng Zhu, Xin Jin, Wenjun Zeng, Chan...

个性化推荐理由:

该论文专注于3D视觉场景理解，涉及3D高斯和语义占据，属于纯计算机视觉领域，与推荐系统、搜索或广告无直接或间接关联。虽然开放词汇概念与NLP相关，但核心是3D场景建模，无明确应用方向。

2026-07-02 03:00:23 | arXiv:2607.01633v1 |

cs.CV

查看完整摘要

Comprehensive 3D scene understanding from sparse, unposed images requires a model to recover renderable geometry, open-vocabulary semantics, and free/occupied 3D space without relying on external camera calibration. Recent feed-forward Gaussian methods improve pose-free reconstruction and semantic rendering, but their Gaussian primitives are mainly optimized through image-space objectives and remain weakly constrained in unobserved regions. We propose \textit{COVScene}, a pose-free semantic Gaussian framework that couples renderable Gaussian primitives with a dense semantic occupancy field through differentiable volumetric lifting. Instead of converting Gaussians to voxels only at evaluation time, COVScene lifts the predicted semantic Gaussians inside the training computation graph, so volumetric regularization provides gradients to Gaussian opacity, geometry, and semantic features. The framework combines a semantic-aware Geometry Transformer, multi-task Gaussian decoding, geometric foundation distillation, and occupancy entropy regularization to support novel view synthesis, open-vocabulary semantic querying, and semantic occupancy prediction within a single representation. Experiments on ScanNet and ScanNet++ show that COVScene maintains competitive rendering quality, improves open-vocabulary segmentation, and achieves stronger semantic occupancy prediction than the self-supervised baseline without direct voxel-level supervision.

DRDN：面向从头训练ViT类增量学习的解耦表示动态网络

1/10

DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning

Bingchen Huang, Yifu Chen, Zhiling Wang, Yuanchao Du

个性化推荐理由:

论文主要研究计算机视觉中的类增量学习，与推荐系统、搜索或广告领域的核心技术无关。其提出的解耦表示和动态网络虽涉及模型架构，但并未体现对推荐/搜索/广告领域的直接应用或潜在适用性。

2026-07-02 02:57:58 | arXiv:2607.01630v1 |

cs.CV

查看完整摘要

Dynamic expansion methods for class-incremental learning (CIL) protect task-specific knowledge by growing dedicated tokens or subnetworks, yet our analyses suggest that classification supervision alone does not sufficiently preserve task-agnostic shared backbone representations over long incremental sequences. We identify two intertwined challenges: cross-task confusion from sequential training on predominantly current-task data, which biases decision boundaries toward recent tasks; and under-optimized shared representations in the backbone that cap long-term discriminability as tasks accumulate. We propose the Decoupled Representation Dynamic Network (DRDN), which addresses these challenges via two orthogonal mechanisms. For shared backbone representations, DRDN continuously applies masked image modeling (MIM) at every incremental step, with reconstruction gradients routed exclusively through the backbone, encouraging it to retain general visual structure beyond class-discriminative cues. For task-specific discrimination, DRDN employs hierarchical task token expansion across all transformer layers, with a modified per-task attention rule that reduces inter-task interference. We support this design with accuracy degradation analysis and cross-task confusion rate measurements. In the from-scratch ViT CIL setting (no external pretraining), DRDN consistently improves over strong token-expansion baselines with comparable backbone scale. On CIFAR100-B0 (10 steps), DRDN achieves 77.19% average accuracy, outperforming DKT by 1.36 points and DyTox by 3.53 points, with an advantage that grows at longer incremental sequences. Multi-seed validation confirms stability (+/-0.31%). The MIM decoder is active only during training, adding no inference-time parameters or computation.

通过发射虚拟无人机进行在线分割3D高斯

1/10

Online Segment 3D Gaussians via Launching Virtual Drones

Liwei Liao, Rongjie Wang, Ronggang Wang

个性化推荐理由:

该论文专注于3D视觉和无人机技术，属于计算机视觉和图形学领域，与推荐系统、搜索或广告领域无直接或间接关联。标题未提及任何可使推荐/搜索/广告受益的技术或应用，因此完全不相关。

2026-07-02 02:51:00 | arXiv:2607.01628v1 |

cs.CV

查看完整摘要

Interactive segmentation of 3D Gaussians offers a compelling opportunity for real-time manipulation of 3D scenes, thanks to the real-time rendering capability of 3D Gaussian Splatting (3DGS). However, existing methods require a time-consuming per-scene setup - typically tens of seconds or even minutes - before interactive segmentation can begin on a raw 3DGS scene. This setup involves multi-view mask preparation, mask lifting, and feature distillation, creating a major bottleneck for online applications. To address this limitation, we aim to completely eliminate the setup stage for interactive 3DGS segmentation while keeping the segmentation time practical (under 1 second). In this work, we present SAGO (Segment Any Gaussians Online), a novel setup-free framework for interactive 3DGS segmentation. By introducing virtual drones, our method reframes the 3D segmentation problem as an online Next-Best-View (NBV) planning task formulated within a Markov process. Extensive experiments demonstrate that SAGO can extract clean 3D assets directly from 3D Gaussians with sub-second latency, thereby enabling a broad range of downstream applications such as object manipulation and scene editing. Moreover, our method achieves over a 50x speedup compared to the previous setup-free 3DGS segmentation frameworks.

多人物跨镜头3D人体网格追踪

1/10

Multi-THuMBS: Multi-person Tracking of 3D Human Meshes Beyond Video Shots

Jeongwan On, Muhammad Salman Ali, Muneeb A. Khan, Sunwoo Park, Inwoong Moon, Hyu...

个性化推荐理由:

论文专注于计算机视觉领域的人物追踪和3D建模，与推荐系统、搜索或广告的核心技术无关，也不涉及LLM或Transformer架构的进展或应用。

2026-07-02 02:48:43 | arXiv:2607.01626v1 |

cs.CV

查看完整摘要

Tracking multi-person 3D human meshes from in-the-wild videos is a highly challenging problem due to complex interactions, frequent occlusions, and severe truncation inherent in unconstrained environments. While recent approaches have improved robustness against these issues, they largely overlook the critical challenge prevalent in real-world footage: frequent shot changes. These abrupt transitions in camera viewpoints often cause existing methods to lose track of human identities and fail in reconstructing temporally coherent trajectories. Although several recent works have explored 3D human mesh tracking under shot changes, they are still limited to single-person scenarios, making them inadequate for real-world videos where multiple people interact and appear simultaneously. To address this limitation, we propose Multi-THuMBS (Multi-person Tracking of 3D Human Meshes Beyond Video Shots) that leverages a state-of-the-art 3D scene prior to reconstruct the two boundary frames in a single shared 3D space. Human meshes are then registered within the shared 3D space, maintaining per-person identity and motion consistency across shot changes. Extensive experiments demonstrate that our approach yields significant improvements in 3D human mesh recovery, camera pose estimation, and identity tracking, thereby ensuring high-fidelity motion reconstruction with consistent identity preservation across shots compared to previous state-of-the-art methods.

MVFusion-GS：基于运动方差引导的时间注意力机制的高质量动态高斯泼溅

1/10

MVFusion-GS: Motion-Variance Guided Temporal Attention for High-Quality Dynamic Gaussian Splatting

Jianwei Hu, Tingxuan Huang, Hengyu Zhou, Ningna Wang, Xiaohu Guo Jinshan Lai, Bi...

个性化推荐理由:

该论文聚焦于动态场景的3D渲染技术（高斯泼溅），属于计算机视觉与图形学领域，与推荐系统、搜索或广告的核心技术无关。它既未涉及Transformer架构改进或LLM应用，也未提出对异构数据建模的通用方法，因此与当前关注方向不匹配。

2026-07-02 01:21:11 | arXiv:2607.01578v1 |

cs.CV

查看完整摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis for static scenes. Extending it to dynamic scenes via deformation fields has recently attracted significant attention, particularly for dynamic scene reconstructionband distractor-free. However, existing deformation networks lack explicit motion awareness: they neither capture long-term motion intensity nor exploit short-term temporal coherence, leading to inaccurate foreground deformation and pseudo-static residuals in the background. We present MVFusion-GS, a method that enhances deformation networks with two complementary motion-aware mechanisms. The Motion-Variance Guided Refinement aggregates per-Gaussian deformation statistics across time to estimate motion variance and uses it to guide dynamic-static separation during deformation prediction. The MotionFormer Temporal Attention module applies Transformer self-attention over neighboring timesteps to model local motion dependencies and improve temporal consistency. Extensive experiments on both dynamic scene reconstruction and distractor-free reconstruction benchmarks demonstrate state-of-the-art performance, showing that explicit motion awareness improves both foreground motion modeling and static background reconstruction.

注意差距：标准3DGS评估主要衡量近轨迹插值

1/10

Mind the Gap: Standard 3DGS Evaluation Primarily Measures Near-Trajectory Interpolation

Gaoxiang Jia, Vikram Appia

个性化推荐理由:

该论文专注于3D高斯泼溅（3DGS）的评估方法，属于计算机图形学和3D视觉领域，与搜索、推荐或广告系统无直接或间接关联。它既不是核心推荐/搜索技术，也非LLM或Transformer相关进展，且未展示出在推荐/搜索/广告中的潜在应用。

2026-07-02 00:24:15 | arXiv:2607.01556v1 |

cs.CV

查看完整摘要

Standard MipNeRF360-style 3D Gaussian Splatting (3DGS) evaluation holds out every N-th frame -- but these frames have trained neighbors on both sides, so the metric measures near-trajectory interpolation rather than spatial generalization. We introduce a fair matched-count protocol that isolates this effect: both arms train on the same number of images and differ only in whether the holdout is spread evenly (interpolation) or forms a contiguous spatial sector (extrapolation). Our primary finding is a large, consistent interpolation-extrapolation gap of 3~12dB -- several times the differences typically reported between competing methods. The gap is robust to training noise, is in two cases large enough to flip a method ranking under multi-seed confirmation, and -- crucially -- persists across three representation families, including a non-Gaussian volumetric neural radiance field (NeRF), so it reflects spatial coverage rather than any one representation. Diagnostically, it is dominated by a diffuse/geometry-proxy component and tracks each view's angular distance to its nearest training view, a zero-cost signal that also guides capture planning; loss-side regularization yields only marginal gains. Standard holdouts remain useful for near-trajectory rendering but should not, alone, be read as evidence of spatial generalization. Prior work notes protocol sensitivity; ours is, to our knowledge, the first to combine matched-count paired holdout, cross-representation quantification, and a diagnostic analysis Table 1. We describe a spatial-holdout benchmark toolkit with standardized splits and baselines for 16 scenes, which we are preparing for public release.

通过逻辑域对比与自适应形状精化提升红外小目标检测

1/10

Boosting Infrared Small Target Detection via Logit-Domain Contrast and Adaptive Shape Refinement

Handong Zeng, Zhengeng Yang, Shuai Zhang, Shikai Chen, Hongshan Yu

个性化推荐理由:

该论文针对红外小目标检测，属于计算机视觉中的特定检测任务，与我的关注领域（推荐系统、搜索、广告、LLM及Transformer架构）无直接或间接关联。既未涉及LLM、Transformer等使能技术，也未提出可用于多模态或序列建模的通用方法。

2026-07-02 00:23:15 | arXiv:2607.01555v1 |

cs.CV

查看完整摘要

Infrared small target detection (IRSTD) remains challenging due to tiny target size, low signal-to-noise ratio, severe foreground-background imbalance, and blurred boundaries in complex scenes. Existing methods usually rely on post-activation probability-domain supervision for discrimination, where weak targets and strong clutter may produce saturated and close probabilities, limiting weak-target discrimination. Meanwhile, blurred boundaries and halo-like predictions mainly stem from thermal diffusion, tiny target scale, boundary uncertainty, and insufficient explicit contour constraints. To address these issues, we propose Adaptive-Contrastive SLSIoU (AC-SLSIoU), a plug-and-play discriminative and shape-aware loss for IRSTD. Specifically, a Logit-Domain Margin Constraint (LDMC) is introduced to enlarge the response gap between targets and informative hard negatives in the logit space, thereby enhancing weak-target discrimination. Adaptive Boundary Suppression (ABS) applies scale-aware annular penalties to refine target contours and suppress halo-like overflow responses. In addition, False-Alarm Focal Loss assigns larger weights to high-probability negative samples, further penalizing persistent high-confidence false alarms. Without introducing extra inference overhead, the proposed method can be seamlessly integrated into existing detectors and consistently improves both detection accuracy and shape quality. Extensive experiments and cross-backbone evaluations demonstrate the effectiveness, robustness, and generalization ability of the proposed method for infrared small target detection.

Hidden-Shot：面向低层视觉通才模型的单次任务泛化

1/10

Hidden-Shot: Towards One-Shot Task Generalization for Low-Level Vision Generalist Models

Shao-Jun Xia, Xianzheng Ma, Zichong Meng

个性化推荐理由:

论文聚焦于低层视觉任务（如图像恢复、增强）的通用模型，属于纯计算机视觉领域，与推荐系统、搜索或广告的排序、匹配等核心问题无直接关联。其方法（单次泛化）和思想虽涉及模型泛化，但缺乏明确的迁移路径至用户建模或内容理解。不符合我当前关注的任何方向。

2026-07-01 23:26:26 | arXiv:2607.01535v1 |

cs.CV

查看完整摘要

Despite the intense engagement surrounding low-level vision generalist models, their effectiveness in zero/few-shot scenarios beyond learned tasks remains unverified. The primary challenge of developing an ideal generalist lies in achieving the ability to generalize from new unseen tasks, which also can be assessed by matched quantitative criteria. Existing methods have made some progress in prompt engineering but have not systematically explored this gap across a wide range of low-level visual tasks. Stimulated by the problem, we propose Hidden-Shot, an implicit prompt mechanism aimed at exploring low-level task adaptation in a vision generalist model. Specifically, the method extracts implicit visual task-based information, utilizes a global task-aware textural prompt, and selectively merges implicit information with in-task processing information to enhance one-shot capabilities in new tasks. The overall design performs direct injection in a cost-effective manner, while minimally altering the architecture of the original generalist model. Additionally, we introduce a data-driven evaluation framework termed C/U assessment to cover two basic scenarios, 3C4U (3 conventional and 4 unconventional tasks) for retraining existing models and 3C7U (3 conventional and 7 unconventional tasks) for training from scratch, as a comprehensive assessment to systematically test the generalization ability of low-level generalist models. Experiments on seven and ten datasets outperform the state-of-the-art vision generalist model, respectively verified by 3C4U and 3C7U framework. Our presented Hidden-Shot approach demonstrates superior performance on one-shot new tasks while maintaining consistent performance on existing tasks.

反提示：针对文本引导图像到视频生成的图像保护

1/10

Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation

Yeonghwan Song, Chanhui Lee, Jinsoo Park, Jeany Son

个性化推荐理由:

该论文专注于图像保护，属于安全/对抗领域，与推荐系统、搜索或广告的核心技术无关。虽然涉及生成模型，但内容保护不属于重点关注方向。

2026-07-01 22:00:18 | arXiv:2607.01499v1 |

cs.CV

查看完整摘要

Recent advances in Image-to-Video generation allow a single image to be animated into a convincing video under text guidance, raising serious copyright and privacy risks. We propose Anti-Prompt, an image protection approach that injects imperceptible perturbations into an image, inducing visible inconsistencies and structural failures in text-guided I2V generation. Our method is motivated by a simple empirical observation. When text guidance is removed from modern I2V models, generation quality degrades markedly, not only in motion realism but also in subject preservation, structural coherence, and temporal consistency. Building on this insight, Anti-Prompt exploits the model reliance on textual guidance by attenuating text-conditioned interactions during denoising while strengthening visual-only pathways. To further systematically evaluate protection effectiveness, we introduce a Video-LLM-assisted evaluation protocol that provides interpretable, frame-grounded analyses of generation artifacts and inconsistencies. Experiments on two representative I2V architectures demonstrate that our method achieves strong protection performance while improving efficiency and cross-model transferability.

一种成本感知的配对协议，用于审计代理视频问答中的动态工具合成

1/10

A Cost-Aware, Paired Protocol for Auditing Dynamic Tool Synthesis in Agentic Video Question Answering

Aseel Mohamed, Rama AlHamidi, Mohamed Rayan Barhdadi, Rasul Khanbayov, Erchin Se...

个性化推荐理由:

该论文聚焦于视频问答中的工具合成和审计协议，属于多模态/NLP领域，未涉及推荐、搜索或广告的核心技术或应用。主题与RecSys/Search/Ads无直接关联，也未提供可迁移的潜在技术启发。

2026-07-01 21:01:57 | arXiv:2607.01469v1 |

cs.CV

查看完整摘要

Agentic Video Question Answering (VideoQA) systems invoke tools during inference, but their tool libraries are fixed, so recurring procedures are rebuilt from primitives on every question. Synthesizing composite tools could remove this overhead, but whether such expansion helps is hard to assess: final-answer accuracy, the standard metric, ignores inference effort, so it cannot reveal how a system shifts cost. We propose a cost-aware, paired protocol for auditing tool-augmented video agents. The protocol pairs two complete systems on the same input for each question and reports their net difference across accuracy and cost jointly. For each question, it sorts the paired outcome into one of six groups defined by joint correctness and by the change in visible tool calls, separating accuracy-preserving efficiency gains from harmful regressions. Significance is reported with McNemar's test and paired bootstrap confidence intervals. We instantiate the protocol on Dynamic-SAGE, an agentic VideoQA framework that synthesizes, validates, and persistently registers executable composite tools for reuse on unseen questions, and evaluate it against the SAGE baseline on SAGE-Bench. The audit reveals a multi-axis profile that a scalar accuracy comparison would miss: Dynamic-SAGE improves accuracy by 7.5 points (p < 0.001) and reduces reasoning turns and visible tool calls by roughly 28%, while shifting rather than reducing inference cost, as token usage rises 34% and cost 26%. Gains are largest on visual and open-ended questions and neutral on verbal and multimodal ones, and residual failures concentrate on hard, open-ended questions where the pipeline does the most work. By measuring accuracy and cost jointly, the protocol shows where the pipeline-level difference is reliable and where it is not. The code is available at https://github.com/KurbanIntelligenceLab/Dynamic-SAGE.

从伪造到基础模型：身份证件攻击与检测的系统性综述

1/10

From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection

Gourab Das, Pavan Kumar C, Raghavendra Ramachandra

个性化推荐理由:

该论文聚焦于身份证件的安全攻击与检测，属于安全和身份验证领域，与推荐系统、搜索或广告的核心技术无关。论文未涉及LLM或Transformer在搜索/推荐/广告中的应用，故相关性极低。

2026-07-01 20:05:46 | arXiv:2607.01442v1 |

cs.CRcs.CV

查看完整摘要

Identity document forgery has undergone a fundamental capability shift: generative AI tools now enable high-fidelity document synthesis and field-level manipulation with minimal technical expertise, while detection methods remain constrained by benchmarks that do not reflect this threat. The resulting attack surface spans physical presentation, digital injection, and fully generative synthesis, introducing distinct forensic failure modes that require a unified threat model and evaluation framework. This survey provides, to our knowledge, the first unified treatment of Presentation Attacks, Digital Injection Attacks, and GenAI-driven synthesis within a single identity verification threat model. We trace detection methodologies from rule-based heuristics through forensic localisation, injection-aware pipelines, foundation models, and few-shot frameworks. A systematic audit of public datasets from 2019--2025 exposes a persistent Reality Gap between benchmark conditions and operational deployment. We further analyse large multimodal models for identity document manipulation, identifying Script-Dependent Generative Instability (SDGI) as a recurring typographic failure mode in non-Latin script inpainting. Finally, zero-shot benchmarking on unseen synthesised ID cards shows that even the strongest publicly available models achieve APCER values above 25% under security-oriented operating conditions, highlighting substantial limits in cross-domain generalisation. We conclude by outlining future directions toward forensically grounded, privacy-preserving, and legally accountable identity verification systems.

未来信息有多大帮助？因果自我中心凝视估计中未来特权监督的受控研究

1/10

How Much Future Helps? A Controlled Study of Future-Privileged Supervision for Causal Egocentric Gaze Estimation

Jia Li, Wenjie Zhao, Fnu Atisri, Sanskriti Aripineni, Shijian Deng, Jon E. Froeh...

个性化推荐理由:

该论文专注于自我中心凝视估计任务，属于计算机视觉领域，与推荐系统、搜索或广告等核心领域没有直接或潜在的关联。其方法基于因果推理和未来信息，但缺乏明确的迁移路径或应用场景指向RecSys/Search/Ads。

2026-07-01 20:00:12 | arXiv:2607.01437v1 |

cs.CV

查看完整摘要

Egocentric gaze estimation is commonly studied using models that process the full video with access to future frames, while real-world applications require strictly causal, online prediction. This discrepancy raises key questions: Does future context inherently provide valuable signals for gaze estimation? If so, how much future look-ahead optimally supervises a causal model during training? To investigate, we propose a controlled framework featuring a future-aware branch that accesses a tunable look-ahead horizon during training but is discarded at inference. This design isolates the impact of future context while keeping the inference architecture fixed and strictly causal. Across EGTEA Gaze+ and Ego4D, we find that future-privileged supervision consistently improves causal gaze prediction, confirming its utility. However, performance gains do not increase monotonically with longer look-ahead, but rather peak within a bounded temporal regime. Specifically, optimal performance corresponds to roughly 1.7--3.3 seconds of future context ($H{\in}[5, 10]$) on EGTEA Gaze+ and 2.7 seconds ($H{=}10$) on Ego4D. Our results demonstrate that lightweight causal models can effectively absorb future-aware signals, providing practical guidance for real-time egocentric gaze modeling.

空中签名解锁：基于点-体素交叉注意力网络的虚拟与增强现实认证界面

1/10

Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network

Neda Abdolrahimi, Thiru Siddharth, Frank Sicongchen, Vir V Phoha

个性化推荐理由:

该论文专注于VR/AR环境下的身份认证技术，属于人机交互和计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及注意力机制，但应用场景和问题设定与RecSys/Search/Ads差距较大，且未展示出在相关领域的潜在应用。

2026-07-01 19:56:55 | arXiv:2607.01435v1 |

cs.CVcs.CRcs.HCcs.LG

查看完整摘要

Significant advancement of immersive technologies such as Virtual and Augmented Reality (VR/AR) and their integration into diverse aspects of modern life need authentication interfaces that are secure, intuitive, and compatible with embodied interaction. Traditional methods such as passwords, PINs, and device-based logins, break immersion and rely on external hardware. Recent 3D-specific behavioral approaches, such as hand-gesture, eye-tracking, and electroencephalography (EEG)-based methods, offer promising alternatives but often require specialized sensors or constrain natural movement, limiting usability in dynamic environments. We present Sign in the Air to Unlock, an in-air signature interface that enables users to authenticate by signing naturally in 3D space which is a familiar, personal, and reproducible gesture. To realize this interface, we design a point-voxel Cross-Attention Network (PV-Net) that jointly models local motion dynamics and global spatial structure from 3D trajectories. The model is evaluated on two datasets: the public DeepAirSig dataset (1,800 signatures from 40 users) and ImmAirsig, a new dataset collected using Meta Quest 2 in immersive VR (880 samples from 22 users). PV-Net achieves an Equal Error Rate of 2.5% on DeepAirSig and 76% classification accuracy on ImmAirSig. These findings highlight the potential of 3D behavioral interfaces for seamless, user-centric authentication that merges security with natural interaction in immersive environments.

超越热力图：面向可解释视觉概念的无监督概念图推理

1/10

Beyond Heatmaps: Unsupervised Concept-Graph Reasoning for Interpretable Visual Explanation

Md Mohasin Hossain, Anar Amirli, Robert Leist, Md Abdul Kadir, Daniel Sonntag

个性化推荐理由:

论文聚焦于计算机视觉中的可解释性概念图推理，虽然涉及无监督学习和图推理，但未提及推荐、搜索或广告系统。与LLM或Transformer技术的关联不明确，缺乏直接应用于RecSys/Search/Ads的潜力。

2026-07-01 19:21:39 | arXiv:2607.01416v1 |

cs.CV

查看完整摘要

Concept Bottleneck Models (CBMs) provide an intrinsically interpretable alternative to post-hoc explanations. However, existing CBMs often rely on predefined concept vocabularies or supervised annotations, lack explicit concept grounding, and summarize each concept with a single image-level score -- discarding spatial recurrence and inter-concept dependencies. We propose a Graph-based Concept Bottleneck Model (G-CBM), an intrinsically interpretable framework that performs unsupervised concept discovery via Non-negative Matrix Factorization (NMF) and represents the discovered concepts as nodes in a per-image concept-graph representation. G-CBM matches region-level features to these concept nodes -- providing concept grounding and capturing concept recurrence across the image -- and applies a \emph{tunable concept filtering threshold} $τ$ to suppress weak region-level features. A Graph Attention Network (GAT) then performs concept-level reasoning by modeling nonlinear dependencies across nodes. Across ImageNet, HAM10000, PH2, and Derm7pt, G-CBM achieves an average relative AUC improvement of 3.7\% over a ResNet-50 baseline. Concept filtering frequently improves predictive performance while inducing selective concept use, achieving peak AUC of $0.96$ on PH2 with only 2 of 10 concepts and 0.92 on HAM10000 with 3.8 of 9 concepts. On dermoscopy benchmarks, G-CBM is competitive with supervised approaches requiring external annotations. Deletion/insertion analyses with random ablation controls show that the learned concept ranking faithfully reflects model predictions.

NeuroBridge：桥接多任务MRI知识用于神经退行性疾病诊断

1/10

NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis

Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang

个性化推荐理由:

该论文专注于医学影像（MRI）在神经退行性疾病诊断中的应用，属于医学领域特定应用，不涉及搜索、推荐或广告技术，也没有与LLM或Transformer架构相关的贡献。因此与我的关注点无关。

2026-07-01 19:03:42 | arXiv:2607.01401v1 |

cs.LGcs.AIcs.CV

查看完整摘要

INTRODUCTION: Accurate MRI-based identification of Alzheimer's disease (AD), mild cognitive impairment (MCI), and related dementias remains challenging because disease-related structural changes are often subtle and heterogeneous. We developed NeuroBridge, a clinically guided multi-task MRI framework for neurodegenerative disease diagnosis. METHODS: NeuroBridge integrates large-scale self-supervised MRI pretraining with hippocampal segmentation, hippocampal atrophy classification, and reconstruction objectives, followed by gated fusion fine-tuning. Performance was evaluated across ADNI and OASIS cohorts, including cross-cohort transfer, probability-based analysis, and opportunistic screening. RESULTS: NeuroBridge achieved the highest performance across evaluated classification tasks, reaching 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS. The largest gains occurred in MCI-related and mixed-diagnosis settings. The framework demonstrated strong cross-cohort generalization, systematic associations between predicted-class probability and accuracy, and the feasibility of probability-based opportunistic screening. DISCUSSION: Clinically guided multi-task representation learning improves neurodegenerative MRI diagnosis beyond conventional single-task approaches. NeuroBridge provides a robust and scalable framework for dementia assessment and MRI-based opportunistic screening.

多语言环境和低资源语言中LLM作为裁判的挑战与建议

0/10

Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

A. Seza Doğruöz, Xixian Liao, Verena Blaschke, Jakob Prange, Senyu Li, David Ife...

个性化推荐理由:

该论文讨论LLM作为裁判用于评估任务，聚焦多语言和低资源语言场景，属于LLM评估和NLP范畴，与推荐系统、搜索或广告的核心技术或应用无直接关联，也未涉及可应用于这些领域的模型架构、训练方法或潜在创新。

2026-07-02 14:34:07 | arXiv:2607.02235v1 |

cs.CLcs.AI

查看完整摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for many natural language generation tasks, due to shortcomings of conventional metrics and high correlations with human judgment, albeit mostly in English. There are now attempts to extend LLM-as-a-Judge to multilingual settings including low-resource languages. However, LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings. To highlight the scope of the problem and current practices, we explore the use of LLM-as-a-Judge evaluators in ACL Anthology papers focusing on multilingual settings and low-resource languages across a diverse set of tasks. Out of 650 papers mentioning LLM-as-a-judge, only 33 of them focus on low-resource or multilingual settings. Our in-depth analysis of these papers indicates inconsistent evaluation outcomes, a tendency to overtrust LLM judgments in multilingual settings, and the widespread reliance on a single judge model per study. To help the NLP community further, we conclude with recommendations about how to use LLM-as-a-Judge in multilingual and low-resource settings.

面向音系学信息的多语言文本到语音评估

0/10

Towards a Phonology-Informed Evaluation of Multilingual TTS

Sneha Ray Barman, Neeraj Kumar Sharma, Shakuntala Mahanta

个性化推荐理由:

该论文专注于多语言文本到语音（TTS）的评估，属于语音合成领域，与推荐系统、搜索或广告的核心技术（如排序、匹配、用户建模等）无直接关联，也未涉及LLM或Transformer架构在相关领域的应用。

2026-07-02 09:57:54 | arXiv:2607.01965v1 |

cs.CLcs.ETcs.LG

查看完整摘要

Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.

AgenticDataBench：面向数据智能体的综合基准

0/10

AgenticDataBench: A Comprehensive Benchmark for Data Agents

Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han, Guoliang Li, Ying Yan, Peng Z...

个性化推荐理由:

该论文聚焦于数据智能体的基准测试，属于LLM代理和工具使用领域，并非直接相关于推荐、搜索或广告的核心技术。且无明显应用于这些领域的潜在可能性，不符合我的关注重点。

2026-07-02 03:18:59 | arXiv:2607.01647v1 |

cs.DBcs.AIcs.CLcs.LG

查看完整摘要

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.

非曼哈顿环境下的文本驱动3D室内场景合成

0/10

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

Xianhui Meng, Zirui Song, Yuchen Zhang, Li Zhang, Yongxuan Lv, Xiuying Chen, Kun...

个性化推荐理由:

该论文专注于3D场景合成，属于计算机图形学和视觉领域，与推荐系统、搜索或广告的核心技术无关，也不涉及LLM或Transformer的改进或应用。

2026-07-02 16:40:08 | arXiv:2607.02407v1 |

cs.AIcs.CV

查看完整摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.

学习光谱和偏振线索用于一对多模态新视角合成

0/10

Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis

Federico Lincetto, Gianluca Agresti, Mattia Rossi, Piergiorgio Sartor, Pietro Za...

个性化推荐理由:

该论文专注于新视角合成，属于计算机视觉和图形学领域，与推荐系统、搜索或广告的排序、用户建模等核心任务无直接关联。未体现对LLM、Transformer架构或推荐系统有应用潜力。

2026-07-02 16:13:19 | arXiv:2607.02372v1 |

cs.CV

查看完整摘要

Neural rendering techniques allow for accurate reconstruction of the geometry and color appearance of 3D scenes. Some methods have extended their use to additional imaging modalities, such as multispectral, infrared, or polarimetric data. However, all of these approaches require expensive sensors and calibrated setups to capture new multimodal frames for each new scene. We propose Spectral and Polarimetric Implicit Learned Representation (SPoILeR), a novel method to obtain multi-view consistent renderings of unconventional modalities for scenes where either only RGB frames or very few of the additional modalities are available. Thanks to a multimodal pre-training phase, the model learns the mutual correlation between different modalities. This step allows predicting accurate renderings of unconventional modalities during a fine-tuning phase supervised only by RGB images. Experimental results show that the approach can accurately render infrared, polarimetric, and multispectral frames for scenes where no input sample captured by these types of sensors is provided.

领域增量变化检测的双选择网络

0/10

Dual-Selective Network for Domain-Incremental Change Detection

Yuzhi He, Junxi Huang, Haorui Wu, Jiahui Qu

个性化推荐理由:

该论文专注于计算机视觉中的变化检测任务，与推荐系统、搜索或广告领域无直接或潜在关联。标题中的“领域增量”并非指推荐系统中的领域，而是指视觉域的变化。

2026-07-02 15:17:36 | arXiv:2607.02299v1 |

cs.CV

查看完整摘要

Domain-incremental change detection (DICD) continuously adapts models to new geographic domains while preserving prior knowledge. However, a structural mismatch exists: the label space remains fixed while domain characteristics vary drastically. Consequently, incremental models struggle to maintain stable spatial change representations across domains. Existing strategies, such as replay-based or regularization-based methods, often fail to scale to long domain sequences, leading to knowledge degradation or increased computational cost. We propose Dual-Selective Incremental Network (DSINet), a unified framework built on visual state space models. DSINet leverages Mamba's input-dependent selective mechanism through a selective spatial state unit (S3U). This unit preserves stable spatial change structures while filtering domain-specific variations during feature propagation. As a result, spatial representations remain stable across domains, preventing the accumulation of feature confusion over incremental steps. Additionally, we employ a concentration-balanced distillation (CBD) strategy to stabilize knowledge transfer across domains. It balances hardness and confidence concentration effects during incremental updates. This ensures reliable probability mass allocation and prevents over-smoothing or mode collapse during distillation. Together, these mechanisms maintain stable learning dynamics throughout incremental stages. Experimental results demonstrate that DSINet mitigates knowledge degradation across long domain sequences while maintaining the linear computational efficiency of state space models.

DetailAnywhere：通过跨模态特征对齐蒸馏实现时尚细节生成

0/10

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

Zijun Li, Yimin Zhou, Jia Sun, Honglie Wang, Pengcheng Wei, Junlong Wu, Yongrui ...

个性化推荐理由:

该论文聚焦于时尚图像中的细节生成，属于计算机视觉和图像生成领域。虽涉及跨模态对齐蒸馏，但更接近视觉内容生成，而非推荐、搜索或广告系统的核心技术（如排序、匹配或用户建模）。与我的关注点（RecSys/Search/Ads核心进展、LLM应用等）无直接关联。

2026-07-02 14:26:47 | arXiv:2607.02220v1 |

cs.CV

查看完整摘要

Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.

基于单次全身CT扫描的患者特定铰接式数字孪生

0/10

Patient-Specific Articulated Digital Twins from a Single Full-Body CT Scan

Han Zhang, Boyang Zhao, Mathias Unberath

个性化推荐理由:

该论文专注于医学影像和数字孪生，属于医疗领域，与推荐系统、搜索或广告无关。

2026-07-02 13:30:45 | arXiv:2607.02156v1 |

cs.CV

查看完整摘要

Patient-specific anatomical models provide individualized context for surgical planning, image-guided intervention, and algorithm development. However, most CT-derived models are static: they preserve the body configuration captured at scan time, but cannot represent how the same anatomy would appear after patient repositioning. This limitation is especially important for radiographic imaging, where appearance depends jointly on imaging geometry and patient pose. We present a proof-of-concept for constructing a patient-specific articulated digital twin from a single full-body CT scan. The method fits a parametric human body model (SMPL) to obtain a patient-aligned kinematic scaffold, binds segmented bones and organs to an anatomy-aware rig, and retargets body-pose changes while preserving skeletal geometry. On three full-body CT subjects, the fitted scaffold achieved 15.8 $\pm$ 4.0 mm chamfer distance and 95.9 $\pm$ 1.8% skeletal enclosure. Recomposition at the acquisition pose preserved major radiographic structure, with overall SSIM of 0.872 $\pm$ 0.016 and PSNR of 18.5 $\pm$ 1.4 dB across paired DRRs. Across unseen target poses, the resulting twins enabled articulation while maintaining high skeletal enclosure (94.4 $\pm$ 0.4%). As a feasibility demonstration, we render the articulated twin as pose-dependent DRRs. These results suggest the feasibility of extending static, view-controllable CT simulation toward pose-controllable anatomical twins for future synthetic imaging and positioning studies.

NeoMap：无需训练的单图像和视频新视角合成

0/10

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

Jinxi Li, Tianyi Zhang, Yafei Yang, Zihui Zhang, Peng Huang, Koon Wing Macgyver ...

个性化推荐理由:

该论文探讨的是计算机视觉中的新视角合成技术，属于3D视觉和图形学领域，与推荐系统、搜索、广告的核心技术（如LLM、Transformer、多模态对齐）无关。标题未提及任何与搜索、推荐或广告相关的应用或方法，因此完全不相关。

2026-07-02 09:56:50 | arXiv:2607.01962v1 |

cs.CVcs.AIcs.GRcs.RO

查看完整摘要

We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.

面向多样性的可扩展VGGT视图分割方法

0/10

Diversity-aware View Partitioning for Scalable VGGT

Jinsoo Park, Donggyu Choi, Ahyun Seo, Minsu cho, Jeany Son

个性化推荐理由:

该论文涉及VGGT（一种3D视觉变换器）的视图分割，属于3D视觉领域，与推荐系统、搜索或广告的核心技术无关，且未提及Transformer效率、多模态应用等可迁移的先进技术。

2026-07-02 08:37:56 | arXiv:2607.01885v1 |

cs.CV

查看完整摘要

Geometry transformers such as VGGT achieve strong performance by jointly reasoning over multiple views with global attention. However, scaling them to large view collections remains challenging due to the quadratic cost of attention. Moreover, our empirical analysis reveals that the reconstruction quality in VGGT is sensitive to the distribution of viewpoints. Simply increasing the number of views without sufficient viewpoint diversity can even degrade performance, as redundant views introduce highly similar tokens that dilute informative geometric signals in the attention mechanism. Motivated by this observation, we propose a training-free and plug-and-play VGGT inference framework that organizes views into diversity-aware balanced chunks. The chunks are constructed through combinatorial graph partitioning over visual dissimilarity and spatial dispersion. This view organization allows the transformer to focus attention on geometrically informative views while reducing redundant attention interactions. To estimate spatial dispersion without full pose estimation, we approximate spatial relationships via a soft pose propagation strategy based on visual similarity from a small set of seed frames. Extensive experiments demonstrate improved performance in camera pose estimation, multi-view depth prediction, and 3D reconstruction while reducing memory usage and inference latency. Our framework also complements existing VGGT variants, enabling scalable multi-view reconstruction without sacrificing geometric fidelity.

C2E：通过多教师对比知识蒸馏提升仅自车3D目标检测

0/10

C2E: Boosting Ego-Only 3D Object Detection via Multi-Teacher Contrastive Knowledge Distillation

Jinlong Wang, Xun Huang, Qiming Xia, Shijia Zhao, Chenglu Wen

个性化推荐理由:

该论文聚焦于自动驾驶领域的3D目标检测，属于计算机视觉和自动驾驶的范畴，与推荐系统、搜索或广告领域无任何直接或潜在关联。

2026-07-02 07:48:22 | arXiv:2607.01827v1 |

cs.CV

查看完整摘要

LiDAR-based 3D object detection is essential for autonomous driving systems. However, traditional Ego-only Perception (Eo-Perception) suffers from limited perspective and occlusions in a complex outdoor environment, leading to performance bottlenecks. Recently, research on multi-agent Collaborative Perception (Co-Perception) has demonstrated excellent performance, but high communication costs and accumulated pose error hinder its application. To address this, we explore a novel C2E (Co-Perception to Eo-Perception) paradigm through the Multi-to-Single (M2S) agent contrastive knowledge distillation framework. Our M2S framework first designs Multi-Level Feature Enhancement module to provide more stable features, and introduces Auxiliary Point Cloud Reconstruction and Multi-Teacher Contrastive Distillation mechanisms to mitigate domain gaps in point cloud and feature distributions within the C2E paradigm. Benefiting from this, our M2S can retain the excellent performance of collaborative perception while effectively avoiding the drawbacks, such as communication delays and positioning errors. Extensive experiments on the V2XSet, V2V4Real and DAIR-V2X datasets show the effectiveness and generalizability of our M2S framework when combined with the state-of-the-art CoSDH model and other excellent 3D detectors. Our M2S framework can deliver up to a 8.64% improvement in 3D mAP performance without introducing any communication costs.

MMBench-Live：一个持续演进的多模态模型基准

0/10

MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models

Yuanzhi Liu, Shousheng Zhao, Bo Zhou, Kongming Liang, Zhanyu Ma

个性化推荐理由:

该论文主要关注多模态模型的评估基准，属于纯评测类工作，与推荐、搜索或广告领域的核心进展或应用无关，也未涉及LLM或Transformer的使能技术。

2026-07-02 07:27:58 | arXiv:2607.01813v1 |

cs.CVcs.AI

查看完整摘要

Evaluation benchmarks are essential for assessing vision-language models (VLMs), but most multimodal benchmarks are static, making them vulnerable to temporal staleness, data contamination, and costly maintenance. We present MMBench-Live, a continuously evolving multimodal benchmark built by a multi-agent-driven automated pipeline. Our framework treats benchmark evolution as task-guided dataset construction, integrating structured benchmark specification, feedback-controlled real-time data acquisition, and verifiable QA generation with executable reasoning. To maintain cross-version comparability, we introduce a distribution-consistent update strategy that extracts task-related visual patterns from the original benchmark to guide data collection and filtering. Instantiated from MMBench, MMBench-Live contains 5.9K newly generated evaluation instances with a high answer correctness rate, while each update costs about USD 30 and takes 1-2 hours. Extensive evaluations show that MMBench-Live preserves stable model rankings, maintains semantic alignment with the original benchmark, and exhibits weaker contamination-related memorization signals, suggesting a practical and scalable paradigm for sustainable multimodal benchmark evolution. The project is available at https://github.com/PRIS-CV/MMBench-Live.

PixGS：面向直接3D高斯点云生成的像素空间扩散

0/10

PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation

Duy Cao, Phong Nguyen-Ha

个性化推荐理由:

该论文专注于3D视觉任务中的高斯点云生成，属于图形学与计算机视觉领域，与搜索、推荐、广告或大语言模型核心进展无直接关联，且不涉及LLM或Transformer在推荐/搜索/广告中的应用。

2026-07-02 07:18:37 | arXiv:2607.01803v1 |

cs.CVcs.GRcs.RO

查看完整摘要

Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.

3D植物表型分析的转折点：3D基础模型实现分钟到秒级的跨作物重建及其他应用

0/10

The Turning Point of 3D Plant Phenotyping: 3D Foundation Models Enable Minute-to-Second Cross-Crop Reconstruction and Beyond

Hanyue Jia, Wei Zhou, Wenbo Zhou, Yanan Li, Hao Lu, Tingting Wu

个性化推荐理由:

该论文专注于3D植物表型分析和农业应用，与推荐系统、搜索或广告领域无直接关联。虽然涉及基础模型，但领域特定，缺乏对RecSys/Search/Ads的明确应用潜力。

2026-07-02 06:10:54 | arXiv:2607.01753v1 |

cs.CVq-bio.QM

查看完整摘要

3D plant phenotyping is notoriously known to be procedure-complicated and of low throughput due to the extensive multi-view imaging, the fragile 3D reconstruction pipeline, and the additional cost from reconstructed geometry to phenotypic extraction. These limitations are further amplified in low-cost data acquisition, where smartphone videos or sparsely sampled multi-view images provide limited view overlap and self-occlusion. In this work, we show that the conventional 3D plant phenotyping pipeline could be streamlined and significantly accelerated with 3D Foundation Models (3DFMs), and particularly, present one of the first cross-crop 3D phenotyping frameworks powered by 3DFMs. The framework replaces COLMAP-style sparse initialization with 3DFM-based feed-forward geometric recovery, combines geometry-constrained 3D Gaussian Splatting for dense reconstruction, enables few-view reconstruction through iterative view synthesis and refinement, and converts reconstructed geometry into measurable organs through 2D-to-3D semantic transfer, metric scale recovery, and organ instance separation. We further construct a cross-crop dataset with smartphone-based image acquisition, diverse plant morphologies, and manual annotations for segmentation and phenotypic evaluation. Experiments across 26 plant sequences show that 3D Foundation Models reduce the average reconstruction time from 6.52 minutes to 1.58 seconds while maintaining high reconstruction quality and phenotyping accuracy. These results suggest a fresh technical route for high-throughput 3D plant phenotyping, from low-cost image acquisition to fast reconstruction, perception, scale recovery, and phenotypic measurement.

基于多线索掩码精修的3D高斯泼溅一致性场景理解

0/10

Consistent Scene Understanding in 3D Gaussian Splatting via Multi-Cue Mask Refinement

Hyunjoon Park, Donghyeon Cho

个性化推荐理由:

该论文专注于3D视觉领域中的场景理解与渲染，属于计算机视觉与图形学方向，与RecSys、Search或Ads核心领域无直接或间接关联，且未涉及LLM、Transformer或推荐系统中的多模态建模技术。

2026-07-02 05:00:38 | arXiv:2607.01708v1 |

cs.CV

查看完整摘要

Reliable instance-level scene understanding is a fundamental prerequisite for object-level interactions and high-fidelity 3D representations. While current methods often leverage 2D foundation segmentation models to obtain these priors, their 2D-centric design typically yields fragmented masks and inconsistent predictions across different views. To address these issues, we propose a novel framework that produces consistent 2D instance masks to guide the optimization of 3D Gaussian Splatting (3DGS) feature fields. Our framework consists of three main stages. (1) Multi-Cue Extraction that generates synergistic semantic, geometric, and structural priors from input images. (2) Multi-Cue-Guided Mask Merging process that consolidates fragmented masks using a composite merge score derived from semantic, depth, and edge cues. (3) Cross-View Mask Matching that establishes globally consistent identity assignments across all viewpoints. By transforming viewpoint-specific segments into coherent 3D primitives, our approach enables stable 3D instance segmentation and effective downstream editing tasks. Experiments demonstrate that our method significantly improves cross-view consistency and segmentation stability over existing baselines while maintaining high-fidelity photometric reconstruction.

野生动物监测中的计算机视觉：使用YOLO检测棕吼猴

0/10

Computer Vision for Wildlife Monitoring: Detecting Brown Howler Monkeys using YOLO

Gabriel Ferri Schneider, Guido Luis Glufke Mainardi, Paulo Ricardo Knob, Patríci...

个性化推荐理由:

该论文专注于野生动物监测的计算机视觉应用，与推荐系统、搜索或广告领域无关，不涉及LLM、Transformer架构或任何与RecSys/Search/Ads相关技术。不属于我的关注范围。

2026-07-01 18:57:09 | arXiv:2607.01396v1 |

cs.CVcs.GR

查看完整摘要

Urban expansion threatens global biodiversity, especially affecting arboreal species due to the fragmentation of forest habitats. The movement of arboreal species across disjointed forest patches increases mortality risk and, thus, compromises their conservation. In this context, the installation of canopy bridges can be a viable strategy; yet continuous monitoring of their use by arboreal species is essential for ensuring their effectiveness, typically carried out with the aid of camera traps. However, this method often produces false-positive images that demand time from conservationists for review. In this context, computer vision algorithms can optimize the task of detecting target species using the canopy bridges. In this study, we explored the automatic detection of brown howler monkeys (Alouatta guariba) in videos obtained by camera traps. Given the need for a large number of annotated images of the target animals to train the algorithms, we tested the incorporation of auxiliary data to improve detection models, fine-tuning the YOLOv10 framework using varying proportions of them. The improvement of these automatic detection techniques contributes to conservation efforts, by providing automatic tools to monitor solutions that minimize the impact of human interference in animals habitats.

重新思考通用物体跟踪：迈向人类级别的感知智能

0/10

Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

Shih-Fang Chen

个性化推荐理由:

该论文主题是通用物体跟踪，属于计算机视觉领域，与推荐系统、搜索或广告领域的核心关注点无关。它不涉及LLM、Transformer架构改进或推荐系统的直接应用，不符合当前研究重点。

2026-07-01 18:54:00 | arXiv:2607.01395v1 |

cs.CVcs.AIcs.LGcs.MMeess.IV

查看完整摘要

At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.