arXiv 每日论文精选

2025-12-17
总论文数: 140
精选论文数: 20
平均评分: 2.4
显示 140 篇论文 (共 140 篇)
RecGPT-V2 Technical Report
Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen C...
核心总结:

论文研究如何克服现有LLM推荐系统在效率、多样性、泛化性和评估方面的局限。核心思想是构建一个分层多智能体系统来协同推理用户意图,结合混合表示压缩、元提示生成、约束强化学习和智能体即法官框架,实现高效、多样且符合人类偏好的推荐。

个性化推荐理由:

该论文直接针对LLM在推荐系统中的应用,提出了多智能体协作、混合表示推理、元提示和强化学习等创新方法,完美契合用户对直接LLM应用和核心领域进展的关注点。

2025-12-16 15:40:44 | arXiv:2512.14503v1 |
cs.IRcs.CL
查看完整摘要
Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.
AsarRec: Adaptive Sequential Augmentation for Robust Self-supervised Sequential Recommendation
Kaike Zhang, Qi Cao, Fei Sun, Xinran Liu
核心总结:

论文研究如何解决序列推荐中用户行为噪声导致性能下降的问题,核心思想是通过将基础增强操作统一为结构化变换矩阵,并学习生成自适应变换矩阵来构建鲁棒的自监督学习框架。

个性化推荐理由:

该论文直接针对推荐系统核心问题,提出自适应数据增强框架以提升鲁棒性,属于Core Domain Advances和Direct LLM Applications范畴。

2025-12-16 03:29:11 | arXiv:2512.14047v1 |
cs.IR
查看完整摘要
Sequential recommender systems have demonstrated strong capabilities in modeling users' dynamic preferences and capturing item transition patterns. However, real-world user behaviors are often noisy due to factors such as human errors, uncertainty, and behavioral ambiguity, which can lead to degraded recommendation performance. To address this issue, recent approaches widely adopt self-supervised learning (SSL), particularly contrastive learning, by generating perturbed views of user interaction sequences and maximizing their mutual information to improve model robustness. However, these methods heavily rely on their pre-defined static augmentation strategies~(where the augmentation type remains fixed once chosen) to construct augmented views, leading to two critical challenges: (1) the optimal augmentation type can vary significantly across different scenarios; (2) inappropriate augmentations may even degrade recommendation performance, limiting the effectiveness of SSL. To overcome these limitations, we propose an adaptive augmentation framework. We first unify existing basic augmentation operations into a unified formulation via structured transformation matrices. Building on this, we introduce AsarRec (Adaptive Sequential Augmentation for Robust Sequential Recommendation), which learns to generate transformation matrices by encoding user sequences into probabilistic transition matrices and projecting them into hard semi-doubly stochastic matrices via a differentiable Semi-Sinkhorn algorithm. To ensure that the learned augmentations benefit downstream performance, we jointly optimize three objectives: diversity, semantic invariance, and informativeness. Extensive experiments on three benchmark datasets under varying noise levels validate the effectiveness of AsarRec, demonstrating its superior robustness and consistent improvements.
From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models
Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Li...
核心总结:

论文研究CTR预测模型中因过度依赖原始ID嵌入特征交互导致的嵌入维度坍缩和信息冗余问题。核心思想是提出监督特征生成框架,通过编码器构建特征隐藏表示,解码器从这些表示中重构所有特征嵌入,利用点击监督信号实现从特征交互到特征生成的范式转变。

个性化推荐理由:

该论文直接针对推荐系统核心任务CTR预测,提出从判别式特征交互转向生成式特征生成的范式转换,属于Core Domain Advances和Direct LLM Applications范畴。

2025-12-16 03:17:18 | arXiv:2512.14041v1 |
cs.IR
查看完整摘要
Click-Through Rate (CTR) prediction, a core task in recommendation systems, aims to estimate the probability of users clicking on items. Existing models predominantly follow a discriminative paradigm, which relies heavily on explicit interactions between raw ID embeddings. However, this paradigm inherently renders them susceptible to two critical issues: embedding dimensional collapse and information redundancy, stemming from the over-reliance on feature interactions \emph{over raw ID embeddings}. To address these limitations, we propose a novel \emph{Supervised Feature Generation (SFG)} framework, \emph{shifting the paradigm from discriminative ``feature interaction" to generative ``feature generation"}. Specifically, SFG comprises two key components: an \emph{Encoder} that constructs hidden embeddings for each feature, and a \emph{Decoder} tasked with regenerating the feature embeddings of all features from these hidden representations. Unlike existing generative approaches that adopt self-supervised losses, we introduce a supervised loss to utilize the supervised signal, \ie, click or not, in the CTR prediction task. This framework exhibits strong generalizability: it can be seamlessly integrated with most existing CTR models, reformulating them under the generative paradigm. Extensive experiments demonstrate that SFG consistently mitigates embedding collapse and reduces information redundancy, while yielding substantial performance gains across various datasets and base models. The code is available at https://github.com/USTC-StarTeam/GE4Rec.
DTRec: Learning Dynamic Reasoning Trajectories for Sequential Recommendation
Yifan Shao, Peilin Zhou, Shoujin Wang, Weizhi Zhang, Xu Cai, Sunghun Kim
核心总结:

该论文研究如何改进序列推荐中基于推理的方法。其核心思想是提出DTRec框架,通过分层过程监督来引导推理方向,并引入自适应推理停止机制来动态调整推理深度,以克服现有静态推理轨迹的局限性。

个性化推荐理由:

该论文直接针对推荐系统核心问题,提出动态推理轨迹的创新方法,完美契合核心领域进展和直接LLM应用两个焦点。

2025-12-16 03:04:43 | arXiv:2512.14036v1 |
cs.IR
查看完整摘要
Inspired by advances in LLMs, reasoning-enhanced sequential recommendation performs multi-step deliberation before making final predictions, unlocking greater potential for capturing user preferences. However, current methods are constrained by static reasoning trajectories that are ill-suited for the diverse complexity of user behaviors. They suffer from two key limitations: (1) a static reasoning direction, which uses flat supervision signals misaligned with human-like hierarchical reasoning, and (2) a fixed reasoning depth, which inefficiently applies the same computational effort to all users, regardless of pattern complexity. These rigidity lead to suboptimal performance and significant computational waste. To overcome these challenges, we propose DTRec, a novel and effective framework that explores the Dynamic reasoning Trajectory for Sequential Recommendation along both direction and depth. To guide the direction, we develop Hierarchical Process Supervision (HPS), which provides coarse-to-fine supervisory signals to emulate the natural, progressive refinement of human cognitive processes. To optimize the depth, we introduce the Adaptive Reasoning Halting (ARH) mechanism that dynamically adjusts the number of reasoning steps by jointly monitoring three indicators. Extensive experiments on three real-world datasets demonstrate the superiority of our approach, achieving up to a 24.5% performance improvement over strong baselines while simultaneously reducing computational cost by up to 41.6%.
Intent-Guided Reasoning for Sequential Recommendation
Yifan Shao, Peilin Zhou
核心总结:

主题:解决顺序推荐中推理增强方法因仅以下一目标物品为监督而导致的不稳定和表面化推理问题。核心思想:提出意图引导推理框架,通过提取高层意图、解耦推理为意图审议与决策、并施加意图一致性正则化,将推理过程锚定于明确的意图表示。

个性化推荐理由:

该论文直接针对顺序推荐系统的核心问题,提出意图引导推理框架,通过解耦推理过程提升模型鲁棒性,属于核心领域进展和直接LLM应用范畴。

2025-12-16 03:00:32 | arXiv:2512.14034v1 |
cs.IR
查看完整摘要
Sequential recommendation systems aim to capture users' evolving preferences from their interaction histories. Recent reasoningenhanced methods have shown promise by introducing deliberate, chain-of-thought-like processes with intermediate reasoning steps. However, these methods rely solely on the next target item as supervision, leading to two critical issues: (1) reasoning instability--the process becomes overly sensitive to recent behaviors and spurious interactions like accidental clicks, and (2) surface-level reasoning--the model memorizes item-to-item transitions rather than understanding intrinsic behavior patterns. To address these challenges, we propose IGR-SR, an Intent-Guided Reasoning framework for Sequential Recommendation that anchors the reasoning process to explicitly extracted high-level intents. Our framework comprises three key components: (1) a Latent Intent Distiller (LID) that efficiently extracts multi-faceted intents using a frozen encoder with learnable tokens, (2) an Intent-aware Deliberative Reasoner (IDR) that decouples reasoning into intent deliberation and decision-making via a dual-attention architecture, and (3) an Intent Consistency Regularization (ICR) that ensures robustness by enforcing consistent representations across different intent views. Extensive experiments on three public datasets demonstrate that IGR-SR achieves an average 7.13% improvement over state-of-the-art baselines. Critically, under 20% behavioral noise, IGR-SR degrades only 10.4% compared to 16.2% and 18.6% for competing methods, validating the effectiveness and robustness of intent-guided reasoning.
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo, Enhua Wu, Xinghao Chen, ...
核心总结:

该论文研究如何在固定参数预算下提升LLM的表示能力。其核心思想是设计一个包含宽度复用路径和深度复用路径的自适应FFN,通过难度感知门控动态分配计算资源,在不增加参数的情况下实现类似稀疏专家混合和深度迭代处理的效果。

个性化推荐理由:

该论文提出了一种通过自适应宽深复用实现LLM参数效率的新FFN架构,直接属于Transformer架构效率提升的核心领域,对大规模推荐系统的部署具有重要应用价值。

2025-12-16 16:08:23 | arXiv:2512.14531v1 |
cs.CL
查看完整摘要
The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.
A Unified Sparse Attention via Multi-Granularity Compression
Siran Liu, Zane Cao, Yongchao He
核心总结:

该论文研究Transformer自注意力机制在长序列处理中的二次计算复杂度瓶颈问题。其核心思想是引入复合令牌概念,通过多粒度压缩和块级选择动态构建稀疏注意力,实现高效且硬件友好的GPU执行。

个性化推荐理由:

该论文提出了一种统一的稀疏注意力机制,通过多粒度压缩解决Transformer计算瓶颈,直接属于Transformer架构效率提升的核心领域。

2025-12-16 04:42:31 | arXiv:2512.14082v1 |
cs.CL
查看完整摘要
Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.
Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias
Malika Iratni, Mohand Boughanem, Taoufiq Dkaki
核心总结:

主题:研究检索增强生成中固定检索策略导致的信息遗漏、干扰文档和位置偏差问题。核心思想:通过分析干扰文档和位置效应,提出基于查询的动态上下文大小分类器,预测最优检索文档数量以提升生成质量。

个性化推荐理由:

该论文针对RAG中的核心检索策略问题,提出动态上下文选择方法,直接改进检索增强生成系统,与搜索和推荐领域的检索排序优化高度相关。

2025-12-16 11:30:40 | arXiv:2512.14313v1 |
cs.IR
查看完整摘要
Retrieval Augmented Generation (RAG) enhances language model performance by incorporating external knowledge retrieved from large corpora, which makes it highly suitable for tasks such as open domain question answering. Standard RAG systems typically rely on a fixed top k retrieval strategy, which can either miss relevant information or introduce semantically irrelevant passages, known as distractors, that degrade output quality. Additionally, the positioning of retrieved passages within the input context can influence the model attention and generation outcomes. Context placed in the middle tends to be overlooked, which is an issue known as the "lost in the middle" phenomenon. In this work, we systematically analyze the impact of distractors on generation quality, and quantify their effects under varying conditions. We also investigate how the position of relevant passages within the context window affects their influence on generation. Building on these insights, we propose a context-size classifier that dynamically predicts the optimal number of documents to retrieve based on query-specific informational needs. We integrate this approach into a full RAG pipeline, and demonstrate improved performance over fixed k baselines.
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He,...
核心总结:

该论文研究如何将自回归语言模型高效转换为并行解码模型以加速推理。其核心思想是提出Jacobi Forcing训练范式,通过让模型在自身生成的并行解码轨迹上训练,平滑地从自回归模型过渡到并行解码器,同时保持其预训练的因果推理特性。

个性化推荐理由:

该论文提出了一种新的并行解码加速方法,直接针对LLM推理效率这一核心瓶颈,属于Transformer架构效率优化的前沿进展,对搜索、推荐等实时系统有重要应用价值。

2025-12-16 18:45:18 | arXiv:2512.14681v1 |
cs.CL
查看完整摘要
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
Dual Language Models: Balancing Training Efficiency and Overfitting Resilience
David Samuel, Lucas Georges Gabriel Charpentier
核心总结:

该论文研究如何平衡语言模型的训练效率与过拟合鲁棒性。其核心方法是结合自回归建模的高效性与掩码扩散模型的抗过拟合能力,通过双目标训练实现两者的优势互补,并探索了两种目标的最优混合比例。

个性化推荐理由:

该论文通过结合自回归与掩码扩散训练目标,提出了一种提升语言模型训练效率与泛化能力平衡的新方法,直接关联到Transformer架构优化与LLM核心技术进步。

2025-12-16 16:25:33 | arXiv:2512.14549v1 |
cs.CLcs.AI
查看完整摘要
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models
Shizhuo Mao, Song Chen, Yi Kang
核心总结:

研究大型语言模型部署中激活量化精度与效率的权衡问题,提出SASQ框架:通过仅优化量化因子(不改变预训练权重)并自适应截断异常值,实现高精度静态推理同时保持部署效率。

个性化推荐理由:

该论文提出专门针对LLM激活量化的轻量级训练框架,通过静态量化因子优化解决部署效率与精度平衡问题,直接提升LLM在边缘设备上的推理效率,属于核心LLM技术进展。

2025-12-16 15:12:34 | arXiv:2512.14481v1 |
cs.CLcs.AI
查看完整摘要
Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
RePo: Language Models with Context Re-Positioning
Huayang Li, Tianyu Zhao, Richard Sproat
核心总结:

该论文研究标准Transformer架构中固定位置编码在处理上下文时引入无关认知负荷的问题。其核心思想是提出RePo机制,使用可微分模块动态学习基于上下文依赖的token位置,而非预定义整数位置,以更好地捕获输入的内在结构。

个性化推荐理由:

该论文提出的RePo机制通过可微分模块动态分配位置编码,直接改进了Transformer架构的核心组件,属于Transformer架构效率与新型注意力机制的前沿进展,对处理推荐、搜索中的长序列和结构化数据有明确应用潜力。

2025-12-16 13:30:30 | arXiv:2512.14391v1 |
cs.LGcs.AIcs.CL
查看完整摘要
In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_φ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.
Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets
Estelle Zheng, Nathan Cerisara, Sébastien Warichet, Emmanuel Helbert, Christophe...
核心总结:

论文研究大语言模型微调时的内存瓶颈问题,核心思想是采用轻量级侧网络架构替代传统参数更新,在保持计算扩展性的同时大幅降低反向传播的内存开销。

个性化推荐理由:

该论文提出了一种参数高效微调方法,通过侧网络架构显著降低内存占用,属于Transformer效率提升和LLM应用部署的关键使能技术。

2025-12-16 09:47:34 | arXiv:2512.14237v1 |
cs.CLcs.LG
查看完整摘要
Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.
CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models
Yiran Zhang, Jincheng Hu, Mark Dras, Usman Naseem
核心总结:

该论文研究LLM在多轮交互中出现的推理偏差和记忆衰减问题,核心思想是设计一个受认知启发的三层记忆架构(长期记忆、直接访问记忆、注意力聚焦机制),通过结构化持久记忆来维持连贯的多轮推理。

个性化推荐理由:

该论文提出的CogMem架构通过分层记忆机制解决LLM多轮推理中的记忆衰减和上下文膨胀问题,直接属于LLM核心架构增强技术,对推荐和搜索系统中的多轮交互优化具有重要应用价值。

2025-12-16 06:01:08 | arXiv:2512.14118v1 |
cs.CL
查看完整摘要
Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.
What Affects the Effective Depth of Large Language Models?
Yi Hu, Cai Zhou, Muhan Zhang
核心总结:

论文研究大语言模型深度扩展中性能收益递减的核心问题,其核心观点是当前LLMs普遍存在层利用率不足,有效深度并未随模型规模、训练方式或任务难度而显著增加,这揭示了模型架构效率低下的本质。

个性化推荐理由:

该论文直接研究大语言模型的核心架构效率问题,其关于有效深度和层利用率不足的发现,对提升Transformer架构效率和优化推荐/搜索系统模型具有重要指导意义。

2025-12-16 04:07:17 | arXiv:2512.14064v1 |
cs.CL
查看完整摘要
The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kan...
核心总结:

该论文研究如何在边缘设备上高效部署多模态大语言模型,其核心方法是采用图像分块策略控制内存峰值,并引入视觉分辨率压缩器和双一致性学习,实现动态切换视觉编码分支以消除冗余计算。

个性化推荐理由:

该论文专注于Transformer架构的效率优化和动态多模态建模,直接对应“使能Transformer技术”和“VLM异构数据类比”两个重点领域,具有明确的边缘设备部署应用场景。

2025-12-16 03:36:41 | arXiv:2512.14052v1 |
cs.CVcs.CL
查看完整摘要
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers
Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin
核心总结:

该论文研究如何打破跨表格数据壁垒以实现有效的多模态自监督学习。其核心方法是设计语义感知的表格建模机制,通过引入列标题作为语义线索,并采用原型引导的线性混合层模块来专门处理异构表格数据,以学习可迁移的跨队列医学知识表示。

个性化推荐理由:

该论文提出的跨表格自监督学习框架CITab,通过语义感知建模和原型引导的线性混合层处理异构表格数据,直接对应“VLM类比处理异构数据”这一焦点,其核心思想具有明确的迁移性和可扩展性潜力。

2025-12-16 02:47:08 | arXiv:2512.14026v1 |
cs.CV
查看完整摘要
Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer's disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.
MMGR: Multi-Modal Generative Reasoning
Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Pa...
核心总结:

该论文研究现有视频/图像生成模型缺乏对物理、逻辑、空间等推理能力的可靠评估问题。其核心方法是提出一个基于五种推理能力(物理、逻辑、3D空间、2D空间、时序)的结构化评估框架MMGR,用于诊断生成模型在抽象推理、具身导航和物理常识等领域的推理缺陷。

个性化推荐理由:

该论文提出的多模态生成推理评估框架MMGR,其核心思想——通过结构化推理能力评估生成模型,与关注异质数据统一建模的VLM类比方向高度相关,并为搜索/推荐中理解用户意图与内容一致性提供了方法论参考。

2025-12-16 18:58:04 | arXiv:2512.14691v1 |
cs.CLcs.CV
查看完整摘要
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chen...
核心总结:

论文研究如何将预训练的自回归语言模型高效转换为扩散语言模型以提升推理速度。其核心方法是:1)采用块级注意力模式以更好地保持预训练权重分布;2)提出位置相关的掩码策略来弥合训练与测试的分布差异。

个性化推荐理由:

该论文专注于Transformer架构的效率优化(注意力模式改进)和从自回归到扩散模型的转换方法,直接关联到“使能Transformer技术”和“使能LLM技术”这两个焦点领域。

2025-12-16 04:12:17 | arXiv:2512.14067v1 |
cs.CLcs.AIcs.LG
查看完整摘要
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models
Zhimin Qiu, Di Wu, Feng Liu, Chenrui Hu, Yuxiao Wang
核心总结:

该论文研究复杂嵌套与重叠实体抽取中语义完整性与结构一致性的保持难题,核心方法是引入候选跨度生成机制和结构化注意力建模,通过层级结构约束在解码过程中统一建模实体边界、层次关系和跨依赖。

个性化推荐理由:

该论文提出了一种基于大语言模型的结构感知解码机制,用于复杂实体抽取,其核心方法(结构化注意力建模、层级约束)属于Transformer架构的效率与注意力机制优化范畴,对搜索和推荐系统中的信息抽取有直接应用价值。

2025-12-16 00:40:06 | arXiv:2512.13980v1 |
cs.CL
查看完整摘要
This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.
Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher
个性化推荐理由:

该论文涉及语言模型的生成控制,可能属于'核心LLM技术'范畴,但标题未明确说明其在推荐系统、搜索或广告中的具体应用。步骤监控技术可能用于改进推荐解释或搜索结果的生成过程,但相关性不够直接。

2025-12-16 12:01:16 | arXiv:2512.14332v1 |
cs.CLcs.AI
查看完整摘要
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50\% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.
OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration
Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun
个性化推荐理由:

该论文涉及扩散Transformer(DiT)的加速技术,属于Transformer架构效率优化领域,与“Enabling Transformer Tech”相关。然而,扩散模型主要应用于图像生成等AIGC任务,与推荐系统、搜索或广告的关联性较弱,除非能明确其技术(如调度与缓存优化)可迁移至推荐/搜索场景的序列建模或推理加速。

2025-12-16 05:11:54 | arXiv:2512.14096v1 |
cs.CV
查看完整摘要
Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.
SPARQL-LLM: Real-Time SPARQL Query Generation from Natural Language Questions
Panayiotis Smeros, Vincent Emonet, Ruijie Wang, Ana-Claudia Sima, Tarcisio Mende...
个性化推荐理由:

该论文涉及LLM在结构化查询生成中的应用,属于直接LLM应用范畴。然而,SPARQL主要用于知识图谱查询,与推荐系统、搜索或广告中的典型数据格式(如用户-物品交互、上下文特征)关联较弱,潜在应用场景有限。

2025-12-16 10:39:46 | arXiv:2512.14277v1 |
cs.IRcs.AIcs.CL
查看完整摘要
The advent of large language models is contributing to the emergence of novel approaches that promise to better tackle the challenge of generating structured queries, such as SPARQL queries, from natural language. However, these new approaches mostly focus on response accuracy over a single source while ignoring other evaluation criteria, such as federated query capability over distributed data stores, as well as runtime and cost to generate SPARQL queries. Consequently, they are often not production-ready or easy to deploy over (potentially federated) knowledge graphs with good accuracy. To mitigate these issues, in this paper, we extend our previous work and describe and systematically evaluate SPARQL-LLM, an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text. First, we describe its architecture, which consists of dedicated components for metadata indexing, prompt building, and query generation and execution. Then, we evaluate it based on a state-of-the-art challenge with multilingual questions, and a collection of questions from three of the most prevalent knowledge graphs within the field of bioinformatics. Our results demonstrate a substantial increase of 24% in the F1 Score on the state-of-the-art challenge, adaptability to high-resource languages such as English and Spanish, as well as ability to form complex and federated bioinformatics queries. Furthermore, we show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of $0.01 per question, making it suitable for real-time, low-cost text-to-SPARQL applications. One such application deployed over real-world decentralized knowledge graphs can be found at https://www.expasy.org/chat.
Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
个性化推荐理由:

该论文标题涉及基础模型和文本到图像检索,这与'直接LLM应用'有一定关联,但具体应用于遥感领域,属于明确的领域特定应用(遥感),这超出了关注范围。虽然提到了基础模型,但缺乏明确的与推荐系统、搜索或广告的潜在应用联系。

2025-12-16 05:33:44 | arXiv:2512.14102v1 |
cs.CVcs.AIcs.IR
查看完整摘要
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines
David Schulmeister, Valentin Hartmann, Lars Klein, Robert West
个性化推荐理由:

该论文涉及高效编码器架构,可能属于Transformer效率改进(Enabling Transformer Tech),但标题明确限定于NLP流程,未提及推荐系统、搜索或广告的潜在应用。单语编码器在跨语言推荐等场景中应用有限,因此相关性较低。

2025-12-16 18:02:58 | arXiv:2512.14645v1 |
cs.CLcs.LG
查看完整摘要
Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition
Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang...
个性化推荐理由:

该论文涉及上下文压缩技术,这在LLM应用中具有潜在价值,可能用于改进推荐或搜索系统中的上下文处理效率。然而,标题明确聚焦于话语单元分解这一特定NLP技术,而非直接针对推荐、搜索或广告领域的核心问题,应用场景不够明确。

2025-12-16 09:52:58 | arXiv:2512.14244v1 |
cs.CLcs.AI
查看完整摘要
Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents
Hongqiu Ni, Jiabao Zhang, Guopeng Li, Zilong Wang, Ruiqi Wu, Chi Zhang, Haisheng...
个性化推荐理由:

该论文主要关注LLM智能体的调度系统,属于系统优化而非核心算法创新。虽然LLM智能体在推荐/搜索中可能有应用(如对话式推荐),但论文标题未明确指向这些领域,且调度引擎本身属于基础设施而非直接应用或架构创新。

2025-12-16 06:55:10 | arXiv:2512.14142v1 |
cs.CL
查看完整摘要
Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.
CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning
Andreas Lolos, Theofilos Christodoulou, Aris L. Moustakas, Stergios Christodouli...
个性化推荐理由:

该论文涉及多示例学习(MIL)和上下文感知表示,这些技术可能间接应用于推荐/搜索系统中的序列建模或特征聚合。然而,标题未明确指向推荐系统、搜索或广告的核心问题,且未提及Transformer、LLM或异构数据统一建模等关键焦点领域,因此相关性有限。

2025-12-16 16:16:45 | arXiv:2512.14540v1 |
cs.CVcs.AI
查看完整摘要
In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features -- extracted using a frozen patch encoder -- into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL
SuperCLIP: CLIP with Simple Classification Supervision
Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang
个性化推荐理由:

该论文属于计算机视觉领域的视觉-语言模型改进工作,主要关注CLIP模型的监督学习增强。虽然CLIP本身是视觉-语言模型,但该论文标题未明确涉及推荐系统、搜索或广告中的异构数据处理或应用,也未提及Transformer架构改进或LLM技术进展。其潜在应用可能仅限于视觉内容理解,与当前关注的推荐/搜索/广告核心领域关联较弱。

2025-12-16 15:11:53 | arXiv:2512.14480v1 |
cs.CV
查看完整摘要
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.
Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes
Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Ang...
个性化推荐理由:

该论文主要关注视觉语言模型(VLM)中的不确定性量化技术,属于VLM特定领域的技术改进。虽然VLM技术可能为处理异构数据提供类比灵感,但论文标题明确聚焦于视觉-语言模态的语义不确定性,与推荐系统、搜索或广告中的异构数据处理仅有间接关联。对于'异构数据的VLM类比'这一焦点而言,该论文提供了底层技术参考,但缺乏直接的应用连接。

2025-12-16 08:15:24 | arXiv:2512.14177v1 |
cs.CV
查看完整摘要
Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.
Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach
Ashish Mishra, Gyanaranjan Nayak, Tarun Kumar, Arpit Shah, Suparna Bhattacharya,...
个性化推荐理由:

该论文主要关注CLIP模型的遗忘技术,属于计算机视觉与语言模型的交叉领域,但未明确涉及推荐系统、搜索或广告的直接应用。虽然CLIP技术可能启发多模态推荐,但论文焦点是模型编辑而非核心推荐算法,且属于特定技术改进而非基础架构进步。

2025-12-16 05:54:13 | arXiv:2512.14113v1 |
cs.CV
查看完整摘要
Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or "unlearning") of specific object classes without requiring additional data or retraining, or affecting the model's performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP's joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.
SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqi...
个性化推荐理由:

该论文涉及视觉-语言模型(VLM),与您关注的“VLM类比处理异构数据”有一定关联,但标题明确聚焦于视觉-语言理解,而非异构数据(如上下文特征和用户序列)的统一建模。扩散方法可能具有效率优势,但未明确指向推荐系统、搜索或广告领域的潜在应用。

2025-12-16 04:12:52 | arXiv:2512.14068v1 |
cs.CVcs.AI
查看完整摘要
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning
Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu, Jing Yuan, Chenglin Liu
个性化推荐理由:

该论文主要关注图表理解和工具集成推理,属于特定模态(图表)的AI应用。虽然图表数据可能出现在某些推荐或搜索场景中(如电商产品比较图表),但论文标题未明确指向推荐系统、搜索或广告的核心问题(如排序、召回、用户建模)。对于“使能技术”类别,图表理解技术可能间接应用于处理结构化数据展示,但潜在应用不够直接或明确。

2025-12-16 03:17:04 | arXiv:2512.14040v1 |
cs.CVcs.LG
查看完整摘要
With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.
PushGen: Push Notifications Generation with LLM
Shifu Bie, Jiangxia Cao, Zixiao Luo, Yichuan Zou, Lei Liang, Lu Zhang, Linxun Ch...
个性化推荐理由:

该论文标题涉及使用LLM生成推送通知,这属于内容生成应用,而非推荐系统、搜索或广告中的核心排名、检索或建模任务。虽然推送通知可能与推荐系统相关,但标题明确指向生成而非排序或优化,因此与当前关注的直接LLM应用(如排名、检索增强)相关性较低。

2025-12-16 15:23:28 | arXiv:2512.14490v1 |
cs.IR
查看完整摘要
We present PushGen, an automated framework for generating high-quality push notifications comparable to human-crafted content. With the rise of generative models, there is growing interest in leveraging LLMs for push content generation. Although LLMs make content generation straightforward and cost-effective, maintaining stylistic control and reliable quality assessment remains challenging, as both directly impact user engagement. To address these issues, PushGen combines two key components: (1) a controllable category prompt technique to guide LLM outputs toward desired styles, and (2) a reward model that ranks and selects generated candidates. Extensive offline and online experiments demonstrate its effectiveness, which has been deployed in large-scale industrial applications, serving hundreds of millions of users daily.
A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs
K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
个性化推荐理由:

该论文主要关注特定语言(孟加拉语)的机器翻译任务,属于纯粹的NLP应用领域。虽然涉及检索增强生成(RAG)技术,但其应用场景是机器翻译而非推荐/搜索/广告系统,且没有明确展示这些技术如何应用于您的关注领域。

2025-12-16 08:18:18 | arXiv:2512.14179v1 |
cs.CLcs.AIcs.IR
查看完整摘要
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
个性化推荐理由:

该论文标题表明其专注于视频时序定位任务,这属于计算机视觉领域,与推荐系统、搜索或广告的核心关注点无直接关联。虽然涉及多模态LLM技术,但论文的应用场景(视频时序定位)与RecSys/Search/Ads的典型问题(如排序、检索、用户行为建模)存在显著差异,潜在应用不明确。

2025-12-16 18:59:58 | arXiv:2512.14698v1 |
cs.CVcs.AIcs.CLcs.MM
查看完整摘要
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
个性化推荐理由:

该论文标题表明这是一个专注于日本多学科多模态理解的基准测试,涉及图像理解,属于纯粹的评估基准范畴。虽然提到了多模态,但其核心是基准构建和评估,与您关注的推荐系统、搜索或广告领域的核心进展、LLM技术应用或Transformer架构创新没有直接关联。

2025-12-16 17:33:00 | arXiv:2512.14620v1 |
cs.CLcs.AIcs.CV
查看完整摘要
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer
Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Goth...
个性化推荐理由:

该论文专注于特定语言(尼泊尔语)的LLM训练和分词器优化,属于语言特定的技术实现,而非核心LLM技术进展或架构创新。虽然高效的训练方法可能间接有益于多语言推荐系统,但论文本身没有明确涉及推荐、搜索或广告领域的应用潜力,与当前关注的四大方向关联度较低。

2025-12-16 16:53:11 | arXiv:2512.14585v1 |
cs.CLcs.AI
查看完整摘要
Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
Polypersona: Persona-Grounded LLM for Synthetic Survey Responses
Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima ...
个性化推荐理由:

该论文涉及LLM生成合成内容(调查响应),属于AIGC/内容生成范畴,这在无关主题中被明确排除。虽然提到了“人物角色”概念可能让人联想到推荐系统中的用户画像,但核心焦点是生成合成调查数据而非推荐/搜索/广告中的排名、检索或建模应用。

2025-12-16 16:33:23 | arXiv:2512.14562v1 |
cs.CLcs.AI
查看完整摘要
This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment alignment.Experimental results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.
Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis
Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng
个性化推荐理由:

该论文主要关注LLM在评估任务(如作文评分)中的表现与人类评分者的一致性,这属于纯粹的LLM评估基准研究。虽然涉及LLM技术,但论文焦点是评估方法和一致性验证,而非LLM在推荐系统、搜索或广告中的直接应用或架构改进,因此与当前关注点相关性较低。

2025-12-16 16:33:07 | arXiv:2512.14561v1 |
cs.CL
查看完整摘要
Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models
Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar
个性化推荐理由:

该论文主要研究文档打包对LLM多跳推理能力的影响,属于LLM能力评估范畴。虽然涉及LLM技术,但论文焦点是评估基准能力而非技术进展或具体应用,与当前关注的RecSys/Search/Ads领域的技术进展、应用创新或异构数据建模没有直接关联。

2025-12-16 14:16:23 | arXiv:2512.14427v1 |
cs.CLcs.AIcs.LG
查看完整摘要
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao
个性化推荐理由:

该论文主要关注长视频叙事的一致性处理,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术焦点没有直接关联。虽然涉及记忆机制和效率优化,但缺乏明确的跨模态建模或序列处理应用,无法直接应用于异构数据统一建模或推荐/搜索场景。

2025-12-16 18:59:59 | arXiv:2512.14699v1 |
cs.CV
查看完整摘要
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
Spherical Leech Quantization for Visual Tokenization and Generation
Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbüh...
个性化推荐理由:

该论文涉及视觉标记化(tokenization)和生成,属于视觉领域的技术。虽然标题提到量化技术可能具有效率优势,但论文明确聚焦于视觉应用,没有表明与推荐系统、搜索或广告的直接关联。量化技术可能作为使能技术应用于推荐系统的模型压缩,但标题缺乏明确的跨模态或多模态建模暗示,使其相关性较低。

2025-12-16 18:59:57 | arXiv:2512.14697v1 |
cs.CVcs.AIcs.LGeess.SP
查看完整摘要
Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($Λ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
Zechen Bai, Chen Gao, Mike Zheng Shou
个性化推荐理由:

该论文涉及视觉-语言-动作模型,属于多模态领域,但与推荐系统、搜索或广告的直接应用相关性较弱。虽然测试时训练技术可能对模型适应性有启发,但论文主要关注机器人控制或具身智能,缺乏明确的RecSys/Search/Ads应用场景。

2025-12-16 18:26:38 | arXiv:2512.14666v1 |
cs.ROcs.CV
查看完整摘要
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yun...
个性化推荐理由:

该论文标题聚焦于实时交互式世界建模的几何一致性,这主要属于计算机视觉和3D建模领域。虽然标题提到“世界建模”,但缺乏与推荐系统、搜索或广告领域的直接关联,也没有涉及LLM技术、Transformer架构进展或异构数据处理方法。

2025-12-16 17:22:46 | arXiv:2512.14614v1 |
cs.CVcs.GR
查看完整摘要
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.
CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer
Xianwei Cao, Dou Quan, Shuang Wang, Ning Huyan, Wei Wang, Yunan Li, Licheng Jiao
个性化推荐理由:

该论文标题涉及地理定位技术,属于计算机视觉中的跨视图匹配问题。虽然地理定位在搜索系统中可能有潜在应用(如位置感知搜索),但该标题没有明确指向推荐系统、搜索或广告的核心排名问题,也没有涉及LLM、Transformer架构或异构数据统一建模等当前关注的技术方向。

2025-12-16 16:31:41 | arXiv:2512.14560v1 |
cs.CVcs.AI
查看完整摘要
Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning
Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Che...
个性化推荐理由:

该论文标题主要涉及智能体框架和可供性推理,属于通用AI智能体研究范畴。虽然智能体技术可能间接应用于推荐或搜索系统,但标题未明确提及任何与推荐系统、搜索、广告相关的核心组件、架构或应用场景,也未涉及LLM、Transformer或异构数据处理等当前关注的技术方向。

2025-12-16 14:27:47 | arXiv:2512.14442v1 |
cs.CVcs.RO
查看完整摘要
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li
个性化推荐理由:

该论文主要关注图像编辑的注意力机制,属于计算机视觉领域,与RecSys/Search/Ads的核心技术无直接关联。虽然注意力机制是Transformer架构的关键组件,但论文专注于图像编辑的特定应用,缺乏明确的推荐、搜索或广告应用潜力。

2025-12-16 14:08:00 | arXiv:2512.14423v1 |
cs.CV
查看完整摘要
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos
Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas
个性化推荐理由:

该论文主要关注计算机视觉中的动态场景视图合成技术,属于纯粹的视觉领域研究。虽然视图合成在理论上可能为推荐/搜索系统中的商品展示或广告创意提供3D内容,但论文标题明确限定于“动态场景”和“单目视频”,缺乏与推荐系统、搜索或广告排名核心问题的直接联系,也未提及任何与Transformer架构、LLM技术或异构数据建模相关的要素。

2025-12-16 13:43:41 | arXiv:2512.14406v1 |
cs.CV
查看完整摘要
In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.
Optimizing Rank for High-Fidelity Implicit Neural Representations
Julian McGinnis, Florian A. Hölzl, Suprosanna Shit, Florentin Bieder, Paul Fried...
个性化推荐理由:

该论文标题涉及隐式神经表示(INR)的排序优化,这属于计算机视觉和神经渲染领域,与推荐系统、搜索或广告的核心技术没有直接关联。虽然排序是推荐/搜索中的关键问题,但这里的'Rank'很可能指图像/信号重建中的排序而非物品排序,且INR技术主要应用于视觉内容生成和压缩,在所列焦点领域缺乏明确的应用潜力。

2025-12-16 12:52:30 | arXiv:2512.14366v1 |
cs.CV
查看完整摘要
Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Our project page, which includes code and experimental results, is available at: (https://muon-inrs.github.io).
Unified Semantic Transformer for 3D Scene Understanding
Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski, Fed...
个性化推荐理由:

该论文标题明确聚焦于3D场景理解,属于计算机视觉领域。虽然使用了Transformer架构,但缺乏与推荐系统、搜索或广告领域的明确关联。标题中未提及任何可能应用于这些领域的技术或概念。

2025-12-16 12:49:35 | arXiv:2512.14364v1 |
cs.CV
查看完整摘要
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
Enhancing Interpretability for Vision Models via Shapley Value Optimization
Kanglong Fan, Yunqiao Yang, Chen Ma
个性化推荐理由:

该论文专注于视觉模型的可解释性方法(沙普利值),属于纯粹的计算机视觉领域研究。虽然可解释性在推荐/搜索系统中很重要,但该论文没有明确涉及异构数据建模、序列处理或与推荐/搜索/广告相关的具体应用场景,因此相关性较低。

2025-12-16 12:33:04 | arXiv:2512.14354v1 |
cs.CVcs.AI
查看完整摘要
Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model's decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.
Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity
Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen
个性化推荐理由:

该论文标题聚焦于图像编辑免疫性,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然标题提及语义和感知概念,但未表明其与异构数据统一建模、Transformer架构或LLM应用相关,因此对当前关注点相关性较低。

2025-12-16 11:34:48 | arXiv:2512.14320v1 |
cs.CVcs.AIcs.CYcs.LG
查看完整摘要
Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachi...
个性化推荐理由:

该论文标题涉及视频理解和强化学习,但未明确展示与推荐系统、搜索或广告的直接关联。虽然时序分析在用户行为序列建模中可能有潜在应用,但标题未提及任何推荐/搜索/广告相关概念,且强化学习应用不明确,因此相关性较低。

2025-12-16 10:34:39 | arXiv:2512.14273v1 |
cs.CV
查看完整摘要
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.
Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs
Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen, Liang Lin, Keze Wang
个性化推荐理由:

该论文主要关注视觉编程和视觉推理,属于视觉领域的研究。虽然提到了概率图,但其核心应用场景是视觉推理而非推荐系统、搜索或广告。论文没有明确展示在异构数据处理、Transformer架构改进或LLM技术应用方面的潜力,与当前关注的领域关联性较弱。

2025-12-16 10:07:40 | arXiv:2512.14257v1 |
cs.CV
查看完整摘要
Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation
Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin
个性化推荐理由:

该论文主要关注视觉语言导航(VLN)任务,属于机器人导航领域,与推荐系统、搜索或广告的核心关注点不直接相关。虽然涉及Transformer架构,但其应用场景(空中导航)与RecSys/Search/Ads的典型问题域(用户行为建模、内容排序、广告匹配等)差异显著,潜在应用价值有限。

2025-12-16 09:16:07 | arXiv:2512.14222v1 |
cs.CVcs.RO
查看完整摘要
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models
Ashish Mishra, Tarun Kumar, Gyanaranjan Nayak, Arpit Shah, Suparna Bhattacharya,...
个性化推荐理由:

该论文主要关注CLIP模型的记忆擦除和类别遗忘技术,属于计算机视觉与语言模型交叉领域。虽然CLIP是视觉-语言模型,但该研究侧重于模型编辑和遗忘的特定技术问题,与推荐系统、搜索或广告的核心进展、架构创新或直接应用关联较弱。

2025-12-16 06:37:41 | arXiv:2512.14137v1 |
cs.CV
查看完整摘要
We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.
Pairwise Comparison for Bias Identification and Quantification
Fabian Haak, Philipp Schaer
个性化推荐理由:

该论文标题明确涉及偏差识别与量化,这属于公平性、伦理等非技术性话题范畴,与您明确排除的隐私、公平性、伦理等非技术性主题直接相关。虽然偏差问题在推荐系统、搜索和广告中客观存在,但您已明确将此类话题列为无关内容,因此该论文与您的技术研究焦点无关。

2025-12-16 16:36:55 | arXiv:2512.14565v1 |
cs.IR
查看完整摘要
Linguistic bias in online news and social media is widespread but difficult to measure. Yet, its identification and quantification remain difficult due to subjectivity, context dependence, and the scarcity of high-quality gold-label datasets. We aim to reduce annotation effort by leveraging pairwise comparison for bias annotation. To overcome the costliness of the approach, we evaluate more efficient implementations of pairwise comparison-based rating. We achieve this by investigating the effects of various rating techniques and the parameters of three cost-aware alternatives in a simulation environment. Since the approach can in principle be applied to both human and large language model annotation, our work provides a basis for creating high-quality benchmark datasets and for quantifying biases and other subjective linguistic aspects. The controlled simulations include latent severity distributions, distance-calibrated noise, and synthetic annotator bias to probe robustness and cost-quality trade-offs. In applying the approach to human-labeled bias benchmark datasets, we then evaluate the most promising setups and compare them to direct assessment by large language models and unmodified pairwise comparison labels as baselines. Our findings support the use of pairwise comparison as a practical foundation for quantifying subjective linguistic aspects, enabling reproducible bias analysis. We contribute an optimization of comparison and matchmaking components, an end-to-end evaluation including simulation and real-data application, and an implementation blueprint for cost-aware large-scale annotation
Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization
Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud, Laureano Moro...
个性化推荐理由:

该论文标题明确指向口语对话摘要数据集,属于纯粹的NLP/语音处理领域。虽然涉及对话和情感,但没有任何迹象表明与推荐系统、搜索或广告相关,也不涉及LLM在推荐/搜索/广告中的直接应用、Transformer架构改进或异构数据统一建模。

2025-12-16 18:54:20 | arXiv:2512.14687v1 |
cs.CLcs.AIcs.LGeess.AS
查看完整摘要
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
Segmental Attention Decoding With Long Form Acoustic Encodings
Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhua...
个性化推荐理由:

该论文标题明确涉及语音处理中的声学编码和注意力解码,属于语音识别领域。虽然提到了注意力机制,但这是针对语音信号处理的特定应用,与推荐系统、搜索或广告中的Transformer架构进展没有直接关联。没有证据表明该方法可应用于处理推荐系统中的异构数据或序列建模。

2025-12-16 18:12:37 | arXiv:2512.14652v1 |
eess.AScs.CL
查看完整摘要
We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies
Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmato...
个性化推荐理由:

该论文标题关注低资源语言技术中的语料库构建和包容性问题,这属于自然语言处理的特定领域,与推荐系统、搜索或广告的核心技术进展、LLM应用、Transformer架构改进或异构数据统一建模没有直接关联。标题中提到的'包容性语言技术'更偏向伦理和社会影响层面,属于明确的无关话题范畴。

2025-12-16 16:44:17 | arXiv:2512.14576v1 |
cs.CLcs.AI
查看完整摘要
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Q...
个性化推荐理由:

该论文标题明确指向法律领域特定基准测试,属于领域特定应用(法律),与RecSys/Search/Ads核心领域无关。虽然涉及LLM评估,但属于纯粹NLP评估基准范畴,且没有展示在推荐、搜索或广告系统中的潜在应用价值。

2025-12-16 16:28:32 | arXiv:2512.14554v1 |
cs.CLcs.AI
查看完整摘要
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
Linguists should learn to love speech-based deep learning models
Marianne de Heer Kloots, Paul Boersma, Willem Zuidema
个性化推荐理由:

该论文标题聚焦于语音处理与语言学领域,属于明确的无关主题。虽然涉及深度学习模型,但其核心关注点(语音、语言学)与用户指定的推荐系统、搜索、广告等核心领域无直接关联,且未提及任何可能应用于这些领域的潜在技术。

2025-12-16 15:42:22 | arXiv:2512.14506v1 |
cs.CLcs.SDeess.ASq-bio.NC
查看完整摘要
Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.
C-ing Clearly: Enhanced Binary Code Explanations using C code
Teodor Poncu, Ioana Pintilie, Marius Dragoi, Dragos Tantaru, Florin Brad
个性化推荐理由:

该论文标题涉及二进制代码和C代码的解释,属于程序分析或软件工程领域,与推荐系统、搜索或广告的核心技术无直接关联。其内容可能专注于代码理解或逆向工程,这些主题不在当前关注的任何技术范畴内,包括核心推荐系统进展、LLM技术、Transformer架构或异构数据建模。

2025-12-16 15:36:29 | arXiv:2512.14500v1 |
cs.CLcs.LG
查看完整摘要
Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
Inflation Attitudes of Large Language Models
Nikoleta Anesti, Edward Hill, Andreas Joseph
个性化推荐理由:

该论文标题表明研究的是LLM对经济概念(通胀)的态度,这属于LLM行为分析或社会科学的交叉研究,与您的技术焦点(推荐系统、搜索广告、Transformer架构、LLM应用技术)完全无关。该研究没有展示任何在RecSys/Search/Ads领域的潜在应用价值。

2025-12-16 11:21:46 | arXiv:2512.14306v1 |
cs.CLecon.EM
查看完整摘要
This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM's output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England's Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT's training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households' inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.
Two CFG Nahuatl for automatic corpora expansion
Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Lig...
个性化推荐理由:

该论文标题涉及特定语言(纳瓦特尔语)的上下文无关文法(CFG)和语料库扩展,这属于语言学或计算语言学领域。它不涉及推荐系统、搜索、广告、LLM技术、Transformer架构或异构数据统一建模等当前关注的核心领域。该主题与所列的所有技术焦点均无直接关联。

2025-12-16 09:49:31 | arXiv:2512.14239v1 |
cs.CL
查看完整摘要
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study
Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Taiga Mori, Divesh Lala, Keiko...
个性化推荐理由:

该论文标题涉及多语言对话中的反馈预测,属于语音处理或对话系统领域,与推荐系统、搜索或广告的核心技术无关。论文没有涉及Transformer架构、LLM技术或异构数据统一建模等当前关注领域,也没有展示在推荐/搜索/广告中的潜在应用价值。

2025-12-16 04:50:22 | arXiv:2512.14085v1 |
cs.CLcs.HCcs.SD
查看完整摘要
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
Scalable Frameworks for Real-World Audio-Visual Speech Recognition
Sungnyun Kim
个性化推荐理由:

该论文专注于音频-视觉语音识别,属于多模态处理领域,但明确属于视觉和语音范畴,与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及多模态建模,但未提及任何与用户行为序列、上下文特征或推荐/搜索/广告应用相关的技术,完全属于被排除的无关主题。

2025-12-16 04:50:13 | arXiv:2512.14083v1 |
eess.AScs.CLcs.LG
查看完整摘要
The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
Grammar Search for Multi-Agent Systems
Mayank Singh, Vikas Yadav, Shiva Krishna Reddy Malay, Shravan Nayak, Sai Rajeswa...
个性化推荐理由:

该论文标题涉及多智能体系统,属于分布式AI或协作智能领域,与推荐系统、搜索或广告的核心技术没有直接关联。标题中的“语法搜索”可能指形式化方法或程序合成,这些主题不在当前关注范围内,且未提及任何与LLM、Transformer架构或异构数据处理相关的技术。

2025-12-16 04:37:07 | arXiv:2512.14079v1 |
cs.AIcs.CLcs.MA
查看完整摘要
Automatic search for Multi-Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM-based free-form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost-efficient search process and the generation of modular, interpretable multi-agent systems with simpler logic.
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives
Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsian...
个性化推荐理由:

该论文标题涉及计算机视觉中的真实到仿真转换和平面场景基元,属于纯粹的视觉/3D视觉领域,与推荐系统、搜索或广告没有明确关联。标题中提到的单目视频处理、接触引导和场景基元等技术在您的关注领域中缺乏直接应用潜力。

2025-12-16 18:59:50 | arXiv:2512.14696v1 |
cs.CVcs.GRcs.RO
查看完整摘要
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
Native and Compact Structured Latents for 3D Generation
Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hon...
个性化推荐理由:

该论文标题明确聚焦于3D生成技术,属于计算机图形学领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然结构化潜在表示在推荐系统中可能有应用,但论文的3D生成焦点使其完全偏离了指定的关注领域。

2025-12-16 18:58:28 | arXiv:2512.14692v1 |
cs.CVcs.AI
查看完整摘要
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baini...
个性化推荐理由:

该论文专注于3D视觉和音频驱动的头像生成,属于纯粹的视觉/图形领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及生成技术,但属于AIGC/内容生成范畴,且没有展示与RecSys/Search/Ads相关的潜在应用。

2025-12-16 18:44:00 | arXiv:2512.14677v1 |
cs.CVcs.AI
查看完整摘要
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
ART: Articulated Reconstruction Transformer
Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Ge...
个性化推荐理由:

该论文标题表明其专注于3D视觉中的关节重建任务,属于纯粹的计算机视觉领域。虽然使用了Transformer架构,但缺乏与推荐系统、搜索或广告领域的直接或潜在应用关联,因此与您的关注点高度不相关。

2025-12-16 18:35:23 | arXiv:2512.14671v1 |
cs.CV
查看完整摘要
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
Enhancing Visual Sentiment Analysis via Semiotic Isotopy-Guided Dataset Construction
Marco Blanchini, Giovanna Maria Dimitri, Benedetta Tondi, Tarcisio Lancioni, Mau...
个性化推荐理由:

该论文专注于视觉情感分析,属于纯粹的计算机视觉领域,与推荐系统、搜索或广告的核心技术没有直接关联。虽然视觉内容可能出现在某些推荐场景中,但该论文的方法论和核心贡献完全围绕视觉模态,没有涉及文本、序列建模或多模态融合等与推荐/搜索/广告相关的关键技术。

2025-12-16 18:26:22 | arXiv:2512.14665v1 |
cs.CV
查看完整摘要
Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks
WaveSim: A Wavelet-based Multi-scale Similarity Metric for Weather and Climate Fields
Gabriele Accarino, Viviana Acquaviva, Sara Shamekh, Duncan Watson-Parris, David ...
个性化推荐理由:

该论文标题明确指向气象和气候领域的特定相似度度量方法,属于气象学/气候学的专业应用。虽然提到了相似度度量这一通用概念,但没有任何迹象表明该方法与推荐系统、搜索、广告或相关使能技术(如LLM、Transformer架构)有任何关联。该论文完全属于被排除的“气象学或其他领域特定应用”范畴。

2025-12-16 18:15:53 | arXiv:2512.14656v1 |
physics.ao-phcs.CVphysics.data-an
查看完整摘要
We introduce WaveSim, a multi-scale similarity metric for the evaluation of spatial fields in weather and climate applications. WaveSim exploits wavelet transforms to decompose input fields into scale-specific wavelet coefficients. The metric is built by multiplying three orthogonal components derived from these coefficients: Magnitude, which quantifies similarities in the energy distribution of the coefficients, i.e., the intensity of the field; Displacement, which captures spatial shift by comparing the centers of mass of normalized energy distributions; and Structure, which assesses pattern organization independent of location and amplitude. Each component yields a scale-specific similarity score ranging from 0 (no similarity) to 1 (perfect similarity), which are then combined across scales to produce an overall similarity measure. We first evaluate WaveSim using synthetic test cases, applying controlled spatial and temporal perturbations to systematically assess its sensitivity and expected behavior. We then demonstrate its applicability to physically relevant case studies of key modes of climate variability in Earth System Models. Traditional point-wise metrics lack a mechanism for attributing errors to physical scales or modes of dissimilarity. By operating in the wavelet domain and decomposing the signal along independent axes, WaveSim bypasses these limitations and provides an interpretable and diagnostically rich framework for assessing similarity in complex fields. Additionally, the WaveSim framework allows users to place emphasis on a specific scale or component, and lends itself to user-specific model intercomparison, model evaluation, and calibration and training of forecasting systems. We provide a PyTorch-ready implementation of WaveSim, along with all evaluation scripts, at: https://github.com/gabrieleaccarino/wavesim.
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma...
个性化推荐理由:

该论文标题表明其专注于视觉-语言交互和数学推理,属于多模态推理领域。虽然涉及视觉和语言模态,但核心是数学思维链推理,与推荐系统、搜索或广告的异构数据处理没有直接关联。该技术主要针对数学问题求解,缺乏在推荐/搜索/广告领域的明确应用潜力。

2025-12-16 18:13:54 | arXiv:2512.14654v1 |
cs.CV
查看完整摘要
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.
Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-guided Subtyping and Lesion-Wise Model Ensemble
Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika...
个性化推荐理由:

该论文标题明确聚焦于医学影像分析(脑肿瘤分割与亚型分类),属于明确的医学领域应用。虽然提到了模型集成技术,但其核心应用场景(脑肿瘤诊断)与搜索、推荐、广告等商业系统完全无关,且未涉及任何Transformer架构、LLM技术或多模态建模在商业领域的潜在应用。

2025-12-16 18:09:48 | arXiv:2512.14648v1 |
cs.CVeess.IV
查看完整摘要
Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.
A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher,...
个性化推荐理由:

该论文标题明确指向医学影像分析(淋巴瘤亚型分型),属于明确的医学领域应用,完全在用户列出的不相关主题范围内。虽然涉及多实例学习模型,但该技术应用于病理图像分析,与推荐系统、搜索或广告领域无直接或潜在关联。

2025-12-16 17:58:03 | arXiv:2512.14640v1 |
cs.CVcs.AI
查看完整摘要
Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.
AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation
Fei Wu, Marcel Dreier, Nora Gourmelon, Sebastian Wind, Jianlin Zhang, Thorsten S...
个性化推荐理由:

该论文专注于冰川崩解前沿分割的计算机视觉任务,属于纯粹的视觉应用领域。虽然涉及Transformer架构,但论文内容明确针对冰川监测这一地球科学特定领域,与推荐系统、搜索或广告的核心技术进展无直接关联,也不具备明显的跨模态建模潜力。

2025-12-16 17:57:52 | arXiv:2512.14639v1 |
cs.CV
查看完整摘要
The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.
Distill Video Datasets into Images
Zhenghao Zhao, Haoxuan Wang, Kai Wang, Yuzhang Shang, Yuan Hong, Yan Yan
个性化推荐理由:

该论文标题聚焦于视频数据到图像的转换,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然数据蒸馏技术可能具有通用性,但标题未表明其在异构数据处理、Transformer架构或LLM应用方面的潜力,因此与当前关注点高度不相关。

2025-12-16 17:33:41 | arXiv:2512.14621v1 |
cs.CV
查看完整摘要
Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos
Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo, Guanbin Li, Rushi La...
个性化推荐理由:

该论文专注于深度伪造视频检测,属于计算机视觉安全领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及异常检测技术,但该技术并未与RecSys/Search/Ads的具体应用场景结合,且论文主题明确排除在关注范围之外。

2025-12-16 17:11:45 | arXiv:2512.14601v1 |
cs.CVcs.AI
查看完整摘要
In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios
Mengyu Li, Xingcheng Zhou, Guang Chen, Alois Knoll, Hu Cao
个性化推荐理由:

该论文标题明确聚焦于计算机视觉领域的事件驱动多目标跟踪,属于纯视觉研究方向,且限定在交通场景这一特定应用领域。根据用户列出的无关主题,明确排除了'纯视觉论文若无明确推荐/搜索/广告相关性则不相关',该工作未展示与推荐系统、搜索或广告的任何潜在联系,因此完全不相关。

2025-12-16 17:05:39 | arXiv:2512.14595v1 |
cs.CV
查看完整摘要
In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction
Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen
个性化推荐理由:

该论文标题明确聚焦于医学领域的癌症生存预测,属于明确的医学/生物学应用范畴,这被列为不相关主题。虽然提到了多模态和知识增强,但这些技术被应用于医学领域而非推荐系统、搜索或广告领域,因此与当前关注点完全无关。

2025-12-16 17:03:56 | arXiv:2512.14594v1 |
cs.CV
查看完整摘要
Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.
FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications
Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata
个性化推荐理由:

该论文标题表明这是一个专注于食物图像数据集的构建工作,属于计算机视觉领域,与RecSys/Search/Ads的核心技术(如排序、召回、用户建模)或LLM/Transformer的架构进展没有直接关联。它可能涉及数据收集或标注方法,但未提及任何可应用于推荐、搜索或广告系统的模型、算法或架构创新。

2025-12-16 16:43:20 | arXiv:2512.14574v1 |
cs.CVcs.MM
查看完整摘要
Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.
Test Time Optimized Generalized AI-based Medical Image Registration Method
Sneha Sree C., Dattesh Shanbhag, Sudhanya Chatterjee
个性化推荐理由:

该论文标题明确聚焦于医学图像配准,属于医学领域的特定应用,与用户关注的推荐系统、搜索、广告等核心领域无关。论文内容涉及医学图像处理,属于用户明确排除的'Medical, Biology, Chemistry, Physics or other domain-specific applications'类别,因此相关性极低。

2025-12-16 16:29:27 | arXiv:2512.14556v1 |
eess.IVcs.CV
查看完整摘要
Medical image registration is critical for aligning anatomical structures across imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Among existing techniques, non-rigid registration (NRR) is particularly challenging due to the need to capture complex anatomical deformations caused by physiological processes like respiration or contrast-induced signal variations. Traditional NRR methods, while theoretically robust, often require extensive parameter tuning and incur high computational costs, limiting their use in real-time clinical workflows. Recent deep learning (DL)-based approaches have shown promise; however, their dependence on task-specific retraining restricts scalability and adaptability in practice. These limitations underscore the need for efficient, generalizable registration frameworks capable of handling heterogeneous imaging contexts. In this work, we introduce a novel AI-driven framework for 3D non-rigid registration that generalizes across multiple imaging modalities and anatomical regions. Unlike conventional methods that rely on application-specific models, our approach eliminates anatomy- or modality-specific customization, enabling streamlined integration into diverse clinical environments.
TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration
Zhiwen Yang, Jiaju Zhang, Yang Yi, Jian Liang, Bingzheng Wei, Yan Xu
个性化推荐理由:

该论文标题明确属于医学领域(医学图像复原),属于明确的无关主题。虽然提到了Transformer架构,但其应用完全限定在医学图像处理,与推荐系统、搜索或广告领域无直接关联。

2025-12-16 16:25:47 | arXiv:2512.14550v1 |
cs.CV
查看完整摘要
Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks--PET synthesis, CT denoising, and MRI super-resolution--both in task-specific and All-in-One settings. Code is available at https://github.com/Yaziwel/TAT.
HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du
个性化推荐理由:

该论文专注于肖像生成和身份保持技术,属于AIGC/内容生成领域,与推荐系统、搜索或广告的核心技术无关。论文标题明确指向图像生成任务,没有涉及推荐、搜索或广告中所需的排序、检索、用户建模或特征工程等关键技术。

2025-12-16 16:17:46 | arXiv:2512.14542v1 |
cs.CV
查看完整摘要
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors
Yiheng Huang, Junhong Chen, Anqi Ning, Zhanhong Liang, Nick Michiels, Luc Claese...
个性化推荐理由:

该论文专注于计算机视觉中的夜间单目深度估计,属于纯粹的视觉任务,与推荐系统、搜索或广告领域没有直接关联。论文内容涉及域自适应和自监督学习,但这些技术并未明确指向或应用于推荐系统、搜索或广告场景。

2025-12-16 16:11:57 | arXiv:2512.14536v1 |
cs.CV
查看完整摘要
Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.
Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency
Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua ...
个性化推荐理由:

该论文标题明确涉及医学领域(视网膜、临床实践),属于明确的无关主题。虽然提到了基础模型,但上下文完全是医学应用,与推荐系统、搜索或广告领域没有任何关联。

2025-12-16 15:33:08 | arXiv:2512.14499v1 |
cs.CV
查看完整摘要
Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.
SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition
Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa
个性化推荐理由:

该论文专注于特定领域(意大利手语)的数据集构建和多模态分析,属于语言/视觉交叉领域,与推荐系统、搜索或广告的核心技术进展、Transformer架构改进或LLM直接应用均无明确关联。其多模态性质虽涉及视觉和语言,但处理的是特定手语模态,而非推荐/搜索/广告中常见的异构数据(如用户序列和上下文特征)的统一建模问题。

2025-12-16 15:21:33 | arXiv:2512.14489v1 |
cs.CV
查看完整摘要
In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.
TACK Tunnel Data (TTD): A Benchmark Dataset for Deep Learning-Based Defect Detection in Tunnels
Andreas Sjölander, Valeria Belloni, Robel Fekadu, Andrea Nascetti
个性化推荐理由:

该论文标题明确聚焦于隧道缺陷检测的计算机视觉应用,属于纯粹的视觉/基础设施检测领域。虽然涉及深度学习技术,但没有任何内容表明与推荐系统、搜索、广告或相关Transformer架构有潜在联系,完全属于用户指定的不相关主题范畴。

2025-12-16 15:10:16 | arXiv:2512.14477v1 |
cs.CVcs.AI
查看完整摘要
Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.
S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
个性化推荐理由:

该论文专注于计算机视觉中的视频实例分割任务,属于纯粹的视觉领域研究。虽然标题中提到“蒸馏”技术,但论文内容明显属于视觉模态处理,与推荐系统、搜索或广告的核心技术栈(用户行为建模、排序算法、特征工程等)没有直接关联,也不涉及LLM、Transformer架构或异构数据统一建模等当前关注的技术方向。

2025-12-16 14:26:30 | arXiv:2512.14440v1 |
cs.CV
查看完整摘要
In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.
VICTOR: Dataset Copyright Auditing in Video Recognition Systems
Quan Yuan, Zhikun Zhang, Linkang Du, Min Chen, Mingyang Sun, Yunjun Gao, Shibo H...
个性化推荐理由:

该论文标题明确涉及版权审计,这属于隐私、安全或公平性等非技术性主题,这些主题被明确列为不相关。标题中没有提到推荐系统、搜索、广告、LLM技术、Transformer架构或异构数据建模等任何相关技术领域。

2025-12-16 14:26:01 | arXiv:2512.14439v1 |
cs.CRcs.CV
查看完整摘要
Video recognition systems are increasingly being deployed in daily life, such as content recommendation and security monitoring. To enhance video recognition development, many institutions have released high-quality public datasets with open-source licenses for training advanced models. At the same time, these datasets are also susceptible to misuse and infringement. Dataset copyright auditing is an effective solution to identify such unauthorized use. However, existing dataset copyright solutions primarily focus on the image domain; the complex nature of video data leaves dataset copyright auditing in the video domain unexplored. Specifically, video data introduces an additional temporal dimension, which poses significant challenges to the effectiveness and stealthiness of existing methods. In this paper, we propose VICTOR, the first dataset copyright auditing approach for video recognition systems. We develop a general and stealthy sample modification strategy that enhances the output discrepancy of the target model. By modifying only a small proportion of samples (e.g., 1%), VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models. Then, the difference in the model's behavior for published modified and unpublished original samples can serve as a key basis for dataset auditing. Extensive experiments on multiple models and datasets highlight the superiority of VICTOR. Finally, we show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models.
Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging
Chang Cai, Hao Jiang, Xiaojun Yuan, Ying-Jun Angela Zhang
个性化推荐理由:

该论文标题明确涉及压缩成像,这是计算机视觉和信号处理领域的技术,与推荐系统、搜索或广告的核心领域没有直接关联。虽然消息传递算法有时用于概率图模型,但该论文专注于成像应用,没有显示出在推荐系统、搜索或广告中的潜在应用。

2025-12-16 14:24:12 | arXiv:2512.14435v1 |
cs.CV
查看完整摘要
Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.
LCMem: A Universal Model for Robust Image Memorization Detection
Mischa Dombrowski, Felix Nützel, Bernhard Kainz
个性化推荐理由:

该论文标题聚焦于图像记忆检测,属于计算机视觉领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然标题提到“通用模型”,但未表明其与Transformer架构、LLM技术或异构数据处理有任何联系,因此不符合当前关注的任何技术方向。

2025-12-16 14:06:58 | arXiv:2512.14421v1 |
cs.CV
查看完整摘要
Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at https://github.com/MischaD/LCMem.
DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi H...
个性化推荐理由:

该论文专注于图像字幕生成的自动评估方法,属于纯粹的计算机视觉和自然语言处理交叉领域。虽然涉及评估技术,但属于您指定的无关主题中的'纯粹视觉或NLP中心主题',与推荐系统、搜索或广告的核心技术、LLM应用或Transformer架构进展没有直接关联。

2025-12-16 14:06:35 | arXiv:2512.14420v1 |
cs.CVcs.AI
查看完整摘要
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
EcoScapes: LLM-Powered Advice for Crafting Sustainable Cities
Martin Röhn, Nora Gourmelon, Vincent Christlein
个性化推荐理由:

该论文标题涉及城市规划与可持续发展,属于特定领域应用(城市规划/环境科学),与推荐系统、搜索或广告的核心技术进展无关。虽然提到了LLM技术,但应用场景完全偏离了指定的技术领域,没有展示在推荐、搜索或广告中的潜在应用价值。

2025-12-16 12:58:16 | arXiv:2512.14373v1 |
cs.CV
查看完整摘要
Climate adaptation is vital for the sustainability and sometimes the mere survival of our urban areas. However, small cities often struggle with limited personnel resources and integrating vast amounts of data from multiple sources for a comprehensive analysis. To overcome these challenges, this paper proposes a multi-layered system combining specialized LLMs, satellite imagery analysis and a knowledge base to aid in developing effective climate adaptation strategies. The corresponding code can be found at https://github.com/Photon-GitHub/EcoScapes.
A Comprehensive Safety Metric to Evaluate Perception in Autonomous Systems
Georg Volk, Jörg Gamerdinger, Alexander von Bernuth, Oliver Bringmann
个性化推荐理由:

该论文标题明确聚焦于自动驾驶系统的感知安全评估,属于特定领域应用(自动驾驶),与我的关注领域(推荐系统、搜索、广告)无直接关联。论文内容涉及安全度量标准,属于被排除的“非技术性话题”范畴,且没有显示出对推荐/搜索/广告领域的潜在应用价值。

2025-12-16 12:53:00 | arXiv:2512.14367v1 |
cs.ROcs.CV
查看完整摘要
Complete perception of the environment and its correct interpretation is crucial for autonomous vehicles. Object perception is the main component of automotive surround sensing. Various metrics already exist for the evaluation of object perception. However, objects can be of different importance depending on their velocity, orientation, distance, size, or the potential damage that could be caused by a collision due to a missed detection. Thus, these additional parameters have to be considered for safety evaluation. We propose a new safety metric that incorporates all these parameters and returns a single easily interpretable safety assessment score for object perception. This new metric is evaluated with both real world and virtual data sets and compared to state of the art metrics.
Mimicking Human Visual Development for Learning Robust Image Representations
Ankita Raj, Kaashika Prajaapat, Tapan Kumar Gandhi, Chetan Arora
个性化推荐理由:

该论文标题明确聚焦于视觉表示学习,属于纯粹的计算机视觉研究范畴,与推荐系统、搜索或广告的核心技术无直接关联。虽然视觉表示学习在多媒体推荐中有潜在应用,但该标题未提及任何与序列建模、用户行为、上下文特征或异构数据处理相关的概念,无法建立与所列关注领域的直接联系。

2025-12-16 12:41:04 | arXiv:2512.14360v1 |
cs.CV
查看完整摘要
The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at https://github.com/rajankita/Visual_Acuity_Curriculum.
HGS: Hybrid Gaussian Splatting with Static-Dynamic Decomposition for Compact Dynamic View Synthesis
Kaizhe Zhang, Yijie Zhou, Weizhan Zhang, Caixia Yan, Haipeng Du, yugui xie, Yu-H...
个性化推荐理由:

该论文专注于计算机视觉中的动态视图合成技术,属于纯粹的视觉/3D视觉领域。虽然标题中提到“动态”和“合成”,但这与推荐系统、搜索或广告中的异构数据处理没有直接关联。该技术没有明显的潜在应用场景可以转移到推荐系统、搜索或广告领域。

2025-12-16 12:29:00 | arXiv:2512.14352v1 |
cs.CVcs.CG
查看完整摘要
Dynamic novel view synthesis (NVS) is essential for creating immersive experiences. Existing approaches have advanced dynamic NVS by introducing 3D Gaussian Splatting (3DGS) with implicit deformation fields or indiscriminately assigned time-varying parameters, surpassing NeRF-based methods. However, due to excessive model complexity and parameter redundancy, they incur large model sizes and slow rendering speeds, making them inefficient for real-time applications, particularly on resource-constrained devices. To obtain a more efficient model with fewer redundant parameters, in this paper, we propose Hybrid Gaussian Splatting (HGS), a compact and efficient framework explicitly designed to disentangle static and dynamic regions of a scene within a unified representation. The core innovation of HGS lies in our Static-Dynamic Decomposition (SDD) strategy, which leverages Radial Basis Function (RBF) modeling for Gaussian primitives. Specifically, for dynamic regions, we employ time-dependent RBFs to effectively capture temporal variations and handle abrupt scene changes, while for static regions, we reduce redundancy by sharing temporally invariant parameters. Additionally, we introduce a two-stage training strategy tailored for explicit models to enhance temporal coherence at static-dynamic boundaries. Experimental results demonstrate that our method reduces model size by up to 98% and achieves real-time rendering at up to 125 FPS at 4K resolution on a single RTX 3090 GPU. It further sustains 160 FPS at 1352 * 1014 on an RTX 3050 and has been integrated into the VR system. Moreover, HGS achieves comparable rendering quality to state-of-the-art methods while providing significantly improved visual fidelity for high-frequency details and abrupt scene changes.
Towards Transferable Defense Against Malicious Image Edits
Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
个性化推荐理由:

该论文标题关注计算机视觉领域的恶意图像编辑防御,属于安全/隐私范畴,与用户当前关注的推荐系统、搜索、广告核心进展、LLM技术应用及Transformer架构改进等焦点完全无关。即使考虑VLM类比,该论文也缺乏对异构数据统一建模的明确关联。

2025-12-16 12:10:16 | arXiv:2512.14341v1 |
cs.CVcs.AIcs.CYcs.LG
查看完整摘要
Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Jooyeol Yun, Jaegul Choo
个性化推荐理由:

该论文标题涉及矢量图形动画和语义结构分层,属于计算机图形学领域。虽然提到了语义结构,但这与推荐系统、搜索或广告中的异构数据处理没有直接关联。该技术主要针对图形动画生成,属于纯粹的视觉/图形领域,没有明确的推荐/搜索/广告应用潜力。

2025-12-16 12:03:46 | arXiv:2512.14336v1 |
cs.CV
查看完整摘要
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
Dual Attention Guided Defense Against Malicious Edits
Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
个性化推荐理由:

该论文标题涉及安全防御机制(对抗恶意编辑),这属于明确排除的“安全、隐私”等非技术性主题范畴。标题中的“注意力”可能指注意力机制,但核心焦点是防御而非推荐/搜索/广告系统的架构改进或应用,与当前关注的推荐系统、搜索、广告、LLM技术、Transformer架构或异构数据统一建模等方向无关。

2025-12-16 12:01:28 | arXiv:2512.14333v1 |
cs.CVcs.AIcs.CYcs.LG
查看完整摘要
Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region
Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
个性化推荐理由:

该论文标题明确聚焦于卫星图像中的污水处理厂检测,属于计算机视觉在特定地理和环境领域的应用。虽然提到了视觉语言模型(VLMs),但其应用场景(污水处理厂检测、卫星图像、MENA地区)与推荐系统、搜索或广告的核心技术领域(用户行为建模、内容排序、广告投放优化等)无直接关联。该研究属于纯粹的视觉应用,没有展示出在推荐/搜索/广告领域的潜在应用价值。

2025-12-16 11:28:55 | arXiv:2512.14312v1 |
cs.CVcs.AI
查看完整摘要
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition
Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Moha...
个性化推荐理由:

该论文标题明确聚焦于植物病害识别这一特定领域应用,属于明确的无关主题(Medical, Biology, Chemistry, Physics or other domain-specific applications)。虽然涉及Mamba架构(一种序列模型),但其应用场景与推荐系统、搜索或广告领域无直接关联,且未提及任何潜在的跨领域应用可能性。

2025-12-16 11:27:19 | arXiv:2512.14309v1 |
cs.CV
查看完整摘要
Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.
SS4D: Native 4D Generative Model via Structured Spacetime Latents
Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin
个性化推荐理由:

该论文标题表明其专注于4D(三维空间+时间)生成模型,这属于计算机视觉和图形学领域,与推荐系统、搜索或广告的核心技术无直接关联。虽然标题提到“结构化”和“潜在变量”,但缺乏明确的连接点表明其在异构数据处理(如VLM类比)或Transformer架构效率方面的应用潜力。

2025-12-16 10:45:06 | arXiv:2512.14284v1 |
cs.CV
查看完整摘要
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning
Yu Chen, Hongwei Lin
个性化推荐理由:

该论文标题涉及拓扑数据分析中的持久性图与深度学习结合,属于数学与计算拓扑领域,与推荐系统、搜索、广告的核心技术或LLM/Transformer架构无直接关联。标题中未提及任何与异构数据处理、序列建模或多模态学习相关的内容,无法推断其在推荐/搜索/广告领域的潜在应用。

2025-12-16 10:35:17 | arXiv:2512.14274v1 |
cs.CVcs.LGmath.AT
查看完整摘要
Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Shreedhar Govil, Didier Stricker, Jason Rambach
个性化推荐理由:

该论文标题明确指向计算机视觉和自动驾驶领域,专注于驾驶员注意力检测和视线追踪。虽然涉及注意力机制,但这是纯粹的视觉应用,与推荐系统、搜索或广告中的用户行为建模、内容排序或广告投放没有任何直接或间接的关联。

2025-12-16 10:23:00 | arXiv:2512.14266v1 |
cs.CV
查看完整摘要
Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360$^\circ$ field of view driver attention dataset, containing $\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://av.dfki.de/drivergaze360.
Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari
个性化推荐理由:

该论文标题明确指向3D视觉和视频处理领域,专注于立体视频转换技术。虽然提到了'可控'和'引导解码'等技术概念,但缺乏与推荐系统、搜索或广告领域的明确联系,也没有展示如何将这些视觉技术应用于异构数据处理或多模态建模。

2025-12-16 09:46:23 | arXiv:2512.14236v1 |
cs.CV
查看完整摘要
The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.
4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation
Jimmie Kwok, Holger Caesar, Andras Palffy
个性化推荐理由:

该论文专注于4D雷达点云生成,属于计算机视觉和传感器数据处理领域,与推荐系统、搜索或广告的核心技术无关。虽然涉及扩散模型,但其应用场景(雷达点云生成)与文本、序列或异构数据建模没有直接关联,无法应用于推荐、搜索或广告领域。

2025-12-16 09:43:05 | arXiv:2512.14235v1 |
cs.CV
查看完整摘要
Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shri...
个性化推荐理由:

该论文标题主要涉及对话代理和三维虚拟身体技术,属于人机交互和计算机图形学领域。虽然提到了“行为智能”,但这与推荐系统、搜索或广告中的核心算法、架构改进或直接应用没有明显关联。该技术更偏向于虚拟形象和交互界面,而非排名、检索或个性化推荐等核心任务。

2025-12-16 09:41:21 | arXiv:2512.14234v1 |
cs.CV
查看完整摘要
Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/
Multi-View MRI Approach for Classification of MGMT Methylation in Glioblastoma Patients
Rawan Alyahya, Asrar Alruwayqi, Atheer Alqarni, Asma Alkhaldi, Metab Alkubeyyer,...
个性化推荐理由:

该论文标题明确指向医学影像分析(MRI)和特定癌症(胶质母细胞瘤)的生物标志物分类,属于医学/生物学领域应用。这与用户关注的推荐系统、搜索、广告、LLM技术及Transformer架构等核心方向完全无关,也不涉及任何异构数据处理或多模态建模的类比应用。

2025-12-16 09:37:20 | arXiv:2512.14232v1 |
cs.CV
查看完整摘要
The presence of MGMT promoter methylation significantly affects how well chemotherapy works for patients with Glioblastoma Multiforme (GBM). Currently, confirmation of MGMT promoter methylation relies on invasive brain tumor tissue biopsies. In this study, we explore radiogenomics techniques, a promising approach in precision medicine, to identify genetic markers from medical images. Using MRI scans and deep learning models, we propose a new multi-view approach that considers spatial relationships between MRI views to detect MGMT methylation status. Importantly, our method extracts information from all three views without using a complicated 3D deep learning model, avoiding issues associated with high parameter count, slow convergence, and substantial memory demands. We also introduce a new technique for tumor slice extraction and show its superiority over existing methods based on multiple evaluation metrics. By comparing our approach to state-of-the-art models, we demonstrate the efficacy of our method. Furthermore, we share a reproducible pipeline of published models, encouraging transparency and the development of robust diagnostic tools. Our study highlights the potential of non-invasive methods for identifying MGMT promoter methylation and contributes to advancing precision medicine in GBM treatment.
OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving
Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, ...
个性化推荐理由:

该论文专注于自动驾驶领域的多模态传感器生成,属于纯粹的视觉/3D视觉应用,与推荐系统、搜索或广告领域没有直接关联。论文内容涉及特定领域(自动驾驶)的生成技术,不符合当前关注的任何技术方向。

2025-12-16 09:18:15 | arXiv:2512.14225v1 |
cs.CV
查看完整摘要
Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos
Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan L...
个性化推荐理由:

该论文标题明确涉及机器人演示视频生成和深度编码轨迹,属于机器人学领域。根据您的关注点,这属于无关主题中的'Purely Vision、3D Vision, Graphic或Speech papers without clear relevance to RecSys/Search/Ads',与推荐系统、搜索或广告的核心进展、LLM技术、Transformer架构或异构数据统一建模均无直接关联。

2025-12-16 09:11:36 | arXiv:2512.14217v1 |
cs.CVcs.RO
查看完整摘要
Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination
Zhuoxiao Li, Wenzong Ma, Taoyu Wu, Jinjing Zhu, Zhenchao Q, Shuai Zhang, Jing Ou...
个性化推荐理由:

该论文标题明确聚焦于计算机视觉领域中的城市场景重建和航空数据集,属于纯粹的视觉/3D视觉研究范畴。虽然提到了“大规模数据集”,但核心内容涉及光照变化下的场景重建,这与推荐系统、搜索或广告的排名、建模、架构或应用没有直接关联,也不符合任何指定的技术焦点领域。

2025-12-16 08:47:56 | arXiv:2512.14200v1 |
cs.CV
查看完整摘要
Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric inaccuracies, and inconsistent appearance. Due to the lack of UAV datasets that systematically capture the same areas under varying illumination conditions, this challenge remains largely underexplored. To fill this gap, we introduceSkyLume, a large-scale, real-world UAV dataset specifically designed for studying illumination robust 3D reconstruction in urban scene modeling: (1) We collect data from 10 urban regions data comprising more than 100k high resolution UAV images (four oblique views and nadir), where each region is captured at three periods of the day to systematically isolate illumination changes. (2) To support precise evaluation of geometry and appearance, we provide per-scene LiDAR scans and accurate 3D ground-truth for assessing depth, surface normals, and reconstruction quality under varying illumination. (3) For the inverse rendering task, we introduce the Temporal Consistency Coefficient (TCC), a metric that measuress cross-time albedo stability and directly evaluates the robustness of the disentanglement of light and material. We aim for this resource to serve as a foundation that advances research and real-world evaluation in large-scale inverse rendering, geometry reconstruction, and novel view synthesis.
Fracture Morphology Classification: Local Multiclass Modeling for Multilabel Complexity
Cassandra Krause, Mattias P. Heinrich, Ron Keuth
个性化推荐理由:

该论文标题明确聚焦于医学影像分析中的骨折形态分类问题,属于医学/生物学领域的特定应用。标题中提到的“局部多类建模”和“多标签复杂性”是计算机视觉中的技术方法,但整体内容与推荐系统、搜索、广告等核心领域以及LLM/Transformer技术无关,也不涉及异构数据统一建模的VLM类比思想。

2025-12-16 08:47:00 | arXiv:2512.14196v1 |
cs.CV
查看完整摘要
Between $15\,\%$ and $45\,\%$ of children experience a fracture during their growth years, making accurate diagnosis essential. Fracture morphology, alongside location and fragment angle, is a key diagnostic feature. In this work, we propose a method to extract fracture morphology by assigning automatically global AO codes to corresponding fracture bounding boxes. This approach enables the use of public datasets and reformulates the global multilabel task into a local multiclass one, improving the average F1 score by $7.89\,\%$. However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment. Our code is available on GitHub.
Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion
Jianwei Sun, Xiaoning Lei, Wenhao Cai, Xichen Xu, Yanshu Wang, Hu Gao
个性化推荐理由:

该论文标题涉及从噪声数据建立随机对象模型,使用扩散方法处理环境测量数据。这属于信号处理、统计建模或物理测量领域,与推荐系统、搜索、广告、LLM技术或Transformer架构的核心进展没有直接关联。标题中未提及任何可能应用于RecSys/Search/Ads的潜在技术。

2025-12-16 08:33:08 | arXiv:2512.14187v1 |
cs.GRcs.CV
查看完整摘要
Task-based measures of image quality (IQ) are critical for evaluating medical imaging systems, which must account for randomness including anatomical variability. Stochastic object models (SOMs) provide a statistical description of such variability, but conventional mathematical SOMs fail to capture realistic anatomy, while data-driven approaches typically require clean data rarely available in clinical tasks. To address this challenge, we propose AMID, an unsupervised Ambient Measurement-Integrated Diffusion with noise decoupling, which establishes clean SOMs directly from noisy measurements. AMID introduces a measurement-integrated strategy aligning measurement noise with the diffusion trajectory, and explicitly models coupling between measurement and diffusion noise across steps, an ambient loss is thus designed base on it to learn clean SOMs. Experiments on real CT and mammography datasets show that AMID outperforms existing methods in generation fidelity and yields more reliable task-based IQ evaluation, demonstrating its potential for unsupervised medical imaging analysis.
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasac...
个性化推荐理由:

该论文标题涉及计算机视觉中的球面几何和外观建模,属于纯粹的视觉技术范畴。虽然提到了可微分划分,但没有明确指向推荐系统、搜索或广告领域的潜在应用,也不涉及LLM、Transformer架构或异构数据统一建模等关注点。

2025-12-16 08:21:41 | arXiv:2512.14180v1 |
cs.CV
查看完整摘要
Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.
FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation
Qingyuan Cai, Linxin Zhang, Xuecai Hu, Saihui Hou, Yongzhen Huang
个性化推荐理由:

该论文专注于计算机视觉中的3D人体姿态估计,属于纯粹的视觉任务,与推荐系统、搜索或广告的核心技术焦点无关。论文标题中提到的统一、高效、解耦等概念主要针对视觉模型优化,没有显示出在异构数据处理、Transformer架构改进或LLM应用方面的潜在价值。

2025-12-16 07:47:06 | arXiv:2512.14162v1 |
cs.CV
查看完整摘要
Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: https://github.com/Andyen512/Fast3DHPE
CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World
Shuxin Zhao, Bo Lang, Nan Xiao, Yilang Zhang
个性化推荐理由:

该论文涉及计算机视觉领域的目标检测和安全攻击(后门攻击),与推荐系统、搜索或广告的核心技术焦点无关。虽然标题提到“真实世界”,但这属于安全/隐私范畴,已被明确列为不相关主题。

2025-12-16 07:37:46 | arXiv:2512.14158v1 |
cs.CVcs.CR
查看完整摘要
Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.
Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis
Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shi...
个性化推荐理由:

该论文标题明确涉及医学图像分析,这属于明确的无关主题(Medical domain-specific applications)。虽然提到了工具增强思维,但核心应用领域与推荐系统、搜索或广告无关,且没有迹象表明该技术可迁移到这些领域。

2025-12-16 07:37:23 | arXiv:2512.14157v1 |
cs.AIcs.CV
查看完整摘要
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models
Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li, Tongbo Luo, Xizhou ...
个性化推荐理由:

该论文标题明确聚焦于计算机视觉模型的性能反模式检测和基准数据集创建,属于纯粹的计算机视觉领域研究。根据用户列出的无关主题,明确排除了'Purely Vision、3D Vision、Graphic或Speech papers without clear relevance to RecSys/Search/Ads',且论文标题中没有任何内容表明与推荐系统、搜索或广告有潜在关联。

2025-12-16 06:54:20 | arXiv:2512.14141v1 |
cs.CVcs.AI
查看完整摘要
Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models' ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM's limited context length and reasoning inefficiencies.
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan
个性化推荐理由:

该论文标题聚焦于草图编辑和重绘工具,属于计算机图形学或创意设计领域。虽然涉及语义编辑,但缺乏与推荐系统、搜索或广告的明确联系,也不涉及LLM、Transformer架构或异构数据处理等当前关注的技术方向。

2025-12-16 06:50:44 | arXiv:2512.14140v1 |
cs.CV
查看完整摘要
Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.
Consistent Instance Field for Dynamic Scene Understanding
Junyi Wu, Van Nguyen Nguyen, Benjamin Planche, Jiachen Tao, Changchang Sun, Zhon...
个性化推荐理由:

该论文标题涉及动态场景理解,这属于计算机视觉领域,特别是3D视觉或场景理解方向。虽然场景理解在广义上可能与某些推荐系统或广告的上下文理解相关,但该标题没有明确指向推荐系统、搜索或广告的核心问题,也没有涉及LLM、Transformer架构或异构数据处理。根据您的关注点,这属于不相关的纯视觉论文。

2025-12-16 06:12:11 | arXiv:2512.14126v1 |
cs.CV
查看完整摘要
We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
SportsGPT: An LLM-driven Framework for Interpretable Sports Motion Assessment and Training Guidance
Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu, Zihao Zhang, Zhan...
个性化推荐理由:

该论文标题明确聚焦于体育领域的运动评估和训练指导应用,属于特定领域(体育)的LLM应用。虽然涉及LLM技术,但未提及任何与推荐系统、搜索或广告相关的潜在应用场景,完全偏离了所有关注领域。

2025-12-16 06:05:55 | arXiv:2512.14121v1 |
cs.CVcs.AI
查看完整摘要
Existing intelligent sports analysis systems mainly focus on "scoring and visualization," often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.
MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction
Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang
个性化推荐理由:

该论文专注于文档图像处理领域的GAN应用,属于计算机视觉中的特定任务。虽然涉及多尺度特征提取技术,但其核心应用(文档图像增强与二值化)与推荐系统、搜索或广告的排名、建模、架构等核心技术焦点没有直接关联,也不属于LLM或Transformer架构的使能技术。

2025-12-16 05:54:27 | arXiv:2512.14114v1 |
cs.CV
查看完整摘要
Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.
ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models
Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li
个性化推荐理由:

这篇论文专注于多视角图像生成技术,属于计算机视觉领域的特定应用。虽然提到了多模态扩散模型,但其核心是3D视觉和图像生成,与推荐系统、搜索或广告的排名、检索、个性化等核心任务没有直接关联。该技术缺乏在RecSys/Search/Ads领域的明确应用潜力。

2025-12-16 05:15:07 | arXiv:2512.14099v1 |
cs.CV
查看完整摘要
Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.
AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation
Sisi Dai, Kai Xu
个性化推荐理由:

该论文专注于4D人-物交互生成,属于计算机视觉中的动作生成/合成领域。虽然涉及零样本学习和生成技术,但其核心应用场景(人-物交互)与推荐系统、搜索或广告的排序任务无直接关联,且未提及任何潜在的跨模态应用或推荐相关技术迁移。

2025-12-16 05:10:19 | arXiv:2512.14095v1 |
cs.CV
查看完整摘要
Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.
Quality-Aware Framework for Video-Derived Respiratory Signals
Nhi Nguyen, Constantino Álvarez Casado, Le Nguyen, Manuel Lage Cañellas, Miguel ...
个性化推荐理由:

该论文标题涉及从视频中提取呼吸信号的质量评估,属于医疗或生理监测领域,与推荐系统、搜索或广告的核心技术无关。即使考虑多模态数据处理,其应用场景和数据类型与RecSys/Search/Ads中的用户行为序列、上下文特征等异构数据有本质区别,且未涉及LLM、Transformer架构或相关应用。

2025-12-16 05:04:24 | arXiv:2512.14093v1 |
cs.CVeess.SP
查看完整摘要
Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch's method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.
ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes
Felix Holm, Ghazal Ghazaei, Nassir Navab
个性化推荐理由:

该论文专注于手术工作流建模这一特定医疗领域应用,属于明确的医疗领域论文。虽然涉及工作流建模和场景图技术,但缺乏与推荐系统、搜索或广告领域的潜在应用联系。

2025-12-16 04:59:58 | arXiv:2512.14092v1 |
cs.CVcs.AI
查看完整摘要
Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.
GaussianPlant: Structure-aligned Gaussian Splatting for 3D Reconstruction of Plants
Yang Yang, Risa Shinoda, Hiroaki Santo, Fumio Okura
个性化推荐理由:

该论文专注于植物三维重建的计算机视觉技术,属于纯粹的视觉/3D视觉领域。虽然标题中提到“结构对齐”和“高斯泼溅”,但这些是计算机图形学中的特定方法,没有明确指向推荐系统、搜索或广告领域的潜在应用。该主题完全属于被排除的“纯粹视觉/3D视觉”类别。

2025-12-16 04:55:06 | arXiv:2512.14087v1 |
cs.CV
查看完整摘要
We present a method for jointly recovering the appearance and internal structure of botanical plants from multi-view images based on 3D Gaussian Splatting (3DGS). While 3DGS exhibits robust reconstruction of scene appearance for novel-view synthesis, it lacks structural representations underlying those appearances (e.g., branching patterns of plants), which limits its applicability to tasks such as plant phenotyping. To achieve both high-fidelity appearance and structural reconstruction, we introduce GaussianPlant, a hierarchical 3DGS representation, which disentangles structure and appearance. Specifically, we employ structure primitives (StPs) to explicitly represent branch and leaf geometry, and appearance primitives (ApPs) to the plants' appearance using 3D Gaussians. StPs represent a simplified structure of the plant, i.e., modeling branches as cylinders and leaves as disks. To accurately distinguish the branches and leaves, StP's attributes (i.e., branches or leaves) are optimized in a self-organized manner. ApPs are bound to each StP to represent the appearance of branches or leaves as in conventional 3DGS. StPs and ApPs are jointly optimized using a re-rendering loss on the input multi-view images, as well as the gradient flow from ApP to StP using the binding correspondence information. We conduct experiments to qualitatively evaluate the reconstruction accuracy of both appearance and structure, as well as real-world experiments to qualitatively validate the practical performance. Experiments show that the GaussianPlant achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling the extraction of branch structure and leaf instances.
Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong
个性化推荐理由:

该论文专注于图像超分辨率的扩散模型技术,属于纯粹的计算机视觉领域。虽然扩散模型是生成式AI的重要技术,但论文内容仅限于图像处理,没有涉及推荐系统、搜索或广告领域的应用潜力,也不涉及多模态建模或Transformer架构的改进。

2025-12-16 03:56:02 | arXiv:2512.14061v1 |
cs.CV
查看完整摘要
Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.
Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning
Zulin Zhuang, Yu Bian
个性化推荐理由:

该论文标题涉及建筑环境中的照度预测和控制系统,属于建筑自动化或环境工程领域。虽然提到了多模态深度学习,但其应用场景(日光联动控制)与推荐系统、搜索或广告的核心技术领域没有直接关联,也不符合任何指定的关注点(如RecSys核心进展、LLM技术、Transformer架构改进等)。

2025-12-16 03:52:27 | arXiv:2512.14058v1 |
cs.CVcs.AI
查看完整摘要
Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.
FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling
Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh
个性化推荐理由:

该论文专注于计算机视觉领域的面部生成和编辑技术,属于纯粹的视觉内容生成范畴。虽然标题中提到“统一”和“生成”,但这与推荐系统、搜索或广告中的异构数据统一建模没有直接关联,也不涉及LLM技术、Transformer架构进展或这些技术在RecSys/Search/Ads领域的潜在应用。

2025-12-16 03:49:52 | arXiv:2512.14056v1 |
cs.CVcs.AI
查看完整摘要
Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
Expert Switching for Robust AAV Landing: A Dual-Detector Framework in Simulation
Humaira Tasnim, Ashik E Rasul, Bruce Jo, Hyung-Jin Yoon
个性化推荐理由:

该论文标题涉及航空器(AAV)着陆的模拟控制框架,属于特定领域(航空/机器人)的应用研究。标题中未提及推荐系统、搜索、广告、LLM、Transformer架构或异构数据处理等与您关注领域相关的任何技术要素。

2025-12-16 03:41:59 | arXiv:2512.14054v1 |
cs.ROcs.CV
查看完整摘要
Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV's viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA's photorealistic rendering with NASA's GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.
SELECT: Detecting Label Errors in Real-world Scene Text Data
Wenjun Liu, Qian Wu, Yifeng Hu, Yuke Li
个性化推荐理由:

该论文标题关注于场景文本数据中的标签错误检测,这属于计算机视觉中的特定任务,与推荐系统、搜索或广告的核心技术没有直接关联。虽然文本处理是搜索系统的一部分,但该论文专注于视觉场景中的文本识别质量评估,而非文本理解、检索或排序等与RecSys/Search/Ads直接相关的技术。

2025-12-16 03:32:30 | arXiv:2512.14050v1 |
cs.CV
查看完整摘要
We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu ...
个性化推荐理由:

该论文专注于自动驾驶领域,属于明确的无关主题(非推荐系统/搜索/广告)。虽然标题中提到了多模态和强化学习,但其应用场景是自动驾驶,与用户关注的推荐系统、搜索或广告领域没有直接关联。

2025-12-16 03:19:28 | arXiv:2512.14044v1 |
cs.CVcs.AI
查看完整摘要
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization
Meng Wei, Cheng Zhang, Jianmin Zheng, Hamid Rezatofighi, Jianfei Cai
个性化推荐理由:

该论文标题明确指向计算机图形学中的3D高斯溅射和纹理渲染技术,属于纯粹的视觉/图形学领域。虽然提到了参数化方法,但没有任何迹象表明与推荐系统、搜索或广告相关,也不涉及Transformer架构、LLM技术或多模态建模。

2025-12-16 03:13:27 | arXiv:2512.14039v1 |
cs.CV
查看完整摘要
Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures' capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.
ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM
Ignacio Alzugaray, Marwan Taher, Andrew J. Davison
个性化推荐理由:

该论文标题涉及SLAM(同时定位与地图构建)技术,属于计算机视觉和机器人领域,与推荐系统、搜索或广告的核心技术焦点无直接关联。神经隐式表示和场景坐标回归方法主要应用于3D场景重建和机器人导航,没有明确的推荐/搜索/广告应用潜力。

2025-12-16 02:56:50 | arXiv:2512.14032v1 |
cs.CVcs.AIeess.IV
查看完整摘要
We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM. Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: https://github.com/ialzugaray/ace-slam
Robust Single-shot Structured Light 3D Imaging via Neural Feature Decoding
Jiaheng Li, Qiyu Dai, Lihan Li, Praneeth Chakravarthula, He Sun, Baoquan Chen, W...
个性化推荐理由:

该论文专注于计算机视觉中的3D成像技术,属于纯粹的视觉领域研究。虽然结构光3D成像在理论上可能用于获取商品或环境的3D数据,但论文标题明确聚焦于成像算法本身,没有提及任何与推荐系统、搜索或广告相关的应用场景或技术迁移可能性。

2025-12-16 02:47:38 | arXiv:2512.14028v1 |
cs.CV
查看完整摘要
We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Project page: https://namisntimpot.github.io/NSLweb/.