arXiv 每日论文精选

显示 145 篇论文 (共 145 篇)

RCLRec：基于逆向课程学习的稀疏转化建模生成式推荐

9/10

RCLRec: Reverse Curriculum Learning for Modeling Sparse Conversions in Generative Recommendation

Yulei Huang, Hao Deng, Haibo Xing, Jinxin Hu, Chuanfei Xu, Zulong Chen, Yu Zhang...

核心总结:

论文研究推荐系统中稀疏转化行为的建模问题，核心思想是构建逆向课程序列，从历史行为中选取与转化相关的子序列作为额外监督信号，通过联合生成目标增强模型对关键决策过程的关注。

个性化推荐理由:

该论文直接针对推荐系统中稀疏转化目标的建模难题，提出基于逆向课程学习的生成式推荐框架，属于核心领域进展和直接LLM应用范畴。

2026-03-30 07:41:33 | arXiv:2603.28124v1 |

cs.IR

查看完整摘要

Conversion objectives in large-scale recommender systems are sparse, making them difficult to optimize. Generative recommendation (GR) partially alleviates data sparsity by organizing multi-type behaviors into a unified token sequence with shared representations, but conversion signals remain insufficiently modeled. While recent behavior-aware GR models encode behavior types and employ behavior-aware attention to highlight decision-related intermediate behaviors, they still rely on standard attention over the full history and provide no additional supervision for conversions, leaving conversion sparsity largely unresolved. To address these challenges, we propose RCLRec, a reverse curriculum learning-based GR framework for sparse conversion supervision. For each conversion target, RCLRec constructs a short curriculum by selecting a subsequence of conversion-related items from the history in reverse. Their semantic tokens are fed to the decoder as a prefix, together with the target conversion tokens, under a joint generation objective. This design provides additional instance-specific intermediate supervision, alleviating conversion sparsity and focusing the model on the user's critical decision process. We further introduce a curriculum quality-aware loss to ensure that the selected curricula are informative for conversion prediction. Experiments on offline datasets and an online A/B test show that RCLRec achieves superior performance, with +2.09% advertising revenue and +1.86% orders in online deployment.

ResAdapt：用于高效多模态推理的自适应分辨率

8/10

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu...

核心总结:

该论文研究多模态大语言模型因高分辨率输入导致视觉令牌激增的计算效率瓶颈问题，核心思想是提出ResAdapt框架，在编码前通过轻量级分配器学习为每帧图像自适应分配视觉预算，将分配问题建模为上下文赌博机并用成本感知策略优化进行训练。

个性化推荐理由:

该论文提出输入侧自适应分辨率框架，通过轻量级分配器动态分配视觉预算，直接优化多模态LLM的输入效率，与关注Transformer架构效率、LLM技术应用及异构数据处理的核心研究方向高度相关。

2026-03-30 15:57:32 | arXiv:2603.28610v1 |

cs.CVcs.AIcs.CL

查看完整摘要

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

通过矩阵乘积算子分解压缩Transformer语言模型：以PicoGPT为例的研究

8/10

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Younes Javanmard, Tanmoy Pandit, Masoud Mardani

核心总结:

该论文研究Transformer语言模型因参数量大导致部署成本高的问题，核心方法是采用矩阵乘积算子（MPO）分解将权重矩阵分解为低秩核心链，通过控制键维数实现可控压缩。

个性化推荐理由:

该论文研究Transformer模型压缩的核心技术MPO分解，直接属于Transformer架构效率提升领域，对资源受限的推荐/搜索系统部署具有重要应用价值。

2026-03-30 14:57:47 | arXiv:2603.28534v1 |

cs.CLphysics.data-an

查看完整摘要

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.

IsoQuant：面向LLM KV缓存压缩的硬件对齐SO(4)等斜旋转方法

8/10

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Zhongping Ji

核心总结:

该论文研究如何降低LLM KV缓存在线向量量化的计算和存储开销。核心方法是利用四元数代数和SO(4)等斜分解，设计硬件对齐的块旋转框架IsoQuant，通过封闭变换T(v)=q_L v ̅q_R实现高效特征解相关。

个性化推荐理由:

该论文提出了一种基于四元数和SO(4)等斜旋转的硬件对齐KV缓存压缩方法，直接针对LLM推理效率这一关键瓶颈，属于Transformer架构效率优化的前沿工作。

2026-03-30 13:37:45 | arXiv:2603.28430v1 |

cs.LGcs.CL

查看完整摘要

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

关于序列推荐系统精度极限的研究：一种基于熵的方法

7/10

On the Accuracy Limits of Sequential Recommender Systems: An Entropy-Based Approach

En Xu, Jingtao Ding, Yong Li

核心总结:

该论文研究序列推荐系统在给定数据下的固有精度上限估计问题，核心思想是开发一种基于熵的、无需训练的候选集大小无关估计器，通过量化数据内在可预测性来评估模型性能的理论极限。

个性化推荐理由:

该论文提出了一种评估序列推荐系统固有精度上限的方法，属于推荐系统核心领域进展，通过熵理论分析系统可预测性边界，为模型开发和数据决策提供理论指导。

2026-03-30 01:57:30 | arXiv:2603.27952v1 |

cs.IR

查看完整摘要

Sequential recommender systems have achieved steady gains in offline accuracy, yet it remains unclear how close current models are to the intrinsic accuracy limit imposed by the data. A reliable, model-agnostic estimate of this ceiling would enable principled difficulty assessment and headroom estimation before costly model development. Existing predictability analyses typically combine entropy estimation with Fano's inequality inversion; however, in recommendation they are hindered by sensitivity to candidate-space specification and distortion from Fano-based scaling in low-predictability regimes. We develop an entropy-induced, training-free approach for quantifying accuracy limits in sequential recommendation, yielding a candidate-size-agnostic estimate. Experiments on controlled synthetic generators and diverse real-world benchmarks show that the estimator tracks oracle-controlled difficulty more faithfully than baselines, remains insensitive to candidate-set size, and achieves high rank consistency with best-achieved offline accuracy across state-of-the-art sequential recommenders (Spearman rho up to 0.914). It also supports user-group diagnostics by stratifying users by novelty preference, long-tail exposure, and activity, revealing systematic predictability differences. Furthermore, predictability can guide training data selection: training sets constructed from high-predictability users yield strong downstream performance under reduced data budgets. Overall, the proposed estimator provides a practical reference for assessing attainable accuracy limits, supporting user-group diagnostics, and informing data-centric decisions in sequential recommendation.

面向视觉语言模型的领域不变提示学习

7/10

Domain-Invariant Prompt Learning for Vision-Language Models

Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt

核心总结:

论文研究视觉语言模型在领域泛化中的提示学习问题，核心思想是通过对抗训练机制强制模型学习领域不变的提示向量，以提升模型在未见数据分布上的泛化能力。

个性化推荐理由:

该论文提出对抗训练方法学习领域不变提示，属于Transformer架构效率与泛化能力的前沿进展，对推荐/搜索系统中处理多领域数据具有直接启发价值。

2026-03-30 15:18:31 | arXiv:2603.28555v1 |

cs.CVcs.AI

查看完整摘要

Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.

大型视觉语言模型的高效推理

6/10

Efficient Inference of Large Vision Language Models

Surendra Pathak

核心总结:

论文系统研究如何解决大型视觉语言模型因高分辨率视觉输入产生大量视觉标记而导致的推理效率瓶颈。核心贡献是提出一个四维分类法（视觉标记压缩、内存管理、高效架构设计、高级解码策略），系统梳理现有优化技术框架。

个性化推荐理由:

该论文系统综述了大型视觉语言模型的高效推理技术，虽然主要关注视觉模态，但其提出的优化框架（如注意力机制优化、架构设计）对处理推荐/搜索中的异构数据序列有直接类比价值。

2026-03-30 02:23:37 | arXiv:2603.27960v1 |

cs.LGcs.CLcs.CV

查看完整摘要

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.

Hydra：在单一视觉语言模型中统一文档检索与生成

3/10

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Athos Georgiou

个性化推荐理由:

虽然该论文涉及视觉语言模型（VLM）和多模态统一建模，但其核心关注点是文档检索与生成，这主要属于信息检索和内容生成领域。对于推荐/搜索/广告领域，其潜在应用有限，因为文档检索与生成与这些领域的核心排名、个性化或广告投放任务关联较弱。

2026-03-30 15:17:41 | arXiv:2603.28554v1 |

cs.CVcs.AIcs.IR

查看完整摘要

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

GEAKG：生成式可执行算法知识图谱

3/10

GEAKG: Generative Executable Algorithm Knowledge Graphs

Camilo Chacón Sartori, José H. García, Andrei Voicu Tomut, Christian Blum

个性化推荐理由:

该论文涉及知识图谱生成，可能属于核心推荐系统或搜索领域的知识表示方法，但标题未明确说明与LLM、Transformer或推荐/搜索/广告的直接联系。若涉及算法知识图谱的生成式建模，可能对推荐系统的知识增强或搜索的查询理解有潜在应用，但相关性较弱且不确定。

2026-03-30 00:42:48 | arXiv:2603.27922v1 |

cs.AIcs.IR

查看完整摘要

In the context of algorithms for problem solving, procedural knowledge -- the know-how of algorithm design and operator composition -- remains implicit in code, lost between runs, and must be re-engineered for each new domain. Knowledge graphs (KGs) have proven effective for organizing declarative knowledge, yet current KG paradigms provide limited support for representing procedural knowledge as executable, learnable graph structures. We introduce \textit{Generative Executable Algorithm Knowledge Graphs} (GEAKG), a class of KGs whose nodes store executable operators, whose edges encode learned composition patterns, and whose traversal generates solutions. A GEAKG is \emph{generative} (topology and operators are synthesized by a Large Language Model), \emph{executable} (every node is runnable code), and \emph{transferable} (learned patterns generalize zero-shot across domains). The framework is domain-agnostic at the engine level: the same three-layer architecture and Ant Colony Optimization (ACO)-based learning engine can be instantiated across domains, parameterized by a pluggable ontology (\texttt{RoleSchema}). Two case studies -- sharing no domain-specific framework code -- provide concrete evidence for this framework hypothesis: (1)~Neural Architecture Search across 70 cross-dataset transfer pairs on two tabular benchmarks, and (2)~Combinatorial Optimization, where knowledge learned on the Traveling Salesman Problem transfers zero-shot to scheduling and assignment domains. Taken together, the results support that algorithmic expertise can be explicitly represented, learned, and transferred as executable knowledge graphs.

GraphWalker：基于合成轨迹课程的智能体知识图谱问答

3/10

GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu

个性化推荐理由:

该论文主要关注知识图谱问答中的智能体方法，属于特定领域的NLP应用。虽然涉及图结构数据和智能体技术，但缺乏与推荐系统、搜索或广告领域的直接联系或潜在应用说明。

2026-03-30 14:56:59 | arXiv:2603.28533v1 |

cs.CL

查看完整摘要

Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker

熵化声明解析：面向检索增强生成的不确定性驱动证据选择

3/10

Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG

Davide Di Gioia

个性化推荐理由:

该论文主要关注检索增强生成（RAG）中的不确定性驱动证据选择，属于LLM技术应用范畴，但未明确涉及推荐系统、搜索或广告领域的直接应用。虽然RAG技术可能间接支持这些领域的知识增强，但论文标题未体现与RecSys/Search/Ads的具体关联，因此相关性较低。

2026-03-30 13:49:03 | arXiv:2603.28444v1 |

cs.AIcs.CL

查看完整摘要

Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H <= epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.

合并与征服：通过添加目标语言权重来指导多语言模型

3/10

Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights

Eneko Valero, Maria Ribalta i Albado, Oscar Sainz, Naiara Perez, German Rigau

个性化推荐理由:

该论文涉及多语言模型的技术，属于核心LLM技术的范畴，可能通过改进多语言能力间接应用于搜索或推荐系统的多语言场景。然而，标题未明确说明与推荐系统、搜索或广告的直接应用联系，因此相关性有限。

2026-03-30 10:46:50 | arXiv:2603.28263v1 |

cs.CLcs.AI

查看完整摘要

Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.

大型语言模型隐藏状态中的类别感知：数字计数边界处的结构扭曲

3/10

Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries

Jon-Paul Cacioli

个性化推荐理由:

该论文研究LLM隐藏状态中的类别感知现象，属于核心LLM技术的基础进展，可能揭示模型内部表示的特性。虽然这种基础理解可能间接影响推荐/搜索系统中对用户行为序列或项目特征的建模，但论文标题未明确说明与这些领域的直接应用联系，因此相关性有限。

2026-03-30 10:34:58 | arXiv:2603.28258v1 |

cs.CLcs.AI

查看完整摘要

Categorical perception (CP) -- enhanced discriminability at category boundaries -- is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: "classic CP" (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and "structural CP" (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.

AdaptToken：基于熵的自适应令牌选择用于多模态大语言模型长视频理解

3/10

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

个性化推荐理由:

该论文主要关注多模态大语言模型（MLLM）的长视频理解，属于视觉-语言交叉领域。虽然标题提到了自适应令牌选择这一效率技术，但其核心应用场景（长视频理解）与推荐系统、搜索或广告的直接相关性较弱，更偏向纯粹的视觉-语言建模任务。

2026-03-30 17:14:15 | arXiv:2603.28696v1 |

cs.CVcs.AI

查看完整摘要

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

ConceptWeaver：基于流的解耦概念编织

3/10

ConceptWeaver: Weaving Disentangled Concepts with Flow

Jintao Chen, Aiming Hao, Xiaoqing Chen, Chengyu Bai, Chubin Chen, Yanxun Li, Jia...

个性化推荐理由:

该标题涉及概念解耦和流模型，属于基础机器学习技术，可能属于“Enabling LLM Tech”或“Enabling Transformer Tech”范畴。然而，标题过于抽象，未明确说明其与推荐系统、搜索或广告的具体应用关联，因此相关性较低。

2026-03-30 14:28:07 | arXiv:2603.28493v1 |

cs.CV

查看完整摘要

Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.

通过概念解释CLIP零样本预测

3/10

Explaining CLIP Zero-shot Predictions Through Concepts

Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas

个性化推荐理由:

该论文主要关注CLIP模型的解释性，属于计算机视觉与自然语言处理的交叉领域，与VLM（视觉语言模型）相关。虽然VLM技术可以类比到异构数据处理，但该论文专注于解释性而非建模技术本身，对推荐系统、搜索或广告领域的直接应用潜力有限。

2026-03-30 09:31:33 | arXiv:2603.28211v1 |

cs.CV

查看完整摘要

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

基于解耦语言模型的高效文本行识别领域自适应

3/10

Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Arundhathi Dev, Justin Zhan

个性化推荐理由:

该论文主要关注文本识别领域的领域自适应技术，属于计算机视觉和OCR的交叉领域。虽然提到了语言模型，但其应用场景（文本行识别）与推荐系统、搜索或广告的核心任务（排序、匹配、用户建模）没有直接关联。解耦语言模型技术可能对处理多模态数据有启发，但论文标题未表明其在RecSys/Search/Ads中的潜在应用。

2026-03-30 04:39:26 | arXiv:2603.28028v1 |

cs.CVcs.LG

查看完整摘要

Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.

CirrusBench：在真实世界云服务环境中超越正确性评估基于大语言模型的智能体

2/10

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo ...

个性化推荐理由:

该论文主要关注LLM智能体在云服务环境中的评估基准，属于评估基准范畴，而非核心推荐/搜索/广告领域进展或使能技术。虽然涉及LLM技术，但焦点是评估而非直接应用或架构创新，与您关注的使能LLM技术趋势、Transformer架构进展或直接LLM应用相关性较弱。

2026-03-30 15:26:00 | arXiv:2603.28569v1 |

cs.LGcs.AIcs.IRcs.PF

查看完整摘要

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI

基于视觉语言模型的意大利议会演讲转录与识别

2/10

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia

个性化推荐理由:

该论文虽然涉及视觉语言模型（VLM），但其应用场景（意大利议会演讲转录）与推荐系统、搜索或广告领域没有直接关联。论文核心是特定领域的语音转录任务，而非探索异构数据统一建模或VLM在推荐/搜索/广告中的应用潜力。

2026-03-30 07:06:49 | arXiv:2603.28103v1 |

cs.DLcs.AIcs.IR

查看完整摘要

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

自适应块缩放数据类型

2/10

Adaptive Block-Scaled Data Types

Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P....

个性化推荐理由:

该标题涉及数据类型的优化，可能属于系统级效率改进，但未明确与Transformer架构、LLM技术或推荐/搜索/广告系统直接相关。虽然高效数据类型可能间接支持大规模模型训练，但缺乏明确的连接点，因此相关性较低。

2026-03-30 17:59:33 | arXiv:2603.28765v1 |

cs.CL

查看完整摘要

NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

ParaSpeechCLAP：一种用于丰富风格化语言-音频预训练的双编码器语音-文本模型

2/10

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath

个性化推荐理由:

该论文涉及语音-文本多模态预训练，属于音频处理领域，与推荐系统、搜索或广告的核心技术关联较弱。虽然双编码器架构在跨模态检索中有应用，但论文标题明确聚焦于语音和风格化语言，缺乏明确的RecSys/Search/Ads应用场景或技术迁移路径。

2026-03-30 17:50:07 | arXiv:2603.28737v1 |

eess.AScs.AIcs.CLcs.SD

查看完整摘要

We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .

SOLE-R1：将视频-语言推理作为机器人强化学习的唯一奖励

2/10

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondr...

个性化推荐理由:

该论文涉及机器人强化学习，与推荐系统、搜索或广告的核心领域无直接关联。虽然提到了视频-语言推理，但应用场景是机器人控制而非异构数据建模，不符合VLM类比或任何当前关注点。

2026-03-30 17:46:31 | arXiv:2603.28730v1 |

cs.ROcs.CLcs.CV

查看完整摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.

超越评论：将语言模型应用于反思中的规划与翻译

2/10

Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection

Seyed Parsa Neshaei, Richard Lee Davis, Tanja Käser

个性化推荐理由:

该论文标题涉及语言模型在反思（reflection）场景中的规划与翻译应用，这属于LLM在特定认知任务中的直接应用。虽然涉及语言模型技术，但“反思”这一应用场景与推荐系统、搜索或广告的核心领域（用户行为建模、内容匹配、排序优化等）没有直接关联，也没有明确展示如何应用于这些领域。标题中提到的“规划”和“翻译”任务更偏向通用NLP或认知科学应用，而非RecSys/Search/Ads的特定需求。

2026-03-30 15:42:38 | arXiv:2603.28596v1 |

cs.HCcs.AIcs.CL

查看完整摘要

Reflective writing is known to support the development of students' metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.

基于渐进式检索增强生成与角色切换的法庭式多智能体辩论用于争议性主张验证

2/10

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md ...

个性化推荐理由:

该论文主要关注多智能体辩论架构和争议性主张验证，这属于通用LLM应用场景而非特定于推荐/搜索/广告领域。虽然涉及RAG技术，但论文焦点是辩论框架和角色切换机制，没有明确展示这些技术如何应用于推荐系统、搜索或广告的核心问题（如排序、匹配、个性化等）。

2026-03-30 14:23:15 | arXiv:2603.28488v1 |

cs.CLcs.AIcs.MA

查看完整摘要

Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.

MiroEval：面向过程与结果的多模态深度研究智能体基准测试

2/10

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei ...

个性化推荐理由:

该论文标题涉及多模态深度研究智能体的基准测试，主要关注评估框架而非核心技术进展。虽然涉及多模态，但未明确关联推荐系统、搜索或广告领域的异构数据处理，且基准测试属于评估范畴，与您关注的架构创新、LLM应用或Transformer技术进展相关性较弱。

2026-03-30 13:16:03 | arXiv:2603.28407v1 |

cs.AIcs.CL

查看完整摘要

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

Kernel-Smith：进化核优化的统一配方

2/10

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qi...

个性化推荐理由:

该论文标题涉及进化算法和核优化，属于通用机器学习方法。虽然核方法在推荐系统或搜索中有潜在应用（如核矩阵分解），但标题未明确指向推荐系统、搜索或广告领域，也未涉及LLM、Transformer架构或VLM类比。它更偏向于通用优化技术，与当前关注的核心领域直接相关性较弱。

2026-03-30 12:12:49 | arXiv:2603.28342v1 |

cs.CLcs.LG

查看完整摘要

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

LLM作为评判者时设置温度参数的必要性

2/10

The Necessity of Setting Temperature in LLM-as-a-Judge

Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu S...

个性化推荐理由:

该论文讨论LLM作为评判者时的温度参数设置，这属于LLM评估和基准测试范畴，与您的关注点中的核心领域进展、使能技术或直接应用无关。虽然温度参数可能影响LLM在推荐/搜索中的输出多样性，但论文标题明确聚焦于“评判者”角色，这更接近NLP评估基准话题，而非RecSys/Search/Ads的具体技术应用。

2026-03-30 11:31:29 | arXiv:2603.28304v1 |

cs.CL

查看完整摘要

LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.

超越余弦相似度：基于零初始化残差复投影的方面情感分析

2/10

Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis

Yijin Wang, Fandi Sun

个性化推荐理由:

该论文主要关注方面情感分析（Aspect-Based Sentiment Analysis），这是自然语言处理中一个特定任务，与推荐系统、搜索或广告的核心领域进展没有直接关联。虽然论文提出了新的相似度计算方法，但其应用场景局限于情感分析，没有明确展示在推荐/搜索/广告系统中的潜在应用价值。

2026-03-30 09:23:04 | arXiv:2603.28205v1 |

cs.CL

查看完整摘要

Aspect-Based Sentiment Analysis (ABSA) is fundamentally challenged by representation entanglement, where aspect semantics and sentiment polarities are often conflated in real-valued embedding spaces. Furthermore, standard contrastive learning suffers from false-negative collisions, severely degrading performance on high-frequency aspects. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss,inspired by quantum projection and entanglement ideas. Our approach projects textual features into a complex semantic space, systematically utilizing the phase to disentangle sentiment polarities while allowing the amplitude to encode the semantic intensity and lexical richness of subjective descriptions. To tackle the collision bottleneck, we introduce an anti-collision mask that elegantly preserves intra-polarity aspect cohesion while expanding the inter-polarity discriminative margin by over 50%. Experimental results demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses further reveal that explicitly penalizing the complex amplitude catastrophically over-regularizes subjective representations, proving that our unconstrained-amplitude and phase-driven objective is crucial for robust, fine-grained sentiment disentanglement.

从评论到需求：大型语言模型能否生成类人用户故事？

2/10

From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem, Shohel Ahmed

个性化推荐理由:

该论文探讨LLMs从用户评论生成需求文档的能力，属于内容生成范畴，与您关注的推荐/搜索/广告核心领域进展、LLM基础技术、Transformer架构改进或异构数据统一建模等焦点直接关联度较低。虽然涉及LLMs，但主要聚焦于需求工程而非推荐/搜索/广告中的排名、检索或用户行为建模等核心任务。

2026-03-30 08:31:18 | arXiv:2603.28163v1 |

cs.CL

查看完整摘要

App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.

重新思考LLM评判中的原子分解：基于提示控制的参考基础问答评估研究

2/10

Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

Xinran Zhang

个性化推荐理由:

该论文主要关注LLM评估方法（特别是问答评估中的原子分解技术），属于纯粹的LLM评估和基准测试范畴。虽然涉及LLM技术，但其核心是评估方法论而非LLM在推荐/搜索/广告领域的应用潜力，与您关注的直接应用、架构创新或跨模态建模等方向关联度很低。

2026-03-30 03:55:26 | arXiv:2603.28005v1 |

cs.CL

查看完整摘要

Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially\_supported cases -- incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.

CDH-Bench：一个用于评估视觉语言模型视觉保真度的常识驱动幻觉基准

2/10

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo

个性化推荐理由:

该论文标题明确关注视觉语言模型（VLM）的幻觉评估基准，属于纯粹的评估基准研究。虽然提到了视觉语言模型，但论文焦点是评估而非建模技术本身，并且明确涉及幻觉这一被列为无关的NLP中心话题。该研究没有展示对推荐系统、搜索或广告领域的直接应用潜力。

2026-03-30 03:04:53 | arXiv:2603.27982v1 |

cs.CVcs.AIcs.CL

查看完整摘要

Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence--commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence--commonsense conflict.

论编码器深度在SLAM-ASR中的作用：Whisper模型剪枝与LoRA微调

2/10

On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar

个性化推荐理由:

该论文标题涉及语音识别（ASR）领域的模型压缩（剪枝）和参数高效微调（LoRA），属于特定领域应用而非推荐/搜索/广告核心领域。虽然模型压缩技术可能间接应用于推荐系统，但论文明确聚焦于语音识别（SLAM-ASR）和Whisper模型，缺乏与异构数据建模、Transformer架构创新或LLM在推荐/搜索中直接应用的明确关联。

2026-03-30 03:02:22 | arXiv:2603.27981v1 |

cs.CLcs.SD

查看完整摘要

Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model's linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM's pre-existing language proficiency and available training data.

EmsemJudge：通过多样化模型集成增强中文大语言模型生成文本检测的可靠性

2/10

EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles

Zhuoshang Wang, Yubing Ren, Guoyu Zhao, Xiaowei Zhu, Hao Li, Yanan Cao

个性化推荐理由:

该论文专注于LLM生成文本的检测，属于评估基准和可靠性问题，这些被明确列为不相关主题。虽然涉及LLM技术，但核心关注点是检测而非在推荐系统、搜索或广告中的实际应用。没有证据表明该方法能直接应用于这些领域或解决其中的具体问题。

2026-03-30 01:50:46 | arXiv:2603.27949v1 |

cs.CL

查看完整摘要

Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at https://github.com/johnsonwangzs/MGT-Mini.

PoseDreamer：基于扩散模型的可扩展且逼真的人体数据生成流水线

2/10

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Ru...

个性化推荐理由:

该论文主要关注使用扩散模型生成逼真的人体数据，属于视觉内容生成领域。虽然扩散模型是LLM相关的生成技术，但该工作专注于人体姿态和图像生成，没有明确涉及推荐系统、搜索或广告的应用场景，也不处理异构数据建模或Transformer架构改进。

2026-03-30 17:59:18 | arXiv:2603.28763v1 |

cs.CV

查看完整摘要

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

基于上下文空间的即时排斥机制实现扩散变换器中丰富多样性

2/10

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or

个性化推荐理由:

该论文主要关注扩散变换器中的多样性生成机制，属于AIGC/内容生成领域，与用户推荐、搜索或广告排名等核心任务没有直接关联。虽然扩散模型在生成任务中表现出色，但论文未明确展示其在推荐系统、搜索或广告中的潜在应用价值。

2026-03-30 17:59:13 | arXiv:2603.28762v1 |

cs.CVcs.AIcs.GRcs.LG

查看完整摘要

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

基于流匹配模型的GRPO逐步信用分配

2/10

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukh...

个性化推荐理由:

该论文标题涉及强化学习中的信用分配（GRPO可能指梯度策略优化）和流匹配模型，这些属于强化学习技术范畴。虽然强化学习在推荐系统中可能有应用，但标题没有明确表明与推荐系统、搜索或广告的直接关联，也没有提到Transformer架构、LLM技术或异构数据处理等当前关注的核心领域。

2026-03-30 17:35:14 | arXiv:2603.28718v1 |

cs.LGcs.AIcs.CV

查看完整摘要

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

分而治之：一种用于通用图像修复的模块化任务解耦框架

2/10

Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration

Joanna Wiekiera, Martyna Zur

个性化推荐理由:

该论文专注于计算机视觉领域的图像修复任务，属于纯粹的视觉处理技术。虽然标题中提到“通用”和“模块化”框架，但核心应用场景是图像修复，与推荐系统、搜索或广告的排名、匹配、用户建模等核心任务没有直接关联。该技术没有展示出在异构数据处理（如用户序列和上下文特征）或推荐/搜索/广告系统架构方面的潜在应用价值。

2026-03-30 16:45:16 | arXiv:2603.28658v1 |

cs.CV

查看完整摘要

Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.

ELViS：基于局部描述符的高效视觉相似性计算，具备跨领域泛化能力

2/10

ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, Giorgos Tolias

个性化推荐理由:

该论文主要关注视觉相似性计算和局部描述符技术，属于计算机视觉领域。虽然视觉相似性在推荐系统和搜索中有潜在应用（如图像搜索、视觉推荐），但论文标题未明确提及与推荐系统、搜索或广告的直接联系，也未涉及LLM、Transformer架构或异构数据建模等当前关注的核心技术。

2026-03-30 15:53:42 | arXiv:2603.28603v1 |

cs.CV

查看完整摘要

Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/

StreamingVLA：基于动作流匹配与自适应早期观测的流式视觉-语言-动作模型

2/10

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian M...

个性化推荐理由:

该论文标题表明其核心是视觉-语言-动作（VLA）模型，主要关注机器人控制或具身智能中的动作生成与流式处理。虽然标题包含“视觉-语言”多模态概念，但其应用场景（动作生成、流式观测）与推荐系统、搜索或广告的典型任务（如排序、召回、用户建模）关联较弱。尽管多模态统一建模思想可类比异构数据处理，但该论文的具体技术方向（动作流匹配、自适应早期观测）更偏向机器人学而非信息检索领域。

2026-03-30 15:23:27 | arXiv:2603.28565v1 |

cs.ROcs.CV

查看完整摘要

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.

弥合几何失配：面向薄结构状态空间模型的频率感知各向异性序列化

2/10

Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs

Jin Bai, Huiyao Zhang, Qi Wen, Ningyang Li, Shengyang Li, Atta ur Rahman, Xiaoli...

个性化推荐理由:

该论文标题涉及状态空间模型（SSMs）的序列化方法，属于序列建模技术，可能属于Transformer架构的替代或补充方案。然而，标题中强调的“薄结构”和“几何失配”表明其焦点可能更偏向于特定结构（如物理、几何或3D数据）的建模，而非通用的推荐/搜索/广告序列数据。虽然SSMs在长序列处理上有潜力，但该论文的具体技术方向与当前关注的异构数据统一建模、LLM应用或Transformer效率改进的直接关联性较弱。

2026-03-30 14:39:04 | arXiv:2603.28503v1 |

cs.CV

查看完整摘要

The segmentation of thin linear structures is inherently topology allowbreak-critical, where minor local errors can sever long-range connectivity. While recent State-Space Models (SSMs) offer efficient long-range modeling, their isotropic serialization (e.g., raster scanning) creates a geometry mismatch for anisotropic targets, causing state propagation across rather than along the structure trajectories. To address this, we propose FGOS-Net, a framework based on frequency allowbreak-geometric disentanglement. We first decompose features into a stable topology carrier and directional high-frequency bands, leveraging the latter to explicitly correct spatial misalignments induced by downsampling. Building on this calibrated topology, we introduce frequency-aligned scanning that elevates serialization to a geometry-conditioned decision, preserving direction-consistent traces. Coupled with an active probing strategy to selectively inject high-frequency details and suppress texture ambiguity, FGOS-Net consistently outperforms strong baselines across four challenging benchmarks. Notably, it achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.

INSID3：基于DINOv3的无训练上下文分割方法

2/10

INSID3: Training-Free In-Context Segmentation with DINOv3

Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Mason...

个性化推荐理由:

该论文标题涉及计算机视觉中的分割任务（Segmentation）和视觉基础模型DINOv3，属于纯粹的视觉技术研究。虽然提到了'In-Context'概念，但这是计算机视觉领域的术语，与推荐/搜索/广告系统中处理用户上下文特征和序列数据的'context'概念完全不同。该论文没有展示任何与推荐系统、搜索或广告相关的潜在应用或技术迁移可能性。

2026-03-30 14:16:37 | arXiv:2603.28480v1 |

cs.CV

查看完整摘要

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .

CiQi-Agent：面向中国瓷器文化推理的多模态智能体，对齐视觉、工具与美学

2/10

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Peng...

个性化推荐理由:

该论文主要关注文化推理和瓷器领域的多模态智能体，属于特定领域应用而非核心推荐/搜索/广告技术。虽然涉及多模态处理，但其应用场景（中国瓷器文化）与所列焦点领域（RecSys/Search/Ads）的直接关联性较弱，且可能偏向AIGC或内容生成方向。

2026-03-30 14:13:47 | arXiv:2603.28474v1 |

cs.CVcs.AI

查看完整摘要

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

R_dm：将分布匹配重新概念化为扩散蒸馏的奖励函数

2/10

$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song

个性化推荐理由:

该论文主要关注扩散模型的蒸馏技术，属于生成模型优化领域。虽然扩散模型在内容生成中有应用，但论文标题未表明与推荐系统、搜索或广告的直接关联，也未提及Transformer架构、多模态建模或这些领域的特定应用场景。

2026-03-30 14:01:31 | arXiv:2603.28460v1 |

cs.CVcs.LG

查看完整摘要

Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.

EdgeDiT：面向高效设备端图像生成的硬件感知扩散变换器

2/10

EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram ...

个性化推荐理由:

该论文主要关注设备端图像生成，属于AIGC/内容生成领域，这是明确列出的无关主题。虽然涉及Transformer架构效率（可能属于“赋能Transformer技术”），但其核心应用是图像生成而非推荐/搜索/广告中的排名或建模任务，因此相关性很低。

2026-03-30 13:14:30 | arXiv:2603.28405v1 |

cs.CVcs.AI

查看完整摘要

Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.

基于视觉自回归模型重新思考文本引导图像编辑中的结构保持

2/10

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang

个性化推荐理由:

该论文主要关注文本引导的图像编辑技术，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术没有直接关联。虽然视觉自回归模型可能在某些多模态推荐中有潜在应用，但论文标题明确聚焦于图像编辑任务，缺乏明确的RecSys/Search/Ads应用场景。

2026-03-30 12:35:33 | arXiv:2603.28367v1 |

cs.CV

查看完整摘要

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

将多模态大语言模型知识整合到非模态补全中

2/10

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Heecheol Yun, Eunho Yang

个性化推荐理由:

该论文标题涉及多模态大语言模型（属于LLM技术）和视觉领域的非模态补全任务。虽然LLM技术是相关领域，但非模态补全是纯粹的计算机视觉任务，没有明确展示与推荐系统、搜索或广告的直接应用潜力。标题中未提及任何推荐、搜索、广告相关的关键词或应用场景。

2026-03-30 12:03:47 | arXiv:2603.28333v1 |

cs.CVcs.AI

查看完整摘要

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

超越扫描路径：动态场景中基于图结构的视线模拟

2/10

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

Luke Palmer, Petar Palasek, Hazem Abdelkawy

个性化推荐理由:

该论文主要关注动态场景中的视线模拟，这属于计算机视觉和眼动追踪领域。虽然视线数据在广告效果评估中可能有潜在应用，但论文标题未明确表明与推荐系统、搜索或广告的直接关联，也未提及LLM、Transformer架构或异构数据建模等当前关注的核心技术。

2026-03-30 11:41:11 | arXiv:2603.28319v1 |

cs.CV

查看完整摘要

Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

DiffAttn：基于扩散模型的驾驶员视觉注意力预测与LLM增强的语义推理

2/10

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

个性化推荐理由:

该论文标题涉及视觉注意力预测和扩散模型，这属于计算机视觉领域，而非推荐系统、搜索或广告的核心技术。虽然提到了LLM增强，但主要应用于驾驶场景的视觉注意力预测，与RecSys/Search/Ads的直接应用或使能技术关联较弱。

2026-03-30 10:24:20 | arXiv:2603.28251v1 |

cs.CVcs.AI

查看完整摘要

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

ToLL：基于结构多视图增强的三维场景图预训练拓扑布局学习

2/10

ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye

个性化推荐理由:

该论文主要涉及3D场景理解和图形表示学习，属于计算机视觉和图形学领域。虽然提到了图结构学习和预训练，但其核心应用场景（3D场景图）与推荐系统、搜索或广告的异构数据处理没有直接关联，且未明确展示在推荐/搜索/广告领域的潜在应用价值。

2026-03-30 08:43:52 | arXiv:2603.28178v1 |

cs.CV

查看完整摘要

3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.

RecycleLoRA：基于秩揭示QR分解的双LoRA子空间自适应用于领域泛化语义分割

2/10

RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation

Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro

个性化推荐理由:

该论文主要关注计算机视觉中的语义分割任务，属于纯粹的视觉领域研究，与推荐系统、搜索或广告没有直接关联。虽然提到了LoRA（Low-Rank Adaptation）技术，但这是针对视觉任务的领域泛化应用，没有展示在RecSys/Search/Ads中的潜在应用价值。

2026-03-30 08:05:39 | arXiv:2603.28142v1 |

cs.CVcs.AI

查看完整摘要

Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM's subspace structures and enhance LoRA's representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter's strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.

基于噪声对应关系的鲁棒遥感图像-文本检索

2/10

Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence

Qiya Song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang

个性化推荐理由:

该论文涉及图像-文本检索，属于多模态检索领域，与推荐系统、搜索或广告中的文本-项目匹配有一定相似性。然而，其特定领域（遥感图像）和应用场景（遥感数据）使其与当前关注的推荐系统、搜索或广告核心领域或LLM技术直接相关性较弱，潜在应用不明确。

2026-03-30 07:55:07 | arXiv:2603.28134v1 |

cs.CV

查看完整摘要

As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: https://github.com/MSFLabX/RRSITR

AutoDrive-P³：通过强化微调实现感知-预测-规划思维的统一链

2/10

$AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng, Wei Gao

个性化推荐理由:

该论文主要关注自动驾驶领域中的感知-预测-规划链，属于特定领域应用而非推荐系统、搜索或广告的核心技术。虽然涉及强化学习，但未明确展示与推荐/搜索/广告领域的直接关联或潜在应用。

2026-03-30 07:28:41 | arXiv:2603.28116v1 |

cs.ROcs.CV

查看完整摘要

Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\textbf{P}$erception, $\textbf{P}$rediction, and $\textbf{P}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.

SHARP：用于运动预测的短窗口流式准确鲁棒预测

2/10

SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegge...

个性化推荐理由:

该论文标题聚焦于运动预测（Motion Forecasting），这属于自动驾驶或机器人领域的时空序列预测问题。虽然涉及序列建模和预测技术，但与推荐系统、搜索或广告的核心领域（用户行为预测、内容排序、广告投放）没有直接关联。标题中提到的“短窗口流式”和“鲁棒预测”技术可能对时序数据处理有参考价值，但缺乏明确的RecSys/Search/Ads应用场景说明。

2026-03-30 06:47:19 | arXiv:2603.28091v1 |

cs.CVcs.RO

查看完整摘要

In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.

是否进行视角变换：基于神经辐射场的预训练视角

2/10

To View Transform or Not to View Transform: NeRF-based Pre-training Perspective

Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum

个性化推荐理由:

该论文标题明确聚焦于神经辐射场（NeRF）和视角变换，这是计算机视觉和3D重建领域的核心技术。虽然提到了预训练视角，但NeRF技术主要应用于3D场景重建、视图合成和图形学，与推荐系统、搜索或广告的核心技术栈（如Transformer架构、序列建模、特征工程）没有直接关联。即使考虑潜在的跨领域应用，这种3D视觉技术到推荐/搜索领域的迁移路径过于间接且不明确，不符合当前关注的任何技术方向。

2026-03-30 06:46:34 | arXiv:2603.28090v1 |

cs.CV

查看完整摘要

Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.

GEMS：具备记忆与技能的智能体原生多模态生成

2/10

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang

个性化推荐理由:

该论文标题涉及智能体、多模态生成、记忆和技能，这些概念主要属于通用AI智能体领域，而非专门针对推荐系统、搜索或广告的核心技术。虽然多模态处理在理论上可以类比于处理异构数据，但标题未明确指向RecSys/Search/Ads的具体应用场景，如用户序列建模或上下文特征融合。因此，其与当前关注点的直接相关性较低，潜在应用不明确。

2026-03-30 06:42:55 | arXiv:2603.28088v1 |

cs.CV

查看完整摘要

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

MolmoPoint：基于接地令牌的视觉语言模型更优指向技术

2/10

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripat...

个性化推荐理由:

该论文主要关注视觉语言模型（VLM）的指向技术改进，属于纯粹的视觉-语言交叉领域研究。虽然标题提到“接地令牌”可能涉及多模态表示，但缺乏明确指向推荐系统、搜索或广告领域的应用潜力。该研究更偏向计算机视觉和自然语言处理的交叉，而非您关注的推荐/搜索/广告核心领域或相关使能技术。

2026-03-30 06:15:06 | arXiv:2603.28069v1 |

cs.CVcs.AI

查看完整摘要

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

CLIP-AUTT：基于动作单元提示的测试时个性化方法，用于细粒度视频情感识别

2/10

CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koe...

个性化推荐理由:

该论文主要关注视频情感识别，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术没有直接关联。虽然提到了测试时个性化，但其应用场景（视频情感分析）和核心方法（动作单元提示）与当前关注的LLM技术、Transformer架构改进或异构数据统一建模等方向缺乏明确的相关性。

2026-03-30 03:39:42 | arXiv:2603.27999v1 |

cs.CV

查看完整摘要

Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.

借助朋友之力：风险控制推荐系统中的集体操纵

1/10

With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems

Giovanni De Toni, Cristian Consonni, Erasmo Purificato, Emilia Gomez, Bruno Lepr...

个性化推荐理由:

该论文标题涉及推荐系统中的风险控制和操纵行为，这属于公平性、伦理和安全范畴，明确列在无关主题中。虽然提到推荐系统，但核心关注的是非技术性的操纵和风险控制问题，与当前关注的推荐系统核心算法进展、LLM技术应用或Transformer架构改进等技术方向无关。

2026-03-30 14:14:48 | arXiv:2603.28476v1 |

cs.IRcs.LGcs.SI

查看完整摘要

Recommendation systems have become central gatekeepers of online information, shaping user behaviour across a wide range of activities. In response, users increasingly organize and coordinate to steer algorithmic outcomes toward diverse goals, such as promoting relevant content or limiting harmful material, relying on platform affordances -- such as likes, reviews, or ratings. While these mechanisms can serve beneficial purposes, they can also be leveraged for adversarial manipulation, particularly in systems where such feedback directly informs safety guarantees. In this paper, we study this vulnerability in recently proposed risk-controlling recommender systems, which use binary user feedback (e.g., "Not Interested") to provably limit exposure to unwanted content via conformal risk control. We empirically demonstrate that their reliance on aggregate feedback signals makes them inherently susceptible to coordinated adversarial user behaviour. Using data from a large-scale online video-sharing platform, we show that a small coordinated group (comprising only 1% of the user population) can induce up to a 20% degradation in nDCG for non-adversarial users by exploiting the affordances provided by risk-controlling recommender systems. We evaluate simple, realistic attack strategies that require little to no knowledge of the underlying recommendation algorithm and find that, while coordinated users can significantly harm overall recommendation quality, they cannot selectively suppress specific content groups through reporting alone. Finally, we propose a mitigation strategy that shifts guarantees from the group level to the user level, showing empirically how it can reduce the impact of adversarial coordinated behaviour while ensuring personalized safety for individuals.

何为VERITAS？面向档案文档分析的模块化框架

1/10

Quid est VERITAS? A Modular Framework for Archival Document Analysis

Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio ...

个性化推荐理由:

该论文标题聚焦于档案文档分析，属于特定领域（历史/档案学）的应用，与推荐系统、搜索或广告的核心技术进展、LLM/Transformer架构创新、或异构数据统一建模等当前关注点无直接关联。其模块化框架设计可能涉及文档处理，但未表明与RecSys/Search/Ads领域的潜在应用联系。

2026-03-30 07:14:51 | arXiv:2603.28108v1 |

cs.DLcs.AIcs.IR

查看完整摘要

The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.

EpiScreen：基于大型语言模型的电子健康记录早期癫痫检测

1/10

EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models

Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie, Zhiyi Sha, Rui...

个性化推荐理由:

该论文标题明确指向医学领域的癫痫检测应用，属于您明确排除的"Medical, Biology, Chemistry, Physics or other domain-specific applications"范畴。虽然涉及大型语言模型，但应用场景与推荐系统、搜索或广告完全无关，没有任何潜在的应用联系。

2026-03-30 17:16:08 | arXiv:2603.28698v1 |

cs.CL

查看完整摘要

Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.

生成式心理测量学中AI驱动规模开发的终极教程：从瓶颈中释放AIGENIE

1/10

The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Chris...

个性化推荐理由:

该论文标题明确指向AIGC（人工智能生成内容）和生成式应用，这属于明确的无关主题。标题中的'Generative Psychometrics'表明这是一个特定领域应用（心理学测量），而非推荐系统、搜索或广告的核心技术。没有迹象表明该研究涉及推荐系统架构、Transformer效率改进、LLM在推荐中的应用，或异构数据统一建模等当前关注领域。

2026-03-30 16:25:37 | arXiv:2603.28643v1 |

cs.AIcs.CLcs.HC

查看完整摘要

Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline -- Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA -- to produce structurally validated item pools entirely *in silico*. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the `AIGENIE` function, and the `GENIE` function. Two running examples illustrate the package's use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the `GENIE()` function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The `AIGENIE` package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.

基于上下文依赖评分标准的简答题自动评分训练数据生成

1/10

Training data generation for context-dependent rubric-based short answer grading

Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar

个性化推荐理由:

该论文专注于教育领域的自动评分系统，属于特定领域应用而非核心推荐/搜索/广告技术。虽然涉及数据生成，但针对的是教育评估这一无关领域，没有展示出在推荐系统、搜索或广告中的潜在应用价值。

2026-03-30 14:59:53 | arXiv:2603.28537v1 |

cs.CL

查看完整摘要

Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.

EarlySciRev：从LaTeX写作痕迹中提取的早期科学修订数据集

1/10

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Bo...

个性化推荐理由:

该论文标题聚焦于科学写作修订数据集构建，属于特定领域（科学写作）的数据收集工作，与推荐系统、搜索、广告、LLM技术或Transformer架构等核心关注领域无直接关联。该研究属于纯粹的数据集构建工作，没有涉及任何可能应用于RecSys/Search/Ads的技术或方法。

2026-03-30 14:47:04 | arXiv:2603.28515v1 |

cs.CL

查看完整摘要

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

TIEG-优普解决方案：针对NeurIPS 2022 WikiKG90Mv2大规模知识图谱补全挑战赛

1/10

TIEG-Youpu Solution for NeurIPS 2022 WikiKG90Mv2-LSC

Feng Nie, Zhixiu Ye, Sifa Xie, Shuang Wu, Xin Yuan, Liang Yao, Jiazhen Peng, Xu ...

个性化推荐理由:

该论文标题描述的是特定竞赛解决方案，涉及知识图谱补全任务，属于特定领域应用。它不涉及推荐系统、搜索或广告的核心进展、LLM/Transformer技术、多模态建模或任何与当前关注点相关的技术方向。

2026-03-30 14:45:30 | arXiv:2603.28512v1 |

cs.CL

查看完整摘要

WikiKG90Mv2 in NeurIPS 2022 is a large encyclopedic knowledge graph. Embedding knowledge graphs into continuous vector spaces is important for many practical applications, such as knowledge acquisition, question answering, and recommendation systems. Compared to existing knowledge graphs, WikiKG90Mv2 is a large scale knowledge graph, which is composed of more than 90 millions of entities. Both efficiency and accuracy should be considered when building graph embedding models for knowledge graph at scale. To this end, we follow the retrieve then re-rank pipeline, and make novel modifications in both retrieval and re-ranking stage. Specifically, we propose a priority infilling retrieval model to obtain candidates that are structurally and semantically similar. Then we propose an ensemble based re-ranking model with neighbor enhanced representations to produce final link prediction results among retrieved candidates. Experimental results show that our proposed method outperforms existing baseline methods and improves MRR of validation set from 0.2342 to 0.2839.

面向自然语言到信号时序逻辑的结构歧义感知翻译

1/10

Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic

Kosei Fushimi, Kazunobu Serizawa, Junya Ikemoto, Kazumune Hashimoto

个性化推荐理由:

该论文专注于自然语言到形式逻辑的翻译，属于形式验证和时序逻辑领域。虽然涉及自然语言处理，但其核心是形式化方法和验证技术，与推荐系统、搜索或广告中的实际应用没有直接关联。论文没有展示在推荐、搜索或广告场景中的潜在应用价值。

2026-03-30 13:33:11 | arXiv:2603.28426v1 |

cs.CLcs.SC

查看完整摘要

Signal Temporal Logic (STL) is widely used to specify timed and safety-critical tasks for cyber-physical systems, but writing STL formulas directly is difficult for non-expert users. Natural language (NL) provides a convenient interface, yet its inherent structural ambiguity makes one-to-one translation into STL unreliable. In this paper, we propose an \textit{ambiguity-preserving} method for translating NL task descriptions into STL candidate formulas. The key idea is to retain multiple plausible syntactic analyses instead of forcing a single interpretation at the parsing stage. To this end, we develop a three-stage pipeline based on Combinatory Categorial Grammar (CCG): ambiguity-preserving $n$-best parsing, STL-oriented template-based semantic composition, and canonicalization with score aggregation. The proposed method outputs a deduplicated set of STL candidates with plausibility scores, thereby explicitly representing multiple possible formal interpretations of an ambiguous instruction. In contrast to existing one-best NL-to-logic translation methods, the proposed approach is designed to preserve attachment and scope ambiguity. Case studies on representative task descriptions demonstrate that the method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to a single STL formula.

LombardoGraphia：伦巴第正字法变体的自动分类

1/10

LombardoGraphia: Automatic Classification of Lombard Orthography Variants

Edoardo Signoroni, Pavel Rychlý

个性化推荐理由:

该论文涉及历史语言学中的特定正字法变体分类，属于高度专业化的语言学研究领域。这与推荐系统、搜索、广告或相关使能技术（如LLM、Transformer架构）没有任何直接或间接关联，完全超出了您关注的所有技术范畴。

2026-03-30 13:28:13 | arXiv:2603.28418v1 |

cs.CL

查看完整摘要

Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.

Marco DeepResearch：通过验证中心设计解锁高效深度研究智能体

1/10

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wan...

个性化推荐理由:

该论文标题聚焦于研究智能体的验证中心设计，属于通用AI代理系统范畴，与推荐系统、搜索或广告的核心技术无关。标题中未提及任何与Transformer架构、LLM技术、多模态建模或推荐/搜索/广告领域直接相关的关键词，因此与用户关注的所有焦点领域均不匹配。

2026-03-30 12:42:02 | arXiv:2603.28376v1 |

cs.CLcs.AI

查看完整摘要

Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

为神经多样性学习者的独特需求定制人工智能驱动的阅读支架

1/10

Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners

Soufiane Jhilal, Eleonora Pasqua, Caterina Marchesi, Riccardo Corradi, Martina G...

个性化推荐理由:

该论文标题明确聚焦于教育领域的神经多样性学习者应用，属于医疗/教育等特定领域应用，与RecSys/搜索/广告的核心技术进展、LLM基础技术、Transformer架构改进或直接应用完全无关。标题中提到的'AI-Driven Reading Scaffolds'属于教育技术范畴，不在当前关注的技术领域范围内。

2026-03-30 12:38:41 | arXiv:2603.28370v1 |

cs.CLcs.HC

查看完整摘要

Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.

并非所有主观性都相同！为NLP中主观性评估定义理想标准

1/10

Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP

Urja Khurana, Michiel van der Meer, Enrico Liscio, Antske Fokkens, Pradeep K. Mu...

个性化推荐理由:

该论文专注于NLP领域的主观性评估基准和方法论，属于纯粹的NLP评估主题。虽然涉及语言处理，但论文标题明确指向NLP评估基准（属于'Irrelevant Topics'中的'Evaluation benchmarks'），没有显示任何与推荐系统、搜索或广告相关的技术应用或架构创新。

2026-03-30 12:21:32 | arXiv:2603.28351v1 |

cs.CL

查看完整摘要

Subjective judgments are part of several NLP datasets and recent work is increasingly prioritizing models whose outputs reflect this diversity of perspectives. Such responses allow us to shed light on minority voices, which are frequently marginalized or obscured by dominant perspectives. It remains a question whether our evaluation practices align with these models' objectives. This position paper proposes seven evaluation desiderata for subjectivity-sensitive models, rooted in how subjectivity is represented in NLP data and models. The desiderata are constructed in a top-down approach, keeping in mind the user-centric impact of such models. We scan the experimental setup of 60 papers and show that various aspects of subjectivity are still understudied: the distinction between ambiguous and polyphonic input, whether subjectivity is effectively expressed to the user, and a lack of interplay between different desiderata, amongst other gaps.

口语数据中的共构现象：通用依存标注指南与初步结果

1/10

Coconstructions in spoken data: UD annotation guidelines and first results

Ludovica Pannitto, Sylvain Kahane, Kaja Dobrovoljc, Elena Battaglia, Bruno Guill...

个性化推荐理由:

该论文专注于口语数据的语言学分析和标注指南，属于纯语言学研究领域。它不涉及推荐系统、搜索或广告的核心技术，也没有展示出在LLM、Transformer架构或异构数据建模方面的潜在应用价值。

2026-03-30 10:45:15 | arXiv:2603.28261v1 |

cs.CL

查看完整摘要

The paper proposes annotation guidelines for syntactic dependencies that span across speaker turns - including collaborative coconstructions proper, wh-question answers, and backchannels - in spoken language treebanks within the Universal Dependencies framework. Two representations are proposed: a speaker-based representation following the segmentation into speech turns, and a dependency-based representation with dependencies across speech turns. New propositions are also put forward to distinguish between reformulations and repairs, and to promote elements in unfinished phrases.

“你理解我吗？”：生成式人工智能、大语言模型与非标准语言的计算与社会语言学视角

1/10

\textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

Verena Platzgummer, John McCrae, Sina Ahmadi

个性化推荐理由:

该论文标题聚焦于非标准语言的社会语言学和计算分析，属于纯粹的NLP语言学研究范畴。虽然提及LLMs，但核心关注点（非标准语言处理、社会语言学）与推荐系统、搜索或广告的技术应用无直接关联，也不涉及Transformer架构改进或跨模态建模等使能技术。

2026-03-30 09:34:41 | arXiv:2603.28213v1 |

cs.CL

查看完整摘要

The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.

东源：基于大语言模型的中西医结合脾胃病诊断框架

1/10

DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis

Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang, Yanzhe ...

个性化推荐理由:

该论文标题明确聚焦于医学领域（中西医结合脾胃病诊断），这属于明确的无关主题"Medical, Biology, Chemistry, Physics or other domain-specific applications"。虽然使用了LLM技术，但其应用场景与推荐系统、搜索或广告领域完全无关，没有任何潜在的应用可能性。

2026-03-30 08:56:24 | arXiv:2603.28191v1 |

cs.CL

查看完整摘要

The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.

克劳德的宪法是否具有文化？

1/10

Does Claude's Constitution Have a Culture?

Parham Pourdavood

个性化推荐理由:

该标题探讨的是AI模型（Claude）的宪法与文化属性，属于AI伦理、价值观或治理范畴，与用户关注的RecSys/Search/Ads领域的技术进展、LLM应用、Transformer架构或异构数据建模等焦点完全无关。

2026-03-30 07:38:46 | arXiv:2603.28123v1 |

cs.CYcs.AIcs.CL

查看完整摘要

Constitutional AI (CAI) aligns language models with explicitly stated normative principles, offering a transparent alternative to implicit alignment through human feedback alone. However, because constitutions are authored by specific groups of people, the resulting models may reflect particular cultural perspectives. We investigate this question by evaluating Anthropic's Claude Sonnet on 55 World Values Survey items, selected for high cross-cultural variance across six value domains and administered as both direct survey questions and naturalistic advice-seeking scenarios. Comparing Claude's responses to country-level data from 90 nations, we find that Claude's value profile most closely resembles those of Northern European and Anglophone countries, but on a majority of items extends beyond the range of all surveyed populations. When users provide cultural context, Claude adjusts its rhetorical framing but not its substantive value positions, with effect sizes indistinguishable from zero across all twelve tested countries. An ablation removing the system prompt increases refusals but does not alter the values expressed when responses are given, and replication on a smaller model (Claude Haiku) confirms the same cultural profile across model sizes. These findings suggest that when a constitution is authored within the same cultural tradition that dominates the training data, constitutional alignment may codify existing cultural biases rather than correct them--producing a value floor that surface-level interventions cannot meaningfully shift. We discuss the compounding nature of this risk and the need for globally representative constitution-authoring processes.

MOSS-语音生成器：通过自然语言描述创建逼真语音

1/10

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zh...

个性化推荐理由:

该论文专注于语音生成技术，属于纯粹的语音处理领域。虽然标题中提到自然语言描述，但这主要用于控制语音生成参数，而非应用于推荐系统、搜索或广告领域的核心任务。该技术没有明显的潜在应用场景来提升推荐、搜索或广告系统的性能。

2026-03-30 06:40:59 | arXiv:2603.28086v1 |

cs.SDcs.AIcs.CL

查看完整摘要

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.

谁写了这本书？检测与归因LLM代笔

1/10

Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau

个性化推荐理由:

该论文标题聚焦于LLM生成内容的检测与归因，这属于LLM幻觉、评估或内容真实性验证范畴，与您关注的推荐系统、搜索或广告领域的核心进展、使能技术或直接应用无关。

2026-03-30 05:41:12 | arXiv:2603.28054v1 |

cs.CL

查看完整摘要

In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.

濒危斯拉夫语种的迁移学习：接触影响方言中波马克语的依存句法分析

1/10

Transfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects

Sercan Karakaş

个性化推荐理由:

该论文专注于濒危斯拉夫语种的依存句法分析，属于特定语言学的NLP应用研究。与当前关注的推荐系统、搜索广告、LLM技术、Transformer架构或异构数据建模等核心领域无直接关联，也不涉及任何可能应用于这些领域的技术迁移。

2026-03-30 04:54:13 | arXiv:2603.28033v1 |

cs.CL

查看完整摘要

This paper presents new resources and baselines for Dependency Parsing in Pomak, an endangered Eastern South Slavic language with substantial dialectal variation and no widely adopted standard. We focus on the variety spoken in Turkey (Uzunköprü) and ask how well a dependency parser trained on the existing Pomak Universal Dependencies treebank, which was built primarily from the variety that is spoken in Greece, transfers across dialects. We run two experimental phases. First, we train a parser on the Greek-variety UD data and evaluate zero-shot transfer to Turkish-variety Pomak, quantifying the impact of phonological and morphosyntactic differences. Second, we introduce a new manually annotated Turkish-variety Pomak corpus of 650 sentences and show that, despite its small size, targeted fine-tuning substantially improves accuracy; performance is further boosted by cross-variety transfer learning that combines the two dialects.

自上而下的字符串到依存关系神经机器翻译

1/10

Top-down string-to-dependency Neural Machine Translation

Shuhei Kondo, Katsuhito Sudoh, Yuji Matsumoto

个性化推荐理由:

该论文专注于神经机器翻译的特定技术方法（字符串到依存关系转换），属于纯粹的NLP领域研究。虽然涉及序列建模，但缺乏与推荐系统、搜索或广告领域的明确关联，且未提及任何可能应用于这些领域的Transformer架构改进或LLM技术。

2026-03-30 01:21:38 | arXiv:2603.27938v1 |

cs.CL

查看完整摘要

Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.

Gen-Searcher：强化图像生成的智能体搜索

1/10

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, ...

个性化推荐理由:

该论文标题明确聚焦于图像生成（Image Generation），这属于AIGC/内容生成范畴，属于明确的无关主题。虽然涉及“搜索”概念，但这是针对生成模型的搜索而非推荐/搜索/广告领域的检索系统。没有证据表明该技术会应用于推荐系统、搜索或广告领域。

2026-03-30 17:59:56 | arXiv:2603.28767v1 |

cs.CV

查看完整摘要

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

HandX：扩展双手运动与交互生成

1/10

HandX: Scaling Bimanual Motion and Interaction Generation

Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, ...

个性化推荐理由:

该论文标题关注机器人或计算机视觉领域的双手运动生成，属于纯粹的机器人学或3D视觉方向。标题中未提及任何与推荐系统、搜索、广告相关的关键词（如用户行为、排序、个性化、注意力机制等），也没有涉及LLM、Transformer架构或多模态建模技术。该研究专注于物理交互生成，与您关注的RecSys/Search/Ads核心领域及使能技术无直接关联。

2026-03-30 17:59:49 | arXiv:2603.28766v1 |

cs.CV

查看完整摘要

Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

SHOW3D：在真实场景中捕捉三维手部与物体的交互

1/10

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, ...

个性化推荐理由:

该论文标题明确聚焦于三维视觉（3D Hands and Objects）和场景捕捉（Capturing Scenes），属于纯粹的计算机视觉研究。虽然涉及真实场景（in the Wild），但未提及任何与推荐系统、搜索或广告相关的应用场景、技术方法或数据模态。根据用户指定的无关主题清单，这属于'Purely Vision、3D Vision, Graphic或Speech papers without clear relevance to RecSys/Search/Ads'的范畴，因此相关性极低。

2026-03-30 17:58:27 | arXiv:2603.28760v1 |

cs.CVcs.RO

查看完整摘要

Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io

FlowIt：基于置信度引导优化的光流全局匹配方法

1/10

FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement

Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney

个性化推荐理由:

该论文专注于计算机视觉中的光流估计，属于纯粹的视觉技术领域。虽然光流在视频理解中有应用，但论文标题没有表明与推荐系统、搜索或广告的明确关联，也没有涉及LLM、Transformer架构或多模态建模等当前关注的技术方向。

2026-03-30 17:58:12 | arXiv:2603.28759v1 |

cs.CV

查看完整摘要

We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.

SonoWorld：从单张图像到三维视听场景

1/10

SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao

个性化推荐理由:

该论文标题聚焦于从单张图像生成三维视听场景，属于计算机视觉和多媒体处理领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然涉及多模态（视觉和音频），但缺乏明确的推荐、搜索或广告应用场景，且更偏向纯粹的视觉和音频生成技术，而非异构数据统一建模或LLM应用。

2026-03-30 17:57:47 | arXiv:2603.28757v1 |

cs.CVcs.MMcs.SD

查看完整摘要

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

潘多拉：基于第一人称视角的关节式三维场景图构建

1/10

Pandora: Articulated 3D Scene Graphs from Egocentric Vision

Alan Yu, Yun Chang, Christopher Xie, Luca Carlone

个性化推荐理由:

该论文专注于从第一人称视觉构建关节式三维场景图，这属于纯粹的计算机视觉和三维视觉领域，与推荐系统、搜索或广告的核心技术无关。虽然标题中的'场景图'概念在理论上可能与结构化数据表示相关，但该论文的具体应用（第一人称视觉、三维重建）明显超出了您关注的技术范畴，且没有提供任何与推荐/搜索/广告相关的潜在应用线索。

2026-03-30 17:47:07 | arXiv:2603.28732v1 |

cs.ROcs.CV

查看完整摘要

Robotic mapping systems typically approach building metric-semantic scene representations from the robot's own sensors and cameras. However, these "first person" maps inherit the robot's own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot's ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.

DreamLite：一种用于图像生成与编辑的轻量级设备端统一模型

1/10

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Y...

个性化推荐理由:

该论文专注于设备端图像生成与编辑，属于纯粹的视觉内容生成领域。虽然涉及轻量级模型技术，但其核心应用（图像生成/编辑）与推荐系统、搜索或广告的排序任务没有直接关联，且不属于异构数据统一建模的范畴。

2026-03-30 17:30:25 | arXiv:2603.28713v1 |

cs.CV

查看完整摘要

Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

为何聚合准确率不足以评估执法面部识别系统的公平性

1/10

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Khalid Adnan Alsayed

个性化推荐理由:

该论文标题明确聚焦于公平性评估，这属于您指定的无关话题中的“公平性”范畴。虽然涉及面部识别技术，但核心关注点是公平性评估方法，而非技术架构、效率提升或推荐/搜索/广告应用。标题未暗示任何与推荐系统、搜索技术、广告排名、Transformer架构或LLM应用相关的技术内容。

2026-03-30 16:56:54 | arXiv:2603.28675v1 |

cs.CVcs.AIcs.LG

查看完整摘要

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

使用合成数据进行仿真到现实的果实检测：基于Isaac Sim的定量评估与嵌入式部署

1/10

Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim

Martina Hutter-Mironovova

个性化推荐理由:

该论文主要关注计算机视觉领域的果实检测，涉及合成数据生成和机器人部署，属于纯粹的视觉应用。论文内容与推荐系统、搜索或广告的核心领域进展、LLM技术、Transformer架构或异构数据统一建模均无直接关联，也不符合任何当前关注的技术方向。

2026-03-30 16:52:29 | arXiv:2603.28670v1 |

cs.CVcs.RO

查看完整摘要

This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.

Industrial3D：面向工业基础设施的陆地激光雷达点云数据集与跨范式基准

1/10

Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure

Chao Yin, Hongzhe Yue, Qing Han, Difeng Hu, Zhenyu Liang, Fangzhou Lin, Bing Sun...

个性化推荐理由:

该论文标题明确指向3D点云数据集和工业基础设施基准，属于纯粹的3D视觉/点云处理领域。虽然涉及数据集创建，但未提及任何与推荐系统、搜索、广告相关的技术或应用场景，也没有展示如何将3D数据处理技术转化为推荐/搜索/广告领域的潜在应用。

2026-03-30 16:46:40 | arXiv:2603.28660v1 |

cs.CV

查看完整摘要

Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification--core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%--a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.

TGIF2：扩展的文本引导图像修复伪造数据集与基准

1/10

TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Hannes Mareen, Dimitrios Karageorgiou, Paschalis Giakoumoglou, Peter Lambert, Sy...

个性化推荐理由:

该论文标题明确涉及图像修复伪造数据集与基准，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术焦点无关。虽然提到了“文本引导”，但这主要针对图像生成/编辑任务，没有显示出对推荐、搜索或广告系统的潜在应用价值。

2026-03-30 15:59:16 | arXiv:2603.28613v1 |

cs.CVcs.AIcs.CRcs.MM

查看完整摘要

Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.

Unsafe2Safe：面向下游任务效用的可控图像匿名化

1/10

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Mih Dinh, SouYoung Jin

个性化推荐理由:

该论文标题聚焦于图像匿名化技术，属于计算机视觉中的隐私保护方向，与您关注的推荐系统、搜索、广告核心领域及LLM/Transformer技术无关。虽然提到了“下游任务效用”，但未明确指向RecSys/Search/Ads应用，且隐私保护属于您明确排除的非技术性话题。

2026-03-30 15:54:47 | arXiv:2603.28605v1 |

cs.CVcs.CYcs.LG

查看完整摘要

Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

机器人感知中对抗性攻击的检测

1/10

Detection of Adversarial Attacks in Robotic Perception

Ziad Sharawy, Mohammad Nakshbandiand, Sorin Mihai Grigorescu

个性化推荐理由:

该论文标题关注机器人感知中的对抗性攻击检测，属于安全/隐私领域，与您明确排除的“非技术性主题”直接相关。论文内容可能涉及对抗性防御或鲁棒性，但这些在您的关注点中未被列为相关，且没有明确指向推荐系统、搜索或广告的应用潜力。

2026-03-30 15:41:49 | arXiv:2603.28594v1 |

cs.CVcs.AIcs.CRcs.RO

查看完整摘要

Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.

ORSIFlow：基于显著性引导的整流流用于光学遥感显著目标检测

1/10

ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection

Haojing Chen, Yutong Li, Zhihang Liu, Tao Tan, Haoyu Bian, Qiuju Ma

个性化推荐理由:

该论文专注于光学遥感图像中的显著目标检测，属于计算机视觉领域。虽然涉及显著性检测技术，但其应用场景（遥感）和核心方法（整流流）与推荐系统、搜索或广告的当前关注点（LLM技术、Transformer架构、异构数据建模等）没有直接关联。

2026-03-30 15:33:26 | arXiv:2603.28584v1 |

cs.CV

查看完整摘要

Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: https://github.com/Ch3nSir/ORSIFlow.

穿越幻象：用于鲁棒误导性图表问答的双路径智能体框架

1/10

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen, Yanna Lin, Huamin Qu, Lei Chen, Y...

个性化推荐理由:

该论文标题涉及图表问答和误导性信息处理，属于视觉问答(VQA)领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然提到了智能体框架，但未表明在推荐/搜索/广告场景中的应用潜力，且图表问答属于特定视觉应用而非异构数据统一建模。

2026-03-30 15:32:24 | arXiv:2603.28583v1 |

cs.CVcs.AIcs.MM

查看完整摘要

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a "skeptical" reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.

XSPA：为可迁移攻击视觉语言模型而设计的不可感知X形稀疏对抗性扰动

1/10

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng...

个性化推荐理由:

该论文专注于视觉语言模型（VLM）的对抗性攻击，属于安全/攻击领域，与您关注的推荐系统、搜索或广告的核心进展、LLM技术应用、Transformer架构改进或异构数据统一建模等方向无关。虽然提到了VLM，但研究的是对抗性扰动而非VLM在推荐/搜索/广告中的应用类比。

2026-03-30 15:24:34 | arXiv:2603.28568v1 |

cs.CV

查看完整摘要

Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

课程引导下的缺血性与非缺血性心肌病心肌瘢痕分割

1/10

Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy

Nivetha Jayakumar, Jonathan Pan, Shuo Wang, Bishow Paudel, Nisha Hosadurg, Crist...

个性化推荐理由:

该论文标题明确属于医学影像分割领域，专注于心肌瘢痕检测这一特定医疗应用。这完全属于'Irrelevant Topics'中明确排除的'Medical, Biology, Chemistry, Physics or other domain-specific applications'范畴，与推荐系统、搜索、广告或相关使能技术没有任何关联。

2026-03-30 15:22:00 | arXiv:2603.28560v1 |

cs.CV

查看完整摘要

Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in ground truth annotations on diffuse scars caused by inter observer variability. In this work, we propose a curriculum learning-based framework designed to improve segmentation performance under these challenging conditions. The method introduces a progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low confidence or visually ambiguous samples with limited scar burden. By structuring the learning process in this manner, the network develops robustness to uncertain labels and subtle scar appearances that are often underrepresented in conventional training pipelines. Experimental results show that the proposed approach enhances segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines. This strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications. Our code is publicly available on GitHub.

MarkushGrapher-2：端到端的化学结构多模态识别

1/10

MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Ahmed Nassar, ...

个性化推荐理由:

该论文专注于化学结构识别，属于化学领域特定应用，与推荐系统、搜索或广告无关。虽然涉及多模态识别，但处理的是化学结构而非推荐/搜索/广告相关的异构数据（如上下文特征和用户序列）。

2026-03-30 15:11:17 | arXiv:2603.28550v1 |

cs.CV

查看完整摘要

Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.

Seen2Scene：通过可见性引导流完成真实感三维场景

1/10

Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow

Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, Angela Dai

个性化推荐理由:

该论文标题明确指向计算机视觉中的3D场景补全任务，属于纯粹的视觉/3D视觉领域研究。虽然标题中提到“真实感场景”，但缺乏与推荐系统、搜索或广告相关的明确连接点，也没有涉及Transformer架构、LLM技术或异构数据建模等当前关注领域。

2026-03-30 15:08:38 | arXiv:2603.28548v1 |

cs.CV

查看完整摘要

We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.

GEditBench v2：一个人类对齐的通用图像编辑基准

1/10

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang ...

个性化推荐理由:

该论文标题明确涉及图像编辑基准，属于纯粹的视觉领域研究。根据用户指定的无关主题列表，'Purely Vision、3D Vision, Graphic或Speech papers without clear relevance to RecSys/Search/Ads'属于应排除的内容。该基准测试没有显示出与推荐系统、搜索或广告领域的直接或潜在应用关联。

2026-03-30 15:08:32 | arXiv:2603.28547v1 |

cs.CV

查看完整摘要

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

ManipArena：面向推理的通用机器人操作综合现实世界评估

1/10

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Ga...

个性化推荐理由:

该论文标题明确聚焦于机器人操作评估，属于机器人学领域。虽然涉及推理能力，但核心是物理机器人操作而非推荐系统、搜索或广告中的算法推理。该主题与您关注的LLM技术、推荐系统架构、Transformer改进或异构数据处理无直接关联。

2026-03-30 15:06:41 | arXiv:2603.28545v1 |

cs.ROcs.CV

查看完整摘要

Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.

RAD-LAD：基于规则与语言理解的实时自动驾驶系统

1/10

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga

个性化推荐理由:

该论文标题明确聚焦于自动驾驶领域，属于计算机视觉与机器人学的交叉应用，与推荐系统、搜索或广告的核心技术栈无直接关联。标题中提到的'语言理解'可能涉及自然语言处理，但整体应用场景完全偏离了用户行为建模、内容排序或广告投放等关注领域。

2026-03-30 14:50:37 | arXiv:2603.28522v1 |

cs.ROcs.AIcs.CVcs.LG

查看完整摘要

We present LAD, a real-time language--action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.

基于大模型与模糊决策树的可泛化AI生成图像检测

1/10

Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang, Zhuosheng Zhang, Shil...

个性化推荐理由:

该论文专注于AI生成图像的检测，属于内容真实性验证领域，与推荐系统、搜索或广告的核心技术（如排序、召回、用户建模）无关。虽然涉及大模型，但应用场景是图像内容分析而非推荐/搜索/广告任务，且模糊决策树属于传统机器学习方法，不属于当前关注的Transformer架构或LLM应用方向。

2026-03-30 14:43:14 | arXiv:2603.28508v1 |

cs.CV

查看完整摘要

The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.

使用漂移模型进行磁共振成像到计算机断层扫描的合成

1/10

MRI-to-CT synthesis using drifting models

Qing Lyu, Jianxu Wang, Jeremy Hudson, Ge Wang, Chirstopher T. Whitlow

个性化推荐理由:

该论文涉及医学影像领域的MRI到CT合成，属于医学/生物学特定应用，与推荐系统、搜索或广告领域完全无关。即使考虑技术层面，医学影像合成与推荐系统、搜索或广告中的异构数据处理也没有直接关联。

2026-03-30 14:34:32 | arXiv:2603.28498v1 |

eess.IVcs.AIcs.CV

查看完整摘要

Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.

卷积神经网络的事后自解释

1/10

Post-hoc Self-explanation of CNNs

Ahcène Boubekki, Line H. Clemmensen

个性化推荐理由:

这篇论文专注于计算机视觉领域的卷积神经网络可解释性方法，属于纯粹的视觉研究范畴。虽然可解释性在推荐系统中可能有间接价值，但该论文没有涉及Transformer架构、LLM技术、多模态建模或任何与推荐/搜索/广告系统直接相关的技术，因此与当前关注点基本无关。

2026-03-30 14:05:37 | arXiv:2603.28466v1 |

cs.CVstat.ML

查看完整摘要

Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a $k$-means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of $k$-means-based post-hoc explanations for the classifier, the encoder's final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.

用于眼底图像分割的单源域泛化的解耦小波子带方法

1/10

Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation

Shramana Dey, Varun Ajith, Abhirup Banerjee, Sushmita Mitra

个性化推荐理由:

该论文标题明确涉及医学图像处理（眼底图像分割）和计算机视觉技术（小波变换、域泛化），属于明确的医学领域应用。虽然域泛化技术理论上可能具有跨领域应用潜力，但该论文专注于特定医学图像任务，与推荐系统、搜索、广告或相关LLM技术没有直接关联。

2026-03-30 14:04:45 | arXiv:2603.28463v1 |

cs.CV

查看完整摘要

Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.

FeDMRA：基于动态记忆重放分配的联邦增量学习

1/10

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

Tiantian Wang, Xiang Xiang, Simon S. Du

个性化推荐理由:

该论文标题明确提及联邦学习（Federated Learning），这属于明确的无关主题，在您的排除列表中。虽然增量学习和记忆重放可能与推荐系统的持续学习相关，但联邦学习作为核心焦点使其完全不相关。

2026-03-30 13:58:36 | arXiv:2603.28455v1 |

cs.LGcs.AIcs.CVcs.DCstat.ML

查看完整摘要

In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.

GeoHCC：面向3D高斯泼溅的局部几何感知分层上下文压缩

1/10

GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

Xuan Deng, Xiandong Meng, Hengyu Man, Qiang Zhu, Tiange Zhang, Debin Zhao, Xiaop...

个性化推荐理由:

该论文标题明确聚焦于3D计算机视觉技术（3D Gaussian Splatting），属于纯粹的视觉领域研究。虽然提到了压缩技术，但这是针对3D几何数据的特定压缩方法，与推荐系统、搜索或广告中的Transformer架构、LLM应用、异构数据建模等核心关注点没有直接关联。

2026-03-30 13:39:35 | arXiv:2603.28431v1 |

cs.CVcs.AI

查看完整摘要

Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.

Tele-Catch：用于灵巧动态3D物体捕捉的自适应遥操作

1/10

Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching

Weiguang Zhao, Junting Dong, Rui Zhang, Kailin Li, Qin Zhao, Kaizhu Huang

个性化推荐理由:

该论文标题聚焦于机器人遥操作和动态物体捕捉，属于机器人学领域。虽然涉及3D感知和自适应控制，但未提及推荐系统、搜索、广告或相关LLM/Transformer技术，与当前关注的领域无直接关联。

2026-03-30 13:34:14 | arXiv:2603.28427v1 |

cs.ROcs.CV

查看完整摘要

Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human input with autonomous policies. To this end, we present Tele-Catch, a systematic framework for dexterous hand teleoperation in dynamic object catching. At its core, we design DAIM, a dynamics-aware adaptive integration mechanism that realizes shared autonomy by fusing glove-based teleoperation signals into the diffusion policy denoising process. It adaptively modulates control based on the interaction object state. To improve policy robustness, we introduce DP-U3R, which integrates unsupervised geometric representations from point cloud observations into diffusion policy learning, enabling geometry-aware decision making. Extensive experiments demonstrate that Tele-Catch significantly improves accuracy and robustness in dynamic catching tasks, while also exhibiting consistent gains across distinct dexterous hand embodiments and previously unseen object categories.

从像素到现实：针对真实世界摄像头的物理-数字补丁攻击

1/10

From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera

Victoria Leonenkova, Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsifero...

个性化推荐理由:

这篇论文讨论的是针对摄像头的物理-数字攻击，属于计算机视觉安全领域。虽然涉及摄像头和图像处理，但主要关注安全攻击而非推荐系统、搜索或广告的核心技术。该主题与您关注的LLM技术、推荐系统架构、Transformer改进或异构数据建模等方向没有直接关联，且安全主题被明确列为无关内容。

2026-03-30 13:31:54 | arXiv:2603.28425v1 |

cs.CV

查看完整摘要

This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA's superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.

统一恢复-感知学习：海事红外-可见光图像融合与分割

1/10

Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation

Weichao Cai, Weiliang Huang, Biao Xue, Chao Huang, Fei Yuan, Bob Zhang

个性化推荐理由:

该论文专注于海事领域的红外与可见光图像融合与分割，属于纯粹的视觉处理任务，与推荐系统、搜索或广告领域没有直接关联。论文标题中提到的'统一学习'方法主要针对特定传感器模态的视觉数据，而非推荐/搜索/广告中常见的异构数据（如用户序列、上下文特征），因此不符合任何关注点。

2026-03-30 13:26:05 | arXiv:2603.28414v1 |

cs.CV

查看完整摘要

Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.

SVH-BD：用于遥感图像仿真的合成植被高光谱基准数据集

1/10

SVH-BD : Synthetic Vegetation Hyperspectral Benchmark Dataset for Emulation of Remote Sensing Images

Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt

个性化推荐理由:

该论文标题明确指向遥感图像和植被高光谱数据，属于特定领域（遥感/地球科学）的数据集创建工作，与推荐系统、搜索或广告的核心技术进展、LLM应用、Transformer架构改进或异构数据统一建模均无直接关联。该研究属于明确的无关主题（医学、生物学、化学、物理学或其他领域特定应用）。

2026-03-30 13:02:42 | arXiv:2603.28390v1 |

cs.CVeess.SP

查看完整摘要

This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400--2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions -- East Africa, Northern France, Eastern India, and Southern Spain -- and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral--biophysical relationships under controlled yet realistic environmental variability.

AutoCut：基于多模态离散化与可控生成的端到端广告视频编辑

1/10

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang

个性化推荐理由:

该论文标题明确聚焦于广告视频编辑，属于'Ads creative generation'范畴，这被明确列为无关主题。虽然涉及多模态和生成技术，但其核心应用（广告视频编辑）与当前关注的推荐系统、搜索或广告排名技术无关。

2026-03-30 12:35:11 | arXiv:2603.28366v1 |

cs.CV

查看完整摘要

Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

SEA：通过元素级常识视觉问答评估草图抽象效率

1/10

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim, Jihie Kim

个性化推荐理由:

该论文标题聚焦于草图抽象效率评估和视觉问答，属于纯粹的计算机视觉研究范畴。虽然涉及多模态（视觉+语言），但其核心是草图理解和常识推理，与推荐系统、搜索或广告的异构数据建模、Transformer架构改进或LLM应用没有直接关联。

2026-03-30 12:30:51 | arXiv:2603.28363v1 |

cs.CV

查看完整摘要

A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.

基于MRI图像的脑肿瘤分类优化加权投票系统

1/10

Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images

Ha Anh Vu

个性化推荐理由:

该论文标题明确涉及医学领域（脑肿瘤分类）和特定模态（MRI图像），这属于明确的无关主题。标题中提到的“加权投票系统”可能涉及集成学习方法，但整个研究焦点是医学图像分析，与推荐系统、搜索或广告领域没有直接关联。

2026-03-30 12:24:54 | arXiv:2603.28357v1 |

cs.CVcs.LG

查看完整摘要

The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.

VistaGEN：基于多视角视觉语言推理的细粒度可控一致性驾驶视频生成

1/10

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu

个性化推荐理由:

该论文专注于驾驶视频生成，属于纯粹的视觉内容生成领域，与推荐系统、搜索或广告的核心技术无关。虽然标题提到'视觉语言推理'，但这是针对驾驶场景的特定应用，没有展示在异构数据统一建模或推荐/搜索/广告应用方面的潜力。

2026-03-30 12:22:22 | arXiv:2603.28353v1 |

cs.CV

查看完整摘要

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

SFDemorpher：面向可操作形态攻击检测的通用化人脸去形态化方法

1/10

SFDemorpher: Generalizable Face Demorphing for Operational Morphing Attack Detection

Raul Ismayilov, Luuk Spreeuwers

个性化推荐理由:

该论文专注于计算机视觉领域的人脸形态攻击检测，属于生物特征安全方向。虽然涉及人脸处理技术，但论文核心是安全防御（形态攻击检测），这属于明确的无关主题（安全、隐私）。该技术没有展示在推荐系统、搜索或广告领域的潜在应用价值。

2026-03-30 11:48:45 | arXiv:2603.28322v1 |

cs.CV

查看完整摘要

Face morphing attacks compromise biometric security by creating document images that verify against multiple identities, posing significant risks from document issuance to border control. Differential Morphing Attack Detection (D-MAD) offers an effective countermeasure, particularly when employing face demorphing to disentangle identities blended in the morph. However, existing methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs. This paper presents SFDemorpher, a framework designed for the operational deployment of face demorphing for D-MAD that performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. We introduce a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions. Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques, spanning both border verification and the challenging document enrollment stage. Our framework achieves superior D-MAD performance by widening the margin between the score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions facilitating explainability.

用于甲状腺结节超声分类的原型增强多视图学习

1/10

Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

Yangmei Chen, Zhongyuan Zhang, Xikun Zhang, Xinyu Hao, Mingliang Hou, Renqiang L...

个性化推荐理由:

该论文标题明确涉及医学领域（甲状腺结节超声分类），这属于明确的无关主题。虽然提到了多视图学习这一技术概念，但整个研究聚焦于特定医学应用场景，与推荐系统、搜索、广告等核心关注领域无直接关联。

2026-03-30 11:37:26 | arXiv:2603.28315v1 |

cs.CVcs.LG

查看完整摘要

Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning.

DinoDental：将DINOv3作为统一视觉编码器用于牙科图像分析的基准测试

1/10

DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis

Kun Tang, Xinquan Yang, Mianjie Zheng, Xuefen Liu, Xuguang Li, Xiaoqi Guo, Ruiha...

个性化推荐理由:

该论文专注于牙科医学图像分析这一特定领域应用，属于明确的医学/生物学领域。虽然DINOv3是视觉基础模型，但论文内容完全围绕牙科诊断这一非相关主题，与推荐系统、搜索、广告等核心关注领域无直接关联。

2026-03-30 11:23:57 | arXiv:2603.28297v1 |

cs.CV

查看完整摘要

The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model's transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.

TerraSky3D：欧洲地标的多视角4K重建

1/10

TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

Mattia D'Urso, Yuxi Hu, Christian Sormann, Mattia Rossi, Friedrich Fraundorfer

个性化推荐理由:

该论文专注于计算机视觉中的3D重建技术，属于纯粹的视觉领域研究。虽然标题提到“地标”，但这指的是物理建筑而非推荐系统/搜索/广告中的概念。该技术没有展示与推荐系统、搜索或广告领域的直接关联或潜在应用，完全属于被排除的“纯粹视觉、3D视觉”类别。

2026-03-30 11:08:51 | arXiv:2603.28287v1 |

cs.CV

查看完整摘要

Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.

TwinMixing：一种用于多任务分割的洗牌感知特征交互模型

1/10

TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation

Minh-Khoi Do, Huy Che, Dinh-Duy Phan, Duc-Khai Lam, Duc-Lung Vu

个性化推荐理由:

该论文标题明确指向计算机视觉领域的多任务分割问题，涉及特征交互和洗牌操作等技术。虽然特征交互在推荐系统中也很重要，但该论文专注于视觉分割任务，没有表明与推荐系统、搜索或广告的直接关联。标题中未提及LLM、Transformer架构或异构数据统一建模等与您关注领域相关的技术。

2026-03-30 09:54:44 | arXiv:2603.28233v1 |

cs.CVcs.AI

查看完整摘要

Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi-task segmentation model designed explicitly for drivable-area and lane segmentation. The proposed network features a shared encoder and task-specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual-Branch Upsampling (DBU) Block composed of a learnable transposed convolution-based Fine detailed branch and a parameter-free bilinear interpolation-based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations - tiny, base, and large. Among them, the base configuration achieves the best trade-off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous driving and embedded perception systems. The source code: https://github.com/Jun0se7en/TwinMixing.

Ghost-FWL：用于鬼影检测与消除的大规模全波形激光雷达数据集

1/10

Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal

Kazuma Ikeda, Ryosei Hara, Rokuto Nagata, Ozora Sako. Zihao Ding, Takahiro Kado,...

个性化推荐理由:

该论文标题明确涉及激光雷达（LiDAR）数据集和鬼影检测，这属于纯粹的3D视觉/传感器数据处理领域，与推荐系统、搜索或广告的核心技术焦点无关。标题中没有任何元素表明与Transformer架构、LLM技术或推荐/搜索/广告应用相关，因此属于完全不相关的主题。

2026-03-30 09:46:55 | arXiv:2603.28224v1 |

cs.CV

查看完整摘要

LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: https://keio-csg.github.io/Ghost-FWL

细看跨域少样本目标检测：微调至关重要，并行解码器助力

1/10

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

Xuanlong Yu, Youyang Sha, Longfei Liu, Xi Shen, Di Yang

个性化推荐理由:

该论文专注于计算机视觉领域的跨域少样本目标检测，属于纯粹的视觉任务。虽然提到了微调和解码器架构，但缺乏与推荐系统、搜索或广告领域的明确联系，也没有涉及LLM技术或Transformer架构的进展。

2026-03-30 08:46:10 | arXiv:2603.28182v1 |

cs.CV

查看完整摘要

Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.

ColorFLUX：一种用于老照片着色的结构-颜色解耦框架

1/10

ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zh...

个性化推荐理由:

这篇论文专注于计算机视觉中的老照片着色任务，属于纯粹的图像处理领域。它不涉及推荐系统、搜索或广告的任何技术，也没有展示出在异构数据统一建模方面的潜在应用。

2026-03-30 08:29:16 | arXiv:2603.28162v1 |

cs.CV

查看完整摘要

Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

基于事件的高动态光照条件下高速三维形变测量方法

1/10

Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions

Banglei Guan, Yifei Bian, Zibin Liu, Haoyang Li, Xuanyu Bai, Taihang Lei, Bin Li...

个性化推荐理由:

该论文标题明确聚焦于计算机视觉中的事件相机技术、三维形变测量和极端光照条件处理，属于纯粹的视觉或3D视觉领域。虽然标题提及“测量”，但这与推荐系统、搜索或广告中的用户行为建模、内容排序或特征工程无直接关联。论文内容很可能专注于传感器技术或特定视觉任务，缺乏转化为RecSys/Search/Ads应用的明确潜力，因此与用户指定的所有焦点领域均不相关。

2026-03-30 08:25:09 | arXiv:2603.28159v1 |

cs.CV

查看完整摘要

Background: Large engineering structures, such as space launch towers and suspension bridges, are subjected to extreme forces that cause high-speed 3D deformation and compromise safety. These structures typically operate under extreme illumination conditions. Traditional cameras often struggle to handle strong light intensity, leading to overexposure due to their limited dynamic range. Objective: Event cameras have emerged as a compelling alternative to traditional cameras in high dynamic range and low-latency applications. This paper presents an integrated method, from calibration to measurement, using a multi-event camera array for high-speed 3D deformation monitoring of structures in extreme illumination conditions. Methods: Firstly, the proposed method combines the characteristics of the asynchronous event stream and temporal correlation analysis to extract the corresponding marker center point. Subsequently, the method achieves rapid calibration by solving the Kruppa equations in conjunction with a parameter optimization framework. Finally, by employing a unified coordinate transformation and linear intersection, the method enables the measurement of 3D deformation of the target structure. Results: Experiments confirmed that the relative measurement error is below 0.08%. Field experiments under extreme illumination conditions, including self-calibration of a multi-event camera array and 3D deformation measurement, verified the performance of the proposed method. Conclusions: This paper addressed the critical limitation of traditional cameras in measuring high-speed 3D deformations under extreme illumination conditions. The experimental results demonstrate that, compared to other methods, the proposed method can accurately measure 3D deformations of structures under harsh lighting conditions, and the relative error of the measured deformation is less than 0.1%.

ObjectMorpher：基于可变形3D高斯泼溅模型的3D感知图像编辑

1/10

ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models

Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Q...

个性化推荐理由:

该论文专注于3D感知图像编辑和计算机图形学技术，属于纯粹的视觉/3D视觉领域。虽然涉及3D建模，但标题中没有任何迹象表明与推荐系统、搜索或广告有直接或间接的关联。该技术主要面向内容生成和图像编辑应用，属于明确排除的无关主题范畴。

2026-03-30 08:15:29 | arXiv:2603.28152v1 |

cs.CV

查看完整摘要

Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.

BlankSkip：纳米无人机上的早期退出目标检测

1/10

BlankSkip: Early-exit Object Detection onboard Nano-drones

Carlo Marra, Beatrice Alessandra Motetti, Alessio Burrello, Enrico Macii, Massim...

个性化推荐理由:

该论文标题涉及无人机上的目标检测，属于计算机视觉领域，与推荐系统、搜索或广告的核心关注点无关。虽然提到了“早期退出”这一效率优化技术，但应用场景（无人机）和任务（目标检测）均不在指定的技术范围内，没有明确的潜在应用价值。

2026-03-30 08:14:46 | arXiv:2603.28149v1 |

cs.CV

查看完整摘要

Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for "easy-to-process" input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.

基于三维空中声纳传感的智能路况监测

1/10

Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing

Amber Cassimon, Robin Kerstens, Walter Daems, Jan Steckel

个性化推荐理由:

该论文涉及3D传感和路况监测，属于特定领域应用（交通/基础设施），与推荐系统、搜索或广告的核心技术无关。它不涉及LLM、Transformer架构、推荐算法或任何可能应用于这些领域的基础技术。

2026-03-30 08:05:28 | arXiv:2603.28141v1 |

cs.CVcs.LG

查看完整摘要

In this paper, we investigate the capabilities of in-air 3D SONAR sensors for the monitoring of road surface conditions. Concretely, we consider two applications: Road material classification and Road damage detection and classification. While such tasks can be performed with other sensor modalities, such as camera sensors and LiDAR sensors, these sensor modalities tend to fail in harsh sensing conditions, such as heavy rain, smoke or fog. By using a sensing modality that is robust to such interference, we enable the creation of opportunistic sensing applications, where vehicles performing other tasks (garbage collection, mail delivery, etc.) can also be used to monitor the condition of the road. For these tasks, we use a single dataset, in which different types of damages are annotated, with labels including the material of the road surface. In the material classification task, we differentiate between three different road materials: Asphalt, Concrete and Element roads. In the damage detection and classification task, we determine if there is damage, and what type of damage (independent of material type), without localizing the damage. We are succesful in determining the road surface type from SONAR sensor data, with F1 scores approaching 90% on the test set, but find that for the detection of damages performace lags, with F1 score around 75%. From this, we conclude that SONAR sensing is a promising modality to include in opportunistic sensing-based pavement management systems, but that further research is needed to reach the desired accuracy.

MDPBench：面向真实场景的多语言文档解析基准

1/10

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song...

个性化推荐理由:

该论文标题明确聚焦于多语言文档解析的基准测试，属于文档处理、信息提取或NLP评估的范畴，与推荐系统、搜索或广告的核心技术、LLM应用、Transformer架构进展或异构数据统一建模等当前关注领域无直接关联。

2026-03-30 07:47:46 | arXiv:2603.28130v1 |

cs.CVcs.AI

查看完整摘要

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

SVGS：基于高斯泼溅的单视图到三维物体编辑

1/10

SVGS: Single-View to 3D Object Editing via Gaussian Splatting

Pengcheng Xue, Yan Tian, Qiutao Song, Ziyi Wang, Linyang He, Weiping Ding, Mahmo...

个性化推荐理由:

该论文专注于3D视觉和图形学领域的单视图3D物体编辑技术，属于纯粹的计算机视觉研究方向。虽然高斯泼溅是当前3D重建的热门方法，但论文内容与推荐系统、搜索、广告的核心技术（如排序、召回、用户建模）以及LLM/Transformer架构进展没有直接关联，也不涉及异构数据统一建模或LLM在推荐/搜索领域的应用。

2026-03-30 07:45:03 | arXiv:2603.28126v1 |

cs.CV

查看完整摘要

Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.

MedLoc-R1：基于GRPO的医学视觉定位中性能感知的课程奖励调度

1/10

MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang, Wanran Sun, ...

个性化推荐理由:

该论文标题明确涉及医学视觉定位（Medical Visual Grounding），这属于医学领域的特定应用，与用户关注的推荐系统、搜索或广告领域无关。标题中提到的GRPO（可能指某种强化学习或定位方法）和课程奖励调度属于技术细节，但整体应用场景完全在医学领域，因此不符合任何关注点。

2026-03-30 07:31:21 | arXiv:2603.28120v1 |

cs.CV

查看完整摘要

Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.

注意力频率调制：扩散交叉注意力的免训练频谱调制

1/10

Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention

Seunghun Oh, Unsang Park

个性化推荐理由:

该论文标题明确涉及扩散模型和交叉注意力机制，属于生成式AI和图像生成领域，与我的关注点（推荐系统、搜索、广告的核心进展及LLM/Transformer技术）完全无关。扩散模型主要用于内容生成，而我的关注点明确排除了AIGC、内容生成等纯LLM中心化主题。

2026-03-30 07:24:41 | arXiv:2603.28114v1 |

cs.CVcs.LG

查看完整摘要

Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.

基于轮廓引导的查询式特征融合用于边界感知和可泛化的心脏超声分割

1/10

Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation

Zahid Ullah, Sieun Choi, Jihie Kim

个性化推荐理由:

该论文专注于医学图像分割（心脏超声），属于明确的医学领域应用，与我的关注领域（推荐系统、搜索、广告）完全无关。论文标题中提到的特征融合和边界感知技术虽然具有技术价值，但缺乏向推荐/搜索/广告领域转化的明确路径或潜在应用场景。

2026-03-30 07:17:40 | arXiv:2603.28110v1 |

cs.CV

查看完整摘要

Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.

RAWIC：位深度自适应无损原始图像压缩

1/10

RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression

Chunhang Zheng, Tongda Xu, Mingli Xie, Yan Wang, Dou Li

个性化推荐理由:

该论文专注于图像压缩技术，属于计算机视觉领域，与推荐系统、搜索或广告的核心技术无直接关联。虽然图像可能作为推荐/广告中的内容，但该技术本身是纯粹的图像处理，没有涉及LLM、Transformer架构、多模态建模或推荐/广告系统的核心算法。

2026-03-30 07:10:35 | arXiv:2603.28105v1 |

cs.CV

查看完整摘要

Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at https://github.com/chunbaobao/RAWIC.

基于八叉树的点云几何学习压缩：一种有损视角

1/10

Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective

Kaiyu Zheng, Wei Gao, Huiming Zheng

个性化推荐理由:

该论文专注于点云几何压缩技术，属于计算机视觉和3D数据处理领域。虽然涉及学习方法和压缩技术，但点云数据与推荐系统、搜索或广告中的异构数据（如用户序列、上下文特征）有本质区别，且论文未表明在推荐/搜索/广告场景中的应用潜力。

2026-03-30 06:53:56 | arXiv:2603.28095v1 |

cs.CV

查看完整摘要

Octree-based context learning has recently become a leading method in point cloud compression. However, its potential on lossy compression remains undiscovered. The traditional lossy compression paradigm using lossless octree representation with quantization step adjustment may result in severe distortions due to massive missing points in quantization. Therefore, we analyze data characteristics of different point clouds and propose lossy approaches specifically. For object point clouds that suffer from quantization step adjustment, we propose a new leaf nodes lossy compression method, which achieves lossy compression by performing bit-wise coding and binary prediction on leaf nodes. For LiDAR point clouds, we explore variable rate approaches and propose a simple but effective rate control method. Experimental results demonstrate that the proposed leaf nodes lossy compression method significantly outperforms the previous octree-based method on object point clouds, and the proposed rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.

LogiStory：一种用于多图像故事可视化的逻辑感知框架

1/10

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang

个性化推荐理由:

该论文标题涉及多图像故事可视化，属于计算机视觉和内容生成领域，与推荐系统、搜索或广告的核心技术焦点无关。虽然提到了“逻辑感知框架”，但这主要针对视觉叙事生成，没有明确指向推荐、搜索或广告的应用潜力。

2026-03-30 06:37:12 | arXiv:2603.28082v1 |

cs.CVcs.MA

查看完整摘要

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

AIBench：评估学术插图生成中的视觉-逻辑一致性

1/10

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-T...

个性化推荐理由:

该论文标题明确关注学术插图生成的评估，属于纯粹的视觉内容生成和评估范畴。虽然提到了视觉-逻辑一致性，但这是针对特定插图生成任务的评估基准，与推荐系统、搜索或广告中的异构数据处理、LLM应用或Transformer架构进展没有直接关联。该工作属于AIGC/内容生成领域，属于明确列出的无关主题。

2026-03-30 06:14:40 | arXiv:2603.28068v1 |

cs.CV

查看完整摘要

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

《4DSurf：高保真动态场景表面重建》

1/10

\textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction

Renjie Wu, Hongdong Li, Jose M. Alvarez, Miaomiao Liu

个性化推荐理由:

该论文专注于计算机视觉中的动态场景三维重建，属于纯粹的视觉/3D视觉领域。虽然标题中提到“高保真”和“动态场景”，但这与推荐系统、搜索或广告的核心技术（如排序、召回、用户建模、特征工程等）没有直接关联。论文没有涉及任何与Transformer架构、LLM技术、多模态学习或推荐/搜索/广告应用相关的内容。

2026-03-30 06:09:37 | arXiv:2603.28064v1 |

cs.CV

查看完整摘要

This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``\textit{4DSurf}'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49\% and 19\% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.

基于分布式卷积神经网络的目标检测

1/10

Object Detection Based on Distributed Convolutional Neural Networks

Liang Sun

个性化推荐理由:

该论文标题聚焦于计算机视觉领域的目标检测技术，采用分布式卷积神经网络架构。虽然涉及分布式计算，但核心内容属于纯粹的视觉任务，与推荐系统、搜索或广告的排序、检索、建模等核心问题没有直接关联。标题中未提及任何与推荐、搜索、广告相关的应用场景或技术交叉点。

2026-03-30 05:30:45 | arXiv:2603.28050v1 |

cs.CV

查看完整摘要

Based on the Distributed Convolutional Neural Network(DisCNN), a straightforward object detection method is proposed. The modules of the output vector of a DisCNN with respect to a specific positive class are positively monotonic with the presence probabilities of the positive features. So, by identifying all high-scoring patches across all possible scales, the positive object can be detected by overlapping them to form a bounding box. The essential idea is that the object is detected by detecting its features on multiple scales, ranging from specific sub-features to abstract features composed of these sub-features. Training DisCNN requires only object-centered image data with positive and negative class labels. The detection process for multiple positive classes can be conducted in parallel to significantly accelerate it, and also faster for single-object detection because of its lightweight model architecture.

Drift-AR：基于反对称漂移的单步视觉自回归生成

1/10

Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao

个性化推荐理由:

该论文标题明确指向视觉生成领域（Single-Step Visual Autoregressive Generation），属于纯粹的视觉内容生成技术。虽然涉及自回归架构，但论文焦点是视觉模态的生成方法，没有表明与推荐系统、搜索或广告领域的相关性。标题中的'Anti-Symmetric Drifting'似乎是视觉生成的具体技术改进，属于AIGC/内容生成范畴，这是明确列出的无关主题。

2026-03-30 05:29:00 | arXiv:2603.28049v1 |

cs.CV

查看完整摘要

Autoregressive (AR)-Diffusion hybrid paradigms combine AR's structured semantic modeling with diffusion's high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft--target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field -- high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift -- enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8--5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at https://github.com/aSleepyTree/Drift-AR.

Event6D：基于事件的新型物体六自由度姿态跟踪

1/10

Event6D: Event-based Novel Object 6D Pose Tracking

Jae-Young Kang, Hoonehee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, ...

个性化推荐理由:

该论文标题明确属于计算机视觉中的6D姿态估计领域，专注于事件相机和物体跟踪。虽然涉及视觉模态，但标题未表明与推荐系统、搜索或广告有任何关联，也未提及Transformer架构、LLM技术或异构数据处理。根据排除标准，这属于“纯视觉论文，与RecSys/Search/Ads无明显相关性”的范畴。

2026-03-30 05:11:57 | arXiv:2603.28045v1 |

cs.CV

查看完整摘要

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

CARLA-Air：在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施

1/10

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang

个性化推荐理由:

该论文标题聚焦于无人机仿真环境与具身智能基础设施，属于机器人学、自动驾驶和仿真技术领域。虽然涉及AI技术，但未提及推荐系统、搜索、广告等核心领域，也未涉及Transformer架构、LLM技术或异构数据统一建模等当前关注方向。其内容与所列的无关主题（如纯粹视觉、机器人应用）更为接近，因此相关性极低。

2026-03-30 04:49:29 | arXiv:2603.28032v1 |

cs.ROcs.AIcs.CVcs.HC

查看完整摘要

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

基于努力度的关键性指标：用于评估自动驾驶中3D感知误差

1/10

Effort-Based Criticality Metrics for Evaluating 3D Perception Errors in Autonomous Driving

Sharang Kaul, Simon Bultmann, Mario Berk, Abhinav Valada

个性化推荐理由:

该论文标题明确聚焦于自动驾驶领域的3D感知误差评估，属于纯粹的计算机视觉应用。虽然提到了评估方法，但内容涉及3D视觉和特定领域（自动驾驶），与推荐系统、搜索或广告的核心技术、LLM应用、Transformer架构进展或异构数据统一建模均无直接关联。

2026-03-30 04:46:03 | arXiv:2603.28029v1 |

cs.CVcs.RO

查看完整摘要

Criticality metrics such as time-to-collision (TTC) quantify collision urgency but conflate the consequences of false-positive (FP) and false-negative (FN) perception errors. We propose two novel effort-based metrics: False Speed Reduction (FSR), the cumulative velocity loss from persistent phantom detections, and Maximum Deceleration Rate (MDR), the peak braking demand from missed objects under a constant-acceleration model. These longitudinal metrics are complemented by Lateral Evasion Acceleration (LEA), adapted from prior lateral evasion kinematics and coupled with reachability-based collision timing to quantify the minimum steering effort to avoid a predicted collision. A reachability-based ellipsoidal collision filter ensures only dynamically plausible threats are scored, with frame-level matching and track-level aggregation. Evaluation of different perception pipelines on nuScenes and Argoverse~2 shows that 65-93% of errors are non-critical, and Spearman correlation analysis confirms that all three metrics capture safety-relevant information inaccessible to established time-based, deceleration-based, or normalized criticality measures, enabling targeted mining of the most critical perception failures.

通过协作细粒度精调将SAM适配到细胞核实例分割与分类任务

1/10

Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained Refinement

Jingze Su, Tianle Zhu, Jiaxin Cai, Zhiyi Wang, Qi Li, Xiao Zhang, Tong Tong, Shu...

个性化推荐理由:

该论文专注于医学图像分析中的细胞核分割与分类，属于明确的生物学/医学领域应用。虽然涉及模型适配技术，但内容与推荐系统、搜索或广告的核心技术进展、LLM应用或Transformer架构改进完全无关，也不涉及异构数据处理的多模态建模类比。

2026-03-30 04:39:07 | arXiv:2603.28027v1 |

cs.CV

查看完整摘要

Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM's robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.

SegRGB-X：通用RGB-X语义分割模型

1/10

SegRGB-X: General RGB-X Semantic Segmentation Model

Jiong Liu, Yingjie Xu, Xingcheng Zhou, Rui Song, Walter Zimmer, Alois Knoll, Hu ...

个性化推荐理由:

该论文标题表明其专注于计算机视觉中的语义分割任务，属于纯粹的视觉研究范畴。虽然标题中提到“RGB-X”可能涉及多模态数据，但论文的核心是视觉分割技术，没有明确展示与推荐系统、搜索或广告领域的直接关联或潜在应用。

2026-03-30 04:32:11 | arXiv:2603.28023v1 |

cs.CV

查看完整摘要

Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.

基于物理启发的高斯泼溅用于高动态范围新视角合成

1/10

Physically Inspired Gaussian Splatting for HDR Novel View Synthesis

Huimin Zeng, Yue Bai, Hailing Wang, Yun Fu

个性化推荐理由:

该论文标题明确涉及计算机视觉中的3D重建和新视角合成技术，属于纯粹的视觉领域研究。虽然高斯泼溅技术在3D表示方面有创新，但标题中没有任何元素表明与推荐系统、搜索或广告的潜在应用连接。根据用户明确的排除标准，这属于'Purely Vision、3D Vision, Graphic或Speech papers without clear relevance to RecSys/Search/Ads'的范畴。

2026-03-30 04:27:39 | arXiv:2603.28020v1 |

cs.CV

查看完整摘要

High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.

基于事件与帧的转向预测能量感知模仿学习

1/10

Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames

Hu Cao, Jiong Liu, Xingzhuo Yan, Rui Song, Yan Xia, Walter Zimmer, Guang Chen, A...

个性化推荐理由:

该论文标题明确涉及计算机视觉（事件相机和帧处理）和机器人控制（转向预测），属于自动驾驶领域。虽然提到了模仿学习，但没有表明与推荐系统、搜索或广告有任何关联。标题中的技术内容完全聚焦于视觉感知和机器人决策，属于明确的无关主题。

2026-03-30 03:58:47 | arXiv:2603.28008v1 |

cs.CV

查看完整摘要

In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.

DipGuava：从单目视频中解耦个性化高斯特征用于3D头部化身

1/10

DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video

Jeonghaeng Lee, Seok Keun Choi, Zhixuan Li, Weisi Lin, Sanghoon Lee

个性化推荐理由:

该论文专注于3D头部化身创建和计算机视觉中的高斯特征，属于纯粹的视觉/3D视觉领域。虽然涉及个性化建模，但没有明确与推荐系统、搜索或广告相关的应用或技术联系。该主题属于明确列出的不相关主题（'Purely Vision、3D Vision, Graphic'）。

2026-03-30 03:50:23 | arXiv:2603.28003v1 |

cs.CV

查看完整摘要

While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.