HCS

Abstract

Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.

Motivation

Compared to structured urban or indoor scenes, wide-area settings—such as deep-sea regions—exhibit greater semantic diversity and sparser object distributions. These scenes often contain rare or unseen objects (e.g., marine species), leading to pronounced long-tail distributions, and are dominated by high-frequency, homogeneous backgrounds that obscure critical semantics. Although VLMs are effective at semantic extraction, their global attention mechanisms tend to focus on frequent patterns, making them less reliable in non-uniform, incomplete scenarios. Moreover, training on high-resolution visual data incurs significant computational costs, limiting scalability.

Inspiration from Coreset Theory

We revisit scene understanding from a compression and selection perspective. We propose an adaptive, interpretable region selection mechanism to enable efficient VLM reasoning in wide-area scenes. Drawing on data compression and coreset theory, we view feature selection as identifying a small, weighted subset that approximates the full dataset's learning outcome. Effective scene understanding thus requires filtering out redundant regions and retaining task-critical areas, yielding a compact, interpretable representation. Coreset theory supports this by constructing small surrogate subsets that preserve accuracy while improving scalability and efficiency.

Overview of HCS

This work proposes a plug-and-play Hierarchical Coreset Selection (HCS) framework to enable efficient and interpretable wide-area scene understanding with vision-language models (VLMs). HCS aims to identify a small subset of informative image regions—termed the coreset—that approximates the full dataset in terms of predictive performance, while significantly reducing computational cost and redundancy.

Importance Function

To overcome the limitations of conventional sensitivity-based coreset methods, HCS introduces a multi-dimensional importance function that evaluates each region from four complementary perspectives: utility (loss reduction and spatial compactness), representativeness (distributional alignment with global features), robustness (stability under perturbations), and synergy (redundancy-aware complementarity among regions). These scores are integrated into a unified selection criterion for ranking region importance.

$Importance Score$

Selection Strategy

Built upon this function, HCS performs hierarchical region refinement. It begins with coarse-level region partitioning and importance scoring, selecting top candidates that satisfy the coreset approximation condition. Then, it recursively refines the candidate regions at finer scales, ensuring boundary accuracy and inter-region complementarity. The final coreset is assembled by aggregating the highest-scoring subregions.

Without requiring retraining, HCS enables VLMs to focus on the most informative and structurally diverse regions, achieving high-fidelity scene understanding with minimal data. The framework is theoretically grounded, scalable, and applicable to a wide range of wide-area visual domains.

Performance

Poster

BibTeX


    @misc{wang2025advancing,
      title={Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection}, 
      author={Jingyao Wang and Yiming Chen and Lingyu Si and Changwen Zheng},
      year={2025},
      eprint={2507.13061},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13061}, 
    }