General Keypoint Detection: Few-Shot and Zero-Shot
Date
2025
Authors
Lu, Changsheng
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Keypoint detection is a fundamental research topic in computer vision. Despite extensive research on keypoint detection over the past several decades, existing state-of-the-art deep-learning based methods perform well only on specific object types and parts, and struggle to recognize novel keypoints on unseen objects. This close-set detection paradigm limits the model generality to diverse classes in an open-world scenario. Moreover, these models typically require large amounts of annotated data for training before they can recognize a new class, leading to the problem of ``no intelligence without human annotations''. To mitigate above issues, we raise an interesting yet challenging question: how to detect both base and novel keypoints for unseen species given few or even no annotated samples? To answer this question, we have conducted a series of research works as follows:
Firstly, inspired by few-shot learning, we introduce a new task called few-shot keypoint detection (FSKD) whose goal is to detect the corresponding keypoints in a query image given one or a few support images with keypoint annotations. Moreover, we propose a versatile FSKD model with uncertainty learning, which can detect a varying number of keypoints of different types while also provides uncertainty estimation for these keypoints. We show the effectiveness of our FSKD on downstream tasks such as few-shot fine-grained visual recognition (FGVR) and semantic alignment (SA). In SA, we further contribute a novel yet realistic thin-plate-spline warping that uses keypoint uncertainty.
Secondly, we explore ways to enhance the performance and robustness of FSKD models. To achieve this, we propose a novel visual prior guided vision transformer for FSKD. The visual prior maps are unsupervised and class-agnostic, which helps suppress background noise and encourages context learning on the foregrounds. Moreover, we investigate i) transductive extension of FSKD and ii) FSKD with masking and alignment (MAA), to improve the keypoint representation learning.
Thirdly, despite the versatility of existing FSKD models, we observe that there remain two important issues hindering the progress of FSKD, which are i) the scalability issue w.r.t. the number of keypoints and ii) the large domain shift of keypoints between seen and unseen species. To address the first issue, we propose a novel light-weight FSKD model based on simultaneous modulation and detection, which significantly reduces the computational cost without the sacrifice of performance. To overcome the second issue, we improve our light-weight model with the mean feature based contrastive learning (MFCL). Despite simplicity, MFCL promotes invariance learning among the same type of keypoints across different species in noisy scenarios. As a result, the domain shift is bridged.
Lastly, if treating the support image and keypoints as visual prompt, we notice that the above FSKD models, basically, align well with the popular ``prompt'' based models used in various vision and language tasks. Consequently, we unified the zero-shot keypoint detection (ZSKD) and few-shot keypoint detection (FSKD) into one framework, capable of accepting either visual prompts, text prompts, or both. We expand prompt diversity across three aspects: modality, semantics (seen vs. unseen), and language, to enable a more general keypoint detection. A multimodal prototype set is proposed to support both visual and textual prompting. To infer the keypoint location of unseen texts, we propose visual and text interpolation, and add the pairs of auxiliary keypoints and texts into training. This significantly improves the zero-shot novel keypoint detection and the spatial reasoning of our model. Additionally, we discover that the large language model (LLM) is a good text parser. By coupling with a large language model (LLM), our keypoint detection model can well handle the diverse text prompts.
Description
Keywords
Citation
Collections
Source
Type
Thesis (PhD)
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material