We are experiencing issues opening hdl.handle.net links on ANU campus. If you are experiencing issues, please contact the repository team repository.admin@anu.edu.au for assistance.
 

General Keypoint Detection: Few-Shot and Zero-Shot

Date

2025

Authors

Lu, Changsheng

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Keypoint detection is a fundamental research topic in computer vision. Despite extensive research on keypoint detection over the past several decades, existing state-of-the-art deep-learning based methods perform well only on specific object types and parts, and struggle to recognize novel keypoints on unseen objects. This close-set detection paradigm limits the model generality to diverse classes in an open-world scenario. Moreover, these models typically require large amounts of annotated data for training before they can recognize a new class, leading to the problem of ``no intelligence without human annotations''. To mitigate above issues, we raise an interesting yet challenging question: how to detect both base and novel keypoints for unseen species given few or even no annotated samples? To answer this question, we have conducted a series of research works as follows: Firstly, inspired by few-shot learning, we introduce a new task called few-shot keypoint detection (FSKD) whose goal is to detect the corresponding keypoints in a query image given one or a few support images with keypoint annotations. Moreover, we propose a versatile FSKD model with uncertainty learning, which can detect a varying number of keypoints of different types while also provides uncertainty estimation for these keypoints. We show the effectiveness of our FSKD on downstream tasks such as few-shot fine-grained visual recognition (FGVR) and semantic alignment (SA). In SA, we further contribute a novel yet realistic thin-plate-spline warping that uses keypoint uncertainty. Secondly, we explore ways to enhance the performance and robustness of FSKD models. To achieve this, we propose a novel visual prior guided vision transformer for FSKD. The visual prior maps are unsupervised and class-agnostic, which helps suppress background noise and encourages context learning on the foregrounds. Moreover, we investigate i) transductive extension of FSKD and ii) FSKD with masking and alignment (MAA), to improve the keypoint representation learning. Thirdly, despite the versatility of existing FSKD models, we observe that there remain two important issues hindering the progress of FSKD, which are i) the scalability issue w.r.t. the number of keypoints and ii) the large domain shift of keypoints between seen and unseen species. To address the first issue, we propose a novel light-weight FSKD model based on simultaneous modulation and detection, which significantly reduces the computational cost without the sacrifice of performance. To overcome the second issue, we improve our light-weight model with the mean feature based contrastive learning (MFCL). Despite simplicity, MFCL promotes invariance learning among the same type of keypoints across different species in noisy scenarios. As a result, the domain shift is bridged. Lastly, if treating the support image and keypoints as visual prompt, we notice that the above FSKD models, basically, align well with the popular ``prompt'' based models used in various vision and language tasks. Consequently, we unified the zero-shot keypoint detection (ZSKD) and few-shot keypoint detection (FSKD) into one framework, capable of accepting either visual prompts, text prompts, or both. We expand prompt diversity across three aspects: modality, semantics (seen vs. unseen), and language, to enable a more general keypoint detection. A multimodal prototype set is proposed to support both visual and textual prompting. To infer the keypoint location of unseen texts, we propose visual and text interpolation, and add the pairs of auxiliary keypoints and texts into training. This significantly improves the zero-shot novel keypoint detection and the spatial reasoning of our model. Additionally, we discover that the large language model (LLM) is a good text parser. By coupling with a large language model (LLM), our keypoint detection model can well handle the diverse text prompts.

Description

Keywords

Citation

Source

Type

Thesis (PhD)

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads

Back to topicon-arrow-up-solid
 
APRU
IARU
 
edX
Group of Eight Member

Acknowledgement of Country

The Australian National University acknowledges, celebrates and pays our respects to the Ngunnawal and Ngambri people of the Canberra region and to all First Nations Australians on whose traditional lands we meet and work, and whose cultures are among the oldest continuing cultures in human history.


Contact ANUCopyrightDisclaimerPrivacyFreedom of Information

+61 2 6125 5111 The Australian National University, Canberra

TEQSA Provider ID: PRV12002 (Australian University) CRICOS Provider Code: 00120C ABN: 52 234 063 906