Retrieving Images through Bi-modal Visual and Language Queries

dc.contributor.authorLiu, Zheyuan
dc.date.accessioned2024-02-02T03:35:14Z
dc.date.available2024-02-02T03:35:14Z
dc.date.issued2024
dc.description.abstractImage retrieval is a research topic studied for decades and has become a common tool in present-day computer systems. Conventional image retrieval uses input queries of simple forms, e.g., short text or another image, as one would see in online search engines nowadays. These types of queries are straightforward to obtain but can be imprecise or inadequate to convey user intentions. With language, it can be difficult to describe an entire image down to the details without being verbose. Alternatively, systems that accept an image as input are usually designed to retrieve targets that are visually similar, but it does not leave room for further modifications to the source image --- yet a perfect image that depicts one's intention can be hard to obtain. Given this, perhaps a more natural setup would be to combine the two forms of queries where text and image can complement each other. We, therefore, arrive at a form of image retrieval where the user provides a reference image as well as some text that states further desired modifications to said image. The system should then return an image that reflects all modifications stated by the user, while still remaining similar to the reference image in other aspects. This task is termed composed image retrieval (CIR) and is the topic of this thesis. Compared to conventional retrieval setups, CIR allows a more interactive user experience, while presenting a valuable opportunity for studying the fine-grained, sometimes ambiguous human intentions in the vision-and-language domain. In this thesis, we make several contributions to the task of CIR, with the aim of better understanding the visiolinguistic reasoning within, thus, building a more efficient retrieval experience. We begin with introducing the first dataset on CIR with generic images and high-quality human annotations. Compared to existing datasets that are synthesized, repurposed from other tasks, or focused on specific domains (e.g., fashion products), ours can better facilitate the study of complex vision-and-language reasoning present in human intentions. Along with the dataset, we propose a method that leverages vision-and-language pre-trained (VLP) networks. We test our method on both the existing fashion dataset and our newly proposed one to demonstrate its capability. Collectively, this work extends the task of CIR into the generic image domain. Next, we present a data augmentation scheme that better utilizes training data for CIR. Our method requires no additional annotations and can be implemented on the fly. These characteristics are favoured particularly considering the high annotating cost of existing CIR datasets. Specifically, we exploit the directionality of CIR queries and design a training process that uses not only the forward queries from reference to target but also the reversed queries from target to reference. One element missing in the pipeline is the reversed modification text. Obtaining such data requires collecting additional annotations. To resolve this, we instead directly infer the embeddings of said text by leveraging the special text tokens in VLP networks to bind the concept of directionality of text through fine-tuning. Last, we propose to re-think the existing training and testing pipeline of CIR, where a reference image-text query is jointly embedded before comparing against candidate target images via simple feature distance metrics such as cosine similarity. Instead, we advocate for using VLP networks to exhaustively score each query-candidate pair. The use of VLP networks in such a manner allows more sophisticated attention mechanisms to be applied when assessing the relevancy between each candidate and the query. We design an architecture based upon a recent VLP model that accepts such query-candidate pairs, along with a two-stage scheme that balances the inference speed with the retrieval accuracy.
dc.identifier.urihttp://hdl.handle.net/1885/313081
dc.language.isoen_AU
dc.titleRetrieving Images through Bi-modal Visual and Language Queries
dc.typeThesis (PhD)
local.contributor.authoremailu5689359@anu.edu.au
local.contributor.supervisorGould, Stephen
local.contributor.supervisorcontactu4971180@anu.edu.au
local.identifier.doi10.25911/045M-ZJ40
local.mintdoimint
local.thesisANUonly.author330e32e9-620a-47ed-8d4c-b4854cbe38fc
local.thesisANUonly.key2c51dd1e-6895-1f49-4770-0b0622894f61
local.thesisANUonly.title000000021097_TC_1

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liu_thesis_2024.pdf
Size:
28.19 MB
Format:
Adobe Portable Document Format
Description:
Thesis Material