Retrieving Images through Bi-modal Visual and Language Queries

Liu, Zheyuan

Retrieving Images through Bi-modal Visual and Language Queries

Date

2024

Authors

Liu, Zheyuan

Abstract

Image retrieval is a research topic studied for decades and has become a common tool in present-day computer systems. Conventional image retrieval uses input queries of simple forms, e.g., short text or another image, as one would see in online search engines nowadays. These types of queries are straightforward to obtain but can be imprecise or inadequate to convey user intentions. With language, it can be difficult to describe an entire image down to the details without being verbose. Alternatively, systems that accept an image as input are usually designed to retrieve targets that are visually similar, but it does not leave room for further modifications to the source image --- yet a perfect image that depicts one's intention can be hard to obtain. Given this, perhaps a more natural setup would be to combine the two forms of queries where text and image can complement each other. We, therefore, arrive at a form of image retrieval where the user provides a reference image as well as some text that states further desired modifications to said image. The system should then return an image that reflects all modifications stated by the user, while still remaining similar to the reference image in other aspects. This task is termed composed image retrieval (CIR) and is the topic of this thesis. Compared to conventional retrieval setups, CIR allows a more interactive user experience, while presenting a valuable opportunity for studying the fine-grained, sometimes ambiguous human intentions in the vision-and-language domain. In this thesis, we make several contributions to the task of CIR, with the aim of better understanding the visiolinguistic reasoning within, thus, building a more efficient retrieval experience. We begin with introducing the first dataset on CIR with generic images and high-quality human annotations. Compared to existing datasets that are synthesized, repurposed from other tasks, or focused on specific domains (e.g., fashion products), ours can better facilitate the study of complex vision-and-language reasoning present in human intentions. Along with the dataset, we propose a method that leverages vision-and-language pre-trained (VLP) networks. We test our method on both the existing fashion dataset and our newly proposed one to demonstrate its capability. Collectively, this work extends the task of CIR into the generic image domain. Next, we present a data augmentation scheme that better utilizes training data for CIR. Our method requires no additional annotations and can be implemented on the fly. These characteristics are favoured particularly considering the high annotating cost of existing CIR datasets. Specifically, we exploit the directionality of CIR queries and design a training process that uses not only the forward queries from reference to target but also the reversed queries from target to reference. One element missing in the pipeline is the reversed modification text. Obtaining such data requires collecting additional annotations. To resolve this, we instead directly infer the embeddings of said text by leveraging the special text tokens in VLP networks to bind the concept of directionality of text through fine-tuning. Last, we propose to re-think the existing training and testing pipeline of CIR, where a reference image-text query is jointly embedded before comparing against candidate target images via simple feature distance metrics such as cosine similarity. Instead, we advocate for using VLP networks to exhaustively score each query-candidate pair. The use of VLP networks in such a manner allows more sophisticated attention mechanisms to be applied when assessing the relevancy between each candidate and the query. We design an architecture based upon a recent VLP model that accepts such query-candidate pairs, along with a two-stage scheme that balances the inference speed with the retrieval accuracy.

URI

http://hdl.handle.net/1885/313081

Collections

Open Access Theses

Type

Thesis (PhD)

DOI

10.25911/045M-ZJ40

Downloads

File

Description

Liu_thesis_2024.pdf (28.19 MB)

Thesis Material

Full item page

Retrieving Images through Bi-modal Visual and Language Queries

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

Downloads