Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

dc.contributor.authorLiu, Zheyuanen
dc.contributor.authorSun, Weixuanen
dc.contributor.authorTeney, Damienen
dc.contributor.authorGould, Stephenen
dc.date.accessioned2025-05-23T10:21:31Z
dc.date.available2025-05-23T10:21:31Z
dc.date.issued2024en
dc.description.abstractComposed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.en
dc.description.statusPeer-revieweden
dc.identifier.scopus85219552903en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85219552903&partnerID=8YFLogxKen
dc.identifier.urihttps://hdl.handle.net/1885/733752013
dc.language.isoenen
dc.rightsPublisher Copyright: © 2024, Transactions on Machine Learning Research. All rights reserved.en
dc.sourceTransactions on Machine Learning Researchen
dc.titleCandidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoderen
dc.typeJournal articleen
dspace.entity.typePublicationen
local.contributor.affiliationLiu, Zheyuan; Australian National Universityen
local.contributor.affiliationSun, Weixuan; COVID 19 Extension Scholarship, The Australian National Universityen
local.contributor.affiliationTeney, Damien; Australian Institute for Machine Learning(AIML)en
local.contributor.affiliationGould, Stephen; School of Computing, ANU College of Systems and Society, The Australian National Universityen
local.identifier.citationvolume2024en
local.identifier.pure64aa5678-3f60-486a-a52b-f4fdd4303c6ben
local.identifier.urlhttps://www.scopus.com/pages/publications/85219552903en
local.type.statusPublisheden

Downloads