Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Liu, Zheyuan; Sun, Weixuan; Teney, Damien; Gould, Stephen

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

dc.contributor.author	Liu, Zheyuan	en
dc.contributor.author	Sun, Weixuan	en
dc.contributor.author	Teney, Damien	en
dc.contributor.author	Gould, Stephen	en
dc.date.accessioned	2025-05-23T10:21:31Z
dc.date.available	2025-05-23T10:21:31Z
dc.date.issued	2024	en
dc.description.abstract	Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.	en
dc.description.status	Peer-reviewed	en
dc.identifier.scopus	85219552903	en
dc.identifier.uri	http://www.scopus.com/inward/record.url?scp=85219552903&partnerID=8YFLogxK	en
dc.identifier.uri	https://hdl.handle.net/1885/733752013
dc.language.iso	en	en
dc.rights	Publisher Copyright: © 2024, Transactions on Machine Learning Research. All rights reserved.	en
dc.source	Transactions on Machine Learning Research	en
dc.title	Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder	en
dc.type	Journal article	en
dspace.entity.type	Publication	en
local.contributor.affiliation	Liu, Zheyuan; Australian National University	en
local.contributor.affiliation	Sun, Weixuan; COVID 19 Extension Scholarship, The Australian National University	en
local.contributor.affiliation	Teney, Damien; Australian Institute for Machine Learning(AIML)	en
local.contributor.affiliation	Gould, Stephen; School of Computing, ANU College of Systems and Society, The Australian National University	en
local.identifier.citationvolume	2024	en
local.identifier.pure	64aa5678-3f60-486a-a52b-f4fdd4303c6b	en
local.identifier.url	https://www.scopus.com/pages/publications/85219552903	en
local.type.status	Published	en

Collections

ANU Research Publications

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Downloads

Collections