Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
| dc.contributor.author | Liu, Zheyuan | en |
| dc.contributor.author | Sun, Weixuan | en |
| dc.contributor.author | Teney, Damien | en |
| dc.contributor.author | Gould, Stephen | en |
| dc.date.accessioned | 2025-05-23T10:21:31Z | |
| dc.date.available | 2025-05-23T10:21:31Z | |
| dc.date.issued | 2024 | en |
| dc.description.abstract | Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR. | en |
| dc.description.status | Peer-reviewed | en |
| dc.identifier.scopus | 85219552903 | en |
| dc.identifier.uri | http://www.scopus.com/inward/record.url?scp=85219552903&partnerID=8YFLogxK | en |
| dc.identifier.uri | https://hdl.handle.net/1885/733752013 | |
| dc.language.iso | en | en |
| dc.rights | Publisher Copyright: © 2024, Transactions on Machine Learning Research. All rights reserved. | en |
| dc.source | Transactions on Machine Learning Research | en |
| dc.title | Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder | en |
| dc.type | Journal article | en |
| dspace.entity.type | Publication | en |
| local.contributor.affiliation | Liu, Zheyuan; Australian National University | en |
| local.contributor.affiliation | Sun, Weixuan; COVID 19 Extension Scholarship, The Australian National University | en |
| local.contributor.affiliation | Teney, Damien; Australian Institute for Machine Learning(AIML) | en |
| local.contributor.affiliation | Gould, Stephen; School of Computing, ANU College of Systems and Society, The Australian National University | en |
| local.identifier.citationvolume | 2024 | en |
| local.identifier.pure | 64aa5678-3f60-486a-a52b-f4fdd4303c6b | en |
| local.identifier.url | https://www.scopus.com/pages/publications/85219552903 | en |
| local.type.status | Published | en |