Dual-path Convolutional Image-Text Embeddings with Instance Loss

dc.contributor.authorZheng, Zhedong
dc.contributor.authorZheng, Liang
dc.contributor.authorGarret, Michael
dc.contributor.authorYang, Yi
dc.contributor.authorXu, Mingliang
dc.contributor.authorYi-Dong, Shen
dc.date.accessioned2024-01-10T00:28:28Z
dc.date.issued2020
dc.date.updated2022-09-25T08:16:36Z
dc.description.abstractMatching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.en_AU
dc.format.mimetypeapplication/pdfen_AU
dc.identifier.issn1551-6857en_AU
dc.identifier.urihttp://hdl.handle.net/1885/311306
dc.language.isoen_AUen_AU
dc.publisherAssociation for Computing Machinary, Inc.en_AU
dc.rights© 2020 The authorsen_AU
dc.sourceACM Transactions on Multimedia Computing, Communications and Applicationsen_AU
dc.subjectImage-sentence retrievalen_AU
dc.subjectcross-modal retrievalen_AU
dc.subjectlanguage-based person searchen_AU
dc.subjectconvolutional neural networksen_AU
dc.titleDual-path Convolutional Image-Text Embeddings with Instance Lossen_AU
dc.typeJournal articleen_AU
local.bibliographicCitation.issue2en_AU
local.bibliographicCitation.lastpage23en_AU
local.bibliographicCitation.startpage1en_AU
local.contributor.affiliationZheng, Zhedong, University of Technology Sydneyen_AU
local.contributor.affiliationZheng, Liang, College of Engineering and Computer Science, ANUen_AU
local.contributor.affiliationGarret, Michael, Edith Cowan Universityen_AU
local.contributor.affiliationYang, Yi, University of Technology Sydneyen_AU
local.contributor.affiliationXu, Mingliang, Zhengzhou Universityen_AU
local.contributor.affiliationYi-Dong, Shen, Institute of Software, Chinese Academy of Sciencesen_AU
local.contributor.authoruidZheng, Liang, u1064892en_AU
local.description.embargo2099-12-31
local.description.notesImported from ARIESen_AU
local.identifier.absfor460304 - Computer visionen_AU
local.identifier.absfor461103 - Deep learningen_AU
local.identifier.ariespublicationa383154xPUB14151en_AU
local.identifier.citationvolume16en_AU
local.identifier.doi10.1145/3383184en_AU
local.identifier.scopusID2-s2.0-85086077560
local.identifier.thomsonIDWOS:000583710600012
local.publisher.urlhttps://dl.acm.org/en_AU
local.type.statusPublished Versionen_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3383184.pdf
Size:
13.85 MB
Format:
Adobe Portable Document Format
Description: