SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval.


Journal

IEEE transactions on cybernetics
ISSN: 2168-2275
Titre abrégé: IEEE Trans Cybern
Pays: United States
ID NLM: 101609393

Informations de publication

Date de publication:
Feb 2022
Historique:
pubmed: 10 5 2020
medline: 19 2 2022
entrez: 10 5 2020
Statut: ppublish

Résumé

This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods.

Identifiants

pubmed: 32386178
doi: 10.1109/TCYB.2020.2985716
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

1086-1097

Auteurs

Articles similaires

Humans Linguistics Language China Semantics

Using Artificial Intelligence to Support Informed Decision-Making on

Jennifer Webster, Jennifer Ghith, Orion Penner et al.
1.00
Humans Artificial Intelligence Proto-Oncogene Proteins B-raf Mutation Clinical Decision-Making
1.00
Humans Natural Language Processing Anxiety Male Female

Classifications MeSH