Fusion of Multi-Modal Features to Enhance Dense Video Caption.
dense video caption
feature extraction
multi-modal feature fusion
neural network
video captioning
Journal
Sensors (Basel, Switzerland)
ISSN: 1424-8220
Titre abrégé: Sensors (Basel)
Pays: Switzerland
ID NLM: 101204366
Informations de publication
Date de publication:
14 Jun 2023
14 Jun 2023
Historique:
received:
11
05
2023
revised:
30
05
2023
accepted:
07
06
2023
medline:
10
7
2023
pubmed:
8
7
2023
entrez:
8
7
2023
Statut:
epublish
Résumé
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
Identifiants
pubmed: 37420732
pii: s23125565
doi: 10.3390/s23125565
pmc: PMC10304565
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : Macao Polytechnic University
ID : RP/ESCA-03/2020
Organisme : Macao Polytechnic University
ID : RP/FCA-06/2023
Organisme : National Natural Science Foundation of China
ID : 61872025
Références
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Neural Netw. 2022 Feb;146:120-129
pubmed: 34852298
Int J Parallel Emergent Distrib Syst. 2020;35(2):112-117
pubmed: 33719359
IEEE Trans Image Process. 2018 Mar;27(3):1060-1075
pubmed: 29053461
IEEE Trans Image Process. 2022;31:5257-5271
pubmed: 35881604