Fusion of Multi-Modal Features to Enhance Dense Video Caption.

dense video caption feature extraction multi-modal feature fusion neural network video captioning

Journal

Sensors (Basel, Switzerland)
ISSN: 1424-8220
Titre abrégé: Sensors (Basel)
Pays: Switzerland
ID NLM: 101204366

Informations de publication

Date de publication:
14 Jun 2023
Historique:
received: 11 05 2023
revised: 30 05 2023
accepted: 07 06 2023
medline: 10 7 2023
pubmed: 8 7 2023
entrez: 8 7 2023
Statut: epublish

Résumé

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.

Identifiants

pubmed: 37420732
pii: s23125565
doi: 10.3390/s23125565
pmc: PMC10304565
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : Macao Polytechnic University
ID : RP/ESCA-03/2020
Organisme : Macao Polytechnic University
ID : RP/FCA-06/2023
Organisme : National Natural Science Foundation of China
ID : 61872025

Références

Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Neural Netw. 2022 Feb;146:120-129
pubmed: 34852298
Int J Parallel Emergent Distrib Syst. 2020;35(2):112-117
pubmed: 33719359
IEEE Trans Image Process. 2018 Mar;27(3):1060-1075
pubmed: 29053461
IEEE Trans Image Process. 2022;31:5257-5271
pubmed: 35881604

Auteurs

Xuefei Huang (X)

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.

Ka-Hou Chan (KH)

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic University, Macau 999078, China.

Weifan Wu (W)

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.

Hao Sheng (H)

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China.
Beihang Hangzhou Innovation Institute Yuhang, Yuhang District, Hangzhou 310023, China.

Wei Ke (W)

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic University, Macau 999078, China.

Classifications MeSH