Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.
Gaussian mixture model
MFCC
SyncNet
diarization error rate
speaker diarization
speech activity detection
Journal
Sensors (Basel, Switzerland)
ISSN: 1424-8220
Titre abrégé: Sensors (Basel)
Pays: Switzerland
ID NLM: 101204366
Informations de publication
Date de publication:
25 Nov 2019
25 Nov 2019
Historique:
received:
27
10
2019
revised:
21
11
2019
accepted:
21
11
2019
entrez:
29
11
2019
pubmed:
30
11
2019
medline:
30
11
2019
Statut:
epublish
Résumé
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
Identifiants
pubmed: 31775385
pii: s19235163
doi: 10.3390/s19235163
pmc: PMC6929047
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Déclaration de conflit d'intérêts
The authors declare no conflict of interest.
Références
Science. 2007 Feb 16;315(5814):972-6
pubmed: 17218491
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93
pubmed: 21383401
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099
pubmed: 28103192
Sensors (Basel). 2019 Nov 25;19(23):
pubmed: 31775385