Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.

Gaussian mixture model MFCC SyncNet diarization error rate speaker diarization speech activity detection

Journal

Sensors (Basel, Switzerland)
ISSN: 1424-8220
Titre abrégé: Sensors (Basel)
Pays: Switzerland
ID NLM: 101204366

Informations de publication

Date de publication:
25 Nov 2019
Historique:
received: 27 10 2019
revised: 21 11 2019
accepted: 21 11 2019
entrez: 29 11 2019
pubmed: 30 11 2019
medline: 30 11 2019
Statut: epublish

Résumé

Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

Identifiants

pubmed: 31775385
pii: s19235163
doi: 10.3390/s19235163
pmc: PMC6929047
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Déclaration de conflit d'intérêts

The authors declare no conflict of interest.

Références

Science. 2007 Feb 16;315(5814):972-6
pubmed: 17218491
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93
pubmed: 21383401
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099
pubmed: 28103192
Sensors (Basel). 2019 Nov 25;19(23):
pubmed: 31775385

Auteurs

Rehan Ahmad (R)

Department of Electrical Engineering, International Islamic University, Islamabad 44000, Pakistan.

Syed Zubair (S)

Analytics Camp, Islamabad 44000, Pakistan.

Hani Alquhayz (H)

Department of Computer Science and Information, College of Science in Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia.

Allah Ditta (A)

Division of Science & Technology, University of Education, Township, Lahore 54770, Pakistan.

Classifications MeSH