Multi-Task Transformer with Adaptive Cross-Entropy Loss for Multi-Dialect Speech Recognition.

adaptive cross-entropy loss multi-dialect speech recognition multi-task Transformer

Journal

Entropy (Basel, Switzerland)
ISSN: 1099-4300
Titre abrégé: Entropy (Basel)
Pays: Switzerland
ID NLM: 101243874

Informations de publication

Date de publication:
08 Oct 2022
Historique:
received: 06 08 2022
revised: 02 10 2022
accepted: 07 10 2022
medline: 8 7 2023
pubmed: 8 7 2023
entrez: 8 7 2023
Statut: epublish

Résumé

At present, most multi-dialect speech recognition models are based on a hard-parameter-sharing multi-task structure, which makes it difficult to reveal how one task contributes to others. In addition, in order to balance multi-task learning, the weights of the multi-task objective function need to be manually adjusted. This makes multi-task learning very difficult and costly because it requires constantly trying various combinations of weights to determine the optimal task weights. In this paper, we propose a multi-dialect acoustic model that combines soft-parameter-sharing multi-task learning with Transformer, and introduce several auxiliary cross-attentions to enable the auxiliary task (dialect ID recognition) to provide dialect information for the multi-dialect speech recognition task. Furthermore, we use the adaptive cross-entropy loss function as the multi-task objective function, which automatically balances the learning of the multi-task model according to the loss proportion of each task during the training process. Therefore, the optimal weight combination can be found without any manual intervention. Finally, for the two tasks of multi-dialect (including low-resource dialect) speech recognition and dialect ID recognition, the experimental results show that, compared with single-dialect Transformer, single-task multi-dialect Transformer, and multi-task Transformer with hard parameter sharing, our method significantly reduces the average syllable error rate of Tibetan multi-dialect speech recognition and the character error rate of Chinese multi-dialect speech recognition.

Identifiants

pubmed: 37420449
pii: e24101429
doi: 10.3390/e24101429
pmc: PMC9601745
pii:
doi:

Types de publication

Journal Article

Langues

eng

Subventions

Organisme : National Natural Science Foundation of China
ID : 61976236
Organisme : National Social Science Foundation key projects
ID : 20&ZD279

Références

IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3614-3633
pubmed: 33497328

Auteurs

Zhengjia Dan (Z)

School of Information Engineering, Minzu University of China, Beijing 100081, China.

Yue Zhao (Y)

School of Information Engineering, Minzu University of China, Beijing 100081, China.

Xiaojun Bi (X)

School of Information Engineering, Minzu University of China, Beijing 100081, China.

Licheng Wu (L)

School of Information Engineering, Minzu University of China, Beijing 100081, China.

Qiang Ji (Q)

Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180-3590, USA.

Classifications MeSH