BioBBC: a multi-feature model that enhances the detection of biomedical entities.

BiLSTM BioBERT Biomedical named entity recognition Machine learning NER Natural language processing PubMedBERT

Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
02 Apr 2024
Historique:
received: 31 10 2023
accepted: 27 03 2024
medline: 3 4 2024
pubmed: 3 4 2024
entrez: 2 4 2024
Statut: epublish

Résumé

The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.

Identifiants

pubmed: 38565624
doi: 10.1038/s41598-024-58334-x
pii: 10.1038/s41598-024-58334-x
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

7697

Subventions

Organisme : King Abdullah University of Science and Technology
ID : BAS/1/1059-01-01, BAS/1/1624-01-01
Organisme : King Abdullah University of Science and Technology
ID : FCC/1/1976-47-01
Organisme : King Abdullah University of Science and Technology
ID : FCC/1/1976-44-01, FCC/1/1976-45-01

Informations de copyright

© 2024. The Author(s).

Références

Fiorini, N., Lipman, D. J. & Lu, Z. Towards PubMed 2.0. Elife https://doi.org/10.7554/eLife.28801 (2017).
doi: 10.7554/eLife.28801 pubmed: 29083299 pmcid: 5662282
Han, P. et al. Exploring the effects of drug, disease, and protein dependencies on biomedical named entity recognition: A comparative analysis. Front. Pharmacol. https://doi.org/10.3389/fphar.2022.1020759 (2022).
doi: 10.3389/fphar.2022.1020759 pubmed: 36703757 pmcid: 9812568
Weston, L. et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 59, 3692–3702. https://doi.org/10.1021/acs.jcim.9b00470 (2019).
doi: 10.1021/acs.jcim.9b00470 pubmed: 31361962
Grishman, R. & Sundheim, B. Message Understanding Conference-6. In Proceedings of the 16th conference on Computational linguistics -. https://doi.org/10.3115/992628.992709 (Association for Computational Linguistics, 1996).
Yang, R., Gan, Y. & Zhang, C. Chinese named entity recognition based on BERT and lightweight feature extraction model. Information 13, 515. https://doi.org/10.3390/info13110515 (2022).
doi: 10.3390/info13110515
Tong, Y., Chen, Y. & Shi, X. A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information.  In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. https://doi.org/10.18653/v1/2021.findings-acl.424 (Association for Computational Linguistics, 2021).
doi: 10.18653/v1/2021.findings-acl.424
Liu, S., Tang, B., Chen, Q. & Wang, X. Drug name recognition: Approaches and resources. Information 6, 790–810. https://doi.org/10.3390/info6040790 (2015).
doi: 10.3390/info6040790
Luo, L. et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 1381–1388. https://doi.org/10.1093/bioinformatics/btx761 (2017).
doi: 10.1093/bioinformatics/btx761
Lim, S., Lee, K. & Kang, J. Drug drug interaction extraction from the literature using a recursive neural network. PLoS One 13, e0190926. https://doi.org/10.1371/journal.pone.0190926 (2018).
doi: 10.1371/journal.pone.0190926 pubmed: 29373599 pmcid: 5786304
Bhasuran, B. & Natarajan, J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 13, e0200699. https://doi.org/10.1371/journal.pone.0200699 (2018).
doi: 10.1371/journal.pone.0200699 pubmed: 30048465 pmcid: 6061985
Hettne, K. M. et al. A dictionary to identify small molecules and drugs in free text. Bioinformatics 25, 2983–2991. https://doi.org/10.1093/bioinformatics/btp535 (2009).
doi: 10.1093/bioinformatics/btp535 pubmed: 19759196
Song, M., Yu, H. & Han, W.-S. Developing a hybrid dictionary-based bio-entity recognition technique. BMC Med. Inform. Decis. Mak. 15 Suppl 1, S9. https://doi.org/10.1186/1472-6947-15-S1-S9 (2015).
doi: 10.1186/1472-6947-15-S1-S9 pubmed: 26043907
Proux, D., Rechenmann, F., Julliard, L., Pillet, V. V. & Jacq, B. Detecting gene symbols and names in biological texts: A first step toward pertinent information extraction. Genome Inform. Ser. Workshop Genome Inform. 9, 72–80 (1998).
pubmed: 11072323
Fukuda, K., Tamura, A., Tsunoda, T. & Takagi, T. Toward information extraction: Identifying protein names from biological papers. Pac. Symp. Biocomput. 707, 707–718 (1998).
Ma, X. & Hovy, E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/p16-1101 (Association for Computational Linguistics, 2016).
Habibi, M., Weber, L., Neves, M., Wiegandt, D. L. & Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, i37–i48. https://doi.org/10.1093/bioinformatics/btx228 (2017).
doi: 10.1093/bioinformatics/btx228 pubmed: 28881963 pmcid: 5870729
Lafferty, J., McCallum, A. & Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. (2001).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186 (2019).
Lee, J. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240. https://doi.org/10.1093/bioinformatics/btz682 (2019).
doi: 10.1093/bioinformatics/btz682 pmcid: 7703786
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task. https://doi.org/10.18653/v1/w19-5006 (Association for Computational Linguistics, 2019).
Alsentzer, E. et al. Publicly available clinical. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. https://doi.org/10.18653/v1/w19-1909 (Association for Computational Linguistics, 2019).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/d19-1371 (Association for Computational Linguistics, 2019).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23. https://doi.org/10.1145/3458754 (2022).
doi: 10.1145/3458754
Doğan, R. I., Leaman, R. & Lu, Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10. https://doi.org/10.1016/j.jbi.2013.12.006 (2014).
doi: 10.1016/j.jbi.2013.12.006 pubmed: 24393765 pmcid: 3951655
Li, J. et al. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database https://doi.org/10.1093/database/baw068 (2016).
doi: 10.1093/database/baw068 pubmed: 28025347 pmcid: 5199199
Krallinger, M. et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 7, S2. https://doi.org/10.1186/1758-2946-7-S1-S2 (2015).
doi: 10.1186/1758-2946-7-S1-S2 pubmed: 25810773 pmcid: 4331692
Smith, L. et al. Overview of BioCreative II gene mention recognition. Genome Biol. 92, S2. https://doi.org/10.1186/gb-2008-9-s2-s2 (2008).
doi: 10.1186/gb-2008-9-s2-s2
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y. & Collier, N. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications - JNLPBA '04. https://doi.org/10.3115/1567594.1567610 (Association for Computational Linguistics, 2004).
Gerner, M., Nenadic, G. & Bergman, C. M. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinform. 11, 85. https://doi.org/10.1186/1471-2105-11-85 (2010).
doi: 10.1186/1471-2105-11-85
Pafilis, E. et al. The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS One 8, e65390. https://doi.org/10.1371/journal.pone.0065390 (2013).
doi: 10.1371/journal.pone.0065390 pubmed: 23823062 pmcid: 3688812
Kulick, S. et al. Integrated Annotation for Biomedical Information Extraction. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases. 61–68 (2004).
Tjong Kim Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -. https://doi.org/10.3115/1119176.1119195 (Association for Computational Linguistics, 2003).
Leaman, R. & Lu, Z. TaggerOne: Joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32, 2839–2846. https://doi.org/10.1093/bioinformatics/btw343 (2016).
doi: 10.1093/bioinformatics/btw343 pubmed: 27283952 pmcid: 5018376
Wu, Y.-C., Fan, T.-K., Lee, Y.-S. & Yen, S.-J. Extracting named entities using support vector machines. Knowl. Discov. Life Sci. Lit. https://doi.org/10.1007/11683568_8 (2006).
doi: 10.1007/11683568_8
Isozaki, H. & Kazawa, H. Efficient Support Vector Classifiers for Named Entity Recognition. In COLING 2002: The 19th International Conference on Computational Linguistics (2002).
Shen, D., Zhang, J., Zhou, G., Su, J. & Tan, C.-L. Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine -. https://doi.org/10.3115/1118958.1118965 (Association for Computational Linguistics, 2003).
Collier, N., Nobata, C. & Tsujii, J.-I. Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of the 18th conference on Computational linguistics -. https://doi.org/10.3115/990820.990850 (Association for Computational Linguistics, 2000).
Tang, B., Cao, H., Wu, Y., Jiang, M. & Xu, H. Clinical entity recognition using structural support vector machines with rich features. In Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics. https://doi.org/10.1145/2390068.2390073 (ACM, 2012).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. & Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://doi.org/10.18653/v1/N16-1030 (Association for Computational Linguistics, 2016).
Hong, S. K. & Lee, J.-G. DTranNER: Biomedical named entity recognition with deep learning-based label-label transition model. BMC Bioinform. 21, 1–11. https://doi.org/10.1186/s12859-020-3393-1 (2020).
doi: 10.1186/s12859-020-3393-1
Crichton, G., Pyysalo, S., Chiu, B. & Korhonen, A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinform. 18, 368. https://doi.org/10.1186/s12859-017-1776-8 (2017).
doi: 10.1186/s12859-017-1776-8
Gridach, M. Character-level neural network for biomedical named entity recognition. J. Biomed. Inform. 70, 85–91. https://doi.org/10.1016/j.jbi.2017.05.002 (2017).
doi: 10.1016/j.jbi.2017.05.002 pubmed: 28502909
Yoon, W., So, C. H., Lee, J. & Kang, J. CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinform. 20, 55–65. https://doi.org/10.1186/s12859-019-2813-6 (2019).
doi: 10.1186/s12859-019-2813-6
Sun, C. et al. Biomedical named entity recognition using BERT in the machine reading comprehension framework. J. Biomed. Inform. 118, 103799. https://doi.org/10.1016/j.jbi.2021.103799 (2021).
doi: 10.1016/j.jbi.2021.103799 pubmed: 33965638
Zheng, X. et al. BioByGANS: Biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinform. 23, 1–19. https://doi.org/10.1186/s12859-022-05051-9 (2022).
doi: 10.1186/s12859-022-05051-9
Chai, Z. et al. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinform. 23, 8. https://doi.org/10.1186/s12859-021-04551-4 (2022).
doi: 10.1186/s12859-021-04551-4
Wang, P. & Gu, J. Named entity recognition of electronic medical records based on BERT-BiLSTM-biaffine model. J. Phys. Conf. Ser. 2560, 012044. https://doi.org/10.1088/1742-6596/2560/1/012044 (2023).
doi: 10.1088/1742-6596/2560/1/012044
Guan, Z. & Zhou, X. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinform. 24, 1–29. https://doi.org/10.1186/s12859-023-05172-9 (2023).
doi: 10.1186/s12859-023-05172-9
Chen, P., Wang, J., Lin, H., Zhang, Y. & Yang, Z. Knowledge adaptive multi-way matching network for biomedical named entity recognition via machine reading comprehension. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 2101–2111. https://doi.org/10.1109/TCBB.2022.3233856 (2023).
doi: 10.1109/TCBB.2022.3233856 pubmed: 37018273
Dang, T. H., Le, H.-Q., Nguyen, T. M. & Vu, S. T. D3NER: Biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics 34, 3539–3546. https://doi.org/10.1093/bioinformatics/bty356 (2018).
doi: 10.1093/bioinformatics/bty356 pubmed: 29718118
Zhang, Z. & Chen, A. L. P. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinform. 23, 458. https://doi.org/10.1186/s12859-022-04994-3 (2022).
doi: 10.1186/s12859-022-04994-3
Bird, S. & Loper, E. NLTK: The Natural Language Toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions 214–217 (2004).
Akbik, A., Blythe, D. & Vollgraf, R. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics 1638–1649 (2018).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 (1997).
doi: 10.1162/neco.1997.9.8.1735 pubmed: 9377276
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. https://doi.org/10.1109/IJCNN.2005.1556215 (IEEE, 2006).
Jia, Y. & Xu, X. Chinese named entity recognition based on CNN-BiLSTM-CRF. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS). https://doi.org/10.1109/ICSESS.2018.8663820 (IEEE, 2018).
Ammar, W. et al. Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). https://doi.org/10.18653/v1/n18-3011 (Association for Computational Linguistics, 2018).

Auteurs

Hind Alamro (H)

Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia.

Takashi Gojobori (T)

Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

Magbubah Essack (M)

Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. magbubah.essack@kaust.edu.sa.
Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. magbubah.essack@kaust.edu.sa.

Xin Gao (X)

Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. xin.gao@kaust.edu.sa.
Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia. xin.gao@kaust.edu.sa.

Classifications MeSH