FastHPOCR: Pragmatic, fast and accurate concept recognition using the Human Phenotype Ontology.
Journal
Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944
Informations de publication
Date de publication:
24 Jun 2024
24 Jun 2024
Historique:
received:
25
03
2024
revised:
18
05
2024
accepted:
19
06
2024
medline:
24
6
2024
pubmed:
24
6
2024
entrez:
24
6
2024
Statut:
aheadofprint
Résumé
Human Phenotype Ontology (HPO)-based phenotype concept recognition underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLM) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype concept recognition (CR) lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically-equivalent tokens-to address lexical variability and a more effective concept recognition step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10,000 publication abstracts in 5s. FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.
Identifiants
pubmed: 38913850
pii: 7698025
doi: 10.1093/bioinformatics/btae406
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2024. Published by Oxford University Press.