FastHPOCR: Pragmatic, fast and accurate concept recognition using the Human Phenotype Ontology.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
24 Jun 2024
Historique:
received: 25 03 2024
revised: 18 05 2024
accepted: 19 06 2024
medline: 24 6 2024
pubmed: 24 6 2024
entrez: 24 6 2024
Statut: aheadofprint

Résumé

Human Phenotype Ontology (HPO)-based phenotype concept recognition underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLM) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype concept recognition (CR) lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically-equivalent tokens-to address lexical variability and a more effective concept recognition step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10,000 publication abstracts in 5s. FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.

Identifiants

pubmed: 38913850
pii: 7698025
doi: 10.1093/bioinformatics/btae406
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press.

Auteurs

Tudor Groza (T)

Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia.
Telethon Kids Institute, Nedlands, WA 6009, Australia.
School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Kent St, Bentley WA 6102, Australia.
SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore 169609, Singapore.

Dylan Gration (D)

Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco WA 6008, Australia.

Gareth Baynam (G)

Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia.
Telethon Kids Institute, Nedlands, WA 6009, Australia.
Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco WA 6008, Australia.
Faculty of Health and Medical Sciences, University of Western Australia, 35 Stirling Hwy, Crawley WA 6009, Australia.

Peter N Robinson (PN)

Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
The Jackson Laboratory for Genomic Medicine, Farmington 06032 CT, USA.

Classifications MeSH