HAMAP as SPARQL rules-A portable annotation pipeline for genomes and proteomes.
SPARQL
function
prediction
protein
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
01 02 2020
01 02 2020
Historique:
received:
28
06
2019
revised:
30
11
2019
accepted:
13
01
2020
entrez:
9
2
2020
pubmed:
9
2
2020
medline:
28
1
2021
Statut:
ppublish
Résumé
Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation. Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline. HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.
Sections du résumé
BACKGROUND
Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.
RESULTS
Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline.
CONCLUSIONS
HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.
Identifiants
pubmed: 32034905
pii: 5731417
doi: 10.1093/gigascience/giaa003
pmc: PMC7007698
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : NHGRI NIH HHS
ID : U24 HG007822
Pays : United States
Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
Nucleic Acids Res. 2009 Jan;37(Database issue):D593-7
pubmed: 18776214
Brief Bioinform. 2019 Jul 19;20(4):1151-1159
pubmed: 29028869
Bioinformatics. 2013 May 15;29(10):1325-32
pubmed: 23479348
Nat Methods. 2011 Sep 29;8(10):785-6
pubmed: 21959131
Proc Int Conf Intell Syst Mol Biol. 1998;6:175-82
pubmed: 9783223
Bioinformatics. 2013 May 1;29(9):1215-7
pubmed: 23505298
Nature. 2016 Aug 25;536(7617):425-30
pubmed: 27533034
Nucleic Acids Res. 2017 Jan 4;45(D1):D128-D134
pubmed: 27794554
J Biomed Semantics. 2016 Jun 13;7:39
pubmed: 27296299
Gigascience. 2020 Feb 1;9(2):
pubmed: 32034905
Nucleic Acids Res. 2014 Jan;42(Database issue):D206-14
pubmed: 24293654
Nucleic Acids Res. 2015 Jan;43(Database issue):D1064-70
pubmed: 25348399
Nature. 2017 Nov 23;551(7681):457-463
pubmed: 29088705
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860
pubmed: 29112715
Nucleic Acids Res. 2016 Jan 4;44(D1):D1214-9
pubmed: 26467479
Nucleic Acids Res. 2000 Jan 1;28(1):304-5
pubmed: 10592255
Database (Oxford). 2014 Jul 22;2014:
pubmed: 25052702
Nucleic Acids Res. 2013 Jan;41(Database issue):D344-7
pubmed: 23161676
Nucleic Acids Res. 2019 Jan 8;47(D1):D351-D360
pubmed: 30398656
Nucleic Acids Res. 2017 Jan 4;45(D1):D507-D516
pubmed: 27738135
Nucleic Acids Res. 2016 Jan 4;44(D1):D523-6
pubmed: 26527720
Biochem Soc Trans. 2018 Aug 20;46(4):931-936
pubmed: 30065105
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Bioinformatics. 2018 Feb 15;34(4):660-668
pubmed: 29028931
PLoS One. 2018 Jun 11;13(6):e0198216
pubmed: 29889900
Bioinformatics. 2020 Mar 1;36(6):1896-1901
pubmed: 31688925
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95
pubmed: 23197656
Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761
pubmed: 29155950
Nucleic Acids Res. 2018 Jan 4;46(D1):D335-D342
pubmed: 29112718
Nucleic Acids Res. 2019 Jan 8;47(D1):D596-D600
pubmed: 30272209
Nat Biotechnol. 2017 Jul;35(7):676-683
pubmed: 28604660
J Biomol Tech. 2017 Apr;28(1):31-39
pubmed: 28337070
Nucleic Acids Res. 2019 Jan 8;47(D1):D330-D338
pubmed: 30395331
Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333
pubmed: 29686065
Nucleic Acids Res. 2018 Jan 4;46(D1):D802-D808
pubmed: 29092050