AntiRef: reference clusters of human antibody sequences.
Journal
Bioinformatics advances
ISSN: 2635-0041
Titre abrégé: Bioinform Adv
Pays: England
ID NLM: 9918282081306676
Informations de publication
Date de publication:
2023
2023
Historique:
received:
17
03
2023
revised:
02
08
2023
accepted:
21
08
2023
medline:
27
10
2023
pubmed:
27
10
2023
entrez:
27
10
2023
Statut:
epublish
Résumé
Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS.
Identifiants
pubmed: 37886711
doi: 10.1093/bioadv/vbad109
pii: vbad109
pmc: PMC10598580
doi:
Types de publication
Journal Article
Langues
eng
Pagination
vbad109Subventions
Organisme : NIAID NIH HHS
ID : U19 AI135995
Pays : United States
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press.
Déclaration de conflit d'intérêts
None declared.
Références
Science. 2009 May 8;324(5928):807-10
pubmed: 19423829
Bioinformatics. 2007 May 15;23(10):1282-8
pubmed: 17379688
Patterns (N Y). 2022 May 18;3(7):100513
pubmed: 35845836
Protein Sci. 2022 Jan;31(1):141-146
pubmed: 34655133
Nature. 2019 Feb;566(7744):398-402
pubmed: 30760926
Curr Opin Immunol. 2013 Oct;25(5):613-8
pubmed: 24161653
Front Cell Infect Microbiol. 2023 Mar 10;12:962945
pubmed: 36968243
J Immunol. 2018 Oct 15;201(8):2502-2509
pubmed: 30217829
Science. 2019 Dec 6;366(6470):
pubmed: 31672916
Nature. 2019 Feb;566(7744):393-397
pubmed: 30664748
MAbs. 2019 Oct;11(7):1197-1205
pubmed: 31216939
Nat Commun. 2018 Jun 29;9(1):2542
pubmed: 29959318
PLoS Biol. 2011 Aug;9(8):e1001127
pubmed: 21886479
Nat Commun. 2023 Apr 25;14(1):2389
pubmed: 37185622
Bioinformatics. 2015 Mar 15;31(6):926-32
pubmed: 25398609
Bioinform Adv. 2022 Jun 17;2(1):vbac046
pubmed: 36699403
Science. 2016 Mar 25;351(6280):1458-63
pubmed: 27013733