DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.


Journal

BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258

Informations de publication

Date de publication:
09 Aug 2019
Historique:
received: 28 03 2019
accepted: 26 07 2019
entrez: 11 8 2019
pubmed: 11 8 2019
medline: 4 1 2020
Statut: epublish

Résumé

Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

Sections du résumé

BACKGROUND BACKGROUND
Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references.
RESULTS RESULTS
We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female.
CONCLUSION CONCLUSIONS
DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

Identifiants

pubmed: 31399045
doi: 10.1186/s12864-019-5996-3
pii: 10.1186/s12864-019-5996-3
pmc: PMC6688218
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

641

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM130691
Pays : United States
Organisme : Directorate for Computer and Information Science and Engineering
ID : CCF-1439057, IIS-1453527, and IIS-1421908
Organisme : Directorate for Biological Sciences
ID : DBI-1356529
Organisme : NIH HHS
ID : R01GM130691
Pays : United States

Références

Nature. 2003 Jun 19;423(6942):825-37
pubmed: 12815422
J Mol Evol. 2009 Feb;68(2):134-44
pubmed: 19142680
Nature. 2010 Jan 28;463(7280):536-9
pubmed: 20072128
Nature. 2012 Feb 22;483(7387):82-6
pubmed: 22367542
Nature. 1990 Nov 29;348(6300):448-50
pubmed: 2247149
Bioinformatics. 2013 Mar 1;29(5):652-3
pubmed: 23325618
BMC Genomics. 2013 Apr 23;14:273
pubmed: 23617698
Genome Res. 2013 Nov;23(11):1894-907
pubmed: 23921660
PLoS Biol. 2013;11(8):e1001643
pubmed: 24015111
Cell. 2014 Nov 6;159(4):800-13
pubmed: 25417157
Nat Commun. 2015 Jun 04;6:7330
pubmed: 26040272
BMC Genomics. 2016 Feb 29;17:157
pubmed: 26925773
Genome Res. 2016 Apr;26(4):530-40
pubmed: 26934921
Science. 2016 Apr 1;352(6281):aae0344
pubmed: 27034376
Trends Genet. 2017 Apr;33(4):266-282
pubmed: 28236503
Genome Res. 2017 May;27(5):757-767
pubmed: 28381613
Bioinformatics. 2018 Apr 1;34(7):1125-1131
pubmed: 29194476
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Nat Commun. 2019 Jan 2;10(1):4
pubmed: 30602775
Semin Cell Dev Biol. 1998 Aug;9(4):423-32
pubmed: 9813189

Auteurs

Samarth Rangavittal (S)

Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA.

Natasha Stopa (N)

Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16802, USA.

Marta Tomaszkiewicz (M)

Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA.

Kristoffer Sahlin (K)

Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16802, USA.

Kateryna D Makova (KD)

Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA. kmakova@bx.psu.edu.
The Genome Sciences Institute of the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16802, USA. kmakova@bx.psu.edu.

Paul Medvedev (P)

Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16802, USA. pzm11@cse.psu.edu.
The Genome Sciences Institute of the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16802, USA. pzm11@cse.psu.edu.
Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA. pzm11@cse.psu.edu.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH