UniAligner: a parameter-free framework for fast sequence alignment.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
09 2023
Historique:
received: 13 09 2022
accepted: 05 07 2023
medline: 8 9 2023
pubmed: 15 8 2023
entrez: 14 8 2023
Statut: ppublish

Résumé

Even though the recent advances in 'complete genomics' revealed the previously inaccessible genomic regions, analysis of variations in centromeres and other extra-long tandem repeats (ETRs) faces an algorithmic challenge since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, fail to construct biologically adequate alignments of ETRs. We present UniAligner-the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship between two sequences. We apply UniAligner to estimate the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions in centromeres. This high rate suggests that centromeres may represent some of the most rapidly evolving regions of the human genome with respect to their structural organization.

Identifiants

pubmed: 37580559
doi: 10.1038/s41592-023-01970-4
pii: 10.1038/s41592-023-01970-4
doi:

Types de publication

Journal Article Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

1346-1354

Informations de copyright

© 2023. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
pubmed: 35357919 pmcid: 9186530 doi: 10.1126/science.abj6987
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
pubmed: 35444317 pmcid: 9402379 doi: 10.1038/s41586-022-04601-8
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
pubmed: 37165242 pmcid: 10172123 doi: 10.1038/s41586-023-05896-x
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).
pubmed: 35357917 pmcid: 8979283 doi: 10.1126/science.abj6965
Hoyt, S. J. et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376, eabk3112 (2022).
pubmed: 35357925 pmcid: 9301658 doi: 10.1126/science.abk3112
Bakhtiari, M. et al. Variable number tandem repeats mediate the expression of proximal genes. Nat. Commun. 12, 2075 (2021).
pubmed: 33824302 pmcid: 8024321 doi: 10.1038/s41467-021-22206-z
Park, J., Bakhtiari, M., Popp, B., Wiesener, M. & Bafna, V. Detecting tandem repeat variants in coding regions using code-adVNTR. iScience 25, 104785 (2022).
pubmed: 35982790 pmcid: 9379575 doi: 10.1016/j.isci.2022.104785
Dvorkina, T., Kunyavskaya, O., Bzikadze, A. V., Alexandrov, I. & Pevzner, P. A. CentromereArchitect: inference and analysis of the architecture of centromeres. Bioinformatics 37, i196–i204 (2021).
pubmed: 34252949 pmcid: 8336445 doi: 10.1093/bioinformatics/btab265
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).
pubmed: 35357911 pmcid: 9233505 doi: 10.1126/science.abl4178
Kunyavskaya, O., Dvorkina, T., Bzikadze, A. V., Alexandrov, I. & Pevzner, P. A. Automated annotation of human centromeres with HORmon. Genome Res. 32, 1137–1151 (2022).
Schueler, M. G., Higgins, A. W., Rudd, M. K., Gustashaw, K. & Willard, H. F. Genomic and genetic definition of a functional human centromere. Science 294, 109–115 (2001).
pubmed: 11588252 doi: 10.1126/science.1065042
Alkan, C. et al. Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data. PLoS Comput. Biol. 3, 1807–1818 (2007).
pubmed: 17907796 doi: 10.1371/journal.pcbi.0030181
Enukashvily, N. I., Donev, R., Waisertreiger, I. S.-R. & Podgornaya, O. I. Human chromosome 1 satellite 3 DNA is decondensed, demethylated and transcribed in senescent cells and in A431 epithelial carcinoma cells. Cytogenet. Genome Res. 118, 42–54 (2007).
pubmed: 17901699 doi: 10.1159/000106440
Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of man can be traced in the layers of defunct ancestral alpha satellites flanking the active centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).
pubmed: 19749981 pmcid: 2729386 doi: 10.1371/journal.pgen.1000641
Nagaoka, S. I., Hassold, T. J. & Hunt, P. A. Human aneuploidy: mechanisms and new insights into an age-old problem. Nat. Rev. Genet. 13, 493–504 (2012).
pubmed: 22705668 pmcid: 3551553 doi: 10.1038/nrg3245
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).
pubmed: 23363705 pmcid: 4053949 doi: 10.1186/gb-2013-14-1-r10
Giunta, S. & Funabiki, H. Integrity of the human centromere DNA repeats is protected by CENP-A, CENP-C, and CENP-T. Proc. Natl Acad. Sci. USA 114, 1928–1933 (2017).
pubmed: 28167779 pmcid: 5338446 doi: 10.1073/pnas.1615133114
Black, E. M. & Giunta, S. Repetitive fragile sites: centromere satellite DNA as a source of genome instability in human diseases. Genes 9, 615 (2018).
pubmed: 30544645 pmcid: 6315641 doi: 10.3390/genes9120615
Smurova, K. & De Wulf, P. Centromere and pericentromere transcription: roles and regulation … in sickness and in health. Front. Genet. https://doi.org/10.3389/fgene.2018.00674 (2018).
Miga, K. H. Centromeric satellite DNAs: hidden sequence variation in the human population. Genes 10, 352 (2019).
pubmed: 31072070 pmcid: 6562703 doi: 10.3390/genes10050352
Miga, K. H. & Alexandrov, I. A. Variation and evolution of human centromeres: a field guide and perspective. Annu. Rev. Genet. 55, 583–602 (2021).
pubmed: 34813350 pmcid: 9549924 doi: 10.1146/annurev-genet-071719-020519
Sirupurapu, V., Safonova, Y. & Pevzner, P. A. Gene prediction in the immunoglobulin loci. Genome Res. 32, 1152–1169 (2022).
pubmed: 35545447 pmcid: 9248892 doi: 10.1101/gr.276676.122
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Ekim, B., Sahlin, K., Medvedev, P., Berger, B. & Chikhi, R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res. https://doi.org/10.1101/gr.277679.123 (2023).
Xu, C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput. Struct. Biotechnol. J. 16, 15–24 (2018).
pubmed: 29552334 pmcid: 5852328 doi: 10.1016/j.csbj.2018.01.003
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. & Bafna, V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28, 1709–1719 (2018).
pubmed: 30352806 pmcid: 6211647 doi: 10.1101/gr.235119.118
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).
pubmed: 31194863 pmcid: 6735967 doi: 10.1093/nar/gkz501
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
pubmed: 30992455 pmcid: 6467913 doi: 10.1038/s41467-018-08148-z
Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat. Biotechnol. 38, 1309–1316 (2020).
pubmed: 32665660 doi: 10.1038/s41587-020-0582-4
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
pubmed: 32801147 pmcid: 7545148 doi: 10.1101/gr.263566.120
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
pubmed: 33526886 pmcid: 7961889 doi: 10.1038/s41592-020-01056-5
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. 40, 1075–1081 (2022).
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).
Ekim, B., Berger, B. & Chikhi, R. Minimizer-space de Bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer. Cell Syst. 12, 958–968 (2021).
pubmed: 34525345 pmcid: 8562525 doi: 10.1016/j.cels.2021.08.009
Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40, 711–719 (2022).
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
pubmed: 32657355 pmcid: 7355294 doi: 10.1093/bioinformatics/btaa440
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
pubmed: 7265238 doi: 10.1016/0022-2836(81)90087-5
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
pubmed: 32663838 pmcid: 7484160 doi: 10.1038/s41586-020-2547-7
Rudd, M. K., Wray, G. A. & Willard, H. F. The evolutionary dynamics of α-satellite. Genome Res. 16, 88–96 (2006).
Pertile, M. D., Graham, A. N., Choo, K. H. A. & Kalitsis, P. Rapid evolution of mouse Y centromere repeat DNA belies recent sequence stability. Genome Res. 19, 2202–2213 (2009).
pubmed: 19737860 pmcid: 2792177 doi: 10.1101/gr.092080.109
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
pubmed: 22988817 pmcid: 3572422 doi: 10.1186/1471-2105-13-238
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
pubmed: 29750242 pmcid: 6137996 doi: 10.1093/bioinformatics/bty191
Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993).
Smith, G. P. Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 (1976).
pubmed: 1251186 doi: 10.1126/science.1251186
Gibbs, A. J. & McIntyre, G. A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970).
pubmed: 5456129 doi: 10.1111/j.1432-1033.1970.tb01046.x
Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics 38, 2049–2051 (2022).
Watson, C. T. & Breden, F. The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun. 13, 363–373 (2012).
pubmed: 22551722 doi: 10.1038/gene.2012.12
Rodriguez, O. L. et al. A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus. Front. Immunol. 11, 2136 (2020).
pubmed: 33072076 pmcid: 7539625 doi: 10.3389/fimmu.2020.02136
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
pubmed: 33911273 pmcid: 8081667 doi: 10.1038/s41586-021-03451-0
Koonin, E. V. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat. Rev. Microbiol. 1, 127–136 (2003).
pubmed: 15035042 doi: 10.1038/nrmicro751
Safonova, Y. & Pevzner, P. A. V(DD)J recombination is an important and evolutionarily conserved mechanism for generating antibodies with unusually long CDR3s. Genome Res. 30, 1547–1558 (2020).
pubmed: 32948615 pmcid: 7605257 doi: 10.1101/gr.259598.119
Šošić, M. & Šikić, M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017).
pubmed: 28453688 pmcid: 5408825 doi: 10.1093/bioinformatics/btw753
Eppstein, D., Galil, Z., Giancarlo, R. & Italiano, G. F. Sparse dynamic programming I: linear cost functions. J. ACM 39, 519–545 (1992).
doi: 10.1145/146637.146650
Arratia, R. & Waterman, M. S. A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4, 200–225 (1994).
doi: 10.1214/aoap/1177005208
Waterman, M. S. & Vingron, M. Sequence comparison significance and Poisson approximation. Stat. Sci. 9, 367–381 (1994).
doi: 10.1214/ss/1177010382
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).
pubmed: 35357911 pmcid: 9233505 doi: 10.1126/science.abl4178
Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993).
Kasai, T., Lee, G., Arimura, H., Arikawa, S. & Park, K. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching (ed. Landau, G. M.) 181–192 (Springer, 2001).
Larsson, N. J. & Sadakane, K. Faster Suffix Sorting (Dept. of Computer Science, Lund Univ., 1999).
Burkhardt, S. & Kärkkäinen, J. Fast lightweight suffix array construction and checking. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching (eds. Baeza-Yates, R. et al.) 55–69 (Springer, 2003).
Kärkkäinen, J. & Sanders, P. Simple linear work suffix array construction. In Lecture Notes in Computer Science (eds. Baeten, J. C. M. et al.) 943–955 (Springer, 2003).
Kim, D. K., Sim, J. S., Park, H. & Park, K. Linear-time construction of suffix arrays. In Proc 14th Annual Symposium on Combinatorial Pattern Matching (eds. Baeza-Yates, R. et al.) 186–199 (Springer, 2003).
Ko, P. & Aluru, S. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3, 143–156 (2005).
Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA 85, 2444–2448 (1988).
pubmed: 3162770 pmcid: 280013 doi: 10.1073/pnas.85.8.2444
Logan, B. F. & Shepp, L. A. A variational problem for random Young tableaux. Adv. Math. 26, 206–222 (1977).
doi: 10.1016/0001-8708(77)90030-5
Vershik, A. M. & Kerov, S. V. Asymptotics of the Plancherel measure of the symmetric group and the limiting form of Young tableaux. Dokl. Akad. Nauk SSSR 233, 1024–1027 (1977).
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
pubmed: 33828295 pmcid: 8099727 doi: 10.1038/s41586-021-03420-7
Bzikadze, A. V. & Pevzner, P. A. UniAligner: a new parameter-free framework for fast sequence alignment. Zenodo https://doi.org/10.5281/zenodo.7563836 (2023).

Auteurs

Andrey V Bzikadze (AV)

Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA.

Pavel A Pevzner (PA)

Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA. ppevzner@ucsd.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH