Pacybara: Accurate long-read sequencing for barcoded mutagenized allelic libraries.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
03 Apr 2024
Historique:
received: 19 12 2023
revised: 05 03 2024
accepted: 02 04 2024
medline: 4 4 2024
pubmed: 4 4 2024
entrez: 3 4 2024
Statut: aheadofprint

Résumé

Long read sequencing technologies, an attractive solution for many applications, often suffer from higher error rates. Alignment of multiple reads can improve base-calling accuracy, but some applications, e.g. sequencing mutagenized libraries where multiple distinct clones differ by one or few variants, require the use of barcodes or unique molecular identifiers. Unfortunately, sequencing errors can interfere with correct barcode identification, and a given barcode sequence may be linked to multiple independent clones within a given library. Here we focus on the target application of sequencing mutagenized libraries in the context of multiplexed assays of variant effects (MAVEs). MAVEs are increasingly used to create comprehensive genotype-phenotype maps that can aid clinical variant interpretation. Many MAVE methods use long-read sequencing of barcoded mutant libraries for accurate association of barcode with genotype. Existing long-read sequencing pipelines do not account for inaccurate sequencing or non-unique barcodes. Here, we describe Pacybara, which handles these issues by clustering long reads based on the similarities of (error-prone) barcodes while also detecting barcodes that have been associated with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls. In three example applications, we show that Pacybara identifies and correctly resolves these issues. Pacybara, freely available at https://github.com/rothlab/pacybara, is implemented using R, Python and bash for Linux. It runs on GNU/Linux HPC clusters via Slurm, PBS, or GridEngine schedulers. A single-machine simplex version is also available. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 38569896
pii: 7639979
doi: 10.1093/bioinformatics/btae182
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press.

Auteurs

Jochen Weile (J)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Gabrielle Ferra (G)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Gabriel Boyle (G)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Sriram Pendyala (S)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Clara Amorosi (C)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Chiann-Ling Yeh (CL)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Atina G Cote (AG)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Nishka Kishore (N)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Daniel Tabet (D)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Warren van Loggerenberg (W)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Ashyad Rayhan (A)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.

Douglas M Fowler (DM)

Department of Genome Sciences, University of Washington, Seattle, WA USA.
Department of Bioengineering, University of Washington, Seattle, WA USA.
Brotman Baty Institute for Precision Medicine, Seattle, WA 98195, USA.

Maitreya J Dunham (MJ)

Department of Genome Sciences, University of Washington, Seattle, WA USA.

Frederick P Roth (FP)

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada.
Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4.
Department of Computational & Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.

Classifications MeSH