CHARR efficiently estimates contamination from DNA sequencing data.


Journal

bioRxiv : the preprint server for biology
Titre abrégé: bioRxiv
Pays: United States
ID NLM: 101680187

Informations de publication

Date de publication:
28 Jun 2023
Historique:
pubmed: 10 7 2023
medline: 10 7 2023
entrez: 10 7 2023
Statut: epublish

Résumé

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

Identifiants

pubmed: 37425834
doi: 10.1101/2023.06.28.545801
pmc: PMC10327099
pii:
doi:

Types de publication

Preprint

Langues

eng

Subventions

Organisme : NIMH NIH HHS
ID : R37 MH107649
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG011450
Pays : United States

Commentaires et corrections

Type : UpdateIn

Auteurs

Wenhan Lu (W)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Laura D Gauthier (LD)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Timothy Poterba (T)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Edoardo Giacopuzzi (E)

Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milano, ITALY.

Julia K Goodrich (JK)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.

Christine R Stevens (CR)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Daniel King (D)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Mark J Daly (MJ)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Institute for Molecular Medicine Finland, Helsinki, Finland.

Benjamin M Neale (BM)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Novo Nordisk Foundation Center, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Konrad J Karczewski (KJ)

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Novo Nordisk Foundation Center, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Classifications MeSH