Application Health Monitoring for Extreme-scale Resiliency using Cooperative Fault Management.

Fault-tolerance exascale resiliency heterogeneous systems molecular dynamics quantum chemistry calculations silent errors

Journal

Concurrency and computation : practice & experience
ISSN: 1532-0626
Titre abrégé: Concurr Comput
Pays: England
ID NLM: 101526872

Informations de publication

Date de publication:
25 Jan 2020
Historique:
entrez: 26 4 2021
pubmed: 25 1 2020
medline: 25 1 2020
Statut: ppublish

Résumé

Resiliency is and will be a critical factor in determining scientific productivity on current and

Identifiants

pubmed: 33897303
doi: 10.1002/cpe.5449
pmc: PMC8064409
mid: NIHMS1589821
pii:
doi:

Types de publication

Journal Article

Langues

eng

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM105978
Pays : United States
Organisme : NIGMS NIH HHS
ID : R21 GM083946
Pays : United States

Auteurs

Pratul K Agarwal (PK)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.
Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee, USA.

Thomas Naughton (T)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.

Byung H Park (BH)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.

David E Bernholdt (DE)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.

Joshua J Hursey (JJ)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.
Present address: IBM Systems, IBM, Rochester, Minnesota, USA.

Al Geist (A)

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.

Classifications MeSH