AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications.
Journal
IEEE transactions on cybernetics
ISSN: 2168-2275
Titre abrégé: IEEE Trans Cybern
Pays: United States
ID NLM: 101609393
Informations de publication
Date de publication:
May 2019
May 2019
Historique:
pubmed:
12
7
2018
medline:
18
6
2019
entrez:
12
7
2018
Statut:
ppublish
Résumé
Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.
Identifiants
pubmed: 29993676
doi: 10.1109/TCYB.2018.2816984
doi:
Substances chimiques
Phosphoproteins
0
Proteins
0
Transcription Factors
0
Phosphotransferases
EC 2.7.-
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM