STAR_outliers: a python package that separates univariate outliers from non-normal distributions.

Outliers Software Statistics

Journal

BioData mining
ISSN: 1756-0381
Titre abrégé: BioData Min
Pays: England
ID NLM: 101319161

Informations de publication

Date de publication:
04 Sep 2023
Historique:
received: 30 09 2021
accepted: 21 08 2023
medline: 5 9 2023
pubmed: 5 9 2023
entrez: 4 9 2023
Statut: epublish

Résumé

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.

Identifiants

pubmed: 37667378
doi: 10.1186/s13040-023-00342-0
pii: 10.1186/s13040-023-00342-0
pmc: PMC10476292
doi:

Types de publication

Journal Article

Langues

eng

Pagination

25

Subventions

Organisme : NIH HHS
ID : LM010098
Pays : United States

Informations de copyright

© 2023. BioMed Central Ltd., part of Springer Nature.

Références

Osborne JA. Overbay The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation. 2004;9(6). https://doi.org/10.7275/qf69-7k43 .
Zhao Y, Nasrullah Z, Li Z. PyOD: a python toolbox for scalable outlier detection. J Mach Learn Res. 2019;20(96):1–7.
Liu F, Ting K, Zhou Z. Isolation-based anomaly detection. ACM Trans Knowl Discov Data. 2012;6(1):3.
doi: 10.1145/2133360.2133363
Z. Li, . Zhao, X. Hu, N. Botta, C. Ionescu, and H. Chen. Ecod: unsupervised outlier detection using empirical cumulative distribution functions. IEEE Trans Knowl Data Eng. 2022.
Z. Li, Y. Zhao, N. Botta, C. Ionescu, and X. Hu. COPOD: copula-based outlier detection. In IEEE International Conference on Data Mining (ICDM). IEEE, 2020.
H. Kriegel, et al. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 444–452. ACM, 2008.
Hubert M, Vandervieren E. An adjusted boxplot for skewed distributions. Computat Statist Data Anal. 2008;52(12):5186–201. https://doi.org/10.1016/j.csda.2007.11.008 .
doi: 10.1016/j.csda.2007.11.008
Yang J, Rahardja S, Fränti P. Outlier Detection: How to Threshold Outlier Scores? Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing - AIIPCC ’19. 2019. https://doi.org/10.1145/3371425.3371427 .
Buzzi-Ferraris G, Manenti F. Outlier detection in large data sets. Comput Chem Eng. 2011;35(2):388–90. https://doi.org/10.1016/j.compchemeng.2010.11.004 .
doi: 10.1016/j.compchemeng.2010.11.004
Verardi V, Vermandele C. Outlier identification for skewed and/or heavy-tailed unimodal multivariate distributions. Journal de la Société Française de Statistique. 2016;157(2):90–114 ( https://researchportal.unamur.be/en/publications/outlier-identification-for-skewed-andor-heavy-tailed-unimodal-mul ).
Chen W, Peters G, Gerlach R, Sisson S. Dynamic quantile function models. SSRN Electron J. 2017. https://doi.org/10.2139/ssrn.2999451 .
doi: 10.2139/ssrn.2999451
Peters G, Chen W, Gerlach R. Estimating quantile families of loss distributions for non-life insurance modelling via L-moments. Risks. 2016;4(2):14. https://doi.org/10.3390/risks4020014 .
doi: 10.3390/risks4020014
Centers for Disease Control, Prevention (CDC). National Center for Health Statistics (NCHS). National Health, MD:U.S. Department of Health Nutrition Examination Survey Data. Hyattsville, Centers for Disease Control Human Services, and Prevention. 2017–2018. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017 .
Hartigan J, Hartigan P. The dip test of unimodality. Ann Stat. 1985;13(1):70–84. https://doi.org/10.1214/aos/1176346577 .
doi: 10.1214/aos/1176346577
Xu Y, Iglewicz B, Chervoneva I. Robust estimation of the parameters of g - and - h distributions, with applications to outlier detection. Comput Stat Data Anal. 2014;75:66–80. https://doi.org/10.1016/j.csda.2014.01.003 .
doi: 10.1016/j.csda.2014.01.003 pubmed: 24665144 pmcid: 3961718

Auteurs

John T Gregg (JT)

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.

Jason H Moore (JH)

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, 90069, USA. jason.moore@csmc.edu.

Classifications MeSH