flowSim: Near duplicate detection for flow cytometry data.
bioinformatics
flow cytometry
machine learning
near duplicate detection
redundant information
similar images
Journal
Cytometry. Part A : the journal of the International Society for Analytical Cytology
ISSN: 1552-4930
Titre abrégé: Cytometry A
Pays: United States
ID NLM: 101235694
Informations de publication
Date de publication:
11 2023
11 2023
Historique:
revised:
22
06
2023
received:
21
04
2023
accepted:
11
07
2023
medline:
10
11
2023
pubmed:
2
8
2023
entrez:
2
8
2023
Statut:
ppublish
Résumé
The analysis of large amounts of data is important for the development of machine learning (ML) models. flowSim is the first algorithm designed to visualize, detect and remove highly redundant information in flow cytometry (FCM) training sets to decrease the computational time for training and increase the performance of ML algorithms by reducing overfitting. flowSim performs near duplicate image detection by combining community detection algorithms with the density analysis of the marker expression values. flowSim clustering compared to consensus manual clustering on a dataset composed of 160 images of bivariate FCM data had a mean Adjusted Rand Index of 0.90, demonstrating its efficiency in identifying similar patterns. flowSim selectively discarded near duplicate files in datasets constructed with known redundancy, and removed 92.6% of FCM images in a dataset of over 500,000 drawn from public repositories.
Identifiants
pubmed: 37530476
doi: 10.1002/cyto.a.24776
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
889-901Subventions
Organisme : NIAID NIH HHS
ID : U19 AI118608
Pays : United States
Informations de copyright
© 2023 The Authors. Cytometry Part A published by Wiley Periodicals LLC on behalf of International Society for Advancement of Cytometry.
Références
Abraham RS, Aubert G. Flow cytometry, a versatile tool for diagnosis and monitoring of primary immunodeficiencies. Clin Vaccine Immunol. 2016 Apr 4;23:254-271. https://doi.org/10.1128/CVI.00001-16
Bashashati A, Brinkman RR. A survey of flow cytometry data analysis methods. Adv Bioinf. 2009;2009:584603. https://doi.org/10.1155/2009/584603
De Rosa SC, Lu FX, Yu J, Perfetto SP, Falloon J, Moser S, et al. Vaccination in humans generates broad T cell cytokine responses. J Immunol. 2004;173:5372-5380. https://doi.org/10.4049/jimmunol.173.9.5372
Lin L, Finak G, Ushey K, Seshadri C, Hawn TR, Frahm N, et al. COMPASS identifies T-cell subsets correlated with clinical outcomes. Nat Biotechnol. 2015;33(6):610-616. https://doi.org/10.1038/nbt.3187
Kalina T et al. EuroFlow standardization of flow cytometer instrument settings and immunophenotyping protocols. Leukemia. 2012;26:1986-2010. https://doi.org/10.1038/leu.2012.122
Macchia I, Urbani F, Proietti E. Immune monitoring in cancer vaccine clinical trials: critical issues of functional flow cytometry-based assays. Biomed Res Int. 2013;2013:726239. https://doi.org/10.1155/2013/726239
Turtle CJ. Chimeric antigen receptor modified T cell therapy for B cell malignancies. Int J Hematol. 2014;99(2):132-140. https://doi.org/10.1007/s12185-013-1490-x
Rahim A, Meskas J, Drissler S, Yue A, Lorenc A, Laing A, et al. High throughput automated analysis of big flow cytometry data. Methods. 2018;134-135:164-176. https://doi.org/10.1016/j.ymeth.2017.12.015
Verschoor CP, Lelic A, Bramson JL, Bowdish DME. An introduction to automated flow cytometry gating tools and their implementation. Front Immunol. 2015 Jul 27;6:380. https://doi.org/10.3389/fimmu.2015.00380
McKinnon KM. Flow cytometry: an overview. Curr Protoc Immunol. 2018 Feb 21;120:5.1.1-5.1.11. https://doi.org/10.1002/cpim.40
Grant R, Coopman K, Medcalf N, Silva-Gomes S, Campbell JJ, Kara B, et al. Quantifying operator subjectivity within flow cytometry data analysis as a source of measurement uncertainty and the impact of experience on results. PDA J Pharm Sci Technol. 2021;75(1):33-47. https://doi.org/10.5731/pdajpst.2019.011213
Baumgaertner P, Sankar M, Herrera F, Benedetti F, Barras D, Thierry AC, et al. Unsupervised analysis of flow cytometry data in a clinical setting captures cell diversity and allows population discovery. Front Immunol. 2021 Apr 30;12:633910. https://doi.org/10.3389/fimmu.2021.633910
Finak G, Langweiler M, Jaimes M, Malek M, Taghiyar J, Korin Y, et al. Standardizing flow cytometry immunophenotyping analysis from the human ImmunoPhenotyping consortium. Sci Rep. 2016;6:20686.
Cheung M, Campbell JJ, Whitby L, Thomas RJ, Braybrook J, Petzing J. Current trends in flow cytometry automated data analysis software. Cytom Part A. 2021;99(10):1007-1021. https://doi.org/10.1002/cyto.a.24320
Liu P, Liu S, Fang Y, Xue X, Zou J, Tseng G, et al. Recent advances in computer-assisted algorithms for cell subtype identification of cytometry data. Front Cell Dev Biol. 2020 Apr 28;8:234. https://doi.org/10.3389/fcell.2020.00234
Hu Z, Bhattacharya S, Butte AJ. Application of machine learning for cytometry data. Front Immunol. 2022 Jan 3;12:787574. https://doi.org/10.3389/fimmu.2021.787574
Weber LM, Robinson MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom Part A. 2016;89(12):1084-1096. https://doi.org/10.1002/cyto.a.23030
Lee H, Sun Y, Patti-Diaz L, Hedrick M, Ehrhardt AG. High-throughput analysis of clinical flow cytometry data by automated gating. Bioinf Biol Insights. 2019 Apr 3;13:1177932219838851. https://doi.org/10.1177/1177932219838851
Spidlen J, Breuer K, Rosenberg C, Kotecha N, Brinkman RR. FlowRepository: a resource of annotated flow cytometry datasets associated with peer-reviewed publications. Cytom Part A. 2012;81(9):727-731. https://doi.org/10.1002/cyto.a.22106
Bhattacharya S, Andorf S, Gomes L, Dunn P, Schaefer H, Pontius J, et al. ImmPort: disseminating data to the public for the future of immunology. Immunol Res. 2014;58(2-3):234-239. https://doi.org/10.1007/s12026-014-8516-1
Chum O, Philbin J, Isard M, Zisserman A. Scalable near identical image and shot detection. Proceedings of the 6th ACM international conference on Image and video retrieval. New York: Association for Computing Machinery; 2007. p. 549-556.
Thyagharajan KK, Kalaiarasi G. A review on near-duplicate detection of images using computer vision techniques. Arch Computat Methods Eng. 2021;28:897-916. https://doi.org/10.1007/s11831-020-09400-w
Xi Z, Yao Y, Ji Y, Fang B. Effective and fast near duplicate detection via signature-based compression metrics. Math Probl Eng. 2016;2016:3919043. https://doi.org/10.1155/2016/3919043
Cilibrasi R, Vitanyi P. Clustering by compression. IEEE Trans Inf Theory. 2005;51(4):1523-1545.
Mitzenmacher M, Pagh R, Pham N. Efficient estimation for high similarities using odd sketches. Proceedings of the 23rd international conference on world wide web (WWW'14). Florence: ACM; 2014. p. 109-118.
R Core Team. R: a language and environment for statistical computing. Vienna: R foundation for statistical computing; 2018.
Borchers HW. pracma: Practical Numerical Math Functions. R Package Version 2.2.2. 2018 https://CRAN.R-project.org/package=pracma
Ellis B, Haaland P, Hahne F, Le Meur N, Gopalakrishnan N, Spidlen J, Jiang M, Finak G. flowCore: basic structures for flow cytometry data. R Package Version 2.8.0. 2022.
Bowman AW, Azzalini A. R package sm: nonparametric smoothing methods (version 2.2-5.7). http://www.stats.gla.ac.uk/~adrian/sm 2021.
Csardi G, Nepusz T. The igraph software package for complex network research. Int J Complex Syst. 2006;1695.
Krijthe JH. Rtsne: T-distributed stochastic neighbor embedding using Barnes-hut implementation. R Package Version 0.16. 2015.
Fujita A, Takahashi DY, Patriota AG, Sato JR. A non-parametric statistical test to compare clusters with applications in functional magnetic resonance imaging data. Stat Med. 2014;33(28):4949-4962. https://doi.org/10.1002/sim.6292
Almende BV, Thieurmel B, Robert T. “visNetwork: network visualization using vis.js library. R Package Version 2.0.5. 2018.
Idoko OT, Smolen KK, Wariri O, Imam A, Shannon CP, Dibassey T, et al. Clinical protocol for a longitudinal cohort study employing systems biology to identify markers of vaccine immunogenicity in newborn infants in The Gambia and Papua New Guinea. Front Pediatr. 2020 Apr 30;8:197. https://doi.org/10.3389/fped.2020.00197
Ivison S, Malek M, Garcia RV, Broady R, Halpin A, Richaud M, et al. A standardized immune phenotyping and automated data analysis platform for multicenter biomarker studies. JCI Insight. 2018 Dec 6;3,23:e121867. https://doi.org/10.1172/jci.insight.121867
Finak G, Jiang M. flowWorkspace: infrastructure for representing and interacting with gated and ungated cytometry data sets. R Package Version 4.8.0. 2022.
Drost H. Philentropy: information theory and distance quantification with R. J Open Source Software. 2018;3(26):765. https://doi.org/10.21105/joss.00765
Santos JM, Embrechts M. On the use of the adjusted Rand index as a metric for evaluating supervised classification. Lecture notes in computer science. Vol 5769. Berlin, Heidelberg: Springer; 2009. https://doi.org/10.1007/978-3-642-04277-5_18
Zhou Z, Wu QMJ, Huang F, Sun X. Fast and accurate near-duplicate image elimination for visual sensor networks. Int J Distrib Sens Netw. 2017;13(2):155014771769417. https://doi.org/10.1177/1550147717694172
Qiao F, Wang C, Zhang X, Wang H. Large scale near-duplicate celebrity web images retrieval using visual and textual features. Sci World J. 2013 Sep 14;2013:795408. https://doi.org/10.1155/2013/795408
Zhang Y et al. Single- and cross-modality near duplicate image pairs detection via spatial transformer comparing CNN. Sensors. 2021;21(1):255. https://doi.org/10.3390/s21010255
Kim H, Chang HW, Lee J, Lee D. BASIL: effective near-duplicate image detection using gene sequence alignment. Advances in information retrieval. ECIR 2010. Lecture notes in computer science. Volume 5993. Berlin, Heidelberg: Springer; 2010. https://doi.org/10.1007/978-3-642-12275-0_22
Kwon Y, Lemieux M, McTavish J, Wathen N. Identifying and removing duplicate records from systematic review searches. J Med Library Assoc. 2015;103(4):184-188. https://doi.org/10.3163/1536-5050.103.4.004
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019 Mar 26;9(1):5233. https://doi.org/10.1038/s41598-019-41695-z
Tandon A, Albeshri A, Thayananthan V, Alhalabi W, Radicchi F, Fortunato S. Community detection in networks using graph embeddings. Phys Rev E. 2021 Feb;103(2-1):22316. https://doi.org/10.1103/PhysRevE.103.022316