Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds.

Boruta Least absolute shrinkage and selection operator Pig breeds Principal component analysis Random forest Single-nucleotide polymorphism

Journal

Tropical animal health and production
ISSN: 1573-7438
Titre abrégé: Trop Anim Health Prod
Pays: United States
ID NLM: 1277355

Informations de publication

Date de publication:
10 Jul 2021
Historique:
received: 10 12 2020
accepted: 18 06 2021
entrez: 10 7 2021
pubmed: 11 7 2021
medline: 14 7 2021
Statut: epublish

Résumé

Assigning animals to their corresponding breeds through breed informative single-nucleotide polymorphisms (SNPs) is required in many fields. For instance, it is used in the traceability and the authentication of meat and other livestock products. SNPs' information for several pork breeds are now accessible thanks to the availability of dense SNP chips. These SNP chips cover a large number of molecular markers distributed across the entire genome. To identify the pork breed from a sample of industrial meat, one must analyze a large panel of genetic markers depending on the SNP chip used. The analysis of such large datasets requires intensive work. This leads to the idea of creating less dense chips of breed informative markers based on a reduced number of SNPs. Therefore, the analysis of the data emanating from the genotyping of these reduced chips will require less time and effort. The objective of this study is to find the most informative SNPs for the discrimination between four pig breeds, namely Duroc, Landrace, Large White, and Pietrain. The Illumina Porcine 60 k SNP chip was used to genotype SNPs distributed all over the individuals' genomes. Firstly, we used three different statistical approaches for feature selection: (i) principal component analysis (PCA), (ii) least absolute shrinkage and selection operator (LASSO), and (iii) random forest (RF). These three approaches identified three sets of SNPs; each set corresponds to one approach. Then, we combined the results of the three methods by setting up a final panel containing the SNPs which appear on the three sets altogether. Separately, each method resulted in a panel with the corresponding most discriminating SNPs. The PCA, the LASSO, and the random forest with Boruta algorithm highlighted 28,816, 50, and 286 SNPs, respectively. The number of SNPs selected by PCA is high compared to Boruta and LASSO because PCA chooses the variables while preserving as much information about the data as possible. The only downside of LASSO regression is that among a group of correlated variables, LASSO tends to select only one variable and ignore the others regardless of their importance. Contrarily to LASSO, the Boruta algorithm considers the interdependence between SNPs and selects informative variables even if they are correlated and have the same effect. The three panels shared 23 SNPs; the distribution of the individuals according to these SNPs showed a grouping of individuals of each breed in well-defined clusters without any overlapping. The biological pathways represented by 23 breed informative SNPs resulted by the combination of PCA, LASSO, and Boruta should be explored in further analysis. The results provided by our study are promising for further applications of this method in other livestock animals.

Sections du résumé

BACKGROUND BACKGROUND
Assigning animals to their corresponding breeds through breed informative single-nucleotide polymorphisms (SNPs) is required in many fields. For instance, it is used in the traceability and the authentication of meat and other livestock products. SNPs' information for several pork breeds are now accessible thanks to the availability of dense SNP chips. These SNP chips cover a large number of molecular markers distributed across the entire genome. To identify the pork breed from a sample of industrial meat, one must analyze a large panel of genetic markers depending on the SNP chip used. The analysis of such large datasets requires intensive work. This leads to the idea of creating less dense chips of breed informative markers based on a reduced number of SNPs. Therefore, the analysis of the data emanating from the genotyping of these reduced chips will require less time and effort.
AIM OBJECTIVE
The objective of this study is to find the most informative SNPs for the discrimination between four pig breeds, namely Duroc, Landrace, Large White, and Pietrain.
METHOD METHODS
The Illumina Porcine 60 k SNP chip was used to genotype SNPs distributed all over the individuals' genomes. Firstly, we used three different statistical approaches for feature selection: (i) principal component analysis (PCA), (ii) least absolute shrinkage and selection operator (LASSO), and (iii) random forest (RF). These three approaches identified three sets of SNPs; each set corresponds to one approach. Then, we combined the results of the three methods by setting up a final panel containing the SNPs which appear on the three sets altogether.
RESULTS RESULTS
Separately, each method resulted in a panel with the corresponding most discriminating SNPs. The PCA, the LASSO, and the random forest with Boruta algorithm highlighted 28,816, 50, and 286 SNPs, respectively. The number of SNPs selected by PCA is high compared to Boruta and LASSO because PCA chooses the variables while preserving as much information about the data as possible. The only downside of LASSO regression is that among a group of correlated variables, LASSO tends to select only one variable and ignore the others regardless of their importance. Contrarily to LASSO, the Boruta algorithm considers the interdependence between SNPs and selects informative variables even if they are correlated and have the same effect. The three panels shared 23 SNPs; the distribution of the individuals according to these SNPs showed a grouping of individuals of each breed in well-defined clusters without any overlapping.
CONCLUSIONS CONCLUSIONS
The biological pathways represented by 23 breed informative SNPs resulted by the combination of PCA, LASSO, and Boruta should be explored in further analysis. The results provided by our study are promising for further applications of this method in other livestock animals.

Identifiants

pubmed: 34245361
doi: 10.1007/s11250-021-02824-x
pii: 10.1007/s11250-021-02824-x
doi:

Substances chimiques

Genetic Markers 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

395

Informations de copyright

© 2021. The Author(s), under exclusive licence to Springer Nature B.V.

Références

Aulchenko, Y.S., Ripke, S., Isaacs, A. and van Duijn, C.M., 2007. GenABEL: an R library for genome-wide association analysis Bioinformatics (Oxford, England), 23, 1294–1296
Bertolini, F., Galimberti, G., Calò, D.G., Schiavo, G., Matassino, D. and Fontanesi, L., 2015. Combined use of principal component analysis and random forests identify population-informative single nucleotide polymorphisms: application in cattle breeds Journal of Animal Breeding and Genetics, 132, 346–356
doi: 10.1111/jbg.12155
Bertolini, F., Galimberti, G., Schiavo, G., Mastrangelo, S., Gerlando, R.D., Strillacci, M.G., Bagnato, A., Portolano, B. and Fontanesi, L., 2018. Preselection statistics and Random Forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds animal, 12, 12–19 (Cambridge University Press)
Botti, S., Caprera, A., Gaita, L., Mondin, P., Ossani, N., Palermo, S., Luini, M., Vezzoli, F., Cordioli, P., Nigrelli, D., Fallacara, C., Barbieri, I., Pacciarini, M., Bandi, C., Stella, A. and Giuffra, E., 2006. The misagen project: towards the genetic improvement of disease resistance of pig commercial populations. Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte, Minas Gerais, Brazil, 13–18 August, 2006, 15–24 (Instituto Prociência)
Breiman, L., 2001. Random Forests Machine Learning, 45, 5–32
doi: 10.1023/A:1010933404324
Chen, H. and Boutros, P.C., 2011. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R BMC Bioinformatics, 12, 35
Ciampolini, R., Cetica, V., Ciani, E., Mazzanti, E., Fosella, X., Marroni, F., Biagetti, M., Sebastiani, C., Papa, P., Filippini, G., Cianci, D. and Presciuttini, S., 2006. Statistical analysis of individual assignment tests among four cattle breeds using fifteen STR loci Journal of Animal Science, 84, 11–19
doi: 10.2527/2006.84111x
FAO’s Animal Production and Health Division: Meat & Meat Products n.d.
Fontanesi, L., Scotti, E., Gallo, M., Nanni Costa, L. and Dall’Olio, S., 2016. Authentication of “mono-breed” pork products: Identification of a coat colour gene marker in Cinta Senese pigs useful to this purpose Livestock Science, 184, 71–77
Friedman, J., Hastie, T. and Tibshirani, R., 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent Journal of Statistical Software, 33, 1–22
doi: 10.18637/jss.v033.i01
Groeneveld, L.F., Lenstra, J.A., Eding, H., Toro, M.A., Scherf, B., Pilling, D., Negrini, R., Finlay, E.K., Jianlin, H., Groeneveld, E. and Weigend, S., 2010. Genetic diversity in farm animals – a review Animal Genetics, 41, 6–31
Guàrdia, M., Quintanilla, R., Manunza, A., Mercadé, A., Amills, M., Pena, R. and Hernández-Sánchez, J., 2012. GWAS of low heritable traits: the case of sensory attributes of dry-cured hams
Jolliffe, I.T. and Cadima, J., 2016. Principal component analysis: a review and recent developments Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 20150202 (Royal Society)
Kassambara, A. and Mundt, F., 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses,
Kohannim, O., Hibar, D.P., Stein, J.L., Jahanshad, N., Hua, X., Rajagopalan, P., Toga, A.W., Jack, C.R., Weiner, M.W., de Zubicaray, G.I., McMahon, K.L., Hansell, N.K., Martin, N.G., Wright, M.J. and Thompson, P.M., 2012. Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression Frontiers in Neuroscience, 6
Kursa, M.B., Jankowski, A. and Rudnicki, W.R., 2010. Boruta – A System for Feature Selection Fundamenta Informaticae, 101, 271–285 (IOS Press)
Kursa, M.B., 2014. Robustness of Random Forest-based gene selection methods BMC bioinformatics, 15, 8
pubmed: 24410865
Kwon, T., Yoon, J., Heo, J., Lee, W. and Kim, H., 2017. Tracing the breeding farm of domesticated pig using feature selection (Sus scrofa) Asian-Australasian Journal of Animal Sciences, 30, 1540–1549
Lee, J., Lee, S., Park, J.-E., Moon, S.-H., Choi, S.-W., Go, G.-W., Lim, D. and Kim, J.-M., 2019. Genome-wide association study and genomic predictions for exterior traits in Yorkshire pigs Journal of Animal Science, 97, 2793–2802 (Oxford Academic)
Liaw, A. and Wiener, M., 2002. Classification and Regression by randomForest R News, 2, 18–22
Meng, Y.A., Yu, Y., Cupples, L.A., Farrer, L.A. and Lunetta, K.L., 2009. Performance of random forest when SNPs are in linkage disequilibrium BMC Bioinformatics, 10, 78
doi: 10.1186/1471-2105-10-78
Niu, P., Kim, S.-W., Choi, B.-H., Kim, T.-H., Kim, J.-J. and Kim, K.-S., 2013. Porcine insulin-like growth factor 1 (IGF1) gene polymorphisms are associated with body size variation Genes & Genomics, 35, 523–528
Paschou, P., Ziv, E., Burchard, E.G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M.W. and Drineas, P., 2007. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations PLOS Genetics, 3, e160 (Public Library of Science)
Ramos, A.M., Crooijmans, R.P.M.A., Affara, N.A., Amaral, A.J., Archibald, A.L., Beever, J.E., Bendixen, C., Churcher, C., Clark, R., Dehais, P., Hansen, M.S., Hedegaard, J., Hu, Z.-L., Kerstens, H.H., Law, A.S., Megens, H.-J., Milan, D., Nonneman, D.J., Rohrer, G.A., Rothschild, M.F., Smith, T.P.L., Schnabel, R.D., Tassell, C.P.V., Taylor, J.F., Wiedmann, R.T., Schook, L.B. and Groenen, M.A.M., 2009. Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next Generation Sequencing Technology PLOS ONE, 4, e6524 (Public Library of Science)
Rashidi, H., 2016. Breeding against infectious diseases in animals (Wageningen University: Wageningen, NL)
Rosenvold, K. and Andersen, H.J., 2003. Factors of significance for pork quality—a review Meat Science, 64, 219–237
Schiavo, G., Bertolini, F., Galimberti, G., Bovo, S., Dall’Olio, S., Costa, L.N., Gallo, M. and Fontanesi, L., 2020. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds animal, 14, 223–232 (Cambridge University Press)
Tang, J., Zhang, Z., Yang, B., Guo, Y., Ai, H., Long, Y., Su, Y., Cui, L., Zhou, L., Wang, X., Zhang, H., Wang, C., Ren, J., Huang, L. and Ding, N., 2017. Identification of loci affecting teat number by genome-wide association studies on three pig populations Asian-Australasian Journal of Animal Sciences, 30, 1–7
doi: 10.5713/ajas.15.0980
Tibshirani, R., 1996. Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistical Society. Series B (Methodological), 58, 267–288 ([Royal Statistical Society, Wiley])
Wilkinson, S., Wiener, P., Archibald, A.L., Law, A., Schnabel, R.D., McKay, S.D., Taylor, J.F. and Ogden, R., 2011. Evaluation of approaches for identifying population informative markers from high density SNP Chips BMC Genetics, 12, 45
doi: 10.1186/1471-2156-12-45

Auteurs

Ichrak Hayah (I)

Plant and Microbial Biotechnologies, Biodiversity, and Environment (BioBio), Mohammed V University in Rabat, 4 Ibn Battouta Avenue, B.P. 1014 RP, Rabat, Morocco.

Mouna Ababou (M)

Laboratory of Human Pathologies, Genomic Center of Human Pathologies, Mohammed V University in Rabat, 4 Ibn Battouta Avenue, B.P. 1014 RP, Rabat, Morocco.

Sara Botti (S)

PTP Science Park, Via Einstein - Loc. Cascina Codazza, 26900, Lodi, Italy.

Bouabid Badaoui (B)

Plant and Microbial Biotechnologies, Biodiversity, and Environment (BioBio), Mohammed V University in Rabat, 4 Ibn Battouta Avenue, B.P. 1014 RP, Rabat, Morocco. bouabidbadaoui@gmail.com.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Robotic Surgical Procedures Animals Humans Telemedicine Models, Animal

Odour generalisation and detection dog training.

Lyn Caldicott, Thomas W Pike, Helen E Zulch et al.
1.00
Animals Odorants Dogs Generalization, Psychological Smell
Animals TOR Serine-Threonine Kinases Colorectal Neoplasms Colitis Mice

Classifications MeSH