Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification.
Chi-square
Document frequency
Extra tree classifier
Feature selection
Genetic algorithm
Information gain
Text classification
Journal
PeerJ. Computer science
ISSN: 2376-5992
Titre abrégé: PeerJ Comput Sci
Pays: United States
ID NLM: 101660598
Informations de publication
Date de publication:
2022
2022
Historique:
received:
15
12
2021
accepted:
04
04
2022
entrez:
31
5
2022
pubmed:
1
6
2022
medline:
1
6
2022
Statut:
epublish
Résumé
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
Identifiants
pubmed: 35634124
doi: 10.7717/peerj-cs.961
pii: cs-961
pmc: PMC9137894
doi:
Types de publication
News
Langues
eng
Pagination
e961Informations de copyright
© 2022 Endalie et al.
Déclaration de conflit d'intérêts
The authors declare that they have no competing interests.
Références
PLoS One. 2021 May 21;16(5):e0251902
pubmed: 34019571
Comput Intell Neurosci. 2021 Jul 27;2021:3774607
pubmed: 34354742