Dataset for file fragment classification of textual file formats.
Classification
File formats
File fragments
Textual file formats
Journal
BMC research notes
ISSN: 1756-0500
Titre abrégé: BMC Res Notes
Pays: England
ID NLM: 101462768
Informations de publication
Date de publication:
11 Dec 2019
11 Dec 2019
Historique:
received:
21
10
2019
accepted:
29
11
2019
entrez:
13
12
2019
pubmed:
13
12
2019
medline:
8
5
2020
Statut:
epublish
Résumé
Classification of textual file formats is a topic of interest in network forensics. There are a few publicly available datasets of files with textual formats. Therewith, there is no public dataset for file fragments of textual file formats. So, a big research challenge in file fragment classification of textual file formats is to compare the performance of the developed methods over the same datasets. In this study, we present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97-Word 2003, Microsoft Word open XML format, portable document format, rich text file, and standard text document. This dataset contains the file fragments in three different languages: English, Persian, and Chinese. For each pair of file format and language, 1500 file fragments are provided. So, the dataset of file fragments contains 22,500 file fragments.
Identifiants
pubmed: 31829258
doi: 10.1186/s13104-019-4837-4
pii: 10.1186/s13104-019-4837-4
pmc: PMC6907108
doi:
Types de publication
Dataset
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
801Références
BMC Res Notes. 2019 Dec 11;12(1):801
pubmed: 31829258