Application of Data Mining Classification Techniques on Text Documents

Besimi, Nuhi
Folyóirat címe
Folyóirat ISSN
Kötet címe (évfolyam száma)
This thesis presents the application of various classification techniques on text documents. Since there is more and more textual data available on the Web, learning from examples and classifying text is an important topic nowadays. Text classification usually requires deeper analysis and more pre-processing than basic classification techniques applied on regular datasets. We have used four algorithms to describe and evaluate the process of text classification, namely 1) Naïve Bayes classifier, 2) k-NN (k-Nearest Neighbors), 3) Centroid classifier, and 4) SVM (Support Vector Machine). All algorithms were adapted to handle text data. In the thesis we also describe the process of information retrieval from the Web. Thanks to social networks, news sites, blogs, and other sources, the quantity of data on the Web is increasing rapidly. These data appear in different formats like text, images, videos, datasets, XML files, etc. In addition to that, the structure of data differs from site to site. To overcome this problem, we propose an algorithm for scraping the web and integrating textual data from different sources. Then, the gathered data are pre-processed, which includes cleaning (removing noise and irrelevant features) and transformation (converting the data into a suitable format for further processing). At the end we present empirical results on the performance for the four text classification algorithms. Algorithms are tested and evaluated on the textual documents (news) scrapped from the Web. We find that all algorithms achieve reasonable results on accuracy and performance. SVM shows considerable improvements on execution time in contrast to the other three algorithms. Naïve Bayes classifier is considered fast classifier. Centroid classifier performs slightly better than the kNN even though they are very similar.
Supervised Learning, Documents Classification, Information Retrieval, Text Cleaning, Text Transformation