Application of Data Mining Classification Techniques on Text Documents

Besimi, Nuhi

Application of Data Mining Classification Techniques on Text Documents

dc.contributor.advisor	Szathmáry, László
dc.contributor.author	Besimi, Nuhi
dc.contributor.department	DE--Informatikai Kar	hu_HU
dc.date.accessioned	2015-04-30T08:03:05Z
dc.date.available	2015-04-30T08:03:05Z
dc.date.created	2015-04-30
dc.description.abstract	This thesis presents the application of various classification techniques on text documents. Since there is more and more textual data available on the Web, learning from examples and classifying text is an important topic nowadays. Text classification usually requires deeper analysis and more pre-processing than basic classification techniques applied on regular datasets. We have used four algorithms to describe and evaluate the process of text classification, namely 1) Naïve Bayes classifier, 2) k-NN (k-Nearest Neighbors), 3) Centroid classifier, and 4) SVM (Support Vector Machine). All algorithms were adapted to handle text data. In the thesis we also describe the process of information retrieval from the Web. Thanks to social networks, news sites, blogs, and other sources, the quantity of data on the Web is increasing rapidly. These data appear in different formats like text, images, videos, datasets, XML files, etc. In addition to that, the structure of data differs from site to site. To overcome this problem, we propose an algorithm for scraping the web and integrating textual data from different sources. Then, the gathered data are pre-processed, which includes cleaning (removing noise and irrelevant features) and transformation (converting the data into a suitable format for further processing). At the end we present empirical results on the performance for the four text classification algorithms. Algorithms are tested and evaluated on the textual documents (news) scrapped from the Web. We find that all algorithms achieve reasonable results on accuracy and performance. SVM shows considerable improvements on execution time in contrast to the other three algorithms. Naïve Bayes classifier is considered fast classifier. Centroid classifier performs slightly better than the kNN even though they are very similar.	hu_HU
dc.description.course	Computer Science	hu_HU
dc.description.degree	MSc/MA	hu_HU
dc.format.extent	53	hu_HU
dc.identifier.uri	http://hdl.handle.net/2437/211754
dc.language.iso	en	hu_HU
dc.subject	Supervised Learning	hu_HU
dc.subject	Documents Classification	hu_HU
dc.subject	Information Retrieval	hu_HU
dc.subject	Text Cleaning	hu_HU
dc.subject	Text Transformation	hu_HU
dc.subject.dspace	DEENK Témalista::Informatika	hu_HU
dc.title	Application of Data Mining Classification Techniques on Text Documents	hu_HU

Gyűjtemények

Hallgatói dolgozatok (Informatikai Kar)

Application of Data Mining Classification Techniques on Text Documents

Fájlok

Gyűjtemények