Application of Data Mining Classification Techniques on Text Documents

dc.contributor.advisorSzathmáry, László
dc.contributor.authorBesimi, Nuhi
dc.contributor.departmentDE--Informatikai Karhu_HU
dc.date.accessioned2015-04-30T08:03:05Z
dc.date.available2015-04-30T08:03:05Z
dc.date.created2015-04-30
dc.description.abstractThis thesis presents the application of various classification techniques on text documents. Since there is more and more textual data available on the Web, learning from examples and classifying text is an important topic nowadays. Text classification usually requires deeper analysis and more pre-processing than basic classification techniques applied on regular datasets. We have used four algorithms to describe and evaluate the process of text classification, namely 1) Naïve Bayes classifier, 2) k-NN (k-Nearest Neighbors), 3) Centroid classifier, and 4) SVM (Support Vector Machine). All algorithms were adapted to handle text data. In the thesis we also describe the process of information retrieval from the Web. Thanks to social networks, news sites, blogs, and other sources, the quantity of data on the Web is increasing rapidly. These data appear in different formats like text, images, videos, datasets, XML files, etc. In addition to that, the structure of data differs from site to site. To overcome this problem, we propose an algorithm for scraping the web and integrating textual data from different sources. Then, the gathered data are pre-processed, which includes cleaning (removing noise and irrelevant features) and transformation (converting the data into a suitable format for further processing). At the end we present empirical results on the performance for the four text classification algorithms. Algorithms are tested and evaluated on the textual documents (news) scrapped from the Web. We find that all algorithms achieve reasonable results on accuracy and performance. SVM shows considerable improvements on execution time in contrast to the other three algorithms. Naïve Bayes classifier is considered fast classifier. Centroid classifier performs slightly better than the kNN even though they are very similar.hu_HU
dc.description.courseComputer Sciencehu_HU
dc.description.degreeMSc/MAhu_HU
dc.format.extent53hu_HU
dc.identifier.urihttp://hdl.handle.net/2437/211754
dc.language.isoenhu_HU
dc.subjectSupervised Learninghu_HU
dc.subjectDocuments Classificationhu_HU
dc.subjectInformation Retrievalhu_HU
dc.subjectText Cleaninghu_HU
dc.subjectText Transformationhu_HU
dc.subject.dspaceDEENK Témalista::Informatikahu_HU
dc.titleApplication of Data Mining Classification Techniques on Text Documentshu_HU
Fájlok