AUN Digital Repository

Clustering news articles using K-means and N-grams

Show simple item record

dc.contributor.author BISANDU, Desmond Bala
dc.date.accessioned 2019-04-25T15:28:53Z
dc.date.available 2019-04-25T15:28:53Z
dc.date.issued 2018-05
dc.identifier.other A00019335
dc.identifier.uri http://hdl.handle.net/123456789/573
dc.description A research thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Science (M.Sc.) in Computer Science. en_US
dc.description.abstract Document clustering is an automatic unsupervised machine learning technique that aimed at grouping related set of items into clusters or subsets. The target is creating clusters with high internal coherence, but different from each other substantially. Simply, items within the same cluster should be highly similar, while maintaining high dissimilarity with items within other clusters. Automatic clustering of documents has played a very significant role in many fields including data mining and information retrieval. This thesis aimed to improve the overall efficiency of a document clustering technique using N-grams and efficient similarity measure. The thesis improves the purity and accuracy of the obtained clusters. The preprocessing method is based on N-grams (sequence of N consecutive characters) which do not give consideration to stop-words or other special punctuations but creates and overlap among the content of a document which further gives room to ignore errors thereby increasing the quality of the clusters to a great extent. This approach clusters the news articles based on their N-grams representation, thereby reducing noise and increase the probability of occurrences of the sequences within the articles document. The proposed clustering technique has parameters which can be changed accordingly at the document representation level in order to improve the efficiency and quality of the generated clusters. The results from the experiment using R programming environment were carried out on real datasets of the Reuters21578 and 20Newsgropus proved the effectiveness of the proposed clustering technique at different levels of N-grams in terms of the accuracy and purity of the generated clusters. The results also showed that the proposed clustering technique perform averagely better than the baseline technique both in terms of accuracy and purity with a best results when the window of N-grams = 3. en_US
dc.language.iso en_US en_US
dc.publisher American University of Nigeria, School of Information, Technology and Computing en_US
dc.relation.ispartofseries Graduate Research Thesis;GRT 2018
dc.subject Data set, Document clustering, Machine learning techniques, N-grams en_US
dc.title Clustering news articles using K-means and N-grams en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • School of Information Technology and Computing
    Collection of research theses and dissertations written by graduate students in the school of information technology and computing for the following programmes; Msc computer science, Msc information sysytems, PhD computer science, etc.

Show simple item record

Search AUN Digital Repository


Advanced Search

Browse

My Account