Abstract
Today with increase using social media, a lot of researchers have interested in topic extraction from Twitter. Twitter is an unstructured short text and messy that it is critical to find topics from tweets. While topic modeling algorithms such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are originally designed to derive topics from large documents such as articles, and books. They are often less efficient when applied to short text content like Twitter. Luckily, Twitter has many features that represent the interaction between users. Tweets have rich user-generated hashtags as keywords. In this paper, we exploit the hashtags feature to improve topics learned from Twitter content without modifying the basic topic model of LSA and LDA. Users who share the same hashtag at most discuss the same topic. We compare the performance of the two methods (LSA and LDA) using the topic coherence (with and without hashtags). The experiment result on the Twitter dataset showed that LSA has better coherence score with hashtags than that do not incorporate hashtags. In contrast, our experiments show that the LDA has a better coherence score without incorporating hashtags. Finally, LDA has a better coherence score than LSA and the best coherence result obtained from the LDA method was (0.6047) and the LSA method was (0.4744) but the number of topics in LDA was higher than LSA. Thus, LDA may cause the same tweets to discuss the same subject set into different clustering.
Export citation and abstract BibTeX RIS
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
A post-publication change was made to this article on 04 August 2023 to correct an author affiliation.