How Large Language Models Enhance Topic Modeling on User-Generated Content
| dc.contributor.author | Bui M.P. | |
| dc.contributor.author | Nguyen M.T.N. | |
| dc.contributor.correspondence | Bui M.P. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2025-11-02T18:27:33Z | |
| dc.date.available | 2025-11-02T18:27:33Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | Understanding user-generated content (UGC) is crucial for obtaining actionable insights in domains such as e-commerce and hospitality. However, the noisy and redundant nature of such content present challenges for topic modeling methods like Latent Semantic Analysis (LSA). In this paper, we investigate whether preprocessing user reviews with large language models (LLMs) can improve topic modeling performance. Specifically, we compare two input variants: (1) raw reviews and (2) ChatGPT-generated summaries produced via API as concise keyphrases. We apply LSA with varimax rotation on each variant and evaluate the resulting topic models using multiple criteria, including topic coherence (cν), average pairwise Jaccard overlap, and cluster compactness via silhouette scores. Unlike prior work that employs LLMs primarily for post hoc topic labeling or interpretation, our method integrates an LLM directly into the preprocessing pipeline to reshape noisy input into structured, standardized summaries. While ChatGPT-based preprocessing results in lower cν coherence scores likely due to reduced lexical redundancy, it significantly improves topic separation, cluster quality, and topical specificity, leading to more interpretable and well-structured topic models overall. | |
| dc.identifier.citation | Journal of Physics Conference Series Vol.3114 No.1 (2025) | |
| dc.identifier.doi | 10.1088/1742-6596/3114/1/012011 | |
| dc.identifier.eissn | 17426596 | |
| dc.identifier.issn | 17426588 | |
| dc.identifier.scopus | 2-s2.0-105019743703 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/112896 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Physics and Astronomy | |
| dc.title | How Large Language Models Enhance Topic Modeling on User-Generated Content | |
| dc.type | Conference Paper | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105019743703&origin=inward | |
| oaire.citation.issue | 1 | |
| oaire.citation.title | Journal of Physics Conference Series | |
| oaire.citation.volume | 3114 | |
| oairecerif.author.affiliation | Faculty of Science, Mahidol University | |
| oairecerif.author.affiliation | University of Economics Ho Chi Minh City | |
| oairecerif.author.affiliation | MHESI |
