DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
dc.contributor.author | Yuenyong S. | |
dc.contributor.author | Buppodom N. | |
dc.contributor.author | Sangkaew K. | |
dc.contributor.author | Boonmeeprakob K. | |
dc.contributor.author | Boonkwan P. | |
dc.contributor.author | Jaroenkantasima J. | |
dc.contributor.author | Khlaisamniang P. | |
dc.contributor.author | Lertpiya A. | |
dc.contributor.author | Piyatumrong A. | |
dc.contributor.author | Rojratchadakorn P. | |
dc.contributor.author | Rugsujarit T. | |
dc.contributor.author | Saengsukhiran T. | |
dc.contributor.author | Saetan K. | |
dc.contributor.author | Sukprapa I. | |
dc.contributor.author | Thavornmongkol T. | |
dc.contributor.author | Thongthungwong N. | |
dc.contributor.author | Triamamornwooth P. | |
dc.contributor.author | Utupon C. | |
dc.contributor.author | Viriyayudhakorn K. | |
dc.contributor.author | Witchutanon P. | |
dc.contributor.author | Wongprayon S. | |
dc.contributor.author | Supnithi T. | |
dc.contributor.correspondence | Yuenyong S. | |
dc.contributor.other | Mahidol University | |
dc.date.accessioned | 2025-02-12T18:21:57Z | |
dc.date.available | 2025-02-12T18:21:57Z | |
dc.date.issued | 2024-01-01 | |
dc.description.abstract | Large language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility. | |
dc.identifier.citation | 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 (2024) | |
dc.identifier.doi | 10.1109/iSAI-NLP64410.2024.10799278 | |
dc.identifier.scopus | 2-s2.0-85216587816 | |
dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/20.500.14594/104253 | |
dc.rights.holder | SCOPUS | |
dc.subject | Computer Science | |
dc.subject | Engineering | |
dc.title | DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques | |
dc.type | Conference Paper | |
mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85216587816&origin=inward | |
oaire.citation.title | 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 | |
oairecerif.author.affiliation | King Mongkut's University of Technology North Bangkok | |
oairecerif.author.affiliation | Chulalongkorn University | |
oairecerif.author.affiliation | Kasetsart University | |
oairecerif.author.affiliation | King Mongkut's Institute of Technology Ladkrabang | |
oairecerif.author.affiliation | University of Liverpool | |
oairecerif.author.affiliation | Mahidol University | |
oairecerif.author.affiliation | Thailand National Electronics and Computer Technology Center | |
oairecerif.author.affiliation | Sirindhorn International Institute of Technology, Thammasat University | |
oairecerif.author.affiliation | Artificial Intelligence Asssociation of Thailand | |
oairecerif.author.affiliation | Kasikorn Business-Technology Group | |
oairecerif.author.affiliation | Big Data Institute (Public Organization) | |
oairecerif.author.affiliation | Faculty of Engineering |