DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
Issued Date
2024-01-01
Resource Type
Scopus ID
2-s2.0-85216587816
Journal Title
19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024
Rights Holder(s)
SCOPUS
Bibliographic Citation
19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 (2024)
Suggested Citation
Yuenyong S., Buppodom N., Sangkaew K., Boonmeeprakob K., Boonkwan P., Jaroenkantasima J., Khlaisamniang P., Lertpiya A., Piyatumrong A., Rojratchadakorn P., Rugsujarit T., Saengsukhiran T., Saetan K., Sukprapa I., Thavornmongkol T., Thongthungwong N., Triamamornwooth P., Utupon C., Viriyayudhakorn K., Witchutanon P., Wongprayon S., Supnithi T. DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques. 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 (2024). doi:10.1109/iSAI-NLP64410.2024.10799278 Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/104253
Title
DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
Author(s)
Yuenyong S.
Buppodom N.
Sangkaew K.
Boonmeeprakob K.
Boonkwan P.
Jaroenkantasima J.
Khlaisamniang P.
Lertpiya A.
Piyatumrong A.
Rojratchadakorn P.
Rugsujarit T.
Saengsukhiran T.
Saetan K.
Sukprapa I.
Thavornmongkol T.
Thongthungwong N.
Triamamornwooth P.
Utupon C.
Viriyayudhakorn K.
Witchutanon P.
Wongprayon S.
Supnithi T.
Buppodom N.
Sangkaew K.
Boonmeeprakob K.
Boonkwan P.
Jaroenkantasima J.
Khlaisamniang P.
Lertpiya A.
Piyatumrong A.
Rojratchadakorn P.
Rugsujarit T.
Saengsukhiran T.
Saetan K.
Sukprapa I.
Thavornmongkol T.
Thongthungwong N.
Triamamornwooth P.
Utupon C.
Viriyayudhakorn K.
Witchutanon P.
Wongprayon S.
Supnithi T.
Author's Affiliation
King Mongkut's University of Technology North Bangkok
Chulalongkorn University
Kasetsart University
King Mongkut's Institute of Technology Ladkrabang
University of Liverpool
Mahidol University
Thailand National Electronics and Computer Technology Center
Sirindhorn International Institute of Technology, Thammasat University
Artificial Intelligence Asssociation of Thailand
Kasikorn Business-Technology Group
Big Data Institute (Public Organization)
Faculty of Engineering
Chulalongkorn University
Kasetsart University
King Mongkut's Institute of Technology Ladkrabang
University of Liverpool
Mahidol University
Thailand National Electronics and Computer Technology Center
Sirindhorn International Institute of Technology, Thammasat University
Artificial Intelligence Asssociation of Thailand
Kasikorn Business-Technology Group
Big Data Institute (Public Organization)
Faculty of Engineering
Corresponding Author(s)
Other Contributor(s)
Abstract
Large language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility.