DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques

Yuenyong S.; Buppodom N.; Sangkaew K.; Boonmeeprakob K.; Boonkwan P.; Jaroenkantasima J.; Khlaisamniang P.; Lertpiya A.; Piyatumrong A.; Rojratchadakorn P.; Rugsujarit T.; Saengsukhiran T.; Saetan K.; Sukprapa I.; Thavornmongkol T.; Thongthungwong N.; Triamamornwooth P.; Utupon C.; Viriyayudhakorn K.; Witchutanon P.; Wongprayon S.; Supnithi T.

DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques

dc.contributor.author	Yuenyong S.
dc.contributor.author	Buppodom N.
dc.contributor.author	Sangkaew K.
dc.contributor.author	Boonmeeprakob K.
dc.contributor.author	Boonkwan P.
dc.contributor.author	Jaroenkantasima J.
dc.contributor.author	Khlaisamniang P.
dc.contributor.author	Lertpiya A.
dc.contributor.author	Piyatumrong A.
dc.contributor.author	Rojratchadakorn P.
dc.contributor.author	Rugsujarit T.
dc.contributor.author	Saengsukhiran T.
dc.contributor.author	Saetan K.
dc.contributor.author	Sukprapa I.
dc.contributor.author	Thavornmongkol T.
dc.contributor.author	Thongthungwong N.
dc.contributor.author	Triamamornwooth P.
dc.contributor.author	Utupon C.
dc.contributor.author	Viriyayudhakorn K.
dc.contributor.author	Witchutanon P.
dc.contributor.author	Wongprayon S.
dc.contributor.author	Supnithi T.
dc.contributor.correspondence	Yuenyong S.
dc.contributor.other	Mahidol University
dc.date.accessioned	2025-02-12T18:21:57Z
dc.date.available	2025-02-12T18:21:57Z
dc.date.issued	2024-01-01
dc.description.abstract	Large language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility.
dc.identifier.citation	19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 (2024)
dc.identifier.doi	10.1109/iSAI-NLP64410.2024.10799278
dc.identifier.scopus	2-s2.0-85216587816
dc.identifier.uri	https://repository.li.mahidol.ac.th/handle/20.500.14594/104253
dc.rights.holder	SCOPUS
dc.subject	Computer Science
dc.subject	Engineering
dc.title	DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
dc.type	Conference Paper
mu.datasource.scopus	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85216587816&origin=inward
oaire.citation.title	19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024
oairecerif.author.affiliation	King Mongkut's University of Technology North Bangkok
oairecerif.author.affiliation	Chulalongkorn University
oairecerif.author.affiliation	Kasetsart University
oairecerif.author.affiliation	King Mongkut's Institute of Technology Ladkrabang
oairecerif.author.affiliation	University of Liverpool
oairecerif.author.affiliation	Mahidol University
oairecerif.author.affiliation	Thailand National Electronics and Computer Technology Center
oairecerif.author.affiliation	Sirindhorn International Institute of Technology, Thammasat University
oairecerif.author.affiliation	Artificial Intelligence Asssociation of Thailand
oairecerif.author.affiliation	Kasikorn Business-Technology Group
oairecerif.author.affiliation	Big Data Institute (Public Organization)
oairecerif.author.affiliation	Faculty of Engineering

Collections

Scopus 2024

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th

DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques

Files

Collections