DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques

dc.contributor.authorYuenyong S.
dc.contributor.authorBuppodom N.
dc.contributor.authorSangkaew K.
dc.contributor.authorBoonmeeprakob K.
dc.contributor.authorBoonkwan P.
dc.contributor.authorJaroenkantasima J.
dc.contributor.authorKhlaisamniang P.
dc.contributor.authorLertpiya A.
dc.contributor.authorPiyatumrong A.
dc.contributor.authorRojratchadakorn P.
dc.contributor.authorRugsujarit T.
dc.contributor.authorSaengsukhiran T.
dc.contributor.authorSaetan K.
dc.contributor.authorSukprapa I.
dc.contributor.authorThavornmongkol T.
dc.contributor.authorThongthungwong N.
dc.contributor.authorTriamamornwooth P.
dc.contributor.authorUtupon C.
dc.contributor.authorViriyayudhakorn K.
dc.contributor.authorWitchutanon P.
dc.contributor.authorWongprayon S.
dc.contributor.authorSupnithi T.
dc.contributor.correspondenceYuenyong S.
dc.contributor.otherMahidol University
dc.date.accessioned2025-02-12T18:21:57Z
dc.date.available2025-02-12T18:21:57Z
dc.date.issued2024-01-01
dc.description.abstractLarge language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility.
dc.identifier.citation19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024 (2024)
dc.identifier.doi10.1109/iSAI-NLP64410.2024.10799278
dc.identifier.scopus2-s2.0-85216587816
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/20.500.14594/104253
dc.rights.holderSCOPUS
dc.subjectComputer Science
dc.subjectEngineering
dc.titleDataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
dc.typeConference Paper
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85216587816&origin=inward
oaire.citation.title19th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2024
oairecerif.author.affiliationKing Mongkut's University of Technology North Bangkok
oairecerif.author.affiliationChulalongkorn University
oairecerif.author.affiliationKasetsart University
oairecerif.author.affiliationKing Mongkut's Institute of Technology Ladkrabang
oairecerif.author.affiliationUniversity of Liverpool
oairecerif.author.affiliationMahidol University
oairecerif.author.affiliationThailand National Electronics and Computer Technology Center
oairecerif.author.affiliationSirindhorn International Institute of Technology, Thammasat University
oairecerif.author.affiliationArtificial Intelligence Asssociation of Thailand
oairecerif.author.affiliationKasikorn Business-Technology Group
oairecerif.author.affiliationBig Data Institute (Public Organization)
oairecerif.author.affiliationFaculty of Engineering

Files

Collections