TranSentCut-transformer based Thai sentence segmentation

dc.contributor.authorYuenyong S.
dc.contributor.otherMahidol University
dc.date.accessioned2023-06-26T18:12:59Z
dc.date.available2023-06-26T18:12:59Z
dc.date.issued2022-05-01
dc.description.abstractWe propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts.
dc.identifier.citationSongklanakarin Journal of Science and Technology Vol.44 No.3 (2022) , 852-860
dc.identifier.issn01253395
dc.identifier.scopus2-s2.0-85137518799
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/20.500.14594/87650
dc.rights.holderSCOPUS
dc.subjectMultidisciplinary
dc.titleTranSentCut-transformer based Thai sentence segmentation
dc.typeArticle
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85137518799&origin=inward
oaire.citation.endPage860
oaire.citation.issue3
oaire.citation.startPage852
oaire.citation.titleSongklanakarin Journal of Science and Technology
oaire.citation.volume44
oairecerif.author.affiliationMusashino University
oairecerif.author.affiliationMahidol University
oairecerif.author.affiliationThammasat University

Files

Collections