TranSentCut-transformer based Thai sentence segmentation
Issued Date
2022-05-01
Resource Type
ISSN
01253395
Scopus ID
2-s2.0-85137518799
Journal Title
Songklanakarin Journal of Science and Technology
Volume
44
Issue
3
Start Page
852
End Page
860
Rights Holder(s)
SCOPUS
Bibliographic Citation
Songklanakarin Journal of Science and Technology Vol.44 No.3 (2022) , 852-860
Suggested Citation
Yuenyong S. TranSentCut-transformer based Thai sentence segmentation. Songklanakarin Journal of Science and Technology Vol.44 No.3 (2022) , 852-860. 860. Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/87650
Title
TranSentCut-transformer based Thai sentence segmentation
Author(s)
Author's Affiliation
Other Contributor(s)
Abstract
We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts.