Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC

Sornlertlamvanich V.; Yuenyong S.

Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC

13

Issued Date

2022-01-01

Resource Type

Article

eISSN

21693536

DOI

10.1109/ACCESS.2022.3175201

Scopus ID

2-s2.0-85130489064

Journal Title

IEEE Access

Volume

10

Start Page

53043

End Page

53052

Rights Holder(s)

SCOPUS

Bibliographic Citation

IEEE Access Vol.10 (2022) , 53043-53052

Suggested Citation

Sornlertlamvanich V., Yuenyong S. Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC. IEEE Access Vol.10 (2022) , 53043-53052. 53052. doi:10.1109/ACCESS.2022.3175201 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/84392

Title

Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC

Author(s)

Sornlertlamvanich V.
Yuenyong S.

Author's Affiliation

Musashino University
Mahidol University
Thammasat University

Other Contributor(s)

Mahidol University

Abstract

The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., part-of-speech (POS) tagging, syntactic parsing, word extraction, and named entity recognition (NER), as we discuss in this research. We introduce the Thai character cluster (TCC) to reduce the errors propagated from word segmentation and POS tagging by incorporating it into the character representation layer of bidirectional long short-term memory (BiLSTM) for NER. The initial NER model is created from the original THAI-NEST named-entity (NE) tagged corpus by applying the best performing BiLSTM-CNN-CRF model (the combination of BiLSTM, convolutional neural network (CNN), and conditional random field (CRF)) with the word, POS, and TCC embedding. We determine the errors and improve the consistency of the NE annotation through our holdout method by retraining the model with the corrected training set. After the iteration, the overall result of the annotation F1-score has been improved to reach 89.22%, which improves 16.21% from the model trained on the original corpus. The result of our iterative verification is a promising method for low resource language modeling. As a result, The NE silver standard corpus is newly generated for the Thai NER task, called Bangkok Data NE tagged Corpus (BKD). The consistency of annotation is checked and revised according to the improvement of the scope of NE detection by TCC which can recover the errors in word segmentation.

Keyword(s)

Computer Science

URI

https://repository.li.mahidol.ac.th/handle/123456789/84392

Collections

Scopus 2022

Full item page

Send Feedback

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th