ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech

Aung T.; Sriwirote P.; Thavornmongkol T.; Pipatsrisawat K.; Achakulvisut T.; Aung Z.H.

ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech

61

Issued Date

2025-01-01

Resource Type

Conference Paper

DOI

10.1109/iSAI-NLP66160.2025.11320472

Scopus ID

2-s2.0-105032741498

Journal Title

2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025

Rights Holder(s)

SCOPUS

Bibliographic Citation

2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025)

Suggested Citation

Aung T., Sriwirote P., Thavornmongkol T., Pipatsrisawat K., Achakulvisut T., Aung Z.H. ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech. 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025). doi:10.1109/iSAI-NLP66160.2025.11320472 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/115796

Title

ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech

Author(s)

Aung T.
Sriwirote P.
Thavornmongkol T.
Pipatsrisawat K.
Achakulvisut T.
Aung Z.H.

Author's Affiliation

Mahidol University
King Mongkut's Institute of Technology Ladkrabang
Mahidol University, Faculty of Dentistry
Looloo Technology

Corresponding Author(s)

Aung T.

Other Contributor(s)

Mahidol University

Abstract

We introduce ThonburianTTS, a finetuned Thai text-to-speech (TTS) system based on the E2-TTS and F5-TTS architectures, designed to improve pronunciation accuracy, alignment robustness, and zero-shot speaker adaptation for the Thai language. Our models are trained on both Thai script and International Phonetic Alphabet (IPA) transcriptions to evaluate the impact of phonetic input on synthesis quality. We evaluate performance using objective metrics, including Word Error Rate (WER), Syllable Error Rate (SylER), Character Error Rate (CER), Speech Naturalness (MOSNet), Speaker Similarity (SIM-O) and Synthesis Speed (RTF). Our best model, F5-TTS trained on Thai script, achieves a WER of 25.72 %, SylER of 14.17 %, CER of 8.70 %, a MOSNet score of 3.9451, and a SIM-O score of 88.30 %. While IPA-based models yield comparable or higher scores in naturalness and speaker similarity, they underperform in accuracy-related metrics such as WER, SylER, and CER. We also show that increasing the Number of Function Evaluations (NFE) leads to improved model accuracy. ThonburianTTS outperforms strong baselines such as MMS-TTS and PyThaiTTS in both intelligibility and speaker similarity, highlighting the effectiveness of flow matching-based architectures for high-quality TTS in tonal, low-resource languages like Thai. The code and pretrained models are available at https://github.com/biodatlab/thonburian-tts.

Keyword(s)

Computer Science
Engineering

URI

https://repository.li.mahidol.ac.th/handle/123456789/115796

Collections

Scopus 2025

Full item page

Send Feedback

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th