ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech
1
Issued Date
2025-01-01
Resource Type
Scopus ID
2-s2.0-105032741498
Journal Title
2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025
Rights Holder(s)
SCOPUS
Bibliographic Citation
2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025)
Suggested Citation
Aung T., Sriwirote P., Thavornmongkol T., Pipatsrisawat K., Achakulvisut T., Aung Z.H. ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech. 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025). doi:10.1109/iSAI-NLP66160.2025.11320472 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/115796
Title
ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech
Corresponding Author(s)
Other Contributor(s)
Abstract
We introduce ThonburianTTS, a finetuned Thai text-to-speech (TTS) system based on the E2-TTS and F5-TTS architectures, designed to improve pronunciation accuracy, alignment robustness, and zero-shot speaker adaptation for the Thai language. Our models are trained on both Thai script and International Phonetic Alphabet (IPA) transcriptions to evaluate the impact of phonetic input on synthesis quality. We evaluate performance using objective metrics, including Word Error Rate (WER), Syllable Error Rate (SylER), Character Error Rate (CER), Speech Naturalness (MOSNet), Speaker Similarity (SIM-O) and Synthesis Speed (RTF). Our best model, F5-TTS trained on Thai script, achieves a WER of 25.72 %, SylER of 14.17 %, CER of 8.70 %, a MOSNet score of 3.9451, and a SIM-O score of 88.30 %. While IPA-based models yield comparable or higher scores in naturalness and speaker similarity, they underperform in accuracy-related metrics such as WER, SylER, and CER. We also show that increasing the Number of Function Evaluations (NFE) leads to improved model accuracy. ThonburianTTS outperforms strong baselines such as MMS-TTS and PyThaiTTS in both intelligibility and speaker similarity, highlighting the effectiveness of flow matching-based architectures for high-quality TTS in tonal, low-resource languages like Thai. The code and pretrained models are available at https://github.com/biodatlab/thonburian-tts.
