ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech
| dc.contributor.author | Aung T. | |
| dc.contributor.author | Sriwirote P. | |
| dc.contributor.author | Thavornmongkol T. | |
| dc.contributor.author | Pipatsrisawat K. | |
| dc.contributor.author | Achakulvisut T. | |
| dc.contributor.author | Aung Z.H. | |
| dc.contributor.correspondence | Aung T. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2026-03-20T18:20:38Z | |
| dc.date.available | 2026-03-20T18:20:38Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | We introduce ThonburianTTS, a finetuned Thai text-to-speech (TTS) system based on the E2-TTS and F5-TTS architectures, designed to improve pronunciation accuracy, alignment robustness, and zero-shot speaker adaptation for the Thai language. Our models are trained on both Thai script and International Phonetic Alphabet (IPA) transcriptions to evaluate the impact of phonetic input on synthesis quality. We evaluate performance using objective metrics, including Word Error Rate (WER), Syllable Error Rate (SylER), Character Error Rate (CER), Speech Naturalness (MOSNet), Speaker Similarity (SIM-O) and Synthesis Speed (RTF). Our best model, F5-TTS trained on Thai script, achieves a WER of 25.72 %, SylER of 14.17 %, CER of 8.70 %, a MOSNet score of 3.9451, and a SIM-O score of 88.30 %. While IPA-based models yield comparable or higher scores in naturalness and speaker similarity, they underperform in accuracy-related metrics such as WER, SylER, and CER. We also show that increasing the Number of Function Evaluations (NFE) leads to improved model accuracy. ThonburianTTS outperforms strong baselines such as MMS-TTS and PyThaiTTS in both intelligibility and speaker similarity, highlighting the effectiveness of flow matching-based architectures for high-quality TTS in tonal, low-resource languages like Thai. The code and pretrained models are available at https://github.com/biodatlab/thonburian-tts. | |
| dc.identifier.citation | 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025) | |
| dc.identifier.doi | 10.1109/iSAI-NLP66160.2025.11320472 | |
| dc.identifier.scopus | 2-s2.0-105032741498 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/115796 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Computer Science | |
| dc.subject | Engineering | |
| dc.title | ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech | |
| dc.type | Conference Paper | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105032741498&origin=inward | |
| oaire.citation.title | 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 | |
| oairecerif.author.affiliation | Mahidol University | |
| oairecerif.author.affiliation | King Mongkut's Institute of Technology Ladkrabang | |
| oairecerif.author.affiliation | Mahidol University, Faculty of Dentistry | |
| oairecerif.author.affiliation | Looloo Technology |
