Chai-Calibrated Hybrid Assessment for IELTS Speaking with Human-Referenced Validation
Issued Date
2025-01-01
Resource Type
Scopus ID
2-s2.0-105032755968
Journal Title
2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025
Rights Holder(s)
SCOPUS
Bibliographic Citation
2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025)
Suggested Citation
Polasa P., Laoaree S., Thanadunpremdet T., Rodjananant N., Kritsuthikul N. Chai-Calibrated Hybrid Assessment for IELTS Speaking with Human-Referenced Validation. 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing Isai Nlp 2025 (2025). doi:10.1109/iSAI-NLP66160.2025.11320542 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/115785
Title
Chai-Calibrated Hybrid Assessment for IELTS Speaking with Human-Referenced Validation
Corresponding Author(s)
Other Contributor(s)
Abstract
We present CHAI, a rubric-aligned framework for IELTS Speaking that combines an accent-aware ASR backbone with self-supervised speech representations to deliver criterion level feedback. CHAI adopts a dual-agent design: a low-latency Coach for live turn-taking (Whisper-TH large) and a read-only Judge for independent scoring (Whisper-base). Evidence integrates pronunciation similarity from HuBERT-style embeddings with alignment/timing cues, prosody, and transcript-derived indicators to estimate bands for Fluency & Coherence, Lexical Resource, and Grammatical Range & Accuracy. Two certified IELTS examiners Expert A and B) and a small crowd panel (Crowd mean) serve as human references in a classroom-style evaluation with Thai EFL learners across three role-play scenarios (restaurant, airport, job interview). Agreement is reported on the band scale using mean absolute error (MAE) as the primary metric, with latency tracked for usability. The hybrid fusion with a lightweight human prior yields the lowest overall MAE (0.410), outperforming single-model baselines and individually considered human references (Expert A: 0.430; Expert B: 0.451; Crowd mean: 0.512); per-criterion MAE likewise favors the hybrid (F/C 0.409, LR 0.402, GRA 0.418). Latency supports near-real-time classroom feedback for ∼ 10 s turns. Despite a focus on a Thaicentric corpus and sensitivity to ASR timing and fairness, results indicate that model-human hybridization is a practical pathway to consistent, scalable IELTS-aligned feedback.
