Performance of large language models on Thailand's national medical licensing examination: a cross-sectional study
| dc.contributor.author | Saowaprut P. | |
| dc.contributor.author | Wabina R.S. | |
| dc.contributor.author | Yang J. | |
| dc.contributor.author | Siriwat L. | |
| dc.contributor.correspondence | Saowaprut P. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2025-05-26T18:09:20Z | |
| dc.date.available | 2025-05-26T18:09:20Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | PURPOSE: This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand's National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials. METHODS: We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds. RESULTS: All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7-89.1), surpassing Thailand's national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT-4o: 90.0% vs. 82.6%). CONCLUSION: General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment. | |
| dc.identifier.citation | Journal of educational evaluation for health professions Vol.22 (2025) , 16 | |
| dc.identifier.doi | 10.3352/jeehp.2025.22.16 | |
| dc.identifier.eissn | 19755937 | |
| dc.identifier.pmid | 40354784 | |
| dc.identifier.scopus | 2-s2.0-105005377591 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/110365 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Medicine | |
| dc.title | Performance of large language models on Thailand's national medical licensing examination: a cross-sectional study | |
| dc.type | Article | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105005377591&origin=inward | |
| oaire.citation.title | Journal of educational evaluation for health professions | |
| oaire.citation.volume | 22 | |
| oairecerif.author.affiliation | Ramathibodi Hospital |
