Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain
| dc.contributor.author | Yang W.Z. | |
| dc.contributor.author | Yang Y.H. | |
| dc.contributor.author | Precharattana M. | |
| dc.contributor.correspondence | Yang W.Z. | |
| dc.contributor.other | Mahidol University | |
| dc.date.accessioned | 2026-04-16T18:33:15Z | |
| dc.date.available | 2026-04-16T18:33:15Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | The automation of literature screening using large language models (LLMs) has advanced within English biomedical contexts, yet the performance of domestically developed LLMs in non-English, non-medical domains remains underexplored. This study evaluates DeepSeek - a leading Chinese-developed LLM - for screening Chinese educational literature, a domain characterized by theoretical nuance and cultural-linguistic specificity. Using a corpus of 177 Chinese abstracts on K-12 programming education, we assessed DeepSeek under six configurations from three prompting strategies (zero-shot, few-shot, full-shot) and two operational modes (R1: high-throughput; V3: semantic-depth). Screening reliability was measured via repeated-measures consistency using weighted Kappa, observed agreement, serious error rate (SER), and Uncertain-class stability. Results showed substantial performance variability (weighted Kappa: 0.36-0.70). The few-shot prompting with R1 mode achieved the highest consistency (κ_w = 0.70) but a high SER (0.14), dominated by exclusion errors. In contrast, few-shot with V3 mode minimized SER (0.06) via conservative reclassification to "uncertain."Few-shot prompting reduced SER by 47% versus full-shot and improved consistency by 25% over zero-shot. While no significant differences emerged between R1 and V3 modes, descriptive trends favored R1 for throughput and V3 for error reduction. This study confirms DeepSeek's viability for Chinese educational literature screening, achieving reliability comparable to Western medical LLMs under optimized setups, though error profiles diverge markedly, underscoring the need for domain-specific evaluation frameworks. These findings offer practical guidance for configuring domestic LLMs and introduce a reliability-centered methodology suitable for low-resource, high-ambiguity domains. | |
| dc.identifier.citation | International Conference on Engineering and Emerging Technologies Iceet (2025) | |
| dc.identifier.doi | 10.1109/ICEET67911.2025.11424128 | |
| dc.identifier.eissn | 28313682 | |
| dc.identifier.issn | 24092983 | |
| dc.identifier.scopus | 2-s2.0-105035367801 | |
| dc.identifier.uri | https://repository.li.mahidol.ac.th/handle/123456789/116227 | |
| dc.rights.holder | SCOPUS | |
| dc.subject | Computer Science | |
| dc.subject | Physics and Astronomy | |
| dc.subject | Engineering | |
| dc.title | Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain | |
| dc.type | Conference Paper | |
| mu.datasource.scopus | https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105035367801&origin=inward | |
| oaire.citation.title | International Conference on Engineering and Emerging Technologies Iceet | |
| oairecerif.author.affiliation | Mahidol University | |
| oairecerif.author.affiliation | Jiaying University | |
| oairecerif.author.affiliation | Guangzhou City Polytechnic |
