Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain
Issued Date
2025-01-01
Resource Type
ISSN
24092983
eISSN
28313682
Scopus ID
2-s2.0-105035367801
Journal Title
International Conference on Engineering and Emerging Technologies Iceet
Rights Holder(s)
SCOPUS
Bibliographic Citation
International Conference on Engineering and Emerging Technologies Iceet (2025)
Suggested Citation
Yang W.Z., Yang Y.H., Precharattana M. Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain. International Conference on Engineering and Emerging Technologies Iceet (2025). doi:10.1109/ICEET67911.2025.11424128 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/116227
Title
Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain
Author(s)
Author's Affiliation
Corresponding Author(s)
Other Contributor(s)
Abstract
The automation of literature screening using large language models (LLMs) has advanced within English biomedical contexts, yet the performance of domestically developed LLMs in non-English, non-medical domains remains underexplored. This study evaluates DeepSeek - a leading Chinese-developed LLM - for screening Chinese educational literature, a domain characterized by theoretical nuance and cultural-linguistic specificity. Using a corpus of 177 Chinese abstracts on K-12 programming education, we assessed DeepSeek under six configurations from three prompting strategies (zero-shot, few-shot, full-shot) and two operational modes (R1: high-throughput; V3: semantic-depth). Screening reliability was measured via repeated-measures consistency using weighted Kappa, observed agreement, serious error rate (SER), and Uncertain-class stability. Results showed substantial performance variability (weighted Kappa: 0.36-0.70). The few-shot prompting with R1 mode achieved the highest consistency (κ_w = 0.70) but a high SER (0.14), dominated by exclusion errors. In contrast, few-shot with V3 mode minimized SER (0.06) via conservative reclassification to "uncertain."Few-shot prompting reduced SER by 47% versus full-shot and improved consistency by 25% over zero-shot. While no significant differences emerged between R1 and V3 modes, descriptive trends favored R1 for throughput and V3 for error reduction. This study confirms DeepSeek's viability for Chinese educational literature screening, achieving reliability comparable to Western medical LLMs under optimized setups, though error profiles diverge markedly, underscoring the need for domain-specific evaluation frameworks. These findings offer practical guidance for configuring domestic LLMs and introduce a reliability-centered methodology suitable for low-resource, high-ambiguity domains.
