Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

Yang W.Z.; Yang Y.H.; Precharattana M.

Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

dc.contributor.author	Yang W.Z.
dc.contributor.author	Yang Y.H.
dc.contributor.author	Precharattana M.
dc.contributor.correspondence	Yang W.Z.
dc.contributor.other	Mahidol University
dc.date.accessioned	2026-04-16T18:33:15Z
dc.date.available	2026-04-16T18:33:15Z
dc.date.issued	2025-01-01
dc.description.abstract	The automation of literature screening using large language models (LLMs) has advanced within English biomedical contexts, yet the performance of domestically developed LLMs in non-English, non-medical domains remains underexplored. This study evaluates DeepSeek - a leading Chinese-developed LLM - for screening Chinese educational literature, a domain characterized by theoretical nuance and cultural-linguistic specificity. Using a corpus of 177 Chinese abstracts on K-12 programming education, we assessed DeepSeek under six configurations from three prompting strategies (zero-shot, few-shot, full-shot) and two operational modes (R1: high-throughput; V3: semantic-depth). Screening reliability was measured via repeated-measures consistency using weighted Kappa, observed agreement, serious error rate (SER), and Uncertain-class stability. Results showed substantial performance variability (weighted Kappa: 0.36-0.70). The few-shot prompting with R1 mode achieved the highest consistency (κ_w = 0.70) but a high SER (0.14), dominated by exclusion errors. In contrast, few-shot with V3 mode minimized SER (0.06) via conservative reclassification to "uncertain."Few-shot prompting reduced SER by 47% versus full-shot and improved consistency by 25% over zero-shot. While no significant differences emerged between R1 and V3 modes, descriptive trends favored R1 for throughput and V3 for error reduction. This study confirms DeepSeek's viability for Chinese educational literature screening, achieving reliability comparable to Western medical LLMs under optimized setups, though error profiles diverge markedly, underscoring the need for domain-specific evaluation frameworks. These findings offer practical guidance for configuring domestic LLMs and introduce a reliability-centered methodology suitable for low-resource, high-ambiguity domains.
dc.identifier.citation	International Conference on Engineering and Emerging Technologies Iceet (2025)
dc.identifier.doi	10.1109/ICEET67911.2025.11424128
dc.identifier.eissn	28313682
dc.identifier.issn	24092983
dc.identifier.scopus	2-s2.0-105035367801
dc.identifier.uri	https://repository.li.mahidol.ac.th/handle/123456789/116227
dc.rights.holder	SCOPUS
dc.subject	Computer Science
dc.subject	Physics and Astronomy
dc.subject	Engineering
dc.title	Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain
dc.type	Conference Paper
mu.datasource.scopus	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105035367801&origin=inward
oaire.citation.title	International Conference on Engineering and Emerging Technologies Iceet
oairecerif.author.affiliation	Mahidol University
oairecerif.author.affiliation	Jiaying University
oairecerif.author.affiliation	Guangzhou City Polytechnic

Collections

Scopus 2025

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th

Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

Files

Collections