Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

Yang W.Z.; Yang Y.H.; Precharattana M.

Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

Issued Date

2025-01-01

Resource Type

Conference Paper

ISSN

24092983

eISSN

28313682

DOI

10.1109/ICEET67911.2025.11424128

Scopus ID

2-s2.0-105035367801

Journal Title

International Conference on Engineering and Emerging Technologies Iceet

Rights Holder(s)

SCOPUS

Bibliographic Citation

International Conference on Engineering and Emerging Technologies Iceet (2025)

Suggested Citation

Yang W.Z., Yang Y.H., Precharattana M. Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain. International Conference on Engineering and Emerging Technologies Iceet (2025). doi:10.1109/ICEET67911.2025.11424128 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/116227

Title

Evaluating DeepSeek for Automated Screening of Chinese Educational Literature: A Reliability-Centered Study of a Domestic LLM in a High-Ambiguity Domain

Author(s)

Yang W.Z.
Yang Y.H.
Precharattana M.

Author's Affiliation

Mahidol University
Jiaying University
Guangzhou City Polytechnic

Corresponding Author(s)

Yang W.Z.

Other Contributor(s)

Mahidol University

Abstract

The automation of literature screening using large language models (LLMs) has advanced within English biomedical contexts, yet the performance of domestically developed LLMs in non-English, non-medical domains remains underexplored. This study evaluates DeepSeek - a leading Chinese-developed LLM - for screening Chinese educational literature, a domain characterized by theoretical nuance and cultural-linguistic specificity. Using a corpus of 177 Chinese abstracts on K-12 programming education, we assessed DeepSeek under six configurations from three prompting strategies (zero-shot, few-shot, full-shot) and two operational modes (R1: high-throughput; V3: semantic-depth). Screening reliability was measured via repeated-measures consistency using weighted Kappa, observed agreement, serious error rate (SER), and Uncertain-class stability. Results showed substantial performance variability (weighted Kappa: 0.36-0.70). The few-shot prompting with R1 mode achieved the highest consistency (κ_w = 0.70) but a high SER (0.14), dominated by exclusion errors. In contrast, few-shot with V3 mode minimized SER (0.06) via conservative reclassification to "uncertain."Few-shot prompting reduced SER by 47% versus full-shot and improved consistency by 25% over zero-shot. While no significant differences emerged between R1 and V3 modes, descriptive trends favored R1 for throughput and V3 for error reduction. This study confirms DeepSeek's viability for Chinese educational literature screening, achieving reliability comparable to Western medical LLMs under optimized setups, though error profiles diverge markedly, underscoring the need for domain-specific evaluation frameworks. These findings offer practical guidance for configuring domestic LLMs and introduce a reliability-centered methodology suitable for low-resource, high-ambiguity domains.

Keyword(s)

Computer Science
Physics and Astronomy
Engineering

URI

https://repository.li.mahidol.ac.th/handle/123456789/116227

Collections

Scopus 2025

Full item page

Send Feedback

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th