Integrating Agentic Artificial Intelligence to Automate International Classification of Diseases, Tenth Revision, Medical Coding
Issued Date
2026-03-01
Resource Type
eISSN
22279709
Scopus ID
2-s2.0-105033869908
Journal Title
Informatics
Volume
13
Issue
3
Rights Holder(s)
SCOPUS
Bibliographic Citation
Informatics Vol.13 No.3 (2026)
Suggested Citation
Akkhawatthanakun K., Narupiyakul L., Wongpatikaseree K., Hnoohom N., Termritthikun C., Muneesawang P. Integrating Agentic Artificial Intelligence to Automate International Classification of Diseases, Tenth Revision, Medical Coding. Informatics Vol.13 No.3 (2026). doi:10.3390/informatics13030039 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/116029
Title
Integrating Agentic Artificial Intelligence to Automate International Classification of Diseases, Tenth Revision, Medical Coding
Author's Affiliation
Corresponding Author(s)
Other Contributor(s)
Abstract
Automating ICD-10 coding from discharge summaries remains demanding because coders analyze clinical narratives while justifying decisions. This study compares three automation patterns: PLM-ICD as a standalone deep learning system emitting 15 codes per case, LLM-only generation with full autonomy, and a hybrid approach where PLM-ICD drafts candidates for an agentic LLM audit to accept or reject. All strategies were evaluated on 19,801 MIMIC-IV summaries using four LLMs spanning compact (Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct, Phi-4-mini-instruct) to large-scale (Sonnet-4.5). Precision guided evaluation because coders still supply any missing diagnoses. PLM-ICD alone reached 55.8% precision while always surfacing 15 suggestions. LLM-only generation lagged severely (1.5–34.6% precision) and produced inconsistent output sizes. The agentic audit delivered the best trade-off: compact LLMs reviewed the 15 candidates, discarded weak evidence, and returned 2–8 high-confidence codes. Llama-3.2-3B-Instruct, for example, improved from 1.5% as a generator to 55.1% as a verifier while trimming false positives by 73%. These results show that positioning LLMs as quality controllers, rather than primary generators, yields reliable support for clinical coding teams, while formal recall/F1 reporting remains future work for fully autonomous implementations.
