Automating candidate gene prioritization with large language models: from naive scoring to literature-grounded validation

dc.contributor.authorKhan T.
dc.contributor.authorToufiq M.
dc.contributor.authorYurieva M.
dc.contributor.authorIndrawattana N.
dc.contributor.authorJittmittraphap A.
dc.contributor.authorKosoltanapiwat N.
dc.contributor.authorPumirat P.
dc.contributor.authorSukphopetch P.
dc.contributor.authorVanaporn M.
dc.contributor.authorPalucka K.
dc.contributor.authorSyed Ahamed Kabeer B.
dc.contributor.authorRinchai D.
dc.contributor.authorChaussabel D.
dc.contributor.correspondenceKhan T.
dc.contributor.otherMahidol University
dc.date.accessioned2025-11-01T18:06:24Z
dc.date.available2025-11-01T18:06:24Z
dc.date.issued2025-10-02
dc.description.abstractMOTIVATION: Identifying promising therapeutic targets from thousands of genes in transcriptomic studies remains a major bottleneck in biomedical research. While large language models (LLMs) show potential for gene prioritization, they suffer from hallucination and lack systematic validation against expert knowledge. RESULTS: The framework identified 609 sepsis-relevant genes with >94% filtering efficiency, demonstrating strong enrichment for inflammatory pathways including TNF-α signaling, complement activation, and interferon responses. Literature validation yielded 30 ultra-high confidence therapeutic candidates, including both established sepsis genes (IL10, TREM1, S100A9, NLRP3) and novel targets warranting investigation. Benchmark validation against expert-curated databases achieved 71.2% recall, with systematic correlation between computational confidence and evidence quality. The final candidate set balanced discovery (11 novel genes) with validation (19 known genes), maintaining biological coherence throughout the filtering process. This framework demonstrates that rigorous methodology can transform unreliable LLM outputs into systematically validated biological insights. By combining computational efficiency with literature grounding, the approach provides a practical tool for prioritizing experimental validation efforts. The modular design enables adaptation to other diseases through knowledge base substitution, offering a systematic approach to literature-guided biomarker discovery. AVAILABILITY AND IMPLEMENTATION: We developed a two-stage computational framework that combines LLM-based screening with literature validation for systematic gene prioritization. Starting with 10 824 genes from the BloodGen3 repertoire, we applied multi-criteria evaluation for sepsis relevance, followed by retrieval-augmented generation using 6346 curated sepsis publications. A novel faithfulness evaluation system verified that LLM predictions aligned with retrieved literature evidence. Source code and implementation details are available at https://github.com/taushifkhan/llm-geneprioritization-framework, vector database at https://doi.org/10.5281/zenodo.15802241, and Interactive demonstration at https://llm-geneprioritization.streamlit.app/.
dc.identifier.citationBioinformatics Oxford England Vol.41 No.10 (2025)
dc.identifier.doi10.1093/bioinformatics/btaf541
dc.identifier.eissn13674811
dc.identifier.pmid41071041
dc.identifier.scopus2-s2.0-105019820264
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/123456789/112879
dc.rights.holderSCOPUS
dc.subjectMathematics
dc.subjectBiochemistry, Genetics and Molecular Biology
dc.subjectComputer Science
dc.titleAutomating candidate gene prioritization with large language models: from naive scoring to literature-grounded validation
dc.typeArticle
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105019820264&origin=inward
oaire.citation.issue10
oaire.citation.titleBioinformatics Oxford England
oaire.citation.volume41
oairecerif.author.affiliationSt. Jude Children's Research Hospital
oairecerif.author.affiliationSiriraj Hospital
oairecerif.author.affiliationThe Jackson Laboratory
oairecerif.author.affiliationFaculty of Tropical Medicine, Mahidol University
oairecerif.author.affiliationSaveetha Medical College and Hospital
oairecerif.author.affiliationSidra Medicine

Files

Collections