Machine learning-guided discovery of thermophilic carbonic anhydrases from environmental metagenomes

dc.contributor.authorPairoh S.
dc.contributor.authorMhuantong W.
dc.contributor.authorBoonyapakron K.
dc.contributor.authorYuvaniyama J.
dc.contributor.authorKanokratana P.
dc.contributor.authorBunterngsook B.
dc.contributor.authorLekakarn H.
dc.contributor.authorArunrattanamook N.
dc.contributor.authorLaothanachareon T.
dc.contributor.authorChampreda V.
dc.contributor.correspondencePairoh S.
dc.contributor.otherMahidol University
dc.date.accessioned2025-11-29T18:22:20Z
dc.date.available2025-11-29T18:22:20Z
dc.date.issued2025-12-01
dc.description.abstractThermophilic carbonic anhydrases (CAs) are promising biocatalysts for carbon capture utilization and storage (CCUS) due to their stability and efficiency at elevated temperatures. This study presents a machine learning (ML)-guided approach to discover thermostable γ-class CA (γ-CA) from metagenomic datasets derived from Fang Hot Spring, Northern Thailand. To develop classification models, two sets of protein descriptors—dipeptide composition (DPC) and physicochemical/biochemical properties (AAindex)—were used to train classification models. Fourteen ML algorithms were systematically evaluated for each feature set. AdaBoost achieved the best performance for the DPC-based model, while LightGBM performed best with AAindex-based features. External validation with known CA sequences confirmed the ability of the models to discriminate thermophilic from non-thermophilic proteins. Applying the optimized models, we screened 1,534 predicted CAs and identified three high-confidence candidates (TtCA, CrCA, and ToCA). These were heterologously expressed in E. coli, purified, and biochemically validated. All candidates exhibited carbonic anhydrase activity, trimeric oligomeric structures, and high melting temperatures (T<inf>m</inf> ranging from 97.0 °C to 109.1 °C). Although their hydration activity was modest compared to α-class CAs, their thermal robustness highlights their potential for industrial CO₂ capture. This study demonstrates an approach in which ML integrated with metagenomics enables efficient discovery and validation of robust enzymes from extreme environments, providing a scalable strategy for CCUS applications.
dc.identifier.citationScientific Reports Vol.15 No.1 (2025)
dc.identifier.doi10.1038/s41598-025-24713-1
dc.identifier.eissn20452322
dc.identifier.pmid41266391
dc.identifier.scopus2-s2.0-105022522498
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/123456789/113291
dc.rights.holderSCOPUS
dc.subjectMultidisciplinary
dc.titleMachine learning-guided discovery of thermophilic carbonic anhydrases from environmental metagenomes
dc.typeArticle
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105022522498&origin=inward
oaire.citation.issue1
oaire.citation.titleScientific Reports
oaire.citation.volume15
oairecerif.author.affiliationThammasat University
oairecerif.author.affiliationFaculty of Science, Mahidol University
oairecerif.author.affiliationThailand National Center for Genetic Engineering and Biotechnology

Files

Collections