Machine learning-guided discovery of thermophilic carbonic anhydrases from environmental metagenomes
Issued Date
2025-12-01
Resource Type
eISSN
20452322
Scopus ID
2-s2.0-105022522498
Pubmed ID
41266391
Journal Title
Scientific Reports
Volume
15
Issue
1
Rights Holder(s)
SCOPUS
Bibliographic Citation
Scientific Reports Vol.15 No.1 (2025)
Suggested Citation
Pairoh S., Mhuantong W., Boonyapakron K., Yuvaniyama J., Kanokratana P., Bunterngsook B., Lekakarn H., Arunrattanamook N., Laothanachareon T., Champreda V. Machine learning-guided discovery of thermophilic carbonic anhydrases from environmental metagenomes. Scientific Reports Vol.15 No.1 (2025). doi:10.1038/s41598-025-24713-1 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/113291
Title
Machine learning-guided discovery of thermophilic carbonic anhydrases from environmental metagenomes
Corresponding Author(s)
Other Contributor(s)
Abstract
Thermophilic carbonic anhydrases (CAs) are promising biocatalysts for carbon capture utilization and storage (CCUS) due to their stability and efficiency at elevated temperatures. This study presents a machine learning (ML)-guided approach to discover thermostable γ-class CA (γ-CA) from metagenomic datasets derived from Fang Hot Spring, Northern Thailand. To develop classification models, two sets of protein descriptors—dipeptide composition (DPC) and physicochemical/biochemical properties (AAindex)—were used to train classification models. Fourteen ML algorithms were systematically evaluated for each feature set. AdaBoost achieved the best performance for the DPC-based model, while LightGBM performed best with AAindex-based features. External validation with known CA sequences confirmed the ability of the models to discriminate thermophilic from non-thermophilic proteins. Applying the optimized models, we screened 1,534 predicted CAs and identified three high-confidence candidates (TtCA, CrCA, and ToCA). These were heterologously expressed in E. coli, purified, and biochemically validated. All candidates exhibited carbonic anhydrase activity, trimeric oligomeric structures, and high melting temperatures (T<inf>m</inf> ranging from 97.0 °C to 109.1 °C). Although their hydration activity was modest compared to α-class CAs, their thermal robustness highlights their potential for industrial CO₂ capture. This study demonstrates an approach in which ML integrated with metagenomics enables efficient discovery and validation of robust enzymes from extreme environments, providing a scalable strategy for CCUS applications.
