Taxonomy-based prompt engineering to generate synthetic drug-related patient portal messages
Issued Date
2024-12-01
Resource Type
ISSN
15320464
Scopus ID
2-s2.0-85211075327
Journal Title
Journal of Biomedical Informatics
Volume
160
Rights Holder(s)
SCOPUS
Bibliographic Citation
Journal of Biomedical Informatics Vol.160 (2024)
Suggested Citation
Wang N., Treewaree S., Zirikly A., Lu Y.L., Nguyen M.H., Agarwal B., Shah J., Stevenson J.M., Taylor C.O. Taxonomy-based prompt engineering to generate synthetic drug-related patient portal messages. Journal of Biomedical Informatics Vol.160 (2024). doi:10.1016/j.jbi.2024.104752 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/102348
Title
Taxonomy-based prompt engineering to generate synthetic drug-related patient portal messages
Corresponding Author(s)
Other Contributor(s)
Abstract
Objective: The objectives of this study were to: (1) create a corpus of synthetic drug-related patient portal messages to address the current lack of publicly available datasets for model development, (2) assess differences in language used and linguistics among the synthetic patient portal messages, and (3) assess the accuracy of patient-reported drug side effects for different racial groups. Methods: We leveraged a taxonomy for patient- and clinician-generated content to guide prompt engineering for synthetic drug-related patient portal messages. We generated two groups of messages: the first group (200 messages) used a subset of the taxonomy relevant to a broad range of drug-related messages and the second group (250 messages) used a subset of the taxonomy relevant to a narrow range of messages focused on side effects. Prompts also include one of five racial groups. Next, we assessed linguistic characteristics among message parts (subject, beginning, body, ending) across different prompt specifications (urgency, patient portal taxa, race). We also assessed the performance and frequency of patient-reported side effects across different racial groups and compared to data present in a real world data source (SIDER). Results: The study generated 450 synthetic patient portal messages, and we assessed linguistic patterns, accuracy of drug-side effect pairs, frequency of pairs compared to real world data. Linguistic analysis revealed variations in language usage and politeness and analysis of positive predictive values identified differences in symptoms reported based on urgency levels and racial groups in the prompt. We also found that low incident SIDER drug-side effect pairs were observed less frequently in our dataset. Conclusion: This study demonstrates the potential of synthetic patient portal messages as a valuable resource for healthcare research. After creating a corpus of synthetic drug-related patient portal messages, we identified significant language differences and provided evidence that drug-side effect pairs observed in messages are comparable to what is expected in real world settings.
