Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
Issued Date
2023-01-01
Resource Type
eISSN
21693536
Scopus ID
2-s2.0-85160271058
Journal Title
IEEE Access
Rights Holder(s)
SCOPUS
Bibliographic Citation
IEEE Access (2023)
Suggested Citation
Visitpanya N., Samanchuen T. Synthesis of Tax Return Datasets for Development of Tax Evasion Detection. IEEE Access (2023). doi:10.1109/ACCESS.2023.3276761 Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/82939
Title
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
Author(s)
Author's Affiliation
Other Contributor(s)
Abstract
Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models.