Synthesis of Tax Return Datasets for Development of Tax Evasion Detection

Visitpanya N.; Samanchuen T.

Synthesis of Tax Return Datasets for Development of Tax Evasion Detection

Issued Date

2023-01-01

Resource Type

Article

eISSN

21693536

DOI

10.1109/ACCESS.2023.3276761

Scopus ID

2-s2.0-85160271058

Journal Title

IEEE Access

Rights Holder(s)

SCOPUS

Bibliographic Citation

IEEE Access (2023)

Suggested Citation

Visitpanya N., Samanchuen T. Synthesis of Tax Return Datasets for Development of Tax Evasion Detection. IEEE Access (2023). doi:10.1109/ACCESS.2023.3276761 Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/82939

Title

Synthesis of Tax Return Datasets for Development of Tax Evasion Detection

Author(s)

Visitpanya N.
Samanchuen T.

Author's Affiliation

Mahidol University

Other Contributor(s)

Mahidol University

Abstract

Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models.

Keyword(s)

Computer Science

URI

https://repository.li.mahidol.ac.th/handle/20.500.14594/82939

Collections

Scopus 2023

Full item page

Send Feedback

	Office Hour: Monday-Friday 08.30-12.00 and 13.00-16.30 hrs.
	Phutthamonthon Sai 4 Rd. Salaya, Nakhon Pathom 73170, Thailand
	The office: +66 (2) 800 2680 ext.4306
	thipsuda.van@mahidol.ac.th
	https://repository.li.mahidol.ac.th