Comparative Analysis of Data Imputation Methods on F1 Performance Across Multiple Classification Algorithms
Issued Date
2025-01-01
Resource Type
Scopus ID
2-s2.0-105032729101
Journal Title
Icsec 2025 29th International Computer Science and Engineering Conference 2025
Start Page
90
End Page
93
Rights Holder(s)
SCOPUS
Bibliographic Citation
Icsec 2025 29th International Computer Science and Engineering Conference 2025 (2025) , 90-93
Suggested Citation
Tangworakitthaworn P., Fujita K., Wiphaalongkot N. Comparative Analysis of Data Imputation Methods on F1 Performance Across Multiple Classification Algorithms. Icsec 2025 29th International Computer Science and Engineering Conference 2025 (2025) , 90-93. 93. doi:10.1109/ICSEC67360.2025.11298078 Retrieved from: https://repository.li.mahidol.ac.th/handle/123456789/115809
Title
Comparative Analysis of Data Imputation Methods on F1 Performance Across Multiple Classification Algorithms
Author(s)
Author's Affiliation
Corresponding Author(s)
Other Contributor(s)
Abstract
The significant issue of developing the machine learning is quality and completeness of the datasets. Therefore, the suitable datasets should not have the missing values because these can lead to reducing the predictive accuracy and introducing the bias. This research project aims to evaluate the comparative analysis of the data imputation methods on F1 performance across multiple classification algorithms, which are Logistic Regression, Random Forest, and Linear Support Vector Machine (SVM). Moreover, the imputation applied on this project are divided into 5 modes which are Mode1: imputed by AI without data description, and this mode will impute the missing data by random imputation, Mode2: imputed by AI with data description, and this mode will impute the missing data by Model-based (iterative) imputation, Mode3: imputed by mean algorithm, Mode4: imputed by KNN algorithm, and Mode5: imputed by median algorithm. The datasets used for the comparative analysis cover the different size of missing data, ranging from 50,000 to 200,000 missing entries. As a result, the research findings revealed that the data imputation method using Mode2 (AI with Data Description) was the most effective for high percentages of missing data, while the data imputation method using Mode1 (AI without Data Description) was the least effective.
