Publication: DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata
Issued Date
2019-07-01
Resource Type
Other identifier(s)
2-s2.0-85074238695
Rights
Mahidol University
Rights Holder(s)
SCOPUS
Bibliographic Citation
JCSSE 2019 - 16th International Joint Conference on Computer Science and Software Engineering: Knowledge Evolution Towards Singularity of Man-Machine Intelligence. (2019), 91-96
Suggested Citation
Waran Taveekarn, Chatchanin Yimudom, Supisara Sukkanta, Steven Lynden, Wudhichart Sawangphol, Suppawong Tuarob DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata. JCSSE 2019 - 16th International Joint Conference on Computer Science and Software Engineering: Knowledge Evolution Towards Singularity of Man-Machine Intelligence. (2019), 91-96. doi:10.1109/JCSSE.2019.8864152 Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/50629
Research Projects
Organizational Units
Authors
Journal Issue
Thesis
Title
DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata
Other Contributor(s)
Abstract
© 2019 IEEE. In the present, technology has become a big influence that impacts the lives of many humans, with artificial intelligence being one of the most influential elements. Creative feature engineering is an important part of machine learning methodology that supports and manipulates existing data to make it work more efficiently by modifying dimensions of data. Pulling useful information from external sources and combining them, however, are cumbersome since data engineers need to manually find external data sources and process them. Therefore, the ability to modify and enrich existing data automatically, using external open data sources could prove crucial to data engineers and scientists looking to enrich their datasets. In this paper, we propose a method that automatically augments a given structured dataset, by inferencing relevant dimension from an external data source with respect to the target attribute. Specifically, our proposed algorithm first creates bloom filters for every instance of data items. Such filters are then used to retrieve relevant information from the linked open data source, which is later processed into additional columns in the target dataset. A case study of three real-world datasets using Wikidata as the external data source is used to empirically validate our proposed method on both regression and classification tasks. The experimental results show that the datasets augmented by our proposed algorithm yield correlation improvement of 23.11 % on average for the regression task, and ROC improvement of 86.50% for the classification task.