Publication: Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
Issued Date
2020-11-01
Resource Type
ISSN
03064573
Other identifier(s)
2-s2.0-85085523063
Rights
Mahidol University
Rights Holder(s)
SCOPUS
Bibliographic Citation
Information Processing and Management. Vol.57, No.6 (2020)
Suggested Citation
Iqra Safder, Saeed Ul Hassan, Anna Visvizi, Thanapon Noraset, Raheel Nawaz, Suppawong Tuarob Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents. Information Processing and Management. Vol.57, No.6 (2020). doi:10.1016/j.ipm.2020.102269 Retrieved from: https://repository.li.mahidol.ac.th/handle/20.500.14594/57817
Research Projects
Organizational Units
Authors
Journal Issue
Thesis
Title
Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
Abstract
© 2020 Elsevier Ltd The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.