Publication:
An exhaustive, non-euclidean, non-parametric data mining tool for Unraveling the complexity of biological systems - novel insights into malaria

dc.contributor.authorCheikh Loucoubaren_US
dc.contributor.authorRichard Paulen_US
dc.contributor.authorAvner Bar-Henen_US
dc.contributor.authorAugustin Hureten_US
dc.contributor.authorAdama Tallen_US
dc.contributor.authorCheikh Sokhnaen_US
dc.contributor.authorJean François Trapeen_US
dc.contributor.authorAlioune Badara Lyen_US
dc.contributor.authorJoseph Fayeen_US
dc.contributor.authorAbdoulaye Badianeen_US
dc.contributor.authorGaoussou Diakhabyen_US
dc.contributor.authorFatoumata Diène Sarren_US
dc.contributor.authorAliou Diopen_US
dc.contributor.authorAnavaj Sakuntabhaien_US
dc.contributor.authorJean François Bureauen_US
dc.contributor.otherInstitut Pasteur, Parisen_US
dc.contributor.otherUniversite Paris Descartesen_US
dc.contributor.otherInstitut Pasteur de Dakaren_US
dc.contributor.otherEcole des hautes etudes en sante publiqueen_US
dc.contributor.otherInstitute of Health and Scienceen_US
dc.contributor.otherInstitut de Recherche pour le Developpement Dakaren_US
dc.contributor.otherUGBen_US
dc.contributor.otherMahidol Universityen_US
dc.date.accessioned2018-05-03T07:55:46Z
dc.date.available2018-05-03T07:55:46Z
dc.date.issued2011-09-09en_US
dc.description.abstractComplex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992-2003, aged 1-5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems. © 2011 Loucoubar et al.en_US
dc.identifier.citationPLoS ONE. Vol.6, No.9 (2011)en_US
dc.identifier.doi10.1371/journal.pone.0024085en_US
dc.identifier.issn19326203en_US
dc.identifier.other2-s2.0-80052604865en_US
dc.identifier.urihttps://repository.li.mahidol.ac.th/handle/20.500.14594/11269
dc.rightsMahidol Universityen_US
dc.rights.holderSCOPUSen_US
dc.source.urihttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=80052604865&origin=inwarden_US
dc.subjectAgricultural and Biological Sciencesen_US
dc.subjectBiochemistry, Genetics and Molecular Biologyen_US
dc.subjectMedicineen_US
dc.titleAn exhaustive, non-euclidean, non-parametric data mining tool for Unraveling the complexity of biological systems - novel insights into malariaen_US
dc.typeArticleen_US
dspace.entity.typePublication
mu.datasource.scopushttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=80052604865&origin=inwarden_US

Files

Collections