Data-driven analysis of collections of big datasets by the Bi-CoPaM method yields field-specific novel insights
Abu-Jamous B., Liu C., Roberts DJ., Brattico E., Nandi AK.
© Springer Nature Singapore Pte Ltd. 2017. Massive amounts of data have recently been, and are increasingly being, generated from various fields, such as bioinformatics, neuroscience and social networks. Many of these big datasets were generated to answer specific research questions, and were analysed accordingly. However, the scope of information contained in these datasets can usually answer much broader questions than what was originally intended. Moreover, many existing big datasets are related to each other but have different detailed specifications, and the mutual information that can be extracted from them collectively has been not commonly considered. To bridge this gap between the fast pace of data generation and the slower pace of data analysis, and to exploit the massive amounts of existing data, we suggest employing data-driven explorations to analyse collections of related big datasets. This approach aims at extracting field-specific novel findings which can be revealed from the data without being driven by specific questions or hypotheses. To realise this paradigm, we introduced the binarisation of consensus partition matrices (Bi- CoPaM) method, with the ability of analysing collections of heterogeneous big datasets to identify clusters of consistently correlated objects. We demonstrate the power of data-driven explorations by applying the Bi-CoPaM to two collections of big datasets from two distinct fields, namely bioinformatics and neuroscience. In the first application, the collective analysis of forty yeast gene expression datasets identified a novel cluster of genes and some new biological hypotheses regarding their function and regulation. In the other application, the analysis of 1,856 big fMRI datasets identified three functionally connected neural networks related to visual, reward and auditory systems during affective processing. These experiments reveal the broad applicability of this paradigm to various fields, and thus encourage exploring the large amounts of partially exploited existing datasets, preferably as collections of related datasets, with a similar approach.