How big is big data for you, and are you interested in getting even more data to analyze?
Jeroen Raes (VIB-KU Leuven Center for Microbiology): “The microbiome field is still in the development phase, so the datasets aren’t as big as they are, for example, in genetics GWAS studies. This being said, in our Flemish Gut Flora Project, there are about 3,500 individuals for which we have microbiome and genetics data, and will be generating metabolomics in the future. In addition to all the clinical and questionnaire data, this is becoming quite an impressive dataset, I’d say – and the multiomics integration won’t be straightforward. Things will become even bigger in the future – for the trials we are currently planning, we will be collecting between 15,000 and 20,000 samples. The biggest challenge there is not the data analysis, but the logistics! ! Yet, it’s still not enough –in our recent Science paper, we estimated that at least 40,000 individuals need to be sampled to have a complete view of gut biodiversity in the healthy population. So, we still have quite some work ahead of us!”
Do you combine big datasets for better insights, or do you work the other way around – by slicing up big data into smaller portions for easier analysis – or perhaps both?
Diether Lambrechts (VIB-KU Leuven Center for Cancer Biology): “We reduce data size from bulk, noisy data (FASTQ/BAM files) to small, highinformation datasets (VCF files). In general, we never slice, because data has cost us money and we don’t want to lose information. What we do is different from, for example, analyzing mouse clicks on a website. We combine data for better insights; vertically with other cohorts (e.g. compare/combine our results with TCGA) and horizontally with other data types (e.g. a new HRD predictor based on Alexandrov mutational signatures and SNParray profiles). Also, we should be careful in claiming that we use ‘big data’. Facebook and Google do, they use MapReduce algorithms on large databases that scale out horizontally on multiple servers. We mostly work with data that fits on RAM memory. The books and the hype around ‘big data’ really refer to the first scale of data.”
Do you need big software to analyze big data?
Stuart Maudsley (VIB-UAntwerp Center for Molecular Neurology): “There are currently three levels of accessible big data analytics – the giant level of Google BigQuery and IBM Watson (now interestingly being offered directly to public users), the intermediate level of data-specific organizations such as Envision, Neural Designer and Quire, and then finally the laboratory-based efforts that generate in-house analytical platforms both for internal and minimal-level consumer use. Profound insights, using big data analytics at the ‘low’ lab level can still compete effectively at the top end, especially when one considers the potential ‘clarity’ of the empirical data used at the lab level end compared to the indiscriminate mass-level (and often ‘greying-out’) data corpi used by the mega-scale players. So, to conclude, for the present time (Watson is coming however), intelligent, focused small-scale platforms can still beat giant corporations – it’s still always down to the input data quality, and this is best curated in-house, near the end of your specific high-dimensionality pipeline.”
In which fields are most of the big data sets produced in life sciences?
Lieven Sterck (VIB-UGent Center for Plant Systems Biology): “Nowadays, due to the ease and fast pace of generating data, big data is present in all fields of life sciences. Whether it is DNA/RNA sequencing, proteomics, metabolomics, patient (meta-)information, microbiomics, generating data is becoming cheaper and more feasible, which eventually leads to big data in every domain. Not surprisingly, most of the (really) big data can probably still be found in the medical and pharmacology fields, mainly due to the availability of both academic and industrial funding to create those data sets as well as the nature of those research fields. Nonetheless, in the plant field, data is becoming more and more accessible. A nice example of this is in plant breeding. We are currently experiencing a vastly increasing supply of information in related areas from plant genomes, water management, soil composition, fertilization, climate and automated phenotyping via drone technology to crop protection systems. The expanding ways and advances in technologies by which we can get and make use of this data is paving the way for big data to make its introduction both into farming practices as well as crop genetics. However, regardless of the field, it is clear that big data itself is useless unless we are able to turn it into knowledge.”