You are here

Getting Beyond Proof of Principle for Big Data Technologies in Bioinformatics: MapReduce Algorithmic, Programming and Architectural issues

30 November 2017
San Francesco - Via della Quarquonia 1 (Classroom 1 )
High Performance Computing (HPC) in Bioinformatics has a classic architectural paradigm: shared-memory multi-processor. With the advent of Cloud Computing, such a new way of managing Big Data has been considered for Bioinformatics. Initially, with Proof of Concept results investigating the advantages of the new computational paradigm. They have been followed by an increasing number of specific Bioinformatics tasks developed mainly with the use of the MapReduce programming paradigm, which is in turn supported by Hadoop and Spark Middleware. A careful analysis of the State of the Art indicates that the main advantage of those Big Data Technologies is the perception of boundless scalability, at least in terms of time. However, how effectively the computing resources are used in the Cloud...is rather cloudy, as most of the software available almost entirely delegates the management of the distributed workload to the powerful primitives of Hadoop and Spark. On a private cloud, i.e., a physical computing cluster that can be configured at will by the user, one can show that carefully designed MapReduce algorithms largely outperform the ones that naively "delegate" to Hadoop and Spark. In the public cloud, e.g., virtual clusters (for instance created via OpenStack) with a dynamic and instance-dependent allocation of physical resources, issues of an architectural nature or related to configuration of the virtual cluster, largely oblivious to the end-user, may translate in a lack of data locality that results in a poor MapReduce performance with respect to the resources used. In order to obtain resource-effective, portable, Cloud-based software for Bioinformatics pipelines, the issues mentioned earlier must be carefully studies and accounted for, in particular to have an impact for Personalized Medicine. As a matter of fact, the need is so pressing and apparently the expected demand so high that Edico genome and Amazon have started a collaboration that makes available Bioinformatics pipelines that are highly engineered to take advantage of FPGA programmability and the Cloud. The objective is to take the already highly performing shared-memory multi-processor based solutions offered by Edico genome in order to make them "real time". Fortunately, this is only the "high end" of the spectrum where a transition from the old HPC paradigm to the new of Cloud Computing one has gone beyond Proof of Concept.
relatore: 
Giancarlo, Raffaele
Units: 
SysMA