ADAM
I’m approaching the end of my first semester here in the Computer & Information Sciences M.S. program, and I’m also approaching a possible thesis topic. I’m trying to wrap my head around it, so I figure writing out my thoughts and questions will help!
ADAM is a framework developed and in development by the Big Data Genomics research group. I’ve been playing around with ADAM for a few months now, and I have an idea of some of the things it does well and some of the things that are still experimental, and I’m finally trying to really dive deep into the software to understand what it does, what it can’t do, and how it works. I’ve been keeping an eye on github issues of course, and have even offered some minor contributions myself, but my focus right now is on the 2015 paper: Rethinking Data-Intensive Science Using Scalable Analytics Systems, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.
I feel like a newb on two fronts right now - the most obvious issue is that I’m so new to software, having only just started really learning to program almost exactly a year ago. I had done some R statistics and simple website building, but a year ago was when I really started to build my career as a developer. Anyway, I’ve found this quite exciting, and I’ve been pretty confident in my ability to learn quickly. The second issue, though, is that I’m trying to do work on what seems to be the cutting edge of bioinformatic processing, but I am not actually that familiar with how bioinformatics data crunching has been done in the past. I’m a biologist, but I’m not really familiar with RNASeq which is one of the first applications of the ADAM software. Anyway, the first part of the paper lays out the problems in genomics data processing that ADAM is solving, and for now at least I sort of need to take it for granted that current pipelines are a mess. I’ve sort of skirted around this issue trying to mix and match the myriad genomics tools that are out there, but until reading this paper I sort of assumed that every genomicist has his or her own pipeline finely tuned and fully functional. This paper suggests that a fully functional and finely tuned pipeline doesn’t yet exist, as evidenced by the constant creation of awkward custom file formats to solve immediate needs in the absence of an overall design. I really like the idea that ADAM starts from an overarching design that anticipates future needs and provides a framework for future solutions. Perhaps I should simply be thankful that I don’t have to wrestle with the haphazard tools of the soon-to-be genomics past.
The stated purpose of ADAM is to democratize scientific data analysis by creating a framework that can use commodity hardware and open-source software to crunch big data. Apache Spark is the change-maker here, because it allows distributed processing over lots of smaller computers and for some applications circumvents the need for top of the line, expensive hardware. The growing accessibility of distributed storage and processing via AWS will continue to expand the boundaries here as well for people to do important and difficult work without a lot of capital.
ADAM has already gone a long way toward achieving their goal by developing a full-stack with separation of concerns engineered into the design so that layers are compartmentalized and therefore interchangeable. I first learned about this important engineering principle last semester in 362, so it’s really cool to see that solid engineering can solve a major problem in real life. I’m still trying to get my head around what exactly Apache Avro and Apache Parquet do, but I’m most focused right now in RDD’s themselves. Resilient Distributed Datasets are (were?) the key to Spark’s ability to distribute processing. I say are, because they really are the fundamental building block of Spark, but I say perhaps were because Spark 2.0 has introduced the Dataframe and Dataset abstractions that seem to allow users to work more easily with the underlying RDD’s. A lot of what goes on under the hood of ADAM seems to be a well-structured configuration of how RDD’s are created and transformed within the framework, but I’m not sure how much of this machinery is still critical in Spark 2.0. For example, ADAM tuned up RDD joins to work better with genomic coordinates of a mapped genome, but Spark SQL has since been introduced to afford more efficiency to the transformations applied to dataframes and datasets, so I’m not sure yet whether or not these efforts overlap one another. It sounds like the Parquet storage format actually does a lot more for the pipeline than simply store the data. The ADAM paper suggests that the Parquet partitions actually help Spark understand how to best distribute the data, but again I’m not yet sure how applicable or necessary this feature is with Spark 2.0.
I plan to meet with a few folks to help me better understand this paper, but I’m also hoping to get plugged into the ADAM development group more directly. If I can sort of immerse myself in their discussions more consistently, then I will hopefully build sort of a DAG of my own to help me get going on my thesis as efficiently as possible.