Genomics is also in the middle of a massive technological revolution; over the past decade, the sequencers used by scientists have improved in cost, quality, and speed at exponential rates. Fifteen years ago, it took billions of dollars and years of work for an international consortium of researchers to produce a single human genome; today a single sequencing center can sequence a human genome in a single day for almost $1000. Thousands of human genomes have been sequenced, and projects to sequence hundreds of thousands or millions of genomes are already underway.
Even as the experimental machinery of genomics has advanced, however, its computational support — the tools and methods that convert raw data into clinical findings and research discoveries — has not kept pace. Genomics software today runs much the way it did ten years ago: discrete tools, scripting for workflow, files instead of databases, file formats in place of data models, and little-to-no parallelism.
Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures, and its support for languages such as R, Python, and SQL ease the learning curve for practicing bioinformaticians. Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms that are in regular use today.
This talk will present ADAM, an open-source library for bioinformatics analysis, written for Spark and hosted by the AMPLab. We will discuss both the places where Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics methods, as well as some methods that have proven more difficult to adapt. We will also cover ADAM’s use of technologies like Avro, for schema specification, and Parquet, for compressed file formats, in conjunction with its Spark-based workflows.