The Global Alliance for Genomics and Health. Interview with David Haussler
“A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.”
I have interviewed David Haussler, director of the Center for Biomolecular Science & Engineering at the University of California, Santa Cruz. David is one of eight organizing committee members of the Global Alliance for Genomics and Health.
Q1. What is the Global Alliance for Genomics and Health?
David Haussler: The Global Alliance for Genomics and Health is a partnership of more than 180 of the world’s leading stakeholders working together to create a common framework of harmonized approaches to enable the responsible and effective sharing of genomic and clinical data. The Global Alliance is made of up of a diverse, international group of organizations working in healthcare, biomedical research, disease and patient advocacy, life science, and information technology, who come together with the goal of accelerating progress in medicine and human health.
Q2. What are the main objectives of the Data Working group?
David Haussler: The Data Working Group is focused on the interoperability and scalability of formats and interfaces for genomic information. The main near-term objective of the Data Working Group is to establish a role as the international coordinating body and frontrunner for organizing, developing and aligning the computer formats and application programming interfaces (APIs) used to represent and exchange genomic data on individuals.
This includes stewardship of existing file formats used to store genomic information (BAM and VCF files) and engaging the community in devising forward-looking data models and APIs for representing, submitting, exchanging, and querying genomic data.
Q3. What are the main challenges (technical and non-technical) in representing and exchanging genomic data on individuals?
David Haussler: A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.
Currently, the ad hoc use of different data formats and technologies in different systems, lack of alignment between approaches to ethics and national legislation across jurisdictions, and the challenges of devising secure systems for controlled sharing of data puts the world on track to create Balkanized data sets and not be able to learn from aggregated information.
It is the hope of the Global Alliance that by addressing these technical, regulatory and other barriers at the outset, we will reverse the current course and enable medical progress through large-scale data aggregation and analysis.
Q4. What do you mean with “responsible” data sharing?
David Haussler: The meaning of responsible data sharing comes down to respect for the privacy and the data sharing preferences of participants. One of the core missions of the Global Alliance is to promote the highest standards for ethics and ensure that participants have a choice to securely share their genomic and clinical data as much as they want to, including not at all.
Aligning with this mission, two of the four initial Working Groups are focused on aspects of this responsible sharing: the Security Working Group and the Regulatory and Ethics Working Group.
The Regulatory and Ethics Working Group is in the process of drafting an International Code of Conduct, which will support the establishment of a set of ethical principles and practices for research seeking to share genomic and clinical data. The Security Working Group aims to support a technology environment that provides assurance to patients, researchers, clinicians, and other stakeholders that data are shared, annotated, and interpreted only by those with appropriate authorization to do so. All work done by the Global Alliance, including in the Data Working Group, is closely tied to ensure that any data sharing is done in a manner that respects privacy and security, while still retaining essential attributes to enable effective analysis.
Q5. What are the plans of the Data Working group to overcome such challenges?
David Haussler: Initially, the Data Working Group will take a role in overseeing the current BAM, CRAM, and VCF format standards to provide a governance and support structure for these efforts.
In the near-term, we will work with the international community to develop formal data models, APIs, and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion. This work will be consistent with the security model developed by the Alliance’s Security Working Ground, the clinical data framework developed by the Alliance’s Clinical Working Group, and the International Code of Conduct developed by the Alliance’s Regulatory and Ethics Working Group.
The Data Working Group in conjunction with partner organisations has also contributed to the startup of a project known as “Beacon”, which fosters the development of ‘beacons’: any institution or site that provides a simple yes or no in response to query regarding the presence of a specific human genetic variant in their genetic data. This open web service is designed both to be technically simple, so that it is easy to implement, and to not return information that could be construed as violating anyone’s privacy, so that it is available as a public, unrestricted web resource.
Q6. Why new APIs are needed? and what are the key areas in which these new APIs will be used?
David Haussler: We need to switch from file formats to APIs so that new architectures can be employed for storage and access to genomics data as we scale to thousands and eventually millions of genomes. APIs allow third parties to write code with standardized methods for utilization of genomic data that do not require download or parsing of large files and that are broadly compatible across many institutional systems.
Specifically, APIs are needed for and will be used in these four key areas:
Reference variants. This API represents a reference genome structure consisting of typical human chromosomal DNA sequences and well-established human polymorphisms including larger structural variations. It defines mechanisms for mapping other information to the reference, including individual genomes, RNA data, and annotation. It should support mapping of DNA or RNA reads from a BAM file or equivalent, individual genome variants as described in a VCF file or equivalent, and various types of reference genome annotation as found in a genome browser or in one of the existing human genetic variation databases.
Read data. This API represents collections of primary data collected from sequencing machines, covering functions currently supported by FASTQ and BAM file formats, and including a query interface over groups of samples. It addresses issues of efficient interaction with large databases, the relationship of reads to a reference genome, lossy or loss-free data compression, and error correction.
Expression, methylation, and other epigenetic data. It is also necessary to have APIs that represent gene expression and the epigenetic state of the DNA or chromatin in a tissue sample. We plan to build an API specifically for gene expression, and establish a framework in which other external groups can create APIs for other types of epigenetic and functional data. These APIs will interact with the reference variant and read data APIs.
Metadata. Metadata is general information about a sample, such as tissue type, including how, when, and where information was extracted from that sample, such as the name of the sequencing center. We intend that there be a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility. Optional fields will allow customization as necessary so that it does not force the specification of too much information for any given API or project.
Q7. How do you plan to create a “shared” data representation, storage, and analysis of genomic data?
David Haussler: The Global Alliance intends to enable the sharing of genomic and clinical data, but will not itself store, analyze, or interpret data. By undertaking work such as the development of APIs for representing, submitting, exchanging, and querying genomic data, we seek to create a common framework of interoperable approaches, lifting up best practices and creating new methods where none exist, that will enable more effective, responsible sharing of genomic and clinical data and facilitate large-scale research by entities throughout the world. The Global Alliance also seeks to catalyze data sharing projects that drive and demonstrate the value of data sharing, and to convene stakeholders from different sectors and localities to share information, establish best practices, and enable interoperability across the broadest possible group.
Q8. Why InterSystems joined the Global Alliance for Genomics and Health and what will be their contribution?
David Haussler: To answer this, I point you towards the statement of Paul Grabscheid, Vice President of Strategy comments at Intersystems, when the company joined the Global Alliance: http://www.intersystems.com/who-we-are/newsroom/news-item/intersystems-joins-global-alliance-for-genomics-and-health/.
On membership generally, since the Global Alliance’s initial formation with 70 partners in June of 2013, the group has brought on many more highly esteemed research and health institutions with broadened international representation, including partners from over 40 leading life science and information technology companies, world leaders in cloud computing, biotechnology, and healthcare generally, and additional respected disease and patient advocacy groups.
Q9. What are the progress and deliverables so far?
David Haussler: The Data Working Group has formed its first four task teams:
(1) File Formats Task Team,
(2) Reference Variation Task Team,
(3) Read Store Task Team, and
(4) Metadata Task Team.
Work from each Task Team is addressed below and is available at https://github.com/ga4gh unless otherwise noted:
File Formats. The developers of the current VCF, BAM, and CRAM file formats have been engaged in a File Formats Task Team led by Ewan Birney of the EBI to govern, maintain and extend these formats. A pre-existing official specification and software development site has been endorsed at https://github.com/samtools/hts-specs and will be used by the Task Team to address suggestions from the developer community for file format modifications.
Reference Variation. The Reference Variation Task Team, co-led by Gil McVean of Oxford University and Benedict Paten of UC Santa Cruz, held its organizing meeting in Hinxton, UK on March 3, 2014.
The team aims to compare existing reference structures such as the GRC reference genome and the dbSNP database alongside newer graph-based approaches, with the near-term goal of delivering one or more new or enhanced reference structures with pilot implementations.
Read Store. The Read Task Team, led by Dave Patterson of UC Berkeley, involves members from various companies, government, and academic institutions. It has compared in detail APIs from NCBI, EBI, Google, SMART/HL7 FHIR, and UC Berkeley, has established a publicly readable mailing list with discussions, designed and released an initial v0.1 API, and is currently working on the v0.5 API. All work is open and issues/comments may be raised by any members of the public through mechanisms provided by the GitHub open source software development environment.
Metadata. The Metadata Task Team, is led by Helen Parkinson of EBI and Tanya Barrett of NCBI, which held its organizing meeting April 17, 2014. In the near term, the team aims to create a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility.
In addition to these task teams, to develop APIs in the context of major ongoing research projects, the Data Working Group currently interacts with three projects from outside the Global Alliance: the ICGC/TCGA Pan-Cancer Whole Genome Analysis project, Matchmaker Exchange, and Beacon.
Q10. What is the Beacon project, and how does it relate to your work at the Global Alliance for Genomics and Health?
David Haussler: In order to root the activities of the Global Alliance in real-world problems and to demonstrate the value of interoperable approaches to data sharing, the Alliance supports specific projects, of which the Beacon project is one. Ongoing engagement between these projects and Working Groups is intended to encourage a focus on the needs of projects currently advancing science and medicine, and crosscutting engagement of the Working Groups with one another and with stakeholders in the community.
The Beacon project, led by Jim Ostell of the NCBI, is a project that was created in order to test the willingness of international sites to share genetic data in the simplest of all technical contexts. It is defined as a simple public web service that any institution can implement as a service.
A site offering this service is called a “beacon.” This open web service is designed both to be technically simple (so that it is easy to implement) and to not return information that could be construed as violating anyone’s privacy (so that there is no good excuse for not implementing it as a public, unrestricted web resource).
A goal of the Data Working Group is to foster the development of more than a dozen independent “beacons” in the near-term, and in collaboration with the other Alliance Working Groups, to gain initial direct experience with the barriers to international genetic data sharing through this Beacon project.
There are currently 4 beacons running at the following locations: UC Berkeley (http://beacon.eecs.berkeley.edu), NCBI (http://www.ncbi.nlm.nih.gov/projects/genome/beacon/), UC Santa Cruz (http://hgwdev-max.cse.ucsc.edu/cgi-bin/beacon/query), and EMBL-EBI (http://www.ebi.ac.uk/eva/beacon).
This is still very much early days. Both the interface and the rules for engagement with beacons are rapidly evolving.
David Haussler’s research lies at the interface of mathematics, computer science, and molecular biology. He develops new statistical and algorithmic methods to explore the molecular function and evolution of the human genome, integrating cross-species comparative and high-throughput genomics data to study gene structure, function, and regulation. He is credited with pioneering the use of hidden Markov models (HMMs), stochastic context-free grammars, and the discriminative kernel method for analyzing DNA, RNA, and protein sequences. He was the first to apply the latter methods to the genome-wide search for gene expression biomarkers in cancer, now a major effort of his laboratory.
As a collaborator on the international Human Genome Project, his team posted the first publicly available computational assembly of the human genome sequence on the Internet on July 7, 2000. Following this, his team developed the UCSC Genome Browser, a web-based tool that is used extensively in biomedical research and serves as the platform for several large-scale genomics projects, including NHGRI’s ENCODE project to use omics methods to explore the function of every base in the human genome, NIH’s Mammalian Gene Collection, NHGRI’s 1000 genomes project to explore human genetic variation, and NCI’s Cancer Genome Atlas (TCGA) project to explore the genomic changes in cancer.
His group’s informatics work on cancer genomics, including the UCSC Cancer Genomics Browser, provides a complete analysis pipeline from raw DNA reads through the detection and interpretation of mutations and altered gene expression in tumor samples. His group collaborates with researchers at medical centers nationally, including members of the Stand Up To Cancer “Dream Teams” and the Cancer Genome Atlas, to discover molecular causes of cancer and pioneer a new personalized, genomics-based approach to cancer treatment.
The UCSC Cancer Genomics Hub (CGHub), a product of the Haussler lab, is a secure repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutation information for 25 cancer types from TCGA, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project, and other related projects. The current planned capacity of this data center is five petabytes. The CGHub will serve as a platform to aggregate other large-scale cancer genomics information, growing to provide the statistical power to attack the complexity of cancer.
He co-founded the Genome 10K Project to assemble a genomic zoo—a collection of DNA sequences representing the genomes of 10,000 vertebrate species—to capture genetic diversity as a resource for the life sciences and for worldwide conservation efforts.
Haussler is an organizing member of the Global Alliance for Genomics and Health, through which research, health care, and disease advocacy organizations that have taken the first steps to standardize and enable secure sharing of genomic and clinical data.
Haussler received his PhD in computer science from the University of Colorado at Boulder. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences and a fellow of AAAS and AAAI. He has won a number of awards, including the 2011 Weldon Memorial Prize from University of Oxford, the 2009 ASHG Curt Stern Award in Human Genetics, the 2008 Senior Scientist Accomplishment Award from the International Society for Computational Biology, the 2005 Dickson Prize for Science from Carnegie Mellon University, and the 2003 ACM/AAAI Allen Newell Award in Artificial Intelligence.
Follow ODBMS.org on Twitter: @odbmsorg