“A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.”
I have interviewed David Haussler, director of the Center for Biomolecular Science & Engineering at the University of California, Santa Cruz. David is one of eight organizing committee members of the Global Alliance for Genomics and Health.
Q1. What is the Global Alliance for Genomics and Health?
David Haussler: The Global Alliance for Genomics and Health is a partnership of more than 180 of the world’s leading stakeholders working together to create a common framework of harmonized approaches to enable the responsible and effective sharing of genomic and clinical data. The Global Alliance is made of up of a diverse, international group of organizations working in healthcare, biomedical research, disease and patient advocacy, life science, and information technology, who come together with the goal of accelerating progress in medicine and human health.
Q2. What are the main objectives of the Data Working group?
David Haussler: The Data Working Group is focused on the interoperability and scalability of formats and interfaces for genomic information. The main near-term objective of the Data Working Group is to establish a role as the international coordinating body and frontrunner for organizing, developing and aligning the computer formats and application programming interfaces (APIs) used to represent and exchange genomic data on individuals.
This includes stewardship of existing file formats used to store genomic information (BAM and VCF files) and engaging the community in devising forward-looking data models and APIs for representing, submitting, exchanging, and querying genomic data.
Q3. What are the main challenges (technical and non-technical) in representing and exchanging genomic data on individuals?
David Haussler: A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.
Currently, the ad hoc use of different data formats and technologies in different systems, lack of alignment between approaches to ethics and national legislation across jurisdictions, and the challenges of devising secure systems for controlled sharing of data puts the world on track to create Balkanized data sets and not be able to learn from aggregated information.
It is the hope of the Global Alliance that by addressing these technical, regulatory and other barriers at the outset, we will reverse the current course and enable medical progress through large-scale data aggregation and analysis.
Q4. What do you mean with “responsible” data sharing?
David Haussler: The meaning of responsible data sharing comes down to respect for the privacy and the data sharing preferences of participants. One of the core missions of the Global Alliance is to promote the highest standards for ethics and ensure that participants have a choice to securely share their genomic and clinical data as much as they want to, including not at all.
Aligning with this mission, two of the four initial Working Groups are focused on aspects of this responsible sharing: the Security Working Group and the Regulatory and Ethics Working Group.
The Regulatory and Ethics Working Group is in the process of drafting an International Code of Conduct, which will support the establishment of a set of ethical principles and practices for research seeking to share genomic and clinical data. The Security Working Group aims to support a technology environment that provides assurance to patients, researchers, clinicians, and other stakeholders that data are shared, annotated, and interpreted only by those with appropriate authorization to do so. All work done by the Global Alliance, including in the Data Working Group, is closely tied to ensure that any data sharing is done in a manner that respects privacy and security, while still retaining essential attributes to enable effective analysis.
Q5. What are the plans of the Data Working group to overcome such challenges?
David Haussler: Initially, the Data Working Group will take a role in overseeing the current BAM, CRAM, and VCF format standards to provide a governance and support structure for these efforts.
In the near-term, we will work with the international community to develop formal data models, APIs, and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion. This work will be consistent with the security model developed by the Alliance’s Security Working Ground, the clinical data framework developed by the Alliance’s Clinical Working Group, and the International Code of Conduct developed by the Alliance’s Regulatory and Ethics Working Group.
The Data Working Group in conjunction with partner organisations has also contributed to the startup of a project known as “Beacon”, which fosters the development of ‘beacons’: any institution or site that provides a simple yes or no in response to query regarding the presence of a specific human genetic variant in their genetic data. This open web service is designed both to be technically simple, so that it is easy to implement, and to not return information that could be construed as violating anyone’s privacy, so that it is available as a public, unrestricted web resource.
Q6. Why new APIs are needed? and what are the key areas in which these new APIs will be used?
David Haussler: We need to switch from file formats to APIs so that new architectures can be employed for storage and access to genomics data as we scale to thousands and eventually millions of genomes. APIs allow third parties to write code with standardized methods for utilization of genomic data that do not require download or parsing of large files and that are broadly compatible across many institutional systems.
Specifically, APIs are needed for and will be used in these four key areas:
Reference variants. This API represents a reference genome structure consisting of typical human chromosomal DNA sequences and well-established human polymorphisms including larger structural variations. It defines mechanisms for mapping other information to the reference, including individual genomes, RNA data, and annotation. It should support mapping of DNA or RNA reads from a BAM file or equivalent, individual genome variants as described in a VCF file or equivalent, and various types of reference genome annotation as found in a genome browser or in one of the existing human genetic variation databases.
Read data. This API represents collections of primary data collected from sequencing machines, covering functions currently supported by FASTQ and BAM file formats, and including a query interface over groups of samples. It addresses issues of efficient interaction with large databases, the relationship of reads to a reference genome, lossy or loss-free data compression, and error correction.
Expression, methylation, and other epigenetic data. It is also necessary to have APIs that represent gene expression and the epigenetic state of the DNA or chromatin in a tissue sample. We plan to build an API specifically for gene expression, and establish a framework in which other external groups can create APIs for other types of epigenetic and functional data. These APIs will interact with the reference variant and read data APIs.
Metadata. Metadata is general information about a sample, such as tissue type, including how, when, and where information was extracted from that sample, such as the name of the sequencing center. We intend that there be a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility. Optional fields will allow customization as necessary so that it does not force the specification of too much information for any given API or project.
Q7. How do you plan to create a “shared” data representation, storage, and analysis of genomic data?
David Haussler: The Global Alliance intends to enable the sharing of genomic and clinical data, but will not itself store, analyze, or interpret data. By undertaking work such as the development of APIs for representing, submitting, exchanging, and querying genomic data, we seek to create a common framework of interoperable approaches, lifting up best practices and creating new methods where none exist, that will enable more effective, responsible sharing of genomic and clinical data and facilitate large-scale research by entities throughout the world. The Global Alliance also seeks to catalyze data sharing projects that drive and demonstrate the value of data sharing, and to convene stakeholders from different sectors and localities to share information, establish best practices, and enable interoperability across the broadest possible group.
Q8. Why InterSystems joined the Global Alliance for Genomics and Health and what will be their contribution?
David Haussler: To answer this, I point you towards the statement of Paul Grabscheid, Vice President of Strategy comments at Intersystems, when the company joined the Global Alliance: http://www.intersystems.com/who-we-are/newsroom/news-item/intersystems-joins-global-alliance-for-genomics-and-health/.
On membership generally, since the Global Alliance’s initial formation with 70 partners in June of 2013, the group has brought on many more highly esteemed research and health institutions with broadened international representation, including partners from over 40 leading life science and information technology companies, world leaders in cloud computing, biotechnology, and healthcare generally, and additional respected disease and patient advocacy groups.
Q9. What are the progress and deliverables so far?
David Haussler: The Data Working Group has formed its first four task teams:
(1) File Formats Task Team,
(2) Reference Variation Task Team,
(3) Read Store Task Team, and
(4) Metadata Task Team.
Work from each Task Team is addressed below and is available at https://github.com/ga4gh unless otherwise noted:
File Formats. The developers of the current VCF, BAM, and CRAM file formats have been engaged in a File Formats Task Team led by Ewan Birney of the EBI to govern, maintain and extend these formats. A pre-existing official specification and software development site has been endorsed at https://github.com/samtools/hts-specs and will be used by the Task Team to address suggestions from the developer community for file format modifications.
Reference Variation. The Reference Variation Task Team, co-led by Gil McVean of Oxford University and Benedict Paten of UC Santa Cruz, held its organizing meeting in Hinxton, UK on March 3, 2014.
The team aims to compare existing reference structures such as the GRC reference genome and the dbSNP database alongside newer graph-based approaches, with the near-term goal of delivering one or more new or enhanced reference structures with pilot implementations.
Read Store. The Read Task Team, led by Dave Patterson of UC Berkeley, involves members from various companies, government, and academic institutions. It has compared in detail APIs from NCBI, EBI, Google, SMART/HL7 FHIR, and UC Berkeley, has established a publicly readable mailing list with discussions, designed and released an initial v0.1 API, and is currently working on the v0.5 API. All work is open and issues/comments may be raised by any members of the public through mechanisms provided by the GitHub open source software development environment.
Metadata. The Metadata Task Team, is led by Helen Parkinson of EBI and Tanya Barrett of NCBI, which held its organizing meeting April 17, 2014. In the near term, the team aims to create a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility.
In addition to these task teams, to develop APIs in the context of major ongoing research projects, the Data Working Group currently interacts with three projects from outside the Global Alliance: the ICGC/TCGA Pan-Cancer Whole Genome Analysis project, Matchmaker Exchange, and Beacon.
Q10. What is the Beacon project, and how does it relate to your work at the Global Alliance for Genomics and Health?
David Haussler: In order to root the activities of the Global Alliance in real-world problems and to demonstrate the value of interoperable approaches to data sharing, the Alliance supports specific projects, of which the Beacon project is one. Ongoing engagement between these projects and Working Groups is intended to encourage a focus on the needs of projects currently advancing science and medicine, and crosscutting engagement of the Working Groups with one another and with stakeholders in the community.
The Beacon project, led by Jim Ostell of the NCBI, is a project that was created in order to test the willingness of international sites to share genetic data in the simplest of all technical contexts. It is defined as a simple public web service that any institution can implement as a service.
A site offering this service is called a “beacon.” This open web service is designed both to be technically simple (so that it is easy to implement) and to not return information that could be construed as violating anyone’s privacy (so that there is no good excuse for not implementing it as a public, unrestricted web resource).
A goal of the Data Working Group is to foster the development of more than a dozen independent “beacons” in the near-term, and in collaboration with the other Alliance Working Groups, to gain initial direct experience with the barriers to international genetic data sharing through this Beacon project.
There are currently 4 beacons running at the following locations: UC Berkeley (http://beacon.eecs.berkeley.edu), NCBI (http://www.ncbi.nlm.nih.gov/projects/genome/beacon/), UC Santa Cruz (http://hgwdev-max.cse.ucsc.edu/cgi-bin/beacon/query), and EMBL-EBI (http://www.ebi.ac.uk/eva/beacon).
This is still very much early days. Both the interface and the rules for engagement with beacons are rapidly evolving.
David Haussler’s research lies at the interface of mathematics, computer science, and molecular biology. He develops new statistical and algorithmic methods to explore the molecular function and evolution of the human genome, integrating cross-species comparative and high-throughput genomics data to study gene structure, function, and regulation. He is credited with pioneering the use of hidden Markov models (HMMs), stochastic context-free grammars, and the discriminative kernel method for analyzing DNA, RNA, and protein sequences. He was the first to apply the latter methods to the genome-wide search for gene expression biomarkers in cancer, now a major effort of his laboratory.
As a collaborator on the international Human Genome Project, his team posted the first publicly available computational assembly of the human genome sequence on the Internet on July 7, 2000. Following this, his team developed the UCSC Genome Browser, a web-based tool that is used extensively in biomedical research and serves as the platform for several large-scale genomics projects, including NHGRI’s ENCODE project to use omics methods to explore the function of every base in the human genome, NIH’s Mammalian Gene Collection, NHGRI’s 1000 genomes project to explore human genetic variation, and NCI’s Cancer Genome Atlas (TCGA) project to explore the genomic changes in cancer.
His group’s informatics work on cancer genomics, including the UCSC Cancer Genomics Browser, provides a complete analysis pipeline from raw DNA reads through the detection and interpretation of mutations and altered gene expression in tumor samples. His group collaborates with researchers at medical centers nationally, including members of the Stand Up To Cancer “Dream Teams” and the Cancer Genome Atlas, to discover molecular causes of cancer and pioneer a new personalized, genomics-based approach to cancer treatment.
The UCSC Cancer Genomics Hub (CGHub), a product of the Haussler lab, is a secure repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutation information for 25 cancer types from TCGA, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project, and other related projects. The current planned capacity of this data center is five petabytes. The CGHub will serve as a platform to aggregate other large-scale cancer genomics information, growing to provide the statistical power to attack the complexity of cancer.
He co-founded the Genome 10K Project to assemble a genomic zoo—a collection of DNA sequences representing the genomes of 10,000 vertebrate species—to capture genetic diversity as a resource for the life sciences and for worldwide conservation efforts.
Haussler is an organizing member of the Global Alliance for Genomics and Health, through which research, health care, and disease advocacy organizations that have taken the first steps to standardize and enable secure sharing of genomic and clinical data.
Haussler received his PhD in computer science from the University of Colorado at Boulder. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences and a fellow of AAAS and AAAI. He has won a number of awards, including the 2011 Weldon Memorial Prize from University of Oxford, the 2009 ASHG Curt Stern Award in Human Genetics, the 2008 Senior Scientist Accomplishment Award from the International Society for Computational Biology, the 2005 Dickson Prize for Science from Carnegie Mellon University, and the 2003 ACM/AAAI Allen Newell Award in Artificial Intelligence.
Follow ODBMS.org on Twitter: @odbmsorg
“Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery”–Steve Nathan.
Why using predictive analytics in healthcare? On this topic, I have interviewed Steve Nathan, CEO at Amara Health Analytics.
Q1. Amara provides real-time predictive analytics to support clinicians in the early detection of critical disease states. Can you tell us a bit more of what are these predictive analytics and which data sets do you use?
Steve Nathan: Our system runs on hospital data, including labs, pharmacy, real-time vitals, ADT, EMR, and clinical narrative text. We have developed domain-specific natural language processing (NLP) and machine learning to find predictive signal that is beyond the reach of traditional approaches to clinical analytics.
Q2. How large are the data sets you analyse?
Steve Nathan: For an average 500 bed hospital we analyse about 100 million data items in real-time annually. We expect this volume of data to increase dramatically as more centers deploy hospital-wide automated real-time streaming of data from patient monitors into the EMR.
Q3. Why early detection is important?
Steve Nathan: Let’s look at the example of sepsis, which is the body’s systemic toxic response to an infection. Clinicians know how to treat sepsis once identified, but it’s often difficult to identify early because it can mimic other conditions and there are a multitude of clinical variables involved, some of which are subjective in nature.
At the later stages of sepsis, mortality increases almost 8% with every hour of delayed treatment. So technology that can assist clinicians in early identification is important.
Q4. What is your back-end platform for real-time data processing and analytics?
Steve Nathan: The major components of the backend are Mirth (open source HL7 parsing/transform), a data aggregator, a real-time data streaming engine, Jess rules engine, the core analytics engine (including NLP pipeline), and Cassandra NoSQL database from DataStax.
Q5 What are the main technical challenges you encounter when you analyse big data sets that are composed of previous patient histories and medical records?
Steve Nathan: Input data comes from a wide variety of centers and is wildly heterogeneous. Issues range from proprietary health IT systems and incompatible usage of available standards like HL7, to wide variations in clinical language used by physicians and nurses in their notes, to varying patterns of diagnostic testing by different providers. And of course every patient is unique. So much of our system is dedicated to producing a standardized timeline for each patient in which every variable has a consistent interpretation, no matter where it originally came from. We then use these patient timelines as input for machine learning, and there are some unique challenges there in developing predictive models that are appropriate for use in real-time decision support.
Q6. Why did you choose a NoSQL database for this task?
Steve Nathan: The primary motivation was flexibility of data representation. We wanted to be able to change schemas dynamically. Also, DataStax’s integration of Cassandra with Solr was very important for us because we do a lot text mining as part of our overall approach.
Q7. What are the lessons learned so far?
Steve Nathan: Processing huge amounts of historical patient data in simulations to refine predictive models is very challenging to do with good performance. We must be able to run many years worth of archived data through the system in a small number of hours, so the system architecture must be designed with this processing mode in mind.
It is important to consider early on how to upgrade the live system with no down time while avoiding the possibility of losing any data.
Q8. What about data protection issues?
Steve Nathan: We comply with HIPAA regulations and go to great lengths to protect data. This includes work processes and policies for all employees, contractors, and business associates. And, importantly, it includes the architecture and deployment of our systems – for example all data is transferred via VPN, and every VPN connection is terminated in a VM that contains a single protected dataset, rather than terminating at a DC router.
Q9. Which relationships exist between Amara and UC San Diego?
Steve Nathan: There is no formal relationship between Amara and UCSD. Amara’s co-founder and chairman, Dr. Ramamohan Paturi, is a professor of Computer Science at UCSD.
Qx. Anything else you wish to add?
Steve Nathan: Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery.
Steve Nathan has over 25 years in the enterprise software industry. As CEO of Parity Computing beginning in 2008, Steve led the company’s product expansion with a pioneering analytics platform for enhancing biomedical research productivity. He then led the formation of Amara Health Analytics, the concept and strategy for the Clinical Vigilance™ product line, and the spinout of Amara as an independent company.
Steve has held key leadership positions at recognized technology innovators Sun Microsystems and Cray Research, as well as at start-ups Celerity Computing, Alignent Software, and Exist Global. As General Manager at Sun Microsystems, he had P&L responsibility for messaging, portal, and web infrastructure in the iPlanet business unit, where he grew the annual revenue of this product line from $15M to $150M in three years.
Steve holds B.S. degrees from the University of California at Riverside in Computer Science, Mathematics, and Psychology; and he is a 1999 graduate of the Stanford University Business Executive Program.
Follow ODBMS.org and ODBMS Industry Watch on Twitter:
“It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure.” –Francois Raab
On the topic of constructing big data benchmarks I have interviewed Francois Raab and Yanpei Chen. Francois is the original author of the TPC-C Benchmark. He is currently the President of InfoSizing, Inc.
Yanpei is a member of the Performance Engineering Team at Cloudera.
Q1.There have been a number of attempts at constructing big data benchmarks. None of them has yet gained wide recognition and usage. Why?
Yanpei: Many big data benchmarks are just like big data systems – new, and with room to improve and grow.
In more detail big data systems:
- rapidly evolve, so it’s important to define performance in ways that matter for end customers.
- consist of many interdependent components, so it’s difficult to measure performance in a reliable fashion.
- service diverse business needs using diverse implementations, so benchmarks need to accommodate different system implementations.
Francois: It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure. There are 3 phases in the evolution of a new technology. First, the technology is introduced and applied to a wide array of solutions without a proven return on investment. Next, a “killer app” emerges from the early adopters and its rapid growth draws all the vendors into competing on a common playing field. Lastly, some technologies emerge as clear winners in the race and the market start to consolidate around a few dominant vendors. Big data has not entered the second phase yet.
Q2. Is it possible to build a truly representative big data benchmark?
To me, the rise of “big data” in part comes from our increased ability to instrument, measure, and ultimately derive value from large scale systems – technology systems, financial systems, medical systems, or physical systems touching day-to-day life. Big data systems, as a special case of technology systems, also deal with ever increasing instrumentation and measurement. Over time, I am absolutely confident that we will increase our understanding of big data systems, and with it, improve the quality of our big data benchmarks.
Cloudera’s broad customers base gives us visibility into big data deployments across telecom, banking, retail, manufacturing, media, government, healthcare, and many other industry sectors. We’re in a great position to identify representative use cases.
Francois: A benchmark is a somewhat abstract (i.e. simplified) model of a real life scenario. The question we face today is to identify a scenario that Fortune 500 companies would widely recognize as relevant to their operations and vital to their competitive survival. Once that critical mass has been reached it will quickly spread to the entire commercial data processing landscape and a successful big data benchmark will be built based on that scenario.
Q3. How would you define a Big Data Benchmark ?
Yanpei: The key properties of good big data benchmarks are a re-cast of the same properties for benchmarks of more established systems.
A good big data benchmark should be representative of real-life use cases; it should generate performance insights immediately relevant to diverse and evolving big data use cases. The benchmark should also be scalable; it should stress big data systems today, as well as the vastly improved systems in the future. The benchmark should be portable, meaning it should accommodate systems with different implementations that achieve the same end-goal. The benchmark should also be verifiable, in that the results can be checked by independent auditors if needed, and end-users can reproduce on their own systems the winning configurations and result.
Q4. Can you give some examples of Successful Benchmarks ?
Francois: The success of a benchmark can be measured by its number of published results and by its longevity over shifts in the underlying technologies. By that measure TPC-C and TPC-H are leading the field. While it can be argued that they have lost relevance over their two decades lifetime, they still encapsulate critical elements at the core of the application domains they represent (transaction processing and decision support).
Q5. One of the main purposes of a benchmark is to evaluate and contrast the merits of various implementations of the same set of requirements. How do you do this with Big Data?
Yanpei: You construct benchmarks that are portable. In other words, you specify implementation-independent requirements.
Best illustrated by example – TPC-C. TPC-C specifies five operations – New Order, Payment, Delivery, Order-Status, and Stock-Level. It also describes the interdependencies between these operations. For example, every New Order will be accompanied by Payment, but only one in ten New Orders will trigger an Order Status. TPC-C describes the load that the system under test should handle – many concurrent operations arriving in randomized order with randomized inter-arrival time, but at controlled relative frequencies. TPC-C also specifies the initial content of all the datasets, as well as how the content grows over the execution of the benchmark. This is an implementation-independent set of requirements – “handle these operations on these data sets.” The underlying system could be a relational database, or a key-value store like HBase.
Francois: Benchmarks can be defined one of two way: by creating a kit to be deployed on technology specific platforms or by specifying a set of technology agnostic requirements to be implemented at will. Because big data has first emerged from the MapReduce paradigm, we have seen a number of technology centric benchmarks (also called component benchmarks) that put a narrow focus on one or more components of a predefined solution. But we should soon expect to see a big data application emerge as the new must-have in commercial data centers.
Q6. In a recent position paper you argued for building future big data benchmarks using what you call a “functional workload model”. What is it?
Francois: We introduced a couple of terms in that position paper to highlight the core concepts underlying representative, scalable, portable, and verifiable big data benchmarks.
The “functional workload model” is a way to specify such benchmarks. It contains three things – the “functions of abstraction”, the load pattern serviced by the system, and the data sets being acted upon.
“Functions of abstraction” describes “what is being computed” without specifying “how the computation should be done.”
The intent is an abstract, functional description that allows the benchmark to be portable across systems of different compute paradigms. “What is being computed” should be justified by empirical evidence, either system traces or industry-wide surveys, with emphasis on identifying the common computation goals.
The load pattern describes “what is the serviced load” without specifying “how it is serviced.” It outlines the execution frequency, distribution, arrival rate, bursts and averages over time of each individual function of abstraction.
The data sets describe “what is the data and the relationships within the data” without specifying “how it is represented.” It is in terms of the structure and interdependence between data elements, initial size and contents, how it evolves over the course of the workload execution, and how it is expected to scale with the system size and load volume.
These concepts help us routinely identify shortcomings in haphazardly specified benchmarks. For example, some of the most often-cited big data benchmarks contain artificial functions of abstraction that do not match any common use cases.
Or, a multi-job, multi-query load pattern is missing altogether, or the data sets are represented in unrealistic formats that inflate performance advantages.
Q7. Why did you select TPC-C as a starting point for your work?
Yanpei: Because TPC-C already has a functional workload model within its specification. And because Francois wrote TPC-C.
Francois: The functional workload model is the underlying structure on which TPC-C was built. Subsequent TPC benchmarks, like TPC-H and TPC-E, were also built based on a functional workload model.
Q8. How does your functional workload model compares with TPC-C ?
Yanpei: TPC-C already uses the functional workload concept.
Q9. For your functions of abstractions concept to be useful, it must be applicable to different types of big data systems. Two important examples are relational databases and MapReduce. How do you do that? How does your work compare with other MapReduce-Specific Benchmarks ?
Yanpei: Best illustrated by example.
Suppose we discover that sorting data is a common operation in real-life production use cases. We would then define “sort” as a function of abstraction. We would define it in the same fashion as the official Sort Benchmark – the input data is of size X, format Y, and the system is asked to produce output sorted by order Z.
A relational database implementation could do, say, “insert into TABLE … ” followed by “select * from TABLE ordered by COLUMN”. A MapReduce implementation would use the IdentityMapper and IdentityReducer, and rely on the implicit shuffle-sort in MapReduce.
This is obvious for sort, because the sort operation has traditionally been defined in a system-independent way.
In contrast, many of the existing MapReduce and relational database performance measurement tools are specified in ways that do not translate across different types of systems. The many SQL-on-Hadoop systems are fast removing the boundary.
The functions of abstraction concept allows us to understand use-case at a level above than any SQL-only or Hadoop-only specifications.
Q10. What are in your opinion the Emerging Big Data Application Domains?
Francois: Everyone wants to figure out which application domain will become the big data killer app. Today, no commercial data center can live without on-line transaction processing or without decision support systems.
Which big data application will become indispensable tomorrow? That is the million dollar question! Once we know that, a standard big data benchmark will soon follow.
Yanpei: The maturation of the Hadoop platform has been relentless. Its role has changed as the platform has gotten more secure, more reliable, more powerful, and (especially) more real-time. It’s no longer a system used for just big batch jobs. Instead, it has become the first place that data lands. It scales and it can store anything – no data need be discarded. It’s used to pre-process data before delivering it to an enterprise data warehouse, a document repository, an analytic engine, a CRM or ERP application, or other specialized system. Most significantly, it has begun to take over some of the work previously done by those traditional platforms, because it can do real-time search and analysis on the data directly, in place, and without further Extract-Transform-Load (ETL).
This leads to the emergence of the enterprise data hub (EDH), a new architecture to complement existing investments and help put data at the center of an organisation’s business. An enterprise data hub allows storage of any amount and type of data, for as long as is needed, and accessible in any way needed.
Additional necessary attributes of EDHs include: It’s Secure and Compliant, offering perimeter security and encryption, plus fine-grained (row and column-level), role-based access controls over data, just like a data warehouse. It’s Governed, enabling users to do data discovery, data auditing, and data lineage, thus understanding what data is in their EDH and how the data are used.
It’s Unified and Manageable, providing native high-availability, fault-tolerance, self-healing storage, automated replication, and disaster recovery, as well as advanced workload management capabilities to enable multiple speciacialist systems to analyze the same data set. And it’s Open, ensuring that customers are not locked in to any particular vendor’s license agreement, that you can choose what tools to use with your EDH, and nobody can hold your data or applications hostage.
The emergence of EDHs pose both challenges and opportunities for defining big data benchmarks. As Francois alluded to, the representative scenarios typically involve application domains whose performance has traditionally been measured separately, such as the case for on-line transaction processing and decision support systems. How to define and measure performance for such concurrent application domains present both a challenge and an opportunity.
Further, to compare different EDHs, it becomes necessary to quantify characteristics that are previously yes/no checks – which is the more secure EDH? the better governed? the more unified and more manageable? the more open? How to quantify such characteristics will stretch our performance thinking and measurement methodology into new territory.
Q11. Future Work ?
Yanpei: We have a strong Performance Engineering Team at Cloudera. We insist on systematic, fair, and repeatable tests both for our internal performance assessment and competitive studies. We are also engaged with community efforts to define big data benchmarks. Look for our future posts on the Cloudera Developer Blog!
Francois Raab is a recognized, award winning expert in the field of performance engineering, benchmark design and system testing. He is the original author of the TPC-C Benchmark, the most successful industry standard measure of OLTP performance. He was also co-author of “The Benchmark Handbook” (pub. Morgan Kaufmann). Francois is accredited as a Certified Benchmark Auditor by the Transaction Processing Performance Council. His consulting services are retained by most major system vendors as well as Fortune-500 IT organizations. With over 30 years of experience in the field of databases and commercial data processing, Francois is a leading member of the performance measurement, system sizing and technology evaluation community. He is currently the President of InfoSizing, Inc.
Yanpei Chen is a member of the Performance Engineering Team at Cloudera, where he works on internal and competitive performance measurement and optimization. His work touches upon multiple interconnected computation frameworks, including Cloudera Search, Cloudera Impala, Apache Hadoop, Apache HBase, and Apache Hive. He is the lead author of the Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads. SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners. He received his doctorate at the UC Berkeley AMP Lab, where he worked on performance-driven, large-scale system design and evaluation.
-New (August 18, 2014): TPCx-HS: First Vendor-Neutral, Industry Standard Big Data Benchmark.
The Transaction Processing Performance Council (TPC) announced the immediate availability of TPCx-HS, developed to provide verifiable performance, price/performance, availability, and optional energy consumption metrics of big data systems.
- The Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014, Potsdam, Germany: Program and Videos of all talks.
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
“The problem we had was that, if we tried to store the XML in traditional relational databases, we were losing the structure of our articles or having to redesign them.”–David Leeming
On making information accessible, I have interviewed David Leeming, Solutions Manager at the Royal Society of Chemistry. The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences, with 49,000 members worldwide.
Q1. What is the main business of The Royal Society of Chemistry?
David Leeming: The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences. With over 49,000 members and a knowledge business that spans the globe, we are the UK’s professional body for chemical scientists; a not-for-profit organisation with 170 years of history and an international vision of the future. We promote, support and celebrate chemistry. We work to shape the future of the chemical sciences – for the benefit of science and humanity.
Q2. What is your role at The Royal Society of Chemistry? And what are your current projects?
David Leeming: I manage the solutions team within the Royal Society of Chemistry’s recently formed Strategic Innovation Group, whose goal is to develop and deliver digital products and platforms that support and enhance our strategy and goals. As well as delivering platforms such as our Publishing Platform and our platform for teaching and learning resources, Learn Chemistry, my team also trials new technologies to scope what the organisation could be doing next in digital delivery to connect the chemistry community with high-quality research and chemical data.
Q3. Why did you digitize and convert over 1 million pages of written word and chemical formulae into XML?
David Leeming: Our aim is to connect the world with the chemical sciences, so we need to be able to make the chemistry research that we publish available and accessible to researchers all over the world. With the age of digital technology and demands from researchers to get their hands on data in new formats, rather than the traditional hard copy journals, we undertook a project to digitise our vast library of scientific data, news and literature and make it available online. By getting the XML for our articles and data, we are able to improve the discoverability and functionality of the platforms we use to deliver our content, by using the content structure. We are also able to hold only one copy of an article and transform the XML into other format layouts without having to hold multiple copies of the article.
Q4. What kind of new products and services were you expecting to develop when you converted the data in XML?
David Leeming: We can enhance and semantically enrich our content and analyse it to discover niche domains or areas of chemistry research that might benefit researchers who may not normally think to look in our journals.
For example, since launch in 2010 we have launched a number of new journals in environmental science.
On Learn Chemistry we have been able to produce mini-sites for specific initiatives such as a collection of Chemistry of Sport education resources that coincided with the 2012 Olympic Games.
Q5. What technical challenges did you encounter when trying to store and manage these XML files?
David Leeming: Before we discovered MarkLogic the problem we had was that, if we tried to store the XML in traditional relational databases, we were losing the structure of our articles or having to redesign them.
The XML was basically stored in flat files, and we needed to build indexes from this. Keeping this up to date and versioned was difficult. Since we’ve been using MarkLogic, we have been able to store our content in its original structure within the database, giving us full version control and adding search capabilities. The initial technical difficulties we faced when switching over to MarkLogic were finding experienced developers and getting traditional database administrators to think a bit differently about the way developers use the database and not to be concerned about traditional data modelling.
Q6. Specifically, how do you manage the logical associations between different types of XML content? And how do you query(update) them?
David Leeming: We have two ways of separating content. One is by product – this was our original method. We stored the XML documents for each product in their own database. We have now changed this method to store by content type. So, for instance, a journal article is stored together with other journal articles, book chapters are stored together, and so on. We keep the original XML schema and different documents have different schemas, but on loading the data we add a ‘meta’ record that is common across all data objects. This holds identifiers and describes what each data object is. This aids in the searching and querying we perform on the data to ensure the correct content is accessed and updated.
Q7. What results did you obtain in using a NoSQL database for this task?
David Leeming: Using the NoSQL database meant that we didn’t need to spend 6 months or so building a data model that doesn’t really fit the structure of the documents we currently have or may have in the future. Each time we get a different XML document, there’s no need to change the underlying data model, so it is very easy to implement.
Q8. What are your plans ahead?
David Leeming: Going forward we are starting to add greater semantic enrichment to data sets to provide better discoverability across chemistry data that will enable increased collaboration among researchers.
David Leeming is the Royal Society of Chemistry’s Solutions Manager. He is an experienced leader, project manager and business analyst in the digital publishing business, with expertise in building innovative platforms and solutions. He works with a number of technologies, with specific interest in online publishing, XML, sprint methodologies and MarkLogic. He manages teams of highly skilled engineers, analysts and an outsourced development unit based in India.
Follow ODBMS.org and ODBMS Industry Watch on Twitter:
“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.
I have interviewed Jules J. Berman, former President of the Association for Pathology Informatics. The focus of the interview is on how to manage Big Data.
Q1. In your experience what are the common mistakes that endanger most Big Data projects?
Jules J. Berman: Overconfidence is the biggest culprit. The creators of Big Data resources like to believe that they have collected all the data relevant to their domain, that all of the data is accurate, and that the data is organized in a manner that supports meaningful data searches. The Big Data analysts like to believe that their results and conclusions are correct. Hah!
Q2. How do you organize large volumes of complex data? Any insights you could give us on this?
Jules J. Berman: Large volumes of Big Data are organized the same way that humans organize the large volumes of complex data held in their brains: through classification. We could not cope with all the sensory input we receive each day if we did not bin visual objects into categories.
There is a science to constructing classifications, and if the science is misapplied, then the complex data objects held in a Big Data resource cannot be sensibly retrieved, or collected with objects to which they are logically related. Novices to the field make two common errors: confusing properties with classes (e.g., creating red-colored objects as a new class), or assigning a part of an object as a subclass (e.g., making “legs” a subclass of “person”). Just like any other science, the science of classification must be studied, practiced, and mastered.
Q3. You have been working on data permanence: what does it mean in practice? How can it be achieved when the content of the data is constantly changing?
Jules J. Berman: Everyone knows the slogan from Orwell’s masterpiece, 1984: “Big Brother is watching you”. If you’ve read the book, you’ll remember that there was another major theme; one that involved data mutability. The minions of Big Brother were constantly fiddling with collected data to distort reality. Because Big Brother held all the data, Big Brother could create perceptions of reality that suited the totalitarian state.
I see the problem of data mutability (i.e., the ability to modify, delete, or fabricate data) as being much more important than issues related to over-surveillance. In hospitals, the regrettable act of “retro-noting” (i.e., inserting patient notes out of sequence to cover omissions, or to justify billing, or to eradicate errors), is an example of data mutability.
The solution involves employing time stamps and metadata, and procedures that block data erasures. Data mutability, and the related topic of missing legacy data, are two of my favorite issues, and they are both covered in my book, “Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.”
Q4. Data verification: what are the challenges?
Jules J. Berman: The biggest challenge involves getting data analysts to take the topic of data verification seriously. I personally know data scientists who have the attitude that data verification is “not my job.” They believe that they have no control over the data; their job is to do the best they can with the data they receive.
I think that we really must get everybody on board with the idea that data needs to be verified. The task of creating verified sets of data is child’s play compared with the professional issues instigated by recalcitrant data scientists.
Q5. Data validation: what are the challenges?
Jules J. Berman: There are many ways of thinking about validation, but my perception is that most people in the field are approaching validation as a post-analytic process, wherein old conclusions are tested on new data, or tested on alternate data sources, or are re-calculated on a regular basis. The validation process is aimed at determining whether what seems true for me today will be true for you and me, today and tomorrow.
Like anything else in Big Data, it requires work and vigilance, and a delay in gratification.
Q6. Are there any general methods for data verification and validation that can be specifically applied to Big Data resources?
Jules J. Berman: There’s a large literature out there on this subject. In my opinion, the methods are not as important as the documentation. Protocols must be written, actions must be recommended, and steps must be taken to implement corrections. If you’re serious about Big Data, you must be serious about documenting everything: how you found errors, what you did to correct the errors, what you did to make sure that future errors of the same kind will not occur, what you did to monitor the occurrence of future errors of the same type. It never seems to end, but it’s just part of the job.
Q7. How would you find relationships among data objects held in disparate Big Data resources: Could you give us some examples?
Jules J. Berman: In my book Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. published in May, I give real-world examples for all of the points raised in this interview, but my favorite “reach” into disparate data involves some inventive research into the sinking of the Titanic.
Here’s an excerpt from my book: “A recent headline story explains how century old tidal data plausibly explained the appearance the iceberg that sank the titanic, on April 15, 1912. Records show that several months earlier, in 1912, the moon earth and sun aligned to produce a strong tidal pull, and this happened when the moon was the closest to the earth in 1,400 years. The resulting tidal surge was sufficient to break the January Labrador ice sheet, sending an unusual number of icebergs towards the open North Atlantic waters.
The Labrador icebergs arrived in the commercial shipping lanes four months later, in time for a fateful rendezvous with the Titanic. Back in January 1912, when tidal measurements were being collected, nobody foresaw that the data would be examined a century later.”
Of course, the finest tool for finding relationships among data objects held in disparate Big Data resources is the human brain. Good data analysts spend lots of time surveying the data held in various resources. When you spend the time, the inspirational moments will come, and you will begin to synthesize new relationships among data from different knowledge domains. Typically, analysis follows inspiration; not vice versa.
Q8. Data integration: how can data be extracted and integrated with data from other resources?
Jules J. Berman: Of course, standards, specifications, and metadata play an important role.
The Holy Grail in the Big Data field involves finding and implementing standard methods for organizing and tagging data, so that every piece of data held on any computer, can be linked and combined into a virtual Super-Big Data resource.
On a less grand scale, it’s always nice when workers in a common field collect their data in a standard form.
In most cases, I’ve been favoring specifications over standards. Data standards seldom, if ever “fit” your data correctly, are prone to re-versioning, often cost money, and usually come with a fine-print license that restricts how the standards are used and how your annotated data are distributed. Specifications are recommendations for describing data; RDF is a good example. Specifications provide the flexibility required for complex data, but the structure required for data integration.
A smart data manager can do a lot more with a specification than with a standard.
Q9. What about Big Data sharing?
Jules J. Berman: Data sharing is absolutely essential to the field of data science. If the data upon which your assertions are based is unavailable to the public, then why would anyone believe your results and conclusions?
In the Big Data realm, there are lots of things that can go wrong with a data analysis project. The chances that any new analysis is correct, on first pass, is slim-to-none. Everything must be repeated over and over, critiqued, and validated on fresh data.
My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.
Jules Berman received two baccalaureate degrees from MIT; in Mathematics, and in Earth and Planetary Sciences. He received the Ph.D. from Temple University, and the M.D. from the U. of Miami.
He received post-doctoral training at NIH and residency training at Geo. Washington U Med Ctr. He is board certified in anatomic pathology and in cytopathology. He served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and the Johns Hopkins Medical Institutions. In 1998, he became a Medical Officer at the U.S. National Cancer Institute and served as the Program Director for Pathology Informatics in the Institute’s Cancer Diagnosis Program. In 2006, Jules Berman was President of the Association for Pathology Informatics. In 2011 he received the Lifetime Achievement Award from the Association for Pathology Informatics. Today, Jules Berman is a free-lance writer. He has first-authored more than 100 articles and 11 book titles in science and medicine.
- Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, Jules J Berman, Ph.D., M.D. Paperback: 288 pages, Morgan Kaufmann; 1 edition (June 13, 2013), ISBN-10: 0124045766
Follow ODBMS.org on Twitter: @odbmsorg
“A true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store”–Shilpa Lawande.
On the subject of column stores, and what are the important features in the new release of the HP Vertica Analytics Platform, I have interviewed Shilpa Lawande, VP Engineering & Customer Experience at HP Vertica.
Q1. Back in 2011 I did an interview with you  at the time Vertica was just acquired by HP. What is new in the current version of Vertica?
Shilpa Lawande: We’ve come a long way since 2011 and our innovation engine is going strong!
From “Bulldozer” to “Crane” and now “Dragline,” we’ve built on our columnar-compressed, MPP share-nothing core, expanded security and manageability, dramatically expanded data ingestion capabilities, and what’s most exciting is that we’ve added a host of advanced analytics functions and extensibility APIs to the HP Vertica Analytics Platform itself. One key innovation is our ability to ingest and auto-schematize semi-structured data using HP Vertica Flex Zone, which takes away much of the friction in the analytic life-cycle from exploration to production.
We’ve also grown a vibrant community of practitioners and an ecosystem of complementary tools, including Hadoop.
Dragline, our next release of the HP Vertica Analytics Platform addresses the needs of the most demanding, analytic-driven organizations by providing many new features, including:
- Project Maverick’s Live Aggregate Projections to speed up queries that rely on resource- intensive aggregate functions like SUM, MIN/MAX, and COUNT.
- Dynamic mixed workload management, which identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to ensure the needs of all data consumers
- HP Vertica Pulse, which helps organizations leverage an in-database sentiment analysis tool that scores short data posts, including social data, such as Twitter feeds or product reviews, to gauge the most popular topics of interest, analyze how sentiment changes over time and identify advocates and detractors.
- HP Vertica Place, which stores and analyzes geospatial data in real time, including locations, networks and regions.
- An expanded SQL-on-Hadoop offering that gives users freedom to pick their data formats and where to store it, including HDFS, but still benefit from the power of the Vertica analytic engine. OF course, there’s a lot more to the “Dragline” release, but these are the highlights.
Q2. Vertica is referred to as an analytics platform. How does it differentiate with respect to conventional relational database systems (RDBMSes)?
Shilpa Lawande: Good question. First, let me clear the misconception that column stores are not relational – Vertica is a relational database, an RDBMS – it speaks tables and columns and standard SQL and ODBC, and like your favorite RDBMS, talks to a variety of BI tools. Now, there are many variations in the database market from low-cost solutions that lack advanced analytics to high-end solutions that can’t handle big data.
HP Vertica is the only one purpose-built for big data analytics – most conventional RDBMS were purpose- built for OLTP and then retrofitted for analytics. Vertica’s core architecture with columnar storage, a columnar engine, aggressive use of data compression, our scale-out architecture, and, most importantly, our unique hybrid load architecture enables what we call real-time analytics, which gives us the edge over the competition.
You can keep loading your data throughout the day — not in batch at night — and you can query the data as it comes in, without any specialized indexes, materialized views, or other pre- processing. And we have a huge and ever-growing library of features and functions to explore and perform analytics on big data–both structured and semi-structured. All of these core capabilities add up to a powerful analytics platform–far beyond a conventional relational database.
Q3. Vertica is column-based. Could you please explain what are the main technological differences with respect to a conventional relational database system?
Shilpa Lawande: It’s about performance. A conventional RDBMS is bottlenecked with disk I/O.
The reason for this is that with a traditional database, data is stored on disks in a row-wise manner, so even if the query needs only a few columns, the entire row must be retrieved from disk. In analytic workloads, often there are hundreds of columns in the data and only a few are used in the query, so row-oriented databases simply don’t scale as the data sets get large.
Vendors who offer this type of database often require that you create indexes and materialized views to retrieve the relevant data in a reasonable about of time. With columnar storage, you store data for each column separately, so that you can grab just the columns you need to answer the query. This can speed query times immensely, where hour-long queries can happen in minutes or seconds. Furthermore, Vertica stores and processes the data sorted, which enables us to do all manner of interesting optimizations to queries that further boost performance.
Some of the traditional database vendors out there claim they now have columnar store, but a true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store.
For instance, an optimization called late materialized allows Vertica to delay retrieval of columns as late as possible in query processing so that minimal I/O and data movement is done until absolutely necessary. Vertica is the only engine that is true columnar; everything else out there is a retrofit of a general purpose engine that can read some kind of a columnar format.
Q4. What is so special of Vertica data compression?
Shilpa Lawande: The capability of Vertica to store data in columns allows us to take advantage of the similar traits in data. This gives us not only a footprint reduction in the disk needed to store data, but also an I/O performance boost — compressed data takes a shorter time to load. But, even more importantly, we use various encoding techniques on the data itself that enable us to process the data without expanding it first.
We have over a dozen schemes for how we store the data to optimize its storage, retrieval, and processing.
Q5. Vertica is designed for massively parallel processing (MPP). What is it?
Shilpa Lawande: Vertica is a database designed to run on a cluster of industry-standard hardware.
There are no special- purpose hardware components. The database is based on a shared-nothing architecture, where many nodes each store part of the database and do part of the work in processing queries. We optimize the processing so much as to minimize data traffic over the network. We have built-in high availability to handle node failures. We also have a sophisticated elasticity mechanism that allows us to efficiently add and remove nodes from the cluster. This enables us to scale-out to very large data sizes and handle very large data problems. In other words, it is massively parallel processing!
Q6. In the past, columnar databases were said to be slow to load. Is it still true now?
Shilpa Lawande: This may have been true with older unsophisticated columnar databases. We have customers loading over 35 TB data / hour into Vertica, so I think we’ve put that one squarely to rest.
Q7. Who are the users ready to try column-based “data slicers”? And for what kind of use cases?
Shilpa Lawande: Vertica is a technology broadly applicable in many industries and in many business situations. Here are just a few of them.
Data Warehouse Modernization – the customer has some underperforming solution for data warehouse in place and they want to replace or augment their current analytics with a solution that will scale and deliver faster analytics at an overall lower TCO that requires substantially less hardware resources.
Hadoop Acceleration – the customer has bought into Hadoop for a data lake solution and would like a more expressive and faster SQL-on-Hadoop solution or an analytic platform that can offer real-time analytics for production use.
Predictive analytics – the customer has some kind of machine data, clickstream logs, call detail records, security event data, network performance data, etc. over long periods of time and they would like to get value out of this data via predictive analytics. Use-cases include website personalization, network performance optimization, security thread forensics, quality control, predictive maintenance, etc.
Q8. What are the typical indicators which are used to measure how well systems are running and analyzing data in the enterprise? In other words, how “good” is the value derived from analyzing (Big) Data?
Shilpa Lawande: There are many, many advantages and places to derive value from big data.
First, just having the ability to answer your daily analytics faster can be a huge boost for the organization. For example, we had one brick-and-mortar retailer who wanted to brief sales associates and managers daily on what the hottest selling products were, who had inventory and other store trends. With their legacy analytics system, they could not deliver analytics fast enough to have these analytics on hand. With Vertica, they now provide very detailed (and I might add graphically pleasing) analytics across all of their stores, right in the hands of the store manager via a tablet device. The analytics has boosted sales performance and efficiency across the chain. The user experience they get wouldn’t be possible without the speed of Vertica.
But what is most exciting to me is when Vertica is used to save lives and the environment. We have a client in the medical field who has used Vertica analytics to better detect infections in newborn infants by leveraging the data they have from the NICU. It’s difficult to detect infections in newborns because they don’t often run a fever, nor can they explain how they feel. The estimate is that this big data analytics has saved the lives of hundreds of newborn babies in the first year of use. Another example is the HP Earth Insights project, which used Vertica to create an early warning system to identify species threatened by destruction of tropical forests around the world.
This project done in cooperation with Conservation International is making an amazing difference to scientists and helping inform and influence policy decisions around our environment.
There are a LOT of great use cases like these coming out of the Vertica community.
Q9. What are the main technical challenges when analyzing data at speed?
Shilpa Lawande: In an analytics system, you tend to have a lot going on at the same time. There are data loads, both in batch and trickle loads. There is daily and regular analytics for generating daily reports. There may be data discovery where users are trying to find value in data. Of course, there are dashboards that executives rely upon to stay up to date. Finally, you may have ad-hoc queries that come in and try to take away resources. So perhaps the biggest challenge is dealing with all of these workloads and coming up with the most efficient way to manage it all.
We’ve invested a lot of resources in this area and the fruit of that labor is very much evident in the “Dragline” release.
Q10. Do you have some concrete example of use cases where HP Vertica is used to analyze data at speed?
Shilpa Lawande: Yes, we have many, see here.
Q11. How HP Vertica differs with respect to other analytical platforms offered by competitors such as IBM, Teradata, to in-memory databases such as SAP HANA?
Shilpa Lawande: Vertica offers everything that’s good about legacy data warehouse technologies like the ability to use your favorite visualization tools, standard SQL, and advanced analytic functionality.
In general, the legacy databases you mentioned are pretty good at handling analysis of business data, but they are still playing catch-up when it comes to big data – the volume, variety, and velocity. A row store simply cannot deliver the analytical performance and scale of an MPP columnar platform like Vertica.
In-memory databases are a good acceleration solution for some classes of business analytics, but, again, when it comes to very large data problems, the economics of putting all the data in memory simply do not work. That said, Vertica itself has an in-memory component which is at the core of our high-speed loading architecture, so I believe we have the best of both worlds – ability to use memory where it matters and still support petabyte scales!
Shilpa Lawande has been an integral part of the Vertica engineering team from its inception to its acquisition by HP in 2011. Shilpa brings over 15 years of experience in databases, data warehousing and grid computing to HP/Vertica.
Besides being responsible for Vertica’s Engineering team, Shilpa also manages the Customer Experience organization for Vertica including Customer Support, Training and Professional Services. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing. She has been named to the 2012 Women to Watch list by Mass High Tech and awarded HP Software Business Unit Leader of the year in 2012.
Shilpa has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
Follow ODBMS.org on Twitter: @odbmsorg
” Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.”– John Schroeder.
I have interviewed John Schroeder, CEO and Cofounder of MapR Technologies. Main topics of the interview are managing Big Data projects and how the Hadoop market is evolving.
Q1. What are the most common problems and challenges encountered in Big Data projects?
John Schroeder: First of all there is no single Big Data use case. Applications cut across industries and involve a wide variety of data sources. These projects can result in revenue gains, cost reductions or risk mitigation. While the challenges for these projects also vary, we see customers embracing our platform to deal with common challenges in meeting mission critical service levels, addressing real-time response pressures, and supporting multiple users and applications.
Q2. How do you see the Hadoop market evolving?
John Schroeder: We have leading customers in diverse industries who are using Hadoop to drive operational analytics, customer examples include performing 100B ad auctions a day, fraud detection for over 100 million card holders and real-time adjustments to improve fleet efficiency. These examples require the right architecture to support streaming writes so data can be constantly writing to the system while analysis is being conducted; high performance to meet the business needs and real-time operations; and the ability to perform online database operations to react to the business situation and impact business as it happens not producing a batch to report days or weeks later.
Q3. Is Hadoop really replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
John Schroeder: Hadoop’s impact is more disruptive than a replacement for OLAP technologies that have been in the market since the 90s. Customers deploy use cases on Hadoop that were not feasible or cost effective using these traditional technologies. For example, the use of clustering algorithms and recommendation engines that can be run much more frequently against much larger datasets open opportunities for use cases that drive new revenue streams.
Hadoop is also more powerful for unstructured data. So while we do see customers offload data warehouse processing on MapR, most MapR customers are deploying net new use cases. The business impact is the net new growth of analytic use cases is being done on Hadoop.
Hadoop is not currently a direct replacement to OLAP or an Enterprise Data Warehouse, for that matter. These technologies will continue to have their place. Hadoop does not require schema definition or structuring of data. In fact, acting as a Datahub, Hadoop can be quite complementary to these by offloading processing and data from these systems. The average cost to store data in a data warehouse is $16,000/terabyte. The cost for MapR is less than $1000/terabyte. OLAP engines leverage data that has been transformed and processed into precise schemas. They can perform very well for well understood problems. One of the benefits of Hadoop is that you don’t need to understand the questions you are going to ask ahead of time, you can combine many different data types and determine required analysis you need after the data is in place. Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.
Q4. Organizations embracing Hadoop often struggle to empower large groups of business analysts who require sophisticated SQL and BI tools to do their jobs. How do you handle this problem?
John Schroeder: MapR has the broadest support concerning SQL-in-Hadoop and SQL-on-Hadoop. Hive, Drill, Spark and Impala continue to mature as technologies. We are consultative to our customers assisting them to select the technology best suited to their use case. These technologies are rapidly evolving so we assist in “future proofing” the SQL technology selection to reduce technology lock in. In the case of large groups of business analysts and users we’re very excited about our partnership with HP Vertica. HP Vertica runs natively within the MapR platform and it provides full 100% ANSI SQL support to users. MapR also supports a broad range of SQL solutions designed specifically for Hadoop.
MapR also provides a standard file-based interface so any tool that uses enterprise storage systems can easily access data directly in MapR.
With MapR, you are in charge. You decide what you want to use to query your data; we focus on providing a reliable, scalable and affordable platform with full enterprise support.
Q5. How do you define the Total Cost of Ownership for Big Data architecture?
John Schroeder: There are many factors that drive TCO. The cost of storing data in MapR can be 50 to 100 times cheaper than other analytic platforms. MapR has innovated at the architecture level to drive many important areas to result in a much lower TCO, these include hardware performance and efficiency that results in a much smaller footprint which saves on hardware, operations and management costs. We have had customers tell us that they would need to deploy clusters 2-5 times larger with other distributions for the same workloads. We have also spent a great deal of time on the underlying data platform to provide high availability, reliability, and serviceability to make a MapR deployment extremely efficient. When customers are deploying an in-Hadoop database, MapR provides many TCO advantages. Our M7 Database Edition is an in-Hadoop NoSQL database that addresses HBase limitations by eliminating region servers, eliminating compactions and automating table management to support continuous, low latency on-line applications.
Q6. Is YARN expanding Hadoop use cases in the enterprise? And if yes, how?
John Schroeder: Much has been talked about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. YARN’s promise is to enable multiple execution frameworks to run on top of Hadoop, thereby expanding the Hadoop use cases beyond batch into interactive, real-time and others. At its core, YARN is a resource allocation framework that allows for execution frameworks such as classical MapReduce, and also newer ones like interactive SQL-on-Hadoop, streaming, and others to ask for and receive CPU and memory resources on the cluster for a period of time. YARN’s power is in making the resource allocation of a Hadoop cluster a more streamlined and centralized decision, thereby allowing for more efficient cluster use and more importantly, opening up Hadoop for emerging use cases. We’re happy to include YARN in MapR’s distribution and have uniquely enhanced YARN to allow both Map Reduce V1 and Map Reduce V2 applications to run simultaneously on the same cluster to reduce the barrier to YARN adoption.
Q7. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?
John Schroeder: We have customers that get 50X the performance at 1/50 the cost. We have other customers that have ROI over 1000X because of better approaches to drive revenue. We have other customers whose entire business model is built on the advantages that Hadoop provides. Earlier, I pointed out operational workloads that allow customers to dramatically transform their businesses, these are the applications that really drive value for organizations.
Beyond top line or cost savings value is the ability to support use cases that were not feasible before MapR.
MapR is key to Rubicon running Internet ad exchanges and comScore’s ability to measure what people do as they navigate the digital world.
Q8. What are the benefits of MapR’s Hadoop Distribution on the Google Compute Engine at Google I/O?
John Schroeder: Through the Google Compute Engine infrastructure, MapR makes big data accessible to any size business by leveraging the Google Compute Engine to provide a high performance, scalable, predictable, and easy to provision Hadoop infrastructure.
With respect to the scale and performance advantages, using MapR, Google was able to demonstrate a significant Hadoop price/performance breakthrough. We were able to run the Hadoop TeraSort benchmark to sort 1TB of data in a world-record setting time of 54 seconds on a 1003-node cluster that Google provided for our use. This broke the previous world record with approximately one third the number of cores.
Q9. You recently announced the early access release of the new HP Vertica Analytics Platform on MapR. What are the benefits of such cooperation for the enterprise?
John Schroeder: MapR and Vertica together demonstrate technical leadership in providing the best-of-breed SQL-on-Hadoop solution for enterprises. HP Vertica and MapR produce a comprehensive, tightly integrated, scalable, open-standards big data platform solution. There is no need to manage a dual cluster environment.
MapR is the only platform that could integrate an MPP analytic platform natively on Hadoop without requiring connectors or external tables in order for the MPP platform to interact with Hadoop data. With this integration, HP Vertica works as a native application on top of MapR, sharing the cluster resources with other Hadoop frameworks and applications.
The storage utilization of each application is dynamic and grows to the needs of business without requiring pre-allocation of file system space for HP Vertica. The architecture also allows customers to leverage MapR’s consistent snapshots and mirroring to provide point-in-time recovery and disaster recovery for HP Vertica with practically no effort.
For analysts, data scientists, and business users wanting more analytical power and faster ability to drive business decisions and execution, HP Vertica delivers the industry’s most advanced SQL-on-Hadoop analytics directly on MapR for higher performance and lower TCO.
Qx Anything else you wish to add?
John Schroeder: Two additional thoughts: data agility and operations.
MapR is investing engineering resources for data agility by decreasing time to value from data. Apache Drill is the only interactive SQL project that is architected for both centrally structured and self-describing data. Requiring DBAs-like work to structure new data sources and the cumbersome process for altering structure, delays time to value from new or changed data. Drill supports query of data structured in HCatalog, but also can query data structures using data-interchange formats like JSON.
Many use cases have batch, interactive and real-time (operational) aspects. Ad exchanges have to store and analyze auctions, but they also have to provide information like yield estimates in real-time to publishers and brands.
Credit fraud has analytic aspects but also have to interact during a credit card swipe. Investment in MapR’s M7 in-Hadoop NoSQL database has, and continues, to provide technology to support those real-time operations and avoid the cost and complexity of a second non-Hadoop platform. We aren’t going to replace and OLTP database, but we can cover many of the operational use cases.
John Schroeder, CEO and Cofounder, MapR Technologies. John has served as MapR’s Chief Executive Officer and Chairman of the Board since founding the company in 2009. Prior to founding MapR, John held executive positions in a number of enterprise software companies with a focus on data, storage and business intelligence at both private and public companies including: CEO of Calista Technologies (now Microsoft), CEO of Rainfinity (now EMC), SVP of Products and Marketing at Brio Technologies (BRYO) and General Manager at Compuware (CPWR).
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
“There are four things that motivate open source development teams:
1. The challenge/puzzle of programming, 2. Need for the software, 3. Personal advancement, 4. Belief in open source” — Bruce Momjian.
On PostgreSQL and the challenges of motivating and managing open source teams, I have interviewed Bruce Momjian, Senior Database Architect at EnterpriseDB, and Co-founder of the PostgreSQL Global Development Group and Core Contributor.
Q1. How did you manage to transform PostgreSQL from an abandoned academic project into a commercially viable, now enterprise relational database?
Bruce Momjian: Ever since I was a developer of database applications, I have been interested in how SQL databases work internally. In 1996, Postgres was the only open source database available that was like the commercial ones I used at work. And I could look at the Postgres code and see how SQL was processed.
At the time, I also kind of had a boring job, or at least a non-challenging one, writing database reports and applications. So getting to know Postgres was exciting and helped me grow as a developer. I started getting more involved with Postgres. I took over the Postgres website from the university, reading bug reports and fixes, and I started interacting with other developers through the website. Fortunately, I got to know other developers who also found Postgres attractive and exciting, and together we assembled a group of people with similar interests.
We had a small user community at the time, but found enough developers to keep the database feature set moving forward.
As we got more users, we got more developers. Then commercial support opportunities began to grow, helping foster a rich ecosystem of users, developers and support companies that created a self-reinforcing structure that in turn continued to drive growth.
You look at Postgres now and it looks like there was some grand plan. But in fact, it was just a matter of setting up the some structure and continuing to make sure all the aspects continued to efficiently reinforce each other.
Q2. What are the current activities and projects of the PostgreSQL community Global Development Group?
Bruce Momjian: We are working on finalizing the version 9.4 beta, which will feature greatly improved JSON capabilities and lots of other stuff. We hope to release 9.4 in September/October of this year.
Q3. How do you manage motivating and managing open source teams?
Bruce Momjian: There are four things that motivate open source development teams:
* The challenge/puzzle of programming
* Need for the software
* Personal advancement
* Belief in open source
Our developers are motivated by a combination of these. Plus, our community members are very supportive of one another, meaning that working on Postgres is seen as something that gives contributors a sense of purpose and value.
I couldn’t tell you the mix of motivations that drive any one individual and I doubt many people you were to ask could answer the question simply. But it’s clear that a mixture of these motivations really drives everything we do, even if we can’t articulate exactly which are most important at any one time.
Q4. What is the technical roadmap for PostgreSQL?
Bruce Momjian: We are continuing to work on handling NoSQL-like workloads better. Our plans for expanded JSON support in 9.4 are part of that.
We need greater parallelism, particularly to allow a single query to make better user of all the server’s resources.
We have made some small steps in this area in v9.3 and plan to in v9.4, but there is still much work to be done.
We are also focused on data federation, allowing Postgres to access data from other data sources.
We already support many data interfaces, but they need improvement, and we need to add the ability to push joins and aggregates to foreign data sources, where applicable.
I should add that Evan Quinn, an analyst at Enterprise Management Associates, wrote a terrific research report, PostgreSQL: The Quite Giant of Enterprise Database, and included some information on plans for 9.4.
Q5. How do managers and executives view Postgres, and particularly how they make Postgres deployment decisions?
Bruce Momjian: In the early years, users that had little or no money, and organizations with heavy data needs and small profit margins drove the adoption of Postgres. These users were willing to overlook Postgres’ limitations.
Now that Postgres has filled out its feature set, almost every user segment using relational databases is considering Postgres. For management, cost savings are key. For engineers, it’s Postgres’ technology, ease of use and flexibility.
Q6. What is your role at EnterpriseDB?
Bruce Momjian: My primary responsibility at EnterpriseDB is to help the Postgres community.
EnterpriseDB supports my work as a core team member and I play an active role in the overall decision-making and the organization of community initiatives. I also travel frequently to conferences worldwide, delivering presentations on advances in Postgres and leading Postgres training sessions. At EnterpriseDB, I occasionally do trainings, help with tech support, attend conferences as a Postgres ambassador and visit customers. And of course, do some PR and interviews, like this one.
Q7. Has PostgreSQL still something in common with the original Ingres project at the University of California, Berkeley?
Bruce Momjian: Not really. There is no Ingres code in Postgres though I think the psql terminal SQL tool is similar to the one in Ingres.
Q8. If you had to compare PostgreSQL with MySQL and MariaDB, what would be the differentiators?
Bruce Momjian: The original focus of MySQL was simple read-only queries. While it has improved since then, it has struggled to go beyond that. Postgres has always targeted the middle-level SQL workload, and is now targeting high-end usage, and the simple-usage cases of NoSQL.
MySQL certainly has greater adoption and application support. But in almost every other measure, Postgres is a better choice for most users. The good news is that people are finally starting to realize that.
Q9. How do you see the database market evolving? And how do you position PostgreSQL in such database market?
Bruce Momjian: We are in close communication with our user community, with our developers reading and responding to email requests daily. That keeps our focus on users’ needs. Postgres, being an object-relational, extensible database, is well suited to being expanded to meet changing user workloads. I don’t think any other database has that level of flexibility and strong developer/user community interaction.
Qx. Would you like to add something?
Bruce Momjian: We had a strong PG NYC conference recently. I posted a summary about it here.
I think that conference highlights some significant trends for Postgres in the months ahead.
Bruce Momjian co-founded in 1996 the PostgreSQL community Global Development Group, the organization of volunteers that steer the development and release of the PostgreSQL open source database. Bruce played a key role in organizing like-minded database professionals to shepherd PostgreSQL from an abandoned academic project into a commercially viable, now enterprise-class relational database. He dedicates the bulk of his time organizing, educating and evangelizing within the open source database community while acting as Senior Database Architect for EnterpriseDB. Bruce began his career as a high school math and computer science teacher and still serves as an adjunct profession at Drexel University. After leaving high school education, Bruce worked for more than a decade as a database consultant building specialized applications for law firms. He then went on to work for the PostgreSQL community with the support of several private companies before joining EnterpriseDB in 2006 to continue his work in the community. Bruce holds a master’s degree from Arcadia University and earned his bachelor’s degree at Columbia University.
Follow ODBMS.org on Twitter: @odbmsorg
“The Internet of Things is a good fit for NoSQL technologies, as you face the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.”–Mike Williams
I have interviewed Mike Williams, software director for i20 Water. Mike has an interesting use case for NoSQL in the area of water distribution networks. We also discussed how NoSQL can be used for the Internet of Things.
Q1. What is the business of i20 Water?
Mike Williams: i2O is the world’s leading developer of Smart Pressure Management solutions for water distribution networks.
Q2. What are the main benefits for utility companies?
Mike Williams: i2O’s Smart Pressure Management solutions optimise the performance of water distribution networks through improving network visibility, and enabling the remote control and automatic optimisation of network pressures.
These technology-enabled best practices deliver benefits in six key areas, with customers typically achieving return on investment in 6-18 months.
The opportunities for savings fall into two areas: on the network side, we see leakage reduction, energy savings and a big reduction in burst pipes.
For our utility customers, there are business-level returns as well based on improved customer service and operational cost savings. We also see customers being able to extend the life of their assets across the network as well, so they see a real long-term benefit to being able to control water pressure more accurately.
Q3. Do you have any metrics to share with us on the estimated volume of water saved by using your software?
Mike Williams: The best metric we can give is the simple volume of water that we help customers save: we currently help our customers to save over 235 Million Litres of water every day.
Q4. How can big data be used to reduce water leakage?
Mike Williams: The i2O system monitors and controls water pressure throughout a zone or network. This enables water companies to fully optimise water pressures remotely and automatically to meet agreed customer service levels throughout the network. The Big Data is all of this time-series data of pressures and flows (and other metrics) for lots of locations over years’ worth of points.
The solution continuously learns key characteristics within a Zone and then automatically controls the pressure within the Zone to achieve a stable target pressure at the critical point. This is achieved through a sophisticated mathematical algorithm, which automatically generates a control model. The control model is supplied ‘over the air’ from i2O software to the Pressure Reducing Valve (PRV) controller.
The control model is automatically updated if the software detects a significant change in the head loss characteristics. This ensures that no more pressure than is required enters the network on an ongoing basis. The i2O Automatic Optimisation solution is the world’s first – and most widely deployed – system for automatically optimising and remotely controlling water pressure in your network.
i2O’s Automatic PRV Optimisation has delivered significant results in hundreds of zones for major water companies worldwide.
Q5. Could you please describe your IT back-end infrastructure? What are the main challenges you have?
Mike Williams: We have an eco-system of distributed loosely coupled Services, each with access to numerous dedicated and tailored data stores.
These services collaborate through an Event-Driven Architecture (EDA) and a distributed Event Broker to deliver business services to our customers.
The main challenges are around the scalability of these services and the data stores where the vast amount of time-series data are held.
Due to the way we have architected our data, we have the ability to replay these events over history when we make changes to our products and services – we can use this historical data to analyse how devices on our networks respond to changes in circumstances, and measure what difference our new features would provide.
Q6. Why did you make the shift to NoSQL? What were you doing previously?
Mike Williams: The challenge was the overall scale–up of the whole platform. From both a technical and a business stand-point, we needed to scale up for the next years. Based on the number of devices that our customers had in place, and the number of new customers that we were projected to win, our RDBMS was not able to cope by design. Storing time-series data is a specialist need that column-oriented data stores are better suited to than traditional RDBMS row-oriented technologies. We were (and for some customers, still are) using MS SQL Server as the only data store. NoSQL technologies also better fit our data modelling needs such as searching, where we employ the ElasticSearch NoSQL solution in a clustered manner.
Q7. What is your experience of using a NoSQL database so far?
Mike Williams: So far it has been very positive. We had a learning curve to go up and the technologies themselves (Cassandra and ElasticSearch) have matured greatly in the last 18 months that we have been using them.
Q8. What does the future hold for Cassandra in your organisation? How are you using other database types as well?
Mike Williams: We moved over to using Cassandra as our main data store for time-series data – this was because it provides better support for columnar data, as well as meeting the requirement that we had around scale. We are committed to Cassandra for the foreseeable future and so it’s future is to grow alongside our business. We also use ElasticSearch as mentioned, as well as PostgresSQL when our specific data models dictate the use of tabular, relational data.
Q9. Do you work on the cross-over with the Internet of Things?
Mike Williams: Indeed we do as we produce intelligent devices that communicate with our Platform over the Internet (GPRS mobile network). The devices automatically sample for pressure, temperature and other information that can then be compared across the network.
The Internet of Things is a good fit for NoSQL technologies, as you face the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.
The ability to capture information from across our network has two key value propositions: the first is for our customers right now, as they can manage their water pressure more effectively. The second is the long term value that the data can provide. By being able to model and re-use historical data, we can offer much more value to customers than they can achieve by themselves. We can add new features to our platform, and demonstrate how these new features can provide greater opportunities to save money and water for customers.
Mike Williams is the software director for i20 Water. He has 25 years of experience working for innovative high-tech companies in finance, payments, process engineering and the environment, where he has focused on solving problems and translating business challenges into tangible technical solutions. Before joining i2O Water Mike was the Chief Software Architect and head of development for Bottomline Technologies, the leading supplier of banking transaction and payments software.
Mike is also an Agile Coach and has helped transform numerous businesses as they become Agile in their approaches to business as a whole and not just software development. Mike is the organiser and founder of the Agile South Coast group.
Follow ODBMS.org on Twitter: @odbmsorg
“Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage.”–Thomas Risberg.
On Spring for Apache Hadoop, I have interviewed Thomas Risberg, Software Engineer focusing on Big Data at Pivotal.
Q1. What is new with the current Spring for Apache Hadoop?
Thomas Risberg: The main focus for Spring for Apache Hadoop version 2.0 is to support new distributions based on Hadoop v2 including support for YARN. We provide backwards compatibility with Hadoop v1 based distributions, so as an end-user you can choose when to move to a new Hadoop version.
In Spring for Apache Hadoop 2.0 we are adding YARN application development support in addition to improvements in the HDFS and MapReduce support. We are introducing a Spring Boot based programming model for easy YARN app development. The main goal is to provide a simplified development experience so the developer can focus on getting the business logic implemented and not having to worry about the “plumbing”.
Q2. How do you measure the “effectiveness” of how Spring for Apache Hadoop simplifies developing Apache Hadoop?
Thomas Risberg: The main differentiator would be developer productivity where Spring for Apache Hadoop provides support for the infrastructure plumbing code and configuration allowing the developer to focus on code that brings business value. The most common measurement would be how long completing a project would take compared to the same project being developed without Spring.
Q3. What is the rationale for offering a unified configuration model and an APIs for using HDFS, MapReduce, Pig, and Hive?
Thomas Risberg: Big Data workflow development usually involves parts that are executed on Hadoop and parts that are executed or at least interact with resources outside of Hadoop. So, a unified configuration strategy will help developers when they move between different parts of the workflow. The basic configuration used across all Hadoop components is based on Spring and therefore similar whether working on a Hive job or an FTP file transfer job.
Q4. How does Spring for Apache Hadoop relate to the Spring Data project, and in general to the overall Spring ecosystem of projects?
Thomas Risberg: Spring for Apache Hadoop is part of the Spring Data umbrella project, but not part of the “release train” that most other Spring Data projects are part of. The reason is that the coupling is not very tight with other Spring Data projects. Spring for Apache Hadoop doesn’t directly use other Spring Data components although anyone using Spring for Apache Hadoop can use the Spring Data MongoDB project when exporting data from HDFS to MongoDB.
Q5. What about the integration with other software systems that are not part of the Spring ecosystem?
Thomas Risberg: In terms of Hadoop we have integration with Hive, Pig and HBase. We also use features from projects outside of the Apache Hadoop family. One example is Kite SDK which is a project that started out at Cloudera but is now a separate project that can be used with any Hadoop distribution. Other examples would be a JSON library like Jackson etc. The whole Spring IO platform uses 689 third party libraries so managing all of this is crucial for anyone using Spring. That is the main motivating factor behind the new Spring IO platform that provides a unified set of dependency versions across all Spring projects.
Q6. Could you give us some detail on how you handle big data ingest/export: e.g. from enterprise databases into Hadoop and vice versa? How is this different than conventional ETL?
Thomas Risberg: We rely on Spring Batch functionality for this task, so in case HDFS ingest isn’t different than loading data into any other data store. It’s simply a different batch writer. Spring Batch is a proven technology that is the bases for JSR-352 and is now certified as a JSR-352 compliant implementation. We support import/export with most relational databases that have a JDBC driver and also with many NoSQL stores that have Spring Data support like MongoDB, Cassandra, Couchbase or Redis.
Q7. What is the main contribution of Spring Data Hadoop to the Hadoop workflow and security?
Thomas Risberg: Spring for Apache Hadoop allows the developer to treat Hadoop workloads the same way as they would approach any workflow problem. Just because Hadoop is involved doesn’t have to mean that you need to use new and different tools from what you are used to. In terms of security we use what Hadoop itself provides and haven’t so far attempted to integrate that with Spring Security.
Q8. Do you also offer tools for analyzing Big Data? If yes, which ones?
Thomas Risberg: Spring XD provides integration with PMML analytics via a plug-in module. That module integrates with the JPMML-Evaluator library that provides support for a wide range of model types and is interoperable with models exported from R, Rattle, KNIME, and RapidMiner. Pivotal also provides MADLib, developed in collaboration with researchers at UC Berkeley and a growing world wide user community. This library is typically used with Pivotal’s Greenplum database or HAWQ which is the SQL engine that is part of Pivotal’s Hadoop distribution.
Q9. In which situations Spring Data Hadoop can add value, and in which situations would it be a poor choice?
Thomas Risberg: It definitely adds a lot of value if you are already using Spring in your workflow and just want to add some Hadoop functionality. It also makes sense if you are using Java and would like to take advantage of Spring’s dependency injection approach when developing your enterprise applications. It would make less sense for an organization that is not using Java as their development language or someone that already have a working solution using other tools that they are happy with.
Q10 Who is currently using Spring Data Hadoop and for which projects/business problems?
Thomas Risberg: I can’t name names, but Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage. Industries include telecommunications, equipment manufacturing, retail and finance. I’ve mentioned Spring XD and Spring for Apache Hadoop is a key component of this project. Spring XD is Pivotal’s new Spring project providing a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. The Spring XD project’s goal is to simplify the development of big data applications.
Thomas Risberg, Software Engineer focusing on Big Data, Pivotal, New Hampshire, USA
My current focus is on the “Spring XD”, “Spring for Apache Hadoop” and “Spring Data JDBC Extensions” projects. I’m a co-author of “Spring Data, Modern Data Access for Enterprise Java” published by O’Reilly Media in 2013 and “Professional Java Development with the Spring Framework” published by Wiley in 2005.
Follow ODBMS.org on Twitter: @odbmsorg