Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis.
“Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.” — Jonathan Ellis.
I wanted to learn more about the new version of the commercial version of Cassandra, DataStax Enterprise 2.0. I did interview Jonathan Ellis, CTO and co-founder of DataStax and project chair of Apache Cassandra.
Q1. What are the new main features of DataStax Enterprise 2.0 (DSE 2.0)?
Jonathan Ellis: The one I’m most excited about is the integration of Solr support. This is not a side-by-side system like Solandra, where Solr has to maintain a separate copy of the data to be indexed, but full integration with Cassandra : you insert your data once, and access it via Cassandra, Solr, or Hadoop.
Search is an increasingly important ingredient for modern applications, and DSE2 is the first to offer fully integrated, scalable search to developers.
DSE2 also includes Apache Sqoop for easy migration of relational data into Cassandra, and plug-and-play log indexing for your application.
Q2. How does it work technically the integration of Cassandra with Apache Hadoop and Apache Soir?
Jonathan Ellis: Basically, Cassandra offers a pluggable index api, and we created a Solr-backed index implementation with this.
We wrote about the technical details here.
Q3. What is exactly an Enterprise Search, and why choosing Apache Soir?
Jonathan Ellis: Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.
Solr is the gold standard for search, much the same way that Hadoop is the gold standard for big data map/reduce and analytics. There’s an ecosystem of tools that build on Solr, so offering true Solr support is much more powerful than implementing a proprietary full-text search engine.
Q4. What are the main business benefits of such integration?
Jonathan Ellis: First, developers and administrators have one database and vendor to concern themselves with instead of multiple databases and many software suppliers. Second, the built-in technical benefits of running both Solr and Hadoop on top of Cassandra yields continuous uptime for critical applications as well as future proofing those same apps where growing data volumes and increased user traffic are concerned.
Finally, customers save anywhere from 80-90% over traditional RDBMS vendors by going with DSE. For example, Constant Contact estimated that a new project they had in the works would take $2.5 million and 9 months on traditional relational technology, but with with Cassandra, they delivered it in 3 months for $250,000.That’s one third the time and one tenth the cost; not bad!
Q5. It looks like you are attempting to compete with Google. Is this correct?
Jonathan Ellis: DSE2 is about providing search as a building block for applications, not necessarily delivering an off-the-shelf search appliance.
Compared to Google’s AppEngine product, it’s fair to say that DSE 2.0 provides a similar, scalable platform to build applications on. DSE 2.0 is actually ahead of the game there: Google has announced but not yet delivered full-text search for AppEngine.
Another useful comparison is to Amazon Web Services: DSE 2.0 gives you the equivalent of Amazon’s DynamoDB, S3, Elastic Map/Reduce, and CloudSearch in a single, integrated stack. So instead of having to insert documents once in S3 and again in CloudSearch, you just add it to DSE (with any of the Solr, Cassandra, or Hadoop APIs) without having to worry about having to write code to keep multiple copies in sync when updates happen.
Q6. How do you manage to run real-time, analytics and search operations in the same database cluster, without performance or resource contention problems?
Jonathan Ellis: DSE offers elastic workload partitioning: your analytics jobs run against their own copies of the data, kept in sync by Cassandra replication, so they don’t interfere with your real time queries. When your workload changes, you can re-provision existing nodes from the real time side to the analytical, or vice versa.
Q7. You do not require ETL software to move data between systems. How does it work instead?
Jonathan Ellis: All DSE nodes are part of a single logical Cassandra cluster. DSE tells Cassandra how many copies to keep for which workload partitions, and Cassandra keeps them in sync with its battle-tested replication.
So your real time nodes will have access to new analytical output almost instantly, and you never have to write ETL code to move real time data into your analytical cluster.
Q8. Could you give us some examples of Big Data applications that are currently powered by DSE 2.0?
A recent example is Healthx, which develops and manages online portals and applications for the healthcare market. They handle things such as enrollment, reporting, claims management, and business intelligence.
They have to manage countless health groups, individual members, doctors, diagnoses, and a lot more. Data comes in very fast, from all over, changes constantly, and is accessed all the time.
Healthx especially likes the new search capabilities in DSE 2.0. In addition to being able to handle real-time and analytic work, their users can now easily perform lightening fast searches for things like, ‘find me a podiatrist who is female, speaks German, and has an office close to where I live.‘
Q9. What about Big Data applications which also need to use Relational Data? Is it possible to integrate DSE 2.0 with a Relational Database? If yes, how? How do you handle query of data from various sources?
Jonathan Ellis: Most customers start by migrating their highest-volume, most-frequently-accessed data to DSE (e.g. with the Sqoop tool I mentioned), and leave the rest in a relational system. So RDBMS interoperability is very common at that level.
It’s also possible to perform analytical queries that mix data from DSE and relational sources, or even a legacy HDFS cluster.
Q10. How can developers use DSE 2.0 for storing, indexing and searching web logs?
Jonathan Ellis: We ship a log4j appender with DSE2, so if your log data is coming from Java, it’s trivial to start streaming and indexing that into DSE. For non-Java systems, we’re looking at supporting ingestion through tools like Flume.
Q11. How do you adjust performance and capacity for various workloads depending on the application needs?
Jonathan Ellis: Currently reprovisioning nodes for different workloads is a manual, operator-driven procedure, made easy with our OpsCenter management tool. We’re looking at delivering automatic adaptation to changing workloads in a future release.
Q12. How DSE 2.0 is influenced by DataStax partnerships with Pentaho Corporation (February 28, 2012) with their Pentaho Kettle?
Jonathan Ellis: A question we get frequently is, “I’m sold on Cassandra and DSE, but I need to not only move data from my existing RDBMS’s to you, but transform the data so that it fits into my new Cassandra data model. How can I do that?” With Sqoop, we can extract and load, but nothing else. The free Pentaho solution provides very powerful transformation capabilities to massage the incoming data in nearly every way under the sun before it’s inserted into Cassandra. It does it very fast too,
and with a visual user interface.
Q13. Anything else to add?
Jonathan Ellis: DSE 2.0 is available for download now and is free to use, without any restrictions, for development purposes. Once you move to production, we do require a subscription, but I think you’ll find that the cost associated with DSE is much less than any RDBMS vendor.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.