On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov
“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.
I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.
Q1. What are the main technical challenges of SaaS development projects?
Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.
Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?
Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.
MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.
Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?
Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.
Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.
Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.
Q4. What is exactly Apache® Ignite™?
Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.
Apache Ignite includes the following functionality:
· Data grid – An in-memory key value data cache that can be queried
· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs
· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing
· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids
· Streaming analytics – The ability to consume an endless stream of information and process it in real-time
· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes
Q5. How Apache Ignite differs from other in-memory data platforms?
Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.
Q6. Can you use Apache® Ignite™ for Real-Time Processing of IoT-Generated Streaming Data?
Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.
Q7. How do you stream data to an Apache Ignite cluster from embedded devices?
Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.
Q8. Is this different then using Apache Kafka?
Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.
Q9. How do you conduct real-time data processing on this stream using Apache Ignite?
Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.
Q10. Is this different then using Spark Streaming?
Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
Qx. Is there anything else you wish to add?
Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.
Follow ODBMS.org on Twitter: @odbmsorg