Supporting the Fast Data Paradigm with Apache Spark
BY Stephen Dillon, Data Architect, Schneider Electric
One of the latest and misunderstood narratives to come out of the Big Data domain surrounds the Fast Data paradigm.
Fast Data is really beginning to be embraced by the mainstream at a time when surprisingly many are still debating what Big Data is and is not; so it is no surprise that Fast Data is misunderstood as well. Companies have come to a point that they are desperately seeking to discover a benefit from all of their data and are trying to understand how to effect change via more complex and demanding analytics all in real (i.e. immediate) time. The fact is, companies have a lot of data that they simply do not know how to process effectively and IoT promises to continue collecting it more frequently as well as demand more effective processing and analytics whether it is Big, Micro, or Dark data. This is where the Fast Data paradigm and Apache Spark serve us well.
The concept of Fast Data itself is actually not new, although the term has only become lingua franca in recent years.
Ask any data engineering professional working in the field, over the last few decades, and they will tell you data was fast before it was “big” and they can recite verse after verse how the community has sought to conquer it via practices such as scaling up servers, partitioning data on single nodes, and data warehousing solutions. The advent of Big Data taught us of the three V’s (Volume, Velocity, and Variety) and how to support them via a horizontal scale-out architecture.
However; Fast Data is qualified by more than just the frequency of data ingestion or discovering performance gains by scaling data out across a distributed cluster and writing targeted queries. It also incorporates real-time data processing, deriving actionable insights quickly, and the speed of delivery of the results all while leveraging more complex analytics.
As a proponent of in-memory databases over the years, I’ve long sought to convince the community that Fast Data was on the near horizon and simply storing data and performing batch analytics was not enough. I proposed that analytics would demand better processing of data than we’ve ever seen as well as different types of analytics, such as those in the Graph domain, and all of this would be magnified by the advent of IoT. To derive actionable insights, we need the ability to process the data quickly as it is ingested (streamed) and often join it via queries against batch data i.e. data at rest.
The buzz around the IoT domain has recently brought the importance of such advanced processing and analytical capabilities such as the integration of machine learning and Graph based analytics to discover the unknown unknowns into the mainstream consciousness. There are of course numerous vendor solutions available that can handle one or more of these needs such as Apache Apex, VoltDB, Apache Storm, Kafka, MemSQL, or Apache Ignite to name just some, but one has proven to be able to handle all of these demands and at an effectively lower cost; i.e. Apache Spark.
There has been a lot of hype surrounding Apache Spark; and rightfully so. It is a fast, general compute engine (not a database) for processing distributed data that provides up to 100x better performance than traditional Map Reduce on Hadoop when run in memory. In short, it is Map Reduce on steroids. Spark’s bundle of unified APIs supporting SQL, Streaming, Machine Learning, and Graph data processing are what really set it apart from its competitors and quite often it integrates well with other solutions. Instead of creating a mix and match combination of multiple solutions to support each capability, developers are able to learn one API and adapt their knowledge across the Spark stack thus increasing developer productivity and a lower total cost of ownership. As an open-source solution, that also works extremely well with Hadoop, it provides a low cost of entry into the Fast Data market. It is fully supported by the major Hadoop vendors such as MapR, Cloudera, and Hortonworks and is compatible with numerous third party solutions such as Kafka and has libraries for integrating with data sources such as S3, HBase, Cassandra and MongoDB.
Fast Data has finally been embraced by the consciousness of the mainstream thanks in large part to the rise of IoT. Databricks’ the company formed by the creators of Spark will release Spark 2.0 in May and demonstrations at the recent Strata Hadoop conference in San Jose California have shown it is even more efficient, possesses an enhanced streaming capability, and is even easier to use than version 1.6 due to the unification of the Dataframe and Dataset APIs.
If you are interested in learning more about Apache Spark, you may download it for free to try it out.
Databricks also provides free online training materials via their site as well as a community edition of their commercial offer to explore Spark in a clustered environment. Apache Spark has come along at the right time with the right set of capabilities to support these advanced data needs and seeks to evolve and remain an important part of the Fast Data paradigm.