Deep Learning Pipelines for Apache Spark

Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark.

The library comes from Databricks and leverages Spark for its two strongest facets:

  1. In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
  2. It uses Spark’s powerful distributed engine to scale out deep learning on massive datasets.

Currently, TensorFlow and TensorFlow-backed Keras workflows are supported, with a focus on model inference/scoring and transfer learning on image data at scale, with hyper-parameter tuning in the works.

Furthermore, it provides tools for data scientists and machine learning experts to turn deep learning models into SQL functions that can be used by a much wider group of users. It does not perform single-model distributed training – this is an area of active research, and here we aim to provide the most practical solutions for the majority of deep learning use cases.

For an overview of the library, see the Databricks blog post introducing Deep Learning Pipelines. For the various use cases the package serves, see the Quick user guide section below.

The library is in its early days, and we welcome everyone’s feedback and contribution.

Maintainers: Bago Amirbekian, Joseph Bradley, Sue Ann Hong, Tim Hunter, Philip Yang

GitHub LINK

Sample notebook LINK

You may also like...