**Spark Mllib: Machine Learning library for Analytics**

by Dr. Christopher Burdorf, software engineer at NBC Universal.

— December 11, 2014

The Spark cluster computing framework has an extensive machine library called Mllib which is optimized for distributed computing (https://spark.apache.org/mllib/).

The library contains multiple components: basic statistics, classification and regression, collaborative filtering, clustering, dimension reduction, feature extraction, and optimization.

The K-means clustering algorthim is a popular choice and it works quite well for data that has disjoint sets that can be clustered into separate groups. However, if the graph of your data has no visibile division between those groups, then K-Means clustering is unlikely to do much for you, because it is a fairly simple algorithm. The algorithm is composed of the following steps:

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

The Naive-Bayes algorithm is a method for classification which applys Bayes’ theorem based on independent assumptions between features. Independence in probability theory means that given two events, the occurance of one does not affect the probability of the other. Thus, if you have dependent events like removing colored marbles from a bag, each time you remove a marble the chances of certain color will change, then Naive Bayes will not account for that. However, this paper has shown that dependencies can be how dependencies can be handled with Naive Bayes

http://www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf.

Nonetheless, from my personal experience Naive Bayes can be difficult to train where there are many varying features.

I have have had better experience with Support Vector Machines (SVM).

According to Wikipedia, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns used for classification and data analysis. SVMs take training examples with multiple categories and builds a model that assigns new examples. It can be used as a linear classifier or non-linear by using what is known of as the kernel trick. The foundation of SVMs is quite theoretical and is well-described in this video

(http://youtu.be/eHsErlPJWUU).

Essentially, SVMs compute a hyperplane to classifiy by solving a quadratic programming optimization problem.

SVMs take training data in the same manner in Spark’s MLLib to Naive Bayes.

Though from what I’ve seen, SVMs are more robust in learning classifications between data that is complex (has many training factors).

Deep learning is a technique that has gained a lot of interest recently and has broken a lot of machine learning records. Deep learning is based on a hierarchy of neural networks that go from more specific to more general.

Google has built massive Deep learning systems that have been used in their speech recognition systems for Android (http://youtu.be/W15K9PegQt0).

While Mllib does not have support for deep learning, there is a library called H20 that does and it has been interfaced to Spark (https://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html).

Thus, Spark with MLlib and Sparkling Water provide an extensive toolkit for machine learning analytics.