Machine Learning on Databricks

element61 is a certified Databricks partner with extensive experience in development of Machine Learning algorithms and set-ups. On this page we give a rough sketch of how to leverage Machine Learning in Databricks. Feel free to contact us if you want to know more

Why run Machine Learning (ML) in Databricks?

Databricks offers a managed Spark environment made for both Data Discovery and Data Engineering at scale. Running Python or R scripts for machine learning is often limiting because it only runs on a single machine which means that processing and transforming the data can be very time consuming. Often this means that users only work with a smaller subset of the data or sampling the data which can lead to not so accurate results. This is where Spark comes in hand, because it offers a platform for data science leveraging parallel processing.

Machine Learning in Apache Spark

To benefit from the speed and compute advantage of Spark in Machine Learning, the open-source community has developed Spark MLlib, Apache Spark's scalable machine learning library. The MLlib library in Spark can be used for machine learning in Databricks and consists of common machine learning algorithms as well as Machine Learning Workflow utilities including ML Pipeline.

Complementary to MLlib, other Spark libraries exists. Most remarkable is:

sparkdl - When a data scientists wants to use Deep Learning on Databricks, it is recommended to leverage this Spark library to get solid versions of Tensorflow and Keras working on a Spark backend. Read more about sparkd l on its github

from pyspark.ml 
from sparkdl # Specific library for Deep Learning

Apacha Spark's ML Pipelines

One solid feature embedded in Apache Spark v2.3 and above is ML Pipeline. With ML Pipelines, a data scientists can stage various transformations and model trainings into pipelines and apply these pipelines to various dataframes. This simplifies combining multiple algorithms into a single workflow/pipeline and saves time as your project is structured and organized.

One can get started easily by importing the functions from the default pyspark library.

from pyspark.ml import Pipeline

Example given: One pipeline could contain both the tokenizer, the hashing as well as the actual algorithm training:

from pyspark.ml import Pipeline
from pyspark.ml.classification import  LogisticRegression
from pyspark.ml.feature import  HashingTF, Tokenizer

# Pyspark Code
tokenizer = Tokenizer(inputCol="text", outputCol="words") 
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") 
lr = LogisticRegression(maxIter=10, regParam=0.001) 
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])