As a Data Scientist, Apache Spark can allow you to run Machine Learning at scale. Using a cluster for computing, Spark can help accelerate your development time, simplify your machine learning lifcycle and allow for faster hypterparameter tuning. Additionally, Databricks can offer you a familiar environment with notebooks, Python and SQL. An environment where you can collaborate with colleagues while versioning your work and leveraging directly your enterprise data lake.
In this course we don't focus on basic programming with Spark* but rather how to use Spark for Machine learning. This course highlights some of the key differences between SparkML and single-node libraries such as scikit-learn. We'll cover in depth how you can use both in scaling your Machine Leaning development
This training is a 1-day training. We'll cover - at minimum - the following "how to":
- running data cleansing for data imputation and missing values using Spark dataframes (SQL and Pyspark)
- using your familiar Python packages (sk-learn, XGBoost) scalable in Spark
- using Spark ML library
- leveraging User-Defined functions
- using koalas
The day will consist of 50% theory and 50% hands-on exercises (Databricks environment is provided)
Note: in this course we specifically don't cover MLflow given it deserves a dedicated training day to cover all its features. If interested, please join our Using MLflow with Databricks course.
- You are a Data Scientist interested in the cloud for data & analytic workloads
- You have experience with Python and working knowledge of machine learning and data science
- You are familiar with the concepts of Spark - e.g. by following prior the Apache Spark Programming with Databricks course
- € 625 per day
Interested to know more?
For more information, please reach out email@example.com and we can give you more details & practicals.
The full element61 Training schedule (incl. when which training runs) can be found here.
For a complete overview of all courses, visit our academy page.