Azure Databricks is a managed version of the Databricks platform optimized for running on Azure.
What is Databricks?
Databricks was originally developed by the creators of Apache Spark and aims to deliver a unified platform where data scientists & engineers can work together to build end-to-end machine learning solutions from data discovery up to production.
Databricks is a platform where users can log in & work. It’s built on top of Apache Spark computing technology & can be mounted on-premise or in a Cloud set-up giving the users any needed compute power to work in an abstracted and simplified way.
Azure Databricks offers all the components and capabilities of Databricks Apache Spark with a possibility to integrate it with other Microsoft Azure services.
What is Azure Databricks?
Designed together with Microsoft, Azure Databricks is a managed version of Databricks that gives Azure customers the ability to do one-click set up, streamlined workflows and shared collaborative interactive workspaces.
It enables fast collaboration between data scientists, data engineers, and business analysts through the Databricks platform. Azure Databricks is tightly connected with Azure storage & compute resources such as Azure Blob Storage, Data Lake Store, SQL Data Warehouse & HDInsights.
click to enlarge
click to enlarge
Azure Databricks Notebook interface
Why do I need Databricks & Spark?
Spark is an open-source framework suitable for large-scale data processing. In essence it provides you the ability to run computations on huge datasets very fast.
It does this by processing the data processing in-parallel and distributing it across a cluster. The technology is mature to do this both with batch and streaming data (Spark Streaming).
Databricks has Spark and thus allows data Scientists & engineers to quickly access a fully managed Spark environment to run analytics on data they couldn’t possibly run on their local laptop.
Azure Databricks Workspace
Azure Databricks has a support for Python, Scala, R and SQL and some libraries for deep learning like Tensorflow, Pytorch and Scikit-learn for building big data analytics and AI solutions. In Azure Databricks notebooks, the user can easily switch between different programming languages with just simple language commands to make use of more languages in one notebook.
Running a job on the cluster in Azure Databricks, means running a notebook, either manually or by scheduling it to run at a specific time. Azure Databricks provides different users in the organization the possibility to collaborate on shared projects in one workspace.
At moment of writing this, Azure Databricks doesn’t integrate with Git or any versioning tool. It is as such not fitted to be a team platform with collaboration & integration with data engineering work.
Another limitation is that it currently only supports HDInsights and not Azure Batch or AZTK.
element61 has built an expertise in creating and deploying AI solutions in production using Azure Databricks. Our knowledge and experience can help your organization build a scalable big data solution in the cloud.
Azure Databricks is a cloud analytics platform that can meet the needs to both data engineers and data scientists to build a full end-to-end big data solution and deploy it in production. It can be used by data engineers to set up the whole architecture by setting up clusters, scheduling and running jobs, connections to data sources etc. and by data scientists to perform machine learning and real-time analytics. Business users can also use the data transformed in Azure Databricks directly in Power BI or other analytics tool for reporting needs just by connecting the cluster to the analytics tool.
For more information you can check https://azure.microsoft.com/en-us/services/databricks/ or by contacting us.