What is Airflow?
Microsoft recently announced a much-awaited feature in Azure Data Factory: Managed Airflow. Airflow is an open-source data orchestration platform which offers great flexibility. It comes with a UI that provides a clear view of DAGs (directed acyclic graphs which are basically data pipelines) and their runs.
As we believe Airflow is complementary to Azure Data Factory, we are quite excited by this release. Airflow provides better data orchestration capabilities, a lot of flexibility due to its pure python nature and triggers to cover all kinds of scenarios (custom triggers can also be developed if needed) while Data Factory is effortless to start with and offers seamless integration with on-premises data sources (see our article on how these tools complement each other).
What is Azure bringing to the table?
The main downside of (unmanaged) Airflow is its complex setup. To run, the tool requires a web server to serve the Airflow UI, a relational database where all DAGs and metadata are stored, a scheduler for scheduling tasks and finally workers to execute tasks. To have this working in a production-grade environment, often Kubernetes is chosen as the host, which requires some specific knowledge to be able to work with it, as well as some maintenance to keep it up and running.
By providing Airflow as a service in Azure Data Factory, Microsoft addressed this inconvenience. Furthermore, it also addressed security concerns (built-in authentication with AAD and metadata encryption), auto-update, and monitoring (through Azure Monitor).
Managed Airflow in practice
In a couple of clicks, an Airflow integration runtime can be set up. After creating a data factory in Azure, a user will find in the ‘manage’ section an Airflow tab, and then can decide on:
- The type of authentication (Azure AD authentication is an option) to access the Airflow UI
- The compute size (number of schedulers and workers) - additional workers can also be added separately
- The Airflow environment (currently up to 2.4.3)
- Configuration overrides
- Environment variables
- Airflow requirements, i.e.: additional python libraries required to run the DAGs. For the moment, these packages must be provided manually (copy/pasted inside a box) as a requirement text file inside the imported folder will not be recognized
Once the Airflow integration runtime environment is up and running (this will take a couple of minutes), a folder can be manually imported into the integration runtime. This folder should contain a ‘dag’ subfolder (with the actual python files) and a ‘plugins’ subfolder (can be empty – it allows customizing Airflow installations). It also must be stored on an Azure Data Lake Storage Gen2. The integration runtime will parse the imported files and show the imported DAGs in its iconic UI (accessible via the ‘Monitor’ button).
To test a DAG in the managed environment, a user will therefore have to manually upload the python file to a storage account and then import it into the integration runtime. We think the user-friendliness of this could be improved, for example by allowing users to link the Airflow environment to an Azure DevOps repository. This would allow users to work in a collaborative mode and test their DAGs in a local setup before pushing them to the remote repo, which could trigger an automatic import in the integration runtime.
Cost-wise, a small computer (D2v4 and < 50 DAGs) should cost $0.49 per hour while a large computer (D4v4 and < 1000 DAGs) costs $0.99 per hour. Additional nodes come at a cost of $0.055 (small computers) and $0.22 (large computers)1. To compare it to an Airflow setup we have running, where we have a Kubernetes Cluster with 2 Standard_DS2_v2 nodes (which fits within the small computer scenario), the price for 1 month would be 236 euros, while the same setup in Managed Airflow would be 357 euros. There definitely is a surcharge here for having Airflow managed by Azure, but it could be worth not having to deal with managing Airflow yourself.
It is also important to note that the managed airflow environment currently cannot be paused, whereas Kubernetes allows us to pause the environment and thus limit costs when the environment is not used (e.g., for a development environment).
Furthermore, while automatic scaling has been announced as a feature, it is currently not available yet. We are looking forward to the release of this feature and its effectiveness.
Lastly, we have not tested managed airflow in production conditions yet (with many DAGs requiring a lot of computing power), but we can imagine that the Azure infrastructure will deal with it.
A feature that could improve Azure Data Factory capabilities
To conclude, this feature seems promising as it can enhance the global ETL capabilities of Azure Data Factory (clearer overview of all the running pipelines for each run of each pipeline, custom triggers to cover all types of scenarios and infinite flexibility with python) but there are still some question marks (among others the inconvenient development environment, costs, and the auto-scaling feature) we think should be addressed by Microsoft.
For companies willing to step away from the heavy management of Airflow without having to translate their ETL logic to Data Factory pipelines, migrating their DAGs to Azure could already be a viable option.
|Setup in a couple of clicks without the knowledge of other tools (e.g., Kubernetes)||Costly compared to the unmanaged setup|
|No maintenance||Inconvenient development environment|
|Security (AAD authentication and metadata encryption)||Auto-scaling announced but not available yet|
|Auto-updating||Not possible to pause the integration runtime|
|Monitoring through Azure Monitor||Requirements must be provided manually|