“We've been informed by Microsoft's product group that they're currently working on making V-Order enabled by default for all Fabric analytical engines that create delta files. There are a few more details to be finalized. We'll provide updates as we learn more. It's good to see that Microsoft is continually improving its user experience.”
With the continuous evolution of data engineering techniques, the advent of Microsoft Fabric's Direct Lake signifies a major step forward. Positioned as a game-changing feature, Direct Lake allows Power BI to directly connect to a data lakehouse in Fabric. This is largely driven by a proprietary mechanism Microsoft has implemented known as V-Order optimization, which enhances performance when applied to the parquet file format in Microsoft Fabric's Delta Lake. However, it's worth noting that Direct Lake can also function efficiently without this optimization, making it compatible with standard delta files as well.
V-Order optimization provides an array of performance enhancements, including specialized sorting, row group distribution, dictionary encoding, and compression, all while keeping the interoperability of the open-source parquet format. This approach should, in theory, give us import like performance while allowing direct access to live data.
Let’s find out! 🔥🔥
Objective & Process
This insight aims to compare the performance of Direct Lake and Power BI's traditional Import mode. Additionally, we will evaluate the impact of V-Order optimization on the performance of Direct Lake. By conducting performance tests on a standard dataset using both modes and varying the V-Order optimization, we will objectively evaluate the unique characteristics and potential advantages of Direct Lake over Import mode.
During the data preparation, we observed an intriguing detail. As we loaded the "New York Taxi" sample file, a parquet file of 2GB, into OneLake through the Fabric interface in delta format, we found that V-Order optimization isn't automatically applied. This is the same behaviour we see with shortcutted delta files, which don't have V-Order applied by default either. Notably, a 'shortcutted' delta file in this context refers to a file that is 'linked' in Fabric, without being natively ingested into OneLake. In other words, it's a pointer to data residing in another lake, thereby avoiding redundancy and saving storage space
For our comparative study, we decided to not only assess Direct Lake with V-Order optimization but also without. Therefore, our performance analysis will cover Import mode, Direct Lake with V-Order, and Direct Lake without V-Order.
The process consists of three stages:
- Data Loading: The first step of our process involves transferring the "New York Taxi" sample file into a Lakehouse within Microsoft Fabric. We leveraged Fabric's native pipeline for this task, though the specific details of this procedure won't be delved into here. This stage lays the groundwork for the ensuing performance tests, all handled through the Fabric Data Factory.
- Direct Lake Reporting: We will leverage Power BI's Direct Lake mode to generate a collection of reports from the dataset. Here, we will focus on the speed of data reading, the fluidity of data manipulation, and the overall user experience. In this stage, we will also identify if the V-Order optimization is applied to the data files.
To create a table with V-Order optimization, execute the following command:
CREATE TABLE taxi_data_vorder USING delta TBLPROPERTIES("delta.parquet.vorder.enabled"="true")
This statement generates a new table 'taxi_data_vorder' using the delta format, with V-Order optimization enabled.
To confirm that V-Order is now activated on the 'taxi_data_vorder' table, execute the following queries:
SHOW TBLPROPERTIES taxi_data; SHOW TBLPROPERTIES taxi_data_vorder;
These commands display the properties of the 'taxi_data' and 'taxi_data_vorder' tables, allowing you to verify that V-Order is enabled on the latter."
- Import Mode Reporting: To provide a basis for comparison, we will generate similar reports using Power BI's traditional Import mode. By noting the performance metrics during this process, we can then compare them with those of Direct Lake.
Test Set & Comparison
For our experiment, we crafted four distinct reports aimed to mirror a variety of visualization types typically used in Power BI while simultaneously pushing the capabilities of Direct Lake.
- Simple Visual Report
- Simple Visual with Date Slicer Report
- Visualization highlighting Average by Top 10 Locations
- A Comprehensive Flat Table holding granular data
Our selected visuals were chosen to mimic varying complexity levels. Report 3, for instance, is expected to execute an exhaustive scan of the entire table to fetch data, thereby introducing a significant computational demand similar to the granular flat table in Report 4. In contrast, reports 1 and 2 should represent more standard scenarios.
The results observed during the testing phase varied from the minimum time taken to the maximum time recorded for data retrieval and report generation.
|DL – no optimization
|Did not complete
Each row contains a range, indicating the minimal and maximal time required to refresh the report. The minimum value typically appears after some data has been cached. It's evident that even in Direct Lake mode, caching is still at play, as we can see notable variability in performance times once the visual has been queried once. This caching effect also transpires in Import mode, albeit to a far less noticeable extent.
An important thing to note is that if a query sent by the visual cannot be sent back within the cold cache bounds, then it will result in a direct query, as seen in Report 4. This implies that you may encounter the same limitations as you would using direct query mode, such as DAX limitations.
The anomaly with Report 2
Interestingly, the Direct Lake mode encountered instability during the testing of Report 2, specifically while interacting with the date slicer. As the screenshot illustrates, the system faced issues which prevented its completion.
However, the problem was promptly mitigated upon the removal of the date filter. Notably, the Direct Lake feature is still in its preview stage, suggesting that such minor glitches are likely to be addressed in forthcoming updates.
Interestingly, this issue was not present with the v-order-enabled dataset.
Through our tests, Power BI's Direct Lake mode showed promise, particularly when boosted with V-Order optimization, outperforming the traditional Import mode in certain scenarios. Its key feature is the live data connection, continuously providing the latest data, a convenience not present in the Import mode. However, this comes at a cost: maintaining the Fabric capacity continuously operational increases compute demands and, consequently, the cost.
Importantly, our tests revealed that Direct Lake doesn't consistently outperform Import mode without V-Order optimisation. Therefore, the choice between the two largely depends on your specific needs and the feasibility of applying V-Order optimization. An added complexity is Direct Lake's fallback to Direct Query under certain conditions, bringing along its own set of limitations, such as those related to DAX.
In conclusion, Direct Lake presents an exciting potential for Power BI users, especially with V-Order optimization. It can handle large data sources and improve performance in specific scenarios. However, it's still a work in progress and we look forward to future updates that will further enhance this promising feature.