In-memory computing vs. direct query in Qlik with Databricks

Choosing the right approach for accessing and analyzing your data is crucial. As more businesses leverage advanced platforms like Qlik and Databricks for powerful analytics, understanding how to optimize data connections becomes essential. When integrating these two tools, a common question arises: Should you use Direct Query or load data into Qlik? Each method offers distinct advantages and considerations depending on your organization’s data size, performance requirements, and real-time needs.

In this insight, we’ll explore the key differences between Direct Query and in-memory computing in Qlik when working with Databricks, helping you decide which approach best suits your data strategy.

What is Direct Query and In-memory computing

Direct Query allows users to access data directly from its underlying data source without loading it into Qlik. This method keeps the data external, which can significantly increase the speed at which users can interact with their datasets. Direct Query is especially useful for those who need near-real-time insights or are working with very large datasets where loading all the data may be impractical.

In-memory computing, on the other hand, brings data into Qlik’s in-memory engine, allowing for advanced modeling and deeper interactivity. This method offers unparalleled speed and flexibility, making it the preferred choice for complex analytical tasks.

Image
Analytics spectrum in-memory computing vs compute at database

Why Choose Direct Query

While in-memory computing is a powerful option, Direct Query unlocks unique capabilities that are particularly valuable in certain scenarios:

  • Real-Time Insights: Direct Query ensures that users are always working with up-to-date data, making it ideal for monitoring dashboards and status reports.
  • Optimized for Big Data: When dealing with massive datasets (e.g., 20M+ rows), Direct Query enables users to analyze information without the time and resource overhead of full reloads in Qlik.
  • Efficient Data Exploration: Users can explore new databases and tables before deciding whether loading the data into Qlik is necessary, streamlining data discovery and reducing redundancy.
  • Seamless Writeback Functionality: With Direct Query, changes in underlying data are reflected instantly in Qlik. This is a significant advantage over in-memory apps, where users would need to reload modified tables to see the latest updates. By using Direct Query, the most up-to-date results from the database are displayed in real time, improving workflow efficiency.
  • Integration with On-Demand App Generation (ODAG): Direct Query can be used in the selection app to efficiently extract targeted data slices, applying filters across all relevant tables. This streamlines the process of transferring selected data into Qlik Cloud’s in-memory engine, making it ideal for situations where only specific portions of the data are needed for deeper analysis. For more information, refer to the details on ODAG functionality.

Getting Started with Direct Query

Starting with the direct query is very easy.

  1. Create a connection between Qlik and your Cloud database. You can find more information about connecting Qlik with Databricks in a previous insight: Connecting Qlik Sense to Databricks: Manual vs. Partner Connect
  2. Create a new app
  3. From the Overview page in the app, click "Files and other sources". You cannot start the direct query from the data load editor
  4. Click on your connection
  5. Click the buttons on the top right to select the direct query mode
  6. Select your tables and fields
Image
Create connection - Qlik - Getting started with Direct Query

Completing these steps will automatically create an SQL script in the data load editor

Image
SQL script - Getting started with Direct Query

Direct query versus traditional in-memory computing performance evaluation

Data

To evaluate performance, I utilized a dataset from Kaggle. This dataset includes information on various URLs, specifying whether each URL is classified as phishing or not. The CSV file is substantial, with a size of 2 GB, encompassing 2.5 million rows and 18 columns.

Below is a preview of the dataset:

Image
Preview of the data - Direct query versus traditional in-memory computing performance evaluation

Visuals

For our experiment, we designed five unique charts to reflect a range of visualization types commonly used in Qlik, while also testing the limits of Direct Query.

  1. Pie chart visualizing the percentage of legitimate and phishing URLs
    Image
    Pie chart visualizing the percentage of legitimate and phishing URLs
  2. KPI chart visualizing the total amount of records

    Image
    Qlik Sense - KPI chart - Total records
  3. Bar chart visualizing the average URL length of phishing and legitimate URLs

    Bar chart visualizing the average URL length of phishing and legitimate URLs
  4. Scatter plot visualizing the relation between the average entropy and the average length of the URLs

    Image
    Scatter plot visualizing the relation between the average entropy and the average length of URLs

  5. Straight table visualizing detailed information about the URLs

    Image
    Qlik Sense - Straight table visualizing detailed information about the URLs

These visuals were selected to represent different levels of complexity. For example, Chart 4 contains an IF statement creating a significant computational load and Chart 5 is designed to perform a comprehensive scan of the entire table to retrieve data. Conversely, Reports 1, 2, and 3 are intended to depict more typical scenarios.

Results

We selected the value "Phishing" within the "Label" column to test the performance. We used the Chrome extension Add Sense to show the calculation time of the different charts. Below you can find the results.

  Chart Direct Query Direct Query with Zorder In-memory computing
1 Pie chart 901 ms 798 ms 74 ms
2 KPI chart 1275 ms 800 ms 88 ms
3 Bart chart 1561 ms 787 ms 88 ms
4 Scatter plot 1951 ms 1007 ms 89 ms
5 Straight table 1867 ms 644 ms 107 ms

The performance results of our test highlight key differences in calculation times between Direct Query and traditional load methods for various chart types. While traditional load methods generally provide faster calculations due to Qlik’s optimized in-memory processing, Direct Query offers a powerful advantage—real-time data access directly from the source.

Because Direct Query retrieves live data from the underlying Databricks database for each interaction, performance depends on factors such as network latency, query processing times, and database optimization. This ensures that users always work with the most up-to-date data without requiring periodic reloads. For organizations that prioritize real-time insights and need to query large datasets dynamically, Direct Query can be a valuable option.

Our tests also demonstrated that implementing ZORDER BY in Databricks improved Direct Query performance by optimizing data retrieval. While in-memory processing remains the fastest approach, ZORDER helps reduce query times by clustering frequently accessed data, enhancing efficiency for Direct Query users.

Conclusion

Choosing the right approach for accessing and analyzing data is crucial for businesses leveraging advanced platforms like Qlik and Databricks. This insight has explored the key differences between Direct Query and in-memory computing, highlighting their advantages and use cases.

Direct Query allows for real-time data access and is particularly beneficial for scenarios involving large datasets, writeback functionality, and exploratory analysis of new databases. It provides a streamlined way to interact with data directly from the source, making it ideal for real-time insights and reducing the overhead of loading large datasets.

However, in-memory computing remains the preferred option if maximum performance and advanced analytics are required. While Direct Query introduces latency due to the need to send queries to the database with each selection, traditional load methods benefit from having data preloaded into memory, resulting in faster response times.

Ultimately, the choice between Direct Query and in-memory computing should be guided by your specific data needs, performance requirements, and the nature of your analytical tasks. For real-time data access and handling large datasets, Direct Query is a valuable tool. For scenarios where performance and speed are critical, in-memory computing may be more suitable.

More information

For more information, please contact us!