The Age of in-memory BI : is there still a need for a data warehouse ?
Creating insight in your company data is becoming crucial for all companies. Data is seen as an asset and can be used as a strategic advantage towards competitors. Knowledge is power, but how can this knowledge be organized in a company and shared amongst all users?
Typically, a Business Analytics environment (potentially consisting out of both Business Intelligence (BI) and Corporate Performance Management modules) will be the way companies are trying to achieve this goal. Business Analytics systems will provide the means to answer business questions and turn data from operational systems into knowledge; knowledge that is up-to-date, actionable, easy to understand and accessible in a responsive way.
In today’s world, more and more vendors are releasing in-memory BI solutions. This approach, also often referred to as "data discovery”, has been popularized by solutions like QlikView, Tableau and Spotfire. It meanwhile –as technology approach- has been "copied” and added to major BI platforms like IBM Cognos with Cognos Insight, SAP Business Objects with SAP Lumira and Microsoft BI with PowerBi/Powerpivot. Given that the data resides in memory, contrary to "old school” diskbased OLAP approaches, users can quickly navigate the data and change how to interact with it, not hindered by fixed dimensions and fixed levels within the dimensions. Associations are the links between data-elements and aggregation is recalculated on the fly.
Whilst this new approach for sure has considerable user benefits, one should see this more as an additional flavor of BI functionality then a replacement of "BI as we know it”.
In the context of this Insight we want to focus on a marketing statement often made by these tools. Some of these vendors are stating that with this technology, there is no further need for a data warehouse within the overall Business Analytics architecture. Is this truly the case ? Can some new technology somehow magically address all the issues an integrated data warehouse was designed for to solve ?
Within this Insight we will cover multiple reasons why a data warehouse is still a crucial component of any management information initiative, and how it can co-operate with in-memory applications to get the most out of your data and take the best decisions based on the information available within your company’s data warehouse.
In-memory BI versus a traditional data warehouse: what is the difference?
As memory is getting cheaper every year, many major vendors are trying to capitalize on the in-memory hype, both within and outside the BI industry. This evolution has been initiated mostly by niche players who position in-memory as a much more performing solution then traditional, I/O based technologies. But how does this ‘in memory processing’ really work? And what are the main differentiators between a traditional data warehouse and an in-memory BI solution?
Most data warehouses that have been built over the years used "traditional” relational database engines, based on storing data in "rows”. The database enables the organization to store large amounts of data in a central place and in an efficient way and query data to display parts of it in an overview: a list, crosstab, chart or report.
In the early days of transactional databases, all data was normalized so that it was only stored once in a database and that redundant data could be left out. This is still how most Operational applications (ERP, CRM, SCM, …) store their data.
Writing a query to get your data out of those systems was complex and demanded a lot of calculation power within the database server. As storage became cheaper and more insight in data was needed a new way of storing and accessing data was needed and data was de-normalized again.
By denormalizing, data that belongs together (e.g. customer data, including address) is stored together in a separate structure that is optimized for retrieving data instead of data entry. This concept was translated by the "father of data warehousing”, Ralph Kimball, into an approach called dimensional modeling (or star schemas). The resulting traditional data warehouse is an open solution on which almost any Business Intelligence reporting tool, in-memory or not, can connect and report upon.
In-memory solutions are based on a different architecture: instead of storing data on disk, they take into memory the data to be available at all times, without running any query. The newest compression techniques are applied to the data to be able to store large amounts of data in the computer’s RAM memory. In order to optimize the reports and response times, not all data is just put into the memory, but logical blocks are created around for example one business area (e.g. sales data). This data is then instantly accessible via the corresponding proprietary BI solution.The current in-memory solutions work only with their own front-end application, since special compression techniques are used to be able to store the data in RAM. Those can only be read by the application that knows how to decrypt the applied compression.
Within most companies some sort of reporting is already available, although this is often departmentally organized in so called "stove-pipes”. There is no "enterprise-wide” view on the information. The result of departmental reports can be that calculations between departments on certain figures are calculated in a different way and that definitions are getting their own interpretation in each different department.
Quite often these reports have been created in easy-to-use tools -BI solutions or even Excel and Access- that allow you to implement your own business rules and change a formula quite easily.
Some of the modern in-memory BI tools are using these same capabilities, to make it easy for the end-users to create their own attractive reports and analysis, enabling them to use their own formulas in dashboard-like environments. These tools offer an embedded ‘integration’ tool to create the datasets needed for the specific reporting set.
It thus does not come as a surprise that the in-memory BI, "data discovery” tools have often become popular through departmental introduction, even in companies who already had some kind of "enterprise BI” in place.
But what about the "single version of the truth” if every department is organizing their "BI” in this way? The central idea behind this concept is that an organization should use the same definitions and calculations for all figures and KPI’s. This is where a data warehouse comes in: it contains the necessary metadata for definitions, calculations and formulas, in order to make sure that everyone has the same figures in their reports and analysis.
Whenever a certain definition is adopted or adapted company-wide, you only need to define it once to have it available in all different departments and all reports that contain the information concerned.
Many people will argue that it is much faster to change their departmental reports than to have a central department changing all the necessary structures and reports.But this is most probably not the most efficient way in terms of total workload. Investigating and adapting the code per department often has a higher cost than doing this in a central location just once.
And there is also the data synchronization issue: today the new changes are already adopted by one department, but not yet for another. How can data over the different departments be matched and compared?Having one version of the truth will enable a company in taking better decisions based on consistent data throughout the whole company.
What about data quality?
Data quality is one of the critical success factors within a reporting or data warehouse solution. Reports with poor data quality influence the acceptance of your reporting environment by the business users; if data is not represented in the right way, with the correct groupings on it, or even just wrong data, data is useless as the basis for correct decisions. Also confidence in the deployed reporting solution will quickly degrade.
If there is no data warehouse available, there is a risk that data quality will be ignored or a data correction is done –over and over again- within the departmental reports itself. This can lead to non-consistent solutions, which lead further to changes in the data over time.
Within a data warehouse a thorough data quality assurance system can ensure that only validated and correct data is loaded in the data warehouse and/or that data is cleansed in a uniform way. Applying uniform rules on cleansing data is crucial to compare facts and figures over the different departments and over time.
A good cleansing of your data and correction of your faulty data is better served from a data warehouse structure than from the departmental view. If uniform definitions exist company-wide on how to treat and deal with faulty data, it will be an excessive cost to apply all of these rules for each individual layer that uses the data. Having an intermediate layer where all cleansed and corrected data is available for all different users is a huge advantage for achieving consistent and trustful data within your organization.
Within a centralized environment, such as a data warehouse, a central audit and traceability can be set up. This traceable and auditable environment can be understood and changed by any developer since it should be designed in a comprehensible and open way. Finding the root cause of your problem becomes considerably easier with a centralized BI system, such as a data warehouse, which typically holds its data in an open SQL based database system and thus not in the proprietary structures of in-memory BI.
Is your in-memory BI tool capable of producing complex transformations?
In-memory tools often offer an embedded ‘integration’ –ETL like- tool to create the datasets needed for your specific reporting set. This will do the trick for most one-on-one transformations (i.e. just getting the data in terms of different columns from a table) or for simple transformations (i.e. concatenating a first-name and a last-name together), especially when coming from one single source (ie. no data-integration needs between sources with different naming conventions and aligning issues). It will however already be a little harder when intermediate transformations such as ‘if then - else’ structures are required. But, finally it will be almost impossible to execute very complex transformations in terms of for example calculations concerning a duration in days based on a multitude of business rules & tables.
After all, an in-memory tool is still a front-end BI solution and not an ETL tool and lacks the broad and deep functionality that the latter typically have.
Do you have an historic view on your data?
Data changes frequently and not only your measures (account balance, turnover, headcount, market share, …) change. All data that is related to these measures is not static forever either; a product can change in terms of product group it belongs to or a sales responsible can change territory as examples.
How will you cope with those changes in your in-memory BI environment ? In most operational systems you are only able to see the last known value for a certain dimension item (e.g. what is the current sales territory of this sales person). It is a picture of the "here and now”. Having your reporting connected directly to the operational database results in only seeing the last situation even for historical data. If an in-memory BI solution is connected to this source, it will have the same challenge for correct historical reporting on the changed objects.
To deliver consistent reporting, you should be able to run a report today over past data that will have the exact same results as if you would have run this report in the past. To set-up this kind of reporting asks for a specific treatment of your data and the procedures to keep track of this historical data. In your transactional system the data that changes often is simply overwritten. If you have no intermediate system such as a data warehouse, there is no place where you can keep track of your historical changes in the data.
Why is this historic view on data so important? Well one reason is already mentioned above: to create consistent reports today on the past as if you would have ran them in the past. The 100 that was on the account balance in the past under a certain description should still be reported as 100 under the same description, even if in the current transactional system the description, an attribute or grouping has changed.
Let’s give another example that is quite sensitive for some people: the bonus-calculation of a sales person. What if a certain customer was served until a few months ago by sales person A and now by sales person B ? We most probably want the historical sales of this customer correctly assigned to sales person A and not to sales person B. Having all historical data available in your data warehouse results in having the correct figures right away, independently of how the operational system is able to store history.
Within the data warehouse methodology, multiple techniques exist to cope with these kind of questions. There is a reason why surrogate keys and so-called "slowly changing dimensions” were invented in the first place and need to be applied consistently in a data warehouse environment in order to have a correct view on the data.
Can you integrate your reporting coming from multiple sources?
The integration of data over multiple departments or multiple data-sources/applications is crucial in the context of "enterprise” Business Intelligence. If you are a HR manager, you may for example need data from your time registration system and from your ERP system to analyze a KPI like productivity by employee. In a similar way, revenue by market segment and customer might require both information from the ERP and the CRM system.
How can you combine this data from different sources into one report? With Excel or an in-memory BI solution you can look up the values to create intermediate values that can finally be used within your reports. Each time a new report with integrated data, or a separate analysis is made the lookups need to be validated and verified, if not recreated.
Data coming from multiple sources can be integrated in a data warehouse by applying identical rules as you would do in your Excel or other reporting on top of your transactional system.
However through having an integration set-up in the intermediate system, the integrated data can be made available to all people that need these integrations and have it applied in a consistent way for all different users and usage.
It can even become a source for other systems eg. cleaned and aligned data from the accounting and HR system could be loaded into a CRM application. The final workload of integrating data just once on a central location will be more cost effective than having them developed multiple times, for each individual report or set of reports.
Integration can go much further than just reporting on multiple data sources. What if your data source for your reports changes completely eg. when you change or upgrade your ERP system ? This has a huge impact on all your individual reports if they are linked directly to this changing source; you need to apply all changes in all individual reports.
A data warehouse is set-up in a more generic way, ensuring that whenever a source changes, only the extraction of your source changes. During your transformation the data is re-arranged to fit into the current reporting set-up. In that way your reports don’t need to be changed or the impact on end-user Business Intelligence reports is minimal.
A data-warehouse is built for the future. It provides you with a sound basis for a complete reporting environment and is able to incorporate expansions of current sources (e.g. new fields and tables) or even complete new business areas. Where possible it can re-use existing dimensions or just some minor adaptions to incorporate the new data. Existing reports are untouched and run flawlessly within your BI system.
Is all your data available at the source?
Within most operational systems only a few measures and metrics are included, mainly basic calculations or just an amount or a value.
What if you need more than these basic measures e.g. some average numbers or a complex calculation? Do you apply them in the different reports or do you calculate those figures in an intermediate data warehouse? Understanding the need for extra measures is easy, but how do you deal with extra attributes you need on for example a person’s name, address or other personal information?
Both a centralized data warehouse and the in-memory tools can deal with these sorts of problems. However the risk exists when an in-memory BI solution is chosen and implemented with a stove pipe vision that the different calculations need to be implemented in all different departments using the BI solution independently of each other. In the section of "single version of the truth”, we already explained that having just one environment is key, regardless of the technology chosen.
Enriching data can be another type of providing data that is not available at the source.In many cases some data is already available, but you want to add some extra data characteristics to your analysis. All calculations and regroupings or data enrichment can be done both in a central data warehouse as in a report or in-memory application.
For the latter, this means that coding needs to be applied into the integration part or that some extra merging of elements is needed. For a data warehouse, the additional characteristics are just treated as another source that can be loaded just once or daily, depending on the requirements. Within a data warehouse a more stable and future-proof solution can be set-up to integrate additional data or definitions.
Finally it may even happen that certain data -at some point- will no longer be available in a certain source, due to performance or other reasons impacting that same source (archiving, implementation of a new system without loading historical backdata, …). While a data warehouse will still be able to provide that data further on – since the data is stored separately on a detailed level next to the operational data source, and in-memory solution will only be able to generate data based on its’ source availability.
The role of the data warehouse in Master data management
In larger enterprise environments, some data about the key "reporting dimensions” like customers, employees, cost centers, products … is available in multiple applications. This often leads to conflicts in this information between sources.
Which of the applications is leading for that data-object? Or does the data just exist as many times as there are applications?Having a data warehouse can help you to solve these kinds of questions and pitfalls, once and for all and enterprise-wide.
A data warehouse can serve as a master datasource for all applications. At first, all data can be collected, merged or integrated into the data warehouse. Afterwards this data can be spread out in a unique and uniform way to all different applications from one single point, your data warehouse. It then has to be decided in which application new data elements are created. The distribution of your new data can be taken care of by the data warehouse, which makes this "master data” available to other applications. Adding in a new application that uses the same central data can automatically use correct, cleansed and qualified data in the source.
"Master data” is closely related to having "one single version of the truth”; it is not only applicable over the different departments or divisions within your organization, but is also needed over the different operational applications. A customer "John Smith” and his characteristics (e.g. phone number, e-mail address …) should be identical in all different systems where he appears within your organization. A data warehouse can potentially fulfill the role as a supplier towards the different operational systems to deliver the correct and unique master data, potentially even enriched with KPIs calculated across applications.
Master data is one of the topics that is often neglected in terms of importance of having a data warehouse available in your BI landscape. Most of the times the data warehouse is seen as an end point of your data: the data stored in the data warehouse is just used for reporting and nothing else. In modern ICT architectures, the data warehouse is just an intermediate environment that can play a role as a hub: complex calculations or ratios can be pre-calculated within the BI system, taking into account all detailed data coming from multiple systems, before it is incorporated in another application.
Can your in-memory BI tool foresee in all your functional BI needs?
Most often an organization requires a diversity of BI functionalities, ranging from static reporting and analysis over dashboarding to scorecarding & advanced analytics. Potentially the work that is being done in the in-memory tool, which is capable of only one or at most two –typically analysis & dashboarding- of the above functionalities, needs to be re-done within the other tools which most likely require similar data.
Why not do it only once in a data warehouse using enterprise BI tools that deliver all capabilities?
How long does it take to generate your in-memory BI ?
Within a traditional data warehouse environment operational metadata is taking care of basic error handling and load performance. It is essential to understand in a pro-active way the performance of your load processes in order to act upon changes therein. When no operational metadata is gathered, as for some in-memory tools is the case, it might be difficult to take proper countermeasures when loads are getting slower or potentially break. ETL transparency is key in terms of integration in a data warehousing context.
Is your end-user a developer?
Typically within a BI environment a data warehouse team or a business intelligence competence center takes care of the complete back-end (data staging, data integration, data warehouse) in a managed way. The front-end might be owned by a certain number of power-users or might be completed by self-service BI.
In the case of in-memory tools the paradigm is often somewhat shifted. The ETL now becomes an integral part of the BI tool and as such the end-users might be expected to perform "ETL –like” operations if they want to use the tools as "self-service BI". This ETL will most of the time not be done in a common SQL language but within a proprietary language which mostly is only still at its child steps and opens the risk for compatibility issues within next releases, next to the issues of learning the language.
There is great value in the arrival of in-memory BI technology in Business Analytics. We however do not see a need to question the existence of the data warehouse in selling the benefits of data discovery.
We have argued that a data warehouse is still an added value for any BI environment. The emerging trend of in-memory OLAP is something we encourage, however in-memory OLAP technology is not an alternative for a data warehouse architecture.
The role of a data warehouse remains key in having current, historical, actionable, comprehensive, validated, integrated, reformatted, summarized and restructured data. Some in-memory vendors may argue the opposite and try to focus the discussion on some of the fancy front-end functionality that their tools have to offer. It is our strong belief that in-memory solutions – in combination with traditional BI technology - only deliver most of their benefits if they are integrated into a data warehouse-based architecture.
Having done the preparation of data in the back-end of your architecture will enable your end-users to display data in an easy to use way regardless of the front-end tools you want to use and the data source the data is coming from. In our opinion, in-memory BI applications work best on a data warehouse as source.
The final, most important advantage of having a data warehouse in between your BI tool-set (in-memory or not) is that you have an "integrated single version of the truth” of your company’s information, that is valid anywhere, at any time and for any use, being it analysis by end-users company-wide or as a trustworthy source for other applications.