Big Data Demystified : What it is and Should we Care (already now) ?

Introduction

Over the past decade, the world has seen some significant changes impacting our everyday lives. Two innovations can be considered the main drivers behind this: the internet and the mobile phone. As a result our lives have become increasingly more digitalized then ever. Plenty of data and information is now recorded, customers are often more informed then the sales representatives thanks to internet and communication is more direct than ever before.

This new era of information can also be translated to the world of Business Analytics and it is consolidated into one new term -or should we say "hype" ? - : Big Data.

Big Data only started to get known as a concept since 2011, but has been gaining tremendous interest in the wider IT world ever since. It is now one of the most used, and also misused, buzz-words of the moment.

 

Big Data Demystified : What it is and Should we Care (already now) ?

 

Google Trends – search term "Big Data" volumes over time

 

As with any hype, it is necessary to understand what this new technology and concept is about and how it can bring value to your organization. In order to be able to answer this question, it is important to first understand what the term stands for and how it differs from existing technologies, in particular Business Intelligence as we know it. In this Insight, this is exactly what we will try to achieve.

 

Big Data Demystified : What it is and Should we Care (already now) ?

 

"Data doesn’t become Big Data until you can draw insights from it that will impact business results.”- Jay Parikh

Big Data Defined

There are many definitions of Big Data out there, but the most commonly accepted definition is the one provided by Gartner.

"Big data is high volume, high velocity, and/or high variety information assets
that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

The volume, velocity and variety are considered the three V's of Big Data. Recently a fourth V has been added, namely veracity (some argue the 4th is variability). The V's describe the characteristics of the information big data uses, but as the definition states the overall purpose of big data is to create insights that lead to enhanced decision making and optimization. This goal is pretty much identical to Business Intelligence & Business Analytics, so the V’s are the main drivers behind the new hype and explain why Big Data has quickly become a known term throughout the world in a relatively short time span.

Volume

It is estimated that we generate around 2.5 quintillion bytes of data (roughly 2.3 trillion gigabytes) around the world on a daily basis. Consider that 90% of the worlds’ data that exists today is created in the last 2 years. If you would digitalize the entire Library of Congress, considered the biggest library in the wold, you would end up with 15 terabytes of data.

That is equal to the daily tweets that are sent each day worldwide … This gigantic boom of information is caused by the number of devices that can generate information and the increase in internet connectivity. Currently 6 out of 7 billion people worldwide have a cell phone. Just look at how much information you generate on a daily basis through the use of email, social networks, camera’s, … This boom of information is shown clearly on below graph.

Big Data Demystified : What it is and Should we Care (already now) ?

Worldwide Corporate Data Growth
 

It is important to note that traditional data systems can handle high volumes of information already. Despite what you would think when you hear the word "Big Data", the sheer volume of information is not the most difficult aspect to overcome by nowadays technologies.

Velocity

Not only is it important to understand the volume of information that is being generated, but it is equally important to consider the velocity at which this is occurring. Due to improved technologies, this velocity is also growing at equally exceptional rates. Obviously the volume and velocity are closely linked, but they need to be considered separately when coming up with solutions to handle this flow of information.

Variety

Generally speaking, data can be divided in 3 groups:

  • structured (data from data structures generated out of data models),
  • semi-structured (data that conform to some kind of structure without having a data model) and
  • unstructured (data with no pre-defined data model).

The scope of BI is usually limited to structured data, while Big Data should handle all kinds of data (database tables, XML, audio- and video-files, etc...). Unlocking the possibilities of handling unstructured data is one of the key differences between Big Data and traditional Business Intelligence. Traditional systems, relational databases for example, typically cannot elegantly process these types of information.

Veracity

Of course, not all information that is generated can be considered accurate, relevant and trustworthy. Think about all Web 2.0 data ... With Big Data, there is often no control over the quality of information, as opposed to Tradional BI data, but it is crucial that this aspect is considered when analysing Big Data. It is necessary to be able to identify which data can potentially provide benefit to your organization, and the accuracy of that information if it is decided to take the data stream into account.

Traditional data management systems are simply not able to process these amounts of information, and the structure in which they occur. New methods and processes are required to handle this task.

Big Data Architecture

When you talk about Big Data, Hadoop quickly comes into play. Hadoop has been developed by Google and Yahoo among others, along with the Apache open source community. Hadoop is a framework designed to store and process big data and it consists out of several key modules. Two of those components clearly illustrate the difference with traditional systems:

  • Hadoop Distributed File System (HDFS), a technology that enables distributed storage of information on commodity hardware
  • Hadoop Mapreduce, which enables distributed computing on commodity hardware throughout the organization, often referred to as nodes. This part of Hadoop is based on Google's MapReduce technology.

This framework enables the processing of information at the velocity, volume, variety and veracity of Big Data. The main characteristic of this platform is the decentralized approach that is considered for both the storage of data and the required computation of conducting analytics. Rather than having big servers carrying all the workload of Big Data initiatives, it is spread across commodity hardware in the organization or in the cloud.

Next to these two components, many other features and add-ons are freely available. This is due to the fact that Hadoop is supported by the Apache open-source community. Just to mention some of the most known add-ons, please bear in mind that the open-source community is always quite creative in finding names for products:

  • Pig – A tool used to create Java based scripts to perform data analysis and infrastructure processes
  • Hive – Provides a datawarehouse function to the Hadoop cluster, providing data summarization, query and analysis capabilities
  • Mahout – A solution that offers machine learning algorithms, used for collaborative filtering, classification, clustering and data mining of information
  • Sqoop – A tool used to efficiently interface information between Hadoop cluster(s) and relational databases

There are also other technologies that are also called "Big Data" related, for example the NoSQL database. The characteristics they all share is the ability to process and store large amounts of unstructured information.

Case Studies

We now have introduced a definition of Big Data and its characteristics, but it becomes much more interesting when we translate this to some compelling business cases. This will hopefully answer the question if the hype around Big Data is justified or not.

Improved customer insights - Trident Marketing

Trident Marketing is a direct response marketing and sales firm for brands in the US. Their success is based on its ability to acquire the maximum number of paying customers at the lowest cost. They managed to increase their revenue from USD 5 million to USD 53 million over a four year time period. They achieved this by creating a solid Big Data platform consisting out of the following main data components:

  • The company's telephone records
  • CRM and order entry information
  • Various external data sources, such as credit bureaus and marketing information
  • Clickstream data from Google and Bing

Thanks to the combination of all these data elements, the company was able to better predict which search keywords would provide which yield. As a result, the sales increased by 10% in just the first six months, and at the same time the cost of marketing decreased by 30 percent. But the biggest results were achieved through the telemarketing department. The platform helped drive the sales throughput by using Predictive Analytics. As a result the company was able to predict who to call, at what moment in time and in what geographic location regarding which specific product.

Creation of a Smarter City template – The city of Barcelona

Barcelona has always been a global innovator in terms of trade, tourism, IT and architecture. In order to shape its future success, Barcelona has chosen the path of technology. The city built a hybrid cloud solution to store and analyse Big Data. The platform stores data from :

  • its own city systems,
  • social media platforms,
  • software log files,
  • cell phone information and
  • GPS signals.

The data captured in the platform is distributed to its city employees, residents and businesses through various platforms and interfaces. Citizens have access to a BI platform where 120 KPI’s are being tracked, regarding administrative procedures, city services including public bike usage and the number of people using each bus route, the economy and population demographics. Companies can also access all this information in order to create new apps or to identify areas of investments.

City employees can now quickly access much more information and at any given location. This allows them to improve the service towards citizens. When La Merc takes place, Spain’s biggest festival, they receive information about the entertainment and food venues, citizen interest and satisfaction, people mobility and incidents that may occur. They also monitor the flow of people by measuring people throughput at specific measuring points.

Next to that the city also uses the information to improve everyday life. The bike rental stations are now being optimized to connect better with other transportation networks such as trains or metro’s. Staffing of medical workers, police and fire departments can now be optimized to fit the needs at any given time or location.

Use Cases of Big Data

The below chart shows the relationship between typical volumes generated by typical enterprise applications.
 

Big Data Demystified : What it is and Should we Care (already now) ?

Potential applications of big data span functions and industries.

Important to note is that -certainly with technology still immature- the benefits of Big Data are bigger for some industries. In general, the biggest opportunities can be found in following sectors:

  • The computer, electronics and information sector
  • Finance and Insurance
  • Government
  • Retail

These sectors have the biggest potential in terms of Big Data investments according to a McKinsey study, as they operate in a B2C environment with consumers - and their trans- and inter-actions - providing a lot of "big data" input. But any given sector can identify gains by using Big Data.

Just to list some examples of potential applications :

  • Online & Web
    • Search ranking
    • Ad tracking Location and proximity tracking
    • On line gesture tracking
    • Audience engagement
  • Marketing
    • Conversation reach
    • Advocate influence & impact
    • Topic trends
    • Sentiment ratio
    • Idea impact
  • Science & Medical
    • Genomics analysis
    • CAT scan comparison
    • Big science data collection
  • Utilities & Industry
    • Smart utility meters
    • Satellite image comparison
    • Building sensors
    • Transportation network monitoring
  • Finance
    • Financial account fraud detection/intervention
    • System hacking detection/intervention

Traditional BI vs Big Data

The obvious question to ask now is how this new technology will develop in the years to come, and how it will impact our existing Business Intelligence & Data Warehousing infrastructure. Will we all switch to Big Data solutions and move away from traditional relational databases and data warehouses as some claim? Obviously it is hard to forecast how this new technology will evolve, but some trends in the market can give a good indication as to where this is heading.

It is already perfectly possible to setup a Hadoop Platform next to existing infrastructure to conduct analysis on unstructured or semi-structured information. The information storage and processing can be executed in the Hadoop cluster and the results of the analysis can be presented to the existing datawarehouse where it can enrich existing datamarts or create completely new ones.

Big Data Demystified : What it is and Should we Care (already now) ?

 

Hybrid Solution of Big Data in a classical Data Warehouse architecture

 

This approach is often referred to as the Analyze and Store principle. The low level information is not stored in the datawarehouse, resulting in lower storage and administrative overhead cost and less integration effort with the existing data.

Opposed to that principle is the Store and Analyze principle. This principle is becoming possible due to the big vendors. Almost all the big vendors of relational databases (Microsoft, Oracle, IBM, …) are integrating Hadoop functionalities in their relational systems at fast pace. This will allow for Big Data information to be stored and integrated in the existing data sets, and will allow for analysis on the consolidated data set. As the data is better integrated, data quality will improve and the history of information will be easier to access. The downside of this approach is the increased cost in storage and more cost for data integration.

Whichever way forward is chosen, a new array of possibilities with regards to Business Analytics will become available for end-users. After all the goal of Big Data and Traditional BI can be considered to be equal, the only difference is in the information itself that is processed and the means to do so.

Conclusion

Big Data has gained a lot of attention in the past two years, and the main reason is the boom of information that is occurring across the globe. Big Data is in essence a new way of processing and storing information, but with the same goals as traditional BI: to enable enhanced decision making, insight discovery and process optimization. As such it will become increasingly more important in the years to come.

The question is not if Big Data is here to stay, but when it will arrive in your company.