Learn the 3 + 1 Vs and how they affect choice in data processing design in this article by Anuj Kumar, a Senior Enterprise Architect with FireEye, a Cybersecurity Service Provider and the author of Easy Test
Framework, a data-driven testing framework used by more than 50 companies.
The 3 Vs
Data usually exhibits three characteristics when it comes to designing the data collection system. The industry calls it the 3 V’s of data. The biggest challenge in data processing is to actually understand the three Vs of the data-intensive system and their effect on the overall approach to the design of the data processing pipeline.
The three Vs stand for the velocity of the data, the volume of the data, and the variety of the data. The outcome from the data processing system is generally the fourth V of the equation, that is, the value of the data. Some experts add more Vs into this equation. For example, data veracity (depicting abnormalities in the data), data validity (the data represents what it is intended to represent), and data volatility (expressing
in loose terms the importance of data over a period of time). While they all are important, they are more or less covered with the basic three Vs of data.
When talking about data-intensive systems, it’s safe to assume that the volume of the data will be huge. Thus, the data processing system that you will design should be capable of handling large volumes of data at a regular pace. In today’s world, if you want to make sense out of your data, the data against which you are trying to find value is probably quite large. Assume that you have two hours of free time and you would like to
watch a movie on Netflix. There are typically two ways you can arrive at the solution of which movie to watch. One is the most straightforward where you know exactly which movie you want to watch. This is a non-use-case with respect to data processing systems.
The other, more interesting, use case is that you want to look at the recommendations of the movies and find the ones that have been most-watched as well as recommended by others. The reason is that the chances of you finding a good movie to watch increases significantly when you take recommendations into considerations. And the larger the recommendation set, the better the chances of finding a good movie. That is why movie-recommendation sites, such as IMDb, are so popular; they have the right set of tools or data processing algorithms in place to generate recommendations for their users. But, as you will realize when you design a data processing system, as the volume of the data increases, so does the cost associated with storing and processing this increased volume of data. With the ever-increasing volume of data, the key design decision for a data processing system is how to enable a large-scale processing of data while keeping the overall system cost low. This problem is only enhanced when you bring the variety and velocity of the data into the equation. The builders of the data processing system should carefully study and understand the behaviour of data within their environment to come up with the right set of tools for defining a data processing pipeline.
To elaborate further on this statement, imagine you have a data collection system that collects data from a multitude of sensors that are deployed in the home-energy consumption-calculation devices (for example, Nest) being sold by the company.
This data is collected to analyze the average use of a household and recommend ways to save on energy by looking at the general patterns of energy consumption.
In this case, the following assumptions will generally hold true:
- The volume of data could be potentially large if the solution is a hit in the market. Imagine this device being used in hundreds of households and the sensors sending data at a per-second rate.
- The variety of the data will not be that complex. In all cases, it will be in the range of a few different message types all following the same/similar data structure and pattern.
- The velocity of the data will potentially be high as the IoT devices will be sending data at a regular pace. Assuming no/minimal buffering is built into the collection systems, the data needs to be sent and processed by the data processing system at a high rate. This would mean more CPU-intensive systems and possibly parallel processing with multiple worker threads/nodes at each stage of processing.
Thus, from this, you can determine that there needs to be a decoupled set of data processing logic that interacts with each other via some queuing mechanism and one class does not depend on the outcome of another class. In addition, things happen in an event fashion where each (batch of) dataset is treated separately from the others, and for each set, an event handler is always available. In such a case, the system can become
distributed by nature as it is not possible to have commodity hardware and a monolith style of application.
The cost associated with latency
The other challenge of data processing systems lies in handling the other, and possibly more important, the cost associated with large-scale processing. This is the cost associated with latency.You may develop a data processing system that can perform the work of recommending the right hotel to you, but if the processing takes a week to generate a recommendation, then it’s not going to be useful to anyone.
The tools that will be used for processing the data should be chosen carefully to allow for higher volumes of data to be processed without significantly compromising on the overall processing time.
The classic way of doing things
Another, usually ignored but very important, a challenge when understanding the right architecture and design of the data processing system stems from the traditional way of doing things and trying to fit them into your data processing design.
Traditional ways of handling data no longer scale with the volume of data that needs to be processed in a data-intensive system.
Traditionally, organizations bought huge computers with a huge price tag. These computers were provided by some top companies in the industry at that time. These systems worked and still work well for the kind of use cases they are supposed to handle. For example, if you know that the volume of your data will never increase and you have these systems lying around, then it still makes sense to use them in the data processing system you design. Unfortunately, the hypothesis that the data for processing will never grow generally does not hold, and there comes a time when these huge systems need to be complemented with more memory and more CPUs. This is generally referred to as scaling up the system. Scaling up the hardware is a very expensive process and, with the uncertainty on the volume of data that needs to be processed in low-latency, these costs can easily be measured in multitudes of dollars.
Even if you invest a lot of money into these systems, there are still limitations as to how big a single machine/host can be. Thus, it becomes a challenge to shed the traditional way of approaching a system architecture and start embracing more modern ways of designing the systems. Shedding the notion of scaling up, and replacing it with the idea of scaling out, becomes more appreciative when the volume of the data increases.
Unlike the scaling up architecture, where more and more horsepower is added to the same machine, scale-out architectures rely on adding more machines and distributing the workload across these machines in a consistent manner. Thus, if the data volume increases twofold, you simply need to add two more hosts instead of doubling the size of a single host. This applies in reverse situations as well. When the volume of data decreases, simply remove the hosts from the system and you can then use
them somewhere else.
As you may have realized, scaling outcomes with its own set of challenges. For example, you need to clearly define strategies on how you would want to split the data into several independent chunks and then how to merge these results back again. Other challenges include developing a mechanism to know when these hosts are not responding and how to replace them with newer ones at a large scale in a very tight time-bound manner.
Moving to the scale-out architecture for data processing requires significant engineering effort as well as mind change.
If you found this article interesting, you can explore Anuj Kumar’s Architecting Data-Intensive Applications to architect and design data-intensive applications. This book is your gateway to build smart data-intensive systems by incorporating the core data -intensive architectural principles, patterns, and techniques directly into your application architecture.