Big Data
In this section, we will take a look at Big data and the concepts related to Big Data. Big data has become very relevant and gathered all the attention, mainly due to the rise of internet usage and corresponding explosion of data being collected by various organizations. Data is generated by various devices, by human or by machines, generated at real time. Now that organizations have data from various sources, they would naturally want get insights from data which would help them take business decisions.
What is Big Data?
Big Data is a terminology which is associated with computation and analytical processing of three forms of data.
- Volume
- Variety
- Velocity
Before, we go deeper into these three forms of data; let us understand what does big data actually deal with. Big data involves performing analysis of data to derive value or meaning out of data. This is partly similar to what traditional analytical systems (also known as data warehouses) used to do. They analyze historical data and answer certain questions on the past behavior (of users, systems or whatever). However, Big data systems go one step further. They also predict the future behavior by building models from the historical data. For instance, if traditional systems were capable of answering questions like “In which quarter of the last year, did the maximum sales of a car happen?” the modern Big data systems would also be capable of answering questions like “In which quarter of the next year, can the maximum sales of a car happen?”
Big data systems also derive patterns from data by running various algorithms on the data. Though, many of these tasks were also performed earlier, what changed from the traditional systems is the above 3 Vs.
Let us take a look at each of these V factors.
1. Volume
Big data deals with huge volumes of data in the range of excess of peta bytes of data. These volumes are not something which could be handled easily by traditional data storage or data processing systems.
2. Variety
Big data deals with different types of data and not necessarily the data which would fit into SQL driven relational data. Traditional data warehouses, generally loaded data into Relational databases and ran SQL queries or procedures. However, the data could have originated from a variety of sources and the data format could be in unstructured text format, or media content like images, videos etc. Following are a few examples of data from different sources.
- Data from social networking platforms like Twitter, Facebook etc.
- Data from Internet connected devices like sensors, devices in cars etc.
- Data collected from users surfing the internet, like websites visited, IP address of the users, time spent on each site etc. (which is generally referred to as Click Stream data).
- Data collected from surveys like health survey, employee survey etc.
3. Velocity
Big data deals with data arriving at very high velocity or speed. Traditional analytical systems dealt with data which was at rest (copied versions of real time data). However, Big data also deals with data which changes in real time. For instance, let us consider a real time fraud checking system in a banking application. The system has to recognize that a transaction is fraudulent in real time (the response time should be a few seconds) and prevent the transaction.
Tools and techniques for Big Data
The 3 Vs of Big data simply means that the traditional databases and the data processing algorithms would not be able to support it one way or the other. Obviously, we need modern data storage systems, algorithms and tools to work with Big data. Also, Big data storage and processing systems need to be distributed in nature in order to scale.
Modern Data Algorithms:
To process and analyze data in Big data systems, a lot of modern algorithms are available, which may not be supported by traditional database systems. Some of the algorithms include Map Reduce, various Machine learning algorithms etc.
Data storage systems:
Data storage systems like Hadoop, No Sql databases like Cassandra, In-Memory data processing systems like Apache Spark support various features, algorithms necessary to operate on Big data.
Data processing types
While discussing Big data, we will take a look at two different categories of processing data.
- Stream Processing
- Batch processing
Let us take a look at each of these.
1. Stream Processing
Stream processing is a programming paradigm where data arrives to a system continuously as a stream and the data needs to be processed in real time. Stream processing requires high performance in-memory processing systems. Libraries like Apache Spark provide Stream Processing abilities.
2. Batch Processing
Batch Processing is offline data processing of huge data for carrying out analytics or import/export functionalities. In contrast to Stream processing, Batch processing operates on data at rest. For analysis of big data, map reduce techniques are suitable to provide desirable results.