How to use BIG DATA for DATA ANALYTICS – Rubikon’s Approach (Part 1 of 2)

Transforming Technology

How to use BIG DATA for DATA ANALYTICS – Rubikon’s Approach (Part 1 of 2)

So, What comes into picture as soon as you hear of big data? 

Exactly!

Something really big, but it’s not that simple as it is to hear.

 

According to Gartner, the definition of Big Data – 

 

“Big data is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

 

“It’s a combination of all forms of data collected by organizations that can be worked upon for information and used in machine learning projects, predictive modeling and other advanced analytics applications.”

 

In simple words you can say that the estimated amount of data in the digital world which is generated day by day by huge enterprises in several formats is referred to as Big Data.

 

These large amounts of data becomes a havoc in terms of processing by using traditional data processing applications like RDBMS.

Although big data doesn’t equate to any specific volume of data, big data deployments often involve terabytes (TB), petabytes (PB) and even exabytes (EB) of data captured over time.

 

To illustrate how Data has been exploding through the years, please refer the below graph

IMPORTANCE OF BIG DATA-

 


5V’S OF BIG DATA:

 

  1. VOLUME– huge amount of data flooding with due course of time.
  2. VALUE –extraction of useful data for our business decisions
  3. VARIETY-heterogeneity, different formats of data(structured, semi-structured, unstructured)

Structured data- RDBMS(tables[rows and columns])

Unstructured data – Audio files, video files, Images

Semi structured data – JSON,XML files

 

4. VELOCITYspeed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous

5. VERACITYThis refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

INFOGRAPHICS

EXAMPLES – BIG DATA USE CASE

1.NYPD– Identification of crime hotspots and providing better service.

 

2.Amazon – Customer satisfaction by user’s clickstream, Personalized Recommendation System

 

3.Netflix – Through big data analytics, Netflix was able to secure its success. 

 

4.Case Study-Netflix identified that the British version of House of Cards was watched by many  customers. Those members who watched the British version of House of Cards also seemed to watch movies starring Kevin Spacey.

This was one of the patterns that led to Kevin Spacey being cast in the lead role.

 

In 2011, the program House of Cards, a remake of an earlier British TV series, was released. Netflix aired the US version of House of Cards, produced at a cost of over $100 million.

CHALLENGES WITH BIG DATA

  • STORAGE

  • PROCESSING

 

SOLUTION 

  • RDBMS- A BIG NO!

 

WHY?

 

  1. RDBMS cannot handle large volumes of data due to high latency.

  2. RDBMS cannot handle unstructured data/semi structured data

TECHNOLOGY SOLUTIONS TO BIG DATA

 

  1. HADOOPHadoop Framework was designed to store and process data in a Distributed Data Processing Environment with commodity hardware with a simple programming model. It can Store and Analyze the data present in different machines with High Speeds and Low Costs.

2. NoSQL Databases-These databases store data as relational database tables, JSON docs or key-value pairs.

3. SAP HANA

4. APACHE SPARK

5. DATA LAKES

6. CLOUD PLATFORM -AWS

 

There are multiple tools for processing Big Data such as Hadoop, Pig, Hive, Cassandra, Spark, Kafka, etc. depending upon the requirement of the organisation.

Top Big Data Technologies

Top big data technologies are divided into 4 fields which are classified as follows:

 

CONCLUSION

——————

Like all other companies, Rubikon Labs has also come up with solutions for handling it’s big data with technologies like AWS, Spark, Presto, Data Lakes, Kafka, Data science and Machine learning to store, mine and analyze big data and come up with better decision making for its business.

 

In fact my very first project was aimed at providing a solution to our clients to  store their data by building a data warehouse pipeline. So, we provided a cloud based solution to our client and migrated their data from athena to redshift .

 

Now you might be wondering what is athena and redshift!!

While both are great means of analyzing data, each has its own advantages and disadvantages.

REDSHIFT

—————

Redshift is a fully managed data warehouse that exists in the cloud. It’s based on PostgreSQL 8.0.2 and is designed to deliver fast query and I/O performance for any size dataset. Redshift first requires the user to set up collections of servers called clusters; each cluster runs an Amazon Redshift engine and holds one or more datasets. Users are then able to quickly run complicated queries and intelligently analyze the outcomes. Redshift is best used for large and structured datasets.

 

ATHENA

————-

Athena is an interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL. It’s completely serverless, meaning there’s no foundation that needs managing or set up, and it’s also fully portable. Athena can be used to analyze unstructured, semi-structured, and structured data stored in Amazon S3.



In our next blog we shall see how we can automate the migration of data from athena to redshift.