Friday, March 09, 2012

When to use Hadoop

Hadoop is one of the big players in the big-data and can be seen as one of the main engines running the big-data machine. We however still do not have a clear picture on what is big-data. we do have some definitions on when we call a lot of data big data however giving it a number has not been done up until now and will most likely never been done. I already zoomed in into this definition question in the "Map reduce into relation of Big Data and Oracle" post on this blog. A number of key components state if data is big-data, to name them; volume of the data, the velocity in which the data grows, the variety of sources which add to the volume of the data and the value it can "potentially" hold. These factors can help you decide when data is big data.

Then we have the question on when data (even big-data) can still be handled in a standard relational database and can still be handled by a "standard" approach. There are some guidelines that can help you. Please do note this is a comparison primarily  for handling data in a relational database or in Hadoop. This is not for storing data.

RDBMS Hadoop / MapReduce
Data Size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed Schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc.)
Integrity High Low
Scaling nonlinear linear
Updates Read and Write Write ones, read many times
Latency Low High

By taking this into consideration when you are struggling with the question if you need to use a MapReduce approach or a RDBMS approach it might be a little more easy to make your decision.

No comments: