Wednesday, December 19, 2012
Hadoop datacenter blueprint for HDFS
When you are planning a rather small implementation of HDFS you will most likely be fine with a simple layout for you namenodes as shown below. All your namenodes are connected to a single network segment and are most likley all within the same rack cabinet.
When you define your HDFS you will have to set the number of replications for your blocks. This means that every block will be replicated X times on different nodes. This setting is done in your hdfs-site.xml configuration file by adjusting the variable dfs.replication .When you update data which resides in a certain block this block will be replicated to all other nodes where this particular block resides. This means that you will have a lot of communication of data between your datanodes that is only used to keep your cluster in sync.
Within a simple setup this is not a real issue however when spanning over 100+ nodes in multiple racks your data will have to travel a lot and will take up networking resources that due to this cannot be used for other things.
In the image below you can see a more complex landscape where we have 15 datanodes postitioned over 3 racks. All racks have a top-rack switch to connect all the HDFS datanodes to the rack network backbone, all racks are connected to the cluster backbone for communication between the nodes of each rack.
If we do not take any action we can come into the situation that we will have a lot of communication between datanodes in different racks and by having this we will see a lot of communication over the cluster backbone and multiple rack backbones. This might cause additional network latency and potential bottlenecks especially at your cluster backbone switch. For example it would be quite possible that datanode 2 holds data that needs to be replicated to datanode 7 and datanode 10 which are both in a different rack.
To prevent this from happening you can make HDFS datanodes aware of there physical location in the datacenter and state that replication will only is allowed between datanodes within the same rack. The dfs.network.script scripting will allow you to state the rack ID and will help you limit the traffic between racks. This will make sure you will get a cluster model as shown below where you have a root (cluster) and 2 racks both containing 3 datanodes
When planning your rack setup both physical and within your cluster configuration there are some things to consider, if you allow replication only within a single rack and this rack is a physical rack at the same time as a software configured rack you can run into troubles if this rack fails for example due to a powerloss on your rack. You will not have access to any of the data that is stored within this rack as all the replications are in the same rack.
When planning you do have to consider this and think about the possibilities to have virtual racks where for example all the odd numbered servers in physical rack A and B are configured to a single (virtual) rack and all the even numbers in physical rack A and B into another virtual rack. To cope with possible loss of a switch and to provide quick communication between the racks you might want to consider connecting some switches with fiber to assure quick connection between physical racks.