Simplify Complex Data Storage with Erasure Coding in HDFS

Erasure coding is one data storage technique that allows you to store huge amount of data with less storage overhead than traditional replication-based approaches. 

In traditional replication-based approaches, data is stored on multiple machines, and multiple copies of each data block are created to provide fault tolerance. This can be resource-intensive and can increase the storage overhead.

In contrast, erasure coding allows you to store data using fewer copies by encoding the data in a way that allows it to be reconstructed even if some of the data blocks are lost. This can reduce the storage overhead and improve the fault tolerance of the system.

The default 3x replication technique in HDFS has a 200% overhead in storage space and other resources, making replication expensive (e.g., network bandwidth). Additional block replicas are rarely accessible during routine operations, but they nonetheless use the same amount of resources as the first replica for warm and cold datasets with relatively modest I/O activities.

As a result, replacing replication with Erasure Coding (EC), which offers the same amount of fault-tolerance but requiring significantly less storage capacity, is a logical upgrade. The storage overhead in normal Erasure Coding (EC) systems is no higher than 50%. An EC file’s replication factor is pointless. It is always 1 and the -setrep command cannot modify it.

In HDFS, erasure coding is an optional feature that can be enabled to store data more efficiently. When erasure coding is enabled, HDFS uses a technique called Reed-Solomon coding to encode data blocks. Reed-Solomon coding is a form of erasure coding that allows you to store data with fewer copies by dividing the data into smaller chunks and adding redundant information to them. 

This redundant information can be used to reconstruct the original data if some of the data blocks are lost. Enabling erasure coding in HDFS can reduce the storage overhead of the system and improve its fault tolerance, but it can also increase the processing overhead and reduce the performance of the system. Therefore, it is important to carefully consider the trade-offs between storage efficiency, fault tolerance, and performance when deciding whether to enable erasure coding in HDFS.

Leave a Reply

Your email address will not be published. Required fields are marked *