Ideal Size of HDFS Block

I am a Tech Enthusiast having 13+ years of experience in ๐๐ as a ๐๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐๐ง๐ญ, ๐๐จ๐ซ๐ฉ๐จ๐ซ๐๐ญ๐ ๐๐ซ๐๐ข๐ง๐๐ซ, ๐๐๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ฌ๐ญ ๐๐ฎ๐ญ๐จ๐ฆ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐๐ญ๐ ๐๐๐ข๐๐ง๐๐. I have ๐๐๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 10,000+ ๐ฐ๐ป ๐ท๐๐๐๐๐๐๐๐๐๐๐๐ and ๐๐๐๐ ๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 500+ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐ in the areas of ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐๐ฏ๐๐ฅ๐จ๐ฉ๐ฆ๐๐ง๐ญ, ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐ฅ๐จ๐ฎ๐, ๐๐๐ญ๐ ๐๐ง๐๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐๐๐ญ๐ ๐๐ข๐ฌ๐ฎ๐๐ฅ๐ข๐ณ๐๐ญ๐ข๐จ๐ง๐ฌ, ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ง๐ญ๐๐ฅ๐ฅ๐ข๐ ๐๐ง๐๐ ๐๐ง๐ ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐ ๐๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐๐ซ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐๐ฌ, ๐ซ๐๐๐๐ข๐ง๐ ๐๐ง๐ ๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ new subjects.
HDFS Stands for Hadoop Distributed File System is the world's most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files.
Block
In Hadoop a file is split into small chunks known as Blocks. These are considered as the smallest unit of data in a FileSystem.
The default block size in Hadoop 1.x is 64 MB and 128 MB in Hadoop 2.x
The size of the block affects sequential reads and writes.
Block Size
There is no such rule set by Hadoop to the bound user with a certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (maybe 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.
So we are talking about larger file large blocks & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. So in order to overcome that situation "dfs.block.size" parameter can be used when the file is written. It will help you in overriding default block size written in hdfs-site.xml
What happens when the block size is small
When the block size is small number of seeks increases as small size of block means the data when divided into blocks will be distributed in more number of blocks and as more blocks are created, there will be more number of seeks to read/write data from/to the blocks.
Also, a large number of blocks increases overhead for the name node as it requires more memory to store the metadata.
When the block size is smaller there will be more tasks to execute by the JVM.
What happens when the block size is large
- When the block size is larger, then parallel processing takes a hit and the complete processing will take a very long time as data in one block may take large amount of time for processing
Hence we should choose a moderate block size of 128 MB and then analyze and observe the performance of the cluster.We can then choose to increase/decrease the block size depending upon our observation.
Important Points to consider while choosing Block Size
Typically a file will have fewer blocks if the block size is larger. The advantage is it is possible for clients to read/write more data without interacting with the NameNode which saves time.
Having larger block size also reduces the metadata size of the NameNode, reducing NameNode load.
With fewer blocks, the file may potentially be stored on fewer nodes in total, this can reduce total throughput of parallel access.
Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism.
Also while a larger block is being processed and some failure occurs more work needs to be done.



