Skip to main content

Command Palette

Search for a command to run...

Ideal Size of HDFS Block

Updated
โ€ข3 min read
Ideal Size of HDFS Block
N

I am a Tech Enthusiast having 13+ years of experience in ๐ˆ๐“ as a ๐‚๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐š๐ง๐ญ, ๐‚๐จ๐ซ๐ฉ๐จ๐ซ๐š๐ญ๐ž ๐“๐ซ๐š๐ข๐ง๐ž๐ซ, ๐Œ๐ž๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐“๐ž๐ฌ๐ญ ๐€๐ฎ๐ญ๐จ๐ฆ๐š๐ญ๐ข๐จ๐ง ๐š๐ง๐ ๐ƒ๐š๐ญ๐š ๐’๐œ๐ข๐ž๐ง๐œ๐ž. I have ๐’•๐’“๐’‚๐’Š๐’๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 10,000+ ๐‘ฐ๐‘ป ๐‘ท๐’“๐’๐’‡๐’†๐’”๐’”๐’Š๐’๐’๐’‚๐’๐’” and ๐’„๐’๐’๐’…๐’–๐’„๐’•๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 500+ ๐’•๐’“๐’‚๐’Š๐’๐’Š๐’๐’ˆ ๐’”๐’†๐’”๐’”๐’Š๐’๐’๐’” in the areas of ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐ƒ๐ž๐ฏ๐ž๐ฅ๐จ๐ฉ๐ฆ๐ž๐ง๐ญ, ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐‚๐ฅ๐จ๐ฎ๐, ๐ƒ๐š๐ญ๐š ๐€๐ง๐š๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐ƒ๐š๐ญ๐š ๐•๐ข๐ฌ๐ฎ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง๐ฌ, ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐ˆ๐ง๐ญ๐ž๐ฅ๐ฅ๐ข๐ ๐ž๐ง๐œ๐ž ๐š๐ง๐ ๐Œ๐š๐œ๐ก๐ข๐ง๐ž ๐‹๐ž๐š๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐  ๐›๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐š๐ซ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐ž๐ฌ, ๐ซ๐ž๐š๐๐ข๐ง๐  ๐š๐ง๐ ๐ฅ๐ž๐š๐ซ๐ง๐ข๐ง๐  new subjects.

HDFS Stands for Hadoop Distributed File System is the world's most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files.

Block

  1. In Hadoop a file is split into small chunks known as Blocks. These are considered as the smallest unit of data in a FileSystem.

  2. The default block size in Hadoop 1.x is 64 MB and 128 MB in Hadoop 2.x

  3. The size of the block affects sequential reads and writes.

Block Size

There is no such rule set by Hadoop to the bound user with a certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (maybe 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.

So we are talking about larger file large blocks & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. So in order to overcome that situation "dfs.block.size" parameter can be used when the file is written. It will help you in overriding default block size written in hdfs-site.xml

What happens when the block size is small

  1. When the block size is small number of seeks increases as small size of block means the data when divided into blocks will be distributed in more number of blocks and as more blocks are created, there will be more number of seeks to read/write data from/to the blocks.

  2. Also, a large number of blocks increases overhead for the name node as it requires more memory to store the metadata.

  3. When the block size is smaller there will be more tasks to execute by the JVM.

What happens when the block size is large

  1. When the block size is larger, then parallel processing takes a hit and the complete processing will take a very long time as data in one block may take large amount of time for processing

Hence we should choose a moderate block size of 128 MB and then analyze and observe the performance of the cluster.We can then choose to increase/decrease the block size depending upon our observation.

Important Points to consider while choosing Block Size

  1. Typically a file will have fewer blocks if the block size is larger. The advantage is it is possible for clients to read/write more data without interacting with the NameNode which saves time.

  2. Having larger block size also reduces the metadata size of the NameNode, reducing NameNode load.

  3. With fewer blocks, the file may potentially be stored on fewer nodes in total, this can reduce total throughput of parallel access.

  4. Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism.

  5. Also while a larger block is being processed and some failure occurs more work needs to be done.

More from this blog

Naveen P.N's Tech Blog

94 posts