Ideal Size of HDFS Block

UpdatedApril 17, 2025

I am a Tech Enthusiast having 13+ years of experience in 𝐈𝐓 as a 𝐂𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭, 𝐂𝐨𝐫𝐩𝐨𝐫𝐚𝐭𝐞 𝐓𝐫𝐚𝐢𝐧𝐞𝐫, 𝐌𝐞𝐧𝐭𝐨𝐫, with 12+ years in training and mentoring in 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐓𝐞𝐬𝐭 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞. I have 𝒕𝒓𝒂𝒊𝒏𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 10,000+ 𝑰𝑻 𝑷𝒓𝒐𝒇𝒆𝒔𝒔𝒊𝒐𝒏𝒂𝒍𝒔 and 𝒄𝒐𝒏𝒅𝒖𝒄𝒕𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 500+ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒔𝒆𝒔𝒔𝒊𝒐𝒏𝒔 in the areas of 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐂𝐥𝐨𝐮𝐝, 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬, 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧𝐬, 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐚𝐧𝐝 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠. I am interested in 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐠𝐬, 𝐬𝐡𝐚𝐫𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞, 𝐬𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐢𝐬𝐬𝐮𝐞𝐬, 𝐫𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐧𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 new subjects.

Part of seriesData Engineering

HDFS Stands for Hadoop Distributed File System is the world's most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files.

Block

In Hadoop a file is split into small chunks known as Blocks. These are considered as the smallest unit of data in a FileSystem.
The default block size in Hadoop 1.x is 64 MB and 128 MB in Hadoop 2.x
The size of the block affects sequential reads and writes.

Block Size

There is no such rule set by Hadoop to the bound user with a certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (maybe 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.

So we are talking about larger file large blocks & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. So in order to overcome that situation "dfs.block.size" parameter can be used when the file is written. It will help you in overriding default block size written in hdfs-site.xml

What happens when the block size is small

When the block size is small number of seeks increases as small size of block means the data when divided into blocks will be distributed in more number of blocks and as more blocks are created, there will be more number of seeks to read/write data from/to the blocks.
Also, a large number of blocks increases overhead for the name node as it requires more memory to store the metadata.
When the block size is smaller there will be more tasks to execute by the JVM.

What happens when the block size is large

When the block size is larger, then parallel processing takes a hit and the complete processing will take a very long time as data in one block may take large amount of time for processing

Hence we should choose a moderate block size of 128 MB and then analyze and observe the performance of the cluster.We can then choose to increase/decrease the block size depending upon our observation.

Important Points to consider while choosing Block Size

Typically a file will have fewer blocks if the block size is larger. The advantage is it is possible for clients to read/write more data without interacting with the NameNode which saves time.
Having larger block size also reduces the metadata size of the NameNode, reducing NameNode load.
With fewer blocks, the file may potentially be stored on fewer nodes in total, this can reduce total throughput of parallel access.
Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism.
Also while a larger block is being processed and some failure occurs more work needs to be done.

#hadoop #big-data

159 views

Comments

Join the discussion

No comments yet. Be the first to comment.

Data Engineering

Part 1 of 32

Up next

Important Hadoop Configurtions

In this article, we will learn about important Hadoop configuration files hadoop-env.sh Environment variables that are used in the scripts to run Hadoop Exploring core-site.xml All the configuration settings related to Hadoop core such as I/O setting...

More from this blog

ACID Properties

RDBMS works under 4 properties (ACID) Atomicity If any operation is performed on the data, either the entire transaction should be executed or should not be executed at all. Single unit of work

May 20, 20262 min read15

Key Problems Microsoft Fabric Solves

Data Silos Across Tools Problem Organizations use many separate tools for ETL (Data Factory), Warehousing (Synapse/Snowflake), Big Data (Databricks/Hadoop), Visualization (Power BI/Tableau), etc.

Mar 3, 20264 min read6

Unity Catalog vs Hive Metastore

What is Hive Metastore Legacy metadata store for tables and schemas Linked to single Databricks workspace Stores based info : table names, locations, schema Limitation No centralized security across workspaces No column level access control H...

Jul 17, 20251 min read53

Advanced Python Dependency Injection with Pydantic and FastAPI

Introduction Modern backend architectures demand modular, maintainable, and testable code. One of the cornerstones of achieving this is Dependency Injection (DI) — a software design pattern that helps decouple object creation from business logic, ma...

Jun 20, 20255 min read460

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Introduction Modern applications increasingly rely on real-time data streams — from chat apps and stock tickers to IoT device feeds, real-time analytics dashboards, and webhooks. The challenge isn’t just speed, but also how to process continuous str...

Jun 20, 20255 min read165

Building Reactive Python Apps with Async Generators and Streams

Naveen P.N's Tech Blog

95 posts