Series

Data Engineering

Ideal Size of HDFS Block
HDFS Stands for Hadoop Distributed File System is the world's most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files. Block In Hadoop a file is split into small chunks known as Blocks. These are consider...
Apr 24, 20233 min read159
Important Hadoop Configurtions
In this article, we will learn about important Hadoop configuration files hadoop-env.sh Environment variables that are used in the scripts to run Hadoop Exploring core-site.xml All the configuration settings related to Hadoop core such as I/O setting...
Apr 24, 20232 min read109
SparkContext vs SparkSession
SparkContext and SparkSession are two important components in Apache Spark, but they serve different purposes. SparkContext SparkContext (sc) is the entry point for interacting with Spark and represents the connection to a Spark cluster. It was the...
May 26, 20232 min read132
PySpark on Google Colab
In this blog post, I will explain how to install PySpark on Google Colab Installing Colab Open drive.google.com in your favorite browser. Click on New on top left → Click on More → Connect more apps Search for Colab and install. Installing PySpa...
May 27, 20231 min read1.0K
SQL vs NoSQL
SQLNoSQL SQL, which stands for “Structured Query Language,” is the programming language that’s been widely used in managing data in relational database management systems (RDBMS).NoSQL stands for "not only SQL," and refers to a non-relational or d...
Jun 13, 20232 min read42
Scaling
When thinking about system design one of the important points that comes into our mind is Scaling. Basically what is scaling? Imagine a scenario where you have built a product and you got some customers as well. Everything was going well until you go...
Jul 29, 20232 min read55

Command Palette