PySpark on Google Colab

I am a Tech Enthusiast having 13+ years of experience in 𝐈𝐓 as a 𝐂𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭, 𝐂𝐨𝐫𝐩𝐨𝐫𝐚𝐭𝐞 𝐓𝐫𝐚𝐢𝐧𝐞𝐫, 𝐌𝐞𝐧𝐭𝐨𝐫, with 12+ years in training and mentoring in 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐓𝐞𝐬𝐭 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞. I have 𝒕𝒓𝒂𝒊𝒏𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 10,000+ 𝑰𝑻 𝑷𝒓𝒐𝒇𝒆𝒔𝒔𝒊𝒐𝒏𝒂𝒍𝒔 and 𝒄𝒐𝒏𝒅𝒖𝒄𝒕𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 500+ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒔𝒆𝒔𝒔𝒊𝒐𝒏𝒔 in the areas of 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐂𝐥𝐨𝐮𝐝, 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬, 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧𝐬, 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐚𝐧𝐝 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠. I am interested in 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐠𝐬, 𝐬𝐡𝐚𝐫𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞, 𝐬𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐢𝐬𝐬𝐮𝐞𝐬, 𝐫𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐧𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 new subjects.
In this blog post, I will explain how to install PySpark on Google Colab
Installing Colab
Open drive.google.com in your favorite browser.
Click on New on top left → Click on More → Connect more apps
Search for Colab and install.
Installing PySpark Libraries on Colab
#install Apache Spark 3.0.1 with Hadoop 2.7 from here.
!wget https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
# Now, we just need to unzip that folder.
!tar -xvzf spark-3.0.0-bin-hadoop2.7.tgz
!pip install findspark
import os
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"
import findspark
findspark.init()
Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("RDD Demo").getOrCreate()
sc = spark.sparkContext
Create RDD
data = list(range(1,101))
rdd = sc.parallelize(data)
rdd.take(5)
If you like my work and want to support me…
I share tips, tricks and insights on #softwareengineering, #dataengineering #cloud #ml on LinkedIn.
Do you want to connect with me, I have started mentoring others for career and interviews at 𝐭𝐨𝐩𝐦𝐚𝐭𝐞.𝐢𝐨/𝐧𝐚𝐯𝐞𝐞𝐧𝐩𝐧


