SQL & PySpark Equivalent

Concept	SQL	Spark / PySpark
SELECT	SELECT column(s) FROM table;

SELECT * FROM table; | df.select("column(s)")

| | | | ) | | | | | | | | | | | | | | | | | | ||

| | AVG | SELECT AVG(column) FROM table; | from pyspark.sql.functions import avg

df.agg(avg("column")) | | MAX / MIN | SELECT MAX(column) FROM table | from pyspark.sql.functions import max,min

df.agg(max("column"), min("column")) | | String Length | SELECT LEN(string) FROM table; | from pyspark.sql.functions import length

df.select(length(col("string"))) | | Convert to Uppercase | SELECT UPPER(string) FROM table | from pyspark.sql.functions import upper;

df.select(upper(col("string"))) | | Convert to Lowercase | SELECT LOWER(string) FROM table | from pyspark.sql.functions import lower

df.select(lower(col("string"))) | | Concatenate Strings | SELECT CONCAT(string1, string2) FROM table | from pyspark.sql.functions import concat

df.select(concat(col("string1"), col("string2"))) | | Trim String | SELECT TRIM(string) FROM table | from pyspark.sql.functions import trim

df.select(trim(col("string"))) | | Substring | SELECT SUBSTRING(string, start, length) FROM table | from pyspark.sql.functions import substring

df.select(substring(col("string"),start, length)) | | CURDATE, NOW, CURTIME | SELECT CURDATE() FROM table; | from pyspark.sql.functions import current_date

df.select(when(condition,value1)
.otherwise(value2)) | | COALESCE | SELECT COALESCE(column1, column2, column3) FROM table; | from pyspark.sql.functions import coalesce

df.select("column", rank().over(Window.orderBy("column"))
.alias("rank")) | | CTE | WITH cte1 AS (SELECT * FROM tab;e1),
SELECT * FROM cte1 WHERE condition | df.createOrReplaceTempView("cte1"); df_cte1 = spark.sql("SELECT * FROM cte1 WHERE condition"); df_cte1.show() or df.filter(condition1).filter(condition2) | | Datatypes | INT: for integer values BIGINT: for large integer values FLOAT: for floating point values DOUBLE: for double precision floating point values CHAR: for fixed-length character strings VARCHAR: for variable-length character strings DATE: for date values TIMESTAMP: for timestamp values | In PySpark, the data types are similar, but are represented differently.

IntegerType: for integer values LongType: for long integer values FloatType: for floating point values DoubleType: for double precision floating point values StringType: for character strings TimestampType: for timestamp values DateType: for date values | | Create Table | CREATE TABLE table_name (column_name data_type constraint); | df.write.format("parquet")
.saveAsTable("table_name") | | Create Table with Columns definition | CREATE TABLE table_name( column_name data_type [constraints], column_name data_type [constraints], ...); | from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType

schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), False), StructField("age", IntegerType(), True), StructField("salary", DecimalType(10,2), True)])

df = spark.createDataFrame([], schema) | | Create Table with Primary Key | CREATE TABLE table_name( column_name data_type PRIMARY KEY, ...);

If table already exists: ALTER TABLE table_name ADD PRIMARY KEY (column_name); | In PySpark or HiveQL, primary key constraints are not enforced directly. However, you can use the dropDuplicates() method to remove duplicate rows based on one or more columns.

df = df.dropDuplicates(["id"]) | | Create Table with Auto Increment constraint | CREATE TABLE table_name( id INT AUTO_INCREMENT, name VARCHAR(255), PRIMARY KEY (id)); | not natively supported by the DataFrame API, but there are several ways to achieve the same functionality.

from pyspark.sql.functions import monotonically_increasing_id df = df.withColumn("id", monotonically_increasing_id()+start_value) | | Adding a column | ALTER TABLE table_name ADD column_name datatype; | from pyspark.sql.functions import lit df=df.withColumn("column_name", lit(None).cast("datatype")) | | Modifying a column | ALTER TABLE table_name MODIFY column_name datatype; | df=df.withColumn("column_name", df["column_name"].cast("datatype")) | | Dropping a column | ALTER TABLE table_name DROP COLUMN column_name; | df = df.drop("column_name") | | Rename a column | ALTER TABLE table_name RENAME COLUMN old_column_name TO new_column_name;

In mysql, ALTER TABLE employees CHANGE COLUMN first_name first_name_new VARCHAR(255); | df =df.withColumnRenamed("existing_column", "new_column") |

SQL & PySpark Equivalent

Comments

More from this blog

ACID Properties

Key Problems Microsoft Fabric Solves

Unity Catalog vs Hive Metastore

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Command Palette

Comments

More from this blog