1. Introduction

1.1 Spark DataFrames VS RDDs

RDD

Spark's core data structure

Spark DataFrames

Designed to behave a lot like a SQL table

Create a SparkSession

# To start working with Spark DataFrames

from pyspark.sql import SparkSession
my_spark = SparkSession.builder.getOrCreate()

print(my_spark)

# SparkSession has an attribute called catalog, which lists all the data inside the cluster.
# returns the names of all the tables 
print(spark.catalog.listTables())

2. Spark Schemas