1. Introduction

1.1 Spark DataFrames VS RDDs

RDD

Spark's core data structure

✅: A low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster.
❌: However, RDDs are hard to work with directly, so we'll be using the Spark DataFrame abstraction built on top of RDDs.

Spark DataFrames

Designed to behave a lot like a SQL table

✅:
- easier to understand,
- Operations using DataFrames are automatically optimized
  - When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

Create a SparkSession

SparkContext as our connection to the cluster
SparkSession as our interface with that connection.

# To start working with Spark DataFrames

from pyspark.sql import SparkSession
my_spark = SparkSession.builder.getOrCreate()

print(my_spark)

# SparkSession has an attribute called catalog, which lists all the data inside the cluster.
# returns the names of all the tables 
print(spark.catalog.listTables())

2. Spark Schemas

Define the format of a DataFrame
May contain various data types:
- Strings, dates, integers, arrays
Can filter garbage data during import
Improves read performance