1. Introduction
1.1 Spark DataFrames VS RDDs
RDD
Spark's core data structure
- ✅: A low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster.
- ❌: However, RDDs are hard to work with directly, so we'll be using the Spark DataFrame abstraction built on top of RDDs.
Spark DataFrames
Designed to behave a lot like a SQL table
- ✅:
- easier to understand,
- Operations using DataFrames are automatically optimized
- When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!
Create a SparkSession
SparkContext
as our connection to the cluster
SparkSession
as our interface with that connection.
# To start working with Spark DataFrames
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.getOrCreate()
print(my_spark)
# SparkSession has an attribute called catalog, which lists all the data inside the cluster.
# returns the names of all the tables
print(spark.catalog.listTables())
2. Spark Schemas
- Define the format of a DataFrame
- May contain various data types:
- Strings, dates, integers, arrays
- Can filter garbage data during import
- Improves read performance