SparkSession in PySpark : – The core of PySpark’s functionality is encapsulated in the `SparkSession` object, which serves as the entry point for programming Spark with the Dataset and DataFrame API. This article explores the `SparkSession` in PySpark, covering its creation, usage, and some of the key methods and configurations it offers.
Introduction to SparkSession
Before the introduction of `SparkSession`, Spark used different context objects for different kinds of functionalities – for instance, `SparkContext` for RDD (Resilient Distributed Dataset) operations, `SQLContext`, and `HiveContext` for SQL and Hive operations. However, starting with Spark 2.0, these contexts were encapsulated within `SparkSession`. Today, `SparkSession` is the central point of interaction with Spark’s functionalities, providing a unified and simplified entry point to work with structured data (DataFrames and Datasets).
Creating a SparkSession
Creating a `SparkSession` is the first step when working with PySpark. It instantiates the underlying Spark and SQL contexts and provides a way to program Spark with DataFrame and Dataset APIs. Here’s how you can create a simple `SparkSession` instance:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Exploring SparkSession") \
.getOrCreate()
print(spark)
Output:
org.apache.spark.sql.SparkSession@3b55dd15
After running this code snippet, you should see a `SparkSession` object printed, indicating that a session has been successfully created with the application name “Exploring SparkSession”.
Configuring a SparkSession
When creating a `SparkSession`, you can configure various aspects of the Spark execution environment using the `config` method. For example, you can set the number of executor cores, the memory for each executor, and other Spark properties. Here’s an example of how to set the master URL and a configuration property before initializing the `SparkSession`:
spark = SparkSession.builder \
.appName("Configured SparkSession") \
.master("local[4]") \
.config("spark.executor.memory", "2g") \
.getOrCreate()
print(spark.sparkContext.getConf().getAll())
Output: (The output might vary, but includes all configuration settings)
[('spark.master', 'local[4]'), ('spark.executor.memory', '2g'), ('spark.app.name', 'Configured SparkSession'), ...]
In the code example, the `.master()` method sets the Spark master URL, which in this case is the local mode with 4 threads. The `.config()` method is used to set the `spark.executor.memory` property to “2g”, which allocates 2 GB of memory for each executor.
Accessing SparkContext and SQLContext through SparkSession
Even though `SparkSession` provides a unified entry point, you might sometimes need access to the underlying `SparkContext` or `SQLContext`. `SparkSession` exposes them via properties:
sc = spark.sparkContext
print(sc)
sqlContext = spark.sqlContext
print(sqlContext)
Output:
org.apache.spark.SparkContext@40de8f93
org.apache.spark.sql.SQLContext@6ff0b1cc
This code snippet prints the `SparkContext` and `SQLContext` objects. Access to the `SparkContext` is useful when you need to interact with low-level Spark capabilities, like accumulators or broadcasts. The `SQLContext` can still be used for backward compatibility.
Using SparkSession to Create DataFrames
One of the primary reasons for using `SparkSession` is to create DataFrames from various data sources. Here’s a simple example of creating a DataFrame from a list of tuples and inferring the schema automatically:
data = [("James", "Bond", "100"), ("Ann", "Varsa", "200")]
columns = ["firstname", "lastname", "id"]
df = spark.createDataFrame(data).toDF(*columns)
df.show()
Output:
+---------+--------+---+
|firstname|lastname| id|
+---------+--------+---+
| James| Bond|100|
| Ann| Varsa|200|
+---------+--------+---+
In this snippet, a list of tuples representing people with their first names, last names, and IDs is converted into a DataFrame with the `createDataFrame` method and column names provided with `toDF`. The `show` method is called on the DataFrame to print the data.
Interacting with External Data Sources
`SparkSession` also simplifies the interaction with external data sources like Hive, JDBC, JSON, Parquet, and more. Here’s an example of reading a JSON file into a DataFrame using `SparkSession`:
json_df = spark.read.json("path/to/input.json")
json_df.show()
This would output the content of the “input.json” file as a Spark DataFrame.
Using SparkSession for SQL Queries
Another powerful feature of `SparkSession` is the ability to execute SQL queries directly on DataFrames. First, you register the DataFrame as a temp view and then use the `sql` method to run your SQL query:
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people WHERE id > 100")
sqlDF.show()
Output:
+---------+--------+---+
|firstname|lastname| id|
+---------+--------+---+
| Ann| Varsa|200|
+---------+--------+---+
In this example, the `createOrReplaceTempView` method creates a temporary view of the DataFrame that you can then query using SQL syntax. The result of the query is returned as a new DataFrame.
Stopping a SparkSession
It’s important to stop the `SparkSession` once your job is finished, as it releases the resources held by the session. This is especially important in environments like shared clusters. The `stop` method is used for this purpose:
spark.stop()
After executing this line, the `SparkSession` along with its associated `SparkContext` will be stopped, all executors will be removed, and no further operations can be performed unless a new `SparkSession` is created.
Conclusion
`SparkSession` in PySpark provides a comprehensive and convenient interface for interacting with Spark’s powerful data processing engine. With its ability to configure sessions, access contexts, and work with various data sources and APIs, `SparkSession` simplifies the development of scalable, complex data processing tasks in Python. Whether you are processing large-scale datasets or running interactive queries, understanding and leveraging `SparkSession` is essential for efficient and effective use of Apache Spark with PySpark.