Install and Run PySpark on Windows : – Apache Spark is a powerful distributed computing system that’s designed to handle big data processing and analytics. PySpark is an interface for Apache Spark in Python that allows you to work with Spark’s powerful data abstractions using Python’s simpler syntax and its vast ecosystem. In this guide, we’ll go through the steps to install and run PySpark on a Windows machine, along with some basic usage examples.
System Requirements
Before we begin, make sure that your system meets the following requirements:
- Windows 7 or higher
- Python 3.6 or higher
Step 1: Install Java
Apache Spark requires Java to be installed on your machine. You can download Java from the official Oracle website. After downloading, run the installer and follow the instructions to install Java on your system.
Verifying Java Installation
To verify Java is installed correctly, open a Command Prompt and type the command:
java -version
If Java is installed correctly, you should see the version of Java printed in the console.
Step 2: Install Scala
Scala is another prerequisite for Spark. You can download the Scala binaries from the official Scala website. After downloading the installer, run it and follow the instructions to install Scala.
Verifying Scala Installation
To ensure Scala is installed correctly, open the Command Prompt and type:
scala -version
This should output the version of Scala installed on your machine.
Step 3: Install Apache Spark
Now it’s time to install Apache Spark. You can download the latest Spark release from the official download page. Ensure you choose a package type that includes support for Hadoop. For Windows users, the pre-built version for Hadoop will usually work fine.
Unzipping Spark Files
After downloading, unzip the Spark binaries to a directory on your system, such as C:\spark
.
Step 4: Configure Environment Variables
For Spark and PySpark to run correctly, we need to set up a few environment variables.
Adding Spark to the PATH
Add the bin directory of Spark to the PATH environment variable. Follow these steps:
1. Right-click on ‘This PC’ (or ‘Computer’ on older versions of Windows) and click ‘Properties’.
2. Click ‘Advanced system settings’.
3. In the ‘System Properties’ window, click ‘Environment Variables’.
4. Under ‘System Variables’ find ‘Path’ and select it.
5. Click ‘Edit’.
6. Click ‘New’ and add the path to the Spark bin directory, e.g., C:\spark\bin
.
7. Click ‘OK’ to close all dialogs.
Set SPARK_HOME
You should also set the SPARK_HOME environment variable:
1. In the ‘Environment Variables’ window, click ‘New’ under ‘System Variables’.
2. Set the variable name to ‘SPARK_HOME’ and the variable value to your Spark installation directory, e.g., C:\spark
.
3. Click ‘OK’.
Step 5: Install winutils.exe
Apache Hadoop is a dependency of Spark and winutils.exe is required to run Hadoop on Windows.
1. Download winutils.exe from a verified source or from the Hadoop releases if you know which version is compatible with your Spark version.
2. Create a directory for Hadoop binaries, such as C:\hadoop\bin
.
3. Place the downloaded winutils.exe in the bin directory.
Set HADOOP_HOME
1. Using the same ‘Environment Variables’ window, create a new System Variable named ‘HADOOP_HOME’ with the value set to your Hadoop binaries directory, e.g., C:\hadoop
.
2. Ensure the bin directory (C:\hadoop\bin
) is added to your PATH variable.
Step 6: Install PySpark
With all the dependencies in place, you can now install PySpark. The easiest way to do this is using pip, Python’s package manager. Open a Command Prompt and run the following command:
pip install pyspark
Verifying PySpark Installation
After the installation is complete, you can verify it by running:
pyspark --version
This will display the PySpark version installed on your system.
Step 7: Running PySpark
To run PySpark, simply enter the `pyspark` command in your Command Prompt. This will open an interactive PySpark shell where you can type Python code and see the results immediately.
PySpark Hello World
Let’s run a simple Spark job to test our installation. In the interactive PySpark shell, type the following:
>> from pyspark.sql import SparkSession
>> spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
>> df = spark.createDataFrame([("Hello, world!",)], ["text"])
>> df.show()
The output should be:
+-------------+
| text|
+-------------+
|Hello, world!|
+-------------+
You’ve successfully run your first PySpark job!
Conclusion
Congratulations! You have successfully installed and run PySpark on your Windows machine. With PySpark installed, you can begin exploring the vast array of functionality it offers, from basic data manipulation to advanced machine learning applications. Remember that Spark is a complex and powerful tool – it’s worth investing time to learn about its architecture and capabilities to fully leverage its potential.
Happy Sparking!