Adding third-party Java JAR files to a PySpark application is a common requirement, especially when needing to leverage custom libraries or UDFs that have been written in Java. Below are detailed steps and methods to include such JAR files in your PySpark job.
Method 1: Adding JAR files when starting the PySpark shell
When you start the PySpark shell, use the `–jars` option followed by the path to your JAR file. This will ensure that the specified JAR file is included in the Spark driver and executors’ classpath.
“`bash
pyspark –jars path/to/your-file.jar
“`
Example in PySpark shell:
$ pyspark --jars /path/to/my-custom-library.jar
This will start a PySpark interactive shell with the specified JAR file added to the classpath.
Method 2: Adding JAR files in a SparkSubmit Job
If you’re submitting a PySpark job using `spark-submit`, you can similarly use the `–jars` option followed by the path to the JAR file.
“`bash
spark-submit –jars path/to/your-file.jar your-python-script.py
“`
Example command:
$ spark-submit --jars /path/to/my-custom-library.jar my_pyspark_script.py
Method 3: Adding JAR files programmatically in SparkSession
You can also add JAR files programmatically when creating a `SparkSession`. This can be particularly useful if you need to add JAR files dynamically based on certain conditions in your code.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.config("spark.jars", "path/to/your-file.jar") \
.getOrCreate()
Example PySpark Code:
from pyspark.sql import SparkSession
# Build SparkSession with the required JAR file
spark = SparkSession.builder \
.appName("AddJarExample") \
.config("spark.jars", "/path/to/my-custom-library.jar") \
.getOrCreate()
# Create a DataFrame
data = [("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100)]
columns = ["EmployeeName", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
+-------------+----------+------+
| EmployeeName|Department|Salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
+-------------+----------+------+
Method 4: Using Spark Configuration
Alternatively, you can also supply JAR files via the Spark configuration properties in the `spark-defaults.conf` file:
“`
spark.jars=/path/to/your-file.jar
“`
This configuration will be applied to every Spark application initiated from this cluster.
Conclusion
Adding third-party Java JAR files to your PySpark application is straightforward once you know your options. Whether it’s through the PySpark shell, `spark-submit`, programmatically via `SparkSession`, or through cluster configuration, these methods ensure that the necessary JAR files are included in your Spark job’s classpath. Choose the method that best suits your development and deployment workflow.