Running a Spark Java program requires a few steps. These steps include setting up the development environment, writing the Spark application, packaging the application into a JAR file, and finally running the JAR file using the Spark-submit script. Below is a detailed step-by-step guide.
Step 1: Setting Up the Development Environment
First, ensure you have the following tools installed on your system:
- Java Development Kit (JDK)
- Apache Spark
- Apache Maven
- Your favorite Integrated Development Environment (IDE) like IntelliJ IDEA or Eclipse
Install JDK
Make sure you have the JDK installed. You can check it by running:
java -version
If it’s not installed, download and install it from the official site.
Install Apache Spark
Download the latest version of Apache Spark from the official site. Extract it and set the `SPARK_HOME` environment variable to point to the root folder of the extracted directory.
Install Maven
Check whether Maven is installed by running:
mvn -version
If it’s not installed, download and install it from the official site.
Step 2: Creating the Maven Project
Create a New Maven Project
In your IDE, create a new Maven project. This will provide you with a basic project structure.
Add Spark Dependencies
Edit the `pom.xml` file to include Spark dependencies:
“`xml
Step 3: Writing the Spark Application
Create a new Java class `SimpleApp.java` with the following content:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
Dataset<Row> logData = spark.read().textFile(logFile).cache();
long numAs = logData.filter((Row s) -> s.toString().contains("a")).count();
long numBs = logData.filter((Row s) -> s.toString().contains("b")).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
spark.stop();
}
}
Step 4: Packaging the Application
Use Maven to package your application into a JAR file. Run the following command from the root directory of your project:
mvn clean package
This command generates a JAR file in the `target` directory.
Step 5: Running the Application
Use the Spark-submit script to run your Spark application. Navigate to your project’s target directory and run:
$SPARK_HOME/bin/spark-submit --class com.example.SimpleApp --master local target/spark-java-example-1.0-SNAPSHOT.jar
Make sure to replace `com.example.SimpleApp` with the correct class name if it’s different and also ensure you’ve replaced `YOUR_SPARK_HOME` with the actual path to your Spark home directory in the `SimpleApp.java` file.
Output
If everything is set up correctly, you should see an output similar to:
21/10/11 15:22:09 INFO SparkContext: Running Spark version 3.1.2
21/10/11 15:22:09 INFO SparkContext: Submitted application: Simple Application
Lines with a: 62, lines with b: 32
21/10/11 15:22:10 INFO SparkContext: Successfully stopped SparkContext
This output confirms that your Spark application ran successfully and counted the occurrences of the letters “a” and “b” in the specified file.