Spark Create DataFrame : Step-by-Step Examples for Easy Understanding!

Spark Create DataFrame: In Apache Spark, you can create DataFrames in several ways using Scala. DataFrames are distributed collections of data organized into named columns. Below are some common methods to create DataFrames in Spark using Scala, along with examples:

Contents hide

1 Creating DataFrames from Existing Data

2 Defining a Schema:

3 Using RDDs (Resilient Distributed Datasets)

4 Using Case Classes

5 Reading Data from External Sources

5.1 From CSV file

5.2 DataFrame from TXT file

5.3 DataFrame from JSON file

5.4 DataFrame from XML file

6 Using SQL Queries

7 Spark Create DataFrame from RDBMS Database

8 Create DataFrame from HBase table

9 Create DataFrame from Hive table

10 From Other Sources

11 About Editorial Team

12 You Might Also Like:

Creating DataFrames from Existing Data

You can create DataFrames from existing data structures like Lists, Arrays, or Seqs. Here’s an example:

import org.apache.spark.sql.{SparkSession, DataFrame}

// Create a SparkSession, which is the entry point for Spark functionality.
val spark = SparkSession.builder()
  .appName("DataFrameCreation")
  .getOrCreate()

// Create a sequence of data as tuples.
val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))

// Define the column names for the DataFrame.
val columns = Seq("Name", "Age")

// Create a DataFrame by converting the sequence of tuples and specifying column names.
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

// Show the contents of the DataFrame.
df.show()

/* 
Output
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
*/

This code starts by importing the necessary Spark libraries.
It creates a SparkSession named “DataFrameCreation,” which is the entry point for Spark operations.
A sequence of tuples data is defined, representing the data to be used in the DataFrame.
columns defines the column names for the DataFrame.
The spark.createDataFrame(data) method creates a DataFrame from the data sequence, and .toDF(columns: _*) specifies the column names.
Finally, df.show() is used to display the contents of the DataFrame.

Defining a Schema:

You can explicitly define a schema for your DataFrame. This is useful when reading data from sources that don’t have a schema (e.g., JSON).

import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

// Create a SparkSession, which is the entry point for Spark functionality.
val spark = SparkSession.builder()
  .appName("DataFrameCreation")
  .getOrCreate()

val schema = StructType(Seq(
  StructField("Name", StringType, nullable = true),
  StructField("Age", IntegerType, nullable = true)
))

val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df: DataFrame = spark.createDataFrame(data).toDF(schema.fieldNames: _*)
df.show()

Using RDDs (Resilient Distributed Datasets)

You can create a DataFrame from an existing RDD.

// Create an RDD containing a sequence of tuples.
val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35)))

// Create a DataFrame from the RDD and specify column names.
val df: DataFrame = spark.createDataFrame(rdd).toDF("Name", "Age")

// Display the contents of the DataFrame.
df.show()

/*
Output
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
*/

Creating an RDD named rdd using the parallelize method of Spark’s SparkContext. The RDD contains a sequence of tuples, where each tuple represents a row of data. In this case, each tuple has two elements: a name (String) and an age (Integer).

Here, an RDD is converted into a DataFrame named df using the spark.createDataFrame method. The column names “Name” and “Age” are specified for the DataFrame using the .toDF method.

Finally, this line displays the contents of the DataFrame df. It prints the first few rows of the DataFrame to the console, allowing you to see the data.

Using Case Classes

You can create a DataFrame from a sequence of case class objects.

import org.apache.spark.sql.SparkSession

case class Person(Name: String, Age: Int)

object sparkDataFrame {
  def main(args: Array[String]): Unit = {
    // Create a SparkSession, which is the entry point for Spark functionality.
    val spark = SparkSession.builder()
      .appName("DataFrameCreation")
      .master("local")
      .getOrCreate()

    // Create a sequence of case class objects.
    val data = Seq(Person("Alice", 25), Person("Bob", 30), Person("Charlie", 35))

    // Create a DataFrame using the toDF method.
    val df = spark.createDataFrame(data).toDF()

    // Show the contents of the DataFrame.
    df.show()
    spark.stop()
  }
}

/*
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
*/

Reading Data from External Sources

Spark allows you to read data from various external sources like CSV, Parquet, JSON, etc. Here’s an example reading from a CSV file:

From CSV file

val csvPath = "path/to/your/file.csv"
val df: DataFrame = spark.read.csv(csvPath)
df.show()

DataFrame from TXT file

To create a DataFrame from a text (TXT) file in Apache Spark using Scala, you can use the spark.read.text method.

// Read the text file into a DataFrame.
val textFilePath = "path/to/your/data.txt"
val df: DataFrame = spark.read.text(textFilePath)

// Show the contents of the DataFrame.
df.show()

DataFrame from JSON file

To create a DataFrame from a JSON file in Apache Spark using Scala, you can use the spark.read.json method.

// Read the JSON file into a DataFrame.
val jsonFilePath = "path/to/your/data.json"
val df: DataFrame = spark.read.json(jsonFilePath)

// Show the contents of the DataFrame.
df.show()

DataFrame from XML file

Apache Spark does not natively support reading XML files into DataFrames. However, you can use the Databricks library, “spark-xml,” which provides a way to read XML files into DataFrames. You’ll need to include this library in your Spark project. Here’s how you can create a DataFrame from an XML file using the “spark-xml” library:

Add the “spark-xml” library to your project. You can do this in your build tool (e.g., build.sbt for Scala projects or pom.xml for Maven projects). Add the following dependency:

For Maven:

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-xml_2.13</artifactId>
    <version>0.17.0</version>
</dependency>

For SBT:

libraryDependencies += "com.databricks" %% "spark-xml" % "0.17.0"

Be sure to use the appropriate version of the library that matches your Spark and Scala versions.

After adding the library, you can create a DataFrame from an XML file as follows:

val xmlFilePath = "E:\\sparktpoint\\data\\data.xml"
// Read the XML file into a DataFrame using the spark-xml library.
val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "record")
      .load(xmlFilePath)

In the code above, you need to specify the rowTag option to indicate the XML element that should be treated as a row. In this example, we assume that the XML file contains records enclosed in a “record” tag. You should replace "record" with the actual tag used in your XML file.

Using SQL Queries

You can create a DataFrame by executing SQL queries on existing DataFrames.

val df1 = spark.createDataFrame(Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))).toDF("Name", "Age")
df1.createOrReplaceTempView("people")

val df2 = spark.sql("SELECT * FROM people WHERE Age >= 30")
df2.show()

/*
Output
+-------+---+
|   Name|Age|
+-------+---+
|    Bob| 30|
|Charlie| 35|
+-------+---+
*/

In summary, this code snippet demonstrates how to use Spark SQL to create a DataFrame, register it as a temporary table, perform SQL queries on the temporary table, and display the results.

Spark Create DataFrame from RDBMS Database

To create a DataFrame from an RDBMS (Relational Database Management System) database using Scala and Apache Spark, you can use the spark.read.jdbc method. This method allows you to read data from a relational database and load it into a DataFrame. Here’s how you can do it:

Assuming you have a PostgreSQL database and want to create a DataFrame from a table named “mytable,” follow these steps:

Define the database connection properties:

    val jdbcUrl = "jdbc:postgresql://localhost:5432/postgres" // JDBC URL of your database
    val connectionProperties = new java.util.Properties()
    connectionProperties.put("user", "postgres") // Database username
    connectionProperties.put("password", "1234") // Database password

Read data from the RDBMS and create a DataFrame:

    val tableName = "mytable" // Name of the table you want to read
    val df = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)

Note: To connect to a relational database using JDBC with Apache Spark, you need to include the appropriate JDBC driver for your specific database. You typically download the JDBC driver for your database (e.g., MySQL, PostgreSQL, Oracle, SQL Server, etc.) and add it to your Spark application’s classpath or add it as a dependency using your build tool (e.g., SBT, Maven, Gradle).

Create DataFrame from HBase table

To create a DataFrame from an HBase table in Apache Spark, you’ll typically use a library like HBase-Spark Connector, which facilitates the integration of Spark and HBase. First, you need to include the HBase-Spark Connector library as a dependency in your Spark project. Make sure the version of the connector matches your Spark and HBase versions.

Use the HBase-Spark Connector to read data from an HBase table and create a DataFrame.

import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog

// Define a catalog that specifies the HBase table and its schema
val catalog = s"""{
   |"table":{"namespace":"default", "name":"yourHBaseTable"},
   |"rowkey":"key",
   |"columns":{
     |"col0":{"cf":"cf1", "col":"col0", "type":"string"},
     |"col1":{"cf":"cf1", "col":"col1", "type":"int"}
   |}
|}""".stripMargin

// Read data from HBase into a DataFrame
val df: DataFrame = spark.read
  .option("catalog", catalog)
  .option("hbase.config.resources", "/path/to/hbase-site.xml") // Replace with your HBase configuration file path
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .load()

Create DataFrame from Hive table

To create a DataFrame from a Hive table in Apache Spark, you need to have Hive integration set up in your Spark application. First, you need to create a SparkSession, which is the entry point for Spark functionality. Make sure to enable Hive support by calling .enableHiveSupport().

Hive tables are accessible through Spark when you enable Hive support. You can access Hive tables using SQL queries or directly as DataFrames.

    val spark = SparkSession.builder()
      .appName("DataFrameCreation")
      .master("local")
      .enableHiveSupport()
      .getOrCreate()

//Accessing a Hive Table as a DataFrame
val df1 = spark.table("yourHiveTable")

//Executing SQL Queries
val df2 = spark.sql("SELECT * FROM yourHiveTable")

From Other Sources

Spark provides connectors to various data sources, such as Kafka, Cassandra, Elasticsearch, and more. You can create DataFrames from these sources using the respective connectors.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.