How to Save DataFrame Directly to Hive?

Apache Spark allows you to save a DataFrame directly to Hive using PySpark or other supported languages. Below is a detailed explanation and code snippets to illustrate how this can be done using PySpark and Scala.

Prerequisites

Before we proceed, ensure that the following prerequisites are met:

  • Apache Spark is installed and configured correctly.
  • Hive is set up and configured correctly.
  • Spark is configured to work with Hive (e.g., hive-site.xml is available in Spark’s conf directory).

Saving DataFrame to Hive Using PySpark

First, let’s create a sample DataFrame and then save it to a Hive table.

PySpark Code Example


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("SaveDataFrameToHive") \
    .enableHiveSupport() \
    .getOrCreate()

# Create a sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, schema=columns)

# Write the DataFrame to a Hive table
df.write.mode("overwrite").saveAsTable("default.people")

# Query the table to verify
df_hive = spark.sql("SELECT * FROM default.people")
df_hive.show()

Output


+---------+---+
|     Name|Age|
+---------+---+
|    Alice| 34|
|      Bob| 45|
|Catherine| 29|
+---------+---+

Saving DataFrame to Hive Using Scala

Scala Code Example


import org.apache.spark.sql.SparkSession

// Initialize Spark session
val spark = SparkSession.builder()
  .appName("SaveDataFrameToHive")
  .enableHiveSupport()
  .getOrCreate()

// Create a sample DataFrame
val data = Seq(("Alice", 34), ("Bob", 45), ("Catherine", 29))
val columns = Seq("Name", "Age")
import spark.implicits._
val df = data.toDF(columns: _*)

// Write the DataFrame to a Hive table
df.write.mode("overwrite").saveAsTable("default.people")

// Query the table to verify
val df_hive = spark.sql("SELECT * FROM default.people")
df_hive.show()

Output


+---------+---+
|     Name|Age|
+---------+---+
|    Alice| 34|
|      Bob| 45|
|Catherine| 29|
+---------+---+

Conclusion

By following the above steps, you can easily save a DataFrame directly to a Hive table using either PySpark or Scala. This process involves creating a DataFrame, writing it to Hive, and then optionally querying the Hive table to verify the data. Make sure that Spark is correctly configured to connect to Hive, usually by having `hive-site.xml` properly set up in the `conf` directory of Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top