Apache Spark is a powerful distributed computing system that provides high-level APIs in Java, Scala, Python and R. It is designed to handle various data processing tasks ranging from batch processing to real-time analytics and machine learning. Spark SQL, a component of Apache Spark, introduces the concept of tables and views as abstractions over data, which can be used to impose a structure on data and execute SQL queries on top of it.
Understanding Data Abstractions: Tables vs. Views
In Spark SQL, both tables and views act as structured data abstractions, but there’s a key distinction between them. Tables are the structured datasets that can be stored in a variety of data sources like HDFS, Hive or cloud storage, and can be persisted across sessions. Views, on the other hand, are defined on top of tables and provide a named, virtual structure that can be queried like a table, but do not hold data by themselves.
Managed and Unmanaged Tables
Within Spark’s domain, tables can be categorized into ‘managed’ and ‘unmanaged’ tables.
Managed Tables
Managed tables, also known as Hive tables, are tables that are managed by Spark. When a managed table is created, Spark handles the location of the data in the file system and manages both the metadata and the data lifecycle. This means that if a managed table is dropped, Spark will delete both the metadata and the data itself.
Creating a Managed Table
val sparkSession = SparkSession.builder().appName("ManagedTableExample").getOrCreate()
import sparkSession.implicits._
val data = Seq((1, "apple"), (2, "banana"))
val df = data.toDF("id", "fruit")
df.write.saveAsTable("managed_fruits")
If you now run `sparkSession.sql(“SHOW TABLES”).show()`, you should see the `managed_fruits` table listed among the tables in the current database. Upon dropping this table, Spark will remove both the metadata and the data:
sparkSession.sql("DROP TABLE managed_fruits")
Unmanaged Tables
Unmanaged tables, also known as external tables, refer to tables where only the metadata is managed by Spark, while the data is kept in a specified location in the filesystem. When an unmanaged table is dropped, Spark deletes only the metadata, and the data remains intact.
Creating an Unmanaged Table
val location = "/path/to/table/data"
df.write.option("path", location).saveAsTable("unmanaged_fruits")
Similarly, when you drop an unmanaged table, only the metadata is removed and the data at `/path/to/table/data` remains untouched:
sparkSession.sql("DROP TABLE unmanaged_fruits")
Temporary and Global Temporary Views
Aside from permanent tables, Spark SQL provides temporary and global temporary views to facilitate sharing of data across different sessions and scopes.
Temporary Views
Temporary views are session-scoped and will not be visible in other Spark sessions. They are removed when the session that created them terminates.
Creating a Temporary View
df.createOrReplaceTempView("temp_fruits")
Global Temporary Views
Global temporary views, in contrast, are visible across multiple Spark sessions and are tied to a system preserved temporary database known as `global_temp`. They are removed only when the Spark application terminates.
Creating a Global Temporary View
df.createGlobalTempView("global_temp_fruits")
When querying a global temporary view, you must reference the `global_temp` database:
sparkSession.sql("SELECT * FROM global_temp.global_temp_fruits").show()
Catalog and Metastore
The Spark SQL Catalog API is an interface to interact with the data catalog, which keeps all the metadata about tables, databases, columns, and functions. Depending on the configuration, Spark uses either an in-memory catalog or a persistent metastore, such as Hive, to store this information.
Performance Considerations for Tables and Views
How Spark interacts with different types of tables and views can also have performance implications. For instance, using a managed table can simplify data management, but it might have overhead compared to using an unmanaged table with data stored in a high-performance format like Parquet or ORC. Similarly, because temporary views are not materialized, complex transformations in a query against a temporary view might be recomputed each time the view is queried.
In summary, Apache Spark provides a robust set of abstractions for working with structured data. By understanding the differences between the various types of tables and views, data engineers and developers can make more informed decisions on how to structure their Spark applications for data processing and analysis tasks.