,

Spark Databases, Tables, Views, and the Hive Metastore

Posted by


๐Ÿงพ Understanding Spark Databases, Tables, Views, and Hive Metastore in Databricks

When working with structured data in Apache Spark and Azure Databricks, itโ€™s essential to understand how data is organized, stored, and queried. Key components like databases, tables (managed and external), views, and the Hive Metastore form the foundation of Spark SQL operations.

In this blog, weโ€™ll walk through:

  • Sparkโ€™s metadata and table architecture
  • Differences between managed and external tables
  • How Databricks connects with Hive Metastore and Azure Data Lake
  • Real-world examples using ADLS layers

๐Ÿงฑ Databases, Tables, and Views in Spark

In Spark SQL, the core organizational structure is:

  • Database โ€“ A namespace that contains related tables and views
  • Table โ€“ A structured dataset (either managed or external)
  • View โ€“ A virtual table based on SQL logic (not storing data physically)

Example:

CREATE DATABASE f1;
USE f1;

CREATE TABLE drivers (
  driver_id INT,
  name STRING,
  points INT
);

CREATE VIEW top_drivers AS
SELECT * FROM drivers WHERE points > 100;

๐Ÿง  Hive Metastore: The Metadata Repository

Spark uses the Hive Metastore to store metadata about databases, tables, views, and partitions.

Key points:

  • Spark interacts with the metastore through Spark SQL
  • You can use the Databricks default metastore, or configure an external metastore (e.g., Azure SQL, MySQL)
  • While the metadata resides in the metastore, actual data is typically stored in Azure Data Lake

๐Ÿงฉ Spark Databases and Table Types

Within a Databricks workspace, databases contain two main types of tables:

1. Managed Tables

  • Spark handles both metadata and data storage
  • Dropping the table deletes the data

2. External Tables

  • Spark only manages metadata
  • The data resides in an external path (e.g., ADLS, Blob)
  • Dropping the table does not remove the data

Example:

-- Managed Table
CREATE TABLE drivers_managed (
  name STRING, points INT
);

-- External Table
CREATE TABLE drivers_external (
  name STRING, points INT
)
USING parquet
OPTIONS (path "/mnt/data/drivers/");

๐Ÿ” ADLS & Table Storage Architecture

In modern data platforms, we often separate data into logical layers in Azure Data Lake and define appropriate table types:

LayerTable TypeDescription
Raw LayerExternal TableUnstructured/ingested data
Ingested LayerManaged TableCleaned, semi-structured datasets
Presentation LayerManaged TableAggregated, report-ready data

These layers are typically powered by:

  • ADF Pipelines for movement
  • Spark jobs for transformation
  • Power BI for reporting

๐Ÿ‘ Views: Logical Data Abstraction

Views are useful for:

  • Reusing business logic
  • Simplifying complex queries
  • Providing abstraction from underlying table structures

They do not store any data and are evaluated each time they are queried.


๐Ÿงช How to Identify Table Type

Use the command below to check if a table is managed or external:

DESCRIBE FORMATTED database_name.table_name;

Look for the "Type" and "Location" fields in the result.


โœ… Best Practices

AreaRecommendation
Raw dataUse external tables with explicit paths
Processed layersUse managed tables with auto-managed storage
Shared metadataUse external Hive Metastore for consistency across clusters
ViewsUse views for reusable query logic
GovernanceAdopt Unity Catalog for fine-grained access control (if available)

๐Ÿ“Œ Summary

ConceptDescription
Hive MetastoreStores metadata only
Managed TablesSpark controls both data and metadata
External TablesData managed externally, Spark tracks metadata
ViewsVirtual tables for abstraction and reusability
ADLS IntegrationSupports scalable lakehouse architecture

๐Ÿ Final Thoughts

Understanding the role of databases, tables, and the Hive Metastore is critical for building organized, secure, and scalable data solutions in Apache Spark and Databricks.

By using the right table types at each stage of your data pipeline, youโ€™ll improve governance, performance, and maintainability โ€” especially when integrated with Azure Data Lake, ADF Pipelines, and Power BI.


guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x