Mohammad Gufran Jahangir April 20, 2025 0

Table of Contents

🧾 Understanding Spark Databases, Tables, Views, and Hive Metastore in Databricks

When working with structured data in Apache Spark and Azure Databricks, it’s essential to understand how data is organized, stored, and queried. Key components like databases, tables (managed and external), views, and the Hive Metastore form the foundation of Spark SQL operations.

In this blog, we’ll walk through:

Spark’s metadata and table architecture
Differences between managed and external tables
How Databricks connects with Hive Metastore and Azure Data Lake
Real-world examples using ADLS layers

🧱 Databases, Tables, and Views in Spark

In Spark SQL, the core organizational structure is:

Database – A namespace that contains related tables and views
Table – A structured dataset (either managed or external)
View – A virtual table based on SQL logic (not storing data physically)

Example:

CREATE DATABASE f1;
USE f1;

CREATE TABLE drivers (
  driver_id INT,
  name STRING,
  points INT
);

CREATE VIEW top_drivers AS
SELECT * FROM drivers WHERE points > 100;

🧠 Hive Metastore: The Metadata Repository

Spark uses the Hive Metastore to store metadata about databases, tables, views, and partitions.

Key points:

Spark interacts with the metastore through Spark SQL
You can use the Databricks default metastore, or configure an external metastore (e.g., Azure SQL, MySQL)
While the metadata resides in the metastore, actual data is typically stored in Azure Data Lake

🧩 Spark Databases and Table Types

Within a Databricks workspace, databases contain two main types of tables:

1. Managed Tables

Spark handles both metadata and data storage
Dropping the table deletes the data

2. External Tables

Spark only manages metadata
The data resides in an external path (e.g., ADLS, Blob)
Dropping the table does not remove the data

Example:

-- Managed Table
CREATE TABLE drivers_managed (
  name STRING, points INT
);

-- External Table
CREATE TABLE drivers_external (
  name STRING, points INT
)
USING parquet
OPTIONS (path "/mnt/data/drivers/");

🔁 ADLS & Table Storage Architecture

In modern data platforms, we often separate data into logical layers in Azure Data Lake and define appropriate table types:

Layer	Table Type	Description
Raw Layer	External Table	Unstructured/ingested data
Ingested Layer	Managed Table	Cleaned, semi-structured datasets
Presentation Layer	Managed Table	Aggregated, report-ready data

These layers are typically powered by:

ADF Pipelines for movement
Spark jobs for transformation
Power BI for reporting

👁 Views: Logical Data Abstraction

Views are useful for:

Reusing business logic
Simplifying complex queries
Providing abstraction from underlying table structures

They do not store any data and are evaluated each time they are queried.

🧪 How to Identify Table Type

Use the command below to check if a table is managed or external:

DESCRIBE FORMATTED database_name.table_name;

Look for the "Type" and "Location" fields in the result.

✅ Best Practices

Area	Recommendation
Raw data	Use external tables with explicit paths
Processed layers	Use managed tables with auto-managed storage
Shared metadata	Use external Hive Metastore for consistency across clusters
Views	Use views for reusable query logic
Governance	Adopt Unity Catalog for fine-grained access control (if available)

📌 Summary

Concept	Description
Hive Metastore	Stores metadata only
Managed Tables	Spark controls both data and metadata
External Tables	Data managed externally, Spark tracks metadata
Views	Virtual tables for abstraction and reusability
ADLS Integration	Supports scalable lakehouse architecture

🏁 Final Thoughts

Understanding the role of databases, tables, and the Hive Metastore is critical for building organized, secure, and scalable data solutions in Apache Spark and Databricks.

By using the right table types at each stage of your data pipeline, you’ll improve governance, performance, and maintainability — especially when integrated with Azure Data Lake, ADF Pipelines, and Power BI.

Mohammad Gufran Jahangir

Category:

Spark Databases, Tables, Views, and the Hive Metastore

🧾 Understanding Spark Databases, Tables, Views, and Hive Metastore in Databricks

🧱 Databases, Tables, and Views in Spark

🧠 Hive Metastore: The Metadata Repository

Key points:

🧩 Spark Databases and Table Types

1. Managed Tables

2. External Tables

Example:

🔁 ADLS & Table Storage Architecture

👁 Views: Logical Data Abstraction

🧪 How to Identify Table Type

✅ Best Practices

📌 Summary

🏁 Final Thoughts