๐งพ Understanding Spark Databases, Tables, Views, and Hive Metastore in Databricks
When working with structured data in Apache Spark and Azure Databricks, itโs essential to understand how data is organized, stored, and queried. Key components like databases, tables (managed and external), views, and the Hive Metastore form the foundation of Spark SQL operations.
In this blog, weโll walk through:
- Sparkโs metadata and table architecture
- Differences between managed and external tables
- How Databricks connects with Hive Metastore and Azure Data Lake
- Real-world examples using ADLS layers
๐งฑ Databases, Tables, and Views in Spark
In Spark SQL, the core organizational structure is:
- Database โ A namespace that contains related tables and views
- Table โ A structured dataset (either managed or external)
- View โ A virtual table based on SQL logic (not storing data physically)
Example:
CREATE DATABASE f1;
USE f1;
CREATE TABLE drivers (
driver_id INT,
name STRING,
points INT
);
CREATE VIEW top_drivers AS
SELECT * FROM drivers WHERE points > 100;
๐ง Hive Metastore: The Metadata Repository
Spark uses the Hive Metastore to store metadata about databases, tables, views, and partitions.
Key points:
- Spark interacts with the metastore through Spark SQL
- You can use the Databricks default metastore, or configure an external metastore (e.g., Azure SQL, MySQL)
- While the metadata resides in the metastore, actual data is typically stored in Azure Data Lake
๐งฉ Spark Databases and Table Types
Within a Databricks workspace, databases contain two main types of tables:
1. Managed Tables
- Spark handles both metadata and data storage
- Dropping the table deletes the data
2. External Tables
- Spark only manages metadata
- The data resides in an external path (e.g., ADLS, Blob)
- Dropping the table does not remove the data
Example:
-- Managed Table
CREATE TABLE drivers_managed (
name STRING, points INT
);
-- External Table
CREATE TABLE drivers_external (
name STRING, points INT
)
USING parquet
OPTIONS (path "/mnt/data/drivers/");
๐ ADLS & Table Storage Architecture
In modern data platforms, we often separate data into logical layers in Azure Data Lake and define appropriate table types:
Layer | Table Type | Description |
---|---|---|
Raw Layer | External Table | Unstructured/ingested data |
Ingested Layer | Managed Table | Cleaned, semi-structured datasets |
Presentation Layer | Managed Table | Aggregated, report-ready data |
These layers are typically powered by:
- ADF Pipelines for movement
- Spark jobs for transformation
- Power BI for reporting
๐ Views: Logical Data Abstraction
Views are useful for:
- Reusing business logic
- Simplifying complex queries
- Providing abstraction from underlying table structures
They do not store any data and are evaluated each time they are queried.
๐งช How to Identify Table Type
Use the command below to check if a table is managed or external:
DESCRIBE FORMATTED database_name.table_name;
Look for the "Type"
and "Location"
fields in the result.
โ Best Practices
Area | Recommendation |
---|---|
Raw data | Use external tables with explicit paths |
Processed layers | Use managed tables with auto-managed storage |
Shared metadata | Use external Hive Metastore for consistency across clusters |
Views | Use views for reusable query logic |
Governance | Adopt Unity Catalog for fine-grained access control (if available) |
๐ Summary
Concept | Description |
---|---|
Hive Metastore | Stores metadata only |
Managed Tables | Spark controls both data and metadata |
External Tables | Data managed externally, Spark tracks metadata |
Views | Virtual tables for abstraction and reusability |
ADLS Integration | Supports scalable lakehouse architecture |
๐ Final Thoughts
Understanding the role of databases, tables, and the Hive Metastore is critical for building organized, secure, and scalable data solutions in Apache Spark and Databricks.
By using the right table types at each stage of your data pipeline, youโll improve governance, performance, and maintainability โ especially when integrated with Azure Data Lake, ADF Pipelines, and Power BI.