External tables supported in Azure Synapse Analytics

Mohammad Gufran Jahangir December 28, 2023 0

An overview of the external table types supported in Azure Synapse Analytics

Hadoop-Based External Tables:

Access Data Sources: Azure Blob Storage, Azure Data Lake Storage Gen2, or Azure Data Lake Storage Gen1.
Technology: Utilize PolyBase for efficient data movement and query execution between external sources and Synapse.
File Formats Supported: Text files (CSV, TSV, PSV), Parquet, and ORC.
Key Considerations:
- Potential performance overhead due to PolyBase.
- Require configuration of PolyBase.

Native External Tables (Preview):

Access Data Source: Azure Data Lake Storage Gen2 (currently).
Technology: Directly query data without PolyBase, aiming for enhanced performance and functionality.
File Formats Supported: Parquet (initially, with more to come).
Key Considerations:
- Still in preview, so functionality and support might be limited.
- Potentially offer better performance than Hadoop-based tables.

Key Differences:

Performance: Native tables can potentially outperform Hadoop-based tables due to the absence of PolyBase overhead.
File System Access: Native tables directly access Azure Data Lake Storage Gen2, while Hadoop-based tables rely on PolyBase.
Functionality: Native tables might offer broader SQL support and features in the future.

Choosing the Right Type:

Performance Priority: Consider native tables if performance is critical.
File Formats: Ensure the required formats are supported.
Data Source: Native tables currently support only Azure Data Lake Storage Gen2.
Comfort with Preview Features: If comfortable with preview features, try native tables for potential benefits.

Additional Considerations:

PolyBase Configuration: If using Hadoop-based tables, factor in PolyBase setup and management.
Future Developments: Native tables are still under development, so expect feature enhancements and broader support in future releases.

	Hadoop	Native
Access Data Sources:	Azure Blob Storage, Azure Data Lake Storage Gen2, or Azure Data Lake Storage Gen1	Azure Data Lake Storage Gen2 (currently).
Technology	Utilize PolyBase for efficient data movement and query execution between external sources and Synapse.	Directly query data without PolyBase, aiming for enhanced performance and functionality.
File Formats Supported	Text files (CSV, TSV, PSV), Parquet, and ORC.	Parquet (initially, with more to come)
Key Considerations	Potential performance overhead due to PolyBase. Require configuration of PolyBase.	Still in preview, so functionality and support might be limited. Potentially offer better performance than Hadoop-based tables.
Dedicated SQL pool	Available	Only Parquet tables are available in public preview.
Serverless SQL pool	Not available	Available
Supported formats	Delimited/CSV, Parquet, ORC, Hive RC, and RC	Serverless SQL pool: Delimited/CSV, Parquet, and Delta Lake Dedicated SQL pool: Parquet (preview)
Storage authentication	Storage Access Key(SAK), Microsoft Entra passthrough, Managed identity, custom application Microsoft Entra identity	Shared Access Signature(SAS), Microsoft Entra passthrough, Managed identity, Custom application Microsoft Entra identity.
CETAS (exporting/transformation)	Yes	CETAS with the native tables as a target works only in the serverless SQL pool. You cannot use the dedicated SQL pools to export data using native tables.

Table of Contents

what are Parquet and ORC ?

Both Parquet and ORC are columnar data file formats designed for efficiently storing and querying large datasets, commonly used with big data technologies like Apache Hadoop and Apache Spark. They offer significant advantages over traditional row-oriented formats like CSV or text files:

Key Differences:

1. Storage Optimization:

Columnar: Store data by column instead of by row, reducing wasted space for repeated values in columns.
Compression: Implement efficient compression algorithms to further shrink file size.

2. Query Performance:

Selective Scanning: Allow reading only specific columns needed for a query, dramatically reducing data transferred and processed.
Predicate Pushing: Push filtering conditions down to the file format, further minimizing data scanned.

3. Schema Evolution:

Flexible Schema: Can handle modifications to the table schema without requiring file restructuring.

4. Additional Features:

Parquet:
- Supports nested data structures.
- Offers different encoding options for specific data types.
ORC:
- Supports ACID transactions for data consistency.
- Provides bloom filters for faster filtering.

Choosing Between Parquet and ORC:

Performance: Both are high-performing, but Parquet often has a slight edge in query speed, especially for complex queries.
Compatibility: ORC is more widely supported by tools and frameworks, while Parquet gains popularity but might require specific libraries.
Features: Consider specific needs like nested data structures (Parquet) or ACID transactions (ORC).
Storage Efficiency: Evaluate the compression effectiveness for your dataset and query patterns.

Final Note:

Both Parquet and ORC are excellent choices for big data storage and processing. Carefully consider your specific needs and environment to choose the best format for your scenario.

Limitation of external tables in Azure Synapse Analytics

The key limitations of external tables in Azure Synapse Analytics:

1. Performance:

Query Execution: Queries can be slower than native tables due to factors like:
- Network latency when accessing external storage
- Metadata operations
- Limited query optimization capabilities
Data Partitioning: External tables generally don’t support partition elimination, potentially scanning more data than necessary.
File Formats: Parquet and ORC often outperform text formats, but some operations might still be slower compared to native tables.

2. Functionality:

Limited SQL Features: Not all SQL features are supported, including:
- Indexes
- Statistics
- Constraints (primary keys, foreign keys, check constraints)
- Triggers
- Stored procedures
- Some DML operations (e.g., UPDATE, DELETE)
No Data Manipulation: You cannot directly modify data in external tables; changes must be made in the source storage.

3. Security:

External Data Access: Requires careful management of permissions for external data sources.
Data Encryption: Encryption at rest in external storage is managed separately from Synapse.

4. Metadata Management:

Manual Updates: Changes to external data (file structure, location) might require manual updates to external table definitions.

5. PolyBase Considerations (Hadoop-based external tables):

Additional Configuration: PolyBase setup and configuration can add complexity.
Resource Utilization: PolyBase uses compute resources, potentially impacting other workloads.

6. Native External Table Limitations (Preview):

Parquet Only: Currently supports only Parquet file format.
Limited Functionality: Some SQL features might still be unavailable.

Key Considerations:

Use external tables strategically, understanding their trade-offs.
Optimize queries and file formats for better performance.
Consider native external tables (in preview) for potential improvements.
If performance, functionality, or security requirements are paramount, loading data into Synapse might be more suitable.

Uses external tables in Azure Synapse Analytics

the key scenarios where using external tables in Azure Synapse Analytics is advantageous:

1. Querying Large Data Files:

Avoid loading massive files into Synapse, potentially consuming significant storage and time.
Directly query data from external sources like Azure Blob Storage or Azure Data Lake Storage.

2. Integrating Data from Different Sources:

Combine data from various sources without complex ETL processes.
Create external tables for each source and query them together using joins and unions.

3. Analyzing Data in Place:

Analyze data without disrupting its original location or structure.
Ideal for sensitive data or compliance requirements that mandate data residency in specific storage.

4. Minimizing Data Movement:

Reduce data transfer costs and improve query performance, especially for large datasets.

5. Data Exploration and Pre-Processing:

Easily explore and evaluate the structure and content of external data before deciding on loading strategies.
Perform initial data cleaning, filtering, and transformations using external tables.

Specific Use Cases:

Data Warehousing: Integrate data from multiple sources for analytical querying.
Data Science: Access and analyze large datasets for machine learning and statistical modeling.
Data Archiving: Query archived data without restoring it to Synapse.
Log Analysis: Process and analyze log files stored in external storage.

Key Benefits:

Avoid Data Replication: Eliminate redundancy and storage costs.
Reduce Data Movement: Enhance query performance and cost-efficiency.
Simplify Data Integration: Streamline multi-source data analysis.
Maintain Data Integrity: Preserve data in its original format and location.

When to Consider Alternatives:

Frequent Updates: If data in external sources changes often, consider loading it into Synapse for better query performance and consistency.
Complex Queries: Some SQL features might not be fully supported for external tables, potentially limiting query options.
Strict Data Governance: If strict data governance and security policies require full control over data access and management, loading data into Synapse might be more suitable.

How external tables work in Azure Synapse Analytics?

1. Definition:

External tables provide a virtual table structure within a SQL pool that maps to data residing in external storage, such as Azure Blob Storage or Azure Data Lake Storage.
The actual data remains in its original location, only metadata about the table structure and data location is stored within Synapse.

2. Creation:

You create external tables using CREATE EXTERNAL TABLE statements, specifying:
- Data source (external storage location)
- File format (CSV, Parquet, ORC, etc.)
- Table structure (column definitions)

3. Querying:

Once defined, you query external tables using standard SQL queries, similar to regular tables.
The query engine translates these queries into actions on the external data source.

4. Data Access:

Data is accessed directly from the external storage during query execution, avoiding upfront data loading.
PolyBase technology (for Hadoop-based external tables) optimizes data movement and query execution across Azure storage and Synapse.

Key Points:

No Data Duplication: Data stays in its original location, reducing storage costs and management overhead.
Minimal Data Movement: Only relevant data is retrieved for queries, improving performance and reducing network traffic.
Limited Functionality: Some SQL features might not be supported for external tables due to their virtual nature.
Data Types and Constraints: External tables don’t enforce data types or constraints at the table level; integrity checks happen during query execution.

Additional Considerations:

Security: Ensure appropriate permissions to access the external data sources.
Performance: Query performance depends on factors like:
- External storage type and network latency
- File format (Parquet and ORC often outperform text formats)
- Query complexity and data size
Best Practices:
- Use external tables strategically for specific use cases.
- Consider performance implications and trade-offs.
- Optimize queries and file formats for efficiency.
- Explore native external tables (in preview) for enhanced performance and functionality.

Mohammad Gufran Jahangir

Category: