Azure Synapse SQL architecture

Mohammad Gufran Jahangir January 15, 2024 0

Table of Contents

Synapse dedicate SQL pool and Serverless SQL pool architecture components

Synapse SQL uses a scale-out architecture to distribute the data across multiple nodes.
In dedicated SQL pools, the “data warehouse unit” is like a measure of how much computing power is used to handle tasks. It’s a way to understand and manage how much work your system can handle at a time.
For serverless SQL pool, being serverless, scaling is done automatically according to query resource requirements.
As topology changes over time by adding, removing nodes or failovers, it adapts to changes and makes sure your query has enough resources and finishes successfully

Synapse SQL uses a node-based architecture.
The Azure Synapse SQL Control node utilizes a distributed query engine to optimize queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.
The serverless SQL pool Control node utilizes Distributed Query Processing (DQP) engine to optimize and orchestrate distributed execution of user query by splitting it into smaller queries that will be executed on Compute nodes. Each small query is called task and represents distributed execution unit. It reads file(s) from storage, joins results from other tasks, groups, or orders data retrieved from other tasks.
The Compute nodes store all user data in Azure Storage and run the parallel queries.
The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

For serverless SQL pool scaling is done automatically, while for dedicated SQL pool one can:

Grow or shrink compute power, within a dedicated SQL pool, without moving data.
Pause compute capacity while leaving data intact, so you only pay for storage.
Resume compute capacity during operational hours.

Control node

The Control node is the brain of the architecture. It’s the front end that interacts with all applications and connections.
In Synapse SQL, the distributed query engine runs on the Control node to optimize and coordinate parallel queries.
When you submit a T-SQL query to dedicated SQL pool, the Control node transforms it into queries that run against each distribution in parallel.
In serverless SQL pool, the DQP engine runs on Control node to optimize and coordinate distributed execution of user query by splitting it into smaller queries that will be executed on Compute nodes.

Distributed query engine

Planning the most efficient route for your queries. It breaks down your queries into smaller tasks and distributes them among the compute nodes for parallel processing.
Control node utilizes a distributed query engine to optimize queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

Compute nodes

Your data resides in Azure Storage and compute nodes retrieve and analyze it as instructed. Dedicated SQL pool allows scaling these nodes up or down based on your workload.
As you pay for more compute resources, pool remaps the distributions to the available Compute nodes.
The number of compute nodes ranges from 1 to 60, and is determined by the service level for the dedicated SQL pool.
Each Compute node has a node ID that is visible in system views.
In serverless SQL pool, each Compute node is assigned task and set of files to execute task on. Task is distributed query execution unit, which is actually part of query user submitted. Automatic scaling is in effect to make sure enough Compute nodes are utilized to execute user query.

Data Movement Service

Data Movement Service (DMS) is the data transport technology in dedicated SQL pool that coordinates data movement between the Compute nodes.
This internal service ensures smooth and efficient transfer.

Azure Storage

Synapse SQL uses Azure Storage to keep your user data safe.
Dedicated SQL pool utilizes Azure Blob Storage or Azure Data Lake Storage, offering flexibility and scalability for your information
Serverless SQL pool allows you to query your data lake files, while dedicated SQL pool allows you to query and ingest data from your data lake files.
When data is ingested into dedicated SQL pool, the data is sharded into distributions to optimize the performance of the system.

You can choose below distributions types

Hash
Round Robin
Replicate

Difference between Teradata Database and Synapse SQL Architecture

Serverless SQL pool in Azure Synapse Analytics

Every Azure Synapse Analytics workspace comes with serverless SQL pool endpoints that you can use to query data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Azure Cosmos DB, or Dataverse.

Serverless SQL pool is a query service over the data in your data lake.

You can access your data through following:

T-SQL syntax to query data
Integrated connectivity via the T-SQL interface

Serverless SQL pool is a distributed data processing system, built for large-scale data and computational functions.
Serverless SQL pool enables you to analyze your Big Data in seconds to minutes, depending on the workload.
Serverless SQL pool is serverless, hence there’s no infrastructure to setup or clusters to maintain.
A default endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created.
There is no charge for resources reserved, you are only being charged for the data processed by queries you run, so this model is a true pay-per-use model.

Benefits for Serverless SQL pool

Discovery and exploration
Logical data warehouse
Data transformation

Different professional roles can benefit from serverless SQL pool:

Data Engineers can explore the lake, transform and prepare data using this service, and simplify their data transformation pipelines.
Data Scientists can quickly reason about the contents and structure of the data in the lake, thanks to features such as OPENROWSET and automatic schema inference.
Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favorite tools, which can connect to serverless SQL pool.
BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables.

1. Cost Optimization:

Pay-per-Execution: You’re billed only for the queries you run, not for provisioned resources, leading to significant savings for intermittent workloads.
Resource Auto-Scaling: The pool automatically scales compute resources up or down based on query demand, eliminating the need for manual provisioning and maintenance.
Cost-Effective for Exploration and Analysis: Ideal for scenarios like data exploration, ad-hoc analysis, and development, where usage patterns are unpredictable.

2. Instant Availability:

No Provisioning Delays: Serverless pools are always available, so you can start querying data immediately without waiting for resources to spin up.
Faster Time to Insights: Accelerate data insights and analysis without infrastructure provisioning overhead.

3. Simplified Management:

No Infrastructure Configuration: No need to manage compute resources, as the serverless pool handles it automatically.
Reduced Operational Overhead: Focus on data analysis and insights instead of infrastructure management tasks.

4. Seamless Integration:

Unified Experience: Serverless pools are integrated within the Azure Synapse Analytics workspace, providing a unified experience with other analytics services.
Data Access Across Services: Easily access data stored in Azure Data Lake Storage or other data sources within your workspace.

5. Performance and Scalability:

Optimized for Analytics: Built on the same distributed query engine as dedicated SQL pools, delivering high performance for analytical workloads.
Elastic Scaling: Automatically scales to handle large queries and concurrent users, ensuring optimal performance and resource utilization.

T-SQL support in Serverless SQL pool

some aspects of the T-SQL language aren’t supported due to the design of serverless SQL pool, as an example, DML functionality is currently not supported.

Workload can be organized using familiar concepts:
Databases – serverless SQL pool endpoint can have multiple databases.
Schemas – Within a database, there can be one or many object ownership groups called schemas.
Views, stored procedures, inline table value functions
External resources – data sources, file formats, and tables

Security can be enforced using:

Logins and users
Credentials to control access to storage accounts
Grant, deny, and revoke permissions per object level
Microsoft Entra integration

Supported T-SQL:

Full SELECT surface area is supported, including a majority of SQL functions
CETAS – CREATE EXTERNAL TABLE AS SELECT
DDL statements related to views and security only

Serverless SQL pool has no local storage, only metadata objects are stored in databases. T-SQL related to the following concepts isn’t supported:

Tables
Triggers
Materialized views
DDL statements other than ones related to views and security
DML statements

Note:

Serverless SQL pool queries have a timeout. For more information on query timeout that may affect your workload.Currently you can’t change the timeout.
Serverless SQL pools have limitations, and you can’t create more than 20 databases per workspace.

Serverless SQL pool Constraints

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002#constraints

What is dedicated SQL pool (formerly SQL DW) in Azure Synapse Analytics?

Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and Big Data analytics

Once your dedicated SQL pool is created, you can import big data with simple PolyBase T-SQL queries, and then use the power of the distributed query engine to run high-performance analytics

Key component of a big data solution

In a cloud data solution, data is ingested into big data stores from a variety of sources.
Once in a big data store, Hadoop, Spark, and machine learning algorithms prepare and train the data.
When the data is ready for complex analysis, dedicated SQL pool uses PolyBase to query the big data stores.
PolyBase uses standard T-SQL queries to bring the data into dedicated SQL pool (formerly SQL DW) tables.

Limitations

Feature	Dedicated SQL Pool	Serverless SQL Pool
Tables	Yes	No, in-database tables are not supported. Serverless SQL pool can query only external tables that reference data stored in Azure Data Lake storage or Dataverse.
Views	Yes, views can use query language elements that are available in dedicated and serverless models	Yes, you can create views over external tables, the queries with the OPENROWSET function, and other views. Views can use query language elements that are available in serverless model
Schemas	Yes	Yes, schemas are supported. Use schemas to isolate different tenants and place their tables per schemas.
Temporary tables	Yes	Yes, temporary tables might be used just to store some information from the system views, literals, or other temporary tables. UPDATE/DELETE on temp table is also supported. You can join temporary tables with the system views. You cannot select data from an external table to insert it into temporary table or join a temporary table with an external table these operations will fail because external data and temporary tables cannot be mixed in the same query.
User-defined procedures	Yes	Yes, stored procedures can be placed in any user databases (not ster database). Procedures can just read external data and use query language elements that are available in serverless pool.
User-defined functions	Yes	Yes, only inline table-valued functions are supported. Scalar user-defined functions are not supported.
Triggers	No	No, serverless SQL pools do not allow changing data, so the triggers cannot react on data changes.
External tables	Yes	Yes, external tables are available and can be used to read data from Azure Data Lake storage or Dataverse. See the supported data formats
Caching queries	Yes, multiple forms (SSD-based caching, in-memory resultset caching). In addition, Materialized View are supported	No. only the file statistics are cached. No. the query results are not cached. Only the file statistics are cached. No. the Materialized views are not supported in the serverless SQL pools.
Result set caching	Yes	No
Materialized views	Yes	No
Table variables	No, use temporary tables	No, table variables are not supported
Table distribution	Yes	No, table distributions are not supported
Table indexes	Yes	No, indexes are not supported
Table partitioning	Yes	External tables do not support partitioning. You can partition files using Hive-partition folder structure and create partitioned tables in Spark. The Spark partitioning will be synchronized with the serverless pool. If you are not using Spark, you can partition your filles in folder structure and can create partitioned views on folder partition structure, but the external tables cannot be created on partitioned folders.
Statistics	Yes	Yes, statistics are created on external files
Workload management	Yes see workload management, resource management, resource classes, and concurrency control	No, you cannot manage the resources that are assigned to the queries. The serverless SQL pool automatically manages the resources.
Cost control	Yes, using scale-up and scale-down actions.	Yes, you can limit daily, weekly, or monthly usage of serverless pool using the Azure portal or T-SQL procedure.

For more Details – https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features

Mohammad Gufran Jahangir

Category:

Uncategorized