A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
When Synapse SQL runs a query, the work is divided into 60 smaller queries that run in parallel.
Each of the 60 smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions.
A dedicated SQL pool with maximum compute resources has one distribution per Compute node. A dedicated SQL pool with minimum compute resources has all the distributions on one compute node.
Recommendations on the best table distribution strategy to use based on your workloads.
Hash-distribution improves query performance on large fact tables, and is the focus of this article.
Round-robin distribution is useful for improving loading speed. These design choices have a significant impact on improving query and loading performance.
Another table storage option is to replicate a small table across all the Compute nodes.
Please consider the choices of distribution on below points
- How large is the table?
- How often is the table refreshed?
- Do I have fact and dimension tables in a dedicated SQL pool?
Hash distributed tables
- A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one
- A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
- In the table definition, one of the columns is designated as the distribution column.
- The hash function uses the values in the distribution column to assign each row to a distribution.
The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.
- Each row belongs to one distribution.
- A deterministic hash algorithm assigns each row to one distribution.
- The number of table rows per distribution varies as shown by the different sizes of tables.
Hash-distributed tables work well for large fact tables in a star schema. They can have very large numbers of rows and still achieve high performance.
Choosing a good distribution column or columns is one such consideration.
Consider using a hash-distributed table when:
- The table size on disk is more than 2 GB.
- The table has frequent insert, update, and delete operations.
- Use for Fact tables
- Use for Large dimension tables
Round-robin distributed tables
- A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.
- A round-robin distributed table distributes data evenly across the table but without any further optimization.
- A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially.
- It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables.
- Joins on round-robin tables require reshuffling data, which takes additional time.
- Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
Consider using the round-robin distribution for your table in the following scenarios:
- When getting started as a simple starting point since it is the default
- If there is no obvious joining key
- If there is no good candidate column for hash distributing the table
- If the table does not share a common join key with other tables
- If the join is less significant than other joins in the query
- When the table is a temporary/staging table
Replicated Tables
- A replicated table provides the fastest query performance for small tables.
- A table that is replicated caches a full copy of the table on each compute node.
- Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation.
- Replicated tables are best utilized with small tables.
- Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.
Consider using the replicated distribution for your table in the following scenarios:
- Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression)
Summarization
Tips:
- Start with Round Robin, but aspire to a hash distribution strategy to take advantage of a massively parallel architecture.
- Make sure that common hash keys have the same data format.
- Don’t distribute on varchar format.
- Dimension tables with a common hash key to a fact table with frequent join operations can be hash distributed.
- Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
- Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast, and shuffle operations take. This is helpful to review your distribution strategy.