Below are the types of Indexes on dedicated SQL pool tables in Azure Synapse Analytics
- Clustered columnstore indexes
- Heap tables
- Clustered and nonclustered indexes
1. Clustered Columnstore Indexes (CCI):
- Storage format: Data is compressed and stored by columns rather than rows, minimizing storage footprint and optimizing read performance for queries focusing on specific columns.
- Read performance: Excels for queries involving aggregations, group-bys, and filtering based on indexed columns. Data is already organized for these operations, reducing data movement.
- Write performance: Less efficient than heaps due to compression and reorganization overhead.
- Clustering: Data is physically ordered based on the indexing key column(s). This can further improve query performance.
- Example: Consider a large table storing sales data with columns like product_id, customer_id, and sales_amount. A CCI on product_id would significantly improve performance for queries analyzing sales trends by product.
There are a few scenarios where clustered columnstore may not be a good option:
- Columnstore tables do not support varchar(max), nvarchar(max), and varbinary(max). Consider heap or clustered index instead.
- Columnstore tables may be less efficient for transient data. Consider heap and perhaps even temporary tables.
- Small tables with less than 60 million rows. Consider heap tables.
CREATE TABLE SalesData (
product_id INT PRIMARY KEY CLUSTERED COLUMNSTORE,
customer_id INT,
sales_amount DECIMAL(10,2),
order_date DATE
);
-- This creates a CCI on the 'product_id' column, physically ordering the data
-- based on this column and optimizing performance for queries
-- focusing on specific products.
CREATE INDEX idx_customer_id ON SalesData (customer_id) NONCLUSTERED;
-- Additionally, you can create nonclustered indexes on other columns
-- like 'customer_id' for faster lookups and filtering based on that column.
2. Heap Tables:
- Storage format: Data is stored in row format, unordered and uncompressed.
- Read performance: Less efficient than CCI for specific column-based queries, as data needs to be scanned entirely.
- Write performance: Faster than CCI due to simpler insertion and update operations.
- Clustering: No physical ordering of data exists.
- Example: A temporary table storing intermediate results during a data processing pipeline might be best suited as a heap for fast insertions and updates without requiring specific optimizations.
Cluster columnstore tables begin to achieve optimal compression once there is more than 60 million rows. For small lookup tables, less than 60 million rows, consider using HEAP or clustered index for faster query performance.
CREATE TABLE myTable
(
id int NOT NULL,
lastName varchar(20),
zipCode varchar(6)
)
WITH ( HEAP );
3. Clustered Indexes:
- Indexing type: Creates a separate physical order of the data based on the indexing key column(s). This order mirrors the clustered columnstore index if CCI exists.
- Benefits: Similar to CCI, improves read performance for queries involving the indexing key(s) through data ordering.
- Trade-offs: Similar to CCI, write performance might be slower due to reorganization overhead. Not as efficient as CCI for column-based queries.
- Example: Imagine a smaller table storing employee data with columns like employee_id, department_id, and salary. A clustered index on department_id could benefit queries focusing on department-specific salary analyses.
CREATE TABLE EmployeeData (
employee_id INT PRIMARY KEY,
department_id INT,
salary DECIMAL(10,2),
name VARCHAR(100)
);
CREATE CLUSTERED INDEX idx_department ON EmployeeData (department_id);
-- This creates a clustered index on the 'department_id' column, ordering
-- the data based on this column. This can benefit queries that analyze
-- data by department, as the relevant rows are already physically grouped.
4. Nonclustered Indexes:
- Indexing type: Creates a separate index structure pointing to actual data rows but doesn’t physically reorder the data.
- Benefits: Faster lookups based on indexed columns without incurring write performance overhead. Useful for queries filtering, joining, and sorting on specific columns not covered by CCI or a clustered index.
- Trade-offs: Introduces additional storage overhead for the index structure. Requires data movement during query execution to access actual rows.
- Example: Suppose you have a large customer table with columns like customer_id, name, and city. A nonclustered index on city could significantly improve performance for queries searching for customers based on their location.
Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly retrieved. For queries where a single or very few row lookup is required to perform with extreme speed, consider a clustered index or nonclustered secondary index. The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective filter on the clustered index column. To improve filter on other columns, a nonclustered index can be added to other columns. However, each index that is added to a table adds both space and processing time to loads.
CREATE TABLE CustomerTable (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
city VARCHAR(50),
email VARCHAR(255)
);
CREATE INDEX idx_city ON CustomerTable (city) NONCLUSTERED;
-- This creates a nonclustered index on the 'city' column. While the data
-- itself remains unordered, the index provides a separate structure for
-- faster lookups and filtering based on city, improving performance for
-- location-based queries.
Remember:
- Choose the right option based on your data size, access patterns, and query types.
- CCIs excel in read-heavy scenarios with predictable column focus, while heaps offer flexibility for frequent writes and unpredictable queries.
- Clustered and nonclustered indexes provide additional read optimization for specific keys and columns.
- Monitor performance and adapt your strategies to optimize efficiency and cost-effectiveness for your workloads.
Indexing is helpful for reading tables quickly. There is a unique set of technologies that you can use based on your needs:
Tips:
- On top of a clustered index, you might want to add a nonclustered index to a column heavily used for filtering.
- Be careful how you manage the memory on a table with CCI. When you load data, you want the user (or the query) to benefit from a large resource class. Make sure to avoid trimming and creating many small compressed row groups.
- On Gen2, CCI tables are cached locally on the compute nodes to maximize performance.
- For CCI, slow performance can happen due to poor compression of your row groups. If this occurs, rebuild or reorganize your CCI. You want at least 100,000 rows per compressed row groups. The ideal is 1 million rows in a row group.
- Based on the incremental load frequency and size, you want to automate when you reorganize or rebuild your indexes. Spring cleaning is always helpful.
- Be strategic when you want to trim a row group. How large are the open row groups? How much data do you expect to load in the coming days?