Prepare data in Azure Storage

  • PolyBase can’t load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data Lake Store, they must have fewer than 1,000,000 bytes of data. This byte limitation is true regardless of the table schema.
  • Split large compressed files into smaller compressed files.

Run loads with enough compute

For fastest loading speed, run only one load job at a time. If that is not feasible, run a minimal number of loads concurrently. If you expect a large loading job, consider scaling up your dedicated SQL pool before the load.

Create a loading user

This example creates a loading user classified to a specific workload group. The first step is to connect to master and create a login.

-- Connect to master
   CREATE LOGIN loader WITH PASSWORD = 'a123STRONGpassword!';

Connect to the dedicated SQL pool and create a user. The following code assumes you’re connected to the database called mySampleDataWarehouse.

-- Connect to the dedicated SQL pool
   CREATE USER loader FOR LOGIN loader;
   GRANT ADMINISTER DATABASE BULK OPERATIONS TO loader;
   GRANT INSERT ON <yourtablename> TO loader;
   GRANT SELECT ON <yourtablename> TO loader;
   GRANT CREATE TABLE TO loader;
   GRANT ALTER ON SCHEMA::dbo TO loader;
   
   CREATE WORKLOAD GROUP DataLoads
   WITH ( 
       MIN_PERCENTAGE_RESOURCE = 0
       ,CAP_PERCENTAGE_RESOURCE = 100
       ,REQUEST_MIN_RESOURCE_GRANT_PERCENT = 100
	);

   CREATE WORKLOAD CLASSIFIER [wgcELTLogin]
   WITH (
	     WORKLOAD_GROUP = 'DataLoads'
       ,MEMBERNAME = 'loader'
   );

Load to a staging table

To achieve the fastest loading speed for moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap and use round-robin for the distribution option.

Load to a columnstore index

Columnstore indexes require large amounts of memory to compress data into high-quality rowgroups. For best compression and index efficiency, the columnstore index needs to compress the maximum of 1,048,576 rows into each rowgroup. When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. 

  • To ensure the loading user has enough memory to achieve maximum compression rates, use loading users that are a member of a medium or large resource class.
  • Load enough rows to completely fill new rowgroups. During a bulk-load, every 1,048,576 rows get compressed directly into the columnstore as a full rowgroup. Loads with fewer than 102,400 rows send the rows to the deltastore where rows are held in a b-tree index. If you load too few rows, they might all go to the deltastore and not get compressed immediately into columnstore format.

Create statistics after the load

To improve query performance, it’s important to create statistics on all columns of all tables after the first load, or major changes occur in the data. 

create statistics [SensorKey] on [Customer_Speed] ([SensorKey]);
create statistics [CustomerKey] on [Customer_Speed] ([CustomerKey]);
create statistics [GeographyKey] on [Customer_Speed] ([GeographyKey]);
create statistics [Speed] on [Customer_Speed] ([Speed]);
create statistics [YearMeasured] on [Customer_Speed] ([YearMeasured]);

Design a PolyBase data loading strategy for dedicated SQL pool in Azure Synapse Analytics

While SQL pool supports many loading methods including non-Polybase options such as BCP and SQL BulkCopy API, the fastest and most scalable way to load data is through PolyBase. PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language.

Extract, Load, and Transform (ELT)

Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data warehouse, and then transformed.

The basic steps for implementing a PolyBase ELT for dedicated SQL pool are:

  1. Extract the source data into text files.
  2. Land the data into Azure Blob storage or Azure Data Lake Store.
  3. Prepare the data for loading.
  4. Load the data into dedicated SQL pool staging tables using PolyBase.
  5. Transform the data.
  6. Insert the data into production tables.

1. Extract the source data into text files

Getting data out of your source system depends on the storage location. The goal is to move the data into PolyBase supported delimited text files.

PolyBase external file formats

PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. PolyBase also loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can also load data from Gzip and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats such as WinZip, JSON, and XML.

2. Land the data into Azure Blob storage or Azure Data Lake Store

To land the data in Azure storage, you can move it to Azure Blob storage or Azure Data Lake Store. In either location, the data should be stored in text files. PolyBase can load from either location.

Tools and services you can use to move data to Azure Storage:

  • Azure ExpressRoute service enhances network throughput, performance, and predictability. ExpressRoute is a service that routes your data through a dedicated private connection to Azure.
  • AzCopy utility moves data to Azure Storage over the public internet. This works if your data sizes are less than 10 TB.
  • Azure Data Factory (ADF) has a gateway that you can install on your local server. Then you can create a pipeline to move data from your local server up to Azure Storage. To use Data Factory with dedicated SQL pool, see Load data into dedicated SQL pool.

3. Prepare the data for loading

You might need to prepare and clean the data in your storage account before loading it into dedicated SQL pool. 

Define external tables

Before you can load data, you need to define external tables in your data warehouse. PolyBase uses external tables to define and access the data in Azure Storage. 

Defining external tables involves specifying the data source, the format of the text files, and the table definitions

  • CREATE EXTERNAL DATA SOURCE
  • CREATE EXTERNAL FILE FORMAT
  • CREATE EXTERNAL TABLE

Format text files

Once the external objects are defined, you need to align the rows of the text files with the external table and file format definition. The data in each row of the text file must align with the table definition. 

  • If your data is coming from a non-relational source, you need to transform it into rows and columns.
  • Format data in the text file to align with the columns and data types in the SQL pool destination table. Misalignment between data types in the external text files and the data warehouse table causes rows to be rejected during the load.
  • Separate fields in the text file with a terminator. Be sure to use a character or a character sequence that is not found in your source data.

4. Load the data into dedicated SQL pool staging tables using PolyBase

  • It is best practice to load data into a staging table. Staging tables allow you to handle errors without interfering with the production tables. 
  •  A staging table also gives you the opportunity to use SQL pool built-in distributed query processing capabilities for data transformations before inserting the data into production tables.

Options for loading with PolyBase

To load data with PolyBase, you can use any of these loading options:

  • PolyBase with T-SQL works well when your data is in Azure Blob storage or Azure Data Lake Store.
  • PolyBase with SSIS works well when your source data is in SQL Server.
  • PolyBase with Azure Data Factory (ADF) is another orchestration tool. It defines a pipeline and schedules jobs.
  • PolyBase with Azure Databricks transfers data from an Azure Synapse Analytics table to a Databricks dataframe and/or writes data from a Databricks dataframe to an Azure Synapse Analytics table using PolyBase.

Non-PolyBase loading options

  • If your data is not compatible with PolyBase, you can use bcp or the SQLBulkCopy API.
  • BCP loads directly to dedicated SQL pool without going through Azure Blob storage, and is intended only for small loads.
  • Note, the load performance of these options is slower than PolyBase.

5. Transform the data

While data is in the staging table, perform transformations that your workload requires. Then move the data into a production table.

6. Insert the data into production tables

The INSERT INTO … SELECT statement moves the data from the staging table to the permanent table.