External Tables with Azure Synapse serverless SQL pool

Posted by

As a first step, you need to configure data source and specify file format of remotely stored data.

Below are Basic things to remember while creating external tables:

  • Data source: A map pointing to the location of books in different sections (storage accounts, containers, folders).
  • External file format: A guide explaining how a particular book is organized (file format, delimiters, encoding).
  • External table: A catalog entry (table) referencing a book’s location and format, allowing you to read it without physically moving it.

In Simple ways how to understand data source , external file format and External table:

Gather Ingredients (Data Source):

  • List ingredients with their locations in the pantry and fridge (like creating a data source).
  • Example:
Ingredient List:
- Flour: Pantry, Shelf 2
- Eggs: Fridge, Door
- Sugar: Pantry, Top Shelf

Understand Ingredient Properties (External File Format):

  • Know the state (solid, liquid) and measurements (cups, grams) of each ingredient to use them correctly (like creating an external file format).

Write Recipes (External Tables):

  • Create recipes referencing ingredients from the list, specifying their locations and properties (like creating external tables).
  • Example:
Cake Recipe:
- 2 cups flour (Pantry, Shelf 2)
- 3 eggs (Fridge, Door)
- 1 cup sugar (Pantry, Top Shelf)

Why This Order?

  • Organization: Separate steps keep information organized and easy to follow.
  • Reusability: Use the same ingredient list and knowledge for multiple recipes.
  • Flexibility: Change ingredient locations or properties without rewriting recipes.

Key Points:

  • Data sources and file formats provide essential instructions for Synapse on how to access and interpret external data.
  • Creating them first ensures a well-structured and efficient cooking process.
  • This allows you to create recipes (external tables) that seamlessly reference data without physically moving it around.

Configure data sources and formats

Data sources represent connection string information that describes where your data is placed and how to authenticate to your data source.

CREATE EXTERNAL DATA SOURCE ecdc_cases WITH (
    LOCATION = 'https://pandemicdatalake.blob.core.windows.net'
);

Note: A caller may access data source without credential if an owner of data source allowed anonymous access or give explicit access to Microsoft Entra identity of the caller.

You can explicitly define a custom credential that will be used while accessing data on external data source.

  • Managed Identity of the Synapse workspace
  • Shared Access Signature of the Azure storage
  • Custom Service Principal Name or Azure Application identity.
  • Read-only Azure Cosmos DB account key that enables you to read Azure Cosmos DB analytical storage.

As a prerequisite, you will need to create a master key in the database:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Setup you password - you need to create master key only once';

In the following external data source, Synapse SQL pool should use a managed identity of the workspace to access data in the storage.

CREATE DATABASE SCOPED CREDENTIAL WorkspaceIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE ecdc_cases WITH (
    LOCATION = 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/',
    CREDENTIAL = WorkspaceIdentity
);

Define external file formats

CREATE EXTERNAL FILE FORMAT ParquetFormat WITH (  FORMAT_TYPE = PARQUET );
GO
CREATE EXTERNAL FILE FORMAT CsvFormat WITH (  FORMAT_TYPE = DELIMITEDTEXT );
or 
CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO

Explore your data

Once you set up your data sources, you can use the OPENROWSET function to explore your data.

OPENROWSET function reads content of a remote data source (for example file) and returns the content as a set of rows.

select top 10  *
from openrowset(bulk 'latest/ecdc_cases.parquet',
                data_source = 'ecdc_cases',
                format='parquet') as a

Create external tables on Azure storage

you can create external tables that are referencing the files. 

The following external table is referencing the ECDC COVID parquet file placed in the Azure storage:

create external table ecdc_adls.cases (
    date_rep                   date,
    day                        smallint,
    month                      smallint,
    year                       smallint,
    cases                      smallint,
    deaths                     smallint,
    countries_and_territories  varchar(256),
    geo_id                     varchar(60),
    country_territory_code     varchar(16),
    pop_data_2018              int,
    continent_exp              varchar(32),
    load_date                  datetime2(7),
    iso_country                varchar(16)
) with (
    data_source= ecdc_cases,
    location = 'latest/ecdc_cases.parquet',
    file_format = ParquetFormat
);
CREATE EXTERNAL TABLE [dbo].[<myexternaltable>] WITH (
        LOCATION = '<myoutputsubfolder>/',
        DATA_SOURCE = [SynapseSQLwriteable],
        FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM [<myview>];
GO
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
    WITH (
            LOCATION = '/files/Customer',
            DATA_SOURCE = customer_ds,
            FILE_FORMAT = customer_ff
            ) AS

SELECT T1.ProductKey AS ProductKeyNoChange,
    T1.OrderDateKey AS OrderDate,
    T1.ShipDateKey AS ShipDate,
    T1.CustomerKey AS CustomerKeyNoChange,
    T1.OrderQuantity AS Quantity,
    T1.SalesAmount AS MONEY
FROM dbo.FactInternetSales T1
INNER JOIN dbo.DimCustomer T2
    ON (T1.CustomerKey = T2.CustomerKey)
OPTION (HASH JOIN);
GO
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x