How to Ingest data into your Warehouse using data pipelines

Posted by

  • Data pipelines are like organized groups of tasks for moving data.
  • They help with activities like extracting, transforming, and loading data (ETL).
  • Instead of doing each task separately, pipelines let you manage them all together.

Create a data pipeline

  • To create a new pipeline, go to your workspace and click the +New button, then select Data pipeline.
  • In the New pipeline dialog, give your pipeline a name and click Create. You’ll then land in the pipeline canvas area.
  • There are three options to start building your pipeline:
    • “Add pipeline activity” lets you create new pipelines from scratch using pipeline activities.
    • “Copy data” guides you through selecting a data source and destination, configuring options, and creating a Copy Data task.
    • “Choose a task to start” provides predefined templates for different scenarios.
  • Choose “Copy data” to launch the Copy assistant.
  • In the Copy data assistant, select your data source (e.g., COVID-19 Data Lake sample) and destination workspace (e.g., Data Warehouse).
  • Configure the destination table by selecting the warehouse, providing a table name, and configuring column mappings.
  • Review the options for staging and advanced settings without changing them.
  • Review the summary of the copy activity and select “Start data transfer immediately” then click Save + Run.
  • The pipeline starts running automatically, and you can monitor its status in the Output pane.
  • After completion, navigate back to your warehouse to preview the copied data.

How to Ingest data into your Warehouse using the COPY statement

  • The COPY statement is used to bring data into Warehouse tables efficiently.
  • It can pull in a lot of data quickly from an Azure storage account.
  • You can customize how it handles things like file formats and where to put rejected rows.
  • The tutorial demonstrates how to use the COPY statement to load data into a Warehouse table.

Create a table

  1. First, you need to create the destination table before using the COPY statement.
  2. To create the table, follow these steps:
    • Go to your Microsoft Fabric workspace and open your warehouse.
    • Switch to the Home tab and click on “New SQL query.”
  3. To create the table used as the destination in this tutorial, run the following code.

Ingest Parquet data using the COPY statement

  • First, we’re loading data from a Parquet source.
  • This data is public and doesn’t need any login info.
  • You can copy this data by saying where it’s from and where it’s going.
  • You only need to mention the FILE_TYPE.
COPY INTO [dbo].[bing_covid-19_data]
FROM 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet'

Ingest CSV data using the COPY statement and skipping a header row

COPY INTO [dbo].[bing_covid-19_data]
FROM 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv'
    FILE_TYPE = 'CSV', 
    FIRSTROW = 2

Ingest data into your Warehouse using Transact-SQL

Creating a new table with the result of a query by using CREATE TABLE AS SELECT (CTAS)

The CREATE TABLE AS SELECT (CTAS) statement allows you to create a new table in your warehouse from the output of a SELECT statement. 

CREATE TABLE [dbo].[bing_covid-19_data_2023]
FROM [dbo].[bing_covid-19_data] 
WHERE DATEPART(YEAR,[updated]) = '2023';

Ingesting data into existing tables with T-SQL queries

INSERT INTO [dbo].[bing_covid-19_data_2023]
SELECT * FROM [dbo].[bing_covid-19_data] 
WHERE [updated] > '2023-02-28';

Ingesting data from tables on different warehouses and lakehouses

  • You can use either “CREATE TABLE AS SELECT” or “INSERT…SELECT” to make a new table based on the results of a SELECT statement.
  • With these commands, you can reference tables from different warehouses (places where data is stored) by using a special naming convention.
  • This can be achieved by using the three-part naming convention [warehouse_or_lakehouse_name.][schema_name.]table_name.
  • For example, if you have a lakehouse called “cases_lakehouse” and warehouses called “reference_warehouse” and “research_warehouse,” you can create a new table that combines data from all these places using the three-part naming convention.
CREATE TABLE [research_warehouse].[dbo].[cases_by_continent]
FROM [cases_lakehouse].[dbo].[bing_covid-19_data] cases
INNER JOIN [reference_warehouse].[dbo].[bing_covid-19_data] reference
ON cases.[iso3] = reference.[countrycode];
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x