Data Factory in Microsoft Fabric

Mohammad Gufran Jahangir March 26, 2024 0

Data Factory helps you manage data by bringing it in from different sources, like databases and real-time feeds, and getting it ready for analysis.
It offers two main features: dataflows and pipelines.
Dataflows let you transform data using over 300 built-in tools, including smart AI-based ones, making it easier than ever.
Pipelines help you create flexible workflows for your data, so you can orchestrate everything smoothly according to your needs.

In short, Data Factory makes it easy to handle data, whether you’re a beginner or an expert, and it brings your data quickly to where you need it for analysis.

Table of Contents

What is Dataflows in Microsoft Fabric?

Dataflows are like easy-to-use tools for getting data from many different places.
You can change the data in lots of ways using more than 300 different methods.
After you’ve changed the data, you can put it into different places, like Azure SQL databases.
You can set up dataflows to run again and again on their own, either whenever you want or on a schedule.
These dataflows are made using something called Power Query, which you might have seen in Excel or Power BI.
Power Query makes it simple for anyone, whether they’re an expert or just starting out, to change data in lots of ways without needing to write complicated code.
You can do things like combining data, summarizing it, cleaning it up, and much more, all with a user-friendly interface.

What is Data pipelines in Microsoft Fabric?

Data pipelines are like powerful workflows in the cloud that can handle a lot of data.
They let you do things like updating your data, moving really large amounts of data, and making complex sequences of tasks.
You can use data pipelines to create complicated ETL (Extract, Transform, Load) processes and workflows in your data factory.
These pipelines have built-in features that let you control the flow of your data and create rules like loops and conditions.
You can combine different activities in one pipeline, like refreshing data and running code, making it easy to do end-to-end data processes.

What is connectors in Microsoft Fabric?

Data Factory in Microsoft Fabric has many connectors.
These connectors let you link up with various kinds of data storage.
You can use these connectors to either change data in dataflows or move large datasets in a data pipeline.

Supported data connectors in dataflows

Dataflow Gen2: It’s a tool that helps to bring in and change data from lots of different places.
Data sources: These are where the data comes from. They can be files, databases, stuff from the internet, cloud services, or things you have on your own computers.
Data connectors: These are like bridges that help to connect Dataflow Gen2 with all these different data sources. There are over 145 of them!
Authoring experience: This is where you work with Dataflow Gen2 to set up how it gets and changes the data. You can access all these data connectors from here.

For a comprehensive list of all currently supported data connectors, go to https://www.cloudopsnow.in/what-is-difference-between-azure-data-factory-and-data-factory-in-microsoft-fabric/

The following connectors are currently available for output destinations in Dataflow Gen2:

Azure Data Explorer
Azure SQL
Data Warehouse
Lakehouse

Supported data stores in data pipeline

Data Factory in Microsoft Fabric supports data stores in a data pipeline through the Copy, Lookup, Get Metadata, Delete, Script, and Stored Procedure activities. For a list of all currently supported data connectors, go to https://www.cloudopsnow.in/what-is-difference-between-azure-data-factory-and-data-factory-in-microsoft-fabric/

Service principal support in Data Factory

Azure service principal (SPN): It’s like an identity card for applications, letting them access data securely without needing a specific user’s identity.
Usage: You can assign permissions to these service principals to let them access your data sources.
In Microsoft Fabric: Service principal authentication is supported in various parts like datasets, dataflows, and datamarts, making it easy for applications to securely connect to your data.

Supported data sources

Currently, the SPN authentication type only supports the following data sources:

Azure Data Lake Storage
Azure Data Lake Storage Gen2
Azure Blob Storage
Azure Synapse Analytics
Azure SQL Database
Dataverse
SharePoint online
Web

Service principal isn't supported on the on-premises data gateway and virtual network data gateway.

Service principal authentication isn't supported for a SQL data source with Direct Query in datasets.

Data pipeline in Microsoft Fabric?

A Microsoft Fabric Workspace can contain one or more pipelines.
A pipeline groups together different tasks or activities that work together to accomplish something.
For instance, a pipeline might have activities to collect and clean log data, followed by a task to analyze that data.
Pipelines make it easier to manage and schedule groups of activities instead of handling each one separately.
Activities in a pipeline determine what actions are taken with your data, like copying data from one place to another or transforming it.
Microsoft Fabric offers three types of activities: data movement (like copying data), data transformation (changing data format), and control (managing the flow of tasks).

Data movement activities

Copy activity in Microsoft Fabric moves data from one place (source data store) to another (sink data store).
Fabric supports various data stores, as listed in the Connector overview.
You can move data from any source to any destination using Fabric’s copy activity.

Data transformation activities

Microsoft Fabric supports the following transformation activities that can be added either individually or chained with another activity.

Data Transformation Activity	Compute Environment
Copy Data	Compute manager by Microsoft Fabric
Dataflow Gen2	Compute manager by Microsoft Fabric
Delete Data	Compute manager by Microsoft Fabric
Fabric Notebook	Apache Spark clusters managed by Microsoft Fabric
Fabric Spark Job Definition	Apache Spark clusters managed by Microsoft Fabric
Stored Procedure	Azure SQL, Azure Synapse Analytics, or SQL Server
SQL Script	Azure SQL, Azure Synapse Analytics, or SQL Server

Control flow activities

Control Activity	Description
Append Variable	Add a value to an existing array variable.
Azure Batch Activity	Runs an Azure Batch script.
Azure Databricks Activity	Runs an Azure Databricks job (Notebook, Jar, Python).
Azure Machine Learning Activity	Runs an Azure Machine Learning job.
Deactivate Activity	Deactivates another activity.
Fail	Cause pipeline execution to fail with a customized error message and error code.
Filter	Apply a filter expression to an input array.
ForEach	ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages.
Functions Activity	Executes an Azure Function.
Get Metadata	GetMetadata activity can be used to retrieve metadata of any data in a Data Factory or Synapse pipeline.
If Condition	The If Condition can be used to branch based on condition that evaluates to true or false. The If Condition activity provides the same functionality that an if statement provides in programming languages.
Invoke Pipeline	Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another pipeline.
KQL Activity	Executes a KQL script against a Kusto instance.
Lookup Activity	Lookup Activity can be used to read or look up a record/ table name/ value from any external source. This output can further be referenced by succeeding activities.
Set Variable	Set the value of an existing variable.
Switch Activity	Implements a switch expression that allows multiple subsequent activities for each potential result of the expression.
Teams Activity	Posts a message in a Teams channel or group chat.
Until Activity	Implements Do-Until loop that is similar to Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity.
Wait Activity	When you use a Wait activity in a pipeline, the pipeline waits for the specified time before continuing with execution of subsequent activities.
Web Activity	Web Activity can be used to call a custom REST endpoint from a pipeline.
Webhook Activity	Using the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits for the callback to be invoked before proceeding to the next activity.

Dataflow Gen2 data destinations and managed settings

After cleaning and preparing your data with Dataflow Gen2, you need to put it somewhere.
Dataflow Gen2 lets you choose where to put your data, called data destinations.
You can choose from options like Azure SQL, Fabric Lakehouse, and others.
Once you select a destination, Dataflow Gen2 transfers your data there.
After that, you can analyze and report on your data using the chosen destination.

The following list contains the supported data destinations.

Azure SQL databases
Azure Data Explorer (Kusto)
Fabric Lakehouse
Fabric Warehouse
Fabric KQL database

Getting from Dataflow Generation 1 to Dataflow Generation 2

Dataflow Gen2 is the latest version of dataflows.
It’s newer than Dataflow Gen1 and has better features.
The next part will show how Dataflow Gen2 compares to Dataflow Gen1.

Feature overview

Feature	Dataflow Gen2	Dataflow Gen1
Author dataflows with Power Query	✓	✓
Shorter authoring flow	✓
Auto-Save and background publishing	✓
Data destinations	✓
Improved monitoring and refresh history	✓
Integration with data pipelines	✓
High-scale compute	✓
Get Data via Dataflows connector	✓	✓
Direct Query via Dataflows connector		✓
Incremental refresh		✓
AI Insights support		✓

Data destinations

Data Transformation: Like in Dataflow Gen1, Dataflow Gen2 helps you change your data and store it temporarily in a special storage called staging storage. You can then access this data using a Dataflow connector.
Separate Storage: With Dataflow Gen2, you can also choose where to put your data permanently. This means you can keep your data transformation logic separate from where your data ends up.
Benefits: This separation is helpful because now you can do different things with your data. For instance, you can use a dataflow to prepare data and then use a notebook to analyze it. Or, you can load data into different destinations like a lakehouse, Azure SQL Database, Azure Data Explorer, and more.
New Destinations: In Dataflow Gen2, you can send your data to places like Fabric Lakehouse, Azure Data Explorer (Kusto), Azure Synapse Analytics (SQL DW), and Azure SQL Database.

Mohammad Gufran Jahangir

Category: