Data Factory in Microsoft Fabric

Posted by

  • Data Factory helps you manage data by bringing it in from different sources, like databases and real-time feeds, and getting it ready for analysis.
  • It offers two main features: dataflows and pipelines.
  • Dataflows let you transform data using over 300 built-in tools, including smart AI-based ones, making it easier than ever.
  • Pipelines help you create flexible workflows for your data, so you can orchestrate everything smoothly according to your needs.

In short, Data Factory makes it easy to handle data, whether you’re a beginner or an expert, and it brings your data quickly to where you need it for analysis.

What is Dataflows in Microsoft Fabric?

  • Dataflows are like easy-to-use tools for getting data from many different places.
  • You can change the data in lots of ways using more than 300 different methods.
  • After you’ve changed the data, you can put it into different places, like Azure SQL databases.
  • You can set up dataflows to run again and again on their own, either whenever you want or on a schedule.
  • These dataflows are made using something called Power Query, which you might have seen in Excel or Power BI.
  • Power Query makes it simple for anyone, whether they’re an expert or just starting out, to change data in lots of ways without needing to write complicated code.
  • You can do things like combining data, summarizing it, cleaning it up, and much more, all with a user-friendly interface.

What is Data pipelines in Microsoft Fabric?

  • Data pipelines are like powerful workflows in the cloud that can handle a lot of data.
  • They let you do things like updating your data, moving really large amounts of data, and making complex sequences of tasks.
  • You can use data pipelines to create complicated ETL (Extract, Transform, Load) processes and workflows in your data factory.
  • These pipelines have built-in features that let you control the flow of your data and create rules like loops and conditions.
  • You can combine different activities in one pipeline, like refreshing data and running code, making it easy to do end-to-end data processes.

What is connectors in Microsoft Fabric?

  • Data Factory in Microsoft Fabric has many connectors.
  • These connectors let you link up with various kinds of data storage.
  • You can use these connectors to either change data in dataflows or move large datasets in a data pipeline.

Supported data connectors in dataflows

  • Dataflow Gen2: It’s a tool that helps to bring in and change data from lots of different places.
  • Data sources: These are where the data comes from. They can be files, databases, stuff from the internet, cloud services, or things you have on your own computers.
  • Data connectors: These are like bridges that help to connect Dataflow Gen2 with all these different data sources. There are over 145 of them!
  • Authoring experience: This is where you work with Dataflow Gen2 to set up how it gets and changes the data. You can access all these data connectors from here.

For a comprehensive list of all currently supported data connectors, go to https://www.cloudopsnow.in/what-is-difference-between-azure-data-factory-and-data-factory-in-microsoft-fabric/

The following connectors are currently available for output destinations in Dataflow Gen2:

  • Azure Data Explorer
  • Azure SQL
  • Data Warehouse
  • Lakehouse

Supported data stores in data pipeline

Data Factory in Microsoft Fabric supports data stores in a data pipeline through the Copy, Lookup, Get Metadata, Delete, Script, and Stored Procedure activities. For a list of all currently supported data connectors, go to https://www.cloudopsnow.in/what-is-difference-between-azure-data-factory-and-data-factory-in-microsoft-fabric/

Service principal support in Data Factory

  • Azure service principal (SPN): It’s like an identity card for applications, letting them access data securely without needing a specific user’s identity.
  • Usage: You can assign permissions to these service principals to let them access your data sources.
  • In Microsoft Fabric: Service principal authentication is supported in various parts like datasets, dataflows, and datamarts, making it easy for applications to securely connect to your data.

Supported data sources

Currently, the SPN authentication type only supports the following data sources:

  • Azure Data Lake Storage
  • Azure Data Lake Storage Gen2
  • Azure Blob Storage
  • Azure Synapse Analytics
  • Azure SQL Database
  • Dataverse
  • SharePoint online
  • Web
Service principal isn't supported on the on-premises data gateway and virtual network data gateway.

Service principal authentication isn't supported for a SQL data source with Direct Query in datasets.

Data pipeline in Microsoft Fabric?

  • A Microsoft Fabric Workspace can contain one or more pipelines.
  • A pipeline groups together different tasks or activities that work together to accomplish something.
  • For instance, a pipeline might have activities to collect and clean log data, followed by a task to analyze that data.
  • Pipelines make it easier to manage and schedule groups of activities instead of handling each one separately.
  • Activities in a pipeline determine what actions are taken with your data, like copying data from one place to another or transforming it.
  • Microsoft Fabric offers three types of activities: data movement (like copying data), data transformation (changing data format), and control (managing the flow of tasks).

Data movement activities

  • Copy activity in Microsoft Fabric moves data from one place (source data store) to another (sink data store).
  • Fabric supports various data stores, as listed in the Connector overview.
  • You can move data from any source to any destination using Fabric’s copy activity.

Data transformation activities

Microsoft Fabric supports the following transformation activities that can be added either individually or chained with another activity.

Data Transformation ActivityCompute Environment
Copy DataCompute manager by Microsoft Fabric
Dataflow Gen2Compute manager by Microsoft Fabric
Delete DataCompute manager by Microsoft Fabric
Fabric NotebookApache Spark clusters managed by Microsoft Fabric
Fabric Spark Job DefinitionApache Spark clusters managed by Microsoft Fabric
Stored ProcedureAzure SQL, Azure Synapse Analytics, or SQL Server
SQL ScriptAzure SQL, Azure Synapse Analytics, or SQL Server

Control flow activities

Control ActivityDescription
Append VariableAdd a value to an existing array variable.
Azure Batch ActivityRuns an Azure Batch script.
Azure Databricks ActivityRuns an Azure Databricks job (Notebook, Jar, Python).
Azure Machine Learning ActivityRuns an Azure Machine Learning job.
Deactivate ActivityDeactivates another activity.
FailCause pipeline execution to fail with a customized error message and error code.
FilterApply a filter expression to an input array.
ForEachForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages.
Functions ActivityExecutes an Azure Function.
Get MetadataGetMetadata activity can be used to retrieve metadata of any data in a Data Factory or Synapse pipeline.
If ConditionThe If Condition can be used to branch based on condition that evaluates to true or false. The If Condition activity provides the same functionality that an if statement provides in programming languages.
Invoke PipelineExecute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another pipeline.
KQL ActivityExecutes a KQL script against a Kusto instance.
Lookup ActivityLookup Activity can be used to read or look up a record/ table name/ value from any external source. This output can further be referenced by succeeding activities.
Set VariableSet the value of an existing variable.
Switch ActivityImplements a switch expression that allows multiple subsequent activities for each potential result of the expression.
Teams ActivityPosts a message in a Teams channel or group chat.
Until ActivityImplements Do-Until loop that is similar to Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity.
Wait ActivityWhen you use a Wait activity in a pipeline, the pipeline waits for the specified time before continuing with execution of subsequent activities.
Web ActivityWeb Activity can be used to call a custom REST endpoint from a pipeline.
Webhook ActivityUsing the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits for the callback to be invoked before proceeding to the next activity.

Dataflow Gen2 data destinations and managed settings

  • After cleaning and preparing your data with Dataflow Gen2, you need to put it somewhere.
  • Dataflow Gen2 lets you choose where to put your data, called data destinations.
  • You can choose from options like Azure SQL, Fabric Lakehouse, and others.
  • Once you select a destination, Dataflow Gen2 transfers your data there.
  • After that, you can analyze and report on your data using the chosen destination.

The following list contains the supported data destinations.

  • Azure SQL databases
  • Azure Data Explorer (Kusto)
  • Fabric Lakehouse
  • Fabric Warehouse
  • Fabric KQL database

Getting from Dataflow Generation 1 to Dataflow Generation 2

  • Dataflow Gen2 is the latest version of dataflows.
  • It’s newer than Dataflow Gen1 and has better features.
  • The next part will show how Dataflow Gen2 compares to Dataflow Gen1.

Feature overview

FeatureDataflow Gen2Dataflow Gen1
Author dataflows with Power Query
Shorter authoring flow
Auto-Save and background publishing
Data destinations
Improved monitoring and refresh history
Integration with data pipelines
High-scale compute
Get Data via Dataflows connector
Direct Query via Dataflows connector
Incremental refresh
AI Insights support

Data destinations

  1. Data Transformation: Like in Dataflow Gen1, Dataflow Gen2 helps you change your data and store it temporarily in a special storage called staging storage. You can then access this data using a Dataflow connector.
  2. Separate Storage: With Dataflow Gen2, you can also choose where to put your data permanently. This means you can keep your data transformation logic separate from where your data ends up.
  3. Benefits: This separation is helpful because now you can do different things with your data. For instance, you can use a dataflow to prepare data and then use a notebook to analyze it. Or, you can load data into different destinations like a lakehouse, Azure SQL Database, Azure Data Explorer, and more.
  4. New Destinations: In Dataflow Gen2, you can send your data to places like Fabric Lakehouse, Azure Data Explorer (Kusto), Azure Synapse Analytics (SQL DW), and Azure SQL Database.
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x