Azure Data Factory (ADF) Basics

Get Started. It's Free
or sign up with your email address
Azure Data Factory (ADF) Basics by Mind Map: Azure Data Factory (ADF) Basics

1. Self-hosted integration runtime

1.1. Will install on prem and enable data factory to copy data from on prem into Azure cloud storage

1.2. Set up for copying data from on prem to Azure Storage

1.2.1. Storage accounts | Select storage account

1.2.1.1. Blob Service | Containers

1.2.1.1.1. 1. Click + Container button

1.2.2. Data factory | Author & Monitor

1.2.2.1. Manage | Integration runtimes

1.2.2.1.1. 1. Click + New button

1.2.2.1.2. 2. Select Azure, Self-Hosted

1.2.2.1.3. 3. Click Continue

1.2.2.1.4. 4. Select Self-Hosted

1.2.2.1.5. 5. Click Continue

1.2.2.1.6. 6. Set Name

1.2.2.1.7. 7. Click Create

1.3. Post installation, find "Microsoft Integration Runtime" on local machine to launch

1.3.1. Home tab

1.3.1.1. Shows status of integration runtime service

1.3.1.2. Option to Stop/Start service

1.3.2. Settings tab

1.3.2.1. Options for configuring remote access from intranet or using an HTTP proxy

1.4. Note: this self-hosted integration runtime can also be installed on on Azure Virtual machine

1.4.1. This constitutes integration with Data Factory from a private network

2. Copy Data Wizard

2.1. Azure Portal | Data factory | Author & Monitor | Overview

2.1.1. Copy Data

2.1.1.1. 1. Properties

2.1.1.1.1. 1.a. Set Task name

2.1.1.1.2. 1.b. (Optional) Set Task description

2.1.1.1.3. 1.c. Select Task schedule

2.1.1.1.4. 1.d. Click Next

2.1.1.2. 2. Source

2.1.1.2.1. 2.a. Select Database

2.1.1.2.2. 2.b. Click Create new connection

2.1.1.2.3. 2.c. Select SQL Server

2.1.1.2.4. 2.d. Click Continue

2.1.1.2.5. 2.e. Select new Linked Service and click Next

2.1.1.2.6. 2.f. Select table(s) for your source dataset

2.1.1.2.7. 2.g. Click Next

2.1.1.3. 3. Destination

2.1.1.3.1. 3.a. Under Azure, click Create new connection

2.1.1.3.2. 3.b. Select Azure Blob Storage

2.1.1.3.3. 3.c. Select new Linked Service and click Next

2.1.1.3.4. 3.d. Set Folder path

2.1.1.3.5. 3.e. Set File name

2.1.1.3.6. 3.f. Set Block size (MB)

2.1.1.3.7. 3.g. Click Next

2.1.1.3.8. 3.h. Configure File format settings

2.1.1.3.9. 3.i. Configure Settings

2.1.1.3.10. 3.j. Click Next

2.1.1.4. 4. Summary

2.1.1.4.1. Displays summary of the new pipeline

2.1.1.4.2. Click Next

2.1.1.5. 5. Deployment

2.1.1.5.1. Automatically deploys and runs new pipeline

2.1.1.5.2. Click Finish

2.2. Wizard allows wide range of configurations for Source and Destination, but the values given here illustrate the options to set for copying data from on-prem SQL Server to Azure Blob storage

3. ADF Interface

3.1. Overview

3.1.1. Quick links to wizards and training videos

3.1.1.1. Completed wizards create Factory Resources

3.2. Author

3.2.1. This is where we create, edit and delete the main components of our data factory, the top level building blocks being the Factory Resources

3.2.2. Factory Resources

3.2.2.1. Available to create via the ADF Author page

3.2.2.2. Pipeline

3.2.2.2.1. When looking at a pipeline via the Author page, you gain access to a whole host of Actions

3.2.2.2.2. You can use the Actions to construct a pipeline and then publish it

3.2.2.2.3. You can also construct a pipeline from the Actions and save it as a template in order to facilitate efficient and standardised future pipelines for your data integration project

3.2.2.3. Pipeline from template

3.2.2.4. Dataset

3.2.2.4.1. Named view of data, pointing to or referencing that data

3.2.2.4.2. Includes tables, files, folders and documents

3.2.2.5. Data Flow

3.2.2.5.1. These were added to ADF v2, inspired by the SSIS data flow (and control flow) concept, this enables ETL in a graphical user interface, with the emphasis on adding the Transformation functionality in between the source Extraction and target Load

3.2.2.5.2. Under the hood, data flows actually leverage Azure Databricks clusters to provide massive, automated scaling for workflows, and automated workload optimization

3.2.2.6. Copy Data

3.2.2.7. Pipelines, Datasets and Data Flows can be organised into one or more folders that you can create and name as you please

3.3. Monitor

3.3.1. This is where you can view the history of data factory executions, including pipelines

3.3.2. You can also view the status of integration runtimes and create/monitor alert rules

3.4. Manage

3.4.1. This is where we create, update and delete connections, which are based on linked services and integration runtimes

3.4.2. We can also configure source control integration here, create triggers, global parameters and manage security

3.4.3. Linked services, integration runtimes and triggers are also factory resources

3.4.3.1. Integration runtimes deliver the compute resources that ADF uses to execute its various supported data integration activities

3.4.3.2. Linked services can be thought of as connection managers, providing the interface for connecting data sources and targets

3.4.3.2.1. Once created, you can re-use a linked service among pipelines

3.4.3.3. Triggers define the scheduling for automatically executing data integration activities in the data factory

4. Data Flows

4.1. Brings data transformation capabilities to ADF (similar to SSIS data flows)

4.2. When you create a new data flow, you have two choices: Mapping data flow Wrangling data flow

4.2.1. Mapping data flow gives you a canvas somewhat similar to an SSIS data flow

4.2.1.1. The idea is for you to create one or more source datasets, add any number of transformations, and ultimately to connect to one or more sink datasets

4.2.1.2. Under the hood, Databricks is used

4.2.2. Wrangling data flows use the Power Query interface

4.2.2.1. The idea is for you to explore source datasets and prepare them for some downstream purpose

4.2.2.2. Under the hood, the M (mashup) query language is used and run in a managed Spark environment

4.2.2.3. Wrangling data flows are currently in preview and don't support as many different types of source and sink dataset yet as mapping data flows

4.3. Data Flow Sources

4.3.1. At time of writing, we have the following options for a data flow source:

4.3.1.1. Azure Blob Storage

4.3.1.1.1. Options here include the following (and same for Data Lake Storage)

4.3.1.2. Azure Cosmos DB

4.3.1.3. Azure Data Lake Storage (Gen1 & Gen2)

4.3.1.4. Azure SQL Database

4.3.1.5. Azure Synapse Analytics

4.3.1.6. Snowflake

4.3.2. Every data flow must have a minimum of 1 source

4.3.3. Creating a new source dataset for a data flow

4.3.3.1. Author | Data flows | New data flow

4.3.3.1.1. Set Name

4.3.3.1.2. Click Add Source on canvas

4.4. Data Flow Transformations

4.4.1. Aggregate transformation

4.4.1.1. 1. Click + against a node in data flow and choose Aggregate

4.4.1.2. 2. Set Output stream name

4.4.1.3. 3. Select Group by columns

4.4.1.4. 4. Set Aggregates via expression

4.4.1.4.1. e.g. sum(Quantity)

4.4.1.4.2. See attached for sum() aggregation using the Visual expression builder

4.4.2. Conditional Split transformation

4.4.2.1. 1. Click + against a node in data flow and choose Conditional Split

4.4.2.2. 2. Set Output stream name

4.4.2.3. 3. Choose Split on

4.4.2.3.1. Default is First matching condition

4.4.2.3.2. Other option is All matching conditions

4.4.2.4. 3. Define Split conditions

4.4.2.4.1. You can create multiple streams

4.4.2.4.2. See attached for example conditional split streams using the equality == operator

4.4.3. Filter transformation

4.4.3.1. Works in a similar way to Conditional Split but produces only a single output stream, with the focus being on reducing its input stream to a subset based on a condition

4.4.3.2. 1. Click + against a node in data flow and choose Filter

4.4.3.3. 2. Set Output stream name

4.4.3.4. 3. Define Filter by expression

4.4.3.4.1. See attached for example filter expression using the equality == operator

4.4.4. Select transformation

4.4.4.1. Allows you to reduce an input stream to a subset of columns

4.4.4.2. 1. Click + against a node in data flow and choose Select

4.4.4.3. 2. Set Output stream name

4.4.4.4. 3. Set Options

4.4.4.4.1. By default, duplicate column names in the input and output streams will be eliminated

4.4.4.5. 4. Edit mappings

4.4.4.5.1. Typically, you might delete mappings to reduce the columns flowing into the output stream - see attached

4.4.5. Lookup transformation

4.4.5.1. Requires two input streams: the primary stream and the lookup stream

4.4.5.1.1. Bear in mind that the output stream of the Lookup transformation will include all columns in the lookup stream, including a copy of the key column(s) used for the lookup

4.4.5.1.2. Basically it works like a Left Outer join, with the purpose being to append one or more columns to a stream.

4.4.5.2. 1. Click + against a node in data flow and choose Lookup

4.4.5.3. 2. Set Output stream name

4.4.5.4. 3. Select Primary and Lookup streams

4.4.5.4.1. Primary stream will be pre-selected according to the context of the + button click

4.4.5.4.2. The lookup stream will typically be a source dataset, perhaps transformed via a Select and/or Filter/Aggregate

4.4.5.5. 4. Set options for handling multiple rows

4.4.5.5.1. I think these options are probably for handling the possibility of repeating rows in the lookup stream, based on the lookup conditions

4.4.5.6. 5. Set Lookup conditions

4.4.5.6.1. See attached screenshot

4.4.6. Join transformation

4.4.6.1. Similar to the Lookup transformation but more flexible because it allows you to specify the type of join

4.4.6.1.1. Default is Inner join with options for the following additional join types: Full outer join, Left outer join, Right outer join, Cross join

4.4.6.2. 1. Click + against a node in data flow and choose Join

4.4.6.3. 2. Set Output stream name

4.4.6.4. 3. Select Left and Right streams

4.4.6.4.1. Left stream will be pre-selected according to the context of the + button click

4.4.6.4.2. The right stream will typically be a source dataset, perhaps reduced by a Select transofmation

4.4.6.5. 4. Set Join conditions

4.4.6.5.1. See attached screenshot

4.4.7. Derived Column transformation

4.4.7.1. Allows you to apply transformations to any column from the input stream or to add new columns to the output stream

4.4.7.2. 1. Click + against a node in data flow and choose Derived Column

4.4.7.3. 2. Set Output stream name

4.4.7.4. 3. Select existing column or add new column, and set expression

4.4.7.4.1. See attached example of transforming existing Bin column and adding new Code column

4.4.8. Exists transformation

4.4.8.1. Equivalent of SQL Exists, this allows you to compare the left stream with the right stream and select the subset from the left stream that exists in the right stream or that does not exist

4.4.8.2. 1. Click + against a node in data flow and choose Exists

4.4.8.3. 2. Set Output stream name

4.4.8.4. 3. Select Right stream

4.4.8.5. 4. Choose Exists type

4.4.8.5.1. Exists or Doesn't exist

4.4.8.6. 5. Set Exists conditions

4.4.8.6.1. See attached screenshot

4.5. Data Flow Sinks

4.5.1. At time of writing, we have the same options for a data flow sink as available for a data flow source

4.5.2. Scenario for my first data flow:

4.5.2.1. Source is file that lists all products sold by company

4.5.2.2. Use Lookup to retrieve model description from a different table

4.5.2.3. Add shipping weight calculated as actual weight + 10% (Derived Column)

4.5.2.4. Filter out products with no list price

4.5.2.5. Order data into a sink file

4.5.3. See attached screenshot of my first data flow

4.5.4. The Data flow debug mode is very handy to be enabled whilst you are developing and testing your data flow

4.5.4.1. When enabled, it provisions a new Databricks cluster in the background, which implements the data flow process, and that process typically takes around 5 mins

4.5.5. In order to test the sink outcome of a data flow, you need to create a pipeline and add to it the Data flow activity (under Move & transform)

4.5.5.1. When running debug on a pipeline containing a data flow activity you will need to enable the Data flow debug (and will be prompted to do so if it's disabled)

4.5.5.2. See attached screenshot of my first pipeline that runs a data flow

4.5.6. Enabling the sampling option in the data flow sources is a handy way to reduce the costs when debugging data flows during development, as the Databricks clusters are chargeable as a compute resource, plus there is the potential storage cost as well for the sink

5. Schedule Pipelines using Triggers

5.1. 3 types of trigger available:

5.1.1. Schedule trigger

5.1.1.1. The traditional form of scheduling that you see in SQL Server Agent

5.1.1.2. Trigger execution operates on a fire and forget basis - as long as the pipeline it triggers starts, the trigger is marked as successfully fired

5.1.1.3. Schedule recurrence can be set at multiple levels: minute, hour, day, week, month

5.1.1.4. Supports multiple pipelines for single trigger and a single trigger tied to multiple schedule triggers (many-to-many)

5.1.2. Tumbling window trigger

5.1.2.1. An interesting enhancement on the traditional schedule trigger, but also for more specific scenarios than the more general, flexible schedule trigger

5.1.2.2. Trigger execution is bound tightly to the triggered pipeline execution and retains state

5.1.2.2.1. Retaining state means that if the triggered pipeline fails or is cancelled, the trigger execution is also marked as failed or cancelled

5.1.2.3. Supports more advanced scenarios than scheduled trigger, including a configurable number of retries on failure, back-fill scenarios

5.1.2.4. Works on discrete, contiguous, non-overlapping time intervals that can be past or future dated

5.1.2.4.1. e.g. Monthly periods for a historic backfill

5.1.2.5. Relationship between tumbling window trigger and pipeline is one-to-one

5.1.3. Event-based trigger

5.1.3.1. Currently based on two events: Blob created, Blob deleted

5.2. Creating a schedule trigger

5.2.1. ADF Author & Monitor | Manage

5.2.1.1. Author | Triggers | Create trigger

5.2.1.1.1. 1. Set Name

5.2.1.1.2. 2. Set Description

5.2.1.1.3. 3. Set Type to Schedule

5.2.1.1.4. 4. Set Start date

5.2.1.1.5. 5. Set Recurrence

5.2.1.1.6. 6. Select Activated option

5.2.1.1.7. 7. Click OK

5.2.1.1.8. 8. Click Publish all | Publish

5.2.2. Assign pipeline to trigger

5.2.2.1. Author | Pipelines | Select pipeline

5.2.2.1.1. Click Add trigger

6. ADF Notifications via Azure Logic Apps

6.1. One of the interesting options we can leverage for notifications arising from our ADF pipelines is to use Azure Logic Apps

6.1.1. POC is a simple example, but you can make an Azure Logic App more dynamic - see attached for the simple POC design

6.1.1.1. Azure Logic Apps are based on triggering events that initiate a flow of actions, and for our POC this will be an HTTP request from ADF pipeline that will trigger Azure Logic App to send an email

6.1.1.2. In order to trigger Azure Logic App to send us an email, we will use the Web Activity in our pipeline

6.2. Creating a Logic App for email notifications

6.2.1. Logic Apps | Create Logic App

6.2.1.1. 1. Select Subscription

6.2.1.1.1. e.g. Pay-As-You-Go

6.2.1.2. 2. Select Resource Group

6.2.1.2.1. e.g. argento-internal-training

6.2.1.3. 3. Set Logic app name

6.2.1.3.1. e.g. DataFactorySuccessEmail

6.2.1.4. 4. Select Location

6.2.1.4.1. e.g. UK South

6.2.1.5. 5. Set Log analytics option

6.2.1.5.1. defaults to Off, if you change to On this will add a little cost

6.2.1.6. 6. Click Review + create and then confirm by clicking Create

6.2.2. Logic Apps | Select new logic app

6.2.2.1. Click Blank Logic App

6.2.2.1.1. 1. (Triggers) Select HTTP Request

6.2.2.1.2. 2. (Triggers) Set HTTP Request Body JSON Schema & click Next step

6.2.2.1.3. 3. (Actions) Select Office365 Send an email

6.2.2.1.4. 4. (Actions) Sign in with your Office365 account

6.2.2.1.5. 5. (Actions) Set Mail To

6.2.2.1.6. 6. (Actions) Set Subject

6.2.2.1.7. 7. (Actions) Set Body

6.2.2.1.8. 8. Click Save

6.2.2.1.9. 9. Return to HTTP request trigger and copy the HTTP POST URL

6.3. Create ADF Pipeline that triggers Logic App for sending notification

6.3.1. ADF | Author & Monitor

6.3.1.1. Author | Pipeline | New pipeline

6.3.1.1.1. 1. Set Name

6.3.1.1.2. 2. Click and drag Wait activity onto canvas

6.3.1.1.3. 3. Click and drag Web activity onto canvas and connect to the Wait activity success output to this

6.3.1.1.4. 4. Set Name of Web activity

6.3.1.1.5. 5. Set URL

6.3.1.1.6. 6. Set Method to POST

6.3.1.1.7. 7. Set Body via dynamic content

6.3.1.1.8. 8. Click Publish all | Publish

6.4. Test ADF pipeline with logic app

6.4.1. Author | Pipeline | Click Debug

6.4.1.1. Verify that pipeline succeeds and notification email is delivered (see attached)

7. Provisioning ADF

7.1. Provision via Azure Portal

7.1.1. Data factories | Add

7.1.1.1. 1. Select Subscription

7.1.1.2. 2. Select Resource Group

7.1.1.3. 3. Select Region

7.1.1.4. 4. Give name to new data factory

7.1.1.4.1. Must be globally unique across Azure universe

7.1.1.5. 5. Select Version

7.1.1.5.1. V1 or V2

7.1.1.6. 6. Select Repository type

7.1.1.6.1. Azure Devops

7.1.1.6.2. GitHub

7.1.1.6.3. You can also check option "Configure Git Later" to defer source control integration

7.1.1.7. 7. Click Create

8. Author & Monitor

8.1. Click this from the Data factory Overview

8.2. Launches a new browser window

8.2.1. Data Factory canvas (i.e. the Data Factory UI)

8.2.1.1. Data Factory Overview

8.2.1.1.1. Default screen

8.2.1.1.2. Includes a bunch of links to get started quickly with authoring tasks

8.2.1.1.3. Also includes a bunch of videos and tutorial links

8.2.1.2. Author

8.2.1.2.1. This is the main development interface for designing and developing your data factory

8.2.1.2.2. Click + button to add new resource

8.2.1.2.3. Pipelines, Datasets and Data flows can be organised into folders for grouping together things logically

8.2.1.3. Monitor

8.2.1.3.1. For monitoring the following:

8.2.1.3.2. Dashboards also available

8.2.1.4. Manage

8.2.1.4.1. Connections

8.2.1.4.2. Triggers

8.2.1.5. ADF UI is officially supported in Microsoft Edge and Google Chrome browsers

8.2.1.5.1. It is supposed to be compatible too in other browsers such as Safari and Firefox now, but sticking to Chrome or Edge seems the safest bet

9. Pipelines

9.1. Activities

9.1.1. Define actions to perform on data

9.1.2. Examples

9.1.2.1. Batch Service (custom activity)

9.1.2.2. Databricks

9.1.2.3. Data Lake Analytics

9.1.2.4. HDInsight

9.1.2.5. Machine Learning

9.1.2.6. Copy Data

9.1.2.7. Stored Procedure

9.2. Copy data activity

9.2.1. Use to copy data between data stores on-premises and in the cloud

9.2.2. Performance

9.2.2.1. Measured in Data Integration Units (DIU)

9.2.2.1.1. DIU is a metric abstraction that represents combination of CPU, Memory and Network Resource Allocation for each Data Factory pipeline activity

9.2.2.1.2. Billing for data factory is based on DIU consumption

9.2.2.1.3. DIU for activity can be set between 2 (mix) and 256 (max)

9.2.2.2. Degree of copy parallelism

9.2.2.2.1. Additional way of modifying performance for the copy data activity

9.2.2.2.2. Setting represents max number of threads to use

9.2.2.2.3. Generally it is best to leave this unset as ADF will automatically decide on parallelism

9.2.3. The Copy Data wizard will automatically create and configure the pipeline with a copy data activity for you

9.2.3.1. However, the wizard will produce some poorly named resources, particularly for the datasets, and ADF pros will generally work directly with the Author tools to develop Production grade pipelines

9.2.4. Creating pipeline with copy data activity

9.2.4.1. Author | Pipelines | New Pipeline

9.2.4.1.1. Set name for pipeline

9.2.4.1.2. Activities | Move & Transform

9.2.4.1.3. Click Publish All | Publish

9.2.4.1.4. Click Debug button to test run the pipeline

9.2.4.2. Renaming a dataset can be done by opening it and then clicking the Properties button to expose the General tab (which can be hidden)

9.3. Get metadata activity

9.3.1. We can use this to make our pipelines dynamic

9.3.1.1. Use case: track when file was last updated, when it was last loaded, and store this in SQL Database for the purpose of deciding whether or not to re-run a Copy activity

9.3.2. Creating pipeline with metadata activity

9.3.2.1. Author | Pipelines | New Pipeline

9.3.2.1.1. Set name for pipeline

9.3.2.1.2. Activities | General

9.3.2.1.3. Click Publish All | Publish

9.3.2.1.4. Click Debug button to test run the pipeline

9.4. Stored procedure activity

9.4.1. Invokes a stored procedure (see attached example)

9.4.1.1. Supports Azure SQL Database, Azure Synapse Analytics, SQL Server Database

9.4.2. Can utilise outputs from other pipeline activities

9.4.2.1. e.g. the Metadata activity

9.4.3. Output parameters not available to ADF pipeline

9.4.3.1. but there is a workaround fior this by using the Lookup activity instead

9.4.4. Extending pipeline by connecting stored procedure activity to metadata activity

9.4.4.1. Author | Pipelines | Select (and Open)

9.4.4.1.1. Activities | General

9.4.4.1.2. Click Publish All | Publish

9.4.4.1.3. Click Debug button to test run the pipeline

9.4.4.1.4. Verify that the expected outcome of the stored procedure in the database has occurred

9.5. Lookup activity

9.5.1. Use case: fetch "Last Load Date" metadata from Azure SQL Database and use this in combination with Get metadata activity to implement a conditional Copy data activity via an intermediate If Condition activity

9.5.1.1. See attached for example database "Control Table" that could be a typical source for a Lookup activity

9.5.1.2. See attached for example database stored procedure that can be used in a Lookup activity to retrieve a dataset

9.5.2. Purpose is to retrieve a dataset

9.5.3. Supports the following:

9.5.3.1. Any ADF data source

9.5.3.2. Executing stored procedures

9.5.3.3. Executing SQL scripts

9.5.3.4. Output parameters

9.5.4. Outputs either a single value or an array (object variable)

9.5.5. Cloning pipeline and replacing stored procedure activity with lookup activity

9.5.5.1. Author | Pipelines | Select (and Clone)

9.5.5.1.1. Rename cloned pipeline

9.5.5.1.2. Activities | General

9.5.5.1.3. Click Publish All | Publish

9.5.5.1.4. Click Debug button to test run the pipeline

9.5.5.1.5. Review Output from Lookup activity

9.6. If Condition activity

9.6.1. Purpose is to provide classic IF statement functionality, evaluating a Boolean expression in order to split the execution path of a pipeline

9.6.2. Supports the ADF expression language and built in ADF functions

9.6.3. Defines If True activities and If False activities, which are essentially nested pipelines to be invoked depending on the outcome of the If Condition activity

9.6.4. Extending pipeline by connecting Get Metadata activity and Lookup activity to If Condition activity

9.6.4.1. Design pattern example:

9.6.4.1.1. Check source file last modified date using Get Metadata activity, and check last load date using Lookup activity, and then use If Condition activity to compare last modified date vs last load date, for the purpose of only loading source file (using Copy data activity) if it updated since last time it loaded

9.6.4.2. Author | Pipelines | Select (and Open)

9.6.4.2.1. Activities | Filter

9.6.4.2.2. Click Publish All | Publish

9.6.4.2.3. Click Debug button to test run the pipeline

9.6.4.2.4. Review output from pipeline debug run to verify that the expected test Wait activity was invoked by the If Condition

9.6.5. Extending pipeline by replacing If Condition True activities (dummy Wait activity to be replaced by Copy data activity + Stored procedure activity)

9.6.5.1. See attached for example database stored procedure that can be used in an If Condition true activity to update a control table

9.6.5.2. Author | Pipelines | Select (and Open)

9.6.5.2.1. Select If Condition activity

9.6.5.2.2. Click Publish All | Publish

9.6.5.2.3. Click Debug button to test run the pipeline

9.6.5.2.4. Verify Output to confirm that the Copy data and Stored procedure activities were triggered as expected

9.6.5.2.5. Verify that control table in SQL DB updated as expected

9.7. ForEach activity

9.7.1. Use case: implement a parent-child pipeline architecture, whereby we iterate over a folder of files for processing and invoke one or more activities for each file

9.7.1.1. Also we will use the concept of a task queue table (i.e. using the ForEach activity to capture a list of files for processing and adding details about those files into a task queue table held in SQL Database)

9.7.1.1.1. Concept of a Task Queue

9.7.1.1.2. See attached for example of task queue table

9.7.1.1.3. See attached for example of stored procedure that will add a single item to our task queue (will be invoked for each file being iterated by the ForEach activity)

9.7.1.2. To prep, we will use Azure Storage Explorer to upload a bunch of files into a new folder in Blob storage

9.7.2. Purpose is to iterate over a list of items

9.7.3. Create child pipeline with initial object parameter for taking a list of file names (to come from parent pipeline)

9.7.3.1. Author | Pipelines | New Pipeline

9.7.3.1.1. 1. (General) Set Name

9.7.3.1.2. 2. (Parameters) New

9.7.4. Create parent pipeline for capturing the metadata of the source file input folder and feeding this into the child pipeline

9.7.4.1. Author | Pipelines | New Pipeline

9.7.4.1.1. (General) Set Name

9.7.4.1.2. Activities | General

9.7.5. Complete child pipeline by adding activities

9.7.5.1. Author | Pipeline | Select Pipeline

9.7.5.1.1. Activities | Iteration & conditionals

9.7.6. Click Publish All | Publish

9.7.6.1. Note: we publish the parent and child pipelines as one

9.7.7. Click Debug button to test run the parent pipeline

9.7.8. Review outcome of ForEach activity in the Task Queue table

9.7.9. Dynamic Datasets

9.7.9.1. This is the idea that we can use a ForEach loop to iterate files in a folder and use the Copy data activity by using a dynamic file reference rather than a fixed one

9.7.9.1.1. This idea depends on all files in the folder having identical schema, so we initially test and review how adding a single file that has a different schema to the source folder will break the pipeline

9.7.9.2. See attached for database table that can be a target for the multiple files that land in the Internet Sales folder

9.7.9.3. We will start by creating a source dataset from a sample source file and a sink dataset for the matching target database table

9.7.9.3.1. Source Dataset

9.7.9.3.2. Sink Dataset

9.7.9.4. With (dynamic) Source dataset and Sink dataset created, we will now clone the previous child pipeline and edit the ForEach activity to use a Copy data activity based upon our new datasets

9.7.9.4.1. Author | Pipelines | Clone pipeline

9.7.9.5. Next we will edit the previous parent pipeline to invoke the new child pipeline, noting that at this stage the process will break due to the inclusion of a rogue file in the source folder (i.e. a single file that does not conform to the schema required for loading into FactInternetSales)

9.7.9.5.1. Author | Pipelines | Select pipeline

9.8. Filter activity

9.8.1. Purpose is to apply a filter expression to an input array

9.8.2. For each item in array, it runs a conditional expression and if the condition resolves to False, the item is removed from the array

9.8.3. To demonstrate its use, we will edit the existing parent pipeline to insert the Filter activity in between the Get Metadata activity and the Execute pipeline activity, which will resolve the failure we previously observed

9.8.3.1. Author | Pipelines | Select pipeline

9.8.3.1.1. Iteration & Conditionals | Click and drag Filter activity onto canvas

9.8.3.1.2. Make space between Get Metadata activity and Execute pipeline activity and delete the direct connection between the two

9.8.3.1.3. Connect output of Get Metadata activity into new Filter activity

9.8.3.1.4. Edit Filter activity

9.8.3.1.5. Connect output of Filter activity to Execute pipeline activity

9.8.3.1.6. Edit Execute pipeline activity

9.8.3.1.7. Click Publish All | Publish

9.8.3.1.8. Click Debug button to test run the pipeline

9.8.3.1.9. Review run output

10. SSIS Lift and Shift

10.1. Executing SSIS packages stored in Azure using Azure, not on-prem resources

10.2. Requirements

10.2.1. Azure subscription

10.2.2. Azure Data Factory

10.2.3. Azure-SSIS Integration Runtime (IR)

10.2.3.1. This IR is actually running on an Azure VM, which brings with it the associated costs of running a VM in Azure

10.2.3.1.1. You won't see it listed as a VM resource though

10.2.3.1.2. If you pause the IR, this halts the VM charges

10.2.4. Azure SQL DB or Azure SQL Managed Instance

10.2.4.1. Will host SSIS Catalog (SSISDB)

10.3. Benefits

10.3.1. Familiar toolset - keep developing SSIS packages in Visual Studio, deploy to SSISDB (where SSISDB is inside the IR)

10.3.2. Azure-SSIS IR provides access to SQL Agent and PowerShell

10.3.2.1. ADF pipelines can also invoke SSIS packages from the IR as an activity

10.3.3. The VM that hosts the IR can be scaled up and scaled out

10.3.3.1. Note: the IR is actually a managed cluster of VMs - hence the scale out capability

10.3.4. Setting a point-to-point or point-to-site VPN enables connecting an Azure VNet (where Lift and Shift SSIS packages will reside) and on-prem resources, such as source files

10.3.4.1. Enables a phased approach to migrating from on prem to cloud as you can have your SSIS-based ETL running in Azure without being forced to have all your source data and sink database migrated to Azure

10.4. Provision Azure-SSIS IR

10.4.1. ADF Author & Monitor

10.4.1.1. Manage | Integration Runtimes | New

10.4.1.1.1. Select Azure-SSIS and click Continue

10.5. Visual Studio SSIS Deployment to Azure-SSIS IR

10.5.1. For a very quick and easy proof of concept, we will just create a project with a single package and add only a connection manager and a script task

10.5.2. 1. Create new project

10.5.2.1. e.g. MyFirstAzureSSISEnabledProject

10.5.3. 2. Create New OLEDB Connection Manager and point to Azure SQL Server endpoint

10.5.3.1. e.g. argento-training.database.windows.net

10.5.3.2. See attached example

10.5.4. 3. Click and drag a Script task onto canvas

10.5.4.1. The Script task is ideal for testing purposes as without any configuration it will simply complete successfully

10.5.5. 4. Right-click project in Solution Explorer and choose Properties

10.5.6. 5. Under Configuration Properties | Deployment, set the values, and click OK

10.5.6.1. e.g. Server name = argento-training.database.windows.net

10.5.6.2. e.g. Server Project Path = \SSISDB\Lift_and_Shift

10.5.7. 6. Right click project again via Solution Explorer and choose Deploy option

10.5.7.1. Starts Deployment Wizard

10.5.7.1.1. Choose SSIS in Azure Data Factory for Deployment Target

10.5.7.1.2. Select Destination

10.5.7.1.3. Complete wizard and confirm successful deployment

10.5.8. 7. Verify deployment using SSMS

10.5.8.1. Once connected to Azure SQL Server via SSMS, you will see SSISDB (which was created as part of the Azure-SSIS IR provisioning process)

10.5.8.2. Inspect contents of internal.folders table to verify that folder created during deployment now exists

10.5.8.3. Inspect contents of internal.projects table to verify that new project now exists

10.5.8.4. Inspect contents of internal.packages table to verify that new package now exists

10.6. Access SSISDB Catalog in Azure SQL Database

10.6.1. The trick to this is to explicitly connect to the SSISDB database when making the connection in SSMS

10.6.1.1. Once connection is made you will see Integration Services Catalogs and be able to set up environment variables, configure packages, etc.

10.7. Execute SSIS packages from Azure

10.7.1. This is enabled by the Azure-SSIS IR and the SSIS project deployment from Visual Studio to Azure SQL Database or Azure SQL Managed Instance

10.7.2. ADF Author & Monitor

10.7.2.1. Author | Pipelines | New pipeline

10.7.2.1.1. 1. Set Name

10.7.2.1.2. 2. Under General, click and drag Execute SSIS package onto canvas

10.7.2.1.3. 3. Click Publish All | Publish

10.7.2.1.4. 4. Click Debug (for the pipeline)

11. Monitoring

11.1. Requires execution from a trigger

11.1.1. Debug operations do not appear under Monitor

11.2. The following can be monitored:

11.2.1. Pipeline runs, Activity runs, Integration runtimes (including Azure-SSIS IR) and Trigger runs

11.3. We can specify custom date & time ranges to review monitored activity

11.4. Before going in to Author & Monitor to see details of monitored activity, we can see a number of charts for monitoring ADF activity, which you can pin to one of your dashboards if you like

11.5. ADF Author & Monitor | Monitor

11.5.1. Select Monitor subject

11.5.1.1. Pipeline runs

11.5.1.1.1. You can drilldown on pipeline runs to see the detail

11.5.1.2. Trigger runs

11.5.1.3. Integration runtimes

11.6. Lift and Shift SSIS package executions only show up in ADF Monitor when invoked by a pipeline

11.6.1. We can also invoke these packages using T-SQL, in which case we can interrogate SSISDB for details of the execution

11.6.1.1. See attached screenshots for scripting out package execution from SSISDB catalog