Azure Data Factory CI/CD

Get Started. It's Free
or sign up with your email address
Azure Data Factory CI/CD by Mind Map: Azure Data Factory CI/CD

1. Methods of implementing CI/CD for ADF

1.1. Visual tools method

1.1.1. UI based method

1.1.2. Use ADF to import/export ARM templates

1.1.3. Integrate with a version control system (e.g. Git)

1.1.4. Simpler method but not fully automated

1.1.4.1. Requires button clicks in Azure Portal

1.2. Automated method

1.2.1. Combines ADF with Git integration and Azure DevOps

2. CI/CD Infrastructure

2.1. ADF is a data integration tool and Azure DevOps is a tool that provides CI/CD capabilities, and we need to include other Azure infrastructure components in a typical data integration solution

2.1.1. Azure Resource Groups

2.1.1.1. All Azure resources, including ADF, must belong to a resource group

2.1.2. Azure Storage

2.1.2.1. Modern data analytics solutions involve a data lake or blob storage, which is provided via Azure Storage accounts

2.1.3. Azure KeyVault

2.1.3.1. Automated deployment activities into Azure resource groups will require authentication, and as best security practice we need to avoid holding sensitive password or access token values directly in any of our DevOps pipeline code, and Azure KeyVault facilitates secure retrieval of these "secrets"

2.1.4. Azure Data Factory

2.1.4.1. The ADF that developers develop includes infrastructure (configuration data about the ADF service instance, including linked services and integration runtimes) plus the code we develop in the form of pipelines, data flows and data sets

2.1.5. Azure DevOps

2.1.5.1. It is most convenient to host our Git repos in Azure DevOps and we can use the service's Azure Pipelines feature to develop CI/CD processes that plug into those Git repos and automate build and release actions

2.2. We typically have 3 environments for our data integration solution, which CI/CD pipelines need to interconnect:

2.2.1. Development

2.2.1.1. This is represented by the Azure resource group that the development team work in day to day via feature branches (one per developer), merging completed features into the collaboration branch (typically the develop branch) via DevOps pull requests

2.2.1.1.1. Typical testing of the data pipelines involves small data samples

2.2.2. Staging

2.2.2.1. This environment may also be referred to as QA or Test, and is typically where all integration testing will occur, both automated and any manual testing that QA testers need to do, and release to this environment may be automatic or manually triggered depending on the team's requirements

2.2.2.1.1. Typically this will involve testing of the data pipelines with more significant data samples

2.2.3. Production

2.2.3.1. Most commonly releases to this environment will be triggered manually, at least until such time as the CI/CD practices have matured to such an extent that automated testing coverage is considered very robust and promotes a high degree of trust in every release

2.2.3.1.1. Typically this will be the only environment that runs data pipelines with real data (or at least, the only one that runs with sensitive, unobfuscated data), and it will be much more restricted in terms of security and access

2.3. Initial setup

2.3.1. We can set up the skeleton for our data integration solution CI/CD infrastructure by creating resource groups for each environment and provisioning a storage account and key vault in each

2.3.1.1. Having a well defined naming convention helps here

2.3.1.1.1. A real world example I've seen is: org-resourcetype-solutionname-environment

2.3.1.2. Step 1: create 3 resource groups

2.3.1.2.1. In this example, all 3 have been created in a common subscription but a common scenario I've seen in the real world is to keep environments contained within separate subscriptions

2.3.1.2.2. argento-rgp-datalake-dev

2.3.1.2.3. argento-rgp-datalake-qa

2.3.1.2.4. argento-rgp-datalake-prd

2.3.1.3. Step 2: create 3 ADLS Gen2 storage accounts in each resource group

2.3.1.3.1. Inside each storage account, we will create two containers with private access only

2.3.1.3.2. argentoadls2datalakedev

2.3.1.3.3. argentoadls2datalakeqa

2.3.1.3.4. argentoadls2datalakeprd

2.3.1.4. Step 3: create 3 key vaults in each resource group

2.3.1.4.1. argento-kv-datalake-dev

2.3.1.4.2. argento-kv-datalake-qa

2.3.1.4.3. argento-kv-datalake-dev

2.3.1.5. Idea: Deploy QA and Production resource groups using ARM template from Development resource group

2.3.1.5.1. Although this example is rather too trivial to justify setting up the skeletons for the QA and Production environments this way, it represents an interesting experiment as an alternative to using the purely GUI driven approach of repeatedly walking through multiple deployment wizards

2.3.1.5.2. Step 1: Go to the dev resource group containing the provisioned storage account and key vault and export the ARM template

2.3.1.5.3. Step 2: Rename the downloaded template and parameters JSON files to reflect resource group names and open up in Visual Studio Code

2.3.1.5.4. Step 3: Perform a find & replace within both template and parameters file to make the parameter names more generic, and set the values in parameter file to be for QA

2.3.1.5.5. Step 4: Use Azure CLI to upload ARM template files

2.3.1.5.6. Step 5: Run through the CLI commands to create new QA resource group and use ARM template to deploy storage account and key vault into it

2.3.1.5.7. Step 6: Verify the QA deployment via Azure Portal

2.3.1.5.8. Step 7: Repeat for Production resource group deployment

2.3.2. In addition, let's upload the same JSON file into the raw container for each of the environment storage accounts, as we'll use this later to test our data factory pipelines

2.3.2.1. We can use Azure Storage Explorer to upload a test JSON file into the raw container for each of our new storage accounts

2.3.2.1.1. In this example, we upload a JSON file named Sales.Customer.json into the raw container of our dev storage account

3. Initial development in Data Factory

3.1. The idea here is to provision a development data factory linked to a Git repo and develop a simple data flow that integrates a source customer file in JSON format into a delta lake format (based on Parquet)

3.2. This data factory will be Git enabled but none of the others will be

3.3. I have serialised tables from the AdventureWorks2017 database into a JSON format and we will use the Sales.Customer.json file for the purpose of our POC, with this file being uploaded to the raw container in all 3 of our environment storage accounts

3.4. Before developing the data factory data flow and pipeline, we will grant the new data factory permission to get secrets from the dev key vault and add a secret to hold the storage account connection string, and we will also grant the data factory permission to read from the raw container and write to the curated container

3.4.1. We need to repeat this configuration in the QA and Prod resource groups (i.e. add a data factory service, adhering to naming convention, and grant that rights to equivalent key vault + storage account, with the different storage account connection strings added as secrets in each of the key vaults)

3.4.1.1. This is easily done manually via the Azure Portal UI but I may choose to deploy those changes to QA and Prod resource groups using ARM template development just for practice!

3.5. POC setup

3.5.1. Step 1: provision new data factory in argento-rgp-datalake-dev

3.5.1.1. We name the data factory according to naming convention, argento-adf-datalake-dev

3.5.1.2. We skip Git configuration on provisioning because we'll do that later after first setting up permissions for the new data factory

3.5.1.3. We get a new data factory named argento-adf-datalake-dev

3.5.2. Step 2: Grant dev data factory read permission to raw container in dev storage account and write permission in curated container

3.5.2.1. Via Azure Portal, we navigate to the raw container of the dev storage account and go to the Access Control (IAM) page

3.5.2.2. We add a new role, choose Storage Blob Data Reader and then search and select the managed identity of the new dev data factory

3.5.2.3. Gotcha! In addition to giving the reader role at the raw container level, we must give the reader role at the storage account level

3.5.3. Step 3: Repeat step 2 but for curated container, and rather than Storage Blob Data Reader, we grant the role of Storage Blob Data Contributor

3.5.4. Step 4: Copy dev storage account connection string and add as secret in dev key vault

3.5.4.1. In the dev key vault we can create a new secret named datalake-conn-str and paste in the connection string for our dev storage account

3.5.5. Step 5: Grant dev data factory access to "Get" secrets from the dev key vault

3.5.5.1. The Access policies page in key vault allows us to Add Access Policy, and then we look up the managed identity for the dev data factory and assign it simply the Get permission for secrets only

3.5.5.2. Once done, we see that the dev data factory has permission to get secrets

3.5.6. Step 6: Create linked service for dev key vault in dev data factory

3.5.6.1. I named linked service AzureKeyVault

3.5.6.2. Authentication works via managed identity thanks to step 5

3.5.7. Step 7: Create linked service for dev storage account in dev data factory

3.5.7.1. I named linked service ADLSGen2

3.5.7.2. Authentication works via managed identity thanks to step 2, but watch out for need to grant argento-adf-datalake-dev the Storage Blob Data Reader role at the storage account level, not just the container level; otherwise the connection will fail

3.5.7.3. Don't forget to click the Publish button so that both linked services become properly added to argento-adf-datalake-dev

3.5.8. Step 8: Create new ARM template for the argento-rgp-datalake-dev resource group

3.5.8.1. This is definitely a trickier and more advanced way to prepare the addition of the base data factory deployments for the QA and Prod environments!

3.5.8.1.1. I'm choosing to do it this way just to practice with ARM templates but it would be much easier to simply repeat steps 1 to 7 twice over, once for each environment

3.5.8.2. Gotcha! The most intuitive method to create the ARM template for me as a newbie is to run the Export template via the Portal for the resource group, but you get an alert that including 3 or more resource types is not supported

3.5.8.3. My starting point was to make edits to the existing ARM template that I previously created for deploying the storage account and key vault to QA and Prod resource groups

3.5.8.4. My first action was to try and fix the previous error arising from the container deployment

3.5.8.4.1. The resource type is Microsoft.Storage/storageAccounts/blobServices/containers and the apiVersion is set to 2020-08-01-preview

3.5.8.4.2. According to documentation, the latest api version (as of Feb-2021) is 2019-06-01

3.5.8.4.3. So I will try changing the API version for these two resources as an experiment to see if this resolves the deployment error

3.5.8.5. Next, I go into the ADF UI and export the ARM template via the Manage page

3.5.8.5.1. This produces a collection of JSON files that looks like this:

3.5.8.6. Next, I integrate the contents of the data factory ARM templates into my existing one for the storage account + key vault deployment

3.5.8.6.1. First, I extend the parameters section by copying and pasting from two files:

3.5.8.6.2. Next, I extend the template parameters file by copying and pasting from two files, and setting values initially for the qa environment:

3.5.8.6.3. Next, I extend the resources section by copying from argento-adf-datalake-dev_arm_template.json

3.5.8.6.4. Next I append to the end of the resources section the linked service deployments by copying from arm_template.json and adding dependencies

3.5.8.7. Next, I start an Azure CLI session via the Portal and upload the modified ARM templates

3.5.8.7.1. Note that I already had an ARM template uploaded for this resource group so I will overwrite

3.5.8.7.2. Also remember that Upload feature of Azure CLI only uploads 1 file at a time and will put it into your home directory (i.e. no option to choose a sub-directory)

3.5.8.8. Repeat steps (5 to 7) from the initial setup to use Azure CLI to deploy first the the QA resource group and then the Prod resource group, using the ARM templates

3.5.8.8.1. After this you should see the QA and Prod data factories deployed to their respective resource groups

3.5.8.9. Finally, we need to take care of the data factory linked services permissions in the QA and Prod environments

3.5.8.9.1. Gotcha! New data factories may not have a managed identity

3.5.8.9.2. For the storage account, this involves assigning the managed identity the role of Storage Blob Data Reader for the QA data factory to the QA storage account and Storage Blob Data Contributor for the curated container

3.5.8.9.3. For the key vault, this involves adding an access policy to allow the QA data factory to Get secrets from the QA key vault

3.5.8.9.4. The final act is to validate the linked services in the newly deployed data factories

4. Create new Azure DevOps repo and link Data Factory dev to this

4.1. In your Azure DevOps project create a new repo and add a develop branch (branched from the default main branch)

4.1.1. For this POC, I created the repo argento-ci-cd-training

4.2. In your dev Data Factory UI, go to the Manage page and click Configure to set up Git integration

4.2.1. Choose Azure DevOps Git as the repository type and point to the existing repo that we created in the previous step, then click Apply

4.3. Once done, you will see your dev data factory is linked to the default collaboration branch that we chose, named develop

4.3.1. You can also see that you can toggle with the non-source controlled version of the data factory, referred to as "live mode"

4.3.2. Note that once the data factory becomes Git-enabled, you can only publish via Git mode (not live mode) and only from the branch designated as the collaboration branch (develop in our case)

4.3.2.1. If you switch to live mode for example, you will see that publishing is not permitted

4.4. When we look in Azure DevOps, we see that our develop collaboration branch has received some JSON files that represent our data factory so far

5. Data Factory development phase 1 (pre CI/CD pipeline)

5.1. Let's start by creating a story for my work in Azure DevOps, as this will become the basis for my feature branch in Data Factory

5.1.1. In Azure DevOps, we can go to Boards and create a new item

5.1.2. Complete the details of the story so we know what we're trying to achieve

5.1.2.1. Our new story has been automatically given a number by DevOps, 9 in this case, which we will use as basis of our feature branch in the data factory development

5.2. Create a new feature branch in ADF

5.2.1. We branch from our main collaboration branch, which in this case is develop

5.2.2. I've adopted the feature branch naming convention: feature/<item_no>_<summary_title>

5.2.2.1. In this case the feature branch I created and switched to in ADF is called: feature/9_RawToCurated_Customer

5.3. Create a dataset for our raw Customer JSON file

5.4. Develop mapping data flow that integrates the raw Customer file into the curated Customer file, with some transformations

5.4.1. For the sink I chose inline Delta and pointed to: curated/customer

5.5. Develop pipeline that calls data flow

5.5.1. After saving all changes, observe the changes committed into your feature branch by looking via Azure DevOps

5.5.1.1. New objects in JSON format are added for dataset, data flow and pipeline

5.6. Test pipeline whilst still in feature branch

5.7. Process pull request for feature into develop collaboration branch

5.7.1. We can create a pull request for our feature branch into develop directly from ADF

5.7.1.1. This launches Azure DevOps and auto-populates a new pull request for us

5.7.1.2. I normally complete with a squash commit, which means the feature goes into the head of develop with a single summary commit for the whole feature

5.8. Now that the feature is merged into the develop collaboration branch, let's test the the new pipeline still works

5.9. Once the test is successful in the develop collaboration branch, we should publish it via the Publish button

5.9.1. Note that publishing can only be done from the designated collaboration branch (e.g. develop)

5.9.2. When publishing we see the list of pending changes for us to confirm

5.9.2.1. If there are pending changes included that we don't recognise as part of our feature, we may need to check with other developers about the status of their testing of promoted features

5.9.3. The effect of the publishing processing is to generate ARM templates for deployment into a special branch named: adf_publish

5.9.3.1. We can see the ARM templates committed into the adf_publish branch via Azure DevOps

6. Deployment using ADF Visual Tools

6.1. Although this is not CI/CD via Azure DevOps, it's worth being aware of the easy to use visual tools that we can use to deploy the dev data factory to the qa data factory

6.2. In the dev data factory, go to Manage | ARM Template and choose the option to Export ARM Template

6.2.1. There are two main files in the zip folder downloaded to your local machine; one represents the data factory and the other is the parameters file

6.3. In the qa data factory, go to Manage | ARM Template and choose the option to Import ARM Template

6.3.1. This takes you to a Custom deployment screen in the Portal, where you should click the option: Build your own template in the editor

6.4. In the custom deployment | edit template screen, click the Load file option and point it to your exported template file: arm_template.json

6.4.1. Once loaded, click the Save button

6.5. Now we are returned to the Custom deployment screen on the Basics page, and all we need to do is set the target QA resource groups and tweak the parameter values to reference qa instead of dev

6.5.1. The validation should pass and we click create to initiate the deployment to the QA data factory

6.5.1.1. We should see the deployment succeed after a short wait

6.6. In the QA data factory we now see the dataset, data flow and pipeline all deployed, and we can trigger this manually to test that it works

6.6.1. All being well, we should see that the newly deployed pipeline succeeds

6.6.2. And we can see that the curated container in the QA storage account got the Customer data landed

7. Implementing CI/CD pipeline

7.1. We use Azure DevOps to develop CI/CD, and more specifically we use a service called Azure Pipelines that is included and integrated with DevOps

7.1.1. Continuous Integration (CI) is the practice used by development teams to automate the merging and testing of code

7.1.2. Continuous Delivery (CD) is a process by which code is built, tested, and deployed to one or more test and production environments

7.2. Azure Pipelines

7.2.1. Variable groups

7.2.1.1. Use a variable group to store values that you want to control and make available across multiple pipelines

7.2.1.2. You can also use variable groups to store secrets and other values that might need to be passed into a YAML pipeline

7.2.1.3. Variable groups are defined and managed in the Library page under Pipelines

7.2.1.4. For our POC, we will create two variables groups, one for QA and one for Production

7.2.1.4.1. In the QA variable group we add a variable called Environment and set its value to qa

7.2.1.4.2. In the Production variable group we add a variable called Environment and set its value to prd

7.2.2. For data factory, we don't need a Build (CI) pipeline, but rather a Release (CD) pipeline

7.2.2.1. This is because for ADF the build process essentially happens behind the scenes when we click the Publish button and the build artifact is the ARM templates committed into the adf_publish branch of our repo

7.2.2.2. Step 1: create a new release pipeline

7.2.2.3. Step 2: start with an empty job

7.2.2.3.1. Note that there are a bunch of templates available for creating your new pipeline but for our POC, we're choosing empty jop

7.2.2.4. Step 3: create the QA stage and give the new pipeline an appropriate name

7.2.2.4.1. Every pipeline is composed of stages, where stages are a way of organising jobs

7.2.2.5. Step 4: open Repos | Files in another browser tab and navigate to the ARM templates parameters file in the adf_publish branch

7.2.2.6. Step 5: using copy and paste between browser tabs, we create a new pipeline parameter variable for factoryName

7.2.2.6.1. In the factoryName variable value, we substitute the "dev" reference with a reference to the Environment variable that we created earlier in our variable groups

7.2.2.7. Step 6: repeat step 5 for remaining parameters of the ARM template that are environment specific

7.2.2.7.1. In this case we have two parameters, both URLs, one referencing the storage account and the other referencing the key vault

7.2.2.8. Step 7: save release pipeline and add comment to summarise what we did

7.2.2.9. Step 8: add artifact to release pipeline and point it to the adf_publish branch in your repo

7.2.2.9.1. The default source for an artifact is Build, which means the output of a build pipeline in Azure Pipelines

7.2.2.9.2. The Source alias is supposed to be a unique identifier for the artifact

7.2.2.9.3. Save the changes and add a comment

7.2.2.10. Step 9: add ARM template deployment task to release pipeline

7.2.2.11. Step 10: configure the ARM template deployment task

7.2.2.11.1. Change the display name and pick an available Azure service connection

7.2.2.11.2. Pick the qa resource group and then manually replace the "qa" reference with the expression that substitutes the Environment variable: $(Environment)

7.2.2.11.3. Set the location and template location as linked artifact, which will require us to use the ellipses for setting template, template parameters and override template parameters

7.2.2.11.4. Finally, we save the ARM template deployment task

7.2.2.12. Step 11: add PowerShell task to disable any existing data factory triggers on the deployment target

7.2.2.12.1. If the target data factory has any active triggers, this can cause the ARM template deployment task to fail

7.2.2.12.2. Using VS Code, open the ADF directory in your local repo and add the pre/post deployment script from Microsoft

7.2.2.12.3. Commit and push the changes to the Azure DevOps remote

7.2.2.12.4. Edit the release pipeline tasks again and add a new task for Azure PowerShell

7.2.2.12.5. Configure the Azure PowerShell task as a pre-deployment script

7.2.2.13. Step 12: clone the pre-deployment task to create the post-deployment task, and tweak its configuration

7.2.2.13.1. First, we clone the pre-deployment task

7.2.2.13.2. Next, we position and rename the cloned task as the post deployment task

7.2.2.13.3. Next, we edit the script arguments and change the values for post deployment

7.2.2.13.4. Finally, save the changes

7.2.2.14. Step 13: clone the QA stage to create new Production stage and link to variable groups

7.2.2.14.1. Hover the mouse over the QA stage and click the clone button

7.2.2.14.2. Rename the cloned stage to Production

7.2.2.14.3. Switch the the Variables tab and under Variable groups, click Link variable group

7.2.2.14.4. Once done, expand the linked variable groups to see the variables and their values

7.2.2.15. Step 14: enable CD trigger on the artifact

7.2.2.15.1. After clicking the trigger on the artifact, we enable it and add a branch filter for adf_publish

7.2.2.15.2. There is no save button here - just click the "x" in the top right corner after you're done

7.2.2.16. Step 15: change trigger on Production stage to manual only

7.2.2.16.1. After clicking the trigger on the Production stage, we change the trigger from After Stage to Manual Only

7.2.2.16.2. It is generally a standard practice to add one or more authorised users as the approvers for the Production stage - in this case I added myself

7.2.2.16.3. Finally, remember to click Save to ensure your changes become permanent

8. Data Factory development phase 2 (testing CI/CD pipeline)

8.1. At this point, our release pipeline in Azure DevOps should be fully configured and ready to go, so all we need to do is make some change in data factory and publish it

8.2. In Azure DevOps I create a new story for the change

8.2.1. In this example, it's item # 10

8.3. In ADF, create a new feature branch from develop and make your change

8.3.1. In this example, I add a simple Wait activity to a pipeline

8.3.2. Save the changes once done

8.3.2.1. This will commit the changes to your feature branch

8.3.2.2. At this point you should validate your data factory and test your changes before raising a PR to develop, but as this is such a trivial change and it's a POC, I'll skip this

8.4. In Azure DevOps, raise the pull request to merge your feature branch into the develop collaboration branch

8.4.1. Verify that the pull request completes successfully

8.5. In ADF, switch to the develop collaboration branch to confirm that the change is present and then click Publish

8.5.1. Again, normal practice would be to test the change in the develop collaboration branch before clicking Publish, but we can skip this for brevity

8.6. In Azure DevOps, we see that a new release has been automatically triggered and is running the QA stage

8.6.1. Verify that the QA stage release completes successfully

8.6.1.1. Note that the Production stage did not run because we set the trigger for this to be manual only

8.7. In ADF, open the QA data factory and verify that the changes are present, then trigger a test to ensure it works

8.7.1. Verify that the QA pipeline test for the deployed change was successful

8.7.1.1. In the real world, you would also make additional checks to verify that the expected outcome(s) of the pipeline tests were all met

8.7.1.1.1. In my case I simply verified that the curated container got an update with the expected datetime on the modification timestamp

8.8. In Azure DevOps, manually trigger the Production deployment by clicking on the Production stage

8.8.1. Click the Deploy button

8.8.1.1. Add a comment and click Deploy to confirm the deployment

8.8.1.1.1. If you are one of the designated approvers, you will be able to click the Approve button and then confirm the approval

8.9. Finally, all being well, we should see that our Production release has succeeded

8.9.1. We can see our latest changes in the Production data factory