Executing Azure Databricks Notebook in Azure Data Factory Pipeline Using Access Tokens

A guide on adding and executing an Azure Databricks notebook in Azure Data Factory pipeline with Azure Key Vault safe Access Tokens



Photo by ZSun Fu on Unsplash

Azure Data Factory is a great tool to create and orchestrate ETL and ELT pipelines. The Data Factory’s power lies in seamlessly integrating vast sources of data and various compute and store components.

This article looks at how to add a Notebook activity to an Azure Data Factory pipeline to perform data transformations. We will execute a PySpark notebook with Azure Databricks cluster from a Data Factory pipeline while safeguarding Access Token in Azure Key Vault as a secret.


Prerequisites

  1. An active Microsoft Azure subscription
  2. Azure Key Vault
  3. Azure Data Factory pipeline
  4. Azure Databricks Workspace with a notebook

If you don’t have prerequisites set up yet, refer to our previous articles to get started:

A Data factory pipeline connects to a component using connecting strings known as linked service. A linked service tells the pipeline about the resource and how to access them. We need to create two linked services referencing our Databricks workspace and our key vault. Our key vault will hold the authentication token to our Databricks. Let’s start by generating an Access Token in our Databricks and store it in our key vault.

Generate a Databricks Access Token

Sign in to the Azure Portal, open your Databricks instance, and launch the workspace. Click the user icon on the top right corner and select User Settings. On the User Settings page, click Generate New Token. A pop-up will appear, enter an identifiable name as Comment and set a Lifetime for the token, click Generate. The pop-up will now show your Access Token; make sure to copy the token to a notepad. You won’t be able to retrieve it later, and we need this token to access our Databricks notebook. Follow the steps as shown.

Azure Databricks: Generate an Access Token (Image by author)

Store the Databricks Access Token in Azure Key Vault

Go to the Azure portal home and open our key vault. Click Secrets to add a new secret; select + Generate/Import. On Create a secret page; give a Name, enter your Databricks access token as Value, Content type for easier readability, and set an expiration date of 365 days. Click Create; your vault should have your Databricks Access Token as a secret now.

Azure Key Vault: Create a secret (Image by author)

Grant Azure Data Factory rights to read secrets from the Azure Key Vault

We need to give our Data Factory rights so it can read keys/secrets from our key vault. Click Access policies and select + Add Access Policy. On the Add access policy screen, select Get for Secret permissions. Go to Select principal and search for our Data factory name on the Principal blade, select our Data factory from the matched content and proceed as shown.

Azure Key Vault: Add access policy (Image by author)

Create linked services in Data Factory

Go to Azure portal home, locate and open your Data factory. Select Author & Monitor on the Overview page to load our Data factory instance in a new browser tab. Switch to the Data factory tab and select Manage on the left-corner menu.

We will create a linked service to our Azure Key vault first. Follow the steps as shown.

Azure Data Factory: Create a linked service (Image by author)
Azure Data Factory: Create an Azure Key Vault linked service (Image by author)

Repeat the creation process to create a linked service to our Databricks instance. Follow the steps as shown. Our Key vault linked service will be available for selection under AKV linked service, make sure to enter the correct Secret Name referring to our Databricks’ token from the key vault.

Azure Data Factory: Create an Azure Databricks linked service (Image by author)

Add Notebook activity to the pipeline

Select Author on the left-corner menu, locate and find your pipeline on the Factory Resources blade. Locate Notebook activity under the Databricks category on Activities blade, drag, and drop it on the canvas. Connect it to our previous activity’s success end. Give this activity a name, switch to the Azure Databricks tab, and select the Databricks linked service we just created. Switch to the Settings tab, browse, and choose your notebook. This notebook will be invoked and run automatically every time our pipeline executes.

Data Factory pipeline: Add Notebook activity (Image by author)

Our pipeline is now ready to execute our Databricks notebook. You can publish the changes or go for a debug run to see the output logs.

Conclusion

We learned how to add a Databricks Notebook to a Data Factory pipeline. We also learned how to create and secure a Databricks Access Token in Azure Key Vault.


Leave a comment