Azure Data Factory — Data Pipeline to move files from ADLS Gen2 to Azure SQL

Abhishek Bhagat
Better Data Platforms
9 min readMay 31, 2021

--

Azure Data Factory

Building a robust and scalable data pipeline is an essential commodity in data engineering solutions. It plays a crucial role in landing data from source to target. For any business user to consume data, they must receive data seamlessly. It is always preferred to have an automated system to generate data. And as data engineers, we need to ensure that we deliver these requirements in our solutions.

In this write-up, I have explained in detail one such solution. I have incorporated critical requirements such as robustness, scalability, and automation in the solution. The pipeline consumes data from Azure Data Lake Storage Gen2 and moved it to Azure SQL Managed Instance using Azure Data Factory. Once built successfully, this pipeline can accommodate a lot of additional requirements such as adding tables, having incremental or flush-and-fill load, controlling the number of years of data stored in the target, etc. without changing the underlying code. As the pipeline is fully automated within the Azure environment using in-house services, it is extremely robust. I am also storing the log reports of each run so that it will be helpful for audit purposes as well as to debug in case of abend. We can easily increase the number of tables consumed by simply adding entries in the metadata table. In addition, we can also activate and deactivate any tables by changing the metadata parameters.

We need Azure Data Factory service and a database to implement this solution. In addition, you need source and target details. I chose ADLS Gen2 because it provides best-of-class storage performance, resulting in less computing resources needed to extract data to be analyzed. This means faster insights from data but also reduced costs. ADLS Gen2 provides an elastic, scalable environment that can easily adapt to the ever-increasing volume of data that needs to be analyzed. In addition, there are multiple security features provided when creating a data lake in Azure. Similarly, I preferred Azure SQL Managed Instance because it delivers near-perfect compatibility with the latest SQL Server (Enterprise Edition) Database Engine used for on-premise databases. These instances enable you to deploy a native virtual network (VNet) that mimics on-premise deployments and increases security. Using managed instances, you can lift-and-shift on-premise SQL Servers and applications to Azure with minimal changes. This enables you to access cloud scalability and availability while eliminating the responsibilities of on-prem infrastructure maintenance, updating, and backups.

Technical Design: To build a pipeline that will perform ETL having ADLS Gen2 as the source and Azure SQL Managed Instance as the target. Below is the block diagram of the pipeline to cater to our requirements.

Block Diagram of Pipeline

Following are the details of the technical design :

1. To build below Azure services and provide necessary permissions:

Azure Data Lake Storage-Gen2(ADLS2)

Azure Data Factory

Azure SQL Managed Instance

2. To build a scalable pipeline to load files (parquet) from Azure Data Lake Storage-Gen2 to Azure SQL Managed Instance. The parquet files format is preferable as it has schema attached to the data when you are moving tables from Hive to Azure as a file. The pipeline should be controlled via a metadata table build on Azure SQL Database.

3. To store logs in a log table built on Azure SQL Database. It contains crucial details such as pipeline run IDs, file paths, pipeline names, error messages, etc.

Azure Data Lake Storage Gen2:

  1. To build an ADLS Gen2 service according to your requirements or in an enterprise it will be pre-built and configured.
  2. I have created storage — devadls2 (South Central US) and in the case of the enterprise storage account, make sure your account has contributor access. You can check this information by going to Access Control (IAM) and look for your ID in the search box.
  3. Also, ensure that firewalls and virtual networks are configured properly. I have attached a screenshot for reference.
To check the permission level of the account
Firewall settings

Azure Data Factory: Azure Data Factory is Azure’s cloud ETL service for scale-out server-less data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management.

1. Create an Azure Data Factory service — devadf (South Central US). Make sure you create in the same region as you have created an ADF account.

2. In the case of an enterprise account, make sure your account has contributor access. Also, ensure that ADF — devadf have a pubic endpoint created.

3. In the case of an enterprise account, check with the admin team to provide access.

4. Add Adls2-devadf account in Access Control and give contributor access.

5. Add ADF-devadf in Access Control (IAM) in ADLS Gen2 and give contributor access.

6. Add Azure SQL Managed Instance/Azure SQL in Access Control (IAM) in ADF and provide contributor access. I have attached a screenshot for reference.

Network access settings

Azure SQL Managed Instance:
In our requirement, the staging tables, procedures, and final tables reside in Azure SQL Managed Instance. We have created the dev_test_MI database to load tables and procedures. The staging tables will be tablename_Back_End.

Link services: Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Below are the steps to create a Link service.

To create New ADLS Link service:
1. Go to the dataset and click on the action and choose “New dataset”.
2. Choose Azure Data Lake Storage Gen2 and click on Continue.
3. Choose the required file format, in this example I have chosen parquet because, with the parquet file, we can directly move data from file to table. There is no need to define the schema of files separately. Click Continue.
4. Provide an appropriate name to the dataset and in the Link service box, select +New.
5. A new window will pop up. Now, give a name to link service, add some description and add the credentials of ADLS Gen2 in respective boxes. Click on Test Connection to check the credentials and then save the link service and then dataset.
6. I have attached a screenshot for reference.

Azure Data Lake Storage- Details

To create New Azure SQL Managed Instance Link service:
1. Again create a new dataset and choose the Azure SQL Managed Instance option and then provide a name to Dataset.
2. Similar to ADLS Gen2, create a new link service and provide appropriate credentials and test the connection. Once the connection is established then save the link service and then save the dataset.
3. I have attached the screenshot for reference.

Azure SQL Managed Instance Link service

Now, we have created the link services and data set needed for our requirement.

To fulfill our requirement we need 4 activities in ADF. Below are a brief introduction to these 4 activities.

LookUp activity: Lookup activity can retrieve a dataset from any of the Azure Data Factory-supported data sources. Lookup activity reads and returns the content of a configuration file or table. In our requirement, it will read data from the metadata table(dev_test_db.adfmetadata_tbl) residing in the dev_test_db database.

ADLS2 Link service details.

Copy Activity: In Azure Data Factory, you can use the Copy activity to copy data among data stores located on-premises and in the cloud. We will use this activity to dynamically read the files from ADLS Gen2 and move data to staging tables in Azure SQL Managed Instance.

ADLS Gen2 Linked service details.
Copy Activity-Sink
Copy Activity-Mapping

Stored Procedure Activity: The Stored Procedure Activity is one of the transformation activities that Data Factory supports. We will be running stored procedures to load data from staging tables to the main table in Managed SQL Instances.

Stored Procedure Activity Details

Delete Activity: We can use the Delete Activity in Azure Data Factory to delete files or folders from on-premise storage stores or cloud storage stores. Use this activity to clean up or archive files when they are no longer needed. Here, we will be deleting the files from the ADLS Gen2 folder after the load is completed to the final table so that in the next run we process the fresh set of data and if in case the files are not deleted then it will load duplicate data into the final table.

Delete Activity Details

Steps to build pipeline:

  1. Create a metadata table along with stored procedures in Azure SQL to control the pipeline. I have attached the sample code.
  2. Create staging and main tables along with stored procedure in the Azure SQL Managed Instance. I have attached the sample code.
  3. Move to the pipeline tab and click on action. Choose “New Pipeline” from the given option.
  4. Assign the appropriate name and description to the pipeline by clicking on properties in the extreme right.
  5. Drag and drop all the required activities as mentioned above in the block diagram and build the connection between them.
  6. Add the dynamic values in the activity as well as add the parameters in the activity.
  7. Use the appropriate linked services created above in the activity.
  8. Validate the pipeline using the “Validate all” option on the left top window.
  9. Publish the pipeline using the “Publish all” option on the left top window.
Complete Data pipeline

10. Debug the error if received any by looking for an error message.

Sample Code:

https://gist.githubusercontent.com/abhishek31/4499b2e3e54ed99d81dbc142e6adfe4a/raw/13a9ba2fd806ea3a847fdfde5fc50b7c76b996d9/metadata.sql

https://gist.githubusercontent.com/abhishek31/1aefd53b55e8f87b4b61438f04ef2870/raw/739efbe97b393eaa01cb732b6b45e345c7d137b8/Staging_to_Main_Load.sql

Triggers:

It is an option to execute your pipeline. Triggers represent a unit of processing that determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of triggers:

  1. Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
  2. Tumbling Window trigger: A trigger that operates on a periodic interval, while also retaining state.
  3. Event-based trigger: A trigger that responds to an event.

To create a new trigger:

Select the trigger option and then select the new trigger.

Trigger

Add name, description choose a type of trigger and fill credentials. Click Continue to move to the next section.

Trigger details

Add the value in parametric fields and click continue to save the trigger.
Click on Publish all. Once publish is completed successfully, it will activate the trigger.

Trigger details

Reference

Conclusion

Overall, I’ve actually quite enjoyed using Azure Data Factory. It was a great experience to learn and build a pipeline. You can connect with me if you need any help with your pipeline.

--

--

Abhishek Bhagat
Better Data Platforms

Big Data, Data Science, Data Lakes, and Cloud Computing enthusiast.