Published on: 8 April 2022
Author: Ramesh Kanjinghat
When running pipelines in Azure Data factory to move data between different locations you need Linked Services. Linked Services are how data factory talks to different data sources like SQL server, Azure Data Lake Storage (ADLS), etc. And these Linked Services need Integration Runtime(IR) to perform compute operations.
Learn more about linked services at https://docs.microsoft.com/en-us/azure/data-factory/concepts-linked-services?tabs=data-factory.
Learn more about IR at https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime.
While moving data between data sources that reside in Azure, we can use Azure hosted IR which is very cost affective. With Azure hosted IRs Azure spins-up a fresh computing infrastructure every time an activity in a pipeline needs computing0 and shuts it down right after that activity is completed.
Isn't this awesome? Yes and no, there is a catch.
With the cheapest option available, which is 4 cores (+4 Driver cores) and General Purpose compute type, it takes more than 3 minutes to just spin-up the computing infrastructure. Most of the activities in my pipeline takes less than a minute to run but when combined with spin-up time it takes 4 minutes plus.
If you are not worried about time or these pipelines run once in few hours or you don't have many activities in the pipeline that needs compute power or you don't want to spend more money, then yes.
If you want to save as much time as possible then there is a way you can keep these computing infrastructures from shutting down so that next activity in the pipeline, as a matter of fact, activities in other pipelines or the next run of the same pipeline, can use the same computing infrastructure. This eliminates the subsequent computing infrastructure spin-up times.
Let's see how
We Will create a Linked Service, to connect to an ADLS Gen2, with both Azure Resolved IR and Long Living (I made up this name) IR.
First Azure resolved IR
Assuming you are on the home page of your Data Factory instance
- Go to Manage > Linked services.
- Hit + New
- Select ADLS Gen2 as Data source and hit Continue.
- Change Name if you want.
- Notice that I have selected AutoResolveIntegrationRuntime for Connect via integration runtime.
- Now select the Authentication type you want and provide relevant configuration values.
- If you want, hit Test connection to test your connection.
- Hit Create.
- Now hit Publish all button at the top of the screen.
Done, we have created a Linked Service, AzureDataLakeStorage1, to Connect to an ADLS Gen2. Every time an activity that uses this Linked Service runs Azure spins-up a computing infrastructure.
If you want to test this then you can create a simple pipeline that copies a file from one container to another container in an ADLS instance and run it twice. You can notice that there is no significant difference between the time it takes in first and second runs.
Here is a documentation on how to copy files between 2 storages, https://docs.microsoft.com/en-us/azure/data-factory/load-azure-data-lake-storage-gen2-from-gen1
Now Long Living IR
Now we will use an IR that spins-up a computing infrastructure that can be re-used by other activities.
First create an Azure hosted Integration Runtime.
- Go to Manage > Integration runtimes.
- Hit + New.
- Select Azure, Self-Hosted"** and hit Continue.
- Select Azure and hit Continue.
- On Settings tab
- Change Name.
- Select the Region best works for you.
- Go to Data flow runtime tab
- Select the Compute type and Core count that works best for you.
- Select Time to live.
- Hit Create.
Time to live is the option that keeps the computing infrastructure from shutting itself down. Longer the time higher the cost. Similarly higher cost better the Compute type and Core count are.
Configure Linked Service to use our Long-Lived IR.
- Go to Manage > Linked services.
- Click on the AzureDataLakeStorage1 link.
- On the Edit linked service window select the integration runtime we have just created, which is integrationRuntime1.
- Hit Save.
- Now hit Publish all button to publish the changes to linked service.
Now run the test pipeline we have created twice, the time it takes in second run will be much lesser than first run. Because at first time the Azure spins-up a computing infrastructure but second time it will use the same computing infrastructure that Azure spins up in previous run.
Make sure that you run the pipeline second time within the time frame you have selected for Time to live. i.e., if you have selected 10 minutes then you should run the pipeline within 10 minutes of first run otherwise Azure will shut down the compute infrastructure after 10 minutes.