Azure Data Factory

Azure Data Factory - Microsoft Learn - Usage, Components

From: Introduction to Azure Data Factory - Azure Data Factory | Microsoft Learn

Usage Concept

Problem: How to organize and manage processes to turn data into business intelligence
Solution: Service to extract-transform-load (ETL), extract-load-transform (ELT), and integrate data. Data is integrated regularly with data storage, analytics services for reporting and business intelligence.

Azure Data Factory provides a service for the solution for data workflows (pipelines), movement, and transformation at scale.

See ADF visual guide:

Problem: raw, unorganized data in different formats and sizes –> data transformation –> analytics –> actionable insights
ADF: data integration service for data manage movement and transformation from different data source to analytical resources

How ADF Works

Connect and collect
- Collect data from on premise, cloud, (un/semi)structured, different times
- Centralize location and process it
Transform and enrich
- Process or transform data using data flows
- Execute on
  - Apache Spark, without need to understand/program Spark
  - Compute like Hadoop, Spark, Data lake analytics, machine learning
CI/CD and Publish
- CICD of data pipelines supported by Azure DevOps and GitHub
- Load data to data stores like databases/warehouses (Azure SQL, data warehouse, Cosmos DB)
Monitor
- Service/pipeline monitoring, alerting and managed service
- Dashboard

ADF Components

Pipelines
- Group of activities to do work on data
- Arguments can be passed to it
- Pipeline run: single pipeline execution
Activities
- Processing step in a pipeline
- Example might be a Hive activity: Hive query on an Azure HDInsight cluster to transform and analyze data
- Types: movement, transformation, control
Datasets
- Data structures in data stores
Linked services
- Connection information to external resource like Azure storage account, database
Data Flows
- Transformation logic on data
- Can be built into library of data transformations
Integration Runtimes
- Compute for activity and linked service execution. Runtime is where activity runs or gets started from.
- May be close to data store or target compute service to meet performance, security, and compliance needs
Triggers
- Unit of processing determining when a pipeline starts
Parameters
- Key-value pairs of a configuration
- Used by activities
- For example, a dataset and a linked service ares both typed parameters and can be reused and referenced
Control flow
- Management of pipeline activities that includes: chaining in sequence, branching, setting parameters in pipeline, and passing arguments to start pipelines (on demand or from trigger).
- Includes state passing, loops, that is, for-each iterators
Variables
- Temporary values in pipelines
- Can be used with parameters to pass values

Integration Runtime

From: Integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn

Overview

Integration runtime (IR) is compute infrastructure used by ADF and Azure Synapse pipelines to do data integration across different networks for:
- Data flows
- Moving data like copy, conversions
- Activity dispatch - call and monitor transformation activities with compute like Databricks, Azure SQL database
- SQL Server Integration Services (SSIS) package execution
The integration runtime provides a bridge between activities and linked services and are created in the ADF and Azure Synapse UI or activities, datasets and flows that reference them.

Types

Azure integration runtime, can:
- Run data flows in Azure
- Copy activities
- Dispatch transform activities in a public network to other resources
Self-hosted
- Run copy between cloud data stores and data stores on private networks
- Dispatch transform activities ina on-premises or Azure virtual network
  - Useful for bring your own drivers to custom data stores like MySQL
- Requires Java Runtime Environment (JRE) on the IR
- The self-hosted IR only makes outbound HTTP based connections to the internet
Azure-SSIS

For choosing a type for your use case and specific features of each type see:

Create and Configure a Self-Hosted Integration Runtime (SHIR)

From: Create a self-hosted integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn

Single IR can access multiple data sources and can be shared with another ADF in the same Azure tenant
Location of the IR can be anywhere, recommendation is close to the data source(s) and on a separate machine to prevent resource competition.
Data sources can be used by multiple IRs
IRs can enable usage of cloud and on premise resources as if they were on the same network.

IR tasks may fail if FIPS-compliant encryption is enabled, if so, disable it.

@startuml

title Data flow steps for copying with a self hosted integration runtime

cloud Azure {
  [Azure Data Factory]
  [Azure PowerShell or Portal]
  [Cloud storage]
}

rectangle On-premise {
  [Integration Runtime (self-hosted),2 ]
  [On-premises storage]
}

[Integration Runtime (self-hosted),2 ] -->  [On-premises storage] : Read-write requests, 1
[Integration Runtime (self-hosted),2 ] --> [Azure Data Factory] : Control Channel, 3
[Integration Runtime (self-hosted),2 ] --> [Cloud storage] : Read-write requests, 4
[Azure Data Factory] <--> [Azure PowerShell or Portal] : 1

legend right
  1. Person create self hosted IR in ADF using Azure portal or PS and created linked service for on premise data store and specify the IR
  2. Self hosted IR has encrypted credentials and can be stored locally. Proxy is present is configured during IR registration.
  3. ADF pipelines communicate with IR to schedule and manage jobs. IR polls queue for work.
  4. IR copies data between data sources, copy direciton and source/destinations are set in pipelines
endlegend

@enduml

Copying data from on premise to cloud storage

Ports and Firewalls

The IR uses port HTTP 443 for variables outbound connections described at: https://learn.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory#ports-and-firewalls