Essential Guidelines for Building Optimized ETL Data Pipelines in the Cloud With Azure Data Factory
Discover how to optimize Azure Data Factory pipelines by selecting the right activities and managing resources for reliable data processing.
Join the DZone community and get the full member experience.
Join For FreeWhen building ETL data pipelines using Azure Data Factory (ADF) to process huge amounts of data from different sources, you may often run into performance and design-related challenges. This article will serve as a guide in building high-performance ETL pipelines that are both efficient and scalable.
Below are the major guidelines to consider when building optimized ETL data pipelines in ADF:
Designing Pipelines
Activity Type Selection
ADF provides many types of activities to orchestrate the data, and based on the need and business requirement, choose the one that best fits your business. For example, when you copy data from an on-prem Oracle data source to an on-prem SQL Server (sink) that has many tables, consider using a combination of Lookup
, For Each
, and Copy Activity
to optimally design the pipeline.
Linked Services and Datasets
A linked service is a connection to a data source that can be created once and reused across multiple pipelines within the same ADF. It is efficient to create one linked service per source for easy maintenance.
Similarly, datasets are derived from the linked services to fetch the data from the source. These should ideally be a single dataset for each linked service and reused across all pipelines in the same ADF. Separate datasets are created when the data format from the source varies.
Resiliency and Reliability
When you build pipelines with numerous activities, it can complicate troubleshooting when the failure occurs. To achieve reliability, split the complex logic by dividing it into multiple pipelines and creating dependency with each other. Additionally, intermittent network issues can cause pipeline failures. To avoid manual intervention, enable the retry policy at the activity level to automatically rerun when it fails.
Scheduling Using Triggers
In a Big Data environment, processing large amounts of data is often required to develop and schedule multiple pipelines to run in batches. To optimize resource utilization, stagger pipeline schedules where possible. In case business requirements demand processing within a strict SLA, it is best to scale up resources on Integration Runtime (IR) and source/sink to effectively utilize and manage resources.
Expressions and Variables
You can leverage expressions and variables in pipelines to implement dynamic content that can adapt to environmental changes without manual intervention and diversify workflows. This can optimize the pipeline and resource usage across the activities. For example, expressions can be used in If
and Switch
activities to direct the activity execution based on the conditions, while variables can be used in Lookup
and For Each
activities to capture and define the values to pass those to subsequent activities.
Performance Optimization Techniques
Using Azure Integration Runtime
Azure Integration Runtime (IR) provides a compute environment for your activities, which are managed by ADF. To transfer large amounts of data, higher computing power is needed, and this compute is adjustable via Data Integration Units (DIU) in Copy activity settings. The more DIUs can significantly reduce the activity runtime.
Using Self-Hosted Integration Runtime (SHIR)
SHIR can let you use your own compute resources for activity execution, mostly used to connect to on-premise sources. It does support a maximum of four nodes. For large amounts of workload, consider using high-configuration machines across all nodes (ideally the same size and location) to avoid performance bottlenecks such as slow throughput, and high queue time.
Network Throughput
When transferring huge amounts of data using pipelines, it is necessary to have high network throughput to handle such amounts of data. For instance, when connecting to on-prem data sources and cloud destinations, you need to consider using a dedicated network connection to have efficient data transfer with high throughput.
Parallel Reads and Writes
When you are copying large datasets without enabling parallel reads and writes, it can cause delays in execution. Choose the right partitioning column option on the source or sink side settings to enable parallel reads and writes.
Advanced Techniques
Data Transformations Using External Compute
When running transformations on data that is stored externally, it is efficient to use the source compute for those transformations. For instance, triggering a Databricks notebook activity can leverage the compute on the Databricks cluster to perform data processing; this can reduce data movement and optimize performance.
Cost Optimization
Pipeline costs highly correlate with the number of activities, transformations, and runtime duration of each activity. The majority of cost occurs due to Azure IR utilization. It is essential to apply the optimization techniques discussed in this guide to lower costs. To reduce costs, enable Azure IR with a time-to-live setting when running data flow activities in parallel.
Security Considerations
During the execution of a pipeline, data movement typically occurs over public networks. To enhance security, there is an option to enable ADF native-managed private endpoints to ensure that the connection to your sources is private and secure.
Operational Best Practices
Data Lineage
It is one of the best practices and important to maintain an audit for each record when moving data across using pipelines. Integrating ADF with Microsoft Purview gives the option to track the lineage of data and document the origin of data in a data catalog entirely for future reference.
Fault Tolerance
During pipeline execution, data compatibility issues can arise and lead to failed activities. To be able to handle compatibility issues, implement fault tolerance by logging incompatible records, allowing for their correction, reprocessing them later, and continuing to load compatible data. To enable this, configure at the activity level by directing incompatible records to the storage account.
Backup and Restore Pipelines
To be able to safeguard your ADF pipelines, ADF needs to be integrated with GitHub or Azure DevOps. This will ensure that pipelines can be restored from their source control if the ADF resource is accidentally deleted.
Note that you need to restore other resources separately that are configured as sources or sinks in the pipelines. ADF can only restore pipeline configuration, not the actual resources that are deleted. For instance, if you are using Azure Blob Storage as a source, then this needs to be restored separately if it was deleted.
Monitoring and Alerts
Setting up monitoring for pipeline resources is critical for proactive reaction and prevention, ensuring no issues occur that may lead to escalation. Monitor the pipeline runs, integration runtime utilization, and set up alerts to notify you via email or phone.
Conclusion
There are many components of data pipelines that need to be taken into consideration and planned before their development. These best practices guide you to do the right design of each component in a data pipeline so that it will be high-performing, reliable, cost-effective, and scalable for handling your growing business needs.
Opinions expressed by DZone contributors are their own.
Comments