Best Practices for Building the Data Pipelines
Using best practices for building the data pipelines will improve the data quality and reduce the risk of pipeline breakage significantly.
Join the DZone community and get the full member experience.
Join For FreeIn my previous article ‘Data Validation to Improve Data Quality’, I shared the importance of data quality and a checklist of validation rules to achieve it. Those validation rules alone may not guarantee the best data quality. In this article, we focus on the best practices to employ while building the data pipelines to ensure data quality.
1. Idempotency
A data pipeline should be built in such a way that, when it is run multiple times, the data should not be duplicated. Also, when a failure happens and it is resolved and run again, there should not be a data loss or improper alterations. Most pipelines are automated and run on a fixed schedule. By capturing the logs of previous successful runs such as the parameters passed (date range), record inserted/modified/deleted count, timespan of the run, etc., the next run parameters can be set relative to the previous successful run. For example, if a pipeline runs every hour and a failover happens at 2 pm, the next run should capture the data from 1 pm automatically and the timeframe should not be incremented until the current run is successful.
2. Consistency
In some cases where the data flows from upstream to downstream databases, if the pipeline ran successfully and did not add/modify/delete any records, the next run should include a bigger time frame accounting for the previous run to avoid any data loss. This will help to maintain the consistency between source and target databases if the data is landed in the source with a bit of delay. For the example we considered in the above scenario if a pipeline ran at 2 pm successfully and did not add/modify/delete any records, the next run which happens at 3 pm should fetch the data from 1 pm-3 pm instead of 2 pm-3 pm.
3. Concurrency
If the data pipeline is scheduled to run more frequently within a shorter timeframe and the previous run is taking longer than usual to finish, the next scheduled run might get triggered. This will cause performance bottlenecks and inconsistent data. To prevent concurrent runs, the pipeline should have a logic to check if the previous run is in progress and raise an exception or gracefully exit if there is a parallel run. If there is a dependency between the pipelines, using Directed Acyclic Graphs (DAGs) the dependencies can be managed.
4. Schema Evolution
As the source systems continue to evolve with changing requirements or software/hardware updates, the schema is subjected to change, which might cause the pipeline to write data with inconsistent data types and add or modify the fields. To overcome pipeline breaks or data loss, it is a good strategy to check the source and target schemas, and if there is a mismatch, add logic to handle it. Another option is to adopt the schema-on-read approach instead of the schema-on-write approach. Modern-day tools like Upsolver SQLake allow the pipeline to dynamically adapt to schema evolution.
5. Logging and Performance Monitoring
If there are hundreds and thousands of data pipelines, it’s not feasible to monitor every single pipeline every day. Using tools to log and monitor the performance metrics in real-time and setting up alerts and notifications helps to foresee the issues and resolve them on time. This also helps in addressing issues related to abnormally high or low data volumes, latency, throughput, resource consumption, performance degrading, and error rates which will impact the data quality eventually.
6. Timeout and Retry Mechanism
If the pipeline is making API calls and sending or receiving requests over the network, there can be issues such as slow or dropped connections, loss of packets, etc. Adding a timeout period for each request and a retry mechanism with certain time constraints will help restrict the pipeline from going into a never-ending state.
7. Validation
Validation plays a key role in measuring the data quality. It will verify if the data meets the predefined rules and standards. Incorporating the validation rules in the data pipeline at each stage of ingestion such as extraction, performing transformations and loading will ensure integrity, reliability, and consistency and enhance the data quality.
8. Error Handling and Testing
Error Handling can be done by making the best guess of exceptions, potential failure scenarios, and edge cases that will cause the pipeline to break and handling them in the pipeline to avoid the breakage. Another important phase of building a data pipeline is testing. A series of tests such as unit tests, integration tests load tests, etc., can be performed to ensure all blocks of the pipeline are working as expected and give an idea of the data volume limits.
Data pipelines, either batch or streaming can be built using different coding languages and tools. There is a vast set of tools that offer different capabilities. It is a good idea to perform an analysis to understand the complete requirements of your use case, functionalities, and limitations that each tool offers and to choose the right platform based on your needs. Regardless, the above-mentioned best practices can come in handy in building, monitoring, and maintaining the data pipelines.
Tags: Data, Data Pipelines, Data Quality, Data Validation, Testing Data Pipelines, Batch Pipelines, Streaming Pipelines, Data Consistency, Data Reliability, Data Integrity, Data Scalability, ETL, ELT, Data Schema, Idempotency, Logging, Performance Monitoring.
Opinions expressed by DZone contributors are their own.
Comments