Using Airflow to Manage Talend ETL Jobs
Learn how to schedule and execute Talend jobs with Airflow, an open-source platform that programmatically orchestrates workflows as directed acyclic graphs of tasks.
Join the DZone community and get the full member experience.
Join For FreeAirflow, an open-source platform, is used to orchestrate workflows as directed acyclic graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and data processing pipelines. Airflow user interface allows easy visualization of pipelines running in production environment, monitoring of the progress of the workflows, and troubleshooting issues when needed. Rich command line utilities are used to perform complex surgeries on DAGs.
In this blog, let's discuss scheduling and executing Talend jobs with Airflow.
Prerequisites
- Airflow 1.7 or above
- Python 2.7
- Talend Open Studio (Big Data or Data Integration)
Use Case
Schedule and execute Talend ETL jobs with Airflow.
Synopsis
- Author Talend jobs
- Schedule Talend jobs
- Monitor workflows in Web UI
Job Description
Talend ETL jobs are created by:
- Joining
application_id
fromapplicant_loan_info
andloan_info
as shown in the below diagram:
- Loading matched data into the
loan_application_analysis
table. - Applying a filter on
LoanDecisionType
field in theloan_application_analysis
table to segregate values as Approved, Denied, and Withdrawn. - Applying another filter on the above-segregated values to segregate
LoanType
as Personal, Auto, Credit, and Home.
The created Talend job is built and moved to the server location. A DAG named Loan_Application_Analysis.py
is created with the corresponding path of the scripts to execute the flow as and when required.
Creating DAG Folder and Restarting Airflow Webserver
After installing Airflow, perform the following:
- Create a DAG folder (/home/ubuntu/airflow/dags) in the Airflow path.
- Move all the
.py
files into the DAG folder. - Restart the Airflow webserver using the below code to view this DAG in UI list:
Loginto the AIRFLOW_HOME path-- eg.(/home/ubuntu/airflow)
To restart webserver ---> airflow webserver
To restart scheduler ---> airflow scheduler
After restarting the webserver, all .py
files or DAGs in the folder will be referred and loaded into the web UI DAG list.
Scheduling Jobs
The created Talend jobs can be scheduled using the Airflow scheduler. For code, see the Reference section.
Note: The job can be manually triggered by clicking the Run button under the Links column as shown below:
Both the auto-scheduled and manually triggered jobs can be viewed in the UI as follows:
Monitoring Jobs
On executing the jobs, upstream or downstream processes will be started as created in the DAG. Upon clicking a particular DAG, the corresponding status such as success, failure, retry, queue, and so on of the job can be visualized in different ways in the UI.
Graph View
The statuses of the jobs are represented in a graphical format as shown below:
Tree View
The statuses of the jobs along with execution dates of the jobs are represented in a tree format as shown below:
Gannt View
The statuses of the jobs along with execution dates of the jobs are represented in a Gannt format as shown below:
Viewing Task Duration
Upon clicking Task Duration tab, you can view the task duration of the whole process or DAGs in a graphical format as shown below:
Viewing Task Instances
By clicking Browse > Task Instances, you can view the instances on which the tasks are running, as shown below:
Viewing Jobs
By clicking Browse > Jobs, you can view details such as start time, end time, and executors of the jobs, as shown in the below diagram:
Viewing Logs
By clicking Browse > ViewLog, you can view the details of the logs, as shown in the below diagram:
Data Profiling
Airflow provides a simple SQL query interface to query the data and a chart UI to visualize the tasks.
To profile your data, click Admin > Connections to select the database connection type, as shown in the below diagram:
Ad Hoc Query
To write and query the data, click Data Profiling > Ad Hoc Query.
Charts
Different types of visualizations can be created for task duration and task status using charts.
To generate charts such as bar, line, area, and so on for a particular DAG using a SQL query, click Data Profiling > Charts > DAG_id, as shown in the below diagram:
All the DAGs are graphically represented, as shown in the below diagram:
Email Notification
Email notifications such as email_on_failure
, email_on_success
, and email_on_retries
can be set to know job status.
To enable the notification, perform the following:
- Configure settings in the
airflow.cfg
file in theairflow_home
path, as shown below:
- Reset your email setting to Gmail settings > allow_less secure_apps > ON to receive email alerts from Airflow.
Note: You may get authentication_error
if the email settings are not properly configured. To overcome this issue, accept the login device as our device in Gmail device review as Yes That Was Me.
A job failure email is shown below:
Upon clicking the Link in the email, you will be redirected to the Logs page.
Conclusion
In this blog, we discussed authoring, scheduling, and monitoring the workflows from web UI, as well as triggering the Talend jobs directly from the web UI on demand using the bash
operator. You can also transfer data from one database to another database using the generic_transfer
operator.
References
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments