Seven Steps To Deploy Kedro Pipelines on Amazon EMR
In this post, the author explains how to launch an Amazon EMR cluster and how to deploy a Kedro project to run a Spark job.
Join the DZone community and get the full member experience.
Join For FreeThis post explains how to launch an Amazon EMR cluster and deploy a Kedro project to run a Spark job.
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform for applications built using open-source big data frameworks, such as Apache Spark, that process and analyze vast amounts of data with AWS.
1. Set up the Amazon EMR Cluster
One way to install Python libraries onto Amazon EMR is to package a virtual environment and deploy it. To do this, the cluster needs to have the same Amazon Linux 2 environment as used by Amazon EMR.
We used this example Dockerfile to package our dependencies on an Amazon Linux 2 base. Our example Dockerfile is as below:
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
COPY requirements.txt /tmp/requirements.txt
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install -r /tmp/requirements.txt
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
Note: A DOCKER_BUILDKIT backend is necessary to run this Dockerfile (make sure you have it installed).
Run the Dockerfile using the following command:
DOCKER_BUILDKIT=1 docker build --output . <output-path>
This will generate a pyspark_deps.tar.gz
file at the <output-path>
specified in the command above.
Use this command if your Dockerfile has a different name:
DOCKER_BUILDKIT=1 docker build -f Dockerfile-emr-venv --output . <output-path>
2. Set up CONF_ROOT
The kedro package
command only packs the source code and yet the conf
directory is essential for running any Kedro project. To make it available to Kedro separately, its location can be controlled by setting CONF_ROOT
.
By default, Kedro looks at the root conf
folder for all its configurations (catalog, parameters, globals, credentials, logging) to run the pipelines, but this can be customised by changing CONF_ROOT
in settings.py
.
Change
CONF_ROOT
insettings.py
to the location where theconf
directory will be deployed. It could be anything. e.g../conf
or/mnt1/kedro/conf
.
For Kedro versions >= 0.18.5
Use the
--conf-source
CLI parameter directly withkedro run
to specify the path.CONF_ROOT
need not be changed insettings.py
.
3. Package the Kedro Project
Package the project using the kedro package
command from the root of your project folder. This will create a .whl
in the dist
folder that will be used when doing spark-submit
to the Amazon EMR cluster to specify the --py-files
to refer to the source code.
4. Create .tar
for conf
As described, the kedro package
command only packs the source code and yet the conf
directory is essential for running any Kedro project. Therefore it needs to be deployed separately as a tar.gz
file. It is important to note that the contents inside the folder needs to be zipped and not the conf
folder entirely.
Use the following command to zip the contents of the conf
directory and generate a conf.tar.gz
file containing catalog.yml
, parameters.yml
and other files needed to run the Kedro pipeline. It will be used with spark-submit
for the --archives
option to unpack the contents into a conf
directory.
tar -czvf conf.tar.gz --exclude="local" conf/*
5. Create an Entrypoint for the Spark Application
Create an entrypoint.py
file that the Spark application will use to start the job. This file can be modified to take arguments and can be run only using main(sys.argv)
after removing the params
array.
python entrypoint.py --pipeline my_new_pipeline --params run_date:2023-02-05,runtime:cloud
This would mimic the exact kedro run
behaviour.
import sys
from proj_name.__main__ import main:
if __name__ == "__main__":
"""
These params could be used as *args to
test pipelines locally. The example below
will run `my_new_pipeline` using `ThreadRunner`
applying a set of params
params = [
"--pipeline",
"my_new_pipeline",
"--runner",
"ThreadRunner",
"--params",
"run_date:2023-02-05,runtime:cloud",
]
main(params)
"""
main(sys.argv)
6. Upload Relevant Files to S3
Upload the relevant files to an S3 bucket (Amazon EMR should have access to this bucket), in order to run the Spark Job. The following artifacts should be uploaded to S3:
.whl
file created in step #3- Virtual Environment
tar.gz
created in step 1 (e.g.pyspark_deps.tar.gz
) .tar
file forconf
folder created in step #4 (e.g.conf.tar.gz
)entrypoint.py
file created in step #5.
7.spark-submit
to the Amazon EMR Cluster
Use the following spark-submit
command as a step on Amazon EMR running in cluster mode. A few points to note:
pyspark_deps.tar.gz
is unpacked into a folder named environment- Environment variables are set referring to libraries unpacked in the environment directory above. e.g.
PYSPARK_PYTHON=environment/bin/python
conf
directory is unpacked to a folder specified in the following after the#
symbol (s3://{S3_BUCKET}/conf.tar.gz#conf
)
Note the following:
- Kedro versions < 0.18.5. The folder location/name after the
#
symbol should match withCONF_ROOT
insettings.py
- Kedro versions >= 0.18.5. You could follow the same approach as earlier. However, Kedro now provides flexibility to provide the
CONF_ROOT
through the CLI parameters using--conf-source
instead of settingCONF_ROOT
insettings.py
. Therefore--conf-root
configuration could be directly specified in the CLI parameters and step 2 can be skipped completely.
spark-submit
--deploy-mode cluster
--master yarn
--conf spark.submit.pyFiles=s3://{S3_BUCKET}/<whl-file>.whl
--archives=s3://{S3_BUCKET}/pyspark_deps.tar.gz#environment,s3://{S3_BUCKET}/conf.tar.gz#conf
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python
--conf spark.yarn.appMasterEnv.<env-var-here>={ENV}
--conf spark.executorEnv.<env-var-here>={ENV}
s3://{S3_BUCKET}/run.py --env base --pipeline my_new_pipeline --params run_date:2023-03-07,runtime:cloud
Summary
This post describes the sequence of steps needed to deploy a Kedro project to an Amazon EMR cluster.
- Set up the Amazon EMR cluster
- Set up
CONF_ROOT
(optional for Kedro versions >= 0.18.5) - Package the Kedro project
- Create an entrypoint for the Spark application
- Upload relevant files to S3
spark-submit
to the Amazon EMR cluster
Kedro supports a range of deployment targets, including Amazon SageMaker, Databricks, Vertex AI and Azure ML, and our documentation additionally includes a range of approaches for single-machine deployment to a production server.
Published at DZone with permission of Jo Stichbury, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments