Example: Thinking Out of the Box

This article shows how doing some real out-of-box thinking and trying some outrageous approaches. It may surprise you and your team.

Bipin Patwardhan

Jan. 24, 23 · Opinion

Likes (1)

Comment

Save

1.9K Views

About two years ago, I worked on a project where we had to migrate an enterprise application from an MPP to Spark with Delta.io on an on-prem Hadoop environment.

We had around 17,000 SQL statements in 4,000 files. These 4,000 files were then executed as per a schedule. For execution, the SQLs were executed on the MPP using a custom database driver. Yup. This was because the application had developed over a period of 20 years.

When we started, we had all the SQL statements, which had to be converted to be compatible with Spark SQL syntax. In addition, the scheduler was to be replaced by Airflow. So, as always, the deadline was pretty aggressive. But that is not the story.

To execute the SQLs on the target platform, we had to build a Python framework. Then, using suitable metadata, we generated the PySpark applications. The generator referred the SQLs from Excel and then stored them in the required structure. Each program was independent and had one or more SQL statements executed using SparkSQL. Before we executed the programs in a schedule, we had to ensure that each application worked as expected. For this, the team submitted each application and checked its results. The SQL queries and the Python framework were modified as needed in parallel. Each PySpark application was independent. In other words, all that it needed for execution was available in the same program. A downside of this was that each application ended up with the same methods being repeated. Which methods? We had one method to execute an SQL such that it took care of exceptions and only returned a True / False, depending on the execution. We also had a method to fetch parameter values and populate them in the SQL statement before executing it. So, these two methods were repeated in each program.

You may ask why not store the SQLs in a database and have the framework pull the queries from the database. Four reasons, with the first being time. Each connection to the database takes additional time. The second reason was to do with SQL modification. As the SQLs within the Python file and each file were independent, each developer could edit her file, make changes and submit the application without affecting any other script. Finally, we had cases where people overwrote their own files.

In some cases, we also lost a few files. The third reason was the ability of a database cell to be a proper editor. So the team would have to copy and paste the code into an editor like Notepad++ (I know. People are not comfortable using vi), edit the program, and then put the query back into the proper cell. And here, 'proper' is important. Given the typical database editors we use, it is quite easy to go and overwrite another cell. The fourth point is the number of people editing the database simultaneously. Imagine 40 people editing the same table. It would be a classic horror story.

When we started executing the PySpark applications through Airflow, we found that initialization of the Spark session takes significant time. If we could combine the execution of multiple PySpark applications, we could save some time. Does that not sound like a wonderful idea? 'Combine multiple PySpark applications into one application.' The question was, how were we to do it with minimal impact on the delivery timeline? Even if we took the time to train and guide each team member on the changes needed, we were looking at the significant effort. Combining multiple PySpark applications was not trivial. First, we would have to edit each file, remove common methods and move the code present in the 'main' into a method with the required parameters. Then we would have to write another program that had to import the modified scripts and execute them in the proper order.

It was quite a monumental task, given the number of team members and the number of files involved.

I did the next best thing; I automated the process.

I wrote a Python program to scan through each script, extract the SQL statements from each script and upload them in the proper order into an SQLite database. Why not Excel? Because some SQLs were so large that they exceeded the character limit of a cell in Excel. So I had to use a database. Running a remote server would have taken time. So a local database like SQLite was the best option. Because the PySpark applications were generated by a program, the structure and naming of methods and variables were well-defined, and there was no change. A remote database also added to the time taken for application generation. Using Excel or SQLite, the time taken for code generation was around 15 minutes. With a remote database, the application needed two hours for code generation.

After extracting all the SQLs into a database table, I had the liberty of redesigning the architecture and making use of common classes and common methods. The new version of the application generator created the script without a 'main' and also generated a program that executed the scripts without a main in the required order. All script names and their order was managed using suitable metadata.

I was able to successfully extract SQLs from 95% of the PySpark applications - and execution time was in minutes. The task took only around 10 to 15 minutes to process around 4000 scripts. I ended up saving the team many days of mundane work. For all practical purposes, the team continued to use the helper shell scripts for application execution, though the helper scripts had also to be changed.

How much time did we save by changing the architecture? We chopped around four hours from the schedule, which was earlier running for around 24 hours. Obviously, much more optimization was to follow because the schedule was supposed to execute in eight hours and hence execute three times a day.

What did we learn from this exercise?

We need to find time for design. If we make poor design decisions, we face long-term consequences.
Do some real out-of-box thinking and try some outrageous approach. You might surprise yourself.
Believe in yourself and your team.

Database application applications Execution (computing) pyspark Python (language) Schedule (computer science) SPARK (programming language) sql

Published at DZone with permission of Bipin Patwardhan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Example: Thinking Out of the Box

This article shows how doing some real out-of-box thinking and trying some outrageous approaches. It may surprise you and your team.

Related

Partner Resources