The Case for Working on Non-Glamorous Migration Projects

Explore this contrary take on how to accelerate your journey to become a battle-hardened software developer/leader.

Tejas Ghadge

Aug. 08, 24 · Opinion

Likes (7)

Comment

Save

17.0K Views

In my 13 years of engineering experience, I saw many people make career decisions based on the opportunity to work on a brand-new service. There is nothing wrong with that decision. However, today we are going to make a contradictory case of working on boring migration projects. What I did not realize early on in my career was that most of my foundational software development learning came from projects that were migration projects — e.g., migrating an underlying data store to another cloud-based technology or deprecating a monolithic service in favor of new microservices, etc.

This is because migrations are inherently hard: you are forced to meet, if not exceed, an existing bar on availability, scale, latency, and customer experience which was built and honed over the years by multiple engineers. You won’t face those constraints on a brand-new system because you are free to define them. Not only that, no matter how thorough you are with migrations, there will be hidden skeletons in the closet to deal with when you switch over to new parts of the system (Check out this interesting article on how Doordash’s migration from Int to BigInt for a database field was fraught with blockers).

These projects force you to meticulously think about testing methodologies, the accuracy of results from new systems, software rollout plans, software rollback plans, etc. so that you don’t always stress about working on a brand new system because there are no existing customers that you are serving. The most boring part is that existing customers are not supposed to know that you actually replaced an underlying system or code base without their knowledge.

1. But Do You Really Need a Migration?

I often see new engineers wanting to try a new technology and replace an existing functionality, or someone wanting to do a complete refactor of the code base. If this is a contained change (e.g., using a well-tested open-source library to perform a small operation in service, etc.), I don’t mind it. But, if it is a major architectural change or reworking an entire code base, it is important to remember a famous engineering tenet “Respect What Came Before." (I found this tweet funny which refers to legacy code as legendary code.)

Coming back to the point of migration projects, it is always wise to evaluate if you can fix the same problem with comparatively smaller effort versus doing a major overhaul of the codebase or architecture. But the appeal of using a new technology or design pattern is always tempting, so how do we evaluate this decision? Here are a few questions and considerations to help you get started before you embark on a migration journey:

Is the business (or customer experience) adversely impacted or will it be impacted in the future if we don’t solve this problem and has the team exhausted all the options to resolve this without a major undertaking of a major migration project? Opt for a review from another senior engineer who is not on your team and who can act as a devil’s advocate to pressure test your reasoning. Some examples of justifications could be improving agility by 4 developer months for every feature launch, using different tech stacks for different services to improve p99 latency by 400ms, removing scaling bottlenecks beyond X TPS, etc. Always seek disagreements to break your confirmation bias in such situations.
Compare the efforts to do the migration with the benefits it will yield, so you can estimate how long it will take to start reaping the benefits of the project. A personal example that I can share is as follows:
- My team owned two separate systems serving two different sets of customer bases, and every new feature launch required the team to make similar, but not exact, changes to these separate systems. Overall, the duplication led to an additional effort of 1 developer month per feature. We launched about 4 such features every year leading to 4 developer months of duplicated or wasted effort. This was frustrating for engineers. One of the engineers came up with a proposal to combine these two systems and estimated the effort to be 24 developer months. It would take 24 feature launches and 6 years (assuming 4 features launched per year) for the team to start reaping the benefits of the migration. We didn’t do the migration and moved to an alternate approach of using shared libraries to reduce the duplication effort by 50%, and later we deprecated the system after 3 years in favor of another service.
In some cases, the migration is a top-down guidance to meet a broader goal (e.g., Amazon switching away from Oracle) where you may still do the analysis but are not required to get approval to proceed with the project.

Once you have identified the right justifications to do the migration and pressure-tested the reasoning with some external engineers or leaders, it is time to move to the next steps.

2. Layout Functional and Nonfunctional Requirements of a System

This is similar to what you would do while preparing for a system design interview. Once functional and non-functional requirements are laid out, it is prudent to forget about the existing system for the time being and lay out how you would build a new system if there were no constraints.

The reason to do this exercise is that a lot of existing team members will have an unconscious bias to build a new system that is not very different from the existing system, defeating the very purpose of migration in many cases. Consider another example from my past:

We had decided to move away from an on-premise SQL database due to scaling bottlenecks and maintenance issues. This was because our service was using SQL database as a scheduling engine to track inventory-related updates. We ran multiple complex queries against the SQL database every second, which was wasteful. Our engineers presented a design to replace our SQL database with a cloud SQL database. The justification was that cloud SQL was more scalable, but what we were effectively doing was pushing our problem arising from bad patterns to cloud technology. We should have fixed the bad pattern instead of pushing the problem to a cloud technology. A Principal Engineer steered our approach to build an Event-Driven System using a Pub/Sub notification and Streaming Queue (e.g., Kafka or AWS Kinesis, etc.) that scaled several magnitudes better than our original proposal.

Involving someone more experienced who did not work on existing systems steered the conversation to build a completely different system that was more scalable, real-time, and easier to maintain. This may not be always possible though but it doesn't hurt to try going through this exercise.

If you are doing a like-for-like migration like we were proposing before (i.e., moving an on-prem SQL DB to a cloud SQL DB), you may have an easier time meeting non-functional requirements. However, if your end system is drastically different from the current system, you should at least make an attempt to fix the anti-pattern built into the system. For example, instead of polling for an update change to a key in the database, you can publish a change notification using a Pub/Sub service to subscribers.

However, like every project in distributed systems, migrations will have trade-offs when it comes to non-functional requirements and you will need to plan for it. For instance, if there is a monolith service with an availability of 99.9% that handles two separate business-related calculations (delivery date estimation and shipping charges estimate), and we decide to split this responsibility into two micro-services A (Delivery Date Estimation Service) and B (Shipping Charges Estimate) each having availability of 99.9%, then overall availability of system becomes:

P(A) * P (B) = 0.999 * 0.999 = 99.8% availability

Creating microservices from a monolith led to a reduction of availability from 99.9% to 99.8%.

Always remember, if you need results from ‘n’ service calls (sequential or parallel service calls) to return a response to your client, you multiply the individual availability of each of ‘n’ services to arrive at the final availability of the system.

To meet or exceed the original availability of the system (i.e. 99.9%), we will need to think about other techniques like caching, retries, etc. But each of these options has its own drawbacks. For example, caching, in some cases, may mean your system should be able to tolerate stale data; retries can add delays and make the system susceptible to retry storms, etc.

However, doing this exercise should allow you to see if you are at least meeting an existing bar on non-functional requirements or if you need leadership approval on new non-functional requirements that you want to provide to your customers.

3. Do Clients Need To Take Any Actions?

With a new system, your customers are going to adopt your new client version. With migration projects, you may have to deal with the problem of what if all customers can't migrate to the new version of the client (i.e., thinking of backward compatibility). If all of your clients are internal to the company or you have limited adoption outside the company, you can work with all your customers to move them to a new version of the client.

In other cases, this is simply not possible. For example, if you own a large cloud service that is broadly adopted in the industry, there is no way you can force all customers to move to a new version of the client. This can add significant blockers as well as maintenance overhead for the team, and in some cases, the solution is to maintain two versions of systems with the older system being in maintenance mode (i.e., no new customers are added to this system) and you provide an incentive to older customers to move to a newer version of the system as it has improved benefits to customers.

However, if you have a situation like the link I shared above with Doordash where using Int as the data type of primary key was going to overflow, you have no option but to force everyone to do the migration.

4. Are There Similar Migrations Already Done?

When building new systems, most engineers do a fabulous job at covering almost all use cases. However, the reverse happens with migrations because you are handling a system that was developed, patched, and maintained by tens, if not hundreds, of engineers before you. Even if you want to learn about every use case, code path, or system bottleneck, it is difficult to get your head wrapped around the entire service.

In such cases, the simplest thing to do is to seek out learnings from teams, senior engineers, etc. who have performed similar migrations around what processes you can follow to cover for your blindspots. Many companies follow a process of broader org-level design and migration reviews. Seek disagreements as a sacred part of the process to solidify your approach and understanding. Migrations are fraught with landmines that trip in unexpected ways.

The majority of migrations are usually in one of the two categories below or some combination of both:

Service migrations: Deprecating an existing service in favor of a new architecture which may consist of using parts of current service and a new service or launching new microservices to replace an existing system
Datastore migrations: Deprecating an existing data store and replacing it with a new data store or making use of an event-driven system.

Even if you don't find an exact migration example, you can always draw broader learnings for these buckets. In my personal experience, data store migrations were the hardest as there are concerns around the accuracy of data which is impacted due to consistency issues between old and new data stores. For example, a user might see an older version of data from a new data store due to delays in propagation.

5. Running Old and New Systems in Parallel

Running existing and new systems in parallel while only serving data from the existing system allows you to compare the results of both systems with real customer requests. This is the single most useful and powerful step to compare and validate that your new system works correctly.

Many years back, I worked on a service migration to a new technical stack. Whenever our old service received customer requests, we would make a parallel asynchronous call to our new service in the backend. We would log existing and new service results to a S3 location. We then ran an AWS Athena query at the end of the day to figure out any discrepancies and identify any issues with the new service. That was still somewhat of a predictable thing to do compared to another tricky migration we handled with a data store. We were moving an old SQL data store to a new NoSQL data store that was populated from a more reliable and new data source. However, the time at which specific keys were updated between old and new data stores was unpredictable, as they came from two entirely different systems.

After unsuccessfully trying multiple approaches to compare data between old and new data stores, we worked with our upstream teams to release versions for data keys, so we could perform data accuracy comparison for a given key using versions between both systems. This was throwaway work, as we did not need versions post-project, but there was no other way to handle this.

6. Expect Things To Go Wrong: Can You Go Back to the Previous World?

Even after running Step 5 where you were able to accurately compare the results of the old and new system thoroughly, it is quite possible that you never hit a specific type of request from a few customers who rarely use your system. I have lost sleep while working on some of these migration projects thinking, "What if everything goes wrong with the new system?"

The easiest way to tackle this was to have an off switch to the new system if our alarms caught something unexpected or we manually triggered it to move the traffic back to the old system. Mind you, this is not as easy as it sounds. In some cases, there may not be a way to move to an older system but having this lever will relieve a lot of pressure on the team.

For cases where this is not possible, your only point of reliance is being thorough with step 5. (Running old and new systems in parallel). This is followed by a slow gradual rollout of the new system. You can define a slow gradual rollout using techniques like moving a small percentage (1% followed by 5%, 10%, 25%, 50%, 100%) of traffic to be served by a new service, or handpicking a few customers to be served by new service with whom you are working closely during migration, etc.

It is also important to review broadly an incident response runbook to be followed by operators if things go wrong. If everything fails, manual intervention can help with edge cases that were missed, but this can quickly become unmanageable if the number of impacted customers grows to thousands. This is the reason to give enough time to the phases described in points 5 and 6.

Conclusion

While working on migrations is not the only way to hone some of these skills, it can definitely speed up your learnings that you can apply to your future projects even if it means they are brand-new initiatives. Migration projects are less glamorous but were the ones that made me battle-tested, especially when I am providing feedback on design documents or other technical documents. So if you get a chance to work on one, give it a try: you won't be disappointed and will have a career-long learning that you will hopefully pass on to others to build some resilient systems.

Cloud sql systems teams Data migration

Opinions expressed by DZone contributors are their own.

Related

Trending