Deep Work for Site Reliability Engineers

This article discusses the concept of Deep Work, its benefits, and the strategies that Site Reliability Engineers can employ.

Krishna Vinnakota

May. 31, 24 · Analysis

Likes (4)

Comment

Save

4.3K Views

In this article, I will discuss:

The concept of Deep Work
Why it is important in this day and age
What are some of the unique challenges that Site Reliability Engineers face that make it hard to do Deep Work in their field?
Some strategies that Site Reliability Engineering teams can employ to overcome these unique challenges and create an environment for Deep Work for SREs

What Is Deep Work?

Let's take a look at what Deep Work is. The concept of Deep Work was introduced by Cal Newport in his book called, "Deep Work: Rules for Focused Success in Distracted World." In his book, Cal Newport defines Deep Work to be the act of focusing without distraction on a cognitively demanding task. The opposite of Deep Work is Shallow Work, which Cal Newport defines as logistical-style tasks that can be performed while distracted, like work coordination and communication tasks that are easy to replicate.

Why Is Deep Work Important?

Firstly, Deep Work is meaningful and satisfying. Based on a recent Gallup Survey, employee engagement in the United States has hit a record low due to less clarity and satisfaction with their organizations. Deep Work can help solve this problem.

Secondly, Deep Work can pave the path to a Flow State. The research found that the Flow State leads to happiness.

Finally, Deep Work is rewarding. Doing cognitively-demanding work brings value to teams and organizations which in turn will lead to promotions and financial rewards for the individual doing the Deep Work. As Cal Newport says, "A deep life is a good life."

Now, let's look at some of the activities that are cognitively demanding for SREs, the activities that can be considered Shallow activities, and some strategies that SRE teams can employ to promote Deep Work within the SRE teams.

What Are Some Cognitively Demanding Tasks for SREs?

The following are some of the cognitively demanding tasks that SRE teams can perform to have a greater impact on the organizations:

Automation and building services: Developing good automation to eliminate toil, improve the efficiency of managing infrastructure, and reduce costs is a cognitively demanding task. Contributing to the codebases that backend teams develop can also be a good opportunity for SREs and is a cognitively demanding task.
Improving observability: Another cognitively demanding task for Site Reliability Engineers is improving the observability of the systems. This can be done through designing and creating usable dashboards, tuning alerts to improve signal-to-noise ratio, instrumenting codebases to emit useful metrics, etc.
Debugging and troubleshooting difficult issues impacting production systems: Troubleshooting difficult issues affecting production systems availability under time pressure is another cognitively demanding task.
Improving processes: Improving processes such as the change management process, incident management process, etc. to improve the overall efficiency of the team, and improving SLOs can be another cognitively demanding task.
Improving documentation: Writing good documentation can be impactful and requires focus to get it done. A few examples of good documentation are usable troubleshooting guides, Standard Operating Procedures, architectural diagrams, etc.
Learning new technical skills: Continuous learning is key to becoming better at an SRE job. Learning new technical skills and keeping up with the latest technology trends such as Generative AI, etc. is cognitively demanding as well.

What Challenges Do SREs Face To Perform Deep Work?

The following are some shallow tasks that SREs need to do to run the business that make it difficult for them to do Deep Work:

1. Deployments and Upgrades

These are essential activities for the business but tend to be repetitive in nature. Depending on the level of automation that exists within the team, SREs spend some amount of time on these activities.

2. Answering Questions of Other Engineers

Randomization of SRE team members by random questions from other teams can be helpful since SRE teams tend to have a deeper knowledge of production systems and infrastructure.

3. Production Access Requests

In many teams, access to production systems is restricted only to the SRE team to maintain the stability of the production environments. Members of teams such as backend engineering and data engineering teams may interrupt SREs to get information from production systems for various purposes such as debugging issues, etc.

4. Randomization Due to On-Call and Production Issues

SREs tend to have end-to-end knowledge about the production systems and often may be pulled into various on-call issues even when the SRE is not in the current on-call rotation. This takes time away from working on meaningful projects.

5. Meetings

There is a lot of overhead with meetings. With SRE roles, sometimes a lot of people join calls that try to troubleshoot issues, and these calls tend to be very long where a lot of engineers just act as bystanders for extended periods of time.

6. Answer Emails and Replying to Teams/Slack Chats

This is a common activity for most of the people working in the knowledge economy, and SREs are not immune to it. Replying to emails and chats constantly randomizes an SRE's time and takes their attention away from important work.

What Strategies Can SREs Employ To Facilitate Deep Work?

Now let's look at some of the strategies that SRE teams can employ to minimize time spent on Shallow work and spend that time on Deep Work:

1. Invest in Automation

SRE teams should prioritize investing time in automation to eliminate toil and reduce operational burden with various activities such as deployments, upgrades, etc. Creating robust Continuous Integration and Continuous Deployment pipelines with built-in automated verifications will reduce time spent on these activities. The goal should be to give required tools for development teams to do self-service with upgrades and deployments. SRE team management should plan projects so that proper resources are allocated for these kinds of projects.

2. Build Just-In-Time Access Systems

Just-in-time access systems with proper auditing trail and approval processes can help give proper access to production environments for people outside SRE teams, and thus, help SRE teams not to spend time on providing shadow access to others and focus on Deep Work.

3. Proactively Plan for Projects

SRE teams can have proper Project Management in place to prioritize important work such as improving the observability of critical production services.

4. Sharing the On-Call Load With R&D and Backend Engineering Teams

Sharing on-call load with backend engineering teams while letting SRE teams focus on improving the tooling, and documentation, and training others on how to effectively handle on-call issues would help with this as well.

5. Follow Efficient On-Call Rotations and Incident Management Processes

Following efficient on-call rotations where only the responsible on-call engineers during that week handle most of the on-call issues lets other engineers focus on dedicated projects and makes Deep Work possible for the rest of the team. Having clear and easy-to-follow troubleshooting guides would aid with this purpose.

6. Create Time Blocks to Focus on Important Projects

On a personal level, individual SRE team members can block time on the calendar to focus on working on important projects to avoid randomization.

7. Providing Time and Resources for Continuous Learning

Giving time to SRE team members to learn and explore new technologies and the freedom to implement the technologies to solve reliability problems is a great way to facilitate learning. Also providing subscriptions to online learning services and books would be a great idea.

8. Allow SREs To Work on Projects of Their Choice

Allowing SRE team members to work on projects of their choice would be a great way to encourage them to do Deep Work. For example, writing features used by end users, experimenting with a new piece of technology, and working on a different team short team are some of the ways to implement this idea. Google famously allowed all their employees to spend 20% of their time on the projects of their choice. Implementing such a policy would be a great way to encourage Deep Work.

Conclusion

By following the strategies discussed in this article, Site Reliability Engineers can aim to perform Deep Work and achieve happiness, satisfaction, and rewarding work while having a greater impact on their organizations.

Incident management Project management Site reliability engineering Task (computing) Team Management

Opinions expressed by DZone contributors are their own.

Related

Trending