Cracking the SRE Interview
This article discusses what skills that companies look for in a candidate while interviewing for a Site Reliability Engineering role.
Join the DZone community and get the full member experience.
Join For FreeThis article discusses the skill set that is expected by various companies for the roles of SREs. I have worked as a Site Reliability Engineer for companies such as Amazon, Microsoft Corporation, and TikTok. I have attended numerous interviews for Site Reliability Engineering roles and have interviewed other engineers for SRE roles in the companies where I worked.
The role of Site Reliability Engineer can have different titles in various companies. For example, Google calls this role Site Reliability Engineering, Microsoft used to call this role Service Engineering, Amazon calls it Systems Development Engineer, Meta calls it Production Engineering, and a few other companies call this role DevOps. These roles have many common requirements.
Let's look into various skills that companies, especially the big technology companies, look for while interviewing engineers for these roles.
Coding
One of the important skills that SREs need to have is coding since automating repetitive tasks and writing tools to manage infrastructure efficiently is an important part of the SRE job. Companies test the candidate's coding skills through coding interviews. Usually, these interviews tend to be of two types.
The first type of coding interview focuses on standing data structures and algorithms. Coding challenges from websites like leetcode or hackerrank will help practicing coding for this type of interview. The second type of coding interview focuses on coding challenges that may emulate some of the day-to-day tasks SREs work on. For example, reading data from files and processing the data, etc.
Companies are usually open to candidates using any programming language but, based on my experience, coding in Python would be helpful since it is easy to implement solutions in Python and the majority of SREs use Python for day-to-day automation.
System Design
The second important skill that an SRE needs to have is a solid understanding of large-scale distributed systems. Companies look for this knowledge by asking System Design questions during the interviews. An example question for a system design interview is "Design a logging service." These questions tend to be vague and it is important to ask a lot of clarifying questions before coming up with a design solution. A few key things to focus on as an SRE while designing a system are Scalability, Reliability, and Security of the system. It is also important to focus on Non Abstract parts of the systems such as capacity planning.
Operating Systems
A deep understanding of Operating Systems, especially Linux, is an important skill that will be invaluable for an SRE. Companies look for this knowledge through the interviews focused on the Linux operating system. The questions may include various topics such as popular Linux commands to administer and troubleshoot issues on Linux, Linux Kernel, System Calls, troubleshooting performance issues on Linux, and Memory/Network/Disk/Process sub-systems of Linux.
Computer Networking
A good understanding of various protocols and TCP/IP models is a great skill to have for an SRE as this will help in troubleshooting any production issues or designing infrastructure. A few protocols that are important to have a deeper understanding of are HTTP, TLS, DNS, TCP, UDP, IPv4, IPv6, ARP, ICMP, etc. It is also useful to know which tools can be used to analyze each of these protocols.
SRE Best Practices
Companies often look for candidates who understand the SRE best practices related to topics such as observability (alerts, metrics, logs, traces, dashboards, etc.), incident management, change management, automation, operational excellence, and capacity planning. The topics may also include concepts such as SLI/SLA/SLO, MTTR/MTTA/MTTI, etc.
Work Experience
This category includes questions related to the kind of projects that you have worked on in your current and previous jobs. Interviewers typically ask for a specific project that the candidate worked on in the past and dive deep in to understand various aspects such as the complexity of the project, challenges faced during the project and how the candidate overcame those challenges, and what the candidate learned from any failures from the projects.
Infrastructure
A key responsibility of SREs is to design, deploy, and maintain various infrastructure components such as Kubernetes, SQL databases, non-SQL databases, message queues, load balancers, Content Delivery Networks, etc. Knowledge and experience working on various major cloud services such as Amazon Web Services(AWS), Microsoft Azure and Google Cloud Platform(GCP) is another important aspect that companies look for in the candidate. Depending on the role where the position is in, companies may assess the engineer's understanding of one or more of these infrastructure components.
Troubleshooting
Being part of the on-call rotation is an essential part of an SRE's job. Effective troubleshooting skills are important to have since resolving user-impacting issues under time pressure is critical for maintaining the uptime of the services. SREs combine their knowledge of various technologies, and systems and their experience operating services in production to troubleshoot issues. Companies assess troubleshooting skills by asking how the engineer would solve a given hypothetical issue. Approaching the troubleshooting problem methodically and showing the understanding of distributed systems is important in this type of interview.
Behavioral
Every company has its unique culture, values, and leadership principles. The behavioral interviews focus on asking questions to probe whether the engineer matches the company's culture. These questions tend to focus on how the engineer acted in the past in a similar situation. An example question is "Tell me a scenario when you had to disagree with your manager." A popular method to use to answer such questions is the STAR method. STAR refers to Situation, Task, Action, and Result.
Conclusion
Site Reliability Engineering role is a challenging role where one needs to have a deeper understanding of various technologies. By focusing on these key skills one can become a great Site Reliability Engineer crack challenging technical interviews and have a rewarding career. Happy interviewing!
Opinions expressed by DZone contributors are their own.
Comments