Principles for Building Kubernetes Operators
Best practices and pitfalls to avoid when building a Kubernetes operator. Bhagya: Original author diff.
Join the DZone community and get the full member experience.
Join For FreeThe automation of data services on Kubernetes is increasing in popularity. And running stateful workloads on K8s means using operators. But engineers are often surprised by the complexity of writing a Kubernetes operator, which impacts end-users. The Data on Kubernetes 2021 Report found that the quality of Kubernetes operators was blocking companies from further expanding their data on Kubernetes footprint.
Anynines CEO Julian Fischer – who has built automation tools for nearly a decade – knows a lot about dealing with the complexity of doing stateful on cloud-native platforms and distributed infrastructure such as Kubernetes.
Julian first shares a methodology that should be followed when building operators, which he calls an Operational Model, divided into four parts:
- level 1: what a sysop or DB would do
- level 2: containerziation, YAML + kubectl
- level 3: write the operator
- level 4: the operator lifecycle management
At the end of the talk, you will know all the tips and how to avoid common pitfalls in data service automation, allowing you to write better Kubernetes operators from both the technical and methodological points of view.
Julian Fischer 01:21
Well, thank you very much. Today, we’re talking about principles for building operators. In fact, you know, I’ve already been introduced to don’t want to bore you with that stuff, except maybe noticing that we’ve built, you know, automation for many data services through nearly the last decade or so. So the topic we’re talking about today, and the things I’d like to share with you a lot about general data service automation, and, and then we will look at Kubernetes, at the Kubernetes context, here and there. So it’s a bit bouncing back between the general topic of data service automation and Kubernetes, in particular. So it’s, it’s usually a one-hour talk, at least, but I’m trying to get through a bit quicker today. So in general, if you talk about data service automation, one of the first things you have to do is scope, what do you actually mean by data service automation? There’s a mission statement for for for us at any nine, which is about fully automating the entire lifecycle of a wide range of data services to run on cloud native platforms across infrastructures at scale. And this is not some marketing claim here, but it’s, it’s an example of how data service automation needs to be scoped. So with the intention, for example, to automate multiple data services, you’ll also see certain sharing effects, things that you can put in a data service automation framework beyond the operator SDK, for example. And thus, the context of your mission has a lot of impacts. So if you think about, for example, a simple Kubernetes cluster, let’s say a small organization that primarily runs their applications, let’s say, using a Postgres database, which Postgres is always my favourite example. You know, one Kubernetes cluster one operator one service instance, and applications will connect to that one database. And there you go, that’s a different story to the story we’d like to talk about here today. They can imagine that with on-demand provisioning of dedicated service instances where a service instance, let’s say a Postgres database is represented as a stateful set. And the operator allows you to create many of them. There, there’s more complexity, because you have more data service instances, you have to take care of that. If you then introduce more data services, for example, you add RabbitMQ, MongoDB, or any other database, to the set of your operators, the challenge becomes even greater.
Now in the organizations that we usually work with. These organizations sometimes have hundreds or thousands of employees with thousands or even ten of thousands of developers, it’s unbelievable, the number of engineers they have and thus, then there will be many Kubernetes clusters. We think that dozens and hundreds of Kubernetes clusters will be accounted the environments that we already experience. For example, in virtual machine-based data service automation, they often have 1000s, of of, of virtual machines running 1000 of service instances, depending on whether they are clustered, you can assume that there’s a ratio of one service instance to three pods, for example, if a small clustered instance is running. Now with that scale, the requirements for automation change a lot and scale matters. So if you solve a simple task like making sausages and handing them out, you can imagine that just by the sheer scale by the number of people you want to serve, for example, the stack technical solution has to be adapted as well. And pretty much the same happens for data service automation as well. So if we think about those large environments, where there are a lot of those service instances, sitting around, you should never forget that each data service instance matters to someone, this service instance, matters a lot. And therefore, automation needs to live to a certain standard. If that standard isn’t, you know, lived up to the automation will be refused by, organizations and technology adoption will not occur.
Alright, so if we now think about data service with Kubernetes, a few topics come to mind. First, well, how do you implement an operator that I think the community knows how can be done? So the most straightforward way would be to use Kubernetes, CRDs, and custom resource definitions which allow you to teach Kubernetes new data structures. For example, in describing your Postgres instance, you want to create a Postgres instance as plural, because we are on-demand provisioning, dedicated instances, as well as a controller who will take that specification of the object you’ve specified and turn it into something viable. So basically, what operators do is translate the specification of a primary object, such as a Postgres instance, with Postgres version 12.2 into secondary resources. And the operator SDK is, to my knowledge, the most popular way to build CRDs and generate them and get boilerplate code for your controllers. So that are the two things we have in mind when we talk about data service automation with Kubernetes. At the same time, there’s KUDO.
If you are interested in what this is, there’s a talk I’ve given a few weeks earlier and the DoK community meetup, which is very interesting for data service automation, prototyping will not be covered here in greater depth today. Alright, stages of development, if you develop an operator, one of the challenges is what do you want to how do you approach this endeavour in a systematic way. And there’s a simple model, which we call the operational model, which is divided into four levels. That helps somebody to approach data service automation, when, you know, doing this for the first time. So as a little construct, to you know, set your mind to the task. We propose that in the first level, for example, automating Postgres. The first thing you need to grasp is what would assist or DBA do. In particular, this perspective is influenced by what would an application developer want. What exactly is it that they desire? What is the average application developer expect from Postgres, for example? Do they need clustered instances with automatic failover? Do they prefer in that case as synchronous or synchronous replication? What kind of failover and cluster manager would you like to use your preferred rep manager, or rather go with petroleum?
And, and that’s basically you’re figuring out the configuration files, you want the basic setup of Postgres, that’s all operational model level one to understand how to configure database just assuming that you have a virtual machine and you can do whatever you want, you know, install packages, and so on. So once you’ve done that, once you know how the configuration file should look like once you know, all of that could be done. You can think about containerization which could be picking existing container images and assembling them into Kubernetes specifications of stateful sets services and creating a template for creating secrets, which is the YAML part in operational model level two. So at the end operation of operation model level two, regardless of whether you’ve chosen existing container images or you’ve created them yourselves, you also have Kubernetes specifications that you can use with kubectl, to create your own service instances manually. Once you’ve done that, once you can basically create your Postgres instance let’s say with three replicas and synchronous streaming replication, you basically know how to do that manually, then you can turn it into an operator much more easily by thinking about the problem how do you write the gde that for example, creates that particular stateful set that particular headless service that particular secret for example.
Now, suppose we remind ourselves that the environment we are talking about is potentially containing 1000s of the data service instances, across multiple data services across many Kubernetes clusters. In that case, we also need to accept that the operator lifecycle management itself is an essential part of our toolchain. And therefore, we also need to have automation to manage the lifecycle of the operator itself. So whether this is the operator lifecycle manager, or whether this is some other technology doesn’t matter at this point, most importantly, you need to, you know, think, as this is a part of your overall data service automation challenge. Now, if you think about Kubernetes operators and, you know, then be reminded that the custom resource definitions, basically, a YAML structure like this describes a new data type that can be taught to the Kubernetes API, which will then provide an endpoint to you, as well as persistently store the specification in its etcd. So, not pretty formatted here. But you can see how such a custom resource of a particular custom resource definition will look like here, we taught Kubernetes how to create such an object. However, your CRD alone doesn’t do anything, because you need to have the controller, which then has the code that observes events, for example, that such an object has been created. And the controller then can take care of finding out whether, for this particular service instance, there already exists, the secondary resources, they have a service secret, and the stateful set he wants to create. So Kubernetes controllers, basically, as I said before turn primary resources and translate them into a combination of secondary resources. In our example, to this point, these resources have been Kubernetes internal resources, but this is not necessarily the case. We’ll come back to that later. There. If you also want to start writing operators there, the operator SDK makes a proposal about operator maturity levels, where the operator is, is classified into five different classes. I’m not really sure whether I would agree with all you know, about the assignment of those abilities in these classes. But if you get started, it’s definitely a good place to start. And, and ask the right questions, which are also in the documentation, so just as a hint, this will bring you to know, a few of your thoughts into a sequence, and that’s very good. I think that if you really build, you know, operators, you need to get some, let’s say core functionality together, for example, patch level updates without a backup plan and, and backup and restore functionality. Usually, these are, you know, must-have criteria and users are likely to refuse a solution or they don’t have that. But you know, you have to, at some point, you’ll start with your implementation, and therefore that plan will help you a bit. So keep it in mind. common pitfalls, well, they are very many common pitfalls, and they depend on which let’s exclude you know, those problems that arise from programming problems with distributed systems in general. So for example, if you know, have problems with organizing git or anything. I would say from my experience, the most probably the biggest problem with data service automation In general, is that people underestimate the complexity and effort required to do it, which has many manifestations including insufficient coverage of essential lifecycle operations, as well as other qualities such as robustness and observability being insufficient. Now, at this point, it makes sense to ask what is actually the barrier of entry, what, you know, the automation needs to do in order to be accepted by the target audience. While this is heavily dependent on the target audience itself, now a few things that I can share that I’ve learned, you know, with our organization that are important to many of our larger clients, we won’t go through them all, because it’s a bit time consuming for the little time we have but you know, accepting configuration updates is something that’s important in the degree that to the degree that the application developer is able to, you know, express with the automation, the things they’ve learned about the database and their application. So often, if the application is nontrivial, you need to tweak the database a bit. So that it really, you know, utilizes the resources as well as possible. So you need to interview the target audience and find out whether these configuration options are already in the document in the automation or not. And you need to be good at adapting your automation to particular needs. If, you know you gain more developers within your organization. Obviously, you know, all the cloud native requirements are there, like, you know being good in being observable, being infrastructure agnostic. Well, with Kubernetes, that to some degree, you already got that. But in the context of backups, when you need to store the backups somewhere, you often have to write them to an object store. And this is where people make assumptions about the existence of an S3 API, for example, where you should rather go with some abstraction library that hides the underlying object store.
Horizontal scalability of service instances, like if you think about a service instance, you could think about a single Postgres with, you know, just a single pod, or cluster Postgres. With asynchronous streaming replication. Once you want to make that horizontal, scale-out from one to three replicas, you introduce a lot of complexity into the automation, because, let’s say Postgres isn’t a simple service to automate, which makes it my favourite example. So, you will need to add a cluster manager for failure detection, you need to have a leader election and, and leader promotion logic that will help to do that. Also, if, if you happen to be in a data centre with different availability zones, you want to make sure that you use them, so distributing your pods so that they won’t, you know, end up on a single Kubernetes node and if they are availability zones and your Kubernetes clusters aware of the availability zones, making use of them as something that should be should be absolute, you know, done if possible. You know, in general, reconstructing a stateful set through the lifecycle will happen many times whether this is because, of plans, switchovers, doing upgrades, or whether this is a vertical scale-out making smaller pods larger, for example, these things are to be incorporated. Backup and restore, we’ll come to that, again, it’s very important, obviously, because often this is the last resort, for application developers to recover their application without waiting for manual intervention of a platform operator. So it’s all about on-demand self-service, and so far that the application developer can take care of themselves, and create service instances, you know, modify them, reconfigure them in if they happen to be, let’s say, failed, and or the data has been deleted accidentally, they need to be able to recover the data within the requirements of the application, in particular to the tolerance of potential data loss. One thing that’s also not very obvious sometimes is like providing the newest data service version. Let’s say the newest Postgres version is good, more progressive users will love that. But also an organization. There could be applications that are stateful for a while they are usually in maintenance mode, they don’t, you know, evolve that much and application developers, therefore, need to choose to be able to choose which data service version they want and manage the operator with the number of versions to support for how long with sunrise and sundown faces for all the supported versions of your automation is essential, you know, the policy that you will have to make with your automation. Because this also will, you know, utilize a lot of your team’s capacity if you provide too many versions. Well, documentation helps you to reduce support. But also security is important encryption and rest and encryption of transit is often demanded, where you want to have, you know, the disk data residing on the disk being encrypted of the disk isn’t used, for example, as well as the data being sent from client to the data service instance. And as well as the ports in the stateful set, for example, should all be encrypted.
Alright, so be aware that these service instances are not, they are not something that will go away quickly. This may be the case, but for some instances that live a long life with applications, the service instance may live years. And if you think about the lifecycle use cases and the things that happened to the service instances, you’ll get a long list, the list is much longer than this one. But this gives you your first impression about the things that may happen, like scale out and scale up. Going through various version upgrades, binding applications to service instances, which I call service bindings here. But also dealing with network partitioning, and fluctuations in network bandwidth and delay.
All these things you have to take into consideration are the service bindings, they represent the connection of the application to a data service instance. If you think for example, about you know, microservices to different applications. So to say connecting to the same service instance, it would be desirable that each of those applications has access to the data service instance, let’s say Postgres with a dedicated Postgres user so that the secret is unique.
So the two things you have to do in order to make such a service binding happens, you have to create a secret where the credentials are stored. And then you have to create the actual data service user.
And that has some complexity, which we’ll come back to later. A similar aspect of data service automation that in my opinion, should be represented, idiomatic with CRDs, and Kubernetes is backups and backup plans. So if you want to create a backup for a particular service instance, describe it as a CRD. And similar to a job and a cron job a backup plan describes how to create those backups regularly.
Now, from the methodological perspective, data service automation has, you know, there are certain principles that if you stick to them, you may benefit from it. So before we come back to the technological takeaways, well, the principles you may want to think about is, first of all, as we’ve seen earlier, know your audience, the requirements and desired qualities are essential, and you need to understand whether you do something for you know, a team that is highly dependent on a particular data source. For example, I’ve seen companies where the entire company evolves around a single MongoDB instance or a few of them. So building an operator for that case will be different from the context I presented earlier. So be aware that context is one of the other core ingredients to good data service automation. Choosing data service wisely is also a good thing. As I mentioned earlier, automating Postgres for example, you know, you have to go a long way. With other data services, you may run into license issues, because they tend to change sometimes licenses, some even moving away from open source licenses and well, you have to take care of that as well. So a single vendor based and backed data services with open source license that may happen in general, the idea to design for, for example, you know, this, whenever you use stateful sets you’re already using on-demand provisioning, as well as whereas whenever you change the stateful set, it’s very likely that the possible be recreated. So that’s what is addressed by rebuilding failed instances instead of fixing them. You can use that rebuilding from a known state as a tool that helps you to fix problems, in some cases to operation model first is what I explained earlier, just understand the data serves first before you head into automation be a backup and restore hero is necessary and important because that’s, you know, the means of last resort before the operators telephone, the platform operators telephone rings, and that should be avoided. So if you happen to automate multiple data services, there’s a lot of synergy between them. And that should be incorporated into a framework. And this framework may have, you know, code base to share between controllers, but also maybe scripts that you will have in container images or well, or anything like testing is significant because with automation, you basically, you know, write some code and that code will be distributed to many environments, and from those environments, you will create many service instances. So having, you know, good test coverage for both the code, but also for the resulting service instance and guiding them through use cases with integration tests, whatever you call this, is very interesting. And my advice also has test cases, for those scenarios, where you know that your automation still has weaknesses, share that test base that tests cases with your client and, and tell them and allow them to run these tests in their local environment. Because that creates trust. And, and this also gives them the opportunity to get a better feeling on what are the circumstances that maybe they can monitor to avoid running into problems.
Yeah, don’t touch up upstream source code is a principle that we use, because we have so many different data services, we do forks, and pull requests, hotfixes, and temporary hot fixes are allowed. Master release management means that from the day Postgres is released to the day, your automation is released, that delay should be short. And once you have released the automation, you want to deliver that release into target environments fast because only then the application developer can upgrade their instances.
Technology few words about technology not too much because we are running out of time. In the writing controller when you know when you do that the first time. The reconciliation of external resources, sometimes it’s a bit tricky. So in the example of a service binding, you need to create a secret but also a Postgres database user.
Now, there are several ways how you can do that. But in general, the challenge will be how do you ensure that the consistency that both of these resources are there? Well, I think the most straightforward answer would be to have a two-level approach, you could, for example, represent the Postgres user as a custom resource as well, and have a controller for it so that the controller has a single purpose, which would be to reconcile those Postgres users, which makes them reduces the complexity in the service binding controller when you create the secret, which is a secondary resource already known to Kubernetes, as well as the Postgres user, which then if you already have that, as a CRD, also is a resource known to your Kubernetes cluster. So as a one, one of the takeaways there is, of course, there is no atomicity guaranteed there are no transactions here. So having a declarative approach is more idiomatic, in Kubernetes, you can live with such an inconsistent state because you have eventual consistency with Kubernetes. So it should be able to, you know, be notified, you know, if the, if the Postgres user couldn’t be created, and we’ll just try to reconcile.
So having your actions and making them idempotent, so that if the loop is reconciling the same spec, again, should be possible so that you don’t get stuck and you don’t, you know, get into an error state here.
So do instead of creating user just create users not exists, for example, just a simple example of food for thought.
One, last thing to share. caching is if you modify a resource in your controller, you’re basically modifying it using the Kubernetes API, your local cache may not directly and immediately reflect upon the change and this can sometimes lead to strange behaviour. So be aware that there might be a lag between the update of your local cache and the update of objects in the Kubernetes API, especially if you’re dealing with controllers who somehow, you know, have a hierarchical relationship.
So, to sum it up. The technology part of this talk was a bit short, we had a lot of discussions around the general challenges with data service automation, and how important is to understand the target audience. And how the context affects your mission in writing your particular operator or operators. And I presented a few ideas on what are the things that are commonly requested. Maybe that’s a good place to start, if you write your own operator, ask yourself, is that what our application developers need, and if not, talk to them, try to get to their idea. And then, you know, try to, you know, make your first steps. If you come back to that talk slides will be shared, maybe there’s something you want to revisit later, and you have some benefit.
You can find me on Twitter and ask me anything. So feel free to do so I’m happy to chat about this topic. It’s one of my passions, besides of nearly executing myself while trying to build a battery. So thank you very much for your attention.
Published at DZone with permission of Sylvain Kalache. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments