Virtual Clusters: The Key to Taming Cloud Costs in the Kubernetes Era
In this interview, learn how virtual clusters impact cloud cost management, offering efficiency and flexibility for K8s deployments in today's economic climate.
Join the DZone community and get the full member experience.
Join For FreeThe economic volatility in the tech industry has most enterprises looking at their cloud bills and searching for deterministic ways to drive down costs. One of the interesting layers of that consideration is the cloud architecture itself. The following is an interview with Loft Labs CEO Lukas Gentele - the creator of the vCluster open-source project - to learn more about how virtualizing clusters is giving developers and platform teams productive new ways to right-size cloud resource utilization.
Interview
Question 1
What’s different about the cloud cost outlook today, from recent years - from your point of view?
Answer
Besides AI driving the stock market and generating massive hype, software companies across the broader economy feel pressure to operate more efficiently. I am seeing a shift away from the “growth at all costs” mindset to prioritizing efficiency as companies are reckoning with high cloud bills after making early investments to virtualize and move to the cloud.
At the same time, research shows that most companies are still increasing their cloud infrastructure spend. Currently, a lot of that is on multi-cloud deployments, which reflects the efficiency mindset; with multi-cloud, companies avoid vendor lock-in and free themselves to find the best pricing models and the products and services that provide the most value for their operations. In cases where organizations were managing things on-premises, shifting to a multi-cloud strategy frees up IT resources. I have seen a lot of companies succeed with multi-cloud because they can scale up or down as needed with less overhead, and they can find the optimal combination of cost and performance from different providers.
I would also note that the ROI for cloud spending depends on a company's growth trajectory. Mature organizations can make bigger cloud investments without wasting money on overprovisioning resources or paying for idle compute because they have the staff and processes in place to optimize cloud resource management. So, they might be getting a lot of payoff regarding faster development and increased security, while less mature organizations may not see those benefits.
Question 2
Why is it so challenging to correlate cloud costs in today’s modern cloud-native systems and application architectures? Where are teams struggling with the mandate to lower cloud costs?
Answer
Almost every organization is likely seeing some waste in its cloud spending. One major issue is a lack of the necessary employee knowledge and training to manage cloud resources efficiently. This is also why more mature organizations see more benefit from cloud spending: they’re equipped to optimize it.
Another huge challenge is resource overprovisioning. It is incredibly common for companies to allocate more resources than they need, which raises unnecessary expenses. Similarly, paying for resources that remain idle or underused also wastes money, but more visibility into cloud usage is often needed. Many teams do not have the right monitoring and automation tools to help them track patterns of overuse and implement autoscaling to adjust resources depending on workload demands.
The idea of a cloud chargeback process, or “responsibility accounting,” makes a lot of sense. If you can map cloud consumption to internal consumers — business units, departments, projects — to bill them accordingly, that is a significant incentive for every team to get smarter about their utilization. But again, this is quite challenging to implement without careful monitoring of cloud services.
Cloud cost challenges are similar to issues with Kubernetes cost optimization and cluster sprawl, which we are tackling at Loft. We saw many companies dive into Kubernetes to keep up with competitors, and they would start spinning up huge numbers of clusters, giving out a cluster for every possible workload. Then, fleet management stopped being a sufficient solution because even at a startup running 30–60 clusters, Istio and Datadog, and all these things are huge IT burdens. We are in a similar place with cloud spending, where resource sprawl is becoming untenable, even as companies continue spending more to gain a competitive advantage.
Question 3
What do you see in the open source community in terms of encouraging trends/patterns/new technology approaches that are bringing this cloud cost equation into better control?
Answer
The virtual cluster approach we are taking with vCluster targets reining in cloud costs. Similar to other virtualization trends, like in the early days of Docker or even VMWare, the next step is the virtualization of Kubernetes. Anyone can use open-source vCluster to make their clusters much lighter, and we also have automatic “sleep mode,” which shuts down idle clusters so you are not wasting money running idle environments.
I also want to highlight the work CoreWeave is doing. They are a unique example of a specialized cloud provider taking a much more open approach than the big players. For instance, at last year’s KubeCon North America, they shared the inner workings of their platform and the internal open-source projects they use, which not every cloud provider would do. Unlike the legacy providers, they are also building most of their platform on Kubernetes, so they are genuinely cloud-native.
CoreWeave is focused on AI workloads, and they let customers spin up GPUs mainly to run inference workloads for AI and machine learning applications. Leading AI companies like OpenAI need Kubernetes, so CoreWeave gives their customers Kubernetes clusters. It is incredible that they are creating lightweight virtual clusters that can offer shared or dedicated GPU clusters depending on a customer’s need. That is so important because smaller companies and startups could not afford dedicated, heavyweight, large-scale GPU nodes for their workloads. So the option to opt for a shared route means they can accomplish their workloads for a fraction of the cost and then turn to dedicated GPU nodes when they need them.
Question 4
What is your general sense of the FinOps movement and how well or poorly “finance” and “engineers” communicate today? What’s broken? What needs to be improved?
Answer
Engineers often need more real-time insights into cloud spend, making cost-efficient decisions difficult. Meanwhile, finance teams may need to grasp the technical aspects driving these costs fully. This discrepancy and differing priorities—finance teams focusing on cost savings and engineers on performance and innovation—lead to friction and misaligned goals. Additionally, the complexity of cloud environments poses a challenge for both teams in effectively tracking and managing costs.
Tools that provide real-time visibility into cloud spend are crucial to bridge these gaps, empowering engineers to make informed decisions. Establishing FinOps practices that foster regular communication and collaboration between finance and engineering teams is also essential. Automated cost management solutions, such as vCluster’s Sleep Mode for pre-production environments, can significantly reduce costs while maintaining performance. Finally, providing education and training on FinOps principles can enhance understanding and partnership between both teams.
Question 5
How do virtual clusters and vClusters fit into this trend of cloud costs and FinOps? What’s new and different via vClusters regarding cost savings, utilization, fewer moving parts, and not burning money on idle resources?
Answer
Cost-effectiveness is one of the significant benefits of virtual clusters, so vCluster is central to this conversation. There are two main reasons for runaway cloud spending tied to Kubernetes. The first is heavyweight infrastructure components in the platform stack, like Istio and Open Policy Agent. Kubernetes is inherently expensive because there is a lot of replication, a heavy platform stack on each cluster, and each cluster has multiple compute nodes. We once talked to a company paying over $10 million yearly for a single division of their organization!
The second is idle time for clusters — nobody turns them off when they are not in use or flags clusters that are no longer needed, which is a considerable problem since Kubernetes is so heavy.
Virtual clusters make the cluster itself lightweight and ephemeral — similar to what virtual machines did for physical servers, then containers did, and now we are bringing that to Kubernetes. Broadly, virtual clusters are isolated Kubernetes environments within a single physical cluster, enabling efficient resource sharing and better operational control. The activities of one tenant do not affect others, so users will not have to spend money fixing vulnerabilities and stabilizing the system.
Virtual clusters are also very flexible and can easily be spun up or torn down in response to tenant demands. This is great for AI and machine learning workloads based on GPUs, as they often require rapid deployment and resource reconfiguration, which would be very costly to do with heavier physical clusters. Spinning up vCluster only takes about 6 seconds, and as I mentioned, we have “sleep mode” to combat wasted spend on idle resources. vCluster can detect if someone is working and automatically scale down nodes so that only necessary resources are used at any given time.
At a higher level, virtual clusters reduce management overhead. While managing multiple physical Kubernetes clusters is resource-intensive and complex, virtual clusters consolidate operational tasks and make managing updates, security, and compliance much easier. vCluster also gives users more granular billing capabilities, as each virtual cluster can be monitored independently. This makes attributing resource usage to specific tenants easy, enabling more equitable and accurate billing practices.
Question 6
What advice would you give to enterprises that feel the cloud providers have too much leverage against them in the cost equation? What can the enterprise do to put themselves in a better position?
Answer
Investing resources in thoroughly training and preparing teams to manage complex, cloud-native applications will always be of good use. Really taking the time to compare cloud providers and review open-source projects that can add value and streamline operations will also help enterprises avoid going all in on a big cloud provider that is not giving them optimal returns.
I advise them not to be reticent to adopt emerging technologies, even though this can be difficult for organizations that have accumulated lots of technical debt, are running on legacy systems, or are in highly regulated industries. While shifting to new frameworks is costly up front, sticking with an inefficient cloud strategy will be much more expensive. At Loft, much of the work we do in evangelizing virtual clusters is about convincing people that it is the right architectural choice for today and the future. Convincing people that they need another virtualization layer seems complicated to manage initially, but history shows it will be well worth the effort.
If you are an enterprise today and need a server, nobody will plug in a physical server for you — the same thing will happen with Kubernetes. If you need a cluster five years from now, you will get a virtual one, except in edge cases, because they are much more cost-effective and easier to maintain. From tracking vCluster’s adoption, we can tell this is the case because in only a couple of years, we already have over 40 million virtual clusters in use and are seeing a lot of commercial success. This applies to many components of cloud-native development, so I think it is all about being proactive, anticipating future challenges with legacy systems, and working to adopt more modern solutions as soon as possible.
Opinions expressed by DZone contributors are their own.
Comments