Unpacking Our Findings From Assessing Numerous Infrastructures (Part 2)
Making superior performance accessible. Get better at assessing your core infrastructure needs, find out where engineering teams often falter.
Join the DZone community and get the full member experience.
Join For FreeWhen superior performance comes at a higher price tag, innovation makes it accessible. This is quite evident from the way AWS has been evolving its services:
- gp3, the successor of gp2 volumes: Offers the same durability, supported volume size, max IOPS per volume, and max IOPS per instance. The main difference between gp2 and gp3 is gp3’s decoupling of IOPS, throughput, and volume size. This flexibility to configure each piece independently – is where the savings come in.
- AWS Graviton3 processors: Offers 25% better computing, double the floating-point, and improved cryptographic performance compared to its predecessors. It’s 3x faster than Graviton 2 and supports DDR5 memory, providing 50% more bandwidth than DDR4 (Graviton 2).
To be better at assessing your core infrastructure needs, knowing the AWS services is just half the battle. In my previous blog, I’ve discussed numerous areas where engineering teams often falter. Do give it a read! Unpacking Our Findings From Assessing Numerous Infrastructures – Part 1
What we’ll be discussing here are:
- Are your systems truly reliable?
- How do you respond to a security incident?
- How do you reduce defects, ease remediation, and improve flow into production? (Operational Excellence)
Are Your Systems Truly Reliable?
Nearly 67% of teams showed high risk in questions around resilience testing. Starting with the lack of basic pre-thinking of how things might fail, and building plans for what you would do in that event. Of course, teams did perform root cause analysis after things actually went wrong — that we can consider as learning from mistakes. For the majority of them — there’s no playbook/procedure to investigate failures and post-incident analysis.
How Do You Plan for Disaster Recovery?
Eighty percent of the workloads we reviewed score a high risk in this area. Despite disaster recovery being a vital necessity, many organizations avoid it due to its perceived complexity and cost. Some other common reasons were — insufficient time, inadequate resources, inability to prioritize due to lack of skilled personnel, etc.
An easy way to begin is by noting down the:
- Recovery point objective: How much data are you prepared to lose?
- Recovery time objective: How long can you handle downtime to serve your customers?
The next important step is planning and working on the recovery strategies. Let’s consider the Lambda function. How can you go about thinking of various error scenarios:
- Manual deployment errors: Risk of deploying incorrect code or configuration changes.
- Cold start delay: It so happens with Lambda that it takes time to initiate the underlying hardware, resulting in the first request taking longer to serve, often attributed to instance expiration from inactivity. Thus resulting in a poor user experience.
- Lambda concurrency limit: Risk of throttling the default concurrency limit, where if it is exceeded, the lambda no longer invokes, resulting in the loss of all requests.
Or maybe answering questions like — what will happen to your application if your database goes away? — Does it reconnect? Does it reconnect properly? Is it re-resolving the DNS name?
While the cloud does take away most of your “heavy lifting” with infrastructure management, this doesn’t include managing your application and business requirements.
Some Best Practices To Follow
- Being aware of unchangeable service quotas, service constraints, and physical resource limits to prevent service interruptions or financial overruns.
- Validate your backup integrity and processes by performing recovery tests.
- Ensure a sufficient gap exists between the current quotas and the maximum usage to accommodate failover.
How Do You Respond to a Security Incident?
75% of technology teams are not doing a good job at responding to security incidents. They’re not planning ahead for things that are going on in the security landscape. Only 30% of teams knew what tooling they would use to either mitigate or investigate a security incident.
Now, we’re talking about security incidents caused by exploited frameworks. Some of the common tell-tale signs observed were:
- Allowing untrusted code execution on your machines.
- Failure to set up adequate access controls on storage services, such as leading to Data leakage from an S3 bucket, potentially making data public.
- Accidental exposure of API keys, such as when checked into a public Git repository.
Another aspect of security is understanding the health of your workload, implying monitoring and telemetry. In this framework, we differentiate user behavior monitoring and real user monitoring versus workload behavior. Here, this is notable because teams are undoubtedly collecting all sorts of data but are not doing much with it.
- More than half of them have clearly defined their KPIs, but fewer have actually established baselines for what normal looks like.
- The number drops further when it comes to setting up alerts for those monitored items.
Then comes access and granting the least privileges. Although teams understood what work they do and what access they should have, not many were following it. There was an absolute absence of:
- Role-Based Access Mechanism
- Multi-factor authentication
- Rotation of passwords and,
- Use of secret vaults like Secrets Managers or HashiCorp Vault (and instead simply baking them into config for their applications), etc.
In short, automation of credential management is pretty much nonexistent.
How Do You Reduce Defects, Ease Remediation, and Enhance the Production Deployment Process?
Yes, finally, we are talking about the pillar – operational excellence. People are pretty much familiar with the version control system and are using Git (mostly). They run a lot of automated testing in their CI, basically a lot of smoke tests and integration tests.
Operational excellence focuses on defining, executing, measuring, and improving the standard operating procedures in response to incidents and client requests. Following the DevOps philosophy is not enough if the tools and workflows don’t support it. The absence of proper documentation and sole dependence on DevOps engineers to use automation has led to burnout. DevOps engineers manually stitching solutions for every situation has resulted in slow workflow development and brittle operations.
As per Gartner, platform engineering is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.” Beyond commercial hype, an Internal Developer Platform is a curated set of tools, capabilities, and processes packaged together for easy consumption by development teams. Reduced human dependency and standardized workflows empower engineering teams to scale efficiently.
I guess the primary takeaway for us through the reviews was that today people are better at building platforms than they are at securing or running them. This is the real lesson, and there’s a high chance that this applies to you as well.
What’s Next?
Over time your workloads evolve and accommodate demanding business needs and highly reliant customers; making it more than necessary to ensure they remain secure, reliable, and performant to serve them better.
You should totally try the Well-Architected Review tool that's available right in your AWS console. You can begin by working through those questions and following the linked information to better understand your own practice.
Strip off the 'AWS Label' from the WAR tool, and you're left with best practices helping you deliver a consistent approach to architecting secure and scalable systems on the AWS Cloud.
Published at DZone with permission of Komal J Prabhakar. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments