SRE with 5+ years across the cloud-native and consulting industries. Specialised in GitOps, multi-cloud Kubernetes at scale, cost optimisation, and automation.

I’m Yaiser Avila Rodríguez, a Site Reliability Engineer with 5+ years building and operating large-scale, multi-cloud Kubernetes infrastructure across AWS, GCP, and other providers. Currently working on multi-region production fleets in a cloud-native streaming-analytics environment.
My focus is on turning manual ops into automated, observable systems — from GitOps adoption and progressive delivery platforms to fleet-wide cost-optimisation initiatives that have delivered measurable savings and reclaimed significant engineering time.
Outside work I run InfraBio, a personal brand where I share SRE content and tooling.
Hydrolix
Remote
Triggle Spain SLU
Spain
Knowmad mood
Spain
Accenture
Spain
live SRE demos · running in your browser · no backend
Est. Resource Waste
mid fleet · CAST AI: 99.94% over-provisioned · Datadog 2024: 83% idle · Kubecost: 35–50% baseline
35%
auto-calculated
VPA/HPA Implementation
VPA 25.9% + HPA 7.5% of bill · 150 clusters
GitOps Automation
3 SREs × 30 hrs/release × 12 · 150 clusters
Alerts / Runbooks / Auto-Remediation
5% auto-resolved/mo · 35 min · 150 clusters
Projected Annual Savings
$287Kvs. 1 SRE-month investment ($16K)
payback in ~2.9 weeks
CAST AI · Datadog · Kubecost benchmarks
Year-1 financial model
ROI
1.8×
self-funds in
~29wk
What you actually buy
“Don’t hire an SRE to react to incidents. Hire one to build the systems that prevent them — and fund their own salary while doing it.”
Traffic
Request Rate
Errors
Error Rate
Latency
P99 Latency
Saturation
CPU Saturation
synthetic data · real Prometheus patterns · 4 SRE golden signals (Latency · Traffic · Errors · Saturation)
Triggle Spain SLU (2024)
EKS clusters were running version 1.24, which had entered extended support — priced at 6× the standard rate. With no upgrade plan in place, costs were compounding monthly.
Oversaw the full upgrade path from EKS 1.24 to 1.29 using Terraform and Velero for workload backup. Managed a team of 3 engineers automating key infrastructure components and autoscaling group configurations throughout the process.
82.5% reduction in extended support costs. Enhanced system scalability and reliability post-upgrade.
Terraform, Terraformer, Velero, Kubernetes (EKS), AWS Auto Scaling
Triggle Spain SLU (2023)
The CI/CD process was entirely manual, prone to human errors, and slow — delaying the deployment of new features to production.
Automated the full CI/CD pipeline with concurrent builds and Docker layer caching. Overhauled deployment methodologies and introduced automation scripts that streamlined the end-to-end deployment process.
Deployment and build times reduced by 5×. Enhanced consistency and reliability across all deployments.
AWS CodeBuild, Bitbucket, Lambda, API Gateway, Python, Bash
Triggle Spain SLU (2024)
Over-provisioned resources, suboptimal allocation, and the lack of effective autoscaling and cleanup policies were generating significant unnecessary AWS spend.
Led a cost reduction initiative across AWS accounts: enhanced autoscaling capabilities, implemented cleanup lifecycle policies, automated resource cleanup via crons, and revised resource allocation to match actual usage patterns.
27% reduction in overall platform costs via autoscaling and cleanup policies. Improved budget efficiency and resource utilisation.
AWS, ArgoCD, CodeBuild, Terraform, CAST AI


Open to new roles, collaborations, or a conversation about SRE, Kubernetes, and infrastructure automation.