Site Reliability Engineer Blog/Portfolio

Engineering reliability at scale across multi-cloud Kubernetes fleets

SRE with 5+ years across the cloud-native and consulting industries. Specialised in GitOps, multi-cloud Kubernetes at scale, cost optimisation, and automation.

work icons

About Me

Yaiser Avila Rodríguez

I’m Yaiser Avila Rodríguez, a Site Reliability Engineer with 5+ years building and operating large-scale, multi-cloud Kubernetes infrastructure across AWS, GCP, and other providers. Currently working on multi-region production fleets in a cloud-native streaming-analytics environment.

My focus is on turning manual ops into automated, observable systems — from GitOps adoption and progressive delivery platforms to fleet-wide cost-optimisation initiatives that have delivered measurable savings and reclaimed significant engineering time.

Outside work I run InfraBio, a personal brand where I share SRE content and tooling.

5+Years SRE experience
Multi-cloudMulti-cloud K8s fleets
GitOpsCost-optimisation focus
Fleet-scaleSignificant cost savings delivered
1,400+GitHub commits

Work Experience

Site Reliability Engineer

Hydrolix

2024 – Present

Remote

  • Drove a Vertical Pod Autoscaler rollout across the production Kubernetes fleet using a phased, region-gated approach
  • Led company-wide GitOps adoption with ArgoCD as the unified continuous delivery model
  • Co-designed and built a fleet-wide progressive delivery platform on ArgoCD + Kustomize + Argo Workflows — health-gated, automated rollouts with one-click rollback
  • Built and operated Prometheus + Grafana observability for production environments
  • Designed an automated OOM-remediation system that surfaces GitOps changes automatically based on observed signals
  • Owned EMEA on-call and incident command; significantly reduced triage time using AI-assisted tooling; supported capacity planning for major scheduled events
  • Sustained high-volume contribution cadence across code, internal tooling, and technical documentation over 2 years
Kubernetes (multi-cloud)ArgoCDGoPulumiPrometheusGrafanaArgo WorkflowsVPAHelmTerraform

DevOps / Cloud Engineer

Triggle Spain SLU

2023 – 2024

Spain

  • Upgraded EKS clusters 1.24 → 1.29, reducing extended support costs by 82.5%
  • Led 27% AWS platform cost reduction via autoscaling policies, cleanup crons, and resource right-sizing
  • Managed a team of 3 engineers automating key infrastructure components
  • Implemented Grafana OnCall + Prometheus alerting, replacing paid UptimeRobot subscription at zero cost
Kubernetes (EKS)TerraformAWSArgoCDPrometheusGrafanaCAST AI

DevOps Engineer

Knowmad mood

2023

Spain

  • Reduced Jenkins build times from 20 min to 5 min (75%) by integrating Docker layer cache stored in S3
  • Implemented S3 cleanup lifecycle policies to control cache storage costs
DockerJenkinsAWS S3AWS IAM

DevOps Engineer

Accenture

2022

Spain

  • Built Branch Inspector tool using GitLab API + Python — cut CI build times by 80%
  • Reduced EC2 Jenkins agent usage by 10% through stale branch detection and automated cleanup alerts
  • Integrated Kubernetes ConfigMaps, Logstash, Prometheus, and Grafana dashboards for branch monitoring
PythonKubernetesGitLab APILogstashPrometheusGrafanaAWS SES
$

ls ~/sandbox/

live SRE demos · running in your browser · no backend

cost-impact-calculator
interactive · industry benchmarks
cost · model
Clusters150
Monthly Compute Bill (USD) (USD)$50K
SRE Team Size3 engineers
SRE Hourly Rate (USD) (USD)$100/hr

Est. Resource Waste

mid fleet · CAST AI: 99.94% over-provisioned · Datadog 2024: 83% idle · Kubecost: 35–50% baseline

35%

auto-calculated

VPA/HPA Implementation

VPA 25.9% + HPA 7.5% of bill · 150 clusters

+$200K/yr

GitOps Automation

3 SREs × 30 hrs/release × 12 · 150 clusters

+$81K/yr

Alerts / Runbooks / Auto-Remediation

5% auto-resolved/mo · 35 min · 150 clusters

+$5K/yr

Projected Annual Savings

$287K

vs. 1 SRE-month investment ($16K)

payback in ~2.9 weeks

CAST AI · Datadog · Kubecost benchmarks

// why hire yaiser?the business case

Year-1 financial model

Projected savings$287K
Senior SRE cost (~$100/hr)-$160K
Net Year-1 value+$127K

ROI

1.8×

self-funds in

~29wk

What you actually buy

5+ yrs SRE across multi-cloud Kubernetes at scale
GitOps adoption — production apps under ArgoCD from day 1
VPA + HPA rollout framework, ready to deploy
On-call, runbooks & auto-remediation included
Savings compound year-over-year, no extra headcount
Track record of significant fleet-wide cost reductions
↗ Book a 30-min intro call

Don’t hire an SRE to react to incidents. Hire one to build the systems that prevent them — and fund their own salary while doing it.

ci-pipeline.yml
github-actions
push → mainubuntu-latest · free tier
·1actions/checkout@v4
·2setup-node@v4 (node 20)
·3npm ci
·4npm run lint
·5tsc --noEmit
·6npm run build
↗ view ci.yml
kubectl get all
kubernetes
ns/default4/4 Running
web-app-0Running45m CPU2h20m
web-app-1Running52m CPU1h26m
web-app-2Running38m CPU50m1R
metrics-0Running12m CPU20h13m
golden-signals
prometheus · grafana
cluster: productionPrometheus · simulated
● live

Traffic

Request Rate

healthy
1,200

Errors

Error Rate

healthy
0.18%

Latency

P99 Latency

healthy
118ms

Saturation

CPU Saturation

healthy
42%

synthetic data · real Prometheus patterns · 4 SRE golden signals (Latency · Traffic · Errors · Saturation)

Flagship Projects

-82.5%extended support cost

Triggle Spain SLU (2024)

EKS Upgrade 1.24 → 1.29 — 82.5% Extended Support Cost Reduction

Problem

EKS clusters were running version 1.24, which had entered extended support — priced at 6× the standard rate. With no upgrade plan in place, costs were compounding monthly.

Solution

Oversaw the full upgrade path from EKS 1.24 to 1.29 using Terraform and Velero for workload backup. Managed a team of 3 engineers automating key infrastructure components and autoscaling group configurations throughout the process.

Outcome

82.5% reduction in extended support costs. Enhanced system scalability and reliability post-upgrade.

Technologies Used

Terraform, Terraformer, Velero, Kubernetes (EKS), AWS Auto Scaling

Triggle Spain SLU (2023)

CI/CD Automation and Deployment Acceleration

Problem

The CI/CD process was entirely manual, prone to human errors, and slow — delaying the deployment of new features to production.

Solution

Automated the full CI/CD pipeline with concurrent builds and Docker layer caching. Overhauled deployment methodologies and introduced automation scripts that streamlined the end-to-end deployment process.

Outcome

Deployment and build times reduced by 5×. Enhanced consistency and reliability across all deployments.

Technologies Used

AWS CodeBuild, Bitbucket, Lambda, API Gateway, Python, Bash

-27%platform costs

Triggle Spain SLU (2024)

AWS Platform Cost Optimisation — 27% Reduction

Problem

Over-provisioned resources, suboptimal allocation, and the lack of effective autoscaling and cleanup policies were generating significant unnecessary AWS spend.

Solution

Led a cost reduction initiative across AWS accounts: enhanced autoscaling capabilities, implemented cleanup lifecycle policies, automated resource cleanup via crons, and revised resource allocation to match actual usage patterns.

Outcome

27% reduction in overall platform costs via autoscaling and cleanup policies. Improved budget efficiency and resource utilisation.

Technologies Used

AWS, ArgoCD, CodeBuild, Terraform, CAST AI

Tools & Technologies

Continuous Delivery

Kubernetes
Kubernetes
Docker
Docker
ArgoCD
ArgoCD
Argo Workflows
Argo Workflows
Helm
Helm

Cloud & Infrastructure as Code

AWS
AWS
Azure
Azure
GCP
GCP
Linode
Linode
Terraform
Terraform
Pulumi
Pulumi
Ansible
Ansible
Linux
Linux

Observability & Reliability

Prometheus
Prometheus
Grafana
Grafana

Automation & AI

Go
Go
Python
Python
GitHub
GitHub
GitLab
GitLab
Jenkins
Jenkins
Claude
Claude
Codex
Codex

Let’s work together

Open to new roles, collaborations, or a conversation about SRE, Kubernetes, and infrastructure automation.