CKA · Terraform Associate · AWS Certified

Engineering reliability at scale across multi-cloud Kubernetes fleets

SRE with 5+ years of experience and a biology background, specialising in Kubernetes at scale, GitOps, cost optimisation, and SRE automation.

About Me

About Me

I’m Yaiser Avila Rodríguez, a Site Reliability Engineer with 5+ years of experience and a biology background, building and operating large-scale, multi-cloud Kubernetes infrastructure across AWS, GCP, and other providers. Currently working on multi-region production fleets in a cloud-native streaming-analytics environment (Hydrolix, since Jul 2024).

My focus is on turning manual ops into automated, observable systems — from GitOps adoption and progressive delivery platforms to fleet-wide cost-optimisation initiatives that have delivered measurable savings and reclaimed significant engineering time.

Outside work I run InfraBio, a personal brand where I share SRE content and tooling. I also hold an MSc in Plant Biology and an MSc in Plant Breeding — the same systems-thinking mindset, different substrate.

LinkedIn GitHub [email protected]

5+Years SRE experience

300+Multi-cloud K8s clusters

20,000+Pods under management

80%Compute cost cut via VPA/HPA rollout

1,400+GitHub commits

Work Experience

Site Reliability Engineer

Hydrolix

Jul 2024 – Present

Switzerland · Remote

▸Drove a Vertical Pod Autoscaler rollout across the production Kubernetes fleet using a phased, region-gated approach
▸Led company-wide GitOps adoption with ArgoCD as the unified continuous delivery model
▸Co-designed and built a fleet-wide progressive delivery platform on ArgoCD + Kustomize + Argo Workflows — health-gated, automated rollouts with one-click rollback
▸Built and operated Prometheus + Grafana observability for production environments
▸Designed an automated OOM-remediation system that surfaces GitOps changes automatically based on observed signals
▸Owned EMEA on-call and incident command; significantly reduced triage time using AI-assisted tooling; supported capacity planning for major scheduled events
▸10x faster deployment time and 100+ automation hours saved
▸Sustained high-volume contribution cadence across code, internal tooling, and technical documentation over 2 years

Kubernetes (multi-cloud)ArgoCDGoPulumiPrometheusGrafanaArgo WorkflowsVPAHelmTerraform

Senior DevOps Engineer

Triggle Spain SLU

Jul 2023 – Jul 2024

Spain

▸Upgraded EKS clusters 1.24 → 1.29, reducing extended support costs by 82.5%
▸Led 27% AWS platform cost reduction via autoscaling policies, cleanup crons, and resource right-sizing
▸Managed a team of 3 engineers automating key infrastructure components
▸Implemented Grafana OnCall + Prometheus alerting, replacing paid UptimeRobot subscription at zero cost

Kubernetes (EKS)TerraformAWSArgoCDPrometheusGrafanaCAST AI

Senior DevOps Engineer

atSistemas Consulting

Feb 2023 – Jul 2023

Spain

▸Airlines customer — maintained on-prem & AWS services, automated deployments, and ensured system scaling
▸Mentored 2 Jr. SREs

PythonBashCloudFormationAnsibleKubernetesJenkinsDockerDatadog

Jr. Site Reliability Engineer / DevOps

Accenture

Feb 2022 – Feb 2023

Spain

▸Built Branch Inspector tool using GitLab API + Python — cut CI build times by 80%
▸Reduced EC2 Jenkins agent usage by 10% through stale branch detection and automated cleanup alerts
▸Integrated Kubernetes ConfigMaps, Logstash, Prometheus, and Grafana dashboards for branch monitoring

PythonKubernetesGitLab APILogstashPrometheusGrafanaAWS SES

ls ~/sandbox/

live SRE demos · running in your browser · no backend

cost-impact-calculator

interactive · industry benchmarks

cost · model

Clusters150

Monthly Compute Bill (USD) (USD)$50K

SRE Team Size3 engineers

SRE Hourly Rate (USD) (USD)$100/hr

Est. Resource Waste

mid fleet · CAST AI: 99.94% over-provisioned · Datadog 2024: 83% idle · Kubecost: 35–50% baseline

35%

auto-calculated

VPA/HPA Implementation

VPA 25.9% + HPA 7.5% of bill · 150 clusters

+$200K/yr

GitOps Automation

3 SREs × 30 hrs/release × 12 · 150 clusters

+$81K/yr

Alerts / Runbooks / Auto-Remediation

5% auto-resolved/mo · 35 min · 150 clusters

+$5K/yr

Projected Annual Savings

$287K

vs. 1 SRE-month investment ($16K)

payback in ~2.9 weeks

CAST AI · Datadog · Kubecost benchmarks

// why hire yaiser?the business case

Year-1 financial model

Projected savings$287K

Senior SRE cost (~$100/hr)-$160K

Net Year-1 value+$127K

ROI

1.8×

self-funds in

~29wk

What you actually buy

✓5+ yrs SRE across multi-cloud Kubernetes at scale

✓GitOps adoption — production apps under ArgoCD from day 1

✓VPA + HPA rollout framework, ready to deploy

✓On-call, runbooks & auto-remediation included

✓Savings compound year-over-year, no extra headcount

✓Track record of significant fleet-wide cost reductions

↗ Book a 30-min intro call

“Don’t hire an SRE to react to incidents. Hire one to build the systems that prevent them — and fund their own salary while doing it.”

ci-pipeline.yml

github-actions

push → mainubuntu-latest · free tier

·1actions/checkout@v4

·2setup-node@v4 (node 20)

·3npm ci

·4npm run lint

·5tsc --noEmit

·6npm run build

↗ view ci.yml

kubectl get all

kubernetes

ns/default4/4 Running

●web-app-0Running45m CPU2h20m

●web-app-1Running52m CPU1h26m

●web-app-2Running38m CPU50m1R

●metrics-0Running12m CPU20h13m

golden-signals

prometheus · grafana

cluster: productionPrometheus · simulated

● live

Traffic

Request Rate

healthy

1,200

Errors

Error Rate

healthy

0.18%

Latency

P99 Latency

healthy

118ms

Saturation

CPU Saturation

healthy

42%

synthetic data · real Prometheus patterns · 4 SRE golden signals (Latency · Traffic · Errors · Saturation)

Flagship Projects

-82.5%extended support cost

Triggle Spain SLU (2024)

EKS Upgrade 1.24 → 1.29 — 82.5% Extended Support Cost Reduction

Problem

EKS clusters were running version 1.24, which had entered extended support — priced at 6× the standard rate. With no upgrade plan in place, costs were compounding monthly.

Solution

Oversaw the full upgrade path from EKS 1.24 to 1.29 using Terraform and Velero for workload backup. Managed a team of 3 engineers automating key infrastructure components and autoscaling group configurations throughout the process.

Outcome

82.5% reduction in extended support costs. Enhanced system scalability and reliability post-upgrade.

Technologies Used

Terraform, Terraformer, Velero, Kubernetes (EKS), AWS Auto Scaling

Triggle Spain SLU (2023)

CI/CD Automation and Deployment Acceleration

Problem

The CI/CD process was entirely manual, prone to human errors, and slow — delaying the deployment of new features to production.

Solution

Automated the full CI/CD pipeline with concurrent builds and Docker layer caching. Overhauled deployment methodologies and introduced automation scripts that streamlined the end-to-end deployment process.

Outcome

Deployment and build times reduced by 5×. Enhanced consistency and reliability across all deployments.

Technologies Used

AWS CodeBuild, Bitbucket, Lambda, API Gateway, Python, Bash

-27%platform costs

Triggle Spain SLU (2024)

AWS Platform Cost Optimisation — 27% Reduction

Problem

Over-provisioned resources, suboptimal allocation, and the lack of effective autoscaling and cleanup policies were generating significant unnecessary AWS spend.

Solution

Led a cost reduction initiative across AWS accounts: enhanced autoscaling capabilities, implemented cleanup lifecycle policies, automated resource cleanup via crons, and revised resource allocation to match actual usage patterns.