Site Reliability Engineer Blog/Portfolio

Providing the best project deployment and reliability experience

As a Site Reliability Engineer with over 3.5 years of experience at a leading tech company as Accenture, atSistemas and currently Triggle Spain SLU, I specialized in optimizing system reliability and efficiency.

Learn More!

Flagship projects

EKS Quantic Update and Leadership

Triggle Spain SLU (2024)

Problem

In this project, I was working to reduce the cost of AWS EKS because they were in an extended support version which is 6 times more expensive

Solution

I oversaw the update of EKS clusters from version 1.24 to 1.29, which reduced the costs of extended support by 82.5% due to more efficient use of autoscaling groups. Additionally, I have been managing a team of three engineers in automating key infrastructure components

Outcome

82.5% reduction in extended support costs due to cluster updates -> Enhanced system scalability and reliability

Technologies Used

Terraform, Terraformer, Velero, Kubernetes (EKS), AWS Auto Scaling

CI/CD Automation and Deployment Acceleration

Triggle Spain SLU (2023)

Problem

Initially, our CI/CD process was entirely manual, prone to human errors, and slow, delaying the deployment of new features to production.

Solution

Focusing on improving operational efficiency, I spearheaded the automation of the continuous integration and continuous deployment (CI/CD) pipelines, enhance the build time using concurrency and docker caching. This project involved overhauling existing deployment methodologies and introducing automation scripts that streamlined the deployment process.

Outcome

Deployment and build times reduced by 500% -> Enhanced consistency and reliability in deployments

Technologies Used

Codebuild, Bitbucket, Lambdas, API Gateway, Python, Bash

AWS Cost Optimization

Triggle Spain SLU (2024)

Problem

This issue likely arose from over-provisioning, suboptimal resource allocation, and the lack of effective autoscaling and cleanup policies, which resulted in unnecessary expenses. The initiative aimed to tackle these inefficiencies by implementing targeted optimizations to improve how resources were managed and utilized.

Solution

In my role at Triggle SLU, a cloud-native company serving the tourism sector, I led a major initiative to reduce platform costs across AWS accounts. By implementing targeted optimizations and refining resource usage, I achieved a 27% reduction in overall platform costs. Key strategies included enhancing autoscaling capabilities, apply cleanup policies, automate cleanups using crons and revising our resource allocation to better fit usage patterns.

Outcome

27% cost reduction by autoscaling policies, 10% cost saving through efficient cleanup cron jobs -> Improved budget efficiency and resource utilization

Technologies Used

AWS, argoCD, Codebuild, Terraform, CASTAI

Free platform On-Call Duties

Triggle Spain SLU (2024)

Problem

We were facing significant expenses due to paying for a UptimeRobot subscription for on-call duties throughout the week

Solution

We leveraged our existing setup of Prometheus and Grafana by installing Grafana OnCall and connecting it to a Telegram channel. This integration triggers a call every time a service goes down

Outcome

Zero Cost: This approach eliminated the costs associated with the on-duty platform, leaving us with only the expenses for developing and maintaining the tool and its integration.

Technologies Used

Prometheus, Grafana, Telegram

Project Branch Inspector

Accenture (2022)

Problem

Build times are slow due to repositories having many undeleted branches, creating bottlenecks during dependency tracking. A tool is needed to alert developers of excessive branches and flag those over 30 days old as stale, prompting cleanup emails.

Solution

I developed a tool utilizing the GitLab API and Python, as the community version of GitLab lacks this functionality. It identifies old branches based on a 30-day threshold. Configurations are stored in a Kubernetes ConfigMap, integrated into a cron job that processes and sends log data to Logstash, Prometheus, and Grafana. An alarm system alerts branch owners and flags branches for deletion.

Outcome

This tool cut build times by 80%, boosted productivity, reduced EC2 usage for Jenkins agents by 10%, and enabled managers to monitor branch statuses through Grafana dashboards.

Technologies Used

Python, Kubernetes, GitLab API, Logstash, Prometheus, Grafana, AWS SES

Enhancing Development Times Using Docker Build Cache Stored in an S3 Bucket

Knowmad mood (2023)

Problem

We encountered the issue of prolonged build times in Jenkins for projects using Python and Node.js, which averaged around 20 minutes.

Solution

To address this issue, I integrated a Docker cache into the pipelines. This cache stores the layers from the build process in an S3 bucket.

Outcome

This enhancement significantly reduced the build time from an average of 20 minutes to 5 minutes, achieving a fourfold increase in speed.

Technologies Used

Docker, Jenkins, AWS S3, AWS IAM, S3 Cleanup Policies

LinkedIn Auto-Posts

Personal project - Personal Brand: InfraBio (2024)

Problem

As part of my hobby and personal branding strategy, I developed a tool that automatically posts on my LinkedIn account.

Solution

This tool comprises two main components: firstly, a CRUD interface to manage the publication of posts, and secondly, a worker in Cloudflare that retrieves posts from the database and publishes them on LinkedIn using its API. This setup allows me to spend a few hours each week creating content, while the worker automatically publishes the posts at optimal times and days. This arrangement frees up my time to learn and explore other interests.

Outcome

This tool has significantly increased my productivity and enhanced my presence on the network, making it easier to manage my personal brand.

Technologies Used

React.js, Node.js, Turso SQLite, Cloudflare Workers, Wrangler