Site Reliability Engineer Blog/Portfolio
As a Site Reliability Engineer with over 3.5 years of experience at a leading tech company as Accenture, atSistemas and currently Triggle Spain SLU, I specialized in optimizing system reliability and efficiency.
Learn More!Flagship projects
EKS Quantic Update and Leadership
Triggle Spain SLU (2024)
Problem
In this project, I was working to reduce the cost of AWS EKS because they were in an extended support version which is 6 times more expensive
Solution
I oversaw the update of EKS clusters from version 1.24 to 1.29, which reduced the costs of extended support by 82.5% due to more efficient use of autoscaling groups. Additionally, I have been managing a team of three engineers in automating key infrastructure components
Outcome
82.5% reduction in extended support costs due to cluster updates -> Enhanced system scalability and reliability
Technologies Used
Terraform, Terraformer, Velero, Kubernetes (EKS), AWS Auto Scaling
CI/CD Automation and Deployment Acceleration
Triggle Spain SLU (2023)
Problem
Initially, our CI/CD process was entirely manual, prone to human errors, and slow, delaying the deployment of new features to production.
Solution
Focusing on improving operational efficiency, I spearheaded the automation of the continuous integration and continuous deployment (CI/CD) pipelines, enhance the build time using concurrency and docker caching. This project involved overhauling existing deployment methodologies and introducing automation scripts that streamlined the deployment process.
Outcome
Deployment and build times reduced by 500% -> Enhanced consistency and reliability in deployments
Technologies Used
Codebuild, Bitbucket, Lambdas, API Gateway, Python, Bash
AWS Cost Optimization
Triggle Spain SLU (2024)
Problem
This issue likely arose from over-provisioning, suboptimal resource allocation, and the lack of effective autoscaling and cleanup policies, which resulted in unnecessary expenses. The initiative aimed to tackle these inefficiencies by implementing targeted optimizations to improve how resources were managed and utilized.
Solution
In my role at Triggle SLU, a cloud-native company serving the tourism sector, I led a major initiative to reduce platform costs across AWS accounts. By implementing targeted optimizations and refining resource usage, I achieved a 27% reduction in overall platform costs. Key strategies included enhancing autoscaling capabilities, apply cleanup policies, automate cleanups using crons and revising our resource allocation to better fit usage patterns.
Outcome
27% cost reduction by autoscaling policies, 10% cost saving through efficient cleanup cron jobs -> Improved budget efficiency and resource utilization
Technologies Used
AWS, argoCD, Codebuild, Terraform, CASTAI
Free platform On-Call Duties
Triggle Spain SLU (2024)
Problem
We were facing significant expenses due to paying for a UptimeRobot subscription for on-call duties throughout the week
Solution
We leveraged our existing setup of Prometheus and Grafana by installing Grafana OnCall and connecting it to a Telegram channel. This integration triggers a call every time a service goes down
Outcome
Zero Cost: This approach eliminated the costs associated with the on-duty platform, leaving us with only the expenses for developing and maintaining the tool and its integration.
Technologies Used
Prometheus, Grafana, Telegram
Project Branch Inspector
Accenture (2022)
Problem
Build times are slow due to repositories having many undeleted branches, creating bottlenecks during dependency tracking. A tool is needed to alert developers of excessive branches and flag those over 30 days old as stale, prompting cleanup emails.
Solution
I developed a tool utilizing the GitLab API and Python, as the community version of GitLab lacks this functionality. It identifies old branches based on a 30-day threshold. Configurations are stored in a Kubernetes ConfigMap, integrated into a cron job that processes and sends log data to Logstash, Prometheus, and Grafana. An alarm system alerts branch owners and flags branches for deletion.
Outcome
This tool cut build times by 80%, boosted productivity, reduced EC2 usage for Jenkins agents by 10%, and enabled managers to monitor branch statuses through Grafana dashboards.
Technologies Used
Python, Kubernetes, GitLab API, Logstash, Prometheus, Grafana, AWS SES
Enhancing Development Times Using Docker Build Cache Stored in an S3 Bucket
Knowmad mood (2023)
Problem
We encountered the issue of prolonged build times in Jenkins for projects using Python and Node.js, which averaged around 20 minutes.
Solution
To address this issue, I integrated a Docker cache into the pipelines. This cache stores the layers from the build process in an S3 bucket.
Outcome
This enhancement significantly reduced the build time from an average of 20 minutes to 5 minutes, achieving a fourfold increase in speed.
Technologies Used
Docker, Jenkins, AWS S3, AWS IAM, S3 Cleanup Policies
LinkedIn Auto-Posts
Personal project - Personal Brand: InfraBio (2024)
Problem
As part of my hobby and personal branding strategy, I developed a tool that automatically posts on my LinkedIn account.
Solution
This tool comprises two main components: firstly, a CRUD interface to manage the publication of posts, and secondly, a worker in Cloudflare that retrieves posts from the database and publishes them on LinkedIn using its API. This setup allows me to spend a few hours each week creating content, while the worker automatically publishes the posts at optimal times and days. This arrangement frees up my time to learn and explore other interests.
Outcome
This tool has significantly increased my productivity and enhanced my presence on the network, making it easier to manage my personal brand.
Technologies Used
React.js, Node.js, Turso SQLite, Cloudflare Workers, Wrangler