#site-reliability-engineer

6 posts

5 May

Abdurrahman J. Allawala 5 May 2026 8 min read

Monitoring reliably at scale

Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…

engineering infrastructure technology observability site-reliability-engineer

28 Apr

Nikos Katirtzis 28 Apr 2026 7 min read

Expedia’s Service Telemetry Analyzer

Expedia

Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…

observability software-engineering distributed-systems generative-ai-tools site-reliability-engineer

21 Apr

Rishabh Kumar 21 Apr 2026 9 min read

Building a fault-tolerant metrics storage system at Airbnb

Airbnb

How we built a storage system that ingests 50 million samples per second and stores 2.5 petabytes of logical time series data. By : Rishabh Kumar Modern observability practice encourages instrumenting every meaningful code path. Over the past 15 years, open-source observability SDKs like Prometheus, OpenTelemetry, and StatsD have made deep instrumentation nearly ubiquitous. These days, most software — open-source…

site-reliability-engineer infrastructure technology engineering software-architecture

7 Apr

Eugene Ma 7 Apr 2026 9 min read

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Airbnb

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…

engineering technology infrastructure observability site-reliability-engineer

16 Mar 2023

Jacob 16 Mar 2023 3 min read

In Focus: Sumanth Reddy

Blinkit

A conversation with engineers who help run Blinkit Chinthakunta Sumanth Kumar Reddy is an SDE 3 at Blinkit. He joined us in March 2021 and has since helped us build a resilient application platform at Blinkit. He currently works as a part of Software Resilience Engineering (SRE)–enabling scalable database migrations for Blinkit’s applications. Tell us about your background and your…

quick-commerce devops site-reliability-engineer culturepeople-at-blinkit

23 Feb 2023

Jacob 23 Feb 2023 3 min read

In Focus: Jay Dihenkar

Blinkit

A conversation with engineers who help run Blinkit Jay Dihenkar is a Staff Engineer at Blinkit. He joined us in December 2020 and has helped different teams manage and streamline their build and release processes. He is currently working towards continuously improving the reliability, scalability, observability, developer productivity, and other such aspects of a software system critical for ensuring that…

people-at-blinkitculture devops quick-commerce site-reliability-engineer