~/devreads

#site-reliability-engineer

6 posts

5 May

Abdurrahman J. Allawala 8 min read

Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…

engineeringinfrastructuretechnologyobservabilitysite-reliability-engineer

28 Apr

Nikos Katirtzis 7 min read

Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…

observabilitysoftware-engineeringdistributed-systemsgenerative-ai-toolssite-reliability-engineer

21 Apr

Rishabh Kumar 9 min read

How we built a storage system that ingests 50 million samples per second and stores 2.5 petabytes of logical time series data. By : Rishabh Kumar Modern observability practice encourages instrumenting every meaningful code path. Over the past 15 years, open-source observability SDKs like Prometheus, OpenTelemetry, and StatsD have made deep instrumentation nearly ubiquitous. These days, most software — open-source…

site-reliability-engineerinfrastructuretechnologyengineeringsoftware-architecture

7 Apr

Eugene Ma 9 min read

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…

engineeringtechnologyinfrastructureobservabilitysite-reliability-engineer

16 Mar 2023

Jacob 3 min read

A conversation with engineers who help run Blinkit Chinthakunta Sumanth Kumar Reddy is an SDE 3 at Blinkit. He joined us in March 2021 and has since helped us build a resilient application platform at Blinkit. He currently works as a part of Software Resilience Engineering (SRE)–enabling scalable database migrations for Blinkit’s applications. Tell us about your background and your…

quick-commercedevopssite-reliability-engineerculturepeople-at-blinkit

23 Feb 2023

Jacob 3 min read

A conversation with engineers who help run Blinkit Jay Dihenkar is a Staff Engineer at Blinkit. He joined us in December 2020 and has helped different teams manage and streamline their build and release processes. He is currently working towards continuously improving the reliability, scalability, observability, developer productivity, and other such aspects of a software system critical for ensuring that…

people-at-blinkitculturedevopsquick-commercesite-reliability-engineer