#infrastructure

45 posts

4 Jun

Bo Teng 4 Jun 2026 10 min read

Sitar-agent: Building a reliable dynamic configuration sidecar at scale

How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale. By : Bo Teng , Cosmo Qiu , Siyuan Zhou , Ankur Soni , Xin Huang , Willis Harvey Introduction In our previous post , we explored Airbnb’s dynamic configuration system, Sitar, with a focus on service architecture and configuration change safety. Now for the harder question:…

distributed-systems infrastructure software-architecture engineering software-development

1 Jun

Giovanni Pereira Zantedeschi 1 Jun 2026 8 min read

How we reduced core unit boot time from hours to minutes

Cloudflare

We investigated why firmware updates were causing our core servers to take four hours to reboot. By diving into UEFI data structures and iPXE automation, we eliminated unnecessary timeouts and cut boot times back down to minutes.

infrastructure engineering networkingcore

19 May

Lucen Zhao 19 May 2026 7 min read

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb

How Airbnb shifts from PaaS to an internal knowledge graph infrastructure at scale. By: Lucen Zhao , Shukun Yang , Ashish Jain Knowledge graphs offer a natural and powerful way to represent relationships between entities. Many real-world systems are fundamentally about connections. Airbnb’s identity graph captures relationships between users in a graph database. The identity graph serves aggregated insights that…

technologygraph-databaseinfrastructureknowledge-graphengineering

5 May

Abdurrahman J. Allawala 5 May 2026 8 min read

Monitoring reliably at scale

Airbnb

Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…

engineering infrastructure technology observability site-reliability-engineer

1 May

Netflix Technology Blog 1 May 2026 13 min read

State of Routing in Model Serving

Netflix Technology Blog

By Nipun Kumar , Rajat Shah , Peter Chng Introduction This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale across various domains (e.g., title recommendations, commerce). In this introductory blog post, we will dive into our domain-independent API abstraction and its traffic…

ai-platformdistributed-systems infrastructure machine-learning

Pinterest Engineering 1 May 2026 16 min read

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

Guangtong Bai | Staff Software Engineer, Product ML Infrastructure*; Shantam Shorewala | Software Engineer II, Product ML Infrastructure*; Chi Zhang | Staff Software Engineer, AI Platform*; Neha Upadhyay | Software Engineer II, AI Platform*; Haoyang Li | Director, Product ML Infrastructure *These authors contributed equally to this article. Background At Pinterest, our online ML serving systems employ a root-leaf architecture.…

engineering pinterest machine-learning infrastructureefficiency

28 Apr

Ricardo Gamba 28 Apr 2026 12 min read

Skipper: Building Airbnb’s embedded workflow engine

Airbnb

How Airbnb built a lightweight workflow engine to solve durable execution. By : Ricardo Gamba , Andriy Sergiyenko Introduction: The durable execution problem Picture this hypothetical flow: A host submits an insurance claim about their listing to Airbnb. The system needs to validate the claim, run trust and safety checks, assess estimates, process the payout, and send notifications. Halfway through…

workflow software-architecture infrastructure technology engineering

21 Apr

Rishabh Kumar 21 Apr 2026 9 min read

Building a fault-tolerant metrics storage system at Airbnb

Airbnb

How we built a storage system that ingests 50 million samples per second and stores 2.5 petabytes of logical time series data. By : Rishabh Kumar Modern observability practice encourages instrumenting every meaningful code path. Over the past 15 years, open-source observability SDKs like Prometheus, OpenTelemetry, and StatsD have made deep instrumentation nearly ubiquitous. These days, most software — open-source…

site-reliability-engineer infrastructure technology engineering software-architecture

20 Apr

Pinterest Engineering 20 Apr 2026 10 min read

Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest

Shanhai Liao | Senior Software Engineer, Content Acquisition and Media Platform; Di Ruan, | Senior Staff Software Engineer, Content Acquisition and Media Platform; Evan Li, | Senior Engineering Manager, Content Acquisition and Media Platform Introduction Accurate content understanding underpins Pinterest’s ability to drive distribution and engagement. This requires deep insight not just into the image itself, but also the outbound…

pinner-experienceengineering infrastructureeng-culturepinterest

13 Apr

Pinterest Engineering 13 Apr 2026 8 min read

Scaling Recommendation Systems with Request-Level Deduplication

Authors: Matt Lawhon | Sr. Machine Learning Engineer; Filip Ryzner | Machine Learning Engineer II; Kousik Rajesh | Machine Learning Engineer II; Chen Yang | Sr. Staff Machine Learning Engineer; Saurabh Vishwas Joshi | Principal Engineer At Pinterest, scaling our recommendation models delivers outsized impact on the quality of the content we serve to users. Our Foundation Model (oral spotlight,…

pinterest machine-learning infrastructure engineering recommendation-system

7 Apr

Eugene Ma 7 Apr 2026 9 min read

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Airbnb

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…

engineering technology infrastructure observability site-reliability-engineer

2 Apr

Kazuaki Okumura,Mike White,Kevin Altschuler,Facundo Agriel 2 Apr 2026 13 min read

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox

By turning compaction into a layered, adaptive pipeline and strengthening our monitoring and controls, we made Magic Pocket more resilient to workload changes.

magic pocketstorage infrastructure

31 Mar

Carlo Preciado 31 Mar 2026 4 min read

#infrastructure

4 Jun

1 Jun

19 May

5 May

1 May

28 Apr

21 Apr

20 Apr

13 Apr

7 Apr

2 Apr

31 Mar

25 Mar

12 Jan

20 Nov 2025

19 Jun 2025

20 Dec 2024

10 Oct 2024

17 Sept 2024

18 Apr 2024

12 Dec 2023

28 Nov 2023

28 Sept 2023

28 Aug 2023

22 Aug 2023

28 Jun 2023

11 Apr 2023

21 Mar 2023

24 Jan 2023

25 Oct 2022

6 Sept 2022

29 Aug 2022

19 Aug 2022

12 Jul 2022

9 Mar 2022

20 Oct 2021

7 Oct 2021

28 Jul 2021

22 Jan 2019

20 Oct 2017

12 Sept 2016

27 Jan 2014

22 Jun 2013

19 Dec 2012