~/devreads

#data infrastructure

14 posts

21 May

Pinterest Engineering 12 min read

Authors ( listed alphabetically ) Ads Feature Engineering Infra team: Ajay Venkatakrishnan, Le Zhang Core ML Infra team: Eric Shang, Pihui Wei ML Data team: Connor Votroubek, Yi He User Understanding team: Camilo Munoz, Simin Li If you work on ranking, retrieval, or recommendation systems, you’ve probably asked for some version of the same thing: “Give me the last N…

machine-learningrecommendation-systemengineeringdata-infrastructurepinterest

12 May

9 min read

Meta’s data ingestion system, which our engineering teams leverage for up-to-date snapshots of the social graph, has recently undergone a significant revamp to enhance its reliability at scale. Moving from our legacy system to our new architecture required a large-scale migration of our entire data ingestion system. We’re sharing the solutions and strategies that enabled [...] Read More... The post…

data infrastructure

8 Apr

1 min read

As AI increases developer speed and productivity it also increases the need for safeguards. On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Ishwari and Joe from Meta’s Configurations team to discuss how Meta makes config rollouts safe at scale. Listen in to learn about canarying and progressive rollouts, the health checks [...] Read More... The…

data infrastructuredevinfraproduction engineeringsecurity privacymeta tech podcast

2 Mar

2 min read

Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure. We are renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and workloads. We are committed to continuing to develop jemalloc development with the [...] Read More... The post…

data infrastructureopen source

19 Dec 2025

5 min read

Incident investigation can be a daunting task in today’s digital landscape, where large-scale systems comprise numerous interconnected components and dependencies DrP is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil Today, DrP is used [...] Read More... The post…

data infrastructureml applications

21 Nov 2025

9 min read

We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post…

data center engineeringdata infrastructureml applications

14 Oct 2025

6 min read

At Open Compute Project Summit (OCP) 2025, we’re sharing details about the direction of next-generation network fabrics for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing new disaggregated network platforms to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards [...] Read More... The post…

data center engineeringdata infrastructuredevinframl applicationsnetworking traffic

29 Sept 2025

17 min read

Over the past 21 years, Meta has grown exponentially from a small social network connecting a few thousand people in a handful of universities in the U.S. into several apps and novel hardware products that serve over 3.4 billion people throughout the world. Our infrastructure has evolved significantly over the years, growing from a [...] Read More... The post Meta’s…

ai researchdata center engineeringdata infrastructuredevinframl applications

13 Aug 2025

10 min read

In this post, we explore the ways we’re evolving Meta’s data warehouse to facilitate productivity and security to serve both human users and AI agents. We detail how we’re developing agents that help users making data access requests to get to the data they need, and that help data owners process requests and maintain security. [...] Read More... The post…

data infrastructure

22 Jul 2025

13 min read

Hardware faults can have a significant impact on AI training and inference. Silent data corruptions (SDCs), undetected data errors caused by hardware, can be particularly harmful for AI systems that rely on accurate data for training as well as providing useful outputs. We are sharing methodologies we deploy at various scales for detecting SDC across [...] Read More... The post…

data infrastructure

20 May 2025

4 min read

As Meta has launched new, innovative products leveraging generative AI (GenAI), we need to make sure the underlying infrastructure components evolve along with it. Applying infrastructure knowledge and optimizations have allowed us to adapt to changing product requirements, delivering a better product along the way. Ultimately, our infrastructure systems need to balance our need to [...] Read More... The post…

data infrastructuredevinfraproduction engineeringweb

8 May 2025

3 min read

Meta and NVIDIA collaborated to accelerate vector search on GPUs by integrating NVIDIA cuVS into Faiss v1.10, Meta’s open source library for similarity search. This new implementation of cuVS will be more performant than classic GPU-accelerated search in some areas. For inverted file (IVF) indexing, NVIDIA cuVS outperforms classical GPU-accelerated IVF build times by up [...] Read More... The post…

ai researchdata infrastructureml applicationsopen source

31 Mar 2025

1 min read

Mobile GraphQL is a framework used at Meta for fetching data in mobile applications using GraphQL, a strongly-typed, declarative query language. At Meta it handles data fetching for apps like Facebook and Instagram. Sabrina, a software engineer on Meta’s Mobile GraphQL Platform Team, joins Pascal Hartig on the Meta Tech podcast to discuss the evolution [...] Read More... The post…

data infrastructuremeta tech podcast

26 Apr 2022

Laura Nolan 12 min read

By Laura Nolan, with contributions from Glen D. Sanford, Jamie Scheinblum, and Chris Sullivan. Assessing conditions Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging! This incident was a…

uncategorizeddata-infrastructuredebuggingincident-response