We’re introducing Instantaneous PowerLoss Storm, a new testing paradigm within Meta’s infrastructure for handling and mitigating instant or zero-notice power loss in our data centers. We’re sharing: how we built readiness to tolerate instant failures into our existing systems with defense-in-depth strategies; tradeoffs made in implementing it, and how we validated our readiness. Disaster preparedness [...] Read More... The post…
#production engineering
10 posts
3 Jun
1 May
The HSM-based Backup Key Vault Meta’s HSM-based Backup Key Vault provides the foundation for end-to-end encrypted backups for WhatsApp and Messenger. The system allows people to protect their backed-up message history with a recovery code, ensuring that the recovery code is stored in tamper-resistant hardware security modules (HSMs) and is inaccessible to Meta, cloud storage [...] Read More... The post…
16 Apr
We’re sharing insights into Meta’s Capacity Efficiency Program, where we’ve built an AI agent platform that helps automate finding and fixing performance issues throughout our infrastructure. By leveraging encoded domain expertise across a unified, standardized tool interface these agents help save power and free up engineers’ time away from addressing performance issues to innovating on [...] Read More... The post…
8 Apr
As AI increases developer speed and productivity it also increases the need for safeguards. On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Ishwari and Joe from Meta’s Configurations team to discuss how Meta makes config rollouts safe at scale. Listen in to learn about canarying and progressive rollouts, the health checks [...] Read More... The…
14 Nov 2025
Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment? On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements [...] Read More... The post…
21 May 2025
In this post, we explore how Instagram has successfully scaled its algorithm to include over 1000 ML models without sacrificing recommendation quality or reliability. We delve into the intricacies of managing such a vast array of models, each with its own performance characteristics and product goals. We share insights and lessons learned along the way—from [...] Read More... The post…
20 May 2025
As Meta has launched new, innovative products leveraging generative AI (GenAI), we need to make sure the underlying infrastructure components evolve along with it. Applying infrastructure knowledge and optimizations have allowed us to adapt to changing product requirements, delivering a better product along the way. Ultimately, our infrastructure systems need to balance our need to [...] Read More... The post…
3 Feb 2025
We’ve previously described why we think it’s time to leave the leap second in the past. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly, does more harm than good. This is particularly true in the data center [...] Read More... The post…
21 Jan 2025
We’re sharing details about Strobelight, Meta’s profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings. Strobelight, Meta’s [...] Read More... The post…
10 Jan 2025
Today’s rapidly evolving landscape of use cases that demand highly performant and efficient network infrastructure is placing new emphasis on how in-line amplifiers (ILAs) are designed and deployed. Meta’s ILA Evo effort seeks to reimagine how an ILA site could be deployed to improve speed and cost while making a step function improvement in power [...] Read More... The post…