The Architect of Systems That Self-Heal and Scale Under Pressure

Effective stakeholder engagement enhances data analysis impact by aligning insights with business goals and fostering collaboration for strategic decisions.

14 Oct 2025 12:53 IST

New Update

Business professionals discussing data insights around a conference table — Effective stakeholder engagement turns data insights into strategic actions and boosts project success.

Billions of people rely on platforms to connect, share, and do business instantly. Behind every smooth social media post, video upload, or job application lies a hidden infrastructure that must function well and grow without interruption. As these systems expand, they face additional challenges, including hardware failures, software glitches, and unexpected issues. Engineers create solutions to help these systems not only handle pressure but also fix themselves quickly and automatically. Leading this effort is Nikhita Kataria, an experienced professional whose work focuses on balancing growth, automation, and system reliability.

Advertisment

To begin with, Kataria’s journey spans over a decade of shaping infrastructure at two of the world’s biggest tech companies. At Meta, her work revolved around ensuring that the massive inventory powering Facebook Advertisements could scale reliably. Shifting to LinkedIn, her focus shifted to strengthening backend systems that support everything from the site’s core functions to its AI-driven features. In particular, her impact has been most visible in the maintenance stack, the unseen foundation ensuring that millions of users never sense the turbulence caused by hardware degradation, security vulnerabilities, or large-scale system upgrades.

Building on this, one of the strategist’s landmark contributions at LinkedIn was spearheading “Bad Host Remediation,” a system designed to detect faulty hosts and move applications to healthier environments swiftly. What once required several hours of manual intervention transitioned into an automated, near-instantaneous process. As she explains, “A self-healing system should be fast, but above all, it must remain invisible to the end user.” Guided by this principle, her work not only restored failing infrastructure within minutes but also reduced downtime for services central to LinkedIn’s global network. Beyond reducing engineering hours, the approach redefined what it meant for an infrastructure to both heal and operate independently.

At the same time, she led another important project: automating operating system upgrades for LinkedIn’s Kubernetes stack. Normally, these upgrades were risky and often raised worries about availability or disruptions. However, her team integrated health checks, fault-domain awareness, and systematic fail-safes. They developed a pipeline that could push security patches across large fleets with minimal risk. In other words, what used to be a high-stakes manual task became a dependable, low-risk routine smoothly embedded into system workflows.

Looking beyond the individual projects, the expert also redefined how scale itself was understood within the infrastructure space. Different applications interpret scale differently, whether in terms of query load, data transfers, or uptime needs. These tools served as a vital mechanism not just for testing resilience but also for ensuring confidence in live rollouts across thousands of servers at any given time.

Building self-healing systems presented significant challenges. A key issue involved identifying which components could genuinely be classified as “healable.” While some failures could be addressed through automated processes, others required direct human intervention and clear escalation protocols. Achieving an appropriate balance between automation and safety was essential. Furthermore, within such critical and high-risk environments, automation had to demonstrate precision, consistency, and reliability.

Over time, the use of AI in infrastructure management is expected to become increasingly common. Many organisations are beginning to explore AI-driven solutions for system remediation, but the unpredictability of system failures remains a major challenge. While learning from past incidents provides valuable insights, predicting and handling unforeseen failures remains a challenge. Moving forward, success will rely on developing adaptive learning systems that can evolve without compromising reliability.

In a field where downtime directly impacts trust, the industry is gradually shifting toward systems that can detect issues, respond to them, and recover autonomously. The efforts of industry professionals demonstrate that this goal is not just theoretical but actively being realised. As infrastructure grows increasingly complex, these self-healing systems offer a vision of a future where resilience is an expectation rather than an exception.

The expert also redefined how scale itself was understood within the infrastructure space. Different applications interpret scale differently, whether in terms of query load, data transfers, or uptime needs. By utilizing advanced stress-testing tools, her team could replay real-world production traffic and verify whether systems would withstand extreme loads without faltering. These tools served as a vital mechanism not only for testing resilience but also for ensuring confidence in live rollouts across thousands of servers at any given time.

brand story