Vivek Khurana

Vivek Khurana

vivek@outlook.com

Engineering leader with 15+ years building large-scale distributed systems, now leading teams at the frontier of agentic AI infrastructure. Shipped foundational platforms at Meta, including Agent Memory, AgentBus, AgentTune, and RL-as-a-Service, and at Microsoft across OneDrive authorization, caching, DLP, and abuse throttling. Track record of taking 0-to-1 systems to production scale and driving measurable business impact across latency, capacity, reliability, and engagement.

Core Skills

  • Domains: Agentic AI infrastructure, reinforcement learning systems, LLM fine-tuning and serving, distributed systems, control planes, P2P distribution, authorization, security
  • Leadership: Multi-team management, cross-org strategy, XFN partnership, hiring, org design, technical vision
  • Stack: C#, Python, C++, PyTorch, vLLM, MAST, KV stores, vector search, shared-log systems

Experience

Engineering Manager, Meta Superintelligence Labs - RL and Agentic Infra, Meta 2025 - Present

Lead roughly 14 engineers building production infrastructure for Meta's AI agents across memory, multi-agent execution, RL, and fine-tuning/serving workflows.

  • Shipped Agent Memory, a persistent namespace-scoped memory platform backed by Meta's KV Store, Vector Search, and Blob Storage; onboarded 2 agent frameworks with millions of memory objects stored across shared agents.
  • Shipped AgentBus, a shared-log execution substrate for multi-agent systems with pluggable safety voters, replay-based fault tolerance, and full audit trails; delivered the core security invariant and 99.99% reliability for onboarded frameworks.
  • Delivered AgentTune and RL-as-a-Service, enabling trajectory-based agent learning, SFT/LoRA fine-tuning, checkpoint registration, batch inference sweeps, one-click vLLM deployments, tiered rate limiting, and observability.
  • Built operating cadence for a 0-to-1 org by clarifying roadmaps, staffing multiple workstreams, partnering with research/product teams, and converting frontier prototypes into production-grade infrastructure.

Engineering Manager, AI Infra - Training Control Plane, Meta 2024 - 2025

Led 23+ ICs and 2 managers delivering Meta's AI Training Control Plane across product groups and infrastructure teams, connecting training pipelines, model registration, serving readiness, and fleet capacity management.

  • Drove a 99.3% reduction in Training-to-Serving latency, producing a 7-10% lift in cold-start engagement on downstream ranking surfaces.
  • Reclaimed 1.6 MW of capacity and delivered 20% resource reduction across training fleets.
  • Unified Model Freshness strategy across Data, Inference, and Training organizations, aligning metrics, SLAs, and ownership boundaries for end-to-end model delivery.
  • Led cross-org planning and execution for emerging business requirements, translating product urgency into infrastructure milestones, launch sequencing, and measurable operational outcomes.

Engineering Manager, Core Systems - Distribution Infrastructure, Meta 2021 - 2024

Owned Tier-0 distribution services powering mission-critical fleet-wide functionality across Meta's production fleet.

  • Led Falcon, a globally distributed control-plane service for config and service discovery with massive fanout, low latency, and Tier-0 reliability.
  • Led Owl, a P2P distribution system for TB-scale objects, including AI/ML models and Ads data, across Meta's private cloud.
  • Drove reliability, capacity, and operational planning for systems that sit on the startup path of critical services.
  • Partnered with XFNs and senior leadership to evolve core infrastructure consumed by virtually every service at Meta.

Principal Engineering Manager, OneDrive and SharePoint, Microsoft 2015 - 2021

Led 14 engineers across abuse throttling, caching, authorization, purchase platform, and Photos experiences.

  • Designed and shipped a unified Cache Framework spanning local cache, cluster cache, and dual-cache implementations, adopted across OneDrive.
  • Delivered the initial Data Loss Prevention implementation for OneDrive business customers, enforcing document access through compliance policies.
  • Rebuilt the OneDrive abuse throttling subsystem, adding filtering by application, usage, and configurable limits.
  • Designed and shipped a unified Authorization Framework for granular runtime resource access checks across OneDrive components.

Software Development Engineer, Windows Services, Microsoft 2011 - 2015

Built the primary contact data store powering Skype, Hotmail/Outlook.com, and Windows clients, serving hundreds of thousands of requests per second over EAS, REST, and SOAP.

  • Owned all REST APIs for reading and writing contact data.
  • Shipped Contact Sync, EAS extensions, Skype Push APIs, and the back-compatibility model.
  • Owned Sandbox, a library for standardized third-party contact and activity integrations consumed by multiple platforms.

Associate Software Engineer, Nokia India Pvt. Ltd. 2008 - 2009

  • Improved media playback performance on high-resolution devices and contributed to Touch Keypad and OVI Media Player development.
  • First-ever recipient of Nokia's Water Tight Quality Champion award; also received an Outstanding Contribution Award.

Software Engineering Intern, IBM Software Labs 2007

  • Built internal PHP/MySQL and WebSphere tools for employee surveys and server IP allocation tracking.

Education

Georgia Institute of Technology, M.S. Computer Science 2009 - 2011

  • GPA: 3.66
  • Coursework: Advanced Operating Systems, Algorithms, Computer Networks, AI, HCI, Mobile Applications and Services

M. S. Ramaiah Institute of Technology, B.E. Computer Science and Engineering Bangalore, India

  • Graduated First Class with Distinction, 73.0% aggregate, top 10% of class

Research and Publications

Georgia Tech CERCS Research Group Faculty: Ada Gavrilovska

  • Built a QoS-aware scheduler for the integrated cryptographic accelerator on Intel's EP80579 SoC.
  • Co-author, "A Split-Driver Approach to SoC Virtualization - Challenges and Opportunities," 5th International Symposium on Embedded Multicore SoCs (MCSoC-10).