Staying With the Hard Part

Reflections on building AI infrastructure in a research-heavy organization

Lately, it feels like everywhere I look, teams are talking about platforms for machine learning and artificial intelligence, and especially for generative AI. ML Ops, ML infrastructure, reliability engineering, agent orchestration, evaluations, RAG, tool calling. The vocabulary has evolved, but the direction feels clear. We are all trying to find better ways to turn experiments into something customers can actually use.

In that sense, the need feels obvious now. We need reusable primitives. We need common junctions where ideas can move from training and prototype playgrounds into real systems. Places where work can change shape without getting lost. The need itself is not new, but it does feel like it has become clearer and more widely shared.

My own path to this conclusion, especially within a research-heavy organization, looked a little different. What follows is a reflection on how I found myself building an AI platform in that environment, and what it taught me along the way.

I've spent a significant part of my career in consumer robotics. By nature, it's a field that leans heavily toward research. Vision models, autonomy, planning, and more recently, large language models for natural conversation, personalization, and activity generation. Experimentation isn't just encouraged. It's foundational.

When the LLM era arrived, that bias naturally intensified. Science teams were running experiments very quickly. Senior leaders were more R&D focused, with a strong bias toward exploration and experimentation. The organization leaned into exploration, and honestly, for good reason. There was real novelty, and there was real upside to chase.

But from my vantage point, shaped by having built and scaled infrastructure for high-volume production systems before, something else was happening at the same time. I had seen teams go through similar transitions before, from early prototypes that worked for a small set of users to systems that needed to hold up for real customers at scale.

As we pushed models further, I started noticing a subtle but growing distance between what worked well in experiments and how those same ideas behaved once they were exposed to real customers, real devices, and real constraints. Agentic workflows that looked encouraging in notebooks often needed additional care and adaptation when they entered production environments. Evaluation pipelines, in practice, lagged behind the pace of experimentation, which made it harder to understand why something worked in one setting and not another. QA workflows and processes required frequent refinement. Day-to-day testing, even the very basic kind we were doing regularly, became one of the earliest places where these gaps started to show up.

Over time, it became clearer that many of the hardest problems we were facing were not just scientific in nature. They increasingly showed up in the operational layers, especially as we tried to move ideas from experiments into something customers could rely on.

Infrastructure as a Future Problem

In robotics, research naturally dominates early thinking. That makes sense. But as systems scale, infrastructure becomes part of the product, even if no one explicitly frames it that way. Latency budgets, reproducibility, rollout safety, rollback mechanisms, and on-device constraints. These details often sound secondary, but they determine whether innovation ever reaches customers.

Proposing shared or democratized AI infrastructure and processes in a research-heavy organization is difficult. Whether it is reusable compute primitives for AI and ML workloads, or common workflows that span research and production, it can easily sound like slowing things down. And to be honest, to many people it genuinely sounds like premature optimization.

For me, it started with a very small and concrete observation. I noticed that prompts and the way they were constructed, along with the surrounding context, had become central to almost everything we were doing.

There was no real infrastructure to manage this in an organized way. Prompts were not versioned. There was no common store that could be used across training and production stacks. If something broke in an evaluation, it was often hard to trace it back to a specific change unless you were deeply familiar with the code and willing to dig through pull requests and builds.

Much of this logic lived directly on device because it made experimentation easier for science teams. At the same time, other parts of the system lived in the cloud, where we were already running into tight latency budgets due to device to cloud round trips.

What stood out to me was not that science teams were wrong to optimize for speed. In fact, simplicity is essential for good science. But that simplicity should be preserved by the platform, not achieved by fragmenting logic across stacks.

That was the motivation behind proposing a shared prompt store. Something that could be versioned, tracked, and reused across both prototype and production systems. In many ways, it felt similar to what feature platforms once enabled in classical machine learning. The goal was not to slow experimentation down, but to make it easier to carry forward.

The pushback came quickly, though it was not unexpected, and it came from multiple directions.

Some of it was very reasonable. Concerns about flexibility, iteration speed, and constraining research too early were valid. Other resistance was harder for me to fully understand at the time. In retrospect, I think part of it came from an implicit hierarchy of priorities. Science was seen as the critical path, while infrastructure was often viewed as something that could wait.

The harder challenge was carrying that progress across the finish line. Making it repeatable, measurable, and shippable without losing the speed that made exploration possible.

Organizational Churn and Quiet Fear

What made this period especially challenging was not disagreement itself. It was organizational inertia, and the way it quietly slowed everything down.

Many product managers and program managers were strongly in favor of building a platform. UX partners could clearly see the long term impact. Some engineering managers agreed privately. But speaking up publicly was harder. When technical leadership is strongly oriented around one worldview, even thoughtful disagreement can feel risky.

This was also happening during a broader industry downturn. Layoffs, uncertainty, and shifting priorities made people more cautious. When insecurity increases, organizations tend to converge toward familiar mental models. Ideas that fall outside the norm can start to feel unsafe, even when they are necessary.

I kept pushing and stayed persistent. I kept sharing what I was seeing, and I kept turning fuzzy concerns into concrete examples people could react to. The work was not always visible, and it was not always easy to explain, but it felt necessary to keep connecting the dots between experiments and real world behavior.

What kept me engaged was not a sense of where the industry was heading, but the stories I kept hearing. In conversations with engineers, researchers, and partners from different disciplines, similar pain points kept coming up. Different words, different contexts, but the same underlying friction.

Traction, Slowly

Momentum did not come from winning arguments. It came from gradual alignment.

Over time, support grew from different parts of the organization. Product, UX, and eventually managers who were willing to take some risk. We aligned on a design. Even then, there were ongoing questions about whether we could afford to spend time on it.

And then something shifted. It did not feel sudden at the time, but in retrospect it was likely the result of many small conversations and gradual alignment building across the organization.

It was no longer about whether a platform was the right idea in principle. It became a much simpler question.

Could we build something small, quickly, and honestly see what it revealed?

After a long stretch of small conversations and trust building, I was given the space to take a real shot at it, with whatever support we could reasonably assemble. There was no grand mandate and no long runway. Just a narrow window and a lot of trust, and some doubt. Given how strategic the effort was, I stepped into a tech lead manager role. Beyond designing and architecting the system, I was owning sprint planning, making resourcing decisions, driving alignment across cross-functional partners, and reporting progress to executive leadership.

A small team came together, roughly ten engineers, all juggling existing responsibilities. Over the next two to three months, we worked intensely. Long days, frequent course corrections, and a lot of learning in public. What we built was not perfect, and it was never meant to be. But it was shared. It was grounded in production realities. And for the first time, it gave research and deployment a common surface to meet on.

That moment changed the trajectory more than any single argument ever could.

Looking back, that period was one of the most stressful phases of my career. It was also one of the most formative.

In Retrospect

Today, funnily enough, all I see are teams trying to build platforms or evolve their existing classical ML platforms to support modern AI workloads.

Organizations across the industry are investing in AI infrastructure that can support both classical machine learning and more modern agentic systems. GPU-backed compute, evaluation frameworks, reusable tools, shared data, and standardized workflows. What once felt difficult to justify now feels much easier to articulate.

I sometimes wish there had been more openness earlier. More space to talk about AI infrastructure without it being framed as a distraction. More psychological safety to challenge dominant narratives. At the same time, I also understand how organizations drift. Fear, incentives, background, and timing all shape what feels reasonable in a given moment.

Some answers only feel obvious in retrospect. Living through the journey rarely feels that way.

I do not think there is a clean moral to pull out of this. If anything, the experience made me more patient with how organizations move.

Some ideas feel obvious only after a system has lived through enough friction to recognize them. Some ideas need the right timing. And sometimes you keep going simply because you believe the product deserves a sturdier foundation, even if the value is not immediately legible.

When I look back, I mostly feel gratitude. For the people who pushed, even when it was tough. For the people who asked the right questions. And for the people who stayed up late to ship the first version. It was messy, it was stressful, and it taught me a lot. And in a strange way, it is a memory I would not trade.