Back to Posts

Evolution of AI/ML Workloads & Platforms

Every era of ML has the same fundamental promise: take messy reality, transform it into signals, learn from those signals, and then ship decisions into products. What changes across "generations" is where the pain lives. In the early years, pain lived in simply processing enough data to train a half-decent model. In the next era, pain moved to operationalizing hundreds (then thousands) of models reliably. In the modern era, pain has shifted again: the unit of value is often no longer a single model behind a single endpoint, but an AI application composed of prompts, retrieval, tool-calling, safety policies, evaluation harnesses, and cost-aware inference.

There's also a second axis that quietly drove each platform rewrite: compute. Classical ML largely rode on CPUs and the economics of large batch clusters. Deep learning pulled the center of gravity toward GPUs and accelerators, and transformers made both training and serving feel like performance engineering problems. As a result, "platform evolution" isn't just better orchestration - it's a response to shifts in data modality, training regimes, and the hardware bottlenecks that define cost.

The workload families that persist across generations (and what changed)

It's tempting to think ML platforms changed because we invented new boxes on an architecture diagram. In reality, the boxes stayed familiar - data, training, serving, orchestration, evaluation - but the dominant shape of each workload changed.

Data processing: from "ETL on tables" to "internet-scale corpus engineering"

In the early era, most ML systems were fed by structured or semi-structured enterprise data: tables, logs, event streams, and curated feature pipelines. ETL meant joins, aggregations, window functions, backfills, and schema discipline.

In the modern era, a large fraction of progress comes from training on unstructured web-scale data and cleaning it aggressively. Instead of asking, "How do we compute this feature?" teams increasingly ask, "How do we extract text from raw web archives, deduplicate it, filter it, and create training-ready shards?"

Feature engineering: from hand-crafted signals to learned representations

Feature engineering used to be the primary leverage point for accuracy improvements. Deep learning changed that by making representation learning a default. The practical consequence is that many systems moved from explicit feature design toward learning features end-to-end, especially in vision, speech, and language.

This doesn't mean "features disappear." It means the frontier shifts. In classical ML systems, you often win by inventing a clever feature. In deep learning systems, you often win by choosing the right architecture, objective, and data. In GenAI systems, you often win by choosing the right data mixture, retrieval strategy, and evaluation loop.

Modeling and training: from fitting models to engineering training regimes

Training is no longer "run a learner on a dataset." It increasingly includes a bundle of optimization techniques that exist because scaling is expensive.

Distillation is one early, enduring example: training can use heavy compute and large models, but deployment often needs smaller, cheaper models. Quantization is another example of training/serving co-evolving - "precision engineering" becomes part of the ML lifecycle.

Serving: from CPU prediction services to GPU inference economics

Older platforms primarily served models as CPU-bound RPC services, sometimes with a batch scoring pathway for offline use. GenAI-era serving shifts the bottleneck to GPU utilization, KV-cache memory, batching policy, and throughput/latency trade-offs.

Orchestration: from one model → pipelines → agentic workflows

ML orchestration used to mean "schedule training," then "deploy a model," then "score predictions." As ML products matured, orchestration expanded into multi-model pipelines and configurable workflows. In the GenAI era, orchestration increasingly looks like running agentic graphs: route requests, retrieve context, call tools, validate outputs, and iterate.

How workflows and compute profiles evolved

The easiest way to feel the platform evolution is to track what an end-to-end workflow looked like at each stage.

In early production ML, the workflow was often a single small model trained on hand-engineered features and run as batch inference. A team would compute features in a warehouse or Spark job, train a logistic regression or boosted tree, and then run a nightly scoring job that wrote predictions back into a table. The serving system was basically a database lookup: "what score did we compute last night?" This was compute-light on the modeling side and compute-heavy on the ETL side.

As products demanded more responsiveness, workflows shifted toward online inference. The same model families could still run on CPU, but the operational requirements changed: you now needed a low-latency service, versioned rollouts, and a stable API.

Once online inference became normal, the workflow expanded into pipelines of models. Instead of "one model predicts," systems increasingly looked like "retrieve candidates, rank candidates, post-process results."

Then deep learning pushed workflows into training factories. Training became heavy mathematical computation over large parameter spaces, and the platform had to schedule accelerators, checkpoint reliably, and handle distributed training.

Finally, GenAI introduced workflows where the "unit of deployment" is a behavior loop, not a single model. The platform now executes agentic graphs: route, retrieve, tool-call, validate, iterate.

ML Platform Generations

1st Gen: Millennials ("MLOps adulthood" years)

This era expanded ML beyond a handful of "high ROI" models into a cross-company capability. Recommendations and ranking became central and increasingly complex. Fraud and risk matured into systems with tight latency constraints. Computer vision moved from research novelty into production reality.

The big win was repeatability. Teams stopped reinventing the lifecycle every time they shipped a new model. The platform started to feel like a product: you could train, track, compare, register, and deploy in a consistent way.

2nd Gen: Gen Z ("cloud-native primitives" years)

Recommendations, ads, and fraud remained dominant, but systems became more real-time and more personalized. Language modeling began reshaping NLP practice as transformers became the best default for embeddings and classification.

The key improvement was abstraction. Training and serving started to feel like APIs rather than clusters. ML infrastructure began to behave like a platform API, and the platform team's job shifted from "make it run" to "make it cheap, safe, and diagnosable."

3rd Gen: Gen Alpha ("AI application platform" years)

Gen Alpha is defined by generative AI and the shift from prediction systems to behavior systems. The flagship domains are copilots and assistants, customer support automation, content generation/editing, code generation, and multimodal experiences.

The product surface became dramatically more flexible. A single model could generalize across tasks via prompting, retrieval, and tool use. That unlocked new product categories and reduced the friction of "standing up a new model" for every micro-use-case.

The novelty is that correctness can't be defined only by offline accuracy. Platforms must support evaluation that looks more like software testing: regression suites over behaviors, offline replay, safety checks, and policy enforcement.

My personal journey through the platform generations at Amazon

Amazon Pay

From 2013-2016, the job was basically "argue with fraudsters using math." We ran batch feature pipelines over buyer/seller history, trained lightweight models, and then made real-time decisions in the auth path via fast lookups plus rules/thresholds. It was unapologetically CPU-first: ETL freshness, latency, and "don't break checkout" reliability mattered more than fancy compute.

Amazon B2B Marketplace

This chapter was where ML stopped being "a model" and became "a small ensemble with opinions." We built recommendation and forecasting systems that needed online serving, frequent refresh, and cold-start survival skills. The architecture naturally drifted toward retrieval + ranking - two-tower embeddings for candidates, stronger rankers downstream.

Amazon Astro

Astro started with plenty of perception-style ML, but the real shift was into transformer-era systems: embeddings for similarity, encoder/decoder patterns, and eventually generative behaviors that forced GPU-aware serving. Most of my energy now goes into the "agent platform" layer - multi-step inference workflows, routing/tool use, offline + online evals, and iteration loops like fine-tuning and prompt/policy versioning.

Conclusion

No big conclusion here, just a retrospective for my own brain. Mostly because, at least to me, AI/ML platform engineering follows fundamentals of systems engineering: interfaces, reliability, observability, and cost. The nouns and ML frameworks change (Spark → TensorFlow/PyTorch → Ray/vLLM), but the verbs stay pretty much the same: build, ship, observe, rollback, repeat.

Also, the industry has a predictable hobby: everything becomes a platform. First you build a model, then you build pipelines, then you build tooling so other people can build pipelines… and then you build tooling so other people can build tools.

And if all of this feels more complicated than it used to, it's not because anyone woke up and chose chaos - it's because the workloads started having stronger opinions. Data stopped being polite tables. Training turned into a gym membership for GPUs. Serving became latency-and-KV-cache budgeting. Workflows started behaving like little autonomous programs that route, retrieve, call tools, and occasionally surprise you. The platforms just followed the pain-points and evolved.