jeremy.runtime
jeremy@agent: /projects/datarobot-ai-platform

DataRobot AI Platform

Seven-plus years building the platform layer between advanced AI capability and production reality: predictive AI, trust and explainability, agent runtimes, eval infrastructure, tool execution, and recovery systems.

What I built last quarter, and the decision behind it

DataRobot's public REST surface is roughly four megabytes of OpenAPI - far too large to hand an LLM as a tool catalog without destroying its context budget on every turn. So when I designed our Global MCP gateway I borrowed Cloudflare's Code Mode pattern: agents get a small, semantic search/execute surface, then run arbitrary code against the full DataRobot SDK inside a sandbox.

The sandbox is a per-call Kubernetes Job with an Envoy + OPA egress proxy, a 15MB base image, sub-200ms cold starts, and a strict securityContext that I pushed through the internal Workload API. The centerpiece PR - feat: add sandboxed code execution via K8s Jobs - is about 28,000 lines across 214 files. The gateway went live in production in April 2026; agent skills shipped to the Cursor marketplace the same week and were submitted to the Claude marketplace immediately after.

That paragraph is what I do. Pick the right architectural primitive for an agentic constraint - context economics + tool-call reliability + sandbox threat model - ship the load-bearing PR myself, then build the team and the process around it.

How I think about agents in production

A few opinions, since they are the part of a resume I would want to read first:

Evals are infrastructure, not artifacts. I built datarobot-agent-tester as a reusable pytest plugin that hash-caches LLM-graded skill E2E tests so the grader cost does not dominate CI. Skill quality is a product feature; the eval harness has to be cheap enough to gate every PR.

Code Mode beats tool-list bloat above about 50 tools. Below that, named tools and good docstrings; above it, a typed code surface plus retrieval is the only thing that scales with the SDK without scaling with the prompt.

Sandbox threat-modeling is where most agent products are weakest. Per-call K8s Jobs + egress proxy + capability-scoped credentials is the floor, not the ceiling. Signed skill bundles with capability manifests are next, so a compromised skill cannot silently widen its blast radius.

Ship-fast on AI infra requires its own safety substrate. I built ours: feature flags + entitlements as a 60-second kill switch, 60-second smoke tests, AI-categorized nightly regression triage, one-command rollback under five minutes, and 15-minute synthetic monitoring. Five PRs, one design doc, presented at design review. Agent platforms break in ways traditional services do not; the recovery loop has to be bounded in minutes, not hours.

The DataRobot arc

2018-2021 - platform engineer. API sharing/permissions services, then DataRobot's Visual AI / image ML platform: image augmentations service split-out, batch image predictions for S3, fixing reference-counting/concurrency in the prediction pipeline. Roughly the work you would hope a senior IC at an ML platform company would have done.

2021-2024 - Engineering Manager, Trust & Explainability. Owned the domain that ships the features regulated-industry customers - financial services, healthcare, government - most depend on: compliance docs, model insights, fairness, performance. Two campaigns from that era are the work I would want to be evaluated on.

Root Insights

Root Insights was a reusable architectural framework - base classes, validators, formatters, a job manager - that consolidated a dozen one-off insight implementations into one contract. Feature Impact, Permutation SHAP, Lift, ROC, Confusion Matrix: all of them needed one underlying shape. The migration unlocked features the old design could not express, including slicing, external test sets, and segmented performance, and shipped without a customer-reported regression.

Lean Testing

I took TREX's flaky, expensive functional/E2E suite - 458 hours of testing per day before the work began - and redesigned the test pyramid around Flask-based contract tests and in-memory fixtures. The headline PR was titled "Burn the boats": a single change that deleted 15,914 lines of E2E tests after coverage had been re-grown at the contract and unit layers.

Result: 85% reduction in test execution time, 65% reduction in test infrastructure cost, zero customer regressions. Then I turned the playbook outward: Engineering All Hands slot, Lean Testing office hours for other domains, three company-wide tech talks.

Cross-org leadership

In parallel: led DataRobot storage cost reductions across the Build org, covering both logs and old modeling artifacts; owned TREX on-prem release readiness for 10.2 including CVE/OSS legal review; designed the TREX firefighting / on-call / SLA process; wrote the staff and principal job descriptions for the team; promoted a direct report to Senior Engineer on the Root Insights continuation.

2025-2026 - Principal architect, MCP & agent platform

When DataRobot moved seriously into agentic AI I moved with it. Currently leading three intertwined initiatives:

Global MCP - the gateway described at the top. Live in production April 2026, on Cursor's marketplace.

DR-Claw - DataRobot's long-running agents / OpenClaw work. Replaces the previous OpenShell sandbox with the design above. I negotiated with PM to keep the hackathon team driving the prototype through to main rather than handing off, because handoff at that stage would have cost a quarter of velocity.

Domain Packs & the Skills marketplace - six personalized agent packs for engineering productivity, CFDS, PM, sales, executive, and docs-writer workflows, each shipping with a bootstrap and a default use case. Reusable infrastructure underneath includes datarobot-agent-tester, DataRobot agent skills, and an LLM-powered AGENTS.md generator for templated projects.

I also lead DataRobot's BuildAI Architecture Guild and serve as Incident Commander.

What I got wrong and what I learned

The Root Insights migration took longer than I told leadership it would - closer to a year than nine months, because each downstream consumer wanted bespoke negotiation about which fields and contracts to preserve. The framework was the easy part; the politics of an architectural migration through a monolith with thirty teams pointing at it was not. The fix in the second half was less about code and more about giving each consumer a paved path with a co-author from my team and a clear deprecation timeline; the same playbook now drives the MCP-tool migration off legacy endpoints.

The Lean Testing rollout had a parallel lesson: the technical work was done in months, but getting other domains to adopt the pattern took a year of office hours, talks, and pairing. Both shaped how I scope cross-org architectural work now: budget the persuasion as carefully as the engineering.

Style

I run my own life on the systems I build. I have a Claude-powered daily wrap-up agent that pulls Jira, GitHub PRs, Slack mentions, Gmail and meeting transcripts and posts a status every morning. I have a personal "deep-coder" persona - a four-doc playbook covering concepts and architecture, quickstart, debugging runbook, and meeting run-of-show - that I use to drive my own coding agents in production work.

My PR titles include "Burn the boats" and "All your test are belong to us". I am comfortable writing the K8s Job spec, the FastMCP proxy, the OpenAPI validator, the Cypress test, the customer UI, the on-call runbook, and the promotion case for a report - usually in the same week.