jeremy.runtime
jeremy@agent: /writing

Writing notes

Shorter notes and article-sized fragments on AI platform infrastructure, agentic systems, developer tooling, Swoleby, and applied AI product judgment.

long / developer tooling

Agent-driven development needs a control plane

The longer version covers the operating system around agentic development: architect/coder splits, review agents, babysitting, cron-driven follow-through, PR triage, low-risk autonomy, rollback testing, and feature flags.

Read the longer version

01 / agentic systems

Enterprise AI needs boring infrastructure

The public conversation around agents tends to reward demos: a tool call works, a browser clicks around, a spreadsheet gets edited. The enterprise version is less glamorous and more important. The hard parts are permissions, audit trails, recovery paths, sandboxing, observability, and human review.

The argument: useful agentic systems look less like magic and more like infrastructure. They need contracts for tools, typed inputs and outputs, execution modes, policy boundaries, and logs that let a human understand what happened after the fact.

02 / mcp

What MCP changes about tool ecosystems

MCP is useful because it gives agent systems a way to discover and call capabilities without every integration becoming a bespoke prompt hack. That matters most when tool use becomes a platform surface rather than a demo-specific script.

The explanation centers on why tool discovery, schema discipline, execution context, and error handling matter. The thesis is simple: agent tools are APIs with a new caller, not an excuse to abandon API design.

03 / openapi tools

OpenAPI is still one of the best bridges into agent tooling

Every company already has APIs. Many of them already have OpenAPI specs. That means the fastest path to practical AI tool use is not always inventing a new tool protocol from scratch. It is turning existing API contracts into usable, governed tool surfaces.

The useful angle is the messy middle: auth, pagination, ambiguous endpoints, destructive operations, rate limits, and how to decide which API actions should be exposed to an agent at all.

04 / secure execution

Sandboxing is a product feature, not an implementation detail

If an AI system can execute code or operate tools, sandboxing becomes part of the user experience. Users need to know what can happen, what cannot happen, what needs approval, and how to inspect the result.

The connection is between technical sandbox design and product trust: scoped filesystems, network controls, timeouts, secrets boundaries, preview-before-apply flows, and why "the model decided" is never a satisfying explanation.

05 / human review

Human-in-the-loop is not a modal

Many products treat human review as a confirmation dialog. That is too shallow. Human-in-the-loop design is about choosing where judgment belongs, what context the human needs, and what the system should learn from approval or rejection.

The argument treats review as a workflow primitive: summarize intent, show diffs, explain risk, preserve undo, and make approval data useful for future evals.

06 / evals

Agent evals should measure behavior, not vibes

Agent evaluation gets weak when it only asks whether an answer sounds good. Real systems need to evaluate task completion, tool choice, safety boundaries, cost, latency, and whether the system recovered from messy intermediate states.

The Swoleby engagement experiments are a good public example: tone guardrails, actionable calls to action, max length, non-empty outputs, and stage-level scoring are small but concrete moves toward behavior-based quality.

07 / product design

Building Swoleby: why SMS beats another app

The product bet behind Swoleby is that behavior change does not start in a dashboard. It starts when the person is about to skip the thing they said they wanted to do. SMS is closer to that moment than a dashboard or a weekly report.

The product logic is reminders, context, accountability, tone, fallback plans, and the difference between a feature that is impressive and a loop that someone actually uses.

08 / memory

How much memory should an AI coach have?

Memory is useful until it feels creepy or wrong. A coach needs enough context to avoid asking the same questions repeatedly, but not so much that the user feels surveilled or trapped by old information.

A practical memory model separates stable profile facts, recent conversation state, explicit user preferences, opt-out boundaries, and reviewable summaries that let the user correct the system.

09 / reminders

The reminder is the product

For behavior-change products, reminders are not notification plumbing. They are the main surface where product judgment shows up. A reminder can be useful, annoying, shaming, timely, irrelevant, or exactly what the user needed.

The product surface includes cadence, quiet hours, reply handling, user control, and evaluation of whether reminders produce action rather than just more engagement noise.

10 / prompt systems

Prompt tuning is product tuning

Prompt work is often framed as model whispering. In a real product it is closer to product tuning. A prompt encodes tone, policy, assumptions, data contracts, tool expectations, and what the product considers a good next action.

Swoleby and agent tooling provide concrete examples: short responses, direct calls to action, safe boundaries, and prompts that are evaluated against concrete behavior checks.

11 / developer tools

Slash commands are underrated AI product design

Slash commands make capability visible. They give users a way to discover what the system can do, repeat useful actions, and build a mental model of the tool.

The broader argument: good AI interfaces should expose affordances instead of hiding everything behind a blank text box. A little structure can make the system feel more powerful, not less.

12 / codex workflow

What I want from an AI coding runtime

AI coding tools are strongest when they preserve engineering discipline: read the code first, make scoped changes, test the actual surface, and explain tradeoffs. They are weakest when they become autocomplete with commit access.

The through-line runs across OpenClaw, Codex, Claude Code, Cursor, and oh-my-codex: reusable prompts, skills, state, teams, browser verification, and the need for better agent runtime ergonomics.

13 / pareto

Pareto frontiers are a better metaphor for AI systems work

Most AI product decisions are tradeoffs: accuracy, cost, latency, safety, user control, and implementation complexity. A single "best" answer is often the wrong goal. The better question is which frontier you are moving.

This article can connect syftr-style optimization to career and product storytelling: the best work expands the frontier instead of optimizing one metric in isolation.

14 / predictive ai

Generative AI still needs predictive AI

LLMs changed the interface, but prediction did not stop mattering. Ranking, scoring, routing, detection, evaluation, personalization, and decision support are still central to useful AI systems.

Generative and predictive AI are complementary. The model that writes the response is only one component. The system still needs to decide what to retrieve, what to trust, what to show, and what to do next.

15 / technical writing

Real Python taught me that developer education is product work

Developer education is not just documentation. It is product design for understanding. The examples, pacing, conceptual model, and failure modes all determine whether someone can actually use the technology.

Real Python connects directly to current AI tooling: if developers cannot understand the system, debug it, and build confidence through small wins, the platform does not matter.

16 / consulting

What consulting teaches you about enterprise AI

Running a consulting firm teaches you that the clean architecture diagram is never the whole story. The real system includes budgets, timelines, users, support, client politics, and the cost of production failure.

Softworks is the background and enterprise AI is the point: organizations need systems that fit how they actually operate, not demos that assume ideal users and ideal data.

17 / scientific software

Scientific software made me care about metadata

At Harvard Medical School, the software challenge was not just computation. It was metadata, quality, workflow, and making complex data usable for researchers. That lesson transfers directly to AI systems.

AI products are only as useful as the surrounding context model. Bad metadata creates bad retrieval, bad evaluation, and bad user trust.

18 / iot

From microcontrollers to agents: interfaces that reach the real world

The old NodeMCU and Alexa smart-home projects look far away from agentic AI, but the pattern is similar: connect software to a physical or behavioral workflow, then design for reliability at the boundary.

The older hardware posts connect naturally to Swoleby: the interface matters most when it changes what happens outside the screen.

19 / ad agents

Ad management is a good agent benchmark

Meta ads are a useful agent domain because the workflow is repetitive but not trivial: monitor spend, spot fatigue, find winners, generate copy, upload variants, and ask for approval before money moves.

Meta-ads-kit is the public example for why the best agent workflows combine pattern recognition, tool execution, human approval, and closed-loop measurement.

20 / operating systems

Personal operating systems for agent-led work

The next useful layer around coding agents is not one more prompt. It is a working system for plans, context, review queues, CI signals, rollback checks, and long-running follow-through.

The article connects OpenClaw, Codex, review agents, babysitting workflows, cron-driven work, and PR triage into one operating model for getting useful work shipped.