Agentic Workflow Studio — Planner · Critic · Tool Registry · Eval Harness
Most agent demos collapse in production because they ship a planner without a critic, tools without a registry, and evals without a regression suite. The Studio treats agents like software: typed contracts, deterministic fallbacks, golden sets, signed tool calls, and a tracing plane your SRE team actually trusts.
What the lab is testing right now
Comparing reflexion, tree-of-thought and graph planners on 12 enterprise task suites.
When confidence dips below threshold, route to typed deterministic code paths — measured cost & accuracy uplift.
Real-time routing between GPT-5, Claude, Gemini 3 and OSS Mistral based on task class, cost ceiling and region.
Per-tool spend and side-effect quotas enforced by the runtime, not by prompts.
Everything the lab ships
- Studio SDKTypeScript + Python SDK for declaring agents, tools, evals and traces in code.
- Eval harnessGolden sets, regression suites, drift detection, CI gates that fail builds on quality dips.
- Tool registrySigned, versioned tools with auth scopes, cost ceilings and per-tenant policy.
- Tracing planeOpenTelemetry-native traces of plan, tool calls, model routes and cost — exportable to Datadog / Honeycomb.
- Reference patternsSix production blueprints: renewals, support, research, ops, claims, recruiting.
- Agent Systems Principal
- Eval Engineer
- Tool / SDK Engineer
- Tracing & Observability Lead
import { defineAgent, tool, eval as evalSet } from "@axp/studio";
export const renewals = defineAgent({
name: "renewals.copilot",
planner: "graph",
critic: { strategy: "reflexion", min_confidence: 0.78 },
models: { primary: "gpt-5", fallback: "gemini-3-pro" },
tools: [
tool("crm.read", { scope: "tenant", cost_ceiling_usd: 0.05 }),
tool("contracts.draft", { scope: "tenant", cost_ceiling_usd: 0.20 }),
],
evals: evalSet("renewals.golden.v7"), // 412 prompts, pass>=0.94
trace: { otel: true, redact: ["pii"] },
region_pin: "eu-west",
});Weeks 1–8 · first agent in shadow by week 2
- 1Weeks 1–2Shadow deploy
Wrap an existing workflow; capture baseline cost, latency and quality with the eval harness.
- 2Weeks 3–5Planner + critic + tools
Wire planner/critic loop, register typed tools, ship deterministic fallbacks.
- 3Weeks 5–8Production hand-off
Tracing, blast-radius limits, model routing, runbook + SRE on-call rotation.
Productionised by these squads
Receipts, not just thesis
- Critic-grounded planners outperform monolithic agents on enterprise tasksNeurIPS Workshop on Agentic AI·2025
- Deterministic fallbacks: a cost-quality study across 1.4M agent runsAXP Internal Whitepaper·2026
What partners actually ask
No — it's the missing engineering layer around frameworks. Use LangGraph, DSPy, OpenAI Agents or your own; the Studio adds evals, tools, traces and policy.
Per-tool and per-task spend ceilings enforced at the runtime, plus auto-cancellation at policy thresholds. Logged to your FinOps cockpit.
Yes — the runtime ships as a container; OSS models served on your private GPUs are first-class.
No. Traces export via OpenTelemetry to whatever you already use (Datadog, Honeycomb, Grafana).
Co-build Agentic Workflow Studio with us in Weeks 1–8.
We'll respond within one business day with a scoping note, a fixed-price outcome contract, and a named principal cleared for your domain. Design partners get first-look access, joint publication rights and roadmap influence.
- • Outcome-priced — no T&M.
- • Sovereign by default — your data, your region, your keys.
- • Refund-backed if the contracted KPI isn't hit.
- • Joint publication rights and conference slots.