← Innovation Labs/Lab L1BetaMulti-agent systems

Agentic Workflow Studio — Planner · Critic · Tool Registry · Eval Harness

Most agent demos collapse in production because they ship a planner without a critic, tools without a registry, and evals without a regression suite. The Studio treats agents like software: typed contracts, deterministic fallbacks, golden sets, signed tool calls, and a tracing plane your SRE team actually trusts.

Research thesis

Agents need the same engineering posture as microservices: contracts, evals, traces, blast-radius controls and rollbacks.

Task completion

+ 4.2×

Hallucination rate

− 81%

Cost per resolved task

− 64%

Become a design partner →See live experiments

Active experiments

What the lab is testing right now

Planner/critic loops at scale

Comparing reflexion, tree-of-thought and graph planners on 12 enterprise task suites.

Deterministic fallbacks

When confidence dips below threshold, route to typed deterministic code paths — measured cost & accuracy uplift.

Cross-model arbitrage

Real-time routing between GPT-5, Claude, Gemini 3 and OSS Mistral based on task class, cost ceiling and region.

Agent blast-radius limits

Per-tool spend and side-effect quotas enforced by the runtime, not by prompts.

Shippable artefacts

Everything the lab ships

Studio SDK
TypeScript + Python SDK for declaring agents, tools, evals and traces in code.
Eval harness
Golden sets, regression suites, drift detection, CI gates that fail builds on quality dips.
Tool registry
Signed, versioned tools with auth scopes, cost ceilings and per-tenant policy.
Tracing plane
OpenTelemetry-native traces of plan, tool calls, model routes and cost — exportable to Datadog / Honeycomb.
Reference patterns
Six production blueprints: renewals, support, research, ops, claims, recruiting.

Lab team

Agent Systems Principal
Eval Engineer
Tool / SDK Engineer
Tracing & Observability Lead

Partners we collaborate with

OpenAIAnthropicGoogle DeepMindLangChainDSPyDatadogHoneycomb

Example output · Studio · agents.declaretypescript

import { defineAgent, tool, eval as evalSet } from "@axp/studio";

export const renewals = defineAgent({
  name: "renewals.copilot",
  planner: "graph",
  critic: { strategy: "reflexion", min_confidence: 0.78 },
  models: { primary: "gpt-5", fallback: "gemini-3-pro" },
  tools: [
    tool("crm.read",      { scope: "tenant", cost_ceiling_usd: 0.05 }),
    tool("contracts.draft", { scope: "tenant", cost_ceiling_usd: 0.20 }),
  ],
  evals: evalSet("renewals.golden.v7"), // 412 prompts, pass>=0.94
  trace: { otel: true, redact: ["pii"] },
  region_pin: "eu-west",
});

Engagement timeline

Weeks 1–8 · first agent in shadow by week 2

1
Weeks 1–2
Shadow deploy
Wrap an existing workflow; capture baseline cost, latency and quality with the eval harness.
2
Weeks 3–5
Planner + critic + tools
Wire planner/critic loop, register typed tools, ship deterministic fallbacks.
3
Weeks 5–8
Production hand-off
Tracing, blast-radius limits, model routing, runbook + SRE on-call rotation.

Flagship pods

Productionised by these squads

Renewals Agent Pod

Support Triage Pod

Research Co-pilot Pod

Claims & Adjudication Pod

Selected publications

Receipts, not just thesis

Critic-grounded planners outperform monolithic agents on enterprise tasks
NeurIPS Workshop on Agentic AI·2025
Deterministic fallbacks: a cost-quality study across 1.4M agent runs
AXP Internal Whitepaper·2026

FAQs

What partners actually ask

Is this another agent framework?

No — it's the missing engineering layer around frameworks. Use LangGraph, DSPy, OpenAI Agents or your own; the Studio adds evals, tools, traces and policy.

How do you stop agent loops blowing budgets?

Per-tool and per-task spend ceilings enforced at the runtime, plus auto-cancellation at policy thresholds. Logged to your FinOps cockpit.

Can it run on-prem?

Yes — the runtime ships as a container; OSS models served on your private GPUs are first-class.

Do you replace our SRE / observability stack?

No. Traces export via OpenTelemetry to whatever you already use (Datadog, Honeycomb, Grafana).

Design-partner programme · L1 Agentic Workflow Studio

Co-build Agentic Workflow Studio with us in Weeks 1–8.

We'll respond within one business day with a scoping note, a fixed-price outcome contract, and a named principal cleared for your domain. Design partners get first-look access, joint publication rights and roadmap influence.

• Outcome-priced — no T&M.
• Sovereign by default — your data, your region, your keys.
• Refund-backed if the contracted KPI isn't hit.
• Joint publication rights and conference slots.