← Innovation Labs/Lab L1BetaMulti-agent systems

Agentic Workflow StudioPlanner · Critic · Tool Registry · Eval Harness

Most agent demos collapse in production because they ship a planner without a critic, tools without a registry, and evals without a regression suite. The Studio treats agents like software: typed contracts, deterministic fallbacks, golden sets, signed tool calls, and a tracing plane your SRE team actually trusts.

Research thesis
Agents need the same engineering posture as microservices: contracts, evals, traces, blast-radius controls and rollbacks.
Task completion
+ 4.2×
Hallucination rate
− 81%
Cost per resolved task
− 64%
Active experiments

What the lab is testing right now

Planner/critic loops at scale

Comparing reflexion, tree-of-thought and graph planners on 12 enterprise task suites.

Deterministic fallbacks

When confidence dips below threshold, route to typed deterministic code paths — measured cost & accuracy uplift.

Cross-model arbitrage

Real-time routing between GPT-5, Claude, Gemini 3 and OSS Mistral based on task class, cost ceiling and region.

Agent blast-radius limits

Per-tool spend and side-effect quotas enforced by the runtime, not by prompts.

Shippable artefacts

Everything the lab ships

  • Studio SDK
    TypeScript + Python SDK for declaring agents, tools, evals and traces in code.
  • Eval harness
    Golden sets, regression suites, drift detection, CI gates that fail builds on quality dips.
  • Tool registry
    Signed, versioned tools with auth scopes, cost ceilings and per-tenant policy.
  • Tracing plane
    OpenTelemetry-native traces of plan, tool calls, model routes and cost — exportable to Datadog / Honeycomb.
  • Reference patterns
    Six production blueprints: renewals, support, research, ops, claims, recruiting.
Lab team
  • Agent Systems Principal
  • Eval Engineer
  • Tool / SDK Engineer
  • Tracing & Observability Lead
Partners we collaborate with
OpenAIAnthropicGoogle DeepMindLangChainDSPyDatadogHoneycomb
Example output · Studio · agents.declaretypescript
import { defineAgent, tool, eval as evalSet } from "@axp/studio";

export const renewals = defineAgent({
  name: "renewals.copilot",
  planner: "graph",
  critic: { strategy: "reflexion", min_confidence: 0.78 },
  models: { primary: "gpt-5", fallback: "gemini-3-pro" },
  tools: [
    tool("crm.read",      { scope: "tenant", cost_ceiling_usd: 0.05 }),
    tool("contracts.draft", { scope: "tenant", cost_ceiling_usd: 0.20 }),
  ],
  evals: evalSet("renewals.golden.v7"), // 412 prompts, pass>=0.94
  trace: { otel: true, redact: ["pii"] },
  region_pin: "eu-west",
});
Engagement timeline

Weeks 1–8 · first agent in shadow by week 2

  1. 1
    Weeks 1–2
    Shadow deploy

    Wrap an existing workflow; capture baseline cost, latency and quality with the eval harness.

  2. 2
    Weeks 3–5
    Planner + critic + tools

    Wire planner/critic loop, register typed tools, ship deterministic fallbacks.

  3. 3
    Weeks 5–8
    Production hand-off

    Tracing, blast-radius limits, model routing, runbook + SRE on-call rotation.

Flagship pods

Productionised by these squads

Renewals Agent Pod
Support Triage Pod
Research Co-pilot Pod
Claims & Adjudication Pod
Selected publications

Receipts, not just thesis

  • Critic-grounded planners outperform monolithic agents on enterprise tasks
    NeurIPS Workshop on Agentic AI·2025
  • Deterministic fallbacks: a cost-quality study across 1.4M agent runs
    AXP Internal Whitepaper·2026
FAQs

What partners actually ask

Is this another agent framework?

No — it's the missing engineering layer around frameworks. Use LangGraph, DSPy, OpenAI Agents or your own; the Studio adds evals, tools, traces and policy.

How do you stop agent loops blowing budgets?

Per-tool and per-task spend ceilings enforced at the runtime, plus auto-cancellation at policy thresholds. Logged to your FinOps cockpit.

Can it run on-prem?

Yes — the runtime ships as a container; OSS models served on your private GPUs are first-class.

Do you replace our SRE / observability stack?

No. Traces export via OpenTelemetry to whatever you already use (Datadog, Honeycomb, Grafana).

Design-partner programme · L1 Agentic Workflow Studio

Co-build Agentic Workflow Studio with us in Weeks 1–8.

We'll respond within one business day with a scoping note, a fixed-price outcome contract, and a named principal cleared for your domain. Design partners get first-look access, joint publication rights and roadmap influence.

  • • Outcome-priced — no T&M.
  • • Sovereign by default — your data, your region, your keys.
  • • Refund-backed if the contracted KPI isn't hit.
  • • Joint publication rights and conference slots.
By submitting you agree to our outreach for this enquiry. Your details are stored in our governed lead system.