Research

Agent workflows and commercial delivery

Agents as a way to support commercial execution

The demo is not the product

In the spring of 2023, AutoGPT was circulating widely. You could watch a language model receive a goal, "research competitors and draft a market report," and proceed to break it into steps, call a search tool, read the results, generate a summary, and write a file. The whole thing ran in a terminal. People shared recordings of it as if something fundamental had shifted.

Something had shifted, but not quite what the recordings suggested.

The demos showed agents that could plan, use tools, and persist across steps. What they did not show was what happened when the plan was wrong, when the tool returned noise, when the task had ambiguous success criteria, or when the agent got stuck in a loop and just kept going. The gap between "impressive in a terminal window" and "reliable enough to put inside a commercial product" was real, and in May 2023 almost nobody was taking it seriously.

That gap is the actual problem worth thinking through.

What planning actually requires

A language model that can generate a plausible sequence of steps is not the same thing as a system that can execute those steps reliably. Planning is a cognitive operation. Execution involves the world: tools with inconsistent APIs, web pages that change, files that don't exist, inputs that are ambiguous.

The early agent frameworks, LangChain, AutoGPT, BabyAGI, were primarily wiring. They gave models a way to call tools and pass results back into context. That was a real contribution. But the intelligence driving the plan was still the model, and models make errors: they hallucinate tool names, misread returned data, lose track of where they are in a multi-step sequence, or pursue a sub-goal so enthusiastically they forget the original instruction.

What makes a plan robust in a commercial context is not generative intelligence alone. It is constraint. A useful commercial agent is not one that can do anything: it is one that operates within a defined envelope where its successes are reliable and its failures are legible. The boundary of that envelope is a design decision, not a model capability question.

This is the distinction that the demo culture of early 2023 consistently collapsed. Watching an agent autonomously browse the web and synthesise a document feels qualitatively different from asking a model to complete a prompt. It is different. But the gap between "autonomous and impressive" and "autonomous and trustworthy" is where most of the real work lives.

The control plane question

There is a design question that every team building agent systems eventually encounters, whether they name it or not: how much authority do you give the agent, and over what?

Authority here is not metaphorical. It means: can the agent send an email? Can it modify a record? Can it delete a file? Can it make an API call that has downstream consequences? Can it take an action that cannot be undone?

In a demo context, these questions rarely surface. The agent is running in a sandbox, or against synthetic data, or being watched closely enough that a bad outcome gets stopped before it matters. In a production context, every one of those questions is consequential.

The naive answer is to give the agent read-only access and have humans action everything it recommends. That is safe, but it largely defeats the purpose. If every recommendation requires a human to review, approve, and execute, you have built an elaborate suggestion engine, useful but not the category of thing that most people are imagining when they talk about agent workflows.

The more interesting answer is to think carefully about which classes of action are reversible, which are low-stakes, and which carry irreversible or high-consequence outcomes. Give the agent authority over the first category freely. Build lightweight human checkpoints into the second. Reserve the third for explicit human initiation only. This produces a tiered autonomy model that is neither paralysed by caution nor reckless about consequences.

The control plane question is really a product design question wearing AI clothes. Getting it right requires the same discipline as any other product decision: understand the user, understand the stakes, understand what goes wrong when it goes wrong.

Why commercial delivery is harder than research delivery

Research on agent systems tends to optimise for a different objective than commercial deployment. A research benchmark measures whether the agent completes a task correctly in controlled conditions. A commercial product measures whether the system is reliable enough that a business will trust it with real work, at scale, over time.

Those are different problems. Research systems can tolerate failure rates that would be unacceptable in a commercial context. Research systems can require a human to set up carefully structured inputs that no ordinary user would provide. Research systems can run slowly because the evaluation is not time-sensitive.

Commercial delivery imposes constraints that change the whole design problem. The system has to be fast enough to feel immediate. It has to handle messy, incomplete inputs because that is what real users provide. It has to fail gracefully rather than catastrophically, and when it fails, the failure has to be explainable to someone who does not understand how language models work. It has to be maintainable by a team that did not build it, in conditions that will change.

These are software engineering problems as much as AI problems. And in 2023, most of the public conversation about agents was focused on AI capability questions to the near-exclusion of the engineering and product design problems that commercial delivery actually requires.

The Benediction Lab orientation

At Benediction Lab, the research direction in this period was deliberately constrained. The question was not "what can agents do in general?" but "what can an agent do reliably inside a structured commercial workflow?"

That reframing changes what you study. It shifts attention from model capability maximisation towards interface design: how does the agent receive its task? What does it have access to? How does it communicate uncertainty? How does the human re-enter the loop when needed? How is the agent's work logged in a way that makes review tractable?

Memory becomes a specific problem rather than a general AI aspiration. An agent operating inside a commercial workflow needs to know the context of what it is doing: the history of the account it is working on, the status of prior steps, the constraints that apply to this specific case. That is not a matter of making the model more capable. It is a matter of engineering the information environment the agent operates within.

Tool use becomes a specific problem rather than an abstract capability. Which tools should be available? What happens when a tool returns an error? What gets logged? These are infrastructure questions as much as AI questions.

The orientation produces a different kind of research artefact: not demonstrations of maximal agent capability, but patterns for agent deployment that can be evaluated against commercial reliability criteria.

Where this connects to Orbit and TUXX

For Orbit, the agent question has a clear commercial shape. The product covers the lead-to-launched workflow: the commercial execution surface for a business. An agent operating inside Orbit is not a general intelligence. It is a specialist with access to specific records, specific tools, and specific authority. It can surface what needs attention. It can draft what needs drafting. It can move tasks forward within defined parameters. What it does not do is replace the judgement of the person running the business.

That distinction is important to say plainly: the goal is not to automate the commercial operator out of the loop. The goal is to reduce the friction cost of keeping the operator well-informed and well-positioned to make decisions. An agent that surfaces the right information at the right moment and drafts the right output for review is genuinely valuable. It does not need to be autonomous to be useful.

For TUXX, the implication is different. When building custom systems for clients, the agent question is almost always a reliability and trust question before it is a capability question. A client who has seen an impressive demo and wants an agent system wants to know one thing more than any other: will it do what I think it will do, every time? That question does not have a model answer. It has a systems design answer.

The honest position in May 2023

The honest position on agent workflows in May 2023 is this: the capability trajectory is real and the excitement is not irrational. Models that can plan and use tools represent a genuine expansion of what software can do. The research progress from GPT-3 to GPT-4 was substantial, and the agent frameworks built on top of it opened up categories of automation that were not previously tractable.

But the gap between research capability and commercial reliability is wide, and it will not close automatically as models improve. It will close as the engineering and product design work catches up: as teams build better abstractions for tool use, better patterns for human-in-the-loop checkpoints, better infrastructure for memory and context, better methods for failure logging and recovery.

That work is less visible than a striking demo. It does not produce recordings that circulate on social media. It produces systems that work reliably in production, which is a different and less legible kind of progress.

The commercial opportunity in agent systems is not for whoever builds the most capable agent. It is for whoever builds the most deployable one. Those are not the same thing, and in 2023, they are not even close to the same thing.

That is the problem worth working on.