Research

Interface control and agent ideas

Interfaces as the operating surface for agentic systems

What happens when the model can see the screen

Most of the conversation around AI in mid-2022 is still organised around text. You send a prompt, you receive a completion. The interface is a text box and the surface of interaction is the message itself. That is a reasonable way to start, but it is not where things are going.

The more interesting question, and one that feels genuinely early to be thinking about, is what becomes possible when a model can not just read text but operate a software environment. When it can see a browser, identify a button, decide what to click, and take that action. When it can move through a desktop application the way a human operator would, but faster, more consistently, and without fatigue.

This is not science fiction in June 2022. The pieces exist, scattered and rough, but they exist. The question is what kind of work can be done to pull them into coherent systems.

The gap between understanding and operating

There is a meaningful distinction between a model that understands what an interface is and a model that can use one. Understanding is the easier problem. Language models in 2022 can read descriptions of software, explain what buttons do, draft instructions for operating an application, and answer questions about interface behaviour with reasonable accuracy.

Operating is different. Operating requires the model to receive a live representation of an interface, a screenshot, a DOM structure, an accessibility tree, make a judgement about what to do next, execute an action, observe the result, and then loop through that process until a goal is achieved. It is a closed-loop problem, not an open-ended generation problem. The model has to be right about what it sees, right about what to do, and right about when it is finished.

Research groups are beginning to take this seriously. The area being called GUI grounding, teaching models to map natural language instructions to specific locations in an interface, is an active direction. The challenge is not simply vision; it is the combination of visual understanding, goal-oriented reasoning, and reliable action. A model that hallucinates interface elements, or takes the right action in the wrong place, is worse than no automation at all.

What makes this genuinely difficult is that interfaces are not stable. A web page reloads, a modal appears unexpectedly, a button changes label between sessions. A human operator adjusts without thinking. A model needs to be taught to adjust, or to ask for help when adjustment is required. That adaptive layer, knowing when the expected environment does not match the actual environment, is one of the harder problems in the space.

Instructed navigation as a capability

One frame that feels useful: think of this as instructed navigation. You describe a goal. The agent navigates towards it through the available interface. The word "navigate" is doing real work here: it implies a sequence of decisions across time, not a single response.

Instructed navigation has an interesting property: it does not require the underlying software to be designed for AI. It works on any surface a human can use. That means the bottleneck is not the software ecosystem but the agent's ability to reliably perceive and act within whatever interface exists. Practically, that opens up possibilities that are otherwise closed. Systems that do not have APIs, workflows that are locked inside desktop applications, processes that exist only as series of human clicks across multiple tools: all of these become reachable.

This is different from integration. Integration requires both sides to agree on a contract, maintain a connection, and handle changes at the boundary. Instructed navigation has no such requirement. The agent simply works through the interface the way a person would. It is more brittle in some ways, as interfaces change without warning, but it is far more generalisable. You can point it at nearly anything.

The practical form this is likely to take in the near term is task automation: sequences of steps that are currently done by a human precisely because they require navigating a UI, but that are repetitive enough to benefit from automation. Data entry across multiple systems. Research workflows that require clicking through a series of web pages. Software testing that requires walking through user flows. These are not glamorous, but they are common, and they consume a substantial amount of human time.

Benediction Lab and the longer research agenda

Benediction Lab exists at MSG to work on problems that are one level of abstraction above the commercial products. Not "how do we build a better CRM" but "what happens when software can be instructed rather than only programmed." Not "how do we design better onboarding flows" but "what does it mean for an AI system to navigate software autonomously and reliably."

GUI control is one of the threads that Benediction Lab is paying close attention to in this period. Not as an isolated technical project, but as part of a broader thesis about what agentic systems look like when they are fully integrated into working environments. The model layer provides intelligence. The memory layer provides context. The interface control layer provides reach: the ability to act within existing software environments rather than only within specially constructed API surfaces.

The combination of those three layers is what makes an agent genuinely useful rather than merely impressive. Intelligence without memory produces conversations that reset every time. Intelligence with memory but without action produces recommendations that someone else has to implement. Intelligence with memory and interface reach produces something that can actually move work forward: something that can pick up a task, operate through the relevant software, and complete it.

That is the shape of the system Benediction Lab is working towards understanding. Not in the sense of having built it, as June 2022 is still early, but in the sense of having a clear conceptual map of what needs to be true for it to work, and what the hard problems are along the way.

The boundary question

Any serious thinking about agentic systems runs into the same question eventually: what should the agent not do? Where does it ask permission? Where does it stop?

This is not purely a safety question in the abstract AI ethics sense, though it is partly that. It is also a practical product question. An agent that does too much, that makes consequential decisions without checking, that operates across systems in ways the user cannot easily inspect or reverse, is not a useful tool. It is a liability. Trust in the system depends on the system being transparent about what it is doing and predictable about where it will pause.

The answers are not obvious in June 2022, and they are not settled in any corner of the research community. What is becoming clearer is that the question cannot be deferred. Building interface control capabilities without simultaneously building the permission and transparency model that wraps them is building something unsafe to ship. The two problems have to be worked in parallel.

At the level of Benediction Lab's research agenda, this means paying attention not just to capability, meaning what the agent can do, but to legibility: what the user can understand about what the agent is doing, why it made the decisions it made, and what they could do to redirect it. An agent that can operate a browser with high reliability but cannot explain its reasoning or accept mid-task correction is an incomplete system. The interface layer and the communication layer have to be designed together.

A different kind of software product

There is a version of this future where agentic systems simply replace point integrations. Where instead of connecting Tool A to Tool B via a custom integration, you instruct an agent to do the work that flows between them. That is a real use case and it has genuine value. But it is the conservative version of what becomes possible.

The less conservative version is that interface control, combined with capable model systems and reliable memory, produces something closer to a new kind of software product altogether. Not an app you open and operate yourself. Not a background process running scheduled tasks. Something in between: a system that participates in your working environment, that can be given goals and responsibilities, that operates across the software landscape you already inhabit rather than requiring you to inhabit a new one built for AI.

That is a significant shift in what it means to build a software product. It means the product boundary stops being the edge of your application and becomes the set of things your agent can reach. It means product design includes designing how the agent perceives and acts in external environments, not just how users perceive and act inside your own interface.

For Orbit and Orion, that has specific implications. For TUXX, it changes the surface area of what can be delivered to a client. For Benediction Lab, it is precisely the territory worth spending time in. The commercial applications are downstream of the research problems being understood now.

The value of thinking about this in June 2022

The risk with research areas that have genuine momentum is becoming a commentator rather than a builder. Every week there are new papers, new demos, new frameworks, new claims. The volume is high enough to fill all available attention if you let it.

The more useful orientation, and the one that feels right for where MSG sits, is to maintain a clear sense of what capability would unlock for the specific systems being built, and to track research progress against that specific threshold. Not "is GUI control an interesting research problem," and it clearly is, but "at what level of reliability does instructed navigation become something we can build product around, and what does that product look like."

That is a harder question to answer in June 2022 than it will be in twelve months. But asking it now is the point. The groundwork for building useful agentic systems is the clarity of thinking that happens in the periods before the capability is fully ready. By the time it is obvious what to do, it is usually too late to be the one who figured it out first.

---

*Benediction Lab is MSG's research group working on agents, memory systems, GUI control, and the infrastructure required for autonomous software operation.*