Research

Model choice becomes product strategy

Frontier, open-weight, voice and specialist models force better product architecture.

A few years ago, the model question was easy to answer badly. There was one credible option for most serious use cases, GPT-4, and everything else was either too weak, too slow, or too unreliable to ship against. The decision was effectively made for you.

That period is over.

By August 2025, the frontier is genuinely crowded. Claude, GPT-4o, Gemini 1.5 Pro, Llama 3, Mistral, each of these is capable of doing things that would have felt like research demonstrations eighteen months ago. The AI Index published by Stanford's Human-Centred AI institute this year noted that the performance gap between the leading closed model and the best open-weight alternatives has collapsed to a degree that would have been difficult to predict even at the start of 2024. The transition from "one good model" to "several capable models with different profiles" has happened quietly but completely.

What follows from that is not a simple story about abundance. It is a story about decisions that now have to be made deliberately, where before they didn't have to be made at all.

The split that matters

There is a natural tendency to frame the current landscape as a race: which model scores highest on the latest benchmark, which provider releases the longest context window, which company demonstrates the most impressive demo. That framing is not useless, but it misses the split that actually drives product decisions.

The meaningful division right now is not between providers. It is between the kinds of work a model is being asked to do, and the constraints that work operates under.

Frontier closed models, the largest Claude variants, GPT-4o, Gemini Ultra, remain the strongest generalists. They handle complex reasoning, nuanced instruction-following, long-form synthesis. The tradeoff is cost per token, latency on longer contexts, and the fact that data leaves your infrastructure. For many tasks that is perfectly acceptable. For some it is not.

Open-weight models like Llama 3 and its fine-tuned derivatives change the economics and the ownership structure considerably. Running inference on your own hardware or a private cloud removes the data transit question entirely and brings marginal cost down sharply at scale. What you give up is some ceiling on raw capability, though that ceiling has risen substantially and continues to rise. The question of when open-weight capability is "good enough" is less theoretical than it used to be.

Voice models introduce a different kind of shift: not about capability but about input modality. When the interface changes from keyboard to voice, the experience of interacting with software changes in ways that compound over time. Response latency becomes more noticeable. Context held between turns becomes more valuable. The entire grammar of how people ask questions shifts.

Specialist and fine-tuned models sit across all of the above categories. A model trained specifically on legal documents is not necessarily more capable than a frontier generalist, but it may be more consistent, more cost-efficient, and easier to constrain within a workflow. Specialisation trades breadth for predictability, and predictability is often what a production system actually needs.

Where these decisions land

None of this is theoretical in the way that model research often is. It lands in specific product surfaces and forces specific choices.

The relevant question for any product team is not "what is the best model?" It is four quieter questions stacked on top of each other. What is this task actually requiring? What are the data constraints around it? What does acceptable latency look like? What is the cost curve as usage scales?

Those questions have different answers in different parts of the same product. A workflow surface that drafts structured output from an operator's existing documents has different requirements than a real-time voice interaction that needs to feel natural. A research synthesis that needs to weigh conflicting sources has different requirements than an alert that needs to fire on a pattern. Choosing a single model and routing every task through it is not architecture; it is avoidance of architecture.

The products that will hold up under scrutiny are the ones where model selection is made at the task level, not the product level. That means the layer sitting above the model, the routing logic, the context management, the memory system, carries as much weight as the model itself.

What this looks like from inside MSG

Across the portfolio, the model landscape in its current form has sharpened a set of questions we had been working through more loosely.

Orbit is a system that covers a complex commercial workflow. The tasks inside it are not uniform. Some require nuanced synthesis: reading a conversation history and drawing out what is actually being said underneath the stated position. Some are closer to structured extraction: parsing a document and populating a set of fields. Some are ambient and continuous: monitoring for a change in context and surfacing it at the right moment. These are not the same task, and treating them as if the same model configuration handles all of them equally well produces a product that does none of them particularly well.

Orion, the intelligence layer, exists precisely to make these distinctions at runtime. The routing question, which model, at which point, with which context, is a design problem that Orion has to solve cleanly for Orbit to function well under real usage. That includes decisions about when to use a stronger, more expensive model and when a smaller, faster, cheaper one is more appropriate given the task and what the user actually needs in that moment.

CheekyGains and Naira present a different version of the problem. Naira is a performance coach. The interaction is personal and recurring. The user is not looking for a technically impressive response; they are looking for something that feels coherent with what they said yesterday and the day before, that notices patterns, that meets them where they are. That use case has a particular sensitivity to consistency and memory over raw capability. The right model for Naira is the one that holds context well and produces responses that feel grounded in the user's actual situation, not the one that scores highest on a reasoning benchmark.

TUXX operates across client environments that vary considerably. Some clients have data handling requirements that make closed frontier models impractical. Some have latency constraints. Some have cost structures that make per-token pricing at scale look unattractive. Working across those different configurations requires not just an opinion about which model is best but a genuine ability to work with several models and deploy the appropriate one against the actual constraint, which is not always the constraint the client initially identifies.

Benediction Lab sits furthest from production constraints and has the most latitude to experiment. Research into agents, memory systems, and autonomous workflows benefits from access to the full range of current models precisely because understanding their differences empirically is part of the work. What a model does when a tool call fails, how it handles conflicting instructions, where it degrades gracefully and where it does not: these questions can only be answered by running against the actual models in structured conditions.

The adoption pattern across industries

What is visible at the industry level, looking at how AI capability is being absorbed across sectors, is that the most sophisticated adopters have moved past the question of whether to use these models and into a different set of questions about where they belong in the stack.

The organisations that integrated early and broadly are now doing something more deliberate: pulling back surface area where the model was being asked to do too much unsupervised, and being more precise about the boundary between what the model handles and what a human or a structured process handles. This is maturation, not disillusionment.

The adoption patterns that hold up are the ones built around tasks where the model's output can be evaluated clearly and where errors are recoverable. The patterns that are struggling are the ones that tried to apply generalist models to tasks that needed either more structure or more specialisation than a generalist can reliably provide.

This is not a reason to be cautious about AI capability. It is a reason to be specific.

The moat is not the model

The concern that model capability is converging, that all the frontier providers will reach similar performance thresholds and the differentiator disappears, has some truth in it at the raw capability level. Benchmarks are converging. The gap between providers on common tasks is narrowing. Fine-tuning and open-weight models bring that capability further down the cost curve over time.

But the moat was never going to be the model. The moat is the system built around it.

Which data reaches the model and when. How context is structured, summarised, and persisted across sessions. How different task types are routed to appropriate model configurations. How outputs are evaluated, verified, and incorporated into the user's record. How the product behaves when the model is wrong. These are product architecture questions, not model questions. They are the decisions that produce a system that compounds value over time, rather than one that performs a demonstration and then flattens.

The organisations and products that will be in a meaningfully stronger position twelve months from now are not the ones that chose the best model today. They are the ones that built the layer above the model well enough that they can swap models as the landscape continues to shift, and it will, without rebuilding the product around it.

That is what model choice as product strategy actually means. Not picking a winner. Building something that does not depend on having picked the right one.