Research

Sora and the expanding model stack

Video generation shows that AI is becoming a stack of capabilities, not a single product category.

In December 2024, OpenAI made Sora available to the public. Video generation, genuinely compelling, temporally coherent, production-adjacent video, was now accessible to anyone with a ChatGPT subscription. That is not a trivial shift in what AI can do. It is a signal about what the field is becoming.

Sora is not a new product category alongside the others. It is evidence that AI capability is assembling itself into a stack: text, code, image, audio, video, planning, retrieval, tool use. Each layer arrived at intervals, but December 2024 is the month it becomes possible to look at them as a coherent whole. Not a single tool. A surface. And the implications of that are different in kind from the implications of any single capability on its own.

That changes the question every builder should be asking.

From singular to plural

The original AI product question was simple: what can the model do? The answer in 2020 was mostly text: summarise this, write that, answer this question. Useful, but bounded. The surface area was narrow, and the design space was narrow with it.

Each subsequent capability opened the question a little further. Codex showed that code was within scope. DALL-E and Stable Diffusion showed that image generation could be fast and iterable enough to actually sit inside a workflow. Whisper showed that audio transcription could be made near-free. Voice synthesis matured slowly and then quickly, moving from uncanny to natural in what felt like a single quarter. And now video, not as a curiosity or a research demo, but as something that could sit inside a production workflow without embarrassing itself.

The point is not that each of these is individually impressive. The point is that they are additive. They are modalities, and they are accumulating. A system built in late 2024 has access to a capability stack that no single human expert could plausibly replicate alone at this cost and this speed. The question of what to do with that stack, which pieces to pick up, which to leave down, which human judgement to keep in the loop, that is the actual design question now. And it is a harder question than the earlier ones.

What video changes, specifically

It is tempting to file Sora under "media" and move on. That would be the wrong category.

Video generation changes the economics of explanation. Concepts that would have required a design team, a script, a shoot day, and a post-production pipeline can now be visualised quickly enough to test whether the idea is worth the full production cost. That is not a media question. It is a product development question. It belongs alongside the other tools that shorten the distance between an idea and something testable.

It changes how early-stage thinking gets stress-tested. When you can render something, a product interaction, a service environment, a customer journey, an architectural concept, you can interrogate it. You can hand it to someone outside the team and ask what they see. You can iterate before committing. The feedback loop shortens. The distance between imagination and artefact narrows in a way that has historically required significant resource.

It also changes how teams communicate internally. Describing a vision in words leaves significant interpretive space. A rough rendered sequence leaves much less. Teams working on ambiguous creative or product problems have always spent considerable energy bridging that gap: alignment meetings, reference boards, lengthy briefs, all of which are imperfect proxies for a shared image of what the thing should be. Video generation does not close that gap entirely, but it compresses it in ways that will matter at scale.

None of this requires the output to be broadcast-quality. The utility arrives well before perfection. That is a lesson the earlier modalities already taught: imperfect code generation was still economically significant; imperfect image generation was still useful enough to reshape design workflows across entire industries. The same logic applies here. Sora in December 2024 does not need to replace a cinematographer to be worth building with.

The wrong way to read this moment

There is a predictable error pattern when a new capability arrives at this scale. Either it gets dismissed ("it cannot do X, therefore it is limited") or it gets over-extended ("this changes everything immediately"). Both miss what is actually useful about the moment.

Sora in December 2024 is not broadcast-ready in every context. It is not a replacement for directors who understand narrative and human emotion. It has artefacts. It makes errors with physics and with faces over long durations. There are genuine constraints on sequence length, on fine-grained controllability, on the consistency of characters and environments across cuts. These are real limitations and worth naming clearly.

But the limitations describe the current state of a capability that is on an improvement curve, and a steep one at that. What matters industrially is not where the capability is today but what class of work it makes possible at all. In December 2024, that class expanded to include temporally coherent, generative video. That is the relevant fact. The boundary will move, and probably quickly.

The more useful frame is to ask: which parts of the work I currently do, or the systems I build, depend on video as an output? And which of those are bottlenecked by cost, time, or access to production infrastructure? Wherever the answer to both questions is yes, there is now a different option available. Not a perfect option. An option.

What the stack asks of builders

What the accumulating model stack asks of builders is specificity. Not "use AI." That sentence has no content anymore. But: which capability, applied at which point in the workflow, improving which specific outcome for which specific person?

That is a harder question than it appears. It requires knowing your workflow in enough detail to identify where the actual bottleneck lives. It requires being honest about which parts of the work are routine enough to delegate and which require judgement that has not yet been successfully encoded. And it requires resisting the temptation to build everything just because the capability to build it now exists.

There is a version of every product that tries to use all the modalities because they are available. That version tends to be overbuilt and underspecified: impressive in a demo, confusing in use. The better version makes a deliberate choice about which capabilities genuinely serve the user's need, and holds everything else. The discipline required is exactly the same as it has always been; the capability stack just makes the temptation to skip it much greater.

The model stack expanding is not an invitation to add features. It is an invitation to think harder about what a given system is actually for and what it should leave alone.

Implications across the portfolio

At Mustard Seed Group, the expanding stack means different things to different parts of the portfolio, and it is worth being specific rather than general about that.

Orbit is a business execution system. The capabilities it draws on are primarily language and structured reasoning: turning context into action, surfacing the right information at the right moment in a commercial workflow. The addition of video to the stack does not immediately change what Orbit needs to do at its core. What it does change is the cost and speed at which explanatory and onboarding content can be produced alongside a product like this. That is a workflow concern rather than a core product concern, but it is a real one, and the savings compound.

For All Purpose, the consumer media and creative ecosystem, the question is more direct. Sound, image and story have been central to what All Purpose is building toward. A world in which video generation is publicly accessible is a world in which the creative surface expands for everyone working within that ecosystem. Not because generated video replaces creative vision, it does not, and the attempt to replace it would miss the point, but because it changes the relationship between vision and execution. The ability to render an idea quickly enough to test it is meaningful for any creative project that has historically been constrained by production cost. All Purpose is precisely the environment where that constraint has mattered most.

TUXX operates as the services division, building custom AI systems in live client environments. Here the expanding stack functions as a capability library. Each new modality is a potential component in a client system, not used by default, not applied because it is available, but reachable when the specific problem calls for it. The work is increasingly architectural: knowing which layer to reach for and how to compose the layers into something coherent and reliable.

Benediction Lab sits furthest upstream. The research questions the lab is working through, agents, memory, autonomous execution, GUI control, are precisely the questions that determine whether individual capabilities in the stack can be reliably composed into systems that do useful work over extended time horizons. A model stack that is wide but shallow in reliability does not serve that ambition. The research work is, among other things, about making composition trustworthy enough to build on.

The actual question

The question is not which model to use. That question is becoming almost administrative: the answer changes quarterly, the options proliferate, and the choice matters less than what is built around it.

The question is which capability belongs in this system, serving which user, in which moment. That question requires knowing your workflow and your constraints at a level of specificity that no model can substitute for. It requires builders who understand what they are actually building and why, before any capability is selected.

The model stack will keep expanding. Audio generation is maturing rapidly. Real-time multimodal reasoning is arriving. The capability frontier will shift again in 2025, and the shift will likely be significant. The builders who will do useful work with it are the ones who already know what problem they are solving with enough clarity that a new capability can slot in as an answer rather than arriving as a question in search of one.

Sora is genuinely significant. It is also one data point in a longer sequence. The stack is not finished. The question of what to build with it remains ours.

---

*December 2024*