NEWS
Diffusion models break out of the image lane: video, language and open weights in 2026
Six months into 2026, the diffusion architecture has expanded well beyond image generation. Alibaba, Inception Labs and Meituan have shipped releases that reshape what the technology can do and what it costs.
For most of the past three years, "diffusion model" has been shorthand for one thing: an AI system that turns text prompts into images. That shorthand no longer covers the field. Between October 2025 and June 2026, three releases in particular have stretched the architecture into territory that used to belong to other approaches: open-source long-form video, frontier reasoning over text, and minutes-long generation on consumer hardware. The story is no longer "diffusion versus autoregressive". It is "where does diffusion go next".
This is a non-technical round-up of what shipped, when, and what UK teams should take from it.
A one-paragraph refresher
A diffusion model is trained to take a noisy version of something (an image, a clip, a sentence) and clean it up. Run that cleanup repeatedly from pure noise, guided by a prompt, and you get a generated result. Until recently, the architecture of choice was a U-Net, borrowed from medical imaging. The shift to Diffusion Transformers (DiTs) in 2023 and 2024 unlocked far better scaling, and that one architectural change is what has made the 2026 releases possible.
A visual metaphor for diffusion: ordered detail emerging from noise. Photo: Igor Omilaev / Unsplash.
Alibaba Wan 2.7: open-source video catches up
Released 6 April 2026 by Alibaba's Tongyi Lab.
The Wan 2.7 release is the most consequential video diffusion news of the year so far. Alibaba shipped a four-model suite covering text-to-video, image-to-video, reference-based video, and instruction-driven video editing, all under the Apache 2.0 licence. Clips run from two to fifteen seconds at 720p or 1080p, and the system supports a 5,000-character prompt window, multi-reference input of up to nine images, HEX-code colour control, and on-screen text rendering across twelve languages.
The headline feature is what Alibaba calls "Thinking Mode": the model first parses the prompt, plans the composition, and only then generates frames. In practice, that means fewer wasted seconds on prompts that ask for something physically incoherent.
A Tongyi Lab spokesperson framed the release as a step beyond conventional video generation: "Wan 2.7 marks a major leap forward in controllable generative AI. We are delivering tools that combine creative freedom with precise control like never before."
Pricing on Alibaba Cloud's Model Studio starts at around US$0.10 per second of generated video, which puts it within reach of small studios and independent creators. For UK production teams, the practical takeaway is that you no longer need a closed-platform subscription to test a serious video model on a real brief.
Open-source video generation is closing the gap with closed platforms. Photo: Jakob Owens / Unsplash.
Inception Mercury 2: diffusion comes for the language model
Released 24 February 2026 by Inception Labs.
If Wan 2.7 was the predictable headline, Mercury 2 was the surprise. Inception Labs released the first commercial-scale diffusion-based reasoning language model, and the numbers are not subtle: 1,009 tokens per second on Nvidia Blackwell hardware, end-to-end latency of 1.7 seconds, and a price of US$0.25 per million input tokens and US$0.75 per million output tokens.
Conventional language models such as GPT-5, Claude Opus and Gemini work autoregressively: they predict one token at a time, in order. Mercury 2 starts with a noisy block of text and refines the whole block in parallel, the same way an image diffusion model refines pixels. Inception claims this makes Mercury up to ten times faster than speed-optimised competitors at comparable quality.
On standard reasoning benchmarks the picture is mixed: 74 on GPQA Diamond and 91 on AIME, which trails Gemini 3 Flash on the first and beats it on the second. The real selling point is latency. Inception is targeting voice assistants, in-editor coding tools, and search systems, all places where a two-second wait feels like a two-minute wait.
For UK developers building chat features, the implication is straightforward: it is now reasonable to expect sub-two-second responses from a reasoning model, which closes off "the model was thinking" as an excuse for sluggish product experiences.
Parallel decoding is what gives diffusion language models their speed advantage. Photo: Joshua Sortino / Unsplash.
Meituan LongCat-Video: long-form open video
Released 27 October 2025 by Meituan's LongCat team.
LongCat-Video is the open-source release that explains why Wan 2.7 had to happen. Meituan put out a 13.6-billion-parameter Diffusion Transformer under the MIT licence that handles text-to-video, image-to-video and video continuation in a single architecture. Crucially, it was pretrained on a continuation objective, which is what lets it produce minutes-long clips while keeping characters, lighting and motion coherent. That is the hardest problem in video generation, and the one that most closed models still solve by silently capping clip length.
Two architectural details matter for readers building with it:
- Block sparse attention keeps memory usage manageable at higher resolutions, so the model can render 720p at 30 frames per second within minutes on a single high-end GPU.
- Coarse-to-fine generation along both time and space means the model sketches the whole clip first, then sharpens, rather than rendering frame by frame.
For UK educators, agencies and independent creators, LongCat-Video is the most permissively licensed video model of comparable quality. The MIT licence allows commercial use without the field-of-use restrictions that come with many "open weights" releases from rival labs.
Minutes-long coherent video is the hardest test of a video diffusion model. Photo: Denise Jans / Unsplash.
What this means for UK teams
Three practical points.
-
The licence story has flipped. Two of the three releases above are under genuinely permissive open licences (Apache 2.0 and MIT). A year ago, anyone serious about generative video defaulted to a closed platform. Today the open option is competitive and the legal terms are friendlier.
-
Diffusion is no longer image-only. If your roadmap assumes "we will use OpenAI for text and a diffusion vendor for images", that mental model is already out of date. Mercury 2 shows that text generation itself is now a diffusion target. Plan for a more fluid landscape rather than a clean split.
-
Inference cost is the next battleground. Mercury 2 at over 1,000 tokens per second and Wan 2.7 at ten cents per video second both point in the same direction: 2025 was about capability, 2026 is about throughput per pound. Build your prototypes against APIs that publish per-token and per-second pricing, not against models that only offer per-seat plans.
The next milestone to watch is the second wave of diffusion language models. Google has an experimental text diffusion model in limited release, and a handful of academic groups are publishing convergent results. Once a second commercial player ships, "diffusion LLM" will move from a one-vendor curiosity to a category, and the architectural conversation will reset again.
Sources
- Alibaba Launches Wan 2.7: Breakthrough AI Image & Video Generation Model with Thinking Mode, FinancialContent, 6 April 2026.
- Inception launches Mercury 2, the first diffusion-based language reasoning model, The Decoder, 24 February 2026.
- LongCat-Video Technical Report, arXiv 2510.22200, Meituan LongCat Team, 27 October 2025.
- From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models, ICLR Blogposts 2026, 27 April 2026.
- AI Updates Today (June 2026), LLM Stats, accessed 8 June 2026.
Written by
Mohamed AL-Kaisi
Editor-in-chief of the Data & AI Hub.