Twenty-Five Open-Weight Models in One Week. A Practitioner's Breakdown.

ai llm open-source machine-learning

There is a phrase that has been floating around AI circles this week: “open AI, notice the space.” A deliberate dig at the increasingly closed posture of OpenAI the company, and a nod to the fact that the most interesting action in AI right now is happening in the open-weight space.

Last week delivered more than 25 notable open-weight releases across text, image, audio, video, and 3D generation. That is not a normal week. Before going deep, here is the short version for practitioners who want the bottom line first.

Quick Picks

Use caseModelWhy
On-device / Apple SiliconLiquid AI LFM2.5-8B1.5B active params, MLX-ready
Deployable multimodalGoogle Gemma 4 12BOne checkpoint, ONNX + MLX, 140+ languages
Frontier reasoning (cloud)NVIDIA Nemotron 3 Ultra89.1 MMLU, 1M context, 55B active
Coding agentsJetBrains Mellum2-12BBuilt for IDE workflows, 2.5B active
Text-in-image / designIdeogram 4#1 open-weight on Design Arena
Real-time TTSHiggs Boson Health Audio v3Sub-second first audio, 21 emotions
Artifact-free TTSrednote dots.ttsCodec-free, continuous waveform
Speech-to-text at scaleNVIDIA Nemotron-3.5 ASR17x concurrency vs Parakeet RNNT
Document parsingPaddleOCR-VL-1.6State-of-the-art at 1B params
Audio-visual generationBaidu NAVABest A/V sync in open weights
Long-form videoJD JoyAI-EchoUp to 5 minutes, multi-shot
Image to 3DVAST TripoSplatMIT license, Gaussian splatting
Robotics simulationNVIDIA Cosmos3-SuperPhysical AI, action-conditioned video

Language Models

NVIDIA Nemotron 3 Ultra (550B)

The headline number is 550 billion parameters, but the more interesting number is 55 billion: the active parameter count at inference time. Nemotron Ultra uses a hybrid Mamba-MoE architecture, combining recurrent state-space models (Mamba) with Mixture-of-Experts routing. Instead of activating all parameters for every token, MoE routes each token through a small subset of specialist sub-networks, keeping inference cost proportional to active params, not total params.

The 1 million token context window puts it in the same league as Gemini 1.5 Pro for long-document tasks. At 89.1 on MMLU it sits comfortably in frontier territory. The NVFP4 variant uses a 4-bit floating point format where numbers are grouped into blocks with a shared scaling factor, recovering dynamic range while keeping memory footprint small. NVIDIA claims 5x throughput on Blackwell compared to standard FP16. For teams already running Blackwell clusters, this is worth serious evaluation. For everyone else, the 550B weight size means a multi-node setup regardless.

Google Gemma 4 12B

Gemma 4 is the most practically deployable model of the week. It handles text, image, audio, and video in a single encoder-free architecture, supports 256k context, and covers 140+ languages. The AIME 2026 score of 77.5 puts it above most models twice its size on mathematical reasoning.

What sets it apart operationally is the 23-checkpoint QAT wave. Quantization-Aware Training means the model was trained with quantization in mind from the start, rather than shrunk post-hoc. Google shipped ONNX and MLX variants simultaneously, meaning you can run it on mobile or Apple Silicon without a separate quantization step. If you need a single model that works across web, mobile, and server without maintaining multiple checkpoints, this is the obvious pick this week.

Liquid AI LFM2.5-8B

Liquid Foundation Models use a recurrent architecture rather than the standard Transformer. The LFM2.5-8B has 8B total parameters but only 1.5B active at inference, with 128k context. MATH500 at 88.8 is strong for an on-device model. MLX-ready out of the box, meaning it runs natively on Apple Silicon using Apple’s own ML framework.

This is the pick for anything that needs to run locally without cloud round-trips. The active parameter count fits comfortably in the unified memory of an M-series chip.

JetBrains Mellum2-12B

JetBrains’ first open MoE model. 12B total parameters, 2.5B active, with a reasoning mode (“Thinking”) that closes the gap with Qwen3-14B on coding benchmarks. Apache 2.0 license. Given that JetBrains built this specifically for IDE integration and code completion, it is worth testing in coding agent workflows where latency matters more than peak benchmark scores.

Image Generation

Ideogram 4 (9.3B)

Ideogram releasing open weights is the surprise of the week. This is a 9.3B parameter flow-matching Diffusion Transformer trained from scratch, not a fine-tune of an existing checkpoint. A Diffusion Transformer (DiT) replaces the UNet backbone of classic diffusion models with a Transformer, giving it better scaling properties. Flow-matching is the training objective, a more stable alternative to the denoising score matching used in older diffusion models.

It ranks second overall on design benchmarks behind GPT Image 2, and first among open-weight models on both Design Arena and LMArena. Its specific strength is text rendering inside images: logos, typography, posters, anywhere the text needs to be legible and correctly spelled. This has historically been the hardest problem for diffusion models to get right. Getting access to the weights changes what is possible for teams building design tooling.

Audio and Speech

Four labs shipped audio models this week, which is unusual.

Higgs Boson Health Audio v3 (4B): 102 languages, 21 distinct emotional styles including singing, whispering, and shouting. Sub-second Time to First Audio makes it viable for real-time applications. The emotional range here goes well beyond what most open TTS models offer.

rednote dots.tts: The architecturally interesting one. Most TTS systems convert text to discrete audio tokens via a neural codec, then synthesize audio from those tokens. dots.tts removes the codec entirely, generating waveforms in a fully continuous space. Apache 2.0. The practical benefit is fewer artifacts and better prosody in edge cases that confuse codec-based systems, particularly unusual pronunciations and emotional transitions.

Google Magenta RealTime 2: Music generation with under 200ms latency, accepting text, audio, and MIDI as input. The latency number makes it viable for live performance tools where a human musician is in the loop. It was ported to PyTorch and running on ZeroGPU demos within hours of release.

NVIDIA Nemotron-3.5 ASR (600M): A streaming ASR model that handles 17x more concurrent streams than Parakeet RNNT 1.1B at comparable accuracy. RNNT (Recurrent Neural Network Transducer) is an architecture that combines an encoder, prediction network, and joint network to enable streaming transcription without requiring the full audio sequence upfront. For teams running speech recognition at scale, that 17x concurrency multiplier translates directly to infrastructure cost per audio hour processed.

Vision and Multimodal

StepFun Step-3.7-Flash: 198B sparse MoE VLM with around 11B active parameters. The SWE-Bench PRO score of 56.3 is notable for a vision-language model; software engineering benchmarks are usually dominated by text-only models. Apache 2.0.

PaddleOCR-VL-1.6: Document parsing at 1B parameters. Most document understanding models require much larger checkpoints to handle complex layouts, tables, and mixed text/image content reliably. At 1B it is deployable on hardware that would struggle with heavier VLMs, which matters for enterprise environments with strict hardware constraints.

Baidu NAVA (6.3B): Joint audio-video generation with best-in-class audio-visual synchronization. Generating video where mouth movements match audio, or where ambient sound matches scene content, has been a persistent weakness in open-weight video models. NAVA addresses the sync problem at the model level rather than as a post-processing step. Apache 2.0.

Video, 3D, and World Models

NVIDIA Cosmos3-Super (64B): An omnimodal world model for Physical AI. The use case is robotics and autonomous systems, not content generation. It couples action trajectories with video and audio generation, letting you condition outputs on “what would happen if the robot arm moved this way.” The target audience is simulation environments for robot training, where you need photorealistic rollouts of hypothetical actions at scale.

JD JoyAI-Echo: Text-to-video up to 5 minutes long, built on LTX-2.3. Five minutes of coherent multi-shot video from text is a meaningful capability jump; most open models cap out at 10-15 seconds. Multi-shot means the model maintains scene and character consistency across cuts, which is the harder problem.

ByteDance Bernini-R + VAST TripoSplat: Single-image-to-3D generation using Gaussian splatting, under MIT license. Gaussian splatting represents a 3D scene as a collection of semi-transparent ellipsoids, each with a color and opacity, rather than a traditional mesh or voxel grid. It is fast to render and produces photorealistic novel views from any angle. MIT license means usable in commercial products without restriction.

What This Week Actually Means

The pattern across all of these releases is compression. Models that required frontier-scale infrastructure six months ago now run on laptops. The gap between what is possible with a closed API and what is possible with local weights is narrowing faster than most people expected.

The more interesting question is what happens to the deployment layer. Running a single model endpoint is straightforward. Running a heterogeneous fleet where different requests route to different specialized models (LFM2.5 for on-device, Nemotron Ultra for complex reasoning, Ideogram 4 for design tasks) requires actual infrastructure thinking: routing logic, fallbacks, cost monitoring, latency SLOs per model type.

That is the part that does not ship in a LinkedIn post. It is also the part that will matter most over the next twelve months.