NVIDIA Nemotron™ 3 Nano Omni is a 30B-A3B open multimodal model designed to function as a perception and context sub-agent in enterprise agent systems. It accepts text, image, video, and audio inputs and produces text output, enabling agents to perceive and reason across modalities in a single inference loop. Built on a hybrid MoE Transformer-Mamba architecture with Conv3D video layers and Efficient Video Sampling (EVS), it delivers approximately 2× higher throughput and 2.5× lower compute for video reasoning versus separate vision + speech pipelines. It supports up to 300K context length and a 16,384 reasoning budget, with extended thinking enabled via reasoning.
VisionReasoning|Proprietary Model
AI Performance Evaluation
Overall
AA Intelligence Index
21%↓18%
Reasoning & Math
GPQA Diamond
47%↓35%
HLE
5.3%↓12%
Coding
AA Coding Index
15%↓22%
TAU2
45%↓35%
TerminalBench
8.3%↓26%
SciCode
28%↓14%
Language & Instructions
IFBench
63%↑0%
AA-LCR
36%↓26%
Output Speed
Standard Mode
312tok/s↑235
First Output 6.96s
Source:Artificial Analysis