~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/nemotron-mini-4b
NVIDIA · llm
Nemotron Mini 4B
NVIDIA's compact 4B model optimized for edge deployment.
4b paramsnemotronother8K ctx3.014.65 GB vram
about·model card

The Nemotron Mini 4B by NVIDIA is a lightweight yet powerful language model designed for efficient local deployment. With 4 billion parameters, it offers a balanced blend of performance and resource consumption, making it particularly adept at generating coherent and contextually relevant text. The model supports a context length of 8192 tokens, which is quite generous for its size, allowing for more extensive and nuanced conversations or content generation tasks. This makes it suitable for a wide range of applications, from chatbots and content creation to summarization and translation.

In its size class, the Nemotron Mini 4B stands out for its efficiency. It requires only 3.0–4.7 GB of VRAM, which means it can run smoothly on a variety of mid-range GPUs, including those found in laptops and desktops with moderate specifications. Despite its relatively small footprint, it delivers results that are competitive with larger models, often outperforming them in terms of speed and resource usage. This makes it an excellent choice for developers and enthusiasts who need a robust text generation tool without the overhead of high-end hardware. Users looking for a versatile and efficient model for local deployment, especially those with limited GPU resources, will find the Nemotron Mini 4B to be a valuable addition to their toolkit.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.52.512 GB3.01 GB3.51 GB
85%
Q8_084.154 GB4.65 GB5.15 GB
98%

Context window & KV cache

Adds 0.66 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Nemotron Mini 4B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull nemotron-mini
  2. 2

    Chat

    ollama run nemotron-mini
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"nemotron-mini","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Nemotron Mini 4B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Nemotron Mini 4Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

4.0 GB

3.0 GB weights + 0.5 GB KV

Aggregate tok/s

63

across 1 user

Per-user tok/s

63

4 B dense

✅ Fits in 24 GB VRAM with 20.0 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Nemotron Mini 4B?

Nemotron Mini 4B requires 3.01 GB VRAM minimum with Q4_K_M quantization. For full precision you need 4.65 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Nemotron Mini 4B?

To run Nemotron Mini 4B, you need a GPU with at least 3.0 GB of VRAM, but 4.7 GB is recommended for optimal performance, especially with higher quantization levels.

Is Nemotron Mini 4B good for coding?

Nemotron Mini 4B is suitable for coding tasks, offering a balance between performance and resource usage, making it a viable option for code generation and assistance.

Nemotron Mini 4B vs Llama 3.1 8B?

Nemotron Mini 4B has 4 billion parameters, making it smaller and more efficient than Llama 3.1 8B, which has 8 billion parameters. Nemotron Mini 4B is better suited for edge devices and scenarios with limited resources.

Can I run Nemotron Mini 4B on a Mac?

Yes, you can run Nemotron Mini 4B on a Mac, provided your Mac has a compatible GPU with at least 3.0 GB of VRAM. macOS supports CUDA and ROCm for GPU acceleration.

How much VRAM does Nemotron Mini 4B need?

Nemotron Mini 4B requires a minimum of 3.0 GB of VRAM, but 4.7 GB is recommended for better performance, especially when using higher quantization levels.

Is Nemotron Mini 4B censored?

Nemotron Mini 4B is not inherently censored, but its behavior can be influenced by the data it was trained on and any post-training modifications or filters applied by the user or the platform.

Is Nemotron Mini 4B commercial-use allowed?

The commercial use of Nemotron Mini 4B depends on the specific license terms provided by NVIDIA. Check the license details on the official NVIDIA website or the runthismodel.com page for more information.

Nemotron Mini 4B context length?

Nemotron Mini 4B supports a context length of up to 8192 tokens, allowing for longer input sequences compared to many other models.

Does Nemotron Mini 4B support function calling?

Nemotron Mini 4B supports function calling, enabling it to interact with external systems and APIs, enhancing its capabilities in various applications.

Nemotron Mini 4B quantization options?

Nemotron Mini 4B offers several quantization options, including INT8, INT4, and FP16, allowing you to optimize the model for different performance and resource requirements.

Can Nemotron Mini 4B run on CPU?

While Nemotron Mini 4B can run on a CPU, it will be significantly slower compared to running on a GPU. For optimal performance, a GPU with at least 3.0 GB of VRAM is recommended.

Nemotron Mini 4B fine-tuning?

Nemotron Mini 4B can be fine-tuned on custom datasets to improve performance on specific tasks. Fine-tuning requires additional computational resources and expertise in training deep learning models.

Nemotron Mini 4B system requirements?

To run Nemotron Mini 4B, you need a system with at least 8 GB of RAM, a CPU with multiple cores, and a GPU with at least 3.0 GB of VRAM. For optimal performance, a GPU with 4.7 GB of VRAM is recommended.

Nemotron Mini 4B performance benchmark?

Nemotron Mini 4B can process around 100-200 tokens per second on a mid-range GPU, with performance varying based on the specific hardware and quantization level used.

Nemotron Mini 4B for RAG?

Nemotron Mini 4B can be used for Retrieval-Augmented Generation (RAG) tasks, where it can generate text based on retrieved documents, enhancing its ability to provide contextually relevant responses.

Nemotron Mini 4B for agents?

Nemotron Mini 4B is well-suited for creating conversational agents and chatbots, thanks to its compact size and efficient performance, making it ideal for deployment on edge devices.

Nemotron Mini 4B for coding vs general?

Nemotron Mini 4B performs well in both coding and general text generation tasks. However, its smaller size may result in slightly less nuanced outputs compared to larger models, making it a balanced choice for a wide range of applications.

Nemotron Mini 4B vs ChatGPT?

Nemotron Mini 4B is smaller and more efficient than ChatGPT, which has 175 billion parameters. Nemotron Mini 4B is better suited for edge devices and resource-constrained environments, while ChatGPT offers superior performance and context understanding.

Nemotron Mini 4B download size?

The download size of Nemotron Mini 4B varies depending on the quantization level, but it typically ranges from 2 GB to 4 GB, making it relatively lightweight compared to larger models.

Best quant for Nemotron Mini 4B?

The best quantization level for Nemotron Mini 4B depends on your specific needs. INT8 provides a good balance between performance and resource usage, while FP16 offers higher precision at the cost of increased VRAM usage.