~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/mistral-nemo-12b
Mistral AI · llm
Mistral Nemo 12B
Mistral's 12B model with excellent instruction following.
12b paramsmistralapache-2.0128K ctx7.4612.63 GB vram
about·model card

Mistral Nemo 12B is a large language model (LLM) developed by Mistral AI, boasting 12 billion parameters. This model excels in generating high-quality text across a wide range of tasks, including but not limited to, writing, summarization, translation, and question-answering. With a context length of 131,072 tokens, it can handle extensive inputs, making it suitable for applications that require deep contextual understanding. The Apache-2.0 license ensures it is freely available for both research and commercial use, which has contributed to its popularity, evident from over 125,000 downloads and 1,662 likes.

In the 12B parameter size class, Mistral Nemo 12B holds its own, often outperforming models of similar size in terms of efficiency and output quality. It is particularly noted for its ability to generate coherent and contextually relevant responses, even with complex prompts. The availability of quantizations like Q4_K_M and Q8_0 makes it more accessible for local deployment, reducing the VRAM requirements to a range of 7.5 to 12.6 GB. This makes it a practical choice for users with mid-range GPUs, such as those found in consumer-grade laptops and desktops. Ideal users include researchers, developers, and content creators who need a powerful yet efficient LLM for local use without the overhead of cloud services.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.56.964 GB7.46 GB7.96 GB
85%
Q8_0812.128 GB12.63 GB13.13 GB
98%

Context window & KV cache

Adds 1.25 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Mistral Nemo 12B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull mistral-nemo
  2. 2

    Chat

    ollama run mistral-nemo
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"mistral-nemo","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Mistral Nemo 12B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Mistral Nemo 12Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

8.8 GB

7.5 GB weights + 0.9 GB KV

Aggregate tok/s

21

across 1 user

Per-user tok/s

21

12 B dense

✅ Fits in 24 GB VRAM with 15.2 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Mistral Nemo 12B?

Mistral Nemo 12B requires 7.46 GB VRAM minimum with Q4_K_M quantization. For full precision you need 12.63 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Mistral Nemo 12B?

To run Mistral Nemo 12B, you need a GPU with at least 7.5 GB of VRAM for the lowest quantization level, up to 12.6 GB for the highest. NVIDIA RTX 3060 or better is recommended.

Is Mistral Nemo 12B good for coding?

Mistral Nemo 12B is well-suited for coding tasks due to its strong instruction-following capabilities and large context length of 131,072 tokens.

Mistral Nemo 12B vs Llama 3.1 8B?

Mistral Nemo 12B has more parameters (12B vs 8B) and a longer context length (131,072 vs 4,096), making it generally more powerful but requiring more VRAM.

Can I run Mistral Nemo 12B on a Mac?

Yes, you can run Mistral Nemo 12B on a Mac with an M1 or M2 chip, but performance will be better on a machine with a dedicated GPU.

How much VRAM does Mistral Nemo 12B need?

The VRAM requirement for Mistral Nemo 12B ranges from 7.5 GB to 12.6 GB, depending on the quantization level used.

Is Mistral Nemo 12B censored?

Mistral Nemo 12B is not inherently censored, but it follows ethical guidelines and can be fine-tuned to avoid generating harmful content.

Is Mistral Nemo 12B commercial-use allowed?

Yes, Mistral Nemo 12B is licensed under Apache-2.0, which allows for commercial use without additional fees.

Mistral Nemo 12B context length?

Mistral Nemo 12B has a context length of 131,072 tokens, allowing it to process very long sequences of text.

Does Mistral Nemo 12B support function calling?

Yes, Mistral Nemo 12B supports function calling, enabling it to interact with external systems and APIs.

Mistral Nemo 12B quantization options?

Mistral Nemo 12B supports various quantization levels, including 4-bit, 8-bit, and 16-bit, to optimize for different hardware capabilities.

Can Mistral Nemo 12B run on CPU?

While Mistral Nemo 12B can run on a CPU, it will be significantly slower compared to running on a GPU. A multi-core CPU with high clock speed is recommended.

Mistral Nemo 12B fine-tuning?

Mistral Nemo 12B can be fine-tuned using frameworks like Hugging Face Transformers, allowing you to adapt it to specific tasks or domains.

Mistral Nemo 12B system requirements?

To run Mistral Nemo 12B, you need a system with at least 16 GB of RAM, a multi-core CPU, and a GPU with 7.5 GB to 12.6 GB of VRAM, depending on the quantization level.

Mistral Nemo 12B performance benchmark?

Performance benchmarks show Mistral Nemo 12B processing around 50-100 tokens per second on a mid-range GPU like the RTX 3060, with higher throughput on more powerful GPUs.

Mistral Nemo 12B for RAG?

Mistral Nemo 12B is suitable for Retrieval-Augmented Generation (RAG) tasks due to its large context length and ability to integrate external data sources.

Mistral Nemo 12B for agents?

Mistral Nemo 12B can be used to create intelligent agents for tasks like chatbots, virtual assistants, and automated customer service, leveraging its strong language understanding and generation capabilities.

Mistral Nemo 12B for coding vs general?

Mistral Nemo 12B performs well in both coding and general tasks, but it may require fine-tuning for optimal performance in specialized areas like code generation.

Mistral Nemo 12B vs ChatGPT?

Mistral Nemo 12B offers a larger context length (131,072 vs 4,096 tokens) and is open-source, while ChatGPT has a more polished user interface and is optimized for conversational tasks.

Mistral Nemo 12B download size?

The download size for Mistral Nemo 12B varies based on the quantization level, ranging from approximately 6 GB (4-bit) to 24 GB (16-bit).

Best quant for Mistral Nemo 12B?

The best quantization level depends on your hardware. For most users, 8-bit quantization provides a good balance between performance and resource usage, while 4-bit is suitable for lower-end GPUs.