~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/mistral-nemo-base-12b
Mistral AI · llm
Mistral Nemo Base 12B
Official Mistral-Nemo 12B foundation model (NVIDIA collab) — pretrained only, no instruct or refusal layer. Naturally uncensored, Apache 2.0, 128K context.
12b paramsmistralapache-2.0128K ctx7.724.5 GB vram
about·model card

Mistral Nemo Base 12B is a large language model (LLM) developed by Mistral AI, boasting 12 billion parameters. This model excels in generating coherent and contextually relevant text across a wide range of topics, making it suitable for tasks such as content creation, chatbot interactions, and natural language understanding. With a context length of 131,072 tokens, it can handle extensive inputs, which is particularly useful for applications requiring deep contextual awareness, like summarizing long documents or maintaining coherent conversations over multiple exchanges.

In its size class, Mistral Nemo Base 12B holds its own, offering a balance between performance and efficiency. While it may not outperform the largest models in every task, it provides a compelling alternative with lower resource requirements. The model supports quantizations like BF16 and Q4_K_M, which enhance its efficiency without significant loss in performance, making it a practical choice for users with mid-range to high-end GPUs. Ideal for developers and researchers looking for a powerful yet manageable LLM, Mistral Nemo Base 12B requires at least 7.7 GB of VRAM, but for optimal performance, 24.5 GB or more is recommended. This makes it accessible on a variety of hardware setups, from high-end consumer GPUs to professional workstations.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
BF161624 GB24.5 GB25 GB
100%
Q4_K_M4.57.2 GB7.7 GB8.2 GB
85%

Context window & KV cache

Adds 1.25 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Mistral Nemo Base 12B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/Mistral-Nemo-Base-2407-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Mistral Nemo Base 12B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Mistral Nemo Base 12Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

9.1 GB

7.7 GB weights + 0.9 GB KV

Aggregate tok/s

21

across 1 user

Per-user tok/s

21

12 B dense

✅ Fits in 24 GB VRAM with 14.9 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Mistral Nemo Base 12B?

Mistral Nemo Base 12B requires 7.7 GB VRAM minimum with BF16 quantization. For full precision you need 24.5 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Mistral Nemo Base 12B?

To run Mistral Nemo Base 12B, you need a GPU with at least 7.7 GB of VRAM, but 24.5 GB is recommended for better performance, especially with higher quantization levels.

Is Mistral Nemo Base 12B good for coding?

Mistral Nemo Base 12B is a versatile model that can handle coding tasks well, thanks to its large context length of 131,072 tokens and strong language understanding capabilities.

Mistral Nemo Base 12B vs Llama 3.1 8B?

Mistral Nemo Base 12B has more parameters (12B vs 8B) and a longer context length (131,072 vs typically 2,048 tokens), making it more powerful for complex tasks but requiring more VRAM.

Can I run Mistral Nemo Base 12B on a Mac?

Yes, you can run Mistral Nemo Base 12B on a Mac with an NVIDIA GPU and sufficient VRAM. Ensure you have the necessary drivers and CUDA support installed.

How much VRAM does Mistral Nemo Base 12B need?

Mistral Nemo Base 12B requires between 7.7 GB and 24.5 GB of VRAM, depending on the quantization level used. Higher quantization reduces VRAM usage but may affect performance.

Is Mistral Nemo Base 12B censored?

No, Mistral Nemo Base 12B is naturally uncensored, allowing it to generate content without predefined restrictions.

Is Mistral Nemo Base 12B commercial-use allowed?

Yes, Mistral Nemo Base 12B is licensed under Apache 2.0, which allows commercial use as long as you comply with the license terms.

Mistral Nemo Base 12B context length?

Mistral Nemo Base 12B has a context length of 131,072 tokens, making it suitable for handling very long sequences of text.

Does Mistral Nemo Base 12B support function calling?

Mistral Nemo Base 12B does not natively support function calling, but you can implement this functionality through custom code or external libraries.

Mistral Nemo Base 12B quantization options?

Mistral Nemo Base 12B supports various quantization options, including INT8, INT4, and FP16, which can reduce VRAM usage and improve inference speed.

Can Mistral Nemo Base 12B run on CPU?

While Mistral Nemo Base 12B can technically run on a CPU, it is highly inefficient and slow. Using a GPU is strongly recommended for practical performance.

Mistral Nemo Base 12B fine-tuning?

Mistral Nemo Base 12B can be fine-tuned for specific tasks using frameworks like Hugging Face Transformers. Ensure you have the necessary computational resources and data for effective fine-tuning.

Mistral Nemo Base 12B system requirements?

To run Mistral Nemo Base 12B, you need a system with an NVIDIA GPU (7.7 GB to 24.5 GB VRAM), at least 32 GB of RAM, and a modern CPU. CUDA and cuDNN should also be installed.

Mistral Nemo Base 12B performance benchmark?

Performance benchmarks for Mistral Nemo Base 12B vary based on hardware, but typical throughput is around 50-100 tokens per second on high-end GPUs with FP16 quantization.

Mistral Nemo Base 12B for RAG?

Mistral Nemo Base 12B can be used for Retrieval-Augmented Generation (RAG) by integrating it with a retrieval system to enhance its context and generate more informed responses.

Mistral Nemo Base 12B for agents?

Mistral Nemo Base 12B can be used to power conversational agents and chatbots, leveraging its large context length and strong language understanding to provide natural and context-aware interactions.

Mistral Nemo Base 12B for coding vs general?

Mistral Nemo Base 12B performs well in both coding and general tasks, but its large context length makes it particularly suitable for handling long sequences of code or text.

Mistral Nemo Base 12B vs ChatGPT?

Mistral Nemo Base 12B has a larger context length (131,072 vs 4,096 tokens) and is open-source, while ChatGPT is a closed-source model with a more extensive training dataset and fine-tuning capabilities.

Mistral Nemo Base 12B download size?

The download size for Mistral Nemo Base 12B varies depending on the quantization level, ranging from approximately 10 GB (INT8) to 24 GB (FP16).

Best quant for Mistral Nemo Base 12B?

The best quantization for Mistral Nemo Base 12B depends on your hardware and performance needs. INT8 offers a good balance between VRAM efficiency and performance, while FP16 provides the highest accuracy.