~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/magnum-v4-12b
Anthracite · llm
Magnum v4 12B
Mistral-Nemo-12B fine-tuned on curated Claude-style prose data. Built for long-form creative writing with literary register.
12b paramsmistralapache-2.0128K ctx7.4624.5 GB vram
about·model card

Magnum v4 12B by Anthracite is a powerful language model designed for text generation tasks, boasting 12 billion parameters and built on the Mistral architecture. This model excels in generating coherent and contextually rich text, making it suitable for applications like content creation, chatbots, and natural language understanding. With a context length of 131,072 tokens, Magnum v4 12B can handle long-form content and maintain context over extensive passages, which is particularly useful for tasks requiring deep understanding and continuity.

In its size class, Magnum v4 12B holds its own, offering a balance between performance and efficiency. While it requires a significant amount of VRAM (7.5–24.5 GB), it supports quantizations like BF16 and Q4_K_M, which can help reduce memory usage and improve inference speed without a substantial loss in quality. Compared to other models of similar size, Magnum v4 12B is competitive, often delivering higher-quality outputs with better contextual awareness.

Ideal users for Magnum v4 12B include developers and researchers who need a robust text generation tool for complex projects. Realistic hardware requirements include GPUs with at least 8 GB of VRAM, though 16 GB or more is recommended for smoother operation and larger batch sizes. This model is well-suited for those who prioritize high-quality text generation and can accommodate the necessary hardware investment.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
BF161624 GB24.5 GB25 GB
100%
Q4_K_M4.56.964 GB7.46 GB7.96 GB
85%

Context window & KV cache

Adds 1.25 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Magnum v4 12B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/magnum-v4-12b-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Magnum v4 12B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Magnum v4 12Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

8.8 GB

7.5 GB weights + 0.9 GB KV

Aggregate tok/s

21

across 1 user

Per-user tok/s

21

12 B dense

✅ Fits in 24 GB VRAM with 15.2 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Magnum v4 12B?

Magnum v4 12B requires 7.46 GB VRAM minimum with BF16 quantization. For full precision you need 24.5 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Magnum v4 12B?

To run Magnum v4 12B, you need a GPU with at least 7.5 GB of VRAM for the lowest quantization level, up to 24.5 GB for the highest. NVIDIA RTX 3090 or higher is recommended for optimal performance.

Is Magnum v4 12B good for coding?

While Magnum v4 12B is primarily designed for long-form creative writing, it can still assist with coding tasks, but its strength lies in generating literary content rather than code.

Magnum v4 12B vs Llama 3.1 8B?

Magnum v4 12B has more parameters (12B vs 8B) and is fine-tuned for creative writing, while Llama 3.1 8B may offer better performance in general-purpose tasks due to its different training data.

Can I run Magnum v4 12B on a Mac?

Yes, you can run Magnum v4 12B on a Mac with an M1/M2 chip or a compatible GPU. Ensure you have the necessary drivers and software installed for optimal performance.

How much VRAM does Magnum v4 12B need?

The VRAM requirement for Magnum v4 12B ranges from 7.5 GB to 24.5 GB, depending on the quantization level used. Lower quantization levels require less VRAM.

Is Magnum v4 12B censored?

Magnum v4 12B is not inherently censored, but it is fine-tuned on curated data to maintain a literary register, which may affect the output style and content.

Is Magnum v4 12B commercial-use allowed?

Yes, Magnum v4 12B is licensed under Apache-2.0, allowing for both personal and commercial use without restrictions.

Magnum v4 12B context length?

Magnum v4 12B supports a context length of 131,072 tokens, making it suitable for generating very long and detailed text.

Does Magnum v4 12B support function calling?

Magnum v4 12B does not natively support function calling, as it is primarily designed for text generation tasks. However, you can integrate it with external tools to achieve similar functionality.

Magnum v4 12B quantization options?

Magnum v4 12B supports various quantization options, including INT8, INT4, and FP16, which allow you to reduce VRAM usage and improve inference speed.

Can Magnum v4 12B run on CPU?

While Magnum v4 12B can technically run on a CPU, it will be significantly slower compared to running on a GPU. A powerful multi-core CPU is recommended for better performance.

Magnum v4 12B fine-tuning?

Magnum v4 12B can be fine-tuned on custom datasets to improve performance on specific tasks. Ensure you have the necessary computational resources and expertise for fine-tuning.

Magnum v4 12B system requirements?

To run Magnum v4 12B, you need a system with at least 16 GB of RAM, a GPU with 7.5 GB to 24.5 GB of VRAM, and a 64-bit operating system. A multi-core CPU and SSD storage are also recommended.

Magnum v4 12B performance benchmark?

Performance benchmarks for Magnum v4 12B vary based on hardware. On an NVIDIA RTX 3090, it can generate around 100 tokens per second with INT8 quantization.

Magnum v4 12B for RAG?

Magnum v4 12B can be used for Retrieval-Augmented Generation (RAG) by integrating it with a retrieval system, but it is not specifically optimized for this task.

Magnum v4 12B for agents?

Magnum v4 12B can be used to create conversational agents, especially for creative and literary tasks. However, for more technical or task-oriented agents, other models might be more suitable.

Magnum v4 12B for coding vs general?

Magnum v4 12B is better suited for general creative writing and literary tasks due to its fine-tuning on curated Claude-style prose data. For coding, consider models specifically trained on code repositories.

Magnum v4 12B vs ChatGPT?

Magnum v4 12B is fine-tuned for creative writing and long-form content, while ChatGPT is a more general-purpose model. ChatGPT may perform better in diverse tasks, but Magnum v4 12B excels in literary and creative applications.

Magnum v4 12B download size?

The download size for Magnum v4 12B varies based on the quantization level. The full model is approximately 24 GB, while lower quantization levels reduce the size to around 12 GB.

Best quant for Magnum v4 12B?

The best quantization level for Magnum v4 12B depends on your hardware. INT8 is a good balance between performance and VRAM usage, but FP16 offers higher accuracy at the cost of more VRAM.