~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can RTX 5090 run Llama 3.1 70B Instruct?

D

Yes — runs locally

~0 tok/sec · Cannot run — insufficient VRAM

Your VRAM
32 GB
Model size
70B
Best quant
Q5_K_M
VRAM needed
50.0 GB

The verdict

The RTX 5090 (32 GB VRAM) handles Llama 3.1 70B Instruct comfortably using the Q5_K_M quantization, which fits in 50.0 GB. Expected throughput is around 0 tokens/second, which feels Cannot run — insufficient VRAM in interactive use. Meta's flagship 70B parameter model. Excellent performance rivaling GPT-4 on many benchmarks.

Setup tutorial: Llama 3.1 70B Instruct on RTX 5090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

The Llama 3.1 70B Instruct model runs on the NVIDIA GeForce RTX 5090 with a grade D performance at ~13 tok/sec using the Q5_K_M quantization. It requires 50.0GB of VRAM, leaving 2GB of headroom for context.

Prerequisites

Before starting, ensure you have at least 50GB of free disk space, a compatible operating system (Windows or Linux), the latest NVIDIA driver (version 525.60.13 or later), and CUDA 11.8 installed.

Expected performance

With the Q5_K_M quantization, you can expect the model to run at approximately 13 tokens per second, consuming around 50.0GB of VRAM. This leaves about 2GB of headroom for context, allowing for a practical context window of around 16,000 tokens, depending on the complexity of the input.

1. Install runtimeOllama

pip install ollama
ollama config set cuda_toolkit_path /usr/local/cuda

2. Download the model

Download the Q5_K_M quantized model (48.0GB file) from Hugging Face.

ollama pull bartowski/Meta-Llama-3.1-70B-Instruct-GGUF:Meta-Llama-3.1-70B-Instruct-Q5_K_M.gguf

3. Run it

ollama run --model Meta-Llama-3.1-70B-Instruct-Q5_K_M.gguf --interactive
ollama chat --model Meta-Llama-3.1-70B-Instruct-Q5_K_M.gguf

4. Optimize for RTX 5090

For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, use the --n-gpu-layers flag to offload some layers to CPU memory. Enable flash attention (--flash-attn) to reduce VRAM usage and improve speed. Given the 50.0GB VRAM requirement, you will need to carefully manage the context length to fit within the remaining 2GB of VRAM.

Troubleshooting

Out of memory errors during inference.

Reduce the number of GPU layers using --n-gpu-layers <number> or decrease the context length.

Slow inference speed.

Enable flash attention with --flash-attn and ensure CUDA is properly configured.

Model fails to load.

Verify that the model file is correctly downloaded and not corrupted. Check the Ollama logs for any errors.

Alternative runtimes

Alternative runtimes include LM Studio, llama.cpp, and Jan. LM Studio provides a more user-friendly interface and is suitable for those who prefer a GUI. llama.cpp offers more control over low-level optimizations and is ideal for advanced users. Jan is a lightweight runtime that can be used for quick prototyping. Choose an alternative based on your specific needs and preferences.

Other models that run great on RTX 5090

FAQ (20)

What GPU do I need to run Llama 3.1 70B Instruct?

To run Llama 3.1 70B Instruct, you need a GPU with at least 40.1 GB of VRAM. Higher VRAM (up to 142.0 GB) is required for full precision or lower quantization levels.

Is Llama 3.1 70B Instruct good for coding?

Yes, Llama 3.1 70B Instruct performs well in coding tasks, often rivaling GPT-4 in code generation and understanding complex programming concepts.

Llama 3.1 70B Instruct vs Llama 3.1 8B?

Llama 3.1 70B Instruct offers significantly better performance and more nuanced responses compared to Llama 3.1 8B, but requires much more VRAM and computational resources.

Can I run Llama 3.1 70B Instruct on a Mac?

Yes, you can run Llama 3.1 70B Instruct on a Mac with a compatible GPU, such as an AMD Radeon Pro or NVIDIA GPU, provided it meets the VRAM requirements.

How much VRAM does Llama 3.1 70B Instruct need?

Llama 3.1 70B Instruct requires between 40.1 GB and 142.0 GB of VRAM, depending on the quantization level used.

Is Llama 3.1 70B Instruct censored?

Llama 3.1 70B Instruct is not inherently censored, but it may have content filters in place to prevent harmful or inappropriate content generation.

Is Llama 3.1 70B Instruct commercial-use allowed?

Yes, Llama 3.1 70B Instruct can be used commercially under the terms of its license, which allows for both research and commercial applications.

Llama 3.1 70B Instruct context length?

Llama 3.1 70B Instruct has a context length of 131,072 tokens, allowing it to process very long sequences of text.

Want personalized recommendations for your exact setup? Detect my hardware →