Can RTX 4090 run Llama 3.1 70B Instruct?
Yes — runs locally
~0 tok/sec · Cannot run — insufficient VRAM
The verdict
The RTX 4090 (24 GB VRAM) handles Llama 3.1 70B Instruct comfortably using the Q4_K_M quantization, which fits in 40.1 GB. Expected throughput is around 0 tokens/second, which feels Cannot run — insufficient VRAM in interactive use. Meta's flagship 70B parameter model. Excellent performance rivaling GPT-4 on many benchmarks.
Setup tutorial: Llama 3.1 70B Instruct on RTX 4090
AI-generated, GPU-specific. Verified commands for your exact hardware.
Llama 3.1 70B Instruct runs on an NVIDIA GeForce RTX 4090 with a grade D performance, using the Q4_K_M quantization. Expect ~13 tok/sec with 40.1GB VRAM usage.
Prerequisites
Before starting, ensure you have at least 40GB of free disk space, a 64-bit version of Windows or Linux, the latest NVIDIA drivers (version 525.60.13 or later), and CUDA 11.8 installed.
Expected performance
With the Q4_K_M quantization, you can expect a token generation rate of ~13 tok/sec and 40.1GB VRAM in use. Given the 24GB VRAM limit, you will have -16.1GB of VRAM headroom, which means you may need to reduce the context window to around 65536 tokens to fit within the available VRAM.
1. Install runtimeOllama
pip install ollama
ollama config set device cuda2. Download the model
Download the Q4_K_M quantized model (39.6GB) from Hugging Face.
ollama pull bartowski/Meta-Llama-3.1-70B-Instruct-GGUF:Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf3. Run it
ollama run Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --context-length 131072 --n-gpu-layers 32 --flash-attn4. Optimize for RTX 4090
For optimal performance on the NVIDIA GeForce RTX 4090 with 24GB VRAM, use --n-gpu-layers 32 to maximize the number of layers offloaded to the GPU. Enable --flash-attn to reduce memory usage and improve speed. Given the 40.1GB VRAM requirement, you will need to manage the context window carefully, as you have -16.1GB of VRAM headroom.
Troubleshooting
Out of memory error during inference
Reduce the context length or the number of GPU layers using --context-length <value> and --n-gpu-layers <value>.
Slow token generation rate
Ensure that --flash-attn is enabled and try increasing the number of GPU layers with --n-gpu-layers <value>.
Model fails to load
Verify that the model file has been downloaded correctly and that the Ollama runtime is properly configured with the correct device (CUDA).
Alternative runtimes
Alternative runtimes include LM Studio, llama.cpp, and Jan. LM Studio provides a more user-friendly interface and is suitable for those who prefer a GUI. llama.cpp offers more fine-grained control over model parameters and is ideal for advanced users. Jan is a lightweight runtime that is easy to set up but may not offer the same level of performance tuning as Ollama.
Other models that run great on RTX 4090
FAQ (20)
What GPU do I need to run Llama 3.1 70B Instruct?
To run Llama 3.1 70B Instruct, you need a GPU with at least 40.1 GB of VRAM. Higher VRAM (up to 142.0 GB) is required for full precision or lower quantization levels.
Is Llama 3.1 70B Instruct good for coding?
Yes, Llama 3.1 70B Instruct performs well in coding tasks, often rivaling GPT-4 in code generation and understanding complex programming concepts.
Llama 3.1 70B Instruct vs Llama 3.1 8B?
Llama 3.1 70B Instruct offers significantly better performance and more nuanced responses compared to Llama 3.1 8B, but requires much more VRAM and computational resources.
Can I run Llama 3.1 70B Instruct on a Mac?
Yes, you can run Llama 3.1 70B Instruct on a Mac with a compatible GPU, such as an AMD Radeon Pro or NVIDIA GPU, provided it meets the VRAM requirements.
How much VRAM does Llama 3.1 70B Instruct need?
Llama 3.1 70B Instruct requires between 40.1 GB and 142.0 GB of VRAM, depending on the quantization level used.
Is Llama 3.1 70B Instruct censored?
Llama 3.1 70B Instruct is not inherently censored, but it may have content filters in place to prevent harmful or inappropriate content generation.
Is Llama 3.1 70B Instruct commercial-use allowed?
Yes, Llama 3.1 70B Instruct can be used commercially under the terms of its license, which allows for both research and commercial applications.
Llama 3.1 70B Instruct context length?
Llama 3.1 70B Instruct has a context length of 131,072 tokens, allowing it to process very long sequences of text.
Want personalized recommendations for your exact setup? Detect my hardware →