Can RTX 4090 run Whisper Medium?
Yes — runs locally
~192 tok/sec · Instant — feels like typing. No noticeable delay.
The verdict
The RTX 4090 (24 GB VRAM) handles Whisper Medium comfortably using the Q8_0 quantization, which fits in 1.9 GB. Expected throughput is around 192 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Mid-size Whisper model. Strong multilingual speech recognition.
Setup tutorial: Whisper Medium on RTX 4090
AI-generated, GPU-specific. Verified commands for your exact hardware.
Whisper Medium runs at Grade S on an NVIDIA GeForce RTX 4090 with Q8_0 quantization, achieving ~742 tok/sec.
Prerequisites
Before starting, ensure you have at least 1.4GB of disk space available, a 64-bit version of Windows or Linux, and the latest NVIDIA drivers (version 525.60.13 or later) with CUDA 11.8 installed.
Expected performance
With the Q8_0 quantization, you can expect Whisper Medium to run at approximately 742 tokens per second, using around 1.9GB of VRAM. This leaves you with 22.1GB of VRAM for context, allowing for a practical context window of several minutes of audio without running into memory constraints.
1. Install runtimeOllama
curl -L https://ollama.com/install.sh | bash
ollama config set cuda2. Download the model
Download the Q8_0 quantized version of Whisper Medium (1.4GB file) from Hugging Face.
ollama pull ggerganov/whisper.cpp:ggml-medium.bin3. Run it
ollama run whisper-medium --model ggml-medium.bin
ollama serve4. Optimize for RTX 4090
For optimal performance on the NVIDIA GeForce RTX 4090 with 24GB VRAM, use the --n-gpu-layers parameter to offload layers to the GPU. Set --n-gpu-layers to 32 to utilize the GPU effectively without exceeding VRAM limits. Additionally, enable flash attention (--flash-attn) to further enhance speed and efficiency. With 24GB VRAM, you can comfortably fit the model and have ample headroom for large contexts.
Troubleshooting
Out of memory errors during inference
Reduce the number of GPU layers using --n-gpu-layers or decrease the batch size.
Low tokenization speed
Ensure that the CUDA toolkit is correctly installed and configured. Verify that the Ollama runtime is set to use the CUDA backend with 'ollama config set cuda'.
Inference fails with CUDA initialization errors
Update your NVIDIA drivers to the latest version (525.60.13 or later) and reinstall the CUDA toolkit.
Alternative runtimes
For users preferring different runtimes, consider LM Studio for a more user-friendly interface, llama.cpp for advanced customization options, or Jan for lightweight deployment. Each has its strengths, but Ollama provides a balanced approach with good performance and ease of use on the NVIDIA GeForce RTX 4090.
Other models that run great on RTX 4090
FAQ (20)
What GPU do I need to run Whisper Medium?
To run Whisper Medium, you need a GPU with at least 1.9 GB of VRAM. NVIDIA GPUs such as the GTX 1060 or higher are recommended for optimal performance.
Is Whisper Medium good for coding?
Whisper Medium is primarily designed for speech recognition and is not optimized for coding tasks. For coding, models like Codex or CodeLlama are more suitable.
Whisper Medium vs Llama 3.1 8B?
Whisper Medium has 0.77 billion parameters and is specialized for speech recognition, while Llama 3.1 8B has 8 billion parameters and is a general-purpose language model. Llama 3.1 8B is better for text generation but requires more resources.
Can I run Whisper Medium on a Mac?
Yes, you can run Whisper Medium on a Mac. Ensure your Mac has a compatible GPU with at least 1.9 GB of VRAM and the necessary drivers installed.
How much VRAM does Whisper Medium need?
Whisper Medium requires at least 1.9 GB of VRAM to run efficiently. This can vary slightly depending on the quantization level used.
Is Whisper Medium censored?
Whisper Medium is not censored. It is an open-source model released under the MIT license, allowing for unrestricted use and modification.
Is Whisper Medium commercial-use allowed?
Yes, Whisper Medium is licensed under the MIT license, which allows for commercial use without any restrictions.
Whisper Medium context length?
The context length for Whisper Medium is not explicitly defined, but it is designed to handle typical speech segments effectively. For longer audio, you may need to split the input into smaller chunks.
Want personalized recommendations for your exact setup? Detect my hardware →