Can RTX 5090 run Whisper Medium?
Yes — runs locally
~216 tok/sec · Instant — feels like typing. No noticeable delay.
The verdict
The RTX 5090 (32 GB VRAM) handles Whisper Medium comfortably using the Q8_0 quantization, which fits in 1.9 GB. Expected throughput is around 216 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Mid-size Whisper model. Strong multilingual speech recognition.
Setup tutorial: Whisper Medium on RTX 5090
AI-generated, GPU-specific. Verified commands for your exact hardware.
Whisper Medium runs at Grade S on the NVIDIA GeForce RTX 5090 with Q8_0 quantization, achieving ~990 tok/sec.
Prerequisites
Before starting, ensure you have at least 1.4GB of disk space available, a compatible operating system (Windows or Linux), and the latest NVIDIA drivers (version 525.60.13 or later) with CUDA 11.8 installed.
Expected performance
With the Q8_0 quantization, you can expect the Whisper Medium model to run at approximately 990 tokens per second, using 1.9GB of VRAM. The remaining 30.1GB of VRAM provides ample headroom for handling large context windows, making it suitable for real-time speech recognition tasks.
1. Install runtimeOllama
curl -s https://ollama.com/install.sh | bash
ollama config set cuda2. Download the model
Download the Q8_0 quantized Whisper Medium model (1.4GB file) from Hugging Face.
ollama pull ggerganov/whisper.cpp:ggml-medium.bin3. Run it
ollama run ggerganov/whisper.cpp:ggml-medium.bin --n-gpu-layers 32 --flash-attn
ollama chat ggerganov/whisper.cpp:ggml-medium.bin4. Optimize for RTX 5090
For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, use the --n-gpu-layers 32 flag to offload layers to the GPU. Enable flash attention with --flash-attn to further optimize inference speed. With 1.9GB VRAM used by the model, you will have 30.1GB of VRAM available for context, allowing for very large context windows.
Troubleshooting
Model fails to load due to insufficient VRAM.
Reduce the number of GPU layers with --n-gpu-layers <num_layers>.
Inference is slow or unresponsive.
Ensure CUDA is properly configured with 'ollama config set cuda'. Also, check if the latest NVIDIA drivers are installed.
Error messages related to flash attention.
Disable flash attention with '--no-flash-attn' and retry the run command.
Alternative runtimes
Alternative runtimes like LM Studio, llama.cpp, and Jan can be used for more advanced customization or specific use cases. For example, LM Studio offers a graphical interface for model management, while llama.cpp provides more fine-grained control over quantization and performance tuning. Jan is another lightweight option that may be preferred for deployment in resource-constrained environments.
Other models that run great on RTX 5090
FAQ (20)
What GPU do I need to run Whisper Medium?
To run Whisper Medium, you need a GPU with at least 1.9 GB of VRAM. NVIDIA GPUs such as the GTX 1060 or higher are recommended for optimal performance.
Is Whisper Medium good for coding?
Whisper Medium is primarily designed for speech recognition and is not optimized for coding tasks. For coding, models like Codex or CodeLlama are more suitable.
Whisper Medium vs Llama 3.1 8B?
Whisper Medium has 0.77 billion parameters and is specialized for speech recognition, while Llama 3.1 8B has 8 billion parameters and is a general-purpose language model. Llama 3.1 8B is better for text generation but requires more resources.
Can I run Whisper Medium on a Mac?
Yes, you can run Whisper Medium on a Mac. Ensure your Mac has a compatible GPU with at least 1.9 GB of VRAM and the necessary drivers installed.
How much VRAM does Whisper Medium need?
Whisper Medium requires at least 1.9 GB of VRAM to run efficiently. This can vary slightly depending on the quantization level used.
Is Whisper Medium censored?
Whisper Medium is not censored. It is an open-source model released under the MIT license, allowing for unrestricted use and modification.
Is Whisper Medium commercial-use allowed?
Yes, Whisper Medium is licensed under the MIT license, which allows for commercial use without any restrictions.
Whisper Medium context length?
The context length for Whisper Medium is not explicitly defined, but it is designed to handle typical speech segments effectively. For longer audio, you may need to split the input into smaller chunks.
Want personalized recommendations for your exact setup? Detect my hardware →