~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/qwen3-30b-a3b
Alibaba · llm
Qwen3 30B-A3B
Mixture-of-Experts model with 30 B total parameters but only 3 B active per token. Runs at the speed of a 3 B model, with the knowledge of a 30 B. Sweet spot for 24 GB cards.
30.5b paramsqwen3-moeapache-2.032K ctx2036 GB vramMoE
about·model card

Qwen3 30B-A3B is the model that finally makes MoE practical for consumer hardware. Total memory footprint sits at 20 GB for Q4 — fits on a 24 GB RTX 3090/4090 — but inference speed lands around what you would expect from a 3 B model because only ~3.3 B parameters activate per token. The trade-off: if your VRAM is smaller than 20 GB you cannot run it at all, since all expert weights must be loaded simultaneously.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.518 GB20 GB24 GB
85%
Q8_0832 GB36 GB40 GB
98%

Context window & KV cache

Adds 1.50 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Qwen3 30B-A3B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/Qwen3-30B-A3B-Instruct-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Qwen3 30B-A3B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Qwen3 30B-A3Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

21.9 GB

20.0 GB weights + 1.4 GB KV

Aggregate tok/s

76

across 1 user

Per-user tok/s

76

MoE active params

✅ Fits in 24 GB VRAM with 2.1 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Qwen3 30B-A3B?

Qwen3 30B-A3B requires 20 GB VRAM minimum with Q4_K_M quantization. For full precision you need 36 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Qwen3 30B-A3B?

To run Qwen3 30B-A3B, you need a GPU with at least 20 GB of VRAM, with 24 GB being the sweet spot for optimal performance.

Is Qwen3 30B-A3B good for coding?

Qwen3 30B-A3B is well-suited for coding tasks due to its large context length of 32,768 tokens, which allows it to understand and generate complex code snippets effectively.

Qwen3 30B-A3B vs Llama 3.1 8B?

Qwen3 30B-A3B has more parameters (30.5B vs 8B) and a longer context length (32,768 vs typically shorter), making it more powerful for complex tasks, though it requires more VRAM.

Can I run Qwen3 30B-A3B on a Mac?

Yes, you can run Qwen3 30B-A3B on a Mac, provided your Mac has a compatible GPU with at least 20 GB of VRAM, such as an eGPU or newer Macs with high-end GPUs.

How much VRAM does Qwen3 30B-A3B need?

Qwen3 30B-A3B requires between 20.0 GB and 36.0 GB of VRAM, depending on the quantization level used.

Is Qwen3 30B-A3B censored?

Qwen3 30B-A3B is not inherently censored, but it adheres to ethical guidelines and can be configured to filter content based on user preferences.

Is Qwen3 30B-A3B commercial-use allowed?

Yes, Qwen3 30B-A3B is licensed under the Apache-2.0 license, allowing for both personal and commercial use without restrictions.

Qwen3 30B-A3B context length?

Qwen3 30B-A3B has a context length of 32,768 tokens, which is significantly longer than many other models, enabling it to handle longer and more complex inputs.

Does Qwen3 30B-A3B support function calling?

Yes, Qwen3 30B-A3B supports function calling, allowing it to interact with external systems and APIs for enhanced functionality.

Qwen3 30B-A3B quantization options?

Qwen3 30B-A3B supports various quantization options, including 8-bit and 4-bit, which can reduce VRAM usage while maintaining performance.

Can Qwen3 30B-A3B run on CPU?

While Qwen3 30B-A3B can technically run on a CPU, it is highly inefficient and not recommended due to the model's size and computational demands.

Qwen3 30B-A3B fine-tuning?

Qwen3 30B-A3B can be fine-tuned for specific tasks, but this requires significant computational resources and expertise in training large language models.

Qwen3 30B-A3B system requirements?

Qwen3 30B-A3B requires a system with a GPU having at least 20 GB of VRAM, ample RAM (at least 32 GB), and a powerful CPU to handle the computational load.

Qwen3 30B-A3B performance benchmark?

Qwen3 30B-A3B runs at the speed of a 3B model due to its Mixture-of-Experts architecture, processing around 30-50 tokens per second on a 24 GB GPU.

Qwen3 30B-A3B for RAG?

Qwen3 30B-A3B is suitable for Retrieval-Augmented Generation (RAG) tasks, leveraging its large context length and ability to integrate external information effectively.

Qwen3 30B-A3B for agents?

Qwen3 30B-A3B can be used to power conversational agents and chatbots, providing them with a rich understanding of context and the ability to generate detailed responses.

Qwen3 30B-A3B for coding vs general?

Qwen3 30B-A3B excels in both coding and general tasks, but its large context length makes it particularly strong for handling complex code and technical documentation.

Qwen3 30B-A3B vs ChatGPT?

Qwen3 30B-A3B has more parameters (30.5B vs ChatGPT's 175B) but runs faster due to its Mixture-of-Experts design, making it more efficient for local deployment.

Qwen3 30B-A3B download size?

The download size for Qwen3 30B-A3B varies depending on the quantization level, but it generally ranges from 15 GB to 30 GB.

Best quant for Qwen3 30B-A3B?

The best quantization for Qwen3 30B-A3B depends on your VRAM and performance needs. 8-bit quantization is a good balance, reducing VRAM usage to around 24 GB while maintaining high performance.