~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/qwen3-235b-a22b
Alibaba · llm
Qwen3 235B-A22B
Flagship MoE — 235 B total parameters, 22 B active. Frontier quality but needs 80 GB+ VRAM to run.
235b paramsqwen3-moeapache-2.032K ctx144144 GB vramMoE
about·model card

Qwen3 235B-A22B is Alibaba flagship open-weights model. Real frontier quality, but the 144 GB VRAM bar at Q4 puts it firmly in the dual-H100 / multi-A100 territory. Not consumer-runnable yet, but listed for cloud-GPU planning.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.5140 GB144 GB200 GB
85%

Context window & KV cache

Adds 3.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Qwen3 235B-A22B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    unsloth/Qwen3-235B-A22B-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Qwen3 235B-A22B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Qwen3 235B-A22Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

148.3 GB

144.0 GB weights + 3.8 GB KV

Aggregate tok/s

3

across 1 user

Per-user tok/s

3

MoE active params

⚠ Will spill 124.3 GB of weights to system RAM (~5× slower per offloaded layer). Use llama.cpp’s --cpu-offload-gb or vLLM’s --swap-space.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Qwen3 235B-A22B?

Qwen3 235B-A22B requires 144 GB VRAM minimum with Q4_K_M quantization. For full precision you need 144 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Qwen3 235B-A22B?

To run Qwen3 235B-A22B, you need a GPU with at least 144 GB of VRAM, such as multiple NVIDIA A100 or H100 GPUs in a multi-GPU setup.

Is Qwen3 235B-A22B good for coding?

Qwen3 235B-A22B is highly effective for coding tasks due to its large context length of 32,768 tokens and advanced language understanding capabilities.

Qwen3 235B-A22B vs Llama 3.1 8B?

Qwen3 235B-A22B has significantly more parameters (235B vs 8B) and a longer context length (32,768 vs typically 2,048), making it more powerful for complex tasks but requiring much more VRAM.

Can I run Qwen3 235B-A22B on a Mac?

Running Qwen3 235B-A22B on a Mac is challenging due to the high VRAM requirement. You would need a Mac with a powerful external GPU setup or consider cloud-based solutions.

How much VRAM does Qwen3 235B-A22B need?

Qwen3 235B-A22B requires 144 GB of VRAM, which can be achieved using multiple high-end GPUs like the NVIDIA A100 or H100.

Is Qwen3 235B-A22B censored?

Qwen3 235B-A22B is not inherently censored, but its responses can be filtered or moderated based on the implementation and usage policies set by the user or organization.

Is Qwen3 235B-A22B commercial-use allowed?

Yes, Qwen3 235B-A22B is licensed under the Apache-2.0 license, which allows for commercial use without additional restrictions.

Qwen3 235B-A22B context length?

Qwen3 235B-A22B has a context length of 32,768 tokens, allowing it to handle very long sequences of text effectively.

Does Qwen3 235B-A22B support function calling?

Qwen3 235B-A22B supports function calling, enabling it to interact with external systems and APIs for enhanced functionality.

Qwen3 235B-A22B quantization options?

Qwen3 235B-A22B can be quantized to reduce VRAM usage, but the exact quantization options and their impact on performance depend on the specific implementation and tools used.

Can Qwen3 235B-A22B run on CPU?

While Qwen3 235B-A22B can technically run on a CPU, it is extremely resource-intensive and impractical due to the high computational demands and long processing times.

Qwen3 235B-A22B fine-tuning?

Qwen3 235B-A22B can be fine-tuned for specific tasks, but this process requires significant computational resources and expertise due to its large size.

Qwen3 235B-A22B system requirements?

Qwen3 235B-A22B requires a system with at least 144 GB of VRAM, multiple high-end GPUs, and ample CPU and storage resources to handle the large model size and computational demands.

Qwen3 235B-A22B performance benchmark?

Performance benchmarks for Qwen3 235B-A22B show it can process around 100-150 tokens per second on high-end GPU setups, but this can vary based on the specific hardware and optimization techniques used.

Qwen3 235B-A22B for RAG?

Qwen3 235B-A22B is well-suited for Retrieval-Augmented Generation (RAG) tasks due to its large context length and ability to handle complex queries and large datasets.

Qwen3 235B-A22B for agents?

Qwen3 235B-A22B can be used to create sophisticated AI agents due to its advanced language capabilities and support for function calling, making it ideal for tasks requiring natural language interaction and decision-making.

Qwen3 235B-A22B for coding vs general?

Qwen3 235B-A22B excels in both coding and general language tasks, but its large context length and specialized training make it particularly strong for complex coding scenarios.

Qwen3 235B-A22B vs ChatGPT?

Qwen3 235B-A22B has more parameters (235B vs 175B for GPT-3) and a longer context length (32,768 vs 2,048), offering superior performance for complex tasks but requiring more VRAM.

Qwen3 235B-A22B download size?

The download size for Qwen3 235B-A22B is approximately 940 GB, reflecting its large model size and the need for substantial storage space.

Best quant for Qwen3 235B-A22B?

The best quantization option for Qwen3 235B-A22B depends on your specific use case and available hardware, but 8-bit quantization is often a good balance between performance and VRAM efficiency.