Qwen 3.5 Model Family Released: New SOTA for Open-Weight LLMs
Alibaba Cloud has officially released the Qwen 3.5 model family, marking a significant leap forward for open-weight language models. The lineup includes six sizes ranging from 0.5B to 72B parameters, all available under the Apache 2.0 license with GGUF quantizations ready for local inference.
What is new in Qwen 3.5
The headline improvement is reasoning performance. Qwen 3.5 72B now matches GPT-4o on the MMLU-Pro benchmark and surpasses it on several math and coding evaluations. The architecture introduces grouped query attention across all model sizes, which improves inference efficiency by roughly 15 percent compared to Qwen 2.5 at equivalent parameter counts. Context windows have been extended to 128K tokens for the 7B and larger models.
Local inference performance
For local users, the most exciting development is the 7B model. Qwen 3.5 7B in Q4_K_M quantization fits comfortably in 6GB of VRAM and delivers noticeably better output quality than its predecessor. The 14B variant requires about 10GB in Q4_K_M and offers a strong balance of quality and speed for users with mid-range GPUs like the RTX 4070 or RTX 3080.
Availability and compatibility
All Qwen 3.5 models are available on Hugging Face with official GGUF conversions. They work out of the box with Ollama, LM Studio, and llama.cpp. RunThisModel has already added hardware compatibility checks for the full Qwen 3.5 lineup, so you can verify whether your GPU can handle each size and quantization before downloading.
Impact on the open-source landscape
This release further narrows the gap between open-weight and proprietary models. Combined with recent releases from Meta and Google, the first quarter of 2026 has been remarkable for local AI. Users with consumer GPUs now have access to model quality that required expensive API subscriptions just a year ago.