~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/qwen3-8b-base
Alibaba · llm
Qwen3 8B Base
Official Qwen3 8B foundation model — pretrained only, no RLHF or refusal training. The 'naturally uncensored' option: no abliteration needed because alignment was never applied. Apache 2.0.
8b paramsqwen3apache-2.032K ctx5.316.5 GB vram
about·model card

Qwen3 8B Base is an 8 billion parameter language model developed by Alibaba, designed for efficient local deployment. This model excels in generating coherent and contextually relevant text across a wide range of applications, including but not limited to, chatbot interactions, content creation, and summarization tasks. With a context length of 32,768 tokens, it can handle long-form inputs and outputs, making it suitable for scenarios where maintaining context over extended passages is crucial. The model is licensed under Apache-2.0, which allows for flexible use in both commercial and non-commercial projects.

In its size class, Qwen3 8B Base holds its own, offering a balance between performance and resource efficiency. It is capable of delivering high-quality results without the need for top-tier hardware, making it a practical choice for users with mid-range GPUs. The available quantizations, including BF16 and Q4_K_M, further enhance its efficiency, reducing memory usage and improving inference speed. This makes it particularly appealing for developers and enthusiasts who want to deploy powerful language models on hardware with 5.3 to 16.5 GB of VRAM. Ideal users include those looking to integrate advanced text generation capabilities into their projects without the overhead of cloud services, ensuring data privacy and reduced latency.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
BF161616 GB16.5 GB17 GB
100%
Q4_K_M4.54.8 GB5.3 GB5.8 GB
85%

Context window & KV cache

Adds 1.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Qwen3 8B Base

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/Qwen3-8B-Base-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Qwen3 8B Base on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Qwen3 8B Basefor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

6.5 GB

5.3 GB weights + 0.7 GB KV

Aggregate tok/s

31

across 1 user

Per-user tok/s

31

8 B dense

✅ Fits in 24 GB VRAM with 17.5 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Qwen3 8B Base?

Qwen3 8B Base requires 5.3 GB VRAM minimum with BF16 quantization. For full precision you need 16.5 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Qwen3 8B Base?

To run Qwen3 8B Base, you need a GPU with at least 5.3 GB of VRAM for the lowest quantization level, up to 16.5 GB for the highest. NVIDIA GPUs like the RTX 3060 or higher are recommended.

Is Qwen3 8B Base good for coding?

Qwen3 8B Base is suitable for coding tasks, offering strong natural language understanding and code generation capabilities, though it may not be as specialized as models trained specifically for coding.

Qwen3 8B Base vs Llama 3.1 8B?

Qwen3 8B Base has a larger context length (32,768 tokens) compared to Llama 3.1 8B, which typically has a shorter context length. Qwen3 8B Base also uses the Apache 2.0 license, making it more permissive for commercial use.

Can I run Qwen3 8B Base on a Mac?

Yes, you can run Qwen3 8B Base on a Mac, but you will need a Mac with an M1 or later chip and sufficient VRAM. You may also need to install additional software like Docker or a compatible GPU driver.

How much VRAM does Qwen3 8B Base need?

The VRAM requirement for Qwen3 8B Base ranges from 5.3 GB to 16.5 GB, depending on the quantization level used. Lower quantization levels require less VRAM but may have a slight impact on performance.

Is Qwen3 8B Base censored?

No, Qwen3 8B Base is not censored. It is a foundation model without alignment or refusal training, allowing for more natural and uncensored responses.

Is Qwen3 8B Base commercial-use allowed?

Yes, Qwen3 8B Base is licensed under Apache 2.0, which allows for commercial use, modification, and distribution without restrictions.

Qwen3 8B Base context length?

Qwen3 8B Base has a context length of 32,768 tokens, which is significantly longer than many other models, allowing for more extensive and coherent conversations.

Does Qwen3 8B Base support function calling?

Qwen3 8B Base supports function calling through custom integrations, but this feature is not built-in. You may need to implement additional code to enable function calling.

Qwen3 8B Base quantization options?

Qwen3 8B Base supports multiple quantization options, including 4-bit, 8-bit, and 16-bit, which allow you to balance between VRAM usage and performance.

Can Qwen3 8B Base run on CPU?

Qwen3 8B Base can run on a CPU, but it will be significantly slower compared to running on a GPU. A powerful multi-core CPU is recommended for better performance.

Qwen3 8B Base fine-tuning?

Qwen3 8B Base can be fine-tuned on your own data to improve its performance on specific tasks. Fine-tuning requires a dataset and a training environment, and it may take several hours to complete.

Qwen3 8B Base system requirements?

To run Qwen3 8B Base, you need a system with at least 5.3 GB of VRAM, 32 GB of RAM, and a multi-core CPU. A high-performance GPU is strongly recommended for optimal performance.

Qwen3 8B Base performance benchmark?

Qwen3 8B Base can process around 100-200 tokens per second on a high-end GPU like the RTX 3090, with performance varying based on the quantization level and specific hardware configuration.

Qwen3 8B Base for RAG?

Qwen3 8B Base can be used for Retrieval-Augmented Generation (RAG) by integrating it with a retrieval system. This setup can enhance its ability to generate contextually relevant responses.

Qwen3 8B Base for agents?

Qwen3 8B Base can be used to power conversational agents and chatbots, providing them with natural language understanding and generation capabilities. However, you may need to add additional logic for task-specific functionalities.

Qwen3 8B Base for coding vs general?

Qwen3 8B Base is versatile and can handle both coding and general tasks, but it may not be as specialized in coding as models like Codex. For general tasks, it performs well due to its large context length and natural language capabilities.

Qwen3 8B Base vs ChatGPT?

Qwen3 8B Base has a larger context length (32,768 tokens) compared to ChatGPT, which typically has a shorter context length. Qwen3 8B Base is also open-source and licensed under Apache 2.0, making it more flexible for commercial use.

Qwen3 8B Base download size?

The download size of Qwen3 8B Base varies depending on the quantization level. The full model is approximately 16 GB, while lower quantization levels reduce the size to around 8 GB or less.

Best quant for Qwen3 8B Base?

The best quantization level for Qwen3 8B Base depends on your hardware and performance needs. 8-bit quantization is a good balance, reducing VRAM usage to around 8 GB while maintaining acceptable performance.