~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/olmoe-1b-7b
AI2 · llm
OLMoE 1B-7B
Fully open MoE — 7 B total, only 1.3 B active per token. Tiny footprint, surprisingly capable.
6.9b paramsolmoeapache-2.04K ctx4.427.35 GB vramMoE
about·model card

OLMoE from AI2 is the most accessible MoE on this list. 7 B total parameters means it fits on a 6 GB GPU at Q4, but only 1.3 B activate per token — so inference is fast even on modest hardware. Fully open: weights, training data, and recipes all released under Apache-2.0.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.53.924 GB4.42 GB4.92 GB
85%
Q8_086.854 GB7.35 GB7.85 GB
98%

Context window & KV cache

Adds 0.50 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 4K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run OLMoE 1B-7B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/OLMoE-1B-7B-0924-Instruct-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running OLMoE 1B-7B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host OLMoE 1B-7Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

5.6 GB

4.4 GB weights + 0.7 GB KV

Aggregate tok/s

192

across 1 user

Per-user tok/s

192

MoE active params

✅ Fits in 24 GB VRAM with 18.4 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run OLMoE 1B-7B?

OLMoE 1B-7B requires 4.42 GB VRAM minimum with Q4_K_M quantization. For full precision you need 7.35 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run OLMoE 1B-7B?

To run OLMoE 1B-7B, you need a GPU with at least 4.4 GB of VRAM for the smallest quantized version, up to 7.3 GB for the full model.

Is OLMoE 1B-7B good for coding?

OLMoE 1B-7B is versatile and can handle coding tasks well, though it may not be as specialized as models specifically trained for code generation.

OLMoE 1B-7B vs Llama 3.1 8B?

OLMoE 1B-7B has fewer parameters (6.9B) compared to Llama 3.1 8B, but it uses a more efficient MoE architecture, making it lighter and potentially faster in certain tasks.

Can I run OLMoE 1B-7B on a Mac?

Yes, you can run OLMoE 1B-7B on a Mac with an M1 or M2 chip, provided you have the necessary VRAM and system resources.

How much VRAM does OLMoE 1B-7B need?

The VRAM requirement for OLMoE 1B-7B ranges from 4.4 GB to 7.3 GB, depending on the quantization level used.

Is OLMoE 1B-7B censored?

OLMoE 1B-7B is not inherently censored, but its responses can be filtered or moderated using external tools to ensure appropriate content.

Is OLMoE 1B-7B commercial-use allowed?

Yes, OLMoE 1B-7B is licensed under Apache-2.0, which allows for commercial use without additional fees.

OLMoE 1B-7B context length?

OLMoE 1B-7B supports a context length of 4096 tokens, which is suitable for handling longer conversations and documents.

Does OLMoE 1B-7B support function calling?

OLMoE 1B-7B does not natively support function calling, but you can integrate it with external systems to achieve this functionality.

OLMoE 1B-7B quantization options?

OLMoE 1B-7B supports various quantization options, including 4-bit, 8-bit, and full precision, allowing you to balance between model size and performance.

Can OLMoE 1B-7B run on CPU?

While OLMoE 1B-7B can run on a CPU, it will be significantly slower compared to running on a GPU due to the model's size and complexity.

OLMoE 1B-7B fine-tuning?

OLMoE 1B-7B can be fine-tuned for specific tasks using frameworks like Hugging Face Transformers, but it requires substantial computational resources and data.

OLMoE 1B-7B system requirements?

To run OLMoE 1B-7B, you need a system with at least 16 GB of RAM, a modern CPU, and a GPU with 4.4 GB to 7.3 GB of VRAM, depending on the quantization level.

OLMoE 1B-7B performance benchmark?

Performance benchmarks for OLMoE 1B-7B vary, but it typically processes around 100-200 tokens per second on a high-end GPU, with lower speeds on less powerful hardware.

OLMoE 1B-7B for RAG?

OLMoE 1B-7B can be used for Retrieval-Augmented Generation (RAG), but you may need to integrate it with a retrieval system to fetch relevant documents.

OLMoE 1B-7B for agents?

OLMoE 1B-7B can be used to power conversational agents and chatbots, thanks to its ability to generate coherent and contextually relevant responses.

OLMoE 1B-7B for coding vs general?

OLMoE 1B-7B is generally capable in both coding and general tasks, but it may not perform as well as specialized models in either domain.

OLMoE 1B-7B vs ChatGPT?

OLMoE 1B-7B is smaller and more efficient than ChatGPT, but it may not match ChatGPT's performance in complex, multi-turn conversations.

OLMoE 1B-7B download size?

The download size for OLMoE 1B-7B varies based on quantization, ranging from approximately 2 GB for the 4-bit quantized version to 14 GB for the full precision model.

Best quant for OLMoE 1B-7B?

The best quantization for OLMoE 1B-7B depends on your hardware and performance needs. 8-bit quantization offers a good balance between model size and accuracy, while 4-bit is more lightweight but may sacrifice some performance.