~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/smollm2-360m-instruct
HuggingFace · llm
SmolLM2 360M
Compact 360M model. Good for basic tasks on very constrained devices.
0.36b paramssmollmapache-2.08K ctx0.750.86 GB vram
about·model card

SmolLM2 360M is a lightweight language model developed by HuggingFace, designed to offer efficient text generation capabilities with a relatively small footprint. With just 360 million parameters, this model is particularly adept at generating coherent and contextually relevant text, making it suitable for a wide range of applications such as chatbots, content creation, and summarization tasks. The model's impressive context length of 8192 tokens allows it to maintain a broader understanding of the input, which is crucial for tasks requiring long-term coherence and context retention.

In its size class, SmolLM2 360M punches well above its weight. Despite its compact architecture, it delivers performance that rivals larger models, making it an excellent choice for users who need a balance between computational efficiency and output quality. The model's quantization options, including Q4_K_M and Q8_0, further enhance its efficiency, allowing it to run smoothly on hardware with limited resources. This makes it ideal for developers and enthusiasts who want to deploy AI models on low-end or mid-range devices, such as older laptops or even some Raspberry Pi setups. With a VRAM requirement of only 0.8–0.9 GB, SmolLM2 360M is accessible to a broad audience, ensuring that more users can benefit from high-quality text generation without the need for expensive hardware.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.50.252 GB0.75 GB1.25 GB
85%
Q8_080.36 GB0.86 GB1.36 GB
98%

Context window & KV cache

Adds 0.13 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run SmolLM2 360M

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull smollm2:360m
  2. 2

    Chat

    ollama run smollm2:360m
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"smollm2:360m","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running SmolLM2 360M on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host SmolLM2 360Mfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

1.4 GB

0.8 GB weights + 0.1 GB KV

Aggregate tok/s

694

across 1 user

Per-user tok/s

694

0.36 B dense

✅ Fits in 24 GB VRAM with 22.6 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run SmolLM2 360M?

SmolLM2 360M requires 0.75 GB VRAM minimum with Q4_K_M quantization. For full precision you need 0.86 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run SmolLM2 360M?

To run SmolLM2 360M, you need a GPU with at least 0.8 GB to 0.9 GB of VRAM, depending on the quantization level.

Is SmolLM2 360M good for coding?

SmolLM2 360M is suitable for basic coding tasks due to its compact size and efficiency, but it may not perform as well on complex or specialized coding challenges.

SmolLM2 360M vs Llama 3.1 8B?

SmolLM2 360M has 0.36B parameters, making it much smaller and more resource-efficient than Llama 3.1 8B, which has 8B parameters. SmolLM2 360M is better suited for constrained devices, while Llama 3.1 8B offers higher performance and capacity.

Can I run SmolLM2 360M on a Mac?

Yes, you can run SmolLM2 360M on a Mac, provided your Mac meets the minimum VRAM requirements of 0.8 GB to 0.9 GB.

How much VRAM does SmolLM2 360M need?

SmolLM2 360M requires between 0.8 GB and 0.9 GB of VRAM, depending on the quantization level used.

Is SmolLM2 360M censored?

SmolLM2 360M is not inherently censored, but it adheres to the guidelines set by the Apache 2.0 license, which may include content moderation policies.

Is SmolLM2 360M commercial-use allowed?

Yes, SmolLM2 360M is licensed under the Apache 2.0 license, which allows commercial use without restrictions.

SmolLM2 360M context length?

SmolLM2 360M supports a context length of 8192 tokens, which is suitable for handling longer sequences of text.

Does SmolLM2 360M support function calling?

SmolLM2 360M does not natively support function calling, but you can integrate it with external tools or APIs to achieve this functionality.

SmolLM2 360M quantization options?

SmolLM2 360M supports various quantization options, including 8-bit and 4-bit, which can reduce the model size and VRAM usage while maintaining performance.

Can SmolLM2 360M run on CPU?

Yes, SmolLM2 360M can run on a CPU, although it will be slower compared to running on a GPU. It is suitable for devices with limited GPU resources.

SmolLM2 360M fine-tuning?

SmolLM2 360M can be fine-tuned for specific tasks using frameworks like Hugging Face's Transformers. Fine-tuning can improve its performance on domain-specific tasks.

SmolLM2 360M system requirements?

To run SmolLM2 360M, you need a system with at least 0.8 GB to 0.9 GB of VRAM, 4 GB of RAM, and a modern CPU. It is compatible with Windows, Linux, and macOS.

SmolLM2 360M performance benchmark?

SmolLM2 360M processes around 50-70 tokens per second on a mid-range GPU, making it efficient for real-time applications on constrained devices.

SmolLM2 360M for RAG?

SmolLM2 360M can be used for Retrieval-Augmented Generation (RAG), but its smaller size may limit its effectiveness in handling complex retrieval tasks compared to larger models.

SmolLM2 360M for agents?

SmolLM2 360M is suitable for creating lightweight conversational agents on devices with limited resources, but it may not match the capabilities of larger models in terms of depth and nuance.

SmolLM2 360M for coding vs general?

SmolLM2 360M performs reasonably well for both coding and general tasks, but it may excel more in general tasks due to its broader training data. For advanced coding, consider larger models.

SmolLM2 360M vs ChatGPT?

SmolLM2 360M is much smaller and more resource-efficient than ChatGPT, which has billions of parameters. ChatGPT offers superior performance and context understanding, but SmolLM2 360M is ideal for devices with limited resources.

SmolLM2 360M download size?

The download size of SmolLM2 360M is approximately 1.2 GB, which includes the model weights and configuration files.

Best quant for SmolLM2 360M?

The best quantization for SmolLM2 360M depends on your use case. 8-bit quantization offers a good balance between performance and resource usage, while 4-bit quantization further reduces VRAM usage at the cost of some performance.