~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/all-minilm-l6-v2
Sentence Transformers · embedding
all-MiniLM-L6-v2
Tiny embedding model. Only 23MB. Perfect for on-device search.
0.023b paramsbertapache-2.00K ctx0.10.1 GB vram
about·model card

The all-MiniLM-L6-v2 model, developed by Sentence Transformers, is a lightweight BERT-based architecture designed for efficient feature extraction and embedding generation. With only 23 million parameters, this model is remarkably compact, making it an excellent choice for resource-constrained environments. It excels in generating high-quality sentence embeddings that can be used for a variety of natural language processing tasks, such as semantic similarity, clustering, and classification. The model's ability to handle sequences up to 256 tokens long ensures it can process a wide range of text inputs effectively.

Despite its small size, the all-MiniLM-L6-v2 punches well above its weight in terms of performance. It offers a good balance between computational efficiency and embedding quality, often outperforming larger models in tasks where fine-grained semantic understanding is crucial. This makes it particularly suitable for applications where real-time processing is necessary, such as chatbots, search engines, and content recommendation systems. Users with limited hardware resources, such as those running models on edge devices or low-end GPUs, will find this model highly practical. The minimal VRAM requirement of 0.1–0.1 GB further enhances its accessibility, allowing it to run smoothly on a wide range of devices without significant performance degradation.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q8_080.023 GB0.1 GB0.2 GB
92%

How to run all-MiniLM-L6-v2

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Local embedding server with OpenAI-compat /v1/embeddings.

Ollama home →
  1. 1

    Pull

    ollama pull all-minilm
  2. 2

    Use

    curl http://localhost:11434/api/embed -d '{"model":"all-minilm","input":"hello world"}'

Community benchmarks

Real tokens/sec reports from people running all-MiniLM-L6-v2 on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

faq·common questions
how much VRAM do I need to run all-MiniLM-L6-v2?

all-MiniLM-L6-v2 requires 0.1 GB VRAM minimum with Q8_0 quantization. For full precision you need 0.1 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run all-MiniLM-L6-v2?

The all-MiniLM-L6-v2 model requires minimal VRAM, so any GPU with at least 0.1 GB of VRAM will suffice. It can even run efficiently on integrated GPUs.

Is all-MiniLM-L6-v2 good for coding?

While all-MiniLM-L6-v2 is primarily an embedding model, it can be useful for generating code embeddings or semantic search within codebases due to its small size and efficiency.

all-MiniLM-L6-v2 vs Llama 3.1 8B?

all-MiniLM-L6-v2 has only 23 million parameters, making it much smaller and more efficient than Llama 3.1 8B, which has 8 billion parameters. Llama 3.1 8B offers more complex language understanding but requires significantly more resources.

Can I run all-MiniLM-L6-v2 on a Mac?

Yes, you can run all-MiniLM-L6-v2 on a Mac. The model's small size and low resource requirements make it compatible with most Mac hardware, including older models.

How much VRAM does all-MiniLM-L6-v2 need?

all-MiniLM-L6-v2 requires only 0.1 GB of VRAM, making it suitable for devices with limited graphics memory.

Is all-MiniLM-L6-v2 censored?

No, all-MiniLM-L6-v2 is not censored. It is a general-purpose embedding model that can be used for various tasks without content restrictions.

Is all-MiniLM-L6-v2 commercial-use allowed?

Yes, all-MiniLM-L6-v2 is licensed under Apache-2.0, which allows for commercial use as long as you comply with the license terms.

all-MiniLM-L6-v2 context length?

The context length for all-MiniLM-L6-v2 is 256 tokens, which is suitable for short text inputs like sentences or paragraphs.

Does all-MiniLM-L6-v2 support function calling?

No, all-MiniLM-L6-v2 is an embedding model and does not support function calling. It is designed to generate embeddings for text inputs.

all-MiniLM-L6-v2 quantization options?

all-MiniLM-L6-v2 can be quantized to 8-bit or 4-bit precision to further reduce its memory footprint and improve inference speed.

Can all-MiniLM-L6-v2 run on CPU?

Yes, all-MiniLM-L6-v2 can run efficiently on a CPU. Its small size makes it suitable for devices without dedicated GPUs.

all-MiniLM-L6-v2 fine-tuning?

Yes, all-MiniLM-L6-v2 can be fine-tuned for specific tasks using labeled data. Fine-tuning can improve its performance on domain-specific tasks.

all-MiniLM-L6-v2 system requirements?

The system requirements for all-MiniLM-L6-v2 are minimal: at least 0.1 GB of VRAM, 23 MB of storage, and a modern CPU or GPU. It runs efficiently on most modern devices.

all-MiniLM-L6-v2 performance benchmark?

all-MiniLM-L6-v2 processes text at approximately 100 tokens per second on a mid-range CPU and up to 500 tokens per second on a mid-range GPU, depending on the specific hardware configuration.

all-MiniLM-L6-v2 for RAG?

all-MiniLM-L6-v2 can be used in Retrieval-Augmented Generation (RAG) systems to generate embeddings for retrieved documents, enhancing the retrieval process with its compact and efficient nature.

all-MiniLM-L6-v2 for agents?

Yes, all-MiniLM-L6-v2 can be used in agent-based systems to generate embeddings for natural language understanding tasks, making it suitable for lightweight conversational agents.

all-MiniLM-L6-v2 for coding vs general?

all-MiniLM-L6-v2 is versatile and can be used for both coding and general text processing tasks. However, for specialized coding tasks, models trained specifically on code may offer better performance.

all-MiniLM-L6-v2 vs ChatGPT?

all-MiniLM-L6-v2 is a much smaller embedding model compared to ChatGPT, which is a large language model. ChatGPT excels in generating human-like text, while all-MiniLM-L6-v2 is optimized for generating high-quality text embeddings with minimal resources.

all-MiniLM-L6-v2 download size?

The download size for all-MiniLM-L6-v2 is approximately 23 MB, making it easy to deploy on devices with limited storage.

Best quant for all-MiniLM-L6-v2?

The best quantization option for all-MiniLM-L6-v2 depends on your specific needs. 8-bit quantization offers a good balance between performance and memory reduction, while 4-bit quantization further reduces memory usage but may slightly impact performance.