~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/internlm2.5-7b-chat
Shanghai AI Lab · llm
InternLM 2.5 7B
Strong 7B model from China. Good at tool use and math.
7.7b paramsinternlm2apache-2.032K ctx4.898.16 GB vram
about·model card

InternLM 2.5 7B, developed by the Shanghai AI Lab, is a robust language model designed for efficient local deployment. With 7.7 billion parameters, this model excels in generating coherent and contextually relevant text, making it suitable for a wide range of applications such as content creation, chatbots, and natural language understanding tasks. Its architecture, internlm2, supports a context length of 32,768 tokens, which is significantly longer than many models in its class, allowing it to handle more complex and nuanced conversations or text generation tasks.

In comparison to other models of similar size, InternLM 2.5 7B punches above its weight in terms of performance and efficiency. It offers a good balance between computational requirements and output quality, making it a practical choice for users who need a powerful yet resource-efficient model. The available quantizations, Q4_K_M and Q8_0, further enhance its efficiency, enabling it to run on hardware with as little as 4.9 GB of VRAM. This makes it accessible for a broader range of users, including those with mid-range GPUs. Users who require high-quality text generation and have moderate computational resources should consider InternLM 2.5 7B, as it provides a strong performance-to-resource ratio.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.54.389 GB4.89 GB5.39 GB
85%
Q8_087.659 GB8.16 GB8.66 GB
98%

Context window & KV cache

Adds 1.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run InternLM 2.5 7B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull internlm2:7b
  2. 2

    Chat

    ollama run internlm2:7b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"internlm2:7b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running InternLM 2.5 7B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host InternLM 2.5 7Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

6.1 GB

4.9 GB weights + 0.7 GB KV

Aggregate tok/s

32

across 1 user

Per-user tok/s

32

7.7 B dense

✅ Fits in 24 GB VRAM with 17.9 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run InternLM 2.5 7B?

InternLM 2.5 7B requires 4.89 GB VRAM minimum with Q4_K_M quantization. For full precision you need 8.16 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run InternLM 2.5 7B?

To run InternLM 2.5 7B, you need a GPU with at least 4.9 GB of VRAM for the lowest quantization level, up to 8.2 GB for the highest. NVIDIA GPUs like the RTX 3060 or higher are recommended.

Is InternLM 2.5 7B good for coding?

Yes, InternLM 2.5 7B is effective for coding tasks due to its strong performance in tool use and math, making it suitable for generating and understanding code.

InternLM 2.5 7B vs Llama 3.1 8B?

InternLM 2.5 7B has 7.7 billion parameters and excels in tool use and math, while Llama 3.1 8B has more parameters and may offer broader language understanding. Choose based on your specific needs.

Can I run InternLM 2.5 7B on a Mac?

Yes, you can run InternLM 2.5 7B on a Mac, but ensure your Mac has a compatible GPU with at least 4.9 GB of VRAM for optimal performance.

How much VRAM does InternLM 2.5 7B need?

InternLM 2.5 7B requires between 4.9 GB and 8.2 GB of VRAM, depending on the quantization level used.

Is InternLM 2.5 7B censored?

InternLM 2.5 7B is not inherently censored, but its responses can be moderated through configuration settings to filter out inappropriate content.

Is InternLM 2.5 7B commercial-use allowed?

Yes, InternLM 2.5 7B is licensed under Apache-2.0, which allows for commercial use as long as you comply with the license terms.

InternLM 2.5 7B context length?

InternLM 2.5 7B supports a context length of 32,768 tokens, allowing for long and complex inputs.

Does InternLM 2.5 7B support function calling?

Yes, InternLM 2.5 7B supports function calling, enabling it to interact with external tools and APIs effectively.

InternLM 2.5 7B quantization options?

InternLM 2.5 7B offers multiple quantization options, including 4-bit, 8-bit, and full precision, to balance performance and resource usage.

Can InternLM 2.5 7B run on CPU?

While InternLM 2.5 7B can run on a CPU, it will be significantly slower compared to running on a GPU. Consider using a GPU for better performance.

InternLM 2.5 7B fine-tuning?

Yes, InternLM 2.5 7B can be fine-tuned on your own data to improve its performance on specific tasks or domains.

InternLM 2.5 7B system requirements?

To run InternLM 2.5 7B, you need a system with at least 4.9 GB of VRAM, 16 GB of RAM, and a multi-core CPU. A high-performance GPU is strongly recommended.

InternLM 2.5 7B performance benchmark?

InternLM 2.5 7B can process around 100-200 tokens per second on a high-end GPU like the RTX 3090, depending on the quantization level and batch size.

InternLM 2.5 7B for RAG?

Yes, InternLM 2.5 7B is suitable for Retrieval-Augmented Generation (RAG) tasks, leveraging its strong context handling and function calling capabilities.

InternLM 2.5 7B for agents?

InternLM 2.5 7B can be used to create intelligent agents due to its proficiency in tool use and math, making it ideal for tasks requiring interaction with external systems.

InternLM 2.5 7B for coding vs general?

InternLM 2.5 7B is particularly strong in coding tasks due to its tool use and math capabilities, but it also performs well in general language understanding and generation.

InternLM 2.5 7B vs ChatGPT?

InternLM 2.5 7B is a 7.7B parameter model with strong tool use and math capabilities, while ChatGPT is a larger, more general-purpose model. Choose based on your specific needs for task-specific performance or broad language understanding.

InternLM 2.5 7B download size?

The download size for InternLM 2.5 7B varies depending on the quantization level, ranging from approximately 4 GB for 4-bit quantization to 16 GB for full precision.

Best quant for InternLM 2.5 7B?

The best quantization level for InternLM 2.5 7B depends on your hardware and performance needs. 8-bit quantization offers a good balance between speed and accuracy, while 4-bit is more resource-efficient.