Getting Started10 min readUpdated 2026-04-10

How to Run AI Models Locally: A Complete Beginner's Guide

Running AI models on your own computer gives you complete privacy, zero API costs, and the ability to use AI offline. This guide walks you through the entire process from checking your hardware to generating your first response.

Why run AI locally

When you use ChatGPT or Claude through a browser, your prompts travel to remote servers for processing. Running models locally means everything stays on your machine. Your conversations are never logged by a third party, you can use AI without an internet connection, and there are no usage limits or subscription fees. The tradeoff is that you need decent hardware, and local models are generally smaller and less capable than the largest cloud models. But for many tasks, a well-chosen local model performs more than adequately.

Check your hardware first

The most important specification for local AI is VRAM, the dedicated memory on your graphics card. VRAM determines which models and quantization levels you can run. As a rough guide, 4GB of VRAM lets you run small models up to about 3B parameters. 8GB of VRAM handles most 7B models comfortably. 16GB opens up 13B to 14B models. And 24GB or more lets you run 30B to 70B models depending on quantization. If you have an Apple Silicon Mac, your unified memory serves the same purpose as VRAM, and you can typically dedicate 60 to 75 percent of your total RAM to model loading. RunThisModel can detect your hardware automatically and show you exactly which models you can run.

Option 1: Ollama (recommended for beginners)

Ollama is the simplest way to start. It is a command-line tool that handles model downloading, quantization selection, and inference in a single package. Install it from ollama.ai, then open your terminal and type: ollama run llama3.2:3b. That single command downloads a 3 billion parameter Llama model and starts an interactive chat. Ollama automatically selects the best quantization for your hardware. You can also run it as a background server with: ollama serve. This exposes a local API on port 11434 that other applications can connect to. Ollama supports dozens of models. Type ollama list to see what is available, or visit the Ollama model library online for the full catalog.

Option 2: LM Studio (recommended for a visual interface)

LM Studio provides a graphical desktop application for browsing, downloading, and running models. It looks similar to ChatGPT but runs entirely on your machine. After installing from lmstudio.ai, open the app and use the built-in model browser to search for models. LM Studio shows you which models fit your hardware and lets you download GGUF files directly. Once downloaded, click a model to load it and start chatting. LM Studio also includes a local server feature that mimics the OpenAI API, letting you use local models with applications that support the OpenAI API format. The settings panel lets you adjust parameters like temperature, context length, and GPU layer offloading.

Option 3: GPT4All (simplest install)

GPT4All from Nomic AI is another desktop application with an even simpler setup process. Download the installer, run it, and pick a model from the built-in list. GPT4All focuses on ease of use and works well on lower-end hardware. It includes a local documents feature that lets you chat with your own files using retrieval-augmented generation. GPT4All is less configurable than LM Studio but is an excellent choice if you want the least friction getting started.

Choosing your first model

For a first model, we recommend starting with something in the 3B to 7B parameter range. Llama 3.2 3B Instruct is an excellent starting point. It runs on virtually any modern hardware, responds quickly, and handles general conversation, writing, and simple reasoning tasks well. If your hardware is more capable, jump to Qwen 2.5 7B Instruct or Gemma 3 4B IT for noticeably better quality. Avoid starting with the largest model your hardware can technically load. Loading a model that maxes out your VRAM leaves no room for context and results in painfully slow generation. Start one size smaller than your maximum and work up from there.

Tips for the best experience

Keep your context length reasonable. Most local models work best with context windows of 2048 to 4096 tokens. Larger contexts are supported but slow things down significantly. Use the Q4_K_M quantization as your default. It offers the best balance of quality and performance for most models. Close other GPU-intensive applications before running AI models. Games, video editors, and even some browsers with hardware acceleration compete for your VRAM. If a model generates text slowly, try a smaller model or a more aggressive quantization before giving up. Speed is highly dependent on both model size and your specific hardware configuration.

What to do next

Once you are comfortable with basic chat, explore these next steps. Try different models to find ones that suit your specific needs, whether that is creative writing, coding assistance, or factual Q&A. Experiment with system prompts to customize model behavior. Look into running models as local API servers to integrate them with other tools and workflows. And check RunThisModel regularly as we add new models and update compatibility information.