Google Launches Gemma 4: Smaller, Faster, and More Capable
Google DeepMind has released Gemma 4, the next generation of its open-weight model family. Available in four sizes from 2B to 27B parameters, Gemma 4 brings meaningful improvements to instruction following, multilingual performance, and a new native tool-use capability that lets the model call functions and APIs in a structured format.
Architecture improvements
Gemma 4 uses a refined transformer architecture with sliding window attention and improved positional encoding. The 9B model now supports 64K token context windows, up from 8K in Gemma 2. All models use a new tokenizer with a 384K vocabulary that significantly improves handling of non-English languages, code, and mathematical notation.
Performance benchmarks
The 9B instruction-tuned model is the standout performer in its class. It scores within two points of Llama 3.3 70B on several reasoning benchmarks while being nearly eight times smaller. The 27B model competes with models twice its size on coding tasks and delivers strong results on the new GPQA Diamond evaluation. For edge deployment, the 2B model runs at over 60 tokens per second on an M3 MacBook Air.
Tool use and structured output
A notable addition is built-in tool-use support. Gemma 4 models can generate structured function calls, parse tool results, and incorporate them into responses without additional fine-tuning. This makes them particularly useful for building AI agents that interact with external services, databases, or APIs.
Running Gemma 4 locally
GGUF quantizations are available from day one. The 4B Q4_K_M variant needs just 3.5GB of VRAM, making it accessible to virtually any modern GPU or Apple Silicon Mac. The 9B Q5_K_M strikes an excellent balance at around 7GB, fitting on an RTX 3060 or M1 Pro with room for context. RunThisModel now includes all Gemma 4 sizes with full compatibility grading.