Open Multimodal Model Family

Gemma 4 Wiki

Track Gemma 4 model sizes, benchmarks, prompting, function calling, multimodal input, local deployment, and fine-tuning across the official Google ecosystem.

Official Site

400M+

Total Downloads

100K+

Community Variants

140+

Languages

256K

Max Context

Latest Updates

Discover the newest guides, tips, and content

Gemma 4 API Pricing: Cost Breakdown for Game Dev Teams in 2026

A practical 2026 guide to Gemma 4 API pricing, including local vs hosted costs, budgeting formulas, and deployment choices for gaming studios.

May 4, 2026models

gemma 4 license: Creator, Modding, and Commercial Use Guide 2026

Learn how the gemma 4 license affects game studios, modders, and content creators in 2026, with practical compliance checklists and deployment tips.

May 4, 2026models

Gemma 4 on Mac: Complete Local Setup, Tuning, and Use Guide 2026

Learn how to install, run, and optimize Gemma 4 on Mac in 2026 with practical model picks, performance tips, and troubleshooting steps.

May 4, 2026install

gemma 4 function calling: Mobile Game Command Systems Guide 2026

Build fast on-device game actions with gemma 4 function calling patterns, tool schemas, tuning workflows, and QA steps for production in 2026.

May 4, 2026guide

Gemma 4 Coding: Complete Local VS Code Setup and Workflow Guide 2026

Learn how to run Gemma 4 locally for coding inside VS Code with Ollama and Continue. Includes setup steps, permission tuning, performance expectations, and troubleshooting for 2026.

May 4, 2026benchmark

Gemma 4 Agent: Offline AI Setup and Gamer Workflow Guide 2026

Learn how to set up a Gemma 4 agent locally for gaming workflows, modding support, log analysis, and offline AI assistance in 2026.

May 4, 2026guide

gemma 4 fine tune: No-Code Unsloth Studio Workflow Tutorial 2026

Learn a practical gemma 4 fine tune workflow with Unsloth Studio, from GPU setup and dataset mapping to export and evaluation in 2026.

May 4, 2026guide

gemma 4 cli: Local AI Setup and Game Dev Workflow Guide 2026

Learn how to install, configure, and optimize gemma 4 cli for game writing, coding, and live design workflows in 2026.

May 4, 2026install

gemma 4 cloud: Local-First Setup and Gaming Workflow Guide 2026

Learn how to use gemma 4 cloud workflows for gaming tasks, modding help, and offline AI coding with practical setup steps and trade-off analysis.

May 4, 2026guide

gemma 4 api: Complete Setup and Optimization Guide for Creators 2026

Learn how to set up, test, and optimize gemma 4 api for game workflows, AI NPCs, mod tools, and multimodal pipelines in 2026.

May 4, 2026install

gemma 4 local: Offline AI Setup and Gaming Workflow Guide 2026

Learn how to run Gemma 4 on your own PC for private, offline gaming tasks like mod planning, walkthrough drafting, and coding help in 2026.

May 4, 2026ollama

gemma 4 31b required vram: Practical GPU Memory Guide 2026

Find out how much VRAM Gemma 4 31B really needs across 4-bit, 6-bit, and 8-bit setups, plus context, speed, and offload tips for local use in 2026.

May 3, 2026requirements

Ollama MLX Gemma4: Complete Local AI Setup and Tuning Guide 2026

Learn how to run Ollama MLX Gemma4 locally for gaming workflows, modding support, image analysis, and fast multimodal prompts in 2026.

May 3, 2026ollama

gemma 4 abliterated: Local AI Setup, Benchmarks, and Gamer Workflow 2026

A practical 2026 guide to Gemma 4 for gamers and creators: model sizes, local setup on PC and phone, performance expectations, and smart workflows.

May 3, 2026models

gemma 4 benchmark scores: Full Model Comparison and Hardware Guide 2026

A practical breakdown of gemma 4 benchmark scores, model rankings, VRAM needs, and setup tips to choose the right Gemma 4 version in 2026.

May 3, 2026benchmark

Gemma 4 Audio: Practical Setup, Limits, and Gaming Workflows 2026

Learn what Gemma 4 audio support includes, what it does not, and how to build a reliable voice workflow for game mods, NPC tools, and creator pipelines in 2026.

May 3, 2026guide

gemma 4 docker: Complete Local Setup, Benchmarks, and Workflow Guide 2026

Learn how to run Gemma 4 in Docker for private, fast local AI workflows. Includes setup steps, performance tuning, troubleshooting, and practical game dev use cases.

May 3, 2026install

gemma 4 26b gguf: Local Gaming Prototype Guide and Benchmarks 2026

Learn how to run Gemma 4 26B GGUF locally for game prototyping, compare quantizations, tune performance, and build better browser-based game demos in 2026.

May 3, 2026models

gemma 4 26b mlx apple silicon: Setup, Benchmarks, and Mac Guide 2026

Learn how to run Gemma 4 26B with MLX on Apple Silicon Macs, including install steps, performance tuning, VRAM planning, and practical creator workflows in 2026.

May 3, 2026install

Gemma 4 SWE benchmark: Model Picks, Performance, and Setup Guide 2026

A practical 2026 guide to the Gemma 4 SWE benchmark, including model tiers, hardware targets, coding performance, and local setup tips.

May 3, 2026benchmark

gemma 4 31b benchmark coding: Performance Guide for Game Dev Teams 2026

A practical 2026 guide to gemma 4 31b benchmark coding for game studios, with benchmark context, hardware planning, workflow setup, and coding task strategies.

May 3, 2026benchmark

gemma 4 vision capabilities: Local Multimodal Workflow Guide 2026

Learn how to use gemma 4 vision capabilities for detection, counting, and scene reasoning in local AI workflows for gaming tools and content pipelines.

May 3, 2026models

gemma 4 31b 4-bit vram usage: Real Hardware Guide and Benchmarks 2026

A practical 2026 guide to gemma 4 31b 4-bit vram usage, including memory math, GPU fit checks, speed expectations, and tuning tips for local AI workflows.

May 3, 2026requirements

gemma 4 swe bench pro: Practical Performance Guide for Dev Teams 2026

A hands-on 2026 guide to evaluating Gemma 4 for SWE-bench Pro style workflows, local coding agents, and gaming studio development pipelines.

May 3, 2026benchmark

Gemma4 9B: Local AI Setup and Gaming Workflow Guide for Creators 2026

Learn how to run Gemma4 9B locally for gaming, modding, and scripting tasks with hardware targets, performance tuning, and practical 2026 workflows.

May 3, 2026models

Gemma 4 local Mac: Practical Setup, Performance, and Workflow Guide 2026

Learn how to run Gemma 4 locally on a Mac, connect it to coding agents, tune performance, and build a reliable no-API workflow in 2026.

May 3, 2026requirements

gemma 4 coding performance: Practical Benchmarks for Game Devs 2026

A practical guide to Gemma 4 coding speed, quality, and cost for game prototyping, UI systems, and local AI workflows in 2026.

May 3, 2026benchmark

Gemma4 Quantization: Best Performance and Quality Settings Guide 2026

Learn how to tune Gemma4 quantization for better FPS-friendly workflows, lower VRAM usage, and strong output quality on everyday gaming PCs in 2026.

May 3, 2026models

gemma 4 chat template: OpenCode Setup, Fixes, and Workflow Guide 2026

Learn how to configure, debug, and optimize the gemma 4 chat template for tool-calling workflows in 2026, including OpenCode and Claude Code style harnesses.

May 3, 2026models

gemma 4 vllm support: Complete Setup, Benchmarks, and Fixes 2026

Learn how to enable gemma 4 vllm support for fast, scalable inference in gaming workflows, from local testing to production deployment.

May 3, 2026install

Gemma 4 Resources

Everything you need to get started with Gemma 4 — from local setup to API integration

Quick Start

Gemma 4 Tutorial

Gemma 4 launched on April 2, 2026 in four official sizes: E2B, E4B, 26B A4B, and 31B. The family is built for open-weight deployment under Apache 2.0, with smaller edge models aimed at mobile and laptop-class hardware and larger models aimed at desktops, workstations, and servers.

Understand the four official Gemma 4 sizes

Gemma 4 comes in E2B, E4B, 26B A4B, and 31B. E2B and E4B accept text, image, and audio input; 26B A4B and 31B accept text and image input and target larger local or server deployments.

Match the model to your hardware

Use E2B or E4B when you want mobile, edge, or laptop-friendly local inference. Use 26B A4B for a stronger general-purpose local model, and 31B when you want the largest official Gemma 4 checkpoint.

Choose a starting point

Gemma 4 26B A4B is a strong default for powerful first experiences. If you want the lightest starting point, begin with an instruction-tuned edge model and move up when your workload needs more capability.

Pick how you want to try it

Try hosted Gemma 4 through Google AI Studio and the Gemini API, or download open weights from Hugging Face or Kaggle for local use, tuning, and custom deployment.

Know what Gemma 4 is optimized for

The family is built for reasoning, coding, agentic workflows, and multimodal understanding. Edge models support 128K context, while 26B A4B and 31B support up to 256K context.

Quick Tips

Instruction-tuned (-it) variants are best for chat and assistant use cases.
E2B and E4B are the most hardware-accessible starting points for local experimentation.
The 26B A4B is a Mixture-of-Experts model with faster effective inference than a dense model of similar total size.
All Gemma 4 weights are released under the Apache 2.0 license.

Local Run

Gemma 4 Ollama Setup

Ollama is one of the fastest ways to get Gemma 4 running on a laptop or workstation. The default Ollama flow is simple: install Ollama, pull Gemma 4, confirm the model list, choose the right tag for your hardware, and then run from the CLI or local API.

Install and verify Ollama

Download Ollama for Windows, macOS, or Linux, install it, and verify the setup with the command ollama --version.

Pull the default Gemma 4 variant

Use ollama pull gemma4 to download the default Gemma 4 package, then run ollama list to confirm it is available locally.

Choose the right model tag

Use gemma4:e2b for the lightest edge option, gemma4:e4b for a stronger edge default, gemma4:26b for the 26B A4B MoE workstation model, and gemma4:31b for the full large model.

Know what each tag expects

On the Ollama library page, e2b is listed at 7.2GB with 128K context, e4b at 9.6GB with 128K, 26b at 18GB with 256K, and 31b at 20GB with 256K.

Run your first prompt

For a first text test, run ollama run gemma4 "Hello, what can you do?". Ollama also supports image input with the prompt form shown in the official guide.

Use the local API for app integration

Ollama exposes a local web service at http://localhost:11434/api/generate, so you can move from CLI testing to a lightweight local application without setting up a separate model server.

Quick Tips

E2B and E4B are the practical first picks for local experimentation on lighter hardware.
The 26b tag targets the 26B A4B MoE model, which uses less active compute than a dense model of similar total size.
ollama list shows all locally downloaded models and their sizes.
Ollama supports image input with the prompt form: ollama run gemma4:e2b with an image path.

Hosted API

Gemma 4 API Guide

The Gemini API provides hosted access to Gemma 4, useful when building without managing local inference. The hosted Gemma 4 models in AI Studio and the Gemini API are gemma-4-26b-a4b-it and gemma-4-31b-it.

Create an API key in Google AI Studio

Open Google AI Studio and create a Gemini API key. New users can start with a default Google Cloud project, while existing users can import a Cloud project and create keys there.

Set the key in your environment

The Gemini SDKs automatically pick up GEMINI_API_KEY or GOOGLE_API_KEY. If both are set, GOOGLE_API_KEY takes precedence.

Install the official SDK

For Python, install google-genai. For JavaScript and TypeScript, install @google/genai. Google also publishes SDK paths for Go, Java, C#, and Apps Script.

Choose the hosted Gemma 4 model ID

For hosted Gemma 4, use gemma-4-26b-a4b-it for a faster MoE large model, or gemma-4-31b-it for the flagship dense checkpoint.

Send a first generateContent request

The official example uses client.models.generate_content with the model field set to gemma-4-31b-it. In REST, requests go to the generateContent endpoint with the x-goog-api-key header.

Use AI Studio to bridge from testing to code

Google AI Studio lets you experiment with prompts, model settings, function calling, and structured output, then export working code through the Get code flow.

Quick Tips

AI Studio is the fastest way to test Gemma 4 prompts before writing any code.
The Gemini API supports streaming responses for chat and long-generation use cases.
gemma-4-26b-a4b-it is the MoE model — generally faster and more cost-efficient than 31B.
Function calling and structured output are available for both hosted Gemma 4 model IDs.

Downloads

Gemma 4 Hugging Face Download

The official Google collection on Hugging Face includes eight core Gemma 4 checkpoints: E2B, E4B, 26B A4B, and 31B, each in base and instruction-tuned form. Instruction-tuned (-it) repositories are the natural starting point for chat, coding, and assistant experiences.

Instruction-tuned

google/gemma-4-E2B-it

Edge checkpoint with text, image, and audio input and 128K context. Best for fast local assistants and on-device multimodal experimentation.

Instruction-tuned

google/gemma-4-E4B-it

Stronger edge checkpoint with text, image, and audio input and 128K context. More capable than E2B without jumping to workstation-class hardware.

Instruction-tuned

google/gemma-4-26B-A4B-it

Mixture-of-Experts checkpoint with 256K context and text-image input. Large-model quality with faster effective inference than a dense model of similar total size.

Instruction-tuned

google/gemma-4-31B-it

Flagship dense Gemma 4 checkpoint with 256K context and text-image input. Best for the strongest chat, reasoning, coding, and agent workflows.

Pre-trained

google/gemma-4-E2B

Base edge checkpoint for users who want to study, adapt, or fine-tune the smallest multimodal Gemma 4 model.

Pre-trained

google/gemma-4-E4B

Base edge checkpoint that keeps text, image, and audio input while leaving downstream instruction behavior to your own tuning pipeline.

Pre-trained

google/gemma-4-26B-A4B

Base MoE large checkpoint for custom adaptation where you want the 26B A4B architecture without default instruction-tuned behavior.

Pre-trained

google/gemma-4-31B

Base 31B dense checkpoint for teams that want the largest official Gemma 4 foundation model before their own fine-tuning or alignment stage.

Browse Official Gemma 4 Collection Hugging Face Blog Post

Model Comparison

Choose the Right Gemma 4 Size for Your Hardware

Gemma 4 ships in four sizes with very different trade-offs. The fastest choice is not always the smallest model, and the highest-quality choice is not always the easiest one to deploy.

Gemma 4 is available in two edge-first dense models, one efficient Mixture-of-Experts model, and one large dense model. For most teams, the real decision is not just quality, but where the model runs: phone, laptop, workstation, or server. A practical starting point is 26B A4B when you want strong quality without jumping all the way to 31B.

Gemma 4 E2B

ArchitectureDense

Parameters2.3B effective

Context128K tokens

Memory (BF16/Q4)9.6 GB BF16 / 4.6 GB SFP8 / 3.2 GB Q4_0

PlatformMobile devices

Offline assistants, lightweight multimodal apps, edge deployment

Gemma 4 E4B

ArchitectureDense

Parameters4.5B effective

Context128K tokens

Memory (BF16/Q4)15 GB BF16 / 7.5 GB SFP8 / 5 GB Q4_0

PlatformMobile and laptops

Stronger local copilots, on-device reasoning, multimodal apps with more headroom

Gemma 4 26B A4B

ArchitectureMoE

Parameters25.2B total, 3.8B active

Context256K tokens

Memory (BF16/Q4)48 GB BF16 / 25 GB SFP8 / 15.6 GB Q4_0

PlatformDesktop and small servers

Best balance of quality, speed, and long-context work for most teams

Gemma 4 31B

ArchitectureDense

Parameters30.7B

Context256K tokens

Memory (BF16/Q4)58.3 GB BF16 / 30.4 GB SFP8 / 17.4 GB Q4_0

PlatformLarge servers

Highest-end reasoning, coding, and multimodal quality in the Gemma 4 family

Core Specs

The Gemma 4 Specs That Actually Matter Before You Build

For most builders, the key questions are context length, modalities, language coverage, licensing, and app-level features. These are the specs that change implementation choices, hosting cost, and product scope.

Gemma 4 is not just a text model refresh. The family combines long context, multimodal input, thinking mode, native system prompts, and function-calling support in one open-weight lineup. The smaller models add audio input, while the larger models extend context to 256K for document-heavy and repository-scale workloads.

Release

March 31, 2026

This is the current Gemma core generation and the one Google now highlights across docs and launch materials.

Input and Output

All models: text and image → text; E2B and E4B also support audio input

You can build text-only, vision, and lightweight speech understanding flows without switching model families.

Maximum Context Window

128K tokens on E2B and E4B; 256K tokens on 26B A4B and 31B

Large prompts such as long documents, long chats, or multi-file code context fit in a single request.

Language Coverage

Over 140 languages

This matters for multilingual products, OCR, and globally deployed assistants.

License and Weights

Apache 2.0 license with open weights and support for responsible commercial use

You can tune, deploy, and run Gemma 4 in your own stack with fewer licensing constraints.

Reasoning and Control

Configurable thinking mode, native system role support, structured JSON output, and function calling

These features make Gemma 4 much easier to use for agents, tool use, and instruction-heavy applications.

Visual Handling

Variable image resolutions and token budgets of 70, 140, 280, 560, or 1120 tokens

You can trade image detail for speed depending on whether the task is OCR, UI reading, chart analysis, or fast frame processing.

Performance

Official Gemma 4 Benchmark Snapshot

These scores show where each Gemma 4 size is strongest across reasoning, coding, science, vision, and long-context retrieval. Use them to shortlist a model quickly, then match that shortlist to your latency and memory budget.

Gemma 4 is positioned as a model family for reasoning, agentic workflows, coding, and multimodal understanding. The official benchmark tables show a clear pattern: 31B leads, 26B A4B stays surprisingly close while being much more efficient, and E4B and E2B bring meaningful capability to smaller devices.

Benchmark	Task Focus	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B
MMLU Pro	Knowledge and reasoning	85.2%	82.6%	69.4%	60.0%
AIME 2026 (no tools)	Math reasoning	89.2%	88.3%	42.5%	37.5%
LiveCodeBench v6	Competitive coding	80.0%	77.1%	52.0%	44.0%
GPQA Diamond	Scientific reasoning	84.3%	82.3%	58.6%	43.4%
MMMU Pro	Multimodal reasoning	76.9%	73.8%	52.6%	44.2%
MRCR v2 (128K, 8-needle)	Long-context retrieval	66.4%	44.1%	25.4%	19.1%

MMLU Pro

Knowledge and reasoning

85.2%

31B

82.6%

26B A4B

69.4%

E4B

60.0%

E2B

Best quick comparison for general high-level reasoning performance across the family.

AIME 2026 (no tools)

Math reasoning

89.2%

31B

88.3%

26B A4B

42.5%

E4B

37.5%

E2B

31B and 26B A4B are the right targets for math-heavy assistants and planning tasks.

LiveCodeBench v6

Competitive coding

80.0%

31B

77.1%

26B A4B

52.0%

E4B

44.0%

E2B

If coding is a primary use case, the larger two models are in a different tier from the edge models.

GPQA Diamond

Scientific reasoning

84.3%

31B

82.3%

26B A4B

58.6%

E4B

43.4%

E2B

A strong signal for technical and expert-facing workflows.

MMMU Pro

Multimodal reasoning

76.9%

31B

73.8%

26B A4B

52.6%

E4B

44.2%

E2B

Vision tasks benefit heavily from the larger models when accuracy matters more than footprint.

MRCR v2 (128K, 8-needle)

Long-context retrieval

66.4%

31B

44.1%

26B A4B

25.4%

E4B

19.1%

E2B

For large-document and repository-scale prompting, 31B is the strongest long-context choice.

Customization

How to Fine-Tune Gemma 4 for Real Product Work

Fine-tuning matters when prompting alone is not enough and you want Gemma 4 to perform better on a specific domain, workflow, or role. The practical paths are lightweight adapter tuning for text tasks and multimodal adapter tuning for image-plus-text tasks.

The official Gemma tuning docs center on a simple rule: tune for a defined task, not for vague improvement. For many builders, QLoRA is the most realistic place to start because it keeps hardware requirements much lower than full-model tuning.

Start with a narrow tuning goal

Choose a task or role that the base model should perform better, such as customer support, text-to-SQL, or product description generation. Use fine-tuning when the task is specific and repeated.

Pick the tuning path

Use text tuning for instruction and generation tasks, or vision tuning when your dataset combines images and text. The text QLoRA guide demonstrates text-to-SQL; the vision QLoRA guide demonstrates image-plus-text product descriptions.

Choose a realistic framework

Gemma 4 supports Keras with LoRA, the Gemma library, Hugging Face-based workflows, GKE, and Vertex AI. Hugging Face plus TRL is the most direct path for many developers.

Match the workflow to your hardware

The official text QLoRA example is designed around a T4 16GB setup. The vision QLoRA guide calls for a BF16-capable GPU such as NVIDIA L4 or A100 with more than 16GB of memory.

Use QLoRA when efficiency matters

QLoRA keeps the base model quantized to 4-bit, freezes the original weights, and trains only the added LoRA adapters. This lowers memory usage while preserving strong task performance.

Prepare data in the right format

Build a dataset that directly matches the behavior you want, then format it for conversation-style training with TRL and SFTTrainer. The official text guide uses a large synthetic text-to-SQL dataset.

Evaluate, compare, and deploy

After training, run inference checks against your base model, verify task gains, and then deploy the tuned model or adapter. Treat deployment format as an early decision because framework choice affects the output format you get.

Quick Tips

Start with QLoRA and a T4-class GPU for text tasks — full fine-tuning is rarely needed for task adaptation.
Format your dataset to mirror the instruction-tuned chat format that Gemma 4 already understands.
Keep your eval set from the same distribution as your training data to get meaningful improvement signals.
The MoE model 26B A4B has efficient active parameters, but its total parameter count still affects checkpoint size during training.
Use the Gemma 4 -it checkpoint as your starting point for instruction tasks rather than the pre-trained base.

Prompting

Gemma 4 Prompt Guide

Gemma 4 introduces a new turn-based prompt format with native system instructions, multimodal placeholders, and built-in controls for thinking and tool use.

This guide turns the official Gemma 4 format into a practical prompt library. Structure every interaction as turns, use the system role for behavior and global rules, insert image or audio placeholders where needed, and only enable thinking or tool use when the task actually benefits from them.

Core chat skeleton

Gemma 4 uses native system, user, and model roles, wrapped in turn markers.

Use system for global instructions
Use user for the current request
Use model as the generation start point

<|turn>system You are a helpful assistant.<turn|> <|turn>user Summarize the following article in 5 bullets.<turn|> <|turn>model

System prompt pattern

Put stable behavior rules in one system turn instead of repeating them every time.

Good for style, scope, and output format
Native system role support starts with Gemma 4
Keep it concise and task-specific

<|turn>system You are a technical writer. Answer in clear English, use short paragraphs, and include one practical example.<turn|> <|turn>user Explain function calling for a beginner.<turn|> <|turn>model

Multimodal placeholders

Use placeholder tokens to indicate where image and audio embeddings should be inserted.

Use <|image|> for images
Use <|audio|> for audio
The processor replaces placeholders with embeddings after tokenization

Thinking-ready prompt

Thinking mode is activated by placing <|think|> inside the system instruction.

Enable it for reasoning-heavy tasks
Keep it off for simple direct generation
Use one system turn for both thinking and other global instructions

Tool-aware prompt structure

Tool declarations belong in the system turn, and tool calls and tool responses are handled with dedicated control tokens.

Useful for APIs, search, calculators, and external data lookups
Tool use is structured, not plain-text pretending
Reasoning and tool use can happen in the same turn

Define tools in the system turn using the tool declaration token block, then set user and model turns as usual. Gemma 4 handles the rest with structured tool_call and tool_response tokens.

Reasoning

Gemma 4 Thinking Mode

Thinking mode lets Gemma 4 produce a reasoning channel before the final answer, and the processor can separate both parts for application use.

Thinking mode is best for tasks where the model benefits from intermediate reasoning before it answers: ambiguous questions, math, coding, tool planning, and multimodal analysis. In Gemma 4, you can enable it at the chat-template level, stream the reasoning live, and then split the output into a thinking block and a user-facing answer block.

Choose the right tasks

Use thinking mode when the request needs decomposition, comparison, planning, or careful interpretation rather than a short direct reply.

Good fits: math, code debugging, structured decision-making, image-plus-text reasoning
Less necessary for simple rewrites, short summaries, or straightforward facts
Official examples cover both text-only and image-text workflows

Enable thinking in the chat template

With Hugging Face Transformers, set enable_thinking=True in apply_chat_template(). At the token level, Gemma 4 uses <|think|> in the system turn.

E2B and E4B: thinking OFF uses a simple user-model flow; thinking ON adds a system turn with <|think|>
26B A4B and 31B: official templates include an empty thinking token when thinking is off to stabilize output
Thinking is designed to be enabled at the conversation level

Generate and separate the result

The model can emit a reasoning channel first and the final answer after it. You can stream it with TextStreamer and split it with parse_response().

processor.parse_response() returns separated thinking and answer content
This works for text prompts and image-text prompts
The reasoning channel can also include tool calls when the turn becomes agentic

Handle multi-turn chats correctly

For normal multi-turn conversations, strip the previous turn generated thoughts before sending the history back. In tool-calling turns, keep the thought flow intact until the tool cycle finishes.

Regular chat: remove prior thought blocks before the next turn
Tool-use exception: do not remove thoughts between function calls inside the same turn
This keeps context clean while preserving agentic behavior

Agentic Workflows

Gemma 4 Function Calling

Gemma 4 supports native structured tool use, letting the model request functions instead of faking external actions in plain text.

Function calling is the practical bridge between model output and real application behavior. Instead of asking Gemma 4 to guess live data or simulate actions, you define tools, let the model generate a structured call, execute the function in your app, and then feed the result back so the model can finish with a clean natural-language answer.

Define tools clearly

Pass tools through apply_chat_template() using either a manual JSON schema or a raw Python function converted to schema.

Manual JSON schema is best when you need precise nested parameters
Raw Python functions are convenient for simple tools with clear type hints and docstrings
Tool definitions should include name, description, parameter types, and required fields

Let the model request a tool

Gemma 4 receives the user prompt plus available tools and returns a structured function call object rather than plain text when a tool is needed.

Tool use is controlled with dedicated tokens such as tool, tool_call, and tool_response
A typical example is a weather or search function
This is better than plain text when the answer depends on external state or system actions

Validate and execute in your app

Gemma 4 cannot execute code on its own. Your application must parse the function name and arguments, validate them, and run the real function safely.

Always validate function names and arguments before execution
Do not rely on generated code without safeguards
For production systems, map tool names to approved handlers instead of dynamic execution

Return tool output for the final answer

Append the tool result back into the chat history, then let Gemma 4 produce the final user-facing response.

Official workflow: define tools, model turn, developer turn, final response
This pattern works for APIs, live lookups, calculators, settings updates, and agent loops
Tool responses should stay structured so the model can ground the final answer correctly

Multimodal

Gemma 4 Multimodal Guide

Gemma 4 handles text and image across all models, supports video as frames, and adds native audio support on E2B and E4B.

Gemma 4 is built for multimodal input. All models support image and video-style visual understanding, the small models add audio input, and the runtime lets you trade off visual detail against speed using token budgets. That makes Gemma 4 suitable for OCR, captioning, object detection, speech tasks, and mixed media prompts inside one chat flow.

Image understanding

All Gemma 4 models support text-plus-image workflows.

Common tasks: OCR, object detection, visual question answering, image captioning
Supports reasoning across multiple images in one prompt
Best for screenshots, documents, product images, and scene analysis

Video understanding

All Gemma 4 models can process video as a sequence of frames.

Good for scene description, human interaction, and situational summaries
Video passed as a content item in the messages array
Maximum supported video length is 60 seconds at 1 frame per second

Audio understanding

Audio is available on the E2B and E4B models.

Supports multilingual speech recognition, speech translation, and general speech understanding
Audio token cost is 25 tokens per second
Maximum audio length is 30 seconds

Visual token budgets

Gemma 4 introduces variable-resolution image processing so you can choose speed or detail based on the task.

Supported image budgets: 70, 140, 280, 560, 1120 tokens
Lower budgets for faster classification, captioning, and video frame analysis
Higher budgets for OCR, document parsing, and reading small text

Input preparation rules

The processor handles much of the media formatting, but a few limits matter in production.

Audio should be mono, 16 kHz, float32, normalized to [-1, 1]
Image file support depends on the framework used to convert files into tensors
Prompt quality still matters: specific instructions outperform vague multimodal requests

Model capability split

Use the smallest models for mobile and speech-heavy use cases, and the larger models for heavier reasoning with long context.

E2B and E4B: audio-enabled small models with 128K context
26B A4B and 31B: larger reasoning-focused models with 256K context
All four official sizes available in base and instruction-tuned variants

Local Deployment

Gemma 4 GGUF and Quantization

Choose the smallest Gemma 4 footprint that still fits your machine

For most local setups, the practical decision is whether to stay with E2B or E4B, or move up to a 26B A4B GGUF build. Google documents approximate memory needs for BF16, SFP8, and 4-bit-style deployment choices across all four official sizes.

Official local entry points

Google's Ollama guide exposes four Gemma 4 tags: gemma4:e2b, gemma4:e4b, gemma4:26b, and gemma4:31b. LM Studio also supports Gemma models in both GGUF and MLX formats for fully local inference.

Start with E2B or E4B for a lighter local loop, and move to 26B or 31B only when you have the RAM budget and want a stronger reasoning model.

Approximate memory by official size

Google lists approximate inference memory as E2B 9.6 GB BF16 / 3.2 GB Q4_0, E4B 15 GB / 5 GB, 26B A4B 48 GB / 15.6 GB, and 31B 58.3 GB / 17.4 GB.

If your target is a mainstream local machine, 4-bit-style deployment or a smaller model size is usually the line between runnable and impractical.

Official 26B A4B GGUF example

The official ggml-org Gemma 4 26B A4B IT GGUF page recommends llama-server for startup and lists Q4_K_M at 16.8 GB, Q8_0 at 26.9 GB, and F16 at 50.5 GB.

Q4_K_M is the most practical default when you want a large local Gemma 4 model but cannot afford Q8_0 or full 16-bit memory use.

What quantization changes

Higher parameter counts and higher precision are generally more capable, but they cost more processing cycles, memory, and power. Lower precision reduces those costs but can reduce capability.

Use quantization to fit the model to your hardware: smaller GGUF builds help you run locally, but they are a deployment compromise rather than a free upgrade.

Python Workflow

Gemma 4 PyTorch Guide

Run Gemma 4 from a PyTorch-first stack

The fastest Python path for Gemma 4 is Hugging Face Transformers on top of PyTorch: install torch and transformers, pick a Gemma 4 model ID, and begin with pipeline-based text inference before moving into multimodal or tool-enabled workflows.

Install the runtime

Google's Gemma 4 text inference guide starts with torch, accelerate, and transformers, plus dialog for conversation handling.

pip install torch accelerate pip install transformers pip install dialog

Pick an official Gemma 4 checkpoint

Google's Gemma 4 examples show four official instruction-tuned IDs: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, and google/gemma-4-31B-it.

MODEL_ID = "google/gemma-4-E2B-it"

Start with text generation

Use transformers.pipeline with task="text-generation", device_map="auto", and dtype="auto" as the quickest way to get a first response.

from transformers import pipeline txt_pipe = pipeline( task="text-generation", model=MODEL_ID, device_map="auto", dtype="auto" )

Move to multimodal and tools when needed

For multimodal and function-calling workflows, use AutoProcessor and AutoModelForMultimodalLM with apply_chat_template for tool-aware prompting.

from transformers import AutoProcessor, AutoModelForMultimodalLM model = AutoModelForMultimodalLM.from_pretrained( MODEL_ID, dtype="auto", device_map="auto") processor = AutoProcessor.from_pretrained(MODEL_ID)

Use native PyTorch for deeper control

Google's PyTorch guide documents Kaggle credential setup, dependency installation, cloning gemma_pytorch, and loading multimodal model classes for experimentation with direct checkpoint control.

pip install -q -U torch immutabledict sentencepiece git clone https://github.com/google/gemma_pytorch.git

On-Device AI

Gemma 4 Mobile Deployment

Put Gemma 4 on mobile through the current Android stack

Gemma 4 now has three practical mobile-facing paths: ML Kit Prompt API on AICore preview devices, Android Studio local-model workflows for developer-side usage, and LiteRT-LM for lower-level runtime control across mobile and embedded devices.

Choose the path that matches your goal

Use ML Kit Prompt API on AICore if you are building an Android app experience, Android Studio local models if you want offline coding help, and LiteRT-LM if you need lower-level runtime control.

Path by use case: - App feature prototype: ML Kit Prompt API + AICore - Local coding workflow: Android Studio local model - Custom runtime control: LiteRT-LM

Prototype on-device with AICore

Google's April 2026 preview lets you target Gemma 4 E2B or E4B through model preference settings inside the Prompt API flow on AICore-enabled devices.

val previewFullConfig = generationConfig { modelConfig = ModelConfig { releaseTrack = ModelReleaseTrack.PREVIEW preference = ModelPreference.FULL } }

Know the device expectations

Preview models run on AICore-enabled devices and the latest AI accelerators from Google, MediaTek, and Qualcomm. AI Edge Gallery is available for quick model checks on non-AICore devices.

Testing options: - AICore-enabled phone for preview models - AI Edge Gallery for quick model checks - High-end Android hardware (Pixel 8, Samsung S23+)

Use Android Studio for developer-side workflows

Android Studio currently recommends Gemma 4 as its local model option. Gemma E4B requires 12 GB RAM and 4 GB storage; Gemma 26B MoE requires 24 GB RAM and 17 GB storage.

Settings > Tools > AI > Model Providers

Switch to LiteRT-LM for deeper runtime control

LiteRT-LM is a cross-platform library for language model pipelines from phones to embedded systems, with CPU, GPU, and NPU paths including Qualcomm AI Engine Direct and MediaTek NeuroPilot.

LiteRT-LM supports: - CPU / GPU execution - Qualcomm AI Engine Direct - MediaTek NeuroPilot

Model Comparison

Gemma 4 vs Gemma 3

See what actually changes when you move from Gemma 3 to Gemma 4

This comparison is for developers deciding whether to keep an existing Gemma 3 workflow or rebuild around Gemma 4. The clearest differences show up in context length, control format, multimodal scope, and benchmark performance at the top end of each family.

Dimension	Gemma 4	Gemma 3	Why It Matters
Release and core sizes	Released March 31, 2026 in E2B, E4B, 26B A4B, and 31B sizes.	Released March 10, 2025 in 1B, 4B, 12B, and 27B sizes, with 270M added August 14, 2025.	Gemma 4 trims the family around clearer deployment tiers: edge-first E-models plus larger workstation-class models.
Context window	E2B and E4B support up to 128K context; 26B A4B and 31B support up to 256K.	4B, 12B, and 27B support 128K context; 1B and 270M support 32K.	For long documents, tool traces, or multi-step history, Gemma 4's larger models open significantly more headroom.
Multimodality	Supports image, video, interleaved text-image, and native audio input on E2B and E4B.	Core models support text and image input with text output.	Gemma 4 is the broader multimodal family if your use case moves beyond image-text into video, OCR-heavy flows, or audio-capable edge models.
Prompt and control format	Adds native system-role support and specialized control tokens for tools, reasoning, images, and audio.	Legacy formatting uses user/model turns; the separate system role is not supported.	Teams building agents or structured workflows get a cleaner control surface in Gemma 4.
Top-end benchmark snapshot	Gemma 4 31B: MMLU Pro 85.2, AIME 2026 89.2, LiveCodeBench v6 80.0, GPQA Diamond 84.3.	Gemma 3 27B (no think): MMLU Pro 67.6, AIME 2026 20.8, LiveCodeBench v6 29.1, GPQA Diamond 42.4.	If upgrading for reasoning, coding, or high-difficulty QA, the top-end Gemma 4 jump is large enough to justify a migration.
Deployment profile	E2B and E4B for efficient local and on-device use; 26B A4B and 31B for consumer GPU or workstation scenarios.	Remains strong for smaller classic sizes like 1B and 4B, with a 27B top end and 128K context on main larger variants.	Stay on Gemma 3 when small classic sizes already fit your stack; move to Gemma 4 when you want newer control features, larger-context top models, or stronger edge-oriented variants.

Release and core sizes

Gemma 4

Released March 31, 2026 in E2B, E4B, 26B A4B, and 31B sizes.

Gemma 3

Released March 10, 2025 in 1B, 4B, 12B, and 27B sizes, with 270M added August 14, 2025.

Gemma 4 trims the family around clearer deployment tiers: edge-first E-models plus larger workstation-class models.

Context window

Gemma 4

E2B and E4B support up to 128K context; 26B A4B and 31B support up to 256K.

Gemma 3

4B, 12B, and 27B support 128K context; 1B and 270M support 32K.

For long documents, tool traces, or multi-step history, Gemma 4's larger models open significantly more headroom.

Multimodality

Gemma 4

Supports image, video, interleaved text-image, and native audio input on E2B and E4B.

Gemma 3

Core models support text and image input with text output.

Gemma 4 is the broader multimodal family if your use case moves beyond image-text into video, OCR-heavy flows, or audio-capable edge models.

Prompt and control format

Gemma 4

Adds native system-role support and specialized control tokens for tools, reasoning, images, and audio.

Gemma 3

Legacy formatting uses user/model turns; the separate system role is not supported.

Teams building agents or structured workflows get a cleaner control surface in Gemma 4.

Top-end benchmark snapshot

Gemma 4

Gemma 4 31B: MMLU Pro 85.2, AIME 2026 89.2, LiveCodeBench v6 80.0, GPQA Diamond 84.3.

Gemma 3

Gemma 3 27B (no think): MMLU Pro 67.6, AIME 2026 20.8, LiveCodeBench v6 29.1, GPQA Diamond 42.4.

If upgrading for reasoning, coding, or high-difficulty QA, the top-end Gemma 4 jump is large enough to justify a migration.

Deployment profile

Gemma 4

E2B and E4B for efficient local and on-device use; 26B A4B and 31B for consumer GPU or workstation scenarios.

Gemma 3

Remains strong for smaller classic sizes like 1B and 4B, with a 27B top end and 128K context on main larger variants.

Stay on Gemma 3 when small classic sizes already fit your stack; move to Gemma 4 when you want newer control features, larger-context top models, or stronger edge-oriented variants.

Frequently Asked Questions

Everything you need to know about Gemma 4

Gemma 4 is Google DeepMind's newest open model family released on April 2, 2026. It comes in four sizes — E2B, E4B, 26B A4B, and 31B — and supports text, image, and audio inputs with up to 256K context under the Apache 2.0 license.

Gemma 4 has four sizes: E2B (edge, ~2B params), E4B (edge, ~4B params), 26B A4B (Mixture-of-Experts, 26B total/4B active), and 31B (dense). E2B and E4B support audio input and target mobile and laptop hardware; 26B A4B and 31B target workstations and servers.

The most common way is through Ollama. Install Ollama, run ollama pull gemma4 for the default variant, then use ollama run gemma4 to start chatting. For a lighter option, try gemma4:e2b or gemma4:e4b. LM Studio is another popular choice for a GUI-based local setup.

Gemma 4 E2B and E4B support a 128K context window. Gemma 4 26B A4B and 31B support up to 256K context, making them suitable for long documents, extended conversations, and large codebase analysis.

Gemma 4 26B A4B is a Mixture-of-Experts (MoE) model with 26B total parameters but only ~4B active per token. This means faster inference and lower memory usage compared to the 31B dense model, while still achieving strong performance on benchmarks.

Yes. Gemma 4 is available through the Gemini API and Google AI Studio as gemma-4-26b-a4b-it and gemma-4-31b-it. Create an API key in AI Studio, install the google-genai SDK, and start sending requests without any local model management.

Ready to Explore Gemma 4?

Join the Google Developer Community on Discord for Gemma guides, model updates, and developer discussions!

Join Discord Official Site

Gemma 4 Wiki

Latest Updates

Gemma 4 Resources

Gemma 4 Tutorial

Gemma 4 Ollama Setup

Gemma 4 API Guide

Gemma 4 Hugging Face

Gemma 4 Model Sizes

Gemma 4 Specs

Gemma 4 Benchmarks

Gemma 4 Fine-Tuning

Gemma 4 Prompt Guide

Gemma 4 Thinking Mode

Gemma 4 Function Calling

Gemma 4 Multimodal Guide

Gemma 4 GGUF and Quantization

Gemma 4 PyTorch Guide

Gemma 4 Mobile Deployment

Gemma 4 vs Gemma 3

Gemma 4 Tutorial

Understand the four official Gemma 4 sizes

Match the model to your hardware

Choose a starting point

Pick how you want to try it

Know what Gemma 4 is optimized for

Quick Tips

Gemma 4 Ollama Setup

Install and verify Ollama

Pull the default Gemma 4 variant

Choose the right model tag

Know what each tag expects

Run your first prompt

Use the local API for app integration

Quick Tips

Gemma 4 API Guide

Create an API key in Google AI Studio

Set the key in your environment

Install the official SDK

Choose the hosted Gemma 4 model ID

Send a first generateContent request

Use AI Studio to bridge from testing to code

Quick Tips

Gemma 4 Hugging Face Download

google/gemma-4-E2B-it

google/gemma-4-E4B-it

google/gemma-4-26B-A4B-it

google/gemma-4-31B-it

google/gemma-4-E2B

google/gemma-4-E4B

google/gemma-4-26B-A4B

google/gemma-4-31B

Choose the Right Gemma 4 Size for Your Hardware

Gemma 4 E2B

Gemma 4 E4B

Gemma 4 26B A4B

Gemma 4 31B

The Gemma 4 Specs That Actually Matter Before You Build

Official Gemma 4 Benchmark Snapshot

MMLU Pro

AIME 2026 (no tools)

LiveCodeBench v6

GPQA Diamond

MMMU Pro

MRCR v2 (128K, 8-needle)

How to Fine-Tune Gemma 4 for Real Product Work

Start with a narrow tuning goal

Pick the tuning path

Choose a realistic framework

Match the workflow to your hardware

Use QLoRA when efficiency matters

Prepare data in the right format

Evaluate, compare, and deploy

Quick Tips

Gemma 4 Prompt Guide

Core chat skeleton

System prompt pattern

Multimodal placeholders

Thinking-ready prompt

Tool-aware prompt structure

Gemma 4 Thinking Mode