Open Multimodal Model Family

Gemma 4 Wiki

Track Gemma 4 model sizes, benchmarks, prompting, function calling, multimodal input, local deployment, and fine-tuning across the official Google ecosystem.

Official Site
What's new in Gemma 4

Latest Updates

Discover the newest guides, tips, and content

Gemma 2 vs Gemma 4: Ultimate AI Model Comparison Guide 2026

A comprehensive breakdown of Google's Gemma 2 vs Gemma 4 series, covering benchmarks, efficiency, and real-world performance for developers and gamers.

Apr 19, 2026guide
Read more →
KoboldCPP Gemma 4: Optimization and Setup Guide 2026

Learn how to optimize KoboldCPP Gemma 4 for maximum performance. Explore Multi-Token Prediction, speculative decoding, and hardware requirements for 2026.

Apr 19, 2026guide
Read more →
gemma 3 vs gemma 4 google ai: Full Comparison & Dev Guide 2026

Explore the major differences in the gemma 3 vs gemma 4 google ai showdown. Learn about MoE architecture, local performance, and game dev integration.

Apr 19, 2026guide
Read more →
vLLM Gemma 4: Local AI Model Setup and Testing Guide 2026

Learn how to deploy Google's Gemma 4 models using vLLM. Explore benchmarks, model variants, and local hardware requirements for 2026.

Apr 19, 2026guide
Read more →
Gemma 3 vs Gemma 4 Release: Full Comparison & Guide 2026

Explore the major differences in the Gemma 3 vs Gemma 4 release, including architecture shifts, Mixture of Experts, and local hardware requirements for 2026.

Apr 19, 2026guide
Read more →
Gemma 3 vs Gemma 4 Differences: AI Model Comparison Guide 2026

Explore the key gemma 3 vs gemma 4 differences, including performance benchmarks, multimodal capabilities, and hardware requirements for 2026.

Apr 19, 2026guide
Read more →
Gemma 31B Requirements: Best Hardware for Google’s Open Model 2026

Explore the essential Gemma 31B requirements for local deployment. Learn about VRAM needs, quantization impacts, and benchmarks for Google's latest dense model.

Apr 19, 2026guide
Read more →
Gemma 12B 4-bit VRAM Requirement RTX 4070 12GB: Full Guide 2026

Analyze the gemma 12b 4-bit vram requirement rtx 4070 12gb to optimize your local AI setup. Learn about quantization, context windows, and performance benchmarks.

Apr 19, 2026guide
Read more →
Gemma 3 vs Gemma 4 Google: Full Comparison and Guide 2026

Explore the major differences in the Gemma 3 vs Gemma 4 Google debate. Learn about Mixture of Experts, local AI performance, and the new Apache 2.0 licensing.

Apr 19, 2026guide
Read more →
Qwen 3.6 vs Gemma 4: Local AI Benchmark & Performance Guide 2026

A comprehensive comparison of Qwen 3.6 vs Gemma 4 for local AI enthusiasts. Discover which model wins in speed, tool calling, and hardware efficiency.

Apr 19, 2026guide
Read more →
Gemma 4 4b Requirements: Full PC Hardware & Setup Guide 2026

Learn the exact Gemma 4 4b requirements to run Google's latest open AI model locally. Hardware specs, RAM needs, and GPU optimizations for 2026.

Apr 11, 2026requirements
Read more →
Gemma 4 2B: The Ultimate Local AI Guide for Developers 2026

Explore the capabilities of Google's Gemma 4 2B model. Learn about its agentic workflows, mobile efficiency, and how to implement it locally for gaming and apps.

Apr 11, 2026models
Read more →
Gemma 4 Reasoning: Advanced AI Agent & Logic Guide 2026

Explore the advanced gemma 4 reasoning capabilities. Learn about the 31B and 26B models, agentic workflows, and local AI performance for developers and gamers.

Apr 11, 2026benchmark
Read more →
Gemma 4 KoboldCPP: Local AI Performance Guide 2026

Learn how to optimize Gemma 4 in KoboldCPP. Explore the 26B MoE architecture, hardware requirements, and how to manage the new thinking mode for peak performance.

Apr 11, 2026install
Read more →
Gemma 4 Jan AI: The Ultimate Local AI Coding Setup 2026

Learn how to set up Gemma 4 with Jan AI for a powerful, private, and free local AI environment. Guide includes benchmarks, setup steps, and coding integrations.

Apr 11, 2026install
Read more →
Gemma 4 31B VRAM: Hardware Requirements & Performance Guide 2026

Master the hardware requirements for Google's Gemma 4 31B. Learn about VRAM needs, quantization performance, and local gaming AI benchmarks for 2026.

Apr 11, 2026requirements
Read more →
Gemma 4 1b: Complete Guide to Google's Newest Lightweight AI 2026

Explore the capabilities of the Gemma 4 1b and E2B models. Learn about on-device performance, agentic workflows, and the massive benchmark jumps from Gemma 3.

Apr 11, 2026models
Read more →
Gemma 4 vs GPT-4o: The Ultimate Open-Source Comparison 2026

Explore the technical breakdown of Gemma 4 vs GPT-4o. Learn about Google's latest open-source model family, benchmarks, and hardware requirements for 2026.

Apr 11, 2026comparison
Read more →
Gemma 4 Token Limit: Complete Context Window Guide 2026

Explore the Gemma 4 token limit and context window capabilities. Learn how to optimize Google's latest open-source AI models for local performance and coding tasks.

Apr 11, 2026requirements
Read more →
Gemma 4 vLLM: Local AI Setup & Performance Guide 2026

Learn how to deploy Google's Gemma 4 models using vLLM. Explore the 26B MoE architecture, hardware requirements, and agentic performance for 2026.

Apr 11, 2026install
Read more →
Gemma 4 vs Phi: Ultimate AI Model Comparison Guide 2026

A deep dive into the battle of small language models. Compare Gemma 4 and Phi for coding, agentic workflows, and local performance in 2026.

Apr 11, 2026comparison
Read more →
Gemma 4 9b: Complete Guide to Google’s New Open Models 2026

Explore the full capabilities of the Gemma 4 9b and the entire Gemma 4 family. Learn about agentic workflows, local performance, and benchmark results.

Apr 11, 2026models
Read more →
Gemma 4 PT Model: The Ultimate Guide to Google’s Open AI 2026

Explore the power of the Gemma 4 pt model series. Learn about its agentic workflows, local performance, and how it revolutionizes AI for gamers and developers in 2026.

Apr 11, 2026models
Read more →
Gemma 4 Context Length: Full Technical Guide & Specs 2026

Explore the impressive Gemma 4 context length and model specifications. Learn how Google's 2026 open-source AI revolutionizes local processing for developers and gamers.

Apr 11, 2026requirements
Read more →
Gemma 4 SWE-bench: The Ultimate Open-Source AI Coding Guide 2026

Master Google's Gemma 4 series with our comprehensive guide. Explore SWE-bench performance, local installation tips, and agentic coding workflows for 2026.

Apr 11, 2026benchmark
Read more →
Gemma4 탈옥: Comprehensive Guide to AI Performance & Guardrails 2026

Explore the latest Gemma 4 31B benchmarks, coding capabilities, and guardrail testing. Learn about gemma4 탈옥 techniques and performance compared to Qwen 3.6.

Apr 9, 2026guide
Read more →
Gemma 4 Model Size Parameters VRAM Requirements Local Inference 2026

A comprehensive guide to Gemma 4 model size parameters, VRAM requirements, and local inference benchmarks for 2026 hardware.

Apr 9, 2026guide
Read more →
Gemma 4 Vision: Ultimate AI Integration Guide 2026

Master the new Gemma 4 Vision capabilities. Learn about the Apache 2.0 open-source models, agentic workflows, and multimodal reasoning for local hardware.

Apr 9, 2026guide
Read more →
Gemma 4 26B A4B Ollama VRAM Requirements: Full Setup Guide 2026

Master the hardware needs for Google's Gemma 4 series. Learn the specific Gemma 4 26b a4b ollama vram requirements and optimization tips for local AI performance.

Apr 9, 2026guide
Read more →
Gemma 4 31B RAM Requirements: Full Hardware Guide 2026

Learn the exact gemma 4 31b ram requirements for local deployment. Compare quantization levels, VRAM needs, and hardware recommendations for Google's flagship model.

Apr 9, 2026guide
Read more →

Gemma 4 Resources

Everything you need to get started with Gemma 4 — from local setup to API integration

Quick Start

Gemma 4 Tutorial

Gemma 4 launched on April 2, 2026 in four official sizes: E2B, E4B, 26B A4B, and 31B. The family is built for open-weight deployment under Apache 2.0, with smaller edge models aimed at mobile and laptop-class hardware and larger models aimed at desktops, workstations, and servers.

1

Understand the four official Gemma 4 sizes

Gemma 4 comes in E2B, E4B, 26B A4B, and 31B. E2B and E4B accept text, image, and audio input; 26B A4B and 31B accept text and image input and target larger local or server deployments.

2

Match the model to your hardware

Use E2B or E4B when you want mobile, edge, or laptop-friendly local inference. Use 26B A4B for a stronger general-purpose local model, and 31B when you want the largest official Gemma 4 checkpoint.

3

Choose a starting point

Gemma 4 26B A4B is a strong default for powerful first experiences. If you want the lightest starting point, begin with an instruction-tuned edge model and move up when your workload needs more capability.

4

Pick how you want to try it

Try hosted Gemma 4 through Google AI Studio and the Gemini API, or download open weights from Hugging Face or Kaggle for local use, tuning, and custom deployment.

5

Know what Gemma 4 is optimized for

The family is built for reasoning, coding, agentic workflows, and multimodal understanding. Edge models support 128K context, while 26B A4B and 31B support up to 256K context.

Quick Tips

  • Instruction-tuned (-it) variants are best for chat and assistant use cases.
  • E2B and E4B are the most hardware-accessible starting points for local experimentation.
  • The 26B A4B is a Mixture-of-Experts model with faster effective inference than a dense model of similar total size.
  • All Gemma 4 weights are released under the Apache 2.0 license.
Local Run

Gemma 4 Ollama Setup

Ollama is one of the fastest ways to get Gemma 4 running on a laptop or workstation. The default Ollama flow is simple: install Ollama, pull Gemma 4, confirm the model list, choose the right tag for your hardware, and then run from the CLI or local API.

1

Install and verify Ollama

Download Ollama for Windows, macOS, or Linux, install it, and verify the setup with the command ollama --version.

2

Pull the default Gemma 4 variant

Use ollama pull gemma4 to download the default Gemma 4 package, then run ollama list to confirm it is available locally.

3

Choose the right model tag

Use gemma4:e2b for the lightest edge option, gemma4:e4b for a stronger edge default, gemma4:26b for the 26B A4B MoE workstation model, and gemma4:31b for the full large model.

4

Know what each tag expects

On the Ollama library page, e2b is listed at 7.2GB with 128K context, e4b at 9.6GB with 128K, 26b at 18GB with 256K, and 31b at 20GB with 256K.

5

Run your first prompt

For a first text test, run ollama run gemma4 "Hello, what can you do?". Ollama also supports image input with the prompt form shown in the official guide.

6

Use the local API for app integration

Ollama exposes a local web service at http://localhost:11434/api/generate, so you can move from CLI testing to a lightweight local application without setting up a separate model server.

Quick Tips

  • E2B and E4B are the practical first picks for local experimentation on lighter hardware.
  • The 26b tag targets the 26B A4B MoE model, which uses less active compute than a dense model of similar total size.
  • ollama list shows all locally downloaded models and their sizes.
  • Ollama supports image input with the prompt form: ollama run gemma4:e2b with an image path.
Hosted API

Gemma 4 API Guide

The Gemini API provides hosted access to Gemma 4, useful when building without managing local inference. The hosted Gemma 4 models in AI Studio and the Gemini API are gemma-4-26b-a4b-it and gemma-4-31b-it.

1

Create an API key in Google AI Studio

Open Google AI Studio and create a Gemini API key. New users can start with a default Google Cloud project, while existing users can import a Cloud project and create keys there.

2

Set the key in your environment

The Gemini SDKs automatically pick up GEMINI_API_KEY or GOOGLE_API_KEY. If both are set, GOOGLE_API_KEY takes precedence.

3

Install the official SDK

For Python, install google-genai. For JavaScript and TypeScript, install @google/genai. Google also publishes SDK paths for Go, Java, C#, and Apps Script.

4

Choose the hosted Gemma 4 model ID

For hosted Gemma 4, use gemma-4-26b-a4b-it for a faster MoE large model, or gemma-4-31b-it for the flagship dense checkpoint.

5

Send a first generateContent request

The official example uses client.models.generate_content with the model field set to gemma-4-31b-it. In REST, requests go to the generateContent endpoint with the x-goog-api-key header.

6

Use AI Studio to bridge from testing to code

Google AI Studio lets you experiment with prompts, model settings, function calling, and structured output, then export working code through the Get code flow.

Quick Tips

  • AI Studio is the fastest way to test Gemma 4 prompts before writing any code.
  • The Gemini API supports streaming responses for chat and long-generation use cases.
  • gemma-4-26b-a4b-it is the MoE model — generally faster and more cost-efficient than 31B.
  • Function calling and structured output are available for both hosted Gemma 4 model IDs.
Downloads

Gemma 4 Hugging Face Download

The official Google collection on Hugging Face includes eight core Gemma 4 checkpoints: E2B, E4B, 26B A4B, and 31B, each in base and instruction-tuned form. Instruction-tuned (-it) repositories are the natural starting point for chat, coding, and assistant experiences.

Instruction-tuned

google/gemma-4-E2B-it

Edge checkpoint with text, image, and audio input and 128K context. Best for fast local assistants and on-device multimodal experimentation.

Instruction-tuned

google/gemma-4-E4B-it

Stronger edge checkpoint with text, image, and audio input and 128K context. More capable than E2B without jumping to workstation-class hardware.

Instruction-tuned

google/gemma-4-26B-A4B-it

Mixture-of-Experts checkpoint with 256K context and text-image input. Large-model quality with faster effective inference than a dense model of similar total size.

Instruction-tuned

google/gemma-4-31B-it

Flagship dense Gemma 4 checkpoint with 256K context and text-image input. Best for the strongest chat, reasoning, coding, and agent workflows.

Pre-trained

google/gemma-4-E2B

Base edge checkpoint for users who want to study, adapt, or fine-tune the smallest multimodal Gemma 4 model.

Pre-trained

google/gemma-4-E4B

Base edge checkpoint that keeps text, image, and audio input while leaving downstream instruction behavior to your own tuning pipeline.

Pre-trained

google/gemma-4-26B-A4B

Base MoE large checkpoint for custom adaptation where you want the 26B A4B architecture without default instruction-tuned behavior.

Pre-trained

google/gemma-4-31B

Base 31B dense checkpoint for teams that want the largest official Gemma 4 foundation model before their own fine-tuning or alignment stage.

Model Comparison

Choose the Right Gemma 4 Size for Your Hardware

Gemma 4 ships in four sizes with very different trade-offs. The fastest choice is not always the smallest model, and the highest-quality choice is not always the easiest one to deploy.

Gemma 4 is available in two edge-first dense models, one efficient Mixture-of-Experts model, and one large dense model. For most teams, the real decision is not just quality, but where the model runs: phone, laptop, workstation, or server. A practical starting point is 26B A4B when you want strong quality without jumping all the way to 31B.

Gemma 4 E2B

ArchitectureDense
Parameters2.3B effective
Context128K tokens
Memory (BF16/Q4)9.6 GB BF16 / 4.6 GB SFP8 / 3.2 GB Q4_0
PlatformMobile devices

Offline assistants, lightweight multimodal apps, edge deployment

Gemma 4 E4B

ArchitectureDense
Parameters4.5B effective
Context128K tokens
Memory (BF16/Q4)15 GB BF16 / 7.5 GB SFP8 / 5 GB Q4_0
PlatformMobile and laptops

Stronger local copilots, on-device reasoning, multimodal apps with more headroom

Gemma 4 26B A4B

ArchitectureMoE
Parameters25.2B total, 3.8B active
Context256K tokens
Memory (BF16/Q4)48 GB BF16 / 25 GB SFP8 / 15.6 GB Q4_0
PlatformDesktop and small servers

Best balance of quality, speed, and long-context work for most teams

Gemma 4 31B

ArchitectureDense
Parameters30.7B
Context256K tokens
Memory (BF16/Q4)58.3 GB BF16 / 30.4 GB SFP8 / 17.4 GB Q4_0
PlatformLarge servers

Highest-end reasoning, coding, and multimodal quality in the Gemma 4 family

Core Specs

The Gemma 4 Specs That Actually Matter Before You Build

For most builders, the key questions are context length, modalities, language coverage, licensing, and app-level features. These are the specs that change implementation choices, hosting cost, and product scope.

Gemma 4 is not just a text model refresh. The family combines long context, multimodal input, thinking mode, native system prompts, and function-calling support in one open-weight lineup. The smaller models add audio input, while the larger models extend context to 256K for document-heavy and repository-scale workloads.

Release

March 31, 2026

This is the current Gemma core generation and the one Google now highlights across docs and launch materials.

Input and Output

All models: text and image → text; E2B and E4B also support audio input

You can build text-only, vision, and lightweight speech understanding flows without switching model families.

Maximum Context Window

128K tokens on E2B and E4B; 256K tokens on 26B A4B and 31B

Large prompts such as long documents, long chats, or multi-file code context fit in a single request.

Language Coverage

Over 140 languages

This matters for multilingual products, OCR, and globally deployed assistants.

License and Weights

Apache 2.0 license with open weights and support for responsible commercial use

You can tune, deploy, and run Gemma 4 in your own stack with fewer licensing constraints.

Reasoning and Control

Configurable thinking mode, native system role support, structured JSON output, and function calling

These features make Gemma 4 much easier to use for agents, tool use, and instruction-heavy applications.

Visual Handling

Variable image resolutions and token budgets of 70, 140, 280, 560, or 1120 tokens

You can trade image detail for speed depending on whether the task is OCR, UI reading, chart analysis, or fast frame processing.

Performance

Official Gemma 4 Benchmark Snapshot

These scores show where each Gemma 4 size is strongest across reasoning, coding, science, vision, and long-context retrieval. Use them to shortlist a model quickly, then match that shortlist to your latency and memory budget.

Gemma 4 is positioned as a model family for reasoning, agentic workflows, coding, and multimodal understanding. The official benchmark tables show a clear pattern: 31B leads, 26B A4B stays surprisingly close while being much more efficient, and E4B and E2B bring meaningful capability to smaller devices.

MMLU Pro

Knowledge and reasoning

85.2%
31B
82.6%
26B A4B
69.4%
E4B
60.0%
E2B

Best quick comparison for general high-level reasoning performance across the family.

AIME 2026 (no tools)

Math reasoning

89.2%
31B
88.3%
26B A4B
42.5%
E4B
37.5%
E2B

31B and 26B A4B are the right targets for math-heavy assistants and planning tasks.

LiveCodeBench v6

Competitive coding

80.0%
31B
77.1%
26B A4B
52.0%
E4B
44.0%
E2B

If coding is a primary use case, the larger two models are in a different tier from the edge models.

GPQA Diamond

Scientific reasoning

84.3%
31B
82.3%
26B A4B
58.6%
E4B
43.4%
E2B

A strong signal for technical and expert-facing workflows.

MMMU Pro

Multimodal reasoning

76.9%
31B
73.8%
26B A4B
52.6%
E4B
44.2%
E2B

Vision tasks benefit heavily from the larger models when accuracy matters more than footprint.

MRCR v2 (128K, 8-needle)

Long-context retrieval

66.4%
31B
44.1%
26B A4B
25.4%
E4B
19.1%
E2B

For large-document and repository-scale prompting, 31B is the strongest long-context choice.

Customization

How to Fine-Tune Gemma 4 for Real Product Work

Fine-tuning matters when prompting alone is not enough and you want Gemma 4 to perform better on a specific domain, workflow, or role. The practical paths are lightweight adapter tuning for text tasks and multimodal adapter tuning for image-plus-text tasks.

The official Gemma tuning docs center on a simple rule: tune for a defined task, not for vague improvement. For many builders, QLoRA is the most realistic place to start because it keeps hardware requirements much lower than full-model tuning.

1

Start with a narrow tuning goal

Choose a task or role that the base model should perform better, such as customer support, text-to-SQL, or product description generation. Use fine-tuning when the task is specific and repeated.

2

Pick the tuning path

Use text tuning for instruction and generation tasks, or vision tuning when your dataset combines images and text. The text QLoRA guide demonstrates text-to-SQL; the vision QLoRA guide demonstrates image-plus-text product descriptions.

3

Choose a realistic framework

Gemma 4 supports Keras with LoRA, the Gemma library, Hugging Face-based workflows, GKE, and Vertex AI. Hugging Face plus TRL is the most direct path for many developers.

4

Match the workflow to your hardware

The official text QLoRA example is designed around a T4 16GB setup. The vision QLoRA guide calls for a BF16-capable GPU such as NVIDIA L4 or A100 with more than 16GB of memory.

5

Use QLoRA when efficiency matters

QLoRA keeps the base model quantized to 4-bit, freezes the original weights, and trains only the added LoRA adapters. This lowers memory usage while preserving strong task performance.

6

Prepare data in the right format

Build a dataset that directly matches the behavior you want, then format it for conversation-style training with TRL and SFTTrainer. The official text guide uses a large synthetic text-to-SQL dataset.

7

Evaluate, compare, and deploy

After training, run inference checks against your base model, verify task gains, and then deploy the tuned model or adapter. Treat deployment format as an early decision because framework choice affects the output format you get.

Quick Tips

  • Start with QLoRA and a T4-class GPU for text tasks — full fine-tuning is rarely needed for task adaptation.
  • Format your dataset to mirror the instruction-tuned chat format that Gemma 4 already understands.
  • Keep your eval set from the same distribution as your training data to get meaningful improvement signals.
  • The MoE model 26B A4B has efficient active parameters, but its total parameter count still affects checkpoint size during training.
  • Use the Gemma 4 -it checkpoint as your starting point for instruction tasks rather than the pre-trained base.
Prompting

Gemma 4 Prompt Guide

Gemma 4 introduces a new turn-based prompt format with native system instructions, multimodal placeholders, and built-in controls for thinking and tool use.

This guide turns the official Gemma 4 format into a practical prompt library. Structure every interaction as turns, use the system role for behavior and global rules, insert image or audio placeholders where needed, and only enable thinking or tool use when the task actually benefits from them.

Core chat skeleton

Gemma 4 uses native system, user, and model roles, wrapped in turn markers.

  • Use system for global instructions
  • Use user for the current request
  • Use model as the generation start point
<|turn>system You are a helpful assistant.<turn|> <|turn>user Summarize the following article in 5 bullets.<turn|> <|turn>model

System prompt pattern

Put stable behavior rules in one system turn instead of repeating them every time.

  • Good for style, scope, and output format
  • Native system role support starts with Gemma 4
  • Keep it concise and task-specific
<|turn>system You are a technical writer. Answer in clear English, use short paragraphs, and include one practical example.<turn|> <|turn>user Explain function calling for a beginner.<turn|> <|turn>model

Multimodal placeholders

Use placeholder tokens to indicate where image and audio embeddings should be inserted.

  • Use <|image|> for images
  • Use <|audio|> for audio
  • The processor replaces placeholders with embeddings after tokenization
<|turn>user Describe this image: <|image|> Then transcribe this clip: <|audio|><turn|> <|turn>model

Thinking-ready prompt

Thinking mode is activated by placing <|think|> inside the system instruction.

  • Enable it for reasoning-heavy tasks
  • Keep it off for simple direct generation
  • Use one system turn for both thinking and other global instructions
<|turn>system <|think|>You are a careful reasoning assistant.<turn|> <|turn>user Compare two pricing models and recommend one for a startup.<turn|> <|turn>model

Tool-aware prompt structure

Tool declarations belong in the system turn, and tool calls and tool responses are handled with dedicated control tokens.

  • Useful for APIs, search, calculators, and external data lookups
  • Tool use is structured, not plain-text pretending
  • Reasoning and tool use can happen in the same turn
Define tools in the system turn using the tool declaration token block, then set user and model turns as usual. Gemma 4 handles the rest with structured tool_call and tool_response tokens.
Reasoning

Gemma 4 Thinking Mode

Thinking mode lets Gemma 4 produce a reasoning channel before the final answer, and the processor can separate both parts for application use.

Thinking mode is best for tasks where the model benefits from intermediate reasoning before it answers: ambiguous questions, math, coding, tool planning, and multimodal analysis. In Gemma 4, you can enable it at the chat-template level, stream the reasoning live, and then split the output into a thinking block and a user-facing answer block.

1

Choose the right tasks

Use thinking mode when the request needs decomposition, comparison, planning, or careful interpretation rather than a short direct reply.

  • Good fits: math, code debugging, structured decision-making, image-plus-text reasoning
  • Less necessary for simple rewrites, short summaries, or straightforward facts
  • Official examples cover both text-only and image-text workflows
2

Enable thinking in the chat template

With Hugging Face Transformers, set enable_thinking=True in apply_chat_template(). At the token level, Gemma 4 uses <|think|> in the system turn.

  • E2B and E4B: thinking OFF uses a simple user-model flow; thinking ON adds a system turn with <|think|>
  • 26B A4B and 31B: official templates include an empty thinking token when thinking is off to stabilize output
  • Thinking is designed to be enabled at the conversation level
3

Generate and separate the result

The model can emit a reasoning channel first and the final answer after it. You can stream it with TextStreamer and split it with parse_response().

  • processor.parse_response() returns separated thinking and answer content
  • This works for text prompts and image-text prompts
  • The reasoning channel can also include tool calls when the turn becomes agentic
4

Handle multi-turn chats correctly

For normal multi-turn conversations, strip the previous turn generated thoughts before sending the history back. In tool-calling turns, keep the thought flow intact until the tool cycle finishes.

  • Regular chat: remove prior thought blocks before the next turn
  • Tool-use exception: do not remove thoughts between function calls inside the same turn
  • This keeps context clean while preserving agentic behavior
Agentic Workflows

Gemma 4 Function Calling

Gemma 4 supports native structured tool use, letting the model request functions instead of faking external actions in plain text.

Function calling is the practical bridge between model output and real application behavior. Instead of asking Gemma 4 to guess live data or simulate actions, you define tools, let the model generate a structured call, execute the function in your app, and then feed the result back so the model can finish with a clean natural-language answer.

1

Define tools clearly

Pass tools through apply_chat_template() using either a manual JSON schema or a raw Python function converted to schema.

  • Manual JSON schema is best when you need precise nested parameters
  • Raw Python functions are convenient for simple tools with clear type hints and docstrings
  • Tool definitions should include name, description, parameter types, and required fields
2

Let the model request a tool

Gemma 4 receives the user prompt plus available tools and returns a structured function call object rather than plain text when a tool is needed.

  • Tool use is controlled with dedicated tokens such as tool, tool_call, and tool_response
  • A typical example is a weather or search function
  • This is better than plain text when the answer depends on external state or system actions
3

Validate and execute in your app

Gemma 4 cannot execute code on its own. Your application must parse the function name and arguments, validate them, and run the real function safely.

  • Always validate function names and arguments before execution
  • Do not rely on generated code without safeguards
  • For production systems, map tool names to approved handlers instead of dynamic execution
4

Return tool output for the final answer

Append the tool result back into the chat history, then let Gemma 4 produce the final user-facing response.

  • Official workflow: define tools, model turn, developer turn, final response
  • This pattern works for APIs, live lookups, calculators, settings updates, and agent loops
  • Tool responses should stay structured so the model can ground the final answer correctly
Multimodal

Gemma 4 Multimodal Guide

Gemma 4 handles text and image across all models, supports video as frames, and adds native audio support on E2B and E4B.

Gemma 4 is built for multimodal input. All models support image and video-style visual understanding, the small models add audio input, and the runtime lets you trade off visual detail against speed using token budgets. That makes Gemma 4 suitable for OCR, captioning, object detection, speech tasks, and mixed media prompts inside one chat flow.

Image understanding

All Gemma 4 models support text-plus-image workflows.

  • Common tasks: OCR, object detection, visual question answering, image captioning
  • Supports reasoning across multiple images in one prompt
  • Best for screenshots, documents, product images, and scene analysis

Video understanding

All Gemma 4 models can process video as a sequence of frames.

  • Good for scene description, human interaction, and situational summaries
  • Video passed as a content item in the messages array
  • Maximum supported video length is 60 seconds at 1 frame per second

Audio understanding

Audio is available on the E2B and E4B models.

  • Supports multilingual speech recognition, speech translation, and general speech understanding
  • Audio token cost is 25 tokens per second
  • Maximum audio length is 30 seconds

Visual token budgets

Gemma 4 introduces variable-resolution image processing so you can choose speed or detail based on the task.

  • Supported image budgets: 70, 140, 280, 560, 1120 tokens
  • Lower budgets for faster classification, captioning, and video frame analysis
  • Higher budgets for OCR, document parsing, and reading small text

Input preparation rules

The processor handles much of the media formatting, but a few limits matter in production.

  • Audio should be mono, 16 kHz, float32, normalized to [-1, 1]
  • Image file support depends on the framework used to convert files into tensors
  • Prompt quality still matters: specific instructions outperform vague multimodal requests

Model capability split

Use the smallest models for mobile and speech-heavy use cases, and the larger models for heavier reasoning with long context.

  • E2B and E4B: audio-enabled small models with 128K context
  • 26B A4B and 31B: larger reasoning-focused models with 256K context
  • All four official sizes available in base and instruction-tuned variants
Local Deployment

Gemma 4 GGUF and Quantization

Choose the smallest Gemma 4 footprint that still fits your machine

For most local setups, the practical decision is whether to stay with E2B or E4B, or move up to a 26B A4B GGUF build. Google documents approximate memory needs for BF16, SFP8, and 4-bit-style deployment choices across all four official sizes.

Official local entry points

Google's Ollama guide exposes four Gemma 4 tags: gemma4:e2b, gemma4:e4b, gemma4:26b, and gemma4:31b. LM Studio also supports Gemma models in both GGUF and MLX formats for fully local inference.

Start with E2B or E4B for a lighter local loop, and move to 26B or 31B only when you have the RAM budget and want a stronger reasoning model.

Approximate memory by official size

Google lists approximate inference memory as E2B 9.6 GB BF16 / 3.2 GB Q4_0, E4B 15 GB / 5 GB, 26B A4B 48 GB / 15.6 GB, and 31B 58.3 GB / 17.4 GB.

If your target is a mainstream local machine, 4-bit-style deployment or a smaller model size is usually the line between runnable and impractical.

Official 26B A4B GGUF example

The official ggml-org Gemma 4 26B A4B IT GGUF page recommends llama-server for startup and lists Q4_K_M at 16.8 GB, Q8_0 at 26.9 GB, and F16 at 50.5 GB.

Q4_K_M is the most practical default when you want a large local Gemma 4 model but cannot afford Q8_0 or full 16-bit memory use.

What quantization changes

Higher parameter counts and higher precision are generally more capable, but they cost more processing cycles, memory, and power. Lower precision reduces those costs but can reduce capability.

Use quantization to fit the model to your hardware: smaller GGUF builds help you run locally, but they are a deployment compromise rather than a free upgrade.

Python Workflow

Gemma 4 PyTorch Guide

Run Gemma 4 from a PyTorch-first stack

The fastest Python path for Gemma 4 is Hugging Face Transformers on top of PyTorch: install torch and transformers, pick a Gemma 4 model ID, and begin with pipeline-based text inference before moving into multimodal or tool-enabled workflows.

1

Install the runtime

Google's Gemma 4 text inference guide starts with torch, accelerate, and transformers, plus dialog for conversation handling.

pip install torch accelerate pip install transformers pip install dialog
2

Pick an official Gemma 4 checkpoint

Google's Gemma 4 examples show four official instruction-tuned IDs: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, and google/gemma-4-31B-it.

MODEL_ID = "google/gemma-4-E2B-it"
3

Start with text generation

Use transformers.pipeline with task="text-generation", device_map="auto", and dtype="auto" as the quickest way to get a first response.

from transformers import pipeline txt_pipe = pipeline( task="text-generation", model=MODEL_ID, device_map="auto", dtype="auto" )
4

Move to multimodal and tools when needed

For multimodal and function-calling workflows, use AutoProcessor and AutoModelForMultimodalLM with apply_chat_template for tool-aware prompting.

from transformers import AutoProcessor, AutoModelForMultimodalLM model = AutoModelForMultimodalLM.from_pretrained( MODEL_ID, dtype="auto", device_map="auto") processor = AutoProcessor.from_pretrained(MODEL_ID)
5

Use native PyTorch for deeper control

Google's PyTorch guide documents Kaggle credential setup, dependency installation, cloning gemma_pytorch, and loading multimodal model classes for experimentation with direct checkpoint control.

pip install -q -U torch immutabledict sentencepiece git clone https://github.com/google/gemma_pytorch.git
On-Device AI

Gemma 4 Mobile Deployment

Put Gemma 4 on mobile through the current Android stack

Gemma 4 now has three practical mobile-facing paths: ML Kit Prompt API on AICore preview devices, Android Studio local-model workflows for developer-side usage, and LiteRT-LM for lower-level runtime control across mobile and embedded devices.

1

Choose the path that matches your goal

Use ML Kit Prompt API on AICore if you are building an Android app experience, Android Studio local models if you want offline coding help, and LiteRT-LM if you need lower-level runtime control.

Path by use case: - App feature prototype: ML Kit Prompt API + AICore - Local coding workflow: Android Studio local model - Custom runtime control: LiteRT-LM
2

Prototype on-device with AICore

Google's April 2026 preview lets you target Gemma 4 E2B or E4B through model preference settings inside the Prompt API flow on AICore-enabled devices.

val previewFullConfig = generationConfig { modelConfig = ModelConfig { releaseTrack = ModelReleaseTrack.PREVIEW preference = ModelPreference.FULL } }
3

Know the device expectations

Preview models run on AICore-enabled devices and the latest AI accelerators from Google, MediaTek, and Qualcomm. AI Edge Gallery is available for quick model checks on non-AICore devices.

Testing options: - AICore-enabled phone for preview models - AI Edge Gallery for quick model checks - High-end Android hardware (Pixel 8, Samsung S23+)
4

Use Android Studio for developer-side workflows

Android Studio currently recommends Gemma 4 as its local model option. Gemma E4B requires 12 GB RAM and 4 GB storage; Gemma 26B MoE requires 24 GB RAM and 17 GB storage.

Settings > Tools > AI > Model Providers
5

Switch to LiteRT-LM for deeper runtime control

LiteRT-LM is a cross-platform library for language model pipelines from phones to embedded systems, with CPU, GPU, and NPU paths including Qualcomm AI Engine Direct and MediaTek NeuroPilot.

LiteRT-LM supports: - CPU / GPU execution - Qualcomm AI Engine Direct - MediaTek NeuroPilot
Model Comparison

Gemma 4 vs Gemma 3

See what actually changes when you move from Gemma 3 to Gemma 4

This comparison is for developers deciding whether to keep an existing Gemma 3 workflow or rebuild around Gemma 4. The clearest differences show up in context length, control format, multimodal scope, and benchmark performance at the top end of each family.

Release and core sizes

Gemma 4
Released March 31, 2026 in E2B, E4B, 26B A4B, and 31B sizes.
Gemma 3
Released March 10, 2025 in 1B, 4B, 12B, and 27B sizes, with 270M added August 14, 2025.

Gemma 4 trims the family around clearer deployment tiers: edge-first E-models plus larger workstation-class models.

Context window

Gemma 4
E2B and E4B support up to 128K context; 26B A4B and 31B support up to 256K.
Gemma 3
4B, 12B, and 27B support 128K context; 1B and 270M support 32K.

For long documents, tool traces, or multi-step history, Gemma 4's larger models open significantly more headroom.

Multimodality

Gemma 4
Supports image, video, interleaved text-image, and native audio input on E2B and E4B.
Gemma 3
Core models support text and image input with text output.

Gemma 4 is the broader multimodal family if your use case moves beyond image-text into video, OCR-heavy flows, or audio-capable edge models.

Prompt and control format

Gemma 4
Adds native system-role support and specialized control tokens for tools, reasoning, images, and audio.
Gemma 3
Legacy formatting uses user/model turns; the separate system role is not supported.

Teams building agents or structured workflows get a cleaner control surface in Gemma 4.

Top-end benchmark snapshot

Gemma 4
Gemma 4 31B: MMLU Pro 85.2, AIME 2026 89.2, LiveCodeBench v6 80.0, GPQA Diamond 84.3.
Gemma 3
Gemma 3 27B (no think): MMLU Pro 67.6, AIME 2026 20.8, LiveCodeBench v6 29.1, GPQA Diamond 42.4.

If upgrading for reasoning, coding, or high-difficulty QA, the top-end Gemma 4 jump is large enough to justify a migration.

Deployment profile

Gemma 4
E2B and E4B for efficient local and on-device use; 26B A4B and 31B for consumer GPU or workstation scenarios.
Gemma 3
Remains strong for smaller classic sizes like 1B and 4B, with a 27B top end and 128K context on main larger variants.

Stay on Gemma 3 when small classic sizes already fit your stack; move to Gemma 4 when you want newer control features, larger-context top models, or stronger edge-oriented variants.