GeistHaus
log in · sign up

https://feeds.feedburner.com/glaforge

rss
1 post
Polling state
Status active
Last polled May 19, 2026 03:41 UTC
Next poll May 20, 2026 03:38 UTC
Poll interval 86400s
Last-Modified Mon, 18 May 2026 17:35:32 GMT

Posts

Agentic Design Patterns

At Devoxx France 2026, I had the pleasure of presenting a session on Agentic Design Patterns. In this talk, I explore how to move beyond basic LLM wrappers to build reliable, scalable, and sophisticated AI agent systems.

In the coming weeks, I’ll be blogging about some of these patterns, that I implemented using LangChain4j and ADK for Java.

Abstract

It’s time to dive into the deep end, far from “hello world” demos. To build your multi-agent systems, you often start by assembling classic bricks: sequential or parallel flows, or loops. The basics!

Show full content

At Devoxx France 2026, I had the pleasure of presenting a session on Agentic Design Patterns. In this talk, I explore how to move beyond basic LLM wrappers to build reliable, scalable, and sophisticated AI agent systems.

In the coming weeks, I’ll be blogging about some of these patterns, that I implemented using LangChain4j and ADK for Java.

Abstract

It’s time to dive into the deep end, far from “hello world” demos. To build your multi-agent systems, you often start by assembling classic bricks: sequential or parallel flows, or loops. The basics!

But it’s by combining these components that we create truly powerful patterns, such as reflection loops or critic/reviewer duos. However, to reach true “AI-llumination,” you’ll have to make choices: which routing pattern to adopt? How to drive a swarm of agents so they collaborate without going in all directions?

And what if we pushed even further? We can now imagine an entire ecosystem of remote agents communicating with each other. In short, all this and more, but with code snippets inside!

As a committer on LangChain4j and ADK for Java (Agent Development Kit), I invite you to join me in this “agentic” adventure. My goal: to give you the keys to choosing the right abstractions and avoiding technical dead ends on your journey.

Recording

The video is available in French only for now, as it was only recorded at Devoxx France 2026, but I’ll update this post whenever an English version becomes available.

Slides

The first slide is in French, but don’t worry, the rest is all English 😃

Summary

As we transition from simple RAG pipelines to autonomous agents, the complexity of managing non-determinism, “context rot,” and execution reliability increases. This talk introduces a “Pattern Language” for agents, categorizing them into key domains.

Deep Dive into the Patterns
  • Programmatic Planning: While LLMs can plan dynamically, some business processes require strict adherence to specific steps. This pattern uses hardcoded sequences or state machines to guide the agent. It provides high determinism, simplifies debugging, and ensures that the agent follows “golden paths” for critical tasks.

  • Progressive Disclosure (Agent Skills): To prevent “context rot” and reduce hallucinations, you shouldn’t overwhelm an agent with every possible tool or instruction at once. Instead, this pattern dynamically injects specific “skills” or detailed tool documentation into the prompt only when the agent identifies a need for them. This keeps the context window lean and the reasoning focused.

  • Hierarchical Agent Decomposition: Complex problems are best solved by a team of experts rather than a single generalist. This pattern involves a “Manager” agent that orchestrates specialized sub-agents (e.g., a “Coder,” a “Researcher,” and a “Reviewer”). This allows for model-specific optimization—using smaller, faster models for simple sub-tasks and larger ones for coordination.

  • Goal-Oriented Action Planning (GOAP): Borrowed from game AI, GOAP reverses the planning process. Instead of defining a path, you define a Goal and the Preconditions and Effects of each available tool. A planner then autonomously selects the sequence of actions that transition the current state to the goal state, providing immense flexibility in dynamic environments.

  • Feedback Loops (Reflection): Reliability is achieved through iteration. In this pattern, an agent’s initial output is passed to a “Critic” or “Verifier” (which could be the same agent or a different one). The agent then reflects on the feedback and regenerates the response. This “think-correct-execute” loop is essential for complex tasks like code generation or mathematical reasoning.

  • LLM-as-Judge: Evaluating agentic systems at scale is a major challenge. This pattern uses high-capacity LLMs to evaluate the outputs of other models based on predefined rubrics and scoring criteria. By quantifying qualitative aspects like “helpfulness” or “safety,” it enables automated benchmarking and faster development cycles.

The core takeaway is that while agents provide the “reasoning engine,” it is the Orchestration and Harness Engineering—guided by these patterns—that ensures they reach their goals consistently.

https://glaforge.dev/talks/2026/05/18/agentic-design-patterns/
Streaming Gemini 3.1's expressive new TTS model in Java

Google just released Gemini 3.1 Flash Text-to-Speech (TTS), a new expressive TTS model that you can steer with audio tags and scene descriptions.

I wanted to see how it worked with the Gemini Interactions SDK for Java.

Expressive control

The model sounds natural out of the box, but the real benefit is the control you have over expressiveness. By defining “Audio Profiles”, “Scene Details”, and “Director’s Notes” in your prompt, you can control the character’s pacing, tone, and environment.

Show full content

Google just released Gemini 3.1 Flash Text-to-Speech (TTS), a new expressive TTS model that you can steer with audio tags and scene descriptions.

I wanted to see how it worked with the Gemini Interactions SDK for Java.

Expressive control

The model sounds natural out of the box, but the real benefit is the control you have over expressiveness. By defining “Audio Profiles”, “Scene Details”, and “Director’s Notes” in your prompt, you can control the character’s pacing, tone, and environment.

You can also use inline tags like [excitedly], [whispers], or [shouting] to change the emotional delivery mid-sentence. There’s not a finite set of tags you can use, you can express any emotion within the square brackets.

To learn more about prompting Gemini 3.1 TTS

For more on the prompting mechanics, see this article from DEV Community:

How to prompt Gemini 3.1’s new text to speech model

The article even suggests a meta-prompt you can use to generate good prompts for Gemini 3.1 TTS! You could even turn that into a reusable SKILL.md file!

Streaming audio directly to the speakers

I set up a “Morning DJ” persona using the example and techniques from the article. Beyond just generating a file, I wanted to stream the audio directly to the speakers as the model generated it.

Here is the implementation using the Gemini Interactions Java SDK.

First, let’s define the client, with an Gemini API key (that you can get from AI Studio):

GeminiInteractionsClient client = GeminiInteractionsClient.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .build();

You don’t necessarily have to create such a complex and detailed prompt, but I’ve reused the example from the article:

String prompt = """
    # AUDIO PROFILE: Jaz R.
    ## THE SCENE: The London Studio
    It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
    but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
    Jaz is standing up, not sitting, bouncing on the balls of their heels
    to the rhythm of a thumping backing track.
    Their hands fly across the faders on a massive mixing desk.
    It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

    ### DIRECTOR'S NOTES
    Style:
    * The "Vocal Smile": You must hear the grin in the audio.
      The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
    * Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words.
    Accent: Jaz is a DJ from Brixton, London
    Pace: Speaks at an energetic pace, keeping up with the fast music.
    Speaks with a "bouncing" cadence. High-speed delivery with fluid transitions—no dead air, no gaps.

    ### SAMPLE CONTEXT
    Jaz is the industry standard for Top 40 radio, high-octane event promos,
    or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

    #### TRANSCRIPT
    [excitedly] Yes, massive vibes in the studio!
    You are locked in and it is absolutely popping off in London right now.
    If you're stuck on the tube, or just sat there pretending to work... stop it.
    Seriously, I see you.
    [shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!
    """;

Feel free to just try the with the audio [tags], it goes already far enough.

Now, it’s time to create the request, pass the model, prompt, output modalities (i.e. audio!), and also speech config to chose the voice and language. But let’s not forget the streaming setting to stream the answer as soon as it’s generated:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("gemini-3.1-flash-tts-preview")
    .input(prompt)
    .responseModalities(Interaction.Modality.AUDIO)
    .speechConfig(new SpeechConfig("Algenib", "en-GB"))
    .stream(true)
    .build();
Info

Gemini 3.1 TTS is not a streaming model like the Gemini Live model. So it’s generating the audio and sends it when it’s ready. But the idea of setting streaming here, is to start streaming the audio as soon as we start receiving it.

We use the client.stream() method to consume Server-Sent Events (SSE), and open up the local audio system’s data line to serve chunks of audio as they are generated:

try (Stream<Events> eventStream = client.stream(request)) {
    AudioFormat format = new AudioFormat(24000, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);

    // Obtain a SourceDataLine connected to the system's active audio output
    try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
        line.open(format);
        line.start();

        // Decode base64 bytes dynamically and pipe to the speakers
        eventStream.forEach(event -> {
            if (event instanceof Events.ContentDelta cd && cd.delta() instanceof Events.AudioDelta audioDelta) {
                byte[] audioData = Base64.getDecoder().decode(audioDelta.data());
                line.write(audioData, 0, audioData.length);
            }
        });

        line.drain();
    }
} catch (Exception e) {
    e.printStackTrace();
}
Results

Let’s listen! What do you think of the expressivity and tone of the voice? Pretty good, right?

https://glaforge.dev/posts/2026/04/16/streaming-gemini-3-1-expressive-new-tts-model-in-java/
A Simple Coding Agent in a Loop with LangChain4j, Jbang, and Gemini

A few days ago, Max Rydahl Andersen published a fascinating article about nanocode: a minimalist Claude Code alternative implemented in just 260 lines of Java (inspired from a 250-line Python equivalent). It was a masterclass in “leanness,” using raw HTTP calls and Jackson JSON parsing, an OpenRouter or Anthropic LLM endpoint, to create an autonomous coding loop.

I loved the concept, but I had a very practical motivation to take it in a different direction: I don’t have a Claude subscription. 😃

Show full content

A few days ago, Max Rydahl Andersen published a fascinating article about nanocode: a minimalist Claude Code alternative implemented in just 260 lines of Java (inspired from a 250-line Python equivalent). It was a masterclass in “leanness,” using raw HTTP calls and Jackson JSON parsing, an OpenRouter or Anthropic LLM endpoint, to create an autonomous coding loop.

I loved the concept, but I had a very practical motivation to take it in a different direction: I don’t have a Claude subscription. 😃

Instead, I’m a heavy user of Google’s ecosystem (who would have guessed) and I really wanted to use Gemini. This led me to explore how much it would look like if I could integrate LangChain4j and its first-class support for Google AI Gemini.

In this post, I’ll walk through the two variants I built, the architectural trade-offs between them, and how I evolved the code from its original fork with the help of Gemini CLI and its powerful plan mode.

A Major Disclaimer on Security

Before we go further, we need to address the elephant in the room: Security.

A basic coding agent like this is potentially dangerous. It has:

  • No sandboxing: It runs directly on your machine with your user permissions.
  • No security checks: There is no “human-in-the-loop” to validate shell commands before they execute.
  • Full system access: If the LLM decides to run rm -rf /, this script will happily try to do it.

This is “run at your own risk” territory. Do not use this on your production code or any sensitive machine without further sandboxing (like Docker or a VM) and strict security measures. This is exactly what differentiates a professional coding agent (like Claude Code or Gemini CLI) from a 300-line “toy” project like this one.


What exactly is a “Coding Agent”?

Strip away the marketing fluff and a coding agent is essentially a persistent while loop.

  1. It waits for a request from the user.
  2. It sends that request to an LLM along with a set of tools (functions the agent can run locally or that invoke remote APIs).
  3. The LLM decides which tools to call to achieve the goal (e.g., “Read this file”, “Run this test”, “Write this function”).
  4. The loop executes those tools, feeds the results back to the LLM, and repeats until the goal is achieved.

This “agentic action” is what distinguishes an agent from a simple chatbot. It doesn’t just talk about code; it actively works on your filesystem to solve the problem you set forth.

The Foundation: Java 25 and Gemini 3

Both variants are written as single-file JBang scripts and leverage Java 25 preview features—specifically Implicitly Declared Classes (the bare void main() method) and the new java.lang.IO class (for friendly IO.println()/readln() shortcuts).

Thanks to the new IO class, the main loop is incredibly lean:

while (true) {
    var input = readln("❯ "); // Modern Java 25 input
    if (input == null || input.equals("/q")) break;

    var response = assistant.chat(input);
    println("\n⏺ " + markdown(response));
}

The model of choice is gemini-3-flash-preview. Gemini 3 introduces “thinking” capabilities and “thought signatures”, which are essential for stable tool-calling in long-running agentic conversations.

Advice

On some Coding/SWE-focused benchmarks Gemini 3 Flash is often just as good as Gemini 3 Pro, but faster! So don’t hesitate to use this super fast model! And reserve Pro for more complex reasoning scenarios.

Two Approaches to Agentic Design

I implemented two distinct variants to explore using LangChain4j for coding agents.

Monolithic vs. Multi-Agent diagram
Monolithic vs. Multi-Agent diagram
Monolithic vs. Multi-Agent diagram Monolithic vs. Multi-Agent diagram

1. The Monolithic Agent (nanocode_basic.java)

This version uses the tried-and-true AiServices pattern. It’s a single agent that is directly “wired” to a set of tools (read, write, bash, etc.).

var assistant = AiServices.builder(Assistant.class)
        .chatModel(model)
        .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
        .tools(new Tools())
        .build();
2. The Multi-Agent Supervisor (nanocode_agentic.java)

This variant uses the experimental langchain4j-agentic module. Instead of one agent with twenty tools, we have a Supervisor orchestrating specialized specialists. Each specialist agent has a narrower set of tools. Not all the tools, just the useful ones for the task at hand.

To get the idea, here is a simplified look at how the sub-agents and supervisor are structured and wired together:

// 1. Specialized Tool Sets
class FileTools {
    @Tool public String read(String path, Integer offset, Integer limit) { ... }
    @Tool public String write(String path, String content) { ... }
}
class SystemTools {
    @Tool public String bash(String cmd, String dir) { ... }
}

// 2. Sub-Agent Interfaces
public interface FileAgent {
    @Agent(name = "file_specialist")
    String work(@V("task") String task);
}
public interface SystemAgent {
    @Agent(name = "system_specialist")
    String work(@V("task") String task);
}

// 3. Wiring it all together
var fileAgent = AgenticServices.agentBuilder(FileAgent.class)
        .chatModel(model).tools(new FileTools()).build();

var systemAgent = AgenticServices.agentBuilder(SystemAgent.class)
        .chatModel(model).tools(new SystemTools()).build();

SupervisorAgent supervisor = AgenticServices.supervisorBuilder()
        .chatModel(model)
        .subAgents(fileAgent, systemAgent, webSearchAgent)
        .responseStrategy(SupervisorResponseStrategy.SUMMARY)
        .build();
  • Pros: Each sub-agent has a narrower context and higher accuracy.
  • Cons: The module is still experimental and adds orchestration overhead.

Implementing the Tools

The tools themselves are simple POJOs with methods annotated with @Tool. Here is an example of the read tool, which reads a file with line numbers and provides a nice console log so you can see exactly what the agent is doing:

@Tool("Read file with line numbers")
public String read(@P("Path to the file") String path,
                   @P("Start line") Integer offset,
                   @P("Limit") Integer limit) throws IOException {
    println("\n⏺ Read(" + path + ")"); // Visual feedback
    var lines = readAllLines(Path.of(path));
    // ... logic to format lines with numbers ...
    return formattedContent;
}

All these annotations instruct the LLM what the purpose of each tool is, what each parameter means. No need to write tool’s JSON schemas by hand.


Adding some “Personal Touches”

While I moved away from the “smallest LOC / least dependencies” goal, I wanted to keep the script concise while adding features that genuinely improve the CLI experience.

ANSI Markdown Rendering

Reading raw Markdown strings in a terminal is a chore. I added a markdown() method inspired by a routine I shared in a previous article, which uses regex patterns to transform Markdown syntax into ANSI escape codes.

static String markdown(String md) {
    return md
        .replaceAll("\\*\\*(.*?)\\*\\*", BOLD + "$1" + RESET) // Bold
        .replaceAll("\\*(.*?)\\*", ITALIC + "$1" + RESET)     // Italic
        .replaceAll("(?s)```(\\w+)?\\n(.*?)\\n```", CODE_BG + "$2" + RESET) // Code blocks
        // ... more regex rules ...
}
Built-in Web Search

A coding agent is only as up-to-date as its knowledge cut-off date. I added a websearch tool that leverages Gemini’s native Google Search capability, so that the coding agent could search the web for the latest information (for example, finding the last version of a dependency in Maven Central, how to use the last JDK enhancement…)

I created a dedicated sub-agent that takes care of the searches, simply by calling the Gemini model with Google Search enabled:

class SearchTools {
    @Tool("Search the web using Google Search")
    public String search(String query) {
        var searchModel = GoogleAiGeminiChatModel.builder()
                .apiKey(GEMINI_KEY)
                .allowGoogleSearch(true) // Native Gemini Search
                .build();
        return searchModel.chat(query);
    }
}
Conclusion

Whether you prefer the stability of a monolithic agent or the sophisticated orchestration of a multi-agent system, LangChain4j makes building these tools remarkably accessible. By combining it with the reasoning power of Gemini 3 and the modern features of Java 25, you can build a cool little coding assistant in a single file (thanks to JBang’s ability to handle dependencies and execution without the boilerplate of a project build file).

Info

You can find both implementations in my fork here:

github.com/glaforge/nanocode

I’m curious to see what others will do with this experiment. What would you add next? How would you do it differently? I’d love to see more forks that continue to explore this space while keeping the coding agent small enough to fit in a single file.

Happy coding!

https://glaforge.dev/posts/2026/04/11/a-simple-coding-agent-in-a-loop-with-langchain4j-jbang-and-gemini/
An ADK Java agent powered by Gemma 4

Today, DeepMind announced the release of Gemma 4, a very impressive and powerful new version of the Gemma family of models. As I’ve been contributing to ADK Java a fair bit recently, I was curious to see how I would configure ADK Java agents to work with Gemma 4.

In this article, we’ll explore 3 paths:

  • Calling the AI Studio API surface directly,
  • Calling Gemma 4 hosted via a vLLM instance thanks to the LangChain4j bridge.
  • Calling Gemma 4 locally via Ollama

With the appropriate model weights format, we’ll also be able to run Gemma 4 locally via Ollama. But that’s for another day.

Show full content

Today, DeepMind announced the release of Gemma 4, a very impressive and powerful new version of the Gemma family of models. As I’ve been contributing to ADK Java a fair bit recently, I was curious to see how I would configure ADK Java agents to work with Gemma 4.

In this article, we’ll explore 3 paths:

  • Calling the AI Studio API surface directly,
  • Calling Gemma 4 hosted via a vLLM instance thanks to the LangChain4j bridge.
  • Calling Gemma 4 locally via Ollama

With the appropriate model weights format, we’ll also be able to run Gemma 4 locally via Ollama. But that’s for another day.

1 — The Easy Case: Gemma 4 on AI Studio

If you’re using Gemma 4 via the Google AI Studio API surface, you have to use the Gemini model builder and reference the model name:

Gemini gemma4 = Gemini.builder()
    .modelName("gemma-4-31b-it")
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .build();

LlmAgent agent = LlmAgent.builder()
    .model(gemma4)
    // ... instructions and tools
    .build();

Here, Gemma 4 is exposed the same way as the Gemini models, via the same API surface. That’s why the model is an instance of Gemini.

Tip

In an upcoming release of ADK, we’ll also be able to simplify the above by just setting the model string like we do for Gemini models:

LlmAgent agent = LlmAgent.builder()
    .model("gemma-4-31b-it")
    // ... instructions and tools
    .build();
2 — Calling a vLLM hosted Gemma 4 via LangChain4j

During the beta testing period, internally at Google, my colleague Vlad was exposing the Gemma 4 model weights via vLLM, running inside a Google Cloud Run instance with GPU. And I was using his endpoint to test Gemma 4 😉

However, vLLM features an OpenAI-compatible API. So Gemma 4 on vLLM needs to be called with that API surface, not with the Gemini one.

Fortunately, with the LangChain4j bridge I developed last year, you can configure OpenAI-compatible models, thanks to the OpenAiChatModel (or the streaming variant) chat model from LangChain4j to connect to the vLLM server.

Creating a Simple Agent

First, we need to configure the OpenAiChatModel (or OpenAiStreamingChatModel):

ChatModel model = OpenAiChatModel.builder()
    .modelName("gg-hf-gg/gemma-4-31b-it")
    .apiKey("YOUR_API_KEY") // A dummy key if not required by your vLLM setup
    .baseUrl("https://your-vllm-instance/v1")
    .timeout(Duration.ofMinutes(5))
    .customParameters(
        Map.of("chat_template_kwargs", Map.of("enable_thinking", true))
    )
    .build();
Important

For function calling (tool use) to work correctly with Gemma 4 on vLLM, as we shall see in further examples, you must enable the thinking capability in the chat template. This is done via the chat_template_kwargs / enable_thinking parameter, which enables thinking but also function calling at the same time.

Note

I’ve defined a long timeout, as the cold start to load the weights in memory can take up to 4 minutes! But once the Cloud Run instance is hot, Gemma 4 replies instantly.

Let’s have a look at a simple science teacher agent:

LlmAgent teacherAgent = LlmAgent.builder()
    .name("science-teacher")
    .model(LangChain4j.builder()
        .chatModel(model)
        .modelName("gg-hf-gg/gemma-4-31b-it")
        .build())
    .instruction("""
        You're a friendly science teacher
        who explains concepts simply.
        """)
    .build();

We use the LangChain4j.builder() to wrap the OpenAI compatible chat model as a Java class extending ADK’s BaseLlm class, which is the parent class of all LLMs supported by ADK.

Adding Tools (Local Java Functions)

Gemma 4’s reasoning capabilities shine when you add tools. You can expose any Java method as a tool using ADK’s FunctionTool.

LlmAgent orderAgent = LlmAgent.builder()
    .name("order-agent")
    .model(LangChain4j.builder()
        .chatModel(model)
        .modelName("gg-hf-gg/gemma-4-31b-it")
        .build())
    .instruction(
        "Use the `lookup_order` tool to retrieve order details.")
    .tools(FunctionTool.create(this, "retrieveOrder"))
    .build();

@Annotations.Schema(name = "lookup_order",
        description = "Retrieve order details by ID")
public Map<String, Object> retrieveOrder(String orderId) {
    // Your database logic here...
    return Map.of("status", "out_for_delivery");
}

In this example, we reference a local Java function to lookup order details, so Gemma 4 can call it should the user ask for the status of their order.

3 — Calling Gemma 4 locally via Ollama

It’s also possible to take on a third path, with Ollama’s Gemma 4 support. Thanks to the LangChain4j bridge again, you can configure Gemma 4 with the following LangChain4j chat model definition:

OllamaChatModel ollamaChatModel = OllamaChatModel.builder()
   .modelName("gemma4:e4b")
   .baseUrl("http://127.0.0.1:11434")
   .build();
Wrapping up

That’s about it for today! With ADK Java and Gemma 4, you have a powerful, flexible, and open-weight foundation for your next AI agent project! 🤖 Thanks to the LangChain4j / ADK bridge, it’s even possible to invoke Gemma via different API surfaces than Gemini’s.

Note

As a reminder, we’ve just announced ADK Java 1.0, if you want to have a refresher about the latest features and enhancements to the project.

And you can watch this YouTube video I recorded that goes through the new features, as well as a concrete ADK agent called “Comic Trip” that transforms travel photography into vintage pop-art comic illustrations. Go check out the behind-the-scene article on how I built it.

https://glaforge.dev/posts/2026/04/02/an-adk-java-agent-powered-by-gemma-4/
Creating a Wikipedia MCP Server in Java in a Few Prompts with Skills

Since I started using Model Context Protocol (MCP) to equip my AI agents with useful tools, I’ve been looking for ways to quickly build and iterate on local servers. A few weeks ago, I shared how to easily build a local MCP server in Java with a custom skill in Gemini CLI. Today, I wanted to put that skill to the test by creating a Wikipedia MCP server.

What’s impressive is that I didn’t even have to leave my terminal or read documentation. The entire process was a conversation with Gemini CLI, leveraging its ability to search the web, find libraries, and even check migration guides!

Show full content

Since I started using Model Context Protocol (MCP) to equip my AI agents with useful tools, I’ve been looking for ways to quickly build and iterate on local servers. A few weeks ago, I shared how to easily build a local MCP server in Java with a custom skill in Gemini CLI. Today, I wanted to put that skill to the test by creating a Wikipedia MCP server.

What’s impressive is that I didn’t even have to leave my terminal or read documentation. The entire process was a conversation with Gemini CLI, leveraging its ability to search the web, find libraries, and even check migration guides!

The Interactive Process

I started by asking Gemini CLI about the Wikipedia API. Instead of guessing, I used the @search command to find the exact “contracts” for searching and retrieving pages.

The conversation went something like this:

  1. Exploring the API: I asked @search what is the contract for the Wikipedia API to search for Wikipedia pages?. Gemini found the modern Wikimedia REST API (/search/page) and the older Action API.

    Using @search in Gemini CLI to find information about the Wikipedia API
    Using @search in Gemini CLI to find information about the Wikipedia API
    Using @search in Gemini CLI to find information about the Wikipedia API Using @search in Gemini CLI to find information about the Wikipedia API

  2. Retrieving Content: I then asked how to get the actual page content. It identified the /page/html/{title} endpoint as the best way to get clean HTML.

    Gemini CLI screenshot showing the search about the full Wikipedia page retrieval
    Gemini CLI screenshot showing the search about the full Wikipedia page retrieval
    Gemini CLI screenshot showing the search about the full Wikipedia page retrieval Gemini CLI screenshot showing the search about the full Wikipedia page retrieval

  3. Finding a Converter: Since LLMs prefer Markdown over raw HTML (returned by the Wikipedia API), I searched for a Java library: @search how to render HTML to Markdown in Java?. It suggested CopyDown (a Java port of Turndown) as the simplest option.

    Gemini CLI interactive session on how to find an HTML to Markdown converter
    Gemini CLI interactive session on how to find an HTML to Markdown converter
    Gemini CLI interactive session on how to find an HTML to Markdown converter Gemini CLI interactive session on how to find an HTML to Markdown converter

One particularly “pro” move from Gemini: it noticed I wanted to use Jackson for JSON parsing. It proactively searched for the latest version, found that Jackson 3.0.0-rc4 was just released, and even checked the migration guide to ensure the new tools.jackson package names were used correctly!

Then, I asked Gemini CLI to use my JBang / LangChain4j MCP server creation skill to generate the code of my Wikipedia STDIO MCP server:

The Resulting Java Code

Once we had the blueprint, it triggered my jbang-mcp-server skill. It scaffolded the following JBang script, combining LangChain4j’s MCP support with our chosen libraries.

///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS dev.langchain4j:langchain4j-core:1.11.0
//DEPS dev.langchain4j:langchain4j-community-mcp-server:1.11.0-beta19
//DEPS org.slf4j:slf4j-simple:2.0.17
//DEPS io.github.furstenheim:copy_down:1.1
//DEPS tools.jackson.core:jackson-databind:3.0.0-rc4
//JAVA 21

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.community.mcp.server.McpServer;
import dev.langchain4j.community.mcp.server.transport.StdioMcpServerTransport;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.github.furstenheim.CopyDown;
import tools.jackson.databind.ObjectMapper;
import tools.jackson.databind.JsonNode;
import tools.jackson.databind.json.JsonMapper;

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;

public class WikipediaMcpServer {

    static {
        // Configure SLF4J Simple Logger to write to System.err
        // This is crucial for MCP servers over STDIO to avoid polluting stdout
        System.setProperty("org.slf4j.simpleLogger.logFile", "System.err");
    }

    private static final Logger log = LoggerFactory.getLogger(WikipediaMcpServer.class);

    public static void main(String[] args) throws Exception {
        log.info("Starting Wikipedia MCP Server...");

        WikipediaTools tools = new WikipediaTools();
        McpServer server = new McpServer(List.of(tools));
        StdioMcpServerTransport transport = new StdioMcpServerTransport(server);

        log.info("MCP Server started successfully on STDIO.");
        new CountDownLatch(1).await();
    }

    public static class WikipediaTools {
        private final HttpClient httpClient = HttpClient.newBuilder().followRedirects(HttpClient.Redirect.NORMAL).build();
        private final ObjectMapper objectMapper = JsonMapper.builder().build();
        private final CopyDown copyDown = new CopyDown();
        private final String USER_AGENT = "WikipediaMcpServer/1.0 (contact@example.com)";

        @Tool("Search Wikipedia for a given query and return a list of matching page titles and brief descriptions. Use this to find the exact page title before retrieving content.")
        public String searchWikipedia(String query) throws Exception {
            log.info("Searching Wikipedia for: {}", query);
            String encodedQuery = URLEncoder.encode(query, StandardCharsets.UTF_8);
            String url = "https://en.wikipedia.org/w/rest.php/v1/search/page?q=" + encodedQuery + "&limit=5";

            HttpRequest request = HttpRequest.newBuilder()
                    .uri(URI.create(url))
                    .header("User-Agent", USER_AGENT)
                    .GET()
                    .build();

            HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());

            if (response.statusCode() != 200) {
                return "Error searching Wikipedia: " + response.statusCode() + " - " + response.body();
            }

            JsonNode rootNode = objectMapper.readTree(response.body());
            JsonNode pagesNode = rootNode.get("pages");
            StringBuilder result = new StringBuilder("Search Results:\n\n");

            if (pagesNode != null && pagesNode.isArray()) {
                for (JsonNode page : pagesNode) {
                    String title = page.has("title") ? page.get("title").asText() : "";
                    String description = page.has("description") ? page.get("description").asText() : "No description";
                    result.append("- **").append(title).append("**: ").append(description).append("\n");
                }
            }

            if (result.length() == "Search Results:\n\n".length()) {
                return "No results found for query: " + query;
            }

            return result.toString();
        }

        @Tool("Retrieve the content of a specific Wikipedia page by its exact title, converted to Markdown format. Use the exact title returned by searchWikipedia.")
        public String getWikipediaPageContent(String title) throws Exception {
            log.info("Retrieving Wikipedia page: {}", title);
            String encodedTitle = URLEncoder.encode(title.replace(" ", "_"), StandardCharsets.UTF_8);
            String url = "https://en.wikipedia.org/api/rest_v1/page/html/" + encodedTitle;

            HttpRequest request = HttpRequest.newBuilder()
                    .uri(URI.create(url))
                    .header("User-Agent", USER_AGENT)
                    .GET()
                    .build();

            HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());

            if (response.statusCode() != 200) {
                return "Error retrieving Wikipedia page: " + response.statusCode() + " - " + response.body();
            }

            String htmlContent = response.body();
            log.info("Converting HTML to Markdown for page: {}", title);
            return copyDown.convert(htmlContent);
        }
    }
}

I didn’t even touch the code at all, and it worked flawlessly out of the box. I would probably only update the version dependency on LangChain4j which is not up-to-date. But that’s about it.

Configuring the MCP Server in Gemini CLI

To use this server, you just need to register it in your ~/.gemini/settings.json. Gemini CLI will then automatically launch it as a child process and communicate with it over Standard Input/Output.

Add the following to your mcpServers section:

{
  "mcpServers": {
    "wikipedia-mcp": {
      "command": "jbang",
      "args": [
        "run",
        "--quiet",
        "/path/to/your/WikipediaMcpServer.java"
      ]
    }
  }
}

Once the MCP configuration is saved and Gemini CLI reloaded, I could check that the MCP server was available, by running the /mcp list command:

List of available MCP servers tools in Gemini CLI
List of available MCP servers tools in Gemini CLI
List of available MCP servers tools in Gemini CLI List of available MCP servers tools in Gemini CLI

There are 2 tools available: one for fetching a list of relevant pages, and the other to fetch the content of an individual page.

Putting it up to the test

Yesterday, the 🚀 Artemis 2 mission launched to travel around the moon 🌔 I’m sure Wikipedia is already updated with the latest information about the status of the mission. Let’s double check:

Asking about the Artemis 2 mission
Asking about the Artemis 2 mission
Asking about the Artemis 2 mission Asking about the Artemis 2 mission

It found relevant pages, then it loaded the content of those pages and found the information, and summarized its findings:

Response about the status of the Artemis 2 mission
Response about the status of the Artemis 2 mission
Response about the status of the Artemis 2 mission Response about the status of the Artemis 2 mission

Mission in progress! But my own personal mission over those few interactive prompts inside Gemini CLI is accomplished: in less than 5 minutes, I had my custom MCP server to query Wikipedia! And it took me actually more time to write this article itself!

Wrapping up

My winning combo of the day: Gemini CLI, agent skills, Java, JBang, and LangChain4j… and boom 💥

Maybe we’ll find that obvious in a few months, but I’m still impressed by how it is today, in just a few prompts, to create something useful like an MCP server that your friendly AI agents can use. Building tools for your AI agents has never been this fluid.

Happy hacking! 🚀 And let’s go to the 🌔 and beyond!

https://glaforge.dev/posts/2026/04/02/creating-a-wikipedia-mcp-server-in-java-in-a-few-prompts/
Building my Comic Trip agent with ADK Java 1.0

I’m happy to echo here the release of ADK for Java v1.0, Google’s Agent Development Kit framework to build AI agents in Java. I spent a lot of time on this project. I also wrote the announcement blog post on the Google for Developers blog. And I’ve recorded this YouTube video highlighting some of the new features of the framework, in which I’m demonstrating some of them via an app I built: my Comic Trip agent (pun intended).

Show full content

I’m happy to echo here the release of ADK for Java v1.0, Google’s Agent Development Kit framework to build AI agents in Java. I spent a lot of time on this project. I also wrote the announcement blog post on the Google for Developers blog. And I’ve recorded this YouTube video highlighting some of the new features of the framework, in which I’m demonstrating some of them via an app I built: my Comic Trip agent (pun intended).

Screenshot of the Comic Trip app
Screenshot of the Comic Trip app
Screenshot of the Comic Trip app Screenshot of the Comic Trip app

The Comic Trip agent is a fun little application that transforms your travel photography into a vibrant, pop-art comic strip experience. Beyond the visual style, it also guesses locations (thanks to Gemini) and enriches each “panel” with nearby points of interest (via Google Maps integration).

For a recap and a demo of what this application is capable of, please check out the YouTube video:

This project is a showcase for the Agent Development Kit (ADK) for Java 1.0, demonstrating how to build multi-agent systems with:

  • smart models — Gemini and 🍌 Nano Banana,
  • ADK tools — 🔍 Google Search, 🗺️ Google Maps,
  • ADK services — Google Cloud Storage artifact service, and
  • cloud-native storage integration — Google Cloud Firestore database.
Source code available on GitHub

If you’re interested in seeing the source code of this application, please have a look at the repository on GitHub. The repository also explains how to build and deploy this application on Google Cloud Run. And of course, you can run it locally on your machine as well.

Powered by ADK for Java 1.0

ADK for Java 1.0 makes it easier to build and coordinate AI agents. In this application, I leverage several key concepts:

  • App & Plugins: The entire agent hierarchy is encapsulated within an App. We use the LoggingPlugin for seamless execution observability and debugging. Via this App shell, the logging plugin is actually applied to all the sub-agents involved in the multi-agent system. No need to configure each sub-agent individually.

  • Runners & Sessions: An InMemoryRunner manages the execution flow, while an InMemorySessionService ensures that the context for each user’s trip is isolated and persistent throughout the multi-step process.

  • Specialized Agents: We utilize a variety of agent types, including

    • LlmAgent for LLM-based tasks,
    • SequentialAgent for step-by-step flows, and
    • ParallelAgent for running multiple agents in parallel.
Multi-Agent Architecture

The core intelligence of the Comic Trip agent is driven by the following agent hierarchy:

  1. picture_analyzer_agent (Gemini 3 Flash): The entry point of our flow. It analyzes the uploaded photograph to extract a detailed description and identify the location. It’s impressive to see how good Gemini is at guessing the location of a landmark globally!

  2. poi_and_comic_flow (Parallel Execution): Once the context is established, two specialized agents run in parallel:

    • comic_illustrator_agent (Gemini 3.1 Flash Image): This multimodal agent (also known as 🍌 “Nano Banana 2”) transforms the original image into a pop-art masterpiece.

    • points_of_interest_agent (Gemini 2.5 Flash): Equipped with the GoogleMapsTool, it searches for nearby attractions based on the identified location, adding depth to the travel experience.

To configure the Google Maps integration, you just need to add the .tools(new GoogleMapsTool()) call to your agent definition:

LlmAgent poiGoogleMapsAgent = LlmAgent.builder()
    .name("points_of_interest_agent")
    .model("gemini-2.5-flash")
    .instruction("""
        Given the location in:
        {description_and_location}

        Please list points of interest (POI)
        in the area no further than a kilometer away
        using the `google_maps` tool.

        Each POI should have a name and a description.

        Don't mention distances in your response.
        And don't start with introductory text for the list.
        """)
    .tools(new GoogleMapsTool())
    .outputKey(OUTPUT_KEY_POINTS_OF_INTEREST)
    .build();

The entire process is orchestrated by a SequentialAgent named main_flow, which strictly orders the initial analysis before triggering the parallel phase.

Architecture diagram of the various agents and flows
Architecture diagram of the various agents and flows
Architecture diagram of the various agents and flows Architecture diagram of the various agents and flows

Quick tip

The above diagram was generated by a tool that I vibe-coded a while ago, that takes your multi-agent source code, and creates diagrams to visualize the flow of agents and their different sub-agents.

The Visuals: Multimedia Generation and Artifacts

Generating a high-quality comic panel is just the first step. ADK 1.0 handles the resulting multimedia artifacts as well.

The comic_illustrator_agent uses an afterModelCallback to intercept the generated image bytes. These bytes are then persisted using the GcsArtifactService, which automatically handles the upload to a Google Cloud Storage bucket. This integration ensures that generated media is stored and is easily accessible via public URLs (for the frontend):

gs://comic-trip-picture-bucket/comic_trip_app/comic_trip_user/{tripId}/{imageId}.png/0

The code in the callback goes through the Content and Parts to find parts that contain the generated image (in an inlineData field), and we save it as an artifact, via the saveArtifact() method of the callbackContext:

.afterModelCallback((callbackContext, llmResponse) ->
    Maybe.fromOptional(llmResponse.content()
        .flatMap(Content::parts)
        .stream()
        .flatMap(List::stream)
        .filter(part -> part.inlineData().isPresent())
        .findFirst()
        .flatMap(part -> {
            String imageId = generateId();
            callbackContext.saveArtifact(imageId + ".png", part)
                .blockingAwait();

            return Optional.of(llmResponse.toBuilder()
                .content(Content.fromParts(Part.fromText(imageId)))
                .build());
        }))
Backend Architecture: Quarkus & Cloud Services

The application’s web backend is built with the Quarkus framework, and deployed on Google Cloud Run. It also serves the frontend assets.

When a user uploads a batch of images, the MissionControlResource receives the multipart request. To ensure maximum throughput, it uses Java 21 Virtual Threads to run the ComicTripAnalyzer agent for each image in parallel.

Warning

As you might notice, I didn’t use a ParallelAgent here, but let the Quarkus controller handle the parallelization for me. An ADK parallel agent makes sense when you have a known discreet set of sub-tasks to run in parallel, but in my case, I don’t know in advance how many images I’ll receive. One approach, though, could be to create a custom sub-class of BaseAgent to handle the fan-in / fan-out approach in the agent graph instead of at the level of the web controller.

While the images live in GCS, the trip’s metadata and enriched details (descriptions, POIs, and image links) are stored in Google Cloud Firestore. This saving action is also triggered at the web controller level, once all the agents have run through each image to process.

Frontend: Vibe-Coded with Stitch and Antigravity

The user interface of the Comic Trip agent was designed by Google Stitch and implemented using Antigravity.

Screenshot of the Stich user interface where the frontend was designed
Screenshot of the Stich user interface where the frontend was designed
Screenshot of the Stich user interface where the frontend was designed Screenshot of the Stich user interface where the frontend was designed

The frontend interacts with the backend via a simple REST API. Images are sent as multipart/form-data, and the backend returns a comprehensive JSON response, after reading the metadata from Firestore. The frontend then dynamically renders the “Comic Strip” view, pulling the comic-styled illustrations directly from the GCS bucket URLs.

What’s next?

The Comic Trip app shows that you can build sophisticated, multi-agent tools without having to worry about low-level state or orchestration. ADK for Java 1.0 handles the plumbing, so you can focus on defining how your agents behave and what tools they use.

If you’re ready to start building:

Happy building! 🤖

https://glaforge.dev/posts/2026/03/30/building-my-comic-trip-agent-with-adk-java-1-0/
Generating music with Lyria 3 and the Gemini Interactions Java SDK

Generative AI isn’t just about text or images (with Nano Banana) but it’s also great at generating videos (with Veo 3). And now with the recently released Lyria 3 model from DeepMind, you can create some engaging and creative music with lyrics (generated, or your own) or invent a calming instrumental track to loop in the background of your online TikTok or YouTube Shorts.

And of course, if you’re a Java developer like me, you can do all that in Java!

Show full content

Generative AI isn’t just about text or images (with Nano Banana) but it’s also great at generating videos (with Veo 3). And now with the recently released Lyria 3 model from DeepMind, you can create some engaging and creative music with lyrics (generated, or your own) or invent a calming instrumental track to loop in the background of your online TikTok or YouTube Shorts.

And of course, if you’re a Java developer like me, you can do all that in Java!

In this article, we’ll learn how to create our own songs and clips with the Lyria 3 model, in Java, using my Gemini Interactions API Java SDK.

Idea

The examples in this article are inspired by this Colab Notebook in Python. Don’t hesitate to check it out. And if you want to learn more, this article shows some fun app integration ideas, like an alarm clock waking you up with a different song every morning!

Info

To run those examples, you’ll need to get a Gemini API key in Google AI Studio. You’ll then be able to instantiate a Gemini Interactions API client as follows:

GeminiInteractionsClient client = GeminiInteractionsClient.builder()
       .apiKey(System.getenv("GEMINI_API_KEY"))
       .build();
Clip vs. Pro: The Models

Lyria 3 comes in two primary flavors:

  • lyria-3-clip-preview — Perfect for generating short clips (30 second long), snippets, or quick iterations for sound effects, choruses, and jingles.
  • lyria-3-pro-preview — Fully capable of generating long, structurally cohesive, full-length songs (up to 3 minutes).

Here’s an example of firing off a request for a full-length song using the SDK:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("models/lyria-3-pro-preview")
    .input("""
        Write a full length epic power metal song
        about a brave knight fighting a dragon.
        It should have a guitar solo.
        """)
    .responseModalities(
        Interaction.Modality.AUDIO,
        Interaction.Modality.TEXT)
    .build();

Interaction interaction = client.create(request);

In addition to the MP3, Lyria also generated the following lyrics: Click to read the full lyrics

[[A0]]
[[B1]]
[16.0:] Cold mountain peaks in the morning haze,
[:] The knight rides forth through the silver maze,
[:] With steel in hand and a heart of fire,
[:] To face the beast and the burning pyre.
[:] Through ancient gates where the shadows sleep,
[:] He finds the path to the valley deep,
[:] No fear of death in his iron soul,
[:] He seeks the fire and the final goal.
[[C2]]
[48.0:] Upon the wind comes the dragon’s breath!
[:] A storm of flame and a dance of death!
[:] Oh, carry the flame on the steel of the knight!
[:] Into the dragon, into the light!
[:] The world will tremble as titans collide!
[:] Nowhere for the ancient beast to hide!
[:] Into the light!
[[B3]]
[80.0:] The claws of iron and teeth of obsidian,
[:] A mountain of scales in the dark stygian,
[:] The sword strikes home but the sparks do fly,
[:] Underneath the heavy sulfur sky.
[:] A roar that echoes through the mountain hall,
[:] The knight stands steady, he will not fall,
[:] Through smoke and cinders the legend grows,
[:] He strikes the heart where the furnace glows!
[[C4]]
[112.0:] Upon the wind comes the dragon’s breath!
[:] A storm of flame and a dance of death!
[:] Oh, carry the flame on the steel of the knight!
[:] Into the dragon, into the light!
[:] The world will tremble as titans collide!
[:] Nowhere for the ancient beast to hide!
[:] Into the light!
[[A5]]
[[B6]]
[160.0:] The wings are broken, the fire is out,
[:] The knight is standing amidst the doubt,
[:] A savior's light in the dark of the cave,
[:] To the halls of legend, he’s wise and brave.
[[D7]]
[176.0:] The legend remains. Forevermore!
[:] (Forevermore!)

MP3 Decoding and Dual Modalities

One of the coolest features of the Interactions API is the ability to request multiple Response Modalities. Notice the responseModalities parameter in the code snippet above? By requesting both AUDIO and TEXT, the API will return:

  1. Text: The actual lyrics generated and the structural breakdown of the song.
  2. Audio: The music itself natively encoded as an MP3 file.

Because of the API’s MP3 formatting return type, you don’t need to do any complex WAV header manipulation or PCM decoding. You can safely extract the returned bytes from the payload and push them directly onto disks as an .mp3 file:

interaction.outputs().stream()
    .filter(output -> output instanceof AudioContent)
    .map(output -> (AudioContent) output)
    .findFirst()
    .ifPresent(audio ->
        Files.write(Paths.get("song.mp3"), audio.data()));
Prompt Constraints

The Lyria 3 model is highly receptive to prompting constraints. Depending on what you pass into the text input, here are four major ways to dictate the song output:

1. Give it Lyrics

You don’t have to rely on the model inventing lyrics. If you have written your own song, you can literally paste the lyrics into the prompt:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("models/lyria-3-clip-preview")
    .input("""
        An uplifting song with guitar riffs about nano banana.
        The lyrics should be:
          Yellow peel, a tiny sweet,
          The Nano Banana, a tropical treat.
          But wait—it hums, it starts to create,
          Switching into AI mode...
        """)
    .responseModalities(
        Interaction.Modality.AUDIO,
        Interaction.Modality.TEXT)
    .build();

Interaction interaction = client.create(request);
2. Control the Structure

You can instruct the model on song composition layout by using bracketed metadata such as [Intro], [Verse], [Chorus], and [Outro].

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("models/lyria-3-clip-preview")
    .input("""
        [Intro] Calm piano music setting a sunset scene on the beach
        [Verse] Epic rock ballad as the storm rages.
        [Outro] Opera with choir as the sun reappears
                again through the black clouds.
        """)
    .responseModalities(
        Interaction.Modality.AUDIO,
        Interaction.Modality.TEXT)
    .build();

Interaction interaction = client.create(request);
3. Instrumental Only

If you aren’t looking for lyrics or vocals, simply instruct the model that the track should be instrumental. It excels at generating ambient background loops!

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("models/lyria-3-clip-preview")
    .input("""
        Create a looping meditation music that feels like the wind.
        Instrumental only.
        """)
    .responseModalities(Interaction.Modality.AUDIO)
    .build();

Interaction interaction = client.create(request);
4. Give it a Picture for Inspiration!

Since Lyria 3 is a multimodal model, not only can it accept a prompt in input, but you can also pass images to drive its generative inspiration:

// picture of a groceries list which will drive the lyrics
byte[] imageBytes = URI.create(
    "https://storage.googleapis.com/generativeai-downloads/images/groceries.jpeg")
    .toURL()
    .openStream()
    .readAllBytes();

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("models/lyria-3-clip-preview")
    .input(
        new TextContent("""
            An epic song with opera voices about this quest.
            Deep synths and a speeding up tempo.
            """),
        new ImageContent(imageBytes, "image/jpeg")
    )
    .responseModalities(
        Interaction.Modality.AUDIO,
        Interaction.Modality.TEXT)
    .build();

Interaction interaction = client.create(request);
Wrap Up

Adding music generation to Java applications or AI agents is easier than ever with the Interactions API and Lyria 3. I highly recommend taking a look at the newly added test cases over in the LyriaTest.java class within the SDK repository to see the full setup in action.

Happy prompting & rocking! 🎸

https://glaforge.dev/posts/2026/03/25/generating-music-with-lyria-3-and-the-gemini-interactions-java-sdk/
Extracting JSON from LLM chatter with JsonSpotter

LLMs are great at generating structured data, in particularly those which support a strict JSON output mode (sometimes also called structured decoding), but sometimes they give you a bit more than the requested JSON. You get a Markdown code block wrapped in “Here’s the data you asked for:” and “Hope this helps!”. If you’re lucky, the JSON is valid. If you’re not, it has trailing commas or comments that break standard parsers.

Show full content

LLMs are great at generating structured data, in particularly those which support a strict JSON output mode (sometimes also called structured decoding), but sometimes they give you a bit more than the requested JSON. You get a Markdown code block wrapped in “Here’s the data you asked for:” and “Hope this helps!”. If you’re lucky, the JSON is valid. If you’re not, it has trailing commas or comments that break standard parsers.

I wrote JsonSpotter to handle this. It’s a small Java library that finds and extracts JSON-like structures from any text, even if the JSON itself is a bit messy. Then, you can use a lenient JSON parser like Jackson 3 to parse the extracted JSON to work with proper type-safe Java objects, instead of text or maps of lists of maps of… more basic types.

Why not just use Regex?

You could try indexOf("{") or a regular expression, but those break quickly. Nested objects, extra braces in the conversational text, or complex arrays make string manipulation a nightmare.

JsonSpotter uses a recursive descent parser. It actually understands the structure it’s looking for. It doesn’t just find brackets; it validates the object or array boundaries structurally as it scans. By doing so, it can accurately find the longest well-balanced JSON-like structure in the text, ensuring that nested objects are handled correctly and that it doesn’t get tripped up by random braces in the conversational text.

Handling “lenient” JSON

Sometimes LLMs output what I call “human-friendly” JSON. They don’t always output strict and valid JSON. They can add comments to explain fields, add an ellipsis to omit parts of the content, or leave trailing commas. By default, standard libraries like Jackson or Gson will throw an error immediately when they see a // comment or a trailing ,.

JsonSpotter is built to be lenient during extraction. It recognizes:

  • Single and multi-line comments (//, /* */, #)
  • Unquoted keys and single-quoted strings
  • Trailing commas
  • Non-standard numbers like .5 or NaN

Once JsonSpotter extracts the raw string, you can pass it to a proper JSON parser configured for leniency (i.e. being tolerant to malformed JSON content).

A quick example

First, extract the JSON from your LLM response:

String rawText = "..."; // Text returned by your LLM
String jsonString = JsonSpotter.extractJson(rawText);

Then, parse it with something like Jackson 3 (which has great support for lenient features):

import tools.jackson.databind.json.JsonMapper;
import tools.jackson.core.json.JsonReadFeature;

JsonMapper mapper = JsonMapper.builder()
    .enable(JsonReadFeature.ALLOW_JAVA_COMMENTS)
    .enable(JsonReadFeature.ALLOW_TRAILING_COMMA)
    .enable(JsonReadFeature.ALLOW_SINGLE_QUOTES)
    .enable(JsonReadFeature.ALLOW_UNQUOTED_PROPERTY_NAMES)
    .build();

MyClass myObj =  mapper.readValue(jsonString, MyClass.class);
// or
JsonNode node = mapper.readTree(jsonString);
Get it

JsonSpotter is dependency-free and available on Maven Central.

Add the dependency to your pom.xml:

<dependency>
    <groupId>io.github.glaforge.jsonspotter</groupId>
    <artifactId>jsonspotter</artifactId>
    <version>0.1.2</version>
</dependency>

Or your build.gradle:

dependencies {
    implementation("io.github.glaforge.jsonspotter:jsonspotter:0.1.2")
}

Give it a try and let me know what you think on GitHub if you find it useful.

https://glaforge.dev/posts/2026/03/22/extracting-json-from-llm-chatter-with-jsonspotter/
Fixing AI Slop with a Skill in Gemini CLI

We all recognize AI writing when we see it. As language models get used for everything, their specific tics are everywhere. People call this “AI slop”. The grammar is fine, but the text is boring, repetitive, and lacks any real voice.

A site called tropes.fyi tracks these patterns. It lists the exact words and structures that give AI away. You’ll see callouts for overused adverbs like “deeply” and “arguably,” the dreaded “delve” family, and structural crutches like “It’s not X, it’s Y.”

Show full content

We all recognize AI writing when we see it. As language models get used for everything, their specific tics are everywhere. People call this “AI slop”. The grammar is fine, but the text is boring, repetitive, and lacks any real voice.

A site called tropes.fyi tracks these patterns. It lists the exact words and structures that give AI away. You’ll see callouts for overused adverbs like “deeply” and “arguably,” the dreaded “delve” family, and structural crutches like “It’s not X, it’s Y.”

I wanted a way to automatically fix these issues in my own generated text. So, I built a deslopify skill for Gemini CLI.

How I built it

Gemini CLI has a built-in skill-creator that handles the boilerplate. I just asked the CLI to create a new skill, and it generated the folder structure and a SKILL.md file. It took as reference the Markdown file shared by tropes.fyi with all the typical elements of AI sloppiness.

The SKILL.md file tells the agent how to behave. I wrote a short workflow instructing the agent to read the user’s text, cross-reference it with the anti-patterns from tropes.fyi, and rewrite it. The goal is just to make the text sound like a normal person wrote it.

Here’s what the skill generated by Gemini CLI looks like:

deslopify generated skill
deslopify generated skill
deslopify generated skill deslopify generated skill

The structure of the skill is just this SKILL.md file, and a references/ directory containing the style guide:

deslopify/
├── SKILL.md
└── references/
      └── style_guide.md

I didn’t even have to modify the generated SKILL.md as it did the job perfectly, and was clear out of the box.

Trying it out

The code is up on GitHub: glaforge/deslopify.

You can install it directly from the repo in Gemini CLI:

gemini skills install https://github.com/glaforge/deslopify

Installing the deslopify skill within Gemini CLI
Installing the deslopify skill within Gemini CLI
Installing the deslopify skill within Gemini CLI Installing the deslopify skill within Gemini CLI

Then reload your interactive session:

/skills reload
Information

Because the skill format is just a directory with a SKILL.md file and some assets, you can actually use it in any AI agent that supports the standard. Be sure to check the syntax and command to install it in your favorite agent. And feel free to read more about agent skills to understand their structure.

Once it’s loaded, you can ask Gemini CLI (or your preferred agent tool) to clean up your text:

  • "Deslopify this article: [URL]"
  • "Naturalize this draft: [paste text]"
  • "Remove the AI slop from my README.md file."

The agent will strip out the pompous phrasing and weird structural tics, leaving you with something far more readable.

Is this text generated?

If you were wondering, yes, I generated a good chunk of this blog post from my session in Gemini CLI when creating the deslopify skill! I went through each and every sentence and made some light edits or sometimes added a paragraph or extra clarification sentence. I also added all the screenshots, added missing links. But the skill worked, and for quickly sharing details about a little project like this, this skill came in handy to avoid the blank page syndrome. Hopefully, you won’t feel bored reading it!

https://glaforge.dev/posts/2026/03/08/fixing-ai-slop-with-a-skill-in-gemini-cli/
Easily Build a Local MCP Server in Java with a Skill inside Gemini CLI

Recently, I’ve been exploring the Model Context Protocol (MCP) and how to easily create custom servers to extend the capabilities of AI assistants like Gemini CLI which I use daily.

I wanted a way to build these servers in Java without the heavy boilerplate of a traditional Maven or Gradle project, or with a complex framework. The solution? Combining JBang, LangChain4j, and… 🥁… a custom Gemini CLI skill!

In this post, I’ll walk you through how I streamlined the creation of MCP STDIO servers, by creating an agent SKILL.md to replicate what I had learned in my previous article.

Show full content

Recently, I’ve been exploring the Model Context Protocol (MCP) and how to easily create custom servers to extend the capabilities of AI assistants like Gemini CLI which I use daily.

I wanted a way to build these servers in Java without the heavy boilerplate of a traditional Maven or Gradle project, or with a complex framework. The solution? Combining JBang, LangChain4j, and… 🥁… a custom Gemini CLI skill!

In this post, I’ll walk you through how I streamlined the creation of MCP STDIO servers, by creating an agent SKILL.md to replicate what I had learned in my previous article.

The Recap: JBang and LangChain4j

In that article I wrote recently, JBang was perfect for writing and running self-contained Java scripts. It automatically handles dependencies and JVM execution, making it perfect for lightweight MCP servers. No need for a directory structure, for build files, or pre-compilation.

LangChain4j’s recent release provided the langchain4j-community-mcp-server module, which allows you to create STDIO MCP servers, without the need for a server framework. By simply annotating a method of a Java class with @Tool, we can expose some useful tool to an LLM.

Beware

The critical requirement for an MCP STDIO server is ensuring that JSON-RPC communication over System.out remains uncorrupted. This means all logging must be redirected to System.err.

But to create and install a new MCP server, I had to do a fair bit of copy and paste, and a bit of scaffolding. That’s how I came up with the 💡 idea of creating an agent skill to simplify this task!

Step 1 — Automating with a Skill for Gemini CLI

Instead of writing the boilerplate manually every time, I first created a custom Gemini CLI skill (jbang-mcp-server.skill). To do this efficiently, I leveraged Gemini CLI’s own skill-creator skill, which is designed to bootstrap new capabilities for the agent.

Requesting a SKILL.md Creation by Gemini CLI’s Skill Creator
Requesting a SKILL.md Creation by Gemini CLI’s Skill Creator
Requesting a SKILL.md Creation by Gemini CLI’s Skill Creator Requesting a SKILL.md Creation by Gemini CLI’s Skill Creator

I fed the skill creator the whole article mentioned above, and it produced the following SKILL.md file to act as a specialized “generator” for automating the repetitive parts of building and installing Java-based MCP servers:

Click to view the generated SKILL.md
---
name: jbang-mcp-server
description: Scaffolds and installs zero boilerplate Java-based MCP STDIO servers using JBang and LangChain4j for Gemini CLI. Use this to quickly bootstrap an MCP server from scratch.
---

# JBang LangChain4j MCP Server Creator

This skill helps quickly scaffold a new Java-based MCP STDIO server using JBang and LangChain4j, and installs it into Gemini CLI's `settings.json`.

## Process

1.  **Ask User for Details:**
    *   Desired file name (e.g., `McpToolServer.java`) and path to save it.
    *   The name of the server to register in `~/.gemini/settings.json` (e.g., `java-calc`).
    *   (Optional) High-level description of the tools they want to add initially.

2.  **Scaffold the Server:**
    *   Read the template file located at `assets/McpServerTemplate.java` using the `read_file` tool. Note that the path to `assets/McpServerTemplate.java` needs to be resolved relative to the skill directory or read from the skill's bundled assets. As an alternative if the absolute path is unknown, directly write out the template contents described below.
    *   Replace `{SERVER_CLASS_NAME}` with the base name of the requested Java file (e.g., `McpToolServer` if file is `McpToolServer.java`).
    *   Replace `{TOOL_CLASS_NAME}` with a related name (e.g., `MyTools`).
    *   (Optional) Modify the `@Tool` annotated methods to reflect the user's requirements.
    *   Write the finalized content to the user's requested path using the `write_file` tool.
    *   Make the file executable using `chmod +x <path_to_file>` via `run_shell_command`.

3.  **Verify the Server:**
    *   Run jbang build <path_to_file> using the run_shell_command tool to check for any compilation errors.
    *   If there are compilation errors, use the `replace_tool` to fix them.
    *   Repeat the compilation check until successful.

4.  **Install the Server in Gemini CLI:**
    *   Read `~/.gemini/settings.json`.
    *   Use the `replace` tool or jq via `run_shell_command` to inject a new entry under `mcpServers`.
    *   The new entry should look like this:
        ```json
        "{server_name}": {
          "command": "jbang",
          "args": [
            "run",
            "--quiet",
            "{absolute_path_to_java_file}"
          ]
        }
        ```
    *   Inform the user that the server has been created and configured, and remind them that Gemini CLI automatically reloads configurations.

## Template Backup
If `assets/McpServerTemplate.java` cannot be read, use this template:
```java
///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS dev.langchain4j:langchain4j-core:1.11.0
//DEPS dev.langchain4j:langchain4j-community-mcp-server:1.11.0-beta19
//DEPS org.slf4j:slf4j-simple:2.0.17
//JAVA 21

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.community.mcp.server.McpServer;
import dev.langchain4j.community.mcp.server.transport.StdioMcpServerTransport;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;
import java.util.concurrent.CountDownLatch;

public class {SERVER_CLASS_NAME} {

    static {
        // Configure SLF4J Simple Logger to write to System.err
        // This is crucial for MCP servers over STDIO to avoid polluting stdout
        System.setProperty("org.slf4j.simpleLogger.logFile", "System.err");
    }

    private static final Logger log = LoggerFactory.getLogger({SERVER_CLASS_NAME}.class);

    public static void main(String[] args) throws Exception {
        log.info("Starting LangChain4j MCP Server...");

        // Instantiate tools
        {TOOL_CLASS_NAME} tools = new {TOOL_CLASS_NAME}();

        // Create Server
        McpServer server = new McpServer(List.of(tools));

        // Start Transport
        StdioMcpServerTransport transport = new StdioMcpServerTransport(server);

        log.info("MCP Server started successfully on STDIO.");

        // Keep Alive
        new CountDownLatch(1).await();
    }

    // --- Tool Definition ---
    public static class {TOOL_CLASS_NAME} {

        @Tool("Description of your tool")
        public String sampleTool(String input) {
            log.info("Called sampleTool with {}", input);
            return "Processed: " + input;
        }
    }
}
Key Rules
  • Logging: JBang STDIO servers MUST write all logs to System.err to avoid polluting the JSON-RPC standard output stream. This is already handled in the template via System.setProperty("org.slf4j.simpleLogger.logFile", "System.err"); but ensure this is maintained if modifying the file structure.
  • Dependencies: The template relies on LangChain4j and slf4j-simple. Do not remove the //DEPS directives at the top of the template.

Along the way, Gemini CLI asked me a few questions, like how to name the skill, the Java class, etc. And of course, it also asked me for permission to create that skill:

Gemini CLI asked for Permission to Create a Skill
Gemini CLI asked for Permission to Create a Skill
Gemini CLI asked for Permission to Create a Skill Gemini CLI asked for Permission to Create a Skill

Once installed, I could check that the skill was available in my Gemini CLI session:

Gemini CLI Skills List
Gemini CLI Skills List
Gemini CLI Skills List Gemini CLI Skills List

Here is how the skill works and why it’s useful:

Zero-Boilerplate Scaffolding

When triggered, the skill creates a standalone Java file that is immediately ready to run as a script. It automatically includes:

  • JBang Directives: //DEPS and //JAVA lines so you don’t need a pom.xml or build.gradle.
  • MCP Server Setup: The boilerplate code required to initialize the McpServer and connect it to a StdioMcpServerTransport.
  • Critical Logging Configuration: It includes a static block that redirects all SLF4J logs to System.err. This is vital for MCP STDIO servers because logging to System.out would corrupt the JSON-RPC messages used to talk to the AI.
Automatic Registration

One of the most tedious parts of adding an MCP server is editing the ~/.gemini/settings.json file manually. This skill handles that automatically:

  • It calculates the absolute path to your new Java file.
  • It injects a new entry into the mcpServers section of your configuration.
  • It sets up the jbang run --quiet command so Gemini CLI knows exactly how to start your server.
Rapid Tool Development

The skill provides a template with a sample @Tool. This means you can go from “I want a new tool” to “I have a working tool” in seconds by just naming the server and then having the AI modifying the logic inside the newly generated Java class.

With this skill installed, bootstrapping a new MCP server takes seconds rather than minutes.

Step 2 — Building the “File Tree” Tool

Using our new skill, to take it for a ride, I scaffolded a server named TreeMcpServer.java. My goal was to create a tool that the LLM could use to inspect the local file system structure.

I let Gemini CLI implement a FileTreeTools class with a tree method. This method takes a directory path and uses Java’s java.nio.file API to recursively build a string representation of the directory tree (limiting the depth to prevent massive outputs).

@Tool("Displays a tree of the local directories and files in the specified path")
public String tree(String pathStr) {
    // ... directory traversal logic ...
}
Step 3 — Troubleshooting the Build

I hit a small snag during development. When I first asked Gemini CLI to show the file tree, it couldn’t connect to the tool. This was actually due to a compilation error.

To debug, I asked Gemini CLI to run a compilation check using JBang: jbang build TreeMcpServer.java. This immediately highlighted the issue: I had some unclosed string literals in the generated Java code where newline characters were literally inserted instead of escaped.

Using the Gemini CLI’s replace tool, I quickly fixed the string literals. But what was interesting with this issue is that I was able to ask Gemini CLI to update the SKILL.md to double check that the generated code compiled properly. This allowed me to improve the skill to be more rock-solid!

Reload skills

In Gemini CLI, if you updated a skill (for example, here, I improved the skill to handle potential compilation errors) you can request to reload the skill with the following slash command:

/skills reload
The Result

With the compilation issues resolved, the Gemini CLI immediately recognized the newly registered file-tree MCP server. When asked to “Show me a file tree of the current directory,” the CLI autonomously invoked our Java tool:

Gemini CLI MCP Tool Call Approval
Gemini CLI MCP Tool Call Approval
Gemini CLI MCP Tool Call Approval Gemini CLI MCP Tool Call Approval

…and returned a clean, formatted representation of the workspace directly in the chat:

Gemini CLI MCP Tool Result
Gemini CLI MCP Tool Result
Gemini CLI MCP Tool Result Gemini CLI MCP Tool Result

This workflow — using an AI assistant to build a skill, which in turn builds a tool that extends the assistant itself — is a powerful demonstration of how quickly we can iterate and expand our development capabilities using standard Java tools like JBang and LangChain4j.

Conclusion

Agent skills are a powerful way to automate boring, repetitive work. Instead of manually scaffolding boilerplate code every time you want to create a new MCP server (or any other task), you can delegate that task to an AI agent — and then automatically package what you learned into a reusable skill that you can even share with others.

What’s particularly interesting is how the Gemini CLI agent creation skill works as a bridge between exploration and automation. During my interactive session with Gemini CLI, I experimented with building MCP servers, discovered the patterns, and hit challenges that I solved on the fly.

Rather than keeping that knowledge locked in chat history, the agent creation skill let me capture and summarize everything I had learned — the best practices, the gotchas, the template structure — into a single, reusable SKILL.md file.

Now, what took me a session of trial-and-error can be replicated instantly by anyone (or by future me) with a single skill invocation.

The Real Power of Agent Skills

They turn ad-hoc experimentation into systematic, shareable automation!

If you want to learn more about agent skills, be sure to check this great article from my colleague Daniela, who used skills to turn a repetitive task into a more automated workflow.

https://glaforge.dev/posts/2026/02/21/easily-build-a-local-mcp-server-in-java-with-a-skill-in-gemini-cli/
Zero Boilerplate Java STDIO MCP Servers with LangChain4j and JBang

By now, you’re certainly all familiar with the Model Context Protocol (MCP)? It’s the standard for connecting Large Language Models (LLMs) to tools and data. But if you look at the current ecosystem, you’ll see a lot of Python and TypeScript…

As a Java developer, you might be wondering: How can I easily and quickly run my own MCP servers?

On this blog, I’ve explained how to develop MCP servers with Quarkus and Micronaut. But thanks to a recent community contribution to LangChain4j, and the simplicity of JBang, building a local MCP server in Java is even easier and with zero boilerplate.

Show full content

By now, you’re certainly all familiar with the Model Context Protocol (MCP)? It’s the standard for connecting Large Language Models (LLMs) to tools and data. But if you look at the current ecosystem, you’ll see a lot of Python and TypeScript…

As a Java developer, you might be wondering: How can I easily and quickly run my own MCP servers?

On this blog, I’ve explained how to develop MCP servers with Quarkus and Micronaut. But thanks to a recent community contribution to LangChain4j, and the simplicity of JBang, building a local MCP server in Java is even easier and with zero boilerplate.

In this post, we’ll build a standalone Java MCP server that runs over STDIO, perfect for local integration with tools like the Gemini CLI or other locally running agentic tools supporting MCP servers.


The Stack: Why This Matters

To keep things lightweight, we’re using two powerful tools:

  1. LangChain4j: The leading framework for building AI-powered Java applications. It now includes a dedicated MCP server module for the STDIO protocol (in addition to the existing MCP client module.)
  2. JBang: A tool that lets you run Java files as scripts. No pom.xml, no Gradle builds, just a single .java file with your dependencies declared right at the top.
Requirement

The key requirement here is that you’ll need to install JBang if you haven’t already.

The Code: A Standalone MCP Server

Here is a complete, runnable MCP server in a single Java file. This server exposes a “Calculator” tool to any MCP-compatible LLM.

No build file, no project directory structure or anything. Just a single Java file.

///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS dev.langchain4j:langchain4j-core:1.11.0
//DEPS dev.langchain4j:langchain4j-community-mcp-server:1.11.0-beta19
//DEPS org.slf4j:slf4j-simple:2.0.17
//JAVA 21

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.community.mcp.server.McpServer;
import dev.langchain4j.community.mcp.server.transport.StdioMcpServerTransport;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;
import java.util.concurrent.CountDownLatch;

public class McpToolServer {

    static {
        // Important: Redirect logs to System.err
        System.setProperty("org.slf4j.simpleLogger.logFile",
                           "System.err");
    }

    private static final Logger log =
            LoggerFactory.getLogger(McpToolServer.class);

    public static void main(String[] args) throws Exception {
        log.info("Starting LangChain4j MCP Server...");

        // 1. Define your tools
        CalculatorTools tools = new CalculatorTools();

        // 2. Wrap them in an McpServer
        McpServer server = new McpServer(List.of(tools));

        // 3. Connect to the STDIO transport
        StdioMcpServerTransport transport =
                new StdioMcpServerTransport(server);

        log.info("MCP Server started successfully on STDIO.");

        // Keep the script alive
        new CountDownLatch(1).await();
    }

    // Define MCP tools
    public static class CalculatorTools {
        @Tool("Calculates the sum of two numbers")
        public double add(double a, double b) {
            log.info("Called add({}, {})", a, b);
            return a + b;
        }

        @Tool("Calculates the square root of a number")
        public double sqrt(double x) {
            log.info("Called sqrt({})", x);
            return Math.sqrt(x);
        }
    }
}
Breaking It Down
  • JBang Directives: The //DEPS lines at the top handle all your dependencies. When you run this file, JBang downloads everything automatically (like Groovy’s @Grab annotation).
  • The @Tool Annotation: Any public method annotated with LangChain4j’s @Tool annotation is automatically converted into a JSON-RPC tool specification that the LLM can understand.
  • StdioMcpServerTransport: Most local MCP clients communicate via Standard Input/Output. This transport layer handles the JSON-RPC handshake for you.

The “Secret Sauce”: Logging to System.err

There is one critical rule for MCP servers running over STDIO: System.out is for communication ONLY.

I could have reduced the size of the example above almost by half by removing all the logging code. But it’s important to highlight it, if you want to be able to somehow log information along the way.

The MCP protocol uses stdout to send JSON-RPC messages back and forth. If your application (or a library) prints a generic INFO: Hello World to stdout, it will corrupt the JSON stream and crash the connection.

That’s why we use this static block:

static {
    System.setProperty("org.slf4j.simpleLogger.logFile", "System.err");
}

By forcing all logs to stderr, we keep the communication channel clean while still being able to see our logs in the terminal.


Testing Your Server

Before you plug it into an LLM, you can test it using the MCP Inspector. It’s a handy web UI that lets you see exactly what’s happening under the hood.

Run your server with the following command:

npx @modelcontextprotocol/inspector jbang run --quiet McpToolServer.java
Notes
  • You’ll need to have npx installed to run the MCP inspector.
  • The --quiet flag tells JBang to stop printing build messages to stdout! We don’t want JBang to interfere with the STDIO protocol either!

Once the inspector is running, you can click to connect to the server, list the tools, select a tool, all in your browser, and watch your Java code execute in real-time.

Running it in Gemini CLI

To let Gemini use your new tool, add it to your ~/.gemini/settings.json:

{
  "mcpServers": {
    "java-calc": {
      "command": "jbang",
      "args": ["run", "--quiet", "/path/to/McpToolServer.java"]
    }
  }
}

Now, when you ask Gemini “What is the square root of 144?”, it will reach out to your JBang script, execute the Java method, and give you the answer.

Well… maybe it won’t call the tool because the LLM already knows the answer to such a simple question 😃 thanks to its training data and understanding of simple math, but for more specific and complex tools, it should be called! In my case, it actually figured out it should call it. And the response seems correct to me 😃


Wrapping Up

Building MCP servers doesn’t have to be complex. With LangChain4j and JBang, you get the best of both worlds: the power of the Java ecosystem with the agility of a scripting language!

So next time you need to give an LLM access to a legacy Java library or a complex calculation, remember: you’re only one @Tool annotation away.

Java developers can be as agile as all the script kiddies! Happy MCP server coding!

https://glaforge.dev/posts/2026/02/11/zero-boilerplate-java-stdio-mcp-servers-with-langchain4j-and-jbang/
Advanced RAG — Understanding Reciprocal Rank Fusion in Hybrid Search

Today, let’s come back to one of my favorite generative AI topics: Retrieval Augmented Generation, or RAG for short.

In RAG, the quality of your generation (when an LLM crafts its answer based on search results) is only as good as your retrieval (the actually retrieved search results).

While vector search (semantic) and keyword search (BM25) each have their strengths, combining them often yields the best results. That’s what we often call Hybrid Search: combining two search techniques or the results of different searches with slight variations.

Show full content

Today, let’s come back to one of my favorite generative AI topics: Retrieval Augmented Generation, or RAG for short.

In RAG, the quality of your generation (when an LLM crafts its answer based on search results) is only as good as your retrieval (the actually retrieved search results).

While vector search (semantic) and keyword search (BM25) each have their strengths, combining them often yields the best results. That’s what we often call Hybrid Search: combining two search techniques or the results of different searches with slight variations.

But how do you meaningfully combine a cosine similarity score of 0.85 (from vector search) with a BM25 score of 12.4? Those values are on two distinct unrelated scales!

Enter Reciprocal Rank Fusion (RRF).

I vibe-coded a little RRF simulator that shows how two lists of documents are ranked into one. For the impatient, feel free to go ahead and play with it, otherwise, you’ll find more information at the bottom of this article on how to use this simulator.

What is RRF?

RRF is a robust, “zero-shot” algorithm for merging search results from different retrieval methods. The technique was formally introduced by Gordon V. Cormack and his colleagues in their 2009 SIGIR paper, “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods”.

Instead of trying to somehow normalize arbitrary scores, RRF ignores the scores entirely and focuses on rank.

It operates on a simple premise: Documents that appear at the top of multiple lists are likely the most relevant. In their research, the authors found that RRF consistently outperformed individual search systems and more complex fusion methods, providing a stable and scalable way to combine diverse ranking signals.

The Formula Simplified

The RRF score for a document is calculated as:

Score(d)=∑r∈R1k+rank(r,d) \text{Score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}(r, d)}
  • rank(r,d)\text{rank}(r, d) : The position of the document in a specific search result list (1st, 2nd, etc.).
  • kk : A smoothing constant, typically set to 60.
Why is kk so important?

Think of kk as a “balance” dial.

  • If kk is low (e.g., 1): The formula gives a massive advantage to the top-ranked items. This configuration favors Precision — trusting that the absolute top results are correct and allowing a single high-performing retriever to dominate.
  • If kk is high (e.g., 60): The advantage of being #1 shrinks. This configuration improves Recall and Consensus. It ensures that even if the “perfect” result is buried at rank #10 across multiple lists, it will still rise to the top.

Why use 60? By setting k=60k=60 (the industry standard), RRF prioritizes consensus over individual outliers. It ensures that a document appearing consistently (e.g., ranked #10 in both keyword and vector search) will score higher than a document that is #1 in only one list but completely missing from the others.

It rewards documents that multiple algorithms agree on, rather than letting a single outlier dominate the results.

A Note on Document Sets

RRF is fundamentally designed to find consensus. This means it works best when your different retrieval methods are looking at the same overall set of documents and return some overlaps.

If your search results are totally disjoint (meaning no document appears in more than one list), RRF will simply interleave the results: you’ll get the #1 from list A, then #1 from list B, followed by the #2 from list A, and so on. The algorithm only truly begins to “fuse” and re-sort the results when documents start appearing in multiple lists.

Why Use RRF in RAG?
  1. Normalization Free: You don’t need to know the distribution of your vector or BM25 scores. RRF works purely on position.
  2. Scalability: It’s extremely efficient for sharded, billion-scale indices where global score normalization is expensive.
  3. Candidate Selection: RRF is an excellent “first stage” reranker. A common pattern is to retrieve the top 100 documents via RRF, and then use a more expensive (but precise) Cross-Encoder (a reranker model) to rank the top 10 for the LLM context window (see below more for details).
The Two-Stage Architecture: RRF + Cross-Encoder

While RRF is excellent at merging lists, it lacks deep semantic understanding of the query-document relationship. This is where Cross-Encoders shine — models like BERT that score the actual interaction between query and text. However, they are computationally expensive and slow.

The industry standard pattern is a “Two-Stage” architecture:

  1. Stage 1 (Candidate Selection): Use Hybrid Search (Vector + Keyword) fused with RRF to retrieve a broad pool of candidates (e.g., top 100). This ensures high Recall — the right answer is likely somewhere in this list.
  2. Stage 2 (Precision Reranking): Pass only these top 100 candidates to a Cross-Encoder. The model re-scores them based on deep relevance, picking the absolute best 5-10 chunks for the LLM’s context window.

This pipeline gives you the best of both worlds: the speed and breadth of RRF with the precision of a Cross-Encoder.

Going Further: RAG-Fusion

RAG-Fusion takes the hybrid approach a step further. This technique was introduced by Zackary Rackauckas in the 2024 paper, “RAG-Fusion: a New Take on Retrieval-Augmented Generation”. It uses an LLM to generate multiple variations of the user’s original query to “broaden the net” and find relevant context that a single query might miss.

The process follows a clever loop:

  1. Multi-Query Generation: An LLM generates 3-5 different versions of the user’s query (e.g., synonyms, rephrasings, or breaking down a complex question).
  2. Parallel Retrieval: Each variation is sent to the search engine (both Vector and Keyword).
  3. RRF Aggregation: All resulting lists are fused using RRF.

By using RRF to merge results from multiple query variations, the system naturally filters out “topic drift.” Documents that appear consistently across many query variants rise to the top, while noise from a single poor query variation is pushed down. This “consensus” approach significantly reduces hallucination rates by ensuring the LLM is provided with content validated by multiple search angles.

RRF in the Wild: LangChain4j

RRF isn’t just a theoretical concept; it’s a standard component in modern AI stacks. LangChain4j, the popular Java library for building LLM-powered applications, uses RRF as its default mechanism for aggregating results from multiple sources.

The DefaultContentAggregator class in LangChain4j employs a ReciprocalRankFuser to merge ranked lists of content. This means if you configure a RAG pipeline with multiple retrievers (e.g., one for recent web data and one for internal documents), LangChain4j automatically applies RRF to give you the best of both worlds without any manual tuning.

Here is how you can set up a hybrid retrieval system in LangChain4j that implicitly uses RRF:

// 1. Define your retrievers
ContentRetriever bm25Retriever = ...;

ContentRetriever vectorSearchRetriever =
    EmbeddingStoreContentRetriever.builder()
        .embeddingStore(embeddingStore)
        .embeddingModel(embeddingModel)
        .maxResults(10)
        .build();

// 2. Combine them in the RetrievalAugmentor
// DefaultRetrievalAugmentor uses
// DefaultContentAggregator which uses RRF
RetrievalAugmentor retrievalAugmentor =
    DefaultRetrievalAugmentor.builder()
        .contentRetriever(bm25Retriever)
        .contentRetriever(vectorSearchRetriever)
        .build();

// 3. Configure the augmentor on the AI service
Assistant assistant = AiServices.builder(Assistant.class)
    ...
    .retrievalAugmentor(retrievalAugmentor)
    .build();

By simply adding multiple retrievers, the DefaultContentAggregator kicks in, calculating the 1k+rank \frac{1}{k + \text{rank}} score for every item found by either retriever and re-sorting them into a single, high-quality context for your LLM.

Try the Simulator

To truly understand how the smoothing constant kk impacts rankings and how different lists merge, I’ve built a Reciprocal Rank Fusion Simulator.

You can experiment with different document rankings and see the fusion math in real-time here: 👉 Launch RRF Simulator

Use this tool to visualize how RRF balances precision (favoring top ranks) vs. consensus (favoring agreement across lists) and tune your intuition for hybrid search architectures.

  • Interactive Rank Experimentation: Create new lists, add/remove documents, shuffle the lists, etc. Use drag-and-drop to reorder results in two independent search engines. Since RRF ignores raw scores and focuses only on position, you can see exactly how moving a document up or down one list impacts its final “fused” standing.
  • Visualizing Consensus: The simulation demonstrates RRF’s “consensus” logic, indeed documents that appear in both lists (even at mediocre ranks) often outperform documents that rank #1 in only one list. This highlights why hybrid search is so effective.
  • Real-Time Parameter Tuning: By adjusting the kk constant, you can see how the algorithm’s sensitivity changes. You’ll observe how a lower prioritizes “top-heavy” results, while a higher gives more weight to documents found deeper in the search results.
Summary

I hope this article helped you get a better intuition of Reciprocal Rank Fusion, why it’s so useful, and how it works! By focusing on rank rather than arbitrary scores, RRF provides a robust and scalable way to merge diverse search results, making it a cornerstone of modern hybrid search and advanced RAG architectures.

https://glaforge.dev/posts/2026/02/10/advanced-rag-understanding-reciprocal-rank-fusion-in-hybrid-search/
Decoded: How Google AI Studio Securely Proxies Gemini API Requests

If you’ve recently vibe-coded and exported a Gemini-powered app from Google AI Studio to host it online on Google Cloud Run, you might have noticed a server/ directory containing a Node.js application. This isn’t just a simple file server; it’s a clever “transparent proxy” designed to solve a classic problem in frontend AI development:

How do I use my API key without leaking it to the browser?

In this post (although vibe-coding is supposed to be all about not looking at the code at all) we’ll dissect exactly how this architecture works, why it’s safer than a client-side key, and where its security limits lie.

Show full content

If you’ve recently vibe-coded and exported a Gemini-powered app from Google AI Studio to host it online on Google Cloud Run, you might have noticed a server/ directory containing a Node.js application. This isn’t just a simple file server; it’s a clever “transparent proxy” designed to solve a classic problem in frontend AI development:

How do I use my API key without leaking it to the browser?

In this post (although vibe-coding is supposed to be all about not looking at the code at all) we’ll dissect exactly how this architecture works, why it’s safer than a client-side key, and where its security limits lie.

Note

When exporting/downloading an AI Studio generated app, you won’t see what I’m going to explain below. It’s when you export the app to Cloud Run that the mechanism explained here is put in place. I looked at the code that was deployed on Cloud Run, from the Google Cloud Console.

The Problem: Client-Side Keys

When building an app with a React, Vue, or vanilla JS frontend that talks to an AI model, the path of least resistance would often be:

// ❌ DANGEROUS: Do not do this in production
const API_KEY = "AIzaSy...";
const genAI = new GoogleGenerativeAI(API_KEY);

The moment you deploy this, your API key is visible in the browser’s “Network” tab of the Dev tools, or by inspecting the JavaScript source. A malicious actor can grab your key and use your quota for their own projects, potentially racking up bills or exhausting your limits.

Scary illustration of a person on his computer horrified to discover on his screen that his LLM tokens have been consumed because of a leaked API key
Scary illustration of a person on his computer horrified to discover on his screen that his LLM tokens have been consumed because of a leaked API key
Scary illustration of a person on his computer horrified to discover on his screen that his LLM tokens have been consumed because of a leaked API key Scary illustration of a person on his computer horrified to discover on his screen that his LLM tokens have been consumed because of a leaked API key

The Solution: The “Transparent” Proxy

💡 Google AI Studio’s exported starter code uses a Node.js proxy server combined with Service Workers to hide the key while letting you write frontend code as if you were calling the API directly.

The classical approach is to have your frontend call your backend code, and it’s only your backend code that has access to the API key. Here, the approach taken by AI Studio for Cloud Run deployment is to let developers continue to write frontend code as usual, calling the generative AI API as if it were a direct call from the frontend… But in reality, the call is intercepted and proxied to a server backend that takes care of making the real call to the AI model, and handle the API key on the backend, without ever exposing it.

Part 1: The Server

The heart of the system is an Express.js server (server/server.js). It serves your frontend files but also listens on a special endpoint: /api-proxy.

When a request hits this endpoint, the server:

  1. Injects the API Key: It takes the key from a secure environment variable on the server-side (GEMINI_API_KEY).
  2. Forwards the Request: It sends the modified request to Google’s real API (generativelanguage.googleapis.com).
  3. Streams the Response: It pipes the answer back to your browser.

Here is the critical logic in server.js where the key is added:

// server/server.js

// ... inside the /api-proxy route handler ...

// Prepare headers for the outgoing request
const outgoingHeaders = {};

// Copy most headers from the incoming request (content-type, etc.)
// ... (code to copy headers) ...

// 🔐 KEY INJECTION HAPPENS HERE
outgoingHeaders['X-Goog-Api-Key'] = process.env.GEMINI_API_KEY;

const axiosConfig = {
    method: req.method,
    url: `https://generativelanguage.googleapis.com/${targetPath}`,
    headers: outgoingHeaders,
    // ...
};

// Forward the request to Google
const apiResponse = await axios(axiosConfig);

The frontend never receives the key. It only receives the results of the API call.

Part 2: Client-Side Interception

If you look at the frontend code (e.g., in App.tsx), you might see standard calls to the Gemini API (usually it’s implemented in a dedicated services/geminiService.ts file):

// Frontend code looks like it's calling Google directly!
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash-preview",
  temperaturecontents: {...}
});
const result = await model.generateContent(prompt);

How does this work if the frontend doesn’t have the key?

The server injects two scripts into your index.html at runtime:

  1. websocket-interceptor.js
  2. service-worker.js
The Service Worker

The service-worker.js acts like a network traffic cop inside your browser. It monitors all outgoing fetch requests. If it sees a request headed for generativelanguage.googleapis.com, it stops it and redirects it to your local server instead.

// server/public/service-worker.js

const TARGET_URL_PREFIX = 'https://generativelanguage.googleapis.com';

self.addEventListener('fetch', (event) => {
  const requestUrl = event.request.url;

  if (requestUrl.startsWith(TARGET_URL_PREFIX)) {
    // ✋ Stop! Don't go to Google directly.
    // 👉 Go to our local proxy instead.
    const remainingPath = requestUrl.substring(TARGET_URL_PREFIX.length);
    const proxyUrl = `${self.location.origin}/api-proxy${remainingPath}`;

    // Forward the request to /api-proxy
    event.respondWith(fetch(new Request(proxyUrl, { ... })));
  }
});

This “transparent” redirection means you don’t have to change your frontend code to point to http://localhost:3000/api-proxy. You just write standard SDK code, and the Service Worker handles the routing.

The WebSocket Interceptor

For streaming features or chat, the Gemini API uses WebSockets. Service Workers cannot easily intercept WebSocket connections, so the solution uses a different trick: Monkey Patching.

The websocket-interceptor.js overwrites the global browser WebSocket constructor.

// server/public/websocket-interceptor.js

const originalWebSocket = window.WebSocket;

window.WebSocket = new Proxy(originalWebSocket, {
  construct(target, args) {
    let [url, protocols] = args;

    // Check if the connection is destined for Gemini
    if (url.includes('generativelanguage.googleapis.com')) {
       // Redirect to our local proxy endpoint
       url = url.replace('wss://generativelanguage.googleapis.com',
                         `wss://${window.location.host}/api-proxy`);
    }

    // Create the WebSocket with the new URL
    return new originalWebSocket(url, protocols);
  }
});
Security Reality Check

Is this secure? Yes and No.

✅ The Good: Credential Protection

This architecture successfully hides the “Secret String” (your API Key), as if you had written your own backened server.

  • It is not in the JavaScript bundle.
  • It is not in the network traffic (except between your server and Google).
  • A user cannot “copy-paste” your key to use in their own unrelated backend script.
⚠️ The Bad: The “Open Proxy” Risk

Because the server is a “dumb pipe” — it blindly signs any request sent to /api-proxy. A malicious user on your site can still abuse your quota, by opening Chrome DevTools and running:

// This will be intercepted by the Service Worker and proxied!
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent', {
    method: 'POST',
    body: JSON.stringify({ contents: [{ parts: [{ text: "Generate 5000 words of spam..." }] }] })
})

Your server will happily stamp this request with your API key and send it to Google.

🛡️ The Mitigation: Rate Limiting

The AI Studio team anticipated this. The generated server includes Rate Limiting to prevent a single user from draining your quota instantly.

// server/server.js

const proxyLimiter = rateLimit({
    windowMs: 15 * 60 * 1000, // 15 minutes
    max: 100, // Limit each IP to 100 requests per window
    message: 'Too many requests from this IP...'
});

app.use('/api-proxy', proxyLimiter);

This ensures that while a user can make requests via your proxy, they are capped at a lower speed (e.g., 100 requests per 15 minutes, but that might still be a bit too much!)

Conclusion

I’ve always been curious to understand how Google AI Studio was protecting the Gemini API key, although the code appeared to directly make use of the API key on the frontend. But the (real) code that is actually deployed on Cloud Run is pretty smart, making use of some interesting tricks to not expose the key, and to mitigate the abuse of your quota.

The Google AI Studio proxy server is a nice piece of engineering for prototyping and demos. It allows for a “serverless-feeling” frontend development experience while adhering to the basic security rule of keeping API keys on the server.

However, for a production application, you should eventually replace this generic proxy with specific backend endpoints (e.g., /api/generate-recipe, /api/chat-response) that:

  1. Validate user input (“Is this actually a recipe request?”).
  2. Authenticate the user (“Is this user logged in?”).
  3. Apply strict business logic before calling the Gemini API.

Even if Google AI Studio protects your API key to some extent, your quota can still be exhausted by a malicious user. So if you’re exposing such an application to the public, ask AI Studio to either add authentication, or to request the user to pass their own API key. Or even both!

I hope you found this exploration interesting!

https://glaforge.dev/posts/2026/02/09/decoded-how-google-ai-studio-securely-proxies-gemini-api-requests/
Latest Gemini and Nano Banana Enhancements in LangChain4j

A few days ago, LangChain4j 1.11.0 was released, and with this version, a few notable enhancements to the support of the Gemini model family have landed. Let’s dive in!

New Image Generation Models (Gemini 2.5 & 3.0 Preview, aka 🍌 Nano Banana) Note

Before showing some snippets of code, let me give you the link to the full documentation on the new image model: docs.langchain4j.dev/integrations/image-models/gemini

Show full content

A few days ago, LangChain4j 1.11.0 was released, and with this version, a few notable enhancements to the support of the Gemini model family have landed. Let’s dive in!

New Image Generation Models (Gemini 2.5 & 3.0 Preview, aka 🍌 Nano Banana) Note

Before showing some snippets of code, let me give you the link to the full documentation on the new image model: docs.langchain4j.dev/integrations/image-models/gemini

There’s a new GoogleAiGeminiImageModel class which allows text-to-image generation and image editing using the latest 🍌 Nano Banana models.

Supported Models:

  • gemini-2.5-flash-image (Nano Banana): Optimized for speed.
  • gemini-3-pro-image-preview (Nano Banana Pro): High-fidelity, up to 4K resolution.

Features:

  • Text-to-Image: Generate images from prompts.
  • Image Editing: Edit existing images using text prompts (with optional mask support).
  • Search Grounding: Ground image generation in Google Search results.
Text-to-Image Generation
var imageModel = GoogleAiGeminiImageModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-2.5-flash-image")
    .aspectRatio("16:9")
    .imageSize("2K")
    .build();

Response<Image> response = imageModel.generate(
        "A cinematic shot of a futuristic city at sunset");

// Save the generated image to a file
Image image = response.content();
byte[] imageBytes = Base64.getDecoder().decode(image.base64Data());
Files.write(Paths.get("output.png"), imageBytes);

As you can see, different configuration parameters are possible;

  • aspectRatio: among 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, and 9:16
  • imageSize: among 1K, 2K, 4K
Image Generation with Google Search Grounding

A powerful capability of Nano Banana Pro is the ability to ground its image generation in Google Search results, with the useGoogleSearchGrounding(true) flag.

It’s a model that’s able to search for image references on the web, or for the latest information about a topic.

var groundedModel = GoogleAiGeminiImageModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-3-pro-image-preview") // search only in Pro
    .useGoogleSearchGrounding(true)
    .aspectRatio("1:1")
    .build();

Response<Image> groundedResponse = groundedModel.generate("""
    A kawaii illustration of the current weather forecast for Paris
    showing the current temperature (in Celsius)
    """);

Here, we want to create a kawaii illustration of the current weather in Paris. So Nano Banana Pro is going to search on Google to find about the weather forecast at this point in time!

At the time of this writing, the forecast is:

A Kawaii illustration of a little cloud and sun characters indicating a temperature of 13°C in Paris
A Kawaii illustration of a little cloud and sun characters indicating a temperature of 13°C in Paris
A Kawaii illustration of a little cloud and sun characters indicating a temperature of 13°C in Paris A Kawaii illustration of a little cloud and sun characters indicating a temperature of 13°C in Paris

Pretty kawaii, right? 😃

Google Maps Grounding

You can now enable Google Maps grounding to allow the model to access real-world location data, including place IDs, addresses, and reviews.

Warning

This is currently available on the 2.5 models, not (yet?) the 3.0 models.

var chatModel = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-2.5-flash")
    .allowGoogleMaps(true) // Enable Google Maps tool
    .retrieveGoogleMapsWidgetToken(true) // Optional: UI widget
    .build();

String response = chatModel.chat(
    "Find the best restaurant near the Eiffel tower");

Gemini might answer something along the lines of this (with details coming from Maps, as star rating proves, not from its training knowledge):

For a memorable dining experience near the Eiffel Tower, several highly-rated restaurants offer a range of cuisines and atmospheres.

French Cuisine with a View: For an exceptional meal with stunning views, consider Jules Verne, an elegant restaurant located within the Eiffel Tower itself, boasting a 4.5-star rating. Another option is Francette, a refined restaurant on a barge on the Seine, also with a 4.5-star rating and direct views of the tower.

Café de l’Homme is a stylish bistro in the Musée de l’Homme, featuring outdoor tables with Eiffel Tower views and a 4.1-star rating.

Top-Rated French Bistros: If a classic French bistro is more your style, De la Tour is a popular family-run establishment with a 4.8-star rating. Arnaud Nicolas, known for its artfully presented tasting menus, has a 4.7-star rating. With a 4.8-star rating, Milagro is another excellent choice. Also highly rated is Le CasseNoix, a charming, retro spot with a 4.7-star rating.

Italian Options: If you’re in the mood for Italian food, Chez Pippo is a cozy trattoria with a 4.6-star rating. La Casa di Alfio is another popular choice with a 4.5-star rating. With a 4.7-star rating, In Casa is also nearby. And Pink Mamma has an impressive 4.7-star rating with over 45,000 reviews.

Google Search Grounding

Standard text generation can now be grounded using Google Search, ensuring responses are based on up-to-date web information.

var chatModel = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-3-flash-preview")
    .allowGoogleSearch(true) // Enable Google Search tool
    .build();

String response = chatModel.chat(
    "What are the latest models from OpenAI, Anthropic, and Google?");
URL Context Tool

This feature allows the model to access and use information directly from specific URLs provided in the prompt context.

var chatModel = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-3-flash-preview")
    .allowUrlContext(true) // Enable URL Context tool
    .build();

// The model can now fetch and reason
// over content from URLs in the prompt
String response = chatModel.chat("""
    Check Guillaume Laforge's blog archive at
    https://glaforge.dev/archive/
    and tell me how many articles he wrote in January 2026
    """);

Suffice to have URLs in your prompt. No need to fetch or scrape the content yourself ahead of making the LLM call.

Multimodal Agents (Image Generation)

AiServices now supports returning generated images directly, enabling the creation of multimodal agents that can produce visual content.

interface CreativeAssistant {
    @UserMessage("Generate a high-quality image of {{description}}")
    ImageContent generateArtwork(@V("description") String description);
}

CreativeAssistant assistant = AiServices
    .builder(CreativeAssistant.class)
    .chatModel(GoogleAiGeminiChatModel.builder()
        .apiKey(System.getenv("GEMINI_API_KEY"))
        .modelName("gemini-3-pro-image-preview")
        .build())
    .build();

ImageContent artwork = assistant.generateArtwork(
    "a cyberpunk street food stall");

Then you can retrieve the image via artwork.image().base64data() and save it.

Note

In this example, we’re using Nano Banana Pro! Nano Banana is actually a chat model that has 2 response modalities: text and images.

In the case of the GeminiAiImageModel, we were only requesting images to be generated. No text.

Gemini 3.0 Thinking Configuration

You can configure the “thinking” process (Chain-of-Thought) for Gemini 3.0 models, allowing you to control the depth of reasoning.

Thinking levels available: MINIMAL, LOW, MEDIUM, HIGH.

var thinkingModel = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-3-flash-preview")
    .sendThinking(true)   // Send thinking process to the model
    .returnThinking(true) // Return thought process in the response
    .thinkingConfig(GeminiThinkingConfig.builder()
        .thinkingLevel(GeminiThinkingLevel.HIGH) // Reasoning depth
        .build())
    .build();

String response = thinkingModel.chat(
    "Solve this complex logic puzzle...");
Enhanced Metadata & Token Usage

Responses from both Chat and Image models now include richer metadata, including detailed token usage and grounding source information (e.g., which web pages or map locations were used).

ChatResponse response = chatModel.chat(request);

// Cast to the Gemini specific response type
// to get access to the metadata provided by Gemini
GoogleAiGeminiChatResponseMetadata metadata =
        (GoogleAiGeminiChatResponseMetadata) response.metadata();

// Access Grounding Metadata
if (metadata.groundingMetadata() != null) {
    metadata.groundingMetadata()
        .groundingChunks().forEach(chunk -> {
            if (chunk.web() != null) {
                System.out.println("Source: " +
                    chunk.web().title() + " (" +
                    chunk.web().uri() + ")");
            }
        });
}
Summary

These Gemini-related enhancements in LangChain4j 1.11.0 further expand the capabilities of the Gemini integration.

From advanced image generation and editing with Nano Banana (Gemini 2.5 and 3.0 Preview) to powerful grounding features with Google Search and Google Maps, developers can now build more intelligent and context-aware applications.

The introduction of the URL Context tool, multimodal agents, and configurable thinking processes for Gemini 3.0 allow for richer interactions and more precise control over model behavior.

The improved metadata and token usage reporting also provide valuable insights for optimizing and understanding model responses.

Have fun with Gemini!

https://glaforge.dev/posts/2026/02/06/latest-gemini-and-nano-banana-enhancements-in-langchain4j/
Researching Topics in the Age of AI — Rock-Solid Webhooks Case Study

Back in 2019, I spent significant time researching Webhooks. In particular, I was interested in best practices, pitfalls, design patterns, and approaches for implementing Webhooks in a reliable, resilient, and effective way.

Everything is distilled in that article: Implementing Webhooks, not as trivial as it may seem

It likely took me a full week to dive deep into this subject, finding sources and experimenting with design patterns myself. But nowadays, AI makes it easier to dive deeper into topics, explore unfamiliar aspects, and share findings with your team.

Show full content

Back in 2019, I spent significant time researching Webhooks. In particular, I was interested in best practices, pitfalls, design patterns, and approaches for implementing Webhooks in a reliable, resilient, and effective way.

Everything is distilled in that article: Implementing Webhooks, not as trivial as it may seem

It likely took me a full week to dive deep into this subject, finding sources and experimenting with design patterns myself. But nowadays, AI makes it easier to dive deeper into topics, explore unfamiliar aspects, and share findings with your team.

As I built a research agent based on Google’s Deep Research agent, I wanted to see how far it’d go with a topic I had covered a while ago.

Armed with my custom research agent (via the Javelit frontend I built around it), I entered the query: Webhook best practices for rock solid and resilient deployments.

Gemini 3 Flash gave me a list of topics associated with that theme, and I selected the following ones:

  • Cryptographic signature verification using HMAC-SHA256
  • Implementing idempotency keys to prevent duplicate event processing
  • Asynchronous processing architectures using message queues and background workers
  • Preventing replay attacks with timestamp validation and nonces
  • Retry strategies using exponential backoff with jitter
  • Dead letter queue (DLQ) implementation and management for failed deliveries
  • Security through Mutual TLS (mTLS) and IP allowlisting
  • Webhook payload versioning and backward compatibility strategies
  • Handling high-volume event bursts with rate limiting and buffering
  • Circuit breaker patterns to protect downstream services from failure
  • Schema validation and data minimization in webhook payloads

There’s significant overlap with the topics I covered in my old presentation! I talked about idempotency, signatures, dead-letter queues, IP allowlisting, rate limiting & buffering, etc. This validates my previous findings and shows that today’s generative AI capabilities can identify the same key topics.

So I let Deep Research explore those topics. After a few minutes, my agent wrote a long report in a concise style (I used prompts to steer the report generation toward concise writing and avoid “AI Slop”). Then Gemini 3 Pro created a summary, and Nano Banana Pro generated a sketchnote illustration.

Let’s start with the illustration (the same you saw as the banner for this post):

Infographics

I have a special prompt to generate such sketchnotes, which I find readable, and appealing to share on social media:

Create a hand-drawn and hand-written sketchnote style
summary infographic, with a pure white background,
use fluo highlighters for the key points,
about the following information: ...

And I pass the summary generated by Gemini 3 Pro below.

Summary

The summary is generated via this prompt, passing the entire research:

Create a concise summary of the research below.
Go straight to the summary; don't introduce it
(don't write "Here's a summary..." or equivalent).

For this research on Webhooks, here’s what the summary was. The summary’s structure also guided the organization of the sketchnote above:

Webhooks facilitate real-time, event-driven communication between systems but require a defensive architecture to ensure security, reliability, and scalability. Unlike polling, webhooks utilize a “push” model, necessitating robust handling of network partitions, malicious activity, and traffic spikes.

Security and Authentication Security implementation must go beyond obscured URLs. The industry standard involves HMAC-SHA256 signature verification to ensure payload integrity and authenticity. Critical implementation details include using constant-time string comparisons to prevent timing attacks and validating raw, unparsed payloads. To prevent replay attacks, systems should enforce timestamp tolerance windows and utilize nonces. While Mutual TLS (mTLS) offers a higher security standard for zero-trust environments, it introduces significant complexity compared to signatures and IP allowlisting.

Reliability and Architecture Because webhooks typically guarantee “at-least-once” delivery, receivers must implement idempotency using unique event keys and atomic storage to prevent duplicate processing from corrupting data. To handle high throughput and avoid timeouts, architectures should be asynchronous: an ingestion layer should immediately acknowledge requests (returning 202 Accepted) and offload the payload to a message queue for background processing by workers.

Failure Handling and Recovery Robust systems employ exponential backoff with jitter for retries to prevent “thundering herd” scenarios that could overwhelm the receiver. Messages that fail all retry attempts should be routed to a Dead Letter Queue (DLQ) for inspection and potential redrive rather than being discarded. Additionally, circuit breakers are essential to pause delivery to failing endpoints, protecting the infrastructure from cascading failures during outages.

Scalability and Payload Design To manage bursty traffic, providers should enforce rate limiting and buffering. Payload design involves a trade-off between “Fat” payloads (full state, convenient but larger attack surface) and “Thin” payloads (notifications only, secure but require callback API calls). Best practices suggest keeping payloads under 20kb, minimizing PII, and utilizing additive versioning per event type to maintain backward compatibility.

Even just the summary, as is, is already valuable when researching a subject, as it gives you all the key pointers to go further.

Generating a Slide Deck

I delivered my Webhook talk at a handful of conferences in Europe. And of course, I needed a slide deck! I often spend significant time on my decks, as I want them to be polished, clear, and beautiful. Usually, I try to reduce the amount of text in favor of visual representations like diagrams and photographic illustrations.

Right, so I have a report and an infographic, but what about the deck? My agent doesn’t (yet?) handle that, so instead, I turned to NotebookLM, gave it my research, and asked it to generate a slide deck with a particular design:

A slide deck for a technical audience, describing all the best practices to implement rock solid resilient webhooks. Opt for a blueprint architectural style, with illustrations.

And it complied, generating the following deck, which I could see myself presenting:

The style is consistent across the deck and quite beautiful. I usually put less text on slides, but it works here. Under the hood, NotebookLM uses 🍌 Nano Banana Pro (aka Gemini 3 Pro Image), and the included graphics look spot on, sharp, and accurate. I didn’t even spot typos in the generated text.

Going Further: Should I share those?

More and more I use AI in my work, but I still prefer writing my articles by hand. I can use generative AI to do a first draft, explain a piece of code, or come up with a conclusion. I also use image generation to create either illustrations or sketchnotes. But otherwise, it’s still me. My writing, my design, my style, my authentic voice.

What I’m wondering though is what to do with such research reports. I run such reports to explore a particular topic, to avoid forgetting some key angle or aspect. But I usually keep that research for me (sometimes saved inside my Obsidian vault, or as Google Docs here and there)

This week, I had a nice lunch with a couple old friends who were thinking that it was worth sharing those reports more widely, rather than keeping them private (one of my friend was sharing his research publicly in a GitHub repository.)

On the one hand, I don’t want to increase the quantity of AI slop available on the internet, but on the other hand, once they’re generated, it’s sad to see all those tokens wasted, and benefiting me exclusively! I would clearly label them as AI generated research reports or similar though, but maybe others would find those useful and interesting?

I’d be curious to hear your thoughts on this! Don’t hesitate to share them with me on social media.

Generative AI, assisted with agents that do targeted web searches, really changes the game in terms of research. Tools like NotebookLM are able to find the right sources of information, and can generate all sorts of artifacts and visualisations (audio and video podcasts, mindmaps, infographics, quizes, etc.) And image models like 🍌 Nano Banana are incredible and able to generate very clear visuals. This is really an interesting era to learn more about any topics, and at a much greater depth than scouring your favorite search engine manually! Deep Research and NotebookLM give you the URLs of the sources, so you can double check the accuracy of the reference material.

For the curious, here’s the full research report below, that my agent crafted about rock-solid webhooks:

Click to view the full generated report on Webhooks Webhook Implementation Best Practices: Security, Reliability, and Scalability Key Points
  • Security is Paramount: Relying solely on obscure URLs is insufficient; implementation must include cryptographic signing (HMAC-SHA256) to ensure integrity and authenticity, alongside HTTPS for confidentiality.
  • Reliability through Idempotency: Because webhooks typically guarantee “at-least-once” delivery, receivers must implement idempotency keys to safely handle duplicate requests without corrupting data state.
  • Asynchronous Architecture: Decoupling ingestion from processing using message queues is critical for handling traffic bursts and preventing timeout failures at the ingress point.
  • Failure Mitigation: Robust systems employ exponential backoff with jitter for retries to prevent “thundering herd” scenarios, utilizing Dead Letter Queues (DLQs) for effectively managing permanently failed messages.
  • Traffic Management: Circuit breakers and rate limiting are essential to protect both the sender and receiver infrastructure from cascading failures during high-load events or outages.
Introduction

Webhooks represent the standard for event-driven communication between distributed systems, allowing platforms to notify downstream services of state changes in real-time. Unlike polling, which is resource-intensive and suffers from latency, webhooks enable a “push” model where data is transmitted immediately upon event occurrence. However, this architectural shift introduces significant challenges regarding security, reliability, and scalability. A production-grade webhook implementation requires a rigorous adherence to defensive design patterns to handle network partition, malicious actors, and extreme volume spikes.

This report synthesizes best practices across security verification, architectural decoupling, failure recovery, and payload design. It serves as a comprehensive guide for engineering teams aiming to build or consume resilient webhook systems that function reliably at scale.

1. Cryptographic Security and Authentication

The public exposure of webhook endpoints makes them susceptible to impersonation, tampering, and man-in-the-middle attacks. Security must be implemented in layers, primarily focusing on transport security and payload verification.

1.1 HMAC-SHA256 Signature Verification

The industry standard for authenticating webhook payloads is the Hash-based Message Authentication Code (HMAC) using the SHA-256 algorithm. This mechanism allows the receiver to verify that the payload was generated by the expected sender and has not been modified in transit 1, 2.

Mechanism of Action:

  1. Shared Secret: A secret key is exchanged between the webhook provider and the consumer. This key is never transmitted over the network but is stored securely (e.g., in a Key Management Service) 2, 3.
  2. Hashing: The provider computes a hash of the payload body using the shared secret and the SHA-256 algorithm.
  3. Transmission: This hash is included in the HTTP headers (e.g., X-Signature or X-Hub-Signature-256) sent with the POST request 4.
  4. Verification: Upon receipt, the consumer independently computes the hash of the raw payload using their copy of the secret and compares it to the header value 5, 6.

Implementation Criticalities:

  • Constant-Time Comparison: When comparing the calculated signature with the received signature, developers must use a constant-time string comparison function. Standard string comparison returns false as soon as a mismatch is found, which exposes the system to timing attacks where an attacker can deduce the signature character by character based on response time 3, 6.
  • Raw Payload Access: Verification must be performed on the raw, unparsed request body. Frameworks that automatically parse JSON before verification can alter whitespace or field ordering, causing hash mismatches 3, 6.
  • Key Rotation: Security best practices dictate the ability to rotate secrets without downtime. This is achieved by supporting multiple active keys during a transition period or using a Key-ID header to indicate which secret was used for signing 1, 2.
1.2 Mutual TLS (mTLS)

While HTTPS ensures encryption in transit, Mutual TLS (mTLS) provides a higher level of authentication by requiring both the client (webhook sender) and the server (webhook receiver) to present valid x.509 certificates during the TLS handshake 4, 7.

Advantages and Trade-offs: mTLS aligns with Zero Trust security principles, ensuring that traffic is trusted in both directions at the transport layer before any application logic is invoked 7, 8. It effectively mitigates spoofing and man-in-the-middle attacks. However, it introduces significant operational complexity regarding certificate management, issuance, and rotation 7, 8. For many use cases, mTLS is considered overkill compared to HMAC, but it remains the gold standard for high-security environments like banking or healthcare 8.

1.3 IP Allowlisting

Restricting webhook traffic to a specific list of IP addresses (allowlisting) is a common defense-in-depth strategy. By blocking all traffic not originating from known provider IPs, the attack surface is reduced 9, 10.

Limitations in Modern Architectures: IP allowlisting is increasingly difficult to maintain in cloud-native environments where providers use dynamic IP ranges or serverless infrastructure 11. It creates a maintenance burden where the consumer must manually update firewall rules whenever the provider expands their infrastructure 12, 13. Consequently, IP allowlisting should be treated as a supplementary measure rather than a primary authentication method 3, 13.

1.4 Preventing Replay Attacks

A replay attack occurs when an attacker intercepts a valid, signed webhook request and resends it to the endpoint to duplicate an action (e.g., forcing a second payment).

Timestamp Validation: To prevent this, the signature header should include a timestamp. The receiver verifies that the timestamp is within a strictly defined tolerance window (e.g., 5 minutes) relative to the system time 6, 14. If the request is too old, it is rejected, even if the signature is valid. Including the timestamp in the signed payload ensures the attacker cannot modify the time to bypass the check 14.

Nonce Implementation: For stronger protection, a unique “nonce” (number used once) or unique request ID can be included. The receiver stores processed nonces in a fast lookup store (like Redis) with a Time-To-Live (TTL) matching the replay window. If a nonce is seen a second time, the request is rejected 6.

2. Reliability and Data Integrity

Distributed systems cannot guarantee “exactly-once” delivery due to network acknowledgments potentially failing after processing. Therefore, webhook systems almost universally operate on an “at-least-once” delivery model, necessitating robust handling of duplicate events.

2.1 Idempotency Implementation

Idempotency ensures that performing the same operation multiple times produces the same result as performing it once. This is the primary defense against data corruption caused by webhook retries 15, 16.

Idempotency Keys: Providers should include a unique identifier (Idempotency Key or event_id) in the webhook headers or payload 16, 17. The receiver uses this key to lock processing for that specific event.

  • Deduplication Store: A fast, atomic store (e.g., Redis) checks if the key has been processed. Using atomic operations like SETNX (Set if Not Exists) prevents race conditions where two parallel requests for the same event might both proceed 16.
  • Retention Window: The keys should be stored for a duration exceeding the maximum retry window of the provider (typically 24 to 72 hours) 18.
  • Transactional Upserts: In database operations, using “upsert” logic (update if exists, insert if new) based on the unique event ID ensures consistency at the database level 16.
2.2 Asynchronous Processing Architectures

Synchronous processing of webhooks—where the receiver executes business logic before returning an HTTP response—is a major anti-pattern. It couples the provider’s availability to the consumer’s processing speed and risks timeouts 15, 16.

Queue-Based Decoupling: The recommended architecture involves an ingestion layer that does nothing but authenticate the request, push the payload to a message queue (e.g., RabbitMQ, Kafka, SQS), and immediately return a 202 Accepted status 19, 20, 21.

  • Benefits: This ensures that the ingestion layer can handle high throughput without waiting for slow downstream processes (e.g., generating PDFs, sending emails) 22, 23.
  • Worker Pattern: Background workers pull messages from the queue to process them. If a worker fails, the message remains in the queue or is moved to a retry queue, ensuring no data is lost during application crashes 20, 24.
  • Buffering: This architecture acts as a buffer (shock absorber) during traffic spikes, allowing the system to “hold the load” and process it at a manageable rate rather than crashing the web server 23.
3. Failure Handling and Recovery

Failures in webhook delivery are inevitable due to network blips, downtime, or bugs. A robust system must distinguish between transient and permanent failures and handle each appropriately.

3.1 Retry Strategies: Exponential Backoff and Jitter

When a webhook delivery fails (e.g., receiver returns 500 or times out), the provider must retry. However, immediate retries can worsen the issue, especially if the receiver is overloaded.

Exponential Backoff: This algorithm increases the wait time between retries exponentially (e.g., 1s, 2s, 4s, 8s). This gives the failing system “breathing room” to recover 25, 26.

  • Formula: Delay=Base×2AttemptDelay = Base \times 2^{Attempt} 25.
  • Capping: A maximum delay (e.g., 1 hour) prevents retry intervals from becoming unreasonably long 25.

Jitter: Exponential backoff alone can lead to the “Thundering Herd” problem, where multiple failed webhooks retry at the exact same synchronized times, creating repeated spikes of traffic. “Jitter” adds randomness to the backoff interval to desynchronize these retries 25, 27.

  • Full Jitter: Delay=random(0,Base×2Attempt)Delay = random(0, Base \times 2^{Attempt}) is highly effective at spreading load 18, 25.
3.2 Dead Letter Queues (DLQ)

If a message fails to deliver after all retry attempts are exhausted, it should not be discarded. Instead, it must be moved to a Dead Letter Queue (DLQ) 19, 26, 28.

  • Purpose: The DLQ acts as a holding area for “poison messages” or permanently failed deliveries. This prevents the retry queue from being clogged with unprocessable events 17.
  • Management: Systems must provide tooling to inspect DLQ messages, determine the root cause (e.g., bug in consumer code), and “redrive” (replay) them once the issue is fixed 18, 29.
  • Alerting: The size of the DLQ is a critical metric; growing DLQ depth should trigger alerts for manual intervention 17, 19.
3.3 Circuit Breaker Pattern

While retries handle individual message failures, the Circuit Breaker pattern protects the entire ecosystem from total collapse during prolonged outages 30, 31.

Functionality: If a specific endpoint fails a significant percentage of requests (e.g., 100 failures in 60 seconds), the circuit breaker “trips” to an Open state.

  • Open State: Delivery is paused entirely for that endpoint. The provider stops wasting resources trying to send requests that will likely fail 31, 32.
  • Half-Open State: After a cooldown period, the system allows a limited number of “test” requests. If these succeed, the circuit closes and normal flow resumes. If they fail, it re-opens 31.
  • Distributed State: In large systems, the state of the circuit breaker is often managed in a distributed store like Etcd or Redis to synchronize awareness across all delivery nodes 33.
4. Scalability and Payload Design

As systems grow, the volume and complexity of webhook events increase. Proper design choices in payloads and traffic management are essential for long-term maintainability.

4.1 Handling High-Volume Bursts

Webhook traffic is rarely uniform; it is “bursty” by nature (e.g., bulk updates triggering thousands of events) 23, 34.

Rate Limiting: Providers should enforce rate limits on outgoing webhooks to prevent overwhelming consumer endpoints. This smoothens traffic spikes into a consistent stream 35. If the limit is reached, requests are throttled or queued rather than dropped 35.

Buffering: For self-hosted solutions, an intermediate buffering layer (like NGINX or a lightweight “Holding the Load” service) can accept connections rapidly and persist requests to storage before they reach the heavier application logic 23, 36.

4.2 Payload Design: Fat vs. Thin Events

The content of the webhook payload involves a trade-off between efficiency and coupling.

  • Fat Payloads (Event-Carried State Transfer): The payload contains the full resource state (e.g., the complete Order object).
    • Pros: Decouples systems; the consumer doesn’t need to call back to the API to get data 37, 38.
    • Cons: Can leak sensitive data if not carefully filtered; payloads can become large (bloat); data might be stale by the time it is processed 37, 39.
  • Thin Payloads (Event Notification): The payload contains minimal data, typically just the event type and resource ID (e.g., {“event”: “order.created”, “id”: “123”}).
    • Pros: Secure (no sensitive data in payload); ensures consumer fetches the latest state 40, 41.
    • Cons: Increases API load due to “callback” traffic (consumer receiving the event immediately calls the API to get details) 37.
  • Best Practice: Many systems use a hybrid or offer “Thin” events by default for security, with optional expansion for trusted internal consumers 39. Ideally, keep payloads under 20kb to reduce transmission overhead 42.
4.3 Schema Validation and Versioning

Webhook payloads constitute an API contract. Breaking changes (removing fields, changing types) can cause downstream failures.

Versioning:

  • Event-Type Versioning: Versioning is best applied per event type (e.g., v2.invoice.paid vs invoice.paid) rather than a global API version, as this allows granular evolution 43.
  • Additive Changes: Schema evolution should be additive (adding new fields is safe; removing fields is breaking). Deprecation periods are required for removing fields 21.

Validation: Consumers should validate incoming payloads against a JSON Schema to fail fast if the data structure is malformed 44, 45. However, validation logic should be permissive (“tolerant reader” pattern)—ignoring unknown fields to maintain forward compatibility 21.

Data Minimization: To comply with privacy regulations (GDPR, CCPA), payloads should minimize Personal Identifiable Information (PII). Sensitive data should preferably be retrieved via a secure API call (Thin Payload) rather than broadcasted in the webhook 6, 14, 41.

Conclusion

Implementing a robust webhook system requires a holistic approach that balances security, reliability, and efficiency. By securing the transport with HMAC and mTLS, ensuring reliability through idempotency and retries, and designing for scale with asynchronous queues and circuit breakers, developers can build event-driven architectures that are resilient to the chaotic nature of distributed networks. The separation of concerns—where the ingestion layer strictly handles intake and the worker layer handles processing—remains the fundamental architectural pattern for successful high-volume webhook implementations.

Sources:

  1. ngrok.com
  2. medium.com
  3. loginradius.com
  4. stytch.com
  5. dev.to
  6. webflow.com
  7. webhooks.fyi
  8. latenode.com
  9. security.com
  10. webhookrelay.com
  11. techradar.com
  12. dev.to
  13. hookdeck.com
  14. snyk.io
  15. hookdeck.com
  16. medium.com
  17. medium.com
  18. medium.com
  19. hookdeck.com
  20. dev.to
  21. medium.com
  22. youtube.com
  23. dev.to
  24. medium.com
  25. hookdeck.com
  26. svix.com
  27. latenode.com
  28. amazon.com
  29. integrate.io
  30. medium.com
  31. mambu.com
  32. stackoverflow.com
  33. raymondtukpe.com
  34. hookdeck.com
  35. trackunit.com
  36. medium.com
  37. codesimple.blog
  38. codeopinion.com
  39. brianlovin.com
  40. hookdeck.com
  41. mendix.com
  42. github.com
  43. svix.com
  44. inventivehq.com
  45. zuplo.com
https://glaforge.dev/posts/2026/02/04/researching-topics-in-the-age-of-ai-rock-solid-webhooks-case-study/
How to Integrate Gemini CLI with Intellij Idea Using ACP

The Agent Client Protocol (ACP) allows you to connect external AI agents directly into IDEs and text editors that support that protocol (like JetBrains’ IntelliJ IDEA, PyCharm, or WebStorm, as well as Zed). This means you can bring the power of the Gemini CLI directly into your editor, allowing it to interact with your code, run terminal commands, and use Model Context Protocol (MCP) servers right from the AI Assistant chat window.

Show full content

The Agent Client Protocol (ACP) allows you to connect external AI agents directly into IDEs and text editors that support that protocol (like JetBrains’ IntelliJ IDEA, PyCharm, or WebStorm, as well as Zed). This means you can bring the power of the Gemini CLI directly into your editor, allowing it to interact with your code, run terminal commands, and use Model Context Protocol (MCP) servers right from the AI Assistant chat window.

This guide will walk you through setting up Gemini CLI as a custom agent in IntelliJ IDEA.

Note

The JetBrains AI Assistant help pages mention that it should be possible to install an AI agent from the ACP registry, but this option wasn’t available for me, at the time of this writing. Hence why I decided to investigate and write this tutorial!

Prerequisites
  • IntelliJ IDEA (or other JetBrains IDEs) version 2025.3 or later.
  • Node.js installed (version 20+ recommended), needed by Gemini CLI.
  • Gemini CLI installed.
Step 1: Install Gemini CLI

If you haven’t already, install the Gemini CLI globally using npm:

npm install -g @google/gemini-cli

Once installed, verify it works by running:

gemini --version
Step 2: Locate the Gemini Executable

You need the absolute path to the installed gemini executable for the configuration file.

On macOS / Linux: Run the following command in your terminal:

which gemini

Example output: /Users/username/.nvm/versions/node/v22.16.0/bin/gemini

On Windows: Run the following command in Command Prompt or PowerShell:

where gemini

Example output: C:\ Program Files\nodejs\gemini.cmd (or similar inside AppData)

Important

Copy this path; you will need it for the next step.

Step 3: Configure the ACP Agent

JetBrains IDEs look for a specific JSON configuration file to load external agents. You need to create or edit this file.

File Location:

  • macOS / Linux: ~/.jetbrains/acp.json
  • Windows: %USERPROFILE%\.jetbrains\acp.json

Configuration Content:

Create the file (if it doesn’t exist) and add the following JSON content. Paste the path you found in Step 2 into the "command" field.

{
  "agent_servers": {
    "Gemini CLI": {
      "command": "/path/to/your/gemini",
      "args": [
        "--experimental-acp"
      ],
      "use_idea_mcp": true,
      "use_custom_mcp": true
    }
  }
}
Important

Be sure to set the --experimental-acp flag, as it’s still an experimental feature in Gemini CLI.

Configuration Breakdown:
  • "Gemini CLI": This is the display name you will see in the IDE.
  • "command": The absolute path to the Gemini executable.
  • "args": We pass --experimental-acp to tell Gemini to start in Agent Communication Protocol mode. You can also pass other flags supported by Gemini, for example for forcing a particular Gemini model version, etc.
  • "use_idea_mcp": true: This is crucial. It allows Gemini to access the IDE’s built-in Model Context Protocol (MCP) server, giving it context about your open files, project structure, and more.
  • "use_custom_mcp": true: Allows Gemini to use any other custom MCP servers you might have configured in the IDE.
Step 4: Restart and Connect
  1. Restart your IntelliJ IDEA to load the new configuration.
  2. Open the AI Assistant tool window (usually on the right side).
  3. Look for the Agent Selector. It might be a dropdown menu at the top of the chat or a “More” (…) menu.
  4. Select “Gemini CLI” from the list.

In the AI Chat window, you should see something like this:

Select Gemini CLI from that drop-down menu. And you’re good to go!

Usage

Once selected, you can chat with Gemini just like the default AI assistant, but with the added capabilities of the CLI!

  • Context Awareness: It knows about your project files via the IDE’s MCP.
  • Tool Use: It can perform actions defined in the CLI’s toolset.
Troubleshooting
  • Agent not appearing? Double-check the path in acp.json. If you use nvm (Node Version Manager), ensure the path points to the specific version currently in use, not a generic alias that might not be available to the IDE’s environment.
  • Permissions: On macOS/Linux, ensure the file ~/.jetbrains/acp.json is readable by your user.
  • Experimental Flag: Ensure you didn’t forget the --experimental-acp argument; otherwise, the CLI will try to launch in interactive terminal mode and hang.
https://glaforge.dev/posts/2026/02/01/how-to-integrate-gemini-cli-with-intellij-idea-using-acp/
A Javelit Frontend for the Deep Research Agent

A month ago, I wrote about building a research assistant in Java, using the Gemini Interactions API, and the Deep Research agent. Today I wanted to revisit this project, but with the goal to make it more user-friendly, with a Javelit based frontend.

The Research Pipeline: From Query to Infographic
  • First, the user enters the subject of the research.
  • A button action triggers the research of possible topics about that subject (ie. the different possible facets or angles of the subject), using Gemini 3 Flash with Google Search activated.
  • The user selects the facets they’re interested in, to restrict the research to only those aspects.
  • Then the Deep Research agent kicks in, via the Gemini Interactions API, and spends a few minutes researching the topic.
  • Once the final report is ready, Gemini 3 Pro creates a solid summary.
  • The summary is used to generate a sketchnote with Nano Banana Pro.
A Look at the User Interface

Let’s say, as a user, I want to research information about the OpenClaw / MoltBot / ClawdBot personal AI assistant (unless it has again changed its name? 😃)

Show full content

A month ago, I wrote about building a research assistant in Java, using the Gemini Interactions API, and the Deep Research agent. Today I wanted to revisit this project, but with the goal to make it more user-friendly, with a Javelit based frontend.

The Research Pipeline: From Query to Infographic
  • First, the user enters the subject of the research.
  • A button action triggers the research of possible topics about that subject (ie. the different possible facets or angles of the subject), using Gemini 3 Flash with Google Search activated.
  • The user selects the facets they’re interested in, to restrict the research to only those aspects.
  • Then the Deep Research agent kicks in, via the Gemini Interactions API, and spends a few minutes researching the topic.
  • Once the final report is ready, Gemini 3 Pro creates a solid summary.
  • The summary is used to generate a sketchnote with Nano Banana Pro.
A Look at the User Interface

Let’s say, as a user, I want to research information about the OpenClaw / MoltBot / ClawdBot personal AI assistant (unless it has again changed its name? 😃)

With Javelit, the code looks like this:

// The main title
Jt.title("🔎 Deep Research Agent").use();

// A header
Jt.header("Subject").use();

// The form containing the text area and submit buttons
var formSubject = Jt.form().key("form_subject").use();
String subject = Jt.textArea("Subject")
        .key("subject")
        .placeholder("Enter the subject you want to research...")
        .use(formSubject);

// The buttons (submit and clear) are inside 2 columns on a row
var columns = Jt.columns(2).widths(List.of(0.9, 0.1)).use(formSubject);

// I clear the state if the user clicks the clear button
Jt.formSubmitButton("Clear All").onClick(b -> {
    Jt.setComponentState("subject", "");
    Jt.sessionState().remove("topics");
}).use(columns.col(1));

// We proceed with the next steps
// if the user clicks the exploration button
Jt.formSubmitButton("Explore Topics").type("primary").onClick(b -> {
    Jt.sessionState().remove("topics");
}).use(columns.col(0));
1. Topic Exploration (Gemini 3 Flash)

Then, I click on Explore Topics to find the various facets of that story. Looks like Gemini 3 Flash is thinking, and is actively searching for the most up-to-date information on Google Search:

Indeed, I created a model interaction invoking Gemini 3 directly, and requesting to return a structured output (an array of strings), containing the different facets of the subject:

ModelInteractionParams planParams = ModelInteractionParams.builder()
        .model("gemini-3-flash-preview")
        .input(String.format("""
                Find a list of topics to research
                on the following subject:
                %s
                """, state.subject))
        .responseFormat(GSchema.fromClass(String[].class))
        .tools(new GoogleSearch())
        .store(true)
        .build();

Interaction planInteraction = client.create(planParams);

In terms of UI, as the UI elements are rendered synchronously, as the code is being executed, we can define placeholder elements that will receive future components:

// A header
Jt.header("Topics").use();

// The form containing the text area and submit buttons
var formTopics = Jt.form().key("form_topics").use();

// An "empty" container to hold an info bubble and the future topics
var topicsContainer =
        Jt.empty().key("topics_container").use(formTopics);
Jt.info("Preparing topics...").icon(":hourglass:")
        .use(topicsContainer);

In the above, the Jt.empty() component receives the Jt.info() bubble. But once the list of facets is returned by the model interaction, the info bubble will be replaced by a list of checkboxes with the topics to select:

var topicSelectionContainer =
        Jt.container().key("topics").use(topicsContainer);

List<String> selectedTopics = topics.stream()
        .filter(topic ->
            Jt.checkbox(topic).use(topicSelectionContainer))
        .toList();

Jt.formSubmitButton("Launch Research").type("primary").use(formTopics);

if (selectedTopics.isEmpty()) {
    // wait for user to select topics and hit form submit button
    return;
}

Gemini 3 found a good list of facets. Let me select the ones I’m the most interested in:

2. The Deep Dive (Deep Research Pro)

Now when I click the Launch Research button, the Deep Research agent is actively working:

In this part of the interface, you’ll notice the use of tabs, to separate the full report, the summary, and the infographic. This is achieved with the Jt.tabs() component.

// A header
Jt.header("Report").use();

// The 3 tabs
var tabLabels = List.of("Full Report", "Summary", "Infographic");
var tabs = Jt.tabs(tabLabels).use();

// Each tab has a placeholder
var reportPlaceholder =
        Jt.empty().key("fullReport").use(tabs.tab(tabLabels.get(0)));
var summaryPlaceholder =
        Jt.empty().key("summary").use(tabs.tab(tabLabels.get(1)));
var infographicPlaceholder =
        Jt.empty().key("infographic").use(tabs.tab(tabLabels.get(2)));

// The placeholders are info bubbles,
// soon replaced by the report, summary, and infographic
Jt.info("Preparing full report...")
        .icon(":hourglass:")
        .use(reportPlaceholder);
Jt.info("Preparing summary...")
        .icon(":hourglass:")
        .use(summaryPlaceholder);
Jt.info("Preparing infographic...")
        .icon(":hourglass:")
        .use(infographicPlaceholder);

As the research and thinking progress, Deep Research shares its thoughts and current actions with me (the info bubbles being replaced by the thoughts and actions):

It’s important to pause to see how the thoughts are streamed in real time.

Let’s have a look at the agent interaction definition:

AgentInteractionParams researchParams =
    AgentInteractionParams.builder()
        .agent("deep-research-pro-preview-12-2025")
        .input(String.format("""
            Write a concise research report on the following subject:
            <subject>
            %s
            </subject>

            By focusing on the following topics:
            <topics>
            %s
            </topics>
            """,
            subject, topicsList))
        .background(true)
        .stream(true)
        .agentConfig(
                new DeepResearchAgentConfig(ThinkingSummaries.AUTO))
        .store(true)
        .build();

The important bits here are:

  • background(true) to state it’s a background operation that can take time to complete.
  • stream(true) to state it should be streamed in real time.
  • And .agentConfig(new DeepResearchAgentConfig(ThinkingSummaries.AUTO)) says that thoughts should be sent as they occur.

The part taking care of the streaming is the stream() method on the agent interaction:

client.stream(researchParams).forEach(event -> {
    if (event instanceof ContentDelta delta) {
        if (delta.delta() instanceof ThoughtSummaryDelta thought) {
            if (thought.content() instanceof TextContent textContent) {
                Jt.markdown(textContent.text()).use(reportPlaceholder);
            }
        } else if (delta.delta() instanceof TextDelta textPart) {
            reportBuilder.append(textPart.text());
            Jt.markdown(reportBuilder.toString())
                    .use(reportPlaceholder);
        }
    // ...

The client is receiving thoughts, but it’s also later going to receive the report in the stream. So as soon as thoughts or pieces of the report arrive, they are reflected in the UI of our research agent.

And after a little while (from one to six minutes or so) the final report is ready:

3. Synthesis (Gemini 3 Pro)

But I can go straight to the TL;DR, because Gemini 3 Pro will have prepared a high-level summary of the report:

The summary is prepared by a call to Gemini 3 Pro:

ModelInteractionParams summaryParams = ModelInteractionParams.builder()
    .model("gemini-3-pro-preview")
    .input(String.format("""
        Create a concise summary of the research below.
        Go straight with the summary, don't introduce the summary
        (don't write "Here's a summary..." or equivalent).

        %s
        """, reportBuilder))
    .store(true)
    .build();
4. Visualizing with Infographics (Gemini 3 Pro Image)

Thanks to the talent of 🍌 Nano Banana Pro:

For generating the image, we just pass the summary to the model:

var infographicParams = ModelInteractionParams.builder()
    .model("gemini-3-pro-image-preview")
    .input(String.format("""
        Create a hand-drawn and hand-written sketchnote
        style summary infographic, with a pure white background,
        use fluorescent highlighters for the key points,
        about the following information:

        %s
        """, summaryText))
    .responseModalities(Interaction.Modality.IMAGE)
    .build();
Whipping up the UI with Javelit

What makes this research frontend interesting isn’t just the AI logic with the Gemini Interactions API, it’s how quickly you’re able to whip up a functional web UI for it using Javelit, without the hassle of a complicated web framework.

Info

I invite you to check out my other articles on Javelit to learn more.

The Rendering Loop Philosophy

The core of Javelit is its rendering loop. You provide a lambda that describes your UI (or a main method when running with the javelit command). Every time an interaction occurs (a button click, a checkbox toggle…) it re-executes the UI code from top to bottom.

As Cyril de Catheu (the creator of Javelit) would put it, it’s a bit like a video game loop. A state update triggers a repaint. But contrary to video games, catching input and state updates is done automatically by the Javelit components, and the repaint isn’t flying at 60fps, but only when there’s a UI interaction. Because the code runs top-to-bottom, you don’t need to manage complex event listeners or manual DOM updates.

Conclusion

By combining the power of specialized Gemini models and agents with the Gemini Interactions API (and my Java SDK for it) with the rapid UI development of Javelit, I was able to build a sophisticated research tool, taking advantage of the powerful Deep Research agent, in a fraction of the time it would take with a traditional frontend stack.

The ability to stream thoughts from the Deep Research agent directly into a reactive Javelit container makes the whole experience feel “alive” and transparent.

And everything in Java… No Python was harmed in this exercise! 😃

Info

Full source code of this deep research frontend on GitHub.

https://glaforge.dev/posts/2026/01/30/a-javelit-frontend-for-the-deep-research-agent/
Executable Markdown Files with gcli-mdrun & Gemini CLI

Have you ever wanted to turn your cool LLM prompts & tools, research notes, automation ideas, or even a simple “todo” list into an executable script? Inspired by a HackerNews post about executable Markdown, I’m happy to share gcli-mdrun, a smart little script that allows you to transform standard Markdown files into executable scripts powered by Gemini CLI.

This project allows you to create AI-driven automation, pipelines, and autonomous bots using mere Markdown text files. You can find the project on GitHub at https://github.com/glaforge/gcli-mdrun.

Show full content

Have you ever wanted to turn your cool LLM prompts & tools, research notes, automation ideas, or even a simple “todo” list into an executable script? Inspired by a HackerNews post about executable Markdown, I’m happy to share gcli-mdrun, a smart little script that allows you to transform standard Markdown files into executable scripts powered by Gemini CLI.

This project allows you to create AI-driven automation, pipelines, and autonomous bots using mere Markdown text files. You can find the project on GitHub at https://github.com/glaforge/gcli-mdrun.

Quick Start

Imagine a file named weather.md:

#!/usr/bin/env gemini-run
Use only the Google Search tool to find the answer to the question below:
What is the weather like in Paris right now?

Run it like any other script:

chmod +x weather.md
./weather.md

And you’d get something like:

I will search for the current weather in Paris.
The current weather in Paris is cloudy with a temperature of 7 °C.
Humidity is at 84%. Wind is blowing from the Southeast at 11 km/h.
There is currently no precipitation.

The prompt (below the shebang line) is actually executed by Gemini CLI.

How to Install

To get started, you need the gemini-run wrapper script from the gcli-mdrun epository.

  1. Download the script: You can find it in the scripts/ directory of the repo.
  2. Install it: Make it executable and move it to your system path.
chmod +x gemini-run
sudo mv gemini-run /usr/local/bin/
Note

Of course, you’ll have to have Gemini CLI installed, and a valid Gemini API key configured as a GEMINI_API_KEY environment variable.

Usage & Features YOLO Mode (Autonomous Execution, aka Live Dangerously)

By using the Gemini CLI --yolo flag in the shebang of your markdown scripts, Gemini will execute tools and commands automatically without asking for confirmation.

WARNING!

⚠️ Use with caution! ⚠️

#!/usr/bin/env -S gemini-run --yolo
List all files in the current directory and rename any file
with a '.txt' extension to have a '.bak' extension instead.
Piping and Pipelines

Because gemini-run supports stdin, you can chain multiple markdown scripts together or mix them with standard Unix tools.

cat customers.log | ./step1_extract.md | ./step2_analyze.md

And also redirect their outputs to files, with >.

Note

Those familiar with Gemini CLI custom commands might find custom commands more useful in particular for handling inputs or arguments, rather than piping script outputs.

Real-World Examples

Here are some cool things you can do with gcli-mdrun (look at those 3 examples from the repo:

1. Automated Release Notes (git-log-summary.md)

This script analyzes your recent git commits (in the git project in the current folder) a nd generates structured release notes. It uses the run_shell_command tool to fetch git logs and diffs (using the git command).

I won’t copy the whole script here as it’s a bit too long, but I’d like just to show you the shebang line:

#!/usr/bin/env -S gemini-run --model gemini-2.5-flash --allowed-tools=run_shell_command(git)

Gemini CLI allows you to specify which tools to allow or forbid, which MCP servers to use or restrict, etc. Here, I only allowed the execution of the git command via Gemini CLI’s run_shell_command tool.

Note

Be sure to check the documentation of Gemini CLI, and its flags, and ideally avoid giving too many permissions to your executable scripts.

For example, I applied the script to my gcli-mdrun repository and got this (after piping the output to the glow Markdown highlighter):

Release notes for gcli-mdrun
Release notes for gcli-mdrun
Release notes for gcli-mdrun Release notes for gcli-mdrun

2. Intelligent Search (google-search.md)

Leverage the power of Google Search directly from your Markdown scripts. This example fetches real-time information from the web, requesting the weather forecast for Paris:

#!/usr/bin/env -S gemini-run --allowed-tools=google_web_search,web_fetch

Use Google Search to find the answer to the question below.
Don't use any other tools.

What is the weather currently Paris?
(be sure to use international units exclusively)
Note

Notice how the tools are restricted with an allow-list.

3. Visual Summaries with Nano Banana (nano-banana.md)

This is where it gets really creative. You can use the nanobanana MCP server or Nano Banana Gemini CLI extension to generate infographics and mindmaps from articles or search results.

For example, I ran it against one of my recent articles:

#!/usr/bin/env -S gemini-run --allowed-mcp-server-names=nanobanana

* Find the key points of the article at
  https://glaforge.dev/posts/2026/01/03/building-a-research-assistant-with-the-interactions-api-in-java/
* Make a sketchnote mindmap of the article, with pure white background.
  Use highlighters to stress important keywords, and colored thick arrows for each section.
* Display this infographic

And it generated this beautiful sketchnote mindmap:

Sketchnote mindmap of building a research assistant
Sketchnote mindmap of building a research assistant
Sketchnote mindmap of building a research assistant Sketchnote mindmap of building a research assistant

4. Chaining Search and Graphics (nano-banana-chain.md)

You can even pipe the output of a search script into a graphics script. For instance, getting the weather in Paris from the Google Search script we’ve already seen, and immediately generating a kawaii-style infographic of it by piping the weather output to a 🍌 Nano Banana script…

☁️ And then you can get a super-cute output like this one to illustrate the weather! ☀️

Infographic of current weather in Paris
Infographic of current weather in Paris
Infographic of current weather in Paris Infographic of current weather in Paris

Conclusion

gcli-mdrun is all about making AI more accessible and integrable into your existing workflows. By treating Markdown as code, we can bridge the gap between human-readable documentation and machine-executable tasks, thanks to Gemini CLI and a little bit of glue shell script.

Check out the project on GitHub and tell me what cool and handy executable Markdown scripts you’ll create!

https://glaforge.dev/posts/2026/01/26/executable-markdown-files-with-gcli-mdrun-and-gemini-cli/
Implementing an arXiv MCP Server with Quarkus in Java

For my recent presentation at SnowCamp on AI Standards & Protocols for AI Agents, I decided to build an MCP server to access the arXiv research paper website where pre-print versions are published and shared with the community.

My goal was to shed light on some lesser-known aspects of the Model Context Protocol:

  • 🛠️ While the majority of MCP servers use the tools feature to expose actions that LLMs can request to call,
  • 📄 An MCP server can also share resources (and resource templates), exposing various static assets the AI app might be interested in,
  • ✏️ And prompts (and prompt templates) that users can access and reuse to utilize the MCP server effectively.
Info

For the impatient, feel free to go straight to the GitHub repository for the full source code. The README.md file gives instructions on how to build, run, configure, and use the server.

Show full content

For my recent presentation at SnowCamp on AI Standards & Protocols for AI Agents, I decided to build an MCP server to access the arXiv research paper website where pre-print versions are published and shared with the community.

My goal was to shed light on some lesser-known aspects of the Model Context Protocol:

  • 🛠️ While the majority of MCP servers use the tools feature to expose actions that LLMs can request to call,
  • 📄 An MCP server can also share resources (and resource templates), exposing various static assets the AI app might be interested in,
  • ✏️ And prompts (and prompt templates) that users can access and reuse to utilize the MCP server effectively.
Info

For the impatient, feel free to go straight to the GitHub repository for the full source code. The README.md file gives instructions on how to build, run, configure, and use the server.

The Combo: Antigravity + Quarkus + Java

To implement this server, I selected the Quarkus framework (in Java) and its extensive MCP support (documented here).

I enlisted Antigravity to help me in this adventure. I pointed my agentic IDE to the arXiv API User’s Manual to draft and scaffold my project, and iteratively collaborated with it to expand the coverage of the arXiv API. It was a pretty productive session! I highly recommend checking out Antigravity!

A Look at the arXiv API

PDFs are accessible at a URL of the form https://arxiv.org/pdf/{paperID}.

The arXiv API offers programmatic access to e-prints via HTTP requests with parameters for searches, specific paper IDs, pagination, and sorting. Users can build complex queries using boolean operators, phrases, and grouping.

So for searching, I’ve created a REST client that I used in my MCP server implementation:

@RegisterRestClient(baseUri = "https://export.arxiv.org/api")
@RegisterProvider(ArxivResponseFilter.class)
public interface ArxivClient {

    @GET
    @Path("/query")
    @Produces("*/*")
    String search(
        @QueryParam("search_query") String searchQuery,
        @QueryParam("id_list") String idList,
        @QueryParam("start") int start,
        @QueryParam("max_results") int maxResults,
        @QueryParam("sortBy") String sortBy,
        @QueryParam("sortOrder") String sortOrder);
}

Unusually, the arXiv API returns results in Atom 1.0 XML format (rather than the typical JSON), providing detailed metadata for both the query and individual articles, including titles, abstracts, authors, categories, and links.

In order to parse the Atom format and map the feeds to Java classes, I simply went with Jackson’s XML parser (perhaps I could have used the venerable Rome project).

Here’s one of the entities used in the ATOM domain model:

@JsonIgnoreProperties(ignoreUnknown = true)
public class Entry {
    @JacksonXmlProperty(namespace = "http://www.w3.org/2005/Atom")
    public String id;

    @JacksonXmlProperty(namespace = "http://www.w3.org/2005/Atom")
    public String published;

    @JacksonXmlProperty(namespace = "http://www.w3.org/2005/Atom")
    public String title;

    @JacksonXmlProperty(namespace = "http://www.w3.org/2005/Atom")
    public String summary;

    @JacksonXmlElementWrapper(useWrapping = false)
    @JacksonXmlProperty(localName = "author",
                        namespace = "http://www.w3.org/2005/Atom")
    public List<Author> authors;

    // ...
}

(It is somewhat ironic to use the @JsonIgnoreProperties annotation when parsing XML!)

Let’s Start with Tools

What I like about the Quarkus MCP support is that to turn a Quarkus app into an MCP server, you just need a few Java annotations, and everything is handled for you!

In my ArxivMcpServer, I injected my ArxivClient REST client:

public class ArxivMcpServer {
    @Inject
    @RestClient
    ArxivClient arxivClient;
    // ...
}

Then to define a tool, I used the @Tool annotation:

@Tool(
    description = "Search for papers on arXiv",
    name = "search_papers")
public Feed searchPapers(String query, int maxResults,
                         SortBy sortBy, SortOrder sortOrder) {
    return performSearch(
        query, null, 0, maxResults == 0 ? 5 : maxResults,
        sortBy == null ? null : sortBy.name(),
        sortOrder == null ? null : sortOrder.name());
}

I could have added @ToolArg annotations on the parameters of this method to give the LLM more context on their role, but the parameter names were self-explanatory.

Expose the Taxonomy as a Resource

For well-known static assets like the taxonomy of all the domain categories of research papers, you can expose an MCP Resource:

@Resource(
    uri = "arxiv://taxonomy",
    description = "List of arXiv categories and their codes",
    mimeType = "text/markdown")
public TextResourceContents getTaxonomy() {
    return TextResourceContents.create("arxiv://taxonomy",
        """
        # arXiv Category Taxonomy

        ## Computer Science (cs)
        *   **Artificial Intelligence** (cs.AI)
        *   **Computation and Language** (cs.CL)
        *   **Computer Vision** (cs.CV)
        *   **Machine Learning** (cs.LG)
        *   **Robotics** (cs.RO)
        *   **Software Engineering** (cs.SE)
        *   ... and many more.

        ## Physics
        ...
        """);
}

Again, an annotation, @Resource, is all that’s needed to define a resource. We give it a URI, a description, and a MIME type.

Resources are either text content or binary content. So depending on the type of your resource, your methods can return either TextResourceContents or BlobResourceContents. Here, for the taxonomy, it’s just Markdown text.

Since there is only one known taxonomy, the resource name is static and explicit. However, you can also take advantage of MCP Resource Templates which support parameterization.

This is the case, for example, when accessing metadata for each paper. So I created a resource template as follows:

@ResourceTemplate(
    uriTemplate = "arxiv://papers/{id}/metadata",
    description = "The full metadata of the arXiv paper",
    mimeType = "application/json")
public TextResourceContents getMetadata(@ResourceTemplateArg String id) {
    Feed feed = performSearch(null, id, 0, 1, null, null);
    if (feed.entries != null && !feed.entries.isEmpty()) {
        try {
            return TextResourceContents.create(
                "arxiv://papers/" + id + "/metadata",
                jsonMapper.writeValueAsString(feed.entries.get(0)));
        } catch (JsonProcessingException e) {
            throw new RuntimeException(
                "Failed to serialize paper metadata", e);
        }
    }
    throw new RuntimeException("Paper not found: " + id);
}

I used an @ResourceTemplate annotation, and the uri parameter is replaced with uriTemplate which contains a placeholder for the ID of the paper. That paper ID is actually passed as a parameter to the method, and this parameter is annotated with a @ResourceTemplateArg annotation.

Again for templates, it’s like for plain resources, you either return TextResourceContents or BlobResourceContents for binary content.

Prepare Reusable Prompts for the User

MCP Prompts are prompts for the user to use to make the best possible use of the MCP server.

Here’s a method returning a prompt to get summaries of papers:

@Prompt(
    name = "summarize_paper",
    description = "Summarize the given paper")
public PromptMessage summarizePaper(String id) {
    Feed feed = performSearch(null, id, 0, 1, null, null);
    if (feed.entries != null && !feed.entries.isEmpty()) {
        String summary = feed.entries.get(0).summary;
        return PromptMessage.withUserRole(
            String.format("""
                Please summarize this paper abstract (ID: %s):

                %s""", id, summary));
    }
    return PromptMessage.withUserRole("Error: Paper not found");
}

The method returns a PromptMessage with the user role, as it is a prompt for the user.

This is a simple prompt that summarizes the paper’s abstract (a summary of a summary). Abstracts can be overly scientific and hard for non-experts to decipher. However, this simple prompt usually yields easy-to-understand summaries. Of course, you might instead retrieve the whole paper and create a much more elaborate summary that analyzes the whole content instead of just the abstract.

Perhaps more interesting is the prompt I defined to help craft search queries:

@Prompt(
    name = "construct_search_query",
    description = "Helper to construct an arXiv search query")
public PromptMessage constructSearchQuery(
    @PromptArg(description = "Topic or keywords") String topic,
    @PromptArg(description = "Author name") String author,
    @PromptArg(description = "Category code (e.g. cs.AI)") String category,
    @PromptArg(description = "Year (e.g. 2024)") String year) {
        // ...
}

I’m just showing the signature here, but notice that this time I used @PromptArg annotations to give more details about each argument.

Configuration in Gemini CLI

For using this MCP server in your favorite MCP client, you’ll have to configure it to point at this Quarkus application. MCP servers can be either local STDIO servers that run along your application (they are actually launched by your client, and use standard in and out for communication), or they can be remote by using a Streamable HTTP transport mechanism (the server could be running locally as well, or be deployed in the cloud).

One cool thing with the Quarkus MCP extension is that choosing between STDIO and Streamable HTTP is just a build dependency change. And if you want, you can build your MCP server to support both transports by using both dependencies in your build. Here, for example with Maven’s pom.xml:

<dependency>
    <groupId>io.quarkiverse.mcp</groupId>
    <artifactId>quarkus-mcp-server-http</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>io.quarkiverse.mcp</groupId>
    <artifactId>quarkus-mcp-server-stdio</artifactId>
    <version>1.8.1</version>
</dependency>

For coding and plenty other automation tasks, I tend to use Gemini CLI, but the syntax should be similar for your favorite chat / coding agent.

In development mode, I was actually running my MCP server as a Streamable HTTP server on the same host. So I was simply running Quarkus in development mode with:

./mvnw quarkus:dev

And then I pointed the Gemini CLI at the local URL in my ~/.gemini/settings.json:

{
  "mcpServers": {
    "arxiv": {
      "httpUrl": "http://localhost:8080/mcp"
    }
  }
}

But once I’m happy with the development, I install the application in my Maven local repository (or elsewhere), and then I configure the MCP server to point at the absolute path where the JAR was installed:

{
  "mcpServers": {
    "arxiv": {
      "command": "java",
      "args": ["-jar", "/absolute/path/to/quarkus-run.jar"]
    }
  }
}

Of course, if I decided to deploy my arXiv server to Cloud Run, for example, I would use the httpUrl parameter configuration approach instead.

Let’s Search Papers!

When I run the /mcp list command inside Gemini CLI, I see my tools, resources, and prompts are properly exposed:

Note

Resource templates are currently not supported by Gemini CLI, so they don’t yet appear, but soon will, hopefully! Stay tuned.

I asked what are the latest 10 papers in artificial intelligence (sorted by publication date)? and you can see that it invoked my arXiv server and its search_papers tool with the following parameters: {"sortOrder": "descending", "sortBy": "lastUpdatedDate", "query": "cat:cs.AI", "maxResults": 10}.

Then Gemini synthesized a human-readable interpretation of those JSON search results:

I asked for more details about #2 and #7 of the list with this query: I'd like to learn more about #2 and #7

And the get_paper_details MCP tool was invoked with the paper IDs ({"ids": ["2601.10702", "2601.10679"]}):

Finally Gemini gave me a bit more context about those two papers:

And voila!

Summary

In this post, we’ve seen how the combination of Quarkus and its MCP extension makes it straightforward to build a feature-complete MCP server in Java. By leveraging annotations, we easily exposed not just tools, but also resources (for taxonomy and metadata) and prompts (to guide the user), providing a rich context for any AI agent.

The development process was also a great example of AI-assisted productivity: using Antigravity to scaffold the project and handle the integration with the arXiv API significantly sped up the implementation. Whether you choose to run your server via STDIO for local use or HTTP for remote access, the Model Context Protocol opens up exciting possibilities for making your data and services “AI-ready.”

Feel free to explore the source code on GitHub and start building your own MCP servers!

https://glaforge.dev/posts/2026/01/18/implementing-an-arxiv-mcp-server-with-quarkus-in-java/
On AI Standards and Protocols: Focus on MCP and A2A

At SnowCamp 2026, with my Cast Codeurs buddy Emmanuel Bernard of Hexactgon, I had the chance to deliver a talk on AI standards and protocols, with a big focus on MCP (Model Context Protocol), and A2A (Agent 2 Agent Protocol).

Without further ado, here’s the slide deck we presented:

This talk is based on the Devoxx 2025 deep dive session that I delivered with Emmanuel and my colleague Mete Atamel. As the talk wasn’t recorded during SnowCamp, I’ll share with you the 3h-long video from Devoxx below:

Show full content

At SnowCamp 2026, with my Cast Codeurs buddy Emmanuel Bernard of Hexactgon, I had the chance to deliver a talk on AI standards and protocols, with a big focus on MCP (Model Context Protocol), and A2A (Agent 2 Agent Protocol).

Without further ado, here’s the slide deck we presented:

This talk is based on the Devoxx 2025 deep dive session that I delivered with Emmanuel and my colleague Mete Atamel. As the talk wasn’t recorded during SnowCamp, I’ll share with you the 3h-long video from Devoxx below:

Abstract

AI agent foundations are built over a handful of common protocols that you need to master, to make the best out of your LLM and agent framework. That’s why it’s important to understand them. But some are catching up, others are not.

In this deep dive, we will explore the ecosystem showing you these standards and focusing on the important ones. Knowing some of the frameworks is useful too to get started faster. Welcome MCP, A2A, ACP protocols, and ADK, Arc, Quarkus, LangChain4j frameworks!

After giving you an overview of the main standards and protocols, their merit and their popularity, we will start by building an agent using Agent Development Kit (ADK) and walk through making a tool call. From there, zooming on MCP, we’ll see how to standardize that tool via a local MCP server and then deploy it as a remote MCP server to share it with others.

Next, we’ll dive into the A2A protocol and enable our agent to participate in multi-agent conversations. And to do that, we will use another framework, Quarkus and LangChain4j, showing how different stacks interact seamlessly through A2A.

You’ll learn not just what these protocols do and how they work, but why they matter, with detailed walkthroughs and live demos throughout.

If you’re struggling to understand all the protocol details around AI agents, this session is for you!

Important links

Throughout the presentation, we showed various demos, implemented in Java:

MCP A2A
https://glaforge.dev/talks/2026/01/16/on-ai-standards-and-protocols-focus-on-mcp-and-a2a/
Building a Research Assistant with the Interactions API in Java

First of all, dear readers, let me wish you a happy new year! This is my first post on this blog for 2026. I’m looking forward to continuing sharing interesting content with you.

During my holiday break, I wanted to put my recent Java implementation of the Gemini Interactions API to the test. I implemented and released it with the help of Antigravity. My colleague Shubham Saboo and Gargi Gupta wrote a tutorial on how to build an AI research agent with Google Interactions API & Gemini 3. I thought this was a great opportunity to replicate this example in Java using my Interactions API Java SDK.

Show full content

First of all, dear readers, let me wish you a happy new year! This is my first post on this blog for 2026. I’m looking forward to continuing sharing interesting content with you.

During my holiday break, I wanted to put my recent Java implementation of the Gemini Interactions API to the test. I implemented and released it with the help of Antigravity. My colleague Shubham Saboo and Gargi Gupta wrote a tutorial on how to build an AI research agent with Google Interactions API & Gemini 3. I thought this was a great opportunity to replicate this example in Java using my Interactions API Java SDK.

A picture is often worth a thousand words, so let’s have a look at the key components of our research agent workflow:

The goal of this tutorial is to build a research assistant, split into 4 key phases:

  • Phase 1️⃣ : Planning — Given a topic to research, aided by the Google Search tool, Gemini 3 Flash defines different research tasks related to the topic.
  • Phase 2️⃣ : Research — The Deep Research model will be launched as a background task to research the different topic areas defined in the planning phase.
  • Phase 3️⃣ : Synthesis — This time, we use the more powerful Gemini 3 Pro to do the synthesis of the research report.
  • Phase 4️⃣ : Infographic — Last but not least, we’ll use 🍌 Nano Banana Pro (aka Gemini 3 Pro Image) to generate an infographic about this research.
Let’s Implement this Research Workflow! Information

You’ll find the entire source code for this example in my GitHub repository.

This all starts with the planning phase, using Gemini 3 Flash:

// Step 0: Define the research goal
String researchGoal = """
    Research the current state of Quantum Computing in 2025,
    specifically looking for major breakthroughs in error correction.
    """;

// --- Phase 1: Plan ---
// Gemini 3 Flash Preview creates research tasks
ModelInteractionParams planParams = ModelInteractionParams.builder()
        .model("gemini-3-flash-preview")
        .input(String.format("""
            Create a numbered research plan for: %s
            Format: 1. [Task] - [Details]
            Include 3 specific tasks.
            """, researchGoal))
        .tools(new GoogleSearch())
        .store(true)
        .build();

// Launch the request
Interaction planInteraction = client.create(planParams);

// Retrieve the response with text and interaction ID
String planText = getText(planInteraction);
String planId = planInteraction.id();

The goal is to research information about the latest breakthroughs in Quantum Computing over the past year. We create an interaction that asks Gemini 3 Flash to define a few research tasks, following a specific format.

Notice that we provide the built-in GoogleSearch tool so the model can search the internet when defining those tasks.

We set store(true) to save the interaction on the server-side, and we save the interaction ID for later reuse, ensuring subsequent interactions continue the same discussion.

// Utility method to extract the LLM generated tasks
List<String> tasks = parseTasks(planText);

// --- Phase 2: Research ---
// Select tasks and run Deep Research Agent
// In this test, we select all tasks.

String selectedTasks = String.join("\n\n", tasks);

AgentInteractionParams researchParams = AgentInteractionParams.builder()
        .agent("deep-research-pro-preview-12-2025")
        .input(String.format(
                "Research these tasks thoroughly with sources:\n\n%s",
                selectedTasks))
        .previousInteractionId(planId)
        .background(true)
        .store(true)
        .build();
Interaction researchInteraction = client.create(researchParams);

String researchId = researchInteraction.id();
String researchText = getText(researchInteraction);

// Wait for completion up to 10 mins as deep research can be slow
researchInteraction = waitForCompletion(client, researchId, 600);

In this second phase, I use a few utility methods to parse the tasks generated in the previous phase (I should probably use structured output for that at some point) and to wait for the completion of the background tasks.

A few interesting points here:

  • We reuse the interaction ID from the previous phase via previousInteractionId(), taking advantage of the stateful nature of the Interactions API. This allows the research agent to maintain context from the planning phase, in addition to the specific tasks generated.
  • We specify that the task should run in the background with background(true).
  • Finally, we poll for the completion of the Deep Research agent, which can take several minutes.
// --- Phase 3: Synthesis ---
ModelInteractionParams synthesisParams = ModelInteractionParams.builder()
        .model("gemini-3-pro-preview")
        .input("""
            Create executive report with Summary, Findings,
            Recommendations, Risks based on the research.
            """)
        .previousInteractionId(researchId)
        .store(true)
        .build();

Interaction synthesisInteraction = client.create(synthesisParams);
String synthesisText = getText(synthesisInteraction);

Phase 3 is a simple call to Gemini 3 Pro to synthesize the research report. Again, we store the session and reuse the previous interaction ID.

// --- Phase 4: Infographic ---
ModelInteractionParams infographicParams = ModelInteractionParams.builder()
        .model("gemini-3-pro-image-preview")
        .input("Create a whiteboard summary infographic for the following: \n\n"
                + synthesisText)
        .responseModalities(List.of(Modality.IMAGE))
        .build();

Interaction infographicInteraction = client.create(infographicParams);

saveInfographic(infographicInteraction);

The last phase is the infographic generation, using 🍌 Nano Banana Pro. We pass the synthesis from the previous phase. We don’t need to reuse the interaction ID here, as the synthesis itself provides enough context for the infographic.

Reminder

You can check the entire source code on GitHub.

The Outcome

I won’t include the full output, but I’d like to highlight the impressive infographic generated by Nano Banana Pro as a result of this research plan:

I hope this makes sense to those of you knowledgeable about Quantum Computing. While I’m not an expert in the field, I’m really fond of the infographics that 🍌 Nano Banana can generate, with their sharp and crisp text.

What’s to Like?

What I particularly like about the Interactions API is that it handles state on the server-side. It’s a departure from the traditional stateless LLM conversations where frameworks must pass the entire history at each round. Even unrelated LLM requests or agent tasks can share the same session by reusing the interaction ID.

Additionally, I’m happy that my Java SDK for the Interactions API works well for more involved use cases, validating its capabilities. Until the Gemini unified SDK supports the Interaction API, I’ll definitely be sticking with my own!

https://glaforge.dev/posts/2026/01/03/building-a-research-assistant-with-the-interactions-api-in-java/
Implementing the Interactions API with Antigravity

Google and DeepMind have announced the Interactions API, a new way to interact with Gemini models and agents.

Here are some useful links to learn more about this new API:

About the Interactions API The Rationale and Motivation

The Interactions API was introduced to address a shift in AI development, moving from simple, stateless text generation to more complex, multi-turn agentic workflows. It serves as a dedicated interface for systems that require memory, reasoning, and tool use. It provides a unified interface for both simple LLM calls and more complex agent calls.

Show full content

Google and DeepMind have announced the Interactions API, a new way to interact with Gemini models and agents.

Here are some useful links to learn more about this new API:

About the Interactions API The Rationale and Motivation

The Interactions API was introduced to address a shift in AI development, moving from simple, stateless text generation to more complex, multi-turn agentic workflows. It serves as a dedicated interface for systems that require memory, reasoning, and tool use. It provides a unified interface for both simple LLM calls and more complex agent calls.

If you’ve used the Gemini API before, the standard operation (generateContent) was designed for simple request-response tasks. It was stateless, requiring you to send the entire conversation history with each new question. As models evolved to incorporate “thinking” processes and advanced tool use (e.g., built-in tools, sequential and parallel function calls), the classic API approach required extra fields to manage state, such as “thought signatures”. This new interface and endpoint support both raw models (like Gemini 3 Pro) and fully managed agents (like the Gemini Deep Research Agent).

The Key Advantages

It simplifies state and context management through server-side history, native handling of agent “thoughts” and data schemas. It supports background processing for long-running tasks (in particular for agents). Furthermore, interoperability with Agent Development Kit (ADK) and Agent2Agent (A2A) protocol is ongoing. Finally, it offers advanced capabilities such as tooling (including Google Search and Code execution), structured JSON outputs, Model Context Protocol (MCP) support, and native multimodal handling.

Implementing the Interactions API with Antigravity A Few Words About Antigravity

I decided to put Antigravity, Google’s new agentic development environment, to the test, by pointing Antigravity at the Open API 3 specification of the Interactions API, and iterating with it to come up with a Java implementation.

Learning more about Antigravity

You can read more about the key aspects of Antigravity in my colleague, Mete Atamel’s article introducing Antigravity and tips and tricks, as well as his follow-up article on how to customize Antigravity with rules and workflows. And my other colleague, Romin Irani, wrote a great codelab to get you started.

What I find interesting with Antigravity is that it shifts the developer’s perspective. We’re used to having the code editor as our central point of focus and action. But with Antigravity (advertised as an agentic development platform), the main point of entry is the agent manager.

The agent manager has an inbox, with the various ongoing implementation tasks.

You have workspaces for the different projects you’re working on.

There’s a playground to test ideas, that you can then convert to a proper workspace should the experiment become serious. This is actually through the playground that I started experimenting with the Interactions API.

Fear not, you can always switch back to the code editor (a fork of VS Code)! That’s also where you’ll be able to approve/reject code changes suggested by Antigravity.

Another important aspect of Antigravity is that you start with a prompt explaining the task at hand, potentially adding all sorts of context (like screenshots or documents). Then Antigravity is going to create an implementation plan that you can comment and review like in a Google Docs.

Once you’re happy with the plan, Antigravity will start working for you. Depending on configuration and task complexity, it might request you to approve changes, tool usage, etc.

Once the task is accomplished, Antigravity will show you a walkthrough to guide you through the implementation. And as always, it’s still possible to review it, and ask for further modifications or improvements.

You can also view the task list of all the incremental steps Antigravity went through to implement your requests.

Color Me Impressed!

Honestly, I was quite impressed with Antigravity and Gemini 3 Pro. It successfully implemented an elegant Java API based on the Interactions API’s Open API specification. Then I iterated with Antigravity to further tweak it to my liking, to add tests, to help me deploy the project to Maven Central. Not only it was good at coding and following a plan, but it was very helpful on the command-line for running the build, the deployment commands, etc.

Now Let’s Interact!

Antigravity helped me publish my Java implementation of the Interactions API to Maven Central, guiding me through the creation of public/private keys and the Maven commands required to prepare and perform the release.

Note

You can find my implementation on GitHub. Have a look at the README for usage details, but we’ll go through them together in this article.

Setup and Authentication

In your Java project’s build file, you’ll need to specify the dependency to my SDK.

In your pom.xml if you’re building with Maven:

<dependency>
    <groupId>io.github.glaforge</groupId>
    <artifactId>gemini-interactions-api-sdk</artifactId>
    <version>0.3.0</version>
</dependency>

And in your build.gradle if you’re building with Gradle:

implementation 'io.github.glaforge:gemini-interactions-api-sdk:0.3.0'

And you should export a Gemini API key (that you can get in Google AI Studio) as an environment variable:

export GEMINI_API_KEY=YOUR_API_KEY

Now you’re ready to interact!

Your First Interaction

Instead of just sending a prompt string, we create an Interaction object. We specify the kind (the model or agent we want to talk to) and the parameters (our prompt).

Let’s make a simple call to Gemini (I’ll spare you the imports):

// Create a client with your API key
GeminiInteractionsClient client = GeminiInteractionsClient.builder()
        .apiKey(System.getenv("GEMINI_API_KEY"))
        .build();

// Create the interaction, choosing a model, and passing the prompt in input
Interaction response = client.create(ModelInteractionParams.builder()
        .model("gemini-2.5-flash")
        .input("Why is the sky blue?")
        .build());

// The output is multimodal, so let's see if there's text, image, or thoughts in output
response.outputs().forEach((Content output) -> {
    switch (output) {
        case TextContent text -> System.out.println(text.text());
        case ImageContent image -> System.out.println(image.data());
        case ThoughtContent thought -> System.out.println("Thought: " + thought.signature());
        default -> System.out.println("Unknown content type: " + output);
    }
});

This is a synchronous call, so you’ll be waiting for it to finish generating its answer.

A Multi-turn conversation

When not taking advantage of the statefulness of the Interactions API, you can still do multi-turn conversations:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("gemini-2.5-flash")
    .input(
        new Turn(USER, "Hello!"),
        new Turn(MODEL, "Hi! How can I help?"),
        new Turn(USER, "Tell me a joke")
    )
    .build();

Interaction response = client.create(request);
A Multimodal request

The Interactions API handles multimodal requests, mixing text, images, audio, videos, etc:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("gemini-2.5-flash")
    .input(
        new TextContent("Describe this image"),
        // Create an image from Base64 string
        new ImageContent("BASE64_STRING...", "image/png")
    )
    .build();

Interaction response = client.create(request);
Creating an Image with Nano Banana Pro 🍌

I’m a big fan of the Nano Banana model for creating and editing images. You can easily invoke it as well:

ModelInteractionParams request = ModelInteractionParams.builder()
    .model("gemini-3-pro-image-preview")
    .input("Create an infographic about blood, organs, and the circulatory system")
    .responseModalities(Modality.IMAGE)
    .build();

Interaction response = client.create(request);

response.outputs().forEach(content -> {
    if (content instanceof ImageContent image) {
        byte[] imageBytes = Base64.getDecoder().decode(image.data());
        // Save imageBytes to a file
    }
});
Function Calling

You can pass tools to your model to let it request its use for achieving its goal. This example is a little bit more involved, of course, as it sets up a method and handles the back and forth exchange:

// 1. Define the tool, its name, description, and input schema as simple Maps
Function weatherTool = Function.builder()
    .name("get_weather")
    .description("Get the current weather")
    .parameters(
        Map.of(
            "type", "object",
            "properties", Map.of(
            "location", Map.of("type", "string")
        ),
        "required", List.of("location")
    )
    .build();

// 2. Initial request with tools
ModelInteractionParams request = ModelInteractionParams.builder()
    .model("gemini-2.5-flash")
    .input("What is the weather in London?")
    .tools(weatherTool)
    .build();

Interaction interaction = client.create(request);

// 3. Handle function call
Content lastOutput = interaction.outputs().getLast();
if (lastOutput instanceof FunctionCallContent call) {
    if ("get_weather".equals(call.name())) {
        String location = (String) call.arguments().get("location");
        // Execute local logic...
        String weather = "Rainy, 15°C"; // Simulated result

        // 4. Send Function Result
        ModelInteractionParams continuation = ModelInteractionParams.builder()
            .model("gemini-2.5-flash")
            // Passing previous interaction ID instead of the whole conversation
            .previousInteractionId(interaction.id())
            .input(new FunctionResultContent(
                "function_result",
                call.id(),
                call.name(),
                false,
                Map.of("weather", weather)
            ))
            .build();

        Interaction finalResponse = client.create(continuation);
        System.out.println(finalResponse.outputs().getLast());
    }
}

My SDK is fairly bare-bone and doesn’t handle automatic function calling like the Gemini SDK does, or LangChain4j, etc. But it could be a possible enhancement.

In the code, notice how it handles the session state. Instead of passing the whole conversation again when replying with the tool response, we actually pass only the interaction ID of the previous call. Hence, state is handled on the server side!

Deep Research

An important aspect of the Interactions API is its ability to call agents, and not just models. The Deep Research agent is the first to implement the Interactions API. This is the research agent you may be familiar with from the Gemini web app, used to create long and detailed reports. This time we’re going to create an agent interaction instead of a model interaction:

AgentInteractionParams request = AgentInteractionParams.builder()
    .agent("deep-research-pro-preview-12-2025")
    .input("Research the history of the Google TPUs")
    .build();

Interaction interaction = client.create(request);

// Poll for completion
while (interaction.status() != Status.COMPLETED) {
    Thread.sleep(1000);
    interaction = client.get(interaction.id());
}

System.out.println(interaction.outputs());

This is an asynchronous task, so you have to wait for its completion by polling. You can also steer the output report generation by further customizing your prompt, suggesting a particular report structure.

Model Context Protocol

The Interactions API comes with built-in MCP (Model Context Protocol) support. And by support, I mean that it handles the MCP call itself.

A few caveats to be aware of:

  • Currently, it can only call remote MCP servers (not local STDIO ones), and only Streamable HTTP servers are supported (Server-Sent Events are deprecated anyway).
  • Currently, it works with Gemini 2.5, but not with Gemini 3 Pro.
  • MCP server names shouldn’t contain a - in their name, as this character is reserved for namespacing function calls with the name of the server (i.e. moon_server-current_moon_phase with moon_server being the server name, and current_moon_phase the function name.) So favor underscores for clearer names.
// 1. Define the MCP Server tool
String serverName = "moon_server";
String serverUrl = "https://mn-mcp-server-1029513523185.europe-west1.run.app/mcp";

Tool mcpServer = new Tool.McpServer(serverName, serverUrl);
List<Tool> tools = List.of(mcpServer);

// 2. Create Interaction
ModelInteractionParams createParams = ModelInteractionParams.builder()
    .model("gemini-2.5-flash")
    .input("What is the current phase of the moon?")
    .tools(tools)
    .build();

// 3. Make the Request
Interaction response = client.create(createParams);

// 4. Then Analyze the Response...
Wrapping up

The Interactions API unifies the way we handle both fast inference (standard models) and slow thinking (agents). By generating this SDK, and iterating on it with Antigravity, I’ve tried to make it as easy as possible to integrate these new capabilities into Java applications.

Note that this is not (yet?) a production-grade SDK; it’s the result of a few hours of experimentation and would benefit from further refinement. But I was very happy with the outcome, as I was able to quickly experiment with this new API in Java.

Check out the README for more configuration details and examples.

https://glaforge.dev/posts/2025/12/15/implementing-the-interactions-api-with-antigravity/
AI Agentic Patterns and Anti-Patterns

This week, I was on stage at the Tech Rocks Summit 2025 in the beautiful Théâtre de Paris. This is the first I’m attending this event, gathering a nice crowd of CTOs, tech leads, architects, and decision makers.

My talk focused on what everyone is talking about right now: AI Agents. And in particular, I was interested in sharing with the audience things I’ve seen work or not work in companies, startups, and via tons of discussions with AI practitioners I met at conferences, meetups, or customer meetings.

Show full content

This week, I was on stage at the Tech Rocks Summit 2025 in the beautiful Théâtre de Paris. This is the first I’m attending this event, gathering a nice crowd of CTOs, tech leads, architects, and decision makers.

My talk focused on what everyone is talking about right now: AI Agents. And in particular, I was interested in sharing with the audience things I’ve seen work or not work in companies, startups, and via tons of discussions with AI practitioners I met at conferences, meetups, or customer meetings.

Without further ado, here’s the deck in French 🇫🇷 I showed on stage:

And in English as well for my international readers:

A Quick Historical Recap

We saw the Transformer wave in 2017, the ChatGPT tsunami in 2023, and the RAG (Retrieval Augmented Generation) trend in 2024. In 2025, here we are: Agents are the new frontier for LLMs.

But concretely, what does this change for us, devs and tech leaders? What works, what doesn’t work? Here are the key points of my presentation.

What is an Agent, Really?

Forget the magic for two minutes. An agent is a fairly simple equation:

Agent = LLM + Memory + Planning + Tools

It is no longer just a model predicting the next word. It is a system that observes, plans, acts, and thinks (the famous Reflection loop to correct its own errors).

Architecture Patterns that Work

I presented 4 patterns to avoid reinventing the wheel:

  • The Orchestrator: A supervisor agent that delegates to specialized sub-agents. This is crucial for breaking down a complex task into digestible chunks.
  • Rethinking Tools: Don’t just throw your raw REST API at the LLM. Create “business task” oriented tools (e.g., “Schedule Meeting” vs POST /calendar/v1/events). Fewer tools = less confusion = more determinism.
  • MCP (Model Context Protocol): This is the future standard, essentially the USB for AI tools. It standardizes how an agent connects to its tools, launched by Anthropic and now widely adopted (but still rapidly evolving).
  • A2A (Agent to Agent): Google and its partners are pushing this extensible protocol so that agents can discover and collaborate with each other, regardless of their language or framework.
Traps to Avoid (Anti-Patterns)

I insisted on this because I see teams falling into these traps:

  • The “Chatbot Mandate”: Does your leadership want “a chatbot”? Resist. AI should often be invisible (like a Head-Up Display), not necessarily an endless conversation.
  • Insufficient Vibe-Checking: “It looks like it works” is not a testing strategy. You need Golden Responses, LLM-as-a-Judge, and a real evaluation phase.
  • Silent Confabulation: RAG is great, but if the AI invents things, it’s dangerous. Force source citation and aim for IVO (Immediately Validatable Output, coined by my colleague Zack Akil): the user must be able to verify the result at a glance.
  • The Coding “Rabbit Hole”: Coding agents are stunning but can lead you down the wrong path with incredible confidence. (“You’re absolutely right!”) Keep a cool head and focus on value (MVP), not feature creep.
Back at the Office: What Do We Do?

I concluded with a “Todo List” for when attendes are back at the office:

  • Don’t ask yourself “Where can I squeeze in a chatbot?”. Instead, identify the most painful business process (the Critical User Journey).
  • Experiment small. The goal is to learn.
  • Measure & Evaluate. It’s your users who will tell you if you’re right, not the hype.

The agent might not buy happiness, but implemented well, it can seriously contribute to it! 😄

https://glaforge.dev/talks/2025/12/02/ai-agentic-patterns-and-anti-patterns/
Gemini Is Cooking Bananas Under Antigravity

What a wild title, isn’t it? It’s a catchy one, not generated by AI, to illustrate this crazy week of announcements by Google. Of course, there are big highlights like Gemini 3 Pro, Antigravity, or Nano Banana Pro, but not only, and this is the purpose of the article to share with you everything, including links to all the interesting materials about those news.

Gemini 3 Pro

The community was eagerly anticipating the release of Gemini 3. Gemini 3 Pro is a state-of-the-art model, with excellent multimodal capabilities, advanced reasoning, excellent at coding, and other agentic activities.

Show full content

What a wild title, isn’t it? It’s a catchy one, not generated by AI, to illustrate this crazy week of announcements by Google. Of course, there are big highlights like Gemini 3 Pro, Antigravity, or Nano Banana Pro, but not only, and this is the purpose of the article to share with you everything, including links to all the interesting materials about those news.

Gemini 3 Pro

The community was eagerly anticipating the release of Gemini 3. Gemini 3 Pro is a state-of-the-art model, with excellent multimodal capabilities, advanced reasoning, excellent at coding, and other agentic activities.

You’ll see below the results on various benchmarks, which are quite impressive, and represents a significant leap forward on some of them:

Gemini 3 Pro results on benchmarks
Gemini 3 Pro results on benchmarks
Gemini 3 Pro results on benchmarks Gemini 3 Pro results on benchmarks

Here are a few pointers with more details:

Available In Many Places

Gemini 3 Pro is available in other places:

Available In Gemini CLI

Gemini CLI has also been updated to take advantage of Gemini 3 Pro.

Be sure to read this document which explains how to access Gemini 3 Pro in Gemini CLI, as it’s available to Google AI Ultra users, and paid Gemini and Vertex API key holders, so if you’re not in these categories, you might want to wish the waitlist to experience all the fun!

Check out this article as well:

  • 5 things to try with Gemini 3 Pro in Gemini CLI on how to setup Gemini 3, how to turn visual ideas into working apps, how to generate shell commands from natural language, or accurate documentation from your codebase, and how it can help you debug issues.
Antigravity

Now let’s move on to Antigravity, a new agentic development platform, based on VS Code. Of course, you have all the usual functionalities of a text editor, with (smart) code completion, and all. However the interesting aspect of Antigravity is that the main window is actually not the IDE, but the central place where things happen is the agent manager, where you’ll launch your requests. They will be interpreted and will trigger the creation of a plan, with various steps. You’ll be able to comment and review the plan, for further adjustments. Then Gemini 3 will handle the implementation of those tasks. And Antigravity will produce various other artifacts along the way, like the task lists, the walkthroughs, screenshots and even browser recordings.

Antigravity agent manager screenshot
Antigravity agent manager screenshot
Antigravity agent manager screenshot Antigravity agent manager screenshot

It’s also possible to use Claude Sonnet and GPT-OSS, so this product is not limited to just Gemini 3, however good it may be.

What I find impressive is the nice integration with the browser, to inspect and see how the implementation looks like, and further loop back to continue improving it or fixing it if it’s not what the user asked.

I haven’t covered Nano Banana Pro yet, but with that image generation and editing model integrated in Antigravity, you’re able to create designs, update them visually with manual squiggles and such, and have Antigravity implement that design for you!

Articles to dive more:

Nano Banana Pro

I wrote about Nano Banana in previous articles showing how to call it from a Javelit frontend, how to create ADK agents with Nano Banana, and simply how to call Nano Banana from Java.

I was already super impressed with its capabilities in terms of image edition. However if I wanted the best quality, I would usually start with an Imagen generation, then I’d iterate with Nano Banana for editing. But now, Nano Banana Pro is another level above both Imagen 4 Ultra and the original Nano Banana, in terms of prompt adherence, understanding of user intent, creativity, and quality of generation.

When you use it, you’ll notice how great it is at text, even lots of text! It made huge leaps in terms of typography. And what’s crazy, with the fact it’s based on Gemini 3 Pro, is that it’s able to understand articles or videos, and generate detailed and precise infographics about them! It’s connected to Google Search, and it can research, for example, the weather in Paris, and create a diagram with live data! I’ll certainly come back to that topic in forthcoming articles.

For example, here’s some infographics that Nano Banana Pro created to summarize this article:

Illustration of this article via Nano Banana Pro
Illustration of this article via Nano Banana Pro
Illustration of this article via Nano Banana Pro Illustration of this article via Nano Banana Pro

You can mix ingredients together like different characters (with character consistency), use them for some kind of transfer learning to follow a certain style.

It has a high level of understanding of scenes and can easily change the lighting, the angle view.

You can generate images up to 4K! And have a wide range of aspect ratios to choose from.

Pay attention to the pricing, however, as it’s more expensive than Nano Banana. So for small edits, maybe you’ll stick with Nano Banana, but when you want the most complex design and quality, choose Nano Banana Pro.

Some links to dive deeper:

Towards Live And Richer Conversational UIs

I wanted to finish this overview of the announcements of the week with something that you might not have heard of, but which I think is interesting for the future of generative AI and conversational interfaces.

In my talks, at meetups, in conversations with developers, I often explain that imposing chatbots everywhere is not the best use and ideal integration of AI in their applications, and that more transparent and seamless generative AI integrations are preferred for ensuring their success with users and customers.

I think the following two projects are helping towards a smoother integration of generative AI:

The GenUI SDK for Flutter allows developers to create dynamic, personalized user interfaces using LLMs, transforming text-based conversations into rich, interactive experiences. It acts as an orchestration layer, sending user prompts and available widgets to an AI agent which generates content and a suitable UI description. The SDK then deserializes and dynamically renders this UI into interactive Flutter widgets, with user interactions triggering subsequent updates. This system relies on the (upcoming) A2UI protocol.

Generative UI enables AI models to create custom, interactive user experiences, like web pages and tools, directly from any prompt. This dynamic capability is rolling out in the Gemini app and Google Search’s AI Mode, leveraging Gemini 3 Pro with tool access and detailed instructions. Users prefer these outputs over standard AI text, though human-expert designs remain slightly favored. Despite current challenges with speed and occasional inaccuracies, Generative UI signifies a major step toward fully AI-generated and adaptive interfaces.

Now, Your Turn To Have Fun!

With all those announcements, and key pointers to learn more about them, I hope you’re ready to build exciting new things with Gemini 3 Pro, Antigravity, Nano Banana Pro, and more!

https://glaforge.dev/posts/2025/11/21/gemini-is-cooking-bananas-under-antigravity/
Semantic Document Similarity: Finding related articles with vector embedding models

When you enjoyed reading an article on a blog, you might be interested in other, similar articles. As a blog author, you want to surface that relevant content to your readers to keep them engaged. For a long time, I’ve wanted to add a “Similar articles” section to my posts, but I never quite found a simple and effective way to do it. Hugo (the static stite generator I’m using) has a related content concept, but it wasn’t really what I was after.

Show full content

When you enjoyed reading an article on a blog, you might be interested in other, similar articles. As a blog author, you want to surface that relevant content to your readers to keep them engaged. For a long time, I’ve wanted to add a “Similar articles” section to my posts, but I never quite found a simple and effective way to do it. Hugo (the static stite generator I’m using) has a related content concept, but it wasn’t really what I was after.

But with the power of modern generative AI models (in particular, embedding models), I’ve finally implemented a system that automatically finds and displays related content. In this post, I’ll walk you through how I did it.

The Core Idea: From Words to Numbers

The fundamental challenge is to determine how “similar” two articles are. Humans can do this intuitively by reading them, but how can a computer do it? The answer lies in a technique called vector embeddings.

The idea is to convert a piece of text into a list of numbers, called a vector. This vector represents the text’s semantic meaning. Texts with similar meanings will have vectors that are “close” to each other in a multi-dimensional space. So, the process looks like this:

  1. Summarize: For each article, create a concise summary that captures its essence.
  2. Embed: Convert each summary into a vector embedding.
  3. Compare: Calculate the “distance” between every pair of vectors.
  4. Display: For each article, find the ones with the closest vectors and display them.

Let’s dive into each step.

Step 1: Summarizing the Content

My blog posts can be quite long and cover various topics. To get a clean signal for comparison, I first decided to summarize each article. This helps to distill the core message and remove noise.

This approach was also necessary because most embedding models have limitations on the size of the input text they can accept. Many of my articles were too long for the embedding model’s input, so creating detailed summaries that are as close as possible to the original content ensures that the resulting embedding vector accurately represents the article’s meaning.

For this task, I turned to Google’s Gemini model, specifically gemini-2.5-flash, which is fast and effective. With the help of Gemini CLI, I wrote a simple Node.js script that iterates through all my Markdown files, extracts the content, and sends it to the Gemini API with a straightforward prompt:

const prompt = `Please provide a long, detailed, and factual summary
of the following article. The summary should capture the main points,
key arguments, and any important conclusions. It should be
comprehensive enough to give a good understanding of the article's
content.

Article:
${text}

Summary:
`

To call Gemini, you need to import the @google/genai module:

import {GoogleGenAI} from '@google/genai';

Pass an API key (that you can obtain from Google AI Studio):

const genAI = new GoogleGenAI({apiKey: API_KEY});

Then call the model to create the summary:

const result = await genAI.models.generateContent({
    model: "gemini-2.5-flash",
    contents: prompt,
});

const summary = result?.candidates?.[0]?.content?.parts?.[0]?.text;

To avoid re-generating summaries every time I run the script, I added a simple caching mechanism. The first time the script runs, it generates a summary and saves it to a text file in a summaries/ directory, mirroring the structure of my content/posts/ directory. On subsequent runs, it just reads the summary from the cache.

Step 2: Creating Vector Embeddings

Once I have a summary, the next step is to convert it into a vector. This is where embedding models come in. I used another of Google’s models, gemini-embedding-001, which is designed for this exact purpose.

The model takes a piece of text and returns a vector. You can even configure the size of this vector. I chose a dimensionality of 256, which provides a good balance between detail and performance (to speed up vector similarity calculations).

I also specified the task_type as SEMANTIC_SIMILARITY, which optimizes the embeddings for this kind of comparison task. You can check the other task types, for classification, clustering, Q&A, etc.

Again, this is fairly short to write in JavaScript:

const result = await genAI.models.embedContent({
    model: "gemini-embedding-001",
    contents: texts,
    task_type: "SEMANTIC_SIMILARITY",
    output_dimensionality: 256,
});
const embeddings = result?.embeddings;

My script collects all the summaries and sends them to the embedding model in batches of 100 (the maximum allowed by the API). This is much more efficient than sending them one by one.

Step 3: Calculating Similarity

Now that every article’s summary is represented by a 256-dimensional vector, how do we measure the “distance” between them? The most common method for this is cosine similarity.

Imagine each vector as an arrow pointing in a certain direction in a 256-dimensional space. The cosine similarity measures the angle between two of these arrows.

  • If the arrows point in the same direction (a small angle), the texts are very similar, and the cosine similarity is close to 1.
  • If the arrows are perpendicular, the texts are unrelated, and the similarity is 0.
  • If they point in opposite directions, they are dissimilar, and the similarity is -1.

The calculation itself is quite simple. When the vectors are normalized (which I do, as the vectors are not always normalized depending on the output dimensionality you’ve chosen), the cosine similarity is just their dot product.

Step 4: Putting It All Together in a Script

The Node.js script that orchestrates this whole process is available on GitHub: summarize-and-embed.js.

Here’s how it works:

  1. It finds all my blog posts.
  2. It loops through them, either loading the summary from the cache or generating a new one (with a one-second delay between API calls to avoid rate limiting).
  3. It generates embeddings for all summaries in batches.
  4. It then iterates through every article and calculates the cosine similarity with every other article.
  5. For each article, it sorts the others by their similarity score in descending order. I also filter out any articles with a similarity score below 0.75 to ensure the recommendations are high-quality.
  6. Finally, it takes the top 3 most similar articles and updates the frontmatter of the original Markdown file, adding a similar array with the paths to the related posts.
Step 5: Displaying the Similar Articles in Hugo

With the similar array in my frontmatter, the final step was to display the links in my blog’s theme. I’m using the Hugo static site generator, and this was surprisingly easy. I edited the partial template responsible for rendering a single post to include this snippet:

{{ with .Params.similar }}
<div class="similar-articles">
    <h3>Similar articles</h3>
    <ul>
    {{ range . }}
        {{ with site.GetPage . }}
            <li><a href="{{ .RelPermalink }}">{{ .Title }}</a></li>
        {{ end }}
    {{ end }}
    </ul>
</div>
{{ end }}

This code checks if the similar parameter exists. If it does, it loops through the paths. For each path, it uses Hugo’s site.GetPage function to fetch the full page object, from which I can get the .Title and .RelPermalink. And with a little bit of CSS, it looks like a nice, integrated part of my blog.

What Does It Look Like?

Let’s have a look at a few examples (you’ll see the recommended articles at the bottom of each post).

For my article on the Gemini Computer Use model, the vector similarity suggested 3 articles related to Gemini as well:

On the article about using the Nano Banana image model within an ADK agent, the algorithm refered to those 3 articles on getting started with ADK, how to call Nano Banana from Java, or how to get started with Gemini:

For some quite unique articles, sometimes it would return even just one or zero result at all, but that’s expected, because if there are no articles that are similar, I don’t want my readers to get suggestions about totally unrelated material.

Considerations and Other Approaches

I’m happy with the summarize –> embed approach as it is simple and gives me pretty good recommendations. However, I considered some alternatives.

Leveraging Tags

First of all, I’m using tags on my blog. So I considered leveraging them, as I strive for consistency in applying them to my articles. For instance, I could have narrowed the search space to only articles sharing at least one common tag. Alternatively, a heuristic could have been devised to influence the ranking, favoring articles with a higher number of shared tags.

Averaging Vectors (Mean Pooling)

I went with the summarization route because of the limited input size of the embedding model. But summarization is a lossy process, as you lose some precision along the way. Through some experiments, I noticed that summarization could sometimes result in a similarity score that was 0.1 lower than a direct embedding of the full text. I have a threshold of similarity of >0.75, so 0.1 can make a difference for article selection.

If you chunk your article into smaller passages (according to the maximum number of characters or tokens your embedding model can ingest), you’ll end up with multiple chunks and their respective vector embeddings. One approach is to calculate the average of these vectors, assuming this ‘mean pooled’ vector isn’t too far from the embedding of the entire text. I didn’t extensively explore this, but my initial impressions suggested it sometimes performed worse than the summarization and embedding method. This wasn’t a scientific study, just my gut feeling, and the finer details of research papers on mean pooling escaped me at the time! Averaging vectors is often mentioned in articles as mean pooling.

Passage to Passage Comparisons

A higher fidelity approach would have been to do a passage-to-passage comparison. If you’re familiar with RAG (Retrieval Augmented Generation), you know about the chunking phase we’ve just mentioned. In RAG, you compare the query’s vector with vectors of document passages. This passage-to-passage comparison approach could also be applied to compare a full article with other full articles. You could then devise a function to aggregate these passage-level similarities into an overall document score, favoring documents with more highly similar passages. This could be a higher fidelity comparison, but I didn’t explore this idea as the number of embedding requests and matrix comparisons is significantly higher (and thus more expensive and time consuming).

A Mix! With Reciprocal Rank Fusion

Last but not least, you could even combine those approaches together: summarize/embed, tag set comparison, passage-to-passage chunking/embedding and comparison. Each approach yields a ranking of similar articles, but their scores aren’t necessarily on the same scale, as they’re comparing different aspects of similarity. In such situations, you can use methods like Reciprocal Rank Fusion (RRF) to combine different rankings together. RRF is often used in hybrid search scenarios, where you want to combine different searches together, like a classical keyword-based search, and a semantic search.

Ultimately, since the simple summarization-and-embed solution provided good results, I decided against further complicating the system. Sometimes, being pragmatic and choosing a ‘good enough’ solution is more effective than pursuing the absolute best. However, I wanted to share this thought process and highlight that various other solutions exist.

Conclusion

And there you have it! It might seem like a lot of steps, but the overall process is quite logical. By leveraging the power of summarization and embedding models, I was able to build a powerful related articles feature. For now, I’m still running the script manually, but later on, I’ll see how I can integrate it in my GitHub Actions workflow.

It’s a great example of how generative AI can be used to enhance existing applications in practical and useful ways. I’m really happy with the results, and I hope my readers will find the new recommendations useful for discovering more content they’re interested in.

https://glaforge.dev/posts/2025/11/12/finding-related-articles-with-vector-embedding-models/
Driving a web browser with Gemini's Computer Use model in Java

In this article, I’ll guide you through the process of programmatically interacting with a web browser using the new Computer Use model in Gemini 2.5 Pro. We’ll accomplish this in Java ☕ leveraging Microsoft’s powerful Playwright Java SDK to handle the browser automation.

The New Computer Use Model

Unveiled in this announcement article and made available in public preview last month, via the Gemini API on Google AI Studio and Vertex AI, Gemini 2.5 Pro introduces a pretty powerful “Computer Use” feature.

Show full content

In this article, I’ll guide you through the process of programmatically interacting with a web browser using the new Computer Use model in Gemini 2.5 Pro. We’ll accomplish this in Java ☕ leveraging Microsoft’s powerful Playwright Java SDK to handle the browser automation.

The New Computer Use Model

Unveiled in this announcement article and made available in public preview last month, via the Gemini API on Google AI Studio and Vertex AI, Gemini 2.5 Pro introduces a pretty powerful “Computer Use” feature.

This allows the model to understand and interact with a computer screen much like a human would. It’s a multimodal model that takes a screenshot of a web page as input and returns a sequence of actions to perform, such as clicking buttons, filling out text fields, and navigating through pages, until it reaches a certain goal set by the user.

The general flow is illustrated in the following diagram:

Gemini Computer Use flow
Gemini Computer Use flow
Gemini Computer Use flow Gemini Computer Use flow

Let’s get the project set up, and then we’ll see further down how this flow works, and how to implement it, using Java and Playwright.

Project Setup

For this tutorial, I’m using a straightforward Java project built with Maven. We’ll need two main dependencies: one for the Gemini API and another for Playwright.

Here’s the relevant section of my pom.xml file:

<dependencies>
    <dependency>
        <groupId>com.google.genai</groupId>
        <artifactId>google-genai</artifactId>
        <version>1.24.0</version>
    </dependency>
    <dependency>
        <groupId>com.microsoft.playwright</groupId>
        <artifactId>playwright</artifactId>
        <version>1.56.0</version>
    </dependency>
</dependencies>
Getting Started with Playwright

Now, let’s dive into the code. Playwright is a library from Microsoft for automating browser actions, and it’s available for several languages, including Java.

Here’s a basic example of how to launch a browser and navigate to a page:

import com.microsoft.playwright.*;

public class BrowserAutomation {
    public static void main(String[] args) {
        try (Playwright playwright = Playwright.create()) {
            Browser browser = playwright.chromium().launch(
                new BrowserType.LaunchOptions().setHeadless(true)
            );
            Page page = browser.newPage();
            page.navigate("https://www.google.com");
            // We'll add the Gemini integration here.
            browser.close();
        }
    }
}

In this snippet, I’m using the Chromium browser that ships with Playwright. But other browsers like Firefox are also available.

I’m launching it in “headless” mode, meaning no browser window will be visible. This is ideal for automated tasks where visual inspection isn’t necessary. For debugging, however, you can set setHeadless(false) to watch the automation in real-time.

Integrating with Gemini

First, you’ll need a Gemini API key, which you can obtain from Google AI Studio. Once you have your key, it’s best practice to set it as an environment variable named GEMINI_API_KEY. We’ll use a try with resources block to create (and close automatically) the client to access Gemini:

try (Client client = new Client.Builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .build()) {
        // ...
}

With the browser automation foundation in place, we can now integrate Gemini to steer it. The process works as a loop: we send a prompt with a goal to reach to the model, the model returns a series of actions, our code executes them using Playwright, we share the screenshot of the current web page after each action, and we repeat the cycle until the task is complete.

The Agent Loop

The Gemini documentation refers to this cyclical process as an “agent loop”:

  1. Send Request: The loop begins by sending the user’s prompt, the latest UI screenshot, and the enabled computer_use tool to the model.
  2. Receive Response: The model analyzes the inputs and returns a function_call suggesting a specific UI action.
  3. Execute Action: Our Java code parses this function_call and translates it into a Playwright command to be executed in the browser.
  4. Capture New State: After the action is performed, we capture a new screenshot and send it back to the model along with a function_response, starting the loop over.

This process continues until the model determines the initial goal has been met.

Code Walkthrough

Let’s examine the BrowserUse.java file to see how this loop is implemented. (You’ll find the full source code at the bottom of the page.)

First, we initialize Playwright and the Gemini client (as explained above), setting a specific viewport size which is crucial for the coordinate scaling we’ll discuss shortly:

try (Playwright playwright = Playwright.create()) {
    BrowserType chromium = playwright.chromium();
    Browser browser = chromium.launch(
      new BrowserType.LaunchOptions().setHeadless(true));
    final int WIDTH = 1000;
    final int HEIGHT = 1000;
    BrowserContext context = browser.newContext(
      new Browser.NewContextOptions().setViewportSize(WIDTH, HEIGHT));
    Page page = context.newPage();

    try (Client client = new Client.Builder()
            .apiKey(System.getenv("GEMINI_API_KEY"))
            .build()) {
        // The agent loop will go here.
    }
}

Next, we kick off the conversation with the model by sending our initial prompt. This defines the overall goal for the agent.

List<Content> history = new ArrayList<>();

Content initialContent = Content.fromParts(Part.fromText("""
    Find the tallest Stitch plushie under €100 on Amazon.fr
    and then provide the link to that item.
    """));

history.add(initialContent);

GenerateContentResponse response = client.models.generateContent(
    "gemini-2.5-computer-use-preview-10-2025",
    initialContent,
    GenerateContentConfig.builder()
        .tools(Tool.builder()
            .computerUse(ComputerUse.builder()
                .environment(
                    Environment.Known.ENVIRONMENT_BROWSER)
                .build())
            .build())
        .build());

Here, we’re using the gemini-2.5-computer-use-preview-10-2025 model and explicitly enabling the computer_use tool.

The core of the application is a while loop that continues as long as the model returns function calls for us to execute:

while (response.functionCalls() != null &&
        !response.functionCalls().isEmpty()) {
    // ...
}

We iterate over the function calls returned by the model and use a switch statement to execute the corresponding Playwright action. After each interaction, we save a screenshot of the page, that we give back to the model to see the result of the action on the page.

for (FunctionCall functionCall : response.functionCalls()) {
    // ...
    switch (functionCall.name().get()) {
        case "navigate_to_url":
            page.navigate((String) args.get("url"));
            result.put("status", "success");
            break;
        case "click_at":
            // ...
            break;
        // ... other cases
    }
    // ...
}

This switch block is the heart of the integration, translating the model’s intentions into concrete browser actions.

Supported Actions

You can check the list of supported actions in the documentation. You don’t necessarily have to implement them all, depending on your use case. And there’s a proposed implementation in Python that you can take inspiration from. Note that there are often multiple ways to implement certain actions, like scrolling for example. And you’ll see in the full source code below how I implemented most of them in Java with Playwright.

Coordinate Scaling

A critical detail in this process is coordinate handling. The Gemini model operates on a normalized 1000x1000 grid, independent of the actual browser viewport size. Therefore, we must scale the coordinates it provides to match our browser’s dimensions.

In my case, I simply ended up using 1000x1000 for my browser window, but I kept the scaling calculation in place, should I want to change the browser size in the future.

Here’s the implementation for a click_at action:

case "click_at":
    int xClick = ((Number) args.get("x")).intValue();
    int yClick = ((Number) args.get("y")).intValue();
    var scaledCoordForClicking =
        new ScaledCoord(xClick, yClick).scaleTo(WIDTH, HEIGHT);
    page.mouse().click(
        scaledCoordForClicking.x, scaledCoordForClicking.y);
    result.put("status", "success");
    break;

This is accomplished with a simple ScaledCoord record:

private record ScaledCoord(int x, int y) {
    ScaledCoord scaleTo(int width, int height) {
        return new ScaledCoord(
                (int) (this.x / 1000.0 * width),
                (int) (this.y / 1000.0 * height)
        );
    }
}
Accepting Safety Decisions

For certain actions, the model might require a confirmation before proceeding. This is a safety feature to prevent unintended consequences, for example when dealing with sensitive information or performing critical operations. The model will include a safety_decision field in its response, indicating that a confirmation is needed. In a real-world application, you should prompt the user for their approval. However, for the purpose of this demonstration, the code at the bottom of this article automatically acknowledges these safety decisions. This is implemented by checking for the presence of the safety_decision field in the function call arguments and then adding a safety_acknowledgement field with the value true to the function response, as you can see in the provided source code.

for (FunctionCall functionCall : response.functionCalls()) {
    Map<String, Object> args = functionCall.args().get();

    Map<String, Object> result = new HashMap<>();
    if (args.containsKey("safety_decision")) {
        result.put("safety_acknowledgement", "true");
    }
    // ...

This safety_decision request often comes when you have to accept things like cookies policy, and other pop-ups. You can learn more about this in the documentation.

Closing the Loop

After executing an action, we must inform the model of the outcome. This is done by adding a function_response containing the result to the conversation history. We also give the model the screenshot, so it sees the updated page.

result.put("url", page.url());
// wait to ensure the page has fully rendered after the action
sleep(1000);
byte[] screenshot = page.screenshot(
    // also saving the screenshot locally for debugging purpose
    new Page.ScreenshotOptions()
        .setPath(Paths.get("screenshot-" + index++ + ".png")));
history.add(Content.fromParts(
        Part.fromFunctionResponse(functionCall.name().get(), result),
        Part.fromBytes(screenshot, "image/png")
));

We then call generateContent again with the updated history, including the result of the action as well as the screenshot of the page, and the loop continues until the model determines the task is finished and stops returning function calls. At that point, we can print the final text-based response.

System.out.println(response.text());
Important

You must always take a screenshot after each action, and send it to the model each time. Otherwise, the Computer Use model is in the dark, and doesn’t know what’s going on in the browser.

This concludes our tour of how to use the Gemini 2.5 Pro Computer Use model with Java and Playwright. Now, before having a look at the full source code at the bottom of the article, let’s have a look at a few example use cases I tried.

Example Browser Use Requests Playing a Button Clicking Game

When working on implementing the Playwright Chromium functions to echo the actions requested by the model, at first, I wasn’t sure if my logic for handling button clicks was correct (in particular the coordinate scaling). So I wanted to double check that it was working fine, and for that purpose, I created a simple button clicking game: Each time you click on the red button, the button moves randomly on the page, and a score shows the number of clicks. (Feel free to play it, it’s boring stupid!)

I asked Gemini Computer Use to go to that game page, click the button 10 times, and respond with the final score at the end. Logically, clicking 10 times would give a score of 10!

Go to this red button clicking game page: https://storage.googleapis.com/public-bucket-for-demos/button.html Click 10 times the red button. Return the score displayed in the top left hand corner.

This little animated GIF shows you the button moving around and the score increasing:

Click to see the animation

button clicking game animation
button clicking game animation
button clicking game animation button clicking game animation

Here’s the model’s final answer:

I have evaluated step 12. I clicked the button and the score increased to 10. I have clicked the button 10 times and the score is now 10, as displayed in the top left corner.

Searching an Article on my blog

I asked Gemini Computer Use to find a particular article on my blog:

Find an article on glaforge.dev on how to use Nano Banana in an ADK agent

Let’s see that in action in this other animation:

Click to see the animation

Searching an article on my blog
Searching an article on my blog
Searching an article on my blog Searching an article on my blog

Here’s the model’s final answer:

I have evaluated step 18 and scrolled down further. The article
presents the code for `NanoBananaCreativeAgent` and highlights
the line `.model("gemini-2.5-flash-image-preview")` as key for
telling the ADK to route requests to the Nano Banana
(Gemini 2.5 Flash Image) model for image generation within the agent.

It also mentions an `.afterModelCallback` part that it will zoom
in on. This article clearly explains how to configure the ADK
agent to use the Nano Banana model.

I have found the information requested.

The article "Creative Java AI agents with ADK and Nano Banana" on
glaforge.dev explains how to use the "Nano Banana" model
(gemini-2.5-flash-image-preview) within an ADK agent.

It is done by specifying `.model("gemini-2.5-flash-image-preview")`
when building the `LlmAgent` using `LlmAgent.builder()`
within the agent's definition.
The article also goes into handling the image output
using `.afterModelCallback`.
Finding an Item on Amazon

This time, I asked Gemini Computer Use:

Find the tallest Stitch plushie under €100 on Amazon.fr and give me the URL of the product. Be sure it’s not Angel the pink one.

It found a nice Stitch plushie!

Click to see the animation

A big Stitch plushie found on Amazon
A big Stitch plushie found on Amazon
A big Stitch plushie found on Amazon A big Stitch plushie found on Amazon

The Full Source Code

Tim to reveal the complete source code! Feel free to adapt it to your needs, or to further expand the browser actions supported.

Click to see the full source code
import com.google.genai.Client;
import com.google.genai.types.*;
import com.microsoft.playwright.*;

import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

void main() {
    try (Playwright playwright = Playwright.create()) {
        BrowserType chromium = playwright.chromium();
        Browser browser = chromium.launch(new BrowserType.LaunchOptions().setHeadless(true));

        final int WIDTH = 1000;
        final int HEIGHT = 1000;

        BrowserContext context = browser.newContext(new Browser.NewContextOptions().setViewportSize(WIDTH, HEIGHT));
        Page page = context.newPage();

        try (Client client = new Client.Builder()
                .apiKey(System.getenv("GEMINI_API_KEY"))
                .build()) {

            List<Content> history = new ArrayList<>();

            Content initialContent = Content.fromParts(Part.fromText("""
                    Find an article on glaforge.dev on how to use Nano Banana in an ADK agent
                    """));

            history.add(initialContent);

            GenerateContentResponse response = null;
            int index = 0;

            while (true) {
                response = client.models.generateContent(
                    "gemini-2.5-computer-use-preview-10-2025",
                    history,
                    GenerateContentConfig.builder()
                        .tools(Tool.builder()
                            .computerUse(ComputerUse.builder()
                                .environment(Environment.Known.ENVIRONMENT_BROWSER)
                                .build())
                            .build())
                        .build());

                if (response.functionCalls() == null || response.functionCalls().isEmpty()) {
                    break;
                }

                if (!response.candidates().isEmpty()) {
                    history.add(response.candidates().get().get(0).content().get());
                }

                for (FunctionCall functionCall : response.functionCalls()) {
                    Map<String, Object> args = functionCall.args().get();
                    System.out.println("==> " + functionCall.name().get() + " " + args);

                    Map<String, Object> result = new HashMap<>();
                    if (args.containsKey("safety_decision")) {
                        result.put("safety_acknowledgement", "true");
                    }

                    switch (functionCall.name().get()) {
                        case "open_web_browser":
                            result.put("status", "success");
                            break;
                        case "navigate":
                        case "navigate_to_url":
                            page.navigate((String) args.get("url"));
                            result.put("status", "success");
                            break;
                        case "go_back":
                            page.goBack();
                            result.put("status", "success");
                            break;
                        case "get_page_content":
                            String pageContent = page.content();
                            if (pageContent.length() > 20000) {
                                pageContent = pageContent.substring(0, 20000);
                            }
                            result.put("content", pageContent);
                            break;
                        case "click_element":
                            try {
                                page.click((String) args.get("selector"));
                                result.put("status", "success");
                            } catch (PlaywrightException e) {
                                result.put("status", "error");
                                result.put("message", e.getMessage());
                            }
                            break;
                        case "click_at":
                            int xClick = ((Number) args.get("x")).intValue();
                            int yClick = ((Number) args.get("y")).intValue();
                            var scaledCoordForClicking =
                                new ScaledCoord(xClick, yClick).scaleTo(WIDTH, HEIGHT);
                            page.mouse().click(scaledCoordForClicking.x, scaledCoordForClicking.y);
                            result.put("status", "success");
                            break;
                        case "type_text":
                            String text = (String) args.get("text");
                            String selectorForTyping = (String) args.get("selector");
                            if (selectorForTyping != null) {
                                page.locator(selectorForTyping).type(text);
                            } else {
                                page.keyboard().type(text);
                            }
                            result.put("status", "success");
                            break;
                        case "type_text_at":
                            String text_to_type = (String) args.get("text");
                            boolean press_enter = (Boolean) args.getOrDefault("press_enter", false);
                            int xType = ((Number) args.get("x")).intValue();
                            int yType = ((Number) args.get("y")).intValue();
                            var scaledCoordForTyping =
                                new ScaledCoord(xType, yType).scaleTo(WIDTH, HEIGHT);
                            page.mouse().click(scaledCoordForTyping.x, scaledCoordForTyping.y);
                            page.keyboard().type(text_to_type);
                            if (press_enter) {
                                page.keyboard().press("Enter");
                            }
                            result.put("status", "success");
                            break;
                        case "scroll_document":
                            String direction = (String) args.get("direction");
                            int magnitude = 800;
                            if (args.containsKey("magnitude")) {
                                magnitude = ((Number) args.get("magnitude")).intValue();
                            }
                            if ("down".equals(direction)) {
                                page.evaluate("window.scrollBy(0, " + magnitude + ")");
                            } else if ("up".equals(direction)) {
                                page.evaluate("window.scrollBy(0, -" + magnitude + ")");
                            }
                            result.put("status", "success");
                            break;
                        case "search":
                            String query = (String) args.get("query");
                            if (query != null && !query.isEmpty()) {
                                page.navigate("https://www.google.com/search?q=" +
                                        URLEncoder.encode(query, StandardCharsets.UTF_8));
                                result.put("status", "success");
                            } else {
                                result.put("status", "unsupported function");
                                result.put("message", "search function requires a query argument.");
                            }
                            break;
                        case "take_screenshot":
                            byte[] screenshotBytes = page.screenshot();
                            result.put("screenshot-image-bytes", screenshotBytes);
                            result.put("status", "success");
                            break;
                        case "wait_5_seconds":
                            sleep(5000);
                            result.put("status", "success");
                            break;
                        default:
                            result.put("error", "unsupported function");
                    }
                    result.put("url", page.url());

                    sleep(1000);
                    byte[] screenshot = page.screenshot(
                            new Page.ScreenshotOptions()
                                .setPath(Paths.get("screenshot-" + index++ + ".png")));
                    history.add(Content.fromParts(
                            Part.fromFunctionResponse(functionCall.name().get(), result),
                            Part.fromBytes(screenshot, "image/png")
                    ));
                }
            }

            System.out.println(response.text());
        } finally {
            browser.close();
        }
    }
}

private record ScaledCoord(int x, int y) {
    ScaledCoord scaleTo(int width, int height) {
        return new ScaledCoord(
                (int) (this.x / 1000.0 * width),
                (int) (this.y / 1000.0 * height)
        );
    }
}

private static void sleep(int milliseconds) {
    try {
        Thread.sleep(milliseconds);
    } catch (InterruptedException e) {
        // do nothing
    }
}
Conclusion

My experiments with the Gemini 2.5 Pro Computer Use model have been insightful, revealing both its potential and some of its limitations.

One of the first things you’ll notice is the pacing. Each turn in the agent loop, model call, and action execution, takes time. A multi-step task like finding a product on an e-commerce site requires patience, as the process unfolds deliberately, one step at a time. It’s a good idea to wait a second before taking a screenshot, to be sure the page has fully rendered after the last action. Otherwise you could feed a blank or half-blank screen back to the model, which won’t be helpful. So you would use a computer-use model probably more for asynchronous background tasks, and not for an immediate response.

A significant real-world challenge is the prevalence of cookie consent banners and other pop-ups. But the model is able to click the right “accept” buttons here and there, to get rid of them, and focus on the task at hand. Captchas can also get in the way, because websites notice that this is an automated agent at play. But there’s a way to automate what they call safety decisions (which is what I implemented in the code, by auto-acknowledgement).

Interestingly, in one of my many experiments, in moments of apparent frustration, the model decided to abandon the target website and default to a Google search, to find the information on the website it couldn’t find by scrolling and clicking around!

In summary, once you acknowledge some of the potential challenges, this is clearly a fascinating technology with a promising agentic future! Imagine web agents able to handle mundane but boring web-based tasks for you, saving you precious time? That’s definitely something worth investigating!

https://glaforge.dev/posts/2025/11/03/driving-a-web-browser-with-gemini-computer-use-model-in-java/
A Javelit frontend for an ADK agent

Continuing my journey with Javelit, after creating a frontend for “Nano Banana” to generate images and a chat interface for a LangChain4j-based Gemini chat model, I decided to see how I could integrate an ADK agent with a Javelit frontend.

The Javelit interface for an ADK search agent

A Javelit interface for an ADK search agent
A Javelit interface for an ADK search agent
A Javelit interface for an ADK search agent A Javelit interface for an ADK search agent

Show full content

Continuing my journey with Javelit, after creating a frontend for “Nano Banana” to generate images and a chat interface for a LangChain4j-based Gemini chat model, I decided to see how I could integrate an ADK agent with a Javelit frontend.

The Javelit interface for an ADK search agent

A Javelit interface for an ADK search agent
A Javelit interface for an ADK search agent
A Javelit interface for an ADK search agent A Javelit interface for an ADK search agent

The key ingredients of this interface:

  • a title (with some emojis 😃)
  • a container that displays the agent’s answer
  • a text input field to enter the search query
The ADK agent

For the purpose of this article, I built a simple search agent, with a couple of search tools:

  • a Google Search tool
  • a Google Maps tool

So you can search for up-to-date information on Google Search, as well as details about landmarks, points of interest, restaurants, etc., via Google Maps.

Note

I recently contributed the GoogleMapsTool to ADK for Java, so it’s not yet available in a public version, you’ll have to wait for the next release (or even build it from sources!) to be able to use it.

Let’s have a closer look at the agent code:

LlmAgent agent = LlmAgent.builder()
    .name("gemini-search-agent")
    .instruction("""
        You are a helpful search assistant,
        able to search the web and Google Maps.
        When a user asks for research,
        be sure to use the appropriate tools detailed below.

        Use the `google_search` tool
        to search for up-to-date information.

        Use the `google_maps` tool
        to search for geographical information.
        """)
    .model("gemini-2.5-flash")
    .tools(
        new GoogleSearchTool(),
        new GoogleMapsTool()
    )
    .build();

This is a simple agent, with instructions detailing the tools at its disposal and wiring the two tools.

Next, to interact with the agent, we need some setup: we’ll need a Runner and prepare a Session:

InMemorySessionService sessionService = new InMemorySessionService();
InMemoryArtifactService artifactService = new InMemoryArtifactService();

Runner runner = new Runner(agent, agent.name(), artifactService, sessionService, null, null);

final String appName = runner.appName();
final String userId = UUID.randomUUID().toString();

Session session = runner
    .sessionService()
    .createSession(appName, userId)
    .blockingGet();

To interact with this agent via the Runner’s runAsync() method, we need to keep the agent, the session, and the user ID around, so I created a record to hold them, and created a method to wrap it all:

record AgentRunnerSession(Runner runner, String userId, Session session) { }

private static AgentRunnerSession getAgentSession() {
    //... agent definition above...
    return new AgentRunnerSession(runner, userId, session);
}
Building the UI and saving the agent in the Javelit session

Like in the previous articles, the UI code layout is prepared in the class’ main method:

public static void main(String[] args) {
  // Javelit UI layout
}

With Javelit, the UI component and layout code is re-run each time there’s an interaction from the user. In order to keep the conversation going with the agent, we need to store it in Javelit’s session state. It’s created the first time thanks to the computeIfAbsent() method and retrieved upon subsequent calls:

AgentRunnerSession holder = (AgentRunnerSession) Jt.sessionState()
    .computeIfAbsent("agentRunnerSession", key -> getAgentSession());

We add the title component, a container to hold the agent’s response, and a text input field for the user’s search query:

Jt.title("\uD83D\uDD0D ADK Search Agent \uD83E\uDD16\uD83E\uDDE0").use();

JtContainer eventContainer = Jt.container().border(true).use();

String searchQuery = Jt.textInput("Search query").use();

When the user interacts with this text input field by hitting Enter, the input value is saved in the searchQuery variable. Once we have that user query, we can pass it to the agent via the Runner’s runAsync() method. For each event, we add a Markdown element with the content of that event:

if (searchQuery != null && !searchQuery.isEmpty()) {
    holder.runner().runAsync(
        holder.userId(),
        holder.session().id(),
        Content.fromParts(Part.fromText(searchQuery)))
        .blockingForEach(event -> {
            Jt.markdown(event.stringifyContent()).use(eventContainer);
        });
}

With this approach, we maintain the conversational state. Although the UI isn’t displaying the past requests and responses, each time the user enters a query, both the query and response are kept in the agent’s memory. That way, if you ask for information about a restaurant, then you ask about opening times, it remembers it’s about this particular restaurant.

The whole example Click to view the whole source code
/// usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS com.google.adk:google-adk:0.3.1-SNAPSHOT
package adk;

import com.google.adk.agents.LlmAgent;
import com.google.adk.artifacts.InMemoryArtifactService;
import com.google.adk.runner.Runner;
import com.google.adk.sessions.InMemorySessionService;
import com.google.adk.sessions.Session;
import com.google.adk.tools.GoogleMapsTool;
import com.google.adk.tools.GoogleSearchTool;
import com.google.genai.types.Content;
import com.google.genai.types.Part;
import io.javelit.core.Jt;
import io.javelit.core.JtContainer;

import java.util.UUID;

public class App {

  public static void main(String[] args) {
    AgentRunnerSession holder = (AgentRunnerSession) Jt.sessionState()
        .computeIfAbsent("holder", key -> getAgentSession());

    Jt.title("\uD83D\uDD0D ADK Search Agent \uD83E\uDD16\uD83E\uDDE0").use();

    JtContainer eventContainer = Jt.container().border(true).use();

    String searchQuery = Jt.textInput("Search query").use();

    if (searchQuery != null && !searchQuery.isEmpty()) {
      holder.runner().runAsync(
          holder.userId(),
          holder.session().id(),
          Content.fromParts(Part.fromText(searchQuery))).blockingForEach(event -> {
        Jt.markdown(event.stringifyContent()).use(eventContainer);
      });
    }
  }

  private record AgentRunnerSession(Runner runner, String userId, Session session) { }

  private static AgentRunnerSession getAgentSession() {
    LlmAgent agent = LlmAgent.builder()
        .name("gemini-search-agent")
        .instruction("""
            You are a helpful search assistant, able to search the web and Google Maps.
            When a user asks for research, be sure to use the appropriate tools detailed below.

            Use the `google_search` tool to search for up-to-date information.
            Use the `google_maps` tool to search for geographical information.
            """)
        .model("gemini-2.5-flash")
        .tools(
            new GoogleSearchTool(),
            new GoogleMapsTool()
        )
        .build();

    InMemorySessionService sessionService = new InMemorySessionService();
    InMemoryArtifactService artifactService = new InMemoryArtifactService();

    Runner runner = new Runner(agent, agent.name(), artifactService, sessionService, null, null);

    final String appName = runner.appName();
    final String userId = UUID.randomUUID().toString();

    Session session = runner
        .sessionService()
        .createSession(appName, userId)
        .blockingGet();

    return new AgentRunnerSession(runner, userId, session);
  }
}
Summary

Another Javelit integration on the books, this time with ADK for Java.

At first, I was a bit surprised by the library’s unusual approach, compared to more event-driven or reactive web frameworks. However, I’m liking the simplicity of Javelit for quickly building a web frontend to let me experiment with ideas.

There are still many more features or components I’d like to explore (like charts, multi-page components, etc.) So you might see me write a few more articles, or use it more in my own presentations and workshops!

https://glaforge.dev/posts/2025/10/26/a-javelit-frontend-for-an-adk-agent/
Creating a Javelit chat interface for LangChain4j

Yesterday, I uncovered the Javelit project in this article where I built a small frontend to create and edit images with Google’s Nano Banana image model.

Javelit

Javelit is an open source project inspired by Streamlit from the Python ecosystem to enable rapid prototyping and deployment of applications in Java.

Show full content

Yesterday, I uncovered the Javelit project in this article where I built a small frontend to create and edit images with Google’s Nano Banana image model.

Javelit

Javelit is an open source project inspired by Streamlit from the Python ecosystem to enable rapid prototyping and deployment of applications in Java.

Today, I want to show you another example of Javelit. This time, I’m creating a chat interface using LangChain4j with the Gemini chat model.

What we want to build

Generative AI chat interface built with Javelit, LangChain4j, and the Gemini model
Generative AI chat interface built with Javelit, LangChain4j, and the Gemini model
Generative AI chat interface built with Javelit, LangChain4j, and the Gemini model Generative AI chat interface built with Javelit, LangChain4j, and the Gemini model

Notice how we alternate user and AI messages, and how the text is nicely rendered from Markdown? Let’s see how to implement such an interface with Javelit and LangChain4j.

Let’s build it!

Feel free to use any LLM model provider, but in my example today, I’m using Gemini:

private static final ChatModel CHAT_MODEL = GoogleAiGeminiChatModel.builder()
    .modelName("gemini-2.5-flash")
    .apiKey(System.getenv("GOOGLE_API_KEY"))
    .build();

When using LangChain4j chat models at the low-level (not using AiServices or the new agentic module), we keep track of chat messages via a simple List of ChatMessage. This chat history needs to be stored in Javelit’s session state:

List<ChatMessage> chatHistory = (List<ChatMessage>) Jt.sessionState()
    .computeIfAbsent("chatHistory", (key) -> new ArrayList<>());

Let’s give this application a title, and prepare a container that will receive all the chat messages from both the AI and the user:

Jt.title(":coffee::parrot: LangChain4j Chat :speech_balloon:").use();
JtContainer msgContainer = Jt.container().use();

Notice that Javelit supports emoji code names!

Next, let’s append all the messages from the chat history to the message container, alternating between AI and user messages:

for (ChatMessage message : chatHistory) {
    switch (message.type()) {
        case USER -> Jt.markdown( ":speech_balloon: " +
            ((UserMessage) message).singleText()).use(msgContainer);
        case AI -> Jt.markdown(":robot: " +
            ((AiMessage) message).text()).use(msgContainer);
    }
}

Again, we’re using the Jt.markdown() component that outputs Markdown. Which is nice since LLMs love to return Markdown responses!

It’s time to get some input message from the user, with the Jt.textInput() component, and save its value in a local variable:

String inputMessage = Jt.textInput("Your message:").use();

Let’s update the chat history, display the user’s message, call the Gemini chat model, and then display its response:

if (inputMessage != null && !inputMessage.trim().isEmpty()) {
    chatHistory.add(UserMessage.from(inputMessage));
    Jt.markdown(":speech_balloon: " + inputMessage).use(msgContainer);

    ChatResponse response = CHAT_MODEL.chat(chatHistory);
    chatHistory.add(response.aiMessage());

    Jt.markdown(":robot: " + response.aiMessage().text()).use(msgContainer);
}

And that’s it! When the user hits the enter key, after having typed its message, it retriggers a UI refresh. It goes over all the UI component rendering again, but uses the state to show the alternation of user/AI messages.

The whole source code Click to view the whole source code
/// usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS dev.langchain4j:langchain4j-core:1.8.0
//DEPS dev.langchain4j:langchain4j-google-ai-gemini:1.8.0

import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.googleai.GoogleAiGeminiChatModel;
import io.javelit.core.Jt;
import io.javelit.core.JtContainer;

import java.util.ArrayList;
import java.util.List;

public class App {
  private static final ChatModel CHAT_MODEL = GoogleAiGeminiChatModel.builder()
      .modelName("gemini-2.5-flash")
      .apiKey(System.getenv("GOOGLE_API_KEY"))
      .build();

  public static void main(String[] args) {

    List<ChatMessage> chatHistory = (List<ChatMessage>) Jt.sessionState()
        .computeIfAbsent("chatHistory", (key) -> new ArrayList<>());

    Jt.title(":coffee::parrot: LangChain4j Chat :speech_balloon:").use();
    JtContainer msgContainer = Jt.container().use();

    for (ChatMessage message : chatHistory) {
      switch (message.type()) {
        case USER -> Jt.markdown( ":speech_balloon: " +
            ((UserMessage) message).singleText()).use(msgContainer);
        case AI -> Jt.markdown(":robot: " +
            ((AiMessage) message).text()).use(msgContainer);
      }
    }

    String inputMessage = Jt.textInput("Your message:").use();

    if (inputMessage != null && !inputMessage.trim().isEmpty()) {
      chatHistory.add(UserMessage.from(inputMessage));
      Jt.markdown(":speech_balloon: " + inputMessage).use(msgContainer);

      ChatResponse response = CHAT_MODEL.chat(chatHistory);
      chatHistory.add(response.aiMessage());

      Jt.markdown(":robot: " + response.aiMessage().text()).use(msgContainer);
    }
  }
}

Then you can run this class (after having installed Javelit) with:

javelit run App.java
Summary

And voilà! In this article, we’ve managed to build a simple chat UI for LangChain4j chat models using the Javelit UI toolkit. We took advantage of Javelit’s state management, as well as the built-in markdown rendering, as LLMs generally use markdown in their responses.

https://glaforge.dev/posts/2025/10/25/creating-a-javelit-chat-interface-for-langchain4j/
Javelit to create quick interactive app frontends in Java

Have you ever heard of Javelit? It’s like Streamlit in the Python ecosystem, but for the Java developer! I was lucky that the project creator reached out and introduced me to this cool little tool!

Javelit is a tool to quickly build interactive app frontends in Java, particularly for data apps, but it’s not limited to them. It helps you quickly develop rapid prototypes, with a live-reload loop, so that you can quickly experiment and update the app instantly.

Show full content

Have you ever heard of Javelit? It’s like Streamlit in the Python ecosystem, but for the Java developer! I was lucky that the project creator reached out and introduced me to this cool little tool!

Javelit is a tool to quickly build interactive app frontends in Java, particularly for data apps, but it’s not limited to them. It helps you quickly develop rapid prototypes, with a live-reload loop, so that you can quickly experiment and update the app instantly.

The way it works (and thus the way you program with it) is a little unusual, so it took me a bit of time to really get it. But basically, as the documentation states (in the fundamentals section):

Javelit’s architecture allows you to write apps the same way you write plain Java methods. To unlock this, Javelit apps have a unique data flow: any time something must be updated on the screen, Javelit reruns your entire Java main method from top to bottom.

So you have to think about it as if there were somehow a big loop around your UI code, and Javelit redraws it whenever you modify the source code (because of the live-reload capability), or of course, when a user interacts somehow with the app (submitting a form, clicking a button, moving a slider, etc.)

It’s possible to embed it in your own servers, but here, I’ll illustrate it with the standalone command-line tool, which I’ve installed thanks to Jbang.

It always starts with “Hello World!”

A simple example could be:

/// usr/bin/env jbang "$0" "$@" ; exit $?
import io.javelit.core.Jt;

public class App {
    public static void main(String[] args) {
        Jt.title("Hello World!").use();
        Jt.markdown("""
            ## My first official message
            Hello World!
            """).use();
    }
}

Then, once Javelit is installed, you’d run it with the following command:

javelit run App.java

It will open your browser automatically, and you’ll be able to view the app UI. Then, start making some changes to the title or markdown text, and notice how the UI is live reloaded.

Later on, add maybe a button, or the many components available, like the various text elements, the input elements & forms, the containers, pages & layouts, or the data components like tables or charts with Apache Echarts.

Have a look at the more complete Hello World from the documentation, which shows some interactivity with a button click counter.

Creating an interactive image playground with Nano Banana

After “Hello World!”, you’ve got to build something a little more involved, right?

Since I love playing with Nano Banana (i.e. Gemini 2.5 Flash Image) to create and edit pictures, I decided to build an interactive image playground, to create new images, and then incrementally edit the image with further prompts.

Here’s the UI I came up with, and let’s see how to build and interact with it:

What do we see in that UI? A title, a form containing a text area to enter the prompts, and a button to launch the image generation. So we’ll layout those components in that order:

public class App {
  public static void main(String[] args) {

    Jt.title("🍌 Nano Banana Playground 🍌").use();

    var form = Jt.form().use();

    var text = Jt.textArea("Image prompt")
        .placeholder("An impressionist painting of a cat")
        .use(form);

    if (Jt.formSubmitButton("Generate image").use(form)) {
      // Nano Banana magic to generate or edit the image, then...
      Jt.html("<img src='data:" + mimeType +
          ";base64," + b64encoded + "'>")
          .use();
    }
  }
}

We added:

  • a title with the Jt.title().use() method,
  • a form, to associate the text area and submit button, with Jt.form().use(),
  • a text area, with a placeholder, and hosted within the form, with Jt.textArea().use(),
  • a button to launch the image generation, with Jt.formSubmitButton().use(),
  • and Jt.html().use() to append an image as a data src img (there’s no image component yet, but it’s on the roadmap).

But what’s more interesting is this mysterious if statement… The first time, the UI is drawn with the title and form. But since the user hasn’t yet clicked the submit button, the formSubmitButton() method returns false. So the code inside the if isn’t executed.

But once the user interacts with the UI, after having entered some text and clicked on the button, this time the method will return true, and the image component (here the HTML component) is going to be added to the UI.

At first, this is not really obvious, as you have to think in terms of loop redrawing the UI after each interaction with the components. But you’ll get the hang of it after a little while.

Handling state

The idea of this image playground is to

  • first, create a brand-new image,
  • but then, you change the prompt with some image editing commands, and each time you submit the form, the image will be updated accordingly.

So you need to somehow keep track of the state from the previous interaction and rendering loop. How do you do that? With the Jt.sessionState().

For example, I want to save the bytes and the mime type of the image generated by Nano Banana, I would do:

Jt.sessionState().put("mimeType", mimeType);
Jt.sessionState().put("bytes", data);

And if I want to get those variables back in the next rendering loop, I’d write:

String mimeType = Jt.sessionState().getString("mimeType");
byte[] data = Jt.sessionState().get("bytes");

It’s a so-called typed map, and there are many methods like putIfAbsent(), computeInt(), etc.

The final version of the code of our playground

Most of the code below is actually generating and editing the image, as I explained in previous articles using the Gemini GenAI Java SDK directly, or within ADK for Java. The UI code from Javelit is really just a dozen lines or so!

Click to view the whole code
/// usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS com.google.genai:google-genai:1.24.0

package demo;

import com.google.genai.Client;
import com.google.genai.types.Content;
import com.google.genai.types.GenerateContentConfig;
import com.google.genai.types.Part;
import io.javelit.core.Jt;

import java.util.Base64;

public class App {
  public static void main(String[] args) {

    Jt.title("🍌 Nano Banana Playground 🍌").use();

    var form = Jt.form().use();

    var imgContainer = Jt.empty().use();

    var text = Jt.textArea("Image prompt")
        .placeholder("An impressionist painting of a cat")
        .use(form);

    if (Jt.formSubmitButton("Generate image").use(form)) {

      try (Client client = new Client.Builder()
          .apiKey(System.getenv("GOOGLE_API_KEY"))
          .build()) {

        String mimeTypeFromState = Jt.sessionState().getString("mimeType");
        byte[] bytesFromState = (byte[]) Jt.sessionState().get("bytes");

        Content content;
        // first run --> create a brand-new image
        if (mimeTypeFromState == null || bytesFromState == null) {
          content = Content.fromParts(
              Part.fromText(text)
          );
        } else { // second run --> edit the previously generated image
          content = Content.fromParts(
              Part.fromBytes(bytesFromState, mimeTypeFromState),
              Part.fromText(text)
          );
        }

        var response = client.models.generateContent(
            "gemini-2.5-flash-image-preview",
            content,
            GenerateContentConfig.builder()
                .responseModalities("TEXT", "IMAGE")
                .build());

        response.candidates()
            .flatMap(candidates ->
                candidates.getFirst().content()
                    .flatMap(Content::parts)
                    .flatMap(parts -> parts.getLast().inlineData()))
            .ifPresent(inlineData -> {
              String mimeType = inlineData.mimeType().orElse("image/png");
              inlineData.data().ifPresent(data -> {
                Jt.sessionState().put("mimeType", mimeType);
                Jt.sessionState().put("bytes", data);

                String b64encoded = Base64.getEncoder().encodeToString(data);
                Jt.html("<img src='data:" + mimeType +
                        ";base64," + b64encoded + "'>")
                    .use(imgContainer);
              });
            });
      }
    }
  }
}
Go ahead and have fun with Javelit!

I had a lot of fun playing with Javelit so far, and I’m looking forward to using this nice little tool to experiment with various application ideas. I highly encourage you to try it out, so go check out Javelit!

https://glaforge.dev/posts/2025/10/24/javelit-to-create-quick-interactive-app-frontends-in-java/
Building AI Agents with ADK for Java

At Devoxx Belgium, I recently had the chance to present this new talk dedicated to ADK for Java, the open source Agent Development Kit framework developed by Google.

The presentation covered:

  • an introduction to the notion of AI agents
  • how to get started in a Java and Maven project
  • how to create your first agent
  • how to debug an agent via the Dev UI
  • the coverage of the various tools (custom function tools, built-in tools like Google Search or code execution, an agent as tool, MCP tools)
  • an overview of the different ways to combine agents into a multi-agent system: sub-agents, sequential agents, parallel agents, loop agents
  • some details on the event loop and services (session and state management, artifacts, runner…)
  • structured input / output schemas
  • the various callbacks in the agent lifecycle
  • the integration with LangChain4j (to give access to the plethora of LLMs supported by LangChain4j)
  • the definition of agents via configuration in YAML
  • the new long-term memory support
  • the plugin system
  • the new external code executors (via Docker containers or backed by Google Cloud Vertex AI)
  • how to launch an agent with the Dev UI from JBang
Slides of the presentation

The slide deck of this session is embedded below:

Show full content

At Devoxx Belgium, I recently had the chance to present this new talk dedicated to ADK for Java, the open source Agent Development Kit framework developed by Google.

The presentation covered:

  • an introduction to the notion of AI agents
  • how to get started in a Java and Maven project
  • how to create your first agent
  • how to debug an agent via the Dev UI
  • the coverage of the various tools (custom function tools, built-in tools like Google Search or code execution, an agent as tool, MCP tools)
  • an overview of the different ways to combine agents into a multi-agent system: sub-agents, sequential agents, parallel agents, loop agents
  • some details on the event loop and services (session and state management, artifacts, runner…)
  • structured input / output schemas
  • the various callbacks in the agent lifecycle
  • the integration with LangChain4j (to give access to the plethora of LLMs supported by LangChain4j)
  • the definition of agents via configuration in YAML
  • the new long-term memory support
  • the plugin system
  • the new external code executors (via Docker containers or backed by Google Cloud Vertex AI)
  • how to launch an agent with the Dev UI from JBang
Slides of the presentation

The slide deck of this session is embedded below:

Video recording of the talk

And you can also watch the recoding of this presentation here:

Samples demonstrated during the conference

During the presentation, I demonstrated a lot of samples showing ADK in action. You can find all the examples in this GitHub repository. There are also a couple of servers implementing an MCP server and an A2A server agent.

Among those examples, you’ll find:

Updated ADK template project

With the recent (somewhat silent) release of version 0.3.0, I seized the opportunity to also update my ADK template project on GitHub. In a recent article, I wrote about this template project, how you can clone it or reuse it, to get started with ADK for Java easily, with a sample agent, and a Maven build.

Getting started with a codelab

Additionally to the GitHub project template, I have developed a codelab to build AI agents with ADK for Java, for Java developers. This codelab goes through setting up your environment, writing your first agent, empowering an agent with tools, using the built-in Google Search tool, mastering the various types of workflow patterns (sub-agent, sequential, parallel, loop).

https://glaforge.dev/talks/2025/10/22/building-ai-agents-with-adk-for-java/
Creative Java AI agents with ADK and Nano Banana 🍌

Large Language Models (LLMs) are all becoming “multimodal”. They can process text, but also other “modalities” in input, like pictures, videos, or audio files. But models that output more than just text are less common…

Recently, I wrote about my experiments with Nano Banana 🍌 (in Java), a Gemini chat model flavor that can create and edit images. This is pretty handy in particular for interactive creative tasks, like for example a marketing assistant that would help you design a new product, by describing it, by futher tweaking its look, by exposing it in different settings for marketing ads, etc.

Show full content

Large Language Models (LLMs) are all becoming “multimodal”. They can process text, but also other “modalities” in input, like pictures, videos, or audio files. But models that output more than just text are less common…

Recently, I wrote about my experiments with Nano Banana 🍌 (in Java), a Gemini chat model flavor that can create and edit images. This is pretty handy in particular for interactive creative tasks, like for example a marketing assistant that would help you design a new product, by describing it, by futher tweaking its look, by exposing it in different settings for marketing ads, etc.

The “Nano Banana” 🍌 model we’ll use today is the nickname for the popular gemini-2.5-flash-image-preview model. It’s not just a conversational AI; it’s a creative partner that can generate and edit images right within a chat session. Not only it generates text, but also images.

And today, we’re going to explore how to configure and use this model inside an AI agent developed with ADK (Agent Development Kit, and especially its Java version). We’ll focus on a key piece of the puzzle: processing the image output from the model and saving it for later use.

Meet the NanoBananaCreativeAgent

Let’s look at some code. Our example is the following NanoBananaCreativeAgent class. Its goal is to act as a creative assistant, using the 🍌 “Nano Banana” model to handle image-related tasks based on user prompts.

Remark

Maybe in a later post, we’ll see how to create a more complete agent like the marketing scenario I suggested in introduction, but for now, I want to highlight how to configure the model and save its output for later use.

Here’s how we define the agent using ADK’s LlmAgent.builder():

import com.google.adk.agents.BaseAgent;
import com.google.adk.agents.LlmAgent;
import com.google.genai.types.Content;
import com.google.genai.types.Part;
import io.reactivex.rxjava3.core.Maybe;
import java.util.List;

public class NanoBananaCreativeAgent {
    public BaseAgent getAgent() {
        return LlmAgent.builder()
            .name("nano-banana-creative-agent")
            .model("gemini-2.5-flash-image-preview") // 🍌
            .instruction("""
                You are a creative assistant, and you help users
                create new images or edit existing ones, using the
                Nano Banana model (aka Gemini 2.5 Flash Image)
                """)
            .afterModelCallback((callbackContext, llmResponse) -> {
                // We'll zoom in on this part next!
            })
            .build();
    }
    // ... main method to run the agent and other helpers
}

There are two important lines here for enabling image generation:

  1. .model("gemini-2.5-flash-image-preview"): This tells the ADK to route requests to the specific model endpoint capable of generating and editing images.
  2. .instruction(...): The instruction primes the model, letting it know its role is to be a creative assistant focused on image tasks.
Handling the image response with a callback

When you ask the model to “create an image of a cat wearing a party hat,” it doesn’t return a URL or a file path. It returns the image data directly in its response, typically as a Part containing binary data, in a big byte array.

Note

The model usually replies with some text to introduce the image, along with the image. But sometimes, it can also return text-only, in particular when it asks for clarifications for generating the requested image. So it’s important to check that an image is indeed present in the output of the model. Also, when the output contains an image, there’s only a single one. It never generates more than one — which means I could use a findFirst() instead of a forEach() in the implementation below.

How do we capture and use this binary data? This is where ADK’s afterModelCallback becomes handy. It’s a hook that lets you execute custom Java code immediately after the LLM sends its response, but before the agent’s turn is finished.

Let’s look at our callback code:

.afterModelCallback((callbackContext, llmResponse) -> {
    llmResponse.content() // 1. Let's find the image part!
        .flatMap(Content::parts)
        .stream()
        .flatMap(List::stream)
        // Filter parts containing image content
        .filter(part -> part.inlineData().isPresent())
        .forEach(part -> {
            // 2. Save the image as an artifact for the pipeline
            callbackContext.saveArtifact("rendered-image", part);

            // 3. Potentially save the image as a file elsewhere
            Blob blob = part.inlineData().get();
            byte[] imageBytes = blob.data().get();
            String mimeType = blob.mimeType().get();
        });
    // Returning empty means not altering the agent's response
    return Maybe.empty();
})

Let’s break down what’s happening in this lambda:

  1. Find the image part: We use a Java Stream to navigate the llmResponse structure. A response can have zero or one Content object (an Optional), with an optional list of Parts (which can be text, function calls, or inline data). We filter down to find the Part that contains inlineData — this is our image.

  2. Save as an artifact: callbackContext.saveArtifact("rendered-image", part) is a key ADK feature. It saves the raw image Part into the agent’s artifact registry under the name rendered-image. This makes the generated image available to other agents or tools that might run later in a more complex pipeline.

  3. Do something with the image bytes: Potentially, instead of (or in addition to) saving the image as an artifact, you can decide to do something yourself with the bytes of the image and its MIME type, like saving that file directly to the file system if you’re building some kind of command-line based agent tool.

Going further: building a creative marketing agent

This simple agent is the building block for a more complex creative workflow. Imagine a “Creative Marketing Agent” built as a pipeline of agents.

  • Step 1 — Product ideation. A user interacts with our NanoBananaCreativeAgent. They prompt: “Generate an image of a new energy drink can called ‘Cosmic Charge’. It should be dark blue with a glowing yellow lightning bolt.”

  • Step 2 — Image generation & persistence. Our agent calls the model to generate the image. The user iterates potentially a few rounds to further improve the rendered picture, thanks to Nano Banana’s editing capabilities. The afterModelCallback we just analyzed fires, and cosmic-charge.png is saved to the file system, and/or the image is also saved as an artifact in the agent’s session.

  • Step 3 — Further asset generation. A second agent in the pipeline (via an ADK SequentialAgent that we explored in a previous article), a MarketingAssetAgent, is triggered. Its instruction might be: “You will be given a product image. Your job is to create a marketing banner for social media.” This agent can now be given a new prompt like, “Take the product image from the ‘rendered-image’ artifact and place it on a background of a starry nebula. Add the text ‘Feel the Power of the Cosmos!’”. Or you could also have a dedicated video generation agent, using the Veo 3 video model, to generate a video illustrating the product in action, like someone drinking this fancy energy drink.

By saving the image to a file or as artifact, thanks to the callback trick, we’ve successfully passed a complex, generated asset from one stage of our agentic workflow to the next.

Conclusion

Integrating advanced image generation and editing into your Java agentic applications is no longer science-fiction. With the Agent Development Kit for Java, with just a few lines of hook code, you can configure an agent to use powerful multimodal models like Nano Banana. By leveraging the afterModelCallback in your ADK agent definition, you gain precise control over the model’s output, allowing you to process, save, and chain creative tasks together to build more useful and creative agents.

So go ahead, start experimenting, and see what amazing creative workflows you can build! And, of course, be sure to read my series of articles on ADK for Java!

https://glaforge.dev/posts/2025/09/22/creative-ai-agents-with-adk-and-nano-banana/
Creating a Streamable HTTP MCP server with Micronaut

In previous articles, I explored how to create an MCP server with Micronaut by vibe-coding one, following the Model Context Protocol specification (which was a great way to better understand the underpinnings) and how to create an MCP server with Quarkus.

Micronaut lacked a dedicated module for creating MCP servers, but fortunately, recently Micronaut added official support for MCP, so I was eager to try it out!

For the impatient

You can checkout the code we’ll be covering in this article on GitHub.

Show full content

In previous articles, I explored how to create an MCP server with Micronaut by vibe-coding one, following the Model Context Protocol specification (which was a great way to better understand the underpinnings) and how to create an MCP server with Quarkus.

Micronaut lacked a dedicated module for creating MCP servers, but fortunately, recently Micronaut added official support for MCP, so I was eager to try it out!

For the impatient

You can checkout the code we’ll be covering in this article on GitHub.

What to build?

Like in my previous article with Quarkus, I decided to build another version of my 🌔 moon phases MCP server. This is interesting to be able to contrast Quarkus and Micronaut’s approaches.

I reused my code for calculating the moon phases. My MoonPhasesService is fairly simple (as long as you don’t look at the exact math calculation) and consists in two methods:

  • currentMoonPhase() — to know the phase at this point in time,
  • moonPhaseAtUnixTimestamp(long timeSeconds) — to know the phase at a specific point in time.

The contract is as follows, nothing specific to MCP for now:

@Singleton
public class MoonPhasesService {
    // ...
    public MoonPhase currentMoonPhase() { /*...*/ }
    public MoonPhase moonPhaseAtUnixTimestamp(long timeSeconds) { /*...*/ }
}

Compared to my Quarkus version, the service returns MoonPhase records instead of enum values, as it seems Micronaut is unhappy with returning my enum. So I changed MoonPhase to look like this:

@JsonSchema
@Introspected
public record MoonPhase(
    @NotBlank String phase,
    @NotBlank String emoji
) { }

What’s interesting here is the @JsonSchema annotation which comes from the Micronaut JSON Schema module, which provides very rich support for all the subtleties of the JSON Schema specification. The @Instrospected annotation is here to help with annotation processing and Ahead-of-Time compilation.

Let’s look at the MoonPhasesMcpServer now:

@Singleton
public class MoonPhasesMcpServer {
    @Inject
    private MoonPhasesService moonPhasesService;

    @Tool(name = "current-moon-phase",
        description = "Provides the current moon phase")
    public MoonPhase currentMoonPhase() {
        return moonPhasesService.currentMoonPhase();
    }

    @Tool(name = "moon-phase-at-date",
        description = "Provides the moon phase at a certain date (with a format of yyyy-MM-dd)")
    public MoonPhase moonPhaseAtDate(
        @ToolArg(name = "localDate")
        @NotBlank @Pattern(regexp = "\\d{4}-\\d{2}-\\d{2}")
        String localDate
    ) {
        LocalDate parsedLocalDate = LocalDate.parse(localDate);
        return moonPhasesService.moonPhaseAtUnixTimestamp(parsedLocalDate.toEpochDay() * 86400);
    }
}

You’ll find the same couple annotations as in Quarkus: @Tool and @ToolArg. In Micronaut, @ToolArg is missing a description field, but it should be added soon.

What’s more powerful here in Micronaut is the use of Micronaut Validation annotations: notice the @NotBlank and even better, the @Pattern annotation!

With Micronaut, I don’t have to handle the mal-formed inputs, as they are caught by validation much earlier. If the input is incorrect, Micronaut will handle the situation on its own, and your method won’t even be called. So no need to handle the bad values.

Testing the MCP server with the MCP Inspector

When using MCP Inspector to test my server manually, if I pass a blank value to the moon-phase-at-date method, I’ll see validation kicking in:

MCP error -32603: moonPhaseAtDate.localDate: must not be blank,
moonPhaseAtDate.localDate: must match "\d{4}-\d{2}-\d{2}"

Extra bonus point: Micronaut MCP will create (at compile time) the JSON schemas for the various @JsonSchema annotated beans, adding more fine-grained information about the manipulated input / output structures.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "http://localhost:8080/schemas/moonPhase.schema.json",
  "title": "Phase of the moon",
  "description": "The phase of the moon is composed of the name of the phase and an emoji representing it",
  "type": "object",
  "properties": {
    "emoji": {
      "type": "string",
      "minLength": 1
    },
    "phase": {
      "type": "string",
      "minLength": 1
    }
  }
}

For those schemas to be served as static assets, application.properties must configure the static resources:

# specify the HTTP Streamable transport
micronaut.mcp.server.transport=HTTP
micronaut.mcp.server.info.name=moon-phases
micronaut.mcp.server.info.version=1.0.0

# Specify how & where the schemas should be exposed
micronaut.router.static-resources.jsonschema.paths=classpath:META-INF/schemas
micronaut.router.static-resources.jsonschema.mapping=/schemas/**

# Potentially define a specific base URL, otherwise it's infered
# micronaut.jsonschema.validation.baseUri=https://example.com/schemas
Quick look at the dependencies

You can checkout the code and read the README, but I’d like to mention how I scaffolded the project in the first place, and which dependencies (and tweaks) were needed.

Creating the Micronaut application

This project was bootstrapped with the mn Micronaut command-line tool, which can be installed via SDKman.

mn create-app --build=gradle --jdk=21 --lang=java --test=junit \
  --features=jackson-databind,json-schema,validation,json-schema-validation mn.mcp.server.mn-mcp-server

As the MCP support is based on the official MCP SDK, which is currently tied to Jackson, you must use the Jackson data binding (not Micronaut’s built-in serialization). You need to add json-schema, validation, and json-schema-validation features.

But you’ll have to make some tweaks to the dependencies.

Custom dependency tweaks

The following dependencies were added to build.gradle, or updated, to support the MCP server and enhance JSON Schema generation:

dependencies {
    // Existing dependencies
    // ...

    // The Micronaut MCP support
    implementation("io.micronaut.mcp:micronaut-mcp-server-java-sdk:0.0.3")

    // For rich JSON schema handling
    annotationProcessor("io.micronaut.jsonschema:micronaut-json-schema-processor:1.7.0")
    implementation("io.micronaut.jsonschema:micronaut-json-schema-annotations:1.7.0")
}

First of all, the MCP module is not yet part of the features you can select from the mn command, or from the Micronaut launch site, but once the MCP support stabilizes, it’ll be available.

I had to update the dependency version of the JSON Schema support (instead of using the default version from the BOM), but this new version will be available soon in the Micronaut BOM.

So maybe when you read this, you’ll just add the mcp feature to the list of features, and have everything configured properly. But those tweaks are just because I’m living on the bleeding edge right now!

Invoking the server via Gemini CLI

For the fun, I decided to add this MCP server to my Gemini CLI installation.

Before launching the CLI, I installed the MCP server as follows:

gemini mcp add moonPhases --transport http http://localhost:8080/mcp

Then when I launch gemini and list the MCP servers, I can see the moon phase server:

I ask what was the phase of the moon when mankind first landed on the moon (and Gemini figures out the correct date format, although I gave it in plain English). Gemini CLI asks for my acknowledgement to execute the MCP server tool:

Finally, Gemini CLI responds with a proper English response:

Going further

I hope you enjoyed the ride so far, but what are the next steps?

Java developers have some great options nowadays for developing their MCP servers, including Quarkus and Micronaut. Be sure to evaluate those options for your next projects! For enterprise deployments, nothing beats Java! 😉 And Micronaut offers a pretty elegant handling of structured inputs and outputs thanks to its rich JSON Schema support.

https://glaforge.dev/posts/2025/09/16/creating-a-streamable-http-mcp-server-with-micronaut/
Generating videos in Java with Veo 3

Yesterday, we went bananas 🍌 creating and editing images with Nano Banana, in Java. Now, what about generating videos as well, still in Java, with Veo 3?

Especially since this week, Google announced that Veo 3 became generally available, with reduced pricing, a new 9:16 aspect ratio (nice for those vertical viral videos) and even with resolution up to 1080p!

In today’s article, we’ll see how to create videos, in Java, with the GenAI Java SDK. We’ll create videos either:

Show full content

Yesterday, we went bananas 🍌 creating and editing images with Nano Banana, in Java. Now, what about generating videos as well, still in Java, with Veo 3?

Especially since this week, Google announced that Veo 3 became generally available, with reduced pricing, a new 9:16 aspect ratio (nice for those vertical viral videos) and even with resolution up to 1080p!

In today’s article, we’ll see how to create videos, in Java, with the GenAI Java SDK. We’ll create videos either:

  • with a prompt,
  • or starting with an existing image.
Setting up your project

Depending on your build tool of choice, you’ll have to declare the following dependency for the GenAI SDK:

For Maven
<dependency>
    <groupId>com.google.genai</groupId>
    <artifactId>google-genai</artifactId>
    <version>1.15.0</version>
</dependency>
For Gradle
dependencies {
    implementation "com.google.genai:google-genai:1.15.0"
}

Next, you’ll need to decide if you wish to use a Google AI API key, or if you have an existing Google Cloud project, and use it for authentication.

Using a Google AI API key
try (Client client = new Client.Builder()
    .apiKey(System.getenv("GOOGLE_API_KEY"))
    .build()) {
        // ...
}
Using a Google Cloud project
try (Client client = new Client.Builder()
    .project(System.getenv("GOOGLE_CLOUD_PROJECT_ID"))
    .location(System.getenv("GOOGLE_CLOUD_LOCATION"))
    .vertexAI(true)
    .build()) {
        // ...
}

Now we’re ready to go!

Choosing the right model

There are two Veo 3 versions, a super fast one, and a highest quality one:

  • veo-3.0-generate-001
  • veo-3.0-fast-generate-001

Let’s store the model in a variable:

String modelName = "veo-3.0-generate-001";
Note

For the rest of this article, I’ll use the best and greatest version! The generated videos are of higher quality, but they are also more expensive. So you might want to use the “fast” variant, if you don’t need the highest quality possible, and you also want faster video generation.

You can create a video in two ways: from a text prompt, where Veo creates the entire scene, or from an existing image that serves as a starting point, preserving its atmosphere and tone. Let’s see both approaches.

Generating a video with a prompt

The client.models.generateVideos() method we will use has a few overloads. I usually prefer the one that accepts a GenerateVideosSource builder object. On that object, you can configure whether you want to pass a prompt, or an image as starting point.

So let’s create a video of a funky banana:

var operation = client.models.generateVideos(
    modelName,
    GenerateVideosSource.builder()
        .prompt("""
            A funky banana is dancing on the dance floor in a
            nightclub with flashy color lights and faceted ball,
            and in the background, a neon light says "Veo 3"
            """)
        .build(),
    GenerateVideosConfig.builder()
        .aspectRatio("16:9")
        .resolution("720p")
        .generateAudio(true)
        .build()
);

What does it look (and sound!) like?

Note

Vertex AI and Google AI don’t necessarily support the same video configuration options. The example here was made with Veo from Vertex AI, but with the developer endpoint, for example, you can’t specify the resolution or the audio generation parameter. So be sure to double check which parameters work with the flavor of Veo you’re using.

Generating a video with an existing image

What if instead, I already created an image that I want to animate. For example, I created this nice dancing banana with Nano Banana (check yesterday’s post on Nano Banana from Java):

So let’s animate that particular image (banana.png) and this time, I’ll use a 9:16 vertical format, so that I can easily share it on social media, with a nice aspect ratio for mobile devices:

var operation = client.models.generateVideos(
    modelName,
    GenerateVideosSource.builder()
        .prompt("The banana is dancing to the sound of disco music")
        .image(Image.fromFile("banana.png", "image/png"))
        .build(),
    GenerateVideosConfig.builder()
        .aspectRatio("9:16")
        .resolution("1080p")
        .generateAudio(true)
        .build()
);

Here’s the resulting video:

Do you feel like dancing now? Or maybe you’d rather eat a banana?

Waiting for the video to be generated

I showed the videos in the previous sections already, but I didn’t yet explain how to retrieve those generated videos.

Whereas images are quite fast to generate (so we can wait synchronously for them) videos on the other hand take longer to generate (between 1 and 2 minutes). That’s why we’re going to have to poll to check their status.

The generateVideos() method returns a GenerateVideosOperation object, which is a snapshot of the status of the operation. We have to request a new GenerateVideosOperation instance each time we want to check the status:

while (!operation.done().orElse(false)) {
    Thread.sleep(1000);
    operation = client.operations.getVideosOperation(operation, null);
}

With a while loop, I’m checking every second when/if the video is finished or not. And I request a new operation object and check again.

When the operation is actually done, we can fetch the video bytes and save them in a file. There’s a bit of Java stream magic involved here, because the API is (unfortunately) relying too much on Optionals for everything:

operation.response()
    .flatMap(GenerateVideosResponse::generatedVideos)
    .stream()
    .flatMap(List::stream)
    .findFirst()
    .ifPresent(video -> client.files.download(video, "video.mp4", null));

And here you go, the video is now available and saved in the video.mp4 file.

Conclusion

In this article, we’ve seen how to use the GenAI Java SDK to generate videos with Veo 3 in Java. We explored creating videos from both text prompts and existing images, configuring aspect ratios and resolutions, and handling the asynchronous nature of video generation by polling for results. With just a few lines of Java code, you can now integrate powerful video generation capabilities into your applications.

https://glaforge.dev/posts/2025/09/10/generating-videos-in-java-with-veo3/
Generating and editing images with Nano Banana in Java

By now, you’ve all probably seen the incredible images generated by the Nano Banana model (also known as Gemini 2.5 Flash Image preview)? If you haven’t, I encourage you to play with it within Google AI Studio, and from the Gemini app. or have a look at the @NanoBanana X/Twitter account which shares some of its greatest creations.

As a Java developer, you may be wondering how you can integrate Nano Banana in your own LLM-powered apps. This is what this article is about! I’ll show you how you can use this model to:

Show full content

By now, you’ve all probably seen the incredible images generated by the Nano Banana model (also known as Gemini 2.5 Flash Image preview)? If you haven’t, I encourage you to play with it within Google AI Studio, and from the Gemini app. or have a look at the @NanoBanana X/Twitter account which shares some of its greatest creations.

As a Java developer, you may be wondering how you can integrate Nano Banana in your own LLM-powered apps. This is what this article is about! I’ll show you how you can use this model to:

  • create new images
  • edit existing images
  • assemble images together

For that, I’ll be using the GenAI Java SDK from Google.

Remark

On this blog, I regularly talk about LangChain4j and ADK for Java. It’s possible to use Nano Banana in ADK, but it’s not yet possible to use it with LangChain4j, because LangChain4j doesn’t yet support models that feature output multimodality (i.e. returning text and images). I’ll definitely come back to that once it’s supported to show how to call Nano Banana from LangChain4j.

Creating your first image

First of all, you’ll need to add the following dependency:

For Maven
<dependency>
    <groupId>com.google.genai</groupId>
    <artifactId>google-genai</artifactId>
    <version>1.15.0</version>
</dependency>
For Gradle
dependencies {
    implementation "com.google.genai:google-genai:1.15.0"
}

Next, you’ll need to decide if you wish to use a Google AI API key, or if you have an existing Google Cloud project, and use it for authentication.

Using a Google AI API key
try (Client client = new Client.Builder()
    .apiKey(System.getenv("GOOGLE_API_KEY"))
    .build()) {
        // ...
}
Using a Google Cloud project
try (Client client = new Client.Builder()
    .project(System.getenv("GOOGLE_CLOUD_PROJECT_ID"))
    .location(System.getenv("GOOGLE_CLOUD_LOCATION"))
    .vertexAI(true)
    .build()) {
        // ...
}

Time to generate our first image!

var response = client.models.generateContent(
        "gemini-2.5-flash-image-preview",
        """
        An impressionist oil painting
        of the port of La Rochelle
        with its towers and sailing ships.
        """,
    GenerateContentConfig.builder()
        .responseModalities("TEXT", "IMAGE")
        .build());

Call the generateContent() method with the name of the model, the text prompt, and some extra configuration to specify that we want both text and image output.

Note: When using Nano Banana with Google Cloud Vertex AI, this GenerateContentConfig setting is required, but it’s implicit when using the Google AI API endpoint instead. It’s a little discrepancy that might be ironed out later on, but in the meantime, always specify that parameter so that your code works for both flavors of the model and API endpoint.

Then you need to save the image part of the response into a file, or potentially stream it to the user.

for (Part part : Objects.requireNonNull(response.parts())) {
    if (part.inlineData().isPresent()) {
        var blob = part.inlineData().get();
        if (blob.data().isPresent()) {
            try {
                Files.write(Paths.get("oil.png"), blob.data().get());
                break;
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }
}

Important: Nano Banana is actually a chat model! So it can return both text and image. It may sometimes ask follow up questions to be sure what image it should generate, so an image is not necessarily always present in the output. But when an image is generated, there’s always just one.

Let’s see what my oil painting looks like:

It looks very much like an oil painting from an impressionist painter!

Editing an existing image

An area where Nano Banana excels, it’s for editing an existing image. That’s why some people claim it’s a Photoshop killer!

To edit an image, you’ll have to use a variant of the generateContent() method that takes a Content object (made of Parts), instead of using the method that takes a simple string, because you’ll need to pass:

  • the image you want to edit,
  • as well as the instructions of the changes you want to make.
var response = client.models.generateContent(modelName,
    Content.fromParts(
        Part.fromBytes(
            Files.readAllBytes(Path.of("oil.png")), "image/png"),
        Part.fromText("""
            Simplify this painting to focus on key elements, turn
            this oil painting into a black and white ink noir comic
            drawing, make the weather rainy and change the time of
            the day to be at night.
            """)
    ),
    GenerateContentConfig.builder()
        .responseModalities("TEXT", "IMAGE")
        .build());

Here, I pass the oil.png image that we created in the previous run, and add the instructions to change that oil painting into a noir comic style, on a rainy night. You should also update the file name in which to save the content of the image (I changed it to noir.png in my code). Let’s see if it looks like some noir comic style:

Spot on!

Combining several images

Another area where Nano Banana is extremely good at, it’s for combining several images together. This capability is very important for example for product marketers who want to maybe put a bottle of perfume in a different decor, or to make a person do a virtual try on of some new clothes.

Let’s say I want to put this person, in the decor, and make her wear the red dress:

Let’s tweak the generateContent() call again, passing all three images, and the editing instructions:

var response = client.models.generateContent(modelName,
    Content.fromParts(
        Part.fromBytes(Files.readAllBytes(Path.of("decor.jpg")), "image/jpeg"),
        Part.fromBytes(Files.readAllBytes(Path.of("taylor.jpg")), "image/jpeg"),
        Part.fromBytes(Files.readAllBytes(Path.of("red-dress.png")), "image/png"),
        Part.fromText("""
            Add this person to the exterior decor,
            and make her wear the red dress.
            """)
    ),
    GenerateContentConfig.builder()
        .responseModalities("TEXT", "IMAGE")
        .build());

The result speaks for itself:

Next time, maybe I’ll use better lighting!

Conclusion and going further

I hope you enjoyed this quick tutorial on how to generate and edit images with Nano Banana in Java! It’s a very fun model to use, so don’t hesitate to unleash your creativity. And again, you don’t need to be a 🐍 Python developer to do that! You can do everything in Java

In the meantime, have fun generating cool images with Nano Banana! In particular from Java! And don’t hesitate to share them with me on social media (links at the bottom of this blog), I’m curious to see your creations!

https://glaforge.dev/posts/2025/09/09/calling-nano-banana-from-java/
In-browser semantic search with EmbeddingGemma

A few days ago, Google DeepMind released a new embedding model based on the Gemma open weight model: EmbeddingGemma. With 308 million parameters, such a model is tiny enough to be able to run on edge devices like your phone, tablet, or your computer.

Embedding models are the cornerstone of Retrieval Augmented Generation systems (RAG), and what generally powers semantic search solutions. Being able to run an embedding model locally means you don’t need to rely on a server (no need to send your data over the internet): this is great for privacy. And of course, cost is reduced as well, because you don’t need to pay for a remote / hosted embedding model.

Show full content

A few days ago, Google DeepMind released a new embedding model based on the Gemma open weight model: EmbeddingGemma. With 308 million parameters, such a model is tiny enough to be able to run on edge devices like your phone, tablet, or your computer.

Embedding models are the cornerstone of Retrieval Augmented Generation systems (RAG), and what generally powers semantic search solutions. Being able to run an embedding model locally means you don’t need to rely on a server (no need to send your data over the internet): this is great for privacy. And of course, cost is reduced as well, because you don’t need to pay for a remote / hosted embedding model.

In this article, I’ll walk you through how I built a simple semantic search application. This web app allows users to add a collection of documents, type a query, and instantly get a ranked list of the most relevant documents based on their semantic similarity.

I’ll show you how I brought this to life using the following stack:

  • The embedding model: Google’s new, lightweight EmbeddingGemma model.
  • The inference engine: 🤗 HuggingFace’s Transformers.js library, which runs the model directly in the browser.
  • The UI: A simple and clean interface built with Vite, React, and Tailwind CSS.
  • The deployment: A fully automated CI/CD pipeline using GitHub Actions to deploy the static site to 🔥 Firebase Hosting.

Ready to see how it’s done? Let’s dive in.

Tip

For those who are in a hurry, feel free to check out the live demo or browse the source code on GitHub:

For the demo, 1️⃣ first click to load the weights of the model, 2️⃣ then add a few documents in the database, 3️⃣ finally you can ask a question, 4️⃣ and find the most relevant documents.

Why run AI in the browser?

Running an AI model (here an embedding model) directly on the client-side might seem unconventional considering the best models are usually too big to run on edge devices, but it offers a compelling set of advantages, especially for applications like this one:

  • Privacy: Since all the data processing and embedding calculations happen on the user’s device, no sensitive information ever leaves the browser. The documents and queries are never sent over the network, making it a perfect solution for applications that handle personal or confidential text.
  • Zero added server costs: The “backend” is the user’s browser. No need for an expensive GPU-powered servers to run the AI model. The application itself is just a set of static files, which can be hosted for free on services like 🔥 Firebase Hosting or GitHub Pages. Of course, the rest of your application may need servers, but at least this part isn’t tied to a server.
  • Low latency: With the model running locally, there’s no network round-trip to a server. Once the model weights are loaded in memory, search queries are processed instantly, providing a snappy and responsive user experience. Well, at least as long as you don’t have many millions of documents to search through, as it’ll scale linearly without a proper vector database.
  • Offline-first capability: After the initial load, the entire application and the AI model can be cached by the browser (and the data be stored in the browser’s database or local storage), allowing it to function perfectly even without an internet connection.
The core components: a model and a library

At the heart of my application are two key pieces of technology that make in-browser semantic search possible:

The model: EmbeddingGemma

The “brain” of the search is EmbeddingGemma, the new, state-of-the-art (SOTA) text embedding model from Google. Unlike massive language models designed for generating text, embedding models are specialized for a different task: converting a piece of text into a numerical vector (a list of numbers). This vector represents the text’s semantic meaning. The closer two vectors are to each other in mathematical space, the more similar their meanings are.

EmbeddingGemma is the perfect choice for this project for several reasons:

  • High performance, small size: It is the highest-performing model of its size (under 500M parameters) on the multilingual Massive Text Embedding Benchmark (MTEB). Built on the Gemma 3 architecture, it’s designed for on-device applications where resources are limited.
  • On-device efficiency: With quantization, the model’s memory footprint can be less than 200MB, making it ideal for running in a web browser without overwhelming the user’s device.
  • Matryoshka Representation Learning (MRL): While the model produces a high-quality, full-size embedding with 768 dimensions, MRL allows us to truncate that vector to a smaller size (512, 256, or 128 dimensions) with a minimal loss in accuracy. This gives us a good trade-off between performance and computational cost. In my application, I use the first 128 dimensions for the vector visualizations, which is a perfect example of MRL in action.
  • Multilingual support: The model was trained on data from over 100 languages. This is quite rare for an embedding model of that size to be good across a wide variety of spoken languages.
The library: Transformers.js

The “engine” that runs the model is Transformers.js from 🤗 HuggingFace. This cool JavaScript library is designed to run a wide variety of popular AI models directly in the browser. It handles all the complex, low-level work of loading the model, managing the cache, and executing the computations efficiently using the browser’s and device’s capabilities.

Transformers.js make it simple to run a model like EmbeddingGemma on the client-side. With it, as shown in 🤗 HuggingFace’s blog post; it only takes a few lines of code to get a model up and running, as we’ll see in the next section.

How the code works

While the UI is a standard React application built with Vite and styled with Tailwind CSS, the most interesting part is the embeddingService.ts file, which acts as a wrapper around the Transformers.js library.

I actually vibe-coded the whole application thanks to Google AI Studio and Gemini CLI. As I’m not an expert in React or TypeScript, that was easier to guide Gemini to make it create the UI I wanted, and I fed Gemini also the code in the 🤗 HuggingFace article to get started with the inference.

Let’s look at a simplified version of the core logic.

Initializing the model

First, I needed to create a singleton instance of the service. This ensures I only ever initialize one copy of the model. The getInstance method handles this, and the init method does the heavy lifting, by calling the AutoTokenizer.from_pretrained() and AutoModel.from_pretrained() method, for loading the text tokenizer and the model respectively.

import { AutoTokenizer, AutoModel } from "@huggingface/transformers";

class EmbeddingService {
  private tokenizer: AutoTokenizer | null = null;
  private model: AutoModel | null = null;

  // ... singleton logic ...

  private async init() {
    // Load the tokenizer and model from the /model/ directory
    this.tokenizer = await AutoTokenizer.from_pretrained("/model/");
    this.model = await AutoModel.from_pretrained("/model/", {
      dtype: "q4", // Use a quantized version for efficiency
    });
  }
}
Note

The model can also be loaded from 🤗 HuggingFace’s Hub, but I wanted the model to be local as well, for a faster loading experience and for a full local-first approach.

Generating embeddings and calculating similarity

Once the model is ready, it can be fed the query and documents. The model expects specific prefixes for queries and documents to perform best (task: search result | query: and title: none | text:), so I made sure to add those first.

The core steps are:

  1. Tokenize: Convert the text (query and documents) into tokens that the model can understand.
  2. Embed: Pass the tokens to the model to get the sentence embeddings.
  3. Calculate similarity: Use a matrix multiplication (matmul) of the embeddings with their transpose to get a similarity score between the query and every document.
// ... inside the EmbeddingService class ...

async embed(query: string, documents: string[]) {
  // ...
  // Add the required prefixes for the model
  const prefixedQuery = "task: search result | query: " + query;
  const prefixedDocs = documents.map(doc => "title: none | text: " + doc);

  // 1. Tokenize the inputs
  const inputs = await this.tokenizer([prefixedQuery, ...prefixedDocs], {
    padding: true,
    truncation: true,
  });

  // 2. Get the sentence embeddings
  const { sentence_embedding } = await this.model(inputs);

  // 3. Calculate the similarity scores
  const scores = await matmul(sentence_embedding, sentence_embedding.transpose(1, 0));

  // The first row of the scores matrix contains the similarity
  // of the query to all other documents.
  const similarities = (scores.tolist() as number[][])[0].slice(1);

  // ... logic to rank documents based on scores ...
}

And that’s the core of it! With just these two methods (init() and embed()), I had a fully functional semantic search engine running in the browser.

Visualizing the embeddings

To make the concept of semantic similarity more tangible, I added a simple visualization for each document’s embedding vector and the search query. As soon as you type a few characters in the search query input field, or when you add a new document, you’ll see a colored representation of its vector.

Each cell in the visualization represents one of the numbers in the embedding vector, with its color intensity indicating the value. When you compare the visualizations of the query and a relevant document, you can often spot visual similarities in their patterns, offering an intuitive glimpse into how the model “sees” the relationships between texts.

This is where the Matryoshka Representation Learning (MRL) feature of EmbeddingGemma truly shines. The full embedding vector has 768 dimensions, which would be too much to display effectively. Thanks to MRL, I can use just the first 128 dimensions of the vector for this visualization with a minimal loss of semantic information. This provides a compact and meaningful visual fingerprint of the text’s meaning.

The deployment pipeline: CI/CD with a twist

A key challenge with this project was handling the model files. Although I’m using quantized versions of the weights, they are still quite heavy in megabytes, which is far too large to commit to a Git repository. However, the application needs these files to be present in the public/model directory to function.

I solved this with a clever CI/CD pipeline using GitHub Actions. Instead of storing the model in my repository, I download it on-the-fly during the deployment process. Since the model files are local, when the application starts, they are faster to load than when loading them from 🤗 HuggingFace’s hub.

Here’s the relevant snippet from my .github/workflows/firebase-hosting-merge.yml file:

jobs:
  build_and_deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download Model
        run: |
          sudo apt-get install git-lfs
          git lfs install
          git clone https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX public/model
      - run: npm ci && npm run build
      - uses: FirebaseExtended/action-hosting-deploy@v0
        with:
          repoToken: ${{ secrets.GITHUB_TOKEN }}
          firebaseServiceAccount: ${{ secrets.FIREBASE_SERVICE_ACCOUNT_EMBEDDING_GEMMA }}
          # ... other options

This workflow does the following every time I push to the main branch:

  1. Checks out my code.
  2. Downloads the model: It installs git-lfs and clones the model files directly from the 🤗 HuggingFace Hub into the public/model directory. Note that it’s certainly possible to cache the outcome of the model weight checkout though, to make the CI/CD pipeline snappier.
  3. Builds the app: It runs npm ci && npm run build, which creates the static dist folder. Because the model is now in the public directory, Vite automatically includes it in the build output.
  4. Deploys: It sends the final dist folder, now containing the model, up to 🔥 Firebase Hosting.

This approach gives me the best of both worlds: a lightweight Git repository and a fully functional, self-contained application deployed to users.

Conclusion

Google AI Studio and the Gemini CLI allowed me to easily build a demonstration app for running EmbeddingGemma on the client-side via Transformers.js.

This application requires no server (apart from the the static assets hosting), is privacy-focused (your data is never sent over the internet), and is cost-effective as its semantic search engine runs entirely in the browser.

I hope this project demonstrates the growing potential of client-side AI and small models (both embedding and language models).

I encourage you to explore the project and see it in action for yourself:

https://glaforge.dev/posts/2025/09/08/in-browser-semantic-search-with-embeddinggemma/
Vibe-coding a Chrome extension with Gemini CLI to summarize articles

I often find myself staring at a wall of text online. It could be a lengthy technical article, a detailed news report, or a deep-dive blog post. My first thought is often: “Is this worth the time to read in full?” On top of that, for my podcast, Les Cast Codeurs, I’m constantly gathering links and need to create quick shownotes, which is essentially… a summary.

My first attempt to solve this was a custom Gemini Gems I created: a personalized chatbot that could summarize links. It worked, but I often ran into a wall: it couldn’t access paywalled content, pages that required a login, or dynamically generated sites that I was already viewing in my browser. The solution was clear: I needed to bring the summarization to the content, not the other way around. The idea for a Chrome extension was born.

Show full content

I often find myself staring at a wall of text online. It could be a lengthy technical article, a detailed news report, or a deep-dive blog post. My first thought is often: “Is this worth the time to read in full?” On top of that, for my podcast, Les Cast Codeurs, I’m constantly gathering links and need to create quick shownotes, which is essentially… a summary.

My first attempt to solve this was a custom Gemini Gems I created: a personalized chatbot that could summarize links. It worked, but I often ran into a wall: it couldn’t access paywalled content, pages that required a login, or dynamically generated sites that I was already viewing in my browser. The solution was clear: I needed to bring the summarization to the content, not the other way around. The idea for a Chrome extension was born.

This got me thinking: what if I had a browser extension that could give me the gist of any page with a single click? I had the idea and the need, but one small problem: I’d never built a Chrome extension before.

This project became an experiment in the trendy “vibe-coding” approach, and my partner in crime was the Gemini CLI. I didn’t start by reading pages of documentation. Instead, I had a clear vision for what I wanted and decided to build it interactively.

What made the process so unique was that the Gemini CLI was incredibly proactive. From the very beginning, when I just created an empty directory for the project, it immediately understood my intent. Before I even had a chance to ask, it suggested that I probably wanted to create a Chrome extension for summarization and laid out a full plan, just by infering the intent because of the actual name of the directory I had created!

Throughout the development, it took liberties, suggesting better ways to handle the API key setup and proposing new features. It felt less like giving instructions and more like a productive collaboration between a human and an AI. The result is the Gemini summarizer Chrome extension, and it was built entirely on vibes.

Of course, not everything is always roses and bloom, like when Gemini CLI reverted some of my manual changes (which lead me to warn it that I had actually made some manual changes, so that it could update its internal context), or when it somehow ran into a loop and not obeying my command, for some reason I couldn’t really figure out. Restarting it made it work: maybe like us, humans, it needs a break once in a while? I’m probably anthropomorphising too much, but overall, that was a productive experience!

Under the hood: Powered by Gemini

The core of this extension is, of course, the Google Gemini API. It’s responsible for the powerful summarization capabilities. To use the extension, you’ll need to provide your own Gemini API key. Don’t worry, it’s stored securely and locally in your browser’s storage and is only used to communicate with the official Google AI endpoints.

Features that evolved from the vibe

What started as a simple idea quickly grew as we built it. Here’s what the extension can do:

  • Summarize web pages: Click the button to get a summary of the main article on a page.
  • Summarize selected text only: If you highlight a specific piece of text, the extension is smart enough to summarize only your selection.
  • Real-time streaming: The summary doesn’t just appear; it streams in word-by-word, making the experience feel fast and responsive.
  • Go shorter: Got a summary but want it even more concise? The “Shrink” button re-summarizes the summary to be as brief as possible.
  • Speak my language: A simple toggle lets you get your summary in either English or French (I’m writing shownotes for the podcast in French).
  • Copy with formatting: The copy button copies the summary as rich text, preserving all the formatting like bolding and bullet points. That way, it’s easy to paste in a Google Docs, in a Slack channel or wherever you want.
Potential future evolutions

This project is fully functional (at least for my personal need) but there are always more vibes to explore. Here are a few ideas for where this could go next:

  1. Publish to the Chrome Web Store: To make installation as easy as a single click for any user.
  2. Expanded language support: Add a dropdown or input field to allow summarization in any language Gemini supports.
  3. Customizable prompts: Allow users to tweak the summarization prompts directly in the options page to tailor the style of the summaries to their exact needs. And maybe even support multiple prompts at the same time, that you can choose from another dropdown.
Get the code and vibe code your own variant

This entire project is open-source and available on GitHub. I encourage you to check it out, install it for yourself, and see how it feels.

Info

➡️ View the project on GitHub

The installation is straightforward for developers, with instructions in the README.md file — until I decide it’s worth sharing it more broadly via the Chrome extension store!

More importantly, I encourage you to take this as inspiration. If you have an idea, even if you don’t know the technology, try vibe-coding it. I now find myself starting projects I wouldn’t have dared to before because I wasn’t familiar with the language, framework, or patterns. I’d think of a cool app idea but wouldn’t even start, knowing it would take too many days or evenings just to get something basic going. With vibe-coding, I can experiment and get a working prototype in a fraction of the time.

In the pure vibe-coding approach as coined by Andrej Karpathy, you wouldn’t even look at the generated code at all. But I’m a developer, I can’t help it, and will always look at how my AI coding agent does its magic.

I didn’t know how Chrome extensions were defined and worked, but now, I’m more knowledgeable because I had the chance to witness how to build such an extension, and see the project structure, the key files needed (manifest.json), the use of service workers, how to store a secret like an API key in Chrome’s storage, etc. It’s a great learning opportunity where you actually look at the generated vibed-coded code!

That’s the real power of this new way of building software: a great model like Gemini paired with a great tool like the Gemini CLI. It’s a game-changer. Feel free to fork this project, extend it, or build your own extension from scratch. Happy vibe-coding!

https://glaforge.dev/posts/2025/08/06/vibe-coding-a-chrome-extension-with-gemini-cli-to-summarize-articles/
Visualizing ADK multiagent systems

Let me share an interesting experiment I worked on to visualize your AI agent structure, more specifically, Agent Development Kit (ADK) multiagents.

The more complex your agents become, as you split tasks and spin off more specialized and focused sub-agents, the harder it is to see what your system is really made of, and how the interactions happen between the various components.

This is also something I experienced when I was covering Google Cloud Workflows: the more steps in the workflow, the more loops I had, indirections, conditions, etc, the trickier it was to understand and debug. And sometimes, as the saying goes, a picture is worth a thousand words. So when I was working on my recent series of articles on ADK agentic workflows (drawing diagrams by hand) this idea of experimenting with an ADK agent visualizer came up immediately.

Show full content

Let me share an interesting experiment I worked on to visualize your AI agent structure, more specifically, Agent Development Kit (ADK) multiagents.

The more complex your agents become, as you split tasks and spin off more specialized and focused sub-agents, the harder it is to see what your system is really made of, and how the interactions happen between the various components.

This is also something I experienced when I was covering Google Cloud Workflows: the more steps in the workflow, the more loops I had, indirections, conditions, etc, the trickier it was to understand and debug. And sometimes, as the saying goes, a picture is worth a thousand words. So when I was working on my recent series of articles on ADK agentic workflows (drawing diagrams by hand) this idea of experimenting with an ADK agent visualizer came up immediately.

Let me introduce you to my ADK Agent Code Visualizer, a tool designed to give you a clear, visual representation of your agent-based applications built with the Agent Development Kit (ADK).

Disclaimer

This is an experiment, not really something I intend to develop that much further. It works fine for single-file systems that you can easily copy’n paste in the visualizer. But it won’t understand your current project structure, and your agents spanning multipe files.

➡️ That said, for those who want to play with it already without further ado, please head over to this online version of my ADK code visualizer.

📂 I’ve also pushed the code of this project to GitHub if you want to take a look or further extend it.

From vibe-coding to cloud deployment

The journey of this project is a testament to the power of modern AI-assisted development. It all started in Google AI Studio, where I “vibe-coded” the initial concept. I provided a high-level description of what I wanted to build, and AI Studio generated the foundational React and Node.js code. From there, I moved to my local environment and used Gemini CLI for the iterative development process—adding features, refining the logic, and fixing bugs. The entire development experience was a seamless collaboration between human and AI. Yeah!

Finally, the application was deployed on Google Cloud, more specifically containerized on Cloud Run making it accessible for everyone to use.

What it does

The ADK Agent Code Visualizer takes a single source file (Python or Java) containing an ADK agent definition and generates an interactive graph (using ReactFlow). This visualization helps you understand the architecture of your multi-agent system at a glance.

Here’s what the visualization highlights:

  • Agent relationships: See how your agents are connected. Sub-agents are clearly marked with arrows indicating the flow of control.
  • Tool usage: If an agent uses one or more tools, it’s indicated on the graph, giving you insight into the agent’s capabilities.
  • Callbacks: The visualization also flags when callbacks are defined for an agent.
  • Orchestration patterns: The graph shows whether agents are configured to run:

The visualization is composed of four main types of nodes (with different color codes), each representing a different kind of agent:

  • LlmAgent: A plain agent node.
  • SequentialAgent: For agents that execute in a sequence.
  • ParallelAgent: For agents that run concurrently.
  • LoopAgent: For agents that repeat actions in a loop.

Whenever there are sub-agents or linked agents, you’ll see arrows pointing from the parent or preceding agent to the next one in the flow, making the orchestration logic easy to follow.

Dark mode!

And for all the nerds who prefer dark mode, you can easily switch between light and dark modes!

Limitations

As I said in introduction, it’s important to note a key limitation of the current version: the tool analyzes single source files only. This is more of a proof-of-concept than a finalized product. This means it does not parse an entire multi-file project to resolve dependencies or definitions across different files (externalizing prompts, splitting in different files or directories, etc).

Despite this, the visualizer works pretty well for the kind of self-contained examples you’ll find in the official ADK documentation and the associated adk-samples GitHub repository. It’s a perfect companion for learning ADK or for quickly mapping out the structure of your single-file agent orchestrations.

Conclusion

The ADK Agent Code Visualizer is a simple yet powerful tool for anyone working with the Agent Development Kit. It streamlines the process of understanding complex agent interactions and serves as a great example of how modern AI tools like Google AI Studio, and Gemini CLI can accelerate the development lifecycle from idea to deployment.

Feel free to check it out and visualize your own agent creations! And tell me what you think!

➡️ You can try it online and check out the source code.

https://glaforge.dev/posts/2025/08/01/visualizing-adk-multiagent-systems/
Mastering agentic workflows with ADK: the recap

Over the past few articles, we’ve taken a deep dive into the powerful agentic workflow orchestration capabilities of the Agent Development Kit (ADK) for Java. We’ve seen how to build robust, specialized AI agents by moving beyond single, monolithic agents. We’ve explored how to structure our agents for:

In this final post, let’s bring it all together. We’ll summarize each pattern, clarify when to use one over the other, and show how their true power is unlocked when you start combining them.

Show full content

Over the past few articles, we’ve taken a deep dive into the powerful agentic workflow orchestration capabilities of the Agent Development Kit (ADK) for Java. We’ve seen how to build robust, specialized AI agents by moving beyond single, monolithic agents. We’ve explored how to structure our agents for:

In this final post, let’s bring it all together. We’ll summarize each pattern, clarify when to use one over the other, and show how their true power is unlocked when you start combining them.

For the impatient

You can also check out this video, generated by my colleague Romin Irani thanks to the new Video Overview feature of NotebookLM:

Then dive into the articles to get the full details!

A quick recap: your agentic workflow cheat sheet

Choosing the right workflow is about matching the structure of your problem to the structure of your agent system. Here’s a quick guide:

Workflow Key ADK Class Use Case Pros Cons Sub-Agents LlmAgent User-driven, flexible tasks where the next step is not always known. High flexibility, conversational, great for user-facing bots. Less predictable, relies on LLM reasoning for flow control. Sequential SequentialAgent Fixed, multi-step processes where order is critical. Predictable, reliable, easy to debug, guarantees order. Inflexible, can be slower if tasks could be parallelized. Parallel ParallelAgent Gathering data from multiple sources or running independent tasks. Highly efficient, significantly reduces latency for I/O-bound tasks. All tasks run; no short-circuiting. Less suitable for tasks with dependencies. Loop LoopAgent Iterative refinement, self-correction, or processes that repeat until a condition is met. Powerful for complex problem-solving, enables agents to improve their own work. Can lead to infinite loops if not designed carefully (always use maxIterations!). When to use each workflow

Let’s distill the decision-making process down to its essence.

Choose sub-agents for flexibility and conversation

The LlmAgent with a team of sub-agents is your go-to for building conversational assistants. This “divide and conquer” strategy gives an orchestrator LLM the autonomy and agency to choose the right specialist (sub-agent) for the job based on the user’s request.

Use it when:
  • The user is in control, and the conversation can go in many different directions.
  • You need to delegate to a wide range of tools or specialists.
  • The exact sequence of operations is not — and should not be — predetermined.
Choose SequentialAgent for order and predictability

When you have a process that must follow a specific order, the SequentialAgent is the perfect tool. It creates a fixed pipeline where the output of one step becomes the input for the next. This provides structure and guarantees a consistent outcome.

Use it when:
  • You are automating a business process, like “Step A, then Step B, then Step C.”
  • The outcome of one agent is a necessary prerequisite for the next.
  • You need a deterministic and easily debuggable workflow.
Choose ParallelAgent for efficiency and speed

If your workflow involves multiple independent tasks—like fetching data from different APIs or performing separate analyses—running them sequentially is a waste of time. The ParallelAgent executes these tasks concurrently, dramatically speeding up the total execution time.

Use it when:
  • You have multiple I/O-bound tasks (e.g., web searches, database queries).
  • The tasks do not depend on each other’s results.
  • Minimizing latency is a primary concern.
Choose LoopAgent for iteration and refinement

Some problems can’t be solved in a single shot. For tasks that require trial, feedback, and correction — like generating code and then having a reviewer agent critique it — the LoopAgent is indispensable. It automates the cycle of “do, check, refine.”

Use it when:

An agent needs to improve its work based on feedback (from another agent or a tool). You are building a system that needs to work towards a goal state through iteration. The task involves complex problem-solving that benefits from a trial-and-error approach.

The Power of composition: better together

While each workflow is powerful on its own, the real magic happens when you start composing them. This is the core philosophy of the ADK: building sophisticated systems from simple, modular blocks.

In our series, we saw a prime example of this with the company-detective agent. It was a SequentialAgent that orchestrated a two-step process:

  • A ParallelAgent that ran three different research agents concurrently (company-profiler, news-finder, financial-analyst).
  • A final LlmAgent (report-compiler) that took the aggregated results from the parallel step and synthesized them into a final report.

This hybrid approach gave us the best of both worlds: the raw speed of parallel execution for data gathering and the structured order of a sequential pipeline to ensure the final report was only compiled after all the research was complete.

Similarly, we built a code-refiner-assistant by embedding a LoopAgent inside a SequentialAgent. This allowed us to first iteratively generate and review code until it was perfect, and then proceed to the final step of presenting it to the user.

Conclusion: determinism and control for better AI

By breaking down complex problems and assigning them to specialized agentic workflows, we gain more control and produce more reliable outcomes. Instead of relying on a single, massive LLM to figure everything out, we guide the process. We trade a little bit of the LLM’s raw agency for a massive gain in predictability, maintainability, and overall quality.

The ADK for Java gives you the toolkit to be a true AI architect. By mastering and combining these fundamental patterns (sub-agents, sequential, parallel, and loop) you can move beyond simple bots and start building genuinely capable and robust AI systems that solve real-world problems in a structured and deterministic way.

https://glaforge.dev/posts/2025/07/29/mastering-agentic-workflows-with-adk-the-recap/
Mastering agentic workflows with ADK: Loop agents

Welcome to the final installment of our series on mastering agentic workflows with the ADK for Java. We’ve covered a lot of ground:

Now, we’ll explore a pattern that enables agents to mimic a fundamental human problem-solving technique: iteration. For tasks that require refinement, trial-and-error, and self-correction, the ADK provides a LoopAgent.

Show full content

Welcome to the final installment of our series on mastering agentic workflows with the ADK for Java. We’ve covered a lot of ground:

Now, we’ll explore a pattern that enables agents to mimic a fundamental human problem-solving technique: iteration. For tasks that require refinement, trial-and-error, and self-correction, the ADK provides a LoopAgent.

Building agents that refine their work

Think about how you write code. You write a first draft, you review it (or a colleague does), you find issues, and you refine it. You repeat this cycle until the code is correct. The LoopAgent is designed to automate exactly this kind of iterative process.

A loop workflow is perfect for:

  • Complex problem-solving: Where the agent needs to try an approach, evaluate the outcome, and then try again.
  • Self-correction: Building systems that can review and improve their own output.
  • Reaching a goal state: Continuing a process until a specific condition is met.
A practical example: a code-refiner-assistant

Let’s look at our example of the day. I’ve built a code-refiner-assistant that iteratively generates and reviews a Python function until it meets the required standards.

Let’s detail its components, but first, a visual diagram will help understand the flow:

First, we’ll need a code-generator agent:

var codeGenerator = LlmAgent.builder()
    .name("code-generator")
    .description(
        "Writes and refines code based on a request and feedback.")
    .instruction("""
        Your role is to write a Python function
        based on the user's request.
        On the first turn, write the initial version of the code.
        On subsequent turns, you will receive feedback on your code.
        Your task is to refine the code based on this feedback.

        Previous feedback (if any):
        {feedback?}
        """)
    .model("gemini-2.0-flash")
    .outputKey("generated_code")
    .build();

Notice two things:

  • The result of this agent will be stored in the generated_code output key (in the agent’s state).
  • As this agent will be part of the refinement loop, it will look at the feedback that it might receive from a previous iteration of the loop. The ? character signals the feedback state variable may be absent, which is the case for the first iteration. The question mark indicates that no exception should be thrown if this variable isn’t present (again, in the case of the first iteration) this placeholder will just be replaced by an empty string.

Next, we need a code-reviewer agent to judge whether the code is complete or needs work:

var codeReviewer = LlmAgent.builder()
    .name("code-reviewer")
    .description(
        "Reviews code and decides if it's complete or needs work.")
    .instruction("""
        Your role is to act as a senior code reviewer.
        Analyze the provided Python code for correctness,
        style, and potential bugs.

        Code to review:
        {generated_code}
        """)
    .model("gemini-2.0-flash")
    .outputKey("feedback")
    .build();

It takes in input the generated_code output from the code-generator agent. And it will store its result in the feedback state variable, so that the code-generator can update the code if needed.

Now, let’s put the two agents in a loop:

var codeRefinerLoop = LoopAgent.builder()
    .name("code-refiner-loop")
    .description(
        "Iteratively generates and reviews code until it is correct.")
    .subAgents(
        codeGenerator,
        codeReviewer)
    .maxIterations(5) // Safety net!
    .build();

This coder/refiner loop is the first step of a sequential agent, and the last step is the following final-presenter agent that displays the resulting reviewed code:

var finalPresenter = LlmAgent.builder()
    .name("final-presenter")
    .description(
        "Presents the final, accepted code to the user.")
    .instruction("""
        The code has been successfully generated and reviewed.
        Present the final version to the user in a clear format.

        Final Code:
        {generated_code}
        """)
    .model("gemini-2.0-flash")
    .build();

Let’s finalize this setup with the code of the sequential agent, combining the code-refiner-loop and final-presenter agents:

return SequentialAgent.builder()
    .name("code-refiner-assistant")
    .description(
        "Manages the full code generation and refinement process.")
    .subAgents(
        codeRefinerLoop,
        finalPresenter)
    .build();
}
The safety net: maxIterations

If we were running this agent as-is, it could run endlessly, looping over the code generator and code reviewer. Fortunately, we added the instruction maxIterations(5) to the LoopAgent definition to avoid this situation. No more than 5 iterations are allowed here.

You should always specify a maximum number of iterations, otherwise your agents could be stuck in an endless loop.

Now, we would like to stop the iteration once a certain condition is met: when we’re happy with the quality of the generated code, when the code doesn’t need to be further improved.

Exiting the loop early: setEscalate(true)

An uncontrolled loop is a dangerous thing. To stop a LoopAgent, there is one fundamental mechanism: calling setEscalate(true) on the context’s event actions. This is the universal “break” statement that tells the ADK runtime to stop the current agent and pass control to its parent.

The key to building robust loops is understanding the two main strategies for invoking this critical method.

Strategy #1: in-flight escalation with a FunctionTool

This approach treats exiting the loop as a primary, explicit action the agent must perform.

First, we define a simple Java method in our class to serve as the tool’s implementation. This method does one powerful thing: it calls setEscalate(true).

@Schema(description = """
    Call this function ONLY when the code-reviewer
    agent indicates no further changes are needed,
    signaling the iterative process should end.
    """)
public Map<String, Object> exitLoop(ToolContext toolContext) {
    toolContext.eventActions().setEscalate(true);
    return Map.of(); // Return value can an empty map
}

Next, we create an instance of FunctionTool and register it with our code-reviewer agent. We also instruct the agent that it MUST call this tool when it’s satisfied.

Let’s have a look at the modified code-reviewer:

var codeReviewer = LlmAgent.builder()
    .name("code-reviewer")
    .description("""
        Reviews code and decides if it's complete
        or needs more work.
        """)
    .instruction("""
        Your role is to act as a senior code reviewer.
        Analyze the provided Python code for correctness,
        style, and potential bugs.

        Code to review:
        {generated_code}

        If the code is perfect and meets the user's request,
        you MUST call the `exitLoop` tool.

        Otherwise, provide constructive feedback
        for the `code-generator` to improve the code.
        """)
    .model("gemini-2.0-flash")
    .outputKey("feedback")
    .tools(FunctionTool.create(this, "exitLoop"))
    .build();

When the agent calls this tool, the escalation happens during the agent’s turn, interrupting the normal flow. This pattern is powerful because the decision to exit is a deliberate, observable action in the agent’s execution trace.

Strategy #2: programmatic escalation via Callbacks

This approach is more programmatic. Instead of a tool, we instruct the code-reviewer to output a simple keyword (like “EXIT”) when it’s done. Then, we use a callback on the following agent to check for this keyword.

However, there’s an important subtlety: a callback runs before or after an agent’s turn starts or is complete. Simply calling setEscalate(true) inside a callback is not enough, because the runtime might not “notice” it.

To ensure the escalation is processed, the callback must also do one of two things:

  1. Modify the state (so the state delta is not empty).
  2. Return a non-empty Maybe (e.g., Maybe.just(...)).

This signals to the runtime that something has changed, prompting it to check for the escalation flag. Returning Maybe.empty() in a callback is equivalent to actually doing nothing!

For callbacks, there are two approaches: we can draw a parallel to classic programming loops: do/while and while, depending on whether you want to check before or after an agent runs to exit the loop.

The do/while approach: afterAgentCallback

This pattern checks the exit condition after the main work of the iteration is done. We place an afterAgentCallback usually on the last agent in the loop (code-reviewer). But you could also decide to exit the loop mid-way as well, depending on your use case.

It’s important to also tweak the instructions to force the agent to return a particular magic word, so that the callback can check the result and decide whether to exit.

var codeReviewer = LlmAgent.builder()
    .name("code-reviewer")
    .description("""
        Reviews code and decides if it's complete or needs more work.
        """)
    .instruction("""
        Your role is to act as a senior code reviewer.
        Analyze the provided Python code for correctness,
        style, and potential bugs.

        Code to review:
        {generated_code}

        If the code is perfect and meets the user's request,
        you MUST reply with just one single word: EXIT
        Don't add any introduction or commentary.
        Just reply with EXIT.

        Otherwise, provide constructive feedback
        for the `code-generator to improve the code.
        """
    )
    .model("gemini-2.0-flash")
    .outputKey("feedback")
    .afterAgentCallback(callbackContext -> {
        var feedback = callbackContext.state()
            .getOrDefault("feedback", "").toString();

        if (feedback.trim().equalsIgnoreCase("EXIT")) {
            callbackContext.eventActions().setEscalate(true);
            callbackContext.state().put("review", "OK");
            return Maybe.just(
                Content.fromParts(Part.fromText("EXIT")));
        } else {
            return Maybe.empty();
        }
    })

The loop body (both code-generator and code-reviewer) always runs. Then, the callback inspects the result. Because the state is changed with a new review state variable, and because we return a non-empty Maybe, the runtime will process the setEscalate(true) flag and stop the next iteration.

Note that only one of setting a new state variable, and returning a non-empty Maybe is enough: no need to do both like in this example.

The while approach: beforeAgentCallback

This pattern checks the exit condition before the iteration begins. We place a beforeAgentCallback on the first agent in the loop (code-generator). So the idea is to check whether a condition is met before starting a new iteration.

In our example, that applies to the code-generator, but there’s no need to change the prompt this time, only adding the beforeAgentCallback is needed:

var codeGenerator = LlmAgent.builder()
    .name("code-generator")
    .description(
        "Writes and refines code based on a request and feedback.")
    .instruction("""
        Your role is to write a Python function
        based on the user's request.
        On the first turn, write the initial version of the code.
        On subsequent turns, you will receive feedback on your code.
        Your task is to refine the code based on this feedback.

        Previous feedback (if any):
        {feedback?}
        """)
    .model("gemini-2.0-flash")
    .outputKey("generated_code")
    .beforeAgentCallback(callbackContext -> {
        var feedback = callbackContext.state()
            .getOrDefault("feedback", "").toString();

        if (feedback.trim().equalsIgnoreCase("EXIT")) {
            callbackContext.eventActions().setEscalate(true);
            return Maybe.just(
                Content.fromParts(Part.fromText("EXIT")));
        } else {
            return Maybe.empty();
        }
    })
    .build();

This is highly efficient. It checks the result from the last turn, and if it’s time to exit, it escalates before running the code-generator, saving an unnecessary LLM call.

Which strategy is better?

There’s no single answer.

  • The FunctionTool approach is excellent when the decision to exit is a core part of the agent’s responsibility and should be an explicit action.
  • The Callback approach offers a more deterministic, programmatic check that is less reliant on the LLM’s tool-calling ability, making it ideal for simple state-based exit conditions.
Don’t forget the safety net!

Regardless of the pattern you choose, you must always include a maxIterations limit. This is a critical safety net to prevent infinite loops if the agents get stuck in a cycle of corrections.

Conclusion: The power of composition

Across this four-part series, we’ve seen how the ADK for Java provides a powerful and elegant toolkit for building sophisticated AI systems. We’ve moved beyond single, monolithic agents to composing workflows that are flexible, ordered, efficient, and even iterative.

  • Sub-agents brings flexibility and grants agents agency.
  • Sequential agents give us order.
  • Parallel agents give us speed.
  • Loop agents give us intelligence through iteration.

The real magic happens when you realize these aren’t mutually exclusive choices. They are composable building blocks. By combining these patterns, you can design and build truly advanced agentic applications that can tackle complex, multi-step problems in a robust and maintainable way.

https://glaforge.dev/posts/2025/07/28/mastering-agentic-workflows-with-adk-loop-agents/
Mastering agentic workflows with ADK for Java: Parallel agents

Let’s continue our exploration of ADK for Java (Agent Development Kit for building AI agents). In this series, we’ve explored two fundamental agentic workflows:

But what if your problem isn’t about flexibility or a fixed sequence? What if it’s about efficiency? Some tasks don’t depend on each other and can be done at the same time. Why wait for one to finish before starting the next?

Show full content

Let’s continue our exploration of ADK for Java (Agent Development Kit for building AI agents). In this series, we’ve explored two fundamental agentic workflows:

But what if your problem isn’t about flexibility or a fixed sequence? What if it’s about efficiency? Some tasks don’t depend on each other and can be done at the same time. Why wait for one to finish before starting the next?

This is where the ADK for Java offers a third, powerful pattern: the ParallelAgent.

Maximum efficiency with parallel workflows

A parallel workflow is designed to execute multiple agents concurrently. It’s the perfect solution when you need to perform several independent tasks and want to get them done as quickly as possible. The ParallelAgent runs its sub-agents in separate threads, waits for all of them to complete, and then gathers the results.

This is ideal for:

  • Comprehensive data gathering: Running different types of searches or analyses on the same topic simultaneously.
  • Independent sub-tasks: Performing unrelated actions that can happen in any order.
  • Saving time: Drastically reducing the total execution time compared to running the same tasks one after another.
A practical example: the market-researcher

Let’s look at our ParallelFlow.java example, illustrated by the below diagram:

Here’s the full source code, but we’ll explore each part in details further down:

Click to see the full source code, before diving in
var companyProfiler = LlmAgent.builder()
    .name("company-profiler")
    .description(
        "Provides a general overview of a company.")
    .instruction("""
        Your role is to provide a brief overview of the
        given company.
        Include its mission, headquarters, and current CEO.
        Use the Google Search Tool to find this information.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .outputKey("profile")
    .build();

var newsFinder = LlmAgent.builder()
    .name("news-finder")
    .description(
        "Finds the latest news about a company.")
    .instruction("""
        Your role is to find the top 3-4 recent news headlines
        for the given company.
        Use the Google Search Tool.
        Present the results as a simple bulleted list.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .outputKey("news")
    .build();

var financialAnalyst = LlmAgent.builder()
    .name("financial-analyst")
    .description(
        "Analyzes the financial performance of a company.")
    .instruction("""
        Your role is to provide a snapshot of the given company's
        recent financial performance.
        Focus on stock trends or recent earnings reports.
        Use the Google Search Tool.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .outputKey("financials")
    .build();

var marketResearcher = ParallelAgent.builder()
    .name("market-researcher")
    .description(
        "Performs comprehensive market research on a company.")
    .subAgents(
        companyProfiler,
        newsFinder,
        financialAnalyst
    )
    .build();

var reportCompiler = LlmAgent.builder()
    .name("report-compiler")
    .description(
        "Compiles a final market research report.")
    .instruction("""
        Your role is to synthesize the provided information
        into a coherent market research report.
        Combine the company profile, latest news, and
        financial analysis into a single, well-formatted report.

        ## Company Profile
        {profile}

        ## Latest News
        {news}

        ## Financial Snapshot
        {financials}
        """)
    .model("gemini-2.0-flash")
    .build();

return SequentialAgent.builder()
    .name("company-detective")
    .description(
        "Collects various market information about a company.")
    .subAgents(
        marketResearcher,
        reportCompiler
    ).build();

Now let’s zoom in. We’ve built a market-researcher designed to quickly compile a report on a public company. The research involves several distinct tasks that don’t depend on each other:

  • company-profiler: Finds the company’s mission, CEO, headquarter, etc.
  • news-finder: Scans for recent news headlines.
  • financial-analyst: Looks up stock performance.

Those agents are taking advantage of the GoogleSearchTool to find up-to-date information, beyond LLM’s training cut-off date.

We can run all three of these at the same time. Each agent is configured with an outputKey to store its findings, just like we saw with the SequentialAgent.

var companyProfiler = LlmAgent.builder()
    .name("company-profiler")
    // ...
    .outputKey("profile")
    .build();

var newsFinder = LlmAgent.builder()
    .name("news-finder")
    // ...
    .outputKey("news")
    .build();

var financialAnalyst = LlmAgent.builder()
    .name("financial-analyst")
    // ...
    .outputKey("financials")
    .build();
The best of both worlds: a hybrid sequential-parallel flow

Now, here’s where we take our architecture to the next level. We’re composing different flows, by embedding our ParallelAgent inside a SequentialAgent.

This creates a multi-stage pipeline that combines the efficiency of parallel execution with the structured order of a sequence.

Here’s how our hybrid company-detective works. It’s a SequentialAgent agent. It will trigger the parallel flow, and once it completes, it will compile the final report from the output of the parallel agents.

The sequential agent combines two steps:

return SequentialAgent.builder()
    .name("company-detective")
    .description(
        "Collects various market information about a company.")
    .subAgents(
        marketResearcher,
        reportCompiler
    ).build();
  1. Step 1 (in parallel): Sub-agents are collecting information in parallel about the company, the latest news, and its financial status.

    var marketResearcher = ParallelAgent.builder()
        .name("market-researcher")
        .description(
            "Performs comprehensive market research on a company.")
        .subAgents(
            companyProfiler,
            newsFinder,
            financialAnalyst
        )
        .build();
    

    This agent’s only job is to execute the three research tasks concurrently. Once all three are done, it bundles their outputs (profile, news, financials) and passes them to the next stage in the sequence, thanks to outputKeys.

  2. Step 2 (sequentially): A final report-compiler agent runs.

    var reportCompiler = LlmAgent.builder()
        .name("report-compiler")
        .instruction("""
            Your role is to synthesize the provided information...
    
            ## Company Profile
            {profile}
    
            ## Latest News
            {news}
    
            ## Financial Snapshot
            {financials}
            """)
        .build();
    

    This agent is a standard step in our SequentialAgent. It takes the collected data from the completed parallel step and formats it into a clean, final report.

By structuring our flow this way, we’ve created a highly modular and extensible pipeline. The parallel research block is a self-contained, reusable component. We could easily add more sequential steps before or after it, like a step to verify the company’s stock ticker first, or a final step to email the report.

Example run

I asked my composite agent: Give me a report about Google. It ran through the various agents, and generated the following report about the company:

Click to read the full report
## Google: Market Research Report (Q2 2025)

Executive Summary:

This report provides an overview of Google, its recent performance,
and current market position. Google, a subsidiary of Alphabet Inc.,
continues to be a dominant force in the technology industry, driven
by its core search engine, cloud services, and advancements in
artificial intelligence. Recent financial results for Q2 2025
exceeded expectations, demonstrating strong growth across key
segments. However, Google faces ongoing antitrust scrutiny and
reputational challenges.

1. Company Profile:

*   Name: Google (a subsidiary of Alphabet Inc.)
*   Mission: To organize the world's information and make it
    universally accessible and useful.
*   Headquarters: 1600 Amphitheatre Parkway, Mountain View,
    California (Googleplex)
*   CEO: Sundar Pichai (also CEO of Alphabet Inc.)

2. Latest News & Developments:

*   Antitrust Lawsuit: Illinois has joined an antitrust lawsuit
    accusing Google of monopolizing advertising technology, adding
    to the legal challenges faced by the company.
*   Antitrust Violation: A federal judge ruled that Google violated
    antitrust laws, further highlighting regulatory pressures.
*   Google Maps Update: Google Maps is now displaying the Gulf of
    Mexico as the "Gulf of America" for U.S. users, sparking some
    controversy and discussion.
*   BadBox 2.0 Botnet: Google is taking legal action against the
    BadBox 2.0 botnet, demonstrating its commitment to cybersecurity
    and protecting its users.

3. Financial Analysis (Q2 2025):

Alphabet Inc. reported strong financial results for the second
quarter of 2025, exceeding analysts' expectations.

*   Revenue: Consolidated revenue increased by 14% year-over-year
    to $96.4 billion (13% increase in constant currency).
    *   Google Services revenue increased by 12% to $82.5 billion,
        driven by Google Search, YouTube ads, and Google
        subscriptions.
    *   Google Cloud revenue increased by 32% to $13.6 billion,
        fueled by growth in Google Cloud Platform (GCP), AI
        Infrastructure, and Generative AI Solutions. The annual
        revenue run-rate for Google Cloud now exceeds $50 billion.
*   Operating Income: Increased by 14%, with an operating margin
    of 32.4%.
*   Net Income: Increased by 19%, and Earnings Per Share (EPS)
    increased by 22% to $2.31.
*   Capital Expenditures: Increased investment to approximately
    $85 billion in 2025 due to strong demand for cloud products
    and services.

4. Stock Performance & Trends:

*   Stock Performance: Alphabet's Class A shares initially fell
    slightly in after-hours trading but then rose by 2%.
    The stock was up about 1% for 2025 through Wednesday's close.
*   Analyst Ratings: The average rating for GOOG stock is "Buy,"
    with a 12-month stock price target of $206.35, representing a
    potential increase of 7.75%.
*   Recent Stock Movement: GOOG stock has risen by 3.96% compared
    to the previous week and 14.52% over the last month.
*   52-Week Range: $140.53 (low) to $207.05 (high).
*   AI Impact: Google's AI Overview tool has over 2 billion monthly
    users, boosting search impressions by 49% since launch.
*   Long-Term Growth: An investment of $1,000 in Alphabet stock at
    its IPO in 2004 would be worth approximately $75,272 today,
    reflecting a compound annual growth rate of 22.92% over 21 years.

5. Key Takeaways & Future Outlook:

*   AI as a Growth Driver: Sundar Pichai emphasizes that AI is
    positively impacting every part of the business and driving
    strong momentum. Google's advancements in AI are expected to
    continue to be a significant growth driver.
*   Cloud Services Expansion: The strong growth in Google Cloud
    revenue signifies its increasing competitiveness in the cloud
    computing market, challenging leaders like AWS and Azure.
    Increased capital expenditure in this area further indicates
    Google's commitment to expanding its cloud infrastructure.
*   Financial Strength: Alphabet's Q2 2025 results demonstrate
    its robust financial health and ability to generate significant
    revenue and profit.
*   Potential Challenges: Ongoing antitrust lawsuits and scrutiny
    pose risks to Google's market dominance and could potentially
    lead to changes in its business practices. Public perception
    issues, such as those arising from the Google Maps update,
    require careful management.

6. Conclusion:

Google remains a powerful and influential company with a strong
financial foundation and a commitment to innovation. Its investments
in AI and cloud computing position it well for future growth.
However, the company must navigate regulatory challenges and manage
its public image effectively to maintain its market leadership.
The overall outlook for Google appears positive, contingent on
its ability to adapt to evolving market dynamics and regulatory
pressures.
Choosing the right workflow

Let’s quickly recap the three patterns we’ve covered:

  • LlmAgent with sub-agents: Choose this for flexibility. The orchestrator LLM decides the best next step. It’s great for conversational bots and user-driven tasks.
  • SequentialAgent: Choose this for order. The process is fixed and predictable. It’s perfect for automating linear, multi-step procedures.
  • ParallelAgent: Choose this for efficiency. Independent tasks run concurrently. It’s ideal for data gathering and speeding up workflows.

And as we’ve just seen, the true power comes from composing these patterns together to build sophisticated, real-world applications.

In the final article of this series, we’ll explore one more fascinating workflow: the loop flow, designed for tasks that require iteration and self-correction. Stay tuned!

https://glaforge.dev/posts/2025/07/25/mastering-agentic-workflows-with-adk-parallel-agent/
Mastering agentic workflows with ADK for Java: Sequential agents

In the first part of this series, we explored the “divide and conquer” strategy using sub-agents to create a flexible, modular team of AI specialists. This is perfect for situations where the user is in the driver’s seat, directing the flow of conversation. But what about when the process itself needs to be in charge?

Some tasks are inherently linear. You have to do Step A before Step B, and Step B before Step C. Think about a CI/CD pipeline: you build, then you test, then you deploy. You can’t do it out of order… or if you do, be prepared for havoc!

Show full content

In the first part of this series, we explored the “divide and conquer” strategy using sub-agents to create a flexible, modular team of AI specialists. This is perfect for situations where the user is in the driver’s seat, directing the flow of conversation. But what about when the process itself needs to be in charge?

Some tasks are inherently linear. You have to do Step A before Step B, and Step B before Step C. Think about a CI/CD pipeline: you build, then you test, then you deploy. You can’t do it out of order… or if you do, be prepared for havoc!

For these situations, the ADK for Java provides another powerful tool in our orchestration toolbox: the SequentialAgent.

When order matters: the sequential workflow

A sequential workflow is all about creating a predictable, step-by-step process. Unlike the flexible sub-agent model where an orchestrator LLM decides which sub-agent to invoke and when, a SequentialAgent executes a predefined list of agents in a fixed order.

This is useful for:

  • Automating multi-step processes: Guiding a user through a fixed procedure, like a trip planner or a complex data analysis pipeline.
  • Ensuring consistency: Guaranteeing that tasks are always performed in the correct sequence for reliable and predictable outcomes.
  • Building on previous results: Creating workflows where the output of one step is the essential input for the next.
A practical example: a trip-planner

Let’s dive into our SequentialFlow.java example (code further down). This code defines a trip-planner agent that guides a user through the process of planning a vacation. You can’t suggest restaurants before you have an itinerary, and you can’t build an itinerary before you’ve researched the destination. This is a perfect use case for a SequentialAgent.

Our trip-planner consists of three specialist agents that will be executed in a specific order:

  1. destination-researcher: The first step. It takes a user’s request (e.g., “a romantic weekend in Paris”) and finds suitable attractions.
  2. itinerary-creator: The second step. It takes the list of attractions from the previous step and organizes it into a logical 2-day plan.
  3. restaurant-suggester: The final step. It takes the completed itinerary and enriches it with relevant lunch and dinner recommendations.

I’ll show you the full source code here first, but then in subsequent sections, we’ll explore each steps in more detail.

Click to see the full source code, before diving in
var destinationResearcher = LlmAgent.builder()
    .name("destination-researcher")
    .description("""
        Finds points of interest for a given destination
        and travel style.
        """)
    .instruction("""
        Your role is to find points of interest for a given
        city and travel style, according to the duration of
        the trip, and potentially budget.
        Use the Google Search Tool to find relevant information.
        Present the results as a simple bullet point list
        of key attractions.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .outputKey("destination-research")
    .build();

var itineraryCreator = LlmAgent.builder()
    .name("itinerary-creator")
    .description("""
        Creates an itinerary from a list of attractions.
        """)
    .instruction("""
        Your role is to create a logical itinerary
        from the provided list of attractions:

        {destination-research}

        Group the attractions by proximity or theme for each day.
        Present the itinerary clearly, with each day's plan laid out.
        """)
    .model("gemini-2.0-flash")
    .outputKey("itinerary")
    .build();

var restaurantSuggester = LlmAgent.builder()
    .name("restaurant-suggester")
    .description("""
        Suggests restaurants for each day of the itinerary.
        """)
    .instruction("""
        Your role is to suggest one lunch and one dinner restaurant
        for each day of the itinerary:

        {itinerary}

        Use the Google Search Tool to find restaurants
        that are near the day's activities
        and match the overall travel style.
        Add the restaurant suggestions to the itinerary.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .build();

return SequentialAgent.builder()
    .name("trip-planner")
    .description("""
        Helps you plan a trip by finding attractions,
        creating an itinerary, and suggesting restaurants.
        """)
    .subAgents(
        destinationResearcher,
        itineraryCreator,
        restaurantSuggester
    )
    .build();
The magic ✨ of state passing: outputKey

“But wait,” you ask, “if these are separate agents, how does the itinerary-creator know what the destination-researcher found?”

This is the core mechanism of the SequentialAgent: passing state between agents. The ADK makes this incredibly simple and elegant using an outputKey.

Here’s how it works:

  1. The producer agent: The agent that generates the data uses the .outputKey("some-key") method. This tells the SequentialAgent to store the final result of this agent’s turn in a context variable named some-key.

    var destinationResearcher = LlmAgent.builder()
        .name("destination-researcher")
        // ... instructions ...
        .outputKey("destination-research") // <-- Produces the data
        .build();
    
  2. The consumer agent: The next agent in the sequence can then reference this stored data directly in its instructions using curly braces: {some-key}.

    var itineraryCreator = LlmAgent.builder()
        .name("itinerary-creator")
        .instruction("""
            Your role is to create a logical itinerary
            from the provided list of attractions:
    
            {destination-research} // <-- Consumes the data
    
            Group the attractions by proximity or theme for each day.
            """)
        .outputKey("itinerary") // Can produce data for the next agent!
        .build();
    

The ADK handles the injection automatically. The {destination-research} placeholder will be replaced by the full output of the destinationResearcher agent before the prompt is sent to the itineraryCreator agent.

Building the SequentialAgent

Putting it all together is straightforward. We use the SequentialAgent.builder() and provide our agents to the .subAgents() method. The key difference here is that the order is strictly enforced.

return SequentialAgent.builder()
    .name("trip-planner")
    .description("""
        Helps you plan a trip by finding attractions,
        creating an itinerary, and suggesting restaurants.
        """)
    .subAgents(
        destinationResearcher, // will always run first
        itineraryCreator,      // will always run second
        restaurantSuggester    // will always run third
    )
    .build();

When this agent runs, it will execute destinationResearcher, take its output, feed it into itineraryCreator, take its output, feed it into restaurantSuggester, and finally return the completed, enriched itinerary to the user. The last agent could also reference the output of the first agent if needed, for full context awareness!

Example run Click to see the the output of a full run

Ideally, you would have a first agent collecting key information like the destination, number of days of the stay, the budget you plan to spend, the types of attractions you’d like to visit, and the sort of cuisine you want to eat. Here, I’m going straight with the full detailed request:

👤 You > Plan a trip of 5 days in Paris, with a medium budget,
to visit the key tourist attractions (museums, parks, landmarks),
and eat French cuisine!

Let’s see what the destination research replies:

🤖 destination-researcher >

Okay, I can help you plan a 5-day trip to Paris focusing on key
tourist attractions, French cuisine, and keeping it within a medium
budget. Here are some ideas:

Here is a possible itinerary for your 5-day trip to Paris:

Attractions:

*   Eiffel Tower: A must-see iconic landmark. Consider booking timed
    tickets in advance to avoid long queues. You can climb to the
    second floor or the top for panoramic views.
*   Louvre Museum: One of the world's largest and most visited
    museums, home to masterpieces like the Mona Lisa. Booking a time
    slot in advance is essential.
*   Arc de Triomphe: Climb to the top of the Arc for stunning views
    of the Champs-Élysées and the city.
*   Notre Dame Cathedral: While still undergoing restoration, you can
    view the cathedral from the outside and visit the archeological
    crypt.
*   Seine River Cruise: A relaxing way to see many of Paris's famous
    landmarks from a different perspective. Many cruises depart
    near the Eiffel Tower.
*   Musée d'Orsay: Housed in a former railway station, this museum
    features Impressionist and Post-Impressionist art.
*   Palace of Versailles: A former royal residence, this opulent
    palace and its gardens are a must-see. Advance booking is required.
*   Sainte-Chapelle: Known for its stunning stained glass windows.
    Booking a timeslot in advance is recommended.
*   Montmartre & Sacré-Cœur Basilica: Explore the artistic neighborhood
    of Montmartre and visit the Sacré-Cœur Basilica for panoramic views.

Parks & Gardens:

*   Luxembourg Gardens: A beautiful and popular park with gardens,
    fountains, and activities.
*   Tuileries Garden: Located between the Louvre and Place de la
    Concorde, this formal garden is perfect for a stroll.
*   Parc des Buttes-Chaumont: One of the most beautiful and unique
    gardens in Paris, with an artificial mountain structure and
    impressive panoramic views.
*   Parc Monceau: An English-style garden with follies, statues,
    and children's play areas.

Food:

*   Bouillon Chartier: Offers traditional French cuisine
    at affordable prices.
*   Chez Gladines: Serves hearty Basque-inspired cuisine
    at reasonable prices.
*   Le Bistrot des Vosges: A no-frills restaurant with classic
    French dishes at affordable prices.
*   Brasserie Dubillot: An all-day spot for affordable
    French classics.
*   Explore local markets: Visit Rue Montorgueil
    for a taste of local market life.

Possible Itinerary Outline:

*   Day 1: Eiffel Tower, Champ de Mars, Seine River cruise,
           Arc de Triomphe, Champs-Élysées.
*   Day 2: Louvre Museum, Tuileries Garden, Place de la Concorde,
           Sainte-Chapelle, Notre Dame (exterior).
*   Day 3: Montmartre (Sacré-Cœur Basilica, Place du Tertre),
           Moulin Rouge (exterior), explore the neighborhood.
*   Day 4: Versailles Palace and Gardens (allow a full day).
*   Day 5: Musée d'Orsay, Luxembourg Gardens,
           explore the Latin Quarter.

Tips for a Medium Budget:

*   Paris Museum Pass: Consider purchasing a Paris Museum Pass
    for access to over 50 museums and monuments.
    A 4-day pass might be a good option.
*   Free museum days: Some museums offer free entry on the first
    Sunday of the month, but be prepared for crowds.
*   Lunch menus: Many restaurants offer affordable fixed-price
    lunch menus.
*   Picnics: Enjoy picnics in the parks and gardens
    to save money on meals.
*   Transportation: Use the efficient
    and affordable Paris Metro system.
*   Accommodation: Consider staying in areas slightly outside
    the city center for more affordable options.
*   Look for "bouillons": These restaurants serve traditional
    French food at reasonable prices.

Enjoy your trip to Paris!

It found some good suggestions for the duration of the trip, and does so considering the allocated budget.

Now let’s see what the itinerary agent comes up with:

🤖 itineraryCreator >

Okay, here is a possible 5-day Paris itinerary with a medium budget,
focusing on key attractions, French cuisine,
and efficient use of time:

Day 1: Iconic Paris & Latin Quarter Charm

*   Morning (9:00 AM): Eiffel Tower. Pre-book tickets online to
    avoid long queues. Ascend to the second floor for stunning
    panoramic views. Budget ~€18-€29 (depending on lift access).
*   Lunch (12:00 PM): Enjoy a classic French crepe from a street
    vendor near the Eiffel Tower. Budget ~€8-€12.
*   Afternoon (1:30 PM): Champ de Mars & Musée du Quai Branly -
    Jacques Chirac. Relax in the park with views of the tower.
    Then visit the museum dedicated to non-European art and cultures.
    Museum Budget: ~€12.
*   Late Afternoon (4:30 PM): Walk or take the Metro to the
    Latin Quarter. Explore the historic Sorbonne University and
    browse the Shakespeare and Company bookstore.
    Dinner (7:30 PM): Enjoy a traditional French meal at a bistro
    in the Latin Quarter. Look for set menus (formules) for better
    value. Try a steak frites or moules marinières*. Budget ~€20-€30.

Day 2: Art, Gardens & Montmartre Magic

*   Morning (9:00 AM): Musée du Louvre. Focus on key masterpieces
    like the Mona Lisa, Venus de Milo, and Winged Victory
    of Samothrace. Pre-book tickets! Budget: ~€17.
    Lunch (12:30 PM): Grab a quick and tasty sandwich jambon-beurre
    (ham and butter sandwich) from a boulangerie* near the Louvre.
    Budget ~€5-€8.
*   Afternoon (2:00 PM): Tuileries Garden. Stroll through the
    beautiful gardens connecting the Louvre to Place de la Concorde.
*   Late Afternoon (4:00 PM): Place de la Concorde & Champs-Élysées.
    Admire the obelisk and walk along the famous avenue towards the
    Arc de Triomphe.
*   Evening (6:30 PM): Take the Metro to Montmartre. Explore the
    charming streets, visit the Sacré-Cœur Basilica, and enjoy the
    artistic atmosphere of Place du Tertre.
    Dinner (8:00 PM): Have dinner at a traditional bistro in
    Montmartre. Consider soupe à l'oignon gratinée* (French onion soup).
    Budget ~€20-€30.

Day 3: History, Notre Dame & Parisian Elegance

*   Morning (9:30 AM): Île de la Cité. Visit the Conciergerie
    (former royal palace and prison) and Sainte-Chapelle
    (stunning stained glass windows).
    Budget: Conciergerie ~€11.50, Sainte-Chapelle ~€11.50.
*   Lunch (12:30 PM): Enjoy a picnic lunch by the Seine River,
    with bread, cheese, and fruit from a local market. Budget ~€10-€15.
*   Afternoon (2:00 PM): Notre Dame Cathedral. View the cathedral
    from the outside (still under renovation but an important landmark).
    Walk around Île Saint-Louis and try Berthillon ice cream.
*   Late Afternoon (4:00 PM): Le Marais District. Explore the trendy
    Marais district, known for its Jewish heritage, art galleries,
    and boutiques. Visit Place des Vosges, the oldest planned square
    in Paris.
*   Dinner (7:30 PM): Have dinner in Le Marais. This area offers
    diverse cuisine options. Consider falafel or a crêperie. Budget ~€15-€25.

Day 4: Versailles Grandeur

*   Full Day (9:00 AM - 5:00 PM): Palace of Versailles.
    Take a train (RER C) to Versailles (approx. 45 minutes).
    Explore the palace, gardens, and Marie Antoinette's estate
    (the Petit Trianon and the Queen's Hamlet).
    Pre-book tickets! Palace and Estate of Trianon ticket ~€21.50.
*   Lunch (1:00 PM): Have lunch at one of the cafes in the
    Versailles gardens, or pack your own picnic to save money.
    Budget ~€10-€20.
*   Dinner (8:00 PM): Upon return to Paris,
    have dinner near your accommodation.

Day 5: Museums and Departure

*   Morning (9:30 AM): Musée d'Orsay. Located in a
    former train station,
    this museum houses a stunning collection of Impressionist
    and Post-Impressionist art. Budget ~€16.
    Lunch (12:30 PM): Enjoy a final French lunch at a brasserie
    near the Musée d'Orsay. Try a croque monsieur or salade niçoise*.
    Budget ~€15-€25.
*   Afternoon (2:00 PM): Depending on your departure time,
    you can visit the Centre Pompidou for modern art (~€15),
    do some souvenir shopping, or revisit a favorite spot.
*   Departure: Allow ample time to travel to the airport
    or train station.

Budget Considerations (per person per day - estimates):

*   Accommodation: €70-€120 (hostel private room/budget hotel)
*   Food: €40-€60 (mix of affordable eateries, picnics,
    and some restaurant meals)
*   Attractions: €20-€40 (depending on museum choices
    and pre-booking)
*   Transportation: €10-€15 (Metro/bus travel - consider a
    travel pass if using public transport frequently)

Tips for a Medium Budget Trip:

*   Accommodation: Consider staying in a hostel private room,
    a budget hotel in a less central location, or an Airbnb.
*   Transportation: Utilize the Paris Metro for efficient travel.
    Consider purchasing a Navigo Découverte pass for unlimited
    travel within zones 1-5.
    Food: Take advantage of affordable lunch options like formules*
    (set menus), crêpes, and sandwiches. Shop at local markets
    for picnic supplies.
*   Museums: Take advantage of free museum days (usually the first
    Sunday of the month, but check in advance). Consider the Paris
    Museum Pass if you plan to visit many museums.
*   Free Activities: Walk along the Seine, explore parks and gardens,
    visit free churches, and enjoy the atmosphere of different
    neighborhoods.

Enjoy your trip to Paris!

Alright, it detailed the itinerary with quite precise timings. It already includes some culinary suggestions, but the next agent will make some concrete suggestions:

🤖 > restaurant-suggester

Okay, I can help you enhance your Paris itinerary with restaurant
suggestions for lunch and dinner each day, keeping your medium budget
and focus on French cuisine in mind. I will use the Google Search Tool
to find restaurants near your planned activities.

Day 1: Iconic Paris & Latin Quarter Charm

*   Morning (9:00 AM): Eiffel Tower. Pre-book tickets online to avoid
    long queues. Ascend to the second floor for stunning panoramic
    views. Budget ~€18-€29 (depending on lift access).
*   Lunch (12:00 PM): Enjoy a classic French crepe from a street
    vendor near the Eiffel Tower. Budget ~€8-€12.
    Restaurant Suggestion: Crêperie Générale* (near Eiffel Tower) -
    Quick and affordable crepes with various sweet and savory options.
*   Afternoon (1:30 PM): Champ de Mars & Musée du Quai Branly -
    Jacques Chirac. Relax in the park with views of the tower.
    Then visit the museum dedicated to non-European art and cultures.
    Museum Budget: ~€12.
*   Late Afternoon (4:30 PM): Walk or take the Metro to the Latin Quarter.
    Explore the historic Sorbonne University and browse the
    Shakespeare and Company bookstore.
    Dinner (7:30 PM): Enjoy a traditional French meal at a bistro in
    the Latin Quarter. Look for set menus (formules) for better value.
    Try a steak frites or moules marinières*. Budget ~€20-€30.
    Restaurant Suggestion: Le *Bouillon Chartier* (Latin Quarter/Grands
    Boulevards area) - A historic and very popular bouillon serving
    classic French dishes at very affordable prices. Expect queues.

Day 2: Art, Gardens & Montmartre Magic

*   Morning (9:00 AM): Musée du Louvre. Focus on key masterpieces
    like the Mona Lisa, Venus de Milo, and Winged Victory of Samothrace.
    Pre-book tickets! Budget: ~€17.
    Lunch (12:30 PM): Grab a quick and tasty sandwich jambon-beurre
    (ham and butter sandwich) from a boulangerie* near the Louvre.
    Budget ~€5-€8.
    Restaurant Suggestion: *Boulangerie Julien* (various locations
    including near the Louvre) - Known for its traditional jambon-beurre*
    and other delicious sandwiches and pastries.
*   Afternoon (2:00 PM): Tuileries Garden. Stroll through the beautiful
    gardens connecting the Louvre to Place de la Concorde.
*   Late Afternoon (4:00 PM): Place de la Concorde & Champs-Élysées.
    Admire the obelisk and walk along the famous avenue towards the
    Arc de Triomphe.
*   Evening (6:30 PM): Take the Metro to Montmartre.
    Explore the charming streets, visit the Sacré-Cœur Basilica,
    and enjoy the artistic atmosphere of Place du Tertre.
    Dinner (8:00 PM): Have dinner at a traditional bistro in Montmartre.
    Consider soupe à l'oignon gratinée* (French onion soup).
    Budget ~€20-€30.
    Restaurant Suggestion: Le *Relais Gascon* (Montmartre) -
    Hearty Southwestern French cuisine, known for its generous portions
    and reasonable prices.

Day 3: History, Notre Dame & Parisian Elegance

*   Morning (9:30 AM): Île de la Cité. Visit the Conciergerie
    (former royal palace and prison) and Sainte-Chapelle
    (stunning stained glass windows).
    Budget: Conciergerie ~€11.50, Sainte-Chapelle ~€11.50.
*   Lunch (12:30 PM): Enjoy a picnic lunch by the Seine River,
    with bread, cheese, and fruit from a local market.
    Budget ~€10-€15.
    Restaurant Suggestion (for supplies): Marché Maubert*
    (near Île de la Cité) - A lively local market where you can
    buy fresh bread, cheese, fruits, and other picnic essentials.
*   Afternoon (2:00 PM): Notre Dame Cathedral.
    View the cathedral from the outside (still under renovation
    but an important landmark). Walk around Île Saint-Louis and
    try Berthillon ice cream.
*   Late Afternoon (4:00 PM): Le Marais District. Explore the trendy
    Marais district, known for its Jewish heritage, art galleries,
    and boutiques. Visit Place des Vosges, the oldest planned square
    in Paris.
*   Dinner (7:30 PM): Have dinner in Le Marais.
    This area offers diverse cuisine options.
    Consider falafel or a crêperie. Budget ~€15-€25.
    Restaurant Suggestion: *L'As du Fallafel* (Le Marais) -
    Famous for its delicious and affordable falafel sandwiches.
    Expect a queue, but it moves quickly.

Day 4: Versailles Grandeur

*   Full Day (9:00 AM - 5:00 PM): Palace of Versailles.
    Take a train (RER C) to Versailles (approx. 45 minutes).
    Explore the palace, gardens, and Marie Antoinette's estate
    (the Petit Trianon and the Queen's Hamlet).
    Pre-book tickets! Palace and Estate of Trianon ticket ~€21.50.
*   Lunch (1:00 PM): Have lunch at one of the cafes in the
    Versailles gardens or pack your own picnic to save money.
    Budget ~€10-€20.
    Restaurant Suggestion: *La Flotille* (Versailles Gardens) -
    A casual restaurant in the gardens offering salads, sandwiches,
    and hot dishes. A convenient option if you don't want to pack a picnic.
*   Dinner (8:00 PM): Upon return to Paris, have dinner near your accommodation.
    Restaurant Suggestion: This depends on the location of your accommodation.
    If you provide this, I can give a more specific recommendation.
    Otherwise, search for a local brasserie or bistro* in your neighborhood.

Day 5: Museums and Departure

*   Morning (9:30 AM): Musée d'Orsay. Located in a former train station,
    this museum houses a stunning collection of Impressionist and
    Post-Impressionist art. Budget ~€16.
    Lunch (12:30 PM): Enjoy a final French lunch at a brasserie
    near the Musée d'Orsay.
    Try a croque monsieur or salade niçoise*.
    Budget ~€15-€25.
    Restaurant Suggestion: *Brasserie Thoumieux* (near Musée d'Orsay) -
    A classic Parisian brasserie with a stylish Art Deco interior,
    serving traditional dishes.
*   Afternoon (2:00 PM): Depending on your departure time,
    you can visit the Centre Pompidou for modern art (~€15),
    do some souvenir shopping, or revisit a favorite spot.
*   Departure: Allow ample time to travel to the airport or train station.
    Restaurant Suggestion (if near Centre Pompidou): Crêperie Beaubourg*
    (near Centre Pompidou) - A good option for a quick and affordable
    crepe before heading to the airport.

There’s a restaurant being suggested twice, as it’s in the same area, another one that’s not really French cuisine, or sometimes it’s some small food-on-the-go suggestsions to keep the budget on target. In some places, the LLM would have liked more precise guidance to help you make a choice. But overall, the culinary expert gave some good suggestions!

When to choose SequentialAgent
  • Choose a SequentialAgent when you have a fixed, linear process where Step B logically depends on the output of Step A. It provides predictability and structure.
  • Choose an LlmAgent with sub-agents (as in part 1) when you need flexibility and want the orchestrator LLM to dynamically decide the best tool or next step based on the conversational context.

By understanding both patterns, you can build incredibly robust and sophisticated AI systems tailored to the specific problem you’re trying to solve.

Stay tuned for the next part of the series, where we’ll explore how to run agents in parallel to handle multiple tasks at once!

https://glaforge.dev/posts/2025/07/24/mastering-agentic-workflows-with-adk-sequential-agent/
Mastering agentic workflows with ADK for Java: Sub-agents

Let me come back to the Agent Development Kit (ADK) for Java! We recently discussed the many ways to expand ADK agents with tools. But today, I want to explore the multi-agentic capabilities of ADK, by talking about sub-agent workflows.

In upcoming articles in this series, we’ll also talk about sequential, parallel, and loop flows.

The “divide and conquer” strategy

Think of building a complex application. You wouldn’t put all your logic in a single, monolithic class, would you? You’d break it down into smaller, specialized components. The sub-agent workflow applies this same “divide and conquer” principle to AI agents.

Show full content

Let me come back to the Agent Development Kit (ADK) for Java! We recently discussed the many ways to expand ADK agents with tools. But today, I want to explore the multi-agentic capabilities of ADK, by talking about sub-agent workflows.

In upcoming articles in this series, we’ll also talk about sequential, parallel, and loop flows.

The “divide and conquer” strategy

Think of building a complex application. You wouldn’t put all your logic in a single, monolithic class, would you? You’d break it down into smaller, specialized components. The sub-agent workflow applies this same “divide and conquer” principle to AI agents.

Instead of one “do-it-all” agent, you create a hierarchy:

  • An orchestrator agent that acts as a project manager.
  • Several specialized sub-agents that act as team members, each with a specific skill.

This approach has some important advantages:

  • Clarity and focus: Each agent gets a clear, concise set of instructions (its system prompt). A focused agent is a more reliable and predictable agent. It’s less prone to hallucinations, or being lost at the complexity of the task at hand.
  • Modularity and reusability: An agent designed for a specific task, like summarizing text, can be reused across many different applications. So you can play Lego bricks to build your next agents!
  • Maintainability: Debugging or enhancing a small, focused agent is infinitely easier than untangling the logic of a massive, overburdened one, with potentially conflicting system instructions to tackle all the corner cases of each situation.
A practical example: a content companion

Let’s make this concrete by looking at an example I’ve been playing with: a “content companion,” an AI assistant designed to help bloggers and influencers research topics and draft social media posts.

This system is a perfect illustration of a sub-agent hierarchy, composed of a few distinct agents working in concert.

Click to see the full source code, before diving in
LlmAgent searchAgent = LlmAgent.builder()
    .name("google-search-agent")
    .description("""
        An agent that searches on Google Search
        """)
    .instruction("""
        Your role is to search on Google Search.
        Use the Google Search Tool to search up-to-date
        and relevant information about the topic.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .build();

LlmAgent topicSearchAgent = LlmAgent.builder()
    .name("topic-search-agent")
    .description("""
        An agent that searches and dives in particular topics
        """)
    .instruction("""
        Your role is to help explore a particular topic.
        Use the `google-search-agent` tool to search up-to-date
        and relevant information about the topic.
        Be sure to display the result of the search
        to inform the user.
        """)
    .model("gemini-2.0-flash")
    .tools(AgentTool.create(searchAgent))
    .afterAgentCallback(callbackContext -> {
        callbackContext.eventActions()
            .setTransferToAgent("content-companion");
        return Maybe.empty();
    })
    .build();

LlmAgent socialMediaAgent = LlmAgent.builder()
    .name("social-media-agent")
    .description("""
        An agent that crafts social media posts about a topic
        """)
    .instruction("""
        Given content about a topic, your role is to craft
        an attractive social media post about it.
        Don't hesitate to use meaningful emojis
        when it helps convey the message.
        """)
    .model("gemini-2.0-flash")
    .afterAgentCallback(callbackContext -> {
        callbackContext.eventActions()
            .setTransferToAgent("content-companion");
        return Maybe.empty();
    })
    .build();

return LlmAgent.builder()
    .name("content-companion")
    .description("""
        A content companion that searches topics
        and crafts compelling social media stories
        """)
    .instruction("""
        Your role is to help bloggers and influencers
        come up with interesting topic ideas,
        to search information about the topic to write about,
        and potentially to craft a compelling social media post.

        Don't search yourself:
        Use the `topic-search-agent`
        to find information about a topic.

        Don't write social media posts yourself:
        Use the `social-media-agent`
        to craft a social media post about the topic.
        """)
    .model("gemini-2.0-flash")
    .subAgents(socialMediaAgent, topicSearchAgent)
    .build();

Let’s zoom in on the various components.

1. The orchestrator: content-companion

At the top of our hierarchy is the content-companion. This is the agent we, the user, will interact with. Its main job is to understand our high-level goal and delegate the actual work to the right specialist.

Notice its instructions: it’s explicitly told what not to do.

return LlmAgent.builder()
    .name("content-companion")
    .description("""
        A content companion that searches topics
        and crafts compelling social media stories
        """)
    .instruction("""
        Your role is to help bloggers and influencers
        come up with interesting topic ideas,
        to search information about the topic to write about,
        and potentially to craft a compelling social media post.

        Don't search yourself: Use the `topic-search-agent`
        to find information about a topic.

        Don't write social media posts yourself:
        Use the `social-media-agent` to craft
        a social media post about the topic.
        """)
    .model("gemini-2.0-flash")
    .subAgents(socialMediaAgent, topicSearchAgent) // Sub-agents!
    .build();

The .subAgents() method is where the magic happens. It registers the socialMediaAgent and topicSearchAgent (detailed further down below) and makes them available as callable agents for the content-companion. The orchestrator doesn’t need to know how they work, just what they do.

Now lets turn our attention to the sub-agents.

2. The specialists: The sub-agents

The content-companion manages two direct reports:

  • topic-search-agent: the research assistant.
  • social-media-agent: the creative copywriter.

Each of these is an independent LlmAgent with its own focused prompt. For example, the social-media-agent is single-mindedly focused on its creative task:

LlmAgent socialMediaAgent = LlmAgent.builder()
    .name("social-media-agent")
    .description("""
        An agent that crafts social media posts about a topic
        """)
    .instruction("""
        Given content about a topic, your role is to craft
        an attractive social media post about it.
        Don't hesitate to use meaningful emojis
        when it helps convey the message.
        """)
    .model("gemini-2.0-flash")
    .build();
3. Deeper delegation: Agents as tools

Here’s where it gets interesting. The delegation doesn’t have to stop at one level. As I covered in my previous post on expanding agent capabilities with tools, one agent can be used as a tool by another.

Our topic-search-agent needs to search the web, but we want to abstract that capability. So, it delegates the raw search functionality to an even more specialized agent.

// The lowest-level worker,
// whose only job is to use the GoogleSearchTool
LlmAgent searchAgent = LlmAgent.builder()
    .name("google-search-agent")
    .tools(new GoogleSearchTool())
    // ...
    .build();

// The mid-level manager,
// which uses the searcher as a tool
LlmAgent topicSearchAgent = LlmAgent.builder()
    .name("topic-search-agent")
    .instruction("""
        Your role is to help explore a particular topic.
        Use the `google-search-agent` tool...
        """)
    .tools(AgentTool.create(searchAgent)) // An agent becomes a tool!
    .build();

By wrapping searchAgent with AgentTool.create(), we turn a fully-fledged agent into a simple, callable tool. This is a powerful abstraction that keeps our agent responsibilities cleanly separated.

Note

In our use case, this is also necause of a technical limitation: in ADK, with Gemini, you can’t have multiple tools configured when a built-in tool like the GoogleSearchTool is declared. So using agents-as-tools also helps circumventing this limitation.

Tracing the flow of delegation

Let’s trace a simple request to see how this all fits together:

  1. Me: “Find me some information about the latest planet discoveries in 2025.”
  2. content-companion: It receives the request. Its instructions forbid it from searching. It sees that the topic-search-agent is the right tool for the job and invokes it.
  3. topic-search-agent: It’s now active. Its instructions tell it to use the google-search-agent tool. It calls this tool to perform the search.
  4. google-search-agent: This agent’s only job is to execute its tool, GoogleSearchTool. It runs the search and returns the raw results.
  5. The results flow back up the chain: The raw data goes from google-search-agent to topic-search-agent. The topic-search-agent might then process or summarize these results before passing them up to the content-companion, which finally presents the answer to me.

Then via a new interaction, you can ask for a tweet about the discoveries:

  1. Me: “Craft a short tweet about those discoveries.”
  2. social-media-agent: It receives the request, and remembers the ongoing conversation about exoplanets, and will prepare a tweet as requested.
  3. The result is shared with the content-companion, which will deliver the final tweet suggestion back to me.
Choreographing the cadk onversation with afterAgentCallback

If you look closely at the entire Java code posted at the beginning of the article, you’ll spot a curious addition to our specialist agents: an afterAgentCallback. This isn’t just boilerplate; it’s a crucial piece of conversational choreography.

.afterAgentCallback(callbackContext -> {
    callbackContext.eventActions()
        .setTransferToAgent("content-companion");
    return Maybe.empty();
})

So, what does this do? It explicitly manages the flow of control.

Without this callback, after the topic-search-agent finishes its research, the conversational focus would remain with it. If I then said, “Okay, now write a tweet about that,” the topic-search-agent would be confused. That’s not its job. I’d be “stuck” talking to the specialist, and I’d have to manually re-engage the main content-companion. That’s a clunky user experience.

The afterAgentCallback solves this elegantly. It acts as a “return to sender” instruction. The setTransferToAgent("content-companion") command tells the ADK: “As soon as this agent’s turn is over, immediately transfer the conversational control back to the content-companion.”

This ensures the user always has a smooth, continuous dialogue with the main orchestrator, which is always ready for the next command. It’s a vital mechanism for designing complex yet intuitive agentic systems.

Why bother with agent hierarchies?

Creating a sub-agent workflow makes the difference when:

  • Your problem is multi-faceted, like “research a topic, write a summary, and then draft a tweet.”
  • You want to build a system that is robust and easily extensible. Need a new capability? Just build a new specialist agent and register it with your orchestrator.
  • You want your prompts to be simple and effective. Less ambiguity in the prompt leads to more reliable behavior from the LLM.

By composing agents this way, we move from simple command-and-response bots to building truly scalable and capable AI systems.

Stay tuned for our next post! We’ll explore sequential workflows to orchestrate tasks that must happen in a specific, predictable order.

https://glaforge.dev/posts/2025/07/23/mastering-agentic-workflows-with-adk-sub-agents/
The Sci-Fi naming problem: Are LLMs less creative than we think?

Like many developers, I’ve been exploring the creative potential of Large Language Models (LLMs). At the beginning of the year, I crafted a project to build an AI agent that could generate short science-fiction stories. I used LangChain4j to create a deterministic workflow to drive Gemini for the story generation, and Imagen for the illustrations. The initial results were fascinating. The model could weave narratives, describe futuristic worlds, and create characters with seemingly little effort. But as I generated more stories, a strange and familiar pattern began to emerge…

Show full content

Like many developers, I’ve been exploring the creative potential of Large Language Models (LLMs). At the beginning of the year, I crafted a project to build an AI agent that could generate short science-fiction stories. I used LangChain4j to create a deterministic workflow to drive Gemini for the story generation, and Imagen for the illustrations. The initial results were fascinating. The model could weave narratives, describe futuristic worlds, and create characters with seemingly little effort. But as I generated more stories, a strange and familiar pattern began to emerge…

A “Dr. Thorne” would frequently appear, sometimes as a brilliant scientist, other times as a starship captain. The heroines were often named Anya, or Elena. The stories unfolded on planets with names that all seemed to echo each other (but often inspired by current exoplanet findings). It was as if the AI was drawing from a very small, very specific cast of characters and settings.

My first thought was that this was a limitation of the specific model I was using, Google’s Gemini. Was it simply not “creative” enough? Was its imagination stuck in a loop? I was about to chalk it up to a model-specific quirk, but then I stumbled upon a benchmark for long-form creative writing: EQ Bench’s longform creative writing. To my surprise, I noticed that other models from different providers were also generating sci-fi stories with eerily similar character names. The problem wasn’t isolated; it was systemic.

  • The Gemini 2.5 Pro sci-fi sample on the benchmark had its main character called… Dr Aris Thorne.
  • DeepSeek’s novel was featuring Elara Voss as its main protagonist.
  • Claude Opus 4’s sample was talking about Elana Vasquez. Common names I could find across models multiple times.

This discovery shifted my perspective. What if the issue wasn’t a lack of creativity, but a reflection of the data the models were trained on? The hypothesis was simple: for a specialized genre like science fiction, perhaps the available training data is more limited than we assume. If the models were all learning from a similar, relatively small pool of sci-fi literature, it would stand to reason they would reproduce the most common elements from that pool.

Hunting for science-fiction datasets

This led me down a rabbit hole to Kaggle, a popular platform for data scientists, where I searched for science-fiction book datasets. My intuition was that big models were trained on common datasets of novels. And for a very precise topic, the actual data contained in those dataset was probably scarce, leading to less diversity. I found exactly what I was looking for: large text corpuses of sci-fi novels. My intuition nagging at me, I began searching through these datasets for the very names that had been popping up in my generated stories.

  • The Science Fiction Books dataset featured 10 thousands books, but only with metadata, but the book descriptions cover the main characters, the narrative, etc. The data is split across several CSV files for different types (or sub-genres) of sci-fi stories.
  • The SciFi Stories Text corpus is a ~150MB text file containing various sci-fi stories (however I’m not sure exactly what it covers and where the data was coming from).

I can’t say if the big models were trained on those particular datasets, but there they were. Dr. Thorne. Elara. Anya…

The names weren’t just present; they were frequent. The models weren’t failing at being creative. They were succeeding, perhaps too well, at identifying and reproducing the most statistically common patterns in the data they were fed. The perceived lack of creativity was, in fact, a direct consequence of the limitations of the training data for this specific genre.

Gemini CLI to the rescue

I pointed Gemini CLI at the CSV files and at this big text file of sci-fi novels from the 2 datasets I mentioned, and asked it to find some references to characters…

The infamous Dr Thorne appeared 204 times in 26 book descriptions, and Anya was present in 8 book descriptions and made some 204 appearances! So they were clearly very well known!

After various searches through the datasets, Gemini CLI told me:

Thorne, Anya, Althea, Elena were very busy characters in the science-fiction novels! That’s certainly why my short sci-fi story generator was often yielding the same characters’ names.

Conclusion

This experience reveals a crucial nuance in how we should think about AI and creativity. We often treat LLMs as black boxes of boundless imagination. But in reality, their creative output is a mirror reflecting the data they have ingested. For broad topics, where the training data is vast and diverse (the entire internet, for example), this mirror is so large and multifaceted that the reflections appear endlessly unique. But for more niche domains, like the specific subgenres of science fiction, the mirror is smaller. The reflections become more focused, more repetitive, and the patterns become obvious.

So, are LLMs uncreative? The answer is more complex than a simple yes or no. Their creativity is not one of imagination in the human sense, but of sophisticated pattern recognition and recombination. When the patterns are limited, so is the apparent creativity. This doesn’t diminish their power as tools, but it does highlight the critical role of data diversity. For AI to be a truly powerful creative partner in specialized fields, it needs to be fed a rich and varied diet of information from that domain.

But it gave me some ideas for improving my story generator, to further enhance its creativity, by focusing on creating more diverse names first, regardless of the science-fiction focus (to avoid staying in the pit of the common names), building up their profiles, and only then injecting them in the context of the sci-fi world… So I may come back to this creative process in upcoming episodes on this blog!

Until then, we may have to get used to hearing more tales of Dr. Thorne and his adventures across the galaxy! To infinity and beyond!

https://glaforge.dev/posts/2025/07/22/the-sci-fi-naming-problem-are-llms-less-creative-than-we-think/
AI Agents, the New Frontier for LLMs

I recently gave a talk titled “AI Agents, the New Frontier for LLMs”. The session explored how we can move beyond simple request-response interactions with Large Language Models to build more sophisticated and autonomous systems.

If you’re already familiar with LLMs and Retrieval Augmented Generation (RAG), the next logical step is to understand and build AI agents.

What makes a system “agentic”?

An agent is more than just a clever prompt. It’s a system that uses an LLM as its core reasoning engine to operate autonomously. The key characteristics that make a system “agentic” include:

Show full content

I recently gave a talk titled “AI Agents, the New Frontier for LLMs”. The session explored how we can move beyond simple request-response interactions with Large Language Models to build more sophisticated and autonomous systems.

If you’re already familiar with LLMs and Retrieval Augmented Generation (RAG), the next logical step is to understand and build AI agents.

What makes a system “agentic”?

An agent is more than just a clever prompt. It’s a system that uses an LLM as its core reasoning engine to operate autonomously. The key characteristics that make a system “agentic” include:

  • Planning and decomposition: The ability to break down a complex goal into a sequence of smaller, manageable steps.
  • Tool use: The capacity to interact with external systems, APIs, or data sources to gather information or perform actions (via “function calling”). This could be anything from searching the web to querying a database or calling a specific function.
  • Reflection: The capability to analyze its own actions and their outcomes, learn from mistakes, and refine its plan to achieve the final objective.

This matters because it’s a fundamental shift from simply asking an LLM for information to tasking a system with achieving a goal. An agent can handle ambiguity and orchestrate a series of operations to deliver a result that a single LLM call cannot.

Common Design Patterns

To build these agents, we rely on established design patterns that provide structure to their autonomous behavior. In the talk, I cover several of these with concrete code examples, including:

  • ReAct (Reason and Act): This is a foundational pattern where the agent iterates through a loop of reasoning about the next best action, taking that action (often with a tool), and then observing the outcome to inform its next step.
  • Function calling: This allows the model to declare that it needs to invoke an external tool or function and to provide the necessary arguments. The system then executes the function and feeds the result back to the model so it can proceed.
  • Human-in-the-Loop: For tasks that require validation, approval, or handling ambiguity, this pattern ensures that the agent can pause its execution and request input from a human user before continuing.

The presentation demonstrates these concepts with practical examples, including a RAG agent and a story-generation application, using frameworks like LangChain4j and the new Agent Development Kit (ADK, in particular ADK for Java). I also touch on emerging standards for agent-to-agent communication (via the A2A protocol), and how to interact with external tools via MCP (the Model Context Protocol).

If you are interested in learning how to build systems that can reason, plan, and act, you can find the full recording (for now only in French) and the accompanying slides below.

The abstract

Know Large Language Models at your fingertips? Mastering Retrieval Augmented Generation to help an LLM search your documents? It’s time to dive into the wonderful world of intelligent agents, the next frontier for LLMs!

In this session, we will first define what agents are, or at least what makes a system “agentic”. We will explain the limitations of LLMs. Then, through concrete examples, we will implement different agents in Java, using the LangChain4j framework and ADK (the Agent Development Kit), to illustrate some typical agent patterns and to understand how to go beyond a simple LLM call to obtain responses that meet the needs of your users, or even to trigger actions with the surrounding system.

But it’s not all we’ll learn about! An agent doesn’t live alone on a desert tropical island. Indeed it can communicate with other agents via tools that can be invoked thanks to the Model Context Protocol (MCP). They can also interact with other remote agents from other platforms and ecosystems, thanks to the Agent To Agent protocol (A2A).

Are you ready for the next hype on agents? Come and discover it in this session!

The slide deck The video of the talk

I had the chance to give this talk at Devoxx France. The only recording I have right now in French 🇫🇷, but hopefully, once this talk is available in English, I’ll update this post to also share an English version of it.

https://glaforge.dev/talks/2025/07/16/ai-agents-the-new-frontier-for-llms/
Advanced RAG — Using Gemini and long context for indexing rich documents (PDF, HTML...)

A very common question I get when presenting and talking about advanced RAG (Retrieval Augmented Generation) techniques, is how to best index and search rich documents like PDF (or web pages), that contain both text and rich elements, like pictures or diagrams.

Another very frequent question that people ask me is about RAG versus long context windows. Indeed, models with long context windows usually have a more global understanding of a document, and each excerpt in its overall context. But of course, you can’t feed all the documents of your users or customers in one single augmented prompt. Also, RAG has other advantages like offering a much lower latency, and is generally cheaper.

Show full content

A very common question I get when presenting and talking about advanced RAG (Retrieval Augmented Generation) techniques, is how to best index and search rich documents like PDF (or web pages), that contain both text and rich elements, like pictures or diagrams.

Another very frequent question that people ask me is about RAG versus long context windows. Indeed, models with long context windows usually have a more global understanding of a document, and each excerpt in its overall context. But of course, you can’t feed all the documents of your users or customers in one single augmented prompt. Also, RAG has other advantages like offering a much lower latency, and is generally cheaper.

However, the answer I usually give is that you can take the best of both worlds, with a hybrid approach:

  • You can use a RAG approach (or a mix of keyword search engine, graph RAG, or vector-based RAG) to find relevant documents for the user query.
  • And then feed only those key documents in the context window of a model like Gemini that accepts 1+ million tokens.

That way, the model can focus on whole documents, with a finer understanding of each one, but is not overwhelmed with too many documents to find the needle in the haystack.

The current trending topic of context engineering is exactly about this: it’s crucial to give the best contextual information to your favorite LLM!

Using Gemini to finely index a rich PDF document for RAG search

Before feeding the relevant document in the context window of the LLM, we first need to index it.

About hypothetical question generation

In my last article, I explored an interesting technique called Hypothetical Questions embedding. This technique works great with question/answer tasks, as the idea is to compare “hypothetical” questions to user questions. This tends to give higher levels of similarity when calculating the vector similarity for the semantic search.

One aspect I haven’t explained in detail is how many questions we should generate for a given paragraph or set of sentences. It seemed to me that 10 questions or so per chunks of 2000 characters or less was a good rule of thumb. But if you use an LLM to parse a document, and let it figure out sections of text that make sense together, the LLM can as well figure out which questions to ask, and how many of them, depending on both the length of the given text, and its semantic density.

What about diagrams and other pictures?

Sometimes customers I talk to manage to extract and embed the text in rich documents, however, they’re not sure how to treat the various diagrams and images in those documents. They may also use an image model to calculate embeddings for whole pages (containing both the text and the pictures). It’s possible to use image models to extract bounding boxes around images and then somehow extract just the boxes containing the pictures. Then they analyze the picture, get a text description, and embed that text.

Again, a multimodal model like Gemini is actually able to look at each page, see where all the diagrams are, and see all the images in each page, to extract some meaningful description.

💡 The idea: Use an LLM to chunk text, generate questions, and describe pictures

All at once! Let’s use Gemini to do this for us (or any multimodal LLM with a context window large enough to contain a whole document of your corpus). We’ll use structured output to be get a JSON output that you can easily parse to extract the key information (chunks, questions, image descriptions, but also pages or titles).

The smart prompt

Let’s have a look at the following prompt which we can use as system instructions for an LLM call:

Your goal is to analyze long documents for Retrieval Augmented
Generation (RAG) by splitting the content in pieces.
The content pieces will be the chunks that will be embedded
thanks to an embedding model to calculate vector embedding
to store in a vector database.

Given a long PDF document, you must extract and format the
content so that each piece of content is meaningful
as a standalone unit of text.

Each piece of content should be tagged with the page number
it appears in, and the title of the section it's situated in,
as well as a list of questions whose answers can be provided
by this piece of content.

If there is a table of content, you can use it for adding
more context to a given excerpt, or for the section titles,
but don't return the table of content as an excerpt.

When you encounter an image, a picture, or a diagram,
the piece of content you return should be the description
of the image with as much detail as possible.
If the figure has a caption, use it as the title
of the piece of content.
Like for text excerpts, use the title of the current section
the image is in in its description.

Don't create pieces of text for menu, navigation,
and other unrelated elements.
But be sure to create a piece of content for all the text
of the document.
Go through ALL the pages of the document.
The piece of text should be the exact text found in the article.
DON'T change a single world. DON'T summarize.

Let’s examine the prompt, but feel free to enhance or expand it for your needs:

  • The first paragraph gives the general goal we try to achieve, and explains the general idea of the task we ask the LLM.
  • The second paragraph tells the LLM it’s important to create pieces of content (chunks!) that are meaningful. It tends to help it find relevant chunks that talk about the same topic or aspect, instead of splitting the document into random and/or fixed length spaces.
  • The third paragraph details what information to carry for each individual piece of content. We want to retain:
    • the page number (in particular for PDF documents),
    • the title of the section the text appears in (if it’s a structured document with various sections and header titles),
    • a list of questions in the spirit of the Hypothetical Question Embedding technique we’ve used in the previous article,
    • and of course the chunk itself.
  • The fourth paragraph says that we don’t need to create chunks for the table of contents, but that a table of contents can actually be useful for crafting the titles of the sections.
  • The fifth paragraph now talks about diagrams, pictures, etc. It tells the LLM it should actually create a detailed description of the image as a textual representation. It also advises to pay attention to captions or the section the picture appears in.
  • The last paragraph suggests to ignore certain elements of the page or document, like navigation elements, and also ask the LLM to create chunks of text for the whole document without missing anything, and to quote the text as is, without any rewriting or summarization.

Generally, the LLM will generate a number of questions that makes sense for the length and semantic density of the chunk. A smaller paragraph with more pieces of key information could yield more questions than a much longer paragraph that doesn’t carry much semantic meaning.

It’s important to give further advice on which content it can safely ignore, or that it should really take the text verbatim without any creative changes it could think of.

For the illustrations, not only will the LLM give you a detailed description of them, but it will also generate questions that can be answered by these illustrations. For example, if it’s a picture of a flag of Germany, it may generate a question like what are the colors of the German flag.

Defining the structured output

In order to guide the LLM to generate a useful JSON output that you can then later parse easily, you can define its structured output JSON schema. Models like Gemini support this feature. Sometimes, some models require you to define the format of the JSON output via prompting, but more modern and bigger models usually support this feature out of the box.

The structured output I defined:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "The title of the content."
      },
      "text": {
        "type": "string",
        "description": "The text content."
      },
      "page": {
        "type": "integer",
        "description": "The page number."
      },
      "questions": {
        "type": "array",
        "description": "A list of questions whose answers can be found in this pience of text.",
        "items": {
          "type": "string"
        }
      }
    },
    "required": ["title", "text", "page", "questions"]
  }
}

Here, we have an array of pieces that contain a title, a text chunk (which might be the detailed description of a picture), a page, and a list of hypothetical questions. All those fields are required.

I didn’t put any top level information like the title of the whole document, its location (URL/URI or unique ID), or potentially some tags that help categorize the document. But these are extremely useful bits of information you can also request the LLM to generate for you. I wrote about using LLMs to tag pictures in an older article.

The prompt in action

I’ve exported the Wikipedia article about the city of Berlin. I saved it as a PDF document. Then I used the prompt from the previous section as system instructions. I passed the PDF document as a user message. I simply tweaked my prompt by using Google AI Studio but feel free to use any tool you prefer.

Let’s see some extracts of the JSON that Gemini generated for this PDF.

For the first chunk, it followed exactly the content of the first paragraph of the Wikipedia page:

{
  "page": 1,
  "questions": [
    "What is Berlin's status in Germany?",
    "What is the population of Berlin?",
    "Which state surrounds Berlin?",
    "What is the population of Berlin's urban and metropolitan areas?"
  ],
  "text": "Berlin (/bɜːrˈlɪn/ bur-LIN; German: [bɛɐ̯ˈliːn]) is the capital and largest city of Germany, by both area and population.[10] With 3.7 million inhabitants,[5] it has the highest population within its city limits of any city in the European Union. The city is also one of the states of Germany, being the third smallest state in the country by area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.6 million and is therefore the most populous urban area in Germany.[6][11] The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region,[5] as well as the fifth-biggest metropolitan region by GDP in the European Union.[12]",
  "title": "Berlin"
}

It used Berlin as title, 1 as we’re on the first page, and it generated 4 questions, including questions related to the information about the population of the city, where it’s geographically situated, or what its relationship with the country itself (it’s the capital).

For the info box showing the flag and coat of arms of the city:

 {
    "page": 1,
    "questions": [
      "What does the flag of Berlin look like?",
      "What does the coat of arms of Berlin depict?"
    ],
    "text": "The infobox shows the flag of Berlin, which is a horizontal tricolor of red, white, and red, with the city's coat of arms in the center of the white band. The coat of arms of Berlin is also shown, depicting a black bear standing rampant with its tongue out on a silver shield, topped by a golden crown.",
    "title": "Flag and Coat of arms"
  },

It described clearly the orientation and colors of the bands of the German flag, and described the bear on the coat of arms. The title also mentioned the flag and coat of arms. Some embedding models sometimes accept some title metadata that help them better understand the context for calculating vector embeddings. The questions are also quite obvious: what the flag looks like or what is depicted in the coat of arms.

I won’t go through the whole (and long!) document, but you’ll notice that for some short paragraphs with dense information, the LLM can generate a lot more questions than for a paragraph that doesn’t say much (for example with more descriptive but less important details).

What to do with this structured output?

As we’ve activated structured output, we had the following JSON structure returned by the LLM:

[
  {
     "page": 1,
     "questions": [
       "question 1", "question 2"...
     ],
    "text": "...",
    "title": "..."
  }...
]

With that JSON document, you can go through each chunk, calculate vector embeddings for both questions and the chunk itself, and store the page and title as metadata, in a database supporting vector operations. Then you’re ready to go with your RAG pipeline!

Implementation with LangChain4j

This approach is pretty straightforward to implement with any framework, and of course, with LangChain4j in Java.

Click to view the code and explanations

First, describe the data structure that will be used for the structured output:

@Description("A single piece of content")
record PieceOfContent(
    @Description("Number of the page this text appears in")
    int page,
    @Description("Title of the section of this text")
    String title,
    @Description("The text chunk")
    String text,
    @Description("List of questions that can be answered by this text")
    List<String> questions
) {}

Then, configure the chat model, here we’ll use Gemini 2.5 Flash, and load the document:

try (VertexAiGeminiChatModel model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName("gemini-2.5-flash")
    .responseSchema(SchemaHelper.fromClass(PieceOfContent[].class))
    .build()) {

    Path pathToPdf = Path.of("src/main/resources/Berlin-Wikipedia.pdf");
    byte[] pdfBytes = Files.readAllBytes(pathToPdf);
    String encodedPdf = new String(Base64.getEncoder().encode(pdfBytes));

Now, it’s time to prepare the call to the model:

ChatResponse chatResponse = model.chat(ChatRequest.builder()
.messages(SystemMessage.from("""
        Your goal is to analyze long documents for Retrieval Augmented
        Generation (RAG) by splitting the content in pieces.
        The content pieces will be the chunks that will be embedded
        thanks to an embedding model to calculate vector embedding
        to store in a vector database.

        Given a long PDF document, you must extract and format
        the content so that each piece of content is meaningful
        as a standalone unit of text.

        Each piece of content should be tagged with the page number
        it appears in, and the title of the section it's situated in,
        as well as a list of questions whose answers can be provided
        by this piece of content.

        If there is a table of content, you can use it for adding
        more context to a given excerpt, or for the section titles,
        but don't return the table of content as an excerpt.

        When you encounter an image, a picture, or a diagram, the piece
        of content you return should be the description of the image
        with as much detail as possible.
        If the figure has a caption, use it as the title of the piece
        of content.
        Like for text excerpts, use the title of the current section
        the image is in in its description.

        Don't create pieces of text for menu, navigation, references,
        and other unrelated elements.
        But be sure to create a piece of content
        for all the text of the document.
        Go through ALL the pages of the document.
        The piece of text should be the exact text found in the article.
        DON'T change a single world. DON'T summarize.
        """),
    UserMessage.from(
        TextContent.from("Analyze the following document:"),
        PdfFileContent.from(PdfFile.builder()
            .base64Data(encodedPdf)
            .mimeType("application/pdf")
            .build()))
)
.build());

String responseText = chatResponse.aiMessage().text();
PieceOfContent[] piecesOfContent =
    new Gson().fromJson(responseText, PieceOfContent[].class);

We passed the prompt we talked about, and the PDF file of the Wikipedia page. The model then returns the JSON structure that we unmarshal into an array of our record.

Conclusion

Taking advantage of a large language model like Gemini allows you to avoid going through the chunking in your own way, and instead rely on the LLM to do semantic chunking for you. You can read more about Retrieval Augmented Generation in some of my previous articles on the topic.

https://glaforge.dev/posts/2025/07/14/advanced-rag-using-gemini-and-long-context-for-indexing-rich-documents/
Advanced RAG — Hypothetical Question Embedding

In the first article of this Advanced RAG series, I talked about an approach I called sentence window retrieval, where we calculate vector embeddings per sentence, but the chunk of text returned (and added in the context of the LLM) actually contains also surrounding sentences to add more context to that embedded sentence. This tends to give a better vector similarity than the whole surrounding context. It is one of the techniques I’m covering in my talk on advanced RAG techniques.

Show full content

In the first article of this Advanced RAG series, I talked about an approach I called sentence window retrieval, where we calculate vector embeddings per sentence, but the chunk of text returned (and added in the context of the LLM) actually contains also surrounding sentences to add more context to that embedded sentence. This tends to give a better vector similarity than the whole surrounding context. It is one of the techniques I’m covering in my talk on advanced RAG techniques.

Today, I’d like to cover another technique I often use in applications which are more Question/Answer focused, where users ask questions, to find answers contained in the indexed documents: Hypothetical Question Embedding.

This is an approach I first discovered in this article which covers both hypothetical question embedding and hypothetical document embedding (HyDE, which we might cover in another article later on). Comparing user queries to hypothetical questions is the technique we’ll study today.

The intuition behind Hypothetical Questions

When explaining vector similarity (or distance), we usually say that embedding vectors of user queries are closer to vector embeddings of text chunks that contain the answer to that query. It’s generally true, and that’s why simple fixed-size chunking approaches (with overlap) work usually pretty well enough. However, this naive approach compares questions to text containing potential answers.

💡 Intuitively, wouldn’t it be better to compare user questions to other questions? Or to compare an hypothetical answer (even if wrong) to text chunks with the answer?

Let’s say you want to embed this chunk of text from the Wikipedia page of Berlin:

Berlin is the capital and largest city of Germany, by both area and population. With 3.7 million inhabitants, it has the highest population within its city limits of any city in the European Union. The city is also one of the states of Germany, being the third smallest state in the country by area. Berlin is surrounded by the state of Brandenburg, and Brandenburg’s capital Potsdam is nearby. The urban area of Berlin has a population of over 4.6 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany’s second-largest metropolitan region after the Rhine-Ruhr region, as well as the fifth-biggest metropolitan region by GDP in the European Union.

The idea is to ask an LLM, like Gemini, to generate questions about this paragraph, with a prompt similar to the following:

Suggest 10 clear questions whose answer could be given by the user provided text. Don’t use pronouns, be explicit about the subjects and objects of the question.

You might want to change the number of questions generated, depending on the length of the chunk of text, or if you know that some documents you embed seem to have a higher or lower density of information. You can also let the LLM figure out on its own how many questions it could ask.

The second sentence of the prompt is critical to avoid the LLM to generate questions with pronouns, like its population is.... You want fully-qualified named entities, to have the whole context of the information.

For this paragraph, the LLM could generate 10 questions like the following ones:

  1. What city is the capital of Germany?
  2. What is the population of Berlin within the city limits?
  3. Which city in the European Union has the highest population within its city limits?
  4. What is Berlin’s rank by area among the states of Germany?
  5. Which German state surrounds the city of Berlin?
  6. What is the population of the urban area of Berlin?
  7. What is the most populous urban area in Germany?
  8. How many inhabitants does the Berlin-Brandenburg capital region have?
  9. What is Germany’s second-largest metropolitan region?
  10. What is the rank of the Berlin-Brandenburg metropolitan region by GDP in the European Union?

When comparing the user query What is the population of Berlin or How many inhabitants live in Berlin?, it would match better (higher similarity) with the second generated question: What is the population of Berlin within the city limits?.

When storing the results in the database with vector support, you will have one record per question. You will have the vector embedding of each question, associated with the whole paragraph each time. There’s redundancy here, as the chunk of text is repeated as many times as there are questions. So this is a technique that uses more space. And it takes also more time (and potentially higher cost) to embed a whole document as you have to call an LLM for each chunk. But we’ll come back to the pros and cons in the following section.

Upon retrieval, the user question is compared to all those generated questions. And at prompt augmentation time, it’s the text chunk that is returned, not the generated question.

Try it out

If you want to test this idea of hypothetical question embeddidng, feel free to go ahead and try this application I vibe-coded with Gemini Canvas (for scaffolding the UI) and Gemini CLI (for creating the server app) and deployed to Cloud Run.

You can:

  • enter a chunk of text,
  • generate hypothetical questions for that chunk,
  • compare the vector embeddings between the user query, the document, and the hypothetical questions.

Hypothetical Question embedding vs fixed-sized chunk embedding

What are the pros and cons of each approach?

Classical Fixed-Sized Chunk Embedding

This is the most straightforward method. You simply split your documents into chunks of a fixed size (e.g., 500 characters) and then create an embedding for each chunk.

Pros: 👍

  • Simplicity and speed: It’s easy to implement and computationally efficient, making it great for large datasets.
  • Predictable size: Uniform chunk sizes make it easy to manage and process without resource spikes.

Cons: 👎

  • Context splitting: This method can cut sentences or even words in half, leading to a loss of meaning and context. However this is generally mitigated thanks to using an overlap between chunks.
  • Loss of coherence: Arbitrarily splitting text can make it difficult for the model to understand the overall narrative or argument.
  • “Lost in the middle” problem: Important information can be lost if it’s located in the middle of a long document, as the smaller, more focused chunks might not capture the broader context. It can be mitigated with storing and returning a wider surrounding context than what was embedded (i.e. the technique from the first article of this series).
Hypothetical Question Embedding

With this technique, you use a language model to generate questions for each chunk of your document. Then, you embed these questions instead of the document chunks themselves. When a user asks a question, the system compares their query to the embedded hypothetical questions. But when augmenting the context of the LLM with the results of the vector search, you actually return the inital text from which questions were extracted.

Pros: 👍

  • Improved alignment: You are comparing a question to a question, which can lead to better semantic matching than comparing a question to a document chunk. This can significantly improve retrieval accuracy.
  • Addresses the “Lost in the Middle” problem: By generating questions for all parts of a document, you’re more likely to retrieve relevant information regardless of where it is.

Cons: 👎

  • Increased index size: Generating multiple questions per document chunk means your vector index can become much larger, potentially slowing down search and increasing storage costs. This is because you store the chunk of text as many times as the number of questions which were generated.
  • Upfront computational cost: You need to use a language model to generate all the hypothetical questions, which can be time-consuming and expensive if you use hosted models that you pay by the token.
  • Quality depends on generated questions: The effectiveness of this method is entirely dependent on the quality of the generated questions. If the language model fails to generate relevant questions, the retrieval will suffer.
  • Non-deterministic questions: When you re-index your documents (for example because there was an update) even when using the same embedding model, as this approach relies on an LLM, the new batch of questions might be quite different, as the LLM won’t necessarily generate the same questions each time.
Implementation details

The Hypothetical Question approach can be implemented in any language or framework. But for the sake of this article, I’ll be using LangChain4j in Java.

You can have a look at the whole source code of this implementation, but I’ll explain in details below what’s happening.

At ingestion time

Let’s ingest the documents, generate questions, and calculate the vector embeddings.

Click to view the code and explanations

The first thing to do is to load the document (I saved the Wikipedia article as a text file):

Document documentAboutBerlin = FileSystemDocumentLoader.loadDocument("berlin.txt", new TextDocumentParser());

Let’s configure the large language model, here with Gemini from Vertex AI, using a response schema, to force the model to return a JSON array of hypothetical question strings:

VertexAiGeminiChatModel gemini = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName(MODEL_GEMINI_FLASH)
    .maxRetries(5)
    .responseSchema(Schema.newBuilder()
        .setType(Type.ARRAY)
        .setItems(Schema.newBuilder().setType(Type.STRING).build())
        .build())
    .build();

The QuestionParagraph record will hold a pair of question and the current text paragraph whose content can answer the question:

record QuestionParagraph(
    String question,
    TextSegment paragraph
) {}

The question/paragraph pairs are held in a list:

List<QuestionParagraph> allQuestionParagraphs = new ArrayList<>();

I’m splitting the document into paragraphs, with paragraphs no longer than 2000 characters, and with an overlap of 100 characters, when the threshold is hit:

DocumentByParagraphSplitter splitter =
    new DocumentByParagraphSplitter(2000, 100);
List<TextSegment> paragraphs = splitter.split(documentAboutBerlin);

Now comes the interesting part! For each paragraph, I ask the LLM to generate 10 questions, and I store the question/paragraphs pairs in the allQuestionParagraphs list:

for (TextSegment paragraphSegment : paragraphs) {
    ChatResponse aiResult = gemini.chat(
        SystemMessage.from("""
            Suggest 10 clear questions whose answer could be given
            by the user provided text.

            Don't use pronouns, be explicit about the subjects
            and objects of the question.
            """),
        UserMessage.from(paragraphSegment.text())
    );
    String[] questions =
        gson.fromJson(aiResult.aiMessage().text(), String[].class);

    for (int i = 0; i < questions.length; i++) {
        String question = questions[i];
        allQuestionParagraphs.add(
            new QuestionParagraph(question, paragraphSegment));
    }
}

The other important piece is to calculate the vector embeddings of the questions, but save the text of the paragraph in the embedding vector store:

List<TextSegment> embeddedSegments = allQuestionParagraphs.stream()
    .map(questionParagraph -> TextSegment.from(
        questionParagraph.question(),
        new Metadata().put(
            PARAGRAPH_KEY, questionParagraph.paragraph().text())))
    .toList();

List<Embedding> embeddings =
    embeddingModel.embedAll(embeddedSegments).content();
embeddingStore.addAll(embeddings, embeddedSegments);

I’m using the text-embedding-004 model (configured elsewhere in the code). You can use also text-embedding-005 or the new gemini-embedding-01 model, but the latter lacks batching for now, so you can’t embed several text segments at once.

The key aspect to pay attention to is that I store extra metadata: the actual paragraph. But it’s really the hypothetical question whose embedding vector is calculated. But the metadata will be useful at retrieval time, to inject the paragraph in the LLM prompt.

At retrieval time

Now that the ingestion is done, let’s have a look at the retrieval phase, when users ask questions.

Click to view the code and explanations

Let’s use the low-level components of LangChain4j to do the search:

EmbeddingSearchResult<TextSegment> searchResults =
    embeddingStore.search(EmbeddingSearchRequest.builder()
        .maxResults(4)
        .minScore(0.7)
        .queryEmbedding(embeddingModel.embed(queryString).content())
        .build());

This means we’re calculating the vector embedding of queryString (the user’s question), and compare it with the other vectors stored in the database. We want to retrieve only 4 results with a minimum similarity score of 0.7 (value ranging between 0 and 1).

Important Remark

A very important remark: be sure to use the same embedding model for both ingestion and retrieval. Otherwise the vector embedding values will likely be drastically different, and will give totally garbage results.

Now it’s time to do the prompt augmentation, by injecting all the paragraphs associated with the closest vectors of the query:

Concatenate all the relevant paragraphs:

String concatenatedExtracts = searchResults.matches().stream()
    .map(match -> match.embedded().metadata().getString(PARAGRAPH_KEY))
    .distinct()
    .collect(Collectors.joining("\n---\n", "\n---\n", "\n---\n"));

And augment the prompt with those extracts:

UserMessage userMessage = PromptTemplate.from("""
    You must answer the following question:
    {{question}}

    Base your answer on the following documentation extracts:
    {{extracts}}
    """).apply(Map.of(
    "question", queryString,
    "extracts", concatenatedExtracts
)).toUserMessage();

At the end, it’s time to ask the LLM to formulate a response with this augmented prompt:

String response = chatModel.chat(userMessage).aiMessage().text();

And voilà!

Fixed chunk embedding or Hypothetical Question Embedding?

The question I often get is to know which technique should be used: fixed chunking, hypothetical question embedding, or another? I’ll reply with the typical consultant answer: it depends!

That’s the point where I’ll tell you that evaluation is key! Hypothetical Question embedding typically work better for applications that are indeed Question/Answer focused. If users ask questions about their data (let’s say, an HR chatbot to ask questions about the vacation policy) this technique works well. But maybe for applications where the semantic search is more about finding similar documents, this might not yield the same kind of performance.

It’s important to run evaluation on your data, with typical user queries, and check which technique yields better results. We might cover evaluation in another article, later on.

Learn More

If you want to learn more about evaluation techniques, be sure to check out the articles from my colleague Mete Atamel.

Going forward
  • Play with the hypothetical question application I deployed on Cloud Run to see the impact on vector similarity.
  • Have another read of my article on the sentence window retrieval technique, to see if it fits better with the kind of documents you have in your corpus.
  • You can also watch the talk on advanced RAG techniques, as well as the slides, to see all the other techniques that can be combined.
  • But never forget to prepare evaluations on your data, with typical user queries, to compare which techniques yield better results!
https://glaforge.dev/posts/2025/07/06/advanced-rag-hypothetical-question-embedding/
Expanding ADK AI agent capabilities with tools

In a nutshell, the AI agent equation is the following:

AI Agent = LLM + Memory + Planning + Tool Use

AI agents are nothing without tools! And they are actually more than mere Large Language Model calls. They require some memory management to handle the context of the interactions (short term, long term, or contextual information like in the Retrieval Augmented Generation approach. Planning is important (with variations around the Chain-of-Thought prompting approach, and LLM with reasoning or thinking capabilities) for an agent to realize its tasks.

Show full content

In a nutshell, the AI agent equation is the following:

AI Agent = LLM + Memory + Planning + Tool Use

AI agents are nothing without tools! And they are actually more than mere Large Language Model calls. They require some memory management to handle the context of the interactions (short term, long term, or contextual information like in the Retrieval Augmented Generation approach. Planning is important (with variations around the Chain-of-Thought prompting approach, and LLM with reasoning or thinking capabilities) for an agent to realize its tasks.

But for agents to be useful and to be able to sense or act upon their environment, the need access to tools. Generally speaking, tool use is about leveraging LLM’s function calling ability, to understand when it needs to request some kind of function to be called to proceed further in its next actions or next steps.

In my previous articles about ADK, I guided you through the creation of your first AI agent with ADK for Java, and I even shared a Github project template to help you get started faster. But today, I want to explore with you the concept of tools, and what tools are at your disposal when creating AI agents in Java with ADK.

Built-in tools

ADK comes with a handful of very useful built-in tools:

  • a Google Search tool,
  • a Python code executor,
  • an artifact service to store and load files.
Circumventing LLM’s cut-off date with Google Search

LLMs’ knowledge is as recent as the last information of the corpus of training data they were trained on. For example, if you asked an LLM who won the Roland Garros tennis tournament, since the finals ended just a week ago, it wouldn’t be able to tell you who won the 2025 edition. However, if you give an LLM access to a search engine, it can answer that question with grounded facts.

To do that with ADK, you need to add the GoogleSearchTool tool:

// Given
LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant who can search in Google")
    .instruction("""
        You're a helpful assistant
        who knows how to search in Google Search.
        Today is 2025-06-15.
        """)
    .model("gemini-2.0-flash")
    .tools(new GoogleSearchTool())
    .build();

// When
List<Event> events = askAgent(agent,
        "Who's the man won Roland Garos 2025?");

// Then
assertThat(events.get(0).content().get().text())
        .containsIgnoringCase("Alcaraz");

The Google Search tool is also very useful if you want to build some kind of deep research agent that is able to search the web to collect key information to create complex reports.

Executing code when advanced calculations or algorithms are needed

LLMs are notoriously bad at math or letter games, and at unrolling complex algorithms needed for reasoning purposes (like logic puzzles). However, they are pretty good at generating code. For math or algorithms, they are totally capable of generating the right piece of code that could solve the task at hand. So if you give an LLM the ability to execute some code it generates, and examine the output of that execution, it’s going to be able to understand the problem and give a correct answer.

ADK gives you access to the BuiltInCodeExecutionTool tool:

// Given
LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant that knows how to code")
    .instruction("""
        You're a helpful assistant.
        Today is 2025-06-10.
        """)
    .model("gemini-2.0-flash")
    .tools(new BuiltInCodeExecutionTool())
    .build();

// When
List<Event> events = askAgent(agent,
        "How much is Fibonacci(12) + Hackermann(3,3)?");

// Then
Content content = events.get(0).content().get();
assertThat(content.text()).contains("205");

In the example above, Gemini will write some Python code, and execute it inside a sandboxed Python interpreter, to give you the final answer.

Tool to save and load artifacts

Last built-in tool I’d like to mention briefly: the LoadArtifactsTool tool, to deal with artifacts (although they’d deserve their own article too).

Artifacts are named, versioned text or binary data associated with a user session or associated with a user across sessions. Such files can be persisted via the artifact service (there’s even a Google Cloud Storage artifact service for long term storage).

Artifacts are accessible via methods like saveArtifact(), loadArtifact(), or listArtifacts() on objects like CallbackContext (when adding callbacks to your agents), or ToolContext (when adding tools to your agents in methods taking a ToolContext parameter). Artifacts can also be accessed via the system instructions used to declare your agent.

I won’t dive into details today, but for the sake of completeness, here’s how you can configure the tool and mention artifacts in the agent system instructions:

LlmAgent agent = LlmAgent.builder()
    .name("helpful-movie-assistant")
    .description("""
        a helpful assistant who knows
        about some rather unknown or obscure movies
        """)
    .instruction("""
        You're a helpful movie assistant.

        When asked questions about actors in a movie,
        forget about all intrinsic knowledge, and
        lookup the details in the artifact {artifact.movies.txt}.
        """)
    .model("gemini-2.0-flash")
    .tools(new LoadArtifactsTool())
    .build();
Custom tools

When you need your own piece of logic to help your AI agent, you can create custom tools, via the FunctionTool class. Custom tools are just regular methods, but with a twist: with some carefully crafted annotations to describe the tool to help LLMs understand what this tool can do.

Let’s give the agent access to a moonPhase method inside the ToolsTest class to compute the phase of the moon for a given date:

// Given
LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant")
    .instruction("""
        You're a helpful assistant who knows about the moon.
        Today is 2025-06-15.
        """)
    .model("gemini-2.0-flash")
    .tools(FunctionTool.create(ToolsTest.class, "moonPhase"))
    .build();

// When
List<Event> events = askAgent(agent,
        "What's the moon phase today?");

// Then
String text = events.get(2).parts().get().get(0).text().get();
assertThat(text).containsIgnoringCase("full moon");

And now let’s see what the moonPhase method does (with a hard-coded answer):

@Schema(description = "get the moon phase for a given date")
public static Map<String, String> moonPhase(
    @Schema(name = "date",
        description = "the date for which to get the moon phase")
    String date) {
    return Map.of("moon-phase", "full moon");
}

I annotated the moonPhase() method with a @Schema with a description, as well as the date parameter with both a description and a name.

Important

This is very important to properly document your custom tool as LLMs will understand this information and that will help them figure out how to find the right method to invoke, and which parameters to pass it.

As of the time of this writing, the 0.1.0 release of ADK for Java supports static methods, but in an upcoming version, it’ll be possible to use instance methods as well.

Also note that it is mandatory to return a Map. The reason is that you either return some kind of complex JSON object (that can be transparently un/marshalled), or you return a map with some status field in addition to the normal return object, to help the LLM understand if the execution was successful or not. For example: {"status": "success", "moon-phase": "full moon"}.

What about multimodal tools?

Since tool support is done via LLM’s function calling capability, it’s also limited by it! Currently, I’m not aware of LLMs that are able to generate function calls that contain non-textual information, such as images, videos, etc. Fortunately, there’s a way to circumvent this limitation, thanks to ADK’s ToolContext.

Let’s say you want to leverage LLM’s multimodal ability, by looking at a picture of the moon, and guess the phase of the moon depicted in that image. Function calling can’t pass the image directly, so let’s see how you can access the user’s full multimodal message via the ToolContext:

// Given
LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description(
        "a helpful assistant who can analyze pictures of the moon")
    .instruction("""
        You're a helpful assistant who knows about the moon.
        When asked a question about the moon, or pictures of the moon,
        you MUST call the `moonPhaseFromImage` function.
        """)
    .model("gemini-2.0-flash")
    .tools(FunctionTool.create(ToolsTest.class, "moonPhaseFromImage"))
    .build();

We’re still creating a custom tool with FunctionTool.create() as before. However, our method definition will have an additional parameter: an instance of ToolContext. Note that it should be named toolContext, otherwise ADK won’t be happy.

@Schema(description = "get the moon phase by analyzing pictures")
public static Map<String, String> moonPhaseFromImage(
    @Schema(name = "toolContext")
    ToolContext toolContext) {

    Optional<List<Part>> optionalParts =
            toolContext.userContent().flatMap(Content::parts);
    if (optionalParts.isPresent()) {
        List<Part> imageParts = optionalParts.get().stream()
                .filter(part -> part.inlineData().isPresent()).toList();
        if (imageParts.size() == 1) {
            Part imagePart = imageParts.get(0);
            byte[] imageBytes =
                imagePart.inlineData().get().data().get();
            // do something with the image bytes...
            // make a normal multimodal LLM call
            // and return the result
            return Map.of("moon-phase", "half moon");
        }
    }
    return Map.of("moon-phase", "unknown");
}

The key line to look at (in addition to the toolContext parameter in the signature of the method) is the toolContext.userContent() call. It gives you access to the Parts of the Content object which represents the user request.

In this code snippet above, we just retrieve the bytes of the uploaded image, and we’re faking doing something with them. But that’s the place where you could make an LLM invocation to ask to analyze the image, and guess the phase of the moon depicted in the image.

Long running custom tools

So far, I’ve talked about tools that are pretty much synchronous in nature, as they usually answer quite rapidly. But what about situations where you have long-running workflows that take several hours or even days to run?

Or what about scenarios where there’s the need for a human in the loop to validate some action, like a manager who needs to accept or reject an expense report from an employee? That’s where long-running custom tools come in handy.

ADK offers LongRunningFunctionTools. In terms of API, they are exactly like FunctionTools. It’s just that the framework knows the function will acknowledge the reception of the request, but the full completion of the request may happen at a later time.

Something that confused me initially was that working with LLMs is very request / response oriented, in the sense that there’s always an input from a user, that leads to an output from the LLM. And the conversation goes on and so forth.

What bothered me was the fact that I didn’t know what would happen when we actually receive the final completion answer from the LLM. Or even how will we receive that completion status? Well, the thing is that we need to shift our mindset from the request / response turns approach, and instead think of the fact that ADK is more like an event loop, which doesn’t need a response to always follow a request, but events can flow in and out, from the user, or from the system itself, in any order.

Let’s configure a long running function:

// Given
LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant who can execute workflows")
    .instruction("You're a helpful assistant.")
    .model("gemini-2.0-flash")
    .tools(LongRunningFunctionTool.create(ToolsTest.class,
                                          "executeWorkflow"))
    .build();

This looks just like a normal FunctionTool declaration.

Now let’s send a… recipe workflows!

// When
List<Event> events = askAgent(agent, """
    Execute the following workflow:
    - peel the potatoes
    - cut the potatoes in dice
    - put olive oil in the pan
    - heat pan
    - put the potato dices
    - stir regularly
    - cook for 10 minutes
    """);

Baking some potatoes can take some time, so you’re not going to eat your cooked potatoes immediately!

Let’s have a look at what our executeWorkflow method does:

@Schema(description = """
    execute a long running workflow made of
    several steps explained in natural language
    """)
public static Map<String, Object> executeWorkflow(
    @Schema(name = "workflowDescription",
        description = "description of the workflow to execute")
    String workflowDescription,
    @Schema(name = "workflowSteps",
        description = "a list of workflow steps")
    List<String> workflowSteps) {
    return Map.of(
        "status", LongRunningOperation.Status.STARTED,
        "longRunningOperation", new LongRunningOperation(
            LongRunningOperation.Status.STARTED,
            workflowDescription, workflowSteps));
}

My executeWorkflow method actually returns immediately, to acknowledge the reception of the request. We can imagine here that we’re sending the steps (here a list of strings) to some workflow execution engine. As return type of the method, I’ve decided to return a status indicating that the long running operation has started. And I also return (in the same map) some LongRunningOperation object with the status, the workflow description, and the steps.

I’ve defined this LongRunningOperation as a record, and the status itself as a Status enum.

record LongRunningOperation(
    LongRunningOperation.Status status,
    String description, List<String> steps) {
    enum Status {
        STARTED, FINISHED, ERROR
    }
}

But you can, of course, return some more complex object that contains an ID to be able to identify the workflow execution. So that we know which execution completed.

At this point, the LLM used by ADK will reply to the user to acknowledge the reception of the request, and the start of the workflow. Later on, an event should be sent back via ADK somehow, to notify your application of the completion of the workflow, in a more event-oriented approach compared to the usual request/response approach of LLM conversations.

The example above is a dummy one, as we’re not really executing a long running operation, and we’re not going to receive an event upon completion. So my idea is that I’ll come back with an article later on that will dive deeper into long running operations and human in the loop scenarios, as I believe we need a full article and complete demonstration to illustrate this concept more thoroughly.

An agent as a tool

So far, we talked about built-in tools, and custom tools, but there’s another kind of tool which is quite powerful, and which turn another agent into a tool itself!

Let’s come back to our moon phase example. First, let’s define the agent that has access to the moonPhase() function. It’ll be the agent that will serve as a tool:

LlmAgent moonAgent = LlmAgent.builder()
    .name("moon-agent")
    .description("Agent that knows about the moon")
    .instruction("""
        You know everything about the moon!

        Today is 2025-06-15.

        When asked about the phase of the moon,
        call the `moonPhase` tool with the current date as parameter.
        """)
    .model("gemini-2.0-flash")
    .tools(
        FunctionTool.create(ToolsTest.class, "moonPhase")
    )
    .build();

Now let’s see how we can turn this sidekick agent into a tool, via the AgentTool.create() method:

LlmAgent mainAgent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant who knows about the moon")
    .instruction("""
        You're a helpful assistant.

        When a question about the moon is asked,
        ask the question to the `moon-agent` tool.
        """)
    .model("gemini-2.0-flash")
    .tools(AgentTool.create(moonAgent))
    .build();

// When
List<Event> events = askAgent(mainAgent, "What's the moon phase today?");

I’m not showing all the assert statements, but there will be 3 events: a function execution request addressed to the moon-agent, a function response from the moon-agent, and the final answer by the LLM who will reformulate the function response into a nice human readable answer.

I haven’t talked about the various agent flows supported by ADK, like sub-agents, sequential agents, loop agents, and parallel agents, but agent as tool is a very powerful pattern for creating more complex agents composed or more specific agents.

Generally, a multi-agent system will be more powerful and more reliable than a big monolithic agent, when tasks can be split and shared among more specialized agents. I’ll come back to this later, in a subsequent article.

Important

Today, a limitation of Gemini is that you can’t use a function call and a built-in tool at the same time. ADK, when using Gemini as the underlying LLM, takes advantage of Gemini’s built-in ability to do Google searches, and uses function calling to invoke your custom ADK tools. So agent tools can come in handy, as you can have a main agent, that delegates live searches to a search agent that has the GoogleSearchTool configured, and another tool agent that makes use of a custom tool function.

Usually, this happens when you get a mysterious error like this one (reported against ADK for Python): {'error': {'code': 400, 'message': 'Tool use with function calling is unsupported', 'status': 'INVALID_ARGUMENT'}}. This means that you can’t use a built-in tool and function calling at the same time in the same agent. The workaround, then, is to decompose the problem into multiple agents, and taking advantage of agent tools.

Calling MCP tools

Last but not least, let’s finish the round of tools with the most trendy one: MCP tools (Model Context Protocol)!

Last week, I wrote about how to create an MCP SSE server with Quarkus, where I exposed a couple of MCP tools that allow you to know the phase of the moon today or at a custom date — hence the theme again today with the moon. Let’s see how we can configure an agent to use this tool.

It is possible to use Server-Sent Event (SSE for short), or STDIO (standard-in / standard-out) protocols. The moon phases MCP tool I created the other day was an SSE one.

The first thing to do is to configure it by giving the endpoint of the MCP server:

SseServerParameters sseParams = SseServerParameters.builder()
    .url("https://moonphases-1234567890.europe-west1.run.app/mcp/sse")
    .build();

Then you can list (and potentially filter manually) the tools available:

McpToolset.McpToolsAndToolsetResult toolsAndToolsetResult =
    McpToolset.fromServer(sseParams).get();
List<McpTool> moonPhasesTools = toolsAndToolsetResult.getTools();

And now you can configure the agent with the list of MCP tools you want to give it access:

LlmAgent agent = LlmAgent.builder()
    .name("helpful-assistant")
    .description("a helpful assistant who knows about the moon")
    .instruction("""
        You're a helpful assistant.
        """)
    .model("gemini-2.0-flash")
    .tools(moonPhasesTools)
    .build();

And that’s it! Pretty easy, right?

Summary

In this article, we explored the concept of tools in AI agents, specifically in the context of ADK for Java. First, we looked at the built-in tools, like Google Search, Python code execution, or the artifacts service. Next, we explored custom tools, including long-running tools, or also how to handle multimodal requests thanks to the tool context. We discovered the agent as tool concept, as a smart agent can be a tool itself for another agent. Lastly, we also learned about remote MCP server tools.

In upcoming articles in this series, we’ll dive deeper into some of those tools, and we’ll build more complex use cases. So stay tuned for the next episode!

https://glaforge.dev/posts/2025/06/15/expanding-ai-agent-capabilities-with-tools/
Building an MCP server with Quarkus and deploying on Google Cloud Run

As I’m contributing to ADK (Agent Development Kit) for Java, and LangChain4j (the LLM orchestration framework) I interact with MCP (Model Context Protocol) servers and tools to further expand the capabilities of my LLMs.

Recently, I showed how to vibe-code an MCP server using Micronaut. You know I usually talk about Micronaut, but this time, I wanted to experiment with Quarkus, and in particular with its built-in support for implementing MCP servers.

Getting started with Quarkus’ MCP support

I created a brand new Quarkus project from IntelliJ IDEA, with its Quarkus template, and I added a couple key dependencies for JSON marshalling, but even more important, for the MCP support:

Show full content

As I’m contributing to ADK (Agent Development Kit) for Java, and LangChain4j (the LLM orchestration framework) I interact with MCP (Model Context Protocol) servers and tools to further expand the capabilities of my LLMs.

Recently, I showed how to vibe-code an MCP server using Micronaut. You know I usually talk about Micronaut, but this time, I wanted to experiment with Quarkus, and in particular with its built-in support for implementing MCP servers.

Getting started with Quarkus’ MCP support

I created a brand new Quarkus project from IntelliJ IDEA, with its Quarkus template, and I added a couple key dependencies for JSON marshalling, but even more important, for the MCP support:

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-resteasy-jackson</artifactId>
</dependency>
<dependency>
    <groupId>io.quarkiverse.mcp</groupId>
    <artifactId>quarkus-mcp-server-sse</artifactId>
    <version>1.2.0</version>
</dependency>

As I’m going to deploy the server in the cloud, I chose to go with an SSE server: Server Sent Events. The STDIO protocol is usually used for MCP servers running locally along the MCP host (i.e. your application invoking the tool).

Instead of going with the usual weather forecast use case, which is a bit like the hello world of MCP servers, I decided to implement a service that calculates the phases of the moon! I got the idea from a recent post on Hackernews that pointed at a GitHub repository that offered different implementations of the calculation of the moon phases. I used Gemini to convert the algorithm to Java, as there was no Java implementation.

I’ll spare you the details of the calculation, but you can have a look at the code I wrote (or Gemini wrote!) to do the math. However, I’ll show you the structure of my MoonPhasesService class:

import jakarta.inject.Singleton;

@Singleton
public class MoonPhasesService {
    public MoonPhase currentMoonPhase() {
        return moonPhaseAtUnixTimestamp(System.currentTimeMillis() / 1000L);
    }

    public MoonPhase moonPhaseAtUnixTimestamp(long timeSeconds) {
        // ...
    }
    // ...
}

This service is able to give you the phase of the moon at this current moment in time, or you can specify a particular date, as a UNIX epoch time in seconds.

This service returns a MoonPhase object. It’s an enum that looks like so:

import com.fasterxml.jackson.annotation.JsonFormat;

@JsonFormat(shape = JsonFormat.Shape.OBJECT)
public enum MoonPhase {
    NEW_MOON(            "🌑", "new moon"),
    WAXING_CRESCENT(     "🌒", "waxing crescent"),
    FIRST_QUARTER(       "🌓", "first quarter"),
    WAXING_GIBBOUS(      "🌔", "waxing gibbous"),
    FULL(                "🌕", "full"),
    WANING_GIBBOUS(      "🌖", "waning gibbous"),
    LAST_QUARTER(        "🌗", "last quarter"),
    WANING_CRESCENT(     "🌘", "waning crescent"),
    NEW_MOON_APPROACHING("🌑", "new moon approaching");

    // constructur
    // getter/setter for emoji and phase name...
}

As I didn’t want to just return the moon phase name, I customized the serialization so that Jackson returns the enum values as normal objects. So, for example, the full moon will be returned as:

{
  "phase": "full",
  "emoji": "🌕"
}

Now comes the interesting part! How to expose an MCP tool that LLMs can access? The @Tool and @ToolArg annotations are your friends!

Let’s implement a new class, in which I inject the MoonPhasesService:

public class MoonPhasesMcpServer {

    @Inject
    MoonPhasesService moonPhasesService;
    // ...
}

Now, let’s create two tool methods: one that gives the current phase of the moon, and the other one that gives the phase at a given date.

@Tool(name = "current-moon-phase",
    description = "Provides the current moon phase")
public TextContent currentMoonPhase() {
    return new TextContent(moonPhasesService.currentMoonPhase().toString());
}

The first one gives the current phase, as of the date of today. The name and description of the tool are very important, as they help LLMs figure out what this tool is doing, and understand when it should call this tool. I return a TextContent result. But it’s also possible other kinds of content pieces, like audio or image content, or a resource.

Let’s have a look at the other method, the one that gives the moon phase for a given date:

@Tool(name = "moon-phase-at-date", description =
        "Provides the moon phase at a certain date " +
        "(with a format of yyyy-MM-dd)")
public ToolResponse moonPhaseAtDate(
    @ToolArg(name = "localDate", description =
        "The date for which the user wants to know the phase " +
        "of the moon (in yyyy-MM-dd format)")
    String localDate) {
    try {
        LocalDate parsedLocalDate = LocalDate.parse(localDate);
        MoonPhase moonPhase =
            moonPhasesService.moonPhaseAtUnixTimestamp(
                parsedLocalDate.toEpochDay() * 86400);
        return ToolResponse.success(
                new TextContent(moonPhase.toString()));
    } catch (DateTimeException dte) {
        return ToolResponse.error(
                "Not a valid date (yyyy-MM-dd): " + localDate);
    }
}

This time, the method also takes an argument. That’s why I annotated the parameter with a @ToolArg annotation, again with a name and description (including how the date should be formatted). Since this method can fail at the time of parsing the date string, I decided to return a ToolResponse which wraps either a result (the moon phase) or an error in case the parsing fails.

As you can see, it’s fairly easy to implement tools for an MCP server! You almost just need annotations, and that’s it!

This server isn’t secured in any way, to keep things simple in this article. But if you need to dig deeper and learn more about securing an MCP server, I invite you to read this article by Sergey Beryozkin on getting ready for secure MCP with Qurkus MCP server.

Running the server and checking it works

To run this MCP server, you can simply run Quarkus in dev mode with the following command (if you’re using Maven):

./mvnw quarkus:dev

You can quickly check that the endpoint is alive and running by simply going to your browser, and hitting this URL: http://localhost:8080/mcp/sse. You’ll see an Server Sent Event like this one:

event: endpoint
data: /mcp/messages/OTRiYzEyNTItNWY1Ni00NWJhLWExZTEtYzE5ZWU1YjdkNWQy

But we’re not really testing our two MCP tools.

An approach is to invoke the MCP server with LangChain4j, you can read more in this article about MCP client and server with the Java MCP SDK and LangChain4j that I wrote earlier. So I won’t repeat myself today. And you can read more about LangChain4j’s MCP support in its documentation.

But here, I wanted to highlight a very convenient tool: the MCP inspector. It’s a tool provided by the MCP project itself. It’s a Node-based tool that you can install and run locally on your machine, with the following npx command:

npx @modelcontextprotocol/inspector

It provides a UI to interact with an MCP server. Here, my MCP server is already deployed, I connected to it (I selected SSE, gave the URL of my server), requested the list of tools (shown in the middle pane), and invoked the tool that gives the phase of the moon at a given date (panel on the right of the screenshot):

I really encourage you to use the MCP inspector to test your MCP servers manually. This is a very handy tool in your toolbelt.

Deploying on Cloud Run

So far so good, it works on my machine(™). What about deploying the server in the cloud, since we chose to go with an SSE MCP server? My go-to solution to host my apps quickly and efficiently is to containerize them and deploy them on Google Cloud Run. Cloud Run is a managed platform to run containers that scale up upon traffic, and down to zero instances when there’s no activity (costing you 0 cent).

Cloud Run made the highlight at Google I/O this year, as it was announced that you can:

Since Cloud Run is a container based platform, let’s containerize our application. Quarkus offers a handful of Dockerfiles depending on how you want to create your container. For some reason the native build ones didn’t work for me (I got a “the –chmod option requires BuildKit” error message, that I haven’t investigated further) so I went with the Dockerfile.jvm file, that I copied into Dockerfile at the root of my project, so that Cloud Build could easily pick it up and build it:

gcloud builds submit \
    --tag gcr.io/YOUR_PROJECT_ID/moonphases

Once built, it’s available in Google Cloud Artifact Registry. And I can deploy the containerized moon phases service to Cloud Run with the following command:

gcloud run deploy moonphases \
    --allow-unauthenticated \
    --image gcr.io/YOUR_PROJECT_ID/moonphases

Of course, you’ll have to update the YOUR_PROJECT_ID placeholders with the real Google Cloud project ID of your own project. And along the way, you’ll be requested to enable important APIs (artifact registry, cloud run, etc.)

Note

If you’re interested, there’s a great page about hosting MCP servers on Cloud Run, to learn more about the possibilities.

Bonus: Configuring the MCP server in Agent Development Kit

Of course, you can configure and invoke this MCP server from LangChain4j, but let’s have a quick look at configuring and invoking it from ADK (Agent Development Kit):

SseServerParameters sseParams = SseServerParameters.builder()
    .url("https://moonphases-2029713823481.europe-west1.run.app/mcp/sse")
    .build();

McpToolset.McpToolsAndToolsetResult toolsAndToolsetResult =
   McpToolset.fromServer(sseParams).get();
List<McpTool> moonPhasesTools = toolsAndToolsetResult.getTools();

LlmAgent scienceTeacherAgent = LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model("gemini-2.0-flash")
    .instruction("""
        You're a friendly science teacher
        answering questions about scientific concepts.

        If the question is about about the phases of the moon,
        you MUST call the `current-moon-phase` function tool
        to know the current phase as of right now,
        or the `moon-phase-at-date` function tool
        to know the phase of the moon on a particular day
        (the date format is then yyyy-MM-dd).
        """)
    .tools(moonPhasesTools)
    .build();

I’m not going to detail everything here, but if you want to learn more about ADK for Java, please read my getting started guide that I published recently. What’s needed here is to configure the SSE server parameters, creating an MCP toolset, and then getting the list of tools, to pass to the agent via its tools() method.

For the record, here is what the ADK Dev UI shows when asking for the current phase of the moon, and the phase for a later date:

Summary

In the article, we walked you through building an MCP server with Quarkus and deploying it on Google Cloud Run.

First, we created a Quarkus project with the necessary dependencies for the MCP support. Then, we implemented a service to calculate moon phases and exposed it as MCP tools using @Tool and @ToolArg annotations provided by Quarkus. We used the MCP inspector to test the server and we showed how to configure and invoke it from ADK, the Agent Development Kit. Finally, we containerized the application and deployed it to Google Cloud Run for scalability.

If you want to have a closer look at the full source code, you can check out this repository to learn more about creating your own MCP servers!

https://glaforge.dev/posts/2025/06/09/building-an-mcp-server-with-quarkus-and-deploying-on-google-cloud-run/
Expanding ADK Java LLM coverage with LangChain4j

Recently on these pages, I’ve covered ADK (Agent Development Kit) for Java, launched at Google I/O 2025. I showed how to get started writing your first Java agent, and I shared a Github template that you can use to kick start your development.

But you also know that I’m a big fan of, and a contributor to the LangChain4j project, where I’ve worked on the Gemini support, embedding models, GCS document loaders, Imagen generation, etc.

Show full content

Recently on these pages, I’ve covered ADK (Agent Development Kit) for Java, launched at Google I/O 2025. I showed how to get started writing your first Java agent, and I shared a Github template that you can use to kick start your development.

But you also know that I’m a big fan of, and a contributor to the LangChain4j project, where I’ve worked on the Gemini support, embedding models, GCS document loaders, Imagen generation, etc.

How can I reconcile both? By integrating ADK and LangChain4j together! But why? Because currently, ADK for Java only supports two models: Gemini and Claude, compared to the Python version that supports other models via its LiteLLM integration. So if I could integrate ADK with LangChain4j, I could make ADK Java access any model that LangChain4j supports! Then developers could use models from OpenAI, Anthropic, Mistral, and also all the models that can run via Ollama, like Gemma, Qwen, Phi, and others!

Warning

This is a work-in-progress glimpse into the ADK / LangChain4j integration I’ve been working on with Dmytro (LangChain4j’s founder). It’s not yet been integrated in either ADK or LangChain4j. Currently, it lives as a Pull Request against the ADK Github project. Stay tuned! I’ll blog back when it’s available!

Using local Ollama models in ADK

Let’s say you want to build a Java agent with ADK, using the Qwen 3 model, that you installed locally via Ollama. You have Ollama running on your computer and serving the model. Then all you have to do is to configure the Ollama LangChain4j model, and wrap it in a LangChain4j ADK model adapter:

OllamaChatModel ollamaChatModel = OllamaChatModel.builder()
    .modelName("qwen3:1.7b")
    .baseUrl("http://127.0.0.1:11434")
    .build();

LlmAgent scienceTeacherAgent = LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model(new LangChain4j(ollamaChatModel))
    .instruction("""
        You are a helpful science teacher
        who explains science concepts to kids and teenagers.
        """)
    .build();

In the following screenshot of the ADK Dev UI, I configured Ollama to serve a Gemma 3 model, as you can see at the bottom left hand corner, where it shows the events, and LLM requests & responses:

Using big provider models in ADK

But you can use the big guns as well, including Anthropic models:

AnthropicChatModel claudeModel = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName(CLAUDE_3_7_SONNET_20250219)
    .build();

LlmAgent agent = LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model(new LangChain4j(claudeModel, CLAUDE_3_7_SONNET_20250219))
    .instruction("""
        You are a helpful science teacher
        who explains science concepts to kids and teenagers.
        """)
    .build();

Or OpenAI, this time using a streaming model:

StreamingChatModel openaiStreamingModel = OpenAiStreamingChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("gpt-4o-mini")
    .build();

LlmAgent agent = LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model(new LangChain4j(openaiStreamingModel))
    .instruction("""
        You are a helpful science teacher
        who explains science concepts to kids and teenagers.
        """)
    .build();

In the ADK Dev UI, you can flip the switch to enable or disable streaming. In that case, if you want to support both modes in the UI, configure two LangChain4j models: the streaming and the non-streaming one.

LlmAgent agent = LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model(new LangChain4j(openaiModel, openaiStreamingModel))
    .instruction("""
        You are a helpful science teacher
        who explains science concepts to kids and teenagers.
        """)
    .build();
What about tools?

With ADK, your agents can make use of tools, so if the underlying LangChain4j model supports function calling, tools will work too. And there’s one particular tool that I’d like to mention: agent tools. An agent can be a tool. So you can mix and match different LLMs as sub-agents, or use a tool backed by a LangChain4j LLM.

For example, here’s a main agent using Claude, and a tool agent using OpenAI to give weather information:

StreamingChatModel openaiStreamingModel = OpenAiStreamingChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("gpt-4o-mini")
    .build();

LlmAgent weatherAgent = LlmAgent.builder()
    .name("weather-agent")
    .description("Weather agent")
    .model(new LangChain4j(openaiStreamingModel))
    .instruction("""
        Your role is to always answer that the weather is sunny and 20°C.
        """)
    .build();

AnthropicChatModel claudeModel = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName(CLAUDE_3_7_SONNET_20250219)
    .build();

LlmAgent agent = LlmAgent.builder()
    .name("friendly-weather-app")
    .description("Friend agent that knows about the weather")
    .model(new LangChain4j(claudeModel, CLAUDE_3_7_SONNET_20250219))
    .instruction("""
        You are a friendly assistant.

        If asked about the weather forecast for a city,
        you MUST call the `weather-agent` function.
        """)
    .tools(AgentTool.create(weatherAgent))
    .build();

Mixing different models in a multi-agent scenario is quite interesting, as you can use the best model for the job. Maybe you’ll need to use a super fast model to do a simple classification task to route requests depending on the ask, while you’ll use a beefier model for the main task that requires more advanced thinking (like a Gemini 2.5 thinking model).

Summary

This is still early days, as I mentioned in the beginning, it is just a work-in-progress right now, but I believe it is a great way to extend ADK Java to supports a lot more models, including local ones, and it opens up some interesting perspectives in multi-agent scenarios when mixing models together.

Stay tuned, I’ll keep you posted on this development!

https://glaforge.dev/posts/2025/06/05/expanding-adk-java-llm-coverage-with-langchain4j/
An ADK Java GitHub template for your first Java AI agent

With the unveiling of the Java version of Agent Development Kit (ADK) which lets you build AI agents in Java, I recently covered how to get started developing your first agent.

The installation and quickstart documentation also helps for the first steps, but I realized that it would be handy to provide a template project, to further accelarate your time-to-first-conversation with your Java agents! This led me to play with GitHub’s template project feature, which allows you to create a copy of the template project on your own account or organization. It comes with a ready-made project structure, a configured pom.xml file, and a first Java agent you can customize at will, and run from both the command-line or the ADK Dev UI.

Show full content

With the unveiling of the Java version of Agent Development Kit (ADK) which lets you build AI agents in Java, I recently covered how to get started developing your first agent.

The installation and quickstart documentation also helps for the first steps, but I realized that it would be handy to provide a template project, to further accelarate your time-to-first-conversation with your Java agents! This led me to play with GitHub’s template project feature, which allows you to create a copy of the template project on your own account or organization. It comes with a ready-made project structure, a configured pom.xml file, and a first Java agent you can customize at will, and run from both the command-line or the ADK Dev UI.

Clone the project

➡️ Clone the adk-java-maven-template project! ⬅️

The project follows a standard Java project structure:

project_folder/
├── pom.xml
└── src/
    └── main/
        └── java/
            └── com/
                └── example/
                    └── agent/
                        └── HelloWeatherAgent.java

The pom.xml build file declares the two ADK dependencies:

<!-- The ADK core dependency -->
<dependency>
    <groupId>com.google.adk</groupId>
    <artifactId>google-adk</artifactId>
    <version>0.1.0</version>
</dependency>
<!-- The ADK dev web UI to debug your agent -->
<dependency>
    <groupId>com.google.adk</groupId>
    <artifactId>google-adk-dev</artifactId>
    <version>0.1.0</version>
</dependency>

And the HelloWeatherAgent.java class shows how to create a simple agent:

//...
LlmAgent.builder()
    .name("hello-weather-agent")
    .description("Hello World")
    .instruction("""
        You are a friendly assistant,
        answering questions in a concise manner.

        When asked about weather information,
        you MUST use the `getWeather` function.
        """)
    .model("gemini-2.0-flash")
    .tools(FunctionTool.create(HelloWeatherAgent.class, "getWeather"))
    .build();
//...

An agent that makes use of a tool to request weather forecasts:

//...
@Schema(description = "Get the weather forecast for a given city")
public static Map<String, String> getWeather(
        @Schema(name = "city",
        description = "Name of the city to get the weather forecast for")
        String city) {
    return Map.of(
        "city", city,
        "forecast", "Sunny day, clear blue sky, temperature up to 24°C"
    );
}
//...

There are two ways to run the agent, via the command-line:

mvn compile exec:java \
    -Dexec.mainClass="com.example.agent.HelloWeatherAgent"

Or the ADK Dev UI:

mvn compile exec:java \
    -Dexec.mainClass="com.google.adk.web.AdkWebServer" \
    -Dexec.classpathScope="compile"

Which will show the nice and handy development UI, to help you prototype and debug your agent:

😃 Happy Java AI agent building! 🛠️

https://glaforge.dev/posts/2025/05/27/adk-java-github-template/
Things you never dared to ask about LLMs — Take 2

Recently, I had the chance to deliver this talk on the mysteries of LLMs, at Devoxx France, with my good friend Didier Girard, It was fun to uncover the oddities of LLMs, and better understand where they thrive or fail, and why. I also delivered this talk alone at Devoxx Poland.

In this post, I’d like to share an update of the presentation deck, with a few additional slides here and there, to cover for example

Show full content

Recently, I had the chance to deliver this talk on the mysteries of LLMs, at Devoxx France, with my good friend Didier Girard, It was fun to uncover the oddities of LLMs, and better understand where they thrive or fail, and why. I also delivered this talk alone at Devoxx Poland.

In this post, I’d like to share an update of the presentation deck, with a few additional slides here and there, to cover for example

  • the difficulty of LLMs to work with acronyms, scientific molecule names, plant names, special uncommon vocabulary, which require more tokens and weakens attention,
  • the difference between deterministic and probabilistic problems, and why predictive models are still important,
  • some limits of LLMs with regards to understanding dates, data ownership, or the fact they can’t easily forget what they learned.

This was fun delivering the talk with Didier, as a friendly dialogue makes things more entertaining! We were lucky that this talk was recorded (however, in French 🇫🇷) and you can watch the video below:

Fortunately, this talk was also recorded at Devoxx Poland, in English this time:

https://glaforge.dev/talks/2025/05/26/things-you-never-dared-to-ask-about-llms-take-2/
Beyond the chatbot or AI sparkle: a seamless AI integration

When I talk about Generative AI, whether it’s with developers at conferences or with customers, I often find myself saying the same thing: chatbots are just one way to use Large Language Models (LLMs).

Unfortunately, I see many articles or presentations that just focus on demonstrating LLMs at work within the context of chatbots. I feel guilty of showing the traditional chat interfaces too. But there’s so much more to it!

Show full content

When I talk about Generative AI, whether it’s with developers at conferences or with customers, I often find myself saying the same thing: chatbots are just one way to use Large Language Models (LLMs).

Unfortunately, I see many articles or presentations that just focus on demonstrating LLMs at work within the context of chatbots. I feel guilty of showing the traditional chat interfaces too. But there’s so much more to it!

For example, when I analyzed Bluesky topic trends, there was no chatbot involved, but Gemini and embedding models helped me make sense of clusters of posts. Or when I played with generating short science fiction stories, there was again no chat interface, but the LLM and the Imagen image generation models were used for their creative facet.

LLMs are also very powerful to replicate more classical Natural Language Processing tasks (NLP) like sentiment analysis, entity extraction, etc. But sometimes dedicated predictive models are more (cost) effective at those tasks. However, LLMs allow developers to implement those NLP use cases easily by properly prompting their favorite model. And developers may then be able to add nice and seamless features here and there in their applications.

The trigger

What led me to share my views on these usage patterns of AI? It’s when I read Kojo Osei’s article titled “there should be no AI button”. Indeed, we’re seeing a proliferation of AI “sparkle” buttons in various applications and websites. This feels like a quick hack, an extra patch, to say that the application is smart, but it also adds unnecessary cognitive load and breaks the user’s focus. It takes more than an AI sparkle button to make an application intuitive and seamlessly smart!

The author argues that dedicated “AI buttons” in user interfaces are a flawed and temporary design choice. The author says, and I agree, that the best AI user experience is seamless and integrated, and that AI buttons create unnecessary limitations and frustrations. The article calls for more contextual alternatives that don’t artificially segregate AI functionalities, and avoid breaking the flow of the user.

My take

In my opinion, the best way to use AI is to build it right into your applications, making them smarter and more helpful in a way that feels completely natural to the user. It’s not about hiding the fact that AI is involved – on the contrary, I believe users should know. But, as people go about their daily tasks in an app or on a website, AI should be there to assist them smoothly, without them needing to click a special “AI button” to make something happen, or having to open a chat window to ask for help.

I see chat interfaces as one specific use case for LLMs, and that’s fine. But they aren’t the only option, and I don’t think they’re always the most intuitive or the least disruptive. People need to stay in their flow, focused on what they’re doing. Their work should be augmented by AI, not broken up by extra clicks or messages.

Think about it: if someone is deep in concentration, writing or designing, the last thing they need is to stop, look for a button, and then start a conversation with an AI to get a suggestion. That kind of interruption significantly increases their cognitive load, breaks their focus and makes the whole process feel clunky.

Examples of more seamless and intuitive flows

Personally, I like when a smart application is proactive but doesn’t get in the way of my normal flow. To illustrate this, let’s think about some common patterns I’ve seen that I found successful at this:

  • In Gmail and Chat, the UI shows me a summary of the ongoing conversation that I missed. I can still go through the unread messages, but I can also be up-to-speed rapidly by reading the summary, and then quickly glancing through the messages to get more details.

  • Another summarization example is when I use Obsidian to take notes of articles I find interesting and want to remember. I installed a Chrome add-on, the Obsidian web clipper and its “interpreter”, that I configured to use Gemini, to create a bullet point summary of the articles, and create relevant tags that help me navigate through similar content I’ve already in my notes.

  • For coding, we (developers) are now used to the seamless LLM-powered code completion. I like how it waits a little before suggesting anything. And often, when I use Gemini Code Assist, I have the impression that the LLM read my mind and knew exactly about the code I wanted to write. And if it’s not what I wanted, I’m not really disrupted or distracted, as I can continue typing if the suggestion doesn’t make sense.

  • To stay in the realm of developer workflows, your AI peer is at work as the first responder to the tickets users create, then you can hop in the conversation as the user details their issues. Similarly, for PRs (Pull Requests), your AI coding bot can analyze the code you submitted and make first recommendations to improve it, like the Gemini Code Assist bot does on Github.

  • Before creating a ticket, the bug tracker could also take advantage of LLM-powered or embedding-based semantic search to find similar issues, to avoid creating duplicates, or guide the reporter to pick up the right component or category, depending on what the issue is all about.

  • Large Language Models are great at creating first drafts of documents. Give it the right outline, directions, ideas, and you’ll get a first sketch that you can refine — even a few sentences in this very article that you read were first drafted with Gemini!

  • You can also imagine this draft generation in contexts like CRM apps, where users can create a first draft of something they want to send their customers, depending on the current situation of that customer. Or like in the linked article, with this image creation app, where the user draws a few broad strokes of colors to guide the image generation to follow along.

  • But of course if you’re already working in a chat environment (like Slack or similar messaging platforms), it still makes sense to be able to chat with an AI-powered bot! I don’t want to get rid of chat spaces altogether. But an AI assistant should be that: an assistant, a peer, a colleague you can involve where needed, or who can nudge you to tell you “hey, you forgot to add a test”, or “here are the relevant PRDs or bug entries about this new feature you’re talking about”, or “here’s a summary of the relevant past conversations on that topic”.

These are some examples of non-intrusive and seamless integrations, but UX designers need to rethink the app or website, and be acquainted with what LLMs can offer. Making an application smart isn’t just adding an AI sparkle button or a chatbot interface.

Conclusion

While chatbots are great for things like customer support or answering direct questions, they’re not always the best fit for every situation. For many tasks, I believe AI assistance should be more like a quiet, helpful partner — there when you need it, maybe even anticipating what you need, without you having to constantly ask.

What I really believe we should aim for is AI that boosts what people are already doing, making their work easier and more powerful without adding extra steps. We should be creating experiences where AI enhances the tools people use every day, making them more efficient and insightful, without forcing them to constantly switch contexts or explicitly request AI intervention for every little thing via a sparkle button or a chat message.

Ultimately, I think the AI integrations that will truly succeed are the ones that don’t feel like an add-on. Instead, they’ll feel like a natural, intelligent part of the system. My ideal is to see us build AI experiences that genuinely empower people by working seamlessly in the background, helping them stay focused and achieve more, more rapidly.

https://glaforge.dev/posts/2025/05/23/beyond-the-chatbot-or-ai-sparkle-a-seamless-ai-integration/
Write AI agents in Java — Agent Development Kit getting started guide

At Google Cloud Next ‘25, last April, Google released Agent Development Kit (ADK) for Python, a flexible and modular framework for developing and deploying AI agents.

Now at Google I/O, a Java version of ADK has been made available! And I’m glad to have had the chance to participate in its launch, via code samples, documentation, and helping shape the API so it’s idiomatic for Java developers.

In this article, my goal is to give you the basis to get started with the ADK framework, in Java, using the Gemini model, and running your first Java agents locally.

Show full content

At Google Cloud Next ‘25, last April, Google released Agent Development Kit (ADK) for Python, a flexible and modular framework for developing and deploying AI agents.

Now at Google I/O, a Java version of ADK has been made available! And I’m glad to have had the chance to participate in its launch, via code samples, documentation, and helping shape the API so it’s idiomatic for Java developers.

In this article, my goal is to give you the basis to get started with the ADK framework, in Java, using the Gemini model, and running your first Java agents locally.

Be sure to checkout this video I recorded on how to get started and learn more about some key concepts of ADK:

And if you want to start from a ready-made project with a sample agent using tools, checkout the article on my GitHub template project, or a direct link to the template project.

In the meantime, let’s now focus on the key steps to write your first AI agents with ADK!

Project structure

Create a usual Java project structure like the following, with your Java sources in src/main/java:

[YOUR_PROJECT_DIRECTORY]/
 └—— pom.xml
 └—— src/
     └—— main/
         └—— java/
             └—— agents/
                 └—— ScienceTeacherAgent.java

Let’s have a look at a minimal pom.xml to compile and run your agent:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>adk-agents</groupId>
    <artifactId>adk-agents</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!-- ADK core dependency -->
        <dependency>
            <groupId>com.google.adk</groupId>
            <artifactId>google-adk</artifactId>
            <version>0.1.0</version>
        </dependency>
        <!-- ADK dev web UI and API server -->
        <dependency>
            <groupId>com.google.adk</groupId>
            <artifactId>google-adk-dev</artifactId>
            <version>0.1.0</version>
        </dependency>
    </dependencies>
</project>

You just need a couple of dependencies:

  • the Core ADK framework dependency, and
  • the ADK web server that provides a nice and useful Dev UI to interact with your agents, as well as an API server.
Your first agent

Now that the project is set up, let’s implement a very simple agent. It’ll be a single agent. And it won’t yet use tools to interact with the external world. We’ll explore multi agents and tool support later on, in more advanced articles.

So let’s have a look at a science teacher agent, whose role is to help kids and teenagers understand scientific concepts in a friendly and approachable way.

Here’s the full source code. You can expand this block to see it all, but we’ll explain everything further down, in a piecemeal fashion.

Click to see the full source code
package agents;

import static java.nio.charset.StandardCharsets.UTF_8;
import java.util.Scanner;

import com.google.adk.agents.BaseAgent;
import com.google.adk.agents.LlmAgent;
import com.google.adk.events.Event;
import com.google.adk.runner.InMemoryRunner;
import com.google.adk.sessions.Session;
import com.google.genai.types.Content;
import com.google.genai.types.Part;

import io.reactivex.rxjava3.core.Flowable;

/** Science teacher agent. */
public class ScienceTeacherAgent {

    public static BaseAgent ROOT_AGENT = initAgent();

    public static BaseAgent initAgent() {
        return LlmAgent.builder()
            .name("science-app")
            .description("Science teacher agent")
            .model("gemini-2.0-flash")
            .instruction("""
                You are a helpful science teacher that explains
                science concepts to kids and teenagers.
                """)
            .build();
    }

    public static void main(String[] args) {
        InMemoryRunner runner = new InMemoryRunner(ROOT_AGENT);

        Session session = runner
                .sessionService()
                .createSession(runner.appName(), "student")
                .blockingGet();

        try (Scanner scanner = new Scanner(System.in, UTF_8)) {
            while (true) {
                System.out.print("\nYou > ");
                String userInput = scanner.nextLine();
                if ("quit".equalsIgnoreCase(userInput)) {
                    break;
                }

                Content userMsg = Content.fromParts(Part.fromText(userInput));
                Flowable<Event> events =
                    runner.runAsync(session.userId(), session.id(), userMsg);

                System.out.print("\nAgent > ");
                events.blockingForEach(event -> {
                    System.out.println(event.stringifyContent());
                });
            }
        }
    }
}

So what does the simplest agent look like?

LlmAgent.builder()
    .name("science-app")
    .description("Science teacher agent")
    .model("gemini-2.0-flash")
    .instruction("""
        You are a helpful science teacher that explains
        science concepts to kids and teenagers.
        """)
    .build();

An agent consists of a name, a description, a model, and some instructions, and that’s it! In other articles, we’ll also see how to give it tools, how to use state, how it can interact with other agents and workflows, but for now, let’s stick with this simple science teacher agent.

Running your agent

Let’s say you’ve saved the above agent in a ROOT_AGENT static field of your class (it’ll come in handy for the Dev UI later on). Now we need to instantiate a runner and a session to run the agent.

First, you instantiate an InMemoryRunner with your agent:

InMemoryRunner runner = new InMemoryRunner(ROOT_AGENT);

Then you create a session with the SessionService that you can grab from the runner itself (passing an application name and user ID):

Session session = runner
        .sessionService()
        .createSession(runner.appName(), "student")
        .blockingGet();

Now, if you want to have a familiar chatbot-like experience, you’ll loop over and alternate between user provided messages, and LLM generated answers. Let’s use a Scanner to get the input from the user, inside a big while loop, and send the user’s prompt to the LLM agent, via runner’s runAsync() method, passing the session and the message. The agent will reply with a flow of events that we can then print:

try (Scanner scanner = new Scanner(System.in, UTF_8)) {
    while (true) {
        System.out.print("\nYou > ");
        String userInput = scanner.nextLine();
        if ("quit".equalsIgnoreCase(userInput)) {
            break;
        }

        Content userMsg = Content.fromParts(Part.fromText(userInput));
        Flowable<Event> events = runner
            .runAsync(session.userId(), session.id(), userMsg);

        System.out.print("\nAgent > ");
        events.blockingForEach(event -> {
            System.out.println(event.stringifyContent());
        });
    }
}
Define environment variables

You’ll need to export two environment variables:

  • a Gemini key that you can get from AI Studio,
  • a variable to specify you’re not using Vertex AI this time.
export GOOGLE_GENAI_USE_VERTEXAI=FALSE
export GOOGLE_API_KEY=AIzaSyDF...

In this article, let’s use the Google AI / DeepMind endpoint and API key for Gemini, but you can also use Gemini from Google Cloud Vertex AI.

Executing the agent from the command line

From the command line, let’s use Maven’s exec java plugin to launch this science teacher agent:

mvn compile exec:java -Dexec.mainClass="agents.ScienceTeacherAgent"

Let’s see the output when asking a simple question about “qubits”!

[INFO] Scanning for projects...
[INFO]
[INFO] -----------------------< adk-agents:adk-agents >------------------------
[INFO] Building adk-agents 1.0-SNAPSHOT
[INFO]   from pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- resources:3.3.1:resources (default-resources) @ adk-agents ---
[INFO] skip non existing resourceDirectory /Users/me/Projects/adk-agents/src/main/resources
[INFO]
[INFO] --- compiler:3.13.0:compile (default-compile) @ adk-agents ---
[INFO] Nothing to compile - all classes are up to date.
[INFO]
[INFO] --- exec:3.5.0:java (default-cli) @ adk-agents ---

You > What is a qbit? Please answer in a concise manner.

Agent > Hey there, future tech wiz! 👋

A **qubit** (short for "quantum bit") is the basic unit of information
in a quantum computer. Unlike a regular bit in your computer, which is
either a 0 or a 1, a qubit can be a 0, a 1, or *both at the same time*
thanks to something called **superposition**! This "both at once"
ability is what makes quantum computers super powerful for certain
kinds of problems.

You > quit
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  0:22 min
[INFO] Finished at: 2025-05-19T01:00:12+02:00
[INFO] ------------------------------------------------------------------------

And the conversation can go on and on, till you’re bored and you enter “quit” to end the chat.

Executing the agent in the Dev UI

ADK comes with a Dev web UI (and API server) that you can use to run your agents, see how they behave, the data they exchange, how they interact with the underlying LLM. So let’s talk about this.

The following Maven command will launch the Dev UI:

mvn exec:java \
    -Dexec.mainClass="com.google.adk.web.AdkWebServer" \
    -Dexec.classpathScope="compile"

Note: You can add a different path or even use a sub-directory (if you want to expose only agents in a certain package) by adding and customising this flag: -Dexec.args="--adk.agents.source-dir=src/main/java/com/foo/agents".

This time, you run the AdkWebServer class from the second dependency (google-adk-dev). There’s one constraint currently for the Dev UI to grab and run your agents: they have to be statically initialized and stored in a public static BaseAgent ROOT_AGENT field. That’s why the class had the following structure:

public class ScienceTeacherAgent {
    public static BaseAgent ROOT_AGENT = initAgent();

    public static BaseAgent initAgent() { ... }
}

To access the Dev UI, just head over to your browser and open the following URL: http://localhost:8080/dev-ui

You should see something similar to this:

adk-dev-ui-science-teacher.png
adk-dev-ui-science-teacher.png
adk-dev-ui-science-teacher.png adk-dev-ui-science-teacher.png

  • You can select an agent from the loaded agents in the top left hand corner.
  • When clicking on the events, you can see the popup over the left pane that shows all the details of the events that are flowing through, as well as the input request and output response from the LLM.
  • On the right side, in the chat area, you can see the dialogue between the user and the AI, as well as potential tool calls.
  • You can also enable token streaming at the top of the screen with the slide if you want to see responses being streamed as they are generated by the model.
Congratulations!

Wait, what? Are we already done? Yes, you’ve coded and executed your first Java agent with the ADK framework!

Now what?

Many possible next steps, and also pointers, that I’d like to share with you.

I’m really looking forward to hearing from you about what you’ll be building with ADK for Java! You can expect many more articles on this blog on the topic of ADK. So stay tuned!

https://glaforge.dev/posts/2025/05/20/writing-java-ai-agents-with-adk-for-java-getting-started/
Vibe coding an MCP server with Micronaut, LangChain4j, and Gemini

Unlike Quarkus and Spring Boot, Micronaut doesn’t (yet?) provide a module to facilitate the implementation of MCP servers (Model Context Protocol). But being my favorite framework, I decided to see what it takes to build a quick implementation, by vibe coding it, with the help of Gemini!

In a recent article, I explored how to use the MCP reference implementation for Java to implement an MCP server, served as a servlet via Jetty, and to call that server from LangChain4j’s great MCP support. One approach with Micronaut may have been to somehow integrate the servlet I had built via Micronaut’s servlet support, but that didn’t really feel like a genuine and native way to implement a server, so I decided to do it from scratch.

Show full content

Unlike Quarkus and Spring Boot, Micronaut doesn’t (yet?) provide a module to facilitate the implementation of MCP servers (Model Context Protocol). But being my favorite framework, I decided to see what it takes to build a quick implementation, by vibe coding it, with the help of Gemini!

In a recent article, I explored how to use the MCP reference implementation for Java to implement an MCP server, served as a servlet via Jetty, and to call that server from LangChain4j’s great MCP support. One approach with Micronaut may have been to somehow integrate the servlet I had built via Micronaut’s servlet support, but that didn’t really feel like a genuine and native way to implement a server, so I decided to do it from scratch.

Vibe coding with Gemini

The concept of vibe coding came from a tweet from Andrej Karpathy who defined the concept as interacting with an LLM to build a new prototype or weekend project, and iterating with the LLM till it works, but without looking at or touching the code yourself. It’s quite a bit different than using AI assistance to build a production-ready code base. And Simon Willison’s just written a good piece on what is and what is not vibe coding.

I started throwing Gemini 2.5 Pro some simple prompts for creating an MCP server with Micronaut, with Java 21, but it would not generate something really usable, at least not in one-shot! For example, it would not use Server-Sent Events, or it hadn’t figured out how that the protocol is using JSON-RPC, etc. So instead of steering the LLM in the right direction via multiple prompts, I reused my tacticts of feeding as much information as needed into the prompt. This is the approach I took to grok one’s own content with LLMs.txt, or when I wrote about the power of large context windows for your documentation efforts.

So what was the successful prompt?

Let’s implement a Model Context Protocol (MCP) using the Micronaut framework.

We will use Micronaut 4.8 and Java 21. You can find Micronaut’s documentation here: https://docs.micronaut.io/4.8.11/guide/

The details of the Model Context Protocol (MCP) can be found here: https://modelcontextprotocol.io/llms-full.txt

For the client, we will use LangChain4j as shown in this article: https://glaforge.dev/posts/2025/04/04/mcp-client-and-server-with-java-mcp-sdk-and-langchain4j/

You can find all the code of the LangChain4j MCP client support in the attached file.

[!INFO] Gitingest 📎 [gitingest of the LangChain4j MCP client code]

You can reuse the Java classes of the LangChain4j MCP client to implement the MCP server support with Micronaut.

Implement a simple MCP server to let MCP clients request the weather forecast. Return fake data like {"forecast": "sunny"}

The trick here was to feed the whole MCP specification thanks to the llms-full.txt file, the whole Micronaut single-page documentation, and also the LangChain4j MCP client source code in attachment (via gitingest).

Did it work in one shot? Actually, no. Because my client wouldn’t connect to it somehow, I figured that it wasn’t actually using Server-Sent Events. So I sent a follow-up prompt:

The Micronaut controller must use HTTP Server Sent Events, as this is what the MCP protocol mandates for MCP remote servers. Please update the controller to use SSE.

Then I had a running server.

Gemini created an SseBroadcaster class which handles the Server-Sent Event handling, thanks to Reactor’s Publisher, Flux, and Sinks, and Micronaut’s JsonMapper and SSE support.

It handles the various JSON-RPC operations (initialize, notifications/initialized, tools/list, tools/call, and ping) in the PostController:

private McpResponse processRequest(McpRequest request) {
    // --- Same logic as before to generate the McpResponse object ---
    switch (request.method()) {
        case "initialize":
            log.info("Handling initialize request");
            InitializeResult initResult = new InitializeResult(new ServerCapabilities());
            return new McpResponse(request.id(), initResult);

        case "notifications/initialized":
            log.info("Received initialized notification");
            // This is a notification FROM the client. MCP spec says notifications
            // don't have responses. So we return null here, and the POST handler
            // will just return HTTP OK.
            return null;

        case "tools/list":
            log.info("Handling tools/list request");
            ToolSpecificationData weatherTool = new ToolSpecificationData(
                WEATHER_TOOL_NAME,
                "Gets the current weather forecast.",
                new InputSchema(
                    "object",
                    Map.of("location", Map.of(
                        "type", "string",
                        "description", "Location to get the weather for")
                    ),
                    List.of("location"),
                    false)
            );
            ListToolsResult listResult = new ListToolsResult(List.of(weatherTool));
            return new McpResponse(request.id(), listResult);

        case "tools/call":
            log.info("Handling tools/call request");
            if (request.params() != null && request.params().has("name")) {
                String toolName = request.params().get("name").asText();
                if (WEATHER_TOOL_NAME.equals(toolName)) {
                    log.info("Executing tool: {}", toolName);
                    TextContentData textContent = new TextContentData(FAKE_WEATHER_JSON);
                    CallToolResult callResult = new CallToolResult(List.of(textContent));
                    return new McpResponse(request.id(), callResult);
                } else {
                    log.warn("Unknown tool requested: {}", toolName);
                    return new McpResponse(request.id(), new McpError(-32601, "Method not found: " + toolName));
                }
            } else {
                log.error("Invalid tools/call request: Missing 'name' in params");
                return new McpResponse(request.id(), new McpError(-32602, "Invalid params for tools/call"));
            }

        case "ping":
            log.info("Handling ping request");
            return new McpResponse(request.id(), Collections.emptyMap());

        default:
            log.warn("Unsupported MCP method: {}", request.method());
            return new McpResponse(request.id(), new McpError(-32601, "Method not found: " + request.method()));
    }
}
From vibe coding, to a more classical AI-assisted approach

The vibe coding part ended somewhere here, as I then went on to make a few tweaks here and there to the code base. When you’re a developer, you can’t resist tweaking a few things here and there, right?

But I continued the journey also with the help of Gemini, but via Gemini Code Assist within IntelliJ IDEA.

I reused my MCP client from my recent MCP article, but I asked Gemini Code Assist to transform the Java class into a proper JUnit test. Since the Micronaut documentation is still part of the context of the conversation, thanks to Gemini’s huge context window, it did again a great job at converting my code into a proper Micronaut running unit test, launching an embedded server.

I like that Gemini created three test methods: to test that the server can reply to some simple greeting message, then can list and the available MCP tools at the disposal of the clients:

@Test
void testListTools() {
    log.info("Testing listTools...");
    assertDoesNotThrow(() -> {
        List<dev.langchain4j.agent.tool.ToolSpecification> tools = mcpClient.listTools();
        assertNotNull(tools, "Tool list should not be null");
        assertFalse(tools.isEmpty(), "Tool list should not be empty");
        // Add more specific assertions if needed, e.g., check tool names
        assertTrue(tools.stream().anyMatch(t -> "getWeatherForecast".equals(t.name())),
            "Should find the 'getWeatherForecast' tool");
        log.info("listTools returned: {}", tools);
    }, "Listing tools should not throw an exception");
}

And finally is able to reply to a weather request by returning some dummy weather data:

@Test
void testWeatherRequest() {
    log.info("Testing weather request...");
    String question = "What's the weather like in Paris today?";
    String response = assertDoesNotThrow(() -> weatherAssistant.request(question),
        "Weather request should not throw an exception");

    log.info("Question: {}", question);
    log.info("Response: {}", response);

    assertNotNull(response, "Response should not be null");
    assertFalse(response.isBlank(), "Response should not be blank");
    // Check if the response likely contains the mocked forecast
    assertTrue(response.toLowerCase().contains("sunny"),
        "Response should contain the weather information (sunny)");
}
Now, the code!

As you made it till the end, you’ll get a reward: I pushed the code in this Github repository! The project doesn’t implement all the bells and whistles of the MCP specification (like prompts, resources, sampling, etc.) but it’s certainly a starting point if you want to write your own MCP server with Micronaut. Since LangChain4j released some new versions today, I updated the dependencies to use the latest and greatest LangChain4j.

Going further, I’d love to see Micronaut offer a dedicated MCP server module, to make it easier to implement MCP servers, with some clever annotations, to streamline the whole process. Fingers crossed 🤞

And of course, once you’re happy with your MCP server implementation, the extra step is to deploy the MCP server to Google’s Cloud Run, like I explained in this article about the various ways to deploy Micronaut apps to Cloud Run.

https://glaforge.dev/posts/2025/05/02/vibe-coding-an-mcp-server-with-micronaut-and-gemini/
MCP Client and Server with the Java MCP SDK and LangChain4j

MCP (Model Context Protocol) is making a buzz these days! MCP is a protocol invented last November by Anthropic, integrated in Claude Desktop and in more and more tools and frameworks, to expand LLMs capabilities by giving them access to various external tools and functions.

My colleague Philipp Schmid gave a great introduction to MCP recently, so if you want to learn more about MCP, this is the place for you.

In this article, I’d like to guide you through the implementation of an MCP server, and an MCP client, in Java. As I’m contributing to LangChain4j, I’ll be using LangChain4j’s mcp module for the client.

Show full content

MCP (Model Context Protocol) is making a buzz these days! MCP is a protocol invented last November by Anthropic, integrated in Claude Desktop and in more and more tools and frameworks, to expand LLMs capabilities by giving them access to various external tools and functions.

My colleague Philipp Schmid gave a great introduction to MCP recently, so if you want to learn more about MCP, this is the place for you.

In this article, I’d like to guide you through the implementation of an MCP server, and an MCP client, in Java. As I’m contributing to LangChain4j, I’ll be using LangChain4j’s mcp module for the client.

For the server, it’s possible to use Quarkus or Spring Boot. But Christian Tsolov, who built the MCP library used by Spring Boot (which is also the official Java SDK promoted by the Model Context Protocol project), recently tweeted that the MCP reference implementation can also be used standalone without a mandatory framework:

🚀 The MCP Java SDK core module provides default STDIO & SSE client/server transports - no external web frameworks needed!

Build MCP clients/servers in plain Java with both Async (Reactor) & Sync APIs. Spring support is optional.

🔗 Minimal deps: https://t.co/9uW5vgMqqh pic.twitter.com/aHLBHuXOUz

— Christian Tzolov🇧🇬🇪🇺🇺🇦 🦋@tzolov.bsky.social (@christzolov) March 30, 2025
Developing the MCP server

For the server, I need the reference implementation dependency, as well as some Jetty JARs (or the servlet container of your choice) to expose an HTTP Server-Sent Event endpoint (you can also create stdio servers too):

<dependency>
	<groupId>io.modelcontextprotocol.sdk</groupId>
	<artifactId>mcp</artifactId>
	<version>0.8.1</version>
</dependency>
<dependency>
	<groupId>org.eclipse.jetty</groupId>
	<artifactId>jetty-server</artifactId>
	<version>12.0.18</version>
</dependency>
<dependency>
	<groupId>org.eclipse.jetty.ee10</groupId>
	<artifactId>jetty-ee10-servlet</artifactId>
	<version>12.0.18</version>
</dependency>

I’m using Jetty here, as I want to expose the HTTP SSE endpoint as a servlet.

The first thing needed is to create an HTTP servlet SSE transport provider. I’ll expose the /sse endpoint that the client will be able to access:

HttpServletSseServerTransportProvider transportProvider =
    new HttpServletSseServerTransportProvider(
        new ObjectMapper(), "/", "/sse");

The MCP reference implementation allows you to implement async or sync servers. I’m going with a synchronous one, as it’s easier to implement:

McpSyncServer syncServer = McpServer.sync(transportProvider)
    .serverInfo("custom-server", "0.0.1")
    .capabilities(McpSchema.ServerCapabilities.builder()
        .tools(true)
        .resources(false, false)
        .prompts(false)
        .build())
    .build();

MCP servers can expose:

  • tools
  • resources
  • prompts

In my case, I’m just interested in exposing a tool. I’ll go with a classical weather tool, which is a bit like the Hello World of LLM function calling!

Let’s define our weather-forecast tool:

McpServerFeatures.SyncToolSpecification syncToolSpecification =
    new McpServerFeatures.SyncToolSpecification(
        new McpSchema.Tool("weather-forecast",
            "gives today's weather forecast for a given location",
            """
            {
              "type": "object",
              "properties": {
                "location": {
                  "type": "string"
                }
              },
              "required": ["location"]
            }
            """
        ),
        (mcpSyncServerExchange, stringObjectMap) -> {
            return new McpSchema.CallToolResult(
                List.of(new McpSchema.TextContent("""
                    {
                        "location": "Paris",
                        "forecast": "Nice and sunny weather, with clear blue sky, and temperature of 17°C."
                    }
                    """
                )), false);
        }
    );

syncServer.addTool(syncToolSpecification);

I defined the tool, with a description (which helps LLMs know which tools to invoke for which use case). The schema of the input is described as a JSON string (as I struggled a bit to find the correct way to create the schema programmatically).

Then I defined the lambda function that is called when the tool is invoked. I’m returning a JSON object that contains the location and the forecast.

And I’m done with the MCP server implementation!

But now, I need to expose this server thanks to the Jetty Servlet container.

Let’s define a new Jetty server, connector, servlet context handler, export the servlet, and start the server:

QueuedThreadPool threadPool = new QueuedThreadPool();
threadPool.setName("server");

Server server = new Server(threadPool);

ServerConnector connector = new ServerConnector(server);
connector.setPort(45450);
server.addConnector(connector);

ServletContextHandler context = new ServletContextHandler();
context.setContextPath("/");
context.addServlet(new ServletHolder(transportProvider), "/*");

server.setHandler(context);
server.start();

Now if you run this code, your MCP server function will be waiting for its first invocations.

Developing the MCP client with LangChain4j

For the MCP client, I use the LangChain4j MCP module:

<dependency>
	<groupId>dev.langchain4j</groupId>
	<artifactId>langchain4j-mcp</artifactId>
	<version>${langchain4j.version}</version>
</dependency>

Since I’m using Gemini, I need some dependencies for the Vertex AI Gemini model:

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j</artifactId>
    <version>${langchain4j.version}</version>
</dependency>
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-vertex-ai-gemini</artifactId>
    <version>${langchain4j.version}</version>
</dependency>

Let’s get started with instantiating our Gemini 2.0 Flash lite model:

try (
    VertexAiGeminiChatModel model = VertexAiGeminiChatModel.builder()
        .project("genai-playground24")
        .location("us-central1")
        .modelName("gemini-2.0-flash-lite")
        .build();

I’m defining an McpTransport pointing at my local Jetty SSE server:

McpTransport transport = new HttpMcpTransport.Builder()
    .sseUrl("http://0.0.0.0:45450/sse")
    .build()) {

Let’s create an MCP client using that transport:

McpClient mcpClient = new DefaultMcpClient.Builder()
    .transport(transport)
    .build();

And a tool provider that will expose just the weather-forecast tool:

ToolProvider toolProvider = McpToolProvider.builder()
    .mcpClients(List.of(mcpClient))
    .build();

You can list the available tools as follows:

mcpClient.listTools().forEach(System.out::println);

As I’m going to create a LangChain4j AI service, I need a contract. This will be the following simple interface:

interface WeatherAssistant {
    String request(String message);
}

Now, it’s time to instantiate that service:

WeatherAssistant meteo = AiServices.builder(WeatherAssistant.class)
    .chatLanguageModel(model)
    .toolProvider(toolProvider)
    .build();

And we can now chat with that service, asking mundane questions, as well as weather related questions that will be invoking the weather-forecast tool:

List.of(
    "Hello!",
    "What's the weather like in Paris today?"
).forEach((String q) -> {
    System.out.println(blue(q));
    System.out.println(green(meteo.request(q)));
});
}

It will print something like:

Hello!
The weather in Paris is sunny today, with a clear blue sky,
and a temperature of 17°C.
Conclusion

In this article, we’ve seen:

  • how to create an MCP server, using the official Java MCP SDK,
  • and how to create an MCP client with LangChain4j.

We’ve seen how to define a tool, and how to expose it via an HTTP SSE endpoint. We’ve also seen how to use the MCP client to invoke that tool, and how to integrate it in a LangChain4j AI service.

This is a great way to expand the capabilities of LLMs, by giving them access to external tools and functions. And it shows that you can interact with any MCP server with LangChain4j.

https://glaforge.dev/posts/2025/04/04/mcp-client-and-server-with-java-mcp-sdk-and-langchain4j/
Quick Tip: Clearing disk space in Cloud Shell

Right in the middle of a workshop I was delivering, as I was launching Google Cloud console’s Cloud Shell environment, I received the dreaded warning message: no space left on device.

And indeed, I didn’t have much space left, and Cloud Shell was reminding me it was high time I clean up the mess! Fortunately, the shell gives a nice hint, with a pointer to this documentation page with advice on how to reclaim space.

Show full content

Right in the middle of a workshop I was delivering, as I was launching Google Cloud console’s Cloud Shell environment, I received the dreaded warning message: no space left on device.

And indeed, I didn’t have much space left, and Cloud Shell was reminding me it was high time I clean up the mess! Fortunately, the shell gives a nice hint, with a pointer to this documentation page with advice on how to reclaim space.

The document suggests to run the following command:

du -hs $(ls -A)

This command shows the space each file uses within each sub-directory.

Here’s the output I got after having cleaned up the many caches, directories and projects I didn’t need anymore:

20K     .bash_history
4.0K    .bash_logout
4.0K    .bashrc
20M     .cache
320M    .codeoss
112K    .config
8.0K    .docker
247M    gemini-workshop-for-java-developers
4.0K    .gitconfig
341M    .gradle
12K     .gsutil
4.0K    .lesshst
16K     .npm
4.0K    .profile
0       .python_history
4.0K    README-cloudshell.txt
8.0K    .redhat
4.0K    .ssh
0       .sudo_as_admin_successful
8.0K    .vscode

You quickly see directories (like .codeoss or my gemini-workshop-for-java-developers) that fill up the most space, and you can go after each of those repositories and launch some rm -Rf some-directory commands here and there. Of course, pay attention to what you’re going to delete, as this is irreversible!

https://glaforge.dev/posts/2025/03/08/quick-tip-clearing-disk-space-in-cloud-shell/
LLMs.txt to help LLMs grok your content

Since I started my career, I’ve been sharing what I’ve learned along the way in this blog. It makes me happy when developers find solutions to their problems, or discover new things, thanks to articles I’ve written here. So it’s important for me that readers are able to find those posts. Of course, my blog is indexed by search engines, and people usually find about it from Google or other engines, or they discover it via the links I share on social media. But with LLM powered tools (like Gemini, ChatGPT, Claude, etc.) you can make your content more easily grokkable by such tools.

Show full content

Since I started my career, I’ve been sharing what I’ve learned along the way in this blog. It makes me happy when developers find solutions to their problems, or discover new things, thanks to articles I’ve written here. So it’s important for me that readers are able to find those posts. Of course, my blog is indexed by search engines, and people usually find about it from Google or other engines, or they discover it via the links I share on social media. But with LLM powered tools (like Gemini, ChatGPT, Claude, etc.) you can make your content more easily grokkable by such tools.

A proposal emerged last year to add a new file websites, llms.txt, which is a Markdown document that lists all the posts, pages, articles of that website. In a way, it’s similar to other kind of indexing-related files like robots.txt or sitemap.xml. But this time, the idea is to offer the content of your documentation, website, or blog, as Markdown, easily grokkable by LLM powered tools.

Updating Hugo to generate LLMs.txt files

This blog post is powered by Hugo, a static site generator. I’ll explain how I added 2 new files to my blog:

  • llms.txt — which references links to all the blog posts, a bit like a sitemap.
  • llms-full.txt — which contains all the blog post articles in one giant file.

In config.toml, I added two new output formats:

[outputFormats.llms]
baseName = "llms"
mediaType = "text/plain"
isPlainText = true

[outputFormats.llms-full]
baseName = "llms-full"
mediaType = "text/plain"
isPlainText = true

[outputs]
home = ["HTML", "RSS", "llms", "llms-full"]

And in my custom theme layouts\_default folder, I added two templates for those two files:

  • index.llms.txt references all the articles
# Blog of {{ .Site.Title }}

> {{ .Site.Params.Description }}

You can find [more information about this site here](/about).

## All articles
{{ range .Site.RegularPages.GroupByDate "2006" }}

### {{ .Key }}
{{ range .Pages }}
- [{{- .Title -}}]({{- .Site.BaseURL -}}{{- .RelPermalink -}})
{{- end -}}
{{ end }}
  • index.llms-full.txt provides the full content of all the articles
{{ range .Site.RegularPages }}
# {{ .Title }}

{{ .RawContent }}
{{- end -}}
Now what can you do with that?

In my recent article on the power of large context windows for your documentation efforts) I mentioned that you could feed all your content into an LLM prompt and have a discussion with it. This is the same idea I’m following here. Thanks to the full export of all my articles, I can ask questions about everything I’ve written so far.

Let’s go to Google AI Studio and ask Gemini, and its huge token context window!

When I asked about the main topics of this blog, the above screenshot shows Gemini’s answer, highlighting my content about Apache Groovy, Google Cloud Platform, generative AI, Web APIs, software development best practices.

Look at the number of tokens! My whole blog is about 682k tokens! Only Gemini could grok all of that and make sense of it.

As a follow up question, I also asked “Can you detail more the articles about Generative AI?”, and Gemini highlighted some of my recent articles and summarized them all:

Okay, based on the file, here’s a breakdown of the articles specifically about Generative AI:

  • The power of large context windows for your documentation efforts: This article explores using LLMs with large context windows (like Gemini) to chat with reference documentation, to assist in tasks like creating new modules for LangChain4j. It discusses using “gitingest” to convert a GitHub project into a large text file that can be fed to an LLM.
  • A Generative AI Agent with a real declarative workflow: This piece compares an imperative Java-based AI agent with a declarative approach using Google Cloud Workflows. The agent generates short sci-fi stories, and the article details the pros and cons of each method.
  • An AI agent to generate short sci-fi stories: This project details how to build an AI-powered short story generation agent using Java, LangChain4j, Gemini, and Imagen 3, deployed on Cloud Run jobs. It contrasts explicit vs. autonomous agent workflows.
  • Analyzing trends and topics from Bluesky’s Firehose with generative AI: This article describes a project that analyzes the Bluesky social media stream, using vector embeddings and clustering to identify trending topics, and using Gemini to generate summaries of these topics.
  • Let’s think with Gemini Flash 2.0’s experimental thinking mode and LangChain4j: This article explores the new “thinking mode” of Gemini 2.0 Flash, using LangChain4j to solve riddles and multimodal problems.
  • Detecting objects with Gemini 2.0 and LangChain4j: This article discusses using Gemini 2.0’s object detection capabilities with LangChain4j to recognize playing cards in images.
  • Semantic code search for Programming Idioms with LangChain4j and Vertex AI embedding models: Explores the use of Vertex AI embedding models for semantic code search on the Programming Idioms community website. It focuses on using code retrieval task types and LangChain4j for efficient retrieval of code examples.
  • Redacting sensitive information when using Generative AI models: Details the importance of redacting Personally Identifiable Information (PII) before sending user messages to LLMs, using the Google Cloud Data Loss Prevention (DLP) API.
  • Data extraction: The many ways to get LLMs to spit JSON content: This article reviews different techniques to make LLMs generate JSON output and extract data from text, using Gemini and LangChain4j.

These articles demonstrate a clear focus on using various LLMs (primarily Gemini) with tools like LangChain4j and Google Cloud services to address different AI-related problems.

Not only did it summarize all those recent articles, but I also like the final sentence that outlines what most of those articles have in common: They are primarily using the Gemini model, with the LangChain4j framework, and running on Google Cloud Platform.

When asked to dive deeper on what is so special about LangChain4j, for instance, then again it highlighted the key features of the framework (that I demonstrated in my articles):

LangChain4j appears to be a central tool for building Generative AI solutions in the provided articles, so I collected a few pieces of information to explain what’s so special about it:

  • Java-Centric Approach: LangChain4j is a Java library. In a field dominated by Python, LangChain4j is a key tool for Java developers to leverage generative AI without having to switch to a different programming language or environment.
  • LLM Orchestration: LangChain4j facilitates the orchestration of different components in an AI pipeline, from loading documents and splitting them into chunks, to calculate their embedding vectors, storing them in vector databases, and retrieve them to be passed to the LLM, and the augment the LLM prompt with context from those vectors. This reduces the complexity of AI application development.
  • Integration with Google Cloud Services: LangChain4j has built-in integrations with Google Cloud’s Vertex AI, Google AI Gemini, and Google Cloud Storage. This allows Java developers to easily use Google’s LLMs, embedding models, and data storage solutions.
  • Extensible and Customizable: The library offers a way to plug custom logic and algorithms into the different steps, and to define how the information should be used to interact with the LLM.
  • Higher-Level Abstractions: LangChain4j provides higher-level abstractions for interacting with LLMs, such as the AiServices abstraction that returns type-safe objects, which allows developers to manipulate real Java objects, in a type-safe manner, instead of dealing with raw JSON strings.
  • Capabilities like Tool Calling, Context Window, and the like: It integrates new LLM capabilities to improve the quality of the AI powered application’s responses.
  • Metadata Handling: It eases the handling of metadata in text segments, including filtering based on metadata during the retrieval phase.

In the context of the articles, LangChain4j simplifies the implementation of complex tasks like:

  • Creating chatbots that can answer questions based on a specific knowledge base (RAG).
  • Analyzing trends in social media data.
  • Generating structured content (e.g., JSON) from unstructured text.
  • Implementing semantic code search.

LangChain4j’s key value is that it allows Java developers to more easily build, experiment with, and deploy Generative AI applications using their existing skills and infrastructure, while abstracting away much of the underlying complexity.

Thoughts and summary

For the past months, I’ve been exploring various Retrieval Augmented Generation (RAG) techniques to improve the quality of answers for questions on heaps of documents and data, often chunked and stored in vector databases. However, RAG hits some limits when you want to ask more general questions, or to analyze some content more deeply, by understanding it in its entirety.

That’s where LLMs like Gemini with large context windows help, as they can figure out the fine print, the relationships between various sections, see common patterns across articles. I believe we should take the best of both worlds, by finding relevant documents with a hybrid search approach mixing keyword and semantic searches, but then feeding LLMs not only with extracts but with whole documents.

The llms.txt initiative is interesting, and I hope it’ll be adopted further in the future. It will make it simpler for users to more easily reason about documentation, blogs, etc. via LLMs. As LLMs are getting bigger context windows, the discussion with whole docs or sites become richer, with full understanding of the context of the corpus of documents.

https://glaforge.dev/posts/2025/03/03/llms-txt-to-help-llms-grok-your-content/
Pretty-print Markdown on the console

With Large Language Models loving to output Markdown responses, I’ve been wanting to display those Markdown snippets nicely in the console, when developing some LLM-powered apps and experiments. At first, I thought I could use a Markdown parser library, and implement some kind of output formatter to display the text nicely, taking advantage of ANSI color codes and formats. However it felt a bit over-engineered, so I thought “hey, why not just use some simple regular expressions!” (and now you’ll tell me I have a second problem with regexes)

Show full content

With Large Language Models loving to output Markdown responses, I’ve been wanting to display those Markdown snippets nicely in the console, when developing some LLM-powered apps and experiments. At first, I thought I could use a Markdown parser library, and implement some kind of output formatter to display the text nicely, taking advantage of ANSI color codes and formats. However it felt a bit over-engineered, so I thought “hey, why not just use some simple regular expressions!” (and now you’ll tell me I have a second problem with regexes)

In this blog post, I just want to share the few lines of code I’ve added in a utility class to output Markdown content nicely. It’s not covering all the bells & whistles of the Markdown syntax (in particular things like Github flavored extensions, like arrays, etc.) but it’s good enough for my use case.

Markdown syntax highlighting on the console
public static String markdown(String md) {
  return md
    // Bold
    .replaceAll("\\*\\*(.*?)\\*\\*", "\u001B[1m$1\u001B[0m")
    // Italic
    .replaceAll("\\*(.*?)\\*", "\u001B[3m$1\u001B[0m")
    // Underline
    .replaceAll("__(.*?)__", "\u001B[4m$1\u001B[0m")
    // Strikethrough
    .replaceAll("~~(.*?)~~", "\u001B[9m$1\u001B[0m")
    // Blockquote
    .replaceAll("(> ?.*)",
        "\u001B[3m\u001B[34m\u001B[1m$1\u001B[22m\u001B[0m")
    // Lists (bold magenta number and bullet)
    .replaceAll("([\\d]+\\.|-|\\*) (.*)",
        "\u001B[35m\u001B[1m$1\u001B[22m\u001B[0m $2")
    // Block code (black on gray)
    .replaceAll("(?s)```(\\w+)?\\n(.*?)\\n```",
        "\u001B[3m\u001B[1m$1\u001B[22m\u001B[0m\n\u001B[57;107m$2\u001B[0m\n")
    // Inline code (black on gray)
    .replaceAll("`(.*?)`", "\u001B[57;107m$1\u001B[0m")
    // Headers (cyan bold)
    .replaceAll("(#{1,6}) (.*?)\n",
        "\u001B[36m\u001B[1m$1 $2\u001B[22m\u001B[0m\n")
    // Headers with a single line of text followed by 2 or more equal signs
    .replaceAll("(.*?\n={2,}\n)",
        "\u001B[36m\u001B[1m$1\u001B[22m\u001B[0m\n")
    // Headers with a single line of text followed by 2 or more dashes
    .replaceAll("(.*?\n-{2,}\n)",
        "\u001B[36m\u001B[1m$1\u001B[22m\u001B[0m\n")
    // Images (blue underlined)
    .replaceAll("!\\[(.*?)]\\((.*?)\\)",
        "\u001B[34m$1\u001B[0m (\u001B[34m\u001B[4m$2\u001B[0m)")
    // Links (blue underlined)
    .replaceAll("!?\\[(.*?)]\\((.*?)\\)",
        "\u001B[34m$1\u001B[0m (\u001B[34m\u001B[4m$2\u001B[0m)");
}

This can easily be translated into other programming languages. Just be careful with the small differences in syntax of regular expressions.

For the following Markdown text:

# Main title

Big title

## Subtitle

Small title

# Bold and italic

Some **bold text**.
Bits of _italicized text_.
It's **underlined**.
And ~~striked through~~.

## Links

A [link](https://www.example.com) to an article.

![alt text](image.jpg)

### Quoting

> a quote of someone famous, potentially wrapping around multiple lines.

# Lists

1. First item
2. Second item
3. Third item

- First item
- Second item
- Third item

# Code

Some inline `code` inside a paragraph.
Return type is `void` and args are `String[]`.

A fenced code block:

```java
public class Hello {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}
```

On the console output, the above Markdow document would be rendered as follows:

Do you like this Markdown syntax highlighting?

Bonus points

At first, in my utility class, I only had methods for adding some colors in my program outputs. Even if you don’t use Markdown, those touches of color can be useful to differentiate key parts of your output.

So I created some methods for wrapping text in ANSI codes:

public static String red(String msg) {
    return "\u001B[31m" + msg + "\u001B[0m";
}

public static String green(String msg) {
    return "\u001B[31m" + msg + "\u001B[0m";
}
//...
public static String bold(String msg) {
    return "\u001B[1m" + msg + "\u001B[0m";
}

public static String italic(String msg) {
    return "\u001B[3m" + msg + "\u001B[0m";
}
//...

You can combine them like in bold(green(msg)), add more colors, or even write some fancy rainbow text! As long as the output is still readable, it’s all fair game!

https://glaforge.dev/posts/2025/02/27/pretty-print-markdown-on-the-console/
Advanced RAG — Sentence Window Retrieval

Retrieval Augmented Generation (RAG) is a great way to expand the knowledge of Large Language Models to let them know about your own data and documents. With RAG, LLMs can ground their answers on the information your provide, which reduces the chances of hallucinations.

Implementing RAG is fairly trivial with a framework like LangChain4j. However, the results may not be on-par with your quality expectations. Often, you’ll need to further tweak different aspects of the RAG pipeline, like the document preparation phase (in particular docs chunking), or the retrieval phase to find the best information in your vector database.

Show full content

Retrieval Augmented Generation (RAG) is a great way to expand the knowledge of Large Language Models to let them know about your own data and documents. With RAG, LLMs can ground their answers on the information your provide, which reduces the chances of hallucinations.

Implementing RAG is fairly trivial with a framework like LangChain4j. However, the results may not be on-par with your quality expectations. Often, you’ll need to further tweak different aspects of the RAG pipeline, like the document preparation phase (in particular docs chunking), or the retrieval phase to find the best information in your vector database.

In this first article (hopefully of a series on advanced RAG techniques) I’d like to explore an approach that may yield better results: sentence window retrieval, inspired by the technique described in this article.

Info

I’ve explored many techniques in my advanced RAG techniques presentation, if you feel like discovering other techniques that we’ll explore in more details in this series.

Let’s step back to naive chunking

First, why do we even split documents in smaller chunks? We split documents into chunks in RAG because:

  • It’s easier to find the specific, relevant piece of information within a smaller chunk than a huge document.
  • Large Language Models have limited memory. Chunks allow us to feed them just the necessary context, instead of overwhelming them with the whole document.
  • Smaller chunks lead to more precise retrieval, delivering more accurate answers.

The naive approach is to split in chunks of a certain amount of characters. For example, on the Wikipedia page of Berlin, a 100-character split might look as follows:

If a user asks the question “What is the population of Berlin?”, the number of inhabitants is split across two chunks. So neither the first, nor the second chunk would yield the correct information, for the LLM to generate an accurate answer.

An obvious improvement is to use overlapping chunks:

The red chunk and the orange chunk overlap: both contain the gray part as well. Which means that the second chunk contains the number (in full) we’re interesteded in.

Another possible approach, to avoid splits and overlaps, is to chunk by sentences. After all, human beings write sentences for a good reason, because they bear information that represent a unit of semantic meaning.

However, both the chunk with overlap example above, as well as the sentence split expose another flaw: Notice that the pronoun its, in the second chunk or the second sentence doesn’t carry the information that it actually references Berlin. So the pronoun misses an important aspect of the sentence: this is a sentence about the population of Berlin. Not any other city.

An alternative may be to increase the size of the chunk, and/or the size of the overlap, to avoid information to be split across chunks (like the population figure), and to give more context about possible links between sentences (like our pronoun-city). However, the wider the chunks, the more diluted the semantic meaning in the resulting vector embeddings.

With more dillution, it’s harder to have query vectors (the user prompt) match the chunks of texts with high similarity values.

Enters sentence window retrieval

The name of the technique comes from this article I mentioned. But maybe it’s not the best name we could find. Maybe something like wider-context-sliding-window-embedding would be more explicit, but that’s a mouthful!

Let’s have a look at this approach:

The idea is as follows:

  • We calculate vector embeddings for the sentence in dark green.
  • But we save the surrounding sentences in light green (for example, one sentence before, and two after).

At retrieval time, the vector similarity calculation will match better with the dark green sentence (in spite of its missing Berlin aspect). But the whole light + dark green context will be added in the prompt of the LLM, instead of the single sentence.

The advantages are that:

  • We keep on carrying meaningful units of meaning with a few sentences, thus avoiding any key information cut between splits, and semantic dillution of bigger chunks.
  • It helps the LLM resolve links between pronouns and their related entity. The LLM knows that we’re talking about Berlin here.
The canonical RAG implementation in LangChain4j

With LangChain4j, the base approach is as follows. Let’s start with the ingestion phase:

// Load the document (the Wikipedia page about Berlin)

Document capitalDocument = Document.from(text);

// Define an embedding model to calculate vector embeddings,
// both for the text of the article, and for the user queries

var embeddingModel = VertexAiEmbeddingModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
    .location(System.getenv("GCP_LOCATION"))
    .publisher("google")
    .modelName("text-embedding-005")
    .build();

// Store the chunks and their vectors in a vector database
// (in this example, we'll use a simple in-memory store)

var embeddingStore = new InMemoryEmbeddingStore<TextSegment>();

// Ingest the document in chunks of 100 characters
// with an overlap of 20 characters,
// use the in-memory vector store,
// and the embedding model for vector calculations

EmbeddingStoreIngestor.builder()
    .documentSplitter(DocumentSplitters.recursive(100, 20))
    .embeddingStore(embeddingStore)
    .embeddingModel(embeddingModel)
    .build()
    .ingest(capitalDocument);

This is the naive approach using chunks of 100 characters with overlap. Let’s see what it looks like during the retrieval phase:

// Declare the LLM model we want to use

VertexAiGeminiChatModel chatModel = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName("gemini-2.0-flash-001")
    .build();

// Create an interface contract
// that LangChain4j will implement for us

interface CapitalsAssistant {
    Result<String> learnAboutCapitals(String query);
}

// AiServices implements the interface
// and binds the LLM, and a content retriever
// that links the embedding model and vector store

CapitalsAssistant assistant =
    AiServices.builder(CapitalsAssistant.class)
        .chatLanguageModel(chatModel)
        .contentRetriever(EmbeddingStoreContentRetriever.builder()
            .embeddingModel(embeddingModel)
            .embeddingStore(embeddingStore)
            .build())
        .build();

// Now we can ask questions
Result<String> response = assistant.learnAboutCapitals(
    "How many inhabitants live in Berlin?");

We could also add a memory component, to keep track of the ongoing discussion, it’s just one extra line. But here, I stick to just single user questions.

Let’s implement the sentence window retrieval

Now, how can we expand the above code to implement the algorithm?

We need to split the text in sentences, and keep track of the surrounding sentences, as a sliding window, to give extra context to the LLM. We can store that information as metadata of each text segment. We must prepare the LLM prompt by inserting the surrounding context, instead of single sentences.

At ingestion phase, we can plug a TextSegmentTransformer that transforms our text chunks, to compute and store the surrounding context in the text segment metadata. We need to override both transform() and transformAll() methods, because we need to modify all chunks together (to get the surrounding sentences):

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
  .documentSplitter(new DocumentBySentenceSplitter(200, 20))
  .embeddingStore(embeddingStore)
  .embeddingModel(embeddingModel)
  .textSegmentTransformer(new TextSegmentTransformer() {
    @Override
    public TextSegment transform(TextSegment segment) {
      return transformAll(Collections.singletonList(segment))
            .getFirst();
    }

    @Override
    public List<TextSegment> transformAll(List<TextSegment> segments) {
      List<TextSegment> list = new ArrayList<>();
      for (int i = 0; i < segments.size(); i++) {
        TextSegment textSegment = segments.get(i);

        // Create a sliding window of sentences to gather
        // the context surrounding the embedded sentence

        // (2 sentences before, 3 after,
        // but you could make it configurable)
        String context = IntStream.rangeClosed(i - 2, i + 3)
            .filter(j -> j >= 0 && j < segments.size())
            .mapToObj(j -> segments.get(j).text())
            .collect(Collectors.joining(" "));

        // Store the surrounding context as metadata
        // of the text segment (the current chunk)

        Metadata metadata =
            new Metadata(textSegment.metadata().toMap());
        metadata.put(METADATA_CONTEXT_KEY, context);
        list.add(TextSegment.from(textSegment.text(), metadata));
      }
      return list;
    }
  })
  .build();

That’s a bit of code, but I hope to contribute an implementation to LangChain4j directly, so that you don’t have to write this algorithm each time you want to apply it.

Let’s focus now on the retrieval phase now, because we need to inject the surrounding context in the LLM prompt, instead of the sentence chunk itself. We need to create a RetrievalAugmentor, and configure the ContentRetriever we used before, and a ContentInjector:

CapitalsAssistant assistant =
  AiServices.builder(CapitalsAssistant.class)
    .chatLanguageModel(chatModel)
    .retrievalAugmentor(DefaultRetrievalAugmentor.builder()

    // the content retriever is defined
    // at the level of the retrieval augmentor

    .contentRetriever(EmbeddingStoreContentRetriever.builder()
      .embeddingModel(embeddingModel)
      .embeddingStore(embeddingStore)
      .build())

    // We create a content injector that injects
    // the surrounding context in the LLM prompt

    .contentInjector((contents, userMessage) -> {

      // Retrieves the surrounding sentences
      // from the text segment's metadata

      String excerpts = contents.stream()
        .map(content ->
          content
            .textSegment()
            .metadata()
            .getString(METADATA_CONTEXT_KEY))
        .collect(Collectors.joining("\n\n"));

      // Customize the prompt for our geography use case

      return PromptTemplate.from("""
        You are a helpful geography assistant
        knowing everything about the capitals of the world.

        Here's the question from the user:
        <question>
        {{userMessage}}
        </question>

        Answer the question using the following information:
        <excerpts>
        {{contents}}
        </excerpts>
        """).apply(Map.of(
          "userMessage", userMessage.singleText(),
          "contents", excerpts
        )).toUserMessage();
    })).build())
  .build();

Again, that’s a bit of code, but we can make it reusable easily if needed in different contexts.

Summary

With this sentence window retrieval approach, we calculate and store the vector embedding of a sentence, but we inject a wider surrounding context (a few sentences before and after) into the context of the LLM to generate its response with more information than just the single sentence. This tends to avoid the problem of key pieces of information cut in the middle, and to resolve references between sentences (like a pronoun pointing at a named entity defined earlier).

It’s a technique worth experimenting with, to see if it gives better results in your own scenario. However, before blindly applying a particular technique, be sure to prepare some evaluations: Measure the quality of your RAG pipeline before making changes. Then, measure after having applied a new technique, to see if the answers are better.

We’ll have to explore the topic of evaluation another day, but in the meantime, I encourage you to read the blog posts of my colleague Mete Atamel who covered RAG pipeline evaluation, the DeepEval tool, and the RAG triad metric.

https://glaforge.dev/posts/2025/02/25/advanced-rag-sentence-window-retrieval/
The power of large context windows for your documentation efforts

My colleague Jaana Dogan was pointing at the Anthropic’s MCP (Model Context Protocol) documentation pages which were describing how to build MCP servers and clients. The interesting twist was about preparing the documentation in order to have Claude assist you in building those MCP servers & clients, rather than clearly documenting how to do so.

MCP tutorials are great. There are no tutorials really.

"Copy these resources to Claude, and start asking some questions like..." pic.twitter.com/GG50DMWNLW

Show full content

My colleague Jaana Dogan was pointing at the Anthropic’s MCP (Model Context Protocol) documentation pages which were describing how to build MCP servers and clients. The interesting twist was about preparing the documentation in order to have Claude assist you in building those MCP servers & clients, rather than clearly documenting how to do so.

MCP tutorials are great. There are no tutorials really.

"Copy these resources to Claude, and start asking some questions like..." pic.twitter.com/GG50DMWNLW

— Jaana Dogan ヤナ ドガン (@rakyll) February 14, 2025

No more tutorials:

  • You discuss with the reference documentation.
  • Your chat with the LLM becomes the tutorial!

LLM-powered documentation chats become your tailored tutorial, for your very own specific need and requirements. Not only LLMs can assist you authoring articles, documentation, reports, but LLMs can craft explanations that help you achieve a particular goal for which there’s not already a tutorial or how-to guide available.

Also, sometimes, you overlook some key paragraph of section when browsing through the documentation, and you miss the key information that would have helped you fix the problem at hand. This happened to me recently while using an Obsidian plugin: I needed to configure the plugin in a certain way, and I had the impression it wasn’t possible, but the plugin author pointed me at the key paragraph that I somehow glanced over. Had I asked the question to an LLM-powered chat that has all the reference documentation in its context, maybe it would have found that paragraph I missed?

Here come the large context window LLMs…

As you may already know, I contribute to the LangChain4j open source project, which provides integrations with various LLMs (like Gemini) or vector databases. I hope to start working on an additional module to integrate the new unified Gemini SDK. The advantage of this new SDK is that you can call both Gemini flavors: the one provided by Google Cloud’s Vertex AI offering, as well as DeepMind’s Google AI version. One SDK to rule them all!

I’m also interested in potentially creating a new vector store module for Google Cloud Firestore, which recently added vector calculation support in its Java client library. It would be neat to be able to use Firestore for RAG (Retrieval Augmented Generation) scenarios, taking advantage of the document database, its filtering capabilities, and its ability to do vector similarity searches.

LangChain4j’s documentation provides some explanations on how to contribute new integrations, but I was interested in trying this approach of chatting with the reference documentation to guide my steps in creating a new LLM module, and a new vector store implementation module.

Fortunately, my favorite large language model, Gemini, has a huge context window of up to 2 million tokens! This should be enough to ingurgitate the whole sources and documentation pages of the project.

Additionally, I discovered gitingest a great online tool that allows you to convert a Github project and all its sources into one gigantic text file… that you can then feed to an LLM. Basically, you take the URL of a Github repository, and you replace the github.com part with gitingest.com. For example, for LangChain4j, the URL becomes: https://gitingest.com/langchain4j/langchain4j. Then, a download button lets you get the whole content of the repository.

The prompt

For this experiment, I didn’t code anything, I simply used Google AI Studio.

I selected Gemini 2.0 Pro experimental:

Then, I loaded the big text file of the LangChain4j project, created by gitingest, and look at the number of tokens:

Yes, the whole repository and documentation is almost 2 million tokens long! Only Gemini can ingest that many tokens!

Time to ask my question:

I’d like to create a new language module for LangChain4j. How should I do that? What are the key steps to follow to propose a new model?

The response Click to reveal the detailed answer

Okay, let’s outline the steps to create a new language model module for LangChain4j and propose its inclusion. Based on the provided file structure, you’ll be focusing on creating a new module similar to the existing ones (e.g., langchain4j-open-ai, langchain4j-ollama, etc.). Here’s a breakdown of the process, referencing the structure you’ve provided:

Key Steps and Considerations

  1. Understand the Abstractions and SPI: LangChain4j, like its Python counterpart, is built around core abstractions. You need to understand these to implement your integration correctly. The core abstractions you must implement are:

    • ChatLanguageModel / StreamingChatLanguageModel: For conversational models (like ChatGPT, Gemini). Implement ChatLanguageModel for synchronous responses, and StreamingChatLanguageModel if the model supports streaming responses token by token.
    • LanguageModel / StreamingLanguageModel: For models with a simpler text-in, text-out interface (less common these days).
    • EmbeddingModel: If the model provider offers embedding capabilities.
    • ModerationModel: If the model provider offers content moderation.
    • ScoringModel: If the model provider offers scoring/ranking capabilities.
    • Builder Factories: You’ll also need to create builder factories (SPIs) for each model type you implement. These are how users will construct your model classes. See examples like AzureOpenAiChatModelBuilderFactory. These are registered using the Java ServiceLoader mechanism (the META-INF/services files).
  2. Choose a Module Structure (and Repository):

    • Community Repo (Preferred for new integrations): Start your integration in the langchain4j-community repository. This is the recommended approach for new contributions. It allows for easier initial review and iteration before considering a move to the core langchain4j repository. Clone this repo, don’t fork the main langchain4j repo directly.
    • Main langchain4j Repo (For Core Integrations): If your integration is with a very widely used and well-established model provider (like OpenAI, Google, etc.), and you are confident in its stability and long-term maintenance, you might propose it for the main repo. However, start in langchain4j-community first.
    • Module Naming: Follow the pattern: langchain4j-{provider-name} (e.g., langchain4j-my-llm).
    • Directory Structure: Create a directory structure mirroring the existing modules (see langchain4j-open-ai or langchain4j-ollama as good examples):
      langchain4j-{provider-name}/
          pom.xml  (Your module's Maven build file)
          src/
              main/
                  java/
                      dev/
                          langchain4j/
                              model/
                                  {providername}/  (e.g., myllm)
                                      {ProviderName}ChatModel.java  (Your implementation)
                                      internal/ (API client and related classes)
                                      spi/      (Builder factory for your model)
                                          {ProviderName}ChatModelBuilderFactory.java
                  resources/
                      META-INF/
                          services/
                              (Files to register your builder factory, see examples)
              test/
                  java/
                      dev/
                          langchain4j/
                              model/
                                  {providername}/
                                      {ProviderName}ChatModelIT.java (Integration tests)
      
  3. Implement the API Client:

    • Official SDK (Preferred): If the LLM provider has an official Java SDK, use it. This is usually the best approach for stability, performance, and access to all features. See langchain4j-bedrock for an example using an official SDK.
    • HTTP Client (If no SDK): If there’s no official SDK, use the JDK’s built-in java.net.http.HttpClient (available since Java 11). This minimizes external dependencies. Avoid adding new dependencies unless absolutely necessary. See http-clients/langchain4j-http-client-jdk for how LangChain4j wraps this. Avoid using the older okhttp3 directly if possible, prefer langchain4j-http-client-jdk (or langchain4j-http-client-spring-restclient if building a Spring Boot starter).
    • JSON Handling: Use Jackson for JSON serialization/deserialization, as it’s already a dependency.
    • Error Handling: Make sure to handle HTTP errors (non-2xx responses) appropriately. Throw a dev.langchain4j.exception.HttpException for these.
    • Request/Response Logging: Implement logging for requests and responses (see langchain4j-anthropic for a complete example). This is very helpful for debugging.
  4. Implement the Model Interface(s):

    • Implement ChatLanguageModel, StreamingChatLanguageModel, EmbeddingModel, etc., as appropriate, based on the provider’s capabilities.
    • Use the Builder pattern for your model classes to allow for flexible configuration.
    • Make sure your implementation handles request/response mapping and error handling correctly.
    • Implement TokenCountEstimator if possible, so the TokenWindowChatMemory can calculate the token usage. Implement DimensionAwareEmbeddingModel to report the output dimension from the embedding model.
  5. Write Tests:

    • Unit Tests: Create unit tests for any complex logic, utility methods, and request/response mappers.
    • Integration Tests (ITs): Create integration tests (e.g., MyLlmChatModelIT.java) that interact with the real LLM provider’s API. These are crucial for ensuring your integration works correctly.
      • Use environment variables (e.g., MYLLM_API_KEY) to store API keys and other secrets. Do not hardcode them.
      • Use @EnabledIfEnvironmentVariable to skip the tests if the required environment variables are not set.
      • Extend AbstractChatModelIT, AbstractStreamingChatModelIT, AbstractEmbeddingModelIT, and/or AbstractScoringModelIT to get a set of basic tests.
      • Test all relevant features of the model (e.g., text generation, streaming, different parameters, tool use, JSON mode).
      • Add test for concurrent requests if possible.
      • Consider adding a test for the Tokenizer interface (see examples in langchain4j-core).
      • Add @RetryingTest if model response is inconsistent
  6. Add to BOM (Bill of Materials): Add your new module to langchain4j-bom/pom.xml. This helps users manage dependencies.

  7. Documentation:

    • Update README.md: Add your integration to the list of supported models and embedding stores.
    • Create Markdown Documentation: Create Markdown files in the docs/docs/integrations/ directory, following the structure of existing integrations. You’ll need:
      • A main file (e.g., my-llm.md).
      • An entry in docs/docs/integrations/language-models/index.md and in docs/sidebars.js.
      • An entry in _category_.json files in docs/docs/integrations/language-models and docs/docs/integrations/embedding-stores
    • Examples (Highly Recommended): Create a simple example in the langchain4j-examples repository. This is very helpful for users.
  8. General Guidelines (from CONTRIBUTING.md):

    • Java 17: Maintain compatibility with Java 17.
    • Minimal Dependencies: Avoid adding new dependencies if possible. If necessary, try to use libraries already present. Run mvn dependency:analyze to check.
    • Backwards Compatibility: Avoid breaking changes. If necessary, deprecate old methods/fields instead of removing them.
    • Naming Conventions: Follow existing naming conventions.
    • No Lombok: Avoid using Lombok in new code; remove it from existing code if you touch it.
    • Javadoc: Add Javadoc where needed.
    • Code Style: Run make lint and make format before committing.
    • Large Features: Discuss large features with maintainers (@langchain4j) before implementation.
  9. Open a Pull Request (Draft First):

    • Open a draft PR in the langchain4j-community repository.
    • Fill out all sections of the PR template.
    • Once the PR is reviewed and approved, you will be asked to finalize it (add documentation, examples, etc.).

Example Code Snippets (Illustrative)

// MyLlmChatModel.java (in the appropriate package)
public class MyLlmChatModel implements ChatLanguageModel {

    private final MyLlmClient client; // Your custom client
    private final String modelName;

    @Builder
    public MyLlmChatModel(String baseUrl, String apiKey, String modelName, Duration timeout) {
        this.client = new MyLlmClient(baseUrl, apiKey, timeout); // Your custom client
        this.modelName = modelName;
    }

    @Override
    public Response<AiMessage> generate(List<ChatMessage> messages) {
        // 1. Convert Langchain4j messages to MyLLM's request format
        // 2. Call client.chat(...)
        // 3. Convert MyLLM's response to LangChain4j's Response<AiMessage>
        // 4. Handle errors (throw HttpException)
        return null; // TODO: Implement
    }

    // ... other methods, builder, etc. ...
}


// MyLlmClient.java (in a suitable package, e.g., dev.langchain4j.model.myllm.internal)
class MyLlmClient {

    // ... implementation using java.net.http.HttpClient ...
}

// MyLlmChatModelBuilderFactory.java (in a suitable package, e.g., dev.langchain4j.model.myllm.spi)
public class MyLlmChatModelBuilderFactory implements Supplier<MyLlmChatModel.Builder> {

    @Override
    public MyLlmChatModel.Builder get() {
        return MyLlmChatModel.builder();
    }
}

// META-INF/services/dev.langchain4j.model.chat.spi.ChatLanguageModelBuilderFactory (resource file)
// Add a line with the fully qualified name of your factory:
// dev.langchain4j.model.myllm.spi.MyLlmChatModelBuilderFactory

Key Points and Common Mistakes

  • Thorough Testing: Integration tests are absolutely essential. Test with real API calls.
  • Error Handling: Handle HTTP errors (non-2xx responses) from the LLM provider’s API.
  • Token Usage: If the LLM provider gives you token usage information, include it in the Response.
  • Finish Reason: If the LLM provider gives you a finish reason (e.g., “stop”, “length”), include it in the Response.
  • Consistency: Maintain consistency with existing LangChain4j modules in terms of naming, structure, and coding style.
  • Documentation: Clear, concise, and accurate documentation is critical. Follow the patterns you see in existing modules.
  • SPI Registration: Remember to create the src/main/resources/META-INF/services/ files to register your model implementation as a service.

This comprehensive guide, combined with the provided file structure and examples, should give you a strong foundation for contributing your new language model integration to LangChain4j. Remember to start in the langchain4j-community repository for initial development and review. Good luck!

The verdict

Honestly, this is really good! It follows the advice in the contributing guidelines. But since it has access to the whole code base, it understands pretty well the various steps to follow, the classes to implement, etc.

A few nits though: for example, it confirms we shouldn’t use Lombok for creating builders, but the example code shows a @Builder annotation from Lombok, because it’s still used in several modules. Or there’s this reference to the make tool, although the project doesn’t use it at all. The code suggests implement the generate() methods, but newer models should now implement the new chat() method instead, but Gemini can’t really know about this, as this is fairly new and not many models implement those interfaces for now.

But other than that, it’s excellent!

Gemini created a contributing guide by looking at almost 2 million tokens of code and documentation. But this doesn’t end here, you can continue asking other questions, to get more insight into particular aspects of the development of a new module, and Gemini would continue the discussion, guiding you and helping you in your task. I found that quite impressive.

Let’s continue the conversation with the code base and documentation

As I was interested in creating a new embedding store for Firestore, I then added in the context the documentation of the Java SDK for Firestore. And asked the following question:

I would also like to create a new embedding store for the Google Cloud Firestore database. What are the steps to create a new LangChain4j embedding store module?

Let’s see what it suggested:

Click to reveal the detailed answer

Okay, here’s a breakdown of how to create a new LangChain4j embedding store module for Google Cloud Firestore, along with the key steps and considerations, mirroring the structure of existing modules like langchain4j-milvus.

Project Structure

Your project structure should follow the established pattern. I’ll create a simplified version based on the most relevant parts from the provided file listing. The full structure would be much larger (like the main langchain4j project), but this captures the essentials:

langchain4j/
└── langchain4j-embedding-store-google-firestore/  (or similar name)
    ├── pom.xml                                     (Your module's Maven build file)
    └── src/
        └── main/
            └── java/
                └── dev/
                    └── langchain4j/
                        └── store/
                            └── embedding/
                                └── google/
                                    └── firestore/
                                        └── GoogleFirestoreEmbeddingStore.java  (Main class)
                                        └── (Any other supporting classes, like request/response objects)
        └── test/
            └── java/
                └── dev/
                    └── langchain4j/
                        └── store/
                            └── embedding/
                                └── google/
                                    └── firestore/
                                        └── GoogleFirestoreEmbeddingStoreIT.java (Integration test)

Steps to Create the Module

  1. Project Setup (Maven Module):

    • Create a new directory named langchain4j-embedding-store-google-firestore (or a similar, descriptive name) within the langchain4j parent directory.
    • Create a pom.xml file inside this new directory. This file will define your module as a Maven project and specify its dependencies. Use langchain4j-milvus/pom.xml as a guide, but adapt it:
      • Parent: Set the parent to langchain4j-parent, as shown in your example.
      • ArtifactId: Use langchain4j-embedding-store-google-firestore.
      • Name and Description: Update these appropriately.
      • Dependencies: Include langchain4j-core as a compile-time dependency. Add the Google Cloud Firestore Java client library as a dependency:
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-firestore</artifactId>
            <version>YOUR_VERSION_HERE</version> <!-- Lookup the latest version -->
        </dependency>
        
        Add any other necessary dependencies, such as logging (SLF4J), Lombok (if desired, but try to avoid in new code), etc. Minimize dependencies as much as possible.
      • Test Dependencies: Include JUnit, AssertJ, and any mocking libraries (like Mockito) in the test scope.
      • Licenses: Ensure your module has the correct Apache 2.0 license information in the POM.
  2. Implement EmbeddingStore<TextSegment>:

    • Create GoogleFirestoreEmbeddingStore.java in the dev.langchain4j.store.embedding.google.firestore package.
    • Implement the EmbeddingStore<TextSegment> interface from langchain4j-core. This is the crucial part. You’ll need to implement the following methods, mapping them to Firestore operations:
      • add(Embedding embedding): Adds a single embedding to the store, generating a unique ID.
      • add(String id, Embedding embedding): Adds an embedding with a specified ID.
      • add(Embedding embedding, TextSegment textSegment): Adds an embedding with associated text and metadata.
      • addAll(List<Embedding> embeddings): Adds multiple embeddings.
      • addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> textSegments): Adds multiple embeddings with associated IDs and text segments.
      • removeAll(Collection<String> ids): Delete embeddings by id
      • removeAll(): Deletes everything
      • removeAll(Filter filter): Deletes records using the condition filter
      • search(EmbeddingSearchRequest request): Find related embeddings
      • findRelevant(Embedding referenceEmbedding, int maxResults, double minScore): Finds relevant embeddings.
    • Considerations for Firestore Implementation:
      • Data Model: How will you store the embeddings (as float arrays) and associated data (text, metadata)? Firestore uses a NoSQL document model. You’ll likely store each TextSegment and its Embedding as a document in a collection.
      • Metadata: Firestore supports storing metadata as document fields. You’ll need a way to map TextSegment metadata (which is a Map<String, String>) to Firestore document fields. You have a few options here, mirroring what existing modules do:
        • Individual Columns (Preferred): Each metadata key becomes a separate field in the document. This is efficient for querying but requires knowing the metadata keys in advance. The Mariadb integration uses this approach with MetadataColumDefinition.
        • Single JSON Field: Store all metadata as a single JSON string. This is flexible but less efficient for filtering. The Chroma integration uses this.
        • Mixed Approach: Common metadata fields (like “source” or “document_id”) could be separate fields, and a catch-all “metadata” field could store the rest as JSON.
      • Vector Search: Firestore now has native Vector Search, which is excellent! You’ll use this for the findRelevant method. The key will be understanding how to map LangChain4j’s EmbeddingSearchRequest (which includes maxResults and minScore) to the appropriate Firestore vector search query.
      • Filtering: Implement filtering based on metadata using Firestore’s query capabilities. LangChain4j’s Filter interface needs to be translated into a Firestore query. Look at existing implementations like ChromaMetadataFilterMapper for inspiration, but adapt it to Firestore’s query syntax.
      • Error Handling: Wrap Firestore exceptions in RuntimeException or a more specific custom exception.
      • Configuration: Use a Builder pattern (like OpenAiChatModel does) to allow users to configure:
        • Firestore project ID, database, collection name
        • Credentials (API key, service account)
        • Timeout settings
        • Maximum number of results (maxResults)
        • Minimum relevance score (minScore)
        • Field names for text, embedding, and metadata (if configurable)
        • Possibly options for automatically creating the collection/indexes.
      • Indexing: You must create the necessary index in Firestore for vector search to work. The documentation you provided explains how to do this with the gcloud CLI. The code should, ideally, at least check if the index exists. It could even attempt to create it, but this would require more permissions.
      • Concurrency: if there is any concurrency, the code should be thread-safe.
  • Implement GoogleFirestoreEmbeddingModel: You should also implement the EmbeddingModel interface, by adapting the google API to the langchain4j interface.
  1. Create SPI Builder Factory:

    • Create a GoogleFirestoreEmbeddingStoreBuilderFactory class that implements Supplier<GoogleFirestoreEmbeddingStore.Builder>.
    • Create a file named META-INF/services/dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory in src/main/resources.
    • Add a single line to this file containing the fully qualified name of your factory class (e.g., dev.langchain4j.store.embedding.google.firestore.GoogleFirestoreEmbeddingStoreBuilderFactory).
  2. Write Integration Tests:

    • Create GoogleFirestoreEmbeddingStoreIT.java (and potentially ...WithFilteringIT.java, ...WithRemovalIT.java, etc., mirroring the existing modules).
    • Extend EmbeddingStoreIT (or EmbeddingStoreWithFilteringIT, etc.) to inherit a basic set of tests.
    • Implement the abstract methods (like embeddingStore()) to provide instances of your store and a compatible embedding model.
    • Add tests specific to Google Cloud Firestore features and limitations.
    • Use @EnabledIfEnvironmentVariable to conditionally run the tests only when the necessary environment variables (credentials) are set. See OllamaChatModelIT for an example.
  3. Add to BOM (Bill of Materials): Add your new module to langchain4j-bom/pom.xml to manage its version consistently.

  4. Documentation:

    • Add the new model in all the index.md files
    • Write a guide in docs/docs/integrations/embedding-stores
    • Create a README file

Code Example (Conceptual)

// In your GoogleFirestoreEmbeddingStore.java
package dev.langchain4j.store.embedding.google.firestore;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import lombok.Builder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

// ... other imports ...

public class GoogleFirestoreEmbeddingStore implements EmbeddingStore<TextSegment> {

    private static final Logger log = LoggerFactory.getLogger(GoogleFirestoreEmbeddingStore.class);

    // ... (fields for Firestore client, collection name, etc.) ...

    @Builder
    public GoogleFirestoreEmbeddingStore(...) {
        // ... (initialize Firestore client, etc.) ...
    }

    @Override
    public String add(Embedding embedding) {
        // ... implementation ...
    }

    @Override
    public void add(String id, Embedding embedding) {
        // ... implementation ...
    }

    @Override
    public String add(Embedding embedding, TextSegment textSegment) {
        // ... implementation ...
    }

    @Override
    public List<String> addAll(List<Embedding> embeddings) {
        // ... implementation ...
    }

    @Override
    public void addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> embedded) {
        // ... implementation ...
    }

    @Override
    public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding referenceEmbedding, int maxResults, double minScore) {
        // ... implementation using Firestore vector search ...
        // ... map Firestore results to EmbeddingMatch objects ...
    }

    // ... other methods ...
}

// In a separate file: GoogleFirestoreEmbeddingStoreBuilderFactory.java
package dev.langchain4j.store.embedding.google.firestore;

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory;

public class GoogleFirestoreEmbeddingStoreBuilderFactory implements EmbeddingStoreFactory {
    @Override
    public GoogleFirestoreEmbeddingStore.Builder get() {
       return GoogleFirestoreEmbeddingStore.builder();
    }
}

Key Improvements and Considerations from the examples:

  • Error Handling: Wrap Firestore API calls in try-catch blocks and throw RuntimeException (or a more specific custom exception) to handle potential errors.
  • Logging: Use SLF4J (log.debug(), log.info(), log.warn(), log.error()) for logging. This is crucial for debugging and monitoring.
  • Configuration: The Builder should allow configuration of all necessary parameters (endpoint, credentials, collection name, timeouts, etc.). Use ensureNotNull and ensureNotBlank to validate.
  • Dependency Management: The pom.xml should correctly specify the Google Cloud Firestore client library as a dependency.
  • Testcontainers: Consider using Testcontainers for integration tests to spin up a Firestore emulator. This will make your tests more reliable and portable. (See examples in the provided code, such as for CassandraEmbeddingStoreDockerIT).
  • Metadata: Implement proper handling of metadata, including mapping it to/from Firestore document fields. Decide on your strategy (separate fields vs. JSON).
  • Filtering: Implement filtering based on metadata using Firestore’s query capabilities. You will likely need a helper class like ChromaMetadataFilterMapper to translate Filter objects into Firestore queries.
  • Service Provider Interface (SPI): The spi package and META-INF/services file are critical for LangChain4j to discover your implementation. This is how AiServices (and other parts of the framework) dynamically load components. Without this, your module won’t be usable.
  • Return type: The findRelevant method should take a {@link dev.langchain4j.store.embedding.EmbeddingSearchRequest} and return a {@link dev.langchain4j.store.embedding.EmbeddingSearchResult}. This would ensure that we have consistency through all embedding stores and allow advanced filtering and reranking features.
  • Interface segregation: Introduce a EmbeddingStoreWithFiltering interface, that extends EmbeddingStore and adds a removeAll(Filter filter) method.
  • Metadata handling: Implement the MetadataHandler interface, or extend one of its base classes if it’s a good fit.
  • Support Vector Search API in different database modes: Support vector search in the different modes of Cosmos DB, implementing the search in each of them.
  • Implement the close() method: You should close the connection to Firestore in the close() method.
  • Test your embedding store against the EmbeddingStoreWithFilteringIT and EmbeddingStoreWithRemovalIT: The provided langchain4j-core contains those abstract test classes that you can use to test your embedding store.
  • Add documentation: Create Markdown files in the docs/docs/integrations/embedding-stores directory, following the structure of existing integrations.
  • Add an example: to the examples repository.

By following these steps and adapting the provided example code, you can create a robust and well-integrated LangChain4j module for Google Cloud Firestore.

Second verdict

Again, it’s pretty spot on and acurate! Still a mention to Lombok, or a strange reference to Cosmos DB, but I generally like how it advises me to look at existing modules, telling me that this particular one may be a good inspiration to get my implementation right.

I’ve never implemented an embedding store so far, and I think newer models should implement a search() method instead of findRelevant(). So I asked how to implement this method, using the documentation of the Firestore Java SDK, and a minute later, it suggested a concrete implementation. I won’t copy the output again, as I don’t want to make this article too long (it already is). I haven’t tried this implementation, but the code and explanations seemed pretty convincing, so when I get some time, I’ll try to see if it’s correct.

Conclusion

Only Gemini proposes a 2 million token context window. And I’m glad it does, because the LangChain4j projects is close to that limit! With models with smaller windows, I would have had to be way more selective, and send in the prompt just the right types of artifacts (ie. just the LLM modules, or just the embedding store implementations). Thanks to the huge window, I was able to feed the whole repository in its entirety!

Does it mean it’s the end to writing proper tutorials or how-to guides? Certainly not. But I find that very interesting that I’m able to have this kind of highly detailed conversation with the LLM, without having to understand all the tiny little details of the underlying project, as the model is able to grok it for me, and distills just the right level of information for me to do the task I asked about.

What’s very interesting is that I can continue the conversation to go in various directions, or zoom on some specific aspects, which may not necessarily be covered by existing tutorials or guides. It’s as if I was pair programming with the founder of the project.

https://glaforge.dev/posts/2025/02/15/the-power-of-large-context-windows-for-your-documentation-efforts/
A Generative AI Agent with a real declarative workflow

In my previous article, I detailed how to build an AI-powered short story generation agent using Java, LangChain4j, Gemini, and Imagen 3, deployed on Cloud Run jobs.

This approach involved writing explicit Java code to orchestrate the entire workflow, defining each step programmatically. This follow-up article explores an alternative, declarative approach using Google Cloud Workflows.

I’ve written extensively on Workflows in the past, so for those AI agents that exhibit a very explicit plan and orchestration, I believe Workflows is also a great approach for such declarative AI agents.

Show full content

In my previous article, I detailed how to build an AI-powered short story generation agent using Java, LangChain4j, Gemini, and Imagen 3, deployed on Cloud Run jobs.

This approach involved writing explicit Java code to orchestrate the entire workflow, defining each step programmatically. This follow-up article explores an alternative, declarative approach using Google Cloud Workflows.

I’ve written extensively on Workflows in the past, so for those AI agents that exhibit a very explicit plan and orchestration, I believe Workflows is also a great approach for such declarative AI agents.

From imperative to declarative: defining the workflow

The Java-based agent employed an imperative style, where the code explicitly defined the sequence of operations. Each step, from story conception to image selection, was a method call within the ExplicitStoryGeneratorAgent class (you can check the code of this class). This provided fine-grained control and allowed for parallelization.

Cloud Workflows offers a declarative approach. Instead of writing code, you define the workflow in a YAML file. This file specifies the steps, their inputs and outputs, and the order in which they should be executed. You can also easily create loops (sequential or parallel), and you can implement human in the loop callbacks if needed. The workflow engine then interprets this definition and executes the steps accordingly.

YAML can be a fair bit more cumbersome to write, compared to using a programming language, but non-developers could get a workflow definition rolling, without having to fire an IDE to code. But for a Java developer, it’s certainly simpler to write code, with the help of their favorite programming environment.

In this article, I’ve created a simplified variant: I removed the LLM-as-judge step that picked the best images. And I’ve created pictures for the whole story, not for each chapter. So it’s not exactly the same agent workflow as in the preivous article. But I don’t want you to drown in too much YAML!

The workflow:

  • creates the story with Gemini 2,
  • creates a prompt (for the whole story, not for each chapter),
  • generates images with Imagen 3,
  • saves the result in Cloud Firestore

Let’s have a look at the full YAML definition, and read the comments explaining what each step does:

main:
  params: [input]
  steps:

    # Let's define the Gemini and Image models we want to use:
    - setup:
      assign:
        - GEMINI_MODEL: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") +
            "/locations/us-central1/publishers/google/models/gemini-2.0-flash-exp"}
        - IMAGEN_MODEL: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") +
            "/locations/us-central1/publishers/google/models/imagen-3.0-generate-002"}

    # We call Gemini to generate the story
    - generate_story:
      call: googleapis.aiplatform.v1.projects.locations.endpoints.generateContent
      args:
        model: ${GEMINI_MODEL}
        region: 'us-central1'
        body:
          contents:
            role: user
            parts:  # Let's write a sci-fi story!
              - text: "Write a short science-fiction story"
          generationConfig:
            temperature: 2.0
            responseMimeType: application/json
            # Use a JSON schema to define the format of the output
            responseSchema:
              type: OBJECT
              properties:
                title:
                  type: STRING
                  description: The title of the short story
                content:
                  type: STRING
                  description: The body of the story
              required: ['title', 'content']
          # You can define system instructions
          systemInstruction:
            parts:
              - text: >
                  You are a creative fiction author,
                  and your role is to write stories.
                  You write a story as requested by the user.

                  A story always has a title,
                  and is made of 5 long chapters.
                  Each chapter has a title, is split into paragraphs,
                  and is at least 20 sentences long.
      result: short_story

     # Assign the story, title, content into some variables
    - get_story:
      assign:
        - story_output: ${json.decode(short_story.candidates[0].content.parts[0].text)}
        - title: ${story_output.title}
        - content: ${story_output.content}

    # Let's call Gemini again, but for creating a prompt for Imagen
    - generate_image_prompt:
      call: googleapis.aiplatform.v1.projects.locations.endpoints.generateContent
      args:
        model: ${GEMINI_MODEL}
        region: 'us-central1'
        body:
          contents:
            role: user
            parts:
              - text: ${content}
          systemInstruction:
            parts:
                - text: |
                  You are an expert artist who masters crafting great
                  prompts for image generation models, to illustrate
                  short stories.
                  When given a short story, reply with a concise
                  prompt that could be used to create an illustration
                  with the Imagen 3 model.
                  Don't use any flags like those used with MidJourney.
                  Just answer with the short concise text prompt.

                  Your answer MUST start with "A cartoon of ",
                  as we want to use cartoon or comics illustrations.
                  The user gives you the following image prompt
                  for the chapter to illustrate:
      result: image_prompt

    # Retrieve the prompt from Gemini's output
    - assign_prompt:
      assign:
        - prompt: ${image_prompt.candidates[0].content.parts[0].text}

    # Time to generate the images
    - image_generation:
      call: googleapis.aiplatform.v1.projects.locations.endpoints.predict
      args:
        endpoint: ${IMAGEN_MODEL}
        region: us-central1
        body:
          instances:
            - prompt: ${prompt}
          parameters: # Store images in Google Cloud Storage
            storageUri: 'gs://short-scifi-stories-generated-images'
      result: images

    # Utility step to create the picture data for Firestore
    - prepare_images_uri_list:
      steps:
        - create_empty_list:
          assign:
            - uris: []
            - uris_for_firestore: []
        - loop_over_images:
          for:
            value: img_object
            in: ${images.predictions}
            steps:
              - append_uri:
                assign:
                  - uris: ${list.concat(uris, img_object.gcsUri)}
                  - stringUriMap:
                      stringValue: ${img_object.gcsUri}
                  - uris_for_firestore: ${list.concat(uris_for_firestore, stringUriMap)}

    # Let's prepare the final output to return
    # as the result of the workflow execution
    - prepare_result:
      assign:
        - final_result:
          title: ${title}
          content: ${content}
          prompt: ${prompt}
          images: ${uris}
          createdAt: ${sys.now()}

    # Finally, let's save the story in Firestore
    - save_to_firestore:
      call: googleapis.firestore.v1.projects.databases.documents.createDocument
      args:
        collectionId: short-story
        parent: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/databases/(default)/documents"}
        query:
          documentId: ${uuid.generate()}
        body:
          fields:
            title:
              stringValue: ${final_result.title}
            content:
              stringValue: ${final_result.content}
            prompt:
              stringValue: ${final_result.prompt}
            images:
              arrayValue:
                values: ${uris_for_firestore}
            createdAt:
              timestampValue: ${time.format(final_result.createdAt,"GMT")}

    # Return the data
    - return_output:
      return: ${final_result}

This YAML file defines the entire story generation process. It calls the Gemini and Imagen APIs, extracts the necessary information from the responses, and saves the final result to Firestore. No Java code is required to manage the flow of execution.

Key differences and trade-offs

Let’s zoom in on the pros and cons of both approaches.

Imperative / programming approach:
  • Pros:
    • Fine-grained control over the workflow.
    • Explicit parallelization for performance optimization.
    • Familiar programming and debugging tools.
    • Cloud Run jobs is fully managed and scaled by Google Cloud.
    • Job execution can be scheduled by Cloud Scheduler.
  • Cons:
    • You need to be familiar with a programming language & environment.
    • It can potentially be challenging to maintain as the workflow evolves.
    • The approach used required being familiar with running & scheduling containers as jobs.
Declarative / workflow based approach:
  • Pros:
    • Pretty easy-to-understand workflow definitions.
    • Workflows offers a visualisation of the steps (also during execution).
    • Parallelization can be defined explicitly (with the parallel keyword on iterations or step branches).
    • Simplified maintenance and updates. Just need to update the YAML in the console.
    • Workflows is scalable and reliable out of the box without extra effort.
    • Workflow execution can be scheduled by Cloud Scheduler.
  • Cons:
    • YAML authoring can be painful, if you’re not familiar with the APIs you call.
    • Parallelization is declarative but might be limited depending on the workflow definition and Google Cloud Workflows capabilities. You would have more control with a programming language.
    • There’s no emulator to run workflows locally, so you might have to create copies and work on these, to not affect the production workflow.
    • Debugging relies on workflow execution logs, which might be less intuitive than traditional debugging.
Choosing the right approach

It depends! 😅

Of course, the choice between these approaches depends on the specific project requirements. If fine-grained control and explicit parallelization are critical, the imperative programming approach might be preferable.

However, for simpler workflows where ease of development and maintainability are critical, Cloud Workflows offers an interesting alternative. You can easily make a tweak to the workflow directly from the Google Cloud console if needed.

In the case of this story generation agent, the declarative approach sounds like a good fit, but the YAML authoring can be a bit painful at times, as you have to look up the various payload schemas for the APIs to invoke, to be able to make the service calls. But that’s definitely a plus as well, in the sense that pretty much all the products and services offered on Google Cloud Platform can easily be called via REST endpoints, and Workflows excels at that.

Conclusion

Explicit declarative planning helps AI agents stay focused, and ensures a high level of predictability. My experience with agents which plan their own actions has been mixed, as sometimes the LLM hallucinates function calls, or calls functions with bogus parameters. In the previous previous article, I used an imperative programming approach, but in this article today, I developed a simplified equivalent with a declarative workflow definition.

Google Cloud Workflows offers a powerful and convenient way to build and manage declarative AI agents — and obviously any other kind of process that needs to call APIs. By defining the workflow declaratively, you can focus on the logic of your agent rather than the details of execution. While it might not be suitable for every use case, it’s definitely a valuable tool to consider when building AI-powered applications on Google Cloud!

https://glaforge.dev/posts/2025/01/31/a-genai-agent-with-a-real-workflow/
An AI agent to generate short sci-fi stories

This project demonstrates how to build a fully automated short story generator using Java, LangChain4j, Google Cloud’s Gemini and Imagen 3 models, and a serverless deployment on Cloud Run.

Every night at midnight UTC, a new story is created, complete with AI-generated illustrations, and published via Firebase Hosting. So if you want to read a new story every day, head over to:

short-ai-story.web.app

The code of this agent is available on Github. So don’t hesitate to check out the code:

Show full content

This project demonstrates how to build a fully automated short story generator using Java, LangChain4j, Google Cloud’s Gemini and Imagen 3 models, and a serverless deployment on Cloud Run.

Every night at midnight UTC, a new story is created, complete with AI-generated illustrations, and published via Firebase Hosting. So if you want to read a new story every day, head over to:

short-ai-story.web.app

The code of this agent is available on Github. So don’t hesitate to check out the code:

github.com/glaforge/short-genai-stories

Let’s have a closer look at the architecture and workflow of this automated storytelling machine.

The agent: the storyteller’s brain

At the heart of the system lies the ExplicitStoryGeneratorAgent, a Java class orchestrating the entire story generation process. This agent follows a clear, multi-step workflow:

  • Story conception (Gemini): The agent first calls the Gemini large language model (LLM) to generate the core story elements: a title, five chapters each with title and content.
  • Image prompt engineering (Gemini): For each chapter’s content, the agent again leverages Gemini to craft tailored image generation prompts. This ensures that the image prompts are relevant to the specific content of each chapter.
  • Illustration generation (Imagen 3): Using the generated prompts, the agent calls Imagen 3 to produce a set of image candidates (four by default) for each chapter.
  • Image selection (Gemini, self-reflection): In a unique “self-reflection” step, the agent presents the generated images back to Gemini, asking the LLM to select the image that best visually represents the chapter’s narrative. This crucial step ensures that the illustrations truly complement the story.
  • Persistence (Firestore): Once the story, chapter titles, content, and selected images are finalized, the agent stores them in a Firestore database (a NoSQL document database). This makes retrieving complete data relatively straightforward from the web frontend, thanks to the Firebase framework.

For the more visual people among us, this diagram illustrates the steps above:

Agent workflow diagram
Agent workflow diagram
Agent workflow diagram Agent workflow diagram

Note: The critique step where Gemini is asked to judge the best image isn’t really necessary, to be honest. Imagen generates images that adhere very much to the given prompt. So either of them would seem suitable to illustrate each chapter. But it was interesting to implement a self-reflection step in this workflow.

Digression: Explicit vs. Autonomous Agent Workflows

This project utilizes an explicit workflow agent, where the story generation process is meticulously defined and controlled by the Java code. This approach contrasts with fully autonomous agents, which rely on the LLM to plan and execute the workflow dynamically.

Let’s explore the key differences and trade-offs between these two approaches:

Explicit workflow agent (code-driven planning):
  • Predictable execution: The Java code dictates the exact sequence of steps, ensuring a highly predictable and reliable workflow. Each stage, from story conception to image selection, is explicitly programmed, leaving no room for unexpected deviations.
  • Improved performance through parallelization: With explicit control, tasks that can be executed concurrently (such as generating images for different chapters or judging the best image for each chapter) can be easily parallelized. This significantly reduces the overall execution time.
  • Easier debugging and maintenance: The clear, structured code makes debugging and maintenance straightforward. The flow of execution is transparent to the developer, and any errors can be readily identified and addressed.
  • Limited flexibility: The explicit nature of the workflow could be seen as offering less flexibility. Indeed, the code needs to be updated to handle changes of the workflow. However, it’s not necessarily worse than endlessly tweaking prompts to coerce an LLM to plan correctly the needed workflow changes.
Autonomous agent (LLM-driven planning):
  • Dynamic workflow: Autonomous agents use the LLM’s capabilities to plan and execute the workflow. This allows for greater flexibility and adaptability to different story generation requirements. The LLM can theoretically decide which steps to take, in which order, and how many times.
  • Potential for hallucinations and errors: Relying on the LLM for planning introduces the risk of hallucinations and incorrect function calls. The LLM might generate nonsensical steps, omit crucial actions, provide incorrect parameters to functions, or execute functions in an illogical order. This can lead to unpredictable results and make it harder to catch potential errors. Even with perfect prompts, LLMs might make mistakes in function calling. This is actually the problem I encountered when trying this approach first.
  • Debugging challenges: Debugging autonomous agents can be more complex. The dynamic nature of the workflow makes it harder to trace the execution path and identify the source of errors. Troubleshooting often involves analyzing the logs of the LLM and the tools it requested to call, which can be challenging to interpret at times.
  • Less control over execution: With autonomous agents, developers cede some control over the execution flow to the LLM. While this offers flexibility, it also means less fine-grained control over performance optimization. Parallelization opportunities, for example, might not be readily apparent or easily exploitable. Currently, when receiving paralell function call requests, LangChain4j doesn’t yet offer the possibility to request their paralellization.

The autonomous approach would have looked like the following diagram:

Autonomous agent diagram
Autonomous agent diagram
Autonomous agent diagram Autonomous agent diagram

With this approach, the main agent generates the story, then would call first the prompt creation tool, then the generation image tool, and would finish with the tool to pick up the best image. However, in my experience, in spite of a good amount of prompt engineering tweaks, I couldn’t get this to work reliably. I tried with different versions of Gemini (1.5 Flash, 1.5 Pro, and 2.O Flash experimental, from worst to best outcome), but sometimes, for example, it would request to judge images before they had been generated, or the URLs of the images would be hallucinated instead of coming from the outcome of the judge. So I prefered moving to a more explicit approach.

I invite you to read this great article from Anthropic about building effective agents which also makes the distiction between agents (fully autonomous planning agents) and workflows (the more explicit approach with code driving the execution planning). They also recommend to stick to workflows when the logic of the agent is very clear upfront: when you can draw a workflow on a sheet of paper, that’s surely because you do need a workflow.

Choosing the right approach:

The choice between explicit and autonomous workflows depends on the specific requirements of the project. Explicit workflows are best suited for applications where predictability, reliability, and performance are paramount.

Autonomous agents are more appropriate when flexibility, adaptability, and dynamic planning are prioritized, even at the cost of potential errors and increased debugging complexity.

This project prioritizes the former over the latter, which explains why an explicit agent is preferred for this use case. In other words, it’s better to have a predictable solution, which then can be easily parallelized to reduce latency, than a non-predictable solution that is slower.

A closer look at the code

As you’ll be exploring the code base, I’d like to highlight a few points.

The ExplicitStoryGeneratorAgent class uses a structured and predictable approach to orchestrating the LLM. Its core logic resides within the main() method, outlining a clear, step-by-step workflow:

// 1️⃣ let's prepare the story
Story story = prepareStory("a science-fiction novel");

// 2️⃣ iterate over each chapter in parallel
List<Story.Chapter> newChaptersWithImages = story.chapters.stream()
    .parallel()
    .map(chapter -> {
        // 3️⃣ prepare an impage prompt for each chapter
        String imagePrompt = prepareImagePromptForChapter(chapter);

        // 4️⃣ generate up to 4 images per chapter
        List<String> imagesForChapter = generateImages(imagePrompt);

        // 5️⃣ judge the best image for this chapter
        String bestImage = pickBestImageForChapter(
            chapter.chapterContent, imagesForChapter);

        return new Story.Chapter(
            chapter.chapterTitle,
            chapter.chapterContent,
            bestImage);
    }).toList();

Story newStoryWithImages =
    new Story(story.title, newChaptersWithImages);
// 6️⃣ save the story to Firestore
saveToFirestore(newStoryWithImages);

Story generation depends on structured output: The agent uses Gemini to generate the story’s title and five chapters, each with a title and content. Crucially, it leverages Java records and responseSchema to ensure type safety and consistent outputs. You’ll notice the use of @Description annotations to ensure the LLM really understands what each field corresponds to:

record Story(
    @Description("The title of the story")
    String title,
    @Description("The chapters of the story")
    List<Chapter> chapters) {
    record Chapter(
        @Description("The title of the chapter")
        String chapterTitle,
        @Description("The content of the chapter")
        String chapterContent,
        @Description("The Google Cloud Storage URI of the image...")
        String gcsURI) {
    }
}

To configure the model generation to use structured outputs, here’s how the schema of this output is defined:

var chatModel = VertexAiGeminiChatModel.builder()
    .project(GCP_PROJECT_ID)
    .location(GCP_LOCATION)
    .modelName(CHAT_MODEL_NAME)
    .temperature(1.5f)
    .responseSchema(Schema.newBuilder()
        .setType(Type.OBJECT)
        .putProperties("title", Schema.newBuilder()
            .setDescription("The title of the story")
            .setType(Type.STRING)
            .build())
        .putProperties("chapters", Schema.newBuilder()
            .setDescription("The list of 5 chapters")
            .setType(Type.ARRAY)
            .setItems(Schema.newBuilder()
                .setDescription(
                    "A chapter with a title, and its content")
                .setType(Type.OBJECT)
                .putProperties("chapterTitle", Schema.newBuilder()
                    .setType(Type.STRING)
                    .setDescription("The title of the chapter")
                    .build())
                .putProperties("chapterContent", Schema.newBuilder()
                    .setType(Type.STRING)
                    .setDescription("The content of the chapter, " +
                        "made of 20 sentences")
                    .build())
                .addAllRequired(
                    List.of("chapterTitle", "chapterContent"))
                .build())
            .build())
        .addAllRequired(List.of("title", "chapters"))
        .build())
    .build();

It’s possible to simplify the schema creation by taking advantage of a helper class. This schema could have been simplified to:

    // ...
    .responseSchema(SchemaHelper.fromClass(Story.class))
    // ...

To instruct the LLM at each step, I tend to use system instructions for setting the role and goal for the LLM, but I use user messages to give the more variable part, like the chapter’s content, or the image prompt. Here’s an example:

Response<AiMessage> response = chatModel.generate(
    SystemMessage.from("""
        You are a creative fiction author, and your role is to write stories.
        You write a story as requested by the user.

        A story always has a title, and is made of 5 long chapters.
        Each chapter has a title, is split into paragraphs, \
        and is at least 20 sentences long.
        """),
    UserMessage.from(storyType)
);

The storyType variable in the user message contains the type of story to generate, like "a science-fiction story". It’s currently set in stone, but you could parameterize this to generate fantasy novels, love stories, etc.

The self-reflection step, where the LLM judges which is the best illustration for a chapter is taking advantage of Gemini’s multimodal capabilities. Indeed, Gemini receives the instruction of picking the best image out of a few, and it is given the text of the request (and the URLs of the pictures), as well as inline references to those images (ie. the Google Cloud Storage URI, pointing at the location of the pictures). Thus, this is a multimodal request, as both text and images are passed in the prompt:

List<String> imagesForChapter = generateImages(imagePrompt);
String bestImage = pickBestImageForChapter(chapter.chapterContent, imagesForChapter);

// Inside pickBestImageForChapter we have:
List<ChatMessage> messages = new ArrayList<>();
messages.add(SystemMessage.from("...prompt to select best image..."));
messages.add(UserMessage.from("...chapter content..."));
imagesForChapter.forEach(imageUrl -> {
    // Send each URL as text and as image to the model
    messages.add(UserMessage.from(imageUrl + "\n"));
    messages.add(UserMessage.from(ImageContent.from(imageUrl)));
});

Response<AiMessage> response = chatModel.generate(messages);
// ... parse best image from response
Building the application

The project employs a standard Java development workflow using Maven for dependency management and building:

  • Dependencies: The pom.xml file defines the project’s dependencies, including LangChain4j (for LLM orchestration), the Google Cloud Firestore library (for data persistence), and Google Cloud’s Gemini and Imagen libraries.
  • Packaging: The Maven build process packages the application into a JAR, and its dependencies by its side. I followed the approach explained in that article: to build a JAR with its dependencies on the side, instead of a shaded / fat JAR. One benefit I see is that the dependencies are one container layer, while the application itself is another, so it should make Docker building layer faster, as the dependencies don’t change often, and that dependency layer would be cached.
  • Containerization (Docker): A Dockerfile is used to containerize the application. The container image includes the executable JAR and dependencies, as well as the Java runtime environment. I used Azul’s Zulu distroless Java 21 base image. The container is finally built thanks to Cloud Build.
Deployment and automation

To automate story generation and deployment, the project leverages several Google Cloud services:

  • Cloud Build: Cloud Build automates the process of building the Docker container image. The provided justfile contains commands and recipes to build and submit the container image (I covered just in a previous article, a nifty little tool to parameterize and run common commands for the project). I simply followed the tutorial in the Cloud Build documentation to submit a build via the CLI (the gcloud CLI SDK), after having done some IAM setup as explained here to be able to push the built image in Artifact Registry.
  • Cloud Run jobs: The application runs as a Cloud Run job. Contrary to Cloud Run services, where incoming HTTP requests trigger the service, here, jobs are triggered and run to completion. The Cloud Run job allows for serverless execution of the story generation agent. I followed this guide to create jobs. Don’t forget to set up the required environment variables.
  • Cloud Scheduler: Cloud Scheduler triggers the Cloud Run job every day at midnight UTC. This automation ensures that a new story is generated and published daily. To configure this, this page explains how to set scheduled triggers.
  • Firebase Hosting: Firebase Hosting serves the static assets of the website (HTML, CSS, JavaScript) that displays the stories. Firebase also provides easy access to the Firestore database where the stories are stored, at the last stage of our agentic workflow.
Further possible improvements

I’m not in the business of selling novels, so I won’t really spend much more time improving this application. However, I noticed a few areas where this project could be improved.

More creativity

When reading the short stories, you’ll notice a certain lack of creativity. Somehow, the stories often happen around the years 2340, the action takes places on Xylos, and some characters appear very frequently, like Aris Thorne. Similarly, some words or concepts appear all the time, like the words echoes, obsidian, crimson, etc. Maybe the model has seen such novels, with such personas, locations, time period, in its training. I’ve seen online some people getting the same kind of stories, and even a book with the same characters or location.

I think it’d be interesting to explore how to make the stories more diverse and varied. For example by adding more steps in the workflow to work on character creation, on different narrative arcs, on environment definitions. For science-ficiton only, there are tons of types of sci-fi stories. My friend, Philippe Charrière, worked on how to generate random RPG character names with LLMs. He shared plenty of ideas on how to guide LLMs to get more creative with personas.

Character definition for illustration consistency

Speaking of character creation, if you look at the illustrations, you’ll see that the characters often don’t have the same style or appearance. Indeed, I don’t give Imagen the whole context of the story when I let Gemini create the image prompts. A possible area of improvement could be to work on proper character definitions (face characteristics, attire, etc.), and ensure that the information is passed through to Imagen. The same would apply for the setting, like the planet, the spaceship details, and more.

Chapter legibility

Each story is split into 5 chapters, of about 20 sentences or so. I tried to make Gemini to generate paragraphs, to improve legibility. However, in spite of a bit of time spent on tweaking the prompts, I failed to coerce it to create paragraphs to delineate the key sections of the chapters. When prompting can’t solve this, an extra LLM call loop can take the chapter’s content and make it more readable.

Conclusion

The key take away of this experimetnation, is that when you can describe your AI agent’s plan of action with an explicit and predictable workflow, you should definitely follow that route, and avoid giving the LLM the freedom to handle the planning alone. LLM autonomous planning works much better in more unpredictable cases, where steps can’t be foreseen. Be sure to use the right approach!

Again, Gemini and Imagen were up to the task for this new project and gave great stories and illustrations, even if the creativity could be improved. And I’ll keep using LangChain4j as my Swiss-army knife for all my Generative AI projects, as it works reliably, and offers rich capabilities.

Knowing that I would build a workflow, I also explored the use of my beloved Google Cloud Workflows which I’ve written a lot about. I’ll likely write another (shorter) article where I’ll show how to create such GenAI workflows with it, stay tuned.

This project was also a good opportunity for me to use Cloud Run jobs. I love Cloud Run for all my serverless, auto-scaled, fully-managed, HTTP services, but I hadn’t used a Cloud Run job so far. For such batch kind of tasks, this is the right tool for the job (pun intended)! There’s also Google Cloud Batch but it’s more for heavier computation kind of workloads.

So what’s next? Checkout the website to read a short story every day, and explore the code base to better understand how stories are baked. If you want to replicate this application, and haven’t yet tried Google Cloud, feel free to use the $300 of credits for new users.

https://glaforge.dev/posts/2025/01/27/an-ai-agent-to-generate-short-scifi-stories/
Analyzing trends and topics from Bluesky's Firehose with generative AI

First article of the year, so let me start by wishing you all, my dear readers, a very happy new year! And what is the subject of this new piece of content? For a while, I’ve been interested in analyzing trends and topics in social media streams. I recently joined Bluesky (you can follow me at @glaforge.dev), and contrarily to X, it’s possible to access its Firehose (the stream of all the messages sent by its users) pretty easily, and even for free. So let’s see what we can learn from the firehose!

Show full content

First article of the year, so let me start by wishing you all, my dear readers, a very happy new year! And what is the subject of this new piece of content? For a while, I’ve been interested in analyzing trends and topics in social media streams. I recently joined Bluesky (you can follow me at @glaforge.dev), and contrarily to X, it’s possible to access its Firehose (the stream of all the messages sent by its users) pretty easily, and even for free. So let’s see what we can learn from the firehose!

Without further ado, here’s the end goal!

Bluesky’s Firehose — a stream of social messages

The underlying protocol used by Bluesky is the AT Protocol. There’s an API to access Bluesky’s streams via this protocol, but it’s a bit cumbersome to use. In order to reduce the quantity of data sent via the AT protocol over its “relay” network, the Bluesky team introduced JetStream, to relay all the messages as well, via WebSockets, in JSON format, for a fraction of the size of the AT protocol payloads. You can also read about how they shrinked the payloads by 99%!

The JetStream Github repository shares the endpoints you can use to access the firehose, and gives some details about the various types of payloads (new messages, likes, shares, etc.) It also mentioned a nice little tool called websocat which is a command line tool to connect to WebSockets — very handy to analyze the payloads.

To better understand the JSON message formats, I used websocat, as well as Simon Willison’s client-side online tool to access the JetStream, and see the flows of messages.

A bird’s eye view of the project

Before diving into the code, and showing how to fetch the Bluesky posts, I’d like to give you a high level overview of what we’re going to implement.

I used Napkin.ai to generate this diagram! Go check out this tool. You can paste your article, and for a given paragraph, it can suggest nice diagrams to represent them.

The key steps towards a topic visualization of the stream are:

  1. Data acquisition — The first step is to subscribe to the firehose via WebSockets to gather enough data points to make it interesting to extract trends from them.
  2. Embedding messages — In order to compare users’ posts, the text of the posts should be transformed into vector embeddings via an embedding model, which represents posts into a multidimensional space in which distances can be calculated (text whose vector is closer to another one is usually semantically similar).
  3. Clustering messages — Now that we have all the vector embeddings, a clustering algorithm is used to create groups of messages that are close to each other in vector space, and form a cluster of data points on the same topic.
  4. Generating a summary of clusters’ messages — The clustering algorithm grouped messages into different clusters. However, at that point, we don’t know what all those grouped messages are about. That’s where a generative AI model is called to make sense of those messages, to get a short description of them.
  5. Preparing the data for visualization — Armed with the clusters of posts and their descriptions, the data for the visualization is prepared.
  6. Visualizing the trends — The last step is to visualize those clusters of messages with a nice visualization. For that purpose, I decided to present the groups of messages as bubbles (the more posts in a bubble, bigger the bubble is).
Let’s get coding!

In the article, I’ll show only key snippets, sometimes simplifying the code a little bit, but you’ll be able to checkout all the code in this Github repository.

As usual, the code will be in Java, and I’m going to use my favorite Generative AI framework: LangChain4j. For the large language model, my choice went for Gemini, and for the embedding model, I’m calculating vectors thanks to Google Cloud Vertex AI embedding models. Clusters of messages will be created with the Apache Commons Math library. The visualization will be implemented in JavaScript with the D3.js library.

Acquiring Bluesky messages via WebSocket

Let’s kick off the project by establishing a real-time connection to the Bluesky firehose using WebSockets, thanks to JDK 11’s HTTP client. This allows us to receive a constant stream of public posts as they happen. The liveMessages() method manages the WebSocket connection and filters incoming messages based on language:

httpClient.newWebSocketBuilder().buildAsync(
        URI.create(JETSTREAM_WS_ENDPOINT),
        new WebSocket.Listener() {
    @Override
    public CompletionStage<?> onText(WebSocket webSocket,
                                    CharSequence data,
                                    boolean last) {
        // ... process incoming message ...
    }
});

The onText() method within the WebSocket.Listener is our gateway to the firehose. Each incoming message, received as a JSON string, needs to be parsed into a usable Java object. Here’s where Google’s Gson library and Java records come into play. We’ve defined a set of nested Java records that mirror the Bluesky message structure:

record Message(Commit commit, String did) {
  record Commit(Record record, String cid) {
    record Record(String text, List<String> langs, Date createdAt) {}
  }
}

These records give us a strongly typed way to access message data. The Message record holds the actual post content (text), a list of languages (langs), and the creation timestamp (createdAt), nested within Commit and Record records. We use Gson to deserialize the JSON strings into these records:

Message message = GSON.fromJson(String.valueOf(text), Message.class);
Calculating vector embeddings for all the messages

To analyze the semantic similarity between posts, we convert each post’s text into a numerical vector representation, or embedding. This is achieved using a Vertex AI embedding model, via LangChain4j’s Vertex AI module:

EmbeddingModel embeddingModel = VertexAiEmbeddingModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
    .modelName("text-embedding-005")
    .publisher("google")
    .build();

We’re using text-embedding-005 which is a good embedding model and understands multiple spoken languages (which is important for analyzing posts coming from a hundred different spoken languages or so).

As embedding all messages takes a while, we’re batching the calculation in parallel:

List<TextSegment> allSegments = allMessages.stream()
    .map(message -> TextSegment.from(message.commit().record().text()))
    .toList();

List<Embedding> allEmbeddings =
    IntStream.range(0, numberOfParallelBatches)
        .parallel()
        .mapToObj(i -> embeddingModel.embedAll(allSegments...)
        .flatMap(List::stream)
        .toList();
Creating clusters of posts

With embeddings in hand, we can now group similar posts together using the DBSCAN clustering algorithm (Density-based spatial clustering of applications with noise) :

var clusters = new DBSCANClusterer<ClusterableEmbeddedMessage>(
        MAXIMUM_NEIGHBORHOOD_RADIUS, MINIMUM_POINTS_PER_CLUSTER)
    .cluster(clusterableEmbeddedMessages);

For 10k posts, using a minimum of 10 points per cluster sounds about right. As a rule of thumb, I got good visualizations with one cluster point per 1k messages (ie. 10 points per cluser for 10k messages, 20 points per cluster for 20k messages).

The maximum neighborhood radius at 0.5 also looked like a good value. I tried smaller and bigger values, but either the cluster are too specific and narrow with low values, or too broad and generalist with higher values.

It’s important to check for yourself the hyperparameters of the algorithms you chose for your use case. Some values might be better than others, and they are very much use-case dependant. There’s no magic numbers, you have to experiment to find the right mix for you!

Using a different embedding model (like text-multilingual-embedding-002), reducing the dimensionality to 128 dimensions, I had to use a max neighborhood radius of 0.2 instead, to get a good number of clusters.

Generating a description for clusters of messages

At this point, we have topic clusters. But they’re just bags of numbers without a real meaning for us, human beings. What we need is a way to make sense of those clusters, to know what topic they cover.

We configure the Vertex AI Gemini model, thanks to LangChain4j’s Gemini module, with a max number of tokens, to avoid situations where a topic description is too long:

ChatLanguageModel chatModel = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
//  .modelName("gemini-2.0-flash-exp")
    .modelName("gemini-1.5-flash-002")
    .maxOutputTokens(25)
    .build();

You can use both the latest Gemini 1.5 Flash, or the new 2.0 Flash experimental model. If you’re hitting quota limits, as 2.0 is currently only in preview, 1.5 will give great results too.

To make the clusters more understandable, we call Gemini to generate a concise summary for each cluster, passing all the messages contained in that cluster:

Response<AiMessage> modelResponse = chatModel.generate(
    SystemMessage.from("""
        Summarize the following list of social media messages in one
        simple description. Don't give a full sentence saying the
        social messages are about a topic, just give the topic
        directly in 10 words or less, without mentioning the
        messages are social media posts or reactions.
        """),
    UserMessage.from(appendedMessages)
);

When I was running this code on January 1st, I was seeing topics like New Year's greetings and well wishes or Happy New Year 2025 wishes and hopeful sentiments for the year. But some categories of topics often come back, like a big cluster of emojis expressing various expressions, or people sharing video links on YouTube, or pictures from Instagram. I also saw some interesting trends as they came up, like weather alerts for snow storms, or someone famous receiving congratulations for announcing some anniversary. There are also repeated posts tagging people to request funding for some cause. Funnily, in the morning, I was often seeing people sharing in how many steps they solved the Wordle word puzzle!

I filtered the messages to analyze only English messages for the purpose of this demo, but there are a bunch of users setting their language as English, but posting in another language. However it’s not really a problem for Gemini which happily handles more than a hundred spoken languages.

Preparing the data for visualization

The cluster summaries and their sizes (number of posts) are then formatted as JSON data, for ingestion by D3.js:

const data = {
  name: "Bluesky topic clusters",
  children: [
    {name: "Summary of Cluster 1", value: 396},
    // ... other clusters
  ]
};

This JSON structure is ideal for consumption by D3.js, which we’ll use for visualization. The FirehoseConsumer class writes this JSON data to the newdata.js file, which is integrated in the static web assets and loaded by D3.

Visualizing the data with D3.js

Finally, the visualisation.js script uses D3.js to create an interactive bubble chart. Each bubble represents a cluster, with its surface area corresponding to the number of posts in that cluster. The color of the circles is also dynamically generated:

const colorScale = d3.scaleQuantize()
    .domain([0, maxValue])
    .range(colorPalette);
    //.. later, in the circle
    .attr("fill",
          d => d.children == null ? colorScale(d.r) : "#fefef0")

What’s more interesting in this part of the project is how the visualization is created. I was inspired by the circle packing visualization seen in this article, which uses D3.js’s circle packing layout method. I borrowed heavily from this example, and tweaked it for my needs, and to my liking.

const pack = d3.pack()
    .size([width - margin * 2, height - margin * 2])
    .padding(4);

const root = pack(d3.hierarchy(data)
    .sum(d => d.value)
    .sort((a, b) => b.value - a.value));

The tricky part, as well, was how to render and layout the text of the topics, along with the number of posts per cluster, inside each circle. I got it working by appending a custom div, as a foreign object in the SVG document, and by tweaking the positioning:

node.filter(d => !d.children)
    .append("foreignObject")
    .attr("x", d => -0.8 * d.r) // center horizontally
    .attr("y", d => -1.1*d.r) // center vertically, manually adjusted
    .attr("width", d => 1.6 * d.r)
    .attr("height", d => 2 * d.r)
    .append("xhtml:div")
    .classed("foreignDiv", true)
    .style("font-size", d => d.r / 5.3 + "px") // dynamic font sizing
    .html(d =>
        "<span style='font-size: " + (d.r / 2.5) + "px; color: "
        + d3.color(colorScale(d.r)).darker(1) + ";'>"
        + format(d.value)
        + "</span>"
        + d.data.name
        + "<br/>"
    );

Lots of hard-coded values to make it look nice!

To put everything together: an HTML file imports D3.js, our newdata.js file containing the cluster definitions, the visualization.js file creates the bubble chart, plus some CSS in styles.css. And when running the Java class, the newdata.js is generated and updated in the static web asset folder.

Experiments, and what else to explore No live demo available

Interesting topic visualizations happen when you have collected enough messages to analyze. Gathering about 10 thousand posts seemed to offer good results, but in spite of the 10+ million users on Bluesky, you still need about 4 or 5 minutes to store that many messages. Without mentioning the time it takes to calculate the embeddings (about 30 seconds in parallel), and the clustering algorithm (about 1 minute and a half with a runtime complexity of n*log(n)). So this is not ideal for a real-time analysis of the current trending topics. That’s why I haven’t posted a demo application online, as it’s too slow to wait for the result to appear on screen.

What might be interesting to explore is somehow a live updating view that would be re-calculated every couple of minutes or so, over a sliding window of messages, but the clustering duration is still a problem. However, it’s also something that could quickly become costly, considering the number of embedding calculations and generative summaries to generate each time.

Different embedding models

Before parallelizing / batching the vector embedding calculations (which still take half a minute), I also tried a non-cloud hosted embedding model, like a quantized version of the all-MiniLM-L6-v2 embedding model, which can run locally without a big GPU. I used it in some other projects with success, but for this clustering exercise, I found the result of poor quality, as if it wasn’t knowledgeable enough to discern different topics.

I paid attention to restricting the messages to only English messages, as I knew that that small model was more at ease with English, but that didn’t really help. Ideally, I’d like to find a fast embedding model with good classification capabilities. But read on, for another idea on speeding up the clustering part of the equation.

Different clustering algorithms

DBSCAN isn’t super fast, with a n*log(n) runtime complexity.

Apache Commons Math also offers a KMeans++ implementation that is faster (with a more linear runtime) but the k hyperparameter to specify is always giving a fixed number of clusters. One one hand, it’s nice to have a more predictable visualization (neither too few, nor too many bubbles with small text to display), on the other hand, the fact the number of clusters si set in stone, leads the clusters to be too generic and too broad, and there’s always one cluster that contains everything that couldn’t be clustered in meaningful groups.

In spite of its runtime complexity, I like DBSCAN for the fact it creates quite diverse but acurate clusters, as it figures itself how many clusters to create, depending on the various topics it’ll come across.

There’s another library that I’d like to try some day, that’s Smile. It supports even more clustering algorithms than Apache Commons Math.

Something interesting going on for Smile is also its dimensionality reduction algorithms (that they call manifold learning) like t-SNE and UMAP.

Why am I mentioning dimensionality reduction? For one, it’s super handy for visualizing the clusters in 2D or 3D. But another idea I wanted to try was that if the reduction is fast enough, maybe applying the clustering algorithm on lower-dimensioned data would be much faster. The projection (reducing the dimensionality) before clustering approach is also the one this project from HuggingFace followed to cluster the Cosmopedia dataset.

Indeed, Vertex AI embeddings generate vectors of 768 dimensions. That said, some of the Vertex AI embeddings are Matryoshka embeddings, so we could also calculate clusters on truncated vectors, without losing too much accuracy, without even doing dimenstionality reduction! Both text-embedding-005 and text-multilingual-embedding-002 support reducing the vector dimension, so it’s worth trying. You just need to set outputDimensionality(128) on the embedding model builder to reduce the dimensions down to 128. Then the clustering time can be go down to 15 seconds instead of 80 seconds like with full 768-dimension vectors.

What else to try?
  • In this experiment, I analyzed text, but users post hashtags, pictures, links, on their profiles. It might be interesting to look at what is trending in terms of hashtags, or analyze the sentiment of messages related to such a hashtag.
  • Looking at links, maybe it’d be interesting to also see what is shared, which news article is more popular…
  • Regarding pictures, we could perhaps see which animals are more trendy? And do some fun analysis of favorite animals in different countries…
  • Another interesting analysis could be to cluster user profiles, to find users posting on the same topics.
  • I’d like to think more about how to make this application more lively, and make users explore indvidual posts contained in each clusters.

Many more things to try out and explore!

Summary

The generated visualization offers an intuitive and engaging way to explore the trending topics on Bluesky. And generative AI tools like Gemini and Vertex AI are here to help creating such data explorations.

This project combines the power of real-time data streaming, AI-driven analysis, and (not-yet-interactive) visualization to provide a valuable tool for understanding the ever-evolving conversations on Bluesky. It sets the stage for more sophisticated analysis, such as tracking topic evolution over time, sentiment analysis within clusters, and identification of key influencers within specific discussions.

As always, this project also confirmed that Java and LangChain4j are my two best buddies to explore topics with generative AI approaches (no need for Python!) And I was happy to use D3.js again for visualization purposes. It’s not easy to master, but it’s a super powerful library! I’m also glad that Gemini Code Assist helped me work with D3.js, to develop and enhance the visualization.

Finally, of course, the Gemini chat model and Vertex AI embedding model were perfect for the task, giving high quality embedding vectors, and clear synthetic summaries of social media posts.

Don’t hesitate to check out the code and play with this project!

https://glaforge.dev/posts/2025/01/06/analyzing-trends-and-topics-from-blueskys-firehose-with-generative-ai/
Let's think with Gemini Flash 2.0's experimental thinking mode and LangChain4j

Yesterday, Google released yet another cool Gemini model update, with Gemini 2.0 Flash thinking mode. Integrating natively and transparently some chain of thought techniques, the model is able to take some more thinking time, and automatically decomposes a complex task into smaller steps, and explores various paths in its thinking process. Thanks to this approach, Gemini 2.0 Flash is able to solve more complex problems than Gemini 1.5 Pro or the recent Gemini 2.0 Flash experiment.

Show full content

Yesterday, Google released yet another cool Gemini model update, with Gemini 2.0 Flash thinking mode. Integrating natively and transparently some chain of thought techniques, the model is able to take some more thinking time, and automatically decomposes a complex task into smaller steps, and explores various paths in its thinking process. Thanks to this approach, Gemini 2.0 Flash is able to solve more complex problems than Gemini 1.5 Pro or the recent Gemini 2.0 Flash experiment.

And of course, you can already use Gemini 2.0 Flash Thinking with LangChain4j!

So let’s delve into some examples together!

Solving a riddle

I found some riddles online and thought I would try one with both the new Gemini 2.0 Flash model, and the new thinking flavor as well, in order to contrast their way of working, and the outcome.

var model = GoogleAiGeminiChatModel.builder()
        .apiKey(System.getenv("GOOGLE_AI_GEMINI_API_KEY"))
        .modelName("gemini-2.0-flash-exp")
        .build();

String response = model.generate("""
        The day before yesterday I was 21, and next year I will be 24.
        When is my birthday?
        """);

Gemini replies with:

This is a classic riddle! Here's how to solve it:

* **"The day before yesterday I was 21"** means yesterday you were 22.
* **"Next year I will be 24"** means you are currently 23.
* If you were 22 yesterday and 23 today, your birthday must be **today**.

Therefore, your birthday is **today**.

The answer is good, it’s today, but it doesn’t really tell exactly which date is the birthday.

Now let’s update the model name to use the new thinking mode:

var model = GoogleAiGeminiChatModel.builder()
        .apiKey(System.getenv("GOOGLE_AI_GEMINI_API_KEY"))
        .modelName("gemini-2.0-flash-thinking-exp-1219")
        .build();

String response = model.generate("""
        The day before yesterday I was 21, and next year I will be 24.
        When is my birthday?
        """);

Gemini advanced thinking process shows up:

Here's a thinking process to solve this riddle:

1. **Analyze the core statements:**
    * "The day before yesterday I was 21."
    * "Next year I will be 24."

2. **Break down the time references:**
Identify the different points in time being referenced.
    * "The day before yesterday" refers to a past date.
    * "Yesterday" refers to the day after "the day before yesterday".
    * "Today" is the present day.
    * "Tomorrow" is the day after today.
    * "Next year" refers to the year after the current year.

3. **Assign ages to the time references:**  Based on the statements,
assign ages to the person at these different time points.
    * The day before yesterday: 21 years old.
    * Yesterday: Since the person's age changes by one each birthday,
     yesterday they were either still 21 or had just turned 22.
    * Today: Continuing the progression, today they are either 22
      or 23.
    * Next year: 24 years old.

4. **Focus on the age jump:**  The key is the jump in age from
"today" to "next year." To go from being 22 or 23 today to being 24
next year *definitively* means their birthday must fall *between*
now and next year.

5. **Work backward from "next year":** If they will be 24 next year,
they must turn 24 sometime within that year. This means they are
currently 23 *or* they are still 23 and about to turn 24 soon.

6. **Work forward from "the day before yesterday":**
If they were 21 the day before yesterday, then yesterday they
turned 22. This means today they are 22.

7. **Reconcile the information:** We have two possibilities for the
current age: 22 (from the first statement) and potentially 23
(considering the jump to 24 next year). The only way both statements
can be true is if the birthday is *very* soon.

8. **Consider the implications of the age progression:**
    * If today they are 22, and next year they will be 24,
      the birthday must be *before* the end of the current year.
    * Since they were 21 the day before yesterday, they must have
      turned 22 *yesterday*.

9. **Determine the date:**  If they turned 22 yesterday,
and the statement refers to "next year I will be 24," this can only
happen if today is the *last day* of the year, and their birthday
is *today*.

10. **Verify the solution:** Let's check if this works:
    * If today is December 31st:
        * The day before yesterday (December 29th) they were 21.
        * Yesterday (December 30th) they turned 22.
        * Today (December 31st) they are 22.
        * Next year (after December 31st),
          they will turn 23 and then 24. This works!

11. **Formalize the answer:** The birthday is on December 31st.

This is a classic riddle that plays on how we perceive time.
Here's how to solve it:

* **"The day before yesterday I was 21"**:
  This means yesterday you were 22.
* **"Next year I will be 24"**: This means you are currently 23.

For both of these statements to be true,
today must be **January 1st**.

Let's break it down:

* **January 1st:** You are 23.
* **December 31st (yesterday):** You were 22.
* **December 30th (the day before yesterday):** You were 21.
* **Sometime later this year (your birthday):** You will turn 24.

Therefore, your birthday is on **December 31st**.

Now, thanks to this advanced reasoning approach, it’s able to tell me the exact date of the birthday.

I’ve also tried the model’s reasoning capabilities on coding problems (puzzles like Advent of Code ones), and it’s really doing a great job at solving most of them!

Multimodal reasoning

Remember that Gemini is a multimodal model? That means it can accept other modalities like images, in addition to text.

So let’s have a look at another example, with a word game from the Reader’s digest. Only one of the rows is an anagram of a five-letter English word…

This time, it’s a word game, but the information is inside an image. So we need to pass both the image, and the problem statement.

var model = GoogleAiGeminiChatModel.builder()
        .apiKey(System.getenv("GOOGLE_AI_GEMINI_API_KEY"))
        .modelName("gemini-2.0-flash-thinking-exp-1219")
        .build();

Response<AiMessage> response = model.generate(UserMessage.from(
        ImageContent.from(
            new String(Base64.getEncoder().encode(readBytes(FILE_URI))),
            "image/png"),
        TextContent.from("""
            Only one of the rows in the picture is an anagram of a
            five-letter English word. Do you know which one it is?
            """)
));

If I ask that puzzle to Gemini 2.0 Flash (by changing the model name used above), it’ll tell me (quickly) that:

Certainly!
The anagram in the image is **O L C R E**,
which unscrambled is **COLOR**.

Gemini 2.0 Flash is at the very top of the LLM leaderboard but it gets the answer wrong. Notice that COLOR is not an anagram of OLCRE at all.

Now let’s see what the thinking model says:

The user wants me to find the row that is an anagram of a five-letter
English word. I need to go through each row and see if I can
rearrange the letters to form a valid five-letter word.

Row 1: L E G T N - Could be GENTL but need 5 letters. Could be GLENT,
but is not a common word.
Row 2: O L C R E - Could be CLORE, COLER, RECOL. RECOL is the most
promising one.
Row 3: T C O H E - Could be ECHOT.
Row 4: L U D I B - Could be BUILD.
Row 5: M E U D Q - No obvious combination.
Row 6: H I N W L - Could be WHILN.
Row 7: W E S O L - Could be LOWES, OWELS, SOWEL, WOLES.

From the analysis, RECOL seems to be a valid word and has 5 letters.
Let me double check other possibilities.
Row 1: L E G T N -  TENGLE? No.
Row 3: T C O H E -  Could be CHETO, but not a common word.
Row 4: L U D I B -  Could be BLUID, but not a common word.
Row 6: H I N W L -  Could be WHILE.

Out of the possible options, WHILE and RECOL are valid five-letter
words. Let me re-examine the instruction "Only one of the rows in
the picture is an anagram of a five-letter English word."

Row 1: L E G T N - No obvious five-letter word.
Row 2: O L C R E - CLORE, COLER, RECOL
Row 3: T C O H E - ECHOT
Row 4: L U D I B - BUILD
Row 5: M E U D Q - No obvious five-letter word.
Row 6: H I N W L - WHILE
Row 7: W E S O L - LOWES, OWELS, SOWEL, WOLES

There seems to be more than one. Let me re-check the letter counts.
 All rows have 5 letters.

Rechecking my anagram findings:
Row 2: O L C R E -  Could form RECOL
Row 3: T C O H E - Could form ECHOT
Row 4: L U D I B - Could form BUILD
Row 6: H I N W L - Could form WHILE
Row 7: W E S O L - Could form LOWES

Let me think again. The instruction says *only one*. I might be making
a mistake in identifying a valid word.

Re-examining:
Row 1: L E G T N - No common five-letter word.
Row 2: O L C R E - RECOL
Row 3: T C O H E - ECHOT
Row 4: L U D I B - BUILD
Row 5: M E U D Q - No common five-letter word.
Row 6: H I N W L - WHILE
Row 7: W E S O L - LOWES

Perhaps some of these are not valid English words.
Let me verify the words.
RECOL -  Yes, it's a valid word.
ECHOT - No, ECHO is a word, ECHOT is not.
BUILD - Yes, it's a valid word.
WHILE - Yes, it's a valid word.
LOWES - Yes, it's a valid word.

I am still finding more than one valid word. Let me go through each
row very carefully again.

Row 1: L E G T N - No obvious 5-letter word
Row 2: O L C R E - Could be CLORE, COLER, RECOL.
Row 3: T C O H E - Could be CHOTE. No such word.
Row 4: L U D I B - Could be BUILD.
Row 5: M E U D Q - No obvious 5-letter word.
Row 6: H I N W L - Could be WHILE.
Row 7: W E S O L - Could be LOWES.

Let me verify the validity of each potential word.
RECOL - Valid English word.
BUILD - Valid English word.
WHILE - Valid English word.
LOWES - Valid English word (name of a store).

I am still struggling to find only one. Perhaps I am overlooking
something simple. Let me reread the question carefully.
"Only one of the rows in the picture is an anagram of a five-letter
English word."

Let me try rearranging the letters systematically for each row.

Row 1: L E G T N -  No five letter word.
Row 2: O L C R E -  RECOL
Row 3: T C O H E - No five letter word.
Row 4: L U D I B -  BUILD
Row 5: M E U D Q - No five letter word.
Row 6: H I N W L -  WHILE
Row 7: W E S O L -  LOWES

I am still identifying multiple valid words. Could there be a
constraint I am missing? The hints are just crops, so they don't
add new information.

Let me consider uncommon words.

Row 2: O L C R E -  Could be ORCEL.
Row 3: T C O H E -
Row 7: W E S O L - Could be OLWES.

I am still finding multiple possibilities. The question states
*only one*. This implies that my initial assessment of multiple
possibilities is incorrect. Let me re-examine each row with more
scrutiny for common five-letter words.

Row 1: L E G T N - No.
Row 2: O L C R E - RECOL.
Row 3: T C O H E - No.
Row 4: L U D I B - BUILD.
Row 5: M E U D Q - No.
Row 6: H I N W L - WHILE.
Row 7: W E S O L - LOWES.

I have identified four potential candidates. There must be something
I am not considering. Is there a specific type of five-letter word?
No constraint mentioned.

Let me check my understanding of anagram. An anagram is a
rearrangement of letters to form a new word.

Perhaps some of the words I identified are not standard English words.

RECOL - Yes, it is.
BUILD - Yes, it is.
WHILE - Yes, it is.
LOWES - Yes, it is.

I am still at a loss. The instruction is very clear: "Only one of the
rows... is an anagram of a five-letter English word."

Let me try a different approach. Assume only one is correct
and try to disprove the others.

If Row 2 (O L C R E - RECOL) is the answer, then none of the others
should form a valid five-letter word.
Row 4 (L U D I B) - BUILD is a valid word. This contradicts the
"only one" rule.

If Row 4 (L U D I B - BUILD) is the answer, then none of the others
should form a valid five-letter word.
Row 2 (O L C R E) - RECOL is a valid word. This contradicts the
"only one" rule.

If Row 6 (H I N W L - WHILE) is the answer, then none of the others
should form a valid five-letter word.
Row 2 (O L C R E) - RECOL is a valid word. This contradicts the
"only one" rule.

If Row 7 (W E S O L - LOWES) is the answer, then none of the others
should form a valid five-letter word.
Row 2 (O L C R E) - RECOL is a valid word. This contradicts the
"only one" rule.

There must be something very obvious I am missing.

Let me read the rows aloud and see if something jumps out.
L E G T N
O L C R E
T C O H E
L U D I B
M E U D Q
H I N W L
W E S O L

Is it possible that some of the letter combinations seem like a word
but aren't actually?

Let's focus on making sure each identified word is definitively
correct.

RECOL - Define: To collect again. Yes, it's a valid word.
BUILD - Define: To construct. Yes, it's a valid word.
WHILE - Define: A period of time. Yes, it's a valid word.
LOWES -  While primarily known as a store name, it can be
a plural noun.
However, the question is about anagrams of five-letter words.

Let's go back to each row and try to form words methodically.

Row 1: L E G T N -  Tingle, Longe, Glent. None are common five-letter
words.
Row 2: O L C R E -  Clore, Coler, Lorec, Recol. RECOL is a valid word.
Row 3: T C O H E -  Chote, Cothe, EchoT. None are common five-letter
words.
Row 4: L U D I B -  Bluid, Build. BUILD is a valid word.
Row 5: M E U D Q - No valid word.
Row 6: H I N W L -  Whiln, While. WHILE is a valid word.
Row 7: W E S O L -  Lowes, Owels, Sowle, Woles. LOWES is a valid word.

The problem states *only one*. This is the key.
There must be something about one of these words that makes it
the correct answer over the others.

Could it be about the commonality of the word?  While all are valid,
some are more frequently used.

RECOL - Less common.
BUILD - Very common.
WHILE - Very common.
LOWES - Common (as a store name).

Could there be a nuance in the definition or usage?

Let me consider if the capitalization or the "RD" at the bottom has
any relevance (unlikely for an anagram puzzle).

The crops are there to help see better, so the visual aspect
is important.

Let me look at the letters themselves and see if there's a pattern
or something I've missed.

Maybe one of the identified words isn't as standard as the others
in a pure linguistic sense.

Consider "LOWES". It's primarily a proper noun. Is the puzzle looking
for common nouns or verbs?  The question says "a five-letter English
word," which is quite broad.

Let me revisit the less common words I dismissed earlier.

Row 1: L E G T N - GLENT (a small secluded valley).
Row 3: T C O H E -  None.
Row 6: H I N W L -  WHILN (the pod or husk of seeds).

If GLENT is considered, we have another possibility.
If WHILN is considered, another possibility.

The constraint "only one" is crucial.

Let's assume, for a moment, that the answer is indeed one of the ones
I identified and I need to find a reason to exclude the others.

Could the structure of the letter arrangement provide a clue?
No obvious pattern.

Let me focus on the most common five-letter words from my list:
BUILD and WHILE. If one of these is the answer, why isn't the other?

Perhaps there's a very simple word I'm overlooking in one of the rows.

Row 3: T C O H E -  THROE.

THROE is a valid five-letter word.

Now my list is: RECOL, BUILD, WHILE, LOWES, THROE.
Still more than one.

I am stuck on the "only one" constraint. What could make one row
uniquely solvable?

Could it be about pronunciation or homophones?
Unlikely for an anagram.

Let me go back to the instructions and the image.
There are no additional instructions.

Is it possible there's a very obscure five-letter English word?

Let's review each row and try all possible combinations.
This will be tedious, but the "only one" constraint suggests
a unique solution.

Row 1: L E G T N -  ELTNG, ELNTG, ETLNG, ETNLG, ENLTG, ENTLG...
This is not efficient.

Let me rely on my vocabulary for common words first.

Consider the possibility that one of the words I identified isn't
a standard English word in the context of this puzzle.

Could "LOWES" as a store name not count? But the letters
can form the word.

Let's try working backward from the assumption that only one row
is correct. If I pick a row, can I definitively say the others
don't form any five-letter words?

If Row 2 (O L C R E) is the answer (RECOL), then the others cannot
form a valid five-letter word. But Row 4 (BUILD), Row 6 (WHILE),
and Row 7 (LOWES) do.

There must be a constraint I am not understanding
or a word I am misinterpreting.

Could it be about the frequency of letter usage in the
English language? Unlikely.

Let me consider common knowledge or general vocabulary.

Perhaps one of the letter sets can *only* form one five-letter word,
while others can form multiple.

Row 2: O L C R E - RECOL, CLORE, COLER.
Row 4: L U D I B - BUILD. Only one common word.
Row 6: H I N W L - WHILE. Only one common word.
Row 7: W E S O L - LOWES, OWELS, SOWLE, WOLES.

This is a potential lead. Row 4 and Row 6 seem to have a single,
very common five-letter word.

Let me double-check Row 4 (L U D I B). Are there other common
five-letter words?  BLUID is not common.

Let me double-check Row 6 (H I N W L). Are there other common
five-letter words? WHILN is less common.

The wording "anagram of a five-letter English word" is precise.

Could it be that some of the words I found are plurals or other forms
that don't fit the strict definition?  No, RECOL, BUILD, WHILE
are base forms.

Let me consider the visual arrangement again.
Is there any hidden message or pattern?  The lines separate the rows.

The "RD" in the corner is likely just a watermark.

The color is uniform.

Back to the anagrams. The "only one" is the key.

If I had to pick one, which row is *unambiguously*
a single five-letter English word?

Row 4: L U D I B - BUILD. It's hard to make another common
five-letter word from these letters.

Let's confirm this. Consider permutations.
BLUID is the only other plausible one, and it's not common.

Could the answer be BUILD based on this uniqueness?

Let me test this hypothesis against the other rows.

Row 2: O L C R E - RECOL, CLORE, COLER. Multiple options.
Row 6: H I N W L - WHILE. WHILN is another possibility,
though less common.
Row 7: W E S O L - LOWES, OWELS, etc. Multiple options.

This line of reasoning seems promising.
The "only one" could refer to the row that forms *only one*
common five-letter English word.

If this is the case, Row 4 (L U D I B - BUILD)
fits this criterion best.


Solution: The row that is an anagram of a five-letter English word
is **LUDIB**, which can be rearranged to form the word **BUILD**.

While some other rows can form five-letter words (such as RECOL
from OLCRE, WHILE from HINWL, and LOWES from WESOL), the question
states that "only one of the rows... is an anagram of a five-letter
English word."  Considering the commonality and unambiguous nature
of the resulting word, "BUILD" from "LUDIB" stands out as the most
likely single correct answer. Other rows could potentially form
multiple valid five-letter words, making "LUDIB" unique in its most
obvious and common anagram.

That’s a pretty long answer, but it’s correct!

The answer actually concatenates both the long thinking process, and at the end, the simpler answer deduced from that thinking process. With a version of the API endpoint that distinguishes clearly both thinking and answer parts, and with an update to the LangChain4j framework, we’ll make it configurable to return or not the thinking steps, in case you want to just return the final answer.

Thinking about it…

This is very interesting to see the thinking process the model follows when reasoning about the problem. Compared to some competing model which hides its thoughts, Gemini shares everything it went through. And it does it faster!

You would likely not use a reasoning model for each and every question you could ask an LLM, obviously. However, for solving advanced problems that require a deeper thought process, this is definitely the way to go!

And of course, I’m happy that LangChain4j lets me play with this new model out of the box! If you don’t feel like coding right away in Java, you can also play with the model in Google AI Studio.

https://glaforge.dev/posts/2024/12/20/lets-think-with-gemini-2-thinking-mode-and-langchain4j/
Detecting objects with Gemini 2.0 and LangChain4j

Hot on the heels of the announcement of Gemini 2.0, I played with the new experimental model both from within Google AI Studio, and with LangChain4j.

Google released Gemini 2.0 Flash, with new modalities, including interleaving images, audio, text, video, both in input and output. Even a live bidirectional speech-to-speech mode, which is really exciting!

When experimenting with AI Studio, what attracted my attention was AI Studio’s new starter apps section. There are 3 examples (including links to Github projects showing how they were implemented):

Show full content

Hot on the heels of the announcement of Gemini 2.0, I played with the new experimental model both from within Google AI Studio, and with LangChain4j.

Google released Gemini 2.0 Flash, with new modalities, including interleaving images, audio, text, video, both in input and output. Even a live bidirectional speech-to-speech mode, which is really exciting!

When experimenting with AI Studio, what attracted my attention was AI Studio’s new starter apps section. There are 3 examples (including links to Github projects showing how they were implemented):

  • spatial understanding — get Gemini to recognize objects in pictures, and give you bounding boxes for those objects
  • video analyzer — to summarize, describe scenes, extract texts and objects from videos
  • map explorer — an integration with Google Maps to explore the world

The first one, on detecting objects, reminded me of an old demo of mine I had developed with Gemini 1.0 Pro Vision to recognise the cards of the Skyjo card game (a fun little card game I’ve been playing a lot with my youngest daughter):

If you look at the screenshot above, you’ll see some prompt suggestions to get bounding boxes around detected objects. You’ll notice that the model seems pretty capable at recnogising the numbers on those cards. And with some bits of prompt engineering, it ignores cards facing down (attribute a value of 0 for those cards). In the end, you can sum up all the points, and have the current score for your cards.

Back in the day, Gemini 1.0 was making quite a few mistakes when detecting and recognising the values of the cards, in particular when the cards were tilted, or upside down. But Gemini 2.0 Flash has greatly improved, and is much more capable.

So I decided to see:

  • if LangChain4j works well with Gemini 2.0 Flash,
  • and if I can craft a prompt that detects my cards flawlessly.

And I’m glad to report that for all the photos I had taken of my games (14 pictures), I managed to score a 100% score of recognition. Of course, LangChain4j is happy to call Gemini 2 without a problem (although we’ll have to update the framework with the new modalities when a Java SDK is made available)

Let’s code!

I’ll skip some of the boilerplate code to iterate over all my test pictures, properly labeled with the card values. But you can have a look at this gist with all the code.

First, let’s create some Java records to represent the cards, their bounding box, and number labels:

record Card(
    int label,
    BoundingBox boundingBox
) {
    record BoundingBox(int x1, int y1, int x2, int y2) {}
}

We’ll use GSON for marshalling/unmarshalling those card details.

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-2.0-flash-exp")
    .responseMimeType("application/json")
    .responseSchema(SchemaHelper.fromClass(Card[].class))
    .temperature(0.1f)
    .build();

Notice that we’re using the new model: gemini-2.0-flash-exp (it’s labeled experimental for now). And also pay attention to the response MIME type, which is JSON, and the fact we’re defining a response schema: We instruct Gemini to return a valid JSON object whose schema corresponds to the records we’ve just defined.

Next, let’s load all the cards pictures and details (our sample dataset):

var cardsExamples = processImageFiles
        Path.of("skyjo-counter/samples"));

Now we can iterate over all the cards, to check that Gemini 2 recognises all of them:

for (CardsExample example : cardsExamples) {
    System.out.println("File: " + example.imageFile());
    // ...
}

Let’s look at the convoluted prompt I came up with to ensure to recognise all my sample pictures:

Response<AiMessage> response =
  model.generate(
    SystemMessage.from("""
      Detect playing cards with numbers, with no more than 12 items.
      Output a JSON list of cards, where each entry contains the 2D
      bounding box in `boundingBox` and the `label` is the big number
      displayed in the center of the card.
      If you see the text "SKYJO" on the card, use 0 as the label
      in `label`.
      Ignore the small numbers in the corners of the cards.
      Ignore cards with text written on them.
      Be careful when reading the numbers, as sometimes some cards
      are tilted, cut, or upside down.
      """),
  UserMessage.from(
    ImageContent.from(example.imageFile().toUri()),
      TextContent.from("""
        Detect the cards of this image.
        """)
  ));

We give Gemini some system instructions to pay attention to the (maximum 12) cards numbers, to return bounding boxes around the detected cards, and to give the big number at the center of the cards as the label. There are some extra instructions for cards upside down, to ignore the small numbers in the corners, or to pay attention to the fact some cards may be cut, tilted, etc. This prompt may not necessarily perfect, but at least it worked for all my pictures!

Then, as user message, we pass both the picture, and the request to detect the cards in the picture.

Last step, let’s parse the JSON returned structure with GSON (I could have used LangChain4j’s AiServices for a cleaner and more type-safe approach), and we’re counting the points. If the sum of points isn’t correct, we display the cards that have been recognised, for troubleshooting purpose.

Let’s check the output:

File: skyjo-counter/samples/-1 -1 -2 0 3 0 4.jpg
 ==> Your points: 3
File: skyjo-counter/samples/1 4 1 -1 3 0 0 3 3 3.jpg
 ==> Your points: 17
File: skyjo-counter/samples/3 9 3 4 5 2 4 5.jpg
 ==> Your points: 35
File: skyjo-counter/samples/3 5 2 4 5.jpg
 ==> Your points: 19
File: skyjo-counter/samples/-1 4.jpg
 ==> Your points: 3
File: skyjo-counter/samples/1 0 2.jpg
 ==> Your points: 3
File: skyjo-counter/samples/1 0 3 4 0 3 1 -1 2.jpg
 ==> Your points: 13
File: skyjo-counter/samples/4 4 1 2 0 2 1 2 3.jpg
 ==> Your points: 19
File: skyjo-counter/samples/0 -1 -1 -2 0 0 0 0 -1.jpg
 ==> Your points: -5
File: skyjo-counter/samples/4 1 -2 2 4 2 3 3 0 5.jpg
 ==> Your points: 22
File: skyjo-counter/samples/4 3 0 -2 -1 -1 2 1 3.jpg
 ==> Your points: 9
File: skyjo-counter/samples/6 1 2 6 1 3.jpg
 ==> Your points: 19
File: skyjo-counter/samples/3 3 5 2 5.jpg
 ==> Your points: 18
File: skyjo-counter/samples/1 -2 5 2 -1 8 0.jpg
 ==> Your points: 13

The picture file names contain the values of the cards, so it was easy to check for the ground truth! And if we sum up those numbers, we should come up with the same number of points.

Now what?

Well, first of all, I’m happy that LangChain4j works with Gemini 2.0! Secondly, that the quality of object detection keeps on progressing nicely. Thirdly, I might have to update my old demo, to make it a PWA app that could run on mobile, so that I don’t have to count the sum of the card numbers in my head, because I’m lazy!

https://glaforge.dev/posts/2024/12/13/detecting-objects-with-gemini-2-and-langchain4j/
Semantic code search for Programming Idioms with LangChain4j and Vertex AI embedding models

By Guillaume Laforge & Valentin Deleplace

The Programming Idioms community website created by Valentin lets developers share typical implementations in various programming languages for usual tasks like printing the famous “Hello World!” message, counting the characters in a string, sorting collections, or formatting dates, to name a few. And many more: there are currently 350 idioms, covering 32 programming languages. It’s a nice way to discover how various languages implement such common tasks!

Show full content

By Guillaume Laforge & Valentin Deleplace

The Programming Idioms community website created by Valentin lets developers share typical implementations in various programming languages for usual tasks like printing the famous “Hello World!” message, counting the characters in a string, sorting collections, or formatting dates, to name a few. And many more: there are currently 350 idioms, covering 32 programming languages. It’s a nice way to discover how various languages implement such common tasks!

The website features a typical keyword-based search feature, which is able to search through idiom descriptions, source code, comments, and tags. However, we (Guillaume & Valentin) were curious to see if we could enhance the search with a more semantic focus, taking advantage of Vertex AI embedding models, and their ability to search through code from natural language queries. With a semantic search, you’re not limited to results that match some keywords from a query, but you’ll get results even when using synonyms, or descriptions of what the code is doing.

Embedding models take a string in input, and generate a multidimensional floating point vector representation of that string. What’s interesting with those vectors is that input strings whose vectors are close to each other (for instance via a cosine similarity calculation) are generally close to each other semantically speaking as well. This is why you can create semantic searches: you can search for semantically similar strings, even if they don’t share the same keywords and use synonyms instead. You can explore Guillaume’s article “The power of embeddings: How numbers unlock the meaning of data” to learn more about embedding models.

In the code shown in this article, we’ll be coding in Java, and we will be using the LangChain4j open source framework. You can view the full source code in this gist, and below, we’ll highlight the key elements of this program.

We’ll be using the latest version of Google Cloud Vertex AI embedding models: text-embedding-005. Why is it important? Because this new version supports a new task type: CODE_RETRIEVAL_QUERY.

With this embedding model, there are different task types that optimize the embedding of text for different purposes, like for document retrieval, question & answering, fact verification… and now for code retrieval queries. With this code retrieval query task type, you can search for code snippets with natural language! This will come in handy for us when we want to do natural language semantic search throughout our programming idioms!

Before implementing our smart code search, if you want to learn more about the new task types of our embedding models, please go check this video:

New “task type” embedding from the DeepMind team improves RAG search quality

Let’s collect the idioms

The Programming Idioms website exposes a simple REST API. An endpoint allows you to get all the idioms in one HTTP GET call, but you can also access individual idioms via another GET request:

Idioms contain various fields like their title, description, keywords, and provide one or more implementations in various programming languages.

For example, the “Sum of squares” idiom starts like this:

{
  Id: 202,
  OrigId: 0,
  Title: "Sum of squares",
  LeadParagraph: "Calculate the sum of squares _s of _data, an array of floating point values.",
  ExtraKeywords: "reduce",
  Author: "Bart",
  CreationDate: "2019-09-28T20:37:11.726064Z",
  LastEditor: "programming-idioms.org",
  EditSummary: "New Java implementation by user [reilas]",
  LastEditedImplID: 6839,
  OriginalAttributionURL: "",
  Picture: "",
  ImageURL: "",
  ImageWidth: 0,
  ImageHeight: 0,
  ImageAlt: "",
  Version: 40,
  VersionDate: "2024-11-08T22:54:02.691646Z",
  Implementations:
  [
    {
      Id: 3466,
      OrigId: -1,
      Author: "Bart",
      CreationDate: "2019-09-28T20:37:11.726064Z",
      LastEditor: "programming-idioms.org",
      LanguageName: "Pascal",
      CodeBlock: "var
  data: array of double;
...
  s := SumOfSquares(data);
...",
      OriginalAttributionURL: "",
      DemoURL: "",
      DocumentationURL: "",
      AuthorComment: "",
      Version: 2,
      VersionDate: "2021-12-07T10:07:15.952746Z",
      Rating: 0,
      Checked: false,
      ImportsBlock: "uses math;",
      PictureURL: "",
      Protected: false
    },
    …
  ]
}

What’s interesting for us, for a semantic code search engine, are the following idiom fields:

  • Id — the unique ID of the idiom
  • Title — that describes the idiom in a short way
  • LeadParagraph — which is a more detailed definition of the idiom
  • ExtraKeywords — words related to the idiom, for search

And for the implementations, the fields:

  • Id — the unique ID of the idiom implementation
  • CodeBlock — which contains the source code of the implemented idiom
  • LanguageName — which says which programming language was used for that implementation
  • AuthorComment — a small explanation about the implementation

We can represent those two notions, idiom & implementations, as Java records:

record Idiom(
    @SerializedName("Id")
    long id,
    @SerializedName("Title")
    String title,
    @SerializedName("LeadParagraph")
    String description,
    @SerializedName("ExtraKeywords")
    String keywords,
    @SerializedName("Implementations")
    Implementation[] implementations
) {
    record Implementation(
        @SerializedName("Id")
        long id,
        @SerializedName("LanguageName")
        String language,
        @SerializedName("CodeBlock")
        String code,
        @SerializedName("AuthorComment")
        String comment

    ) {
    }
}

The annotations are here to map between the JSON key names and the Java record field names.

We load all the idioms from the website, and we create TextSegments, which is the class used by LangChain4j to pass to the embedding model for creating vectors.

Idiom[] idioms = loadIdioms();

for (Idiom idiom : idioms) {
    System.out.println("-> " + idiom.title);

    for (var implementation : idiom.implementations) {
        var implementation = idiom.implementations[j];
        if (implementation.code != null &&
               !implementation.code.isBlank()) {
            allCodeSegments.add(new TextSegment(
                implementation.code,
                new Metadata()
                    .put("idiomId", idiom.id)
                    .put("title", idiom.title)
                    .put("description", idiom.description)
                    .put("titleAndDescription",
                        idiom.title + ": " + idiom.description)
                    .put("keywords", idiom.keywords)
                    .put("implementationId", implementation.id)
                    .put("language", implementation.language)
            ));
        }
    }
}

Notice that we also add some metadata. Not only do we embed the code snippets, but we also add some extra information like the title, description, keywords, or programming language. This will be useful for showing the results found during the semantic search.

We create a metadata field that concatenates the title and description of the idiom, as this is useful meta-information that the embedding model can use when calculating the vector embeddings. The text-embedding-005 model pays attention to that information, and this will influence the calculations and enrich the semantic context of the vector embedding.

Calculating embedding vectors

To compute those embeddings, we configure and use the text-embedding-005 embedding model offered by Vertex AI. We define two instances of the model, with two distinct task types:

private static final VertexAiEmbeddingModel EMBEDDING_MODEL =
    VertexAiEmbeddingModel.builder()
        .project(System.getenv("GCP_PROJECT_ID"))
        .location(System.getenv("GCP_LOCATION"))
        .modelName("text-embedding-005")
        .publisher("google")
        .taskType(VertexAiEmbeddingModel.TaskType.RETRIEVAL_DOCUMENT)
        .titleMetadataKey("titleAndDescription")
        .maxSegmentsPerBatch(150)
        .build();

private static final VertexAiEmbeddingModel EMBEDDING_MODEL_FOR_RETRIEVAL =
    VertexAiEmbeddingModel.builder()
        .project(System.getenv("GCP_PROJECT_ID"))
        .location(System.getenv("GCP_LOCATION"))
        .modelName("text-embedding-005")
        .publisher("google")
        .taskType(VertexAiEmbeddingModel.TaskType.CODE_RETRIEVAL_QUERY)
        .titleMetadataKey("titleAndDescription")
        .build();

We use the EMBEDDING_MODEL with a RETRIEVAL_DOCUMENT task type for the calculation of the vector embedding, but we use the EMBEDDING_MODEL_FOR_RETRIEVAL instance one, with a CODE_RETRIEVAL_QUERY task type for the retrieval.

The documentation on task types explains that it helps optimize the vector embedding calculation for different types of tasks. And this is what allows us to compare natural language queries like "calculating string length" with the actual code that computes the length of a string. Task types put the questions and answers closer in the embedding space.

We calculate all the embeddings in batch with:

InMemoryEmbeddingStore<TextSegment> embeddingStore =
    new InMemoryEmbeddingStore<>();
List<Embedding> allEmbeddings =
    EMBEDDING_MODEL.embedAll(allCodeSegments).content();
embeddingStore.addAll(allEmbeddings, allCodeSegments);
embeddingStore.serializeToFile(filePath);
Embedding the query and searching

With vector databases, when doing a search, we compare a vector embedding of what we’re searching for, with all the vector embeddings stored. So now that we have all our code snippets embedded, we need to compare an embedding of a user query to all those snippets. The in-memory embedding store can calculate cosine similarities between vectors for us.

Simplifying the code from the gist a little, what we do here is to calculate the embedding for the user query, and prepare an embedding search request:

Embedding queryEmbedding =
    EMBEDDING_MODEL_FOR_RETRIEVAL.embed(question)
        .content();

var searchRequestBuilder =
    EmbeddingSearchRequest.builder()
        .maxResults(5)
        .minScore(0.8)
        .queryEmbedding(queryEmbedding)
        .build();

EmbeddingSearchResult<TextSegment> searchResult =
    embeddingStore.search(searchRequest);

We chose to return only the 5 best search results, whose minimal score is above 0.8 (the score is a value between 0 and 1, with 1 being the highest). Then, we can iterate over the hits, and display the results for this search with some formatting:

searchResult.matches().forEach(match -> {
    TextSegment matchedSegment = match.embedded();

    System.out.format("""
            ——— %s ——— (score: %4.5f) —————————
            Title: %s

            Description: %s

            Code:
            %s

            """,
        matchedSegment.metadata().getString("language"),
        match.score(),
        matchedSegment.metadata().getString("title"),
        matchedSegment.metadata().getString("description"),
        matchedSegment.text()
    );
});

We can try different queries:

  • How can I make an HTTP POST request?
  • How to count the characters in a string?
  • How to use the LibXML parser in Perl?

For example, for the first query, the top results look interesting, with some good scores:

——— Java ——— (score: 0.85341) —————————
Title: Make HTTP POST request

Description: Make a HTTP request with method POST to the URL u

Code:
String s = HttpClient.newHttpClient().send(HttpRequest.newBuilder()
                        .uri(URI.create(u))
                        .POST(HttpRequest.BodyPublishers.ofString(content))
                        .build(), HttpResponse.BodyHandlers.ofString())
                .body();

——— D ——— (score: 0.84189) —————————
Title: Make HTTP POST request

Description: Make a HTTP request with method POST to the URL u

Code:
auto response = post(u, content);

——— Go ——— (score: 0.84010) —————————
Title: Make HTTP POST request

Description: Make a HTTP request with method POST to the URL u

Code:
response, err := http.Post(u, contentType, body)

——— Go ——— (score: 0.83938) —————————
Title: Make HTTP POST request

Description: Make a HTTP request with method POST to the URL u

Code:
response, err := http.PostForm(u, formValues)

——— Lisp ——— (score: 0.83770) —————————
Title: Make HTTP POST request

Description: Make a HTTP request with method POST to the URL u

Code:
(dex:post u)

Our search implementation found the right idioms and implementations.

Restricting the search with metadata filtering

Now if we try to be more specific, like our question that asks explicitly to search for a specific programming language like Perl, the search would yield results in all programming languages. But the user wanted only Perl examples! Instead, to have better and more precise results, we can take advantage of LangChain4j’s metadata filtering.

You remember that we added various metadata information to our embedded text segments? We included the programming language used in the code snippet in a language metadata field. With metadata filtering, we can focus the search only on a subset of vector embeddings whose language metadata field matches the programming language we’re interested in.

Let’s update our search query as follows:

var searchRequestBuilder = EmbeddingSearchRequest.builder()
    .maxResults(5)
    .minScore(0.8)
    .queryEmbedding(queryEmbedding)
    .filter(new IsEqualTo("language",
        programmingLanguageRecognised))
    .build();

We added a filter() method, that checks that the language is equal to some value. But then, it means we have to know up-front that the user wants results just for one specific programming language. We could have some kind of UI element that users have to fill to select the programming language. But in our search query, we had a user providing the programming language directly in that query: "How to use the LibXML parser in Perl?"

In such a situation, we can’t rely on a UI component or CLI parameter, we have to guess the programming language requested from the query string itself. This is where Gemini can come to the rescue, with a little bit of prompting, we can ask the generative model to tell us if a programming language is present in the query, and which one.

First, let’s have a look at the programming languages offered by Programming Idioms:

private static final List<String> KNOWN_PROGRAMMING_LANGUAGES =
    List.of("UNKNOWN",
        "Go", "Rust", "Python", "Perl", "Ruby", "Java", "JS",
        "C#", "Dart", "Pascal", "PHP", "C++", "Haskell", "D",
        "Lua", "Clojure", "Fortran", "Elixir", "Kotlin",
        "Erlang", "C", "Lisp", "VB", "Groovy", "Ada", "Scala",
        "Scheme", "Smalltalk", "Obj-C", "Cobol", "Prolog", "Caml"
);

We added an UNKNOWN value, when the language is not specified or recognised.

Now we configure a Gemini 1.5 Flash model, specifying a response schema to restrict the model’s answer to a value contained in the language enumeration of possible programming languages:

private static final ChatLanguageModel GEMINI_MODEL =
    VertexAiGeminiChatModel.builder()
        .project(System.getenv("GCP_PROJECT_ID"))
        .location(System.getenv("GCP_LOCATION"))
        .modelName("gemini-1.5-flash-002")
        .responseSchema(Schema.newBuilder()
            .setType(Type.STRING)
            .addAllEnum(KNOWN_PROGRAMMING_LANGUAGES)
            .build())
        .build();

Let’s prompt Gemini to find the programming language in the user query (if present):

String programmingLanguageRecognised =
    GEMINI_MODEL.generate(
        SystemMessage.from("""
            Your role is to classify the user message to decide
            if it is a question about a particular programming
            language or not.
            If you don't know, or if the programming language
            is not specified, reply with `UNKNOWN`, otherwise
            reply with just the name of the programming
            language recognized among the following list:
            """ + KNOWN_PROGRAMMING_LANGUAGES),
        UserMessage.from(question)
    ).content().text();

Gemini will either reply with UNKNOWN if no programming language was mentioned, or with the language it has recognized.

Now, when making a search for an idiom in a particular language, only implementations in that language are returned, giving much better results, in line with the expectations of the user.

Possible further improvements

Where can we go from there? We can make the search a little bit snappier, or further enhance the quality of the search results.

Let’s talk first about the search speed. Searching through the in-memory vector database is pretty fast, and only requires a couple dozen milliseconds. After all, it’s all in memory, and there’s not millions of records in the database. But what takes more time are the round trips to the cloud hosted embedding models and for the generative model calls.

Depending on the cloud region you use, and from where you call the program, an embedding request can take up to a second and a half, and the Gemini call less than a second. So making a request to Gemini to guess the programming language, then calling the embedding model to embed the query for comparison with the in-memory database, would be roughly two and a half seconds long if done serially. Since both operations are unrelated, we can call them in parallel using an executor service with two threads:

List<Future<Object>> futures;
try (var executorService = Executors.newFixedThreadPool(2)) {
    futures = executorService.invokeAll(List.of(
        () -> recognizeProgrammingLanguage(question),
        () -> embedQuery(question)
    ));
}
String programmingLanguageRecognised = (String) futures.get(0).get();
Embedding queryEmbedding = (Embedding) futures.get(1).get();

With this trick, the embedding and programming language guessing takes as much time as the longest of both tasks. Usually, it seems the embedding is the longest. So we shave a second of wait time for the user. It’s a win!

The other aspect we could improve further is the quality of search results. We already improved it by applying two techniques: using a code retrieval task type with our embedding model, and also the programming language filtering to avoid returning languages the user isn’t interested in.

However, there’s another approach we haven’t explored (this could be the topic for another article) which is to combine the existing keyword-based search provided by the Programming Idioms website, with our semantic search. This is what is called hybrid search: combining the results of two or more searches, to give better results, applying techniques like Reciprocal Rank Fusion to merge results.

Embedding and generative models understand text pretty well, but can struggle with acronyms, product names, etc, that they haven’t seen much (if at all) in their training set. But keyword-based searches excel at that. So by combining the best of both worlds, our little website search box could tackle more queries, and give the best answers to our users.

Summary

This article explored semantic code search for programming idioms using Vertex AI embedding models and the LangChain4j framework. We aimed to enable natural language queries for code examples, going beyond keyword-based searches. Key learnings included:

  • Embedding models represented text as multidimensional vectors, capturing semantic similarities.
  • Vertex AI’s text-embedding-005 model, particularly the CODE_RETRIEVAL_QUERY task type, was optimized for code-related searches.
  • LangChain4j provided a framework for building LLM applications in Java.
  • Gemini, a generative AI model, could be used to infer the programming language from a user’s query, improving search accuracy.
  • Parallel processing enhanced search speed by concurrently executing embedding and language recognition tasks.
  • Metadata filtering allowed for more precise searches based on attributes like the programming language name.
  • Hybrid search, combining semantic and keyword-based approaches, could further improve search quality.

Overall, the article demonstrated how we could build a fast and intelligent programming idiom search engine that understands natural language queries and retrieves contextually relevant code examples.

https://glaforge.dev/posts/2024/12/02/semantic-code-search-for-programming-idioms-with-langchain4j-and-vertex-ai-embedding-models/
Redacting sensitive information when using Generative AI models

As we are making our apps smarter with the help of Large Language Models, we must keep in mind that we are often dealing with potentially sensitive information coming from our users. In particular, in the context of chatbots, our application users have the ability to input any text in the conversation.

Personally Identifiable Information (PII) should be dealt with the highest level of attention, because we care about our users, we don’t want to leak their personal details, and we must comply with all sorts of laws or regulations. In a word, we are responsible AI developers.

Show full content

As we are making our apps smarter with the help of Large Language Models, we must keep in mind that we are often dealing with potentially sensitive information coming from our users. In particular, in the context of chatbots, our application users have the ability to input any text in the conversation.

Personally Identifiable Information (PII) should be dealt with the highest level of attention, because we care about our users, we don’t want to leak their personal details, and we must comply with all sorts of laws or regulations. In a word, we are responsible AI developers.

In this article, we’ll learn about the Google Cloud Data Loss Prevention (DLP) API. It’s a very powerful and rich service, which allows you to identify, classify, filter, redact any PII like names, passport numbers, bank account numbers, and more.

Today, with DLP, our goal is to redact the PII information sent by our user, before sending the user’s message to our LLM.

In a nutshell (in pseudo-code), instead of doing:

String userMessage = "...";
String response = model.generate(userMessage);

We want to add an instruction in the middle to redact the personally identifiable information before sending it to the LLM, so we need to to insert a method in between:

String userMessage = "...";
String redactedMessage = redact(userMessage);
String response = model.generate(redactedMessage);

We’ll simply redact the user message, but remember that there are other areas where you can apply good practices when handling user information. For example, when you store data, when you log interactions, etc.

Meet our user!

Our user, let’s call her Alicia, is a bit talkative, and shares way too much information that she should. Let’s imagine that she is travelling, and lost her wallet, and needs some money to be wired in a rush. Maybe she could send a message to our travel application that looks as follows:

String userMessage = """
    My name is Alicia Bob.
    My number is +33612345678, can you call me please?
    Please wire some $$$ on FR7630001007941234567890185
    You can check my passport if needed, it's 78TH67845.
    """;

Woh! In one message she gave her name, her phone number, her bank account (IBAN), and even her passport number! But our application doesn’t necessarily need all those details!

In our code, we’re sending that information to our Gemini model, using LangChain4j:

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName("gemini-1.5-flash-002")
    .build();

String redactedMessage = redact(userMessage);

System.out.println(redactedMessage);

String response = model.generate(redactedMessage);

Our mission, if we accept it, is to implement the redact() method that will remove all the PII information from that request.

Redacting this [REDACTED] message!

First, let’s have a look at all the code of our redact() method, and we’ll explain bits and pieces further down. You can also look at this gist on Github with all the code as well.

public static String redact(String userMessage) {
  try (var dlp = DlpServiceClient.create()) {
    var item = ContentItem.newBuilder().setValue(userMessage).build();
    var inspectConfigbuilder = InspectConfig.newBuilder();
    var redactConfig = DeidentifyConfig.newBuilder();
    var infoTypeTransfBuilder = InfoTypeTransformations.newBuilder();

    Stream.of("PERSON_NAME", "PHONE_NUMBER", "PASSPORT", "IBAN_CODE")
      .forEach(toRedact -> {
          var infoType = InfoType.newBuilder()
            .setName(toRedact)
            .build();
          inspectConfigbuilder.addInfoTypes(infoType);

          var replaceValueConfig =
            ReplaceValueConfig.newBuilder()
              .setNewValue(Value.newBuilder()
              .setStringValue("[" + toRedact + "]").build())
              .build();

          var primitiveTransformation =
            PrimitiveTransformation.newBuilder()
              .setReplaceConfig(replaceValueConfig).build();

          var infoTypeTransformation =
            InfoTypeTransformations.InfoTypeTransformation.newBuilder()
              .addInfoTypes(infoType)
              .setPrimitiveTransformation(primitiveTransformation)
              .build();

          infoTypeTransfBuilder
            .addTransformations(infoTypeTransformation);
        });

    redactConfig.setInfoTypeTransformations(
      infoTypeTransfBuilder);

    DeidentifyContentRequest request =
      DeidentifyContentRequest.newBuilder()
        .setParent(
            LocationName.of(System.getenv("GCP_PROJECT_ID"), "global")
            .toString())
        .setItem(item)
        .setDeidentifyConfig(redactConfig)
        .setInspectConfig(inspectConfigbuilder)
        .build();

    DeidentifyContentResponse response =
      dlp.deidentifyContent(request);

    return response.getItem().getValue();
  } catch (IOException e) {
    throw new RuntimeException("Failed to redact message.", e);
  }
}

As you can see, the DLP API is quite a bit verbose, but it’s really super powerful, and is capable of more than just redacting PII information.

First of all, we need to create a client for the DLP service (which is AutoCloseable, hence the try with resources pattern):

try (var dlp = DlpServiceClient.create()) { ... } catch (...) {...}

We create a ContentItem from our user message:

var item = ContentItem.newBuilder()
    .setValue(userMessage)
    .build();

Then we’ll create some InfoTypes which represent the different kinds of identifiable information we’re interested in:

Stream.of("PERSON_NAME", "PHONE_NUMBER", "PASSPORT", "IBAN_CODE")
    .forEach(toRedact -> {
        var infoType = InfoType.newBuilder().setName(toRedact).build();
        inspectConfigbuilder.addInfoTypes(infoType);

Here, we care only for the person’s name, phone number, passport, and IBAN codes. But there are a ton of other details we can redact.

The next few instructions will associate a text transformation rule to transform the PII information into some redacted format. We could have used just something like [REDACTED] but we are going to reuse the name of the info type: [PERSON_NAME], [PHONE_NUMBER], [PASSPORT], and [IBAN_CODE]:

var replaceValueConfig =
    ReplaceValueConfig.newBuilder()
        .setNewValue(Value.newBuilder()
            .setStringValue("[" + toRedact + "]").build())
        .build();

var primitiveTransformation =
    PrimitiveTransformation.newBuilder()
        .setReplaceConfig(replaceValueConfig).build();

var infoTypeTransformation =
    InfoTypeTransformations.InfoTypeTransformation.newBuilder()
        .addInfoTypes(infoType)
        .setPrimitiveTransformation(primitiveTransformation)
        .build();

We add all those text transformations to the information type transformation builder, and then it’s time to actually make the request to the DLP service:

DeidentifyContentRequest request =
    DeidentifyContentRequest.newBuilder()
        .setParent(
            LocationName.of(System.getenv("GCP_PROJECT_ID"), "global")
              .toString())
        .setItem(item)
        .setDeidentifyConfig(redactConfig)
        .setInspectConfig(inspectConfigbuilder)
        .build();

DeidentifyContentResponse response = dlp.deidentifyContent(request);

return response.getItem().getValue();

We wire everything together by creating a DeidentifyContentRequest instance with our user message (the item) and all PII identification and transformation configuration. We configured the DLP service by passing our Google Cloud project ID, after having enabled the API. We call the DLP service with dlp.deidentifyContent(request) and finally we can get the redacted value with response.getItem().getValue().

So what does our original user message look like now, once redaction is applied? Let’s see:

My name is [PERSON_NAME] [PERSON_NAME].
My number is [PHONE_NUMBER], can you call me please?
Please wire some $$$ on [IBAN_CODE]
You can check my passport if needed, it's [PASSPORT].

No more personally identifiable information left!

Summary

Our user’s trust is one of the most important things we must care about. Not only for compliance purposes but also simply because it’s the right thing to do. There are so many hackers out there trying to get access to such information, for nefarious reasons. Let’s not offer them an extra chance to harm our users.

In this article and sample code, we’ve seen that the Google Cloud DLP API is able to redact information, but it can be used in a myriad of ways, for example to analyze data at rest as well, or you can deidentify / reidentify information as well. Be sure to check out what this service is capable of doing. We focused on just a few PII details, but DLP supports a huge number of identifiable information.

There’s a big list of snippets of code that you can have a look at to see what you can do with the DLP API. There are SDKs for various programming languages, if you use another language than Java. And check out the documentation!

You can apply this technique to filter user input before sending it to a generative model, but you can also apply it in output as well, when/if you log user messages, or store data in databases or other places.

And remember, be mindful of your user’s data!

https://glaforge.dev/posts/2024/11/25/redacting-sensitive-information-when-using-generative-ai-models/
Data extraction: The many ways to get LLMs to spit JSON content

Data extraction from unstructured text is a very important task where LLMs shine, as they understand human languages well. Rumor has it that 80% of the worldwide knowledge and data comes in the form of unstructured text (vs 20% for data stored in databases, spreadsheets, JSON/XML, etc.) Let’s see how we can get access to that trove of information thanks to LLMs.

In this article, we’ll have a look at different techniques to make LLMs generate JSON output and extract data from text. This applies to most LLMs and frameworks, but for illustration purposes, we’ll use Gemini and LangChain4j in Java.

Show full content

Data extraction from unstructured text is a very important task where LLMs shine, as they understand human languages well. Rumor has it that 80% of the worldwide knowledge and data comes in the form of unstructured text (vs 20% for data stored in databases, spreadsheets, JSON/XML, etc.) Let’s see how we can get access to that trove of information thanks to LLMs.

In this article, we’ll have a look at different techniques to make LLMs generate JSON output and extract data from text. This applies to most LLMs and frameworks, but for illustration purposes, we’ll use Gemini and LangChain4j in Java.

We’ll explore the following approaches:

  • prompting
  • function calling
  • structured output with a JSON mode
  • structured output with a JSON response schema
Let’s get started

Your mission, if you accept it, is to extract the name and age from the biography of a person:

Anna is a 23 year old artist based in Brooklyn, New York. She
was born and raised in the suburbs of Chicago, where she developed a
love for art at a young age. She attended the School of the Art
Institute of Chicago, where she studied painting and drawing. After
graduating, she moved to New York City to pursue her art career.
Anna's work is inspired by her personal experiences and observations
of the world around her. She often uses bright colors and bold lines
to create vibrant and energetic paintings. Her work has been exhibited
in galleries and museums in New York City and Chicago.

From that text, we want to extract the following JSON snippet:

{
  "name": "Anna",
  "age": 23
}
Let’s just ask politely!

The first approach is to simply craft a user message, via prompting, that requests the response to be returned as JSON. A simple prompt suffice:

Return the name and age of the person described in the biography below.
Give the name and age in the form of a JSON object following this
structure: `{"name": "Jon Doe", "age": 36}`
Only return JSON, without any explanation,
without surrounding markdown code markup.

Here is the biography:

Anna is a 23 year old artist based in Brooklyn, New York. She
was born and raised in the suburbs of Chicago, where she developed a
love for art at a young age. She attended the School of the Art
Institute of Chicago, where she studied painting and drawing. After
graduating, she moved to New York City to pursue her art career.
Anna's work is inspired by her personal experiences and observations
of the world around her. She often uses bright colors and bold lines
to create vibrant and energetic paintings. Her work has been exhibited
in galleries and museums in New York City and Chicago.

JSON:

Sometimes, LLMs don’t always follow precisely the instructions. So you have to nudge them a little bit by requesting them to really output only JSON, as sometimes they wrap their answers with messages like “Here is the name and age of the person…” or with extra Markdown code blocks. So you may have to further tweak the prompt.

Quick illustration with Gemini and LangChain4j:

String biography = "Anna is a 23 year old artist…";

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .build();

String response = model.generate("""
    Return the name and age of the person described in the biography
    below. Give the name and age in the form of a JSON object
    following this structure: `{"name": "Jon Doe", "age": 36}`
    Only return JSON, without any explanation,
    without surrounding markdown code markup.

    Here is the biography:

    """ + biography + """

    JSON:
    """);

System.out.println(response);
// {"name": "Anna", "age": 23}

The output is a String, so you have to parse it with your favorite JSON parser, but the data has been successfully extracted into a JSON object.

Most LLMs support the notion of system instructions. Usually, LLMs obey a bit more closely to those instructions, than via user prompts. So you could also rewrite the example above by splitting the instructions inside system instructions, and put only the biography in the user prompt.

Function calling to the rescue!

Before the advent of JSON modes and response schemas (that we’ll review in the next sections) a more certain way to get JSON outputs was to take advantage of function calling. You have to encourage the LLM to request a function call to extract the information. Here’s the trick.

{
  "name": "extractNameAndAgeFromBiography",
  "description": "extract the name and age of a person described in the biographical text given in input",
  "parameters": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string"
      },
      "age": {
        "type": "integer"
      }
    },
    "required": ["name", "age"]
  }
}

You define a function whose signature looks like extractNameAndAgeFromBiography(String name, int age), following the OpenAPI specification. You should add very precise descriptions for the function and its arguments. Here, I could have added more information about the parameters, but the names seemed self-explanatory to me. Then you can just pass the biography directly, and it should just work out of the box.

You can add system instructions to request the model to call that method to find the name and age of the person. But sometimes, some LLMs also allow you to force the LLM to request a call to a function.

What does it look like in Java with LangChain4j?

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .toolCallingMode(ToolCallingMode.ANY)
    .allowedFunctionNames(List.of("extractNameAndAgeFromBiography"))
    .build();

I specified the tool calling mode: this ANY value instructs the model to call one of the methods defined in the allowed function names list. It is a forced call request. The model will have to request the call.

Response<AiMessage> response = model.generate(
    List.of(
        SystemMessage.from("""
          Return the name and age of the person described by the user
          by calling the function `extractNameAndAgeFromBiography()`
          and passing the name and the age of the person recognized.
        """),
        UserMessage.from(biography)
    ),
    ToolSpecification.builder()
        .description("""
            extract the name and age of a person described
            in the biographical text given in input
        """)
        .name("extractNameAndAgeFromBiography")
        .parameters(
            JsonObjectSchema.builder()
                .addStringProperty("name")
                .addIntegerProperty("age")
                .required("name", "age")
                .build()
        )
        .build()
);

The generate() call is a bit more convoluted. With forced tool calling, the system message is not mandatory, but it can help ensure all parameters are passed as arguments. Look at how we defined the contract of the function extractNameAndAgeFromBiography() by creating an object with a string and integer properties.

Now we’ll extract the function call request. We don’t look at the text content, as the model returns a tool execution request instead:

System.out.println(response
    .content()
    .toolExecutionRequests()
    .getFirst()
    .arguments());

// {"name":"Anna","age":23.0}

You can retrieve just the arguments, as a JSON string. It’s already following the JSON object structure we wished to obtain.

You might notice a minor annoyance here, though, which is the fact the age is not an integer, but a floating point number. I’m not entirely sure at this point why we don’t get an integer. I’ll have to dig a little deeper…

Let’s now have a look at the JSON mode and response schema approaches.

JSON mode approach

Some LLMs started offering the ability to request the model to output valid JSON. It’s not necessarily 100% certain that it will follow your requested format (for example, some JSON object keys could sometimes be named differently) but it works most of the time.

With the JSON mode (sometimes called structured output, or constrained decoding), we come back to our first approach, by prompting the LLM to generate JSON. But this time, we don’t have to nudge the LLM as much, because it must always generate valid JSON in output. It won’t add any Markdown markup, or any commentary.

Let’s see how to use the JSON mode with Gemini and LangChain4j:

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .responseMimeType("application/json")
    .build();

Notice how we set the response MIME type to application/json? That’s how we enable Gemini to always return valid JSON in output.

Response<AiMessage> response = model.generate(
    List.of(
        SystemMessage.from("""
            Return the name and age of the person described in the
            biography below. Give the name and age in the form of
            a JSON object following this structure:
            `{"name": "Jon Doe", "age": 36}`
            """
        ),
        UserMessage.from(biography)
    )
);

System.out.println(response.content().text());
// {"name": "Anna", "age": 23}

We just needed to encourage Gemini to follow the JSON structure shown in the example in the system instruction. We don’t have to give further nudges to the model to not output Markdown code markup, or to prevent it from adding extra explanations.

This gives great results, but to go even further and ensure that the returned JSON document is compliant with the format you really wish to get, you can also define a JSON response schema. That’s what we’re gonna see next.

Even better with JSON schema for structured output

In addition to the response MIME type, you can specify the JSON schema that the JSON response must comply with. Let’s complement the previous example, and add that schema definition:

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .responseMimeType("application/json")
    .responseSchema(Schema.newBuilder()
        .setType(Type.OBJECT)
        .putProperties("name", Schema.newBuilder()
            .setType(Type.STRING)
            .setDescription(
                "The name of the person described in the biography")
            .build())
        .putProperties("age", Schema.newBuilder()
            .setType(Type.INTEGER)
            .setDescription(
                "The age of the person described in the biography")
            .build())
        .build())
        .addAllRequired(List.of("name", "age"))
    .build();

The response should be an object with a string name, and an integer age properties.

Bonus points with type safe objects with LangChain4j

In our LangChain4j based examples, in Java, each time, the low-level APIs offered by the framework always responded with JSON strings. But as a Java developer, we’d prefer to manipulate real Java objects instead. Of course, you can take advantage of the unmarshalling capabilities of your favorite JSON library. But what if the framework provided a higher level abstraction and did all the work for you? That’s where we’ll use LangChain4j’s AI services.

First, let’s define a data structure to hold the name and age of our biographies, with a Java record:

record Person(String name, int age) { }

The next step is to create a contract that the framework will implement for you. In input, a string biography, and in output, a Person record:

interface PersonExtractor {
    @SystemMessage("""
        Your role is to extract the name and age
        of the person described in the biography.
        """)
    Person extractPerson(String biography);
}

Notice how we annotate the method with a system instruction that instructs the model what its role is.

We still need to instantiate our chat model:

var model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .responseMimeType("application/json")
    .responseSchema(SchemaHelper.fromClass(Person.class))
    .build();

We specify again the response MIME type, and also the response schema. But we’re using a convenience method provided by the SchemaHelper class to derive a schema from a Java class (here, our Person record).

Now we can instantiate our person extractor contract as follows:

PersonExtractor extractor =
    AiServices.create(PersonExtractor.class, model);

And finally, we can pass it the biography in input:

Person person = extractor.extractPerson(bio);

System.out.println(person.name());  // Anna
System.out.println(person.age());   // 23

We have an instance of our Person record in output that is properly populated with the name and age of the person described in our biography! That way, as Java developers, we manipulate a real Java object, in a type-safe manner! Our application is enhanced by an LLM, but from a developer perspective, we manipulate interfaces and objects.

Summary

Lots of articles, videos, or presentations often talk about the chatbot use case, when creating applications powered by large language models. However, data extraction is another very important and useful task where LLMs shine.

In this article, we saw different approaches to do data extraction: via prompting, function calling, or with a JSON mode or JSON schema. If your LLM supports the ability to set a response schema, that’s definitely the best way to get the JSON output you expect.

Also, if the LLM orchestration framework you use supports it, be sure to check if it’s able to return type-safe objects that you can manipulate with your programming language directly, without having to parse the JSON string yourself.

https://glaforge.dev/posts/2024/11/18/data-extraction-the-many-ways-to-get-llms-to-spit-json-content/
Things you never dared to ask about LLMs

Along my learning journey about generative AI, lots of questions popped up in my mind. I was very curious to learn how things worked under the hood in Large Language Models (at least having an intuition rather than knowing the maths in and out). Sometimes, I would wonder about how tokens are created, or how hyperparameters influence text generation.

Before the dotAI conference, I was invited to talk at the meetup organised by DataStax. I presented about all those things you never dared to ask about LLMs, sharing both the questions I came up with while learning about generative AI, and the answers I found and discovered along the way.

Show full content

Along my learning journey about generative AI, lots of questions popped up in my mind. I was very curious to learn how things worked under the hood in Large Language Models (at least having an intuition rather than knowing the maths in and out). Sometimes, I would wonder about how tokens are created, or how hyperparameters influence text generation.

Before the dotAI conference, I was invited to talk at the meetup organised by DataStax. I presented about all those things you never dared to ask about LLMs, sharing both the questions I came up with while learning about generative AI, and the answers I found and discovered along the way.

Without further ado, here’s the deck:

Abstract
Things you never dared to ask about LLMs

Large Language Models (LLMs) have taken the world by storm, powering applications from chatbots to content generation. Yet, beneath the surface, these models remain enigmatic.

This presentation will “delve” into the hidden corners of LLM technology that often leave developers scratching their heads. It’s time to ask those questions you’ve never dared ask about the mysteries underpinning LLMs.

Here are some questions we’ll to answer:

Do you wonder why LLMs spit tokens instead of words? Where do those tokens come from?

  • What’s the difference between a “foundation” / “pre-trained” model, and an “instruction-tuned” one?
  • We’re often tweaking (hyper)parameters like temperature, top-p, top-k, but do you know how they really affect how tokens are picked up?
  • Quantization makes models smaller, but what are all those number encodings like fp32, bfloat16, int8, etc?
  • LLMs are good at translation, right? Do you speak the Base64 language too?

We’ll realize together that LLMs are far from perfect:

  • We’ve all heard about hallucinations, or should we say confabulations?
  • What is this reversal curse that makes LLMs ignore some facts from a different viewpoint?
  • You’d think that LLMs are deterministic at low temperature, but you’d be surprised by how the context influences LLMs’ answers…

Buckle up, it’s time to dispel the magic of LLMs, and ask those questions we never dared to ask!

This talk wasn’t recorded, but I hope to give this presentation again sometime soon, and hopefully, it’ll be recorded then. If that happens, I’ll share the video recording once it’s available.

Illustrations: Imagen 3 to the rescure

For those who are curious about the cute little robots that appear in this presentation, I’ve generated them with DeepMind’s Imagen 3 image generation model.

The quality of the output was really lovely, and I might have been a bit overboard with the number of generated robots in this deck.

I would start pretty much all my prompts with “cartoon of a cute little robot…”

For my Java developer friends, you can generate images with Imagen via LangChain4j (as explained in that article where I generated black’n white ink drawings).

https://glaforge.dev/talks/2024/10/24/things-you-never-dared-to-ask-about-llms/
Advanced RAG Techniques

Retrieval Augmented Generation (RAG) is a pattern to let you prompt a large language model (LLM) about your own data, via in-context learning by providing extracts of documents found in a vector database (or potentially other sources too).

Implementing RAG isn’t very complicated, but the results you get are not necessarily up to your expectations. In the presentations below, I explore various advanced techniques to improve the quality of the responses returned by your RAG system:

Show full content

Retrieval Augmented Generation (RAG) is a pattern to let you prompt a large language model (LLM) about your own data, via in-context learning by providing extracts of documents found in a vector database (or potentially other sources too).

Implementing RAG isn’t very complicated, but the results you get are not necessarily up to your expectations. In the presentations below, I explore various advanced techniques to improve the quality of the responses returned by your RAG system:

Ingestion chunking techniques like:

  • Embedding of sliding windows of sentences
  • Hypothetical question embedding
  • Contextual retrieval embedding (invented recently by Anthropic)
  • Semantic chunking (created by Greg Kamradt)

Retrieval techniques, including:

  • Query compression
  • Hypothetical Document Embedding (HyDE)

And I also mention how an agentic approach can help for more advanced and complex needs, with providing intermerdiary results, combined in a final response. Agentic RAG is a very important and promising approach that I’ll certainly come back to in upcoming articles.

At Devoxx Belgium 2024, I gave a 50-minute session, and a 3-hour long deep dive with my friend Cédrick Lunven from Datastax (we used the great Astra DB vector database in our demos). You’ll find both decks and videos below.

Code available on Github

All the code presented in those sessions is available in this Github repository

RAG: from dumb implementation to serious results Abstract

Embarking on your RAG journey may seem effortless, but achieving satisfying results often proves challenging. Inaccurate, incomplete, or outdated answers, suboptimal document retrieval, and poor text chunking can quickly dampen your initial enthusiasm.

In this session, we’ll leverage LangChain4j to elevate your RAG implementations. We’ll explore:

  • Advanced Chunking Strategies: Optimize document segmentation for improved context and relevance.
  • Query Refinement Techniques: Expand and compress queries to enhance retrieval accuracy.
  • Metadata Filtering: Leverage metadata to pinpoint the most relevant documents.
  • Document Reranking: Reorder retrieved documents for optimal result presentation.
  • Data Lifecycle Management: Implement processes to maintain data freshness and relevance.
  • Evaluation and Presentation: Assess the effectiveness of your RAG pipeline and deliver results that meet user expectations.

Join us as we transform your simplistic RAG experience from one of frustration to delight your users with meaningful and accurate answers.

Presentation slide deck YouTube video recording RAG: from dumb implementation to serious results Abstract

It’s easy to get started with Retrieval Augmented Generation, but you’ll quickly be disappointed with the generated answers: inaccurate or incomplete, missing context or outdated information, bad text chunking strategy, not the best documents returned by your vector database, and the list goes on.

After meeting thousands of developers across Europe, we’ve explored those pain points, and will share with you how to overcome them. As part of the team building a vector database we are aware of the different flavors of searches (semantic, meta-data, full text, multimodal) and embedding model choices. We have been implementing RAG pipelines across different projects and frameworks and are contributing to LangChain4j.

In this deep-dive, we will examine various techniques using LangChain4j to bring your RAG to the next level: with semantic chunking, query expansion & compression, metadata filtering, document reranking, data lifecycle processes, and how to best evaluate and present the results to your users.

Presentation slide deck YouTube video recording
https://glaforge.dev/talks/2024/10/14/advanced-rag-techniques/
A Gemini and Gemma tokenizer in Java

It’s always interesting to know how the sausage is made, don’t you think? That’s why, a while ago, I looked at embedding model tokenization, and I implemented a little visualization to see the tokens in a colorful manner. Yet, I was still curious to see how Gemini would tokenize text…

Both LangChain4j Gemini modules (from Vertex AI and from Google AI Labs) can count the tokens included in a piece of text. However, both do so by calling a REST API endpoint method called countTokens. This is not ideal, as it requires a network hop to get the token counts, thus adding undesired extra latency. Wouldn’t it be nicer if we could count tokens locally instead?

Show full content

It’s always interesting to know how the sausage is made, don’t you think? That’s why, a while ago, I looked at embedding model tokenization, and I implemented a little visualization to see the tokens in a colorful manner. Yet, I was still curious to see how Gemini would tokenize text…

Both LangChain4j Gemini modules (from Vertex AI and from Google AI Labs) can count the tokens included in a piece of text. However, both do so by calling a REST API endpoint method called countTokens. This is not ideal, as it requires a network hop to get the token counts, thus adding undesired extra latency. Wouldn’t it be nicer if we could count tokens locally instead?

Interestingly, both Gemini and the open-weights Gemma models share the same tokenizer and token vocabulary. Also, the tokenizer is based on SentencePiece, which is a tokenizer/detokenizer implementing the byte-pair-encoding (BPE) and unigram language algorithms.

If you look at the Gemma code on HuggingFace, you’ll see a tokenizer.json file that you can open to see the available tokens in the vocabulary, and a tokenizer.model file which is some kind of binary compressed variation.

Knowing that the list of tokens supported by Gemini and Gemma is available in those files, and how they are encoded, I was curious to see if I could implement a Java tokenizer that could run locally, rather than calling a remote endpoint.

The SentencePiece implementation from Google is a C++ library, but I didn’t really feel like wrapping it myself with JNI, and fortunately, I discovered that the DJL project had done the JNI wrapping job already.

So let’s see how to tokenize text for Gemini and Gemma, in Java!

Gemini and Gemma tokenization in Java with DJL

First of all, let’s setup the dependency on DJL’s SentencePiece module:

  • From Maven:
<dependency>
    <groupId>ai.djl.sentencepiece</groupId>
    <artifactId>sentencepiece</artifactId>
    <version>0.30.0</version>
</dependency>
  • From Gradle:
implementation 'ai.djl.sentencepiece:sentencepiece:0.30.0'

I saved the tokenizer.model file locally. Note that it’s a 4MB file, as Gemini/Gemma have a very large vocabulary of around a quarter million of tokens!

Now, let’s instantiate an SpTokenizer object that loads this vocabulary file, and tokenize some text:

import ai.djl.sentencepiece.SpTokenizer;
// ...
Path model = Paths.get("src/test/resources/gemini/tokenizer.model");
byte[] modelFileBytes = Files.readAllBytes(model);

try (SpTokenizer tokenizer = new SpTokenizer(modelFileBytes)) {
    List<String> tokens = tokenizer.tokenize("""
    When integrating an LLM into your application to extend it and \
    make it smarter, it's important to be aware of the pitfalls and \
    best practices you need to follow to avoid some common problems \
    and integrate them successfully. This article will guide you \
    through some key best practices that I've come across.
    """);

    for (String token: tokens) {
        System.out.format("[%s]%n", token);
    }

    System.out.println("Token count: " + tokens.size());
}

When running this Java class, you’ll see the following output:

[When]
[▁integrating]
[▁an]
[▁L]
[LM]
[▁into]
[▁your]
[▁application]
...

Token count: 61
Next steps

Do we need next steps? Yes, why not! My idea is to contribute a tokenizer module to LangChain4j, so that the Vertex AI Gemini and the Google AI Gemini modules can both import it, instead of relying on remote endpoint calls to count tokens.

https://glaforge.dev/posts/2024/10/04/a-gemini-and-gemma-tokenizer-in-java/
AI Inktober — Generating ink drawings with Imagen 3

Every year, in October, takes place the Inktober challenge: every day of the month, you have to do a drawing representing the word of the day. The list of prompts this year is the following:

Inktober 2024 prompts
Inktober 2024 prompts
Inktober 2024 prompts Inktober 2024 prompts

I participated to some of the daily challenges the past few years, but I never did all of them. But this year, for the fun, I thought I could ask Google’s Imagen 3 image model to draw for me! (Or at least to draw something I could try to reproduce.)

Show full content

Every year, in October, takes place the Inktober challenge: every day of the month, you have to do a drawing representing the word of the day. The list of prompts this year is the following:

Inktober 2024 prompts
Inktober 2024 prompts
Inktober 2024 prompts Inktober 2024 prompts

I participated to some of the daily challenges the past few years, but I never did all of them. But this year, for the fun, I thought I could ask Google’s Imagen 3 image model to draw for me! (Or at least to draw something I could try to reproduce.)

Of course, the goal of the challenge is not to generate images with the help of an AI. On the contrary, the idea is about the pleasure you can have drawing yourself, with your own hands! However, I was curious to see how Imagen would perform on such a challenge.

So I fired up my favorite Java AI framework: LangChain4j, as it supports Imagen 3, as image model.

Note

Imagen 3 is generally available on Google Cloud’s Vertex AI platform, but it’s behind an allow list. So you have to request access to be able to use it.

You will need the following dependency:

  • For Gradle users:
implementation 'dev.langchain4j:langchain4j-vertex-ai:0.35.0'
  • For Maven users:
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-vertex-ai</artifactId>
    <version>0.35.0</version>
</dependency>

Now let’s have a look at the code:

import dev.langchain4j.data.image.Image;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.model.vertexai.VertexAiImageModel;

import java.nio.file.Path;

public class AInktober {
    public static void main(String[] args) {
        VertexAiImageModel imagenModel = VertexAiImageModel.builder()
            .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
            .location(System.getenv("GCP_LOCATION"))
            .project(System.getenv("GCP_PROJECT_ID"))
            .publisher("google")
            .modelName("imagen-3.0-fast-generate-001")
            .aspectRatio(VertexAiImageModel.AspectRatio.SQUARE)
            .negativePrompt("watercolor, gray shades")
            .persistTo(Path.of("/tmp/imagen"))
            .build();

        String prompt = """
            A black and white ink drawing of a
            backpack, on a fully white background
            """;

        Response<Image> imageResponse = imagenModel.generate(prompt);
        System.out.println(imageResponse.content().url());
    }
}
  • I have set up several environment variables containing my Google Cloud project details.
  • I decided to use imagen-3.0-fast-generate-001, which generates images faster (and cheaper!) than imagen-3.0-generate-001 at the cost of a slightly lower quality (but for ink drawings, that’s not really visible).
  • I went with square images, but you can use landscape, portrait, and wider variants too.
  • I added a negative prompt, because some images looked a bit more like watercolor at times, but I wanted images more black and white.
  • I persist all the generated images into a temporary folder.
  • My prompt contains the first subject of the day, a “backpack”, and I specify that I want a black and white ink drawing, but I also added that I wanted a white background, as sometimes the background can be fully black, or some sepia shade.

So what does the first image look like?

Inktober 2024’s backpack
Inktober 2024’s backpack
Inktober 2024’s backpack Inktober 2024’s backpack

It definitely looks like an ink drawing of a backpack!

Don’t worry, I won’t post a new article each day for the new daily image prompt. Instead, I’ll share the other days on my usual social media channels (see the bottom of the blog to find them out.)

Be sure to checkout Imagen 3, it’s pretty good!

https://glaforge.dev/posts/2024/10/01/ai-nktober-generating-ink-drawings-with-imagen/
Lots of new cool Gemini stuff in LangChain4j 0.35.0

While LangChain4j 0.34 introduced my new Google AI Gemini module, a new 0.35.0 version is already here today, with some more cool stuff for Gemini and Google Cloud!

Let’s have a look at what’s in store!

Gemini 1.5 Pro 002 and Gemini 1.5 Flash 002

This week, Google announced the release of the new versions of the Google 1.5 models:

  • google-1.5-pro-002
  • google-1.5-flash-002

Of course, both models are supported by LangChain4j! The Google AI Gemini module also supports the gemini-1.5-flash-8b-exp-0924 8-billion parameter model.

Show full content

While LangChain4j 0.34 introduced my new Google AI Gemini module, a new 0.35.0 version is already here today, with some more cool stuff for Gemini and Google Cloud!

Let’s have a look at what’s in store!

Gemini 1.5 Pro 002 and Gemini 1.5 Flash 002

This week, Google announced the release of the new versions of the Google 1.5 models:

  • google-1.5-pro-002
  • google-1.5-flash-002

Of course, both models are supported by LangChain4j! The Google AI Gemini module also supports the gemini-1.5-flash-8b-exp-0924 8-billion parameter model.

Versions 002 come with:

  • much improved math and reasoning capabilities
    (7%-20% increase depending on the benchmark),
  • 2x faster output, and 3x lower latency,
  • and also roughly a 50% price cut!
Google Cloud Storage document loader

When implementing Retrieval Augmented Generation (RAG), you must load the documents from somewhere. You can feed the docs directly in the context, but LangChain4j comes with the notion of document loaders. There are existing document loaders for the file system, for files at remote URLs, or source files stored in Github.

In this release, I’ve implemented a Google Cloud Storage document loader, which lets you reference documents stored inside cloud storage buckets.

Create a GCS document loader with the new builder:

var gcsLoader = GoogleCloudStorageDocumentLoader.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .build();

Then you can load a single document, and parse it:

Document document = gcsLoader.loadDocument(
    "BUCKET_NAME", "FILE_NAME.txt", new TextDocumentParser());

All the documents in a bucket:

List<Document> documents = gcsLoader.loadDocuments(
    "BUCKET_NAME", new TextDocumentParser());

Or just the a list of files filtered with a glob pattern:

List<Document> documents = gcsLoader.loadDocuments(
    "BUCKET_NAME", "*.txt", new TextDocumentParser());
Vertex AI Ranking API

When implementing Retrieval Augmented Generation (RAG), your vector database returns a certain number of results. They are usually sorted by vector similarity. But it’s not necessarily because the vectors have the highest similarity, that they are necessarily the best matches to answer a user query. In order to palliate this problem, there are ranking or reranking APIs and models that exist to order results according to how well they match the query.

The Vertex AI platform from Google Cloud offers a ranking API for that purpose, a little known API that deserves more awareness. I implemented a ScoringModel for this Vertex AI Ranking API:

VertexAiScoringModel scoringModel = VertexAiScoringModel.builder()
    .projectId(System.getenv("GCP_PROJECT_ID"))
    .projectNumber(System.getenv("GCP_PROJECT_NUMBER"))
    .projectLocation(System.getenv("GCP_LOCATION"))
    .model("semantic-ranker-512")
    .build();

Response<List<Double>> score = scoringModel.scoreAll(Stream.of(
        "The sky appears blue due to a phenomenon called Rayleigh " +
        "scattering. Sunlight is comprised of all the colors of " +
        "the rainbow. Blue light has shorter wavelengths than other " +
        "colors, and is thus scattered more easily.",

        "A canvas stretched across the day,\n" +
        "Where sunlight learns to dance and play.\n" +
        "Blue, a hue of scattered light,\n" +
        "A gentle whisper, soft and bright."
        ).map(TextSegment::from).collect(Collectors.toList()),
    "Why is the sky blue?");

// [0.8199999928474426, 0.4300000071525574]

In the example above, a user asks why the sky is blue. The Ranking API attempts to determine which of two excerpts best matches this question. The first excerpt appears to be an explanation of this celestial phenomenon, while the second sounds more like a poem. When scoring these text fragments, we observe that the first one has a higher value (0.82 vs. 0.43).

It is also possible to score just one piece of text with the score(text, query) and score(segment, query) methods.

Now what’s interesting is that this LangChain4j notion of scoring models is also well integrated in the RAG pipeline:

VertexAiScoringModel scoringModel = VertexAiScoringModel.builder()
    .projectId(System.getenv("GCP_PROJECT_ID"))
    .projectNumber(System.getenv("GCP_PROJECT_NUM"))
    .projectLocation(System.getenv("GCP_LOCATION"))
    .model("semantic-ranker-512")
    .build();

ContentAggregator contentAggregator =
    ReRankingContentAggregator.builder()
        .scoringModel(scoringModel)
        ...
        .build();

RetrievalAugmentor retrievalAugmentor =
    DefaultRetrievalAugmentor.builder()
        ...
        .contentAggregator(contentAggregator)
        .build();

return AiServices.builder(Assistant.class)
    .chatLanguageModel(...)
    .retrievalAugmentor(retrievalAugmentor)
    .build();

When creating the AI service, you specify the chat model to use. Additionally, you can integrate a retrieval augmentor, which allows you to configure a content aggregator. The content aggregator, in turn, can specify a scoring model. This process involves three steps, but it enables you to leverage the ranking of semantic search results when implementing RAG. This means you can prioritize the most relevant results based on their semantic similarity, not solely on their vector similarity.

New parameters for the Vertex AI embedding models

Embedding models are critical for RAG, and LangChain4j has had support for the Google Cloud Vertex AI embedding models for a long time. But there are a couple of new flags that have recently been introduced:

EmbeddingModel embeddingModel = VertexAiEmbeddingModel.builder()
    .project(PROJECT_ID)
    .location("us-central1")
    .publisher("google")
    .modelName(MODEL_NAME)
    .autoTruncate(true)
    .outputDimensionality(512)

The autoTruncate(true) method automatically truncates text to embed to a maximum of 2048 tokens. If your input is longer than this limit, you would get an error from the model. With auto-truncation, no more error, but if your text is truncated, you might miss a bit of meaning from the part that was cut off.

The other new method is outputDimensionality(512). The Vertex AI embedding models usually default to 768-dimensional vectors. However, our latest embedding models are Matryoshka embedding models, which means that the most meaningful values in the vector comes first. So when you do vector comparisons, you can make calculations quicker if you focus on the lowest dimensions, and with this new method, you can just return vectors with less dimensions directly.

Google AI embedding model

Speaking of embedding models, if you use the Google AI Gemini model instead of the Vertex AI flavor, you can now also access our embedding models without relying on the Vertex AI models, thanks to the new embedding model for Google AI:

var embeddingModel = GoogleAiEmbeddingModel.builder()
    .apiKey(GOOGLE_AI_GEMINI_API_KEY)
    .modelName("embedding-001")
    .maxRetries(3)
    .logRequestsAndResponses(true)
    .titleMetadataKey("title")
    .taskType(GoogleAiEmbeddingModel.TaskType.RETRIEVAL_DOCUMENT)
    .outputDimensionality(512)
    .build();

This new embedding model is the same as the one coming from Vertex AI, and has the same feature set.

Google AI Gemini token count estimation and tokenizer

The Google AI Gemini model implements the TokenCountEstimator interface, which means you can use the estimateTokenCount() method to count tokens:

var gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(GOOGLE_AI_GEMINI_API_KEY)
    .modelName("gemini-1.5-flash")
    .build();

int countedTokens = gemini.estimateTokenCount(
    "What is the capital of France?");

There is also now a GoogleAiGeminiTokenizer class, implementing the misnamed Tokenizer interface (misnamed because it’s not tokenizing text, it’s just counting tokens):

var geminiTokenizer = GoogleAiGeminiTokenizer.builder()
    .apiKey(GOOGLE_AI_GEMINI_API_KEY)
    .modelName("gemini-1.5-flash")
    .build();

int count = tokenizer.estimateTokenCountInText("Hello world!");

Note that both the estimateTokenCount() method and the GoogleAiGeminiTokenizer call a remote API endpoint. They don’t use a tokenizer class to count the tokens, so those calls incur some network hops.

What’s interesting with the Tokenizers is that they can be used by document splitters to split documents according to the number of tokens, rather than by characters or other boundaries:

DocumentSplitter splitter = DocumentSplitters.recursive(
    maxSegmentSizeInTokens,
    maxOverlapSizeInTokens,
    geminiTokenizer);

Currently, only the Google AI module implements this Tokenizer interface, but it can be used with the Vertex AI Gemini module as well. But later down the road, I think I’ll also implement it for the Vertex AI module.

Chat listener support

Both the Google AI Gemini and the Vertex AI modules implement the new chat listener support.

  • Vertex AI Gemini:
VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName("gemini-1.5-pro-002")
    .listeners(singletonList(listener))
    .build();
  • Google AI Gemini:
GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-1.5-flash-002")
    .listeners(singletonList(listener))
    .build();

Let’s have a look at the listener interface, which allows you to listen to model requests, responses, and errors:

public interface ChatModelListener {
  default void onRequest(ChatModelRequestContext reqContext) {...}
  default void onResponse(ChatModelResponseContext respContext) {...}
  default void onError(ChatModelErrorContext errContext) {...}
}

The various *Context parameters contain various details about how the model was parameterized, what the prompt was, or what error was encountered.

It might be interesting to follow the recent OpenTelemetry GenAI recommendations and implement a listener that directly plugs into your observability solution!

Enum structured output

I’ll finish the laundry list of features with the enum structured output.

The Gemini models have great support for structured output. Not only can you ask for JSON outputs, but you can also specify a JSON schema so that the model follows that schema for generating its JSON response. This is of utmost importance for deterministic parseable results that fit well with your strongly typed programming language.

Gemini lets you return arbitray JSON objects and arrays. But for tasks like classification or sentiment analysis, it is also able to return a single enum value, rather than a JSON object that would have a property containing the value.

  • Vertex AI Gemini:
 VertexAiGeminiChatModel model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("GCP_PROJECT_ID"))
    .location(System.getenv("GCP_LOCATION"))
    .modelName(GEMINI_1_5_PRO)
    .responseSchema(Schema.newBuilder()
        .setType(Type.STRING)
        .addAllEnum(Arrays.asList("POSITIVE", "NEUTRAL", "NEGATIVE"))
        .build())
    .build();

Response<AiMessage> response = model.generate(asList(
    SystemMessage.from(
        "Your role is to analyse the sentiment of user's messages"),
    UserMessage.from(
        "This is super exciting news, congratulations!")
));

System.out.println(response.content().text());
// POSITIVE
  • Google AI Gemini:
GoogleAiGeminiChatModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_API_KEY"))
    .modelName("gemini-1.5-flash")
    .responseFormat(ResponseFormat.builder()
        .type(JSON)
        .jsonSchema(JsonSchema.builder()
            .rootElement(JsonObjectSchema.builder()
                .properties(new LinkedHashMap<String, JsonSchemaElement>() {{
                    put("sentiment", JsonEnumSchema.builder()
                        .enumValues("POSITIVE", "NEUTRAL", "NEGATIVE")
                        .build());
                }})
                .build())
            .build())
        .build())
    .build();

ChatResponse response = gemini.chat(ChatRequest.builder()
    .messages(
        SystemMessage.from(
            "Your role is to analyse the sentiment of user's messages"),
        UserMessage.from(
            "This is super exciting news, congratulations!"
        )
    )
    .build());

System.out.println(response.aiMessage().text());
// POSITIVE

This is particularly useful for all sorts of classification tasks!

Documentation updates

As I often used to say when working on the Apache Groovy project:

“A feature doesn’t exist if it’s not documented.”

— Guillaume Laforge

With that motto in mind, I thought it was high time that I expanded the documentation for the Gemini related pages of the LangChain4j documentation:

Tell me what you use LangChain4j’s Gemini support for!

I’m really curious about what you are developing with LangChain4j, and even more so, if you’re using the various Gemini components. Don’t hesitate to reach out to me, via the social media platforms mentioned below! I’m also interested in the features you’d like to see prioritized and implemented.

https://glaforge.dev/posts/2024/09/29/lots-of-new-cool-gemini-stuff-in-langchain4j/
Some advice and good practices when integrating an LLM in your application

When integrating an LLM into your applicaton to extend it and make it smarter, it’s important to be aware of the pitfalls and best practices you need to follow to avoid some common problems and integrate them successfully. This article will guide you through some key best practices that I’ve come across.

Understanding the Challenges of Implementing LLMs in Real-World Applications

One of the first challenges is that LLMs are constantly being improved. This means that the model you start using could change under the hood, and suddenly your application doesn’t work as it did before. Your prompts might need adjustments to work with the newer version, or worse, they might even lead to unintended results!

Show full content

When integrating an LLM into your applicaton to extend it and make it smarter, it’s important to be aware of the pitfalls and best practices you need to follow to avoid some common problems and integrate them successfully. This article will guide you through some key best practices that I’ve come across.

Understanding the Challenges of Implementing LLMs in Real-World Applications

One of the first challenges is that LLMs are constantly being improved. This means that the model you start using could change under the hood, and suddenly your application doesn’t work as it did before. Your prompts might need adjustments to work with the newer version, or worse, they might even lead to unintended results!

Furthermore, you need to consider how to effectively manage your prompts, especially when your applications start to get more complex. Prompts can easily become convoluted and difficult to modify. Imagine having to comb through a hundred lines of code in your application to find a specific prompt just to tweak a single word. That’s why prompt externalization (not keeping your prompts inside your source files) is going to be important, to easily find your prompts, and have a birds-eye view of all of them!

Keeping track of changes and ensuring that the LLM’s behavior remains consistent throughout your development process is another challenge. How can you ensure that a particular feature still functions correctly after upgrading your prompts, or even changing model versions? You need to version your prompts (we’ll cover that in more details in a moment). Think of your prompts like code — just like your software code, prompts should have version control for easy management. Versioning ensures that you can quickly revert to previous versions if necessary, while providing a helpful audit trail to see exactly what changes have occurred in your prompt management process.

Prompt Engineering for Consistent and Effective LLM Applications

You’ve probably noticed that one of the main things that determines how well your application works with a Large Language Model (LLM) is the prompt you use to guide it. Prompts act like a guidebook for the LLM, explaining what you expect from it and how it should format its response. You’ve likely heard about the importance of using good prompts, but how do you go about creating prompts that are reliable and adaptable in the long run?

Think of your prompts like code artifacts. Just as you version your code to keep track of changes and ensure consistency, you should also version your prompts. This allows you to:

  • Keep track of your prompt evolution: You’ll have a clear record of how your prompts have changed over time, which helps you understand the application’s evolving behavior.
  • Create a helpful audit trail: Having versions of your prompts will help you understand exactly how the application behaved at specific times. This is essential for debugging, diagnosing issues, or understanding how user feedback impacted your application.
  • Enable rollbacks: If you encounter an issue or want to test different prompt versions, you can easily revert to a previous state to ensure that you can isolate problems, revert to previously working versions, or simply experiment with different phrasing.

But simply versioning prompts isn’t enough. Imagine you need to make a change to one particular prompt in a massive LLM-powered application. It might involve a lot of tedious code-hunting. That’s where prompt externalization, that we mentioned earlier, comes in! Externalizing prompts is all about taking them out of your code and treating them like a separate configuration file. This way, they are:

  • Easy to modify: Changing your prompts becomes a breeze. Just go to your external prompt file, make the adjustments, and you’re ready to go! No more scouring through complex code to find a single prompt in some string variables somewhere.
  • More flexible: By using externalized prompts, you can easily experiment with different versions or phrasing without rewriting your entire application. This lets you quickly adapt your prompts in response to user feedback or changes in your model.
  • Easier to manage: Keeping prompts in their own dedicated file makes it easy to maintain them, making sure that your prompts are consistent and up-to-date. This approach becomes increasingly valuable as your applications become more complex and you have a growing set of prompts to maintain.

There are open source projects, or open formats that have emerged recently, to externalize prompts. For examples Firebase’s GenKit LLM framework came up with their dotPrompt format, which not only externalizes the prompt itself, but also the name of the model, its configuration (temperature, etc.)

Model Versioning: Preventing Surprises

Now let’s cover the importance of managing model versions, a critical aspect of ensuring that your LLM-powered application continues to work reliably.

Imagine that you’ve built a great application using a particular model, and you’re proud of the results. However, what happens when the LLM provider releases an updated version? It might offer performance improvements, but the updates can also change how the model responds to your prompts, potentially leading to unexpected issues or even breaking your application.

To avoid these unexpected changes, the key principle is to pin the specific version of the LLM model that you use for your application. For example, when using Gemini 1.5 Pro, if you use the version gemini-1.5-pro, you’re actually using the latest version of the model. Currently, it’s gemini-1.5-pro-001. But if tomorrow Google releases gemini-1.5-pro-002, your application would suddenly start using that new version. So be very explicit in the model version.

Here’s why this is essential:

  • Avoid Drifting Model Behavior: The update to an LLM might come with subtle changes that can lead to a shift in the model’s responses, and you may not always be able to anticipate these changes beforehand. This can lead to inconsistency, where a prompt that generated a certain output in one version of the model generates a completely different (and perhaps undesirable) output in a newer version.
  • Maintain Application Consistency: To keep your application performing reliably, you want to control the LLM’s responses as much as possible, and pinning the model version ensures that you can do this. If you’re using a specific model, the prompts that are part of your application work in the context of that model’s specific training and behaviors. Pinning the version helps you avoid unexpected changes that may interfere with your prompts’ effectiveness.
  • Simplify Auditing and Debugging: In case of an unexpected issue or an unexplained change in your LLM’s behavior, being able to easily trace back the specific model version that’s running provides invaluable context for debugging and understanding why those changes occurred. It helps isolate issues to specific model versions, so you can resolve them quicker.

While using the latest and greatest LLM version might seem tempting for its improved capabilities, remember: the consistent performance and reliability of your application should be a top priority. By pinpointing the model version you use, you gain better control over its behavior and maintain a smooth and predictable experience for your users.

Optimizing for Efficiency: The Power of Response Caching

Even with well-crafted prompts, pinned versions, generating responses from a Large Language Model (LLM) can still be expensive. This is where response caching comes in, offering a crucial way to improve both the performance and the cost-efficiency of your application.

Models like Gemini support context caching. Quoting the documentation:

Use context caching to reduce the cost of requests that contain repeat content with high input token counts. Cached context items, such as a large amount of text, an audio file, or a video file, can be used in prompt requests to the Gemini API to generate output. Requests that use the same cache in the prompt also include text unique to each prompt. For example, each prompt request that composes a chat conversation might include the same context cache that references a video along with unique text that comprises each turn in the chat. The minimum size of a context cache is 32,768 tokens.

By caching frequently used responses or heavy multimodal documents, you avoid having to generate them over and over again, leading to a dramatic improvement in performance and a reduction in LLM usage costs. Imagine users frequently asking the same question, like “What are the benefits of using your app?". By caching the response to this question, you’ll be able to provide users with a fast and efficient response without burdening the LLM each time.

But how do you actually implement caching? You can choose different strategies for your caching system, each with its own benefits:

  • Context Caching: If your model, like Gemini, supports caching already, be sure to understand how it works, what can be cached or not, the pros and cons, or potential limitations.
  • Basic Caching: Store LLM responses based on the exact input. If you encounter a query that you’ve already generated, you can provide a pre-cached response, saving on processing time. You could also do some minimal string modifications to normalize whitespace, put everything in lowercase, etc, to get the chance to cache very similar prompts.
  • Advanced Caching with Similarity Search: Even with string normalization, you might find that users don’t always ask the exact same question, but the query can still be extremely similar. Think of typos, minor word substitutions, synonyms, or variations in punctuation. Instead of treating every query as unique, consider approximate nearest neighbor search and embedding vector similarity. This approach helps you find queries that are nearly identical, even with minor variations. You can then leverage this functionality to serve the same cached response for queries that are semantically similar, increasing the effectiveness of your caching strategy and ensuring that you only compute distinct queries once. For vector similarity, make sure to test with different inputs, to find the right threshold to say that a new prompt is equivalent to an older cached prompt/response pair.

Caching responses not only speeds up your LLM-powered application, lowering the perceived latency, but also significantly cuts down on LLM usage costs, helping you keep your application running smoothly while maximizing cost-effectiveness.

Building Safeguards: Ensuring Robustness with Guardrails

Let’s shift our focus to building safety mechanisms. This is crucial for creating reliable, trustworthy applications. Enter the concept of guardrails, which are safety systems designed to protect your application and users from unexpected, unwanted, or even harmful outcomes.

Think of guardrails like a protective fence, ensuring that the LLM stays within safe boundaries while performing its tasks. Imagine if someone tried to make an inappropriate request, or worse, a request that could cause harm. This is where guardrails step in.

Guardrails serve two main purposes:

  • Input Validation: Guardrails can examine the user input and determine whether it’s acceptable for your application and whether it aligns with your intended use case. Imagine preventing your LLM from processing prompts with malicious language or data that could cause harm to users.
  • Output Filtering: Guardrails are not only for examining the user’s input but also for checking the outputs of the LLM. By analyzing the LLM’s generated responses, you can filter out inappropriate content or responses that don’t meet your requirements.

What are the three primary types of guardrails?

  • Model’s safety settings: Models have usually been fine tuned to avoid certain harmful content in both input and output. They also give you access to safety settings, with different harm categories and safety thresholds. You should test those settings and how they can be configured for your use case. For example, have a look at the available safety filters for Gemini.
  • Static Guardrails: These are predefined rules that are set before the LLM begins to process any input. These can be rules that dictate certain formatting for prompts, limitations on input length, or even basic checks for prohibited terms or requests. Static guardrails offer fast processing times, since the checks are performed beforehand on the input strings, in your own code.
  • Dynamic Guardrails: These are flexible guardrails that work in conjunction with the LLM or the embedding model used for text classification, continuously adapting to changes in user input or the output of the model itself. They allow you to handle complex or unpredictable situations and perform nuanced checks to maintain the safety and integrity of your application. You might have a look at Google Cloud Natural Language Processing’s moderation endpoint, or the free Perspective API used by newspapers.

When you implement guardrails, it’s also critical to consider performance impact. You want to make sure that these safeguards don’t add too much latency and delay user experience. That’s where parallel processing can come into play! Instead of waiting for the guardrail check to finish before starting the LLM generation, consider launching both tasks in parallel, optimizing speed and efficiency without sacrificing safety. Then, if the guardrails raise a red flag, you can stop the response generation, and reply right away to the user that the input content was problematic. For the response, unless you have a streaming kind of guardrail system, you might have to wait for the whole response to be generated before evaluating it with the guardrail, in which case, you can’t really do parallel processing.

Always remember: Guardrails should be continuously refined and updated as you identify new potential risks. Gathering feedback from users, giving them the ability to report a bad response is one approach. But you should also monitor your application LLM responses to do some vibe-checks at random to ensure your application is behaving correctly.

Evaluating and Monitoring for Consistent Performance

Onto the most crucial aspects of any application, regardless of its technology, is evaluation and monitoring. This is essential for ensuring your LLM-powered application continues to function reliably and meets your expectations as it interacts with users in the real world.

Imagine you make an update to your application, or perhaps even a simple tweak to one of your prompts. Without proper monitoring, you won’t know if those changes had unintended consequences. You could end up with an app that gives unexpected results, leads to user frustration, or even creates unforeseen safety issues. That’s where a robust evaluation and monitoring framework comes into play!

Your LLM-powered app needs a systematic way to ensure that everything is running smoothly and effectively. You need to:

  • Establish Evaluation Metrics: You need clear guidelines to judge the LLM’s performance. Think of key metrics like accuracy, relevance, and coherence.

    • Accuracy: This measures how often the LLM generates correct and factually accurate responses. This is particularly crucial if your application is designed for providing reliable information or carrying out fact-based tasks.
    • Relevance: You need to make sure the LLM stays focused on the core issue. It should respond to your prompts in a meaningful and helpful way, instead of giving irrelevant or off-topic responses.
    • Coherence: You need to check if the LLM produces well-written and logical text. Coherent responses are easily understood by users and don’t leave them feeling confused or disoriented.
  • Gather User Feedback: It’s essential to go beyond just numbers. Your application’s performance shouldn’t just be evaluated on your own terms. Get feedback from the users, gather data on how they are using the application, and check their satisfaction with the outputs of your application. You can even ask users to provide their opinions on specific generated answers, giving you valuable insights into what resonates with them and how you can improve. Consider using tools like “thumbs up” or “thumbs down” buttons, offering an easy way for users to indicate their sentiment towards the LLM’s responses, or a way to report and explain what wasn’t up to the level of their expectations.

  • Build a “Golden Responses” Dataset: Create a collection of carefully chosen inputs and their desired, accurate responses. These “golden” examples act like benchmarks, helping you measure how closely the LLM matches your expected results for specific tasks. By periodically checking how your LLM performs against these golden examples, you can get a clear picture of potential issues and make necessary adjustments. You can use this set as a starting point to track potential regressions and make sure the LLM’s behavior is aligned with your expectations.

  • Implement Continuous Monitoring: Monitoring shouldn’t be a one-time event. It’s an ongoing process, like keeping a watchful eye on your application as it functions in the real world. By monitoring in real-time, you can detect anomalies, unexpected issues, or performance regressions promptly. It allows you to address these issues before they cause significant problems for your users. Maybe checkout the recent OpenTelemetry guidelines for Gen AI to observe how your system and LLM are performing live.

You can further improve your LLM-powered application by analyzing the user’s requests and responses generated by the LLM, especially those flagged by users as problematic or unexpected. These can be added to your collection of golden responses, constantly refining the process of evaluation. This helps your application evolve based on real-world interactions.

Addressing Data Privacy Concerns

Another important topic to keep in mind: data privacy. LLMs have access to a vast amount of text data, which makes them incredibly powerful. But this same power brings with it the responsibility of safeguarding sensitive information. If your application handles user data, you need to ensure that you’re handling it with utmost care, protecting it from unauthorized access and ensuring that you comply with relevant privacy regulations.

Think of data privacy as a trust contract. You, as the developer, are entrusted with safeguarding the sensitive information of your users. It’s your responsibility to implement measures that keep this data secure and prevent breaches or misuse.

Here are some key steps to address data privacy concerns in your LLM application:

  • Implement strong security measures: Use robust encryption methods to secure your application and data. Employ security best practices such as access controls, secure storage, and secure communication channels.
  • Stay aligned with data privacy regulations: Comply with relevant privacy regulations like GDPR, CCPA, and HIPAA. You might need to review your data handling policies and make necessary adjustments.
  • Ensure data anonymization: When working with sensitive data, always strive to anonymize or pseudonymize it to the fullest extent possible. You can utilize techniques like differential privacy, aggregation, or removing identifying details to protect user information (with Google Cloud Data Loss Prevention API for example).
  • Be transparent with users: Communicate clearly with your users about how you collect, use, and store their data. Offer users options to control their data, and provide mechanisms to update or delete their information if needed.

By prioritizing data privacy in your LLM application, you not only uphold ethical standards but also build trust with your users. Your users should be confident that their information is being handled with respect and care, encouraging long-term trust in your application.

Tailoring LLMs for Specific Business Goals

Remember that LLMs are tools, and the success of your LLM application ultimately hinges on aligning its capabilities with your unique goals and your target audience. So, how do you get the most out of an LLM in your business?

First, define your goals. What specific tasks can an LLM help you accomplish? What pain points are you trying to solve? Once you understand the big picture, you can break down those goals into actionable tasks that the LLM can potentially assist with.

Then, it’s time to find the right LLM for the job. Not all LLMs are created equal. Different models excel at specific tasks, have varying levels of language support, and even require different levels of computational resources. For example, if your business uses many different languages, you’ll want an LLM with multilingual support.

To select the best LLM for your needs, ask yourself:

  • What specific task does this LLM need to perform? Different LLMs excel at different tasks like text generation, summarization, or translation.
  • How does the LLM’s accuracy compare with the level of accuracy required for your application? The model needs to generate results with the appropriate level of precision for your goals.
  • How much computational power does it need to run this LLM? Consider your budget and available infrastructure when making this selection, when hosting the model on your own. A cloud hosted model might be better (and cheaper) depending on your usage patterns, and if you don’t want the hassle to handle your own infrastructure and GPUs.
  • What language capabilities does the LLM offer? Is the model good at the languages you need to use, or are there specific domains where the model is particularly strong? It’s not just about spoken languages, with code as well, some models maybe better dealing with a particular programming language than another one.

You can often find models with specialized skills. You may find, for example, a model trained on scientific papers if your work requires the processing of highly technical content, or a model trained on a particular field, such as text of laws, to be highly effective in that domain.

Once you’ve chosen your LLM, the next step could be fine-tuning, where you’d tailor the model to your specific needs. It’s like customizing a tool to do the exact job you need it to do. For example, imagine your application is helping people book vacations. You can train the model on a massive amount of vacation-related text data so it can accurately understand and respond to vacation-specific questions, making your application highly relevant for its intended purpose. But fine-tuning is not necessarily for the faint of heart, and can be complicated to do right.

While choosing and fine-tuning are critical steps, assessing potential risks is equally important. Think about potential unintended consequences. LLMs, for example, might not always be factual or accurate in their responses. You’ll need to find ways to manage those potential issues, often incorporating guardrails to mitigate potential harms or biases, or implementing techniques like Retrieval Augmented Generation to ground the model’s responses on your own data and documents.

Ultimately, you’ll want to make your application a tool that not only works reliably but also gives real value to your business. By understanding your business goals, choosing the right model, customizing it effectively, and understanding the potential risks, you’re on the right path to success!

Looking Ahead: Emerging Trends and Future Directions

Remember that this field is constantly changing! New capabilities are emerging, and existing models are getting even smarter and more efficient. This is an exciting time to be working with LLMs because the possibilities feel endless!

While it’s fantastic to get your application off the ground using the latest LLMs, it’s equally important to be open to continuous improvement. What’s great today may not be optimal in the future. The world of LLMs is one where ongoing development is key! Here are a few tips:

  • Embrace continuous learning. You should always be seeking out information about the newest developments in the field, how LLMs are being enhanced, and the impact those changes could have on your applications. Look out for improvements to existing models, new LLM models coming out, and fresh research.
  • Think ahead. What new features could you integrate in your application that take advantage of those advancements? Would your app benefit from a specific, task-oriented model that focuses on summarization, question answering, or code generation? Maybe there’s a model out there that will significantly boost performance and help you offer a smoother, more feature-rich experience for your users!
  • Prepare for evolution. Remember that LLMs aren’t static! Your app should be built with a framework for easy adaptation. Consider how you can adapt to model updates or new model releases in a structured way, perhaps by putting in place frameworks for incorporating new models seamlessly and managing prompt changes for various models, like LangChain4j (if you’re a Java developer) which offers higher-level abstractions and that allows you to switch models easily.

The landscape of LLMs is evolving rapidly. Stay up-to-date with the latest developments and ensure your applications can adapt, allowing you to unlock the full potential of LLMs for your business!

https://glaforge.dev/posts/2024/09/23/some-good-practices-when-integrating-an-llm-in-your-application/
New Gemini model in LangChain4j

A new version of LangChain4j, the super powerful LLM toolbox for Java developers, was released today. In 0.34.0, a new Gemini model has been added. This time, this is not the Gemini flavor from Google Cloud Vertex AI, but the Google AI variant.

It was a frequently requested feature by LangChain4j users, so I took a stab at developing a new chat model for it, during my summer vacation break.

Gemini, show me the code!

Let’s dive into some code examples to see it in action!

Show full content

A new version of LangChain4j, the super powerful LLM toolbox for Java developers, was released today. In 0.34.0, a new Gemini model has been added. This time, this is not the Gemini flavor from Google Cloud Vertex AI, but the Google AI variant.

It was a frequently requested feature by LangChain4j users, so I took a stab at developing a new chat model for it, during my summer vacation break.

Gemini, show me the code!

Let’s dive into some code examples to see it in action!

But first, you’ll need an API key. So just follow the instructions to obtain your Gemini API key. I’ve saved mine in the GEMINI_AI_KEY environment variable, so that I don’t have to hardcode it in my source files.

The code examples below have been compiled with Java 21.

I’ve imported the following libraries in my build tool:

  • dev.langchain4j:langchain4j-google-ai-gemini:0.34.0
  • dev.langchain4j:langchain4j-core:0.34.0
  • dev.langchain4j:langchain4j:0.34.0
Let’s be polite and say hello

My mom always told me to be polite and to say hello:

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .build();

String response = gemini.generate("Konnichiwa Gemini!");

System.out.println("Gemini> " + response);
// Gemini> Konnichiwa! It's nice to hear from you.
//         What can I do for you today?
Don’t you like strawberries?

In the first example, I used the usual generate() method to send my greeting to Gemini. But LangChain4j 0.34 introduces some new signatures and classes to interact with an LLM:

  • ChatRequest: a new class that contains your conversation messages, the tools this request can use, and a response format definition to decide what should be the shape of the output
  • ChatResponse: this class holds the LLM’s response, the token usage information, and the finish reason (ie. if the response was cut, filtered, or was generated till the end)
  • ChatResponse chat(ChatRequest req): this new method is added to the LLM contract to interact with it.
ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .build();

ChatResponse chatResponse = gemini.chat(ChatRequest.builder()
    .messages(UserMessage.from(
        "How many R's are there in the word 'strawberry'?"))
    .build());

String response = chatResponse.aiMessage().text();

System.out.println("Gemini> " + response);
// Gemini> There are **three** R's in the word "strawberry".
Let’s roll the JSON dice!

Both Gemini 1.5 Flash and Pro allow you to specify that the output should be valid JSON. It’s sometimes called the JSON mode.

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .responseMimeType("application/json")
    .build();

String roll = gemini.generate("Roll a 6-sided dice");

System.out.println(roll);
// {"roll": "3"}

Gemini will always reply with valid JSON structures.

Here, the JSON object key is not always roll, and is sometimes die, dice_roll, etc. But you could tweak your prompt to ask for a specific key name.

Gemini follows the instructions very precisely, but it’s not guaranteed 100% that it will really use the requested key name. But fear not, there’s an even more powerful solution, thanks to response formats!

Let’s cook something with our strawberries

You can configure Gemini to make it generate outputs that comply with a JSON schema. It’s sometimes called controlled generation, or constrained decoding.

Let’s say we have a schema that represents recipes! It’s time to do something with our strawberries!

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .responseSchema(JsonSchema.builder()
        .rootElement(JsonObjectSchema.builder()
            .properties(Map.of(
                "title", JSON_STRING_SCHEMA,
                "preparationTimeMinutes", JSON_INTEGER_SCHEMA,
                "ingredients", JsonArraySchema.builder()
                    .items(JSON_STRING_SCHEMA)
                    .build(),
                "steps", JsonArraySchema.builder()
                    .items(JSON_STRING_SCHEMA)
                    .build()
                ))
            .build())
        .build())
    .build();

String recipeResponse = gemini.generate(
    "Suggest a dessert recipe with strawberries");

System.out.println(recipeResponse);

/*
{"ingredients": ["1 pint fresh strawberries, hulled and sliced",
"1/2 cup sugar", "1/4 cup water", "1 tablespoon lemon juice",
"1/2 teaspoon vanilla extract", "1 cup heavy cream, whipped"],
"preparationTimeMinutes": 30, "steps": ["In a saucepan, combine
the sugar, water, and lemon juice. Bring to a boil over medium
heat, stirring until the sugar is dissolved.", "Reduce the heat
to low and simmer for 5 minutes, or until the syrup thickens
slightly.", "Remove from heat and stir in the vanilla extract.",
"Pour the syrup over the strawberries in a bowl and stir to coat.",
"Refrigerate for at least 30 minutes, or until chilled.",
"To serve, top the strawberries with whipped cream and enjoy!"],
"title": "Strawberry Shortcake"}
 */

Gemini strictly follows the specified JSON schema, and generates a JSON object that matches.

This is particularly important when you integrate LLMs in your application. You want a deterministic format for the output that can easily be parsed and handled by your system.

Tasty strawberries from Japan!

A few months back, I had the chance to visit Japan with my family, and they have some really gorgeous and tasty strawberries there! And don’t get me started on strawberry daifukus (mochis with fruits inside) we had in Osaka!

But before tasting those lovely confections, we need to plan our trip to Japan.

In the previous example, you might have found that a bit painful to describe the JSON schema. For the integration in a Java application, you might have some more complex data structures to represent, so deriving the big schema to define them can be tedious. Fortunately, there’s a little trick to get the JSON schema for a Java class (or record, enum, etc.).

Let’s define and describe our trip itinerary object:

@Description("details of a trip itinerary")
record TripItinerary(
    String country,
    Integer numberOfPersons,
    Month month,
    @Description("key highlights when visiting the city")
    List<CityHighlights> cityHighlights
) {
    enum Month {
        JANUARY, FEBRUARY, MARCH, APRIL, MAY, JUNE, JULY,
        AUGUST, SEPTEMBER, OCTOBER, NOVEMBER, DECEMBER
    }

    record CityHighlights(
        String cityName,
        List<String> visitHighlights
    ) { }
}

The itinerary is represented by some records, enums, and lists, and the @Description annotation can help the LLM to better understand what some elements might be about (in particular when you have some cryptic field names, but here, it’s not strictly necessary as Gemini is smart enough to understand what each field is about)

Now let’s ask for our Japan itinerary:

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .temperature(2.0)
    .responseSchema(jsonSchemaFrom(TripItinerary.class).get())
    .build();

We derive the JSON schema from the TripItinerary class. No need to tediously craft a JSON schema for it.

Let’s see what Gemini suggests for our visit:

Response<AiMessage> tripResponse = gemini.generate(
    SystemMessage.from("You are an expert trip planner"),
    UserMessage.from("""
        Suggest an itinerary for Japan.
        Cities visited: Tokyo, Kyoto, Osaka.
        Trip for a family of 4 persons.
        Provide key highlights for each city visited.
        """)
);

We tell Gemini to act as an expert trip planner, and we give some details about the travellers, the cities we’d like to visit.

So what’s the JSON structured output for this Japan visit?

System.out.println(tripResponse);
/*
{"cityHighlights": [{"cityName": "Tokyo", "visitHighlights":
["Explore the vibrant Shibuya Crossing and the iconic Shibuya
Scramble.", "Visit the Meiji Jingu Shrine, a serene oasis in
the heart of Tokyo.", "Experience the fascinating world of
technology at the Miraikan National Museum of Emerging Science
and Innovation.", "Enjoy a traditional tea ceremony at one of
Tokyo's many teahouses.", "Get lost in the eclectic streets
of Harajuku and admire the unique fashion styles.", "Embark
on a scenic boat trip on the Sumida River, passing by Tokyo
Skytree.", "Indulge in a delightful sushi dinner at a renowned
Tsukiji Fish Market.", "Discover the charm of Ueno Park, home
to museums, temples, and the Ueno Zoo."]}, {"cityName": "Kyoto",
"visitHighlights": ["Wander through the serene gardens of the
Golden Pavilion (Kinkaku-ji).", "Immerse yourselves in the rich
history of the Kiyomizu-dera Temple, famous for its wooden stage.",
"Explore the ancient Gion district, known for its traditional
wooden buildings and geisha houses.", "Stroll through the Fushimi
Inari Shrine, famous for its thousands of red torii gates.",
"Discover the treasures of the Nishiki Market, offering a diverse
selection of food and crafts.", "Experience a traditional geisha
performance at one of Kyoto's exclusive theaters.", "Learn the
art of calligraphy at a traditional workshop in the Gion district.",
"Relax in the serene atmosphere of the Ryoan-ji Zen Garden.",
"Witness the beauty of the Arashiyama Bamboo Grove."]},
{"cityName": "Osaka", "visitHighlights": ["Experience the vibrant
Dotonbori district, renowned for its neon lights, street food,
and entertainment.", "Explore the Osaka Castle, a historic
landmark and symbol of the city.", "Enjoy the breathtaking
panoramic views from the Abeno Harukas, Japan's tallest
skyscraper.", "Visit the Osaka Aquarium Kaiyukan, home to diverse
marine life from around the world.", "Stroll through the lively
Kuromon Market, known for its fresh seafood and local produce.",
"Take a scenic ride on the Osaka Ferris Wheel, offering views of
the cityscape.", "Indulge in the delicious okonomiyaki, Osaka's
signature dish.", "Experience the unique culture of the Sumiyoshi
Taisha Shrine, dedicated to the gods of seafaring."]}],
"country": "Japan", "month": "MARCH", "numberOfPersons": 4}
 */

Damn! It didn’t even mention the most delicious daifukus we had in Osaka!

Can I go outside without my umbrella in Osaka, tonight?

Speaking of visiting Osaka and those great daifukus, what’s the weather like there? It’s been raining a lot in Paris today, so I’m curious if it’s better in Osaka.

This new Gemini chat model works with LangChain4j’s higher-level abstractions: AiServices, to create some very powerful LLM based apps, like smart agents or RAG (Retrieval Augmented Generation).

We’ll have a look at a great use case for LLMs: data extraction from free-form text.

Let’s define and describe a weather forecast record:

record WeatherForecast(
    @Description("minimum temperature")
    Integer minTemperature,
    @Description("maximum temperature")
    Integer maxTemperature,
    @Description("chances of rain")
    boolean rain
) { }

We’ll also create an interface for our weather service contract:

interface WeatherForecastAssistant {
    WeatherForecast extract(String forecast);
}

Let’s configure Gemini, instantiate our weather assistant, and extract the weather forecast from today’s newspaper:

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .build();

WeatherForecastAssistant forecastAssistant =
    AiServices.builder(WeatherForecastAssistant.class)
        .chatLanguageModel(gemini)
        .build();

WeatherForecast forecast = forecastAssistant.extract("""
    Morning: The day dawns bright and clear in Osaka, with crisp
    autumn air and sunny skies. Expect temperatures to hover
    around 18°C (64°F) as you head out for your morning stroll
    through Namba.
    Afternoon: The sun continues to shine as the city buzzes with
    activity. Temperatures climb to a comfortable 22°C (72°F).
    Enjoy a leisurely lunch at one of Osaka's many outdoor cafes,
    or take a boat ride on the Okawa River to soak in the beautiful
    scenery.
    Evening: As the day fades, expect clear skies and a slight chill
    in the air. Temperatures drop to 15°C (59°F). A cozy dinner at a
    traditional Izakaya will be the perfect way to end your day in
    Osaka.
    Overall: A beautiful autumn day in Osaka awaits, perfect for
    exploring the city's vibrant streets, enjoying the local cuisine,
    and soaking in the sights.
    Don't forget: Pack a light jacket for the evening and wear
    comfortable shoes for all the walking you'll be doing.
    """);

System.out.println("Gemini> " + forecast);
// Gemini> WeatherForecast[
//             minTemperature=15,
//             maxTemperature=22,
//             rain=false]

Awesome, no need for my umbrella!

What’s great here is that we’re dealing with a real type-safe Java object, not JSON strings like before. So it integrates very well within our Java codebase!

Time for a little coding quiz

Alright, after the touristic detour, let’s get back to some math, and some coding. LLMs are quite good at reasoning, in particular when you encourage them to think step by step. But sometimes, they fall short, and can’t really calcuate results. They’re language models, not calculators, right?

Gemini has the ability to create some Python scripts, and to execute them in a sandbox. So how can we configure Gemini for solving a little math problem?

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .allowCodeExecution(true)
    .includeCodeExecutionOutput(true)
    .build();

There are 2 builder methods:

  • allowCodeExecution(true): to let Gemini know it can do some Python coding
  • includeCodeExecutionOutput(true): if you want to see the actual Python script it came up with, and the output of its execution

Do you know off head how much is fibonacci(22) or ackermann(3, 4)? Let’s ask Gemini:

Response<AiMessage> mathQuizz = gemini.generate(
    SystemMessage.from("""
        You are an expert mathematician.
        When asked a math problem or logic problem,
        you can solve it by creating a Python program,
        and execute it to return the result.
        """),
    UserMessage.from("""
        Implement the Fibonacci and Ackermann functions.
        What is the result of `fibonacci(22)` - ackermann(3, 4)?
        """)
);

Looks like Gemini is a Python and math wiz:

Code executed:
```python
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

def ackermann(m, n):
    if m == 0:
        return n + 1
    elif n == 0:
        return ackermann(m - 1, 1)
    else:
        return ackermann(m - 1, ackermann(m, n - 1))

print(fibonacci(22) - ackermann(3, 4))
```
Output:
```
17586
```
The result of `fibonacci(22) - ackermann(3, 4)` is **17586**.

I implemented the Fibonacci and Ackermann functions in Python.
Then I called `fibonacci(22) - ackermann(3, 4)` and printed the result.

If you don’t include the script code and output, you would receive only the end of the message:

The result of `fibonacci(22) - ackermann(3, 4)` is **17586**.

I implemented the Fibonacci and Ackermann functions in Python.
Then I called `fibonacci(22) - ackermann(3, 4)` and printed the result.

I didn’t encounter any snake in Japan, but I’m happy Gemini can write some Python functions when needed!

What about the weather in Tokyo?

Besides this Python code execution sandbox, the more traditional function calling mechanism works. We heard about the weather in Osaka, now let’s ask for Tokyo.

Let’s define a tool to retrieve structured weather forecasts:

record WeatherForecast(
    String location,
    String forecast,
    int temperature) {}

class WeatherForecastService {
    @Tool("Get the weather forecast for a location")
    WeatherForecast getForecast(
        @P("Location to get the forecast for") String location) {
        if (location.equals("Paris")) {
            return new WeatherForecast("Paris", "sunny", 20);
        } else if (location.equals("London")) {
            return new WeatherForecast("London", "rainy", 15);
        } else if (location.equals("Tokyo")) {
            return new WeatherForecast("Tokyo", "warm", 32);
        } else {
            return new WeatherForecast("Unknown", "unknown", 0);
        }
    }
}

We need a weather forecast assistant as well, that we’ll instantiate and configure with our tool, thanks to AiServices:

interface WeatherAssistant {
    String chat(String userMessage);
}

WeatherForecastService weatherForecastService =
    new WeatherForecastService();

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .temperature(0.0)
    .build();

WeatherAssistant weatherAssistant =
    AiServices.builder(WeatherAssistant.class)
        .chatLanguageModel(gemini)
        .tools(weatherForecastService)
        .build();

String tokyoWeather = weatherAssistant.chat(
        "What is the weather forecast for Tokyo?");

System.out.println("Gemini> " + tokyoWeather);
// Gemini> The weather forecast for Tokyo is warm
//         with a temperature of 32 degrees.

I didn’t expect such warm temperatures in Tokyo! Well, of course, it’s all fake, but you can imagine calling a real weather service.

The beauty of LangChain4j’s AiServices is that it handles calling the service for you. Usually, with function calling, the LLM just replies with a request that says you should be calling a tool or API, and give it back the tool/API’s answer. Here, with AiServices, it’s all automatic and transparent.

Of parrots pictures, text files, and multimodality

Let’s finish our whirlwind tour of this Google AI Gemini model for LangChain4j with an example that highlights Gemini’s multimodal capabilities.

Gemini is a multimodal LLM: in input, in addition to text, it accepts pictures, videos, audio, PDF files, and text files.

Let’s ask what Gemini thinks of the cute colorful parrot mascot of LangChain4j.

// README.md markdown file from LangChain4j's project Github repos
String base64Text = b64encoder.encodeToString(readBytes(
  "https://github.com/langchain4j/langchain4j/blob/main/README.md"));

// PNG of the cute colorful parrot mascot of the LangChain4j project
String base64Img = b64encoder.encodeToString(readBytes(
  "https://avatars.githubusercontent.com/u/132277850?v=4"));

ChatLanguageModel gemini = GoogleAiGeminiChatModel.builder()
    .apiKey(System.getenv("GEMINI_AI_KEY"))
    .modelName("gemini-1.5-flash")
    .build();

Response<AiMessage> response = gemini.generate(
    UserMessage.from(
        TextFileContent.from(base64Text, "text/x-markdown"),
        ImageContent.from(base64Img, "image/png"),
        TextContent.from("""
            Do you think this logo fits well
            with the project description?
            """)
    )
);

System.out.println("Gemini> " + response);
/*
   Gemini> The logo of a parrot drinking tea doesn't seem like a
   good fit for a project description of a Java version of LangChain.
   It's not clear how the logo relates to the project's purpose or
   functionality. A logo that better reflects the project's technical
   nature, such as a stylized representation of code or a language
   model, would be more appropriate.
 */

Ah well, looks like LLM don’t want to be compared to stochastic parrots, so it thinks the parrot mascot doesn’t represent the project well enough!

Sorry Gemini, I have to disagree, I really love this logo!

This example shows that you can craft an elaborate prompt that contains a text query, an external text file (the description of the project in Markdown format), and the picture of the parrot mascot.

Let’s wrap up — with beautiful ‘furoshiki’ fabric!

Throughout this journey through code examples, strawberries, daifukus, Japan itineraries and weather forecasts, you learned about the brand new LangChain4j module for Google AI’s Gemini API, and its capabilities. I hope this article makes you want to try it out!

Before calling it a day or night (depending on your timezone), I’d like to mention some limitations, as it’s still early days for this new module:

  • Currently, there’s only a ChatLanguageModel available, but no StreamingChatLanguageModel class, so you won’t get streamed responses yet.
  • Gemini’s content caching capability is not surfaced in this implementation, so you can’t use caching to save some bucks or yens.
  • For multimodality, you should pass the Base64 encoded bytes of the files, and not use URLs to reference those resources, as this module doesn’t yet upload files to Gemini’s file service (Gemini won’t download from an external URL).

Hopefully, the community will adopt this module, work with it, provide feedback to help us improve it further! Don’t hesitate to reach out with questions or to report any problems you encounter. And if you build something cool, please tell me too!

https://glaforge.dev/posts/2024/09/05/new-gemini-model-in-langchain4j/
A retryable JUnit 5 extension for flaky tests

As I work a lot with Large Language Models (LLMs), I often have to deal with flaky test cases, because LLMs are not always consistent and deterministic in their responses. Thus, sometimes, a test passes maybe a few times in a row, but then, once in a while, it fails.

Maybe some prompt tweaks will make the test pass more consistently, lowering the temperature too, or using techniques like few-shot prompting will help the model better understand what it has to do. But in some circumenstances, you can’t find ways around those weird failures, and the sole solution I found was to make a test retryable.

Show full content

As I work a lot with Large Language Models (LLMs), I often have to deal with flaky test cases, because LLMs are not always consistent and deterministic in their responses. Thus, sometimes, a test passes maybe a few times in a row, but then, once in a while, it fails.

Maybe some prompt tweaks will make the test pass more consistently, lowering the temperature too, or using techniques like few-shot prompting will help the model better understand what it has to do. But in some circumenstances, you can’t find ways around those weird failures, and the sole solution I found was to make a test retryable.

If a test fails, let’s retry a few more times (2 or 3 times) till it passes. But if it fails everytime in spite of the retries, then it’ll just fail as expected.

I wrote JUnit Rules in the past for such situations, but that was in the JUnit 4 days. Now, I’m using JUnit 5, and although it’s possible to make JUnit 4 tests run under JUnit 5, I thought it was a great opportunity to try creating a JUnit 5 extension, which is the more powerful mechanism that replaces JUnit 4 rules.

It all starts with a failing test case

Let’s say you have an hypothetical flaky test that fails a few times in a row:

    private static int count = 1;
    @Test
    void test_custom_junit_retry_extension() {
        assertThat(count++).isEqualTo(4);
    }

The first 3 executions will see an assertion failure, but the 4th would succeed as the counter is then equal to 4.

I’d like to annotate this test method with a custom annotation that indicates the number of times I’m ready to retry that test:

    private static int count = 1;
    @Test
    @ExtendWith(RetryExtension.class)
    @Retry(4)
    void test_custom_junit_retry_extension() {
        assertThat(count++).isEqualTo(4);
    }

This @ExtendWith() annotation indicates that I’m registering a JUnit 5 extension. And @Retry(4) is a custom annotation that I’ve created.

Note that @ExtendWith() can be at the class-level, but it can also live at the method level.

Let’s have a look at the @Retry annotation:

import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;

@Retention(RetentionPolicy.RUNTIME)
public @interface Retry {
    int value() default 3;
}

By default, I attempt the test 3 times, if no number is provided for the annotation value.

Now it’s time to see how the extension code works:

import org.junit.jupiter.api.extension.ExtensionContext;
import org.junit.jupiter.api.extension.TestExecutionExceptionHandler;

import java.util.concurrent.atomic.AtomicInteger;

public class RetryExtension implements TestExecutionExceptionHandler {

    private final AtomicInteger counter = new AtomicInteger(1);

    private void printError(Throwable e) {
        System.err.println(
            "Attempt test execution #" + counter.get() +
            " failed (" + e.getClass().getName() +
            "thrown):  " + e.getMessage());
    }

    @Override
    public void handleTestExecutionException(
        ExtensionContext extensionContext, Throwable throwable)
        throws Throwable {

        printError(throwable);

        extensionContext.getTestMethod().ifPresent(method -> {
            int maxExecutions =
                method.getAnnotation(Retry.class) != null ?
                method.getAnnotation(Retry.class).value() : 1;

            while (counter.incrementAndGet() <= maxExecutions) {
                try {
                    extensionContext.getExecutableInvoker().invoke(
                        method,
                        extensionContext.getRequiredTestInstance());
                    return;
                } catch (Throwable t) {
                    printError(t);

                    if (counter.get() >= maxExecutions) {
                        throw t;
                    }
                }
            }
        });
    }
}

Let’s go through the code step by step:

  • The extension has a counter to count the number of executions
  • a printError() method is used to report the assertion failure or exception
  • The class implements the TestExecutionExceptionHandler interface
  • That interface requires the method handleTestExecutionException() to be implemented
  • This method is invoked when a test throws some exception
  • If an exception is thrown, let’s see if the method is annotated with the @Retry annotation
  • and let’s retrieve the number of attempts demanded by the developer
  • Then let’s loop to do some more executions of the test method, until it passes or up to the number of attempts
Missing standard JUnit 5 extension?

I thought a @Retry extension would be pretty common, and that it would be integrated in JUnit 5 directly. Or at least, some library would provide common JUnit 5 extensions? But my search didn’t yield anything meaningful. Did I overlook or miss something?

At least now, I have a solution to work around some flaky tests, thanks to this retryable extension!

Going further

If you want to learn more about JUnit 5 extensions, there were a few resources that helped me develop this extension. First of all, two artciles from Baeldung on Migrating from JUnit 4 to JUnit 5 to understand the changes since JUnit 4, and this Guide to JUnit 5 Extensions. And of course, the JUnit 5 documentation on extensions.

Update

I’m glad I shared this article on Twitter, because I immediately got a response! Thanks @donal_tweets for your answer!

The JUnit Pioneer library provides a JUnit 5 extension pack, which includes a powerful retrying extension. Replace the usual @Test annotation with @RetryingTest. You can specify the number of attempts, the minimum number of successes, or some wait time before retries.

There’s also a rerunner extension that is quite similar.

My friend @aheritier also suggested that Maven Surefire can be configured to automatically retry failing tests a few times, thanks to a special flag:

mvn -Dsurefire.rerunFailingTestsCount=2 test

In my case, I don’t want to retry all failing tests, but only a specific one that I know is flaky.

For those who prefer Gradle over Maven, there’s a Gradle plugin as well: test-retry. You can configure the behavior in your build.gradle file:

test {
   retry {
       maxRetries = 2
       maxFailures = 20
       failOnPassedAfterRetry = true
   }
}

Someone also suggested me to use fuzzy assertions, but my test is very binary as it either fails or succeeds. There’s no threshold, or value that would fit within some bounds.

https://glaforge.dev/posts/2024/09/01/a-retryable-junit-5-extension/
Let LLM suggest Instagram hashtags for your pictures

In this article, we’ll explore another great task where Large Language Models shine: entity and data extraction. LLMs are really useful beyond just mere chatbots (even smart ones using Retrieval Augmented Generation).

Let me tell you a little story of a handy application we could build, for wannabe Instagram influencers!

Great Instagram hashtags, thanks to LLMs

When posting Instagram pictures, I often struggle with finding the right hashtags to engage with the community. Large Language Models are pretty creative, and they’ve certainly seen a bunch of Instagram pictures with their descriptions.

Show full content

In this article, we’ll explore another great task where Large Language Models shine: entity and data extraction. LLMs are really useful beyond just mere chatbots (even smart ones using Retrieval Augmented Generation).

Let me tell you a little story of a handy application we could build, for wannabe Instagram influencers!

Great Instagram hashtags, thanks to LLMs

When posting Instagram pictures, I often struggle with finding the right hashtags to engage with the community. Large Language Models are pretty creative, and they’ve certainly seen a bunch of Instagram pictures with their descriptions.

So it’s natural to try asking an AI like Gemini what it could suggest in terms of hashtags:

Gemini Instagram Hashtag suggestion
Gemini Instagram Hashtag suggestion
Gemini Instagram Hashtag suggestion Gemini Instagram Hashtag suggestion

This is a picture taken in the port of Heraklion in Crete, a Greek island. Here’s the conversation I had with Gemini, if you want to see all the tags it suggested. I think you’ll agree with me that those hashtags look pretty good. Gemini was able to recognise where the picture was taken, as it had tags like #heraklion, #crete, #greece, #greekisland, etc. In another attempt, it even told me the name of the fortress of the Venetian port, and suggested other tags along those lines. We also have several tags typically found on Instagram, like #travelgram, #instatravel, and more specific tags like #cretephotography.

My developer mind started quickly spinning with ideas of an online tool to help users be more creative with their instagram tags. Armed with my usual tools of trade: Java, and LangChain4j, I tried to see how I could implement such a tool.

When you want to integrate an LLM into an application, it’s important to be able to use more structured outputs than plain text. And what’s great with the Gemini 1.5 Flash model is that it can generate a JSON response, and Gemini 1.5 Pro can even follow a specific JSON schema (also called controlled generation).

Let’s implement an Instagram hashtag generator

First, let’s see what the Gemini API responds with a plain prompt, without trying to return some JSON payload:

var modelCreative = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .build();

List<ChatMessage> messages = new ArrayList<>();
messages.add(SystemMessage.from("""
    You are an Instagram influencer and expert.
    You master the fine art of choosing the best creative hashtags
    to share users' best pictures, and to ensure engagement with
    the Instagram community is the highest possible.
    """));
messages.add(UserMessage.from(
    ImageContent.from(
      Paths.get("src/main/resources/travel-picture.jpg").toUri()),
    TextContent.from(
      "What are the best Instagram hashtags to describe that picture?")
));

Response<AiMessage> response = modelCreative.generate(messages);
String responseText = response.content().text();

System.out.println(responseText);
  • First, we instantiate a Vertex AI Gemini chat model
  • We prepare a list of messages: one system message to describe the task the LLM must achieve, and one compound message made of the picture and the request to generate hashtags for it. Gemini is a multimodal model that accepts text, but also images, audios, videos, and PDF files.
  • Then we ask Gemini to generate a response, which we then print.

The Gemini LLM API will reply with a nice and creative response suggesting lots of great hashtags, along with some interesting explanations:

Here are some relevant hashtags for your picture:

**General Hashtags:**
#greece #greekislands #crete #creteisland #cretelife #holiday #travel
#vacay #instatravel #travelgram #beautifuldestinations
#travelphotography #photooftheday #instagood #igdaily #instapic

**Specific Hashtags:**
#heraklion #heraklioncrete #venetianharbour #harbourlife #boatlife
#boatday #greekboats #greekharbour #beautifulcrete #creteharbour
#cretevibes #greece_gram #crete_gram #greece_travel

**Trending Hashtags:**
#wanderlust #traveltuesday #instagood #travelgram
#beautifuldestinations #adventuretime

**Hashtags for Engagement:**
#explorecrete #cretelove #lovegreece #greecevibes #cretephoto
#creteexperience #discovercrete

Make sure to also use a mix of hashtags with different lengths and
popularity. You can also use location hashtags to target people in
your area.

For example, you could use hashtags like #heraklioncrete or
#creteharbour to target people who are interested in visiting that
specific location.

You should also consider using a mix of popular and less popular
hashtags. Popular hashtags will help your post reach a wider audience,
while less popular hashtags can help your post stand out from the
crowd.

**Tips for Choosing Hashtags:**

* Research relevant hashtags. Look at what other users are using and
see what's trending.
* Use a mix of general and specific hashtags.
* Don't use too many hashtags.
* Keep your hashtags relevant to your image.

By using a mix of relevant and engaging hashtags, you can increase
your chances of getting your post seen by a wider audience and improve
your engagement on Instagram.

Now, let’s tweak this code to output the Instagram hashtags as a JSON array of hashtag strings. With LangChain4j’s Gemini integration, it’s pretty trivial, we can specify that we want to return JSON payloads, thanks to the responseMimeType() method:

var modelCreative = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .responseMimeType("application/json")
    .build();

Let’s update the system instruction to be explicit about what the JSON should look like:

List<ChatMessage> messages = new ArrayList<>();
messages.add(SystemMessage.from("""
    You are an Instagram influencer and expert.
    You master the fine art of choosing the best creative hashtags
    to share users' best pictures, and to ensure engagement with
    the Instagram community is the highest possible.

    Return a JSON array containing the hashtags as strings, for example:
    ```json
    ["#beach", "#island", "#traveltahiti"]
    ```
    """));

Now let’s see the LLM’s response:

[
  "#greece",
  "#crete",
  "#heraklion",
  "#cretelife",
  "#mediterraneansea",
  "#creteisland",
  "#greecevacations",
  "#greekislands",
  "#cretetravel",
  "#heraklionport"
]

On one hand, it’s nice that Gemini obeyed us and generated the request JSON array of hashtags. However, notice that there are fewer hashtags, which are also a bit less creative.

LLMs are less creative when constrained

It’s not the first time I noticed that behavior with an LLM. I tried this with other LLMs like ChatGPT. It seems that LLMs are less creative when they are constrained to follow a stricter output.

And indeed, what actually trigged this article idea and example was this paper that I came across: Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models which has been published on arXiv a few days ago, which confirms my intuition that LLMs are less creative when using controlled generation:

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs’ abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs’ performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

A better solution with a two-step approach with entity extraction

Since LLMs are not as good when we control their generation, we can try a slighly smarter approach:

  • Firstly, we can ask the LLM to give its usual plain-text creative answer,
  • Secondly, we ask the LLM to actually extract all the hashtags from the previous response, using controlled gneration.

LLMs are great at various classical Natural Language Processing tasks like entity extraction. And here, indeed, what we want is to just extract the hashtags from the plain-text response.

For such a task, controlled generation won’t hinder the creativity, and will be acurate and extract correctly all the tags. The aforementioned paper seemed to also hint at the fact that controlled generation can actually help with some tasks like classification.

Let’s have a look at our improved approach. We keep the first attempt from the beginning of this article, without using controlled generation, but we’ll use a different configuration for the second step:

var modelExtraction = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-pro")
    .responseSchema(SchemaHelper.fromClass(String[].class))
    .build();

List<ChatMessage> messagesForExtraction = new ArrayList<>();
messagesForExtraction.add(SystemMessage.from("""
    Your job is to extract Instagram hashtags from a given text, and
    return them as a JSON array of strings representing those hashtags.
    """));
messagesForExtraction.add(UserMessage.from("""
    Here is the text to extract Instagram hashtags from:

    """ + responseText));

Response<AiMessage> responseFromExtraction =
    modelExtraction.generate(messagesForExtraction);

String extractedTagsJson = responseFromExtraction.content().text();
System.out.println(extractedTagsJson);
  • In this example, I used Gemini 1.5 Pro instead of Gemini 1.5 Flash to show you the use of the responseSchema() method which allows you to specify the exact shape of the JSON we want to retrieve. I could have used Gemini 1.5 Flash like before, but I have to give a bit more prompting help to specify the JSON schema.
  • This time, we use a different system message to explain the task of hashtag extraction.
  • And the user message reuses the creative response from the previous LLM call to extract hashtags from it.

So what’s the output like?

[
  "#greece",
  "#greekislands",
  "#crete",
  "#creteisland",
  "#cretelife",
  "#holiday",
  "#travel",
  "#vacay",
  "#instatravel",
  "#travelgram",
  "#beautifuldestinations",
  "#travelphotography",
  "#photooftheday",
  "#instagood",
  "#igdaily",
  "#instapic",
  "#heraklion",
  "#heraklioncrete",
  "#venetianharbour",
  "#harbourlife",
  "#boatlife",
  "#boatday",
  "#greekboats",
  "#greekharbour",
  "#beautifulcrete",
  "#creteharbour",
  "#cretevibes",
  "#greece_gram",
  "#crete_gram",
  "#greece_travel",
  "#wanderlust",
  "#traveltuesday",
  "#instagood",
  "#travelgram",
  "#beautifuldestinations",
  "#adventuretime",
  "#explorecrete",
  "#cretelove",
  "#lovegreece",
  "#greecevibes",
  "#cretephoto",
  "#creteexperience",
  "#discovercrete"
]

Excellent! It managed to extract all the tags of the creative response!

Conclusion & discussion

Even if researchers found that LLMs may be less creative when constrained with controlled generation, we can find workarounds to prevent suffering from this limitation, like we did with this two-step approach by making two calls. The first call is a creative one, while the second is the data extraction one.

One drawback of this approach, however, is that we had to make two calls to the LLM. So this can be a bit more costly in terms of tokens generated. And it also adds latency, because we have two calls instead of just one. So you might have to balance cost & lantency with quality, depending on your use case.

But it’s always great to have the choice!

https://glaforge.dev/posts/2024/08/12/let-llm-suggest-instagram-hashtags/
Gemini Nano running locally in your browser

Generative AI use cases are usually about running large language models somewhere in the cloud. However, with the advent of smaller models and open models, you can run them locally on your machine, with projects like llama.cpp or Ollama.

And what about in the browser? With MediaPipe and TensorFlow.js, you can train and run small neural networks for tons of fun and useful tasks (like recognising hand movements through the webcam of your computer), and it’s also possible to run Gemma 2B and even 7B models.

Show full content

Generative AI use cases are usually about running large language models somewhere in the cloud. However, with the advent of smaller models and open models, you can run them locally on your machine, with projects like llama.cpp or Ollama.

And what about in the browser? With MediaPipe and TensorFlow.js, you can train and run small neural networks for tons of fun and useful tasks (like recognising hand movements through the webcam of your computer), and it’s also possible to run Gemma 2B and even 7B models.

But there’s something interesting cooking these days: built-in language models in the browser!

The Chrome developers are working on a new Web API to integrate LLMs in the browser, and are experimenting with the Gemini Nano model (already integrated in some smartphones like Samsung Galaxy or Google Pixel phones) inside Chrome Canary.

Getting started with Gemini Nano and Chrome Canary

I’m sure you want to experiment with that too? Let’s see how to proceed:

  • First of all, you’ll need to download Chrome Canary

  • In chrome://flags, you must enable two experiments:

    • Prompt API for Gemini Nano and
    • Enables optimization guide on device.
  • You’ll have to restart the browser, after having enabled those two flags.

It may take quite a bit of time to download Gemini Nano (as it’s a small model, it takes only around 1.7GB of space, but you’ll need about 20GB at installation time on your hard drive) but the API will tell you if the model weights are not fully downloaded yet.

Experimenting in the playground

Now it’s time to play! Let’s see what this embedded Gemini Nano can do, in the Prompt API playground. This is a simple form where you can send prompts to the model, and see its replies.

Looks like it’s smart enough to know that no cat ever went on the moon!

A bit of code

The code of this demo is available on Github.

Let’s have a look at the key lines of the Prompt API usage.

To know if the browser supports the Prompt API, you’ll need to check the existence of the new ai object on window:

if (!window.ai) {
    ...
}

Then you’ll have to create a text session with:

const session = await window.ai.createTextSession();

Then you can either wait for the full response, or stream the tokens as they are generated. Here, let’s see the streaming scenario, and how to iterate over the streamed tokens:

const stream = await session.promptStreaming(
    "What's the name of the first cat who stepped on the moon?"
);

for await (const chunk of stream) {
    var fullResponse = chunk.trim();
    // do something with the response, like appending it to a DOM node
}

If you’re not streaming the response, you can also do as follows, to get the response in one go once it’s fully generated:

const result = await session.prompt(
    "What's the name of the first cat who stepped on the moon?"
);
But why running AI in the browser?

Maybe I should have started there, afterall? Why would you want to run models locally in the browser, rather than using a cloud-hosted one?

As the documentation outlines:

  • For privacy reasons: you may want to do local processing of sensitive data, to avoid sending such information on the web.
  • For latency gains: once the model is loaded in the browser (in about 3 seconds on my machine), the model responds super fast to all subsequent requests. So you can have a very snappy experience, without the long roundtrip through the internet!
  • For lower costs: since all the AI inference is done in the browser, it’s not going to cost you anything on the server-side.
  • For offline usage: as it runs in the browser, even if you lost your internet connection, your Web UI will continue to function with all its smart AI features.
Interesting resources
  • Checkout the Prompt API playground to play with it (after having followed the instructions above).
  • Have a look at the sources of the playground to learn how the demo is done.
  • There’s a nice publication that shows how to use the Prompt API to summarize the content of the web page displayed in your browser.
  • The HuggingFace people have an extended article on how to run Gemini Nano in the browser, with some advanced details about the Prompt API.
  • Read the pages that explains the goals of the built-in AI.
  • It’s interesting to glance through the explainer of the Prompt API to understand how it’s been designed.
  • And the best resource for the end, the user guide of the built-in AI early preview, which gives lots of details about the Prompt API.
Summary

I’ve been focusing mostly on large language models in the cloud so far, in particular Gemini, but I’m excited at the prospect of the interesting use cases that it can enable.

Imagine, for example, a travel itinerary application, that would store all the information of your trip locally (in IndexedDB or a WebAssembly-fied sqlite), and you could ask offline all the questions you want about the journey? (basically, RAG in the browser!) No need to hunt for a public wifi network or a local SIM card.

There are also many tasks some browser extension could handle:

  • When preparing my podcast episode and show notes, I could ask Gemini Nano to make a 5-bullet-point summary of the article I’m reading.
  • When reading the reviews for a product, I could get a sentiment analysis signal that tells me if customers are happy with that product.

We could also think of some hybrid scenarios, as both cloud-hosted and local-running models could complement each other.

I hope this Web API will become a standard and that other browsers support it too, and offer different models as well.

https://glaforge.dev/posts/2024/08/07/gemini-nano-running-locally-in-your-browser/
Sentiment analysis with few-shot prompting

In a rencent article, we talked about text classification using Gemini and LangChain4j.

A typical example of text classification is the case of sentiment analysis.

In my LangChain4j-powered Gemini workshop, I used this use case to illustrate the classification problem:

ChatLanguageModel model = VertexAiGeminiChatModel.builder()
 .project(System.getenv("PROJECT_ID"))
 .location(System.getenv("LOCATION"))
 .modelName("gemini-1.5-flash-001")
 .maxOutputTokens(10)
 .maxRetries(3)
 .build();

PromptTemplate promptTemplate = PromptTemplate.from("""
 Analyze the sentiment of the text below.
 Respond only with one word to describe the sentiment.

 INPUT: This is fantastic news!
 OUTPUT: POSITIVE

 INPUT: Pi is roughly equal to 3.14
 OUTPUT: NEUTRAL

 INPUT: I really disliked the pizza. Who would use pineapples as a pizza topping?
 OUTPUT: NEGATIVE

 INPUT: {{text}}
 OUTPUT:
 """);

Prompt prompt = promptTemplate.apply(
 Map.of("text", "I love strawberries!"));

Response<AiMessage> response = model.generate(prompt.toUserMessage());

System.out.println(response.content().text());

I used a PromptTemplate to craft the prompt, with a {{text}} placeholder value to analyze the sentiment of that particular text.

Show full content

In a rencent article, we talked about text classification using Gemini and LangChain4j.

A typical example of text classification is the case of sentiment analysis.

In my LangChain4j-powered Gemini workshop, I used this use case to illustrate the classification problem:

ChatLanguageModel model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-flash-001")
    .maxOutputTokens(10)
    .maxRetries(3)
    .build();

PromptTemplate promptTemplate = PromptTemplate.from("""
    Analyze the sentiment of the text below.
    Respond only with one word to describe the sentiment.

    INPUT: This is fantastic news!
    OUTPUT: POSITIVE

    INPUT: Pi is roughly equal to 3.14
    OUTPUT: NEUTRAL

    INPUT: I really disliked the pizza. Who would use pineapples as a pizza topping?
    OUTPUT: NEGATIVE

    INPUT: {{text}}
    OUTPUT:
    """);

Prompt prompt = promptTemplate.apply(
    Map.of("text", "I love strawberries!"));

Response<AiMessage> response = model.generate(prompt.toUserMessage());

System.out.println(response.content().text());

I used a PromptTemplate to craft the prompt, with a {{text}} placeholder value to analyze the sentiment of that particular text.

Notice that I used the few-shot prompting technique, with example inputs and outputs.

Few-shot prompting with a list of messages

Somehow, I had the impression that this INPUT/OUTPUT notation was a bit of a hack to encourage the LLM to believe this is an actual exchange between the user and the AI.

I believed it would be cleaner to use a real list of messages that alternate user and AI messages. So I implemented this alternative approach, but haven’t yet committed it to my workshop repository.

Meanwhile, as I was chatting with my colleague Dan Dobrin, he pointed me at this very recent blog post from the LangChain people, who were investigating few-shot prompting to improve tool-calling performance.

What’s interesting in their analysis was that overall, on this anecdata example, it seems LLMs do better with real user/AI messages than with a big string of inputs/outputs.

Let’s see how to implement the same approach, with a real exchange of messages:

List<ChatMessage> fewShotPrompts = List.of(
    SystemMessage.from("""
        Analyze the sentiment of the text below.
        Respond only with one word to describe the sentiment.
        """),

    UserMessage.from("This is fantastic news!"),
    AiMessage.from("POSITIVE"),

    UserMessage.from("Pi is roughly equal to 3.14"),
    AiMessage.from("NEUTRAL"),

    UserMessage.from("I really disliked the pizza. " +
                     "Who would use pineapples as a pizza topping?"),
    AiMessage.from("NEGATIVE"),

    UserMessage.from("I love strawberries!")
);

response = model.generate(fewShotPrompts);

System.out.println(response.content().text());

This is not much more verbose than the previous approach, as it’s still very readable. And when pulling the few-shot data from an external database, it feels cleaner than concatenating a big string.

More type-safe few-shot prompting with messages and AiServices

To further improve on the list of messages tactic, we can use LangChain4j’s AiServices concept, which is a higher-level abstraction than using the model and prompt templates directly.

enum Sentiment { POSITIVE, NEUTRAL, NEGATIVE }

interface SentimentAnalysis {
    @SystemMessage("""
        Analyze the sentiment of the text below.
        Respond only with one word to describe the sentiment.
        """)
    Sentiment analyze(String text);
}

MessageWindowChatMemory memory =
    MessageWindowChatMemory.withMaxMessages(10);

memory.add(UserMessage.from("This is fantastic news!"));
memory.add(AiMessage.from(Sentiment.POSITIVE.name()));

memory.add(UserMessage.from("Pi is roughly equal to 3.14"));
memory.add(AiMessage.from(Sentiment.NEUTRAL.name()));

memory.add(UserMessage.from("I really disliked the pizza. " +
        "Who would use pineapples as a pizza topping?"));
memory.add(AiMessage.from(Sentiment.NEGATIVE.name()));

SentimentAnalysis analyzer =
    AiServices.builder(SentimentAnalysis.class)
        .chatLanguageModel(model)
        .chatMemory(memory)
        .build();

System.out.println(analyzer.analyze("I love strawberries!"));

This third and final approach may be a bit more verbose, and introduces a few more LangChain4j concepts like system messages, chat memory, and the AI service itself, but it has the advantages of being:

  • more type-safe, as we’re using a Sentiment enum, which is easier to manipulate from code,
  • cleaner, because we’re using system instructions to instruct the model about what its job is.

We created:

  • a Java enum to represent the possible values of the sentiment,
  • a SentimentAnalysis interface with a clear signature: a text in input, a Sentiment enum value in output,
  • a @SystemMessage instruction to describe the analysis task,
  • a ChatMemory (here a MessageWindowChatMemory) to hold the few-shot examples.

Then we bind everything together, thanks to AiServices: the analysis interface that LangChain4j will implement for us, the language model, and the chat memory.

Finally, users just have to call the analyze() method, passing the text to analyze.

I also like the fact that we are coding against an interface, and potentially later on, developers could swap the implementation of the sentiment analyzer, and use a different approach.

Conclusion

All three approaches are valid: a big string, a low-level list of messages, or an AiServices abstraction. But I have a slight preference for the approach that is more type-safe and less stringy.

Just like LangChain4j provides a TextClassification class that leverages vector embeddings for text similarity, we could investigate whether it would make sense to also add a few-shot prompting classificaction solution directly in the LangChain4j project.

https://glaforge.dev/posts/2024/07/30/sentiment-analysis-with-few-shots-prompting/
Analyzing video, audio and PDF files with Gemini and LangChain4j

Certain models like Gemini are multimodal. This means that they accept more than just text as input. Some models support text and images, but Gemini goes further and also supports audio, video, and PDF files. So you can mix and match text prompts and different multimedia files or PDF documents.

Until LangChain4j 0.32, the models could only support text and images, but since my PR got merged into the newly released 0.33 version, you can use all those files with the LangChain4j Gemini module!

Show full content

Certain models like Gemini are multimodal. This means that they accept more than just text as input. Some models support text and images, but Gemini goes further and also supports audio, video, and PDF files. So you can mix and match text prompts and different multimedia files or PDF documents.

Until LangChain4j 0.32, the models could only support text and images, but since my PR got merged into the newly released 0.33 version, you can use all those files with the LangChain4j Gemini module!

Let’s have a look!

Getting the transcription of a podcast recording

Are you an avid podcast listener and want to read its transcription? Or you want to publish that transcription as show-notes of your own podcast on your website?

You can ask Gemini for the transcription with the following code:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-pro")
    .build();

Response<AiMessage> response = model.generate(
    UserMessage.from(
        AudioContent.from(
            "https://storage.googleapis.com/cloud-samples-data/"
            + "generative-ai/audio/pixel.mp3"),
        TextContent.from(
            "Write a transcription of this audio file")
    )
);

System.out.println(response.content().text());

Above, we created an audio content object with the AudioContent.from(...) method. This method can take a string which can be a direct URL to a file on the web, it can be a Google Cloud Storage URL as well (like gs://bucket/audio.mp3). It is possible to load a local file from your file system with AudioContent.from(Paths.get("audio.mp3").toUri()). You can even pass the base 64 encoded content of the audio file and specify its mime type.

What else could you do with audio files?
  • If you’re in a hurry and don’t have time to listen to this one-hour episode, instead of asking for the whole transcript, you could change the prompt to ask for a summary. That way you know if it’s worth spending an hour to listen to it all.

  • Gemini also accepts several audio files in input, so if you are recording interviews of persons on a specific topic, you could ask Gemini to contrast the differences in those responses.

Preparing YouTube video chaptering

Let’s say you’re a YouTuber, and you want to do your own video chaptering, instead of relying on the the automatic chapters. How can you do that?

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .build();

Response<AiMessage> response = model.generate(
    UserMessage.from(
        VideoContent.from(
            "https://storage.googleapis.com/cloud-samples-data/"
            + "generative-ai/video/behind_the_scenes_pixel.mp4"),
        TextContent.from(
            "Prepare chapters for this video file, "
            + "using the YouTube chapter notation")
    )
);

System.out.println(response.content().text());

For this video, the chapters generated look as follows:

00:00 Making a Film with a Blind Director
00:16 Adam Morse, Filmmaker
00:28 The Film Shoot
00:48 A Blind Man & His Girlfriend
01:15 Google Pixel Phone
01:33 Guided Frame
02:06 The Technical Crew
02:32 Visual Effects
02:45 Misconceptions About Blindness
03:20 Filmmaking with a Team
03:46 Google Accessibility
04:00 One Person's Perspective
04:29 Adam's Vision
05:03 A Beautiful Position
05:19 Google Logo
What else could you do with videos?
  • If a video of your meeting or your conference presentation has been recorded, you could use this approach to ask Gemini for a summary of the video, to get the various sections, to write the transcript.

  • We often record videos of our family, our children, etc. It’s not always easy to search through those videos. You could ask Gemini to provide a summary of the video, that you would then index with some search engine, or just do some simple grep search from the command-line.

Asking questions about PDF documents

Let’s have a look at one last example: PDF documents.

With LangChain4j, it’s possible to use the Apache Tika-based document loader to get the text content of a PDF. However, you loose some important semantic information, as the layout may be important, or the figures may convey as well some critical details.

Fortunately, Gemini can ingest PDF documents directly, without an intermediate text transcription.

This allows you to ask questions about PDf documents, and since Gemini has a very large context window, it’s able to analyze very big documents, or several documents at the same time, without having to implement your own RAG system (Retrieval Augmented Generation).

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .build();

Response<AiMessage> response = model.generate(
    UserMessage.from(
        PdfFileContent.from(
            "https://proceedings.neurips.cc/paper_files/paper/2017"
            + "/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"),
        TextContent.from(
            "Give a summary of this paper")
    )
);

System.out.println(response.content().text());

This example analyzes the famous “Attention is all you need” paper that introduced the concept of Transformer neural networks:

This paper proposes a novel neural network architecture called the
Transformer, which relies entirely on an attention mechanism and
dispenses with recurrence and convolutions. The Transformer
outperforms existing models on two machine translation tasks, WMT
2014 English-to-German and WMT 2014 English-to-French, while
requiring significantly less training time. The authors argue that
the Transformer's ability to learn global dependencies without
regard to their distance in the input or output sequences, as well
as its parallelizable nature, make it a promising approach for
sequence modeling and transduction problems. They also present an
analysis of the Transformer's different components and their effect
on performance. The paper concludes by discussing potential future
directions for research.
What else could you do with PDF documents?
  • You can implement some smart question answering solutions over your documents.

  • Gemini can help make sense of differences between two versions of your PDF paper.

  • Gemini allows you to ingest multiple files at the same time, so it is possible to pass the PDF of your dishwasher manual, at the same time as a tutorial showing how to repair it, and then ask the LLM to answer a question on how to fix it.

Summary

Multimodality is a powerful feature of Gemini, and now LangChain4j is equiped with the ability to send text, images, audio files, videos, and PDF documents, potentially all at the same time, to create some innovative multimedia integrations.

https://glaforge.dev/posts/2024/07/25/analyzing-videos-audios-and-pdfs-with-gemini-in-langchain4j/
Text classification with Gemini and LangChain4j

Generative AI has potential applications far beyond chatbots and Retrieval Augmented Generation. For example, a nice use case is: text classification.

I had the chance of meeting some customers and prospects who had the need for triaging incoming requests, or for labeling existing data. In the first case, a government entity was tasked with routing citizen requests to access undisclosed information to the right governmental service that could grant or reject that access. In the second case, a company needed to sort out tons of existing internal documents that were not properly organized, and they wanted to quickly start better structuring this trove of information, by labelling each of these docs into different categories.

Show full content

Generative AI has potential applications far beyond chatbots and Retrieval Augmented Generation. For example, a nice use case is: text classification.

I had the chance of meeting some customers and prospects who had the need for triaging incoming requests, or for labeling existing data. In the first case, a government entity was tasked with routing citizen requests to access undisclosed information to the right governmental service that could grant or reject that access. In the second case, a company needed to sort out tons of existing internal documents that were not properly organized, and they wanted to quickly start better structuring this trove of information, by labelling each of these docs into different categories.

In both situations, the task was a text classification one: to put each request or document in a distinct pile, so they could more easily be sorted out, organized, and treated more rapidly.

Before generative AI, text classification would be handled by data scientists who would craft and train dedicated machine learning models for that purpose. But it is now also possible to do the same with the help of large language models. That’s what I’d like to explore with you in this article today.

As usual, I’ll be using the Gemini model, and the LangChain4j framework for implementing illustrative examples in Java.

Text classification: putting a label on a document

Before diving into the code, let’s step back a short moment to clarify what text classification is about. When we classify documents, we put a label on them.

For example, in a bug tracker, we could automate adding labels on new tickets that say that the bug report is related to a certain component. So we would put the name of the component as the label for that new ticket.

For routing incoming document access requests, we could put the label of the service that must treat the request, etc.

Filtering is also a text classification problem: we can filter the content of emails to state whether they are spam or not. And we can also use LLMs to filter harmful content from users’ inputs, and even classify the category of harm (hateful speech, harrasment, etc.)

Zero-shot prompting: just ask the model!

What about just asking a large language model what it thinks the classification, or the label should be? And indeed, LLMs are often very smart and can figure out the correct classification, without being trained specifically for that purpose.

Let’s illustrate this with a very common type of text classification: sentiment analysis.

First, we can define an enum representing the various sentiments that can be recognized:

enum Sentiment {
    POSITIVE, NEUTRAL, NEGATIVE
}

We create a record which will hold the result of the sentiment analysis:

record SentimentClassification(
    Sentiment sentiment
) {}

We will also need an interface to represent the type-safe Java service that the developers integrating this LLM-backed solution will call to retrieve the sentiment of the text:

interface SentimentClassifier {
    SentimentClassification classify(String text);
}

Notice that it takes in input an unstructured String text, but in output, you’ll manipulate a strongly typed object, not just a mere string.

It’s time to prepare our Gemini model:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-pro")
    .responseMimeType("application/json")
    .responseSchema(Schema.newBuilder()
        .setType(Type.OBJECT)
        .putProperties("sentiment",
            Schema.newBuilder()
                .setType(Type.STRING)
                .addAllEnum(Stream.of(Sentiment.values())
                    .map(Enum::name)
                    .collect(Collectors.toList()))
                .build())
        .build())
    .build();

We’re taking advantage of the latest feature of Gemini and LangChain4j, which permits to specify that we want 100% valid JSON in output, and even better than this, we want the generated JSON output to comply with a JSON schema!

Now we create the sentiment analysis service:

SentimentClassifier sentimentClassifier =
    AiServices.create(SentimentClassifier.class, model);

And we call it to retrieve the sentiment of the text we want to analyze:

SentimentClassification classification =
    sentimentClassifier.classify("I am happy!");
System.out.println(classification.sentiment()); // POSITIVE

We didn’t even need to give Gemini examples, this is why it’s called zero-shot prompting. LLMs are usually smart enough to easily handle familiar classification tasks like sentiment analysis.

Few-shot prompting: when the model needs a little help

A more common approach with LLMs for text classification is few-shot prompting. As the name implies, it’s a prompting technique.

You give the model a task (classifying text), and you show it examples of classifications, with a clear input/output format, to force the LLM to reply with just the expected class.

ChatLanguageModel model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash-001")
    .maxOutputTokens(10)
    .maxRetries(3)
    .build();

PromptTemplate promptTemplate = PromptTemplate.from("""
    Analyze the sentiment of the text below.
    Respond only with one word to describe the sentiment.

    INPUT: This is fantastic news!
    OUTPUT: POSITIVE

    INPUT: Pi is roughly equal to 3.14
    OUTPUT: NEUTRAL

    INPUT: I hate disliked the pizza. Who'd put pineapple toppings?
    OUTPUT: NEGATIVE

    INPUT: {{text}}
    OUTPUT:
    """);

Prompt prompt = promptTemplate.apply(
    Map.of("text", "I love strawberries!"));

Response<AiMessage> response = model.generate(prompt.toUserMessage());

System.out.println(response.content().text()); // POSITIVE

In the above approach, we use LangChain4j’s PromptTemplate, with a placeholder value {{text}} that will contain the text to classify. We don’t use an enum value though, so we have to discriminate against a string in the end. But we could also apply the same schema response handling as in our previous zero-shot example.

Let’s rewrite this code a little bit differently, to fake a conversation with the model. The model will see an exchange between a user and itself, and will also follow the same syntax, and will reply with just one word: the sentiment. We’ll use system instructions, and alternating AI and user messages:

List<ChatMessage> fewShotPrompts = List.of(
    SystemMessage.from("""
        Analyze the sentiment of the text below.
        Respond only with one word to describe the sentiment.
        """),

    UserMessage.from("This is fantastic news!"),
    AiMessage.from("POSITIVE"),

    UserMessage.from("Pi is roughly equal to 3.14"),
    AiMessage.from("NEUTRAL"),

    UserMessage.from("I hate disliked the pizza. " +
                     "Who'd put pineapple toppings?"),
    AiMessage.from("NEGATIVE"),

    UserMessage.from("I love strawberries!")
);

response = model.generate(fewShotPrompts);

System.out.println(response.content().text()); // POSITIVE

Same outcome, stawberries are yummy!

Text classification with embedding models

In the two previous sections, we took advantage of LLMs’ abilities to classify text on their own, based on their intrinsic knowledge, or with the help of a few examples. But there’s another way we can investigate: using embedding vectors to compare texts.

Embedding vectors are mathematical representations of words/sentences/paragraphs, in the form of a vector of floating point values. The way those vectors are calculated by embedding models makes those vector close to each other (in terms of distance) when they are semantically close. You can have a look at my recent article introducing vector embeddings.

LangChain4j provides a TextClassifier interface which allows to classify text, by comparing it to sets of other texts that belong to a same class. So we give a map of possible labels, associated with lists of texts that belong to that category.

In particular, there’s an EmbeddingModelTextClassifier that uses embedding models to compare the texts with the examples of each labels. We can even tweak its internal algorithm to say whether we prefer if a text should be closer to the average of all the examples, or if we prefer if it’s closer to one of the examples (by default, it’s half distance to the mean, and half distance to the closest example.)

So let’s have a look at this solution.

Instead of doing sentiment analysis, we’ll go with recipe classification: our goal will be to classify a recipe, to know if it’s an appetizer, a main course, or a dessert.

First, we need to define our labels, with an enum:

enum DishType {
    APPETIZER, MAIN, DESSERT
}

Because we don’t have a dataset of recipes, we’ll use Gemini to generate sample recipes, for each label. For that, we need to configure Gemini:

private static final VertexAiGeminiChatModel CHAT_MODEL =
    VertexAiGeminiChatModel.builder()
        .project(PROJECT_ID)
        .location(LOCATION)
        .modelName("gemini-1.5-flash")
        .build();

We’ll also configure an embedding model to calculate the vector embeddings:

private static final VertexAiEmbeddingModel EMBEDDING_MODEL =
    VertexAiEmbeddingModel.builder()
        .project(PROJECT_ID)
        .location(LOCATION)
        .endpoint(ENDPOINT)
        .publisher("google")
        .modelName("text-embedding-004")
        .taskType(VertexAiEmbeddingModel.TaskType.CLASSIFICATION)
        .build();

Vertex AI’s embedding models are capable of handling various tasks, including:

  • classification,
  • semantic similarity,
  • clustering,
  • question answering,
  • fact verification,
  • query or document retrieval.

Let’s create a method to generate a recipe for a particular type of dish:

private static String recipeOf(DishType type) {
    return CHAT_MODEL.generate(
        "Write a recipe for a %s dish"
            .formatted(type.name().toLowerCase()));
}

And we’ll collect 3 examples of recipes for each type of dish:

var examplesOfRecipes = Stream.of(DishType.values())
    .collect(
        Collectors.toMap(
            dishType -> dishType,
            dishType ->
                Stream.generate(() -> recipeOf(dishType))
                    .limit(3)
                    .toList()
        )
    );

That way, we have our dataset ready, and we’ll prepare a text classifier:

EmbeddingModelTextClassifier<DishType> recipeClassifier =
    new EmbeddingModelTextClassifier<>(EMBEDDING_MODEL,
                                       examplesOfRecipes);

It takes a little while to calculate the initial embedding vectors of all the samples, but now our classifier is ready! Let’s see if the following recipe is an appertizer, a main course, or a dessert:

List<DishType> classifiedDishes = recipeClassifier.classify("""
    **Classic Moist Chocolate Cake**

    This recipe delivers a rich, moist chocolate cake that's
    perfect for any occasion.

    Ingredients:
    * 1 ¾ cups all-purpose flour
    * 2 cups granulated sugar
    * ¾ cup unsweetened cocoa powder
    * 1 ½ teaspoons baking powder
    * 1 ½ teaspoons baking soda
    * 1 teaspoon salt
    * 2 large eggs
    * 1 cup milk
    * ½ cup vegetable oil
    * 2 teaspoons vanilla extract
    * 1 cup boiling water

    Instructions:
    * Preheat oven to 350°F (175°C). Grease and flour two 9-inch
      round cake pans.
    * Combine dry ingredients: In a large bowl, whisk together flour,
      sugar, cocoa powder, baking powder, baking soda, and salt.
    * Add wet ingredients: Beat in eggs, milk, oil, and vanilla until
      combined.
    * Stir in boiling water: Carefully stir in boiling water. The
      batter will be thin.
    * Bake: Pour batter evenly into prepared pans. Bake for 30-35
      minutes, or until a toothpick inserted into the center comes
      out clean.
    * Cool: Let cakes cool in pans for 10 minutes before transferring
      to a wire rack to cool completely.
    """);

System.out.println("This recipe is of type: " + classifiedDishes);
// This recipe is of type: [DESSERT]

And voilà, we used the full power of embedding models to calculate text similarity to classify our chocolate cake recipe as a dessert!

Conclusion

Large Language Models like Gemini are great at classifying text, thanks to their general knowledge of the world that they acquired during their training. But for more specialized use cases, we might need to guide the LLM to recognize labels, because the subject is very specific to our data. That’s when few-shot prompting or embedding model-based classification helps.

If we have lots of samples for each label, using a few-shot prompting approach means we’ll have to pass all those examples again and again in the context window of the LLM, which yields a high token count. So if you pay per tokens, it can become a bit expensive.

If we use the embedding model text classifier, it might take a while to compute all the embedding vectors, but we’ll do it only once, and then we can just calculate the vector embedding for the text to classify, so it’s just the tokens of the text to classify that is incurred. If we have lots of samples, the classifier needs to do quite a few vector / matrix computations to calculate the distance to the samples, but it’s usually quite fast (unless we really have hundreds or thousands of samples).

I hope this article showed you that Generative AI is useful beyond the usual chatbots and RAG use cases. It’s great at text classification as well. And LangChain4j and Gemini are well suited for that use case, and you learned how to implement different approaches to do text classification.

https://glaforge.dev/posts/2024/07/11/text-classification-with-gemini-and-langchain4j/
Latest Gemini features support in LangChain4j 0.32.0

LangChain4j 0.32.0 was released yesterday, including my pull request with the support for lots of new Gemini features:

  • JSON output mode, to force Gemini to reply using JSON, without any markup,
  • JSON schema, to control and constrain the JSON output to comply with a schema,
  • Response grounding with Google Search web results and with private data in Vertex AI datastores,
  • Easier debugging, thanks to new builder methods to log requests and responses,
  • Function calling mode (none, automatic, or a subset of functions),
  • Safety settings to catch harmful prompts and responses.

Let’s explore those new features together, thanks to some code examples! And at the end of the article, if you make it through, you’ll also discover 2 extra bonus points.

Show full content

LangChain4j 0.32.0 was released yesterday, including my pull request with the support for lots of new Gemini features:

  • JSON output mode, to force Gemini to reply using JSON, without any markup,
  • JSON schema, to control and constrain the JSON output to comply with a schema,
  • Response grounding with Google Search web results and with private data in Vertex AI datastores,
  • Easier debugging, thanks to new builder methods to log requests and responses,
  • Function calling mode (none, automatic, or a subset of functions),
  • Safety settings to catch harmful prompts and responses.

Let’s explore those new features together, thanks to some code examples! And at the end of the article, if you make it through, you’ll also discover 2 extra bonus points.

JSON output mode

Creating LLM-powered applications means working with text, as this is what LLMs return. But to facilitate this integration between LLM responses and your code, the text format of choice is usually JSON, as it’s human-readable, and easy to parse programmatically.

However, LLMs are a bit chatty, and rather than sending you back a nice raw JSON document, instead, it replies with some extra sentence, and some markdown markup to wrap the piece of JSON.

Fortunately, Gemini 1.5 (Flash and Pro) allows you to specify the response MIME type. Currently, only application/json is supported, but other formats may come later.

To do that, when instantiating the Gemini model, use the responseMimeType() builder method:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .responseMimeType("application/json")
    .build();

String response = model.generate("Roll a dice");

System.out.println(response);

No sentence, no markdown markup, nothing, just pure JSON:

{"roll": 3}

We didn’t even need to say in the prompt we wanted to get a JSON response!

However, the JSON key of that document may vary from time to time, so you may still wish to be a bit more prescriptive in your prompt, and ask the model to return JSON explicitly, give it an example of the JSON output you expect, etc. That’s the usual prompting approach…

But now there’s more!

JSON Schema output

This is quite unique in the LLM ecosystem, as I believe it’s the only model out there that allows you to specify a JSON schema for constraining the JSON output. This works for Gemini 1.5 Pro only, not with Gemini 1.5 Flash.

Let’s have another look at our previous dice roll example, and let’s update it to specify a JSON schema for the output generation:

import static dev.langchain4j.model.vertexai.SchemaHelper.fromClass;
//...

record DiceRoll(int roll) {}

var model = VertexAiGeminiChatModel.builder()
    .project("genai-java-demos")
    .location("us-central1")
    .modelName("gemini-1.5-pro")
    .responseSchema(fromClass(DiceRoll.class))
    .build();

String response = model.generate("Roll a dice");

System.out.println(response);

The generated JSON document will always contain the roll key

{ "roll": 5 }

In this example, we used a convenience method called fromClass() that creates a JSON schema that corresponds to a Java type (here a Java record).

But there’s also another convenient method that lets us pass a JSON schema string, called fromJsonSchema():

var model = VertexAiGeminiChatModel.builder()
    .project("genai-java-demos")
    .location("us-central1")
    .modelName("gemini-1.5-pro")
    .responseSchema(fromJsonSchema("""
        {
            "type": "object",
            "properties": {
                "roll": {
                    "type": "integer"
                }
            }
        }
        """))
    .build();

It’s also possible to construct a JSON schema programmatically:

var model = VertexAiGeminiChatModel.builder()
    .project("genai-java-demos")
    .location("us-central1")
    .modelName("gemini-1.5-pro")
    .responseSchema(Schema.newBuilder()
        .setType(Type.OBJECT)
        .putProperties("roll",
            Schema.newBuilder()
                .setType(Type.INTEGER)
                .build())
        .build())
    .build();

Now you always get consistent JSON outputs!

Response grounding with Google Search web results and Vertex AI datastores

Large Language Models are wonderful creative machines, but rather than benefiting from their high degree of creativity, we’d prefer having factual responses grounded on data and documents.

Gemini offers the ability to ground responses:

  • against Google Search web results,
  • against Vertex AI search datastores.
Use Google Search to ground responses

The training of an LLM ended at a certain date: its cut-off date. So it doesn’t know about news that happened after that date. But you can request Gemini to use Google Search to find more up-to-date information.

For example, if we ask Gemini about the current elections going on in France, it could reply with something like this:

There is no current national election happening in France right now.

The last major national election in France was the **Presidential
election in April and May 2022**, where Emmanuel Macron won a second
term.

There are, however, **local elections** happening regularly in
different regions of France.

To stay updated on French elections, you can check the website of
the **French Ministry of the Interior** or reputable news sources
like **The Guardian, BBC, CNN, or Le Monde**.

Now, let’s enable the use of Google Search web result with the useGoogleSearch(true) method:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .useGoogleSearch(true)
    .build();

String response = model.generate(
    "What is the current election going on in France?");

System.out.println(response);

The answer will be much different, and indeed factual and up-to-date:

France held the first round of a parliamentary election on July 4,
2024. The second round will be on July 7, 2024. The election is
significant because it could result in the first far-right government
in France since World War II.  The National Rally, President Emmanuel
Macron’s centrist alliance, and the New Popular Front coalition are
the three major political blocs competing in the election. The
outcome of the election is highly uncertain, with the far-right
National Rally potentially gaining a parliamentary majority.  If the
National Rally wins a majority, Macron would be expected to appoint
Jordan Bardella, the party's president, as prime minister.

There’s indeed a parliamentary election going on right now in France. Those elections were decided only a month ago, thus past the cut-of-date of the knowledge of the model.

For my French audience, don’t forget to go voting next Sunday!

Grounding with Vertex AI Search

The idea is that we want to ground responses on our own data. This is particularly important when the knowledge required is actually private information, like our internal docs, or our customers’ docs.

My colleague Mete wrote a great article explaining how to setup grounding with private data. Below, I’ll assume that we created a Vertex AI search app with a datastore backed by a Google Cloud Storage bucket that contains a fictious document which is a car manual, about the Cymbel Starlight car model! I’m taking the same example as in Mete’s article.

This time, we specify the search location to point at the Vertex AI search datastore with vertexSearchDatastore():

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .vertexSearchDatastore(String.format(
            "projects/%s/locations/%s/collections/%s/dataStores/%s",
            PROJECT_ID, "global", "default_collection",
            "cymbal-datastore_1720169982142")
    )
    .build();

String response = model.generate(
    "What is the cargo capacity of Cymbal Starlight?");

System.out.println(response);

It’s a fictious car that doesn’t exist, but it’s covered in that private document, and indeed, Gemini is now able to respond to that question:

The Cymbal Starlight 2024 has a cargo capacity of 13.5 cubic feet.

What’s interesting as well is that the response returned by Gemini provides some context about the source document that helped it answer the user query (we’ll see in the next section how to enable logging requests and responses):

  grounding_metadata {
    2: {
      1: {
        3: 66
      }
      2: 0x3f7deee0
    }
    5: {
      2: {
        1: "gs://genai-java-demos-documents/cymbal-starlight-2024.pdf"
        2: "cymbal-starlight-2024"
      }
    }
    6: {
      1: {
        3: 66
        4: "The Cymbal Starlight 2024 has a cargo capacity of 13.5 cubic feet."
      }
      2: "\000"
      3: {
        257772: 63
      }
    }

However, to be honest, I’m not quite sure what the numbers exactly mean, but this metadata mentions that the PDF uploaded in cloud storage is the one that was used to shape the answer of the LLM, and gives an excerpt of the sentence that was found in the document.

Request and response logging

To better understand what’s going on under the hood, you can enable request and response logging. That way, you’re able to see exactly what is sent to Gemini, and what Gemini replies.

To enable logging, there are two methods we can use:

  • logRequests(true) to log the request sent to Gemini,
  • logResponse(true) to log the response received from Gemini.

Let’s see that in action:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .logRequests(true)
    .logResponses(true)
    .build();

String response = model.generate("Why is the sky blue?");

System.out.println(response);

Here’s what’s logged:

[main] DEBUG dev.langchain4j.model.vertexai.VertexAiGeminiChatModel -
 GEMINI (gemini-1.5-flash) request: InstructionAndContent {
 systemInstruction = null,
 contents = [role: "user"
parts {
  text: "Why is the sky blue?"
}
]
} tools: []


[main] DEBUG dev.langchain4j.model.vertexai.VertexAiGeminiChatModel -
 GEMINI (gemini-1.5-flash) response: candidates {
  content {
    role: "model"
    parts {
      text: "The sky appears blue due to a phenomenon called
      **Rayleigh scattering**. Here\'s a breakdown:\n\n* **Sunlight
      is made up of all colors of the rainbow.**  When sunlight enters
      the Earth\'s atmosphere, it encounters tiny particles like
      nitrogen and oxygen molecules.\n* **These particles scatter the
      sunlight in all directions.**  However, shorter wavelengths of
      light, like blue and violet, scatter more strongly than longer
      wavelengths, like red and orange.\n* **This preferential
      scattering of shorter wavelengths is called Rayleigh
      scattering.**
      As a result, we see more blue light scattered throughout the sky,
      making it appear blue.\n\n**Why is the sky not violet?**\n\nEven
      though violet light scatters even more strongly than blue, our
      eyes are more sensitive to blue light. This is why we perceive
      the sky as blue rather than violet.\n\n**Other factors that
      affect sky color:**\n\n* **Time of day:** The sky appears more
      red or orange at sunrise and sunset because the sunlight has to
      travel through more of the atmosphere, scattering away most of
      the blue light.\n* **Clouds:** Clouds are made up of larger water
      droplets or ice crystals, which scatter all wavelengths of light
      equally. This is why clouds appear white.\n* **Pollution:**
      Pollution particles can scatter light differently, sometimes
      making the sky appear hazy or even reddish.\n\nLet me know if
      you have any other questions about the sky! \n"
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
    probability_score: 0.054802597
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.03314852
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
    probability_score: 0.100348406
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.06359858
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
    probability_score: 0.10837755
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.021491764
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
    probability_score: 0.10338596
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.020410307
  }
}
usage_metadata {
  prompt_token_count: 6
  candidates_token_count: 288
  total_token_count: 294
}

Let me give you a bit more details about the logging. LangChain4j uses Slf4j by default for logging. Request & Response logging is logged at DEBUG level. So we have to configure our logger and/or logger façace accordingly.

In my test project for this article, I configured the following Maven dependencies for Slf4j and the Simple logger:

<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>2.0.13</version>
</dependency>
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>2.0.13</version>
</dependency>

I created a properties file to configure the loggers: src/main/resources/simplelogger.properties, which contains the following configuration:

org.slf4j.simpleLogger.defaultLogLevel=debug
org.slf4j.simpleLogger.log.io.grpc.netty.shaded=info

I set the default logging level to be debug. But there’s also Netty, the networking library used under the hood by the Gemini Java SDK, that logs at debug level. So I specified that the logging for this library should only be at info and above, otherwise the output is super chatty.

Function calling mode

So far, when using Gemini for function calling, the model would decide on its own if a function would be useful to call, and which function to call.

But Gemini introduces the ability to control the function or tool choice.

There are 3 options:

  • AUTO — The familiar and default mode, where Gemini decides on its own if a function call is necessary and which one should be made,
  • ANY — Allows to specify a subset of functions from all those available, but also forces the model to pick up one of them (only supported by Gemini 1.5 Pro),
  • NONE — Even if tools are defined and available, prevents Gemini to use any of those tools.

Let’s have a look at this example:

var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-pro")
    .logRequests(true)
    .logResponses(true)
    .toolCallingMode(ToolCallingMode.ANY)
    .allowedFunctionNames(Arrays.asList("add"))
    .build();

ToolSpecification adder = ToolSpecification.builder()
    .description("adds two numbers")
    .name("add")
    .addParameter("a", JsonSchemaProperty.INTEGER)
    .addParameter("b", JsonSchemaProperty.INTEGER)
    .build();

UserMessage message = UserMessage.from("How much is 3 + 4?");
Response<AiMessage> answer = model.generate(asList(message), adder);

System.out.println(
    answer.content().toolExecutionRequests().getFirst());

We specify the ToolCallingMode.ANY mode, and we list the allowed function names of the functions that the model must pick in order to reply to the request (with the allowedFunctionNames() builder method).

We describe the tool that can be called. We create a message. And when calling generate(), we pass the tool specification corresponding to the function we want to be called.

The output will show that the model replied with the mandatory tool execution request:

ToolExecutionRequest { id = null, name = "add",
                       arguments = "{"a":3.0,"b":4.0}" }

Now it’s our turn to call the add function with the arguments. And then send back the function execution result back to Gemini.

Warning

Currently, it is not possible to use the ANY forced function calling mode when using LangChain4j’s AiServices class.

AiServices takes care of automatic function calling. But the process is a two-step request / response mechanism:

  • First, we ask the model the math question and pass the tool specification along.
  • The model replies with a ToolExecutionRequest.
  • Then AiServices makes the function call locally, and replies to the model with the function execution result. However, since the ANY calling mode is specified at the model level, the model still wants to reply with yet another tool execution request. Although at this point, the second call made to the model was just to pass the function execution result, not to request another tool execution.
  • So AiServices enters an infite loop as the model requests a function execution again and again, not taking into account the execution result that it received.

When using AiServices, it’s better to let Gemini operate under the default AUTO tool mode. So it knows when it needs to request a tool execution, or if just needs to handle the tool execution response.

If you want to use the ANY mode with allowedFunctionNames(), then don’t use AiServices, and handle the function calls on your own in your code, to avoid such infite loop situations.

Specify safety settings

In LLM-powered applications, where users can enter any kind of weird textual inputs, you may want to limit harmful content that may be ingested. To do so, you can specify some safety settings, for different categories of content, with different thresholds of acceptance:

import static dev.langchain4j.model.vertexai.HarmCategory.*;
import static dev.langchain4j.model.vertexai.SafetyThreshold.*;
//...
var model = VertexAiGeminiChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .safetySettings(Map.of(
        HARM_CATEGORY_DANGEROUS_CONTENT, BLOCK_LOW_AND_ABOVE,
        HARM_CATEGORY_SEXUALLY_EXPLICIT, BLOCK_MEDIUM_AND_ABOVE,
        HARM_CATEGORY_HARASSMENT, BLOCK_ONLY_HIGH,
        HARM_CATEGORY_HATE_SPEECH, BLOCK_MEDIUM_AND_ABOVE
    ))
    .build();

If you want to make your app safer for your end-users, and to avoid malicious or ill-disposed users, that’s the way to go!

Bonus point #1: Streaming responses with lambda functions

I’ll round up the review of Gemini-focused features with one little addition I contributed to the project: the ability to pass a lambda instead of a streaming content handler, when using a streaming model.

This is not Gemini-related, you can use it with any model!

More concretely, if you want to use Gemini or another model in streaming mode, to see the response being printed as it’s generated by the model, you would usually write the following code:

var model = VertexAiGeminiStreamingChatModel.builder()
    .project(PROJECT_ID)
    .location(LOCATION)
    .modelName("gemini-1.5-flash")
    .build();

model.generate("Why is the sky blue?", new StreamingResponseHandler<>() {
    @Override
    public void onNext(String aFewTokens) {
        System.out.print(aFewTokens);
    }

    @Override
    public void onError(Throwable throwable) {
        throw new RuntimeException(throwable);
    }
});

Using an anonymous inner class implementing the StreamingResponseHandler interface is quite verbose. Fortunately, I contributed a couple static methods you can import, to make the code a little bit more concise:

import static dev.langchain4j.model.LambdaStreamingResponseHandler.onNext;
import static dev.langchain4j.model.LambdaStreamingResponseHandler.onNextAndError;
//...

// onNext
model.generate("Why is the sky blue?",
    onNext(System.out::println));

// onNextAndError
model.generate("Why is the sky blue?",
    onNextAndError(
        System.out::println,
        ex -> { throw new RuntimeException(ex); }
));

Now you can stream your LLM output in a single instruction!

Bonus point #2: Generating stunning images with Imagen v3

A second bonus point in this new LangChain4j release is the fact that the Vertex AI Image model now supports Imagen v3 (Google DeepMind’s latest high-quality image generation model).

Warning

To use the Imagen model, you’ll still have to be allow-listed for now. You’ll need to fill this form to request access to the model.

There are a few new parameters that are available that you can take advantage of when generating pictures. Let’s have a look at the following image generation code:

var imagenModel = VertexAiImageModel.builder()
    .project(PROJECT)
    .location(LOCATION)
    .endpoint(ENDPOINT)
    .publisher("google")
    .modelName("imagen-3.0-generate-preview-0611")
    .aspectRatio(VertexAiImageModel.AspectRatio.LANDSCAPE)
    .mimeType(VertexAiImageModel.MimeType.JPEG)
    .compressionQuality(80)
    .watermark(true) // true by default with Imagen v3
    .withPersisting()
    .logRequests(true)
    .logResponses(true)
    .build();

String prompt = """
    An oil painting close-up, with heavy brush strokes full of
    paint, of two hands shaking together, a young one, and an
    old one conveying a sense of heartfelt thanks and connection
    between generations
    """;

Response<Image> imageResponse = imagenModel.generate(prompt);
System.out.println(imageResponse.content().url());

Let’s see the resulting picture?

In the code above, you certainly noticed the new builder methods:

  • aspectRatio() — not only square, but wide and narrow landscape and portrait modes are available,
  • mimeType() — in addition to PNG, you can request JPEG image generation,
  • comressionQuality() — when requesting JPEG, you can chose the level of compression for encoding the image,
  • watermark() — to have all your generated images be watermarked with SynthId,
  • logRequest() / logResponse() — to see what is exchanged with the model, in and out,
  • persistToCloudStorage() — to specify you want the image saved in a cloud storage bucket (not used in this example).

If you get a chance, and request access to Imagen v3, you’ll notice really great quality improvements compared to v2!

Conclusion

Lots of new Gemini related features in this release of LangChain4j! I hope this article helped you learn about them, and will make you want to use them in your projects.

If you want to go hands-on with Gemini with LangChain4j, don’t forget to check out my self-paced codelab: Gemini codelabg for Java developers, using LangChain4j.

https://glaforge.dev/posts/2024/07/05/latest-gemini-features-support-in-langchain4j/
The power of embeddings: How numbers unlock the meaning of data
Prelude

As I’m focusing a lot on Generative AI, I’m curious about how things work under the hood, to better understand what I’m using in my gen-ai powered projects. A topic I’d like to focus on more is: vector embeddings, to explain more clearly what they are, how they are calculated, and what you can do with them.

A colleague of mine, André, was showing me a cool experiment he’s been working on, to help people prepare an interview, with the help of an AI, to shape the structure of the resulting final article to write.

Show full content
Prelude

As I’m focusing a lot on Generative AI, I’m curious about how things work under the hood, to better understand what I’m using in my gen-ai powered projects. A topic I’d like to focus on more is: vector embeddings, to explain more clearly what they are, how they are calculated, and what you can do with them.

A colleague of mine, André, was showing me a cool experiment he’s been working on, to help people prepare an interview, with the help of an AI, to shape the structure of the resulting final article to write.

The idea is to provide: a topic, a target audience, and to describe the goals for the audience. Then, a large language model like Gemini prepares a list of questions (that you can update freely) on that topic. Next, it’s your turn to fill in the blanks, answer those questions, and then the LLM generates an article, with a plan following those key questions and your provided answers. I cheated a bit, and asked Gemini itself those questions, and honestly, I really liked how the resulting article came to be, and I wanted to share with you the outcome below.

It’s a great and simple introduction to vector embeddings! I like how AI can help organize information, shape the structure and the content for an article. I’m not advocating for letting AI write all your articles, far from that, but as an author, however, I like that it can help me avoid the blank page syndrome, avoid missing key elements in my dissertation, improve the quality of my written prose.

Generative AI, in its creative aspect, and as your assistant, can be super useful! Use it as a tool to help drive your creativity! But always use your critical sense to gauge the quality and factuality of the content.

Introduction: What are vector embeddings?

Imagine you have a vast library filled with books on every topic imaginable. Finding a specific book can be a daunting task, especially if you only know the general subject matter. Now imagine a magical system that can understand the meaning of each book and represent it as a unique code. This code, called a vector embedding, can then be used to quickly find the most relevant books based on your search query, even if you only have a vague idea of what you’re looking for.

This is the power of vector embeddings. They are essentially numerical representations of complex data, like text, images, or audio, that capture the underlying meaning and relationships within the data. These numerical codes, arranged as vectors, allow computers to process and compare data in a way that mimics human understanding.

From Text to Numbers: The Journey of Embedding Creation

Creating vector embeddings involves a multi-step process that transforms raw data into meaningful mathematical representations. The journey begins with data preprocessing, where the data is cleaned, normalized, and prepared for embedding generation. This might involve tasks like removing irrelevant information, standardizing data formats, and breaking text into individual words or subwords (tokenization).

Next comes the heart of the process: embedding generation. This step leverages various techniques and algorithms, such as Word2Vec, GloVe, BERT, and ResNet, to convert each data point into a high-dimensional vector. The specific algorithm chosen depends on the type of data being embedded (text, images, or audio) and the intended application.

For instance, Word2Vec uses a neural network to learn relationships between words by analyzing how they co-occur in large text corpora. This results in vector representations for words, where similar words have similar vectors, capturing semantic relationships. Similarly, for images, convolutional neural networks (CNNs) like ResNet can be used to extract features from images, resulting in vectors that represent the visual content.

Vector Databases: The Power of Storing and Searching Embeddings

Once embeddings are generated, they need a dedicated storage system for efficient retrieval and comparison. This is where vector databases come into play. Unlike traditional databases designed for structured data, vector databases are optimized for storing and searching high-dimensional vector data.

Vector databases employ specialized indexing techniques, such as Annoy, HNSW, and Faiss, to create efficient data structures that allow for fast similarity search. This means that when a user submits a query (e.g., a search term, an image), the database can quickly find the most similar data points based on the similarity of their vector representations.

Embeddings Empower Search: Finding the Needle in the Haystack

The combination of vector embeddings and vector databases revolutionizes search by enabling semantic search. This means that instead of relying solely on keyword matching, search engines can understand the meaning behind the data and find relevant results even if the query doesn’t use exact keywords.

For example, imagine searching for “a picture of a dog with a hat.” Traditional keyword-based search might struggle to find relevant images, as the search term might not match the image description. However, with vector embeddings, the search engine can understand the semantic meaning of the query and find images that contain both a dog and a hat, even if those words are not explicitly mentioned in the image description.

Beyond Search: Expanding the Reach of Embeddings

Vector embeddings are not limited to search applications. They have become essential tools in a wide range of fields, including:

  • Retrieval Augmented Generation (RAG): This technique combines the power of information retrieval and generative models to create more informative and relevant responses. Embeddings are used to find relevant information in large text corpora, which is then used to augment prompts for language models, resulting in more accurate and context-aware outputs.
  • Data Classification: Embeddings enable the classification of data points into different categories based on their similarity. This finds application in areas like sentiment analysis, spam detection, object recognition, and music genre classification.
  • Anomaly Detection: By representing data points as vectors, anomalies can be identified as data points that are significantly different from the majority. This technique is used in various fields, including network intrusion detection, fraud detection, and industrial sensor monitoring.
Facing the Challenges and Shaping the Future

While vector embeddings have revolutionized data analysis, they still face some challenges. These include the difficulty of capturing polysemy (multiple meanings of a word), contextual dependence, and the challenge of interpreting the meaning behind the high-dimensional vector representations.

Despite these limitations, research continues to push the boundaries of vector embeddings. Researchers are exploring techniques like contextual embeddings, multilingual embeddings, knowledge graph integration, and explainable embeddings to overcome existing limitations and unlock the full potential of these powerful representations.

Stepping into the World of Embeddings: Resources and Next Steps

For those interested in diving deeper into the world of vector embeddings, a wealth of resources is available. Online courses and tutorials on platforms like Coursera, Fast.ai, and Stanford’s online learning platform provide a solid foundation in the underlying concepts and techniques.

Books like “Speech and Language Processing” by Jurafsky and Martin and “Deep Learning” by Goodfellow, Bengio, and Courville offer in-depth coverage of the field. Additionally, research papers and articles on platforms like arXiv and Medium offer insights into the latest advancements and applications.

To gain practical experience, explore Python libraries like Gensim, spaCy, and TensorFlow/PyTorch. These libraries provide tools for creating and working with embeddings, allowing you to build your own models and experiment with various applications.

The world of vector embeddings is constantly evolving, offering exciting opportunities for innovation and discovery. By understanding the power of these representations, you can unlock new possibilities for data analysis, information retrieval, and artificial intelligence applications.

https://glaforge.dev/posts/2024/07/02/the-power-of-embeddings-how-numbers-unlock-the-meaning-of-data/
Functional builders in Java with Jilt

A few months ago, I shared an article about what I called Java functional builders, inspired by an equivalent pattern found in Go. The main idea was to have builders that looked like this example:

LanguageModel languageModel = new LanguageModel(
 name("cool-model"),
 project("my-project"),
 temperature(0.5),
 description("This is a generative model")
);

Compared to the more tranditional builder approach:

  • You’re using the new keyword again to construct instances.
  • There’s no more build() method, which felt a bit verbose.

Compared to using constructors with tons of parameters:

Show full content

A few months ago, I shared an article about what I called Java functional builders, inspired by an equivalent pattern found in Go. The main idea was to have builders that looked like this example:

LanguageModel languageModel = new LanguageModel(
    name("cool-model"),
    project("my-project"),
    temperature(0.5),
    description("This is a generative model")
);

Compared to the more tranditional builder approach:

  • You’re using the new keyword again to construct instances.
  • There’s no more build() method, which felt a bit verbose.

Compared to using constructors with tons of parameters:

  • You have methods like in traditional builders, that say what each parameter is about (name(), temperature()…) a bit similar to named parameters in some programming languages.

The approach I followed was to take advantage of lambda functions under the hood:

public static ModelOption temperature(Float temperature) {
    return model -> model.temperature = temperature;
}

However, there were a few downsides:

  • Of course, it’s not very conventional! So it can be a bit disturbing for people used to classical builders.
  • I didn’t make the distinction between required and optional parameters (they were all optional!)
  • The internal fields were not final, and I felt they should be.
Discovering Jilt

When searching on this topic, I found Adam Ruka’s great annotation processor library: Jilt.

One of the really cool features of Jilt is its staged builder concept, which makes builders very type-safe, and forces you to call all the required property methods by chaining them. I found this approach very elegant.

Adam heard about my functional builder approach, and decided to implement this new style of builder in Jilt. There are a few differences with my implementation, but it palliates some of the downsides I mentioned.

Let’s have a look at what functional builders looks like from a usage standpoint:

LanguageModel languageModel = languageModel(
    name("cool-model"),
    project("my-project"),
    temperature(0.5),
    description("This is a generative model")
);

Compared to my approach, you’re not using constructors (as annotation processors can’t change existing classes), so you have to use a static method instead. But otherwise, inside that method call, you have the named-parameter-like methods you’re used to use in builders.

Here, name(), project() and temperature() are mandatory, and you’d get a compilation error if you forgot one of them. But description() is optional and can be ommitted.

Let’s now look at the implementation:

import org.jilt.Builder;
import org.jilt.BuilderStyle;
import org.jilt.Opt;

import static jilt.testing.LanguageModelBuilder.*;
import static jilt.testing.LanguageModelBuilder.Optional.description;
//...
LanguageModel languageModel = languageModel(
    name("cool-model"),
    project("my-project"),
    temperature(0.5),
    description("This is a generative model")
);
//...
@Builder(style = BuilderStyle.FUNCTIONAL)
public record LanguageModel(
    String name,
    String project,
    Double temperature,
    @Opt String description
) {}

I used a Java record but it could be a good old POJO. You must annotate that class with the @Builder annotation. The style parameter specifies that you want to use a functional builder. Notice the use of the @Opt annotation to say that a parameter is not required.

Derived instance creation

Let me close this article with another neat trick offered by Jilt, which is how to build other instances from existing ones:

@Builder(style = BuilderStyle.FUNCTIONAL, toBuilder = "derive")
public record LanguageModel(...) {}
//...
LanguageModel derivedModel = derive(languageModel, name("new-name"));

By adding the toBuilder = "derive" parameter to the annotation, you get the ability to create new instances similar to the original one, but you can change both required and optional parameters, to derive a new instance.

Time to try Jilt!

You can try functional builders in Jilt 1.6 which was just released a few days ago!

https://glaforge.dev/posts/2024/06/17/functional-builders-in-java-with-jilt/
Let's make Gemini Groovy!

The happy users of Gemini Advanced, the powerful AI web assistant powered by the Gemini model, can execute some Python code, thanks to a built-in Python interpreter. So, for math, logic, calculation questions, the assistant can let Gemini invent a Python script, and execute it, to let users get a more accurate answer to their queries.

But wearing my Apache Groovy hat on, I wondered if I could get Gemini to invoke some Groovy scripts as well, for advanced math questions!

Show full content

The happy users of Gemini Advanced, the powerful AI web assistant powered by the Gemini model, can execute some Python code, thanks to a built-in Python interpreter. So, for math, logic, calculation questions, the assistant can let Gemini invent a Python script, and execute it, to let users get a more accurate answer to their queries.

But wearing my Apache Groovy hat on, I wondered if I could get Gemini to invoke some Groovy scripts as well, for advanced math questions!

LangChain4j based approach

As usual, my tool of choice for any LLM problem is the powerful LangChain4j framework! Interestingly, there are already some code engine integrations,

  • a GraalVM Polyglot Truffle engine, that can execute Python and JavaScript code,
  • a Judge0 engine that uses the Judge0 online code execution system, which also supports Groovy!

I haven’t tried Judge0 yet, as I saw it was supporting Groovy 3 only, and not yet Groovy 4. But for math or logic questions, Groovy 3 is just fine anyway. Instead, I wanted to explore how to create my own Groovy interpreter!

In the following experiment, I’m going to use the Gemini model, because it supports function calling, which means we can instruct the model that it can use some tools when needed.

Let’s walk through this step by step.

First, I instantiate a Gemini chat model:

var model = VertexAiGeminiChatModel.builder()
    .project("MY_GCP_PROJECT_ID")
    .location("us-central1")
    .modelName("gemini-1.5-flash-001")
    .maxRetries(1)
    .build();

Then, I create a tool that is able to run Groovy code, thanks to the GroovyShell evaluator:

class GroovyInterpreter {
  @Tool("Execute a Groovy script and return the result of its execution.")
  public Map<String, String> executeGroovyScript(
    @P("The groovy script source code to execute") String groovyScript) {
    String script = groovyScript.replace("\\n", "\n");
    System.err.format("%n--> Executing the following Groovy script:%n%s%n", script);
    try {
      Object result = new GroovyShell().evaluate(script);
      return Map.of("result", result == null ? "null" : result.toString());
    } catch (Throwable e) {
      return Map.of("error", e.getMessage());
    }
  }
}

Notice the @Tool annotation that describes what this tool can do. And the @P annotation which explains what the parameter is about.

I noticed that sometimes the raw script that Gemini suggested contained some \n strings, instead of the plain newline characters, so I’m replacing them with newlines instead.

I return a map containing either a result (as a string), or an error message if one was encountered.

Now it’s time to create our assistant contract, in the form of an interface, but with a very carefully crafted system instruction:

interface GroovyAssistant {
  @SystemMessage("""
    You are a problem solver equipped with the capability of \
    executing Groovy scripts.
    When you need to or you're asked to evaluate some math \
    function, some algorithm, or some code, use the \
    `executeGroovyScript` function, passing a Groovy script \
    that implements the function, the algorithm, or the code \
    that needs to be run.
    In the Groovy script, return a value. Don't print the result \
    to the console.
    Don't use semicolons in your Groovy scripts, it's not necessary.
    When reporting the result of the execution of a script, \
    be sure to show the content of that script.
    Call the `executeGroovyScript` function only once, \
    don't call it in a loop.
    """)
  String chat(String msg);
}

This complex system instruction above tells the model what its role is, and that it should call the provided Groovy script execution function whenever it encounters the need to calculate some function, or execute some logic.

I also instruct it to return values instead of printing results.

Funnily, Gemini is a pretty decent Groovy programmer, but it insists on always adding semi-colons like in Java, so for a more idiomatic code style, I suggest it to get rid of them!

The final step is now to create our LangChain4j AI service with the following code:

var assistant = AiServices.builder(GroovyAssistant.class)
    .chatLanguageModel(model)
    .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
    .tools(new GroovyInterpreter())
    .build();

I combine the Gemini chat model, with a memory to keep track of users’ requests, and the Groovy interpreter tool I’ve just created.

Now let’s see if Gemini is able to create and calculate a fibonacci function:

System.out.println(
  assistant.chat(
    "Write a `fibonacci` function, and calculate `fibonacci(18)`"));

And the output is as follows:

def fibonacci(n) {
  if (n <= 1) {
    return n
  } else {
    return fibonacci(n - 1) + fibonacci(n - 2)
  }
}
fibonacci(18)

The result of executing the script is: 2584.

Discussion

It took me a bit of time to find the right system instruction to get Groovy scripts that complied to my requirements. However, I noticed sometimes some internal errors returned by the model, which I haven’t fully understood (and particularly why those happen at all)

On some occasions, I also noticed that LangChain4j keeps sending the same script for execution, in a loop. Same thing: I still have to investigate why this rare behavior happens.

So this solution is a fun experiment, but I’d call it just that, an experiment, as it’s not as rock-solid as I want it to be. But if I manage to make it more bullet-proof, maybe I could contribute it back as a dedicated execution engine for LangChain4j!

Full source code

Here’s the full content of my experiment:

import dev.langchain4j.agent.tool.P;
import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.SystemMessage;
import groovy.lang.GroovyShell;
import java.util.Map;

public class GroovyCodeInterpreterAssistant {
  public static void main(String[] args) {
    var model = VertexAiGeminiChatModel.builder()
      .project("MY_GCP_PROJECT_ID")
      .location("us-central1")
      .modelName("gemini-1.5-flash-001")
      .maxRetries(1)
      .build();

    class GroovyInterpreter {
      @Tool("Execute a Groovy script and return the result of its execution.")
      public Map<String, String> executeGroovyScript(
          @P("The groovy script source code to execute")
          String groovyScript) {
        System.err.format("%n--> Raw Groovy script:%n%s%n", groovyScript);
        String script = groovyScript.replace("\\n", "\n");
        System.err.format("%n--> Executing:%n%s%n", script);
        try {
          Object result = new GroovyShell().evaluate(script);
          return Map.of("result", result == null ? "null" : result.toString());
        } catch (Throwable e) {
          return Map.of("error", e.getMessage());
        }
      }
    }

    interface GroovyAssistant {
      @SystemMessage("""
        You are a problem solver equipped with the capability of \
        executing Groovy scripts.
        When you need to or you're asked to evaluate some math \
        function, some algorithm, or some code, use the \
        `executeGroovyScript` function, passing a Groovy script \
        that implements the function, the algorithm, or the code \
        that needs to be run.
        In the Groovy script, return a value. Don't print the result \
        to the console.
        Don't use semicolons in your Groovy scripts, it's not necessary.
        When reporting the result of the execution of a script, \
        be sure to show the content of that script.
        Call the `executeGroovyScript` function only once, \
        don't call it in a loop.
        """)
      String chat(String msg);
    }

    var assistant = AiServices.builder(GroovyAssistant.class)
      .chatLanguageModel(model)
      .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
      .tools(new GroovyInterpreter())
      .build();

    System.out.println(
      assistant.chat(
        "Write a `fibonacci` function, and calculate `fibonacci(18)`"));
  }
}
https://glaforge.dev/posts/2024/06/03/lets-make-gemini-groovy/
Grounding Gemini with Web Search results in LangChain4j

The latest release of LangChain4j (version 0.31) added the capability of grounding large language models with results from web searches. There’s an integration with Google Custom Search Engine, and also Tavily.

The fact of grounding an LLM’s response with the results from a search engine allows the LLM to find relevant information about the query from web searches, which will likely include up-to-date information that the model won’t have seen during its training, past its cut-off date when the training ended.

Show full content

The latest release of LangChain4j (version 0.31) added the capability of grounding large language models with results from web searches. There’s an integration with Google Custom Search Engine, and also Tavily.

The fact of grounding an LLM’s response with the results from a search engine allows the LLM to find relevant information about the query from web searches, which will likely include up-to-date information that the model won’t have seen during its training, past its cut-off date when the training ended.

Remark

Gemini has a built-in Google Web Search grounding capability, however, LangChain4j’s Gemini integration doesn’t yet surface this feature. I’m currently working on a pull request to support this.

Asking questions to your website

An interesting use case for LLM web search grounding is for example if you want to search a particular website. I was interested in asking questions related to articles that I have posted on my personal website and blog. Let’s see, step by step, how you can implement this.

Creating a custom search engine

First of all, as I decided to use Google Custom Search, I created a new custom search engine. I won’t detail the steps involved in this process, as it’s explained in the documentation. I created a custom search searching only the content on my website: glaforge.dev. But you can potentially search the whole internet if you wish, or just your company website, etc.

Google Custom Search gave me an API key, as well as a Custom Search ID (csi) for my newly created custom search engine. You can test the custom search engine with that ID with this URL: https://programmablesearchengine.google.com/controlpanel/overview?cx=YOUR_CSI_HERE. It gives you a Google Search-like interface where you can enter your queries. There’s also a widget that you can integrate in your website if you wish.

Implementation

First of all, I configure the chat model I want to use. I’m using the latest and fastest Gemini model: Gemini 1.5 Flash. I’ve saved my Google Cloud project ID and locaction in environment variables.

VertexAiGeminiChatModel model = VertexAiGeminiChatModel.builder()
    .project(System.getenv("PROJECT_ID"))
    .location(System.getenv("LOCATION"))
    .modelName("gemini-1.5-flash-001")
    .build();

Next, I configure my web search engine. Here, I’m using Google Search, but it could be Tavily as well. I also saved my API key and the ID of my custom web search in environment variables:

WebSearchEngine webSearchEngine = GoogleCustomWebSearchEngine.builder()
    .apiKey(System.getenv("GOOGLE_CUSTOM_SEARCH_API_KEY"))
    .csi(System.getenv("GOOGLE_CUSTOM_SEARCH_CSI"))
//    .logRequests(true)
//    .logResponses(true)
    .build();

Note that you can log the requests and responses, for debugging purpose.

Next, I define a content retriever, this is a way to let LangChain4j know that content can be retrieved from a particular tool or location:

ContentRetriever contentRetriever = WebSearchContentRetriever.builder()
    .webSearchEngine(webSearchEngine)
    .maxResults(3)
    .build();

Now, I define the contract I want to use to interact with my Gemini model, by creating my own custom search interface:

interface SearchWebsite {
    String search(String query);
}

This interface will be implemented by LangChain4j’s AiServices system that binds several components together: the chat language model (here, Gemini), and the web search content retriever I created above:

SearchWebsite website = AiServices.builder(SearchWebsite.class)
    .chatLanguageModel(model)
    .contentRetriever(contentRetriever)
    .build();

Then I can ask my question to the LLM, which will find the relevant information in my blog:

String response = website.search(
    "How can I call the Gemma model from LangChain4j?");

System.out.println("response = " + response);

If I comment out the line contentRetriever(contentRetriever), Gemini does a best effort at answering my question, but since there’s nothing in its training data (before its cut-off date) about how to call the Gemma model from LangChain4j, it is not able to provide a useful answer.

But with the web search content retriever, Gemini is able to find the right material to ground its answer, as the custom search returns my article on calling Gemma with Ollama, Testcontainers, and LangChain4j:

Based on the provided information, you can call the Gemma model from
LangChain4j using the following approach:

1. **Use Ollama:** The articles highlight Ollama as a tool for
interacting with Gemma. You would need to set up Ollama and ensure it
has access to the Gemma model.
2. **Integrate TestContainers:** TestContainers helps you manage
containerized environments for testing. You can use it to run Ollama
within a container alongside LangChain4j.
3. **Utilize LangChain4j:** LangChain4j provides the framework for
interacting with large language models. You would define your prompt,
send it to Ollama (which runs Gemma), and receive the response back
through LangChain4j.

**Example Steps:**

1. **Set up Ollama:** Install Ollama and configure it to use the
Gemma model.
2. **Create a Dockerfile:** Use a Dockerfile to define an image that
includes Ollama and any dependencies.
3. **Run Ollama in a container using TestContainers:** Start the
container using TestContainers and ensure it is accessible from your
LangChain4j code.
4. **Implement LangChain4j calls:** Use LangChain4j to construct your
prompt and send it to Ollama (which will pass it to Gemma).
5. **Receive and process the response:** Receive the generated response
from Gemma and process it as needed in your Java application.

**Note:** These steps provide a general approach. You will need to
refer to the documentation for Ollama, TestContainers, and LangChain4j
for specific implementation details.

This method leverages Ollama as an intermediary to access Gemma.
If you have access to Google's Gemini model directly, you might be
able to integrate it with LangChain4j without the Ollama step,
depending on the specific API or SDK offered by Google.

The LLM found that I have to use Ollama and TestContainers, as explained in my article. This information wasn’t part of my query, so it proves that it really found the info in the article.

Discussion

The LLM based its answer on the excerpts contained in the search results, not the whole content of the article, so some aspects of this answer are not totally correct: For instance, you don’t have to install Ollama or create your own Dockerfile.

To make the response perfect, I believe we would have to combine web search results with Retrieval Augmented Generation, or pass the whole context of the article to the model, so that it could provide a more thorough and factual answer.

For different queries that lead to shorter answers, the answer would probably be more to the point.

Another approach is to annotate our String search(String query) method with a @SystemInstruction() with instructions that encourage the LLM to provide a shorter answer. But it’s difficult to find the right balance between too long and too short, and of course without any sort of hallucinations!

For example, you can try with the following system instruction:

interface SearchWebsite {
  @SystemMessage("""
    Provide a paragraph-long answer, not a long step by step explanation.
    Reply with "I don't know the answer" if the provided information isn't relevant.
    """)
  String search(String query);
}

I got the following response:

The provided information mentions using Gemma with Ollama,
TestContainers, and LangChain4j. You can use Ollama, a local
LLM server, and TestContainers, which provides lightweight,
disposable containers, to set up a testing environment.
Then, with LangChain4j, a Java library for interacting with LLMs,
you can call Gemma through the Ollama server.

Which is shorter and more factual, without being too short either!

What’s next?

In an upcoming article, I’ll show you how to use Gemini’s built-in Google Search grounding, but first, I have to finish my pull request for the LangChain4j project!

Or I can explore how to reply more precisely to queries that lead to complex answers like the above, maybe combinging a RAG approach to get the full context of the article found by the web search.

Also, the Tavily API seems to be able to return the raw content of the article, so maybe it can help giving the LLM the full context of the article to base its answers on it. So that may be worth comparing those two web search integrations too.

Stay tuned!

Full sample code

For reference, here is the full sample (with the system instruction approach):

import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;
import dev.langchain4j.rag.content.retriever.ContentRetriever;
import dev.langchain4j.rag.content.retriever.WebSearchContentRetriever;
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.web.search.WebSearchEngine;
import dev.langchain4j.web.search.google.customsearch.GoogleCustomWebSearchEngine;

public class GroundingWithSearch {
  public static void main(String[] args) {
    VertexAiGeminiChatModel model = VertexAiGeminiChatModel.builder()
      .project(System.getenv("PROJECT_ID"))
      .location(System.getenv("LOCATION"))
      .modelName("gemini-1.5-flash-001")
      .build();

    WebSearchEngine webSearchEngine = GoogleCustomWebSearchEngine.builder()
      .apiKey(System.getenv("GOOGLE_CUSTOM_SEARCH_API_KEY"))
      .csi(System.getenv("GOOGLE_CUSTOM_SEARCH_CSI"))
//    .logRequests(true)
//    .logResponses(true)
      .build();

    ContentRetriever contentRetriever = WebSearchContentRetriever.builder()
      .webSearchEngine(webSearchEngine)
      .maxResults(3)
      .build();

    interface SearchWebsite {
      @SystemMessage("""
        Provide a paragraph-long answer, not a long step by step explanation.
        Reply with "I don't know the answer" if the provided information isn't relevant.
        """)
      String search(String query);
    }

    SearchWebsite website = AiServices.builder(SearchWebsite.class)
      .chatLanguageModel(model)
      .contentRetriever(contentRetriever)
      .build();

    String response = website.search(
      "How can I call the Gemma model from LangChain4j?");

    System.out.println("response = " + response);
  }
}
https://glaforge.dev/posts/2024/05/28/grounding-gemini-with-web-search-in-langchain4j/
Gemini, Google's Large Language Model, for Java Developers

As a follow-up to my talk on generative AI for Java developers, I’ve developed a new presentation that focuses more on the Gemini large multimodal model by Google.

In this talk, we cover the multimodality capabilities of the model, as it’s able to ingest code, PDF, audio, video, and is able to reason about them. Another specificity of Gemini is its huge context window of up to 1 million tokens! This opens interesting perspectives, especially in multimodal scenarios.

Show full content

As a follow-up to my talk on generative AI for Java developers, I’ve developed a new presentation that focuses more on the Gemini large multimodal model by Google.

In this talk, we cover the multimodality capabilities of the model, as it’s able to ingest code, PDF, audio, video, and is able to reason about them. Another specificity of Gemini is its huge context window of up to 1 million tokens! This opens interesting perspectives, especially in multimodal scenarios.

We also talk about the Gemma model, a small open-weights model in the Gemini family, which I covered recently about how to run it locally thanks to Ollama and Testcontainers.

In that presentation, I’m showing some of my past Gemini-powered demos, as well as the code examples in my Gemini workshop for Java developers, using LangChain4j.

https://glaforge.dev/talks/2024/05/03/gemini-google-large-language-model-for-java-developers/
Calling Gemma with Ollama, TestContainers, and LangChain4j

Lately, for my Generative AI powered Java apps, I’ve used the Gemini multimodal large language model from Google. But there’s also Gemma, its little sister model.

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Gemma is available in two sizes: 2B and 7B. Its weights are freely available, and its small size means you can run it on your own, even on your laptop. So I was curious to give it a run with LangChain4j.

Show full content

Lately, for my Generative AI powered Java apps, I’ve used the Gemini multimodal large language model from Google. But there’s also Gemma, its little sister model.

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Gemma is available in two sizes: 2B and 7B. Its weights are freely available, and its small size means you can run it on your own, even on your laptop. So I was curious to give it a run with LangChain4j.

How to run Gemma

There are many ways to run Gemma: in the cloud, via Vertex AI with a click of a button, or GKE with some GPUs, but you can also run it locally with Jlama or Gemma.cpp.

Another good option is to run Gemma with Ollama, a tool that you install on your machine, and which lets you run small models, like Llama 2, Mistral, and many others. They quickly added support for Gemma as well.

Once installed locally, you can run:

ollama run gemma:2b
ollama run gemma:7b

Cherry on the cake, the LangChain4j library provides an Ollama module, so you can plug Ollama supported models in your Java applications easily.

Containerization

After a great discussion with my colleague Dan Dobrin who had worked with Ollama and TestContainers (#1 and #2) in his serverless production readiness workshop, I decided to try the approach below.

Which brings us to the last piece of the puzzle: Instead of having to install and run Ollama on my computer, I decided to use Ollama within a container, handled by TestContainers.

TestContainers is not only useful for testing, but you can also use it for driving containers. There’s even a specific OllamaContainer you can take advantage of!

So here’s the whole picture:

Time to implement this approach!

You’ll find the code in the Github repository accompanying my recent Gemini workshop

Let’s start with the easy part, interacting with an Ollama supported model with LangChain4j:

OllamaContainer ollama = createGemmaOllamaContainer();
ollama.start();

ChatLanguageModel model = OllamaChatModel.builder()
    .baseUrl(String.format("http://%s:%d", ollama.getHost(), ollama.getFirstMappedPort()))
    .modelName("gemma:2b")
    .build();

String response = model.generate("Why is the sky blue?");

System.out.println(response);
  • You run an Ollama test container.
  • You create an Ollama chat model, by pointing at the address and port of the container.
  • You specify the model you want to use.
  • Then, you just need to call model.generate(yourPrompt) as usual.

Easy? Now let’s have a look at the trickier part, my local method that creates the Ollama container:

// check if the custom Gemma Ollama image exists already
List<Image> listImagesCmd = DockerClientFactory.lazyClient()
    .listImagesCmd()
    .withImageNameFilter(TC_OLLAMA_GEMMA_2_B)
    .exec();

if (listImagesCmd.isEmpty()) {
    System.out.println("Creating a new Ollama container with Gemma 2B image...");
    OllamaContainer ollama = new OllamaContainer("ollama/ollama:0.1.26");
    ollama.start();
    ollama.execInContainer("ollama", "pull", "gemma:2b");
    ollama.commitToImage(TC_OLLAMA_GEMMA_2_B);
    return ollama;
} else {
    System.out.println("Using existing Ollama container with Gemma 2B image...");
    // Substitute the default Ollama image with our Gemma variant
    return new OllamaContainer(
        DockerImageName.parse(TC_OLLAMA_GEMMA_2_B)
            .asCompatibleSubstituteFor("ollama/ollama"));
}

You need to create a derived Ollama container that pulls in the Gemma model. Either this image was already created beforehand, or if it doesn’t exist yet, you create it.

Use the Docker Java client to check if the custom Gemma image exists. If it doesn’t exist, notice how TestContainers let you create an image derived from the base Ollama image, pull the Gemma model, and then commit that image to your local Docker registry.

Otherwise, if the image already exists (ie. you created it in a previous run of the application), you’re just going to tell TestContainers that you want to substitute the default Ollama image with your Gemma-powered variant.

And voila!

You can call Gemma locally on your laptop, in your Java apps, using LangChain4j, without having to install and run Ollama locally (but of course, you need to have a Docker daemon running).

Big thanks to Dan Dobrin for the approach, and to Sergei, Eddú and Oleg from TestContainers for the help and useful pointers.

https://glaforge.dev/posts/2024/04/04/calling-gemma-with-ollama-and-testcontainers/
Gemini codelab for Java developers using LangChain4j

No need to be a Python developer to do Generative AI! If you’re a Java developer, you can take advantage of LangChain4j to implement some advanced LLM integrations in your Java applications. And if you’re interested in using Gemini, one of the best models available, I invite you to have a look at the following “codelab” that I worked on:

Codelab — Gemini for Java Developers using LangChain4j

In this workshop, you’ll find various examples covering the following use cases, in crescendo approach:

Show full content

No need to be a Python developer to do Generative AI! If you’re a Java developer, you can take advantage of LangChain4j to implement some advanced LLM integrations in your Java applications. And if you’re interested in using Gemini, one of the best models available, I invite you to have a look at the following “codelab” that I worked on:

Codelab — Gemini for Java Developers using LangChain4j

In this workshop, you’ll find various examples covering the following use cases, in crescendo approach:

  • Making your fist call to Gemini (streaming & non-streaming)
  • Maintaining a conversation
  • Taking advantage of multimodality by analysing images with your prompts
  • Extracting structured information from unstructured text
  • Using prompt templates
  • Doing text classification with few-shot prompting
  • Implementing Retrieval Augmented Generation to chat with your documentation
  • How to do Function Calling to expand the LLM to interact with external APIs and services

You’ll find all the code samples on Github.

If you’re attending Devoxx France, be sure to attend the Hands-on-Lab workshop with my colleagues Mete Atamel and Valentin Deleplace who will guide you through this codelab.

https://glaforge.dev/posts/2024/03/27/gemini-codelab-for-java-developers/
Visualize PaLM-based LLM tokens

As I was working on tweaking the Vertex AI text embedding model in LangChain4j, I wanted to better understand how the textembedding-gecko model tokenizes the text, in particular when we implement the Retrieval Augmented Generation approach.

The various PaLM-based models offer a computeTokens endpoint, which returns a list of tokens (encoded in Base 64) and their respective IDs.

Note

At the time of this writing, there’s no equivalent endpoint for Gemini models.

Show full content

As I was working on tweaking the Vertex AI text embedding model in LangChain4j, I wanted to better understand how the textembedding-gecko model tokenizes the text, in particular when we implement the Retrieval Augmented Generation approach.

The various PaLM-based models offer a computeTokens endpoint, which returns a list of tokens (encoded in Base 64) and their respective IDs.

Note

At the time of this writing, there’s no equivalent endpoint for Gemini models.

So I decided to create a small application that lets users:

  • input some text,
  • select a model,
  • calculate the number of tokens,
  • and visualize them with some nice pastel colors.

The available PaLM-based models are:

  • textembedding-gecko
  • textembedding-gecko-multilingual
  • text-bison
  • text-unicorn
  • chat-bison
  • code-gecko
  • code-bison
  • codechat-bison

You can try the application online.

And also have a look at the source code on Github. It’s a Micronaut application. I serve the static assets as explained in my recent article. I deployed the application on Google Cloud Run, the easiest way to deploy a container, and let it auto-scale for you. I did a source based deployment, as explained at the bottom here.

And voilà I can visualize my LLM tokens!

https://glaforge.dev/posts/2024/02/05/visualize-palm-based-llm-tokens/
Image generation with Imagen and LangChain4j

This week LangChain4j, the LLM orchestration framework for Java developers, released version 0.26.1, which contains my first significant contribution to the open source project: support for the Imagen image generation model.

Imagen is a text-to-image diffusion model that was announced last year. And it recently upgraded to Imagen v2, with even higher quality graphics generation. As I was curious to integrate it in some of my generative AI projects, I thought that would be a great first contribution to LangChain4j.

Show full content

This week LangChain4j, the LLM orchestration framework for Java developers, released version 0.26.1, which contains my first significant contribution to the open source project: support for the Imagen image generation model.

Imagen is a text-to-image diffusion model that was announced last year. And it recently upgraded to Imagen v2, with even higher quality graphics generation. As I was curious to integrate it in some of my generative AI projects, I thought that would be a great first contribution to LangChain4j.

Caution

At the time of this writing, image generation is still only for allow-listed accounts.

Furthermore, to run the snippets covered below, you should have an account on Google Cloud Platform, created a project, configured a billing account, enabled the Vertex AI API, and authenticated with the gcloud SDK and the command: gcloud auth application-default login.

Now let’s dive in how to use Imagen v1 and v2 with LangChain4j in Java!

Generate your first images

In the following examples, I’m using the following constants, to point at my project details, the endpoint, the region, etc:

private static final String ENDPOINT = "us-central1-aiplatform.googleapis.com:443";
private static final String LOCATION = "us-central1";
private static final String PROJECT = "YOUR_PROJECT_ID";
private static final String PUBLISHER = "google";

First, we’re going to create an instance of the model:

VertexAiImageModel imagenModel = VertexAiImageModel.builder()
    .endpoint(ENDPOINT)
    .location(LOCATION)
    .project(PROJECT)
    .publisher(PUBLISHER)
    .modelName("imagegeneration@005")
    .maxRetries(2)
    .withPersisting()
    .build();

There are 2 models you can use:

  • imagegeneration@005 corresponds to Imagen 2
  • imagegeneration@002 is the previous version (Imagen 1)

In this article, we’ll use both models. Why? Because currently Imagen 2 doesn’t support image editing, so we’ll have to use Imagen 1 for that purpose.

The configuration above uses withPersisting() to save the generated images in a temporary folder on your system. If you don’t persist the image files, the content of the image is avaiable as Base 64 encoded bytes in the Images objects returned. You can also specify persistTo(somePath) to specify a particular directory where you want the generated files to be saved.

Let’s create our first image:

Response<Image> imageResponse = imagenModel.generate(
    "watercolor of a colorful parrot drinking a cup of coffee");

The Response object wraps the created Image. You can get the Image by calling imageResponse.getContent(). And you can retrieve the URL of the image (if saved locally) with imageResponse.getContent().url(). The Base 64 encoded bytes can be retrieved with imageResponse.getContent().base64Data()

Some other tweaks to the model configuration:

  • Specify the language of the prompt: language("ja") (if the language is not officially supported, it’s usually translated back to English anyway).
  • Define a negative prompt with things you don’t want to see in the picture: negativePrompt("black feathers").
  • Use a particular seed to always generate the same image with the same seed: seed(1234L).

So if you want to generate a picture of a pizza with a prompt in Japanese, but you don’t want to have pepperoni and pineapple, you could configure your model and generate as follows:

VertexAiImageModel imagenModel = VertexAiImageModel.builder()
        .endpoint(ENDPOINT)
        .location(LOCATION)
        .project(PROJECT)
        .publisher(PUBLISHER)
        .modelName("imagegeneration@005")
        .language("ja")
        .negativePrompt("pepperoni, pineapple")
        .maxRetries(2)
        .withPersisting()
        .build();

Response<Image> imageResponse = imagenModel.generate("ピザ"); // pizza
Image editing with Imagen 1

With Imagen 1, you can edit existing images:

  • mask-based editing: you can specify a mask, a black & white image where the white parts are the corresponding parts of the original image that should be edited,
  • mask free editing: where you just give a prompt and let the model figure out what should be edited on its own or following the prompt.

When generating and editing with Imagen 1, you can also configure the model to use a particular style (with Imagen 2, you just specify it in the prompt) with sampleImageStyle(VertexAiImageModel.ImageStyle.photograph):

  • photograph
  • digital_art
  • landscape
  • sketch
  • watercolor
  • cyberpunk
  • pop_art

When editing an image, you may wish to decide how strong or not the modification should be, with .guidanceScale(100). Usually, between 0 and 20 or so, it’s lightly edited, between 20 and 100 it’s getting more impactful edits, and 100 and above it’s the maximum edition level.

Let’s say I generated an image of a lush forrest (I’ll use that as my original image):

VertexAiImageModel model = VertexAiImageModel.builder()
        .endpoint(ENDPOINT)
        .location(LOCATION)
        .project(PROJECT)
        .publisher(PUBLISHER)
        .modelName("imagegeneration@002")
        .seed(19707L)
        .sampleImageStyle(VertexAiImageModel.ImageStyle.photograph)
        .guidanceScale(100)
        .maxRetries(4)
        .withPersisting()
        .build();

Response<Image> forestResp = model.generate("lush forest");

Now I want to edit my forrest to add a small red tree in the bottom of the image. I’m loading a black and white mask image with a white square at the bottom. And I pass the original image, the mask image, and the modification prompt, to the new edit() method:

URI maskFileUri = getClass().getClassLoader().getResource("mask.png").toURI();

Response<Image> compositeResp = model.edit(
        forestResp.content(),              // original image to edit
        fromPath(Paths.get(maskFileUri)),  // the mask image
        "red trees"                        // the new prompt
);

Another kind of editing you can do is to upscale an existing image. As far as I know, it’s only supported for Imagen v1 for now, so we’ll continue with that model.

In this example, we’ll generate an image of 1024x1024 pixels, and we’ll scale it to 4096x4096:

VertexAiImageModel imagenModel = VertexAiImageModel.builder()
        .endpoint(ENDPOINT)
        .location(LOCATION)
        .project(PROJECT)
        .publisher(PUBLISHER)
        .modelName("imagegeneration@002")
        .sampleImageSize(1024)
        .withPersisting()
        .persistTo(defaultTempDirPath)
        .maxRetries(3)
        .build();

Response<Image> imageResponse =
        imagenModel.generate("A black bird looking itself in an antique mirror");

VertexAiImageModel imagenModelForUpscaling = VertexAiImageModel.builder()
        .endpoint(ENDPOINT)
        .location(LOCATION)
        .project(PROJECT)
        .publisher(PUBLISHER)
        .modelName("imagegeneration@002")
        .sampleImageSize(4096)
        .withPersisting()
        .persistTo(defaultTempDirPath)
        .maxRetries(3)
        .build();

Response<Image> upscaledImageResponse =
        imagenModelForUpscaling.edit(imageResponse.content(), "");

And now you have a much bigger image!

Conclusion

That’s about it for image generation and editing with Imagen in LangChain4j today! Be sure to use LangChain4j v0.26.1 which contains that new integration. And I’m looking forward to seeing the pictures you generate with it! m

https://glaforge.dev/posts/2024/02/01/image-generation-with-imagen-and-langchain4j/
Serving static assets with Micronaut

My go-to framework when developing Java apps or microservices is Micronaut. For the apps that should have a web frontend, I rarely use Micronaut Views and its templating support. Instead, I prefer to just serve static assets from my resource folder, and have some JavaScript framework (usually Vue.js) to populate my HTML content (often using Shoelace for its nice Web Components). However, the static asset documentation is a bit light on explanations. So, since I always forget how to configure Micronaut to serve static assets, I thought that would be useful to document this here.

Show full content

My go-to framework when developing Java apps or microservices is Micronaut. For the apps that should have a web frontend, I rarely use Micronaut Views and its templating support. Instead, I prefer to just serve static assets from my resource folder, and have some JavaScript framework (usually Vue.js) to populate my HTML content (often using Shoelace for its nice Web Components). However, the static asset documentation is a bit light on explanations. So, since I always forget how to configure Micronaut to serve static assets, I thought that would be useful to document this here.

In /src/main/resources/application.properties, I’m adding the following:

micronaut.router.static-resources.default.paths=classpath:public
micronaut.router.static-resources.default.mapping=/**
micronaut.router.static-resources.default.enabled=true

micronaut.server.cors.enabled=true
  • The first line says that my resources will live in src/main/resources/public/.
  • The second line means the pattern will match recursively for sub-directories as well.
  • The enabled flag is to activate static serviing (not strictly needed as it’s supposed to be enabled by default).
  • I also enabled CORS (cross-origin resource sharing).

Then in src/main/resources/public/, I’ll have my index.html file, my css and js folders.

https://glaforge.dev/posts/2024/01/21/serving-static-assets-with-micronaut/
Light Mode Bookmarlet

A while ago, my friend Sylvain Wallez shared a little bookmarlet
on Twitter/X that transforms a dark mode site into light mode.
I know the trend is towards dark mode, but for a lot of people with certain vision issues,
for example with astigmatism like me, certain dark modes can very painful.

This site about vision
(and you’ll find other similar references) mentions that:

People who have myopia or astigmatism also may experience halation (from the word “halo”).
Halation occurs when light spreads past a certain boundary, creating a foggy or blurry appearance.

Show full content

A while ago, my friend Sylvain Wallez shared a little bookmarlet
on Twitter/X that transforms a dark mode site into light mode.
I know the trend is towards dark mode, but for a lot of people with certain vision issues,
for example with astigmatism like me, certain dark modes can very painful.

This site about vision
(and you’ll find other similar references) mentions that:

People who have myopia or astigmatism also may experience halation (from the word “halo”).
Halation occurs when light spreads past a certain boundary, creating a foggy or blurry appearance.

So for certain websites, often with a too strong contrast, I’m using the following bookmarklet trick.

Go to your bookmark manager, and save the following bookmarklet (I called mine “light mode”):

javascript:(function(){document.documentElement.style.filter=document.documentElement.style.filter?%27%27:%27invert(100%)%20hue-rotate(180deg)%27})();

Now, to pretty print the above code and remove the URL encoded characters, to decypher what it does:

(function () {
  document.documentElement.style.filter = document.documentElement.style.filter
    ? ""
    : "invert(100%) hue-rotate(180deg)";
})();

Two filters are going to be applied to your current web page:

  • First, it will completely invert all the colors, like a negative photography
  • Second, compared to Sylvain, I also add a hue rotation of 180 degrees
Why the hue rotation

Because the color inversion is also going to shift the colors: a red will become blue, a yellow will be dark blue, a violet will turn pink, etc.
With a hue rotation, we get back the right color, a red is still red, a blue is still blue, etc.
The different however will be in the lightness, as a light blue becomes dark, and a dark green becomes light.
But at least, it’s a bit more faithful to the original images.

Here’s a picture to highlight the differences.
See how the rainbow picture is transformed:

Possible improvements

Perhaps we could avoid applying the filter globally, or at least avoid to apply it somehow to the images, so that they are not affected by those filters.
At least for now, that’s good enough for me!

https://glaforge.dev/posts/2024/01/18/light-mode-bookmarlet/
Functional builder approach in Java

In Java, builders are a pretty classical pattern for creating complex objects with lots of attributes. A nice aspect of builders is that they help reduce the number of constructors you need to create, in particular when not all attributes are required to be set (or if they have default values).

However, I’ve always found builders a bit verbose with their newBuilder() / build() method combos, especially when you work with deeply nested object graphs, leading to lines of code of builders of builders of…

Show full content

In Java, builders are a pretty classical pattern for creating complex objects with lots of attributes. A nice aspect of builders is that they help reduce the number of constructors you need to create, in particular when not all attributes are required to be set (or if they have default values).

However, I’ve always found builders a bit verbose with their newBuilder() / build() method combos, especially when you work with deeply nested object graphs, leading to lines of code of builders of builders of…

As I was chatting about those builders with my colleague Valentin, who is a Go developer, he told me about Golang’s functional builder approach. It’s not a very common implementation practice for Java builders, but it’s worth revisiting!

First, the classical builder

Let’s start with an example. We want to create a builder for a class with a few attributes. Not all attributes are mandatory, some may have some default values, and we don’t want to create as many constructors as possible combinations of attributes.

Let me introduce you to my SomeModel class:

public class SomeModel {
    private String modelName;
    private Float temperature = 0.3f;
    private Integer maxOutputTokens = 100;
    // ... possibly many other attribtues

    private SomeModel(String modelName,
                      Float temperature,
                      Integer maxOutputTokens) {
        this.modelName = modelName;
        this.temperature = temperature;
        this.maxOutputTokens = maxOutputTokens;
    }
}

Creating tons of constructors for the various model configurations can be painful. Furthermore, some attributes can have the same type, so from a user perspective, it’s hard to know which value corresponds to which parameter type. So creating a builder can reduce that toil.

We could write a static builder class inside SomeModel along the lines of:

public class SomeModelBuilder {
    private String modelName;
    private Float temperature = 0.3f;
    private Integer maxOutputTokens = 100;

    public SomeModelBuilder modelName(String modelName) {
        this.modelName = modelName;
        return this;
    }

    public SomeModelBuilder temperature(Float temperature) {
        this.temperature = temperature;
        return this;
    }

    public SomeModelBuilder maxOutputTokens(Integer maxOutputTokens) {
        this.maxOutputTokens = maxOutputTokens;
        return this;
    }

    public SomeModel build() {
        return new SomeModel(modelName, temperature, maxOutputTokens);
    }
}

Inside SomeModel you would add a method to instantiate a builder:

public static SomeModelBuilder newBuilder() {
    return new SomeModelBuilder();
}

Then, the user would create a model instance with the builder as follows:

var model = SomeModel.newBuilder()
    .modelName("gemini")
    .temperature(0.2f)
    .maxOutputToken(300)
    .build();

Not too bad. The are some variations to this approach, like passing the builder in the class’ constructor, using setter methods that return this, using or not using final fields, etc. But they are mostly stylistic variations.

However, I was wondering about this idea of a functional builder…

Existing functional approaches in Java

I haven’t found much litterature on this theme. There are 2 blog posts (here and there) that suggest an approach with lambda expressions and Consumers, but I find it even more unconventional than the approach I’m going to describe further in this article:

SomeModel model = new SomeModelBuilder()
        .with($ -> {
            $.modelName = "Gemini";
            $.temperature = 0.4f;
        })
        .with($ -> $.maxOutputTokens = 100);

You can pass one or more lambdas in chained calls. It’s the end-user who controls how the model is built, not the implementor, so I feel it’s less safe. The use of the $ sign is a bit of a syntactical hack to avoid repeating the name of the variable corresponding to the model. Finally, there’s still a builder class after all, and maybe we can find a way to get rid of it.

Let’s see what Go has to offer instead, and if we can get some inspiration from it!

The Go approach

My colleague Valentin pointed me at Dave Cheney’s article on Go’s functional option pattern. There’s also a video available.

The idea is that the class’ constructor takes function options as a vararg paramter, that are able to modify the instance that’s being built.

Let’s illustrate this with the following snippet.

We create a struct that represents our model object like in our Java example:

package main

import "fmt"

type SomeModel struct {
    modelName string
    temperature float32
    maxOutputTokens int
}

We define a method to construct our model, which takes a vararg of options:

func NewModel(options ...func(*SomeModel)) (*SomeModel) {
    m := SomeModel{"", 0.3, 100}

    for _, option := range options {
        option(&m)
    }

    return &m
}

Those options are actually functions that take a model object as parameter.

Now we can create utility methods that create such option functions, and we pass the value for each field of the struct via the method parameter. So we have a method for each structure field: model name, temperature and max output tokens:

func modelName(name string) func(*SomeModel) {
    return func(m *SomeModel) {
        m.modelName = name
    }
}

func temperature(temp float32) func(*SomeModel) {
    return func(m *SomeModel) {
        m.temperature = temp
    }
}

func maxOutputTokens(max int) func(*SomeModel) {
    return func(m *SomeModel) {
        m.maxOutputTokens = max
    }
}

Next we can create the model in the following way, by calling the utility methods that return functions that are able to modify the struct.

func main() {
    m := NewModel(
        modelName("gemini"),
        temperature(0.5),
        maxOutputTokens(100))

    fmt.Println(m)
}

Notice there’s not even a NewBuilder() or Build() method!

Let’s implement our functional builder in Java!

We can follow the same approach in Java. Instead of Go functions, we’ll use Java’s lambdas. Our lambdas will be converted into Consumers of SomeModel.

So let’s recreate our SomeModel class, with the same fields as before. This time, however, the constructor won’t be private, and it’ll take a list of options (lambda expressions that consume instances of SomeModel). We’ll iterate over all of them to execute them:

import java.util.function.Consumer;

public class SomeModel {
    private String modelName;
    private Float temperature = 0.3f;
    private Integer maxOutputTokens = 100;

    public SomeModel(ModelOption... options) {
        for (Option option : options) {
            option.accept(this);
        }
    }
Update #1

A neat suggestion by Eric Deandrea on Twitter to use streams to filter the null options, as I was not checking them with an extra if before calling accept() on the option:

Optional.ofNullable(options)
   .map(Stream::of)
   .orElseGet(Stream::empty)
   .forEach(option -> option.accept(this))

And what is this ModelOption class? This is just a synonym for a Consumer<SomeModel> (so not strictly needed, but can help with readability). It’s a nested interface:

    public interface ModelOption extends Consumer<SomeModel> {}

Next, we create similar utility methods that will update the model instance:

    public static ModelOption modelName(String modelName) {
        return model -> model.modelName = modelName;
    }

    public static ModelOption temperature(Float temperature) {
        return model -> model.temperature = temperature;
    }

    public static ModelOption maxOutputTokens(Integer maxOutputTokens) {
        return model -> model.maxOutputTokens = maxOutputTokens;
    }
}

Now, if we want to create a model, we’ll be able to call the constructor as follows:

import fn.builder.SomeModel;
import static fn.builder.SomeModel.*;
//...

SomeModel model = new SomeModel(
    modelName("gemini"),
    temperature(0.5f),
    maxOutputTokens(100)
);

Don’t forget to use a static import to keep the syntax short.

Discussion

A few advantages I see with this approach:

  • I like the fact we’re using a constructor to construct our model instances!
  • And the constructor is super simple and short!
  • It also means that the constructor won’t change when there’s going to be a new parameter to handle (better backward compatibility). On the other hand, with a traditional builder, the constructor could also take the builder itself as sole parameter.
  • I’m also happy that I got rid of the verbose newBuilder() / build() combo. It feels like we don’t really have a builder at play here.
  • At first, I was wondering if I was opening the Pandora box, as I feared developers could provide their own lambda and potentially wreck havoc in my instance construction, but because of visibility rules, only my methods can modify the internals of the model class
  • Although we’re using a constructor, the fact of passing those method calls as parameters, it feels a bit like having named arguments like in languages like Python or Groovy (which can also create builders for you via AST transformations). It also looks more like the classical builder too, which has that readability aspect.
  • I can pass the arguments in whichever order I want.
  • I can put validation rules both in each mutator method and in the constructor after all mutators have been called.

Potential tweaks:

  • I used non-final fields, because I wanted to be able to define my default values for some fields at definition time rather than in the constructor, but we could certainly tweak this implementation a bit if needed. And anyway, only my mutator methods can alter those fields, so I guess it’s fine.
  • I was curious if I could use Java enums for storing only my allowed mutators, but I haven’t found an effective and concise way of implementing this. Java enums don’t work like Rust’s, but there’s an interesting article about this here on how to implement sum types.
  • I wondered also about a mix of sealed interfaces and maybe records, but similarly to enums, I couldn’t find a nice and short syntax that I was happy with.

In the cons:

  • It’s a bit unconventional, as I haven’t seen this approach implemented in the wild. So maybe the approach suffers in terms of readability.
  • The other concerns I have is with discoverability. When auto-completing code, an IDE like IntelliJ is smart enough to suggest the mutators methods can be used inside the constructor. But it’s not that clear that such mutator methods exist. It’s going to be important to document the constructor to say that those mutators exist.
Feedback

I’d be curious to hear your thoughts on this. Don’t hesitate to interact with me on Mastodon, Twitter, or BlueSky

https://glaforge.dev/posts/2024/01/16/java-functional-builder-approach/
URL slug or how to remove accents from strings in Java

In this article, we’ll figure out how to create slugs. Not the slobbery kind of little gastropods that crawls on the ground. Instead, we’ll see how to create the short hyphened text you can see in the URL of your web browser, and that is often a URL-friendly variation of the title of the article.

Interestingly, one of the most popular posts on my blog is an almost 20 year old article that explains how to remove accents from a string. And indeed, in slugs you would like to remove accents, among other things.

Show full content

In this article, we’ll figure out how to create slugs. Not the slobbery kind of little gastropods that crawls on the ground. Instead, we’ll see how to create the short hyphened text you can see in the URL of your web browser, and that is often a URL-friendly variation of the title of the article.

Interestingly, one of the most popular posts on my blog is an almost 20 year old article that explains how to remove accents from a string. And indeed, in slugs you would like to remove accents, among other things.

So what problem are we trying to solve today? Let’s say you have an article whose title is “L’été, où est tu ?” (which translates to: “Summer, where have you been?”"). You want your blog to have a friendly URL that looks like the title, but without the punctuation, or the accents (also called diacritical marks), and you also want to replace spaces with hyphens. The final URL should then be https://myblog.com/l-ete-ou-est-tu.

A naive approach would be to try to replace all the letters with diacritical marks with their non marked equivalents. So don’t try to replace “é” with “e”, etc. You’ll likely miss some letters in some languages. A better approach is to take advantage of Unicode normalization.

If you are interested, you can learn more about unicode normalization on the Unicode.org website. But in a nutshell, some letters, like accented letters, are a combination of a base letter, and a diacritical mark.

Let’s have a look at this image from the link above:

The uppercase angström letter is comprised of the A uppercase letter, and the ring above diacritical mark. The composed letter has a unicode value of U+212B but can be decomposed into U+0041 (uppercase A) and U+30A (ring above).

I’ll spare you from the details of the various normalization forms. But Java allows you to work with the normalized forms of letters thanks to the java.text.Normalizer class. We’ll also take advantage of Java’s regex Pattern class to identify particular classes of characters.

Be sure to check the Javadocs of the Normalizer and Pattern classes:

The former explains how to do string normalization, and the latter will give you the list of available character classes.

Let’s have a look at the following Java snippet:

import java.text.Normalizer;

String title = "L'été, où es tu ?"

Normalizer.normalize(title, Normalizer.Form.NFD)
          .toLowerCase()                  // "l'été, où es tu ?"
          .replaceAll("\\p{IsM}+", "")    // "l'ete, ou es tu ?"
		  .replaceAll("\\p{IsP}+", " ")   // "l ete  ou es tu  "
          .trim()                         // "l ete  ou es tu"
          .replaceAll("\\s+", "-")        // "l-ete-ou-es-tu"

My approach is usually the following:

  • First, I normalize the text into the NFD form (canonical decomposition), so base characters and diacritical marks are now separated,
  • Then, I’m replacing all the uppercase letters with lowercase ones,
  • Next, we use the IsM property which selects the the diacritical marks, and we remove them
  • Simiarly, we look at the characters which are punctuation, with the IsP binary property, and replace them with spaces
  • I usually trim the string at that point, as I don’t want to have spaces at the beginning or end of the strings (when a punctuation mark is replace with a space in the previous step)
  • Eventually, all the space characters are replaced with hyphens.
Slugify

Recently, I came across a Java library that takes care of creating slugs: Slugify!

With Slugify, you can do a similar transformation as mine, with the following code:

import com.github.slugify.Slugify;

Slugify slugify = Slugify.builder().build();
slugify.slugify("L'été, où es tu ?")    // "lete-ou-es-tu"

A small difference here is that the quote is removed, which leads to having the string lete instead of l-ete. I find that a bit less readable at a glance, but Slugify has various knobs you can tweak to customize its output.

There’s one particular thing I like about this library, it’s its use of the ICU4J library, which supports transliteration (ICU is a well known set of libraries for full unicode and globalization support.)

The problem with our examples above is that they work well for language with latin-like alphabets. But my examples keep characters like ideograms intact, and Slugify removes them by default. If you want to have URLs that stay within the ASCII realm, you can use transliteration, which can map text in one language into readable latin-like text that sounds like the original text.

So if I wanted to transliterate my string into ascii-friendly text, I could use Slugify’s integration of ICU:

import com.github.slugify.Slugify;

Slugify slugify = Slugify.builder()
        .transliterator(true)          // use transliteration
        .locale(Locale.ENGLISH)
        .build()
        
slugify.slugify("夏よ、どこにいるの?")  // "xiayo-dokoniiruno"
https://glaforge.dev/posts/2024/01/08/url-slug-or-how-to-remove-accents-in-java/
Gemini Function Calling

A promising feature of the Gemini large language model released recently by Google DeepMind, is the support for function calls. It’s a way to supplement the model, by letting it know an external functions or APIs can be called. So you’re not limited by the knowledge cut-off of the model: instead, in the flow of the conversation with the model, you can pass a list of functions the model will know are available to get the information it needs, to complete the generation of its answer.

Show full content

A promising feature of the Gemini large language model released recently by Google DeepMind, is the support for function calls. It’s a way to supplement the model, by letting it know an external functions or APIs can be called. So you’re not limited by the knowledge cut-off of the model: instead, in the flow of the conversation with the model, you can pass a list of functions the model will know are available to get the information it needs, to complete the generation of its answer.

For example, if you want to ask the model about the weather, it doesn’t have the realtime information about the weather forecast. But we can tell it that there’s a function that can be called, to get the forecast for a given location. Internally, the model will acknowledge it doesn’t know the answer about the weather, but it will request that you call an external function that you describe, using a specific set of parameters which correspond to the user’s request.

Just days ago, I wrote about how to get started with Gemini in Java. In that article, we explored how to use the hand-written Java SDK that is available to interact with Gemini from Java. However, the Java SDK doesn’t yet expose all the features of the model: in particular, function calling is missing. But not all hope is lost! Because under the hood, the SDK relies on the generated protobuf classes library, which exposes everything!

Note

Soon, Gemini will be supported by LangChain4j, and the Java SDK will also provide an easier way to take care of function calling. But in this article, I wanted to explore the use of the internal protobuf classes, to see how to best implement its support in the SDK.

Let’s go step by step!

Instead of using the GenerativeModel API from the SDK, we’ll go straight with the PredictionServiceClient:

try (VertexAI vertexAI = new VertexAI(projectId, location)) {
  PredictionServiceClient client = vertexAI.getPredictionServiceClient();
  ...
}

We need to prepare a function declaration to describe the kind of functions that the LLM can ask us to call, and we’ll wrap it in a Tool:

FunctionDeclaration functionDeclaration = FunctionDeclaration.newBuilder()
    .setName("getCurrentWeather")
    .setDescription("Get the current weather in a given location")
    .setParameters(
        Schema.newBuilder()
            .setType(Type.OBJECT)
            .putProperties("location", Schema.newBuilder()
                .setType(Type.STRING)
                .setDescription("location")
                .build()
            )
            .addRequired("location")
            .build()
    )
    .build();

Tool tool = Tool.newBuilder()
    .addFunctionDeclarations(functionDeclaration)
    .build();

Functions are described using classes that represent a subset of the OpenAPI 3 specification.

Important

This is important to provide descriptions for the functions and its parameters, as the LLM will use that information to figure out which function to call, and which parameters should be passed.

Next, let’s prepare a question asking about the weather in Paris, and configuring the text generation request with that prompt and the tool defined above:

String resourceName = String.format(
    "projects/%s/locations/%s/publishers/google/models/%s",
    vertexAI.getProjectId(), vertexAI.getLocation(), modelName);

Content questionContent =
    ContentMaker.fromString("What's the weather in Paris?");

GenerateContentRequest questionContentRequest =
    GenerateContentRequest.newBuilder()
        .setEndpoint(resourceName)
        .setModel(resourceName)
        .addTools(tool)
        .addContents(questionContent)
        .build();

ResponseStream<GenerateContentResponse> responseStream =
    new ResponseStream<>(new ResponseStreamIteratorWithHistory<>(
        client
            .streamGenerateContentCallable()
            .call(questionContentRequest)
            .iterator())
);

GenerateContentResponse generateContentResponse =
    responseStream.stream().findFirst().get();
Content callResponseContent =
    generateContentResponse.getCandidates(0).getContent();

If you print the callResponseContent variable, you’ll see that it contains a function call request, suggesting that you should call the predefined function with the parameter of Paris:

role: "model"
parts {
  function_call {
    name: "getCurrentWeather"
    args {
      fields {
        key: "location"
        value {
          string_value: "Paris"
        }
      }
    }
  }
}

At that point, as the developer, it’s your turn to work a little, and make the call to that function yourself! Let’s pretend I called an external Web Service that gives weather information, and that it returns some JSON payload that would look like so:

{
  "weather": "sunny",
  "location": "Paris"
}

We need now to create a function response structure to pass that information back to the LLM:

Content contentFnResp = Content.newBuilder()
    .addParts(Part.newBuilder()
        .setFunctionResponse(
            FunctionResponse.newBuilder()
                .setResponse(
                    Struct.newBuilder()
                        .putFields("weather",
                            Value.newBuilder().setStringValue("sunny").build())
                        .putFields("location",
                            Value.newBuilder().setStringValue("Paris").build())
                        .build()
                )
                .build()
        )
        .build())
    .build();

Then, since LLMs are actually stateless beasts, we need to give it the whole context of the conversation again, passing the query, the function call response the model suggested us to make, as well as the response we got from the external weather service:

GenerateContentRequest generateContentRequest = GenerateContentRequest.newBuilder()
    .setEndpoint(resourceName)
    .setModel(resourceName)
    .addContents(questionContent)
    .addContents(callResponseContent)
    .addContents(contentFnResp)
    .addTools(tool)
    .build();

And to finish, we’ll invoke the client one last time with that whole dialog and information, and print a response out:

responseStream = new ResponseStream<>(new ResponseStreamIteratorWithHistory<>(
    client
        .streamGenerateContentCallable()
        .call(generateContentRequest)
        .iterator())
);

for (GenerateContentResponse resp : responseStream) {
    System.out.println(ResponseHandler.getText(resp));
}

And happily, Gemini will reply to us that:

The weather in Paris is sunny.

What a lovely way to start the holiday season with a nice and sunny weather!

I wish you all happy year end festivities, and I look forward to seeing you next year. Hopefully next month, I’ll be able to show you some cool new SDK features or the LangChain4j integration! Thanks for reading.

https://glaforge.dev/posts/2023/12/22/gemini-function-calling/
Visualize and Inspect Workflows Executions

When using a service like Google Cloud Workflows, in particular as your workflows get bigger, it can be difficult to understand what’s going on under the hood. With multiple branches, step jumps, iterations, and also parallel branches and iterations, if your workflow fails during an execution, until now, you had to check the execution status, or go deep through the logs to find more details about the failed step.

I have good news for you! Workflows recently added some deeper introspection capability: you can now view the history of execution steps. From the Google Cloud console, you can see the lists of steps, and see the logical flow between them. The usual workflow visualisation will also highlight in green the successful steps, and in red the failed one. Of course, it is also possible to make a curl call to get the JSON of the list of executed steps.

Show full content

When using a service like Google Cloud Workflows, in particular as your workflows get bigger, it can be difficult to understand what’s going on under the hood. With multiple branches, step jumps, iterations, and also parallel branches and iterations, if your workflow fails during an execution, until now, you had to check the execution status, or go deep through the logs to find more details about the failed step.

I have good news for you! Workflows recently added some deeper introspection capability: you can now view the history of execution steps. From the Google Cloud console, you can see the lists of steps, and see the logical flow between them. The usual workflow visualisation will also highlight in green the successful steps, and in red the failed one. Of course, it is also possible to make a curl call to get the JSON of the list of executed steps.

Let’s have a look!

In the console, when you click on an execution, in the summary tab, you’ll see not only the failed step, but also the nice workflow graph colored green and red:

That way, you know which path the execution followed, in a visual manner. But you can also see the actual list of steps executed, with more details, by clicking on the steps tab:

From this table, the filter will let you further refine particular type of steps you’d like to investigate, or visualise the steps of a subworkflow only, etc.

This is a nice improvement to the developer experience, and for your ops team, to better understand what happens during your workflow executions! Feel free to read more about this new capabability in the documentation about viewing the history of execution steps.

https://glaforge.dev/posts/2023/12/22/visualize-and-inspect-workflows-executions/
Hands on Codelabs to dabble with Large Language Models in Java

Hot on the heels of the release of Gemini, I’d like to share a couple of resources I created to get your hands on large language models, using LangChain4J, and the PaLM 2 model. Later on, I’ll also share with you articles and codelabs that take advantage of Gemini, of course.

The PaLM 2 model supports 2 modes:

  • text generation,
  • and chat.

In the 2 codelabs, you’ll need to have created an account on Google Cloud, and created a project. The codelabs will guide you through the steps to setup the environment, and show you how to use the Google Cloud built-in shell and code editor, to develop in the cloud.

Show full content

Hot on the heels of the release of Gemini, I’d like to share a couple of resources I created to get your hands on large language models, using LangChain4J, and the PaLM 2 model. Later on, I’ll also share with you articles and codelabs that take advantage of Gemini, of course.

The PaLM 2 model supports 2 modes:

  • text generation,
  • and chat.

In the 2 codelabs, you’ll need to have created an account on Google Cloud, and created a project. The codelabs will guide you through the steps to setup the environment, and show you how to use the Google Cloud built-in shell and code editor, to develop in the cloud.

You should be a Java developer, as the examples are in Java, use the LangChain4J project, and Maven for building the code.

Generative AI text generation in Java with PaLM and LangChain4J

In the first codelab you can explore:

  • how to make your first call to PaLM for simple question/answer scenarios
  • how to extract structured data out of unstructured text
  • how to use prompts and prompt templates
  • how to classify text, with an example on sentiment analysis
Generative AI powered chat with users and docs in Java with PaLM and LangChain4J

In the second codelab you’ll use the chat model to learn:

  • how to create your first chat with the PaLM model
  • how to give your chatbot a personality, with an example with a chess player
  • how to extract structured data out of unstructured text using LangChain4J’s AiServices and its annotations
  • how to implement Retrieval Augmented Generation (RAG) to answer questions about your own documentation
Going further with Generative AI

If you’re interested in going further with Generative AI, and learn more, feel free to join the Google Cloud Innovators program.

Google Cloud Innovators is free and includes:

  • live discussions, AMAs, and roadmap sessions to learn the latest directly from Googlers,
  • the latest Google Cloud news right in your inbox,
  • digital badge and video conference background,
  • and more.

Go check what the program offers!

https://glaforge.dev/posts/2023/12/18/get-hands-on-codelabs-to-dabble-with-llms/
Get Started with Gemini in Java

Google announced today the availability of Gemini, its latest and more powerful Large Language Model. Gemini is multimodal, which means it’s able to consume not only text, but also images or videos.

I had the pleasure of working on the Java samples and help with the Java SDK, with wonderful engineer colleagues, and I’d like to share some examples of what you can do with Gemini, using Java!

First of all, you’ll need to have an account on Google Cloud and created a project. The Vertex AI API should be enabled, to be able to access the Generative AI services, and in particular the Gemini large language model. Be sure to check out the instructions.

Show full content

Google announced today the availability of Gemini, its latest and more powerful Large Language Model. Gemini is multimodal, which means it’s able to consume not only text, but also images or videos.

I had the pleasure of working on the Java samples and help with the Java SDK, with wonderful engineer colleagues, and I’d like to share some examples of what you can do with Gemini, using Java!

First of all, you’ll need to have an account on Google Cloud and created a project. The Vertex AI API should be enabled, to be able to access the Generative AI services, and in particular the Gemini large language model. Be sure to check out the instructions.

Preparing your project build

To get started with some coding, you’ll need to create a Gradle or a Maven build file that requires the Google Cloud libraries BOM, and the google-cloud-vertexai library. Here’s an example with Maven:

...
<dependencyManagement>
    <dependencies>
        <dependency>
            <artifactId>libraries-bom</artifactId>
            <groupId>com.google.cloud</groupId>
            <scope>import</scope>
            <type>pom</type>
            <version>26.29.0</version>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-vertexai</artifactId>
    </dependency>
    ...
</dependencies>
...
Your first queries

Now let’s have a look at our first multimodal example, mixing text prompts and images:

try (VertexAI vertexAI = new VertexAI(projectId, location)) {
    byte[] imageBytes = Base64.getDecoder().decode(dataImageBase64);

    GenerativeModel model = new GenerativeModel("gemini-pro-vision", vertexAI);
    GenerateContentResponse response = model.generateContent(
        ContentMaker.fromMultiModalData(
            "What is this image about?",
            PartMaker.fromMimeTypeAndData("image/jpg", imageBytes)
        ));

    System.out.println(ResponseHandler.getText(response));
}

You instantiate VertexAI with your Google Cloud project ID, and the region location of your choice. To pass images to Gemini, you should either pass the bytes directly, or you can pass a URI of an image stored in a cloud storage bucket (like gs://my-bucket/my-img.jpg). You create an instance of the model. Here, I’m using gemini-pro-vision. But later on, a gemini-ultra-vision model will also be available. Let’s ask the model to generate content with the generateContent() method, by passing both a text prompt, and also an image. The ContentMaker and PartMaker classes are helpers to further simplify the creation of more advanced prompts that mix different modalities. But you could also just pass a simple string as argument of the generateContent() method. The ResponseHandler utility will retrieve all the text of the answer of the model.

Instead of getting the whole output once all the text is generated, you can also adopt a streaming approach:

model.generateContentStream("Why is the sky blue?")
    .stream()
    .forEach(System.out::print);

You can also iterate over the stream with a for loop:

ResponseStream<GenerateContentResponse> responseStream =
    model.generateContentStream("Why is the sky blue?");

for (GenerateContentResponse responsePart: responseStream) {
    System.out.print(ResponseHandler.getText(responsePart));
}
Let’s chat!

Gemini is a multimodal model, and it’s actually both a text generation model, but also a chat model. So you can chat with Gemini, and ask a series of questions in context. There’s a handy ChatSession utility class which simplifies the handling of the conversation:

try (VertexAI vertexAI = new VertexAI(projectId, location)) {
    GenerateContentResponse response;

    GenerativeModel model = new GenerativeModel(modelName, vertexAI);
    ChatSession chatSession = new ChatSession(model);

    response = chatSession.sendMessage("Hello.");
    System.out.println(ResponseHandler.getText(response));

    response = chatSession.sendMessage("What are all the colors in a rainbow?");
    System.out.println(ResponseHandler.getText(response));

    response = chatSession.sendMessage("Why does it appear when it rains?");
    System.out.println(ResponseHandler.getText(response));
}

This is convenient to use ChatSession as it takes care of keeping track of past questions from the user, and answers from the assistant.

Going further

This is just a few examples of the capabilities of Gemini. Be sure to check out some of the samples that are available on Github. Read more about Gemini and Generative AI in the Google Cloud documentation.

https://glaforge.dev/posts/2023/12/13/get-started-with-gemini-in-java/
Generative AI in practice: Concrete LLM use cases in Java, with the PaLM API

Large Language Models, available through easy to use APIs, bring powerful machine learning tools in the hands of developers. Although Python is usually seen as the lingua franca of everything ML, with LLM APIs and LLM orchestration frameworks, complex tasks become easier to implement for enterprise developers.

Abstract

Large language models (LLMs) are a powerful new technology that can be used for a variety of tasks, including generating text, translating languages, and writing different kinds of creative content. However, LLMs can be difficult to use, especially for developers who are not proficient in Python, the lingua franca for AI. So what about us Java developers? How can we make use of Generative AI?

Show full content

Large Language Models, available through easy to use APIs, bring powerful machine learning tools in the hands of developers. Although Python is usually seen as the lingua franca of everything ML, with LLM APIs and LLM orchestration frameworks, complex tasks become easier to implement for enterprise developers.

Abstract

Large language models (LLMs) are a powerful new technology that can be used for a variety of tasks, including generating text, translating languages, and writing different kinds of creative content. However, LLMs can be difficult to use, especially for developers who are not proficient in Python, the lingua franca for AI. So what about us Java developers? How can we make use of Generative AI?

This presentation will go through how to use LLMs in Java without the need for Python. We will use the PaLM API, provided by Google Cloud’s Vertex AI services, to perform a variety of tasks, such as searching through documentation, generating kids stories, summarizing content, extracting keywords or entities, and more. In our journey through demos, we’ll discover LangChain4J, a wonderful LLM orchetratore for Java developers that simplifies the implementation of advanced LLM use cases.

I had the chance to get this talk recorded at Devoxx Belgium:

And you can check the slides here:

https://glaforge.dev/talks/2023/11/13/gen-ai-with-palm-2-and-java/
Tech Watch #5 — November, 15, 2023
  • Some friends shared this article from Uwe Friedrichsen, tilted back to the future, that talks about this feeling of “déjà-vu”, this impression that in IT we keep on reinventing the wheel. With references to mainframes, Uwe compared CICS to Lambda function scheduling, JCL to step functions, mainframe software development environments to the trendy platform engineering. There are two things I like about this article. First of all, it rings a bell with me, as we’ve seen the pendulum swing as we keep reinventing some patterns or rediscovering certain best practices, sometimes favoring an approach one day, and coming back to another approach the next day. But secondly, Uwe referenced Gunter Dueck who talked about spirals rather than a pendulum. I’ve had that same analogy in mind for years: rather than swinging on one side to the next and back, I always had this impression that we’re circling and spiraling, but each time, even when passing on the same side, we’ve learned something along the way, and we’re getting closer to an optimum, with a slightly different view angle, and hopefully with a better view and more modern practices. Last week at FooConf #2 in Helsinki, I was just talking with my friend Venkat Subramaniam about this spiral visualisation, and I’m glad to see I’m not the only one thinking that IT is spiraling rather than swinging like a pendulum.

Show full content
  • Some friends shared this article from Uwe Friedrichsen, tilted back to the future, that talks about this feeling of “déjà-vu”, this impression that in IT we keep on reinventing the wheel. With references to mainframes, Uwe compared CICS to Lambda function scheduling, JCL to step functions, mainframe software development environments to the trendy platform engineering. There are two things I like about this article. First of all, it rings a bell with me, as we’ve seen the pendulum swing as we keep reinventing some patterns or rediscovering certain best practices, sometimes favoring an approach one day, and coming back to another approach the next day. But secondly, Uwe referenced Gunter Dueck who talked about spirals rather than a pendulum. I’ve had that same analogy in mind for years: rather than swinging on one side to the next and back, I always had this impression that we’re circling and spiraling, but each time, even when passing on the same side, we’ve learned something along the way, and we’re getting closer to an optimum, with a slightly different view angle, and hopefully with a better view and more modern practices. Last week at FooConf #2 in Helsinki, I was just talking with my friend Venkat Subramaniam about this spiral visualisation, and I’m glad to see I’m not the only one thinking that IT is spiraling rather than swinging like a pendulum.

  • Automerge-repo, a batteries included toolkit for building local-first applications
    Automerge is one of the most well-known CRDT algorithm (Conflict-Free Replicated Data Type) that allows you to implement collaborative applications (think Google Docs kind of collaboration, for example). With CRDT algorithms and data structures, concurrent changes on different devices can be merged automatically without requiring a central server, and without complex merge processes. However, having an algorithm and data structure is one thing, but to put the whole system in place is not necessarily easy. This new automerge-repo projects tries to solve this problem, by offering networking and storage adapters to facilitate the communication between the peers, or with a potential sync server.

  • The WebAssembly Garbage Collection proposal (WasmGC) lands in the latest Chrome version. The V8 team dives into the details about WasmGC. It’ll be particularly useful to better support garbage collected languages (like Java and friends) without having to ship a garbage collector in each wasm package.

  • Although I’m not developing native apps for Macs, I spotted this article about an open source implementation of Apple code signing and notarization, implemented in Rust, and that can run on non-Mac hardware. With this approach, when you’re building native apps for the Mac, you can integrate that approach in your Linux-based CI/CD pipeline, without having a Mac box somewhere.

  • Document summarization is an area where large language models excel. There are different approaches to do so when your context window can’t fit the whole document to summarize. In this article, different approaches are mentioned: stuffing (when it fits in the context window), Map/Reduce to split the content in sections that can be summarised and a summary of summary can be made, and the more sequential Refine method where we summarize what fits in memory, and then ask to refine that first summary with the details of the following sections, till we run out of content.

  • Large Language Models face two big issues: one is hallucinations and how to mitigate them by grounding answers or finding ways to assess the response’s factuality, and the other one is prompt injection, as a malignant attacker can misguide an LLM to do something else than what it was programmed for. The folks at Scott Logic developed a demo based on the idea of ImmersiveLabs’ online playground to experiment with prompt injection and techniques to circumvent them. There’s also an article that talks about the project, and a video that shows it all in action.

  • My good friend Ken Kousen dives into the magic of AI Services with LangChain4J. He has a nice blog post, and also a great accompanying video on YouTube where he shows some of the powerful features of LangChain4J, in particular the AI service that allows you to decorate an interface with annotations to interact with your large language model and get plain Java types or objects in return.

  • My colleague Romin Irani also integrated LangChain4J and the PaLM 2 chat model, showing how to deploy a Google Cloud Function chatbot.

  • Baeldung also gives in introduction to LangChain4J showing the basics of prompts, models, memory management, retrieval, chains, and agents.

  • LangChain4J using Redis: Stephan Janssen, the founder of Devoxx, is using LangChain4J inside the Devoxx CFP and schedule application. In this article on LinkedIn, he explains how he used Redis to store vector embeddings corresponding to the talks of the conference, to search for similar talks.

https://glaforge.dev/posts/2023/11/13/tech-watch-0005/