Build a Production Agent Loop with Ollama Tool Calling: Complete Guide

The bottom line: Ollama’s native tool calling (function calling) API lets you build a local agent loop that executes multi-step reasoning with tool use, parallel dispatch, and error recovery — no GPU cluster needed. This guide walks through building a production-grade agent loop with Qwen 3, covering the core loop, parallel tools, streaming, rate limiting, and deployment on modest hardware.

Why Local Tool Calling Matters

Running agent workloads locally isn’t just about cost savings or privacy. For many production patterns — CI/CD integrations, local file operations, development tooling — a remote API introduces latency, rate limits, and data exfiltration risk. Ollama’s tool calling API, available since v0.3.0, gives you the same function-calling primitive as OpenAI and Anthropic, but running on your own hardware [1].

The models that support it in 2026: Qwen 3 (all sizes), Llama 3.1/3.3, Gemma 4, Mistral Small 3.1, DeepSeek R1, and Hermes 3 [2]. Qwen 3 is the go-to for agent workloads — best tool selection accuracy per parameter, and its think=True mode reveals intermediate reasoning, which is invaluable for debugging agent behavior.

The Core Agent Loop

A production agent loop needs more than the minimal example from the docs. Here’s the pattern with error boundaries, loop limits, and structured tool dispatch:

import json
from typing import Any, Callable
from ollama import chat, ChatResponse
from tenacity import retry, stop_after_attempt, wait_exponential

MAX_ITERATIONS = 15

class AgentLoop:
    def __init__(self, model: str = "qwen3", tools: dict[str, Callable] = None):
        self.model = model
        self.tools = tools or {}
        self.tool_schemas = self._build_schemas()

    def _build_schemas(self) -> list[dict]:
        # Tools are passed as Python functions — Ollama SDK auto-generates JSON schemas
        return list(self.tools.values())

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
    def _llm_call(self, messages: list[dict]) -> ChatResponse:
        return chat(
            model=self.model,
            messages=messages,
            tools=self.tool_schemas,
            options={"temperature": 0.1},  # low temp for deterministic tool selection
            think=True,                      # surface reasoning for debugging
        )

    def run(self, user_input: str) -> str:
        messages = [{"role": "user", "content": user_input}]
        for iteration in range(MAX_ITERATIONS):
            response = self._llm_call(messages)
            messages.append(response.message)

            if not response.message.tool_calls:
                return response.message.content  # done

            for call in response.message.tool_calls:
                fn_name = call.function.name
                fn_args = call.function.arguments or {}
                try:
                    result = self.tools[fn_name](**fn_args)
                except Exception as e:
                    result = f"ERROR: {e}"
                messages.append({
                    "role": "tool",
                    "tool_name": fn_name,
                    "content": str(result),
                })

        return "Agent reached max iterations without final answer."

Key differences from the minimal example [1]:

Retry wrapper — transient Ollama API errors (model loading, OOM) get exponential backoff
Temperature 0.1 — deterministic tool selection avoids hallucinated function names
Max iterations — prevents infinite loops on confused models
Error boundaries per tool — a crashing tool doesn’t kill the entire agent
think=True — exposes model reasoning, critical for debugging tool selection

Parallel Tool Dispatch

When the model needs data from multiple sources — weather in three cities, file stats across directories — the agent loop shouldn’t wait for each tool sequentially. Ollama supports parallel tool calls: the model returns an array of tool_calls in a single response [1].

def run_parallel(self, user_input: str) -> str:
    messages = [{"role": "user", "content": user_input}]
    for iteration in range(MAX_ITERATIONS):
        response = self._llm_call(messages)
        messages.append(response.message)

        if not response.message.tool_calls:
            return response.message.content

        # Dispatch all tools in parallel
        tool_results = []
        for call in response.message.tool_calls:
            fn_name = call.function.name
            fn_args = call.function.arguments or {}
            try:
                result = self.tools[fn_name](**fn_args)
            except Exception as e:
                result = f"ERROR: {e}"
            tool_results.append({
                "role": "tool",
                "tool_name": fn_name,
                "content": str(result),
            })

        messages.extend(tool_results)

For truly parallel execution (I/O-bound tools like API calls), use concurrent.futures.ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed

def _dispatch_tools(self, calls: list) -> list[dict]:
    results = []
    with ThreadPoolExecutor(max_workers=5) as pool:
        futures = {
            pool.submit(self._execute_tool, c): c
            for c in calls
        }
        for future in as_completed(futures):
            call = futures[future]
            try:
                result = future.result()
            except Exception as e:
                result = f"ERROR: {e}"
            results.append({
                "role": "tool",
                "tool_name": call.function.name,
                "content": str(result),
            })
    return results

The model handles parallel results naturally — it sees all tool outputs in the next turn and synthesizes them into a coherent answer. Testing with qwen3 shows it correctly correlates results from parallel calls with their original context [3].

Streaming: Real-Time Output for UX

For interactive applications, streaming gives the user incremental progress instead of a long wait:

def run_streaming(self, user_input: str):
    messages = [{"role": "user", "content": user_input}]
    for iteration in range(MAX_ITERATIONS):
        stream = chat(
            model=self.model,
            messages=messages,
            tools=self.tool_schemas,
            stream=True,
            options={"temperature": 0.1},
        )

        full_response = ""
        tool_calls = None

        for chunk in stream:
            if chunk.message and chunk.message.content:
                yield ("delta", chunk.message.content)
                full_response += chunk.message.content
            if chunk.message and chunk.message.tool_calls:
                tool_calls = chunk.message.tool_calls

        if tool_calls:
            yield ("tool_calls", [
                {"name": tc.function.name, "args": tc.function.arguments}
                for tc in tool_calls
            ])
            messages.append({
                "role": "assistant",
                "tool_calls": [
                    {"type": "function", "function": {
                        "name": tc.function.name,
                        "arguments": json.dumps(tc.function.arguments),
                    }}
                    for tc in tool_calls
                ]
            })
            # Execute and append tool results
            for tc in tool_calls:
                try:
                    result = self.tools[tc.function.name](**tc.function.arguments)
                except Exception as e:
                    result = f"ERROR: {e}"
                messages.append({"role": "tool", "tool_name": tc.function.name, "content": str(result)})
                yield ("tool_result", {"name": tc.function.name, "result": result})
        else:
            yield ("done", full_response)
            return

Consumers can render this in a terminal with rich text or a web UI with server-sent events:

for event_type, data in agent.run_streaming("What's the disk usage on /data?"):
    if event_type == "delta":
        sys.stdout.write(data)
        sys.stdout.flush()
    elif event_type == "tool_calls":
        print(f"\n[Calling: {data[0]['name']}]\n")
    elif event_type == "tool_result":
        print(f"\n[Result: {data['result'][:80]}...]\n")

Error Recovery Strategies

Models sometimes get tool calling wrong. Here’s how to handle the three most common failure modes:

1. Hallucinated function names

A model might call get_weather when only get_temperature exists:

def _execute_tool(self, call) -> str:
    fn_name = call.function.name
    if fn_name not in self.tools:
        return f"ERROR: Unknown function '{fn_name}'. Available: {list(self.tools.keys())}"
    return str(self.tools[fn_name](**call.function.arguments))

Returning the correct function list lets the model recover on the next turn.

2. Missing required parameters

def safe_invoke(fn: Callable, args: dict) -> str:
    import inspect
    sig = inspect.signature(fn)
    required = {
        name for name, param in sig.parameters.items()
        if param.default is inspect.Parameter.empty
    }
    missing = required - set(args.keys())
    if missing:
        return f"ERROR: Missing required parameters: {missing}"
    return str(fn(**args))

3. Model stuck in tool-calling loop

Sometimes the model calls tools forever without generating a final answer. The max-iteration guard handles this, but you can also detect the pattern mid-loop:

def _detect_loop(self, history: list[dict]) -> bool:
    """Detect repeated identical tool calls"""
    recent = [m for m in history[-6:] if "tool_calls" in (m.get("role") and m)]
    if len(recent) >= 4:
        call_sigs = [str(m.get("tool_calls", [])) for m in recent]
        return len(set(call_sigs)) <= 2  # same tools repeated
    return False

Model Selection Guide

Not all models handle tool calling equally. From real-world testing and community reports [2][3]:

Model	Params	VRAM	Tool Accuracy	Notes
Qwen 3	8B-235B	6GB+	Excellent	Best overall, strong parallel support
Llama 3.1 8B	8B	6GB	Good	Fast, reliable for simple tool sets
Gemma 4	9B-27B	7GB+	Very good	Strong reasoning with tool use
DeepSeek R1 14B	14B	10GB	Good	Slower but thorough reasoning
Hermes 3 70B	70B	40GB	Excellent	Best accuracy, needs high-end GPU

For development on consumer hardware (16GB VRAM or less), start with Qwen 3 8B (5.5GB) or Qwen 3 14B (9GB). For production accuracy on sensitive workflows, Qwen 3 32B (20GB) strikes the best accuracy-to-cost ratio [3].

Production Deployment

For a production agent service, run Ollama behind a simple queue to prevent VRAM contention:

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio

app = FastAPI()
agent = AgentLoop(model="qwen3:14b", tools=my_tools)

class Query(BaseModel):
    input: str
    stream: bool = False

@app.post("/agent")
async def run_agent(query: Query):
    if query.stream:
        return StreamingResponse(agent.run_streaming(query.input), media_type="text/event-stream")
    result = await asyncio.to_thread(agent.run, query.input)
    return {"result": result}

Critical deployment rules for Ollama in production:

One model at a time per GPU — Ollama keeps the loaded model in VRAM. Concurrent requests to different models cause thrashing.
Set OLLAMA_NUM_PARALLEL — controls concurrent request handling for the same model. Start at 1, increase to 2-4 for latency-tolerant workloads.
Systemd keepalive — run ollama serve as a systemd service with Restart=always and a health check every 30 seconds.
Health endpoint — http://localhost:11434/api/tags returns JSON. Monitor it in your load balancer.

# systemd unit snippet
[Service]
ExecStart=/usr/bin/ollama serve
Environment=OLLAMA_NUM_PARALLEL=2
Environment=OLLAMA_KEEP_ALIVE=24h
Restart=always
RestartSec=5

Putting It All Together

Here’s a complete working example — a file analysis agent that reads directories, checks disk usage, and searches file contents. Pull qwen3:14b and run it:

import os, glob
from agent_loop import AgentLoop

def list_files(path: str = ".") -> str:
    """List files in a directory"""
    return "\n".join(os.listdir(path)[:30])

def disk_usage(path: str = ".") -> str:
    """Check disk usage of a path"""
    import shutil
    total, used, free = shutil.disk_usage(path)
    return f"Total: {total//(2**30)}GB, Used: {used//(2**30)}GB, Free: {free//(2**30)}GB"

def search_files(pattern: str, path: str = ".") -> str:
    """Search for files matching a glob pattern"""
    return "\n".join(glob.glob(os.path.join(path, pattern))[:20])

tools = {
    "list_files": list_files,
    "disk_usage": disk_usage,
    "search_files": search_files,
}

agent = AgentLoop(model="qwen3:14b", tools=tools)
result = agent.run("Find all Python files in /data, then check disk usage")
print(result)

The agent will: call search_files("**/*.py", "/data") → get results → call disk_usage("/data") → synthesize both into a human-readable answer. All locally, no API keys needed.

Verdict

Ollama’s tool calling API in 2026 is production-ready for agent workloads that need local execution, data privacy, or low latency. Qwen 3 provides reliable tool selection at every model size tier. The patterns above — parallel dispatch, streaming, error recovery, and deployment hardening — turn the basic doc example into a real agent service.

The biggest remaining risk is model confusion on complex multi-tool workflows. Test your specific tool set exhaustively before deploying. For most single-domain agent tasks (file ops, code analysis, local API orchestration), the local loop is faster, cheaper, and more private than any remote API.

References

[1] Ollama Tool Calling Documentation — https://docs.ollama.com/capabilities/tool-calling

[2] “Best Ollama Local Models for Tool Calling 2026” — Clawdbook — https://clawdbook.org/blog/openclaw-best-ollama-models-2026

[3] “Ollama Tool Calling Guide: Build AI Agents with Local LLMs” — Local AI Master — https://localaimaster.com/blog/ollama-tool-calling-guide

[4] Qwen 3 Model Card — Ollama Library — https://ollama.com/library/qwen3

[5] Ollama GitHub Repository — Server Configuration — https://github.com/ollama/ollama

Ollama vs llama.cpp vs MLX: Edge Inference in 2026 — Framework comparison for local model deployment
Building Custom MCP Servers: Production Deployment Patterns — Deploying tool servers for agent consumption
Building a Production-Ready MCP Server — Full-stack MCP server implementation

Cross-links automatically generated from NiteAgent.

← Back to all posts